wcfs.go 40.9 KB
Newer Older
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1 2
// Copyright (C) 2018-2019  Nexedi SA and Contributors.
//                          Kirill Smelkov <kirr@nexedi.com>
Kirill Smelkov's avatar
Kirill Smelkov committed
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
//
// This program is free software: you can Use, Study, Modify and Redistribute
// it under the terms of the GNU General Public License version 3, or (at your
// option) any later version, as published by the Free Software Foundation.
//
// You can also Link and Combine this program with other software covered by
// the terms of any of the Free Software licenses or any of the Open Source
// Initiative approved licenses and Convey the resulting work. Corresponding
// source of such a combination shall include the source code for all other
// software used.
//
// This program is distributed WITHOUT ANY WARRANTY; without even the implied
// warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
//
// See COPYING file for full licensing terms.
// See https://www.nexedi.com/licensing for rationale and options.

Kirill Smelkov's avatar
.  
Kirill Smelkov committed
20
// Program wcfs provides filesystem server with file data backed by wendelin.core arrays.
Kirill Smelkov's avatar
Kirill Smelkov committed
21 22 23 24 25
//
// Intro
//
// Each wendelin.core array (ZBigArray) is actually a linear file (ZBigFile)
// and array metadata like dtype, shape and strides associated with it. This
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
26
// program exposes as files only ZBigFile data and leaves rest of
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
27
// array-specific handling to clients. Every ZBigFile is exposed as one separate
Kirill Smelkov's avatar
Kirill Smelkov committed
28 29 30
// file that represents whole ZBigFile's data.
//
// For a client, the primary way to access a bigfile should be to mmap
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
31
// head/bigfile/<bigfileX> which represents always latest bigfile data.
Kirill Smelkov's avatar
Kirill Smelkov committed
32 33 34 35 36 37 38 39 40 41 42 43 44
// Clients that want to get isolation guarantee should subscribe for
// invalidations and re-mmap invalidated regions to file with pinned bigfile revision for
// the duration of their transaction. See "Invalidation protocol" for details.
//
// In the usual situation when bigfiles are big, and there are O(1)/δt updates,
// there should be no need for any cache besides shared kernel cache of latest
// bigfile data.
//
//
// Filesystem organization
//
// Top-level structure of provided filesystem is as follows:
//
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
45
//	head/			; latest database view
Kirill Smelkov's avatar
Kirill Smelkov committed
46
//		...
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
47 48 49
//	@<rev1>/		; database view as of revision <revX>
//		...
//	@<rev2>/
Kirill Smelkov's avatar
Kirill Smelkov committed
50
//		...
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
51
//	...
Kirill Smelkov's avatar
Kirill Smelkov committed
52
//
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
53
// where head/ represents latest data as stored in upstream ZODB, and
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
54
// @<revX>/ represents data as of database revision <revX>.
Kirill Smelkov's avatar
Kirill Smelkov committed
55 56 57
//
// head/ has the following structure:
//
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
58 59 60 61 62 63 64
//	head/
//		at			; data inside head/ is as of this ZODB transaction
//		watch			; channel for bigfile invalidations
//		bigfile/		; bigfiles' data
//			<oid(bigfile1)>
//			<oid(bigfile2)>
//			...
Kirill Smelkov's avatar
Kirill Smelkov committed
65
//
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
66 67 68 69 70 71
// where /bigfile/<bigfileX> represents latest bigfile data as stored in
// upstream ZODB. As there can be some lag receiving updates from the database,
// /at describes precisely ZODB state for which bigfile data is currently
// exposed. Whenever bigfile data is changed in upstream ZODB, information
// about the changes is first propagated to /watch, and only after that
// /bigfile/<bigfileX> is updated. See "Invalidation protocol" for details.
Kirill Smelkov's avatar
Kirill Smelkov committed
72
//
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
73
// @<revX>/ has the following structure:
Kirill Smelkov's avatar
Kirill Smelkov committed
74
//
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
75 76 77 78 79 80
//	@<revX>/
//		at
//		bigfile/		; bigfiles' data as of revision <revX>
//			<oid(bigfile1)>
//			<oid(bigfile2)>
//			...
Kirill Smelkov's avatar
Kirill Smelkov committed
81
//
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
82
// where /bigfile/<bigfileX> represent bigfile data as of revision <revX>.
Kirill Smelkov's avatar
Kirill Smelkov committed
83
//
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
84
// Unless accessed {head,@<revX>}/bigfile/<bigfileX> are not automatically visible in
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
85
// wcfs filesystem. Similarly @<revX>/ should be explicitly created by client via mkdir.	XXX -> just @<revX> access.
Kirill Smelkov's avatar
Kirill Smelkov committed
86 87 88 89
//
//
// Invalidation protocol
//
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
90
// In order to support isolation, wcfs implements invalidation protocol that
Kirill Smelkov's avatar
Kirill Smelkov committed
91 92
// must be cooperatively followed by both wcfs and client.
//
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
93
// First, client mmaps latest bigfile, but does not access it
Kirill Smelkov's avatar
Kirill Smelkov committed
94
//
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
95
//	mmap(head/bigfile/<bigfileX>)
Kirill Smelkov's avatar
Kirill Smelkov committed
96
//
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
97 98
// Then client opens head/watch and tells wcfs through it for which ZODB state
// it wants to get bigfile's view.
Kirill Smelkov's avatar
Kirill Smelkov committed
99
//
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
100
//	C: 1 watch <bigfileX> @<at>
Kirill Smelkov's avatar
Kirill Smelkov committed
101
//
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
102 103
// The server then, after potentially sending initial pin messages (see below),
// reports either success or failure:
Kirill Smelkov's avatar
Kirill Smelkov committed
104
//
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
105 106
//	S: 1 ok
//	S: 1 error ...		; if <at> is too far away back from head/at
Kirill Smelkov's avatar
Kirill Smelkov committed
107
//
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
108 109 110 111 112
// The server sends "ok" reply only after head/at is ≥ requested <at>, and
// only after all initial pin messages are fully acknowledged by the client.
// The client can start to use mmapped data after it gets "ok".
// The server sends "error" reply if requested <at> is too far away back from
// head/at.
Kirill Smelkov's avatar
Kirill Smelkov committed
113
//
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
114 115 116
// Upon watch request, either initially, or after sending "ok", the server will be notifying the
// client about file blocks that client needs to pin in order to observe file's
// data as of <at> revision:
Kirill Smelkov's avatar
Kirill Smelkov committed
117
//
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
118 119 120 121 122 123
// The filesystem server itself receives information about changed data from
// ZODB server through regular ZODB invalidation channel (as it is ZODB client
// itself). Then, separately for each changed file block, before actually
// updating head/bigfile/<bigfileX> content, it notifies through head/watch to
// clients, that had requested it (separately to each client), about the
// changes:
Kirill Smelkov's avatar
Kirill Smelkov committed
124
//
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
125
//	S: 2 pin <bigfileX> #<blk> @<rev_max>
Kirill Smelkov's avatar
Kirill Smelkov committed
126
//
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
127 128
// and waits until all clients confirm that changed file block can be updated
// in global OS cache.
Kirill Smelkov's avatar
Kirill Smelkov committed
129
//
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
130
// The client in turn should now re-mmap requested to be pinned block to bigfile@<rev_max>
Kirill Smelkov's avatar
Kirill Smelkov committed
131
//
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
132 133
//	# mmapped at address corresponding to #blk
//	mmap(@<rev_max>/bigfile/<bigfileX>, #blk, MAP_FIXED)
Kirill Smelkov's avatar
Kirill Smelkov committed
134 135 136
//
// and must send ack back to the server when it is done:
//
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
137 138 139 140 141 142
//	C: 2 ack
//
// The server sends pin notifications only for file blocks, that are known to
// be potentially changed after client's <at>, and <rev_max> describes the
// upper bound for the block revision:
//
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
143
//	<at>	<  <rev_max>		FIXME ->	<rev_max> ≤ <at>	(?)
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
144 145 146 147 148 149 150 151 152
//
// The server maintains short history tail of file changes to be able to
// support openings with <at> being slightly in the past compared to current
// head/at. The server might reject a watch request if <at> is too far away in
// the past from head/at. The client is advised to restart its transaction with
// more uptodate database view if it gets watch setup error.
//
// A later request from the client for the same <bigfileX> but with different
// <at>, overrides previous watch request for that file. A client can use "-"
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
153
// instead of "@<at>" to stop watching a file.
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
154 155 156 157 158
//
// A single client can send several watch requests through single head/watch
// open, as well as it can use several head/watch opens simultaneously.
// The server sends pin notifications for all files requested to be watched via
// every head/watch open.
Kirill Smelkov's avatar
Kirill Smelkov committed
159
//
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
160 161 162 163 164 165
// Note: a client could use a single watch to manage its several views for the same
// file but with different <at>. This could be achieved via watching with
// @<at_min>, and then deciding internally which views needs to be adjusted and
// which views need not. Wcfs does not oblige clients to do so though, and a
// client is free to use as many head/watch openenings as it needs to.
//
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
166
// When clients are done with @<revX>/bigfile/<bigfileX> (i.e. client's
Kirill Smelkov's avatar
Kirill Smelkov committed
167
// transaction ends and array is unmapped), the server sees number of opened
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
168 169
// files to @<revX>/bigfile/<bigfileX> drops to zero, and automatically
// destroys @<revX>/bigfile/<bigfileX> after reasonable timeout.
Kirill Smelkov's avatar
Kirill Smelkov committed
170 171 172 173
//
//
// Protection against slow or faulty clients
//
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
174 175 176 177
// If a client, on purpose or due to a bug or being stopped, is slow to respond
// with ack to file invalidation notification, it creates a problem because the
// server will become blocked waiting for pin acknowledgments, and thus all
// other clients, that try to work with the same file, will get stuck.
Kirill Smelkov's avatar
Kirill Smelkov committed
178
//
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
179 180
// The problem could be avoided, if wcfs would reside inside OS kernel and this
// way could be able to manipulate clients address space directly (then
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
181 182 183 184
// invalidation protocol won't be needed). It is also possible to imagine
// mechanism, where wcfs would synchronously change clients' address space via
// injecting trusted code and running it on client side via ptrace to adjust
// file mappings.
Kirill Smelkov's avatar
Kirill Smelkov committed
185
//
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
186 187 188 189 190
// However ptrace does not work when client thread is blocked under pagefault,
// and that is exactly what wcfs would need to do to process invalidations
// lazily, because eager invalidation processing results in prohibitively slow
// file opens. See internal wcfs overview for details about why ptrace
// cannot be used and why lazy invalidation processing is required.
Kirill Smelkov's avatar
Kirill Smelkov committed
191
//
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
192 193 194
// Lacking OS primitives to change address space of another process and not
// being able to work it around with ptrace in userspace, wcfs takes approach
// to kill a slow client on 30 seconds timeout by default.
Kirill Smelkov's avatar
Kirill Smelkov committed
195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232
//
//
// Writes
//
// As each bigfile is represented by 1 synthetic file, there can be several
// write schemes:
//
// 1. mmap(MAP_PRIVATE) + writeout by client
//
// In this scheme bigfile data is mmapped in MAP_PRIVATE mode, so that local
// user changes are not automatically propagated back to the file. When there
// is a need to commit, client investigates via some OS mechanism, e.g.
// /proc/self/pagemap or something similar, which pages of this mapping it
// modified. Knowing this it knows which data it dirtied and so can write this
// data back to ZODB itself, without filesystem server providing write support.
//
// 2. mmap(MAP_SHARED, PROT_READ) + write-tracking & writeout by client
//
// In this scheme bigfile data is mmaped in MAP_SHARED mode with read-only pages
// protection. Then whenever write fault occurs, client allocates RAM from
// shmfs, copies faulted page to it, and then mmaps RAM page with RW protection
// in place of original bigfile page. Writeout implementation should be similar
// to "1", only here client already knows the pages it dirtied, and this way
// there is no need to consult /proc/self/pagemap.
//
// The advantage of this scheme over mmap(MAP_PRIVATE) is that in case
// there are several in-process mappings of the same bigfile with overlapping
// in-file ranges, changes in one mapping will be visible in another mapping.
// Contrary: whenever a MAP_PRIVATE mapping is modified, the kernel COWs
// faulted page into a page completely private to this mapping, so that other
// MAP_PRIVATE mappings of this file, including ones created from the same
// process, do not see changes made to the first mapping.
//
// Since wendelin.core needs to provide coherency in between different slices
// of the same array, this is the mode wendelin.core actually uses.
//
// 3. write to wcfs
//
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
233
// TODO we later could implement "write-directly" mode where clients would write
Kirill Smelkov's avatar
Kirill Smelkov committed
234
// data directly into the file.
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
235
package main
Kirill Smelkov's avatar
Kirill Smelkov committed
236

Kirill Smelkov's avatar
.  
Kirill Smelkov committed
237
// Wcfs organization
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
238
//
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
239
// Wcfs is a ZODB client that translates ZODB objects into OS files as would
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
240
// non-wcfs wendelin.core do for a ZBigFile. Contrary to non-wcfs wendelin.core,
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
241
// it keeps bigfile data in shared OS cache efficiently. It is organized as follows:
242
//
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
243
// 1) 1 ZODB connection for "latest data" for whole filesystem (zhead).
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
244 245
// 2) head/bigfile/* of all bigfiles represent state as of zhead.At .
// 3) for head/bigfile/* the following invariant is maintained:
246
//
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
247
//	#blk ∈ file cache    =>    ZBlk(#blk) + all BTree/Bucket that lead to it  ∈ zhead cache
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
248
//	                           (ZBlk* in ghost state(%))
249
//
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
250
//    The invariant helps on invalidation: if we see a changed oid, and
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
251
//    zhead.cache.lookup(oid) = ø -> we know we don't have to invalidate OS
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
252 253
//    cache for any part of any file (even if oid relates to a file block - that
//    block is not cached and will trigger ZODB load on file read).
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
254
//
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
255 256 257 258
//    Currently we maintain this invariant by simply never evicting LOBTree/LOBucket
//    objects from ZODB Connection cache (LOBucket keeps references to ZBlk* and
//    so ZBlk* also stay in cache in ghost form). In the future we may want to
//    try to synchronize to kernel freeing its pagecache pages.
259
//
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
260
// 4) when we receive an invalidation message from ZODB - we process it and
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
261
//    propagate invalidations to OS file cache of head/bigfile/*:
262
//
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
263
//	invalidation message: (tid↑, []oid)
264
//
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
265
//    4.1) zhead.cache.lookup(oid)
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
266 267
//    4.2) ø: nothing to do - see invariant ^^^.
//    4.3) obj found:
268 269
//
//	- ZBlk*		-> file/#blk
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
270
//	- BTree/Bucket	-> δ(BTree)  -> file/[]#blk
271
//
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
272
//	in the end after processing all []oid from invalidation message we have
273 274 275
//
//	  [] of file/[]#blk
//
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
276
//	that describes which file(s) parts needs to be invalidated.
277
//
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
278
//    4.4) for all file/blk to invalidate we do:
279
//
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
280
//	- try to retrieve head/bigfile/file[blk] from OS file cache(*);
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
281
//	- if retrieved successfully -> store retrieved data back into OS file
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
282
//	  cache for @<rev>/bigfile/file[blk], where
Kirill Smelkov's avatar
Kirill Smelkov committed
283
//
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
284
//	    rev = max(δFtail.by(#blk)) || min(rev ∈ δFtail) || zhead.at	; see below about δFtail
Kirill Smelkov's avatar
Kirill Smelkov committed
285
//
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
286
//	- invalidate head/bigfile/file[blk] in OS file cache.
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
287 288
//
//	This preserves previous data in OS file cache in case it will be needed
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
289
//	by not-yet-uptodate clients, and makes sure file read of head/bigfile/file[blk]
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
290 291 292
//	won't be served from OS file cache and instead will trigger a FUSE read
//	request to wcfs.
//
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
293
//    4.5) no invalidation messages are sent to wcfs clients at this point(+).
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
294
//
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
295 296
//    4.6) processing ZODB invalidations and serving file reads (see 7) are
//      organized to be mutually exclusive.
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
297
//
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
298 299
//	(TODO head.zconnMu -> special mutex with Lock(ctx) so that Lock could be canceled)
//
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
300 301 302 303
// 5) after OS file cache was invalidated, we resync zhead to new database
//    view corresponding to tid.
//
// 6) for every file δFtail invalidation info about head/data is maintained:
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
304
//
Kirill Smelkov's avatar
Kirill Smelkov committed
305 306
//	- tailv: [](rev↑, []#blk)
//	- by:    {} #blk -> []rev↑ in tail
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
307
//
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
308 309
//    δFtail.tail describes invalidations to file we learned from ZODB invalidation.
//    δFtail.by   allows to quickly lookup information by #blk.
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
310
//
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
311
//    min(rev) in δFtail is min(@at) at which head/bigfile/file is currently mmapped (see below).
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
312
//
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
313 314 315
//    to support initial openings with @at being slightly in the past, we also
//    make sure that min(rev) is enough to cover last 10 minutes of history
//    from head/at.
316
//
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
317 318 319
// 7) when we receive a FUSE read(#blk) request to a head/bigfile/file we process it as follows:
//
//   7.1) load blkdata for head/bigfile/file[blk] @zhead.at .
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
320 321 322
//
//	while loading this also gives upper bound estimate of when the block
//	was last changed:
323
//
Kirill Smelkov's avatar
Kirill Smelkov committed
324 325 326
//	  rev(blk) ≤ max(_.serial for _ in (ZBlk(#blk), all BTree/Bucket that lead to ZBlk))
//
//	it is not exact because BTree/Bucket can change (e.g. rebalance)
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
327
//	but still point to the same k->ZBlk.
Kirill Smelkov's avatar
Kirill Smelkov committed
328 329 330 331 332 333 334
//
//	we also use file.δFtail to find either exact blk revision:
//
//	  rev(blk) = max(file.δFtail.by(#blk) -> []rev↑)
//
//	or another upper bound if #blk ∉ δFtail:
//
Kirill Smelkov's avatar
Kirill Smelkov committed
335
//	  rev(blk) ≤ min(rev ∈ δFtail)		; #blk ∉ δFtail
Kirill Smelkov's avatar
Kirill Smelkov committed
336 337
//
//
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
338
//	below rev'(blk) is min(of the estimates found):
Kirill Smelkov's avatar
Kirill Smelkov committed
339 340 341
//
//	  rev(blk) ≤ rev'(blk)		rev'(blk) = min(^^^)
//
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
342
//
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
343
//   7.2) for all client@at mmappings of head/bigfile/file:
344
//
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
345
//	- rev'(blk) ≤ at: -> do nothing		XXX if rev is not last and there is rev_next ≤ at ? -> also consult δFtail ?
Kirill Smelkov's avatar
Kirill Smelkov committed
346 347
//	- rev'(blk) > at:
//	  - if blk ∈ mmapping.pinned -> do nothing
Kirill Smelkov's avatar
Kirill Smelkov committed
348
//	  - rev = max(δFtail.by(#blk) : _ ≤ at)	|| min(rev ∈ δFtail : rev ≤ at)	|| at
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
349
//	  - client.remmap(file, #blk, @rev/bigfile/file)
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
350
//	  - mmapping.pinned += blk
351
//
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
352 353 354
//	remmapping is done via "invalidation protocol" exchange with client.
//	( one could imagine adjusting mappings synchronously via running
//	  wcfs-trusted code via ptrace that wcfs injects into clients, but ptrace
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
355
//	  won't work when client thread is blocked under pagefault or syscall(^) )
356
//
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
357
//	in order to support remmapping for each head/bigfile/file
358
//
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
359
//	  [] of mmapping{client@at↑, pinned}
360
//
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
361 362
//	is maintained.
//
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
363
//   7.3) blkdata is returned to kernel.
364 365 366
//
//   Thus a client that wants latest data on pagefault will get latest data,
//   and a client that wants @rev data will get @rev data, even if it was this
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
367
//   "old" client that triggered the pagefault(~).
368
//
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
369
// XXX 8) serving read from @<rev>/data + zconn(s) for historical state
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
370
// XXX 9) gc @rev/ and @rev/bigfile/<bigfileX> automatically on atime timeout
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
371 372
//
//
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
373 374 375 376
// (*) see notes.txt -> "Notes on OS pagecache control"
// (+) see notes.txt -> "Invalidations to wcfs clients are delayed until block access"
// (~) see notes.txt -> "Changing mmapping while under pagefault is possible"
// (^) see notes.txt -> "Client cannot be ptraced while under pagefault"
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
377
// (%) no need to keep track of ZData - ZBlk1 is always marked as changed on blk data change.
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
378
//
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
379
// XXX For every ZODB connection a dedicated read-only transaction is maintained.
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
380

Kirill Smelkov's avatar
.  
Kirill Smelkov committed
381
import (
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
382
	"context"
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
383
	"flag"
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
384
	"fmt"
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
385
	stdlog "log"
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
386
	"os"
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
387
	"runtime"
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
388
	"strings"
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
389 390
	"sync"
	"syscall"
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
391

Kirill Smelkov's avatar
.  
Kirill Smelkov committed
392
	log "github.com/golang/glog"
Kirill Smelkov's avatar
Kirill Smelkov committed
393 394
	"golang.org/x/sync/errgroup"

Kirill Smelkov's avatar
.  
Kirill Smelkov committed
395
	"lab.nexedi.com/kirr/go123/xcontext"
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
396
	"lab.nexedi.com/kirr/go123/xerr"
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
397

Kirill Smelkov's avatar
.  
Kirill Smelkov committed
398
	"lab.nexedi.com/kirr/neo/go/transaction"
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
399
	"lab.nexedi.com/kirr/neo/go/zodb"
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
400
	"lab.nexedi.com/kirr/neo/go/zodb/btree"
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
401
	_ "lab.nexedi.com/kirr/neo/go/zodb/wks"
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
402 403 404

	"github.com/hanwen/go-fuse/fuse"
	"github.com/hanwen/go-fuse/fuse/nodefs"
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
405
	"github.com/pkg/errors"
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
406 407
)

Kirill Smelkov's avatar
.  
Kirill Smelkov committed
408 409 410 411 412 413 414 415 416
// Root represents root of wcfs filesystem.
type Root struct {
	nodefs.Node

	// ZODB storage we work with
	zstor zodb.IStorage

	// ZODB DB handle for zstor.
	// keeps cache of connections for both head/ and @<rev>/ accesses.
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
417
	// XXX head won't be kept here and will be .Resync()'ed explicitly?
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
418
	//
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
419
	// only one connection is used for head/ and only one for each @<rev>.
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
420 421
	zdb *zodb.DB

Kirill Smelkov's avatar
.  
Kirill Smelkov committed
422 423
	// directory + ZODB connection for head/
	head *Head
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
424

Kirill Smelkov's avatar
.  
Kirill Smelkov committed
425 426 427
	// directories + ZODB connections for @<rev>/
	revMu  sync.Mutex
	revTab map[zodb.Tid]*Head
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
428 429
}

Kirill Smelkov's avatar
.  
Kirill Smelkov committed
430
// /(head|<rev>)/			- served by Head.
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
431
type Head struct {
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
432
	nodefs.Node
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
433
	rev   zodb.Tid    // 0 for head/, !0 for @<rev>/
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
434 435 436
	bfdir *BigFileDir // bigfile/
	// at    - served by .readAt
	// watch - implicitly linked to by fs
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
437 438 439 440

	// ZODB connection for everything under this head
	zconnMu sync.RWMutex // protects access to zconn & live _objects_ associated with it
	zconn   *ZConn       // for head/ zwatcher resyncs head.zconn; others only read zconn objects.
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
441 442

	// XXX move zconn's current transaction to Head here?
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
443
}
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
444

Kirill Smelkov's avatar
.  
Kirill Smelkov committed
445 446
// /head/watch				- served by Watch.
type Watch struct {
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
447
	nodefs.Node
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
448

Kirill Smelkov's avatar
.  
Kirill Smelkov committed
449
	// TODO
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
450 451
}

Kirill Smelkov's avatar
.  
Kirill Smelkov committed
452 453
// /(head|<rev>)/bigfile/		- served by BigFileDir.
type BigFileDir struct {
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
454
	nodefs.Node
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
455
	head *Head // parent head/ or @<rev>/
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
456 457

	// {} oid -> <bigfileX>
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
458 459
	mu      sync.Mutex
	fileTab map[zodb.Oid]*BigFile
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
460 461
}

Kirill Smelkov's avatar
.  
Kirill Smelkov committed
462
// /(head|<rev>)/bigfile/<bigfileX>	- served by BigFile.
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
463
type BigFile struct {
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
464 465
	nodefs.Node

Kirill Smelkov's avatar
.  
Kirill Smelkov committed
466
	// this BigFile is under .head/bigfile/; it views ZODB via .head.zconn
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
467 468
	// parent's BigFileDir.head is the same.
	head	*Head
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
469

Kirill Smelkov's avatar
.  
Kirill Smelkov committed
470
	// ZBigFile top-level object. Kept activated during lifetime of current transaction.
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
471
	zbf	*ZBigFile
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
472

Kirill Smelkov's avatar
.  
Kirill Smelkov committed
473
	// zbf.Size(). It is constant during liftime of current transaction.
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
474 475
	zbfSize int64

Kirill Smelkov's avatar
.  
Kirill Smelkov committed
476 477
	// tail change history of this file.
	δFtail *ΔTailI64 // [](rev↑, []#blk)
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
478

Kirill Smelkov's avatar
.  
Kirill Smelkov committed
479
	// TODO -> δFtail
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
480
	// lastChange	zodb.Tid // last change to whole bigfile as of .zconn.At view
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
481

Kirill Smelkov's avatar
.  
Kirill Smelkov committed
482
	// inflight loadings of ZBigFile from ZODB.
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
483
	// successfull load results are kept here until blkdata is put into OS pagecache.
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
484
	loadMu  sync.Mutex
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
485
	loading map[int64]*blkLoadState // #blk -> {... blkdata}
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
486 487 488

	// XXX mappings where client(s) requested isolation guarantee
	//mappings ...
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
489 490 491 492 493 494 495 496 497 498 499
}

// blkLoadState represents a ZBlk load state/result.
//
// when !ready the loading is in progress.
// when ready the loading has been completed.
type blkLoadState struct {
	ready chan struct{}

	blkdata []byte
	err     error
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
500 501
}

Kirill Smelkov's avatar
.  
Kirill Smelkov committed
502 503
// ----------------------------------------

Kirill Smelkov's avatar
.  
Kirill Smelkov committed
504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523
// zodbCacheControl implements zodb.LiveCacheControl to tune ZODB to never evict
// LOBTree/LOBucket from live cache. We want to keep LOBTree/LOBucket always alive
// becuse it is essentially the index where to find ZBigFile data.
//
// For the data itself - we put it to kernel pagecache and always deactivate
// from ZODB right after that.
//
// See "3) for */head/data the following invariant is maintained..."
type zodbCacheControl struct {}

func (cc *zodbCacheControl) WantEvict(obj zodb.IPersistent) bool {
	switch obj.(type) {
	default:
		return true

	case *btree.LOBTree:
	case *btree.LOBucket:

	// ZBlk* are kept referenced by a LOBucket, so they don't go away from Connection.cache.objtab

Kirill Smelkov's avatar
.  
Kirill Smelkov committed
524 525 526 527 528 529 530 531 532
	// FIXME we need to keep ZBigFile in cache: even if we keep a pointer
	// to ZBigFile, but it is allowed to drop its state, it will release
	// pointer to LOBTree object and, consequently, that LOBTree object,
	// even if it was marked not to be released from cache will be GC'ed by
	// go runtime, and the cache will loose its weak reference to it.
	// XXX however we cannot protect ZBigFile from releaseing state - as
	// any object can be explicitly invalidated.
	// FIXME -> teach zodb.LiveCache to keep object by itself?
	//
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
533 534 535 536 537 538 539 540 541
	// we also keep ZBigFile alive because we want to make sure .blksize
	// and (p. ref) .blktab do not change.
	// XXX do we really need to keep ZBigFile alive for that?
	//case *ZBigFile:
	}

	return false
}

Kirill Smelkov's avatar
.  
Kirill Smelkov committed
542 543 544 545 546 547 548 549
func traceWatch(format string, argv ...interface{}) {
	if !log.V(1) {	// XXX -> 2?
		return
	}

	log.Infof("zwatcher: " + format, argv...)
}

Kirill Smelkov's avatar
.  
Kirill Smelkov committed
550 551
// zwatcher watches for ZODB changes.
// see "4) when we receive an invalidation message from ZODB ..."
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
552
func (root *Root) zwatcher(ctx context.Context) (err error) {
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
553
	defer xerr.Contextf(&err, "zwatch")	// XXX more in context?
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
554 555
	// XXX unmount on error? -> always EIO?

Kirill Smelkov's avatar
.  
Kirill Smelkov committed
556
	traceWatch(">>>")
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
557

Kirill Smelkov's avatar
.  
Kirill Smelkov committed
558
	zwatchq := make(chan zodb.CommitEvent)
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
559 560
	root.zstor.AddWatch(zwatchq)
	defer root.zstor.DelWatch(zwatchq)
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
561

Kirill Smelkov's avatar
.  
Kirill Smelkov committed
562
	var zevent zodb.CommitEvent
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
563
	var ok bool
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
564 565

	for {
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
566
		traceWatch("select ...")
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
567 568
		select {
		case <-ctx.Done():
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
569
			traceWatch("cancel")
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
570 571
			return ctx.Err()

Kirill Smelkov's avatar
.  
Kirill Smelkov committed
572
		// TODO handle errors from ZODB watch stream
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
573 574
		case zevent, ok = <-zwatchq:
			if !ok {
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
575
				traceWatch("zwatchq closed")
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
576 577
				return nil // closed	XXX ok?
			}
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
578 579

			traceWatch("zevent: %s", zevent)
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
580 581
		}

Kirill Smelkov's avatar
.  
Kirill Smelkov committed
582
		root.zhandle1(zevent)
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
583 584 585 586
	}
}

// zhandle1 handles 1 event from ZODB notification.
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
587
func (root *Root) zhandle1(zevent zodb.CommitEvent) {
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
588 589
	// while we are invalidating OS cache, make sure that nothing, that
	// even reads /head/bigfile/*, is running (see 4.6).
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
590 591
	root.head.zconnMu.Lock()
	defer root.head.zconnMu.Unlock()
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
592

Kirill Smelkov's avatar
.  
Kirill Smelkov committed
593 594
	zhead := root.head.zconn
	bfdir := root.head.bfdir
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
595

Kirill Smelkov's avatar
.  
Kirill Smelkov committed
596
	toinvalidate := map[*BigFile]SetI64{} // {} file -> set(#blk)
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
597 598

	// zevent = (tid^, []oid)
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
599
	for _, oid := range zevent.Changev {
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
600
		// XXX zhead.Cache() lock/unlock + comment it is not really needed
601
		obj := zhead.Cache().Get(oid)
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
602 603 604 605 606 607
		if obj == nil {
			continue // nothing to do - see invariant
		}

		switch obj := obj.(type) {
		default:
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
608
			continue // object not related to any bigfile
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
609

Kirill Smelkov's avatar
.  
Kirill Smelkov committed
610
		case *btree.LOBTree:
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
611 612
			// XXX -> δBTree

Kirill Smelkov's avatar
.  
Kirill Smelkov committed
613
		case *btree.LOBucket:
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
614 615
			// XXX -> δBTree

Kirill Smelkov's avatar
.  
Kirill Smelkov committed
616
		case zBlk:	// ZBlk*
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
617 618 619 620
			// blkBoundTo locking: no other bindZFile are running,
			// since we write-locked head.zconnMu and bindZFile is
			// run when loading objects - thus when head.zconnMu is
			// read-locked.
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
621 622 623
			//
			// bfdir locking: similarly not needed, since we are
			// exclusively holding head lock.
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
624
			for zfile, objBlk := range obj.blkBoundTo() {
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
625
				file, ok := bfdir.fileTab[zfile.POid()]
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
626 627 628 629 630 631 632
				if !ok {
					// even though zfile is in ZODB cache, the
					// filesystem already forgot about this file.
					continue
				}

				blkmap, ok := toinvalidate[file]
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
633 634
				if !ok {
					blkmap = SetI64{}
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
635
					toinvalidate[file] = blkmap
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
636
				}
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
637
				blkmap.Update(objBlk)
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
638
			}
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
639

Kirill Smelkov's avatar
.  
Kirill Smelkov committed
640 641 642 643 644
		case *ZBigFile:
			// XXX check that .blksize and .blktab (it is only
			// persistent reference) do not change.

			// XXX shutdown fs with ^^^ message.
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
645
		}
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
646 647 648

		// make sure obj won't be garbage-collected until we finish handling it.
		runtime.KeepAlive(obj)
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
649 650
	}

Kirill Smelkov's avatar
.  
Kirill Smelkov committed
651
	//wg = ...
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
652
	ctx := context.TODO()
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
653 654
	for file, blkmap := range toinvalidate {
		for blk := range blkmap {
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
655
			go file.invalidateBlk(ctx, blk)	// XXX -> wg.Go
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
656
		}
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
657
	}
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
658

Kirill Smelkov's avatar
.  
Kirill Smelkov committed
659
	// resync .zhead to zevent.tid
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
660
	//zhead.Resync(zevent.Tid)		XXX reenable
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
661 662

	// notify .wcfs/zhead
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
663 664 665 666 667 668 669
	for sk := range gdebug.zheadSockTab {
		_, err := fmt.Fprintf(sk, "%s\n", zevent.Tid)
		if err != nil {
			log.Error(err)	// XXX errctx, -> warning?
			sk.Close()
			delete(gdebug.zheadSockTab, sk)
		}
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
670
	}
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
671 672 673
}

// invalidateBlk invalidates 1 file block.	XXX
674 675 676
//
// called with f.head.zconnMu wlocked.
//
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
677
// XXX see "4.4) for all file/blk to in invalidate we do"
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
678
func (f *BigFile) invalidateBlk(ctx context.Context, blk int64) error {
679
	fsconn := gfsconn
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
680
	blksize := f.zbf.blksize
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
681 682
	off := blk*blksize

683
	// try to retrieve cache of current head/data[blk]
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
684 685 686 687 688 689 690
	//
	// if less than blksize was cached - probably the kernel had to evict
	// some data from its cache already. In such case we don't try to
	// preserve the rest and drop what was read, to avoid keeping the
	// system overloaded.
	//
	// XXX st != OK -> warn?
691 692
	blkdata := make([]byte, blksize)
	n, st := fsconn.FileRetrieveCache(f.Inode(), off, blkdata)
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
693
	if int64(n) == blksize {
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
694
		// XXX -> go
695 696
		// store retrieved data back to OS cache for file @<rev>/file[blk]
		blkrev, _ := f.δFtail.LastRevOf(blk, f.head.zconn.At())
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
697
		frev, err := groot.mkrevfile(blkrev, f.zbf.POid())
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
698 699 700
		if err != nil {
			// XXX
		}
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
701 702 703 704 705
		st = fsconn.FileNotifyStoreCache(frev.Inode(), off, blkdata)
		if st != fuse.OK {
			// XXX log	- dup wrt readBlk -> common func.
		}
	}
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
706

Kirill Smelkov's avatar
.  
Kirill Smelkov committed
707
	// invalidate file/head/data[blk] in OS file cache.
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
708 709
	st = fsconn.FileNotify(f.Inode(), off, blksize)
	// XXX st != ok (fatal here)
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
710 711

	panic("TODO")
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
712
}
713 714


Kirill Smelkov's avatar
.  
Kirill Smelkov committed
715
// mkrevfile makes sure inode ID of /@<rev>/bigfile/<fid> is known to kernel.
716 717
//
// We need node ID to be know to the kernel, when we need to store data into
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
718 719
// file's kernel cache - if the kernel don't have the node ID for the file in
// question, FileNotifyStoreCche will just fail.
720
//
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747
// For kernel to know the inode mkrevfile issues regular filesystem lookup
// request which goes to kernel and should go back to wcfs. It is thus not safe
// to use mkrevfile from under FUSE request handler as doing so might deadlock.
func (root *Root) mkrevfile(rev zodb.Tid, fid zodb.Oid) (_ *BigFile, err error) {
	fsconn := gfsconn

	frevpath := fmt.Sprintf("@%s/bigfile/%s", rev, fid) // relative to fs root for now
	defer xerr.Contextf(&err, "/: mkrevfile %s", frevpath)

	// first check without going through kernel, whether the inode maybe know already
	xfrev := fsconn.LookupNode(root.Inode(), frevpath)
	if xfrev != nil {
		// FIXME checking for "node{0}" is fragile, but currently no other way
		if xfrev.String() != "node{0}" {
			return xfrev.Node().(*BigFile), nil
		}
	}

	// we have to ping the kernel
	frevospath := gmntpt + "/" + frevpath // now starting from OS /
	f, err := os.Open(frevospath)
	if err != nil {
		return nil, err
	}
	defer f.Close()

	xfrev = fsconn.LookupNode(root.Inode(), frevpath)
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
748
	// must be !nil as open succeeded	XXX better recheck
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
749
	return xfrev.Node().(*BigFile), nil
750
}
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
751

Kirill Smelkov's avatar
.  
Kirill Smelkov committed
752 753
// ----------------------------------------

Kirill Smelkov's avatar
.  
Kirill Smelkov committed
754 755 756 757 758 759 760 761
// /(head|<rev>)/at -> readAt serves read.
func (h *Head) readAt() []byte {
	h.zconnMu.RLock()
	defer h.zconnMu.RUnlock()

	return []byte(h.zconn.At().String())
}

Kirill Smelkov's avatar
.  
Kirill Smelkov committed
762
// /(head|<rev>)/bigfile/ -> Lookup receives client request to create /(head|<rev>)/bigfile/<bigfileX>.
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
763
func (bfdir *BigFileDir) Lookup(out *fuse.Attr, name string, fctx *fuse.Context) (*nodefs.Inode, fuse.Status) {
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
764
	f, err := bfdir.lookup(out, name, fctx)
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
765 766 767 768
	var inode *nodefs.Inode
	if f != nil {
		inode = f.Inode()
	}
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
769 770 771 772
	return inode, err2LogStatus(err)

}

Kirill Smelkov's avatar
.  
Kirill Smelkov committed
773
func (bfdir *BigFileDir) lookup(out *fuse.Attr, name string, fctx *fuse.Context) (f *BigFile, err error) {
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
774
	defer xerr.Contextf(&err, "/XXXbigfile: lookup %q", name)	// XXX name -> path
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
775 776 777 778 779 780

	oid, err := zodb.ParseOid(name)
	if err != nil {
		return nil, eINVALf("not oid")
	}

Kirill Smelkov's avatar
.  
Kirill Smelkov committed
781 782
	bfdir.head.zconnMu.RLock()
	defer bfdir.head.zconnMu.RUnlock()
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
783 784 785 786 787 788 789

	defer func() {
		if f != nil {
			f.getattr(out)
		}
	}()

Kirill Smelkov's avatar
.  
Kirill Smelkov committed
790
	// check to see if dir(oid) is already there
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
791
	bfdir.mu.Lock()
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
792
	f, already := bfdir.fileTab[oid]
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
793
	bfdir.mu.Unlock()
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
794 795

	if already {
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
796
		return f, nil
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
797 798
	}

Kirill Smelkov's avatar
.  
Kirill Smelkov committed
799
	// not there - without bfdir lock proceed to open BigFile from ZODB
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
800
	f, err = bfdir.head.bigopen(asctx(fctx), oid)
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
801 802 803
	if err != nil {
		return nil, err
	}
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
804

Kirill Smelkov's avatar
.  
Kirill Smelkov committed
805 806 807
	// relock bfdir and either register f or, if the file was maybe
	// simultanously created while we were not holding bfdir.mu, return that.
	bfdir.mu.Lock()
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
808
	f2, already := bfdir.fileTab[oid]
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
809
	if already {
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
810 811
		bfdir.mu.Unlock()
		f.Close()
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
812
		return f2, nil
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
813 814
	}

Kirill Smelkov's avatar
.  
Kirill Smelkov committed
815
	bfdir.fileTab[oid] = f
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
816
	bfdir.mu.Unlock()
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
817

Kirill Smelkov's avatar
.  
Kirill Smelkov committed
818 819
	// mkfile takes filesystem treeLock - do it outside bfdir.mu
	mkfile(bfdir, name, f)
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
820

Kirill Smelkov's avatar
.  
Kirill Smelkov committed
821
	return f, nil
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
822 823
}

Kirill Smelkov's avatar
.  
Kirill Smelkov committed
824
// XXX do we need to support unlink? (probably no)
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
825

Kirill Smelkov's avatar
.  
Kirill Smelkov committed
826
// / -> Mkdir receives client request to create @<rev>/.
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
827 828 829
//
// it is not an error if @<rev>/ already exists - mkdir succeeds and EEXIST is not returned.
// in other words mkdir behaves here similarly to `mkdir -p`.
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
830 831
//
// XXX -> remove mkdir and just create @revX/ on lookup.
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
832 833
func (root *Root) Mkdir(name string, mode uint32, fctx *fuse.Context) (*nodefs.Inode, fuse.Status) {
	inode, err := root.mkdir(name, fctx) // XXX ok to ignore mode?
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
834 835 836
	return inode, err2LogStatus(err)
}

Kirill Smelkov's avatar
.  
Kirill Smelkov committed
837 838
func (root *Root) mkdir(name string, fctx *fuse.Context) (_ *nodefs.Inode, err error) {
	defer xerr.Contextf(&err, "/: mkdir %q", name)
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
839

Kirill Smelkov's avatar
.  
Kirill Smelkov committed
840
	var rev zodb.Tid
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
841 842 843
	ok := false

	if strings.HasPrefix(name, "@") {
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
844
		rev, err = zodb.ParseTid(name[1:])
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
845 846 847
		ok = (err == nil)
	}
	if !ok {
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
848
		return nil, eINVALf("not @rev")
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
849 850
	}

Kirill Smelkov's avatar
.  
Kirill Smelkov committed
851
	// check to see if dir(rev) is already there
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
852 853 854
	root.revMu.Lock()
	_, already := root.revTab[rev]
	root.revMu.Unlock()
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
855 856

	if already {
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
857
		return nil, syscall.EEXIST
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
858 859
	}

Kirill Smelkov's avatar
.  
Kirill Smelkov committed
860
	// not there - without revMu lock proceed to open @rev view of ZODB
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
861
	ctx := asctx(fctx)
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
862 863
//	zconnRev, err := root.zopenAt(ctx, rev)
	zconnRev, err := zopen(ctx, root.zdb, &zodb.ConnOptions{At: rev})
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
864 865 866 867
	if err != nil {
		return nil, err
	}

Kirill Smelkov's avatar
.  
Kirill Smelkov committed
868
	// relock root and either mkdir or EEXIST if the directory was maybe
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
869 870 871
	// simultanously created while we were not holding revMu.
	root.revMu.Lock()
	_, already = root.revTab[rev]
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
872
	if already {
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
873 874 875
		root.revMu.Unlock()
//		zconnRev.Release()
		transaction.Current(zconnRev.txnCtx).Abort()
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
876 877
		return nil, syscall.EEXIST
	}
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
878

Kirill Smelkov's avatar
.  
Kirill Smelkov committed
879
	// XXX -> newHead()
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
880
	revDir := &Head{
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
881
		Node:  newDefaultNode(),
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
882 883
		rev:   rev,
		zconn: zconnRev,
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
884
	}
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
885

Kirill Smelkov's avatar
.  
Kirill Smelkov committed
886
	bfdir := &BigFileDir{
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
887
		Node:    newDefaultNode(),
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
888 889
		head:    revDir,
		fileTab: make(map[zodb.Oid]*BigFile),
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
890 891 892
	}
	revDir.bfdir = bfdir

Kirill Smelkov's avatar
.  
Kirill Smelkov committed
893 894
	root.revTab[rev] = revDir
	root.revMu.Unlock()
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
895

Kirill Smelkov's avatar
.  
Kirill Smelkov committed
896
	// mkdir takes filesystem treeLock - do it outside revMu.
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
897
	mkdir(root, name, revDir)
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
898 899
	mkdir(revDir, "bigfile", bfdir)
	// XXX + "at"
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
900

Kirill Smelkov's avatar
.  
Kirill Smelkov committed
901
	return revDir.Inode(), nil
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
902 903 904
}


Kirill Smelkov's avatar
.  
Kirill Smelkov committed
905
// bigopen opens BigFile corresponding to oid on head.zconn.
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
906 907 908
//
// A ZBigFile corresponding to oid is activated and statted.
//
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
909 910 911
// head.zconn must be locked.
func (head *Head) bigopen(ctx context.Context, oid zodb.Oid) (_ *BigFile, err error) {
	zconn := head.zconn
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
912
	defer xerr.Contextf(&err, "bigopen %s @%s", oid, zconn.At())
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
913 914

	// XXX better ctx = transaction.PutIntoContext(ctx, txn)
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
915
	ctx, cancel := xcontext.Merge(ctx, zconn.txnCtx)
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
916 917
	defer cancel()

Kirill Smelkov's avatar
.  
Kirill Smelkov committed
918
	xzbf, err := zconn.Get(ctx, oid)
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
919
	if err != nil {
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
920 921 922 923 924 925 926 927
		switch errors.Cause(err).(type) {
		case *zodb.NoObjectError:
			return nil, eINVAL(err)
		case *zodb.NoDataError:
			return nil, eINVAL(err) // XXX what to do if it was existing and got deleted?
		default:
			return nil, err
		}
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
928
	}
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950

	zbf, ok := xzbf.(*ZBigFile)
	if !ok {
		return nil, eINVALf("%s is not a ZBigFile", typeOf(xzbf))
	}

	// activate ZBigFile and keep it this way
	err = zbf.PActivate(ctx)
	if err != nil {
		return nil, err
	}
	defer func() {
		if err != nil {
			zbf.PDeactivate()
		}
	}()

	zbfSize, err := zbf.Size(ctx)
	if err != nil {
		return nil, err
	}

Kirill Smelkov's avatar
.  
Kirill Smelkov committed
951
//	zconn.Incref()
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
952
	return &BigFile{
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
953
		Node:    newDefaultNode(),
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
954
		head:    head,
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
955 956
		zbf:     zbf,
		zbfSize: zbfSize,
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
957 958 959

		// XXX this is needed only for head/
		δFtail:  NewΔTailI64(),	// XXX indicate we have coverage starting from zconn.at?
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
960
		loading: make(map[int64]*blkLoadState),
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
961
	}, nil
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
962 963
}

Kirill Smelkov's avatar
.  
Kirill Smelkov committed
964
// Close release all resources of BigFile.
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
965
func (f *BigFile) Close() error {
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
966
	f.zbf.PDeactivate()	// XXX f.head.zconn must locked
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
967
	f.zbf = nil
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
968

Kirill Smelkov's avatar
.  
Kirill Smelkov committed
969 970 971
//	f.zconn.Release()
//	f.zconn = nil
	f.head = nil
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
972 973 974

	return nil
}
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
975

Kirill Smelkov's avatar
.  
Kirill Smelkov committed
976
// /(head|<rev>)/bigfile/<bigfileX> -> Getattr serves stat.
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
977
func (f *BigFile) GetAttr(out *fuse.Attr, _ nodefs.File, _ *fuse.Context) fuse.Status {
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
978
	// XXX locking
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
979 980 981
	f.getattr(out)
	return fuse.OK
}
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
982

Kirill Smelkov's avatar
.  
Kirill Smelkov committed
983
func (f *BigFile) getattr(out *fuse.Attr) {
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
984
	out.Mode = fuse.S_IFREG | 0444
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
985
	out.Size = uint64(f.zbfSize)
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
986
	// .Blocks
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
987
	// .Blksize
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
988 989

	// FIXME lastChange should cover all bigfile data, not only ZBigFile itself
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
990 991
	//mtime := &f.lastChange.Time().Time
	lastChange := f.zbf.PSerial()
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
992 993 994 995 996 997

	mtime := lastChange.Time().Time
	out.SetTimes(/*atime=*/nil, /*mtime=*/&mtime, /*ctime=*/&mtime)
}


Kirill Smelkov's avatar
.  
Kirill Smelkov committed
998 999
// /(head|<rev>)/bigfile/<bigfileX> -> Read serves reading bigfile data.
func (f *BigFile) Read(_ nodefs.File, dest []byte, off int64, fctx *fuse.Context) (fuse.ReadResult, fuse.Status) {
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1000 1001
	f.head.zconnMu.RLock()
	defer f.head.zconnMu.RUnlock()
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1002

Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1003
	zbf := f.zbf
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1004

Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1005 1006
	// cap read request to file size
	end := off + int64(len(dest))		// XXX overflow?
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1007 1008
	if end > f.zbfSize {
		end = f.zbfSize
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1009 1010 1011 1012 1013
	}
	if end <= off {
		// XXX off >= size -> EINVAL? (but when size=0 kernel issues e.g. [0 +4K) read)
		return fuse.ReadResultData(nil), fuse.OK
	}
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1014

Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1015
	// widen read request to be aligned with blksize granularity
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1016
	// (we can load only whole ZBlk* blocks)
Kirill Smelkov's avatar
Kirill Smelkov committed
1017
	aoff := off - (off % zbf.blksize)
1018 1019 1020 1021
	aend := end
	if re := end % zbf.blksize; re != 0 {
		aend += zbf.blksize - re
	}
1022
	dest = make([]byte, aend - aoff) // ~> [aoff:aend) in file
Kirill Smelkov's avatar
Kirill Smelkov committed
1023

Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1024
	// XXX better ctx = transaction.PutIntoContext(ctx, txn)
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1025
	ctx, cancel := xcontext.Merge(asctx(fctx), f.head.zconn.txnCtx)
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1026 1027
	defer cancel()

Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1028
	// read/load all block(s) in parallel
Kirill Smelkov's avatar
Kirill Smelkov committed
1029 1030 1031 1032 1033
	wg, ctx := errgroup.WithContext(ctx)
	for blkoff := aoff; blkoff < aend; blkoff += zbf.blksize {
		blkoff := blkoff
		blk := blkoff / zbf.blksize
		wg.Go(func() error {
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1034
			δ := blkoff-aoff // blk position in dest
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1035
			//log.Infof("readBlk #%d dest[%d:+%d]", blk, δ, zbf.blksize)
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1036
			return f.readBlk(ctx, blk, dest[δ:δ+zbf.blksize])
Kirill Smelkov's avatar
Kirill Smelkov committed
1037 1038 1039 1040 1041
		})
	}

	err := wg.Wait()
	if err != nil {
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1042
		log.Errorf("%s", err)	// XXX + /bigfile/XXX: read [a,b): -> ...
Kirill Smelkov's avatar
Kirill Smelkov committed
1043 1044 1045
		return nil, fuse.EIO
	}

Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1046
	return fuse.ReadResultData(dest[off-aoff:end-aoff]), fuse.OK
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1047 1048
}

Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1049
// readBlk serves Read to read 1 ZBlk #blk into destination buffer.
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1050
//
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1051
// see "7) when we receive a FUSE read(#blk) request ..." in overview.
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1052
//
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1053
// len(dest) == blksize.
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1054
func (f *BigFile) readBlk(ctx context.Context, blk int64, dest []byte) error {
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1055
	// XXX errctx?
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1056
	// XXX locking
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1057

Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1058
	// check if someone else is already loading this block
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1059 1060
	f.loadMu.Lock()
	loading, already := f.loading[blk]
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1061 1062 1063 1064
	if !already {
		loading = &blkLoadState{
			ready:   make(chan struct{}),
		}
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1065
		f.loading[blk] = loading
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1066
	}
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1067
	f.loadMu.Unlock()
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1068

Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1069
	// if it is already loading - just wait for it
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1070 1071 1072 1073 1074 1075
	if already {
		select {
		case <-ctx.Done():
			return ctx.Err()

		case <-loading.ready:
1076
			if loading.err == nil {
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1077 1078 1079 1080 1081 1082 1083
				copy(dest, loading.blkdata)
			}
			return loading.err
		}
	}

	// noone was loading - we became reponsible to load this block
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1084

Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1085
	zbf := f.zbf
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1086
	blkdata, err := zbf.LoadBlk(ctx, blk)	// XXX -> +blkrevmax1
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1087 1088 1089 1090
	loading.blkdata = blkdata
	loading.err = err
	close(loading.ready)

Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1091
	// XXX before loading.ready?
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1092
	blkrevmax2, _ := f.δFtail.LastRevOf(blk, zbf.PJar().At())
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1093 1094 1095
	//revmax := min(blkrevmax1, blkrevmax2)
	revmax := blkrevmax2
	_ = revmax
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1096

Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1097
/*
1098
	// XXX remmapping - only if head.rev == 0
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1099 1100
	// XXX -> own func?
	// XXX locking
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1101
	for _, mapping := range f.mappings {
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120
		if revmax <= mapping.at || !mapping.blkrange.in(blk) {
			continue // do nothing
		}

		if mapping.pinned.Contains(blk) {
			continue // do nothing
		}

		rev = max(δFtail.by(blk) : _ <= mapping.at)

		// XXX vvv -> go
		client.remmap(mapping.addr[blk], file/@<rev>/data)
		mapping.pinned.Add(blk)


	}
*/


Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1121 1122
	// data loaded with error - cleanup .loading
	if loading.err != nil {
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1123 1124 1125
		f.loadMu.Lock()
		delete(f.loading, blk)
		f.loadMu.Unlock()
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1126 1127 1128 1129 1130 1131 1132 1133 1134
		return err
	}

	// data loaded ok
	copy(dest, blkdata)

	// store to kernel pagecache whole block that we've just loaded from database.
	// This way, even if the user currently requested to read only small portion from it,
	// it will prevent next e.g. consecutive user read request to again hit
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1135
	// the DB, and instead will be served by kernel from its pagecache.
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1136 1137 1138
	//
	// We cannot do this directly from reading goroutine - while reading
	// kernel FUSE is holding corresponging page in pagecache locked, and if
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1139
	// we would try to update that same page in pagecache it would result
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1140 1141
	// in deadlock inside kernel.
	//
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1142
	// .loading cleanup is done once we are finished with putting the data into OS pagecache.
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1143 1144 1145
	// If we do it earlier - a simultaneous read covered by the same block could result
	// into missing both kernel pagecache (if not yet updated) and empty .loading[blk],
	// and thus would trigger DB access again.
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1146 1147
	go func() {
		// XXX locking - invalidation must make sure this workers are finished.
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1148 1149

		// XXX if direct-io: don't touch pagecache
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1150
		st := gfsconn.FileNotifyStoreCache(f.Inode(), blk*zbf.blksize, blkdata)
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1151

Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1152 1153 1154
		f.loadMu.Lock()
		delete(f.loading, blk)
		f.loadMu.Unlock()
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1155

Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1156 1157
		if st == fuse.OK {
			return
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1158
		}
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1159 1160 1161 1162 1163

		// pagecache update failed, but it must not (we verified on startup that
		// pagecache control is supported by kernel). We can correctly live on
		// with the error, but data access will be likely very slow. Tell user
		// about the problem.
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1164
		log.Errorf("BUG: bigfile %s: blk %d: -> pagecache: %s  (ignoring, but reading from bigfile will be very slow)", zbf.POid(), blk, st)
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1165
	}()
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1166 1167 1168 1169

	return nil
}

Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1170

Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1171 1172 1173



Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1174
// FIXME groot/gfsconn is tmp workaround for lack of way to retrieve FileSystemConnector from nodefs.Inode
Kirill Smelkov's avatar
Kirill Smelkov committed
1175 1176 1177 1178 1179
// TODO:
//	- Inode += .Mount() -> nodefs.Mount
//	- Mount:
//		.Root()		-> root Inode of the fs
//		.Connector()	-> FileSystemConnector through which fs is mounted
1180
var groot   *Root
Kirill Smelkov's avatar
Kirill Smelkov committed
1181 1182
var gfsconn *nodefs.FileSystemConnector

Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1183 1184 1185 1186 1187 1188 1189 1190
// root of the filesystem is mounted here.
//
// we need to talk to kernel and lookup @<rev>/bigfile/<fid> before uploading
// data to kernel cache there. Referencing root of the filesystem via path is
// vulnerable to bugs wrt e.g. `mount --move` and/or mounting something else
// over wcfs. However keeping opened root fd will prevent wcfs to be unmounted,
// so we still have to reference the root via path.
var gmntpt string
1191

Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1192 1193 1194 1195 1196 1197 1198
// debugging
var gdebug = struct {
	// .wcfs/zhead opens
	// protected by groot.head.zconnMu
	zheadSockTab map[*FileSock]struct{}
}{}

Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1199 1200 1201 1202
func init() {
	gdebug.zheadSockTab = make(map[*FileSock]struct{})
}

Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219
// _wcfs_Zhead serves .wcfs/zhead opens.
type _wcfs_Zhead struct {
	nodefs.Node
}

func (zh *_wcfs_Zhead) Open(flags uint32, fctx *fuse.Context) (nodefs.File, fuse.Status) {
	// XXX check flags?
	sk := NewFileSock()
	sk.CloseRead()

	groot.head.zconnMu.Lock()
	defer groot.head.zconnMu.Unlock()

	gdebug.zheadSockTab[sk] = struct{}{}
	return sk.File(), fuse.OK
}

Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1220
func main() {
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1221
	stdlog.SetPrefix("wcfs: ")
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1222
	//log.CopyStandardLogTo("WARNING") // XXX -> "DEBUG" if -d ?
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1223
	defer log.Flush()
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1224

Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1225
	debug := flag.Bool("d", false, "debug")
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1226
	autoexit := flag.Bool("autoexit", false, "automatically stop service when there is no client activity")
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1227 1228
	// XXX option to prevent starting if wcfs was already started ?

Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1229 1230
	flag.Parse()
	if len(flag.Args()) != 2 {
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1231
		log.Fatalf("Usage: %s [OPTIONS] zurl mntpt", os.Args[0])
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1232 1233 1234 1235
	}
	zurl := flag.Args()[0]
	mntpt := flag.Args()[1]

Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1236 1237 1238 1239 1240
	// debug -> precise t, no dates	(XXX -> always precise t?)
	if *debug {
		stdlog.SetFlags(stdlog.Lmicroseconds)
	}

Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1241
	// open zodb storage/db/connection
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1242
	ctx := context.Background()	// XXX + timeout?
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1243 1244 1245
	zstor, err := zodb.OpenStorage(ctx, zurl, &zodb.OpenOptions{
		ReadOnly: true,
	})
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1246 1247 1248 1249 1250
	if err != nil {
		log.Fatal(err)
	}
	defer zstor.Close()

Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1251
	zdb := zodb.NewDB(zstor)
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1252 1253 1254 1255 1256
	zhead, err := zopen(ctx, zdb, &zodb.ConnOptions{
		// we need zhead.cache to be maintained across several transactions.
		// see "3) for head/bigfile/* the following invariant is maintained ..."
		NoPool: true,
	})
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1257 1258 1259
	if err != nil {
		log.Fatal(err)
	}
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1260
	zhead.Cache().SetControl(&zodbCacheControl{})	// XXX +locking?
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1261

Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1262
	// mount root + head/
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1263
	// XXX -> newHead()
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1264
	head := &Head{
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1265
		Node:  newDefaultNode(),
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1266 1267 1268
		rev:   0,
		zconn: zhead,
	}
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1269
	bfdir := &BigFileDir{
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1270
		Node:    newDefaultNode(),
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1271 1272
		head:    head,
		fileTab: make(map[zodb.Oid]*BigFile),
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1273 1274
	}
	head.bfdir = bfdir
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1275

Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1276
	root := &Root{
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1277
		Node:   newDefaultNode(),
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1278 1279 1280 1281
		zstor:  zstor,
		zdb:    zdb,
		head:   head,
		revTab: make(map[zodb.Tid]*Head),
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1282 1283
	}

Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1284
	opts := &fuse.MountOptions{
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1285 1286
		FsName: zurl,
		Name:   "wcfs",
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1287 1288

		DisableXAttrs: true, // we don't use
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1289
		Debug:         *debug,
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1290
	}
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1291

Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1292
	fssrv, fsconn, err := mount(mntpt, root, opts)
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1293
	if err != nil {
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1294
		log.Fatal(err)
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1295
	}
1296
	groot   = root		// FIXME temp workaround (see ^^^)
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1297
	gfsconn = fsconn	// FIXME ----//----
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1298
	gmntpt  = mntpt
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1299

Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1300
	// we require proper pagecache control (added to Linux 2.6.36 in 2010)
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1301
	supports := fssrv.KernelSettings().SupportsNotify
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1302
	if !(supports(fuse.NOTIFY_STORE_CACHE) && supports(fuse.NOTIFY_RETRIEVE_CACHE)) {
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1303
		log.Fatalf("kernel FUSE does not support pagecache control")
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1304 1305
	}

Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1306
	// add entries to /
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1307
	mkdir(root, "head", head)
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1308
	mkdir(head, "bigfile", bfdir)
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1309
	mkfile(head, "at", NewSmallFile(head.readAt))   // TODO mtime(at) = tidtime(at)
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1310 1311 1312
	// XXX ^^^ invalidate cache or direct io

	// for debugging/testing
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1313
	_wcfs := newDefaultNode()
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1314 1315
	mkdir(root, ".wcfs", _wcfs)
	mkfile(_wcfs, "zurl", NewStaticFile([]byte(zurl)))
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1316

Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1317 1318 1319 1320 1321
	// .wcfs/zhead - special file channel that sends zhead.at.
	//
	// If a user opens it, it will start to get tids of through which
	// zhead.at was, starting from the time when .wcfs/zhead was opened.
	// There can be multiple openers. Once opened, the file must be read,
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1322
	// as wcfs blocks waiting for data to be read before XXX.
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1323
	mkfile(_wcfs, "zhead", &_wcfs_Zhead{
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1324
		Node: newDefaultNode(),
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1325
	})
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1326

Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1327 1328 1329 1330 1331
	// XXX place = ok?
	// XXX ctx = ok?
	// XXX wait for zwatcher shutdown.
	go root.zwatcher(ctx)

Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1332 1333 1334
	// TODO handle autoexit
	_ = autoexit

Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1335
	// serve client requests
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1336
	fssrv.Serve()	// XXX Serve returns no error
Kirill Smelkov's avatar
.  
Kirill Smelkov committed
1337
}