.

c40f3831 · Kirill Smelkov · 901e9fc1 · c40f3831
Commit c40f3831 authored Dec 24, 2018 by Kirill Smelkov
Hide whitespace changes
Inline Side-by-side

Showing with 87 additions and 80 deletions

wcfs/wcfs.go wcfs/wcfs.go +87 -80

No files found.
--- a/wcfs/wcfs.go
+++ b/wcfs/wcfs.go
@@ -28,7 +28,7 @@
 // file that represents whole ZBigFile's data.
 //
 // For a client, the primary way to access a bigfile should be to mmap
-// bigfile/<bigfileX>/head/data which represents always latest bigfile data.
+// head/bigfile/<bigfileX> which represents always latest bigfile data.
 // Clients that want to get isolation guarantee should subscribe for
 // invalidations and re-mmap invalidated regions to file with pinned bigfile revision for
 // the duration of their transaction. See "Invalidation protocol" for details.
@@ -42,119 +42,125 @@
 //
 // Top-level structure of provided filesystem is as follows:
 //
-//	bigfile/
+//	head/			; latest database view
-//		<oid(bigfile1)>/
-//			...
-//		<oid(bigfile2)>/
-//			...
 //		...
-//
+//	@<rev1>/		; database view as of revision <revX>
-// where for a bigfileX there is bigfile/<oid(bigfileX)>/ directory, with
+//		...
-// oid(bigfileX) being ZODB object-id of corresponding ZBigFile object formatted with %016x.
+//	@<rev2>/
-//
-// Each bigfileX/ has the following structure:
-//
-//	bigfile/<bigfileX>/
-//		head/		; latest bigfile revision
-//			...
-//		@<tid1>/	; bigfile revision as of transaction <tidX>
-//			...
-//		@<tid2>/
-//			...
 //		...
+//	...
 //
-// where head/ represents latest bigfile as stored in upstream ZODB, and
+// where head/ represents latest data as stored in upstream ZODB, and
-// @<tidX>/ represents bigfile as of transaction <tidX>.
+// @<revX>/ represents data as of revision <revX>.
 //
 // head/ has the following structure:
 //
-//	bigfile/<bigfileX>/head/
+//	head/
-//		data		; latest bigfile data
+//		at			; data inside head/ is as of this ZODB transaction
-//		at		; data is bigfile view as of this ZODB transaction
+//		watch			; channel for bigfile invalidations
-//		invalidations	; channel that describes invalidated data regions
+//		bigfile/		; bigfiles' data
+//			<oid(bigfile1)>
+//			<oid(bigfile2)>
+//			...
 //
-// where /data represents latest bigfile data as stored in upstream ZODB. As
+// where /bigfile/<bigfileX> represents latest bigfile data as stored in
-// there can be some lag receiving updates from the database, /at describes
+// upstream ZODB. As there can be some lag receiving updates from the database,
-// precisely ZODB state for which bigfile data is currently exposed. Whenever
+// /at describes precisely ZODB state for which bigfile data is currently
-// bigfile data is changed in upstream ZODB, information about the changes is
+// exposed. Whenever bigfile data is changed in upstream ZODB, information
-// first propagated to /invalidations, and only after that /data is
+// about the changes is first propagated to /watch, and only after that
-// updated. See "Invalidation protocol" for details.
+// /bigfile/<bigfileX> is updated. See "Invalidation protocol" for details.
 //
-// @<tidX>/ has the following structure:
+// @<revX>/ has the following structure:
 //
-//	bigfile/<bigfileX>/@<tidX>/
+//	@<revX>/
-//		data		; bigfile data as of transaction <tidX>
+//		at
+//		bigfile/		; bigfiles' data as of revision <revX>
+//			<oid(bigfile1)>
+//			<oid(bigfile2)>
+//			...
 //
-// where /data represents bigfile data as of transaction <tidX>.
+// where /bigfile/<bigfileX> represent bigfile data as of revision <revX>.
 //
-// bigfile/<bigfileX>/ should be created by client via mkdir. Unless explicitly
+// Unless accessed {head,@<revX>}/bigfile/<bigfileX> are not automatically visible in
-// created bigfile/<bigfileX>/ are not automatically visible in wcfs
+// wcfs filesystem. Similarly @<revX>/ should be explicitly created by client via mkdir.
-// filesystem. Similarly bigfile/<bigfileX>/@<tidX>/ should be too created by
-// client.
 //
 //
 // Invalidation protocol
 //
-// XXX invalidations will be done via ptrace because we need them to be
+// In order to support isolation, wcfs implements invalidation protocol that
-// synchronous (see "wcfs organization")
-//
-// In order to support isolation wcfs implements invalidation protocol that
 // must be cooperatively followed by both wcfs and client.
 //
-// First, before client wants to mmap bigfile, it opens
+// First, client mmaps latest bigfile, but does not access it
-// bigfile/<bigfileX>/head/invalidations and tells wcfs through it for which
-// ZODB state it wants to get bigfile view. The server in turn reports for
-// which ZODB state head/data is current, δ describing changed bigfile region
-// between those revisions, or "wait" flag if server state is earlier compared
-// to what client wants:
 //
-//	C: want <Cat>
+//	mmap(head/bigfile/<bigfileX>)
-//	S: have <Sat>, wait		; Sat < Cat
-//	S: have <Sat>, δR(Cat,Sat)	; Sat ≥ Cat
 //
-// If server reply was "wait" the client does nothing and waits for next server
+// Then client opens head/watch and tells wcfs through it for which ZODB state
-// message which must come without "wait" flag set. When client receives have
+// it wants to get bigfile's view.
-// message with δR(Cat,Sat) it has the guarantee from wcfs that head/data
-// content is for Sat ZODB revision and won't change until client sends ack
-// back to the server. The client in turn now can mmap head/data and
-// @<Cat>/data to get bigfile view as of Cat:
 //
-//	mmap(bigfile/<bigfileX>/head/data)
+//	C: 1 watch <bigfileX> @<at>
-//	mmap(bigfile/<bigfileX>/@<Cat>/data, δR(Cat,Sat), MAP_FIXED)  # mmaped at addresses corresponding to δR(Cat,Sat)
 //
-// When client completes its initial mmapping it sends ack back to the server:
+// The server then, after potentially sending initial pin messages (see below),
+// reports either success or failure:
 //
-//	C: ack
+//	S: 1 ok
+//	S: 1 error ...		; if <at> is too far away back from head/at
 //
-// From now on the server will be processing updates to bigfile coming from
+// The server sends "ok" reply only after head/at is ≥ requested <at>, and
-// ZODB as follows:
+// only after all initial pin messages are fully acknowledged by the client.
+// The client can start to use mmapped data after it gets "ok".
+// The server sends "error" reply if requested <at> is too far away back from
+// head/at.
 //
+// Upon watch request, either initially, or after sending "ok", the server will be notifying the
+// client about file blocks that client needs to pin in order to observe file's
+// data as of <at> revision:
 //
-// The filesystem server itself receives information about changed data
+// The filesystem server itself receives information about changed data from
-// from ZODB server through regular ZODB invalidation channel (as it is ZODB
+// ZODB server through regular ZODB invalidation channel (as it is ZODB client
-// client itself). Then, before actually updating bigfile/<bigfileX>/head/data
+// itself). Then, separately for each changed file block, before actually
-// content in changed part, it notifies through bigfile/<bigfileX>/head/invalidations
+// updating head/bigfile/<bigfileX> content, it notifies through head/watch to
-// to clients that had opened this file (separately to each client) about the changes:
+// clients, that had requested it (separately to each client), about the
+// changes:
 //
-//	S: have <Sat>, δR(Sat_prev, Sat)
+//	S: 2 pin <bigfileX> #<blk> @<rev_max>
 //
-// where Sat_prev is ZODB revision last reported to client for this bigfile,
+// and waits until all clients confirm that changed file block can be updated
-// and waits until they all confirm that changed file part can be updated in
+// in global OS cache.
-// global OS cache.
 //
-// The client in turn can now re-mmap invalidated regions to bigfile@Cat
+// The client in turn should now re-mmap requested to be pinned block to bigfile@<rev_max>
 //
-//	# mmapped at addresses corresponding to δR(Sat_prev, Sat)
+//	# mmapped at address corresponding to #blk
-//	mmap(bigfile/<bigfileX>/@<Cat>/data, δR(Sat_prev, Sat), MAP_FIXED)
+//	mmap(@<rev_max>/bigfile/<bigfileX>, #blk, MAP_FIXED)
 //
 // and must send ack back to the server when it is done:
 //
-//	C: ack
+//	C: 2 ack
+//
+// The server sends pin notifications only for file blocks, that are known to
+// be potentially changed after client's <at>, and <rev_max> describes the
+// upper bound for the block revision:
+//
+//	<at>	<  <rev_max>
+//
+// The server maintains short history tail of file changes to be able to
+// support openings with <at> being slightly in the past compared to current
+// head/at. The server might reject a watch request if <at> is too far away in
+// the past from head/at. The client is advised to restart its transaction with
+// more uptodate database view if it gets watch setup error.
+//
+// A later request from the client for the same <bigfileX> but with different
+// <at>, overrides previous watch request for that file. A client can use "-"
+// instead of "@<at>" to stop watching the file.
+//
+// A single client can send several watch requests through single head/watch
+// open, as well as it can use several head/watch opens simultaneously.
+// The server sends pin notifications for all files requested to be watched via
+// every head/watch open.
 //
-// When clients are done with bigfile/<bigfileX>/@<Cat>/data (i.e. Cat
+// When clients are done with @<revX>/bigfile/<bigfileX> (i.e. client's
 // transaction ends and array is unmapped), the server sees number of opened
-// files to bigfile/<bigfileX>/@<Cat>/data drops to zero, and automatically
+// files to @<revX>/bigfile/<bigfileX> drops to zero, and automatically
-// destroys bigfile/<bigfileX>/@<Cat>/ directory after reasonable timeout.
+// destroys @<revX>/bigfile/<bigfileX> after reasonable timeout.
 //
 //
 // Protection against slow or faulty clients
@@ -293,6 +299,7 @@ package main
 //    δFtail.by   allows to quickly lookup information by #blk.
 //
 //    min(rev) in δFtail is min(@at) at which head/data is currently mmapped (see below).
+//    XXX min(10 minutes) of history to support initial openenings
 //
 // 7) when we receive a FUSE read(#blk) request to a file/head/data we process it as follows:
 //