1. 26 Oct, 2021 14 commits
    • Kirill Smelkov's avatar
      wcfs: Handle ZODB invalidations · e16e029a
      Kirill Smelkov authored
      Use ΔFtail.Track on every READ, and query accumulated ΔFtail upon
      receiving ZODB invalidation to query it about which blocks of which
      files have been changed. Then invalidate those blocks in OS file cache.
      
      See added documentation to wcfs.go and notes.txt for details.
      
      Now the filesystem is no longer stale: it provides view of data
      that is uptodate wrt changes on ZODB storage.
      
      Some preliminary history:
      
      9b4a42a3    X invalidation design draftly settled
      27d91d47    X δFtail settled
      33e0dfce    X ΔTail draftly done
      822366a7    X keeping fd to root opened prevents the filesystem from being unmounted
      89ad3a79    X Don't keep ZBigFile activated during whole current transaction
      245511ac    X Give pointer on from where to get nxd-fuse.ko
      d1cd128c    X Hit FUSE-related deadlock
      d134ee44    X FUSE lookup deadlock should be hopefully fixed
      0e60e9ff    X wcfs: Don't noise ZWatcher trace logs with "select ..."
      bf9a7405    X No longer rely on ZODB cache invariant for invalidations
      e16e029a
    • Kirill Smelkov's avatar
      wcfs: Add FileSock FUSE utility · cb14b213
      Kirill Smelkov authored
      FileSock is bidirectional channel associated with opened file.
      
      FileSock provides streaming write/read operations for filesystem server that
      are correspondingly matched with read/write operations on filesystem user side.
      
      WCFS will use FileSock to implement exchange over .wcfs/zhead and,
      later, head/watch files.
      
      Some preliminary history:
      
      b17aeb8c    X Change FileSock to use xio.Pipe which is io.Pipe + support for IO cancellation
      cb14b213
    • Kirill Smelkov's avatar
      wcfs: zdata: ΔFtail · 23d8da82
      Kirill Smelkov authored
      ΔFtail builds on ΔBtail and  provides ZBigFile-level history that WCFS
      will use to compute which blocks of a ZBigFile need to be invalidated in
      OS file cache given raw ZODB changes on ZODB invalidation message.
      
      It also will be used by WCFS to implement isolation protocol, where on
      every FUSE READ request WCFS will query ΔFtail to find out revision of
      corresponding file block.
      
      Quoting ΔFtail documentation:
      
      ---- 8< ----
      
      ΔFtail provides ZBigFile-level history tail.
      
      It translates ZODB object-level changes to information about which blocks of
      which ZBigFile were modified, and provides service to query that information.
      
      ΔFtail class documentation
      ~~~~~~~~~~~~~~~~~~~~~~~~~~
      
      ΔFtail represents tail of revisional changes to files.
      
      It semantically consists of
      
          []δF			; rev ∈ (tail, head]
      
      where δF represents a change in files space
      
          δF:
          	.rev↑
          	{} file ->  {}blk | EPOCH
      
      Only files and blocks explicitly requested to be tracked are guaranteed to
      be present. In particular a block that was not explicitly requested to be
      tracked, even if it was changed in δZ, is not guaranteed to be present in δF.
      
      After file epoch (file creation, deletion, or any other change to file
      object) previous track requests for that file become forgotten and have no
      further effect.
      
      ΔFtail provides the following operations:
      
        .Track(file, blk, path, zblk)	- add file and block reached via BTree path to tracked set.
      
        .Update(δZ) -> δF				- update files δ tail given raw ZODB changes
        .ForgetPast(revCut)			- forget changes ≤ revCut
        .SliceByRev(lo, hi) -> []δF		- query for all files changes with rev ∈ (lo, hi]
        .SliceByFileRev(file, lo, hi) -> []δfile	- query for changes of a file with rev ∈ (lo, hi]
        .BlkRevAt(file, #blk, at) -> blkrev	- query for what is last revision that changed
          					  file[#blk] as of @at database state.
      
      where δfile represents a change to one file
      
          δfile:
          	.rev↑
          	{}blk | EPOCH
      
      See also zodb.ΔTail and xbtree.ΔBtail
      
      Concurrency
      
      ΔFtail is safe to use in single-writer / multiple-readers mode. That is at
      any time there should be either only sole writer, or, potentially several
      simultaneous readers. The table below classifies operations:
      
          Writers:  Update, ForgetPast
          Readers:  Track + all queries (SliceByRev, SliceByFileRev, BlkRevAt)
      
      Note that, in particular, it is correct to run multiple Track and queries
      requests simultaneously.
      
      ΔFtail organization
      ~~~~~~~~~~~~~~~~~~~
      
      ΔFtail leverages:
      
          - ΔBtail to track changes to ZBigFile.blktab BTree, and
          - ΔZtail to track changes to ZBlk objects and to ZBigFile object itself.
      
      then every query merges ΔBtail and ΔZtail data on the fly to provide
      ZBigFile-level result.
      
      Merging on the fly, contrary to computing and maintaining vδF data, is done
      to avoid complexity of recomputing vδF when tracking set changes. Most of
      ΔFtail complexity is, thus, located in ΔBtail, which implements BTree diff
      and handles complexity of recomputing vδB when set of tracked blocks
      changes after new track requests.
      
      Changes to ZBigFile object indicate epochs. Epochs could be:
      
          - file creation or deletion,
          - change of ZBigFile.blksize,
          - change of ZBigFile.blktab to point to another BTree.
      
      Epochs represent major changes to file history where file is assumed to
      change so dramatically, that practically it can be considered to be a
      "whole" change. In particular, WCFS, upon seeing a ZBigFile epoch,
      invalidates all data in corresponding OS-level cache for the file.
      
      The only historical data, that ΔFtail maintains by itself, is history of
      epochs. That history does not need to be recomputed when more blocks become
      tracked and is thus easy to maintain. It also can be maintained only in
      ΔFtail because ΔBtail and ΔZtail does not "know" anything about ZBigFile.
      
      Concurrency
      
      In order to allow multiple Track and queries requests to be served in
      parallel, ΔFtail bases its concurrency promise on ΔBtail guarantees +
      snapshot-style access for vδE and ztrackInBlk in queries:
      
      1. Track calls ΔBtail.Track and quickly updates .byFile, .byRoot and
         _RootTrack indices under a lock.
      
      2. BlkRevAt queries ΔBtail.GetAt and then combines retrieved information
         about zblk with vδE and δZ.
      
      3. SliceByFileRev queries ΔBtail.SliceByRootRev and then merges retrieved
         vδT data with vδZ, vδE and ztrackInBlk.
      
      4. In queries vδE is retrieved/built in snapshot style similarly to how vδT
         is built in ΔBtail. Note that vδE needs to be built only the first time,
         and does not need to be further rebuilt, so the logic in ΔFtail is simpler
         compared to ΔBtail.
      
      5. for ztrackInBlk - that is used by SliceByFileRev query - an atomic
         snapshot is retrieved for objects of interest. This allows to hold
         δFtail.mu lock for relatively brief time without blocking other parallel
         Track/queries requests for long.
      
      Combined this organization allows non-overlapping queries/track-requests
      to run simultaneously. (This property is essential to WCFS because otherwise
      WCFS would not be able to serve several non-overlapping READ requests to one
      file in parallel.)
      
      See also "Concurrency" in ΔBtail organization for more details.
      
      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      
      Some preliminary history:
      
      ef74aebc    X ΔFtail: Keep reference to ZBigFile via Oid, not via *ZBigFile
      bf9a7405    X No longer rely on ZODB cache invariant for invalidations
      46340069    X found by Random
      e7b598c6    X start of ΔFtail.SliceByFileRev rework to function via merging δB and δZ histories on the fly
      59c83009    X ΔFtail.SliceByFileRoot tests started to work draftly after "on-the-fly" rework
      210e9b07    X Fix ΔBtail.SliceByRootRev (lo,hi] handling
      bf3ace66    X ΔFtail: Rebuild vδE after first track
      46624787    X ΔFtail: `go test -failfast -short -v -run Random -randseed=1626793016249041295` discovered problems
      786dd336    X Size no longer tracks [0,∞) since we start tracking when zfile is non-empty
      4f707117    X test that shows problem of SliceByRootRev where untracked blocks are not added uniformly into whole history
      c0b7e4c3    X ΔFtail.SliceByFileRev: Fix untracked entries to be present uniformly in result
      aac37c11    X zdata: Introduce T to start removing duplication in tests
      bf411aa9    X zdata: Deduplicate zfile loading
      b74dda09    X Start switching Track from Track(key) to Track(keycov)
      aa0288ce    X Switch SliceByRootRev to vδTSnapForTracked
      588a512a    X zdata: Switch SliceByFileRev not to clone Zinblk
      8b5d8523    X Move tracking of which blocks were accessed from wcfs to ΔFtail
      23d8da82
    • Kirill Smelkov's avatar
      wcfs: xbtree: ΔBtail · 305d897b
      Kirill Smelkov authored
      ΔBtail provides BTree-level history tail that WCFS - via ΔFtail - will
      use to compute which blocks of a ZBigFile need to be invalidated in OS
      file cache given raw ZODB changes on ZODB invalidation message.
      
      It also will be used by WCFS to implement isolation protocol, where on
      every FUSE READ request WCFS will query ΔBtail - again via ΔFtail - to
      find out revision of corresponding file block.
      
      Quoting ΔBtail documentation:
      
      ---- 8< ----
      
      ΔBtail provides BTree-level history tail.
      
      It translates ZODB object-level changes to information about which keys of
      which BTree were modified, and provides service to query that information.
      
      ΔBtail class documentation
      ~~~~~~~~~~~~~~~~~~~~~~~~~~
      
      ΔBtail represents tail of revisional changes to BTrees.
      
      It semantically consists of
      
          []δB			; rev ∈ (tail, head]
      
      where δB represents a change in BTrees space
      
          δB:
          	.rev↑
          	{} root -> {}(key, δvalue)
      
      It covers only changes to keys from tracked subset of BTrees parts.
      In particular a key that was not explicitly requested to be tracked, even if
      it was changed in δZ, is not guaranteed to be present in δB.
      
      ΔBtail provides the following operations:
      
        .Track(path)	- start tracking tree nodes and keys; root=path[0], keys=path[-1].(lo,hi]
      
        .Update(δZ) -> δB				- update BTree δ tail given raw ZODB changes
        .ForgetPast(revCut)			- forget changes ≤ revCut
        .SliceByRev(lo, hi) -> []δB		- query for all trees changes with rev ∈ (lo, hi]
        .SliceByRootRev(root, lo, hi) -> []δT	- query for changes of a tree with rev ∈ (lo, hi]
        .GetAt(root, key, at) -> (value, rev)	- get root[key] @at assuming root[key] ∈ tracked
      
      where δT represents a change to one tree
      
          δT:
          	.rev↑
          	{}(key, δvalue)
      
      An example for tracked set is a set of visited BTree paths.
      There is no requirement that tracked set belongs to only one single BTree.
      
      See also zodb.ΔTail and zdata.ΔFtail
      
      Concurrency
      
      ΔBtail is safe to use in single-writer / multiple-readers mode. That is at
      any time there should be either only sole writer, or, potentially several
      simultaneous readers. The table below classifies operations:
      
          Writers:  Update, ForgetPast
          Readers:  Track + all queries (SliceByRev, SliceByRootRev, GetAt)
      
      Note that, in particular, it is correct to run multiple Track and queries
      requests simultaneously.
      
      ΔBtail organization
      ~~~~~~~~~~~~~~~~~~~
      
      ΔBtail keeps raw ZODB history in ΔZtail and uses BTree-diff algorithm(*) to
      turn δZ into BTree-level diff. For each tracked BTree a separate ΔTtail is
      maintained with tree-level history in ΔTtail.vδT .
      
      Because it is very computationally expensive(+) to find out for an object to
      which BTree it belongs, ΔBtail cannot provide full BTree-level history given
      just ΔZtail with δZ changes. Due to this ΔBtail requires help from
      users, which are expected to call ΔBtail.Track(treepath) to let ΔBtail know
      that such and such ZODB objects constitute a path from root of a tree to some
      of its leaf. After Track call the objects from the path and tree keys, that
      are covered by leaf node, become tracked: from now-on ΔBtail will detect
      and provide BTree-level changes caused by any change of tracked tree objects
      or tracked keys. This guarantee can be provided because ΔBtail now knows
      that such and such objects belong to a particular tree.
      
      To manage knowledge which tree part is tracked ΔBtail uses PPTreeSubSet.
      This data-structure represents so-called PP-connected set of tree nodes:
      simply speaking it builds on some leafs and then includes parent(leaf),
      parent(parent(leaf)), etc. In other words it's a "parent"-closure of the
      leafs. The property of being PP-connected means that starting from any node
      from such set, it is always possible to reach root node by traversing
      .parent links, and that every intermediate node went-through during
      traversal also belongs to the set.
      
      A new Track request potentially grows tracked keys coverage. Due to this,
      on a query, ΔBtail needs to recompute potentially whole vδT of the affected
      tree. This recomputation is managed by "vδTSnapForTracked*" and "_rebuild"
      functions and uses the same treediff algorithm, that Update is using, but
      modulo PPTreeSubSet corresponding to δ key coverage. Update also potentially
      needs to rebuild whole vδT history, not only append new δT, because a
      change to tracked tree nodes can result in growth of tracked key coverage.
      
      Queries are relatively straightforward code that work on vδT snapshot. The
      main complexity, besides BTree-diff algorithm, lies in recomputing vδT when
      set of tracked keys changes, and in handling that recomputation in such a way
      that multiple Track and queries requests could be all served in parallel.
      
      Concurrency
      
      In order to allow multiple Track and queries requests to be served in
      parallel ΔBtail employs special organization of vδT rebuild process where
      complexity of concurrency is reduced to math on merging updates to vδT and
      trackSet, and on key range lookup:
      
      1. vδT is managed under read-copy-update (RCU) discipline: before making
         any vδT change the mutator atomically clones whole vδT and applies its
         change to the clone. This way a query, once it retrieves vδT snapshot,
         does not need to further synchronize with vδT mutators, and can rely on
         that retrieved vδT snapshot will remain immutable.
      
      2. a Track request goes through 3 states: "new", "handle-in-progress" and
         "handled". At each state keys/nodes of the Track are maintained in:
      
         - ΔTtail.ktrackNew and .trackNew       for "new",
         - ΔTtail.krebuildJobs                  for "handle-in-progress", and
         - ΔBtail.trackSet                      for "handled".
      
         trackSet keeps nodes, and implicitly keys, from all handled Track
         requests. For all keys, covered by trackSet, vδT is fully computed.
      
         a new Track(keycov, path) is remembered in ktrackNew and trackNew to be
         further processed when a query should need keys from keycov. vδT is not
         yet providing data for keycov keys.
      
         when a Track request starts to be processed, its keys and nodes are moved
         from ktrackNew/trackNew into krebuildJobs. vδT is not yet providing data
         for requested-to-be-tracked keys.
      
         all trackSet, trackNew/ktrackNew and krebuildJobs are completely disjoint:
      
          trackSet ^ trackNew     = ø
          trackSet ^ krebuildJobs = ø
          trackNew ^ krebuildJobs = ø
      
      3. when a query is served, it needs to retrieve vδT snapshot that takes
         related previous Track requests into account. Retrieving such snapshots
         is implemented in vδTSnapForTracked*() family of functions: there it
         checks ktrackNew/trackNew, and if those sets overlap with query's keys
         of interest, run vδT rebuild for keys queued in ktrackNew.
      
         the main part of that rebuild can be run without any locks, because it
         does not use nor modify any ΔBtail data, and for δ(vδT) it just computes
         a fresh full vδT build modulo retrieved ktrackNew. Only after that
         computation is complete, ΔBtail is locked again to quickly merge in
         δ(vδT) update back into vδT.
      
         This organization is based on the fact that
      
          vδT/(T₁∪T₂) = vδT/T₁ | vδT/T₂
      
           ( i.e. vδT computed for tracked set being union of T₁ and T₂ is the
             same as merge of vδT computed for tracked set T₁ and vδT computed
            for tracked set T₂ )
      
         and that
      
          trackSet | (δPP₁|δPP₂) = (trackSet|δPP₁) | (trackSet|δPP₂)
      
          ( i.e. tracking set updated for union of δPP₁ and δPP₂ is the same
            as union of tracking set updated with δPP₁ and tracking set updated
            with δPP₂ )
      
         these merge properties allow to run computation for δ(vδT) and δ(trackSet)
         independently and with ΔBtail unlocked, which in turn enables running
         several Track/queries in parallel.
      
      4. while vδT rebuild is being run, krebuildJobs keeps corresponding keycov
         entry to indicate in-progress rebuild. Should a query need vδT for keys
         from that job, it first waits for corresponding job(s) to complete.
      
      Explained rebuild organization allows non-overlapping queries/track-requests
      to run simultaneously. (This property is essential to WCFS because otherwise
      WCFS would not be able to serve several non-overlapping READ requests to one
      file in parallel.)
      
      --------
      
      (*) implemented in treediff.go
      (+) full database scan
      
      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      
      Some preliminary history:
      
      877e64a9    X wcfs: Fix tests to pass again
      c32055fc    X wcfs/xbtree: ΔBtail tests += ø -> Tree; Tree -> ø
      78f2f88b    X wcfs/xbtree: Fix treediff(a, ø)
      5324547c    X wcfs/xbtree: root(a) must stay in trackSet even after treediff(a,ø)
      f65f775b    X wcfs/xbtree: treediff(ø, b)
      c75b1c6f    X wcfs/xbtree: Start killing holeIdx
      0fa06cbd    X kadj must be taken into account as kadj^δZ
      ef5e5183    X treediff ret += δtkeycov
      f30826a6    X another bug in δtkeyconv computation
      0917380e    X wcfs: assert that keycov only grow
      502e05c2    X found why TestΔBTailAllStructs was not effective to find δtkeycov bugs
      450ba707    X Fix rebuild with ø @at2
      f60528c9    X ΔBtail.Clone had bug that it was aliasing klon and orig data
      9d20f8e8    X treediff: Fix BUG while computing AB coverage
      ddb28043    X rebuild: Don't return nil for empty ΔPPTreeSubSet - that leads to SIGSEGV
      324241eb    X rebuild: tests: Don't reflect.DeepEqual in inner loop
      8f6e2b1e    X rebuild: tests: Don't access ZODB in XGetδKV
      2c0b4793    X rebuild: tests: Don't access ZODB in xtrackKeys
      8f0e37f2    X rebuild: tests: Precompute kadj10·kadj21
      271d953d    X rebuild: tests: Move ΔBtail.Clone test out of hot inner loop into separate test
      a87cc6de    X rebuild: tests: Don't recompute trackSet(keys1R2) several times
      01433e96    X rebuild: tests: Don't compute keyCover in trackSet
      7371f9c5    X rebuild: tests: Inline _assertTrack
      3e9164b3    X rebuild: tests: Don't exercise keys from keys2 that already became tracked after Track(keys1) + Update
      e9c4b619    X rebuild: tests: Random testing
      d0fe680a    X δbtail += ForgetPast
      210e9b07    X Fix ΔBtail.SliceByRootRev (lo,hi] handling
      855ab4b8    X ΔBtail: Goodbye .KVAtTail
      2f5582e6    X ΔBtail: Tweak tests to run faster in normal mode
      cf352737    X random testing found another failing test for rebuild...
      7f7e34e0    X wcfs/xbtree: Fix update not to add duplicate extra point if rebuild  - called by Update - already added it
      6ad0052c    X ΔBtail.Track: No need to return error
      aafcacdf    X xbtree: GetAt test
      784a6761    X xbtree: Fix KAdj definition after treediff was reworked this summer to base decisions on node keycoverage instead of particular node keys
      0bb1c22e    X xbtree: Verify that ForgetPast clones vδT on trim
      a8945cbf    X Start reworking rebuild routines not to modify data inplace
      b74dda09    X Start switching Track from Track(key) to Track(keycov)
      dea85e87    X Switch GetAt to vδTSnapForTrackedKey
      aa0288ce    X Switch SliceByRootRev to vδTSnapForTracked
      c4366b14    X xbtree: tests: Also verify state of ΔTtail.ktrackNew
      b98706ad    X Track should be nop if keycov/path is already in krebuildJobs
      e141848a    X test.go  ↑ timeout  10m -> 20m
      305d897b
    • Kirill Smelkov's avatar
      wcfs: xbtree: BTree-diff algorithm · b7b59e20
      Kirill Smelkov authored
      This algorithm will be internally used by ΔBtail in the next patch.
      
      The algorithm would be simple, if we would need to diff two trees
      completely. However in ΔBtail only subpart of BTree nodes are tracked(*)
      and the diff has to work modulo that tracking set.
      
      No tests now because ΔBtail tests will cover treediff functionality as well.
      
      Some preliminary history:
      
      78f2f88b    X wcfs/xbtree: Fix treediff(a, ø)
      5324547c    X wcfs/xbtree: root(a) must stay in trackSet even after treediff(a,ø)
      f65f775b    X wcfs/xbtree: treediff(ø, b)
      c75b1c6f    X wcfs/xbtree: Start killing holeIdx
      ef5e5183    X treediff ret += δtkeycov
      9d20f8e8    X treediff: Fix BUG while computing AB coverage
      ddb28043    X rebuild: Don't return nil for empty ΔPPTreeSubSet - that leads to SIGSEGV
      f68398c9    X wcfs: Move treediff into its own file
      
      (*) because full BTree scan is needed to discover all of its nodes.
      
      Quoting treediff documentation:
      
      ---- 8< ----
      
      treediff provides diff for BTrees
      
      Use δZConnectTracked + treediff to compute BTree-diff caused by δZ:
      
          δZConnectTracked(δZ, trackSet)                         -> δZTC, δtopsByRoot
          treediff(root, δtops, δZTC, trackSet, zconn{Old,New})  -> δT, δtrack, δtkeycov
      
      δZConnectTracked computes BTree-connected closure of δZ modulo tracked set
      and also returns δtopsByRoot to indicate which tree objects were changed and
      in which subtree parts. With that information one can call treediff for each
      changed root to compute BTree-diff and δ for trackSet itself.
      
      BTree diff algorithm
      
      diffT, diffB and δMerge constitute the diff algorithm implementation.
      diff(A,B) works on pair of A and B whole key ranges splitted into regions
      covered by tree nodes. The splitting represents current state of recursion
      into corresponding tree. If a node in particular key range is Bucket, that
      bucket contributes to δ- in case of A, and to δ+ in case of B. If a node in
      particular key range is Tree, the algorithm may want to expand that tree
      node into its children and to recourse into some of the children.
      
      There are two phases:
      
      - Phase 1 expands A top->down driven by δZTC, adds reached buckets to δ-,
        and queues key regions of those buckets to be processed on B.
      
      - Phase 2 starts processing from queued key regions, expands them on B and
        adds reached buckets to δ+. Then it iterates to reach consistency in between
        A and B because processing buckets on B side may increase δ key coverage,
        and so corresponding key ranges has to be again processed on A. Which in
        turn may increase δ key coverage again, and needs to be processed on B side,
        etc...
      
      The final δ is merge of δ- and δ+.
      
      diffT has more detailed explanation of phase 1 and phase 2 logic.
      b7b59e20
    • Kirill Smelkov's avatar
      wcfs: xbtree: blib += PPTreeSubSet, ΔPPTreeSubSet · ce84b07f
      Kirill Smelkov authored
      This data structures will be used in ΔBtail to maintain sef of tracked
      BTree nodes, and to represent δ to such set.
      
      Some preliminary history:
      
      78f2f88b    X wcfs/xbtree: Fix treediff(a, ø)
      5324547c    X wcfs/xbtree: root(a) must stay in trackSet even after treediff(a,ø)
      f65f775b    X wcfs/xbtree: treediff(ø, b)
      66bc41ce    X Fix bug in PPTreeSubSet.Difference  - it was always leaving root node alive
      ddb28043    X rebuild: Don't return nil for empty ΔPPTreeSubSet - that leads to SIGSEGV
      a87cc6de    X rebuild: tests: Don't recompute trackSet(keys1R2) several times
      
      Quoting PPTreeSubSet and ΔPPTreeSubSet documentation:
      
      ---- 8< ----
      
      PPTreeSubSet represents PP-connected subset of tree node objects.
      
      It is
      
          PP(xleafs)
      
      where PP(node) maps node to {node, node.parent, node.parent,parent, ...} up
      to top root from where the node is reached.
      
      The nodes in the set are represented by their Oid.
      
      Usually PPTreeSubSet is built as PP(some-leafs), but in general the starting
      nodes are arbitrary. PPTreeSubSet can also have many root nodes, thus not
      necessarily representing a subset of a single tree.
      
      Usual set operations are provided: Union, Difference and Intersection.
      
      Nodes can be added into the set via AddPath. Path is reverse operation - it
      returns path to tree node given its oid.
      
      Every node in the set comes with .parent pointer.
      
      ~~~~
      
      ΔPPTreeSubSet represents a change to PPTreeSubSet.
      
      It can be applied via PPTreeSubSet.ApplyΔ .
      
      The result B of applying δ to A is:
      
          B = A.xDifference(δ.Del).xUnion(δ.Add)		(*)
      
      (*) NOTE δ.Del and δ.Add might have their leafs starting from non-leaf nodes in A/B.
          This situation arises when δ represents a change in path to particular
          node, but that node itself does not change, for example:
      
                 c*             c
                / \            /
              41*  42         41
               |    |         | \
              22   43        46  43
                    |         |   |
                   44        22  44
      
          Here nodes {c, 41} are changed, node 42 is unlinked, and node 46 is added.
          Nodes 43 and 44 stay unchanged.
      
              δ.Del = c-42-43   | c-41-22
              δ.Add = c-41-43   | c-41-46-22
      
          The second component with "-22" builds from leaf, but the first
          component with "-43" builds from non-leaf node.
      
              ΔnchildNonLeafs = {43: +1}
      
          Only complete result of applying all
      
              - xfixup(-1, ΔnchildNonLeafs)
              - δ.Del,
              - δ.Add, and
              - xfixup(+1, ΔnchildNonLeafs)
      
          produces correctly PP-connected set.
      ce84b07f
    • Kirill Smelkov's avatar
      wcfs: xbtree: blib += RangedMap, RangedKeySet · b7c560c5
      Kirill Smelkov authored
      RangedMap is Key->VALUE map with adjacent keys mapped to the same value coalesced into Ranges.
      RangedKeySet is set of Keys with adjacent keys coalesced into Ranges.
      
      This data structures will be needed for ΔBtail.
      
      For now the implementation is simple since it keeps whole map in a
      linear slice because both RangedMap and RangedKeySet will be used in
      ΔBtail to keep something proportional to δ of a change, which is assumed
      to be small or medium most of the time.
      
      Some preliminary history:
      
      6ea5920a    X xbtree: Less copy/garbage in RangedKeySet ops
      3ecacd99    X need to keep Value first so that sizeof(set-entry) = sizeof(KeyRange)
      a5b9b19b    X SetRange draftly works
      ed2de0de    X Tests for Get
      3b7b69e6    X fixes for empty set/range
      6972f999    X xbtree/blib: RangedMap, RangedSet += IntersectsRange, Intersection
      b7c560c5
    • Kirill Smelkov's avatar
      wcfs: tests: Tree-based testing environment · 828da0e1
      Kirill Smelkov authored
      Add treeenv.go that combines Treegen and client side access to ZODB with
      committed trees as extension to testing.T . The environment allows to
      easily see which tree update was committed, what is the difference in
      terms of KV, what is the state of updated tree and state of pointed-to
      ZBlk objects.
      
      This will be used to test upcoming ΔBtail and ΔFtail.
      
      Main functionality is in treeenv.go; the other added files are to
      support that.
      
      Some preliminary history:
      
      f07502fc    X xbtreetest: Teach T & Commit to automatically provide At in symbolic form
      0d62b05e    X Adjust to btree.VGet & friends signature change to include keycov in visit callback
      588a512a    X zdata: Switch SliceByFileRev not to clone Zinblk
      e9c4b619    X rebuild: tests: Random testing
      43090ac7    X tests: Factor-out tree-test-env into tTreeEnv
      d4a523b2    X δbtail: tests: Run much faster with live ZODB cache
      271d953d    X rebuild: tests: Move ΔBtail.Clone test out of hot inner loop into separate test
      c32055fc    X wcfs/xbtree: ΔBtail tests += ø -> Tree; Tree -> ø
      5324547c    X wcfs/xbtree: root(a) must stay in trackSet even after treediff(a,ø)
      8f6e2b1e    X rebuild: tests: Don't access ZODB in XGetδKV
      828da0e1
    • Kirill Smelkov's avatar
      wcfs: Set package · 84b89f42
      Kirill Smelkov authored
      Lacking generics we have set.go.in and instantiation for Set[int64],
      set[string], Set[Oid] and Set[Tid] - that will be used in follow-up
      patches.
      
      The set.go.in itself is mostly a generalized copy from git-backup:
      
      https://lab.nexedi.com/kirr/git-backup/blob/c9db60e8/set.go
      84b89f42
    • Kirill Smelkov's avatar
      wcfs: tests: Treegen functionality · 9cb12737
      Kirill Smelkov authored
      treegen.go and treegen.py together provide a way
      
      - to commit a particular BTree topology into ZODB, and
      - to generate set of random tree topologies that all correspond to particular {k->v} dict.
      
      this will be used in upcoming ΔBtail and ΔFtail tests.
      
      See treegen.py documentation for details.
      
      Some preliminary history:
      
      9eca74ec    X Teach AllStructs to emit topologies with values
      1b962f03    X Restructure: found bug that it was not marking objects as modified
      2139af2c    X treegen: Verify that tree actually saved to storage is what was requested
      b5e39d4a    X wcfs/treegen: allstructs: Do not keep all tree structures in memory
      e9c4b619    X rebuild: tests: Random testing
      c32055fc    X wcfs/xbtree: ΔBtail tests += ø -> Tree; Tree -> ø
      4300d88a    X wcfs/xbtreetest/treegen.py: Fix it on ZODB4
      9cb12737
    • Kirill Smelkov's avatar
      wcfs: xbtree: blib: Start of package · 56cb2897
      Kirill Smelkov authored
      This will be the place to keep BTree-related utilities.
      For now it provides only type aliases since Go lacks generics.
      56cb2897
    • Kirill Smelkov's avatar
      wcfs: tests: xbtree.py package for inspecting/manipulating internal structure of BTrees · c58c20d6
      Kirill Smelkov authored
      To handle invalidations, WCFS will need to detect changes to both ZBlk
      objects and to ZBigFile.blktab BTree that is mapping file blocks to ZBlk
      objects. And with BTree detecting changes is much more complex, because
      when a BTree changes, it might be rebalanced, or keys migrated from one
      tree/bucket node to another tree/bucket node. In other words a BTree
      change might be not only a change to a {}key->value dictionary, but also
      a change to BTree topology.
      
      Because there are many BTree topologies that correspond to the same
      {}key->value state, a change from kv₁ to kv₂, even if kv₁ and kv₂ are
      close to each other, might be accompanied by a dramatic change to
      topology of the tree. This creates a need for thoroughly testing the
      BTree difference algorithm because many of BTree topologies changes are
      tricky, and if a simple algorithm works on relatively stable topology
      updates, it does not necessarily mean that that same algorithm will
      continue to work correctly in the general case.
      
      So, as a preparatory step, here comes xbtree.py package, that can be
      used to inspect tree topologies, to create trees with specified topology
      and to manipulate topology of an existing tree. This package will be
      used in tests for upcoming ΔBtail.
      
      For debugging, and also since those tests will involve both Go and
      Python parts, it creates the need to be able to specify and exchange
      topology of a tree via compact string. This package also defines so
      called "topology encoding" to do so.
      
      Some preliminar history:
      
      fb56193f    X fix metric to keep Z <- N order stable over key^
      809304d1    X "B:" indicates ø bucket with k&b, "B" - ø bucket with only keys
      9eca74ec    X Teach AllStructs to emit topologies with values
      1b962f03    X Restructure: found bug that it was not marking objects as modified
      9181c5d9    X Restructure; verify that it marks as changed only modifed nodes
      e9902c4a    X improve `xbtree topoview`
      
      For the reference xbtree.py package documentation is quoted below.
      
      ---- 8< ----
      
      Package xbtree provides utilities for inspecting/manipulating internal
      structure of integer-keyed BTrees.
      
      It will be primarily used to help verify ΔBTail in WCFS.
      
      - `Tree` represents a tree node.
      - `Bucket` represents a bucket node.
      - `StructureOf` returns internal structure of ZODB BTree represented as Tree
        and Bucket nodes.
      - `Restructure` reorganizes ZODB BTree instance according to specified topology
        structure.
      
      - `AllStructs` generates all possible BTree topology structures with given keys.
      
      Topology encoding
      -----------------
      
      Topology encoding provides way to represent structure of a Tree as path-like string.
      
      TopoEncode converts Tree into its topology-encoded representation, while
      TopoDecode decodes topology-encoded string back into Tree.
      
      The following example illustrates topology encoding represented by string
      "T3/T-T/B1-T5/B-B7,8,9":
      
            [ 3 ]             T3/         represents Tree([3])
             / \
           [ ] [ ]            T-T/        represents two empty Tree([])
            ↓   ↓
           |1|[ 5 ]           B1-T5/      represent Bucket([1]) and Tree([5])
               / \
              || |7|8|9|      B-B7,8,9    represents empty Bucket([]) and Bucket([7,8,9])
      
      Topology encoding specification:
      
      A Tree is encoded by level-order traversal, delimiting layers with "/".
      Inside a layer Tree and Bucket nodes are signalled as
      
          "T<keys>"           ; Tree
          "B<keys>"           ; Bucket with only keys
          "B<keys+values>"    ; Bucket with keys and values
      
      Keys are represented as ","-delimited list of integers. For example Tree
      or Bucket with [1,3,5] keys are represented as
      
          "T1,3,5"        ; Tree([1,3,5])
          "B1,3,5"        ; Bucket([1,3,5])
      
      Keys+values are represented as ","-delimited list of "<key>:<value>" pairs. For
      example Bucket corresponding to {1:1, 2:4, 3:9} is represented as
      
          "B1:1,2:4,3:9"  ; Bucket([1,2,3], [1,4,9])
      
      Empty keys+values are represented as ":" - an empty Bucket for key->value
      mapping is represented as
      
          "B:"            ; Bucket([], [])
      
      Nodes inside one layer are delimited with "-". For example a layer consisting
      of an empty Tree, a Tree with [1,3] keys, and Bucket with [4,5] keys is
      represented as
      
          "T-T1,3-B4,5"   ; layer with Tree([]), Tree([1,3]) and Bucket([4,5])
      
      A layer consists of nodes that are followed by node-node links from upper layer
      in left-to-right order.
      
      Visualization
      -------------
      
      The following visualization utilities are provided to help understand BTrees
      better:
      
      - `topoview` displays BTree structure given its topology-encoded representation.
      - `Tree.graphviz` returns Tree graph representation in dot language.
      c58c20d6
    • Kirill Smelkov's avatar
      wcfs: tests: Start verifying state of OS file cache · fbf15309
      Kirill Smelkov authored
      For WCFS to be efficient it will have to carefully preserve OS cache on
      file invalidations. As preparatory step establish infrastructure for
      verifying state of OS file cache and start asserting on OS cache state
      in a couple of places.
      
      See comments added to tFile constructor that describe how OS cache state
      verification is setup.
      
      Some preliminary history:
      
      8293025b    X Thoughts on how to avoid readahead touching pages of neighbour block
      3054e4a3    X not touching neighbour block works via setting MADV_RANDOM in last 1/4 of every block
      18362227    X #5 access still triggers read to #4 ?
      17dbf94e    X Provide mlock2 fallback for Ubuntu
      d134c0b9    X wcfs: test: try to live with only hard memlock limit adjusted
      c2423296    X Fix mlock2 build on Debian 8
      fbf15309
    • Kirill Smelkov's avatar
      wcfs: Initial implementation of basic filesystem · 58e2a88c
      Kirill Smelkov authored
      Provide filesystem view of in-ZODB ZBigFiles, but do not implement support for
      invalidations nor isolation protocol yet. In particular, because ZODB
      invalidations are not yet handled, the filesystem does not update its data in
      accordance with ZODB updates, and instead provides stale data view that
      corresponds to the state of ZODB at the time when wcfs was mounted.
      
      The main parts of this patch are:
      
      - wcfs/wcfs.go is filesystem implementation itself together with overview.
      - wcfs/__init__.py is python wrapper to spawn and interoperate with that filesystem.
      - wcfs/wcfs_test.py is tests.
      
      Some preliminary history:
      
      fe7efb94    X start of wcfs
      878b2787    X draft loading
      d58c71e8    X don't overalign end by 1 blksize if end is already aligned
      29c9f13d    X readBlk: Fix thinko in already case
      59552328    X wcfs: Care to disable OS polling on us
      c00d94c7    X workaround lack of exception chaining on Python2 with xdefer
      0398e23d    X bytearray turned out to be copying data
      7a837040    X print wcfs.py py-level traceback on SIGBUS (e.g. wcfs.go aborting due to bug/panic)
      661b871f    X make sure tests don't get stuck even if wcfs gets killed -9 ...
      2c043d29    X More effort to unmount failed wcfs.go
      1ccc4478    X Use `with gil` + regular py code instead of PyGILState_Ensure/PyGILState_Release/PyRun_SimpleString
      5dc9c791    X wcfs: Kill xdefer
      91e9eba8    X wcfs: test: Register tFile to tDB early
      a7138fef    X wcfs: mkdir /tmp/wcfs with sticky bit
      1eec76d0    X wcfs: try to set sticky for /tmp/wcfs even if the directory already exists
      c2c35851    X wcfs: tests: Factor-out waiting for a general condition to become true into waitfor
      78f36993    X wcfs: test: Fix thinko in getting /sys/fs/fuse/connection/<X> for wcfs
      bc9eb16f    X wcfs: tests: Don't use testmntpt everywhere
      6dec74e7    X wcfs: tests: Split tDB into -> tDB + tWCFS
      3a6bd764    X wcfs: tests: Run `fusermount -u` the second time if we had to kill wcfs
      112720f3    X wcfs: tests: Print which files are still opened on wcfs if `fusermount -u` fails
      bb40185b    X wcfs: Take $WENDELIN_CORE_WCFS_OPTIONS into account not only from under join
      03a9ef33    X wcfs: Remove credentials from zurl when computing wcfs mountpoint
      68ee5bdc    X wcfs: lsof tweaks
      21671879    X wcfs: Teach entrypoint frontend to handle subcommands: serve, status, stop
      b0642b80    X wcfs: Switch mountpoints from /tmp/wcfs/* to /dev/shm/*
      b0ca031f    X wcfs: Teach join/serve to start successfully even after unclean wcfs shutdown
      5bfa8cf8    X wcfs: Add start to spawn a Server that can be later stopped  (draft)
      5fcec261    X wcfs: Run fusermount and friends with /bin:/usr/bin always on path
      669d7a20    fixup! X wcfs: Run fusermount and friends with /bin:/usr/bin always on path
      6b22f8c4    X wcfs: Teach start to start successfully even after unclean wcfs shutdown
      15389db0    X wcfs: Tune _fuse_unmount to include `fusermount -u` error message into raised exception
      153c002a    X wcfs: _fuse_unmount: Try first `kill -TERM` before `kill -QUIT` wcfs
      3244f3a6    X wcfs: lsof +D misbehaves - don't use it
      a126e709    X wcfs: Put client log into its own logger
      ac303d1e    X wcfs: tests: -v  ->  show only wcfs.py logs verbosely
      d671a9e9    X wcfs: Give more time to stop wcfs server
      58e2a88c
  2. 25 Oct, 2021 10 commits
  3. 01 Apr, 2021 3 commits
    • Kirill Smelkov's avatar
      tests: Reset transaction synchronizers before every test run · fe369d32
      Kirill Smelkov authored
      Else, e.g. after a failing test, that closed its storage and DB, but not
      all Connections, another test, just by starting new transaction, would
      invoke synchronization on that unclosed connection, which will try to
      access closed storage and likely fail.
      
      Fixes e.g. https://nexedijs.erp5.net/#/test_result_module/20210401-31B27B3D/5
      
      Crash scenariou is the same as described in 5a5ed2c7 (tests: Force-close
      ZODB connections in teardown, that testing code forgot to explicitly
      close). Only now we try to isolate tests from each other not only for
      different modules, but also for tests inside the same module.
      fe369d32
    • Kirill Smelkov's avatar
      lib/zodb: Add tests for critical ZODB properties that Wendelin.core 2 will depend on · c37a989d
      Kirill Smelkov authored
      The tests verify that there is no concurrency bugs around load,
      Connection.open and invalidations. See e.g.
      
      https://github.com/zopefoundation/ZODB/issues/290
      https://github.com/zopefoundation/ZEO/issues/155
      
      By including the tests into wendelin.core, we will have CI coverage for
      all supported storages (FileStorage, ZEO, NEO), and for all supported
      ZODB (currently ZODB4, ZODB4-wc2 and ZODB5).
      
      ZEO5 is know to currently fail zloadrace.
      However, even though ZODB#290 was fixed, ZEO5 turned out to also fail on zopenrace:
      
              def test_zodb_zopenrace():
                  # exercises ZODB.Connection + particular storage implementation
          >       zopenrace.main()
      
          lib/tests/test_zodb.py:382:
          _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
          <decorator-gen-1>:2: in main
              ???
          ../../tools/go/pygolang/golang/__init__.py:103: in _
              return f(*argv, **kw)
          lib/tests/testprog/zopenrace.py:115: in main
              test(zstor)
          <decorator-gen-2>:2: in test
              ???
          ../../tools/go/pygolang/golang/__init__.py:103: in _
              return f(*argv, **kw)
          lib/tests/testprog/zopenrace.py:201: in test
              wg.wait()
          golang/_sync.pyx:246: in golang._sync.PyWorkGroup.wait
              ???
          golang/_sync.pyx:226: in golang._sync.PyWorkGroup.go.pyrunf
              ???
          lib/tests/testprog/zopenrace.py:165: in T1
              t1()
          _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
      
              def t1():
                  transaction.begin()
                  zconn = db.open()
      
                  root = zconn.root()
                  obj1 = root['obj1']
                  obj2 = root['obj2']
      
                  # obj1 - reload it from zstor
                  # obj2 - get it from zconn cache
                  obj1._p_invalidate()
      
                  # both objects must have the same values
                  i1 = obj1.i
                  i2 = obj2.i
                  if i1 != i2:
          >           raise AssertionError("T1: obj1.i (%d)  !=  obj2.i (%d)" % (i1, i2))
          E           AssertionError: T1: obj1.i (3)  !=  obj2.i (2)
      
          lib/tests/testprog/zopenrace.py:156: AssertionError
      c37a989d
    • Kirill Smelkov's avatar
      *: tests: don't hang on exception in non-main thread · 08e0c9fb
      Kirill Smelkov authored
      Previously if an assert or something failed in spawned thread, the main
      thread was usually spinning indefinitely = tests hang. -> Switch all
      threading places to use sync.WorkGroup and this way if a thread fails,
      all other threads are canceled and the exception is reported back to
      wg.wait in main thread.
      
      Since we start to go this route, NotifyChannel is reworked to fully use
      channels instead of busy-waiting.
      08e0c9fb
  4. 26 Mar, 2021 1 commit
  5. 08 Mar, 2021 3 commits
    • Kirill Smelkov's avatar
      tox: v↑ NEO (1.9 -> 1.12) · 95b012d3
      Kirill Smelkov authored
      NEO 1.9 was released in 2018 and is outdated by now. NEO 1.12 is
      currently the latest NEO release.
      95b012d3
    • Kirill Smelkov's avatar
      Require Zodbtools · d62a297c
      Kirill Smelkov authored
      After switching to ZODB >= 4 in the previous commit, we can safely
      require zodbtools, because there is now no conflict in between
      ZODB3/ZODB eggs.
      d62a297c
    • Kirill Smelkov's avatar
      Drop support for ZODB3 · 0802da2b
      Kirill Smelkov authored
      It's been a while since last ZODB3 3.10.7 release in 2016 and the last
      commit in upstream ZODB3 repository (3.10 branch) is from 2017. The
      world switched since then to ZODB4 and to ZODB5 after that.
      
      We were still requiring ZODB3, because ZODB3 3.11 egg was just a
      dependency on newer ZODB, ZEO, BTrees and persistent; and this way we
      could be supporting all ZODB3.10.x and  ZODB4 and ZODB5 via ZODB3.11.
      
      However upcoming Wendelin.core 2, for its proper working, needs MVCC
      semantic as implemented in ZODB5. This forces us, even for ZODB4, to
      backport non-trivial bits from ZODB5 (see [1]). Maintaining ZODB3
      support at this point becomes non-practical, because, to our knowledge,
      there is no wendelin.core user that plans to continue using ZODB3
      without switching to at least ZODB4 in the near future.
      
      So goodbye ZODB3. Even though ZODB still stays with us, it gives a
      feeling similar to [2], because in 2014, when I was myself learning
      ZODB, it was through ZODB3 - still at the time when all ZODB bits were
      living together in one place.
      
      [1] nexedi/ZODB!1
      [2] https://lists.osuosl.org/pipermail/darcs-users/2008-September/014095.html
      0802da2b
  6. 11 Dec, 2020 1 commit
    • Kirill Smelkov's avatar
      tests: Don't try to access db.storage when automatically closing connections · fd6b5252
      Kirill Smelkov authored
      DB.close() does `del self.storage`.
      
      https://github.com/zopefoundation/ZODB/blob/5.6.0-14-g0eae10cd0/src/ZODB/DB.py#L646
      
      This way if DB was closed, but some conn(s) were not, it will crash in
      teardown as e.g. below:
      
          _____________ ERROR at teardown of test_bigfile_zblk1_zdata_reuse ______________
      
              def teardown_module():
          >       testdb.teardown()
      
          bigfile/tests/test_filezodb.py:58:
          _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
      
          self = <wendelin.lib.testing.TestDB_ZEO object at 0x7fb9c0216350>
      
              def teardown(self):
                  # close connections that test code forgot to close
                  for connref, tb in self.connv:
                      conn = connref()
                      if conn is None:
                          continue
                      if not conn.opened:
                          continue    # still alive, but closed
                      print("W: testdb: teardown: %s left not closed by test code"
                            "; opened by:\n%s" % (conn, tb), file=sys.stderr)
      
                      db = conn.db()
          >           stor = db.storage
          E           AttributeError: 'DB' object has no attribute 'storage'
      
          lib/testing.py:217: AttributeError
      
      The fix is simple - don't use db.storage at all, because it is not actually used in that code.
      fd6b5252
  7. 17 Nov, 2020 2 commits
  8. 03 Nov, 2020 2 commits
    • Kirill Smelkov's avatar
      t/tfault-run: Require bash · a702d410
      Kirill Smelkov authored
      Otherwise when /bin/sh is dash it fails with
      
          t/tfault-run: 35: test: on_pagefault: unexpected operator
      a702d410
    • Kirill Smelkov's avatar
      t/tfault-run: Clear state from previous run before starting · cf92dfca
      Kirill Smelkov authored
      Otherwise, if previous test.fault failed, tfault-run fails to start, e.g.
      
          >>> test.fault
          $ make test.fault # MAKEFLAGS=-j1
          x86_64-linux-gnu-gcc -pthread -g -Wall -D_GNU_SOURCE -std=gnu99 -fplan9-extensions -Wno-declaration-after-statement -Wno-error=declaration-after-statement  -Iinclude -I3rdparty/ccan -I3rdparty/include   bigfile/tests/tfault.c lib/bug.c lib/utils.c 3rdparty/ccan/ccan/tap/tap.c  -o bigfile/tests/tfault.t
          t/tfault-run bigfile/tests/tfault.t faultr on_pagefault
          mkdir: cannot create directory ‘t/tfault-run.faultr’: File exists
          Makefile:186: recipe for target 'faultr.tfault' failed
          make: *** [faultr.tfault] Error 1
          rm bigfile/tests/tfault.t
          error   test.fault      0.433s  # 1t 1e 0f 0s
      cf92dfca
  9. 02 Nov, 2020 1 commit
  10. 11 Sep, 2020 1 commit
  11. 17 May, 2020 2 commits