Commit 2ab4be93 authored by Kirill Smelkov's avatar Kirill Smelkov

wcfs: xbtree: ΔBtail

ΔBtail provides BTree-level history tail that WCFS - via ΔFtail - will
use to compute which blocks of a ZBigFile need to be invalidated in OS
file cache given raw ZODB changes on ZODB invalidation message.

It also will be used by WCFS to implement isolation protocol, where on
every FUSE READ request WCFS will query ΔBtail - again via ΔFtail - to
find out revision of corresponding file block.

Quoting ΔBtail documentation:

---- 8< ----

ΔBtail provides BTree-level history tail.

It translates ZODB object-level changes to information about which keys of
which BTree were modified, and provides service to query that information.

ΔBtail class documentation
~~~~~~~~~~~~~~~~~~~~~~~~~~

ΔBtail represents tail of revisional changes to BTrees.

It semantically consists of

    []δB			; rev ∈ (tail, head]

where δB represents a change in BTrees space

    δB:
    	.rev↑
    	{} root -> {}(key, δvalue)

It covers only changes to keys from tracked subset of BTrees parts.
In particular a key that was not explicitly requested to be tracked, even if
it was changed in δZ, is not guaranteed to be present in δB.

ΔBtail provides the following operations:

  .Track(path)	- start tracking tree nodes and keys; root=path[0], keys=path[-1].(lo,hi]

  .Update(δZ) -> δB				- update BTree δ tail given raw ZODB changes
  .ForgetPast(revCut)			- forget changes ≤ revCut
  .SliceByRev(lo, hi) -> []δB		- query for all trees changes with rev ∈ (lo, hi]
  .SliceByRootRev(root, lo, hi) -> []δT	- query for changes of a tree with rev ∈ (lo, hi]
  .GetAt(root, key, at) -> (value, rev)	- get root[key] @at assuming root[key] ∈ tracked

where δT represents a change to one tree

    δT:
    	.rev↑
    	{}(key, δvalue)

An example for tracked set is a set of visited BTree paths.
There is no requirement that tracked set belongs to only one single BTree.

See also zodb.ΔTail and zdata.ΔFtail

Concurrency

ΔBtail is safe to use in single-writer / multiple-readers mode. That is at
any time there should be either only sole writer, or, potentially several
simultaneous readers. The table below classifies operations:

    Writers:  Update, ForgetPast
    Readers:  Track + all queries (SliceByRev, SliceByRootRev, GetAt)

Note that, in particular, it is correct to run multiple Track and queries
requests simultaneously.

ΔBtail organization
~~~~~~~~~~~~~~~~~~~

ΔBtail keeps raw ZODB history in ΔZtail and uses BTree-diff algorithm(*) to
turn δZ into BTree-level diff. For each tracked BTree a separate ΔTtail is
maintained with tree-level history in ΔTtail.vδT .

Because it is very computationally expensive(+) to find out for an object to
which BTree it belongs, ΔBtail cannot provide full BTree-level history given
just ΔZtail with δZ changes. Due to this ΔBtail requires help from
users, which are expected to call ΔBtail.Track(treepath) to let ΔBtail know
that such and such ZODB objects constitute a path from root of a tree to some
of its leaf. After Track call the objects from the path and tree keys, that
are covered by leaf node, become tracked: from now-on ΔBtail will detect
and provide BTree-level changes caused by any change of tracked tree objects
or tracked keys. This guarantee can be provided because ΔBtail now knows
that such and such objects belong to a particular tree.

To manage knowledge which tree part is tracked ΔBtail uses PPTreeSubSet.
This data-structure represents so-called PP-connected set of tree nodes:
simply speaking it builds on some leafs and then includes parent(leaf),
parent(parent(leaf)), etc. In other words it's a "parent"-closure of the
leafs. The property of being PP-connected means that starting from any node
from such set, it is always possible to reach root node by traversing
.parent links, and that every intermediate node went-through during
traversal also belongs to the set.

A new Track request potentially grows tracked keys coverage. Due to this,
on a query, ΔBtail needs to recompute potentially whole vδT of the affected
tree. This recomputation is managed by "vδTSnapForTracked*" and "_rebuild"
functions and uses the same treediff algorithm, that Update is using, but
modulo PPTreeSubSet corresponding to δ key coverage. Update also potentially
needs to rebuild whole vδT history, not only append new δT, because a
change to tracked tree nodes can result in growth of tracked key coverage.

Queries are relatively straightforward code that work on vδT snapshot. The
main complexity, besides BTree-diff algorithm, lies in recomputing vδT when
set of tracked keys changes, and in handling that recomputation in such a way
that multiple Track and queries requests could be all served in parallel.

Concurrency

In order to allow multiple Track and queries requests to be served in
parallel ΔBtail employs special organization of vδT rebuild process where
complexity of concurrency is reduced to math on merging updates to vδT and
trackSet, and on key range lookup:

1. vδT is managed under read-copy-update (RCU) discipline: before making
   any vδT change the mutator atomically clones whole vδT and applies its
   change to the clone. This way a query, once it retrieves vδT snapshot,
   does not need to further synchronize with vδT mutators, and can rely on
   that retrieved vδT snapshot will remain immutable.

2. a Track request goes through 3 states: "new", "handle-in-progress" and
   "handled". At each state keys/nodes of the Track are maintained in:

   - ΔTtail.ktrackNew and .trackNew       for "new",
   - ΔTtail.krebuildJobs                  for "handle-in-progress", and
   - ΔBtail.trackSet                      for "handled".

   trackSet keeps nodes, and implicitly keys, from all handled Track
   requests. For all keys, covered by trackSet, vδT is fully computed.

   a new Track(keycov, path) is remembered in ktrackNew and trackNew to be
   further processed when a query should need keys from keycov. vδT is not
   yet providing data for keycov keys.

   when a Track request starts to be processed, its keys and nodes are moved
   from ktrackNew/trackNew into krebuildJobs. vδT is not yet providing data
   for requested-to-be-tracked keys.

   all trackSet, trackNew/ktrackNew and krebuildJobs are completely disjoint:

    trackSet ^ trackNew     = ø
    trackSet ^ krebuildJobs = ø
    trackNew ^ krebuildJobs = ø

3. when a query is served, it needs to retrieve vδT snapshot that takes
   related previous Track requests into account. Retrieving such snapshots
   is implemented in vδTSnapForTracked*() family of functions: there it
   checks ktrackNew/trackNew, and if those sets overlap with query's keys
   of interest, run vδT rebuild for keys queued in ktrackNew.

   the main part of that rebuild can be run without any locks, because it
   does not use nor modify any ΔBtail data, and for δ(vδT) it just computes
   a fresh full vδT build modulo retrieved ktrackNew. Only after that
   computation is complete, ΔBtail is locked again to quickly merge in
   δ(vδT) update back into vδT.

   This organization is based on the fact that

    vδT/(T₁∪T₂) = vδT/T₁ | vδT/T₂

     ( i.e. vδT computed for tracked set being union of T₁ and T₂ is the
       same as merge of vδT computed for tracked set T₁ and vδT computed
      for tracked set T₂ )

   and that

    trackSet | (δPP₁|δPP₂) = (trackSet|δPP₁) | (trackSet|δPP₂)

    ( i.e. tracking set updated for union of δPP₁ and δPP₂ is the same
      as union of tracking set updated with δPP₁ and tracking set updated
      with δPP₂ )

   these merge properties allow to run computation for δ(vδT) and δ(trackSet)
   independently and with ΔBtail unlocked, which in turn enables running
   several Track/queries in parallel.

4. while vδT rebuild is being run, krebuildJobs keeps corresponding keycov
   entry to indicate in-progress rebuild. Should a query need vδT for keys
   from that job, it first waits for corresponding job(s) to complete.

Explained rebuild organization allows non-overlapping queries/track-requests
to run simultaneously. (This property is essential to WCFS because otherwise
WCFS would not be able to serve several non-overlapping READ requests to one
file in parallel.)

--------

(*) implemented in treediff.go
(+) full database scan

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Some preliminary history:

877e64a9    X wcfs: Fix tests to pass again
c32055fc    X wcfs/xbtree: ΔBtail tests += ø -> Tree; Tree -> ø
78f2f88b    X wcfs/xbtree: Fix treediff(a, ø)
5324547c    X wcfs/xbtree: root(a) must stay in trackSet even after treediff(a,ø)
f65f775b    X wcfs/xbtree: treediff(ø, b)
c75b1c6f    X wcfs/xbtree: Start killing holeIdx
0fa06cbd    X kadj must be taken into account as kadj^δZ
ef5e5183    X treediff ret += δtkeycov
f30826a6    X another bug in δtkeyconv computation
0917380e    X wcfs: assert that keycov only grow
502e05c2    X found why TestΔBTailAllStructs was not effective to find δtkeycov bugs
450ba707    X Fix rebuild with ø @at2
f60528c9    X ΔBtail.Clone had bug that it was aliasing klon and orig data
9d20f8e8    X treediff: Fix BUG while computing AB coverage
ddb28043    X rebuild: Don't return nil for empty ΔPPTreeSubSet - that leads to SIGSEGV
324241eb    X rebuild: tests: Don't reflect.DeepEqual in inner loop
8f6e2b1e    X rebuild: tests: Don't access ZODB in XGetδKV
2c0b4793    X rebuild: tests: Don't access ZODB in xtrackKeys
8f0e37f2    X rebuild: tests: Precompute kadj10·kadj21
271d953d    X rebuild: tests: Move ΔBtail.Clone test out of hot inner loop into separate test
a87cc6de    X rebuild: tests: Don't recompute trackSet(keys1R2) several times
01433e96    X rebuild: tests: Don't compute keyCover in trackSet
7371f9c5    X rebuild: tests: Inline _assertTrack
3e9164b3    X rebuild: tests: Don't exercise keys from keys2 that already became tracked after Track(keys1) + Update
e9c4b619    X rebuild: tests: Random testing
d0fe680a    X δbtail += ForgetPast
210e9b07    X Fix ΔBtail.SliceByRootRev (lo,hi] handling
855ab4b8    X ΔBtail: Goodbye .KVAtTail
2f5582e6    X ΔBtail: Tweak tests to run faster in normal mode
cf352737    X random testing found another failing test for rebuild...
7f7e34e0    X wcfs/xbtree: Fix update not to add duplicate extra point if rebuild  - called by Update - already added it
6ad0052c    X ΔBtail.Track: No need to return error
aafcacdf    X xbtree: GetAt test
784a6761    X xbtree: Fix KAdj definition after treediff was reworked this summer to base decisions on node keycoverage instead of particular node keys
0bb1c22e    X xbtree: Verify that ForgetPast clones vδT on trim
a8945cbf    X Start reworking rebuild routines not to modify data inplace
b74dda09    X Start switching Track from Track(key) to Track(keycov)
dea85e87    X Switch GetAt to vδTSnapForTrackedKey
aa0288ce    X Switch SliceByRootRev to vδTSnapForTracked
c4366b14    X xbtree: tests: Also verify state of ΔTtail.ktrackNew
b98706ad    X Track should be nop if keycov/path is already in krebuildJobs
e141848a    X test.go  ↑ timeout  10m -> 20m
423f77be    X wcfs: Goodby holeIdx
37c2e806    X wcfs: Teach treediff to compute not only δtrack (set of nodes), but also δ for track-key coverage
52c72dbb    X ΔBtail.rebuild started to work draftly
c9f13fc7    X Get rebuild tests to run in a sane time; Add proper random-based testing for rebuild
c7f1e3c9    X xbtree: Factor testing infrastructure bits into xbtree/xbtreetest
7602c1f4    ΔBtail concurrency
parent 80153aa5
...@@ -186,7 +186,9 @@ test.py.drd: bigfile/_bigfile.so wcfs/wcfs ...@@ -186,7 +186,9 @@ test.py.drd: bigfile/_bigfile.so wcfs/wcfs
# run go tests # run go tests
test.go : test.go :
cd wcfs && $(GO) test -count=1 ./... # -count=1 disables tests caching # -timeout=20m: xbtree tests are computationally expensive, and
# sometime cross 10 minutes border when testnodes are otherwise loaded.
cd wcfs && $(GO) test -timeout=20m -count=1 ./... # -count=1 disables tests caching
# test pagefault for double/real faults - it should crash # test pagefault for double/real faults - it should crash
......
...@@ -20,7 +20,7 @@ ...@@ -20,7 +20,7 @@
"""Package xbtree provides utilities for inspecting/manipulating internal """Package xbtree provides utilities for inspecting/manipulating internal
structure of integer-keyed BTrees. structure of integer-keyed BTrees.
It will be primarily used to help verify ΔBTail in WCFS. It is primarily used to help verify ΔBTail in WCFS.
- `Tree` represents a tree node. - `Tree` represents a tree node.
- `Bucket` represents a bucket node. - `Bucket` represents a bucket node.
......
...@@ -56,7 +56,9 @@ const KeyMin = blib.KeyMin ...@@ -56,7 +56,9 @@ const KeyMin = blib.KeyMin
type Value = zodb.Oid type Value = zodb.Oid
const VDEL = zodb.InvalidOid const VDEL = zodb.InvalidOid
type setKey = set.I64
type setOid = set.Oid type setOid = set.Oid
type setTid = set.Tid
// pathEqual returns whether two paths are the same. // pathEqual returns whether two paths are the same.
...@@ -99,6 +101,10 @@ func zgetNodeOrNil(ctx context.Context, zconn *zodb.Connection, oid zodb.Oid) (n ...@@ -99,6 +101,10 @@ func zgetNodeOrNil(ctx context.Context, zconn *zodb.Connection, oid zodb.Oid) (n
} }
func kstr(key Key) string {
return blib.KStr(key)
}
func panicf(format string, argv ...interface{}) { func panicf(format string, argv ...interface{}) {
panic(fmt.Sprintf(format, argv...)) panic(fmt.Sprintf(format, argv...))
} }
// Copyright (C) 2020-2021 Nexedi SA and Contributors.
// Kirill Smelkov <kirr@nexedi.com>
//
// This program is free software: you can Use, Study, Modify and Redistribute
// it under the terms of the GNU General Public License version 3, or (at your
// option) any later version, as published by the Free Software Foundation.
//
// You can also Link and Combine this program with other software covered by
// the terms of any of the Free Software licenses or any of the Open Source
// Initiative approved licenses and Convey the resulting work. Corresponding
// source of such a combination shall include the source code for all other
// software used.
//
// This program is distributed WITHOUT ANY WARRANTY; without even the implied
// warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
//
// See COPYING file for full licensing terms.
// See https://www.nexedi.com/licensing for rationale and options.
package xbtreetest
// testing-related support
import (
"flag"
"math/rand"
"testing"
"time"
)
var (
verylongFlag = flag.Bool("verylong", false, `switch tests to run in "very long" mode`)
randseedFlag = flag.Int64("randseed", -1, `seed for random number generator`)
)
// VeryLong returns whether -verylong flag is in effect.
func VeryLong() bool {
// -short takes priority over -verylong
if testing.Short() {
return false
}
return *verylongFlag
}
// N returns short, medium, or long depending on whether tests were ran with
// -short, -verylong, or normally.
func N(short, medium, long int) int {
// -short
if testing.Short() {
return short
}
// -verylong
if *verylongFlag {
return long
}
// default
return medium
}
// NewRand returns new random-number generator and seed that was used to initialize it.
//
// The seed can be controlled via -randseed option.
func NewRand() (rng *rand.Rand, seed int64) {
seed = *randseedFlag
if seed == -1 {
seed = time.Now().UnixNano()
}
rng = rand.New(rand.NewSource(seed))
return rng, seed
}
...@@ -20,7 +20,7 @@ ...@@ -20,7 +20,7 @@
# See https://www.nexedi.com/licensing for rationale and options. # See https://www.nexedi.com/licensing for rationale and options.
"""Program treegen provides infrastructure to generate ZODB BTree states. """Program treegen provides infrastructure to generate ZODB BTree states.
It will be used as helper for ΔBtail and ΔFtail tests. It is used as helper for ΔBtail and ΔFtail tests.
The following subcommands are provided: The following subcommands are provided:
......
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
// Copyright (C) 2020-2021 Nexedi SA and Contributors.
// Kirill Smelkov <kirr@nexedi.com>
//
// This program is free software: you can Use, Study, Modify and Redistribute
// it under the terms of the GNU General Public License version 3, or (at your
// option) any later version, as published by the Free Software Foundation.
//
// You can also Link and Combine this program with other software covered by
// the terms of any of the Free Software licenses or any of the Open Source
// Initiative approved licenses and Convey the resulting work. Corresponding
// source of such a combination shall include the source code for all other
// software used.
//
// This program is distributed WITHOUT ANY WARRANTY; without even the implied
// warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
//
// See COPYING file for full licensing terms.
// See https://www.nexedi.com/licensing for rationale and options.
package xbtree_test
import (
_ "lab.nexedi.com/nexedi/wendelin.core/wcfs/internal/xbtree/xbtreetest/init"
)
// Copyright (C) 2018-2021 Nexedi SA and Contributors.
// Kirill Smelkov <kirr@nexedi.com>
//
// This program is free software: you can Use, Study, Modify and Redistribute
// it under the terms of the GNU General Public License version 3, or (at your
// option) any later version, as published by the Free Software Foundation.
//
// You can also Link and Combine this program with other software covered by
// the terms of any of the Free Software licenses or any of the Open Source
// Initiative approved licenses and Convey the resulting work. Corresponding
// source of such a combination shall include the source code for all other
// software used.
//
// This program is distributed WITHOUT ANY WARRANTY; without even the implied
// warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
//
// See COPYING file for full licensing terms.
// See https://www.nexedi.com/licensing for rationale and options.
// Package xtail provides utilities that are common to ΔBtail and ΔFtail.
package xtail
import (
"fmt"
"lab.nexedi.com/kirr/neo/go/zodb"
)
// AssertSlice asserts that δ.tail ≤ lo ≤ hi ≤ δ.head
func AssertSlice(δ interface { Head() zodb.Tid; Tail() zodb.Tid }, lo, hi zodb.Tid) {
tail := δ.Tail()
head := δ.Head()
if !(tail <= lo && lo <= hi && hi <= head) {
panic(fmt.Sprintf("invalid slice: (%s, %s]; (tail, head] = (%s, %s]", lo, hi, tail, head))
}
}
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment