Commits · e64f0e0bfdce8dde736a465c0d4aca612ab2d4ee · Kirill Smelkov / wendelin.core

An error occurred fetching the project authors.

15 Sep, 2024 1 commit

wcfs: Switch filesystem to EIO mode on zwatcher failure · e64f0e0b

Kirill Smelkov authored 4 months ago

Currently zwatcher failure leads to wcfs starting to provide stale data
instead of uptodate data. Fix that by detecting zwatcher failures and
explicitly switching the filesystem to a mode where any access to
anything returns "input/output error".

Zwatcher can fail on e.g. failure to retrieve transactions from ZODB
storage or any other failure. With this patch we make sure this does not
go unnoticed.

e64f0e0b

25 Jun, 2024 1 commit

wcfs: tests: Adapt changed modules/methods to Python 3. · 594ff3fa

Carlos Ramos Carreño authored 6 months ago

Some modules and methods have changed names in Python 3.
The `thread` module has been renamed to `_thread` and the old name
gives error when run on Python 3:

```python
Traceback:
/opt/slapgrid/b0df76c24a1d2728ccf3e276f07c1790/parts/python3/lib/python3.9/importlib/__init__.py:127: in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
wcfs/client/client_test.py:32: in <module>
    from wendelin.wcfs.wcfs_test import tDB, tAt, timeout, eprint
wcfs/wcfs_test.py:44: in <module>
    from thread import get_ident as gettid
E   ModuleNotFoundError: No module named 'thread'
```

In a similar vein, the `items` method of dictionaries plays the same
role as the old `iteritems`.

We use the `six` module to paper over these differences.

/reviewed-by @kirr
/reviewed-on nexedi/wendelin.core!27

594ff3fa

21 Dec, 2022 1 commit

wcfs: tests: Don't forget to close ._wcfuseabort fd · 7ce0978d

Kirill Smelkov authored 2 years ago

The function to do it was there, but I missed to add corresponding defer.
Fixes 6f0cdaff (wcfs: Provide isolation to clients).

7ce0978d

21 Jan, 2022 7 commits

wcfs: Fix crash if on watch request setupWatch needs to access ZODB · 38dde766

Kirill Smelkov authored 3 years ago

The problem is similar to a7bf0311 (wcfs: Fix crash if on invalidation
handledδZ needs to access ZODB) - I forgot to put zhead's transaction into
context.

Without the fix added test fails as:

    wcfs_test.py::test_wcfs_crash_old_data
    ---------------- live log call -----------------
    WARNING  ZODB.FileStorage:FileStorage.py:413 Ignoring index for /tmp/testdb_fs.OV0rS6/1.fs

    M: commit -> @at0 (03e5a3342bc5ab22)

    M: commit -> @at1 (03e5a3342bc88899)
    M:      f<0000000000000002>     [0]
    INFO     wcfs:__init__.py:293 starting for file:///tmp/testdb_fs.OV0rS6/1.fs ...
    I0120 17:12:10.274379  704327 wcfs.go:2393] start "/dev/shm/wcfs/556fa61a9f9675f34c6b44e1f978842c37176c59" "file:///tmp/testdb_fs.OV0rS6/1.fs"
    I0120 17:12:10.274409  704327 wcfs.go:2399] (built with go1.17.6)
    W0120 17:12:10.274560  704327 storage.go:152] zodb: FIXME: open file:///tmp/testdb_fs.OV0rS6/1.fs: raw cache is not ready for invalidations -> NoCache forced
    INFO     wcfs:__init__.py:334 started pid704327 @ /dev/shm/wcfs/556fa61a9f9675f34c6b44e1f978842c37176c59

    C: setup watch f<0000000000000002> @at1 (03e5a3342bc88899)
    #  pinok: {}

    M: commit -> @at2 (03e5a3342c895777)
    M:      f<0000000000000002>     [1]

    M: commit -> @at3 (03e5a3342ca5ef55)
    M:      f<0000000000000002>     [0]

    C: setup watch f<0000000000000002> @at2 (03e5a3342c895777)
    #  pinok: {0: @at1 (03e5a3342bc88899)}
    panic: transaction: no current transaction

    goroutine 88 [running]:
    lab.nexedi.com/kirr/neo/go/transaction.currentTxn({0x969718, 0xc0000b6240})
            /home/kirr/src/neo/src/lab.nexedi.com/kirr/neo/go/transaction/transaction.go:59 +0x77
    lab.nexedi.com/kirr/neo/go/transaction.Current(...)
            /home/kirr/src/neo/src/lab.nexedi.com/kirr/neo/go/transaction/api.go:206
    lab.nexedi.com/kirr/neo/go/zodb.(*Connection).checkTxnCtx(...)
            /home/kirr/src/neo/src/lab.nexedi.com/kirr/neo/go/zodb/connection.go:374
    lab.nexedi.com/kirr/neo/go/zodb.(*Connection).Get(0xc0000c25a0, {0x969718, 0xc0000b6240}, 0x4)
            /home/kirr/src/neo/src/lab.nexedi.com/kirr/neo/go/zodb/connection.go:331 +0x73
    lab.nexedi.com/nexedi/wendelin.core/wcfs/internal/zdata.(*ΔFtail).BlkRevAt(0xc00009dd40, {0x969718, 0xc0000b6240}, 0xc000100540, 0x30, 0x3e5a3342c895777)
            /home/kirr/src/neo/src/lab.nexedi.com/nexedi/wendelin.core/wcfs/internal/zdata/δftail.go:1140 +0x39d
    main.(*WatchLink).setupWatch(0xc0000120a0, {0x969718, 0xc0000b6240}, 0x2, 0x3e5a3342c895777)
            /home/kirr/src/neo/src/lab.nexedi.com/nexedi/wendelin.core/wcfs/wcfs.go:1754 +0xe3f
    main.(*WatchLink)._handleWatch(0x0, {0x969718, 0xc0000b6240}, {0xc0000a0122, 0x0})
            /home/kirr/src/neo/src/lab.nexedi.com/nexedi/wendelin.core/wcfs/wcfs.go:1973 +0x65
    main.(*WatchLink).handleWatch(0x0, {0x969718, 0xc0000b6240}, 0x0, {0xc0000a0122, 0x28})
            /home/kirr/src/neo/src/lab.nexedi.com/nexedi/wendelin.core/wcfs/wcfs.go:1955 +0x10c
    main.(*WatchLink)._serve.func3({0x969718, 0xc0000b6240})
            /home/kirr/src/neo/src/lab.nexedi.com/nexedi/wendelin.core/wcfs/wcfs.go:1944 +0x3c
    lab.nexedi.com/kirr/go123/xsync.(*WorkGroup).Go.func1()
            /home/kirr/src/neo/src/lab.nexedi.com/kirr/go123/xsync/xsync.go:86 +0x68
    created by lab.nexedi.com/kirr/go123/xsync.(*WorkGroup).Go
            /home/kirr/src/neo/src/lab.nexedi.com/kirr/go123/xsync/xsync.go:83 +0x92
    >>> Change history by file:

    f<0000000000000002>:
                                    0 1 2 3 4 5 6 7
                                    a b c d e f g h
            @at0 (03e5a3342bc5ab22)
            @at1 (03e5a3342bc88899) 0
            @at2 (03e5a3342c895777)   1
            @at3 (03e5a3342ca5ef55) 0

    ----------------------------------------

            # wcfs was crashing in setting up watch because of "1" and "2" from above, and
            # 3. setupWatch was calling ΔFtail.BlkRevAt without putting zhead's transaction into ctx.
            wl2 = t.openwatch()
    >       wl2.watch(zf, at2, {0:at1})

38dde766

wcfs: tests: Exercise watching @at0 · 769b1c06
Kirill Smelkov authored 3 years ago
```
Watching with at=tail is inevitable as explained in the previous patch.
```
769b1c06

wcfs: Adjust ΔFtail/ΔBtail to allow point-queries with at=tail · ef10f820

Kirill Smelkov authored 3 years ago

This is needed because when e.g. wcfs is just started the coverage of
ΔFtail is (head,head] i.e. empty, and if user wants to setup a watch
with at=head, it becomes watch with at=tail. Then that at is used in a
query and if point-queries with at=tail are disallowed it panics with
"at out of bounds".

This fixes crashes in test_wcfs_watch_setup (see 339f1884 "wcfs: tests:
Always start tDB with ZBigFile pre-created before WCFS startup") and in
test_wcfs_crash_old_data (see 97ce5105 "wcfs: tests: Add test do
demonstrate "at out of bounds" crash on readPinWatchers ->
ΔFtail.BlkRevAt")

For the reference zodb.ΔTail already allows point queries with at=tail:

https://lab.nexedi.com/kirr/neo/blob/1193c44e/go/zodb/δtail.go#L202-206
https://lab.nexedi.com/kirr/neo/blob/1193c44e/go/zodb/δtail.go#L225-228

ef10f820

wcfs: tests: Add test do demonstrate "at out of bounds" crash on readPinWatchers -> ΔFtail.BlkRevAt · 97ce5105

Kirill Smelkov authored 3 years ago

The codepath that sends pin messages to watchers on FUSE READ, similarly
to what was showed in 339f1884 is also vulnerable to "at out of bounds"
panic if at=ΔFtail.tail:

    wcfs_test.py::test_wcfs_crash_old_data
    ---------------- live log call -----------------
    WARNING  ZODB.FileStorage:FileStorage.py:413 Ignoring index for /tmp/testdb_fs.nbSKXu/1.fs

    M: commit -> @at0 (03e5a31e5e5ef6bb)

    M: commit -> @at1 (03e5a31e5e63fa77)
    M:      f<0000000000000002>     [0]
    INFO     wcfs:__init__.py:293 starting for file:///tmp/testdb_fs.nbSKXu/1.fs ...
    I0120 16:50:22.136098  697106 wcfs.go:2393] start "/dev/shm/wcfs/93026d44ef96f87df2cc0e2e451c5aabee91b652" "file:///tmp/testdb_fs.nbSKXu/1.fs"
    I0120 16:50:22.136127  697106 wcfs.go:2399] (built with go1.17.6)
    W0120 16:50:22.136233  697106 storage.go:152] zodb: FIXME: open file:///tmp/testdb_fs.nbSKXu/1.fs: raw cache is not ready for invalidations -> NoCache forced
    INFO     wcfs:__init__.py:334 started pid697106 @ /dev/shm/wcfs/93026d44ef96f87df2cc0e2e451c5aabee91b652

    C: setup watch f<0000000000000002> @at1 (03e5a31e5e63fa77)
    #  pinok: {}
    panic: at out of bounds: at: @03e5a31e5e63fa77,  (tail, head] = (@03e5a31e5e63fa77, @03e5a31e5e63fa77]

    goroutine 7 [running]:
    lab.nexedi.com/nexedi/wendelin.core/wcfs/internal/zdata.panicf(...)
            /home/kirr/src/neo/src/lab.nexedi.com/nexedi/wendelin.core/wcfs/internal/zdata/misc.go:47
    lab.nexedi.com/nexedi/wendelin.core/wcfs/internal/zdata.(*ΔFtail).BlkRevAt(0xc0000a5d40, {0x969718, 0xc000076140}, 0xc0001a22a0, 0xc0001c0200, 0x3e5a31e5e63fa77)
            /home/kirr/src/neo/src/lab.nexedi.com/nexedi/wendelin.core/wcfs/internal/zdata/δftail.go:1077 +0xa45
    main.(*BigFile).readPinWatchers(0xc0001d0200, {0x969718, 0xc000076140}, 0x0, 0xffffffffffffffff)
            /home/kirr/src/neo/src/lab.nexedi.com/nexedi/wendelin.core/wcfs/wcfs.go:1559 +0x2a5
    main.(*BigFile).readBlk(0xc0001d0200, {0x969718, 0xc000076140}, 0x0, {0xc000320000, 0x200000, 0x0})
            /home/kirr/src/neo/src/lab.nexedi.com/nexedi/wendelin.core/wcfs/wcfs.go:1281 +0x4d2
    main.(*BigFile).Read.func1({0x969718, 0xc000076140})
            /home/kirr/src/neo/src/lab.nexedi.com/nexedi/wendelin.core/wcfs/wcfs.go:1223 +0x71
    lab.nexedi.com/kirr/go123/xsync.(*WorkGroup).Go.func1()
            /home/kirr/src/neo/src/lab.nexedi.com/kirr/go123/xsync/xsync.go:86 +0x68
    created by lab.nexedi.com/kirr/go123/xsync.(*WorkGroup).Go
            /home/kirr/src/neo/src/lab.nexedi.com/kirr/go123/xsync/xsync.go:83 +0x92
    >>> Change history by file:

    f<0000000000000002>:
                                    0 1 2 3 4 5 6 7
                                    a b c d e f g h
            @at0 (03e5a31e5e5ef6bb)
            @at1 (03e5a31e5e63fa77) 0

    ...

        @func
        def test_wcfs_crash_old_data():
            # start wcfs with ΔFtail/ΔBtail not covering that initial data.
            t = tDB(old_data=[{0:'a'}]); zf = t.zfile; at1 = t.head
            defer(t.close)

            f = t.open(zf)

            # ΔFtail coverage is currently (at1,at1]
            wl = t.openwatch()
            wl.watch(zf, at1, {})

            # wcfs is crashing on readPinWatcher -> ΔFtail.BlkRevAt with
            #   "at out of bounds: at: @at1,  (tail,head] = (@at1,@at1]
            # because BlkRevAt(at=tail) query was disallowed.
    >       f.assertBlk(0, 'a')          # [0] becomes tracked

Still also crashing in test_wcfs_watch_setup.

97ce5105

wcfs: tests: Move tests for crashing WCFS due to old data to dedicated section · 67519be7

Kirill Smelkov authored 3 years ago

Soon this test will also exercise functionality from isolation protocol
as well and so it will stop to be basic.

Move plus rename test_wcfs_basic_invalidation_wo_dFtail_coverage ->
test_wcfs_crash_old_data.

Still crashing in test_wcfs_watch_setup.

67519be7

wcfs: tests: Teach tDB to create database with initial ZBigFile changes before WCFS is started · 1da89b57

Kirill Smelkov authored 3 years ago

This semantically moves initialization code from
test_wcfs_basic_invalidation_wo_dFtail_coverage (see a7bf0311 "wcfs: Fix
crash if on invalidation handledδZ needs to access ZODB") to tDB itself,
and will be useful to exercise similar scenarios in other tests.

Still crashing in test_wcfs_watch_setup.

1da89b57

wcfs: tests: Always start tDB with ZBigFile pre-created before WCFS startup · 339f1884

Kirill Smelkov authored 3 years ago

This should hopefully exercise codepaths in wcfs.go a bit more for
mistakes similar to a7bf0311 (wcfs: Fix crash if on invalidation
handledδZ needs to access ZODB) where the code on server side forgets to
put zhead's transaction into context.

Currently, because watching @tail is disallowed, this leads to panic triggered by test_wcfs_watch_setup:

    @at0 (03e59e3e606b89bb) -> @at1 (03e59e3e610692bb) -> @at2 (03e59e3e612a5811) -> @at3 (03e59e3e614fa9cc) -> @at4 (03e59e3e6189c3ee) -> @at5 (03e59e3e61af0baa)

    C: setup watch f<0000000000000002> @at0 (03e59e3e606b89bb)
    #  pinok: {0: @at0 (03e59e3e606b89bb), 2: @at0 (03e59e3e606b89bb), 3: @at0 (03e59e3e606b89bb), 5: @at0 (03e59e3e606b89bb)}
    panic: at out of bounds: at: @03e59e3e606b89bb,  (tail, head] = (@03e59e3e606b89bb, @03e59e3e61af0baa]

    goroutine 187 [running]:
    lab.nexedi.com/nexedi/wendelin.core/wcfs/internal/zdata.panicf(...)
            /home/kirr/src/neo/src/lab.nexedi.com/nexedi/wendelin.core/wcfs/internal/zdata/misc.go:47
    lab.nexedi.com/nexedi/wendelin.core/wcfs/internal/zdata.(*ΔFtail).BlkRevAt(0xc000077d40, {0x969718, 0xc000062940}, 0xc0003060c0, 0x4174f4, 0x3e59e3e606b89bb)
            /home/kirr/src/neo/src/lab.nexedi.com/nexedi/wendelin.core/wcfs/internal/zdata/δftail.go:1077 +0xa45
    main.(*WatchLink).setupWatch(0xc000108050, {0x969718, 0xc000062940}, 0x2, 0x3e59e3e606b89bb)
            /home/kirr/src/neo/src/lab.nexedi.com/nexedi/wendelin.core/wcfs/wcfs.go:1754 +0xe3f
    main.(*WatchLink)._handleWatch(0x0, {0x969718, 0xc000062940}, {0xc00001c812, 0xa00000})
            /home/kirr/src/neo/src/lab.nexedi.com/nexedi/wendelin.core/wcfs/wcfs.go:1973 +0x65
    main.(*WatchLink).handleWatch(0x74039b, {0x969718, 0xc000062940}, 0xc0000a4280, {0xc00001c812, 0x28})
            /home/kirr/src/neo/src/lab.nexedi.com/nexedi/wendelin.core/wcfs/wcfs.go:1955 +0x10c
    main.(*WatchLink)._serve.func3({0x969718, 0xc000062940})
            /home/kirr/src/neo/src/lab.nexedi.com/nexedi/wendelin.core/wcfs/wcfs.go:1944 +0x3c
    lab.nexedi.com/kirr/go123/xsync.(*WorkGroup).Go.func1()
            /home/kirr/src/neo/src/lab.nexedi.com/kirr/go123/xsync/xsync.go:86 +0x68
    created by lab.nexedi.com/kirr/go123/xsync.(*WorkGroup).Go
            /home/kirr/src/neo/src/lab.nexedi.com/kirr/go123/xsync/xsync.go:83 +0x92
    >>> Change history by file:

    f<0000000000000002>:
                                    0 1 2 3 4 5 6 7
                                    a b c d e f g h
            @at0 (03e59e3e606b89bb)
            @at1 (03e59e3e610692bb)     2
            @at2 (03e59e3e612a5811)     2 3 4 5
            @at3 (03e59e3e614fa9cc) 0   2     5
            @at4 (03e59e3e6189c3ee)     2   4 5
            @at5 (03e59e3e61af0baa)       3   5

However next we will anyway need to allow to setup watches @tail, and so
we will be fixing this and other errors in followup commits.

NOTE: we don't loose coverage for the case when ZBigFile is created after wcfs
startup due to test_wcfs_watch_2files, where that scenario is tested.

ΔFtail/ΔBtail tests also exercise ZBigFile/BTree epochs
(creation/deletion) well.

339f1884

19 Jan, 2022 3 commits

wcfs: tests: Simplify syncing WCFS to database in tDB.commit · c07d771b

Kirill Smelkov authored 3 years ago

tDB.commit always creates only one transaction and so wcfs should be
expected to catch up with only that single one -> no need to loop.

No need to keep tDB._wc_zheadv as we have information about all
committed transactions in t.dFtail.

c07d771b

wcfs: tests: Inline tDB._wcsync into tDB.commit · 1de68556

Kirill Smelkov authored 3 years ago

.commit is the only caller of ._wcsync. .commit is also the only place
via which tests are intended to modify ZODB.

1de68556

wcfs: tests: Split tDB.commit into .commit and ._commit · c8292d67

Kirill Smelkov authored 3 years ago

- .commit performs ZODB commit and synchronizes WCFS to database changes;
- ._commit performs ZODB commit without WCFS synchronization.

We will soon need ._commit to create initial revisions for ZBigFile
while WCFS is not yet started.

c8292d67

18 Jan, 2022 1 commit

wcfs: Fix crash if on invalidation handledδZ needs to access ZODB · a7bf0311

Kirill Smelkov authored 3 years ago

The invalidation logic is generally right, but invalidateBlk -> ΔFtail.BlkRevAt
was being called with ctx without transaction. As the result it was
panicking as

    panic: transaction: no current transaction

    goroutine 41 [running]:
    lab.nexedi.com/kirr/neo/go/transaction.currentTxn({0x9696d8, 0xc0000d8080})
            /home/kirr/src/neo/src/lab.nexedi.com/kirr/neo/go/transaction/transaction.go:59 +0x77
    lab.nexedi.com/kirr/neo/go/transaction.Current(...)
            /home/kirr/src/neo/src/lab.nexedi.com/kirr/neo/go/transaction/api.go:206
    lab.nexedi.com/kirr/neo/go/zodb.(*Connection).checkTxnCtx(...)
            /home/kirr/src/neo/src/lab.nexedi.com/kirr/neo/go/zodb/connection.go:374
    lab.nexedi.com/kirr/neo/go/zodb.(*Connection).Get(0xc00010c640, {0x9696d8, 0xc0000d8080}, 0x4)
            /home/kirr/src/neo/src/lab.nexedi.com/kirr/neo/go/zodb/connection.go:331 +0x73
    lab.nexedi.com/nexedi/wendelin.core/wcfs/internal/zdata.(*ΔFtail).BlkRevAt(0xc000077d40, {0x9696d8, 0xc0000d8080}, 0xc000064f60, 0x0, 0x3e5983329bbd100)
            /home/kirr/src/neo/src/lab.nexedi.com/nexedi/wendelin.core/wcfs/internal/zdata/δftail.go:1140 +0x39d
    main.(*BigFile).invalidateBlk.func1(0xc000164400, {0x9696d8, 0xc0000d8080}, 0xc0005a0000, 0x200000, 0x200000, {0xc0005a0000, 0x200000, 0x200000})
            /home/kirr/src/neo/src/lab.nexedi.com/nexedi/wendelin.core/wcfs/wcfs.go:1089 +0xb8
    main.(*BigFile).invalidateBlk(0xc000164400, {0x9696d8, 0xc0000d8080}, 0x0)
            /home/kirr/src/neo/src/lab.nexedi.com/nexedi/wendelin.core/wcfs/wcfs.go:1105 +0x3bb
    main.(*Root).handleδZ.func3({0x9696d8, 0xc0000d8080})
            /home/kirr/src/neo/src/lab.nexedi.com/nexedi/wendelin.core/wcfs/wcfs.go:898 +0x34
    lab.nexedi.com/kirr/go123/xsync.(*WorkGroup).Go.func1()
            /home/kirr/src/neo/src/lab.nexedi.com/kirr/go123/xsync/xsync.go:86 +0x68
    created by lab.nexedi.com/kirr/go123/xsync.(*WorkGroup).Go
            /home/kirr/src/neo/src/lab.nexedi.com/kirr/go123/xsync/xsync.go:83 +0x92

on any new change to tracked file block whose previous history is not covered by ΔFtail/ΔBtail.

Problem reported by @Francois.

a7bf0311

12 Nov, 2021 1 commit

wcfs: Make sure to remove mountpoint directory on Server.stop · d2fd8b77

Kirill Smelkov authored 3 years ago

Else every time test.py/wcfs is run several empty directories are left
in /dev/shm/wcfs - each corresponding to WCFS server that was
automatically spawned and stopped at the end of the test. Over time this
can accumulate to some big number as e.g. ~20000 of such directories
were left on the testnode during last 6 months.

d2fd8b77

28 Oct, 2021 7 commits

bigfile/zodb: Teach ZBigFile backend to use WCFS · c5e18c74

Kirill Smelkov authored 3 years ago

By using WCFS as mmap-overlay for base data(*). WCFS-mode is still opt-in
with default remaining to use old full user-space virtual memory manager
mode as initially introduced in 2015.

Wendelin.core should be draftly usable in WCFS mode now.

This patch is organized as follows:

- file_zodb.cpp provides mmap-overlay operations for WCFS implemented via
  WCFS client library.
- file_zodb.py is adjusted accordingly to use WCFS if requested.
  Low-level things specific to gluing to file_zodb.cpp are moved to _file_zodb.pyx.
- the rest of the changes are drive-by by main ones.

(*) see the following patches for what is mmap-overlay:

- fae045cc  (bigfile/virtmem: Introduce "mmap overlay" mode)
- 23362204  (bigfile/py: Allow PyBigFile backend to expose "mmap overlay" functionality)

Some preliminary history:

kirr/wendelin.core@01916f09    X Draft demo that reading data through wcfs works
kirr/wendelin.core@fd58082a    X Fix build on old GCC
kirr/wendelin.core@f622e751    X tests: Stop wcfs spawned during tests
kirr/wendelin.core@f118617b    X tests: Don't try to stop wcfs that is already exited

c5e18c74

wcfs: client: Provide client package to care about isolation protocol details · 10f7153a

Kirill Smelkov authored 3 years ago

This patch follows-up on previous patch, that added server-side part of
isolation protocol handling, and adds client package that takes care about
WCFS isolation protocol details and provides to clients simple interface to
isolated view of bigfile data on WCFS similar to regular files: given a
particular revision of database @at, it provides synthetic read-only bigfile
memory mappings with data corresponding to @at state, but using /head/bigfile/*
most of the time to build and maintain the mappings.

The patch is organized as follows:

- wcfs.h and wcfs.cpp brings in usage documentation, internal overview and the
main part of the implementation.

- wcfs/client/client_test.py is tests.

- The rest of the changes in wcfs/client/ are to support the implementation and tests.

Quoting package documentation for the reference:

---- 8< ----

Package wcfs provides WCFS client.

This client package takes care about WCFS isolation protocol details and
provides to clients simple interface to isolated view of bigfile data on
WCFS similar to regular files: given a particular revision of database @at,
it provides synthetic read-only bigfile memory mappings with data
corresponding to @at state, but using /head/bigfile/* most of the time to
build and maintain the mappings.

For its data a mapping to bigfile X mostly reuses kernel cache for
/head/bigfile/X with amount of data not associated with kernel cache for
/head/bigfile/X being proportional to δ(bigfile/X, at..head). In the usual
case where many client workers simultaneously serve requests, their database
views are a bit outdated, but close to head, which means that in practice
the kernel cache for /head/bigfile/* is being used almost 100% of the time.

A mapping for bigfile X@at is built from OS-level memory mappings of
on-WCFS files as follows:

___ /@revA/bigfile/X
__ /@revB/bigfile/X
_ /@revC/bigfile/X
+ ...
─── ───── ────────────────────────── ───── /head/bigfile/X

where @revR mmaps are being dynamically added/removed by this client package
to maintain X@at data view according to WCFS isolation protocol(*).

API overview

- `WCFS` represents filesystem-level connection to wcfs server.
- `Conn` represents logical connection that provides view of data on wcfs
filesystem as of particular database state.
- `FileH` represent isolated file view under Conn.
- `Mapping` represents one memory mapping of FileH.

A path from WCFS to Mapping is as follows:

WCFS.connect(at) -> Conn
Conn.open(foid) -> FileH
FileH.mmap([blk_start +blk_len)) -> Mapping

A connection can be resynced to another database view via Conn.resync(at').

Documentation for classes provides more thorough overview and API details.

--------

(*) see wcfs.go documentation for WCFS isolation protocol overview and details.

Wcfs client organization
~~~~~~~~~~~~~~~~~~~~~~~~

Wcfs client provides to its users isolated bigfile views backed by data on
WCFS filesystem. In the absence of Isolation property, wcfs client would
reduce to just directly using OS-level file wcfs/head/f for a bigfile f. On
the other hand there is a simple, but inefficient, way to support isolation:
for @at database view of bigfile f - directly use OS-level file wcfs/@at/f.
The latter works, but is very inefficient because OS-cache for f data is not
shared in between two connections with @at1 and @at2 views. The cache is
also lost when connection view of the database is resynced on transaction
boundary. To support isolation efficiently, wcfs client uses wcfs/head/f
most of the time, but injects wcfs/@revX/f parts into mappings to maintain
f@at view driven by pin messages that wcfs server sends to client in
accordance to WCFS isolation protocol(*).

Wcfs server sends pin messages synchronously triggered by access to mmaped
memory. That means that a client thread, that is accessing wcfs/head/f mmap,
is completely blocked while wcfs server sends pins and waits to receive acks
from all clients. In other words on-client handling of pins has to be done
in separate thread, because wcfs server can also send pins to client that
triggered the access.

Wcfs client implements pins handling in so-called "pinner" thread(+). The
pinner thread receives pin requests from wcfs server via watchlink handle
opened through wcfs/head/watch. For every pin request the pinner finds
corresponding Mappings and injects wcfs/@revX/f parts via Mapping._remmapblk
appropriately.

The same watchlink handle is used to send client-originated requests to wcfs
server. The requests are sent to tell wcfs that client wants to observe a
particular bigfile as of particular revision, or to stop watching it.
Such requests originate from regular client threads - not pinner - via entry
points like Conn.open, Conn.resync and FileH.close.

Every FileH maintains fileh._pinned {} with currently pinned blk -> rev. This
dict is updated by pinner driven by pin messages, and is used when
new fileh Mapping is created (FileH.mmap).

In wendelin.core a bigfile has semantic that it is infinite in size and
reads as all zeros beyond region initialized with data. Memory-mapping of
OS-level files can also go beyond file size, however accessing memory
corresponding to file region after file.size triggers SIGBUS. To preserve
wendelin.core semantic wcfs client mmaps-in zeros for Mapping regions after
wcfs/head/f.size. For simplicity it is assumed that bigfiles only grow and
never shrink. It is indeed currently so, but will have to be revisited
if/when wendelin.core adds bigfile truncation. Wcfs client restats
wcfs/head/f at every transaction boundary (Conn.resync) and remembers f.size
in FileH._headfsize for use during one transaction(%).

--------

(*) see wcfs.go documentation for WCFS isolation protocol overview and details.
(+) currently, for simplicity, there is one pinner thread for each connection.
In the future, for efficiency, it might be reworked to be one pinner thread
that serves all connections simultaneously.
(%) see _headWait comments on how this has to be reworked.

Wcfs client locking organization

Wcfs client needs to synchronize regular user threads vs each other and vs
pinner. A major lock Conn.atMu protects updates to changes to Conn's view of
the database. Whenever atMu.W is taken - Conn.at is changing (Conn.resync),
and contrary whenever atMu.R is taken - Conn.at is stable (roughly speaking
Conn.resync is not running).

Similarly to wcfs.go(*) several locks that protect internal data structures
are minor to Conn.atMu - they need to be taken only under atMu.R (to
synchronize e.g. multiple fileh open running simultaneously), but do not
need to be taken at all if atMu.W is taken. In data structures such locks
are noted as follows

sync::Mutex xMu; // atMu.W | atMu.R + xMu

After atMu, Conn.filehMu protects registry of opened file handles
(Conn._filehTab), and FileH.mmapMu protects registry of created Mappings
(FileH.mmaps) and FileH.pinned.

Several locks are RWMutex instead of just Mutex not only to allow more
concurrency, but, in the first place for correctness: pinner thread being
core element in handling WCFS isolation protocol, is effectively invoked
synchronously from other threads via messages coming through wcfs server.
For example Conn.resync sends watch request to wcfs server and waits for the
answer. Wcfs server, in turn, might send corresponding pin messages to the
pinner and _wait_ for the answer before answering to resync:

- - - - - -
| .···|·····. ----> = request
pinner <------.↓ <···· = response
| | wcfs
resync -------^↓
| `····|·····
- - - - - -
client process

This creates the necessity to use RWMutex for locks that pinner and other
parts of the code could be using at the same time in synchronous scenarios
similar to the above. This locks are:

- Conn.atMu
- Conn.filehMu

Note that FileH.mmapMu is regular - not RW - mutex, since nothing in wcfs
client calls into wcfs server via watchlink with mmapMu held.

The ordering of locks is:

Conn.atMu > Conn.filehMu > FileH.mmapMu

The pinner takes the following locks:

- wconn.atMu.R
- wconn.filehMu.R
- fileh.mmapMu (to read .mmaps + write .pinned)

(*) see "Wcfs locking organization" in wcfs.go

Handling of fork

When a process calls fork, OS copies its memory and creates child process
with only 1 thread. That child inherits file descriptors and memory mappings
from parent. To correctly continue using Conn, FileH and Mappings, the child
must recreate pinner thread and reconnect to wcfs via reopened watchlink.
The reason here is that without reconnection - by using watchlink file
descriptor inherited from parent - the child would interfere into
parent-wcfs exchange and neither parent nor child could continue normal
protocol communication with WCFS.

For simplicity, since fork is seldomly used for things besides followup
exec, wcfs client currently takes straightforward approach by disabling
mappings and detaching from WCFS server in the child right after fork. This
ensures that there is no interference into parent-wcfs exchange should child
decide not to exec and to continue running in the forked thread. Without
this protection the interference might come even automatically via e.g.
Python GC -> PyFileH.__del__ -> FileH.close -> message to WCFS.

----------------------------------------

Some preliminary history:

a8fa9178 X wcfs: move client tests into client/
990afac1 X wcfs/client: Package overview (draft)
3f83469c X wcfs: client: Handle fork
0ed6b8b6 fixup! X wcfs: client: Handle fork
24378c46 X wcfs: client: Provide Conn.at()

10f7153a

wcfs: Provide isolation to clients · 6f0cdaff

Kirill Smelkov authored 3 years ago

Via custom isolation protocol that both server and clients must cooperatively
follow. This is the core change that enables file cache to be practically
shared while each client can still be provided with isolated view of the database.

This patch brings only server changes, tests + the minimum client bits to support the tests.
The client library, that will implement isolation protocol on client side, will come next.

This patch is organized as follows:

- wcfs.go brings in description of the protocol, overview of how server
  implements that protocol and the implementation itself.
  See also notes.txt

- wcfs_test.py brings in tests for server implementation.
  tWCFS._abort_ontimeout had to be moved into nogil mode into wcfs_test.pyx
  to avoid deadlock on the GIL (see comments in wcfs_test.pyx for details).

- files added in wcfs/client/ are needed to provide client-side
  implementation of WatchLink - the message exchange protocol over
  opened head/watch file - for tests. Client-side watchlink implementation
  lives in wcfs/client/wcfs_watchlink.{h,cpp}. The other additions in
  wcfs/client/ are to support that and to expose the WatchLink to Python.

  Client-side bits are done right in C++ because upcoming WCFS client
  library will be implemented in C++ to work in nogil mode in order to
  avoid deadlock on the GIL because client-side pinner thread might be
  woken-up synchronously by WCFS server at any moment, including when
  another client thread already holds the GIL and is paused by WCFS.

Some preliminary history:

kirr/wendelin.core@9b4a42a3    X invalidation design draftly settled
kirr/wendelin.core@27d91d47    X δFtail settled
kirr/wendelin.core@c27c1940    X mmap over under pagefault to this mmapping works
kirr/wendelin.core@d36b171f    X ptrace when client is under pagefault or syscall won't work
kirr/wendelin.core@c1f5bb19    X notes on why lazy-invalidate approach was taken
kirr/wendelin.core@4fbdd270    X Proof that that it is possible to change mmapping while under pagefault to it
kirr/wendelin.core@33e0dfce    X ΔTail draftly done
kirr/wendelin.core@12628943    X make sure "bye" is always processed immediately - even if a handleWatch is currently blocked
kirr/wendelin.core@af0a64cb    X test for "bye" canceling blocked handlers
kirr/wendelin.core@996dc6a8    X Fix race in test
kirr/wendelin.core@43915fe9    X wcfs: Don't forbid simultaneous watch requests
kirr/wendelin.core@941dc54b    X wcfs: threading.Lock -> sync.Mutex
kirr/wendelin.core@d75b2304    X wcfs: Move _abort_ontimeout to pyx/nogil
kirr/wendelin.core@79234659    X Notes on why eagier invalidation was rejected
kirr/wendelin.core@f05271b1    X Test that sysread(/head/watch) can be interrupted
kirr/wendelin.core@5ba816da    X restore test_wcfs_watch_robust after f05271b1.
kirr/wendelin.core@4bd88564    X "Invalidation protocol" -> "Isolation protocol"
kirr/wendelin.core@f7b54ca4    X avoid fmt::vsprintf  (now compils again with latest pygolang@master)
kirr/wendelin.core@0a8fcd9d    X wcfs/client: Move EOF -> pygolang
kirr/wendelin.core@153e02e6    X test_wcfs_watch_setup and test_wcfs_watch_setup_ahead work again
kirr/wendelin.core@17f98edc    X wcfs: client: os: Factor syserr -> string into _sysErrString
kirr/wendelin.core@7b0c301c    X wcfs: tests: Fix tFile.assertBlk not to segfault on a test failure
kirr/wendelin.core@b74dda09    X Start switching Track from Track(key) to Track(keycov)
kirr/wendelin.core@8b5d8523    X Move tracking of which blocks were accessed from wcfs to ΔFtail

6f0cdaff

wcfs: Handle ZODB invalidations · 4430de41

Kirill Smelkov authored 3 years ago

Use ΔFtail.Track on every READ, and query accumulated ΔFtail upon
receiving ZODB invalidation to query it about which blocks of which
files have been changed. Then invalidate those blocks in OS file cache.

See added documentation to wcfs.go and notes.txt for details.

Now the filesystem is no longer stale: it provides view of data
that is uptodate wrt changes on ZODB storage.

Some preliminary history:

kirr/wendelin.core@9b4a42a3 X invalidation design draftly settled
kirr/wendelin.core@27d91d47 X δFtail settled
kirr/wendelin.core@33e0dfce X ΔTail draftly done
kirr/wendelin.core@822366a7 X keeping fd to root opened prevents the filesystem from being unmounted
kirr/wendelin.core@89ad3a79 X Don't keep ZBigFile activated during whole current transaction
kirr/wendelin.core@245511ac X Give pointer on from where to get nxd-fuse.ko
kirr/wendelin.core@d1cd128c X Hit FUSE-related deadlock
kirr/wendelin.core@d134ee44 X FUSE lookup deadlock should be hopefully fixed
kirr/wendelin.core@0e60e9ff X wcfs: Don't noise ZWatcher trace logs with "select ..."
kirr/wendelin.core@bf9a7405 X No longer rely on ZODB cache invariant for invalidations

4430de41

wcfs: tests: Start verifying state of OS file cache · d81d2cbb

Kirill Smelkov authored 3 years ago

For WCFS to be efficient it will have to carefully preserve OS cache on
file invalidations. As preparatory step establish infrastructure for
verifying state of OS file cache and start asserting on OS cache state
in a couple of places.

See comments added to tFile constructor that describe how OS cache state
verification is setup.

Some preliminary history:

kirr/wendelin.core@8293025b X Thoughts on how to avoid readahead touching pages of neighbour block
kirr/wendelin.core@3054e4a3 X not touching neighbour block works via setting MADV_RANDOM in last 1/4 of every block
kirr/wendelin.core@18362227 X #5 access still triggers read to #4 ?
kirr/wendelin.core@17dbf94e X Provide mlock2 fallback for Ubuntu
kirr/wendelin.core@d134c0b9 X wcfs: test: try to live with only hard memlock limit adjusted
kirr/wendelin.core@c2423296 X Fix mlock2 build on Debian 8

d81d2cbb

wcfs: Initial implementation of basic filesystem · e3f2ee2d

Kirill Smelkov authored 3 years ago

Provide filesystem view of in-ZODB ZBigFiles, but do not implement support for
invalidations nor isolation protocol yet. In particular, because ZODB
invalidations are not yet handled, the filesystem does not update its data in
accordance with ZODB updates, and instead provides stale data view that
corresponds to the state of ZODB at the time when wcfs was mounted.

The main parts of this patch are:

- wcfs/wcfs.go is filesystem implementation itself together with overview.
- wcfs/__init__.py is python wrapper to spawn and interoperate with that filesystem.
- wcfs/wcfs_test.py is tests.

Some preliminary history:

kirr/wendelin.core@fe7efb94    X start of wcfs
kirr/wendelin.core@878b2787    X draft loading
kirr/wendelin.core@d58c71e8    X don't overalign end by 1 blksize if end is already aligned
kirr/wendelin.core@29c9f13d    X readBlk: Fix thinko in already case
kirr/wendelin.core@59552328    X wcfs: Care to disable OS polling on us
kirr/wendelin.core@c00d94c7    X workaround lack of exception chaining on Python2 with xdefer
kirr/wendelin.core@0398e23d    X bytearray turned out to be copying data
kirr/wendelin.core@7a837040    X print wcfs.py py-level traceback on SIGBUS (e.g. wcfs.go aborting due to bug/panic)
kirr/wendelin.core@661b871f    X make sure tests don't get stuck even if wcfs gets killed -9 ...
kirr/wendelin.core@2c043d29    X More effort to unmount failed wcfs.go
kirr/wendelin.core@1ccc4478    X Use `with gil` + regular py code instead of PyGILState_Ensure/PyGILState_Release/PyRun_SimpleString
kirr/wendelin.core@5dc9c791    X wcfs: Kill xdefer
kirr/wendelin.core@91e9eba8    X wcfs: test: Register tFile to tDB early
kirr/wendelin.core@a7138fef    X wcfs: mkdir /tmp/wcfs with sticky bit
kirr/wendelin.core@1eec76d0    X wcfs: try to set sticky for /tmp/wcfs even if the directory already exists
kirr/wendelin.core@c2c35851    X wcfs: tests: Factor-out waiting for a general condition to become true into waitfor
kirr/wendelin.core@78f36993    X wcfs: test: Fix thinko in getting /sys/fs/fuse/connection/<X> for wcfs
kirr/wendelin.core@bc9eb16f    X wcfs: tests: Don't use testmntpt everywhere
kirr/wendelin.core@6dec74e7    X wcfs: tests: Split tDB into -> tDB + tWCFS
kirr/wendelin.core@3a6bd764    X wcfs: tests: Run `fusermount -u` the second time if we had to kill wcfs
kirr/wendelin.core@112720f3    X wcfs: tests: Print which files are still opened on wcfs if `fusermount -u` fails
kirr/wendelin.core@bb40185b    X wcfs: Take $WENDELIN_CORE_WCFS_OPTIONS into account not only from under join
kirr/wendelin.core@03a9ef33    X wcfs: Remove credentials from zurl when computing wcfs mountpoint
kirr/wendelin.core@68ee5bdc    X wcfs: lsof tweaks
kirr/wendelin.core@21671879    X wcfs: Teach entrypoint frontend to handle subcommands: serve, status, stop
kirr/wendelin.core@b0642b80    X wcfs: Switch mountpoints from /tmp/wcfs/* to /dev/shm/*
kirr/wendelin.core@b0ca031f    X wcfs: Teach join/serve to start successfully even after unclean wcfs shutdown
kirr/wendelin.core@5bfa8cf8    X wcfs: Add start to spawn a Server that can be later stopped  (draft)
kirr/wendelin.core@5fcec261    X wcfs: Run fusermount and friends with /bin:/usr/bin always on path
kirr/wendelin.core@669d7a20    fixup! X wcfs: Run fusermount and friends with /bin:/usr/bin always on path
kirr/wendelin.core@6b22f8c4    X wcfs: Teach start to start successfully even after unclean wcfs shutdown
kirr/wendelin.core@15389db0    X wcfs: Tune _fuse_unmount to include `fusermount -u` error message into raised exception
kirr/wendelin.core@153c002a    X wcfs: _fuse_unmount: Try first `kill -TERM` before `kill -QUIT` wcfs
kirr/wendelin.core@3244f3a6    X wcfs: lsof +D misbehaves - don't use it
kirr/wendelin.core@a126e709    X wcfs: Put client log into its own logger
kirr/wendelin.core@ac303d1e    X wcfs: tests: -v  ->  show only wcfs.py logs verbosely
kirr/wendelin.core@d671a9e9    X wcfs: Give more time to stop wcfs server

e3f2ee2d

wcfs: Initial stub · 2163fcaf

Kirill Smelkov authored 3 years ago

Add initial stub for WCFS program and tests.
WCFS functionality will be added step-by-step in follow-up commits.

Some preliminary history:

kirr/wendelin.core@0ae88a32 X .nxdtest: Verify Go bits with GOMAXPROCS=1,2,`nproc`
kirr/wendelin.core@23528eb4 X wcfs: make it to use go modules for dependencies

2163fcaf

24 Oct, 2017 1 commit

Relicense to GPLv3+ with wide exception for all Free Software / Open Source... · f11386a4

Kirill Smelkov authored 7 years ago

Relicense to GPLv3+ with wide exception for all Free Software / Open Source projects + Business options.

Nexedi stack is licensed under Free Software licenses with various exceptions
that cover three business cases:

- Free Software
- Proprietary Software
- Rebranding

As long as one intends to develop Free Software based on Nexedi stack, no
license cost is involved. Developing proprietary software based on Nexedi stack
may require a proprietary exception license. Rebranding Nexedi stack is
prohibited unless rebranding license is acquired.

Through this licensing approach, Nexedi expects to encourage Free Software
development without restrictions and at the same time create a framework for
proprietary software to contribute to the long term sustainability of the
Nexedi stack.

Please see https://www.nexedi.com/licensing for details, rationale and options.

f11386a4

03 Apr, 2015 2 commits

bigfile: Python wrapper around virtual memory subsystem · 35eb95c2

Kirill Smelkov authored 9 years ago

Exposes BigFile - this way users can define BigFile backend in Python.

Also exposed are BigFile handles, and VMA objects which are results of
mmaping.

35eb95c2

Start wendelin.bigfile module · 1e92c921

Kirill Smelkov authored 9 years ago

This will be a module to allow users to program custom file-like
backends and latter memory-map content of files from the backend.

For now just start the module structure.

1e92c921