src/ZODB/scripts/fsrefs.py · 79078049c5529c409e6ea958870a3df14f2c8f29 · Kirill Smelkov / ZODB

fsrefs: Optimize IO (take 2) (#340) · 79078049
Kirill Smelkov authored Mar 29, 2021
* fsrefs: Optimize IO  (take 2)

Access objects in the order of their position in file instead of in the order
of their OID. This should give dramatical speedup when data are on HDD.

For example @perrinjerome reports that on a 73Go database it takes
almost 8h to run fsrefs (where on the same database, fstest takes 15
minutes) [1,2]. After the patch fsrefs took ~80 minutes to run on the same
database. In other words this is ~ 6x improvement.

Fsrefs has no tests. I tested it only lightly via generating a bit
corrupt database with deleted referred object(*), and it gives the same
output as unmodified fsrefs.

    oid 0x0 __main__.Object
    last updated: 1979-01-03 21:00:42.900001, tid=0x285cbacb70a3db3
    refers to invalid objects:
            oid 0x07 missing: '<unknown>'
            oid 0x07 object creation was undone: '<unknown>'

This "take 2" version is derived from https://github.com/zopefoundation/ZODB/pull/338
and only iterates objects in the order of their in-file position without
building complete references graph in-RAM, because that in-RAM graph would
consume ~12GB of memory.

Added pos2oid in-RAM index also consumes memory: for the 73GB database in
question fs._index takes ~700MB, while pos2oid takes ~2GB. In theory it could be less,
because we need only array of oid sorted by key(oid)=fs._index[oid]. However
array.array does not support sorting, and if we use plain list to keep just
[]oid, the memory consumption just for that list is ~5GB. Also because
list.sort(key=...) internally allocates memory for key array (and
list.sort(cmp=...) was removed from Python3), total memory consumption just to
produce list of []oid ordered by pos is ~10GB.
So without delving into C/Cython and/or manually sorting the array in Python (=
slow), using QQBTree seems to be the best out-of-the-box option for oid-by-pos index.

[1] nexedi/zodbtools!19 (comment 129480)
[2] nexedi/zodbtools!19 (comment 129551)

(*) test database generated via a bit modified gen_testdata.py from
zodbtools:

https://lab.nexedi.com/nexedi/zodbtools/blob/v0.0.0.dev8-28-g129afa6/zodbtools/test/gen_testdata.py

+

```diff
--- a/zodbtools/test/gen_testdata.py
+++ b/zodbtools/test/gen_testdata.py
@@ -229,7 +229,7 @@ def ext(subj): return {}
         # delete an object
         name = random.choice(list(root.keys()))
         obj = root[name]
-        root[name] = Object("%s%i*" % (name, i))
+#       root[name] = Object("%s%i*" % (name, i))
         # NOTE user/ext are kept empty on purpose - to also test this case
         commit(u"", u"predelete %s" % unpack64(obj._p_oid), {})
```

/cc @tim-one, @jeremyhylton, @jamadden
/reviewed-by @jamadden, @perrinjerome 
/reviewed-on https://github.com/zopefoundation/ZODB/pull/340
79078049
fsrefs.py 6.62 KB
Replace fsrefs.py