• Kirill Smelkov's avatar
    fsrefs: Optimize IO (take 2) (#340) · 79078049
    Kirill Smelkov authored
    * fsrefs: Optimize IO  (take 2)
    
    Access objects in the order of their position in file instead of in the order
    of their OID. This should give dramatical speedup when data are on HDD.
    
    For example @perrinjerome reports that on a 73Go database it takes
    almost 8h to run fsrefs (where on the same database, fstest takes 15
    minutes) [1,2]. After the patch fsrefs took ~80 minutes to run on the same
    database. In other words this is ~ 6x improvement.
    
    Fsrefs has no tests. I tested it only lightly via generating a bit
    corrupt database with deleted referred object(*), and it gives the same
    output as unmodified fsrefs.
    
        oid 0x0 __main__.Object
        last updated: 1979-01-03 21:00:42.900001, tid=0x285cbacb70a3db3
        refers to invalid objects:
                oid 0x07 missing: '<unknown>'
                oid 0x07 object creation was undone: '<unknown>'
    
    This "take 2" version is derived from https://github.com/zopefoundation/ZODB/pull/338
    and only iterates objects in the order of their in-file position without
    building complete references graph in-RAM, because that in-RAM graph would
    consume ~12GB of memory.
    
    Added pos2oid in-RAM index also consumes memory: for the 73GB database in
    question fs._index takes ~700MB, while pos2oid takes ~2GB. In theory it could be less,
    because we need only array of oid sorted by key(oid)=fs._index[oid]. However
    array.array does not support sorting, and if we use plain list to keep just
    []oid, the memory consumption just for that list is ~5GB. Also because
    list.sort(key=...) internally allocates memory for key array (and
    list.sort(cmp=...) was removed from Python3), total memory consumption just to
    produce list of []oid ordered by pos is ~10GB.
    So without delving into C/Cython and/or manually sorting the array in Python (=
    slow), using QQBTree seems to be the best out-of-the-box option for oid-by-pos index.
    
    [1] nexedi/zodbtools!19 (comment 129480)
    [2] nexedi/zodbtools!19 (comment 129551)
    
    (*) test database generated via a bit modified gen_testdata.py from
    zodbtools:
    
    https://lab.nexedi.com/nexedi/zodbtools/blob/v0.0.0.dev8-28-g129afa6/zodbtools/test/gen_testdata.py
    
    +
    
    ```diff
    --- a/zodbtools/test/gen_testdata.py
    +++ b/zodbtools/test/gen_testdata.py
    @@ -229,7 +229,7 @@ def ext(subj): return {}
             # delete an object
             name = random.choice(list(root.keys()))
             obj = root[name]
    -        root[name] = Object("%s%i*" % (name, i))
    +#       root[name] = Object("%s%i*" % (name, i))
             # NOTE user/ext are kept empty on purpose - to also test this case
             commit(u"", u"predelete %s" % unpack64(obj._p_oid), {})
    ```
    
    /cc @tim-one, @jeremyhylton, @jamadden
    /reviewed-by @jamadden, @perrinjerome 
    /reviewed-on https://github.com/zopefoundation/ZODB/pull/340
    79078049
fsrefs.py 6.62 KB