• Darrick J. Wong's avatar
    xfs: support btrees with overlapping intervals for keys · 2c813ad6
    Darrick J. Wong authored
    On a filesystem with both reflink and reverse mapping enabled, it's
    possible to have multiple rmap records referring to the same blocks on
    disk.  When overlapping intervals are possible, querying a classic
    btree to find all records intersecting a given interval is inefficient
    because we cannot use the left side of the search interval to filter
    out non-matching records the same way that we can use the existing
    btree key to filter out records coming after the right side of the
    search interval.  This will become important once we want to use the
    rmap btree to rebuild BMBTs, or implement the (future) fsmap ioctl.
    
    (For the non-overlapping case, we can perform such queries trivially
    by starting at the left side of the interval and walking the tree
    until we pass the right side.)
    
    Therefore, extend the btree code to come closer to supporting
    intervals as a first-class record attribute.  This involves widening
    the btree node's key space to store both the lowest key reachable via
    the node pointer (as the btree does now) and the highest key reachable
    via the same pointer and teaching the btree modifying functions to
    keep the highest-key records up to date.
    
    This behavior can be turned on via a new btree ops flag so that btrees
    that cannot store overlapping intervals don't pay the overhead costs
    in terms of extra code and disk format changes.
    
    When we're deleting a record in a btree that supports overlapped
    interval records and the deletion results in two btree blocks being
    joined, we defer updating the high/low keys until after all possible
    joining (at higher levels in the tree) have finished.  At this point,
    the btree pointers at all levels have been updated to remove the empty
    blocks and we can update the low and high keys.
    
    When we're doing this, we must be careful to update the keys of all
    node pointers up to the root instead of stopping at the first set of
    keys that don't need updating.  This is because it's possible for a
    single deletion to cause joining of multiple levels of tree, and so
    we need to update everything going back to the root.
    
    The diff_two_keys functions return < 0, 0, or > 0 if key1 is less than,
    equal to, or greater than key2, respectively.  This is consistent
    with the rest of the kernel and the C library.
    
    In btree_updkeys(), we need to evaluate the force_all parameter before
    running the key diff to avoid reading uninitialized memory when we're
    forcing a key update.  This happens when we've allocated an empty slot
    at level N + 1 to point to a new block at level N and we're in the
    process of filling out the new keys.
    Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
    Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
    Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
    2c813ad6
xfs_btree.c 119 KB