• Josef Bacik's avatar
    btrfs: reserve delalloc metadata differently · c8eaeac7
    Josef Bacik authored
    With the per-inode block reserves we started refilling the reserve based
    on the calculated size of the outstanding csum bytes and extents for the
    inode, including the amount we were adding with the new operation.
    
    However, generic/224 exposed a problem with this approach.  With 1000
    files all writing at the same time we ended up with a bunch of bytes
    being reserved but unusable.
    
    When you write to a file we reserve space for the csum leaves for those
    bytes, the number of extent items required to cover those bytes, and a
    single transaction item for updating the inode at ordered extent finish
    for that range of bytes.  This is held until the ordered extent finishes
    and we release all of the reserved space.
    
    If a second write comes in at this point we would add a single
    reservation for the new outstanding extent and however many reservations
    for the csum leaves.  At this point we find the delta of how much we
    have reserved and how much outstanding size this is and attempt to
    reserve this delta.  If the first write finishes it will not release any
    space, because the space it had reserved for the initial write is still
    needed for the second write.  However some space would have been used,
    as we have added csums, extent items, and dirtied the inode.  Our
    reserved space would be > 0 but less than the total needed reserved
    space.
    
    This is just for a single inode, now consider generic/224.  This has
    1000 inodes writing in parallel to a very small file system, 1GiB.  In
    my testing this usually means we get about a 120MiB metadata area to
    work with, more than enough to allow the writes to continue, but not
    enough if all of the inodes are stuck trying to reserve the slack space
    while continuing to hold their leftovers from their initial writes.
    
    Fix this by pre-reserved _only_ for the space we are currently trying to
    add.  Then once that is successful modify our inodes csum count and
    outstanding extents, and then add the newly reserved space to the inodes
    block_rsv.  This allows us to actually pass generic/224 without running
    out of metadata space.
    Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
    Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
    c8eaeac7
extent-tree.c 313 KB