• Josef Bacik's avatar
    btrfs: save drop_progress if we drop refs at all · aea6f028
    Josef Bacik authored
    Previously we only updated the drop_progress key if we were in the
    DROP_REFERENCE stage of snapshot deletion.  This is because the
    UPDATE_BACKREF stage checks the flags of the blocks it's converting to
    FULL_BACKREF, so if we go over a block we processed before it doesn't
    matter, we just don't do anything.
    
    The problem is in do_walk_down() we will go ahead and drop the roots
    reference to any blocks that we know we won't need to walk into.
    
    Given subvolume A and snapshot B.  The root of B points to all of the
    nodes that belong to A, so all of those nodes have a refcnt > 1.  If B
    did not modify those blocks it'll hit this condition in do_walk_down
    
    if (!wc->update_ref ||
        generation <= root->root_key.offset)
    	goto skip;
    
    and in "goto skip" we simply do a btrfs_free_extent() for that bytenr
    that we point at.
    
    Now assume we modified some data in B, and then took a snapshot of B and
    call it C.  C points to all the nodes in B, making every node the root
    of B points to have a refcnt > 1.  This assumes the root level is 2 or
    higher.
    
    We delete snapshot B, which does the above work in do_walk_down,
    free'ing our ref for nodes we share with A that we didn't modify.  Now
    we hit a node we _did_ modify, thus we own.  We need to walk down into
    this node and we set wc->stage == UPDATE_BACKREF.  We walk down to level
    0 which we also own because we modified data.  We can't walk any further
    down and thus now need to walk up and start the next part of the
    deletion.  Now walk_up_proc is supposed to put us back into
    DROP_REFERENCE, but there's an exception to this
    
    if (level < wc->shared_level)
    	goto out;
    
    we are at level == 0, and our shared_level == 1.  We skip out of this
    one and go up to level 1.  Since path->slots[1] < nritems we
    path->slots[1]++ and break out of walk_up_tree to stop our transaction
    and loop back around.  Now in btrfs_drop_snapshot we have this snippet
    
    if (wc->stage == DROP_REFERENCE) {
    	level = wc->level;
    	btrfs_node_key(path->nodes[level],
    		       &root_item->drop_progress,
    		       path->slots[level]);
    	root_item->drop_level = level;
    }
    
    our stage == UPDATE_BACKREF still, so we don't update the drop_progress
    key.  This is a problem because we would have done btrfs_free_extent()
    for the nodes leading up to our current position.  If we crash or
    unmount here and go to remount we'll start over where we were before and
    try to free our ref for blocks we've already freed, and thus abort()
    out.
    
    Fix this by keeping track of the last place we dropped a reference for
    our block in do_walk_down.  Then if wc->stage == UPDATE_BACKREF we know
    we'll start over from a place we meant to, and otherwise things continue
    to work as they did before.
    
    I have a complicated reproducer for this problem, without this patch
    we'll fail to fsck the fs when replaying the log writes log.  With this
    patch we can replay the whole log without any fsck or mount failures.
    
    The steps to reproduce this easily are sort of tricky, I had to add a
    couple of debug patches to the kernel in order to make it easy,
    basically I just needed to make sure we did actually commit the
    transaction every time we finished a walk_down_tree/walk_up_tree combo.
    
    The reproducer:
    
    1) Creates a base subvolume.
    2) Creates 100k files in the subvolume.
    3) Snapshots the base subvolume (snap1).
    4) Touches files 5000-6000 in snap1.
    5) Snapshots snap1 (snap2).
    6) Deletes snap1.
    
    I do this with dm-log-writes, and then replay to every FUA in the log
    and fsck the fs.
    Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
    Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
    [ copy reproducer steps ]
    Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
    aea6f028
extent-tree.c 317 KB