• Darrick J. Wong's avatar
    xfs: force all buffers to be written during btree bulk load · 13ae04d8
    Darrick J. Wong authored
    While stress-testing online repair of btrees, I noticed periodic
    assertion failures from the buffer cache about buffers with incorrect
    DELWRI_Q state.  Looking further, I observed this race between the AIL
    trying to write out a btree block and repair zapping a btree block after
    the fact:
    
    AIL:    Repair0:
    
    pin buffer X
    delwri_queue:
    set DELWRI_Q
    add to delwri list
    
            stale buf X:
            clear DELWRI_Q
            does not clear b_list
            free space X
            commit
    
    delwri_submit   # oops
    
    Worse yet, I discovered that running the same repair over and over in a
    tight loop can result in a second race that cause data integrity
    problems with the repair:
    
    AIL:    Repair0:        Repair1:
    
    pin buffer X
    delwri_queue:
    set DELWRI_Q
    add to delwri list
    
            stale buf X:
            clear DELWRI_Q
            does not clear b_list
            free space X
            commit
    
                            find free space X
                            get buffer
                            rewrite buffer
                            delwri_queue:
                            set DELWRI_Q
                            already on a list, do not add
                            commit
    
                            BAD: committed tree root before all blocks written
    
    delwri_submit   # too late now
    
    I traced this to my own misunderstanding of how the delwri lists work,
    particularly with regards to the AIL's buffer list.  If a buffer is
    logged and committed, the buffer can end up on that AIL buffer list.  If
    btree repairs are run twice in rapid succession, it's possible that the
    first repair will invalidate the buffer and free it before the next time
    the AIL wakes up.  Marking the buffer stale clears DELWRI_Q from the
    buffer state without removing the buffer from its delwri list.  The
    buffer doesn't know which list it's on, so it cannot know which lock to
    take to protect the list for a removal.
    
    If the second repair allocates the same block, it will then recycle the
    buffer to start writing the new btree block.  Meanwhile, if the AIL
    wakes up and walks the buffer list, it will ignore the buffer because it
    can't lock it, and go back to sleep.
    
    When the second repair calls delwri_queue to put the buffer on the
    list of buffers to write before committing the new btree, it will set
    DELWRI_Q again, but since the buffer hasn't been removed from the AIL's
    buffer list, it won't add it to the bulkload buffer's list.
    
    This is incorrect, because the bulkload caller relies on delwri_submit
    to ensure that all the buffers have been sent to disk /before/
    committing the new btree root pointer.  This ordering requirement is
    required for data consistency.
    
    Worse, the AIL won't clear DELWRI_Q from the buffer when it does finally
    drop it, so the next thread to walk through the btree will trip over a
    debug assertion on that flag.
    
    To fix this, create a new function that waits for the buffer to be
    removed from any other delwri lists before adding the buffer to the
    caller's delwri list.  By waiting for the buffer to clear both the
    delwri list and any potential delwri wait list, we can be sure that
    repair will initiate writes of all buffers and report all write errors
    back to userspace instead of committing the new structure.
    Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
    Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
    13ae04d8
xfs_buf.h 12.2 KB