• Brian Foster's avatar
    xfs: always log the inode on unwritten extent conversion · 2e588a46
    Brian Foster authored
    The fsync() requirements for crash consistency on XFS are to flush file
    data and force any in-core inode updates to the log. We currently check
    whether the inode is pinned to identify whether the log needs to be
    forced, since a non-zero pin count generally represents an inode that
    has transactions awaiting a flush to the on-disk log.
    
    This is not sufficient in all cases, however. Reports of xfstests test
    generic/311 failures on ppc64/s390x hosts have identified failures to
    fsync outstanding inode modifications due to the inode not being pinned
    at the time of the fsync. This occurs because certain bmap updates can
    complete by logging bmapbt buffers but without ever dirtying (and thus
    pinning) the core inode. The following is a specific incarnation of this
    problem:
    
    $ mount $dev /mnt -o noatime,nobarrier
    $ for i in $(seq 0 2 31); do \
            xfs_io -f -c "falloc $((i * 32768)) 32k" -c fsync /mnt/file; \
    	done
    $ xfs_io -c "pwrite -S 0 80k 16k" -c fsync -c "pwrite 76k 4k" -c fsync /mnt/file; \
    	hexdump /mnt/file; \
    	./xfstests-dev/src/godown /mnt
    ...
    0000000 0000 0000 0000 0000 0000 0000 0000 0000
    *
    0013000 cdcd cdcd cdcd cdcd cdcd cdcd cdcd cdcd
    *
    0014000 0000 0000 0000 0000 0000 0000 0000 0000
    *
    00f8000
    $ umount /mnt; mount ...
    $ hexdump /mnt/file
    0000000 0000 0000 0000 0000 0000 0000 0000 0000
    *
    00f8000
    
    In short, the unwritten extent conversion for the last write is lost
    despite the fact that an fsync executed before the filesystem was
    shutdown. Note that this is impossible to reproduce on v5 supers due to
    unconditional time callbacks for di_changecount and highly difficult to
    reproduce on CONFIG_HZ=1000 kernels due to those same callbacks
    frequently updating cmtime prior to the bmap update. CONFIG_HZ=100
    reduces timer granularity enough to increase the odds that time updates
    are skipped and allows this to reproduce within a handful of attempts.
    
    To deal with this problem, unconditionally log the core in the unwritten
    extent conversion path. Fix up logflags after the extent conversion to
    keep the extent update code consistent with the other extent update
    helpers. This fixup is not necessary for the other (hole, delay) extent
    helpers because they execute in the block allocation codepath, which
    already logs the inode for other reasons (e.g., for di_nblocks).
    Signed-off-by: default avatarBrian Foster <bfoster@redhat.com>
    Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
    Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
    
    2e588a46
xfs_bmap.c 170 KB