• Josef Bacik's avatar
    btrfs: save i_size to avoid double evaluation of i_size_read in compress_file_range · d98da499
    Josef Bacik authored
    We hit a regression while rolling out 5.2 internally where we were
    hitting the following panic
    
      kernel BUG at mm/page-writeback.c:2659!
      RIP: 0010:clear_page_dirty_for_io+0xe6/0x1f0
      Call Trace:
       __process_pages_contig+0x25a/0x350
       ? extent_clear_unlock_delalloc+0x43/0x70
       submit_compressed_extents+0x359/0x4d0
       normal_work_helper+0x15a/0x330
       process_one_work+0x1f5/0x3f0
       worker_thread+0x2d/0x3d0
       ? rescuer_thread+0x340/0x340
       kthread+0x111/0x130
       ? kthread_create_on_node+0x60/0x60
       ret_from_fork+0x1f/0x30
    
    This is happening because the page is not locked when doing
    clear_page_dirty_for_io.  Looking at the core dump it was because our
    async_extent had a ram_size of 24576 but our async_chunk range only
    spanned 20480, so we had a whole extra page in our ram_size for our
    async_extent.
    
    This happened because we try not to compress pages outside of our
    i_size, however a cleanup patch changed us to do
    
    actual_end = min_t(u64, i_size_read(inode), end + 1);
    
    which is problematic because i_size_read() can evaluate to different
    values in between checking and assigning.  So either an expanding
    truncate or a fallocate could increase our i_size while we're doing
    writeout and actual_end would end up being past the range we have
    locked.
    
    I confirmed this was what was happening by installing a debug kernel
    that had
    
      actual_end = min_t(u64, i_size_read(inode), end + 1);
      if (actual_end > end + 1) {
    	  printk(KERN_ERR "KABOOM\n");
    	  actual_end = end + 1;
      }
    
    and installing it onto 500 boxes of the tier that had been seeing the
    problem regularly.  Last night I got my debug message and no panic,
    confirming what I expected.
    
    [ dsterba: the assembly confirms a tiny race window:
    
        mov    0x20(%rsp),%rax
        cmp    %rax,0x48(%r15)           # read
        movl   $0x0,0x18(%rsp)
        mov    %rax,%r12
        mov    %r14,%rax
        cmovbe 0x48(%r15),%r12           # eval
    
      Where r15 is inode and 0x48 is offset of i_size.
    
      The original fix was to revert 62b37622 that would do an
      intermediate assignment and this would also avoid the doulble
      evaluation but is not future-proof, should the compiler merge the
      stores and call i_size_read anyway.
    
      There's a patch adding READ_ONCE to i_size_read but that's not being
      applied at the moment and we need to fix the bug. Instead, emulate
      READ_ONCE by two barrier()s that's what effectively happens. The
      assembly confirms single evaluation:
    
        mov    0x48(%rbp),%rax          # read once
        mov    0x20(%rsp),%rcx
        mov    $0x20,%edx
        cmp    %rax,%rcx
        cmovbe %rcx,%rax
        mov    %rax,(%rsp)
        mov    %rax,%rcx
        mov    %r14,%rax
    
      Where 0x48(%rbp) is inode->i_size stored to %eax.
    ]
    
    Fixes: 62b37622 ("btrfs: Remove isize local variable in compress_file_range")
    CC: stable@vger.kernel.org # v5.1+
    Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
    Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
    Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
    [ changelog updated ]
    Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
    d98da499
inode.c 299 KB