1. 18 Oct, 2018 21 commits
    • Christoph Hellwig's avatar
      xfs: cancel COW blocks before swapext · 96987eea
      Christoph Hellwig authored
      We need to make sure we have no outstanding COW blocks before we swap
      extents, as there is nothing preventing us from having preallocated COW
      delalloc on either inode that swapext is called on.  That case can
      easily be reproduced by running generic/324 in always_cow mode:
      
      [  620.760572] XFS: Assertion failed: tip->i_delayed_blks == 0, file: fs/xfs/xfs_bmap_util.c, line: 1669
      [  620.761608] ------------[ cut here ]------------
      [  620.762171] kernel BUG at fs/xfs/xfs_message.c:102!
      [  620.762732] invalid opcode: 0000 [#1] SMP PTI
      [  620.763272] CPU: 0 PID: 24153 Comm: xfs_fsr Tainted: G        W         4.19.0-rc1+ #4182
      [  620.764203] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.1-1 04/01/2014
      [  620.765202] RIP: 0010:assfail+0x20/0x28
      [  620.765646] Code: 31 ff e8 83 fc ff ff 0f 0b c3 48 89 f1 41 89 d0 48 c7 c6 48 ca 8d 82 48 89 fa 38
      [  620.767758] RSP: 0018:ffffc9000898bc10 EFLAGS: 00010202
      [  620.768359] RAX: 0000000000000000 RBX: ffff88012f14ba40 RCX: 0000000000000000
      [  620.769174] RDX: 00000000ffffffc0 RSI: 000000000000000a RDI: ffffffff828560d9
      [  620.769982] RBP: ffff88012f14b300 R08: 0000000000000000 R09: 0000000000000000
      [  620.770788] R10: 000000000000000a R11: f000000000000000 R12: ffffc9000898bc98
      [  620.771638] R13: ffffc9000898bc9c R14: ffff880130b5e2b8 R15: ffff88012a1fa2a8
      [  620.772504] FS:  00007fdc36e0fbc0(0000) GS:ffff88013ba00000(0000) knlGS:0000000000000000
      [  620.773475] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  620.774168] CR2: 00007fdc3604d000 CR3: 0000000132afc000 CR4: 00000000000006f0
      [  620.774978] Call Trace:
      [  620.775274]  xfs_swap_extent_forks+0x2a0/0x2e0
      [  620.775792]  xfs_swap_extents+0x38b/0xab0
      [  620.776256]  xfs_ioc_swapext+0x121/0x140
      [  620.776709]  xfs_file_ioctl+0x328/0xc90
      [  620.777154]  ? rcu_read_lock_sched_held+0x50/0x60
      [  620.777694]  ? xfs_iunlock+0x233/0x260
      [  620.778127]  ? xfs_setattr_nonsize+0x3be/0x6a0
      [  620.778647]  do_vfs_ioctl+0x9d/0x680
      [  620.779071]  ? ksys_fchown+0x47/0x80
      [  620.779552]  ksys_ioctl+0x35/0x70
      [  620.780040]  __x64_sys_ioctl+0x11/0x20
      [  620.780530]  do_syscall_64+0x4b/0x190
      [  620.780927]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
      [  620.781467] RIP: 0033:0x7fdc364d0f07
      [  620.781900] Code: b3 66 90 48 8b 05 81 5f 2c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 28
      [  620.784044] RSP: 002b:00007ffe2a766038 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
      [  620.784896] RAX: ffffffffffffffda RBX: 0000000000000025 RCX: 00007fdc364d0f07
      [  620.785667] RDX: 0000560296ca2fc0 RSI: 00000000c0c0586d RDI: 0000000000000005
      [  620.786398] RBP: 0000000000000025 R08: 0000000000001200 R09: 0000000000000000
      [  620.787283] R10: 0000000000000432 R11: 0000000000000246 R12: 0000000000000005
      [  620.788051] R13: 0000000000000000 R14: 0000000000001000 R15: 0000000000000006
      [  620.788927] Modules linked in:
      [  620.789340] ---[ end trace 9503b7417ffdbdb0 ]---
      [  620.790065] RIP: 0010:assfail+0x20/0x28
      [  620.790642] Code: 31 ff e8 83 fc ff ff 0f 0b c3 48 89 f1 41 89 d0 48 c7 c6 48 ca 8d 82 48 89 fa 38
      [  620.793038] RSP: 0018:ffffc9000898bc10 EFLAGS: 00010202
      [  620.793609] RAX: 0000000000000000 RBX: ffff88012f14ba40 RCX: 0000000000000000
      [  620.794317] RDX: 00000000ffffffc0 RSI: 000000000000000a RDI: ffffffff828560d9
      [  620.795025] RBP: ffff88012f14b300 R08: 0000000000000000 R09: 0000000000000000
      [  620.795778] R10: 000000000000000a R11: f000000000000000 R12: ffffc9000898bc98
      [  620.796675] R13: ffffc9000898bc9c R14: ffff880130b5e2b8 R15: ffff88012a1fa2a8
      [  620.797782] FS:  00007fdc36e0fbc0(0000) GS:ffff88013ba00000(0000) knlGS:0000000000000000
      [  620.798908] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  620.799594] CR2: 00007fdc3604d000 CR3: 0000000132afc000 CR4: 00000000000006f0
      [  620.800424] Kernel panic - not syncing: Fatal exception
      [  620.801191] Kernel Offset: disabled
      [  620.801597] ---[ end Kernel panic - not syncing: Fatal exception ]---
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      96987eea
    • Brian Foster's avatar
      xfs: clear ail delwri queued bufs on unmount of shutdown fs · efc3289c
      Brian Foster authored
      In the typical unmount case, the AIL is forced out by the unmount
      sequence before the xfsaild task is stopped. Since AIL items are
      removed on writeback completion, this means that the AIL
      ->ail_buf_list delwri queue has been drained. This is not always
      true in the shutdown case, however.
      
      It's possible for buffers to sit on a delwri queue for a period of
      time across submission attempts if said items are locked or have
      been relogged and pinned since first added to the queue. If the
      attempt to log such an item results in a log I/O error, the error
      processing can shutdown the fs, remove the item from the AIL, stale
      the buffer (dropping the LRU reference) and clear its delwri queue
      state. The latter bit means the buffer will be released from a
      delwri queue on the next submission attempt, but this might never
      occur if the filesystem has shutdown and the AIL is empty.
      
      This means that such buffers are held indefinitely by the AIL delwri
      queue across destruction of the AIL. Aside from being a memory leak,
      these buffers can also hold references to in-core perag structures.
      The latter problem manifests as a generic/475 failure, reproducing
      the following asserts at unmount time:
      
        XFS: Assertion failed: atomic_read(&pag->pag_ref) == 0,
      	file: fs/xfs/xfs_mount.c, line: 151
        XFS: Assertion failed: atomic_read(&pag->pag_ref) == 0,
      	file: fs/xfs/xfs_mount.c, line: 132
      
      To prevent this problem, clear the AIL delwri queue as a final step
      before xfsaild() exit. The !empty state should never occur in the
      normal case, so add an assert to catch unexpected problems going
      forward.
      
      [dgc: add comment explaining need for xfs_buf_delwri_cancel() after
       calling xfs_buf_delwri_submit_nowait().]
      Signed-off-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      efc3289c
    • Carlos Maiolino's avatar
      xfs: use offsetof() in place of offset macros for __xfsstats · 26ca3901
      Carlos Maiolino authored
      Most offset macro mess is used in xfs_stats_format() only, and we can
      simply get the right offsets using offsetof(), instead of several macros
      to mark the offsets inside __xfsstats structure.
      
      Replace all XFSSTAT_END_* macros by a single helper macro to get the
      right offset into __xfsstats, and use this helper in xfs_stats_format()
      directly.
      
      The quota stats code, still looks a bit cleaner when using XFSSTAT_*
      macros, so, this patch also defines XFSSTAT_START_XQMSTAT and
      XFSSTAT_END_XQMSTAT locally to that code. This also should prevent
      offset mistakes when updates are done into __xfsstats.
      Signed-off-by: default avatarCarlos Maiolino <cmaiolino@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      26ca3901
    • Carlos Maiolino's avatar
      xfs: Fix xqmstats offsets in /proc/fs/xfs/xqmstat · 41657e55
      Carlos Maiolino authored
      The addition of FIBT, RMAP and REFCOUNT changed the offsets into
      __xfssats structure.
      
      This caused xqmstat_proc_show() to display garbage data via
      /proc/fs/xfs/xqmstat, once it relies on the offsets marked via macros.
      
      Fix it.
      
      Fixes: 00f4e4f9 xfs: add rmap btree stats infrastructure
      Fixes: aafc3c24 xfs: support the XFS_BTNUM_FINOBT free inode btree type
      Fixes: 46eeb521 xfs: introduce refcount btree definitions
      Signed-off-by: default avatarCarlos Maiolino <cmaiolino@redhat.com>
      Reviewed-by: default avatarEric Sandeen <sandeen@redhat.com>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      41657e55
    • Dave Chinner's avatar
      xfs: fix use-after-free race in xfs_buf_rele · 37fd1678
      Dave Chinner authored
      When looking at a 4.18 based KASAN use after free report, I noticed
      that racing xfs_buf_rele() may race on dropping the last reference
      to the buffer and taking the buffer lock. This was the symptom
      displayed by the KASAN report, but the actual issue that was
      reported had already been fixed in 4.19-rc1 by commit e339dd8d
      ("xfs: use sync buffer I/O for sync delwri queue submission").
      
      Despite this, I think there is still an issue with xfs_buf_rele()
      in this code:
      
              release = atomic_dec_and_lock(&bp->b_hold, &pag->pag_buf_lock);
              spin_lock(&bp->b_lock);
              if (!release) {
      .....
      
      If two threads race on the b_lock after both dropping a reference
      and one getting dropping the last reference so release = true, we
      end up with:
      
      CPU 0				CPU 1
      atomic_dec_and_lock()
      				atomic_dec_and_lock()
      				spin_lock(&bp->b_lock)
      spin_lock(&bp->b_lock)
      <spins>
      				<release = true bp->b_lru_ref = 0>
      				<remove from lists>
      				freebuf = true
      				spin_unlock(&bp->b_lock)
      				xfs_buf_free(bp)
      <gets lock, reading and writing freed memory>
      <accesses freed memory>
      spin_unlock(&bp->b_lock) <reads/writes freed memory>
      
      IOWs, we can't safely take bp->b_lock after dropping the hold
      reference because the buffer may go away at any time after we
      drop that reference. However, this can be fixed simply by taking the
      bp->b_lock before we drop the reference.
      
      It is safe to nest the pag_buf_lock inside bp->b_lock as the
      pag_buf_lock is only used to serialise against lookup in
      xfs_buf_find() and no other locks are held over or under the
      pag_buf_lock there. Make this clear by documenting the buffer lock
      orders at the top of the file.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      37fd1678
    • Allison Henderson's avatar
      xfs: Add attibute remove and helper functions · 068f985a
      Allison Henderson authored
      This patch adds xfs_attr_remove_args. These sub-routines remove
      the attributes specified in @args. We will use this later for setting
      parent pointers as a deferred attribute operation.
      Signed-off-by: default avatarAllison Henderson <allison.henderson@oracle.com>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      068f985a
    • Allison Henderson's avatar
      xfs: Add attibute set and helper functions · 2f3cd809
      Allison Henderson authored
      This patch adds xfs_attr_set_args and xfs_bmap_set_attrforkoff.
      These sub-routines set the attributes specified in @args.
      We will use this later for setting parent pointers as a deferred
      attribute operation.
      
      [dgc: remove attr fork init code from xfs_attr_set_args().]
      [dgc: xfs_attr_try_sf_addname() NULLs args.trans after commit.]
      [dgc: correct sf add error handling.]
      Signed-off-by: default avatarAllison Henderson <allison.henderson@oracle.com>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      2f3cd809
    • Allison Henderson's avatar
      xfs: Add helper function xfs_attr_try_sf_addname · 4c74a56b
      Allison Henderson authored
      This patch adds a subroutine xfs_attr_try_sf_addname
      used by xfs_attr_set.  This subrotine will attempt to
      add the attribute name specified in args in shortform,
      as well and perform error handling previously done in
      xfs_attr_set.
      
      This patch helps to pre-simplify xfs_attr_set for reviewing
      purposes and reduce indentation.  New function will be added
      in the next patch.
      
      [dgc: moved commit to helper function, too.]
      Signed-off-by: default avatarAllison Henderson <allison.henderson@oracle.com>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      4c74a56b
    • Allison Henderson's avatar
      xfs: Move fs/xfs/xfs_attr.h to fs/xfs/libxfs/xfs_attr.h · e2421f0b
      Allison Henderson authored
      This patch moves fs/xfs/xfs_attr.h to fs/xfs/libxfs/xfs_attr.h
      since xfs_attr.c is in libxfs.  We will need these later in
      xfsprogs.
      Signed-off-by: default avatarAllison Henderson <allison.henderson@oracle.com>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      e2421f0b
    • Dave Chinner's avatar
      xfs: issue log message on user force shutdown · 56668a5c
      Dave Chinner authored
      The kernel only issues a log message that it's been shut down when
      the filesystem triggers a shutdown itself. Hence there is no trace
      in the log when a shutdown is triggered manually from userspace.
      This can make it hard to see sequence of events in the log when
      things go wrong, so make sure we always log a message when a
      shutdown is run.
      
      While there, clean up the logic flow so we don't have to continually
      check if the shutdown trigger was user initiated before logging
      shutdown messages.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      56668a5c
    • Darrick J. Wong's avatar
      xfs: fix buffer state management in xrep_findroot_block · 38b6238e
      Darrick J. Wong authored
      We don't handle buffer state properly in online repair's findroot
      routine.  If a buffer already has b_ops set, we don't ever want to touch
      that, and we don't want to call the read verifiers on a buffer that
      could be dirty (CRCs are only recomputed during log checkpoints).
      
      Therefore, be more careful about what we do with a buffer -- if someone
      else already attached ops that are not the ones for this btree type,
      just ignore the buffer.  We only attach our btree type's buf ops if it
      matches the magic/uuid and structure checks.
      
      We also modify xfs_buf_read_map to allow callers to set buffer ops on a
      DONE buffer with NULL ops so that repair doesn't leave behind buffers
      which won't have buffers attached to them.
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      38b6238e
    • Darrick J. Wong's avatar
      xfs: always assign buffer verifiers when one is provided · 1aff5696
      Darrick J. Wong authored
      If a caller supplies buffer ops when trying to read a buffer and the
      buffer doesn't already have buf ops assigned, ensure that the ops are
      assigned to the buffer and the verifier is run on that buffer.
      
      Note that current XFS code is careful to assign buffer ops after a
      xfs_{trans_,}buf_read call in which ops were not supplied.  However, we
      should apply ops defensively in case there is ever a coding mistake; and
      an upcoming repair patch will need to be able to read a buffer without
      assigning buf ops.
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      1aff5696
    • Darrick J. Wong's avatar
      xfs: xrep_findroot_block should reject root blocks with siblings · 1002ff45
      Darrick J. Wong authored
      In xrep_findroot_block, if we find a candidate root block with sibling
      pointers or sibling blocks on the same tree level, we should not return
      that block as a tree root because root blocks cannot have siblings.
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      1002ff45
    • Adam Borowski's avatar
      xfs: add a define for statfs magic to uapi · dddde68b
      Adam Borowski authored
      Needed by userspace programs that call fstatfs().
      
      It'd be natural to publish XFS_SB_MAGIC in uapi, but while these two
      have identical values, they have different semantic meaning: one is
      an enum cookie meant for statfs, the other a signature of the
      on-disk format.
      Signed-off-by: default avatarAdam Borowski <kilobyte@angband.pl>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      dddde68b
    • Christoph Hellwig's avatar
      xfs: print dangling delalloc extents · 4831822f
      Christoph Hellwig authored
      Instead of just asserting that we have no delalloc space dangling
      in an inode that gets freed print the actual offenders for debug
      mode.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      4831822f
    • Christoph Hellwig's avatar
      xfs: fix fork selection in xfs_find_trim_cow_extent · 032dc923
      Christoph Hellwig authored
      We should want to write directly into the data fork for blocks that don't
      have an extent in the COW fork covering them yet.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      032dc923
    • Christoph Hellwig's avatar
    • Christoph Hellwig's avatar
    • Christoph Hellwig's avatar
      xfs: handle zeroing in xfs_file_iomap_begin_delay · 0365c5d6
      Christoph Hellwig authored
      We only need to allocate blocks for zeroing for reflink inodes,
      and for we currently have a special case for reflink files in
      the otherwise direct I/O path that I'd like to get rid of.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      
      0365c5d6
    • Christoph Hellwig's avatar
      xfs: remove suport for filesystems without unwritten extent flag · daa79bae
      Christoph Hellwig authored
      The option to enable unwritten extents was made default in 2003,
      removed from mkfs in 2007, and cannot be disabled in v5.  We also
      rely on it for a lot of common functionality, so filesystems without
      it will run a completely untested and buggy code path.  Enabling the
      support also is a simple bit flip using xfs_db, so legacy file
      systems can still be brought forward.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      daa79bae
    • Christoph Hellwig's avatar
      xfs: remove XFS_IO_INVALID · 97e5a6e6
      Christoph Hellwig authored
      The invalid state isn't any different from a hole, so merge the two
      states.  Use the more descriptive hole name, but keep it as the first
      value of the enum to catch uninitialized fields.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      97e5a6e6
  2. 06 Oct, 2018 2 commits
    • Dave Chinner's avatar
      xfs: fix data corruption w/ unaligned reflink ranges · b3998900
      Dave Chinner authored
      When reflinking sub-file ranges, a data corruption can occur when
      the source file range includes a partial EOF block. This shares the
      unknown data beyond EOF into the second file at a position inside
      EOF, exposing stale data in the second file.
      
      XFS only supports whole block sharing, but we still need to
      support whole file reflink correctly.  Hence if the reflink
      request includes the last block of the souce file, only proceed with
      the reflink operation if it lands at or past the destination file's
      current EOF. If it lands within the destination file EOF, reject the
      entire request with -EINVAL and make the caller go the hard way.
      
      This avoids the data corruption vector, but also avoids disruption
      of returning EINVAL to userspace for the common case of whole file
      cloning.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      b3998900
    • Dave Chinner's avatar
      xfs: fix data corruption w/ unaligned dedupe ranges · dceeb47b
      Dave Chinner authored
      A deduplication data corruption is Exposed by fstests generic/505 on
      XFS. It is caused by extending the block match range to include the
      partial EOF block, but then allowing unknown data beyond EOF to be
      considered a "match" to data in the destination file because the
      comparison is only made to the end of the source file. This corrupts
      the destination file when the source extent is shared with it.
      
      XFS only supports whole block dedupe, but we still need to appear to
      support whole file dedupe correctly.  Hence if the dedupe request
      includes the last block of the souce file, don't include it in the
      actual XFS dedupe operation. If the rest of the range dedupes
      successfully, then report the partial last block as deduped, too, so
      that userspace sees it as a successful dedupe rather than return
      EINVAL because we can't dedupe unaligned blocks.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      dceeb47b
  3. 05 Oct, 2018 3 commits
  4. 30 Sep, 2018 1 commit
    • Dave Chinner's avatar
      xfs: fix error handling in xfs_bmap_extents_to_btree · e55ec4dd
      Dave Chinner authored
      Commit 01239d77 ("xfs: fix a null pointer dereference in
      xfs_bmap_extents_to_btree") attempted to fix a null pointer
      dreference when a fuzzing corruption of some kind was found.
      This fix was flawed, resulting in assert failures like:
      
      XFS: Assertion failed: ifp->if_broot == NULL, file: fs/xfs/libxfs/xfs_bmap.c, line: 715
      .....
      Call Trace:
        xfs_bmap_extents_to_btree+0x6b9/0x7b0
        __xfs_bunmapi+0xae7/0xf00
        ? xfs_log_reserve+0x1c8/0x290
        xfs_reflink_remap_extent+0x20b/0x620
        xfs_reflink_remap_blocks+0x7e/0x290
        xfs_reflink_remap_range+0x311/0x530
        vfs_dedupe_file_range_one+0xd7/0xe0
        vfs_dedupe_file_range+0x15b/0x1a0
        do_vfs_ioctl+0x267/0x6c0
      
      The problem is that the error handling code now asserts that the
      inode fork is not in btree format before the error handling code
      undoes the modifications that put the fork back in extent format.
      Fix this by moving the assert back to after the xfs_iroot_realloc()
      call that returns the fork to extent format, and clean up the jump
      labels to be meaningful.
      
      Also, returning ENOSPC when xfs_btree_get_bufl() fails to
      instantiate the buffer that was allocated (the actual fix in the
      commit mentioned above) is incorrect. This is a fatal error - only
      an invalid block address or a filesystem shutdown can result in
      failing to get a buffer here.
      
      Hence change this to EFSCORRUPTED so that the higher layer knows
      this was a corruption related failure and should not treat it as an
      ENOSPC error.  This should result in a shutdown (via cancelling a
      dirty transaction) which is necessary as we do not attempt to clean
      up the (invalid) block that we have already allocated.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      
      e55ec4dd
  5. 29 Sep, 2018 13 commits
    • Brian Foster's avatar
      iomap: set page dirty after partial delalloc on mkwrite · 561295a3
      Brian Foster authored
      The iomap page fault mechanism currently dirties the associated page
      after the full block range of the page has been allocated. This
      leaves the page susceptible to delayed allocations without ever
      being set dirty on sub-page block sized filesystems.
      
      For example, consider a page fault on a page with one preexisting
      real (non-delalloc) block allocated in the middle of the page. The
      first iomap_apply() iteration performs delayed allocation on the
      range up to the preexisting block, the next iteration finds the
      preexisting block, and the last iteration attempts to perform
      delayed allocation on the range after the prexisting block to the
      end of the page. If the first allocation succeeds and the final
      allocation fails with -ENOSPC, iomap_apply() returns the error and
      iomap_page_mkwrite() fails to dirty the page having already
      performed partial delayed allocation. This eventually results in the
      page being invalidated without ever converting the delayed
      allocation to real blocks.
      
      This problem is reliably reproduced by generic/083 on XFS on ppc64
      systems (64k page size, 4k block size). It results in leaked
      delalloc blocks on inode reclaim, which triggers an assert failure
      in xfs_fs_destroy_inode() and filesystem accounting inconsistency.
      
      Move the set_page_dirty() call from iomap_page_mkwrite() to the
      actor callback, similar to how the buffer head implementation works.
      The actor callback is called iff ->iomap_begin() returns success, so
      ensures the page is dirtied as soon as possible after an allocation.
      Signed-off-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      561295a3
    • Brian Foster's avatar
      xfs: remove invalid log recovery first/last cycle check · ec2ed0b5
      Brian Foster authored
      One of the first steps of log recovery is to check for the special
      case of a zeroed log. If the first cycle in the log is zero or the
      tail portion of the log is zeroed, the head is set to the first
      instance of cycle 0. xlog_find_zeroed() includes a sanity check that
      enforces that the first cycle in the log must be 1 if the last cycle
      is 0. While this is true in most cases, the check is not totally
      valid because it doesn't consider the case where the filesystem
      crashed after a partial/out of order log buffer completion that
      wraps around the end of the physical log.
      
      For example, consider a filesystem that has completed most of the
      first cycle of the log, reaches the end of the physical log and
      splits the next single log buffer write into two in order to wrap
      around the end of the log. If these I/Os are reordered, the second
      (wrapped) I/O completes and the first happens to fail, the log is
      left in a state where the last cycle of the log is 0 and the first
      cycle is 2. This causes the xlog_find_zeroed() sanity check to fail
      and prevents the filesystem from mounting. This situation has been
      reproduced on particular systems via repeated runs of generic/475.
      
      This is an expected state that log recovery already knows how to
      deal with, however. Since the log is still partially zeroed, the
      head is detected correctly and points to a valid tail. The
      subsequent stale block detection clears blocks beyond the head up to
      the tail (within a maximum range), with the express purpose of
      clearing such out of order writes. As expected, this removes the out
      of order cycle 2 blocks at the physical start of the log.
      
      In other words, the only thing that prevents a clean mount and
      recovery of the filesystem in this scenario is the specific (last ==
      0 && first != 1) sanity check in xlog_find_zeroed(). Since the log
      head/tail are now independently validated via cycle, log record and
      CRC checks, this highly specific first cycle check is of dubious
      value. Remove it and rely on the higher level validation to
      determine whether log content is sane and recoverable.
      Signed-off-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      
      ec2ed0b5
    • Eric Sandeen's avatar
      xfs: validate inode di_forkoff · 339e1a3f
      Eric Sandeen authored
      Verify the inode di_forkoff, lifted from xfs_repair's
      process_check_inode_forkoff().
      Signed-off-by: default avatarEric Sandeen <sandeen@redhat.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      339e1a3f
    • Christoph Hellwig's avatar
      xfs: skip delalloc COW blocks in xfs_reflink_end_cow · f5f3f959
      Christoph Hellwig authored
      The iomap direct I/O code issues a single ->end_io call for the whole
      I/O request, and if some of the extents cowered needed a COW operation
      it will call xfs_reflink_end_cow over the whole range.
      
      When we do AIO writes we drop the iolock after doing the initial setup,
      but before the I/O completion.  Between dropping the lock and completing
      the I/O we can have a racing buffered write create new delalloc COW fork
      extents in the region covered by the outstanding direct I/O write, and
      thus see delalloc COW fork extents in xfs_reflink_end_cow.  As
      concurrent writes are fundamentally racy and no guarantees are given we
      can simply skip those.
      
      This can be easily reproduced with xfstests generic/208 in always_cow
      mode.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      
      f5f3f959
    • Eric Sandeen's avatar
      xfs: don't treat unknown di_flags2 as corruption in scrub · f369a13c
      Eric Sandeen authored
      xchk_inode_flags2() currently treats any di_flags2 values that the
      running kernel doesn't recognize as corruption, and calls
      xchk_ino_set_corrupt() if they are set.  However, it's entirely possible
      that these flags were set in some newer kernel and are quite valid,
      but ignored in this kernel.
      
      (Validators don't care one bit about unknown di_flags2.)
      
      Call xchk_ino_set_warning instead, because this may or may not actually
      indicate a problem.
      Signed-off-by: default avatarEric Sandeen <sandeen@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      f369a13c
    • YueHaibing's avatar
      xfs: remove duplicated include from alloc.c · 2863c2eb
      YueHaibing authored
      Remove duplicated include xfs_alloc.h
      Signed-off-by: default avatarYueHaibing <yuehaibing@huawei.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      
      2863c2eb
    • Christoph Hellwig's avatar
      xfs: don't bring in extents in xfs_bmap_punch_delalloc_range · 0065b541
      Christoph Hellwig authored
      This function is only used to punch out delayed allocations on I/O
      failure, which means we need to have read the extents earlier.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      0065b541
    • Dave Chinner's avatar
      xfs: fix transaction leak in xfs_reflink_allocate_cow() · df307077
      Dave Chinner authored
      When xfs_reflink_allocate_cow() allocates a transaction, it drops
      the ILOCK to perform the operation. This Introduces a race condition
      where another thread modifying the file can perform the COW
      allocation operation underneath us. This result in the retry loop
      finding an allocated block and jumping straight to the conversion
      code. It does not, however, cancel the transaction it holds and so
      this gets leaked. This results in a lockdep warning:
      
      ================================================
      WARNING: lock held when returning to user space!
      4.18.5 #1 Not tainted
      ------------------------------------------------
      worker/6123 is leaving the kernel with locks still held!
      1 lock held by worker/6123:
       #0: 000000009eab4f1b (sb_internal#2){.+.+}, at: xfs_trans_alloc+0x17c/0x220
      
      And eventually the filesystem deadlocks because it runs out of log
      space that is reserved by the leaked transaction and never gets
      released.
      
      The logic flow in xfs_reflink_allocate_cow() is a convoluted mess of
      gotos - it's no surprise that it has bug where the flow through
      several goto jumps then fails to clean up context from a non-obvious
      logic path. CLean up the logic flow and make sure every path does
      the right thing.
      Reported-by: default avatarAlexander Y. Fomichev <git.user@gmail.com>
      Tested-by: default avatarAlexander Y. Fomichev <git.user@gmail.com>
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=200981Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      [hch: slight refactor]
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      df307077
    • Dave Chinner's avatar
      xfs: avoid lockdep false positives in xfs_trans_alloc · 8683edb7
      Dave Chinner authored
      We've had a few reports of lockdep tripping over memory reclaim
      context vs filesystem freeze "deadlocks". They all have looked
      to be false positives on analysis, but it seems that they are
      being tripped because we take freeze references before we run
      a GFP_KERNEL allocation for the struct xfs_trans.
      
      We can avoid this false positive vector just by re-ordering the
      operations in xfs_trans_alloc(). That is. we need allocate the
      structure before we take the freeze reference and enter the GFP_NOFS
      allocation context that follows the xfs_trans around. This prevents
      lockdep from seeing the GFP_KERNEL allocation inside the transaction
      context, and that prevents it from triggering the freeze level vs
      alloc context vs reclaim warnings.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      8683edb7
    • Brian Foster's avatar
      xfs: refactor xfs_buf_log_item reference count handling · 95808459
      Brian Foster authored
      The xfs_buf_log_item structure has a reference counter with slightly
      tricky semantics. In the common case, a buffer is logged and
      committed in a transaction, committed to the on-disk log (added to
      the AIL) and then finally written back and removed from the AIL. The
      bli refcount covers two potentially overlapping timeframes:
      
       1. the bli is held in an active transaction
       2. the bli is pinned by the log
      
      The caveat to this approach is that the reference counter does not
      purely dictate the lifetime of the bli. IOW, when a dirty buffer is
      physically logged and unpinned, the bli refcount may go to zero as
      the log item is inserted into the AIL. Only once the buffer is
      written back can the bli finally be freed.
      
      The above semantics means that it is not enough for the various
      refcount decrementing contexts to release the bli on decrement to
      zero. xfs_trans_brelse(), transaction commit (->iop_unlock()) and
      unpin (->iop_unpin()) must all drop the associated reference and
      make additional checks to determine if the current context is
      responsible for freeing the item.
      
      For example, if a transaction holds but does not dirty a particular
      bli, the commit may drop the refcount to zero. If the bli itself is
      clean, it is also not AIL resident and must be freed at this time.
      The same is true for xfs_trans_brelse(). If the transaction dirties
      a bli and then aborts or an unpin results in an abort due to a log
      I/O error, the last reference count holder is expected to explicitly
      remove the item from the AIL and release it (since an abort means
      filesystem shutdown and metadata writeback will never occur).
      
      This leads to fairly complex checks being replicated in a few
      different places. Since ->iop_unlock() and xfs_trans_brelse() are
      nearly identical, refactor the logic into a common helper that
      implements and documents the semantics in one place. This patch does
      not change behavior.
      Signed-off-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      
      95808459
    • Brian Foster's avatar
      xfs: clean up xfs_trans_brelse() · 23420d05
      Brian Foster authored
      xfs_trans_brelse() is a bit of a historical mess, similar to
      xfs_buf_item_unlock(). It is unnecessarily verbose, has snippets of
      commented out code, inconsistency with regard to stale items, etc.
      
      Clean up xfs_trans_brelse() to use similar logic and flow as
      xfs_buf_item_unlock() with regard to bli reference count handling.
      This patch makes no functional changes, but facilitates further
      refactoring of the common bli reference count handling code.
      Signed-off-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      
      23420d05
    • Brian Foster's avatar
      xfs: don't unlock invalidated buf on aborted tx commit · d9183105
      Brian Foster authored
      xfstests generic/388,475 occasionally reproduce assertion failures
      in xfs_buf_item_unpin() when the final bli reference is dropped on
      an invalidated buffer and the buffer is not locked as it is expected
      to be. Invalidated buffers should remain locked on transaction
      commit until the final unpin, at which point the buffer is removed
      from the AIL and the bli is freed since stale buffers are not
      written back.
      
      The assert failures are associated with filesystem shutdown,
      typically due to log I/O errors injected by the test. The
      problematic situation can occur if the shutdown happens to cause a
      race between an active transaction that has invalidated a particular
      buffer and an I/O error on a log buffer that contains the bli
      associated with the same (now stale) buffer.
      
      Both transaction and log contexts acquire a bli reference. If the
      transaction has already invalidated the buffer by the time the I/O
      error occurs and ends up aborting due to shutdown, the transaction
      and log hold the last two references to a stale bli. If the
      transaction cancel occurs first, it treats the buffer as non-stale
      due to the aborted state: the bli reference is dropped and the
      buffer is released/unlocked. The log buffer I/O error handling
      eventually calls into xfs_buf_item_unpin(), drops the final
      reference to the bli and treats it as stale. The buffer wasn't left
      locked by xfs_buf_item_unlock(), however, so the assert fails and
      the buffer is double unlocked. The latter problem is mitigated by
      the fact that the fs is shutdown and no further damage is possible.
      
      ->iop_unlock() of an invalidated buffer should behave consistently
      with respect to the bli refcount, regardless of aborted state. If
      the refcount remains elevated on commit, we know the bli is awaiting
      an unpin (since it can't be in another transaction) and will be
      handled appropriately on log buffer completion. If the final bli
      reference of an invalidated buffer is dropped in ->iop_unlock(), we
      can assume the transaction has aborted because invalidation implies
      a dirty transaction. In the non-abort case, the log would have
      acquired a bli reference in ->iop_pin() and prevented bli release at
      ->iop_unlock() time. In the abort case the item must be freed and
      buffer unlocked because it wasn't pinned by the log.
      
      Rework xfs_buf_item_unlock() to simplify the currently circuitous
      and duplicate logic and leave invalidated buffers locked based on
      bli refcount, regardless of aborted state. This ensures that a
      pinned, stale buffer is always found locked when eventually
      unpinned.
      Signed-off-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      d9183105
    • Brian Foster's avatar
      xfs: remove last of unnecessary xfs_defer_cancel() callers · d5a2e289
      Brian Foster authored
      Now that deferred operations are completely managed via
      transactions, it's no longer necessary to cancel the dfops in error
      paths that already cancel the associated transaction. There are a
      few such calls lingering throughout the codebase.
      
      Remove all remaining unnecessary calls to xfs_defer_cancel(). This
      leaves xfs_defer_cancel() calls in two places. The first is the call
      in the transaction cancel path itself, which facilitates this patch.
      The second is made via the xfs_defer_finish() error path to provide
      consistent error semantics with transaction commit. For example,
      xfs_trans_commit() expects an xfs_defer_finish() failure to clean up
      the dfops structure before it returns.
      Signed-off-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      
      d5a2e289