1. 14 Jul, 2022 10 commits
    • Dave Chinner's avatar
      xfs: add in-memory iunlink log item · 784eb7d8
      Dave Chinner authored
      Now that we have a clean operation to update the di_next_unlinked
      field of inode cluster buffers, we can easily defer this operation
      to transaction commit time so we can order the inode cluster buffer
      locking consistently.
      
      To do this, we introduce a new in-memory log item to track the
      unlinked list item modification that we are going to make. This
      follows the same observations as the in-memory double linked list
      used to track unlinked inodes in that the inodes on the list are
      pinned in memory and cannot go away, and hence we can simply
      reference them for the duration of the transaction without needing
      to take active references or pin them or look them up.
      
      This allows us to pass the xfs_inode to the transaction commit code
      along with the modification to be made, and then order the logged
      modifications via the ->iop_sort and ->iop_precommit operations
      for the new log item type. As this is an in-memory log item, it
      doesn't have formatting, CIL or AIL operational hooks - it exists
      purely to run the inode unlink modifications and is then removed
      from the transaction item list and freed once the precommit
      operation has run.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      784eb7d8
    • Dave Chinner's avatar
      xfs: add log item precommit operation · fad743d7
      Dave Chinner authored
      For inodes that are dirty, we have an attached cluster buffer that
      we want to use to track the dirty inode through the AIL.
      Unfortunately, locking the cluster buffer and adding it to the
      transaction when the inode is first logged in a transaction leads to
      buffer lock ordering inversions.
      
      The specific problem is ordering against the AGI buffer. When
      modifying unlinked lists, the buffer lock order is AGI -> inode
      cluster buffer as the AGI buffer lock serialises all access to the
      unlinked lists. Unfortunately, functionality like xfs_droplink()
      logs the inode before calling xfs_iunlink(), as do various directory
      manipulation functions. The inode can be logged way down in the
      stack as far as the bmapi routines and hence, without a major
      rewrite of lots of APIs there's no way we can avoid the inode being
      logged by something until after the AGI has been logged.
      
      As we are going to be using ordered buffers for inode AIL tracking,
      there isn't a need to actually lock that buffer against modification
      as all the modifications are captured by logging the inode item
      itself. Hence we don't actually need to join the cluster buffer into
      the transaction until just before it is committed. This means we do
      not perturb any of the existing buffer lock orders in transactions,
      and the inode cluster buffer is always locked last in a transaction
      that doesn't otherwise touch inode cluster buffers.
      
      We do this by introducing a precommit log item method.  This commit
      just introduces the mechanism; the inode item implementation is in
      followup commits.
      
      The precommit items need to be sorted into consistent order as we
      may be locking multiple items here. Hence if we have two dirty
      inodes in cluster buffers A and B, and some other transaction has
      two separate dirty inodes in the same cluster buffers, locking them
      in different orders opens us up to ABBA deadlocks. Hence we sort the
      items on the transaction based on the presence of a sort log item
      method.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      fad743d7
    • Dave Chinner's avatar
      xfs: combine iunlink inode update functions · 062efdb0
      Dave Chinner authored
      Combine the logging of the inode unlink list update into the
      calling function that looks up the buffer we end up logging. These
      do not need to be separate functions as they are both short, simple
      operations and there's only a single call path through them. This
      new function will end up being the core of the iunlink log item
      processing...
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      062efdb0
    • Dave Chinner's avatar
      xfs: clean up xfs_iunlink_update_inode() · 5301f870
      Dave Chinner authored
      We no longer need to have this function return the previous next
      agino value from the on-disk inode as we have it in the in-core
      inode now.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      5301f870
    • Dave Chinner's avatar
      xfs: double link the unlinked inode list · 2fd26cc0
      Dave Chinner authored
      Now we have forwards traversal via the incore inode in place, we now
      need to add back pointers to the incore inode to entirely replace
      the back reference cache. We use the same lookup semantics and
      constraints as for the forwards pointer lookups during unlinks, and
      so we can look up any inode in the unlinked list directly and update
      the list pointers, forwards or backwards, at any time.
      
      The only wrinkle in converting the unlinked list manipulations to
      use in-core previous pointers is that log recovery doesn't have the
      incore inode state built up so it can't just read in an inode and
      release it to finish off the unlink. Hence we need to modify the
      traversal in recovery to read one inode ahead before we
      release the inode at the head of the list. This populates the
      next->prev relationship sufficient to be able to replay the unlinked
      list and hence greatly simplify the runtime code.
      
      This recovery algorithm also requires that we actually remove inodes
      from the unlinked list one at a time as background inode
      inactivation will result in unlinked list removal racing with the
      building of the in-memory unlinked list state. We could serialise
      this by holding the AGI buffer lock when constructing the in memory
      state, but all that does is lockstep background processing with list
      building. It is much simpler to flush the inodegc immediately after
      releasing the inode so that it is unlinked immediately and there is
      no races present at all.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      2fd26cc0
    • Dave Chinner's avatar
      xfs: introduce xfs_iunlink_lookup · a83d5a8b
      Dave Chinner authored
      When an inode is on an unlinked list during normal operation, it is
      guaranteed to be pinned in memory as it is either referenced by the
      current unlink operation or it has a open file descriptor that
      references it and has it pinned in memory. Hence to look up an inode
      on the unlinked list, we can do a direct inode cache lookup and
      always expect the lookup to succeed.
      
      Add a function to do this lookup based on the agino that we use to
      link the chain of unlinked inodes together so we can begin the
      conversion the unlinked list manipulations to use in-memory inodes
      rather than inode cluster buffers and remove the backref cache.
      
      Use this lookup function to replace the on-disk inode buffer walk
      when removing inodes from the unlinked list with an in-core inode
      unlinked list walk.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      a83d5a8b
    • Dave Chinner's avatar
      xfs: refactor xlog_recover_process_iunlinks() · 04755d2e
      Dave Chinner authored
      For upcoming changes to the way inode unlinked list processing is
      done, the structure of recovery needs to change slightly. We also
      really need to untangle the messy error handling in list recovery
      so that actions like emptying the bucket on inode lookup failure
      are associated with the bucket list walk failing, not failing
      to look up the inode.
      
      Refactor the recovery code now to keep the re-organisation seperate
      to the algorithm changes.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      04755d2e
    • Dave Chinner's avatar
      xfs: track the iunlink list pointer in the xfs_inode · 4fcc94d6
      Dave Chinner authored
      Having direct access to the i_next_unlinked pointer in unlinked
      inodes greatly simplifies the processing of inodes on the unlinked
      list. We no longer need to look up the inode buffer just to find
      next inode in the list if the xfs_inode is in memory. These
      improvements will be realised over upcoming patches as other
      dependencies on the inode buffer for unlinked list processing are
      removed.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      4fcc94d6
    • Dave Chinner's avatar
      xfs: factor the xfs_iunlink functions · a4454cd6
      Dave Chinner authored
      Prep work that separates the locking that protects the unlinked list
      from the actual operations being performed. This also helps document
      the fact they are performing list insert  and remove operations. No
      functional code change.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      a4454cd6
    • Zhang Yi's avatar
      xfs: flush inode gc workqueue before clearing agi bucket · 04a98a03
      Zhang Yi authored
      In the procedure of recover AGI unlinked lists, if something bad
      happenes on one of the unlinked inode in the bucket list, we would call
      xlog_recover_clear_agi_bucket() to clear the whole unlinked bucket list,
      not the unlinked inodes after the bad one. If we have already added some
      inodes to the gc workqueue before the bad inode in the list, we could
      get below error when freeing those inodes, and finaly fail to complete
      the log recover procedure.
      
       XFS (ram0): Internal error xfs_iunlink_remove at line 2456 of file
       fs/xfs/xfs_inode.c.  Caller xfs_ifree+0xb0/0x360 [xfs]
      
      The problem is xlog_recover_clear_agi_bucket() clear the bucket list, so
      the gc worker fail to check the agino in xfs_verify_agino(). Fix this by
      flush workqueue before clearing the bucket.
      
      Fixes: ab23a776 ("xfs: per-cpu deferred inode inactivation queues")
      Signed-off-by: default avatarZhang Yi <yi.zhang@huawei.com>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      04a98a03
  2. 07 Jul, 2022 14 commits
  3. 01 Jul, 2022 1 commit
    • Darrick J. Wong's avatar
      xfs: prevent a UAF when log IO errors race with unmount · 7561cea5
      Darrick J. Wong authored
      KASAN reported the following use after free bug when running
      generic/475:
      
       XFS (dm-0): Mounting V5 Filesystem
       XFS (dm-0): Starting recovery (logdev: internal)
       XFS (dm-0): Ending recovery (logdev: internal)
       Buffer I/O error on dev dm-0, logical block 20639616, async page read
       Buffer I/O error on dev dm-0, logical block 20639617, async page read
       XFS (dm-0): log I/O error -5
       XFS (dm-0): Filesystem has been shut down due to log error (0x2).
       XFS (dm-0): Unmounting Filesystem
       XFS (dm-0): Please unmount the filesystem and rectify the problem(s).
       ==================================================================
       BUG: KASAN: use-after-free in do_raw_spin_lock+0x246/0x270
       Read of size 4 at addr ffff888109dd84c4 by task 3:1H/136
      
       CPU: 3 PID: 136 Comm: 3:1H Not tainted 5.19.0-rc4-xfsx #rc4 8e53ab5ad0fddeb31cee5e7063ff9c361915a9c4
       Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1 04/01/2014
       Workqueue: xfs-log/dm-0 xlog_ioend_work [xfs]
       Call Trace:
        <TASK>
        dump_stack_lvl+0x34/0x44
        print_report.cold+0x2b8/0x661
        ? do_raw_spin_lock+0x246/0x270
        kasan_report+0xab/0x120
        ? do_raw_spin_lock+0x246/0x270
        do_raw_spin_lock+0x246/0x270
        ? rwlock_bug.part.0+0x90/0x90
        xlog_force_shutdown+0xf6/0x370 [xfs 4ad76ae0d6add7e8183a553e624c31e9ed567318]
        xlog_ioend_work+0x100/0x190 [xfs 4ad76ae0d6add7e8183a553e624c31e9ed567318]
        process_one_work+0x672/0x1040
        worker_thread+0x59b/0xec0
        ? __kthread_parkme+0xc6/0x1f0
        ? process_one_work+0x1040/0x1040
        ? process_one_work+0x1040/0x1040
        kthread+0x29e/0x340
        ? kthread_complete_and_exit+0x20/0x20
        ret_from_fork+0x1f/0x30
        </TASK>
      
       Allocated by task 154099:
        kasan_save_stack+0x1e/0x40
        __kasan_kmalloc+0x81/0xa0
        kmem_alloc+0x8d/0x2e0 [xfs]
        xlog_cil_init+0x1f/0x540 [xfs]
        xlog_alloc_log+0xd1e/0x1260 [xfs]
        xfs_log_mount+0xba/0x640 [xfs]
        xfs_mountfs+0xf2b/0x1d00 [xfs]
        xfs_fs_fill_super+0x10af/0x1910 [xfs]
        get_tree_bdev+0x383/0x670
        vfs_get_tree+0x7d/0x240
        path_mount+0xdb7/0x1890
        __x64_sys_mount+0x1fa/0x270
        do_syscall_64+0x2b/0x80
        entry_SYSCALL_64_after_hwframe+0x46/0xb0
      
       Freed by task 154151:
        kasan_save_stack+0x1e/0x40
        kasan_set_track+0x21/0x30
        kasan_set_free_info+0x20/0x30
        ____kasan_slab_free+0x110/0x190
        slab_free_freelist_hook+0xab/0x180
        kfree+0xbc/0x310
        xlog_dealloc_log+0x1b/0x2b0 [xfs]
        xfs_unmountfs+0x119/0x200 [xfs]
        xfs_fs_put_super+0x6e/0x2e0 [xfs]
        generic_shutdown_super+0x12b/0x3a0
        kill_block_super+0x95/0xd0
        deactivate_locked_super+0x80/0x130
        cleanup_mnt+0x329/0x4d0
        task_work_run+0xc5/0x160
        exit_to_user_mode_prepare+0xd4/0xe0
        syscall_exit_to_user_mode+0x1d/0x40
        entry_SYSCALL_64_after_hwframe+0x46/0xb0
      
      This appears to be a race between the unmount process, which frees the
      CIL and waits for in-flight iclog IO; and the iclog IO completion.  When
      generic/475 runs, it starts fsstress in the background, waits a few
      seconds, and substitutes a dm-error device to simulate a disk falling
      out of a machine.  If the fsstress encounters EIO on a pure data write,
      it will exit but the filesystem will still be online.
      
      The next thing the test does is unmount the filesystem, which tries to
      clean the log, free the CIL, and wait for iclog IO completion.  If an
      iclog was being written when the dm-error switch occurred, it can race
      with log unmounting as follows:
      
      Thread 1				Thread 2
      
      					xfs_log_unmount
      					xfs_log_clean
      					xfs_log_quiesce
      xlog_ioend_work
      <observe error>
      xlog_force_shutdown
      test_and_set_bit(XLOG_IOERROR)
      					xfs_log_force
      					<log is shut down, nop>
      					xfs_log_umount_write
      					<log is shut down, nop>
      					xlog_dealloc_log
      					xlog_cil_destroy
      					<wait for iclogs>
      spin_lock(&log->l_cilp->xc_push_lock)
      <KABOOM>
      
      Therefore, free the CIL after waiting for the iclogs to complete.  I
      /think/ this race has existed for quite a few years now, though I don't
      remember the ~2014 era logging code well enough to know if it was a real
      threat then or if the actual race was exposed only more recently.
      
      Fixes: ac983517 ("xfs: don't sleep in xlog_cil_force_lsn on shutdown")
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      7561cea5
  4. 29 Jun, 2022 3 commits
    • Darrick J. Wong's avatar
      xfs: dont treat rt extents beyond EOF as eofblocks to be cleared · 8944c6fb
      Darrick J. Wong authored
      On a system with a realtime volume and a 28k realtime extent,
      generic/491 fails because the test opens a file on a frozen filesystem
      and closing it causes xfs_release -> xfs_can_free_eofblocks to
      mistakenly think that the the blocks of the realtime extent beyond EOF
      are posteof blocks to be freed.  Realtime extents cannot be partially
      unmapped, so this is pointless.  Worse yet, this triggers posteof
      cleanup, which stalls on a transaction allocation, which is why the test
      fails.
      
      Teach the predicate to account for realtime extents properly.
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      8944c6fb
    • Darrick J. Wong's avatar
      xfs: don't hold xattr leaf buffers across transaction rolls · e53bcffa
      Darrick J. Wong authored
      Now that we've established (again!) that empty xattr leaf buffers are
      ok, we no longer need to bhold them to transactions when we're creating
      new leaf blocks.  Get rid of the entire mechanism, which should simplify
      the xattr code quite a bit.
      
      The original justification for using bhold here was to prevent the AIL
      from trying to write the empty leaf block into the fs during the brief
      time that we release the buffer lock.  The reason for /that/ was to
      prevent recovery from tripping over the empty ondisk block.
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      e53bcffa
    • Darrick J. Wong's avatar
      xfs: empty xattr leaf header blocks are not corruption · 7be3bd88
      Darrick J. Wong authored
      TLDR: Revert commit 51e6104f ("xfs: detect empty attr leaf blocks in
      xfs_attr3_leaf_verify") because it was wrong.
      
      Every now and then we get a corruption report from the kernel or
      xfs_repair about empty leaf blocks in the extended attribute structure.
      We've long thought that these shouldn't be possible, but prior to 5.18
      one would shake loose in the recoveryloop fstests about once a month.
      
      A new addition to the xattr leaf block verifier in 5.19-rc1 makes this
      happen every 7 minutes on my testing cloud.  I added a ton of logging to
      detect any time we set the header count on an xattr leaf block to zero.
      This produced the following dmesg output on generic/388:
      
      XFS (sda4): ino 0x21fcbaf leaf 0x129bf78 hdcount==0!
      Call Trace:
       <TASK>
       dump_stack_lvl+0x34/0x44
       xfs_attr3_leaf_create+0x187/0x230
       xfs_attr_shortform_to_leaf+0xd1/0x2f0
       xfs_attr_set_iter+0x73e/0xa90
       xfs_xattri_finish_update+0x45/0x80
       xfs_attr_finish_item+0x1b/0xd0
       xfs_defer_finish_noroll+0x19c/0x770
       __xfs_trans_commit+0x153/0x3e0
       xfs_attr_set+0x36b/0x740
       xfs_xattr_set+0x89/0xd0
       __vfs_setxattr+0x67/0x80
       __vfs_setxattr_noperm+0x6e/0x120
       vfs_setxattr+0x97/0x180
       setxattr+0x88/0xa0
       path_setxattr+0xc3/0xe0
       __x64_sys_setxattr+0x27/0x30
       do_syscall_64+0x35/0x80
       entry_SYSCALL_64_after_hwframe+0x46/0xb0
      
      So now we know that someone is creating empty xattr leaf blocks as part
      of converting a sf xattr structure into a leaf xattr structure.  The
      conversion routine logs any existing sf attributes in the same
      transaction that creates the leaf block, so we know this is a setxattr
      to a file that has no attributes at all.
      
      Next, g/388 calls the shutdown ioctl and cycles the mount to trigger log
      recovery.  I also augmented buffer item recovery to call ->verify_struct
      on any attr leaf blocks and complain if it finds a failure:
      
      XFS (sda4): Unmounting Filesystem
      XFS (sda4): Mounting V5 Filesystem
      XFS (sda4): Starting recovery (logdev: internal)
      XFS (sda4): xattr leaf daddr 0x129bf78 hdrcount == 0!
      Call Trace:
       <TASK>
       dump_stack_lvl+0x34/0x44
       xfs_attr3_leaf_verify+0x3b8/0x420
       xlog_recover_buf_commit_pass2+0x60a/0x6c0
       xlog_recover_items_pass2+0x4e/0xc0
       xlog_recover_commit_trans+0x33c/0x350
       xlog_recovery_process_trans+0xa5/0xe0
       xlog_recover_process_data+0x8d/0x140
       xlog_do_recovery_pass+0x19b/0x720
       xlog_do_log_recovery+0x62/0xc0
       xlog_do_recover+0x33/0x1d0
       xlog_recover+0xda/0x190
       xfs_log_mount+0x14c/0x360
       xfs_mountfs+0x517/0xa60
       xfs_fs_fill_super+0x6bc/0x950
       get_tree_bdev+0x175/0x280
       vfs_get_tree+0x1a/0x80
       path_mount+0x6f5/0xaa0
       __x64_sys_mount+0x103/0x140
       do_syscall_64+0x35/0x80
       entry_SYSCALL_64_after_hwframe+0x46/0xb0
      RIP: 0033:0x7fc61e241eae
      
      And a moment later, the _delwri_submit of the recovered buffers trips
      the same verifier and recovery fails:
      
      XFS (sda4): Metadata corruption detected at xfs_attr3_leaf_verify+0x393/0x420 [xfs], xfs_attr3_leaf block 0x129bf78
      XFS (sda4): Unmount and run xfs_repair
      XFS (sda4): First 128 bytes of corrupted metadata buffer:
      00000000: 00 00 00 00 00 00 00 00 3b ee 00 00 00 00 00 00  ........;.......
      00000010: 00 00 00 00 01 29 bf 78 00 00 00 00 00 00 00 00  .....).x........
      00000020: a5 1b d0 02 b2 9a 49 df 8e 9c fb 8d f8 31 3e 9d  ......I......1>.
      00000030: 00 00 00 00 02 1f cb af 00 00 00 00 10 00 00 00  ................
      00000040: 00 50 0f b0 00 00 00 00 00 00 00 00 00 00 00 00  .P..............
      00000050: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
      00000060: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
      00000070: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
      XFS (sda4): Corruption of in-memory data (0x8) detected at _xfs_buf_ioapply+0x37f/0x3b0 [xfs] (fs/xfs/xfs_buf.c:1518).  Shutting down filesystem.
      XFS (sda4): Please unmount the filesystem and rectify the problem(s)
      XFS (sda4): log mount/recovery failed: error -117
      XFS (sda4): log mount failed
      
      I think I see what's going on here -- setxattr is racing with something
      that shuts down the filesystem:
      
      Thread 1				Thread 2
      --------				--------
      xfs_attr_sf_addname
      xfs_attr_shortform_to_leaf
      <create empty leaf>
      xfs_trans_bhold(leaf)
      xattri_dela_state = XFS_DAS_LEAF_ADD
      <roll transaction>
      					<flush log>
      					<shut down filesystem>
      xfs_trans_bhold_release(leaf)
      <discover fs is dead, bail>
      
      Thread 3
      --------
      <cycle mount, start recovery>
      xlog_recover_buf_commit_pass2
      xlog_recover_do_reg_buffer
      <replay empty leaf buffer from recovered buf item>
      xfs_buf_delwri_queue(leaf)
      xfs_buf_delwri_submit
      _xfs_buf_ioapply(leaf)
      xfs_attr3_leaf_write_verify
      <trip over empty leaf buffer>
      <fail recovery>
      
      As you can see, the bhold keeps the leaf buffer locked and thus prevents
      the *AIL* from tripping over the ichdr.count==0 check in the write
      verifier.  Unfortunately, it doesn't prevent the log from getting
      flushed to disk, which sets up log recovery to fail.
      
      So.  It's clear that the kernel has always had the ability to persist
      attr leaf blocks with ichdr.count==0, which means that it's part of the
      ondisk format now.
      
      Unfortunately, this check has been added and removed multiple times
      throughout history.  It first appeared in[1] kernel 3.10 as part of the
      early V5 format patches.  The check was later discovered to break log
      recovery and hence disabled[2] during log recovery in kernel 4.10.
      Simultaneously, the check was added[3] to xfs_repair 4.9.0 to try to
      weed out the empty leaf blocks.  This was still not correct because log
      recovery would recover an empty attr leaf block successfully only for
      regular xattr operations to trip over the empty block during of the
      block during regular operation.  Therefore, the check was removed
      entirely[4] in kernel 5.7 but removal of the xfs_repair check was
      forgotten.  The continued complaints from xfs_repair lead to us
      mistakenly re-adding[5] the verifier check for kernel 5.19.  Remove it
      once again.
      
      [1] 517c2220 ("xfs: add CRCs to attr leaf blocks")
      [2] 2e1d2337 ("xfs: ignore leaf attr ichdr.count in verifier
                         during log replay")
      [3] f7140161 ("xfs_repair: junk leaf attribute if count == 0")
      [4] f28cef9e ("xfs: don't fail verifier on empty attr3 leaf
                         block")
      [5] 51e6104f ("xfs: detect empty attr leaf blocks in
                         xfs_attr3_leaf_verify")
      
      Looking at the rest of the xattr code, it seems that files with empty
      leaf blocks behave as expected -- listxattr reports no attributes;
      getxattr on any xattr returns nothing as expected; removexattr does
      nothing; and setxattr can add attributes just fine.
      
      Original-bug: 517c2220 ("xfs: add CRCs to attr leaf blocks")
      Still-not-fixed-by: 2e1d2337 ("xfs: ignore leaf attr ichdr.count in verifier during log replay")
      Removed-in: f28cef9e ("xfs: don't fail verifier on empty attr3 leaf block")
      Fixes: 51e6104f ("xfs: detect empty attr leaf blocks in xfs_attr3_leaf_verify")
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      7be3bd88
  5. 26 Jun, 2022 4 commits
    • Darrick J. Wong's avatar
      xfs: clean up the end of xfs_attri_item_recover · f94e08b6
      Darrick J. Wong authored
      The end of this function could use some cleanup -- the EAGAIN
      conditionals make it harder to figure out what's going on with the
      disposal of xattri_leaf_bp, and the dual error/ret variables aren't
      needed.  Turn the EAGAIN case into a separate block documenting all the
      subtleties of recovering in the middle of an xattr update chain, which
      makes the rest of the prologue much simpler.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      f94e08b6
    • Darrick J. Wong's avatar
      xfs: always free xattri_leaf_bp when cancelling a deferred op · b822ea17
      Darrick J. Wong authored
      While running the following fstest with logged xattrs DISabled, I
      noticed the following:
      
      # FSSTRESS_AVOID="-z -f unlink=1 -f rmdir=1 -f creat=2 -f mkdir=2 -f
      getfattr=3 -f listfattr=3 -f attr_remove=4 -f removefattr=4 -f
      setfattr=20 -f attr_set=60" ./check generic/475
      
      INFO: task u9:1:40 blocked for more than 61 seconds.
            Tainted: G           O      5.19.0-rc2-djwx #rc2
      "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      task:u9:1            state:D stack:12872 pid:   40 ppid:     2 flags:0x00004000
      Workqueue: xfs-cil/dm-0 xlog_cil_push_work [xfs]
      Call Trace:
       <TASK>
       __schedule+0x2db/0x1110
       schedule+0x58/0xc0
       schedule_timeout+0x115/0x160
       __down_common+0x126/0x210
       down+0x54/0x70
       xfs_buf_lock+0x2d/0xe0 [xfs 0532c1cb1d67dd81d15cb79ac6e415c8dec58f73]
       xfs_buf_item_unpin+0x227/0x3a0 [xfs 0532c1cb1d67dd81d15cb79ac6e415c8dec58f73]
       xfs_trans_committed_bulk+0x18e/0x320 [xfs 0532c1cb1d67dd81d15cb79ac6e415c8dec58f73]
       xlog_cil_committed+0x2ea/0x360 [xfs 0532c1cb1d67dd81d15cb79ac6e415c8dec58f73]
       xlog_cil_push_work+0x60f/0x690 [xfs 0532c1cb1d67dd81d15cb79ac6e415c8dec58f73]
       process_one_work+0x1df/0x3c0
       worker_thread+0x53/0x3b0
       kthread+0xea/0x110
       ret_from_fork+0x1f/0x30
       </TASK>
      
      This appears to be the result of shortform_to_leaf creating a new leaf
      buffer as part of adding an xattr to a file.  The new leaf buffer is
      held and attached to the xfs_attr_intent structure, but then the
      filesystem shuts down.  Instead of the usual path (which adds the attr
      to the held leaf buffer which releases the hold), we instead cancel the
      entire deferred operation.
      
      Unfortunately, xfs_attr_cancel_item doesn't release any attached leaf
      buffers, so we leak the locked buffer.  The CIL cannot do anything
      about that, and hangs.  Fix this by teaching it to release leaf buffers,
      and make XFS a little more careful about not leaving a dangling
      reference.
      
      The prologue of xfs_attri_item_recover is (in this author's opinion) a
      little hard to figure out, so I'll clean that up in the next patch.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      b822ea17
    • Kaixu Xia's avatar
      xfs: use invalidate_lock to check the state of mmap_lock · 82af8806
      Kaixu Xia authored
      We should use invalidate_lock and XFS_MMAPLOCK_SHARED to check the state
      of mmap_lock rw_semaphore in xfs_isilocked(), rather than i_rwsem and
      XFS_IOLOCK_SHARED.
      
      Fixes: 2433480a ("xfs: Convert to use invalidate_lock")
      Signed-off-by: default avatarKaixu Xia <kaixuxia@tencent.com>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      82af8806
    • Kaixu Xia's avatar
      xfs: factor out the common lock flags assert · ca76a761
      Kaixu Xia authored
      There are similar lock flags assert in xfs_ilock(), xfs_ilock_nowait(),
      xfs_iunlock(), thus we can factor it out into a helper that is clear.
      Signed-off-by: default avatarKaixu Xia <kaixuxia@tencent.com>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      ca76a761
  6. 23 Jun, 2022 2 commits
    • Dave Chinner's avatar
      xfs: introduce xfs_inodegc_push() · 5e672cd6
      Dave Chinner authored
      The current blocking mechanism for pushing the inodegc queue out to
      disk can result in systems becoming unusable when there is a long
      running inodegc operation. This is because the statfs()
      implementation currently issues a blocking flush of the inodegc
      queue and a significant number of common system utilities will call
      statfs() to discover something about the underlying filesystem.
      
      This can result in userspace operations getting stuck on inodegc
      progress, and when trying to remove a heavily reflinked file on slow
      storage with a full journal, this can result in delays measuring in
      hours.
      
      Avoid this problem by adding "push" function that expedites the
      flushing of the inodegc queue, but doesn't wait for it to complete.
      
      Convert xfs_fs_statfs() and xfs_qm_scall_getquota() to use this
      mechanism so they don't block but still ensure that queued
      operations are expedited.
      
      Fixes: ab23a776 ("xfs: per-cpu deferred inode inactivation queues")
      Reported-by: default avatarChris Dunlop <chris@onthe.net.au>
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      [djwong: fix _getquota_next to use _inodegc_push too]
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      5e672cd6
    • Dave Chinner's avatar
      xfs: bound maximum wait time for inodegc work · 7cf2b0f9
      Dave Chinner authored
      Currently inodegc work can sit queued on the per-cpu queue until
      the workqueue is either flushed of the queue reaches a depth that
      triggers work queuing (and later throttling). This means that we
      could queue work that waits for a long time for some other event to
      trigger flushing.
      
      Hence instead of just queueing work at a specific depth, use a
      delayed work that queues the work at a bound time. We can still
      schedule the work immediately at a given depth, but we no long need
      to worry about leaving a number of items on the list that won't get
      processed until external events prevail.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      7cf2b0f9
  7. 16 Jun, 2022 3 commits
    • Darrick J. Wong's avatar
      xfs: preserve DIFLAG2_NREXT64 when setting other inode attributes · e89ab76d
      Darrick J. Wong authored
      It is vitally important that we preserve the state of the NREXT64 inode
      flag when we're changing the other flags2 fields.
      
      Fixes: 9b7d16e3 ("xfs: Introduce XFS_DIFLAG2_NREXT64 and associated helpers")
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChandan Babu R <chandan.babu@oracle.com>
      Reviewed-by: default avatarAllison Henderson <allison.henderson@oracle.com>
      e89ab76d
    • Darrick J. Wong's avatar
      xfs: fix variable state usage · 10930b25
      Darrick J. Wong authored
      The variable @args is fed to a tracepoint, and that's the only place
      it's used.  This is fine for the kernel, but for userspace, tracepoints
      are #define'd out of existence, which results in this warning on gcc
      11.2:
      
      xfs_attr.c: In function ‘xfs_attr_node_try_addname’:
      xfs_attr.c:1440:42: warning: unused variable ‘args’ [-Wunused-variable]
       1440 |         struct xfs_da_args              *args = attr->xattri_da_args;
            |                                          ^~~~
      
      Clean this up.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarAllison Henderson <allison.henderson@oracle.com>
      10930b25
    • Darrick J. Wong's avatar
      xfs: fix TOCTOU race involving the new logged xattrs control knob · f4288f01
      Darrick J. Wong authored
      I found a race involving the larp control knob, aka the debugging knob
      that lets developers enable logging of extended attribute updates:
      
      Thread 1			Thread 2
      
      echo 0 > /sys/fs/xfs/debug/larp
      				setxattr(REPLACE)
      				xfs_has_larp (returns false)
      				xfs_attr_set
      
      echo 1 > /sys/fs/xfs/debug/larp
      
      				xfs_attr_defer_replace
      				xfs_attr_init_replace_state
      				xfs_has_larp (returns true)
      				xfs_attr_init_remove_state
      
      				<oops, wrong DAS state!>
      
      This isn't a particularly severe problem right now because xattr logging
      is only enabled when CONFIG_XFS_DEBUG=y, and developers *should* know
      what they're doing.
      
      However, the eventual intent is that callers should be able to ask for
      the assistance of the log in persisting xattr updates.  This capability
      might not be required for /all/ callers, which means that dynamic
      control must work correctly.  Once an xattr update has decided whether
      or not to use logged xattrs, it needs to stay in that mode until the end
      of the operation regardless of what subsequent parallel operations might
      do.
      
      Therefore, it is an error to continue sampling xfs_globals.larp once
      xfs_attr_change has made a decision about larp, and it was not correct
      for me to have told Allison that ->create_intent functions can sample
      the global log incompat feature bitfield to decide to elide a log item.
      
      Instead, create a new op flag for the xfs_da_args structure, and convert
      all other callers of xfs_has_larp and xfs_sb_version_haslogxattrs within
      the attr update state machine to look for the operations flag.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarAllison Henderson <allison.henderson@oracle.com>
      f4288f01
  8. 12 Jun, 2022 3 commits