1. 30 Mar, 2022 8 commits
    • Dave Chinner's avatar
      xfs: drop async cache flushes from CIL commits. · 919edbad
      Dave Chinner authored
      Jan Kara reported a performance regression in dbench that he
      bisected down to commit bad77c37 ("xfs: CIL checkpoint
      flushes caches unconditionally").
      
      Whilst developing the journal flush/fua optimisations this cache was
      part of, it appeared to made a significant difference to
      performance. However, now that this patchset has settled and all the
      correctness issues fixed, there does not appear to be any
      significant performance benefit to asynchronous cache flushes.
      
      In fact, the opposite is true on some storage types and workloads,
      where additional cache flushes that can occur from fsync heavy
      workloads have measurable and significant impact on overall
      throughput.
      
      Local dbench testing shows little difference on dbench runs with
      sync vs async cache flushes on either fast or slow SSD storage, and
      no difference in streaming concurrent async transaction workloads
      like fs-mark.
      
      Fast NVME storage.
      
      From `dbench -t 30`, CIL scale:
      
      clients		async			sync
      		BW	Latency		BW	Latency
      1		 935.18   0.855		 915.64   0.903
      8		2404.51   6.873		2341.77   6.511
      16		3003.42   6.460		2931.57   6.529
      32		3697.23   7.939		3596.28   7.894
      128		7237.43  15.495		7217.74  11.588
      512		5079.24  90.587		5167.08  95.822
      
      fsmark, 32 threads, create w/ 64 byte xattr w/32k logbsize
      
      	create		chown		unlink
      async   1m41s		1m16s		2m03s
      sync	1m40s		1m19s		1m54s
      
      Slower SATA SSD storage:
      
      From `dbench -t 30`, CIL scale:
      
      clients		async			sync
      		BW	Latency		BW	Latency
      1		  78.59  15.792		  83.78  10.729
      8		 367.88  92.067		 404.63  59.943
      16		 564.51  72.524		 602.71  76.089
      32		 831.66 105.984		 870.26 110.482
      128		1659.76 102.969		1624.73  91.356
      512		2135.91 223.054		2603.07 161.160
      
      fsmark, 16 threads, create w/32k logbsize
      
      	create		unlink
      async   5m06s		4m15s
      sync	5m00s		4m22s
      
      And on Jan's test machine:
      
                         5.18-rc8-vanilla       5.18-rc8-patched
      Amean     1        71.22 (   0.00%)       64.94 *   8.81%*
      Amean     2        93.03 (   0.00%)       84.80 *   8.85%*
      Amean     4       150.54 (   0.00%)      137.51 *   8.66%*
      Amean     8       252.53 (   0.00%)      242.24 *   4.08%*
      Amean     16      454.13 (   0.00%)      439.08 *   3.31%*
      Amean     32      835.24 (   0.00%)      829.74 *   0.66%*
      Amean     64     1740.59 (   0.00%)     1686.73 *   3.09%*
      
      Performance and cache flush behaviour is restored to pre-regression
      levels.
      
      As such, we can now consider the async cache flush mechanism an
      unnecessary exercise in premature optimisation and hence we can
      now remove it and the infrastructure it requires completely.
      
      Fixes: bad77c37 ("xfs: CIL checkpoint flushes caches unconditionally")
      Reported-and-tested-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      919edbad
    • Dave Chinner's avatar
      xfs: shutdown during log recovery needs to mark the log shutdown · 5652ef31
      Dave Chinner authored
      When a checkpoint writeback is run by log recovery, corruption
      propagated from the log can result in writeback verifiers failing
      and calling xfs_force_shutdown() from
      xfs_buf_delwri_submit_buffers().
      
      This results in the mount being marked as shutdown, but the log does
      not get marked as shut down because:
      
              /*
               * If this happens during log recovery then we aren't using the runtime
               * log mechanisms yet so there's nothing to shut down.
               */
              if (!log || xlog_in_recovery(log))
                      return false;
      
      If there are other buffers that then fail (say due to detecting the
      mount shutdown), they will now hang in xfs_do_force_shutdown()
      waiting for the log to shut down like this:
      
        __schedule+0x30d/0x9e0
        schedule+0x55/0xd0
        xfs_do_force_shutdown+0x1cd/0x200
        ? init_wait_var_entry+0x50/0x50
        xfs_buf_ioend+0x47e/0x530
        __xfs_buf_submit+0xb0/0x240
        xfs_buf_delwri_submit_buffers+0xfe/0x270
        xfs_buf_delwri_submit+0x3a/0xc0
        xlog_do_recovery_pass+0x474/0x7b0
        ? do_raw_spin_unlock+0x30/0xb0
        xlog_do_log_recovery+0x91/0x140
        xlog_do_recover+0x38/0x1e0
        xlog_recover+0xdd/0x170
        xfs_log_mount+0x17e/0x2e0
        xfs_mountfs+0x457/0x930
        xfs_fs_fill_super+0x476/0x830
      
      xlog_force_shutdown() always needs to mark the log as shut down,
      regardless of whether recovery is in progress or not, so that
      multiple calls to xfs_force_shutdown() during recovery don't end
      up waiting for the log to be shut down like this.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      5652ef31
    • Dave Chinner's avatar
      xfs: xfs_trans_commit() path must check for log shutdown · 3c4cb76b
      Dave Chinner authored
      If a shut races with xfs_trans_commit() and we have shut down the
      filesystem but not the log, we will still cancel the transaction.
      This can result in aborting dirty log items instead of committing and
      pinning them whilst the log is still running. Hence we can end up
      with dirty, unlogged metadata that isn't in the AIL in memory that
      can be flushed to disk via writeback clustering.
      
      This was discovered from a g/388 trace where an inode log item was
      having IO completed on it and it wasn't in the AIL, hence tripping
      asserts xfs_ail_check(). Inode cluster writeback started long after
      the filesystem shutdown started, and long after the transaction
      containing the dirty inode was aborted and the log item marked
      XFS_LI_ABORTED. The inode was seen as dirty and unpinned, so it
      was flushed. IO completion tried to remove the inode from the AIL,
      at which point stuff went bad:
      
       XFS (pmem1): Log I/O Error (0x6) detected at xfs_fs_goingdown+0xa3/0xf0 (fs/xfs/xfs_fsops.c:500).  Shutting down filesystem.
       XFS: Assertion failed: in_ail, file: fs/xfs/xfs_trans_ail.c, line: 67
       XFS (pmem1): Please unmount the filesystem and rectify the problem(s)
       Workqueue: xfs-buf/pmem1 xfs_buf_ioend_work
       RIP: 0010:assfail+0x27/0x2d
       Call Trace:
        <TASK>
        xfs_ail_check+0xa8/0x180
        xfs_ail_delete_one+0x3b/0xf0
        xfs_buf_inode_iodone+0x329/0x3f0
        xfs_buf_ioend+0x1f8/0x530
        xfs_buf_ioend_work+0x15/0x20
        process_one_work+0x1ac/0x390
        worker_thread+0x56/0x3c0
        kthread+0xf6/0x120
        ret_from_fork+0x1f/0x30
        </TASK>
      
      xfs_trans_commit() needs to check log state for shutdown, not mount
      state. It cannot abort dirty log items while the log is still
      running as dirty items must remained pinned in memory until they are
      either committed to the journal or the log has shut down and they
      can be safely tossed away. Hence if the log has not shut down, the
      xfs_trans_commit() path must allow completed transactions to commit
      to the CIL and pin the dirty items even if a mount shutdown has
      started.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      3c4cb76b
    • Dave Chinner's avatar
      xfs: xfs_do_force_shutdown needs to block racing shutdowns · 41e63621
      Dave Chinner authored
      When we call xfs_forced_shutdown(), the caller often expects the
      filesystem to be completely shut down when it returns. However,
      if we have racing xfs_forced_shutdown() calls, the first caller sets
      the mount shutdown flag then goes to shutdown the log. The second
      caller sees the mount shutdown flag and returns immediately - it
      does not wait for the log to be shut down.
      
      Unfortunately, xfs_forced_shutdown() is used in some places that
      expect it to completely shut down the filesystem before it returns
      (e.g. xfs_trans_log_inode()). As such, returning before the log has
      been shut down leaves us in a place where the transaction failed to
      complete correctly but we still call xfs_trans_commit(). This
      situation arises because xfs_trans_log_inode() does not return an
      error and instead calls xfs_force_shutdown() to ensure that the
      transaction being committed is aborted.
      
      Unfortunately, we have a race condition where xfs_trans_commit()
      needs to check xlog_is_shutdown() because it can't abort log items
      before the log is shut down, but it needs to use xfs_is_shutdown()
      because xfs_forced_shutdown() does not block waiting for the log to
      shut down.
      
      To fix this conundrum, first we make all calls to
      xfs_forced_shutdown() block until the log is also shut down. This
      means we can then safely use xfs_forced_shutdown() as a mechanism
      that ensures the currently running transaction will be aborted by
      xfs_trans_commit() regardless of the shutdown check it uses.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      41e63621
    • Dave Chinner's avatar
      xfs: log shutdown triggers should only shut down the log · b5f17bec
      Dave Chinner authored
      We've got a mess on our hands.
      
      1. xfs_trans_commit() cannot cancel transactions because the mount is
      shut down - that causes dirty, aborted, unlogged log items to sit
      unpinned in memory and potentially get written to disk before the
      log is shut down. Hence xfs_trans_commit() can only abort
      transactions when xlog_is_shutdown() is true.
      
      2. xfs_force_shutdown() is used in places to cause the current
      modification to be aborted via xfs_trans_commit() because it may be
      impractical or impossible to cancel the transaction directly, and
      hence xfs_trans_commit() must cancel transactions when
      xfs_is_shutdown() is true in this situation. But we can't do that
      because of #1.
      
      3. Log IO errors cause log shutdowns by calling xfs_force_shutdown()
      to shut down the mount and then the log from log IO completion.
      
      4. xfs_force_shutdown() can result in a log force being issued,
      which has to wait for log IO completion before it will mark the log
      as shut down. If #3 races with some other shutdown trigger that runs
      a log force, we rely on xfs_force_shutdown() silently ignoring #3
      and avoiding shutting down the log until the failed log force
      completes.
      
      5. To ensure #2 always works, we have to ensure that
      xfs_force_shutdown() does not return until the the log is shut down.
      But in the case of #4, this will result in a deadlock because the
      log Io completion will block waiting for a log force to complete
      which is blocked waiting for log IO to complete....
      
      So the very first thing we have to do here to untangle this mess is
      dissociate log shutdown triggers from mount shutdowns. We already
      have xlog_forced_shutdown, which will atomically transistion to the
      log a shutdown state. Due to internal asserts it cannot be called
      multiple times, but was done simply because the only place that
      could call it was xfs_do_force_shutdown() (i.e. the mount shutdown!)
      and that could only call it once and once only.  So the first thing
      we do is remove the asserts.
      
      We then convert all the internal log shutdown triggers to call
      xlog_force_shutdown() directly instead of xfs_force_shutdown(). This
      allows the log shutdown triggers to shut down the log without
      needing to care about mount based shutdown constraints. This means
      we shut down the log independently of the mount and the mount may
      not notice this until it's next attempt to read or modify metadata.
      At that point (e.g. xfs_trans_commit()) it will see that the log is
      shutdown, error out and shutdown the mount.
      
      To ensure that all the unmount behaviours and asserts track
      correctly as a result of a log shutdown, propagate the shutdown up
      to the mount if it is not already set. This keeps the mount and log
      state in sync, and saves a huge amount of hassle where code fails
      because of a log shutdown but only checks for mount shutdowns and
      hence ends up doing the wrong thing. Cleaning up that mess is
      an exercise for another day.
      
      This enables us to address the other problems noted above in
      followup patches.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      b5f17bec
    • Dave Chinner's avatar
      xfs: run callbacks before waking waiters in xlog_state_shutdown_callbacks · cd6f79d1
      Dave Chinner authored
      Brian reported a null pointer dereference failure during unmount in
      xfs/006. He tracked the problem down to the AIL being torn down
      before a log shutdown had completed and removed all the items from
      the AIL. The failure occurred in this path while unmount was
      proceeding in another task:
      
       xfs_trans_ail_delete+0x102/0x130 [xfs]
       xfs_buf_item_done+0x22/0x30 [xfs]
       xfs_buf_ioend+0x73/0x4d0 [xfs]
       xfs_trans_committed_bulk+0x17e/0x2f0 [xfs]
       xlog_cil_committed+0x2a9/0x300 [xfs]
       xlog_cil_process_committed+0x69/0x80 [xfs]
       xlog_state_shutdown_callbacks+0xce/0xf0 [xfs]
       xlog_force_shutdown+0xdf/0x150 [xfs]
       xfs_do_force_shutdown+0x5f/0x150 [xfs]
       xlog_ioend_work+0x71/0x80 [xfs]
       process_one_work+0x1c5/0x390
       worker_thread+0x30/0x350
       kthread+0xd7/0x100
       ret_from_fork+0x1f/0x30
      
      This is processing an EIO error to a log write, and it's
      triggering a force shutdown. This causes the log to be shut down,
      and then it is running attached iclog callbacks from the shutdown
      context. That means the fs and log has already been marked as
      xfs_is_shutdown/xlog_is_shutdown and so high level code will abort
      (e.g. xfs_trans_commit(), xfs_log_force(), etc) with an error
      because of shutdown.
      
      The umount would have been blocked waiting for a log force
      completion inside xfs_log_cover() -> xfs_sync_sb(). The first thing
      for this situation to occur is for xfs_sync_sb() to exit without
      waiting for the iclog buffer to be comitted to disk. The
      above trace is the completion routine for the iclog buffer, and
      it is shutting down the filesystem.
      
      xlog_state_shutdown_callbacks() does this:
      
      {
              struct xlog_in_core     *iclog;
              LIST_HEAD(cb_list);
      
              spin_lock(&log->l_icloglock);
              iclog = log->l_iclog;
              do {
                      if (atomic_read(&iclog->ic_refcnt)) {
                              /* Reference holder will re-run iclog callbacks. */
                              continue;
                      }
                      list_splice_init(&iclog->ic_callbacks, &cb_list);
      >>>>>>           wake_up_all(&iclog->ic_write_wait);
      >>>>>>           wake_up_all(&iclog->ic_force_wait);
              } while ((iclog = iclog->ic_next) != log->l_iclog);
      
              wake_up_all(&log->l_flush_wait);
              spin_unlock(&log->l_icloglock);
      
      >>>>>>  xlog_cil_process_committed(&cb_list);
      }
      
      This wakes any thread waiting on IO completion of the iclog (in this
      case the umount log force) before shutdown processes all the pending
      callbacks.  That means the xfs_sync_sb() waiting on a sync
      transaction in xfs_log_force() on iclog->ic_force_wait will get
      woken before the callbacks attached to that iclog are run. This
      results in xfs_sync_sb() returning an error, and so unmount unblocks
      and continues to run whilst the log shutdown is still in progress.
      
      Normally this is just fine because the force waiter has nothing to
      do with AIL operations. But in the case of this unmount path, the
      log force waiter goes on to tear down the AIL because the log is now
      shut down and so nothing ever blocks it again from the wait point in
      xfs_log_cover().
      
      Hence it's a race to see who gets to the AIL first - the unmount
      code or xlog_cil_process_committed() killing the superblock buffer.
      
      To fix this, we just have to change the order of processing in
      xlog_state_shutdown_callbacks() to run the callbacks before it wakes
      any task waiting on completion of the iclog.
      Reported-by: default avatarBrian Foster <bfoster@redhat.com>
      Fixes: aad7272a ("xfs: separate out log shutdown callback processing")
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      cd6f79d1
    • Dave Chinner's avatar
      xfs: shutdown in intent recovery has non-intent items in the AIL · ab9c81ef
      Dave Chinner authored
      generic/388 triggered a failure in RUI recovery due to a corrupted
      btree record and the system then locked up hard due to a subsequent
      assert failure while holding a spinlock cancelling intents:
      
       XFS (pmem1): Corruption of in-memory data (0x8) detected at xfs_do_force_shutdown+0x1a/0x20 (fs/xfs/xfs_trans.c:964).  Shutting down filesystem.
       XFS (pmem1): Please unmount the filesystem and rectify the problem(s)
       XFS: Assertion failed: !xlog_item_is_intent(lip), file: fs/xfs/xfs_log_recover.c, line: 2632
       Call Trace:
        <TASK>
        xlog_recover_cancel_intents.isra.0+0xd1/0x120
        xlog_recover_finish+0xb9/0x110
        xfs_log_mount_finish+0x15a/0x1e0
        xfs_mountfs+0x540/0x910
        xfs_fs_fill_super+0x476/0x830
        get_tree_bdev+0x171/0x270
        ? xfs_init_fs_context+0x1e0/0x1e0
        xfs_fs_get_tree+0x15/0x20
        vfs_get_tree+0x24/0xc0
        path_mount+0x304/0xba0
        ? putname+0x55/0x60
        __x64_sys_mount+0x108/0x140
        do_syscall_64+0x35/0x80
        entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      Essentially, there's dirty metadata in the AIL from intent recovery
      transactions, so when we go to cancel the remaining intents we assume
      that all objects after the first non-intent log item in the AIL are
      not intents.
      
      This is not true. Intent recovery can log new intents to continue
      the operations the original intent could not complete in a single
      transaction. The new intents are committed before they are deferred,
      which means if the CIL commits in the background they will get
      inserted into the AIL at the head.
      
      Hence if we shut down the filesystem while processing intent
      recovery, the AIL may have new intents active at the current head.
      Hence this check:
      
                      /*
                       * We're done when we see something other than an intent.
                       * There should be no intents left in the AIL now.
                       */
                      if (!xlog_item_is_intent(lip)) {
      #ifdef DEBUG
                              for (; lip; lip = xfs_trans_ail_cursor_next(ailp, &cur))
                                      ASSERT(!xlog_item_is_intent(lip));
      #endif
                              break;
                      }
      
      in both xlog_recover_process_intents() and
      log_recover_cancel_intents() is simply not valid. It was valid back
      when we only had EFI/EFD intents and didn't chain intents, but it
      hasn't been valid ever since intent recovery could create and commit
      new intents.
      
      Given that crashing the mount task like this pretty much prevents
      diagnosing what went wrong that lead to the initial failure that
      triggered intent cancellation, just remove the checks altogether.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      ab9c81ef
    • Dave Chinner's avatar
      xfs: aborting inodes on shutdown may need buffer lock · d2d7c047
      Dave Chinner authored
      Most buffer io list operations are run with the bp->b_lock held, but
      xfs_iflush_abort() can be called without the buffer lock being held
      resulting in inodes being removed from the buffer list while other
      list operations are occurring. This causes problems with corrupted
      bp->b_io_list inode lists during filesystem shutdown, leading to
      traversals that never end, double removals from the AIL, etc.
      
      Fix this by passing the buffer to xfs_iflush_abort() if we have
      it locked. If the inode is attached to the buffer, we're going to
      have to remove it from the buffer list and we'd have to get the
      buffer off the inode log item to do that anyway.
      
      If we don't have a buffer passed in (e.g. from xfs_reclaim_inode())
      then we can determine if the inode has a log item and if it is
      attached to a buffer before we do anything else. If it does have an
      attached buffer, we can lock it safely (because the inode has a
      reference to it) and then perform the inode abort.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      d2d7c047
  2. 28 Mar, 2022 5 commits
    • Darrick J. Wong's avatar
      xfs: don't report reserved bnobt space as available · 85bcfa26
      Darrick J. Wong authored
      On a modern filesystem, we don't allow userspace to allocate blocks for
      data storage from the per-AG space reservations, the user-controlled
      reservation pool that prevents ENOSPC in the middle of internal
      operations, or the internal per-AG set-aside that prevents unwanted
      filesystem shutdowns due to ENOSPC during a bmap btree split.
      
      Since we now consider freespace btree blocks as unavailable for
      allocation for data storage, we shouldn't report those blocks via statfs
      either.  This makes the numbers that we return via the statfs f_bavail
      and f_bfree fields a more conservative estimate of actual free space.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      85bcfa26
    • Darrick J. Wong's avatar
      xfs: fix overfilling of reserve pool · 82be38bc
      Darrick J. Wong authored
      Due to cycling of m_sb_lock, it's possible for multiple callers of
      xfs_reserve_blocks to race at changing the pool size, subtracting blocks
      from fdblocks, and actually putting it in the pool.  The result of all
      this is that we can overfill the reserve pool to hilarious levels.
      
      xfs_mod_fdblocks, when called with a positive value, already knows how
      to take freed blocks and either fill the reserve until it's full, or put
      them in fdblocks.  Use that instead of setting m_resblks_avail directly.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      82be38bc
    • Darrick J. Wong's avatar
      xfs: always succeed at setting the reserve pool size · 0baa2657
      Darrick J. Wong authored
      Nowadays, xfs_mod_fdblocks will always choose to fill the reserve pool
      with freed blocks before adding to fdblocks.  Therefore, we can change
      the behavior of xfs_reserve_blocks slightly -- setting the target size
      of the pool should always succeed, since a deficiency will eventually
      be made up as blocks get freed.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      0baa2657
    • Darrick J. Wong's avatar
      xfs: remove infinite loop when reserving free block pool · 15f04fdc
      Darrick J. Wong authored
      Infinite loops in kernel code are scary.  Calls to xfs_reserve_blocks
      should be rare (people should just use the defaults!) so we really don't
      need to try so hard.  Simplify the logic here by removing the infinite
      loop.
      
      Cc: Brian Foster <bfoster@redhat.com>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      15f04fdc
    • Darrick J. Wong's avatar
      xfs: don't include bnobt blocks when reserving free block pool · c8c56825
      Darrick J. Wong authored
      xfs_reserve_blocks controls the size of the user-visible free space
      reserve pool.  Given the difference between the current and requested
      pool sizes, it will try to reserve free space from fdblocks.  However,
      the amount requested from fdblocks is also constrained by the amount of
      space that we think xfs_mod_fdblocks will give us.  If we forget to
      subtract m_allocbt_blks before calling xfs_mod_fdblocks, it will will
      return ENOSPC and we'll hang the kernel at mount due to the infinite
      loop.
      
      In commit fd43cf60, we decided that xfs_mod_fdblocks should not hand
      out the "free space" used by the free space btrees, because some portion
      of the free space btrees hold in reserve space for future btree
      expansion.  Unfortunately, xfs_reserve_blocks' estimation of the number
      of blocks that it could request from xfs_mod_fdblocks was not updated to
      include m_allocbt_blks, so if space is extremely low, the caller hangs.
      
      Fix this by creating a function to estimate the number of blocks that
      can be reserved from fdblocks, which needs to exclude the set-aside and
      m_allocbt_blks.
      
      Found by running xfs/306 (which formats a single-AG 20MB filesystem)
      with an fstests configuration that specifies a 1k blocksize and a
      specially crafted log size that will consume 7/8 of the space (17920
      blocks, specifically) in that AG.
      
      Cc: Brian Foster <bfoster@redhat.com>
      Fixes: fd43cf60 ("xfs: set aside allocation btree blocks from block reservation")
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      c8c56825
  3. 21 Mar, 2022 1 commit
  4. 20 Mar, 2022 7 commits
    • Dave Chinner's avatar
      xfs: xfs_is_shutdown vs xlog_is_shutdown cage fight · 01728b44
      Dave Chinner authored
      I've been chasing a recent resurgence in generic/388 recovery
      failure and/or corruption events. The events have largely been
      uninitialised inode chunks being tripped over in log recovery
      such as:
      
       XFS (pmem1): User initiated shutdown received.
       pmem1: writeback error on inode 12621949, offset 1019904, sector 12968096
       XFS (pmem1): Log I/O Error (0x6) detected at xfs_fs_goingdown+0xa3/0xf0 (fs/xfs/xfs_fsops.c:500).  Shutting down filesystem.
       XFS (pmem1): Please unmount the filesystem and rectify the problem(s)
       XFS (pmem1): Unmounting Filesystem
       XFS (pmem1): Mounting V5 Filesystem
       XFS (pmem1): Starting recovery (logdev: internal)
       XFS (pmem1): bad inode magic/vsn daddr 8723584 #0 (magic=1818)
       XFS (pmem1): Metadata corruption detected at xfs_inode_buf_verify+0x180/0x190, xfs_inode block 0x851c80 xfs_inode_buf_verify
       XFS (pmem1): Unmount and run xfs_repair
       XFS (pmem1): First 128 bytes of corrupted metadata buffer:
       00000000: 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18  ................
       00000010: 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18  ................
       00000020: 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18  ................
       00000030: 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18  ................
       00000040: 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18  ................
       00000050: 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18  ................
       00000060: 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18  ................
       00000070: 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18  ................
       XFS (pmem1): metadata I/O error in "xlog_recover_items_pass2+0x52/0xc0" at daddr 0x851c80 len 32 error 117
       XFS (pmem1): log mount/recovery failed: error -117
       XFS (pmem1): log mount failed
      
      There have been isolated random other issues, too - xfs_repair fails
      because it finds some corruption in symlink blocks, rmap
      inconsistencies, etc - but they are nowhere near as common as the
      uninitialised inode chunk failure.
      
      The problem has clearly happened at runtime before recovery has run;
      I can see the ICREATE log item in the log shortly before the
      actively recovered range of the log. This means the ICREATE was
      definitely created and written to the log, but for some reason the
      tail of the log has been moved past the ordered buffer log item that
      tracks INODE_ALLOC buffers and, supposedly, prevents the tail of the
      log moving past the ICREATE log item before the inode chunk buffer
      is written to disk.
      
      Tracing the fsstress processes that are running when the filesystem
      shut down immediately pin-pointed the problem:
      
      user shutdown marks xfs_mount as shutdown
      
               godown-213341 [008]  6398.022871: console:              [ 6397.915392] XFS (pmem1): User initiated shutdown received.
      .....
      
      aild tries to push ordered inode cluster buffer
      
        xfsaild/pmem1-213314 [001]  6398.022974: xfs_buf_trylock:      dev 259:1 daddr 0x851c80 bbcount 0x20 hold 16 pincount 0 lock 0 flags DONE|INODES|PAGES caller xfs_inode_item_push+0x8e
        xfsaild/pmem1-213314 [001]  6398.022976: xfs_ilock_nowait:     dev 259:1 ino 0x851c80 flags ILOCK_SHARED caller xfs_iflush_cluster+0xae
      
      xfs_iflush_cluster() checks xfs_is_shutdown(), returns true,
      calls xfs_iflush_abort() to kill writeback of the inode.
      Inode is removed from AIL, drops cluster buffer reference.
      
        xfsaild/pmem1-213314 [001]  6398.022977: xfs_ail_delete:       dev 259:1 lip 0xffff88880247ed80 old lsn 7/20344 new lsn 7/21000 type XFS_LI_INODE flags IN_AIL
        xfsaild/pmem1-213314 [001]  6398.022978: xfs_buf_rele:         dev 259:1 daddr 0x851c80 bbcount 0x20 hold 17 pincount 0 lock 0 flags DONE|INODES|PAGES caller xfs_iflush_abort+0xd7
      
      .....
      
      All inodes on cluster buffer are aborted, then the cluster buffer
      itself is aborted and removed from the AIL *without writeback*:
      
      xfsaild/pmem1-213314 [001]  6398.023011: xfs_buf_error_relse:  dev 259:1 daddr 0x851c80 bbcount 0x20 hold 2 pincount 0 lock 0 flags ASYNC|DONE|STALE|INODES|PAGES caller xfs_buf_ioend_fail+0x33
         xfsaild/pmem1-213314 [001]  6398.023012: xfs_ail_delete:       dev 259:1 lip 0xffff8888053efde8 old lsn 7/20344 new lsn 7/20344 type XFS_LI_BUF flags IN_AIL
      
      The inode buffer was at 7/20344 when it was removed from the AIL.
      
         xfsaild/pmem1-213314 [001]  6398.023012: xfs_buf_item_relse:   dev 259:1 daddr 0x851c80 bbcount 0x20 hold 2 pincount 0 lock 0 flags ASYNC|DONE|STALE|INODES|PAGES caller xfs_buf_item_done+0x31
         xfsaild/pmem1-213314 [001]  6398.023012: xfs_buf_rele:         dev 259:1 daddr 0x851c80 bbcount 0x20 hold 2 pincount 0 lock 0 flags ASYNC|DONE|STALE|INODES|PAGES caller xfs_buf_item_relse+0x39
      
      .....
      
      Userspace is still running, doing stuff. an fsstress process runs
      syncfs() or sync() and we end up in sync_fs_one_sb() which issues
      a log force. This pushes on the CIL:
      
              fsstress-213322 [001]  6398.024430: xfs_fs_sync_fs:       dev 259:1 m_features 0x20000000019ff6e9 opstate (clean|shutdown|inodegc|blockgc) s_flags 0x70810000 caller sync_fs_one_sb+0x26
              fsstress-213322 [001]  6398.024430: xfs_log_force:        dev 259:1 lsn 0x0 caller xfs_fs_sync_fs+0x82
              fsstress-213322 [001]  6398.024430: xfs_log_force:        dev 259:1 lsn 0x5f caller xfs_log_force+0x7c
                 <...>-194402 [001]  6398.024467: kmem_alloc:           size 176 flags 0x14 caller xlog_cil_push_work+0x9f
      
      And the CIL fills up iclogs with pending changes. This picks up
      the current tail from the AIL:
      
                 <...>-194402 [001]  6398.024497: xlog_iclog_get_space: dev 259:1 state XLOG_STATE_ACTIVE refcnt 1 offset 0 lsn 0x0 flags  caller xlog_write+0x149
                 <...>-194402 [001]  6398.024498: xlog_iclog_switch:    dev 259:1 state XLOG_STATE_ACTIVE refcnt 1 offset 0 lsn 0x700005408 flags  caller xlog_state_get_iclog_space+0x37e
                 <...>-194402 [001]  6398.024521: xlog_iclog_release:   dev 259:1 state XLOG_STATE_WANT_SYNC refcnt 1 offset 32256 lsn 0x700005408 flags  caller xlog_write+0x5f9
                 <...>-194402 [001]  6398.024522: xfs_log_assign_tail_lsn: dev 259:1 new tail lsn 7/21000, old lsn 7/20344, last sync 7/21448
      
      And it moves the tail of the log to 7/21000 from 7/20344. This
      *moves the tail of the log beyond the ICREATE transaction* that was
      at 7/20344 and pinned by the inode cluster buffer that was cancelled
      above.
      
      ....
      
               godown-213341 [008]  6398.027005: xfs_force_shutdown:   dev 259:1 tag logerror flags log_io|force_umount file fs/xfs/xfs_fsops.c line_num 500
                godown-213341 [008]  6398.027022: console:              [ 6397.915406] pmem1: writeback error on inode 12621949, offset 1019904, sector 12968096
                godown-213341 [008]  6398.030551: console:              [ 6397.919546] XFS (pmem1): Log I/O Error (0x6) detected at xfs_fs_goingdown+0xa3/0xf0 (fs/
      
      And finally the log itself is now shutdown, stopping all further
      writes to the log. But this is too late to prevent the corruption
      that moving the tail of the log forwards after we start cancelling
      writeback causes.
      
      The fundamental problem here is that we are using the wrong shutdown
      checks for log items. We've long conflated mount shutdown with log
      shutdown state, and I started separating that recently with the
      atomic shutdown state changes in commit b36d4651 ("xfs: make
      forced shutdown processing atomic"). The changes in that commit
      series are directly responsible for being able to diagnose this
      issue because it clearly separated mount shutdown from log shutdown.
      
      Essentially, once we start cancelling writeback of log items and
      removing them from the AIL because the filesystem is shut down, we
      *cannot* update the journal because we may have cancelled the items
      that pin the tail of the log. That moves the tail of the log
      forwards without having written the metadata back, hence we have
      corrupt in memory state and writing to the journal propagates that
      to the on-disk state.
      
      What commit b36d4651 makes clear is that log item state needs to
      change relative to log shutdown, not mount shutdown. IOWs, anything
      that aborts metadata writeback needs to check log shutdown state
      because log items directly affect log consistency. Having them check
      mount shutdown state introduces the above race condition where we
      cancel metadata writeback before the log shuts down.
      
      To fix this, this patch works through all log items and converts
      shutdown checks to use xlog_is_shutdown() rather than
      xfs_is_shutdown(), so that we don't start aborting metadata
      writeback before we shut off journal writes.
      
      AFAICT, this race condition is a zero day IO error handling bug in
      XFS that dates back to the introduction of XLOG_IO_ERROR,
      XLOG_STATE_IOERROR and XFS_FORCED_SHUTDOWN back in January 1997.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      01728b44
    • Dave Chinner's avatar
      xfs: AIL should be log centric · 8eda8721
      Dave Chinner authored
      The AIL operates purely on log items, so it is a log centric
      subsystem. Divorce it from the xfs_mount and instead have it pass
      around xlog pointers.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChandan Babu R <chandan.babu@oracle.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      8eda8721
    • Dave Chinner's avatar
      xfs: log items should have a xlog pointer, not a mount · d86142dd
      Dave Chinner authored
      Log items belong to the log, not the xfs_mount. Convert the mount
      pointer in the log item to a xlog pointer in preparation for
      upcoming log centric changes to the log items.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChandan Babu R <chandan.babu@oracle.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      d86142dd
    • Dave Chinner's avatar
      xfs: async CIL flushes need pending pushes to be made stable · 70447e0a
      Dave Chinner authored
      When the AIL tries to flush the CIL, it relies on the CIL push
      ending up on stable storage without having to wait for and
      manipulate iclog state directly. However, if there is already a
      pending CIL push when the AIL tries to flush the CIL, it won't set
      the cil->xc_push_commit_stable flag and so the CIL push will not
      actively flush the commit record iclog.
      
      generic/530 when run on a single CPU test VM can trigger this fairly
      reliably. This test exercises unlinked inode recovery, and can
      result in inodes being pinned in memory by ongoing modifications to
      the inode cluster buffer to record unlinked list modifications. As a
      result, the first inode unlinked in a buffer can pin the tail of the
      log whilst the inode cluster buffer is pinned by the current
      checkpoint that has been pushed but isn't on stable storage because
      because the cil->xc_push_commit_stable was not set. This results in
      the log/AIL effectively deadlocking until something triggers the
      commit record iclog to be pushed to stable storage (i.e. the
      periodic log worker calling xfs_log_force()).
      
      The fix is two-fold - first we should always set the
      cil->xc_push_commit_stable when xlog_cil_flush() is called,
      regardless of whether there is already a pending push or not.
      
      Second, if the CIL is empty, we should trigger an iclog flush to
      ensure that the iclogs of the last checkpoint have actually been
      submitted to disk as that checkpoint may not have been run under
      stable completion constraints.
      Reported-and-tested-by: default avatarMatthew Wilcox <willy@infradead.org>
      Fixes: 0020a190 ("xfs: AIL needs asynchronous CIL forcing")
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      70447e0a
    • Dave Chinner's avatar
      xfs: xfs_ail_push_all_sync() stalls when racing with updates · 941fbdfd
      Dave Chinner authored
      xfs_ail_push_all_sync() has a loop like this:
      
      while max_ail_lsn {
      	prepare_to_wait(ail_empty)
      	target = max_ail_lsn
      	wake_up(ail_task);
      	schedule()
      }
      
      Which is designed to sleep until the AIL is emptied. When
      xfs_ail_update_finish() moves the tail of the log, it does:
      
      	if (list_empty(&ailp->ail_head))
      		wake_up_all(&ailp->ail_empty);
      
      So it will only wake up the sync push waiter when the AIL goes
      empty. If, by the time the push waiter has woken, the AIL has more
      in it, it will reset the target, wake the push task and go back to
      sleep.
      
      The problem here is that if the AIL is having items added to it
      when xfs_ail_push_all_sync() is called, then they may get inserted
      into the AIL at a LSN higher than the target LSN. At this point,
      xfsaild_push() will see that the target is X, the item LSNs are
      (X+N) and skip over them, hence never pushing the out.
      
      The result of this the AIL will not get emptied by the AIL push
      thread, hence xfs_ail_finish_update() will never see the AIL being
      empty even if it moves the tail. Hence xfs_ail_push_all_sync() never
      gets woken and hence cannot update the push target to capture the
      items beyond the current target on the LSN.
      
      This is a TOCTOU type of issue so the way to avoid it is to not
      use the push target at all for sync pushes. We know that a sync push
      is being requested by the fact the ail_empty wait queue is active,
      hence the xfsaild can just set the target to max_ail_lsn on every
      push that we see the wait queue active. Hence we no longer will
      leave items on the AIL that are beyond the LSN sampled at the start
      of a sync push.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChandan Babu R <chandan.babu@oracle.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      941fbdfd
    • Dave Chinner's avatar
      xfs: check buffer pin state after locking in delwri_submit · dbd0f529
      Dave Chinner authored
      AIL flushing can get stuck here:
      
      [316649.005769] INFO: task xfsaild/pmem1:324525 blocked for more than 123 seconds.
      [316649.007807]       Not tainted 5.17.0-rc6-dgc+ #975
      [316649.009186] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [316649.011720] task:xfsaild/pmem1   state:D stack:14544 pid:324525 ppid:     2 flags:0x00004000
      [316649.014112] Call Trace:
      [316649.014841]  <TASK>
      [316649.015492]  __schedule+0x30d/0x9e0
      [316649.017745]  schedule+0x55/0xd0
      [316649.018681]  io_schedule+0x4b/0x80
      [316649.019683]  xfs_buf_wait_unpin+0x9e/0xf0
      [316649.021850]  __xfs_buf_submit+0x14a/0x230
      [316649.023033]  xfs_buf_delwri_submit_buffers+0x107/0x280
      [316649.024511]  xfs_buf_delwri_submit_nowait+0x10/0x20
      [316649.025931]  xfsaild+0x27e/0x9d0
      [316649.028283]  kthread+0xf6/0x120
      [316649.030602]  ret_from_fork+0x1f/0x30
      
      in the situation where flushing gets preempted between the unpin
      check and the buffer trylock under nowait conditions:
      
      	blk_start_plug(&plug);
      	list_for_each_entry_safe(bp, n, buffer_list, b_list) {
      		if (!wait_list) {
      			if (xfs_buf_ispinned(bp)) {
      				pinned++;
      				continue;
      			}
      Here >>>>>>
      			if (!xfs_buf_trylock(bp))
      				continue;
      
      This means submission is stuck until something else triggers a log
      force to unpin the buffer.
      
      To get onto the delwri list to begin with, the buffer pin state has
      already been checked, and hence it's relatively rare we get a race
      between flushing and encountering a pinned buffer in delwri
      submission to begin with. Further, to increase the pin count the
      buffer has to be locked, so the only way we can hit this race
      without failing the trylock is to be preempted between the pincount
      check seeing zero and the trylock being run.
      
      Hence to avoid this problem, just invert the order of trylock vs
      pin check. We shouldn't hit that many pinned buffers here, so
      optimising away the trylock for pinned buffers should not matter for
      performance at all.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChandan Babu R <chandan.babu@oracle.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      dbd0f529
    • Dave Chinner's avatar
      xfs: log worker needs to start before intent/unlink recovery · a9a4bc8c
      Dave Chinner authored
      After 963 iterations of generic/530, it deadlocked during recovery
      on a pinned inode cluster buffer like so:
      
      XFS (pmem1): Starting recovery (logdev: internal)
      INFO: task kworker/8:0:306037 blocked for more than 122 seconds.
            Not tainted 5.17.0-rc6-dgc+ #975
      "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      task:kworker/8:0     state:D stack:13024 pid:306037 ppid:     2 flags:0x00004000
      Workqueue: xfs-inodegc/pmem1 xfs_inodegc_worker
      Call Trace:
       <TASK>
       __schedule+0x30d/0x9e0
       schedule+0x55/0xd0
       schedule_timeout+0x114/0x160
       __down+0x99/0xf0
       down+0x5e/0x70
       xfs_buf_lock+0x36/0xf0
       xfs_buf_find+0x418/0x850
       xfs_buf_get_map+0x47/0x380
       xfs_buf_read_map+0x54/0x240
       xfs_trans_read_buf_map+0x1bd/0x490
       xfs_imap_to_bp+0x4f/0x70
       xfs_iunlink_map_ino+0x66/0xd0
       xfs_iunlink_map_prev.constprop.0+0x148/0x2f0
       xfs_iunlink_remove_inode+0xf2/0x1d0
       xfs_inactive_ifree+0x1a3/0x900
       xfs_inode_unlink+0xcc/0x210
       xfs_inodegc_worker+0x1ac/0x2f0
       process_one_work+0x1ac/0x390
       worker_thread+0x56/0x3c0
       kthread+0xf6/0x120
       ret_from_fork+0x1f/0x30
       </TASK>
      task:mount           state:D stack:13248 pid:324509 ppid:324233 flags:0x00004000
      Call Trace:
       <TASK>
       __schedule+0x30d/0x9e0
       schedule+0x55/0xd0
       schedule_timeout+0x114/0x160
       __down+0x99/0xf0
       down+0x5e/0x70
       xfs_buf_lock+0x36/0xf0
       xfs_buf_find+0x418/0x850
       xfs_buf_get_map+0x47/0x380
       xfs_buf_read_map+0x54/0x240
       xfs_trans_read_buf_map+0x1bd/0x490
       xfs_imap_to_bp+0x4f/0x70
       xfs_iget+0x300/0xb40
       xlog_recover_process_one_iunlink+0x4c/0x170
       xlog_recover_process_iunlinks.isra.0+0xee/0x130
       xlog_recover_finish+0x57/0x110
       xfs_log_mount_finish+0xfc/0x1e0
       xfs_mountfs+0x540/0x910
       xfs_fs_fill_super+0x495/0x850
       get_tree_bdev+0x171/0x270
       xfs_fs_get_tree+0x15/0x20
       vfs_get_tree+0x24/0xc0
       path_mount+0x304/0xba0
       __x64_sys_mount+0x108/0x140
       do_syscall_64+0x35/0x80
       entry_SYSCALL_64_after_hwframe+0x44/0xae
       </TASK>
      task:xfsaild/pmem1   state:D stack:14544 pid:324525 ppid:     2 flags:0x00004000
      Call Trace:
       <TASK>
       __schedule+0x30d/0x9e0
       schedule+0x55/0xd0
       io_schedule+0x4b/0x80
       xfs_buf_wait_unpin+0x9e/0xf0
       __xfs_buf_submit+0x14a/0x230
       xfs_buf_delwri_submit_buffers+0x107/0x280
       xfs_buf_delwri_submit_nowait+0x10/0x20
       xfsaild+0x27e/0x9d0
       kthread+0xf6/0x120
       ret_from_fork+0x1f/0x30
      
      We have the mount process waiting on an inode cluster buffer read,
      inodegc doing unlink waiting on the same inode cluster buffer, and
      the AIL push thread blocked in writeback waiting for the inode
      cluster buffer to become unpinned.
      
      What has happened here is that the AIL push thread has raced with
      the inodegc process modifying, committing and pinning the inode
      cluster buffer here in xfs_buf_delwri_submit_buffers() here:
      
      	blk_start_plug(&plug);
      	list_for_each_entry_safe(bp, n, buffer_list, b_list) {
      		if (!wait_list) {
      			if (xfs_buf_ispinned(bp)) {
      				pinned++;
      				continue;
      			}
      Here >>>>>>
      			if (!xfs_buf_trylock(bp))
      				continue;
      
      Basically, the AIL has found the buffer wasn't pinned and got the
      lock without blocking, but then the buffer was pinned. This implies
      the processing here was pre-empted between the pin check and the
      lock, because the pin count can only be increased while holding the
      buffer locked. Hence when it has gone to submit the IO, it has
      blocked waiting for the buffer to be unpinned.
      
      With all executing threads now waiting on the buffer to be unpinned,
      we normally get out of situations like this via the background log
      worker issuing a log force which will unpinned stuck buffers like
      this. But at this point in recovery, we haven't started the log
      worker. In fact, the first thing we do after processing intents and
      unlinked inodes is *start the log worker*. IOWs, we start it too
      late to have it break deadlocks like this.
      
      Avoid this and any other similar deadlock vectors in intent and
      unlinked inode recovery by starting the log worker before we recover
      intents and unlinked inodes. This part of recovery runs as though
      the filesystem is fully active, so we really should have the same
      infrastructure running as we normally do at runtime.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChandan Babu R <chandan.babu@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      a9a4bc8c
  5. 14 Mar, 2022 6 commits
    • Darrick J. Wong's avatar
      xfs: constify xfs_name_dotdot · 744e6c8a
      Darrick J. Wong authored
      The symbol xfs_name_dotdot is a global variable that the xfs codebase
      uses here and there to look up directory dotdot entries.  Currently it's
      a non-const variable, which means that it's a mutable global variable.
      So far nobody's abused this to cause problems, but let's use the
      compiler to enforce that.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      744e6c8a
    • Darrick J. Wong's avatar
      xfs: constify the name argument to various directory functions · 996b2329
      Darrick J. Wong authored
      Various directory functions do not modify their @name parameter,
      so mark it const to make that clear.  This will enable us to mark
      the global xfs_name_dotdot variable as const to prevent mischief.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      996b2329
    • Darrick J. Wong's avatar
      xfs: reserve quota for target dir expansion when renaming files · 41667260
      Darrick J. Wong authored
      XFS does not reserve quota for directory expansion when renaming
      children into a directory.  This means that we don't reject the
      expansion with EDQUOT when we're at or near a hard limit, which means
      that unprivileged userspace can use rename() to exceed quota.
      
      Rename operations don't always expand the target directory, and we allow
      a rename to proceed with no space reservation if we don't need to add a
      block to the target directory to handle the addition.  Moreover, the
      unlink operation on the source directory generally does not expand the
      directory (you'd have to free a block and then cause a btree split) and
      it's probably of little consequence to leave the corner case that
      renaming a file out of a directory can increase its size.
      
      As with link and unlink, there is a further bug in that we do not
      trigger the blockgc workers to try to clear space when we're out of
      quota.
      
      Because rename is its own special tricky animal, we'll patch xfs_rename
      directly to reserve quota to the rename transaction.  We'll leave
      cleaning up the rest of xfs_rename for the metadata directory tree
      patchset.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      41667260
    • Darrick J. Wong's avatar
      xfs: reserve quota for dir expansion when linking/unlinking files · 871b9316
      Darrick J. Wong authored
      XFS does not reserve quota for directory expansion when linking or
      unlinking children from a directory.  This means that we don't reject
      the expansion with EDQUOT when we're at or near a hard limit, which
      means that unprivileged userspace can use link()/unlink() to exceed
      quota.
      
      The fix for this is nuanced -- link operations don't always expand the
      directory, and we allow a link to proceed with no space reservation if
      we don't need to add a block to the directory to handle the addition.
      Unlink operations generally do not expand the directory (you'd have to
      free a block and then cause a btree split) and we can defer the
      directory block freeing if there is no space reservation.
      
      Moreover, there is a further bug in that we do not trigger the blockgc
      workers to try to clear space when we're out of quota.
      
      To fix both cases, create a new xfs_trans_alloc_dir function that
      allocates the transaction, locks and joins the inodes, and reserves
      quota for the directory.  If there isn't sufficient space or quota,
      we'll switch the caller to reservationless mode.  This should prevent
      quota usage overruns with the least restriction in functionality.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      871b9316
    • Darrick J. Wong's avatar
      xfs: refactor user/group quota chown in xfs_setattr_nonsize · dd3b015d
      Darrick J. Wong authored
      Combine if tests to reduce the indentation levels of the quota chown
      calls in xfs_setattr_nonsize.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarChristian Brauner <brauner@kernel.org>
      dd3b015d
    • Darrick J. Wong's avatar
      xfs: use setattr_copy to set vfs inode attributes · e014f37d
      Darrick J. Wong authored
      Filipe Manana pointed out that XFS' behavior w.r.t. setuid/setgid
      revocation isn't consistent with btrfs[1] or ext4.  Those two
      filesystems use the VFS function setattr_copy to convey certain
      attributes from struct iattr into the VFS inode structure.
      
      Andrey Zhadchenko reported[2] that XFS uses the wrong user namespace to
      decide if it should clear setgid and setuid on a file attribute update.
      This is a second symptom of the problem that Filipe noticed.
      
      XFS, on the other hand, open-codes setattr_copy in xfs_setattr_mode,
      xfs_setattr_nonsize, and xfs_setattr_time.  Regrettably, setattr_copy is
      /not/ a simple copy function; it contains additional logic to clear the
      setgid bit when setting the mode, and XFS' version no longer matches.
      
      The VFS implements its own setuid/setgid stripping logic, which
      establishes consistent behavior.  It's a tad unfortunate that it's
      scattered across notify_change, should_remove_suid, and setattr_copy but
      XFS should really follow the Linux VFS.  Adapt XFS to use the VFS
      functions and get rid of the old functions.
      
      [1] https://lore.kernel.org/fstests/CAL3q7H47iNQ=Wmk83WcGB-KBJVOEtR9+qGczzCeXJ9Y2KCV25Q@mail.gmail.com/
      [2] https://lore.kernel.org/linux-xfs/20220221182218.748084-1-andrey.zhadchenko@virtuozzo.com/
      
      Fixes: 7fa294c8 ("userns: Allow chown and setgid preservation")
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarChristian Brauner <brauner@kernel.org>
      e014f37d
  6. 09 Mar, 2022 2 commits
  7. 27 Feb, 2022 4 commits
  8. 26 Feb, 2022 7 commits