1. 20 Mar, 2022 3 commits
    • Dave Chinner's avatar
      xfs: xfs_ail_push_all_sync() stalls when racing with updates · 941fbdfd
      Dave Chinner authored
      xfs_ail_push_all_sync() has a loop like this:
      
      while max_ail_lsn {
      	prepare_to_wait(ail_empty)
      	target = max_ail_lsn
      	wake_up(ail_task);
      	schedule()
      }
      
      Which is designed to sleep until the AIL is emptied. When
      xfs_ail_update_finish() moves the tail of the log, it does:
      
      	if (list_empty(&ailp->ail_head))
      		wake_up_all(&ailp->ail_empty);
      
      So it will only wake up the sync push waiter when the AIL goes
      empty. If, by the time the push waiter has woken, the AIL has more
      in it, it will reset the target, wake the push task and go back to
      sleep.
      
      The problem here is that if the AIL is having items added to it
      when xfs_ail_push_all_sync() is called, then they may get inserted
      into the AIL at a LSN higher than the target LSN. At this point,
      xfsaild_push() will see that the target is X, the item LSNs are
      (X+N) and skip over them, hence never pushing the out.
      
      The result of this the AIL will not get emptied by the AIL push
      thread, hence xfs_ail_finish_update() will never see the AIL being
      empty even if it moves the tail. Hence xfs_ail_push_all_sync() never
      gets woken and hence cannot update the push target to capture the
      items beyond the current target on the LSN.
      
      This is a TOCTOU type of issue so the way to avoid it is to not
      use the push target at all for sync pushes. We know that a sync push
      is being requested by the fact the ail_empty wait queue is active,
      hence the xfsaild can just set the target to max_ail_lsn on every
      push that we see the wait queue active. Hence we no longer will
      leave items on the AIL that are beyond the LSN sampled at the start
      of a sync push.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChandan Babu R <chandan.babu@oracle.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      941fbdfd
    • Dave Chinner's avatar
      xfs: check buffer pin state after locking in delwri_submit · dbd0f529
      Dave Chinner authored
      AIL flushing can get stuck here:
      
      [316649.005769] INFO: task xfsaild/pmem1:324525 blocked for more than 123 seconds.
      [316649.007807]       Not tainted 5.17.0-rc6-dgc+ #975
      [316649.009186] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [316649.011720] task:xfsaild/pmem1   state:D stack:14544 pid:324525 ppid:     2 flags:0x00004000
      [316649.014112] Call Trace:
      [316649.014841]  <TASK>
      [316649.015492]  __schedule+0x30d/0x9e0
      [316649.017745]  schedule+0x55/0xd0
      [316649.018681]  io_schedule+0x4b/0x80
      [316649.019683]  xfs_buf_wait_unpin+0x9e/0xf0
      [316649.021850]  __xfs_buf_submit+0x14a/0x230
      [316649.023033]  xfs_buf_delwri_submit_buffers+0x107/0x280
      [316649.024511]  xfs_buf_delwri_submit_nowait+0x10/0x20
      [316649.025931]  xfsaild+0x27e/0x9d0
      [316649.028283]  kthread+0xf6/0x120
      [316649.030602]  ret_from_fork+0x1f/0x30
      
      in the situation where flushing gets preempted between the unpin
      check and the buffer trylock under nowait conditions:
      
      	blk_start_plug(&plug);
      	list_for_each_entry_safe(bp, n, buffer_list, b_list) {
      		if (!wait_list) {
      			if (xfs_buf_ispinned(bp)) {
      				pinned++;
      				continue;
      			}
      Here >>>>>>
      			if (!xfs_buf_trylock(bp))
      				continue;
      
      This means submission is stuck until something else triggers a log
      force to unpin the buffer.
      
      To get onto the delwri list to begin with, the buffer pin state has
      already been checked, and hence it's relatively rare we get a race
      between flushing and encountering a pinned buffer in delwri
      submission to begin with. Further, to increase the pin count the
      buffer has to be locked, so the only way we can hit this race
      without failing the trylock is to be preempted between the pincount
      check seeing zero and the trylock being run.
      
      Hence to avoid this problem, just invert the order of trylock vs
      pin check. We shouldn't hit that many pinned buffers here, so
      optimising away the trylock for pinned buffers should not matter for
      performance at all.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChandan Babu R <chandan.babu@oracle.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      dbd0f529
    • Dave Chinner's avatar
      xfs: log worker needs to start before intent/unlink recovery · a9a4bc8c
      Dave Chinner authored
      After 963 iterations of generic/530, it deadlocked during recovery
      on a pinned inode cluster buffer like so:
      
      XFS (pmem1): Starting recovery (logdev: internal)
      INFO: task kworker/8:0:306037 blocked for more than 122 seconds.
            Not tainted 5.17.0-rc6-dgc+ #975
      "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      task:kworker/8:0     state:D stack:13024 pid:306037 ppid:     2 flags:0x00004000
      Workqueue: xfs-inodegc/pmem1 xfs_inodegc_worker
      Call Trace:
       <TASK>
       __schedule+0x30d/0x9e0
       schedule+0x55/0xd0
       schedule_timeout+0x114/0x160
       __down+0x99/0xf0
       down+0x5e/0x70
       xfs_buf_lock+0x36/0xf0
       xfs_buf_find+0x418/0x850
       xfs_buf_get_map+0x47/0x380
       xfs_buf_read_map+0x54/0x240
       xfs_trans_read_buf_map+0x1bd/0x490
       xfs_imap_to_bp+0x4f/0x70
       xfs_iunlink_map_ino+0x66/0xd0
       xfs_iunlink_map_prev.constprop.0+0x148/0x2f0
       xfs_iunlink_remove_inode+0xf2/0x1d0
       xfs_inactive_ifree+0x1a3/0x900
       xfs_inode_unlink+0xcc/0x210
       xfs_inodegc_worker+0x1ac/0x2f0
       process_one_work+0x1ac/0x390
       worker_thread+0x56/0x3c0
       kthread+0xf6/0x120
       ret_from_fork+0x1f/0x30
       </TASK>
      task:mount           state:D stack:13248 pid:324509 ppid:324233 flags:0x00004000
      Call Trace:
       <TASK>
       __schedule+0x30d/0x9e0
       schedule+0x55/0xd0
       schedule_timeout+0x114/0x160
       __down+0x99/0xf0
       down+0x5e/0x70
       xfs_buf_lock+0x36/0xf0
       xfs_buf_find+0x418/0x850
       xfs_buf_get_map+0x47/0x380
       xfs_buf_read_map+0x54/0x240
       xfs_trans_read_buf_map+0x1bd/0x490
       xfs_imap_to_bp+0x4f/0x70
       xfs_iget+0x300/0xb40
       xlog_recover_process_one_iunlink+0x4c/0x170
       xlog_recover_process_iunlinks.isra.0+0xee/0x130
       xlog_recover_finish+0x57/0x110
       xfs_log_mount_finish+0xfc/0x1e0
       xfs_mountfs+0x540/0x910
       xfs_fs_fill_super+0x495/0x850
       get_tree_bdev+0x171/0x270
       xfs_fs_get_tree+0x15/0x20
       vfs_get_tree+0x24/0xc0
       path_mount+0x304/0xba0
       __x64_sys_mount+0x108/0x140
       do_syscall_64+0x35/0x80
       entry_SYSCALL_64_after_hwframe+0x44/0xae
       </TASK>
      task:xfsaild/pmem1   state:D stack:14544 pid:324525 ppid:     2 flags:0x00004000
      Call Trace:
       <TASK>
       __schedule+0x30d/0x9e0
       schedule+0x55/0xd0
       io_schedule+0x4b/0x80
       xfs_buf_wait_unpin+0x9e/0xf0
       __xfs_buf_submit+0x14a/0x230
       xfs_buf_delwri_submit_buffers+0x107/0x280
       xfs_buf_delwri_submit_nowait+0x10/0x20
       xfsaild+0x27e/0x9d0
       kthread+0xf6/0x120
       ret_from_fork+0x1f/0x30
      
      We have the mount process waiting on an inode cluster buffer read,
      inodegc doing unlink waiting on the same inode cluster buffer, and
      the AIL push thread blocked in writeback waiting for the inode
      cluster buffer to become unpinned.
      
      What has happened here is that the AIL push thread has raced with
      the inodegc process modifying, committing and pinning the inode
      cluster buffer here in xfs_buf_delwri_submit_buffers() here:
      
      	blk_start_plug(&plug);
      	list_for_each_entry_safe(bp, n, buffer_list, b_list) {
      		if (!wait_list) {
      			if (xfs_buf_ispinned(bp)) {
      				pinned++;
      				continue;
      			}
      Here >>>>>>
      			if (!xfs_buf_trylock(bp))
      				continue;
      
      Basically, the AIL has found the buffer wasn't pinned and got the
      lock without blocking, but then the buffer was pinned. This implies
      the processing here was pre-empted between the pin check and the
      lock, because the pin count can only be increased while holding the
      buffer locked. Hence when it has gone to submit the IO, it has
      blocked waiting for the buffer to be unpinned.
      
      With all executing threads now waiting on the buffer to be unpinned,
      we normally get out of situations like this via the background log
      worker issuing a log force which will unpinned stuck buffers like
      this. But at this point in recovery, we haven't started the log
      worker. In fact, the first thing we do after processing intents and
      unlinked inodes is *start the log worker*. IOWs, we start it too
      late to have it break deadlocks like this.
      
      Avoid this and any other similar deadlock vectors in intent and
      unlinked inode recovery by starting the log worker before we recover
      intents and unlinked inodes. This part of recovery runs as though
      the filesystem is fully active, so we really should have the same
      infrastructure running as we normally do at runtime.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChandan Babu R <chandan.babu@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      a9a4bc8c
  2. 14 Mar, 2022 6 commits
    • Darrick J. Wong's avatar
      xfs: constify xfs_name_dotdot · 744e6c8a
      Darrick J. Wong authored
      The symbol xfs_name_dotdot is a global variable that the xfs codebase
      uses here and there to look up directory dotdot entries.  Currently it's
      a non-const variable, which means that it's a mutable global variable.
      So far nobody's abused this to cause problems, but let's use the
      compiler to enforce that.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      744e6c8a
    • Darrick J. Wong's avatar
      xfs: constify the name argument to various directory functions · 996b2329
      Darrick J. Wong authored
      Various directory functions do not modify their @name parameter,
      so mark it const to make that clear.  This will enable us to mark
      the global xfs_name_dotdot variable as const to prevent mischief.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      996b2329
    • Darrick J. Wong's avatar
      xfs: reserve quota for target dir expansion when renaming files · 41667260
      Darrick J. Wong authored
      XFS does not reserve quota for directory expansion when renaming
      children into a directory.  This means that we don't reject the
      expansion with EDQUOT when we're at or near a hard limit, which means
      that unprivileged userspace can use rename() to exceed quota.
      
      Rename operations don't always expand the target directory, and we allow
      a rename to proceed with no space reservation if we don't need to add a
      block to the target directory to handle the addition.  Moreover, the
      unlink operation on the source directory generally does not expand the
      directory (you'd have to free a block and then cause a btree split) and
      it's probably of little consequence to leave the corner case that
      renaming a file out of a directory can increase its size.
      
      As with link and unlink, there is a further bug in that we do not
      trigger the blockgc workers to try to clear space when we're out of
      quota.
      
      Because rename is its own special tricky animal, we'll patch xfs_rename
      directly to reserve quota to the rename transaction.  We'll leave
      cleaning up the rest of xfs_rename for the metadata directory tree
      patchset.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      41667260
    • Darrick J. Wong's avatar
      xfs: reserve quota for dir expansion when linking/unlinking files · 871b9316
      Darrick J. Wong authored
      XFS does not reserve quota for directory expansion when linking or
      unlinking children from a directory.  This means that we don't reject
      the expansion with EDQUOT when we're at or near a hard limit, which
      means that unprivileged userspace can use link()/unlink() to exceed
      quota.
      
      The fix for this is nuanced -- link operations don't always expand the
      directory, and we allow a link to proceed with no space reservation if
      we don't need to add a block to the directory to handle the addition.
      Unlink operations generally do not expand the directory (you'd have to
      free a block and then cause a btree split) and we can defer the
      directory block freeing if there is no space reservation.
      
      Moreover, there is a further bug in that we do not trigger the blockgc
      workers to try to clear space when we're out of quota.
      
      To fix both cases, create a new xfs_trans_alloc_dir function that
      allocates the transaction, locks and joins the inodes, and reserves
      quota for the directory.  If there isn't sufficient space or quota,
      we'll switch the caller to reservationless mode.  This should prevent
      quota usage overruns with the least restriction in functionality.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      871b9316
    • Darrick J. Wong's avatar
      xfs: refactor user/group quota chown in xfs_setattr_nonsize · dd3b015d
      Darrick J. Wong authored
      Combine if tests to reduce the indentation levels of the quota chown
      calls in xfs_setattr_nonsize.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarChristian Brauner <brauner@kernel.org>
      dd3b015d
    • Darrick J. Wong's avatar
      xfs: use setattr_copy to set vfs inode attributes · e014f37d
      Darrick J. Wong authored
      Filipe Manana pointed out that XFS' behavior w.r.t. setuid/setgid
      revocation isn't consistent with btrfs[1] or ext4.  Those two
      filesystems use the VFS function setattr_copy to convey certain
      attributes from struct iattr into the VFS inode structure.
      
      Andrey Zhadchenko reported[2] that XFS uses the wrong user namespace to
      decide if it should clear setgid and setuid on a file attribute update.
      This is a second symptom of the problem that Filipe noticed.
      
      XFS, on the other hand, open-codes setattr_copy in xfs_setattr_mode,
      xfs_setattr_nonsize, and xfs_setattr_time.  Regrettably, setattr_copy is
      /not/ a simple copy function; it contains additional logic to clear the
      setgid bit when setting the mode, and XFS' version no longer matches.
      
      The VFS implements its own setuid/setgid stripping logic, which
      establishes consistent behavior.  It's a tad unfortunate that it's
      scattered across notify_change, should_remove_suid, and setattr_copy but
      XFS should really follow the Linux VFS.  Adapt XFS to use the VFS
      functions and get rid of the old functions.
      
      [1] https://lore.kernel.org/fstests/CAL3q7H47iNQ=Wmk83WcGB-KBJVOEtR9+qGczzCeXJ9Y2KCV25Q@mail.gmail.com/
      [2] https://lore.kernel.org/linux-xfs/20220221182218.748084-1-andrey.zhadchenko@virtuozzo.com/
      
      Fixes: 7fa294c8 ("userns: Allow chown and setgid preservation")
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarChristian Brauner <brauner@kernel.org>
      e014f37d
  3. 09 Mar, 2022 2 commits
  4. 27 Feb, 2022 4 commits
  5. 26 Feb, 2022 22 commits
  6. 25 Feb, 2022 3 commits