1. 16 Aug, 2021 14 commits
    • Dave Chinner's avatar
      xfs: order CIL checkpoint start records · 68a74dca
      Dave Chinner authored
      Because log recovery depends on strictly ordered start records as
      well as strictly ordered commit records.
      
      This is a zero day bug in the way XFS writes pipelined transactions
      to the journal which is exposed by fixing the zero day bug that
      prevents the CIL from pipelining checkpoints. This re-introduces
      explicit concurrent commits back into the on-disk journal and hence
      out of order start records.
      
      The XFS journal commit code has never ordered start records and we
      have relied on strict commit record ordering for correct recovery
      ordering of concurrently written transactions. Unfortunately, root
      cause analysis uncovered the fact that log recovery uses the LSN of
      the start record for transaction commit processing. Hence, whilst
      the commits are processed in strict order by recovery, the LSNs
      associated with the commits can be out of order and so recovery may
      stamp incorrect LSNs into objects and/or misorder intents in the AIL
      for later processing. This can result in log recovery failures
      and/or on disk corruption, sometimes silent.
      
      Because this is a long standing log recovery issue, we can't just
      fix log recovery and call it good. This still leaves older kernels
      susceptible to recovery failures and corruption when replaying a log
      from a kernel that pipelines checkpoints. There is also the issue
      that in-memory ordering for AIL pushing and data integrity
      operations are based on checkpoint start LSNs, and if the start LSN
      is incorrect in the journal, it is also incorrect in memory.
      
      Hence there's really only one choice for fixing this zero-day bug:
      we need to strictly order checkpoint start records in ascending
      sequence order in the log, the same way we already strictly order
      commit records.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      68a74dca
    • Dave Chinner's avatar
      xfs: attach iclog callbacks in xlog_cil_set_ctx_write_state() · caa80090
      Dave Chinner authored
      Now that we have a mechanism to guarantee that the callbacks
      attached to an iclog are owned by the context that attaches them
      until they drop their reference to the iclog via
      xlog_state_release_iclog(), we can attach callbacks to the iclog at
      any time we have an active reference to the iclog.
      
      xlog_state_get_iclog_space() always guarantees that the commit
      record will fit in the iclog it returns, so we can move this IO
      callback setting to xlog_cil_set_ctx_write_state(), record the
      commit iclog in the context and remove the need for the commit iclog
      to be returned by xlog_write() altogether.
      
      This, in turn, allows us to move the wakeup for ordered commit
      record writes up into xlog_cil_set_ctx_write_state(), too, because
      we have been guaranteed that this commit record will be physically
      located in the iclog before any waiting commit record at a higher
      sequence number will be granted iclog space.
      
      This further cleans up the post commit record write processing in
      the CIL push code, especially as xlog_state_release_iclog() will now
      clean up the context when shutdown errors occur.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      caa80090
    • Dave Chinner's avatar
      xfs: factor out log write ordering from xlog_cil_push_work() · bf034bc8
      Dave Chinner authored
      So we can use it for start record ordering as well as commit record
      ordering in future.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      bf034bc8
    • Dave Chinner's avatar
      xfs: pass a CIL context to xlog_write() · c45aba40
      Dave Chinner authored
      Pass the CIL context to xlog_write() rather than a pointer to a LSN
      variable. Only the CIL checkpoint calls to xlog_write() need to know
      about the start LSN of the writes, so rework xlog_write to directly
      write the LSNs into the CIL context structure.
      
      This removes the commit_lsn variable from xlog_cil_push_work(), so
      now we only have to issue the commit record ordering wakeup from
      there.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      c45aba40
    • Dave Chinner's avatar
      xfs: move xlog_commit_record to xfs_log_cil.c · 2ce82b72
      Dave Chinner authored
      It is only used by the CIL checkpoints, and is the counterpart to
      start record formatting and writing that is already local to
      xfs_log_cil.c.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      2ce82b72
    • Dave Chinner's avatar
      xfs: log head and tail aren't reliable during shutdown · 2562c322
      Dave Chinner authored
      I'm seeing assert failures from xlog_space_left() after a shutdown
      has begun that look like:
      
      XFS (dm-0): log I/O error -5
      XFS (dm-0): xfs_do_force_shutdown(0x2) called from line 1338 of file fs/xfs/xfs_log.c. Return address = xlog_ioend_work+0x64/0xc0
      XFS (dm-0): Log I/O Error Detected.
      XFS (dm-0): Shutting down filesystem. Please unmount the filesystem and rectify the problem(s)
      XFS (dm-0): xlog_space_left: head behind tail
      XFS (dm-0):   tail_cycle = 6, tail_bytes = 2706944
      XFS (dm-0):   GH   cycle = 6, GH   bytes = 1633867
      XFS: Assertion failed: 0, file: fs/xfs/xfs_log.c, line: 1310
      ------------[ cut here ]------------
      Call Trace:
       xlog_space_left+0xc3/0x110
       xlog_grant_push_threshold+0x3f/0xf0
       xlog_grant_push_ail+0x12/0x40
       xfs_log_reserve+0xd2/0x270
       ? __might_sleep+0x4b/0x80
       xfs_trans_reserve+0x18b/0x260
      .....
      
      There are two things here. Firstly, after a shutdown, the log head
      and tail can be out of whack as things abort and release (or don't
      release) resources, so checking them for sanity doesn't make much
      sense. Secondly, xfs_log_reserve() can race with shutdown and so it
      can still fail like this even though it has already checked for a
      log shutdown before calling xlog_grant_push_ail().
      
      So, before ASSERT failing in xlog_space_left(), make sure we haven't
      already shut down....
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      2562c322
    • Dave Chinner's avatar
      xfs: don't run shutdown callbacks on active iclogs · 502a01fa
      Dave Chinner authored
      When the log is shutdown, it currently walks all the iclogs and runs
      callbacks that are attached to the iclogs, regardless of whether the
      iclog is queued for IO completion or not. This creates a problem for
      contexts attaching callbacks to iclogs in that a racing shutdown can
      run the callbacks even before the attaching context has finished
      processing the iclog and releasing it for IO submission.
      
      If the callback processing of the iclog frees the structure that is
      attached to the iclog, then this leads to an UAF scenario that can
      only be protected against by holding the icloglock from the point
      callbacks are attached through to the release of the iclog. While we
      currently do this, it is not practical or sustainable.
      
      Hence we need to make shutdown processing the responsibility of the
      context that holds active references to the iclog. We know that the
      contexts attaching callbacks to the iclog must have active
      references to the iclog, and that means they must be in either
      ACTIVE or WANT_SYNC states. xlog_state_do_callback() will skip over
      iclogs in these states -except- when the log is shut down.
      
      xlog_state_do_callback() checks the state of the iclogs while
      holding the icloglock, therefore the reference count/state change
      that occurs in xlog_state_release_iclog() after the callbacks are
      atomic w.r.t. shutdown processing.
      
      We can't push the responsibility of callback cleanup onto the CIL
      context because we can have ACTIVE iclogs that have callbacks
      attached that have already been released. Hence we really need to
      internalise the cleanup of callbacks into xlog_state_release_iclog()
      processing.
      
      Indeed, we already have that internalisation via:
      
      xlog_state_release_iclog
        drop last reference
          ->SYNCING
        xlog_sync
          xlog_write_iclog
            if (log_is_shutdown)
              xlog_state_done_syncing()
      	  xlog_state_do_callback()
      	    <process shutdown on iclog that is now in SYNCING state>
      
      The problem is that xlog_state_release_iclog() aborts before doing
      anything if the log is already shut down. It assumes that the
      callbacks have already been cleaned up, and it doesn't need to do
      any cleanup.
      
      Hence the fix is to remove the xlog_is_shutdown() check from
      xlog_state_release_iclog() so that reference counts are correctly
      released from the iclogs, and when the reference count is zero we
      always transition to SYNCING if the log is shut down. Hence we'll
      always enter the xlog_sync() path in a shutdown and eventually end
      up erroring out the iclog IO and running xlog_state_do_callback() to
      process the callbacks attached to the iclog.
      
      This allows us to stop processing referenced ACTIVE/WANT_SYNC iclogs
      directly in the shutdown code, and in doing so gets rid of the UAF
      vector that currently exists. This then decouples the adding of
      callbacks to the iclogs from xlog_state_release_iclog() as we
      guarantee that xlog_state_release_iclog() will process the callbacks
      if the log has been shut down before xlog_state_release_iclog() has
      been called.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      502a01fa
    • Dave Chinner's avatar
      xfs: separate out log shutdown callback processing · aad7272a
      Dave Chinner authored
      The iclog callback processing done during a forced log shutdown has
      different logic to normal runtime IO completion callback processing.
      Separate out the shutdown callbacks into their own function and call
      that from the shutdown code instead.
      
      We don't need this shutdown specific logic in the normal runtime
      completion code - we'll always run the shutdown version on shutdown,
      and it will do what shutdown needs regardless of whether there are
      racing IO completion callbacks scheduled or in progress. Hence we
      can also simplify the normal IO completion callpath and only abort
      if shutdown occurred while we actively were processing callbacks.
      
      Further, separating out the IO completion logic from the shutdown
      logic avoids callback race conditions from being triggered by log IO
      completion after a shutdown. IO completion will now only run
      callbacks on iclogs that are in the correct state for a callback to
      be run, avoiding the possibility of running callbacks on a
      referenced iclog that hasn't yet been submitted for IO.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      aad7272a
    • Dave Chinner's avatar
      xfs: rework xlog_state_do_callback() · 8bb92005
      Dave Chinner authored
      Clean it up a bit by factoring and rearranging some of the code.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      8bb92005
    • Dave Chinner's avatar
      xfs: make forced shutdown processing atomic · b36d4651
      Dave Chinner authored
      The running of a forced shutdown is a bit of a mess. It does racy
      checks for XFS_MOUNT_SHUTDOWN in xfs_do_force_shutdown(), then
      does more racy checks in xfs_log_force_unmount() before finally
      setting XFS_MOUNT_SHUTDOWN and XLOG_IO_ERROR under the
      log->icloglock.
      
      Move the checking and setting of XFS_MOUNT_SHUTDOWN into
      xfs_do_force_shutdown() so we only process a shutdown once and once
      only. Serialise this with the mp->m_sb_lock spinlock so that the
      state change is atomic and won't race. Move all the mount specific
      shutdown state changes from xfs_log_force_unmount() to
      xfs_do_force_shutdown() so they are done atomically with setting
      XFS_MOUNT_SHUTDOWN.
      
      Then get rid of the racy xlog_is_shutdown() check from
      xlog_force_shutdown(), and gate the log shutdown on the
      test_and_set_bit(XLOG_IO_ERROR) test under the icloglock. This
      means that the log is shutdown once and once only, and code that
      needs to prevent races with shutdown can do so by holding the
      icloglock and checking the return value of xlog_is_shutdown().
      
      This results in a predictable shutdown execution process - we set the
      shutdown flags once and process the shutdown once rather than the
      current "as many concurrent shutdowns as can race to the flag
      setting" situation we have now.
      
      Also, now that shutdown is atomic, alway emit a stack trace when the
      error level for the filesystem is high enough. This means that we
      always get a stack trace when trying to diagnose the cause of
      shutdowns in the field, rather than just for SHUTDOWN_CORRUPT_INCORE
      cases.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      b36d4651
    • Dave Chinner's avatar
      xfs: convert log flags to an operational state field · e1d06e5f
      Dave Chinner authored
      log->l_flags doesn't actually contain "flags" as such, it contains
      operational state information that can change at runtime. For the
      shutdown state, this at least should be an atomic bit because
      it is read without holding locks in many places and so using atomic
      bitops for the state field modifications makes sense.
      
      This allows us to use things like test_and_set_bit() on state
      changes (e.g. setting XLOG_TAIL_WARN) to avoid races in setting the
      state when we aren't holding locks.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      e1d06e5f
    • Dave Chinner's avatar
      xfs: move recovery needed state updates to xfs_log_mount_finish · fd67d8a0
      Dave Chinner authored
      xfs_log_mount_finish() needs to know if recovery is needed or not to
      make decisions on whether to flush the log and AIL.  Move the
      handling of the NEED_RECOVERY state out to this function rather than
      needing a temporary variable to store this state over the call to
      xlog_recover_finish().
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      fd67d8a0
    • Dave Chinner's avatar
      xfs: XLOG_STATE_IOERROR must die · 5112e206
      Dave Chinner authored
      We don't need an iclog state field to tell us the log has been shut
      down. We can just check the xlog_is_shutdown() instead. The avoids
      the need to have shutdown overwrite the current iclog state while
      being active used by the log code and so having to ensure that every
      iclog state check handles XLOG_STATE_IOERROR appropriately.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      5112e206
    • Dave Chinner's avatar
      xfs: convert XLOG_FORCED_SHUTDOWN() to xlog_is_shutdown() · 2039a272
      Dave Chinner authored
      Make it less shouty and a static inline before adding more calls
      through the log code.
      
      Also convert internal log code that uses XFS_FORCED_SHUTDOWN(mount)
      to use xlog_is_shutdown(log) as well.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      2039a272
  2. 11 Aug, 2021 2 commits
  3. 09 Aug, 2021 20 commits
  4. 06 Aug, 2021 4 commits
    • Dave Chinner's avatar
      xfs: per-cpu deferred inode inactivation queues · ab23a776
      Dave Chinner authored
      Move inode inactivation to background work contexts so that it no
      longer runs in the context that releases the final reference to an
      inode. This will allow process work that ends up blocking on
      inactivation to continue doing work while the filesytem processes
      the inactivation in the background.
      
      A typical demonstration of this is unlinking an inode with lots of
      extents. The extents are removed during inactivation, so this blocks
      the process that unlinked the inode from the directory structure. By
      moving the inactivation to the background process, the userspace
      applicaiton can keep working (e.g. unlinking the next inode in the
      directory) while the inactivation work on the previous inode is
      done by a different CPU.
      
      The implementation of the queue is relatively simple. We use a
      per-cpu lockless linked list (llist) to queue inodes for
      inactivation without requiring serialisation mechanisms, and a work
      item to allow the queue to be processed by a CPU bound worker
      thread. We also keep a count of the queue depth so that we can
      trigger work after a number of deferred inactivations have been
      queued.
      
      The use of a bound workqueue with a single work depth allows the
      workqueue to run one work item per CPU. We queue the work item on
      the CPU we are currently running on, and so this essentially gives
      us affine per-cpu worker threads for the per-cpu queues. THis
      maintains the effective CPU affinity that occurs within XFS at the
      AG level due to all objects in a directory being local to an AG.
      Hence inactivation work tends to run on the same CPU that last
      accessed all the objects that inactivation accesses and this
      maintains hot CPU caches for unlink workloads.
      
      A depth of 32 inodes was chosen to match the number of inodes in an
      inode cluster buffer. This hopefully allows sequential
      allocation/unlink behaviours to defering inactivation of all the
      inodes in a single cluster buffer at a time, further helping
      maintain hot CPU and buffer cache accesses while running
      inactivations.
      
      A hard per-cpu queue throttle of 256 inode has been set to avoid
      runaway queuing when inodes that take a long to time inactivate are
      being processed. For example, when unlinking inodes with large
      numbers of extents that can take a lot of processing to free.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      [djwong: tweak comments and tracepoints, convert opflags to state bits]
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      ab23a776
    • Darrick J. Wong's avatar
      xfs: detach dquots from inode if we don't need to inactivate it · 62af7d54
      Darrick J. Wong authored
      If we don't need to inactivate an inode, we can detach the dquots and
      move on to reclamation.  This isn't strictly required here; it's a
      preparation patch for deferred inactivation per reviewer request[1] to
      move the creation of xfs_inode_needs_inactivation into a separate
      change.  Eventually this !need_inactive chunk will turn into the code
      path for inodes that skip xfs_inactive and go straight to memory
      reclaim.
      
      [1] https://lore.kernel.org/linux-xfs/20210609012838.GW2945738@locust/T/#mca6d958521cb88bbc1bfe1a30767203328d410b5Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      62af7d54
    • Darrick J. Wong's avatar
      xfs: move xfs_inactive call to xfs_inode_mark_reclaimable · c6c2066d
      Darrick J. Wong authored
      Move the xfs_inactive call and all the other debugging checks and stats
      updates into xfs_inode_mark_reclaimable because most of that are
      implementation details about the inode cache.  This is preparation for
      deferred inactivation that is coming up.  We also move it around
      xfs_icache.c in preparation for deferred inactivation.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      c6c2066d
    • Dave Chinner's avatar
      xfs: introduce all-mounts list for cpu hotplug notifications · 0ed17f01
      Dave Chinner authored
      The inode inactivation and CIL tracking percpu structures are
      per-xfs_mount structures. That means when we get a CPU dead
      notification, we need to then iterate all the per-cpu structure
      instances to process them. Rather than keeping linked lists of
      per-cpu structures in each subsystem, add a list of all xfs_mounts
      that the generic xfs_cpu_dead() function will iterate and call into
      each subsystem appropriately.
      
      This allows us to handle both per-mount and global XFS percpu state
      from xfs_cpu_dead(), and avoids the need to link subsystem
      structures that can be easily found from the xfs_mount into their
      own global lists.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      [djwong: expand some comments about mount list setup ordering rules]
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      0ed17f01