1. 22 Jun, 2021 1 commit
  2. 21 Jun, 2021 17 commits
    • Brian Foster's avatar
      xfs: remove dead stale buf unpin handling code · e53d3aa0
      Brian Foster authored
      This code goes back to a time when transaction commits wrote
      directly to iclogs. The associated log items were pinned, written to
      the log, and then "uncommitted" if some part of the log write had
      failed. This uncommit sequence called an ->iop_unpin_remove()
      handler that was eventually folded into ->iop_unpin() via the remove
      parameter. The log subsystem has since changed significantly in that
      transactions commit to the CIL instead of direct to iclogs, though
      log items must still be aborted in the event of an eventual log I/O
      error. However, the context for a log item abort is now asynchronous
      from transaction commit, which means the committing transaction has
      been freed by this point in time and the transaction uncommit
      sequence of events is no longer relevant.
      
      Further, since stale buffers remain locked at transaction commit
      through unpin, we can be certain that the buffer is not associated
      with any transaction when the unpin callback executes. Remove this
      unused hunk of code and replace it with an assertion that the buffer
      is disassociated from transaction context.
      Signed-off-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      e53d3aa0
    • Brian Foster's avatar
      xfs: hold buffer across unpin and potential shutdown processing · 84d8949e
      Brian Foster authored
      The special processing used to simulate a buffer I/O failure on fs
      shutdown has a difficult to reproduce race that can result in a use
      after free of the associated buffer. Consider a buffer that has been
      committed to the on-disk log and thus is AIL resident. The buffer
      lands on the writeback delwri queue, but is subsequently locked,
      committed and pinned by another transaction before submitted for
      I/O. At this point, the buffer is stuck on the delwri queue as it
      cannot be submitted for I/O until it is unpinned. A log checkpoint
      I/O failure occurs sometime later, which aborts the bli. The unpin
      handler is called with the aborted log item, drops the bli reference
      count, the pin count, and falls into the I/O failure simulation
      path.
      
      The potential problem here is that once the pin count falls to zero
      in ->iop_unpin(), xfsaild is free to retry delwri submission of the
      buffer at any time, before the unpin handler even completes. If
      delwri queue submission wins the race to the buffer lock, it
      observes the shutdown state and simulates the I/O failure itself.
      This releases both the bli and delwri queue holds and frees the
      buffer while xfs_buf_item_unpin() sits on xfs_buf_lock() waiting to
      run through the same failure sequence. This problem is rare and
      requires many iterations of fstest generic/019 (which simulates disk
      I/O failures) to reproduce.
      
      To avoid this problem, grab a hold on the buffer before the log item
      is unpinned if the associated item has been aborted and will require
      a simulated I/O failure. The hold is already required for the
      simulated I/O failure, so the ordering simply guarantees the unpin
      handler access to the buffer before it is unpinned and thus
      processed by the AIL. This particular ordering is required so long
      as the AIL does not acquire a reference on the bli, which is the
      long term solution to this problem.
      Signed-off-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      84d8949e
    • Darrick J. Wong's avatar
      xfs: force the log offline when log intent item recovery fails · 4e6b8270
      Darrick J. Wong authored
      If any part of log intent item recovery fails, we should shut down the
      log immediately to stop the log from writing a clean unmount record to
      disk, because the metadata is not consistent.  The inability to cancel a
      dirty transaction catches most of these cases, but there are a few
      things that have slipped through the cracks, such as ENOSPC from a
      transaction allocation, or runtime errors that result in cancellation of
      a non-dirty transaction.
      
      This solves some weird behaviors reported by customers where a system
      goes down, the first mount fails, the second succeeds, but then the fs
      goes down later because of inconsistent metadata.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      4e6b8270
    • Darrick J. Wong's avatar
      xfs: fix log intent recovery ENOSPC shutdowns when inactivating inodes · 81ed9475
      Darrick J. Wong authored
      During regular operation, the xfs_inactive operations create
      transactions with zero block reservation because in general we're
      freeing space, not asking for more.  The per-AG space reservations
      created at mount time enable us to handle expansions of the refcount
      btree without needing to reserve blocks to the transaction.
      
      Unfortunately, log recovery doesn't create the per-AG space reservations
      when intent items are being recovered.  This isn't an issue for intent
      item recovery itself because they explicitly request blocks, but any
      inode inactivation that can happen during log recovery uses the same
      xfs_inactive paths as regular runtime.  If a refcount btree expansion
      happens, the transaction will fail due to blk_res_used > blk_res, and we
      shut down the filesystem unnecessarily.
      
      Fix this problem by making per-AG reservations temporarily so that we
      can handle the inactivations, and releasing them at the end.  This
      brings the recovery environment closer to the runtime environment.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      81ed9475
    • Darrick J. Wong's avatar
      xfs: shorten the shutdown messages to a single line · c06ad17c
      Darrick J. Wong authored
      Consolidate the shutdown messages to a single line containing the
      reason, the passed-in flags, the source of the shutdown, and the end
      result.  This means we now only have one line to look for when
      debugging, which is useful when the fs goes down while something else is
      flooding dmesg.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChandan Babu R <chandanrlinux@gmail.com>
      c06ad17c
    • Darrick J. Wong's avatar
      xfs: print name of function causing fs shutdown instead of hex pointer · 3a1c3abe
      Darrick J. Wong authored
      In xfs_do_force_shutdown, print the symbolic name of the function that
      called us to shut down the filesystem instead of a raw hex pointer.
      This makes debugging a lot easier:
      
      XFS (sda): xfs_do_force_shutdown(0x2) called from line 2440 of file
      	fs/xfs/xfs_log.c. Return address = ffffffffa038bc38
      
      becomes:
      
      XFS (sda): xfs_do_force_shutdown(0x2) called from line 2440 of file
      	fs/xfs/xfs_log.c. Return address = xfs_trans_mod_sb+0x25
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChandan Babu R <chandanrlinux@gmail.com>
      3a1c3abe
    • Darrick J. Wong's avatar
      xfs: fix type mismatches in the inode reclaim functions · 10be350b
      Darrick J. Wong authored
      It's currently unlikely that we will ever end up with more than 4
      billion inodes waiting for reclamation, but the fs object code uses long
      int for object counts and we're certainly capable of generating that
      many.  Instead of truncating the internal counters, widen them and
      report the object counts correctly.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarChandan Babu R <chandanrlinux@gmail.com>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      10be350b
    • Darrick J. Wong's avatar
      xfs: separate primary inode selection criteria in xfs_iget_cache_hit · 77b4d286
      Darrick J. Wong authored
      During review of the v6 deferred inode inactivation patchset[1], Dave
      commented that _cache_hit should have a clear separation between inode
      selection criteria and actions performed on a selected inode.  Move a
      hunk to make this true, and compact the shrink cases in the function.
      
      [1] https://lore.kernel.org/linux-xfs/162310469340.3465262.504398465311182657.stgit@locust/T/#mca6d958521cb88bbc1bfe1a30767203328d410b5Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      77b4d286
    • Darrick J. Wong's avatar
      xfs: refactor the inode recycling code · ff7bebeb
      Darrick J. Wong authored
      Hoist the code in xfs_iget_cache_hit that restores the VFS inode state
      to an xfs_inode that was previously vfs-destroyed.  The next patch will
      add a new set of state flags, so we need the helper to avoid
      duplication.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      ff7bebeb
    • Dave Chinner's avatar
      xfs: add iclog state trace events · 956f6daa
      Dave Chinner authored
      For the DEBUGS!
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      956f6daa
    • Dave Chinner's avatar
      xfs: xfs_log_force_lsn isn't passed a LSN · 5f9b4b0d
      Dave Chinner authored
      In doing an investigation into AIL push stalls, I was looking at the
      log force code to see if an async CIL push could be done instead.
      This lead me to xfs_log_force_lsn() and looking at how it works.
      
      xfs_log_force_lsn() is only called from inode synchronisation
      contexts such as fsync(), and it takes the ip->i_itemp->ili_last_lsn
      value as the LSN to sync the log to. This gets passed to
      xlog_cil_force_lsn() via xfs_log_force_lsn() to flush the CIL to the
      journal, and then used by xfs_log_force_lsn() to flush the iclogs to
      the journal.
      
      The problem is that ip->i_itemp->ili_last_lsn does not store a
      log sequence number. What it stores is passed to it from the
      ->iop_committing method, which is called by xfs_log_commit_cil().
      The value this passes to the iop_committing method is the CIL
      context sequence number that the item was committed to.
      
      As it turns out, xlog_cil_force_lsn() converts the sequence to an
      actual commit LSN for the related context and returns that to
      xfs_log_force_lsn(). xfs_log_force_lsn() overwrites it's "lsn"
      variable that contained a sequence with an actual LSN and then uses
      that to sync the iclogs.
      
      This caused me some confusion for a while, even though I originally
      wrote all this code a decade ago. ->iop_committing is only used by
      a couple of log item types, and only inode items use the sequence
      number it is passed.
      
      Let's clean up the API, CIL structures and inode log item to call it
      a sequence number, and make it clear that the high level code is
      using CIL sequence numbers and not on-disk LSNs for integrity
      synchronisation purposes.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarAllison Henderson <allison.henderson@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      5f9b4b0d
    • Dave Chinner's avatar
      xfs: Fix CIL throttle hang when CIL space used going backwards · 19f4e7cc
      Dave Chinner authored
      A hang with tasks stuck on the CIL hard throttle was reported and
      largely diagnosed by Donald Buczek, who discovered that it was a
      result of the CIL context space usage decrementing in committed
      transactions once the hard throttle limit had been hit and processes
      were already blocked.  This resulted in the CIL push not waking up
      those waiters because the CIL context was no longer over the hard
      throttle limit.
      
      The surprising aspect of this was the CIL space usage going
      backwards regularly enough to trigger this situation. Assumptions
      had been made in design that the relogging process would only
      increase the size of the objects in the CIL, and so that space would
      only increase.
      
      This change and commit message fixes the issue and documents the
      result of an audit of the triggers that can cause the CIL space to
      go backwards, how large the backwards steps tend to be, the
      frequency in which they occur, and what the impact on the CIL
      accounting code is.
      
      Even though the CIL ctx->space_used can go backwards, it will only
      do so if the log item is already logged to the CIL and contains a
      space reservation for it's entire logged state. This is tracked by
      the shadow buffer state on the log item. If the item is not
      previously logged in the CIL it has no shadow buffer nor log vector,
      and hence the entire size of the logged item copied to the log
      vector is accounted to the CIL space usage. i.e.  it will always go
      up in this case.
      
      If the item has a log vector (i.e. already in the CIL) and the size
      decreases, then the existing log vector will be overwritten and the
      space usage will go down. This is the only condition where the space
      usage reduces, and it can only occur when an item is already tracked
      in the CIL. Hence we are safe from CIL space usage underruns as a
      result of log items decreasing in size when they are relogged.
      
      Typically this reduction in CIL usage occurs from metadata blocks
      being free, such as when a btree block merge occurs or a directory
      enter/xattr entry is removed and the da-tree is reduced in size.
      This generally results in a reduction in size of around a single
      block in the CIL, but also tends to increase the number of log
      vectors because the parent and sibling nodes in the tree needs to be
      updated when a btree block is removed. If a multi-level merge
      occurs, then we see reduction in size of 2+ blocks, but again the
      log vector count goes up.
      
      The other vector is inode fork size changes, which only log the
      current size of the fork and ignore the previously logged size when
      the fork is relogged. Hence if we are removing items from the inode
      fork (dir/xattr removal in shortform, extent record removal in
      extent form, etc) the relogged size of the inode for can decrease.
      
      No other log items can decrease in size either because they are a
      fixed size (e.g. dquots) or they cannot be relogged (e.g. relogging
      an intent actually creates a new intent log item and doesn't relog
      the old item at all.) Hence the only two vectors for CIL context
      size reduction are relogging inode forks and marking buffers active
      in the CIL as stale.
      
      Long story short: the majority of the code does the right thing and
      handles the reduction in log item size correctly, and only the CIL
      hard throttle implementation is problematic and needs fixing. This
      patch makes that fix, as well as adds comments in the log item code
      that result in items shrinking in size when they are relogged as a
      clear reminder that this can and does happen frequently.
      
      The throttle fix is based upon the change Donald proposed, though it
      goes further to ensure that once the throttle is activated, it
      captures all tasks until the CIL push issues a wakeup, regardless of
      whether the CIL space used has gone back under the throttle
      threshold.
      
      This ensures that we prevent tasks reducing the CIL slightly under
      the throttle threshold and then making more changes that push it
      well over the throttle limit. This is acheived by checking if the
      throttle wait queue is already active as a condition of throttling.
      Hence once we start throttling, we continue to apply the throttle
      until the CIL context push wakes everything on the wait queue.
      
      We can use waitqueue_active() for the waitqueue manipulations and
      checks as they are all done under the ctx->xc_push_lock. Hence the
      waitqueue has external serialisation and we can safely peek inside
      the wait queue without holding the internal waitqueue locks.
      
      Many thanks to Donald for his diagnostic and analysis work to
      isolate the cause of this hang.
      Reported-and-tested-by: default avatarDonald Buczek <buczek@molgen.mpg.de>
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarChandan Babu R <chandanrlinux@gmail.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarAllison Henderson <allison.henderson@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      19f4e7cc
    • Dave Chinner's avatar
      xfs: journal IO cache flush reductions · eef983ff
      Dave Chinner authored
      Currently every journal IO is issued as REQ_PREFLUSH | REQ_FUA to
      guarantee the ordering requirements the journal has w.r.t. metadata
      writeback. THe two ordering constraints are:
      
      1. we cannot overwrite metadata in the journal until we guarantee
      that the dirty metadata has been written back in place and is
      stable.
      
      2. we cannot write back dirty metadata until it has been written to
      the journal and guaranteed to be stable (and hence recoverable) in
      the journal.
      
      The ordering guarantees of #1 are provided by REQ_PREFLUSH. This
      causes the journal IO to issue a cache flush and wait for it to
      complete before issuing the write IO to the journal. Hence all
      completed metadata IO is guaranteed to be stable before the journal
      overwrites the old metadata.
      
      The ordering guarantees of #2 are provided by the REQ_FUA, which
      ensures the journal writes do not complete until they are on stable
      storage. Hence by the time the last journal IO in a checkpoint
      completes, we know that the entire checkpoint is on stable storage
      and we can unpin the dirty metadata and allow it to be written back.
      
      This is the mechanism by which ordering was first implemented in XFS
      way back in 2002 by commit 95d97c36e5155075ba2eb22b17562cfcc53fcf96
      ("Add support for drive write cache flushing") in the xfs-archive
      tree.
      
      A lot has changed since then, most notably we now use delayed
      logging to checkpoint the filesystem to the journal rather than
      write each individual transaction to the journal. Cache flushes on
      journal IO are necessary when individual transactions are wholly
      contained within a single iclog. However, CIL checkpoints are single
      transactions that typically span hundreds to thousands of individual
      journal writes, and so the requirements for device cache flushing
      have changed.
      
      That is, the ordering rules I state above apply to ordering of
      atomic transactions recorded in the journal, not to the journal IO
      itself. Hence we need to ensure metadata is stable before we start
      writing a new transaction to the journal (guarantee #1), and we need
      to ensure the entire transaction is stable in the journal before we
      start metadata writeback (guarantee #2).
      
      Hence we only need a REQ_PREFLUSH on the journal IO that starts a
      new journal transaction to provide #1, and it is not on any other
      journal IO done within the context of that journal transaction.
      
      The CIL checkpoint already issues a cache flush before it starts
      writing to the log, so we no longer need the iclog IO to issue a
      REQ_REFLUSH for us. Hence if XLOG_START_TRANS is passed
      to xlog_write(), we no longer need to mark the first iclog in
      the log write with REQ_PREFLUSH for this case. As an added bonus,
      this ordering mechanism works for both internal and external logs,
      meaning we can remove the explicit data device cache flushes from
      the iclog write code when using external logs.
      
      Given the new ordering semantics of commit records for the CIL, we
      need iclogs containing commit records to issue a REQ_PREFLUSH. We
      also require unmount records to do this. Hence for both
      XLOG_COMMIT_TRANS and XLOG_UNMOUNT_TRANS xlog_write() calls we need
      to mark the first iclog being written with REQ_PREFLUSH.
      
      For both commit records and unmount records, we also want them
      immediately on stable storage, so we want to also mark the iclogs
      that contain these records to be marked REQ_FUA. That means if a
      record is split across multiple iclogs, they are all marked REQ_FUA
      and not just the last one so that when the transaction is completed
      all the parts of the record are on stable storage.
      
      And for external logs, unmount records need a pre-write data device
      cache flush similar to the CIL checkpoint cache pre-flush as the
      internal iclog write code does not do this implicitly anymore.
      
      As an optimisation, when the commit record lands in the same iclog
      as the journal transaction starts, we don't need to wait for
      anything and can simply use REQ_FUA to provide guarantee #2.  This
      means that for fsync() heavy workloads, the cache flush behaviour is
      completely unchanged and there is no degradation in performance as a
      result of optimise the multi-IO transaction case.
      
      The most notable sign that there is less IO latency on my test
      machine (nvme SSDs) is that the "noiclogs" rate has dropped
      substantially. This metric indicates that the CIL push is blocking
      in xlog_get_iclog_space() waiting for iclog IO completion to occur.
      With 8 iclogs of 256kB, the rate is appoximately 1 noiclog event to
      every 4 iclog writes. IOWs, every 4th call to xlog_get_iclog_space()
      is blocking waiting for log IO. With the changes in this patch, this
      drops to 1 noiclog event for every 100 iclog writes. Hence it is
      clear that log IO is completing much faster than it was previously,
      but it is also clear that for large iclog sizes, this isn't the
      performance limiting factor on this hardware.
      
      With smaller iclogs (32kB), however, there is a substantial
      difference. With the cache flush modifications, the journal is now
      running at over 4000 write IOPS, and the journal throughput is
      largely identical to the 256kB iclogs and the noiclog event rate
      stays low at about 1:50 iclog writes. The existing code tops out at
      about 2500 IOPS as the number of cache flushes dominate performance
      and latency. The noiclog event rate is about 1:4, and the
      performance variance is quite large as the journal throughput can
      fall to less than half the peak sustained rate when the cache flush
      rate prevents metadata writeback from keeping up and the log runs
      out of space and throttles reservations.
      
      As a result:
      
      	logbsize	fsmark create rate	rm -rf
      before	32kb		152851+/-5.3e+04	5m28s
      patched	32kb		221533+/-1.1e+04	5m24s
      
      before	256kb		220239+/-6.2e+03	4m58s
      patched	256kb		228286+/-9.2e+03	5m06s
      
      The rm -rf times are included because I ran them, but the
      differences are largely noise. This workload is largely metadata
      read IO latency bound and the changes to the journal cache flushing
      doesn't really make any noticable difference to behaviour apart from
      a reduction in noiclog events from background CIL pushing.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChandan Babu R <chandanrlinux@gmail.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarAllison Henderson <allison.henderson@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      eef983ff
    • Dave Chinner's avatar
      xfs: remove need_start_rec parameter from xlog_write() · 3468bb1c
      Dave Chinner authored
      The CIL push is the only call to xlog_write that sets this variable
      to true. The other callers don't need a start rec, and they tell
      xlog_write what to do by passing the type of ophdr they need written
      in the flags field. The need_start_rec parameter essentially tells
      xlog_write to to write an extra ophdr with a XLOG_START_TRANS type,
      so get rid of the variable to do this and pass XLOG_START_TRANS as
      the flag value into xlog_write() from the CIL push.
      
      $ size fs/xfs/xfs_log.o*
        text	   data	    bss	    dec	    hex	filename
       27595	    560	      8	  28163	   6e03	fs/xfs/xfs_log.o.orig
       27454	    560	      8	  28022	   6d76	fs/xfs/xfs_log.o.patched
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChandan Babu R <chandanrlinux@gmail.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarAllison Henderson <allison.henderson@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      3468bb1c
    • Dave Chinner's avatar
      xfs: CIL checkpoint flushes caches unconditionally · bad77c37
      Dave Chinner authored
      Currently every journal IO is issued as REQ_PREFLUSH | REQ_FUA to
      guarantee the ordering requirements the journal has w.r.t. metadata
      writeback. THe two ordering constraints are:
      
      1. we cannot overwrite metadata in the journal until we guarantee
      that the dirty metadata has been written back in place and is
      stable.
      
      2. we cannot write back dirty metadata until it has been written to
      the journal and guaranteed to be stable (and hence recoverable) in
      the journal.
      
      These rules apply to the atomic transactions recorded in the
      journal, not to the journal IO itself. Hence we need to ensure
      metadata is stable before we start writing a new transaction to the
      journal (guarantee #1), and we need to ensure the entire transaction
      is stable in the journal before we start metadata writeback
      (guarantee #2).
      
      The ordering guarantees of #1 are currently provided by REQ_PREFLUSH
      being added to every iclog IO. This causes the journal IO to issue a
      cache flush and wait for it to complete before issuing the write IO
      to the journal. Hence all completed metadata IO is guaranteed to be
      stable before the journal overwrites the old metadata.
      
      However, for long running CIL checkpoints that might do a thousand
      journal IOs, we don't need every single one of these iclog IOs to
      issue a cache flush - the cache flush done before the first iclog is
      submitted is sufficient to cover the entire range in the log that
      the checkpoint will overwrite because the CIL space reservation
      guarantees the tail of the log (completed metadata) is already
      beyond the range of the checkpoint write.
      
      Hence we only need a full cache flush between closing off the CIL
      checkpoint context (i.e. when the push switches it out) and issuing
      the first journal IO. Rather than plumbing this through to the
      journal IO, we can start this cache flush the moment the CIL context
      is owned exclusively by the push worker. The cache flush can be in
      progress while we process the CIL ready for writing, hence
      reducing the latency of the initial iclog write. This is especially
      true for large checkpoints, where we might have to process hundreds
      of thousands of log vectors before we issue the first iclog write.
      In these cases, it is likely the cache flush has already been
      completed by the time we have built the CIL log vector chain.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChandan Babu R <chandanrlinux@gmail.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarAllison Henderson <allison.henderson@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      bad77c37
    • Dave Chinner's avatar
      xfs: async blkdev cache flush · 0431d926
      Dave Chinner authored
      The new checkpoint cache flush mechanism requires us to issue an
      unconditional cache flush before we start a new checkpoint. We don't
      want to block for this if we can help it, and we have a fair chunk
      of CPU work to do between starting the checkpoint and issuing the
      first journal IO.
      
      Hence it makes sense to amortise the latency cost of the cache flush
      by issuing it asynchronously and then waiting for it only when we
      need to issue the first IO in the transaction.
      
      To do this, we need async cache flush primitives to submit the cache
      flush bio and to wait on it. The block layer has no such primitives
      for filesystems, so roll our own for the moment.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarAllison Henderson <allison.henderson@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      0431d926
    • Dave Chinner's avatar
      xfs: remove xfs_blkdev_issue_flush · b5071ada
      Dave Chinner authored
      It's a one line wrapper around blkdev_issue_flush(). Just replace it
      with direct calls to blkdev_issue_flush().
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChandan Babu R <chandanrlinux@gmail.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarAllison Henderson <allison.henderson@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      b5071ada
  3. 18 Jun, 2021 6 commits
  4. 09 Jun, 2021 2 commits
  5. 08 Jun, 2021 13 commits
    • Darrick J. Wong's avatar
      Merge tag 'rename-eofblocks-5.14_2021-06-08' of... · 68b2c8bc
      Darrick J. Wong authored
      Merge tag 'rename-eofblocks-5.14_2021-06-08' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-5.14-merge2
      
      xfs: rename struct xfs_eofblocks
      
      In the old days, struct xfs_eofblocks was an optional parameter to the
      speculative post-EOF allocation garbage collector to narrow the scope of
      a scan to files fitting specific criteria.  Nowadays it is used for all
      other kinds of inode cache walks (reclaim, quotaoff, inactivation), so
      the name is no longer fitting.  Change the flag namespace and rename the
      structure to something more appropriate for what it does.
      
      v2: separate the inode cache walk flag namespace from eofblocks
      
      * tag 'rename-eofblocks-5.14_2021-06-08' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux:
        xfs: rename struct xfs_eofblocks to xfs_icwalk
        xfs: change the prefix of XFS_EOF_FLAGS_* to XFS_ICWALK_FLAG_
      68b2c8bc
    • Darrick J. Wong's avatar
      Merge tag 'fix-inode-health-reports-5.14_2021-06-08' of... · 295abff2
      Darrick J. Wong authored
      Merge tag 'fix-inode-health-reports-5.14_2021-06-08' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-5.14-merge2
      
      xfs: preserve inode health reports for longer
      
      This is a quick series to make sure that inode sickness reports stick
      around in memory for some amount of time.
      
      v2: rebase to 5.13-rc4
      v3: require explicit request to reclaim sick inodes, drop weird icache
          miss interaction with DONTCACHE
      
      * tag 'fix-inode-health-reports-5.14_2021-06-08' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux:
        xfs: selectively keep sick inodes in memory
        xfs: drop IDONTCACHE on inodes when we mark them sick
        xfs: only reset incore inode health state flags when reclaiming an inode
      295abff2
    • Darrick J. Wong's avatar
      xfs: rename struct xfs_eofblocks to xfs_icwalk · b26b2bf1
      Darrick J. Wong authored
      The xfs_eofblocks structure is no longer well-named -- nowadays it
      provides optional filtering criteria to any walk of the incore inode
      cache.  Only one of the cache walk goals has anything to do with
      clearing of speculative post-EOF preallocations, so change the name to
      be more appropriate.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      b26b2bf1
    • Darrick J. Wong's avatar
      xfs: selectively keep sick inodes in memory · 9492750a
      Darrick J. Wong authored
      It's important that the filesystem retain its memory of sick inodes for
      a little while after problems are found so that reports can be collected
      about what was wrong.  Don't let inode reclamation free sick inodes
      unless we're unmounting or the fs already went down.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarCarlos Maiolino <cmaiolino@redhat.com>
      9492750a
    • Darrick J. Wong's avatar
      xfs: change the prefix of XFS_EOF_FLAGS_* to XFS_ICWALK_FLAG_ · 2d53f66b
      Darrick J. Wong authored
      In preparation for renaming struct xfs_eofblocks to struct xfs_icwalk,
      change the prefix of the existing XFS_EOF_FLAGS_* flags to
      XFS_ICWALK_FLAG_ and convert all the existing users.  This adds a degree
      of interface separation between the ioctl definitions and the incore
      parameters.  Since FLAGS_UNION is only used in xfs_icache.c, move it
      there as a private flag.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarCarlos Maiolino <cmaiolino@redhat.com>
      2d53f66b
    • Darrick J. Wong's avatar
      xfs: drop IDONTCACHE on inodes when we mark them sick · 7975e465
      Darrick J. Wong authored
      When we decide to mark an inode sick, clear the DONTCACHE flag so that
      the incore inode will be kept around until memory pressure forces it out
      of memory.  This increases the chances that the sick status will be
      caught by someone compiling a health report later on.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarCarlos Maiolino <cmaiolino@redhat.com>
      7975e465
    • Darrick J. Wong's avatar
      xfs: only reset incore inode health state flags when reclaiming an inode · 255794c7
      Darrick J. Wong authored
      While running some fuzz tests on inode metadata, I noticed that the
      filesystem health report (as provided by xfs_spaceman) failed to report
      the file corruption even when spaceman was run immediately after running
      xfs_scrub to detect the corruption.  That isn't the intended behavior;
      one ought to be able to run scrub to detect errors in the ondisk
      metadata and be able to access to those reports for some time after the
      scrub.
      
      After running the same sequence through an instrumented kernel, I
      discovered the reason why -- scrub igets the file, scans it, marks it
      sick, and ireleases the inode.  When the VFS lets go of the incore
      inode, it moves to RECLAIMABLE state.  If spaceman igets the incore
      inode before it moves to RECLAIM state, iget reinitializes the VFS
      state, clears the sick and checked masks, and hands back the inode.  At
      this point, the caller has the exact same incore inode, but with all the
      health state erased.
      
      In other words, we're erasing the incore inode's health state flags when
      we've decided NOT to sever the link between the incore inode and the
      ondisk inode.  This is wrong, so we need to remove the lines that zero
      the fields from xfs_iget_cache_hit.
      
      As a precaution, we add the same lines into xfs_reclaim_inode just after
      we sever the link between incore and ondisk inode.  Strictly speaking
      this isn't necessary because once an inode has gone through reclaim it
      must go through xfs_inode_alloc (which also zeroes the state) and
      xfs_iget is careful to check for mismatches between the inode it pulls
      out of the radix tree and the one it wants.
      
      Fixes: 6772c1f1 ("xfs: track metadata health status")
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarCarlos Maiolino <cmaiolino@redhat.com>
      255794c7
    • Darrick J. Wong's avatar
      Merge tag 'inode-walk-cleanups-5.14_2021-06-03' of... · ffc18582
      Darrick J. Wong authored
      Merge tag 'inode-walk-cleanups-5.14_2021-06-03' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-5.14-merge2
      
      xfs: clean up incore inode walk functions
      
      This ambitious series aims to cleans up redundant inode walk code in
      xfs_icache.c, hide implementation details of the quotaoff dquot release
      code, and eliminates indirect function calls from incore inode walks.
      
      The first thing it does is to move all the code that quotaoff calls to
      release dquots from all incore inodes into xfs_icache.c.  Next, it
      separates the goal of an inode walk from the actual radix tree tags that
      may or may not be involved and drops the kludgy XFS_ICI_NO_TAG thing.
      Finally, we split the speculative preallocation (blockgc) and quotaoff
      dquot release code paths into separate functions so that we can keep the
      implementations cohesive.
      
      Christoph suggested last cycle that we 'simply' change quotaoff not to
      allow deactivating quota entirely, but as these cleanups are to enable
      one major change in behavior (deferred inode inactivation) I do not want
      to add a second behavior change (quotaoff) as a dependency.
      
      To be blunt: Additional cleanups are not in scope for this series.
      
      Next, I made two observations about incore inode radix tree walks --
      since there's a 1:1 mapping between the walk goal and the per-inode
      processing function passed in, we can use the goal to make a direct call
      to the processing function.  Furthermore, the only caller to supply a
      nonzero iter_flags argument is quotaoff, and there's only one INEW flag.
      
      From that observation, I concluded that it's quite possible to remove
      two parameters from the xfs_inode_walk* function signatures -- the
      iter_flags, and the execute function pointer.  The middle of the series
      moves the INEW functionality into the one piece (quotaoff) that wants
      it, and removes the indirect calls.
      
      The final observation is that the inode reclaim walk loop is now almost
      the same as xfs_inode_walk, so it's silly to maintain two copies.  Merge
      the reclaim loop code into xfs_inode_walk.
      
      Lastly, refactor the per-ag radix tagging functions since there's
      duplicated code that can be consolidated.
      
      This series is a prerequisite for the next two patchsets, since deferred
      inode inactivation will add another inode radix tree tag and iterator
      function to xfs_inode_walk.
      
      v2: walk the vfs inode list when running quotaoff instead of the radix
          tree, then rework the (now completely internal) inode walk function
          to take the tag as the main parameter.
      v3: merge the reclaim loop into xfs_inode_walk, then consolidate the
          radix tree tagging functions
      v4: rebase to 5.13-rc4
      v5: combine with the quotaoff patchset, reorder functions to minimize
          forward declarations, split inode walk goals from radix tree tags
          to reduce conceptual confusion
      v6: start moving the inode cache code towards the xfs_icwalk prefix
      
      * tag 'inode-walk-cleanups-5.14_2021-06-03' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux:
        xfs: refactor per-AG inode tagging functions
        xfs: merge xfs_reclaim_inodes_ag into xfs_inode_walk_ag
        xfs: pass struct xfs_eofblocks to the inode scan callback
        xfs: fix radix tree tag signs
        xfs: make the icwalk processing functions clean up the grab state
        xfs: clean up inode state flag tests in xfs_blockgc_igrab
        xfs: remove indirect calls from xfs_inode_walk{,_ag}
        xfs: remove iter_flags parameter from xfs_inode_walk_*
        xfs: move xfs_inew_wait call into xfs_dqrele_inode
        xfs: separate the dqrele_all inode grab logic from xfs_inode_walk_ag_grab
        xfs: pass the goal of the incore inode walk to xfs_inode_walk()
        xfs: rename xfs_inode_walk functions to xfs_icwalk
        xfs: move the inode walk functions further down
        xfs: detach inode dquots at the end of inactivation
        xfs: move the quotaoff dqrele inode walk into xfs_icache.c
      
      [djwong: added variable names to function declarations while fixing
      merge conflicts]
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      ffc18582
    • Darrick J. Wong's avatar
      Merge tag 'assorted-fixes-5.14-1_2021-06-03' of... · 8b943d21
      Darrick J. Wong authored
      Merge tag 'assorted-fixes-5.14-1_2021-06-03' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-5.14-merge2
      
      xfs: assorted fixes for 5.14, part 1
      
      This branch contains the first round of various small fixes for 5.14.
      
      * tag 'assorted-fixes-5.14-1_2021-06-03' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux:
        xfs: don't take a spinlock unconditionally in the DIO fastpath
        xfs: mark xfs_bmap_set_attrforkoff static
        xfs: Remove redundant assignment to busy
        xfs: sort variable alphabetically to avoid repeated declaration
      8b943d21
    • Darrick J. Wong's avatar
      Merge tag 'unit-conversion-cleanups-5.14_2021-06-03' of... · f52edf6c
      Darrick J. Wong authored
      Merge tag 'unit-conversion-cleanups-5.14_2021-06-03' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-5.14-merge2
      
      xfs: various unit conversions
      
      Crafting the realtime file extent size hint fixes revealed various
      opportunities to clean up unit conversions, so now that gets its own
      series.
      
      * tag 'unit-conversion-cleanups-5.14_2021-06-03' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux:
        xfs: remove unnecessary shifts
        xfs: clean up open-coded fs block unit conversions
      f52edf6c
    • Dave Chinner's avatar
      xfs: drop the AGI being passed to xfs_check_agi_freecount · 9ba0889e
      Dave Chinner authored
      From: Dave Chinner <dchinner@redhat.com>
      
      Stephen Rothwell reported this compiler warning from linux-next:
      
      fs/xfs/libxfs/xfs_ialloc.c: In function 'xfs_difree_finobt':
      fs/xfs/libxfs/xfs_ialloc.c:2032:20: warning: unused variable 'agi' [-Wunused-variable]
       2032 |  struct xfs_agi   *agi = agbp->b_addr;
      
      Which is fallout from agno -> perag conversions that were done in
      this function. xfs_check_agi_freecount() is the only user of "agi"
      in xfs_difree_finobt() now, and it only uses the agi to get the
      current free inode count. We hold that in the perag structure, so
      there's not need to directly reference the raw AGI to get this
      information.
      
      The btree cursor being passed to xfs_check_agi_freecount() has a
      reference to the perag being operated on, so use that directly in
      xfs_check_agi_freecount() rather than passing an AGI.
      
      Fixes: 7b13c515 ("xfs: use perag for ialloc btree cursors")
      Reported-by: default avatarStephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarCarlos Maiolino <cmaiolino@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      9ba0889e
    • Darrick J. Wong's avatar
      Merge tag 'xfs-perag-conv-tag' of... · c3eabd36
      Darrick J. Wong authored
      Merge tag 'xfs-perag-conv-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs into xfs-5.14-merge2
      
      xfs: initial agnumber -> perag conversions for shrink
      
      If we want to use active references to the perag to be able to gate
      shrink removing AGs and hence perags safely, we've got a fair bit of
      work to do actually use perags in all the places we need to.
      
      There's a lot of code that iterates ag numbers and then
      looks up perags from that, often multiple times for the same perag
      in the one operation. If we want to use reference counted perags for
      access control, then we need to convert all these uses to perag
      iterators, not agno iterators.
      
      [Patches 1-4]
      
      The first step of this is consolidating all the perag management -
      init, free, get, put, etc into a common location. THis is spread all
      over the place right now, so move it all into libxfs/xfs_ag.[ch].
      This does expose kernel only bits of the perag to libxfs and hence
      userspace, so the structures and code is rearranged to minimise the
      number of ifdefs that need to be added to the userspace codebase.
      The perag iterator in xfs_icache.c is promoted to a first class API
      and expanded to the needs of the code as required.
      
      [Patches 5-10]
      
      These are the first basic perag iterator conversions and changes to
      pass the perag down the stack from those iterators where
      appropriate. A lot of this is obvious, simple changes, though in
      some places we stop passing the perag down the stack because the
      code enters into an as yet unconverted subsystem that still uses raw
      AGs.
      
      [Patches 11-16]
      
      These replace the agno passed in the btree cursor for per-ag btree
      operations with a perag that is passed to the cursor init function.
      The cursor takes it's own reference to the perag, and the reference
      is dropped when the cursor is deleted. Hence we get reference
      coverage for the entire time the cursor is active, even if the code
      that initialised the cursor drops it's reference before the cursor
      or any of it's children (duplicates) have been deleted.
      
      The first patch adds the perag infrastructure for the cursor, the
      next four patches convert a btree cursor at a time, and the last
      removes the agno from the cursor once it is unused.
      
      [Patches 17-21]
      
      These patches are a demonstration of the simplifications and
      cleanups that come from plumbing the perag through interfaces that
      select and then operate on a specific AG. In this case the inode
      allocation algorithm does up to three walks across all AGs before it
      either allocates an inode or fails. Two of these walks are purely
      just to select the AG, and even then it doesn't guarantee inode
      allocation success so there's a third walk if the selected AG
      allocation fails.
      
      These patches collapse the selection and allocation into a single
      loop, simplifies the error handling because xfs_dir_ialloc() always
      returns ENOSPC if no AG was selected for inode allocation or we fail
      to allocate an inode in any AG, gets rid of xfs_dir_ialloc()
      wrapper, converts inode allocation to run entirely from a single
      perag instance, and then factors xfs_dialloc() into a much, much
      simpler loop which is easy to understand.
      
      Hence we end up with the same inode allocation logic, but it only
      needs two complete iterations at worst, makes AG selection and
      allocation atomic w.r.t. shrink and chops out out over 100 lines of
      code from this hot code path.
      
      [Patch 22]
      
      Converts the unlink path to pass perags through it.
      
      There's more conversion work to be done, but this patchset gets
      through a large chunk of it in one hit. Most of the iterators are
      converted, so once this is solidified we can move on to converting
      these to active references for being able to free perags while the
      fs is still active.
      
      * tag 'xfs-perag-conv-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: (23 commits)
        xfs: remove xfs_perag_t
        xfs: use perag through unlink processing
        xfs: clean up and simplify xfs_dialloc()
        xfs: inode allocation can use a single perag instance
        xfs: get rid of xfs_dir_ialloc()
        xfs: collapse AG selection for inode allocation
        xfs: simplify xfs_dialloc_select_ag() return values
        xfs: remove agno from btree cursor
        xfs: use perag for ialloc btree cursors
        xfs: convert allocbt cursors to use perags
        xfs: convert refcount btree cursor to use perags
        xfs: convert rmap btree cursor to using a perag
        xfs: add a perag to the btree cursor
        xfs: pass perags around in fsmap data dev functions
        xfs: push perags through the ag reservation callouts
        xfs: pass perags through to the busy extent code
        xfs: convert secondary superblock walk to use perags
        xfs: convert xfs_iwalk to use perag references
        xfs: convert raw ag walks to use for_each_perag
        xfs: make for_each_perag... a first class citizen
        ...
      c3eabd36
    • Darrick J. Wong's avatar
      Merge tag 'xfs-buf-bulk-alloc-tag' of... · ebf2e337
      Darrick J. Wong authored
      Merge tag 'xfs-buf-bulk-alloc-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs into xfs-5.14-merge2
      
      xfs: buffer cache bulk page allocation
      
      This patchset makes use of the new bulk page allocation interface to
      reduce the overhead of allocating large numbers of pages in a
      loop.
      
      The first two patches are refactoring buffer memory allocation and
      converting the uncached buffer path to use the same page allocation
      path, followed by converting the page allocation path to use bulk
      allocation.
      
      The rest of the patches are then consolidation of the page
      allocation and freeing code to simplify the code and remove a chunk
      of unnecessary abstraction. This is largely based on a series of
      changes made by Christoph Hellwig.
      
      * tag 'xfs-buf-bulk-alloc-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs:
        xfs: merge xfs_buf_allocate_memory
        xfs: cleanup error handling in xfs_buf_get_map
        xfs: get rid of xb_to_gfp()
        xfs: simplify the b_page_count calculation
        xfs: remove ->b_offset handling for page backed buffers
        xfs: move page freeing into _xfs_buf_free_pages()
        xfs: merge _xfs_buf_get_pages()
        xfs: use alloc_pages_bulk_array() for buffers
        xfs: use xfs_buf_alloc_pages for uncached buffers
        xfs: split up xfs_buf_allocate_memory
      ebf2e337
  6. 07 Jun, 2021 1 commit