1. 04 May, 2022 22 commits
    • Allison Henderson's avatar
      xfs: Set up infrastructure for log attribute replay · fd920008
      Allison Henderson authored
      Currently attributes are modified directly across one or more
      transactions. But they are not logged or replayed in the event of an
      error. The goal of log attr replay is to enable logging and replaying
      of attribute operations using the existing delayed operations
      infrastructure.  This will later enable the attributes to become part of
      larger multi part operations that also must first be recorded to the
      log.  This is mostly of interest in the scheme of parent pointers which
      would need to maintain an attribute containing parent inode information
      any time an inode is moved, created, or removed.  Parent pointers would
      then be of interest to any feature that would need to quickly derive an
      inode path from the mount point. Online scrub, nfs lookups and fs grow
      or shrink operations are all features that could take advantage of this.
      
      This patch adds two new log item types for setting or removing
      attributes as deferred operations.  The xfs_attri_log_item will log an
      intent to set or remove an attribute.  The corresponding
      xfs_attrd_log_item holds a reference to the xfs_attri_log_item and is
      freed once the transaction is done.  Both log items use a generic
      xfs_attr_log_format structure that contains the attribute name, value,
      flags, inode, and an op_flag that indicates if the operations is a set
      or remove.
      
      [dchinner: added extra little bits needed for intent whiteouts]
      Signed-off-by: default avatarAllison Henderson <allison.henderson@oracle.com>
      Reviewed-by: default avatarChandan Babu R <chandanrlinux@gmail.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      fd920008
    • Allison Henderson's avatar
      xfs: Return from xfs_attr_set_iter if there are no more rmtblks to process · 9a39cdab
      Allison Henderson authored
      During an attr rename operation, blocks are saved for later removal
      as rmtblkno2. The rmtblkno is used in the case of needing to alloc
      more blocks if not enough were available.  However, in the case
      that no further blocks need to be added or removed, we can return as soon
      as xfs_attr_node_addname completes, rather than rolling the transaction
      with an -EAGAIN return.  This extra loop does not hurt anything right
      now, but it will be a problem later when we get into log items because
      we end up with an empty log transaction.  So, add a simple check to
      cut out the unneeded iteration.
      Signed-off-by: default avatarAllison Henderson <allison.henderson@oracle.com>
      Reviewed-by: default avatarChandan Babu R <chandanrlinux@gmail.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      9a39cdab
    • Allison Henderson's avatar
      xfs: Fix double unlock in defer capture code · 7b3ec2b2
      Allison Henderson authored
      The new deferred attr patch set uncovered a double unlock in the
      recent port of the defer ops capture and continue code.  During log
      recovery, we're allowed to hold buffers to a transaction that's being
      used to replay an intent item.  When we capture the resources as part
      of scheduling a continuation of an intent chain, we call xfs_buf_hold
      to retain our reference to the buffer beyond the transaction commit,
      but we do /not/ call xfs_trans_bhold to maintain the buffer lock.
      This means that xfs_defer_ops_continue needs to relock the buffers
      before xfs_defer_restore_resources joins then tothe new transaction.
      
      Additionally, the buffers should not be passed back via the dres
      structure since they need to remain locked unlike the inodes.  So
      simply set dr_bufs to zero after populating the dres structure.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarAllison Henderson <allison.henderson@oracle.com>
      Reviewed-by: default avatarChandan Babu R <chandan.babu@oracle.com>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      7b3ec2b2
    • Dave Chinner's avatar
    • Dave Chinner's avatar
      Merge tag 'reflink-speedups-5.19_2022-04-28' of... · 166afc45
      Dave Chinner authored
      Merge tag 'reflink-speedups-5.19_2022-04-28' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-5.19-for-next
      
      xfs: fix reflink inefficiencies
      
      As Dave Chinner has complained about on IRC, there are a couple of
      things about reflink that are very inefficient.  First of all, we
      limited the size of all bunmapi operations to avoid flooding the log
      with defer ops in the worst case, but recent changes to the defer
      ops code have solved that problem, so get rid of the bunmapi length
      clamp.
      
      Second, the log reservations for reflink operations are far far
      larger than they need to be.  Shrink them to exactly what we need to
      handle each deferred RUI and CUI log item, and no more.  Also reduce
      logcount because we don't need 8 rolls per operation.  Introduce a
      transaction reservation compatibility layer to avoid changing the
      minimum log size calculations.
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      166afc45
    • Dave Chinner's avatar
      Merge tag 'rmap-speedups-5.19_2022-04-28' of... · 956f1b8f
      Dave Chinner authored
      Merge tag 'rmap-speedups-5.19_2022-04-28' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-5.19-for-next
      
      xfs: fix rmap inefficiencies
      
      Reduce the performance impact of the reverse mapping btree when
      reflink is enabled by using the much faster non-overlapped btree
      lookup functions when we're searching the rmap index with a fully
      specified key.  If we find the exact record we're looking for,
      great!  We don't have to perform the full overlapped scan.  For
      filesystems with high sharing factors this reduces the xfs_scrub
      runtime by a good 15%%.
      
      This has been shown to reduce the fstests runtime for realtime rmap
      configurations by 30%%, since the lack of AGs severely limits
      scalability.
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      956f1b8f
    • Dave Chinner's avatar
    • Dave Chinner's avatar
    • Dave Chinner's avatar
      xfs: validate v5 feature fields · f0f5f658
      Dave Chinner authored
      We don't check that the v4 feature flags taht v5 requires to be set
      are actually set anywhere. Do this check when we see that the
      filesystem is a v5 filesystem.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      f0f5f658
    • Dave Chinner's avatar
      xfs: set XFS_FEAT_NLINK correctly · dd0d2f97
      Dave Chinner authored
      While xfs_has_nlink() is not used in kernel, it is used in userspace
      (e.g. by xfs_db) so we need to set the XFS_FEAT_NLINK flag correctly
      in xfs_sb_version_to_features().
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      dd0d2f97
    • Dave Chinner's avatar
      xfs: validate inode fork size against fork format · 1eb70f54
      Dave Chinner authored
      xfs_repair catches fork size/format mismatches, but the in-kernel
      verifier doesn't, leading to null pointer failures when attempting
      to perform operations on the fork. This can occur in the
      xfs_dir_is_empty() where the in-memory fork format does not match
      the size and so the fork data pointer is accessed incorrectly.
      
      Note: this causes new failures in xfs/348 which is testing mode vs
      ftype mismatches. We now detect a regular file that has been changed
      to a directory or symlink mode as being corrupt because the data
      fork is for a symlink or directory should be in local form when
      there are only 3 bytes of data in the data fork. Hence the inode
      verify for the regular file now fires w/ -EFSCORRUPTED because
      the inode fork format does not match the format the corrupted mode
      says it should be in.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      1eb70f54
    • Dave Chinner's avatar
      xfs: detect self referencing btree sibling pointers · dc04db2a
      Dave Chinner authored
      To catch the obvious graph cycle problem and hence potential endless
      looping.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      dc04db2a
    • Dave Chinner's avatar
      xfs: intent item whiteouts · 0d227466
      Dave Chinner authored
      When we log modifications based on intents, we add both intent
      and intent done items to the modification being made. These get
      written to the log to ensure that the operation is re-run if the
      intent done is not found in the log.
      
      However, for operations that complete wholly within a single
      checkpoint, the change in the checkpoint is atomic and will never
      need replay. In this case, we don't need to actually write the
      intent and intent done items to the journal because log recovery
      will never need to manually restart this modification.
      
      Log recovery currently handles intent/intent done matching by
      inserting the intent into the AIL, then removing it when a matching
      intent done item is found. Hence for all the intent-based operations
      that complete within a checkpoint, we spend all that time parsing
      the intent/intent done items just to cancel them and do nothing with
      them.
      
      Hence it follows that the only time we actually need intents in the
      log is when the modification crosses checkpoint boundaries in the
      log and so may only be partially complete in the journal. Hence if
      we commit and intent done item to the CIL and the intent item is in
      the same checkpoint, we don't actually have to write them to the
      journal because log recovery will always cancel the intents.
      
      We've never really worried about the overhead of logging intents
      unnecessarily like this because the intents we log are generally
      very much smaller than the change being made. e.g. freeing an extent
      involves modifying at lease two freespace btree blocks and the AGF,
      so the EFI/EFD overhead is only a small increase in space and
      processing time compared to the overall cost of freeing an extent.
      
      However, delayed attributes change this cost equation dramatically,
      especially for inline attributes. In the case of adding an inline
      attribute, we only log the inode core and attribute fork at present.
      With delayed attributes, we now log the attr intent which includes
      the name and value, the inode core adn attr fork, and finally the
      attr intent done item. We increase the number of items we log from 1
      to 3, and the number of log vectors (regions) goes up from 3 to 7.
      Hence we tripple the number of objects that the CIL has to process,
      and more than double the number of log vectors that need to be
      written to the journal.
      
      At scale, this means delayed attributes cause a non-pipelined CIL to
      become CPU bound processing all the extra items, resulting in a > 40%
      performance degradation on 16-way file+xattr create worklaods.
      Pipelining the CIL (as per 5.15) reduces the performance degradation
      to 20%, but now the limitation is the rate at which the log items
      can be written to the iclogs and iclogs be dispatched for IO and
      completed.
      
      Even log IO completion is slowed down by these intents, because it
      now has to process 3x the number of items in the checkpoint.
      Processing completed intents is especially inefficient here, because
      we first insert the intent into the AIL, then remove it from the AIL
      when the intent done is processed. IOWs, we are also doing expensive
      operations in log IO completion we could completely avoid if we
      didn't log completed intent/intent done pairs.
      
      Enter log item whiteouts.
      
      When an intent done is committed, we can check to see if the
      associated intent is in the same checkpoint as we are currently
      committing the intent done to. If so, we can mark the intent log
      item with a whiteout and immediately free the intent done item
      rather than committing it to the CIL. We can basically skip the
      entire formatting and CIL insertion steps for the intent done item.
      
      However, we cannot remove the intent item from the CIL at this point
      because the unlocked per-cpu CIL item lists do not permit removal
      without holding the CIL context lock exclusively. Transaction commit
      only holds the context lock shared, hence the best we can do is mark
      the intent item with a whiteout so that the CIL push can release it
      rather than writing it to the log.
      
      This means we never write the intent to the log if the intent done
      has also been committed to the same checkpoint, but we'll always
      write the intent if the intent done has not been committed or has
      been committed to a different checkpoint. This will result in
      correct log recovery behaviour in all cases, without the overhead of
      logging unnecessary intents.
      
      This intent whiteout concept is generic - we can apply it to all
      intent/intent done pairs that have a direct 1:1 relationship. The
      way deferred ops iterate and relog intents mean that all intents
      currently have a 1:1 relationship with their done intent, and hence
      we can apply this cancellation to all existing intent/intent done
      implementations.
      
      For delayed attributes with a 16-way 64kB xattr create workload,
      whiteouts reduce the amount of journalled metadata from ~2.5GB/s
      down to ~600MB/s and improve the creation rate from 9000/s to
      14000/s.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarAllison Henderson <allison.henderson@oracle.com>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      0d227466
    • Dave Chinner's avatar
      xfs: whiteouts release intents that are not in the AIL · 3512fc1e
      Dave Chinner authored
      When we release an intent that a whiteout applies to, it will not
      have been committed to the journal and so won't be in the AIL. Hence
      when we drop the last reference to the intent, we do not want to try
      to remove it from the AIL as that will trigger a filesystem
      shutdown. Hence make the removal of intents from the AIL conditional
      on them actually being in the AIL so we do the correct thing.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarAllison Henderson <allison.henderson@oracle.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      3512fc1e
    • Dave Chinner's avatar
      xfs: add log item method to return related intents · c23ab603
      Dave Chinner authored
      To apply a whiteout to an intent item when an intent done item is
      committed, we need to be able to retrieve the intent item from the
      the intent done item. Add a log item op method for doing this, and
      wire all the intent done items up to it.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarAllison Henderson <allison.henderson@oracle.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      c23ab603
    • Dave Chinner's avatar
      xfs: factor and move some code in xfs_log_cil.c · 22b1afc5
      Dave Chinner authored
      In preparation for adding support for intent item whiteouts.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarAllison Henderson <allison.henderson@oracle.com>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      22b1afc5
    • Dave Chinner's avatar
      xfs: tag transactions that contain intent done items · bb7b1c9c
      Dave Chinner authored
      Intent whiteouts will require extra work to be done during
      transaction commit if the transaction contains an intent done item.
      
      To determine if a transaction contains an intent done item, we want
      to avoid having to walk all the items in the transaction to check if
      they are intent done items. Hence when we add an intent done item to
      a transaction, tag the transaction to indicate that it contains such
      an item.
      
      We don't tag the transaction when the defer ops is relogging an
      intent to move it forward in the log. Whiteouts will never apply to
      these cases, so we don't need to bother looking for them.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarAllison Henderson <allison.henderson@oracle.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      bb7b1c9c
    • Dave Chinner's avatar
      xfs: add log item flags to indicate intents · f5b81200
      Dave Chinner authored
      We currently have a couple of helper functions that try to infer
      whether the log item is an intent or intent done item from the
      combinations of operations it supports.  This is incredibly fragile
      and not very efficient as it requires checking specific combinations
      of ops.
      
      We need to be able to identify intent and intent done items quickly
      and easily in upcoming patches, so simply add intent and intent done
      type flags to the log item ops flags. These are static flags to
      begin with, so intent items should have been typed like this from
      the start.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarAllison Henderson <allison.henderson@oracle.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      f5b81200
    • Dave Chinner's avatar
      xfs: don't commit the first deferred transaction without intents · 5ddd658e
      Dave Chinner authored
      If the first operation in a string of defer ops has no intents,
      then there is no reason to commit it before running the first call
      to xfs_defer_finish_one(). This allows the defer ops to be used
      effectively for non-intent based operations without requiring an
      unnecessary extra transaction commit when first called.
      
      This fixes a regression in per-attribute modification transaction
      count when delayed attributes are not being used.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarAllison Henderson <allison.henderson@oracle.com>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      5ddd658e
    • Dave Chinner's avatar
      xfs: hide log iovec alignment constraints · b2c28035
      Dave Chinner authored
      Callers currently have to round out the size of buffers to match the
      aligment constraints of log iovecs and xlog_write(). They should not
      need to know this detail, so introduce a new function to calculate
      the iovec length (for use in ->iop_size implementations). Also
      modify xlog_finish_iovec() to round up the length to the correct
      alignment so the callers don't need to do this, either.
      
      Convert the only user - inode forks - of this alignment rounding to
      use the new interface.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarAllison Henderson <allison.henderson@oracle.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      b2c28035
    • Dave Chinner's avatar
      xfs: fix potential log item leak · c230a4a8
      Dave Chinner authored
      Ever since we added shadown format buffers to the log items, log
      items need to handle the item being released with shadow buffers
      attached. Due to the fact this requirement was added at the same
      time we added new rmap/reflink intents, we missed the cleanup of
      those items.
      
      In theory, this means shadow buffers can be leaked in a very small
      window when a shutdown is initiated. Testing with KASAN shows this
      leak does not happen in practice - we haven't identified a single
      leak in several years of shutdown testing since ~v4.8 kernels.
      
      However, the intent whiteout cleanup mechanism results in every
      cancelled intent in exactly the same state as this tiny race window
      creates and so if intents down clean up shadow buffers on final
      release we will leak the shadow buffer for just about every intent
      we create.
      
      Hence we start with this patch to close this condition off and
      ensure that when whiteouts start to be used we don't leak lots of
      memory.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarAllison Henderson <allison.henderson@oracle.com>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      c230a4a8
    • Dave Chinner's avatar
      xfs: zero inode fork buffer at allocation · cb512c92
      Dave Chinner authored
      When we first allocate or resize an inline inode fork, we round up
      the allocation to 4 byte alingment to make journal alignment
      constraints. We don't clear the unused bytes, so we can copy up to
      three uninitialised bytes into the journal. Zero those bytes so we
      only ever copy zeros into the journal.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarAllison Henderson <allison.henderson@oracle.com>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      cb512c92
  2. 28 Apr, 2022 10 commits
    • Darrick J. Wong's avatar
      xfs: rename xfs_*alloc*_log_count to _block_count · 6ed7e509
      Darrick J. Wong authored
      These functions return the maximum number of blocks that could be logged
      in a particular transaction.  "log count" is confusing since there's a
      separate concept of a log (operation) count in the reservation code, so
      let's change it to "block count" to be less confusing.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      6ed7e509
    • Darrick J. Wong's avatar
      xfs: rewrite xfs_reflink_end_cow to use intents · df2fd88f
      Darrick J. Wong authored
      Currently, the code that performs CoW remapping after a write has this
      odd behavior where it walks /backwards/ through the data fork to remap
      extents in reverse order.  Earlier, we rewrote the reflink remap
      function to use deferred bmap log items instead of trying to cram as
      much into the first transaction that we could.  Now do the same for the
      CoW remap code.  There doesn't seem to be any performance impact; we're
      just making better use of code that we added for the benefit of reflink.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      df2fd88f
    • Darrick J. Wong's avatar
      xfs: reduce transaction reservations with reflink · b037c4ee
      Darrick J. Wong authored
      Before to the introduction of deferred refcount operations, reflink
      would try to cram refcount btree updates into the same transaction as an
      allocation or a free event.  Mainline XFS has never actually done that,
      but we never refactored the transaction reservations to reflect that we
      now do all refcount updates in separate transactions.  Fix this to
      reduce the transaction reservation size even farther, so that between
      this patch and the previous one, we reduce the tr_write and tr_itruncate
      sizes by 66%.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      b037c4ee
    • Darrick J. Wong's avatar
      xfs: reduce the absurdly large log operation count · 4ecf9e7c
      Darrick J. Wong authored
      Back in the early days of reflink and rmap development I set the
      transaction reservation sizes to be overly generous for rmap+reflink
      filesystems, and a little under-generous for rmap-only filesystems.
      
      Since we don't need *eight* transaction rolls to handle three new log
      intent items, decrease the logcounts to what we actually need, and amend
      the shadow reservation computation function to reflect what we used to
      do so that the minimum log size doesn't change.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      4ecf9e7c
    • Darrick J. Wong's avatar
      xfs: report "max_resp" used for min log size computation · 918247ce
      Darrick J. Wong authored
      Move the tracepoint that computes the size of the transaction used to
      compute the minimum log size into xfs_log_get_max_trans_res so that we
      only have to compute this stuff once.
      
      Leave xfs_log_get_max_trans_res as a non-static function so that xfs_db
      can call it to report the results of the userspace computation of the
      same value to diagnose mkfs/kernel misinteractions.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      918247ce
    • Darrick J. Wong's avatar
      xfs: create shadow transaction reservations for computing minimum log size · 52d8ea4f
      Darrick J. Wong authored
      Every time someone changes the transaction reservation sizes, they
      introduce potential compatibility problems if the changes affect the
      minimum log size that we validate at mount time.  If the minimum log
      size gets larger (which should be avoided because doing so presents a
      serious risk of log livelock), filesystems created with old mkfs will
      not mount on a newer kernel; if the minimum size shrinks, filesystems
      created with newer mkfs will not mount on older kernels.
      
      Therefore, enable the creation of a shadow log reservation structure
      where we can "undo" the effects of tweaks when computing minimum log
      sizes.  These shadow reservations should never be used in practice, but
      they insulate us from perturbations in minimum log size.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      52d8ea4f
    • Darrick J. Wong's avatar
      xfs: remove a __xfs_bunmapi call from reflink · f1e6a8d7
      Darrick J. Wong authored
      This raw call isn't necessary since we can always remove a full delalloc
      extent.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      f1e6a8d7
    • Darrick J. Wong's avatar
      xfs: stop artificially limiting the length of bunmap calls · 4ed6435c
      Darrick J. Wong authored
      In commit e1a4e37c, we clamped the length of bunmapi calls on the
      data forks of shared files to avoid two failure scenarios: one where the
      extent being unmapped is so sparsely shared that we exceed the
      transaction reservation with the sheer number of refcount btree updates
      and EFI intent items; and the other where we attach so many deferred
      updates to the transaction that we pin the log tail and later the log
      head meets the tail, causing the log to livelock.
      
      We avoid triggering the first problem by tracking the number of ops in
      the refcount btree cursor and forcing a requeue of the refcount intent
      item any time we think that we might be close to overflowing.  This has
      been baked into XFS since before the original e1a4 patch.
      
      A recent patchset fixed the second problem by changing the deferred ops
      code to finish all the work items created by each round of trying to
      complete a refcount intent item, which eliminates the long chains of
      deferred items (27dad); and causing long-running transactions to relog
      their intent log items when space in the log gets low (74f4d).
      
      Because this clamp affects /any/ unmapping request regardless of the
      sharing factors of the component blocks, it degrades the performance of
      all large unmapping requests -- whereas with an unshared file we can
      unmap millions of blocks in one go, shared files are limited to
      unmapping a few thousand blocks at a time, which causes the upper level
      code to spin in a bunmapi loop even if it wasn't needed.
      
      This also eliminates one more place where log recovery behavior can
      differ from online behavior, because bunmapi operations no longer need
      to requeue.  The fstest generic/447 was created to test the old fix, and
      it still passes with this applied.
      
      Partial-revert-of: e1a4e37c ("xfs: try to avoid blowing out the transaction reservation when bunmaping a shared extent")
      Depends: 27dada07 ("xfs: change the order in which child and parent defer ops ar finished")
      Depends: 74f4d6a1 ("xfs: only relog deferred intent items if free space in the log gets low")
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      4ed6435c
    • Darrick J. Wong's avatar
      xfs: count EFIs when deciding to ask for a continuation of a refcount update · c47260d4
      Darrick J. Wong authored
      A long time ago, I added to XFS the ability to use deferred reference
      count operations as part of a transaction chain.  This enabled us to
      avoid blowing out the transaction reservation when the blocks in a
      physical extent all had different reference counts because we could ask
      the deferred operation manager for a continuation, which would get us a
      clean transaction.
      
      The refcount code asks for a continuation when the number of refcount
      record updates reaches the point where we think that the transaction has
      logged enough full btree blocks due to refcount (and free space) btree
      shape changes and refcount record updates that we're in danger of
      overflowing the transaction.
      
      We did not previously count the EFIs logged to the refcount update
      transaction because the clamps on the length of a bunmap operation were
      sufficient to avoid overflowing the transaction reservation even in the
      worst case situation where every other block of the unmapped extent is
      shared.
      
      Unfortunately, the restrictions on bunmap length avoid failure in the
      worst case by imposing a maximum unmap length of ~3000 blocks, even for
      non-pathological cases.  This seriously limits performance when freeing
      large extents.
      
      Therefore, track EFIs with the same counter as refcount record updates,
      and use that information as input into when we should ask for a
      continuation.  This enables the next patch to drop the clumsy bunmap
      limitation.
      
      Depends: 27dada07 ("xfs: change the order in which child and parent defer ops ar finished")
      Depends: 74f4d6a1 ("xfs: only relog deferred intent items if free space in the log gets low")
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      c47260d4
    • Darrick J. Wong's avatar
      xfs: speed up write operations by using non-overlapped lookups when possible · 1edf8056
      Darrick J. Wong authored
      Reverse mapping on a reflink-capable filesystem has some pretty high
      overhead when performing file operations.  This is because the rmap
      records for logically and physically adjacent extents might not be
      adjacent in the rmap index due to data block sharing.  As a result, we
      use expensive overlapped-interval btree search, which walks every record
      that overlaps with the supplied key in the hopes of finding the record.
      
      However, profiling data shows that when the index contains a record that
      is an exact match for a query key, the non-overlapped btree search
      function can find the record much faster than the overlapped version.
      Try the non-overlapped lookup first when we're trying to find the left
      neighbor rmap record for a given file mapping, which makes unwritten
      extent conversion and remap operations run faster if data block sharing
      is minimal in this part of the filesystem.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      1edf8056
  3. 27 Apr, 2022 3 commits
  4. 26 Apr, 2022 3 commits
  5. 21 Apr, 2022 2 commits
    • Dave Chinner's avatar
      Merge tag 'large-extent-counters-v9' of https://github.com/chandanr/linux into xfs-5.19-for-next · a44a027a
      Dave Chinner authored
      xfs: Large extent counters
      
      The commit xfs: fix inode fork extent count overflow
      (3f8a4f1d) mentions that 10 billion
      data fork extents should be possible to create. However the
      corresponding on-disk field has a signed 32-bit type. Hence this
      patchset extends the per-inode data fork extent counter to 64 bits
      (out of which 48 bits are used to store the extent count).
      
      Also, XFS has an attribute fork extent counter which is 16 bits
      wide. A workload that,
      1. Creates 1 million 255-byte sized xattrs,
      2. Deletes 50% of these xattrs in an alternating manner,
      3. Tries to insert 400,000 new 255-byte sized xattrs
         causes the xattr extent counter to overflow.
      
      Dave tells me that there are instances where a single file has more
      than 100 million hardlinks. With parent pointers being stored in
      xattrs, we will overflow the signed 16-bits wide attribute extent
      counter when large number of hardlinks are created. Hence this
      patchset extends the on-disk field to 32-bits.
      
      The following changes are made to accomplish this,
      1. A 64-bit inode field is carved out of existing di_pad and
         di_flushiter fields to hold the 64-bit data fork extent counter.
      2. The existing 32-bit inode data fork extent counter will be used to
         hold the attribute fork extent counter.
      3. A new incompat superblock flag to prevent older kernels from mounting
         the filesystem.
      Signed-off-by: default avatarChandan Babu R <chandan.babu@oracle.com>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      a44a027a
    • Dave Chinner's avatar