1. 03 Jun, 2010 3 commits
    • Christoph Hellwig's avatar
      xfs: skip writeback from reclaim context · 070ecdca
      Christoph Hellwig authored
      Allowing writeback from reclaim context causes massive problems with stack
      overflows as we can call into the writeback code which tends to be a heavy
      stack user both in the generic code and XFS from random contexts that
      perform memory allocations.
      
      Follow the example of btrfs (and in slightly different form ext4) and refuse
      to write out data from reclaim context.  This issue should really be handled
      by the VM so that we can tune better for this case, but until we get it
      sorted out there we have to hack around this in each filesystem with a
      complex writeback path.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      
      070ecdca
    • Christoph Hellwig's avatar
    • Dave Chinner's avatar
      xfs: fix race in inode cluster freeing failing to stale inodes · 5b257b4a
      Dave Chinner authored
      When an inode cluster is freed, it needs to mark all inodes in memory as
      XFS_ISTALE before marking the buffer as stale. This is eeded because the inodes
      have a different life cycle to the buffer, and once the buffer is torn down
      during transaction completion, we must ensure none of the inodes get written
      back (which is what XFS_ISTALE does).
      
      Unfortunately, xfs_ifree_cluster() has some bugs that lead to inodes not being
      marked with XFS_ISTALE. This shows up when xfs_iflush() is called on these
      inodes either during inode reclaim or tail pushing on the AIL.  The buffer is
      read back, but no longer contains inodes and so triggers assert failures and
      shutdowns. This was reproducable with at run.dbench10 invocation from xfstests.
      
      There are two main causes of xfs_ifree_cluster() failing. The first is simple -
      it checks in-memory inodes it finds in the per-ag icache to see if they are
      clean without holding the flush lock. if they are clean it skips them
      completely. However, If an inode is flushed delwri, it will
      appear clean, but is not guaranteed to be written back until the flush lock has
      been dropped. Hence we may have raced on the clean check and the inode may
      actually be dirty. Hence always mark inodes found in memory stale before we
      check properly if they are clean.
      
      The second is more complex, and makes the first problem easier to hit.
      Basically the in-memory inode scan is done with full knowledge it can be racing
      with inode flushing and AIl tail pushing, which means that inodes that it can't
      get the flush lock on might not be attached to the buffer after then in-memory
      inode scan due to IO completion occurring. This is actually documented in the
      code as "needs better interlocking". i.e. this is a zero-day bug.
      
      Effectively, the in-memory scan must be done while the inode buffer is locked
      and Io cannot be issued on it while we do the in-memory inode scan. This
      ensures that inodes we couldn't get the flush lock on are guaranteed to be
      attached to the cluster buffer, so we can then catch all in-memory inodes and
      mark them stale.
      
      Now that the inode cluster buffer is locked before the in-memory scan is done,
      there is no need for the two-phase update of the in-memory inodes, so simplify
      the code into two loops and remove the allocation of the temporary buffer used
      to hold locked inodes across the phases.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      5b257b4a
  2. 28 May, 2010 11 commits
  3. 24 May, 2010 12 commits
    • Dave Chinner's avatar
      xfs: Ensure inode allocation buffers are fully replayed · ccf7c23f
      Dave Chinner authored
      With delayed logging, we can get inode allocation buffers in the
      same transaction inode unlink buffers. We don't currently mark inode
      allocation buffers in the log, so inode unlink buffers take
      precedence over allocation buffers.
      
      The result is that when they are combined into the same checkpoint,
      only the unlinked inode chain fields are replayed, resulting in
      uninitialised inode buffers being detected when the next inode
      modification is replayed.
      
      To fix this, we need to ensure that we do not set the inode buffer
      flag in the buffer log item format flags if the inode allocation has
      not already hit the log. To avoid requiring a change to log
      recovery, we really need to make this a modification that relies
      only on in-memory sate.
      
      We can do this by checking during buffer log formatting (while the
      CIL cannot be flushed) if we are still in the same sequence when we
      commit the unlink transaction as the inode allocation transaction.
      If we are, then we do not add the inode buffer flag to the buffer
      log format item flags. This means the entire buffer will be
      replayed, not just the unlinked fields. We do this while
      CIL flusheѕ are locked out to ensure that we don't race with the
      sequence numbers changing and hence fail to put the inode buffer
      flag in the buffer format flags when we really need to.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarAlex Elder <aelder@sgi.com>
      ccf7c23f
    • Dave Chinner's avatar
      xfs: enable background pushing of the CIL · df806158
      Dave Chinner authored
      If we let the CIL grow without bound, it will grow large enough to violate
      recovery constraints (must be at least one complete transaction in the log at
      all times) or take forever to write out through the log buffers. Hence we need
      a check during asynchronous transactions as to whether the CIL needs to be
      pushed.
      
      We track the amount of log space the CIL consumes, so it is relatively simple
      to limit it on a pure size basis. Make the limit the minimum of just under half
      the log size (recovery constraint) or 8MB of log space (which is an awful lot
      of metadata).
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarAlex Elder <aelder@sgi.com>
      df806158
    • Dave Chinner's avatar
      xfs: forced unmounts need to push the CIL · 9da1ab18
      Dave Chinner authored
      If the filesystem is being shut down and the there is no log error,
      the current code forces out the current log buffers. This code now needs
      to push the CIL before it forces out the log buffers to acheive the same
      result.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarAlex Elder <aelder@sgi.com>
      9da1ab18
    • Dave Chinner's avatar
      xfs: Introduce delayed logging core code · 71e330b5
      Dave Chinner authored
      The delayed logging code only changes in-memory structures and as
      such can be enabled and disabled with a mount option. Add the mount
      option and emit a warning that this is an experimental feature that
      should not be used in production yet.
      
      We also need infrastructure to track committed items that have not
      yet been written to the log. This is what the Committed Item List
      (CIL) is for.
      
      The log item also needs to be extended to track the current log
      vector, the associated memory buffer and it's location in the Commit
      Item List. Extend the log item and log vector structures to enable
      this tracking.
      
      To maintain the current log format for transactions with delayed
      logging, we need to introduce a checkpoint transaction and a context
      for tracking each checkpoint from initiation to transaction
      completion.  This includes adding a log ticket for tracking space
      log required/used by the context checkpoint.
      
      To track all the changes we need an io vector array per log item,
      rather than a single array for the entire transaction. Using the new
      log vector structure for this requires two passes - the first to
      allocate the log vector structures and chain them together, and the
      second to fill them out.  This log vector chain can then be passed
      to the CIL for formatting, pinning and insertion into the CIL.
      
      Formatting of the log vector chain is relatively simple - it's just
      a loop over the iovecs on each log vector, but it is made slightly
      more complex because we re-write the iovec after the copy to point
      back at the memory buffer we just copied into.
      
      This code also needs to pin log items. If the log item is not
      already tracked in this checkpoint context, then it needs to be
      pinned. Otherwise it is already pinned and we don't need to pin it
      again.
      
      The only other complexity is calculating the amount of new log space
      the formatting has consumed. This needs to be accounted to the
      transaction in progress, and the accounting is made more complex
      becase we need also to steal space from it for log metadata in the
      checkpoint transaction. Calculate all this at insert time and update
      all the tickets, counters, etc correctly.
      
      Once we've formatted all the log items in the transaction, attach
      the busy extents to the checkpoint context so the busy extents live
      until checkpoint completion and can be processed at that point in
      time. Transactions can then be freed at this point in time.
      
      Now we need to issue checkpoints - we are tracking the amount of log space
      used by the items in the CIL, so we can trigger background checkpoints when the
      space usage gets to a certain threshold. Otherwise, checkpoints need ot be
      triggered when a log synchronisation point is reached - a log force event.
      
      Because the log write code already handles chained log vectors, writing the
      transaction is trivial, too. Construct a transaction header, add it
      to the head of the chain and write it into the log, then issue a
      commit record write. Then we can release the checkpoint log ticket
      and attach the context to the log buffer so it can be called during
      Io completion to complete the checkpoint.
      
      We also need to allow for synchronising multiple in-flight
      checkpoints. This is needed for two things - the first is to ensure
      that checkpoint commit records appear in the log in the correct
      sequence order (so they are replayed in the correct order). The
      second is so that xfs_log_force_lsn() operates correctly and only
      flushes and/or waits for the specific sequence it was provided with.
      
      To do this we need a wait variable and a list tracking the
      checkpoint commits in progress. We can walk this list and wait for
      the checkpoints to change state or complete easily, an this provides
      the necessary synchronisation for correct operation in both cases.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarAlex Elder <aelder@sgi.com>
      71e330b5
    • Dave Chinner's avatar
      xfs: Delayed logging design documentation · a9a745da
      Dave Chinner authored
      Document the design of the delayed logging implementation. This
      includes assumptions made, dead ends followed, the reasoning behind
      the structuring of the code, the layout of various structures, how
      things fit together, traps and pit-falls avoided, etc. This is all
      too much to document in the code itself, so do it in a separate
      file.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarAlex Elder <aelder@sgi.com>
      a9a745da
    • Dave Chinner's avatar
      xfs: Improve scalability of busy extent tracking · ed3b4d6c
      Dave Chinner authored
      When we free a metadata extent, we record it in the per-AG busy
      extent array so that it is not re-used before the freeing
      transaction hits the disk. This array is fixed size, so when it
      overflows we make further allocation transactions synchronous
      because we cannot track more freed extents until those transactions
      hit the disk and are completed. Under heavy mixed allocation and
      freeing workloads with large log buffers, we can overflow this array
      quite easily.
      
      Further, the array is sparsely populated, which means that inserts
      need to search for a free slot, and array searches often have to
      search many more slots that are actually used to check all the
      busy extents. Quite inefficient, really.
      
      To enable this aspect of extent freeing to scale better, we need
      a structure that can grow dynamically. While in other areas of
      XFS we have used radix trees, the extents being freed are at random
      locations on disk so are better suited to being indexed by an rbtree.
      
      So, use a per-AG rbtree indexed by block number to track busy
      extents.  This incures a memory allocation when marking an extent
      busy, but should not occur too often in low memory situations. This
      should scale to an arbitrary number of extents so should not be a
      limitation for features such as in-memory aggregation of
      transactions.
      
      However, there are still situations where we can't avoid allocating
      busy extents (such as allocation from the AGFL). To minimise the
      overhead of such occurences, we need to avoid doing a synchronous
      log force while holding the AGF locked to ensure that the previous
      transactions are safely on disk before we use the extent. We can do
      this by marking the transaction doing the allocation as synchronous
      rather issuing a log force.
      
      Because of the locking involved and the ordering of transactions,
      the synchronous transaction provides the same guarantees as a
      synchronous log force because it ensures that all the prior
      transactions are already on disk when the synchronous transaction
      hits the disk. i.e. it preserves the free->allocate order of the
      extent correctly in recovery.
      
      By doing this, we avoid holding the AGF locked while log writes are
      in progress, hence reducing the length of time the lock is held and
      therefore we increase the rate at which we can allocate and free
      from the allocation group, thereby increasing overall throughput.
      
      The only problem with this approach is that when a metadata buffer is
      marked stale (e.g. a directory block is removed), then buffer remains
      pinned and locked until the log goes to disk. The issue here is that
      if that stale buffer is reallocated in a subsequent transaction, the
      attempt to lock that buffer in the transaction will hang waiting
      the log to go to disk to unlock and unpin the buffer. Hence if
      someone tries to lock a pinned, stale, locked buffer we need to
      push on the log to get it unlocked ASAP. Effectively we are trading
      off a guaranteed log force for a much less common trigger for log
      force to occur.
      
      Ideally we should not reallocate busy extents. That is a much more
      complex fix to the problem as it involves direct intervention in the
      allocation btree searches in many places. This is left to a future
      set of modifications.
      
      Finally, now that we track busy extents in allocated memory, we
      don't need the descriptors in the transaction structure to point to
      them. We can replace the complex busy chunk infrastructure with a
      simple linked list of busy extents. This allows us to remove a large
      chunk of code, making the overall change a net reduction in code
      size.
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarAlex Elder <aelder@sgi.com>
      ed3b4d6c
    • Dave Chinner's avatar
      xfs: make the log ticket ID available outside the log infrastructure · 955833cf
      Dave Chinner authored
      The ticket ID is needed to uniquely identify transactions when doing busy
      extent matching. Delayed logging changes the lifecycle of busy extents with
      respect to the transaction structure lifecycle. Hence we can no longer use
      the transaction structure as a means of determining the owner of the busy
      extent as it may be freed and reused while the busy extent is still active.
      
      This commit provides the infrastructure to access the xlog_tid_t held in the
      ticket from a transaction handle. This avoids the need for callers to peek
      into the transaction and log structures to find this out.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarAlex Elder <aelder@sgi.com>
      955833cf
    • Dave Chinner's avatar
      xfs: clean up log ticket overrun debug output · 169a7b07
      Dave Chinner authored
      Push the error message output when a ticket overrun is detected
      into the ticket printing functions. Also remove the debug version
      of the code as the production version will still panic just as
      effectively on a debug kernel via the panic mask being set.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarAlex Elder <aelder@sgi.com>
      169a7b07
    • Dave Chinner's avatar
      xfs: Clean up XFS_BLI_* flag namespace · c1155410
      Dave Chinner authored
      Clean up the buffer log format (XFS_BLI_*) flags because they have a
      polluted namespace. They XFS_BLI_ prefix is used for both in-memory
      and on-disk flag feilds, but have overlapping values for different
      flags. Rename the buffer log format flags to use the XFS_BLF_*
      prefix to avoid confusing them with the in-memory XFS_BLI_* prefixed
      flags.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarAlex Elder <aelder@sgi.com>
      c1155410
    • Dave Chinner's avatar
      xfs: modify buffer item reference counting · 64fc35de
      Dave Chinner authored
      The buffer log item reference counts used to take referenceѕ for every
      transaction, similar to the pin counting. This is symmetric (like the
      pin/unpin) with respect to transaction completion, but with dleayed logging
      becomes assymetric as the pinning becomes assymetric w.r.t. transaction
      completion.
      
      To make both cases the same, allow the buffer pinning to take a reference to
      the buffer log item and always drop the reference the transaction has on it
      when being unlocked. This is balanced correctly because the unpin operation
      always drops a reference to the log item. Hence reference counting becomes
      symmetric w.r.t. item pinning as well as w.r.t active transactions and as a
      result the reference counting model remain consistent between normal and
      delayed logging.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarAlex Elder <aelder@sgi.com>
      64fc35de
    • Dave Chinner's avatar
      xfs: allow log ticket allocation to take allocation flags · 3383ca57
      Dave Chinner authored
      Delayed logging currently requires ticket allocation to succeed, so
      we need to be able to sleep on allocation. It also should not allow
      memory allocation to recurse into the filesystem. hence we need to
      pass allocation flags directing the type of allocation the caller
      requires.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarAlex Elder <aelder@sgi.com>
      3383ca57
    • Dave Chinner's avatar
      xfs: Don't reuse the same transaction ID for duplicated transactions. · 524ee36f
      Dave Chinner authored
      The transaction ID is written into the log as the unique identifier
      for transactions during recover. When duplicating a transaction, we
      reuse the log ticket, which means it has the same transaction ID as
      the previous transaction.
      
      Rather than regenerating a random transaction ID for the duplicated
      transaction, just add one to the current ID so that duplicated
      transaction can be easily spotted in the log and during recovery
      during problem diagnosis.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarAlex Elder <aelder@sgi.com>
      524ee36f
  4. 19 May, 2010 14 commits