1. 09 Apr, 2018 1 commit
    • Christoph Hellwig's avatar
      xfs: remove filestream item xfs_inode reference · 7fcd3efa
      Christoph Hellwig authored
      The filestreams allocator stores an xfs_fstrm_item structure in the MRU to
      cache inode number to agno mappings for a particular length of time.  Each
      xfs_fstrm_item contains the internal MRU structure, an inode pointer and
      agno value.
      
      The inode pointer stored in the xfs_fstrm_item is not referenced, however,
      which means the inode itself can be removed and reclaimed before the MRU
      item is freed. If this occurs, xfs_fstrm_free_func() can access freed or
      unrelated memory through xfs_fstrm_item->ip and crash.
      
      The obvious solution is to grab an inode reference for xfs_fstrm_item.
      The filestream mechanism only actually uses the inode pointer as a means
      to access the xfs_mount, however.  Rather than add unnecessary
      complexity, simplify the implementation to store an xfs_mount pointer in
      struct xfs_mru_cache, and pass it to the free callback.  This also
      requires updates to the tracepoint class to provide the associated data
      via parameters rather than the inode and a minor hack to peek at the MRU
      key to establish the inode number at free time.
      
      Based on debugging work and an earlier patch from Brian Foster, who
      also wrote most of this changelog.
      Reported-by: default avatarBrian Foster <bfoster@redhat.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      7fcd3efa
  2. 03 Apr, 2018 1 commit
    • Dave Chinner's avatar
      xfs: fix intent use-after-free on abort · 0612d116
      Dave Chinner authored
      When an intent is aborted during it's initial commit through
      xfs_defer_trans_abort(), there is a use after free. The current
      report is for a RUI  through this path in generic/388:
      
       Freed by task 6274:
        __kasan_slab_free+0x136/0x180
        kmem_cache_free+0xe7/0x4b0
        xfs_trans_free_items+0x198/0x2e0
        __xfs_trans_commit+0x27f/0xcc0
        xfs_trans_roll+0x17b/0x2a0
        xfs_defer_trans_roll+0x6ad/0xe60
        xfs_defer_finish+0x2a6/0x2140
        xfs_alloc_file_space+0x53a/0xf90
        xfs_file_fallocate+0x5c6/0xac0
        vfs_fallocate+0x2f5/0x930
        ioctl_preallocate+0x1dc/0x320
        do_vfs_ioctl+0xfe4/0x1690
      
      The problem is that the RUI has two active references - one in the
      current transaction, and another held by the defer_ops structure
      that is passed to the RUD (intent done) so that both the intent and
      the intent done structures are freed on commit of the intent done.
      
      Hence during abort, we need to release the intent item, because the
      defer_ops reference is released separately via ->abort_intent
      callback. Fix all the intent code to do this correctly.
      Signed-Off-By: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      0612d116
  3. 02 Apr, 2018 1 commit
  4. 29 Mar, 2018 1 commit
  5. 26 Mar, 2018 1 commit
  6. 24 Mar, 2018 17 commits
    • Dave Chinner's avatar
      xfs: remove dead inode version setting code · fa4493f0
      Dave Chinner authored
      We can only get into the branch if CRCs are enabled, so there's no
      need to check inside the branch for CRCs being enabled....
      Signed-Off-By: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      fa4493f0
    • Dave Chinner's avatar
      xfs: catch inode allocation state mismatch corruption · ee457001
      Dave Chinner authored
      We recently came across a V4 filesystem causing memory corruption
      due to a newly allocated inode being setup twice and being added to
      the superblock inode list twice. From code inspection, the only way
      this could happen is if a newly allocated inode was not marked as
      free on disk (i.e. di_mode wasn't zero).
      
      Running the metadump on an upstream debug kernel fails during inode
      allocation like so:
      
      XFS: Assertion failed: ip->i_d.di_nblocks == 0, file: fs/xfs/xfs_inod=
      e.c, line: 838
       ------------[ cut here ]------------
      kernel BUG at fs/xfs/xfs_message.c:114!
      invalid opcode: 0000 [#1] PREEMPT SMP
      CPU: 11 PID: 3496 Comm: mkdir Not tainted 4.16.0-rc5-dgc #442
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/0=
      1/2014
      RIP: 0010:assfail+0x28/0x30
      RSP: 0018:ffffc9000236fc80 EFLAGS: 00010202
      RAX: 00000000ffffffea RBX: 0000000000004000 RCX: 0000000000000000
      RDX: 00000000ffffffc0 RSI: 000000000000000a RDI: ffffffff8227211b
      RBP: ffffc9000236fce8 R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000bec R11: f000000000000000 R12: ffffc9000236fd30
      R13: ffff8805c76bab80 R14: ffff8805c77ac800 R15: ffff88083fb12e10
      FS:  00007fac8cbff040(0000) GS:ffff88083fd00000(0000) knlGS:0000000000000=
      000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007fffa6783ff8 CR3: 00000005c6e2b003 CR4: 00000000000606e0
      Call Trace:
       xfs_ialloc+0x383/0x570
       xfs_dir_ialloc+0x6a/0x2a0
       xfs_create+0x412/0x670
       xfs_generic_create+0x1f7/0x2c0
       ? capable_wrt_inode_uidgid+0x3f/0x50
       vfs_mkdir+0xfb/0x1b0
       SyS_mkdir+0xcf/0xf0
       do_syscall_64+0x73/0x1a0
       entry_SYSCALL_64_after_hwframe+0x42/0xb7
      
      Extracting the inode number we crashed on from an event trace and
      looking at it with xfs_db:
      
      xfs_db> inode 184452204
      xfs_db> p
      core.magic = 0x494e
      core.mode = 0100644
      core.version = 2
      core.format = 2 (extents)
      core.nlinkv2 = 1
      core.onlink = 0
      .....
      
      Confirms that it is not a free inode on disk. xfs_repair
      also trips over this inode:
      
      .....
      zero length extent (off = 0, fsbno = 0) in ino 184452204
      correcting nextents for inode 184452204
      bad attribute fork in inode 184452204, would clear attr fork
      bad nblocks 1 for inode 184452204, would reset to 0
      bad anextents 1 for inode 184452204, would reset to 0
      imap claims in-use inode 184452204 is free, would correct imap
      would have cleared inode 184452204
      .....
      disconnected inode 184452204, would move to lost+found
      
      And so we have a situation where the directory structure and the
      inobt thinks the inode is free, but the inode on disk thinks it is
      still in use. Where this corruption came from is not possible to
      diagnose, but we can detect it and prevent the kernel from oopsing
      on lookup. The reproducer now results in:
      
      $ sudo mkdir /mnt/scratch/{0,1,2,3,4,5}{0,1,2,3,4,5}
      mkdir: cannot create directory =E2=80=98/mnt/scratch/00=E2=80=99: File ex=
      ists
      mkdir: cannot create directory =E2=80=98/mnt/scratch/01=E2=80=99: File ex=
      ists
      mkdir: cannot create directory =E2=80=98/mnt/scratch/03=E2=80=99: Structu=
      re needs cleaning
      mkdir: cannot create directory =E2=80=98/mnt/scratch/04=E2=80=99: Input/o=
      utput error
      mkdir: cannot create directory =E2=80=98/mnt/scratch/05=E2=80=99: Input/o=
      utput error
      ....
      
      And this corruption shutdown:
      
      [   54.843517] XFS (loop0): Corruption detected! Free inode 0xafe846c not=
       marked free on disk
      [   54.845885] XFS (loop0): Internal error xfs_trans_cancel at line 1023 =
      of file fs/xfs/xfs_trans.c.  Caller xfs_create+0x425/0x670
      [   54.848994] CPU: 10 PID: 3541 Comm: mkdir Not tainted 4.16.0-rc5-dgc #=
      443
      [   54.850753] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIO=
      S 1.10.2-1 04/01/2014
      [   54.852859] Call Trace:
      [   54.853531]  dump_stack+0x85/0xc5
      [   54.854385]  xfs_trans_cancel+0x197/0x1c0
      [   54.855421]  xfs_create+0x425/0x670
      [   54.856314]  xfs_generic_create+0x1f7/0x2c0
      [   54.857390]  ? capable_wrt_inode_uidgid+0x3f/0x50
      [   54.858586]  vfs_mkdir+0xfb/0x1b0
      [   54.859458]  SyS_mkdir+0xcf/0xf0
      [   54.860254]  do_syscall_64+0x73/0x1a0
      [   54.861193]  entry_SYSCALL_64_after_hwframe+0x42/0xb7
      [   54.862492] RIP: 0033:0x7fb73bddf547
      [   54.863358] RSP: 002b:00007ffdaa553338 EFLAGS: 00000246 ORIG_RAX: 0000=
      000000000053
      [   54.865133] RAX: ffffffffffffffda RBX: 00007ffdaa55449a RCX: 00007fb73=
      bddf547
      [   54.866766] RDX: 0000000000000001 RSI: 00000000000001ff RDI: 00007ffda=
      a55449a
      [   54.868432] RBP: 00007ffdaa55449a R08: 00000000000001ff R09: 00005623a=
      8670dd0
      [   54.870110] R10: 00007fb73be72d5b R11: 0000000000000246 R12: 000000000=
      00001ff
      [   54.871752] R13: 00007ffdaa5534b0 R14: 0000000000000000 R15: 00007ffda=
      a553500
      [   54.873429] XFS (loop0): xfs_do_force_shutdown(0x8) called from line 1=
      024 of file fs/xfs/xfs_trans.c.  Return address = ffffffff814cd050
      [   54.882790] XFS (loop0): Corruption of in-memory data detected.  Shutt=
      ing down filesystem
      [   54.884597] XFS (loop0): Please umount the filesystem and rectify the =
      problem(s)
      
      Note that this crash is only possible on v4 filesystemsi or v5
      filesystems mounted with the ikeep mount option. For all other V5
      filesystems, this problem cannot occur because we don't read inodes
      we are allocating from disk - we simply overwrite them with the new
      inode information.
      Signed-Off-By: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarCarlos Maiolino <cmaiolino@redhat.com>
      Tested-by: default avatarCarlos Maiolino <cmaiolino@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      ee457001
    • Darrick J. Wong's avatar
      xfs: xfs_scrub_iallocbt_xref_rmap_inodes should use xref_set_corrupt · b83e4c3c
      Darrick J. Wong authored
      In xfs_scrub_iallocbt_xref_rmap_inodes we're checking inodes against
      rmap records, so we should use xfs_scrub_btree_xref_set_corrupt if we
      encounter discrepancies here so that we know that it's a cross
      referencing error, not necessarily a corruption in the inobt itself.
      
      The userspace xfs_scrub program will try to repair outright corruptions
      in the agi/inobt prior to phase 3 so that the inode scan will proceed.
      If only a cross-referencing error is noted, the repair program defers
      the repair attempt until it can check the other space metadata at least
      once.
      
      It is therefore essential that the inobt scrubber can correctly
      distinguish between corruptions and "unable to cross-reference something
      else with this inobt".  The same reasoning applies to "xfs: record inode
      buf errors as a xref error in inobt scrubber".
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      b83e4c3c
    • Darrick J. Wong's avatar
      xfs: flag inode corruption if parent ptr doesn't get us a real inode · 5927268f
      Darrick J. Wong authored
      If a directory's parent inode pointer doesn't point to an inode, the
      directory should be flagged as corrupt.  Enable IGET_UNTRUSTED here so
      that _iget will return -EINVAL if the inobt does not confirm that the
      inode is present and allocated and we can flag the directory corruption.
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      5927268f
    • Darrick J. Wong's avatar
      xfs: don't accept inode buffers with suspicious unlinked chains · 6a96c565
      Darrick J. Wong authored
      When we're verifying inode buffers, sanity-check the unlinked pointer.
      We don't want to run the risk of trying to purge something that's
      obviously broken.
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      6a96c565
    • Darrick J. Wong's avatar
      xfs: move inode extent size hint validation to libxfs · 8bb82bc1
      Darrick J. Wong authored
      Extent size hint validation is used by scrub to decide if there's an
      error, and it will be used by repair to decide to remove the hint.
      Since these use the same validation functions, move them to libxfs.
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      8bb82bc1
    • Darrick J. Wong's avatar
      xfs: record inode buf errors as a xref error in inobt scrubber · 1b44a6ae
      Darrick J. Wong authored
      During the inode btree scrubs we try to confirm the freemask bits
      against the inode records.  If the inode buffer read fails, this is a
      cross-referencing error, not a corruption of the inode btree itself.
      Use the xref_process_error call here.  Found via core.version middlebit
      fuzz in xfs/415.
      
      The userspace xfs_scrub program will try to repair outright corruptions
      in the agi/inobt prior to phase 3 so that the inode scan will proceed.
      If only a cross-referencing error is noted, the repair program defers
      the repair attempt until it can check the other space metadata at least
      once.
      
      It is therefore essential that the inobt scrubber can correctly
      distinguish between corruptions and "unable to cross-reference something
      else with this inobt".
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      1b44a6ae
    • Darrick J. Wong's avatar
      xfs: remove xfs_buf parameter from inode scrub methods · 7e56d9ea
      Darrick J. Wong authored
      Now that we no longer do raw inode buffer scrubbing, the bp parameter is
      no longer used anywhere we're dealing with an inode, so remove it and
      all the useless NULL parameters that go with it.
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      7e56d9ea
    • Darrick J. Wong's avatar
      xfs: inode scrubber shouldn't bother with raw checks · d0018ad8
      Darrick J. Wong authored
      The inode scrubber tries to _iget the inode prior to running checks.
      If that _iget call fails with corruption errors that's an automatic
      fail, regardless of whether it was the inode buffer read verifier,
      the ifork verifier, or the ifork formatter that errored out.
      
      Therefore, get rid of the raw mode scrub code because it's not needed.
      Found by trying to fix some test failures in xfs/379 and xfs/415.
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      d0018ad8
    • Darrick J. Wong's avatar
      xfs: bmap scrubber should do rmap xref with bmap for sparse files · 5e777b62
      Darrick J. Wong authored
      When we're scanning an extent mapping inode fork, ensure that every rmap
      record for this ifork has a corresponding bmbt record too.  This
      (mostly) provides the ability to cross-reference rmap records with bmap
      data.  The rmap scrubber cannot do the xref on its own because that
      requires taking an ilock with the agf lock held, which violates our
      locking order rules (inode, then agf).
      
      Note that we only do this for forks that are in btree format due to the
      increased complexity; or forks that should have data but suspiciously
      have zero extents because the inode could have just had its iforks
      zapped by the inode repair code and now we need to reclaim the old
      extents.
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      5e777b62
    • Darrick J. Wong's avatar
      xfs: refactor inode buffer verifier error logging · 6edb1810
      Darrick J. Wong authored
      When the inode buffer verifier encounters an error, it's much more
      helpful to print a buffer from the offending inode instead of just the
      start of the inode chunk buffer.
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      6edb1810
    • Darrick J. Wong's avatar
      xfs: refactor inode verifier error logging · 90a58f95
      Darrick J. Wong authored
      Refactor some of the inode verifier failure logging call sites to use
      the new xfs_inode_verifier_error method which dumps the offending buffer
      as well as the code location of the failed check.  This trims the
      output, makes it clearer to the admin that repair must be run, and gives
      the developers more details to work from.
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      90a58f95
    • Darrick J. Wong's avatar
      xfs: refactor bmap record validation · 30b0984d
      Darrick J. Wong authored
      Refactor the bmap validator into a more complete helper that looks for
      extents that run off the end of the device, overflow into the next AG,
      or have invalid flag states.
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      30b0984d
    • Darrick J. Wong's avatar
      xfs: sanity-check the unused space before trying to use it · 6915ef35
      Darrick J. Wong authored
      In xfs_dir2_data_use_free, we examine on-disk metadata and ASSERT if
      it doesn't make sense.  Since a carefully crafted fuzzed image can cause
      the kernel to crash after blowing a bunch of assertions, let's move
      those checks into a validator function and rig everything up to return
      EFSCORRUPTED to userspace.  Found by lastbit fuzzing ltail.bestcount via
      xfs/391.
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      6915ef35
    • Brian Foster's avatar
      xfs: detect agfl count corruption and reset agfl · a27ba260
      Brian Foster authored
      The struct xfs_agfl v5 header was originally introduced with
      unexpected padding that caused the AGFL to operate with one less
      slot than intended. The header has since been packed, but the fix
      left an incompatibility for users who upgrade from an old kernel
      with the unpacked header to a newer kernel with the packed header
      while the AGFL happens to wrap around the end. The newer kernel
      recognizes one extra slot at the physical end of the AGFL that the
      previous kernel did not. The new kernel will eventually attempt to
      allocate a block from that slot, which contains invalid data, and
      cause a crash.
      
      This condition can be detected by comparing the active range of the
      AGFL to the count. While this detects a padding mismatch, it can
      also trigger false positives for unrelated flcount corruption. Since
      we cannot distinguish a size mismatch due to padding from unrelated
      corruption, we can't trust the AGFL enough to simply repopulate the
      empty slot.
      
      Instead, avoid unnecessarily complex detection logic and and use a
      solution that can handle any form of flcount corruption that slips
      through read verifiers: distrust the entire AGFL and reset it to an
      empty state. Any valid blocks within the AGFL are intentionally
      leaked. This requires xfs_repair to rectify (which was already
      necessary based on the state the AGFL was found in). The reset
      mitigates the side effect of the padding mismatch problem from a
      filesystem crash to a free space accounting inconsistency. The
      generic approach also means that this patch can be safely backported
      to kernels with or without a packed struct xfs_agfl.
      
      Check the AGF for an invalid freelist count on initial read from
      disk. If detected, set a flag on the xfs_perag to indicate that a
      reset is required before the AGFL can be used. In the first
      transaction that attempts to use a flagged AGFL, reset it to empty,
      warn the user about the inconsistency and allow the freelist fixup
      code to repopulate the AGFL with new blocks. The xfs_perag flag is
      cleared to eliminate the need for repeated checks on each block
      allocation operation.
      
      This allows kernels that include the packing fix commit 96f859d5
      ("libxfs: pack the agfl header structure so XFS_AGFL_SIZE is correct")
      to handle older unpacked AGFL formats without a filesystem crash.
      Suggested-by: default avatarDave Chinner <david@fromorbit.com>
      Signed-off-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by Dave Chiluk <chiluk+linuxxfs@indeed.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      a27ba260
    • Christoph Hellwig's avatar
      xfs: unwind the try_again loop in xfs_log_force · 3e4da466
      Christoph Hellwig authored
      Instead split out a __xfs_log_fore_lsn helper that gets called again
      with the already_slept flag set to true in case we had to sleep.
      
      This prepares for aio_fsync support.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      3e4da466
    • Christoph Hellwig's avatar
      xfs: refactor xfs_log_force_lsn · 93806299
      Christoph Hellwig authored
      Use the the smallest possible loop as preable to find the correct iclog
      buffer, and then use gotos for unwinding to straighten the code.
      
      Also fix the top of function comment while we're at it.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      93806299
  7. 15 Mar, 2018 6 commits
  8. 14 Mar, 2018 6 commits
  9. 12 Mar, 2018 6 commits
    • Brian Foster's avatar
      xfs: account only rmapbt-used blocks against rmapbt perag res · 0ab32086
      Brian Foster authored
      The rmapbt perag metadata reservation reserves blocks for the
      reverse mapping btree (rmapbt). Since the rmapbt uses blocks from
      the agfl and perag accounting is updated as blocks are allocated
      from the allocation btrees, the reservation actually accounts blocks
      as they are allocated to (or freed from) the agfl rather than the
      rmapbt itself.
      
      While this works for blocks that are eventually used for the rmapbt,
      not all agfl blocks are destined for the rmapbt. Blocks that are
      allocated to the agfl (and thus "reserved" for the rmapbt) but then
      used by another structure leads to a growing inconsistency over time
      between the runtime tracking of rmapbt usage vs. actual rmapbt
      usage. Since the runtime tracking thinks all agfl blocks are rmapbt
      blocks, it essentially believes that less future reservation is
      required to satisfy the rmapbt than what is actually necessary.
      
      The inconsistency is rectified across mount cycles because the perag
      reservation is initialized based on the actual rmapbt usage at mount
      time. The problem, however, is that the excessive drain of the
      reservation at runtime opens a window to allocate blocks for other
      purposes that might be required for the rmapbt on a subsequent
      mount. This problem can be demonstrated by a simple test that runs
      an allocation workload to consume agfl blocks over time and then
      observe the difference in the agfl reservation requirement across an
      unmount/mount cycle:
      
        mount ...: xfs_ag_resv_init: ... resv 3193 ask 3194 len 3194
        ...
        ...      : xfs_ag_resv_alloc_extent: ... resv 2957 ask 3194 len 1
        umount...: xfs_ag_resv_free: ... resv 2956 ask 3194 len 0
        mount ...: xfs_ag_resv_init: ... resv 3052 ask 3194 len 3194
      
      As the above tracepoints show, the reservation requirement reduces
      from 3194 blocks to 2956 blocks as the workload runs.  Without any
      other changes in the filesystem, the same reservation requirement
      jumps from 2956 to 3052 blocks over a umount/mount cycle.
      
      To address this divergence, update the RMAPBT reservation to account
      blocks used for the rmapbt only rather than all blocks filled into
      the agfl. This patch makes several high-level changes toward that
      end:
      
      1.) Reintroduce an AGFL reservation type to serve as an accounting
          no-op for blocks allocated to (or freed from) the AGFL.
      2.) Invoke RMAPBT usage accounting from the actual rmapbt block
          allocation path rather than the AGFL allocation path.
      
      The first change is required because agfl blocks are considered free
      blocks throughout their lifetime. The perag reservation subsystem is
      invoked unconditionally by the allocation subsystem, so we need a
      way to tell the perag subsystem (via the allocation subsystem) to
      not make any accounting changes for blocks filled into the AGFL.
      
      The second change causes the in-core RMAPBT reservation usage
      accounting to remain consistent with the on-disk state at all times
      and eliminates the risk of leaving the rmapbt reservation
      underfilled.
      Signed-off-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      0ab32086
    • Brian Foster's avatar
      xfs: rename agfl perag res type to rmapbt · 21592863
      Brian Foster authored
      The AGFL perag reservation type accounts all allocations that feed
      into (or are released from) the allocation group free list (agfl).
      The purpose of the reservation is to support worst case conditions
      for the reverse mapping btree (rmapbt). As such, the agfl
      reservation usage accounting only considers rmapbt usage when the
      in-core counters are initialized at mount time.
      
      This implementation inconsistency leads to divergence of the in-core
      and on-disk usage accounting over time. In preparation to resolve
      this inconsistency and adjust the AGFL reservation into an rmapbt
      specific reservation, rename the AGFL reservation type and
      associated accounting fields to something more rmapbt-specific. Also
      fix up a couple tracepoints that incorrectly use the AGFL
      reservation type to pass the agfl state of the associated extent
      where the raw reservation type is expected.
      
      Note that this patch does not change perag reservation behavior.
      Signed-off-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      21592863
    • Brian Foster's avatar
      xfs: account format bouncing into rmapbt swapext tx reservation · b3fed434
      Brian Foster authored
      The extent swap mechanism requires a unique implementation for
      rmapbt enabled filesystems. Because the rmapbt tracks extent owner
      information, extent swap must individually unmap and remap each
      extent between the two inodes.
      
      The rmapbt extent swap transaction block reservation currently
      accounts for the worst case bmapbt block and rmapbt block
      consumption based on the extent count of each inode. There is a
      corner case that exists due to the extent swap implementation that
      is not covered by this reservation, however.
      
      If one of the associated inodes is just over the max extent count
      used for extent format inodes (i.e., the inode is in btree format by
      a single extent), the unmap/remap cycle of the extent swap can
      bounce the inode between extent and btree format multiple times,
      almost as many times as there are extents in the inode (if the
      opposing inode happens to have one less, for example). Each back and
      forth cycle involves a block free and allocation, which isn't a
      problem except for that the initial transaction reservation must
      account for the total number of block allocations performed by the
      chain of deferred operations. If not, a block reservation overrun
      occurs and the filesystem shuts down.
      
      Update the rmapbt extent swap block reservation to check for this
      situation and add some block reservation slop to ensure the entire
      operation succeeds. We'd never likely require reservation for both
      inodes as fsr wouldn't defrag the file in that case, but the
      additional reservation is constrained by the data fork size so be
      cautious and check for both.
      Signed-off-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      b3fed434
    • Brian Foster's avatar
      xfs: shutdown if block allocation overruns tx reservation · 3e78b9a4
      Brian Foster authored
      The ->t_blk_res_used field tracks how many blocks have been used in
      the current transaction. This should never exceed the block
      reservation (->t_blk_res) for a particular transaction. We currently
      assert this condition in the transaction block accounting code, but
      otherwise take no additional action should this situation occur.
      
      The overrun generally has no effect if space ends up being available
      and the associated transaction commits. If the transaction is
      duplicated, however, the current block usage is used to determine
      the remaining block reservation to be transferred to the new
      transaction. If usage exceeds reservation, this calculation
      underflows and creates a transaction with an invalid and excessive
      reservation. When the second transaction commits, the release of
      unused blocks corrupts the in-core free space counters. With lazy
      superblock accounting enabled, this inconsistency eventually
      trickles to the on-disk superblock and corrupts the filesystem.
      
      Replace the transaction block usage accounting assert with an
      explicit overrun check. If the transaction overruns the reservation,
      shutdown the filesystem immediately to prevent corruption. Add a new
      assert to xfs_trans_dup() to catch any callers that might induce
      this invalid state in the future.
      Signed-off-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      3e78b9a4
    • Matthew Wilcox's avatar
      xfs: Rename xa_ elements to ail_ · 57e80956
      Matthew Wilcox authored
      This is a simple rename, except that xa_ail becomes ail_head.
      Signed-off-by: default avatarMatthew Wilcox <mawilcox@microsoft.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      57e80956
    • Dave Chinner's avatar
      inode: don't memset the inode address space twice · ae23395d
      Dave Chinner authored
      Noticed when looking at why cycling 600k inodes/s through the inode
      cache was taking a total of 8% cpu in memset() during inode
      initialisation.  There is no need to zero the inode.i_data structure
      twice.
      
      This increases single threaded bulkstat throughput from ~200,000
      inodes/s to ~220,000 inodes/s, so we save a substantial amount of
      CPU time per inode init by doing this.
      Signed-Off-By: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarCarlos Maiolino <cmaiolino@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      ae23395d