1. 15 Dec, 2023 10 commits
    • Darrick J. Wong's avatar
      xfs: create separate structures and code for u32 bitmaps · 6ece924b
      Darrick J. Wong authored
      Create a version of the xbitmap that handles 32-bit integer intervals
      and adapt the xfs_agblock_t bitmap to use it.  This reduces the size of
      the interval tree nodes from 48 to 36 bytes and enables us to use a more
      efficient slab (:0000040 instead of :0000048) which allows us to pack
      more nodes into a single slab page (102 vs 85).
      
      As a side effect, the users of these bitmaps no longer have to convert
      between u32 and u64 quantities just to use the bitmap; and the hairy
      overflow checking code in xagb_bitmap_test goes away.
      
      Later in this patchset we're going to add bitmaps for xfs_agino_t,
      xfs_rgblock_t, and xfs_dablk_t, so the increase in code size (5622 vs.
      9959 bytes) seems worth it.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      6ece924b
    • Darrick J. Wong's avatar
      xfs: constrain dirty buffers while formatting a staged btree · e069d549
      Darrick J. Wong authored
      Constrain the number of dirty buffers that are locked by the btree
      staging code at any given time by establishing a threshold at which we
      put them all on the delwri queue and push them to disk.  This limits
      memory consumption while writing out new btrees.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      e069d549
    • Darrick J. Wong's avatar
      xfs: move btree bulkload record initialization to ->get_record implementations · 6dfeb0c2
      Darrick J. Wong authored
      When we're performing a bulk load of a btree, move the code that
      actually stores the btree record in the new btree block out of the
      generic code and into the individual ->get_record implementations.
      This is preparation for being able to store multiple records with a
      single indirect call.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      6dfeb0c2
    • Darrick J. Wong's avatar
      xfs: add debug knobs to control btree bulk load slack factors · a20ffa7d
      Darrick J. Wong authored
      Add some debug knobs so that we can control the leaf and node block
      slack when rebuilding btrees.
      
      For developers, it might be useful to construct btrees of various
      heights by crafting a filesystem with a certain number of records and
      then using repair+knobs to rebuild the index with a certain shape.
      Practically speaking, you'd only ever do that for extreme stress
      testing of the runtime code or the btree generator.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      a20ffa7d
    • Darrick J. Wong's avatar
      xfs: read leaf blocks when computing keys for bulkloading into node blocks · 26de6462
      Darrick J. Wong authored
      When constructing a new btree, xfs_btree_bload_node needs to read the
      btree blocks for level N to compute the keyptrs for the blocks that will
      be loaded into level N+1.  The level N blocks must be formatted at that
      point.
      
      A subsequent patch will change the btree bulkloader to write new btree
      blocks in 256K chunks to moderate memory consumption if the new btree is
      very large.  As a consequence of that, it's possible that the buffers
      for lower level blocks might have been reclaimed by the time the node
      builder comes back to the block.
      
      Therefore, change xfs_btree_bload_node to read the lower level blocks
      to handle the reclaimed buffer case.  As a side effect, the read will
      increase the LRU refs, which will bias towards keeping new btree buffers
      in memory after the new btree commits.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      26de6462
    • Darrick J. Wong's avatar
      xfs: set XBF_DONE on newly formatted btree block that are ready for writing · c1e0f8e6
      Darrick J. Wong authored
      The btree bulkloading code calls xfs_buf_delwri_queue_here when it has
      finished formatting a new btree block and wants to queue it to be
      written to disk.  Once the new btree root has been committed, the blocks
      (and hence the buffers) will be accessible to the rest of the
      filesystem.  Mark each new buffer as DONE when adding it to the delwri
      list so that the next btree traversal can skip reloading the contents
      from disk.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      c1e0f8e6
    • Darrick J. Wong's avatar
      xfs: force all buffers to be written during btree bulk load · 13ae04d8
      Darrick J. Wong authored
      While stress-testing online repair of btrees, I noticed periodic
      assertion failures from the buffer cache about buffers with incorrect
      DELWRI_Q state.  Looking further, I observed this race between the AIL
      trying to write out a btree block and repair zapping a btree block after
      the fact:
      
      AIL:    Repair0:
      
      pin buffer X
      delwri_queue:
      set DELWRI_Q
      add to delwri list
      
              stale buf X:
              clear DELWRI_Q
              does not clear b_list
              free space X
              commit
      
      delwri_submit   # oops
      
      Worse yet, I discovered that running the same repair over and over in a
      tight loop can result in a second race that cause data integrity
      problems with the repair:
      
      AIL:    Repair0:        Repair1:
      
      pin buffer X
      delwri_queue:
      set DELWRI_Q
      add to delwri list
      
              stale buf X:
              clear DELWRI_Q
              does not clear b_list
              free space X
              commit
      
                              find free space X
                              get buffer
                              rewrite buffer
                              delwri_queue:
                              set DELWRI_Q
                              already on a list, do not add
                              commit
      
                              BAD: committed tree root before all blocks written
      
      delwri_submit   # too late now
      
      I traced this to my own misunderstanding of how the delwri lists work,
      particularly with regards to the AIL's buffer list.  If a buffer is
      logged and committed, the buffer can end up on that AIL buffer list.  If
      btree repairs are run twice in rapid succession, it's possible that the
      first repair will invalidate the buffer and free it before the next time
      the AIL wakes up.  Marking the buffer stale clears DELWRI_Q from the
      buffer state without removing the buffer from its delwri list.  The
      buffer doesn't know which list it's on, so it cannot know which lock to
      take to protect the list for a removal.
      
      If the second repair allocates the same block, it will then recycle the
      buffer to start writing the new btree block.  Meanwhile, if the AIL
      wakes up and walks the buffer list, it will ignore the buffer because it
      can't lock it, and go back to sleep.
      
      When the second repair calls delwri_queue to put the buffer on the
      list of buffers to write before committing the new btree, it will set
      DELWRI_Q again, but since the buffer hasn't been removed from the AIL's
      buffer list, it won't add it to the bulkload buffer's list.
      
      This is incorrect, because the bulkload caller relies on delwri_submit
      to ensure that all the buffers have been sent to disk /before/
      committing the new btree root pointer.  This ordering requirement is
      required for data consistency.
      
      Worse, the AIL won't clear DELWRI_Q from the buffer when it does finally
      drop it, so the next thread to walk through the btree will trip over a
      debug assertion on that flag.
      
      To fix this, create a new function that waits for the buffer to be
      removed from any other delwri lists before adding the buffer to the
      caller's delwri list.  By waiting for the buffer to clear both the
      delwri list and any potential delwri wait list, we can be sure that
      repair will initiate writes of all buffers and report all write errors
      back to userspace instead of committing the new structure.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      13ae04d8
    • Dave Chinner's avatar
      xfs: initialise di_crc in xfs_log_dinode · 0573676f
      Dave Chinner authored
      Alexander Potapenko report that KMSAN was issuing these warnings:
      
      kmalloc-ed xlog buffer of size 512 : ffff88802fc26200
      kmalloc-ed xlog buffer of size 368 : ffff88802fc24a00
      kmalloc-ed xlog buffer of size 648 : ffff88802b631000
      kmalloc-ed xlog buffer of size 648 : ffff88802b632800
      kmalloc-ed xlog buffer of size 648 : ffff88802b631c00
      xlog_write_iovec: copying 12 bytes from ffff888017ddbbd8 to ffff88802c300400
      xlog_write_iovec: copying 28 bytes from ffff888017ddbbe4 to ffff88802c30040c
      xlog_write_iovec: copying 68 bytes from ffff88802fc26274 to ffff88802c300428
      xlog_write_iovec: copying 188 bytes from ffff88802fc262bc to ffff88802c30046c
      =====================================================
      BUG: KMSAN: uninit-value in xlog_write_iovec fs/xfs/xfs_log.c:2227
      BUG: KMSAN: uninit-value in xlog_write_full fs/xfs/xfs_log.c:2263
      BUG: KMSAN: uninit-value in xlog_write+0x1fac/0x2600 fs/xfs/xfs_log.c:2532
       xlog_write_iovec fs/xfs/xfs_log.c:2227
       xlog_write_full fs/xfs/xfs_log.c:2263
       xlog_write+0x1fac/0x2600 fs/xfs/xfs_log.c:2532
       xlog_cil_write_chain fs/xfs/xfs_log_cil.c:918
       xlog_cil_push_work+0x30f2/0x44e0 fs/xfs/xfs_log_cil.c:1263
       process_one_work kernel/workqueue.c:2630
       process_scheduled_works+0x1188/0x1e30 kernel/workqueue.c:2703
       worker_thread+0xee5/0x14f0 kernel/workqueue.c:2784
       kthread+0x391/0x500 kernel/kthread.c:388
       ret_from_fork+0x66/0x80 arch/x86/kernel/process.c:147
       ret_from_fork_asm+0x11/0x20 arch/x86/entry/entry_64.S:242
      
      Uninit was created at:
       slab_post_alloc_hook+0x101/0xac0 mm/slab.h:768
       slab_alloc_node mm/slub.c:3482
       __kmem_cache_alloc_node+0x612/0xae0 mm/slub.c:3521
       __do_kmalloc_node mm/slab_common.c:1006
       __kmalloc+0x11a/0x410 mm/slab_common.c:1020
       kmalloc ./include/linux/slab.h:604
       xlog_kvmalloc fs/xfs/xfs_log_priv.h:704
       xlog_cil_alloc_shadow_bufs fs/xfs/xfs_log_cil.c:343
       xlog_cil_commit+0x487/0x4dc0 fs/xfs/xfs_log_cil.c:1574
       __xfs_trans_commit+0x8df/0x1930 fs/xfs/xfs_trans.c:1017
       xfs_trans_commit+0x30/0x40 fs/xfs/xfs_trans.c:1061
       xfs_create+0x15af/0x2150 fs/xfs/xfs_inode.c:1076
       xfs_generic_create+0x4cd/0x1550 fs/xfs/xfs_iops.c:199
       xfs_vn_create+0x4a/0x60 fs/xfs/xfs_iops.c:275
       lookup_open fs/namei.c:3477
       open_last_lookups fs/namei.c:3546
       path_openat+0x29ac/0x6180 fs/namei.c:3776
       do_filp_open+0x24d/0x680 fs/namei.c:3809
       do_sys_openat2+0x1bc/0x330 fs/open.c:1440
       do_sys_open fs/open.c:1455
       __do_sys_openat fs/open.c:1471
       __se_sys_openat fs/open.c:1466
       __x64_sys_openat+0x253/0x330 fs/open.c:1466
       do_syscall_x64 arch/x86/entry/common.c:51
       do_syscall_64+0x4f/0x140 arch/x86/entry/common.c:82
       entry_SYSCALL_64_after_hwframe+0x63/0x6b arch/x86/entry/entry_64.S:120
      
      Bytes 112-115 of 188 are uninitialized
      Memory access of size 188 starts at ffff88802fc262bc
      
      This is caused by the struct xfs_log_dinode not having the di_crc
      field initialised. Log recovery never uses this field (it is only
      present these days for on-disk format compatibility reasons) and so
      it's value is never checked so nothing in XFS has caught this.
      
      Further, none of the uninitialised memory access warning tools have
      caught this (despite catching other uninit memory accesses in the
      struct xfs_log_dinode back in 2017!) until recently. Alexander
      annotated the XFS code to get the dump of the actual bytes that were
      detected as uninitialised, and from that report it took me about 30s
      to realise what the issue was.
      
      The issue was introduced back in 2016 and every inode that is logged
      fails to initialise this field. This is no actual bad behaviour
      caused by this issue - I find it hard to even classify it as a
      bug...
      Reported-and-tested-by: default avatarAlexander Potapenko <glider@google.com>
      Fixes: f8d55aa0 ("xfs: introduce inode log format object")
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatar"Darrick J. Wong" <djwong@kernel.org>
      Signed-off-by: default avatarChandan Babu R <chandanbabu@kernel.org>
      0573676f
    • Darrick J. Wong's avatar
      xfs: fix an off-by-one error in xreap_agextent_binval · c0e37f07
      Darrick J. Wong authored
      Overall, this function tries to find and invalidate all buffers for a
      given extent of space on the data device.  The inner for loop in this
      function tries to find all xfs_bufs for a given daddr.  The lengths of
      all possible cached buffers range from 1 fsblock to the largest needed
      to contain a 64k xattr value (~17fsb).  The scan is capped to avoid
      looking at anything buffer going past the given extent.
      
      Unfortunately, the loop continuation test is wrong -- max_fsbs is the
      largest size we want to scan, not one past that.  Put another way, this
      loop is actually 1-indexed, not 0-indexed.  Therefore, the continuation
      test should use <=, not <.
      
      As a result, online repairs of btree blocks fails to stale any buffers
      for btrees that are being torn down, which causes later assertions in
      the buffer cache when another thread creates a different-sized buffer.
      This happens in xfs/709 when allocating an inode cluster buffer:
      
       ------------[ cut here ]------------
       WARNING: CPU: 0 PID: 3346128 at fs/xfs/xfs_message.c:104 assfail+0x3a/0x40 [xfs]
       CPU: 0 PID: 3346128 Comm: fsstress Not tainted 6.7.0-rc4-djwx #rc4
       RIP: 0010:assfail+0x3a/0x40 [xfs]
       Call Trace:
        <TASK>
        _xfs_buf_obj_cmp+0x4a/0x50
        xfs_buf_get_map+0x191/0xba0
        xfs_trans_get_buf_map+0x136/0x280
        xfs_ialloc_inode_init+0x186/0x340
        xfs_ialloc_ag_alloc+0x254/0x720
        xfs_dialloc+0x21f/0x870
        xfs_create_tmpfile+0x1a9/0x2f0
        xfs_rename+0x369/0xfd0
        xfs_vn_rename+0xfa/0x170
        vfs_rename+0x5fb/0xc30
        do_renameat2+0x52d/0x6e0
        __x64_sys_renameat2+0x4b/0x60
        do_syscall_64+0x3b/0xe0
        entry_SYSCALL_64_after_hwframe+0x46/0x4e
      
      A later refactoring patch in the online repair series fixed this by
      accident, which is why I didn't notice this until I started testing only
      the patches that are likely to end up in 6.8.
      
      Fixes: 1c7ce115 ("xfs: reap large AG metadata extents when possible")
      Signed-off-by: default avatar"Darrick J. Wong" <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarChandan Babu R <chandanbabu@kernel.org>
      c0e37f07
    • Eric Sandeen's avatar
      xfs: short circuit xfs_growfs_data_private() if delta is zero · 84712492
      Eric Sandeen authored
      Although xfs_growfs_data() doesn't call xfs_growfs_data_private()
      if in->newblocks == mp->m_sb.sb_dblocks, xfs_growfs_data_private()
      further massages the new block count so that we don't i.e. try
      to create a too-small new AG.
      
      This may lead to a delta of "0" in xfs_growfs_data_private(), so
      we end up in the shrink case and emit the EXPERIMENTAL warning
      even if we're not changing anything at all.
      
      Fix this by returning straightaway if the block delta is zero.
      
      (nb: in older kernels, the result of entering the shrink case
      with delta == 0 may actually let an -ENOSPC escape to userspace,
      which is confusing for users.)
      
      Fixes: fb2fc172 ("xfs: support shrinking unused space in the last AG")
      Signed-off-by: default avatarEric Sandeen <sandeen@redhat.com>
      Reviewed-by: default avatar"Darrick J. Wong" <djwong@kernel.org>
      Signed-off-by: default avatarChandan Babu R <chandanbabu@kernel.org>
      84712492
  2. 14 Dec, 2023 6 commits
  3. 13 Dec, 2023 1 commit
    • Darrick J. Wong's avatar
      xfs: recompute growfsrtfree transaction reservation while growing rt volume · 578bd4ce
      Darrick J. Wong authored
      While playing with growfs to create a 20TB realtime section on a
      filesystem that didn't previously have an rt section, I noticed that
      growfs would occasionally shut down the log due to a transaction
      reservation overflow.
      
      xfs_calc_growrtfree_reservation uses the current size of the realtime
      summary file (m_rsumsize) to compute the transaction reservation for a
      growrtfree transaction.  The reservations are computed at mount time,
      which means that m_rsumsize is zero when growfs starts "freeing" the new
      realtime extents into the rt volume.  As a result, the transaction is
      undersized and fails.
      
      Fix this by recomputing the transaction reservations every time we
      change m_rsumsize.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      578bd4ce
  4. 07 Dec, 2023 23 commits
    • Christoph Hellwig's avatar
      xfs: move xfs_ondisk.h to libxfs/ · 18793e05
      Christoph Hellwig authored
      Move xfs_ondisk.h to libxfs so that we can do the struct sanity checks
      in userspace libxfs as well.  This should allow us to retire the
      somewhat fragile xfs/122 test on xfstests.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatar"Darrick J. Wong" <djwong@kernel.org>
      Reviewed-by: default avatarCarlos Maiolino <cmaiolino@redhat.com>
      Signed-off-by: default avatarChandan Babu R <chandanbabu@kernel.org>
      18793e05
    • Christoph Hellwig's avatar
      xfs: use static_assert to check struct sizes and offsets · c12c5039
      Christoph Hellwig authored
      Use the compiler-provided static_assert built-in from C11 instead of
      the kernel-specific BUILD_BUG_ON_MSG for the structure size and offset
      checks in xfs_ondisk.  This not only gives slightly nicer error messages
      in case things go south, but can also be trivially used as-is in
      userspace.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatar"Darrick J. Wong" <djwong@kernel.org>
      Reviewed-by: default avatarCarlos Maiolino <cmaiolino@redhat.com>
      Signed-off-by: default avatarChandan Babu R <chandanbabu@kernel.org>
      c12c5039
    • Zhang Tianci's avatar
      xfs: extract xfs_da_buf_copy() helper function · fd45ddb9
      Zhang Tianci authored
      This patch does not modify logic.
      
      xfs_da_buf_copy() will copy one block from src xfs_buf to
      dst xfs_buf, and update the block metadata in dst directly.
      Signed-off-by: default avatarZhang Tianci <zhangtianci.1997@bytedance.com>
      Suggested-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarChandan Babu R <chandanbabu@kernel.org>
      fd45ddb9
    • Zhang Tianci's avatar
      xfs: update dir3 leaf block metadata after swap · 5759aa4f
      Zhang Tianci authored
      xfs_da3_swap_lastblock() copy the last block content to the dead block,
      but do not update the metadata in it. We need update some metadata
      for some kinds of type block, such as dir3 leafn block records its
      blkno, we shall update it to the dead block blkno. Otherwise,
      before write the xfs_buf to disk, the verify_write() will fail in
      blk_hdr->blkno != xfs_buf->b_bn, then xfs will be shutdown.
      
      We will get this warning:
      
        XFS (dm-0): Metadata corruption detected at xfs_dir3_leaf_verify+0xa8/0xe0 [xfs], xfs_dir3_leafn block 0x178
        XFS (dm-0): Unmount and run xfs_repair
        XFS (dm-0): First 128 bytes of corrupted metadata buffer:
        00000000e80f1917: 00 80 00 0b 00 80 00 07 3d ff 00 00 00 00 00 00  ........=.......
        000000009604c005: 00 00 00 00 00 00 01 a0 00 00 00 00 00 00 00 00  ................
        000000006b6fb2bf: e4 44 e3 97 b5 64 44 41 8b 84 60 0e 50 43 d9 bf  .D...dDA..`.PC..
        00000000678978a2: 00 00 00 00 00 00 00 83 01 73 00 93 00 00 00 00  .........s......
        00000000b28b247c: 99 29 1d 38 00 00 00 00 99 29 1d 40 00 00 00 00  .).8.....).@....
        000000002b2a662c: 99 29 1d 48 00 00 00 00 99 49 11 00 00 00 00 00  .).H.....I......
        00000000ea2ffbb8: 99 49 11 08 00 00 45 25 99 49 11 10 00 00 48 fe  .I....E%.I....H.
        0000000069e86440: 99 49 11 18 00 00 4c 6b 99 49 11 20 00 00 4d 97  .I....Lk.I. ..M.
        XFS (dm-0): xfs_do_force_shutdown(0x8) called from line 1423 of file fs/xfs/xfs_buf.c.  Return address = 00000000c0ff63c1
        XFS (dm-0): Corruption of in-memory data detected.  Shutting down filesystem
        XFS (dm-0): Please umount the filesystem and rectify the problem(s)
      
      >From the log above, we know xfs_buf->b_no is 0x178, but the block's hdr record
      its blkno is 0x1a0.
      
      Fixes: 24df33b4 ("xfs: add CRC checking to dir2 leaf blocks")
      Signed-off-by: default avatarZhang Tianci <zhangtianci.1997@bytedance.com>
      Suggested-by: default avatarDave Chinner <david@fromorbit.com>
      Reviewed-by: default avatar"Darrick J. Wong" <djwong@kernel.org>
      Signed-off-by: default avatarChandan Babu R <chandanbabu@kernel.org>
      5759aa4f
    • Jiachen Zhang's avatar
      xfs: ensure logflagsp is initialized in xfs_bmap_del_extent_real · e6af9c98
      Jiachen Zhang authored
      In the case of returning -ENOSPC, ensure logflagsp is initialized by 0.
      Otherwise the caller __xfs_bunmapi will set uninitialized illegal
      tmp_logflags value into xfs log, which might cause unpredictable error
      in the log recovery procedure.
      
      Also, remove the flags variable and set the *logflagsp directly, so that
      the code should be more robust in the long run.
      
      Fixes: 1b24b633 ("xfs: move some more code into xfs_bmap_del_extent_real")
      Signed-off-by: default avatarJiachen Zhang <zhangjiachen.jaycee@bytedance.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatar"Darrick J. Wong" <djwong@kernel.org>
      Signed-off-by: default avatarChandan Babu R <chandanbabu@kernel.org>
      e6af9c98
    • Christoph Hellwig's avatar
      xfs: clean up xfs_fsops.h · 08e54ca4
      Christoph Hellwig authored
      Use struct types instead of typedefs so that the header can be included
      with pulling in the headers that define the typedefs, and remove the
      pointless externs.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatar"Darrick J. Wong" <djwong@kernel.org>
      Signed-off-by: default avatarChandan Babu R <chandanbabu@kernel.org>
      08e54ca4
    • Christoph Hellwig's avatar
      xfs: clean up the xfs_reserve_blocks interface · 646ddf0c
      Christoph Hellwig authored
      xfs_reserve_blocks has a very odd interface that can only be explained
      by it directly deriving from the IRIX fcntl handler back in the day.
      
      Split reporting out the reserved blocks out of xfs_reserve_blocks into
      the only caller that cares.  This means that the value reported from
      XFS_IOC_SET_RESBLKS isn't atomically sampled in the same critical
      section as when it was set anymore, but as the values could change
      right after setting them anyway that does not matter.  It does
      provide atomic sampling of both values for XFS_IOC_GET_RESBLKS now,
      though.
      
      Also pass a normal scalar integer value for the requested value instead
      of the pointless pointer.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatar"Darrick J. Wong" <djwong@kernel.org>
      Signed-off-by: default avatarChandan Babu R <chandanbabu@kernel.org>
      646ddf0c
    • Christoph Hellwig's avatar
      xfs: clean up the XFS_IOC_FSCOUNTS handler · c2c2620d
      Christoph Hellwig authored
      Split XFS_IOC_FSCOUNTS out of the main xfs_file_ioctl function, and
      merge the xfs_fs_counts helper into the ioctl handler.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatar"Darrick J. Wong" <djwong@kernel.org>
      Signed-off-by: default avatarChandan Babu R <chandanbabu@kernel.org>
      c2c2620d
    • Christoph Hellwig's avatar
      xfs: clean up the XFS_IOC_{GS}ET_RESBLKS handler · 64f08b15
      Christoph Hellwig authored
      The XFS_IOC_GET_RESBLKS and XFS_IOC_SET_RESBLKS already share a fair
      amount of code, and will share even more soon.  Move the logic for both
      of them out of the main xfs_file_ioctl function into a
      xfs_ioctl_getset_resblocks helper to share the code and prepare for
      additional changes.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatar"Darrick J. Wong" <djwong@kernel.org>
      Signed-off-by: default avatarChandan Babu R <chandanbabu@kernel.org>
      64f08b15
    • Bagas Sanjaya's avatar
      Documentation: xfs: consolidate XFS docs into its own subdirectory · 011f129f
      Bagas Sanjaya authored
      XFS docs are currently in upper-level Documentation/filesystems.
      Although these are currently 4 docs, they are already outstanding as
      a group and can be moved to its own subdirectory.
      
      Consolidate them into Documentation/filesystems/xfs/.
      Signed-off-by: default avatarBagas Sanjaya <bagasdotme@gmail.com>
      Reviewed-by: default avatarBill O'Donnell <bodonnel@redhat.com>
      Reviewed-by: default avatar"Darrick J. Wong" <djwong@kernel.org>
      Signed-off-by: default avatarChandan Babu R <chandanbabu@kernel.org>
      011f129f
    • Shiyang Ruan's avatar
      mm, pmem, xfs: Introduce MF_MEM_PRE_REMOVE for unbind · fa422b35
      Shiyang Ruan authored
      Now, if we suddenly remove a PMEM device(by calling unbind) which
      contains FSDAX while programs are still accessing data in this device,
      e.g.:
      ```
       $FSSTRESS_PROG -d $SCRATCH_MNT -n 99999 -p 4 &
       # $FSX_PROG -N 1000000 -o 8192 -l 500000 $SCRATCH_MNT/t001 &
       echo "pfn1.1" > /sys/bus/nd/drivers/nd_pmem/unbind
      ```
      it could come into an unacceptable state:
        1. device has gone but mount point still exists, and umount will fail
             with "target is busy"
        2. programs will hang and cannot be killed
        3. may crash with NULL pointer dereference
      
      To fix this, we introduce a MF_MEM_PRE_REMOVE flag to let it know that we
      are going to remove the whole device, and make sure all related processes
      could be notified so that they could end up gracefully.
      
      This patch is inspired by Dan's "mm, dax, pmem: Introduce
      dev_pagemap_failure()"[1].  With the help of dax_holder and
      ->notify_failure() mechanism, the pmem driver is able to ask filesystem
      on it to unmap all files in use, and notify processes who are using
      those files.
      
      Call trace:
      trigger unbind
       -> unbind_store()
        -> ... (skip)
         -> devres_release_all()
          -> kill_dax()
           -> dax_holder_notify_failure(dax_dev, 0, U64_MAX, MF_MEM_PRE_REMOVE)
            -> xfs_dax_notify_failure()
            `-> freeze_super()             // freeze (kernel call)
            `-> do xfs rmap
            ` -> mf_dax_kill_procs()
            `  -> collect_procs_fsdax()    // all associated processes
            `  -> unmap_and_kill()
            ` -> invalidate_inode_pages2_range() // drop file's cache
            `-> thaw_super()               // thaw (both kernel & user call)
      
      Introduce MF_MEM_PRE_REMOVE to let filesystem know this is a remove
      event.  Use the exclusive freeze/thaw[2] to lock the filesystem to prevent
      new dax mapping from being created.  Do not shutdown filesystem directly
      if configuration is not supported, or if failure range includes metadata
      area.  Make sure all files and processes(not only the current progress)
      are handled correctly.  Also drop the cache of associated files before
      pmem is removed.
      
      [1]: https://lore.kernel.org/linux-mm/161604050314.1463742.14151665140035795571.stgit@dwillia2-desk3.amr.corp.intel.com/
      [2]: https://lore.kernel.org/linux-xfs/169116275623.3187159.16862410128731457358.stg-ugh@frogsfrogsfrogs/Signed-off-by: default avatarShiyang Ruan <ruansy.fnst@fujitsu.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDan Williams <dan.j.williams@intel.com>
      Signed-off-by: default avatarChandan Babu R <chandanbabu@kernel.org>
      fa422b35
    • Chandan Babu R's avatar
      Merge tag 'repair-auto-reap-space-reservations-6.8_2023-12-06' of... · 49391d13
      Chandan Babu R authored
      Merge tag 'repair-auto-reap-space-reservations-6.8_2023-12-06' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.8-mergeA
      
      xfs: reserve disk space for online repairs
      
      Online repair fixes metadata structures by writing a new copy out to
      disk and atomically committing the new structure into the filesystem.
      For this to work, we need to reserve all the space we're going to need
      ahead of time so that the atomic commit transaction is as small as
      possible.  We also require the reserved space to be freed if the system
      goes down, or if we decide not to commit the repair, or if we reserve
      too much space.
      
      To keep the atomic commit transaction as small as possible, we would
      like to allocate some space and simultaneously schedule automatic
      reaping of the reserved space, even on log recovery.  EFIs are the
      mechanism to get us there, but we need to use them in a novel manner.
      Once we allocate the space, we want to hold on to the EFI (relogging as
      necessary) until we can commit or cancel the repair.  EFIs for written
      committed blocks need to go away, but unwritten or uncommitted blocks
      can be freed like normal.
      
      Earlier versions of this patchset directly manipulated the log items,
      but Dave thought that to be a layering violation.  For v27, I've
      modified the defer ops handling code to be capable of pausing a deferred
      work item.  Log intent items are created as they always have been, but
      paused items are pushed onto a side list when finishing deferred work
      items, and pushed back onto the transaction after that.  Log intent done
      item are not created for paused work.
      
      The second part adds a "stale" flag to the EFI so that the repair
      reservation code can dispose of an EFI the normal way, but without the
      space actually being freed.
      
      This has been lightly tested with fstests.  Enjoy!
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarChandan Babu R <chandanbabu@kernel.org>
      
      * tag 'repair-auto-reap-space-reservations-6.8_2023-12-06' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux:
        xfs: force small EFIs for reaping btree extents
        xfs: log EFIs for all btree blocks being used to stage a btree
        xfs: implement block reservation accounting for btrees we're staging
        xfs: remove unused fields from struct xbtree_ifakeroot
        xfs: automatic freeing of freshly allocated unwritten space
        xfs: remove __xfs_free_extent_later
        xfs: allow pausing of pending deferred work items
        xfs: don't append work items to logged xfs_defer_pending objects
      49391d13
    • Chandan Babu R's avatar
      Merge tag 'scrub-livelock-prevention-6.8_2023-12-06' of... · dec0224b
      Chandan Babu R authored
      Merge tag 'scrub-livelock-prevention-6.8_2023-12-06' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.8-mergeA
      
      xfs: prevent livelocks in xchk_iget
      
      Prevent scrub from live locking in xchk_iget if there's a cycle in the
      inobt by allocating an empty transaction.
      
      This has been lightly tested with fstests.  Enjoy!
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarChandan Babu R <chandanbabu@kernel.org>
      Signed-off-by: default avatarChandan Babu R <chandanbabu@kernel.org>
      
      * tag 'scrub-livelock-prevention-6.8_2023-12-06' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux:
        xfs: make xchk_iget safer in the presence of corrupt inode btrees
      dec0224b
    • Chandan Babu R's avatar
      Merge tag 'defer-elide-create-done-6.8_2023-12-06' of... · 9f334526
      Chandan Babu R authored
      Merge tag 'defer-elide-create-done-6.8_2023-12-06' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.8-mergeA
      
      xfs: elide defer work ->create_done if no intent
      
      Christoph pointed out that the defer ops machinery doesn't need to call
      ->create_done if the deferred work item didn't generate a log intent
      item in the first place.  Let's clean that up and save an indirect call
      in the non-logged xattr update call path.
      
      This has been lightly tested with fstests.  Enjoy!
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarChandan Babu R <chandanbabu@kernel.org>
      
      * tag 'defer-elide-create-done-6.8_2023-12-06' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux:
        xfs: elide ->create_done calls for unlogged deferred work
        xfs: document what LARP means
      9f334526
    • Chandan Babu R's avatar
      Merge tag 'fix-rtmount-overflows-6.8_2023-12-06' of... · 47c460ef
      Chandan Babu R authored
      Merge tag 'fix-rtmount-overflows-6.8_2023-12-06' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.8-mergeA
      
      xfs: fix realtime geometry integer overflows
      
      While reading through the realtime geometry support code in xfsprogs, I
      noticed a discrepancy between the sb_rextslog computation used when
      writing out the superblock during mkfs and the validation code used in
      xfs_repair.  This discrepancy would lead to system failure for a runt rt
      volume having more than 1 rt block but zero rt extents in length.  Most
      people aren't going to configure a 1M extent size for their 360k rt
      floppy disk volume, but I did!
      
      In the process of studying that code, it occurred to me that there is a
      second bug in the computation -- the use of highbit32 for a 64-bit
      value means that the upper 32 bits are not considered in the search for
      a high bit.  This causes the creation of a realtime summary file that is
      the wrong length.  If rextents is a multiple of U32_MAX then this will
      appear to work fine because highbit32 returns -1 for an input of 0; but
      for all other cases the rt summary is undersized, leading to failures.
      
      Fix the first problem by standardizing the computation with a helper in
      libxfs; and the second problem by correcting the computation.  This will
      cause any existing rt volumes larger than 2^32 blocks to fail validation
      but they probably were already crashing anyway.
      
      This has been lightly tested with fstests.  Enjoy!
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarChandan Babu R <chandanbabu@kernel.org>
      
      * tag 'fix-rtmount-overflows-6.8_2023-12-06' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux:
        xfs: don't allow overly small or large realtime volumes
        xfs: fix 32-bit truncation in xfs_compute_rextslog
        xfs: make rextslog computation consistent with mkfs
      47c460ef
    • Chandan Babu R's avatar
      Merge tag 'reconstruct-defer-cleanups-6.8_2023-12-06' of... · 34d38666
      Chandan Babu R authored
      Merge tag 'reconstruct-defer-cleanups-6.8_2023-12-06' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.8-mergeA
      
      xfs: continue removing defer item boilerplate
      
      Now that we've restructured log intent item recovery to reconstruct the
      incore deferred work state, apply further cleanups to that code to
      remove boilerplate that is duplicated across all the _item.c files.
      Having done that, collapse a bunch of trivial helpers to reduce the
      overall call chain.  That enables us to refactor the relog code so that
      the ->relog_item implementations only have to know how to format the
      implementation-specific data encoded in an intent item and don't
      themselves have to handle the log item juggling.
      
      This has been lightly tested with fstests.  Enjoy!
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarChandan Babu R <chandanbabu@kernel.org>
      
      * tag 'reconstruct-defer-cleanups-6.8_2023-12-06' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux:
        xfs: move ->iop_relog to struct xfs_defer_op_type
        xfs: collapse the ->create_done functions
        xfs: hoist xfs_trans_add_item calls to defer ops functions
        xfs: clean out XFS_LI_DIRTY setting boilerplate from ->iop_relog
        xfs: use xfs_defer_create_done for the relogging operation
        xfs: hoist ->create_intent boilerplate to its callsite
        xfs: collapse the ->finish_item helpers
        xfs: hoist intent done flag setting to ->finish_item callsite
        xfs: don't set XFS_TRANS_HAS_INTENT_DONE when there's no ATTRD log item
      34d38666
    • Chandan Babu R's avatar
      Merge tag 'reconstruct-defer-work-6.8_2023-12-06' of... · 6b4ffe97
      Chandan Babu R authored
      Merge tag 'reconstruct-defer-work-6.8_2023-12-06' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.8-mergeA
      
      xfs: log intent item recovery should reconstruct defer work state
      
      Long Li reported a KASAN report from a UAF when intent recovery fails:
      
       ==================================================================
       BUG: KASAN: slab-use-after-free in xfs_cui_release+0xb7/0xc0
       Read of size 4 at addr ffff888012575e60 by task kworker/u8:3/103
       CPU: 3 PID: 103 Comm: kworker/u8:3 Not tainted 6.4.0-rc7-next-20230619-00003-g94543a53f9a4-dirty #166
       Workqueue: xfs-cil/sda xlog_cil_push_work
       Call Trace:
        <TASK>
        dump_stack_lvl+0x50/0x70
        print_report+0xc2/0x600
        kasan_report+0xb6/0xe0
        xfs_cui_release+0xb7/0xc0
        xfs_cud_item_release+0x3c/0x90
        xfs_trans_committed_bulk+0x2d5/0x7f0
        xlog_cil_committed+0xaba/0xf20
        xlog_cil_push_work+0x1a60/0x2360
        process_one_work+0x78e/0x1140
        worker_thread+0x58b/0xf60
        kthread+0x2cd/0x3c0
        ret_from_fork+0x1f/0x30
        </TASK>
      
       Allocated by task 531:
        kasan_save_stack+0x22/0x40
        kasan_set_track+0x25/0x30
        __kasan_slab_alloc+0x55/0x60
        kmem_cache_alloc+0x195/0x5f0
        xfs_cui_init+0x198/0x1d0
        xlog_recover_cui_commit_pass2+0x133/0x5f0
        xlog_recover_items_pass2+0x107/0x230
        xlog_recover_commit_trans+0x3e7/0x9c0
        xlog_recovery_process_trans+0x140/0x1d0
        xlog_recover_process_ophdr+0x1a0/0x3d0
        xlog_recover_process_data+0x108/0x2d0
        xlog_recover_process+0x1f6/0x280
        xlog_do_recovery_pass+0x609/0xdb0
        xlog_do_log_recovery+0x84/0xe0
        xlog_do_recover+0x7d/0x470
        xlog_recover+0x25f/0x490
        xfs_log_mount+0x2dd/0x6f0
        xfs_mountfs+0x11ce/0x1e70
        xfs_fs_fill_super+0x10ec/0x1b20
        get_tree_bdev+0x3c8/0x730
        vfs_get_tree+0x89/0x2c0
        path_mount+0xecf/0x1800
        do_mount+0xf3/0x110
        __x64_sys_mount+0x154/0x1f0
        do_syscall_64+0x39/0x80
        entry_SYSCALL_64_after_hwframe+0x63/0xcd
      
       Freed by task 531:
        kasan_save_stack+0x22/0x40
        kasan_set_track+0x25/0x30
        kasan_save_free_info+0x2b/0x40
        __kasan_slab_free+0x114/0x1b0
        kmem_cache_free+0xf8/0x510
        xfs_cui_item_free+0x95/0xb0
        xfs_cui_release+0x86/0xc0
        xlog_recover_cancel_intents.isra.0+0xf8/0x210
        xlog_recover_finish+0x7e7/0x980
        xfs_log_mount_finish+0x2bb/0x4a0
        xfs_mountfs+0x14bf/0x1e70
        xfs_fs_fill_super+0x10ec/0x1b20
        get_tree_bdev+0x3c8/0x730
        vfs_get_tree+0x89/0x2c0
        path_mount+0xecf/0x1800
        do_mount+0xf3/0x110
        __x64_sys_mount+0x154/0x1f0
        do_syscall_64+0x39/0x80
        entry_SYSCALL_64_after_hwframe+0x63/0xcd
      
       The buggy address belongs to the object at ffff888012575dc8
        which belongs to the cache xfs_cui_item of size 432
       The buggy address is located 152 bytes inside of
        freed 432-byte region [ffff888012575dc8, ffff888012575f78)
      
       The buggy address belongs to the physical page:
       page:ffffea0000495d00 refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff888012576208 pfn:0x12574
       head:ffffea0000495d00 order:2 entire_mapcount:0 nr_pages_mapped:0 pincount:0
       flags: 0x1fffff80010200(slab|head|node=0|zone=1|lastcpupid=0x1fffff)
       page_type: 0xffffffff()
       raw: 001fffff80010200 ffff888012092f40 ffff888014570150 ffff888014570150
       raw: ffff888012576208 00000000001e0010 00000001ffffffff 0000000000000000
       page dumped because: kasan: bad access detected
      
       Memory state around the buggy address:
        ffff888012575d00: fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc fc
        ffff888012575d80: fc fc fc fc fc fc fc fc fc fa fb fb fb fb fb fb
       >ffff888012575e00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                                                              ^
        ffff888012575e80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
        ffff888012575f00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fc
       ==================================================================
      
      "If process intents fails, intent items left in AIL will be delete
      from AIL and freed in error handling, even intent items that have been
      recovered and created done items. After this, uaf will be triggered when
      done item committed, because at this point the released intent item will
      be accessed.
      
      xlog_recover_finish                     xlog_cil_push_work
      ----------------------------            ---------------------------
      xlog_recover_process_intents
        xfs_cui_item_recover//cui_refcount == 1
          xfs_trans_get_cud
          xfs_trans_commit
            <add cud item to cil>
        xfs_cui_item_recover
          <error occurred and return>
      xlog_recover_cancel_intents
        xfs_cui_release     //cui_refcount == 0
          xfs_cui_item_free //free cui
        <release other intent items>
      xlog_force_shutdown   //shutdown
                                     <...>
                                              <push items in cil>
                                              xlog_cil_committed
                                                xfs_cud_item_release
                                                  xfs_cui_release // UAF
      
      "Intent log items are created with a reference count of 2, one for the
      creator, and one for the intent done object. Log recovery explicitly
      drops the creator reference after it is inserted into the AIL, but it
      then processes the log item as if it also owns the intent-done reference.
      
      "The code in ->iop_recovery should assume that it passes the reference
      to the done intent, we can remove the intent item from the AIL after
      creating the done-intent, but if that code fails before creating the
      done-intent then it needs to release the intent reference by log recovery
      itself.
      
      "That way when we go to cancel the intent, the only intents we find in
      the AIL are the ones we know have not been processed yet and hence we
      can safely drop both the creator and the intent done reference from
      xlog_recover_cancel_intents().
      
      "Hence if we remove the intent from the list of intents that need to
      be recovered after we have done the initial recovery, we acheive two
      things:
      
      "1. the tail of the log can be moved forward with the commit of the
      done intent or new intent to continue the operation, and
      
      "2. We avoid the problem of trying to determine how many reference
      counts we need to drop from intent recovery cancelling because we
      never come across intents we've actually attempted recovery on."
      
      Restated: The cause of the UAF is that xlog_recover_cancel_intents
      thinks that it owns the refcount on any intent item in the AIL, and that
      it's always safe to release these intent items.  This is not true after
      the recovery function creates an log intent done item and points it at
      the log intent item because releasing the done item always releases the
      intent item.
      
      The runtime defer ops code avoids all this by tracking both the log
      intent and the intent done items, and releasing only the intent done
      item if both have been created.  Long Li proposed fixing this by adding
      state flags, but I have a more comprehensive fix.
      
      First, observe that the latter half of the intent _recover functions are
      nearly open-coded versions of the corresponding _finish_one function
      that uses an onstack deferred work item to single-step through the item.
      
      Second, notice that the recover function is not an exact match because
      of the odd behavior that unfinished recovered work items are relogged
      with separate log intent items instead of a single new log intent item,
      which is what the defer ops machinery does.
      
      Dave and I have long suspected that recovery should be reconstructing
      the defer work state from what's in the recovered intent item.  Now we
      finally have an excuse to refactor the code to do that.
      
      This series starts by fixing a resource leak in LARP recovery.  We fix
      the bug that Long Li reported by switching the intent recovery code to
      construct chains of xfs_defer_pending objects and then using the defer
      pending objects to track the intent/done item ownership.  Finally, we
      clean up the code to reconstruct the exact incore state, which means we
      can remove all the opencoded _recover code, which makes maintaining log
      items much easier.
      
      v2: minor changes per review comments
      v3: pick up more rvb tags, fix build errors
      
      This has been lightly tested with fstests.  Enjoy!
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarChandan Babu R <chandanbabu@kernel.org>
      
      * tag 'reconstruct-defer-work-6.8_2023-12-06' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux:
        xfs: move ->iop_recover to xfs_defer_op_type
        xfs: use xfs_defer_finish_one to finish recovered work items
        xfs: dump the recovered xattri log item if corruption happens
        xfs: recreate work items when recovering intent items
        xfs: transfer recovered intent item ownership in ->iop_recover
        xfs: pass the xfs_defer_pending object to iop_recover
        xfs: use xfs_defer_pending objects to recover intent items
        xfs: don't leak recovered attri intent items
      6b4ffe97
    • Darrick J. Wong's avatar
      xfs: force small EFIs for reaping btree extents · 3f3cec03
      Darrick J. Wong authored
      Introduce the concept of a defer ops barrier to separate consecutively
      queued pending work items of the same type.  With a barrier in place,
      the two work items will be tracked separately, and receive separate log
      intent items.  The goal here is to prevent reaping of old metadata
      blocks from creating unnecessarily huge EFIs that could then run the
      risk of overflowing the scrub transaction.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      3f3cec03
    • Darrick J. Wong's avatar
      xfs: log EFIs for all btree blocks being used to stage a btree · 6bb9ea8e
      Darrick J. Wong authored
      We need to log EFIs for every extent that we allocate for the purpose of
      staging a new btree so that if we fail then the blocks will be freed
      during log recovery.  Use the autoreaping mechanism provided by the
      previous patch to attach paused freeing work to the scrub transaction.
      We can then mark the EFIs stale if we decide to commit the new btree, or
      we can unpause the EFIs if we decide to abort the repair.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      6bb9ea8e
    • Darrick J. Wong's avatar
      xfs: implement block reservation accounting for btrees we're staging · be408417
      Darrick J. Wong authored
      Create a new xrep_newbt structure to encapsulate a fake root for
      creating a staged btree cursor as well as to track all the blocks that
      we need to reserve in order to build that btree.
      
      As for the particular choice of lowspace thresholds and btree block
      slack factors -- at this point one could say that the thresholds in
      online repair come from bulkload_estimate_ag_slack in xfs_repair[1].
      But that's not the entire story, since the offline btree rebuilding
      code in xfs_repair was merged as a retroport of the online btree code
      in this patchset!
      
      Before xfs_btree_staging.[ch] came along, xfs_repair determined the
      slack factor (aka the number of slots to leave unfilled in each new
      btree block) via open-coded logic in repair/phase5.c[2].  At that point
      the slack factors were arbitrary quantities per btree.  The rmapbt
      automatically left 10 slots free; everything else left zero.
      
      That had a noticeable effect on performance straight after mounting
      because adding records to /any/ btree would result in splits.  A few
      years ago when this patch was first written, Dave and I decided that
      repair should generate btree blocks that were 75% full unless space was
      tight, in which case it should try to fill the blocks to nearly full.
      We defined tight as ~10% free to avoid repair failures but settled on
      3/32 (~9%) to avoid div64.
      
      IOWs, we mostly pulled the thresholds out of thin air.  We've been
      QAing with those geometry numbers ever since. ;)
      
      Link: https://git.kernel.org/pub/scm/fs/xfs/xfsprogs-dev.git/tree/repair/bulkload.c?h=v6.5.0#n114
      Link: https://git.kernel.org/pub/scm/fs/xfs/xfsprogs-dev.git/tree/repair/phase5.c?h=v4.19.0#n1349Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      be408417
    • Darrick J. Wong's avatar
      xfs: remove unused fields from struct xbtree_ifakeroot · 4c8ecd1c
      Darrick J. Wong authored
      Remove these unused fields since nobody uses them.  They should have
      been removed years ago in a different cleanup series from Christoph
      Hellwig.
      
      Fixes: daf83964 ("xfs: move the per-fork nextents fields into struct xfs_ifork")
      Fixes: f7e67b20 ("xfs: move the fork format fields into struct xfs_ifork")
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      4c8ecd1c
    • Darrick J. Wong's avatar
      xfs: automatic freeing of freshly allocated unwritten space · e3042be3
      Darrick J. Wong authored
      As mentioned in the previous commit, online repair wants to allocate
      space to write out a new metadata structure, and it also wants to hedge
      against system crashes during repairs by logging (and later cancelling)
      EFIs to free the space if we crash before committing the new data
      structure.
      
      Therefore, create a trio of functions to schedule automatic reaping of
      freshly allocated unwritten space.  xfs_alloc_schedule_autoreap creates
      a paused EFI representing the space we just allocated.  Once the
      allocations are made and the autoreaps scheduled, we can start writing
      to disk.
      
      If the writes succeed, xfs_alloc_cancel_autoreap marks the EFI work
      items as stale and unpauses the pending deferred work item.  Assuming
      that's done in the same transaction that commits the new structure into
      the filesystem, we guarantee that either the new object is fully
      visible, or that all the space gets reclaimed.
      
      If the writes succeed but only part of an extent was used, repair must
      call the same _cancel_autoreap function to kill the first EFI and then
      log a new EFI to free the unused space.  The first EFI is already
      committed, so it cannot be changed.
      
      For full extents that aren't used, xfs_alloc_commit_autoreap will
      unpause the EFI, which results in the space being freed during the next
      _defer_finish cycle.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      e3042be3
    • Darrick J. Wong's avatar
      xfs: remove __xfs_free_extent_later · 4c88fef3
      Darrick J. Wong authored
      xfs_free_extent_later is a trivial helper, so remove it to reduce the
      amount of thinking required to understand the deferred freeing
      interface.  This will make it easier to introduce automatic reaping of
      speculative allocations in the next patch.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      4c88fef3