1. 07 Dec, 2023 26 commits
    • Chandan Babu R's avatar
      Merge tag 'defer-elide-create-done-6.8_2023-12-06' of... · 9f334526
      Chandan Babu R authored
      Merge tag 'defer-elide-create-done-6.8_2023-12-06' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.8-mergeA
      
      xfs: elide defer work ->create_done if no intent
      
      Christoph pointed out that the defer ops machinery doesn't need to call
      ->create_done if the deferred work item didn't generate a log intent
      item in the first place.  Let's clean that up and save an indirect call
      in the non-logged xattr update call path.
      
      This has been lightly tested with fstests.  Enjoy!
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarChandan Babu R <chandanbabu@kernel.org>
      
      * tag 'defer-elide-create-done-6.8_2023-12-06' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux:
        xfs: elide ->create_done calls for unlogged deferred work
        xfs: document what LARP means
      9f334526
    • Chandan Babu R's avatar
      Merge tag 'fix-rtmount-overflows-6.8_2023-12-06' of... · 47c460ef
      Chandan Babu R authored
      Merge tag 'fix-rtmount-overflows-6.8_2023-12-06' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.8-mergeA
      
      xfs: fix realtime geometry integer overflows
      
      While reading through the realtime geometry support code in xfsprogs, I
      noticed a discrepancy between the sb_rextslog computation used when
      writing out the superblock during mkfs and the validation code used in
      xfs_repair.  This discrepancy would lead to system failure for a runt rt
      volume having more than 1 rt block but zero rt extents in length.  Most
      people aren't going to configure a 1M extent size for their 360k rt
      floppy disk volume, but I did!
      
      In the process of studying that code, it occurred to me that there is a
      second bug in the computation -- the use of highbit32 for a 64-bit
      value means that the upper 32 bits are not considered in the search for
      a high bit.  This causes the creation of a realtime summary file that is
      the wrong length.  If rextents is a multiple of U32_MAX then this will
      appear to work fine because highbit32 returns -1 for an input of 0; but
      for all other cases the rt summary is undersized, leading to failures.
      
      Fix the first problem by standardizing the computation with a helper in
      libxfs; and the second problem by correcting the computation.  This will
      cause any existing rt volumes larger than 2^32 blocks to fail validation
      but they probably were already crashing anyway.
      
      This has been lightly tested with fstests.  Enjoy!
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarChandan Babu R <chandanbabu@kernel.org>
      
      * tag 'fix-rtmount-overflows-6.8_2023-12-06' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux:
        xfs: don't allow overly small or large realtime volumes
        xfs: fix 32-bit truncation in xfs_compute_rextslog
        xfs: make rextslog computation consistent with mkfs
      47c460ef
    • Chandan Babu R's avatar
      Merge tag 'reconstruct-defer-cleanups-6.8_2023-12-06' of... · 34d38666
      Chandan Babu R authored
      Merge tag 'reconstruct-defer-cleanups-6.8_2023-12-06' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.8-mergeA
      
      xfs: continue removing defer item boilerplate
      
      Now that we've restructured log intent item recovery to reconstruct the
      incore deferred work state, apply further cleanups to that code to
      remove boilerplate that is duplicated across all the _item.c files.
      Having done that, collapse a bunch of trivial helpers to reduce the
      overall call chain.  That enables us to refactor the relog code so that
      the ->relog_item implementations only have to know how to format the
      implementation-specific data encoded in an intent item and don't
      themselves have to handle the log item juggling.
      
      This has been lightly tested with fstests.  Enjoy!
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarChandan Babu R <chandanbabu@kernel.org>
      
      * tag 'reconstruct-defer-cleanups-6.8_2023-12-06' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux:
        xfs: move ->iop_relog to struct xfs_defer_op_type
        xfs: collapse the ->create_done functions
        xfs: hoist xfs_trans_add_item calls to defer ops functions
        xfs: clean out XFS_LI_DIRTY setting boilerplate from ->iop_relog
        xfs: use xfs_defer_create_done for the relogging operation
        xfs: hoist ->create_intent boilerplate to its callsite
        xfs: collapse the ->finish_item helpers
        xfs: hoist intent done flag setting to ->finish_item callsite
        xfs: don't set XFS_TRANS_HAS_INTENT_DONE when there's no ATTRD log item
      34d38666
    • Chandan Babu R's avatar
      Merge tag 'reconstruct-defer-work-6.8_2023-12-06' of... · 6b4ffe97
      Chandan Babu R authored
      Merge tag 'reconstruct-defer-work-6.8_2023-12-06' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.8-mergeA
      
      xfs: log intent item recovery should reconstruct defer work state
      
      Long Li reported a KASAN report from a UAF when intent recovery fails:
      
       ==================================================================
       BUG: KASAN: slab-use-after-free in xfs_cui_release+0xb7/0xc0
       Read of size 4 at addr ffff888012575e60 by task kworker/u8:3/103
       CPU: 3 PID: 103 Comm: kworker/u8:3 Not tainted 6.4.0-rc7-next-20230619-00003-g94543a53f9a4-dirty #166
       Workqueue: xfs-cil/sda xlog_cil_push_work
       Call Trace:
        <TASK>
        dump_stack_lvl+0x50/0x70
        print_report+0xc2/0x600
        kasan_report+0xb6/0xe0
        xfs_cui_release+0xb7/0xc0
        xfs_cud_item_release+0x3c/0x90
        xfs_trans_committed_bulk+0x2d5/0x7f0
        xlog_cil_committed+0xaba/0xf20
        xlog_cil_push_work+0x1a60/0x2360
        process_one_work+0x78e/0x1140
        worker_thread+0x58b/0xf60
        kthread+0x2cd/0x3c0
        ret_from_fork+0x1f/0x30
        </TASK>
      
       Allocated by task 531:
        kasan_save_stack+0x22/0x40
        kasan_set_track+0x25/0x30
        __kasan_slab_alloc+0x55/0x60
        kmem_cache_alloc+0x195/0x5f0
        xfs_cui_init+0x198/0x1d0
        xlog_recover_cui_commit_pass2+0x133/0x5f0
        xlog_recover_items_pass2+0x107/0x230
        xlog_recover_commit_trans+0x3e7/0x9c0
        xlog_recovery_process_trans+0x140/0x1d0
        xlog_recover_process_ophdr+0x1a0/0x3d0
        xlog_recover_process_data+0x108/0x2d0
        xlog_recover_process+0x1f6/0x280
        xlog_do_recovery_pass+0x609/0xdb0
        xlog_do_log_recovery+0x84/0xe0
        xlog_do_recover+0x7d/0x470
        xlog_recover+0x25f/0x490
        xfs_log_mount+0x2dd/0x6f0
        xfs_mountfs+0x11ce/0x1e70
        xfs_fs_fill_super+0x10ec/0x1b20
        get_tree_bdev+0x3c8/0x730
        vfs_get_tree+0x89/0x2c0
        path_mount+0xecf/0x1800
        do_mount+0xf3/0x110
        __x64_sys_mount+0x154/0x1f0
        do_syscall_64+0x39/0x80
        entry_SYSCALL_64_after_hwframe+0x63/0xcd
      
       Freed by task 531:
        kasan_save_stack+0x22/0x40
        kasan_set_track+0x25/0x30
        kasan_save_free_info+0x2b/0x40
        __kasan_slab_free+0x114/0x1b0
        kmem_cache_free+0xf8/0x510
        xfs_cui_item_free+0x95/0xb0
        xfs_cui_release+0x86/0xc0
        xlog_recover_cancel_intents.isra.0+0xf8/0x210
        xlog_recover_finish+0x7e7/0x980
        xfs_log_mount_finish+0x2bb/0x4a0
        xfs_mountfs+0x14bf/0x1e70
        xfs_fs_fill_super+0x10ec/0x1b20
        get_tree_bdev+0x3c8/0x730
        vfs_get_tree+0x89/0x2c0
        path_mount+0xecf/0x1800
        do_mount+0xf3/0x110
        __x64_sys_mount+0x154/0x1f0
        do_syscall_64+0x39/0x80
        entry_SYSCALL_64_after_hwframe+0x63/0xcd
      
       The buggy address belongs to the object at ffff888012575dc8
        which belongs to the cache xfs_cui_item of size 432
       The buggy address is located 152 bytes inside of
        freed 432-byte region [ffff888012575dc8, ffff888012575f78)
      
       The buggy address belongs to the physical page:
       page:ffffea0000495d00 refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff888012576208 pfn:0x12574
       head:ffffea0000495d00 order:2 entire_mapcount:0 nr_pages_mapped:0 pincount:0
       flags: 0x1fffff80010200(slab|head|node=0|zone=1|lastcpupid=0x1fffff)
       page_type: 0xffffffff()
       raw: 001fffff80010200 ffff888012092f40 ffff888014570150 ffff888014570150
       raw: ffff888012576208 00000000001e0010 00000001ffffffff 0000000000000000
       page dumped because: kasan: bad access detected
      
       Memory state around the buggy address:
        ffff888012575d00: fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc fc
        ffff888012575d80: fc fc fc fc fc fc fc fc fc fa fb fb fb fb fb fb
       >ffff888012575e00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                                                              ^
        ffff888012575e80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
        ffff888012575f00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fc
       ==================================================================
      
      "If process intents fails, intent items left in AIL will be delete
      from AIL and freed in error handling, even intent items that have been
      recovered and created done items. After this, uaf will be triggered when
      done item committed, because at this point the released intent item will
      be accessed.
      
      xlog_recover_finish                     xlog_cil_push_work
      ----------------------------            ---------------------------
      xlog_recover_process_intents
        xfs_cui_item_recover//cui_refcount == 1
          xfs_trans_get_cud
          xfs_trans_commit
            <add cud item to cil>
        xfs_cui_item_recover
          <error occurred and return>
      xlog_recover_cancel_intents
        xfs_cui_release     //cui_refcount == 0
          xfs_cui_item_free //free cui
        <release other intent items>
      xlog_force_shutdown   //shutdown
                                     <...>
                                              <push items in cil>
                                              xlog_cil_committed
                                                xfs_cud_item_release
                                                  xfs_cui_release // UAF
      
      "Intent log items are created with a reference count of 2, one for the
      creator, and one for the intent done object. Log recovery explicitly
      drops the creator reference after it is inserted into the AIL, but it
      then processes the log item as if it also owns the intent-done reference.
      
      "The code in ->iop_recovery should assume that it passes the reference
      to the done intent, we can remove the intent item from the AIL after
      creating the done-intent, but if that code fails before creating the
      done-intent then it needs to release the intent reference by log recovery
      itself.
      
      "That way when we go to cancel the intent, the only intents we find in
      the AIL are the ones we know have not been processed yet and hence we
      can safely drop both the creator and the intent done reference from
      xlog_recover_cancel_intents().
      
      "Hence if we remove the intent from the list of intents that need to
      be recovered after we have done the initial recovery, we acheive two
      things:
      
      "1. the tail of the log can be moved forward with the commit of the
      done intent or new intent to continue the operation, and
      
      "2. We avoid the problem of trying to determine how many reference
      counts we need to drop from intent recovery cancelling because we
      never come across intents we've actually attempted recovery on."
      
      Restated: The cause of the UAF is that xlog_recover_cancel_intents
      thinks that it owns the refcount on any intent item in the AIL, and that
      it's always safe to release these intent items.  This is not true after
      the recovery function creates an log intent done item and points it at
      the log intent item because releasing the done item always releases the
      intent item.
      
      The runtime defer ops code avoids all this by tracking both the log
      intent and the intent done items, and releasing only the intent done
      item if both have been created.  Long Li proposed fixing this by adding
      state flags, but I have a more comprehensive fix.
      
      First, observe that the latter half of the intent _recover functions are
      nearly open-coded versions of the corresponding _finish_one function
      that uses an onstack deferred work item to single-step through the item.
      
      Second, notice that the recover function is not an exact match because
      of the odd behavior that unfinished recovered work items are relogged
      with separate log intent items instead of a single new log intent item,
      which is what the defer ops machinery does.
      
      Dave and I have long suspected that recovery should be reconstructing
      the defer work state from what's in the recovered intent item.  Now we
      finally have an excuse to refactor the code to do that.
      
      This series starts by fixing a resource leak in LARP recovery.  We fix
      the bug that Long Li reported by switching the intent recovery code to
      construct chains of xfs_defer_pending objects and then using the defer
      pending objects to track the intent/done item ownership.  Finally, we
      clean up the code to reconstruct the exact incore state, which means we
      can remove all the opencoded _recover code, which makes maintaining log
      items much easier.
      
      v2: minor changes per review comments
      v3: pick up more rvb tags, fix build errors
      
      This has been lightly tested with fstests.  Enjoy!
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarChandan Babu R <chandanbabu@kernel.org>
      
      * tag 'reconstruct-defer-work-6.8_2023-12-06' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux:
        xfs: move ->iop_recover to xfs_defer_op_type
        xfs: use xfs_defer_finish_one to finish recovered work items
        xfs: dump the recovered xattri log item if corruption happens
        xfs: recreate work items when recovering intent items
        xfs: transfer recovered intent item ownership in ->iop_recover
        xfs: pass the xfs_defer_pending object to iop_recover
        xfs: use xfs_defer_pending objects to recover intent items
        xfs: don't leak recovered attri intent items
      6b4ffe97
    • Darrick J. Wong's avatar
      xfs: elide ->create_done calls for unlogged deferred work · 9c07bca7
      Darrick J. Wong authored
      Extended attribute updates use the deferred work machinery to manage
      state across a chain of smaller transactions.  All previous deferred
      work users have employed log intent items and log done items to manage
      restarting of interrupted operations, which means that ->create_intent
      sets dfp_intent to a log intent item and ->create_done uses that item to
      create a log intent done item.
      
      However, xattrs have used the INCOMPLETE flag to deal with the lack of
      recovery support for an interrupted transaction chain.  Log items are
      optional if the xattr update caller didn't set XFS_DA_OP_LOGGED to
      require a restartable sequence.
      
      In other words, ->create_intent can return NULL to say that there's no
      log intent item.  If that's the case, no log intent done item should be
      created.  Clean up xfs_defer_create_done not to do this, so that the
      ->create_done functions don't have to check for non-null dfp_intent
      themselves.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      9c07bca7
    • Darrick J. Wong's avatar
      xfs: don't allow overly small or large realtime volumes · e1429380
      Darrick J. Wong authored
      Don't allow realtime volumes that are less than one rt extent long.
      This has been broken across 4 LTS kernels with nobody noticing, so let's
      just disable it.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      e1429380
    • Darrick J. Wong's avatar
      xfs: move ->iop_relog to struct xfs_defer_op_type · a49c708f
      Darrick J. Wong authored
      The only log items that need relogging are the ones created for deferred
      work operations, and the only part of the code base that relogs log
      items is the deferred work machinery.  Move the function pointers.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      a49c708f
    • Darrick J. Wong's avatar
      xfs: document what LARP means · 94da54d5
      Darrick J. Wong authored
      Christoph requested a blurb somewhere explaining exactly what LARP
      means.  I don't know of a good place other than the source code (debug
      knobs aren't covered in Documentation/), so here it is.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      94da54d5
    • Darrick J. Wong's avatar
      xfs: fix 32-bit truncation in xfs_compute_rextslog · cf8f0e6c
      Darrick J. Wong authored
      It's quite reasonable that some customer somewhere will want to
      configure a realtime volume with more than 2^32 extents.  If they try to
      do this, the highbit32() call will truncate the upper bits of the
      xfs_rtbxlen_t and produce the wrong value for rextslog.  This in turn
      causes the rsumlevels to be wrong, which results in a realtime summary
      file that is the wrong length.  Fix that.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      cf8f0e6c
    • Darrick J. Wong's avatar
      xfs: make rextslog computation consistent with mkfs · a6a38f30
      Darrick J. Wong authored
      There's a weird discrepancy in xfsprogs dating back to the creation of
      the Linux port -- if there are zero rt extents, mkfs will set
      sb_rextents and sb_rextslog both to zero:
      
      	sbp->sb_rextslog =
      		(uint8_t)(rtextents ?
      			libxfs_highbit32((unsigned int)rtextents) : 0);
      
      However, that's not the check that xfs_repair uses for nonzero rtblocks:
      
      	if (sb->sb_rextslog !=
      			libxfs_highbit32((unsigned int)sb->sb_rextents))
      
      The difference here is that xfs_highbit32 returns -1 if its argument is
      zero.  Unfortunately, this means that in the weird corner case of a
      realtime volume shorter than 1 rt extent, xfs_repair will immediately
      flag a freshly formatted filesystem as corrupt.  Because mkfs has been
      writing ondisk artifacts like this for decades, we have to accept that
      as "correct".  TBH, zero rextslog for zero rtextents makes more sense to
      me anyway.
      
      Regrettably, the superblock verifier checks created in commit copied
      xfs_repair even though mkfs has been writing out such filesystems for
      ages.  Fix the superblock verifier to accept what mkfs spits out; the
      userspace version of this patch will have to fix xfs_repair as well.
      
      Note that the new helper leaves the zeroday bug where the upper 32 bits
      of sb_rextents is ripped off and fed to highbit32.  This leads to a
      seriously undersized rt summary file, which immediately breaks mkfs:
      
      $ hugedisk.sh foo /dev/sdc $(( 0x100000080 * 4096))B
      $ /sbin/mkfs.xfs -f /dev/sda -m rmapbt=0,reflink=0 -r rtdev=/dev/mapper/foo
      meta-data=/dev/sda               isize=512    agcount=4, agsize=1298176 blks
               =                       sectsz=512   attr=2, projid32bit=1
               =                       crc=1        finobt=1, sparse=1, rmapbt=0
               =                       reflink=0    bigtime=1 inobtcount=1 nrext64=1
      data     =                       bsize=4096   blocks=5192704, imaxpct=25
               =                       sunit=0      swidth=0 blks
      naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
      log      =internal log           bsize=4096   blocks=16384, version=2
               =                       sectsz=512   sunit=0 blks, lazy-count=1
      realtime =/dev/mapper/foo        extsz=4096   blocks=4294967424, rtextents=4294967424
      Discarding blocks...Done.
      mkfs.xfs: Error initializing the realtime space [117 - Structure needs cleaning]
      
      The next patch will drop support for rt volumes with fewer than 1 or
      more than 2^32-1 rt extents, since they've clearly been broken forever.
      
      Fixes: f8e566c0 ("xfs: validate the realtime geometry in xfs_validate_sb_common")
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      a6a38f30
    • Darrick J. Wong's avatar
      xfs: collapse the ->create_done functions · 8a9aa763
      Darrick J. Wong authored
      Move the meat of the ->create_done function helpers into ->create_done
      to reduce the amount of boilerplate.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      8a9aa763
    • Darrick J. Wong's avatar
      xfs: hoist xfs_trans_add_item calls to defer ops functions · b28852a5
      Darrick J. Wong authored
      Remove even more repeated boilerplate.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      b28852a5
    • Darrick J. Wong's avatar
      xfs: clean out XFS_LI_DIRTY setting boilerplate from ->iop_relog · 3e0958be
      Darrick J. Wong authored
      Hoist this dirty flag setting to the ->iop_relog callsite to reduce
      boilerplate.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      3e0958be
    • Darrick J. Wong's avatar
      xfs: use xfs_defer_create_done for the relogging operation · bd3a88f6
      Darrick J. Wong authored
      Now that we have a helper to handle creating a log intent done item and
      updating all the necessary state flags, use it to reduce boilerplate in
      the ->iop_relog implementations.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      bd3a88f6
    • Darrick J. Wong's avatar
      xfs: hoist ->create_intent boilerplate to its callsite · f3fd7f6f
      Darrick J. Wong authored
      Hoist the dirty flag setting code out of each ->create_intent
      implementation up to the callsite to reduce boilerplate further.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      f3fd7f6f
    • Darrick J. Wong's avatar
      xfs: collapse the ->finish_item helpers · e6e5299f
      Darrick J. Wong authored
      Each log item's ->finish_item function sets up a small amount of state
      and calls another function to do the work.  Collapse that other function
      into ->finish_item to reduce the call stack height.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      e6e5299f
    • Darrick J. Wong's avatar
      xfs: move ->iop_recover to xfs_defer_op_type · db7ccc0b
      Darrick J. Wong authored
      Finish off the series by moving the intent item recovery function
      pointer to the xfs_defer_op_type struct, since this is really a deferred
      work function now.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      db7ccc0b
    • Darrick J. Wong's avatar
      xfs: hoist intent done flag setting to ->finish_item callsite · 3dd75c8d
      Darrick J. Wong authored
      Each log intent item's ->finish_item call chain inevitably includes some
      code to set the dirty flag of the transaction.  If there's an associated
      log intent done item, it also sets the item's dirty flag and the
      transaction's INTENT_DONE flag.  This is repeated throughout the
      codebase.
      
      Reduce the LOC by moving all that to xfs_defer_finish_one.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      3dd75c8d
    • Darrick J. Wong's avatar
      xfs: use xfs_defer_finish_one to finish recovered work items · e5f1a514
      Darrick J. Wong authored
      Get rid of the open-coded calls to xfs_defer_finish_one.  This also
      means that the recovery transaction takes care of cleaning up the dfp,
      and we have solved (I hope) all the ownership issues in recovery.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      e5f1a514
    • Darrick J. Wong's avatar
      xfs: don't set XFS_TRANS_HAS_INTENT_DONE when there's no ATTRD log item · 172538be
      Darrick J. Wong authored
      XFS_TRANS_HAS_INTENT_DONE is a flag to the CIL that we've added a log
      intent done item to the transaction.  This enables an optimization
      wherein we avoid writing out log intent and log intent done items if
      they would have ended up in the same checkpoint.  This reduces writes to
      the ondisk log and speeds up recovery as a result.
      
      However, callers can use the defer ops machinery to modify xattrs
      without using the log items.  In this situation, there won't be an
      intent done item, so we do not need to set the flag.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      172538be
    • Darrick J. Wong's avatar
      xfs: dump the recovered xattri log item if corruption happens · a51489e1
      Darrick J. Wong authored
      If xfs_attri_item_recover receives a corruption error when it tries to
      finish a recovered log intent item, it should dump the log item for
      debugging, just like all the other log intent items.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      a51489e1
    • Darrick J. Wong's avatar
      xfs: recreate work items when recovering intent items · e70fb328
      Darrick J. Wong authored
      Recreate work items for each xfs_defer_pending object when we are
      recovering intent items.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      e70fb328
    • Darrick J. Wong's avatar
      xfs: transfer recovered intent item ownership in ->iop_recover · deb4cd8b
      Darrick J. Wong authored
      Now that we pass the xfs_defer_pending object into the intent item
      recovery functions, we know exactly when ownership of the sole refcount
      passes from the recovery context to the intent done item.  At that
      point, we need to null out dfp_intent so that the recovery mechanism
      won't release it.  This should fix the UAF problem reported by Long Li.
      
      Note that we still want to recreate the full deferred work state.  That
      will be addressed in the next patches.
      
      Fixes: 2e76f188 ("xfs: cancel intents immediately if process_intents fails")
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      deb4cd8b
    • Darrick J. Wong's avatar
      xfs: pass the xfs_defer_pending object to iop_recover · a050acdf
      Darrick J. Wong authored
      Now that log intent item recovery recreates the xfs_defer_pending state,
      we should pass that into the ->iop_recover routines so that the intent
      item can finish the recreation work.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      a050acdf
    • Darrick J. Wong's avatar
      xfs: use xfs_defer_pending objects to recover intent items · 03f7767c
      Darrick J. Wong authored
      One thing I never quite got around to doing is porting the log intent
      item recovery code to reconstruct the deferred pending work state.  As a
      result, each intent item open codes xfs_defer_finish_one in its recovery
      method, because that's what the EFI code did before xfs_defer.c even
      existed.
      
      This is a gross thing to have left unfixed -- if an EFI cannot proceed
      due to busy extents, we end up creating separate new EFIs for each
      unfinished work item, which is a change in behavior from what runtime
      would have done.
      
      Worse yet, Long Li pointed out that there's a UAF in the recovery code.
      The ->commit_pass2 function adds the intent item to the AIL and drops
      the refcount.  The one remaining refcount is now owned by the recovery
      mechanism (aka the log intent items in the AIL) with the intent of
      giving the refcount to the intent done item in the ->iop_recover
      function.
      
      However, if something fails later in recovery, xlog_recover_finish will
      walk the recovered intent items in the AIL and release them.  If the CIL
      hasn't been pushed before that point (which is possible since we don't
      force the log until later) then the intent done release will try to free
      its associated intent, which has already been freed.
      
      This patch starts to address this mess by having the ->commit_pass2
      functions recreate the xfs_defer_pending state.  The next few patches
      will fix the recovery functions.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      03f7767c
    • Darrick J. Wong's avatar
      xfs: don't leak recovered attri intent items · 07bcbdf0
      Darrick J. Wong authored
      If recovery finds an xattr log intent item calling for the removal of an
      attribute and the file doesn't even have an attr fork, we know that the
      removal is trivially complete.  However, we can't just exit the recovery
      function without doing something about the recovered log intent item --
      it's still on the AIL, and not logging an attrd item means it stays
      there forever.
      
      This has likely not been seen in practice because few people use LARP
      and the runtime code won't log the attri for a no-attrfork removexattr
      operation.  But let's fix this anyway.
      
      Also we shouldn't really be testing the attr fork presence until we've
      taken the ILOCK, though this doesn't matter much in recovery, which is
      single threaded.
      
      Fixes: fdaf1bb3 ("xfs: ATTR_REPLACE algorithm with LARP enabled needs rework")
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      07bcbdf0
  2. 03 Dec, 2023 3 commits
  3. 02 Dec, 2023 5 commits
    • Linus Torvalds's avatar
      Merge tag 'powerpc-6.7-3' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux · 1b8af655
      Linus Torvalds authored
      Pull powerpc fixes from Michael Ellerman:
      
       - Fix corruption of f0/vs0 during FP/Vector save, seen as userspace
         crashes when using io-uring workers (in particular with MariaDB)
      
       - Fix KVM_RUN potentially clobbering all host userspace FP/Vector
         registers
      
      Thanks to Timothy Pearson, Jens Axboe, and Nicholas Piggin.
      
      * tag 'powerpc-6.7-3' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
        KVM: PPC: Book3S HV: Fix KVM_RUN clobbering FP/VEC user registers
        powerpc: Don't clobber f0/vs0 during fp|altivec register save
      1b8af655
    • Linus Torvalds's avatar
      Merge tag 'vfio-v6.7-rc4' of https://github.com/awilliam/linux-vfio · 17b17be2
      Linus Torvalds authored
      Pull vfio fixes from Alex Williamson:
      
       - Fix the lifecycle of a mutex in the pds variant driver such that a
         reset prior to opening the device won't find it uninitialized.
         Implement the release path to symmetrically destroy the mutex. Also
         switch a different lock from spinlock to mutex as the code path has
         the potential to sleep and doesn't need the spinlock context
         otherwise (Brett Creeley)
      
       - Fix an issue detected via randconfig where KVM tries to symbol_get an
         undeclared function. The symbol is temporarily declared
         unconditionally here, which resolves the problem and avoids churn
         relative to a series pending for the next merge window which resolves
         some of this symbol ugliness, but also fixes Kconfig dependencies
         (Sean Christopherson)
      
      * tag 'vfio-v6.7-rc4' of https://github.com/awilliam/linux-vfio:
        vfio: Drop vfio_file_iommu_group() stub to fudge around a KVM wart
        vfio/pds: Fix possible sleep while in atomic context
        vfio/pds: Fix mutex lock->magic != lock warning
      17b17be2
    • Linus Torvalds's avatar
      Merge tag 'for-linus-6.7a-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip · deb4b9dd
      Linus Torvalds authored
      Pull xen fixes from Juergen Gross:
      
       - A fix for the Xen event driver setting the correct return value when
         experiencing an allocation failure
      
       - A fix for allocating space for a struct in the percpu area to not
         cross page boundaries (this one is for x86, a similar one for Arm was
         already in the pull request for rc3)
      
      * tag 'for-linus-6.7a-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
        xen/events: fix error code in xen_bind_pirq_msi_to_irq()
        x86/xen: fix percpu vcpu_info allocation
      deb4b9dd
    • Linus Torvalds's avatar
      Merge tag 'probes-fixes-v6.7-rc3' of... · 669fc834
      Linus Torvalds authored
      Merge tag 'probes-fixes-v6.7-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
      
      Pull probes fixes from Masami Hiramatsu:
      
       - objpool: Fix objpool overrun case on memory/cache access delay
         especially on the big.LITTLE SoC. The objpool uses a copy of object
         slot index internal loop, but the slot index can be changed on
         another processor in parallel. In that case, the difference of 'head'
         local copy and the 'slot->last' index will be bigger than local slot
         size. In that case, we need to re-read the slot::head to update it.
      
       - kretprobe: Fix to use appropriate rcu API for kretprobe holder. Since
         kretprobe_holder::rp is RCU managed, it should use
         rcu_assign_pointer() and rcu_dereference_check() correctly. Also
         adding __rcu tag for finding wrong usage by sparse.
      
       - rethook: Fix to use appropriate rcu API for rethook::handler. The
         same as kretprobe, rethook::handler is RCU managed and it should use
         rcu_assign_pointer() and rcu_dereference_check(). This also adds
         __rcu tag for finding wrong usage by sparse.
      
      * tag 'probes-fixes-v6.7-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
        rethook: Use __rcu pointer for rethook::handler
        kprobes: consistent rcu api usage for kretprobe holder
        lib: objpool: fix head overrun on RK3588 SBC
      669fc834
    • Linus Torvalds's avatar
      Merge tag 'pm-6.7-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm · 815fb87b
      Linus Torvalds authored
      Pull power management fixes from Rafael Wysocki:
       "These fix issues in two cpufreq drivers, in the AMD P-state driver and
        in the power-capping DTPM framework.
      
        Specifics:
      
         - Fix the AMD P-state driver's EPP sysfs interface in the cases when
           the performance governor is in use (Ayush Jain)
      
         - Make the ->fast_switch() callback in the AMD P-state driver return
           the target frequency as expected (Gautham R. Shenoy)
      
         - Allow user space to control the range of frequencies to use via
           scaling_min_freq and scaling_max_freq when AMD P-state driver is in
           use (Wyes Karny)
      
         - Prevent power domains needed for wakeup signaling from being turned
           off during system suspend on Qualcomm systems and prevent
           performance states votes from runtime-suspended devices from being
           lost across a system suspend-resume cycle in qcom-cpufreq-nvmem
           (Stephan Gerhold)
      
         - Fix disabling the 792 Mhz OPP in the imx6q cpufreq driver for the
           i.MX6ULL types that can run at that frequency (Christoph
           Niedermaier)
      
         - Eliminate unnecessary and harmful conversions to uW from the DTPM
           (dynamic thermal and power management) framework (Lukasz Luba)"
      
      * tag 'pm-6.7-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
        cpufreq/amd-pstate: Only print supported EPP values for performance governor
        cpufreq/amd-pstate: Fix scaling_min_freq and scaling_max_freq update
        powercap: DTPM: Fix unneeded conversions to micro-Watts
        cpufreq/amd-pstate: Fix the return value of amd_pstate_fast_switch()
        pmdomain: qcom: rpmpd: Set GENPD_FLAG_ACTIVE_WAKEUP
        cpufreq: qcom-nvmem: Preserve PM domain votes in system suspend
        cpufreq: qcom-nvmem: Enable virtual power domain devices
        cpufreq: imx6q: Don't disable 792 Mhz OPP unnecessarily
      815fb87b
  4. 01 Dec, 2023 6 commits
    • Linus Torvalds's avatar
      Merge tag 'acpi-6.7-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm · ce474ae7
      Linus Torvalds authored
      Pull ACPI fixes from Rafael Wysocki:
       "This fixes a recently introduced build issue on ARM32 and a NULL
        pointer dereference in the ACPI backlight driver due to a design issue
        exposed by a recent change in the ACPI bus type code.
      
        Specifics:
      
         - Fix a recently introduced build issue on ARM32 platforms caused by
           an inadvertent header file breakage (Dave Jiang)
      
         - Eliminate questionable usage of acpi_driver_data() in the ACPI
           backlight cooling device code that leads to NULL pointer
           dereferences after recent ACPI core changes (Hans de Goede)"
      
      * tag 'acpi-6.7-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
        ACPI: video: Use acpi_video_device for cooling-dev driver data
        ACPI: Fix ARM32 platforms compile issue introduced by fw_table changes
      ce474ae7
    • Linus Torvalds's avatar
      Merge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux · 35f84584
      Linus Torvalds authored
      Pull arm64 fix from Catalin Marinas:
       "Fix a regression where the arm64 KPTI ends up enabled even on systems
        that don't need it"
      
      * tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
        arm64: Avoid enabling KPTI unnecessarily
      35f84584
    • Linus Torvalds's avatar
      Merge tag 'iommu-fixes-v6.7-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu · 1a2b4185
      Linus Torvalds authored
      Pull iommu fixes from Joerg Roedel:
      
       - Fix race conditions in device probe path
      
       - Handle ERR_PTR() returns in __iommu_domain_alloc() path
      
       - Update MAINTAINERS entry for Qualcom IOMMUs
      
       - Printk argument fix in device tree specific code
      
       - Several Intel VT-d fixes from Lu Baolu:
           - Do not support enforcing cache coherency for non-empty domains
           - Avoid devTLB invalidation if iommu is off
           - Disable PCI ATS in legacy passthrough mode
           - Support non-PCI devices when clearing context
           - Fix incorrect cache invalidation for mm notification
           - Add MTL to quirk list to skip TE disabling
           - Set variable intel_dirty_ops to static
      
      * tag 'iommu-fixes-v6.7-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu:
        iommu: Fix printk arg in of_iommu_get_resv_regions()
        iommu/vt-d: Set variable intel_dirty_ops to static
        iommu/vt-d: Fix incorrect cache invalidation for mm notification
        iommu/vt-d: Add MTL to quirk list to skip TE disabling
        iommu/vt-d: Make context clearing consistent with context mapping
        iommu/vt-d: Disable PCI ATS in legacy passthrough mode
        iommu/vt-d: Omit devTLB invalidation requests when TES=0
        iommu/vt-d: Support enforce_cache_coherency only for empty domains
        iommu: Avoid more races around device probe
        MAINTAINERS: list all Qualcomm IOMMU drivers in the QUALCOMM IOMMU entry
        iommu: Flow ERR_PTR out from __iommu_domain_alloc()
      1a2b4185
    • Linus Torvalds's avatar
      Merge tag 'sound-6.7-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound · 06a3c59f
      Linus Torvalds authored
      Pull sound fixes from Takashi Iwai:
       "No surprise here, including only a collection of HD-audio
        device-specific small fixes"
      
      * tag 'sound-6.7-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
        ALSA: hda: Disable power-save on KONTRON SinglePC
        ALSA: hda/realtek: Add supported ALC257 for ChromeOS
        ALSA: hda/realtek: Headset Mic VREF to 100%
        ALSA: hda: intel-nhlt: Ignore vbps when looking for DMIC 32 bps format
        ALSA: hda: cs35l56: Enable low-power hibernation mode on SPI
        ALSA: cs35l41: Fix for old systems which do not support command
        ALSA: hda: cs35l41: Remove unnecessary boolean state variable firmware_running
        ALSA: hda - Fix speaker and headset mic pin config for CHUWI CoreBook XPro
      06a3c59f
    • Linus Torvalds's avatar
      Merge tag 'drm-fixes-2023-12-01' of git://anongit.freedesktop.org/drm/drm · b1e51588
      Linus Torvalds authored
      Pull drm fixes from Dave Airlie:
       "Weekly fixes, mostly amdgpu fixes with a scattering of nouveau, i915,
        and a couple of reverts. Hopefully it will quieten down in coming
        weeks.
      
        drm:
         - Revert unexport of prime helpers for fd/handle conversion
      
        dma_resv:
         - Do not double add fences in dma_resv_add_fence.
      
        gpuvm:
         - Fix GPUVM license identifier.
      
        i915:
         - Mark internal GSC engine with reserved uabi class
         - Take VGA converters into account in eDP probe
         - Fix intel_pre_plane_updates() call to ensure workarounds get applied
      
        panel:
         - Revert panel fixes as they require exporting device_is_dependent.
      
        nouveau:
         - fix oversized allocations in new vm path
         - fix zero-length array
         - remove a stray lock
      
        nt36523:
         - Fix error check for nt36523.
      
        amdgpu:
         - DMUB fix
         - DCN 3.5 fixes
         - XGMI fix
         - DCN 3.2 fixes
         - Vangogh suspend fix
         - NBIO 7.9 fix
         - GFX11 golden register fix
         - Backlight fix
         - NBIO 7.11 fix
         - IB test overflow fix
         - DCN 3.1.4 fixes
         - fix a runtime pm ref count
         - Retimer fix
         - ABM fix
         - DCN 3.1.5 fix
         - Fix AGP addressing
         - Fix possible memory leak in SMU error path
         - Make sure PME is enabled in D3
         - Fix possible NULL pointer dereference in debugfs
         - EEPROM fix
         - GC 9.4.3 fix
      
        amdkfd:
         - IP version check fix
         - Fix memory leak in pqm_uninit()"
      
      * tag 'drm-fixes-2023-12-01' of git://anongit.freedesktop.org/drm/drm: (53 commits)
        Revert "drm/prime: Unexport helpers for fd/handle conversion"
        drm/amdgpu: Use another offset for GC 9.4.3 remap
        drm/amd/display: Fix some HostVM parameters in DML
        drm/amdkfd: Free gang_ctx_bo and wptr_bo in pqm_uninit
        drm/amdgpu: Update EEPROM I2C address for smu v13_0_0
        drm/amd/display: Allow DTBCLK disable for DCN35
        drm/amdgpu: Fix cat debugfs amdgpu_regs_didt causes kernel null pointer
        drm/amd: Enable PCIe PME from D3
        drm/amd/pm: fix a memleak in aldebaran_tables_init
        drm/amdgpu: fix AGP addressing when GART is not at 0
        drm/amd/display: update dcn315 lpddr pstate latency
        drm/amd/display: fix ABM disablement
        drm/amd/display: Fix black screen on video playback with embedded panel
        drm/amd/display: Fix conversions between bytes and KB
        drm/amdkfd: Use common function for IP version check
        drm/amd/display: Remove config update
        drm/amd/display: Update DCN35 clock table policy
        drm/amd/display: force toggle rate wa for first link training for a retimer
        drm/amdgpu: correct the amdgpu runtime dereference usage count
        drm/amd/display: Update min Z8 residency time to 2100 for DCN314
        ...
      b1e51588
    • Linus Torvalds's avatar
      Merge tag 'io_uring-6.7-2023-11-30' of git://git.kernel.dk/linux · c9a925b7
      Linus Torvalds authored
      Pull io_uring fixes from Jens Axboe:
      
       - Fix an issue with discontig page checking for IORING_SETUP_NO_MMAP
      
       - Fix an issue with not allowing IORING_SETUP_NO_MMAP also disallowing
         mmap'ed buffer rings
      
       - Fix an issue with deferred release of memory mapped pages
      
       - Fix a lockdep issue with IORING_SETUP_NO_MMAP
      
       - Use fget/fput consistently, even from our sync system calls. No real
         issue here, but if we were ever to allow closing io_uring descriptors
         it would be required. Let's play it safe and just use the full ref
         counted versions upfront. Most uses of io_uring are threaded anyway,
         and hence already doing the full version underneath.
      
      * tag 'io_uring-6.7-2023-11-30' of git://git.kernel.dk/linux:
        io_uring: use fget/fput consistently
        io_uring: free io_buffer_list entries via RCU
        io_uring/kbuf: prune deferred locked cache when tearing down
        io_uring/kbuf: recycle freed mapped buffer ring entries
        io_uring/kbuf: defer release of mapped buffer rings
        io_uring: enable io_mem_alloc/free to be used in other parts
        io_uring: don't guard IORING_OFF_PBUF_RING with SETUP_NO_MMAP
        io_uring: don't allow discontig pages for IORING_SETUP_NO_MMAP
      c9a925b7