1. 19 Jan, 2022 1 commit
    • Brian Foster's avatar
      xfs: flush inodegc workqueue tasks before cancel · 6191cf3a
      Brian Foster authored
      The xfs_inodegc_stop() helper performs a high level flush of pending
      work on the percpu queues and then runs a cancel_work_sync() on each
      of the percpu work tasks to ensure all work has completed before
      returning.  While cancel_work_sync() waits for wq tasks to complete,
      it does not guarantee work tasks have started. This means that the
      _stop() helper can queue and instantly cancel a wq task without
      having completed the associated work. This can be observed by
      tracepoint inspection of a simple "rm -f <file>; fsfreeze -f <mnt>"
      test:
      
      	xfs_destroy_inode: ... ino 0x83 ...
      	xfs_inode_set_need_inactive: ... ino 0x83 ...
      	xfs_inodegc_stop: ...
      	...
      	xfs_inodegc_start: ...
      	xfs_inodegc_worker: ...
      	xfs_inode_inactivating: ... ino 0x83 ...
      
      The first few lines show that the inode is removed and need inactive
      state set, but the inactivation work has not completed before the
      inodegc mechanism stops. The inactivation doesn't actually occur
      until the fs is unfrozen and the gc mechanism starts back up. Note
      that this test requires fsfreeze to reproduce because xfs_freeze
      indirectly invokes xfs_fs_statfs(), which calls xfs_inodegc_flush().
      
      When this occurs, the workqueue try_to_grab_pending() logic first
      tries to steal the pending bit, which does not succeed because the
      bit has been set by queue_work_on(). Subsequently, it checks for
      association of a pool workqueue from the work item under the pool
      lock. This association is set at the point a work item is queued and
      cleared when dequeued for processing. If the association exists, the
      work item is removed from the queue and cancel_work_sync() returns
      true. If the pwq association is cleared, the remove attempt assumes
      the task is busy and retries (eventually returning false to the
      caller after waiting for the work task to complete).
      
      To avoid this race, we can flush each work item explicitly before
      cancel. However, since the _queue_all() already schedules each
      underlying work item, the workqueue level helpers are sufficient to
      achieve the same ordering effect. E.g., the inodegc enabled flag
      prevents scheduling any further work in the _stop() case. Use the
      drain_workqueue() helper in this particular case to make the intent
      a bit more self explanatory.
      Signed-off-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      6191cf3a
  2. 18 Jan, 2022 1 commit
  3. 17 Jan, 2022 3 commits
  4. 12 Jan, 2022 1 commit
    • Darrick J. Wong's avatar
      xfs: fix online fsck handling of v5 feature bits on secondary supers · 4a9bca86
      Darrick J. Wong authored
      While I was auditing the code in xfs_repair that adds feature bits to
      existing V5 filesystems, I decided to have a look at how online fsck
      handles feature bits, and I found a few problems:
      
      1) ATTR2 is added to the primary super when an xattr is set to a file,
      but that isn't consistently propagated to secondary supers.  This isn't
      a corruption, merely a discrepancy that repair will fix if it ever has
      to restore the primary from a secondary.  Hence, if we find a mismatch
      on a secondary, this is a preen condition, not a corruption.
      
      2) There are more compat and ro_compat features now than there used to
      be, but we mask off the newer features from testing.  This means we
      ignore inconsistencies in the INOBTCOUNT and BIGTIME features, which is
      wrong.  Get rid of the masking and compare directly.
      
      3) NEEDSREPAIR, when set on a secondary, is ignored by everyone.  Hence
      a mismatch here should also be flagged for preening, and online repair
      should clear the flag.  Right now we ignore it due to (2).
      
      4) log_incompat features are ephemeral, since we can clear the feature
      bit as soon as the log no longer contains live records for a particular
      log feature.  As such, the only copy we care about is the one in the
      primary super.  If we find any bits set in the secondary super, we
      should flag that for preening, and clear the bits if the user elects to
      repair it.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      4a9bca86
  5. 11 Jan, 2022 1 commit
    • Darrick J. Wong's avatar
      xfs: take the ILOCK when readdir inspects directory mapping data · 65552b02
      Darrick J. Wong authored
      I was poking around in the directory code while diagnosing online fsck
      bugs, and noticed that xfs_readdir doesn't actually take the directory
      ILOCK when it calls xfs_dir2_isblock.  xfs_dir_open most probably loaded
      the data fork mappings and the VFS took i_rwsem (aka IOLOCK_SHARED) so
      we're protected against writer threads, but we really need to follow the
      locking model like we do in other places.
      
      To avoid unnecessarily cycling the ILOCK for fairly small directories,
      change the block/leaf _getdents functions to consume the ILOCK hold that
      the parent readdir function took to decide on a _getdents implementation.
      
      It is ok to cycle the ILOCK in readdir because the VFS takes the IOLOCK
      in the appropriate mode during lookups and writes, and we don't want to
      be holding the ILOCK when we copy directory entries to userspace in case
      there's a page fault.  We really only need it to protect against data
      fork lookups, like we do for other files.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      65552b02
  6. 06 Jan, 2022 5 commits
    • Darrick J. Wong's avatar
      xfs: warn about inodes with project id of -1 · 7e937bb3
      Darrick J. Wong authored
      Inodes aren't supposed to have a project id of -1U (aka 4294967295) but
      the kernel hasn't always validated FSSETXATTR correctly.  Flag this as
      something for the sysadmin to check out.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      7e937bb3
    • Darrick J. Wong's avatar
      xfs: hold quota inode ILOCK_EXCL until the end of dqalloc · eae44cb3
      Darrick J. Wong authored
      Online fsck depends on callers holding ILOCK_EXCL from the time they
      decide to update a block mapping until after they've updated the reverse
      mapping records to guarantee the stability of both mapping records.
      Unfortunately, the quota code drops ILOCK_EXCL at the first transaction
      roll in the dquot allocation process, which breaks that assertion.  This
      leads to sporadic failures in the online rmap repair code if the repair
      code grabs the AGF after bmapi_write maps a new block into the quota
      file's data fork but before it can finish the deferred rmap update.
      
      Fix this by rewriting the function to hold the ILOCK until after the
      transaction commit like all other bmap updates do, and get rid of the
      dqread wrapper that does nothing but complicate the codebase.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      eae44cb3
    • Jiapeng Chong's avatar
      xfs: Remove redundant assignment of mp · f4901a18
      Jiapeng Chong authored
      mp is being initialized to log->l_mp but this is never read
      as record is overwritten later on. Remove the redundant
      assignment.
      
      Cleans up the following clang-analyzer warning:
      
      fs/xfs/xfs_log_recover.c:3543:20: warning: Value stored to 'mp' during
      its initialization is never read [clang-analyzer-deadcode.DeadStores].
      Reported-by: default avatarAbaci Robot <abaci@linux.alibaba.com>
      Signed-off-by: default avatarJiapeng Chong <jiapeng.chong@linux.alibaba.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      f4901a18
    • Dave Chinner's avatar
      xfs: reduce kvmalloc overhead for CIL shadow buffers · 8dc9384b
      Dave Chinner authored
      Oh, let me count the ways that the kvmalloc API sucks dog eggs.
      
      The problem is when we are logging lots of large objects, we hit
      kvmalloc really damn hard with costly order allocations, and
      behaviour utterly sucks:
      
           - 49.73% xlog_cil_commit
      	 - 31.62% kvmalloc_node
      	    - 29.96% __kmalloc_node
      	       - 29.38% kmalloc_large_node
      		  - 29.33% __alloc_pages
      		     - 24.33% __alloc_pages_slowpath.constprop.0
      			- 18.35% __alloc_pages_direct_compact
      			   - 17.39% try_to_compact_pages
      			      - compact_zone_order
      				 - 15.26% compact_zone
      				      5.29% __pageblock_pfn_to_page
      				      3.71% PageHuge
      				    - 1.44% isolate_migratepages_block
      					 0.71% set_pfnblock_flags_mask
      				   1.11% get_pfnblock_flags_mask
      			   - 0.81% get_page_from_freelist
      			      - 0.59% _raw_spin_lock_irqsave
      				 - do_raw_spin_lock
      				      __pv_queued_spin_lock_slowpath
      			- 3.24% try_to_free_pages
      			   - 3.14% shrink_node
      			      - 2.94% shrink_slab.constprop.0
      				 - 0.89% super_cache_count
      				    - 0.66% xfs_fs_nr_cached_objects
      				       - 0.65% xfs_reclaim_inodes_count
      					    0.55% xfs_perag_get_tag
      				   0.58% kfree_rcu_shrink_count
      			- 2.09% get_page_from_freelist
      			   - 1.03% _raw_spin_lock_irqsave
      			      - do_raw_spin_lock
      				   __pv_queued_spin_lock_slowpath
      		     - 4.88% get_page_from_freelist
      			- 3.66% _raw_spin_lock_irqsave
      			   - do_raw_spin_lock
      				__pv_queued_spin_lock_slowpath
      	    - 1.63% __vmalloc_node
      	       - __vmalloc_node_range
      		  - 1.10% __alloc_pages_bulk
      		     - 0.93% __alloc_pages
      			- 0.92% get_page_from_freelist
      			   - 0.89% rmqueue_bulk
      			      - 0.69% _raw_spin_lock
      				 - do_raw_spin_lock
      				      __pv_queued_spin_lock_slowpath
      	   13.73% memcpy_erms
      	 - 2.22% kvfree
      
      On this workload, that's almost a dozen CPUs all trying to compact
      and reclaim memory inside kvmalloc_node at the same time. Yet it is
      regularly falling back to vmalloc despite all that compaction, page
      and shrinker reclaim that direct reclaim is doing. Copying all the
      metadata is taking far less CPU time than allocating the storage!
      
      Direct reclaim should be considered extremely harmful.
      
      This is a high frequency, high throughput, CPU usage and latency
      sensitive allocation. We've got memory there, and we're using
      kvmalloc to allow memory allocation to avoid doing lots of work to
      try to do contiguous allocations.
      
      Except it still does *lots of costly work* that is unnecessary.
      
      Worse: the only way to avoid the slowpath page allocation trying to
      do compaction on costly allocations is to turn off direct reclaim
      (i.e. remove __GFP_RECLAIM_DIRECT from the gfp flags).
      
      Unfortunately, the stupid kvmalloc API then says "oh, this isn't a
      GFP_KERNEL allocation context, so you only get kmalloc!". This
      cuts off the vmalloc fallback, and this leads to almost instant OOM
      problems which ends up in filesystems deadlocks, shutdowns and/or
      kernel crashes.
      
      I want some basic kvmalloc behaviour:
      
      - kmalloc for a contiguous range with fail fast semantics - no
        compaction direct reclaim if the allocation enters the slow path.
      - run normal vmalloc (i.e. GFP_KERNEL) if kmalloc fails
      
      The really, really stupid part about this is these kvmalloc() calls
      are run under memalloc_nofs task context, so all the allocations are
      always reduced to GFP_NOFS regardless of the fact that kvmalloc
      requires GFP_KERNEL to be passed in. IOWs, we're already telling
      kvmalloc to behave differently to the gfp flags we pass in, but it
      still won't allow vmalloc to be run with anything other than
      GFP_KERNEL.
      
      So, this patch open codes the kvmalloc() in the commit path to have
      the above described behaviour. The result is we more than halve the
      CPU time spend doing kvmalloc() in this path and transaction commits
      with 64kB objects in them more than doubles. i.e. we get ~5x
      reduction in CPU usage per costly-sized kvmalloc() invocation and
      the profile looks like this:
      
        - 37.60% xlog_cil_commit
      	16.01% memcpy_erms
            - 8.45% __kmalloc
      	 - 8.04% kmalloc_order_trace
      	    - 8.03% kmalloc_order
      	       - 7.93% alloc_pages
      		  - 7.90% __alloc_pages
      		     - 4.05% __alloc_pages_slowpath.constprop.0
      			- 2.18% get_page_from_freelist
      			- 1.77% wake_all_kswapds
      ....
      				    - __wake_up_common_lock
      				       - 0.94% _raw_spin_lock_irqsave
      		     - 3.72% get_page_from_freelist
      			- 2.43% _raw_spin_lock_irqsave
            - 5.72% vmalloc
      	 - 5.72% __vmalloc_node_range
      	    - 4.81% __get_vm_area_node.constprop.0
      	       - 3.26% alloc_vmap_area
      		  - 2.52% _raw_spin_lock
      	       - 1.46% _raw_spin_lock
      	      0.56% __alloc_pages_bulk
            - 4.66% kvfree
      	 - 3.25% vfree
      	    - __vfree
      	       - 3.23% __vunmap
      		  - 1.95% remove_vm_area
      		     - 1.06% free_vmap_area_noflush
      			- 0.82% _raw_spin_lock
      		     - 0.68% _raw_spin_lock
      		  - 0.92% _raw_spin_lock
      	 - 1.40% kfree
      	    - 1.36% __free_pages
      	       - 1.35% __free_pages_ok
      		  - 1.02% _raw_spin_lock_irqsave
      
      It's worth noting that over 50% of the CPU time spent allocating
      these shadow buffers is now spent on spinlocks. So the shadow buffer
      allocation overhead is greatly reduced by getting rid of direct
      reclaim from kmalloc, and could probably be made even less costly if
      vmalloc() didn't use global spinlocks to protect it's structures.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarAllison Henderson <allison.henderson@oracle.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      8dc9384b
    • Greg Kroah-Hartman's avatar
      xfs: sysfs: use default_groups in kobj_type · 219aac5d
      Greg Kroah-Hartman authored
      There are currently 2 ways to create a set of sysfs files for a
      kobj_type, through the default_attrs field, and the default_groups
      field.  Move the xfs sysfs code to use default_groups field which has
      been the preferred way since aa30f47c ("kobject: Add support for
      default attribute groups to kobj_type") so that we can soon get rid of
      the obsolete default_attrs field.
      
      Cc: "Darrick J. Wong" <djwong@kernel.org>
      Cc: linux-xfs@vger.kernel.org
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      219aac5d
  7. 22 Dec, 2021 1 commit
    • Darrick J. Wong's avatar
      xfs: prevent UAF in xfs_log_item_in_current_chkpt · f8d92a66
      Darrick J. Wong authored
      While I was running with KASAN and lockdep enabled, I stumbled upon an
      KASAN report about a UAF to a freed CIL checkpoint.  Looking at the
      comment for xfs_log_item_in_current_chkpt, it seems pretty obvious to me
      that the original patch to xfs_defer_finish_noroll should have done
      something to lock the CIL to prevent it from switching the CIL contexts
      while the predicate runs.
      
      For upper level code that needs to know if a given log item is new
      enough not to need relogging, add a new wrapper that takes the CIL
      context lock long enough to sample the current CIL context.  This is
      kind of racy in that the CIL can switch the contexts immediately after
      sampling, but that's ok because the consequence is that the defer ops
      code is a little slow to relog items.
      
       ==================================================================
       BUG: KASAN: use-after-free in xfs_log_item_in_current_chkpt+0x139/0x160 [xfs]
       Read of size 8 at addr ffff88804ea5f608 by task fsstress/527999
      
       CPU: 1 PID: 527999 Comm: fsstress Tainted: G      D      5.16.0-rc4-xfsx #rc4
       Call Trace:
        <TASK>
        dump_stack_lvl+0x45/0x59
        print_address_description.constprop.0+0x1f/0x140
        kasan_report.cold+0x83/0xdf
        xfs_log_item_in_current_chkpt+0x139/0x160
        xfs_defer_finish_noroll+0x3bb/0x1e30
        __xfs_trans_commit+0x6c8/0xcf0
        xfs_reflink_remap_extent+0x66f/0x10e0
        xfs_reflink_remap_blocks+0x2dd/0xa90
        xfs_file_remap_range+0x27b/0xc30
        vfs_dedupe_file_range_one+0x368/0x420
        vfs_dedupe_file_range+0x37c/0x5d0
        do_vfs_ioctl+0x308/0x1260
        __x64_sys_ioctl+0xa1/0x170
        do_syscall_64+0x35/0x80
        entry_SYSCALL_64_after_hwframe+0x44/0xae
       RIP: 0033:0x7f2c71a2950b
       Code: 0f 1e fa 48 8b 05 85 39 0d 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff
      ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01
      f0 ff ff 73 01 c3 48 8b 0d 55 39 0d 00 f7 d8 64 89 01 48
       RSP: 002b:00007ffe8c0e03c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
       RAX: ffffffffffffffda RBX: 00005600862a8740 RCX: 00007f2c71a2950b
       RDX: 00005600862a7be0 RSI: 00000000c0189436 RDI: 0000000000000004
       RBP: 000000000000000b R08: 0000000000000027 R09: 0000000000000003
       R10: 0000000000000000 R11: 0000000000000246 R12: 000000000000005a
       R13: 00005600862804a8 R14: 0000000000016000 R15: 00005600862a8a20
        </TASK>
      
       Allocated by task 464064:
        kasan_save_stack+0x1e/0x50
        __kasan_kmalloc+0x81/0xa0
        kmem_alloc+0xcd/0x2c0 [xfs]
        xlog_cil_ctx_alloc+0x17/0x1e0 [xfs]
        xlog_cil_push_work+0x141/0x13d0 [xfs]
        process_one_work+0x7f6/0x1380
        worker_thread+0x59d/0x1040
        kthread+0x3b0/0x490
        ret_from_fork+0x1f/0x30
      
       Freed by task 51:
        kasan_save_stack+0x1e/0x50
        kasan_set_track+0x21/0x30
        kasan_set_free_info+0x20/0x30
        __kasan_slab_free+0xed/0x130
        slab_free_freelist_hook+0x7f/0x160
        kfree+0xde/0x340
        xlog_cil_committed+0xbfd/0xfe0 [xfs]
        xlog_cil_process_committed+0x103/0x1c0 [xfs]
        xlog_state_do_callback+0x45d/0xbd0 [xfs]
        xlog_ioend_work+0x116/0x1c0 [xfs]
        process_one_work+0x7f6/0x1380
        worker_thread+0x59d/0x1040
        kthread+0x3b0/0x490
        ret_from_fork+0x1f/0x30
      
       Last potentially related work creation:
        kasan_save_stack+0x1e/0x50
        __kasan_record_aux_stack+0xb7/0xc0
        insert_work+0x48/0x2e0
        __queue_work+0x4e7/0xda0
        queue_work_on+0x69/0x80
        xlog_cil_push_now.isra.0+0x16b/0x210 [xfs]
        xlog_cil_force_seq+0x1b7/0x850 [xfs]
        xfs_log_force_seq+0x1c7/0x670 [xfs]
        xfs_file_fsync+0x7c1/0xa60 [xfs]
        __x64_sys_fsync+0x52/0x80
        do_syscall_64+0x35/0x80
        entry_SYSCALL_64_after_hwframe+0x44/0xae
      
       The buggy address belongs to the object at ffff88804ea5f600
        which belongs to the cache kmalloc-256 of size 256
       The buggy address is located 8 bytes inside of
        256-byte region [ffff88804ea5f600, ffff88804ea5f700)
       The buggy address belongs to the page:
       page:ffffea00013a9780 refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff88804ea5ea00 pfn:0x4ea5e
       head:ffffea00013a9780 order:1 compound_mapcount:0
       flags: 0x4fff80000010200(slab|head|node=1|zone=1|lastcpupid=0xfff)
       raw: 04fff80000010200 ffffea0001245908 ffffea00011bd388 ffff888004c42b40
       raw: ffff88804ea5ea00 0000000000100009 00000001ffffffff 0000000000000000
       page dumped because: kasan: bad access detected
      
       Memory state around the buggy address:
        ffff88804ea5f500: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
        ffff88804ea5f580: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
       >ffff88804ea5f600: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                             ^
        ffff88804ea5f680: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
        ffff88804ea5f700: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
       ==================================================================
      
      Fixes: 4e919af7 ("xfs: periodically relog deferred intent items")
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      f8d92a66
  8. 21 Dec, 2021 8 commits
    • Dan Carpenter's avatar
      xfs: prevent a WARN_ONCE() in xfs_ioc_attr_list() · 6ed6356b
      Dan Carpenter authored
      The "bufsize" comes from the root user.  If "bufsize" is negative then,
      because of type promotion, neither of the validation checks at the start
      of the function are able to catch it:
      
      	if (bufsize < sizeof(struct xfs_attrlist) ||
      	    bufsize > XFS_XATTR_LIST_MAX)
      		return -EINVAL;
      
      This means "bufsize" will trigger (WARN_ON_ONCE(size > INT_MAX)) in
      kvmalloc_node().  Fix this by changing the type from int to size_t.
      Signed-off-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      6ed6356b
    • Yang Xu's avatar
      xfs: Fix comments mentioning xfs_ialloc · 132c460e
      Yang Xu authored
      Since kernel commit 1abcf261 ("xfs: move on-disk inode allocation out of xfs_ialloc()"),
      xfs_ialloc has been renamed to xfs_init_new_inode. So update this in comments.
      Signed-off-by: default avatarYang Xu <xuyang2018.jy@fujitsu.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      132c460e
    • Dave Chinner's avatar
      xfs: check sb_meta_uuid for dabuf buffer recovery · 09654ed8
      Dave Chinner authored
      Got a report that a repeated crash test of a container host would
      eventually fail with a log recovery error preventing the system from
      mounting the root filesystem. It manifested as a directory leaf node
      corruption on writeback like so:
      
       XFS (loop0): Mounting V5 Filesystem
       XFS (loop0): Starting recovery (logdev: internal)
       XFS (loop0): Metadata corruption detected at xfs_dir3_leaf_check_int+0x99/0xf0, xfs_dir3_leaf1 block 0x12faa158
       XFS (loop0): Unmount and run xfs_repair
       XFS (loop0): First 128 bytes of corrupted metadata buffer:
       00000000: 00 00 00 00 00 00 00 00 3d f1 00 00 e1 9e d5 8b  ........=.......
       00000010: 00 00 00 00 12 fa a1 58 00 00 00 29 00 00 1b cc  .......X...)....
       00000020: 91 06 78 ff f7 7e 4a 7d 8d 53 86 f2 ac 47 a8 23  ..x..~J}.S...G.#
       00000030: 00 00 00 00 17 e0 00 80 00 43 00 00 00 00 00 00  .........C......
       00000040: 00 00 00 2e 00 00 00 08 00 00 17 2e 00 00 00 0a  ................
       00000050: 02 35 79 83 00 00 00 30 04 d3 b4 80 00 00 01 50  .5y....0.......P
       00000060: 08 40 95 7f 00 00 02 98 08 41 fe b7 00 00 02 d4  .@.......A......
       00000070: 0d 62 ef a7 00 00 01 f2 14 50 21 41 00 00 00 0c  .b.......P!A....
       XFS (loop0): Corruption of in-memory data (0x8) detected at xfs_do_force_shutdown+0x1a/0x20 (fs/xfs/xfs_buf.c:1514).  Shutting down.
       XFS (loop0): Please unmount the filesystem and rectify the problem(s)
       XFS (loop0): log mount/recovery failed: error -117
       XFS (loop0): log mount failed
      
      Tracing indicated that we were recovering changes from a transaction
      at LSN 0x29/0x1c16 into a buffer that had an LSN of 0x29/0x1d57.
      That is, log recovery was overwriting a buffer with newer changes on
      disk than was in the transaction. Tracing indicated that we were
      hitting the "recovery immediately" case in
      xfs_buf_log_recovery_lsn(), and hence it was ignoring the LSN in the
      buffer.
      
      The code was extracting the LSN correctly, then ignoring it because
      the UUID in the buffer did not match the superblock UUID. The
      problem arises because the UUID check uses the wrong UUID - it
      should be checking the sb_meta_uuid, not sb_uuid. This filesystem
      has sb_uuid != sb_meta_uuid (which is fine), and the buffer has the
      correct matching sb_meta_uuid in it, it's just the code checked it
      against the wrong superblock uuid.
      
      The is no corruption in the filesystem, and failing to recover the
      buffer due to a write verifier failure means the recovery bug did
      not propagate the corruption to disk. Hence there is no corruption
      before or after this bug has manifested, the impact is limited
      simply to an unmountable filesystem....
      
      This was missed back in 2015 during an audit of incorrect sb_uuid
      usage that resulted in commit fcfbe2c4 ("xfs: log recovery needs
      to validate against sb_meta_uuid") that fixed the magic32 buffers to
      validate against sb_meta_uuid instead of sb_uuid. It missed the
      magicda buffers....
      
      Fixes: ce748eaa ("xfs: create new metadata UUID field and incompat flag")
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      09654ed8
    • Darrick J. Wong's avatar
      xfs: fix a bug in the online fsck directory leaf1 bestcount check · e5d1802c
      Darrick J. Wong authored
      When xfs_scrub encounters a directory with a leaf1 block, it tries to
      validate that the leaf1 block's bestcount (aka the best free count of
      each directory data block) is the correct size.  Previously, this author
      believed that comparing bestcount to the directory isize (since
      directory data blocks are under isize, and leaf/bestfree blocks are
      above it) was sufficient.
      
      Unfortunately during testing of online repair, it was discovered that it
      is possible to create a directory with a hole between the last directory
      block and isize.  The directory code seems to handle this situation just
      fine and xfs_repair doesn't complain, which effectively makes this quirk
      part of the disk format.
      
      Fix the check to work properly.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      e5d1802c
    • Darrick J. Wong's avatar
      xfs: only run COW extent recovery when there are no live extents · 7993f1a4
      Darrick J. Wong authored
      As part of multiple customer escalations due to file data corruption
      after copy on write operations, I wrote some fstests that use fsstress
      to hammer on COW to shake things loose.  Regrettably, I caught some
      filesystem shutdowns due to incorrect rmap operations with the following
      loop:
      
      mount <filesystem>				# (0)
      fsstress <run only readonly ops> &		# (1)
      while true; do
      	fsstress <run all ops>
      	mount -o remount,ro			# (2)
      	fsstress <run only readonly ops>
      	mount -o remount,rw			# (3)
      done
      
      When (2) happens, notice that (1) is still running.  xfs_remount_ro will
      call xfs_blockgc_stop to walk the inode cache to free all the COW
      extents, but the blockgc mechanism races with (1)'s reader threads to
      take IOLOCKs and loses, which means that it doesn't clean them all out.
      Call such a file (A).
      
      When (3) happens, xfs_remount_rw calls xfs_reflink_recover_cow, which
      walks the ondisk refcount btree and frees any COW extent that it finds.
      This function does not check the inode cache, which means that incore
      COW forks of inode (A) is now inconsistent with the ondisk metadata.  If
      one of those former COW extents are allocated and mapped into another
      file (B) and someone triggers a COW to the stale reservation in (A), A's
      dirty data will be written into (B) and once that's done, those blocks
      will be transferred to (A)'s data fork without bumping the refcount.
      
      The results are catastrophic -- file (B) and the refcount btree are now
      corrupt.  In the first patch, we fixed the race condition in (2) so that
      (A) will always flush the COW fork.  In this second patch, we move the
      _recover_cow call to the initial mount call in (0) for safety.
      
      As mentioned previously, xfs_reflink_recover_cow walks the refcount
      btree looking for COW staging extents, and frees them.  This was
      intended to be run at mount time (when we know there are no live inodes)
      to clean up any leftover staging events that may have been left behind
      during an unclean shutdown.  As a time "optimization" for readonly
      mounts, we deferred this to the ro->rw transition, not realizing that
      any failure to clean all COW forks during a rw->ro transition would
      result in catastrophic corruption.
      
      Therefore, remove this optimization and only run the recovery routine
      when we're guaranteed not to have any COW staging extents anywhere,
      which means we always run this at mount time.  While we're at it, move
      the callsite to xfs_log_mount_finish because any refcount btree
      expansion (however unlikely given that we're removing records from the
      right side of the index) must be fed by a per-AG reservation, which
      doesn't exist in its current location.
      
      Fixes: 174edb0e ("xfs: store in-progress CoW allocations in the refcount btree")
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChandan Babu R <chandan.babu@oracle.com>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      7993f1a4
    • Darrick J. Wong's avatar
      xfs: don't expose internal symlink metadata buffers to the vfs · 7b7820b8
      Darrick J. Wong authored
      Ian Kent reported that for inline symlinks, it's possible for
      vfs_readlink to hang on to the target buffer returned by
      _vn_get_link_inline long after it's been freed by xfs inode reclaim.
      This is a layering violation -- we should never expose XFS internals to
      the VFS.
      
      When the symlink has a remote target, we allocate a separate buffer,
      copy the internal information, and let the VFS manage the new buffer's
      lifetime.  Let's adapt the inline code paths to do this too.  It's
      less efficient, but fixes the layering violation and avoids the need to
      adapt the if_data lifetime to rcu rules.  Clearly I don't care about
      readlink benchmarks.
      
      As a side note, this fixes the minor locking violation where we can
      access the inode data fork without taking any locks; proper locking (and
      eliminating the possibility of having to switch inode_operations on a
      live inode) is essential to online repair coordinating repairs
      correctly.
      Reported-by: default avatarIan Kent <raven@themaw.net>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      7b7820b8
    • Darrick J. Wong's avatar
      xfs: fix quotaoff mutex usage now that we don't support disabling it · 59d7fab2
      Darrick J. Wong authored
      Prior to commit 40b52225 ("xfs: remove support for disabling quota
      accounting on a mounted file system"), we used the quotaoff mutex to
      protect dquot operations against quotaoff trying to pull down dquots as
      part of disabling quota.
      
      Now that we only support turning off quota enforcement, the quotaoff
      mutex only protects changes in m_qflags/sb_qflags.  We don't need it to
      protect dquots, which means we can remove it from setqlimits and the
      dquot scrub code.  While we're at it, fix the function that forces
      quotacheck, since it should have been taking the quotaoff mutex.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      59d7fab2
    • Darrick J. Wong's avatar
      xfs: shut down filesystem if we xfs_trans_cancel with deferred work items · 47a6df7c
      Darrick J. Wong authored
      While debugging some very strange rmap corruption reports in connection
      with the online directory repair code.  I root-caused the error to the
      following incorrect sequence:
      
      <start repair transaction>
      <expand directory, causing a deferred rmap to be queued>
      <roll transaction>
      <cancel transaction>
      
      Obviously, we should have committed the transaction instead of
      cancelling it.  Thinking more broadly, however, xfs_trans_cancel should
      have warned us that we were throwing away work item that we already
      committed to performing.  This is not correct, and we need to shut down
      the filesystem.
      
      Change xfs_trans_cancel to complain in the loudest manner if we're
      cancelling any transaction with deferred work items attached.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      47a6df7c
  9. 12 Dec, 2021 14 commits
  10. 11 Dec, 2021 5 commits
    • Linus Torvalds's avatar
      Merge tag 'perf-tools-fixes-for-v5.16-2021-12-11' of... · bbdff6d5
      Linus Torvalds authored
      Merge tag 'perf-tools-fixes-for-v5.16-2021-12-11' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux
      
      Pull perf tools fixes from Arnaldo Carvalho de Melo:
      
       - Prevent out-of-bounds access to per sample registers.
      
       - Fix NULL vs IS_ERR_OR_NULL() checking on the python binding.
      
       - Intel PT fixes, half of those are one-liners:
            - Fix some PGE (packet generation enable/control flow packets) usage.
            - Fix sync state when a PSB (synchronization) packet is found.
            - Fix intel_pt_fup_event() assumptions about setting state type.
            - Fix state setting when receiving overflow (OVF) packet.
            - Fix next 'err' value, walking trace.
            - Fix missing 'instruction' events with 'q' option.
            - Fix error timestamp setting on the decoder error path.
      
      * tag 'perf-tools-fixes-for-v5.16-2021-12-11' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux:
        perf python: Fix NULL vs IS_ERR_OR_NULL() checking
        perf intel-pt: Fix error timestamp setting on the decoder error path
        perf intel-pt: Fix missing 'instruction' events with 'q' option
        perf intel-pt: Fix next 'err' value, walking trace
        perf intel-pt: Fix state setting when receiving overflow (OVF) packet
        perf intel-pt: Fix intel_pt_fup_event() assumptions about setting state type
        perf intel-pt: Fix sync state when a PSB (synchronization) packet is found
        perf intel-pt: Fix some PGE (packet generation enable/control flow packets) usage
        perf tools: Prevent out-of-bounds access to registers
      bbdff6d5
    • Linus Torvalds's avatar
      Merge tag 'block-5.16-2021-12-10' of git://git.kernel.dk/linux-block · eccea80b
      Linus Torvalds authored
      Pull block fixes from Jens Axboe:
       "A few block fixes that should go into this release:
      
         - NVMe pull request:
              - set ana_log_size to 0 after freeing ana_log_buf (Hou Tao)
              - show subsys nqn for duplicate cntlids (Keith Busch)
              - disable namespace access for unsupported metadata (Keith
                Busch)
              - report write pointer for a full zone as zone start + zone len
                (Niklas Cassel)
              - fix use after free when disconnecting a reconnecting ctrl
                (Ruozhu Li)
              - fix a list corruption in nvmet-tcp (Sagi Grimberg)
      
         - Fix for a regression on DIO single bio async IO (Pavel)
      
         - ioprio seteuid fix (Davidlohr)
      
         - mtd fix that subsequently got reverted as it was broken, will get
           re-done and submitted for the next round
      
         - Two MD fixes via Song (Markus, zhangyue)"
      
      * tag 'block-5.16-2021-12-10' of git://git.kernel.dk/linux-block:
        Revert "mtd_blkdevs: don't scan partitions for plain mtdblock"
        block: fix ioprio_get(IOPRIO_WHO_PGRP) vs setuid(2)
        md: fix double free of mddev->private in autorun_array()
        md: fix update super 1.0 on rdev size change
        nvmet-tcp: fix possible list corruption for unexpected command failure
        block: fix single bio async DIO error handling
        nvme: fix use after free when disconnecting a reconnecting ctrl
        nvme-multipath: set ana_log_size to 0 after free ana_log_buf
        mtd_blkdevs: don't scan partitions for plain mtdblock
        nvme: report write pointer for a full zone as zone start + zone len
        nvme: disable namespace access for unsupported metadata
        nvme: show subsys nqn for duplicate cntlids
      eccea80b
    • Linus Torvalds's avatar
      Merge tag 'io_uring-5.16-2021-12-10' of git://git.kernel.dk/linux-block · f152165a
      Linus Torvalds authored
      Pull io_uring fixes from Jens Axboe:
       "A few fixes that are all bound for stable:
      
         - Two syzbot reports for io-wq that turned out to be separate fixes,
           but ultimately very closely related
      
         - io_uring task_work running on cancelations"
      
      * tag 'io_uring-5.16-2021-12-10' of git://git.kernel.dk/linux-block:
        io-wq: check for wq exit after adding new worker task_work
        io_uring: ensure task_work gets run as part of cancelations
        io-wq: remove spurious bit clear on task_work addition
      f152165a
    • Linus Torvalds's avatar
      Merge branch 'i2c/for-current' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux · bd66be54
      Linus Torvalds authored
      Pull i2c fixes from Wolfram Sang:
       "Two more I2C driver bugfixes"
      
      * 'i2c/for-current' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
        i2c: mpc: Use atomic read and fix break condition
        i2c: virtio: fix completion handling
      bd66be54
    • Linus Torvalds's avatar
      Merge tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux · 2acdaf59
      Linus Torvalds authored
      Pull clk driver fixes from Stephen Boyd:
      
       - Fix qcom mux logic to look at the proper parent table member. Luckily
         this clk type isn't very common.
      
       - Don't kill clks on qcom systems that use Trion PLLs that are enabled
         out of the bootloader. We will simply skip programming the PLL rate
         if it's already done.
      
       - Use the proper clk_ops for the qcom sm6125 ICE clks.
      
       - Use module_platform_driver() in i.MX as it can be a module.
      
       - Fix a UAF in the versatile clk driver on an error path.
      
      * tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux:
        clk: versatile: clk-icst: use after free on error path
        clk: qcom: sm6125-gcc: Swap ops of ice and apps on sdcc1
        clk: imx: use module_platform_driver
        clk: qcom: clk-alpha-pll: Don't reconfigure running Trion
        clk: qcom: regmap-mux: fix parent clock lookup
      2acdaf59