1. 10 Feb, 2023 7 commits
    • Dave Chinner's avatar
      xfs: drop firstblock constraints from allocation setup · 36b6ad2d
      Dave Chinner authored
      Now that xfs_alloc_vextent() does all the AGF deadlock prevention
      filtering for multiple allocations in a single transaction, we no
      longer need the allocation setup code to care about what AGs we
      might already have locked.
      
      Hence we can remove all the "nullfb" conditional logic in places
      like xfs_bmap_btalloc() and instead have them focus simply on
      setting up locality constraints. If the allocation fails due to
      AGF lock filtering in xfs_alloc_vextent, then we just fall back as
      we normally do to more relaxed allocation constraints.
      
      As a result, any allocation that allows AG scanning (i.e. not
      confined to a single AG) and does not force a worst case full
      filesystem scan will now be able to attempt allocation from AGs
      lower than that defined by tp->t_firstblock. This is because
      xfs_alloc_vextent() allows try-locking of the AGFs and hence enables
      low space algorithms to at least -try- to get space from AGs lower
      than the one that we have currently locked and allocated from. This
      is a significant improvement in the low space allocation algorithm.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarAllison Henderson <allison.henderson@oracle.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      36b6ad2d
    • Dave Chinner's avatar
      xfs: block reservation too large for minleft allocation · d5753847
      Dave Chinner authored
      When we enter xfs_bmbt_alloc_block() without having first allocated
      a data extent (i.e. tp->t_firstblock == NULLFSBLOCK) because we
      are doing something like unwritten extent conversion, the transaction
      block reservation is used as the minleft value.
      
      This works for operations like unwritten extent conversion, but it
      assumes that the block reservation is only for a BMBT split. THis is
      not always true, and sometimes results in larger than necessary
      minleft values being set. We only actually need enough space for a
      btree split, something we already handle correctly in
      xfs_bmapi_write() via the xfs_bmapi_minleft() calculation.
      
      We should use xfs_bmapi_minleft() in xfs_bmbt_alloc_block() to
      calculate the number of blocks a BMBT split on this inode is going to
      require, not use the transaction block reservation that contains the
      maximum number of blocks this transaction may consume in it...
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarAllison Henderson <allison.henderson@oracle.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      d5753847
    • Dave Chinner's avatar
      xfs: prefer free inodes at ENOSPC over chunk allocation · f08f984c
      Dave Chinner authored
      When an XFS filesystem has free inodes in chunks already allocated
      on disk, it will still allocate new inode chunks if the target AG
      has no free inodes in it. Normally, this is a good idea as it
      preserves locality of all the inodes in a given directory.
      
      However, at ENOSPC this can lead to using the last few remaining
      free filesystem blocks to allocate a new chunk when there are many,
      many free inodes that could be allocated without consuming free
      space. This results in speeding up the consumption of the last few
      blocks and inode create operations then returning ENOSPC when there
      free inodes available because we don't have enough block left in the
      filesystem for directory creation reservations to proceed.
      
      Hence when we are near ENOSPC, we should be attempting to preserve
      the remaining blocks for directory block allocation rather than
      using them for unnecessary inode chunk creation.
      
      This particular behaviour is exposed by xfs/294, when it drives to
      ENOSPC on empty file creation whilst there are still thousands of
      free inodes available for allocation in other AGs in the filesystem.
      
      Hence, when we are within 1% of ENOSPC, change the inode allocation
      behaviour to prefer to use existing free inodes over allocating new
      inode chunks, even though it results is poorer locality of the data
      set. It is more important for the allocations to be space efficient
      near ENOSPC than to have optimal locality for performance, so lets
      modify the inode AG selection code to reflect that fact.
      
      This allows generic/294 to not only pass with this allocator rework
      patchset, but to increase the number of post-ENOSPC empty inode
      allocations to from ~600 to ~9080 before we hit ENOSPC on the
      directory create transaction reservation.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarAllison Henderson <allison.henderson@oracle.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      f08f984c
    • Dave Chinner's avatar
      xfs: fix low space alloc deadlock · 1dd0510f
      Dave Chinner authored
      I've recently encountered an ABBA deadlock with g/476. The upcoming
      changes seem to make this much easier to hit, but the underlying
      problem is a pre-existing one.
      
      Essentially, if we select an AG for allocation, then lock the AGF
      and then fail to allocate for some reason (e.g. minimum length
      requirements cannot be satisfied), then we drop out of the
      allocation with the AGF still locked.
      
      The caller then modifies the allocation constraints - usually
      loosening them up - and tries again. This can result in trying to
      access AGFs that are lower than the AGF we already have locked from
      the failed attempt. e.g. the failed attempt skipped several AGs
      before failing, so we have locks an AG higher than the start AG.
      Retrying the allocation from the start AG then causes us to violate
      AGF lock ordering and this can lead to deadlocks.
      
      The deadlock exists even if allocation succeeds - we can do a
      followup allocations in the same transaction for BMBT blocks that
      aren't guaranteed to be in the same AG as the original, and can move
      into higher AGs. Hence we really need to move the tp->t_firstblock
      tracking down into xfs_alloc_vextent() where it can be set when we
      exit with a locked AG.
      
      xfs_alloc_vextent() can also check there if the requested
      allocation falls within the allow range of AGs set by
      tp->t_firstblock. If we can't allocate within the range set, we have
      to fail the allocation. If we are allowed to to non-blocking AGF
      locking, we can ignore the AG locking order limitations as we can
      use try-locks for the first iteration over requested AG range.
      
      This invalidates a set of post allocation asserts that check that
      the allocation is always above tp->t_firstblock if it is set.
      Because we can use try-locks to avoid the deadlock in some
      circumstances, having a pre-existing locked AGF doesn't always
      prevent allocation from lower order AGFs. Hence those ASSERTs need
      to be removed.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarAllison Henderson <allison.henderson@oracle.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      1dd0510f
    • Darrick J. Wong's avatar
      xfs: revert commit 8954c44f · dd07bb8b
      Darrick J. Wong authored
      The name passed into __xfs_xattr_put_listent is exactly namelen bytes
      long and not null-terminated.  Passing namelen+1 to the strscpy function
      
          strscpy(offset, (char *)name, namelen + 1);
      
      is therefore wrong.  Go back to the old code, which works fine because
      strncpy won't find a null in @name and stops after namelen bytes.  It
      really could be a memcpy call, but it worked for years.
      
      Reported-by: syzbot+898115bc6d7140437215@syzkaller.appspotmail.com
      Fixes: 8954c44f ("xfs: use strscpy() to instead of strncpy()")
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      dd07bb8b
    • Thomas Weißschuh's avatar
      xfs: make kobj_type structures constant · 2ee83335
      Thomas Weißschuh authored
      Since commit ee6d3dd4 ("driver core: make kobj_type constant.")
      the driver core allows the usage of const struct kobj_type.
      
      Take advantage of this to constify the structure definitions to prevent
      modification at runtime.
      Signed-off-by: default avatarThomas Weißschuh <linux@weissschuh.net>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      2ee83335
    • Donald Douwsma's avatar
      xfs: allow setting full range of panic tags · 167ce4cb
      Donald Douwsma authored
      xfs will not allow combining other panic masks with
      XFS_PTAG_VERIFIER_ERROR.
      
       # sysctl fs.xfs.panic_mask=511
       sysctl: setting key "fs.xfs.panic_mask": Invalid argument
       fs.xfs.panic_mask = 511
      
      Update to the maximum value that can be set to allow the full range of
      masks. Do this using a mask of possible values to prevent this happening
      again as suggested by Darrick.
      
      Fixes: d519da41 ("xfs: Introduce XFS_PTAG_VERIFIER_ERROR panic mask")
      Signed-off-by: default avatarDonald Douwsma <ddouwsma@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      167ce4cb
  2. 05 Feb, 2023 10 commits
    • Dave Chinner's avatar
      xfs: don't use BMBT btree split workers for IO completion · c85007e2
      Dave Chinner authored
      When we split a BMBT due to record insertion, we offload it to a
      worker thread because we can be deep in the stack when we try to
      allocate a new block for the BMBT. Allocation can use several
      kilobytes of stack (full memory reclaim, swap and/or IO path can
      end up on the stack during allocation) and we can already be several
      kilobytes deep in the stack when we need to split the BMBT.
      
      A recent workload demonstrated a deadlock in this BMBT split
      offload. It requires several things to happen at once:
      
      1. two inodes need a BMBT split at the same time, one must be
      unwritten extent conversion from IO completion, the other must be
      from extent allocation.
      
      2. there must be a no available xfs_alloc_wq worker threads
      available in the worker pool.
      
      3. There must be sustained severe memory shortages such that new
      kworker threads cannot be allocated to the xfs_alloc_wq pool for
      both threads that need split work to be run
      
      4. The split work from the unwritten extent conversion must run
      first.
      
      5. when the BMBT block allocation runs from the split work, it must
      loop over all AGs and not be able to either trylock an AGF
      successfully, or each AGF is is able to lock has no space available
      for a single block allocation.
      
      6. The BMBT allocation must then attempt to lock the AGF that the
      second task queued to the rescuer thread already has locked before
      it finds an AGF it can allocate from.
      
      At this point, we have an ABBA deadlock between tasks queued on the
      xfs_alloc_wq rescuer thread and a locked AGF. i.e. The queued task
      holding the AGF lock can't be run by the rescuer thread until the
      task the rescuer thread is runing gets the AGF lock....
      
      This is a highly improbably series of events, but there it is.
      
      There's a couple of ways to fix this, but the easiest way to ensure
      that we only punt tasks with a locked AGF that holds enough space
      for the BMBT block allocations to the worker thread.
      
      This works for unwritten extent conversion in IO completion (which
      doesn't have a locked AGF and space reservations) because we have
      tight control over the IO completion stack. It is typically only 6
      functions deep when xfs_btree_split() is called because we've
      already offloaded the IO completion work to a worker thread and
      hence we don't need to worry about stack overruns here.
      
      The other place we can be called for a BMBT split without a
      preceeding allocation is __xfs_bunmapi() when punching out the
      center of an existing extent. We don't remove extents in the IO
      path, so these operations don't tend to be called with a lot of
      stack consumed. Hence we don't really need to ship the split off to
      a worker thread in these cases, either.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      c85007e2
    • Darrick J. Wong's avatar
      xfs: fix confusing variable names in xfs_refcount_item.c · 01a3af22
      Darrick J. Wong authored
      Variable names in this code module are inconsistent and confusing.
      xfs_phys_extent describe physical mappings, so rename them "pmap".
      xfs_refcount_intents describe refcount intents, so rename them "ri".
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      01a3af22
    • Darrick J. Wong's avatar
      xfs: pass refcount intent directly through the log intent code · 0b11553e
      Darrick J. Wong authored
      Pass the incore refcount intent through the CUI logging code instead of
      repeatedly boxing and unboxing parameters.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      0b11553e
    • Darrick J. Wong's avatar
      xfs: fix confusing variable names in xfs_rmap_item.c · ffaa196f
      Darrick J. Wong authored
      Variable names in this code module are inconsistent and confusing.
      xfs_map_extent describe file mappings, so rename them "map".
      xfs_rmap_intents describe block mapping intents, so rename them "ri".
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      ffaa196f
    • Darrick J. Wong's avatar
      xfs: pass rmap space mapping directly through the log intent code · 1534328b
      Darrick J. Wong authored
      Pass the incore rmap space mapping through the RUI logging code instead
      of repeatedly boxing and unboxing parameters.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      1534328b
    • Darrick J. Wong's avatar
      xfs: fix confusing xfs_extent_item variable names · 578c714b
      Darrick J. Wong authored
      Change the name of all pointers to xfs_extent_item structures to "xefi"
      to make the name consistent and because the current selections ("new"
      and "free") mean other things in C.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      578c714b
    • Darrick J. Wong's avatar
      xfs: pass xfs_extent_free_item directly through the log intent code · 72ba4555
      Darrick J. Wong authored
      Pass the incore xfs_extent_free_item through the EFI logging code
      instead of repeatedly boxing and unboxing parameters.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      72ba4555
    • Darrick J. Wong's avatar
      xfs: fix confusing variable names in xfs_bmap_item.c · f3ebac4c
      Darrick J. Wong authored
      Variable names in this code module are inconsistent and confusing.
      xfs_map_extent describe file mappings, so rename them "map".
      xfs_bmap_intents describe block mapping intents, so rename them "bi".
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      f3ebac4c
    • Darrick J. Wong's avatar
      xfs: pass the xfs_bmbt_irec directly through the log intent code · ddccb81b
      Darrick J. Wong authored
      Instead of repeatedly boxing and unboxing the incore extent mapping
      structure as it passes through the BUI code, pass the pointer directly
      through.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      ddccb81b
    • Xu Panda's avatar
      xfs: use strscpy() to instead of strncpy() · 8954c44f
      Xu Panda authored
      The implementation of strscpy() is more robust and safer.
      That's now the recommended way to copy NUL-terminated strings.
      Signed-off-by: default avatarXu Panda <xu.panda@zte.com.cn>
      Signed-off-by: default avatarYang Yang <yang.yang29@zte.com.cn>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      8954c44f
  3. 29 Jan, 2023 6 commits
  4. 28 Jan, 2023 7 commits
    • Linus Torvalds's avatar
      Fix up more non-executable files marked executable · c9661827
      Linus Torvalds authored
      Joe found another DT file that shouldn't be executable, and that
      frustrated me enough that I went hunting with this script:
      
          git ls-files -s |
              grep '^100755' |
              cut -f2 |
              xargs grep -L '^#!'
      
      and that found another file that shouldn't have been marked executable
      either, despite being in the scripts directory.
      
      Maybe these two are the last ones at least for now.  But I'm sure we'll
      be back in a few years, fixing things up again.
      
      Fixes: 8c6789f4 ("ASoC: dt-bindings: Add Everest ES8326 audio CODEC")
      Fixes: 4d8e5cd2 ("locking/atomics: Fix scripts/atomic/ script permissions")
      Reported-by: default avatarJoe Perches <joe@perches.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c9661827
    • Linus Torvalds's avatar
      Merge tag '6.2-rc5-ksmbd-server-fixes' of git://git.samba.org/ksmbd · 2543fdbd
      Linus Torvalds authored
      Pull ksmbd server fixes from Steve French:
       "Four smb3 server fixes, all also for stable:
      
         - fix for signing bug
      
         - fix to more strictly check packet length
      
         - add a max connections parm to limit simultaneous connections
      
         - fix error message flood that can occur with newer Samba xattr
           format"
      
      * tag '6.2-rc5-ksmbd-server-fixes' of git://git.samba.org/ksmbd:
        ksmbd: downgrade ndr version error message to debug
        ksmbd: limit pdu length size according to connection status
        ksmbd: do not sign response to session request for guest login
        ksmbd: add max connections parameter
      2543fdbd
    • Linus Torvalds's avatar
      Merge tag '6.2-rc5-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6 · 5af6ce70
      Linus Torvalds authored
      Pull cifs fix from Steve French:
       "Fix for reconnect oops in smbdirect (RDMA), also is marked for stable"
      
      * tag '6.2-rc5-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
        cifs: Fix oops due to uncleared server->smbd_conn in reconnect
      5af6ce70
    • Linus Torvalds's avatar
      Merge tag 'block-6.2-2023-01-27' of git://git.kernel.dk/linux · 90aaef4e
      Linus Torvalds authored
      Pull block fixes from Jens Axboe:
       "Minor tweaks for this release:
      
         - NVMe pull request via Christoph:
              - Flush initial scan_work for async probe (Keith Busch)
              - Fix passthrough csi check (Keith Busch)
              - Fix nvme-fc initialization order (Ross Lagerwall)
      
         - Fix for tearing down non-started device in ublk (Ming)"
      
      * tag 'block-6.2-2023-01-27' of git://git.kernel.dk/linux:
        block: ublk: move ublk_chr_class destroying after devices are removed
        nvme: fix passthrough csi check
        nvme-pci: flush initial scan_work for async probe
        nvme-fc: fix initialization order
      90aaef4e
    • Linus Torvalds's avatar
      Merge tag 'io_uring-6.2-2023-01-27' of git://git.kernel.dk/linux · f851453b
      Linus Torvalds authored
      Pull io_uring fixes from Jens Axboe:
       "Two small fixes for this release:
      
         - Sanitize how async prep is done for drain requests, so we ensure
           that it always gets done (Dylan)
      
         - A ring provided buffer recycling fix for multishot receive (me)"
      
      * tag 'io_uring-6.2-2023-01-27' of git://git.kernel.dk/linux:
        io_uring: always prep_async for drain requests
        io_uring/net: cache provided buffer group value for multishot receives
      f851453b
    • Linus Torvalds's avatar
      Merge tag 'hardening-v6.2-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux · 28cca23d
      Linus Torvalds authored
      Pull hardening fixes from Kees Cook:
      
       - Split slow memcpy tests into MEMCPY_SLOW_KUNIT_TEST
      
       - Reorganize gcc-plugin includes for GCC 13
      
       - Silence bcache memcpy run-time false positive warnings
      
      * tag 'hardening-v6.2-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
        bcache: Silence memcpy() run-time false positive warnings
        gcc-plugins: Reorganize gimple includes for GCC 13
        kunit: memcpy: Split slow memcpy tests into MEMCPY_SLOW_KUNIT_TEST
      28cca23d
    • Linus Torvalds's avatar
      Merge tag 'trace-v6.2-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace · d786f0fe
      Linus Torvalds authored
      Pull tracing fixes from Steven Rostedt:
      
       - Fix filter memory leak by calling ftrace_free_filter()
      
       - Initialize trace_printk() earlier so that ftrace_dump_on_oops shows
         data on early crashes.
      
       - Update the outdated instructions in scripts/tracing/ftrace-bisect.sh
      
       - Add lockdep_is_held() to fix lockdep warning
      
       - Add allocation failure check in create_hist_field()
      
       - Don't initialize pointer that gets set right away in enabled_monitors_write()
      
       - Update MAINTAINER entries
      
       - Fix help messages in Kconfigs
      
       - Fix kernel-doc header for update_preds()
      
      * tag 'trace-v6.2-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
        bootconfig: Update MAINTAINERS file to add tree and mailing list
        rv: remove redundant initialization of pointer ptr
        ftrace: Maintain samples/ftrace
        tracing/filter: fix kernel-doc warnings
        lib: Kconfig: fix spellos
        trace_events_hist: add check for return value of 'create_hist_field'
        tracing/osnoise: Use built-in RCU list checking
        tracing: Kconfig: Fix spelling/grammar/punctuation
        ftrace/scripts: Update the instructions for ftrace-bisect.sh
        tracing: Make sure trace_printk() can output as soon as it can be used
        ftrace: Export ftrace_free_filter() to modules
      d786f0fe
  5. 27 Jan, 2023 10 commits