1. 09 Mar, 2024 15 commits
    • Trond Myklebust's avatar
    • Josef Bacik's avatar
      nfs: fix UAF in direct writes · 17f46b80
      Josef Bacik authored
      In production we have been hitting the following warning consistently
      
      ------------[ cut here ]------------
      refcount_t: underflow; use-after-free.
      WARNING: CPU: 17 PID: 1800359 at lib/refcount.c:28 refcount_warn_saturate+0x9c/0xe0
      Workqueue: nfsiod nfs_direct_write_schedule_work [nfs]
      RIP: 0010:refcount_warn_saturate+0x9c/0xe0
      PKRU: 55555554
      Call Trace:
       <TASK>
       ? __warn+0x9f/0x130
       ? refcount_warn_saturate+0x9c/0xe0
       ? report_bug+0xcc/0x150
       ? handle_bug+0x3d/0x70
       ? exc_invalid_op+0x16/0x40
       ? asm_exc_invalid_op+0x16/0x20
       ? refcount_warn_saturate+0x9c/0xe0
       nfs_direct_write_schedule_work+0x237/0x250 [nfs]
       process_one_work+0x12f/0x4a0
       worker_thread+0x14e/0x3b0
       ? ZSTD_getCParams_internal+0x220/0x220
       kthread+0xdc/0x120
       ? __btf_name_valid+0xa0/0xa0
       ret_from_fork+0x1f/0x30
      
      This is because we're completing the nfs_direct_request twice in a row.
      
      The source of this is when we have our commit requests to submit, we
      process them and send them off, and then in the completion path for the
      commit requests we have
      
      if (nfs_commit_end(cinfo.mds))
      	nfs_direct_write_complete(dreq);
      
      However since we're submitting asynchronous requests we sometimes have
      one that completes before we submit the next one, so we end up calling
      complete on the nfs_direct_request twice.
      
      The only other place we use nfs_generic_commit_list() is in
      __nfs_commit_inode, which wraps this call in a
      
      nfs_commit_begin();
      nfs_commit_end();
      
      Which is a common pattern for this style of completion handling, one
      that is also repeated in the direct code with get_dreq()/put_dreq()
      calls around where we process events as well as in the completion paths.
      
      Fix this by using the same pattern for the commit requests.
      
      Before with my 200 node rocksdb stress running this warning would pop
      every 10ish minutes.  With my patch the stress test has been running for
      several hours without popping.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@hammerspace.com>
      17f46b80
    • Josef Bacik's avatar
      nfs: properly protect nfs_direct_req fields · 09450135
      Josef Bacik authored
      We protect accesses to the nfs_direct_req fields with the dreq->lock
      ever where except nfs_direct_commit_complete.  This isn't a huge deal,
      but it does lead to confusion, and we could potentially end up setting
      NFS_ODIRECT_RESCHED_WRITES in one thread where we've had an error in
      another.  Clean this up to properly protect ->error and ->flags in the
      commit completion path.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@hammerspace.com>
      09450135
    • Trond Myklebust's avatar
      NFS: enable nconnect for RDMA · b326df4a
      Trond Myklebust authored
      It appears that in certain cases, RDMA capable transports can benefit
      from the ability to establish multiple connections to increase their
      throughput. This patch therefore enables the use of the "nconnect" mount
      option for those use cases.
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@hammerspace.com>
      b326df4a
    • Trond Myklebust's avatar
      NFSv4: nfs4_do_open() is incorrectly triggering state recovery · 04602539
      Trond Myklebust authored
      We're seeing spurious calls to nfs4_schedule_stateid_recovery() from
      nfs4_do_open() in situations where there is no trigger coming from the
      server.
      In theory the code path being triggered is supposed to notice that state
      recovery happened while we were processing the open call result from the
      server, before the open stateid is published. However in the years since
      that code was added, we've also added the 'session draining' mechanism,
      which ensures that the state recovery will wait until all the session
      slots have been returned. In nfs4_do_open() the session slot is only
      returned on exit of the function, so we don't need the legacy mechanism.
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@hammerspace.com>
      04602539
    • NeilBrown's avatar
      NFS: avoid infinite loop in pnfs_update_layout. · 2fdbc200
      NeilBrown authored
      If pnfsd_update_layout() is called on a file for which recovery has
      failed it will enter a tight infinite loop.
      
      NFS_LAYOUT_INVALID_STID will be set, nfs4_select_rw_stateid() will
      return -EIO, and nfs4_schedule_stateid_recovery() will do nothing, so
      nfs4_client_recover_expired_lease() will not wait.  So the code will
      loop indefinitely.
      
      Break the loop by testing the validity of the open stateid at the top of
      the loop.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@hammerspace.com>
      2fdbc200
    • NeilBrown's avatar
      NFS: remove sync_mode test from nfs_writepage_locked() · 0b81371d
      NeilBrown authored
      nfs_writepage_locked() is only called from nfs_wb_folio() (since Commit
      12fc0a96 ("nfs: Remove writepage")) so ->sync_mode is always
      WB_SYNC_ALL.
      
      This means the test for WB_SYNC_NONE is dead code and can be removed.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@hammerspace.com>
      0b81371d
    • Olga Kornievskaia's avatar
      NFSv4.1/pnfs: fix NFS with TLS in pnfs · a35518ca
      Olga Kornievskaia authored
      Currently, even though xprtsec=tls is specified and used for operations
      to MDS, any operations that go to DS travel over unencrypted connection.
      Or additionally, if more than 1 DS can serve the data, then trunked
      connections are also done unencrypted.
      
      IN GETDEVINCEINFO, we get an entry for the DS which carries a protocol
      type (which is TCP), then nfs4_set_ds_client() gets called with TCP
      instead of TCP with TLS.
      
      Currently, each trunked connection is created and uses clp->cl_hostname
      value which if TLS is used would get passed up in the handshake upcall,
      but instead we need to pass in the appropriate trunked address value.
      
      Fixes: c8407f2e ("NFS: Add an "xprtsec=" NFS mount option")
      Signed-off-by: default avatarOlga Kornievskaia <kolga@netapp.com>
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@hammerspace.com>
      a35518ca
    • Christophe JAILLET's avatar
      NFS: Fix an off by one in root_nfs_cat() · 698ad1a5
      Christophe JAILLET authored
      The intent is to check if 'dest' is truncated or not. So, >= should be
      used instead of >, because strlcat() returns the length of 'dest' and 'src'
      excluding the trailing NULL.
      
      Fixes: 56463e50 ("NFS: Use super.c for NFSROOT mount option parsing")
      Signed-off-by: default avatarChristophe JAILLET <christophe.jaillet@wanadoo.fr>
      Reviewed-by: default avatarBenjamin Coddington <bcodding@redhat.com>
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@hammerspace.com>
      698ad1a5
    • Josef Bacik's avatar
      nfs: make the rpc_stat per net namespace · 1548036e
      Josef Bacik authored
      Now that we're exposing the rpc stats on a per-network namespace basis,
      move this struct into struct nfs_net and use that to make sure only the
      per-network namespace stats are exposed.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@hammerspace.com>
      1548036e
    • Josef Bacik's avatar
      nfs: expose /proc/net/sunrpc/nfs in net namespaces · d47151b7
      Josef Bacik authored
      We're using nfs mounts inside of containers in production and noticed
      that the nfs stats are not exposed in /proc.  This is a problem for us
      as we use these stats for monitoring, and have to do this awkward bind
      mount from the main host into the container in order to get to these
      states.
      
      Add the rpc_proc_register call to the pernet operations entry and exit
      points so these stats can be exposed inside of network namespaces.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@hammerspace.com>
      d47151b7
    • Josef Bacik's avatar
      sunrpc: add a struct rpc_stats arg to rpc_create_args · 2057a48d
      Josef Bacik authored
      We want to be able to have our rpc stats handled in a per network
      namespace manner, so add an option to rpc_create_args to specify a
      different rpc_stats struct instead of using the one on the rpc_program.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@hammerspace.com>
      2057a48d
    • Jeff Layton's avatar
      nfs: remove unused NFS_CALL macro · edc99a2d
      Jeff Layton authored
      Nothing uses this, and thank goodness, as the syntax looks horrid.
      Signed-off-by: default avatarJeff Layton <jlayton@kernel.org>
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@hammerspace.com>
      edc99a2d
    • Olga Kornievskaia's avatar
      NFSv4.1: add tracepoint to trunked nfs4_exchange_id calls · 7e5ae43b
      Olga Kornievskaia authored
      Add a tracepoint to track when the client sends EXCHANGE_ID to test
      a new transport for session trunking.
      
      nfs4_detect_session_trunking() tests for trunking and returns
      EINVAL if trunking can't be done, add EINVAL mapping to
      show_nfs4_status() in tracepoints.
      Signed-off-by: default avatarOlga Kornievskaia <kolga@netapp.com>
      Reviewed-by: default avatarBenjamin Coddington <bcodding@redhat.com>
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@hammerspace.com>
      7e5ae43b
    • Dave Wysochanski's avatar
      NFS: Fix nfs_netfs_issue_read() xarray locking for writeback interrupt · fd5860ab
      Dave Wysochanski authored
      The loop inside nfs_netfs_issue_read() currently does not disable
      interrupts while iterating through pages in the xarray to submit
      for NFS read.  This is not safe though since after taking xa_lock,
      another page in the mapping could be processed for writeback inside
      an interrupt, and deadlock can occur.  The fix is simple and clean
      if we use xa_for_each_range(), which handles the iteration with RCU
      while reducing code complexity.
      
      The problem is easily reproduced with the following test:
       mount -o vers=3,fsc 127.0.0.1:/export /mnt/nfs
       dd if=/dev/zero of=/mnt/nfs/file1.bin bs=4096 count=1
       echo 3 > /proc/sys/vm/drop_caches
       dd if=/mnt/nfs/file1.bin of=/dev/null
       umount /mnt/nfs
      
      On the console with a lockdep-enabled kernel a message similar to
      the following will be seen:
      
       ================================
       WARNING: inconsistent lock state
       6.7.0-lockdbg+ #10 Not tainted
       --------------------------------
       inconsistent {IN-SOFTIRQ-W} -> {SOFTIRQ-ON-W} usage.
       test5/1708 [HC0[0]:SC0[0]:HE1:SE1] takes:
       ffff888127baa598 (&xa->xa_lock#4){+.?.}-{3:3}, at:
      nfs_netfs_issue_read+0x1b2/0x4b0 [nfs]
       {IN-SOFTIRQ-W} state was registered at:
         lock_acquire+0x144/0x380
         _raw_spin_lock_irqsave+0x4e/0xa0
         __folio_end_writeback+0x17e/0x5c0
         folio_end_writeback+0x93/0x1b0
         iomap_finish_ioend+0xeb/0x6a0
         blk_update_request+0x204/0x7f0
         blk_mq_end_request+0x30/0x1c0
         blk_complete_reqs+0x7e/0xa0
         __do_softirq+0x113/0x544
         __irq_exit_rcu+0xfe/0x120
         irq_exit_rcu+0xe/0x20
         sysvec_call_function_single+0x6f/0x90
         asm_sysvec_call_function_single+0x1a/0x20
         pv_native_safe_halt+0xf/0x20
         default_idle+0x9/0x20
         default_idle_call+0x67/0xa0
         do_idle+0x2b5/0x300
         cpu_startup_entry+0x34/0x40
         start_secondary+0x19d/0x1c0
         secondary_startup_64_no_verify+0x18f/0x19b
       irq event stamp: 176891
       hardirqs last  enabled at (176891): [<ffffffffa67a0be4>]
      _raw_spin_unlock_irqrestore+0x44/0x60
       hardirqs last disabled at (176890): [<ffffffffa67a0899>]
      _raw_spin_lock_irqsave+0x79/0xa0
       softirqs last  enabled at (176646): [<ffffffffa515d91e>]
      __irq_exit_rcu+0xfe/0x120
       softirqs last disabled at (176633): [<ffffffffa515d91e>]
      __irq_exit_rcu+0xfe/0x120
      
       other info that might help us debug this:
        Possible unsafe locking scenario:
      
              CPU0
              ----
         lock(&xa->xa_lock#4);
         <Interrupt>
           lock(&xa->xa_lock#4);
      
        *** DEADLOCK ***
      
       2 locks held by test5/1708:
        #0: ffff888127baa498 (&sb->s_type->i_mutex_key#22){++++}-{4:4}, at:
            nfs_start_io_read+0x28/0x90 [nfs]
        #1: ffff888127baa650 (mapping.invalidate_lock#3){.+.+}-{4:4}, at:
            page_cache_ra_unbounded+0xa4/0x280
      
       stack backtrace:
       CPU: 6 PID: 1708 Comm: test5 Kdump: loaded Not tainted 6.7.0-lockdbg+
       Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-1.fc39
      04/01/2014
       Call Trace:
        dump_stack_lvl+0x5b/0x90
        mark_lock+0xb3f/0xd20
        __lock_acquire+0x77b/0x3360
        _raw_spin_lock+0x34/0x80
        nfs_netfs_issue_read+0x1b2/0x4b0 [nfs]
        netfs_begin_read+0x77f/0x980 [netfs]
        nfs_netfs_readahead+0x45/0x60 [nfs]
        nfs_readahead+0x323/0x5a0 [nfs]
        read_pages+0xf3/0x5c0
        page_cache_ra_unbounded+0x1c8/0x280
        filemap_get_pages+0x38c/0xae0
        filemap_read+0x206/0x5e0
        nfs_file_read+0xb7/0x140 [nfs]
        vfs_read+0x2a9/0x460
        ksys_read+0xb7/0x140
      
      Fixes: 000dbe0b ("NFS: Convert buffered read paths to use netfs when fscache is enabled")
      Suggested-by: default avatarJeff Layton <jlayton@redhat.com>
      Signed-off-by: default avatarDave Wysochanski <dwysocha@redhat.com>
      Reviewed-by: default avatarJeff Layton <jlayton@kernel.org>
      Reviewed-by: default avatarDavid Howells <dhowells@redhat.com>
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@hammerspace.com>
      fd5860ab
  2. 28 Feb, 2024 11 commits
  3. 25 Feb, 2024 14 commits
    • Linus Torvalds's avatar
      Linux 6.8-rc6 · d206a76d
      Linus Torvalds authored
      d206a76d
    • Linus Torvalds's avatar
      Merge tag 'bcachefs-2024-02-25' of https://evilpiepirate.org/git/bcachefs · e231dbd4
      Linus Torvalds authored
      Pull bcachefs fixes from Kent Overstreet:
       "Some more mostly boring fixes, but some not
      
        User reported ones:
      
         - the BTREE_ITER_FILTER_SNAPSHOTS one fixes a really nasty
           performance bug; user reported an untar initially taking two
           seconds and then ~2 minutes
      
         - kill a __GFP_NOFAIL in the buffered read path; this was a leftover
           from the trickier fix to kill __GFP_NOFAIL in readahead, where we
           can't return errors (and have to silently truncate the read
           ourselves).
      
           bcachefs can't use GFP_NOFAIL for folio state unlike iomap based
           filesystems because our folio state is just barely too big, 2MB
           hugepages cause us to exceed the 2 page threshhold for GFP_NOFAIL.
      
           additionally, the flags argument was just buggy, we weren't
           supplying GFP_KERNEL previously (!)"
      
      * tag 'bcachefs-2024-02-25' of https://evilpiepirate.org/git/bcachefs:
        bcachefs: fix bch2_save_backtrace()
        bcachefs: Fix check_snapshot() memcpy
        bcachefs: Fix bch2_journal_flush_device_pins()
        bcachefs: fix iov_iter count underflow on sub-block dio read
        bcachefs: Fix BTREE_ITER_FILTER_SNAPSHOTS on inodes btree
        bcachefs: Kill __GFP_NOFAIL in buffered read path
        bcachefs: fix backpointer_to_text() when dev does not exist
      e231dbd4
    • Kent Overstreet's avatar
      bcachefs: fix bch2_save_backtrace() · 5197728f
      Kent Overstreet authored
      Missed a call in the previous fix.
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      5197728f
    • Linus Torvalds's avatar
      Merge tag 'docs-6.8-fixes3' of git://git.lwn.net/linux · 70ff1fe6
      Linus Torvalds authored
      Pull two documentation build fixes from Jonathan Corbet:
      
       - The XFS online fsck documentation uses incredibly deeply nested
         subsection and list nesting; that broke the PDF docs build. Tweak a
         parameter to tell LaTeX to allow the deeper nesting.
      
       - Fix a 6.8 PDF-build regression
      
      * tag 'docs-6.8-fixes3' of git://git.lwn.net/linux:
        docs: translations: use attribute to store current language
        docs: Instruct LaTeX to cope with deeper nesting
      70ff1fe6
    • Linus Torvalds's avatar
      Merge tag 'usb-6.8-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb · c46ac50e
      Linus Torvalds authored
      Pull USB fixes from Greg KH:
       "Here are some small USB fixes for 6.8-rc6 to resolve some reported
        problems. These include:
      
         - regression fixes with typec tpcm code as reported by many
      
         - cdnsp and cdns3 driver fixes
      
         - usb role setting code bugfixes
      
         - build fix for uhci driver
      
         - ncm gadget driver bugfix
      
         - MAINTAINERS entry update
      
        All of these have been in linux-next all week with no reported issues
        and there is at least one fix in here that is in Thorsten's regression
        list that is being tracked"
      
      * tag 'usb-6.8-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb:
        usb: typec: tpcm: Fix issues with power being removed during reset
        MAINTAINERS: Drop myself as maintainer of TYPEC port controller drivers
        usb: gadget: ncm: Avoid dropping datagrams of properly parsed NTBs
        Revert "usb: typec: tcpm: reset counter when enter into unattached state after try role"
        usb: gadget: omap_udc: fix USB gadget regression on Palm TE
        usb: dwc3: gadget: Don't disconnect if not started
        usb: cdns3: fix memory double free when handle zero packet
        usb: cdns3: fixed memory use after free at cdns3_gadget_ep_disable()
        usb: roles: don't get/set_role() when usb_role_switch is unregistered
        usb: roles: fix NULL pointer issue when put module's reference
        usb: cdnsp: fixed issue with incorrect detecting CDNSP family controllers
        usb: cdnsp: blocked some cdns3 specific code
        usb: uhci-grlib: Explicitly include linux/platform_device.h
      c46ac50e
    • Linus Torvalds's avatar
      Merge tag 'tty-6.8-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty · 1e592e95
      Linus Torvalds authored
      Pull tty/serial driver fixes from Greg KH:
       "Here are three small serial/tty driver fixes for 6.8-rc6 that resolve
        the following reported errors:
      
         - riscv hvc console driver fix that was reported by many
      
         - amba-pl011 serial driver fix for RS485 mode
      
         - stm32 serial driver fix for RS485 mode
      
        All of these have been in linux-next all week with no reported
        problems"
      
      * tag 'tty-6.8-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty:
        serial: amba-pl011: Fix DMA transmission in RS485 mode
        serial: stm32: do not always set SER_RS485_RX_DURING_TX if RS485 is enabled
        tty: hvc: Don't enable the RISC-V SBI console by default
      1e592e95
    • Linus Torvalds's avatar
      Merge tag 'x86_urgent_for_v6.8_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 1eee4ef3
      Linus Torvalds authored
      Pull x86 fixes from Borislav Petkov:
      
       - Make sure clearing CPU buffers using VERW happens at the latest
         possible point in the return-to-userspace path, otherwise memory
         accesses after the VERW execution could cause data to land in CPU
         buffers again
      
      * tag 'x86_urgent_for_v6.8_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        KVM/VMX: Move VERW closer to VMentry for MDS mitigation
        KVM/VMX: Use BT+JNC, i.e. EFLAGS.CF to select VMRESUME vs. VMLAUNCH
        x86/bugs: Use ALTERNATIVE() instead of mds_user_clear static key
        x86/entry_32: Add VERW just before userspace transition
        x86/entry_64: Add VERW just before userspace transition
        x86/bugs: Add asm helpers for executing VERW
      1eee4ef3
    • Linus Torvalds's avatar
      Merge tag 'irq_urgent_for_v6.8_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 8c46ed37
      Linus Torvalds authored
      Pull irq fixes from Borislav Petkov:
      
       - Make sure GICv4 always gets initialized to prevent a kexec-ed kernel
         from silently failing to set it up
      
       - Do not call bus_get_dev_root() for the mbigen irqchip as it always
         returns NULL - use NULL directly
      
       - Fix hardware interrupt number truncation when assigning MSI
         interrupts
      
       - Correct sending end-of-interrupt messages to disabled interrupts
         lines on RISC-V PLIC
      
      * tag 'irq_urgent_for_v6.8_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        irqchip/gic-v3-its: Do not assume vPE tables are preallocated
        irqchip/mbigen: Don't use bus_get_dev_root() to find the parent
        PCI/MSI: Prevent MSI hardware interrupt number truncation
        irqchip/sifive-plic: Enable interrupt if needed before EOI
      8c46ed37
    • Linus Torvalds's avatar
      Merge tag 'erofs-for-6.8-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs · 4ca0d989
      Linus Torvalds authored
      Pull erofs fix from Gao Xiang:
      
       - Fix page refcount leak when looking up specific inodes
         introduced by metabuf reworking
      
      * tag 'erofs-for-6.8-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs:
        erofs: fix refcount on the metabuf used for inode lookup
      4ca0d989
    • Linus Torvalds's avatar
      Merge tag 'pull-fixes.pathwalk-rcu-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs · 66a97c2e
      Linus Torvalds authored
      Pull RCU pathwalk fixes from Al Viro:
       "We still have some races in filesystem methods when exposed to RCU
        pathwalk. This series is a result of code audit (the second round of
        it) and it should deal with most of that stuff.
      
        Still pending: ntfs3 ->d_hash()/->d_compare() and ceph_d_revalidate().
        Up to maintainers (a note for NTFS folks - when documentation says
        that a method may not block, it *does* imply that blocking allocations
        are to be avoided. Really)"
      
      [ More explanations for people who aren't familiar with the vagaries of
        RCU path walking: most of it is hidden from filesystems, but if a
        filesystem actively participates in the low-level path walking it
        needs to make sure the fields involved in that walk are RCU-safe.
      
        That "actively participate in low-level path walking" includes things
        like having its own ->d_hash()/->d_compare() routines, or by having
        its own directory permission function that doesn't just use the common
        helpers.  Having a ->d_revalidate() function will also have this issue.
      
        Note that instead of making everything RCU safe you can also choose to
        abort the RCU pathwalk if your operation cannot be done safely under
        RCU, but that obviously comes with a performance penalty. One common
        pattern is to allow the simple cases under RCU, and abort only if you
        need to do something more complicated.
      
        So not everything needs to be RCU-safe, and things like the inode etc
        that the VFS itself maintains obviously already are. But these fixes
        tend to be about properly RCU-delaying things like ->s_fs_info that
        are maintained by the filesystem and that got potentially released too
        early.   - Linus ]
      
      * tag 'pull-fixes.pathwalk-rcu-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
        ext4_get_link(): fix breakage in RCU mode
        cifs_get_link(): bail out in unsafe case
        fuse: fix UAF in rcu pathwalks
        procfs: make freeing proc_fs_info rcu-delayed
        procfs: move dropping pde and pid from ->evict_inode() to ->free_inode()
        nfs: fix UAF on pathwalk running into umount
        nfs: make nfs_set_verifier() safe for use in RCU pathwalk
        afs: fix __afs_break_callback() / afs_drop_open_mmap() race
        hfsplus: switch to rcu-delayed unloading of nls and freeing ->s_fs_info
        exfat: move freeing sbi, upcase table and dropping nls into rcu-delayed helper
        affs: free affs_sb_info with kfree_rcu()
        rcu pathwalk: prevent bogus hard errors from may_lookup()
        fs/super.c: don't drop ->s_user_ns until we free struct super_block itself
      66a97c2e
    • Linus Torvalds's avatar
      Merge tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs · 9b243492
      Linus Torvalds authored
      Pull vfs fixes from Al Viro:
       "A couple of fixes - revert of regression from this cycle and a fix for
        erofs failure exit breakage (had been there since way back)"
      
      * tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
        erofs: fix handling kern_mount() failure
        Revert "get rid of DCACHE_GENOCIDE"
      9b243492
    • Al Viro's avatar
      ext4_get_link(): fix breakage in RCU mode · 9fa8e282
      Al Viro authored
      1) errors from ext4_getblk() should not be propagated to caller
      unless we are really sure that we would've gotten the same error
      in non-RCU pathwalk.
      2) we leak buffer_heads if ext4_getblk() is successful, but bh is
      not uptodate.
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      9fa8e282
    • Al Viro's avatar
      cifs_get_link(): bail out in unsafe case · 0511fdb4
      Al Viro authored
      ->d_revalidate() bails out there, anyway.  It's not enough
      to prevent getting into ->get_link() in RCU mode, but that
      could happen only in a very contrieved setup.  Not worth
      trying to do anything fancy here unless ->d_revalidate()
      stops kicking out of RCU mode at least in some cases.
      Reviewed-by: default avatarChristian Brauner <brauner@kernel.org>
      Acked-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      0511fdb4
    • Al Viro's avatar
      fuse: fix UAF in rcu pathwalks · 053fc4f7
      Al Viro authored
      ->permission(), ->get_link() and ->inode_get_acl() might dereference
      ->s_fs_info (and, in case of ->permission(), ->s_fs_info->fc->user_ns
      as well) when called from rcu pathwalk.
      
      Freeing ->s_fs_info->fc is rcu-delayed; we need to make freeing ->s_fs_info
      and dropping ->user_ns rcu-delayed too.
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      053fc4f7