1. 15 Aug, 2024 1 commit
    • Linus Torvalds's avatar
      Merge tag 'for-6.11-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux · 1fb91896
      Linus Torvalds authored
      Pull btrfs fixes from David Sterba:
      
       - extend tree-checker verification of directory item type
      
       - fix regression in page/folio and extent state tracking in xarray, the
         dirty status can get out of sync and can cause problems e.g. a hang
      
       - in send, detect last extent and allow to clone it instead of sending
         it as write, reduces amount of data transferred in the stream
      
       - fix checking extent references when cleaning deleted subvolumes
      
       - fix one more case in the extent map shrinker, let it run only in the
         kswapd context so it does not cause latency spikes during other
         operations
      
      * tag 'for-6.11-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
        btrfs: fix invalid mapping of extent xarray state
        btrfs: send: allow cloning non-aligned extent if it ends at i_size
        btrfs: only run the extent map shrinker from kswapd tasks
        btrfs: tree-checker: reject BTRFS_FT_UNKNOWN dir type
        btrfs: check delayed refs when we're checking if a ref exists
      1fb91896
  2. 14 Aug, 2024 6 commits
    • Linus Torvalds's avatar
      Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm · d07b4328
      Linus Torvalds authored
      Pull kvm fixes from Paolo Bonzini:
       "s390:
      
         - Fix failure to start guests with kvm.use_gisa=0
      
         - Panic if (un)share fails to maintain security.
      
        ARM:
      
         - Use kvfree() for the kvmalloc'd nested MMUs array
      
         - Set of fixes to address warnings in W=1 builds
      
         - Make KVM depend on assembler support for ARMv8.4
      
         - Fix for vgic-debug interface for VMs without LPIs
      
         - Actually check ID_AA64MMFR3_EL1.S1PIE in get-reg-list selftest
      
         - Minor code / comment cleanups for configuring PAuth traps
      
         - Take kvm->arch.config_lock to prevent destruction / initialization
           race for a vCPU's CPUIF which may lead to a UAF
      
        x86:
      
         - Disallow read-only memslots for SEV-ES and SEV-SNP (and TDX)
      
         - Fix smatch issues
      
         - Small cleanups
      
         - Make x2APIC ID 100% readonly
      
         - Fix typo in uapi constant
      
        Generic:
      
         - Use synchronize_srcu_expedited() on irqfd shutdown"
      
      * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (21 commits)
        KVM: SEV: uapi: fix typo in SEV_RET_INVALID_CONFIG
        KVM: x86: Disallow read-only memslots for SEV-ES and SEV-SNP (and TDX)
        KVM: eventfd: Use synchronize_srcu_expedited() on shutdown
        KVM: selftests: Add a testcase to verify x2APIC is fully readonly
        KVM: x86: Make x2APIC ID 100% readonly
        KVM: x86: Use this_cpu_ptr() instead of per_cpu_ptr(smp_processor_id())
        KVM: x86: hyper-v: Remove unused inline function kvm_hv_free_pa_page()
        KVM: SVM: Fix an error code in sev_gmem_post_populate()
        KVM: SVM: Fix uninitialized variable bug
        KVM: arm64: vgic: Hold config_lock while tearing down a CPU interface
        KVM: selftests: arm64: Correct feature test for S1PIE in get-reg-list
        KVM: arm64: Tidying up PAuth code in KVM
        KVM: arm64: vgic-debug: Exit the iterator properly w/o LPI
        KVM: arm64: Enforce dependency on an ARMv8.4-aware toolchain
        s390/uv: Panic for set and remove shared access UVC errors
        KVM: s390: fix validity interception issue when gisa is switched off
        docs: KVM: Fix register ID of SPSR_FIQ
        KVM: arm64: vgic: fix unexpected unlock sparse warnings
        KVM: arm64: fix kdoc warnings in W=1 builds
        KVM: arm64: fix override-init warnings in W=1 builds
        ...
      d07b4328
    • Amit Shah's avatar
      KVM: SEV: uapi: fix typo in SEV_RET_INVALID_CONFIG · 1c0e5881
      Amit Shah authored
      "INVALID" is misspelt in "SEV_RET_INAVLID_CONFIG". Since this is part of
      the UAPI, keep the current definition and add a new one with the fix.
      Fix-suggested-by: default avatarMarc Zyngier <maz@kernel.org>
      Signed-off-by: default avatarAmit Shah <amit.shah@amd.com>
      Message-ID: <20240814083113.21622-1-amit@kernel.org>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      1c0e5881
    • Sean Christopherson's avatar
      KVM: x86: Disallow read-only memslots for SEV-ES and SEV-SNP (and TDX) · 66155de9
      Sean Christopherson authored
      Disallow read-only memslots for SEV-{ES,SNP} VM types, as KVM can't
      directly emulate instructions for ES/SNP, and instead the guest must
      explicitly request emulation.  Unless the guest explicitly requests
      emulation without accessing memory, ES/SNP relies on KVM creating an MMIO
      SPTE, with the subsequent #NPF being reflected into the guest as a #VC.
      
      But for read-only memslots, KVM deliberately doesn't create MMIO SPTEs,
      because except for ES/SNP, doing so requires setting reserved bits in the
      SPTE, i.e. the SPTE can't be readable while also generating a #VC on
      writes.  Because KVM never creates MMIO SPTEs and jumps directly to
      emulation, the guest never gets a #VC.  And since KVM simply resumes the
      guest if ES/SNP guests trigger emulation, KVM effectively puts the vCPU
      into an infinite #NPF loop if the vCPU attempts to write read-only memory.
      
      Disallow read-only memory for all VMs with protected state, i.e. for
      upcoming TDX VMs as well as ES/SNP VMs.  For TDX, it's actually possible
      to support read-only memory, as TDX uses EPT Violation #VE to reflect the
      fault into the guest, e.g. KVM could configure read-only SPTEs with RX
      protections and SUPPRESS_VE=0.  But there is no strong use case for
      supporting read-only memslots on TDX, e.g. the main historical usage is
      to emulate option ROMs, but TDX disallows executing from shared memory.
      And if someone comes along with a legitimate, strong use case, the
      restriction can always be lifted for TDX.
      
      Don't bother trying to retroactively apply the restriction to SEV-ES
      VMs that are created as type KVM_X86_DEFAULT_VM.  Read-only memslots can't
      possibly work for SEV-ES, i.e. disallowing such memslots is really just
      means reporting an error to userspace instead of silently hanging vCPUs.
      Trying to deal with the ordering between KVM_SEV_INIT and memslot creation
      isn't worth the marginal benefit it would provide userspace.
      
      Fixes: 26c44aa9 ("KVM: SEV: define VM types for SEV and SEV-ES")
      Fixes: 1dfe571c ("KVM: SEV: Add initial SEV-SNP support")
      Cc: Peter Gonda <pgonda@google.com>
      Cc: Michael Roth <michael.roth@amd.com>
      Cc: Vishal Annapurve <vannapurve@google.com>
      Cc: Ackerly Tng <ackerleytng@google.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-ID: <20240809190319.1710470-2-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      66155de9
    • Linus Torvalds's avatar
      Merge tag 'selinux-pr-20240814' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux · 9d590679
      Linus Torvalds authored
      Pull selinux fixes from Paul Moore:
      
       - Fix a xperms counting problem where we adding to the xperms count
         even if we failed to add the xperm.
      
       - Propogate errors from avc_add_xperms_decision() back to the caller so
         that we can trigger the proper cleanup and error handling.
      
       - Revert our use of vma_is_initial_heap() in favor of our older logic
         as vma_is_initial_heap() doesn't correctly handle the no-heap case
         and it is causing issues with the SELinux process/execheap access
         control. While the older SELinux logic may not be perfect, it
         restores the expected user visible behavior.
      
         Hopefully we will be able to resolve the problem with the
         vma_is_initial_heap() macro with the mm folks, but we need to fix
         this in the meantime.
      
      * tag 'selinux-pr-20240814' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux:
        selinux: revert our use of vma_is_initial_heap()
        selinux: add the processing of the failure of avc_add_xperms_decision()
        selinux: fix potential counting error in avc_add_xperms_decision()
      9d590679
    • Linus Torvalds's avatar
      Merge tag 'vfs-6.11-rc4.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs · 4ac0f08f
      Linus Torvalds authored
      Pull vfs fixes from Christian Brauner:
       "VFS:
      
         - Fix the name of file lease slab cache. When file leases were split
           out of file locks the name of the file lock slab cache was used for
           the file leases slab cache as well.
      
         - Fix a type in take_fd() helper.
      
         - Fix infinite directory iteration for stable offsets in tmpfs.
      
         - When the icache is pruned all reclaimable inodes are marked with
           I_FREEING and other processes that try to lookup such inodes will
           block.
      
           But some filesystems like ext4 can trigger lookups in their inode
           evict callback causing deadlocks. Ext4 does such lookups if the
           ea_inode feature is used whereby a separate inode may be used to
           store xattrs.
      
           Introduce I_LRU_ISOLATING which pins the inode while its pages are
           reclaimed. This avoids inode deletion during inode_lru_isolate()
           avoiding the deadlock and evict is made to wait until
           I_LRU_ISOLATING is done.
      
        netfs:
      
         - Fault in smaller chunks for non-large folio mappings for
           filesystems that haven't been converted to large folios yet.
      
         - Fix the CONFIG_NETFS_DEBUG config option. The config option was
           renamed a short while ago and that introduced two minor issues.
           First, it depended on CONFIG_NETFS whereas it wants to depend on
           CONFIG_NETFS_SUPPORT. The former doesn't exist, while the latter
           does. Second, the documentation for the config option wasn't fixed
           up.
      
         - Revert the removal of the PG_private_2 writeback flag as ceph is
           using it and fix how that flag is handled in netfs.
      
         - Fix DIO reads on 9p. A program watching a file on a 9p mount
           wouldn't see any changes in the size of the file being exported by
           the server if the file was changed directly in the source
           filesystem. Fix this by attempting to read the full size specified
           when a DIO read is requested.
      
         - Fix a NULL pointer dereference bug due to a data race where a
           cachefiles cookies was retired even though it was still in use.
           Check the cookie's n_accesses counter before discarding it.
      
        nsfs:
      
         - Fix ioctl declaration for NS_GET_MNTNS_ID from _IO() to _IOR() as
           the kernel is writing to userspace.
      
        pidfs:
      
         - Prevent the creation of pidfds for kthreads until we have a
           use-case for it and we know the semantics we want. It also confuses
           userspace why they can get pidfds for kthreads.
      
        squashfs:
      
         - Fix an unitialized value bug reported by KMSAN caused by a
           corrupted symbolic link size read from disk. Check that the
           symbolic link size is not larger than expected"
      
      * tag 'vfs-6.11-rc4.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
        Squashfs: sanity check symbolic link size
        9p: Fix DIO read through netfs
        vfs: Don't evict inode under the inode lru traversing context
        netfs: Fix handling of USE_PGPRIV2 and WRITE_TO_CACHE flags
        netfs, ceph: Revert "netfs: Remove deprecated use of PG_private_2 as a second writeback flag"
        file: fix typo in take_fd() comment
        pidfd: prevent creation of pidfds for kthreads
        netfs: clean up after renaming FSCACHE_DEBUG config
        libfs: fix infinite directory reads for offset dir
        nsfs: fix ioctl declaration
        fs/netfs/fscache_cookie: add missing "n_accesses" check
        filelock: fix name of file_lease slab cache
        netfs: Fault in smaller chunks for non-large folio mappings
      4ac0f08f
    • Linus Torvalds's avatar
      Merge tag 'bpf-6.11-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf · 02f8ca3d
      Linus Torvalds authored
      Pull bpf fixes from Alexei Starovoitov:
      
       - Fix bpftrace regression from Kyle Huey.
      
         Tracing bpf prog was called with perf_event input arguments causing
         bpftrace produce garbage output.
      
       - Fix verifier crash in stacksafe() from Yonghong Song.
      
         Daniel Hodges reported verifier crash when playing with sched-ext.
         The stack depth in the known verifier state was larger than stack
         depth in being explored state causing out-of-bounds access.
      
       - Fix update of freplace prog in prog_array from Leon Hwang.
      
         freplace prog type wasn't recognized correctly.
      
      * tag 'bpf-6.11-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
        perf/bpf: Don't call bpf_overflow_handler() for tracing events
        selftests/bpf: Add a test to verify previous stacksafe() fix
        bpf: Fix a kernel verifier crash in stacksafe()
        bpf: Fix updating attached freplace prog in prog_array map
      02f8ca3d
  3. 13 Aug, 2024 23 commits
    • Linus Torvalds's avatar
      Merge tag 'execve-v6.11-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux · 6b0f8db9
      Linus Torvalds authored
      Pull execve fixes from Kees Cook:
      
       - binfmt_flat: Fix corruption when not offsetting data start
      
       - exec: Fix ToCToU between perm check and set-uid/gid usage
      
      * tag 'execve-v6.11-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
        exec: Fix ToCToU between perm check and set-uid/gid usage
        binfmt_flat: Fix corruption when not offsetting data start
      6b0f8db9
    • Kees Cook's avatar
      exec: Fix ToCToU between perm check and set-uid/gid usage · f50733b4
      Kees Cook authored
      When opening a file for exec via do_filp_open(), permission checking is
      done against the file's metadata at that moment, and on success, a file
      pointer is passed back. Much later in the execve() code path, the file
      metadata (specifically mode, uid, and gid) is used to determine if/how
      to set the uid and gid. However, those values may have changed since the
      permissions check, meaning the execution may gain unintended privileges.
      
      For example, if a file could change permissions from executable and not
      set-id:
      
      ---------x 1 root root 16048 Aug  7 13:16 target
      
      to set-id and non-executable:
      
      ---S------ 1 root root 16048 Aug  7 13:16 target
      
      it is possible to gain root privileges when execution should have been
      disallowed.
      
      While this race condition is rare in real-world scenarios, it has been
      observed (and proven exploitable) when package managers are updating
      the setuid bits of installed programs. Such files start with being
      world-executable but then are adjusted to be group-exec with a set-uid
      bit. For example, "chmod o-x,u+s target" makes "target" executable only
      by uid "root" and gid "cdrom", while also becoming setuid-root:
      
      -rwxr-xr-x 1 root cdrom 16048 Aug  7 13:16 target
      
      becomes:
      
      -rwsr-xr-- 1 root cdrom 16048 Aug  7 13:16 target
      
      But racing the chmod means users without group "cdrom" membership can
      get the permission to execute "target" just before the chmod, and when
      the chmod finishes, the exec reaches brpm_fill_uid(), and performs the
      setuid to root, violating the expressed authorization of "only cdrom
      group members can setuid to root".
      
      Re-check that we still have execute permissions in case the metadata
      has changed. It would be better to keep a copy from the perm-check time,
      but until we can do that refactoring, the least-bad option is to do a
      full inode_permission() call (under inode lock). It is understood that
      this is safe against dead-locks, but hardly optimal.
      Reported-by: default avatarMarco Vanotti <mvanotti@google.com>
      Tested-by: default avatarMarco Vanotti <mvanotti@google.com>
      Suggested-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Cc: stable@vger.kernel.org
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Christian Brauner <brauner@kernel.org>
      Signed-off-by: default avatarKees Cook <kees@kernel.org>
      f50733b4
    • Kyle Huey's avatar
      perf/bpf: Don't call bpf_overflow_handler() for tracing events · 100bff23
      Kyle Huey authored
      The regressing commit is new in 6.10. It assumed that anytime event->prog
      is set bpf_overflow_handler() should be invoked to execute the attached bpf
      program. This assumption is false for tracing events, and as a result the
      regressing commit broke bpftrace by invoking the bpf handler with garbage
      inputs on overflow.
      
      Prior to the regression the overflow handlers formed a chain (of length 0,
      1, or 2) and perf_event_set_bpf_handler() (the !tracing case) added
      bpf_overflow_handler() to that chain, while perf_event_attach_bpf_prog()
      (the tracing case) did not. Both set event->prog. The chain of overflow
      handlers was replaced by a single overflow handler slot and a fixed call to
      bpf_overflow_handler() when appropriate. This modifies the condition there
      to check event->prog->type == BPF_PROG_TYPE_PERF_EVENT, restoring the
      previous behavior and fixing bpftrace.
      Signed-off-by: default avatarKyle Huey <khuey@kylehuey.com>
      Suggested-by: default avatarAndrii Nakryiko <andrii.nakryiko@gmail.com>
      Reported-by: default avatarJoe Damato <jdamato@fastly.com>
      Closes: https://lore.kernel.org/lkml/ZpFfocvyF3KHaSzF@LQ3V64L9R2/
      Fixes: f11f10bf ("perf/bpf: Call BPF handler directly, not through overflow machinery")
      Cc: stable@vger.kernel.org
      Tested-by: Joe Damato <jdamato@fastly.com> # bpftrace
      Acked-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20240813151727.28797-1-jdamato@fastly.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      100bff23
    • Li RongQing's avatar
      KVM: eventfd: Use synchronize_srcu_expedited() on shutdown · c9b35a6f
      Li RongQing authored
      When hot-unplug a device which has many queues, and guest CPU will has
      huge jitter, and unplugging is very slow.
      
      It turns out synchronize_srcu() in irqfd_shutdown() caused the guest
      jitter and unplugging latency, so replace synchronize_srcu() with
      synchronize_srcu_expedited(), to accelerate the unplugging, and reduce
      the guest OS jitter, this accelerates the VM reboot too.
      Signed-off-by: default avatarLi RongQing <lirongqing@baidu.com>
      Message-ID: <20240711121130.38917-1-lirongqing@baidu.com>
      [Call it just once in irqfd_resampler_shutdown. - Paolo]
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      c9b35a6f
    • Linus Torvalds's avatar
      Merge tag '6.11-rc3-ksmbd-fixes' of git://git.samba.org/ksmbd · 6b4aa469
      Linus Torvalds authored
      Pull smb server fixes from Steve French:
       "Two smb3 server fixes for access denied problem on share path checks"
      
      * tag '6.11-rc3-ksmbd-fixes' of git://git.samba.org/ksmbd:
        ksmbd: override fsids for smb2_query_info()
        ksmbd: override fsids for share path check
      6b4aa469
    • Michal Luczaj's avatar
      KVM: selftests: Add a testcase to verify x2APIC is fully readonly · 238d3d63
      Michal Luczaj authored
      Add a test to verify that userspace can't change a vCPU's x2APIC ID by
      abusing KVM_SET_LAPIC.  KVM models the x2APIC ID (and x2APIC LDR) as
      readonly, and silently ignores userspace attempts to change the x2APIC ID
      for backwards compatibility.
      Signed-off-by: default avatarMichal Luczaj <mhal@rbox.co>
      [sean: write changelog, add to existing test]
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-ID: <20240802202941.344889-3-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      238d3d63
    • Sean Christopherson's avatar
      KVM: x86: Make x2APIC ID 100% readonly · 4b7c3f6d
      Sean Christopherson authored
      Ignore the userspace provided x2APIC ID when fixing up APIC state for
      KVM_SET_LAPIC, i.e. make the x2APIC fully readonly in KVM.  Commit
      a92e2543 ("KVM: x86: use hardware-compatible format for APIC ID
      register"), which added the fixup, didn't intend to allow userspace to
      modify the x2APIC ID.  In fact, that commit is when KVM first started
      treating the x2APIC ID as readonly, apparently to fix some race:
      
       static inline u32 kvm_apic_id(struct kvm_lapic *apic)
       {
      -       return (kvm_lapic_get_reg(apic, APIC_ID) >> 24) & 0xff;
      +       /* To avoid a race between apic_base and following APIC_ID update when
      +        * switching to x2apic_mode, the x2apic mode returns initial x2apic id.
      +        */
      +       if (apic_x2apic_mode(apic))
      +               return apic->vcpu->vcpu_id;
      +
      +       return kvm_lapic_get_reg(apic, APIC_ID) >> 24;
       }
      
      Furthermore, KVM doesn't support delivering interrupts to vCPUs with a
      modified x2APIC ID, but KVM *does* return the modified value on a guest
      RDMSR and for KVM_GET_LAPIC.  I.e. no remotely sane setup can actually
      work with a modified x2APIC ID.
      
      Making the x2APIC ID fully readonly fixes a WARN in KVM's optimized map
      calculation, which expects the LDR to align with the x2APIC ID.
      
        WARNING: CPU: 2 PID: 958 at arch/x86/kvm/lapic.c:331 kvm_recalculate_apic_map+0x609/0xa00 [kvm]
        CPU: 2 PID: 958 Comm: recalc_apic_map Not tainted 6.4.0-rc3-vanilla+ #35
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.2-1-1 04/01/2014
        RIP: 0010:kvm_recalculate_apic_map+0x609/0xa00 [kvm]
        Call Trace:
         <TASK>
         kvm_apic_set_state+0x1cf/0x5b0 [kvm]
         kvm_arch_vcpu_ioctl+0x1806/0x2100 [kvm]
         kvm_vcpu_ioctl+0x663/0x8a0 [kvm]
         __x64_sys_ioctl+0xb8/0xf0
         do_syscall_64+0x56/0x80
         entry_SYSCALL_64_after_hwframe+0x46/0xb0
        RIP: 0033:0x7fade8b9dd6f
      
      Unfortunately, the WARN can still trigger for other CPUs than the current
      one by racing against KVM_SET_LAPIC, so remove it completely.
      Reported-by: default avatarMichal Luczaj <mhal@rbox.co>
      Closes: https://lore.kernel.org/all/814baa0c-1eaa-4503-129f-059917365e80@rbox.coReported-by: default avatarHaoyu Wu <haoyuwu254@gmail.com>
      Closes: https://lore.kernel.org/all/20240126161633.62529-1-haoyuwu254@gmail.com
      Reported-by: syzbot+545f1326f405db4e1c3e@syzkaller.appspotmail.com
      Closes: https://lore.kernel.org/all/000000000000c2a6b9061cbca3c3@google.comSigned-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-ID: <20240802202941.344889-2-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      4b7c3f6d
    • Isaku Yamahata's avatar
      KVM: x86: Use this_cpu_ptr() instead of per_cpu_ptr(smp_processor_id()) · 15e1c3d6
      Isaku Yamahata authored
      Use this_cpu_ptr() instead of open coding the equivalent in various
      user return MSR helpers.
      Signed-off-by: default avatarIsaku Yamahata <isaku.yamahata@intel.com>
      Reviewed-by: default avatarChao Gao <chao.gao@intel.com>
      Reviewed-by: default avatarYuan Yao <yuan.yao@intel.com>
      [sean: massage changelog]
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarPankaj Gupta <pankaj.gupta@amd.com>
      Message-ID: <20240802201630.339306-1-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      15e1c3d6
    • Naohiro Aota's avatar
      btrfs: fix invalid mapping of extent xarray state · 6252690f
      Naohiro Aota authored
      In __extent_writepage_io(), we call btrfs_set_range_writeback() ->
      folio_start_writeback(), which clears PAGECACHE_TAG_DIRTY mark from the
      mapping xarray if the folio is not dirty. This worked fine before commit
      97713b1a ("btrfs: do not clear page dirty inside
      extent_write_locked_range()").
      
      After the commit, however, the folio is still dirty at this point, so the
      mapping DIRTY tag is not cleared anymore. Then, __extent_writepage_io()
      calls btrfs_folio_clear_dirty() to clear the folio's dirty flag. That
      results in the page being unlocked with a "strange" state. The page is not
      PageDirty, but the mapping tag is set as PAGECACHE_TAG_DIRTY.
      
      This strange state looks like causing a hang with a call trace below when
      running fstests generic/091 on a null_blk device. It is waiting for a folio
      lock.
      
      While I don't have an exact relation between this hang and the strange
      state, fixing the state also fixes the hang. And, that state is worth
      fixing anyway.
      
      This commit reorders btrfs_folio_clear_dirty() and
      btrfs_set_range_writeback() in __extent_writepage_io(), so that the
      PAGECACHE_TAG_DIRTY tag is properly removed from the xarray.
      
        [464.274] task:fsx             state:D stack:0     pid:3034  tgid:3034  ppid:2853   flags:0x00004002
        [464.286] Call Trace:
        [464.291]  <TASK>
        [464.295]  __schedule+0x10ed/0x6260
        [464.301]  ? __pfx___blk_flush_plug+0x10/0x10
        [464.308]  ? __submit_bio+0x37c/0x450
        [464.314]  ? __pfx___schedule+0x10/0x10
        [464.321]  ? lock_release+0x567/0x790
        [464.327]  ? __pfx_lock_acquire+0x10/0x10
        [464.334]  ? __pfx_lock_release+0x10/0x10
        [464.340]  ? __pfx_lock_acquire+0x10/0x10
        [464.347]  ? __pfx_lock_release+0x10/0x10
        [464.353]  ? do_raw_spin_lock+0x12e/0x270
        [464.360]  schedule+0xdf/0x3b0
        [464.365]  io_schedule+0x8f/0xf0
        [464.371]  folio_wait_bit_common+0x2ca/0x6d0
        [464.378]  ? folio_wait_bit_common+0x1cc/0x6d0
        [464.385]  ? __pfx_folio_wait_bit_common+0x10/0x10
        [464.392]  ? __pfx_filemap_get_folios_tag+0x10/0x10
        [464.400]  ? __pfx_wake_page_function+0x10/0x10
        [464.407]  ? __pfx___might_resched+0x10/0x10
        [464.414]  ? do_raw_spin_unlock+0x58/0x1f0
        [464.420]  extent_write_cache_pages+0xe49/0x1620 [btrfs]
        [464.428]  ? lock_acquire+0x435/0x500
        [464.435]  ? __pfx_extent_write_cache_pages+0x10/0x10 [btrfs]
        [464.443]  ? btrfs_do_write_iter+0x493/0x640 [btrfs]
        [464.451]  ? orc_find.part.0+0x1d4/0x380
        [464.457]  ? __pfx_lock_release+0x10/0x10
        [464.464]  ? __pfx_lock_release+0x10/0x10
        [464.471]  ? btrfs_do_write_iter+0x493/0x640 [btrfs]
        [464.478]  btrfs_writepages+0x1cc/0x460 [btrfs]
        [464.485]  ? __pfx_btrfs_writepages+0x10/0x10 [btrfs]
        [464.493]  ? is_bpf_text_address+0x6e/0x100
        [464.500]  ? kernel_text_address+0x145/0x160
        [464.507]  ? unwind_get_return_address+0x5e/0xa0
        [464.514]  ? arch_stack_walk+0xac/0x100
        [464.521]  do_writepages+0x176/0x780
        [464.527]  ? lock_release+0x567/0x790
        [464.533]  ? __pfx_do_writepages+0x10/0x10
        [464.540]  ? __pfx_lock_acquire+0x10/0x10
        [464.546]  ? __pfx_stack_trace_save+0x10/0x10
        [464.553]  ? do_raw_spin_lock+0x12e/0x270
        [464.560]  ? do_raw_spin_unlock+0x58/0x1f0
        [464.566]  ? _raw_spin_unlock+0x23/0x40
        [464.573]  ? wbc_attach_and_unlock_inode+0x3da/0x7d0
        [464.580]  filemap_fdatawrite_wbc+0x113/0x180
        [464.587]  ? prepare_pages.constprop.0+0x13c/0x5c0 [btrfs]
        [464.596]  __filemap_fdatawrite_range+0xaf/0xf0
        [464.603]  ? __pfx___filemap_fdatawrite_range+0x10/0x10
        [464.611]  ? trace_irq_enable.constprop.0+0xce/0x110
        [464.618]  ? kasan_quarantine_put+0xd7/0x1e0
        [464.625]  btrfs_start_ordered_extent+0x46f/0x570 [btrfs]
        [464.633]  ? __pfx_btrfs_start_ordered_extent+0x10/0x10 [btrfs]
        [464.642]  ? __clear_extent_bit+0x2c0/0x9d0 [btrfs]
        [464.650]  btrfs_lock_and_flush_ordered_range+0xc6/0x180 [btrfs]
        [464.659]  ? __pfx_btrfs_lock_and_flush_ordered_range+0x10/0x10 [btrfs]
        [464.669]  btrfs_read_folio+0x12a/0x1d0 [btrfs]
        [464.676]  ? __pfx_btrfs_read_folio+0x10/0x10 [btrfs]
        [464.684]  ? __pfx_filemap_add_folio+0x10/0x10
        [464.691]  ? __pfx___might_resched+0x10/0x10
        [464.698]  ? __filemap_get_folio+0x1c5/0x450
        [464.705]  prepare_uptodate_page+0x12e/0x4d0 [btrfs]
        [464.713]  prepare_pages.constprop.0+0x13c/0x5c0 [btrfs]
        [464.721]  ? fault_in_iov_iter_readable+0xd2/0x240
        [464.729]  btrfs_buffered_write+0x5bd/0x12f0 [btrfs]
        [464.737]  ? __pfx_btrfs_buffered_write+0x10/0x10 [btrfs]
        [464.745]  ? __pfx_lock_release+0x10/0x10
        [464.752]  ? generic_write_checks+0x275/0x400
        [464.759]  ? down_write+0x118/0x1f0
        [464.765]  ? up_write+0x19b/0x500
        [464.770]  btrfs_direct_write+0x731/0xba0 [btrfs]
        [464.778]  ? __pfx_btrfs_direct_write+0x10/0x10 [btrfs]
        [464.785]  ? __pfx___might_resched+0x10/0x10
        [464.792]  ? lock_acquire+0x435/0x500
        [464.798]  ? lock_acquire+0x435/0x500
        [464.804]  btrfs_do_write_iter+0x494/0x640 [btrfs]
        [464.811]  ? __pfx_btrfs_do_write_iter+0x10/0x10 [btrfs]
        [464.819]  ? __pfx___might_resched+0x10/0x10
        [464.825]  ? rw_verify_area+0x6d/0x590
        [464.831]  vfs_write+0x5d7/0xf50
        [464.837]  ? __might_fault+0x9d/0x120
        [464.843]  ? __pfx_vfs_write+0x10/0x10
        [464.849]  ? btrfs_file_llseek+0xb1/0xfb0 [btrfs]
        [464.856]  ? lock_release+0x567/0x790
        [464.862]  ksys_write+0xfb/0x1d0
        [464.867]  ? __pfx_ksys_write+0x10/0x10
        [464.873]  ? _raw_spin_unlock+0x23/0x40
        [464.879]  ? btrfs_getattr+0x4af/0x670 [btrfs]
        [464.886]  ? vfs_getattr_nosec+0x79/0x340
        [464.892]  do_syscall_64+0x95/0x180
        [464.898]  ? __do_sys_newfstat+0xde/0xf0
        [464.904]  ? __pfx___do_sys_newfstat+0x10/0x10
        [464.911]  ? trace_irq_enable.constprop.0+0xce/0x110
        [464.918]  ? syscall_exit_to_user_mode+0xac/0x2a0
        [464.925]  ? do_syscall_64+0xa1/0x180
        [464.931]  ? trace_irq_enable.constprop.0+0xce/0x110
        [464.939]  ? trace_irq_enable.constprop.0+0xce/0x110
        [464.946]  ? syscall_exit_to_user_mode+0xac/0x2a0
        [464.953]  ? btrfs_file_llseek+0xb1/0xfb0 [btrfs]
        [464.960]  ? do_syscall_64+0xa1/0x180
        [464.966]  ? btrfs_file_llseek+0xb1/0xfb0 [btrfs]
        [464.973]  ? trace_irq_enable.constprop.0+0xce/0x110
        [464.980]  ? syscall_exit_to_user_mode+0xac/0x2a0
        [464.987]  ? __pfx_btrfs_file_llseek+0x10/0x10 [btrfs]
        [464.995]  ? trace_irq_enable.constprop.0+0xce/0x110
        [465.002]  ? __pfx_btrfs_file_llseek+0x10/0x10 [btrfs]
        [465.010]  ? do_syscall_64+0xa1/0x180
        [465.016]  ? lock_release+0x567/0x790
        [465.022]  ? __pfx_lock_acquire+0x10/0x10
        [465.028]  ? __pfx_lock_release+0x10/0x10
        [465.034]  ? trace_irq_enable.constprop.0+0xce/0x110
        [465.042]  ? syscall_exit_to_user_mode+0xac/0x2a0
        [465.049]  ? do_syscall_64+0xa1/0x180
        [465.055]  ? syscall_exit_to_user_mode+0xac/0x2a0
        [465.062]  ? do_syscall_64+0xa1/0x180
        [465.068]  ? syscall_exit_to_user_mode+0xac/0x2a0
        [465.075]  ? do_syscall_64+0xa1/0x180
        [465.081]  ? clear_bhb_loop+0x25/0x80
        [465.087]  ? clear_bhb_loop+0x25/0x80
        [465.093]  ? clear_bhb_loop+0x25/0x80
        [465.099]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
        [465.106] RIP: 0033:0x7f093b8ee784
        [465.111] RSP: 002b:00007ffc29d31b28 EFLAGS: 00000202 ORIG_RAX: 0000000000000001
        [465.122] RAX: ffffffffffffffda RBX: 0000000000006000 RCX: 00007f093b8ee784
        [465.131] RDX: 000000000001de00 RSI: 00007f093b6ed200 RDI: 0000000000000003
        [465.141] RBP: 000000000001de00 R08: 0000000000006000 R09: 0000000000000000
        [465.150] R10: 0000000000023e00 R11: 0000000000000202 R12: 0000000000006000
        [465.160] R13: 0000000000023e00 R14: 0000000000023e00 R15: 0000000000000001
        [465.170]  </TASK>
        [465.174] INFO: lockdep is turned off.
      Reported-by: default avatarShinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
      Fixes: 97713b1a ("btrfs: do not clear page dirty inside extent_write_locked_range()")
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6252690f
    • Yue Haibing's avatar
      KVM: x86: hyper-v: Remove unused inline function kvm_hv_free_pa_page() · b098495e
      Yue Haibing authored
      There is no caller in tree since introduction in commit b4f69df0 ("KVM:
      x86: Make Hyper-V emulation optional")
      Signed-off-by: default avatarYue Haibing <yuehaibing@huawei.com>
      Message-ID: <20240803113233.128185-1-yuehaibing@huawei.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      b098495e
    • Phillip Lougher's avatar
      Squashfs: sanity check symbolic link size · 810ee43d
      Phillip Lougher authored
      Syzkiller reports a "KMSAN: uninit-value in pick_link" bug.
      
      This is caused by an uninitialised page, which is ultimately caused
      by a corrupted symbolic link size read from disk.
      
      The reason why the corrupted symlink size causes an uninitialised
      page is due to the following sequence of events:
      
      1. squashfs_read_inode() is called to read the symbolic
         link from disk.  This assigns the corrupted value
         3875536935 to inode->i_size.
      
      2. Later squashfs_symlink_read_folio() is called, which assigns
         this corrupted value to the length variable, which being a
         signed int, overflows producing a negative number.
      
      3. The following loop that fills in the page contents checks that
         the copied bytes is less than length, which being negative means
         the loop is skipped, producing an uninitialised page.
      
      This patch adds a sanity check which checks that the symbolic
      link size is not larger than expected.
      
      --
      Signed-off-by: default avatarPhillip Lougher <phillip@squashfs.org.uk>
      Link: https://lore.kernel.org/r/20240811232821.13903-1-phillip@squashfs.org.ukReported-by: default avatarLizhi Xu <lizhi.xu@windriver.com>
      Reported-by: syzbot+24ac24ff58dc5b0d26b9@syzkaller.appspotmail.com
      Closes: https://lore.kernel.org/all/000000000000a90e8c061e86a76b@google.com/
      V2: fix spelling mistake.
      Signed-off-by: default avatarChristian Brauner <brauner@kernel.org>
      810ee43d
    • Dominique Martinet's avatar
      9p: Fix DIO read through netfs · e3786b29
      Dominique Martinet authored
      If a program is watching a file on a 9p mount, it won't see any change in
      size if the file being exported by the server is changed directly in the
      source filesystem, presumably because 9p doesn't have change notifications,
      and because netfs skips the reads if the file is empty.
      
      Fix this by attempting to read the full size specified when a DIO read is
      requested (such as when 9p is operating in unbuffered mode) and dealing
      with a short read if the EOF was less than the expected read.
      
      To make this work, filesystems using netfslib must not set
      NETFS_SREQ_CLEAR_TAIL if performing a DIO read where that read hit the EOF.
      I don't want to mandatorily clear this flag in netfslib for DIO because,
      say, ceph might make a read from an object that is not completely filled,
      but does not reside at the end of file - and so we need to clear the
      excess.
      
      This can be tested by watching an empty file over 9p within a VM (such as
      in the ktest framework):
      
              while true; do read content; if [ -n "$content" ]; then echo $content; break; fi; done < /host/tmp/foo
      
      then writing something into the empty file.  The watcher should immediately
      display the file content and break out of the loop.  Without this fix, it
      remains in the loop indefinitely.
      
      Fixes: 80105ed2 ("9p: Use netfslib read/write_iter")
      Closes: https://bugzilla.kernel.org/show_bug.cgi?id=218916Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Link: https://lore.kernel.org/r/1229195.1723211769@warthog.procyon.org.uk
      cc: Eric Van Hensbergen <ericvh@kernel.org>
      cc: Latchesar Ionkov <lucho@ionkov.net>
      cc: Christian Schoenebeck <linux_oss@crudebyte.com>
      cc: Marc Dionne <marc.dionne@auristor.com>
      cc: Ilya Dryomov <idryomov@gmail.com>
      cc: Steve French <sfrench@samba.org>
      cc: Paulo Alcantara <pc@manguebit.com>
      cc: Trond Myklebust <trond.myklebust@hammerspace.com>
      cc: v9fs@lists.linux.dev
      cc: linux-afs@lists.infradead.org
      cc: ceph-devel@vger.kernel.org
      cc: linux-cifs@vger.kernel.org
      cc: linux-nfs@vger.kernel.org
      cc: netfs@lists.linux.dev
      cc: linux-fsdevel@vger.kernel.org
      Signed-off-by: default avatarDominique Martinet <asmadeus@codewreck.org>
      Signed-off-by: default avatarChristian Brauner <brauner@kernel.org>
      e3786b29
    • Zhihao Cheng's avatar
      vfs: Don't evict inode under the inode lru traversing context · 2a062983
      Zhihao Cheng authored
      The inode reclaiming process(See function prune_icache_sb) collects all
      reclaimable inodes and mark them with I_FREEING flag at first, at that
      time, other processes will be stuck if they try getting these inodes
      (See function find_inode_fast), then the reclaiming process destroy the
      inodes by function dispose_list(). Some filesystems(eg. ext4 with
      ea_inode feature, ubifs with xattr) may do inode lookup in the inode
      evicting callback function, if the inode lookup is operated under the
      inode lru traversing context, deadlock problems may happen.
      
      Case 1: In function ext4_evict_inode(), the ea inode lookup could happen
              if ea_inode feature is enabled, the lookup process will be stuck
      	under the evicting context like this:
      
       1. File A has inode i_reg and an ea inode i_ea
       2. getfattr(A, xattr_buf) // i_ea is added into lru // lru->i_ea
       3. Then, following three processes running like this:
      
          PA                              PB
       echo 2 > /proc/sys/vm/drop_caches
        shrink_slab
         prune_dcache_sb
         // i_reg is added into lru, lru->i_ea->i_reg
         prune_icache_sb
          list_lru_walk_one
           inode_lru_isolate
            i_ea->i_state |= I_FREEING // set inode state
           inode_lru_isolate
            __iget(i_reg)
            spin_unlock(&i_reg->i_lock)
            spin_unlock(lru_lock)
                                           rm file A
                                            i_reg->nlink = 0
            iput(i_reg) // i_reg->nlink is 0, do evict
             ext4_evict_inode
              ext4_xattr_delete_inode
               ext4_xattr_inode_dec_ref_all
                ext4_xattr_inode_iget
                 ext4_iget(i_ea->i_ino)
                  iget_locked
                   find_inode_fast
                    __wait_on_freeing_inode(i_ea) ----→ AA deadlock
          dispose_list // cannot be executed by prune_icache_sb
           wake_up_bit(&i_ea->i_state)
      
      Case 2: In deleted inode writing function ubifs_jnl_write_inode(), file
              deleting process holds BASEHD's wbuf->io_mutex while getting the
      	xattr inode, which could race with inode reclaiming process(The
              reclaiming process could try locking BASEHD's wbuf->io_mutex in
      	inode evicting function), then an ABBA deadlock problem would
      	happen as following:
      
       1. File A has inode ia and a xattr(with inode ixa), regular file B has
          inode ib and a xattr.
       2. getfattr(A, xattr_buf) // ixa is added into lru // lru->ixa
       3. Then, following three processes running like this:
      
              PA                PB                        PC
                      echo 2 > /proc/sys/vm/drop_caches
                       shrink_slab
                        prune_dcache_sb
                        // ib and ia are added into lru, lru->ixa->ib->ia
                        prune_icache_sb
                         list_lru_walk_one
                          inode_lru_isolate
                           ixa->i_state |= I_FREEING // set inode state
                          inode_lru_isolate
                           __iget(ib)
                           spin_unlock(&ib->i_lock)
                           spin_unlock(lru_lock)
                                                         rm file B
                                                          ib->nlink = 0
       rm file A
        iput(ia)
         ubifs_evict_inode(ia)
          ubifs_jnl_delete_inode(ia)
           ubifs_jnl_write_inode(ia)
            make_reservation(BASEHD) // Lock wbuf->io_mutex
            ubifs_iget(ixa->i_ino)
             iget_locked
              find_inode_fast
               __wait_on_freeing_inode(ixa)
                |          iput(ib) // ib->nlink is 0, do evict
                |           ubifs_evict_inode
                |            ubifs_jnl_delete_inode(ib)
                ↓             ubifs_jnl_write_inode
           ABBA deadlock ←-----make_reservation(BASEHD)
                         dispose_list // cannot be executed by prune_icache_sb
                          wake_up_bit(&ixa->i_state)
      
      Fix the possible deadlock by using new inode state flag I_LRU_ISOLATING
      to pin the inode in memory while inode_lru_isolate() reclaims its pages
      instead of using ordinary inode reference. This way inode deletion
      cannot be triggered from inode_lru_isolate() thus avoiding the deadlock.
      evict() is made to wait for I_LRU_ISOLATING to be cleared before
      proceeding with inode cleanup.
      
      Link: https://lore.kernel.org/all/37c29c42-7685-d1f0-067d-63582ffac405@huaweicloud.com/
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=219022
      Fixes: e50e5129 ("ext4: xattr-in-inode support")
      Fixes: 7959cf3a ("ubifs: journal: Handle xattrs like files")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarZhihao Cheng <chengzhihao1@huawei.com>
      Link: https://lore.kernel.org/r/20240809031628.1069873-1-chengzhihao@huaweicloud.comReviewed-by: default avatarJan Kara <jack@suse.cz>
      Suggested-by: default avatarJan Kara <jack@suse.cz>
      Suggested-by: default avatarMateusz Guzik <mjguzik@gmail.com>
      Signed-off-by: default avatarChristian Brauner <brauner@kernel.org>
      2a062983
    • Filipe Manana's avatar
      btrfs: send: allow cloning non-aligned extent if it ends at i_size · 46a6e10a
      Filipe Manana authored
      If we a find that an extent is shared but its end offset is not sector
      size aligned, then we don't clone it and issue write operations instead.
      This is because the reflink (remap_file_range) operation does not allow
      to clone unaligned ranges, except if the end offset of the range matches
      the i_size of the source and destination files (and the start offset is
      sector size aligned).
      
      While this is not incorrect because send can only guarantee that a file
      has the same data in the source and destination snapshots, it's not
      optimal and generates confusion and surprising behaviour for users.
      
      For example, running this test:
      
        $ cat test.sh
        #!/bin/bash
      
        DEV=/dev/sdi
        MNT=/mnt/sdi
      
        mkfs.btrfs -f $DEV
        mount $DEV $MNT
      
        # Use a file size not aligned to any possible sector size.
        file_size=$((1 * 1024 * 1024 + 5)) # 1MB + 5 bytes
        dd if=/dev/random of=$MNT/foo bs=$file_size count=1
        cp --reflink=always $MNT/foo $MNT/bar
      
        btrfs subvolume snapshot -r $MNT/ $MNT/snap
        rm -f /tmp/send-test
        btrfs send -f /tmp/send-test $MNT/snap
      
        umount $MNT
        mkfs.btrfs -f $DEV
        mount $DEV $MNT
      
        btrfs receive -vv -f /tmp/send-test $MNT
      
        xfs_io -r -c "fiemap -v" $MNT/snap/bar
      
        umount $MNT
      
      Gives the following result:
      
        (...)
        mkfile o258-7-0
        rename o258-7-0 -> bar
        write bar - offset=0 length=49152
        write bar - offset=49152 length=49152
        write bar - offset=98304 length=49152
        write bar - offset=147456 length=49152
        write bar - offset=196608 length=49152
        write bar - offset=245760 length=49152
        write bar - offset=294912 length=49152
        write bar - offset=344064 length=49152
        write bar - offset=393216 length=49152
        write bar - offset=442368 length=49152
        write bar - offset=491520 length=49152
        write bar - offset=540672 length=49152
        write bar - offset=589824 length=49152
        write bar - offset=638976 length=49152
        write bar - offset=688128 length=49152
        write bar - offset=737280 length=49152
        write bar - offset=786432 length=49152
        write bar - offset=835584 length=49152
        write bar - offset=884736 length=49152
        write bar - offset=933888 length=49152
        write bar - offset=983040 length=49152
        write bar - offset=1032192 length=16389
        chown bar - uid=0, gid=0
        chmod bar - mode=0644
        utimes bar
        utimes
        BTRFS_IOC_SET_RECEIVED_SUBVOL uuid=06d640da-9ca1-604c-b87c-3375175a8eb3, stransid=7
        /mnt/sdi/snap/bar:
         EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
           0: [0..2055]:       26624..28679      2056   0x1
      
      There's no clone operation to clone extents from the file foo into file
      bar and fiemap confirms there's no shared flag (0x2000).
      
      So update send_write_or_clone() so that it proceeds with cloning if the
      source and destination ranges end at the i_size of the respective files.
      
      After this changes the result of the test is:
      
        (...)
        mkfile o258-7-0
        rename o258-7-0 -> bar
        clone bar - source=foo source offset=0 offset=0 length=1048581
        chown bar - uid=0, gid=0
        chmod bar - mode=0644
        utimes bar
        utimes
        BTRFS_IOC_SET_RECEIVED_SUBVOL uuid=582420f3-ea7d-564e-bbe5-ce440d622190, stransid=7
        /mnt/sdi/snap/bar:
         EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
           0: [0..2055]:       26624..28679      2056 0x2001
      
      A test case for fstests will also follow up soon.
      
      Link: https://github.com/kdave/btrfs-progs/issues/572#issuecomment-2282841416
      CC: stable@vger.kernel.org # 5.10+
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      46a6e10a
    • Filipe Manana's avatar
      btrfs: only run the extent map shrinker from kswapd tasks · ae1e766f
      Filipe Manana authored
      Currently the extent map shrinker can be run by any task when attempting
      to allocate memory and there's enough memory pressure to trigger it.
      
      To avoid too much latency we stop iterating over extent maps and removing
      them once the task needs to reschedule. This logic was introduced in commit
      b3ebb9b7 ("btrfs: stop extent map shrinker if reschedule is needed").
      
      While that solved high latency problems for some use cases, it's still
      not enough because with a too high number of tasks entering the extent map
      shrinker code, either due to memory allocations or because they are a
      kswapd task, we end up having a very high level of contention on some
      spin locks, namely:
      
      1) The fs_info->fs_roots_radix_lock spin lock, which we need to find
         roots to iterate over their inodes;
      
      2) The spin lock of the xarray used to track open inodes for a root
         (struct btrfs_root::inodes) - on 6.10 kernels and below, it used to
         be a red black tree and the spin lock was root->inode_lock;
      
      3) The fs_info->delayed_iput_lock spin lock since the shrinker adds
         delayed iputs (calls btrfs_add_delayed_iput()).
      
      Instead of allowing the extent map shrinker to be run by any task, make
      it run only by kswapd tasks. This still solves the problem of running
      into OOM situations due to an unbounded extent map creation, which is
      simple to trigger by direct IO writes, as described in the changelog
      of commit 956a17d9 ("btrfs: add a shrinker for extent maps"), and
      by a similar case when doing buffered IO on files with a very large
      number of holes (keeping the file open and creating many holes, whose
      extent maps are only released when the file is closed).
      Reported-by: default avatarkzd <kzd@56709.net>
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=219121Reported-by: default avatarOctavia Togami <octavia.togami@gmail.com>
      Link: https://lore.kernel.org/linux-btrfs/CAHPNGSSt-a4ZZWrtJdVyYnJFscFjP9S7rMcvEMaNSpR556DdLA@mail.gmail.com/
      Fixes: 956a17d9 ("btrfs: add a shrinker for extent maps")
      CC: stable@vger.kernel.org # 6.10+
      Tested-by: default avatarkzd <kzd@56709.net>
      Tested-by: default avatarOctavia Togami <octavia.togami@gmail.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ae1e766f
    • Qu Wenruo's avatar
      btrfs: tree-checker: reject BTRFS_FT_UNKNOWN dir type · 31723c95
      Qu Wenruo authored
      [REPORT]
      There is a bug report that kernel is rejecting a mismatching inode mode
      and its dir item:
      
        [ 1881.553937] BTRFS critical (device dm-0): inode mode mismatch with
        dir: inode mode=040700 btrfs type=2 dir type=0
      
      [CAUSE]
      It looks like the inode mode is correct, while the dir item type
      0 is BTRFS_FT_UNKNOWN, which should not be generated by btrfs at all.
      
      This may be caused by a memory bit flip.
      
      [ENHANCEMENT]
      Although tree-checker is not able to do any cross-leaf verification, for
      this particular case we can at least reject any dir type with
      BTRFS_FT_UNKNOWN.
      
      So here we enhance the dir type check from [0, BTRFS_FT_MAX), to
      (0, BTRFS_FT_MAX).
      Although the existing corruption can not be fixed just by such enhanced
      checking, it should prevent the same 0x2->0x0 bitflip for dir type to
      reach disk in the future.
      Reported-by: default avatarKota <nospam@kota.moe>
      Link: https://lore.kernel.org/linux-btrfs/CACsxjPYnQF9ZF-0OhH16dAx50=BXXOcP74MxBc3BG+xae4vTTw@mail.gmail.com/
      CC: stable@vger.kernel.org # 5.4+
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      31723c95
    • Josef Bacik's avatar
      btrfs: check delayed refs when we're checking if a ref exists · 42fac187
      Josef Bacik authored
      In the patch 78c52d9e ("btrfs: check for refs on snapshot delete
      resume") I added some code to handle file systems that had been
      corrupted by a bug that incorrectly skipped updating the drop progress
      key while dropping a snapshot.  This code would check to see if we had
      already deleted our reference for a child block, and skip the deletion
      if we had already.
      
      Unfortunately there is a bug, as the check would only check the on-disk
      references.  I made an incorrect assumption that blocks in an already
      deleted snapshot that was having the deletion resume on mount wouldn't
      be modified.
      
      If we have 2 pending deleted snapshots that share blocks, we can easily
      modify the rules for a block.  Take the following example
      
      subvolume a exists, and subvolume b is a snapshot of subvolume a.  They
      share references to block 1.  Block 1 will have 2 full references, one
      for subvolume a and one for subvolume b, and it belongs to subvolume a
      (btrfs_header_owner(block 1) == subvolume a).
      
      When deleting subvolume a, we will drop our full reference for block 1,
      and because we are the owner we will drop our full reference for all of
      block 1's children, convert block 1 to FULL BACKREF, and add a shared
      reference to all of block 1's children.
      
      Then we will start the snapshot deletion of subvolume b.  We look up the
      extent info for block 1, which checks delayed refs and tells us that
      FULL BACKREF is set, so sets parent to the bytenr of block 1.  However
      because this is a resumed snapshot deletion, we call into
      check_ref_exists().  Because check_ref_exists() only looks at the disk,
      it doesn't find the shared backref for the child of block 1, and thus
      returns 0 and we skip deleting the reference for the child of block 1
      and continue.  This orphans the child of block 1.
      
      The fix is to lookup the delayed refs, similar to what we do in
      btrfs_lookup_extent_info().  However we only care about whether the
      reference exists or not.  If we fail to find our reference on disk, go
      look up the bytenr in the delayed refs, and if it exists look for an
      existing ref in the delayed ref head.  If that exists then we know we
      can delete the reference safely and carry on.  If it doesn't exist we
      know we have to skip over this block.
      
      This bug has existed since I introduced this fix, however requires
      having multiple deleted snapshots pending when we unmount.  We noticed
      this in production because our shutdown path stops the container on the
      system, which deletes a bunch of subvolumes, and then reboots the box.
      This gives us plenty of opportunities to hit this issue.  Looking at the
      history we've seen this occasionally in production, but we had a big
      spike recently thanks to faster machines getting jobs with multiple
      subvolumes in the job.
      
      Chris Mason wrote a reproducer which does the following
      
      mount /dev/nvme4n1 /btrfs
      btrfs subvol create /btrfs/s1
      simoop -E -f 4k -n 200000 -z /btrfs/s1
      while(true) ; do
      	btrfs subvol snap /btrfs/s1 /btrfs/s2
      	simoop -f 4k -n 200000 -r 10 -z /btrfs/s2
      	btrfs subvol snap /btrfs/s2 /btrfs/s3
      	btrfs balance start -dusage=80 /btrfs
      	btrfs subvol del /btrfs/s2 /btrfs/s3
      	umount /btrfs
      	btrfsck /dev/nvme4n1 || exit 1
      	mount /dev/nvme4n1 /btrfs
      done
      
      On the second loop this would fail consistently, with my patch it has
      been running for hours and hasn't failed.
      
      I also used dm-log-writes to capture the state of the failure so I could
      debug the problem.  Using the existing failure case to test my patch
      validated that it fixes the problem.
      
      Fixes: 78c52d9e ("btrfs: check for refs on snapshot delete resume")
      CC: stable@vger.kernel.org # 5.4+
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      42fac187
    • Dan Carpenter's avatar
      KVM: SVM: Fix an error code in sev_gmem_post_populate() · cd2d0060
      Dan Carpenter authored
      The copy_from_user() function returns the number of bytes which it
      was not able to copy.  Return -EFAULT instead.
      
      Fixes: dee5a47c ("KVM: SEV: Add KVM_SEV_SNP_LAUNCH_UPDATE command")
      Signed-off-by: default avatarDan Carpenter <dan.carpenter@linaro.org>
      Message-ID: <20240612115040.2423290-4-dan.carpenter@linaro.org>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      cd2d0060
    • Paolo Bonzini's avatar
      Merge tag 'kvm-s390-master-6.11-1' of... · 696eb24a
      Paolo Bonzini authored
      Merge tag 'kvm-s390-master-6.11-1' of https://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into HEAD
      
      Fix invalid gisa designation value when gisa is not in use.
      Panic if (un)share fails to maintain security.
      696eb24a
    • Paolo Bonzini's avatar
      Merge tag 'kvmarm-fixes-6.11-1' of... · 747cfbf1
      Paolo Bonzini authored
      Merge tag 'kvmarm-fixes-6.11-1' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD
      
      KVM/arm64 fixes for 6.11, round #1
      
       - Use kvfree() for the kvmalloc'd nested MMUs array
      
       - Set of fixes to address warnings in W=1 builds
      
       - Make KVM depend on assembler support for ARMv8.4
      
       - Fix for vgic-debug interface for VMs without LPIs
      
       - Actually check ID_AA64MMFR3_EL1.S1PIE in get-reg-list selftest
      
       - Minor code / comment cleanups for configuring PAuth traps
      
       - Take kvm->arch.config_lock to prevent destruction / initialization
         race for a vCPU's CPUIF which may lead to a UAF
      747cfbf1
    • Dan Carpenter's avatar
      KVM: SVM: Fix uninitialized variable bug · 92b6c2f0
      Dan Carpenter authored
      If snp_lookup_rmpentry() fails then "assigned" is printed in the error
      message but it was never initialized.  Initialize it to false.
      
      Fixes: dee5a47c ("KVM: SEV: Add KVM_SEV_SNP_LAUNCH_UPDATE command")
      Signed-off-by: default avatarDan Carpenter <dan.carpenter@linaro.org>
      Message-ID: <20240612115040.2423290-3-dan.carpenter@linaro.org>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      92b6c2f0
    • Yonghong Song's avatar
      selftests/bpf: Add a test to verify previous stacksafe() fix · 662c3e2d
      Yonghong Song authored
      A selftest is added such that without the previous patch,
      a crash can happen. With the previous patch, the test can
      run successfully. The new test is written in a way which
      mimics original crash case:
        main_prog
          static_prog_1
            static_prog_2
      where static_prog_1 has different paths to static_prog_2
      and some path has stack allocated and some other path
      does not. A stacksafe() checking in static_prog_2()
      triggered the crash.
      Signed-off-by: default avatarYonghong Song <yonghong.song@linux.dev>
      Link: https://lore.kernel.org/r/20240812214852.214037-1-yonghong.song@linux.devSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      662c3e2d
    • Yonghong Song's avatar
      bpf: Fix a kernel verifier crash in stacksafe() · bed2eb96
      Yonghong Song authored
      Daniel Hodges reported a kernel verifier crash when playing with sched-ext.
      Further investigation shows that the crash is due to invalid memory access
      in stacksafe(). More specifically, it is the following code:
      
          if (exact != NOT_EXACT &&
              old->stack[spi].slot_type[i % BPF_REG_SIZE] !=
              cur->stack[spi].slot_type[i % BPF_REG_SIZE])
                  return false;
      
      The 'i' iterates old->allocated_stack.
      If cur->allocated_stack < old->allocated_stack the out-of-bound
      access will happen.
      
      To fix the issue add 'i >= cur->allocated_stack' check such that if
      the condition is true, stacksafe() should fail. Otherwise,
      cur->stack[spi].slot_type[i % BPF_REG_SIZE] memory access is legal.
      
      Fixes: 2793a8b0 ("bpf: exact states comparison for iterator convergence checks")
      Cc: Eduard Zingerman <eddyz87@gmail.com>
      Reported-by: default avatarDaniel Hodges <hodgesd@meta.com>
      Acked-by: default avatarEduard Zingerman <eddyz87@gmail.com>
      Signed-off-by: default avatarYonghong Song <yonghong.song@linux.dev>
      Link: https://lore.kernel.org/r/20240812214847.213612-1-yonghong.song@linux.devSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      bed2eb96
  4. 12 Aug, 2024 10 commits
    • Leon Hwang's avatar
      bpf: Fix updating attached freplace prog in prog_array map · fdad456c
      Leon Hwang authored
      The commit f7866c35 ("bpf: Fix null pointer dereference in resolve_prog_type() for BPF_PROG_TYPE_EXT")
      fixed a NULL pointer dereference panic, but didn't fix the issue that
      fails to update attached freplace prog to prog_array map.
      
      Since commit 1c123c56 ("bpf: Resolve fext program type when checking map compatibility"),
      freplace prog and its target prog are able to tail call each other.
      
      And the commit 3aac1ead ("bpf: Move prog->aux->linked_prog and trampoline into bpf_link on attach")
      sets prog->aux->dst_prog as NULL after attaching freplace prog to its
      target prog.
      
      After loading freplace the prog_array's owner type is BPF_PROG_TYPE_SCHED_CLS.
      Then, after attaching freplace its prog->aux->dst_prog is NULL.
      Then, while updating freplace in prog_array the bpf_prog_map_compatible()
      incorrectly returns false because resolve_prog_type() returns
      BPF_PROG_TYPE_EXT instead of BPF_PROG_TYPE_SCHED_CLS.
      After this patch the resolve_prog_type() returns BPF_PROG_TYPE_SCHED_CLS
      and update to prog_array can succeed.
      
      Fixes: f7866c35 ("bpf: Fix null pointer dereference in resolve_prog_type() for BPF_PROG_TYPE_EXT")
      Cc: Toke Høiland-Jørgensen <toke@redhat.com>
      Cc: Martin KaFai Lau <martin.lau@kernel.org>
      Acked-by: default avatarYonghong Song <yonghong.song@linux.dev>
      Signed-off-by: default avatarLeon Hwang <leon.hwang@linux.dev>
      Link: https://lore.kernel.org/r/20240728114612.48486-2-leon.hwang@linux.devSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      fdad456c
    • David Howells's avatar
      netfs: Fix handling of USE_PGPRIV2 and WRITE_TO_CACHE flags · 7b589a9b
      David Howells authored
      The NETFS_RREQ_USE_PGPRIV2 and NETFS_RREQ_WRITE_TO_CACHE flags aren't used
      correctly.  The problem is that we try to set them up in the request
      initialisation, but we the cache may be in the process of setting up still,
      and so the state may not be correct.  Further, we secondarily sample the
      cache state and make contradictory decisions later.
      
      The issue arises because we set up the cache resources, which allows the
      cache's ->prepare_read() to switch on NETFS_SREQ_COPY_TO_CACHE - which
      triggers cache writing even if we didn't set the flags when allocating.
      
      Fix this in the following way:
      
       (1) Drop NETFS_ICTX_USE_PGPRIV2 and instead set NETFS_RREQ_USE_PGPRIV2 in
           ->init_request() rather than trying to juggle that in
           netfs_alloc_request().
      
       (2) Repurpose NETFS_RREQ_USE_PGPRIV2 to merely indicate that if caching is
           to be done, then PG_private_2 is to be used rather than only setting
           it if we decide to cache and then having netfs_rreq_unlock_folios()
           set the non-PG_private_2 writeback-to-cache if it wasn't set.
      
       (3) Split netfs_rreq_unlock_folios() into two functions, one of which
           contains the deprecated code for using PG_private_2 to avoid
           accidentally doing the writeback path - and always use it if
           USE_PGPRIV2 is set.
      
       (4) As NETFS_ICTX_USE_PGPRIV2 is removed, make netfs_write_begin() always
           wait for PG_private_2.  This function is deprecated and only used by
           ceph anyway, and so label it so.
      
       (5) Drop the NETFS_RREQ_WRITE_TO_CACHE flag and use
           fscache_operation_valid() on the cache_resources instead.  This has
           the advantage of picking up the result of netfs_begin_cache_read() and
           fscache_begin_write_operation() - which are called after the object is
           initialised and will wait for the cache to come to a usable state.
      
      Just reverting ae678317[1] isn't a sufficient fix, so this need to be
      applied on top of that.  Without this as well, things like:
      
       rcu: INFO: rcu_sched detected expedited stalls on CPUs/tasks: {
      
      and:
      
       WARNING: CPU: 13 PID: 3621 at fs/ceph/caps.c:3386
      
      may happen, along with some UAFs due to PG_private_2 not getting used to
      wait on writeback completion.
      
      Fixes: 2ff1e975 ("netfs: Replace PG_fscache by setting folio->private and marking dirty")
      Reported-by: default avatarMax Kellermann <max.kellermann@ionos.com>
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      cc: Ilya Dryomov <idryomov@gmail.com>
      cc: Xiubo Li <xiubli@redhat.com>
      cc: Hristo Venev <hristo@venev.name>
      cc: Jeff Layton <jlayton@kernel.org>
      cc: Matthew Wilcox <willy@infradead.org>
      cc: ceph-devel@vger.kernel.org
      cc: netfs@lists.linux.dev
      cc: linux-fsdevel@vger.kernel.org
      cc: linux-mm@kvack.org
      Link: https://lore.kernel.org/r/3575457.1722355300@warthog.procyon.org.uk/ [1]
      Link: https://lore.kernel.org/r/1173209.1723152682@warthog.procyon.org.ukSigned-off-by: default avatarChristian Brauner <brauner@kernel.org>
      7b589a9b
    • David Howells's avatar
      netfs, ceph: Revert "netfs: Remove deprecated use of PG_private_2 as a second writeback flag" · 8e5ced78
      David Howells authored
      This reverts commit ae678317.
      
      Revert the patch that removes the deprecated use of PG_private_2 in
      netfslib for the moment as Ceph is actually still using this to track
      data copied to the cache.
      
      Fixes: ae678317 ("netfs: Remove deprecated use of PG_private_2 as a second writeback flag")
      Reported-by: default avatarMax Kellermann <max.kellermann@ionos.com>
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      cc: Ilya Dryomov <idryomov@gmail.com>
      cc: Xiubo Li <xiubli@redhat.com>
      cc: Jeff Layton <jlayton@kernel.org>
      cc: Matthew Wilcox <willy@infradead.org>
      cc: ceph-devel@vger.kernel.org
      cc: netfs@lists.linux.dev
      cc: linux-fsdevel@vger.kernel.org
      cc: linux-mm@kvack.org
      https: //lore.kernel.org/r/3575457.1722355300@warthog.procyon.org.uk
      Signed-off-by: default avatarChristian Brauner <brauner@kernel.org>
      8e5ced78
    • Mathias Krause's avatar
      file: fix typo in take_fd() comment · 86509e38
      Mathias Krause authored
      The explanatory comment above take_fd() contains a typo, fix that to not
      confuse readers.
      Signed-off-by: default avatarMathias Krause <minipli@grsecurity.net>
      Link: https://lore.kernel.org/r/20240809135035.748109-1-minipli@grsecurity.netSigned-off-by: default avatarChristian Brauner <brauner@kernel.org>
      86509e38
    • Christian Brauner's avatar
      pidfd: prevent creation of pidfds for kthreads · 3b5bbe79
      Christian Brauner authored
      It's currently possible to create pidfds for kthreads but it is unclear
      what that is supposed to mean. Until we have use-cases for it and we
      figured out what behavior we want block the creation of pidfds for
      kthreads.
      
      Link: https://lore.kernel.org/r/20240731-gleis-mehreinnahmen-6bbadd128383@brauner
      Fixes: 32fcb426 ("pid: add pidfd_open()")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarChristian Brauner <brauner@kernel.org>
      3b5bbe79
    • Lukas Bulwahn's avatar
      netfs: clean up after renaming FSCACHE_DEBUG config · 889ced4c
      Lukas Bulwahn authored
      Commit 6b8e61472529 ("netfs: Rename CONFIG_FSCACHE_DEBUG to
      CONFIG_NETFS_DEBUG") renames the config, but introduces two issues: First,
      NETFS_DEBUG mistakenly depends on the non-existing config NETFS, whereas
      the actual intended config is called NETFS_SUPPORT. Second, the config
      renaming misses to adjust the documentation of the functionality of this
      config.
      
      Clean up those two points.
      Signed-off-by: default avatarLukas Bulwahn <lukas.bulwahn@redhat.com>
      Link: https://lore.kernel.org/r/20240731073902.69262-1-lukas.bulwahn@redhat.comSigned-off-by: default avatarChristian Brauner <brauner@kernel.org>
      889ced4c
    • yangerkun's avatar
      libfs: fix infinite directory reads for offset dir · 64a7ce76
      yangerkun authored
      After we switch tmpfs dir operations from simple_dir_operations to
      simple_offset_dir_operations, every rename happened will fill new dentry
      to dest dir's maple tree(&SHMEM_I(inode)->dir_offsets->mt) with a free
      key starting with octx->newx_offset, and then set newx_offset equals to
      free key + 1. This will lead to infinite readdir combine with rename
      happened at the same time, which fail generic/736 in xfstests(detail show
      as below).
      
      1. create 5000 files(1 2 3...) under one dir
      2. call readdir(man 3 readdir) once, and get one entry
      3. rename(entry, "TEMPFILE"), then rename("TEMPFILE", entry)
      4. loop 2~3, until readdir return nothing or we loop too many
         times(tmpfs break test with the second condition)
      
      We choose the same logic what commit 9b378f6a ("btrfs: fix infinite
      directory reads") to fix it, record the last_index when we open dir, and
      do not emit the entry which index >= last_index. The file->private_data
      now used in offset dir can use directly to do this, and we also update
      the last_index when we llseek the dir file.
      
      Fixes: a2e45955 ("shmem: stable directory offsets")
      Signed-off-by: default avataryangerkun <yangerkun@huawei.com>
      Link: https://lore.kernel.org/r/20240731043835.1828697-1-yangerkun@huawei.comReviewed-by: default avatarChuck Lever <chuck.lever@oracle.com>
      [brauner: only update last_index after seek when offset is zero like Jan suggested]
      Signed-off-by: default avatarChristian Brauner <brauner@kernel.org>
      64a7ce76
    • Christian Brauner's avatar
      nsfs: fix ioctl declaration · 42b0f8da
      Christian Brauner authored
      The kernel is writing an object of type __u64, so the ioctl has to be
      defined to _IOR(NSIO, 0x5, __u64) instead of _IO(NSIO, 0x5).
      Reported-by: default avatarDmitry V. Levin <ldv@strace.io>
      Link: https://lore.kernel.org/r/20240730164554.GA18486@altlinux.orgSigned-off-by: default avatarChristian Brauner <brauner@kernel.org>
      42b0f8da
    • Max Kellermann's avatar
      fs/netfs/fscache_cookie: add missing "n_accesses" check · f71aa063
      Max Kellermann authored
      This fixes a NULL pointer dereference bug due to a data race which
      looks like this:
      
        BUG: kernel NULL pointer dereference, address: 0000000000000008
        #PF: supervisor read access in kernel mode
        #PF: error_code(0x0000) - not-present page
        PGD 0 P4D 0
        Oops: 0000 [#1] SMP PTI
        CPU: 33 PID: 16573 Comm: kworker/u97:799 Not tainted 6.8.7-cm4all1-hp+ #43
        Hardware name: HP ProLiant DL380 Gen9/ProLiant DL380 Gen9, BIOS P89 10/17/2018
        Workqueue: events_unbound netfs_rreq_write_to_cache_work
        RIP: 0010:cachefiles_prepare_write+0x30/0xa0
        Code: 57 41 56 45 89 ce 41 55 49 89 cd 41 54 49 89 d4 55 53 48 89 fb 48 83 ec 08 48 8b 47 08 48 83 7f 10 00 48 89 34 24 48 8b 68 20 <48> 8b 45 08 4c 8b 38 74 45 49 8b 7f 50 e8 4e a9 b0 ff 48 8b 73 10
        RSP: 0018:ffffb4e78113bde0 EFLAGS: 00010286
        RAX: ffff976126be6d10 RBX: ffff97615cdb8438 RCX: 0000000000020000
        RDX: ffff97605e6c4c68 RSI: ffff97605e6c4c60 RDI: ffff97615cdb8438
        RBP: 0000000000000000 R08: 0000000000278333 R09: 0000000000000001
        R10: ffff97605e6c4600 R11: 0000000000000001 R12: ffff97605e6c4c68
        R13: 0000000000020000 R14: 0000000000000001 R15: ffff976064fe2c00
        FS:  0000000000000000(0000) GS:ffff9776dfd40000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 0000000000000008 CR3: 000000005942c002 CR4: 00000000001706f0
        Call Trace:
         <TASK>
         ? __die+0x1f/0x70
         ? page_fault_oops+0x15d/0x440
         ? search_module_extables+0xe/0x40
         ? fixup_exception+0x22/0x2f0
         ? exc_page_fault+0x5f/0x100
         ? asm_exc_page_fault+0x22/0x30
         ? cachefiles_prepare_write+0x30/0xa0
         netfs_rreq_write_to_cache_work+0x135/0x2e0
         process_one_work+0x137/0x2c0
         worker_thread+0x2e9/0x400
         ? __pfx_worker_thread+0x10/0x10
         kthread+0xcc/0x100
         ? __pfx_kthread+0x10/0x10
         ret_from_fork+0x30/0x50
         ? __pfx_kthread+0x10/0x10
         ret_from_fork_asm+0x1b/0x30
         </TASK>
        Modules linked in:
        CR2: 0000000000000008
        ---[ end trace 0000000000000000 ]---
      
      This happened because fscache_cookie_state_machine() was slow and was
      still running while another process invoked fscache_unuse_cookie();
      this led to a fscache_cookie_lru_do_one() call, setting the
      FSCACHE_COOKIE_DO_LRU_DISCARD flag, which was picked up by
      fscache_cookie_state_machine(), withdrawing the cookie via
      cachefiles_withdraw_cookie(), clearing cookie->cache_priv.
      
      At the same time, yet another process invoked
      cachefiles_prepare_write(), which found a NULL pointer in this code
      line:
      
        struct cachefiles_object *object = cachefiles_cres_object(cres);
      
      The next line crashes, obviously:
      
        struct cachefiles_cache *cache = object->volume->cache;
      
      During cachefiles_prepare_write(), the "n_accesses" counter is
      non-zero (via fscache_begin_operation()).  The cookie must not be
      withdrawn until it drops to zero.
      
      The counter is checked by fscache_cookie_state_machine() before
      switching to FSCACHE_COOKIE_STATE_RELINQUISHING and
      FSCACHE_COOKIE_STATE_WITHDRAWING (in "case
      FSCACHE_COOKIE_STATE_FAILED"), but not for
      FSCACHE_COOKIE_STATE_LRU_DISCARDING ("case
      FSCACHE_COOKIE_STATE_ACTIVE").
      
      This patch adds the missing check.  With a non-zero access counter,
      the function returns and the next fscache_end_cookie_access() call
      will queue another fscache_cookie_state_machine() call to handle the
      still-pending FSCACHE_COOKIE_DO_LRU_DISCARD.
      
      Fixes: 12bb21a2 ("fscache: Implement cookie user counting and resource pinning")
      Signed-off-by: default avatarMax Kellermann <max.kellermann@ionos.com>
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Link: https://lore.kernel.org/r/20240729162002.3436763-2-dhowells@redhat.com
      cc: Jeff Layton <jlayton@kernel.org>
      cc: netfs@lists.linux.dev
      cc: linux-fsdevel@vger.kernel.org
      cc: stable@vger.kernel.org
      Signed-off-by: default avatarChristian Brauner <brauner@kernel.org>
      f71aa063
    • Omar Sandoval's avatar
      filelock: fix name of file_lease slab cache · 3f65f3c0
      Omar Sandoval authored
      When struct file_lease was split out from struct file_lock, the name of
      the file_lock slab cache was copied to the new slab cache for
      file_lease. This name conflict causes confusion in /proc/slabinfo and
      /sys/kernel/slab. In particular, it caused failures in drgn's test case
      for slab cache merging.
      
      Link: https://github.com/osandov/drgn/blob/9ad29fd86499eb32847473e928b6540872d3d59a/tests/linux_kernel/helpers/test_slab.py#L81
      Fixes: c69ff407 ("filelock: split leases out of struct file_lock")
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Link: https://lore.kernel.org/r/2d1d053da1cafb3e7940c4f25952da4f0af34e38.1722293276.git.osandov@fb.comReviewed-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Reviewed-by: default avatarJeff Layton <jlayton@kernel.org>
      Signed-off-by: default avatarChristian Brauner <brauner@kernel.org>
      3f65f3c0