1. 23 Nov, 2022 2 commits
    • Yang Shi's avatar
      mm: khugepaged: allow page allocation fallback to eligible nodes · e031ff96
      Yang Shi authored
      Syzbot reported the below splat:
      
      WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221 __alloc_pages_node include/linux/gfp.h:221 [inline]
      WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221 hpage_collapse_alloc_page mm/khugepaged.c:807 [inline]
      WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221 alloc_charge_hpage+0x802/0xaa0 mm/khugepaged.c:963
      Modules linked in:
      CPU: 1 PID: 3646 Comm: syz-executor210 Not tainted 6.1.0-rc1-syzkaller-00454-ga7038524 #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/11/2022
      RIP: 0010:__alloc_pages_node include/linux/gfp.h:221 [inline]
      RIP: 0010:hpage_collapse_alloc_page mm/khugepaged.c:807 [inline]
      RIP: 0010:alloc_charge_hpage+0x802/0xaa0 mm/khugepaged.c:963
      Code: e5 01 4c 89 ee e8 6e f9 ae ff 4d 85 ed 0f 84 28 fc ff ff e8 70 fc ae ff 48 8d 6b ff 4c 8d 63 07 e9 16 fc ff ff e8 5e fc ae ff <0f> 0b e9 96 fa ff ff 41 bc 1a 00 00 00 e9 86 fd ff ff e8 47 fc ae
      RSP: 0018:ffffc90003fdf7d8 EFLAGS: 00010293
      RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
      RDX: ffff888077f457c0 RSI: ffffffff81cd8f42 RDI: 0000000000000001
      RBP: ffff888079388c0c R08: 0000000000000001 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
      R13: dffffc0000000000 R14: 0000000000000000 R15: 0000000000000000
      FS:  00007f6b48ccf700(0000) GS:ffff8880b9b00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007f6b48a819f0 CR3: 00000000171e7000 CR4: 00000000003506e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       <TASK>
       collapse_file+0x1ca/0x5780 mm/khugepaged.c:1715
       hpage_collapse_scan_file+0xd6c/0x17a0 mm/khugepaged.c:2156
       madvise_collapse+0x53a/0xb40 mm/khugepaged.c:2611
       madvise_vma_behavior+0xd0a/0x1cc0 mm/madvise.c:1066
       madvise_walk_vmas+0x1c7/0x2b0 mm/madvise.c:1240
       do_madvise.part.0+0x24a/0x340 mm/madvise.c:1419
       do_madvise mm/madvise.c:1432 [inline]
       __do_sys_madvise mm/madvise.c:1432 [inline]
       __se_sys_madvise mm/madvise.c:1430 [inline]
       __x64_sys_madvise+0x113/0x150 mm/madvise.c:1430
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x63/0xcd
      RIP: 0033:0x7f6b48a4eef9
      Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 b1 15 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
      RSP: 002b:00007f6b48ccf318 EFLAGS: 00000246 ORIG_RAX: 000000000000001c
      RAX: ffffffffffffffda RBX: 00007f6b48af0048 RCX: 00007f6b48a4eef9
      RDX: 0000000000000019 RSI: 0000000000600003 RDI: 0000000020000000
      RBP: 00007f6b48af0040 R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000246 R12: 00007f6b48aa53a4
      R13: 00007f6b48bffcbf R14: 00007f6b48ccf400 R15: 0000000000022000
       </TASK>
      
      The khugepaged code would pick up the node with the most hit as the preferred
      node, and also tries to do some balance if several nodes have the same
      hit record.  Basically it does conceptually:
          * If the target_node <= last_target_node, then iterate from
      last_target_node + 1 to MAX_NUMNODES (1024 on default config)
          * If the max_value == node_load[nid], then target_node = nid
      
      But there is a corner case, paritucularly for MADV_COLLAPSE, that the
      non-existing node may be returned as preferred node.
      
      Assuming the system has 2 nodes, the target_node is 0 and the
      last_target_node is 1, if MADV_COLLAPSE path is hit, the max_value may
      be 0, then it may return 2 for target_node, but it is actually not
      existing (offline), so the warn is triggered.
      
      The node balance was introduced by commit 9f1b868a ("mm: thp:
      khugepaged: add policy for finding target node") to satisfy
      "numactl --interleave=all".  But interleaving is a mere hint rather than
      something that has hard requirements.
      
      So use nodemask to record the nodes which have the same hit record, the
      hugepage allocation could fallback to those nodes.  And remove
      __GFP_THISNODE since it does disallow fallback.  And if the nodemask
      just has one node set, it means there is one single node has the most
      hit record, the nodemask approach actually behaves like __GFP_THISNODE.
      
      Link: https://lkml.kernel.org/r/20221108184357.55614-2-shy828301@gmail.com
      Fixes: 7d8faaf1 ("mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse")
      Signed-off-by: default avatarYang Shi <shy828301@gmail.com>
      Suggested-by: default avatarZach O'Keefe <zokeefe@google.com>
      Suggested-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarZach O'Keefe <zokeefe@google.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Reported-by: <syzbot+0044b22d177870ee974f@syzkaller.appspotmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      e031ff96
    • Johannes Weiner's avatar
      mm: vmscan: fix extreme overreclaim and swap floods · f53af428
      Johannes Weiner authored
      During proactive reclaim, we sometimes observe severe overreclaim, with
      several thousand times more pages reclaimed than requested.
      
      This trace was obtained from shrink_lruvec() during such an instance:
      
          prio:0 anon_cost:1141521 file_cost:7767
          nr_reclaimed:4387406 nr_to_reclaim:1047 (or_factor:4190)
          nr=[7161123 345 578 1111]
      
      While he reclaimer requested 4M, vmscan reclaimed close to 16G, most of it
      by swapping.  These requests take over a minute, during which the write()
      to memory.reclaim is unkillably stuck inside the kernel.
      
      Digging into the source, this is caused by the proportional reclaim
      bailout logic.  This code tries to resolve a fundamental conflict: to
      reclaim roughly what was requested, while also aging all LRUs fairly and
      in accordance to their size, swappiness, refault rates etc.  The way it
      attempts fairness is that once the reclaim goal has been reached, it stops
      scanning the LRUs with the smaller remaining scan targets, and adjusts the
      remainder of the bigger LRUs according to how much of the smaller LRUs was
      scanned.  It then finishes scanning that remainder regardless of the
      reclaim goal.
      
      This works fine if priority levels are low and the LRU lists are
      comparable in size.  However, in this instance, the cgroup that is
      targeted by proactive reclaim has almost no files left - they've already
      been squeezed out by proactive reclaim earlier - and the remaining anon
      pages are hot.  Anon rotations cause the priority level to drop to 0,
      which results in reclaim targeting all of anon (a lot) and all of file
      (almost nothing).  By the time reclaim decides to bail, it has scanned
      most or all of the file target, and therefor must also scan most or all of
      the enormous anon target.  This target is thousands of times larger than
      the reclaim goal, thus causing the overreclaim.
      
      The bailout code hasn't changed in years, why is this failing now?  The
      most likely explanations are two other recent changes in anon reclaim:
      
      1. Before the series starting with commit 5df74196 ("mm: fix LRU
         balancing effect of new transparent huge pages"), the VM was
         overall relatively reluctant to swap at all, even if swap was
         configured. This means the LRU balancing code didn't come into play
         as often as it does now, and mostly in high pressure situations
         where pronounced swap activity wouldn't be as surprising.
      
      2. For historic reasons, shrink_lruvec() loops on the scan targets of
         all LRU lists except the active anon one, meaning it would bail if
         the only remaining pages to scan were active anon - even if there
         were a lot of them.
      
         Before the series starting with commit ccc5dc67 ("mm/vmscan:
         make active/inactive ratio as 1:1 for anon lru"), most anon pages
         would live on the active LRU; the inactive one would contain only a
         handful of preselected reclaim candidates. After the series, anon
         gets aged similarly to file, and the inactive list is the default
         for new anon pages as well, making it often the much bigger list.
      
         As a result, the VM is now more likely to actually finish large
         anon targets than before.
      
      Change the code such that only one SWAP_CLUSTER_MAX-sized nudge toward the
      larger LRU lists is made before bailing out on a met reclaim goal.
      
      This fixes the extreme overreclaim problem.
      
      Fairness is more subtle and harder to evaluate.  No obvious misbehavior
      was observed on the test workload, in any case.  Conceptually, fairness
      should primarily be a cumulative effect from regular, lower priority
      scans.  Once the VM is in trouble and needs to escalate scan targets to
      make forward progress, fairness needs to take a backseat.  This is also
      acknowledged by the myriad exceptions in get_scan_count().  This patch
      makes fairness decrease gradually, as it keeps fairness work static over
      increasing priority levels with growing scan targets.  This should make
      more sense - although we may have to re-visit the exact values.
      
      Link: https://lkml.kernel.org/r/20220802162811.39216-1-hannes@cmpxchg.orgSigned-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarRik van Riel <riel@surriel.com>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f53af428
  2. 08 Nov, 2022 22 commits
  3. 06 Nov, 2022 16 commits
    • Linus Torvalds's avatar
      Linux 6.1-rc4 · f0c4d9fc
      Linus Torvalds authored
      f0c4d9fc
    • Linus Torvalds's avatar
      Merge tag 'cxl-fixes-for-6.1-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl · 16c7a368
      Linus Torvalds authored
      Pull cxl fixes from Dan Williams:
       "Several fixes for CXL region creation crashes, leaks and failures.
      
        This is mainly fallout from the original implementation of dynamic CXL
        region creation (instantiate new physical memory pools) that arrived
        in v6.0-rc1.
      
        Given the theme of "failures in the presence of pass-through decoders"
        this also includes new regression test infrastructure for that case.
      
        Summary:
      
         - Fix region creation crash with pass-through decoders
      
         - Fix region creation crash when no decoder allocation fails
      
         - Fix region creation crash when scanning regions to enforce the
           increasing physical address order constraint that CXL mandates
      
         - Fix a memory leak for cxl_pmem_region objects, track 1:N instead of
           1:1 memory-device-to-region associations.
      
         - Fix a memory leak for cxl_region objects when regions with active
           targets are deleted
      
         - Fix assignment of NUMA nodes to CXL regions by CFMWS (CXL Window)
           emulated proximity domains.
      
         - Fix region creation failure for switch attached devices downstream
           of a single-port host-bridge
      
         - Fix false positive memory leak of cxl_region objects by recycling
           recently used region ids rather than freeing them
      
         - Add regression test infrastructure for a pass-through decoder
           configuration
      
         - Fix some mailbox payload handling corner cases"
      
      * tag 'cxl-fixes-for-6.1-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl:
        cxl/region: Recycle region ids
        cxl/region: Fix 'distance' calculation with passthrough ports
        tools/testing/cxl: Add a single-port host-bridge regression config
        tools/testing/cxl: Fix some error exits
        cxl/pmem: Fix cxl_pmem_region and cxl_memdev leak
        cxl/region: Fix cxl_region leak, cleanup targets at region delete
        cxl/region: Fix region HPA ordering validation
        cxl/pmem: Use size_add() against integer overflow
        cxl/region: Fix decoder allocation crash
        ACPI: NUMA: Add CXL CFMWS 'nodes' to the possible nodes set
        cxl/pmem: Fix failure to account for 8 byte header for writes to the device LSA.
        cxl/region: Fix null pointer dereference due to pass through decoder commit
        cxl/mbox: Add a check on input payload size
      16c7a368
    • Linus Torvalds's avatar
      Merge tag 'hwmon-for-v6.1-rc4' of... · aa529949
      Linus Torvalds authored
      Merge tag 'hwmon-for-v6.1-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging
      
      Pull hwmon fixes from Guenter Roeck:
       "Fix two regressions:
      
         - Commit 54cc3dbf ("hwmon: (pmbus) Add regulator supply into
           macro") resulted in regulator undercount when disabling regulators.
           Revert it.
      
         - The thermal subsystem rework caused the scmi driver to no longer
           register with the thermal subsystem because index values no longer
           match. To fix the problem, the scmi driver now directly registers
           with the thermal subsystem, no longer through the hwmon core"
      
      * tag 'hwmon-for-v6.1-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging:
        Revert "hwmon: (pmbus) Add regulator supply into macro"
        hwmon: (scmi) Register explicitly with Thermal Framework
      aa529949
    • Linus Torvalds's avatar
      Merge tag 'perf_urgent_for_v6.1_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 727ea09e
      Linus Torvalds authored
      Pull perf fixes from Borislav Petkov:
      
       - Add Cooper Lake's stepping to the PEBS guest/host events isolation
         fixed microcode revisions checking quirk
      
       - Update Icelake and Sapphire Rapids events constraints
      
       - Use the standard energy unit for Sapphire Rapids in RAPL
      
       - Fix the hw_breakpoint test to fail more graciously on !SMP configs
      
      * tag 'perf_urgent_for_v6.1_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        perf/x86/intel: Add Cooper Lake stepping to isolation_ucodes[]
        perf/x86/intel: Fix pebs event constraints for SPR
        perf/x86/intel: Fix pebs event constraints for ICL
        perf/x86/rapl: Use standard Energy Unit for SPR Dram RAPL domain
        perf/hw_breakpoint: test: Skip the test if dependencies unmet
      727ea09e
    • Linus Torvalds's avatar
      Merge tag 'x86_urgent_for_v6.1_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · f6f52047
      Linus Torvalds authored
      Pull x86 fixes from Borislav Petkov:
      
       - Add new Intel CPU models
      
       - Enforce that TDX guests are successfully loaded only on TDX hardware
         where virtualization exception (#VE) delivery on kernel memory is
         disabled because handling those in all possible cases is "essentially
         impossible"
      
       - Add the proper include to the syscall wrappers so that BTF can see
         the real pt_regs definition and not only the forward declaration
      
      * tag 'x86_urgent_for_v6.1_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        x86/cpu: Add several Intel server CPU model numbers
        x86/tdx: Panic on bad configs that #VE on "private" memory access
        x86/tdx: Prepare for using "INFO" call for a second purpose
        x86/syscall: Include asm/ptrace.h in syscall_wrapper header
      f6f52047
    • Linus Torvalds's avatar
      Merge tag 'kbuild-fixes-v6.1-2' of... · 35697d81
      Linus Torvalds authored
      Merge tag 'kbuild-fixes-v6.1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild
      
      Pull Kbuild fixes from Masahiro Yamada:
      
       - Use POSIX-compatible grep options
      
       - Document git-related tips for reproducible builds
      
       - Fix a typo in the modpost rule
      
       - Suppress SIGPIPE error message from gcc-ar and llvm-ar
      
       - Fix segmentation fault in the menuconfig search
      
      * tag 'kbuild-fixes-v6.1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
        kconfig: fix segmentation fault in menuconfig search
        kbuild: fix SIGPIPE error message for AR=gcc-ar and AR=llvm-ar
        kbuild: fix typo in modpost
        Documentation: kbuild: Add description of git for reproducible builds
        kbuild: use POSIX-compatible grep option
      35697d81
    • Linus Torvalds's avatar
      Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm · 089d1c31
      Linus Torvalds authored
      Pull kvm fixes from Paolo Bonzini:
      "ARM:
      
         - Fix the pKVM stage-1 walker erronously using the stage-2 accessor
      
         - Correctly convert vcpu->kvm to a hyp pointer when generating an
           exception in a nVHE+MTE configuration
      
         - Check that KVM_CAP_DIRTY_LOG_* are valid before enabling them
      
         - Fix SMPRI_EL1/TPIDR2_EL0 trapping on VHE
      
         - Document the boot requirements for FGT when entering the kernel at
           EL1
      
        x86:
      
         - Use SRCU to protect zap in __kvm_set_or_clear_apicv_inhibit()
      
         - Make argument order consistent for kvcalloc()
      
         - Userspace API fixes for DEBUGCTL and LBRs"
      
      * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
        KVM: x86: Fix a typo about the usage of kvcalloc()
        KVM: x86: Use SRCU to protect zap in __kvm_set_or_clear_apicv_inhibit()
        KVM: VMX: Ignore guest CPUID for host userspace writes to DEBUGCTL
        KVM: VMX: Fold vmx_supported_debugctl() into vcpu_supported_debugctl()
        KVM: VMX: Advertise PMU LBRs if and only if perf supports LBRs
        arm64: booting: Document our requirements for fine grained traps with SME
        KVM: arm64: Fix SMPRI_EL1/TPIDR2_EL0 trapping on VHE
        KVM: Check KVM_CAP_DIRTY_LOG_{RING, RING_ACQ_REL} prior to enabling them
        KVM: arm64: Fix bad dereference on MTE-enabled systems
        KVM: arm64: Use correct accessor to parse stage-1 PTEs
      089d1c31
    • Linus Torvalds's avatar
      Merge tag 'for-linus-6.1-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip · 6e8c78d3
      Linus Torvalds authored
      Pull xen fixes from Juergen Gross:
       "One fix for silencing a smatch warning, and a small cleanup patch"
      
      * tag 'for-linus-6.1-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
        x86/xen: simplify sysenter and syscall setup
        x86/xen: silence smatch warning in pmu_msr_chk_emulated()
      6e8c78d3
    • Linus Torvalds's avatar
      Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 · 9761070d
      Linus Torvalds authored
      Pull ext4 fixes from Ted Ts'o:
       "Fix a number of bugs, including some regressions, the most serious of
        which was one which would cause online resizes to fail with file
        systems with metadata checksums enabled.
      
        Also fix a warning caused by the newly added fortify string checker,
        plus some bugs that were found using fuzzed file systems"
      
      * tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
        ext4: fix fortify warning in fs/ext4/fast_commit.c:1551
        ext4: fix wrong return err in ext4_load_and_init_journal()
        ext4: fix warning in 'ext4_da_release_space'
        ext4: fix BUG_ON() when directory entry has invalid rec_len
        ext4: update the backup superblock's at the end of the online resize
      9761070d
    • Linus Torvalds's avatar
      Merge tag '6.1-rc4-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6 · 90153f92
      Linus Torvalds authored
      Pull cifs fixes from Steve French:
       "One symlink handling fix and two fixes foir multichannel issues with
        iterating channels, including for oplock breaks when leases are
        disabled"
      
      * tag '6.1-rc4-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6:
        cifs: fix use-after-free on the link name
        cifs: avoid unnecessary iteration of tcp sessions
        cifs: always iterate smb sessions using primary channel
      90153f92
    • Linus Torvalds's avatar
      Merge tag 'trace-v6.1-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace · 8391aa4b
      Linus Torvalds authored
      Pull `lTracing fixes for 6.1-rc3:
      
       - Fixed NULL pointer dereference in the ring buffer wait-waiters code
         for machines that have less CPUs than what nr_cpu_ids returns.
      
         The buffer array is of size nr_cpu_ids, but only the online CPUs get
         initialized.
      
       - Fixed use after free call in ftrace_shutdown.
      
       - Fix accounting of if a kprobe is enabled
      
       - Fix NULL pointer dereference on error path of fprobe rethook_alloc().
      
       - Fix unregistering of fprobe_kprobe_handler
      
       - Fix memory leak in kprobe test module
      
      * tag 'trace-v6.1-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
        tracing: kprobe: Fix memory leak in test_gen_kprobe/kretprobe_cmd()
        tracing/fprobe: Fix to check whether fprobe is registered correctly
        fprobe: Check rethook_alloc() return in rethook initialization
        kprobe: reverse kp->flags when arm_kprobe failed
        ftrace: Fix use-after-free for dynamic ftrace_ops
        ring-buffer: Check for NULL cpu_buffer in ring_buffer_wake_waiters()
      8391aa4b
    • Paolo Bonzini's avatar
      Merge tag 'kvmarm-fixes-6.1-3' of... · f4298cac
      Paolo Bonzini authored
      Merge tag 'kvmarm-fixes-6.1-3' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD
      
      * Fix the pKVM stage-1 walker erronously using the stage-2 accessor
      
      * Correctly convert vcpu->kvm to a hyp pointer when generating
        an exception in a nVHE+MTE configuration
      
      * Check that KVM_CAP_DIRTY_LOG_* are valid before enabling them
      
      * Fix SMPRI_EL1/TPIDR2_EL0 trapping on VHE
      
      * Document the boot requirements for FGT when entering the kernel
        at EL1
      f4298cac
    • Paolo Bonzini's avatar
      Merge branch 'kvm-master' into HEAD · 14620149
      Paolo Bonzini authored
      x86:
      * Use SRCU to protect zap in __kvm_set_or_clear_apicv_inhibit()
      
      * Make argument order consistent for kvcalloc()
      
      * Userspace API fixes for DEBUGCTL and LBRs
      14620149
    • Theodore Ts'o's avatar
      ext4: fix fortify warning in fs/ext4/fast_commit.c:1551 · 0d043351
      Theodore Ts'o authored
      With the new fortify string system, rework the memcpy to avoid this
      warning:
      
      memcpy: detected field-spanning write (size 60) of single field "&raw_inode->i_generation" at fs/ext4/fast_commit.c:1551 (size 4)
      
      Cc: stable@kernel.org
      Fixes: 54d9469b ("fortify: Add run-time WARN for cross-field memcpy()")
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      0d043351
    • Jason Yan's avatar
      ext4: fix wrong return err in ext4_load_and_init_journal() · 9f2a1d9f
      Jason Yan authored
      The return value is wrong in ext4_load_and_init_journal(). The local
      variable 'err' need to be initialized before goto out. The original code
      in __ext4_fill_super() is fine because it has two return values 'ret'
      and 'err' and 'ret' is initialized as -EINVAL. After we factor out
      ext4_load_and_init_journal(), this code is broken. So fix it by directly
      returning -EINVAL in the error handler path.
      
      Cc: stable@kernel.org
      Fixes: 9c1dd22d ("ext4: factor out ext4_load_and_init_journal()")
      Signed-off-by: default avatarJason Yan <yanaijie@huawei.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Link: https://lore.kernel.org/r/20221025040206.3134773-1-yanaijie@huawei.comSigned-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      9f2a1d9f
    • Ye Bin's avatar
      ext4: fix warning in 'ext4_da_release_space' · 1b8f787e
      Ye Bin authored
      Syzkaller report issue as follows:
      EXT4-fs (loop0): Free/Dirty block details
      EXT4-fs (loop0): free_blocks=0
      EXT4-fs (loop0): dirty_blocks=0
      EXT4-fs (loop0): Block reservation details
      EXT4-fs (loop0): i_reserved_data_blocks=0
      EXT4-fs warning (device loop0): ext4_da_release_space:1527: ext4_da_release_space: ino 18, to_free 1 with only 0 reserved data blocks
      ------------[ cut here ]------------
      WARNING: CPU: 0 PID: 92 at fs/ext4/inode.c:1528 ext4_da_release_space+0x25e/0x370 fs/ext4/inode.c:1524
      Modules linked in:
      CPU: 0 PID: 92 Comm: kworker/u4:4 Not tainted 6.0.0-syzkaller-09423-g493ffd66 #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/22/2022
      Workqueue: writeback wb_workfn (flush-7:0)
      RIP: 0010:ext4_da_release_space+0x25e/0x370 fs/ext4/inode.c:1528
      RSP: 0018:ffffc900015f6c90 EFLAGS: 00010296
      RAX: 42215896cd52ea00 RBX: 0000000000000000 RCX: 42215896cd52ea00
      RDX: 0000000000000000 RSI: 0000000080000001 RDI: 0000000000000000
      RBP: 1ffff1100e907d96 R08: ffffffff816aa79d R09: fffff520002bece5
      R10: fffff520002bece5 R11: 1ffff920002bece4 R12: ffff888021fd2000
      R13: ffff88807483ecb0 R14: 0000000000000001 R15: ffff88807483e740
      FS:  0000000000000000(0000) GS:ffff8880b9a00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00005555569ba628 CR3: 000000000c88e000 CR4: 00000000003506f0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       <TASK>
       ext4_es_remove_extent+0x1ab/0x260 fs/ext4/extents_status.c:1461
       mpage_release_unused_pages+0x24d/0xef0 fs/ext4/inode.c:1589
       ext4_writepages+0x12eb/0x3be0 fs/ext4/inode.c:2852
       do_writepages+0x3c3/0x680 mm/page-writeback.c:2469
       __writeback_single_inode+0xd1/0x670 fs/fs-writeback.c:1587
       writeback_sb_inodes+0xb3b/0x18f0 fs/fs-writeback.c:1870
       wb_writeback+0x41f/0x7b0 fs/fs-writeback.c:2044
       wb_do_writeback fs/fs-writeback.c:2187 [inline]
       wb_workfn+0x3cb/0xef0 fs/fs-writeback.c:2227
       process_one_work+0x877/0xdb0 kernel/workqueue.c:2289
       worker_thread+0xb14/0x1330 kernel/workqueue.c:2436
       kthread+0x266/0x300 kernel/kthread.c:376
       ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:306
       </TASK>
      
      Above issue may happens as follows:
      ext4_da_write_begin
        ext4_create_inline_data
          ext4_clear_inode_flag(inode, EXT4_INODE_EXTENTS);
          ext4_set_inode_flag(inode, EXT4_INODE_INLINE_DATA);
      __ext4_ioctl
        ext4_ext_migrate -> will lead to eh->eh_entries not zero, and set extent flag
      ext4_da_write_begin
        ext4_da_convert_inline_data_to_extent
          ext4_da_write_inline_data_begin
            ext4_da_map_blocks
              ext4_insert_delayed_block
      	  if (!ext4_es_scan_clu(inode, &ext4_es_is_delonly, lblk))
      	    if (!ext4_es_scan_clu(inode, &ext4_es_is_mapped, lblk))
      	      ext4_clu_mapped(inode, EXT4_B2C(sbi, lblk)); -> will return 1
      	       allocated = true;
                ext4_es_insert_delayed_block(inode, lblk, allocated);
      ext4_writepages
        mpage_map_and_submit_extent(handle, &mpd, &give_up_on_write); -> return -ENOSPC
        mpage_release_unused_pages(&mpd, give_up_on_write); -> give_up_on_write == 1
          ext4_es_remove_extent
            ext4_da_release_space(inode, reserved);
              if (unlikely(to_free > ei->i_reserved_data_blocks))
      	  -> to_free == 1  but ei->i_reserved_data_blocks == 0
      	  -> then trigger warning as above
      
      To solve above issue, forbid inode do migrate which has inline data.
      
      Cc: stable@kernel.org
      Reported-by: syzbot+c740bb18df70ad00952e@syzkaller.appspotmail.com
      Signed-off-by: default avatarYe Bin <yebin10@huawei.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Link: https://lore.kernel.org/r/20221018022701.683489-1-yebin10@huawei.comSigned-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      1b8f787e