1. 01 Nov, 2021 29 commits
    • Linus Torvalds's avatar
      Merge tag 'ras_core_for_v5.16_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 158405e8
      Linus Torvalds authored
      Pull RAS updates from Borislav Petkov:
      
       - Get rid of a bunch of function pointers used in MCA land in favor of
         normal functions. This is in preparation of making the MCA code
         noinstr-aware
      
       - When the kernel copies data from user addresses and it encounters a
         machine check, a SIGBUS is sent to that process. Change this action
         to either an -EFAULT which is returned to the user or a short write,
         making the recovery action a lot more user-friendly
      
      * tag 'ras_core_for_v5.16_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        x86/mce: Sort mca_config members to get rid of unnecessary padding
        x86/mce: Get rid of the ->quirk_no_way_out() indirect call
        x86/mce: Get rid of msr_ops
        x86/mce: Get rid of machine_check_vector
        x86/mce: Get rid of the mce_severity function pointer
        x86/mce: Drop copyin special case for #MC
        x86/mce: Change to not send SIGBUS error during copy from user
      158405e8
    • Linus Torvalds's avatar
      Merge tag 'efi-next-for-v5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 93351d2c
      Linus Torvalds authored
      Pull EFI updates from Borislav Petkov:
       "The last EFI pull request which is forwarded through the tip tree, for
        v5.16. From now on, Ard will be sending stuff directly.
      
        Disable EFI runtime services by default on PREEMPT_RT, while adding
        the ability to re-enable them on demand by passing efi=runtime on the
        command line"
      
      * tag 'efi-next-for-v5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        efi: Allow efi=runtime
        efi: Disable runtime services on RT
      93351d2c
    • Linus Torvalds's avatar
      Merge tag 'edac_updates_for_v5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras · fe354159
      Linus Torvalds authored
      Pull EDAC updates from Borislav Petkov:
       "A small pile of EDAC updates which the autumn wind blew my way. :)
      
         - amd64_edac: Add support for three-rank interleaving mode which is
           present on AMD zen2 servers
      
         - The usual fixes and cleanups all over EDAC land"
      
      * tag 'edac_updates_for_v5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras:
        EDAC/sb_edac: Fix top-of-high-memory value for Broadwell/Haswell
        EDAC/ti: Remove redundant error messages
        EDAC/amd64: Handle three rank interleaving mode
        EDAC/mc_sysfs: Print MC-scope sysfs counters unsigned
        EDAC/al_mc: Make use of the helper function devm_add_action_or_reset()
        EDAC/mc: Replace strcpy(), sprintf() and snprintf() with strscpy() or scnprintf()
      fe354159
    • Linus Torvalds's avatar
      mm: fix mismerge of folio page flag manipulators · e6643593
      Linus Torvalds authored
      I had missed a semantic conflict between commit d389a4a8 ("mm: Add
      folio flag manipulation functions") from the folio tree, and commit
      eac96c3e ("mm: filemap: check if THP has hwpoisoned subpage for PMD
      page fault") that added a new set of page flags.
      
      My build tests had too many options enabled, which hid this issue.  But
      if you didn't have MEMORY_FAILURE or TRANSPARENT_HUGEPAGE enabled, you'd
      end up with build errors like this:
      
        include/linux/page-flags.h:806:29: error: macro "PAGEFLAG_FALSE" requires 2 arguments, but only 1 given
          806 | PAGEFLAG_FALSE(HasHWPoisoned)
              |                             ^
      
      due to the missing lowercase name used for folio function naming.
      
      Fixes: 49f8275c ("Merge tag 'folio-5.16' of git://git.infradead.org/users/willy/pagecache")
      Reported-by: default avatarNaresh Kamboju <naresh.kamboju@linaro.org>
      Reported-by: default avatarYang Shi <shy828301@gmail.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e6643593
    • Linus Torvalds's avatar
      Merge tag 'x86-fpu-2021-11-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 8cb1ae19
      Linus Torvalds authored
      Pull x86 fpu updates from Thomas Gleixner:
      
       - Cleanup of extable fixup handling to be more robust, which in turn
         allows to make the FPU exception fixups more robust as well.
      
       - Change the return code for signal frame related failures from
         explicit error codes to a boolean fail/success as that's all what the
         calling code evaluates.
      
       - A large refactoring of the FPU code to prepare for adding AMX
         support:
      
            - Distangle the public header maze and remove especially the
              misnomed kitchen sink internal.h which is despite it's name
              included all over the place.
      
            - Add a proper abstraction for the register buffer storage (struct
              fpstate) which allows to dynamically size the buffer at runtime
              by flipping the pointer to the buffer container from the default
              container which is embedded in task_struct::tread::fpu to a
              dynamically allocated container with a larger register buffer.
      
            - Convert the code over to the new fpstate mechanism.
      
            - Consolidate the KVM FPU handling by moving the FPU related code
              into the FPU core which removes the number of exports and avoids
              adding even more export when AMX has to be supported in KVM.
              This also removes duplicated code which was of course
              unnecessary different and incomplete in the KVM copy.
      
            - Simplify the KVM FPU buffer handling by utilizing the new
              fpstate container and just switching the buffer pointer from the
              user space buffer to the KVM guest buffer when entering
              vcpu_run() and flipping it back when leaving the function. This
              cuts the memory requirements of a vCPU for FPU buffers in half
              and avoids pointless memory copy operations.
      
              This also solves the so far unresolved problem of adding AMX
              support because the current FPU buffer handling of KVM inflicted
              a circular dependency between adding AMX support to the core and
              to KVM. With the new scheme of switching fpstate AMX support can
              be added to the core code without affecting KVM.
      
            - Replace various variables with proper data structures so the
              extra information required for adding dynamically enabled FPU
              features (AMX) can be added in one place
      
       - Add AMX (Advanced Matrix eXtensions) support (finally):
      
         AMX is a large XSTATE component which is going to be available with
         Saphire Rapids XEON CPUs. The feature comes with an extra MSR
         (MSR_XFD) which allows to trap the (first) use of an AMX related
         instruction, which has two benefits:
      
          1) It allows the kernel to control access to the feature
      
          2) It allows the kernel to dynamically allocate the large register
             state buffer instead of burdening every task with the the extra
             8K or larger state storage.
      
         It would have been great to gain this kind of control already with
         AVX512.
      
         The support comes with the following infrastructure components:
      
          1) arch_prctl() to
              - read the supported features (equivalent to XGETBV(0))
              - read the permitted features for a task
              - request permission for a dynamically enabled feature
      
             Permission is granted per process, inherited on fork() and
             cleared on exec(). The permission policy of the kernel is
             restricted to sigaltstack size validation, but the syscall
             obviously allows further restrictions via seccomp etc.
      
          2) A stronger sigaltstack size validation for sys_sigaltstack(2)
             which takes granted permissions and the potentially resulting
             larger signal frame into account. This mechanism can also be used
             to enforce factual sigaltstack validation independent of dynamic
             features to help with finding potential victims of the 2K
             sigaltstack size constant which is broken since AVX512 support
             was added.
      
          3) Exception handling for #NM traps to catch first use of a extended
             feature via a new cause MSR. If the exception was caused by the
             use of such a feature, the handler checks permission for that
             feature. If permission has not been granted, the handler sends a
             SIGILL like the #UD handler would do if the feature would have
             been disabled in XCR0. If permission has been granted, then a new
             fpstate which fits the larger buffer requirement is allocated.
      
             In the unlikely case that this allocation fails, the handler
             sends SIGSEGV to the task. That's not elegant, but unavoidable as
             the other discussed options of preallocation or full per task
             permissions come with their own set of horrors for kernel and/or
             userspace. So this is the lesser of the evils and SIGSEGV caused
             by unexpected memory allocation failures is not a fundamentally
             new concept either.
      
             When allocation succeeds, the fpstate properties are filled in to
             reflect the extended feature set and the resulting sizes, the
             fpu::fpstate pointer is updated accordingly and the trap is
             disarmed for this task permanently.
      
          4) Enumeration and size calculations
      
          5) Trap switching via MSR_XFD
      
             The XFD (eXtended Feature Disable) MSR is context switched with
             the same life time rules as the FPU register state itself. The
             mechanism is keyed off with a static key which is default
             disabled so !AMX equipped CPUs have zero overhead. On AMX enabled
             CPUs the overhead is limited by comparing the tasks XFD value
             with a per CPU shadow variable to avoid redundant MSR writes. In
             case of switching from a AMX using task to a non AMX using task
             or vice versa, the extra MSR write is obviously inevitable.
      
             All other places which need to be aware of the variable feature
             sets and resulting variable sizes are not affected at all because
             they retrieve the information (feature set, sizes) unconditonally
             from the fpstate properties.
      
          6) Enable the new AMX states
      
         Note, this is relatively new code despite the fact that AMX support
         is in the works for more than a year now.
      
         The big refactoring of the FPU code, which allowed to do a proper
         integration has been started exactly 3 weeks ago. Refactoring of the
         existing FPU code and of the original AMX patches took a week and has
         been subject to extensive review and testing. The only fallout which
         has not been caught in review and testing right away was restricted
         to AMX enabled systems, which is completely irrelevant for anyone
         outside Intel and their early access program. There might be dragons
         lurking as usual, but so far the fine grained refactoring has held up
         and eventual yet undetected fallout is bisectable and should be
         easily addressable before the 5.16 release. Famous last words...
      
         Many thanks to Chang Bae and Dave Hansen for working hard on this and
         also to the various test teams at Intel who reserved extra capacity
         to follow the rapid development of this closely which provides the
         confidence level required to offer this rather large update for
         inclusion into 5.16-rc1
      
      * tag 'x86-fpu-2021-11-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (110 commits)
        Documentation/x86: Add documentation for using dynamic XSTATE features
        x86/fpu: Include vmalloc.h for vzalloc()
        selftests/x86/amx: Add context switch test
        selftests/x86/amx: Add test cases for AMX state management
        x86/fpu/amx: Enable the AMX feature in 64-bit mode
        x86/fpu: Add XFD handling for dynamic states
        x86/fpu: Calculate the default sizes independently
        x86/fpu/amx: Define AMX state components and have it used for boot-time checks
        x86/fpu/xstate: Prepare XSAVE feature table for gaps in state component numbers
        x86/fpu/xstate: Add fpstate_realloc()/free()
        x86/fpu/xstate: Add XFD #NM handler
        x86/fpu: Update XFD state where required
        x86/fpu: Add sanity checks for XFD
        x86/fpu: Add XFD state to fpstate
        x86/msr-index: Add MSRs for XFD
        x86/cpufeatures: Add eXtended Feature Disabling (XFD) feature bit
        x86/fpu: Reset permission and fpstate on exec()
        x86/fpu: Prepare fpu_clone() for dynamically enabled features
        x86/fpu/signal: Prepare for variable sigframe length
        x86/signal: Use fpu::__state_user_size for sigalt stack validation
        ...
      8cb1ae19
    • Linus Torvalds's avatar
      Merge tag 'x86-apic-2021-11-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 7d20dd32
      Linus Torvalds authored
      Pull x86/apic update from Thomas Gleixner:
       "A single commit which reduces cache misses in __x2apic_send_IPI_mask()
        significantly by converting x86_cpu_to_logical_apicid() to an array
        instead of using per CPU storage.
      
        This reduces the cost for a full broadcast on a dual socket system
        with 256 CPUs from 33 down to 11 microseconds"
      
      * tag 'x86-apic-2021-11-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        x86/apic: Reduce cache line misses in __x2apic_send_IPI_mask()
      7d20dd32
    • Linus Torvalds's avatar
      Merge tag 'sched-core-2021-11-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 9a7e0a90
      Linus Torvalds authored
      Pull scheduler updates from Thomas Gleixner:
      
       - Revert the printk format based wchan() symbol resolution as it can
         leak the raw value in case that the symbol is not resolvable.
      
       - Make wchan() more robust and work with all kind of unwinders by
         enforcing that the task stays blocked while unwinding is in progress.
      
       - Prevent sched_fork() from accessing an invalid sched_task_group
      
       - Improve asymmetric packing logic
      
       - Extend scheduler statistics to RT and DL scheduling classes and add
         statistics for bandwith burst to the SCHED_FAIR class.
      
       - Properly account SCHED_IDLE entities
      
       - Prevent a potential deadlock when initial priority is assigned to a
         newly created kthread. A recent change to plug a race between cpuset
         and __sched_setscheduler() introduced a new lock dependency which is
         now triggered. Break the lock dependency chain by moving the priority
         assignment to the thread function.
      
       - Fix the idle time reporting in /proc/uptime for NOHZ enabled systems.
      
       - Improve idle balancing in general and especially for NOHZ enabled
         systems.
      
       - Provide proper interfaces for live patching so it does not have to
         fiddle with scheduler internals.
      
       - Add cluster aware scheduling support.
      
       - A small set of tweaks for RT (irqwork, wait_task_inactive(), various
         scheduler options and delaying mmdrop)
      
       - The usual small tweaks and improvements all over the place
      
      * tag 'sched-core-2021-11-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (69 commits)
        sched/fair: Cleanup newidle_balance
        sched/fair: Remove sysctl_sched_migration_cost condition
        sched/fair: Wait before decaying max_newidle_lb_cost
        sched/fair: Skip update_blocked_averages if we are defering load balance
        sched/fair: Account update_blocked_averages in newidle_balance cost
        x86: Fix __get_wchan() for !STACKTRACE
        sched,x86: Fix L2 cache mask
        sched/core: Remove rq_relock()
        sched: Improve wake_up_all_idle_cpus() take #2
        irq_work: Also rcuwait for !IRQ_WORK_HARD_IRQ on PREEMPT_RT
        irq_work: Handle some irq_work in a per-CPU thread on PREEMPT_RT
        irq_work: Allow irq_work_sync() to sleep if irq_work() no IRQ support.
        sched/rt: Annotate the RT balancing logic irqwork as IRQ_WORK_HARD_IRQ
        sched: Add cluster scheduler level for x86
        sched: Add cluster scheduler level in core and related Kconfig for ARM64
        topology: Represent clusters of CPUs within a die
        sched: Disable -Wunused-but-set-variable
        sched: Add wrapper for get_wchan() to keep task blocked
        x86: Fix get_wchan() to support the ORC unwinder
        proc: Use task_is_running() for wchan in /proc/$pid/stat
        ...
      9a7e0a90
    • Linus Torvalds's avatar
      Merge tag 'timers-core-2021-10-31' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 57a315cd
      Linus Torvalds authored
      Pull timer updates from Thomas Gleixner:
       "Time, timers and timekeeping updates:
      
         - No core updates
      
         - No new clocksource/event driver
      
         - A large rework of the ARM architected timer driver to prepare for
           the support of the upcoming ARMv8.6 support
      
         - Fix Kconfig options for Exynos MCT, Samsung PWM and TI DM timers
      
         - Address a namespace collison in the ARC sp804 timer driver"
      
      * tag 'timers-core-2021-10-31' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        clocksource/drivers/timer-ti-dm: Select TIMER_OF
        clocksource/drivers/exynosy: Depend on sub-architecture for Exynos MCT and Samsung PWM
        clocksource/drivers/arch_arm_timer: Move workaround synchronisation around
        clocksource/drivers/arm_arch_timer: Fix masking for high freq counters
        clocksource/drivers/arm_arch_timer: Drop unnecessary ISB on CVAL programming
        clocksource/drivers/arm_arch_timer: Remove any trace of the TVAL programming interface
        clocksource/drivers/arm_arch_timer: Work around broken CVAL implementations
        clocksource/drivers/arm_arch_timer: Advertise 56bit timer to the core code
        clocksource/drivers/arm_arch_timer: Move MMIO timer programming over to CVAL
        clocksource/drivers/arm_arch_timer: Fix MMIO base address vs callback ordering issue
        clocksource/drivers/arm_arch_timer: Move drop _tval from erratum function names
        clocksource/drivers/arm_arch_timer: Move system register timer programming over to CVAL
        clocksource/drivers/arm_arch_timer: Extend write side of timer register accessors to u64
        clocksource/drivers/arm_arch_timer: Drop CNT*_TVAL read accessors
        clocksource/arm_arch_timer: Add build-time guards for unhandled register accesses
        clocksource/drivers/arc_timer: Eliminate redefined macro error
      57a315cd
    • Linus Torvalds's avatar
      Merge tag 'objtool-core-2021-10-31' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 43aa0a19
      Linus Torvalds authored
      Pull objtool updates from Thomas Gleixner:
      
       - Improve retpoline code patching by separating it from alternatives
         which reduces memory footprint and allows to do better optimizations
         in the actual runtime patching.
      
       - Add proper retpoline support for x86/BPF
      
       - Address noinstr warnings in x86/kvm, lockdep and paravirtualization
         code
      
       - Add support to handle pv_opsindirect calls in the noinstr analysis
      
       - Classify symbols upfront and cache the result to avoid redundant
         str*cmp() invocations.
      
       - Add a CFI hash to reduce memory consumption which also reduces
         runtime on a allyesconfig by ~50%
      
       - Adjust XEN code to make objtool handling more robust and as a side
         effect to prevent text fragmentation due to placement of the
         hypercall page.
      
      * tag 'objtool-core-2021-10-31' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (41 commits)
        bpf,x86: Respect X86_FEATURE_RETPOLINE*
        bpf,x86: Simplify computing label offsets
        x86,bugs: Unconditionally allow spectre_v2=retpoline,amd
        x86/alternative: Add debug prints to apply_retpolines()
        x86/alternative: Try inline spectre_v2=retpoline,amd
        x86/alternative: Handle Jcc __x86_indirect_thunk_\reg
        x86/alternative: Implement .retpoline_sites support
        x86/retpoline: Create a retpoline thunk array
        x86/retpoline: Move the retpoline thunk declarations to nospec-branch.h
        x86/asm: Fixup odd GEN-for-each-reg.h usage
        x86/asm: Fix register order
        x86/retpoline: Remove unused replacement symbols
        objtool,x86: Replace alternatives with .retpoline_sites
        objtool: Shrink struct instruction
        objtool: Explicitly avoid self modifying code in .altinstr_replacement
        objtool: Classify symbols
        objtool: Support pv_opsindirect calls for noinstr
        x86/xen: Rework the xen_{cpu,irq,mmu}_opsarrays
        x86/xen: Mark xen_force_evtchn_callback() noinstr
        x86/xen: Make irq_disable() noinstr
        ...
      43aa0a19
    • Linus Torvalds's avatar
      Merge tag 'locking-core-2021-10-31' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 595b28fb
      Linus Torvalds authored
      Pull locking updates from Thomas Gleixner:
      
       - Move futex code into kernel/futex/ and split up the kitchen sink into
         seperate files to make integration of sys_futex_waitv() simpler.
      
       - Add a new sys_futex_waitv() syscall which allows to wait on multiple
         futexes.
      
         The main use case is emulating Windows' WaitForMultipleObjects which
         allows Wine to improve the performance of Windows Games. Also native
         Linux games can benefit from this interface as this is a common wait
         pattern for this kind of applications.
      
       - Add context to ww_mutex_trylock() to provide a path for i915 to
         rework their eviction code step by step without making lockdep upset
         until the final steps of rework are completed. It's also useful for
         regulator and TTM to avoid dropping locks in the non contended path.
      
       - Lockdep and might_sleep() cleanups and improvements
      
       - A few improvements for the RT substitutions.
      
       - The usual small improvements and cleanups.
      
      * tag 'locking-core-2021-10-31' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (44 commits)
        locking: Remove spin_lock_flags() etc
        locking/rwsem: Fix comments about reader optimistic lock stealing conditions
        locking: Remove rcu_read_{,un}lock() for preempt_{dis,en}able()
        locking/rwsem: Disable preemption for spinning region
        docs: futex: Fix kernel-doc references
        futex: Fix PREEMPT_RT build
        futex2: Documentation: Document sys_futex_waitv() uAPI
        selftests: futex: Test sys_futex_waitv() wouldblock
        selftests: futex: Test sys_futex_waitv() timeout
        selftests: futex: Add sys_futex_waitv() test
        futex,arm: Wire up sys_futex_waitv()
        futex,x86: Wire up sys_futex_waitv()
        futex: Implement sys_futex_waitv()
        futex: Simplify double_lock_hb()
        futex: Split out wait/wake
        futex: Split out requeue
        futex: Rename mark_wake_futex()
        futex: Rename: match_futex()
        futex: Rename: hb_waiter_{inc,dec,pending}()
        futex: Split out PI futex
        ...
      595b28fb
    • Linus Torvalds's avatar
      Merge tag 'perf-core-2021-10-31' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 91e1c99e
      Linus Torvalds authored
      Pull perf updates from Thomas Gleixner:
       "Core:
      
         - Allow ftrace to instrument parts of the perf core code
      
         - Add a new mem_hops field to perf_mem_data_src which allows to
           represent intra-node/package or inter-node/off-package details to
           prepare for next generation systems which have more hieararchy
           within the node/pacakge level.
      
        Tools:
      
         - Update for the new mem_hops field in perf_mem_data_src
      
        Arch:
      
         - A set of constraints fixes for the Intel uncore PMU
      
         - The usual set of small fixes and improvements for x86 and PPC"
      
      * tag 'perf-core-2021-10-31' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        perf/x86/intel: Fix ICL/SPR INST_RETIRED.PREC_DIST encodings
        powerpc/perf: Fix data source encodings for L2.1 and L3.1 accesses
        tools/perf: Add mem_hops field in perf_mem_data_src structure
        perf: Add mem_hops field in perf_mem_data_src structure
        perf: Add comment about current state of PERF_MEM_LVL_* namespace and remove an extra line
        perf/core: Allow ftrace for functions in kernel/event/core.c
        perf/x86: Add new event for AUX output counter index
        perf/x86: Add compiler barrier after updating BTS
        perf/x86/intel/uncore: Fix Intel SPR M3UPI event constraints
        perf/x86/intel/uncore: Fix Intel SPR M2PCIE event constraints
        perf/x86/intel/uncore: Fix Intel SPR IIO event constraints
        perf/x86/intel/uncore: Fix Intel SPR CHA event constraints
        perf/x86/intel/uncore: Fix Intel ICX IIO event constraints
        perf/x86/intel/uncore: Fix invalid unit check
        perf/x86/intel/uncore: Support extra IMC channel on Ice Lake server
      91e1c99e
    • Linus Torvalds's avatar
      Merge tag 'irq-core-2021-10-31' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 5a47ebe9
      Linus Torvalds authored
      Pull irq updates from Thomas Gleixner:
       "Updates for the interrupt subsystem:
      
        Core changes:
      
         - Prevent a potential deadlock when initial priority is assigned to a
           newly created interrupt thread. A recent change to plug a race
           between cpuset and __sched_setscheduler() introduced a new lock
           dependency which is now triggered. Break the lock dependency chain
           by moving the priority assignment to the thread function.
      
         - A couple of small updates to make the irq core RT safe.
      
         - Confine the irq_cpu_online/offline() API to the only left unfixable
           user Cavium Octeon so that it does not grow new usage.
      
         - A small documentation update
      
        Driver changes:
      
         - A large cross architecture rework to move irq_enter/exit() into the
           architecture code to make addressing the NOHZ_FULL/RCU issues
           simpler.
      
         - The obligatory new irq chip driver for Microchip EIC
      
         - Modularize a few irq chip drivers
      
         - Expand usage of devm_*() helpers throughout the driver code
      
         - The usual small fixes and improvements all over the place"
      
      * tag 'irq-core-2021-10-31' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (53 commits)
        h8300: Fix linux/irqchip.h include mess
        dt-bindings: irqchip: renesas-irqc: Document r8a774e1 bindings
        MIPS: irq: Avoid an unused-variable error
        genirq: Hide irq_cpu_{on,off}line() behind a deprecated option
        irqchip/mips-gic: Get rid of the reliance on irq_cpu_online()
        MIPS: loongson64: Drop call to irq_cpu_offline()
        irq: remove handle_domain_{irq,nmi}()
        irq: remove CONFIG_HANDLE_DOMAIN_IRQ_IRQENTRY
        irq: riscv: perform irqentry in entry code
        irq: openrisc: perform irqentry in entry code
        irq: csky: perform irqentry in entry code
        irq: arm64: perform irqentry in entry code
        irq: arm: perform irqentry in entry code
        irq: add a (temporary) CONFIG_HANDLE_DOMAIN_IRQ_IRQENTRY
        irq: nds32: avoid CONFIG_HANDLE_DOMAIN_IRQ
        irq: arc: avoid CONFIG_HANDLE_DOMAIN_IRQ
        irq: add generic_handle_arch_irq()
        irq: unexport handle_irq_desc()
        irq: simplify handle_domain_{irq,nmi}()
        irq: mips: simplify do_domain_IRQ()
        ...
      5a47ebe9
    • Linus Torvalds's avatar
      Merge tag 'for-5.16-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux · 037c50bf
      Linus Torvalds authored
      Pull btrfs updates from David Sterba:
       "The updates this time are more under the hood and enhancing existing
        features (subpage with compression and zoned namespaces).
      
        Performance related:
      
         - misc small inode logging improvements (+3% throughput, -11% latency
           on sample dbench workload)
      
         - more efficient directory logging: bulk item insertion, less tree
           searches and locking
      
         - speed up bulk insertion of items into a b-tree, which is used when
           logging directories, when running delayed items for directories
           (fsync and transaction commits) and when running the slow path
           (full sync) of an fsync (bulk creation run time -4%, deletion -12%)
      
        Core:
      
         - continued subpage support
            - make defragmentation work
            - make compression write work
      
         - zoned mode
            - support ZNS (zoned namespaces), zone capacity is number of
              usable blocks in each zone
            - add dedicated block group (zoned) for relocation, to prevent
              out of order writes in some cases
            - greedy block group reclaim, pick the ones with least usable
              space first
      
         - preparatory work for send protocol updates
      
         - error handling improvements
      
         - cleanups and refactoring
      
        Fixes:
      
         - lockdep warnings
            - in show_devname callback, on seeding device
            - device delete on loop device due to conversions to workqueues
      
         - fix deadlock between chunk allocation and chunk btree modifications
      
         - fix tracking of missing device count and status"
      
      * tag 'for-5.16-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (140 commits)
        btrfs: remove root argument from check_item_in_log()
        btrfs: remove root argument from add_link()
        btrfs: remove root argument from btrfs_unlink_inode()
        btrfs: remove root argument from drop_one_dir_item()
        btrfs: clear MISSING device status bit in btrfs_close_one_device
        btrfs: call btrfs_check_rw_degradable only if there is a missing device
        btrfs: send: prepare for v2 protocol
        btrfs: fix comment about sector sizes supported in 64K systems
        btrfs: update device path inode time instead of bd_inode
        fs: export an inode_update_time helper
        btrfs: fix deadlock when defragging transparent huge pages
        btrfs: sysfs: convert scnprintf and snprintf to sysfs_emit
        btrfs: make btrfs_super_block size match BTRFS_SUPER_INFO_SIZE
        btrfs: update comments for chunk allocation -ENOSPC cases
        btrfs: fix deadlock between chunk allocation and chunk btree modifications
        btrfs: zoned: use greedy gc for auto reclaim
        btrfs: check-integrity: stop storing the block device name in btrfsic_dev_state
        btrfs: use btrfs_get_dev_args_from_path in dev removal ioctls
        btrfs: add a btrfs_get_dev_args_from_path helper
        btrfs: handle device lookup with btrfs_dev_lookup_args
        ...
      037c50bf
    • Linus Torvalds's avatar
      btrfs: fix lzo_decompress_bio() kmap leakage · 2cf3f813
      Linus Torvalds authored
      Commit ccaa66c8 reinstated the kmap/kunmap that had been dropped in
      commit 8c945d32 ("btrfs: compression: drop kmap/kunmap from lzo").
      
      However, it seems to have done so incorrectly due to the change not
      reverting cleanly, and lzo_decompress_bio() ended up not having a
      matching "kunmap()" to the "kmap()" that was put back.
      
      Also, any assert that the page pointer is not NULL should be before the
      kmap() of said pointer, since otherwise you'd just oops in the kmap()
      before the assert would even trigger.
      
      I noticed this when trying to verify my btrfs merge, and things not
      adding up.  I'm doing this fixup before re-doing my merge, because this
      commit needs to also be backported to 5.15 (after verification from the
      btrfs people).
      
      Fixes: ccaa66c8 ("Revert 'btrfs: compression: drop kmap/kunmap from lzo'")
      Cc: David Sterba <dsterba@suse.com>
      Cc: stable@kernel.org
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2cf3f813
    • Linus Torvalds's avatar
      Merge tag 'exfat-for-5.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat · 9c6e8d52
      Linus Torvalds authored
      Pull exfat fix from Namjae Jeon:
       "Fix ->i_blocks truncation issue caused by wrong 32bit mask"
      
      * tag 'exfat-for-5.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat:
        exfat: fix incorrect loading of i_blocks for large files
      9c6e8d52
    • Linus Torvalds's avatar
      Merge tag 'erofs-for-5.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs · 67a135b8
      Linus Torvalds authored
      Pull erofs updates from Gao Xiang:
       "There are some new features available for this cycle. Firstly, EROFS
        LZMA algorithm support, specifically called MicroLZMA, is available as
        an option for embedded devices, LiveCDs and/or as the secondary
        auxiliary compression algorithm besides the primary algorithm in one
        file.
      
        In order to better support the LZMA fixed-sized output compression,
        especially for 4KiB pcluster size (which has lowest memory pressure
        thus useful for memory-sensitive scenarios), Lasse introduced a new
        LZMA header/container format called MicroLZMA to minimize the original
        LZMA1 header (for example, we don't need to waste 4-byte dictionary
        size and another 8-byte uncompressed size, which can be calculated by
        fs directly, for each pcluster) and enable EROFS fixed-sized output
        compression.
      
        Note that MicroLZMA can also be later used by other things in addition
        to EROFS too where wasting minimal amount of space for headers is
        important and it can be only compiled by enabling XZ_DEC_MICROLZMA.
        MicroLZMA has been supported by the latest upstream XZ embedded [1] &
        XZ utils [2], apply the latest related XZ embedded upstream patches by
        the XZ author Lasse here.
      
        Secondly, multiple device is also supported in this cycle, which is
        designed for multi-layer container images. By working together with
        inter-layer data deduplication and compression, we can achieve the
        next high-performance container image solution. Our team will announce
        the new Nydus container image service [3] implementation with new RAFS
        v6 (EROFS-compatible) format in Open Source Summit 2021 China [4]
        soon.
      
        Besides, the secondary compression head support and readmore
        decompression strategy are also included in this cycle. There are also
        some minor bugfixes and cleanups, as always.
      
        Summary:
      
         - support multiple devices for multi-layer container images;
      
         - support the secondary compression head;
      
         - support readmore decompression strategy;
      
         - support new LZMA algorithm (specifically called MicroLZMA);
      
         - some bugfixes & cleanups"
      
      * tag 'erofs-for-5.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs:
        erofs: don't trigger WARN() when decompression fails
        erofs: get rid of ->lru usage
        erofs: lzma compression support
        erofs: rename some generic methods in decompressor
        lib/xz, lib/decompress_unxz.c: Fix spelling in comments
        lib/xz: Add MicroLZMA decoder
        lib/xz: Move s->lzma.len = 0 initialization to lzma_reset()
        lib/xz: Validate the value before assigning it to an enum variable
        lib/xz: Avoid overlapping memcpy() with invalid input with in-place decompression
        erofs: introduce readmore decompression strategy
        erofs: introduce the secondary compression head
        erofs: get compression algorithms directly on mapping
        erofs: add multiple device support
        erofs: decouple basic mount options from fs_context
        erofs: remove the fast path of per-CPU buffer decompression
      67a135b8
    • Linus Torvalds's avatar
      Merge tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt · cd3e8ea8
      Linus Torvalds authored
      Pull fscrypt updates from Eric Biggers:
       "Some cleanups for fs/crypto/:
      
         - Allow 256-bit master keys with AES-256-XTS
      
         - Improve documentation and comments
      
         - Remove unneeded field fscrypt_operations::max_namelen"
      
      * tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt:
        fscrypt: improve a few comments
        fscrypt: allow 256-bit master keys with AES-256-XTS
        fscrypt: improve documentation for inline encryption
        fscrypt: clean up comments in bio.c
        fscrypt: remove fscrypt_operations::max_namelen
      cd3e8ea8
    • Linus Torvalds's avatar
      Merge tag 'for-5.16/inode-sync-2021-10-29' of git://git.kernel.dk/linux-block · 19901165
      Linus Torvalds authored
      Pull block inode sync updates from Jens Axboe:
       "This contains improvements to how bdev inode syncing is handled,
        unifying the API"
      
      * tag 'for-5.16/inode-sync-2021-10-29' of git://git.kernel.dk/linux-block:
        block: simplify the block device syncing code
        ntfs3: use sync_blockdev_nowait
        fat: use sync_blockdev_nowait
        btrfs: use sync_blockdev
        xen-blkback: use sync_blockdev
        block: remove __sync_blockdev
        fs: remove __sync_filesystem
      19901165
    • Linus Torvalds's avatar
      Merge tag 'for-5.16/ki_complete-2021-10-29' of git://git.kernel.dk/linux-block · b6773cdb
      Linus Torvalds authored
      Pull kiocb->ki_complete() cleanup from Jens Axboe:
       "This removes the res2 argument from kiocb->ki_complete().
      
        Only the USB gadget code used it, everybody else passes 0. The USB
        guys checked the user gadget code they could find, and everybody just
        uses res as expected for the async interface"
      
      * tag 'for-5.16/ki_complete-2021-10-29' of git://git.kernel.dk/linux-block:
        fs: get rid of the res2 iocb->ki_complete argument
        usb: remove res2 argument from gadget code completions
      b6773cdb
    • Linus Torvalds's avatar
      Merge tag 'for-5.16/passthrough-flag-2021-10-29' of git://git.kernel.dk/linux-block · 71ae4262
      Linus Torvalds authored
      Pull QUEUE_FLAG_SCSI_PASSTHROUGH removal from Jens Axboe:
       "This contains a series leading to the removal of the
        QUEUE_FLAG_SCSI_PASSTHROUGH queue flag"
      
      * tag 'for-5.16/passthrough-flag-2021-10-29' of git://git.kernel.dk/linux-block:
        block: remove blk_{get,put}_request
        block: remove QUEUE_FLAG_SCSI_PASSTHROUGH
        block: remove the initialize_rq_fn blk_mq_ops method
        scsi: add a scsi_alloc_request helper
        bsg-lib: initialize the bsg_job in bsg_transport_sg_io_fn
        nfsd/blocklayout: use ->get_unique_id instead of sending SCSI commands
        sd: implement ->get_unique_id
        block: add a ->get_unique_id method
      71ae4262
    • Linus Torvalds's avatar
      Merge tag 'for-5.16/cdrom-2021-10-29' of git://git.kernel.dk/linux-block · 737f1cd8
      Linus Torvalds authored
      Pull CDROM updates from Jens Axboe:
       "On behalf of Phillip, here are the CDROM updates for the 5.16-rc1
        merge window:
      
         - Add ioctl for improved media change detection (Lukas)
      
         - Reformat some documentation (Phillip)
      
         - Redundant variable removal (luo)"
      
      * tag 'for-5.16/cdrom-2021-10-29' of git://git.kernel.dk/linux-block:
        cdrom: Remove redundant variable and its assignment
        cdrom: docs: reformat table in Documentation/userspace-api/ioctl/cdrom.rst
        drivers/cdrom: improved ioctl for media change detection
      737f1cd8
    • Linus Torvalds's avatar
      Merge tag 'for-5.16/scsi-ma-2021-10-29' of git://git.kernel.dk/linux-block · fcaec17b
      Linus Torvalds authored
      Pull SCSI multi-actuator support from Jens Axboe:
       "This adds SCSI support for the recently merged block multi-actuator
        support. Since this was sitting on top of the block tree, the SCSI
        side asked me to queue it up."
      
      * tag 'for-5.16/scsi-ma-2021-10-29' of git://git.kernel.dk/linux-block:
        doc: Fix typo in request queue sysfs documentation
        doc: document sysfs queue/independent_access_ranges attributes
        libata: support concurrent positioning ranges log
        scsi: sd: add concurrent positioning ranges support
      fcaec17b
    • Linus Torvalds's avatar
      Merge tag 'for-5.16/bdev-size-2021-10-29' of git://git.kernel.dk/linux-block · 3f01727f
      Linus Torvalds authored
      Pull bdev size cleanups from Jens Axboe:
       "Clean up the bdev size handling with new bdev_nr_bytes() helper"
      
      * tag 'for-5.16/bdev-size-2021-10-29' of git://git.kernel.dk/linux-block: (34 commits)
        partitions/ibm: use bdev_nr_sectors instead of open coding it
        partitions/efi: use bdev_nr_bytes instead of open coding it
        block/ioctl: use bdev_nr_sectors and bdev_nr_bytes
        block: cache inode size in bdev
        udf: use sb_bdev_nr_blocks
        reiserfs: use sb_bdev_nr_blocks
        ntfs: use sb_bdev_nr_blocks
        jfs: use sb_bdev_nr_blocks
        ext4: use sb_bdev_nr_blocks
        block: add a sb_bdev_nr_blocks helper
        block: use bdev_nr_bytes instead of open coding it in blkdev_fallocate
        squashfs: use bdev_nr_bytes instead of open coding it
        reiserfs: use bdev_nr_bytes instead of open coding it
        pstore/blk: use bdev_nr_bytes instead of open coding it
        ntfs3: use bdev_nr_bytes instead of open coding it
        nilfs2: use bdev_nr_bytes instead of open coding it
        nfs/blocklayout: use bdev_nr_bytes instead of open coding it
        jfs: use bdev_nr_bytes instead of open coding it
        hfsplus: use bdev_nr_sectors instead of open coding it
        hfs: use bdev_nr_sectors instead of open coding it
        ...
      3f01727f
    • Linus Torvalds's avatar
      Merge tag 'for-5.16/io_uring-2021-10-29' of git://git.kernel.dk/linux-block · 8d1f0177
      Linus Torvalds authored
      Pull io_uring updates from Jens Axboe:
       "Light on new features - basically just the hybrid mode support.
      
        Outside of that it's just fixes, cleanups, and performance
        improvements.
      
        In detail:
      
         - Add ring related information to the fdinfo output (Hao)
      
         - Hybrid async mode (Hao)
      
         - Support for batched issue on block (me)
      
         - sqe error trace improvement (me)
      
         - IOPOLL efficiency improvements (Pavel)
      
         - submit state cleanups and improvements (Pavel)
      
         - Completion side improvements (Pavel)
      
         - Drain improvements (Pavel)
      
         - Buffer selection cleanups (Pavel)
      
         - Fixed file node improvements (Pavel)
      
         - io-wq setup cancelation fix (Pavel)
      
         - Various other performance improvements and cleanups (Pavel)
      
         - Misc fixes (Arnd, Bixuan, Changcheng, Hao, me, Noah)"
      
      * tag 'for-5.16/io_uring-2021-10-29' of git://git.kernel.dk/linux-block: (97 commits)
        io-wq: remove worker to owner tw dependency
        io_uring: harder fdinfo sq/cq ring iterating
        io_uring: don't assign write hint in the read path
        io_uring: clusterise ki_flags access in rw_prep
        io_uring: kill unused param from io_file_supports_nowait
        io_uring: clean up timeout async_data allocation
        io_uring: don't try io-wq polling if not supported
        io_uring: check if opcode needs poll first on arming
        io_uring: clean iowq submit work cancellation
        io_uring: clean io_wq_submit_work()'s main loop
        io-wq: use helper for worker refcounting
        io_uring: implement async hybrid mode for pollable requests
        io_uring: Use ERR_CAST() instead of ERR_PTR(PTR_ERR())
        io_uring: split logic of force_nonblock
        io_uring: warning about unused-but-set parameter
        io_uring: inform block layer of how many requests we are submitting
        io_uring: simplify io_file_supports_nowait()
        io_uring: combine REQ_F_NOWAIT_{READ,WRITE} flags
        io_uring: arm poll for non-nowait files
        fs/io_uring: Prioritise checking faster conditions first in io_write
        ...
      8d1f0177
    • Linus Torvalds's avatar
      Merge tag 'for-5.16/drivers-2021-10-29' of git://git.kernel.dk/linux-block · 643a7234
      Linus Torvalds authored
      Pull block driver updates from Jens Axboe:
      
       - paride driver cleanups (Christoph)
      
       - Remove cryptoloop support (Christoph)
      
       - null_blk poll support (me)
      
       - Now that add_disk() supports proper error handling, add it to various
         drivers (Luis)
      
       - Make ataflop actually work again (Michael)
      
       - s390 dasd fixes (Stefan, Heiko)
      
       - nbd fixes (Yu, Ye)
      
       - Remove redundant wq flush in mtip32xx (Christophe)
      
       - NVMe updates
            - fix a multipath partition scanning deadlock (Hannes Reinecke)
            - generate uevent once a multipath namespace is operational again
              (Hannes Reinecke)
            - support unique discovery controller NQNs (Hannes Reinecke)
            - fix use-after-free when a port is removed (Israel Rukshin)
            - clear shadow doorbell memory on resets (Keith Busch)
            - use struct_size (Len Baker)
            - add error handling support for add_disk (Luis Chamberlain)
            - limit the maximal queue size for RDMA controllers (Max Gurtovoy)
            - use a few more symbolic names (Max Gurtovoy)
            - fix error code in nvme_rdma_setup_ctrl (Max Gurtovoy)
            - add support for ->map_queues on FC (Saurav Kashyap)
            - support the current discovery subsystem entry (Hannes Reinecke)
            - use flex_array_size and struct_size (Len Baker)
      
       - bcache fixes (Christoph, Coly, Chao, Lin, Qing)
      
       - MD updates (Christoph, Guoqing, Xiao)
      
       - Misc fixes (Dan, Ding, Jiapeng, Shin'ichiro, Ye)
      
      * tag 'for-5.16/drivers-2021-10-29' of git://git.kernel.dk/linux-block: (117 commits)
        null_blk: Fix handling of submit_queues and poll_queues attributes
        block: ataflop: Fix warning comparing pointer to 0
        bcache: replace snprintf in show functions with sysfs_emit
        bcache: move uapi header bcache.h to bcache code directory
        nvmet: use flex_array_size and struct_size
        nvmet: register discovery subsystem as 'current'
        nvmet: switch check for subsystem type
        nvme: add new discovery log page entry definitions
        block: ataflop: more blk-mq refactoring fixes
        block: remove support for cryptoloop and the xor transfer
        mtd: add add_disk() error handling
        rnbd: add error handling support for add_disk()
        um/drivers/ubd_kern: add error handling support for add_disk()
        m68k/emu/nfblock: add error handling support for add_disk()
        xen-blkfront: add error handling support for add_disk()
        bcache: add error handling support for add_disk()
        dm: add add_disk() error handling
        block: aoe: fixup coccinelle warnings
        nvmet: use struct_size over open coded arithmetic
        nvme: drop scan_lock and always kick requeue list when removing namespaces
        ...
      643a7234
    • Linus Torvalds's avatar
      Merge tag 'for-5.16/block-2021-10-29' of git://git.kernel.dk/linux-block · 33c8846c
      Linus Torvalds authored
      Pull block updates from Jens Axboe:
      
       - mq-deadline accounting improvements (Bart)
      
       - blk-wbt timer fix (Andrea)
      
       - Untangle the block layer includes (Christoph)
      
       - Rework the poll support to be bio based, which will enable adding
         support for polling for bio based drivers (Christoph)
      
       - Block layer core support for multi-actuator drives (Damien)
      
       - blk-crypto improvements (Eric)
      
       - Batched tag allocation support (me)
      
       - Request completion batching support (me)
      
       - Plugging improvements (me)
      
       - Shared tag set improvements (John)
      
       - Concurrent queue quiesce support (Ming)
      
       - Cache bdev in ->private_data for block devices (Pavel)
      
       - bdev dio improvements (Pavel)
      
       - Block device invalidation and block size improvements (Xie)
      
       - Various cleanups, fixes, and improvements (Christoph, Jackie,
         Masahira, Tejun, Yu, Pavel, Zheng, me)
      
      * tag 'for-5.16/block-2021-10-29' of git://git.kernel.dk/linux-block: (174 commits)
        blk-mq-debugfs: Show active requests per queue for shared tags
        block: improve readability of blk_mq_end_request_batch()
        virtio-blk: Use blk_validate_block_size() to validate block size
        loop: Use blk_validate_block_size() to validate block size
        nbd: Use blk_validate_block_size() to validate block size
        block: Add a helper to validate the block size
        block: re-flow blk_mq_rq_ctx_init()
        block: prefetch request to be initialized
        block: pass in blk_mq_tags to blk_mq_rq_ctx_init()
        block: add rq_flags to struct blk_mq_alloc_data
        block: add async version of bio_set_polled
        block: kill DIO_MULTI_BIO
        block: kill unused polling bits in __blkdev_direct_IO()
        block: avoid extra iter advance with async iocb
        block: Add independent access ranges support
        blk-mq: don't issue request directly in case that current is to be blocked
        sbitmap: silence data race warning
        blk-cgroup: synchronize blkg creation against policy deactivation
        block: refactor bio_iov_bvec_set()
        block: add single bio async direct IO helper
        ...
      33c8846c
    • Linus Torvalds's avatar
      Merge tag 'locks-v5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux · 9ac21142
      Linus Torvalds authored
      Pull file locking updates from Jeff Layton:
       "Most of this is just follow-on cleanup work of documentation and
        comments from the mandatory locking removal in v5.15.
      
        The only real functional change is that LOCK_MAND flock() support is
        also being removed, as it has basically been non-functional since the
        v2.5 days"
      
      * tag 'locks-v5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux:
        fs: remove leftover comments from mandatory locking removal
        locks: remove changelog comments
        docs: fs: locks.rst: update comment about mandatory file locking
        Documentation: remove reference to now removed mandatory-locking doc
        locks: remove LOCK_MAND flock lock support
      9ac21142
    • Linus Torvalds's avatar
      Merge tag 'tpmdd-next-v5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/jarkko/linux-tpmdd · ad98a924
      Linus Torvalds authored
      Pull tpm updates from Jarkko Sakkinen:
       "Only bug fixes"
      
      * tag 'tpmdd-next-v5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/jarkko/linux-tpmdd:
        tpm_tis_spi: Add missing SPI ID
        tpm: fix Atmel TPM crash caused by too frequent queries
        tpm: Check for integer overflow in tpm2_map_response_body()
        tpm: tis: Kconfig: Add helper dependency on COMPILE_TEST
      ad98a924
    • Linus Torvalds's avatar
      Merge tag 'folio-5.16' of git://git.infradead.org/users/willy/pagecache · 49f8275c
      Linus Torvalds authored
      Pull memory folios from Matthew Wilcox:
       "Add memory folios, a new type to represent either order-0 pages or the
        head page of a compound page. This should be enough infrastructure to
        support filesystems converting from pages to folios.
      
        The point of all this churn is to allow filesystems and the page cache
        to manage memory in larger chunks than PAGE_SIZE. The original plan
        was to use compound pages like THP does, but I ran into problems with
        some functions expecting only a head page while others expect the
        precise page containing a particular byte.
      
        The folio type allows a function to declare that it's expecting only a
        head page. Almost incidentally, this allows us to remove various calls
        to VM_BUG_ON(PageTail(page)) and compound_head().
      
        This converts just parts of the core MM and the page cache. For 5.17,
        we intend to convert various filesystems (XFS and AFS are ready; other
        filesystems may make it) and also convert more of the MM and page
        cache to folios. For 5.18, multi-page folios should be ready.
      
        The multi-page folios offer some improvement to some workloads. The
        80% win is real, but appears to be an artificial benchmark (postgres
        startup, which isn't a serious workload). Real workloads (eg building
        the kernel, running postgres in a steady state, etc) seem to benefit
        between 0-10%. I haven't heard of any performance losses as a result
        of this series. Nobody has done any serious performance tuning; I
        imagine that tweaking the readahead algorithm could provide some more
        interesting wins. There are also other places where we could choose to
        create large folios and currently do not, such as writes that are
        larger than PAGE_SIZE.
      
        I'd like to thank all my reviewers who've offered review/ack tags:
        Christoph Hellwig, David Howells, Jan Kara, Jeff Layton, Johannes
        Weiner, Kirill A. Shutemov, Michal Hocko, Mike Rapoport, Vlastimil
        Babka, William Kucharski, Yu Zhao and Zi Yan.
      
        I'd also like to thank those who gave feedback I incorporated but
        haven't offered up review tags for this part of the series: Nick
        Piggin, Mel Gorman, Ming Lei, Darrick Wong, Ted Ts'o, John Hubbard,
        Hugh Dickins, and probably a few others who I forget"
      
      * tag 'folio-5.16' of git://git.infradead.org/users/willy/pagecache: (90 commits)
        mm/writeback: Add folio_write_one
        mm/filemap: Add FGP_STABLE
        mm/filemap: Add filemap_get_folio
        mm/filemap: Convert mapping_get_entry to return a folio
        mm/filemap: Add filemap_add_folio()
        mm/filemap: Add filemap_alloc_folio
        mm/page_alloc: Add folio allocation functions
        mm/lru: Add folio_add_lru()
        mm/lru: Convert __pagevec_lru_add_fn to take a folio
        mm: Add folio_evictable()
        mm/workingset: Convert workingset_refault() to take a folio
        mm/filemap: Add readahead_folio()
        mm/filemap: Add folio_mkwrite_check_truncate()
        mm/filemap: Add i_blocks_per_folio()
        mm/writeback: Add folio_redirty_for_writepage()
        mm/writeback: Add folio_account_redirty()
        mm/writeback: Add folio_clear_dirty_for_io()
        mm/writeback: Add folio_cancel_dirty()
        mm/writeback: Add folio_account_cleaned()
        mm/writeback: Add filemap_dirty_folio()
        ...
      49f8275c
  2. 31 Oct, 2021 11 commits