1. 11 Mar, 2022 12 commits
    • Helge Deller's avatar
      parisc/unaligned: Rewrite inline assembly of emulate_ldh() · f85b2af1
      Helge Deller authored
      Convert to use real temp variables instead of clobbering processor
      registers.
      Signed-off-by: default avatarHelge Deller <deller@gmx.de>
      f85b2af1
    • Helge Deller's avatar
      parisc/unaligned: Use EFAULT fixup handler in unaligned handlers · d1434e03
      Helge Deller authored
      Convert the inline assembly code to use the automatic EFAULT exception
      handler. With that the fixup code can be dropped.
      
      The other change is to allow double-word only when a 64-bit kernel is
      used instead of depending on CONFIG_PA20.
      Signed-off-by: default avatarHelge Deller <deller@gmx.de>
      d1434e03
    • Helge Deller's avatar
      parisc: Reduce code size by optimizing get_current() function calls · 8278cc16
      Helge Deller authored
      The get_current() code uses the mfctl() macro to get the pointer to the
      current task struct from %cr30. The problem with the mfctl() macro is,
      that it is marked volatile which is basically correct, because mfctl()
      is used to get e.g. the current internal timer or interrupt flags as
      well.
      
      But specifically the task struct pointer (%cr30) doesn't change over
      time when the kernel executes code for a task.
      
      So, by dropping the volatile when retrieving %cr30 the compiler is now
      able to get this value only once and optimize the generated code a lot.
      
      A bloat-o-meter comparism shows that this patch saves ~5kB kernel code
      on a 32-bit kernel and ~6kB kernel code on a 64-bit kernel.
      Signed-off-by: default avatarHelge Deller <deller@gmx.de>
      8278cc16
    • Helge Deller's avatar
      parisc: Use constants to encode the space registers like SR_KERNEL · 360bd6c6
      Helge Deller authored
      Use the provided space register constants instead of hardcoded values.
      Signed-off-by: default avatarHelge Deller <deller@gmx.de>
      360bd6c6
    • Helge Deller's avatar
      parisc: Use SR_USER and SR_KERNEL in get_user() and put_user() · 5613a930
      Helge Deller authored
      Instead of hardcoding the space registers as strings, use the SR_USER
      and SR_KERNEL constants to form the space register in the access
      functions.
      Signed-off-by: default avatarHelge Deller <deller@gmx.de>
      5613a930
    • Helge Deller's avatar
      parisc: Add defines for various space register · 46b4016f
      Helge Deller authored
      Provide defines for space registers (SR_KERNEL, SR_USER, ...) which
      should be used instead of hardcoding the values.
      Signed-off-by: default avatarHelge Deller <deller@gmx.de>
      46b4016f
    • Helge Deller's avatar
      parisc: Always use the self-extracting kernel feature · b9f50eea
      Helge Deller authored
      This patch drops the CONFIG_PARISC_SELF_EXTRACT option.
      
      The palo boot loader is able to decompress a kernel which was compressed
      with gzip. That possibility was useful when the Linux kernel
      self-extracting feature wasn't implemented yet.
      
      Beside the fact that the self-extracting feature offers much better
      compression rates, we do support self-extracting kernels already since
      kernel v4.14, so now it's really time to get rid of that old option and
      always use the self-extractor.
      Signed-off-by: default avatarHelge Deller <deller@gmx.de>
      b9f50eea
    • Helge Deller's avatar
      video/fbdev/stifb: Implement the stifb_fillrect() function · 9c379c65
      Helge Deller authored
      The stifb driver (for Artist/HCRX graphics on PA-RISC) was missing
      the fillrect function.
      Tested on a 715/64 PA-RISC machine and in qemu.
      Signed-off-by: default avatarHelge Deller <deller@gmx.de>
      9c379c65
    • Helge Deller's avatar
      parisc: Add vDSO support · df24e178
      Helge Deller authored
      Add minimal vDSO support, which provides the signal trampoline helpers,
      but none of the userspace syscall helpers like time wrappers.
      
      The big benefit of this vDSO implementation is, that we now don't need
      an executeable stack any longer. PA-RISC is one of the last
      architectures where an executeable stack was needed in oder to implement
      the signal trampolines by putting assembly instructions on the stack
      which then gets executed. Instead the kernel will provide the relevant
      code in the vDSO page and only put the pointers to the signal
      information on the stack.
      
      By dropping the need for executable stacks we avoid running into issues
      with applications which want non executable stacks for security reasons.
      Additionally, alternative stacks on memory areas without exec
      permissions are supported too.
      
      This code is based on an initial implementation by Randolph Chung from 2006:
      https://lore.kernel.org/linux-parisc/4544A34A.6080700@tausq.org/
      
      I did the porting and lifted the code to current code base. Dave fixed
      the unwind code so that gdb and glibc are able to backtrace through the
      code. An additional patch to gdb will be pushed upstream by Dave.
      Signed-off-by: default avatarHelge Deller <deller@gmx.de>
      Signed-off-by: default avatarDave Anglin <dave.anglin@bell.net>
      Cc: Randolph Chung <randolph@tausq.org>
      Signed-off-by: default avatarHelge Deller <deller@gmx.de>
      df24e178
    • John David Anglin's avatar
      parisc: Simplify fast path for non-access data TLB faults · 14615ecc
      John David Anglin authored
      With the latest cache fix for non-access faults and the support for
      non-access faults (code 17) in handle_interruption, we can remove
      the fast path emulation for fdc, fic, pdc, lpa, probe and probei
      instructions.
      Signed-off-by: default avatarJohn David Anglin <dave.anglin@bell.net>
      Signed-off-by: default avatarHelge Deller <deller@gmx.de>
      14615ecc
    • John David Anglin's avatar
      parisc: Fix handling off probe non-access faults · e00b0a2a
      John David Anglin authored
      Currently, the parisc kernel does not fully support non-access TLB
      fault handling for probe instructions. In the fast path, we set the
      target register to zero if it is not a shadowed register. The slow
      path is not implemented, so we call do_page_fault. The architecture
      indicates that non-access faults should not cause a page fault from
      disk.
      
      This change adds to code to provide non-access fault support for
      probe instructions. It also modifies the handling of faults on
      userspace so that if the address lies in a valid VMA and the access
      type matches that for the VMA, the probe target register is set to
      one. Otherwise, the target register is set to zero.
      
      This was done to make probe instructions more useful for userspace.
      Probe instructions are not very useful if they set the target register
      to zero whenever a page is not present in memory. Nominally, the
      purpose of the probe instruction is determine whether read or write
      access to a given address is allowed.
      
      This fixes a problem in function pointer comparison noticed in the
      glibc testsuite (stdio-common/tst-vfprintf-user-type). The same
      problem is likely in glibc (_dl_lookup_address).
      
      V2 adds flush and lpa instruction support to handle_nadtlb_fault.
      Signed-off-by: default avatarJohn David Anglin <dave.anglin@bell.net>
      Signed-off-by: default avatarHelge Deller <deller@gmx.de>
      e00b0a2a
    • John David Anglin's avatar
      parisc: Fix non-access data TLB cache flush faults · f839e5f1
      John David Anglin authored
      When a page is not present, we get non-access data TLB faults from
      the fdc and fic instructions in flush_user_dcache_range_asm and
      flush_user_icache_range_asm. When these occur, the cache line is
      not invalidated and potentially we get memory corruption. The
      problem was hidden by the nullification of the flush instructions.
      
      These faults also affect performance. With pa8800/pa8900 processors,
      there will be 32 faults per 4 KB page since the cache line is 128
      bytes.  There will be more faults with earlier processors.
      
      The problem is fixed by using flush_cache_pages(). It does the flush
      using a tmp alias mapping.
      
      The flush_cache_pages() call in flush_cache_range() flushed too
      large a range.
      
      V2: Remove unnecessary preempt_disable() and preempt_enable() calls.
      Signed-off-by: default avatarJohn David Anglin <dave.anglin@bell.net>
      Signed-off-by: default avatarHelge Deller <deller@gmx.de>
      f839e5f1
  2. 06 Mar, 2022 5 commits
    • Linus Torvalds's avatar
      Linux 5.17-rc7 · ffb217a1
      Linus Torvalds authored
      ffb217a1
    • Linus Torvalds's avatar
      Merge tag 'for-5.17-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux · 3ee65c0f
      Linus Torvalds authored
      Pull btrfs fixes from David Sterba:
       "A few more fixes for various problems that have user visible effects
        or seem to be urgent:
      
         - fix corruption when combining DIO and non-blocking io_uring over
           multiple extents (seen on MariaDB)
      
         - fix relocation crash due to premature return from commit
      
         - fix quota deadlock between rescan and qgroup removal
      
         - fix item data bounds checks in tree-checker (found on a fuzzed
           image)
      
         - fix fsync of prealloc extents after EOF
      
         - add missing run of delayed items after unlink during log replay
      
         - don't start relocation until snapshot drop is finished
      
         - fix reversed condition for subpage writers locking
      
         - fix warning on page error"
      
      * tag 'for-5.17-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
        btrfs: fallback to blocking mode when doing async dio over multiple extents
        btrfs: add missing run of delayed items after unlink during log replay
        btrfs: qgroup: fix deadlock between rescan worker and remove qgroup
        btrfs: fix relocation crash due to premature return from btrfs_commit_transaction()
        btrfs: do not start relocation until in progress drops are done
        btrfs: tree-checker: use u64 for item data end to avoid overflow
        btrfs: do not WARN_ON() if we have PageError set
        btrfs: fix lost prealloc extents beyond eof after full fsync
        btrfs: subpage: fix a wrong check on subpage->writers
      3ee65c0f
    • Linus Torvalds's avatar
      Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm · f81664f7
      Linus Torvalds authored
      Pull kvm fixes from Paolo Bonzini:
       "x86 guest:
      
         - Tweaks to the paravirtualization code, to avoid using them when
           they're pointless or harmful
      
        x86 host:
      
         - Fix for SRCU lockdep splat
      
         - Brown paper bag fix for the propagation of errno"
      
      * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
        KVM: x86: pull kvm->srcu read-side to kvm_arch_vcpu_ioctl_run
        KVM: x86/mmu: Passing up the error state of mmu_alloc_shadow_roots()
        KVM: x86: Yield to IPI target vCPU only if it is busy
        x86/kvmclock: Fix Hyper-V Isolated VM's boot issue when vCPUs > 64
        x86/kvm: Don't waste memory if kvmclock is disabled
        x86/kvm: Don't use PV TLB/yield when mwait is advertised
      f81664f7
    • Linus Torvalds's avatar
      Merge tag 'powerpc-5.17-5' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux · 9bdeaca1
      Linus Torvalds authored
      Pull powerpc fix from Michael Ellerman:
       "Fix build failure when CONFIG_PPC_64S_HASH_MMU is not set.
      
        Thanks to Murilo Opsfelder Araujo, and Erhard F"
      
      * tag 'powerpc-5.17-5' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
        powerpc/64s: Fix build failure when CONFIG_PPC_64S_HASH_MMU is not set
      9bdeaca1
    • Linus Torvalds's avatar
      Merge tag 'trace-v5.17-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace · f40a33f5
      Linus Torvalds authored
      Pull tracing fixes from Steven Rostedt:
      
       - Fix sorting on old "cpu" value in histograms
      
       - Fix return value of __setup() boot parameter handlers
      
      * tag 'trace-v5.17-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
        tracing: Fix return value of __setup handlers
        tracing/histogram: Fix sorting on old "cpu" value
      f40a33f5
  3. 05 Mar, 2022 13 commits
  4. 04 Mar, 2022 10 commits
    • Linus Torvalds's avatar
      Merge tag 'riscv-for-linus-5.17-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux · 07ebd38a
      Linus Torvalds authored
      Pull RISC-V fixes from Palmer Dabbelt:
      
       - Fixes for a handful of KASAN-related crashes.
      
       - A fix to avoid a crash during boot for SPARSEMEM &&
         !SPARSEMEM_VMEMMAP configurations.
      
       - A fix to stop reporting some incorrect errors under DEBUG_VIRTUAL.
      
       - A fix for the K210's device tree to properly populate the interrupt
         map, so hart1 will get interrupts again.
      
      * tag 'riscv-for-linus-5.17-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
        riscv: dts: k210: fix broken IRQs on hart1
        riscv: Fix kasan pud population
        riscv: Move high_memory initialization to setup_bootmem
        riscv: Fix config KASAN && DEBUG_VIRTUAL
        riscv: Fix DEBUG_VIRTUAL false warnings
        riscv: Fix config KASAN && SPARSEMEM && !SPARSE_VMEMMAP
        riscv: Fix is_linear_mapping with recent move of KASAN region
      07ebd38a
    • Linus Torvalds's avatar
      Merge tag 'iommu-fixes-v5.17-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu · 3f509f59
      Linus Torvalds authored
      Pull iommu fixes from Joerg Roedel:
      
       - Fix a double list_add() in Intel VT-d code
      
       - Add missing put_device() in Tegra SMMU driver
      
       - Two AMD IOMMU fixes:
           - Memory leak in IO page-table freeing code
           - Add missing recovery from event-log overflow
      
      * tag 'iommu-fixes-v5.17-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu:
        iommu/tegra-smmu: Fix missing put_device() call in tegra_smmu_find
        iommu/vt-d: Fix double list_add when enabling VMD in scalable mode
        iommu/amd: Fix I/O page table memory leak
        iommu/amd: Recover from event log overflow
      3f509f59
    • Linus Torvalds's avatar
      Merge tag 'thermal-5.17-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm · a4ffdb61
      Linus Torvalds authored
      Pull thermal control fix from Rafael Wysocki:
       "Fix NULL pointer dereference in the thermal netlink interface (Nicolas
        Cavallari)"
      
      * tag 'thermal-5.17-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
        thermal: core: Fix TZ_GET_TRIP NULL pointer dereference
      a4ffdb61
    • Linus Torvalds's avatar
      Merge tag 'sound-5.17-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound · 8d670948
      Linus Torvalds authored
      Pull sound fixes from Takashi Iwai:
       "Hopefully the last PR for 5.17, including just a few small changes:
        an additional fix for ASoC ops boundary check and other minor
        device-specific fixes"
      
      * tag 'sound-5.17-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
        ALSA: intel_hdmi: Fix reference to PCM buffer address
        ASoC: cs4265: Fix the duplicated control name
        ASoC: ops: Shift tested values in snd_soc_put_volsw() by +min
      8d670948
    • Linus Torvalds's avatar
      Merge tag 'drm-fixes-2022-03-04' of git://anongit.freedesktop.org/drm/drm · c4fc118a
      Linus Torvalds authored
      Pull drm fixes from Dave Airlie:
       "Things are quieting down as expected, just a small set of fixes, i915,
        exynos, amdgpu, vrr, bridge and hdlcd. Nothing scary at all.
      
        i915:
         - Fix GuC SLPC unset command
         - Fix misidentification of some Apple MacBook Pro laptops as Jasper Lake
      
        amdgpu:
         - Suspend regression fix
      
        exynos:
         - irq handling fixes
         - Fix two regressions to TE-gpio handling
      
        arm/hdlcd:
         - Select DRM_GEM_CMEA_HELPER for HDLCD
      
        bridge:
         - ti-sn65dsi86: Properly undo autosuspend
      
        vrr:
         - Fix potential NULL-pointer deref"
      
      * tag 'drm-fixes-2022-03-04' of git://anongit.freedesktop.org/drm/drm:
        drm/amdgpu: fix suspend/resume hang regression
        drm/vrr: Set VRR capable prop only if it is attached to connector
        drm/arm: arm hdlcd select DRM_GEM_CMA_HELPER
        drm/bridge: ti-sn65dsi86: Properly undo autosuspend
        drm/i915: s/JSP2/ICP2/ PCH
        drm/i915/guc/slpc: Correct the param count for unset param
        drm/exynos: Search for TE-gpio in DSI panel's node
        drm/exynos: Don't fail if no TE-gpio is defined for DSI driver
        drm/exynos: gsc: Use platform_get_irq() to get the interrupt
        drm/exynos/fimc: Use platform_get_irq() to get the interrupt
        drm/exynos/exynos_drm_fimd: Use platform_get_irq_byname() to get the interrupt
        drm/exynos: mixer: Use platform_get_irq() to get the interrupt
        drm/exynos/exynos7_drm_decon: Use platform_get_irq_byname() to get the interrupt
      c4fc118a
    • Linus Torvalds's avatar
      Merge tag 'pinctrl-v5.17-3' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl · 0b7344a6
      Linus Torvalds authored
      Pull pin control fixes from Linus Walleij:
       "These two fixes should fix the issues seen on the OrangePi, first we
        needed the correct offset when calling pinctrl_gpio_direction(), and
        fixing that made a lockdep issue explode in our face. Both now fixed"
      
      * tag 'pinctrl-v5.17-3' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl:
        pinctrl: sunxi: Use unique lockdep classes for IRQs
        pinctrl-sunxi: sunxi_pinctrl_gpio_direction_in/output: use correct offset
      0b7344a6
    • Randy Dunlap's avatar
      tracing: Fix return value of __setup handlers · 1d02b444
      Randy Dunlap authored
      __setup() handlers should generally return 1 to indicate that the
      boot options have been handled.
      
      Using invalid option values causes the entire kernel boot option
      string to be reported as Unknown and added to init's environment
      strings, polluting it.
      
        Unknown kernel command line parameters "BOOT_IMAGE=/boot/bzImage-517rc6
          kprobe_event=p,syscall_any,$arg1 trace_options=quiet
          trace_clock=jiffies", will be passed to user space.
      
       Run /sbin/init as init process
         with arguments:
           /sbin/init
         with environment:
           HOME=/
           TERM=linux
           BOOT_IMAGE=/boot/bzImage-517rc6
           kprobe_event=p,syscall_any,$arg1
           trace_options=quiet
           trace_clock=jiffies
      
      Return 1 from the __setup() handlers so that init's environment is not
      polluted with kernel boot options.
      
      Link: lore.kernel.org/r/64644a2f-4a20-bab3-1e15-3b2cdd0defe3@omprussia.ru
      Link: https://lkml.kernel.org/r/20220303031744.32356-1-rdunlap@infradead.org
      
      Cc: stable@vger.kernel.org
      Fixes: 7bcfaf54 ("tracing: Add trace_options kernel command line parameter")
      Fixes: e1e232ca ("tracing: Add trace_clock=<clock> kernel parameter")
      Fixes: 970988e1 ("tracing/kprobe: Add kprobe_event= boot parameter")
      Signed-off-by: default avatarRandy Dunlap <rdunlap@infradead.org>
      Reported-by: default avatarIgor Zhbanov <i.zhbanov@omprussia.ru>
      Acked-by: default avatarMasami Hiramatsu <mhiramat@kernel.org>
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      1d02b444
    • Daniel Borkmann's avatar
      mm: Consider __GFP_NOWARN flag for oversized kvmalloc() calls · 0708a0af
      Daniel Borkmann authored
      syzkaller was recently triggering an oversized kvmalloc() warning via
      xdp_umem_create().
      
      The triggered warning was added back in 7661809d ("mm: don't allow
      oversized kvmalloc() calls"). The rationale for the warning for huge
      kvmalloc sizes was as a reaction to a security bug where the size was
      more than UINT_MAX but not everything was prepared to handle unsigned
      long sizes.
      
      Anyway, the AF_XDP related call trace from this syzkaller report was:
      
        kvmalloc include/linux/mm.h:806 [inline]
        kvmalloc_array include/linux/mm.h:824 [inline]
        kvcalloc include/linux/mm.h:829 [inline]
        xdp_umem_pin_pages net/xdp/xdp_umem.c:102 [inline]
        xdp_umem_reg net/xdp/xdp_umem.c:219 [inline]
        xdp_umem_create+0x6a5/0xf00 net/xdp/xdp_umem.c:252
        xsk_setsockopt+0x604/0x790 net/xdp/xsk.c:1068
        __sys_setsockopt+0x1fd/0x4e0 net/socket.c:2176
        __do_sys_setsockopt net/socket.c:2187 [inline]
        __se_sys_setsockopt net/socket.c:2184 [inline]
        __x64_sys_setsockopt+0xb5/0x150 net/socket.c:2184
        do_syscall_x64 arch/x86/entry/common.c:50 [inline]
        do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
        entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      Björn mentioned that requests for >2GB allocation can still be valid:
      
        The structure that is being allocated is the page-pinning accounting.
        AF_XDP has an internal limit of U32_MAX pages, which is *a lot*, but
        still fewer than what memcg allows (PAGE_COUNTER_MAX is a LONG_MAX/
        PAGE_SIZE on 64 bit systems). [...]
      
        I could just change from U32_MAX to INT_MAX, but as I stated earlier
        that has a hacky feeling to it. [...] From my perspective, the code
        isn't broken, with the memcg limits in consideration. [...]
      
      Linus says:
      
        [...] Pretty much every time this has come up, the kernel warning has
        shown that yes, the code was broken and there really wasn't a reason
        for doing allocations that big.
      
        Of course, some people would be perfectly fine with the allocation
        failing, they just don't want the warning. I didn't want __GFP_NOWARN
        to shut it up originally because I wanted people to see all those
        cases, but these days I think we can just say "yeah, people can shut
        it up explicitly by saying 'go ahead and fail this allocation, don't
        warn about it'".
      
        So enough time has passed that by now I'd certainly be ok with [it].
      
      Thus allow call-sites to silence such userspace triggered splats if the
      allocation requests have __GFP_NOWARN. For xdp_umem_pin_pages()'s call
      to kvcalloc() this is already the case, so nothing else needed there.
      
      Fixes: 7661809d ("mm: don't allow oversized kvmalloc() calls")
      Reported-by: syzbot+11421fbbff99b989670e@syzkaller.appspotmail.com
      Suggested-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Tested-by: syzbot+11421fbbff99b989670e@syzkaller.appspotmail.com
      Cc: Björn Töpel <bjorn@kernel.org>
      Cc: Magnus Karlsson <magnus.karlsson@intel.com>
      Cc: Willy Tarreau <w@1wt.eu>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Andrii Nakryiko <andrii@kernel.org>
      Cc: Jakub Kicinski <kuba@kernel.org>
      Cc: David S. Miller <davem@davemloft.net>
      Link: https://lore.kernel.org/bpf/CAJ+HfNhyfsT5cS_U9EC213ducHs9k9zNxX9+abqC0kTrPbQ0gg@mail.gmail.com
      Link: https://lore.kernel.org/bpf/20211201202905.b9892171e3f5b9a60f9da251@linux-foundation.orgReviewed-by: default avatarLeon Romanovsky <leonro@nvidia.com>
      Ackd-by: default avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0708a0af
    • Filipe Manana's avatar
      btrfs: fallback to blocking mode when doing async dio over multiple extents · ca93e44b
      Filipe Manana authored
      Some users recently reported that MariaDB was getting a read corruption
      when using io_uring on top of btrfs. This started to happen in 5.16,
      after commit 51bd9563 ("btrfs: fix deadlock due to page faults
      during direct IO reads and writes"). That changed btrfs to use the new
      iomap flag IOMAP_DIO_PARTIAL and to disable page faults before calling
      iomap_dio_rw(). This was necessary to fix deadlocks when the iovector
      corresponds to a memory mapped file region. That type of scenario is
      exercised by test case generic/647 from fstests.
      
      For this MariaDB scenario, we attempt to read 16K from file offset X
      using IOCB_NOWAIT and io_uring. In that range we have 4 extents, each
      with a size of 4K, and what happens is the following:
      
      1) btrfs_direct_read() disables page faults and calls iomap_dio_rw();
      
      2) iomap creates a struct iomap_dio object, its reference count is
         initialized to 1 and its ->size field is initialized to 0;
      
      3) iomap calls btrfs_dio_iomap_begin() with file offset X, which finds
         the first 4K extent, and setups an iomap for this extent consisting
         of a single page;
      
      4) At iomap_dio_bio_iter(), we are able to access the first page of the
         buffer (struct iov_iter) with bio_iov_iter_get_pages() without
         triggering a page fault;
      
      5) iomap submits a bio for this 4K extent
         (iomap_dio_submit_bio() -> btrfs_submit_direct()) and increments
         the refcount on the struct iomap_dio object to 2; The ->size field
         of the struct iomap_dio object is incremented to 4K;
      
      6) iomap calls btrfs_iomap_begin() again, this time with a file
         offset of X + 4K. There we setup an iomap for the next extent
         that also has a size of 4K;
      
      7) Then at iomap_dio_bio_iter() we call bio_iov_iter_get_pages(),
         which tries to access the next page (2nd page) of the buffer.
         This triggers a page fault and returns -EFAULT;
      
      8) At __iomap_dio_rw() we see the -EFAULT, but we reset the error
         to 0 because we passed the flag IOMAP_DIO_PARTIAL to iomap and
         the struct iomap_dio object has a ->size value of 4K (we submitted
         a bio for an extent already). The 'wait_for_completion' variable
         is not set to true, because our iocb has IOCB_NOWAIT set;
      
      9) At the bottom of __iomap_dio_rw(), we decrement the reference count
         of the struct iomap_dio object from 2 to 1. Because we were not
         the only ones holding a reference on it and 'wait_for_completion' is
         set to false, -EIOCBQUEUED is returned to btrfs_direct_read(), which
         just returns it up the callchain, up to io_uring;
      
      10) The bio submitted for the first extent (step 5) completes and its
          bio endio function, iomap_dio_bio_end_io(), decrements the last
          reference on the struct iomap_dio object, resulting in calling
          iomap_dio_complete_work() -> iomap_dio_complete().
      
      11) At iomap_dio_complete() we adjust the iocb->ki_pos from X to X + 4K
          and return 4K (the amount of io done) to iomap_dio_complete_work();
      
      12) iomap_dio_complete_work() calls the iocb completion callback,
          iocb->ki_complete() with a second argument value of 4K (total io
          done) and the iocb with the adjust ki_pos of X + 4K. This results
          in completing the read request for io_uring, leaving it with a
          result of 4K bytes read, and only the first page of the buffer
          filled in, while the remaining 3 pages, corresponding to the other
          3 extents, were not filled;
      
      13) For the application, the result is unexpected because if we ask
          to read N bytes, it expects to get N bytes read as long as those
          N bytes don't cross the EOF (i_size).
      
      MariaDB reports this as an error, as it's not expecting a short read,
      since it knows it's asking for read operations fully within the i_size
      boundary. This is typical in many applications, but it may also be
      questionable if they should react to such short reads by issuing more
      read calls to get the remaining data. Nevertheless, the short read
      happened due to a change in btrfs regarding how it deals with page
      faults while in the middle of a read operation, and there's no reason
      why btrfs can't have the previous behaviour of returning the whole data
      that was requested by the application.
      
      The problem can also be triggered with the following simple program:
      
        /* Get O_DIRECT */
        #ifndef _GNU_SOURCE
        #define _GNU_SOURCE
        #endif
      
        #include <stdio.h>
        #include <stdlib.h>
        #include <unistd.h>
        #include <fcntl.h>
        #include <errno.h>
        #include <string.h>
        #include <liburing.h>
      
        int main(int argc, char *argv[])
        {
            char *foo_path;
            struct io_uring ring;
            struct io_uring_sqe *sqe;
            struct io_uring_cqe *cqe;
            struct iovec iovec;
            int fd;
            long pagesize;
            void *write_buf;
            void *read_buf;
            ssize_t ret;
            int i;
      
            if (argc != 2) {
                fprintf(stderr, "Use: %s <directory>\n", argv[0]);
                return 1;
            }
      
            foo_path = malloc(strlen(argv[1]) + 5);
            if (!foo_path) {
                fprintf(stderr, "Failed to allocate memory for file path\n");
                return 1;
            }
            strcpy(foo_path, argv[1]);
            strcat(foo_path, "/foo");
      
            /*
             * Create file foo with 2 extents, each with a size matching
             * the page size. Then allocate a buffer to read both extents
             * with io_uring, using O_DIRECT and IOCB_NOWAIT. Before doing
             * the read with io_uring, access the first page of the buffer
             * to fault it in, so that during the read we only trigger a
             * page fault when accessing the second page of the buffer.
             */
             fd = open(foo_path, O_CREAT | O_TRUNC | O_WRONLY |
                      O_DIRECT, 0666);
             if (fd == -1) {
                 fprintf(stderr,
                         "Failed to create file 'foo': %s (errno %d)",
                         strerror(errno), errno);
                 return 1;
             }
      
             pagesize = sysconf(_SC_PAGE_SIZE);
             ret = posix_memalign(&write_buf, pagesize, 2 * pagesize);
             if (ret) {
                 fprintf(stderr, "Failed to allocate write buffer\n");
                 return 1;
             }
      
             memset(write_buf, 0xab, pagesize);
             memset(write_buf + pagesize, 0xcd, pagesize);
      
             /* Create 2 extents, each with a size matching page size. */
             for (i = 0; i < 2; i++) {
                 ret = pwrite(fd, write_buf + i * pagesize, pagesize,
                              i * pagesize);
                 if (ret != pagesize) {
                     fprintf(stderr,
                           "Failed to write to file, ret = %ld errno %d (%s)\n",
                            ret, errno, strerror(errno));
                     return 1;
                 }
                 ret = fsync(fd);
                 if (ret != 0) {
                     fprintf(stderr, "Failed to fsync file\n");
                     return 1;
                 }
             }
      
             close(fd);
             fd = open(foo_path, O_RDONLY | O_DIRECT);
             if (fd == -1) {
                 fprintf(stderr,
                         "Failed to open file 'foo': %s (errno %d)",
                         strerror(errno), errno);
                 return 1;
             }
      
             ret = posix_memalign(&read_buf, pagesize, 2 * pagesize);
             if (ret) {
                 fprintf(stderr, "Failed to allocate read buffer\n");
                 return 1;
             }
      
             /*
              * Fault in only the first page of the read buffer.
              * We want to trigger a page fault for the 2nd page of the
              * read buffer during the read operation with io_uring
              * (O_DIRECT and IOCB_NOWAIT).
              */
             memset(read_buf, 0, 1);
      
             ret = io_uring_queue_init(1, &ring, 0);
             if (ret != 0) {
                 fprintf(stderr, "Failed to create io_uring queue\n");
                 return 1;
             }
      
             sqe = io_uring_get_sqe(&ring);
             if (!sqe) {
                 fprintf(stderr, "Failed to get io_uring sqe\n");
                 return 1;
             }
      
             iovec.iov_base = read_buf;
             iovec.iov_len = 2 * pagesize;
             io_uring_prep_readv(sqe, fd, &iovec, 1, 0);
      
             ret = io_uring_submit_and_wait(&ring, 1);
             if (ret != 1) {
                 fprintf(stderr,
                         "Failed at io_uring_submit_and_wait()\n");
                 return 1;
             }
      
             ret = io_uring_wait_cqe(&ring, &cqe);
             if (ret < 0) {
                 fprintf(stderr, "Failed at io_uring_wait_cqe()\n");
                 return 1;
             }
      
             printf("io_uring read result for file foo:\n\n");
             printf("  cqe->res == %d (expected %d)\n", cqe->res, 2 * pagesize);
             printf("  memcmp(read_buf, write_buf) == %d (expected 0)\n",
                    memcmp(read_buf, write_buf, 2 * pagesize));
      
             io_uring_cqe_seen(&ring, cqe);
             io_uring_queue_exit(&ring);
      
             return 0;
        }
      
      When running it on an unpatched kernel:
      
        $ gcc io_uring_test.c -luring
        $ mkfs.btrfs -f /dev/sda
        $ mount /dev/sda /mnt/sda
        $ ./a.out /mnt/sda
        io_uring read result for file foo:
      
          cqe->res == 4096 (expected 8192)
          memcmp(read_buf, write_buf) == -205 (expected 0)
      
      After this patch, the read always returns 8192 bytes, with the buffer
      filled with the correct data. Although that reproducer always triggers
      the bug in my test vms, it's possible that it will not be so reliable
      on other environments, as that can happen if the bio for the first
      extent completes and decrements the reference on the struct iomap_dio
      object before we do the atomic_dec_and_test() on the reference at
      __iomap_dio_rw().
      
      Fix this in btrfs by having btrfs_dio_iomap_begin() return -EAGAIN
      whenever we try to satisfy a non blocking IO request (IOMAP_NOWAIT flag
      set) over a range that spans multiple extents (or a mix of extents and
      holes). This avoids returning success to the caller when we only did
      partial IO, which is not optimal for writes and for reads it's actually
      incorrect, as the caller doesn't expect to get less bytes read than it has
      requested (unless EOF is crossed), as previously mentioned. This is also
      the type of behaviour that xfs follows (xfs_direct_write_iomap_begin()),
      even though it doesn't use IOMAP_DIO_PARTIAL.
      
      A test case for fstests will follow soon.
      
      Link: https://lore.kernel.org/linux-btrfs/CABVffEM0eEWho+206m470rtM0d9J8ue85TtR-A_oVTuGLWFicA@mail.gmail.com/
      Link: https://lore.kernel.org/linux-btrfs/CAHF2GV6U32gmqSjLe=XKgfcZAmLCiH26cJ2OnHGp5x=VAH4OHQ@mail.gmail.com/
      CC: stable@vger.kernel.org # 5.16+
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ca93e44b
    • Niklas Cassel's avatar
      riscv: dts: k210: fix broken IRQs on hart1 · 74583f1b
      Niklas Cassel authored
      Commit 67d96729 ("riscv: Update Canaan Kendryte K210 device tree")
      incorrectly removed two entries from the PLIC interrupt-controller node's
      interrupts-extended property.
      
      The PLIC driver cannot know the mapping between hart contexts and hart ids,
      so this information has to be provided by device tree, as specified by the
      PLIC device tree binding.
      
      The PLIC driver uses the interrupts-extended property, and initializes the
      hart context registers in the exact same order as provided by the
      interrupts-extended property.
      
      In other words, if we don't specify the S-mode interrupts, the PLIC driver
      will simply initialize the hart0 S-mode hart context with the hart1 M-mode
      configuration. It is therefore essential to specify the S-mode IRQs even
      though the system itself will only ever be running in M-mode.
      
      Re-add the S-mode interrupts, so that we get working IRQs on hart1 again.
      
      Cc: <stable@vger.kernel.org>
      Fixes: 67d96729 ("riscv: Update Canaan Kendryte K210 device tree")
      Signed-off-by: default avatarNiklas Cassel <niklas.cassel@wdc.com>
      Signed-off-by: default avatarPalmer Dabbelt <palmer@rivosinc.com>
      74583f1b