1. 07 Mar, 2022 11 commits
  2. 06 Mar, 2022 5 commits
    • Linus Torvalds's avatar
      Linux 5.17-rc7 · ffb217a1
      Linus Torvalds authored
      ffb217a1
    • Linus Torvalds's avatar
      Merge tag 'for-5.17-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux · 3ee65c0f
      Linus Torvalds authored
      Pull btrfs fixes from David Sterba:
       "A few more fixes for various problems that have user visible effects
        or seem to be urgent:
      
         - fix corruption when combining DIO and non-blocking io_uring over
           multiple extents (seen on MariaDB)
      
         - fix relocation crash due to premature return from commit
      
         - fix quota deadlock between rescan and qgroup removal
      
         - fix item data bounds checks in tree-checker (found on a fuzzed
           image)
      
         - fix fsync of prealloc extents after EOF
      
         - add missing run of delayed items after unlink during log replay
      
         - don't start relocation until snapshot drop is finished
      
         - fix reversed condition for subpage writers locking
      
         - fix warning on page error"
      
      * tag 'for-5.17-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
        btrfs: fallback to blocking mode when doing async dio over multiple extents
        btrfs: add missing run of delayed items after unlink during log replay
        btrfs: qgroup: fix deadlock between rescan worker and remove qgroup
        btrfs: fix relocation crash due to premature return from btrfs_commit_transaction()
        btrfs: do not start relocation until in progress drops are done
        btrfs: tree-checker: use u64 for item data end to avoid overflow
        btrfs: do not WARN_ON() if we have PageError set
        btrfs: fix lost prealloc extents beyond eof after full fsync
        btrfs: subpage: fix a wrong check on subpage->writers
      3ee65c0f
    • Linus Torvalds's avatar
      Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm · f81664f7
      Linus Torvalds authored
      Pull kvm fixes from Paolo Bonzini:
       "x86 guest:
      
         - Tweaks to the paravirtualization code, to avoid using them when
           they're pointless or harmful
      
        x86 host:
      
         - Fix for SRCU lockdep splat
      
         - Brown paper bag fix for the propagation of errno"
      
      * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
        KVM: x86: pull kvm->srcu read-side to kvm_arch_vcpu_ioctl_run
        KVM: x86/mmu: Passing up the error state of mmu_alloc_shadow_roots()
        KVM: x86: Yield to IPI target vCPU only if it is busy
        x86/kvmclock: Fix Hyper-V Isolated VM's boot issue when vCPUs > 64
        x86/kvm: Don't waste memory if kvmclock is disabled
        x86/kvm: Don't use PV TLB/yield when mwait is advertised
      f81664f7
    • Linus Torvalds's avatar
      Merge tag 'powerpc-5.17-5' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux · 9bdeaca1
      Linus Torvalds authored
      Pull powerpc fix from Michael Ellerman:
       "Fix build failure when CONFIG_PPC_64S_HASH_MMU is not set.
      
        Thanks to Murilo Opsfelder Araujo, and Erhard F"
      
      * tag 'powerpc-5.17-5' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
        powerpc/64s: Fix build failure when CONFIG_PPC_64S_HASH_MMU is not set
      9bdeaca1
    • Linus Torvalds's avatar
      Merge tag 'trace-v5.17-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace · f40a33f5
      Linus Torvalds authored
      Pull tracing fixes from Steven Rostedt:
      
       - Fix sorting on old "cpu" value in histograms
      
       - Fix return value of __setup() boot parameter handlers
      
      * tag 'trace-v5.17-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
        tracing: Fix return value of __setup handlers
        tracing/histogram: Fix sorting on old "cpu" value
      f40a33f5
  3. 05 Mar, 2022 14 commits
    • Linus Torvalds's avatar
      Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input · dcde98da
      Linus Torvalds authored
      Pull input updates from Dmitry Torokhov:
      
       - a fixup for Goodix touchscreen driver allowing it to work on certain
         Cherry Trail devices
      
       - a fix for imbalanced enable/disable regulator in Elam touchpad driver
         that became apparent when used with Asus TF103C 2-in-1 dock
      
       - a couple new input keycodes used on newer keyboards
      
      * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input:
        HID: add mapping for KEY_ALL_APPLICATIONS
        HID: add mapping for KEY_DICTATE
        Input: elan_i2c - fix regulator enable count imbalance after suspend/resume
        Input: elan_i2c - move regulator_[en|dis]able() out of elan_[en|dis]able_power()
        Input: goodix - workaround Cherry Trail devices with a bogus ACPI Interrupt() resource
        Input: goodix - use the new soc_intel_is_byt() helper
        Input: samsung-keypad - properly state IOMEM dependency
      dcde98da
    • Linus Torvalds's avatar
      Merge branch 'akpm' (patches from Andrew) · 0014404f
      Linus Torvalds authored
      Merge misc fixes from Andrew Morton:
       "8 patches.
      
        Subsystems affected by this patch series: mm (hugetlb, pagemap, and
        userfaultfd), memfd, selftests, and kconfig"
      
      * emailed patches from Andrew Morton <akpm@linux-foundation.org>:
        configs/debug: set CONFIG_DEBUG_INFO=y properly
        proc: fix documentation and description of pagemap
        kselftest/vm: fix tests build with old libc
        memfd: fix F_SEAL_WRITE after shmem huge page allocated
        mm: fix use-after-free when anon vma name is used after vma is freed
        mm: prevent vm_area_struct::anon_name refcount saturation
        mm: refactor vm_area_struct::anon_vma_name usage code
        selftests/vm: cleanup hugetlb file after mremap test
      0014404f
    • Linus Torvalds's avatar
      Merge tag 's390-5.17-5' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux · f9026e19
      Linus Torvalds authored
      Pull s390 fixes from Vasily Gorbik:
      
       - Fix HAVE_DYNAMIC_FTRACE_WITH_ARGS implementation by providing correct
         switching between ftrace_caller/ftrace_regs_caller and supplying
         pt_regs only when ftrace_regs_caller is activated.
      
       - Fix exception table sorting.
      
       - Fix breakage of kdump tooling by preserving metadata it cannot
         function without.
      
      * tag 's390-5.17-5' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
        s390/extable: fix exception table sorting
        s390/ftrace: fix arch_ftrace_get_regs implementation
        s390/ftrace: fix ftrace_caller/ftrace_regs_caller generation
        s390/setup: preserve memory at OLDMEM_BASE and OLDMEM_SIZE
      f9026e19
    • Qian Cai's avatar
      configs/debug: set CONFIG_DEBUG_INFO=y properly · d1eff16d
      Qian Cai authored
      CONFIG_DEBUG_INFO can't be set by user directly, so set
      CONFIG_DEBUG_INFO_DWARF_TOOLCHAIN_DEFAULT=y instead.
      
      Otherwise, we end up with no debuginfo in vmlinux which is a big no-no
      for kernel debugging.
      
      Link: https://lkml.kernel.org/r/20220301202920.18488-1-quic_qiancai@quicinc.comSigned-off-by: default avatarQian Cai <quic_qiancai@quicinc.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d1eff16d
    • Yun Zhou's avatar
      proc: fix documentation and description of pagemap · dd21bfa4
      Yun Zhou authored
      Since bit 57 was exported for uffd-wp write-protected (commit
      fb8e37f3: "mm/pagemap: export uffd-wp protection information"),
      fixing it can reduce some unnecessary confusion.
      
      Link: https://lkml.kernel.org/r/20220301044538.3042713-1-yun.zhou@windriver.com
      Fixes: fb8e37f3 ("mm/pagemap: export uffd-wp protection information")
      Signed-off-by: default avatarYun Zhou <yun.zhou@windriver.com>
      Reviewed-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Tiberiu A Georgescu <tiberiu.georgescu@nutanix.com>
      Cc: Florian Schmidt <florian.schmidt@nutanix.com>
      Cc: Ivan Teterevkov <ivan.teterevkov@nutanix.com>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Colin Cross <ccross@google.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      dd21bfa4
    • Chengming Zhou's avatar
      kselftest/vm: fix tests build with old libc · b773827e
      Chengming Zhou authored
      The error message when I build vm tests on debian10 (GLIBC 2.28):
      
          userfaultfd.c: In function `userfaultfd_pagemap_test':
          userfaultfd.c:1393:37: error: `MADV_PAGEOUT' undeclared (first use
          in this function); did you mean `MADV_RANDOM'?
            if (madvise(area_dst, test_pgsize, MADV_PAGEOUT))
                                               ^~~~~~~~~~~~
                                               MADV_RANDOM
      
      This patch includes these newer definitions from UAPI linux/mman.h, is
      useful to fix tests build on systems without these definitions in glibc
      sys/mman.h.
      
      Link: https://lkml.kernel.org/r/20220227055330.43087-2-zhouchengming@bytedance.comSigned-off-by: default avatarChengming Zhou <zhouchengming@bytedance.com>
      Reviewed-by: default avatarShuah Khan <skhan@linuxfoundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b773827e
    • Hugh Dickins's avatar
      memfd: fix F_SEAL_WRITE after shmem huge page allocated · f2b277c4
      Hugh Dickins authored
      Wangyong reports: after enabling tmpfs filesystem to support transparent
      hugepage with the following command:
      
        echo always > /sys/kernel/mm/transparent_hugepage/shmem_enabled
      
      the docker program tries to add F_SEAL_WRITE through the following
      command, but it fails unexpectedly with errno EBUSY:
      
        fcntl(5, F_ADD_SEALS, F_SEAL_WRITE) = -1.
      
      That is because memfd_tag_pins() and memfd_wait_for_pins() were never
      updated for shmem huge pages: checking page_mapcount() against
      page_count() is hopeless on THP subpages - they need to check
      total_mapcount() against page_count() on THP heads only.
      
      Make memfd_tag_pins() (compared > 1) as strict as memfd_wait_for_pins()
      (compared != 1): either can be justified, but given the non-atomic
      total_mapcount() calculation, it is better now to be strict.  Bear in
      mind that total_mapcount() itself scans all of the THP subpages, when
      choosing to take an XA_CHECK_SCHED latency break.
      
      Also fix the unlikely xa_is_value() case in memfd_wait_for_pins(): if a
      page has been swapped out since memfd_tag_pins(), then its refcount must
      have fallen, and so it can safely be untagged.
      
      Link: https://lkml.kernel.org/r/a4f79248-df75-2c8c-3df-ba3317ccb5da@google.comSigned-off-by: default avatarHugh Dickins <hughd@google.com>
      Reported-by: default avatarZeal Robot <zealci@zte.com.cn>
      Reported-by: default avatarwangyong <wang.yong12@zte.com.cn>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: CGEL ZTE <cgel.zte@gmail.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Song Liu <songliubraving@fb.com>
      Cc: Yang Yang <yang.yang29@zte.com.cn>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f2b277c4
    • Suren Baghdasaryan's avatar
      mm: fix use-after-free when anon vma name is used after vma is freed · 942341dc
      Suren Baghdasaryan authored
      When adjacent vmas are being merged it can result in the vma that was
      originally passed to madvise_update_vma being destroyed.  In the current
      implementation, the name parameter passed to madvise_update_vma points
      directly to vma->anon_name and it is used after the call to vma_merge.
      In the cases when vma_merge merges the original vma and destroys it,
      this might result in UAF.  For that the original vma would have to hold
      the anon_vma_name with the last reference.  The following vma would need
      to contain a different anon_vma_name object with the same string.  Such
      scenario is shown below:
      
      madvise_vma_behavior(vma)
        madvise_update_vma(vma, ..., anon_name == vma->anon_name)
          vma_merge(vma)
            __vma_adjust(vma) <-- merges vma with adjacent one
              vm_area_free(vma) <-- frees the original vma
          replace_vma_anon_name(anon_name) <-- UAF of vma->anon_name
      
      Fix this by raising the name refcount and stabilizing it.
      
      Link: https://lkml.kernel.org/r/20220224231834.1481408-3-surenb@google.com
      Link: https://lkml.kernel.org/r/20220223153613.835563-3-surenb@google.com
      Fixes: 9a10064f ("mm: add a field to store names for private anonymous memory")
      Signed-off-by: default avatarSuren Baghdasaryan <surenb@google.com>
      Reported-by: syzbot+aa7b3d4b35f9dc46a366@syzkaller.appspotmail.com
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Alexey Gladkov <legion@kernel.org>
      Cc: Chris Hyser <chris.hyser@oracle.com>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Colin Cross <ccross@google.com>
      Cc: Cyrill Gorcunov <gorcunov@gmail.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Peter Collingbourne <pcc@google.com>
      Cc: Sasha Levin <sashal@kernel.org>
      Cc: Sumit Semwal <sumit.semwal@linaro.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Xiaofeng Cao <caoxiaofeng@yulong.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      942341dc
    • Suren Baghdasaryan's avatar
      mm: prevent vm_area_struct::anon_name refcount saturation · 96403e11
      Suren Baghdasaryan authored
      A deep process chain with many vmas could grow really high.  With
      default sysctl_max_map_count (64k) and default pid_max (32k) the max
      number of vmas in the system is 2147450880 and the refcounter has
      headroom of 1073774592 before it reaches REFCOUNT_SATURATED
      (3221225472).
      
      Therefore it's unlikely that an anonymous name refcounter will overflow
      with these defaults.  Currently the max for pid_max is PID_MAX_LIMIT
      (4194304) and for sysctl_max_map_count it's INT_MAX (2147483647).  In
      this configuration anon_vma_name refcount overflow becomes theoretically
      possible (that still require heavy sharing of that anon_vma_name between
      processes).
      
      kref refcounting interface used in anon_vma_name structure will detect a
      counter overflow when it reaches REFCOUNT_SATURATED value but will only
      generate a warning and freeze the ref counter.  This would lead to the
      refcounted object never being freed.  A determined attacker could leak
      memory like that but it would be rather expensive and inefficient way to
      do so.
      
      To ensure anon_vma_name refcount does not overflow, stop anon_vma_name
      sharing when the refcount reaches REFCOUNT_MAX (2147483647), which still
      leaves INT_MAX/2 (1073741823) values before the counter reaches
      REFCOUNT_SATURATED.  This should provide enough headroom for raising the
      refcounts temporarily.
      
      Link: https://lkml.kernel.org/r/20220223153613.835563-2-surenb@google.comSigned-off-by: default avatarSuren Baghdasaryan <surenb@google.com>
      Suggested-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Alexey Gladkov <legion@kernel.org>
      Cc: Chris Hyser <chris.hyser@oracle.com>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Colin Cross <ccross@google.com>
      Cc: Cyrill Gorcunov <gorcunov@gmail.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Peter Collingbourne <pcc@google.com>
      Cc: Sasha Levin <sashal@kernel.org>
      Cc: Sumit Semwal <sumit.semwal@linaro.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Xiaofeng Cao <caoxiaofeng@yulong.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      96403e11
    • Suren Baghdasaryan's avatar
      mm: refactor vm_area_struct::anon_vma_name usage code · 5c26f6ac
      Suren Baghdasaryan authored
      Avoid mixing strings and their anon_vma_name referenced pointers by
      using struct anon_vma_name whenever possible.  This simplifies the code
      and allows easier sharing of anon_vma_name structures when they
      represent the same name.
      
      [surenb@google.com: fix comment]
      
      Link: https://lkml.kernel.org/r/20220223153613.835563-1-surenb@google.com
      Link: https://lkml.kernel.org/r/20220224231834.1481408-1-surenb@google.comSigned-off-by: default avatarSuren Baghdasaryan <surenb@google.com>
      Suggested-by: default avatarMatthew Wilcox <willy@infradead.org>
      Suggested-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Colin Cross <ccross@google.com>
      Cc: Sumit Semwal <sumit.semwal@linaro.org>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Alexey Gladkov <legion@kernel.org>
      Cc: Sasha Levin <sashal@kernel.org>
      Cc: Chris Hyser <chris.hyser@oracle.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Peter Collingbourne <pcc@google.com>
      Cc: Xiaofeng Cao <caoxiaofeng@yulong.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Cyrill Gorcunov <gorcunov@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5c26f6ac
    • Mike Kravetz's avatar
      selftests/vm: cleanup hugetlb file after mremap test · ff712a62
      Mike Kravetz authored
      The hugepage-mremap test will create a file in a hugetlb filesystem.  In
      a default 'run_vmtests' run, the file will contain all the hugetlb
      pages.  After the test, the file remains and there are no free hugetlb
      pages for subsequent tests.  This causes those hugetlb tests to fail.
      
      Change hugepage-mremap to take the name of the hugetlb file as an
      argument.  Unlink the file within the test, and just to be sure remove
      the file in the run_vmtests script.
      
      Link: https://lkml.kernel.org/r/20220201033459.156944-1-mike.kravetz@oracle.comSigned-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarShuah Khan <skhan@linuxfoundation.org>
      Acked-by: default avatarYosry Ahmed <yosryahmed@google.com>
      Reviewed-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Reviewed-by: default avatarMina Almasry <almasrymina@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ff712a62
    • Zhang Wensheng's avatar
      bfq: fix use-after-free in bfq_dispatch_request · ab552fcb
      Zhang Wensheng authored
      KASAN reports a use-after-free report when doing normal scsi-mq test
      
      [69832.239032] ==================================================================
      [69832.241810] BUG: KASAN: use-after-free in bfq_dispatch_request+0x1045/0x44b0
      [69832.243267] Read of size 8 at addr ffff88802622ba88 by task kworker/3:1H/155
      [69832.244656]
      [69832.245007] CPU: 3 PID: 155 Comm: kworker/3:1H Not tainted 5.10.0-10295-g576c6382529e #8
      [69832.246626] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
      [69832.249069] Workqueue: kblockd blk_mq_run_work_fn
      [69832.250022] Call Trace:
      [69832.250541]  dump_stack+0x9b/0xce
      [69832.251232]  ? bfq_dispatch_request+0x1045/0x44b0
      [69832.252243]  print_address_description.constprop.6+0x3e/0x60
      [69832.253381]  ? __cpuidle_text_end+0x5/0x5
      [69832.254211]  ? vprintk_func+0x6b/0x120
      [69832.254994]  ? bfq_dispatch_request+0x1045/0x44b0
      [69832.255952]  ? bfq_dispatch_request+0x1045/0x44b0
      [69832.256914]  kasan_report.cold.9+0x22/0x3a
      [69832.257753]  ? bfq_dispatch_request+0x1045/0x44b0
      [69832.258755]  check_memory_region+0x1c1/0x1e0
      [69832.260248]  bfq_dispatch_request+0x1045/0x44b0
      [69832.261181]  ? bfq_bfqq_expire+0x2440/0x2440
      [69832.262032]  ? blk_mq_delay_run_hw_queues+0xf9/0x170
      [69832.263022]  __blk_mq_do_dispatch_sched+0x52f/0x830
      [69832.264011]  ? blk_mq_sched_request_inserted+0x100/0x100
      [69832.265101]  __blk_mq_sched_dispatch_requests+0x398/0x4f0
      [69832.266206]  ? blk_mq_do_dispatch_ctx+0x570/0x570
      [69832.267147]  ? __switch_to+0x5f4/0xee0
      [69832.267898]  blk_mq_sched_dispatch_requests+0xdf/0x140
      [69832.268946]  __blk_mq_run_hw_queue+0xc0/0x270
      [69832.269840]  blk_mq_run_work_fn+0x51/0x60
      [69832.278170]  process_one_work+0x6d4/0xfe0
      [69832.278984]  worker_thread+0x91/0xc80
      [69832.279726]  ? __kthread_parkme+0xb0/0x110
      [69832.280554]  ? process_one_work+0xfe0/0xfe0
      [69832.281414]  kthread+0x32d/0x3f0
      [69832.282082]  ? kthread_park+0x170/0x170
      [69832.282849]  ret_from_fork+0x1f/0x30
      [69832.283573]
      [69832.283886] Allocated by task 7725:
      [69832.284599]  kasan_save_stack+0x19/0x40
      [69832.285385]  __kasan_kmalloc.constprop.2+0xc1/0xd0
      [69832.286350]  kmem_cache_alloc_node+0x13f/0x460
      [69832.287237]  bfq_get_queue+0x3d4/0x1140
      [69832.287993]  bfq_get_bfqq_handle_split+0x103/0x510
      [69832.289015]  bfq_init_rq+0x337/0x2d50
      [69832.289749]  bfq_insert_requests+0x304/0x4e10
      [69832.290634]  blk_mq_sched_insert_requests+0x13e/0x390
      [69832.291629]  blk_mq_flush_plug_list+0x4b4/0x760
      [69832.292538]  blk_flush_plug_list+0x2c5/0x480
      [69832.293392]  io_schedule_prepare+0xb2/0xd0
      [69832.294209]  io_schedule_timeout+0x13/0x80
      [69832.295014]  wait_for_common_io.constprop.1+0x13c/0x270
      [69832.296137]  submit_bio_wait+0x103/0x1a0
      [69832.296932]  blkdev_issue_discard+0xe6/0x160
      [69832.297794]  blk_ioctl_discard+0x219/0x290
      [69832.298614]  blkdev_common_ioctl+0x50a/0x1750
      [69832.304715]  blkdev_ioctl+0x470/0x600
      [69832.305474]  block_ioctl+0xde/0x120
      [69832.306232]  vfs_ioctl+0x6c/0xc0
      [69832.306877]  __se_sys_ioctl+0x90/0xa0
      [69832.307629]  do_syscall_64+0x2d/0x40
      [69832.308362]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
      [69832.309382]
      [69832.309701] Freed by task 155:
      [69832.310328]  kasan_save_stack+0x19/0x40
      [69832.311121]  kasan_set_track+0x1c/0x30
      [69832.311868]  kasan_set_free_info+0x1b/0x30
      [69832.312699]  __kasan_slab_free+0x111/0x160
      [69832.313524]  kmem_cache_free+0x94/0x460
      [69832.314367]  bfq_put_queue+0x582/0x940
      [69832.315112]  __bfq_bfqd_reset_in_service+0x166/0x1d0
      [69832.317275]  bfq_bfqq_expire+0xb27/0x2440
      [69832.318084]  bfq_dispatch_request+0x697/0x44b0
      [69832.318991]  __blk_mq_do_dispatch_sched+0x52f/0x830
      [69832.319984]  __blk_mq_sched_dispatch_requests+0x398/0x4f0
      [69832.321087]  blk_mq_sched_dispatch_requests+0xdf/0x140
      [69832.322225]  __blk_mq_run_hw_queue+0xc0/0x270
      [69832.323114]  blk_mq_run_work_fn+0x51/0x60
      [69832.323942]  process_one_work+0x6d4/0xfe0
      [69832.324772]  worker_thread+0x91/0xc80
      [69832.325518]  kthread+0x32d/0x3f0
      [69832.326205]  ret_from_fork+0x1f/0x30
      [69832.326932]
      [69832.338297] The buggy address belongs to the object at ffff88802622b968
      [69832.338297]  which belongs to the cache bfq_queue of size 512
      [69832.340766] The buggy address is located 288 bytes inside of
      [69832.340766]  512-byte region [ffff88802622b968, ffff88802622bb68)
      [69832.343091] The buggy address belongs to the page:
      [69832.344097] page:ffffea0000988a00 refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff88802622a528 pfn:0x26228
      [69832.346214] head:ffffea0000988a00 order:2 compound_mapcount:0 compound_pincount:0
      [69832.347719] flags: 0x1fffff80010200(slab|head)
      [69832.348625] raw: 001fffff80010200 ffffea0000dbac08 ffff888017a57650 ffff8880179fe840
      [69832.354972] raw: ffff88802622a528 0000000000120008 00000001ffffffff 0000000000000000
      [69832.356547] page dumped because: kasan: bad access detected
      [69832.357652]
      [69832.357970] Memory state around the buggy address:
      [69832.358926]  ffff88802622b980: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      [69832.360358]  ffff88802622ba00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      [69832.361810] >ffff88802622ba80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      [69832.363273]                       ^
      [69832.363975]  ffff88802622bb00: fb fb fb fb fb fb fb fb fb fb fb fb fb fc fc fc
      [69832.375960]  ffff88802622bb80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
      [69832.377405] ==================================================================
      
      In bfq_dispatch_requestfunction, it may have function call:
      
      bfq_dispatch_request
      	__bfq_dispatch_request
      		bfq_select_queue
      			bfq_bfqq_expire
      				__bfq_bfqd_reset_in_service
      					bfq_put_queue
      						kmem_cache_free
      In this function call, in_serv_queue has beed expired and meet the
      conditions to free. In the function bfq_dispatch_request, the address
      of in_serv_queue pointing to has been released. For getting the value
      of idle_timer_disabled, it will get flags value from the address which
      in_serv_queue pointing to, then the problem of use-after-free happens;
      
      Fix the problem by check in_serv_queue == bfqd->in_service_queue, to
      get the value of idle_timer_disabled if in_serve_queue is equel to
      bfqd->in_service_queue. If the space of in_serv_queue pointing has
      been released, this judge will aviod use-after-free problem.
      And if in_serv_queue may be expired or finished, the idle_timer_disabled
      will be false which would not give effects to bfq_update_dispatch_stats.
      Reported-by: default avatarHulk Robot <hulkci@huawei.com>
      Signed-off-by: default avatarZhang Wensheng <zhangwensheng5@huawei.com>
      Link: https://lore.kernel.org/r/20220303070334.3020168-1-zhangwensheng5@huawei.comSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      ab552fcb
    • Murilo Opsfelder Araujo's avatar
      powerpc/64s: Fix build failure when CONFIG_PPC_64S_HASH_MMU is not set · 58dbe9b3
      Murilo Opsfelder Araujo authored
      The following build failure occurs when CONFIG_PPC_64S_HASH_MMU is not
      set:
      
          arch/powerpc/kernel/setup_64.c: In function ‘setup_per_cpu_areas’:
          arch/powerpc/kernel/setup_64.c:811:21: error: ‘mmu_linear_psize’ undeclared (first use in this function); did you mean ‘mmu_virtual_psize’?
            811 |                 if (mmu_linear_psize == MMU_PAGE_4K)
                |                     ^~~~~~~~~~~~~~~~
                |                     mmu_virtual_psize
          arch/powerpc/kernel/setup_64.c:811:21: note: each undeclared identifier is reported only once for each function it appears in
      
      Move the declaration of mmu_linear_psize outside of
      CONFIG_PPC_64S_HASH_MMU ifdef.
      
      After the above is fixed, it fails later with the following error:
      
          ld: arch/powerpc/kexec/file_load_64.o: in function `.arch_kexec_kernel_image_probe':
          file_load_64.c:(.text+0x1c1c): undefined reference to `.add_htab_mem_range'
      
      Fix that, too, by conditioning add_htab_mem_range() symbol to
      CONFIG_PPC_64S_HASH_MMU.
      
      Fixes: 387e220a ("powerpc/64s: Move hash MMU support code under CONFIG_PPC_64S_HASH_MMU")
      Reported-by: default avatarErhard F. <erhard_f@mailbox.org>
      Signed-off-by: default avatarMurilo Opsfelder Araujo <muriloo@linux.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=215567
      Link: https://lore.kernel.org/r/20220301204743.45133-1-muriloo@linux.ibm.com
      58dbe9b3
    • Linus Torvalds's avatar
      Merge tag 'block-5.17-2022-03-04' of git://git.kernel.dk/linux-block · ac84e82f
      Linus Torvalds authored
      Pull block fix from Jens Axboe:
       "Just a small UAF fix for blktrace"
      
      * tag 'block-5.17-2022-03-04' of git://git.kernel.dk/linux-block:
        blktrace: fix use after free for struct blk_trace
      ac84e82f
  4. 04 Mar, 2022 10 commits
    • Linus Torvalds's avatar
      Merge tag 'riscv-for-linus-5.17-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux · 07ebd38a
      Linus Torvalds authored
      Pull RISC-V fixes from Palmer Dabbelt:
      
       - Fixes for a handful of KASAN-related crashes.
      
       - A fix to avoid a crash during boot for SPARSEMEM &&
         !SPARSEMEM_VMEMMAP configurations.
      
       - A fix to stop reporting some incorrect errors under DEBUG_VIRTUAL.
      
       - A fix for the K210's device tree to properly populate the interrupt
         map, so hart1 will get interrupts again.
      
      * tag 'riscv-for-linus-5.17-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
        riscv: dts: k210: fix broken IRQs on hart1
        riscv: Fix kasan pud population
        riscv: Move high_memory initialization to setup_bootmem
        riscv: Fix config KASAN && DEBUG_VIRTUAL
        riscv: Fix DEBUG_VIRTUAL false warnings
        riscv: Fix config KASAN && SPARSEMEM && !SPARSE_VMEMMAP
        riscv: Fix is_linear_mapping with recent move of KASAN region
      07ebd38a
    • Linus Torvalds's avatar
      Merge tag 'iommu-fixes-v5.17-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu · 3f509f59
      Linus Torvalds authored
      Pull iommu fixes from Joerg Roedel:
      
       - Fix a double list_add() in Intel VT-d code
      
       - Add missing put_device() in Tegra SMMU driver
      
       - Two AMD IOMMU fixes:
           - Memory leak in IO page-table freeing code
           - Add missing recovery from event-log overflow
      
      * tag 'iommu-fixes-v5.17-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu:
        iommu/tegra-smmu: Fix missing put_device() call in tegra_smmu_find
        iommu/vt-d: Fix double list_add when enabling VMD in scalable mode
        iommu/amd: Fix I/O page table memory leak
        iommu/amd: Recover from event log overflow
      3f509f59
    • Linus Torvalds's avatar
      Merge tag 'thermal-5.17-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm · a4ffdb61
      Linus Torvalds authored
      Pull thermal control fix from Rafael Wysocki:
       "Fix NULL pointer dereference in the thermal netlink interface (Nicolas
        Cavallari)"
      
      * tag 'thermal-5.17-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
        thermal: core: Fix TZ_GET_TRIP NULL pointer dereference
      a4ffdb61
    • Linus Torvalds's avatar
      Merge tag 'sound-5.17-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound · 8d670948
      Linus Torvalds authored
      Pull sound fixes from Takashi Iwai:
       "Hopefully the last PR for 5.17, including just a few small changes:
        an additional fix for ASoC ops boundary check and other minor
        device-specific fixes"
      
      * tag 'sound-5.17-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
        ALSA: intel_hdmi: Fix reference to PCM buffer address
        ASoC: cs4265: Fix the duplicated control name
        ASoC: ops: Shift tested values in snd_soc_put_volsw() by +min
      8d670948
    • Linus Torvalds's avatar
      Merge tag 'drm-fixes-2022-03-04' of git://anongit.freedesktop.org/drm/drm · c4fc118a
      Linus Torvalds authored
      Pull drm fixes from Dave Airlie:
       "Things are quieting down as expected, just a small set of fixes, i915,
        exynos, amdgpu, vrr, bridge and hdlcd. Nothing scary at all.
      
        i915:
         - Fix GuC SLPC unset command
         - Fix misidentification of some Apple MacBook Pro laptops as Jasper Lake
      
        amdgpu:
         - Suspend regression fix
      
        exynos:
         - irq handling fixes
         - Fix two regressions to TE-gpio handling
      
        arm/hdlcd:
         - Select DRM_GEM_CMEA_HELPER for HDLCD
      
        bridge:
         - ti-sn65dsi86: Properly undo autosuspend
      
        vrr:
         - Fix potential NULL-pointer deref"
      
      * tag 'drm-fixes-2022-03-04' of git://anongit.freedesktop.org/drm/drm:
        drm/amdgpu: fix suspend/resume hang regression
        drm/vrr: Set VRR capable prop only if it is attached to connector
        drm/arm: arm hdlcd select DRM_GEM_CMA_HELPER
        drm/bridge: ti-sn65dsi86: Properly undo autosuspend
        drm/i915: s/JSP2/ICP2/ PCH
        drm/i915/guc/slpc: Correct the param count for unset param
        drm/exynos: Search for TE-gpio in DSI panel's node
        drm/exynos: Don't fail if no TE-gpio is defined for DSI driver
        drm/exynos: gsc: Use platform_get_irq() to get the interrupt
        drm/exynos/fimc: Use platform_get_irq() to get the interrupt
        drm/exynos/exynos_drm_fimd: Use platform_get_irq_byname() to get the interrupt
        drm/exynos: mixer: Use platform_get_irq() to get the interrupt
        drm/exynos/exynos7_drm_decon: Use platform_get_irq_byname() to get the interrupt
      c4fc118a
    • Linus Torvalds's avatar
      Merge tag 'pinctrl-v5.17-3' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl · 0b7344a6
      Linus Torvalds authored
      Pull pin control fixes from Linus Walleij:
       "These two fixes should fix the issues seen on the OrangePi, first we
        needed the correct offset when calling pinctrl_gpio_direction(), and
        fixing that made a lockdep issue explode in our face. Both now fixed"
      
      * tag 'pinctrl-v5.17-3' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl:
        pinctrl: sunxi: Use unique lockdep classes for IRQs
        pinctrl-sunxi: sunxi_pinctrl_gpio_direction_in/output: use correct offset
      0b7344a6
    • Randy Dunlap's avatar
      tracing: Fix return value of __setup handlers · 1d02b444
      Randy Dunlap authored
      __setup() handlers should generally return 1 to indicate that the
      boot options have been handled.
      
      Using invalid option values causes the entire kernel boot option
      string to be reported as Unknown and added to init's environment
      strings, polluting it.
      
        Unknown kernel command line parameters "BOOT_IMAGE=/boot/bzImage-517rc6
          kprobe_event=p,syscall_any,$arg1 trace_options=quiet
          trace_clock=jiffies", will be passed to user space.
      
       Run /sbin/init as init process
         with arguments:
           /sbin/init
         with environment:
           HOME=/
           TERM=linux
           BOOT_IMAGE=/boot/bzImage-517rc6
           kprobe_event=p,syscall_any,$arg1
           trace_options=quiet
           trace_clock=jiffies
      
      Return 1 from the __setup() handlers so that init's environment is not
      polluted with kernel boot options.
      
      Link: lore.kernel.org/r/64644a2f-4a20-bab3-1e15-3b2cdd0defe3@omprussia.ru
      Link: https://lkml.kernel.org/r/20220303031744.32356-1-rdunlap@infradead.org
      
      Cc: stable@vger.kernel.org
      Fixes: 7bcfaf54 ("tracing: Add trace_options kernel command line parameter")
      Fixes: e1e232ca ("tracing: Add trace_clock=<clock> kernel parameter")
      Fixes: 970988e1 ("tracing/kprobe: Add kprobe_event= boot parameter")
      Signed-off-by: default avatarRandy Dunlap <rdunlap@infradead.org>
      Reported-by: default avatarIgor Zhbanov <i.zhbanov@omprussia.ru>
      Acked-by: default avatarMasami Hiramatsu <mhiramat@kernel.org>
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      1d02b444
    • Daniel Borkmann's avatar
      mm: Consider __GFP_NOWARN flag for oversized kvmalloc() calls · 0708a0af
      Daniel Borkmann authored
      syzkaller was recently triggering an oversized kvmalloc() warning via
      xdp_umem_create().
      
      The triggered warning was added back in 7661809d ("mm: don't allow
      oversized kvmalloc() calls"). The rationale for the warning for huge
      kvmalloc sizes was as a reaction to a security bug where the size was
      more than UINT_MAX but not everything was prepared to handle unsigned
      long sizes.
      
      Anyway, the AF_XDP related call trace from this syzkaller report was:
      
        kvmalloc include/linux/mm.h:806 [inline]
        kvmalloc_array include/linux/mm.h:824 [inline]
        kvcalloc include/linux/mm.h:829 [inline]
        xdp_umem_pin_pages net/xdp/xdp_umem.c:102 [inline]
        xdp_umem_reg net/xdp/xdp_umem.c:219 [inline]
        xdp_umem_create+0x6a5/0xf00 net/xdp/xdp_umem.c:252
        xsk_setsockopt+0x604/0x790 net/xdp/xsk.c:1068
        __sys_setsockopt+0x1fd/0x4e0 net/socket.c:2176
        __do_sys_setsockopt net/socket.c:2187 [inline]
        __se_sys_setsockopt net/socket.c:2184 [inline]
        __x64_sys_setsockopt+0xb5/0x150 net/socket.c:2184
        do_syscall_x64 arch/x86/entry/common.c:50 [inline]
        do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
        entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      Björn mentioned that requests for >2GB allocation can still be valid:
      
        The structure that is being allocated is the page-pinning accounting.
        AF_XDP has an internal limit of U32_MAX pages, which is *a lot*, but
        still fewer than what memcg allows (PAGE_COUNTER_MAX is a LONG_MAX/
        PAGE_SIZE on 64 bit systems). [...]
      
        I could just change from U32_MAX to INT_MAX, but as I stated earlier
        that has a hacky feeling to it. [...] From my perspective, the code
        isn't broken, with the memcg limits in consideration. [...]
      
      Linus says:
      
        [...] Pretty much every time this has come up, the kernel warning has
        shown that yes, the code was broken and there really wasn't a reason
        for doing allocations that big.
      
        Of course, some people would be perfectly fine with the allocation
        failing, they just don't want the warning. I didn't want __GFP_NOWARN
        to shut it up originally because I wanted people to see all those
        cases, but these days I think we can just say "yeah, people can shut
        it up explicitly by saying 'go ahead and fail this allocation, don't
        warn about it'".
      
        So enough time has passed that by now I'd certainly be ok with [it].
      
      Thus allow call-sites to silence such userspace triggered splats if the
      allocation requests have __GFP_NOWARN. For xdp_umem_pin_pages()'s call
      to kvcalloc() this is already the case, so nothing else needed there.
      
      Fixes: 7661809d ("mm: don't allow oversized kvmalloc() calls")
      Reported-by: syzbot+11421fbbff99b989670e@syzkaller.appspotmail.com
      Suggested-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Tested-by: syzbot+11421fbbff99b989670e@syzkaller.appspotmail.com
      Cc: Björn Töpel <bjorn@kernel.org>
      Cc: Magnus Karlsson <magnus.karlsson@intel.com>
      Cc: Willy Tarreau <w@1wt.eu>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Andrii Nakryiko <andrii@kernel.org>
      Cc: Jakub Kicinski <kuba@kernel.org>
      Cc: David S. Miller <davem@davemloft.net>
      Link: https://lore.kernel.org/bpf/CAJ+HfNhyfsT5cS_U9EC213ducHs9k9zNxX9+abqC0kTrPbQ0gg@mail.gmail.com
      Link: https://lore.kernel.org/bpf/20211201202905.b9892171e3f5b9a60f9da251@linux-foundation.orgReviewed-by: default avatarLeon Romanovsky <leonro@nvidia.com>
      Ackd-by: default avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0708a0af
    • Filipe Manana's avatar
      btrfs: fallback to blocking mode when doing async dio over multiple extents · ca93e44b
      Filipe Manana authored
      Some users recently reported that MariaDB was getting a read corruption
      when using io_uring on top of btrfs. This started to happen in 5.16,
      after commit 51bd9563 ("btrfs: fix deadlock due to page faults
      during direct IO reads and writes"). That changed btrfs to use the new
      iomap flag IOMAP_DIO_PARTIAL and to disable page faults before calling
      iomap_dio_rw(). This was necessary to fix deadlocks when the iovector
      corresponds to a memory mapped file region. That type of scenario is
      exercised by test case generic/647 from fstests.
      
      For this MariaDB scenario, we attempt to read 16K from file offset X
      using IOCB_NOWAIT and io_uring. In that range we have 4 extents, each
      with a size of 4K, and what happens is the following:
      
      1) btrfs_direct_read() disables page faults and calls iomap_dio_rw();
      
      2) iomap creates a struct iomap_dio object, its reference count is
         initialized to 1 and its ->size field is initialized to 0;
      
      3) iomap calls btrfs_dio_iomap_begin() with file offset X, which finds
         the first 4K extent, and setups an iomap for this extent consisting
         of a single page;
      
      4) At iomap_dio_bio_iter(), we are able to access the first page of the
         buffer (struct iov_iter) with bio_iov_iter_get_pages() without
         triggering a page fault;
      
      5) iomap submits a bio for this 4K extent
         (iomap_dio_submit_bio() -> btrfs_submit_direct()) and increments
         the refcount on the struct iomap_dio object to 2; The ->size field
         of the struct iomap_dio object is incremented to 4K;
      
      6) iomap calls btrfs_iomap_begin() again, this time with a file
         offset of X + 4K. There we setup an iomap for the next extent
         that also has a size of 4K;
      
      7) Then at iomap_dio_bio_iter() we call bio_iov_iter_get_pages(),
         which tries to access the next page (2nd page) of the buffer.
         This triggers a page fault and returns -EFAULT;
      
      8) At __iomap_dio_rw() we see the -EFAULT, but we reset the error
         to 0 because we passed the flag IOMAP_DIO_PARTIAL to iomap and
         the struct iomap_dio object has a ->size value of 4K (we submitted
         a bio for an extent already). The 'wait_for_completion' variable
         is not set to true, because our iocb has IOCB_NOWAIT set;
      
      9) At the bottom of __iomap_dio_rw(), we decrement the reference count
         of the struct iomap_dio object from 2 to 1. Because we were not
         the only ones holding a reference on it and 'wait_for_completion' is
         set to false, -EIOCBQUEUED is returned to btrfs_direct_read(), which
         just returns it up the callchain, up to io_uring;
      
      10) The bio submitted for the first extent (step 5) completes and its
          bio endio function, iomap_dio_bio_end_io(), decrements the last
          reference on the struct iomap_dio object, resulting in calling
          iomap_dio_complete_work() -> iomap_dio_complete().
      
      11) At iomap_dio_complete() we adjust the iocb->ki_pos from X to X + 4K
          and return 4K (the amount of io done) to iomap_dio_complete_work();
      
      12) iomap_dio_complete_work() calls the iocb completion callback,
          iocb->ki_complete() with a second argument value of 4K (total io
          done) and the iocb with the adjust ki_pos of X + 4K. This results
          in completing the read request for io_uring, leaving it with a
          result of 4K bytes read, and only the first page of the buffer
          filled in, while the remaining 3 pages, corresponding to the other
          3 extents, were not filled;
      
      13) For the application, the result is unexpected because if we ask
          to read N bytes, it expects to get N bytes read as long as those
          N bytes don't cross the EOF (i_size).
      
      MariaDB reports this as an error, as it's not expecting a short read,
      since it knows it's asking for read operations fully within the i_size
      boundary. This is typical in many applications, but it may also be
      questionable if they should react to such short reads by issuing more
      read calls to get the remaining data. Nevertheless, the short read
      happened due to a change in btrfs regarding how it deals with page
      faults while in the middle of a read operation, and there's no reason
      why btrfs can't have the previous behaviour of returning the whole data
      that was requested by the application.
      
      The problem can also be triggered with the following simple program:
      
        /* Get O_DIRECT */
        #ifndef _GNU_SOURCE
        #define _GNU_SOURCE
        #endif
      
        #include <stdio.h>
        #include <stdlib.h>
        #include <unistd.h>
        #include <fcntl.h>
        #include <errno.h>
        #include <string.h>
        #include <liburing.h>
      
        int main(int argc, char *argv[])
        {
            char *foo_path;
            struct io_uring ring;
            struct io_uring_sqe *sqe;
            struct io_uring_cqe *cqe;
            struct iovec iovec;
            int fd;
            long pagesize;
            void *write_buf;
            void *read_buf;
            ssize_t ret;
            int i;
      
            if (argc != 2) {
                fprintf(stderr, "Use: %s <directory>\n", argv[0]);
                return 1;
            }
      
            foo_path = malloc(strlen(argv[1]) + 5);
            if (!foo_path) {
                fprintf(stderr, "Failed to allocate memory for file path\n");
                return 1;
            }
            strcpy(foo_path, argv[1]);
            strcat(foo_path, "/foo");
      
            /*
             * Create file foo with 2 extents, each with a size matching
             * the page size. Then allocate a buffer to read both extents
             * with io_uring, using O_DIRECT and IOCB_NOWAIT. Before doing
             * the read with io_uring, access the first page of the buffer
             * to fault it in, so that during the read we only trigger a
             * page fault when accessing the second page of the buffer.
             */
             fd = open(foo_path, O_CREAT | O_TRUNC | O_WRONLY |
                      O_DIRECT, 0666);
             if (fd == -1) {
                 fprintf(stderr,
                         "Failed to create file 'foo': %s (errno %d)",
                         strerror(errno), errno);
                 return 1;
             }
      
             pagesize = sysconf(_SC_PAGE_SIZE);
             ret = posix_memalign(&write_buf, pagesize, 2 * pagesize);
             if (ret) {
                 fprintf(stderr, "Failed to allocate write buffer\n");
                 return 1;
             }
      
             memset(write_buf, 0xab, pagesize);
             memset(write_buf + pagesize, 0xcd, pagesize);
      
             /* Create 2 extents, each with a size matching page size. */
             for (i = 0; i < 2; i++) {
                 ret = pwrite(fd, write_buf + i * pagesize, pagesize,
                              i * pagesize);
                 if (ret != pagesize) {
                     fprintf(stderr,
                           "Failed to write to file, ret = %ld errno %d (%s)\n",
                            ret, errno, strerror(errno));
                     return 1;
                 }
                 ret = fsync(fd);
                 if (ret != 0) {
                     fprintf(stderr, "Failed to fsync file\n");
                     return 1;
                 }
             }
      
             close(fd);
             fd = open(foo_path, O_RDONLY | O_DIRECT);
             if (fd == -1) {
                 fprintf(stderr,
                         "Failed to open file 'foo': %s (errno %d)",
                         strerror(errno), errno);
                 return 1;
             }
      
             ret = posix_memalign(&read_buf, pagesize, 2 * pagesize);
             if (ret) {
                 fprintf(stderr, "Failed to allocate read buffer\n");
                 return 1;
             }
      
             /*
              * Fault in only the first page of the read buffer.
              * We want to trigger a page fault for the 2nd page of the
              * read buffer during the read operation with io_uring
              * (O_DIRECT and IOCB_NOWAIT).
              */
             memset(read_buf, 0, 1);
      
             ret = io_uring_queue_init(1, &ring, 0);
             if (ret != 0) {
                 fprintf(stderr, "Failed to create io_uring queue\n");
                 return 1;
             }
      
             sqe = io_uring_get_sqe(&ring);
             if (!sqe) {
                 fprintf(stderr, "Failed to get io_uring sqe\n");
                 return 1;
             }
      
             iovec.iov_base = read_buf;
             iovec.iov_len = 2 * pagesize;
             io_uring_prep_readv(sqe, fd, &iovec, 1, 0);
      
             ret = io_uring_submit_and_wait(&ring, 1);
             if (ret != 1) {
                 fprintf(stderr,
                         "Failed at io_uring_submit_and_wait()\n");
                 return 1;
             }
      
             ret = io_uring_wait_cqe(&ring, &cqe);
             if (ret < 0) {
                 fprintf(stderr, "Failed at io_uring_wait_cqe()\n");
                 return 1;
             }
      
             printf("io_uring read result for file foo:\n\n");
             printf("  cqe->res == %d (expected %d)\n", cqe->res, 2 * pagesize);
             printf("  memcmp(read_buf, write_buf) == %d (expected 0)\n",
                    memcmp(read_buf, write_buf, 2 * pagesize));
      
             io_uring_cqe_seen(&ring, cqe);
             io_uring_queue_exit(&ring);
      
             return 0;
        }
      
      When running it on an unpatched kernel:
      
        $ gcc io_uring_test.c -luring
        $ mkfs.btrfs -f /dev/sda
        $ mount /dev/sda /mnt/sda
        $ ./a.out /mnt/sda
        io_uring read result for file foo:
      
          cqe->res == 4096 (expected 8192)
          memcmp(read_buf, write_buf) == -205 (expected 0)
      
      After this patch, the read always returns 8192 bytes, with the buffer
      filled with the correct data. Although that reproducer always triggers
      the bug in my test vms, it's possible that it will not be so reliable
      on other environments, as that can happen if the bio for the first
      extent completes and decrements the reference on the struct iomap_dio
      object before we do the atomic_dec_and_test() on the reference at
      __iomap_dio_rw().
      
      Fix this in btrfs by having btrfs_dio_iomap_begin() return -EAGAIN
      whenever we try to satisfy a non blocking IO request (IOMAP_NOWAIT flag
      set) over a range that spans multiple extents (or a mix of extents and
      holes). This avoids returning success to the caller when we only did
      partial IO, which is not optimal for writes and for reads it's actually
      incorrect, as the caller doesn't expect to get less bytes read than it has
      requested (unless EOF is crossed), as previously mentioned. This is also
      the type of behaviour that xfs follows (xfs_direct_write_iomap_begin()),
      even though it doesn't use IOMAP_DIO_PARTIAL.
      
      A test case for fstests will follow soon.
      
      Link: https://lore.kernel.org/linux-btrfs/CABVffEM0eEWho+206m470rtM0d9J8ue85TtR-A_oVTuGLWFicA@mail.gmail.com/
      Link: https://lore.kernel.org/linux-btrfs/CAHF2GV6U32gmqSjLe=XKgfcZAmLCiH26cJ2OnHGp5x=VAH4OHQ@mail.gmail.com/
      CC: stable@vger.kernel.org # 5.16+
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ca93e44b
    • Niklas Cassel's avatar
      riscv: dts: k210: fix broken IRQs on hart1 · 74583f1b
      Niklas Cassel authored
      Commit 67d96729 ("riscv: Update Canaan Kendryte K210 device tree")
      incorrectly removed two entries from the PLIC interrupt-controller node's
      interrupts-extended property.
      
      The PLIC driver cannot know the mapping between hart contexts and hart ids,
      so this information has to be provided by device tree, as specified by the
      PLIC device tree binding.
      
      The PLIC driver uses the interrupts-extended property, and initializes the
      hart context registers in the exact same order as provided by the
      interrupts-extended property.
      
      In other words, if we don't specify the S-mode interrupts, the PLIC driver
      will simply initialize the hart0 S-mode hart context with the hart1 M-mode
      configuration. It is therefore essential to specify the S-mode IRQs even
      though the system itself will only ever be running in M-mode.
      
      Re-add the S-mode interrupts, so that we get working IRQs on hart1 again.
      
      Cc: <stable@vger.kernel.org>
      Fixes: 67d96729 ("riscv: Update Canaan Kendryte K210 device tree")
      Signed-off-by: default avatarNiklas Cassel <niklas.cassel@wdc.com>
      Signed-off-by: default avatarPalmer Dabbelt <palmer@rivosinc.com>
      74583f1b