Commits · cad347fab142bcb9bebc125b5ba0c1e52ce74fdc · Kirill Smelkov / linux

27 May, 2021 8 commits

KVM: selftests: add a memslot-related performance benchmark · cad347fa

Maciej S. Szmigiero authored Apr 13, 2021

This benchmark contains the following tests:
* Map test, where the host unmaps guest memory while the guest writes to
it (maps it).

The test is designed in a way to make the unmap operation on the host
take a negligible amount of time in comparison with the mapping
operation in the guest.

The test area is actually split in two: the first half is being mapped
by the guest while the second half in being unmapped by the host.
Then a guest <-> host sync happens and the areas are reversed.

* Unmap test which is broadly similar to the above map test, but it is
designed in an opposite way: to make the mapping operation in the guest
take a negligible amount of time in comparison with the unmap operation
on the host.
This test is available in two variants: with per-page unmap operation
or a chunked one (using 2 MiB chunk size).

* Move active area test which involves moving the last (highest gfn)
memslot a bit back and forth on the host while the guest is
concurrently writing around the area being moved (including over the
moved memslot).

* Move inactive area test which is similar to the previous move active
area test, but now guest writes all happen outside of the area being
moved.

* Read / write test in which the guest writes to the beginning of each
page of the test area while the host writes to the middle of each such
page.
Then each side checks the values the other side has written.
This particular test is not expected to give different results depending
on particular memslots implementation, it is meant as a rough sanity
check and to provide insight on the spread of test results expected.

Each test performs its operation in a loop until a test period ends
(this is 5 seconds by default, but it is configurable).
Then the total count of loops done is divided by the actual elapsed
time to give the test result.

The tests have a configurable memslot cap with the "-s" test option, by
default the system maximum is used.
Each test is repeated a particular number of times (by default 20
times), the best result achieved is printed.

The test memory area is divided equally between memslots, the reminder
is added to the last memslot.
The test area size does not depend on the number of memslots in use.

The tests also measure the time that it took to add all these memslots.
The best result from the tests that use the whole test area is printed
after all the requested tests are done.

In general, these tests are designed to use as much memory as possible
(within reason) while still doing 100+ loops even on high memslot counts
with the default test length.
Increasing the test runtime makes it increasingly more likely that some
event will happen on the system during the test run, which might lower
the test result.
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Reviewed-by: Andrew Jones <drjones@redhat.com>
Message-Id: <8d31bb3d92bc8fa33a9756fa802ee14266ab994e.1618253574.git.maciej.szmigiero@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

cad347fa

KVM: selftests: Keep track of memslots more efficiently · 22721a56

Maciej S. Szmigiero authored Apr 13, 2021

The KVM selftest framework was using a simple list for keeping track of
the memslots currently in use.
This resulted in lookups and adding a single memslot being O(n), the
later due to linear scanning of the existing memslot set to check for
the presence of any conflicting entries.

Before this change, benchmarking high count of memslots was more or less
impossible as pretty much all the benchmark time was spent in the
selftest framework code.

We can simply use a rbtree for keeping track of both of gfn and hva.
We don't need an interval tree for hva here as we can't have overlapping
memslots because we allocate a completely new memory chunk for each new
memslot.
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Reviewed-by: Andrew Jones <drjones@redhat.com>
Message-Id: <b12749d47ee860468240cf027412c91b76dbe3db.1618253574.git.maciej.szmigiero@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

22721a56

selftests: kvm: fix potential issue with ELF loading · a13534d6

Paolo Bonzini authored May 24, 2021

vm_vaddr_alloc() sets up GVA to GPA mapping page by page; therefore, GPAs
may not be continuous if same memslot is used for data and page table allocation.

kvm_vm_elf_load() however expects a continuous range of HVAs (and thus GPAs)
because it does not try to read file data page by page. Fix this mismatch
by allocating memory in one step.
Reported-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

a13534d6

selftests: kvm: make allocation of extra memory take effect · 39fe2fc9

Zhenzhong Duan authored May 12, 2021

The extra memory pages is missed to be allocated during VM creating.
perf_test_util and kvm_page_table_test use it to alloc extra memory
currently.

Fix it by adding extra_mem_pages to the total memory calculation before
allocate.
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Message-Id: <20210512043107.30076-1-zhenzhong.duan@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

39fe2fc9

KVM: X86: hyper-v: Task srcu lock when accessing kvm_memslots() · da6d63a0

Wanpeng Li authored May 18, 2021

   WARNING: suspicious RCU usage
   5.13.0-rc1 #4 Not tainted
   -----------------------------
   ./include/linux/kvm_host.h:710 suspicious rcu_dereference_check() usage!

  other info that might help us debug this:

  rcu_scheduler_active = 2, debug_locks = 1
   1 lock held by hyperv_clock/8318:
    #0: ffffb6b8cb05a7d8 (&hv->hv_lock){+.+.}-{3:3}, at: kvm_hv_invalidate_tsc_page+0x3e/0xa0 [kvm]

  stack backtrace:
  CPU: 3 PID: 8318 Comm: hyperv_clock Not tainted 5.13.0-rc1 #4
  Call Trace:
   dump_stack+0x87/0xb7
   lockdep_rcu_suspicious+0xce/0xf0
   kvm_write_guest_page+0x1c1/0x1d0 [kvm]
   kvm_write_guest+0x50/0x90 [kvm]
   kvm_hv_invalidate_tsc_page+0x79/0xa0 [kvm]
   kvm_gen_update_masterclock+0x1d/0x110 [kvm]
   kvm_arch_vm_ioctl+0x2a7/0xc50 [kvm]
   kvm_vm_ioctl+0x123/0x11d0 [kvm]
   __x64_sys_ioctl+0x3ed/0x9d0
   do_syscall_64+0x3d/0x80
   entry_SYSCALL_64_after_hwframe+0x44/0xae

kvm_memslots() will be called by kvm_write_guest(), so we should take the srcu lock.

Fixes: e880c6ea (KVM: x86: hyper-v: Prevent using not-yet-updated TSC page by secondary CPUs)
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Message-Id: <1621339235-11131-4-git-send-email-wanpengli@tencent.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

da6d63a0

KVM: X86: Fix vCPU preempted state from guest's point of view · 1eff0ada

Wanpeng Li authored May 18, 2021

Commit 66570e96 (kvm: x86: only provide PV features if enabled in guest's
CPUID) avoids to access pv tlb shootdown host side logic when this pv feature
is not exposed to guest, however, kvm_steal_time.preempted not only leveraged
by pv tlb shootdown logic but also mitigate the lock holder preemption issue.
From guest's point of view, vCPU is always preempted since we lose the reset
of kvm_steal_time.preempted before vmentry if pv tlb shootdown feature is not
exposed. This patch fixes it by clearing kvm_steal_time.preempted before
vmentry.

Fixes: 66570e96 (kvm: x86: only provide PV features if enabled in guest's CPUID)
Reviewed-by: Sean Christopherson <seanjc@google.com>
Cc: stable@vger.kernel.org
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Message-Id: <1621339235-11131-3-git-send-email-wanpengli@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

1eff0ada

KVM: X86: Bail out of direct yield in case of under-committed scenarios · 72b268a8

Wanpeng Li authored May 18, 2021

In case of under-committed scenarios, vCPUs can be scheduled easily;
kvm_vcpu_yield_to adds extra overhead, and it is also common to see
when vcpu->ready is true but yield later failing due to p->state is
TASK_RUNNING.

Let's bail out in such scenarios by checking the length of current cpu
runqueue, which can be treated as a hint of under-committed instead of
guarantee of accuracy. 30%+ of directed-yield attempts can now avoid
the expensive lookups in kvm_sched_yield() in an under-committed scenario.
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Message-Id: <1621339235-11131-2-git-send-email-wanpengli@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

72b268a8

KVM: PPC: exit halt polling on need_resched() · 6bd5b743

Wanpeng Li authored May 18, 2021

This is inspired by commit 262de410 (kvm: exit halt polling on
need_resched() as well). Due to PPC implements an arch specific halt
polling logic, we have to the need_resched() check there as well. This
patch adds a helper function that can be shared between book3s and generic
halt-polling loops.
Reviewed-by: David Matlack <dmatlack@google.com>
Reviewed-by: Venkatesh Srinivas <venkateshs@chromium.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Venkatesh Srinivas <venkateshs@chromium.org>
Cc: Jim Mattson <jmattson@google.com>
Cc: David Matlack <dmatlack@google.com>
Cc: Paul Mackerras <paulus@ozlabs.org>
Cc: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Message-Id: <1621339235-11131-1-git-send-email-wanpengli@tencent.com>
[Make the function inline. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

6bd5b743

24 May, 2021 3 commits

KVM: SVM: make the avic parameter a bool · 28a4aa11

Paolo Bonzini authored May 24, 2021

Make it consistent with kvm_intel.enable_apicv.
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

28a4aa11

KVM: VMX: Drop unneeded CONFIG_X86_LOCAL_APIC check · 377872b3

Vitaly Kuznetsov authored May 18, 2021

CONFIG_X86_LOCAL_APIC is always on when CONFIG_KVM (on x86) since
commit e42eef4b ("KVM: add X86_LOCAL_APIC dependency").
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20210518144339.1987982-3-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>

377872b3

KVM: SVM: Drop unneeded CONFIG_X86_LOCAL_APIC check · 778a136e

Vitaly Kuznetsov authored May 18, 2021

AVIC dependency on CONFIG_X86_LOCAL_APIC is dead code since
commit e42eef4b ("KVM: add X86_LOCAL_APIC dependency").
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20210518144339.1987982-2-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>

778a136e

17 May, 2021 1 commit

Merge tag 'kvmarm-fixes-5.13-1' of... · a4345a7c

Paolo Bonzini authored May 17, 2021

Merge tag 'kvmarm-fixes-5.13-1' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD

KVM/arm64 fixes for 5.13, take #1

- Fix regression with irqbypass not restarting the guest on failed connect
- Fix regression with debug register decoding resulting in overlapping access
- Commit exception state on exit to usrspace
- Fix the MMU notifier return values
- Add missing 'static' qualifiers in the new host stage-2 code

a4345a7c

15 May, 2021 7 commits

KVM: arm64: Fix debug register indexing · cb853ded

Marc Zyngier authored May 14, 2021

Commit 03fdfb26 ("KVM: arm64: Don't write junk to sysregs on
reset") flipped the register number to 0 for all the debug registers
in the sysreg table, hereby indicating that these registers live
in a separate shadow structure.

However, the author of this patch failed to realise that all the
accessors are using that particular index instead of the register
encoding, resulting in all the registers hitting index 0. Not quite
a valid implementation of the architecture...

Address the issue by fixing all the accessors to use the CRm field
of the encoding, which contains the debug register index.

Fixes: 03fdfb26 ("KVM: arm64: Don't write junk to sysregs on reset")
Reported-by: Ricardo Koller <ricarkol@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Cc: stable@vger.kernel.org

cb853ded

KVM: arm64: Commit pending PC adjustemnts before returning to userspace · 26778aaa

Marc Zyngier authored May 06, 2021

KVM currently updates PC (and the corresponding exception state)
using a two phase approach: first by setting a set of flags,
then by converting these flags into a state update when the vcpu
is about to enter the guest.

However, this creates a disconnect with userspace if the vcpu thread
returns there with any exception/PC flag set. In this case, the exposed
context is wrong, as userspace doesn't have access to these flags
(they aren't architectural). It also means that these flags are
preserved across a reset, which isn't expected.

To solve this problem, force an explicit synchronisation of the
exception state on vcpu exit to userspace. As an optimisation
for nVHE systems, only perform this when there is something pending.
Reported-by: Zenghui Yu <yuzenghui@huawei.com>
Reviewed-by: Alexandru Elisei <alexandru.elisei@arm.com>
Reviewed-by: Zenghui Yu <yuzenghui@huawei.com>
Tested-by: Zenghui Yu <yuzenghui@huawei.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Cc: stable@vger.kernel.org # 5.11

26778aaa

KVM: arm64: Move __adjust_pc out of line · f5e30680

Marc Zyngier authored May 06, 2021

In order to make it easy to call __adjust_pc() from the EL1 code
(in the case of nVHE), rename it to __kvm_adjust_pc() and move
it out of line.

No expected functional change.
Reviewed-by: Alexandru Elisei <alexandru.elisei@arm.com>
Reviewed-by: Zenghui Yu <yuzenghui@huawei.com>
Tested-by: Zenghui Yu <yuzenghui@huawei.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Cc: stable@vger.kernel.org # 5.11

f5e30680

KVM: arm64: Mark the host stage-2 memory pools static · 3fdc15fe

Quentin Perret authored May 14, 2021

The host stage-2 memory pools are not used outside of mem_protect.c,
mark them static.

Fixes: 1025c8c0 ("KVM: arm64: Wrap the host with a stage 2")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Quentin Perret <qperret@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20210514085640.3917886-3-qperret@google.com

3fdc15fe

KVM: arm64: Mark pkvm_pgtable_mm_ops static · eaa9b88d

Quentin Perret authored May 14, 2021

It is not used outside of setup.c, mark it static.

Fixes:f320bc74 ("KVM: arm64: Prepare the creation of s1 mappings at EL2")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Quentin Perret <qperret@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20210514085640.3917886-2-qperret@google.com

eaa9b88d

KVM: arm64: Fix boolreturn.cocci warnings · fcb82839

kernel test robot authored Apr 27, 2021

arch/arm64/kvm/mmu.c:1114:9-10: WARNING: return of 0/1 in function 'kvm_age_gfn' with return type bool
arch/arm64/kvm/mmu.c:1084:9-10: WARNING: return of 0/1 in function 'kvm_set_spte_gfn' with return type bool
arch/arm64/kvm/mmu.c:1127:9-10: WARNING: return of 0/1 in function 'kvm_test_age_gfn' with return type bool
arch/arm64/kvm/mmu.c:1070:9-10: WARNING: return of 0/1 in function 'kvm_unmap_gfn_range' with return type bool

Return statements in functions returning bool should use
true/false instead of 1/0.
Generated by: scripts/coccinelle/misc/boolreturn.cocci

Fixes: cd4c7183 ("KVM: arm64: Convert to the gfn-based MMU notifier callbacks")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: kernel test robot <lkp@intel.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20210426223357.GA45871@cd4295a34ed8

fcb82839

Revert "irqbypass: do not start cons/prod when failed connect" · e44b49f6

Zhu Lingshan authored May 08, 2021

This reverts commit a979a6aa.

The reverted commit may cause VM freeze on arm64 with GICv4,
where stopping a consumer is implemented by suspending the VM.
Should the connect fail, the VM will not be resumed, which
is a bit of a problem.

It also erroneously calls the producer destructor unconditionally,
which is unexpected.
Reported-by: Shaokun Zhang <zhangshaokun@hisilicon.com>
Suggested-by: Marc Zyngier <maz@kernel.org>
Acked-by: Jason Wang <jasowang@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Tested-by: Shaokun Zhang <zhangshaokun@hisilicon.com>
Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com>
[maz: tags and cc-stable, commit message update]
Signed-off-by: Marc Zyngier <maz@kernel.org>
Fixes: a979a6aa ("irqbypass: do not start cons/prod when failed connect")
Link: https://lore.kernel.org/r/3a2c66d6-6ca0-8478-d24b-61e8e3241b20@hisilicon.com
Link: https://lore.kernel.org/r/20210508071152.722425-1-lingshan.zhu@intel.com
Cc: stable@vger.kernel.org

e44b49f6

09 May, 2021 10 commits

Linux 5.13-rc1 · 6efb943b
Linus Torvalds authored May 09, 2021

6efb943b

fbmem: fix horribly incorrect placement of __maybe_unused · 6dae40ae

Linus Torvalds authored May 09, 2021

Commit b9d79e4c ("fbmem: Mark proc_fb_seq_ops as __maybe_unused")
places the '__maybe_unused' in an entirely incorrect location between
the "struct" keyword and the structure name.

It's a wonder that gcc accepts that silently, but clang quite reasonably
warns about it:

    drivers/video/fbdev/core/fbmem.c:736:21: warning: attribute declaration must precede definition [-Wignored-attributes]
    static const struct __maybe_unused seq_operations proc_fb_seq_ops = {
                        ^

Fix it.

Cc: Guenter Roeck <linux@roeck-us.net>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

6dae40ae

Merge tag 'drm-next-2021-05-10' of git://anongit.freedesktop.org/drm/drm · efc58a96

Linus Torvalds authored May 09, 2021

Pull drm fixes from Dave Airlie:
 "Bit later than usual, I queued them all up on Friday then promptly
  forgot to write the pull request email. This is mainly amdgpu fixes,
  with some radeon/msm/fbdev and one i915 gvt fix thrown in.

  amdgpu:
   - MPO hang workaround
   - Fix for concurrent VM flushes on vega/navi
   - dcefclk is not adjustable on navi1x and newer
   - MST HPD debugfs fix
   - Suspend/resumes fixes
   - Register VGA clients late in case driver fails to load
   - Fix GEM leak in user framebuffer create
   - Add support for polaris12 with 32 bit memory interface
   - Fix duplicate cursor issue when using overlay
   - Fix corruption with tiled surfaces on VCN3
   - Add BO size and stride check to fix BO size verification

  radeon:
   - Fix off-by-one in power state parsing
   - Fix possible memory leak in power state parsing

  msm:
   - NULL ptr dereference fix

  fbdev:
   - procfs disabled warning fix

  i915:
   - gvt: Fix a possible division by zero in vgpu display rate
     calculation"

* tag 'drm-next-2021-05-10' of git://anongit.freedesktop.org/drm/drm:
  drm/amdgpu: Use device specific BO size & stride check.
  drm/amdgpu: Init GFX10_ADDR_CONFIG for VCN v3 in DPG mode.
  drm/amd/pm: initialize variable
  drm/radeon: Avoid power table parsing memory leaks
  drm/radeon: Fix off-by-one power_state index heap overwrite
  drm/amd/display: Fix two cursor duplication when using overlay
  drm/amdgpu: add new MC firmware for Polaris12 32bit ASIC
  fbmem: Mark proc_fb_seq_ops as __maybe_unused
  drm/msm/dpu: Delete bonkers code
  drm/i915/gvt: Prevent divided by zero when calculating refresh rate
  amdgpu: fix GEM obj leak in amdgpu_display_user_framebuffer_create
  drm/amdgpu: Register VGA clients after init can no longer fail
  drm/amdgpu: Handling of amdgpu_device_resume return value for graceful teardown
  drm/amdgpu: fix r initial values
  drm/amd/display: fix wrong statement in mst hpd debugfs
  amdgpu/pm: set pp_dpm_dcefclk to readonly on NAVI10 and newer gpus
  amdgpu/pm: Prevent force of DCEFCLK on NAVI10 and SIENNA_CICHLID
  drm/amdgpu: fix concurrent VM flushes on Vega/Navi v2
  drm/amd/display: Reject non-zero src_y and src_x for video planes

efc58a96

Merge tag 'block-5.13-2021-05-09' of git://git.kernel.dk/linux-block · 506c3079

Linus Torvalds authored May 09, 2021

Pull block fix from Jens Axboe:
 "Turns out the bio max size change still has issues, so let's get it
  reverted for 5.13-rc1. We'll shake out the issues there and defer it
  to 5.14 instead"

* tag 'block-5.13-2021-05-09' of git://git.kernel.dk/linux-block:
  Revert "bio: limit bio max size"

506c3079

Merge tag '5.13-rc-smb3-part3' of git://git.samba.org/sfrench/cifs-2.6 · 0a55a1fb

Linus Torvalds authored May 09, 2021

Pull cifs fixes from Steve French:
 "Three small SMB3 chmultichannel related changesets (also for stable)
  from the SMB3 test event this week.

  The other fixes are still in review/testing"

* tag '5.13-rc-smb3-part3' of git://git.samba.org/sfrench/cifs-2.6:
  smb3: if max_channels set to more than one channel request multichannel
  smb3: do not attempt multichannel to server which does not support it
  smb3: when mounting with multichannel include it in requested capabilities

0a55a1fb

Merge tag 'sched-urgent-2021-05-09' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 9819f682

Linus Torvalds authored May 09, 2021

Pull scheduler fixes from Thomas Gleixner:
 "A set of scheduler updates:

   - Prevent PSI state corruption when schedule() races with cgroup
     move.

     A recent commit combined two PSI callbacks to reduce the number of
     cgroup tree updates, but missed that schedule() can drop rq::lock
     for load balancing, which opens the race window for
     cgroup_move_task() which then observes half updated state.

     The fix is to solely use task::ps_flags instead of looking at the
     potentially mismatching scheduler state

   - Prevent an out-of-bounds access in uclamp caused bu a rounding
     division which can lead to an off-by-one error exceeding the
     buckets array size.

   - Prevent unfairness caused by missing load decay when a task is
     attached to a cfs runqueue.

     The old load of the task was attached to the runqueue and never
     removed. Fix it by enforcing the load update through the hierarchy
     for unthrottled run queue instances.

   - A documentation fix fot the 'sched_verbose' command line option"

* tag 'sched-urgent-2021-05-09' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/fair: Fix unfairness caused by missing load decay
  sched: Fix out-of-bound access in uclamp
  psi: Fix psi state corruption when schedule() races with cgroup move
  sched,doc: sched_debug_verbose cmdline should be sched_verbose

9819f682

Merge tag 'locking-urgent-2021-05-09' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 732a27a0

Linus Torvalds authored May 09, 2021

Pull locking fixes from Thomas Gleixner:
 "A set of locking related fixes and updates:

   - Two fixes for the futex syscall related to the timeout handling.

     FUTEX_LOCK_PI does not support the FUTEX_CLOCK_REALTIME bit and
     because it's not set the time namespace adjustment for clock
     MONOTONIC is applied wrongly.

     FUTEX_WAIT cannot support the FUTEX_CLOCK_REALTIME bit because its
     always a relative timeout.

   - Cleanups in the futex syscall entry points which became obvious
     when the two timeout handling bugs were fixed.

   - Cleanup of queued_write_lock_slowpath() as suggested by Linus

   - Fixup of the smp_call_function_single_async() prototype"

* tag 'locking-urgent-2021-05-09' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  futex: Make syscall entry points less convoluted
  futex: Get rid of the val2 conditional dance
  futex: Do not apply time namespace adjustment on FUTEX_LOCK_PI
  Revert 337f1304 ("futex: Allow FUTEX_CLOCK_REALTIME with FUTEX_WAIT op")
  locking/qrwlock: Cleanup queued_write_lock_slowpath()
  smp: Fix smp_call_function_single_async prototype

732a27a0

Merge tag 'perf_urgent_for_v5.13_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 85bbba1c

Linus Torvalds authored May 09, 2021

Pull x86 perf fix from Borislav Petkov:
 "Handle power-gating of AMD IOMMU perf counters properly when they are
  used"

* tag 'perf_urgent_for_v5.13_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/events/amd/iommu: Fix invalid Perf result due to IOMMU PMC power-gating

85bbba1c

Merge tag 'x86_urgent_for_v5.13_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · dd3e4012

Linus Torvalds authored May 09, 2021

Pull x86 fixes from Borislav Petkov:
 "A bunch of things accumulated for x86 in the last two weeks:

   - Fix guest vtime accounting so that ticks happening while the guest
     is running can also be accounted to it. Along with a consolidation
     to the guest-specific context tracking helpers.

   - Provide for the host NMI handler running after a VMX VMEXIT to be
     able to run on the kernel stack correctly.

   - Initialize MSR_TSC_AUX when RDPID is supported and not RDTSCP (virt
     relevant - real hw supports both)

   - A code generation improvement to TASK_SIZE_MAX through the use of
     alternatives

   - The usual misc and related cleanups and improvements"

* tag 'x86_urgent_for_v5.13_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  KVM: x86: Consolidate guest enter/exit logic to common helpers
  context_tracking: KVM: Move guest enter/exit wrappers to KVM's domain
  context_tracking: Consolidate guest enter/exit wrappers
  sched/vtime: Move guest enter/exit vtime accounting to vtime.h
  sched/vtime: Move vtime accounting external declarations above inlines
  KVM: x86: Defer vtime accounting 'til after IRQ handling
  context_tracking: Move guest exit vtime accounting to separate helpers
  context_tracking: Move guest exit context tracking to separate helpers
  KVM/VMX: Invoke NMI non-IST entry instead of IST entry
  x86/cpu: Remove write_tsc() and write_rdtscp_aux() wrappers
  x86/cpu: Initialize MSR_TSC_AUX if RDTSCP *or* RDPID is supported
  x86/resctrl: Fix init const confusion
  x86: Delete UD0, UD1 traces
  x86/smpboot: Remove duplicate includes
  x86/cpu: Use alternative to generate the TASK_SIZE_MAX constant

dd3e4012

Revert "bio: limit bio max size" · 35c820e7

Jens Axboe authored May 08, 2021

This reverts commit cd2c7545.

Alex reports that the commit causes corruption with LUKS on ext4. Revert
it for now so that this can be investigated properly.

Link: https://lore.kernel.org/linux-block/1620493841.bxdq8r5haw.none@localhost/Reported-by: Alex Xu (Hello71) <alex_y_xu@yahoo.ca>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

35c820e7

08 May, 2021 11 commits

Merge tag 'riscv-for-linus-5.13-mw1' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux · b7415964

Linus Torvalds authored May 08, 2021

Pull RISC-V fixes from Palmer Dabbelt:

 - A fix to avoid over-allocating the kernel's mapping on !MMU systems,
   which could lead to up to 2MiB of lost memory

 - The SiFive address extension errata only manifest on rv64, they are
   now disabled on rv32 where they are unnecessary

 - A pair of late-landing cleanups

* tag 'riscv-for-linus-5.13-mw1' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
  riscv: remove unused handle_exception symbol
  riscv: Consistify protect_kernel_linear_mapping_text_rodata() use
  riscv: enable SiFive errata CIP-453 and CIP-1200 Kconfig only if CONFIG_64BIT=y
  riscv: Only extend kernel reservation if mapped read-only

b7415964

drm/i915/display: fix compiler warning about array overrun · fec4d427

Linus Torvalds authored May 08, 2021

intel_dp_check_mst_status() uses a 14-byte array to read the DPRX Event
Status Indicator data, but then passes that buffer at offset 10 off as
an argument to drm_dp_channel_eq_ok().

End result: there are only 4 bytes remaining of the buffer, yet
drm_dp_channel_eq_ok() wants a 6-byte buffer.  gcc-11 correctly warns
about this case:

  drivers/gpu/drm/i915/display/intel_dp.c: In function ‘intel_dp_check_mst_status’:
  drivers/gpu/drm/i915/display/intel_dp.c:3491:22: warning: ‘drm_dp_channel_eq_ok’ reading 6 bytes from a region of size 4 [-Wstringop-overread]
   3491 |                     !drm_dp_channel_eq_ok(&esi[10], intel_dp->lane_count)) {
        |                      ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  drivers/gpu/drm/i915/display/intel_dp.c:3491:22: note: referencing argument 1 of type ‘const u8 *’ {aka ‘const unsigned char *’}
  In file included from drivers/gpu/drm/i915/display/intel_dp.c:38:
  include/drm/drm_dp_helper.h:1466:6: note: in a call to function ‘drm_dp_channel_eq_ok’
   1466 | bool drm_dp_channel_eq_ok(const u8 link_status[DP_LINK_STATUS_SIZE],
        |      ^~~~~~~~~~~~~~~~~~~~
       6:14 elapsed

This commit just extends the original array by 2 zero-initialized bytes,
avoiding the warning.

There may be some underlying bug in here that caused this confusion, but
this is at least no worse than the existing situation that could use
random data off the stack.

Cc: Jani Nikula <jani.nikula@intel.com>
Cc: Ville Syrjälä <ville.syrjala@linux.intel.com>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Dave Airlie <airlied@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

fec4d427

Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi · 07db0563

Linus Torvalds authored May 08, 2021

Pull more SCSI updates from James Bottomley:
 "This is a set of minor fixes in various drivers (qla2xxx, ufs,
  scsi_debug, lpfc) one doc fix and a fairly large update to the fnic
  driver to remove the open coded iteration functions in favour of the
  scsi provided ones"

* tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
  scsi: fnic: Use scsi_host_busy_iter() to traverse commands
  scsi: fnic: Kill 'exclude_id' argument to fnic_cleanup_io()
  scsi: scsi_debug: Fix cmd_per_lun, set to max_queue
  scsi: ufs: core: Narrow down fast path in system suspend path
  scsi: ufs: core: Cancel rpm_dev_flush_recheck_work during system suspend
  scsi: ufs: core: Do not put UFS power into LPM if link is broken
  scsi: qla2xxx: Prevent PRLI in target mode
  scsi: qla2xxx: Add marginal path handling support
  scsi: target: tcmu: Return from tcmu_handle_completions() if cmd_id not found
  scsi: ufs: core: Fix a typo in ufs-sysfs.c
  scsi: lpfc: Fix bad memory access during VPD DUMP mailbox command
  scsi: lpfc: Fix DMA virtual address ptr assignment in bsg
  scsi: lpfc: Fix illegal memory access on Abort IOCBs
  scsi: blk-mq: Fix build warning when making htmldocs

07db0563

Merge tag 'kbuild-v5.13-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild · 0f979d81

Linus Torvalds authored May 08, 2021

Pull more Kbuild updates from Masahiro Yamada:

 - Convert sh and sparc to use generic shell scripts to generate the
   syscall headers

 - refactor .gitignore files

 - Update kernel/config_data.gz only when the content of the .config
   is really changed, which avoids the unneeded re-link of vmlinux

 - move "remove stale files" workarounds to scripts/remove-stale-files

 - suppress unused-but-set-variable warnings by default for Clang
   as well

 - fix locale setting LANG=C to LC_ALL=C

 - improve 'make distclean'

 - always keep intermediate objects from scripts/link-vmlinux.sh

 - move IF_ENABLED out of <linux/kconfig.h> to make it self-contained

 - misc cleanups

* tag 'kbuild-v5.13-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: (25 commits)
  linux/kconfig.h: replace IF_ENABLED() with PTR_IF() in <linux/kernel.h>
  kbuild: Don't remove link-vmlinux temporary files on exit/signal
  kbuild: remove the unneeded comments for external module builds
  kbuild: make distclean remove tag files in sub-directories
  kbuild: make distclean work against $(objtree) instead of $(srctree)
  kbuild: refactor modname-multi by using suffix-search
  kbuild: refactor fdtoverlay rule
  kbuild: parameterize the .o part of suffix-search
  arch: use cross_compiling to check whether it is a cross build or not
  kbuild: remove ARCH=sh64 support from top Makefile
  .gitignore: prefix local generated files with a slash
  kbuild: replace LANG=C with LC_ALL=C
  Makefile: Move -Wno-unused-but-set-variable out of GCC only block
  kbuild: add a script to remove stale generated files
  kbuild: update config_data.gz only when the content of .config is changed
  .gitignore: ignore only top-level modules.builtin
  .gitignore: move tags and TAGS close to other tag files
  kernel/.gitgnore: remove stale timeconst.h and hz.bc
  usr/include: refactor .gitignore
  genksyms: fix stale comment
  ...

0f979d81

smb3: if max_channels set to more than one channel request multichannel · c1f8a398

Steve French authored May 07, 2021

Mounting with "multichannel" is obviously implied if user requested
more than one channel on mount (ie mount parm max_channels>1).
Currently both have to be specified. Fix that so that if max_channels
is greater than 1 on mount, enable multichannel rather than silently
falling back to non-multichannel.
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-By: Tom Talpey <tom@talpey.com>
Cc: <stable@vger.kernel.org> # v5.11+
Reviewed-by: Shyam Prasad N <sprasad@microsoft.com>

c1f8a398

smb3: do not attempt multichannel to server which does not support it · 9c2dc11d

Steve French authored May 07, 2021

We were ignoring CAP_MULTI_CHANNEL in the server response - if the
server doesn't support multichannel we should not be attempting it.

See MS-SMB2 section 3.2.5.2
Reviewed-by: Shyam Prasad N <sprasad@microsoft.com>
Reviewed-By: Tom Talpey <tom@talpey.com>
Cc: <stable@vger.kernel.org> # v5.8+
Signed-off-by: Steve French <stfrench@microsoft.com>

9c2dc11d

Merge tag 'powerpc-5.13-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux · ab159ac5

Linus Torvalds authored May 08, 2021

Pull powerpc updates and fixes from Michael Ellerman:
 "A bit of a mixture of things, tying up some loose ends.

  There's the removal of the nvlink code, which dependend on a commit in
  the vfio tree. Then the enablement of huge vmalloc which was in next
  for a few weeks but got dropped due to conflicts. And there's also a
  few fixes.

  Summary:

   - Remove the nvlink support now that it's only user has been removed.

   - Enable huge vmalloc mappings for Radix MMU (P9).

   - Fix KVM conversion to gfn-based MMU notifier callbacks.

   - Fix a kexec/kdump crash with hot plugged CPUs.

   - Fix boot failure on 32-bit with CONFIG_STACKPROTECTOR.

   - Restore alphabetic order of the selects under CONFIG_PPC.

  Thanks to: Christophe Leroy, Christoph Hellwig, Nicholas Piggin,
  Sandipan Das, and Sourabh Jain"

* tag 'powerpc-5.13-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
  KVM: PPC: Book3S HV: Fix conversion to gfn-based MMU notifier callbacks
  powerpc/kconfig: Restore alphabetic order of the selects under CONFIG_PPC
  powerpc/32: Fix boot failure with CONFIG_STACKPROTECTOR
  powerpc/powernv/memtrace: Fix dcache flushing
  powerpc/kexec_file: Use current CPU info while setting up FDT
  powerpc/64s/radix: Enable huge vmalloc mappings
  powerpc/powernv: remove the nvlink support

ab159ac5

smb3: when mounting with multichannel include it in requested capabilities · 679971e7

Steve French authored May 07, 2021

In the SMB3/SMB3.1.1 negotiate protocol request, we are supposed to
advertise CAP_MULTICHANNEL capability when establishing multiple
channels has been requested by the user doing the mount. See MS-SMB2
sections 2.2.3 and 3.2.5.2

Without setting it there is some risk that multichannel could fail
if the server interpreted the field strictly.
Reviewed-By: Tom Talpey <tom@talpey.com>
Reviewed-by: Shyam Prasad N <sprasad@microsoft.com>
Cc: <stable@vger.kernel.org> # v5.8+
Signed-off-by: Steve French <stfrench@microsoft.com>

679971e7

Merge tag 'net-5.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net · fc858a52

Linus Torvalds authored May 08, 2021

Pull networking fixes from Jakub Kicinski:
 "Networking fixes for 5.13-rc1, including fixes from bpf, can and
  netfilter trees. Self-contained fixes, nothing risky.

  Current release - new code bugs:

   - dsa: ksz: fix a few bugs found by static-checker in the new driver

   - stmmac: fix frame preemption handshake not triggering after
     interface restart

  Previous releases - regressions:

   - make nla_strcmp handle more then one trailing null character

   - fix stack OOB reads while fragmenting IPv4 packets in openvswitch
     and net/sched

   - sctp: do asoc update earlier in sctp_sf_do_dupcook_a

   - sctp: delay auto_asconf init until binding the first addr

   - stmmac: clear receive all(RA) bit when promiscuous mode is off

   - can: mcp251x: fix resume from sleep before interface was brought up

  Previous releases - always broken:

   - bpf: fix leakage of uninitialized bpf stack under speculation

   - bpf: fix masking negation logic upon negative dst register

   - netfilter: don't assume that skb_header_pointer() will never fail

   - only allow init netns to set default tcp cong to a restricted algo

   - xsk: fix xp_aligned_validate_desc() when len == chunk_size to avoid
     false positive errors

   - ethtool: fix missing NLM_F_MULTI flag when dumping

   - can: m_can: m_can_tx_work_queue(): fix tx_skb race condition

   - sctp: fix a SCTP_MIB_CURRESTAB leak in sctp_sf_do_dupcook_b

   - bridge: fix NULL-deref caused by a races between assigning
     rx_handler_data and setting the IFF_BRIDGE_PORT bit

  Latecomer:

   - seg6: add counters support for SRv6 Behaviors"

* tag 'net-5.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (73 commits)
  atm: firestream: Use fallthrough pseudo-keyword
  net: stmmac: Do not enable RX FIFO overflow interrupts
  mptcp: fix splat when closing unaccepted socket
  i40e: Remove LLDP frame filters
  i40e: Fix PHY type identifiers for 2.5G and 5G adapters
  i40e: fix the restart auto-negotiation after FEC modified
  i40e: Fix use-after-free in i40e_client_subtask()
  i40e: fix broken XDP support
  netfilter: nftables: avoid potential overflows on 32bit arches
  netfilter: nftables: avoid overflows in nft_hash_buckets()
  tcp: Specify cmsgbuf is user pointer for receive zerocopy.
  mlxsw: spectrum_mr: Update egress RIF list before route's action
  net: ipa: fix inter-EE IRQ register definitions
  can: m_can: m_can_tx_work_queue(): fix tx_skb race condition
  can: mcp251x: fix resume from sleep before interface was brought up
  can: mcp251xfd: mcp251xfd_probe(): add missing can_rx_offload_del() in error path
  can: mcp251xfd: mcp251xfd_probe(): fix an error pointer dereference in probe
  netfilter: nftables: Fix a memleak from userdata error path in new objects
  netfilter: remove BUG_ON() after skb_header_pointer()
  netfilter: nfnetlink_osf: Fix a missing skb_header_pointer() NULL check
  ...

fc858a52

linux/kconfig.h: replace IF_ENABLED() with PTR_IF() in <linux/kernel.h> · 0ab1438b

Masahiro Yamada authored May 06, 2021

<linux/kconfig.h> is included from all the kernel-space source files,
including C, assembly, linker scripts. It is intended to contain a
minimal set of macros to evaluate CONFIG options.

IF_ENABLED() is an intruder here because (x ? y : z) is C code, which
should not be included from assembly files or linker scripts.

Also, <linux/kconfig.h> is no longer self-contained because NULL is
defined in <linux/stddef.h>.

Move IF_ENABLED() out to <linux/kernel.h> as PTR_IF(). PTF_IF()
takes the general boolean expression instead of a CONFIG option
so that it fits better in <linux/kernel.h>.
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Reviewed-by: Kees Cook <keescook@chromium.org>

0ab1438b

Merge branch 'master' into next · f96271ce

Michael Ellerman authored May 08, 2021

Merge master back into next, this allows us to resolve some conflicts in
arch/powerpc/Kconfig, and also re-sort the symbols under config PPC so
that they are in alphabetical order again.

f96271ce