1. 08 Jul, 2020 5 commits
  2. 06 Jul, 2020 3 commits
    • Paolo Bonzini's avatar
      Merge tag 'kvmarm-fixes-5.8-3' of... · 8038a922
      Paolo Bonzini authored
      Merge tag 'kvmarm-fixes-5.8-3' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into kvm-master
      
      KVM/arm fixes for 5.8, take #3
      
      - Disable preemption on context-switching PMU EL0 state happening
        on system register trap
      - Don't clobber X0 when tearing down KVM via a soft reset (kexec)
      8038a922
    • Andrew Scull's avatar
      KVM: arm64: Stop clobbering x0 for HVC_SOFT_RESTART · b9e10d4a
      Andrew Scull authored
      HVC_SOFT_RESTART is given values for x0-2 that it should installed
      before exiting to the new address so should not set x0 to stub HVC
      success or failure code.
      
      Fixes: af42f204 ("arm64: hyp-stub: Zero x0 on successful stub handling")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarAndrew Scull <ascull@google.com>
      Signed-off-by: default avatarMarc Zyngier <maz@kernel.org>
      Link: https://lore.kernel.org/r/20200706095259.1338221-1-ascull@google.com
      b9e10d4a
    • Marc Zyngier's avatar
      KVM: arm64: PMU: Fix per-CPU access in preemptible context · 146f76cc
      Marc Zyngier authored
      Commit 07da1ffa ("KVM: arm64: Remove host_cpu_context
      member from vcpu structure") has, by removing the host CPU
      context pointer, exposed that kvm_vcpu_pmu_restore_guest
      is called in preemptible contexts:
      
      [  266.932442] BUG: using smp_processor_id() in preemptible [00000000] code: qemu-system-aar/779
      [  266.939721] caller is debug_smp_processor_id+0x20/0x30
      [  266.944157] CPU: 2 PID: 779 Comm: qemu-system-aar Tainted: G            E     5.8.0-rc3-00015-g8d4aa58b2fe3 #1374
      [  266.954268] Hardware name: amlogic w400/w400, BIOS 2020.04 05/22/2020
      [  266.960640] Call trace:
      [  266.963064]  dump_backtrace+0x0/0x1e0
      [  266.966679]  show_stack+0x20/0x30
      [  266.969959]  dump_stack+0xe4/0x154
      [  266.973338]  check_preemption_disabled+0xf8/0x108
      [  266.977978]  debug_smp_processor_id+0x20/0x30
      [  266.982307]  kvm_vcpu_pmu_restore_guest+0x2c/0x68
      [  266.986949]  access_pmcr+0xf8/0x128
      [  266.990399]  perform_access+0x8c/0x250
      [  266.994108]  kvm_handle_sys_reg+0x10c/0x2f8
      [  266.998247]  handle_exit+0x78/0x200
      [  267.001697]  kvm_arch_vcpu_ioctl_run+0x2ac/0xab8
      
      Note that the bug was always there, it is only the switch to
      using percpu accessors that made it obvious.
      The fix is to wrap these accesses in a preempt-disabled section,
      so that we sample a coherent context on trap from the guest.
      
      Fixes: 435e53fb ("arm64: KVM: Enable VHE support for :G/:H perf event modifiers")
      Cc:: Andrew Murray <amurray@thegoodpenguin.co.uk>
      Signed-off-by: default avatarMarc Zyngier <maz@kernel.org>
      146f76cc
  3. 03 Jul, 2020 3 commits
  4. 02 Jul, 2020 1 commit
  5. 01 Jul, 2020 1 commit
  6. 30 Jun, 2020 1 commit
    • Paolo Bonzini's avatar
      KVM: x86: bit 8 of non-leaf PDPEs is not reserved · 5ecad245
      Paolo Bonzini authored
      Bit 8 would be the "global" bit, which does not quite make sense for non-leaf
      page table entries.  Intel ignores it; AMD ignores it in PDEs and PDPEs, but
      reserves it in PML4Es.
      
      Probably, earlier versions of the AMD manual documented it as reserved in PDPEs
      as well, and that behavior made it into KVM as well as kvm-unit-tests; fix it.
      
      Cc: stable@vger.kernel.org
      Reported-by: default avatarNadav Amit <namit@vmware.com>
      Fixes: a0c0feb5 ("KVM: x86: reserve bit 8 of non-leaf PDPEs and PML4Es in 64-bit mode on AMD", 2014-09-03)
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      5ecad245
  7. 29 Jun, 2020 1 commit
    • Wanpeng Li's avatar
      KVM: X86: Fix async pf caused null-ptr-deref · 9d3c447c
      Wanpeng Li authored
      Syzbot reported that:
      
        CPU: 1 PID: 6780 Comm: syz-executor153 Not tainted 5.7.0-syzkaller #0
        Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
        RIP: 0010:__apic_accept_irq+0x46/0xb80
        Call Trace:
         kvm_arch_async_page_present+0x7de/0x9e0
         kvm_check_async_pf_completion+0x18d/0x400
         kvm_arch_vcpu_ioctl_run+0x18bf/0x69f0
         kvm_vcpu_ioctl+0x46a/0xe20
         ksys_ioctl+0x11a/0x180
         __x64_sys_ioctl+0x6f/0xb0
         do_syscall_64+0xf6/0x7d0
         entry_SYSCALL_64_after_hwframe+0x49/0xb3
      
      The testcase enables APF mechanism in MSR_KVM_ASYNC_PF_EN with ASYNC_PF_INT
      enabled w/o setting MSR_KVM_ASYNC_PF_INT before, what's worse, interrupt
      based APF 'page ready' event delivery depends on in kernel lapic, however,
      we didn't bail out when lapic is not in kernel during guest setting
      MSR_KVM_ASYNC_PF_EN which causes the null-ptr-deref in host later.
      This patch fixes it.
      
      Reported-by: syzbot+1bf777dfdde86d64b89b@syzkaller.appspotmail.com
      Fixes: 2635b5c4 (KVM: x86: interrupt based APF 'page ready' event delivery)
      Signed-off-by: default avatarWanpeng Li <wanpengli@tencent.com>
      Message-Id: <1593426391-8231-1-git-send-email-wanpengli@tencent.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      9d3c447c
  8. 24 Jun, 2020 1 commit
  9. 23 Jun, 2020 5 commits
    • Marc Zyngier's avatar
      KVM: arm64: vgic-v4: Plug race between non-residency and v4.1 doorbell · a3f574cd
      Marc Zyngier authored
      When making a vPE non-resident because it has hit a blocking WFI,
      the doorbell can fire at any time after the write to the RD.
      Crucially, it can fire right between the write to GICR_VPENDBASER
      and the write to the pending_last field in the its_vpe structure.
      
      This means that we would overwrite pending_last with stale data,
      and potentially not wakeup until some unrelated event (such as
      a timer interrupt) puts the vPE back on the CPU.
      
      GICv4 isn't affected by this as we actively mask the doorbell on
      entering the guest, while GICv4.1 automatically manages doorbell
      delivery without any hypervisor-driven masking.
      
      Use the vpe_lock to synchronize such update, which solves the
      problem altogether.
      
      Fixes: ae699ad3 ("irqchip/gic-v4.1: Move doorbell management to the GICv4 abstraction layer")
      Reported-by: default avatarZenghui Yu <yuzenghui@huawei.com>
      Signed-off-by: default avatarMarc Zyngier <maz@kernel.org>
      a3f574cd
    • Sean Christopherson's avatar
      KVM: VMX: Remove vcpu_vmx's defunct copy of host_pkru · e4553b49
      Sean Christopherson authored
      Remove vcpu_vmx.host_pkru, which got left behind when PKRU support was
      moved to common x86 code.
      
      No functional change intended.
      
      Fixes: 37486135 ("KVM: x86: Fix pkru save/restore when guest CR4.PKE=0, move it to x86.c")
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Message-Id: <20200617034123.25647-1-sean.j.christopherson@intel.com>
      Reviewed-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Reviewed-by: default avatarJim Mattson <jmattson@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      e4553b49
    • Marcelo Tosatti's avatar
      KVM: x86: allow TSC to differ by NTP correction bounds without TSC scaling · 26769f96
      Marcelo Tosatti authored
      The Linux TSC calibration procedure is subject to small variations
      (its common to see +-1 kHz difference between reboots on a given CPU, for example).
      
      So migrating a guest between two hosts with identical processor can fail, in case
      of a small variation in calibrated TSC between them.
      
      Without TSC scaling, the current kernel interface will either return an error
      (if user_tsc_khz <= tsc_khz) or enable TSC catchup mode.
      
      This change enables the following TSC tolerance check to
      accept KVM_SET_TSC_KHZ within tsc_tolerance_ppm (which is 250ppm by default).
      
              /*
               * Compute the variation in TSC rate which is acceptable
               * within the range of tolerance and decide if the
               * rate being applied is within that bounds of the hardware
               * rate.  If so, no scaling or compensation need be done.
               */
              thresh_lo = adjust_tsc_khz(tsc_khz, -tsc_tolerance_ppm);
              thresh_hi = adjust_tsc_khz(tsc_khz, tsc_tolerance_ppm);
              if (user_tsc_khz < thresh_lo || user_tsc_khz > thresh_hi) {
                      pr_debug("kvm: requested TSC rate %u falls outside tolerance [%u,%u]\n", user_tsc_khz, thresh_lo, thresh_hi);
                      use_scaling = 1;
              }
      
      NTP daemon in the guest can correct this difference (NTP can correct upto 500ppm).
      Signed-off-by: default avatarMarcelo Tosatti <mtosatti@redhat.com>
      
      Message-Id: <20200616114741.GA298183@fuller.cnet>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      26769f96
    • Xiaoyao Li's avatar
      KVM: X86: Fix MSR range of APIC registers in X2APIC mode · bf10bd0b
      Xiaoyao Li authored
      Only MSR address range 0x800 through 0x8ff is architecturally reserved
      and dedicated for accessing APIC registers in x2APIC mode.
      
      Fixes: 0105d1a5 ("KVM: x2apic interface to lapic")
      Signed-off-by: default avatarXiaoyao Li <xiaoyao.li@intel.com>
      Message-Id: <20200616073307.16440-1-xiaoyao.li@intel.com>
      Cc: stable@vger.kernel.org
      Reviewed-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Reviewed-by: default avatarJim Mattson <jmattson@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      bf10bd0b
    • Sean Christopherson's avatar
      KVM: VMX: Stop context switching MSR_IA32_UMWAIT_CONTROL · bf09fb6c
      Sean Christopherson authored
      Remove support for context switching between the guest's and host's
      desired UMWAIT_CONTROL.  Propagating the guest's value to hardware isn't
      required for correct functionality, e.g. KVM intercepts reads and writes
      to the MSR, and the latency effects of the settings controlled by the
      MSR are not architecturally visible.
      
      As a general rule, KVM should not allow the guest to control power
      management settings unless explicitly enabled by userspace, e.g. see
      KVM_CAP_X86_DISABLE_EXITS.  E.g. Intel's SDM explicitly states that C0.2
      can improve the performance of SMT siblings.  A devious guest could
      disable C0.2 so as to improve the performance of their workloads at the
      detriment to workloads running in the host or on other VMs.
      
      Wholesale removal of UMWAIT_CONTROL context switching also fixes a race
      condition where updates from the host may cause KVM to enter the guest
      with the incorrect value.  Because updates are are propagated to all
      CPUs via IPI (SMP function callback), the value in hardware may be
      stale with respect to the cached value and KVM could enter the guest
      with the wrong value in hardware.  As above, the guest can't observe the
      bad value, but it's a weird and confusing wart in the implementation.
      
      Removal also fixes the unnecessary usage of VMX's atomic load/store MSR
      lists.  Using the lists is only necessary for MSRs that are required for
      correct functionality immediately upon VM-Enter/VM-Exit, e.g. EFER on
      old hardware, or for MSRs that need to-the-uop precision, e.g. perf
      related MSRs.  For UMWAIT_CONTROL, the effects are only visible in the
      kernel via TPAUSE/delay(), and KVM doesn't do any form of delay in
      vcpu_vmx_run().  Using the atomic lists is undesirable as they are more
      expensive than direct RDMSR/WRMSR.
      
      Furthermore, even if giving the guest control of the MSR is legitimate,
      e.g. in pass-through scenarios, it's not clear that the benefits would
      outweigh the overhead.  E.g. saving and restoring an MSR across a VMX
      roundtrip costs ~250 cycles, and if the guest diverged from the host
      that cost would be paid on every run of the guest.  In other words, if
      there is a legitimate use case then it should be enabled by a new
      per-VM capability.
      
      Note, KVM still needs to emulate MSR_IA32_UMWAIT_CONTROL so that it can
      correctly expose other WAITPKG features to the guest, e.g. TPAUSE,
      UMWAIT and UMONITOR.
      
      Fixes: 6e3ba4ab ("KVM: vmx: Emulate MSR IA32_UMWAIT_CONTROL")
      Cc: stable@vger.kernel.org
      Cc: Jingqi Liu <jingqi.liu@intel.com>
      Cc: Tao Xu <tao3.xu@intel.com>
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Message-Id: <20200623005135.10414-1-sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      bf09fb6c
  10. 22 Jun, 2020 7 commits
    • Sean Christopherson's avatar
      KVM: nVMX: Plumb L2 GPA through to PML emulation · 2dbebf7a
      Sean Christopherson authored
      Explicitly pass the L2 GPA to kvm_arch_write_log_dirty(), which for all
      intents and purposes is vmx_write_pml_buffer(), instead of having the
      latter pull the GPA from vmcs.GUEST_PHYSICAL_ADDRESS.  If the dirty bit
      update is the result of KVM emulation (rare for L2), then the GPA in the
      VMCS may be stale and/or hold a completely unrelated GPA.
      
      Fixes: c5f983f6 ("nVMX: Implement emulated Page Modification Logging")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Message-Id: <20200622215832.22090-2-sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      2dbebf7a
    • Vitaly Kuznetsov's avatar
      KVM: x86/mmu: Avoid mixing gpa_t with gfn_t in walk_addr_generic() · 312d16c7
      Vitaly Kuznetsov authored
      translate_gpa() returns a GPA, assigning it to 'real_gfn' seems obviously
      wrong. There is no real issue because both 'gpa_t' and 'gfn_t' are u64 and
      we don't use the value in 'real_gfn' as a GFN, we do
      
       real_gfn = gpa_to_gfn(real_gfn);
      
      instead. 'If you see a "buffalo" sign on an elephant's cage, do not trust
      your eyes', but let's fix it for good.
      
      No functional change intended.
      Signed-off-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Message-Id: <20200622151435.752560-1-vkuznets@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      312d16c7
    • Paolo Bonzini's avatar
      KVM: LAPIC: ensure APIC map is up to date on concurrent update requests · 44d52717
      Paolo Bonzini authored
      The following race can cause lost map update events:
      
               cpu1                            cpu2
      
                                      apic_map_dirty = true
        ------------------------------------------------------------
                                      kvm_recalculate_apic_map:
                                           pass check
                                               mutex_lock(&kvm->arch.apic_map_lock);
                                               if (!kvm->arch.apic_map_dirty)
                                           and in process of updating map
        -------------------------------------------------------------
          other calls to
             apic_map_dirty = true         might be too late for affected cpu
        -------------------------------------------------------------
                                           apic_map_dirty = false
        -------------------------------------------------------------
          kvm_recalculate_apic_map:
          bail out on
            if (!kvm->arch.apic_map_dirty)
      
      To fix it, record the beginning of an update of the APIC map in
      apic_map_dirty.  If another APIC map change switches apic_map_dirty
      back to DIRTY during the update, kvm_recalculate_apic_map should not
      make it CLEAN, and the other caller will go through the slow path.
      Reported-by: default avatarIgor Mammedov <imammedo@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      44d52717
    • Igor Mammedov's avatar
      kvm: lapic: fix broken vcpu hotplug · af28dfac
      Igor Mammedov authored
      Guest fails to online hotplugged CPU with error
        smpboot: do_boot_cpu failed(-1) to wakeup CPU#4
      
      It's caused by the fact that kvm_apic_set_state(), which used to call
      recalculate_apic_map() unconditionally and pulled hotplugged CPU into
      apic map, is updating map conditionally on state changes.  In this case
      the APIC map is not considered dirty and the is not updated.
      
      Fix the issue by forcing unconditional update from kvm_apic_set_state(),
      like it used to be.
      
      Fixes: 4abaffce ("KVM: LAPIC: Recalculate apic map in batch")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarIgor Mammedov <imammedo@redhat.com>
      Message-Id: <20200622160830.426022-1-imammedo@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      af28dfac
    • Andrew Jones's avatar
      KVM: arm64: pvtime: Ensure task delay accounting is enabled · a25e9102
      Andrew Jones authored
      Ensure we're actually accounting run_delay before we claim that we'll
      expose it to the guest. If we're not, then we just pretend like steal
      time isn't supported in order to avoid any confusion.
      Signed-off-by: default avatarAndrew Jones <drjones@redhat.com>
      Signed-off-by: default avatarMarc Zyngier <maz@kernel.org>
      Link: https://lore.kernel.org/r/20200622142710.18677-1-drjones@redhat.com
      a25e9102
    • Steven Price's avatar
      KVM: arm64: Fix kvm_reset_vcpu() return code being incorrect with SVE · 66b7e05d
      Steven Price authored
      If SVE is enabled then 'ret' can be assigned the return value of
      kvm_vcpu_enable_sve() which may be 0 causing future "goto out" sites to
      erroneously return 0 on failure rather than -EINVAL as expected.
      
      Remove the initialisation of 'ret' and make setting the return value
      explicit to avoid this situation in the future.
      
      Fixes: 9a3cdf26 ("KVM: arm64/sve: Allow userspace to enable SVE for vcpus")
      Cc: stable@vger.kernel.org
      Reported-by: default avatarJames Morse <james.morse@arm.com>
      Signed-off-by: default avatarSteven Price <steven.price@arm.com>
      Signed-off-by: default avatarMarc Zyngier <maz@kernel.org>
      Link: https://lore.kernel.org/r/20200617105456.28245-1-steven.price@arm.com
      66b7e05d
    • Alexandru Elisei's avatar
      KVM: arm64: Annotate hyp NMI-related functions as __always_inline · 7733306b
      Alexandru Elisei authored
      The "inline" keyword is a hint for the compiler to inline a function.  The
      functions system_uses_irq_prio_masking() and gic_write_pmr() are used by
      the code running at EL2 on a non-VHE system, so mark them as
      __always_inline to make sure they'll always be part of the .hyp.text
      section.
      
      This fixes the following splat when trying to run a VM:
      
      [   47.625273] Kernel panic - not syncing: HYP panic:
      [   47.625273] PS:a00003c9 PC:0000ca0b42049fc4 ESR:86000006
      [   47.625273] FAR:0000ca0b42049fc4 HPFAR:0000000010001000 PAR:0000000000000000
      [   47.625273] VCPU:0000000000000000
      [   47.647261] CPU: 1 PID: 217 Comm: kvm-vcpu-0 Not tainted 5.8.0-rc1-ARCH+ #61
      [   47.654508] Hardware name: Globalscale Marvell ESPRESSOBin Board (DT)
      [   47.661139] Call trace:
      [   47.663659]  dump_backtrace+0x0/0x1cc
      [   47.667413]  show_stack+0x18/0x24
      [   47.670822]  dump_stack+0xb8/0x108
      [   47.674312]  panic+0x124/0x2f4
      [   47.677446]  panic+0x0/0x2f4
      [   47.680407] SMP: stopping secondary CPUs
      [   47.684439] Kernel Offset: disabled
      [   47.688018] CPU features: 0x240402,20002008
      [   47.692318] Memory Limit: none
      [   47.695465] ---[ end Kernel panic - not syncing: HYP panic:
      [   47.695465] PS:a00003c9 PC:0000ca0b42049fc4 ESR:86000006
      [   47.695465] FAR:0000ca0b42049fc4 HPFAR:0000000010001000 PAR:0000000000000000
      [   47.695465] VCPU:0000000000000000 ]---
      
      The instruction abort was caused by the code running at EL2 trying to fetch
      an instruction which wasn't mapped in the EL2 translation tables. Using
      objdump showed the two functions as separate symbols in the .text section.
      
      Fixes: 85738e05 ("arm64: kvm: Unmask PMR before entering guest")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarAlexandru Elisei <alexandru.elisei@arm.com>
      Signed-off-by: default avatarMarc Zyngier <maz@kernel.org>
      Acked-by: default avatarJames Morse <james.morse@arm.com>
      Link: https://lore.kernel.org/r/20200618171254.1596055-1-alexandru.elisei@arm.com
      7733306b
  11. 19 Jun, 2020 1 commit
    • Vitaly Kuznetsov's avatar
      Revert "KVM: VMX: Micro-optimize vmexit time when not exposing PMU" · 49097762
      Vitaly Kuznetsov authored
      Guest crashes are observed on a Cascade Lake system when 'perf top' is
      launched on the host, e.g.
      
       BUG: unable to handle kernel paging request at fffffe0000073038
       PGD 7ffa7067 P4D 7ffa7067 PUD 7ffa6067 PMD 7ffa5067 PTE ffffffffff120
       Oops: 0000 [#1] SMP PTI
       CPU: 1 PID: 1 Comm: systemd Not tainted 4.18.0+ #380
      ...
       Call Trace:
        serial8250_console_write+0xfe/0x1f0
        call_console_drivers.constprop.0+0x9d/0x120
        console_unlock+0x1ea/0x460
      
      Call traces are different but the crash is imminent. The problem was
      blindly bisected to the commit 041bc42c ("KVM: VMX: Micro-optimize
      vmexit time when not exposing PMU"). It was also confirmed that the
      issue goes away if PMU is exposed to the guest.
      
      With some instrumentation of the guest we can see what is being switched
      (when we do atomic_switch_perf_msrs()):
      
       vmx_vcpu_run: switching 2 msrs
       vmx_vcpu_run: switching MSR38f guest: 70000000d host: 70000000f
       vmx_vcpu_run: switching MSR3f1 guest: 0 host: 2
      
      The current guess is that PEBS (MSR_IA32_PEBS_ENABLE, 0x3f1) is to blame.
      Regardless of whether PMU is exposed to the guest or not, PEBS needs to
      be disabled upon switch.
      
      This reverts commit 041bc42c.
      Reported-by: default avatarMaxime Coquelin <maxime.coquelin@redhat.com>
      Signed-off-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Message-Id: <20200619094046.654019-1-vkuznets@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      49097762
  12. 18 Jun, 2020 1 commit
    • Christian Borntraeger's avatar
      KVM: s390: reduce number of IO pins to 1 · 77491129
      Christian Borntraeger authored
      The current number of KVM_IRQCHIP_NUM_PINS results in an order 3
      allocation (32kb) for each guest start/restart. This can result in OOM
      killer activity even with free swap when the memory is fragmented
      enough:
      
      kernel: qemu-system-s39 invoked oom-killer: gfp_mask=0x440dc0(GFP_KERNEL_ACCOUNT|__GFP_COMP|__GFP_ZERO), order=3, oom_score_adj=0
      kernel: CPU: 1 PID: 357274 Comm: qemu-system-s39 Kdump: loaded Not tainted 5.4.0-29-generic #33-Ubuntu
      kernel: Hardware name: IBM 8562 T02 Z06 (LPAR)
      kernel: Call Trace:
      kernel: ([<00000001f848fe2a>] show_stack+0x7a/0xc0)
      kernel:  [<00000001f8d3437a>] dump_stack+0x8a/0xc0
      kernel:  [<00000001f8687032>] dump_header+0x62/0x258
      kernel:  [<00000001f8686122>] oom_kill_process+0x172/0x180
      kernel:  [<00000001f8686abe>] out_of_memory+0xee/0x580
      kernel:  [<00000001f86e66b8>] __alloc_pages_slowpath+0xd18/0xe90
      kernel:  [<00000001f86e6ad4>] __alloc_pages_nodemask+0x2a4/0x320
      kernel:  [<00000001f86b1ab4>] kmalloc_order+0x34/0xb0
      kernel:  [<00000001f86b1b62>] kmalloc_order_trace+0x32/0xe0
      kernel:  [<00000001f84bb806>] kvm_set_irq_routing+0xa6/0x2e0
      kernel:  [<00000001f84c99a4>] kvm_arch_vm_ioctl+0x544/0x9e0
      kernel:  [<00000001f84b8936>] kvm_vm_ioctl+0x396/0x760
      kernel:  [<00000001f875df66>] do_vfs_ioctl+0x376/0x690
      kernel:  [<00000001f875e304>] ksys_ioctl+0x84/0xb0
      kernel:  [<00000001f875e39a>] __s390x_sys_ioctl+0x2a/0x40
      kernel:  [<00000001f8d55424>] system_call+0xd8/0x2c8
      
      As far as I can tell s390x does not use the iopins as we bail our for
      anything other than KVM_IRQ_ROUTING_S390_ADAPTER and the chip/pin is
      only used for KVM_IRQ_ROUTING_IRQCHIP. So let us use a small number to
      reduce the memory footprint.
      Signed-off-by: default avatarChristian Borntraeger <borntraeger@de.ibm.com>
      Reviewed-by: default avatarCornelia Huck <cohuck@redhat.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Link: https://lore.kernel.org/r/20200617083620.5409-1-borntraeger@de.ibm.com
      77491129
  13. 15 Jun, 2020 4 commits
  14. 14 Jun, 2020 4 commits
    • Linus Torvalds's avatar
      Linux 5.8-rc1 · b3a9e3b9
      Linus Torvalds authored
      b3a9e3b9
    • Linus Torvalds's avatar
      Merge tag 'LSM-add-setgid-hook-5.8-author-fix' of git://github.com/micah-morton/linux · 4a87b197
      Linus Torvalds authored
      Pull SafeSetID update from Micah Morton:
       "Add additional LSM hooks for SafeSetID
      
        SafeSetID is capable of making allow/deny decisions for set*uid calls
        on a system, and we want to add similar functionality for set*gid
        calls.
      
        The work to do that is not yet complete, so probably won't make it in
        for v5.8, but we are looking to get this simple patch in for v5.8
        since we have it ready.
      
        We are planning on the rest of the work for extending the SafeSetID
        LSM being merged during the v5.9 merge window"
      
      * tag 'LSM-add-setgid-hook-5.8-author-fix' of git://github.com/micah-morton/linux:
        security: Add LSM hooks to set*gid syscalls
      4a87b197
    • Thomas Cedeno's avatar
      security: Add LSM hooks to set*gid syscalls · 39030e13
      Thomas Cedeno authored
      The SafeSetID LSM uses the security_task_fix_setuid hook to filter
      set*uid() syscalls according to its configured security policy. In
      preparation for adding analagous support in the LSM for set*gid()
      syscalls, we add the requisite hook here. Tested by putting print
      statements in the security_task_fix_setgid hook and seeing them get hit
      during kernel boot.
      Signed-off-by: default avatarThomas Cedeno <thomascedeno@google.com>
      Signed-off-by: default avatarMicah Morton <mortonm@chromium.org>
      39030e13
    • Linus Torvalds's avatar
      Merge tag 'for-5.8-part2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux · 9d645db8
      Linus Torvalds authored
      Pull btrfs updates from David Sterba:
       "This reverts the direct io port to iomap infrastructure of btrfs
        merged in the first pull request. We found problems in invalidate page
        that don't seem to be fixable as regressions or without changing iomap
        code that would not affect other filesystems.
      
        There are four reverts in total, but three of them are followup
        cleanups needed to revert a43a67a2 cleanly. The result is the
        buffer head based implementation of direct io.
      
        Reverts are not great, but under current circumstances I don't see
        better options"
      
      * tag 'for-5.8-part2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
        Revert "btrfs: switch to iomap_dio_rw() for dio"
        Revert "fs: remove dio_end_io()"
        Revert "btrfs: remove BTRFS_INODE_READDIO_NEED_LOCK"
        Revert "btrfs: split btrfs_direct_IO to read and write part"
      9d645db8
  15. 13 Jun, 2020 2 commits
    • Linus Torvalds's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net · 96144c58
      Linus Torvalds authored
      Pull networking fixes from David Miller:
      
       1) Fix cfg80211 deadlock, from Johannes Berg.
      
       2) RXRPC fails to send norigications, from David Howells.
      
       3) MPTCP RM_ADDR parsing has an off by one pointer error, fix from
          Geliang Tang.
      
       4) Fix crash when using MSG_PEEK with sockmap, from Anny Hu.
      
       5) The ucc_geth driver needs __netdev_watchdog_up exported, from
          Valentin Longchamp.
      
       6) Fix hashtable memory leak in dccp, from Wang Hai.
      
       7) Fix how nexthops are marked as FDB nexthops, from David Ahern.
      
       8) Fix mptcp races between shutdown and recvmsg, from Paolo Abeni.
      
       9) Fix crashes in tipc_disc_rcv(), from Tuong Lien.
      
      10) Fix link speed reporting in iavf driver, from Brett Creeley.
      
      11) When a channel is used for XSK and then reused again later for XSK,
          we forget to clear out the relevant data structures in mlx5 which
          causes all kinds of problems. Fix from Maxim Mikityanskiy.
      
      12) Fix memory leak in genetlink, from Cong Wang.
      
      13) Disallow sockmap attachments to UDP sockets, it simply won't work.
          From Lorenz Bauer.
      
      * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (83 commits)
        net: ethernet: ti: ale: fix allmulti for nu type ale
        net: ethernet: ti: am65-cpsw-nuss: fix ale parameters init
        net: atm: Remove the error message according to the atomic context
        bpf: Undo internal BPF_PROBE_MEM in BPF insns dump
        libbpf: Support pre-initializing .bss global variables
        tools/bpftool: Fix skeleton codegen
        bpf: Fix memlock accounting for sock_hash
        bpf: sockmap: Don't attach programs to UDP sockets
        bpf: tcp: Recv() should return 0 when the peer socket is closed
        ibmvnic: Flush existing work items before device removal
        genetlink: clean up family attributes allocations
        net: ipa: header pad field only valid for AP->modem endpoint
        net: ipa: program upper nibbles of sequencer type
        net: ipa: fix modem LAN RX endpoint id
        net: ipa: program metadata mask differently
        ionic: add pcie_print_link_status
        rxrpc: Fix race between incoming ACK parser and retransmitter
        net/mlx5: E-Switch, Fix some error pointer dereferences
        net/mlx5: Don't fail driver on failure to create debugfs
        net/mlx5e: CT: Fix ipv6 nat header rewrite actions
        ...
      96144c58
    • David Sterba's avatar
      Revert "btrfs: switch to iomap_dio_rw() for dio" · 55e20bd1
      David Sterba authored
      This reverts commit a43a67a2.
      
      This patch reverts the main part of switching direct io implementation
      to iomap infrastructure. There's a problem in invalidate page that
      couldn't be solved as regression in this development cycle.
      
      The problem occurs when buffered and direct io are mixed, and the ranges
      overlap. Although this is not recommended, filesystems implement
      measures or fallbacks to make it somehow work. In this case, fallback to
      buffered IO would be an option for btrfs (this already happens when
      direct io is done on compressed data), but the change would be needed in
      the iomap code, bringing new semantics to other filesystems.
      
      Another problem arises when again the buffered and direct ios are mixed,
      invalidation fails, then -EIO is set on the mapping and fsync will fail,
      though there's no real error.
      
      There have been discussions how to fix that, but revert seems to be the
      least intrusive option.
      
      Link: https://lore.kernel.org/linux-btrfs/20200528192103.xm45qoxqmkw7i5yl@fiona/Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      55e20bd1