1. 19 Sep, 2018 9 commits
    • Sean Christopherson's avatar
      KVM: VMX: use preemption timer to force immediate VMExit · d264ee0c
      Sean Christopherson authored
      A VMX preemption timer value of '0' is guaranteed to cause a VMExit
      prior to the CPU executing any instructions in the guest.  Use the
      preemption timer (if it's supported) to trigger immediate VMExit
      in place of the current method of sending a self-IPI.  This ensures
      that pending VMExit injection to L1 occurs prior to executing any
      instructions in the guest (regardless of nesting level).
      
      When deferring VMExit injection, KVM generates an immediate VMExit
      from the (possibly nested) guest by sending itself an IPI.  Because
      hardware interrupts are blocked prior to VMEnter and are unblocked
      (in hardware) after VMEnter, this results in taking a VMExit(INTR)
      before any guest instruction is executed.  But, as this approach
      relies on the IPI being received before VMEnter executes, it only
      works as intended when KVM is running as L0.  Because there are no
      architectural guarantees regarding when IPIs are delivered, when
      running nested the INTR may "arrive" long after L2 is running e.g.
      L0 KVM doesn't force an immediate switch to L1 to deliver an INTR.
      
      For the most part, this unintended delay is not an issue since the
      events being injected to L1 also do not have architectural guarantees
      regarding their timing.  The notable exception is the VMX preemption
      timer[1], which is architecturally guaranteed to cause a VMExit prior
      to executing any instructions in the guest if the timer value is '0'
      at VMEnter.  Specifically, the delay in injecting the VMExit causes
      the preemption timer KVM unit test to fail when run in a nested guest.
      
      Note: this approach is viable even on CPUs with a broken preemption
      timer, as broken in this context only means the timer counts at the
      wrong rate.  There are no known errata affecting timer value of '0'.
      
      [1] I/O SMIs also have guarantees on when they arrive, but I have
          no idea if/how those are emulated in KVM.
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      [Use a hook for SVM instead of leaving the default in x86.c - Paolo]
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      d264ee0c
    • Sean Christopherson's avatar
      KVM: VMX: modify preemption timer bit only when arming timer · f459a707
      Sean Christopherson authored
      Provide a singular location where the VMX preemption timer bit is
      set/cleared so that future usages of the preemption timer can ensure
      the VMCS bit is up-to-date without having to modify unrelated code
      paths.  For example, the preemption timer can be used to force an
      immediate VMExit.  Cache the status of the timer to avoid redundant
      VMREAD and VMWRITE, e.g. if the timer stays armed across multiple
      VMEnters/VMExits.
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      f459a707
    • Sean Christopherson's avatar
      KVM: VMX: immediately mark preemption timer expired only for zero value · 4c008127
      Sean Christopherson authored
      A VMX preemption timer value of '0' at the time of VMEnter is
      architecturally guaranteed to cause a VMExit prior to the CPU
      executing any instructions in the guest.  This architectural
      definition is in place to ensure that a previously expired timer
      is correctly recognized by the CPU as it is possible for the timer
      to reach zero and not trigger a VMexit due to a higher priority
      VMExit being signalled instead, e.g. a pending #DB that morphs into
      a VMExit.
      
      Whether by design or coincidence, commit f4124500 ("KVM: nVMX:
      Fully emulate preemption timer") special cased timer values of '0'
      and '1' to ensure prompt delivery of the VMExit.  Unlike '0', a
      timer value of '1' has no has no architectural guarantees regarding
      when it is delivered.
      
      Modify the timer emulation to trigger immediate VMExit if and only
      if the timer value is '0', and document precisely why '0' is special.
      Do this even if calibration of the virtual TSC failed, i.e. VMExit
      will occur immediately regardless of the frequency of the timer.
      Making only '0' a special case gives KVM leeway to be more aggressive
      in ensuring the VMExit is injected prior to executing instructions in
      the nested guest, and also eliminates any ambiguity as to why '1' is
      a special case, e.g. why wasn't the threshold for a "short timeout"
      set to 10, 100, 1000, etc...
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      4c008127
    • Andy Shevchenko's avatar
      KVM: SVM: Switch to bitmap_zalloc() · a101c9d6
      Andy Shevchenko authored
      Switch to bitmap_zalloc() to show clearly what we are allocating.
      Besides that it returns pointer of bitmap type instead of opaque void *.
      Signed-off-by: default avatarAndy Shevchenko <andriy.shevchenko@linux.intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      a101c9d6
    • Tianyu Lan's avatar
      KVM/MMU: Fix comment in walk_shadow_page_lockless_end() · 9a984586
      Tianyu Lan authored
      kvm_commit_zap_page() has been renamed to kvm_mmu_commit_zap_page()
      This patch is to fix the commit.
      Signed-off-by: default avatarLan Tianyu <Tianyu.Lan@microsoft.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      9a984586
    • Lei Yang's avatar
      kvm: selftests: use -pthread instead of -lpthread · 6bd317d3
      Lei Yang authored
      I run into the following error
      
      testing/selftests/kvm/dirty_log_test.c:285: undefined reference to `pthread_create'
      testing/selftests/kvm/dirty_log_test.c:297: undefined reference to `pthread_join'
      collect2: error: ld returned 1 exit status
      
      my gcc version is gcc version 4.8.4
      "-pthread" would work everywhere
      Signed-off-by: default avatarLei Yang <Lei.Yang@windriver.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      6bd317d3
    • Wei Yang's avatar
      KVM: x86: don't reset root in kvm_mmu_setup() · 83b20b28
      Wei Yang authored
      Here is the code path which shows kvm_mmu_setup() is invoked after
      kvm_mmu_create(). Since kvm_mmu_setup() is only invoked in this code path,
      this means the root_hpa and prev_roots are guaranteed to be invalid. And
      it is not necessary to reset it again.
      
          kvm_vm_ioctl_create_vcpu()
              kvm_arch_vcpu_create()
                  vmx_create_vcpu()
                      kvm_vcpu_init()
                          kvm_arch_vcpu_init()
                              kvm_mmu_create()
              kvm_arch_vcpu_setup()
                  kvm_mmu_setup()
                      kvm_init_mmu()
      
      This patch set reset_roots to false in kmv_mmu_setup().
      
      Fixes: 50c28f21Signed-off-by: default avatarWei Yang <richard.weiyang@gmail.com>
      Reviewed-by: default avatarLiran Alon <liran.alon@oracle.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      83b20b28
    • Junaid Shahid's avatar
      kvm: mmu: Don't read PDPTEs when paging is not enabled · d35b34a9
      Junaid Shahid authored
      kvm should not attempt to read guest PDPTEs when CR0.PG = 0 and
      CR4.PAE = 1.
      Signed-off-by: default avatarJunaid Shahid <junaids@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      d35b34a9
    • Vitaly Kuznetsov's avatar
      x86/kvm/lapic: always disable MMIO interface in x2APIC mode · d1766202
      Vitaly Kuznetsov authored
      When VMX is used with flexpriority disabled (because of no support or
      if disabled with module parameter) MMIO interface to lAPIC is still
      available in x2APIC mode while it shouldn't be (kvm-unit-tests):
      
      PASS: apic_disable: Local apic enabled in x2APIC mode
      PASS: apic_disable: CPUID.1H:EDX.APIC[bit 9] is set
      FAIL: apic_disable: *0xfee00030: 50014
      
      The issue appears because we basically do nothing while switching to
      x2APIC mode when APIC access page is not used. apic_mmio_{read,write}
      only check if lAPIC is disabled before proceeding to actual write.
      
      When APIC access is virtualized we correctly manipulate with VMX controls
      in vmx_set_virtual_apic_mode() and we don't get vmexits from memory writes
      in x2APIC mode so there's no issue.
      
      Disabling MMIO interface seems to be easy. The question is: what do we
      do with these reads and writes? If we add apic_x2apic_mode() check to
      apic_mmio_in_range() and return -EOPNOTSUPP these reads and writes will
      go to userspace. When lAPIC is in kernel, Qemu uses this interface to
      inject MSIs only (see kvm_apic_mem_write() in hw/i386/kvm/apic.c). This
      somehow works with disabled lAPIC but when we're in xAPIC mode we will
      get a real injected MSI from every write to lAPIC. Not good.
      
      The simplest solution seems to be to just ignore writes to the region
      and return ~0 for all reads when we're in x2APIC mode. This is what this
      patch does. However, this approach is inconsistent with what currently
      happens when flexpriority is enabled: we allocate APIC access page and
      create KVM memory region so in x2APIC modes all reads and writes go to
      this pre-allocated page which is, btw, the same for all vCPUs.
      Signed-off-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      d1766202
  2. 18 Sep, 2018 2 commits
  3. 16 Sep, 2018 2 commits
  4. 15 Sep, 2018 8 commits
  5. 14 Sep, 2018 19 commits
    • Linus Torvalds's avatar
      Merge tag 'devicetree-fixes-for-4.19-2' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux · 090b75bc
      Linus Torvalds authored
      Pull DeviceTree fix from Rob Herring:
       "One regression for a 20 year old PowerMac:
      
         - Fix a regression on systems having a DT without any phandles which
           happens on a PowerMac G3"
      
      * tag 'devicetree-fixes-for-4.19-2' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux:
        of: fix phandle cache creation for DTs with no phandles
      090b75bc
    • Linus Torvalds's avatar
      Merge tag 'for-linus-4.19c-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip · d7c02680
      Linus Torvalds authored
      Pull xen fixes from Juergen Gross:
       "This contains some minor cleanups and fixes:
      
         - a new knob for controlling scrubbing of pages returned by the Xen
           balloon driver to the Xen hypervisor to address a boot performance
           issue seen in large guests booted pre-ballooned
      
         - a fix of a regression in the gntdev driver which made it impossible
           to use fully virtualized guests (HVM guests) with a 4.19 based dom0
      
         - a fix in Xen cpu hotplug functionality which could be triggered by
           wrong admin commands (setting number of active vcpus to 0)
      
        One further note: the patches have all been under test for several
        days in another branch. This branch has been rebased in order to avoid
        merge conflicts"
      
      * tag 'for-linus-4.19c-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
        xen/gntdev: fix up blockable calls to mn_invl_range_start
        xen: fix GCC warning and remove duplicate EVTCHN_ROW/EVTCHN_COL usage
        xen: avoid crash in disable_hotplug_cpu
        xen/balloon: add runtime control for scrubbing ballooned out pages
        xen/manage: don't complain about an empty value in control/sysrq node
      d7c02680
    • Linus Torvalds's avatar
      Merge tag 'xtensa-20180914' of git://github.com/jcmvbkbc/linux-xtensa · eae4f885
      Linus Torvalds authored
      Pull Xtensa fixes and cleanups from Max Filippov:
      
       - don't allocate memory in platform_setup as the memory allocator is
         not initialized at that point yet;
      
       - remove unnecessary ifeq KBUILD_SRC from arch/xtensa/Makefile;
      
       - enable SG chaining in arch/xtensa/Kconfig.
      
      * tag 'xtensa-20180914' of git://github.com/jcmvbkbc/linux-xtensa:
        xtensa: enable SG chaining in Kconfig
        xtensa: remove unnecessary KBUILD_SRC ifeq conditional
        xtensa: ISS: don't allocate memory in platform_setup
      eae4f885
    • Linus Torvalds's avatar
      Merge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux · 3e153256
      Linus Torvalds authored
      Pull arm64 fixes from Will Deacon:
       "The trickle of arm64 fixes continues to come in.
      
        Nothing that's the end of the world, but we've got a fix for PCI IO
        port accesses, an accidental naked "asm goto" and a fix to the
        vmcoreinfo PT_NOTE merged this time around which we'd like to get
        sorted before it becomes ABI.
      
         - Fix ioport_map() mapping the wrong physical address for some I/O
           BARs
      
         - Remove direct use of "asm goto", since some compilers don't like
           that
      
         - Ensure kimage_voffset is always present in vmcoreinfo PT_NOTE"
      
      * tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
        asm-generic: io: Fix ioport_map() for !CONFIG_GENERIC_IOMAP && CONFIG_INDIRECT_PIO
        arm64: kernel: arch_crash_save_vmcoreinfo() should depend on CONFIG_CRASH_CORE
        arm64: jump_label.h: use asm_volatile_goto macro instead of "asm goto"
      3e153256
    • Trond Myklebust's avatar
      NFS: Don't open code clearing of delegation state · 9f0c5124
      Trond Myklebust authored
      Add a helper for the case when the nfs4 open state has been set to use
      a delegation stateid, and we want to revert to using the open stateid.
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@hammerspace.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      9f0c5124
    • Trond Myklebust's avatar
      NFSv4.1 fix infinite loop on I/O. · 994b15b9
      Trond Myklebust authored
      The previous fix broke recovery of delegated stateids because it assumes
      that if we did not mark the delegation as suspect, then the delegation has
      effectively been revoked, and so it removes that delegation irrespectively
      of whether or not it is valid and still in use. While this is "mostly
      harmless" for ordinary I/O, we've seen pNFS fail with LAYOUTGET spinning
      in an infinite loop while complaining that we're using an invalid stateid
      (in this case the all-zero stateid).
      
      What we rather want to do here is ensure that the delegation is always
      correctly marked as needing testing when that is the case. So we want
      to close the loophole offered by nfs4_schedule_stateid_recovery(),
      which marks the state as needing to be reclaimed, but not the
      delegation that may be backing it.
      
      Fixes: 0e3d3e5d ("NFSv4.1 fix infinite loop on IO BAD_STATEID error")
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@hammerspace.com>
      Cc: stable@vger.kernel.org # v4.11+
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      994b15b9
    • Trond Myklebust's avatar
      NFSv4: Fix a tracepoint Oops in initiate_file_draining() · 2edaead6
      Trond Myklebust authored
      Now that the value of 'ino' can be NULL or an ERR_PTR(), we need to
      change the test in the tracepoint.
      
      Fixes: ce5624f7 ("NFSv4: Return NFS4ERR_DELAY when a layout fails...")
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@hammerspace.com>
      Cc: stable@vger.kernel.org # v4.17+
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      2edaead6
    • Trond Myklebust's avatar
      pNFS: Ensure we return the error if someone kills a waiting layoutget · d03360aa
      Trond Myklebust authored
      If someone interrupts a wait on one or more outstanding layoutgets in
      pnfs_update_layout() then return the ERESTARTSYS/EINTR error.
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@hammerspace.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      d03360aa
    • Trond Myklebust's avatar
      NFSv4: Fix a tracepoint Oops in initiate_file_draining() · 2a534a74
      Trond Myklebust authored
      Now that the value of 'ino' can be NULL or an ERR_PTR(), we need to
      change the test in the tracepoint.
      
      Fixes: ce5624f7 ("NFSv4: Return NFS4ERR_DELAY when a layout fails...")
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@hammerspace.com>
      Cc: stable@vger.kernel.org # v4.17+
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      2a534a74
    • Linus Torvalds's avatar
      Merge tag 'dmaengine-fix-4.19-rc4' of git://git.infradead.org/users/vkoul/slave-dma · f3c0b8ce
      Linus Torvalds authored
      Pull dmaengine fix from Vinod Koul:
       "Fix the mic_x100_dma driver to use devm_kzalloc for driver memory, so
        that it is freed properly when it unregisters from dmaengine using
        managed API"
      
      * tag 'dmaengine-fix-4.19-rc4' of git://git.infradead.org/users/vkoul/slave-dma:
        dmaengine: mic_x100_dma: use devm_kzalloc to fix an issue
      f3c0b8ce
    • Linus Torvalds's avatar
      Merge tag 'usb-4.19-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb · 1abc088a
      Linus Torvalds authored
      Pull USB fixes from Greg KH:
       "Here are a number of small USB driver fixes for -rc4.
      
        The usual suspects of gadget, xhci, and dwc2/3 are in here, along with
        some reverts of reported problem changes, and a number of build
        documentation warning fixes. Full details are in the shortlog.
      
        All of these have been in linux-next with no reported issues"
      
      * tag 'usb-4.19-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb: (28 commits)
        Revert "cdc-acm: implement put_char() and flush_chars()"
        usb: Change usb_of_get_companion_dev() place to usb/common
        usb: xhci: fix interrupt transfer error happened on MTK platforms
        usb: cdc-wdm: Fix a sleep-in-atomic-context bug in service_outstanding_interrupt()
        usb: misc: uss720: Fix two sleep-in-atomic-context bugs
        usb: host: u132-hcd: Fix a sleep-in-atomic-context bug in u132_get_frame()
        usb: Avoid use-after-free by flushing endpoints early in usb_set_interface()
        linux/mod_devicetable.h: fix kernel-doc missing notation for typec_device_id
        usb/typec: fix kernel-doc notation warning for typec_match_altmode
        usb: Don't die twice if PCI xhci host is not responding in resume
        usb: mtu3: fix error of xhci port id when enable U3 dual role
        usb: uas: add support for more quirk flags
        USB: Add quirk to support DJI CineSSD
        usb: typec: fix kernel-doc parameter warning
        usb/dwc3/gadget: fix kernel-doc parameter warning
        USB: yurex: Check for truncation in yurex_read()
        USB: yurex: Fix buffer over-read in yurex_write()
        usb: host: xhci-plat: Iterate over parent nodes for finding quirks
        xhci: Fix use after free for URB cancellation on a reallocated endpoint
        USB: add quirk for WORLDE Controller KS49 or Prodipe MIDI 49C USB controller
        ...
      1abc088a
    • Linus Torvalds's avatar
      Merge tag 'tty-4.19-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty · c284cf06
      Linus Torvalds authored
      Pull tty fixes from Greg KH:
       "Here are three small HVC tty driver fixes to resolve a reported
        regression from 4.19-rc1.
      
        All of these have been in linux-next for a while with no reported
        issues"
      
      * tag 'tty-4.19-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty:
        tty: hvc: hvc_write() fix break condition
        tty: hvc: hvc_poll() fix read loop batching
        tty: hvc: hvc_poll() fix read loop hang
      c284cf06
    • Linus Torvalds's avatar
      Merge tag 'staging-4.19-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging · 45d9ab8a
      Linus Torvalds authored
      Pull staging/IIO driver fixes from Greg KH:
       "Here are a few small staging and iio driver fixes for -rc4.
      
        Nothing major, just a few small bugfixes for some reported issues, and
        a MAINTAINERS file update for the fbtft drivers.
      
        We also re-enable the building of the erofs filesystem as the XArray
        patches that were causing it to break never got merged in the -rc1
        cycle, so there's no reason it can't be turned back on for now. The
        problem that was previously there is now being handled in the Xarray
        tree at the moment, so it will not hit us again in the future.
      
        All of these patches have been in linux-next with no reported issues"
      
      * tag 'staging-4.19-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging:
        staging: vboxvideo: Change address of scanout buffer on page-flip
        staging: vboxvideo: Fix IRQs no longer working
        staging: gasket: TODO: re-implement using UIO
        staging/fbtft: Update TODO and mailing lists
        staging: erofs: rename superblock flags (MS_xyz -> SB_xyz)
        iio: imu: st_lsm6dsx: take into account ts samples in wm configuration
        Revert "iio: temperature: maxim_thermocouple: add MAX31856 part"
        Revert "staging: erofs: disable compiling temporarile"
        MAINTAINERS: Switch a maintainer for drivers/staging/gasket
        staging: wilc1000: revert "fix TODO to compile spi and sdio components in single module"
      45d9ab8a
    • Linus Torvalds's avatar
      Merge tag 'char-misc-4.19-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc · 319cbacf
      Linus Torvalds authored
      Pull char/misc driver fixes from Greg KH:
       "Here are a small handful of char/misc driver fixes for 4.19-rc4.
      
        All of them are simple, resolving reported problems in a few drivers.
        Full details are in the shortlog.
      
        All of these have been in linux-next with no reported issues"
      
      * tag 'char-misc-4.19-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc:
        firmware: Fix security issue with request_firmware_into_buf()
        vmbus: don't return values for uninitalized channels
        fpga: dfl: fme: fix return value check in in pr_mgmt_init()
        misc: hmc6352: fix potential Spectre v1
        Tools: hv: Fix a bug in the key delete code
        misc: ibmvsm: Fix wrong assignment of return code
        android: binder: fix the race mmap and alloc_new_buf_locked
        mei: bus: need to unlink client before freeing
        mei: bus: fix hw module get/put balance
        mei: fix use-after-free in mei_cl_write
        mei: ignore not found client in the enumeration
      319cbacf
    • Joerg Roedel's avatar
      Revert "x86/mm/legacy: Populate the user page-table with user pgd's" · 61a6bd83
      Joerg Roedel authored
      This reverts commit 1f40a46c.
      
      It turned out that this patch is not sufficient to enable PTI on 32 bit
      systems with legacy 2-level page-tables. In this paging mode the huge-page
      PTEs are in the top-level page-table directory, where also the mirroring to
      the user-space page-table happens. So every huge PTE exits twice, in the
      kernel and in the user page-table.
      
      That means that accessed/dirty bits need to be fetched from two PTEs in
      this mode to be safe, but this is not trivial to implement because it needs
      changes to generic code just for the sake of enabling PTI with 32-bit
      legacy paging. As all systems that need PTI should support PAE anyway,
      remove support for PTI when 32-bit legacy paging is used.
      
      Fixes: 7757d607 ('x86/pti: Allow CONFIG_PAGE_TABLE_ISOLATION for x86_32')
      Reported-by: default avatarMeelis Roos <mroos@linux.ee>
      Signed-off-by: default avatarJoerg Roedel <jroedel@suse.de>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: hpa@zytor.com
      Cc: linux-mm@kvack.org
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Link: https://lkml.kernel.org/r/1536922754-31379-1-git-send-email-joro@8bytes.org
      61a6bd83
    • Michal Hocko's avatar
      xen/gntdev: fix up blockable calls to mn_invl_range_start · 58a57569
      Michal Hocko authored
      Patch series "mmu_notifiers follow ups".
      
      Tetsuo has noticed some fallouts from 93065ac7 ("mm, oom: distinguish
      blockable mode for mmu notifiers").  One of them has been fixed and picked
      up by AMD/DRM maintainer [1].  XEN issue is fixed by patch 1.  I have also
      clarified expectations about blockable semantic of invalidate_range_end.
      Finally the last patch removes MMU_INVALIDATE_DOES_NOT_BLOCK which is no
      longer used nor needed.
      
      [1] http://lkml.kernel.org/r/20180824135257.GU29735@dhcp22.suse.cz
      
      This patch (of 3):
      
      93065ac7 ("mm, oom: distinguish blockable mode for mmu notifiers") has
      introduced blockable parameter to all mmu_notifiers and the notifier has
      to back off when called in !blockable case and it could block down the
      road.
      
      The above commit implemented that for mn_invl_range_start but both
      in_range checks are done unconditionally regardless of the blockable mode
      and as such they would fail all the time for regular calls.  Fix this by
      checking blockable parameter as well.
      
      Once we are there we can remove the stale TODO.  The lock has to be
      sleepable because we wait for completion down in gnttab_unmap_refs_sync.
      
      Link: http://lkml.kernel.org/r/20180827112623.8992-2-mhocko@kernel.org
      Fixes: 93065ac7 ("mm, oom: distinguish blockable mode for mmu notifiers")
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Reviewed-by: default avatarJuergen Gross <jgross@suse.com>
      Signed-off-by: default avatarBoris Ostrovsky <boris.ostrovsky@oracle.com>
      58a57569
    • Josh Abraham's avatar
      xen: fix GCC warning and remove duplicate EVTCHN_ROW/EVTCHN_COL usage · 4dca864b
      Josh Abraham authored
      This patch removes duplicate macro useage in events_base.c.
      
      It also fixes gcc warning:
      variable ‘col’ set but not used [-Wunused-but-set-variable]
      Signed-off-by: default avatarJoshua Abraham <j.abraham1776@gmail.com>
      Reviewed-by: default avatarJuergen Gross <jgross@suse.com>
      Signed-off-by: default avatarBoris Ostrovsky <boris.ostrovsky@oracle.com>
      4dca864b
    • Olaf Hering's avatar
      xen: avoid crash in disable_hotplug_cpu · 3366cdb6
      Olaf Hering authored
      The command 'xl vcpu-set 0 0', issued in dom0, will crash dom0:
      
      BUG: unable to handle kernel NULL pointer dereference at 00000000000002d8
      PGD 0 P4D 0
      Oops: 0000 [#1] PREEMPT SMP NOPTI
      CPU: 7 PID: 65 Comm: xenwatch Not tainted 4.19.0-rc2-1.ga9462db-default #1 openSUSE Tumbleweed (unreleased)
      Hardware name: Intel Corporation S5520UR/S5520UR, BIOS S5500.86B.01.00.0050.050620101605 05/06/2010
      RIP: e030:device_offline+0x9/0xb0
      Code: 77 24 00 e9 ce fe ff ff 48 8b 13 e9 68 ff ff ff 48 8b 13 e9 29 ff ff ff 48 8b 13 e9 ea fe ff ff 90 66 66 66 66 90 41 54 55 53 <f6> 87 d8 02 00 00 01 0f 85 88 00 00 00 48 c7 c2 20 09 60 81 31 f6
      RSP: e02b:ffffc90040f27e80 EFLAGS: 00010203
      RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
      RDX: ffff8801f3800000 RSI: ffffc90040f27e70 RDI: 0000000000000000
      RBP: 0000000000000000 R08: ffffffff820e47b3 R09: 0000000000000000
      R10: 0000000000007ff0 R11: 0000000000000000 R12: ffffffff822e6d30
      R13: dead000000000200 R14: dead000000000100 R15: ffffffff8158b4e0
      FS:  00007ffa595158c0(0000) GS:ffff8801f39c0000(0000) knlGS:0000000000000000
      CS:  e033 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00000000000002d8 CR3: 00000001d9602000 CR4: 0000000000002660
      Call Trace:
       handle_vcpu_hotplug_event+0xb5/0xc0
       xenwatch_thread+0x80/0x140
       ? wait_woken+0x80/0x80
       kthread+0x112/0x130
       ? kthread_create_worker_on_cpu+0x40/0x40
       ret_from_fork+0x3a/0x50
      
      This happens because handle_vcpu_hotplug_event is called twice. In the
      first iteration cpu_present is still true, in the second iteration
      cpu_present is false which causes get_cpu_device to return NULL.
      In case of cpu#0, cpu_online is apparently always true.
      
      Fix this crash by checking if the cpu can be hotplugged, which is false
      for a cpu that was just removed.
      
      Also check if the cpu was actually offlined by device_remove, otherwise
      leave the cpu_present state as it is.
      
      Rearrange to code to do all work with device_hotplug_lock held.
      Signed-off-by: default avatarOlaf Hering <olaf@aepfle.de>
      Reviewed-by: default avatarJuergen Gross <jgross@suse.com>
      Signed-off-by: default avatarBoris Ostrovsky <boris.ostrovsky@oracle.com>
      3366cdb6
    • Marek Marczykowski-Górecki's avatar
      xen/balloon: add runtime control for scrubbing ballooned out pages · 197ecb38
      Marek Marczykowski-Górecki authored
      Scrubbing pages on initial balloon down can take some time, especially
      in nested virtualization case (nested EPT is slow). When HVM/PVH guest is
      started with memory= significantly lower than maxmem=, all the extra
      pages will be scrubbed before returning to Xen. But since most of them
      weren't used at all at that point, Xen needs to populate them first
      (from populate-on-demand pool). In nested virt case (Xen inside KVM)
      this slows down the guest boot by 15-30s with just 1.5GB needed to be
      returned to Xen.
      
      Add runtime parameter to enable/disable it, to allow initially disabling
      scrubbing, then enable it back during boot (for example in initramfs).
      Such usage relies on assumption that a) most pages ballooned out during
      initial boot weren't used at all, and b) even if they were, very few
      secrets are in the guest at that time (before any serious userspace
      kicks in).
      Convert CONFIG_XEN_SCRUB_PAGES to CONFIG_XEN_SCRUB_PAGES_DEFAULT (also
      enabled by default), controlling default value for the new runtime
      switch.
      Signed-off-by: default avatarMarek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
      Reviewed-by: default avatarJuergen Gross <jgross@suse.com>
      Signed-off-by: default avatarBoris Ostrovsky <boris.ostrovsky@oracle.com>
      197ecb38