1. 15 Aug, 2018 39 commits
    • Thomas Gleixner's avatar
      cpu/hotplug: Split do_cpu_down() · 6beba29c
      Thomas Gleixner authored
      commit cc1fe215 upstream
      
      Split out the inner workings of do_cpu_down() to allow reuse of that
      function for the upcoming SMT disabling mechanism.
      
      No functional change.
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: default avatarKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Acked-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6beba29c
    • Thomas Gleixner's avatar
      cpu/hotplug: Make bringup/teardown of smp threads symmetric · 26a6dcc7
      Thomas Gleixner authored
      commit c4de6569 upstream
      
      The asymmetry caused a warning to trigger if the bootup was stopped in state
      CPUHP_AP_ONLINE_IDLE. The warning no longer triggers as kthread_park() can
      now be invoked on already or still parked threads. But there is still no
      reason to have this be asymmetric.
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: default avatarKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Acked-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      26a6dcc7
    • Thomas Gleixner's avatar
      x86/topology: Provide topology_smt_supported() · 8eb28605
      Thomas Gleixner authored
      commit f048c399 upstream
      
      Provide information whether SMT is supoorted by the CPUs. Preparatory patch
      for SMT control mechanism.
      Suggested-by: default avatarDave Hansen <dave.hansen@intel.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Acked-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8eb28605
    • Thomas Gleixner's avatar
      x86/smp: Provide topology_is_primary_thread() · b84b9184
      Thomas Gleixner authored
      commit 6a4d2657 upstream
      
      If the CPU is supporting SMT then the primary thread can be found by
      checking the lower APIC ID bits for zero. smp_num_siblings is used to build
      the mask for the APIC ID bits which need to be taken into account.
      
      This uses the MPTABLE or ACPI/MADT supplied APIC ID, which can be different
      than the initial APIC ID in CPUID. But according to AMD the lower bits have
      to be consistent. Intel gave a tentative confirmation as well.
      
      Preparatory patch to support disabling SMT at boot/runtime.
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: default avatarKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Acked-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b84b9184
    • Peter Zijlstra's avatar
      sched/smt: Update sched_smt_present at runtime · fc890e9b
      Peter Zijlstra authored
      commit ba2591a5 upstream
      
      The static key sched_smt_present is only updated at boot time when SMT
      siblings have been detected. Booting with maxcpus=1 and bringing the
      siblings online after boot rebuilds the scheduling domains correctly but
      does not update the static key, so the SMT code is not enabled.
      
      Let the key be updated in the scheduler CPU hotplug code to fix this.
      Signed-off-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: default avatarKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Acked-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      fc890e9b
    • Konrad Rzeszutek Wilk's avatar
      x86/bugs: Move the l1tf function and define pr_fmt properly · 202f9056
      Konrad Rzeszutek Wilk authored
      commit 56563f53 upstream
      
      The pr_warn in l1tf_select_mitigation would have used the prior pr_fmt
      which was defined as "Spectre V2 : ".
      
      Move the function to be past SSBD and also define the pr_fmt.
      
      Fixes: 17dbca11 ("x86/speculation/l1tf: Add sysfs reporting for l1tf")
      Signed-off-by: default avatarKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      202f9056
    • Andi Kleen's avatar
      x86/speculation/l1tf: Limit swap file size to MAX_PA/2 · 4bb1a8d8
      Andi Kleen authored
      commit 377eeaa8 upstream
      
      For the L1TF workaround its necessary to limit the swap file size to below
      MAX_PA/2, so that the higher bits of the swap offset inverted never point
      to valid memory.
      
      Add a mechanism for the architecture to override the swap file size check
      in swapfile.c and add a x86 specific max swapfile check function that
      enforces that limit.
      
      The check is only enabled if the CPU is vulnerable to L1TF.
      
      In VMs with 42bit MAX_PA the typical limit is 2TB now, on a native system
      with 46bit PA it is 32TB. The limit is only per individual swap file, so
      it's always possible to exceed these limits with multiple swap files or
      partitions.
      Signed-off-by: default avatarAndi Kleen <ak@linux.intel.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: default avatarJosh Poimboeuf <jpoimboe@redhat.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarDave Hansen <dave.hansen@intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4bb1a8d8
    • Andi Kleen's avatar
      x86/speculation/l1tf: Disallow non privileged high MMIO PROT_NONE mappings · a4116334
      Andi Kleen authored
      commit 42e4089c upstream
      
      For L1TF PROT_NONE mappings are protected by inverting the PFN in the page
      table entry. This sets the high bits in the CPU's address space, thus
      making sure to point to not point an unmapped entry to valid cached memory.
      
      Some server system BIOSes put the MMIO mappings high up in the physical
      address space. If such an high mapping was mapped to unprivileged users
      they could attack low memory by setting such a mapping to PROT_NONE. This
      could happen through a special device driver which is not access
      protected. Normal /dev/mem is of course access protected.
      
      To avoid this forbid PROT_NONE mappings or mprotect for high MMIO mappings.
      
      Valid page mappings are allowed because the system is then unsafe anyways.
      
      It's not expected that users commonly use PROT_NONE on MMIO. But to
      minimize any impact this is only enforced if the mapping actually refers to
      a high MMIO address (defined as the MAX_PA-1 bit being set), and also skip
      the check for root.
      
      For mmaps this is straight forward and can be handled in vm_insert_pfn and
      in remap_pfn_range().
      
      For mprotect it's a bit trickier. At the point where the actual PTEs are
      accessed a lot of state has been changed and it would be difficult to undo
      on an error. Since this is a uncommon case use a separate early page talk
      walk pass for MMIO PROT_NONE mappings that checks for this condition
      early. For non MMIO and non PROT_NONE there are no changes.
      Signed-off-by: default avatarAndi Kleen <ak@linux.intel.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: default avatarJosh Poimboeuf <jpoimboe@redhat.com>
      Acked-by: default avatarDave Hansen <dave.hansen@intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a4116334
    • Andi Kleen's avatar
      x86/speculation/l1tf: Add sysfs reporting for l1tf · 3d98de69
      Andi Kleen authored
      commit 17dbca11 upstream
      
      L1TF core kernel workarounds are cheap and normally always enabled, However
      they still should be reported in sysfs if the system is vulnerable or
      mitigated. Add the necessary CPU feature/bug bits.
      
      - Extend the existing checks for Meltdowns to determine if the system is
        vulnerable. All CPUs which are not vulnerable to Meltdown are also not
        vulnerable to L1TF
      
      - Check for 32bit non PAE and emit a warning as there is no practical way
        for mitigation due to the limited physical address bits
      
      - If the system has more than MAX_PA/2 physical memory the invert page
        workarounds don't protect the system against the L1TF attack anymore,
        because an inverted physical address will also point to valid
        memory. Print a warning in this case and report that the system is
        vulnerable.
      
      Add a function which returns the PFN limit for the L1TF mitigation, which
      will be used in follow up patches for sanity and range checks.
      
      [ tglx: Renamed the CPU feature bit to L1TF_PTEINV ]
      Signed-off-by: default avatarAndi Kleen <ak@linux.intel.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: default avatarJosh Poimboeuf <jpoimboe@redhat.com>
      Acked-by: default avatarDave Hansen <dave.hansen@intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3d98de69
    • Andi Kleen's avatar
      x86/speculation/l1tf: Make sure the first page is always reserved · aff6fe17
      Andi Kleen authored
      commit 10a70416 upstream
      
      The L1TF workaround doesn't make any attempt to mitigate speculate accesses
      to the first physical page for zeroed PTEs. Normally it only contains some
      data from the early real mode BIOS.
      
      It's not entirely clear that the first page is reserved in all
      configurations, so add an extra reservation call to make sure it is really
      reserved. In most configurations (e.g.  with the standard reservations)
      it's likely a nop.
      Signed-off-by: default avatarAndi Kleen <ak@linux.intel.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: default avatarJosh Poimboeuf <jpoimboe@redhat.com>
      Acked-by: default avatarDave Hansen <dave.hansen@intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      aff6fe17
    • Andi Kleen's avatar
      x86/speculation/l1tf: Protect PROT_NONE PTEs against speculation · 8c35b2fc
      Andi Kleen authored
      commit 6b28baca upstream
      
      When PTEs are set to PROT_NONE the kernel just clears the Present bit and
      preserves the PFN, which creates attack surface for L1TF speculation
      speculation attacks.
      
      This is important inside guests, because L1TF speculation bypasses physical
      page remapping. While the host has its own migitations preventing leaking
      data from other VMs into the guest, this would still risk leaking the wrong
      page inside the current guest.
      
      This uses the same technique as Linus' swap entry patch: while an entry is
      is in PROTNONE state invert the complete PFN part part of it. This ensures
      that the the highest bit will point to non existing memory.
      
      The invert is done by pte/pmd_modify and pfn/pmd/pud_pte for PROTNONE and
      pte/pmd/pud_pfn undo it.
      
      This assume that no code path touches the PFN part of a PTE directly
      without using these primitives.
      
      This doesn't handle the case that MMIO is on the top of the CPU physical
      memory. If such an MMIO region was exposed by an unpriviledged driver for
      mmap it would be possible to attack some real memory.  However this
      situation is all rather unlikely.
      
      For 32bit non PAE the inversion is not done because there are really not
      enough bits to protect anything.
      
      Q: Why does the guest need to be protected when the HyperVisor already has
         L1TF mitigations?
      
      A: Here's an example:
      
         Physical pages 1 2 get mapped into a guest as
         GPA 1 -> PA 2
         GPA 2 -> PA 1
         through EPT.
      
         The L1TF speculation ignores the EPT remapping.
      
         Now the guest kernel maps GPA 1 to process A and GPA 2 to process B, and
         they belong to different users and should be isolated.
      
         A sets the GPA 1 PA 2 PTE to PROT_NONE to bypass the EPT remapping and
         gets read access to the underlying physical page. Which in this case
         points to PA 2, so it can read process B's data, if it happened to be in
         L1, so isolation inside the guest is broken.
      
         There's nothing the hypervisor can do about this. This mitigation has to
         be done in the guest itself.
      
      [ tglx: Massaged changelog ]
      Signed-off-by: default avatarAndi Kleen <ak@linux.intel.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: default avatarJosh Poimboeuf <jpoimboe@redhat.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarDave Hansen <dave.hansen@intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8c35b2fc
    • Linus Torvalds's avatar
      x86/speculation/l1tf: Protect swap entries against L1TF · 83ef7e8c
      Linus Torvalds authored
      commit 2f22b4cd upstream
      
      With L1 terminal fault the CPU speculates into unmapped PTEs, and resulting
      side effects allow to read the memory the PTE is pointing too, if its
      values are still in the L1 cache.
      
      For swapped out pages Linux uses unmapped PTEs and stores a swap entry into
      them.
      
      To protect against L1TF it must be ensured that the swap entry is not
      pointing to valid memory, which requires setting higher bits (between bit
      36 and bit 45) that are inside the CPUs physical address space, but outside
      any real memory.
      
      To do this invert the offset to make sure the higher bits are always set,
      as long as the swap file is not too big.
      
      Note there is no workaround for 32bit !PAE, or on systems which have more
      than MAX_PA/2 worth of memory. The later case is very unlikely to happen on
      real systems.
      
      [AK: updated description and minor tweaks by. Split out from the original
           patch ]
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarAndi Kleen <ak@linux.intel.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Tested-by: default avatarAndi Kleen <ak@linux.intel.com>
      Reviewed-by: default avatarJosh Poimboeuf <jpoimboe@redhat.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarDave Hansen <dave.hansen@intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      83ef7e8c
    • Linus Torvalds's avatar
      x86/speculation/l1tf: Change order of offset/type in swap entry · 39991a7a
      Linus Torvalds authored
      commit bcd11afa upstream
      
      If pages are swapped out, the swap entry is stored in the corresponding
      PTE, which has the Present bit cleared. CPUs vulnerable to L1TF speculate
      on PTE entries which have the present bit set and would treat the swap
      entry as phsyical address (PFN). To mitigate that the upper bits of the PTE
      must be set so the PTE points to non existent memory.
      
      The swap entry stores the type and the offset of a swapped out page in the
      PTE. type is stored in bit 9-13 and offset in bit 14-63. The hardware
      ignores the bits beyond the phsyical address space limit, so to make the
      mitigation effective its required to start 'offset' at the lowest possible
      bit so that even large swap offsets do not reach into the physical address
      space limit bits.
      
      Move offset to bit 9-58 and type to bit 59-63 which are the bits that
      hardware generally doesn't care about.
      
      That, in turn, means that if you on desktop chip with only 40 bits of
      physical addressing, now that the offset starts at bit 9, there needs to be
      30 bits of offset actually *in use* until bit 39 ends up being set, which
      means when inverted it will again point into existing memory.
      
      So that's 4 terabyte of swap space (because the offset is counted in pages,
      so 30 bits of offset is 42 bits of actual coverage). With bigger physical
      addressing, that obviously grows further, until the limit of the offset is
      hit (at 50 bits of offset - 62 bits of actual swap file coverage).
      
      This is a preparatory change for the actual swap entry inversion to protect
      against L1TF.
      
      [ AK: Updated description and minor tweaks. Split into two parts ]
      [ tglx: Massaged changelog ]
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarAndi Kleen <ak@linux.intel.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Tested-by: default avatarAndi Kleen <ak@linux.intel.com>
      Reviewed-by: default avatarJosh Poimboeuf <jpoimboe@redhat.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarDave Hansen <dave.hansen@intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      39991a7a
    • Andi Kleen's avatar
      x86/speculation/l1tf: Increase 32bit PAE __PHYSICAL_PAGE_SHIFT · aefe1861
      Andi Kleen authored
      commit 50896e18 upstream
      
      L1 Terminal Fault (L1TF) is a speculation related vulnerability. The CPU
      speculates on PTE entries which do not have the PRESENT bit set, if the
      content of the resulting physical address is available in the L1D cache.
      
      The OS side mitigation makes sure that a !PRESENT PTE entry points to a
      physical address outside the actually existing and cachable memory
      space. This is achieved by inverting the upper bits of the PTE. Due to the
      address space limitations this only works for 64bit and 32bit PAE kernels,
      but not for 32bit non PAE.
      
      This mitigation applies to both host and guest kernels, but in case of a
      64bit host (hypervisor) and a 32bit PAE guest, inverting the upper bits of
      the PAE address space (44bit) is not enough if the host has more than 43
      bits of populated memory address space, because the speculation treats the
      PTE content as a physical host address bypassing EPT.
      
      The host (hypervisor) protects itself against the guest by flushing L1D as
      needed, but pages inside the guest are not protected against attacks from
      other processes inside the same guest.
      
      For the guest the inverted PTE mask has to match the host to provide the
      full protection for all pages the host could possibly map into the
      guest. The hosts populated address space is not known to the guest, so the
      mask must cover the possible maximal host address space, i.e. 52 bit.
      
      On 32bit PAE the maximum PTE mask is currently set to 44 bit because that
      is the limit imposed by 32bit unsigned long PFNs in the VMs. This limits
      the mask to be below what the host could possible use for physical pages.
      
      The L1TF PROT_NONE protection code uses the PTE masks to determine which
      bits to invert to make sure the higher bits are set for unmapped entries to
      prevent L1TF speculation attacks against EPT inside guests.
      
      In order to invert all bits that could be used by the host, increase
      __PHYSICAL_PAGE_SHIFT to 52 to match 64bit.
      
      The real limit for a 32bit PAE kernel is still 44 bits because all Linux
      PTEs are created from unsigned long PFNs, so they cannot be higher than 44
      bits on a 32bit kernel. So these extra PFN bits should be never set. The
      only users of this macro are using it to look at PTEs, so it's safe.
      
      [ tglx: Massaged changelog ]
      Signed-off-by: default avatarAndi Kleen <ak@linux.intel.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: default avatarJosh Poimboeuf <jpoimboe@redhat.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarDave Hansen <dave.hansen@intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      aefe1861
    • Nick Desaulniers's avatar
      x86/irqflags: Provide a declaration for native_save_fl · 688e5a20
      Nick Desaulniers authored
      commit 208cbb32 upstream.
      
      It was reported that the commit d0a8d937 is causing users of gcc < 4.9
      to observe -Werror=missing-prototypes errors.
      
      Indeed, it seems that:
      extern inline unsigned long native_save_fl(void) { return 0; }
      
      compiled with -Werror=missing-prototypes produces this warning in gcc <
      4.9, but not gcc >= 4.9.
      
      Fixes: d0a8d937 ("x86/paravirt: Make native_save_fl() extern inline").
      Reported-by: default avatarDavid Laight <david.laight@aculab.com>
      Reported-by: default avatarJean Delvare <jdelvare@suse.de>
      Signed-off-by: default avatarNick Desaulniers <ndesaulniers@google.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: hpa@zytor.com
      Cc: jgross@suse.com
      Cc: kstewart@linuxfoundation.org
      Cc: gregkh@linuxfoundation.org
      Cc: boris.ostrovsky@oracle.com
      Cc: astrachan@google.com
      Cc: mka@chromium.org
      Cc: arnd@arndb.de
      Cc: tstellar@redhat.com
      Cc: sedat.dilek@gmail.com
      Cc: David.Laight@aculab.com
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/20180803170550.164688-1-ndesaulniers@google.comSigned-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      688e5a20
    • Masami Hiramatsu's avatar
      kprobes/x86: Fix %p uses in error messages · bf6ac84a
      Masami Hiramatsu authored
      commit 0ea06330 upstream.
      
      Remove all %p uses in error messages in kprobes/x86.
      Signed-off-by: default avatarMasami Hiramatsu <mhiramat@kernel.org>
      Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
      Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: David Howells <dhowells@redhat.com>
      Cc: David S . Miller <davem@davemloft.net>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Jon Medhurst <tixy@linaro.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Thomas Richter <tmricht@linux.ibm.com>
      Cc: Tobin C . Harding <me@tobin.cc>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: acme@kernel.org
      Cc: akpm@linux-foundation.org
      Cc: brueckner@linux.vnet.ibm.com
      Cc: linux-arch@vger.kernel.org
      Cc: rostedt@goodmis.org
      Cc: schwidefsky@de.ibm.com
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/lkml/152491902310.9916.13355297638917767319.stgit@devboxSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      bf6ac84a
    • Jiri Kosina's avatar
      x86/speculation: Protect against userspace-userspace spectreRSB · f374b559
      Jiri Kosina authored
      commit fdf82a78 upstream.
      
      The article "Spectre Returns! Speculation Attacks using the Return Stack
      Buffer" [1] describes two new (sub-)variants of spectrev2-like attacks,
      making use solely of the RSB contents even on CPUs that don't fallback to
      BTB on RSB underflow (Skylake+).
      
      Mitigate userspace-userspace attacks by always unconditionally filling RSB on
      context switch when the generic spectrev2 mitigation has been enabled.
      
      [1] https://arxiv.org/pdf/1807.07940.pdfSigned-off-by: default avatarJiri Kosina <jkosina@suse.cz>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: default avatarJosh Poimboeuf <jpoimboe@redhat.com>
      Acked-by: default avatarTim Chen <tim.c.chen@linux.intel.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: David Woodhouse <dwmw@amazon.co.uk>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/nycvar.YFH.7.76.1807261308190.997@cbobk.fhfr.pmSigned-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f374b559
    • Peter Zijlstra's avatar
      x86/paravirt: Fix spectre-v2 mitigations for paravirt guests · f7a0871d
      Peter Zijlstra authored
      commit 5800dc5c upstream.
      
      Nadav reported that on guests we're failing to rewrite the indirect
      calls to CALLEE_SAVE paravirt functions. In particular the
      pv_queued_spin_unlock() call is left unpatched and that is all over the
      place. This obviously wrecks Spectre-v2 mitigation (for paravirt
      guests) which relies on not actually having indirect calls around.
      
      The reason is an incorrect clobber test in paravirt_patch_call(); this
      function rewrites an indirect call with a direct call to the _SAME_
      function, there is no possible way the clobbers can be different
      because of this.
      
      Therefore remove this clobber check. Also put WARNs on the other patch
      failure case (not enough room for the instruction) which I've not seen
      trigger in my (limited) testing.
      
      Three live kernel image disassemblies for lock_sock_nested (as a small
      function that illustrates the problem nicely). PRE is the current
      situation for guests, POST is with this patch applied and NATIVE is with
      or without the patch for !guests.
      
      PRE:
      
      (gdb) disassemble lock_sock_nested
      Dump of assembler code for function lock_sock_nested:
         0xffffffff817be970 <+0>:     push   %rbp
         0xffffffff817be971 <+1>:     mov    %rdi,%rbp
         0xffffffff817be974 <+4>:     push   %rbx
         0xffffffff817be975 <+5>:     lea    0x88(%rbp),%rbx
         0xffffffff817be97c <+12>:    callq  0xffffffff819f7160 <_cond_resched>
         0xffffffff817be981 <+17>:    mov    %rbx,%rdi
         0xffffffff817be984 <+20>:    callq  0xffffffff819fbb00 <_raw_spin_lock_bh>
         0xffffffff817be989 <+25>:    mov    0x8c(%rbp),%eax
         0xffffffff817be98f <+31>:    test   %eax,%eax
         0xffffffff817be991 <+33>:    jne    0xffffffff817be9ba <lock_sock_nested+74>
         0xffffffff817be993 <+35>:    movl   $0x1,0x8c(%rbp)
         0xffffffff817be99d <+45>:    mov    %rbx,%rdi
         0xffffffff817be9a0 <+48>:    callq  *0xffffffff822299e8
         0xffffffff817be9a7 <+55>:    pop    %rbx
         0xffffffff817be9a8 <+56>:    pop    %rbp
         0xffffffff817be9a9 <+57>:    mov    $0x200,%esi
         0xffffffff817be9ae <+62>:    mov    $0xffffffff817be993,%rdi
         0xffffffff817be9b5 <+69>:    jmpq   0xffffffff81063ae0 <__local_bh_enable_ip>
         0xffffffff817be9ba <+74>:    mov    %rbp,%rdi
         0xffffffff817be9bd <+77>:    callq  0xffffffff817be8c0 <__lock_sock>
         0xffffffff817be9c2 <+82>:    jmp    0xffffffff817be993 <lock_sock_nested+35>
      End of assembler dump.
      
      POST:
      
      (gdb) disassemble lock_sock_nested
      Dump of assembler code for function lock_sock_nested:
         0xffffffff817be970 <+0>:     push   %rbp
         0xffffffff817be971 <+1>:     mov    %rdi,%rbp
         0xffffffff817be974 <+4>:     push   %rbx
         0xffffffff817be975 <+5>:     lea    0x88(%rbp),%rbx
         0xffffffff817be97c <+12>:    callq  0xffffffff819f7160 <_cond_resched>
         0xffffffff817be981 <+17>:    mov    %rbx,%rdi
         0xffffffff817be984 <+20>:    callq  0xffffffff819fbb00 <_raw_spin_lock_bh>
         0xffffffff817be989 <+25>:    mov    0x8c(%rbp),%eax
         0xffffffff817be98f <+31>:    test   %eax,%eax
         0xffffffff817be991 <+33>:    jne    0xffffffff817be9ba <lock_sock_nested+74>
         0xffffffff817be993 <+35>:    movl   $0x1,0x8c(%rbp)
         0xffffffff817be99d <+45>:    mov    %rbx,%rdi
         0xffffffff817be9a0 <+48>:    callq  0xffffffff810a0c20 <__raw_callee_save___pv_queued_spin_unlock>
         0xffffffff817be9a5 <+53>:    xchg   %ax,%ax
         0xffffffff817be9a7 <+55>:    pop    %rbx
         0xffffffff817be9a8 <+56>:    pop    %rbp
         0xffffffff817be9a9 <+57>:    mov    $0x200,%esi
         0xffffffff817be9ae <+62>:    mov    $0xffffffff817be993,%rdi
         0xffffffff817be9b5 <+69>:    jmpq   0xffffffff81063aa0 <__local_bh_enable_ip>
         0xffffffff817be9ba <+74>:    mov    %rbp,%rdi
         0xffffffff817be9bd <+77>:    callq  0xffffffff817be8c0 <__lock_sock>
         0xffffffff817be9c2 <+82>:    jmp    0xffffffff817be993 <lock_sock_nested+35>
      End of assembler dump.
      
      NATIVE:
      
      (gdb) disassemble lock_sock_nested
      Dump of assembler code for function lock_sock_nested:
         0xffffffff817be970 <+0>:     push   %rbp
         0xffffffff817be971 <+1>:     mov    %rdi,%rbp
         0xffffffff817be974 <+4>:     push   %rbx
         0xffffffff817be975 <+5>:     lea    0x88(%rbp),%rbx
         0xffffffff817be97c <+12>:    callq  0xffffffff819f7160 <_cond_resched>
         0xffffffff817be981 <+17>:    mov    %rbx,%rdi
         0xffffffff817be984 <+20>:    callq  0xffffffff819fbb00 <_raw_spin_lock_bh>
         0xffffffff817be989 <+25>:    mov    0x8c(%rbp),%eax
         0xffffffff817be98f <+31>:    test   %eax,%eax
         0xffffffff817be991 <+33>:    jne    0xffffffff817be9ba <lock_sock_nested+74>
         0xffffffff817be993 <+35>:    movl   $0x1,0x8c(%rbp)
         0xffffffff817be99d <+45>:    mov    %rbx,%rdi
         0xffffffff817be9a0 <+48>:    movb   $0x0,(%rdi)
         0xffffffff817be9a3 <+51>:    nopl   0x0(%rax)
         0xffffffff817be9a7 <+55>:    pop    %rbx
         0xffffffff817be9a8 <+56>:    pop    %rbp
         0xffffffff817be9a9 <+57>:    mov    $0x200,%esi
         0xffffffff817be9ae <+62>:    mov    $0xffffffff817be993,%rdi
         0xffffffff817be9b5 <+69>:    jmpq   0xffffffff81063ae0 <__local_bh_enable_ip>
         0xffffffff817be9ba <+74>:    mov    %rbp,%rdi
         0xffffffff817be9bd <+77>:    callq  0xffffffff817be8c0 <__lock_sock>
         0xffffffff817be9c2 <+82>:    jmp    0xffffffff817be993 <lock_sock_nested+35>
      End of assembler dump.
      
      
      Fixes: 63f70270 ("[PATCH] i386: PARAVIRT: add common patching machinery")
      Fixes: 3010a066 ("x86/paravirt, objtool: Annotate indirect calls")
      Reported-by: default avatarNadav Amit <namit@vmware.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: default avatarJuergen Gross <jgross@suse.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f7a0871d
    • Oleksij Rempel's avatar
      ARM: dts: imx6sx: fix irq for pcie bridge · 93bf5456
      Oleksij Rempel authored
      commit 1bcfe056 upstream.
      
      Use the correct IRQ line for the MSI controller in the PCIe host
      controller. Apparently a different IRQ line is used compared to other
      i.MX6 variants. Without this change MSI IRQs aren't properly propagated
      to the upstream interrupt controller.
      Signed-off-by: default avatarOleksij Rempel <o.rempel@pengutronix.de>
      Reviewed-by: default avatarLucas Stach <l.stach@pengutronix.de>
      Fixes: b1d17f68 ("ARM: dts: imx: add initial imx6sx device tree source")
      Signed-off-by: default avatarShawn Guo <shawnguo@kernel.org>
      Signed-off-by: default avatarAmit Pundir <amit.pundir@linaro.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      93bf5456
    • Lukas Wunner's avatar
      Bluetooth: hci_serdev: Init hci_uart proto_lock to avoid oops · 4a53c4e8
      Lukas Wunner authored
      commit d73e1728 upstream.
      
      John Stultz reports a boot time crash with the HiKey board (which uses
      hci_serdev) occurring in hci_uart_tx_wakeup().  That function is
      contained in hci_ldisc.c, but also called from the newer hci_serdev.c.
      It acquires the proto_lock in struct hci_uart and it turns out that we
      forgot to init the lock in the serdev code path, thus causing the crash.
      
      John bisected the crash to commit 67d2f878 ("Bluetooth: hci_ldisc:
      Allow sleeping while proto locks are held"), but the issue was present
      before and the commit merely exposed it.  (Perhaps by luck, the crash
      did not occur with rwlocks.)
      
      Init the proto_lock in the serdev code path to avoid the oops.
      
      Stack trace for posterity:
      
      Unable to handle kernel read from unreadable memory at 406f127000
      [000000406f127000] user address but active_mm is swapper
      Internal error: Oops: 96000005 [#1] PREEMPT SMP
      Hardware name: HiKey Development Board (DT)
      Call trace:
       hci_uart_tx_wakeup+0x38/0x148
       hci_uart_send_frame+0x28/0x38
       hci_send_frame+0x64/0xc0
       hci_cmd_work+0x98/0x110
       process_one_work+0x134/0x330
       worker_thread+0x130/0x468
       kthread+0xf8/0x128
       ret_from_fork+0x10/0x18
      
      Link: https://lkml.org/lkml/2017/11/15/908Reported-and-tested-by: default avatarJohn Stultz <john.stultz@linaro.org>
      Cc: Ronald Tschalär <ronald@innovation.ch>
      Cc: Rob Herring <rob.herring@linaro.org>
      Cc: Sumit Semwal <sumit.semwal@linaro.org>
      Signed-off-by: default avatarLukas Wunner <lukas@wunner.de>
      Signed-off-by: default avatarMarcel Holtmann <marcel@holtmann.org>
      Signed-off-by: default avatarAmit Pundir <amit.pundir@linaro.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4a53c4e8
    • Ronald Tschalär's avatar
      Bluetooth: hci_ldisc: Allow sleeping while proto locks are held. · f6ec33f6
      Ronald Tschalär authored
      commit 67d2f878 upstream.
      
      Commit dec2c928 ("Bluetooth: hci_ldisc:
      Use rwlocking to avoid closing proto races") introduced locks in
      hci_ldisc that are held while calling the proto functions. These locks
      are rwlock's, and hence do not allow sleeping while they are held.
      However, the proto functions that hci_bcm registers use mutexes and
      hence need to be able to sleep.
      
      In more detail: hci_uart_tty_receive() and hci_uart_dequeue() both
      acquire the rwlock, after which they call proto->recv() and
      proto->dequeue(), respectively. In the case of hci_bcm these point to
      bcm_recv() and bcm_dequeue(). The latter both acquire the
      bcm_device_lock, which is a mutex, so doing so results in a call to
      might_sleep(). But since we're holding a rwlock in hci_ldisc, that
      results in the following BUG (this for the dequeue case - a similar
      one for the receive case is omitted for brevity):
      
        BUG: sleeping function called from invalid context at kernel/locking/mutex.c
        in_atomic(): 1, irqs_disabled(): 0, pid: 7303, name: kworker/7:3
        INFO: lockdep is turned off.
        CPU: 7 PID: 7303 Comm: kworker/7:3 Tainted: G        W  OE   4.13.2+ #17
        Hardware name: Apple Inc. MacBookPro13,3/Mac-A5C67F76ED83108C, BIOS MBP133.8
        Workqueue: events hci_uart_write_work [hci_uart]
        Call Trace:
         dump_stack+0x8e/0xd6
         ___might_sleep+0x164/0x250
         __might_sleep+0x4a/0x80
         __mutex_lock+0x59/0xa00
         ? lock_acquire+0xa3/0x1f0
         ? lock_acquire+0xa3/0x1f0
         ? hci_uart_write_work+0xd3/0x160 [hci_uart]
         mutex_lock_nested+0x1b/0x20
         ? mutex_lock_nested+0x1b/0x20
         bcm_dequeue+0x21/0xc0 [hci_uart]
         hci_uart_write_work+0xe6/0x160 [hci_uart]
         process_one_work+0x253/0x6a0
         worker_thread+0x4d/0x3b0
         kthread+0x133/0x150
      
      We can't replace the mutex in hci_bcm, because there are other calls
      there that might sleep. Therefore this replaces the rwlock's in
      hci_ldisc with rw_semaphore's (which allow sleeping). This is a safer
      approach anyway as it reduces the restrictions on the proto callbacks.
      Also, because acquiring write-lock is very rare compared to acquiring
      the read-lock, the percpu variant of rw_semaphore is used.
      
      Lastly, because hci_uart_tx_wakeup() may be called from an IRQ context,
      we can't block (sleep) while trying acquire the read lock there, so we
      use the trylock variant.
      Signed-off-by: default avatarRonald Tschalär <ronald@innovation.ch>
      Reviewed-by: default avatarLukas Wunner <lukas@wunner.de>
      Signed-off-by: default avatarMarcel Holtmann <marcel@holtmann.org>
      Signed-off-by: default avatarAmit Pundir <amit.pundir@linaro.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f6ec33f6
    • Chunfeng Yun's avatar
      phy: phy-mtk-tphy: use auto instead of force to bypass utmi signals · 4290940d
      Chunfeng Yun authored
      commit 00c0092c upstream.
      
      When system is running, if usb2 phy is forced to bypass utmi signals,
      all PLL will be turned off, and it can't detect device connection
      anymore, so replace force mode with auto mode which can bypass utmi
      signals automatically if no device attached for normal flow.
      But keep the force mode to fix RX sensitivity degradation issue.
      Signed-off-by: default avatarChunfeng Yun <chunfeng.yun@mediatek.com>
      Signed-off-by: default avatarKishon Vijay Abraham I <kishon@ti.com>
      Signed-off-by: default avatarAmit Pundir <amit.pundir@linaro.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4290940d
    • Fabio Estevam's avatar
      mtd: nand: qcom: Add a NULL check for devm_kasprintf() · 2424869e
      Fabio Estevam authored
      commit 069f0534 upstream.
      
      devm_kasprintf() may fail, so we should better add a NULL check
      and propagate an error on failure.
      Signed-off-by: default avatarFabio Estevam <fabio.estevam@nxp.com>
      Signed-off-by: default avatarBoris Brezillon <boris.brezillon@free-electrons.com>
      Signed-off-by: default avatarAmit Pundir <amit.pundir@linaro.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2424869e
    • Al Viro's avatar
      fix __legitimize_mnt()/mntput() race · e5751c84
      Al Viro authored
      commit 119e1ef8 upstream.
      
      __legitimize_mnt() has two problems - one is that in case of success
      the check of mount_lock is not ordered wrt preceding increment of
      refcount, making it possible to have successful __legitimize_mnt()
      on one CPU just before the otherwise final mntpu() on another,
      with __legitimize_mnt() not seeing mntput() taking the lock and
      mntput() not seeing the increment done by __legitimize_mnt().
      Solved by a pair of barriers.
      
      Another is that failure of __legitimize_mnt() on the second
      read_seqretry() leaves us with reference that'll need to be
      dropped by caller; however, if that races with final mntput()
      we can end up with caller dropping rcu_read_lock() and doing
      mntput() to release that reference - with the first mntput()
      having freed the damn thing just as rcu_read_lock() had been
      dropped.  Solution: in "do mntput() yourself" failure case
      grab mount_lock, check if MNT_DOOMED has been set by racing
      final mntput() that has missed our increment and if it has -
      undo the increment and treat that as "failure, caller doesn't
      need to drop anything" case.
      
      It's not easy to hit - the final mntput() has to come right
      after the first read_seqretry() in __legitimize_mnt() *and*
      manage to miss the increment done by __legitimize_mnt() before
      the second read_seqretry() in there.  The things that are almost
      impossible to hit on bare hardware are not impossible on SMP
      KVM, though...
      Reported-by: default avatarOleg Nesterov <oleg@redhat.com>
      Fixes: 48a066e7 ("RCU'd vsfmounts")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e5751c84
    • Al Viro's avatar
      fix mntput/mntput race · d9d46a22
      Al Viro authored
      commit 9ea0a46c upstream.
      
      mntput_no_expire() does the calculation of total refcount under mount_lock;
      unfortunately, the decrement (as well as all increments) are done outside
      of it, leading to false positives in the "are we dropping the last reference"
      test.  Consider the following situation:
      	* mnt is a lazy-umounted mount, kept alive by two opened files.  One
      of those files gets closed.  Total refcount of mnt is 2.  On CPU 42
      mntput(mnt) (called from __fput()) drops one reference, decrementing component
      	* After it has looked at component #0, the process on CPU 0 does
      mntget(), incrementing component #0, gets preempted and gets to run again -
      on CPU 69.  There it does mntput(), which drops the reference (component #69)
      and proceeds to spin on mount_lock.
      	* On CPU 42 our first mntput() finishes counting.  It observes the
      decrement of component #69, but not the increment of component #0.  As the
      result, the total it gets is not 1 as it should've been - it's 0.  At which
      point we decide that vfsmount needs to be killed and proceed to free it and
      shut the filesystem down.  However, there's still another opened file
      on that filesystem, with reference to (now freed) vfsmount, etc. and we are
      screwed.
      
      It's not a wide race, but it can be reproduced with artificial slowdown of
      the mnt_get_count() loop, and it should be easier to hit on SMP KVM setups.
      
      Fix consists of moving the refcount decrement under mount_lock; the tricky
      part is that we want (and can) keep the fast case (i.e. mount that still
      has non-NULL ->mnt_ns) entirely out of mount_lock.  All places that zero
      mnt->mnt_ns are dropping some reference to mnt and they call synchronize_rcu()
      before that mntput().  IOW, if mntput() observes (under rcu_read_lock())
      a non-NULL ->mnt_ns, it is guaranteed that there is another reference yet to
      be dropped.
      Reported-by: default avatarJann Horn <jannh@google.com>
      Tested-by: default avatarJann Horn <jannh@google.com>
      Fixes: 48a066e7 ("RCU'd vsfmounts")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d9d46a22
    • Al Viro's avatar
      make sure that __dentry_kill() always invalidates d_seq, unhashed or not · d5426a38
      Al Viro authored
      commit 4c0d7cd5 upstream.
      
      RCU pathwalk relies upon the assumption that anything that changes
      ->d_inode of a dentry will invalidate its ->d_seq.  That's almost
      true - the one exception is that the final dput() of already unhashed
      dentry does *not* touch ->d_seq at all.  Unhashing does, though,
      so for anything we'd found by RCU dcache lookup we are fine.
      Unfortunately, we can *start* with an unhashed dentry or jump into
      it.
      
      We could try and be careful in the (few) places where that could
      happen.  Or we could just make the final dput() invalidate the damn
      thing, unhashed or not.  The latter is much simpler and easier to
      backport, so let's do it that way.
      Reported-by: default avatar"Dae R. Jeong" <threeearcat@gmail.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d5426a38
    • Al Viro's avatar
      root dentries need RCU-delayed freeing · abfc0ec6
      Al Viro authored
      commit 90bad5e0 upstream.
      
      Since mountpoint crossing can happen without leaving lazy mode,
      root dentries do need the same protection against having their
      memory freed without RCU delay as everything else in the tree.
      
      It's partially hidden by RCU delay between detaching from the
      mount tree and dropping the vfsmount reference, but the starting
      point of pathwalk can be on an already detached mount, in which
      case umount-caused RCU delay has already passed by the time the
      lazy pathwalk grabs rcu_read_lock().  If the starting point
      happens to be at the root of that vfsmount *and* that vfsmount
      covers the entire filesystem, we get trouble.
      
      Fixes: 48a066e7 ("RCU'd vsfmounts")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      abfc0ec6
    • Linus Torvalds's avatar
      init: rename and re-order boot_cpu_state_init() · b7722f4a
      Linus Torvalds authored
      commit b5b1404d upstream.
      
      This is purely a preparatory patch for upcoming changes during the 4.19
      merge window.
      
      We have a function called "boot_cpu_state_init()" that isn't really
      about the bootup cpu state: that is done much earlier by the similarly
      named "boot_cpu_init()" (note lack of "state" in name).
      
      This function initializes some hotplug CPU state, and needs to run after
      the percpu data has been properly initialized.  It even has a comment to
      that effect.
      
      Except it _doesn't_ actually run after the percpu data has been properly
      initialized.  On x86 it happens to do that, but on at least arm and
      arm64, the percpu base pointers are initialized by the arch-specific
      'smp_prepare_boot_cpu()' hook, which ran _after_ boot_cpu_state_init().
      
      This had some unexpected results, and in particular we have a patch
      pending for the merge window that did the obvious cleanup of using
      'this_cpu_write()' in the cpu hotplug init code:
      
        -       per_cpu_ptr(&cpuhp_state, smp_processor_id())->state = CPUHP_ONLINE;
        +       this_cpu_write(cpuhp_state.state, CPUHP_ONLINE);
      
      which is obviously the right thing to do.  Except because of the
      ordering issue, it actually failed miserably and unexpectedly on arm64.
      
      So this just fixes the ordering, and changes the name of the function to
      be 'boot_cpu_hotplug_init()' to make it obvious that it's about cpu
      hotplug state, because the core CPU state was supposed to have already
      been done earlier.
      
      Marked for stable, since the (not yet merged) patch that will show this
      problem is marked for stable.
      Reported-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reported-by: default avatarMian Yousaf Kaukab <yousaf.kaukab@suse.com>
      Suggested-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Acked-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: stable@kernel.org
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b7722f4a
    • Quinn Tran's avatar
      scsi: qla2xxx: Fix memory leak for allocating abort IOCB · fa085d77
      Quinn Tran authored
      commit 5e53be8e upstream.
      
      In the case of IOCB QFull, Initiator code can leave behind a stale pointer
      to an SRB structure on the outstanding command array.
      
      Fixes: 82de802a ("scsi: qla2xxx: Preparation for Target MQ.")
      Cc: stable@vger.kernel.org #v4.16+
      Signed-off-by: default avatarQuinn Tran <quinn.tran@cavium.com>
      Signed-off-by: default avatarHimanshu Madhani <himanshu.madhani@cavium.com>
      Signed-off-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      fa085d77
    • Bart Van Assche's avatar
      scsi: sr: Avoid that opening a CD-ROM hangs with runtime power management enabled · 71b7ca57
      Bart Van Assche authored
      commit 1214fd7b upstream.
      
      Surround scsi_execute() calls with scsi_autopm_get_device() and
      scsi_autopm_put_device(). Note: removing sr_mutex protection from the
      scsi_cd_get() and scsi_cd_put() calls is safe because the purpose of
      sr_mutex is to serialize cdrom_*() calls.
      
      This patch avoids that complaints similar to the following appear in the
      kernel log if runtime power management is enabled:
      
      INFO: task systemd-udevd:650 blocked for more than 120 seconds.
           Not tainted 4.18.0-rc7-dbg+ #1
      "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      systemd-udevd   D28176   650    513 0x00000104
      Call Trace:
      __schedule+0x444/0xfe0
      schedule+0x4e/0xe0
      schedule_preempt_disabled+0x18/0x30
      __mutex_lock+0x41c/0xc70
      mutex_lock_nested+0x1b/0x20
      __blkdev_get+0x106/0x970
      blkdev_get+0x22c/0x5a0
      blkdev_open+0xe9/0x100
      do_dentry_open.isra.19+0x33e/0x570
      vfs_open+0x7c/0xd0
      path_openat+0x6e3/0x1120
      do_filp_open+0x11c/0x1c0
      do_sys_open+0x208/0x2d0
      __x64_sys_openat+0x59/0x70
      do_syscall_64+0x77/0x230
      entry_SYSCALL_64_after_hwframe+0x49/0xbe
      Signed-off-by: default avatarBart Van Assche <bart.vanassche@wdc.com>
      Cc: Maurizio Lombardi <mlombard@redhat.com>
      Cc: Johannes Thumshirn <jthumshirn@suse.de>
      Cc: Alan Stern <stern@rowland.harvard.edu>
      Cc: <stable@vger.kernel.org>
      Tested-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Reviewed-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      71b7ca57
    • Juergen Gross's avatar
      xen/netfront: don't cache skb_shinfo() · 6e32aa71
      Juergen Gross authored
      commit d472b3a6 upstream.
      
      skb_shinfo() can change when calling __pskb_pull_tail(): Don't cache
      its return value.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarJuergen Gross <jgross@suse.com>
      Reviewed-by: default avatarWei Liu <wei.liu2@citrix.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6e32aa71
    • Isaac J. Manjarres's avatar
      stop_machine: Disable preemption after queueing stopper threads · b6f194b6
      Isaac J. Manjarres authored
      commit 2610e889 upstream.
      
      This commit:
      
        9fb8d5dc ("stop_machine, Disable preemption when waking two stopper threads")
      
      does not fully address the race condition that can occur
      as follows:
      
      On one CPU, call it CPU 3, thread 1 invokes
      cpu_stop_queue_two_works(2, 3,...), and the execution is such
      that thread 1 queues the works for migration/2 and migration/3,
      and is preempted after releasing the locks for migration/2 and
      migration/3, but before waking the threads.
      
      Then, On CPU 2, a kworker, call it thread 2, is running,
      and it invokes cpu_stop_queue_two_works(1, 2,...), such that
      thread 2 queues the works for migration/1 and migration/2.
      Meanwhile, on CPU 3, thread 1 resumes execution, and wakes
      migration/2 and migration/3. This means that when CPU 2
      releases the locks for migration/1 and migration/2, but before
      it wakes those threads, it can be preempted by migration/2.
      
      If thread 2 is preempted by migration/2, then migration/2 will
      execute the first work item successfully, since migration/3
      was woken up by CPU 3, but when it goes to execute the second
      work item, it disables preemption, calls multi_cpu_stop(),
      and thus, CPU 2 will wait forever for migration/1, which should
      have been woken up by thread 2. However migration/1 cannot be
      woken up by thread 2, since it is a kworker, so it is affine to
      CPU 2, but CPU 2 is running migration/2 with preemption
      disabled, so thread 2 will never run.
      
      Disable preemption after queueing works for stopper threads
      to ensure that the operation of queueing the works and waking
      the stopper threads is atomic.
      Co-Developed-by: default avatarPrasad Sodagudi <psodagud@codeaurora.org>
      Co-Developed-by: default avatarPavankumar Kondeti <pkondeti@codeaurora.org>
      Signed-off-by: default avatarIsaac J. Manjarres <isaacm@codeaurora.org>
      Signed-off-by: default avatarPrasad Sodagudi <psodagud@codeaurora.org>
      Signed-off-by: default avatarPavankumar Kondeti <pkondeti@codeaurora.org>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: bigeasy@linutronix.de
      Cc: gregkh@linuxfoundation.org
      Cc: matt@codeblueprint.co.uk
      Fixes: 9fb8d5dc ("stop_machine, Disable preemption when waking two stopper threads")
      Link: http://lkml.kernel.org/r/1531856129-9871-1-git-send-email-isaacm@codeaurora.orgSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b6f194b6
    • Linus Torvalds's avatar
      Mark HI and TASKLET softirq synchronous · 3eb86ff3
      Linus Torvalds authored
      commit 3c53776e upstream.
      
      Way back in 4.9, we committed 4cd13c21 ("softirq: Let ksoftirqd do
      its job"), and ever since we've had small nagging issues with it.  For
      example, we've had:
      
        1ff68820 ("watchdog: core: make sure the watchdog_worker is not deferred")
        8d5755b3 ("watchdog: softdog: fire watchdog even if softirqs do not get to run")
        217f6974 ("net: busy-poll: allow preemption in sk_busy_loop()")
      
      all of which worked around some of the effects of that commit.
      
      The DVB people have also complained that the commit causes excessive USB
      URB latencies, which seems to be due to the USB code using tasklets to
      schedule USB traffic.  This seems to be an issue mainly when already
      living on the edge, but waiting for ksoftirqd to handle it really does
      seem to cause excessive latencies.
      
      Now Hanna Hawa reports that this issue isn't just limited to USB URB and
      DVB, but also causes timeout problems for the Marvell SoC team:
      
       "I'm facing kernel panic issue while running raid 5 on sata disks
        connected to Macchiatobin (Marvell community board with Armada-8040
        SoC with 4 ARMv8 cores of CA72) Raid 5 built with Marvell DMA engine
        and async_tx mechanism (ASYNC_TX_DMA [=y]); the DMA driver (mv_xor_v2)
        uses a tasklet to clean the done descriptors from the queue"
      
      The latency problem causes a panic:
      
        mv_xor_v2 f0400000.xor: dma_sync_wait: timeout!
        Kernel panic - not syncing: async_tx_quiesce: DMA error waiting for transaction
      
      We've discussed simply just reverting the original commit entirely, and
      also much more involved solutions (with per-softirq threads etc).  This
      patch is intentionally stupid and fairly limited, because the issue
      still remains, and the other solutions either got sidetracked or had
      other issues.
      
      We should probably also consider the timer softirqs to be synchronous
      and not be delayed to ksoftirqd (since they were the issue with the
      earlier watchdog problems), but that should be done as a separate patch.
      This does only the tasklet cases.
      Reported-and-tested-by: default avatarHanna Hawa <hannah@marvell.com>
      Reported-and-tested-by: default avatarJosef Griebichler <griebichler.josef@gmx.at>
      Reported-by: default avatarMauro Carvalho Chehab <mchehab@s-opensource.com>
      Cc: Alan Stern <stern@rowland.harvard.edu>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3eb86ff3
    • Andrey Konovalov's avatar
      kasan: add no_sanitize attribute for clang builds · d8b880fc
      Andrey Konovalov authored
      commit 12c8f25a upstream.
      
      KASAN uses the __no_sanitize_address macro to disable instrumentation of
      particular functions.  Right now it's defined only for GCC build, which
      causes false positives when clang is used.
      
      This patch adds a definition for clang.
      
      Note, that clang's revision 329612 or higher is required.
      
      [andreyknvl@google.com: remove redundant #ifdef CONFIG_KASAN check]
        Link: http://lkml.kernel.org/r/c79aa31a2a2790f6131ed607c58b0dd45dd62a6c.1523967959.git.andreyknvl@google.com
      Link: http://lkml.kernel.org/r/4ad725cc903f8534f8c8a60f0daade5e3d674f8d.1523554166.git.andreyknvl@google.comSigned-off-by: default avatarAndrey Konovalov <andreyknvl@google.com>
      Acked-by: default avatarAndrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: David Woodhouse <dwmw@amazon.co.uk>
      Cc: Andrey Konovalov <andreyknvl@google.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Paul Lawrence <paullawrence@google.com>
      Cc: Sandipan Das <sandipan@linux.vnet.ibm.com>
      Cc: Kees Cook <keescook@chromium.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Sodagudi Prasad <psodagud@codeaurora.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d8b880fc
    • Ming Lei's avatar
      scsi: virtio_scsi: fix IO hang caused by automatic irq vector affinity · 70b522f1
      Ming Lei authored
      commit b5b6e8c8 upstream.
      
      Since commit 84676c1f ("genirq/affinity: assign vectors to all
      possible CPUs") it is possible to end up in a scenario where only
      offline CPUs are mapped to an interrupt vector.
      
      This is only an issue for the legacy I/O path since with blk-mq/scsi-mq
      an I/O can't be submitted to a hardware queue if the queue isn't mapped
      to an online CPU.
      
      Fix this issue by forcing virtio-scsi to use blk-mq.
      
      [mkp: commit desc]
      
      Cc: Omar Sandoval <osandov@fb.com>,
      Cc: "Martin K. Petersen" <martin.petersen@oracle.com>,
      Cc: James Bottomley <james.bottomley@hansenpartnership.com>,
      Cc: Christoph Hellwig <hch@lst.de>,
      Cc: Don Brace <don.brace@microsemi.com>
      Cc: Kashyap Desai <kashyap.desai@broadcom.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Mike Snitzer <snitzer@redhat.com>
      Cc: Laurence Oberman <loberman@redhat.com>
      Fixes: 84676c1f ("genirq/affinity: assign vectors to all possible CPUs")
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Reviewed-by: default avatarHannes Reinecke <hare@suse.de>
      Acked-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      70b522f1
    • Ming Lei's avatar
      scsi: core: introduce force_blk_mq · 3ad2e6f4
      Ming Lei authored
      commit 2f31115e upstream.
      
      This patch introduces 'force_blk_mq' to the scsi_host_template so that
      drivers that have no desire to support the legacy I/O path can signal
      blk-mq only support.
      
      [mkp: commit desc]
      
      Cc: Omar Sandoval <osandov@fb.com>,
      Cc: "Martin K. Petersen" <martin.petersen@oracle.com>,
      Cc: James Bottomley <james.bottomley@hansenpartnership.com>,
      Cc: Christoph Hellwig <hch@lst.de>,
      Cc: Don Brace <don.brace@microsemi.com>
      Cc: Kashyap Desai <kashyap.desai@broadcom.com>
      Cc: Mike Snitzer <snitzer@redhat.com>
      Cc: Laurence Oberman <loberman@redhat.com>
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Reviewed-by: default avatarHannes Reinecke <hare@suse.de>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3ad2e6f4
    • Ming Lei's avatar
      scsi: hpsa: fix selection of reply queue · 1edd825c
      Ming Lei authored
      commit 8b834bff upstream.
      
      Since commit 84676c1f ("genirq/affinity: assign vectors to all
      possible CPUs") we could end up with an MSI-X vector that did not have
      any online CPUs mapped. This would lead to I/O hangs since there was no
      CPU to receive the completion.
      
      Retrieve IRQ affinity information using pci_irq_get_affinity() and use
      this mapping to choose a reply queue.
      
      [mkp: tweaked commit desc]
      
      Cc: Hannes Reinecke <hare@suse.de>
      Cc: "Martin K. Petersen" <martin.petersen@oracle.com>,
      Cc: James Bottomley <james.bottomley@hansenpartnership.com>,
      Cc: Christoph Hellwig <hch@lst.de>,
      Cc: Don Brace <don.brace@microsemi.com>
      Cc: Kashyap Desai <kashyap.desai@broadcom.com>
      Cc: Laurence Oberman <loberman@redhat.com>
      Cc: Meelis Roos <mroos@linux.ee>
      Cc: Artem Bityutskiy <artem.bityutskiy@intel.com>
      Cc: Mike Snitzer <snitzer@redhat.com>
      Fixes: 84676c1f ("genirq/affinity: assign vectors to all possible CPUs")
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Tested-by: default avatarLaurence Oberman <loberman@redhat.com>
      Tested-by: default avatarDon Brace <don.brace@microsemi.com>
      Tested-by: default avatarArtem Bityutskiy <artem.bityutskiy@intel.com>
      Acked-by: default avatarDon Brace <don.brace@microsemi.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1edd825c
    • John David Anglin's avatar
      parisc: Define mb() and add memory barriers to assembler unlock sequences · 2e56b37b
      John David Anglin authored
      commit fedb8da9 upstream.
      
      For years I thought all parisc machines executed loads and stores in
      order. However, Jeff Law recently indicated on gcc-patches that this is
      not correct. There are various degrees of out-of-order execution all the
      way back to the PA7xxx processor series (hit-under-miss). The PA8xxx
      series has full out-of-order execution for both integer operations, and
      loads and stores.
      
      This is described in the following article:
      http://web.archive.org/web/20040214092531/http://www.cpus.hp.com/technical_references/advperf.shtml
      
      For this reason, we need to define mb() and to insert a memory barrier
      before the store unlocking spinlocks. This ensures that all memory
      accesses are complete prior to unlocking. The ldcw instruction performs
      the same function on entry.
      Signed-off-by: default avatarJohn David Anglin <dave.anglin@bell.net>
      Cc: stable@vger.kernel.org # 4.0+
      Signed-off-by: default avatarHelge Deller <deller@gmx.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2e56b37b
    • Helge Deller's avatar
      parisc: Enable CONFIG_MLONGCALLS by default · 9ffedb10
      Helge Deller authored
      commit 66509a27 upstream.
      
      Enable the -mlong-calls compiler option by default, because otherwise in most
      cases linking the vmlinux binary fails due to truncations of R_PARISC_PCREL22F
      relocations. This fixes building the 64-bit defconfig.
      
      Cc: stable@vger.kernel.org # 4.0+
      Signed-off-by: default avatarHelge Deller <deller@gmx.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9ffedb10
  2. 09 Aug, 2018 1 commit