1. 02 Mar, 2018 1 commit
    • John David Anglin's avatar
      parisc: Fix ordering of cache and TLB flushes · 0adb24e0
      John David Anglin authored
      The change to flush_kernel_vmap_range() wasn't sufficient to avoid the
      SMP stalls.  The problem is some drivers call these routines with
      interrupts disabled.  Interrupts need to be enabled for flush_tlb_all()
      and flush_cache_all() to work.  This version adds checks to ensure
      interrupts are not disabled before calling routines that need IPI
      interrupts.  When interrupts are disabled, we now drop into slower code.
      
      The attached change fixes the ordering of cache and TLB flushes in
      several cases.  When we flush the cache using the existing PTE/TLB
      entries, we need to flush the TLB after doing the cache flush.  We don't
      need to do this when we flush the entire instruction and data caches as
      these flushes don't use the existing TLB entries.  The same is true for
      tmpalias region flushes.
      
      The flush_kernel_vmap_range() and invalidate_kernel_vmap_range()
      routines have been updated.
      
      Secondly, we added a new purge_kernel_dcache_range_asm() routine to
      pacache.S and use it in invalidate_kernel_vmap_range().  Nominally,
      purges are faster than flushes as the cache lines don't have to be
      written back to memory.
      
      Hopefully, this is sufficient to resolve the remaining problems due to
      cache speculation.  So far, testing indicates that this is the case.  I
      did work up a patch using tmpalias flushes, but there is a performance
      hit because we need the physical address for each page, and we also need
      to sequence access to the tmpalias flush code.  This increases the
      probability of stalls.
      
      Signed-off-by: John David Anglin <dave.anglin@bell.net>
      Cc: stable@vger.kernel.org # 4.9+
      Signed-off-by: default avatarHelge Deller <deller@gmx.de>
      0adb24e0
  2. 28 Jan, 2018 8 commits
  3. 27 Jan, 2018 2 commits
    • Thomas Gleixner's avatar
      hrtimer: Reset hrtimer cpu base proper on CPU hotplug · d5421ea4
      Thomas Gleixner authored
      The hrtimer interrupt code contains a hang detection and mitigation
      mechanism, which prevents that a long delayed hrtimer interrupt causes a
      continous retriggering of interrupts which prevent the system from making
      progress. If a hang is detected then the timer hardware is programmed with
      a certain delay into the future and a flag is set in the hrtimer cpu base
      which prevents newly enqueued timers from reprogramming the timer hardware
      prior to the chosen delay. The subsequent hrtimer interrupt after the delay
      clears the flag and resumes normal operation.
      
      If such a hang happens in the last hrtimer interrupt before a CPU is
      unplugged then the hang_detected flag is set and stays that way when the
      CPU is plugged in again. At that point the timer hardware is not armed and
      it cannot be armed because the hang_detected flag is still active, so
      nothing clears that flag. As a consequence the CPU does not receive hrtimer
      interrupts and no timers expire on that CPU which results in RCU stalls and
      other malfunctions.
      
      Clear the flag along with some other less critical members of the hrtimer
      cpu base to ensure starting from a clean state when a CPU is plugged in.
      
      Thanks to Paul, Sebastian and Anna-Maria for their help to get down to the
      root cause of that hard to reproduce heisenbug. Once understood it's
      trivial and certainly justifies a brown paperbag.
      
      Fixes: 41d2e494 ("hrtimer: Tune hrtimer_interrupt hang logic")
      Reported-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Sebastian Sewior <bigeasy@linutronix.de>
      Cc: Anna-Maria Gleixner <anna-maria@linutronix.de>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/alpine.DEB.2.20.1801261447590.2067@nanos
      d5421ea4
    • H. Peter Anvin's avatar
      x86: Mark hpa as a "Designated Reviewer" for the time being · 8a95b74d
      H. Peter Anvin authored
      Due to some unfortunate events, I have not been directly involved in
      the x86 kernel patch flow for a while now.  I have also not been able
      to ramp back up by now like I had hoped to, and after reviewing what I
      will need to work on both internally at Intel and elsewhere in the near
      term, it is clear that I am not going to be able to ramp back up until
      late 2018 at the very earliest.
      
      It is not acceptable to not recognize that this load is currently
      taken by Ingo and Thomas without my direct participation, so I mark
      myself as R: (designated reviewer) rather than M: (maintainer) until
      further notice.  This is in fact recognizing the de facto situation
      for the past few years.
      
      I have obviously no intention of going away, and I will do everything
      within my power to improve Linux on x86 and x86 for Linux.  This,
      however, puts credit where it is due and reflects a change of focus.
      
      This patch also removes stale entries for portions of the x86
      architecture which have not been maintained separately from arch/x86
      for a long time.  If there is a reason to re-introduce them then that
      can happen later.
      Signed-off-by: default avatarH. Peter Anvin <h.peter.anvin@intel.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Bruce Schlobohm <bruce.schlobohm@intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/20180125195934.5253-1-hpa@zytor.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      8a95b74d
  4. 26 Jan, 2018 13 commits
  5. 25 Jan, 2018 11 commits
    • Lyude Paul's avatar
      drm/nouveau: Move irq setup/teardown to pci ctor/dtor · 0fd189a9
      Lyude Paul authored
      For a while we've been having issues with seemingly random interrupts
      coming from nvidia cards when resuming them. Originally the fix for this
      was thought to be just re-arming the MSI interrupt registers right after
      re-allocating our IRQs, however it seems a lot of what we do is both
      wrong and not even nessecary.
      
      This was made apparent by what appeared to be a regression in the
      mainline kernel that started introducing suspend/resume issues for
      nouveau:
      
              a0c9259d (irq/matrix: Spread interrupts on allocation)
      
      After this commit was introduced, we started getting interrupts from the
      GPU before we actually re-allocated our own IRQ (see references below)
      and assigned the IRQ handler. Investigating this turned out that the
      problem was not with the commit, but the fact that nouveau even
      free/allocates it's irqs before and after suspend/resume.
      
      For starters: drivers in the linux kernel haven't had to handle
      freeing/re-allocating their IRQs during suspend/resume cycles for quite
      a while now. Nouveau seems to be one of the few drivers left that still
      does this, despite the fact there's no reason we actually need to since
      disabling interrupts from the device side should be enough, as the
      kernel is already smart enough to know to disable host-side interrupts
      for us before going into suspend. Since we were tearing down our IRQs by
      hand however, that means there was a short period during resume where
      interrupts could be received before we re-allocated our IRQ which would
      lead to us getting an unhandled IRQ. Since we never handle said IRQ and
      re-arm the interrupt registers, this would cause us to miss all of the
      interrupts from the GPU and cause our init process to start timing out
      on anything requiring interrupts.
      
      So, since this whole setup/teardown every suspend/resume cycle is
      useless anyway, move irq setup/teardown into the pci subdev's ctor/dtor
      functions instead so they're only called at driver load and driver
      unload. This should fix most of the issues with pending interrupts on
      resume, along with getting suspend/resume for nouveau to work again.
      
      As well, this probably means we can also just remove the msi rearm call
      inside nvkm_pci_init(). But since our main focus here is to fix
      suspend/resume before 4.15, we'll save that for a later patch.
      Signed-off-by: default avatarLyude Paul <lyude@redhat.com>
      Cc: Karol Herbst <kherbst@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarBen Skeggs <bskeggs@redhat.com>
      0fd189a9
    • Nicolas Dichtel's avatar
      net: don't call update_pmtu unconditionally · f15ca723
      Nicolas Dichtel authored
      Some dst_ops (e.g. md_dst_ops)) doesn't set this handler. It may result to:
      "BUG: unable to handle kernel NULL pointer dereference at           (null)"
      
      Let's add a helper to check if update_pmtu is available before calling it.
      
      Fixes: 52a589d5 ("geneve: update skb dst pmtu on tx path")
      Fixes: a93bf0ff ("vxlan: update skb dst pmtu on tx path")
      CC: Roman Kapl <code@rkapl.cz>
      CC: Xin Long <lucien.xin@gmail.com>
      Signed-off-by: default avatarNicolas Dichtel <nicolas.dichtel@6wind.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f15ca723
    • Linus Torvalds's avatar
      Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm · 6e20630e
      Linus Torvalds authored
      Pull KVM fixes from Radim Krčmář:
       "Fix races and a potential use after free in the s390 cmma migration
        code"
      
      * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
        KVM: s390: add proper locking for CMMA migration bitmap
      6e20630e
    • Linus Torvalds's avatar
      Merge tag 'for-4.15-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux · 525273fb
      Linus Torvalds authored
      Pull btrfs fix from David Sterba:
       "It's been reported recently that readdir can list stale entries under
        some conditions. Fix it."
      
      * tag 'for-4.15-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
        Btrfs: fix stale entries in readdir
      525273fb
    • Dan Streetman's avatar
      net: tcp: close sock if net namespace is exiting · 4ee806d5
      Dan Streetman authored
      When a tcp socket is closed, if it detects that its net namespace is
      exiting, close immediately and do not wait for FIN sequence.
      
      For normal sockets, a reference is taken to their net namespace, so it will
      never exit while the socket is open.  However, kernel sockets do not take a
      reference to their net namespace, so it may begin exiting while the kernel
      socket is still open.  In this case if the kernel socket is a tcp socket,
      it will stay open trying to complete its close sequence.  The sock's dst(s)
      hold a reference to their interface, which are all transferred to the
      namespace's loopback interface when the real interfaces are taken down.
      When the namespace tries to take down its loopback interface, it hangs
      waiting for all references to the loopback interface to release, which
      results in messages like:
      
      unregister_netdevice: waiting for lo to become free. Usage count = 1
      
      These messages continue until the socket finally times out and closes.
      Since the net namespace cleanup holds the net_mutex while calling its
      registered pernet callbacks, any new net namespace initialization is
      blocked until the current net namespace finishes exiting.
      
      After this change, the tcp socket notices the exiting net namespace, and
      closes immediately, releasing its dst(s) and their reference to the
      loopback interface, which lets the net namespace continue exiting.
      
      Link: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1711407
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=97811Signed-off-by: default avatarDan Streetman <ddstreet@canonical.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4ee806d5
    • Peter Zijlstra's avatar
      perf/x86: Fix perf,x86,cpuhp deadlock · efe951d3
      Peter Zijlstra authored
      More lockdep gifts, a 5-way lockup race:
      
      	perf_event_create_kernel_counter()
      	  perf_event_alloc()
      	    perf_try_init_event()
      	      x86_pmu_event_init()
      		__x86_pmu_event_init()
      		  x86_reserve_hardware()
       #0		    mutex_lock(&pmc_reserve_mutex);
      		    reserve_ds_buffer()
       #1		      get_online_cpus()
      
      	perf_event_release_kernel()
      	  _free_event()
      	    hw_perf_event_destroy()
      	      x86_release_hardware()
       #0		mutex_lock(&pmc_reserve_mutex)
      		release_ds_buffer()
       #1		  get_online_cpus()
      
       #1	do_cpu_up()
      	  perf_event_init_cpu()
       #2	    mutex_lock(&pmus_lock)
       #3	    mutex_lock(&ctx->mutex)
      
      	sys_perf_event_open()
      	  mutex_lock_double()
       #3	    mutex_lock(ctx->mutex)
       #4	    mutex_lock_nested(ctx->mutex, 1);
      
      	perf_try_init_event()
       #4	  mutex_lock_nested(ctx->mutex, 1)
      	  x86_pmu_event_init()
      	    intel_pmu_hw_config()
      	      x86_add_exclusive()
       #0		mutex_lock(&pmc_reserve_mutex)
      
      Fix it by using ordering constructs instead of locking.
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      efe951d3
    • Peter Zijlstra's avatar
      perf/core: Fix ctx::mutex deadlock · 0c7296ca
      Peter Zijlstra authored
      Lockdep noticed the following 3-way lockup scenario:
      
      	sys_perf_event_open()
      	  perf_event_alloc()
      	    perf_try_init_event()
       #0	      ctx = perf_event_ctx_lock_nested(1)
      	      perf_swevent_init()
      		swevent_hlist_get()
       #1		  mutex_lock(&pmus_lock)
      
      	perf_event_init_cpu()
       #1	  mutex_lock(&pmus_lock)
       #2	  mutex_lock(&ctx->mutex)
      
      	sys_perf_event_open()
      	  mutex_lock_double()
       #2	   mutex_lock()
       #0	   mutex_lock_nested()
      
      And while we need that perf_event_ctx_lock_nested() for HW PMUs such
      that they can iterate the sibling list, trying to match it to the
      available counters, the software PMUs need do no such thing. Exclude
      them.
      
      In particular the swevent triggers the above invertion, while the
      tpevent PMU triggers a more elaborate one through their event_mutex.
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      0c7296ca
    • Peter Zijlstra's avatar
      perf/core: Fix another perf,trace,cpuhp lock inversion · 43fa87f7
      Peter Zijlstra authored
      Lockdep noticed the following 3-way lockup race:
      
              perf_trace_init()
       #0       mutex_lock(&event_mutex)
                perf_trace_event_init()
                  perf_trace_event_reg()
                    tp_event->class->reg() := tracepoint_probe_register
       #1              mutex_lock(&tracepoints_mutex)
                        trace_point_add_func()
       #2                  static_key_enable()
      
       #2	do_cpu_up()
      	  perf_event_init_cpu()
       #3	    mutex_lock(&pmus_lock)
       #4	    mutex_lock(&ctx->mutex)
      
      	perf_ioctl()
       #4	  ctx = perf_event_ctx_lock()
      	  _perf_iotcl()
      	    ftrace_profile_set_filter()
       #0	      mutex_lock(&event_mutex)
      
      Fudge it for now by noting that the tracepoint state does not depend
      on the event <-> context relation. Ugly though :/
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      43fa87f7
    • Peter Zijlstra's avatar
      perf/core: Fix lock inversion between perf,trace,cpuhp · 82d94856
      Peter Zijlstra authored
      Lockdep gifted us with noticing the following 4-way lockup scenario:
      
              perf_trace_init()
       #0       mutex_lock(&event_mutex)
                perf_trace_event_init()
                  perf_trace_event_reg()
                    tp_event->class->reg() := tracepoint_probe_register
       #1             mutex_lock(&tracepoints_mutex)
                        trace_point_add_func()
       #2                 static_key_enable()
      
       #2     do_cpu_up()
                perf_event_init_cpu()
       #3         mutex_lock(&pmus_lock)
       #4         mutex_lock(&ctx->mutex)
      
              perf_event_task_disable()
                mutex_lock(&current->perf_event_mutex)
       #4       ctx = perf_event_ctx_lock()
       #5       perf_event_for_each_child()
      
              do_exit()
                task_work_run()
                  __fput()
                    perf_release()
                      perf_event_release_kernel()
       #4               mutex_lock(&ctx->mutex)
       #5               mutex_lock(&event->child_mutex)
                        free_event()
                          _free_event()
                            event->destroy() := perf_trace_destroy
       #0                     mutex_lock(&event_mutex);
      
      Fix that by moving the free_event() out from under the locks.
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      82d94856
    • Dave Airlie's avatar
      Merge tag 'drm-misc-fixes-2018-01-24' of git://anongit.freedesktop.org/drm/drm-misc into drm-fixes · 7e3f8e91
      Dave Airlie authored
      Two vc4 fixes that were applied in the last day.
      One fixes a NULL dereference, and the other fixes
      a flickering bug.
      
      Cc: Eric Anholt <eric@anholt.net>
      Cc: Boris Brezillon <boris.brezillon@free-electrons.com>
      
      * tag 'drm-misc-fixes-2018-01-24' of git://anongit.freedesktop.org/drm/drm-misc:
        drm/vc4: Fix NULL pointer dereference in vc4_save_hang_state()
        drm/vc4: Flush the caches before the bin jobs, as well.
      7e3f8e91
    • Linus Torvalds's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net · 5b7d2796
      Linus Torvalds authored
      Pull networking fixes from David Miller:
      
       1) Avoid negative netdev refcount in error flow of xfrm state add, from
          Aviad Yehezkel.
      
       2) Fix tcpdump decoding of IPSEC decap'd frames by filling in the
          ethernet header protocol field in xfrm{4,6}_mode_tunnel_input().
          From Yossi Kuperman.
      
       3) Fix a syzbot triggered skb_under_panic in pppoe having to do with
          failing to allocate an appropriate amount of headroom. From
          Guillaume Nault.
      
       4) Fix memory leak in vmxnet3 driver, from Neil Horman.
      
       5) Cure out-of-bounds packet memory access in em_nbyte EMATCH module,
          from Wolfgang Bumiller.
      
       6) Restrict what kinds of sockets can be bound to the KCM multiplexer
          and also disallow when another layer has attached to the socket and
          made use of sk_user_data. From Tom Herbert.
      
       7) Fix use before init of IOTLB in vhost code, from Jason Wang.
      
       8) Correct STACR register write bit definition in IBM emac driver, from
          Ivan Mikhaylov.
      
      * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net:
        net/ibm/emac: wrong bit is used for STA control register write
        net/ibm/emac: add 8192 rx/tx fifo size
        vhost: do not try to access device IOTLB when not initialized
        vhost: use mutex_lock_nested() in vhost_dev_lock_vqs()
        i40e: flower: check if TC offload is enabled on a netdev
        qed: Free reserved MR tid
        qed: Remove reserveration of dpi for kernel
        kcm: Check if sk_user_data already set in kcm_attach
        kcm: Only allow TCP sockets to be attached to a KCM mux
        net: sched: fix TCF_LAYER_LINK case in tcf_get_base_ptr
        net: sched: em_nbyte: don't add the data offset twice
        mlxsw: spectrum_router: Don't log an error on missing neighbor
        vmxnet3: repair memory leak
        ipv6: Fix getsockopt() for sockets with default IPV6_AUTOFLOWLABEL
        pppoe: take ->needed_headroom of lower device into account on xmit
        xfrm: fix boolean assignment in xfrm_get_type_offload
        xfrm: Fix eth_hdr(skb)->h_proto to reflect inner IP version
        xfrm: fix error flow in case of add state fails
        xfrm: Add SA to hardware at the end of xfrm_state_construct()
      5b7d2796
  6. 24 Jan, 2018 5 commits