Commits · 612b72862b4dd7f3f5e42651522daac6733b8ea6 · nexedi / linux

18 Feb, 2019 4 commits

nvme-pci: Simplify interrupt allocation · 612b7286

Ming Lei authored Feb 16, 2019

The NVME PCI driver contains a tedious mechanism for interrupt
allocation, which is necessary to adjust the number and size of interrupt
sets to the maximum available number of interrupts which depends on the
underlying PCI capabilities and the available CPU resources.

It works around the former short comings of the PCI and core interrupt
allocation mechanims in combination with interrupt sets.

The PCI interrupt allocation function allows to provide a maximum and a
minimum number of interrupts to be allocated and tries to allocate as
many as possible. This worked without driver interaction as long as there
was only a single set of interrupts to handle.

With the addition of support for multiple interrupt sets in the generic
affinity spreading logic, which is invoked from the PCI interrupt
allocation, the adaptive loop in the PCI interrupt allocation did not
work for multiple interrupt sets. The reason is that depending on the
total number of interrupts which the PCI allocation adaptive loop tries
to allocate in each step, the number and the size of the interrupt sets
need to be adapted as well. Due to the way the interrupt sets support was
implemented there was no way for the PCI interrupt allocation code or the
core affinity spreading mechanism to invoke a driver specific function
for adapting the interrupt sets configuration.

As a consequence the driver had to implement another adaptive loop around
the PCI interrupt allocation function and calling that with maximum and
minimum interrupts set to the same value. This ensured that the
allocation either succeeded or immediately failed without any attempt to
adjust the number of interrupts in the PCI code.

The core code now allows drivers to provide a callback to recalculate the
number and the size of interrupt sets during PCI interrupt allocation,
which in turn allows the PCI interrupt allocation function to be called
in the same way as with a single set of interrupts. The PCI code handles
the adaptive loop and the interrupt affinity spreading mechanism invokes
the driver callback to adapt the interrupt set configuration to the
current loop value. This replaces the adaptive loop in the driver
completely.

Implement the NVME specific callback which adjusts the interrupt sets
configuration and remove the adaptive allocation loop.

[ tglx: Simplify the callback further and restore the dropped adjustment of
  	number of sets ]
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Marc Zyngier <marc.zyngier@arm.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Bjorn Helgaas <helgaas@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: linux-block@vger.kernel.org
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: linux-nvme@lists.infradead.org
Cc: linux-pci@vger.kernel.org
Cc: Keith Busch <keith.busch@intel.com>
Cc: Sumit Saxena <sumit.saxena@broadcom.com>
Cc: Kashyap Desai <kashyap.desai@broadcom.com>
Cc: Shivasharan Srikanteshwara <shivasharan.srikanteshwara@broadcom.com>
Link: https://lkml.kernel.org/r/20190216172228.602546658@linutronix.de

612b7286

genirq/affinity: Add new callback for (re)calculating interrupt sets · c66d4bd1

Ming Lei authored Feb 16, 2019

The interrupt affinity spreading mechanism supports to spread out
affinities for one or more interrupt sets. A interrupt set contains one or
more interrupts. Each set is mapped to a specific functionality of a
device, e.g. general I/O queues and read I/O queus of multiqueue block
devices.

The number of interrupts per set is defined by the driver. It depends on
the total number of available interrupts for the device, which is
determined by the PCI capabilites and the availability of underlying CPU
resources, and the number of queues which the device provides and the
driver wants to instantiate.

The driver passes initial configuration for the interrupt allocation via a
pointer to struct irq_affinity.

Right now the allocation mechanism is complex as it requires to have a loop
in the driver to determine the maximum number of interrupts which are
provided by the PCI capabilities and the underlying CPU resources.  This
loop would have to be replicated in every driver which wants to utilize
this mechanism. That's unwanted code duplication and error prone.

In order to move this into generic facilities it is required to have a
mechanism, which allows the recalculation of the interrupt sets and their
size, in the core code. As the core code does not have any knowledge about the
underlying device, a driver specific callback is required in struct
irq_affinity, which can be invoked by the core code. The callback gets the
number of available interupts as an argument, so the driver can calculate the
corresponding number and size of interrupt sets.

At the moment the struct irq_affinity pointer which is handed in from the
driver and passed through to several core functions is marked 'const', but for
the callback to be able to modify the data in the struct it's required to
remove the 'const' qualifier.

Add the optional callback to struct irq_affinity, which allows drivers to
recalculate the number and size of interrupt sets and remove the 'const'
qualifier.

For simple invocations, which do not supply a callback, a default callback
is installed, which just sets nr_sets to 1 and transfers the number of
spreadable vectors to the set_size array at index 0.

This is for now guarded by a check for nr_sets != 0 to keep the NVME driver
working until it is converted to the callback mechanism.

To make sure that the driver configuration is correct under all circumstances
the callback is invoked even when there are no interrupts for queues left,
i.e. the pre/post requirements already exhaust the numner of available
interrupts.

At the PCI layer irq_create_affinity_masks() has to be invoked even for the
case where the legacy interrupt is used. That ensures that the callback is
invoked and the device driver can adjust to that situation.

[ tglx: Fixed the simple case (no sets required). Moved the sanity check
  	for nr_sets after the invocation of the callback so it catches
  	broken drivers. Fixed the kernel doc comments for struct
  	irq_affinity and de-'This patch'-ed the changelog ]
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Marc Zyngier <marc.zyngier@arm.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Bjorn Helgaas <helgaas@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: linux-block@vger.kernel.org
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: linux-nvme@lists.infradead.org
Cc: linux-pci@vger.kernel.org
Cc: Keith Busch <keith.busch@intel.com>
Cc: Sumit Saxena <sumit.saxena@broadcom.com>
Cc: Kashyap Desai <kashyap.desai@broadcom.com>
Cc: Shivasharan Srikanteshwara <shivasharan.srikanteshwara@broadcom.com>
Link: https://lkml.kernel.org/r/20190216172228.512444498@linutronix.de

c66d4bd1

genirq/affinity: Store interrupt sets size in struct irq_affinity · 9cfef55b

Ming Lei authored Feb 16, 2019

The interrupt affinity spreading mechanism supports to spread out
affinities for one or more interrupt sets. A interrupt set contains one
or more interrupts. Each set is mapped to a specific functionality of a
device, e.g. general I/O queues and read I/O queus of multiqueue block
devices.

The number of interrupts per set is defined by the driver. It depends on
the total number of available interrupts for the device, which is
determined by the PCI capabilites and the availability of underlying CPU
resources, and the number of queues which the device provides and the
driver wants to instantiate.

The driver passes initial configuration for the interrupt allocation via
a pointer to struct irq_affinity.

Right now the allocation mechanism is complex as it requires to have a
loop in the driver to determine the maximum number of interrupts which
are provided by the PCI capabilities and the underlying CPU resources.
This loop would have to be replicated in every driver which wants to
utilize this mechanism. That's unwanted code duplication and error
prone.

In order to move this into generic facilities it is required to have a
mechanism, which allows the recalculation of the interrupt sets and
their size, in the core code. As the core code does not have any
knowledge about the underlying device, a driver specific callback will
be added to struct affinity_desc, which will be invoked by the core
code. The callback will get the number of available interupts as an
argument, so the driver can calculate the corresponding number and size
of interrupt sets.

To support this, two modifications for the handling of struct irq_affinity
are required:

1) The (optional) interrupt sets size information is contained in a
   separate array of integers and struct irq_affinity contains a
   pointer to it.

   This is cumbersome and as the maximum number of interrupt sets is small,
   there is no reason to have separate storage. Moving the size array into
   struct affinity_desc avoids indirections and makes the code simpler.

2) At the moment the struct irq_affinity pointer which is handed in from
   the driver and passed through to several core functions is marked
   'const'.

   With the upcoming callback to recalculate the number and size of
   interrupt sets, it's necessary to remove the 'const'
   qualifier. Otherwise the callback would not be able to update the data.

Implement #1 and store the interrupt sets size in 'struct irq_affinity'.

No functional change.

[ tglx: Fixed the memcpy() size so it won't copy beyond the size of the
  	source. Fixed the kernel doc comments for struct irq_affinity and
  	de-'This patch'-ed the changelog ]
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Marc Zyngier <marc.zyngier@arm.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Bjorn Helgaas <helgaas@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: linux-block@vger.kernel.org
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: linux-nvme@lists.infradead.org
Cc: linux-pci@vger.kernel.org
Cc: Keith Busch <keith.busch@intel.com>
Cc: Sumit Saxena <sumit.saxena@broadcom.com>
Cc: Kashyap Desai <kashyap.desai@broadcom.com>
Cc: Shivasharan Srikanteshwara <shivasharan.srikanteshwara@broadcom.com>
Link: https://lkml.kernel.org/r/20190216172228.423723127@linutronix.de

9cfef55b

genirq/affinity: Code consolidation · 0145c30e

Thomas Gleixner authored Feb 16, 2019

All information and calculations in the interrupt affinity spreading code
is strictly unsigned int. Though the code uses int all over the place.

Convert it over to unsigned int.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Acked-by: Marc Zyngier <marc.zyngier@arm.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Bjorn Helgaas <helgaas@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: linux-block@vger.kernel.org
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: linux-nvme@lists.infradead.org
Cc: linux-pci@vger.kernel.org
Cc: Keith Busch <keith.busch@intel.com>
Cc: Sumit Saxena <sumit.saxena@broadcom.com>
Cc: Kashyap Desai <kashyap.desai@broadcom.com>
Cc: Shivasharan Srikanteshwara <shivasharan.srikanteshwara@broadcom.com>
Link: https://lkml.kernel.org/r/20190216172228.336424556@linutronix.de

0145c30e

14 Feb, 2019 1 commit
- Merge branch 'linus' into irq/core · d869f866
  Thomas Gleixner authored Feb 14, 2019
```
Pick up upstream changes to avoid conflicts for pending patches.
```
  d869f866
13 Feb, 2019 1 commit

genirq: Add missing documentation for tot_count · 030fc443

Waiman Long authored Feb 12, 2019

Commit:

  1136b072 ("genirq: Avoid summation loops for /proc/stat")

adds a new irq_desc::tot_count field, without documenting it.
Add the missing piece of documentation.
Signed-off-by: Waiman Long <longman@redhat.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Daniel Colascione <dancol@google.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1549983253-19107-1-git-send-email-longman@redhat.comSigned-off-by: Ingo Molnar <mingo@kernel.org>

030fc443

10 Feb, 2019 5 commits

softirq: Don't skip softirq execution when softirq thread is parking · 1342d808

Matthias Kaehlcke authored Jan 28, 2019

When a CPU is unplugged the kernel threads of this CPU are parked (see
smpboot_park_threads()). kthread_park() is used to mark each thread as
parked and wake it up, so it can complete the process of parking itselfs
(see smpboot_thread_fn()).

If local softirqs are pending on interrupt exit invoke_softirq() is called
to process the softirqs, however it skips processing when the softirq
kernel thread of the local CPU is scheduled to run. The softirq kthread is
one of the threads that is parked when a CPU is unplugged. Parking the
kthread wakes it up, however only to complete the parking process, not to
process the pending softirqs. Hence processing of softirqs at the end of an
interrupt is skipped, but not done elsewhere, which can result in warnings
about pending softirqs when a CPU is unplugged:

/sys/devices/system/cpu # echo 0 > cpu4/online
[ ... ] NOHZ: local_softirq_pending 02
[ ... ] NOHZ: local_softirq_pending 202
[ ... ] CPU4: shutdown
[ ... ] psci: CPU4 killed.

Don't skip processing of softirqs at the end of an interrupt when the
softirq thread of the CPU is parking.
Signed-off-by: Matthias Kaehlcke <mka@chromium.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Douglas Anderson <dianders@chromium.org>
Cc: Stephen Boyd <swboyd@chromium.org>
Link: https://lkml.kernel.org/r/20190128234625.78241-3-mka@chromium.org

1342d808

kthread: Add __kthread_should_park() · 0121805d

Matthias Kaehlcke authored Jan 28, 2019

kthread_should_park() is used to check if the calling kthread ('current')
should park, but there is no function to check whether an arbitrary kthread
should be parked. The latter is required to plug a CPU hotplug race vs. a
parking ksoftirqd thread.

The new __kthread_should_park() receives a task_struct as parameter to
check if the corresponding kernel thread should be parked.

Call __kthread_should_park() from kthread_should_park() to avoid code
duplication.
Signed-off-by: Matthias Kaehlcke <mka@chromium.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Douglas Anderson <dianders@chromium.org>
Cc: Stephen Boyd <swboyd@chromium.org>
Link: https://lkml.kernel.org/r/20190128234625.78241-2-mka@chromium.org

0121805d

proc/stat: Make the interrupt statistics more efficient · c2da3f1b

Thomas Gleixner authored Feb 08, 2019

Waiman reported that on large systems with a large amount of interrupts the
readout of /proc/stat takes a long time to sum up the interrupt
statistics. In principle this is not a problem. but for unknown reasons
some enterprise quality software reads /proc/stat with a high frequency.

The reason for this is that interrupt statistics are accounted per cpu. So
the /proc/stat logic has to sum up the interrupt stats for each interrupt.

The interrupt core provides now a per interrupt summary counter which can
be used to avoid the summation loops completely except for interrupts
marked PER_CPU which are only a small fraction of the interrupt space if at
all.

Another simplification is to iterate only over the active interrupts and
skip the potentially large gaps in the interrupt number space and just
print zeros for the gaps without going into the interrupt core in the first
place.

Waiman provided test results from a 4-socket IvyBridge-EX system (60-core
120-thread, 3016 irqs) excuting a test program which reads /proc/stat
50,000 times:

   Before: 18.436s (sys 18.380s)
   After:   3.769s (sys  3.742s)
Reported-by: Waiman Long <longman@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Waiman Long <longman@redhat.com>
Reviewed-by: Marc Zyngier <marc.zyngier@arm.com>
Reviewed-by: Davidlohr Bueso <dbueso@suse.de>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: linux-fsdevel@vger.kernel.org
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Daniel Colascione <dancol@google.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Link: https://lkml.kernel.org/r/20190208135021.013828701@linutronix.de

c2da3f1b

genirq: Avoid summation loops for /proc/stat · 1136b072

Thomas Gleixner authored Feb 08, 2019

Waiman reported that on large systems with a large amount of interrupts the
readout of /proc/stat takes a long time to sum up the interrupt
statistics. In principle this is not a problem. but for unknown reasons
some enterprise quality software reads /proc/stat with a high frequency.

The reason for this is that interrupt statistics are accounted per cpu. So
the /proc/stat logic has to sum up the interrupt stats for each interrupt.

This can be largely avoided for interrupts which are not marked as
'PER_CPU' interrupts by simply adding a per interrupt summation counter
which is incremented along with the per interrupt per cpu counter.

The PER_CPU interrupts need to avoid that and use only per cpu accounting
because they share the interrupt number and the interrupt descriptor and
concurrent updates would conflict or require unwanted synchronization.
Reported-by: Waiman Long <longman@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Waiman Long <longman@redhat.com>
Reviewed-by: Marc Zyngier <marc.zyngier@arm.com>
Reviewed-by: Davidlohr Bueso <dbueso@suse.de>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: linux-fsdevel@vger.kernel.org
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Daniel Colascione <dancol@google.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Link: https://lkml.kernel.org/r/20190208135020.925487496@linutronix.de


8<-------------

v2: Undo the unintentional layout change of struct irq_desc.

 include/linux/irqdesc.h |    1 +
 kernel/irq/chip.c       |   12 ++++++++++--
 kernel/irq/internals.h  |    8 +++++++-
 kernel/irq/irqdesc.c    |    7 ++++++-
 4 files changed, 24 insertions(+), 4 deletions(-)

1136b072

genirq/affinity: Move allocation of 'node_to_cpumask' to irq_build_affinity_masks() · 347253c4

Ming Lei authored Jan 25, 2019

'node_to_cpumask' is just one temparay variable for irq_build_affinity_masks(),
so move it into irq_build_affinity_masks().

No functioanl change.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Bjorn Helgaas <bhelgaas@google.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: linux-block@vger.kernel.org
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: linux-nvme@lists.infradead.org
Cc: linux-pci@vger.kernel.org
Link: https://lkml.kernel.org/r/20190125095347.17950-2-ming.lei@redhat.com

347253c4

07 Feb, 2019 13 commits

Merge tag 'platform-drivers-x86-v5.0-2' of git://git.infradead.org/linux-platform-drivers-x86 · 74e96711

Linus Torvalds authored Feb 07, 2019

Pull x86 platform driver fixlet from Darren Hart:
 "Correct Documentation/ABI 4.21 KernelVersion to 5.0"

* tag 'platform-drivers-x86-v5.0-2' of git://git.infradead.org/linux-platform-drivers-x86:
  Documentation/ABI: Correct mlxreg-io KernelVersion for 5.0

74e96711

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm · e303a067

Linus Torvalds authored Feb 07, 2019

Pull KVM fixes from Paolo Bonzini:
 "Three security fixes"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
  KVM: nVMX: unconditionally cancel preemption timer in free_nested (CVE-2019-7221)
  KVM: x86: work around leak of uninitialized stack contents (CVE-2019-7222)
  kvm: fix kvm_ioctl_create_device() reference counting (CVE-2019-6974)

e303a067

Merge tag 'nfsd-5.0-1' of git://linux-nfs.org/~bfields/linux · ee6c0737

Linus Torvalds authored Feb 07, 2019

Pull nfsd fixes from Bruce Fields:
 "Two small nfsd bugfixes for 5.0, for an RDMA bug and a file clone bug"

* tag 'nfsd-5.0-1' of git://linux-nfs.org/~bfields/linux:
  svcrdma: Remove max_sge check at connect time
  nfsd: Fix error return values for nfsd4_clone_file_range()

ee6c0737

Merge tag 'for-5.0/dm-fixes-2' of... · 8b5cdbe5

Linus Torvalds authored Feb 07, 2019

Merge tag 'for-5.0/dm-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device mapper fixes from Mike Snitzer:
 "Both of these fixes address issues in changes merged for 5.0-rc4:

   - Fix DM core's missing memory barrier before waitqueue_active()
     calls.

   - Fix DM core's clone_bio() to work when cloning a subset of a bio
     with an integrity payload; bio_integrity_trim() wasn't getting
     called due to bio_trim()'s early return"

* tag 'for-5.0/dm-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
  dm: don't use bio_trim() afterall
  dm: add memory barrier before waitqueue_active

8b5cdbe5

KVM: nVMX: unconditionally cancel preemption timer in free_nested (CVE-2019-7221) · ecec7688

Peter Shier authored Oct 11, 2018

Bugzilla: 1671904

There are multiple code paths where an hrtimer may have been started to
emulate an L1 VMX preemption timer that can result in a call to free_nested
without an intervening L2 exit where the hrtimer is normally
cancelled. Unconditionally cancel in free_nested to cover all cases.

Embargoed until Feb 7th 2019.
Signed-off-by: Peter Shier <pshier@google.com>
Reported-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Reported-by: Felix Wilhelm <fwilhelm@google.com>
Cc: stable@kernel.org
Message-Id: <20181011184646.154065-1-pshier@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

ecec7688

KVM: x86: work around leak of uninitialized stack contents (CVE-2019-7222) · 353c0956

Paolo Bonzini authored Jan 29, 2019

Bugzilla: 1671930

Emulation of certain instructions (VMXON, VMCLEAR, VMPTRLD, VMWRITE with
memory operand, INVEPT, INVVPID) can incorrectly inject a page fault
when passed an operand that points to an MMIO address.  The page fault
will use uninitialized kernel stack memory as the CR2 and error code.

The right behavior would be to abort the VM with a KVM_EXIT_INTERNAL_ERROR
exit to userspace; however, it is not an easy fix, so for now just
ensure that the error code and CR2 are zero.

Embargoed until Feb 7th 2019.
Reported-by: Felix Wilhelm <fwilhelm@google.com>
Cc: stable@kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

353c0956

kvm: fix kvm_ioctl_create_device() reference counting (CVE-2019-6974) · cfa39381

Jann Horn authored Jan 26, 2019

kvm_ioctl_create_device() does the following:

1. creates a device that holds a reference to the VM object (with a borrowed
   reference, the VM's refcount has not been bumped yet)
2. initializes the device
3. transfers the reference to the device to the caller's file descriptor table
4. calls kvm_get_kvm() to turn the borrowed reference to the VM into a real
   reference

The ownership transfer in step 3 must not happen before the reference to the VM
becomes a proper, non-borrowed reference, which only happens in step 4.
After step 3, an attacker can close the file descriptor and drop the borrowed
reference, which can cause the refcount of the kvm object to drop to zero.

This means that we need to grab a reference for the device before
anon_inode_getfd(), otherwise the VM can disappear from under us.

Fixes: 852b6d57 ("kvm: add device control API")
Cc: stable@kernel.org
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

cfa39381

Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/hid/hid · d47e3da1

Linus Torvalds authored Feb 07, 2019

Pull HID fix from Jiri Kosina:
 "A fix for a bug in hid-debug that can lock up the kernel in infinite
  loop (CVE-2019-3819), from Vladis Dronov"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/hid/hid:
  HID: debug: fix the ring buffer implementation

d47e3da1

Merge tag 'sound-5.0-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound · 6f64e3a4

Linus Torvalds authored Feb 07, 2019

Pull sound fixes from Takashi Iwai:
 "A collection of a few small fixes.

  The most significant one is the fix for the possible race at loading
  HD-audio drivers. This has been present for long time and surfaced
  only in a rare occasion, but finally spotted out"

* tag 'sound-5.0-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
  ALSA: hda/ca0132 - Fix build error without CONFIG_PCI
  ALSA: compress: Fix stop handling on compressed capture streams
  ALSA: usb-audio: Add support for new T+A USB DAC
  ALSA: hda - Serialize codec registrations
  ALSA: hda/realtek - Use a common helper for hp pin reference
  ALSA: hda/realtek - Fix lose hp_pins for disable auto mute
  ALSA: hda/realtek - Headset microphone support for System76 darp5

6f64e3a4

Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost · b0314565

Linus Torvalds authored Feb 07, 2019

Pull virtio fixes from Michael Tsirkin:
 "A small fix for a uapi header, and a fix for VDPA for non-x86 guests"

* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost:
  virtio: drop internal struct from UAPI
  virtio: support VIRTIO_F_ORDER_PLATFORM

b0314565

Merge tag 'trace-v5.0-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace · 4879f116

Linus Torvalds authored Feb 07, 2019

Pull tracing fixes from Steven Rostedt:
 "This has two fixes for uprobe code.

   - Cut and paste fix to have uprobe printks say "uprobe" and not
     "kprobe"

   - Add terminating '\0' byte when copying function arguments"

* tag 'trace-v5.0-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  tracing/uprobes: Fix output for multiple string arguments
  tracing: uprobes: Fix typo in pr_fmt string

4879f116

Merge tag 'fuse-fixes-5.0-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse · 076a3f55

Linus Torvalds authored Feb 07, 2019

Pull fuse fixes from Miklos Szeredi:
 "A fix for a CUSE regression introduced in v4.20, as well as fixes for
  a couple of old bugs"

* tag 'fuse-fixes-5.0-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
  fuse: decrement NR_WRITEBACK_TEMP on the right page
  fuse: call pipe_buf_release() under pipe lock
  cuse: fix ioctl
  fuse: handle zero sized retrieve correctly

076a3f55

Merge tag 'pinctrl-v5.0-2' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl · b66bc777

Linus Torvalds authored Feb 07, 2019

Pull pin control fixes from Linus Walleij:

 - Mediatek Kconfig fix

 - Sunxi regulator, IRQ banks and pin base fixup

 - Intel Cherryview Strago DMI workaround

 - Potential regmap problem on mcp23s08

* tag 'pinctrl-v5.0-2' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl:
  pinctrl: sunxi: Correct number of IRQ banks on H6 main pin controller
  pinctrl: mcp23s08: spi: Fix regmap allocation for mcp23s18
  pinctrl: cherryview: fix Strago DMI workaround
  pinctrl: sunxi: Consider pin_base when calculating regulator array index
  pinctrl: sunxi: Fix and simplify pin bank regulator handling
  pinctrl: mediatek: fix Kconfig build errors for moore core

b66bc777

06 Feb, 2019 5 commits

dm: don't use bio_trim() afterall · fa8db494

Mike Snitzer authored Feb 05, 2019

bio_trim() has an early return, which makes it _not_ idempotent, if the
offset is 0 and the bio's bi_size already matches the requested size.
Prior to DM, all users of bio_trim() were fine with this.  But DM has
exposed the fact that bio_trim()'s early return is incompatible with a
cloned bio whose integrity payload must be trimmed via
bio_integrity_trim().

Fix this by reverting DM back to doing the equivalent of bio_trim() but
in an idempotent manner (so bio_integrity_trim is always performed).

Follow-on work is needed to assess what benefit bio_trim()'s early
return is providing to its existing callers.
Reported-by: Milan Broz <gmazyland@gmail.com>
Fixes: 57c36519 ("dm: fix clone_bio() to trigger blk_recount_segments()")
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

fa8db494

dm: add memory barrier before waitqueue_active · 645efa84

Mikulas Patocka authored Feb 05, 2019

Block core changes to switch bio-based IO accounting to be percpu had a
side-effect of altering DM core to now rely on calling waitqueue_active
(in both bio-based and request-based) to check if another task is in
dm_wait_for_completion().

A memory barrier is needed before calling waitqueue_active().  DM core
doesn't piggyback on a preceding memory barrier so it must explicitly
use its own.

For more details on why using waitqueue_active() without a preceding
barrier is unsafe, please see the comment before the waitqueue_active()
definition in include/linux/wait.h.

Add the missing memory barrier by switching to using wq_has_sleeper().

Fixes: 6f757231 ("dm: remove the pending IO accounting")
Fixes: c4576aed ("dm: fix request-based dm's use of dm_wait_for_completion")
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>

645efa84

svcrdma: Remove max_sge check at connect time · e248aa7b

Chuck Lever authored Jan 25, 2019

Two and a half years ago, the client was changed to use gathered
Send for larger inline messages, in commit 655fec69 ("xprtrdma:
Use gathered Send for large inline messages"). Several fixes were
required because there are a few in-kernel device drivers whose
max_sge is 3, and these were broken by the change.

Apparently my memory is going, because some time later, I submitted
commit 25fd86ec ("svcrdma: Don't overrun the SGE array in
svc_rdma_send_ctxt"), and after that, commit f3c1fd0e ("svcrdma:
Reduce max_send_sges"). These too incorrectly assumed in-kernel
device drivers would have more than a few Send SGEs available.

The fix for the server side is not the same. This is because the
fundamental problem on the server is that, whether or not the client
has provisioned a chunk for the RPC reply, the server must squeeze
even the most complex RPC replies into a single RDMA Send. Failing
in the send path because of Send SGE exhaustion should never be an
option.

Therefore, instead of failing when the send path runs out of SGEs,
switch to using a bounce buffer mechanism to handle RPC replies that
are too complex for the device to send directly. That allows us to
remove the max_sge check to enable drivers with small max_sge to
work again.
Reported-by: Don Dutile <ddutile@redhat.com>
Fixes: 25fd86ec ("svcrdma: Don't overrun the SGE array in ...")
Cc: stable@vger.kernel.org
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>

e248aa7b

nfsd: Fix error return values for nfsd4_clone_file_range() · e3fdc89c

Trond Myklebust authored Jan 21, 2019

If the parameter 'count' is non-zero, nfsd4_clone_file_range() will
currently clobber all errors returned by vfs_clone_file_range() and
replace them with EINVAL.

Fixes: 42ec3d4c ("vfs: make remap_file_range functions take and...")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: stable@vger.kernel.org # v4.20+
Signed-off-by: J. Bruce Fields <bfields@redhat.com>

e3fdc89c

ALSA: hda/ca0132 - Fix build error without CONFIG_PCI · c97617a8

Takashi Iwai authored Feb 05, 2019

A call of pci_iounmap() call without CONFIG_PCI leads to a build error
on some architectures. We tried to address this and add a check of
IS_ENABLED(CONFIG_PCI), but this still doesn't seem enough for sh.
Ideally we should fix it globally, it's really a corner case, so let's
paper over it with a simpler ifdef.

Fixes: 1e73359a ("ALSA: hda/ca0132 - make pci_iounmap() call conditional")
Reported-by: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com>
Signed-off-by: Takashi Iwai <tiwai@suse.de>

c97617a8

05 Feb, 2019 3 commits

ALSA: compress: Fix stop handling on compressed capture streams · 4f2ab5e1

Charles Keepax authored Feb 05, 2019

It is normal user behaviour to start, stop, then start a stream
again without closing it. Currently this works for compressed
playback streams but not capture ones.

The states on a compressed capture stream go directly from OPEN to
PREPARED, unlike a playback stream which moves to SETUP and waits
for a write of data before moving to PREPARED. Currently however,
when a stop is sent the state is set to SETUP for both types of
streams. This leaves a capture stream in the situation where a new
start can't be sent as that requires the state to be PREPARED and
a new set_params can't be sent as that requires the state to be
OPEN. The only option being to close the stream, and then reopen.

Correct this issues by allowing snd_compr_drain_notify to set the
state depending on the stream direction, as we already do in
set_params.

Fixes: 49bb6402 ("ALSA: compress_core: Add support for capture streams")
Signed-off-by: Charles Keepax <ckeepax@opensource.cirrus.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Takashi Iwai <tiwai@suse.de>

4f2ab5e1

virtio: drop internal struct from UAPI · 9c0644ee

Michael S. Tsirkin authored Feb 01, 2019

There's no reason to expose struct vring_packed in UAPI - if we do we
won't be able to change or drop it, and it's not part of any interface.

Let's move it to virtio_ring.c

Cc: Tiwei Bie <tiwei.bie@intel.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

9c0644ee

ALSA: usb-audio: Add support for new T+A USB DAC · 3bff2407

Udo Eberhardt authored Feb 05, 2019

This patch adds the T+A VID to the generic check in order to enable
native DSD support for T+A devices. This works with the new T+A USB
DAC model SD3100HV and will also work with future devices which
support the XMOS/Thesycon style DSD format.
Signed-off-by: Udo Eberhardt <udo.eberhardt@thesycon.de>
Cc: <stable@vger.kernel.org>
Signed-off-by: Takashi Iwai <tiwai@suse.de>

3bff2407

03 Feb, 2019 6 commits

Linux 5.0-rc5 · 8834f560
Linus Torvalds authored Feb 03, 2019

8834f560

Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 24b888d8

Linus Torvalds authored Feb 03, 2019

Pull x86 fixes from Thomas Gleixner:
 "A few updates for x86:

   - Fix an unintended sign extension issue in the fault handling code

   - Rename the new resource control config switch so it's less
     confusing

   - Avoid setting up EFI info in kexec when the EFI runtime is
     disabled.

   - Fix the microcode version check in the AMD microcode loader so it
     only loads higher version numbers and never downgrades

   - Set EFER.LME in the 32bit trampoline before returning to long mode
     to handle older AMD/KVM behaviour properly.

   - Add Darren and Andy as x86/platform reviewers"

* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/resctrl: Avoid confusion over the new X86_RESCTRL config
  x86/kexec: Don't setup EFI info if EFI runtime is not enabled
  x86/microcode/amd: Don't falsely trick the late loading mechanism
  MAINTAINERS: Add Andy and Darren as arch/x86/platform/ reviewers
  x86/fault: Fix sign-extend unintended sign extension
  x86/boot/compressed/64: Set EFER.LME=1 in 32-bit trampoline before returning to long mode
  x86/cpu: Add Atom Tremont (Jacobsville)

24b888d8

Merge branch 'smp-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · cc6810e3

Linus Torvalds authored Feb 03, 2019

Pull cpu hotplug fixes from Thomas Gleixner:
 "Two fixes for the cpu hotplug machinery:

   - Replace the overly clever 'SMT disabled by BIOS' detection logic as
     it breaks KVM scenarios and prevents speculation control updates
     when the Hyperthreads are brought online late after boot.

   - Remove a redundant invocation of the speculation control update
     function"

* 'smp-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  cpu/hotplug: Fix "SMT disabled by BIOS" detection for KVM
  x86/speculation: Remove redundant arch_smt_update() invocation

cc6810e3

Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 58f6d428

Linus Torvalds authored Feb 03, 2019

Pull perf fixes from Thomas Gleixner:
 "A pile of perf updates:

   - Fix broken sanity check in the /proc/sys/kernel/perf_cpu_time_max_percent
     write handler

   - Cure a perf script crash which caused by an unitinialized data
     structure

   - Highlight the hottest instruction in perf top and not a random one

   - Cure yet another clang issue when building perf python

   - Handle topology entries with no CPU correctly in the tools

   - Handle perf data which contains both tracepoints and performance
     counter entries correctly.

   - Add a missing NULL pointer check in perf ordered_events_free()"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf script: Fix crash when processing recorded stat data
  perf top: Fix wrong hottest instruction highlighted
  perf tools: Handle TOPOLOGY headers with no CPU
  perf python: Remove -fstack-clash-protection when building with some clang versions
  perf core: Fix perf_proc_update_handler() bug
  perf script: Fix crash with printing mixed trace point and other events
  perf ordered_events: Fix crash in ordered_events__free

58f6d428

Merge branch 'efi-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 89401be6

Linus Torvalds authored Feb 03, 2019

Pull EFI fix from Thomas Gleixner:
 "The dump info for the efi page table debugging lacks a terminator
  which causes the kernel to crash when the debugfile is read"

* 'efi-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  efi/arm64: Fix debugfs crash by adding a terminator for ptdump marker

89401be6

Merge tag 'for-5.0-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux · 312b3a93

Linus Torvalds authored Feb 03, 2019

Pull btrfs fixes from David Sterba:

 - regression fix: transaction commit can run away due to delayed ref
   waiting heuristic, this is not necessary now because of the proper
   reservation mechanism introduced in 5.0

 - regression fix: potential crash due to use-before-check of an ERR_PTR
   return value

 - fix for transaction abort during transaction commit that needs to
   properly clean up pending block groups

 - fix deadlock during b-tree node/leaf splitting, when this happens on
   some of the fundamental trees, we must prevent new tree block
   allocation to re-enter indirectly via the block group flushing path

 - potential memory leak after errors during mount

* tag 'for-5.0-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: On error always free subvol_name in btrfs_mount
  btrfs: clean up pending block groups when transaction commit aborts
  btrfs: fix potential oops in device_list_add
  btrfs: don't end the transaction for delayed refs in throttle
  Btrfs: fix deadlock when allocating tree block during leaf/node split

312b3a93

02 Feb, 2019 2 commits

Merge tag 'devicetree-fixes-for-5.0-3' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux · 12491ed3

Linus Torvalds authored Feb 02, 2019

Pull Devicetree fix from Rob Herring:
 "A single fix for building DT bindings in-tree"

* tag 'devicetree-fixes-for-5.0-3' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux:
  dt-bindings: Fix dt_binding_check target for in tree builds

12491ed3

Merge tag 'riscv-for-linus-5.0-rc5' of... · 74b13e7e

Linus Torvalds authored Feb 02, 2019

Merge tag 'riscv-for-linus-5.0-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/palmer/riscv-linux

Pull RISC-V fixes from Palmer Dabbelt:
 "This contains a handful of mostly-independent patches:

   - make our port respect TIF_NEED_RESCHED, which fixes
     CONFIG_PREEMPT=y kernels

   - fix double-put of OF nodes

   - fix a misspelling of target in our Kconfig

   - generic PCIe is enabled in our defconfig

   - fix our SBI early console to properly handle line
     endings

   - fix max_low_pfn being counted in PFNs

   - a change to TASK_UNMAPPED_BASE to match what other
     arches do

  This has passed my standard 'boot Fedora' flow"

* tag 'riscv-for-linus-5.0-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/palmer/riscv-linux:
  riscv: Adjust mmap base address at a third of task size
  riscv: fixup max_low_pfn with PFN_DOWN.
  tty/serial: use uart_console_write in the RISC-V SBL early console
  RISC-V: defconfig: Add CRYPTO_DEV_VIRTIO=y
  RISC-V: defconfig: Enable Generic PCIE by default
  RISC-V: defconfig: Move CONFIG_PCI{,E_XILINX}
  RISC-V: Kconfig: fix spelling mistake "traget" -> "target"
  RISC-V: asm/page.h: fix spelling mistake "CONFIG_64BITS" -> "CONFIG_64BIT"
  RISC-V: fix bad use of of_node_put
  RISC-V: Add _TIF_NEED_RESCHED check for kernel thread when CONFIG_PREEMPT=y

74b13e7e