Commits · 69eea95c48857c9dfcac120d6acea43027627b28 · Kirill Smelkov / linux

27 Nov, 2015 14 commits

s390/pci_dma: fix DMA table corruption with > 4 TB main memory · 69eea95c

Gerald Schaefer authored Nov 16, 2015

DMA addresses returned from map_page() are calculated by using an iommu
bitmap plus a start_dma offset. The size of this bitmap is based on the main
memory size. If we have more than (4 TB - start_dma) main memory, the DMA
address calculation will also produce addresses > 4 TB. Such addresses
cannot be inserted in the 3-level DMA page table, instead the entries
modulo 4 TB will be overwritten.

Fix this by restricting the iommu bitmap size to (4 TB - start_dma).
Also set zdev->end_dma to the actual end address of the usable
range, instead of the theoretical maximum as reported by the hardware,
which fixes a sanity check in dma_map() and also the IOMMU API domain
geometry aperture calculation.
Signed-off-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Reviewed-by: Sebastian Ott <sebott@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>

69eea95c

s390: get_user_pages_fast() might sleep · 40612351

David Hildenbrand authored Oct 15, 2015

Let's annotate it correctly, so we directly get a warning if
we ever were to use it in atomic/preempt_disable/spinlock environment.
Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>

40612351

s390/spinlock: avoid diagnose loop · db1c4515

Martin Schwidefsky authored Nov 12, 2015

The spinlock implementation calls the diagnose 0x9c / 0x44 immediately
if the SIGP sense running reported the target CPU as not running.

The diagnose 0x9c is a hint to the hypervisor to schedule the target
CPU in preference to the source CPU that issued the diagnose. It can
happen that on return from the diagnose the target CPU has not been
scheduled yet, e.g. if the target logical CPU is on another physical
CPU and the hypervisor did not want to migrate the logical CPU.

Avoid the immediate repeat of the diagnose instruction, instead do
the retry loop before the next invocation of diagnose 0x9c.
Reviewed-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>

db1c4515

s390/dump: cleanup CPU save area handling · 1a2c5840

Martin Schwidefsky authored Oct 29, 2015

Introduce save_area_alloc(), save_area_boot_cpu(), save_area_add_regs()
and save_area_add_vxrs to deal with storing the CPU state in case of a
system dump. Remove struct save_area and save_area_ext, and create a new
struct save_area as a local definition to arch/s390/kernel/crash_dump.c.
Copy each individual field from the hardware status area to the save area,
storing the minimum of required data.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>

1a2c5840

s390/dump: rework CPU register dump code · 1a36a39e

Martin Schwidefsky authored Oct 29, 2015

To collect the CPU registers of the crashed system allocated a single
page with memblock_alloc_base and use it as a copy buffer. Replace the
stop-and-store-status sigp with a store-status-at-address sigp in
smp_save_dump_cpus() and smp_store_status(). In both cases the target
CPU is already stopped and store-status-at-address avoids the detour
via the absolute zero page.

For kexec simplify s390_reset_system and call store_status() before
the prefix register of the boot CPU has been set to zero. Use STPX
to store the prefix register and remove dump_prefix_page.
Acked-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>

1a36a39e

s390/dump: remove SAVE_AREA_BASE · f08b8414

Martin Schwidefsky authored Oct 23, 2015

Replace the SAVE_AREA_BASE offset calculations in reipl.S with the
assembler constant for the location of each register status area.

Use __LC_FPREGS_SAVE_AREA instead of SAVE_AREA_BASE in the three
remaining code locations and remove the definition of SAVE_AREA_BASE.
Acked-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>

f08b8414

s390/kvm: remove dependency on struct save_area definition · d9a3a09a

Martin Schwidefsky authored Oct 23, 2015

Replace the offsets based on the struct area_area with the offset
constants from asm-offsets.c based on the struct _lowcore.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>

d9a3a09a

s390/zcore: simplify memcpy_hsa · 019d6bec

Martin Schwidefsky authored Oct 12, 2015

Replace the three part copy logic int memcpy_hsa with a single loop
around sclp_sdias_copy with appropriate offset and size calculations,
and inline memcpy_hsa into memcpy_hsa_user and memcpy_hsa_kernel.
Acked-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>

019d6bec

s390/dump: streamline oldmem copy functions · df9694c7

Martin Schwidefsky authored Oct 12, 2015

Introduce two copy functions for the memory of the dumped system,
copy_oldmem_kernel() to copy to the virtual kernel address space
and copy_oldmem_user() to copy to user space.
Acked-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>

df9694c7

s390/kdump: remove code to create ELF notes in the crashed system · 8a07dd02

Martin Schwidefsky authored Oct 14, 2015

The s390 architecture can store the CPU registers of the crashed system
after the kdump kernel has been started and this is the preferred way.
Remove the remaining code fragments that deal with storing CPU registers
while the crashed system is still active.
Acked-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>

8a07dd02

s390/zcore: remove /sys/kernel/debug/zcore/mem · ffa52d02

Martin Schwidefsky authored Oct 28, 2015

New versions of the SCSI dumper use the /dev/vmcore interface instead
of zcore mem. Remove the outdated interface.
Acked-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>

ffa52d02

s390/zcore: copy vector registers into the image data · bbfed511

Martin Schwidefsky authored Oct 15, 2015

The /sys/kernel/debug/zcore/mem interface delivers the memory of the
old system with the CPU registers stored to the assigned locations in
each prefix page.

For the vector registers the prefix page of each CPU has an address of
a 1024 byte save area at 0x11b0. But the /sys/kernel/debug/zcore/mem
interface fails copy the vector registers saved at boot of the zfcpdump
kernel into the dump image.

Copy the saved vector registers of a CPU to the outout buffer if the
memory area that is read via /sys/kernel/debug/zcore/mem intersects
with the vector register save area of this CPU.
Acked-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>

bbfed511

s390/zcore: remove invalid kfree in init_cpu_info · 4c5b03b6

Martin Schwidefsky authored Oct 09, 2015

The extended save area for the boot CPU has been allocated by
smp_save_dump_cpus() with memblock_alloc() and may not be freed
with kfree().
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>

4c5b03b6

s390/zcrypt: Fix AP queue handling if queue is full · 2bc53b80

Ingo Tuchscherer authored Nov 27, 2015

When the AP queue depth of requests was reached additional requests
have been ignored. These request are stuck in the request queue.

The AP queue handling now push the next waiting request into the
queue after fetching a previous serviced and finished reply.
Signed-off-by: Ingo Tuchscherer <ingo.tuchscherer@linux.vnet.ibm.com>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Acked-by: Harald Freudenberger <freude@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>

2bc53b80

25 Nov, 2015 5 commits

Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs · 78c4a49a

Linus Torvalds authored Nov 25, 2015

Pull vfs fixes from Al Viro:
 "A couple of fixes for sendfile lockups caught by Dmitry + a fix for
  ancient sysvfs symlink breakage"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  vfs: Avoid softlockups with sendfile(2)
  vfs: Make sendfile(2) killable even better
  fix sysvfs symlinks

78c4a49a

Merge branch 'for-linus' of git://git.kernel.dk/linux-block · 9b81d512

Linus Torvalds authored Nov 25, 2015

Pull more block layer fixes from Jens Axboe:
 "I wasn't going to send off a new pull before next week, but the blk
  flush fix from Jan from the other day introduced a regression.  It's
  rare enough not to have hit during testing, since it requires both a
  device that rejects the first flush, and bad timing while it does
  that.  But since someone did hit it, let's get the revert into 4.4-rc3
  so we don't have a released rc with that known issue.

  Apart from that revert, three other fixes:

   - From Christoph, a fix for a missing unmap in NVMe request
     preparation.

   - An NVMe fix from Nishanth that fixes data corruption on powerpc.

   - Also from Christoph, fix a list_del() attempt on blk-mq that didn't
     have a matching list_add() at timer start"

* 'for-linus' of git://git.kernel.dk/linux-block:
  Revert "blk-flush: Queue through IO scheduler when flush not required"
  block: fix blk_abort_request for blk-mq drivers
  nvme: add missing unmaps in nvme_queue_rq
  NVMe: default to 4k device page size

9b81d512

Revert "blk-flush: Queue through IO scheduler when flush not required" · dcd8376c

Jens Axboe authored Nov 25, 2015

This reverts commit 1b2ff19e.

Jan writes:

--

Thanks for report! After some investigation I found out we allocate
elevator specific data in __get_request() only for non-flush requests. And
this is actually required since the flush machinery uses the space in
struct request for something else. Doh. So my patch is just wrong and not
easy to fix since at the time __get_request() is called we are not sure
whether the flush machinery will be used in the end. Jens, please revert
1b2ff19e. Thanks!

I'm somewhat surprised that you can reliably hit the race where flushing
gets disabled for the device just while the request is in flight. But I
guess during boot it makes some sense.

--

So let's just revert it, we can fix the queue run manually after the
fact. This race is rare enough that it didn't trigger in testing, it
requires the specific disable-while-in-flight scenario to trigger.

dcd8376c

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm · 4cf193b4

Linus Torvalds authored Nov 25, 2015

Pull KVM fixes from Paolo Bonzini:
 "Bug fixes for all architectures.  Nothing really stands out"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (21 commits)
  KVM: nVMX: remove incorrect vpid check in nested invvpid emulation
  arm64: kvm: report original PAR_EL1 upon panic
  arm64: kvm: avoid %p in __kvm_hyp_panic
  KVM: arm/arm64: vgic: Trust the LR state for HW IRQs
  KVM: arm/arm64: arch_timer: Preserve physical dist. active state on LR.active
  KVM: arm/arm64: Fix preemptible timer active state crazyness
  arm64: KVM: Add workaround for Cortex-A57 erratum 834220
  arm64: KVM: Fix AArch32 to AArch64 register mapping
  ARM/arm64: KVM: test properly for a PTE's uncachedness
  KVM: s390: fix wrong lookup of VCPUs by array index
  KVM: s390: avoid memory overwrites on emergency signal injection
  KVM: Provide function for VCPU lookup by id
  KVM: s390: fix pfmf intercept handler
  KVM: s390: enable SIMD only when no VCPUs were created
  KVM: x86: request interrupt window when IRQ chip is split
  KVM: x86: set KVM_REQ_EVENT on local interrupt request from user space
  KVM: x86: split kvm_vcpu_ready_for_interrupt_injection out of dm_request_for_irq_injection
  KVM: x86: fix interrupt window handling in split IRQ chip case
  MIPS: KVM: Uninit VCPU in vcpu_create error path
  MIPS: KVM: Fix CACHE immediate offset sign extension
  ...

4cf193b4

KVM: nVMX: remove incorrect vpid check in nested invvpid emulation · b2467e74

Haozhong Zhang authored Nov 25, 2015

This patch removes the vpid check when emulating nested invvpid
instruction of type all-contexts invalidation. The existing code is
incorrect because:
 (1) According to Intel SDM Vol 3, Section "INVVPID - Invalidate
     Translations Based on VPID", invvpid instruction does not check
     vpid in the invvpid descriptor when its type is all-contexts
     invalidation.
 (2) According to the same document, invvpid of type all-contexts
     invalidation does not require there is an active VMCS, so/and
     get_vmcs12() in the existing code may result in a NULL-pointer
     dereference. In practice, it can crash both KVM itself and L1
     hypervisors that use invvpid (e.g. Xen).
Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

b2467e74

24 Nov, 2015 21 commits

block: fix blk_abort_request for blk-mq drivers · 55ce0da1

Christoph Hellwig authored Oct 30, 2015

We only added the request to the request list for the !blk-mq case,
so we should only delete it in that case as well.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>

55ce0da1

nvme: add missing unmaps in nvme_queue_rq · bf508e91

Christoph Hellwig authored Oct 16, 2015

When we fail various metadata related operations in nvme_queue_rq we
need to unmap the data SGL.

Cc: stable@vger.kernel.org
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>

bf508e91

NVMe: default to 4k device page size · c5c9f25b

Nishanth Aravamudan authored Nov 24, 2015

We received a bug report recently when DDW (64-bit direct DMA on Power)
is not enabled for NVMe devices. In that case, we fall back to 32-bit
DMA via the IOMMU, which is always done via 4K TCEs (Translation Control
Entries).

The NVMe device driver, though, assumes that the DMA alignment for the
PRP entries will match the device's page size, and that the DMA aligment
matches the kernel's page aligment. On Power, the the IOMMU page size,
as mentioned above, can be 4K, while the device can have a page size of
8K, while the kernel has a page size of 64K. This eventually trips the
BUG_ON in nvme_setup_prps(), as we have a 'dma_len' that is a multiple
of 4K but not 8K (e.g., 0xF000).

In this particular case of page sizes, we clearly want to use the
IOMMU's page size in the driver. And generally, the NVMe driver in this
function should be using the IOMMU's page size for the default device
page size, rather than the kernel's page size. There is not currently an
API to obtain the IOMMU's page size across all architectures and in the
interest of a stop-gap fix to this functional issue, default the NVMe
device page size to 4K, with the intent of adding such an API and
implementation across all architectures in the next merge window.

With the functionally equivalent v3 of this patch, our hardware test
exerciser survives when using 32-bit DMA; without the patch, the kernel
will BUG within a few minutes.

Signed-off-by: Nishanth Aravamudan <nacc at linux.vnet.ibm.com>
Signed-off-by: Jens Axboe <axboe@fb.com>

c5c9f25b

Merge tag 'dm-4.4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm · 6ffeba96

Linus Torvalds authored Nov 24, 2015

Pull device mapper fixes from Mike Snitzer:
 "Two fixes for 4.4-rc1's DM ioctl changes that introduced the potential
  for infinite recursion on ioctl (with DM multipath).

  And four stable fixes:

   - A DM thin-provisioning fix to restore 'error_if_no_space' setting
     when a thin-pool is made writable again (after having been out of
     space).

   - A DM thin-provisioning fix to properly advertise discard support
     for thin volumes that are stacked on a thin-pool whose underlying
     data device doesn't support discards.

   - A DM ioctl fix to allow ctrl-c to break out of an ioctl retry loop
     when DM multipath is configured to 'queue_if_no_path'.

   - A DM crypt fix for a possible hang on dm-crypt device removal"

* tag 'dm-4.4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
  dm thin: fix regression in advertised discard limits
  dm crypt: fix a possible hang due to race condition on exit
  dm mpath: fix infinite recursion in ioctl when no paths and !queue_if_no_path
  dm: do not reuse dm_blk_ioctl block_device input as local variable
  dm: fix ioctl retry termination with signal
  dm thin: restore requested 'error_if_no_space' setting on OODS to WRITE transition

6ffeba96

pidns: fix NULL dereference in __task_pid_nr_ns() · 81b1a832

Eric Dumazet authored Nov 24, 2015

I got a crash during a "perf top" session that was caused by a race in
__task_pid_nr_ns() :

pid_nr_ns() was inlined, but apparently compiler chose to read
task->pids[type].pid twice, and the pid->level dereference crashed
because we got a NULL pointer at the second read :

    if (pid && ns->level <= pid->level) { // CRASH

Just use RCU API properly to solve this race, and not worry about "perf
top" crashing hosts :(

get_task_pid() can benefit from same fix.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

81b1a832

Merge tag 'kvm-arm-for-v4.4-rc3' of... · 8bd142c0

Paolo Bonzini authored Nov 24, 2015

Merge tag 'kvm-arm-for-v4.4-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into kvm-master

KVM/ARM Fixes for v4.4-rc3.

Includes some timer fixes, properly unmapping PTEs, an errata fix, and two
tweaks to the EL2 panic code.

8bd142c0

Merge branch 'for-linus' of git://git.kernel.dk/linux-block · 4ce01c51

Linus Torvalds authored Nov 24, 2015

Pull block layer fixes from Jens Axboe:
 "A round of fixes/updates for the current series.

  This looks a little bigger than it is, but that's mainly because we
  pushed the lightnvm enabled null_blk change out of the merge window so
  it could be updated a bit.  The rest of the volume is also mostly
  lightnvm.  In particular:

   - Lightnvm.  Various fixes, additions, updates from Matias and
     Javier, as well as from Wenwei Tao.

   - NVMe:
        - Fix for potential arithmetic overflow from Keith.
        - Also from Keith, ensure that we reap pending completions from
          a completion queue before deleting it.  Fixes kernel crashes
          when resetting a device with IO pending.
        - Various little lightnvm related tweaks from Matias.

   - Fixup flushes to go through the IO scheduler, for the cases where a
     flush is not required.  Fixes a case in CFQ where we would be
     idling and not see this request, hence not break the idling.  From
     Jan Kara.

   - Use list_{first,prev,next} in elevator.c for cleaner code.  From
     Gelian Tang.

   - Fix for a warning trigger on btrfs and raid on single queue blk-mq
     devices, where we would flush plug callbacks with preemption
     disabled.  From me.

   - A mac partition validation fix from Kees Cook.

   - Two merge fixes from Ming, marked stable.  A third part is adding a
     new warning so we'll notice this quicker in the future, if we screw
     up the accounting.

   - Cleanup of thread name/creation in mtip32xx from Rasmus Villemoes"

* 'for-linus' of git://git.kernel.dk/linux-block: (32 commits)
  blk-merge: warn if figured out segment number is bigger than nr_phys_segments
  blk-merge: fix blk_bio_segment_split
  block: fix segment split
  blk-mq: fix calling unplug callbacks with preempt disabled
  mac: validate mac_partition is within sector
  mtip32xx: use formatting capability of kthread_create_on_node
  NVMe: reap completion entries when deleting queue
  lightnvm: add free and bad lun info to show luns
  lightnvm: keep track of block counts
  nvme: lightnvm: use admin queues for admin cmds
  lightnvm: missing free on init error
  lightnvm: wrong return value and redundant free
  null_blk: do not del gendisk with lightnvm
  null_blk: use device addressing mode
  null_blk: use ppa_cache pool
  NVMe: Fix possible arithmetic overflow for max segments
  blk-flush: Queue through IO scheduler when flush not required
  null_blk: register as a LightNVM device
  elevator: use list_{first,prev,next}_entry
  lightnvm: cleanup queue before target removal
  ...

4ce01c51

arm64: kvm: report original PAR_EL1 upon panic · fbb4574c

Mark Rutland authored Nov 16, 2015

If we call __kvm_hyp_panic while a guest context is active, we call
__restore_sysregs before acquiring the system register values for the
panic, in the process throwing away the PAR_EL1 value at the point of
the panic.

This patch modifies __kvm_hyp_panic to stash the PAR_EL1 value prior to
restoring host register values, enabling us to report the original
values at the point of the panic.
Acked-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>

fbb4574c

arm64: kvm: avoid %p in __kvm_hyp_panic · 1d7a4e31

Mark Rutland authored Nov 16, 2015

Currently __kvm_hyp_panic uses %p for values which are not pointers,
such as the ESR value. This can confusingly lead to "(null)" being
printed for the value.

Use %x instead, and only use %p for host pointers.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Acked-by: Marc Zyngier <marc.zyngier@arm.com>
Cc: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>

1d7a4e31

KVM: arm/arm64: vgic: Trust the LR state for HW IRQs · 9f958c11

Christoffer Dall authored Nov 24, 2015

We were probing the physial distributor state for the active state of a
HW virtual IRQ, because we had seen evidence that the LR state was not
cleared when the guest deactivated a virtual interrupted.

However, this issue turned out to be a software bug in the GIC, which
was solved by: 84aab5e68c2a5e1e18d81ae8308c3ce25d501b29
(KVM: arm/arm64: arch_timer: Preserve physical dist. active
state on LR.active, 2015-11-24)

Therefore, get rid of the complexities and just look at the LR.
Reviewed-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>

9f958c11

KVM: arm/arm64: arch_timer: Preserve physical dist. active state on LR.active · 0e3dfda9

Christoffer Dall authored Nov 24, 2015

We were incorrectly removing the active state from the physical
distributor on the timer interrupt when the timer output level was
deasserted.  We shouldn't be doing this without considering the virtual
interrupt's active state, because the architecture requires that when an
LR has the HW bit set and the pending or active bits set, then the
physical interrupt must also have the corresponding bits set.

This addresses an issue where we have been observing an inconsistency
between the LR state and the physical distributor state where the LR
state was active and the physical distributor was not active, which
shouldn't happen.
Reviewed-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>

0e3dfda9

KVM: arm/arm64: Fix preemptible timer active state crazyness · 7e16aa81

Christoffer Dall authored Nov 24, 2015

We were setting the physical active state on the GIC distributor in a
preemptible section, which could cause us to set the active state on
different physical CPU from the one we were actually going to run on,
hacoc ensues.

Since we are no longer descheduling/scheduling soft timers in the
flush/sync timer functions, simply moving the timer flush into a
non-preemptible section.
Reviewed-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>

7e16aa81

arm64: KVM: Add workaround for Cortex-A57 erratum 834220 · 498cd5c3

Marc Zyngier authored Nov 16, 2015

Cortex-A57 parts up to r1p2 can misreport Stage 2 translation faults
when a Stage 1 permission fault or device alignment fault should
have been reported.

This patch implements the workaround (which is to validate that the
Stage-1 translation actually succeeds) by using code patching.

Cc: stable@vger.kernel.org
Reviewed-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>

498cd5c3

arm64: KVM: Fix AArch32 to AArch64 register mapping · c0f09634

Marc Zyngier authored Nov 16, 2015

When running a 32bit guest under a 64bit hypervisor, the ARMv8
architecture defines a mapping of the 32bit registers in the 64bit
space. This includes banked registers that are being demultiplexed
over the 64bit ones.

On exceptions caused by an operation involving a 32bit register, the
HW exposes the register number in the ESR_EL2 register. It was so
far understood that SW had to distinguish between AArch32 and AArch64
accesses (based on the current AArch32 mode and register number).

It turns out that I misinterpreted the ARM ARM, and the clue is in
D1.20.1: "For some exceptions, the exception syndrome given in the
ESR_ELx identifies one or more register numbers from the issued
instruction that generated the exception. Where the exception is
taken from an Exception level using AArch32 these register numbers
give the AArch64 view of the register."

Which means that the HW is already giving us the translated version,
and that we shouldn't try to interpret it at all (for example, doing
an MMIO operation from the IRQ mode using the LR register leads to
very unexpected behaviours).

The fix is thus not to perform a call to vcpu_reg32() at all from
vcpu_reg(), and use whatever register number is supplied directly.
The only case we need to find out about the mapping is when we
actively generate a register access, which only occurs when injecting
a fault in a guest.

Cc: stable@vger.kernel.org
Reviewed-by: Robin Murphy <robin.murphy@arm.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>

c0f09634

ARM/arm64: KVM: test properly for a PTE's uncachedness · e6fab544

Ard Biesheuvel authored Nov 10, 2015

The open coded tests for checking whether a PTE maps a page as
uncached use a flawed '(pte_val(xxx) & CONST) != CONST' pattern,
which is not guaranteed to work since the type of a mapping is
not a set of mutually exclusive bits

For HYP mappings, the type is an index into the MAIR table (i.e, the
index itself does not contain any information whatsoever about the
type of the mapping), and for stage-2 mappings it is a bit field where
normal memory and device types are defined as follows:

    #define MT_S2_NORMAL            0xf
    #define MT_S2_DEVICE_nGnRE      0x1

I.e., masking *and* comparing with the latter matches on the former,
and we have been getting lucky merely because the S2 device mappings
also have the PTE_UXN bit set, or we would misidentify memory mappings
as device mappings.

Since the unmap_range() code path (which contains one instance of the
flawed test) is used both for HYP mappings and stage-2 mappings, and
considering the difference between the two, it is non-trivial to fix
this by rewriting the tests in place, as it would involve passing
down the type of mapping through all the functions.

However, since HYP mappings and stage-2 mappings both deal with host
physical addresses, we can simply check whether the mapping is backed
by memory that is managed by the host kernel, and only perform the
D-cache maintenance if this is the case.

Cc: stable@vger.kernel.org
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Tested-by: Pavel Fedin <p.fedin@samsung.com>
Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>

e6fab544

blk-merge: warn if figured out segment number is bigger than nr_phys_segments · 12e57f59

Ming Lei authored Nov 24, 2015

We had seen lots of reports of this kind issue, so add one
warnning in blk-merge, then it can be triggered easily and
avoid to depend on warning/bug from drivers.
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Jens Axboe <axboe@fb.com>

12e57f59

blk-merge: fix blk_bio_segment_split · 02e70742

Ming Lei authored Nov 24, 2015

Commit bdced438(block: setup bi_phys_segments after
splitting) introduces function of computing bio->bi_phys_segments
during bio splitting.

Unfortunately both bio->bi_seg_front_size and bio->bi_seg_back_size
arn't computed, so too many physical segments may be obtained
for one request since both the two are used to check if one segment
across two bios can be possible.

This patch fixes the issue by computing the two variables in
blk_bio_segment_split().

Fixes: bdced438(block: setup bi_phys_segments after splitting)
Reported-by: Michael Ellerman <mpe@ellerman.id.au>
Reported-by: Mark Salter <msalter@redhat.com>
Tested-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Tested-by: Mark Salter <msalter@redhat.com>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Jens Axboe <axboe@fb.com>

02e70742

block: fix segment split · 578270bf

Ming Lei authored Nov 24, 2015

Inside blk_bio_segment_split(), previous bvec pointer(bvprvp)
always points to the iterator local variable, which is obviously
wrong, so fix it by pointing to the local variable of 'bvprv'.

Fixes: 5014c311(block: fix bogus compiler warnings in blk-merge.c)
Cc: stable@kernel.org #4.3
Reported-by: Michael Ellerman <mpe@ellerman.id.au>
Reported-by: Mark Salter <msalter@redhat.com>
Tested-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Tested-by: Mark Salter <msalter@redhat.com>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Jens Axboe <axboe@fb.com>

578270bf

vfs: Avoid softlockups with sendfile(2) · c2489e07

Jan Kara authored Nov 23, 2015

The following test program from Dmitry can cause softlockups or RCU
stalls as it copies 1GB from tmpfs into eventfd and we don't have any
scheduling point at that path in sendfile(2) implementation:

        int r1 = eventfd(0, 0);
        int r2 = memfd_create("", 0);
        unsigned long n = 1<<30;
        fallocate(r2, 0, 0, n);
        sendfile(r1, r2, 0, n);

Add cond_resched() into __splice_from_pipe() to fix the problem.

CC: Dmitry Vyukov <dvyukov@google.com>
CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

c2489e07

vfs: Make sendfile(2) killable even better · c725bfce

Jan Kara authored Nov 23, 2015

Commit 296291cd (mm: make sendfile(2) killable) fixed an issue where
sendfile(2) was doing a lot of tiny writes into a filesystem and thus
was unkillable for a long time. However sendfile(2) can be (mis)used to
issue lots of writes into arbitrary file descriptor such as evenfd or
similar special file descriptors which never hit the standard filesystem
write path and thus are still unkillable. E.g. the following example
from Dmitry burns CPU for ~16s on my test system without possibility to
be killed:

        int r1 = eventfd(0, 0);
        int r2 = memfd_create("", 0);
        unsigned long n = 1<<30;
        fallocate(r2, 0, 0, n);
        sendfile(r1, r2, 0, n);

There are actually quite a few tests for pending signals in sendfile
code however we data to write is always available none of them seems to
trigger. So fix the problem by adding a test for pending signal into
splice_from_pipe_next() also before the loop waiting for pipe buffers to
be available. This should fix all the lockup issues with sendfile of the
do-ton-of-tiny-writes nature.

CC: stable@vger.kernel.org
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

c725bfce

fix sysvfs symlinks · 0ebf7f10

Al Viro authored Nov 23, 2015

The thing got broken back in 2002 - sysvfs does *not* have inline
symlinks; even short ones have bodies stored in the first block
of file.  sysv_symlink() handles that correctly; unfortunately,
attempting to look an existing symlink up will end up confusing
them for inline symlinks, and interpret the block number containing
the body as the body itself.

Nobody has noticed until now, which says something about the level
of testing sysvfs gets ;-/

Cc: stable@vger.kernel.org # all of them, not that anyone cared
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

0ebf7f10