Commits · d477f8c202d1f0d4791ab1263ca7657bbe5cf79e · Kirill Smelkov / linux

12 Jun, 2019 1 commit

cpuset: restore sanity to cpuset_cpus_allowed_fallback() · d477f8c2

Joel Savitz authored Jun 12, 2019

In the case that a process is constrained by taskset(1) (i.e.
sched_setaffinity(2)) to a subset of available cpus, and all of those are
subsequently offlined, the scheduler will set tsk->cpus_allowed to
the current value of task_cs(tsk)->effective_cpus.

This is done via a call to do_set_cpus_allowed() in the context of
cpuset_cpus_allowed_fallback() made by the scheduler when this case is
detected. This is the only call made to cpuset_cpus_allowed_fallback()
in the latest mainline kernel.

However, this is not sane behavior.

I will demonstrate this on a system running the latest upstream kernel
with the following initial configuration:

	# grep -i cpu /proc/$$/status
	Cpus_allowed:	ffffffff,fffffff
	Cpus_allowed_list:	0-63

(Where cpus 32-63 are provided via smt.)

If we limit our current shell process to cpu2 only and then offline it
and reonline it:

	# taskset -p 4 $$
	pid 2272's current affinity mask: ffffffffffffffff
	pid 2272's new affinity mask: 4

	# echo off > /sys/devices/system/cpu/cpu2/online
	# dmesg | tail -3
	[ 2195.866089] process 2272 (bash) no longer affine to cpu2
	[ 2195.872700] IRQ 114: no longer affine to CPU2
	[ 2195.879128] smpboot: CPU 2 is now offline

	# echo on > /sys/devices/system/cpu/cpu2/online
	# dmesg | tail -1
	[ 2617.043572] smpboot: Booting Node 0 Processor 2 APIC 0x4

We see that our current process now has an affinity mask containing
every cpu available on the system _except_ the one we originally
constrained it to:

	# grep -i cpu /proc/$$/status
	Cpus_allowed:   ffffffff,fffffffb
	Cpus_allowed_list:      0-1,3-63

This is not sane behavior, as the scheduler can now not only place the
process on previously forbidden cpus, it can't even schedule it on
the cpu it was originally constrained to!

Other cases result in even more exotic affinity masks. Take for instance
a process with an affinity mask containing only cpus provided by smt at
the moment that smt is toggled, in a configuration such as the following:

	# taskset -p f000000000 $$
	# grep -i cpu /proc/$$/status
	Cpus_allowed:	000000f0,00000000
	Cpus_allowed_list:	36-39

A double toggle of smt results in the following behavior:

	# echo off > /sys/devices/system/cpu/smt/control
	# echo on > /sys/devices/system/cpu/smt/control
	# grep -i cpus /proc/$$/status
	Cpus_allowed:	ffffff00,ffffffff
	Cpus_allowed_list:	0-31,40-63

This is even less sane than the previous case, as the new affinity mask
excludes all smt-provided cpus with ids less than those that were
previously in the affinity mask, as well as those that were actually in
the mask.

With this patch applied, both of these cases end in the following state:

	# grep -i cpu /proc/$$/status
	Cpus_allowed:	ffffffff,ffffffff
	Cpus_allowed_list:	0-63

The original policy is discarded. Though not ideal, it is the simplest way
to restore sanity to this fallback case without reinventing the cpuset
wheel that rolls down the kernel just fine in cgroup v2. A user who wishes
for the previous affinity mask to be restored in this fallback case can use
that mechanism instead.

This patch modifies scheduler behavior by instead resetting the mask to
task_cs(tsk)->cpus_allowed by default, and cpu_possible mask in legacy
mode. I tested the cases above on both modes.

Note that the scheduler uses this fallback mechanism if and only if
_every_ other valid avenue has been traveled, and it is the last resort
before calling BUG().
Suggested-by: Waiman Long <longman@redhat.com>
Suggested-by: Phil Auld <pauld@redhat.com>
Signed-off-by: Joel Savitz <jsavitz@redhat.com>
Acked-by: Phil Auld <pauld@redhat.com>
Acked-by: Waiman Long <longman@redhat.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Tejun Heo <tj@kernel.org>

d477f8c2

10 Jun, 2019 1 commit

cgroup: Fix css_task_iter_advance_css_set() cset skip condition · c596687a

Tejun Heo authored Jun 10, 2019

While adding handling for dying task group leaders c03cd773
("cgroup: Include dying leaders with live threads in PROCS
iterations") added an inverted cset skip condition to
css_task_iter_advance_css_set().  It should skip cset if it's
completely empty but was incorrectly testing for the inverse condition
for the dying_tasks list.  Fix it.
Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: c03cd773 ("cgroup: Include dying leaders with live threads in PROCS iterations")
Reported-by: syzbot+d4bba5ccd4f9a2a68681@syzkaller.appspotmail.com

c596687a

05 Jun, 2019 1 commit

cgroup: css_task_iter_skip()'d iterators must be advanced before accessed · cee0c33c

Tejun Heo authored Jun 05, 2019

b636fd38 ("cgroup: Implement css_task_iter_skip()") introduced
css_task_iter_skip() which is used to fix task iterations skipping
dying threadgroup leaders with live threads.  Skipping is implemented
as a subportion of full advancing but css_task_iter_next() forgot to
fully advance a skipped iterator before determining the next task to
visit causing it to return invalid task pointers.

Fix it by making css_task_iter_next() fully advance the iterator if it
has been skipped since the previous iteration.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: syzbot
Link: http://lkml.kernel.org/r/00000000000097025d058a7fd785@google.com
Fixes: b636fd38 ("cgroup: Implement css_task_iter_skip()")

cee0c33c

31 May, 2019 3 commits

cgroup: Include dying leaders with live threads in PROCS iterations · c03cd773

Tejun Heo authored May 31, 2019

CSS_TASK_ITER_PROCS currently iterates live group leaders; however,
this means that a process with dying leader and live threads will be
skipped.  IOW, cgroup.procs might be empty while cgroup.threads isn't,
which is confusing to say the least.

Fix it by making cset track dying tasks and include dying leaders with
live threads in PROCS iteration.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-and-tested-by: Topi Miettinen <toiwoton@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>

c03cd773

cgroup: Implement css_task_iter_skip() · b636fd38

Tejun Heo authored May 31, 2019

When a task is moved out of a cset, task iterators pointing to the
task are advanced using the normal css_task_iter_advance() call. This
is fine but we'll be tracking dying tasks on csets and thus moving
tasks from cset->tasks to (to be added) cset->dying_tasks. When we
remove a task from cset->tasks, if we advance the iterators, they may
move over to the next cset before we had the chance to add the task
back on the dying list, which can allow the task to escape iteration.

This patch separates out skipping from advancing. Skipping only moves
the affected iterators to the next pointer rather than fully advancing
it and the following advancing will recognize that the cursor has
already been moved forward and do the rest of advancing. This ensures
that when a task moves from one list to another in its cset, as long
as it moves in the right direction, it's always visible to iteration.

This doesn't cause any visible behavior changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>

b636fd38

cgroup: Call cgroup_release() before __exit_signal() · 6b115bf5

Tejun Heo authored May 31, 2019

cgroup_release() calls cgroup_subsys->release() which is used by the
pids controller to uncharge its pid.  We want to use it to manage
iteration of dying tasks which requires putting it before
__unhash_process().  Move cgroup_release() above __exit_signal().
While this makes it uncharge before the pid is freed, pid is RCU freed
anyway and the window is very narrow.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>

6b115bf5

30 May, 2019 1 commit

docs cgroups: add another example size for hugetlb · 8cfeb385

Odin Ugedal authored May 30, 2019

Add another example to clarify that HugePages smaller than 1MB will
be displayed using "KB", with an uppercased K (eg. 20KB), and not the
normal SI prefix kilo (small k).

Because of a misunderstanding/copy-paste error inside runc
(see https://github.com/opencontainers/runc/pull/2065), it tried
accessing the cgroup control file of a 64kB HugePage using
"hugetlb.64kB._____" instead of the correct "hugetlb.64KB._____".

Adding a new example will make it clear how sizes smaller than 1MB are
handled.
Signed-off-by: Odin Ugedal <odin@ugedal.com>
Signed-off-by: Tejun Heo <tj@kernel.org>

8cfeb385

29 May, 2019 1 commit

cgroup: Use css_tryget() instead of css_tryget_online() in task_get_css() · 18fa84a2

Tejun Heo authored May 29, 2019

A PF_EXITING task can stay associated with an offline css.  If such
task calls task_get_css(), it can get stuck indefinitely.  This can be
triggered by BSD process accounting which writes to a file with
PF_EXITING set when racing against memcg disable as in the backtrace
at the end.

After this change, task_get_css() may return a css which was already
offline when the function was called.  None of the existing users are
affected by this change.

  INFO: rcu_sched self-detected stall on CPU
  INFO: rcu_sched detected stalls on CPUs/tasks:
  ...
  NMI backtrace for cpu 0
  ...
  Call Trace:
   <IRQ>
   dump_stack+0x46/0x68
   nmi_cpu_backtrace.cold.2+0x13/0x57
   nmi_trigger_cpumask_backtrace+0xba/0xca
   rcu_dump_cpu_stacks+0x9e/0xce
   rcu_check_callbacks.cold.74+0x2af/0x433
   update_process_times+0x28/0x60
   tick_sched_timer+0x34/0x70
   __hrtimer_run_queues+0xee/0x250
   hrtimer_interrupt+0xf4/0x210
   smp_apic_timer_interrupt+0x56/0x110
   apic_timer_interrupt+0xf/0x20
   </IRQ>
  RIP: 0010:balance_dirty_pages_ratelimited+0x28f/0x3d0
  ...
   btrfs_file_write_iter+0x31b/0x563
   __vfs_write+0xfa/0x140
   __kernel_write+0x4f/0x100
   do_acct_process+0x495/0x580
   acct_process+0xb9/0xdb
   do_exit+0x748/0xa00
   do_group_exit+0x3a/0xa0
   get_signal+0x254/0x560
   do_signal+0x23/0x5c0
   exit_to_usermode_loop+0x5d/0xa0
   prepare_exit_to_usermode+0x53/0x80
   retint_user+0x8/0x8
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: stable@vger.kernel.org # v4.2+
Fixes: ec438699 ("cgroup, block: implement task_get_css() and use it in bio_associate_current()")

18fa84a2

28 May, 2019 3 commits

Merge tag 'pinctrl-v5.2-2' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl · 9fb67d64

Linus Torvalds authored May 28, 2019

Pull pin control fixes from Linus Walleij:
 "The commits that stand out are the Intel fixes that arrived during the
  merge window and I got relayed by pull request from Andy.

  Apart from that a minor Kconfig noise.

   - Interrupt clearing fix for the Intel pin controllers affecting
     touchpads on some laptops.

   - Compile Kconfig fix for the STMFX expander pin controller"

* tag 'pinctrl-v5.2-2' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl:
  pinctrl: stmfx: Fix compile issue when CONFIG_OF_GPIO is not defined
  pinctrl: intel: Clear interrupt status in mask/unmask callback
  pinctrl: intel: Use GENMASK() consistently

9fb67d64

Merge tag 'gpio-v5.2-2' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio · ca6584a3

Linus Torvalds authored May 28, 2019

Pull GPIO fix from Linus Walleij:
 "Fix a build error in gpio-adp5588"

* tag 'gpio-v5.2-2' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio:
  gpio: fix gpio-adp5588 build errors

ca6584a3

ia64: fix build errors by exporting paddr_to_nid() · 9a626c4a

Randy Dunlap authored May 28, 2019

Fix build errors on ia64 when DISCONTIGMEM=y and NUMA=y by
exporting paddr_to_nid().

Fixes these build errors:

ERROR: "paddr_to_nid" [sound/core/snd-pcm.ko] undefined!
ERROR: "paddr_to_nid" [net/sunrpc/sunrpc.ko] undefined!
ERROR: "paddr_to_nid" [fs/cifs/cifs.ko] undefined!
ERROR: "paddr_to_nid" [drivers/video/fbdev/core/fb.ko] undefined!
ERROR: "paddr_to_nid" [drivers/usb/mon/usbmon.ko] undefined!
ERROR: "paddr_to_nid" [drivers/usb/core/usbcore.ko] undefined!
ERROR: "paddr_to_nid" [drivers/md/raid1.ko] undefined!
ERROR: "paddr_to_nid" [drivers/md/dm-mod.ko] undefined!
ERROR: "paddr_to_nid" [drivers/md/dm-crypt.ko] undefined!
ERROR: "paddr_to_nid" [drivers/md/dm-bufio.ko] undefined!
ERROR: "paddr_to_nid" [drivers/ide/ide-core.ko] undefined!
ERROR: "paddr_to_nid" [drivers/ide/ide-cd_mod.ko] undefined!
ERROR: "paddr_to_nid" [drivers/gpu/drm/drm.ko] undefined!
ERROR: "paddr_to_nid" [drivers/char/agp/agpgart.ko] undefined!
ERROR: "paddr_to_nid" [drivers/block/nbd.ko] undefined!
ERROR: "paddr_to_nid" [drivers/block/loop.ko] undefined!
ERROR: "paddr_to_nid" [drivers/block/brd.ko] undefined!
ERROR: "paddr_to_nid" [crypto/ccm.ko] undefined!
Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: linux-ia64@vger.kernel.org
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

9a626c4a

27 May, 2019 1 commit

Merge tag 'intel-pinctrl-v5.2-2' of... · b1fa7d85

Linus Walleij authored May 27, 2019

Merge tag 'intel-pinctrl-v5.2-2' of git://git.kernel.org/pub/scm/linux/kernel/git/pinctrl/intel into fixes

intel-pinctrl for v5.2-2

Fix a laggish ELAN touchpad responsiveness due to an odd interrupt masking.

The following is an automated git shortlog grouped by driver:

intel:
 -  Clear interrupt status in mask/unmask callback
 -  Use GENMASK() consistently

b1fa7d85

26 May, 2019 6 commits

Linux 5.2-rc2 · cd6c84d8
Linus Torvalds authored May 26, 2019

cd6c84d8

Merge tag 'trace-v5.2-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace · c5b44095

Linus Torvalds authored May 26, 2019

Pull tracing warning fix from Steven Rostedt:
 "Make the GCC 9 warning for sub struct memset go away.

  GCC 9 now warns about calling memset() on partial structures when it
  goes across multiple fields. This adds a helper for the place in
  tracing that does this type of clearing of a structure"

* tag 'trace-v5.2-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  tracing: Silence GCC 9 array bounds warning

c5b44095

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm · 862f0a32

Linus Torvalds authored May 26, 2019

Pull KVM fixes from Paolo Bonzini:
 "The usual smattering of fixes and tunings that came in too late for
  the merge window, but should not wait four months before they appear
  in a release.

  I also travelled a bit more than usual in the first part of May, which
  didn't help with picking up patches and reports promptly"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (33 commits)
  KVM: x86: fix return value for reserved EFER
  tools/kvm_stat: fix fields filter for child events
  KVM: selftests: Wrap vcpu_nested_state_get/set functions with x86 guard
  kvm: selftests: aarch64: compile with warnings on
  kvm: selftests: aarch64: fix default vm mode
  kvm: selftests: aarch64: dirty_log_test: fix unaligned memslot size
  KVM: s390: fix memory slot handling for KVM_SET_USER_MEMORY_REGION
  KVM: x86/pmu: do not mask the value that is written to fixed PMUs
  KVM: x86/pmu: mask the result of rdpmc according to the width of the counters
  x86/kvm/pmu: Set AMD's virt PMU version to 1
  KVM: x86: do not spam dmesg with VMCS/VMCB dumps
  kvm: Check irqchip mode before assign irqfd
  kvm: svm/avic: fix off-by-one in checking host APIC ID
  KVM: selftests: do not blindly clobber registers in guest asm
  KVM: selftests: Remove duplicated TEST_ASSERT in hyperv_cpuid.c
  KVM: LAPIC: Expose per-vCPU timer_advance_ns to userspace
  KVM: LAPIC: Fix lapic_timer_advance_ns parameter overflow
  kvm: vmx: Fix -Wmissing-prototypes warnings
  KVM: nVMX: Fix using __this_cpu_read() in preemptible context
  kvm: fix compilation on s390
  ...

862f0a32

Merge tag 'random_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/random · 128f2bfa

Linus Torvalds authored May 26, 2019

Pull /dev/random fix from Ted Ts'o:
 "Fix a soft lockup regression when reading from /dev/random in early
  boot"

* tag 'random_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/random:
  random: fix soft lockup when trying to read from an uninitialized blocking pool

128f2bfa

random: fix soft lockup when trying to read from an uninitialized blocking pool · 58be0106

Theodore Ts'o authored May 22, 2019

Fixes: eb9d1bf0: "random: only read from /dev/random after its pool has received 128 bits"
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

58be0106

tracing: Silence GCC 9 array bounds warning · 0c97bf86

Miguel Ojeda authored May 23, 2019

Starting with GCC 9, -Warray-bounds detects cases when memset is called
starting on a member of a struct but the size to be cleared ends up
writing over further members.

Such a call happens in the trace code to clear, at once, all members
after and including `seq` on struct trace_iterator:

    In function 'memset',
        inlined from 'ftrace_dump' at kernel/trace/trace.c:8914:3:
    ./include/linux/string.h:344:9: warning: '__builtin_memset' offset
    [8505, 8560] from the object at 'iter' is out of the bounds of
    referenced subobject 'seq' with type 'struct trace_seq' at offset
    4368 [-Warray-bounds]
      344 |  return __builtin_memset(p, c, size);
          |         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~

In order to avoid GCC complaining about it, we compute the address
ourselves by adding the offsetof distance instead of referring
directly to the member.

Since there are two places doing this clear (trace.c and trace_kdb.c),
take the chance to move the workaround into a single place in
the internal header.

Link: http://lkml.kernel.org/r/20190523124535.GA12931@gmail.comSigned-off-by: Miguel Ojeda <miguel.ojeda.sandonis@gmail.com>
[ Removed unnecessary parenthesis around "iter" ]
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>

0c97bf86

25 May, 2019 5 commits

Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 · 35efb51e

Linus Torvalds authored May 25, 2019

Pull ext4 fixes from Ted Ts'o:
 "Bug fixes (including a regression fix) for ext4"

* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: fix dcache lookup of !casefolded directories
  ext4: do not delete unlinked inode from orphan list on failed truncate
  ext4: wait for outstanding dio during truncate in nojournal mode
  ext4: don't perform block validity checks on the journal inode

35efb51e

Merge tag 'libnvdimm-fixes-5.2-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm · b2ad8136

Linus Torvalds authored May 25, 2019

Pull libnvdimm fixes from Dan Williams:

 - Fix a regression that disabled device-mapper dax support

 - Remove unnecessary hardened-user-copy overhead (>30%) for dax
   read(2)/write(2).

 - Fix some compilation warnings.

* tag 'libnvdimm-fixes-5.2-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
  libnvdimm/pmem: Bypass CONFIG_HARDENED_USERCOPY overhead
  dax: Arrange for dax_supported check to span multiple devices
  libnvdimm: Fix compilation warnings with W=1

b2ad8136

Merge tag 'trace-v5.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace · a2c48d98

Linus Torvalds authored May 25, 2019

Pull tracing fixes from Steven Rostedt:
 "Tom Zanussi sent me some small fixes and cleanups to the histogram
  code and I forgot to incorporate them.

  I also added a small clean up patch that was sent to me a while ago
  and I just noticed it"

* tag 'trace-v5.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  kernel/trace/trace.h: Remove duplicate header of trace_seq.h
  tracing: Add a check_val() check before updating cond_snapshot() track_val
  tracing: Check keys for variable references in expressions too
  tracing: Prevent hist_field_var_ref() from accessing NULL tracing_map_elts

a2c48d98

ext4: fix dcache lookup of !casefolded directories · 66883da1

Gabriel Krisman Bertazi authored May 24, 2019

Found by visual inspection, this wasn't caught by my xfstest, since it's
effect is ignoring positive dentries in the cache the fallback just goes
to the disk.  it was introduced in the last iteration of the
case-insensitive patch.

d_compare should return 0 when the entries match, so make sure we are
correctly comparing the entire string if the encoding feature is set and
we are on a case-INsensitive directory.

Fixes: b886ee3e ("ext4: Support case-insensitive file name lookups")
Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

66883da1

Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi · 2409207a

Linus Torvalds authored May 24, 2019

Pull SCSI fixes from James Bottomley:
 "This is the same set of patches sent in the merge window as the final
  pull except that Martin's read only rework is replaced with a simple
  revert of the original change that caused the regression.

  Everything else is an obvious fix or small cleanup"

* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
  Revert "scsi: sd: Keep disk read-only when re-reading partition"
  scsi: bnx2fc: fix incorrect cast to u64 on shift operation
  scsi: smartpqi: Reporting unhandled SCSI errors
  scsi: myrs: Fix uninitialized variable
  scsi: lpfc: Update lpfc version to 12.2.0.2
  scsi: lpfc: add check for loss of ndlp when sending RRQ
  scsi: lpfc: correct rcu unlock issue in lpfc_nvme_info_show
  scsi: lpfc: resolve lockdep warnings
  scsi: qedi: remove set but not used variables 'cdev' and 'udev'
  scsi: qedi: remove memset/memcpy to nfunc and use func instead
  scsi: qla2xxx: Add cleanup for PCI EEH recovery

2409207a

24 May, 2019 17 commits

Merge tag 'for-linus-20190524' of git://git.kernel.dk/linux-block · 7fbc78e3

Linus Torvalds authored May 24, 2019

Pull block fixes from Jens Axboe:

 - NVMe pull request from Keith, with fixes from a few folks.

 - bio and sbitmap before atomic barrier fixes (Andrea)

 - Hang fix for blk-mq freeze and unfreeze (Bob)

 - Single segment count regression fix (Christoph)

 - AoE now has a new maintainer

 - tools/io_uring/ Makefile fix, and sync with liburing (me)

* tag 'for-linus-20190524' of git://git.kernel.dk/linux-block: (23 commits)
  tools/io_uring: sync with liburing
  tools/io_uring: fix Makefile for pthread library link
  blk-mq: fix hang caused by freeze/unfreeze sequence
  block: remove the bi_seg_{front,back}_size fields in struct bio
  block: remove the segment size check in bio_will_gap
  block: force an unlimited segment size on queues with a virt boundary
  block: don't decrement nr_phys_segments for physically contigous segments
  sbitmap: fix improper use of smp_mb__before_atomic()
  bio: fix improper use of smp_mb__before_atomic()
  aoe: list new maintainer for aoe driver
  nvme-pci: use blk-mq mapping for unmanaged irqs
  nvme: update MAINTAINERS
  nvme: copy MTFA field from identify controller
  nvme: fix memory leak for power latency tolerance
  nvme: release namespace SRCU protection before performing controller ioctls
  nvme: merge nvme_ns_ioctl into nvme_ioctl
  nvme: remove the ifdef around nvme_nvm_ioctl
  nvme: fix srcu locking on error return in nvme_get_ns_from_disk
  nvme: Fix known effects
  nvme-pci: Sync queues on reset
  ...

7fbc78e3

Merge tag 'linux-kselftest-5.2-rc2' of... · 7f8b40e3

Linus Torvalds authored May 24, 2019

Merge tag 'linux-kselftest-5.2-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest

Pull Kselftest fixes from Shuah Khan:

 - Two fixes to regressions introduced in kselftest Makefile test run
   output refactoring work (Kees Cook)

 - Adding Atom support to syscall_arg_fault test (Tong Bo)

* tag 'linux-kselftest-5.2-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest:
  selftests/timers: Add missing fflush(stdout) calls
  selftests: Remove forced unbuffering for test running
  selftests/x86: Support Atom for syscall_arg_fault test

7f8b40e3

Merge tag 'devicetree-fixes-for-5.2' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux · e7bd3e24

Linus Torvalds authored May 24, 2019

Pull Devicetree fixes from Rob Herring:

 - Update checkpatch.pl to use DT vendor-prefixes.yaml

 - Fix DT binding references to files converted to DT schema

 - Clean-up Arm CPU binding examples to match schema

 - Add Sifive block versioning scheme documentation

 - Pass binding directory base to validation tools for reference lookups

* tag 'devicetree-fixes-for-5.2' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux:
  checkpatch.pl: Update DT vendor prefix check
  dt: bindings: mtd: replace references to nand.txt with nand-controller.yaml
  dt-bindings: interrupt-controller: arm,gic: Fix schema errors in example
  dt-bindings: arm: Clean up CPU binding examples
  dt: fix refs that were renamed to json with the same file name
  dt-bindings: Pass binding directory to validation tools
  dt-bindings: sifive: describe sifive-blocks versioning

e7bd3e24

Merge tag 'spdx-5.2-rc2-2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core · 86c2f5d6

Linus Torvalds authored May 24, 2019

Pule more SPDX updates from Greg KH:
 "Here is another set of reviewed patches that adds SPDX tags to
  different kernel files, based on a set of rules that are being used to
  parse the comments to try to determine that the license of the file is
  "GPL-2.0-or-later".

  Only the "obvious" versions of these matches are included here, a
  number of "non-obvious" variants of text have been found but those
  have been postponed for later review and analysis.

  These patches have been out for review on the linux-spdx@vger mailing
  list, and while they were created by automatic tools, they were
  hand-verified by a bunch of different people, all whom names are on
  the patches are reviewers"

* tag 'spdx-5.2-rc2-2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (85 commits)
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 125
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 123
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 122
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 121
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 120
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 119
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 118
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 116
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 114
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 113
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 112
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 111
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 110
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 106
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 105
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 104
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 103
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 102
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 101
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 98
  ...

86c2f5d6

locking/lock_events: Use this_cpu_add() when necessary · 51816e9e

Waiman Long authored May 24, 2019

The kernel test robot has reported that the use of __this_cpu_add()
causes bug messages like:

  BUG: using __this_cpu_add() in preemptible [00000000] code: ...

Given the imprecise nature of the count and the possibility of resetting
the count and doing the measurement again, this is not really a big
problem to use the unprotected __this_cpu_*() functions.

To make the preemption checking code happy, the this_cpu_*() functions
will be used if CONFIG_DEBUG_PREEMPT is defined.

The imprecise nature of the locking counts are also documented with
the suggestion that we should run the measurement a few times with the
counts reset in between to get a better picture of what is going on
under the hood.

Fixes: a8654596 ("locking/rwsem: Enable lock event counting")
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

51816e9e

KVM: x86: fix return value for reserved EFER · 66f61c92

Paolo Bonzini authored May 24, 2019

Commit 11988499 ("KVM: x86: Skip EFER vs. guest CPUID checks for
host-initiated writes", 2019-04-02) introduced a "return false" in a
function returning int, and anyway set_efer has a "nonzero on error"
conventon so it should be returning 1.
Reported-by: Pavel Machek <pavel@denx.de>
Fixes: 11988499 ("KVM: x86: Skip EFER vs. guest CPUID checks for host-initiated writes")
Cc: Sean Christopherson <sean.j.christopherson@intel.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

66f61c92

tools/kvm_stat: fix fields filter for child events · 883d25e7

Stefan Raspl authored Apr 21, 2019

The fields filter would not work with child fields, as the respective
parents would not be included. No parents displayed == no childs displayed.
To reproduce, run on s390 (would work on other platforms, too, but would
require a different filter name):
- Run 'kvm_stat -d'
- Press 'f'
- Enter 'instruct'
Notice that events like instruction_diag_44 or instruction_diag_500 are not
displayed - the output remains empty.
With this patch, we will filter by matching events and their parents.
However, consider the following example where we filter by
instruction_diag_44:

  kvm statistics - summary
                   regex filter: instruction_diag_44
   Event                                         Total %Total CurAvg/s
   exit_instruction                                276  100.0       12
     instruction_diag_44                           256   92.8       11
   Total                                           276              12

Note that the parent ('exit_instruction') displays the total events, but
the childs listed do not match its total (256 instead of 276). This is
intended (since we're filtering all but one child), but might be confusing
on first sight.
Signed-off-by: Stefan Raspl <raspl@linux.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

883d25e7

KVM: selftests: Wrap vcpu_nested_state_get/set functions with x86 guard · c7957206

Thomas Huth authored May 23, 2019

struct kvm_nested_state is only available on x86 so far. To be able
to compile the code on other architectures as well, we need to wrap
the related code with #ifdefs.
Signed-off-by: Thomas Huth <thuth@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

c7957206

kvm: selftests: aarch64: compile with warnings on · 98e68344

Andrew Jones authored May 23, 2019

aarch64 fixups needed to compile with warnings as errors.
Reviewed-by: Thomas Huth <thuth@redhat.com>
Signed-off-by: Andrew Jones <drjones@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

98e68344

kvm: selftests: aarch64: fix default vm mode · 55eda003

Andrew Jones authored May 23, 2019

VM_MODE_P52V48_4K is not a valid mode for AArch64. Replace its
use in vm_create_default() with a mode that works and represents
a good AArch64 default. (We didn't ever see a problem with this
because we don't have any unit tests using vm_create_default(),
but it's good to get it fixed in advance.)
Reported-by: Thomas Huth <thuth@redhat.com>
Signed-off-by: Andrew Jones <drjones@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

55eda003

kvm: selftests: aarch64: dirty_log_test: fix unaligned memslot size · bffed38d

Andrew Jones authored May 23, 2019

The memory slot size must be aligned to the host's page size. When
testing a guest with a 4k page size on a host with a 64k page size,
then 3 guest pages are not host page size aligned. Since we just need
a nearly arbitrary number of extra pages to ensure the memslot is not
aligned to a 64 host-page boundary for this test, then we can use
16, as that's 64k aligned, but not 64 * 64k aligned.

Fixes: 76d58e0f ("KVM: fix KVM_CLEAR_DIRTY_LOG for memory slots of unaligned size", 2019-04-17)
Signed-off-by: Andrew Jones <drjones@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

bffed38d

KVM: s390: fix memory slot handling for KVM_SET_USER_MEMORY_REGION · 19ec166c

Christian Borntraeger authored May 24, 2019

kselftests exposed a problem in the s390 handling for memory slots.
Right now we only do proper memory slot handling for creation of new
memory slots. Neither MOVE, nor DELETION are handled properly. Let us
implement those.
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

19ec166c

KVM: x86/pmu: do not mask the value that is written to fixed PMUs · 2924b521

Paolo Bonzini authored May 20, 2019

According to the SDM, for MSR_IA32_PERFCTR0/1 "the lower-order 32 bits of
each MSR may be written with any value, and the high-order 8 bits are
sign-extended according to the value of bit 31", but the fixed counters
in real hardware are limited to the width of the fixed counters ("bits
beyond the width of the fixed-function counter are reserved and must be
written as zeros"). Fix KVM to do the same.
Reported-by: Nadav Amit <nadav.amit@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

2924b521

KVM: x86/pmu: mask the result of rdpmc according to the width of the counters · 0e6f467e

Paolo Bonzini authored May 20, 2019

This patch will simplify the changes in the next, by enforcing the
masking of the counters to RDPMC and RDMSR.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

0e6f467e

x86/kvm/pmu: Set AMD's virt PMU version to 1 · a80c4ec1

Borislav Petkov authored May 08, 2019

After commit:

  672ff6cf ("KVM: x86: Raise #GP when guest vCPU do not support PMU")

my AMD guests started #GPing like this:

  general protection fault: 0000 [#1] PREEMPT SMP
  CPU: 1 PID: 4355 Comm: bash Not tainted 5.1.0-rc6+ #3
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
  RIP: 0010:x86_perf_event_update+0x3b/0xa0

with Code: pointing to RDPMC. It is RDPMC because the guest has the
hardware watchdog CONFIG_HARDLOCKUP_DETECTOR_PERF enabled which uses
perf. Instrumenting kvm_pmu_rdpmc() some, showed that it fails due to:

  if (!pmu->version)
  	return 1;

which the above commit added. Since AMD's PMU leaves the version at 0,
that causes the #GP injection into the guest.

Set pmu->version arbitrarily to 1 and move it above the non-applicable
struct kvm_pmu members.
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Janakarajan Natarajan <Janakarajan.Natarajan@amd.com>
Cc: kvm@vger.kernel.org
Cc: Liran Alon <liran.alon@oracle.com>
Cc: Mihai Carabas <mihai.carabas@oracle.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Radim Krčmář" <rkrcmar@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: x86@kernel.org
Cc: stable@vger.kernel.org
Fixes: 672ff6cf ("KVM: x86: Raise #GP when guest vCPU do not support PMU")
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

a80c4ec1

KVM: x86: do not spam dmesg with VMCS/VMCB dumps · 6f2f8453

Paolo Bonzini authored May 20, 2019

Userspace can easily set up invalid processor state in such a way that
dmesg will be filled with VMCS or VMCB dumps.  Disable this by default
using a module parameter.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

6f2f8453

kvm: Check irqchip mode before assign irqfd · 654f1f13

Peter Xu authored May 05, 2019

When assigning kvm irqfd we didn't check the irqchip mode but we allow
KVM_IRQFD to succeed with all the irqchip modes.  However it does not
make much sense to create irqfd even without the kernel chips.  Let's
provide a arch-dependent helper to check whether a specific irqfd is
allowed by the arch.  At least for x86, it should make sense to check:

- when irqchip mode is NONE, all irqfds should be disallowed, and,

- when irqchip mode is SPLIT, irqfds that are with resamplefd should
  be disallowed.

For either of the case, previously we'll silently ignore the irq or
the irq ack event if the irqchip mode is incorrect.  However that can
cause misterious guest behaviors and it can be hard to triage.  Let's
fail KVM_IRQFD even earlier to detect these incorrect configurations.

CC: Paolo Bonzini <pbonzini@redhat.com>
CC: Radim Krčmář <rkrcmar@redhat.com>
CC: Alex Williamson <alex.williamson@redhat.com>
CC: Eduardo Habkost <ehabkost@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

654f1f13