Commits · 3e41871046bfe0ba7d122a1f14f0c1db2dca0256 · nexedi / linux

18 Aug, 2015 22 commits

blkcg: make blkcg_activate_policy() allow NULL ->pd_init_fn · 3e418710

Tejun Heo authored Aug 18, 2015

blkg_create() allows NULL ->pd_init_fn() but blkcg_activate_policy()
doesn't.  As both in-kernel policies implement ->pd_init_fn, it
currently doesn't break anything.  Update blkcg_activate_policy() so
that its behavior is consistent with blkg_create().
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>

3e418710

blkcg: restructure blkg_policy_data allocation in blkcg_activate_policy() · 4c55f4f9

Tejun Heo authored Aug 18, 2015

When a policy gets activated, it needs to allocate and install its
policy data on all existing blkg's (blkcg_gq's).  Because blkg
iteration is protected by a spinlock, it currently counts the total
number of blkg's in the system, allocates the matching number of
policy data on a list and installs them during a single iteration.

This can be simplified by using speculative GFP_NOWAIT allocations
while iterating and falling back to a preallocated policy data on
failure.  If the preallocated one has already been consumed, it
releases the lock, preallocate with GFP_KERNEL and then restarts the
iteration.  This can be a bit more expensive than before but policy
activation is a very cold path and shouldn't matter.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@fb.com>

4c55f4f9

blkcg: remove unnecessary blkcg_root handling from css_alloc/free paths · bc915e61

Tejun Heo authored Aug 18, 2015

blkcg_css_alloc() bypasses policy data allocation and blkcg_css_free()
bypasses policy data and blkcg freeing for blkcg_root.  There's no
reason to to treat policy data any differently for blkcg_root.  If the
root css gets allocated after policies are registered, policy
registration path will add policy data; otherwise, the alloc path
will.  The free path isn't never invoked for root csses.

This patch removes the unnecessary special handling of blkcg_root from
css_alloc/free paths.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>

bc915e61

blkcg: use blkg_free() in blkcg_init_queue() failure path · 994b7832

Tejun Heo authored Aug 18, 2015

When blkcg_init_queue() fails midway after creating a new blkg, it
performs kfree() directly; however, this doesn't free the policy data
areas.  Make it use blkg_free() instead.  In turn, blkg_free() is
updated to handle root request_list special case.

While this fixes a possible memory leak, it's on an unlikely failure
path of an already cold path and the size leaked per occurrence is
miniscule too.  I don't think it needs to be tagged for -stable.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>

994b7832

blkcg: remove unnecessary request_list->blkg NULL test in blk_put_rl() · 401efbf8

Tejun Heo authored Aug 18, 2015

Since ec13b1d6 ("blkcg: always create the blkcg_gq for the root
blkcg"), a request_list always has its blkg associated.  Drop
unnecessary rl->blkg NULL test from blk_put_rl().
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>

401efbf8

cfq-iosched: charge async IOs to the appropriate blkcg's instead of the root · 60a83707

Tejun Heo authored Aug 18, 2015

Up until now, all async IOs were queued to async queues which are
shared across the whole request_queue, which means that blkcg resource
control is completely void on async IOs including all writeback IOs.
It was done this way because writeback didn't support writeback and
there was no way of telling which writeback IO belonged to which
cgroup; however, writeback recently became cgroup aware and writeback
bio's are sent down properly tagged with the blkcg's to charge them
against.

This patch makes async cfq_queues per-cfq_cgroup instead of
per-cfq_data so that each async IO is charged to the blkcg that it was
tagged for instead of unconditionally attributing it to root.

* cfq_data->async_cfqq and ->async_idle_cfqq are moved to cfq_group
  and alloc / destroy paths are updated accordingly.

* cfq_link_cfqq_cfqg() no longer overrides @cfqg to root for async
  queues.

* check_blkcg_changed() now also invalidates async queues as they no
  longer stay the same across cgroups.

After this patch, cfq's proportional IO control through blkio.weight
works correctly when cgroup writeback is in use.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Arianna Avanzini <avanzini.arianna@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>

60a83707

cfq-iosched: fold cfq_find_alloc_queue() into cfq_get_queue() · d4aad7ff

Tejun Heo authored Aug 18, 2015

cfq_find_alloc_queue() checks whether a queue actually needs to be
allocated, which is unnecessary as its sole caller, cfq_get_queue(),
only calls it if so.  Also, the oom queue fallback logic is scattered
between cfq_get_queue() and cfq_find_alloc_queue().  There really
isn't much going on in the latter and things can be made simpler by
folding it into cfq_get_queue().

This patch collapses cfq_find_alloc_queue() into cfq_get_queue().  The
change is fairly straight-forward with one exception - async_cfqq is
now initialized to NULL and the "!is_sync" test in the last if
conditional is replaced with "async_cfqq" test.  This is because gcc
(5.1.1) gets confused for some reason and warns that async_cfqq may be
used uninitialized otherwise.  Oh well, the code isn't necessarily
worse this way.

This patch doesn't cause any functional difference.

v2: Updated to reflect GFP_ATOMIC -> GPF_NOWAIT.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Arianna Avanzini <avanzini.arianna@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>

d4aad7ff

cfq-iosched: move cfq_group determination from cfq_find_alloc_queue() to cfq_get_queue() · 322731ed

Tejun Heo authored Aug 18, 2015

This is necessary for making async cfq_cgroups per-cfq_group instead
of per-cfq_data.  While this change makes cfq_get_queue() perform RCU
locking and look up cfq_group even when it reuses async queue, the
extra overhead is extremely unlikely to be noticeable given that this
is already sitting behind cic->cfqq[] cache and the overall cost of
cfq operation.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Arianna Avanzini <avanzini.arianna@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>

322731ed

cfq-iosched: remove @gfp_mask from cfq_find_alloc_queue() · 2da8de0b

Tejun Heo authored Aug 18, 2015

Even when allocations fail, cfq_find_alloc_queue() always returns a
valid cfq_queue by falling back to the oom cfq_queue.  As such, there
isn't much point in taking @gfp_mask and trying "harder" if __GFP_WAIT
is set.  GFP_NOWAIT allocations don't fail often and even when they do
the degraded behavior is acceptable and temporary.

After all, the only reason get_request(), which ultimately determines
the gfp_mask, cares about __GFP_WAIT is to guarantee request
allocation, assuming IO forward progress, for callers which are
willing to wait.  There's no reason for cfq_find_alloc_queue() to
behave differently on __GFP_WAIT when it already has a fallback
mechanism.

Remove @gfp_mask from cfq_find_alloc_queue() and propagate the changes
to its callers.  This simplifies the function quite a bit and will
help making async queues per-cfq_group.

v2: Updated to reflect GFP_ATOMIC -> GPF_NOWAIT.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Arianna Avanzini <avanzini.arianna@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>

2da8de0b

blkcg, cfq-iosched: use GFP_NOWAIT instead of GFP_ATOMIC for non-critical allocations · d93a11f1

Tejun Heo authored Aug 18, 2015

blkcg performs several allocations to track IOs per cgroup and enforce
resource control.  Most of these allocations are performed lazily on
demand in the IO path and thus can't involve reclaim path.  Currently,
these allocations use GFP_ATOMIC; however, blkcg can gracefully deal
with occassional failures of these allocations by punting IOs to the
root cgroup and there's no reason to reach into the emergency reserve.

This patch replaces GFP_ATOMIC with GFP_NOWAIT for the following
allocations.

* bdi_writeback_congested and blkcg_gq allocations in blkg_create().

* radix tree node allocations for blkcg->blkg_tree.

* cfq_queue allocation on ioprio changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Suggested-and-Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Suggested-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>

d93a11f1

cfq-iosched: minor cleanups · 563180a4

Tejun Heo authored Aug 18, 2015

* Some were accessing cic->cfqq[] directly.  Always use cic_to_cfqq()
  and cic_set_cfqq().

* check_ioprio_changed() doesn't need to verify cfq_get_queue()'s
  return for NULL.  It's always non-NULL.  Simplify accordingly.

This patch doesn't cause any functional changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Arianna Avanzini <avanzini.arianna@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>

563180a4

cfq-iosched: fix oom cfq_queue ref leak in cfq_set_request() · bce6133b

Tejun Heo authored Aug 18, 2015

If the cfq_queue cached in cfq_io_cq is the oom one, cfq_set_request()
replaces it by invoking cfq_get_queue() again without putting the oom
queue leaking the reference it was holding.  While oom queues are not
released through reference counting, they're still reference counted
and this can theoretically lead to the reference count overflowing and
incorrectly invoke the usual release path on it.

Fix it by making cfq_set_request() put the ref it was holding.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Arianna Avanzini <avanzini.arianna@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>

bce6133b

cfq-iosched: fix async oom queue handling · 95e5d6f6

Tejun Heo authored Aug 18, 2015

Async cfqq's (cfq_queue's) are shared across cfq_data.  When
cfq_get_queue() obtains a new queue from cfq_find_alloc_queue(), it
stashes the pointer in cfq_data and reuses it from then on; however,
the function doesn't consider that cfq_find_alloc_queue() may return
the oom_cfqq under memory pressure and installs the returned queue
unconditionally.

If the oom_cfqq is installed as an async cfqq, cfq_set_request() will
continue calling cfq_get_queue() hoping to replace it with a proper
queue; however, cfq_get_queue() will keep returning the cached queue
for the slot - the oom_cfqq.

Fix it by skipping caching if the queue is the oom one.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Arianna Avanzini <avanzini.arianna@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>

95e5d6f6

cfq-iosched: simplify control flow in cfq_get_queue() · 4ebc1c61

Tejun Heo authored Aug 18, 2015

cfq_get_queue()'s control flow looks like the following.

	async_cfqq = NULL;
	cfqq = NULL;

	if (!is_sync) {
		...
		async_cfqq = ...;
		cfqq = *async_cfqq;
	}

	if (!cfqq)
		cfqq = ...;

	if (!is_sync && !(*async_cfqq))
		...;

The only thing the local variable init, the second if, and the
async_cfqq test in the third if achieves is to skip cfqq creation and
installation if *async_cfqq was already non-NULL.  This is needlessly
complicated with different tests examining the same condition.
Simplify it to the following.

	if (!is_sync) {
		...
		async_cfqq = ...;
		cfqq = *async_cfqq;
		if (cfqq)
			goto out;
	}

	cfqq = ...;

	if (!is_sync)
		...;
 out:
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Arianna Avanzini <avanzini.arianna@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>

4ebc1c61

writeback: update writeback tracepoints to report cgroup · 5634cc2a

Tejun Heo authored Aug 18, 2015

The following tracepoints are updated to report the cgroup used during
cgroup writeback.

* writeback_write_inode[_start]
* writeback_queue
* writeback_exec
* writeback_start
* writeback_written
* writeback_wait
* writeback_nowork
* writeback_wake_background
* wbc_writepage
* writeback_queue_io
* bdi_dirty_ratelimit
* balance_dirty_pages
* writeback_sb_inodes_requeue
* writeback_single_inode[_start]

Note that writeback_bdi_register is separated out from writeback_class
as reporting cgroup doesn't make sense to it.  Tracepoints which take
bdi are updated to take bdi_writeback instead.
Signed-off-by: Tejun Heo <tj@kernel.org>
Suggested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>

5634cc2a

kernfs: implement kernfs_path_len() · 9acee9c5

Tejun Heo authored Aug 18, 2015

Add a function to determine the path length of a kernfs node.  This
for now will be used by writeback tracepoint updates.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Jens Axboe <axboe@fb.com>

9acee9c5

writeback: explain why @inode is allowed to be NULL for inode_congested() · 60292bcc

Tejun Heo authored Aug 18, 2015

Signed-off-by: Tejun Heo <tj@kernel.org>
Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>

60292bcc

writeback: remove wb_writeback_work->single_wait/done · 8a1270cd

Tejun Heo authored Aug 18, 2015

wb_writeback_work->single_wait/done are used for the wait mechanism
for synchronous wb_work (wb_writeback_work) items which are issued
when bdi_split_work_to_wbs() fails to allocate memory for asynchronous
wb_work items; however, there's no reason to use a separate wait
mechanism for this.  bdi_split_work_to_wbs() can simply use on-stack
fallback wb_work item and separate wb_completion to wait for it.

This patch removes wb_work->single_wait/done and the related code and
make bdi_split_work_to_wbs() use on-stack fallback wb_work and
wb_completion instead.
Signed-off-by: Tejun Heo <tj@kernel.org>
Suggested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>

8a1270cd

writeback: bdi_for_each_wb() iteration is memcg ID based not blkcg · 1ed8d48c

Tejun Heo authored Aug 18, 2015

wb's (bdi_writeback's) are currently keyed by memcg ID; however, in an
earlier implementation, wb's were keyed by blkcg ID.
bdi_for_each_wb() walks bdi->cgwb_tree in the ascending ID order and
allows iterations to start from an arbitrary ID which is used to
interrupt and resume iterations.

Unfortunately, while changing wb to be keyed by memcg ID instead of
blkcg, bdi_for_each_wb() was missed and is still assuming that wb's
are keyed by blkcg ID.  This doesn't affect iterations which don't get
interrupted but bdi_split_work_to_wbs() makes use of iteration
resuming on allocation failures and thus may incorrectly skip or
repeat wb's.

Fix it by changing bdi_for_each_wb() to take memcg IDs instead of
blkcg IDs and updating bdi_split_work_to_wbs() accordingly.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@fb.com>

1ed8d48c

Merge branch 'for-4.3-unified-base' of... · 11743ee0

Jens Axboe authored Aug 18, 2015

Merge branch 'for-4.3-unified-base' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup into for-4.3/blkcg

11743ee0

cgroup: introduce cgroup_subsys->legacy_name · 3e1d2eed

Tejun Heo authored Aug 18, 2015

This allows cgroup subsystems to use a different name on the unified
hierarchy.  cgroup_subsys->name is used on the unified hierarchy,
->legacy_name elsewhere.  If ->legacy_name is not explicitly set, it's
automatically set to ->name and the userland visible behavior remains
unchanged.

v2: Make parse_cgroupfs_options() only consider ->legacy_name as mount
    options are used only on legacy hierarchies.  Suggested by Li
    Zefan.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: cgroups@vger.kernel.org

3e1d2eed

cgroup: don't print subsystems for the default hierarchy · d98817d4

Tejun Heo authored Aug 18, 2015

It doesn't make sense to print subsystems on mount option or
/proc/PID/cgroup for the default hierarchy.

* cgroup.controllers file at the root of the default hierarchy lists
  the currently attached controllers.

* The default hierarchy is catch-all for unmounted subsystems.

* The default hierarchy doesn't accept any mount options.

Suppress subsystem printing on mount options and /proc/PID/cgroup for
the default hierarchy.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: cgroups@vger.kernel.org

d98817d4

11 Aug, 2015 1 commit

cgroup: make cftype->private a unsigned long · 731a981e

Tejun Heo authored Aug 11, 2015

It's pretty unusual to have an int as a private data field and it
makes it impossible to carray a pointer value through it.  Let's make
it an unsigned long.  AFAICS, this shouldn't break anything.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>

731a981e

05 Aug, 2015 1 commit

cgroup: export cgrp_dfl_root · d0ec4230

Tejun Heo authored Aug 05, 2015

While cgroup subsystems can't be modules, blkcg supports dynamically
loadable policies which interact with cgroup core.  Export
cgrp_dfl_root so that cgroup_on_dfl() can be used in those modules.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>

d0ec4230

04 Aug, 2015 1 commit

cgroup: define controller file conventions · 6abc8ca1

Tejun Heo authored Aug 04, 2015

Traditionally, each cgroup controller implemented whatever interface
it wanted leading to interfaces which are widely inconsistent.
Examining the requirements of the controllers readily yield that there
are only a few control schemes shared among all.

Two major controllers already had to implement new interface for the
unified hierarchy due to significant structural changes.  Let's take
the chance to establish common conventions throughout all controllers.

This patch defines CGROUP_WEIGHT_MIN/DFL/MAX to be used on all weight
based control knobs and documents the conventions that controllers
should follow on the unified hierarchy.  Except for io.weight knob,
all existing unified hierarchy knobs are already compliant.  A
follow-up patch will update io.weight.

v2: Added descriptions of min, low and high knobs.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Peter Zijlstra <peterz@infradead.org>

6abc8ca1

27 Jul, 2015 1 commit

Merge branch 'stable/for-jens-4.2' of... · e162b219

Jens Axboe authored Jul 27, 2015

Merge branch 'stable/for-jens-4.2' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen into for-linus

Konrad writes:

"There are three bugs that have been found in the xen-blkfront (and
backend). Two of them have the stable tree CC-ed. They have been found
where an guest is migrating to a host that is missing
'feature-persistent' support (from one that has it enabled). We end up
hitting an BUG() in the driver code."

e162b219

26 Jul, 2015 9 commits

Linux 4.2-rc4 · cbfe8fa6
Linus Torvalds authored Jul 26, 2015

cbfe8fa6

Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 2579d019

Linus Torvalds authored Jul 26, 2015

Pull perf fix from Thomas Gleixner:
 "A single fix for the intel cqm perf facility to prevent IPIs from
  interrupt context"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/x86/intel/cqm: Return cached counter value from IRQ context

2579d019

Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 28003486

Linus Torvalds authored Jul 26, 2015

Pull x86 fixes from Thomas Gleixner:
 "This update contains:

   - the manual revert of the SYSCALL32 changes which caused a
     regression

   - a fix for the MPX vma handling

   - three fixes for the ioremap 'is ram' checks.

   - PAT warning fixes

   - a trivial fix for the size calculation of TLB tracepoints

   - handle old EFI structures gracefully

  This also contains a PAT fix from Jan plus a revert thereof.  Toshi
  explained why the code is correct"

* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/mm/pat: Revert 'Adjust default caching mode translation tables'
  x86/asm/entry/32: Revert 'Do not use R9 in SYSCALL32' commit
  x86/mm: Fix newly introduced printk format warnings
  mm: Fix bugs in region_is_ram()
  x86/mm: Remove region_is_ram() call from ioremap
  x86/mm: Move warning from __ioremap_check_ram() to the call site
  x86/mm/pat, drivers/media/ivtv: Move the PAT warning and replace WARN() with pr_warn()
  x86/mm/pat, drivers/infiniband/ipath: Replace WARN() with pr_warn()
  x86/mm/pat: Adjust default caching mode translation tables
  x86/fpu: Disable dependent CPU features on "noxsave"
  x86/mpx: Do not set ->vm_ops on MPX VMAs
  x86/mm: Add parenthesis for TLB tracepoint size calculation
  efi: Handle memory error structures produced based on old versions of standard

28003486

x86/mm/pat: Revert 'Adjust default caching mode translation tables' · 1a4e8795

Thomas Gleixner authored Jul 26, 2015

Toshi explains:

"No, the default values need to be set to the fallback types,
i.e. minimal supported mode. For WC and WT, UC is the fallback type.

When PAT is disabled, pat_init() does update the tables below to
enable WT per the default BIOS setup. However, when PAT is enabled,
but CPU has PAT -errata, WT falls back to UC per the default values."

Revert: ca1fec58 'x86/mm/pat: Adjust default caching mode translation tables'
Requested-by: Toshi Kani <toshi.kani@hp.com>
Cc: Jan Beulich <jbeulich@suse.de>
Link: http://lkml.kernel.org/r/1437577776.3214.252.camel@hp.comSigned-off-by: Thomas Gleixner <tglx@linutronix.de>

1a4e8795

perf/x86/intel/cqm: Return cached counter value from IRQ context · 2c534c0d

Matt Fleming authored Jul 21, 2015

Peter reported the following potential crash which I was able to
reproduce with his test program,

[  148.765788] ------------[ cut here ]------------
[  148.765796] WARNING: CPU: 34 PID: 2840 at kernel/smp.c:417 smp_call_function_many+0xb6/0x260()
[  148.765797] Modules linked in:
[  148.765800] CPU: 34 PID: 2840 Comm: perf Not tainted 4.2.0-rc1+ #4
[  148.765803]  ffffffff81cdc398 ffff88085f105950 ffffffff818bdfd5 0000000000000007
[  148.765805]  0000000000000000 ffff88085f105990 ffffffff810e413a 0000000000000000
[  148.765807]  ffffffff82301080 0000000000000022 ffffffff8107f640 ffffffff8107f640
[  148.765809] Call Trace:
[  148.765810]  <NMI>  [<ffffffff818bdfd5>] dump_stack+0x45/0x57
[  148.765818]  [<ffffffff810e413a>] warn_slowpath_common+0x8a/0xc0
[  148.765822]  [<ffffffff8107f640>] ? intel_cqm_stable+0x60/0x60
[  148.765824]  [<ffffffff8107f640>] ? intel_cqm_stable+0x60/0x60
[  148.765825]  [<ffffffff810e422a>] warn_slowpath_null+0x1a/0x20
[  148.765827]  [<ffffffff811613f6>] smp_call_function_many+0xb6/0x260
[  148.765829]  [<ffffffff8107f640>] ? intel_cqm_stable+0x60/0x60
[  148.765831]  [<ffffffff81161748>] on_each_cpu_mask+0x28/0x60
[  148.765832]  [<ffffffff8107f6ef>] intel_cqm_event_count+0x7f/0xe0
[  148.765836]  [<ffffffff811cdd35>] perf_output_read+0x2a5/0x400
[  148.765839]  [<ffffffff811d2e5a>] perf_output_sample+0x31a/0x590
[  148.765840]  [<ffffffff811d333d>] ? perf_prepare_sample+0x26d/0x380
[  148.765841]  [<ffffffff811d3497>] perf_event_output+0x47/0x60
[  148.765843]  [<ffffffff811d36c5>] __perf_event_overflow+0x215/0x240
[  148.765844]  [<ffffffff811d4124>] perf_event_overflow+0x14/0x20
[  148.765847]  [<ffffffff8107e7f4>] intel_pmu_handle_irq+0x1d4/0x440
[  148.765849]  [<ffffffff811d07a6>] ? __perf_event_task_sched_in+0x36/0xa0
[  148.765853]  [<ffffffff81219bad>] ? vunmap_page_range+0x19d/0x2f0
[  148.765854]  [<ffffffff81219d11>] ? unmap_kernel_range_noflush+0x11/0x20
[  148.765859]  [<ffffffff814ce6fe>] ? ghes_copy_tofrom_phys+0x11e/0x2a0
[  148.765863]  [<ffffffff8109e5db>] ? native_apic_msr_write+0x2b/0x30
[  148.765865]  [<ffffffff8109e44d>] ? x2apic_send_IPI_self+0x1d/0x20
[  148.765869]  [<ffffffff81065135>] ? arch_irq_work_raise+0x35/0x40
[  148.765872]  [<ffffffff811c8d86>] ? irq_work_queue+0x66/0x80
[  148.765875]  [<ffffffff81075306>] perf_event_nmi_handler+0x26/0x40
[  148.765877]  [<ffffffff81063ed9>] nmi_handle+0x79/0x100
[  148.765879]  [<ffffffff81064422>] default_do_nmi+0x42/0x100
[  148.765880]  [<ffffffff81064563>] do_nmi+0x83/0xb0
[  148.765884]  [<ffffffff818c7c0f>] end_repeat_nmi+0x1e/0x2e
[  148.765886]  [<ffffffff811d07a6>] ? __perf_event_task_sched_in+0x36/0xa0
[  148.765888]  [<ffffffff811d07a6>] ? __perf_event_task_sched_in+0x36/0xa0
[  148.765890]  [<ffffffff811d07a6>] ? __perf_event_task_sched_in+0x36/0xa0
[  148.765891]  <<EOE>>  [<ffffffff8110ab66>] finish_task_switch+0x156/0x210
[  148.765898]  [<ffffffff818c1671>] __schedule+0x341/0x920
[  148.765899]  [<ffffffff818c1c87>] schedule+0x37/0x80
[  148.765903]  [<ffffffff810ae1af>] ? do_page_fault+0x2f/0x80
[  148.765905]  [<ffffffff818c1f4a>] schedule_user+0x1a/0x50
[  148.765907]  [<ffffffff818c666c>] retint_careful+0x14/0x32
[  148.765908] ---[ end trace e33ff2be78e14901 ]---

The CQM task events are not safe to be called from within interrupt
context because they require performing an IPI to read the counter value
on all sockets. And performing IPIs from within IRQ context is a
"no-no".

Make do with the last read counter value currently event in
event->count when we're invoked in this context.
Reported-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vikas Shivappa <vikas.shivappa@intel.com>
Cc: Kanaka Juvva <kanaka.d.juvva@intel.com>
Cc: Will Auld <will.auld@intel.com>
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/1437490509-15373-1-git-send-email-matt@codeblueprint.co.ukSigned-off-by: Thomas Gleixner <tglx@linutronix.de>

2c534c0d

Merge tag 'usb-4.2-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb · 26ae19a3

Linus Torvalds authored Jul 25, 2015

Pull USB fixes from Greg KH:
 "Here's a few USB and PHY fixes for 4.2-rc4.

  Nothing major, the shortlog has the full details.

  All of these have been in linux-next successfully"

* tag 'usb-4.2-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb: (21 commits)
  USB: OHCI: fix bad #define in ohci-tmio.c
  cdc-acm: Destroy acm_minors IDR on module exit
  usb-storage: Add ignore-device quirk for gm12u320 based usb mini projectors
  usb-storage: ignore ZTE MF 823 card reader in mode 0x1225
  USB: OHCI: Fix race between ED unlink and URB submission
  usb: core: lpm: set lpm_capable for root hub device
  xhci: do not report PLC when link is in internal resume state
  xhci: prevent bus_suspend if SS port resuming in phase 1
  xhci: report U3 when link is in resume state
  xhci: Calculate old endpoints correctly on device reset
  usb: xhci: Bugfix for NULL pointer deference in xhci_endpoint_init() function
  xhci: Workaround to get D3 working in Intel xHCI
  xhci: call BIOS workaround to enable runtime suspend on Intel Braswell
  usb: dwc3: Reset the transfer resource index on SET_INTERFACE
  usb: gadget: udc: core: Fix argument of dma_map_single for IOMMU
  usb: gadget: mv_udc_core: fix phy_regs I/O memory leak
  usb: ulpi: ulpi_init should be executed in subsys_initcall
  phy: berlin-usb: fix divider for BG2
  phy: berlin-usb: fix divider for BG2CD
  phy/pxa: add HAS_IOMEM dependency
  ...

26ae19a3

Merge tag 'tty-4.2-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty · 82b35f37

Linus Torvalds authored Jul 25, 2015

Pull tty/serial driver fixes from Greg KH:
 "Here are a number of small serial and tty fixes for reported issues.

  All have been in linux-next successfully"

* tag 'tty-4.2-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty:
  tty: vt: Fix !TASK_RUNNING diagnostic warning from paste_selection()
  serial: core: Fix crashes while echoing when closing
  m32r: Add ioreadXX/iowriteXX big-endian mmio accessors
  Revert "serial: imx: initialized DMA w/o HW flow enabled"
  sc16is7xx: fix FIFO address of secondary UART
  sc16is7xx: fix Kconfig dependencies
  serial: etraxfs-uart: Fix release etraxfs_uart_ports
  tty/vt: Fix the memory leak in visual_init
  serial: amba-pl011: Fix devm_ioremap_resource return value check
  n_tty: signal and flush atomically

82b35f37

Merge tag 'staging-4.2-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging · b0de415a

Linus Torvalds authored Jul 25, 2015

Pull staging driver fixes from Greg KH:
 "Here are a number of iio and staging driver fixes for reported issues
  for 4.2-rc4.

  All have been in linux-next for a while with no problems"

* tag 'staging-4.2-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging: (34 commits)
  iio:light:stk3310: make endianness independent of host
  iio:light:stk3310: move device register to end of probe
  iio: mma8452: use iio event type IIO_EV_TYPE_MAG
  iio: mcp320x: Fix NULL pointer dereference
  iio: adc: vf610: fix the adc register read fail issue
  iio: mlx96014: Replace offset sign
  iio: magnetometer: mmc35240: fix SET/RESET sequence
  iio: magnetometer: mmc35240: Fix SET/RESET mask
  iio: magnetometer: mmc35240: Fix crash in pm suspend
  iio:magnetometer:bmc150_magn: output intended variable
  iio:magnetometer:bmc150_magn: add regmap dependency
  staging: vt6656: check ieee80211_bss_conf bssid not NULL
  staging: vt6655: check ieee80211_bss_conf bssid not NULL
  iio: tmp006: Check channel info on write
  iio: sx9500: Add missing init in sx9500_buffer_pre{en,dis}able()
  iio:light:ltr501: fix regmap dependency
  iio:light:ltr501: fix variable in ltr501_init
  iio: sx9500: fix bug in compensation code
  iio: sx9500: rework error handling of raw readings
  iio: magnetometer: mmc35240: fix available sampling frequencies
  ...

b0de415a

Merge tag 'char-misc-4.2-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc · e433b656

Linus Torvalds authored Jul 25, 2015

Pull char/misc driver fixes from Greg KH:
 "Here are some char and misc driver fixes for reported issues.

  One parport patch is reverted as it was incorrect, thanks to testing
  by the 0-day bot"

* tag 'char-misc-4.2-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc:
  parport: Revert "parport: fix memory leak"
  mei: prevent unloading mei hw modules while the device is opened.
  misc: mic: scif bug fix for vmalloc_to_page crash
  parport: fix freeing freed memory
  parport: fix memory leak
  parport: fix error handling

e433b656

25 Jul, 2015 5 commits

parport: Revert "parport: fix memory leak" · 4e5a74f1

Sudip Mukherjee authored Jul 25, 2015

This reverts commit 23c40591 ("parport: fix memory leak")

par_dev->state was already being removed in parport_unregister_device().
Reported-by: Ying Huang <ying.huang@intel.com>
Signed-off-by: Sudip Mukherjee <sudip@vectorindia.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

4e5a74f1

Merge tag 'trace-v4.2-rc2-fix3' of... · 763e326c

Linus Torvalds authored Jul 25, 2015

Merge tag 'trace-v4.2-rc2-fix3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull ftrace fix from Steven Rostedt:
 "Back in 3.16 the ftrace code was redesigned and cleaned up to remove
  the double iteration list (one for registered ftrace ops, and one for
  registered "global" ops), to just use one list.  That simplified the
  code but also broke the function tracing filtering on pid.

  This updates the code to handle the filtering again with the new
  logic"

* tag 'trace-v4.2-rc2-fix3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  ftrace: Fix breakage of set_ftrace_pid

763e326c

Merge branch 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/djbw/nvdimm · 45200838

Linus Torvalds authored Jul 25, 2015

Pull libnvdimm fix from Dan Williams:
 "A minor fix for the libnvdimm subsystem.

  This is not critical.  The problem can be worked around in userspace
  by putting the namespace temporarily into raw mode
  (ndctl_namespace_set_raw_mode() from libndctl), but that is awkward
  for management utilities.

* 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/djbw/nvdimm:
  libnvdimm: fix namespace seed creation

45200838

Merge tag 'md/4.2-fixes' of git://neil.brown.name/md · aca105a6

Linus Torvalds authored Jul 25, 2015

Pull md fixes from Neil Brown:
 "Some md fixes for 4.2

  Several are tagged for -stable.
  A few aren't because they are not very, serious or because they are in
  the 'experimental' cluster code"

* tag 'md/4.2-fixes' of git://neil.brown.name/md:
  md/raid5: clear R5_NeedReplace when no longer needed.
  Fix read-balancing during node failure
  md-cluster: fix bitmap sub-offset in bitmap_read_sb
  md: Return error if request_module fails and returns positive value
  md: Skip cluster setup in case of error while reading bitmap
  md/raid1: fix test for 'was read error from last working device'.
  md: Skip cluster setup for dm-raid
  md: flush ->event_work before stopping array.
  md/raid10: always set reshape_safe when initializing reshape_position.
  md/raid5: avoid races when changing cache size.

aca105a6

Merge tag 'for-linus-20150724' of git://git.infradead.org/linux-mtd · 32fd3d4a

Linus Torvalds authored Jul 25, 2015

Pull MTD fixes from Brian Norris:
 "Two trivial updates.  I meant to send these much earlier, but I've
  been preoccupied.

   - Add MAINTAINERS entry for diskonchip g3 driver

   - Fix an overlooked conflict in bitfield value assignments

  The latter update is a bit overdue, but there's no reason to wait any
  longer"

* tag 'for-linus-20150724' of git://git.infradead.org/linux-mtd:
  mtd: nand: Fix NAND_USE_BOUNCE_BUFFER flag conflict
  MAINTAINERS: mtd: docg3: add docg3 maintainer

32fd3d4a