Commits · 78e691f4ae2d5edea0199ca802bb505b9cdced88 · nexedi / linux

16 Jan, 2015 6 commits

Merge branches 'doc.2015.01.07a', 'fixes.2015.01.15a', 'preempt.2015.01.06a',... · 78e691f4

Paul E. McKenney authored Jan 15, 2015

Merge branches 'doc.2015.01.07a', 'fixes.2015.01.15a', 'preempt.2015.01.06a', 'srcu.2015.01.06a', 'stall.2015.01.16a' and 'torture.2015.01.11a' into HEAD

doc.2015.01.07a: Documentation updates.
fixes.2015.01.15a: Miscellaneous fixes.
preempt.2015.01.06a: Changes to handling of lists of preempted tasks.
srcu.2015.01.06a: SRCU updates.
stall.2015.01.16a: RCU CPU stall-warning updates and fixes.
torture.2015.01.11a: RCU torture-test updates and fixes.

78e691f4

rcu: Initialize tiny RCU stall-warning timeouts at boot · 630181c4

Paul E. McKenney authored Dec 23, 2014

The current tiny RCU stall-warning code assumes that the jiffies counter
starts at zero, however, it is sometimes initialized to other values,
for example, -30,000. This commit therefore changes rcu_init() to
invoke reset_cpu_stall_ticks() for both flavors of RCU to initialize
the stall-warning times properly at boot.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

630181c4

rcu: Fix RCU CPU stall detection in tiny implementation · ec1fe396

Miroslav Benes authored Dec 22, 2014

The tiny RCU CPU stall detection depends on *rcp->curtail not being
NULL. It is however a tail pointer and thus NULL by definition. Instead we
should check rcp->rcucblist for the presence of pending callbacks which
need to be processed. With this fix INFO about the stall is printed and
jiffies_stall (jiffies at next stall) correctly updated.

Note that the check for pending callback is necessary to avoid spurious
warnings if there are no pendings callbacks.
Signed-off-by: Miroslav Benes <mbenes@suse.cz>
[ paulmck: Fused identical "if" statements, ported to -rcu. ]
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

ec1fe396

rcu: Add GP-kthread-starvation checks to CPU stall warnings · fb81a44b

Paul E. McKenney authored Dec 17, 2014

This commit adds a message that is printed if the relevant grace-period
kthread has not been able to run for the two seconds preceding the
stall warning. (The two seconds is double the maximum interval between
successive bouts of quiescent-state forcing.)
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

fb81a44b

rcu: Make cond_resched_rcu_qs() apply to normal RCU flavors · 5cd37193

Paul E. McKenney authored Dec 13, 2014

Although cond_resched_rcu_qs() only applies to TASKS_RCU, it is used
in places where it would be useful for it to apply to the normal RCU
flavors, rcu_preempt, rcu_sched, and rcu_bh.  This is especially the
case for workloads that aggressively overload the system, particularly
those that generate large numbers of RCU updates on systems running
NO_HZ_FULL CPUs.  This commit therefore communicates quiescent states
from cond_resched_rcu_qs() to the normal RCU flavors.

Note that it is unfortunately necessary to leave the old ->passed_quiesce
mechanism in place to allow quiescent states that apply to only one
flavor to be recorded.  (Yes, we could decrement ->rcu_qs_ctr_snap in
that case, but that is not so good for debugging of RCU internals.)
In addition, if one of the RCU flavor's grace period has stalled, this
will invoke rcu_momentary_dyntick_idle(), resulting in a heavy-weight
quiescent state visible from other CPUs.
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Reported-by: Dave Jones <davej@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
[ paulmck: Merge commit from Sasha Levin fixing a bug where __this_cpu()
  was used in preemptible code. ]

5cd37193

rcu: Optionally run grace-period kthreads at real-time priority · a94844b2

Paul E. McKenney authored Dec 12, 2014

Recent testing has shown that under heavy load, running RCU's grace-period
kthreads at real-time priority can improve performance (according to 0day
test robot) and reduce the incidence of RCU CPU stall warnings. However,
most systems do just fine with the default non-realtime priorities for
these kthreads, and it does not make sense to expose the entire user
base to any risk stemming from this change, given that this change is
of use only to a few users running extremely heavy workloads.

Therefore, this commit allows users to specify realtime priorities
for the grace-period kthreads, but leaves them running SCHED_OTHER
by default. The realtime priority may be specified at build time
via the RCU_KTHREAD_PRIO Kconfig parameter, or at boot time via the
rcutree.kthread_prio parameter. Either way, 0 says to continue the
default SCHED_OTHER behavior and values from 1-99 specify that priority
of SCHED_FIFO behavior. Note that a value of 0 is not permitted when
the RCU_BOOST Kconfig parameter is specified.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

a94844b2

14 Jan, 2015 2 commits

ksoftirqd: Use new cond_resched_rcu_qs() function · 60479676

Paul E. McKenney authored Jan 14, 2015

Simplify run_ksoftirqd() by using the new cond_resched_rcu_qs() function
that conditionally reschedules, but unconditionally supplies an RCU
quiescent state. This commit is separate from the previous commit by
Calvin Owens because Calvin's approach can be backported, while this
commit cannot be. The reason that this commit cannot be backported is
that cond_resched_rcu_qs() does not always provide the needed quiescent
state in earlier kernels.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

60479676

ksoftirqd: Enable IRQs and call cond_resched() before poking RCU · 28423ad2

Calvin Owens authored Jan 13, 2015

While debugging an issue with excessive softirq usage, I encountered the
following note in commit 3e339b5d ("softirq: Use hotplug thread
infrastructure"):

    [ paulmck: Call rcu_note_context_switch() with interrupts enabled. ]

...but despite this note, the patch still calls RCU with IRQs disabled.

This seemingly innocuous change caused a significant regression in softirq
CPU usage on the sending side of a large TCP transfer (~1 GB/s): when
introducing 0.01% packet loss, the softirq usage would jump to around 25%,
spiking as high as 50%. Before the change, the usage would never exceed 5%.

Moving the call to rcu_note_context_switch() after the cond_sched() call,
as it was originally before the hotplug patch, completely eliminated this
problem.
Signed-off-by: Calvin Owens <calvinowens@fb.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

28423ad2

11 Jan, 2015 12 commits

rcutorture: Add more diagnostics in rcu_barrier() test failure case · 7602de4a
Paul E. McKenney authored Dec 17, 2014
```
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
```
7602de4a

torture: Flag console.log file to prevent holdovers from earlier runs · d2f74b5b

Paul E. McKenney authored Dec 11, 2014

A system misconfiguration that prevents qemu from running at all (for
example, a missing dynamically linked library) will keep the console.log
file from the previous run. This can fool the developer into thinking
that this failed run actually completed correctly. This commit therefore
overwrites the console.log file just before launching qemu.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

d2f74b5b

torture: Add "-enable-kvm -soundhw pcspk" to qemu command line · 16c77ea7

Paul E. McKenney authored Dec 10, 2014

More recent qemu implementations really want "-enable-kvm", and the
"-soundhw pcspk" makes the script a bit less dependent on odd audio
libraries being installed.  This commit therefore adds both to the
default qemu command line.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

16c77ea7

rcutorture: Handle different mpstat versions · 94162c8d

Paul E. McKenney authored Dec 10, 2014

The mpstat command recently added the %gnice column, which messes up
the cpu2use.sh script's idle-CPU calculations. This commit therefore
uses $NF instead of $12 to select the last (%idle) column.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

94162c8d

rcutorture: Check from beginning to end of grace period · 917963d0

Paul E. McKenney authored Nov 21, 2014

Currently, rcutorture's Reader Batch checks measure from the end of
the previous grace period to the end of the current one. This commit
tightens up these checks by measuring from the start and end of the same
grace period. This involves adding rcu_batches_started() and friends
corresponding to the existing rcu_batches_completed() and friends.

We leave SRCU alone for the moment, as it does not yet have a way of
tracking both ends of its grace periods.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

917963d0

rcu: Remove redundant rcu_batches_completed() declaration · f9103c39
Paul E. McKenney authored Nov 21, 2014
```
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
```
f9103c39

rcutorture: Drop rcu_torture_completed() and friends · 1e32eaee

Paul E. McKenney authored Nov 21, 2014

Now that the return type of rcu_batches_completed() and friends matches
that of the rcu_torture_ops structure's ->completed field, the wrapper
functions can be deleted. This commit carries out that deletion, while
also wiring "sched"'s ->completed field to rcu_batches_completed_sched().
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

1e32eaee

rcu: Provide rcu_batches_completed_sched() for TINY_RCU · c1fe9cde

Paul E. McKenney authored Nov 21, 2014

A bug in rcutorture has caused it to ignore completed batches.
In preparation for fixing that bug, this commit provides TINY_RCU with
the required rcu_batches_completed_sched().
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

c1fe9cde

rcutorture: Use unsigned for Reader Batch computations · 6b80da42

Paul E. McKenney authored Nov 21, 2014

The counter returned by the various ->completed functions is subject to
overflow, which means that subtracting two such counters might result
in overflow, which invokes undefined behavior in the C standard. This
commit therefore changes these functions and variables to unsigned to
avoid this undefined behavior.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

6b80da42

rcutorture: Make build-output parsing correctly flag RCU's warnings · 3b009c0e
Paul E. McKenney authored Nov 21, 2014
```
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
```
3b009c0e

rcu: Make _batches_completed() functions return unsigned long · 9733e4f0

Paul E. McKenney authored Nov 21, 2014

Long ago, the various ->completed fields were of type long, but now are
unsigned long due to signed-integer-overflow concerns. However, the
various _batches_completed() functions remained of type long, even though
their only purpose in life is to return the corresponding ->completed
field. This patch cleans this up by changing these functions' return
types to unsigned long.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

9733e4f0

rcutorture: Issue warnings on close calls due to Reader Batch blows · 79619cf5

Paul E. McKenney authored Nov 17, 2014

Normal rcutorture checking overestimates grace periods somewhat due to
the fact that there is a delay from a grace-period request until the
start of the corresponding grace period and another delay from the end
of that grace period to notification of the requestor.  This means that
rcutorture's detection of RCU bugs is less sensitive than it might be.

It turns out that rcutorture also checks the underlying grace-period
"completed" counter (displayed in Reader Batch output), which in theory
allows rcutorture to do exact checks.  In practice, memory misordering
(by both compiler and CPU) can result in false positives.  However,
experience on x86 shows that these false positives are quite rare,
occurring less than one time per 1,000 hours of testing.  This commit
therefore does the exact checking, giving a warning if any Reader Batch
blows happen, and flagging an error if they happen more often than
once every three hours in long tests.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

79619cf5

07 Jan, 2015 2 commits

documentation: Fix smp typo in memory-barriers.txt · d87510c5

Davidlohr Bueso authored Dec 28, 2014

Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

d87510c5

documentation: Record limitations of bitfields and small variables · 432fbf3c

Paul E. McKenney authored Sep 04, 2014

This commit documents the fact that it is not safe to use bitfields
as shared variables in synchronization algorithms.  It also documents
that CPUs must be able to concurrently load from and store to adjacent
one-byte and two-byte variables, which is in fact required by the
C11 standard (Section 3.14).
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

432fbf3c

06 Jan, 2015 18 commits

rcu: Handle gpnum/completed wrap while dyntick idle · e3663b10

Paul E. McKenney authored Dec 08, 2014

Subtle race conditions can result if a CPU stays in dyntick-idle mode
long enough for the ->gpnum and ->completed fields to wrap.  For
example, consider the following sequence of events:

o	CPU 1 encounters a quiescent state while waiting for grace period
	5 to complete, but then enters dyntick-idle mode.

o	While CPU 1 is in dyntick-idle mode, the grace-period counters
	wrap around so that the grace period number is now 4.

o	Just as CPU 1 exits dyntick-idle mode, grace period 4 completes
	and grace period 5 begins.

o	The quiescent state that CPU 1 passed through during the old
	grace period 5 looks like it applies to the new grace period
	5.  Therefore, the new grace period 5 completes without CPU 1
	having passed through a quiescent state.

This could clearly be a fatal surprise to any long-running RCU read-side
critical section that happened to be running on CPU 1 at the time.  At one
time, this was not a problem, given that it takes significant time for
the grace-period counters to overflow even on 32-bit systems.  However,
with the advent of NO_HZ_FULL and SMP embedded systems, arbitrarily long
idle periods are now becoming quite feasible.  It is therefore time to
close this race.

This commit therefore avoids this race condition by having the
quiescent-state forcing code detect when a CPU is falling too far
behind, and setting a new rcu_data field ->gpwrap when this happens.
Whenever this new ->gpwrap field is set, the CPU's ->gpnum and ->completed
fields are known to be untrustworthy, and can be ignored, along with
any associated quiescent states.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

e3663b10

rcu: Improve diagnostics for spurious RCU CPU stall warnings · 6ccd2ecd

Paul E. McKenney authored Dec 11, 2014

The current RCU CPU stall warning code will print "Stall ended before
state dump start" any time that the stall-warning code is triggered on
a CPU that has already reported a quiescent state for the current grace
period and if all quiescent states have been reported for the current
grace period. However, a true stall can result in these symptoms, for
example, by preventing RCU's grace-period kthreads from ever running

This commit therefore checks for this condition, reporting the end of
the stall only if one of the grace-period counters has actually advanced.
Otherwise, it reports the last time that the grace-period kthread made
meaningful progress. (In normal situations, the grace-period kthread
should make meaningful progress at least every jiffies_till_next_fqs
jiffies.)
Reported-by: Miroslav Benes <mbenes@suse.cz>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Miroslav Benes <mbenes@suse.cz>

6ccd2ecd

rcu: Make RCU_CPU_STALL_INFO include number of fqs attempts · fc908ed3

Paul E. McKenney authored Dec 08, 2014

One way that an RCU CPU stall warning can happen is if the grace-period
kthread is not allowed to execute. One proxy for this kthread's
forward progress is the number of force-quiescent-state (fqs) scans.
This commit therefore adds the number of fqs scans to the RCU CPU stall
warning printouts when CONFIG_RCU_CPU_STALL_INFO=y.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

fc908ed3

rcutorture: Add checks for stall ending before dump start · e9408e4f

Paul E. McKenney authored Dec 08, 2014

The current rcutorture scripting checks for actual stalls (via the "Call
Trace:" check), but fails to spot the case where a stall ends just as it
is being detected. This commit therefore adds a check for this case.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

e9408e4f

rcu: Set default to RCU_CPU_STALL_INFO=y · 68158fe2

Paul E. McKenney authored Dec 08, 2014

The RCU_CPU_STALL_INFO code has been in for quite some time, and has
proven reliable. This commit therefore enables it by default.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

68158fe2

rcu: Make SRCU optional by using CONFIG_SRCU · 83fe27ea

Pranith Kumar authored Dec 05, 2014

SRCU is not necessary to be compiled by default in all cases. For tinification
efforts not compiling SRCU unless necessary is desirable.

The current patch tries to make compiling SRCU optional by introducing a new
Kconfig option CONFIG_SRCU which is selected when any of the components making
use of SRCU are selected.

If we do not select CONFIG_SRCU, srcu.o will not be compiled at all.

   text    data     bss     dec     hex filename
   2007       0       0    2007     7d7 kernel/rcu/srcu.o

Size of arch/powerpc/boot/zImage changes from

   text    data     bss     dec     hex filename
 831552   64180   23944  919676   e087c arch/powerpc/boot/zImage : before
 829504   64180   23952  917636   e0084 arch/powerpc/boot/zImage : after

so the savings are about ~2000 bytes.
Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
CC: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
CC: Josh Triplett <josh@joshtriplett.org>
CC: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
[ paulmck: resolve conflict due to removal of arch/ia64/kvm/Kconfig. ]

83fe27ea

rcu: Combine DEFINE_SRCU() and DEFINE_STATIC_SRCU() · 9735af5c

Paul E. McKenney authored Nov 26, 2014

The DEFINE_SRCU() and DEFINE_STATIC_SRCU() definitions are quite
similar, so this commit combines them, saving a bit of code and removing
redundancy.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

9735af5c

rcu: Expand SRCU ->completed to 64 bits · a5c198f4

Paul E. McKenney authored Nov 23, 2014

When rcutorture used only the low-order 32 bits of the grace-period
number, it was not a problem for SRCU to use a 32-bit completed field.
However, rcutorture now uses the full 64 bits on 64-bit systems, so
this commit converts SRCU's ->completed field to unsigned long so as to
provide 64 bits on 64-bit systems.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

a5c198f4

rcu: Remove redundant callback-list initialization · ab954c16

Paul E. McKenney authored Dec 17, 2014

The RCU callback lists are initialized in both rcu_boot_init_percpu_data()
and rcu_init_percpu_data().  The former is intended for initializing
immutable data, so this commit removes the initialization from
rcu_boot_init_percpu_data() and leaves it in rcu_init_percpu_data().
This change prepares for permitting callbacks to be queued very early
in boot.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

ab954c16

rcu: Don't scan root rcu_node structure for stalled tasks · 6cd534ef

Paul E. McKenney authored Dec 10, 2014

Now that blocked tasks are no longer migrated to the root rcu_node
structure, there is no need to scan the root rcu_node structure for
blocked tasks stalling the current grace period.  This commit therefore
removes this scan.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

6cd534ef

rcu: Revert "Allow post-unlock reference for rt_mutex" to avoid priority-inversion · abaf3f9d

Lai Jiangshan authored Nov 18, 2014

The patch dfeb9765 ("Allow post-unlock reference for rt_mutex")
ensured rcu-boost safe even the rt_mutex has post-unlock reference.

But rt_mutex allowing post-unlock reference is definitely a bug and it was
fixed by the commit 27e35715 ("rtmutex: Plug slow unlock race").
This fix made the previous patch (dfeb9765) useless.

And even worse, the priority-inversion introduced by the the previous
patch still exists.

rcu_read_unlock_special() {
	rt_mutex_unlock(&rnp->boost_mtx);
	/* Priority-Inversion:
	 * the current task had been deboosted and preempted as a low
	 * priority task immediately, it could wait long before reschedule in,
	 * and the rcu-booster also waits on this low priority task and sleeps.
	 * This priority-inversion makes rcu-booster can't work
	 * as expected.
	 */
	complete(&rnp->boost_completion);
}

Just revert the patch to avoid it.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

abaf3f9d

rcu: Note quiescent state when CPU goes offline · 3ba4d0e0

Paul E. McKenney authored Nov 12, 2014

The rcu_cleanup_dead_cpu() function (called after a CPU has gone
completely offline) has not reported a quiescent state because there
was probably at least one synchronize_rcu() between the time the CPU
went offline and the CPU_DEAD notifier, and this would have detected
the CPU's offline state via quiescent-state forcing. However, the plan
is for CPUs to take themselves offline, at which point it makes sense
for them to report their own quiescent state. This commit makes this
change in preparation for the new CPU-hotplug setup.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

3ba4d0e0

rcu: Don't bother affinitying rcub kthreads away from offline CPUs · 5d0b0249

Paul E. McKenney authored Nov 10, 2014

When rcu_boost_kthread_setaffinity() sees that all CPUs for a given
rcu_node structure are now offline, it affinities the corresponding
RCU-boost ("rcub") kthread away from those CPUs. This is pointless
because the kthread cannot run on those offline CPUs in any case.
This commit therefore removes this unneeded code.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

5d0b0249

rcu: Don't initiate RCU priority boosting on root rcu_node · 1be0085b

Paul E. McKenney authored Nov 03, 2014

Because there is no longer any preempted tasks on the root rcu_node, and
because there is no longer ever an rcub kthread for the root rcu_node,
this commit drops the code in force_qs_rnp() that attempts to awaken
the non-existent root rcub kthread. This is strictly a performance
enhancement, removing a root rcu_node ->lock acquisition and release
along with some tests in rcu_initiate_boost(), ending with the test that
notes that there is no rcub kthread.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

1be0085b

rcu: Don't spawn rcub kthreads on root rcu_node structure · 3e9f5c70

Paul E. McKenney authored Nov 03, 2014

Now that offlining CPUs no longer moves leaf rcu_node structures'
->blkd_tasks lists to the root, there is no way for the root rcu_node
structure's ->blkd_task list to be nonempty, unless the root node is also
the sole leaf node.  This commit therefore refrains from creating an rcub
kthread for the root rcu_node structure unless it is also the sole leaf.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

3e9f5c70

rcu: Make use of rcu_preempt_has_tasks() · 96e92021

Paul E. McKenney authored Oct 31, 2014

Given that there is now arcu_preempt_has_tasks() function that checks
to see if the ->blkd_tasks list is non-empty, this commit makes use of it.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

96e92021

rcu: Shorten irq-disable region in rcu_cleanup_dead_cpu() · a8f4cbad

Paul E. McKenney authored Oct 31, 2014

Now that we are not migrating callbacks, there is no need to hold the
->orphan_lock across the the ->qsmaskinit bit-clearing process.
This commit therefore releases ->orphan_lock immediately after adopting
the orphaned RCU callbacks.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

a8f4cbad

rcu: Don't migrate blocked tasks even if all corresponding CPUs offline · d19fb8d1

Paul E. McKenney authored Oct 31, 2014

When the last CPU associated with a given leaf rcu_node structure
goes offline, something must be done about the tasks queued on that
rcu_node structure. Each of these tasks has been preempted on one of
the leaf rcu_node structure's CPUs while in an RCU read-side critical
section that it have not yet exited. Handling these tasks is the job of
rcu_preempt_offline_tasks(), which migrates them from the leaf rcu_node
structure to the root rcu_node structure.

Unfortunately, this migration has to be done one task at a time because
each tasks allegiance must be shifted from the original leaf rcu_node to
the root, so that future attempts to deal with these tasks will acquire
the root rcu_node structure's ->lock rather than that of the leaf.
Worse yet, this migration must be done with interrupts disabled, which
is not so good for realtime response, especially given that there is
no bound on the number of tasks on a given rcu_node structure's list.
(OK, OK, there is a bound, it is just that it is unreasonably large,
especially on 64-bit systems.) This was not considered a problem back
when rcu_preempt_offline_tasks() was first written because realtime
systems were assumed not to do CPU-hotplug operations while real-time
applications were running. This assumption has proved of dubious validity
given that people are starting to run multiple realtime applications
on a single SMP system and that it is common practice to offline then
online a CPU before starting its real-time application in order to clear
extraneous processing off of that CPU. So we now need CPU hotplug
operations to avoid undue latencies.

This commit therefore avoids migrating these tasks, instead letting
them be dequeued one by one from the original leaf rcu_node structure
by rcu_read_unlock_special(). This means that the clearing of bits
from the upper-level rcu_node structures must be deferred until the
last such task has been dequeued, because otherwise subsequent grace
periods won't wait on them. This commit has the beneficial side effect
of simplifying the CPU-hotplug code for TREE_PREEMPT_RCU, especially in
CONFIG_RCU_BOOST builds.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

d19fb8d1