1. 22 Jul, 2022 10 commits
  2. 19 Jul, 2022 20 commits
    • Zqiang's avatar
      rcu/nocb: Avoid polling when my_rdp->nocb_head_rdp list is empty · 0578e14c
      Zqiang authored
      Currently, if the 'rcu_nocb_poll' kernel boot parameter is enabled, all
      rcuog kthreads enter polling mode.  However, if all of a given group
      of rcuo kthreads correspond to CPUs that have been de-offloaded, the
      corresponding rcuog kthread will nonetheless still wake up periodically,
      unnecessarily consuming power and perturbing workloads.  Fortunately,
      this situation is easily detected by the fact that the rcuog kthread's
      CPU's rcu_data structure's ->nocb_head_rdp list is empty.
      
      This commit saves power and avoids unnecessarily perturbing workloads
      by putting an rcuog kthread to sleep during any time period when all of
      its rcuo kthreads' CPUs are de-offloaded.
      Co-developed-by: default avatarFrederic Weisbecker <frederic@kernel.org>
      Signed-off-by: default avatarFrederic Weisbecker <frederic@kernel.org>
      Signed-off-by: default avatarZqiang <qiang1.zhang@intel.com>
      Signed-off-by: default avatarPaul E. McKenney <paulmck@kernel.org>
      Reviewed-by: default avatarNeeraj Upadhyay <quic_neeraju@quicinc.com>
      0578e14c
    • Uladzislau Rezki (Sony)'s avatar
      rcu/nocb: Add option to opt rcuo kthreads out of RT priority · 8f489b4d
      Uladzislau Rezki (Sony) authored
      This commit introduces a RCU_NOCB_CPU_CB_BOOST Kconfig option that
      prevents rcuo kthreads from running at real-time priority, even in
      kernels built with RCU_BOOST.  This capability is important to devices
      needing low-latency (as in a few milliseconds) response from expedited
      RCU grace periods, but which are not running a classic real-time workload.
      On such devices, permitting the rcuo kthreads to run at real-time priority
      results in unacceptable latencies imposed on the application tasks,
      which run as SCHED_OTHER.
      
      See for example the following trace output:
      
      <snip>
      <...>-60 [006] d..1 2979.028717: rcu_batch_start: rcu_preempt CBs=34619 bl=270
      <snip>
      
      If that rcuop kthread were permitted to run at real-time SCHED_FIFO
      priority, it would monopolize its CPU for hundreds of milliseconds
      while invoking those 34619 RCU callback functions, which would cause an
      unacceptably long latency spike for many application stacks on Android
      platforms.
      
      However, some existing real-time workloads require that callback
      invocation run at SCHED_FIFO priority, for example, those running on
      systems with heavy SCHED_OTHER background loads.  (It is the real-time
      system's administrator's responsibility to make sure that important
      real-time tasks run at a higher priority than do RCU's kthreads.)
      
      Therefore, this new RCU_NOCB_CPU_CB_BOOST Kconfig option defaults to
      "y" on kernels built with PREEMPT_RT and defaults to "n" otherwise.
      The effect is to preserve current behavior for real-time systems, but for
      other systems to allow expedited RCU grace periods to run with real-time
      priority while continuing to invoke RCU callbacks as SCHED_OTHER.
      
      As you would expect, this RCU_NOCB_CPU_CB_BOOST Kconfig option has no
      effect except on CPUs with offloaded RCU callbacks.
      Signed-off-by: default avatarUladzislau Rezki (Sony) <urezki@gmail.com>
      Signed-off-by: default avatarPaul E. McKenney <paulmck@kernel.org>
      Acked-by: default avatarJoel Fernandes (Google) <joel@joelfernandes.org>
      Reviewed-by: default avatarNeeraj Upadhyay <quic_neeraju@quicinc.com>
      8f489b4d
    • Zqiang's avatar
      rcu: Add nocb_cb_kthread check to rcu_is_callbacks_kthread() · 51038506
      Zqiang authored
      Callbacks are invoked in RCU kthreads when calbacks are offloaded
      (rcu_nocbs boot parameter) or when RCU's softirq handler has been
      offloaded to rcuc kthreads (use_softirq==0).  The current code allows
      for the rcu_nocbs case but not the use_softirq case.  This commit adds
      support for the use_softirq case.
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Signed-off-by: default avatarZqiang <qiang1.zhang@intel.com>
      Signed-off-by: default avatarPaul E. McKenney <paulmck@kernel.org>
      Reviewed-by: default avatarNeeraj Upadhyay <quic_neeraju@quicinc.com>
      51038506
    • Joel Fernandes's avatar
      rcu/nocb: Add an option to offload all CPUs on boot · b37a667c
      Joel Fernandes authored
      Systems built with CONFIG_RCU_NOCB_CPU=y but booted without either
      the rcu_nocbs= or rcu_nohz_full= kernel-boot parameters will not have
      callback offloading on any of the CPUs, nor can any of the CPUs be
      switched to enable callback offloading at runtime.  Although this is
      intentional, it would be nice to have a way to offload all the CPUs
      without having to make random bootloaders specify either the rcu_nocbs=
      or the rcu_nohz_full= kernel-boot parameters.
      
      This commit therefore provides a new CONFIG_RCU_NOCB_CPU_DEFAULT_ALL
      Kconfig option that switches the default so as to offload callback
      processing on all of the CPUs.  This default can still be overridden
      using the rcu_nocbs= and rcu_nohz_full= kernel-boot parameters.
      Reviewed-by: default avatarKalesh Singh <kaleshsingh@google.com>
      Reviewed-by: default avatarUladzislau Rezki <urezki@gmail.com>
      (In v4.1, fixed issues with CONFIG maze reported by kernel test robot).
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Signed-off-by: default avatarJoel Fernandes <joel@joelfernandes.org>
      Signed-off-by: default avatarPaul E. McKenney <paulmck@kernel.org>
      Reviewed-by: default avatarNeeraj Upadhyay <quic_neeraju@quicinc.com>
      b37a667c
    • Zqiang's avatar
      rcu/nocb: Fix NOCB kthreads spawn failure with rcu_nocb_rdp_deoffload() direct call · 3a5761dc
      Zqiang authored
      If the rcuog/o[p] kthreads spawn failed, the offloaded rdp needs to
      be explicitly deoffloaded, otherwise the target rdp is still considered
      offloaded even though nothing actually handles the callbacks.
      Signed-off-by: default avatarZqiang <qiang1.zhang@intel.com>
      Cc: Neeraj Upadhyay <quic_neeraju@quicinc.com>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Uladzislau Rezki <uladzislau.rezki@sony.com>
      Cc: Joel Fernandes <joel@joelfernandes.org>
      Signed-off-by: default avatarFrederic Weisbecker <frederic@kernel.org>
      Signed-off-by: default avatarPaul E. McKenney <paulmck@kernel.org>
      Reviewed-by: default avatarNeeraj Upadhyay <quic_neeraju@quicinc.com>
      3a5761dc
    • Zqiang's avatar
      rcu/nocb: Invert rcu_state.barrier_mutex VS hotplug lock locking order · 24a57aff
      Zqiang authored
      In case of failure to spawn either rcuog or rcuo[p] kthreads for a given
      rdp, rcu_nocb_rdp_deoffload() needs to be called with the hotplug
      lock and the barrier_mutex held. However cpus write lock is already held
      while calling rcutree_prepare_cpu(). It's not possible to call
      rcu_nocb_rdp_deoffload() from there with just locking the barrier_mutex
      or this would result in a locking inversion against
      rcu_nocb_cpu_deoffload() which holds both locks in the reverse order.
      
      Simply solve this with inverting the locking order inside
      rcu_nocb_cpu_[de]offload(). This will be a pre-requisite to toggle NOCB
      states toward cpusets anyway.
      Signed-off-by: default avatarZqiang <qiang1.zhang@intel.com>
      Cc: Neeraj Upadhyay <quic_neeraju@quicinc.com>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Uladzislau Rezki <uladzislau.rezki@sony.com>
      Cc: Joel Fernandes <joel@joelfernandes.org>
      Signed-off-by: default avatarFrederic Weisbecker <frederic@kernel.org>
      Signed-off-by: default avatarPaul E. McKenney <paulmck@kernel.org>
      Reviewed-by: default avatarNeeraj Upadhyay <quic_neeraju@quicinc.com>
      24a57aff
    • Frederic Weisbecker's avatar
      rcu/nocb: Add/del rdp to iterate from rcuog itself · 1598f4a4
      Frederic Weisbecker authored
      NOCB rdp's are part of a group whose list is iterated by the
      corresponding rdp leader.
      
      This list is RCU traversed because an rdp can be either added or
      deleted concurrently. Upon addition, a new iteration to the list after
      a synchronization point (a pair of LOCK/UNLOCK ->nocb_gp_lock) is forced
      to make sure:
      
      1) we didn't miss a new element added in the middle of an iteration
      2) we didn't ignore a whole subset of the list due to an element being
         quickly deleted and then re-added.
      3) we prevent from probably other surprises...
      
      Although this layout is expected to be safe, it doesn't help anybody
      to sleep well.
      
      Simplify instead the nocb state toggling with moving the list
      modification from the nocb (de-)offloading workqueue to the rcuog
      kthreads instead.
      
      Whenever the rdp leader is expected to (re-)set the SEGCBLIST_KTHREAD_GP
      flag of a target rdp, the latter is queued so that the leader handles
      the flag flip along with adding or deleting the target rdp to the list
      to iterate. This way the list modification and iteration happen from the
      same kthread and those operations can't race altogether.
      
      As a bonus, the flags for each rdp don't need to be checked locklessly
      before each iteration, which is one less opportunity to produce
      nightmares.
      Signed-off-by: default avatarFrederic Weisbecker <frederic@kernel.org>
      Cc: Neeraj Upadhyay <quic_neeraju@quicinc.com>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Uladzislau Rezki <uladzislau.rezki@sony.com>
      Cc: Joel Fernandes <joel@joelfernandes.org>
      Cc: Zqiang <qiang1.zhang@intel.com>
      Signed-off-by: default avatarPaul E. McKenney <paulmck@kernel.org>
      Reviewed-by: default avatarNeeraj Upadhyay <quic_neeraju@quicinc.com>
      1598f4a4
    • Neeraj Upadhyay's avatar
      rcu/tree: Add comment to describe GP-done condition in fqs loop · a03ae49c
      Neeraj Upadhyay authored
      Add a comment to explain why !rcu_preempt_blocked_readers_cgp() condition
      is required on root rnp node, for GP completion check in rcu_gp_fqs_loop().
      Reviewed-by: default avatarJoel Fernandes (Google) <joel@joelfernandes.org>
      Signed-off-by: default avatarNeeraj Upadhyay <quic_neeraju@quicinc.com>
      Signed-off-by: default avatarPaul E. McKenney <paulmck@kernel.org>
      a03ae49c
    • Paul E. McKenney's avatar
      rcu: Initialize first_gp_fqs at declaration in rcu_gp_fqs() · 9bdb5b3a
      Paul E. McKenney authored
      This commit saves a line of code by initializing the rcu_gp_fqs()
      function's first_gp_fqs local variable in its declaration.
      Reported-by: default avatarFrederic Weisbecker <frederic@kernel.org>
      Reported-by: default avatarNeeraj Upadhyay <quic_neeraju@quicinc.com>
      Signed-off-by: default avatarPaul E. McKenney <paulmck@kernel.org>
      9bdb5b3a
    • Joel Fernandes (Google)'s avatar
      rcu/kvfree: Remove useless monitor_todo flag · 82d26c36
      Joel Fernandes (Google) authored
      monitor_todo is not needed as the work struct already tracks
      if work is pending. Just use that to know if work is pending
      using schedule_delayed_work() helper.
      Signed-off-by: default avatarJoel Fernandes (Google) <joel@joelfernandes.org>
      Signed-off-by: default avatarUladzislau Rezki (Sony) <urezki@gmail.com>
      Signed-off-by: default avatarPaul E. McKenney <paulmck@kernel.org>
      Reviewed-by: default avatarNeeraj Upadhyay <quic_neeraju@quicinc.com>
      82d26c36
    • Zqiang's avatar
      rcu: Cleanup RCU urgency state for offline CPU · e2bb1288
      Zqiang authored
      When a CPU is slow to provide a quiescent state for a given grace
      period, RCU takes steps to encourage that CPU to get with the
      quiescent-state program in a more timely fashion.  These steps
      include these flags in the rcu_data structure:
      
      1.	->rcu_urgent_qs, which causes the scheduling-clock interrupt to
      	request an otherwise pointless context switch from the scheduler.
      
      2.	->rcu_need_heavy_qs, which causes both cond_resched() and RCU's
      	context-switch hook to do an immediate momentary quiscent state.
      
      3.	->rcu_need_heavy_qs, which causes the scheduler-clock tick to
      	be enabled even on nohz_full CPUs with only one runnable task.
      
      These flags are of course cleared once the corresponding CPU has passed
      through a quiescent state.  Unless that quiescent state is the CPU
      going offline, which means that when the CPU comes back online, it will
      needlessly consume additional CPU time and incur additional latency,
      which constitutes a minor but very real performance bug.
      
      This commit therefore adds the call to rcu_disable_urgency_upon_qs()
      that clears these flags to the CPU-hotplug offlining code path.
      Signed-off-by: default avatarZqiang <qiang1.zhang@intel.com>
      Signed-off-by: default avatarPaul E. McKenney <paulmck@kernel.org>
      Reviewed-by: default avatarNeeraj Upadhyay <quic_neeraju@quicinc.com>
      e2bb1288
    • Johannes Berg's avatar
      rcu: tiny: Record kvfree_call_rcu() call stack for KASAN · 800d6acf
      Johannes Berg authored
      When running KASAN with Tiny RCU (e.g. under ARCH=um, where
      a working KASAN patch is now available), we don't get any
      information on the original kfree_rcu() (or similar) caller
      when a problem is reported, as Tiny RCU doesn't record this.
      
      Add the recording, which required pulling kvfree_call_rcu()
      out of line for the KASAN case since the recording function
      (kasan_record_aux_stack_noalloc) is neither exported, nor
      can we include kasan.h into rcutiny.h.
      
      without KASAN, the patch has no size impact (ARCH=um kernel):
          text       data         bss         dec        hex    filename
       6151515    4423154    33148520    43723189    29b29b5    linux
       6151515    4423154    33148520    43723189    29b29b5    linux + patch
      
      with KASAN, the impact on my build was minimal:
          text       data         bss         dec        hex    filename
      13915539    7388050    33282304    54585893    340ea25    linux
      13911266    7392114    33282304    54585684    340e954    linux + patch
         -4273      +4064         +-0        -209
      Acked-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Signed-off-by: default avatarJohannes Berg <johannes.berg@intel.com>
      Signed-off-by: default avatarPaul E. McKenney <paulmck@kernel.org>
      800d6acf
    • Chen Zhongjin's avatar
      locking/csd_lock: Change csdlock_debug from early_param to __setup · 9c9b26b0
      Chen Zhongjin authored
      The csdlock_debug kernel-boot parameter is parsed by the
      early_param() function csdlock_debug().  If set, csdlock_debug()
      invokes static_branch_enable() to enable csd_lock_wait feature, which
      triggers a panic on arm64 for kernels built with CONFIG_SPARSEMEM=y and
      CONFIG_SPARSEMEM_VMEMMAP=n.
      
      With CONFIG_SPARSEMEM_VMEMMAP=n, __nr_to_section is called in
      static_key_enable() and returns NULL, resulting in a NULL dereference
      because mem_section is initialized only later in sparse_init().
      
      This is also a problem for powerpc because early_param() functions
      are invoked earlier than jump_label_init(), also resulting in
      static_key_enable() failures.  These failures cause the warning "static
      key 'xxx' used before call to jump_label_init()".
      
      Thus, early_param is too early for csd_lock_wait to run
      static_branch_enable(), so changes it to __setup to fix these.
      
      Fixes: 8d0968cc ("locking/csd_lock: Add boot parameter for controlling CSD lock debugging")
      Cc: stable@vger.kernel.org
      Reported-by: default avatarChen jingwen <chenjingwen6@huawei.com>
      Signed-off-by: default avatarChen Zhongjin <chenzhongjin@huawei.com>
      Signed-off-by: default avatarPaul E. McKenney <paulmck@kernel.org>
      9c9b26b0
    • Paul E. McKenney's avatar
      rcu: Forbid RCU_STRICT_GRACE_PERIOD in TINY_RCU kernels · b3ade95b
      Paul E. McKenney authored
      The RCU_STRICT_GRACE_PERIOD Kconfig option does nothing in kernels
      built with CONFIG_TINY_RCU=y, so this commit adjusts the dependencies
      to disallow this combination.
      Signed-off-by: default avatarPaul E. McKenney <paulmck@kernel.org>
      Reviewed-by: default avatarNeeraj Upadhyay <quic_neeraju@quicinc.com>
      b3ade95b
    • Zqiang's avatar
      rcu: Immediately boost preempted readers for strict grace periods · 70a82c3c
      Zqiang authored
      The intent of the CONFIG_RCU_STRICT_GRACE_PERIOD Konfig option is to
      cause normal grace periods to complete quickly in order to better catch
      errors resulting from improperly leaking pointers from RCU read-side
      critical sections.  However, kernels built with this option enabled still
      wait for some hundreds of milliseconds before boosting RCU readers that
      have been preempted within their current critical section.  The value
      of this delay is set by the CONFIG_RCU_BOOST_DELAY Kconfig option,
      which defaults to 500 milliseconds.
      
      This commit therefore causes kernels build with strict grace periods
      to ignore CONFIG_RCU_BOOST_DELAY.  This causes rcu_initiate_boost()
      to start boosting immediately after all CPUs on a given leaf rcu_node
      structure have passed through their quiescent states.
      Signed-off-by: default avatarZqiang <qiang1.zhang@intel.com>
      Signed-off-by: default avatarPaul E. McKenney <paulmck@kernel.org>
      Reviewed-by: default avatarNeeraj Upadhyay <quic_neeraju@quicinc.com>
      70a82c3c
    • Zqiang's avatar
      rcu: Add rnp->cbovldmask check in rcutree_migrate_callbacks() · 52c1d81e
      Zqiang authored
      Currently, the rcu_node structure's ->cbovlmask field is set in call_rcu()
      when a given CPU is suffering from callback overload.  But if that CPU
      goes offline, the outgoing CPU's callbacks is migrated to the running
      CPU, which is likely to overload the running CPU.  However, that CPU's
      bit in its leaf rcu_node structure's ->cbovlmask field remains zero.
      
      Initially, this is OK because the outgoing CPU's bit remains set.
      However, that bit will be cleared at the next end of a grace period,
      at which time it is quite possible that the running CPU will still
      be overloaded.  If the running CPU invokes call_rcu(), then overload
      will be checked for and the bit will be set.  Except that there is no
      guarantee that the running CPU will invoke call_rcu(), in which case the
      next grace period will fail to take the running CPU's overload condition
      into account.  Plus, because the bit is not set, the end of the grace
      period won't check for overload on this CPU.
      
      This commit therefore adds a call to check_cb_ovld_locked() in
      rcutree_migrate_callbacks() to set the running CPU's ->cbovlmask bit
      appropriately.
      Signed-off-by: default avatarZqiang <qiang1.zhang@intel.com>
      Signed-off-by: default avatarPaul E. McKenney <paulmck@kernel.org>
      Reviewed-by: default avatarNeeraj Upadhyay <quic_neeraju@quicinc.com>
      52c1d81e
    • Patrick Wang's avatar
      rcu: Avoid tracing a few functions executed in stop machine · 48f8070f
      Patrick Wang authored
      Stop-machine recently started calling additional functions while waiting:
      
      ----------------------------------------------------------------
      Former stop machine wait loop:
      do {
          cpu_relax(); => macro
          ...
      } while (curstate != STOPMACHINE_EXIT);
      -----------------------------------------------------------------
      Current stop machine wait loop:
      do {
          stop_machine_yield(cpumask); => function (notraced)
          ...
          touch_nmi_watchdog(); => function (notraced, inside calls also notraced)
          ...
          rcu_momentary_dyntick_idle(); => function (notraced, inside calls traced)
      } while (curstate != MULTI_STOP_EXIT);
      ------------------------------------------------------------------
      
      These functions (and the functions that they call) must be marked
      notrace to prevent them from being updated while they are executing.
      The consequences of failing to mark these functions can be severe:
      
        rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
        rcu: 	1-...!: (0 ticks this GP) idle=14f/1/0x4000000000000000 softirq=3397/3397 fqs=0
        rcu: 	3-...!: (0 ticks this GP) idle=ee9/1/0x4000000000000000 softirq=5168/5168 fqs=0
        	(detected by 0, t=8137 jiffies, g=5889, q=2 ncpus=4)
        Task dump for CPU 1:
        task:migration/1     state:R  running task     stack:    0 pid:   19 ppid:     2 flags:0x00000000
        Stopper: multi_cpu_stop+0x0/0x18c <- stop_machine_cpuslocked+0x128/0x174
        Call Trace:
        Task dump for CPU 3:
        task:migration/3     state:R  running task     stack:    0 pid:   29 ppid:     2 flags:0x00000000
        Stopper: multi_cpu_stop+0x0/0x18c <- stop_machine_cpuslocked+0x128/0x174
        Call Trace:
        rcu: rcu_preempt kthread timer wakeup didn't happen for 8136 jiffies! g5889 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402
        rcu: 	Possible timer handling issue on cpu=2 timer-softirq=594
        rcu: rcu_preempt kthread starved for 8137 jiffies! g5889 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=2
        rcu: 	Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior.
        rcu: RCU grace-period kthread stack dump:
        task:rcu_preempt     state:I stack:    0 pid:   14 ppid:     2 flags:0x00000000
        Call Trace:
          schedule+0x56/0xc2
          schedule_timeout+0x82/0x184
          rcu_gp_fqs_loop+0x19a/0x318
          rcu_gp_kthread+0x11a/0x140
          kthread+0xee/0x118
          ret_from_exception+0x0/0x14
        rcu: Stack dump where RCU GP kthread last ran:
        Task dump for CPU 2:
        task:migration/2     state:R  running task     stack:    0 pid:   24 ppid:     2 flags:0x00000000
        Stopper: multi_cpu_stop+0x0/0x18c <- stop_machine_cpuslocked+0x128/0x174
        Call Trace:
      
      This commit therefore marks these functions notrace:
       rcu_preempt_deferred_qs()
       rcu_preempt_need_deferred_qs()
       rcu_preempt_deferred_qs_irqrestore()
      
      [ paulmck: Apply feedback from Neeraj Upadhyay. ]
      Signed-off-by: default avatarPatrick Wang <patrick.wang.shcn@gmail.com>
      Acked-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      Signed-off-by: default avatarPaul E. McKenney <paulmck@kernel.org>
      Reviewed-by: default avatarNeeraj Upadhyay <quic_neeraju@quicinc.com>
      48f8070f
    • Paul E. McKenney's avatar
      rcu: Decrease FQS scan wait time in case of callback overloading · fb77dccf
      Paul E. McKenney authored
      The force-quiesce-state loop function rcu_gp_fqs_loop() checks for
      callback overloading and does an immediate initial scan for idle CPUs
      if so.  However, subsequent rescans will be carried out at as leisurely a
      rate as they always are, as specified by the rcutree.jiffies_till_next_fqs
      module parameter.  It might be tempting to just continue immediately
      rescanning, but this turns the RCU grace-period kthread into a CPU hog.
      It might also be tempting to reduce the time between rescans to a single
      jiffy, but this can be problematic on larger systems.
      
      This commit therefore divides the normal time between rescans by three,
      rounding up.  Thus a small system running at HZ=1000 that is suffering
      from callback overload will wait only one jiffy instead of the normal
      three between rescans.
      
      [ paulmck: Apply Neeraj Upadhyay feedback. ]
      Signed-off-by: default avatarPaul E. McKenney <paulmck@kernel.org>
      Reviewed-by: default avatarNeeraj Upadhyay <quic_neeraju@quicinc.com>
      fb77dccf
    • Neeraj Upadhyay's avatar
      srcu: Make expedited RCU grace periods block even less frequently · 4f2bfd94
      Neeraj Upadhyay authored
      The purpose of commit 282d8998 ("srcu: Prevent expedited GPs
      and blocking readers from consuming CPU") was to prevent a long
      series of never-blocking expedited SRCU grace periods from blocking
      kernel-live-patching (KLP) progress.  Although it was successful, it also
      resulted in excessive boot times on certain embedded workloads running
      under qemu with the "-bios QEMU_EFI.fd" command line.  Here "excessive"
      means increasing the boot time up into the three-to-four minute range.
      This increase in boot time was due to the more than 6000 back-to-back
      invocations of synchronize_rcu_expedited() within the KVM host OS, which
      in turn resulted from qemu's emulation of a long series of MMIO accesses.
      
      Commit 640a7d37c3f4 ("srcu: Block less aggressively for expedited grace
      periods") did not significantly help this particular use case.
      
      Zhangfei Gao and Shameerali Kolothum Thodi did experiments varying the
      value of SRCU_MAX_NODELAY_PHASE with HZ=250 and with various values
      of non-sleeping per phase counts on a system with preemption enabled,
      and observed the following boot times:
      
      +──────────────────────────+────────────────+
      | SRCU_MAX_NODELAY_PHASE   | Boot time (s)  |
      +──────────────────────────+────────────────+
      | 100                      | 30.053         |
      | 150                      | 25.151         |
      | 200                      | 20.704         |
      | 250                      | 15.748         |
      | 500                      | 11.401         |
      | 1000                     | 11.443         |
      | 10000                    | 11.258         |
      | 1000000                  | 11.154         |
      +──────────────────────────+────────────────+
      
      Analysis on the experiment results show additional improvements with
      CPU-bound delays approaching one jiffy in duration. This improvement was
      also seen when number of per-phase iterations were scaled to one jiffy.
      
      This commit therefore scales per-grace-period phase number of non-sleeping
      polls so that non-sleeping polls extend for about one jiffy. In addition,
      the delay-calculation call to srcu_get_delay() in srcu_gp_end() is
      replaced with a simple check for an expedited grace period.  This change
      schedules callback invocation immediately after expedited grace periods
      complete, which results in greatly improved boot times.  Testing done
      by Marc and Zhangfei confirms that this change recovers most of the
      performance degradation in boottime; for CONFIG_HZ_250 configuration,
      specifically, boot times improve from 3m50s to 41s on Marc's setup;
      and from 2m40s to ~9.7s on Zhangfei's setup.
      
      In addition to the changes to default per phase delays, this
      change adds 3 new kernel parameters - srcutree.srcu_max_nodelay,
      srcutree.srcu_max_nodelay_phase, and srcutree.srcu_retry_check_delay.
      This allows users to configure the srcu grace period scanning delays in
      order to more quickly react to additional use cases.
      
      Fixes: 640a7d37c3f4 ("srcu: Block less aggressively for expedited grace periods")
      Fixes: 282d8998 ("srcu: Prevent expedited GPs and blocking readers from consuming CPU")
      Reported-by: default avatarZhangfei Gao <zhangfei.gao@linaro.org>
      Reported-by: default avataryueluck <yueluck@163.com>
      Signed-off-by: default avatarNeeraj Upadhyay <quic_neeraju@quicinc.com>
      Tested-by: default avatarMarc Zyngier <maz@kernel.org>
      Tested-by: default avatarZhangfei Gao <zhangfei.gao@linaro.org>
      Link: https://lore.kernel.org/all/20615615-0013-5adc-584f-2b1d5c03ebfc@linaro.org/Signed-off-by: default avatarPaul E. McKenney <paulmck@kernel.org>
      4f2bfd94
    • Paul E. McKenney's avatar
      srcu: Block less aggressively for expedited grace periods · 8f870e6e
      Paul E. McKenney authored
      Commit 282d8998 ("srcu: Prevent expedited GPs and blocking readers
      from consuming CPU") fixed a problem where a long-running expedited SRCU
      grace period could block kernel live patching.  It did so by giving up
      on expediting once a given SRCU expedited grace period grew too old.
      
      Unfortunately, this added excessive delays to boots of virtual embedded
      systems specifying "-bios QEMU_EFI.fd" to qemu.  This commit therefore
      makes the transition away from expediting less aggressive, increasing
      the per-grace-period phase number of non-sleeping polls of readers from
      one to three and increasing the required grace-period age from one jiffy
      (actually from zero to one jiffies) to two jiffies (actually from one
      to two jiffies).
      
      Fixes: 282d8998 ("srcu: Prevent expedited GPs and blocking readers from consuming CPU")
      Signed-off-by: default avatarPaul E. McKenney <paulmck@kernel.org>
      Reported-by: default avatarZhangfei Gao <zhangfei.gao@linaro.org>
      Reported-by: default avatarchenxiang (M)" <chenxiang66@hisilicon.com>
      Cc: Shameerali Kolothum Thodi  <shameerali.kolothum.thodi@huawei.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Reviewed-by: default avatarNeeraj Upadhyay <quic_neeraju@quicinc.com>
      Link: https://lore.kernel.org/all/20615615-0013-5adc-584f-2b1d5c03ebfc@linaro.org/
      8f870e6e
  3. 21 Jun, 2022 10 commits
    • Zqiang's avatar
      refscale: Convert test_lock spinlock to raw_spinlock · 7bf336fb
      Zqiang authored
      In kernels built with CONFIG_PREEMPT_RT=y, spinlocks are replaced by
      rt_mutex, which can sleep.  This means that acquiring a non-raw spinlock
      in a critical section where preemption is disabled can trigger the
      following BUG:
      
      BUG: scheduling while atomic: ref_scale_reade/76/0x00000002
      Preemption disabled at:
      ref_lock_section+0x16/0x80
      Call Trace:
      <TASK>
      dump_stack_lvl+0x5b/0x82
      dump_stack+0x10/0x12
      __schedule_bug.cold+0x9c/0xad
      __schedule+0x839/0xc00
      schedule_rtlock+0x22/0x40
      rtlock_slowlock_locked+0x460/0x1350
      rt_spin_lock+0x61/0xe0
      ref_lock_section+0x29/0x80
      rcu_scale_one_reader+0x52/0x60
      ref_scale_reader+0x28d/0x490
      kthread+0x128/0x150
      ret_from_fork+0x22/0x30
      </TASK>
      
      This commit therefore converts spinlock to raw_spinlock.
      Signed-off-by: default avatarZqiang <qiang1.zhang@intel.com>
      Signed-off-by: default avatarPaul E. McKenney <paulmck@kernel.org>
      7bf336fb
    • Li Qiong's avatar
      rcutorture: Handle failure of memory allocation functions · 1a5ca5e0
      Li Qiong authored
      This commit adds warnings for allocation failure during the mem_dump_obj()
      tests.  It also terminates these tests upon such failure.
      Signed-off-by: default avatarLi Qiong <liqiong@nfschina.com>
      Signed-off-by: default avatarPaul E. McKenney <paulmck@kernel.org>
      1a5ca5e0
    • Frederic Weisbecker's avatar
      rcutorture: Fix ksoftirqd boosting timing and iteration · 3002153a
      Frederic Weisbecker authored
      The RCU priority boosting can fail in two situations:
      
      1) If (nr_cpus= > maxcpus=), which means if the total number of CPUs
      is higher than those brought online at boot, then torture_onoff() may
      later bring up CPUs that weren't online on boot. Now since rcutorture
      initialization only boosts the ksoftirqds of the CPUs that have been
      set online on boot, the CPUs later set online by torture_onoff won't
      benefit from the boost, making RCU priority boosting fail.
      
      2) The ksoftirqd kthreads are boosted after the creation of
      rcu_torture_boost() kthreads, which opens a window large enough for these
      rcu_torture_boost() kthreads to wait (despite running at FIFO priority)
      for ksoftirqds that are still running at SCHED_NORMAL priority.
      
      The issues can trigger for example with:
      
      	./kvm.sh --configs TREE01 --kconfig "CONFIG_RCU_BOOST=y"
      
      	[   34.968561] rcu-torture: !!!
      	[   34.968627] ------------[ cut here ]------------
      	[   35.014054] WARNING: CPU: 4 PID: 114 at kernel/rcu/rcutorture.c:1979 rcu_torture_stats_print+0x5ad/0x610
      	[   35.052043] Modules linked in:
      	[   35.069138] CPU: 4 PID: 114 Comm: rcu_torture_sta Not tainted 5.18.0-rc1 #1
      	[   35.096424] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.14.0-0-g155821a-rebuilt.opensuse.org 04/01/2014
      	[   35.154570] RIP: 0010:rcu_torture_stats_print+0x5ad/0x610
      	[   35.198527] Code: 63 1b 02 00 74 02 0f 0b 48 83 3d 35 63 1b 02 00 74 02 0f 0b 48 83 3d 21 63 1b 02 00 74 02 0f 0b 48 83 3d 0d 63 1b 02 00 74 02 <0f> 0b 83 eb 01 0f 8e ba fc ff ff 0f 0b e9 b3 fc ff f82
      	[   37.251049] RSP: 0000:ffffa92a0050bdf8 EFLAGS: 00010202
      	[   37.277320] rcu: De-offloading 8
      	[   37.290367] RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000001
      	[   37.290387] RDX: 0000000000000000 RSI: 00000000ffffbfff RDI: 00000000ffffffff
      	[   37.290398] RBP: 000000000000007b R08: 0000000000000000 R09: c0000000ffffbfff
      	[   37.290407] R10: 000000000000002a R11: ffffa92a0050bc18 R12: ffffa92a0050be20
      	[   37.290417] R13: ffffa92a0050be78 R14: 0000000000000000 R15: 000000000001bea0
      	[   37.290427] FS:  0000000000000000(0000) GS:ffff96045eb00000(0000) knlGS:0000000000000000
      	[   37.290448] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      	[   37.290460] CR2: 0000000000000000 CR3: 000000001dc0c000 CR4: 00000000000006e0
      	[   37.290470] Call Trace:
      	[   37.295049]  <TASK>
      	[   37.295065]  ? preempt_count_add+0x63/0x90
      	[   37.295095]  ? _raw_spin_lock_irqsave+0x12/0x40
      	[   37.295125]  ? rcu_torture_stats_print+0x610/0x610
      	[   37.295143]  rcu_torture_stats+0x29/0x70
      	[   37.295160]  kthread+0xe3/0x110
      	[   37.295176]  ? kthread_complete_and_exit+0x20/0x20
      	[   37.295193]  ret_from_fork+0x22/0x30
      	[   37.295218]  </TASK>
      
      Fix this with boosting the ksoftirqds kthreads from the boosting
      hotplug callback itself and before the boosting kthreads are created.
      
      Fixes: ea6d962e ("rcutorture: Judge RCU priority boosting on grace periods, not callbacks")
      Signed-off-by: default avatarFrederic Weisbecker <frederic@kernel.org>
      Signed-off-by: default avatarPaul E. McKenney <paulmck@kernel.org>
      3002153a
    • Paul E. McKenney's avatar
      torture: Create kvm-check-branches.sh output in proper location · 148df92f
      Paul E. McKenney authored
      Currently, kvm-check-branches.sh causes each kvm.sh invocation create a
      separate date-stamped directory, then after that invocation completes,
      moves it into the *-group/NNNN directory.  This works, but makes it more
      difficult to monitor an ongoing run.  This commit therefore uses the
      kvm.sh --datestamp argument to make kvm.sh put the output in the right
      place to start with, and also dispenses with the additional level of
      datestamping.  (Those wanting datestamps can find them in the log files.)
      Signed-off-by: default avatarPaul E. McKenney <paulmck@kernel.org>
      148df92f
    • Zqiang's avatar
      rcuscale: Fix smp_processor_id()-in-preemptible warnings · 92366810
      Zqiang authored
      Systems built with CONFIG_DEBUG_PREEMPT=y can trigger the following
      BUG while running the rcuscale performance test:
      
      BUG: using smp_processor_id() in preemptible [00000000] code: rcu_scale_write/69
      CPU: 0 PID: 66 Comm: rcu_scale_write Not tainted 5.18.0-rc7-next-20220517-yoctodev-standard+
      caller is debug_smp_processor_id+0x17/0x20
      Call Trace:
      <TASK>
      dump_stack_lvl+0x49/0x5e
      dump_stack+0x10/0x12
      check_preemption_disabled+0xdf/0xf0
      debug_smp_processor_id+0x17/0x20
      rcu_scale_writer+0x2b5/0x580
      kthread+0x177/0x1b0
      ret_from_fork+0x22/0x30
      </TASK>
      
      Reproduction method:
      runqemu kvm slirp nographic qemuparams="-m 4096 -smp 8" bootparams="isolcpus=2,3
      nohz_full=2,3 rcu_nocbs=2,3 rcutree.dump_tree=1 rcuscale.shutdown=false
      rcuscale.gp_async=true" -d
      
      The problem is that the rcu_scale_writer() kthreads fail to set the
      PF_NO_SETAFFINITY flags, which causes is_percpu_thread() to assume
      that the kthread's affinity might change at any time, thus the BUG
      noted above.
      
      This commit therefore causes rcu_scale_writer() to set PF_NO_SETAFFINITY
      in its kthread's ->flags field, thus preventing this BUG.
      Signed-off-by: default avatarZqiang <qiang1.zhang@intel.com>
      Signed-off-by: default avatarPaul E. McKenney <paulmck@kernel.org>
      92366810
    • Paul E. McKenney's avatar
      rcutorture: Make failure indication note reader-batch overflow · 8c0666d3
      Paul E. McKenney authored
      The loop scanning the pipesummary[] array currently skips the last
      element, which means that the diagnostics ignore those rarest of
      situations, namely where some readers persist across more than ten
      grace periods, but all other readers avoid spanning a full grace period.
      This commit therefore adjusts the scan to include the last element of
      this array.
      Signed-off-by: default avatarPaul E. McKenney <paulmck@kernel.org>
      8c0666d3
    • Paul E. McKenney's avatar
      torture: Adjust to again produce debugging information · 5c92d750
      Paul E. McKenney authored
      A recent change to the DEBUG_INFO Kconfig option means that simply adding
      CONFIG_DEBUG_INFO=y to the .config file and running "make oldconfig" no
      longer works.  It is instead necessary to add CONFIG_DEBUG_INFO_NONE=n
      and (for example) CONFIG_DEBUG_INFO_DWARF_TOOLCHAIN_DEFAULT=y.
      This combination will then result in CONFIG_DEBUG_INFO being selected.
      
      This commit therefore updates the Kconfig options produced in response
      to the kvm.sh --gdb, --kasan, and --kcsan Kconfig options.
      
      Fixes: f9b3cd24 ("Kconfig.debug: make DEBUG_INFO selectable from a choice")
      Signed-off-by: default avatarPaul E. McKenney <paulmck@kernel.org>
      5c92d750
    • Zqiang's avatar
      rcutorture: Fix memory leak in rcu_test_debug_objects() · 98ea2032
      Zqiang authored
      The kernel memory leak detector located the following:
      
      unreferenced object 0xffff95d941135b50 (size 16):
        comm "swapper/0", pid 1, jiffies 4294667610 (age 1367.451s)
        hex dump (first 16 bytes):
          f0 c6 c2 bd d9 95 ff ff 00 00 00 00 00 00 00 00  ................
        backtrace:
          [<00000000bc81d9b1>] kmem_cache_alloc_trace+0x2f6/0x500
          [<00000000d28be229>] rcu_torture_init+0x1235/0x1354
          [<0000000032c3acd9>] do_one_initcall+0x51/0x210
          [<000000003c117727>] kernel_init_freeable+0x205/0x259
          [<000000003961f965>] kernel_init+0x1a/0x120
          [<000000001998f890>] ret_from_fork+0x22/0x30
      
      This is caused by the rcu_test_debug_objects() function allocating an
      rcu_head structure, then failing to free it.  This commit therefore adds
      the needed kfree() after the last use of this structure.
      Signed-off-by: default avatarZqiang <qiang1.zhang@intel.com>
      Signed-off-by: default avatarPaul E. McKenney <paulmck@kernel.org>
      98ea2032
    • Paul E. McKenney's avatar
      rcutorture: Simplify rcu_torture_read_exit_child() loop · d984114e
      Paul E. McKenney authored
      The existing loop has an implicit manual loop that obscures the flow
      and requires an extra control variable.  This commit makes this implicit
      loop explicit, thus saving several lines of code.
      Signed-off-by: default avatarPaul E. McKenney <paulmck@kernel.org>
      d984114e
    • Anna-Maria Behnsen's avatar
      rcu/torture: Change order of warning and trace dump · 14c0017c
      Anna-Maria Behnsen authored
      Dumping a big ftrace buffer could lead to a RCU stall. So there is the
      ftrace buffer and the stall information which needs to be printed. When
      there is additionally a WARN_ON() which describes the reason for the ftrace
      buffer dump and the WARN_ON() is executed _after_ ftrace buffer dump, the
      information get lost in the middle of the RCU stall information.
      
      Therefore print WARN_ON() message before dumping the ftrace buffer in
      rcu_torture_writer().
      
      [ paulmck: Add tracing_off() to avoid cruft from WARN(). ]
      Signed-off-by: default avatarAnna-Maria Behnsen <anna-maria@linutronix.de>
      Reviewed-by: default avatarBenedikt Spranger <b.spranger@linutronix.de>
      Signed-off-by: default avatarPaul E. McKenney <paulmck@kernel.org>
      14c0017c