1. 01 Sep, 2022 2 commits
    • Paul E. McKenney's avatar
      Merge branches 'doc.2022.08.31b', 'fixes.2022.08.31b', 'kvfree.2022.08.31b',... · 5c0ec490
      Paul E. McKenney authored
      Merge branches 'doc.2022.08.31b', 'fixes.2022.08.31b', 'kvfree.2022.08.31b', 'nocb.2022.09.01a', 'poll.2022.08.31b', 'poll-srcu.2022.08.31b' and 'tasks.2022.08.31b' into HEAD
      
      doc.2022.08.31b: Documentation updates
      fixes.2022.08.31b: Miscellaneous fixes
      kvfree.2022.08.31b: kvfree_rcu() updates
      nocb.2022.09.01a: NOCB CPU updates
      poll.2022.08.31b: Full-oldstate RCU polling grace-period API
      poll-srcu.2022.08.31b: Polled SRCU grace-period updates
      tasks.2022.08.31b: Tasks RCU updates
      5c0ec490
    • Zqiang's avatar
      rcutorture: Use the barrier operation specified by cur_ops · 48297a22
      Zqiang authored
      The rcutorture_oom_notify() function unconditionally invokes
      rcu_barrier(), which is OK when the rcutorture.torture_type value is
      "rcu", but unhelpful otherwise.  The purpose of these barrier calls is to
      wait for all outstanding callback-flooding callbacks to be invoked before
      cleaning up their data.  Using the wrong barrier function therefore
      risks arbitrary memory corruption.  Thus, this commit changes these
      rcu_barrier() calls into cur_ops->cb_barrier() to make things work when
      torturing non-vanilla flavors of RCU.
      Signed-off-by: default avatarZqiang <qiang1.zhang@intel.com>
      Signed-off-by: default avatarPaul E. McKenney <paulmck@kernel.org>
      48297a22
  2. 31 Aug, 2022 38 commits
    • Zqiang's avatar
      rcu-tasks: Make RCU Tasks Trace check for userspace execution · 528262f5
      Zqiang authored
      Userspace execution is a valid quiescent state for RCU Tasks Trace,
      but the scheduling-clock interrupt does not currently report such
      quiescent states.
      
      Of course, the scheduling-clock interrupt is not strictly speaking
      userspace execution.  However, the only way that this code is not
      in a quiescent state is if something invoked rcu_read_lock_trace(),
      and that would be reflected in the ->trc_reader_nesting field in
      the task_struct structure.  Furthermore, this field is checked by
      rcu_tasks_trace_qs(), which is invoked by rcu_tasks_qs() which is in
      turn invoked by rcu_note_voluntary_context_switch() in kernels building
      at least one of the RCU Tasks flavors.  It is therefore safe to invoke
      rcu_tasks_trace_qs() from the rcu_sched_clock_irq().
      
      But rcu_tasks_qs() also invokes rcu_tasks_classic_qs() for RCU
      Tasks, which lacks the read-side markers provided by RCU Tasks Trace.
      This raises the possibility that an RCU Tasks grace period could start
      after the interrupt from userspace execution, but before the call to
      rcu_sched_clock_irq().  However, it turns out that this is safe because
      the RCU Tasks grace period waits for an RCU grace period, which will
      wait for the entire scheduling-clock interrupt handler, including any
      RCU Tasks read-side critical section that this handler might contain.
      
      This commit therefore updates the rcu_sched_clock_irq() function's
      check for usermode execution and its call to rcu_tasks_classic_qs()
      to instead check for both usermode execution and interrupt from idle,
      and to instead call rcu_note_voluntary_context_switch().  This
      consolidates code and provides more faster RCU Tasks Trace
      reporting of quiescent states in kernels that do scheduling-clock
      interrupts for userspace execution.
      
      [ paulmck: Consolidate checks into rcu_sched_clock_irq(). ]
      Signed-off-by: default avatarZqiang <qiang1.zhang@intel.com>
      Signed-off-by: default avatarPaul E. McKenney <paulmck@kernel.org>
      528262f5
    • Paul E. McKenney's avatar
      rcu-tasks: Ensure RCU Tasks Trace loops have quiescent states · d6ad6063
      Paul E. McKenney authored
      The RCU Tasks Trace grace-period kthread loops across all CPUs, and
      there can be quite a few CPUs, with some commercially available systems
      sporting well over a thousand of them.  Some of these loops can feature
      IPIs, which can take some time.  This commit therefore places a call to
      cond_resched_tasks_rcu_qs() in each such loop.
      
      Link: https://docs.google.com/document/d/1V0YnG1HTWMt9WHJjroiJL9lf-hMrud4v8Fn3fhyY0cI/edit?usp=sharingSigned-off-by: default avatarPaul E. McKenney <paulmck@kernel.org>
      d6ad6063
    • Zqiang's avatar
      rcu-tasks: Convert RCU_LOCKDEP_WARN() to WARN_ONCE() · fcd53c8a
      Zqiang authored
      Kernels built with CONFIG_PROVE_RCU=y and CONFIG_DEBUG_LOCK_ALLOC=y
      attempt to emit a warning when the synchronize_rcu_tasks_generic()
      function is called during early boot while the rcu_scheduler_active
      variable is RCU_SCHEDULER_INACTIVE.  However the warnings is not
      actually be printed because the debug_lockdep_rcu_enabled() returns
      false, exactly because the rcu_scheduler_active variable is still equal
      to RCU_SCHEDULER_INACTIVE.
      
      This commit therefore replaces RCU_LOCKDEP_WARN() with WARN_ONCE()
      to force these warnings to actually be printed.
      Signed-off-by: default avatarZqiang <qiang1.zhang@intel.com>
      Signed-off-by: default avatarPaul E. McKenney <paulmck@kernel.org>
      fcd53c8a
    • Paul E. McKenney's avatar
      srcu: Make Tiny SRCU use full-sized grace-period counters · 5fe89191
      Paul E. McKenney authored
      This commit makes Tiny SRCU use full-sized grace-period counters to
      further avoid counter-wrap issues when using polled grace-period APIs.
      Signed-off-by: default avatarPaul E. McKenney <paulmck@kernel.org>
      5fe89191
    • Paul E. McKenney's avatar
      srcu: Make Tiny SRCU poll_state_synchronize_srcu() more precise · de3f2671
      Paul E. McKenney authored
      This commit applies the more-precise grace-period-state check used by
      rcu_seq_done_exact() to poll_state_synchronize_srcu().  This is important
      because Tiny SRCU uses a 16-bit counter, which can wrap quite quickly.
      If counter wrap continues to be a problem, then expanding ->srcu_idx
      and ->srcu_idx_max to 32 bits might be warranted.
      Signed-off-by: default avatarPaul E. McKenney <paulmck@kernel.org>
      de3f2671
    • Paul E. McKenney's avatar
      srcu: Add GP and maximum requested GP to Tiny SRCU rcutorture output · d66e4cf9
      Paul E. McKenney authored
      This commit adds the ->srcu_idx and ->srcu_max_idx fields to the Tiny
      SRCU rcutorture output for additional diagnostics.
      Signed-off-by: default avatarPaul E. McKenney <paulmck@kernel.org>
      d66e4cf9
    • Paul E. McKenney's avatar
      rcutorture: Make "srcud" option also test polled grace-period API · 599d97e3
      Paul E. McKenney authored
      This commit brings the "srcud" (dynamically allocated) SRCU test in line
      with the "srcu" (statically allocated) test, so that both test the full
      SRCU polled grace-period API.
      Signed-off-by: default avatarPaul E. McKenney <paulmck@kernel.org>
      599d97e3
    • Paul E. McKenney's avatar
      rcutorture: Limit read-side polling-API testing · 967c298d
      Paul E. McKenney authored
      RCU's polled grace-period API is reasonably lightweight, but still
      contains heavyweight memory barriers.  This commit therefore limits
      testing of this API from rcutorture's readers in order to avoid the
      false negatives that these heavyweight operations could provoke.
      Signed-off-by: default avatarPaul E. McKenney <paulmck@kernel.org>
      967c298d
    • Paul E. McKenney's avatar
      rcu: Add functions to compare grace-period state values · 18538248
      Paul E. McKenney authored
      This commit adds same_state_synchronize_rcu() and
      same_state_synchronize_rcu_full() functions to compare grace-period state
      values, for example, those obtained from get_state_synchronize_rcu()
      and get_state_synchronize_rcu_full().  These functions allow small
      structures to omit these state values by placing them in list headers for
      lists containing structures with the same token value.  Presumably the
      per-structure list pointers are the same ones used to link the structures
      into whatever reader-accessible data structure was used.
      
      This commit also adds both NUM_ACTIVE_RCU_POLL_OLDSTATE and
      NUM_ACTIVE_RCU_POLL_FULL_OLDSTATE, which define the maximum number of
      distinct unsigned long values and rcu_gp_oldstate values, respectively,
      corresponding to not-yet-completed grace periods.  These values can be
      used to size arrays of the list headers described above.
      Signed-off-by: default avatarPaul E. McKenney <paulmck@kernel.org>
      18538248
    • Paul E. McKenney's avatar
      rcutorture: Expand rcu_torture_write_types() first "if" statement · 5d7801f2
      Paul E. McKenney authored
      This commit expands the rcu_torture_write_types() function's first "if"
      condition and body, placing one element per line, in order to make the
      compiler's error messages more helpful.
      Signed-off-by: default avatarPaul E. McKenney <paulmck@kernel.org>
      5d7801f2
    • Paul E. McKenney's avatar
      rcutorture: Use 1-suffixed variable in rcu_torture_write_types() check · cc8faf5b
      Paul E. McKenney authored
      This commit changes the use of gp_poll_exp to gp_poll_exp1 in the first
      check in rcu_torture_write_types().  No functional effect, but consistency
      is a good thing.
      Signed-off-by: default avatarPaul E. McKenney <paulmck@kernel.org>
      cc8faf5b
    • Paul E. McKenney's avatar
      rcu: Make synchronize_rcu() fastpath update only boot-CPU counters · d761de8a
      Paul E. McKenney authored
      Large systems can have hundreds of rcu_node structures, and updating
      counters in each of them might slow down booting.  This commit therefore
      updates only the counters in those rcu_node structures corresponding
      to the boot CPU, up to and including the root rcu_node structure.
      
      The counters for the remaining rcu_node structures are updated by the
      rcu_scheduler_starting() function, which executes just before the first
      non-boot kthread is spawned.
      Signed-off-by: default avatarPaul E. McKenney <paulmck@kernel.org>
      d761de8a
    • Paul E. McKenney's avatar
      rcutorture: Adjust rcu_poll_need_2gp() for rcu_gp_oldstate field removal · b3cdd0a7
      Paul E. McKenney authored
      Now that rcu_gp_oldstate can accurately track both normal and
      expedited grace periods regardless of system state, rcutorture's
      rcu_poll_need_2gp() function need only call for a second grace period
      for the old single-unsigned-long grace-period polling APIs
      This commit therefore adjusts rcu_poll_need_2gp() accordingly.
      Signed-off-by: default avatarPaul E. McKenney <paulmck@kernel.org>
      b3cdd0a7
    • Paul E. McKenney's avatar
      rcu: Remove ->rgos_polled field from rcu_gp_oldstate structure · 7ecef087
      Paul E. McKenney authored
      Because both normal and expedited grace periods increment their respective
      counters on their pre-scheduler early boot fastpaths, the rcu_gp_oldstate
      structure no longer needs its ->rgos_polled field.  This commit therefore
      removes this field, shrinking this structure so that it is the same size
      as an rcu_head structure.
      Signed-off-by: default avatarPaul E. McKenney <paulmck@kernel.org>
      7ecef087
    • Paul E. McKenney's avatar
      rcu: Make synchronize_rcu_expedited() fast path update .expedited_sequence · 43ff97cc
      Paul E. McKenney authored
      This commit causes the early boot single-CPU synchronize_rcu_expedited()
      fastpath to update the rcu_state structure's ->expedited_sequence
      counter.  This will allow the full-state polled grace-period APIs to
      detect all expedited grace periods without the need to track the special
      combined polling-only counter, which is another step towards removing
      the ->rgos_polled field from the rcu_gp_oldstate, thereby reducing its
      size by one third.
      Signed-off-by: default avatarPaul E. McKenney <paulmck@kernel.org>
      43ff97cc
    • Paul E. McKenney's avatar
      rcu: Remove expedited grace-period fast-path forward-progress helper · e8755d2b
      Paul E. McKenney authored
      Now that the expedited grace-period fast path can only happen during
      the pre-scheduler portion of early boot, this fast path can no longer
      block run-time RCU Trace grace periods.  This commit therefore removes
      the conditional cond_resched() invocation.
      Signed-off-by: default avatarPaul E. McKenney <paulmck@kernel.org>
      e8755d2b
    • Paul E. McKenney's avatar
      rcu: Make synchronize_rcu() fast path update ->gp_seq counters · 910e1209
      Paul E. McKenney authored
      This commit causes the early boot single-CPU synchronize_rcu() fastpath to
      update the rcu_state and rcu_node structures' ->gp_seq and ->gp_seq_needed
      counters.  This will allow the full-state polled grace-period APIs to
      detect all normal grace periods without the need to track the special
      combined polling-only counter, which is a step towards removing the
      ->rgos_polled field from the rcu_gp_oldstate, thereby reducing its size
      by one third.
      Signed-off-by: default avatarPaul E. McKenney <paulmck@kernel.org>
      910e1209
    • Paul E. McKenney's avatar
      rcu-tasks: Remove grace-period fast-path rcu-tasks helper · 5f11bad6
      Paul E. McKenney authored
      Now that the grace-period fast path can only happen during the
      pre-scheduler portion of early boot, this fast path can no longer block
      run-time RCU Tasks and RCU Tasks Trace grace periods.  This commit
      therefore removes the conditional cond_resched_tasks_rcu_qs() invocation.
      Signed-off-by: default avatarPaul E. McKenney <paulmck@kernel.org>
      5f11bad6
    • Paul E. McKenney's avatar
      rcu: Set rcu_data structures' initial ->gpwrap value to true · a5d1b0b6
      Paul E. McKenney authored
      It would be good do reduce the size of the rcu_gp_oldstate structure
      from three unsigned long instances to two, but this requires that the
      boot-time optimized grace periods update the various ->gp_seq fields.
      Updating these fields in the rcu_state structure and in all of the
      rcu_node structures is at least semi-reasonable, but updating them in
      all of the rcu_data structures is a bridge too far.  This means that if
      there are too many early boot-time grace periods, the ->gp_seq field in
      the rcu_data structure cannot be trusted.  This commit therefore sets
      each rcu_data structure's ->gpwrap field to provide the necessary impetus
      for a suitable level of distrust.
      Signed-off-by: default avatarPaul E. McKenney <paulmck@kernel.org>
      a5d1b0b6
    • Paul E. McKenney's avatar
      rcu: Disable run-time single-CPU grace-period optimization · 258f887a
      Paul E. McKenney authored
      The run-time single-CPU grace-period optimization applies only to
      kernels built with CONFIG_SMP=y && CONFIG_PREEMPTION=y that are running
      on a single-CPU system.  But a kernel intended for a single-CPU system
      should instead be built with CONFIG_SMP=n, and in any case, single-CPU
      systems running Linux no longer appear to be the common case.  Plus this
      optimization results in the rcu_gp_oldstate structure being half again
      larger than it needs to be.
      
      This commit therefore disables the run-time single-CPU grace-period
      optimization, so that this optimization applies only during the
      pre-scheduler portion of the boot sequence.
      Signed-off-by: default avatarPaul E. McKenney <paulmck@kernel.org>
      258f887a
    • Paul E. McKenney's avatar
      rcu: Add full-sized polling for cond_sync_exp_full() · 8df13f01
      Paul E. McKenney authored
      The cond_synchronize_rcu_expedited() API compresses the combined expedited and
      normal grace-period states into a single unsigned long, which conserves
      storage, but can miss grace periods in certain cases involving overlapping
      normal and expedited grace periods.  Missing the occasional grace period
      is usually not a problem, but there are use cases that care about each
      and every grace period.
      
      This commit therefore adds yet another member of the full-state RCU
      grace-period polling API, which is the cond_synchronize_rcu_exp_full()
      function.  This uses up to three times the storage (rcu_gp_oldstate
      structure instead of unsigned long), but is guaranteed not to miss
      grace periods.
      Signed-off-by: default avatarPaul E. McKenney <paulmck@kernel.org>
      8df13f01
    • Paul E. McKenney's avatar
      rcu: Add full-sized polling for cond_sync_full() · b6fe4917
      Paul E. McKenney authored
      The cond_synchronize_rcu() API compresses the combined expedited and
      normal grace-period states into a single unsigned long, which conserves
      storage, but can miss grace periods in certain cases involving overlapping
      normal and expedited grace periods.  Missing the occasional grace period
      is usually not a problem, but there are use cases that care about each
      and every grace period.
      
      This commit therefore adds yet another member of the full-state RCU
      grace-period polling API, which is the cond_synchronize_rcu_full()
      function.  This uses up to three times the storage (rcu_gp_oldstate
      structure instead of unsigned long), but is guaranteed not to miss
      grace periods.
      
      [ paulmck: Apply feedback from kernel test robot and Julia Lawall. ]
      Signed-off-by: default avatarPaul E. McKenney <paulmck@kernel.org>
      b6fe4917
    • Paul E. McKenney's avatar
      rcu: Remove blank line from poll_state_synchronize_rcu() docbook header · f21e0143
      Paul E. McKenney authored
      This commit removes the blank line preceding the oldstate parameter to
      the docbook header for the poll_state_synchronize_rcu() function and
      marks uses of this parameter later in that header.
      Signed-off-by: default avatarPaul E. McKenney <paulmck@kernel.org>
      f21e0143
    • Paul E. McKenney's avatar
      rcu: Add full-sized polling for start_poll_expedited() · 6c502b14
      Paul E. McKenney authored
      The start_poll_synchronize_rcu_expedited() API compresses the combined
      expedited and normal grace-period states into a single unsigned long,
      which conserves storage, but can miss grace periods in certain cases
      involving overlapping normal and expedited grace periods.  Missing the
      occasional grace period is usually not a problem, but there are use
      cases that care about each and every grace period.
      
      This commit therefore adds yet another member of the
      full-state RCU grace-period polling API, which is the
      start_poll_synchronize_rcu_expedited_full() function.  This uses up to
      three times the storage (rcu_gp_oldstate structure instead of unsigned
      long), but is guaranteed not to miss grace periods.
      
      [ paulmck: Apply feedback from kernel test robot and Julia Lawall. ]
      Signed-off-by: default avatarPaul E. McKenney <paulmck@kernel.org>
      6c502b14
    • Paul E. McKenney's avatar
      rcu: Add full-sized polling for start_poll() · 76ea3641
      Paul E. McKenney authored
      The start_poll_synchronize_rcu() API compresses the combined expedited and
      normal grace-period states into a single unsigned long, which conserves
      storage, but can miss grace periods in certain cases involving overlapping
      normal and expedited grace periods.  Missing the occasional grace period
      is usually not a problem, but there are use cases that care about each
      and every grace period.
      
      This commit therefore adds the next member of the full-state RCU
      grace-period polling API, namely the start_poll_synchronize_rcu_full()
      function.  This uses up to three times the storage (rcu_gp_oldstate
      structure instead of unsigned long), but is guaranteed not to miss
      grace periods.
      Signed-off-by: default avatarPaul E. McKenney <paulmck@kernel.org>
      76ea3641
    • Paul E. McKenney's avatar
      rcutorture: Verify long-running reader prevents full polling from completing · f4754ad2
      Paul E. McKenney authored
      This commit adds full-state polling checks to accompany the old-style
      polling checks in the rcu_torture_one_read() function.  If a polling
      cycle within an RCU reader completes, a WARN_ONCE() is triggered.
      Signed-off-by: default avatarPaul E. McKenney <paulmck@kernel.org>
      f4754ad2
    • Paul E. McKenney's avatar
      rcutorture: Remove redundant RTWS_DEF_FREE check · 37d6ade3
      Paul E. McKenney authored
      This check does nothing because the state at this point in the code
      because the rcu_torture_writer_state value is guaranteed to instead
      be RTWS_REPLACE.  This commit therefore removes this check.
      Signed-off-by: default avatarPaul E. McKenney <paulmck@kernel.org>
      37d6ade3
    • Paul E. McKenney's avatar
      rcutorture: Verify RCU reader prevents full polling from completing · d594231a
      Paul E. McKenney authored
      This commit adds a test to rcu_torture_writer() that verifies that a
      ->get_gp_state_full() and ->poll_gp_state_full() polled grace-period
      sequence does not claim that a grace period elapsed within the confines
      of the corresponding read-side critical section.
      Signed-off-by: default avatarPaul E. McKenney <paulmck@kernel.org>
      d594231a
    • Paul E. McKenney's avatar
      rcutorture: Allow per-RCU-flavor polled double-GP check · ed7d2f1a
      Paul E. McKenney authored
      Only vanilla RCU needs a double grace period for its compressed
      polled grace-period old-state cookie.  This commit therefore adds an
      rcu_torture_ops per-flavor function ->poll_need_2gp to allow this check
      to be adapted to the RCU flavor under test.  A NULL pointer for this
      function says that doubled grace periods are never needed.
      Signed-off-by: default avatarPaul E. McKenney <paulmck@kernel.org>
      ed7d2f1a
    • Paul E. McKenney's avatar
      rcutorture: Abstract synchronous and polled API testing · ccb42229
      Paul E. McKenney authored
      This commit abstracts a do_rtws_sync() function that does synchronous
      grace-period testing, but also testing the polled API 25% of the time
      each for the normal and full-state variants of the polled API.
      Signed-off-by: default avatarPaul E. McKenney <paulmck@kernel.org>
      ccb42229
    • Paul E. McKenney's avatar
      rcu: Add full-sized polling for get_state() · 3fdefca9
      Paul E. McKenney authored
      The get_state_synchronize_rcu() API compresses the combined expedited and
      normal grace-period states into a single unsigned long, which conserves
      storage, but can miss grace periods in certain cases involving overlapping
      normal and expedited grace periods.  Missing the occasional grace period
      is usually not a problem, but there are use cases that care about each
      and every grace period.
      
      This commit therefore adds the next member of the full-state RCU
      grace-period polling API, namely the get_state_synchronize_rcu_full()
      function.  This uses up to three times the storage (rcu_gp_oldstate
      structure instead of unsigned long), but is guaranteed not to miss
      grace periods.
      Signed-off-by: default avatarPaul E. McKenney <paulmck@kernel.org>
      3fdefca9
    • Paul E. McKenney's avatar
      rcu: Add full-sized polling for get_completed*() and poll_state*() · 91a967fd
      Paul E. McKenney authored
      The get_completed_synchronize_rcu() and poll_state_synchronize_rcu()
      APIs compress the combined expedited and normal grace-period states into a
      single unsigned long, which conserves storage, but can miss grace periods
      in certain cases involving overlapping normal and expedited grace periods.
      Missing the occasional grace period is usually not a problem, but there
      are use cases that care about each and every grace period.
      
      This commit therefore adds the first members of the full-state RCU
      grace-period polling API, namely the get_completed_synchronize_rcu_full()
      and poll_state_synchronize_rcu_full() functions.  These use up to three
      times the storage (rcu_gp_oldstate structure instead of unsigned long),
      but which are guaranteed not to miss grace periods, at least in situations
      where the single-CPU grace-period optimization does not apply.
      Signed-off-by: default avatarPaul E. McKenney <paulmck@kernel.org>
      91a967fd
    • Paul E. McKenney's avatar
      rcu/nocb: Add CPU number to CPU-{,de}offload failure messages · 638dce22
      Paul E. McKenney authored
      Offline CPUs cannot be offloaded or deoffloaded.  Any attempt to offload
      or deoffload an offline CPU causes a message to be printed on the console,
      which is good, but this message does not contain the CPU number, which
      is bad.  Such a CPU number can be helpful when debugging, as it gives a
      clear indication that the CPU in question is in fact offline.  This commit
      therefore adds the CPU number to the CPU-{,de}offload failure messages.
      
      Cc: Frederic Weisbecker <frederic@kernel.org>
      Signed-off-by: default avatarPaul E. McKenney <paulmck@kernel.org>
      638dce22
    • Zqiang's avatar
      rcu/nocb: Choose the right rcuog/rcuop kthreads to output · 5334da2a
      Zqiang authored
      The show_rcu_nocb_gp_state() function is supposed to dump out the rcuog
      kthread and the show_rcu_nocb_state() function is supposed to dump out
      the rcuo[ps] kthread.  Currently, both do a mixture, which is not optimal
      for debugging, even though it does not affect functionality.
      
      This commit therefore adjusts these two functions to focus on their
      respective kthreads.
      Signed-off-by: default avatarZqiang <qiang1.zhang@intel.com>
      Signed-off-by: default avatarPaul E. McKenney <paulmck@kernel.org>
      5334da2a
    • Uladzislau Rezki (Sony)'s avatar
      rcu/kvfree: Update KFREE_DRAIN_JIFFIES interval · 51824b78
      Uladzislau Rezki (Sony) authored
      Currently the monitor work is scheduled with a fixed interval of HZ/20,
      which is roughly 50 milliseconds. The drawback of this approach is
      low utilization of the 512 page slots in scenarios with infrequence
      kvfree_rcu() calls.  For example on an Android system:
      
      <snip>
        kworker/3:3-507     [003] ....   470.286305: rcu_invoke_kfree_bulk_callback: rcu_preempt bulk=0x00000000d0f0dde5 nr_records=6
        kworker/6:1-76      [006] ....   470.416613: rcu_invoke_kfree_bulk_callback: rcu_preempt bulk=0x00000000ea0d6556 nr_records=1
        kworker/6:1-76      [006] ....   470.416625: rcu_invoke_kfree_bulk_callback: rcu_preempt bulk=0x000000003e025849 nr_records=9
        kworker/3:3-507     [003] ....   471.390000: rcu_invoke_kfree_bulk_callback: rcu_preempt bulk=0x00000000815a8713 nr_records=48
        kworker/1:1-73      [001] ....   471.725785: rcu_invoke_kfree_bulk_callback: rcu_preempt bulk=0x00000000fda9bf20 nr_records=3
        kworker/1:1-73      [001] ....   471.725833: rcu_invoke_kfree_bulk_callback: rcu_preempt bulk=0x00000000a425b67b nr_records=76
        kworker/0:4-1411    [000] ....   472.085673: rcu_invoke_kfree_bulk_callback: rcu_preempt bulk=0x000000007996be9d nr_records=1
        kworker/0:4-1411    [000] ....   472.085728: rcu_invoke_kfree_bulk_callback: rcu_preempt bulk=0x00000000d0f0dde5 nr_records=5
        kworker/6:1-76      [006] ....   472.260340: rcu_invoke_kfree_bulk_callback: rcu_preempt bulk=0x0000000065630ee4 nr_records=102
      <snip>
      
      In many cases, out of 512 slots, fewer than 10 were actually used.
      In order to improve batching and make utilization more efficient this
      commit sets a drain interval to a fixed 5-seconds interval. Floods are
      detected when a page fills quickly, and in that case, the reclaim work
      is re-scheduled for the next scheduling-clock tick (jiffy).
      
      After this change:
      
      <snip>
        kworker/7:1-371     [007] ....  5630.725708: rcu_invoke_kfree_bulk_callback: rcu_preempt bulk=0x000000005ab0ffb3 nr_records=121
        kworker/7:1-371     [007] ....  5630.989702: rcu_invoke_kfree_bulk_callback: rcu_preempt bulk=0x0000000060c84761 nr_records=47
        kworker/7:1-371     [007] ....  5630.989714: rcu_invoke_kfree_bulk_callback: rcu_preempt bulk=0x000000000babf308 nr_records=510
        kworker/7:1-371     [007] ....  5631.553790: rcu_invoke_kfree_bulk_callback: rcu_preempt bulk=0x00000000bb7bd0ef nr_records=169
        kworker/7:1-371     [007] ....  5631.553808: rcu_invoke_kfree_bulk_callback: rcu_preempt bulk=0x0000000044c78753 nr_records=510
        kworker/5:6-9428    [005] ....  5631.746102: rcu_invoke_kfree_bulk_callback: rcu_preempt bulk=0x00000000d98519aa nr_records=123
        kworker/4:7-9434    [004] ....  5632.001758: rcu_invoke_kfree_bulk_callback: rcu_preempt bulk=0x00000000526c9d44 nr_records=322
        kworker/4:7-9434    [004] ....  5632.002073: rcu_invoke_kfree_bulk_callback: rcu_preempt bulk=0x000000002c6a8afa nr_records=185
        kworker/7:1-371     [007] ....  5632.277515: rcu_invoke_kfree_bulk_callback: rcu_preempt bulk=0x000000007f4a962f nr_records=510
      <snip>
      
      Here, all but one of the cases, more than one hundreds slots were used,
      representing an order-of-magnitude improvement.
      Signed-off-by: default avatarUladzislau Rezki (Sony) <urezki@gmail.com>
      Signed-off-by: default avatarPaul E. McKenney <paulmck@kernel.org>
      51824b78
    • Joel Fernandes (Google)'s avatar
      rcu/kfree: Fix kfree_rcu_shrink_count() return value · 38269096
      Joel Fernandes (Google) authored
      As per the comments in include/linux/shrinker.h, .count_objects callback
      should return the number of freeable items, but if there are no objects
      to free, SHRINK_EMPTY should be returned. The only time 0 is returned
      should be when we are unable to determine the number of objects, or the
      cache should be skipped for another reason.
      Signed-off-by: default avatarJoel Fernandes (Google) <joel@joelfernandes.org>
      Reviewed-by: default avatarUladzislau Rezki (Sony) <urezki@gmail.com>
      Signed-off-by: default avatarPaul E. McKenney <paulmck@kernel.org>
      38269096
    • Michal Hocko's avatar
      rcu: Back off upon fill_page_cache_func() allocation failure · 093590c1
      Michal Hocko authored
      The fill_page_cache_func() function allocates couple of pages to store
      kvfree_rcu_bulk_data structures. This is a lightweight (GFP_NORETRY)
      allocation which can fail under memory pressure. The function will,
      however keep retrying even when the previous attempt has failed.
      
      This retrying is in theory correct, but in practice the allocation is
      invoked from workqueue context, which means that if the memory reclaim
      gets stuck, these retries can hog the worker for quite some time.
      Although the workqueues subsystem automatically adjusts concurrency, such
      adjustment is not guaranteed to happen until the worker context sleeps.
      And the fill_page_cache_func() function's retry loop is not guaranteed
      to sleep (see the should_reclaim_retry() function).
      
      And we have seen this function cause workqueue lockups:
      
      kernel: BUG: workqueue lockup - pool cpus=93 node=1 flags=0x1 nice=0 stuck for 32s!
      [...]
      kernel: pool 74: cpus=37 node=0 flags=0x1 nice=0 hung=32s workers=2 manager: 2146
      kernel:   pwq 498: cpus=249 node=1 flags=0x1 nice=0 active=4/256 refcnt=5
      kernel:     in-flight: 1917:fill_page_cache_func
      kernel:     pending: dbs_work_handler, free_work, kfree_rcu_monitor
      
      Originally, we thought that the root cause of this lockup was several
      retries with direct reclaim, but this is not yet confirmed.  Furthermore,
      we have seen similar lockups without any heavy memory pressure.  This
      suggests that there are other factors contributing to these lockups.
      However, it is not really clear that endless retries are desireable.
      
      So let's make the fill_page_cache_func() function back off after
      allocation failure.
      
      Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
      Cc: "Paul E. McKenney" <paulmck@kernel.org>
      Cc: Frederic Weisbecker <frederic@kernel.org>
      Cc: Neeraj Upadhyay <quic_neeraju@quicinc.com>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Lai Jiangshan <jiangshanlai@gmail.com>
      Cc: Joel Fernandes <joel@joelfernandes.org>
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarUladzislau Rezki (Sony) <urezki@gmail.com>
      Signed-off-by: default avatarPaul E. McKenney <paulmck@kernel.org>
      093590c1
    • Paul E. McKenney's avatar
      rcu: Exclude outgoing CPU when it is the last to leave · 7634b1ea
      Paul E. McKenney authored
      The rcu_boost_kthread_setaffinity() function removes the outgoing CPU
      from the set_cpus_allowed() mask for the corresponding leaf rcu_node
      structure's rcub priority-boosting kthread.  Except that if the outgoing
      CPU will leave that structure without any online CPUs, the mask is set
      to the housekeeping CPU mask from housekeeping_cpumask().  Which is fine
      unless the outgoing CPU happens to be a housekeeping CPU.
      
      This commit therefore removes the outgoing CPU from the housekeeping mask.
      This would of course be problematic if the outgoing CPU was the last
      online housekeeping CPU, but in that case you are in a world of hurt
      anyway.  If someone comes up with a valid use case for a system needing
      all the housekeeping CPUs to be offline, further adjustments can be made.
      Signed-off-by: default avatarPaul E. McKenney <paulmck@kernel.org>
      7634b1ea