• Frederic Weisbecker's avatar
    rcu/exp: Remove rcu_par_gp_wq · 23da2ad6
    Frederic Weisbecker authored
    TREE04 running on short iterations can produce writer stalls of the
    following kind:
    
     ??? Writer stall state RTWS_EXP_SYNC(4) g3968 f0x0 ->state 0x2 cpu 0
     task:rcu_torture_wri state:D stack:14568 pid:83    ppid:2      flags:0x00004000
     Call Trace:
      <TASK>
      __schedule+0x2de/0x850
      ? trace_event_raw_event_rcu_exp_funnel_lock+0x6d/0xb0
      schedule+0x4f/0x90
      synchronize_rcu_expedited+0x430/0x670
      ? __pfx_autoremove_wake_function+0x10/0x10
      ? __pfx_synchronize_rcu_expedited+0x10/0x10
      do_rtws_sync.constprop.0+0xde/0x230
      rcu_torture_writer+0x4b4/0xcd0
      ? __pfx_rcu_torture_writer+0x10/0x10
      kthread+0xc7/0xf0
      ? __pfx_kthread+0x10/0x10
      ret_from_fork+0x2f/0x50
      ? __pfx_kthread+0x10/0x10
      ret_from_fork_asm+0x1b/0x30
      </TASK>
    
    Waiting for an expedited grace period and polling for an expedited
    grace period both are operations that internally rely on the same
    workqueue performing necessary asynchronous work.
    
    However, a dependency chain is involved between those two operations,
    as depicted below:
    
           ====== CPU 0 =======                          ====== CPU 1 =======
    
                                                         synchronize_rcu_expedited()
                                                             exp_funnel_lock()
                                                                 mutex_lock(&rcu_state.exp_mutex);
        start_poll_synchronize_rcu_expedited
            queue_work(rcu_gp_wq, &rnp->exp_poll_wq);
                                                             synchronize_rcu_expedited_queue_work()
                                                                 queue_work(rcu_gp_wq, &rew->rew_work);
                                                             wait_event() // A, wait for &rew->rew_work completion
                                                             mutex_unlock() // B
        //======> switch to kworker
    
        sync_rcu_do_polled_gp() {
            synchronize_rcu_expedited()
                exp_funnel_lock()
                    mutex_lock(&rcu_state.exp_mutex); // C, wait B
                    ....
        } // D
    
    Since workqueues are usually implemented on top of several kworkers
    handling the queue concurrently, the above situation wouldn't deadlock
    most of the time because A then doesn't depend on D. But in case of
    memory stress, a single kworker may end up handling alone all the works
    in a serialized way. In that case the above layout becomes a problem
    because A then waits for D, closing a circular dependency:
    
    	A -> D -> C -> B -> A
    
    This however only happens when CONFIG_RCU_EXP_KTHREAD=n. Indeed
    synchronize_rcu_expedited() is otherwise implemented on top of a kthread
    worker while polling still relies on rcu_gp_wq workqueue, breaking the
    above circular dependency chain.
    
    Fix this with making expedited grace period to always rely on kthread
    worker. The workqueue based implementation is essentially a duplicate
    anyway now that the per-node initialization is performed by per-node
    kthread workers.
    
    Meanwhile the CONFIG_RCU_EXP_KTHREAD switch is still kept around to
    manage the scheduler policy of these kthread workers.
    Reported-by: default avatarAnna-Maria Behnsen <anna-maria@linutronix.de>
    Reported-by: default avatarThomas Gleixner <tglx@linutronix.de>
    Suggested-by: default avatarJoel Fernandes <joel@joelfernandes.org>
    Suggested-by: default avatarPaul E. McKenney <paulmck@kernel.org>
    Suggested-by: default avatarNeeraj upadhyay <Neeraj.Upadhyay@amd.com>
    Signed-off-by: default avatarFrederic Weisbecker <frederic@kernel.org>
    Reviewed-by: default avatarPaul E. McKenney <paulmck@kernel.org>
    Signed-off-by: default avatarBoqun Feng <boqun.feng@gmail.com>
    23da2ad6
tree_exp.h 33.7 KB