• Frederic Weisbecker's avatar
    rcu-tasks: Fix synchronize_rcu_tasks() VS zap_pid_ns_processes() · 28319d6d
    Frederic Weisbecker authored
    RCU Tasks and PID-namespace unshare can interact in do_exit() in a
    complicated circular dependency:
    
    1) TASK A calls unshare(CLONE_NEWPID), this creates a new PID namespace
       that every subsequent child of TASK A will belong to. But TASK A
       doesn't itself belong to that new PID namespace.
    
    2) TASK A forks() and creates TASK B. TASK A stays attached to its PID
       namespace (let's say PID_NS1) and TASK B is the first task belonging
       to the new PID namespace created by unshare()  (let's call it PID_NS2).
    
    3) Since TASK B is the first task attached to PID_NS2, it becomes the
       PID_NS2 child reaper.
    
    4) TASK A forks() again and creates TASK C which get attached to PID_NS2.
       Note how TASK C has TASK A as a parent (belonging to PID_NS1) but has
       TASK B (belonging to PID_NS2) as a pid_namespace child_reaper.
    
    5) TASK B exits and since it is the child reaper for PID_NS2, it has to
       kill all other tasks attached to PID_NS2, and wait for all of them to
       die before getting reaped itself (zap_pid_ns_process()).
    
    6) TASK A calls synchronize_rcu_tasks() which leads to
       synchronize_srcu(&tasks_rcu_exit_srcu).
    
    7) TASK B is waiting for TASK C to get reaped. But TASK B is under a
       tasks_rcu_exit_srcu SRCU critical section (exit_notify() is between
       exit_tasks_rcu_start() and exit_tasks_rcu_finish()), blocking TASK A.
    
    8) TASK C exits and since TASK A is its parent, it waits for it to reap
       TASK C, but it can't because TASK A waits for TASK B that waits for
       TASK C.
    
    Pid_namespace semantics can hardly be changed at this point. But the
    coverage of tasks_rcu_exit_srcu can be reduced instead.
    
    The current task is assumed not to be concurrently reapable at this
    stage of exit_notify() and therefore tasks_rcu_exit_srcu can be
    temporarily relaxed without breaking its constraints, providing a way
    out of the deadlock scenario.
    
    [ paulmck: Fix build failure by adding additional declaration. ]
    
    Fixes: 3f95aa81 ("rcu: Make TASKS_RCU handle tasks that are almost done exiting")
    Reported-by: default avatarPengfei Xu <pengfei.xu@intel.com>
    Suggested-by: default avatarBoqun Feng <boqun.feng@gmail.com>
    Suggested-by: default avatarNeeraj Upadhyay <quic_neeraju@quicinc.com>
    Suggested-by: default avatarPaul E. McKenney <paulmck@kernel.org>
    Cc: Oleg Nesterov <oleg@redhat.com>
    Cc: Lai Jiangshan <jiangshanlai@gmail.com>
    Cc: Eric W . Biederman <ebiederm@xmission.com>
    Signed-off-by: default avatarFrederic Weisbecker <frederic@kernel.org>
    Signed-off-by: default avatarPaul E. McKenney <paulmck@kernel.org>
    28319d6d
pid_namespace.c 11.9 KB