• Christian Brauner's avatar
    cgroup: introduce cgroup.kill · 661ee628
    Christian Brauner authored
    Introduce the cgroup.kill file. It does what it says on the tin and
    allows a caller to kill a cgroup by writing "1" into cgroup.kill.
    The file is available in non-root cgroups.
    
    Killing cgroups is a process directed operation, i.e. the whole
    thread-group is affected. Consequently trying to write to cgroup.kill in
    threaded cgroups will be rejected and EOPNOTSUPP returned. This behavior
    aligns with cgroup.procs where reads in threaded-cgroups are rejected
    with EOPNOTSUPP.
    
    The cgroup.kill file is write-only since killing a cgroup is an event
    not which makes it different from e.g. freezer where a cgroup
    transitions between the two states.
    
    As with all new cgroup features cgroup.kill is recursive by default.
    
    Killing a cgroup is protected against concurrent migrations through the
    cgroup mutex. To protect against forkbombs and to mitigate the effect of
    racing forks a new CGRP_KILL css set lock protected flag is introduced
    that is set prior to killing a cgroup and unset after the cgroup has
    been killed. We can then check in cgroup_post_fork() where we hold the
    css set lock already whether the cgroup is currently being killed. If so
    we send the child a SIGKILL signal immediately taking it down as soon as
    it returns to userspace. To make the killing of the child semantically
    clean it is killed after all cgroup attachment operations have been
    finalized.
    
    There are various use-cases of this interface:
    - Containers usually have a conservative layout where each container
      usually has a delegated cgroup. For such layouts there is a 1:1
      mapping between container and cgroup. If the container in addition
      uses a separate pid namespace then killing a container usually becomes
      a simple kill -9 <container-init-pid> from an ancestor pid namespace.
      However, there are quite a few scenarios where that isn't true. For
      example, there are containers that share the cgroup with other
      processes on purpose that are supposed to be bound to the lifetime of
      the container but are not in the same pidns of the container.
      Containers that are in a delegated cgroup but share the pid namespace
      with the host or other containers.
    - Service managers such as systemd use cgroups to group and organize
      processes belonging to a service. They usually rely on a recursive
      algorithm now to kill a service. With cgroup.kill this becomes a
      simple write to cgroup.kill.
    - Userspace OOM implementations can make good use of this feature to
      efficiently take down whole cgroups quickly.
    - The kill program can gain a new
      kill --cgroup /sys/fs/cgroup/delegated
      flag to take down cgroups.
    
    A few observations about the semantics:
    - If parent and child are in the same cgroup and CLONE_INTO_CGROUP is
      not specified we are not taking cgroup mutex meaning the cgroup can be
      killed while a process in that cgroup is forking.
      If the kill request happens right before cgroup_can_fork() and before
      the parent grabs its siglock the parent is guaranteed to see the
      pending SIGKILL. In addition we perform another check in
      cgroup_post_fork() whether the cgroup is being killed and is so take
      down the child (see above). This is robust enough and protects gainst
      forkbombs. If userspace really really wants to have stricter
      protection the simple solution would be to grab the write side of the
      cgroup threadgroup rwsem which will force all ongoing forks to
      complete before killing starts. We concluded that this is not
      necessary as the semantics for concurrent forking should simply align
      with freezer where a similar check as cgroup_post_fork() is performed.
    
      For all other cases CLONE_INTO_CGROUP is required. In this case we
      will grab the cgroup mutex so the cgroup can't be killed while we
      fork. Once we're done with the fork and have dropped cgroup mutex we
      are visible and will be found by any subsequent kill request.
    - We obviously don't kill kthreads. This means a cgroup that has a
      kthread will not become empty after killing and consequently no
      unpopulated event will be generated. The assumption is that kthreads
      should be in the root cgroup only anyway so this is not an issue.
    - We skip killing tasks that already have pending fatal signals.
    - Freezer doesn't care about tasks in different pid namespaces, i.e. if
      you have two tasks in different pid namespaces the cgroup would still
      be frozen. The cgroup.kill mechanism consequently behaves the same
      way, i.e. we kill all processes and ignore in which pid namespace they
      exist.
    - If the caller is located in a cgroup that is killed the caller will
      obviously be killed as well.
    
    Link: https://lore.kernel.org/r/20210503143922.3093755-1-brauner@kernel.org
    Cc: Shakeel Butt <shakeelb@google.com>
    Cc: Roman Gushchin <guro@fb.com>
    Cc: Tejun Heo <tj@kernel.org>
    Cc: cgroups@vger.kernel.org
    Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
    Reviewed-by: default avatarSerge Hallyn <serge@hallyn.com>
    Acked-by: default avatarRoman Gushchin <guro@fb.com>
    Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
    Signed-off-by: default avatarTejun Heo <tj@kernel.org>
    661ee628
cgroup.c 173 KB