1. 16 Mar, 2016 1 commit
    • Tejun Heo's avatar
      cgroup: ignore css_sets associated with dead cgroups during migration · 2b021cbf
      Tejun Heo authored
      Before 2e91fa7f ("cgroup: keep zombies associated with their
      original cgroups"), all dead tasks were associated with init_css_set.
      If a zombie task is requested for migration, while migration prep
      operations would still be performed on init_css_set, the actual
      migration would ignore zombie tasks.  As init_css_set is always valid,
      this worked fine.
      
      However, after 2e91fa7f, zombie tasks stay with the css_set it was
      associated with at the time of death.  Let's say a task T associated
      with cgroup A on hierarchy H-1 and cgroup B on hiearchy H-2.  After T
      becomes a zombie, it would still remain associated with A and B.  If A
      only contains zombie tasks, it can be removed.  On removal, A gets
      marked offline but stays pinned until all zombies are drained.  At
      this point, if migration is initiated on T to a cgroup C on hierarchy
      H-2, migration path would try to prepare T's css_set for migration and
      trigger the following.
      
       WARNING: CPU: 0 PID: 1576 at kernel/cgroup.c:474 cgroup_get+0x121/0x160()
       CPU: 0 PID: 1576 Comm: bash Not tainted 4.4.0-work+ #289
       ...
       Call Trace:
        [<ffffffff8127e63c>] dump_stack+0x4e/0x82
        [<ffffffff810445e8>] warn_slowpath_common+0x78/0xb0
        [<ffffffff810446d5>] warn_slowpath_null+0x15/0x20
        [<ffffffff810c33e1>] cgroup_get+0x121/0x160
        [<ffffffff810c349b>] link_css_set+0x7b/0x90
        [<ffffffff810c4fbc>] find_css_set+0x3bc/0x5e0
        [<ffffffff810c5269>] cgroup_migrate_prepare_dst+0x89/0x1f0
        [<ffffffff810c7547>] cgroup_attach_task+0x157/0x230
        [<ffffffff810c7a17>] __cgroup_procs_write+0x2b7/0x470
        [<ffffffff810c7bdc>] cgroup_tasks_write+0xc/0x10
        [<ffffffff810c4790>] cgroup_file_write+0x30/0x1b0
        [<ffffffff811c68fc>] kernfs_fop_write+0x13c/0x180
        [<ffffffff81151673>] __vfs_write+0x23/0xe0
        [<ffffffff81152494>] vfs_write+0xa4/0x1a0
        [<ffffffff811532d4>] SyS_write+0x44/0xa0
        [<ffffffff814af2d7>] entry_SYSCALL_64_fastpath+0x12/0x6f
      
      It doesn't make sense to prepare migration for css_sets pointing to
      dead cgroups as they are guaranteed to contain only zombies which are
      ignored later during migration.  This patch makes cgroup destruction
      path mark all affected css_sets as dead and updates the migration path
      to ignore them during preparation.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Fixes: 2e91fa7f ("cgroup: keep zombies associated with their original cgroups")
      Cc: stable@vger.kernel.org # v4.4+
      2b021cbf
  2. 11 Mar, 2016 1 commit
  3. 08 Mar, 2016 5 commits
    • Tejun Heo's avatar
      cgroup: implement cgroup_subsys->implicit_on_dfl · f6d635ad
      Tejun Heo authored
      Some controllers, perf_event for now and possibly freezer in the
      future, don't really make sense to control explicitly through
      "cgroup.subtree_control".  For example, the primary role of perf_event
      is identifying the cgroups of tasks; however, because the controller
      also keeps a small amount of state per cgroup, it can't be replaced
      with simple cgroup membership tests.
      
      This patch implements cgroup_subsys->implicit_on_dfl flag.  When set,
      the controller is implicitly enabled on all cgroups on the v2
      hierarchy so that utility type controllers such as perf_event can be
      enabled and function transparently.
      
      An implicit controller doesn't show up in "cgroup.controllers" or
      "cgroup.subtree_control", is exempt from no internal process rule and
      can be stolen from the default hierarchy even if there are non-root
      csses.
      
      v2: Reimplemented on top of the recent updates to css handling and
          subsystem rebinding.  Rebinding implicit subsystems is now a
          simple matter of exempting it from the busy subsystem check.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      f6d635ad
    • Tejun Heo's avatar
      cgroup: use css_set->mg_dst_cgrp for the migration target cgroup · e4857982
      Tejun Heo authored
      Migration can be multi-target on the default hierarchy when a
      controller is enabled - processes belonging to each child cgroup have
      to be moved to the child cgroup itself to refresh css association.
      
      This isn't a problem for cgroup_migrate_add_src() as each source
      css_set still maps to single source and target cgroups; however,
      cgroup_migrate_prepare_dst() is called once after all source css_sets
      are added and thus might not have a single destination cgroup.  This
      is currently worked around by specifying NULL for @dst_cgrp and using
      the source's default cgroup as destination as the only multi-target
      migration in use is self-targetting.  While this works, it's subtle
      and clunky.
      
      As all taget cgroups are already specified while preparing the source
      css_sets, this clunkiness can easily be removed by recording the
      target cgroup in each source css_set.  This patch adds
      css_set->mg_dst_cgrp which is recorded on cgroup_migrate_src() and
      used by cgroup_migrate_prepare_dst().  This also makes migration code
      ready for arbitrary multi-target migration.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      e4857982
    • Tejun Heo's avatar
      cgroup: make cgroup[_taskset]_migrate() take cgroup_root instead of cgroup · 37ff9f8f
      Tejun Heo authored
      On the default hierarchy, a migration can be multi-source and/or
      multi-destination.  cgroup_taskest_migrate() used to incorrectly
      assume single destination cgroup but the bug has been fixed by
      1f7dd3e5 ("cgroup: fix handling of multi-destination migration
      from subtree_control enabling").
      
      Since the commit, @dst_cgrp to cgroup[_taskset]_migrate() is only used
      to determine which subsystems are affected or which cgroup_root the
      migration is taking place in.  As such, @dst_cgrp is misleading.  This
      patch replaces @dst_cgrp with @root.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      37ff9f8f
    • Tejun Heo's avatar
      cgroup: move migration destination verification out of cgroup_migrate_prepare_dst() · 6c694c88
      Tejun Heo authored
      cgroup_migrate_prepare_dst() verifies whether the destination cgroup
      is allowable; however, the test doesn't really belong there.  It's too
      deep and common in the stack and as a result the test itself is gated
      by another test.
      
      Separate the test out into cgroup_may_migrate_to() and update
      cgroup_attach_task() and cgroup_transfer_tasks() to perform the test
      directly.  This doesn't cause any behavior differences.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      6c694c88
    • Tejun Heo's avatar
      cgroup: fix incorrect destination cgroup in cgroup_update_dfl_csses() · 58cdb1ce
      Tejun Heo authored
      cgroup_update_dfl_csses() should move each task in the subtree to
      self; however, it was incorrectly calling cgroup_migrate_add_src()
      with the root of the subtree as @dst_cgrp.  Fortunately,
      cgroup_migrate_add_src() currently uses @dst_cgrp only to determine
      the hierarchy and the bug doesn't cause any actual breakages.  Fix it.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      58cdb1ce
  4. 05 Mar, 2016 1 commit
  5. 04 Mar, 2016 1 commit
  6. 03 Mar, 2016 19 commits
    • Tejun Heo's avatar
      cgroup: update css iteration in cgroup_update_dfl_csses() · 54962604
      Tejun Heo authored
      The existing sequences of operations ensure that the offlining csses
      are drained before cgroup_update_dfl_csses(), so even though
      cgroup_update_dfl_csses() uses css_for_each_descendant_pre() to walk
      the target cgroups, it doesn't end up operating on dead cgroups.
      Also, the function explicitly excludes the subtree root from
      operation.
      
      This is fragile and inconsistent with the rest of css update
      operations.  This patch updates cgroup_update_dfl_csses() to use
      cgroup_for_each_live_descendant_pre() instead and include the subtree
      root.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarZefan Li <lizefan@huawei.com>
      54962604
    • Tejun Heo's avatar
      cgroup: allocate 2x cgrp_cset_links when setting up a new root · 04313591
      Tejun Heo authored
      During prep, cgroup_setup_root() allocates cgrp_cset_links matching
      the number of existing css_sets to later link the new root.  This is
      fine for now as the only operation which can happen inbetween is
      rebind_subsystems() and rebinding of empty subsystems doesn't create
      new css_sets.
      
      However, while not yet allowed, with the recent reimplementation,
      rebind_subsystems() can rebind subsystems with descendant csses and
      thus can create new css_sets.  This patch makes cgroup_setup_root()
      allocate 2x of the existing css_sets so that later use of live
      subsystem rebinding doesn't blow up.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarZefan Li <lizefan@huawei.com>
      04313591
    • Tejun Heo's avatar
      cgroup: make cgroup_calc_subtree_ss_mask() take @this_ss_mask · 5ced2518
      Tejun Heo authored
      cgroup_calc_subtree_ss_mask() currently takes @cgrp and
      @subtree_control.  @cgrp is used for two purposes - to decide whether
      it's for default hierarchy and the mask of available subsystems.  The
      former doesn't matter as the results are the same regardless.  The
      latter can be specified directly through a subsystem mask.
      
      This patch makes cgroup_calc_subtree_ss_mask() perform the same
      calculations for both default and legacy hierarchies and take
      @this_ss_mask for available subsystems.  @cgrp is no longer used and
      dropped.  This is to allow using the function in contexts where
      available controllers can't be decided from the cgroup.
      
      v2: cgroup_refres_subtree_ss_mask() is removed by a previous patch.
          Updated accordingly.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarZefan Li <lizefan@huawei.com>
      5ced2518
    • Tejun Heo's avatar
      cgroup: reimplement rebind_subsystems() using cgroup_apply_control() and friends · 334c3679
      Tejun Heo authored
      rebind_subsystem() open codes quite a bit of css and interface file
      manipulations.  It tries to be fail-safe but doesn't quite achieve it.
      It can be greatly simplified by using the new css management helpers.
      This patch reimplements rebind_subsytsems() using
      cgroup_apply_control() and friends.
      
      * The half-baked rollback on file creation failure is dropped.  It is
        an extremely cold path, failure isn't critical, and, aside from
        kernel bugs, the only reason it can fail is memory allocation
        failure which pretty much doesn't happen for small allocations.
      
      * As cgroup_apply_control_disable() is now used to clean up root
        cgroup on rebind, make sure that it doesn't end up killing root
        csses.
      
      * All callers of rebind_subsystems() are updated to use
        cgroup_lock_and_drain_offline() as the apply_control functions
        require drained subtree.
      
      * This leaves cgroup_refresh_subtree_ss_mask() without any user.
        Removed.
      
      * css_populate_dir() and css_clear_dir() no longer needs
        @cgrp_override parameter.  Dropped.
      
      * While at it, add WARN_ON() to rebind_subsystem() calls which are
        expected to always succeed just in case.
      
      While the rules visible to userland aren't changed, this
      reimplementation not only simplifies rebind_subsystems() but also
      allows it to disable and enable csses recursively.  This can be used
      to implement more flexible rebinding.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarZefan Li <lizefan@huawei.com>
      334c3679
    • Tejun Heo's avatar
      cgroup: use cgroup_apply_enable_control() in cgroup creation path · 03970d3c
      Tejun Heo authored
      cgroup_create() manually updates control masks and creates child csses
      which cgroup_mkdir() then manually populates.  Both can be simplified
      by using cgroup_apply_enable_control() and friends.  The only catch is
      that it calls css_populate_dir() with NULL cgroup->kn during
      cgroup_create().  This is worked around by making the function noop on
      NULL kn.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarZefan Li <lizefan@huawei.com>
      03970d3c
    • Tejun Heo's avatar
      cgroup: combine cgroup_mutex locking and offline css draining · 945ba199
      Tejun Heo authored
      cgroup_drain_offline() is used to wait for csses being offlined to
      uninstall itself from cgroup->subsys[] array so that new csses can be
      installed.  The function's only user, cgroup_subtree_control_write(),
      calls it after performing some checks and restarts the whole process
      via restart_syscall() if draining has to release cgroup_mutex to wait.
      
      This can be simplified by draining before other synchronized
      operations so that there's nothing to restart.  This patch converts
      cgroup_drain_offline() to cgroup_lock_and_drain_offline() which
      performs both locking and draining and updates cgroup_kn_lock_live()
      use it instead of cgroup_mutex() if requested.  This combined locking
      and draining operations are easier to use and less error-prone.
      
      While at it, add WARNs in control_apply functions which triggers if
      the subtree isn't properly drained.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarZefan Li <lizefan@huawei.com>
      945ba199
    • Tejun Heo's avatar
      cgroup: factor out cgroup_{apply|finalize}_control() from cgroup_subtree_control_write() · f7b2814b
      Tejun Heo authored
      Factor out cgroup_{apply|finalize}_control() so that control mask
      update can be done in several simple steps.  This patch doesn't
      introduce behavior changes.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarZefan Li <lizefan@huawei.com>
      f7b2814b
    • Tejun Heo's avatar
      cgroup: introduce cgroup_{save|propagate|restore}_control() · 15a27c36
      Tejun Heo authored
      While controllers are being enabled and disabled in
      cgroup_subtree_control_write(), the original subsystem masks are
      stashed in local variables so that they can be restored if the
      operation fails in the middle.
      
      This patch adds dedicated fields to struct cgroup to be used instead
      of the local variables and implements functions to stash the current
      values, propagate the changes and restore them recursively.  Combined
      with the previous changes, this makes subsystem management operations
      fully recursive and modularlized.  This will be used to expand cgroup
      core functionalities.
      
      While at it, remove now unused @css_enable and @css_disable from
      cgroup_subtree_control_write().
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarZefan Li <lizefan@huawei.com>
      15a27c36
    • Tejun Heo's avatar
      cgroup: make cgroup_drain_offline() and cgroup_apply_control_{disable|enable}() recursive · ce3f1d9d
      Tejun Heo authored
      The three factored out css management operations -
      cgroup_drain_offline() and cgroup_apply_control_{disable|enable}() -
      only depend on the current state of the target cgroups and idempotent
      and thus can be easily made to operate on the subtree instead of the
      immediate children.
      
      This patch introduces the iterators which walk live subtree and
      converts the three functions to operate on the subtree including self
      instead of the children.  While this leads to spurious walking and be
      slightly more expensive, it will allow them to be used for wider scope
      of operations.
      
      Note that cgroup_drain_offline() now tests for whether a css is dying
      before trying to drain it.  This is to avoid trying to drain live
      csses as there can be mix of live and dying csses in a subtree unlike
      children of the same parent.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarZefan Li <lizefan@huawei.com>
      ce3f1d9d
    • Tejun Heo's avatar
      cgroup: factor out cgroup_apply_control_enable() from cgroup_subtree_control_write() · bdb53bd7
      Tejun Heo authored
      Factor out css enabling and showing into cgroup_apply_control_enable().
      
      * Nest subsystem walk inside child walk.  The child walk will later be
        converted to subtree walk which is a bit more expensive.
      
      * Instead of operating on the differential masks @css_enable, simply
        enable or show csses according to the current cgroup_control() and
        cgroup_ss_mask().  This leads to the same result and is simpler and
        more robust.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarZefan Li <lizefan@huawei.com>
      bdb53bd7
    • Tejun Heo's avatar
      cgroup: factor out cgroup_apply_control_disable() from cgroup_subtree_control_write() · 12b3bb6a
      Tejun Heo authored
      Factor out css disabling and hiding into cgroup_apply_control_disable().
      
      * Nest subsystem walk inside child walk.  The child walk will later be
        converted to subtree walk which is a bit more expensive.
      
      * Instead of operating on the differential masks @css_enable and
        @css_disable, simply disable or hide csses according to the current
        cgroup_control() and cgroup_ss_mask().  This leads to the same
        result and is simpler and more robust.
      
      * This allows error handling path to share the same code.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarZefan Li <lizefan@huawei.com>
      12b3bb6a
    • Tejun Heo's avatar
      cgroup: factor out cgroup_drain_offline() from cgroup_subtree_control_write() · 1b9b96a1
      Tejun Heo authored
      Factor out async css offline draining into cgroup_drain_offline().
      
      * Nest subsystem walk inside child walk.  The child walk will later be
        converted to subtree walk which is a bit more expensive.
      
      * Relocate the draining above subsystem mask preparation, which
        doesn't create any behavior differences but helps further
        refactoring.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarZefan Li <lizefan@huawei.com>
      1b9b96a1
    • Tejun Heo's avatar
      cgroup: introduce cgroup_control() and cgroup_ss_mask() · 5531dc91
      Tejun Heo authored
      When a controller is enabled and visible on a non-root cgroup is
      determined by subtree_control and subtree_ss_mask of the parent
      cgroup.  For a root cgroup, by the type of the hierarchy and which
      controllers are attached to it.  Deciding the above on each usage is
      fragile and unnecessarily complicates the users.
      
      This patch introduces cgroup_control() and cgroup_ss_mask() which
      calculate and return the [visibly] enabled subsyste mask for the
      specified cgroup and conver the existing usages.
      
      * cgroup_e_css() is restructured for simplicity.
      
      * cgroup_calc_subtree_ss_mask() and cgroup_subtree_control_write() no
        longer need to distinguish root and non-root cases.
      
      * With cgroup_control(), cgroup_controllers_show() can now handle both
        root and non-root cases.  cgroup_root_controllers_show() is removed.
      
      v2: cgroup_control() updated to yield the correct result on v1
          hierarchies too.  cgroup_subtree_control_write() converted.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarZefan Li <lizefan@huawei.com>
      5531dc91
    • Tejun Heo's avatar
      cgroup: factor out cgroup_create() out of cgroup_mkdir() · a5bca215
      Tejun Heo authored
      We're in the process of refactoring cgroup and css management paths to
      separate them out to eventually allow cgroups which aren't visible
      through cgroup fs.  This patch factors out cgroup_create() out of
      cgroup_mkdir().  cgroup_create() contains all internal object creation
      and initialization.  cgroup_mkdir() uses cgroup_create() to create the
      internal cgroup and adds interface directory and file creation.
      
      This patch doesn't cause any behavior differences.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarZefan Li <lizefan@huawei.com>
      a5bca215
    • Tejun Heo's avatar
      cgroup: reorder operations in cgroup_mkdir() · 195e9b6c
      Tejun Heo authored
      Currently, operations to initialize internal objects and create
      interface directory and files are intermixed in cgroup_mkdir().  We're
      in the process of refactoring cgroup and css management paths to
      separate them out to eventually allow cgroups which aren't visible
      through cgroup fs.
      
      This patch reorders operations inside cgroup_mkdir() so that interface
      directory and file handling comes after internal object
      initialization.  This will enable further refactoring.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarZefan Li <lizefan@huawei.com>
      195e9b6c
    • Tejun Heo's avatar
      cgroup: explicitly track whether a cgroup_subsys_state is visible to userland · 88cb04b9
      Tejun Heo authored
      Currently, whether a css (cgroup_subsys_state) has its interface files
      created is not tracked and assumed to change together with the owning
      cgroup's lifecycle.  cgroup directory and interface creation is being
      separated out from internal object creation to help refactoring and
      eventually allow cgroups which are not visible through cgroupfs.
      
      This patch adds CSS_VISIBLE to track whether a css has its interface
      files created and perform management operations only when necessary
      which helps decoupling interface file handling from internal object
      lifecycle.  After this patch, all css interface file management
      functions can be called regardless of the current state and will
      achieve the expected result.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarZefan Li <lizefan@huawei.com>
      88cb04b9
    • Tejun Heo's avatar
      cgroup: separate out interface file creation from css creation · 6cd0f5bb
      Tejun Heo authored
      Currently, interface files are created when a css is created depending
      on whether @visible is set.  This patch separates out the two into
      separate steps to help code refactoring and eventually allow cgroups
      which aren't visible through cgroup fs.
      
      Move css_populate_dir() out of create_css() and drop @visible.  While
      at it, rename the function to css_create() for consistency.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarZefan Li <lizefan@huawei.com>
      6cd0f5bb
    • Tejun Heo's avatar
      cgroup: suppress spurious de-populated events · 20b454a6
      Tejun Heo authored
      During task migration, tasks may transfer between two css_sets which
      are associated with the same cgroup.  If those tasks are the only
      tasks in the cgroup, this currently triggers a spurious de-populated
      event on the cgroup.
      
      Fix it by bumping up populated count before bumping it down during
      migration to ensure that it doesn't reach zero spuriously.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarZefan Li <lizefan@huawei.com>
      20b454a6
    • Tejun Heo's avatar
      cgroup: re-hash init_css_set after subsystems are initialized · 2378d8b8
      Tejun Heo authored
      css_sets are hashed by their subsys[] contents and in cgroup_init()
      init_css_set is hashed early, before subsystem inits, when all entries
      in its subsys[] are NULL, so that cgroup_dfl_root initialization can
      find and link to it.  As subsystems are initialized,
      init_css_set.subsys[] is filled up but the hashing is never updated
      making init_css_set hashed in the wrong place.  While incorrect, this
      doesn't cause a critical failure as css_set management code would
      create an identical css_set dynamically.
      
      Fix it by rehashing init_css_set after subsystems are initialized.
      While at it, drop unnecessary @key local variable.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarZefan Li <lizefan@huawei.com>
      2378d8b8
  7. 01 Mar, 2016 1 commit
    • Vladimir Davydov's avatar
      cgroup: reset css on destruction · fa06235b
      Vladimir Davydov authored
      An associated css can be around for quite a while after a cgroup
      directory has been removed. In general, it makes sense to reset it to
      defaults so as not to worry about any remnants. For instance, memory
      cgroup needs to reset memory.low, otherwise pages charged to a dead
      cgroup might never get reclaimed. There's ->css_reset callback, which
      would fit perfectly for the purpose. Currently, it's only called when a
      subsystem is disabled in the unified hierarchy and there are other
      subsystems dependant on it. Let's call it on css destruction as well.
      Suggested-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarVladimir Davydov <vdavydov@virtuozzo.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      fa06235b
  8. 27 Feb, 2016 1 commit
  9. 23 Feb, 2016 10 commits