1. 13 Feb, 2014 7 commits
    • Tejun Heo's avatar
      cgroup: make css_set_lock a rwsem and rename it to css_set_rwsem · 96d365e0
      Tejun Heo authored
      Currently there are two ways to walk tasks of a cgroup -
      css_task_iter_start/next/end() and css_scan_tasks().  The latter
      builds on the former but allows blocking while iterating.
      Unfortunately, the way css_scan_tasks() is implemented is rather
      nasty, it uses a priority heap of pointers to extract some number of
      tasks in task creation order and loops over them invoking the callback
      and repeats that until it reaches the end.  It requires either
      preallocated heap or may fail under memory pressure, while unlikely to
      be problematic, the complexity is O(N^2), and in general just nasty.
      
      We're gonna convert all css_scan_users() to
      css_task_iter_start/next/end() and remove css_scan_users().  As
      css_scan_tasks() users may block, let's convert css_set_lock to a
      rwsem so that tasks can block during css_task_iter_*() is in progress.
      
      While this does increase the chance of possible deadlock scenarios,
      given the current usage, the probability is relatively low, and even
      if that happens, the right thing to do is updating the iteration in
      the similar way to css iterators so that it can handle blocking.
      
      Most conversions are trivial; however, task_cgroup_path() now expects
      to be called with css_set_rwsem locked instead of locking itself.
      This is because the function is called with RCU read lock held and
      rwsem locking should nest outside RCU read lock.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      96d365e0
    • Tejun Heo's avatar
      cgroup: reimplement cgroup_transfer_tasks() without using css_scan_tasks() · e406d1cf
      Tejun Heo authored
      Reimplement cgroup_transfer_tasks() so that it repeatedly fetches the
      first task in the cgroup and then tranfers it.  This achieves the same
      result without using css_scan_tasks() which is scheduled to be
      removed.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      e406d1cf
    • Tejun Heo's avatar
      cgroup: implement cgroup_has_tasks() and unexport cgroup_task_count() · 07bc356e
      Tejun Heo authored
      cgroup_task_count() read-locks css_set_lock and walks all tasks to
      count them and then returns the result.  The only thing all the users
      want is determining whether the cgroup is empty or not.  This patch
      implements cgroup_has_tasks() which tests whether cgroup->cset_links
      is empty, replaces all cgroup_task_count() usages and unexports it.
      
      Note that the test isn't synchronized.  This is the same as before.
      The test has always been racy.
      
      This will help planned css_set locking update.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      07bc356e
    • Tejun Heo's avatar
      cgroup: relocate cgroup_enable_task_cg_lists() · afeb0f9f
      Tejun Heo authored
      Move it above so that prototype isn't necessary.  Let's also move the
      definition of use_task_css_set_links next to it.
      
      This is purely cosmetic.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      afeb0f9f
    • Tejun Heo's avatar
      cgroup: enable task_cg_lists on the first cgroup mount · 56fde9e0
      Tejun Heo authored
      Tasks are not linked on their css_sets until cgroup task iteration is
      actually used.  This is to avoid incurring overhead on the fork and
      exit paths for systems which have cgroup compiled in but don't use it.
           
      This lazy binding also affects the task migration path.  It has to be
      careful so that it doesn't link tasks to css_sets when task_cg_lists
      linking is not enabled yet.  Unfortunately, this conditional linking
      in the migration path interferes with planned migration updates.
      
      This patch moves the lazy binding a bit earlier, to the first cgroup
      mount.  It's a clear indication that cgroup is being used on the
      system and task_cg_lists linking is highly likely to be enabled soon
      anyway through "tasks" and "cgroup.procs" files.
      
      This allows cgroup_task_migrate() to always link @tsk->cg_list.  Note
      that it may still race with cgroup_post_fork() but who wins that race
      is inconsequential.
      
      While at it, make use_task_css_set_links a bool, add sanity checks in
      cgroup_enable_task_cg_lists() and css_task_iter_start(), and update
      the former so that it's guaranteed and assumes to run only once.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      56fde9e0
    • Tejun Heo's avatar
      cgroup: drop CGRP_ROOT_SUBSYS_BOUND · 35585573
      Tejun Heo authored
      Before kernfs conversion, due to the way super_block lookup works,
      cgroup roots were created and made visible before being fully
      initialized.  This in turn required a special flag to mark that the
      root hasn't been fully initialized so that the destruction path can
      tell fully bound ones from half initialized.
      
      That flag is CGRP_ROOT_SUBSYS_BOUND and no longer necessary after the
      kernfs conversion as the lookup and creation of new root are atomic
      w.r.t. cgroup_mutex.  This patch removes the flag and passes the
      requests subsystem mask to cgroup_setup_root() so that it can set the
      respective mask bits as subsystems are bound.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      35585573
    • Tejun Heo's avatar
      cgroup: disallow xattr, release_agent and name if sane_behavior · d3ba07c3
      Tejun Heo authored
      Disallow more mount options if sane_behavior.  Note that xattr used to
      generate warning.
      
      While at it, simplify option check in cgroup_mount() and update
      sane_behavior comment in cgroup.h.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      d3ba07c3
  2. 12 Feb, 2014 9 commits
    • Stephen Rothwell's avatar
      sun4M: add include of slab.h for kzalloc · a755180b
      Stephen Rothwell authored
      This was being included implicitly via cgroup.h's inclusion of xattr.h
      (which has now been removed).
      Signed-off-by: default avatarStephen Rothwell <sfr@canb.auug.org.au>
      Acked-by: default avatarSam Ravnborg <sam@ravnborg.org>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      a755180b
    • Tejun Heo's avatar
      cgroup: remove cgroupfs_root->refcnt · 776f02fa
      Tejun Heo authored
      Currently, cgroupfs_root and its ->top_cgroup are separated reference
      counted and the latter's is ignored.  There's no reason to do this
      separately.  This patch removes cgroupfs_root->refcnt and destroys
      cgroupfs_root when the top_cgroup is released.
      
      * cgroup_put() updated to ignore cgroup_is_dead() test for top
        cgroups.  cgroup_free_fn() updated to handle root destruction when
        releasing a top cgroup.
      
      * As root destruction is now bounced through cgroup destruction, it is
        asynchronous.  Update cgroup_mount() so that it waits for pending
        release which is currently implemented using msleep().  Converting
        this to proper wait_queue isn't hard but likely unnecessary.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      776f02fa
    • Tejun Heo's avatar
      cgroup: rename cgroupfs_root->number_of_cgroups to ->nr_cgrps and make it atomic_t · 3c9c825b
      Tejun Heo authored
      root->number_of_cgroups is currently an integer protected with
      cgroup_mutex.  Except for sanity checks and proc reporting, the only
      place it's used is to check whether the root has any child during
      remount; however, this is a bit flawed as the counter is not
      decremented when the cgroup is unlinked but when it's released,
      meaning that there could be an extended period where all cgroups are
      removed but remount is still not allowed because some internal objects
      are lingering.  While not perfect either, it'd be better to use
      emptiness test on root->top_cgroup.children.
      
      This patch updates cgroup_remount() to test top_cgroup's children
      instead, which makes number_of_cgroups only actual usage statistics
      printing in proc implemented in proc_cgroupstats_show().  Let's
      shorten its name and make it an atomic_t so that we don't have to
      worry about its synchronization.  It's purely auxiliary at this point.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      3c9c825b
    • Tejun Heo's avatar
      cgroup: remove cgroup->name · e61734c5
      Tejun Heo authored
      cgroup->name handling became quite complicated over time involving
      dedicated struct cgroup_name for RCU protection.  Now that cgroup is
      on kernfs, we can drop all of it and simply use kernfs_name/path() and
      friends.  Replace cgroup->name and all related code with kernfs
      name/path constructs.
      
      * Reimplement cgroup_name() and cgroup_path() as thin wrappers on top
        of kernfs counterparts, which involves semantic changes.
        pr_cont_cgroup_name() and pr_cont_cgroup_path() added.
      
      * cgroup->name handling dropped from cgroup_rename().
      
      * All users of cgroup_name/path() updated to the new semantics.  Users
        which were formatting the string just to printk them are converted
        to use pr_cont_cgroup_name/path() instead, which simplifies things
        quite a bit.  As cgroup_name() no longer requires RCU read lock
        around it, RCU lockings which were protecting only cgroup_name() are
        removed.
      
      v2: Comment above oom_info_lock updated as suggested by Michal.
      
      v3: dummy_top doesn't have a kn associated and
          pr_cont_cgroup_name/path() ended up calling the matching kernfs
          functions with NULL kn leading to oops.  Test for NULL kn and
          print "/" if so.  This issue was reported by Fengguang Wu.
      
      v4: Rebased on top of 0ab02ca8 ("cgroup: protect modifications to
          cgroup_idr with cgroup_mutex").
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      e61734c5
    • Tejun Heo's avatar
      cgroup: make cgroup hold onto its kernfs_node · 6f30558f
      Tejun Heo authored
      cgroup currently releases its kernfs_node when it gets removed.  While
      not buggy, this makes cgroup->kn access rules complicated than
      necessary and leads to things like get/put protection around
      kernfs_remove() in cgroup_destroy_locked().  In addition, we want to
      use kernfs_name/path() and friends but also want to be able to
      determine a cgroup's name between removal and release.
      
      This patch makes cgroup hold onto its kernfs_node until freed so that
      cgroup->kn is always accessible.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      6f30558f
    • Tejun Heo's avatar
      cgroup: simplify dynamic cftype addition and removal · 21a2d343
      Tejun Heo authored
      Dynamic cftype addition and removal using cgroup_add/rm_cftypes()
      respectively has been quite hairy due to vfs i_mutex.  As i_mutex
      nests outside cgroup_mutex, cgroup_mutex has to be released and
      regrabbed on each iteration through the hierarchy complicating the
      process.  Now that i_mutex is no longer in play, it can be simplified.
      
      * Just holding cgroup_tree_mutex is enough.  No need to meddle with
        cgroup_mutex.
      
      * No reason to play the unlock - relock - check serial_nr dancing.
        Everything can be atomically while holding cgroup_tree_mutex.
      
      * cgroup_cfts_prepare() is replaced with direct locking of
        cgroup_tree_mutex.
      
      * cgroup_cfts_commit() no longer fiddles with locking.  It just
        applies the cftypes change to the existing cgroups in the hierarchy.
        Renamed to cgroup_cfts_apply().
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      21a2d343
    • Tejun Heo's avatar
      cgroup: remove cftype_set · 0adb0704
      Tejun Heo authored
      cftype_set was added primarily to allow registering the same cftype
      array more than once for different subsystems.  Nobody uses or needs
      such thing and it's already broken because each cftype has ->ss
      pointer which is initialized during registration.
      
      Let's add list_head ->node to cftype and use the first cftype entry in
      the array to link them instead of allocating separate cftype_set.
      While at it, trigger WARN if cft seems previously initialized during
      registration.
      
      This simplifies cftype handling a bit.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      0adb0704
    • Tejun Heo's avatar
      cgroup: relocate cgroup_rm_cftypes() · 80b13586
      Tejun Heo authored
      cftype handling is about to be revamped.  Relocate cgroup_rm_cftypes()
      above cgroup_add_cftypes() in preparation.  This is pure relocation.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      80b13586
    • Tejun Heo's avatar
      cgroup: warn if "xattr" is specified with "sane_behavior" · 86bf4b68
      Tejun Heo authored
      Mount option "xattr" is no longer necessary as it's enabled by default
      on kernfs.  Warn if "xattr" is specified with "sane_behavior" so that
      the option can be removed in the future.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      86bf4b68
  3. 11 Feb, 2014 15 commits
    • Tejun Heo's avatar
      cgroup: convert to kernfs · 2bd59d48
      Tejun Heo authored
      cgroup filesystem code was derived from the original sysfs
      implementation which was heavily intertwined with vfs objects and
      locking with the goal of re-using the existing vfs infrastructure.
      That experiment turned out rather disastrous and sysfs switched, a
      long time ago, to distributed filesystem model where a separate
      representation is maintained which is queried by vfs.  Unfortunately,
      cgroup stuck with the failed experiment all these years and
      accumulated even more problems over time.
      
      Locking and object lifetime management being entangled with vfs is
      probably the most egregious.  vfs is never designed to be misused like
      this and cgroup ends up jumping through various convoluted dancing to
      make things work.  Even then, operations across multiple cgroups can't
      be done safely as it'll deadlock with rename locking.
      
      Recently, kernfs is separated out from sysfs so that it can be used by
      users other than sysfs.  This patch converts cgroup to use kernfs,
      which will bring the following benefits.
      
      * Separation from vfs internals.  Locking and object lifetime
        management is contained in cgroup proper making things a lot
        simpler.  This removes significant amount of locking convolutions,
        hairy object lifetime rules and the restriction on multi-cgroup
        operations.
      
      * Can drop a lot of code to implement filesystem interface as most are
        provided by kernfs.
      
      * Proper "severing" semantics, which allows controllers to not worry
        about lingering file accesses after offline.
      
      While the preceding patches did as much as possible to make the
      transition less painful, large part of the conversion has to be one
      discrete step making this patch rather large.  The rest of the commit
      message lists notable changes in different areas.
      
      Overall
      -------
      
      * vfs constructs replaced with kernfs ones.  cgroup->dentry w/ ->kn,
        cgroupfs_root->sb w/ ->kf_root.
      
      * All dentry accessors are removed.  Helpers to map from kernfs
        constructs are added.
      
      * All vfs plumbing around dentry, inode and bdi removed.
      
      * cgroup_mount() now directly looks for matching root and then
        proceeds to create a new one if not found.
      
      Synchronization and object lifetime
      -----------------------------------
      
      * vfs inode locking removed.  Among other things, this removes the
        need for the convolution in cgroup_cfts_commit().  Future patches
        will further simplify it.
      
      * vfs refcnting replaced with cgroup internal ones.  cgroup->refcnt,
        cgroupfs_root->refcnt added.  cgroup_put_root() now directly puts
        root->refcnt and when it reaches zero proceeds to destroy it thus
        merging cgroup_put_root() and the former cgroup_kill_sb().
        Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
        when refcnt reaches zero.
      
      * Unlike before, kernfs objects don't hold onto cgroup objects.  When
        cgroup destroys a kernfs node, all existing operations are drained
        and the association is broken immediately.  The same for
        cgroupfs_roots and mounts.
      
      * All operations which come through kernfs guarantee that the
        associated cgroup is and stays valid for the duration of operation;
        however, there are two paths which need to find out the associated
        cgroup from dentry without going through kernfs -
        css_tryget_from_dir() and cgroupstats_build().  For these two,
        kernfs_node->priv is RCU managed so that they can dereference it
        under RCU read lock.
      
      File and directory handling
      ---------------------------
      
      * File and directory operations converted to kernfs_ops and
        kernfs_syscall_ops.
      
      * xattrs is implicitly supported by kernfs.  No need to worry about it
        from cgroup.  This means that "xattr" mount option is no longer
        necessary.  A future patch will add a deprecated warning message
        when sane_behavior.
      
      * When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
        private copy of one of the kernfs_ops to set its atomic_write_len.
        cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
        to handle it.
      
      * cftype->lockdep_key added so that kernfs lockdep annotation can be
        per cftype.
      
      * Inidividual file entries and open states are now managed by kernfs.
        No need to worry about them from cgroup.  cfent, cgroup_open_file
        and their friends are removed.
      
      * kernfs_nodes are created deactivated and kernfs_activate()
        invocations added to places where creation of new nodes are
        committed.
      
      * cgroup_rmdir() uses kernfs_[un]break_active_protection() for
        self-removal.
      
      v2: - Li pointed out in an earlier patch that specifying "name="
            during mount without subsystem specification should succeed if
            there's an existing hierarchy with a matching name although it
            should fail with -EINVAL if a new hierarchy should be created.
            Prior to the conversion, this used by handled by deferring
            failure from NULL return from cgroup_root_from_opts(), which was
            necessary because root was being created before checking for
            existing ones.  Note that cgroup_root_from_opts() returned an
            ERR_PTR() value for error conditions which require immediate
            mount failure.
      
            As we now have separate search and creation steps, deferring
            failure from cgroup_root_from_opts() is no longer necessary.
            cgroup_root_from_opts() is updated to always return ERR_PTR()
            value on failure.
      
          - The logic to match existing roots is updated so that a mount
            attempt with a matching name but different subsys_mask are
            rejected.  This was handled by a separate matching loop under
            the comment "Check for name clashes with existing mounts" but
            got lost during conversion.  Merge the check into the main
            search loop.
      
          - Add __rcu __force casting in RCU_INIT_POINTER() in
            cgroup_destroy_locked() to avoid the sparse address space
            warning reported by kbuild test bot.  Maybe we want an explicit
            interface to use kn->priv as RCU protected pointer?
      
      v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
      
      v4: Rebased on top of 0ab02ca8 ("cgroup: protect modifications to
          cgroup_idr with cgroup_mutex").
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      Cc: kbuild test robot fengguang.wu@intel.com>
      2bd59d48
    • Tejun Heo's avatar
      cgroup: relocate functions in preparation of kernfs conversion · f2e85d57
      Tejun Heo authored
      Relocate cgroup_init/exit_root_id(), cgroup_free_root(),
      cgroup_kill_sb() and cgroup_file_name() in preparation of kernfs
      conversion.
      
      These are pure relocations to make kernfs conversion easier to follow.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      f2e85d57
    • Tejun Heo's avatar
      cgroup: misc preps for kernfs conversion · 59f5296b
      Tejun Heo authored
      * Un-inline seq_css().  After kernfs conversion, the function will
        need to dereference internal data structures.
      
      * Add cgroup_get/put_root() and replace direct super_block->s_active
        manipulatinos with them.  These will be converted to kernfs_root
        refcnting.
      
      * Add cgroup_get/put() and replace dget/put() on cgrp->dentry with
        them.  These will be converted to kernfs refcnting.
      
      * Update current_css_set_cg_links_read() to use cgroup_name() instead
        of reaching into the dentry name.  The end result is the same.
      
      These changes don't make functional differences but will make
      transition to kernfs easier.
      
      v2: Rebased on top of 0ab02ca8 ("cgroup: protect modifications to
          cgroup_idr with cgroup_mutex").
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      59f5296b
    • Tejun Heo's avatar
      cgroup: introduce cgroup_ino() · b1664924
      Tejun Heo authored
      mm/memory-failure.c::hwpoison_filter_task() has been reaching into
      cgroup to extract the associated ino to be used as a filtering
      criterion.  This is an implementation detail which shouldn't be
      depended upon from outside cgroup proper and is about to change with
      the scheduled kernfs conversion.
      
      This patch introduces a proper interface to determine the associated
      ino, cgroup_ino(), and updates hwpoison_filter_task() to use it
      instead of reaching directly into cgroup.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      b1664924
    • Tejun Heo's avatar
      cgroup: introduce cgroup_init/exit_cftypes() · 2da440a2
      Tejun Heo authored
      Factor out cft->ss initialization into cgroup_init_cftypes() from
      cgroup_add_cftypes() and add cft->ss clearing to cgroup_rm_cftypes()
      through cgroup_exit_cftypes().
      
      This doesn't make any meaningful difference now but the two new
      functions will be expanded during kernfs transition.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      2da440a2
    • Tejun Heo's avatar
      cgroup: update the meaning of cftype->max_write_len · 5f469907
      Tejun Heo authored
      cftype->max_write_len is used to extend the maximum size of writes.
      It's interpreted in such a way that the actual maximum size is one
      less than the specified value.  The default size is defined by
      CGROUP_LOCAL_BUFFER_SIZE.  Its interpretation is quite confusing - its
      value is decremented by 1 and then compared for equality with max
      size, which means that the actual default size is
      CGROUP_LOCAL_BUFFER_SIZE - 2, which is 62 chars.
      
      There's no point in having a limit that low.  Update its definition so
      that it means the actual string length sans termination and anything
      below PAGE_SIZE-1 is treated as PAGE_SIZE-1.
      
      .max_write_len for "release_agent" is updated to PATH_MAX-1 and
      cgroup_release_agent_write() is updated so that the redundant strlen()
      check is removed and it uses strlcpy() instead of strcpy().
      .max_write_len initializations in blk-throttle.c and cfq-iosched.c are
      no longer necessary and removed.  The one in cpuset is kept unchanged
      as it's an approximated value to begin with.
      
      This will also make transition to kernfs smoother.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      5f469907
    • Tejun Heo's avatar
      cgroup: make cgroup_subsys->base_cftypes use cgroup_add_cftypes() · de00ffa5
      Tejun Heo authored
      Currently, cgroup_subsys->base_cftypes registration is different from
      dynamic cftypes registartion.  Instead of going through
      cgroup_add_cftypes(), cgroup_init_subsys() invokes
      cgroup_init_cftsets() which makes use of cgroup_subsys->base_cftset
      which doesn't involve dynamic allocation.
      
      While avoiding dynamic allocation is somewhat nice, having two
      separate paths for cftypes registration is nasty, especially as we're
      planning to add more operations during cftypes registration.
      
      This patch drops cgroup_init_cftsets() and cgroup_subsys->base_cftset
      and registers base_cftypes using cgroup_add_cftypes().  This is done
      as a separate step in cgroup_init() instead of a part of
      cgroup_init_subsys().  This is because cgroup_init_subsys() can be
      called very early during boot when kmalloc() isn't available yet.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      de00ffa5
    • Tejun Heo's avatar
      cgroup: update cgroup name handling · 8d7e6fb0
      Tejun Heo authored
      Straightforward updates to cgroup name handling in preparation of
      kernfs conversion.
      
      * cgroup_alloc_name() is updated to take const char * isntead of
        dentry * for name source.
      
      * cgroup name formatting is separated out into cgroup_file_name().
        While at it, buffer length protection is added.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      8d7e6fb0
    • Tejun Heo's avatar
      cgroup: factor out cgroup_setup_root() from cgroup_mount() · d427dfeb
      Tejun Heo authored
      Factor out new root initialization into cgroup_setup_root() from
      cgroup_mount().  This makes it easier to follow and will ease kernfs
      conversion.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      d427dfeb
    • Tejun Heo's avatar
      cgroup: restructure locking and error handling in cgroup_mount() · 8e30e2b8
      Tejun Heo authored
      cgroup is scheduled to be converted to kernfs.  After conversion,
      cgroup_mount() won't use the sget() machinery for finding out existing
      super_blocks but instead would do that directly.  It'll search the
      existing cgroupfs_roots for a matching one and create a new one iff a
      match doesn't exist.  To ease such conversion, this patch restructures
      locking and error handling of the function.
      
      cgroup_tree_mutex and cgroup_mutex are grabbed from the get-go and
      held until return.  For now, due to the way vfs locks nest outside
      cgroup mutexes, the two cgroup mutexes are temporarily dropped across
      sget() and inode mutex locking, which looks quite ridiculous; however,
      these will be removed through kernfs conversion and structuring the
      code this way makes the conversion less painful.
      
      The error goto labels are consolidated to two.  This looks unwieldy
      now but the next patch will factor out creation of new root into a
      separate function with accompanying error handling and it'll look a
      lot better.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      8e30e2b8
    • Tejun Heo's avatar
      cgroup: release cgroup_mutex over file removals · 4ac06017
      Tejun Heo authored
      Now that cftypes and all tree modification operations are protected by
      cgroup_tree_mutex, we can drop cgroup_mutex while deleting files and
      directories.  Drop cgroup_mutex over removals.
      
      This doesn't make any noticeable difference now but is to help kernfs
      conversion.  In kernfs, removals are sync points which drain in-flight
      operations as those operations would grab cgroup_mutex, trying to
      delete under cgroup_mutex would deadlock.  This can be resolved by
      just holding the outer cgroup_tree_mutex which nests outside both
      kernfs active reference and cgroup_mutex.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      4ac06017
    • Tejun Heo's avatar
      cgroup: introduce cgroup_tree_mutex · ace2bee8
      Tejun Heo authored
      Currently cgroup uses combination of inode->i_mutex'es and
      cgroup_mutex for synchronization.  With the scheduled kernfs
      conversion, i_mutex'es will be removed.  Unfortunately, just using
      cgroup_mutex isn't possible.  All kernfs file and syscall operations,
      most of which require grabbing cgroup_mutex, will be called with
      kernfs active ref held and, if we try to perform kernfs removals under
      cgroup_mutex, it can deadlock as kernfs_remove() tries to drain the
      target node.
      
      Let's introduce a new outer mutex, cgroup_tree_mutex, which protects
      stuff used during hierarchy changing operations - cftypes and all the
      operations which may affect the cgroupfs.  It also covers css
      association and iteration.  This allows cgroup_css(), for_each_css()
      and other css iterators to be called under cgroup_tree_mutex.  The new
      mutex will nest above both kernfs's active ref protection and
      cgroup_mutex.  By protecting tree modifications with a separate outer
      mutex, we can get rid of the forementioned deadlock condition.
      
      Actual file additions and removals now require cgroup_tree_mutex
      instead of cgroup_mutex.  Currently, cgroup_tree_mutex is never used
      without cgroup_mutex; however, we'll soon add hierarchy modification
      sections which are only protected by cgroup_tree_mutex.  In the
      future, we might want to make the locking more granular by better
      splitting the coverages of the two mutexes.  For now, this should do.
      
      v2: Rebased on top of 0ab02ca8 ("cgroup: protect modifications to
          cgroup_idr with cgroup_mutex").
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      ace2bee8
    • Tejun Heo's avatar
      cgroup: improve css_from_dir() into css_tryget_from_dir() · 5a17f543
      Tejun Heo authored
      css_from_dir() returns the matching css (cgroup_subsys_state) given a
      dentry and subsystem.  The function doesn't pin the css before
      returning and requires the caller to be holding RCU read lock or
      cgroup_mutex and handling pinning on the caller side.
      
      Given that users of the function are likely to want to pin the
      returned css (both existing users do) and that getting and putting
      css's are very cheap, there's no reason for the interface to be tricky
      like this.
      
      Rename css_from_dir() to css_tryget_from_dir() and make it try to pin
      the found css and return it only if pinning succeeded.  The callers
      are updated so that they no longer do RCU locking and pinning around
      the function and just use the returned css.
      
      This will also ease converting cgroup to kernfs.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      5a17f543
    • Tejun Heo's avatar
      Merge branch 'cgroup/for-3.14-fixes' into cgroup/for-3.15 · 398f8787
      Tejun Heo authored
      Pull for-3.14-fixes to receive 0ab02ca8 ("cgroup: protect
      modifications to cgroup_idr with cgroup_mutex") prior to kernfs
      conversion series to avoid non-trivial conflicts.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      398f8787
    • Li Zefan's avatar
      cgroup: protect modifications to cgroup_idr with cgroup_mutex · 0ab02ca8
      Li Zefan authored
      Setup cgroupfs like this:
        # mount -t cgroup -o cpuacct xxx /cgroup
        # mkdir /cgroup/sub1
        # mkdir /cgroup/sub2
      
      Then run these two commands:
        # for ((; ;)) { mkdir /cgroup/sub1/tmp && rmdir /mnt/sub1/tmp; } &
        # for ((; ;)) { mkdir /cgroup/sub2/tmp && rmdir /mnt/sub2/tmp; } &
      
      After seconds you may see this warning:
      
      ------------[ cut here ]------------
      WARNING: CPU: 1 PID: 25243 at lib/idr.c:527 sub_remove+0x87/0x1b0()
      idr_remove called for id=6 which is not allocated.
      ...
      Call Trace:
       [<ffffffff8156063c>] dump_stack+0x7a/0x96
       [<ffffffff810591ac>] warn_slowpath_common+0x8c/0xc0
       [<ffffffff81059296>] warn_slowpath_fmt+0x46/0x50
       [<ffffffff81300aa7>] sub_remove+0x87/0x1b0
       [<ffffffff810f3f02>] ? css_killed_work_fn+0x32/0x1b0
       [<ffffffff81300bf5>] idr_remove+0x25/0xd0
       [<ffffffff810f2bab>] cgroup_destroy_css_killed+0x5b/0xc0
       [<ffffffff810f4000>] css_killed_work_fn+0x130/0x1b0
       [<ffffffff8107cdbc>] process_one_work+0x26c/0x550
       [<ffffffff8107eefe>] worker_thread+0x12e/0x3b0
       [<ffffffff81085f96>] kthread+0xe6/0xf0
       [<ffffffff81570bac>] ret_from_fork+0x7c/0xb0
      ---[ end trace 2d1577ec10cf80d0 ]---
      
      It's because allocating/removing cgroup ID is not properly synchronized.
      
      The bug was introduced when we converted cgroup_ida to cgroup_idr.
      While synchronization is already done inside ida_simple_{get,remove}(),
      users are responsible for concurrent calls to idr_{alloc,remove}().
      
      tj: Refreshed on top of b58c8998 ("cgroup: fix error return from
      cgroup_create()").
      
      Fixes: 4e96ee8e ("cgroup: convert cgroup_ida to cgroup_idr")
      Cc: <stable@vger.kernel.org> #3.12+
      Reported-by: default avatarMichal Hocko <mhocko@suse.cz>
      Signed-off-by: default avatarLi Zefan <lizefan@huawei.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      0ab02ca8
  4. 08 Feb, 2014 9 commits
    • Tejun Heo's avatar
      Merge branch 'driver-core-next' into cgroup/for-3.15 · f7cef064
      Tejun Heo authored
      Pending kernfs conversion depends on kernfs improvements in
      driver-core-next.  Pull it into for-3.15.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      f7cef064
    • Tejun Heo's avatar
      Merge branch 'for-3.14-fixes' into for-3.15 · 1a698a4a
      Tejun Heo authored
      Pending kernfs conversion depends on fixes in for-3.14-fixes.  Pull it
      into for-3.15.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      1a698a4a
    • Tejun Heo's avatar
      cgroup: remove cgroup_root_mutex · 3417ae1f
      Tejun Heo authored
      cgroup_root_mutex was added to avoid deadlock involving namespace_sem
      via cgroup_show_options().  It added a lot of overhead for the small
      purpose of it and, because it's nested under cgroup_mutex, it has very
      limited usefulness.  The previous patch made cgroup_show_options() not
      use cgroup_root_mutex, so nobody needs it anymore.  Remove it.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      3417ae1f
    • Tejun Heo's avatar
      cgroup: update locking in cgroup_show_options() · 69e943b7
      Tejun Heo authored
      cgroup_show_options() grabs cgroup_root_mutex to protect the options
      changing while printing; however, holding root_mutex or not doesn't
      really make much difference for the function.  subsys_mask can be
      atomically tested and most of the options aren't allowed to change
      anyway once mounted.
      
      The only field which needs synchronization is ->release_agent_path.
      This patch introduces a dedicated spinlock to synchronize accesses to
      the field and drops cgroup_root_mutex locking from
      cgroup_show_options().  The next patch will remove cgroup_root_mutex.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      69e943b7
    • Tejun Heo's avatar
      cgroup: rename cgroup_subsys->subsys_id to ->id · aec25020
      Tejun Heo authored
      It's no longer referenced outside cgroup core, so renaming is easy.
      Let's rename it for consistency & brevity.
      
      This patch is pure rename.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      aec25020
    • Tejun Heo's avatar
      cgroup: clean up cgroup_subsys names and initialization · 073219e9
      Tejun Heo authored
      cgroup_subsys is a bit messier than it needs to be.
      
      * The name of a subsys can be different from its internal identifier
        defined in cgroup_subsys.h.  Most subsystems use the matching name
        but three - cpu, memory and perf_event - use different ones.
      
      * cgroup_subsys_id enums are postfixed with _subsys_id and each
        cgroup_subsys is postfixed with _subsys.  cgroup.h is widely
        included throughout various subsystems, it doesn't and shouldn't
        have claim on such generic names which don't have any qualifier
        indicating that they belong to cgroup.
      
      * cgroup_subsys->subsys_id should always equal the matching
        cgroup_subsys_id enum; however, we require each controller to
        initialize it and then BUG if they don't match, which is a bit
        silly.
      
      This patch cleans up cgroup_subsys names and initialization by doing
      the followings.
      
      * cgroup_subsys_id enums are now postfixed with _cgrp_id, and each
        cgroup_subsys with _cgrp_subsys.
      
      * With the above, renaming subsys identifiers to match the userland
        visible names doesn't cause any naming conflicts.  All non-matching
        identifiers are renamed to match the official names.
      
        cpu_cgroup -> cpu
        mem_cgroup -> memory
        perf -> perf_event
      
      * controllers no longer need to initialize ->subsys_id and ->name.
        They're generated in cgroup core and set automatically during boot.
      
      * Redundant cgroup_subsys declarations removed.
      
      * While updating BUG_ON()s in cgroup_init_early(), convert them to
        WARN()s.  BUGging that early during boot is stupid - the kernel
        can't print anything, even through serial console and the trap
        handler doesn't even link stack frame properly for back-tracing.
      
      This patch doesn't introduce any behavior changes.
      
      v2: Rebased on top of fe1217c4 ("net: net_cls: move cgroupfs
          classid handling into core").
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarNeil Horman <nhorman@tuxdriver.com>
      Acked-by: default avatar"David S. Miller" <davem@davemloft.net>
      Acked-by: default avatar"Rafael J. Wysocki" <rjw@rjwysocki.net>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Acked-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Acked-by: default avatarAristeu Rozanski <aris@redhat.com>
      Acked-by: default avatarIngo Molnar <mingo@redhat.com>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Serge E. Hallyn <serue@us.ibm.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Thomas Graf <tgraf@suug.ch>
      073219e9
    • Tejun Heo's avatar
      cgroup: drop module support · 3ed80a62
      Tejun Heo authored
      With module supported dropped from net_prio, no controller is using
      cgroup module support.  None of actual resource controllers can be
      built as a module and we aren't gonna add new controllers which don't
      control resources.  This patch drops module support from cgroup.
      
      * cgroup_[un]load_subsys() and cgroup_subsys->module removed.
      
      * As there's no point in distinguishing IS_BUILTIN() and IS_MODULE(),
        cgroup_subsys.h now uses IS_ENABLED() directly.
      
      * enum cgroup_subsys_id now exactly matches the list of enabled
        controllers as ordered in cgroup_subsys.h.
      
      * cgroup_subsys[] is now a contiguously occupied array.  Size
        specification is no longer necessary and dropped.
      
      * for_each_builtin_subsys() is removed and for_each_subsys() is
        updated to not require any locking.
      
      * module ref handling is removed from rebind_subsystems().
      
      * Module related comments dropped.
      
      v2: Rebased on top of fe1217c4 ("net: net_cls: move cgroupfs
          classid handling into core").
      
      v3: Added {} around the if (need_forkexit_callback) block in
          cgroup_post_fork() for readability as suggested by Li.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      3ed80a62
    • Tejun Heo's avatar
      cgroup: make CONFIG_CGROUP_NET_PRIO bool and drop unnecessary init_netclassid_cgroup() · af636337
      Tejun Heo authored
      net_prio is the only cgroup which is allowed to be built as a module.
      The savings from allowing one controller to be built as a module are
      tiny especially given that cgroup module support itself adds quite a
      bit of complexity.
      
      Given that none of other controllers has much chance of being made a
      module and that we're unlikely to add new modular controllers, the
      added complexity is simply not justifiable.
      
      As a first step to drop cgroup module support, this patch changes the
      config option to bool from tristate and drops module related code from
      it.
      
      Also, while an earlier commit fe1217c4 ("net: net_cls: move
      cgroupfs classid handling into core") dropped module support from
      net_cls cgroup, it retained a call to cgroup_load_subsys(), which is
      noop for built-in controllers.  Drop it along with
      init_netclassid_cgroup().
      
      v2: Removed modular version of task_netprioidx() in
          include/net/netprio_cgroup.h as suggested by Li Zefan.
      
      v3: Rebased on top of fe1217c4 ("net: net_cls: move cgroupfs
          classid handling into core").  net_cls cgroup part is mostly
          dropped except for removal of init_netclassid_cgroup().
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarNeil Horman <nhorman@tuxdriver.com>
      Acked-by: default avatar"David S. Miller" <davem@davemloft.net>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      Cc: Thomas Graf <tgraf@suug.ch>
      af636337
    • Tejun Heo's avatar
      cgroup: fix locking in cgroup_cfts_commit() · 48573a89
      Tejun Heo authored
      cgroup_cfts_commit() walks the cgroup hierarchy that the target
      subsystem is attached to and tries to apply the file changes.  Due to
      the convolution with inode locking, it can't keep cgroup_mutex locked
      while iterating.  It currently holds only RCU read lock around the
      actual iteration and then pins the found cgroup using dget().
      
      Unfortunately, this is incorrect.  Although the iteration does check
      cgroup_is_dead() before invoking dget(), there's nothing which
      prevents the dentry from going away inbetween.  Note that this is
      different from the usual css iterations where css_tryget() is used to
      pin the css - css_tryget() tests whether the css can be pinned and
      fails if not.
      
      The problem can be solved by simply holding cgroup_mutex instead of
      RCU read lock around the iteration, which actually reduces LOC.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      Cc: stable@vger.kernel.org
      48573a89