1. 13 May, 2014 6 commits
    • Tejun Heo's avatar
      cgroup: fix offlining child waiting in cgroup_subtree_control_write() · 0cee8b77
      Tejun Heo authored
      cgroup_subtree_control_write() waits for offline to complete
      child-by-child before enabling a controller; however, it has a couple
      bugs.
      
      * It doesn't initialize the wait_queue_t.  This can lead to infinite
        hang on the following schedule() among other things.
      
      * It forgets to pin the child before releasing cgroup_tree_mutex and
        performing schedule().  The child may already be gone by the time it
        wakes up and invokes finish_wait().  Pin the child being waited on.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      0cee8b77
    • Tejun Heo's avatar
      Merge branch 'for-3.15-fixes' of... · f21a4f75
      Tejun Heo authored
      Merge branch 'for-3.15-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup into for-3.16
      
      Pull to receive e37a06f1 ("cgroup: fix the retry path of
      cgroup_mount()") to avoid unnecessary conflicts with planned
      cgroup_tree_mutex removal and also to be able to remove the temp fix
      added by 36c38fb7 ("blkcg: use trylock on blkcg_pol_mutex in
      blkcg_reset_stats()") afterwards.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      f21a4f75
    • Tejun Heo's avatar
      cgroup: fix rcu_read_lock() leak in update_if_frozen() · 36e9d2eb
      Tejun Heo authored
      While updating cgroup_freezer locking, 68fafb77d827 ("cgroup_freezer:
      replace freezer->lock with freezer_mutex") introduced a bug in
      update_if_frozen() where it returns with rcu_read_lock() held.  Fix it
      by adding rcu_read_unlock() before returning.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-by: default avatarkbuild test robot <fengguang.wu@intel.com>
      36e9d2eb
    • Tejun Heo's avatar
      Merge branch 'for-3.16' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu into for-3.16 · d39ea871
      Tejun Heo authored
      Pull to receive percpu_ref_tryget[_live]() changes.  Planned cgroup
      changes will make use of them.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      d39ea871
    • Tejun Heo's avatar
      cgroup_freezer: replace freezer->lock with freezer_mutex · e5ced8eb
      Tejun Heo authored
      After 96d365e0 ("cgroup: make css_set_lock a rwsem and rename it
      to css_set_rwsem"), css task iterators requires sleepable context as
      it may block on css_set_rwsem.  I missed that cgroup_freezer was
      iterating tasks under IRQ-safe spinlock freezer->lock.  This leads to
      errors like the following on freezer state reads and transitions.
      
        BUG: sleeping function called from invalid context at /work
       /os/work/kernel/locking/rwsem.c:20
        in_atomic(): 0, irqs_disabled(): 0, pid: 462, name: bash
        5 locks held by bash/462:
         #0:  (sb_writers#7){.+.+.+}, at: [<ffffffff811f0843>] vfs_write+0x1a3/0x1c0
         #1:  (&of->mutex){+.+.+.}, at: [<ffffffff8126d78b>] kernfs_fop_write+0xbb/0x170
         #2:  (s_active#70){.+.+.+}, at: [<ffffffff8126d793>] kernfs_fop_write+0xc3/0x170
         #3:  (freezer_mutex){+.+...}, at: [<ffffffff81135981>] freezer_write+0x61/0x1e0
         #4:  (rcu_read_lock){......}, at: [<ffffffff81135973>] freezer_write+0x53/0x1e0
        Preemption disabled at:[<ffffffff81104404>] console_unlock+0x1e4/0x460
      
        CPU: 3 PID: 462 Comm: bash Not tainted 3.15.0-rc1-work+ #10
        Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
         ffff88000916a6d0 ffff88000e0a3da0 ffffffff81cf8c96 0000000000000000
         ffff88000e0a3dc8 ffffffff810cf4f2 ffffffff82388040 ffff880013aaf740
         0000000000000002 ffff88000e0a3de8 ffffffff81d05974 0000000000000246
        Call Trace:
         [<ffffffff81cf8c96>] dump_stack+0x4e/0x7a
         [<ffffffff810cf4f2>] __might_sleep+0x162/0x260
         [<ffffffff81d05974>] down_read+0x24/0x60
         [<ffffffff81133e87>] css_task_iter_start+0x27/0x70
         [<ffffffff8113584d>] freezer_apply_state+0x5d/0x130
         [<ffffffff81135a16>] freezer_write+0xf6/0x1e0
         [<ffffffff8112eb88>] cgroup_file_write+0xd8/0x230
         [<ffffffff8126d7b7>] kernfs_fop_write+0xe7/0x170
         [<ffffffff811f0756>] vfs_write+0xb6/0x1c0
         [<ffffffff811f121d>] SyS_write+0x4d/0xc0
         [<ffffffff81d08292>] system_call_fastpath+0x16/0x1b
      
      freezer->lock used to be used in hot paths but that time is long gone
      and there's no reason for the lock to be IRQ-safe spinlock or even
      per-cgroup.  In fact, given the fact that a cgroup may contain large
      number of tasks, it's not a good idea to iterate over them while
      holding IRQ-safe spinlock.
      
      Let's simplify locking by replacing per-cgroup freezer->lock with
      global freezer_mutex.  This also makes the comments explaining the
      intricacies of policy inheritance and the locking around it as the
      states are protected by a common mutex.
      
      The conversion is mostly straight-forward.  The followings are worth
      mentioning.
      
      * freezer_css_online() no longer needs double locking.
      
      * freezer_attach() now performs propagation simply while holding
        freezer_mutex.  update_if_frozen() race no longer exists and the
        comment is removed.
      
      * freezer_fork() now tests whether the task is in root cgroup using
        the new task_css_is_root() without doing rcu_read_lock/unlock().  If
        not, it grabs freezer_mutex and performs the operation.
      
      * freezer_read() and freezer_change_state() grab freezer_mutex across
        the whole operation and pin the css while iterating so that each
        descendant processing happens in sleepable context.
      
      Fixes: 96d365e0 ("cgroup: make css_set_lock a rwsem and rename it to css_set_rwsem")
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      e5ced8eb
    • Tejun Heo's avatar
      cgroup: introduce task_css_is_root() · 5024ae29
      Tejun Heo authored
      Determining the css of a task usually requires RCU read lock as that's
      the only thing which keeps the returned css accessible till its
      reference is acquired; however, testing whether a task belongs to the
      root can be performed without dereferencing the returned css by
      comparing the returned pointer against the root one in init_css_set[]
      which never changes.
      
      Implement task_css_is_root() which can be invoked in any context.
      This will be used by the scheduled cgroup_freezer change.
      
      v2: cgroup no longer supports modular controllers.  No need to export
          init_css_set.  Pointed out by Li.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      5024ae29
  2. 09 May, 2014 2 commits
  3. 07 May, 2014 1 commit
  4. 06 May, 2014 2 commits
  5. 05 May, 2014 3 commits
    • Fabian Frederick's avatar
      kernel/cgroup.c: fix 2 kernel-doc warnings · 60106946
      Fabian Frederick authored
      Fix typo and variable name.
      
      tj: Updated @cgrp argument description in cgroup_destroy_css_killed()
      
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarFabian Frederick <fabf@skynet.be>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      60106946
    • Tejun Heo's avatar
      blkcg: use trylock on blkcg_pol_mutex in blkcg_reset_stats() · 36c38fb7
      Tejun Heo authored
      During the recent conversion of cgroup to kernfs, cgroup_tree_mutex
      which nests above both the kernfs s_active protection and cgroup_mutex
      is added to synchronize cgroup file type operations as cgroup_mutex
      needed to be grabbed from some file operations and thus can't be put
      above s_active protection.
      
      While this arrangement mostly worked for cgroup, this triggered the
      following lockdep warning.
      
        ======================================================
        [ INFO: possible circular locking dependency detected ]
        3.15.0-rc3-next-20140430-sasha-00016-g4e281fa-dirty #429 Tainted: G        W
        -------------------------------------------------------
        trinity-c173/9024 is trying to acquire lock:
        (blkcg_pol_mutex){+.+.+.}, at: blkcg_reset_stats (include/linux/spinlock.h:328 block/blk-cgroup.c:455)
      
        but task is already holding lock:
        (s_active#89){++++.+}, at: kernfs_fop_write (fs/kernfs/file.c:283)
      
        which lock already depends on the new lock.
      
        the existing dependency chain (in reverse order) is:
      
        -> #2 (s_active#89){++++.+}:
        lock_acquire (arch/x86/include/asm/current.h:14 kernel/locking/lockdep.c:3602)
        __kernfs_remove (arch/x86/include/asm/atomic.h:27 fs/kernfs/dir.c:352 fs/kernfs/dir.c:1024)
        kernfs_remove_by_name_ns (fs/kernfs/dir.c:1219)
        cgroup_addrm_files (include/linux/kernfs.h:427 kernel/cgroup.c:1074 kernel/cgroup.c:2899)
        cgroup_clear_dir (kernel/cgroup.c:1092 (discriminator 2))
        rebind_subsystems (kernel/cgroup.c:1144)
        cgroup_setup_root (kernel/cgroup.c:1568)
        cgroup_mount (kernel/cgroup.c:1716)
        mount_fs (fs/super.c:1094)
        vfs_kern_mount (fs/namespace.c:899)
        do_mount (fs/namespace.c:2238 fs/namespace.c:2561)
        SyS_mount (fs/namespace.c:2758 fs/namespace.c:2729)
        tracesys (arch/x86/kernel/entry_64.S:746)
      
        -> #1 (cgroup_tree_mutex){+.+.+.}:
        lock_acquire (arch/x86/include/asm/current.h:14 kernel/locking/lockdep.c:3602)
        mutex_lock_nested (kernel/locking/mutex.c:486 kernel/locking/mutex.c:587)
        cgroup_add_cftypes (include/linux/list.h:76 kernel/cgroup.c:3040)
        blkcg_policy_register (block/blk-cgroup.c:1106)
        throtl_init (block/blk-throttle.c:1694)
        do_one_initcall (init/main.c:789)
        kernel_init_freeable (init/main.c:854 init/main.c:863 init/main.c:882 init/main.c:1003)
        kernel_init (init/main.c:935)
        ret_from_fork (arch/x86/kernel/entry_64.S:552)
      
        -> #0 (blkcg_pol_mutex){+.+.+.}:
        __lock_acquire (kernel/locking/lockdep.c:1840 kernel/locking/lockdep.c:1945 kernel/locking/lockdep.c:2131 kernel/locking/lockdep.c:3182)
        lock_acquire (arch/x86/include/asm/current.h:14 kernel/locking/lockdep.c:3602)
        mutex_lock_nested (kernel/locking/mutex.c:486 kernel/locking/mutex.c:587)
        blkcg_reset_stats (include/linux/spinlock.h:328 block/blk-cgroup.c:455)
        cgroup_file_write (kernel/cgroup.c:2714)
        kernfs_fop_write (fs/kernfs/file.c:295)
        vfs_write (fs/read_write.c:532)
        SyS_write (fs/read_write.c:584 fs/read_write.c:576)
        tracesys (arch/x86/kernel/entry_64.S:746)
      
        other info that might help us debug this:
      
        Chain exists of:
        blkcg_pol_mutex --> cgroup_tree_mutex --> s_active#89
      
         Possible unsafe locking scenario:
      
      	 CPU0                    CPU1
      	 ----                    ----
          lock(s_active#89);
      				 lock(cgroup_tree_mutex);
      				 lock(s_active#89);
          lock(blkcg_pol_mutex);
      
         *** DEADLOCK ***
      
        4 locks held by trinity-c173/9024:
        #0: (&f->f_pos_lock){+.+.+.}, at: __fdget_pos (fs/file.c:714)
        #1: (sb_writers#18){.+.+.+}, at: vfs_write (include/linux/fs.h:2255 fs/read_write.c:530)
        #2: (&of->mutex){+.+.+.}, at: kernfs_fop_write (fs/kernfs/file.c:283)
        #3: (s_active#89){++++.+}, at: kernfs_fop_write (fs/kernfs/file.c:283)
      
        stack backtrace:
        CPU: 3 PID: 9024 Comm: trinity-c173 Tainted: G        W     3.15.0-rc3-next-20140430-sasha-00016-g4e281fa-dirty #429
         ffffffff919687b0 ffff8805f6373bb8 ffffffff8e52cdbb 0000000000000002
         ffffffff919d8400 ffff8805f6373c08 ffffffff8e51fb88 0000000000000004
         ffff8805f6373c98 ffff8805f6373c08 ffff88061be70d98 ffff88061be70dd0
        Call Trace:
        dump_stack (lib/dump_stack.c:52)
        print_circular_bug (kernel/locking/lockdep.c:1216)
        __lock_acquire (kernel/locking/lockdep.c:1840 kernel/locking/lockdep.c:1945 kernel/locking/lockdep.c:2131 kernel/locking/lockdep.c:3182)
        lock_acquire (arch/x86/include/asm/current.h:14 kernel/locking/lockdep.c:3602)
        mutex_lock_nested (kernel/locking/mutex.c:486 kernel/locking/mutex.c:587)
        blkcg_reset_stats (include/linux/spinlock.h:328 block/blk-cgroup.c:455)
        cgroup_file_write (kernel/cgroup.c:2714)
        kernfs_fop_write (fs/kernfs/file.c:295)
        vfs_write (fs/read_write.c:532)
        SyS_write (fs/read_write.c:584 fs/read_write.c:576)
      
      This is a highly unlikely but valid circular dependency between "echo
      1 > blkcg.reset_stats" and cfq module [un]loading.  cgroup is going
      through further locking update which will remove this complication but
      for now let's use trylock on blkcg_pol_mutex and retry the file
      operation if the trylock fails.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-by: default avatarSasha Levin <sasha.levin@oracle.com>
      References: http://lkml.kernel.org/g/5363C04B.4010400@oracle.com
      36c38fb7
    • Aristeu Rozanski's avatar
      device_cgroup: check if exception removal is allowed · d2c2b11c
      Aristeu Rozanski authored
      [PATCH v3 1/2] device_cgroup: check if exception removal is allowed
      
      When the device cgroup hierarchy was introduced in
      	bd2953eb - devcg: propagate local changes down the hierarchy
      
      a specific case was overlooked. Consider the hierarchy bellow:
      
      	A	default policy: ALLOW, exceptions will deny access
      	 \
      	  B	default policy: ALLOW, exceptions will deny access
      
      There's no need to verify when an new exception is added to B because
      in this case exceptions will deny access to further devices, which is
      always fine. Hierarchy in device cgroup only makes sure B won't have
      more access than A.
      
      But when an exception is removed (by writing devices.allow), it isn't
      checked if the user is in fact removing an inherited exception from A,
      thus giving more access to B.
      
      Example:
      
      	# echo 'a' >A/devices.allow
      	# echo 'c 1:3 rw' >A/devices.deny
      	# echo $$ >A/B/tasks
      	# echo >/dev/null
      	-bash: /dev/null: Operation not permitted
      	# echo 'c 1:3 w' >A/B/devices.allow
      	# echo >/dev/null
      	#
      
      This shouldn't be allowed and this patch fixes it by making sure to never allow
      exceptions in this case to be removed if the exception is partially or fully
      present on the parent.
      
      v3: missing '*' in function description
      v2: improved log message and formatting fixes
      
      Cc: cgroups@vger.kernel.org
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarAristeu Rozanski <arozansk@redhat.com>
      Acked-by: default avatarSerge Hallyn <serge.hallyn@canonical.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      d2c2b11c
  6. 04 May, 2014 7 commits
    • Aristeu Rozanski's avatar
      device_cgroup: fix the comment format for recently added functions · f5f3cf6f
      Aristeu Rozanski authored
      Moving more extensive explanations to the end of the comment.
      
      Cc: Li Zefan <lizefan@huawei.com>
      Signed-off-by: default avatarAristeu Rozanski <arozansk@redhat.com>
      Acked-by: default avatarSerge Hallyn <serge.hallyn@canonical.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      f5f3cf6f
    • Tejun Heo's avatar
      cgroup, memcg: implement css->id and convert css_from_id() to use it · 15a4c835
      Tejun Heo authored
      Until now, cgroup->id has been used to identify all the associated
      csses and css_from_id() takes cgroup ID and returns the matching css
      by looking up the cgroup and then dereferencing the css associated
      with it; however, now that the lifetimes of cgroup and css are
      separate, this is incorrect and breaks on the unified hierarchy when a
      controller is disabled and enabled back again before the previous
      instance is released.
      
      This patch adds css->id which is a subsystem-unique ID and converts
      css_from_id() to look up by the new css->id instead.  memcg is the
      only user of css_from_id() and also converted to use css->id instead.
      
      For traditional hierarchies, this shouldn't make any functional
      difference.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jianyu Zhan <nasa4836@gmail.com>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      15a4c835
    • Tejun Heo's avatar
      cgroup: update init_css() into init_and_link_css() · ddfcadab
      Tejun Heo authored
      init_css() takes the cgroup the new css belongs to as an argument and
      initializes the new css's ->cgroup and ->parent pointers but doesn't
      acquire the matching reference counts.  After the previous patch,
      create_css() puts init_css() and reference acquisition right next to
      each other.  Let's move reference acquistion into init_css() and
      rename the function to init_and_link_css().  This makes sense and is
      easier to follow.  This makes the root csses to hold a reference on
      cgrp_dfl_root.cgrp, which is harmless.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      ddfcadab
    • Tejun Heo's avatar
      cgroup: use RCU free in create_css() failure path · a2bed820
      Tejun Heo authored
      Currently, when create_css() fails in the middle, the half-initialized
      css is freed by invoking cgroup_subsys->css_free() directly.  This
      patch updates the function so that it invokes RCU free path instead.
      As the RCU free path puts the parent css and owning cgroup, their
      references are now acquired right after a new css is successfully
      allocated.
      
      This doesn't make any visible difference now but is to enable
      implementing css->id and RCU protected lookup by such IDs.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      a2bed820
    • Tejun Heo's avatar
      cgroup: protect cgroup_root->cgroup_idr with a spinlock · 6fa4918d
      Tejun Heo authored
      Currently, cgroup_root->cgroup_idr is protected by cgroup_mutex, which
      ends up requiring cgroup_put() to be invoked under sleepable context.
      This is okay for now but is an unusual requirement and we'll soon add
      css->id which will have the same problem but won't be able to simply
      grab cgroup_mutex as removal will have to happen from css_release()
      which can't sleep.
      
      Introduce cgroup_idr_lock and idr_alloc/replace/remove() wrappers
      which protects the idr operations with the lock and use them for
      cgroup_root->cgroup_idr.  cgroup_put() no longer needs to grab
      cgroup_mutex and css_from_id() is updated to always require RCU read
      lock instead of either RCU read lock or cgroup_mutex, which doesn't
      affect the existing users.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      6fa4918d
    • Tejun Heo's avatar
      cgroup, memcg: allocate cgroup ID from 1 · 7d699ddb
      Tejun Heo authored
      Currently, cgroup->id is allocated from 0, which is always assigned to
      the root cgroup; unfortunately, memcg wants to use ID 0 to indicate
      invalid IDs and ends up incrementing all IDs by one.
      
      It's reasonable to reserve 0 for special purposes.  This patch updates
      cgroup core so that ID 0 is not used and the root cgroups get ID 1.
      The ID incrementing is removed form memcg.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      7d699ddb
    • Tejun Heo's avatar
      cgroup: make flags and subsys_masks unsigned int · 69dfa00c
      Tejun Heo authored
      There's no reason to use atomic bitops for cgroup_subsys_state->flags,
      cgroup_root->flags and various subsys_masks.  This patch updates those
      to use bitwise and/or operations instead and converts them form
      unsigned long to unsigned int.
      
      This makes the fields occupy (marginally) smaller space and makes it
      clear that they don't require atomicity.
      
      This patch doesn't cause any behavior difference.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      69dfa00c
  7. 25 Apr, 2014 10 commits
    • Joe Perches's avatar
      cgroup: Use more current logging style · ed3d261b
      Joe Perches authored
      Use pr_fmt and remove embedded prefixes.
      Realign modified multi-line statements to open parenthesis.
      Convert embedded function name to "%s: ", __func__
      Signed-off-by: default avatarJoe Perches <joe@perches.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      ed3d261b
    • Jianyu Zhan's avatar
      cgroup: replace pr_warning with preferred pr_warn · a2a1f9ea
      Jianyu Zhan authored
      As suggested by scripts/checkpatch.pl, substitude all pr_warning()
      with pr_warn().
      
      No functional change.
      Signed-off-by: default avatarJianyu Zhan <nasa4836@gmail.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      a2a1f9ea
    • Jianyu Zhan's avatar
      cgroup: remove orphaned cgroup_pidlist_seq_operations · f8719ccf
      Jianyu Zhan authored
      6612f05b ("cgroup: unify pidlist and other file handling")
      has removed the only user of cgroup_pidlist_seq_operations :
      cgroup_pidlist_open().
      
      This patch removes it.
      Signed-off-by: default avatarJianyu Zhan <nasa4836@gmail.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      f8719ccf
    • Jianyu Zhan's avatar
      cgroup: clean up obsolete comment for parse_cgroupfs_options() · 2f0edc04
      Jianyu Zhan authored
      1d5be6b2 ("cgroup: move module ref handling into
      rebind_subsystems()") makes parse_cgroupfs_options() no longer takes
      refcounts on subsystems.
      
      And unified hierachy makes parse_cgroupfs_options not need to call
      with cgroup_mutex held to protect the cgroup_subsys[].
      
      So this patch removes BUG_ON() and the comment.  As the comment
      doesn't contain useful information afterwards, the whole comment is
      removed.
      Signed-off-by: default avatarJianyu Zhan <nasa4836@gmail.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      2f0edc04
    • Tejun Heo's avatar
      cgroup: add documentation about unified hierarchy · 65731578
      Tejun Heo authored
      Unified hierarchy will be the new version of cgroup interface.  This
      patch adds Documentation/cgroups/unified-hierarchy.txt which describes
      the design and rationales of unified hierarchy.
      
      v2: Grammatical updates as per Randy Dunlap's review.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      65731578
    • Tejun Heo's avatar
      cgroup: implement cgroup.populated for the default hierarchy · 842b597e
      Tejun Heo authored
      cgroup users often need a way to determine when a cgroup's
      subhierarchy becomes empty so that it can be cleaned up.  cgroup
      currently provides release_agent for it; unfortunately, this mechanism
      is riddled with issues.
      
      * It delivers events by forking and execing a userland binary
        specified as the release_agent.  This is a long deprecated method of
        notification delivery.  It's extremely heavy, slow and cumbersome to
        integrate with larger infrastructure.
      
      * There is single monitoring point at the root.  There's no way to
        delegate management of a subtree.
      
      * The event isn't recursive.  It triggers when a cgroup doesn't have
        any tasks or child cgroups.  Events for internal nodes trigger only
        after all children are removed.  This again makes it impossible to
        delegate management of a subtree.
      
      * Events are filtered from the kernel side.  "notify_on_release" file
        is used to subscribe to or suppress release event.  This is
        unnecessarily complicated and probably done this way because event
        delivery itself was expensive.
      
      This patch implements interface file "cgroup.populated" which can be
      used to monitor whether the cgroup's subhierarchy has tasks in it or
      not.  Its value is 0 if there is no task in the cgroup and its
      descendants; otherwise, 1, and kernfs_notify() notificaiton is
      triggers when the value changes, which can be monitored through poll
      and [di]notify.
      
      This is a lot ligther and simpler and trivially allows delegating
      management of subhierarchy - subhierarchy monitoring can block further
      propgation simply by putting itself or another process in the root of
      the subhierarchy and monitor events that it's interested in from there
      without interfering with monitoring higher in the tree.
      
      v2: Patch description updated as per Serge.
      
      v3: "cgroup.subtree_populated" renamed to "cgroup.populated".  The
          subtree_ prefix was a bit confusing because
          "cgroup.subtree_control" uses it to denote the tree rooted at the
          cgroup sans the cgroup itself while the populated state includes
          the cgroup itself.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarSerge Hallyn <serge.hallyn@ubuntu.com>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      Cc: Lennart Poettering <lennart@poettering.net>
      842b597e
    • Tejun Heo's avatar
      Merge branch 'driver-core-next' of... · 50bce01b
      Tejun Heo authored
      Merge branch 'driver-core-next' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core into for-3.16
      
      Pull in driver-core-next to receive kernfs_notify() updates which will
      be used by the planned "cgroup.populated" implementation.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      50bce01b
    • Michael Marineau's avatar
      kobject: Make support for uevent_helper optional. · 86d56134
      Michael Marineau authored
      Support for uevent_helper, aka hotplug, is not required on many systems
      these days but it can still be enabled via sysfs or sysctl.
      Reported-by: default avatarDarren Shepherd <darren.s.shepherd@gmail.com>
      Signed-off-by: default avatarMichael Marineau <mike@marineau.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      86d56134
    • Tejun Heo's avatar
      kernfs: make kernfs_notify() trigger inotify events too · d911d987
      Tejun Heo authored
      kernfs_notify() is used to indicate either new data is available or
      the content of a file has changed.  It currently only triggers poll
      which may not be the most convenient to monitor especially when there
      are a lot to monitor.  Let's hook it up to fsnotify too so that the
      events can be monitored via inotify too.
      
      fsnotify_modify() requires file * but kernfs_notify() doesn't have any
      specific file associated; however, we can walk all super_blocks
      associated with a kernfs_root and as kernfs always associate one ino
      with inode and one dentry with an inode, it's trivial to look up the
      dentry associated with a given kernfs_node.  As any active monitor
      would pin dentry, just looking up existing dentry is enough.  This
      patch looks up the dentry associated with the specified kernfs_node
      and generates events equivalent to fsnotify_modify().
      
      Note that as fsnotify doesn't provide fsnotify_modify() equivalent
      which can be called with dentry, kernfs_notify() directly calls
      fsnotify_parent() and fsnotify().  It might be better to add a wrapper
      in fsnotify.h instead.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: John McCutchan <john@johnmccutchan.com>
      Cc: Robert Love <rlove@rlove.org>
      Cc: Eric Paris <eparis@parisplace.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d911d987
    • Tejun Heo's avatar
      kernfs: implement kernfs_root->supers list · 7d568a83
      Tejun Heo authored
      Currently, there's no way to find out which super_blocks are
      associated with a given kernfs_root.  Let's implement it - the planned
      inotify extension to kernfs_notify() needs it.
      
      Make kernfs_super_info point back to the super_block and chain it at
      kernfs_root->supers.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7d568a83
  8. 23 Apr, 2014 9 commits
    • Tejun Heo's avatar
      cgroup: implement dynamic subtree controller enable/disable on the default hierarchy · f8f22e53
      Tejun Heo authored
      cgroup is switching away from multiple hierarchies and will use one
      unified default hierarchy where controllers can be dynamically enabled
      and disabled per subtree.  The default hierarchy will serve as the
      unified hierarchy to which all controllers are attached and a css on
      the default hierarchy would need to also serve the tasks of descendant
      cgroups which don't have the controller enabled - ie. the tree may be
      collapsed from leaf towards root when viewed from specific
      controllers.  This has been implemented through effective css in the
      previous patches.
      
      This patch finally implements dynamic subtree controller
      enable/disable on the default hierarchy via a new knob -
      "cgroup.subtree_control" which controls which controllers are enabled
      on the child cgroups.  Let's assume a hierarchy like the following.
      
        root - A - B - C
                     \ D
      
      root's "cgroup.subtree_control" determines which controllers are
      enabled on A.  A's on B.  B's on C and D.  This coincides with the
      fact that controllers on the immediate sub-level are used to
      distribute the resources of the parent.  In fact, it's natural to
      assume that resource control knobs of a child belong to its parent.
      Enabling a controller in "cgroup.subtree_control" declares that
      distribution of the respective resources of the cgroup will be
      controlled.  Note that this means that controller enable states are
      shared among siblings.
      
      The default hierarchy has an extra restriction - only cgroups which
      don't contain any task may have controllers enabled in
      "cgroup.subtree_control".  Combined with the other properties of the
      default hierarchy, this guarantees that, from the view point of
      controllers, tasks are only on the leaf cgroups.  In other words, only
      leaf csses may contain tasks.  This rules out situations where child
      cgroups compete against internal tasks of the parent, which is a
      competition between two different types of entities without any clear
      way to determine resource distribution between the two.  Different
      controllers handle it differently and all the implemented behaviors
      are ambiguous, ad-hoc, cumbersome and/or just wrong.  Having this
      structural constraints imposed from cgroup core removes the burden
      from controller implementations and enables showing one consistent
      behavior across all controllers.
      
      When a controller is enabled or disabled, css associations for the
      controller in the subtrees of each child should be updated.  After
      enabling, the whole subtree of a child should point to the new css of
      the child.  After disabling, the whole subtree of a child should point
      to the cgroup's css.  This is implemented by first updating cgroup
      states such that cgroup_e_css() result points to the appropriate css
      and then invoking cgroup_update_dfl_csses() which migrates all tasks
      in the affected subtrees to the self cgroup on the default hierarchy.
      
      * When read, "cgroup.subtree_control" lists all the currently enabled
        controllers on the children of the cgroup.
      
      * White-space separated list of controller names prefixed with either
        '+' or '-' can be written to "cgroup.subtree_control".  The ones
        prefixed with '+' are enabled on the controller and '-' disabled.
      
      * A controller can be enabled iff the parent's
        "cgroup.subtree_control" enables it and disabled iff no child's
        "cgroup.subtree_control" has it enabled.
      
      * If a cgroup has tasks, no controller can be enabled via
        "cgroup.subtree_control".  Likewise, if "cgroup.subtree_control" has
        some controllers enabled, tasks can't be migrated into the cgroup.
      
      * All controllers which aren't bound on other hierarchies are
        automatically associated with the root cgroup of the default
        hierarchy.  All the controllers which are bound to the default
        hierarchy are listed in the read-only file "cgroup.controllers" in
        the root directory.
      
      * "cgroup.controllers" in all non-root cgroups is read-only file whose
        content is equal to that of "cgroup.subtree_control" of the parent.
        This indicates which controllers can be used in the cgroup's
        "cgroup.subtree_control".
      
      This is still experimental and there are some holes, one of which is
      that ->can_attach() failure during cgroup_update_dfl_csses() may leave
      the cgroups in an undefined state.  The issues will be addressed by
      future patches.
      
      v2: Non-root cgroups now also have "cgroup.controllers".
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      f8f22e53
    • Tejun Heo's avatar
      cgroup: prepare migration path for unified hierarchy · f817de98
      Tejun Heo authored
      Unified hierarchy implementation would require re-migrating tasks onto
      the same cgroup on the default hierarchy to reflect updated effective
      csses.  Update cgroup_migrate_prepare_dst() so that it accepts NULL as
      the destination cgrp.  When NULL is specified, the destination is
      considered to be the cgroup on the default hierarchy associated with
      each css_set.
      
      After this change, the identity check in cgroup_migrate_add_src()
      isn't sufficient for noop detection as the associated csses may change
      without any cgroup association changing.  The only way to tell whether
      a migration is noop or not is testing whether the source and
      destination csets are identical.  The noop check in
      cgroup_migrate_add_src() is removed and cset identity test is added to
      cgroup_migreate_prepare_dst().  If it's detected that source and
      destination csets are identical, the cset is removed removed from
      @preloaded_csets and all the migration nodes are cleared which makes
      cgroup_migrate() ignore the cset.
      
      Also, make the function append the destination css_sets to
      @preloaded_list so that destination css_sets always come after source
      css_sets.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      f817de98
    • Tejun Heo's avatar
      cgroup: update subsystem rebind restrictions · 7fd8c565
      Tejun Heo authored
      Because the default root couldn't have any non-root csses attached to
      it, rebinding away from it was always allowed; however, the default
      hierarchy will soon host the unified hierarchy and have non-root csses
      so the rebind restrictions need to be updated accordingly.
      
      Instead of special casing rebinding from the default hierarchy and
      then checking whether the source hierarchy has children cgroups, which
      implies non-root csses for !dfl hierarchies, simply check whether the
      source hierarchy has non-root csses for the subsystem using
      css_next_child().
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      7fd8c565
    • Tejun Heo's avatar
      cgroup: add css_set->dfl_cgrp · 6803c006
      Tejun Heo authored
      To implement the unified hierarchy behavior, we'll need to be able to
      determine the associated cgroup on the default hierarchy from css_set.
      Let's add css_set->dfl_cgrp so that it can be accessed conveniently
      and efficiently.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      6803c006
    • Tejun Heo's avatar
      cgroup: allow cgroup creation and suppress automatic css creation in the unified hierarchy · bd53d617
      Tejun Heo authored
      Now that effective css handling has been added and iterators updated
      accordingly, it's safe to allow cgroup creation in the default
      hierarchy.  Unblock cgroup creation in the default hierarchy.
      
      As the default hierarchy will implement explicit enabling and
      disabling of controllers on each cgroup, suppress automatic css
      enabling on cgroup creation.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      bd53d617
    • Tejun Heo's avatar
      cgroup: cgroup->subsys[] should be cleared after the css is offlined · e3297803
      Tejun Heo authored
      After a css finishes offlining, offline_css() mistakenly performs
      RCU_INIT_POINTER(css->cgroup->subsys[ss->id], css) which just sets the
      cgroup->subsys[] pointer to the current value.  The intention was to
      clear it after offline is complete, not reassign the same value.
      
      Update it to assign NULL instead of the current value.  This makes
      cgroup_css() to return NULL once offline is complete.  All the
      existing users of the function either can handle NULL return already
      or guarantee that the css doesn't get offlined.
      
      While this is a bugfix, as css lifetime is currently tied to the
      cgroup it belongs to, this bug doesn't cause any actual problems.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      e3297803
    • Tejun Heo's avatar
      cgroup: teach css_task_iter about effective csses · 3ebb2b6e
      Tejun Heo authored
      Currently, css_task_iter iterates tasks associated with a css by
      visiting each css_set associated with the owning cgroup and walking
      tasks of each of them.  This works fine for !unified hierarchies as
      each cgroup has its own css for each associated subsystem on the
      hierarchy; however, on the planned unified hierarchy, a cgroup may not
      have csses associated and its tasks would be considered associated
      with the matching css of the nearest ancestor which has the subsystem
      enabled.
      
      This means that on the default unified hierarchy, just walking all
      tasks associated with a cgroup isn't enough to walk all tasks which
      are associated with the specified css.  If any of its children doesn't
      have the matching css enabled, task iteration should also include all
      tasks from the subtree.  We already added cgroup->e_csets[] to list
      all css_sets effectively associated with a given css and walk css_sets
      on that list instead to achieve such iteration.
      
      This patch updates css_task_iter iteration such that it walks css_sets
      on cgroup->e_csets[] instead of cgroup->cset_links if iteration is
      requested on an non-dummy css.  Thanks to the previous iteration
      update, this change can be achieved with the addition of
      css_task_iter->ss and minimal updates to css_advance_task_iter() and
      css_task_iter_start().
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      3ebb2b6e
    • Tejun Heo's avatar
      cgroup: reorganize css_task_iter · 0f0a2b4f
      Tejun Heo authored
      This patch reorganizes css_task_iter so that adding effective css
      support is easier.
      
      * s/->cset_link/->cset_pos/ and s/->task/->task_pos/ for consistency
      
      * ->origin_css is used to determine whether the iteration reached the
        last css_set.  Replace it with explicit ->cset_head so that
        css_advance_task_iter() doesn't have to know the termination
        condition directly.
      
      * css_task_iter_next() currently assumes that it's walking list of
        cgrp_cset_link and reaches into the current cset through the current
        link to determine the termination conditions for task walking.  As
        this won't always be true for effective css walking, add
        ->tasks_head and ->mg_tasks_head and use them to control task
        walking so that css_task_iter_next() doesn't have to know how
        css_sets are being walked.
      
      This patch doesn't make any behavior changes.  The iteration logic
      stays unchanged after the patch.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      0f0a2b4f
    • Tejun Heo's avatar
      cgroup: make css_next_child() skip missing csses · 3b281afb
      Tejun Heo authored
      css_next_child() walks the children of the specified css.  It does
      this by finding the next cgroup and then returning the requested css.
      On the default unified hierarchy, a cgroup may not have a css
      associated with it even if the hierarchy has the subsystem enabled.
      This patch updates css_next_child() so that it skips children without
      the requested css associated.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      3b281afb