1. 13 May, 2014 22 commits
    • Tejun Heo's avatar
      cgroup: factor out cgroup_kn_lock_live() and cgroup_kn_unlock() · a9746d8d
      Tejun Heo authored
      cgroup_mkdir(), cgroup_rmdir() and cgroup_subtree_control_write()
      share the logic to break active protection so that they can grab
      cgroup_tree_mutex which nests above active protection and/or remove
      self.  Factor out this logic into cgroup_kn_lock_live() and
      cgroup_kn_unlock().
      
      This patch doesn't introduce any functional changes.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      a9746d8d
    • Tejun Heo's avatar
      cgroup: move cgroup->kn->priv clearing to cgroup_rmdir() · cfc79d5b
      Tejun Heo authored
      The ->priv field of a cgroup directory kernfs_node points back to the
      cgroup.  This field is RCU cleared in cgroup_destroy_locked() for
      non-kernfs accesses from css_tryget_from_dir() and
      cgroupstats_build().
      
      As these are only applicable to cgroups which finished creation
      successfully and fully initialized cgroups are always removed by
      cgroup_rmdir(), this can be safely moved to the end of cgroup_rmdir().
      
      This will help simplifying cgroup locking and shouldn't introduce any
      behavior difference.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      cfc79d5b
    • Tejun Heo's avatar
      cgroup: grab cgroup_mutex earlier in cgroup_subtree_control_write() · ddab2b6e
      Tejun Heo authored
      Move cgroup_lock_live_group() invocation upwards to right below
      cgroup_tree_mutex in cgroup_subtree_control_write().  This is to help
      the planned locking simplification.
      
      This doesn't make any userland-visible behavioral changes.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      ddab2b6e
    • Tejun Heo's avatar
      cgroup: collapse cgroup_create() into croup_mkdir() · b3bfd983
      Tejun Heo authored
      cgroup_mkdir() is the sole user of cgroup_create().  Let's collapse
      the latter into the former.  This will help simplifying locking.
      While at it, remove now stale comment about inode locking.
      
      This patch doesn't introduce any functional changes.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      b3bfd983
    • Tejun Heo's avatar
      cgroup: reorganize cgroup_create() · ba0f4d76
      Tejun Heo authored
      Reorganize cgroup_create() so that all paths share unlock out path.
      
      * All err_* labels are renamed to out_* as they're now shared by both
        success and failure paths.
      
      * @err renamed to @ret for the similar reason as above and so that
        it's more consistent with other functions.
      
      * cgroup memory allocation moved after locking so that freeing failed
        cgroup happens before unlocking.  While this moves more code inside
        critical section, memory allocations inside cgroup locking are
        already pretty common and this is unlikely to make any noticeable
        difference.
      
      * While at it, replace a stray @parent->root dereference with @root.
      
      This reorganization will help simplifying locking.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      ba0f4d76
    • Tejun Heo's avatar
      cgroup: remove cgroup->control_kn · b7fc5ad2
      Tejun Heo authored
      Now that cgroup_subtree_control_write() has access to the associated
      kernfs_open_file and thus the kernfs_node, there's no need to cache it
      in cgroup->control_kn on creation.  Remove cgroup->control_kn and use
      @of->kn directly.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      b7fc5ad2
    • Tejun Heo's avatar
      cgroup: convert "tasks" and "cgroup.procs" handle to use cftype->write() · acbef755
      Tejun Heo authored
      cgroup_tasks_write() and cgroup_procs_write() are currently using
      cftype->write_u64().  This patch converts them to use cftype->write()
      instead.  This allows access to the associated kernfs_open_file which
      will be necessary to implement the planned kernfs active protection
      manipulation for these files.
      
      This shifts buffer parsing to attach_task_by_pid() and makes it return
      @nbytes on success.  Let's rename it to __cgroup_procs_write() to
      clearly indicate that this is a write handler implementation.
      
      This patch doesn't introduce any visible behavior changes.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      acbef755
    • Tejun Heo's avatar
      cgroup: replace cftype->trigger() with cftype->write() · 6770c64e
      Tejun Heo authored
      cftype->trigger() is pointless.  It's trivial to ignore the input
      buffer from a regular ->write() operation.  Convert all ->trigger()
      users to ->write() and remove ->trigger().
      
      This patch doesn't introduce any visible behavior changes.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      6770c64e
    • Tejun Heo's avatar
      cgroup: replace cftype->write_string() with cftype->write() · 451af504
      Tejun Heo authored
      Convert all cftype->write_string() users to the new cftype->write()
      which maps directly to kernfs write operation and has full access to
      kernfs and cgroup contexts.  The conversions are mostly mechanical.
      
      * @css and @cft are accessed using of_css() and of_cft() accessors
        respectively instead of being specified as arguments.
      
      * Should return @nbytes on success instead of 0.
      
      * @buf is not trimmed automatically.  Trim if necessary.  Note that
        blkcg and netprio don't need this as the parsers already handle
        whitespaces.
      
      cftype->write_string() has no user left after the conversions and
      removed.
      
      While at it, remove unnecessary local variable @p in
      cgroup_subtree_control_write() and stale comment about
      CGROUP_LOCAL_BUFFER_SIZE in cgroup_freezer.c.
      
      This patch doesn't introduce any visible behavior changes.
      
      v2: netprio was missing from conversion.  Converted.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarAristeu Rozanski <arozansk@redhat.com>
      Acked-by: default avatarVivek Goyal <vgoyal@redhat.com>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Neil Horman <nhorman@tuxdriver.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      451af504
    • Tejun Heo's avatar
      cgroup: implement cftype->write() · b4168640
      Tejun Heo authored
      During the recent conversion to kernfs, cftype's seq_file operations
      are updated so that they are directly mapped to kernfs operations and
      thus can fully access the associated kernfs and cgroup contexts;
      however, write path hasn't seen similar updates and none of the
      existing write operations has access to, for example, the associated
      kernfs_open_file.
      
      Let's introduce a new operation cftype->write() which maps directly to
      the kernfs write operation and has access to all the arguments and
      contexts.  This will replace ->write_string() and ->trigger() and ease
      manipulation of kernfs active protection from cgroup file operations.
      
      Two accessors - of_cft() and of_css() - are introduced to enable
      accessing the associated cgroup context from cftype->write() which
      only takes kernfs_open_file for the context information.  The
      accessors for seq_file operations - seq_cft() and seq_css() - are
      rewritten to wrap the of_ accessors.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      b4168640
    • Tejun Heo's avatar
      cgroup: rename css_tryget*() to css_tryget_online*() · ec903c0c
      Tejun Heo authored
      Unlike the more usual refcnting, what css_tryget() provides is the
      distinction between online and offline csses instead of protection
      against upping a refcnt which already reached zero.  cgroup is
      planning to provide actual tryget which fails if the refcnt already
      reached zero.  Let's rename the existing trygets so that they clearly
      indicate that they're onliness.
      
      I thought about keeping the existing names as-are and introducing new
      names for the planned actual tryget; however, given that each
      controller participates in the synchronization of the online state, it
      seems worthwhile to make it explicit that these functions are about
      on/offline state.
      
      Rename css_tryget() to css_tryget_online() and css_tryget_from_dir()
      to css_tryget_online_from_dir().  This is pure rename.
      
      v2: cgroup_freezer grew new usages of css_tryget().  Update
          accordingly.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      ec903c0c
    • Tejun Heo's avatar
      cgroup: use release_agent_path_lock in cgroup_release_agent_show() · 46cfeb04
      Tejun Heo authored
      release_path is now protected by release_agent_path_lock to allow
      accessing it without grabbing cgroup_mutex; however,
      cgroup_release_agent_show() was still grabbing cgroup_mutex.  Let's
      convert it to release_agent_path_lock so that we don't have to worry
      about this one for the planned locking updates.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      46cfeb04
    • Tejun Heo's avatar
      cgroup: use restart_syscall() for retries after offline waits in cgroup_subtree_control_write() · 7d331fa9
      Tejun Heo authored
      After waiting for a child to finish offline,
      cgroup_subtree_control_write() jumps up to retry from after the input
      parsing and active protection breaking.  This retry makes the
      scheduled locking update - removal of cgroup_tree_mutex - more
      difficult.  Let's simplify it by returning with restart_syscall() for
      retries.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      7d331fa9
    • Tejun Heo's avatar
      cgroup: update and fix parsing of "cgroup.subtree_control" · d37167ab
      Tejun Heo authored
      I was confused that strsep() was equivalent to strtok_r() in skipping
      over consecutive delimiters.  strsep() just splits at the first
      occurrence of one of the delimiters which makes the parsing very
      inflexible, which makes allowing multiple whitespace chars as
      delimters kinda moot.  Let's just be consistently strict and require
      list of tokens separated by spaces.  This is what
      Documentation/cgroups/unified-hierarchy.txt describes too.
      
      Also, parsing may access beyond the end of the string if the string
      ends with spaces or is zero-length.  Make sure it skips zero-length
      tokens.  Note that this also ensures that the parser doesn't puke on
      multiple consecutive spaces.
      
      v2: Add zero-length token skipping.
      
      v3: Added missing space after "==".  Spotted by Li.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      d37167ab
    • Tejun Heo's avatar
      cgroup: css_release() shouldn't clear cgroup->subsys[] · 0ab7a60d
      Tejun Heo authored
      c1a71504 ("cgroup: don't recycle cgroup id until all csses' have
      been destroyed") made cgroup ID persist until a cgroup is released and
      add cgroup->subsys[] clearing to css_release() so that css_from_id()
      doesn't return a css which has already been released which happens
      before cgroup release; however, the right change here was updating
      offline_css() to clear cgroup->subsys[] which was done by e3297803
      ("cgroup: cgroup->subsys[] should be cleared after the css is
      offlined") instead of clearing it from css_release().
      
      We're now clearing cgroup->subsys[] twice.  This is okay for
      traditional hierarchies as a css's lifetime is the same as its
      cgroup's; however, this confuses unified hierarchy and turning on and
      off a controller repeatedly using "cgroup.subtree_control" can lead to
      an oops like the following which happens because cgroup->subsys[] is
      incorrectly cleared asynchronously by css_release().
      
       BUG: unable to handle kernel NULL pointer dereference at 00000000000000 08
       IP: [<ffffffff81130c11>] kill_css+0x21/0x1c0
       PGD 1170d067 PUD f0ab067 PMD 0
       Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
       Modules linked in:
       CPU: 2 PID: 459 Comm: bash Not tainted 3.15.0-rc2-work+ #5
       Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
       task: ffff880009296710 ti: ffff88000e198000 task.ti: ffff88000e198000
       RIP: 0010:[<ffffffff81130c11>]  [<ffffffff81130c11>] kill_css+0x21/0x1c0
       RSP: 0018:ffff88000e199dc8  EFLAGS: 00010202
       RAX: 0000000000000001 RBX: 0000000000000000 RCX: 0000000000000001
       RDX: 0000000000000001 RSI: ffffffff8238a968 RDI: ffff880009296f98
       RBP: ffff88000e199de0 R08: 0000000000000001 R09: 02b0000000000000
       R10: 0000000000000000 R11: ffff880009296fc0 R12: 0000000000000001
       R13: ffff88000db6fc58 R14: 0000000000000001 R15: ffff8800139dcc00
       FS:  00007ff9160c5740(0000) GS:ffff88001fb00000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: 0000000000000008 CR3: 0000000013947000 CR4: 00000000000006e0
       Stack:
        ffff88000e199de0 ffffffff82389160 0000000000000001 ffff88000e199e80
        ffffffff8113537f 0000000000000007 ffff88000e74af00 ffff88000e199e48
        ffff880009296710 ffff88000db6fc00 ffffffff8239c100 0000000000000002
       Call Trace:
        [<ffffffff8113537f>] cgroup_subtree_control_write+0x85f/0xa00
        [<ffffffff8112fd18>] cgroup_file_write+0x38/0x1d0
        [<ffffffff8126fc97>] kernfs_fop_write+0xe7/0x170
        [<ffffffff811f2ae6>] vfs_write+0xb6/0x1c0
        [<ffffffff811f35ad>] SyS_write+0x4d/0xc0
        [<ffffffff81d0acd2>] system_call_fastpath+0x16/0x1b
       Code: 5c 41 5d 41 5e 41 5f 5d c3 90 0f 1f 44 00 00 55 48 89 e5 41 54 53 48 89 fb 48 83 ec 08 8b 05 37 ad 29 01 85 c0 0f 85 df 00 00 00 <48> 8b 43 08 48 8b 3b be 01 00 00 00 8b 48 5c d3 e6 e8 49 ff ff
       RIP  [<ffffffff81130c11>] kill_css+0x21/0x1c0
        RSP <ffff88000e199dc8>
       CR2: 0000000000000008
       ---[ end trace e7aae1f877c4e1b4 ]---
      
      Remove the unnecessary cgroup->subsys[] clearing from css_release().
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      0ab7a60d
    • Tejun Heo's avatar
      cgroup: cgroup_idr_lock should be bh · 54504e97
      Tejun Heo authored
      cgroup_idr_remove() can be invoked from bh leading to lockdep
      detecting possible AA deadlock (IN_BH/ON_BH).  Make the lock bh-safe.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      54504e97
    • Tejun Heo's avatar
      cgroup: fix offlining child waiting in cgroup_subtree_control_write() · 0cee8b77
      Tejun Heo authored
      cgroup_subtree_control_write() waits for offline to complete
      child-by-child before enabling a controller; however, it has a couple
      bugs.
      
      * It doesn't initialize the wait_queue_t.  This can lead to infinite
        hang on the following schedule() among other things.
      
      * It forgets to pin the child before releasing cgroup_tree_mutex and
        performing schedule().  The child may already be gone by the time it
        wakes up and invokes finish_wait().  Pin the child being waited on.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      0cee8b77
    • Tejun Heo's avatar
      Merge branch 'for-3.15-fixes' of... · f21a4f75
      Tejun Heo authored
      Merge branch 'for-3.15-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup into for-3.16
      
      Pull to receive e37a06f1 ("cgroup: fix the retry path of
      cgroup_mount()") to avoid unnecessary conflicts with planned
      cgroup_tree_mutex removal and also to be able to remove the temp fix
      added by 36c38fb7 ("blkcg: use trylock on blkcg_pol_mutex in
      blkcg_reset_stats()") afterwards.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      f21a4f75
    • Tejun Heo's avatar
      cgroup: fix rcu_read_lock() leak in update_if_frozen() · 36e9d2eb
      Tejun Heo authored
      While updating cgroup_freezer locking, 68fafb77d827 ("cgroup_freezer:
      replace freezer->lock with freezer_mutex") introduced a bug in
      update_if_frozen() where it returns with rcu_read_lock() held.  Fix it
      by adding rcu_read_unlock() before returning.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-by: default avatarkbuild test robot <fengguang.wu@intel.com>
      36e9d2eb
    • Tejun Heo's avatar
      Merge branch 'for-3.16' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu into for-3.16 · d39ea871
      Tejun Heo authored
      Pull to receive percpu_ref_tryget[_live]() changes.  Planned cgroup
      changes will make use of them.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      d39ea871
    • Tejun Heo's avatar
      cgroup_freezer: replace freezer->lock with freezer_mutex · e5ced8eb
      Tejun Heo authored
      After 96d365e0 ("cgroup: make css_set_lock a rwsem and rename it
      to css_set_rwsem"), css task iterators requires sleepable context as
      it may block on css_set_rwsem.  I missed that cgroup_freezer was
      iterating tasks under IRQ-safe spinlock freezer->lock.  This leads to
      errors like the following on freezer state reads and transitions.
      
        BUG: sleeping function called from invalid context at /work
       /os/work/kernel/locking/rwsem.c:20
        in_atomic(): 0, irqs_disabled(): 0, pid: 462, name: bash
        5 locks held by bash/462:
         #0:  (sb_writers#7){.+.+.+}, at: [<ffffffff811f0843>] vfs_write+0x1a3/0x1c0
         #1:  (&of->mutex){+.+.+.}, at: [<ffffffff8126d78b>] kernfs_fop_write+0xbb/0x170
         #2:  (s_active#70){.+.+.+}, at: [<ffffffff8126d793>] kernfs_fop_write+0xc3/0x170
         #3:  (freezer_mutex){+.+...}, at: [<ffffffff81135981>] freezer_write+0x61/0x1e0
         #4:  (rcu_read_lock){......}, at: [<ffffffff81135973>] freezer_write+0x53/0x1e0
        Preemption disabled at:[<ffffffff81104404>] console_unlock+0x1e4/0x460
      
        CPU: 3 PID: 462 Comm: bash Not tainted 3.15.0-rc1-work+ #10
        Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
         ffff88000916a6d0 ffff88000e0a3da0 ffffffff81cf8c96 0000000000000000
         ffff88000e0a3dc8 ffffffff810cf4f2 ffffffff82388040 ffff880013aaf740
         0000000000000002 ffff88000e0a3de8 ffffffff81d05974 0000000000000246
        Call Trace:
         [<ffffffff81cf8c96>] dump_stack+0x4e/0x7a
         [<ffffffff810cf4f2>] __might_sleep+0x162/0x260
         [<ffffffff81d05974>] down_read+0x24/0x60
         [<ffffffff81133e87>] css_task_iter_start+0x27/0x70
         [<ffffffff8113584d>] freezer_apply_state+0x5d/0x130
         [<ffffffff81135a16>] freezer_write+0xf6/0x1e0
         [<ffffffff8112eb88>] cgroup_file_write+0xd8/0x230
         [<ffffffff8126d7b7>] kernfs_fop_write+0xe7/0x170
         [<ffffffff811f0756>] vfs_write+0xb6/0x1c0
         [<ffffffff811f121d>] SyS_write+0x4d/0xc0
         [<ffffffff81d08292>] system_call_fastpath+0x16/0x1b
      
      freezer->lock used to be used in hot paths but that time is long gone
      and there's no reason for the lock to be IRQ-safe spinlock or even
      per-cgroup.  In fact, given the fact that a cgroup may contain large
      number of tasks, it's not a good idea to iterate over them while
      holding IRQ-safe spinlock.
      
      Let's simplify locking by replacing per-cgroup freezer->lock with
      global freezer_mutex.  This also makes the comments explaining the
      intricacies of policy inheritance and the locking around it as the
      states are protected by a common mutex.
      
      The conversion is mostly straight-forward.  The followings are worth
      mentioning.
      
      * freezer_css_online() no longer needs double locking.
      
      * freezer_attach() now performs propagation simply while holding
        freezer_mutex.  update_if_frozen() race no longer exists and the
        comment is removed.
      
      * freezer_fork() now tests whether the task is in root cgroup using
        the new task_css_is_root() without doing rcu_read_lock/unlock().  If
        not, it grabs freezer_mutex and performs the operation.
      
      * freezer_read() and freezer_change_state() grab freezer_mutex across
        the whole operation and pin the css while iterating so that each
        descendant processing happens in sleepable context.
      
      Fixes: 96d365e0 ("cgroup: make css_set_lock a rwsem and rename it to css_set_rwsem")
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      e5ced8eb
    • Tejun Heo's avatar
      cgroup: introduce task_css_is_root() · 5024ae29
      Tejun Heo authored
      Determining the css of a task usually requires RCU read lock as that's
      the only thing which keeps the returned css accessible till its
      reference is acquired; however, testing whether a task belongs to the
      root can be performed without dereferencing the returned css by
      comparing the returned pointer against the root one in init_css_set[]
      which never changes.
      
      Implement task_css_is_root() which can be invoked in any context.
      This will be used by the scheduled cgroup_freezer change.
      
      v2: cgroup no longer supports modular controllers.  No need to export
          init_css_set.  Pointed out by Li.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      5024ae29
  2. 09 May, 2014 2 commits
  3. 07 May, 2014 1 commit
  4. 06 May, 2014 2 commits
  5. 05 May, 2014 3 commits
    • Fabian Frederick's avatar
      kernel/cgroup.c: fix 2 kernel-doc warnings · 60106946
      Fabian Frederick authored
      Fix typo and variable name.
      
      tj: Updated @cgrp argument description in cgroup_destroy_css_killed()
      
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarFabian Frederick <fabf@skynet.be>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      60106946
    • Tejun Heo's avatar
      blkcg: use trylock on blkcg_pol_mutex in blkcg_reset_stats() · 36c38fb7
      Tejun Heo authored
      During the recent conversion of cgroup to kernfs, cgroup_tree_mutex
      which nests above both the kernfs s_active protection and cgroup_mutex
      is added to synchronize cgroup file type operations as cgroup_mutex
      needed to be grabbed from some file operations and thus can't be put
      above s_active protection.
      
      While this arrangement mostly worked for cgroup, this triggered the
      following lockdep warning.
      
        ======================================================
        [ INFO: possible circular locking dependency detected ]
        3.15.0-rc3-next-20140430-sasha-00016-g4e281fa-dirty #429 Tainted: G        W
        -------------------------------------------------------
        trinity-c173/9024 is trying to acquire lock:
        (blkcg_pol_mutex){+.+.+.}, at: blkcg_reset_stats (include/linux/spinlock.h:328 block/blk-cgroup.c:455)
      
        but task is already holding lock:
        (s_active#89){++++.+}, at: kernfs_fop_write (fs/kernfs/file.c:283)
      
        which lock already depends on the new lock.
      
        the existing dependency chain (in reverse order) is:
      
        -> #2 (s_active#89){++++.+}:
        lock_acquire (arch/x86/include/asm/current.h:14 kernel/locking/lockdep.c:3602)
        __kernfs_remove (arch/x86/include/asm/atomic.h:27 fs/kernfs/dir.c:352 fs/kernfs/dir.c:1024)
        kernfs_remove_by_name_ns (fs/kernfs/dir.c:1219)
        cgroup_addrm_files (include/linux/kernfs.h:427 kernel/cgroup.c:1074 kernel/cgroup.c:2899)
        cgroup_clear_dir (kernel/cgroup.c:1092 (discriminator 2))
        rebind_subsystems (kernel/cgroup.c:1144)
        cgroup_setup_root (kernel/cgroup.c:1568)
        cgroup_mount (kernel/cgroup.c:1716)
        mount_fs (fs/super.c:1094)
        vfs_kern_mount (fs/namespace.c:899)
        do_mount (fs/namespace.c:2238 fs/namespace.c:2561)
        SyS_mount (fs/namespace.c:2758 fs/namespace.c:2729)
        tracesys (arch/x86/kernel/entry_64.S:746)
      
        -> #1 (cgroup_tree_mutex){+.+.+.}:
        lock_acquire (arch/x86/include/asm/current.h:14 kernel/locking/lockdep.c:3602)
        mutex_lock_nested (kernel/locking/mutex.c:486 kernel/locking/mutex.c:587)
        cgroup_add_cftypes (include/linux/list.h:76 kernel/cgroup.c:3040)
        blkcg_policy_register (block/blk-cgroup.c:1106)
        throtl_init (block/blk-throttle.c:1694)
        do_one_initcall (init/main.c:789)
        kernel_init_freeable (init/main.c:854 init/main.c:863 init/main.c:882 init/main.c:1003)
        kernel_init (init/main.c:935)
        ret_from_fork (arch/x86/kernel/entry_64.S:552)
      
        -> #0 (blkcg_pol_mutex){+.+.+.}:
        __lock_acquire (kernel/locking/lockdep.c:1840 kernel/locking/lockdep.c:1945 kernel/locking/lockdep.c:2131 kernel/locking/lockdep.c:3182)
        lock_acquire (arch/x86/include/asm/current.h:14 kernel/locking/lockdep.c:3602)
        mutex_lock_nested (kernel/locking/mutex.c:486 kernel/locking/mutex.c:587)
        blkcg_reset_stats (include/linux/spinlock.h:328 block/blk-cgroup.c:455)
        cgroup_file_write (kernel/cgroup.c:2714)
        kernfs_fop_write (fs/kernfs/file.c:295)
        vfs_write (fs/read_write.c:532)
        SyS_write (fs/read_write.c:584 fs/read_write.c:576)
        tracesys (arch/x86/kernel/entry_64.S:746)
      
        other info that might help us debug this:
      
        Chain exists of:
        blkcg_pol_mutex --> cgroup_tree_mutex --> s_active#89
      
         Possible unsafe locking scenario:
      
      	 CPU0                    CPU1
      	 ----                    ----
          lock(s_active#89);
      				 lock(cgroup_tree_mutex);
      				 lock(s_active#89);
          lock(blkcg_pol_mutex);
      
         *** DEADLOCK ***
      
        4 locks held by trinity-c173/9024:
        #0: (&f->f_pos_lock){+.+.+.}, at: __fdget_pos (fs/file.c:714)
        #1: (sb_writers#18){.+.+.+}, at: vfs_write (include/linux/fs.h:2255 fs/read_write.c:530)
        #2: (&of->mutex){+.+.+.}, at: kernfs_fop_write (fs/kernfs/file.c:283)
        #3: (s_active#89){++++.+}, at: kernfs_fop_write (fs/kernfs/file.c:283)
      
        stack backtrace:
        CPU: 3 PID: 9024 Comm: trinity-c173 Tainted: G        W     3.15.0-rc3-next-20140430-sasha-00016-g4e281fa-dirty #429
         ffffffff919687b0 ffff8805f6373bb8 ffffffff8e52cdbb 0000000000000002
         ffffffff919d8400 ffff8805f6373c08 ffffffff8e51fb88 0000000000000004
         ffff8805f6373c98 ffff8805f6373c08 ffff88061be70d98 ffff88061be70dd0
        Call Trace:
        dump_stack (lib/dump_stack.c:52)
        print_circular_bug (kernel/locking/lockdep.c:1216)
        __lock_acquire (kernel/locking/lockdep.c:1840 kernel/locking/lockdep.c:1945 kernel/locking/lockdep.c:2131 kernel/locking/lockdep.c:3182)
        lock_acquire (arch/x86/include/asm/current.h:14 kernel/locking/lockdep.c:3602)
        mutex_lock_nested (kernel/locking/mutex.c:486 kernel/locking/mutex.c:587)
        blkcg_reset_stats (include/linux/spinlock.h:328 block/blk-cgroup.c:455)
        cgroup_file_write (kernel/cgroup.c:2714)
        kernfs_fop_write (fs/kernfs/file.c:295)
        vfs_write (fs/read_write.c:532)
        SyS_write (fs/read_write.c:584 fs/read_write.c:576)
      
      This is a highly unlikely but valid circular dependency between "echo
      1 > blkcg.reset_stats" and cfq module [un]loading.  cgroup is going
      through further locking update which will remove this complication but
      for now let's use trylock on blkcg_pol_mutex and retry the file
      operation if the trylock fails.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-by: default avatarSasha Levin <sasha.levin@oracle.com>
      References: http://lkml.kernel.org/g/5363C04B.4010400@oracle.com
      36c38fb7
    • Aristeu Rozanski's avatar
      device_cgroup: check if exception removal is allowed · d2c2b11c
      Aristeu Rozanski authored
      [PATCH v3 1/2] device_cgroup: check if exception removal is allowed
      
      When the device cgroup hierarchy was introduced in
      	bd2953eb - devcg: propagate local changes down the hierarchy
      
      a specific case was overlooked. Consider the hierarchy bellow:
      
      	A	default policy: ALLOW, exceptions will deny access
      	 \
      	  B	default policy: ALLOW, exceptions will deny access
      
      There's no need to verify when an new exception is added to B because
      in this case exceptions will deny access to further devices, which is
      always fine. Hierarchy in device cgroup only makes sure B won't have
      more access than A.
      
      But when an exception is removed (by writing devices.allow), it isn't
      checked if the user is in fact removing an inherited exception from A,
      thus giving more access to B.
      
      Example:
      
      	# echo 'a' >A/devices.allow
      	# echo 'c 1:3 rw' >A/devices.deny
      	# echo $$ >A/B/tasks
      	# echo >/dev/null
      	-bash: /dev/null: Operation not permitted
      	# echo 'c 1:3 w' >A/B/devices.allow
      	# echo >/dev/null
      	#
      
      This shouldn't be allowed and this patch fixes it by making sure to never allow
      exceptions in this case to be removed if the exception is partially or fully
      present on the parent.
      
      v3: missing '*' in function description
      v2: improved log message and formatting fixes
      
      Cc: cgroups@vger.kernel.org
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarAristeu Rozanski <arozansk@redhat.com>
      Acked-by: default avatarSerge Hallyn <serge.hallyn@canonical.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      d2c2b11c
  6. 04 May, 2014 7 commits
    • Aristeu Rozanski's avatar
      device_cgroup: fix the comment format for recently added functions · f5f3cf6f
      Aristeu Rozanski authored
      Moving more extensive explanations to the end of the comment.
      
      Cc: Li Zefan <lizefan@huawei.com>
      Signed-off-by: default avatarAristeu Rozanski <arozansk@redhat.com>
      Acked-by: default avatarSerge Hallyn <serge.hallyn@canonical.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      f5f3cf6f
    • Tejun Heo's avatar
      cgroup, memcg: implement css->id and convert css_from_id() to use it · 15a4c835
      Tejun Heo authored
      Until now, cgroup->id has been used to identify all the associated
      csses and css_from_id() takes cgroup ID and returns the matching css
      by looking up the cgroup and then dereferencing the css associated
      with it; however, now that the lifetimes of cgroup and css are
      separate, this is incorrect and breaks on the unified hierarchy when a
      controller is disabled and enabled back again before the previous
      instance is released.
      
      This patch adds css->id which is a subsystem-unique ID and converts
      css_from_id() to look up by the new css->id instead.  memcg is the
      only user of css_from_id() and also converted to use css->id instead.
      
      For traditional hierarchies, this shouldn't make any functional
      difference.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jianyu Zhan <nasa4836@gmail.com>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      15a4c835
    • Tejun Heo's avatar
      cgroup: update init_css() into init_and_link_css() · ddfcadab
      Tejun Heo authored
      init_css() takes the cgroup the new css belongs to as an argument and
      initializes the new css's ->cgroup and ->parent pointers but doesn't
      acquire the matching reference counts.  After the previous patch,
      create_css() puts init_css() and reference acquisition right next to
      each other.  Let's move reference acquistion into init_css() and
      rename the function to init_and_link_css().  This makes sense and is
      easier to follow.  This makes the root csses to hold a reference on
      cgrp_dfl_root.cgrp, which is harmless.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      ddfcadab
    • Tejun Heo's avatar
      cgroup: use RCU free in create_css() failure path · a2bed820
      Tejun Heo authored
      Currently, when create_css() fails in the middle, the half-initialized
      css is freed by invoking cgroup_subsys->css_free() directly.  This
      patch updates the function so that it invokes RCU free path instead.
      As the RCU free path puts the parent css and owning cgroup, their
      references are now acquired right after a new css is successfully
      allocated.
      
      This doesn't make any visible difference now but is to enable
      implementing css->id and RCU protected lookup by such IDs.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      a2bed820
    • Tejun Heo's avatar
      cgroup: protect cgroup_root->cgroup_idr with a spinlock · 6fa4918d
      Tejun Heo authored
      Currently, cgroup_root->cgroup_idr is protected by cgroup_mutex, which
      ends up requiring cgroup_put() to be invoked under sleepable context.
      This is okay for now but is an unusual requirement and we'll soon add
      css->id which will have the same problem but won't be able to simply
      grab cgroup_mutex as removal will have to happen from css_release()
      which can't sleep.
      
      Introduce cgroup_idr_lock and idr_alloc/replace/remove() wrappers
      which protects the idr operations with the lock and use them for
      cgroup_root->cgroup_idr.  cgroup_put() no longer needs to grab
      cgroup_mutex and css_from_id() is updated to always require RCU read
      lock instead of either RCU read lock or cgroup_mutex, which doesn't
      affect the existing users.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      6fa4918d
    • Tejun Heo's avatar
      cgroup, memcg: allocate cgroup ID from 1 · 7d699ddb
      Tejun Heo authored
      Currently, cgroup->id is allocated from 0, which is always assigned to
      the root cgroup; unfortunately, memcg wants to use ID 0 to indicate
      invalid IDs and ends up incrementing all IDs by one.
      
      It's reasonable to reserve 0 for special purposes.  This patch updates
      cgroup core so that ID 0 is not used and the root cgroups get ID 1.
      The ID incrementing is removed form memcg.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      7d699ddb
    • Tejun Heo's avatar
      cgroup: make flags and subsys_masks unsigned int · 69dfa00c
      Tejun Heo authored
      There's no reason to use atomic bitops for cgroup_subsys_state->flags,
      cgroup_root->flags and various subsys_masks.  This patch updates those
      to use bitwise and/or operations instead and converts them form
      unsigned long to unsigned int.
      
      This makes the fields occupy (marginally) smaller space and makes it
      clear that they don't require atomicity.
      
      This patch doesn't cause any behavior difference.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      69dfa00c
  7. 25 Apr, 2014 3 commits