• Michal Hocko's avatar
    memcg: relax memcg iter caching · 5f578161
    Michal Hocko authored
    Now that the per-node-zone-priority iterator caches memory cgroups
    rather than their css ids we have to be careful and remove them from the
    iterator when they are on the way out otherwise they might live for
    unbounded amount of time even though their group is already gone (until
    the global/targeted reclaim triggers the zone under priority to find out
    the group is dead and let it to find the final rest).
    
    We can fix this issue by relaxing rules for the last_visited memcg.
    Instead of taking a reference to the css before it is stored into
    iter->last_visited we can just store its pointer and track the number of
    removed groups from each memcg's subhierarchy.
    
    This number would be stored into iterator everytime when a memcg is
    cached.  If the iter count doesn't match the curent walker root's one we
    will start from the root again.  The group counter is incremented
    upwards the hierarchy every time a group is removed.
    
    The iter_lock can be dropped because racing iterators cannot leak the
    reference anymore as the reference count is not elevated for
    last_visited when it is cached.
    
    Locking rules got a bit complicated by this change though.  The iterator
    primarily relies on rcu read lock which makes sure that once we see a
    valid last_visited pointer then it will be valid for the whole RCU walk.
    smp_rmb makes sure that dead_count is read before last_visited and
    last_dead_count while smp_wmb makes sure that last_visited is updated
    before last_dead_count so the up-to-date last_dead_count cannot point to
    an outdated last_visited.  css_tryget then makes sure that the
    last_visited is still alive in case the iteration races with the cached
    group removal (css is invalidated before mem_cgroup_css_offline
    increments dead_count).
    
    In short:
    mem_cgroup_iter
     rcu_read_lock()
     dead_count = atomic_read(parent->dead_count)
     smp_rmb()
     if (dead_count != iter->last_dead_count)
     	last_visited POSSIBLY INVALID -> last_visited = NULL
     if (!css_tryget(iter->last_visited))
     	last_visited DEAD -> last_visited = NULL
     next = find_next(last_visited)
     css_tryget(next)
     css_put(last_visited) 	// css would be invalidated and parent->dead_count
     			// incremented if this was the last reference
     iter->last_visited = next
     smp_wmb()
     iter->last_dead_count = dead_count
     rcu_read_unlock()
    
    cgroup_rmdir
     cgroup_destroy_locked
      atomic_add(CSS_DEACT_BIAS, &css->refcnt) // subsequent css_tryget fail
       mem_cgroup_css_offline
        mem_cgroup_invalidate_reclaim_iterators
         while(parent = parent_mem_cgroup)
         	atomic_inc(parent->dead_count)
      css_put(css) // last reference held by cgroup core
    
    Spotted by Ying Han.
    
    Original idea from Johannes Weiner.
    
    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: default avatarMichal Hocko <mhocko@suse.cz>
    Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
    Acked-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
    Cc: Ying Han <yinghan@google.com>
    Cc: Li Zefan <lizefan@huawei.com>
    Cc: Tejun Heo <htejun@gmail.com>
    Cc: Glauber Costa <glommer@parallels.com>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    5f578161
memcontrol.c 183 KB