• Vasily Averin's avatar
    memcg: prohibit unconditional exceeding the limit of dying tasks · a4ebf1b6
    Vasily Averin authored
    Memory cgroup charging allows killed or exiting tasks to exceed the hard
    limit.  It is assumed that the amount of the memory charged by those
    tasks is bound and most of the memory will get released while the task
    is exiting.  This is resembling a heuristic for the global OOM situation
    when tasks get access to memory reserves.  There is no global memory
    shortage at the memcg level so the memcg heuristic is more relieved.
    
    The above assumption is overly optimistic though.  E.g.  vmalloc can
    scale to really large requests and the heuristic would allow that.  We
    used to have an early break in the vmalloc allocator for killed tasks
    but this has been reverted by commit b8c8a338 ("Revert "vmalloc:
    back off when the current task is killed"").  There are likely other
    similar code paths which do not check for fatal signals in an
    allocation&charge loop.  Also there are some kernel objects charged to a
    memcg which are not bound to a process life time.
    
    It has been observed that it is not really hard to trigger these
    bypasses and cause global OOM situation.
    
    One potential way to address these runaways would be to limit the amount
    of excess (similar to the global OOM with limited oom reserves).  This
    is certainly possible but it is not really clear how much of an excess
    is desirable and still protects from global OOMs as that would have to
    consider the overall memcg configuration.
    
    This patch is addressing the problem by removing the heuristic
    altogether.  Bypass is only allowed for requests which either cannot
    fail or where the failure is not desirable while excess should be still
    limited (e.g.  atomic requests).  Implementation wise a killed or dying
    task fails to charge if it has passed the OOM killer stage.  That should
    give all forms of reclaim chance to restore the limit before the failure
    (ENOMEM) and tell the caller to back off.
    
    In addition, this patch renames should_force_charge() helper to
    task_is_dying() because now its use is not associated witch forced
    charging.
    
    This patch depends on pagefault_out_of_memory() to not trigger
    out_of_memory(), because then a memcg failure can unwind to VM_FAULT_OOM
    and cause a global OOM killer.
    
    Link: https://lkml.kernel.org/r/8f5cebbb-06da-4902-91f0-6566fc4b4203@virtuozzo.comSigned-off-by: default avatarVasily Averin <vvs@virtuozzo.com>
    Suggested-by: default avatarMichal Hocko <mhocko@suse.com>
    Acked-by: default avatarMichal Hocko <mhocko@suse.com>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
    Cc: Roman Gushchin <guro@fb.com>
    Cc: Uladzislau Rezki <urezki@gmail.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Shakeel Butt <shakeelb@google.com>
    Cc: Mel Gorman <mgorman@techsingularity.net>
    Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    a4ebf1b6
memcontrol.c 191 KB