• David Finkel's avatar
    mm, memcg: cg2 memory{.swap,}.peak write handlers · c6f53ed8
    David Finkel authored
    Patch series "mm, memcg: cg2 memory{.swap,}.peak write handlers", v7.
    
    
    This patch (of 2):
    
    Other mechanisms for querying the peak memory usage of either a process or
    v1 memory cgroup allow for resetting the high watermark.  Restore parity
    with those mechanisms, but with a less racy API.
    
    For example:
     - Any write to memory.max_usage_in_bytes in a cgroup v1 mount resets
       the high watermark.
     - writing "5" to the clear_refs pseudo-file in a processes's proc
       directory resets the peak RSS.
    
    This change is an evolution of a previous patch, which mostly copied the
    cgroup v1 behavior, however, there were concerns about races/ownership
    issues with a global reset, so instead this change makes the reset
    filedescriptor-local.
    
    Writing any non-empty string to the memory.peak and memory.swap.peak
    pseudo-files reset the high watermark to the current usage for subsequent
    reads through that same FD.
    
    Notably, following Johannes's suggestion, this implementation moves the
    O(FDs that have written) behavior onto the FD write(2) path.  Instead, on
    the page-allocation path, we simply add one additional watermark to
    conditionally bump per-hierarchy level in the page-counter.
    
    Additionally, this takes Longman's suggestion of nesting the
    page-charging-path checks for the two watermarks to reduce the number of
    common-case comparisons.
    
    This behavior is particularly useful for work scheduling systems that need
    to track memory usage of worker processes/cgroups per-work-item.  Since
    memory can't be squeezed like CPU can (the OOM-killer has opinions), these
    systems need to track the peak memory usage to compute system/container
    fullness when binpacking workitems.
    
    Most notably, Vimeo's use-case involves a system that's doing global
    binpacking across many Kubernetes pods/containers, and while we can use
    PSI for some local decisions about overload, we strive to avoid packing
    workloads too tightly in the first place.  To facilitate this, we track
    the peak memory usage.  However, since we run with long-lived workers (to
    amortize startup costs) we need a way to track the high watermark while a
    work-item is executing.  Polling runs the risk of missing short spikes
    that last for timescales below the polling interval, and peak memory
    tracking at the cgroup level is otherwise perfect for this use-case.
    
    As this data is used to ensure that binpacked work ends up with sufficient
    headroom, this use-case mostly avoids the inaccuracies surrounding
    reclaimable memory.
    
    Link: https://lkml.kernel.org/r/20240730231304.761942-1-davidf@vimeo.com
    Link: https://lkml.kernel.org/r/20240729143743.34236-1-davidf@vimeo.com
    Link: https://lkml.kernel.org/r/20240729143743.34236-2-davidf@vimeo.comSigned-off-by: default avatarDavid Finkel <davidf@vimeo.com>
    Suggested-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
    Suggested-by: default avatarWaiman Long <longman@redhat.com>
    Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
    Reviewed-by: default avatarMichal Koutný <mkoutny@suse.com>
    Acked-by: default avatarTejun Heo <tj@kernel.org>
    Reviewed-by: default avatarRoman Gushchin <roman.gushchin@linux.dev>
    Cc: Jonathan Corbet <corbet@lwn.net>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Muchun Song <muchun.song@linux.dev>
    Cc: Shakeel Butt <shakeel.butt@linux.dev>
    Cc: Shuah Khan <shuah@kernel.org>
    Cc: Zefan Li <lizefan.x@bytedance.com>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    c6f53ed8
cgroup-internal.h 8.88 KB