• Yosry Ahmed's avatar
    mm: memcg: use rstat for non-hierarchical stats · f82e6bf9
    Yosry Ahmed authored
    Currently, memcg uses rstat to maintain aggregated hierarchical stats. 
    Counters are maintained for hierarchical stats at each memcg.  Rstat
    tracks which cgroups have updates on which cpus to keep those counters
    fresh on the read-side.
    
    Non-hierarchical stats are currently not covered by rstat.  Their per-cpu
    counters are summed up on every read, which is expensive.  The original
    implementation did the same.  At some point before rstat, non-hierarchical
    aggregated counters were introduced by commit a983b5eb ("mm:
    memcontrol: fix excessive complexity in memory.stat reporting").  However,
    those counters were updated on the performance critical write-side, which
    caused regressions, so they were later removed by commit 815744d7
    ("mm: memcontrol: don't batch updates of local VM stats and events").  See
    [1] for more detailed history.
    
    Kernel versions in between a983b5eb & 815744d7 (a year and a half)
    enjoyed cheap reads of non-hierarchical stats, specifically on cgroup v1. 
    When moving to more recent kernels, a performance regression for reading
    non-hierarchical stats is observed.
    
    Now that we have rstat, we know exactly which percpu counters have updates
    for each stat.  We can maintain non-hierarchical counters again, making
    reads much more efficient, without affecting the performance critical
    write-side.  Hence, add non-hierarchical (i.e local) counters for the
    stats, and extend rstat flushing to keep those up-to-date.
    
    A caveat is that we now need a stats flush before reading
    local/non-hierarchical stats through {memcg/lruvec}_page_state_local() or
    memcg_events_local(), where we previously only needed a flush to read
    hierarchical stats.  Most contexts reading non-hierarchical stats are
    already doing a flush, add a flush to the only missing context in
    count_shadow_nodes().
    
    With this patch, reading memory.stat from 1000 memcgs is 3x faster on a
    machine with 256 cpus on cgroup v1:
    
     # for i in $(seq 1000); do mkdir /sys/fs/cgroup/memory/cg$i; done
     # time cat /sys/fs/cgroup/memory/cg*/memory.stat > /dev/null
     real	 0m0.125s
     user	 0m0.005s
     sys	 0m0.120s
    
    After:
     real	 0m0.032s
     user	 0m0.005s
     sys	 0m0.027s
    
    To make sure there are no regressions on cgroup v2, I ran an artificial
    reclaim/refault stress test [2] that creates (NR_CPUS * 2) cgroups,
    assigns them limits, runs a worker process in each cgroup that allocates
    tmpfs memory equal to quadruple the limit (to invoke reclaim
    continuously), and then reads back the entire file (to invoke refaults). 
    All workers are run in parallel, and zram is used as a swapping backend. 
    Both reclaim and refault have conditional stats flushing.  I ran this on a
    machine with 112 cpus, once on mm-unstable, and once on mm-unstable with
    this patch reverted.
    
    (1) A few runs without this patch:
    
     # time ./stress_reclaim_refault.sh
     real 0m9.949s
     user 0m0.496s
     sys 14m44.974s
    
     # time ./stress_reclaim_refault.sh
     real 0m10.049s
     user 0m0.486s
     sys 14m55.791s
    
     # time ./stress_reclaim_refault.sh
     real 0m9.984s
     user 0m0.481s
     sys 14m53.841s
    
    (2) A few runs with this patch:
    
     # time ./stress_reclaim_refault.sh
     real 0m9.885s
     user 0m0.486s
     sys 14m48.753s
    
     # time ./stress_reclaim_refault.sh
     real 0m9.903s
     user 0m0.495s
     sys 14m48.339s
    
     # time ./stress_reclaim_refault.sh
     real 0m9.861s
     user 0m0.507s
     sys 14m49.317s
    
    No regressions are observed with this patch. There is actually a very
    slight improvement. If I have to guess, maybe it's because we avoid
    the percpu loop in count_shadow_nodes() when calling
    lruvec_page_state_local(), but I could not prove this using perf, it's
    probably in the noise.
    
    [1] https://lore.kernel.org/lkml/20230725201811.GA1231514@cmpxchg.org/
    [2] https://lore.kernel.org/lkml/CAJD7tkb17x=qwoO37uxyYXLEUVp15BQKR+Xfh7Sg9Hx-wTQ_=w@mail.gmail.com/
    
    Link: https://lkml.kernel.org/r/20230803185046.1385770-1-yosryahmed@google.com
    Link: https://lkml.kernel.org/r/20230726153223.821757-2-yosryahmed@google.comSigned-off-by: default avatarYosry Ahmed <yosryahmed@google.com>
    Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
    Acked-by: default avatarRoman Gushchin <roman.gushchin@linux.dev>
    Acked-by: default avatarMichal Hocko <mhocko@suse.com>
    Cc: Muchun Song <muchun.song@linux.dev>
    Cc: Shakeel Butt <shakeelb@google.com>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    f82e6bf9
workingset.c 26.6 KB