1. 12 Sep, 2013 15 commits
    • Johannes Weiner's avatar
      arch: mm: remove obsolete init OOM protection · 94bce453
      Johannes Weiner authored
      The memcg code can trap tasks in the context of the failing allocation
      until an OOM situation is resolved.  They can hold all kinds of locks
      (fs, mm) at this point, which makes it prone to deadlocking.
      
      This series converts memcg OOM handling into a two step process that is
      started in the charge context, but any waiting is done after the fault
      stack is fully unwound.
      
      Patches 1-4 prepare architecture handlers to support the new memcg
      requirements, but in doing so they also remove old cruft and unify
      out-of-memory behavior across architectures.
      
      Patch 5 disables the memcg OOM handling for syscalls, readahead, kernel
      faults, because they can gracefully unwind the stack with -ENOMEM.  OOM
      handling is restricted to user triggered faults that have no other
      option.
      
      Patch 6 reworks memcg's hierarchical OOM locking to make it a little
      more obvious wth is going on in there: reduce locked regions, rename
      locking functions, reorder and document.
      
      Patch 7 implements the two-part OOM handling such that tasks are never
      trapped with the full charge stack in an OOM situation.
      
      This patch:
      
      Back before smart OOM killing, when faulting tasks were killed directly on
      allocation failures, the arch-specific fault handlers needed special
      protection for the init process.
      
      Now that all fault handlers call into the generic OOM killer (see commit
      609838cf: "mm: invoke oom-killer from remaining unconverted page
      fault handlers"), which already provides init protection, the
      arch-specific leftovers can be removed.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarMichal Hocko <mhocko@suse.cz>
      Acked-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: azurIt <azurit@pobox.sk>
      Acked-by: Vineet Gupta <vgupta@synopsys.com>	[arch/arc bits]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      94bce453
    • Andrew Morton's avatar
      memcg: trivial cleanups · f894ffa8
      Andrew Morton authored
      Clean up some mess made by the "Soft limit rework" series, and a few other
      things.
      
      Cc: Michal Hocko <mhocko@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f894ffa8
    • Michal Hocko's avatar
      memcg, vmscan: do not fall into reclaim-all pass too quickly · e975de99
      Michal Hocko authored
      shrink_zone starts with soft reclaim pass first and then falls back to
      regular reclaim if nothing has been scanned.  This behavior is natural
      but there is a catch.  Memcg iterators, when used with the reclaim
      cookie, are designed to help to prevent from over reclaim by
      interleaving reclaimers (per node-zone-priority) so the tree walk might
      miss many (even all) nodes in the hierarchy e.g.  when there are direct
      reclaimers racing with each other or with kswapd in the global case or
      multiple allocators reaching the limit for the target reclaim case.  To
      make it even more complicated, targeted reclaim doesn't do the whole
      tree walk because it stops reclaiming once it reclaims sufficient pages.
      As a result groups over the limit might be missed, thus nothing is
      scanned, and reclaim would fall back to the reclaim all mode.
      
      This patch checks for the incomplete tree walk in shrink_zone.  If no
      group has been visited and the hierarchy is soft reclaimable then we
      must have missed some groups, in which case the __shrink_zone is called
      again.  This doesn't guarantee there will be some progress of course
      because the current reclaimer might be still racing with others but it
      would at least give a chance to start the walk without a big risk of
      reclaim latencies.
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Glauber Costa <glommer@openvz.org>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Ying Han <yinghan@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e975de99
    • Michal Hocko's avatar
      memcg: track all children over limit in the root · 1be171d6
      Michal Hocko authored
      Children in soft limit excess are currently tracked up the hierarchy in
      memcg->children_in_excess.  Nevertheless there still might exist tons of
      groups that are not in hierarchy relation to the root cgroup (e.g.  all
      first level groups if root_mem_cgroup->use_hierarchy == false).
      
      As the whole tree walk has to be done when the iteration starts at
      root_mem_cgroup the iterator should be able to skip the walk if there is
      no child above the limit without iterating them.  This can be done
      easily if the root tracks all children rather than only hierarchical
      children.  This is done by this patch which updates root_mem_cgroup
      children_in_excess if root_mem_cgroup->use_hierarchy == false so the
      root knows about all children in excess.
      
      Please note that this is not an issue for inner memcgs which have
      use_hierarchy == false because then only the single group is visited so
      no special optimization is necessary.
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Glauber Costa <glommer@openvz.org>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Ying Han <yinghan@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1be171d6
    • Michal Hocko's avatar
      memcg, vmscan: do not attempt soft limit reclaim if it would not scan anything · e839b6a1
      Michal Hocko authored
      mem_cgroup_should_soft_reclaim controls whether soft reclaim pass is
      done and it always says yes currently.  Memcg iterators are clever to
      skip nodes that are not soft reclaimable quite efficiently but
      mem_cgroup_should_soft_reclaim can be more clever and do not start the
      soft reclaim pass at all if it knows that nothing would be scanned
      anyway.
      
      In order to do that, simply reuse mem_cgroup_soft_reclaim_eligible for
      the target group of the reclaim and allow the pass only if the whole
      subtree wouldn't be skipped.
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Glauber Costa <glommer@openvz.org>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Ying Han <yinghan@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e839b6a1
    • Michal Hocko's avatar
      memcg: track children in soft limit excess to improve soft limit · 7d910c05
      Michal Hocko authored
      Soft limit reclaim has to check the whole reclaim hierarchy while doing
      the first pass of the reclaim.  This leads to a higher system time which
      can be visible especially when there are many groups in the hierarchy.
      
      This patch adds a per-memcg counter of children in excess.  It also
      restores MEM_CGROUP_TARGET_SOFTLIMIT into mem_cgroup_event_ratelimit for a
      proper batching.
      
      If a group crosses soft limit for the first time it increases parent's
      children_in_excess up the hierarchy.  The similarly if a group gets below
      the limit it will decrease the counter.  The transition phase is recorded
      in soft_contributed flag.
      
      mem_cgroup_soft_reclaim_eligible then uses this information to better
      decide whether to skip the node or the whole subtree.  The rule is simple.
       Skip the node with a children in excess or skip the whole subtree
      otherwise.
      
      This has been tested by a stream IO (dd if=/dev/zero of=file with
      4*MemTotal size) which is quite sensitive to overhead during reclaim.  The
      load is running in a group with soft limit set to 0 and without any limit.
       Apart from that there was a hierarchy with ~500, 2k and 8k groups (two
      groups on each level) without any pages in them.  base denotes to the
      kernel on which the whole series is based on, rework is the kernel before
      this patch and reworkoptim is with this patch applied:
      
      * Run with soft limit set to 0
      Elapsed
      0-0-limit/base: min: 88.21 max: 94.61 avg: 91.73 std: 2.65 runs: 3
      0-0-limit/rework: min: 76.05 [86.2%] max: 79.08 [83.6%] avg: 77.84 [84.9%] std: 1.30 runs: 3
      0-0-limit/reworkoptim: min: 77.98 [88.4%] max: 80.36 [84.9%] avg: 78.92 [86.0%] std: 1.03 runs: 3
      System
      0.5k-0-limit/base: min: 34.86 max: 36.42 avg: 35.89 std: 0.73 runs: 3
      0.5k-0-limit/rework: min: 43.26 [124.1%] max: 48.95 [134.4%] avg: 46.09 [128.4%] std: 2.32 runs: 3
      0.5k-0-limit/reworkoptim: min: 46.98 [134.8%] max: 50.98 [140.0%] avg: 48.49 [135.1%] std: 1.77 runs: 3
      Elapsed
      0.5k-0-limit/base: min: 88.50 max: 97.52 avg: 93.92 std: 3.90 runs: 3
      0.5k-0-limit/rework: min: 75.92 [85.8%] max: 78.45 [80.4%] avg: 77.34 [82.3%] std: 1.06 runs: 3
      0.5k-0-limit/reworkoptim: min: 75.79 [85.6%] max: 79.37 [81.4%] avg: 77.55 [82.6%] std: 1.46 runs: 3
      System
      2k-0-limit/base: min: 34.57 max: 37.65 avg: 36.34 std: 1.30 runs: 3
      2k-0-limit/rework: min: 64.17 [185.6%] max: 68.20 [181.1%] avg: 66.21 [182.2%] std: 1.65 runs: 3
      2k-0-limit/reworkoptim: min: 49.78 [144.0%] max: 52.99 [140.7%] avg: 51.00 [140.3%] std: 1.42 runs: 3
      Elapsed
      2k-0-limit/base: min: 92.61 max: 97.83 avg: 95.03 std: 2.15 runs: 3
      2k-0-limit/rework: min: 78.33 [84.6%] max: 84.08 [85.9%] avg: 81.09 [85.3%] std: 2.35 runs: 3
      2k-0-limit/reworkoptim: min: 75.72 [81.8%] max: 78.57 [80.3%] avg: 76.73 [80.7%] std: 1.30 runs: 3
      System
      8k-0-limit/base: min: 39.78 max: 42.09 avg: 41.09 std: 0.97 runs: 3
      8k-0-limit/rework: min: 200.86 [504.9%] max: 265.42 [630.6%] avg: 241.80 [588.5%] std: 29.06 runs: 3
      8k-0-limit/reworkoptim: min: 53.70 [135.0%] max: 54.89 [130.4%] avg: 54.43 [132.5%] std: 0.52 runs: 3
      Elapsed
      8k-0-limit/base: min: 95.11 max: 98.61 avg: 96.81 std: 1.43 runs: 3
      8k-0-limit/rework: min: 246.96 [259.7%] max: 331.47 [336.1%] avg: 301.32 [311.2%] std: 38.52 runs: 3
      8k-0-limit/reworkoptim: min: 76.79 [80.7%] max: 81.71 [82.9%] avg: 78.97 [81.6%] std: 2.05 runs: 3
      
      System time is increased by 30-40% but it is reduced a lot comparing to
      kernel without this patch.  The higher time can be explained by the fact
      that the original soft reclaim scanned at priority 0 so it was much more
      effective for this workload (which is basically touch once and writeback).
       The Elapsed time looks better though (~20%).
      
      * Run with no soft limit set
      System
      0-no-limit/base: min: 42.18 max: 50.38 avg: 46.44 std: 3.36 runs: 3
      0-no-limit/rework: min: 40.57 [96.2%] max: 47.04 [93.4%] avg: 43.82 [94.4%] std: 2.64 runs: 3
      0-no-limit/reworkoptim: min: 40.45 [95.9%] max: 45.28 [89.9%] avg: 42.10 [90.7%] std: 2.25 runs: 3
      Elapsed
      0-no-limit/base: min: 75.97 max: 78.21 avg: 76.87 std: 0.96 runs: 3
      0-no-limit/rework: min: 75.59 [99.5%] max: 80.73 [103.2%] avg: 77.64 [101.0%] std: 2.23 runs: 3
      0-no-limit/reworkoptim: min: 77.85 [102.5%] max: 82.42 [105.4%] avg: 79.64 [103.6%] std: 1.99 runs: 3
      System
      0.5k-no-limit/base: min: 44.54 max: 46.93 avg: 46.12 std: 1.12 runs: 3
      0.5k-no-limit/rework: min: 42.09 [94.5%] max: 46.16 [98.4%] avg: 43.92 [95.2%] std: 1.69 runs: 3
      0.5k-no-limit/reworkoptim: min: 42.47 [95.4%] max: 45.67 [97.3%] avg: 44.06 [95.5%] std: 1.31 runs: 3
      Elapsed
      0.5k-no-limit/base: min: 78.26 max: 81.49 avg: 79.65 std: 1.36 runs: 3
      0.5k-no-limit/rework: min: 77.01 [98.4%] max: 80.43 [98.7%] avg: 78.30 [98.3%] std: 1.52 runs: 3
      0.5k-no-limit/reworkoptim: min: 76.13 [97.3%] max: 77.87 [95.6%] avg: 77.18 [96.9%] std: 0.75 runs: 3
      System
      2k-no-limit/base: min: 62.96 max: 69.14 avg: 66.14 std: 2.53 runs: 3
      2k-no-limit/rework: min: 76.01 [120.7%] max: 81.06 [117.2%] avg: 78.17 [118.2%] std: 2.12 runs: 3
      2k-no-limit/reworkoptim: min: 62.57 [99.4%] max: 66.10 [95.6%] avg: 64.53 [97.6%] std: 1.47 runs: 3
      Elapsed
      2k-no-limit/base: min: 76.47 max: 84.22 avg: 79.12 std: 3.60 runs: 3
      2k-no-limit/rework: min: 89.67 [117.3%] max: 93.26 [110.7%] avg: 91.10 [115.1%] std: 1.55 runs: 3
      2k-no-limit/reworkoptim: min: 76.94 [100.6%] max: 79.21 [94.1%] avg: 78.45 [99.2%] std: 1.07 runs: 3
      System
      8k-no-limit/base: min: 104.74 max: 151.34 avg: 129.21 std: 19.10 runs: 3
      8k-no-limit/rework: min: 205.23 [195.9%] max: 285.94 [188.9%] avg: 258.98 [200.4%] std: 38.01 runs: 3
      8k-no-limit/reworkoptim: min: 161.16 [153.9%] max: 184.54 [121.9%] avg: 174.52 [135.1%] std: 9.83 runs: 3
      Elapsed
      8k-no-limit/base: min: 125.43 max: 181.00 avg: 154.81 std: 22.80 runs: 3
      8k-no-limit/rework: min: 254.05 [202.5%] max: 355.67 [196.5%] avg: 321.46 [207.6%] std: 47.67 runs: 3
      8k-no-limit/reworkoptim: min: 193.77 [154.5%] max: 222.72 [123.0%] avg: 210.18 [135.8%] std: 12.13 runs: 3
      
      Both System and Elapsed are in stdev with the base kernel for all
      configurations except for 8k where both System and Elapsed are up by 35%.
      I do not have a good explanation for this because there is no soft reclaim
      pass going on as no group is above the limit which is checked in
      mem_cgroup_should_soft_reclaim.
      
      Then I have tested kernel build with the same configuration to see the
      behavior with a more general behavior.
      
      * Soft limit set to 0 for the build
      System
      0-0-limit/base: min: 242.70 max: 245.17 avg: 243.85 std: 1.02 runs: 3
      0-0-limit/rework min: 237.86 [98.0%] max: 240.22 [98.0%] avg: 239.00 [98.0%] std: 0.97 runs: 3
      0-0-limit/reworkoptim: min: 241.11 [99.3%] max: 243.53 [99.3%] avg: 242.01 [99.2%] std: 1.08 runs: 3
      Elapsed
      0-0-limit/base: min: 348.48 max: 360.86 avg: 356.04 std: 5.41 runs: 3
      0-0-limit/rework min: 286.95 [82.3%] max: 290.26 [80.4%] avg: 288.27 [81.0%] std: 1.43 runs: 3
      0-0-limit/reworkoptim: min: 286.55 [82.2%] max: 289.00 [80.1%] avg: 287.69 [80.8%] std: 1.01 runs: 3
      System
      0.5k-0-limit/base: min: 251.77 max: 254.41 avg: 252.70 std: 1.21 runs: 3
      0.5k-0-limit/rework min: 286.44 [113.8%] max: 289.30 [113.7%] avg: 287.60 [113.8%] std: 1.23 runs: 3
      0.5k-0-limit/reworkoptim: min: 252.18 [100.2%] max: 253.16 [99.5%] avg: 252.62 [100.0%] std: 0.41 runs: 3
      Elapsed
      0.5k-0-limit/base: min: 347.83 max: 353.06 avg: 350.04 std: 2.21 runs: 3
      0.5k-0-limit/rework min: 290.19 [83.4%] max: 295.62 [83.7%] avg: 293.12 [83.7%] std: 2.24 runs: 3
      0.5k-0-limit/reworkoptim: min: 293.91 [84.5%] max: 294.87 [83.5%] avg: 294.29 [84.1%] std: 0.42 runs: 3
      System
      2k-0-limit/base: min: 263.05 max: 271.52 avg: 267.94 std: 3.58 runs: 3
      2k-0-limit/rework min: 458.99 [174.5%] max: 468.31 [172.5%] avg: 464.45 [173.3%] std: 3.97 runs: 3
      2k-0-limit/reworkoptim: min: 267.10 [101.5%] max: 279.38 [102.9%] avg: 272.78 [101.8%] std: 5.05 runs: 3
      Elapsed
      2k-0-limit/base: min: 372.33 max: 379.32 avg: 375.47 std: 2.90 runs: 3
      2k-0-limit/rework min: 334.40 [89.8%] max: 339.52 [89.5%] avg: 337.44 [89.9%] std: 2.20 runs: 3
      2k-0-limit/reworkoptim: min: 301.47 [81.0%] max: 319.19 [84.1%] avg: 307.90 [82.0%] std: 8.01 runs: 3
      System
      8k-0-limit/base: min: 320.50 max: 332.10 avg: 325.46 std: 4.88 runs: 3
      8k-0-limit/rework min: 1115.76 [348.1%] max: 1165.66 [351.0%] avg: 1132.65 [348.0%] std: 23.34 runs: 3
      8k-0-limit/reworkoptim: min: 403.75 [126.0%] max: 409.22 [123.2%] avg: 406.16 [124.8%] std: 2.28 runs: 3
      Elapsed
      8k-0-limit/base: min: 475.48 max: 585.19 avg: 525.54 std: 45.30 runs: 3
      8k-0-limit/rework min: 616.25 [129.6%] max: 625.90 [107.0%] avg: 620.68 [118.1%] std: 3.98 runs: 3
      8k-0-limit/reworkoptim: min: 420.18 [88.4%] max: 428.28 [73.2%] avg: 423.05 [80.5%] std: 3.71 runs: 3
      
      Apart from 8k the system time is comparable with the base kernel while
      Elapsed is up to 20% better with all configurations.
      
      * No soft limit set
      System
      0-no-limit/base: min: 234.76 max: 237.42 avg: 236.25 std: 1.11 runs: 3
      0-no-limit/rework min: 233.09 [99.3%] max: 238.65 [100.5%] avg: 236.09 [99.9%] std: 2.29 runs: 3
      0-no-limit/reworkoptim: min: 236.12 [100.6%] max: 240.53 [101.3%] avg: 237.94 [100.7%] std: 1.88 runs: 3
      Elapsed
      0-no-limit/base: min: 288.52 max: 295.42 avg: 291.29 std: 2.98 runs: 3
      0-no-limit/rework min: 283.17 [98.1%] max: 284.33 [96.2%] avg: 283.78 [97.4%] std: 0.48 runs: 3
      0-no-limit/reworkoptim: min: 288.50 [100.0%] max: 290.79 [98.4%] avg: 289.78 [99.5%] std: 0.95 runs: 3
      System
      0.5k-no-limit/base: min: 286.51 max: 293.23 avg: 290.21 std: 2.78 runs: 3
      0.5k-no-limit/rework min: 291.69 [101.8%] max: 294.38 [100.4%] avg: 292.97 [101.0%] std: 1.10 runs: 3
      0.5k-no-limit/reworkoptim: min: 277.05 [96.7%] max: 288.76 [98.5%] avg: 284.17 [97.9%] std: 5.11 runs: 3
      Elapsed
      0.5k-no-limit/base: min: 294.94 max: 298.92 avg: 296.47 std: 1.75 runs: 3
      0.5k-no-limit/rework min: 292.55 [99.2%] max: 294.21 [98.4%] avg: 293.55 [99.0%] std: 0.72 runs: 3
      0.5k-no-limit/reworkoptim: min: 294.41 [99.8%] max: 301.67 [100.9%] avg: 297.78 [100.4%] std: 2.99 runs: 3
      System
      2k-no-limit/base: min: 443.41 max: 466.66 avg: 457.66 std: 10.19 runs: 3
      2k-no-limit/rework min: 490.11 [110.5%] max: 516.02 [110.6%] avg: 501.42 [109.6%] std: 10.83 runs: 3
      2k-no-limit/reworkoptim: min: 435.25 [98.2%] max: 458.11 [98.2%] avg: 446.73 [97.6%] std: 9.33 runs: 3
      Elapsed
      2k-no-limit/base: min: 330.85 max: 333.75 avg: 332.52 std: 1.23 runs: 3
      2k-no-limit/rework min: 343.06 [103.7%] max: 349.59 [104.7%] avg: 345.95 [104.0%] std: 2.72 runs: 3
      2k-no-limit/reworkoptim: min: 330.01 [99.7%] max: 333.92 [100.1%] avg: 332.22 [99.9%] std: 1.64 runs: 3
      System
      8k-no-limit/base: min: 1175.64 max: 1259.38 avg: 1222.39 std: 34.88 runs: 3
      8k-no-limit/rework min: 1226.31 [104.3%] max: 1241.60 [98.6%] avg: 1233.74 [100.9%] std: 6.25 runs: 3
      8k-no-limit/reworkoptim: min: 1023.45 [87.1%] max: 1056.74 [83.9%] avg: 1038.92 [85.0%] std: 13.69 runs: 3
      Elapsed
      8k-no-limit/base: min: 613.36 max: 619.60 avg: 616.47 std: 2.55 runs: 3
      8k-no-limit/rework min: 627.56 [102.3%] max: 642.33 [103.7%] avg: 633.44 [102.8%] std: 6.39 runs: 3
      8k-no-limit/reworkoptim: min: 545.89 [89.0%] max: 555.36 [89.6%] avg: 552.06 [89.6%] std: 4.37 runs: 3
      
      and these numbers look good as well.  System time is around 100%
      (suprisingly better for the 8k case) and Elapsed is copies that trend.
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Glauber Costa <glommer@openvz.org>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Ying Han <yinghan@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7d910c05
    • Michal Hocko's avatar
      memcg: enhance memcg iterator to support predicates · de57780d
      Michal Hocko authored
      The caller of the iterator might know that some nodes or even subtrees
      should be skipped but there is no way to tell iterators about that so the
      only choice left is to let iterators to visit each node and do the
      selection outside of the iterating code.  This, however, doesn't scale
      well with hierarchies with many groups where only few groups are
      interesting.
      
      This patch adds mem_cgroup_iter_cond variant of the iterator with a
      callback which gets called for every visited node.  There are three
      possible ways how the callback can influence the walk.  Either the node is
      visited, it is skipped but the tree walk continues down the tree or the
      whole subtree of the current group is skipped.
      
      [hughd@google.com: fix memcg-less page reclaim]
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Glauber Costa <glommer@openvz.org>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Ying Han <yinghan@google.com>
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      de57780d
    • Michal Hocko's avatar
      vmscan, memcg: do softlimit reclaim also for targeted reclaim · a5b7c87f
      Michal Hocko authored
      Soft reclaim has been done only for the global reclaim (both background
      and direct).  Since "memcg: integrate soft reclaim tighter with zone
      shrinking code" there is no reason for this limitation anymore as the soft
      limit reclaim doesn't use any special code paths and it is a part of the
      zone shrinking code which is used by both global and targeted reclaims.
      
      From the semantic point of view it is natural to consider soft limit
      before touching all groups in the hierarchy tree which is touching the
      hard limit because soft limit tells us where to push back when there is a
      memory pressure.  It is not important whether the pressure comes from the
      limit or imbalanced zones.
      
      This patch simply enables soft reclaim unconditionally in
      mem_cgroup_should_soft_reclaim so it is enabled for both global and
      targeted reclaim paths.  mem_cgroup_soft_reclaim_eligible needs to learn
      about the root of the reclaim to know where to stop checking soft limit
      state of parents up the hierarchy.  Say we have
      
      A (over soft limit)
       \
        B (below s.l., hit the hard limit)
       / \
      C   D (below s.l.)
      
      B is the source of the outside memory pressure now for D but we shouldn't
      soft reclaim it because it is behaving well under B subtree and we can
      still reclaim from C (pressumably it is over the limit).
      mem_cgroup_soft_reclaim_eligible should therefore stop climbing up the
      hierarchy at B (root of the memory pressure).
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.cz>
      Reviewed-by: default avatarGlauber Costa <glommer@openvz.org>
      Reviewed-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Ying Han <yinghan@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a5b7c87f
    • Michal Hocko's avatar
      memcg: get rid of soft-limit tree infrastructure · e883110a
      Michal Hocko authored
      Now that the soft limit is integrated to the reclaim directly the whole
      soft-limit tree infrastructure is not needed anymore.  Rip it out.
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.cz>
      Reviewed-by: default avatarGlauber Costa <glommer@openvz.org>
      Reviewed-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Ying Han <yinghan@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e883110a
    • Michal Hocko's avatar
      memcg, vmscan: integrate soft reclaim tighter with zone shrinking code · 3b38722e
      Michal Hocko authored
      This patchset is sitting out of tree for quite some time without any
      objections.  I would be really happy if it made it into 3.12.  I do not
      want to push it too hard but I think this work is basically ready and
      waiting more doesn't help.
      
      The basic idea is quite simple.  Pull soft reclaim into shrink_zone in the
      first step and get rid of the previous soft reclaim infrastructure.
      shrink_zone is done in two passes now.  First it tries to do the soft
      limit reclaim and it falls back to reclaim-all mode if no group is over
      the limit or no pages have been scanned.  The second pass happens at the
      same priority so the only time we waste is the memcg tree walk which has
      been updated in the third step to have only negligible overhead.
      
      As a bonus we will get rid of a _lot_ of code by this and soft reclaim
      will not stand out like before when it wasn't integrated into the zone
      shrinking code and it reclaimed at priority 0 (the testing results show
      that some workloads suffers from such an aggressive reclaim).  The clean
      up is in a separate patch because I felt it would be easier to review that
      way.
      
      The second step is soft limit reclaim integration into targeted reclaim.
      It should be rather straight forward.  Soft limit has been used only for
      the global reclaim so far but it makes sense for any kind of pressure
      coming from up-the-hierarchy, including targeted reclaim.
      
      The third step (patches 4-8) addresses the tree walk overhead by enhancing
      memcg iterators to enable skipping whole subtrees and tracking number of
      over soft limit children at each level of the hierarchy.  This information
      is updated same way the old soft limit tree was updated (from
      memcg_check_events) so we shouldn't see an additional overhead.  In fact
      mem_cgroup_update_soft_limit is much simpler than tree manipulation done
      previously.
      
      __shrink_zone uses mem_cgroup_soft_reclaim_eligible as a predicate for
      mem_cgroup_iter so the decision whether a particular group should be
      visited is done at the iterator level which allows us to decide to skip
      the whole subtree as well (if there is no child in excess).  This reduces
      the tree walk overhead considerably.
      
      * TEST 1
      ========
      
      My primary test case was a parallel kernel build with 2 groups (make is
      running with -j8 with a distribution .config in a separate cgroup without
      any hard limit) on a 32 CPU machine booted with 1GB memory and both builds
      run taskset to Node 0 cpus.
      
      I was mostly interested in 2 setups.  Default - no soft limit set and -
      and 0 soft limit set to both groups.  The first one should tell us whether
      the rework regresses the default behavior while the second one should show
      us improvements in an extreme case where both workloads are always over
      the soft limit.
      
      /usr/bin/time -v has been used to collect the statistics and each
      configuration had 3 runs after fresh boot without any other load on the
      system.
      
      base is mmotm-2013-07-18-16-40
      rework all 8 patches applied on top of base
      
      * No-limit
      User
      no-limit/base: min: 651.92 max: 672.65 avg: 664.33 std: 8.01 runs: 6
      no-limit/rework: min: 657.34 [100.8%] max: 668.39 [99.4%] avg: 663.13 [99.8%] std: 3.61 runs: 6
      System
      no-limit/base: min: 69.33 max: 71.39 avg: 70.32 std: 0.79 runs: 6
      no-limit/rework: min: 69.12 [99.7%] max: 71.05 [99.5%] avg: 70.04 [99.6%] std: 0.59 runs: 6
      Elapsed
      no-limit/base: min: 398.27 max: 422.36 avg: 408.85 std: 7.74 runs: 6
      no-limit/rework: min: 386.36 [97.0%] max: 438.40 [103.8%] avg: 416.34 [101.8%] std: 18.85 runs: 6
      
      The results are within noise. Elapsed time has a bigger variance but the
      average looks good.
      
      * 0-limit
      User
      0-limit/base: min: 573.76 max: 605.63 avg: 585.73 std: 12.21 runs: 6
      0-limit/rework: min: 645.77 [112.6%] max: 666.25 [110.0%] avg: 656.97 [112.2%] std: 7.77 runs: 6
      System
      0-limit/base: min: 69.57 max: 71.13 avg: 70.29 std: 0.54 runs: 6
      0-limit/rework: min: 68.68 [98.7%] max: 71.40 [100.4%] avg: 69.91 [99.5%] std: 0.87 runs: 6
      Elapsed
      0-limit/base: min: 1306.14 max: 1550.17 avg: 1430.35 std: 90.86 runs: 6
      0-limit/rework: min: 404.06 [30.9%] max: 465.94 [30.1%] avg: 434.81 [30.4%] std: 22.68 runs: 6
      
      The improvement is really huge here (even bigger than with my previous
      testing and I suspect that this highly depends on the storage).  Page
      fault statistics tell us at least part of the story:
      
      Minor
      0-limit/base: min: 37180461.00 max: 37319986.00 avg: 37247470.00 std: 54772.71 runs: 6
      0-limit/rework: min: 36751685.00 [98.8%] max: 36805379.00 [98.6%] avg: 36774506.33 [98.7%] std: 17109.03 runs: 6
      Major
      0-limit/base: min: 170604.00 max: 221141.00 avg: 196081.83 std: 18217.01 runs: 6
      0-limit/rework: min: 2864.00 [1.7%] max: 10029.00 [4.5%] avg: 5627.33 [2.9%] std: 2252.71 runs: 6
      
      Same as with my previous testing Minor faults are more or less within
      noise but Major fault count is way bellow the base kernel.
      
      While this looks as a nice win it is fair to say that 0-limit
      configuration is quite artificial. So I was playing with 0-no-limit
      loads as well.
      
      * TEST 2
      ========
      
      The following results are from 2 groups configuration on a 16GB machine
      (single NUMA node).
      
      - A running stream IO (dd if=/dev/zero of=local.file bs=1024) with
        2*TotalMem with 0 soft limit.
      - B running a mem_eater which consumes TotalMem-1G without any limit. The
        mem_eater consumes the memory in 100 chunks with 1s nap after each
        mmap+poppulate so that both loads have chance to fight for the memory.
      
      The expected result is that B shouldn't be reclaimed and A shouldn't see
      a big dropdown in elapsed time.
      
      User
      base: min: 2.68 max: 2.89 avg: 2.76 std: 0.09 runs: 3
      rework: min: 3.27 [122.0%] max: 3.74 [129.4%] avg: 3.44 [124.6%] std: 0.21 runs: 3
      System
      base: min: 86.26 max: 88.29 avg: 87.28 std: 0.83 runs: 3
      rework: min: 81.05 [94.0%] max: 84.96 [96.2%] avg: 83.14 [95.3%] std: 1.61 runs: 3
      Elapsed
      base: min: 317.28 max: 332.39 avg: 325.84 std: 6.33 runs: 3
      rework: min: 281.53 [88.7%] max: 298.16 [89.7%] avg: 290.99 [89.3%] std: 6.98 runs: 3
      
      System time improved slightly as well as Elapsed. My previous testing
      has shown worse numbers but this again seem to depend on the storage
      speed.
      
      My theory is that the writeback doesn't catch up and prio-0 soft reclaim
      falls into wait on writeback page too often in the base kernel. The
      patched kernel doesn't do that because the soft reclaim is done from the
      kswapd/direct reclaim context. This can be seen on the following graph
      nicely. The A's group usage_in_bytes regurarly drops really low very often.
      
      All 3 runs
      http://labs.suse.cz/mhocko/soft_limit_rework/stream_io-vs-mem_eater/stream.png
      resp. a detail of the single run
      http://labs.suse.cz/mhocko/soft_limit_rework/stream_io-vs-mem_eater/stream-one-run.png
      
      mem_eater seems to be doing better as well. It gets to the full
      allocation size faster as can be seen on the following graph:
      http://labs.suse.cz/mhocko/soft_limit_rework/stream_io-vs-mem_eater/mem_eater-one-run.png
      
      /proc/meminfo collected during the test also shows that rework kernel
      hasn't swapped that much (well almost not at all):
      base: max: 123900 K avg: 56388.29 K
      rework: max: 300 K avg: 128.68 K
      
      kswapd and direct reclaim statistics are of no use unfortunatelly because
      soft reclaim is not accounted properly as the counters are hidden by
      global_reclaim() checks in the base kernel.
      
      * TEST 3
      ========
      
      Another test was the same configuration as TEST2 except the stream IO was
      replaced by a single kbuild (16 parallel jobs bound to Node0 cpus same as
      in TEST1) and mem_eater allocated TotalMem-200M so kbuild had only 200MB
      left.
      
      Kbuild did better with the rework kernel here as well:
      User
      base: min: 860.28 max: 872.86 avg: 868.03 std: 5.54 runs: 3
      rework: min: 880.81 [102.4%] max: 887.45 [101.7%] avg: 883.56 [101.8%] std: 2.83 runs: 3
      System
      base: min: 84.35 max: 85.06 avg: 84.79 std: 0.31 runs: 3
      rework: min: 85.62 [101.5%] max: 86.09 [101.2%] avg: 85.79 [101.2%] std: 0.21 runs: 3
      Elapsed
      base: min: 135.36 max: 243.30 avg: 182.47 std: 45.12 runs: 3
      rework: min: 110.46 [81.6%] max: 116.20 [47.8%] avg: 114.15 [62.6%] std: 2.61 runs: 3
      Minor
      base: min: 36635476.00 max: 36673365.00 avg: 36654812.00 std: 15478.03 runs: 3
      rework: min: 36639301.00 [100.0%] max: 36695541.00 [100.1%] avg: 36665511.00 [100.0%] std: 23118.23 runs: 3
      Major
      base: min: 14708.00 max: 53328.00 avg: 31379.00 std: 16202.24 runs: 3
      rework: min: 302.00 [2.1%] max: 414.00 [0.8%] avg: 366.33 [1.2%] std: 47.22 runs: 3
      
      Again we can see a significant improvement in Elapsed (it also seems to
      be more stable), there is a huge dropdown for the Major page faults and
      much more swapping:
      base: max: 583736 K avg: 112547.43 K
      rework: max: 4012 K avg: 124.36 K
      
      Graphs from all three runs show the variability of the kbuild quite
      nicely.  It even seems that it took longer after every run with the base
      kernel which would be quite surprising as the source tree for the build is
      removed and caches are dropped after each run so the build operates on a
      freshly extracted sources everytime.
      http://labs.suse.cz/mhocko/soft_limit_rework/stream_io-vs-mem_eater/kbuild-mem_eater.png
      
      My other testing shows that this is just a matter of timing and other runs
      behave differently the std for Elapsed time is similar ~50.  Example of
      other three runs:
      http://labs.suse.cz/mhocko/soft_limit_rework/stream_io-vs-mem_eater/kbuild-mem_eater2.png
      
      So to wrap this up.  The series is still doing good and improves the soft
      limit.
      
      The testing results for bunch of cgroups with both stream IO and kbuild
      loads can be found in "memcg: track children in soft limit excess to
      improve soft limit".
      
      This patch:
      
      Memcg soft reclaim has been traditionally triggered from the global
      reclaim paths before calling shrink_zone.  mem_cgroup_soft_limit_reclaim
      then picked up a group which exceeds the soft limit the most and reclaimed
      it with 0 priority to reclaim at least SWAP_CLUSTER_MAX pages.
      
      The infrastructure requires per-node-zone trees which hold over-limit
      groups and keep them up-to-date (via memcg_check_events) which is not cost
      free.  Although this overhead hasn't turned out to be a bottle neck the
      implementation is suboptimal because mem_cgroup_update_tree has no idea
      which zones consumed memory over the limit so we could easily end up
      having a group on a node-zone tree having only few pages from that
      node-zone.
      
      This patch doesn't try to fix node-zone trees management because it seems
      that integrating soft reclaim into zone shrinking sounds much easier and
      more appropriate for several reasons.  First of all 0 priority reclaim was
      a crude hack which might lead to big stalls if the group's LRUs are big
      and hard to reclaim (e.g.  a lot of dirty/writeback pages).  Soft reclaim
      should be applicable also to the targeted reclaim which is awkward right
      now without additional hacks.  Last but not least the whole infrastructure
      eats quite some code.
      
      After this patch shrink_zone is done in 2 passes.  First it tries to do
      the soft reclaim if appropriate (only for global reclaim for now to keep
      compatible with the original state) and fall back to ignoring soft limit
      if no group is eligible to soft reclaim or nothing has been scanned during
      the first pass.  Only groups which are over their soft limit or any of
      their parents up the hierarchy is over the limit are considered eligible
      during the first pass.
      
      Soft limit tree which is not necessary anymore will be removed in the
      follow up patch to make this patch smaller and easier to review.
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.cz>
      Reviewed-by: default avatarGlauber Costa <glommer@openvz.org>
      Reviewed-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Ying Han <yinghan@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Glauber Costa <glommer@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3b38722e
    • Li Zefan's avatar
      memcg: remove redundant code in mem_cgroup_force_empty_write() · c33bd835
      Li Zefan authored
      vfs guarantees the cgroup won't be destroyed, so it's redundant to get a
      css reference.
      Signed-off-by: default avatarLi Zefan <lizefan@huawei.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c33bd835
    • Linus Torvalds's avatar
      Bye, bye, WfW flag · d5d04bb4
      Linus Torvalds authored
      This reverts the Linux for Workgroups thing.  And no, before somebody
      asks, we're not doing Linux95.  Not for a few years, at least.
      
      Sure, the flag added some color to the logo, and could have remained as
      a testament to my leet gimp skills.  But no.  And I'll do this early, to
      avoid the chance of forgetting when I'm doing the actual rc1 release on
      the road.
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d5d04bb4
    • Linus Torvalds's avatar
      Merge tag 'ecryptfs-3.12-rc1-crypt-ctx' of... · 1ae276a9
      Linus Torvalds authored
      Merge tag 'ecryptfs-3.12-rc1-crypt-ctx' of git://git.kernel.org/pub/scm/linux/kernel/git/tyhicks/ecryptfs
      
      Pull eCryptfs fixes from Tyler Hicks:
       "Two small fixes to the code that initializes the per-file crypto
        contexts"
      
      * tag 'ecryptfs-3.12-rc1-crypt-ctx' of git://git.kernel.org/pub/scm/linux/kernel/git/tyhicks/ecryptfs:
        ecryptfs: avoid ctx initialization race
        ecryptfs: remove check for if an array is NULL
      1ae276a9
    • Linus Torvalds's avatar
      Merge branch 'for-v3.12-fix' of git://git.linaro.org/people/mszyprowski/linux-dma-mapping · 3b38f56c
      Linus Torvalds authored
      Pull DMA-mapping fix from Marek Szyprowski:
       "A build bugfix for the device tree support for reserved memory
        regions.  Due to superfluous include the common code failed to build
        on ARM64 and MIPS architectures.
      
        The patch that caused the build break has lived at linux-next for
        about two weeks and noone noticed the issue, what convinced me that
        everything was ok"
      
      * 'for-v3.12-fix' of git://git.linaro.org/people/mszyprowski/linux-dma-mapping:
        drivers: of: fix build break if asm/dma-contiguous.h is missing
      3b38f56c
    • Linus Torvalds's avatar
      Merge tag 'for-3.12' of git://git.linaro.org/people/sumitsemwal/linux-dma-buf · b3b75684
      Linus Torvalds authored
      Pull dma-buf updates from Sumit Semwal:
       "Yet another small one - dma-buf framework now supports size discovery
        of the buffer via llseek"
      
      * tag 'for-3.12' of git://git.linaro.org/people/sumitsemwal/linux-dma-buf:
        dma-buf: Expose buffer size to userspace (v2)
        dma-buf: Check return value of anon_inode_getfile
      b3b75684
  2. 11 Sep, 2013 25 commits