Commit 4f9b16a6 authored by Mel Gorman's avatar Mel Gorman Committed by Linus Torvalds

mm: disable zone_reclaim_mode by default

When it was introduced, zone_reclaim_mode made sense as NUMA distances
punished and workloads were generally partitioned to fit into a NUMA
node.  NUMA machines are now common but few of the workloads are
NUMA-aware and it's routine to see major performance degradation due to
zone_reclaim_mode being enabled but relatively few can identify the
problem.

Those that require zone_reclaim_mode are likely to be able to detect
when it needs to be enabled and tune appropriately so lets have a
sensible default for the bulk of users.

This patch (of 2):

zone_reclaim_mode causes processes to prefer reclaiming memory from
local node instead of spilling over to other nodes.  This made sense
initially when NUMA machines were almost exclusively HPC and the
workload was partitioned into nodes.  The NUMA penalties were
sufficiently high to justify reclaiming the memory.  On current machines
and workloads it is often the case that zone_reclaim_mode destroys
performance but not all users know how to detect this.  Favour the
common case and disable it by default.  Users that are sophisticated
enough to know they need zone_reclaim_mode will detect it.
Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
Reviewed-by: default avatarZhang Yanfei <zhangyanfei@cn.fujitsu.com>
Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
Reviewed-by: default avatarChristoph Lameter <cl@linux.com>
Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
parent 944d9fec
...@@ -772,16 +772,17 @@ This is value ORed together of ...@@ -772,16 +772,17 @@ This is value ORed together of
2 = Zone reclaim writes dirty pages out 2 = Zone reclaim writes dirty pages out
4 = Zone reclaim swaps pages 4 = Zone reclaim swaps pages
zone_reclaim_mode is set during bootup to 1 if it is determined that pages zone_reclaim_mode is disabled by default. For file servers or workloads
from remote zones will cause a measurable performance reduction. The that benefit from having their data cached, zone_reclaim_mode should be
page allocator will then reclaim easily reusable pages (those page left disabled as the caching effect is likely to be more important than
cache pages that are currently not used) before allocating off node pages.
It may be beneficial to switch off zone reclaim if the system is
used for a file server and all of memory should be used for caching files
from disk. In that case the caching effect is more important than
data locality. data locality.
zone_reclaim may be enabled if it's known that the workload is partitioned
such that each partition fits within a NUMA node and that accessing remote
memory would cause a measurable performance reduction. The page allocator
will then reclaim easily reusable pages (those page cache pages that are
currently not used) before allocating off node pages.
Allowing zone reclaim to write out pages stops processes that are Allowing zone reclaim to write out pages stops processes that are
writing large amounts of data from dirtying pages on other nodes. Zone writing large amounts of data from dirtying pages on other nodes. Zone
reclaim will write out dirty pages if a zone fills up and so effectively reclaim will write out dirty pages if a zone fills up and so effectively
......
...@@ -21,7 +21,8 @@ ...@@ -21,7 +21,8 @@
#define PENALTY_FOR_NODE_WITH_CPUS 255 #define PENALTY_FOR_NODE_WITH_CPUS 255
/* /*
* Distance above which we begin to use zone reclaim * Nodes within this distance are eligible for reclaim by zone_reclaim() when
* zone_reclaim_mode is enabled.
*/ */
#define RECLAIM_DISTANCE 15 #define RECLAIM_DISTANCE 15
......
...@@ -9,12 +9,8 @@ struct device_node; ...@@ -9,12 +9,8 @@ struct device_node;
#ifdef CONFIG_NUMA #ifdef CONFIG_NUMA
/* /*
* Before going off node we want the VM to try and reclaim from the local * If zone_reclaim_mode is enabled, a RECLAIM_DISTANCE of 10 will mean that
* node. It does this if the remote distance is larger than RECLAIM_DISTANCE. * all zones on all nodes will be eligible for zone_reclaim().
* With the default REMOTE_DISTANCE of 20 and the default RECLAIM_DISTANCE of
* 20, we never reclaim and go off node straight away.
*
* To fix this we choose a smaller value of RECLAIM_DISTANCE.
*/ */
#define RECLAIM_DISTANCE 10 #define RECLAIM_DISTANCE 10
......
...@@ -58,7 +58,8 @@ int arch_update_cpu_topology(void); ...@@ -58,7 +58,8 @@ int arch_update_cpu_topology(void);
/* /*
* If the distance between nodes in a system is larger than RECLAIM_DISTANCE * If the distance between nodes in a system is larger than RECLAIM_DISTANCE
* (in whatever arch specific measurement units returned by node_distance()) * (in whatever arch specific measurement units returned by node_distance())
* then switch on zone reclaim on boot. * and zone_reclaim_mode is enabled then the VM will only call zone_reclaim()
* on nodes within this distance.
*/ */
#define RECLAIM_DISTANCE 30 #define RECLAIM_DISTANCE 30
#endif #endif
......
...@@ -1860,8 +1860,6 @@ static void __paginginit init_zone_allows_reclaim(int nid) ...@@ -1860,8 +1860,6 @@ static void __paginginit init_zone_allows_reclaim(int nid)
for_each_node_state(i, N_MEMORY) for_each_node_state(i, N_MEMORY)
if (node_distance(nid, i) <= RECLAIM_DISTANCE) if (node_distance(nid, i) <= RECLAIM_DISTANCE)
node_set(i, NODE_DATA(nid)->reclaim_nodes); node_set(i, NODE_DATA(nid)->reclaim_nodes);
else
zone_reclaim_mode = 1;
} }
#else /* CONFIG_NUMA */ #else /* CONFIG_NUMA */
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment