Commit c1859213 authored by Andrew Morton's avatar Andrew Morton Committed by Jaroslav Kysela

[PATCH] Add /proc/sys/vm/lower_zone_protection

This allows us to control the aggressiveness of the lower-zone defense
algorithm.  The `incremental min'.  For workloads which are using a
serious amount of mlocked memory, a few megabytes is not enough.

So the `lower_zone_protection' tunable allows the administrator to
increase the amount of protection which lower zones receive against
allocations which _could_ use higher zones.

The default value of lower_zone_protection is zero, giving unchanged
behaviour.  We should not normally make large amounts of memory
unavailable for pagecache just in case someone mlocks many hundreds of
megabytes.
parent 20b96b52
...@@ -989,42 +989,58 @@ for writeout by the pdflush daemons. It is expressed in 100'ths of a second. ...@@ -989,42 +989,58 @@ for writeout by the pdflush daemons. It is expressed in 100'ths of a second.
Data which has been dirty in-memory for longer than this interval will be Data which has been dirty in-memory for longer than this interval will be
written out next time a pdflush daemon wakes up. written out next time a pdflush daemon wakes up.
lower_zone_protection
---------------------
kswapd For some specialised workloads on highmem machines it is dangerous for
------ the kernel to allow process memory to be allocated from the "lowmem"
zone. This is because that memory could then be pinned via the mlock()
system call, or by unavailability of swapspace.
Kswapd is the kernel swap out daemon. That is, kswapd is that piece of the And on large highmem machines this lack of reclaimable lowmem memory
kernel that frees memory when it gets fragmented or full. Since every system can be fatal.
is different, you'll probably want some control over this piece of the system.
The file contains three numbers: So the Linux page allocator has a mechanism which prevents allocations
which _could_ use highmem from using too much lowmem. This means that
a certain amount of lowmem is defended from the possibility of being
captured into pinned user memory.
tries_base (The same argument applies to the old 16 megabyte ISA DMA region. This
---------- mechanism will also defend that region from allocations which could use
highmem or lowmem).
The maximum number of pages kswapd tries to free in one round is calculated The `lower_zone_protection' tunable determines how aggressive the kernel is
from this number. Usually this number will be divided by 4 or 8 (see in defending these lower zones. The default value is zero - no
mm/vmscan.c), so it isn't as big as it looks. protection at all.
When you need to increase the bandwidth to/from swap, you'll want to increase If you have a machine which uses highmem or ISA DMA and your
this number. applications are using mlock(), or if you are running with no swap then
you probably should increase the lower_zone_protection setting.
tries_min The units of this tunable are fairly vague. It is approximately equal
--------- to "megabytes". So setting lower_zone_protection=100 will protect around 100
megabytes of the lowmem zone from user allocations. It will also make
those 100 megabytes unavaliable for use by applications and by
pagecache, so there is a cost.
The effects of this tunable may be observed by monitoring
/proc/meminfo:LowFree. Write a single huge file and observe the point
at which LowFree ceases to fall.
This is the minimum number of times kswapd tries to free a page each time it A reasonable value for lower_zone_protection is 100.
is called. Basically it's just there to make sure that kswapd frees some pages
even when it's being called with minimum priority.
swap_cluster page-cluster
------------ ------------
This is probably the greatest influence on system performance. page-cluster controls the number of pages which are written to swap in
a single attempt. The swap I/O size.
It is a logarithmic value - setting it to zero means "1 page", setting
it to 1 means "2 pages", setting it to 2 means "4 pages", etc.
swap_cluster is the number of pages kswapd writes in one turn. You'll want The default value is three (eight pages at a time). There may be some
this value to be large so that kswapd does its I/O in large chunks and the small benefits in tuning this to a different value if your workload is
disk doesn't have to seek as often, but you don't want it to be too large swap-intensive.
since that would flood the request queue.
overcommit_memory overcommit_memory
----------------- -----------------
......
...@@ -154,6 +154,7 @@ enum ...@@ -154,6 +154,7 @@ enum
VM_PAGEBUF=17, /* struct: Control pagebuf parameters */ VM_PAGEBUF=17, /* struct: Control pagebuf parameters */
VM_HUGETLB_PAGES=18, /* int: Number of available Huge Pages */ VM_HUGETLB_PAGES=18, /* int: Number of available Huge Pages */
VM_SWAPPINESS=19, /* Tendency to steal mapped memory */ VM_SWAPPINESS=19, /* Tendency to steal mapped memory */
VM_LOWER_ZONE_PROTECTION=20,/* Amount of protection of lower zones */
}; };
......
...@@ -53,6 +53,7 @@ extern int core_uses_pid; ...@@ -53,6 +53,7 @@ extern int core_uses_pid;
extern char core_pattern[]; extern char core_pattern[];
extern int cad_pid; extern int cad_pid;
extern int pid_max; extern int pid_max;
extern int sysctl_lower_zone_protection;
/* this is needed for the proc_dointvec_minmax for [fs_]overflow UID and GID */ /* this is needed for the proc_dointvec_minmax for [fs_]overflow UID and GID */
static int maxolduid = 65535; static int maxolduid = 65535;
...@@ -310,8 +311,13 @@ static ctl_table vm_table[] = { ...@@ -310,8 +311,13 @@ static ctl_table vm_table[] = {
0644, NULL, &proc_dointvec_minmax, &sysctl_intvec, NULL, &zero, 0644, NULL, &proc_dointvec_minmax, &sysctl_intvec, NULL, &zero,
&one_hundred }, &one_hundred },
#ifdef CONFIG_HUGETLB_PAGE #ifdef CONFIG_HUGETLB_PAGE
{VM_HUGETLB_PAGES, "nr_hugepages", &htlbpage_max, sizeof(int), 0644, NULL, &hugetlb_sysctl_handler}, {VM_HUGETLB_PAGES, "nr_hugepages", &htlbpage_max, sizeof(int), 0644,
NULL, &hugetlb_sysctl_handler},
#endif #endif
{VM_LOWER_ZONE_PROTECTION, "lower_zone_protection",
&sysctl_lower_zone_protection, sizeof(sysctl_lower_zone_protection),
0644, NULL, &proc_dointvec_minmax, &sysctl_intvec, NULL, &zero,
NULL, },
{0} {0}
}; };
......
...@@ -38,7 +38,7 @@ unsigned long totalram_pages; ...@@ -38,7 +38,7 @@ unsigned long totalram_pages;
unsigned long totalhigh_pages; unsigned long totalhigh_pages;
int nr_swap_pages; int nr_swap_pages;
int numnodes = 1; int numnodes = 1;
int sysctl_lower_zone_protection = 0;
/* /*
* Used by page_zone() to look up the address of the struct zone whose * Used by page_zone() to look up the address of the struct zone whose
...@@ -470,6 +470,7 @@ __alloc_pages(unsigned int gfp_mask, unsigned int order, ...@@ -470,6 +470,7 @@ __alloc_pages(unsigned int gfp_mask, unsigned int order,
if (page) if (page)
return page; return page;
} }
min += z->pages_low * sysctl_lower_zone_protection;
} }
/* we're somewhat low on memory, failed to find what we needed */ /* we're somewhat low on memory, failed to find what we needed */
...@@ -492,6 +493,7 @@ __alloc_pages(unsigned int gfp_mask, unsigned int order, ...@@ -492,6 +493,7 @@ __alloc_pages(unsigned int gfp_mask, unsigned int order,
if (page) if (page)
return page; return page;
} }
min += local_min * sysctl_lower_zone_protection;
} }
/* here we're in the low on memory slow path */ /* here we're in the low on memory slow path */
...@@ -529,6 +531,7 @@ __alloc_pages(unsigned int gfp_mask, unsigned int order, ...@@ -529,6 +531,7 @@ __alloc_pages(unsigned int gfp_mask, unsigned int order,
if (page) if (page)
return page; return page;
} }
min += z->pages_low * sysctl_lower_zone_protection;
} }
/* /*
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment