[PATCH] cpusets - big numa cpu and memory placement

This my cpuset patch, with the following changes in the last two weeks: 1) Updated to 2.6.8.1-mm1 2) [Simon Derr <Simon.Derr@bull.net>] Fix new cpuset to begin empty, not copied from parent. Needed to avoid breaking exclusive property. 3) [Dinakar Guniguntala <dino@in.ibm.com>] Finish initializing top cpuset from cpu_possible_map after smp_init() called. 4) [Paul Jackson <pj@sgi.com>] Check on each call to __alloc_pages() if the current tasks cpuset mems_allowed has changed. Use a cpuset generation number, bumped on any cpuset memory placement change, to make this check efficient. Update the tasks mems_allowed from its cpuset, if the cpuset has changed. 5) [Paul Jackson <pj@sgi.com>] If a task is moved to another cpuset, then update its cpus_allowed, using set_cpus_allowed(). 6) [Paul Jackson <pj@sgi.com>] Update Documentation/cpusets.txt to reflect above changes (4) and (5). I continue to recommend the following patch for inclusion in your 2.6.9-*mm series, when that opens. It provides an important facility for high performance computing on large systems. Simon Derr of Bull (France) and myself are the primary authors. Erich Focht has indicated that NEC is also a potential user of this patch on the TX-7 NUMA machines, and that he "would very much welcome the inclusion of cpusets." I offer this update to lkml, in order to invite continued feedback. The one prerequiste patch for this cpuset patch was just posted before this one. That was a patch to provide a new bitmap list format, of which cpusets is the first user. This patch has been built on top of 2.6.8.1-mm1, for the arch's: i386 x86_64 sparc ia64 powerpc-405 powerpc-750 sparc64 with and without CONFIG_CPUSET. It has been booted and tested on ia64 (sn2_defconfig, SN2 hardware). The 'alpha' arch also built, except for what seems to be an unrelated toolchain problem (crosstool ld sigsegv) in the final link step. === Cpusets provide a mechanism for assigning a set of CPUs and Memory Nodes to a set of tasks. Cpusets constrain the CPU and Memory placement of tasks to only the processor and memory resources within a tasks current cpuset. They form a nested hierarchy visible in a virtual file system. These are the essential hooks, beyond what is already present, required to manage dynamic job placement on large systems. Cpusets require small kernel hooks in init, exit, fork, mempolicy, sched_setaffinity, page_alloc and vmscan. And they require a "struct cpuset" pointer, a cpuset_mems_generation, and a "mems_allowed" nodemask_t (to go along with the "cpus_allowed" cpumask_t that's already there) in each task struct. These hooks: 1) establish and propagate cpusets, 2) enforce CPU placement in sched_setaffinity, 3) enforce Memory placement in mbind and sys_set_mempolicy, 4) restrict page allocation and scanning to mems_allowed, and 5) restrict migration and set_cpus_allowed to cpus_allowed. The other required hook, restricting task scheduling to CPUs in a tasks cpus_allowed mask, is already present. Cpusets extend the usefulness of, the existing placement support that was added to Linux 2.6 kernels: sched_setaffinity() for CPU placement, and mbind() and set_mempolicy() for memory placement. On smaller or dedicated use systems, the existing calls are often sufficient. On larger NUMA systems, running more than one, performance critical, job, it is necessary to be able to manage jobs in their entirety. This includes providing a job with exclusive CPU and memory that no other job can use, and being able to list all tasks currently in a cpuset. A given job running within a cpuset, would likely use the existing placement calls to manage its CPU and memory placement in more detail. Cpusets are named, nested sets of CPUs and Memory Nodes. Each cpuset is represented by a directory in the cpuset virtual file system, normally mounted at /dev/cpuset. Each cpuset directory provides the following files, which can be read and written: cpus: List of CPUs allowed to tasks in that cpuset. mems: List of Memory Nodes allowed to tasks in that cpuset. tasks: List of pid's of tasks in that cpuset. cpu_exclusive: Flag (0 or 1) - if set, cpuset has exclusive use of its CPUs (no sibling or cousin cpuset may overlap CPUs). mem_exclusive: Flag (0 or 1) - if set, cpuset has exclusive use of its Memory Nodes (no sibling or cousin may overlap). notify_on_release: Flag (0 or 1) - if set, then /sbin/cpuset_release_agent will be invoked, with the name (/dev/cpuset relative path) of that cpuset in argv[1], when the last user of it (task or child cpuset) goes away. This supports automatic cleanup of abandoned cpusets. In addition one new filetype is added to the /proc file system: /proc/<pid>/cpuset: For each task (pid), list its cpuset path, relative to the root of the cpuset file system. This file is read-only. New cpusets are created using 'mkdir' (at the shell or in C). Old ones are removed using 'rmdir'. The above files are accessed using read(2) and write(2) system calls, or shell commands such as 'cat' and 'echo'. The CPUs and Memory Nodes in a given cpuset are always a subset of its parent. The root cpuset has all possible CPUs and Memory Nodes in the system. A cpuset may be exclusive (cpu or memory) only if its parent is similarly exclusive. See further Documentation/cpusets.txt, at the top of the following patch. /proc interface: It is useful, when learning and making new uses of cpusets and placement to be able to see what are the current value of a tasks cpus_allowed and mems_allowed, which are the actual placement used by the kernel scheduler and memory allocator. The cpus_allowed and mems_allowed values are needed by user space apps that are micromanaging placement, such as when moving an app to a obtained by that app within its cpuset using sched_setaffinity, mbind and set_mempolicy. The cpus_allowed value is also available via the sched_getaffinity system call. But since the entire rest of the cpuset API, including the display of mems_allowed added here, is via an ascii style presentation in /proc and /dev/cpuset, it is worth the extra couple lines of code to display cpus_allowed in the same way. This patch adds the display of these two fields to the 'status' file in the /proc/<pid> directory of each task. The fields are only added if CONFIG_CPUSETS is enabled (which is also needed to define the mems_allowed field of each task). The new output lines look like: $ tail -2 /proc/1/status Cpus_allowed: ffffffff,ffffffff,ffffffff,ffffffff Mems_allowed: ffffffff,ffffffff Signed-off-by: Dinakar Guniguntala <dino@in.ibm.com> Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Simon Derr <simon.derr@bull.net> Signed-off-by: Matt Mackall <mpm@selenic.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>

[PATCH] cpusets - big numa cpu and memory placement
This my cpuset patch, with the following changes in the last two weeks: 1) Updated to 2.6.8.1-mm1 2) [Simon Derr <Simon.Derr@bull.net>] Fix new cpuset to begin empty, not copied from parent. Needed to avoid breaking exclusive property. 3) [Dinakar Guniguntala <dino@in.ibm.com>] Finish initializing top cpuset from cpu_possible_map after smp_init() called. 4) [Paul Jackson <pj@sgi.com>] Check on each call to __alloc_pages() if the current tasks cpuset mems_allowed has changed. Use a cpuset generation number, bumped on any cpuset memory placement change, to make this check efficient. Update the tasks mems_allowed from its cpuset, if the cpuset has changed. 5) [Paul Jackson <pj@sgi.com>] If a task is moved to another cpuset, then update its cpus_allowed, using set_cpus_allowed(). 6) [Paul Jackson <pj@sgi.com>] Update Documentation/cpusets.txt to reflect above changes (4) and (5). I continue to recommend the following patch for inclusion in your 2.6.9-*mm series, when that opens. It provides an important facility for high performance computing on large systems. Simon Derr of Bull (France) and myself are the primary authors. Erich Focht has indicated that NEC is also a potential user of this patch on the TX-7 NUMA machines, and that he "would very much welcome the inclusion of cpusets." I offer this update to lkml, in order to invite continued feedback. The one prerequiste patch for this cpuset patch was just posted before this one. That was a patch to provide a new bitmap list format, of which cpusets is the first user. This patch has been built on top of 2.6.8.1-mm1, for the arch's: i386 x86_64 sparc ia64 powerpc-405 powerpc-750 sparc64 with and without CONFIG_CPUSET. It has been booted and tested on ia64 (sn2_defconfig, SN2 hardware). The 'alpha' arch also built, except for what seems to be an unrelated toolchain problem (crosstool ld sigsegv) in the final link step. === Cpusets provide a mechanism for assigning a set of CPUs and Memory Nodes to a set of tasks. Cpusets constrain the CPU and Memory placement of tasks to only the processor and memory resources within a tasks current cpuset. They form a nested hierarchy visible in a virtual file system. These are the essential hooks, beyond what is already present, required to manage dynamic job placement on large systems. Cpusets require small kernel hooks in init, exit, fork, mempolicy, sched_setaffinity, page_alloc and vmscan. And they require a "struct cpuset" pointer, a cpuset_mems_generation, and a "mems_allowed" nodemask_t (to go along with the "cpus_allowed" cpumask_t that's already there) in each task struct. These hooks: 1) establish and propagate cpusets, 2) enforce CPU placement in sched_setaffinity, 3) enforce Memory placement in mbind and sys_set_mempolicy, 4) restrict page allocation and scanning to mems_allowed, and 5) restrict migration and set_cpus_allowed to cpus_allowed. The other required hook, restricting task scheduling to CPUs in a tasks cpus_allowed mask, is already present. Cpusets extend the usefulness of, the existing placement support that was added to Linux 2.6 kernels: sched_setaffinity() for CPU placement, and mbind() and set_mempolicy() for memory placement. On smaller or dedicated use systems, the existing calls are often sufficient. On larger NUMA systems, running more than one, performance critical, job, it is necessary to be able to manage jobs in their entirety. This includes providing a job with exclusive CPU and memory that no other job can use, and being able to list all tasks currently in a cpuset. A given job running within a cpuset, would likely use the existing placement calls to manage its CPU and memory placement in more detail. Cpusets are named, nested sets of CPUs and Memory Nodes. Each cpuset is represented by a directory in the cpuset virtual file system, normally mounted at /dev/cpuset. Each cpuset directory provides the following files, which can be read and written: cpus: List of CPUs allowed to tasks in that cpuset. mems: List of Memory Nodes allowed to tasks in that cpuset. tasks: List of pid's of tasks in that cpuset. cpu_exclusive: Flag (0 or 1) - if set, cpuset has exclusive use of its CPUs (no sibling or cousin cpuset may overlap CPUs). mem_exclusive: Flag (0 or 1) - if set, cpuset has exclusive use of its Memory Nodes (no sibling or cousin may overlap). notify_on_release: Flag (0 or 1) - if set, then /sbin/cpuset_release_agent will be invoked, with the name (/dev/cpuset relative path) of that cpuset in argv[1], when the last user of it (task or child cpuset) goes away. This supports automatic cleanup of abandoned cpusets. In addition one new filetype is added to the /proc file system: /proc/<pid>/cpuset: For each task (pid), list its cpuset path, relative to the root of the cpuset file system. This file is read-only. New cpusets are created using 'mkdir' (at the shell or in C). Old ones are removed using 'rmdir'. The above files are accessed using read(2) and write(2) system calls, or shell commands such as 'cat' and 'echo'. The CPUs and Memory Nodes in a given cpuset are always a subset of its parent. The root cpuset has all possible CPUs and Memory Nodes in the system. A cpuset may be exclusive (cpu or memory) only if its parent is similarly exclusive. See further Documentation/cpusets.txt, at the top of the following patch. /proc interface: It is useful, when learning and making new uses of cpusets and placement to be able to see what are the current value of a tasks cpus_allowed and mems_allowed, which are the actual placement used by the kernel scheduler and memory allocator. The cpus_allowed and mems_allowed values are needed by user space apps that are micromanaging placement, such as when moving an app to a obtained by that app within its cpuset using sched_setaffinity, mbind and set_mempolicy. The cpus_allowed value is also available via the sched_getaffinity system call. But since the entire rest of the cpuset API, including the display of mems_allowed added here, is via an ascii style presentation in /proc and /dev/cpuset, it is worth the extra couple lines of code to display cpus_allowed in the same way. This patch adds the display of these two fields to the 'status' file in the /proc/<pid> directory of each task. The fields are only added if CONFIG_CPUSETS is enabled (which is also needed to define the mems_allowed field of each task). The new output lines look like: $ tail -2 /proc/1/status Cpus_allowed: ffffffff,ffffffff,ffffffff,ffffffff Mems_allowed: ffffffff,ffffffff Signed-off-by: Dinakar Guniguntala <dino@in.ibm.com> Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Simon Derr <simon.derr@bull.net> Signed-off-by: Matt Mackall <mpm@selenic.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
3a978e55 · Paul Jackson · Linus Torvalds · 794c8de9 · 3a978e55 · 3a978e55
Commit 3a978e55 authored Mar 09, 2005 by Paul Jackson Committed by Linus Torvalds Mar 09, 2005
15 changed files
--- a/Documentation/cpusets.txt
+++ b/Documentation/cpusets.txt
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -73,6 +73,7 @@
 #include <linux/highmem.h>
 #include <linux/file.h>
 #include <linux/times.h>
+#include <linux/cpuset.h>

 #include <asm/uaccess.h>
 #include <asm/pgtable.h>
@@ -300,6 +301,7 @@ int proc_pid_status(struct task_struct *task, char * buffer)
 	}
 	buffer = task_sig(task, buffer);
 	buffer = task_cap(task, buffer);
+	buffer = cpuset_task_status_allowed(task, buffer);
 #if defined(CONFIG_ARCH_S390)
 	buffer = task_show_regs(task, buffer);
 #endif

--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -33,6 +33,7 @@
 #include <linux/security.h>
 #include <linux/ptrace.h>
 #include <linux/seccomp.h>
+#include <linux/cpuset.h>
 #include "internal.h"

 /*
@@ -68,6 +69,9 @@ enum pid_directory_inos {
 #ifdef CONFIG_SCHEDSTATS
 	PROC_TGID_SCHEDSTAT,
 #endif
+#ifdef CONFIG_CPUSETS
+	PROC_TGID_CPUSET,
+#endif
 #ifdef CONFIG_SECURITY
 	PROC_TGID_ATTR,
 	PROC_TGID_ATTR_CURRENT,
@@ -102,6 +106,9 @@ enum pid_directory_inos {
 #ifdef CONFIG_SCHEDSTATS
 	PROC_TID_SCHEDSTAT,
 #endif
+#ifdef CONFIG_CPUSETS
+	PROC_TID_CPUSET,
+#endif
 #ifdef CONFIG_SECURITY
 	PROC_TID_ATTR,
 	PROC_TID_ATTR_CURRENT,
@@ -152,6 +159,9 @@ static struct pid_entry tgid_base_stuff[] = {
 #endif
 #ifdef CONFIG_SCHEDSTATS
 	E(PROC_TGID_SCHEDSTAT, "schedstat", S_IFREG|S_IRUGO),
+#endif
+#ifdef CONFIG_CPUSETS
+	E(PROC_TGID_CPUSET,    "cpuset",  S_IFREG|S_IRUGO),
 #endif
 	E(PROC_TGID_OOM_SCORE, "oom_score",S_IFREG|S_IRUGO),
 	E(PROC_TGID_OOM_ADJUST,"oom_adj", S_IFREG|S_IRUGO|S_IWUSR),
@@ -185,6 +195,9 @@ static struct pid_entry tid_base_stuff[] = {
 #endif
 #ifdef CONFIG_SCHEDSTATS
 	E(PROC_TID_SCHEDSTAT, "schedstat",S_IFREG|S_IRUGO),
+#endif
+#ifdef CONFIG_CPUSETS
+	E(PROC_TID_CPUSET,     "cpuset",  S_IFREG|S_IRUGO),
 #endif
 	E(PROC_TID_OOM_SCORE,  "oom_score",S_IFREG|S_IRUGO),
 	E(PROC_TID_OOM_ADJUST, "oom_adj", S_IFREG|S_IRUGO|S_IWUSR),
@@ -1556,6 +1569,12 @@ static struct dentry *proc_pident_lookup(struct inode *dir,
 			inode->i_fop = &proc_info_file_operations;
 			ei->op.proc_read = proc_pid_schedstat;
 			break;
+#endif
+#ifdef CONFIG_CPUSETS
+		case PROC_TID_CPUSET:
+		case PROC_TGID_CPUSET:
+			inode->i_fop = &proc_cpuset_operations;
+			break;
 #endif
 		case PROC_TID_OOM_SCORE:
 		case PROC_TGID_OOM_SCORE:

--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
+#ifndef _LINUX_CPUSET_H
+#define _LINUX_CPUSET_H
+/*
+ *  cpuset interface
+ *
+ *  Copyright (C) 2003 BULL SA
+ *  Copyright (C) 2004 Silicon Graphics, Inc.
+ *
+ */
+
+#include <linux/sched.h>
+#include <linux/cpumask.h>
+#include <linux/nodemask.h>
+
+#ifdef CONFIG_CPUSETS
+
+extern int cpuset_init(void);
+extern void cpuset_init_smp(void);
+extern void cpuset_fork(struct task_struct *p);
+extern void cpuset_exit(struct task_struct *p);
+extern const cpumask_t cpuset_cpus_allowed(const struct task_struct *p);
+void cpuset_init_current_mems_allowed(void);
+void cpuset_update_current_mems_allowed(void);
+void cpuset_restrict_to_mems_allowed(unsigned long *nodes);
+int cpuset_zonelist_valid_mems_allowed(struct zonelist *zl);
+int cpuset_zone_allowed(struct zone *z);
+extern struct file_operations proc_cpuset_operations;
+extern char *cpuset_task_status_allowed(struct task_struct *task, char *buffer);
+
+#else /* !CONFIG_CPUSETS */
+
+static inline int cpuset_init(void) { return 0; }
+static inline void cpuset_init_smp(void) {}
+static inline void cpuset_fork(struct task_struct *p) {}
+static inline void cpuset_exit(struct task_struct *p) {}
+
+static inline const cpumask_t cpuset_cpus_allowed(struct task_struct *p)
+{
+	return cpu_possible_map;
+}
+
+static inline void cpuset_init_current_mems_allowed(void) {}
+static inline void cpuset_update_current_mems_allowed(void) {}
+static inline void cpuset_restrict_to_mems_allowed(unsigned long *nodes) {}
+
+static inline int cpuset_zonelist_valid_mems_allowed(struct zonelist *zl)
+{
+	return 1;
+}
+
+static inline int cpuset_zone_allowed(struct zone *z)
+{
+	return 1;
+}
+
+static inline char *cpuset_task_status_allowed(struct task_struct *task,
+							char *buffer)
+{
+	return buffer;
+}
+
+#endif /* !CONFIG_CPUSETS */
+
+#endif /* _LINUX_CPUSET_H */
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -14,6 +14,7 @@
 #include <linux/thread_info.h>
 #include <linux/cpumask.h>
 #include <linux/errno.h>
+#include <linux/nodemask.h>

 #include <asm/system.h>
 #include <asm/semaphore.h>
@@ -514,6 +515,7 @@ extern void cpu_attach_domain(struct sched_domain *sd, int cpu);

 struct io_context;			/* See blkdev.h */
 void exit_io_context(void);
+struct cpuset;

 #define NGROUPS_SMALL		32
 #define NGROUPS_PER_BLOCK	((int)(PAGE_SIZE / sizeof(gid_t)))
@@ -712,6 +714,11 @@ struct task_struct {
  	struct mempolicy *mempolicy;
 	short il_next;
 #endif
+#ifdef CONFIG_CPUSETS
+	struct cpuset *cpuset;
+	nodemask_t mems_allowed;
+	int cpuset_mems_generation;
+#endif
 };

 static inline pid_t process_group(struct task_struct *tsk)

--- a/init/Kconfig
+++ b/init/Kconfig
@@ -237,6 +237,16 @@ config IKCONFIG_PROC
 	  This option enables access to the kernel configuration file
 	  through /proc/config.gz.

+config CPUSETS
+	bool "Cpuset support"
+	depends on SMP
+	help
+	  This options will let you create and manage CPUSET's which
+	  allow dynamically partitioning a system into sets of CPUs and
+	  Memory Nodes and assigning tasks to run only within those sets.
+	  This is primarily useful on large SMP or NUMA systems.
+
+	  Say N if unsure.

 menuconfig EMBEDDED
 	bool "Configure standard kernel features (for small systems)"

--- a/init/main.c
+++ b/init/main.c
@@ -41,6 +41,7 @@
 #include <linux/kallsyms.h>
 #include <linux/writeback.h>
 #include <linux/cpu.h>
+#include <linux/cpuset.h>
 #include <linux/efi.h>
 #include <linux/unistd.h>
 #include <linux/rmap.h>
@@ -515,6 +516,8 @@ asmlinkage void __init start_kernel(void)
 #ifdef CONFIG_PROC_FS
 	proc_root_init();
 #endif
+	cpuset_init();
+
 	check_bugs();

 	acpi_early_init(); /* before LAPIC and SMP init */
@@ -656,6 +659,8 @@ static int init(void * unused)
 	smp_init();
 	sched_init_smp();

+	cpuset_init_smp();
+
 	/*
 	 * Do this before initcalls, because some drivers want to access
 	 * firmware files.

--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -18,6 +18,7 @@ obj-$(CONFIG_KALLSYMS) += kallsyms.o
 obj-$(CONFIG_PM) += power/
 obj-$(CONFIG_BSD_PROCESS_ACCT) += acct.o
 obj-$(CONFIG_COMPAT) += compat.o
+obj-$(CONFIG_CPUSETS) += cpuset.o
 obj-$(CONFIG_IKCONFIG) += configs.o
 obj-$(CONFIG_IKCONFIG_PROC) += configs.o
 obj-$(CONFIG_STOP_MACHINE) += stop_machine.o

--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -25,6 +25,7 @@
 #include <linux/mount.h>
 #include <linux/proc_fs.h>
 #include <linux/mempolicy.h>
+#include <linux/cpuset.h>
 #include <linux/syscalls.h>

 #include <asm/uaccess.h>
@@ -820,6 +821,7 @@ fastcall NORET_TYPE void do_exit(long code)
 	__exit_fs(tsk);
 	exit_namespace(tsk);
 	exit_thread();
+	cpuset_exit(tsk);
 	exit_keys(tsk);

 	if (group_dead && tsk->signal->leader)

--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -29,6 +29,7 @@
 #include <linux/mman.h>
 #include <linux/fs.h>
 #include <linux/cpu.h>
+#include <linux/cpuset.h>
 #include <linux/security.h>
 #include <linux/swap.h>
 #include <linux/syscalls.h>
@@ -1069,6 +1070,8 @@ static task_t *copy_process(unsigned long clone_flags,
 	if (unlikely(p->ptrace & PT_PTRACED))
 		__ptrace_link(p, current->parent);

+	cpuset_fork(p);
+
 	attach_pid(p, PIDTYPE_PID, p->pid);
 	attach_pid(p, PIDTYPE_TGID, p->tgid);
 	if (thread_group_leader(p)) {

--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -40,6 +40,7 @@
 #include <linux/timer.h>
 #include <linux/rcupdate.h>
 #include <linux/cpu.h>
+#include <linux/cpuset.h>
 #include <linux/percpu.h>
 #include <linux/kthread.h>
 #include <linux/seq_file.h>
@@ -3539,6 +3540,7 @@ long sched_setaffinity(pid_t pid, cpumask_t new_mask)
 {
 	task_t *p;
 	int retval;
+	cpumask_t cpus_allowed;

 	lock_cpu_hotplug();
 	read_lock(&tasklist_lock);
@@ -3563,6 +3565,8 @@ long sched_setaffinity(pid_t pid, cpumask_t new_mask)
 			!capable(CAP_SYS_NICE))
 		goto out_unlock;

+	cpus_allowed = cpuset_cpus_allowed(p);
+	cpus_and(new_mask, new_mask, cpus_allowed);
 	retval = set_cpus_allowed(p, new_mask);

 out_unlock:
@@ -4240,7 +4244,7 @@ static void move_task_off_dead_cpu(int dead_cpu, struct task_struct *tsk)

 	/* No more Mr. Nice Guy. */
 	if (dest_cpu == NR_CPUS) {
-		cpus_setall(tsk->cpus_allowed);
+		tsk->cpus_allowed = cpuset_cpus_allowed(tsk);
 		dest_cpu = any_online_cpu(tsk->cpus_allowed);

 		/*

--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -67,6 +67,7 @@
 #include <linux/sched.h>
 #include <linux/mm.h>
 #include <linux/nodemask.h>
+#include <linux/cpuset.h>
 #include <linux/gfp.h>
 #include <linux/slab.h>
 #include <linux/string.h>
@@ -167,6 +168,10 @@ static int get_nodes(unsigned long *nodes, unsigned long __user *nmask,
 	if (copy_from_user(nodes, nmask, nlongs*sizeof(unsigned long)))
 		return -EFAULT;
 	nodes[nlongs-1] &= endmask;
+	/* Update current mems_allowed */
+	cpuset_update_current_mems_allowed();
+	/* Ignore nodes not set in current->mems_allowed */
+	cpuset_restrict_to_mems_allowed(nodes);
 	return mpol_check_policy(mode, nodes);
 }

@@ -655,8 +660,10 @@ static struct zonelist *zonelist_policy(unsigned gfp, struct mempolicy *policy)
 		break;
 	case MPOL_BIND:
 		/* Lower zones don't get a policy applied */
+		/* Careful: current->mems_allowed might have moved */
 		if (gfp >= policy_zone)
-			return policy->v.zonelist;
+			if (cpuset_zonelist_valid_mems_allowed(policy->v.zonelist))
+				return policy->v.zonelist;
 		/*FALL THROUGH*/
 	case MPOL_INTERLEAVE: /* should not happen */
 	case MPOL_DEFAULT:
@@ -747,6 +754,8 @@ alloc_page_vma(unsigned gfp, struct vm_area_struct *vma, unsigned long addr)
 {
 	struct mempolicy *pol = get_vma_policy(vma, addr);

+	cpuset_update_current_mems_allowed();
+
 	if (unlikely(pol->policy == MPOL_INTERLEAVE)) {
 		unsigned nid;
 		if (vma) {
@@ -784,6 +793,8 @@ struct page *alloc_pages_current(unsigned gfp, unsigned order)
 {
 	struct mempolicy *pol = current->mempolicy;

+	if (!in_interrupt())
+		cpuset_update_current_mems_allowed();
 	if (!pol || in_interrupt())
 		pol = &default_policy;
 	if (pol->policy == MPOL_INTERLEAVE)

--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -31,6 +31,7 @@
 #include <linux/topology.h>
 #include <linux/sysctl.h>
 #include <linux/cpu.h>
+#include <linux/cpuset.h>
 #include <linux/nodemask.h>
 #include <linux/vmalloc.h>

@@ -765,6 +766,9 @@ __alloc_pages(unsigned int gfp_mask, unsigned int order,
 				       classzone_idx, 0, 0))
 			continue;

+		if (!cpuset_zone_allowed(z))
+			continue;
+
 		page = buffered_rmqueue(z, order, gfp_mask);
 		if (page)
 			goto got_pg;
@@ -783,6 +787,9 @@ __alloc_pages(unsigned int gfp_mask, unsigned int order,
 				       gfp_mask & __GFP_HIGH))
 			continue;

+		if (!cpuset_zone_allowed(z))
+			continue;
+
 		page = buffered_rmqueue(z, order, gfp_mask);
 		if (page)
 			goto got_pg;
@@ -792,6 +799,8 @@ __alloc_pages(unsigned int gfp_mask, unsigned int order,
 	if (((p->flags & PF_MEMALLOC) || unlikely(test_thread_flag(TIF_MEMDIE))) && !in_interrupt()) {
 		/* go through the zonelist yet again, ignoring mins */
 		for (i = 0; (z = zones[i]) != NULL; i++) {
+			if (!cpuset_zone_allowed(z))
+				continue;
 			page = buffered_rmqueue(z, order, gfp_mask);
 			if (page)
 				goto got_pg;
@@ -831,6 +840,9 @@ __alloc_pages(unsigned int gfp_mask, unsigned int order,
 					       gfp_mask & __GFP_HIGH))
 				continue;

+			if (!cpuset_zone_allowed(z))
+				continue;
+
 			page = buffered_rmqueue(z, order, gfp_mask);
 			if (page)
 				goto got_pg;
@@ -847,6 +859,9 @@ __alloc_pages(unsigned int gfp_mask, unsigned int order,
 					       classzone_idx, 0, 0))
 				continue;

+			if (!cpuset_zone_allowed(z))
+				continue;
+
 			page = buffered_rmqueue(z, order, gfp_mask);
 			if (page)
 				goto got_pg;
@@ -1490,6 +1505,7 @@ void __init build_all_zonelists(void)
 	for_each_online_node(i)
 		build_zonelists(NODE_DATA(i));
 	printk("Built %i zonelists\n", num_online_nodes());
+	cpuset_init_current_mems_allowed();
 }

 /*

--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -30,6 +30,7 @@
 #include <linux/rmap.h>
 #include <linux/topology.h>
 #include <linux/cpu.h>
+#include <linux/cpuset.h>
 #include <linux/notifier.h>
 #include <linux/rwsem.h>

@@ -865,6 +866,9 @@ shrink_caches(struct zone **zones, struct scan_control *sc)
 		if (zone->present_pages == 0)
 			continue;

+		if (!cpuset_zone_allowed(zone))
+			continue;
+
 		zone->temp_priority = sc->priority;
 		if (zone->prev_priority > sc->priority)
 			zone->prev_priority = sc->priority;
@@ -908,6 +912,9 @@ int try_to_free_pages(struct zone **zones,
 	for (i = 0; zones[i] != NULL; i++) {
 		struct zone *zone = zones[i];

+		if (!cpuset_zone_allowed(zone))
+			continue;
+
 		zone->temp_priority = DEF_PRIORITY;
 		lru_pages += zone->nr_active + zone->nr_inactive;
 	}
@@ -948,8 +955,14 @@ int try_to_free_pages(struct zone **zones,
 			blk_congestion_wait(WRITE, HZ/10);
 	}
 out:
-	for (i = 0; zones[i] != 0; i++)
-		zones[i]->prev_priority = zones[i]->temp_priority;
+	for (i = 0; zones[i] != 0; i++) {
+		struct zone *zone = zones[i];
+
+		if (!cpuset_zone_allowed(zone))
+			continue;
+
+		zone->prev_priority = zone->temp_priority;
+	}
 	return ret;
 }

@@ -1210,6 +1223,8 @@ void wakeup_kswapd(struct zone *zone, int order)
 		return;
 	if (pgdat->kswapd_max_order < order)
 		pgdat->kswapd_max_order = order;
+	if (!cpuset_zone_allowed(zone))
+		return;
 	if (!waitqueue_active(&zone->zone_pgdat->kswapd_wait))
 		return;
 	wake_up_interruptible(&zone->zone_pgdat->kswapd_wait);