sched/numa: Avoid migrating task to CPU-less node

In a typical memory tiering system, there's no CPU in slow (PMEM) NUMA nodes. But if the number of the hint page faults on a PMEM node is the max for a task, The current NUMA balancing policy may try to place the task on the PMEM node instead of DRAM node. This is unreasonable, because there's no CPU in PMEM NUMA nodes. To fix this, CPU-less nodes are ignored when searching the migration target node for a task in this patch. To test the patch, we run a workload that accesses more memory in PMEM node than memory in DRAM node. Without the patch, the PMEM node will be chosen as preferred node in task_numa_placement(). While the DRAM node will be chosen instead with the patch. Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20220214121553.582248-2-ying.huang@intel.com

sched/numa: Avoid migrating task to CPU-less node
In a typical memory tiering system, there's no CPU in slow (PMEM) NUMA nodes. But if the number of the hint page faults on a PMEM node is the max for a task, The current NUMA balancing policy may try to place the task on the PMEM node instead of DRAM node. This is unreasonable, because there's no CPU in PMEM NUMA nodes. To fix this, CPU-less nodes are ignored when searching the migration target node for a task in this patch. To test the patch, we run a workload that accesses more memory in PMEM node than memory in DRAM node. Without the patch, the PMEM node will be chosen as preferred node in task_numa_placement(). While the DRAM node will be chosen instead with the patch. Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20220214121553.582248-2-ying.huang@intel.com
5c7b1aaf · Huang Ying · Peter Zijlstra · 0fb3978b · 5c7b1aaf
Commit 5c7b1aaf authored Feb 14, 2022 by Huang Ying Committed by Peter Zijlstra Feb 16, 2022
Show whitespace changes
Inline Side-by-side

Showing with 20 additions and 5 deletions

kernel/sched/fair.c kernel/sched/fair.c +20 -5

No files found.
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1989,7 +1989,7 @@ static int task_numa_migrate(struct task_struct *p)
 	 */
 	ng = deref_curr_numa_group(p);
 	if (env.best_cpu == -1 || (ng && ng->active_nodes > 1)) {
-		for_each_online_node(nid) {
+		for_each_node_state(nid, N_CPU) {
 			if (nid == env.src_nid || nid == p->numa_preferred_nid)
 				continue;

@@ -2087,13 +2087,13 @@ static void numa_group_count_active_nodes(struct numa_group *numa_group)
 	unsigned long faults, max_faults = 0;
 	int nid, active_nodes = 0;

-	for_each_online_node(nid) {
+	for_each_node_state(nid, N_CPU) {
 		faults = group_faults_cpu(numa_group, nid);
 		if (faults > max_faults)
 			max_faults = faults;
 	}

-	for_each_online_node(nid) {
+	for_each_node_state(nid, N_CPU) {
 		faults = group_faults_cpu(numa_group, nid);
 		if (faults * ACTIVE_NODE_FRACTION > max_faults)
 			active_nodes++;
@@ -2247,7 +2247,7 @@ static int preferred_group_nid(struct task_struct *p, int nid)

 		dist = sched_max_numa_distance;

-		for_each_online_node(node) {
+		for_each_node_state(node, N_CPU) {
 			score = group_weight(p, node, dist);
 			if (score > max_score) {
 				max_score = score;
@@ -2266,7 +2266,7 @@ static int preferred_group_nid(struct task_struct *p, int nid)
 	 * inside the highest scoring group of nodes. The nodemask tricks
 	 * keep the complexity of the search down.
 	 */
-	nodes = node_online_map;
+	nodes = node_states[N_CPU];
 	for (dist = sched_max_numa_distance; dist > LOCAL_DISTANCE; dist--) {
 		unsigned long max_faults = 0;
 		nodemask_t max_group = NODE_MASK_NONE;
@@ -2405,6 +2405,21 @@ static void task_numa_placement(struct task_struct *p)
 		}
 	}

+	/* Cannot migrate task to CPU-less node */
+	if (!node_state(max_nid, N_CPU)) {
+		int near_nid = max_nid;
+		int distance, near_distance = INT_MAX;
+
+		for_each_node_state(nid, N_CPU) {
+			distance = node_distance(max_nid, nid);
+			if (distance < near_distance) {
+				near_nid = nid;
+				near_distance = distance;
+			}
+		}
+		max_nid = near_nid;
+	}
+
 	if (ng) {
 		numa_group_count_active_nodes(ng);
 		spin_unlock_irq(group_lock);