sched/numa: Fix NUMA topology for systems with CPU-less nodes

The NUMA topology parameters (sched_numa_topology_type, sched_domains_numa_levels, and sched_max_numa_distance, etc.) identified by scheduler may be wrong for systems with CPU-less nodes. For example, the ACPI SLIT of a system with CPU-less persistent memory (Intel Optane DCPMM) nodes is as follows, [000h 0000 4] Signature : "SLIT" [System Locality Information Table] [004h 0004 4] Table Length : 0000042C [008h 0008 1] Revision : 01 [009h 0009 1] Checksum : 59 [00Ah 0010 6] Oem ID : "XXXX" [010h 0016 8] Oem Table ID : "XXXXXXX" [018h 0024 4] Oem Revision : 00000001 [01Ch 0028 4] Asl Compiler ID : "INTL" [020h 0032 4] Asl Compiler Revision : 20091013 [024h 0036 8] Localities : 0000000000000004 [02Ch 0044 4] Locality 0 : 0A 15 11 1C [030h 0048 4] Locality 1 : 15 0A 1C 11 [034h 0052 4] Locality 2 : 11 1C 0A 1C [038h 0056 4] Locality 3 : 1C 11 1C 0A While the `numactl -H` output is as follows, available: 4 nodes (0-3) node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 node 0 size: 64136 MB node 0 free: 5981 MB node 1 cpus: 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 node 1 size: 64466 MB node 1 free: 10415 MB node 2 cpus: node 2 size: 253952 MB node 2 free: 253920 MB node 3 cpus: node 3 size: 253952 MB node 3 free: 253951 MB node distances: node 0 1 2 3 0: 10 21 17 28 1: 21 10 28 17 2: 17 28 10 28 3: 28 17 28 10 In this system, there are only 2 sockets. In each memory controller, both DRAM and PMEM DIMMs are installed. Although the physical NUMA topology is simple, the logical NUMA topology becomes a little complex. Because both the distance(0, 1) and distance (1, 3) are less than the distance (0, 3), it appears that node 1 sits between node 0 and node 3. And the whole system appears to be a glueless mesh NUMA topology type. But it's definitely not, there is even no CPU in node 3. This isn't a practical problem now yet. Because the PMEM nodes (node 2 and node 3 in example system) are offlined by default during system boot. So init_numa_topology_type() called during system boot will ignore them and set sched_numa_topology_type to NUMA_DIRECT. And init_numa_topology_type() is only called at runtime when a CPU of a never-onlined-before node gets plugged in. And there's no CPU in the PMEM nodes. But it appears better to fix this to make the code more robust. To test the potential problem. We have used a debug patch to call init_numa_topology_type() when the PMEM node is onlined (in __set_migration_target_nodes()). With that, the NUMA parameters identified by scheduler is as follows, sched_numa_topology_type: NUMA_GLUELESS_MESH sched_domains_numa_levels: 4 sched_max_numa_distance: 28 To fix the issue, the CPU-less nodes are ignored when the NUMA topology parameters are identified. Because a node may become CPU-less or not at run time because of CPU hotplug, the NUMA topology parameters need to be re-initialized at runtime for CPU hotplug too. With the patch, the NUMA parameters identified for the example system above is as follows, sched_numa_topology_type: NUMA_DIRECT sched_domains_numa_levels: 2 sched_max_numa_distance: 21 Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20220214121553.582248-1-ying.huang@intel.com

sched/numa: Fix NUMA topology for systems with CPU-less nodes
The NUMA topology parameters (sched_numa_topology_type, sched_domains_numa_levels, and sched_max_numa_distance, etc.) identified by scheduler may be wrong for systems with CPU-less nodes. For example, the ACPI SLIT of a system with CPU-less persistent memory (Intel Optane DCPMM) nodes is as follows, [000h 0000 4] Signature : "SLIT" [System Locality Information Table] [004h 0004 4] Table Length : 0000042C [008h 0008 1] Revision : 01 [009h 0009 1] Checksum : 59 [00Ah 0010 6] Oem ID : "XXXX" [010h 0016 8] Oem Table ID : "XXXXXXX" [018h 0024 4] Oem Revision : 00000001 [01Ch 0028 4] Asl Compiler ID : "INTL" [020h 0032 4] Asl Compiler Revision : 20091013 [024h 0036 8] Localities : 0000000000000004 [02Ch 0044 4] Locality 0 : 0A 15 11 1C [030h 0048 4] Locality 1 : 15 0A 1C 11 [034h 0052 4] Locality 2 : 11 1C 0A 1C [038h 0056 4] Locality 3 : 1C 11 1C 0A While the `numactl -H` output is as follows, available: 4 nodes (0-3) node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 node 0 size: 64136 MB node 0 free: 5981 MB node 1 cpus: 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 node 1 size: 64466 MB node 1 free: 10415 MB node 2 cpus: node 2 size: 253952 MB node 2 free: 253920 MB node 3 cpus: node 3 size: 253952 MB node 3 free: 253951 MB node distances: node 0 1 2 3 0: 10 21 17 28 1: 21 10 28 17 2: 17 28 10 28 3: 28 17 28 10 In this system, there are only 2 sockets. In each memory controller, both DRAM and PMEM DIMMs are installed. Although the physical NUMA topology is simple, the logical NUMA topology becomes a little complex. Because both the distance(0, 1) and distance (1, 3) are less than the distance (0, 3), it appears that node 1 sits between node 0 and node 3. And the whole system appears to be a glueless mesh NUMA topology type. But it's definitely not, there is even no CPU in node 3. This isn't a practical problem now yet. Because the PMEM nodes (node 2 and node 3 in example system) are offlined by default during system boot. So init_numa_topology_type() called during system boot will ignore them and set sched_numa_topology_type to NUMA_DIRECT. And init_numa_topology_type() is only called at runtime when a CPU of a never-onlined-before node gets plugged in. And there's no CPU in the PMEM nodes. But it appears better to fix this to make the code more robust. To test the potential problem. We have used a debug patch to call init_numa_topology_type() when the PMEM node is onlined (in __set_migration_target_nodes()). With that, the NUMA parameters identified by scheduler is as follows, sched_numa_topology_type: NUMA_GLUELESS_MESH sched_domains_numa_levels: 4 sched_max_numa_distance: 28 To fix the issue, the CPU-less nodes are ignored when the NUMA topology parameters are identified. Because a node may become CPU-less or not at run time because of CPU hotplug, the NUMA topology parameters need to be re-initialized at runtime for CPU hotplug too. With the patch, the NUMA parameters identified for the example system above is as follows, sched_numa_topology_type: NUMA_DIRECT sched_domains_numa_levels: 2 sched_max_numa_distance: 21 Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20220214121553.582248-1-ying.huang@intel.com
0fb3978b · Huang Ying · Peter Zijlstra · 1087ad4e · 0fb3978b · 0fb3978b
Commit 0fb3978b authored Feb 14, 2022 by Huang Ying Committed by Peter Zijlstra Feb 16, 2022
Showing with 137 additions and 95 deletions

kernel/sched/core.c kernel/sched/core.c +4 -1

kernel/sched/fair.c kernel/sched/fair.c +8 -7

kernel/sched/sched.h kernel/sched/sched.h +4 -2

kernel/sched/topology.c kernel/sched/topology.c +121 -85

No files found.
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -9052,6 +9052,7 @@ int sched_cpu_activate(unsigned int cpu)
 	set_cpu_active(cpu, true);

 	if (sched_smp_initialized) {
+		sched_update_numa(cpu, true);
 		sched_domains_numa_masks_set(cpu);
 		cpuset_cpu_active();
 	}
@@ -9130,10 +9131,12 @@ int sched_cpu_deactivate(unsigned int cpu)
 	if (!sched_smp_initialized)
 		return 0;

+	sched_update_numa(cpu, false);
 	ret = cpuset_cpu_inactive(cpu);
 	if (ret) {
 		balance_push_set(cpu, false);
 		set_cpu_active(cpu, true);
+		sched_update_numa(cpu, true);
 		return ret;
 	}
 	sched_domains_numa_masks_clear(cpu);
@@ -9236,7 +9239,7 @@ int sched_cpu_dying(unsigned int cpu)

 void __init sched_init_smp(void)
 {
-	sched_init_numa();
+	sched_init_numa(NUMA_NO_NODE);

 	/*
 	 * There's no userspace yet to cause hotplug operations; hence all the

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1259,10 +1259,10 @@ static bool numa_is_active_node(int nid, struct numa_group *ng)

 /* Handle placement on systems where not all nodes are directly connected. */
 static unsigned long score_nearby_nodes(struct task_struct *p, int nid,
-					int maxdist, bool task)
+					int lim_dist, bool task)
 {
 	unsigned long score = 0;
-	int node;
+	int node, max_dist;

 	/*
 	 * All nodes are directly connected, and the same distance
@@ -1271,6 +1271,8 @@ static unsigned long score_nearby_nodes(struct task_struct *p, int nid,
 	if (sched_numa_topology_type == NUMA_DIRECT)
 		return 0;

+	/* sched_max_numa_distance may be changed in parallel. */
+	max_dist = READ_ONCE(sched_max_numa_distance);
 	/*
 	 * This code is called for each node, introducing N^2 complexity,
 	 * which should be ok given the number of nodes rarely exceeds 8.
@@ -1283,7 +1285,7 @@ static unsigned long score_nearby_nodes(struct task_struct *p, int nid,
 		 * The furthest away nodes in the system are not interesting
 		 * for placement; nid was already counted.
 		 */
-		if (dist == sched_max_numa_distance || node == nid)
+		if (dist >= max_dist || node == nid)
 			continue;

 		/*
@@ -1293,8 +1295,7 @@ static unsigned long score_nearby_nodes(struct task_struct *p, int nid,
 		 * "hoplimit", only nodes closer by than "hoplimit" are part
 		 * of each group. Skip other nodes.
 		 */
-		if (sched_numa_topology_type == NUMA_BACKPLANE &&
-					dist >= maxdist)
+		if (sched_numa_topology_type == NUMA_BACKPLANE && dist >= lim_dist)
 			continue;

 		/* Add up the faults from nearby nodes. */
@@ -1312,8 +1313,8 @@ static unsigned long score_nearby_nodes(struct task_struct *p, int nid,
 		 * This seems to result in good task placement.
 		 */
 		if (sched_numa_topology_type == NUMA_GLUELESS_MESH) {
-			faults *= (sched_max_numa_distance - dist);
-			faults /= (sched_max_numa_distance - LOCAL_DISTANCE);
+			faults *= (max_dist - dist);
+			faults /= (max_dist - LOCAL_DISTANCE);
 		}

 		score += faults;

--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1662,12 +1662,14 @@ enum numa_topology_type {
 extern enum numa_topology_type sched_numa_topology_type;
 extern int sched_max_numa_distance;
 extern bool find_numa_distance(int distance);
-extern void sched_init_numa(void);
+extern void sched_init_numa(int offline_node);
+extern void sched_update_numa(int cpu, bool online);
 extern void sched_domains_numa_masks_set(unsigned int cpu);
 extern void sched_domains_numa_masks_clear(unsigned int cpu);
 extern int sched_numa_find_closest(const struct cpumask *cpus, int cpu);
 #else
-static inline void sched_init_numa(void) { }
+static inline void sched_init_numa(int offline_node) { }
+static inline void sched_update_numa(int cpu, bool online) { }
 static inline void sched_domains_numa_masks_set(unsigned int cpu) { }
 static inline void sched_domains_numa_masks_clear(unsigned int cpu) { }
 static inline int sched_numa_find_closest(const struct cpumask *cpus, int cpu)

--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c