[PATCH] sched: scheduler domain support
From: Nick Piggin <piggin@cyberone.com.au> This is the core sched domains patch. It can handle any number of levels in a scheduling heirachy, and allows architectures to easily customize how the scheduler behaves. It also provides progressive balancing backoff needed by SGI on their large systems (although they have not yet tested it). It is built on top of (well, uses ideas from) my previous SMP/NUMA work, and gets results very similar to them when using the default scheduling description. Benchmarks ========== Martin was seeing I think 10-20% better system times in kernbench on the 32 way. I was seeing improvements in dbench, tbench, kernbench, reaim, hackbench on a 16-way NUMAQ. Hackbench in fact had a non linear element which is all but eliminated. Large improvements in volanomark. Cross node task migration was decreased in all above benchmarks, sometimes by a factor of 100!! Cross CPU migration was also generally decreased. See this post: http://groups.google.com.au/groups?hl=en&lr=&ie=UTF-8&oe=UTF-8&frame=right&th=a406c910b30cbac4&seekm=UAdQ.3hj.5%40gated-at.bofh.it#link2 Results on a hyperthreading P4 are equivalent to Ingo's shared runqueues patch (which is a big improvement). Some examples on the 16-way NUMAQ (this is slightly older sched domain code): http://www.kerneltrap.org/~npiggin/w26/hbench.png http://www.kerneltrap.org/~npiggin/w26/vmark.html From: Jes Sorensen <jes@wildopensource.com> Tiny patch to make -mm3 compile on an NUMA box with NR_CPUS > BITS_PER_LONG. From: "Martin J. Bligh" <mbligh@aracnet.com> Fix a minor nit with the find_busiest_group code. No functional change, but makes the code simpler and clearer. This patch does two things ... adds some more expansive comments, and removes this if clause: if (*imbalance < SCHED_LOAD_SCALE && max_load - this_load > SCHED_LOAD_SCALE) *imbalance = SCHED_LOAD_SCALE; If we remove the scaling factor, we're basically conditionally doing: if (*imbalance < 1) *imbalance = 1; Which is pointless, as the very next thing we do is to remove the scaling factor, rounding up to the nearest integer as we do: *imbalance = (*imbalance + SCHED_LOAD_SCALE - 1) >> SCHED_LOAD_SHIFT; Thus the if statement is redundant, and only makes the code harder to read ;-) From: Rick Lindsley <ricklind@us.ibm.com> In find_busiest_group(), after we exit the do/while, we select our imbalance. But max_load, avg_load, and this_load are all unsigned, so min(x,y) will make a bad choice if max_load < avg_load < this_load (that is, a choice between two negative [very large] numbers). Unfortunately, there is a bug when max_load never gets changed from zero (look in the loop and think what happens if the only load on the machine is being created by cpu groups of which we are a member). And you have a recipe for some really bogus values for imbalance. Even if you fix the max_load == 0 bug, there will still be times when avg_load - this_load will be negative (thus very large) and you'll make the decision to move stuff when you shouldn't have. This patch allows for this_load to set max_load, which if I understand the logic properly is correct. With this patch applied, the algorithm is *much* more conservative ... maybe *too* conservative but that's for another round of testing ... From: Ingo Molnar <mingo@elte.hu> sched-find-busiest-fix
Showing
This diff is collapsed.
Please register or sign in to comment