[PATCH] sched: improve wakeup-affinity
David Mosberger noticed bw_pipe was way down on sched-domains kernels on SMP systems. That is due to two things: first, the previous wake-affine logic would *always* move a pipe wakee onto the waker's CPU. With the scheduler rework, this was toned down a lot (but extended to all types of wakeups). One of the ways this was damped was with the logic: don't move the wakee if its CPU is relatively idle compared to the waker's CPU. Without this, some workloads would pile everything up onto a few CPUs and get lots of idle time. However, the fix was a bit of a blunt hack: if the wakee runqueue was below 50% busy, and the waker's was above 50% busy, we wouldn't do the move. I think a better way to capture it is what this patch does: if the wakee runqueue is below 100% busy, and the sum of the two runqueue's loads is above 100% busy, and the wakee runqueue is less busy than the waker runqueue (ie. CPU utilisation would drop if we do the move), then we don't do the move. After I fixed this, I found things were still getting bounced around quite a bit. The reason is that we were attempting very aggressive idle balancing in order to cut down idle time in a dbt2-pgsql workload, which is particularly sensitive to idle. After having Mark Wong (markw@osdl.org) retest this load with this patch, it looks like we don't need to be so aggressive. I'm glad to be rid of this because it never sat too well with me. We should see slightly lower cost of schedule and slightly improved cache impact with this change too. Mark said: --- This looks pretty good: metric kernel 2334 2.6.7-rc2 2298 2.6.7-rc2-mm2 2329 2.6.7-rc2-mm2-sched-more-wakeaffine --- ie. within the noise. David said: --- Oooh, me likeee! Host OS Pipe AF UNIX --------- ------------- ---- ---- caldera.h Linux 2.6.6 3424 2057 (plain 2.6.6) caldera.h Linux 2.6.7-r 333. 1402 (original 2.6.7-rc1) caldera.h Linux 2.6.7-r 3086 4301 (2.6.7-rc1 with your patch) Pipe-bandwidth is still down about 10% but that may be due to unrelated changes (or perhaps warmup effects?). The AF UNIX bandwidth is just mindboggling. Moreover, with your patch 2.6.7-rc1 shows better context-switch times and lower communication latencies (more like the numbers you're getting on UP). So it seems like the overall balance of keeping things on the same CPU vs. distributing them across CPUs is improved. --- I also ran some tests on the NUMAQ. kernbench, dbench, hackbench, reaim were much the same. tbench was improved, very much so when clients < NR_CPU. Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Showing
Please register or sign in to comment