[PATCH] ia32 IRQ distribution rework
Patch from "Kamble, Nitin A" <nitin.a.kamble@intel.com> Hello All, We were looking at the performance impact of the IRQ routing from the 2.5.52 Linux kernel. This email includes some of our findings about the way the interrupts are getting moved in the 2.5.52 kernel. Also there is discussion and a patch for a new implementation. Let me know what you think at nitin.a.kamble@intel.com Current implementation: ====================== We have found that the existing implementation works well on IA32 SMP systems with light load of interrupts. Also we noticed that it is not working that well under heavy interrupt load conditions on these SMP systems. The observations are: * Interrupt load of each IRQ is getting balanced on CPUs independent of load of other IRQs. Also the current implementation moves the IRQs randomly. This works well when the interrupt load is light. But we start seeing imbalance of interrupt load with existence of multiple heavy interrupt sources. Frequently multiple heavily loaded IRQs gets moved to a single CPU while other CPUs stay very lightly loaded. To achieve a good interrupts load balance, it is important to consider the load of all the interrupts together. This further can be explained with an example of 4 CPUs and 4 heavy interrupt sources. With the existing random movement approach, the chance of each of these heavy interrupt sources moving to separate CPUs is: (4/4)*(3/4)*(2/4)*(1/4) = 3/16. It means 13/16 = 81.25% of the time the situation is, some CPUs are very lightly loaded and some are loaded with multiple heavy interrupts. This causes the interrupt load imbalance and results in less performance. In a case of 2 CPUs and 2 heavily loaded interrupt sources, this imbalance happens 1/2 = 50% of the times. This issue becomes more and more severe with increasing number of heavy interrupt sources. * Another interesting observation is: We cannot see the imbalance of the interrupt load from /proc/interrupts. (/proc/interrupts shows the cumulative load of interrupts on all CPUs.) If the interrupt load is imbalanced and this imbalance is getting rotated among CPUs continuously, then /proc/interrupts will still show that the interrupt load is going to processors very evenly. Currently at the frequency (HZ/50) at which IRQs are moved across CPUs, it is not possible to see any interrupt load imbalance happening. * We have also found that, in certain cases the static IRQ binding performs better than the existing kernel distribution of interrupt load. The reason is, in a well-balanced interrupt load situations, these interrupts are unnecessarily getting frequently moved across CPUs. This adds an extra overhead; also it takes off the CPU cache warmth benefits. This came out from the performance measurements done on a 4-way HT (8 logical processors) Pentium 4 Xeon system running 8 copies of netperf. The 4 NICs in the system taking different IRQs generated sizable interrupt load with the help of connected clients. Here the netperf transactions/sec throughput numbers observed are: IRQs nicely manually bound to CPUs: 56.20K The current kernel implementation of IRQ movement: 50.05K ----------------------- The static binding of IRQs has performed 12.28% better than the current IRQ movement implemented in the kernel. * The current implementation does not distinguish siblings from the HT (Hyper-Threading(tm)) enabled CPUs. It will be beneficial to balance the interrupt load with respect to processor packages first, and then among logical CPUs inside processor packages. For example if we have 2 heavy interrupt sources and 2 processor packages (4 logical CPUs); Assigning both the heavy interrupt sources in different processor packages is better, it will use different execution resources from the different processor packages. New revised implementation: ========================== We also have been working on a new implementation. The following points are in main focus. * At any moment heavily loaded IRQs are distributed to different CPUs to achieve as much balance as possible. * Lightly loaded interrupt sources are ignored from the load balancing, as they do not cause considerable imbalance. * When the heavy interrupt sources are balanced, they are not moved around. This also helps in keeping the CPU caches warm. * It has been made HT aware. While distributing the load, the load on a processor package to which the logical CPUs belong to is also considered. * In the situations of few (lesser than num_cpus) heavy interrupt sources, it is not possible to balance them evenly. In such case the existing code has been reused to move the interrupts. The randomness from the original code has been removed. * The time interval for redistribution has been made flexible. It varies as the system interrupt load changes. * A new kernel_thread is introduced to do the load balancing calculations for all the interrupt sources. It keeps the balanace_maps ready for interrupt handlers, keeping the overhead in the interrupt handling to minimum. * It allows the disabling of the IRQ distribution from the boot loader command line, if anybody wants to do it for any reason. * The algorithm also takes into account the static binding of interrupts to CPUs that user imposes from the /proc/irq/{n}/smp_affinity interface. Throughput numbers with the netperf setup for the new implementation: Current kernel IRQ balance implementation: 50.02K transactions/sec The new IRQ balance implementation: 56.01K transactions/sec --------------------- The performance improvement on P4 Xeon of 11.9% is observed. The new IRQ balance implementation also shows little performance improvement on P6 (Pentium II, III) systems. On a P6 system the netperf throughput numbers are: Current kernel IRQ balance implementation: 36.96K transactions/sec The new IRQ balance implementation: 37.65K transactions/sec --------------------- Here the performance improvement on P6 system of about 2% is observed. --------------------- Andrew Theurer <habanero@us.ibm.com> did some testing of this patch on a quad P4: I got a chance to run the NetBench benchmark with your patch on 2.5.54-mjb2 kernel. NetBench measures SMB/CIFS performance by using several SMB clients (in this case 44 Windows 2000 systems), sending SMB requests to a Linux server running Samba 2.2.3a+sendfile. Result is in throughput, Mbps. Generally the network traffic on the server is 60% recv, 40% tx. I believe we have very similar systems. Mine is a 4 x 1.6 GHz, 1 MB L3 P4 Xeon with 4 GB DDR memory (3.2 GB/sec I believe). The chipset is "Summit". I also have more than one Intel e1000 adapters. I decided to run a few configurations, first with just one adapter, with and without HT support in the kernel (acpi=off), then add another adapter and test again with/without HT. Here are the results: 4P, no HT, 1 x e1000, no kirq: 1214 Mbps, 4% idle 4P, no HT, 1 x e1000, kirq: 1223 Mbps, 4% idle, +0.74% I suppose we didn't see much of an improvement here because we never run into the situation where more than one interrupt with a high rate is routed to a single CPU on irq_balance. 4P, HT, 1 x e1000, no kirq: 1214 Mbps, 25% idle 4P, HT, 1 x e1000, kirq: 1220 Mbps, 30% idle, +0.49% Again, not much of a difference just yet, but lots of idle time. We may have reached the limit at which one logical CPU can process interrupts for an e1000 adapter. There are other things I can probably do to help this, like int delay, and NAPI, which I will get to eventually. 4P, HT, 2 x e1000, no kirq: 1269 Mbps, 23% idle 4P, HT, 2 x e1000, kirq: 1329 Mbps, 18% idle +4.7% OK, almost 5% better! Probably has to do with a couple of things; the fact that your code does not route two different interrupts to the same core/different logical cpus (quite obvious by looking at /proc/interrupts), and that more than one interrupt does not go to the same cpu if possible. I suspect irq_balance did some of those [bad] things some of the time, and we observed a bottleneck in int processing that was lower than with kirq. I don't think all of the idle time is because of a int processing bottleneck. I'm just not sure what it is yet :) Hopefully something will become obvious to me... Overall I like the way it works, and I believe it can be tweaked to work with NUMA when necessary. I hope to have access to a specweb system on a NUMA box soon, so we can verify that.
Showing
This diff is collapsed.
Please register or sign in to comment