Commit 08f16f8f authored by Andrew Morton's avatar Andrew Morton Committed by Linus Torvalds

[PATCH] ia32 IRQ distribution rework

Patch from "Kamble, Nitin A" <nitin.a.kamble@intel.com>

Hello All,

  We were looking at the performance impact of the IRQ routing from
the 2.5.52 Linux kernel. This email includes some of our findings
about the way the interrupts are getting moved in the 2.5.52 kernel.
Also there is discussion and a patch for a new implementation. Let
me know what you think at nitin.a.kamble@intel.com

Current implementation:
======================
We have found that the existing implementation works well on IA32
SMP systems with light load of interrupts. Also we noticed that it
is not working that well under heavy interrupt load conditions on
these SMP systems. The observations are:

* Interrupt load of each IRQ is getting balanced on CPUs independent
of load of other IRQs. Also the current implementation moves the
IRQs randomly. This works well when the interrupt load is light. But
we start seeing imbalance of interrupt load with existence of
multiple heavy interrupt sources. Frequently multiple heavily loaded
IRQs gets moved to a single CPU while other CPUs stay very lightly
loaded. To achieve a good interrupts load balance, it is important to
consider the load of all the interrupts together.
    This further can be explained with an example of 4 CPUs and 4
heavy interrupt sources. With the existing random movement approach,
the chance of each of these heavy interrupt sources moving to separate
CPUs is: (4/4)*(3/4)*(2/4)*(1/4) = 3/16. It means 13/16 = 81.25% of
the time the situation is, some CPUs are very lightly loaded and some
are loaded with multiple heavy interrupts. This causes the interrupt
load imbalance and results in less performance. In a case of 2 CPUs
and 2 heavily loaded interrupt sources, this imbalance happens
1/2 = 50% of the times. This issue becomes more and more severe with
increasing number of heavy interrupt sources.

* Another interesting observation is: We cannot see the imbalance
of the interrupt load from /proc/interrupts. (/proc/interrupts shows
the cumulative load of interrupts on all CPUs.) If the interrupt load
is imbalanced and this imbalance is getting rotated among CPUs
continuously, then /proc/interrupts will still show that the interrupt
load is going to processors very evenly. Currently at the frequency
(HZ/50) at which IRQs are moved across CPUs, it is not possible to
see any interrupt load imbalance happening.

* We have also found that, in certain cases the static IRQ binding
performs better than the existing kernel distribution of interrupt
load. The reason is, in a well-balanced interrupt load situations,
these interrupts are unnecessarily getting frequently moved across
CPUs. This adds an extra overhead; also it takes off the CPU cache
warmth benefits.
  This came out from the performance measurements done on a 4-way HT
(8 logical processors) Pentium 4 Xeon system running 8 copies of
netperf. The 4 NICs in the system taking different IRQs generated
sizable interrupt load with the help of connected clients.

Here the netperf transactions/sec throughput numbers observed are:

IRQs nicely manually bound to CPUs: 56.20K
The current kernel implementation of IRQ movement: 50.05K
 -----------------------
 The static binding of IRQs has performed 12.28% better than the
current IRQ movement implemented in the kernel.

* The current implementation does not distinguish siblings from the
HT (Hyper-Threading(tm)) enabled CPUs. It will be beneficial to
balance the interrupt load with respect to processor packages first,
and then among logical CPUs inside processor packages.
  For example if we have 2 heavy interrupt sources and 2 processor
packages (4 logical CPUs); Assigning both the heavy interrupt sources
in different processor packages is better, it will use different
execution resources from the different processor packages.



New revised implementation:
==========================
We also have been working on a new implementation. The following
points are in main focus.

* At any moment heavily loaded IRQs are distributed to different
CPUs to achieve as much balance as possible.

* Lightly loaded interrupt sources are ignored from the load
balancing, as they do not cause considerable imbalance.

* When the heavy interrupt sources are balanced, they are not moved
around. This also helps in keeping the CPU caches warm.

* It has been made HT aware. While distributing the load, the load
on a processor package to which the logical CPUs belong to is also
considered.

* In the situations of few (lesser than num_cpus) heavy interrupt
sources, it is not possible to balance them evenly. In such case
the existing code has been reused to move the interrupts. The
randomness from the original code has been removed.

* The time interval for redistribution has been made flexible. It
varies as the system interrupt load changes.

* A new kernel_thread is introduced to do the load balancing
calculations for all the interrupt sources. It keeps the balanace_maps
ready for interrupt handlers, keeping the overhead in the interrupt
handling to minimum.

* It allows the disabling of the IRQ distribution from the boot loader
command line, if anybody wants to do it for any reason.

* The algorithm also takes into account the static binding of
interrupts to CPUs that user imposes from the
/proc/irq/{n}/smp_affinity interface.


Throughput numbers with the netperf setup for the new implementation:

Current kernel IRQ balance implementation: 50.02K transactions/sec
The new IRQ balance implementation: 56.01K transactions/sec
 ---------------------
  The performance improvement on P4 Xeon of 11.9% is observed.

The new IRQ balance implementation also shows little performance
improvement on P6 (Pentium II, III) systems.

On a P6 system the netperf throughput numbers are:
Current kernel IRQ balance implementation: 36.96K transactions/sec
The new IRQ balance implementation: 37.65K transactions/sec
 ---------------------
Here the performance improvement on P6 system of about 2% is observed.


 ---------------------

Andrew Theurer <habanero@us.ibm.com> did some testing of this patch on a quad
P4:


I got a chance to run the NetBench benchmark with your patch on 2.5.54-mjb2
kernel.  NetBench measures SMB/CIFS performance by using several SMB
clients  (in this case 44 Windows 2000 systems), sending SMB requests to a
Linux  server running Samba 2.2.3a+sendfile.  Result is in throughput,
Mbps.   Generally the network traffic on the server is 60% recv, 40% tx.

I believe we have very similar systems.  Mine is a 4 x 1.6 GHz, 1 MB L3 P4
Xeon with 4 GB DDR memory (3.2 GB/sec I believe).  The chipset is "Summit".
 I also have more than one Intel e1000 adapters.

I decided to run a few configurations, first with just one adapter, with
and  without HT support in the kernel (acpi=off), then add another adapter
and  test again with/without HT.

Here are the results:

4P, no HT, 1 x e1000, no kirq:	1214 Mbps, 4% idle
4P, no HT, 1 x e1000, kirq:		1223 Mbps, 4% idle,		+0.74%

I suppose we didn't see much of an improvement here because we never run
into  the situation where more than one interrupt with a high rate is
routed to a  single CPU on irq_balance.

4P, HT, 1 x e1000, no kirq:	1214 Mbps, 25% idle
4P, HT, 1 x e1000, kirq:	1220 Mbps, 30% idle,			+0.49%

Again, not much of a difference just yet, but lots of idle time.  We may
have  reached the limit at which one logical CPU can process interrupts for
an  e1000 adapter.  There are other things I can probably do to help this,
like  int delay, and NAPI, which I will get to eventually.

4P, HT, 2 x e1000, no kirq:	1269 Mbps, 23% idle
4P, HT, 2 x e1000, kirq:	1329 Mbps, 18% idle			+4.7%

OK, almost 5% better!  Probably has to do with a couple of things; the fact
that your code does not route two different interrupts to the same
core/different logical cpus (quite obvious by looking at /proc/interrupts),
and that more than one interrupt does not go to the same cpu if possible.
I  suspect irq_balance did some of those [bad] things some of the time, and
we  observed a bottleneck in int processing that was lower than with kirq.

I don't think all of the idle time is because of a int processing
bottleneck.   I'm just not sure what it is yet :)  Hopefully something will
become obvious  to me...

Overall I like the way it works, and I believe it can be tweaked to work
with  NUMA when necessary.  I hope to have access to a specweb system on a
NUMA box  soon, so we can verify that.
parent 50d49a05
......@@ -351,6 +351,8 @@ running once the system is up.
hugepages= [HW,IA-32,IA-64] Maximal number of HugeTLB pages.
noirqbalance [IA-32,SMP,KNL] Disable kernel irq balancing
i8042_direct [HW] Non-translated mode
i8042_dumbkbd
i8042_noaux
......
......@@ -207,19 +207,34 @@ static void set_ioapic_affinity (unsigned int irq, unsigned long mask)
spin_unlock_irqrestore(&ioapic_lock, flags);
}
#if CONFIG_SMP
typedef struct {
unsigned int cpu;
unsigned long timestamp;
} ____cacheline_aligned irq_balance_t;
static irq_balance_t irq_balance[NR_IRQS] __cacheline_aligned
= { [ 0 ... NR_IRQS-1 ] = { 0, 0 } };
#if defined(CONFIG_SMP)
# include <asm/processor.h> /* kernel_thread() */
# include <linux/kernel_stat.h> /* kstat */
# include <linux/slab.h> /* kmalloc() */
# include <linux/timer.h> /* time_after() */
# if CONFIG_BALANCED_IRQ_DEBUG
# define TDprintk(x...) do { printk("<%ld:%s:%d>: ", jiffies, __FILE__, __LINE__); printk(x); } while (0)
# define Dprintk(x...) do { TDprintk(x); } while (0)
# else
# define TDprintk(x...)
# define Dprintk(x...)
# endif
extern unsigned long irq_affinity [NR_IRQS];
unsigned long __cacheline_aligned irq_balance_mask [NR_IRQS];
static int irqbalance_disabled __initdata = 0;
static int physical_balance = 0;
#endif
struct irq_cpu_info {
unsigned long * last_irq;
unsigned long * irq_delta;
unsigned long irq;
} irq_cpu_data[NR_CPUS];
#define CPU_IRQ(cpu) (irq_cpu_data[cpu].irq)
#define LAST_CPU_IRQ(cpu,irq) (irq_cpu_data[cpu].last_irq[irq])
#define IRQ_DELTA(cpu,irq) (irq_cpu_data[cpu].irq_delta[irq])
#define IDLE_ENOUGH(cpu,now) \
(idle_cpu(cpu) && ((now) - irq_stat[(cpu)].idle_timestamp > 1))
......@@ -227,9 +242,223 @@ extern unsigned long irq_affinity [NR_IRQS];
#define IRQ_ALLOWED(cpu,allowed_mask) \
((1 << cpu) & (allowed_mask))
#if CONFIG_SMP
#define CPU_TO_PACKAGEINDEX(i) \
((physical_balance && i > cpu_sibling_map[i]) ? cpu_sibling_map[i] : i)
#define MAX_BALANCED_IRQ_INTERVAL (5*HZ)
#define MIN_BALANCED_IRQ_INTERVAL (HZ/2)
#define BALANCED_IRQ_MORE_DELTA (HZ/10)
#define BALANCED_IRQ_LESS_DELTA (HZ)
long balanced_irq_interval = MAX_BALANCED_IRQ_INTERVAL;
static inline void balance_irq(int cpu, int irq);
static inline void rotate_irqs_among_cpus(unsigned long useful_load_threshold)
{
int i, j;
Dprintk("Rotating IRQs among CPUs.\n");
for (i = 0; i < NR_CPUS; i++) {
for (j = 0; cpu_online(i) && (j < NR_IRQS); j++) {
if (!irq_desc[j].action)
continue;
/* Is it a significant load ? */
if (IRQ_DELTA(CPU_TO_PACKAGEINDEX(i),j) < useful_load_threshold)
continue;
balance_irq(i, j);
}
}
balanced_irq_interval = max((long)MIN_BALANCED_IRQ_INTERVAL,
balanced_irq_interval - BALANCED_IRQ_LESS_DELTA);
return;
}
static void do_irq_balance(void)
{
int i, j;
unsigned long max_cpu_irq = 0, min_cpu_irq = (~0);
unsigned long move_this_load = 0;
int max_loaded = 0, min_loaded = 0;
unsigned long useful_load_threshold = balanced_irq_interval + 10;
int selected_irq;
int tmp_loaded, first_attempt = 1;
unsigned long tmp_cpu_irq;
unsigned long imbalance = 0;
unsigned long allowed_mask;
unsigned long target_cpu_mask;
for (i = 0; i < NR_CPUS; i++) {
int package_index;
CPU_IRQ(i) = 0;
if (!cpu_online(i))
continue;
package_index = CPU_TO_PACKAGEINDEX(i);
for (j = 0; j < NR_IRQS; j++) {
unsigned long value_now, delta;
/* Is this an active IRQ? */
if (!irq_desc[j].action)
continue;
if ( package_index == i )
IRQ_DELTA(package_index,j) = 0;
/* Determine the total count per processor per IRQ */
value_now = (unsigned long) kstat_cpu(i).irqs[j];
/* Determine the activity per processor per IRQ */
delta = value_now - LAST_CPU_IRQ(i,j);
/* Update last_cpu_irq[][] for the next time */
LAST_CPU_IRQ(i,j) = value_now;
/* Ignore IRQs whose rate is less than the clock */
if (delta < useful_load_threshold)
continue;
/* update the load for the processor or package total */
IRQ_DELTA(package_index,j) += delta;
/* Keep track of the higher numbered sibling as well */
if (i != package_index)
CPU_IRQ(i) += delta;
/*
* We have sibling A and sibling B in the package
*
* cpu_irq[A] = load for cpu A + load for cpu B
* cpu_irq[B] = load for cpu B
*/
CPU_IRQ(package_index) += delta;
}
}
/* Find the least loaded processor package */
for (i = 0; i < NR_CPUS; i++) {
if (!cpu_online(i))
continue;
if (physical_balance && i > cpu_sibling_map[i])
continue;
if (min_cpu_irq > CPU_IRQ(i)) {
min_cpu_irq = CPU_IRQ(i);
min_loaded = i;
}
}
max_cpu_irq = ULONG_MAX;
tryanothercpu:
/* Look for heaviest loaded processor.
* We may come back to get the next heaviest loaded processor.
* Skip processors with trivial loads.
*/
tmp_cpu_irq = 0;
tmp_loaded = -1;
for (i = 0; i < NR_CPUS; i++) {
if (!cpu_online(i))
continue;
if (physical_balance && i > cpu_sibling_map[i])
continue;
if (max_cpu_irq <= CPU_IRQ(i))
continue;
if (tmp_cpu_irq < CPU_IRQ(i)) {
tmp_cpu_irq = CPU_IRQ(i);
tmp_loaded = i;
}
}
if (tmp_loaded == -1) {
/* In the case of small number of heavy interrupt sources,
* loading some of the cpus too much. We use Ingo's original
* approach to rotate them around.
*/
if (!first_attempt && imbalance >= useful_load_threshold) {
rotate_irqs_among_cpus(useful_load_threshold);
return;
}
goto not_worth_the_effort;
}
first_attempt = 0; /* heaviest search */
max_cpu_irq = tmp_cpu_irq; /* load */
max_loaded = tmp_loaded; /* processor */
imbalance = (max_cpu_irq - min_cpu_irq) / 2;
Dprintk("max_loaded cpu = %d\n", max_loaded);
Dprintk("min_loaded cpu = %d\n", min_loaded);
Dprintk("max_cpu_irq load = %ld\n", max_cpu_irq);
Dprintk("min_cpu_irq load = %ld\n", min_cpu_irq);
Dprintk("load imbalance = %lu\n", imbalance);
/* if imbalance is less than approx 10% of max load, then
* observe diminishing returns action. - quit
*/
if (imbalance < (max_cpu_irq >> 3)) {
Dprintk("Imbalance too trivial\n");
goto not_worth_the_effort;
}
tryanotherirq:
/* if we select an IRQ to move that can't go where we want, then
* see if there is another one to try.
*/
move_this_load = 0;
selected_irq = -1;
for (j = 0; j < NR_IRQS; j++) {
/* Is this an active IRQ? */
if (!irq_desc[j].action)
continue;
if (imbalance <= IRQ_DELTA(max_loaded,j))
continue;
/* Try to find the IRQ that is closest to the imbalance
* without going over.
*/
if (move_this_load < IRQ_DELTA(max_loaded,j)) {
move_this_load = IRQ_DELTA(max_loaded,j);
selected_irq = j;
}
}
if (selected_irq == -1) {
goto tryanothercpu;
}
imbalance = move_this_load;
#define IRQ_BALANCE_INTERVAL (HZ/50)
/* For physical_balance case, we accumlated both load
* values in the one of the siblings cpu_irq[],
* to use the same code for physical and logical processors
* as much as possible.
*
* NOTE: the cpu_irq[] array holds the sum of the load for
* sibling A and sibling B in the slot for the lowest numbered
* sibling (A), _AND_ the load for sibling B in the slot for
* the higher numbered sibling.
*
* We seek the least loaded sibling by making the comparison
* (A+B)/2 vs B
*/
if (physical_balance && (CPU_IRQ(min_loaded) >> 1) > CPU_IRQ(cpu_sibling_map[min_loaded]))
min_loaded = cpu_sibling_map[min_loaded];
allowed_mask = cpu_online_map & irq_affinity[selected_irq];
target_cpu_mask = 1 << min_loaded;
if (target_cpu_mask & allowed_mask) {
irq_desc_t *desc = irq_desc + selected_irq;
Dprintk("irq = %d moved to cpu = %d\n", selected_irq, min_loaded);
/* mark for change destination */
spin_lock(&desc->lock);
irq_balance_mask[selected_irq] = target_cpu_mask;
spin_unlock(&desc->lock);
/* Since we made a change, come back sooner to
* check for more variation.
*/
balanced_irq_interval = max((long)MIN_BALANCED_IRQ_INTERVAL,
balanced_irq_interval - BALANCED_IRQ_LESS_DELTA);
return;
}
goto tryanotherirq;
not_worth_the_effort:
/* if we did not find an IRQ to move, then adjust the time interval upward */
balanced_irq_interval = min((long)MAX_BALANCED_IRQ_INTERVAL,
balanced_irq_interval + BALANCED_IRQ_MORE_DELTA);
Dprintk("IRQ worth rotating not found\n");
return;
}
static unsigned long move(int curr_cpu, unsigned long allowed_mask, unsigned long now, int direction)
{
......@@ -257,34 +486,113 @@ static unsigned long move(int curr_cpu, unsigned long allowed_mask, unsigned lon
return cpu;
}
static inline void balance_irq(int irq)
static inline void balance_irq (int cpu, int irq)
{
irq_balance_t *entry = irq_balance + irq;
unsigned long now = jiffies;
unsigned long allowed_mask;
unsigned int new_cpu;
if (no_balance_irq)
return;
if (unlikely(time_after(now, entry->timestamp + IRQ_BALANCE_INTERVAL))) {
unsigned long allowed_mask;
unsigned int new_cpu;
int random_number;
allowed_mask = cpu_online_map & irq_affinity[irq];
new_cpu = move(cpu, allowed_mask, now, 1);
if (cpu != new_cpu) {
irq_desc_t *desc = irq_desc + irq;
spin_lock(&desc->lock);
irq_balance_mask[irq] = cpu_to_logical_apicid(new_cpu);
spin_unlock(&desc->lock);
}
}
int balanced_irq(void *unused)
{
int i;
unsigned long prev_balance_time = jiffies;
long time_remaining = balanced_irq_interval;
daemonize();
sigfillset(&current->blocked);
sprintf(current->comm, "kirqd");
/* push everything to CPU 0 to give us a starting point. */
for (i = 0 ; i < NR_IRQS ; i++)
irq_balance_mask[i] = 1 << 0;
for (;;) {
set_current_state(TASK_INTERRUPTIBLE);
time_remaining = schedule_timeout(time_remaining);
if (time_after(jiffies, prev_balance_time+balanced_irq_interval)) {
Dprintk("balanced_irq: calling do_irq_balance() %lu\n", jiffies);
do_irq_balance();
prev_balance_time = jiffies;
time_remaining = balanced_irq_interval;
}
}
}
rdtscl(random_number);
random_number &= 1;
static int __init balanced_irq_init(void)
{
int i;
struct cpuinfo_x86 *c;
c = &boot_cpu_data;
if (irqbalance_disabled)
return 0;
/* Enable physical balance only if more than 1 physical processor is present */
if (smp_num_siblings > 1 && cpu_online_map >> 2)
physical_balance = 1;
allowed_mask = cpu_online_map & irq_affinity[irq];
entry->timestamp = now;
new_cpu = move(entry->cpu, allowed_mask, now, random_number);
if (entry->cpu != new_cpu) {
entry->cpu = new_cpu;
set_ioapic_affinity(irq, cpu_to_logical_apicid(new_cpu));
for (i = 0; i < NR_CPUS; i++) {
if (!cpu_online(i))
continue;
irq_cpu_data[i].irq_delta = kmalloc(sizeof(unsigned long) * NR_IRQS, GFP_KERNEL);
irq_cpu_data[i].last_irq = kmalloc(sizeof(unsigned long) * NR_IRQS, GFP_KERNEL);
if (irq_cpu_data[i].irq_delta == NULL || irq_cpu_data[i].last_irq == NULL) {
printk(KERN_ERR "balanced_irq_init: out of memory");
goto failed;
}
memset(irq_cpu_data[i].irq_delta,0,sizeof(unsigned long) * NR_IRQS);
memset(irq_cpu_data[i].last_irq,0,sizeof(unsigned long) * NR_IRQS);
}
printk(KERN_INFO "Starting balanced_irq\n");
if (kernel_thread(balanced_irq, NULL, CLONE_KERNEL) >= 0)
return 0;
else
printk(KERN_ERR "balanced_irq_init: failed to spawn balanced_irq");
failed:
for (i = 0; i < NR_CPUS; i++) {
if(irq_cpu_data[i].irq_delta)
kfree(irq_cpu_data[i].irq_delta);
if(irq_cpu_data[i].last_irq)
kfree(irq_cpu_data[i].last_irq);
}
return 0;
}
static int __init irqbalance_disable(char *str)
{
irqbalance_disabled = 1;
return 0;
}
__setup("noirqbalance", irqbalance_disable);
static void set_ioapic_affinity (unsigned int irq, unsigned long mask);
static inline void move_irq(int irq)
{
/* note - we hold the desc->lock */
if (unlikely(irq_balance_mask[irq])) {
set_ioapic_affinity(irq, irq_balance_mask[irq]);
irq_balance_mask[irq] = 0;
}
}
__initcall(balanced_irq_init);
#else /* !SMP */
static inline void balance_irq(int irq) { }
#endif
static inline void move_irq(int irq) { }
#endif /* defined(CONFIG_SMP) */
/*
* support for broken MP BIOSs, enables hand-redirection of PIRQ0-7 to
......@@ -1307,7 +1615,7 @@ static unsigned int startup_edge_ioapic_irq(unsigned int irq)
*/
static void ack_edge_ioapic_irq(unsigned int irq)
{
balance_irq(irq);
move_irq(irq);
if ((irq_desc[irq].status & (IRQ_PENDING | IRQ_DISABLED))
== (IRQ_PENDING | IRQ_DISABLED))
mask_IO_APIC_irq(irq);
......@@ -1347,7 +1655,7 @@ static void end_level_ioapic_irq (unsigned int irq)
unsigned long v;
int i;
balance_irq(irq);
move_irq(irq);
/*
* It appears there is an erratum which affects at least version 0x11
* of I/O APIC (that's the 82093AA and cores integrated into various
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment