-
Andrew Morton authored
From: Christoph Hellwig <hch@infradead.org> Originally by David Mosberger, testing by Roger Luethi. From the ia64 tree. Basically, it avoids going to memory all the time. What this does is make life a lot easier for gcc, so it can actually do a decent amount of optimization. The restructuring clearly is less important for out-of-order CPUs, but even there it gives some benefits. More specifically, the loop is now structured to operate one "unsigned long" at a time, rather than one bit at a time. Of course, you still need to process all the bits, but most of the relevant state in the inner loop can be kept in registers. Roger Luethi measured the routine on a bunch of different machines (mostly x86, IIRC: P5, P6, Crusoe, Athlons) and performance improved there, too (and it should definitely improve performance on any RISC-like architecture). Roger's benchmarking results (vs number of fd's): File TCP Numbfer of fd's: 10 250 500 10 250 500 UP, Pentium MMX 233MHz original 8.2 108.5 212.8 11.0 180.0 356.5 UP, Pentium MMX 233MHz w/patch 7.4 87.6 171.1 10.4 163.6 323.4 MP, Pentium MMX 233MHz original 15.7 283.8 562.8 18.9 354.4 705.5 MP, Pentium MMX 233MHz w/patch 14.6 255.6 506.5 17.8 332.8 664.1 UP, Athlon 1394 MHz original 1.3 13.4 26.1 1.9 24.7 48.6 UP, Athlon 1394 MHz w/patch 1.2 11.0 21.5 1.6 22.3 43.8 MP, Athlon 1394 MHz original 1.6 22.4 44.6 1.9 30.9 60.5 MP, Athlon 1394 MHz w/patch 1.5 21.2 41.7 1.9 30.2 59.6
57a54189