-
bill.irwin@oracle.com authored
Convert mapping->tree_lock to an rwlock. with: dd if=/dev/zero of=foo bs=1 count=2M 0.80s user 4.15s system 99% cpu 4.961 total dd if=/dev/zero of=foo bs=1 count=2M 0.73s user 4.26s system 100% cpu 4.987 total dd if=/dev/zero of=foo bs=1 count=2M 0.79s user 4.25s system 100% cpu 5.034 total dd if=foo of=/dev/null bs=1 0.80s user 3.12s system 99% cpu 3.928 total dd if=foo of=/dev/null bs=1 0.77s user 3.15s system 100% cpu 3.914 total dd if=foo of=/dev/null bs=1 0.92s user 3.02s system 100% cpu 3.935 total (3.926: 1.87 usecs) without: dd if=/dev/zero of=foo bs=1 count=2M 0.85s user 3.92s system 99% cpu 4.780 total dd if=/dev/zero of=foo bs=1 count=2M 0.78s user 4.02s system 100% cpu 4.789 total dd if=/dev/zero of=foo bs=1 count=2M 0.82s user 3.94s system 99% cpu 4.763 total dd if=/dev/zero of=foo bs=1 count=2M 0.71s user 4.10s system 99% cpu 4.810 tota dd if=foo of=/dev/null bs=1 0.76s user 2.68s system 100% cpu 3.438 total dd if=foo of=/dev/null bs=1 0.74s user 2.72s system 99% cpu 3.465 total dd if=foo of=/dev/null bs=1 0.67s user 2.82s system 100% cpu 3.489 total dd if=foo of=/dev/null bs=1 0.70s user 2.62s system 99% cpu 3.326 total (3.430: 1.635 usecs) So on a P4, the additional cost of the rwlock is ~240 nsecs for a one-byte-write(). On the other hand: From: Peter Chubb <peterc@gelato.unsw.edu.au> As part of the Gelato scalability focus group, we've been running OSDL's Re-AIM7 benchmark with an I/O intensive load with varying numbers of processors. The current kernel shows severe contention on the tree_lock in the address space structure when running on tmpfs or ext2 on a RAM disk. Lockstat output for a 12-way: SPINLOCKS HOLD WAIT UTIL CON MEAN( MAX ) MEAN( MAX )(% CPU) TOTAL NOWAIT SPIN RJECT NAME 5.5% 0.4us(3177us) 28us( 20ms)(44.2%) 131821954 94.5% 5.5% 0.00% *TOTAL* 72.3% 13.1% 0.5us( 9.5us) 29us( 20ms)(42.5%) 50542055 86.9% 13.1% 0% find_lock_page+0x30 23.8% 0% 385us(3177us) 0us 23235 100% 0% 0% exit_mmap+0x50 11.5% 0.82% 0.1us( 101us) 17us(5670us)( 1.6%) 50665658 99.2% 0.82% 0% dnotify_parent+0x70 Replacing the spinlock with a multi-reader lock fixes this problem, without unduly affecting anything else. Here are the benchmark results (jobs per minute at a 50-client level, average of 5 runs, standard deviation in parens) on an HP Olympia with 3 cells, 12 processors, and dnotify turned off (after this spinlock, the spinlock in dnotify_parent is the worst contended for this workload). tmpfs............... ext2............... #CPUs spinlock rwlock spinlock rwlock 1 7556(15) 7588(17) +0.42% 3744(20) 3791(16) +1.25% 2 13743(31) 13791(33) +0.35% 6405(30) 6413(24) +0.12% 4 23334(111) 22881(154) -2% 9648(51) 9595(50) -0.55% 8 33580(240) 36163(190) +7.7% 13183(63) 13070(68) -0.85% 12 28748(170) 44064(238)+53% 12681(49) 14504(105)+14% And on a pentium3 single processsor: 1 4177(4) 4169(2) -0.2% 3811(4) 3820(3) +0.23% I'm not sure what's happening in the 4-processor case. The important thing to note is that with a spinlock, the benchmark shows worse performance for a 12 than for an 8-way box; with the patch, the 12 way performs better, as expected. We've done some runs with 16-way as well; without the patch below, the 16-way performs worse than the 12-way. It's a tricky tradeoff, but large-smp is hurt a lot more by the spinlocks than small-smp is by the rwlocks. And I don't think we really want to implement compile-time either-or-locks. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
1eeae015