mm/truncate.c · 1eeae0158ecd0535a2bc257a53d3472cc37ceb15 · Kirill Smelkov / linux

[PATCH] make mapping->tree_lock an rwlock · 1eeae015
bill.irwin@oracle.com authored Mar 04, 2005
Convert mapping->tree_lock to an rwlock.

with:

dd if=/dev/zero of=foo bs=1 count=2M  0.80s user 4.15s system 99% cpu 4.961 total
dd if=/dev/zero of=foo bs=1 count=2M  0.73s user 4.26s system 100% cpu 4.987 total
dd if=/dev/zero of=foo bs=1 count=2M  0.79s user 4.25s system 100% cpu 5.034 total

dd if=foo of=/dev/null bs=1  0.80s user 3.12s system 99% cpu 3.928 total
dd if=foo of=/dev/null bs=1  0.77s user 3.15s system 100% cpu 3.914 total
dd if=foo of=/dev/null bs=1  0.92s user 3.02s system 100% cpu 3.935 total

(3.926: 1.87 usecs)

without:

dd if=/dev/zero of=foo bs=1 count=2M  0.85s user 3.92s system 99% cpu 4.780 total
dd if=/dev/zero of=foo bs=1 count=2M  0.78s user 4.02s system 100% cpu 4.789 total
dd if=/dev/zero of=foo bs=1 count=2M  0.82s user 3.94s system 99% cpu 4.763 total
dd if=/dev/zero of=foo bs=1 count=2M  0.71s user 4.10s system 99% cpu 4.810 tota

dd if=foo of=/dev/null bs=1  0.76s user 2.68s system 100% cpu 3.438 total
dd if=foo of=/dev/null bs=1  0.74s user 2.72s system 99% cpu 3.465 total
dd if=foo of=/dev/null bs=1  0.67s user 2.82s system 100% cpu 3.489 total
dd if=foo of=/dev/null bs=1  0.70s user 2.62s system 99% cpu 3.326 total

(3.430: 1.635 usecs)


So on a P4, the additional cost of the rwlock is ~240 nsecs for a
one-byte-write().  On the other hand:

From: Peter Chubb <peterc@gelato.unsw.edu.au>

  As part of the Gelato scalability focus group, we've been running OSDL's
  Re-AIM7 benchmark with an I/O intensive load with varying numbers of
  processors.  The current kernel shows severe contention on the tree_lock in
  the address space structure when running on tmpfs or ext2 on a RAM disk.


  Lockstat output for a 12-way:

  SPINLOCKS         HOLD            WAIT
    UTIL  CON    MEAN(  MAX )   MEAN(  MAX )(% CPU)     TOTAL NOWAIT SPIN RJECT  NAME

          5.5%  0.4us(3177us)   28us(  20ms)(44.2%) 131821954 94.5%  5.5% 0.00%  *TOTAL*

   72.3% 13.1%  0.5us( 9.5us)   29us(  20ms)(42.5%)  50542055 86.9% 13.1%    0%  find_lock_page+0x30
   23.8%    0%  385us(3177us)    0us                    23235  100%    0%    0%  exit_mmap+0x50
   11.5% 0.82%  0.1us( 101us)   17us(5670us)( 1.6%)  50665658 99.2% 0.82%    0%  dnotify_parent+0x70


  Replacing the spinlock with a multi-reader lock fixes this problem,
  without unduly affecting anything else.

  Here are the benchmark results (jobs per minute at a 50-client level, average
  of 5 runs, standard deviation in parens) on an HP Olympia with 3 cells, 12
  processors, and dnotify turned off (after this spinlock, the spinlock in
  dnotify_parent is the worst contended for this workload).

  	 tmpfs...............               ext2...............
  #CPUs	 spinlock      rwlock               spinlock     rwlock
      1     7556(15)      7588(17)  +0.42%      3744(20)     3791(16) +1.25%
      2	 13743(31)     13791(33)  +0.35%      6405(30)     6413(24) +0.12%
      4	 23334(111)    22881(154) -2%        9648(51)     9595(50)  -0.55%
      8	 33580(240)    36163(190) +7.7%     13183(63)    13070(68)  -0.85%
     12	 28748(170)    44064(238)+53%      12681(49)	 14504(105)+14% 

  And on a pentium3 single processsor:
      1    4177(4)        4169(2)  -0.2%        3811(4)     3820(3) +0.23%

  I'm not sure what's happening in the 4-processor case.  The important thing to
  note is that with a spinlock, the benchmark shows worse performance for a 12
  than for an 8-way box; with the patch, the 12 way performs better, as
  expected.  We've done some runs with 16-way as well; without the patch below,
  the 16-way performs worse than the 12-way.


It's a tricky tradeoff, but large-smp is hurt a lot more by the spinlocks than
small-smp is by the rwlocks.  And I don't think we really want to implement
compile-time either-or-locks.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
1eeae015
truncate.c 8.84 KB
Replace truncate.c