• bill.irwin@oracle.com's avatar
    [PATCH] make mapping->tree_lock an rwlock · 1eeae015
    bill.irwin@oracle.com authored
    Convert mapping->tree_lock to an rwlock.
    
    with:
    
    dd if=/dev/zero of=foo bs=1 count=2M  0.80s user 4.15s system 99% cpu 4.961 total
    dd if=/dev/zero of=foo bs=1 count=2M  0.73s user 4.26s system 100% cpu 4.987 total
    dd if=/dev/zero of=foo bs=1 count=2M  0.79s user 4.25s system 100% cpu 5.034 total
    
    dd if=foo of=/dev/null bs=1  0.80s user 3.12s system 99% cpu 3.928 total
    dd if=foo of=/dev/null bs=1  0.77s user 3.15s system 100% cpu 3.914 total
    dd if=foo of=/dev/null bs=1  0.92s user 3.02s system 100% cpu 3.935 total
    
    (3.926: 1.87 usecs)
    
    without:
    
    dd if=/dev/zero of=foo bs=1 count=2M  0.85s user 3.92s system 99% cpu 4.780 total
    dd if=/dev/zero of=foo bs=1 count=2M  0.78s user 4.02s system 100% cpu 4.789 total
    dd if=/dev/zero of=foo bs=1 count=2M  0.82s user 3.94s system 99% cpu 4.763 total
    dd if=/dev/zero of=foo bs=1 count=2M  0.71s user 4.10s system 99% cpu 4.810 tota
    
    dd if=foo of=/dev/null bs=1  0.76s user 2.68s system 100% cpu 3.438 total
    dd if=foo of=/dev/null bs=1  0.74s user 2.72s system 99% cpu 3.465 total
    dd if=foo of=/dev/null bs=1  0.67s user 2.82s system 100% cpu 3.489 total
    dd if=foo of=/dev/null bs=1  0.70s user 2.62s system 99% cpu 3.326 total
    
    (3.430: 1.635 usecs)
    
    
    So on a P4, the additional cost of the rwlock is ~240 nsecs for a
    one-byte-write().  On the other hand:
    
    From: Peter Chubb <peterc@gelato.unsw.edu.au>
    
      As part of the Gelato scalability focus group, we've been running OSDL's
      Re-AIM7 benchmark with an I/O intensive load with varying numbers of
      processors.  The current kernel shows severe contention on the tree_lock in
      the address space structure when running on tmpfs or ext2 on a RAM disk.
    
    
      Lockstat output for a 12-way:
    
      SPINLOCKS         HOLD            WAIT
        UTIL  CON    MEAN(  MAX )   MEAN(  MAX )(% CPU)     TOTAL NOWAIT SPIN RJECT  NAME
    
              5.5%  0.4us(3177us)   28us(  20ms)(44.2%) 131821954 94.5%  5.5% 0.00%  *TOTAL*
    
       72.3% 13.1%  0.5us( 9.5us)   29us(  20ms)(42.5%)  50542055 86.9% 13.1%    0%  find_lock_page+0x30
       23.8%    0%  385us(3177us)    0us                    23235  100%    0%    0%  exit_mmap+0x50
       11.5% 0.82%  0.1us( 101us)   17us(5670us)( 1.6%)  50665658 99.2% 0.82%    0%  dnotify_parent+0x70
    
    
      Replacing the spinlock with a multi-reader lock fixes this problem,
      without unduly affecting anything else.
    
      Here are the benchmark results (jobs per minute at a 50-client level, average
      of 5 runs, standard deviation in parens) on an HP Olympia with 3 cells, 12
      processors, and dnotify turned off (after this spinlock, the spinlock in
      dnotify_parent is the worst contended for this workload).
    
      	 tmpfs...............               ext2...............
      #CPUs	 spinlock      rwlock               spinlock     rwlock
          1     7556(15)      7588(17)  +0.42%      3744(20)     3791(16) +1.25%
          2	 13743(31)     13791(33)  +0.35%      6405(30)     6413(24) +0.12%
          4	 23334(111)    22881(154) -2%        9648(51)     9595(50)  -0.55%
          8	 33580(240)    36163(190) +7.7%     13183(63)    13070(68)  -0.85%
         12	 28748(170)    44064(238)+53%      12681(49)	 14504(105)+14% 
    
      And on a pentium3 single processsor:
          1    4177(4)        4169(2)  -0.2%        3811(4)     3820(3) +0.23%
    
      I'm not sure what's happening in the 4-processor case.  The important thing to
      note is that with a spinlock, the benchmark shows worse performance for a 12
      than for an 8-way box; with the patch, the 12 way performs better, as
      expected.  We've done some runs with 16-way as well; without the patch below,
      the 16-way performs worse than the 12-way.
    
    
    It's a tricky tradeoff, but large-smp is hurt a lot more by the spinlocks than
    small-smp is by the rwlocks.  And I don't think we really want to implement
    compile-time either-or-locks.
    Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
    Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
    1eeae015
filemap.c 57.1 KB