• Ingo Molnar's avatar
    [PATCH] support "requeueing" futexes · 7149345c
    Ingo Molnar authored
    This addresses a futex related SMP scalability problem of
    glibc. A number of regressions have been reported to the NTPL mailing list
    when going to many CPUs, for applications that use condition variables and
    the pthread_cond_broadcast() API call. Using this functionality, testcode
    shows a slowdown from 0.12 seconds runtime to over 237 seconds (!)
    runtime, on 4-CPU systems.
    
    pthread condition variables use two futex-backed mutex-alike locks: an
    internal one for the glibc CV state itself, and a user-supplied mutex
    which the API guarantees to take in certain codepaths. (Unfortunately the
    user-supplied mutex cannot be used to protect the CV state, so we've got
    to deal with two locks.)
    
    The cause of the slowdown is a 'swarm effect': if lots of threads are
    blocked on a condition variable, and pthread_cond_broadcast() is done,
    then glibc first does a FUTEX_WAKE on the cv-internal mutex, then down a
    mutex_down() on the user-supplied mutex. Ie. a swarm of threads is created
    which all race to serialize on the user-supplied mutex. The more threads
    are used, the more likely it becomes that the scheduler will balance them
    over to other CPUs - where they just schedule, try to lock the mutex, and
    go to sleep. This 'swarm effect' is purely technical, a side-effect of
    glibc's use of futexes, and the imperfect coupling of the two locks.
    
    the solution to this problem is to not wake up the swarm of threads, but
    'requeue' them from the CV-internal mutex to the user-supplied mutex. The
    attached patch adds the FUTEX_REQUEUE feature FUTEX_REQUEUE requeues N
    threads from futex address A to futex address B.
    
    This way glibc can wake up a single thread (which will take the
    user-mutex), and can requeue the rest, with a single system-call.
    
    Ulrich Drepper has implemented FUTEX_REQUEUE support in glibc, and a
    number of people have tested it over the past couple of weeks. Here are
    the measurements done by Saurabh Desai:
    
    System: 4xPIII 700MHz
    
     ./cond-perf -r 100 -n 200:        1p       2p         4p
     Default NPTL:                 0.120s   0.211s   237.407s
     requeue NPTL:                 0.124s   0.156s     0.040s
    
     ./cond-perf -r 1000 -n 100:
     Default NPTL:                 0.276s   0.412s     0.530s
     requeue NPTL:                 0.349s   0.503s     0.550s
    
     ./pp -v -n 128 -i 1000 -S 32768:
     Default NPTL: 128 games in    1.111s   1.270s    16.894s
     requeue NPTL: 128 games in    1.111s   1.959s     2.426s
    
     ./pp -v -n 1024 -i 10 -S 32768:
     Default NPTL: 1024 games in   0.181s   0.394s     incompleted 2m+
     requeue NPTL: 1024 games in   0.166s   0.254s     0.341s
    
    the speedup with increasing number of threads is quite significant, in the
    128 threads, case it's more than 8 times. In the cond-perf test, on 4 CPUs
    it's almost infinitely faster than the 'swarm of threads' catastrophy
    triggered by the old code.
    7149345c
fork.c 28.8 KB