• Roman Penyaev's avatar
    epoll: atomically remove wait entry on wake up · 412895f0
    Roman Penyaev authored
    This patch does two things:
    
     - fixes a lost wakeup introduced by commit 339ddb53 ("fs/epoll:
       remove unnecessary wakeups of nested epoll")
    
     - improves performance for events delivery.
    
    The description of the problem is the following: if N (>1) threads are
    waiting on ep->wq for new events and M (>1) events come, it is quite
    likely that >1 wakeups hit the same wait queue entry, because there is
    quite a big window between __add_wait_queue_exclusive() and the
    following __remove_wait_queue() calls in ep_poll() function.
    
    This can lead to lost wakeups, because thread, which was woken up, can
    handle not all the events in ->rdllist.  (in better words the problem is
    described here: https://lkml.org/lkml/2019/10/7/905)
    
    The idea of the current patch is to use init_wait() instead of
    init_waitqueue_entry().
    
    Internally init_wait() sets autoremove_wake_function as a callback,
    which removes the wait entry atomically (under the wq locks) from the
    list, thus the next coming wakeup hits the next wait entry in the wait
    queue, thus preventing lost wakeups.
    
    Problem is very well reproduced by the epoll60 test case [1].
    
    Wait entry removal on wakeup has also performance benefits, because
    there is no need to take a ep->lock and remove wait entry from the queue
    after the successful wakeup.  Here is the timing output of the epoll60
    test case:
    
      With explicit wakeup from ep_scan_ready_list() (the state of the
      code prior 339ddb53):
    
        real    0m6.970s
        user    0m49.786s
        sys     0m0.113s
    
     After this patch:
    
       real    0m5.220s
       user    0m36.879s
       sys     0m0.019s
    
    The other testcase is the stress-epoll [2], where one thread consumes
    all the events and other threads produce many events:
    
      With explicit wakeup from ep_scan_ready_list() (the state of the
      code prior 339ddb53):
    
        threads  events/ms  run-time ms
              8       5427         1474
             16       6163         2596
             32       6824         4689
             64       7060         9064
            128       6991        18309
    
     After this patch:
    
        threads  events/ms  run-time ms
              8       5598         1429
             16       7073         2262
             32       7502         4265
             64       7640         8376
            128       7634        16767
    
     (number of "events/ms" represents event bandwidth, thus higher is
      better; number of "run-time ms" represents overall time spent
      doing the benchmark, thus lower is better)
    
    [1] tools/testing/selftests/filesystems/epoll/epoll_wakeup_test.c
    [2] https://github.com/rouming/test-tools/blob/master/stress-epoll.cSigned-off-by: default avatarRoman Penyaev <rpenyaev@suse.de>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    Reviewed-by: default avatarJason Baron <jbaron@akamai.com>
    Cc: Khazhismel Kumykov <khazhy@google.com>
    Cc: Alexander Viro <viro@zeniv.linux.org.uk>
    Cc: Heiher <r@hev.cc>
    Cc: <stable@vger.kernel.org>
    Link: http://lkml.kernel.org/r/20200430130326.1368509-2-rpenyaev@suse.deSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    412895f0
eventpoll.c 65.7 KB