MDEV-23369 False sharing in page_hash_latch::read_lock_wait()

MDEV-22871 refactored the InnoDB buf_pool.page_hash to use a simple rw-lock implementation that avoids a spinloop between non-contended read-lock requests, simply using std::atomic::fetch_add() for the lock acquisition. Alas, in a write-heavy stress test on a 56-core system with 1,000 concurrent client connections, the server would stop processing any transactions every now and then. The reason turned out to be false sharing. Attaching a debugger to the server during one such hang revealed that 22 of the 1,033 threads were polling in page_hash_latch::read_lock_wait() on the same object, which appeared to be in unlocked state (no readers or writers). All 22 requests were for accessing an undo log page, with a distinct page number. To eliminate such false sharing, we will make buf_pool.page_hash.array contain one page_hash_latch per CPU data cache line. On AMD64, this will pad the size of the array by 8/7, or almost 15%. For a 50GiB buffer pool of 16KiB pages, the buf_pool.page_hash.array would grow from 25MiB to 28.6MiB. On other instruction set architectures, the incurred memory overhead may be smaller. Thanks to Vladislav Vaintroub for noticing this anomaly.

MDEV-23369 False sharing in page_hash_latch::read_lock_wait()
MDEV-22871 refactored the InnoDB buf_pool.page_hash to use a simple rw-lock implementation that avoids a spinloop between non-contended read-lock requests, simply using std::atomic::fetch_add() for the lock acquisition. Alas, in a write-heavy stress test on a 56-core system with 1,000 concurrent client connections, the server would stop processing any transactions every now and then. The reason turned out to be false sharing. Attaching a debugger to the server during one such hang revealed that 22 of the 1,033 threads were polling in page_hash_latch::read_lock_wait() on the same object, which appeared to be in unlocked state (no readers or writers). All 22 requests were for accessing an undo log page, with a distinct page number. To eliminate such false sharing, we will make buf_pool.page_hash.array contain one page_hash_latch per CPU data cache line. On AMD64, this will pad the size of the array by 8/7, or almost 15%. For a 50GiB buffer pool of 16KiB pages, the buf_pool.page_hash.array would grow from 25MiB to 28.6MiB. On other instruction set architectures, the incurred memory overhead may be smaller. Thanks to Vladislav Vaintroub for noticing this anomaly.
c12d24e2 · Marko Mäkelä · 8ddebb33 · c12d24e2
Commit c12d24e2 authored Aug 02, 2020 by Marko Mäkelä
Hide whitespace changes
Inline Side-by-side

Showing with 2 additions and 1 deletion

storage/innobase/include/buf0buf.h storage/innobase/include/buf0buf.h +2 -1

No files found.
--- a/storage/innobase/include/buf0buf.h
+++ b/storage/innobase/include/buf0buf.h
@@ -1824,7 +1824,8 @@ class buf_pool_t
  {
    /** Number of array[] elements per page_hash_latch.
    Must be one less than a power of 2. */
-    static constexpr size_t ELEMENTS_PER_LATCH= 1023;
+    static constexpr size_t ELEMENTS_PER_LATCH= CPU_LEVEL1_DCACHE_LINESIZE /
+      sizeof(void*) - 1;

    /** number of payload elements in array[] */
    Atomic_relaxed<ulint> n_cells;