• Marko Mäkelä's avatar
    MDEV-15053 Reduce buf_pool_t::mutex contention · b1ab211d
    Marko Mäkelä authored
    User-visible changes: The INFORMATION_SCHEMA views INNODB_BUFFER_PAGE
    and INNODB_BUFFER_PAGE_LRU will report a dummy value FLUSH_TYPE=0
    and will no longer report the PAGE_STATE value READY_FOR_USE.
    
    We will remove some fields from buf_page_t and move much code to
    member functions of buf_pool_t and buf_page_t, so that the access
    rules of data members can be enforced consistently.
    
    Evicting or adding pages in buf_pool.LRU will remain covered by
    buf_pool.mutex.
    
    Evicting or adding pages in buf_pool.page_hash will remain
    covered by both buf_pool.mutex and the buf_pool.page_hash X-latch.
    
    After this fix, buf_pool.page_hash lookups can entirely
    avoid acquiring buf_pool.mutex, only relying on
    buf_pool.hash_lock_get() S-latch.
    
    Similarly, buf_flush_check_neighbors() can will rely solely on
    buf_pool.mutex, no buf_pool.page_hash latch at all.
    
    The buf_pool.mutex is rather contended in I/O heavy benchmarks,
    especially when the workload does not fit in the buffer pool.
    
    The first attempt to alleviate the contention was the
    buf_pool_t::mutex split in
    commit 4ed7082e
    which introduced buf_block_t::mutex, which we are now removing.
    
    Later, multiple instances of buf_pool_t were introduced
    in commit c18084f7
    and recently removed by us in
    commit 1a6f708e (MDEV-15058).
    
    UNIV_BUF_DEBUG: Remove. This option to enable some buffer pool
    related debugging in otherwise non-debug builds has not been used
    for years. Instead, we have been using UNIV_DEBUG, which is enabled
    in CMAKE_BUILD_TYPE=Debug.
    
    buf_block_t::mutex, buf_pool_t::zip_mutex: Remove. We can mainly rely on
    std::atomic and the buf_pool.page_hash latches, and in some cases
    depend on buf_pool.mutex or buf_pool.flush_list_mutex just like before.
    We must always release buf_block_t::lock before invoking
    unfix() or io_unfix(), to prevent a glitch where a block that was
    added to the buf_pool.free list would apper X-latched. See
    commit c5883deb how this glitch
    was finally caught in a debug environment.
    
    We move some buf_pool_t::page_hash specific code from the
    ha and hash modules to buf_pool, for improved readability.
    
    buf_pool_t::close(): Assert that all blocks are clean, except
    on aborted startup or crash-like shutdown.
    
    buf_pool_t::validate(): No longer attempt to validate
    n_flush[] against the number of BUF_IO_WRITE fixed blocks,
    because buf_page_t::flush_type no longer exists.
    
    buf_pool_t::watch_set(): Replaces buf_pool_watch_set().
    Reduce mutex contention by separating the buf_pool.watch[]
    allocation and the insert into buf_pool.page_hash.
    
    buf_pool_t::page_hash_lock<bool exclusive>(): Acquire a
    buf_pool.page_hash latch.
    Replaces and extends buf_page_hash_lock_s_confirm()
    and buf_page_hash_lock_x_confirm().
    
    buf_pool_t::READ_AHEAD_PAGES: Renamed from BUF_READ_AHEAD_PAGES.
    
    buf_pool_t::curr_size, old_size, read_ahead_area, n_pend_reads:
    Use Atomic_counter.
    
    buf_pool_t::running_out(): Replaces buf_LRU_buf_pool_running_out().
    
    buf_pool_t::LRU_remove(): Remove a block from the LRU list
    and return its predecessor. Incorporates buf_LRU_adjust_hp(),
    which was removed.
    
    buf_page_get_gen(): Remove a redundant call of fsp_is_system_temporary(),
    for mode == BUF_GET_IF_IN_POOL_OR_WATCH, which is only used by
    BTR_DELETE_OP (purge), which is never invoked on temporary tables.
    
    buf_free_from_unzip_LRU_list_batch(): Avoid redundant assignments.
    
    buf_LRU_free_from_unzip_LRU_list(): Simplify the loop condition.
    
    buf_LRU_free_page(): Clarify the function comment.
    
    buf_flush_check_neighbor(), buf_flush_check_neighbors():
    Rewrite the construction of the page hash range. We will hold
    the buf_pool.mutex for up to buf_pool.read_ahead_area (at most 64)
    consecutive lookups of buf_pool.page_hash.
    
    buf_flush_page_and_try_neighbors(): Remove.
    Merge to its only callers, and remove redundant operations in
    buf_flush_LRU_list_batch().
    
    buf_read_ahead_random(), buf_read_ahead_linear(): Rewrite.
    Do not acquire buf_pool.mutex, and iterate directly with page_id_t.
    
    ut_2_power_up(): Remove. my_round_up_to_next_power() is inlined
    and avoids any loops.
    
    fil_page_get_prev(), fil_page_get_next(), fil_addr_is_null(): Remove.
    
    buf_flush_page(): Add a fil_space_t* parameter. Minimize the
    buf_pool.mutex hold time. buf_pool.n_flush[] is no longer updated
    atomically with the io_fix, and we will protect most buf_block_t
    fields with buf_block_t::lock. The function
    buf_flush_write_block_low() is removed and merged here.
    
    buf_page_init_for_read(): Use static linkage. Initialize the newly
    allocated block and acquire the exclusive buf_block_t::lock while not
    holding any mutex.
    
    IORequest::IORequest(): Remove the body. We only need to invoke
    set_punch_hole() in buf_flush_page() and nowhere else.
    
    buf_page_t::flush_type: Remove. Replaced by IORequest::flush_type.
    This field is only used during a fil_io() call.
    That function already takes IORequest as a parameter, so we had
    better introduce  for the rarely changing field.
    
    buf_block_t::init(): Replaces buf_page_init().
    
    buf_page_t::init(): Replaces buf_page_init_low().
    
    buf_block_t::initialise(): Initialise many fields, but
    keep the buf_page_t::state(). Both buf_pool_t::validate() and
    buf_page_optimistic_get() requires that buf_page_t::in_file()
    be protected atomically with buf_page_t::in_page_hash
    and buf_page_t::in_LRU_list.
    
    buf_page_optimistic_get(): Now that buf_block_t::mutex
    no longer exists, we must check buf_page_t::io_fix()
    after acquiring the buf_pool.page_hash lock, to detect
    whether buf_page_init_for_read() has been initiated.
    We will also check the io_fix() before acquiring hash_lock
    in order to avoid unnecessary computation.
    The field buf_block_t::modify_clock (protected by buf_block_t::lock)
    allows buf_page_optimistic_get() to validate the block.
    
    buf_page_t::real_size: Remove. It was only used while flushing
    pages of page_compressed tables.
    
    buf_page_encrypt(): Add an output parameter that allows us ot eliminate
    buf_page_t::real_size. Replace a condition with debug assertion.
    
    buf_page_should_punch_hole(): Remove.
    
    buf_dblwr_t::add_to_batch(): Replaces buf_dblwr_add_to_batch().
    Add the parameter size (to replace buf_page_t::real_size).
    
    buf_dblwr_t::write_single_page(): Replaces buf_dblwr_write_single_page().
    Add the parameter size (to replace buf_page_t::real_size).
    
    fil_system_t::detach(): Replaces fil_space_detach().
    Ensure that fil_validate() will not be violated even if
    fil_system.mutex is released and reacquired.
    
    fil_node_t::complete_io(): Renamed from fil_node_complete_io().
    
    fil_node_t::close_to_free(): Replaces fil_node_close_to_free().
    Avoid invoking fil_node_t::close() because fil_system.n_open
    has already been decremented in fil_space_t::detach().
    
    BUF_BLOCK_READY_FOR_USE: Remove. Directly use BUF_BLOCK_MEMORY.
    
    BUF_BLOCK_ZIP_DIRTY: Remove. Directly use BUF_BLOCK_ZIP_PAGE,
    and distinguish dirty pages by buf_page_t::oldest_modification().
    
    BUF_BLOCK_POOL_WATCH: Remove. Use BUF_BLOCK_NOT_USED instead.
    This state was only being used for buf_page_t that are in
    buf_pool.watch.
    
    buf_pool_t::watch[]: Remove pointer indirection.
    
    buf_page_t::in_flush_list: Remove. It was set if and only if
    buf_page_t::oldest_modification() is nonzero.
    
    buf_page_decrypt_after_read(), buf_corrupt_page_release(),
    buf_page_check_corrupt(): Change the const fil_space_t* parameter
    to const fil_node_t& so that we can report the correct file name.
    
    buf_page_monitor(): Declare as an ATTRIBUTE_COLD global function.
    
    buf_page_io_complete(): Split to buf_page_read_complete() and
    buf_page_write_complete().
    
    buf_dblwr_t::in_use: Remove.
    
    buf_dblwr_t::buf_block_array: Add IORequest::flush_t.
    
    buf_dblwr_sync_datafiles(): Remove. It was a useless wrapper of
    os_aio_wait_until_no_pending_writes().
    
    buf_flush_write_complete(): Declare static, not global.
    Add the parameter IORequest::flush_t.
    
    buf_flush_freed_page(): Simplify the code.
    
    recv_sys_t::flush_lru: Renamed from flush_type and changed to bool.
    
    fil_read(), fil_write(): Replaced with direct use of fil_io().
    
    fil_buffering_disabled(): Remove. Check srv_file_flush_method directly.
    
    fil_mutex_enter_and_prepare_for_io(): Return the resolved
    fil_space_t* to avoid a duplicated lookup in the caller.
    
    fil_report_invalid_page_access(): Clean up the parameters.
    
    fil_io(): Return fil_io_t, which comprises fil_node_t and error code.
    Always invoke fil_space_t::acquire_for_io() and let either the
    sync=true caller or fil_aio_callback() invoke
    fil_space_t::release_for_io().
    
    fil_aio_callback(): Rewrite to replace buf_page_io_complete().
    
    fil_check_pending_operations(): Remove a parameter, and remove some
    redundant lookups.
    
    fil_node_close_to_free(): Wait for n_pending==0. Because we no longer
    do an extra lookup of the tablespace between fil_io() and the
    completion of the operation, we must give fil_node_t::complete_io() a
    chance to decrement the counter.
    
    fil_close_tablespace(): Remove unused parameter trx, and document
    that this is only invoked during the error handling of IMPORT TABLESPACE.
    
    row_import_discard_changes(): Merged with the only caller,
    row_import_cleanup(). Do not lock up the data dictionary while
    invoking fil_close_tablespace().
    
    logs_empty_and_mark_files_at_shutdown(): Do not invoke
    fil_close_all_files(), to avoid a !needs_flush assertion failure
    on fil_node_t::close().
    
    innodb_shutdown(): Invoke os_aio_free() before fil_close_all_files().
    
    fil_close_all_files(): Invoke fil_flush_file_spaces()
    to ensure proper durability.
    
    thread_pool::unbind(): Fix a crash that would occur on Windows
    after srv_thread_pool->disable_aio() and os_file_close().
    This fix was submitted by Vladislav Vaintroub.
    
    Thanks to Matthias Leich and Axel Schwenke for extensive testing,
    Vladislav Vaintroub for helpful comments, and Eugene Kosov for a review.
    b1ab211d
btr0sea.cc 56.4 KB