• Marko Mäkelä's avatar
    MDEV-23855: Improve InnoDB log checkpoint performance · 3a9a3be1
    Marko Mäkelä authored
    After MDEV-15053, MDEV-22871, MDEV-23399 shifted the scalability
    bottleneck, log checkpoints became a new bottleneck.
    
    If innodb_io_capacity is set low or innodb_max_dirty_pct_lwm is
    set high and the workload fits in the buffer pool, the page cleaner
    thread will perform very little flushing. When we reach the capacity
    of the circular redo log file ib_logfile0 and must initiate a checkpoint,
    some 'furious flushing' will be necessary. (If innodb_flush_sync=OFF,
    then flushing would continue at the innodb_io_capacity rate, and
    writers would be throttled.)
    
    We have the best chance of advancing the checkpoint LSN immediately
    after a page flush batch has been completed. Hence, it is best to
    perform checkpoints after every batch in the page cleaner thread,
    attempting to run once per second.
    
    By initiating high-priority flushing in the page cleaner as early
    as possible, we aim to make the throughput more stable.
    
    The function buf_flush_wait_flushed() used to sleep for 10ms, hoping
    that the page cleaner thread would do something during that time.
    The observed end result was that a large number of threads that call
    log_free_check() would end up sleeping while nothing useful is happening.
    
    We will revise the design so that in the default innodb_flush_sync=ON
    mode, buf_flush_wait_flushed() will wake up the page cleaner thread
    to perform the necessary flushing, and it will wait for a signal from
    the page cleaner thread.
    
    If innodb_io_capacity is set to a low value (causing the page cleaner to
    throttle its work), a write workload would initially perform well, until
    the capacity of the circular ib_logfile0 is reached and log_free_check()
    will trigger checkpoints. At that point, the extra waiting in
    buf_flush_wait_flushed() will start reducing throughput.
    
    The page cleaner thread will also initiate log checkpoints after each
    buf_flush_lists() call, because that is the best point of time for
    the checkpoint LSN to advance by the maximum amount.
    
    Even in 'furious flushing' mode we invoke buf_flush_lists() with
    innodb_io_capacity_max pages at a time, and at the start of each
    batch (in the log_flush() callback function that runs in a separate
    task) we will invoke os_aio_wait_until_no_pending_writes(). This
    tweak allows the checkpoint to advance in smaller steps and
    significantly reduces the maximum latency. On an Intel Optane 960
    NVMe SSD on Linux, it reduced from 4.6 seconds to 74 milliseconds.
    On Microsoft Windows with a slower SSD, it reduced from more than
    180 seconds to 0.6 seconds.
    
    We will make innodb_adaptive_flushing=OFF simply flush innodb_io_capacity
    per second whenever the dirty proportion of buffer pool pages exceeds
    innodb_max_dirty_pages_pct_lwm. For innodb_adaptive_flushing=ON we try
    to make page_cleaner_flush_pages_recommendation() more consistent and
    predictable: if we are below innodb_adaptive_flushing_lwm, let us flush
    pages according to the return value of af_get_pct_for_dirty().
    
    innodb_max_dirty_pages_pct_lwm: Revert the change of the default value
    that was made in MDEV-23399. The value innodb_max_dirty_pages_pct_lwm=0
    guarantees that a shutdown of an idle server will be fast. Users might
    be surprised if normal shutdown suddenly became slower when upgrading
    within a GA release series.
    
    innodb_checkpoint_usec: Remove. The master task will no longer perform
    periodic log checkpoints. It is the duty of the page cleaner thread.
    
    log_sys.max_modified_age: Remove. The current span of the
    buf_pool.flush_list expressed in LSN only matters for adaptive
    flushing (outside the 'furious flushing' condition).
    For the correctness of checkpoints, the only thing that matters is
    the checkpoint age (log_sys.lsn - log_sys.last_checkpoint_lsn).
    This run-time constant was also reported as log_max_modified_age_sync.
    
    log_sys.max_checkpoint_age_async: Remove. This does not serve any
    purpose, because the checkpoints will now be triggered by the page
    cleaner thread. We will retain the log_sys.max_checkpoint_age limit
    for engaging 'furious flushing'.
    
    page_cleaner.slot: Remove. It turns out that
    page_cleaner_slot.flush_list_time was duplicating
    page_cleaner.slot.flush_time and page_cleaner.slot.flush_list_pass
    was duplicating page_cleaner.flush_pass.
    Likewise, there were some redundant monitor counters, because the
    page cleaner thread no longer performs any buf_pool.LRU flushing, and
    because there only is one buf_flush_page_cleaner thread.
    
    buf_flush_sync_lsn: Protect writes by buf_pool.flush_list_mutex.
    
    buf_pool_t::get_oldest_modification(): Add a parameter to specify the
    return value when no persistent data pages are dirty. Require the
    caller to hold buf_pool.flush_list_mutex.
    
    log_buf_pool_get_oldest_modification(): Take the fall-back LSN
    as a parameter. All callers will also invoke log_sys.get_lsn().
    
    log_preflush_pool_modified_pages(): Replaced with buf_flush_wait_flushed().
    
    buf_flush_wait_flushed(): Implement two limits. If not enough buffer pool
    has been flushed, signal the page cleaner (unless innodb_flush_sync=OFF)
    and wait for the page cleaner to complete. If the page cleaner
    thread is not running (which can be the case durign shutdown),
    initiate the flush and wait for it directly.
    
    buf_flush_ahead(): If innodb_flush_sync=ON (the default),
    submit a new buf_flush_sync_lsn target for the page cleaner
    but do not wait for the flushing to finish.
    
    log_get_capacity(), log_get_max_modified_age_async(): Remove, to make
    it easier to see that af_get_pct_for_lsn() is not acquiring any mutexes.
    
    page_cleaner_flush_pages_recommendation(): Protect all access to
    buf_pool.flush_list with buf_pool.flush_list_mutex. Previously there
    were some race conditions in the calculation.
    
    buf_flush_sync_for_checkpoint(): New function to process
    buf_flush_sync_lsn in the page cleaner thread. At the end of
    each batch, we try to wake up any blocked buf_flush_wait_flushed().
    If everything up to buf_flush_sync_lsn has been flushed, we will
    reset buf_flush_sync_lsn=0. The page cleaner thread will keep
    'furious flushing' until the limit is reached. Any threads that
    are waiting in buf_flush_wait_flushed() will be able to resume
    as soon as their own limit has been satisfied.
    
    buf_flush_page_cleaner: Prioritize buf_flush_sync_lsn and do not
    sleep as long as it is set. Do not update any page_cleaner statistics
    for this special mode of operation. In the normal mode
    (buf_flush_sync_lsn is not set for innodb_flush_sync=ON),
    try to wake up once per second. No longer check whether
    srv_inc_activity_count() has been called. After each batch,
    try to perform a log checkpoint, because the best chances for
    the checkpoint LSN to advance by the maximum amount are upon
    completing a flushing batch.
    
    log_t: Move buf_free, max_buf_free possibly to the same cache line
    with log_sys.mutex.
    
    log_margin_checkpoint_age(): Simplify the logic, and replace
    a 0.1-second sleep with a call to buf_flush_wait_flushed() to
    initiate flushing. Moved to the same compilation unit
    with the only caller.
    
    log_close(): Clean up the calculations. (Should be no functional
    change.) Return whether flush-ahead is needed. Moved to the same
    compilation unit with the only caller.
    
    mtr_t::finish_write(): Return whether flush-ahead is needed.
    
    mtr_t::commit(): Invoke buf_flush_ahead() when needed. Let us avoid
    external calls in mtr_t::commit() and make the logic easier to follow
    by having related code in a single compilation unit. Also, we will
    invoke srv_stats.log_write_requests.inc() only once per
    mini-transaction commit, while not holding mutexes.
    
    log_checkpoint_margin(): Only care about log_sys.max_checkpoint_age.
    Upon reaching log_sys.max_checkpoint_age where we must wait to prevent
    the log from getting corrupted, let us wait for at most 1MiB of LSN
    at a time, before rechecking the condition. This should allow writers
    to proceed even if the redo log capacity has been reached and
    'furious flushing' is in progress. We no longer care about
    log_sys.max_modified_age_sync or log_sys.max_modified_age_async.
    The log_sys.max_modified_age_sync could be a relic from the time when
    there was a srv_master_thread that wrote dirty pages to data files.
    Also, we no longer have any log_sys.max_checkpoint_age_async limit,
    because log checkpoints will now be triggered by the page cleaner
    thread upon completing buf_flush_lists().
    
    log_set_capacity(): Simplify the calculations of the limit
    (no functional change).
    
    log_checkpoint_low(): Split from log_checkpoint(). Moved to the
    same compilation unit with the caller.
    
    log_make_checkpoint(): Only wait for everything to be flushed until
    the current LSN.
    
    create_log_file(): After checkpoint, invoke log_write_up_to()
    to ensure that the FILE_CHECKPOINT record has been written.
    This avoids ut_ad(!srv_log_file_created) in create_log_file_rename().
    
    srv_start(): Do not call recv_recovery_from_checkpoint_start()
    if the log has just been created. Set fil_system.space_id_reuse_warned
    before dict_boot() has been executed, and clear it after recovery
    has finished.
    
    dict_boot(): Initialize fil_system.max_assigned_id.
    
    srv_check_activity(): Remove. The activity count is counting transaction
    commits and therefore mostly interesting for the purge of history.
    
    BtrBulk::insert(): Do not explicitly wake up the page cleaner,
    but do invoke srv_inc_activity_count(), because that counter is
    still being used in buf_load_throttle_if_needed() for some
    heuristics. (It might be cleaner to execute buf_load() in the
    page cleaner thread!)
    
    Reviewed by: Vladislav Vaintroub
    3a9a3be1
xtrabackup.cc 176 KB