1. 15 Oct, 2020 5 commits
    • Marko Mäkelä's avatar
      MDEV-23399: Performance regression with write workloads · 7cffb5f6
      Marko Mäkelä authored
      The buffer pool refactoring in MDEV-15053 and MDEV-22871 shifted
      the performance bottleneck to the page flushing.
      
      The configuration parameters will be changed as follows:
      
      innodb_lru_flush_size=32 (new: how many pages to flush on LRU eviction)
      innodb_lru_scan_depth=1536 (old: 1024)
      innodb_max_dirty_pages_pct=90 (old: 75)
      innodb_max_dirty_pages_pct_lwm=75 (old: 0)
      
      Note: The parameter innodb_lru_scan_depth will only affect LRU
      eviction of buffer pool pages when a new page is being allocated. The
      page cleaner thread will no longer evict any pages. It used to
      guarantee that some pages will remain free in the buffer pool. Now, we
      perform that eviction 'on demand' in buf_LRU_get_free_block().
      The parameter innodb_lru_scan_depth(srv_LRU_scan_depth) is used as follows:
       * When the buffer pool is being shrunk in buf_pool_t::withdraw_blocks()
       * As a buf_pool.free limit in buf_LRU_list_batch() for terminating
         the flushing that is initiated e.g., by buf_LRU_get_free_block()
      The parameter also used to serve as an initial limit for unzip_LRU
      eviction (evicting uncompressed page frames while retaining
      ROW_FORMAT=COMPRESSED pages), but now we will use a hard-coded limit
      of 100 or unlimited for invoking buf_LRU_scan_and_free_block().
      
      The status variables will be changed as follows:
      
      innodb_buffer_pool_pages_flushed: This includes also the count of
      innodb_buffer_pool_pages_LRU_flushed and should work reliably,
      updated one by one in buf_flush_page() to give more real-time
      statistics. The function buf_flush_stats(), which we are removing,
      was not called in every code path. For both counters, we will use
      regular variables that are incremented in a critical section of
      buf_pool.mutex. Note that show_innodb_vars() directly links to the
      variables, and reads of the counters will *not* be protected by
      buf_pool.mutex, so you cannot get a consistent snapshot of both variables.
      
      The following INFORMATION_SCHEMA.INNODB_METRICS counters will be
      removed, because the page cleaner no longer deals with writing or
      evicting least recently used pages, and because the single-page writes
      have been removed:
      * buffer_LRU_batch_flush_avg_time_slot
      * buffer_LRU_batch_flush_avg_time_thread
      * buffer_LRU_batch_flush_avg_time_est
      * buffer_LRU_batch_flush_avg_pass
      * buffer_LRU_single_flush_scanned
      * buffer_LRU_single_flush_num_scan
      * buffer_LRU_single_flush_scanned_per_call
      
      When moving to a single buffer pool instance in MDEV-15058, we missed
      some opportunity to simplify the buf_flush_page_cleaner thread. It was
      unnecessarily using a mutex and some complex data structures, even
      though we always have a single page cleaner thread.
      
      Furthermore, the buf_flush_page_cleaner thread had separate 'recovery'
      and 'shutdown' modes where it was waiting to be triggered by some
      other thread, adding unnecessary latency and potential for hangs in
      relatively rarely executed startup or shutdown code.
      
      The page cleaner was also running two kinds of batches in an
      interleaved fashion: "LRU flush" (writing out some least recently used
      pages and evicting them on write completion) and the normal batches
      that aim to increase the MIN(oldest_modification) in the buffer pool,
      to help the log checkpoint advance.
      
      The buf_pool.flush_list flushing was being blocked by
      buf_block_t::lock for no good reason. Furthermore, if the FIL_PAGE_LSN
      of a page is ahead of log_sys.get_flushed_lsn(), that is, what has
      been persistently written to the redo log, we would trigger a log
      flush and then resume the page flushing. This would unnecessarily
      limit the performance of the page cleaner thread and trigger the
      infamous messages "InnoDB: page_cleaner: 1000ms intended loop took 4450ms.
      The settings might not be optimal" that were suppressed in
      commit d1ab8903 unless log_warnings>2.
      
      Our revised algorithm will make log_sys.get_flushed_lsn() advance at
      the start of buf_flush_lists(), and then execute a 'best effort' to
      write out all pages. The flush batches will skip pages that were modified
      since the log was written, or are are currently exclusively locked.
      The MDEV-13670 message "page_cleaner: 1000ms intended loop took" message
      will be removed, because by design, the buf_flush_page_cleaner() should
      not be blocked during a batch for extended periods of time.
      
      We will remove the single-page flushing altogether. Related to this,
      the debug parameter innodb_doublewrite_batch_size will be removed,
      because all of the doublewrite buffer will be used for flushing
      batches. If a page needs to be evicted from the buffer pool and all
      100 least recently used pages in the buffer pool have unflushed
      changes, buf_LRU_get_free_block() will execute buf_flush_lists() to
      write out and evict innodb_lru_flush_size pages. At most one thread
      will execute buf_flush_lists() in buf_LRU_get_free_block(); other
      threads will wait for that LRU flushing batch to finish.
      
      To improve concurrency, we will replace the InnoDB ib_mutex_t and
      os_event_t native mutexes and condition variables in this area of code.
      Most notably, this means that the buffer pool mutex (buf_pool.mutex)
      is no longer instrumented via any InnoDB interfaces. It will continue
      to be instrumented via PERFORMANCE_SCHEMA.
      
      For now, both buf_pool.flush_list_mutex and buf_pool.mutex will be
      declared with MY_MUTEX_INIT_FAST (PTHREAD_MUTEX_ADAPTIVE_NP). The critical
      sections of buf_pool.flush_list_mutex should be shorter than those for
      buf_pool.mutex, because in the worst case, they cover a linear scan of
      buf_pool.flush_list, while the worst case of a critical section of
      buf_pool.mutex covers a linear scan of the potentially much longer
      buf_pool.LRU list.
      
      mysql_mutex_is_owner(), safe_mutex_is_owner(): New predicate, usable
      with SAFE_MUTEX. Some InnoDB debug assertions need this predicate
      instead of mysql_mutex_assert_owner() or mysql_mutex_assert_not_owner().
      
      buf_pool_t::n_flush_LRU, buf_pool_t::n_flush_list:
      Replaces buf_pool_t::init_flush[] and buf_pool_t::n_flush[].
      The number of active flush operations.
      
      buf_pool_t::mutex, buf_pool_t::flush_list_mutex: Use mysql_mutex_t
      instead of ib_mutex_t, to have native mutexes with PERFORMANCE_SCHEMA
      and SAFE_MUTEX instrumentation.
      
      buf_pool_t::done_flush_LRU: Condition variable for !n_flush_LRU.
      
      buf_pool_t::done_flush_list: Condition variable for !n_flush_list.
      
      buf_pool_t::do_flush_list: Condition variable to wake up the
      buf_flush_page_cleaner when a log checkpoint needs to be written
      or the server is being shut down. Replaces buf_flush_event.
      We will keep using timed waits (the page cleaner thread will wake
      _at least_ once per second), because the calculations for
      innodb_adaptive_flushing depend on fixed time intervals.
      
      buf_dblwr: Allocate statically, and move all code to member functions.
      Use a native mutex and condition variable. Remove code to deal with
      single-page flushing.
      
      buf_dblwr_check_block(): Make the check debug-only. We were spending
      a significant amount of execution time in page_simple_validate_new().
      
      flush_counters_t::unzip_LRU_evicted: Remove.
      
      IORequest: Make more members const. FIXME: m_fil_node should be removed.
      
      buf_flush_sync_lsn: Protect by std::atomic, not page_cleaner.mutex
      (which we are removing).
      
      page_cleaner_slot_t, page_cleaner_t: Remove many redundant members.
      
      pc_request_flush_slot(): Replaces pc_request() and pc_flush_slot().
      
      recv_writer_thread: Remove. Recovery works just fine without it, if we
      simply invoke buf_flush_sync() at the end of each batch in
      recv_sys_t::apply().
      
      recv_recovery_from_checkpoint_finish(): Remove. We can simply call
      recv_sys.debug_free() directly.
      
      srv_started_redo: Replaces srv_start_state.
      
      SRV_SHUTDOWN_FLUSH_PHASE: Remove. logs_empty_and_mark_files_at_shutdown()
      can communicate with the normal page cleaner loop via the new function
      flush_buffer_pool().
      
      buf_flush_remove(): Assert that the calling thread is holding
      buf_pool.flush_list_mutex. This removes unnecessary mutex operations
      from buf_flush_remove_pages() and buf_flush_dirty_pages(),
      which replace buf_LRU_flush_or_remove_pages().
      
      buf_flush_lists(): Renamed from buf_flush_batch(), with simplified
      interface. Return the number of flushed pages. Clarified comments and
      renamed min_n to max_n. Identify LRU batch by lsn=0. Merge all the functions
      buf_flush_start(), buf_flush_batch(), buf_flush_end() directly to this
      function, which was their only caller, and remove 2 unnecessary
      buf_pool.mutex release/re-acquisition that we used to perform around
      the buf_flush_batch() call. At the start, if not all log has been
      durably written, wait for a background task to do it, or start a new
      task to do it. This allows the log write to run concurrently with our
      page flushing batch. Any pages that were skipped due to too recent
      FIL_PAGE_LSN or due to them being latched by a writer should be flushed
      during the next batch, unless there are further modifications to those
      pages. It is possible that a page that we must flush due to small
      oldest_modification also carries a recent FIL_PAGE_LSN or is being
      constantly modified. In the worst case, all writers would then end up
      waiting in log_free_check() to allow the flushing and the checkpoint
      to complete.
      
      buf_do_flush_list_batch(): Clarify comments, and rename min_n to max_n.
      Cache the last looked up tablespace. If neighbor flushing is not applicable,
      invoke buf_flush_page() directly, avoiding a page lookup in between.
      
      buf_flush_space(): Auxiliary function to look up a tablespace for
      page flushing.
      
      buf_flush_page(): Defer the computation of space->full_crc32(). Never
      call log_write_up_to(), but instead skip persistent pages whose latest
      modification (FIL_PAGE_LSN) is newer than the redo log. Also skip
      pages on which we cannot acquire a shared latch without waiting.
      
      buf_flush_try_neighbors(): Do not bother checking buf_fix_count
      because buf_flush_page() will no longer wait for the page latch.
      Take the tablespace as a parameter, and only execute this function
      when innodb_flush_neighbors>0. Avoid repeated calls of page_id_t::fold().
      
      buf_flush_relocate_on_flush_list(): Declare as cold, and push down
      a condition from the callers.
      
      buf_flush_check_neighbor(): Take id.fold() as a parameter.
      
      buf_flush_sync(): Ensure that the buf_pool.flush_list is empty,
      because the flushing batch will skip pages whose modifications have
      not yet been written to the log or were latched for modification.
      
      buf_free_from_unzip_LRU_list_batch(): Remove redundant local variables.
      
      buf_flush_LRU_list_batch(): Let the caller buf_do_LRU_batch() initialize
      the counters, and report n->evicted.
      Cache the last looked up tablespace. If neighbor flushing is not applicable,
      invoke buf_flush_page() directly, avoiding a page lookup in between.
      
      buf_do_LRU_batch(): Return the number of pages flushed.
      
      buf_LRU_free_page(): Only release and re-acquire buf_pool.mutex if
      adaptive hash index entries are pointing to the block.
      
      buf_LRU_get_free_block(): Do not wake up the page cleaner, because it
      will no longer perform any useful work for us, and we do not want it
      to compete for I/O while buf_flush_lists(innodb_lru_flush_size, 0)
      writes out and evicts at most innodb_lru_flush_size pages. (The
      function buf_do_LRU_batch() may complete after writing fewer pages if
      more than innodb_lru_scan_depth pages end up in buf_pool.free list.)
      Eliminate some mutex release-acquire cycles, and wait for the LRU
      flush batch to complete before rescanning.
      
      buf_LRU_check_size_of_non_data_objects(): Simplify the code.
      
      buf_page_write_complete(): Remove the parameter evict, and always
      evict pages that were part of an LRU flush.
      
      buf_page_create(): Take a pre-allocated page as a parameter.
      
      buf_pool_t::free_block(): Free a pre-allocated block.
      
      recv_sys_t::recover_low(), recv_sys_t::apply(): Preallocate the block
      while not holding recv_sys.mutex. During page allocation, we may
      initiate a page flush, which in turn may initiate a log flush, which
      would require acquiring log_sys.mutex, which should always be acquired
      before recv_sys.mutex in order to avoid deadlocks. Therefore, we must
      not be holding recv_sys.mutex while allocating a buffer pool block.
      
      BtrBulk::logFreeCheck(): Skip a redundant condition.
      
      row_undo_step(): Do not invoke srv_inc_activity_count() for every row
      that is being rolled back. It should suffice to invoke the function in
      trx_flush_log_if_needed() during trx_t::commit_in_memory() when the
      rollback completes.
      
      sync_check_enable(): Remove. We will enable innodb_sync_debug from the
      very beginning.
      
      Reviewed by: Vladislav Vaintroub
      7cffb5f6
    • Marko Mäkelä's avatar
      MDEV-23399: Remove buf_pool.flush_rbt · 46b1f500
      Marko Mäkelä authored
      Normally, buf_pool.flush_list must be sorted by
      buf_page_t::oldest_modification, so that log_checkpoint()
      can choose MIN(oldest_modification) as the checkpoint LSN.
      
      During recovery, buf_pool.flush_rbt used to guarantee the
      ordering. However, we can allow the buf_pool.flush_list to
      be in an arbitrary order during recovery, and simply ensure
      that it is in the correct order by the time a log checkpoint
      needs to be executed.
      
      recv_sys_t::apply(): To keep it simple, we will always flush the
      buffer pool at the end of each batch.
      
      Note that log_checkpoint() will invoke recv_sys_t::apply() in case
      a checkpoint is initiated during the last batch of recovery,
      when we already allow writes to data pages and the redo log.
      
      Reviewed by: Vladislav Vaintroub
      46b1f500
    • Marko Mäkelä's avatar
      MDEV-23399: Remove recv_writer_thread · b535a790
      Marko Mäkelä authored
      Recovery works just fine without a separate thread whose only
      task is to tell the page cleaner thread to do its job.
      
      recv_sys_t::apply(): Flush the buffer pool at the end of each batch.
      
      Reviewed by: Vladislav Vaintroub
      b535a790
    • Marko Mäkelä's avatar
      MDEV-23399 preparation: Remove buf_pool.zip_clean · fa70c146
      Marko Mäkelä authored
      The debug data structure may have been useful during the development of
      ROW_FORMAT=COMPRESSED page frames. Let us simplify code by removing it.
      fa70c146
    • Marko Mäkelä's avatar
      MDEV-23190 after-merge fix: remove unused code · 308f8350
      Marko Mäkelä authored
      The merge commit 4d4865de
      introduced fil_space_t::max_page_number_of_io() with no callers.
      308f8350
  2. 14 Oct, 2020 1 commit
    • Otto Kekäläinen's avatar
      Travis-CI: Use new Ubuntu 20.04 as base, streamline and document · cea6a666
      Otto Kekäläinen authored
      Simplify Travis-CI file and extend inline comments.
      
      Upgrade to using Ubuntu 20.04 (Focal) as the baseline distro version
      now that Travis-CI has made it available. Drop Xenial and all the
      excess repositories Xenial needed. Now we only Focal and one Bionic
      build to keep things simple and streamlined.
      
      Keep GCC-7/Clang-7 as the older compiler, and start using GCC-10
      and Clang-10 as the newer compiler. Assume that if both of them
      build OK, than the intermediate versions would be OK as well.
      
      Print 'apt-cache policy' to make it transparent in build logs what
      repositories was used for build dependencies.
      
      Remove temporary workaround from homebrew install step as Travis-CI has
      fixed the original issue.
      
      Revert ignoring results form build that previously failed on the test
      main.thread_pool_info as MDEV-20372 is not fixed.
      
      Keep arm64 failures ignored due to MDEV-23955.
      
      Allow failures for the test main.column_compression 'innodb' due
      to MDEV-23954.
      cea6a666
  3. 09 Oct, 2020 1 commit
  4. 08 Oct, 2020 1 commit
  5. 07 Oct, 2020 2 commits
  6. 05 Oct, 2020 7 commits
  7. 02 Oct, 2020 4 commits
  8. 01 Oct, 2020 5 commits
  9. 30 Sep, 2020 10 commits
  10. 29 Sep, 2020 4 commits