1. 30 Oct, 2020 1 commit
  2. 29 Oct, 2020 11 commits
  3. 28 Oct, 2020 4 commits
    • Vicențiu Ciorbaru's avatar
      MDEV-18323 Convert MySQL JSON type to MariaDB TEXT in mysql_upgrade · f6549e95
      Vicențiu Ciorbaru authored
      This patch solves two key problems.
      1. There is a type number clash between MySQL and MariaDB. The number
         245, used for MariaDB Virtual Fields is the same as MySQL's JSON.
         This leads to corrupt FRM errors if unhandled. The code properly
         checks frm table version number and if it matches 5.7+ (until 10.0+)
         it will assume it is dealing with a MySQL table with the JSON
         datatype.
      2. MySQL JSON datatype uses a proprietary format to pack JSON data. The
         patch introduces a datatype plugin which parses the format and convers
         it to its string representation.
      
      The intended conversion path is to only use the JSON datatype within
      ALTER TABLE <table> FORCE, to force a table recreate. This happens
      during mysql_upgrade or via a direct ALTER TABLE <table> FORCE.
      f6549e95
    • Vicențiu Ciorbaru's avatar
      cleanup: Static_binary_string need not take non-const double parameter · 85c686e2
      Vicențiu Ciorbaru authored
      Convert the parameter to const as the function won't modify the pointer
      value.
      85c686e2
    • Vladislav Vaintroub's avatar
      MDEV-24037 Use NtFlushBuffersFileEx(FLUSH_FLAGS_FILE_DATA_SYNC_ONLY) on Windows · 9478368d
      Vladislav Vaintroub authored
      This avoids flushing file metadata on NTFS , and writing to <drive>:\$Log
      file. With heavy write workload this can consume up to 1/3 of the
      server's IO bandwidth.
      
      Reviewed by : Marko
      9478368d
    • Varun Gupta's avatar
      MDEV-24015: SQL Error (1038): Out of sort memory when enough memory for the sort buffer is provided · db56f9b8
      Varun Gupta authored
      For a correlated subquery filesort is executed multiple times.
      During each execution, sortlength() computed total sort key length in
      Sort_keys::sort_length, without resetting it first.
      
      Eventually Sort_keys::sort_length got larger than @@sort_buffer_size, which
      caused filesort() to be aborted with error.
      
      Fixed by making sortlength() to compute lengths only during the first
      invocation. Subsequent invocations return pre-computed values.
      db56f9b8
  4. 27 Oct, 2020 2 commits
  5. 26 Oct, 2020 10 commits
    • Marko Mäkelä's avatar
      MDEV-23855: Use normal mutex for log_sys.mutex, log_sys.flush_order_mutex · c27e53f4
      Marko Mäkelä authored
      With an unreasonably small innodb_log_file_size, the page cleaner
      thread would frequently acquire log_sys.flush_order_mutex and spend
      a significant portion of CPU time spinning on that mutex when
      determining the checkpoint LSN.
      c27e53f4
    • Marko Mäkelä's avatar
      MDEV-23855: Implement asynchronous doublewrite · a5a2ef07
      Marko Mäkelä authored
      Synchronous writes and calls to fdatasync(), fsync() or
      FlushFileBuffers() would ruin performance. So, let us
      submit asynchronous writes for the doublewrite buffer.
      We submit a single request for the likely case that the
      two doublewrite buffers are contiquous in the system tablespace.
      
      buf_dblwr_t::flush_buffered_writes_completed(): The completion callback
      of buf_dblwr_t::flush_buffered_writes().
      
      os_aio_wait_until_no_pending_writes(): Also wait for doublewrite batches.
      
      buf_dblwr_t::element::space: Remove. We can simply use
      element::request.node->space instead.
      
      Reviewed by: Vladislav Vaintroub
      a5a2ef07
    • Marko Mäkelä's avatar
      MDEV-23399 fixup: Interleaved doublewrite batches · ef3f71fa
      Marko Mäkelä authored
      Author: Vladislav Vaintroub
      ef3f71fa
    • Marko Mäkelä's avatar
      MDEV-16264 fixup: Clean up asynchronous I/O · 8cb01c51
      Marko Mäkelä authored
      os_aio_userdata_t: Remove. It was basically duplicating IORequest.
      
      buf_page_write_complete(): Take only IORequest as a parameter.
      
      os_aio_func(), pfs_os_aio_func(): Replaced with os_aio() that has
      no redundant parameters. There is only one caller, so there is no
      point to pass __FILE__, __LINE__ as a parameter.
      8cb01c51
    • Marko Mäkelä's avatar
      MDEV-23855: Shrink fil_space_t · 118e258a
      Marko Mäkelä authored
      Merge n_pending_ios, n_pending_ops to std::atomic<uint32_t> n_pending.
      Change some more fil_space_t members to uint32_t to reduce
      the memory footprint.
      
      fil_space_t::add(), fil_ibd_create(): Attach the already opened
      handle to the tablespace, and enforce the fil_system.n_open limit.
      
      dict_boot(): Initialize fil_system.max_assigned_id.
      
      srv_boot(): Call srv_thread_pool_init() before anything else,
      so that files should be opened in the correct mode on Windows.
      
      fil_ibd_create(): Create the file in OS_FILE_AIO mode, just like
      fil_node_open_file_low() does it.
      
      dict_table_t::is_accessible(): Replaces fil_table_accessible().
      
      Reviewed by: Vladislav Vaintroub
      118e258a
    • Marko Mäkelä's avatar
      MDEV-23855: Remove fil_system.LRU and reduce fil_system.mutex contention · 45ed9dd9
      Marko Mäkelä authored
      Also fixes MDEV-23929: innodb_flush_neighbors is not being ignored
      for system tablespace on SSD
      
      When the maximum configured number of file is exceeded, InnoDB will
      close data files. We used to maintain a fil_system.LRU list and
      a counter fil_node_t::n_pending to achieve this, at the huge cost
      of multiple fil_system.mutex operations per I/O operation.
      
      fil_node_open_file_low(): Implement a FIFO replacement policy:
      The last opened file will be moved to the end of fil_system.space_list,
      and files will be closed from the start of the list. However, we will
      not move tablespaces in fil_system.space_list while
      i_s_tablespaces_encryption_fill_table() is executing
      (producing output for INFORMATION_SCHEMA.INNODB_TABLESPACES_ENCRYPTION)
      because it may cause information of some tablespaces to go missing.
      We also avoid this in mariabackup --backup because datafiles_iter_next()
      assumes that the ordering is not changed.
      
      IORequest: Fold more parameters to IORequest::type.
      
      fil_space_t::io(): Replaces fil_io().
      
      fil_space_t::flush(): Replaces fil_flush().
      
      OS_AIO_IBUF: Remove. We will always issue synchronous reads of the
      change buffer pages in buf_read_page_low().
      
      We will always ignore some errors for background reads.
      
      This should reduce fil_system.mutex contention a little.
      
      fil_node_t::complete_write(): Replaces fil_node_t::complete_io().
      On both read and write completion, fil_space_t::release_for_io()
      will have to be called.
      
      fil_space_t::io(): Do not acquire fil_system.mutex in the normal
      code path.
      
      xb_delta_open_matching_space(): Do not try to open the system tablespace
      which was already opened. This fixes a file sharing violation in
      mariabackup --prepare --incremental.
      
      Reviewed by: Vladislav Vaintroub
      45ed9dd9
    • Marko Mäkelä's avatar
      MDEV-23855: Improve InnoDB log checkpoint performance · 3a9a3be1
      Marko Mäkelä authored
      After MDEV-15053, MDEV-22871, MDEV-23399 shifted the scalability
      bottleneck, log checkpoints became a new bottleneck.
      
      If innodb_io_capacity is set low or innodb_max_dirty_pct_lwm is
      set high and the workload fits in the buffer pool, the page cleaner
      thread will perform very little flushing. When we reach the capacity
      of the circular redo log file ib_logfile0 and must initiate a checkpoint,
      some 'furious flushing' will be necessary. (If innodb_flush_sync=OFF,
      then flushing would continue at the innodb_io_capacity rate, and
      writers would be throttled.)
      
      We have the best chance of advancing the checkpoint LSN immediately
      after a page flush batch has been completed. Hence, it is best to
      perform checkpoints after every batch in the page cleaner thread,
      attempting to run once per second.
      
      By initiating high-priority flushing in the page cleaner as early
      as possible, we aim to make the throughput more stable.
      
      The function buf_flush_wait_flushed() used to sleep for 10ms, hoping
      that the page cleaner thread would do something during that time.
      The observed end result was that a large number of threads that call
      log_free_check() would end up sleeping while nothing useful is happening.
      
      We will revise the design so that in the default innodb_flush_sync=ON
      mode, buf_flush_wait_flushed() will wake up the page cleaner thread
      to perform the necessary flushing, and it will wait for a signal from
      the page cleaner thread.
      
      If innodb_io_capacity is set to a low value (causing the page cleaner to
      throttle its work), a write workload would initially perform well, until
      the capacity of the circular ib_logfile0 is reached and log_free_check()
      will trigger checkpoints. At that point, the extra waiting in
      buf_flush_wait_flushed() will start reducing throughput.
      
      The page cleaner thread will also initiate log checkpoints after each
      buf_flush_lists() call, because that is the best point of time for
      the checkpoint LSN to advance by the maximum amount.
      
      Even in 'furious flushing' mode we invoke buf_flush_lists() with
      innodb_io_capacity_max pages at a time, and at the start of each
      batch (in the log_flush() callback function that runs in a separate
      task) we will invoke os_aio_wait_until_no_pending_writes(). This
      tweak allows the checkpoint to advance in smaller steps and
      significantly reduces the maximum latency. On an Intel Optane 960
      NVMe SSD on Linux, it reduced from 4.6 seconds to 74 milliseconds.
      On Microsoft Windows with a slower SSD, it reduced from more than
      180 seconds to 0.6 seconds.
      
      We will make innodb_adaptive_flushing=OFF simply flush innodb_io_capacity
      per second whenever the dirty proportion of buffer pool pages exceeds
      innodb_max_dirty_pages_pct_lwm. For innodb_adaptive_flushing=ON we try
      to make page_cleaner_flush_pages_recommendation() more consistent and
      predictable: if we are below innodb_adaptive_flushing_lwm, let us flush
      pages according to the return value of af_get_pct_for_dirty().
      
      innodb_max_dirty_pages_pct_lwm: Revert the change of the default value
      that was made in MDEV-23399. The value innodb_max_dirty_pages_pct_lwm=0
      guarantees that a shutdown of an idle server will be fast. Users might
      be surprised if normal shutdown suddenly became slower when upgrading
      within a GA release series.
      
      innodb_checkpoint_usec: Remove. The master task will no longer perform
      periodic log checkpoints. It is the duty of the page cleaner thread.
      
      log_sys.max_modified_age: Remove. The current span of the
      buf_pool.flush_list expressed in LSN only matters for adaptive
      flushing (outside the 'furious flushing' condition).
      For the correctness of checkpoints, the only thing that matters is
      the checkpoint age (log_sys.lsn - log_sys.last_checkpoint_lsn).
      This run-time constant was also reported as log_max_modified_age_sync.
      
      log_sys.max_checkpoint_age_async: Remove. This does not serve any
      purpose, because the checkpoints will now be triggered by the page
      cleaner thread. We will retain the log_sys.max_checkpoint_age limit
      for engaging 'furious flushing'.
      
      page_cleaner.slot: Remove. It turns out that
      page_cleaner_slot.flush_list_time was duplicating
      page_cleaner.slot.flush_time and page_cleaner.slot.flush_list_pass
      was duplicating page_cleaner.flush_pass.
      Likewise, there were some redundant monitor counters, because the
      page cleaner thread no longer performs any buf_pool.LRU flushing, and
      because there only is one buf_flush_page_cleaner thread.
      
      buf_flush_sync_lsn: Protect writes by buf_pool.flush_list_mutex.
      
      buf_pool_t::get_oldest_modification(): Add a parameter to specify the
      return value when no persistent data pages are dirty. Require the
      caller to hold buf_pool.flush_list_mutex.
      
      log_buf_pool_get_oldest_modification(): Take the fall-back LSN
      as a parameter. All callers will also invoke log_sys.get_lsn().
      
      log_preflush_pool_modified_pages(): Replaced with buf_flush_wait_flushed().
      
      buf_flush_wait_flushed(): Implement two limits. If not enough buffer pool
      has been flushed, signal the page cleaner (unless innodb_flush_sync=OFF)
      and wait for the page cleaner to complete. If the page cleaner
      thread is not running (which can be the case durign shutdown),
      initiate the flush and wait for it directly.
      
      buf_flush_ahead(): If innodb_flush_sync=ON (the default),
      submit a new buf_flush_sync_lsn target for the page cleaner
      but do not wait for the flushing to finish.
      
      log_get_capacity(), log_get_max_modified_age_async(): Remove, to make
      it easier to see that af_get_pct_for_lsn() is not acquiring any mutexes.
      
      page_cleaner_flush_pages_recommendation(): Protect all access to
      buf_pool.flush_list with buf_pool.flush_list_mutex. Previously there
      were some race conditions in the calculation.
      
      buf_flush_sync_for_checkpoint(): New function to process
      buf_flush_sync_lsn in the page cleaner thread. At the end of
      each batch, we try to wake up any blocked buf_flush_wait_flushed().
      If everything up to buf_flush_sync_lsn has been flushed, we will
      reset buf_flush_sync_lsn=0. The page cleaner thread will keep
      'furious flushing' until the limit is reached. Any threads that
      are waiting in buf_flush_wait_flushed() will be able to resume
      as soon as their own limit has been satisfied.
      
      buf_flush_page_cleaner: Prioritize buf_flush_sync_lsn and do not
      sleep as long as it is set. Do not update any page_cleaner statistics
      for this special mode of operation. In the normal mode
      (buf_flush_sync_lsn is not set for innodb_flush_sync=ON),
      try to wake up once per second. No longer check whether
      srv_inc_activity_count() has been called. After each batch,
      try to perform a log checkpoint, because the best chances for
      the checkpoint LSN to advance by the maximum amount are upon
      completing a flushing batch.
      
      log_t: Move buf_free, max_buf_free possibly to the same cache line
      with log_sys.mutex.
      
      log_margin_checkpoint_age(): Simplify the logic, and replace
      a 0.1-second sleep with a call to buf_flush_wait_flushed() to
      initiate flushing. Moved to the same compilation unit
      with the only caller.
      
      log_close(): Clean up the calculations. (Should be no functional
      change.) Return whether flush-ahead is needed. Moved to the same
      compilation unit with the only caller.
      
      mtr_t::finish_write(): Return whether flush-ahead is needed.
      
      mtr_t::commit(): Invoke buf_flush_ahead() when needed. Let us avoid
      external calls in mtr_t::commit() and make the logic easier to follow
      by having related code in a single compilation unit. Also, we will
      invoke srv_stats.log_write_requests.inc() only once per
      mini-transaction commit, while not holding mutexes.
      
      log_checkpoint_margin(): Only care about log_sys.max_checkpoint_age.
      Upon reaching log_sys.max_checkpoint_age where we must wait to prevent
      the log from getting corrupted, let us wait for at most 1MiB of LSN
      at a time, before rechecking the condition. This should allow writers
      to proceed even if the redo log capacity has been reached and
      'furious flushing' is in progress. We no longer care about
      log_sys.max_modified_age_sync or log_sys.max_modified_age_async.
      The log_sys.max_modified_age_sync could be a relic from the time when
      there was a srv_master_thread that wrote dirty pages to data files.
      Also, we no longer have any log_sys.max_checkpoint_age_async limit,
      because log checkpoints will now be triggered by the page cleaner
      thread upon completing buf_flush_lists().
      
      log_set_capacity(): Simplify the calculations of the limit
      (no functional change).
      
      log_checkpoint_low(): Split from log_checkpoint(). Moved to the
      same compilation unit with the caller.
      
      log_make_checkpoint(): Only wait for everything to be flushed until
      the current LSN.
      
      create_log_file(): After checkpoint, invoke log_write_up_to()
      to ensure that the FILE_CHECKPOINT record has been written.
      This avoids ut_ad(!srv_log_file_created) in create_log_file_rename().
      
      srv_start(): Do not call recv_recovery_from_checkpoint_start()
      if the log has just been created. Set fil_system.space_id_reuse_warned
      before dict_boot() has been executed, and clear it after recovery
      has finished.
      
      dict_boot(): Initialize fil_system.max_assigned_id.
      
      srv_check_activity(): Remove. The activity count is counting transaction
      commits and therefore mostly interesting for the purge of history.
      
      BtrBulk::insert(): Do not explicitly wake up the page cleaner,
      but do invoke srv_inc_activity_count(), because that counter is
      still being used in buf_load_throttle_if_needed() for some
      heuristics. (It might be cleaner to execute buf_load() in the
      page cleaner thread!)
      
      Reviewed by: Vladislav Vaintroub
      3a9a3be1
    • Marko Mäkelä's avatar
      MDEV-23399 fixup: Assertion bpage->in_file() failed · bd67cb92
      Marko Mäkelä authored
      buf_flush_remove_pages(), buf_flush_dirty_pages(): Because
      buf_page_t::state() is protected by buf_pool.mutex, which we
      are not holding, the state may be BUF_BLOCK_REMOVE_HASH when
      the page is being relocated. Let us relax these assertions
      similar to buf_flush_validate_low().
      
      The other in_file() assertions in buf0flu.cc look valid.
      bd67cb92
    • Marko Mäkelä's avatar
      Cleanup: Speed up mariabackup --prepare · 59a0236d
      Marko Mäkelä authored
      srv_start(): Avoid trx_lists_init_at_db_start() for normal
      mariabackup --prepare without --export.
      59a0236d
    • Marko Mäkelä's avatar
      MDEV-23399 fixup: Avoid crash on Mariabackup shutdown · 5999d512
      Marko Mäkelä authored
      innodb_preshutdown(): Terminate the encryption threads before
      the page cleaner thread can be shut down.
      
      innodb_shutdown(): Always wait for the encryption threads and
      page cleaner to shut down.
      
      srv_shutdown_all_bg_threads(): Wait for the encryption threads and
      the page cleaner to shut down. (After an aborted startup,
      innodb_shutdown() would not be called.)
      
      row_get_background_drop_list_len_low(): Remove.
      
      os_thread_count: Remove. Alternatively, at the end of
      srv_shutdown_all_bg_threads() we could try to wait longer
      for the count to reach 0. On some platforms, an assertion
      os_thread_count==0 could fail even after a small delay,
      even though in the core dump all threads would have exited.
      
      srv_shutdown_threads(): Renamed from srv_shutdown_all_bg_threads().
      Do not wait for the page cleaner to shut down, because the later
      innodb_shutdown(), which may invoke
      logs_empty_and_mark_files_at_shutdown(), assumes that it exists.
      5999d512
  6. 24 Oct, 2020 6 commits
  7. 23 Oct, 2020 2 commits
  8. 22 Oct, 2020 4 commits