An error occurred fetching the project authors.
  1. 26 Aug, 2021 1 commit
  2. 11 Aug, 2021 1 commit
  3. 24 Jul, 2021 1 commit
  4. 22 Jul, 2021 1 commit
    • Marko Mäkelä's avatar
      MDEV-26110: Do not rely on alignment on static allocation · 82d59945
      Marko Mäkelä authored
      It is implementation-defined whether alignment requirements
      that are larger than std::max_align_t (typically 8 or 16 bytes)
      will be honored by the compiler and linker.
      
      It turns out that on IBM AIX, both alignas() and MY_ALIGNED()
      only guarantees alignment up to 16 bytes.
      
      For some data structures, specifying alignment to the CPU
      cache line size (typically 64 or 128 bytes) is a mere performance
      optimization, and we do not really care whether the requested
      alignment is guaranteed.
      
      But, for the correct operation of direct I/O, we do require that
      the buffers be aligned at a block size boundary.
      
      field_ref_zero: Define as a pointer, not an array.
      For innochecksum, we can make this point to unaligned memory;
      for anything else, we will allocate an aligned buffer from the heap.
      This buffer will be used for overwriting freed data pages when
      innodb_immediate_scrub_data_uncompressed=ON. And exactly that code
      hit an assertion failure on AIX, in the test innodb.innodb_scrub.
      
      log_sys.checkpoint_buf: Define as a pointer to aligned memory
      that is allocated from heap.
      
      log_t::file::write_header_durable(): Reuse log_sys.checkpoint_buf
      instead of trying to allocate an aligned buffer from the stack.
      82d59945
  5. 15 Jul, 2021 1 commit
  6. 31 May, 2021 1 commit
  7. 29 May, 2021 1 commit
  8. 18 May, 2021 1 commit
    • Marko Mäkelä's avatar
      MDEV-25710: Dead code os_file_opendir() in the server · 08b6fd93
      Marko Mäkelä authored
      The functions fil_file_readdir_next_file(), os_file_opendir(),
      os_file_closedir() became dead code in the server in MariaDB 10.4.0
      with commit 09af00cb (the removal of
      the crash recovery logic for the TRUNCATE TABLE implementation that
      was replaced in MDEV-13564).
      
      os_file_opendir(), os_file_closedir(): Define as macros.
      08b6fd93
  9. 11 Apr, 2021 2 commits
  10. 01 Apr, 2021 1 commit
    • Srinidhi Kaushik's avatar
      MDEV-24197: Add "innodb_force_recovery" for "mariabackup --prepare" · 5bc5ecce
      Srinidhi Kaushik authored
      During the prepare phase of restoring backups, "mariabackup" does
      not seem to allow (or recognize) the option "innodb_force_recovery"
      for the embedded InnoDB server instance that it starts.
      
      If page corruption observed during page recovery, the prepare step
      fails. While this is indeed the correct behavior ideally, allowing
      this option to be set in case of emergencies might be useful when
      the current backup is the only copy available. Some error messages
      during "--prepare" suggest to set "innodb_force_recovery" to 1:
      
        [ERROR] InnoDB: Set innodb_force_recovery=1 to ignore corruption.
      
      For backwards compatibility, "mariabackup --innobackupex --apply-log"
      should also have this option.
      Signed-off-by: default avatarSrinidhi Kaushik <shrinidhi.kaushik@gmail.com>
      5bc5ecce
  11. 20 Mar, 2021 1 commit
  12. 09 Mar, 2021 2 commits
  13. 05 Mar, 2021 1 commit
  14. 08 Feb, 2021 1 commit
    • Monty's avatar
      Added 'const' to arguments in get_one_option and find_typeset() · 5d6ad2ad
      Monty authored
      One should not change the program arguments!
      This change also reduces warnings from the icc compiler.
      
      Almost all changes are just syntax changes (adding const to
      'get_one_option function' declarations).
      
      Other changes:
      - Added a few cast of 'argument' from 'const char*' to 'char *'. This
        was mainly in calls to 'external' functions we don't have control of.
      - Ensure that all reset of 'password command line argument' are similar.
        (In almost all cases it was just adding a comment and a cast)
      - In mysqlbinlog.cc and mysqld.cc there was a few cases that changed
        the command line argument. These places where changed to instead allocate
        the option in a MEM_ROOT to avoid changing the argument. Some of this
        code was changed to ensure that different programs did parsing the
        same way. Added a test case for the changes in mysqlbinlog.cc
      - Changed a few variables that took their value from command line options
        from 'char *' to 'const char *'.
      5d6ad2ad
  15. 06 Jan, 2021 1 commit
    • Marko Mäkelä's avatar
      MDEV-24537 innodb_max_dirty_pages_pct_lwm=0 lost its special meaning · a9933105
      Marko Mäkelä authored
      In commit 3a9a3be1 (MDEV-23855)
      some previous logic was replaced with the condition
      dirty_pct < srv_max_dirty_pages_pct_lwm, which caused
      the default value of the parameter innodb_max_dirty_pages_pct_lwm=0
      to lose its special meaning: 'refer to innodb_max_dirty_pages_pct instead'.
      
      This implicit special meaning was visible in the function
      af_get_pct_for_dirty(), which was removed in
      commit f0c295e2 (MDEV-24369).
      
      page_cleaner_flush_pages_recommendation(): Restore the special
      meaning that was removed in MDEV-24369.
      
      buf_flush_page_cleaner(): If srv_max_dirty_pages_pct_lwm==0.0,
      refer to srv_max_buf_pool_modified_pct. This fixes the observed
      performance regression due to excessive page flushing.
      
      buf_pool_t::page_cleaner_wakeup(): Revise the wakeup condition.
      
      innodb_init(): Do initialize srv_max_io_capacity in Mariabackup.
      It was previously constantly 0, which caused mariadb-backup --prepare
      to hang in buf_flush_sync(), making no progress.
      a9933105
  16. 16 Dec, 2020 1 commit
  17. 14 Dec, 2020 1 commit
    • Marko Mäkelä's avatar
      MDEV-24313 (2 of 2): Silently ignored innodb_use_native_aio=1 · f24b7383
      Marko Mäkelä authored
      In commit 5e62b6a5 (MDEV-16264)
      the logic of os_aio_init() was changed so that it will never fail,
      but instead automatically disable innodb_use_native_aio (which is
      enabled by default) if the io_setup() system call would fail due
      to resource limits being exceeded. This is questionable, especially
      because falling back to simulated AIO may lead to significantly
      reduced performance.
      
      srv_n_file_io_threads, srv_n_read_io_threads, srv_n_write_io_threads:
      Change the data type from ulong to uint.
      
      os_aio_init(): Remove the parameters, and actually return an error code.
      
      thread_pool::configure_aio(): Do not silently fall back to simulated AIO.
      
      Reviewed by: Vladislav Vaintroub
      f24b7383
  18. 11 Dec, 2020 1 commit
    • Marko Mäkelä's avatar
      MDEV-24391 heap-use-after-free in fil_space_t::flush_low() · 8677c14e
      Marko Mäkelä authored
      We observed a race condition that involved two threads
      executing fil_flush_file_spaces() and one thread
      executing fil_delete_tablespace(). After one of the
      fil_flush_file_spaces() observed that
      space.needs_flush_not_stopping() is set and was
      releasing the fil_system.mutex, the other fil_flush_file_spaces()
      would complete the execution of fil_space_t::flush_low() on
      the same tablespace. Then, fil_delete_tablespace() would
      destroy the object, because the value of fil_space_t::n_pending
      did not prevent that. Finally, the fil_flush_file_spaces() would
      resume execution and invoke fil_space_t::flush_low() on the freed
      object.
      
      This race condition was introduced in
      commit 118e258a of MDEV-23855.
      
      fil_space_t::flush(): Add a template parameter that indicates
      whether the caller is holding a reference to prevent the
      tablespace from being freed.
      
      buf_dblwr_t::flush_buffered_writes_completed(),
      row_quiesce_table_start(): Acquire a reference for the duration
      of the fil_space_t::flush_low() operation. It should be impossible
      for the object to be freed in these code paths, but we want to
      satisfy the debug assertions.
      
      fil_space_t::flush_low(): Do not increment or decrement the
      reference count, but instead assert that the caller is holding
      a reference.
      
      fil_space_extend_must_retry(), fil_flush_file_spaces():
      Acquire a reference before releasing fil_system.mutex.
      This is what will fix the race condition.
      8677c14e
  19. 04 Dec, 2020 1 commit
    • Marko Mäkelä's avatar
      MDEV-24340 Unique final message of InnoDB during shutdown · 1eb59c30
      Marko Mäkelä authored
      innobase_space_shutdown(): Remove. We want this step to be executed
      before the message "InnoDB: Shutdown completed; log sequence number "
      is output by innodb_shutdown(). It used to be executed after that step.
      
      innodb_shutdown(): Duplicate the code that used to live in
      innobase_space_shutdown().
      
      innobase_init_abort(): Merge with innobase_space_shutdown().
      1eb59c30
  20. 01 Dec, 2020 2 commits
    • Vlad Lesin's avatar
      MDEV-22929 MariaBackup option to report and/or continue when corruption is encountered · e30a05f4
      Vlad Lesin authored
      Post-push Windows compilation errors fix.
      e30a05f4
    • Vlad Lesin's avatar
      MDEV-22929 MariaBackup option to report and/or continue when corruption is encountered · e6b3e38d
      Vlad Lesin authored
      The new option --log-innodb-page-corruption is introduced.
      
      When this option is set, backup is not interrupted if innodb corrupted
      page is detected. Instead it logs all found corrupted pages in
      innodb_corrupted_pages file in backup directory and finishes with error.
      
      For incremental backup corrupted pages are also copied to .delta file,
      because we can't do LSN check for such pages during backup,
      innodb_corrupted_pages will also be created in incremental backup
      directory.
      
      During --prepare, corrupted pages list is read from the file just after
      redo log is applied, and each page from the list is checked if it is allocated
      in it's tablespace or not. If it is not allocated, then it is zeroed out,
      flushed to the tablespace and removed from the list. If all pages are removed
      from the list, then --prepare is finished successfully and
      innodb_corrupted_pages file is removed from backup directory. Otherwise
      --prepare is finished with error message and innodb_corrupted_pages contains
      the list of the pages, which are detected as corrupted during backup, and are
      allocated in their tablespaces, what means backup directory contains corrupted
      innodb pages, and backup can not be considered as consistent.
      
      For incremental --prepare corrupted pages from .delta files are applied
      to the base backup, innodb_corrupted_pages is read from both base in
      incremental directories, and the same action is proceded for corrupted
      pages list as for full --prepare. innodb_corrupted_pages file is
      modified or removed only in base directory.
      
      If DDL happens during backup, it is also processed at the end of backup
      to have correct tablespace names in innodb_corrupted_pages.
      e6b3e38d
  21. 29 Oct, 2020 1 commit
  22. 26 Oct, 2020 5 commits
    • Marko Mäkelä's avatar
      MDEV-23855: Use normal mutex for log_sys.mutex, log_sys.flush_order_mutex · c27e53f4
      Marko Mäkelä authored
      With an unreasonably small innodb_log_file_size, the page cleaner
      thread would frequently acquire log_sys.flush_order_mutex and spend
      a significant portion of CPU time spinning on that mutex when
      determining the checkpoint LSN.
      c27e53f4
    • Marko Mäkelä's avatar
      MDEV-23855: Shrink fil_space_t · 118e258a
      Marko Mäkelä authored
      Merge n_pending_ios, n_pending_ops to std::atomic<uint32_t> n_pending.
      Change some more fil_space_t members to uint32_t to reduce
      the memory footprint.
      
      fil_space_t::add(), fil_ibd_create(): Attach the already opened
      handle to the tablespace, and enforce the fil_system.n_open limit.
      
      dict_boot(): Initialize fil_system.max_assigned_id.
      
      srv_boot(): Call srv_thread_pool_init() before anything else,
      so that files should be opened in the correct mode on Windows.
      
      fil_ibd_create(): Create the file in OS_FILE_AIO mode, just like
      fil_node_open_file_low() does it.
      
      dict_table_t::is_accessible(): Replaces fil_table_accessible().
      
      Reviewed by: Vladislav Vaintroub
      118e258a
    • Marko Mäkelä's avatar
      MDEV-23855: Remove fil_system.LRU and reduce fil_system.mutex contention · 45ed9dd9
      Marko Mäkelä authored
      Also fixes MDEV-23929: innodb_flush_neighbors is not being ignored
      for system tablespace on SSD
      
      When the maximum configured number of file is exceeded, InnoDB will
      close data files. We used to maintain a fil_system.LRU list and
      a counter fil_node_t::n_pending to achieve this, at the huge cost
      of multiple fil_system.mutex operations per I/O operation.
      
      fil_node_open_file_low(): Implement a FIFO replacement policy:
      The last opened file will be moved to the end of fil_system.space_list,
      and files will be closed from the start of the list. However, we will
      not move tablespaces in fil_system.space_list while
      i_s_tablespaces_encryption_fill_table() is executing
      (producing output for INFORMATION_SCHEMA.INNODB_TABLESPACES_ENCRYPTION)
      because it may cause information of some tablespaces to go missing.
      We also avoid this in mariabackup --backup because datafiles_iter_next()
      assumes that the ordering is not changed.
      
      IORequest: Fold more parameters to IORequest::type.
      
      fil_space_t::io(): Replaces fil_io().
      
      fil_space_t::flush(): Replaces fil_flush().
      
      OS_AIO_IBUF: Remove. We will always issue synchronous reads of the
      change buffer pages in buf_read_page_low().
      
      We will always ignore some errors for background reads.
      
      This should reduce fil_system.mutex contention a little.
      
      fil_node_t::complete_write(): Replaces fil_node_t::complete_io().
      On both read and write completion, fil_space_t::release_for_io()
      will have to be called.
      
      fil_space_t::io(): Do not acquire fil_system.mutex in the normal
      code path.
      
      xb_delta_open_matching_space(): Do not try to open the system tablespace
      which was already opened. This fixes a file sharing violation in
      mariabackup --prepare --incremental.
      
      Reviewed by: Vladislav Vaintroub
      45ed9dd9
    • Marko Mäkelä's avatar
      MDEV-23855: Improve InnoDB log checkpoint performance · 3a9a3be1
      Marko Mäkelä authored
      After MDEV-15053, MDEV-22871, MDEV-23399 shifted the scalability
      bottleneck, log checkpoints became a new bottleneck.
      
      If innodb_io_capacity is set low or innodb_max_dirty_pct_lwm is
      set high and the workload fits in the buffer pool, the page cleaner
      thread will perform very little flushing. When we reach the capacity
      of the circular redo log file ib_logfile0 and must initiate a checkpoint,
      some 'furious flushing' will be necessary. (If innodb_flush_sync=OFF,
      then flushing would continue at the innodb_io_capacity rate, and
      writers would be throttled.)
      
      We have the best chance of advancing the checkpoint LSN immediately
      after a page flush batch has been completed. Hence, it is best to
      perform checkpoints after every batch in the page cleaner thread,
      attempting to run once per second.
      
      By initiating high-priority flushing in the page cleaner as early
      as possible, we aim to make the throughput more stable.
      
      The function buf_flush_wait_flushed() used to sleep for 10ms, hoping
      that the page cleaner thread would do something during that time.
      The observed end result was that a large number of threads that call
      log_free_check() would end up sleeping while nothing useful is happening.
      
      We will revise the design so that in the default innodb_flush_sync=ON
      mode, buf_flush_wait_flushed() will wake up the page cleaner thread
      to perform the necessary flushing, and it will wait for a signal from
      the page cleaner thread.
      
      If innodb_io_capacity is set to a low value (causing the page cleaner to
      throttle its work), a write workload would initially perform well, until
      the capacity of the circular ib_logfile0 is reached and log_free_check()
      will trigger checkpoints. At that point, the extra waiting in
      buf_flush_wait_flushed() will start reducing throughput.
      
      The page cleaner thread will also initiate log checkpoints after each
      buf_flush_lists() call, because that is the best point of time for
      the checkpoint LSN to advance by the maximum amount.
      
      Even in 'furious flushing' mode we invoke buf_flush_lists() with
      innodb_io_capacity_max pages at a time, and at the start of each
      batch (in the log_flush() callback function that runs in a separate
      task) we will invoke os_aio_wait_until_no_pending_writes(). This
      tweak allows the checkpoint to advance in smaller steps and
      significantly reduces the maximum latency. On an Intel Optane 960
      NVMe SSD on Linux, it reduced from 4.6 seconds to 74 milliseconds.
      On Microsoft Windows with a slower SSD, it reduced from more than
      180 seconds to 0.6 seconds.
      
      We will make innodb_adaptive_flushing=OFF simply flush innodb_io_capacity
      per second whenever the dirty proportion of buffer pool pages exceeds
      innodb_max_dirty_pages_pct_lwm. For innodb_adaptive_flushing=ON we try
      to make page_cleaner_flush_pages_recommendation() more consistent and
      predictable: if we are below innodb_adaptive_flushing_lwm, let us flush
      pages according to the return value of af_get_pct_for_dirty().
      
      innodb_max_dirty_pages_pct_lwm: Revert the change of the default value
      that was made in MDEV-23399. The value innodb_max_dirty_pages_pct_lwm=0
      guarantees that a shutdown of an idle server will be fast. Users might
      be surprised if normal shutdown suddenly became slower when upgrading
      within a GA release series.
      
      innodb_checkpoint_usec: Remove. The master task will no longer perform
      periodic log checkpoints. It is the duty of the page cleaner thread.
      
      log_sys.max_modified_age: Remove. The current span of the
      buf_pool.flush_list expressed in LSN only matters for adaptive
      flushing (outside the 'furious flushing' condition).
      For the correctness of checkpoints, the only thing that matters is
      the checkpoint age (log_sys.lsn - log_sys.last_checkpoint_lsn).
      This run-time constant was also reported as log_max_modified_age_sync.
      
      log_sys.max_checkpoint_age_async: Remove. This does not serve any
      purpose, because the checkpoints will now be triggered by the page
      cleaner thread. We will retain the log_sys.max_checkpoint_age limit
      for engaging 'furious flushing'.
      
      page_cleaner.slot: Remove. It turns out that
      page_cleaner_slot.flush_list_time was duplicating
      page_cleaner.slot.flush_time and page_cleaner.slot.flush_list_pass
      was duplicating page_cleaner.flush_pass.
      Likewise, there were some redundant monitor counters, because the
      page cleaner thread no longer performs any buf_pool.LRU flushing, and
      because there only is one buf_flush_page_cleaner thread.
      
      buf_flush_sync_lsn: Protect writes by buf_pool.flush_list_mutex.
      
      buf_pool_t::get_oldest_modification(): Add a parameter to specify the
      return value when no persistent data pages are dirty. Require the
      caller to hold buf_pool.flush_list_mutex.
      
      log_buf_pool_get_oldest_modification(): Take the fall-back LSN
      as a parameter. All callers will also invoke log_sys.get_lsn().
      
      log_preflush_pool_modified_pages(): Replaced with buf_flush_wait_flushed().
      
      buf_flush_wait_flushed(): Implement two limits. If not enough buffer pool
      has been flushed, signal the page cleaner (unless innodb_flush_sync=OFF)
      and wait for the page cleaner to complete. If the page cleaner
      thread is not running (which can be the case durign shutdown),
      initiate the flush and wait for it directly.
      
      buf_flush_ahead(): If innodb_flush_sync=ON (the default),
      submit a new buf_flush_sync_lsn target for the page cleaner
      but do not wait for the flushing to finish.
      
      log_get_capacity(), log_get_max_modified_age_async(): Remove, to make
      it easier to see that af_get_pct_for_lsn() is not acquiring any mutexes.
      
      page_cleaner_flush_pages_recommendation(): Protect all access to
      buf_pool.flush_list with buf_pool.flush_list_mutex. Previously there
      were some race conditions in the calculation.
      
      buf_flush_sync_for_checkpoint(): New function to process
      buf_flush_sync_lsn in the page cleaner thread. At the end of
      each batch, we try to wake up any blocked buf_flush_wait_flushed().
      If everything up to buf_flush_sync_lsn has been flushed, we will
      reset buf_flush_sync_lsn=0. The page cleaner thread will keep
      'furious flushing' until the limit is reached. Any threads that
      are waiting in buf_flush_wait_flushed() will be able to resume
      as soon as their own limit has been satisfied.
      
      buf_flush_page_cleaner: Prioritize buf_flush_sync_lsn and do not
      sleep as long as it is set. Do not update any page_cleaner statistics
      for this special mode of operation. In the normal mode
      (buf_flush_sync_lsn is not set for innodb_flush_sync=ON),
      try to wake up once per second. No longer check whether
      srv_inc_activity_count() has been called. After each batch,
      try to perform a log checkpoint, because the best chances for
      the checkpoint LSN to advance by the maximum amount are upon
      completing a flushing batch.
      
      log_t: Move buf_free, max_buf_free possibly to the same cache line
      with log_sys.mutex.
      
      log_margin_checkpoint_age(): Simplify the logic, and replace
      a 0.1-second sleep with a call to buf_flush_wait_flushed() to
      initiate flushing. Moved to the same compilation unit
      with the only caller.
      
      log_close(): Clean up the calculations. (Should be no functional
      change.) Return whether flush-ahead is needed. Moved to the same
      compilation unit with the only caller.
      
      mtr_t::finish_write(): Return whether flush-ahead is needed.
      
      mtr_t::commit(): Invoke buf_flush_ahead() when needed. Let us avoid
      external calls in mtr_t::commit() and make the logic easier to follow
      by having related code in a single compilation unit. Also, we will
      invoke srv_stats.log_write_requests.inc() only once per
      mini-transaction commit, while not holding mutexes.
      
      log_checkpoint_margin(): Only care about log_sys.max_checkpoint_age.
      Upon reaching log_sys.max_checkpoint_age where we must wait to prevent
      the log from getting corrupted, let us wait for at most 1MiB of LSN
      at a time, before rechecking the condition. This should allow writers
      to proceed even if the redo log capacity has been reached and
      'furious flushing' is in progress. We no longer care about
      log_sys.max_modified_age_sync or log_sys.max_modified_age_async.
      The log_sys.max_modified_age_sync could be a relic from the time when
      there was a srv_master_thread that wrote dirty pages to data files.
      Also, we no longer have any log_sys.max_checkpoint_age_async limit,
      because log checkpoints will now be triggered by the page cleaner
      thread upon completing buf_flush_lists().
      
      log_set_capacity(): Simplify the calculations of the limit
      (no functional change).
      
      log_checkpoint_low(): Split from log_checkpoint(). Moved to the
      same compilation unit with the caller.
      
      log_make_checkpoint(): Only wait for everything to be flushed until
      the current LSN.
      
      create_log_file(): After checkpoint, invoke log_write_up_to()
      to ensure that the FILE_CHECKPOINT record has been written.
      This avoids ut_ad(!srv_log_file_created) in create_log_file_rename().
      
      srv_start(): Do not call recv_recovery_from_checkpoint_start()
      if the log has just been created. Set fil_system.space_id_reuse_warned
      before dict_boot() has been executed, and clear it after recovery
      has finished.
      
      dict_boot(): Initialize fil_system.max_assigned_id.
      
      srv_check_activity(): Remove. The activity count is counting transaction
      commits and therefore mostly interesting for the purge of history.
      
      BtrBulk::insert(): Do not explicitly wake up the page cleaner,
      but do invoke srv_inc_activity_count(), because that counter is
      still being used in buf_load_throttle_if_needed() for some
      heuristics. (It might be cleaner to execute buf_load() in the
      page cleaner thread!)
      
      Reviewed by: Vladislav Vaintroub
      3a9a3be1
    • Marko Mäkelä's avatar
      MDEV-23399 fixup: Avoid crash on Mariabackup shutdown · 5999d512
      Marko Mäkelä authored
      innodb_preshutdown(): Terminate the encryption threads before
      the page cleaner thread can be shut down.
      
      innodb_shutdown(): Always wait for the encryption threads and
      page cleaner to shut down.
      
      srv_shutdown_all_bg_threads(): Wait for the encryption threads and
      the page cleaner to shut down. (After an aborted startup,
      innodb_shutdown() would not be called.)
      
      row_get_background_drop_list_len_low(): Remove.
      
      os_thread_count: Remove. Alternatively, at the end of
      srv_shutdown_all_bg_threads() we could try to wait longer
      for the count to reach 0. On some platforms, an assertion
      os_thread_count==0 could fail even after a small delay,
      even though in the core dump all threads would have exited.
      
      srv_shutdown_threads(): Renamed from srv_shutdown_all_bg_threads().
      Do not wait for the page cleaner to shut down, because the later
      innodb_shutdown(), which may invoke
      logs_empty_and_mark_files_at_shutdown(), assumes that it exists.
      5999d512
  23. 23 Oct, 2020 1 commit
    • Vlad Lesin's avatar
      MDEV-20755 InnoDB: Database page corruption on disk or a failed file read of... · 985ede92
      Vlad Lesin authored
      MDEV-20755 InnoDB: Database page corruption on disk or a failed file read of tablespace upon prepare of mariabackup incremental backup
      
      The problem:
      
      When incremental backup is taken, delta files are created for innodb tables
      which are marked as new tables during innodb ddl tracking. When such
      tablespace is tried to be opened during prepare in
      xb_delta_open_matching_space(), it is "created", i.e.
      xb_space_create_file() is invoked, instead of opening, even if
      a tablespace with the same name exists in the base backup directory.
      
      xb_space_create_file() writes page 0 header the tablespace.
      This header does not contain crypt data, as mariabackup does not have
      any information about crypt data in delta file metadata for
      tablespaces.
      
      After delta file is applied, recovery process is started. As the
      sequence of recovery for different pages is not defined, there can be
      the situation when crypt data redo log event is executed after some
      other page is read for recovery. When some page is read for recovery, it's
      decrypted using crypt data stored in tablespace header in page 0, if
      there is no crypt data, the page is not decryped and does not pass corruption
      test.
      
      This causes error for incremental backup --prepare for encrypted
      tablespaces.
      
      The error is not stable because crypt data redo log event updates crypt
      data on page 0, and recovery for different pages can be executed in
      undefined order.
      
      The fix:
      
      When delta file is created, the corresponding write filter copies only
      the pages which LSN is greater then some incremental LSN. When new file
      is created during incremental backup, the LSN of all it's pages must be
      greater then incremental LSN, so there is no need to create delta for
      such table, we can just copy it completely.
      
      The fix is to copy the whole file which was tracked during incremental backup
      with innodb ddl tracker, and copy it to base directory during --prepare
      instead of delta applying.
      
      There is also DBUG_EXECUTE_IF() in innodb code to avoid writing redo log
      record for crypt data updating on page 0 to make the test case stable.
      
      Note:
      
      The issue is not reproducible in 10.5 as optimized DDL's are deprecated
      in 10.5. But the fix is still useful because it allows to decrease
      data copy size during backup, as delta file contains some extra info.
      The test case should be removed for 10.5 as it will always pass.
      985ede92
  24. 20 Oct, 2020 1 commit
    • Julius Goryavsky's avatar
      MDEV-21951: mariabackup SST fail if data-directory have lost+found directory · 888010d9
      Julius Goryavsky authored
      To fix this, it is necessary to add an option to exclude the
      database with the name "lost+found" from processing (the database
      name will be checked by the check_if_skip_database_by_path() or
      by the check_if_skip_database() function, and as a result
      "lost+found" will be skipped).
      
      In addition, it is necessary to slightly modify the verification
      logic in the check_if_skip_database() function.
      
      Also added a new test galera_sst_mariabackup_lost_found.test
      888010d9
  25. 19 Oct, 2020 1 commit
    • Marko Mäkelä's avatar
      MDEV-23982: Mariabackup hangs on backup · 1066312a
      Marko Mäkelä authored
      MDEV-13318 introduced a condition to Mariabackup that can cause it to
      hang if the server goes idle after writing a log block that has no
      payload after the 12-byte header. Normal recovery in log0recv.cc would
      allow blocks with exactly 12 bytes of length, and only reject blocks
      where the length field is shorter than that.
      1066312a
  26. 16 Oct, 2020 1 commit
    • Marko Mäkelä's avatar
      Fixup 9028cc6b · a0113683
      Marko Mäkelä authored
      We forgot to change innodb_autoextend_increment from ULONG to
      UINT (always 32-bit) in Mariabackup.
      a0113683
  27. 15 Oct, 2020 2 commits
    • Marko Mäkelä's avatar
      Cleanup: Make InnoDB page numbers uint32_t · 9028cc6b
      Marko Mäkelä authored
      InnoDB stores a 32-bit page number in page headers and in some
      data structures, such as FIL_ADDR (consisting of a 32-bit page number
      and a 16-bit byte offset within a page). For better compile-time
      error detection and to reduce the memory footprint in some data
      structures, let us use a uint32_t for the page number, instead
      of ulint (size_t) which can be 64 bits.
      9028cc6b
    • Marko Mäkelä's avatar
      MDEV-23399: Performance regression with write workloads · 7cffb5f6
      Marko Mäkelä authored
      The buffer pool refactoring in MDEV-15053 and MDEV-22871 shifted
      the performance bottleneck to the page flushing.
      
      The configuration parameters will be changed as follows:
      
      innodb_lru_flush_size=32 (new: how many pages to flush on LRU eviction)
      innodb_lru_scan_depth=1536 (old: 1024)
      innodb_max_dirty_pages_pct=90 (old: 75)
      innodb_max_dirty_pages_pct_lwm=75 (old: 0)
      
      Note: The parameter innodb_lru_scan_depth will only affect LRU
      eviction of buffer pool pages when a new page is being allocated. The
      page cleaner thread will no longer evict any pages. It used to
      guarantee that some pages will remain free in the buffer pool. Now, we
      perform that eviction 'on demand' in buf_LRU_get_free_block().
      The parameter innodb_lru_scan_depth(srv_LRU_scan_depth) is used as follows:
       * When the buffer pool is being shrunk in buf_pool_t::withdraw_blocks()
       * As a buf_pool.free limit in buf_LRU_list_batch() for terminating
         the flushing that is initiated e.g., by buf_LRU_get_free_block()
      The parameter also used to serve as an initial limit for unzip_LRU
      eviction (evicting uncompressed page frames while retaining
      ROW_FORMAT=COMPRESSED pages), but now we will use a hard-coded limit
      of 100 or unlimited for invoking buf_LRU_scan_and_free_block().
      
      The status variables will be changed as follows:
      
      innodb_buffer_pool_pages_flushed: This includes also the count of
      innodb_buffer_pool_pages_LRU_flushed and should work reliably,
      updated one by one in buf_flush_page() to give more real-time
      statistics. The function buf_flush_stats(), which we are removing,
      was not called in every code path. For both counters, we will use
      regular variables that are incremented in a critical section of
      buf_pool.mutex. Note that show_innodb_vars() directly links to the
      variables, and reads of the counters will *not* be protected by
      buf_pool.mutex, so you cannot get a consistent snapshot of both variables.
      
      The following INFORMATION_SCHEMA.INNODB_METRICS counters will be
      removed, because the page cleaner no longer deals with writing or
      evicting least recently used pages, and because the single-page writes
      have been removed:
      * buffer_LRU_batch_flush_avg_time_slot
      * buffer_LRU_batch_flush_avg_time_thread
      * buffer_LRU_batch_flush_avg_time_est
      * buffer_LRU_batch_flush_avg_pass
      * buffer_LRU_single_flush_scanned
      * buffer_LRU_single_flush_num_scan
      * buffer_LRU_single_flush_scanned_per_call
      
      When moving to a single buffer pool instance in MDEV-15058, we missed
      some opportunity to simplify the buf_flush_page_cleaner thread. It was
      unnecessarily using a mutex and some complex data structures, even
      though we always have a single page cleaner thread.
      
      Furthermore, the buf_flush_page_cleaner thread had separate 'recovery'
      and 'shutdown' modes where it was waiting to be triggered by some
      other thread, adding unnecessary latency and potential for hangs in
      relatively rarely executed startup or shutdown code.
      
      The page cleaner was also running two kinds of batches in an
      interleaved fashion: "LRU flush" (writing out some least recently used
      pages and evicting them on write completion) and the normal batches
      that aim to increase the MIN(oldest_modification) in the buffer pool,
      to help the log checkpoint advance.
      
      The buf_pool.flush_list flushing was being blocked by
      buf_block_t::lock for no good reason. Furthermore, if the FIL_PAGE_LSN
      of a page is ahead of log_sys.get_flushed_lsn(), that is, what has
      been persistently written to the redo log, we would trigger a log
      flush and then resume the page flushing. This would unnecessarily
      limit the performance of the page cleaner thread and trigger the
      infamous messages "InnoDB: page_cleaner: 1000ms intended loop took 4450ms.
      The settings might not be optimal" that were suppressed in
      commit d1ab8903 unless log_warnings>2.
      
      Our revised algorithm will make log_sys.get_flushed_lsn() advance at
      the start of buf_flush_lists(), and then execute a 'best effort' to
      write out all pages. The flush batches will skip pages that were modified
      since the log was written, or are are currently exclusively locked.
      The MDEV-13670 message "page_cleaner: 1000ms intended loop took" message
      will be removed, because by design, the buf_flush_page_cleaner() should
      not be blocked during a batch for extended periods of time.
      
      We will remove the single-page flushing altogether. Related to this,
      the debug parameter innodb_doublewrite_batch_size will be removed,
      because all of the doublewrite buffer will be used for flushing
      batches. If a page needs to be evicted from the buffer pool and all
      100 least recently used pages in the buffer pool have unflushed
      changes, buf_LRU_get_free_block() will execute buf_flush_lists() to
      write out and evict innodb_lru_flush_size pages. At most one thread
      will execute buf_flush_lists() in buf_LRU_get_free_block(); other
      threads will wait for that LRU flushing batch to finish.
      
      To improve concurrency, we will replace the InnoDB ib_mutex_t and
      os_event_t native mutexes and condition variables in this area of code.
      Most notably, this means that the buffer pool mutex (buf_pool.mutex)
      is no longer instrumented via any InnoDB interfaces. It will continue
      to be instrumented via PERFORMANCE_SCHEMA.
      
      For now, both buf_pool.flush_list_mutex and buf_pool.mutex will be
      declared with MY_MUTEX_INIT_FAST (PTHREAD_MUTEX_ADAPTIVE_NP). The critical
      sections of buf_pool.flush_list_mutex should be shorter than those for
      buf_pool.mutex, because in the worst case, they cover a linear scan of
      buf_pool.flush_list, while the worst case of a critical section of
      buf_pool.mutex covers a linear scan of the potentially much longer
      buf_pool.LRU list.
      
      mysql_mutex_is_owner(), safe_mutex_is_owner(): New predicate, usable
      with SAFE_MUTEX. Some InnoDB debug assertions need this predicate
      instead of mysql_mutex_assert_owner() or mysql_mutex_assert_not_owner().
      
      buf_pool_t::n_flush_LRU, buf_pool_t::n_flush_list:
      Replaces buf_pool_t::init_flush[] and buf_pool_t::n_flush[].
      The number of active flush operations.
      
      buf_pool_t::mutex, buf_pool_t::flush_list_mutex: Use mysql_mutex_t
      instead of ib_mutex_t, to have native mutexes with PERFORMANCE_SCHEMA
      and SAFE_MUTEX instrumentation.
      
      buf_pool_t::done_flush_LRU: Condition variable for !n_flush_LRU.
      
      buf_pool_t::done_flush_list: Condition variable for !n_flush_list.
      
      buf_pool_t::do_flush_list: Condition variable to wake up the
      buf_flush_page_cleaner when a log checkpoint needs to be written
      or the server is being shut down. Replaces buf_flush_event.
      We will keep using timed waits (the page cleaner thread will wake
      _at least_ once per second), because the calculations for
      innodb_adaptive_flushing depend on fixed time intervals.
      
      buf_dblwr: Allocate statically, and move all code to member functions.
      Use a native mutex and condition variable. Remove code to deal with
      single-page flushing.
      
      buf_dblwr_check_block(): Make the check debug-only. We were spending
      a significant amount of execution time in page_simple_validate_new().
      
      flush_counters_t::unzip_LRU_evicted: Remove.
      
      IORequest: Make more members const. FIXME: m_fil_node should be removed.
      
      buf_flush_sync_lsn: Protect by std::atomic, not page_cleaner.mutex
      (which we are removing).
      
      page_cleaner_slot_t, page_cleaner_t: Remove many redundant members.
      
      pc_request_flush_slot(): Replaces pc_request() and pc_flush_slot().
      
      recv_writer_thread: Remove. Recovery works just fine without it, if we
      simply invoke buf_flush_sync() at the end of each batch in
      recv_sys_t::apply().
      
      recv_recovery_from_checkpoint_finish(): Remove. We can simply call
      recv_sys.debug_free() directly.
      
      srv_started_redo: Replaces srv_start_state.
      
      SRV_SHUTDOWN_FLUSH_PHASE: Remove. logs_empty_and_mark_files_at_shutdown()
      can communicate with the normal page cleaner loop via the new function
      flush_buffer_pool().
      
      buf_flush_remove(): Assert that the calling thread is holding
      buf_pool.flush_list_mutex. This removes unnecessary mutex operations
      from buf_flush_remove_pages() and buf_flush_dirty_pages(),
      which replace buf_LRU_flush_or_remove_pages().
      
      buf_flush_lists(): Renamed from buf_flush_batch(), with simplified
      interface. Return the number of flushed pages. Clarified comments and
      renamed min_n to max_n. Identify LRU batch by lsn=0. Merge all the functions
      buf_flush_start(), buf_flush_batch(), buf_flush_end() directly to this
      function, which was their only caller, and remove 2 unnecessary
      buf_pool.mutex release/re-acquisition that we used to perform around
      the buf_flush_batch() call. At the start, if not all log has been
      durably written, wait for a background task to do it, or start a new
      task to do it. This allows the log write to run concurrently with our
      page flushing batch. Any pages that were skipped due to too recent
      FIL_PAGE_LSN or due to them being latched by a writer should be flushed
      during the next batch, unless there are further modifications to those
      pages. It is possible that a page that we must flush due to small
      oldest_modification also carries a recent FIL_PAGE_LSN or is being
      constantly modified. In the worst case, all writers would then end up
      waiting in log_free_check() to allow the flushing and the checkpoint
      to complete.
      
      buf_do_flush_list_batch(): Clarify comments, and rename min_n to max_n.
      Cache the last looked up tablespace. If neighbor flushing is not applicable,
      invoke buf_flush_page() directly, avoiding a page lookup in between.
      
      buf_flush_space(): Auxiliary function to look up a tablespace for
      page flushing.
      
      buf_flush_page(): Defer the computation of space->full_crc32(). Never
      call log_write_up_to(), but instead skip persistent pages whose latest
      modification (FIL_PAGE_LSN) is newer than the redo log. Also skip
      pages on which we cannot acquire a shared latch without waiting.
      
      buf_flush_try_neighbors(): Do not bother checking buf_fix_count
      because buf_flush_page() will no longer wait for the page latch.
      Take the tablespace as a parameter, and only execute this function
      when innodb_flush_neighbors>0. Avoid repeated calls of page_id_t::fold().
      
      buf_flush_relocate_on_flush_list(): Declare as cold, and push down
      a condition from the callers.
      
      buf_flush_check_neighbor(): Take id.fold() as a parameter.
      
      buf_flush_sync(): Ensure that the buf_pool.flush_list is empty,
      because the flushing batch will skip pages whose modifications have
      not yet been written to the log or were latched for modification.
      
      buf_free_from_unzip_LRU_list_batch(): Remove redundant local variables.
      
      buf_flush_LRU_list_batch(): Let the caller buf_do_LRU_batch() initialize
      the counters, and report n->evicted.
      Cache the last looked up tablespace. If neighbor flushing is not applicable,
      invoke buf_flush_page() directly, avoiding a page lookup in between.
      
      buf_do_LRU_batch(): Return the number of pages flushed.
      
      buf_LRU_free_page(): Only release and re-acquire buf_pool.mutex if
      adaptive hash index entries are pointing to the block.
      
      buf_LRU_get_free_block(): Do not wake up the page cleaner, because it
      will no longer perform any useful work for us, and we do not want it
      to compete for I/O while buf_flush_lists(innodb_lru_flush_size, 0)
      writes out and evicts at most innodb_lru_flush_size pages. (The
      function buf_do_LRU_batch() may complete after writing fewer pages if
      more than innodb_lru_scan_depth pages end up in buf_pool.free list.)
      Eliminate some mutex release-acquire cycles, and wait for the LRU
      flush batch to complete before rescanning.
      
      buf_LRU_check_size_of_non_data_objects(): Simplify the code.
      
      buf_page_write_complete(): Remove the parameter evict, and always
      evict pages that were part of an LRU flush.
      
      buf_page_create(): Take a pre-allocated page as a parameter.
      
      buf_pool_t::free_block(): Free a pre-allocated block.
      
      recv_sys_t::recover_low(), recv_sys_t::apply(): Preallocate the block
      while not holding recv_sys.mutex. During page allocation, we may
      initiate a page flush, which in turn may initiate a log flush, which
      would require acquiring log_sys.mutex, which should always be acquired
      before recv_sys.mutex in order to avoid deadlocks. Therefore, we must
      not be holding recv_sys.mutex while allocating a buffer pool block.
      
      BtrBulk::logFreeCheck(): Skip a redundant condition.
      
      row_undo_step(): Do not invoke srv_inc_activity_count() for every row
      that is being rolled back. It should suffice to invoke the function in
      trx_flush_log_if_needed() during trx_t::commit_in_memory() when the
      rollback completes.
      
      sync_check_enable(): Remove. We will enable innodb_sync_debug from the
      very beginning.
      
      Reviewed by: Vladislav Vaintroub
      7cffb5f6
  28. 30 Sep, 2020 1 commit
    • Marko Mäkelä's avatar
      MDEV-16264 fixup: Remove unused code and data · a9550c47
      Marko Mäkelä authored
      LATCH_ID_OS_AIO_READ_MUTEX,
      LATCH_ID_OS_AIO_WRITE_MUTEX,
      LATCH_ID_OS_AIO_LOG_MUTEX,
      LATCH_ID_OS_AIO_IBUF_MUTEX,
      LATCH_ID_OS_AIO_SYNC_MUTEX: Remove. The tpool is not instrumented.
      
      lock_set_timeout_event(): Remove.
      
      srv_sys_mutex_key, srv_sys_t::mutex, SYNC_THREADS: Remove.
      
      srv_slot_t::suspended: Remove. We only ever assigned this data member
      true, so it is redundant.
      
      ib_wqueue_wait(), ib_wqueue_timedwait(): Remove.
      
      os_thread_join(): Remove.
      
      os_thread_create(), os_thread_exit(): Remove redundant parameters.
      
      These were missed in commit 5e62b6a5.
      a9550c47
  29. 21 Sep, 2020 2 commits
    • Marko Mäkelä's avatar
      407d170c
    • Vlad Lesin's avatar
      MDEV-23711 make mariabackup innodb redo log read error message more clear · 0a224edc
      Vlad Lesin authored
      log_group_read_log_seg() returns error when:
      
      1) Calculated log block number does not correspond to read log block
      number. This can be caused by:
        a) Garbage or an incompletely written log block. We can exclude this
        case by checking log block checksum if it's enabled(see innodb-log-checksums,
        encrypted log block contains checksum always).
        b) The log block is overwritten. In this case checksum will be correct and
        read log block number will be greater then requested one.
      
      2) When log block length is wrong. In this case recv_sys->found_corrupt_log
      is set.
      
      3) When redo log block checksum is wrong. In this case innodb code
      writes messages to error log with the following prefix: "Invalid log
      block checksum."
      
      The fix processes all the cases above.
      0a224edc
  30. 17 Sep, 2020 1 commit
    • Vladislav Vaintroub's avatar
      MDEV-19935 Create unified CRC-32 interface · ccbe6bb6
      Vladislav Vaintroub authored
      Add CRC32C code to mysys. The x86-64 implementation uses PCMULQDQ in addition to CRC32 instruction
      after Intel whitepaper, and is ported from rocksdb code.
      
      Optimized ARM and POWER CRC32 were already present in mysys.
      ccbe6bb6
  31. 19 Aug, 2020 1 commit
    • Marko Mäkelä's avatar
      MDEV-23475 InnoDB performance regression for write-heavy workloads · 309302a3
      Marko Mäkelä authored
      In commit fe39d02f (MDEV-20638)
      we removed some wake-up signaling of the master thread that should
      have been there, to ensure a steady log checkpointing workload.
      
      Common sense suggests that the commit omitted some necessary calls
      to srv_inc_activity_count(). But, an attempt to add the call to
      trx_flush_log_if_needed_low() as well as to reinstate the function
      innobase_active_small() did not restore the performance for the
      case where sync_binlog=1 is set.
      
      Therefore, we will revert the entire commit in MariaDB Server 10.2.
      In MariaDB Server 10.5, adding a srv_inc_activity_count() call to
      trx_flush_log_if_needed_low() did restore the performance, so we
      will not revert MDEV-20638 across all versions.
      309302a3