An error occurred fetching the project authors.
  1. 27 Sep, 2024 1 commit
  2. 14 Sep, 2024 1 commit
    • Marko Mäkelä's avatar
      mtr_t::log_file_op(): Fix -Wnonnull · 4010dff0
      Marko Mäkelä authored
      GCC 12.2.0 could issue -Wnonnull for an unreachable call to
      strlen(new_path).  Let us prevent that by replacing the condition
      (type == FILE_RENAME) with the equivalent (new_path).
      This should also optimize the generated code, because the life time
      of the parameter "type" will be reduced.
      4010dff0
  3. 07 Jun, 2024 1 commit
  4. 26 Mar, 2024 1 commit
  5. 22 Mar, 2024 1 commit
    • Marko Mäkelä's avatar
      MDEV-33515 log_sys.lsn_lock causes excessive context switching · bf0b82d2
      Marko Mäkelä authored
      The log_sys.lsn_lock is a very contended resource with a small
      critical section in log_sys.append_prepare(). On many processor
      microarchitectures, replacing the system call based log_sys.lsn_lock
      with a pure spin lock would fare worse during high concurrency workloads,
      wasting a significant amount of CPU cycles in the spin loop.
      
      On other microarchitectures, we would see a significant amount of time
      being spent in native_queued_spin_lock_slowpath() in the Linux kernel,
      plus context switching between user and kernel address space. This was
      pointed out by Steve Shaw from Intel Corporation.
      
      Depending on the workload and the hardware implementation, it may be
      useful to use a pure spin lock in log_sys.append_prepare().
      We will introduce a parameter. The statement
      
      	SET GLOBAL INNODB_LOG_SPIN_WAIT_DELAY=50;
      
      would enable a spin lock that will execute that many MY_RELAX_CPU()
      operations (such as the x86 PAUSE instruction) between successive
      attempts of acquiring the spin lock. The use of a system call based
      log_sys.lsn_lock (which is the default setting) can be enabled by
      
      	SET GLOBAL INNODB_LOG_SPIN_WAIT_DELAY=0;
      
      This patch will also introduce #ifdef LOG_LATCH_DEBUG
      (part of cmake -DWITH_INNODB_EXTRA_DEBUG=ON) for more accurate
      tracking of log_sys.latch ownership and reorganize the fields of
      log_sys to improve the locality of reference and to reduce the
      chances of false sharing.
      
      When a spin lock is being used, it will be maintained in the
      most significant bit of log_sys.buf_free. This is useful, because that is
      one of the fields that is covered by the lock. For IA-32 or AMD64, we
      implement the spin lock specially via log_t::lsn_lock_bts(), employing the
      i386 LOCK BTS instruction. A straightforward std::atomic::fetch_or() would
      translate into an inefficient loop around LOCK CMPXCHG.
      
      mtr_t::spin_wait_delay: The value of innodb_log_spin_wait_delay.
      
      mtr_t::finisher: Pointer to the currently used mtr_t::finish_write()
      implementation. This allows to avoid introducing conditional branches.
      We no longer invoke log_sys.is_pmem() at the mini-transaction level,
      but we would do that in log_write_up_to().
      
      mtr_t::finisher_update(): Update finisher when spin_wait_delay is
      changed from or to 0 (the spin lock is changed to log_sys.lsn_lock or
      vice versa).
      bf0b82d2
  6. 27 Feb, 2024 1 commit
    • mariadb-DebarunBanerjee's avatar
      MDEV-33011 mariabackup --backup: FATAL ERROR: ... Can't open datafile cool_down/t3 · 96966976
      mariadb-DebarunBanerjee authored
      The root cause is the WAL logging of file operation when the actual
      operation fails afterwards. It creates a situation with a log entry for
      a operation that would always fail. I could simulate both the backup
      scenario error and Innodb recovery failure exploiting the weakness.
      
      We are following WAL for file rename operation and once logged the
      operation must eventually complete successfully, or it is a major
      catastrophe. Right now, we fail for rename and handle it as normal error
      and it is the problem.
      
      I created a patch to address RENAME operation to a non existing schema
      where the destination schema directory is missing. The patch checks for
      the missing schema before logging in an attempt to avoid the failure
      after WAL log is written/flushed. I also checked that the schema cannot
      be dropped or there cannot be any race with other rename to the same
      file. This is protected by the MDL lock in SQL today.
      
      The patch should this be a good improvement over the current situation
      and solves the issue at hand.
      96966976
  7. 20 Feb, 2024 2 commits
    • Marko Mäkelä's avatar
      Cleanup: Remove OS_FILE_ON_ERROR_NO_EXIT · 3dd7b0a8
      Marko Mäkelä authored
      Ever since commit 412ee033
      or commit a440d6ed
      InnoDB should generally not abort when failing to open or create files.
      In Datafile::open_or_create() we had failed to set the flag
      to avoid abort() on failure, but everywhere else we were setting it.
      
      We may still call abort() via os_file_handle_error().
      
      Reviewed by: Vladislav Vaintroub
      3dd7b0a8
    • Marko Mäkelä's avatar
      MDEV-33379 innodb_log_file_buffering=OFF causes corruption on bcachefs · 7f7329f0
      Marko Mäkelä authored
      Apparently, invoking fcntl(fd, F_SETFL, O_DIRECT) will lead to
      unexpected behaviour on Linux bcachefs and possibly other file systems,
      depending on the operating system version. So, let us avoid doing that,
      and instead just attempt to pass the O_DIRECT flag to open(). This should
      make us compatible with NetBSD, IBM AIX, as well as Solaris and its
      derivatives.
      
      This fix does not change the fact that we had only implemented
      innodb_log_file_buffering=OFF on systems where we can determine the
      physical block size (typically 512 or 4096 bytes).
      Currently, those operating systems are Linux and Microsoft Windows.
      
      HAVE_FCNTL_DIRECT, os_file_set_nocache(): Remove.
      
      OS_FILE_OVERWRITE, OS_FILE_CREATE_PATH: Remove (never used parameters).
      
      os_file_log_buffered(), os_file_log_maybe_unbuffered(): Helper functions.
      
      os_file_create_simple_func(): When applicable, initially attempt to
      open files in O_DIRECT mode.
      
      os_file_create_func(): When applicable, initially attempt to
      open files in O_DIRECT mode.
      For type==OS_LOG_FILE && create_mode != OS_FILE_CREATE
      we will first invoke stat(2) on the file name to find out if the size
      is compatible with O_DIRECT. If create_mode == OS_FILE_CREATE, we will
      invoke fstat(2) on the created log file afterwards, and may close and
      reopen the file in O_DIRECT mode if applicable.
      
      create_temp_file(): Support O_DIRECT. This is only used if O_TMPFILE is
      available and innodb_disable_sort_file_cache=ON (non-default value).
      Notably, that setting never worked on Microsoft Windows.
      
      row_merge_file_create_mode(): Split from row_merge_file_create_low().
      Create a temporary file in the specified mode.
      
      Reviewed by: Vladislav Vaintroub
      7f7329f0
  8. 19 Jan, 2024 1 commit
    • Marko Mäkelä's avatar
      MDEV-33095 innodb_flush_method=O_DIRECT creates excessive errors on Solaris · a6290a5b
      Marko Mäkelä authored
      The directio(3C) function on Solaris is supported on NFS and UFS
      while the majority of users should be on ZFS, which is a copy-on-write
      file system that implements transparent compression and therefore
      cannot support unbuffered I/O.
      
      Let us remove the call to directio() and simply treat
      innodb_flush_method=O_DIRECT in the same way as the previous
      default value innodb_flush_method=fsync on Solaris. Also, let us
      remove some dead code around calls to os_file_set_nocache() on
      platforms where fcntl(2) is not usable with O_DIRECT.
      
      On IBM AIX, O_DIRECT is not documented for fcntl(2), only for open(2).
      a6290a5b
  9. 10 Jan, 2024 1 commit
    • Marko Mäkelä's avatar
      MDEV-33112 innodb_undo_log_truncate=ON is blocking page write · 3613fb2a
      Marko Mäkelä authored
      When innodb_undo_log_truncate=ON causes an InnoDB undo tablespace
      to be truncated, we must guarantee that the undo tablespace will
      be rebuilt atomically: After mtr_t::commit_shrink() has durably
      written the mini-transaction that rebuilds the undo tablespace,
      we must not write any old pages to the tablespace.
      
      To guarantee this, in trx_purge_truncate_history() we used to
      traverse the entire buf_pool.flush_list in order to acquire
      exclusive latches on all pages for the undo tablespace that
      reside in the buffer pool, so that those pages cannot be written
      and will be evicted during mtr_t::commit_shrink(). But, this
      traversal may interfere with the page writing activity of
      buf_flush_page_cleaner(). It would be better to lazily discard
      the old pages of the truncated undo tablespace.
      
      fil_space_t::is_being_truncated, fil_space_t::clear_stopping(): Remove.
      
      fil_space_t::create_lsn: A new field, identifying the LSN of the
      latest rebuild of a tablespace.
      
      buf_page_t::flush(), buf_flush_try_neighbors(): Evict pages whose
      FIL_PAGE_LSN is below fil_space_t::create_lsn.
      
      mtr_t::commit_shrink(): Update fil_space_t::create_lsn and
      fil_space_t::size right before the log is durably written and the
      tablespace file is being truncated.
      
      fsp_page_create(), trx_purge_truncate_history(): Simplify the logic.
      
      Reviewed by: Thirunarayanan Balathandayuthapani, Vladislav Lesin
      Performance tested by: Axel Schwenke
      Correctness tested by: Matthias Leich
      3613fb2a
  10. 14 Dec, 2023 1 commit
    • Marko Mäkelä's avatar
      MDEV-32939 If tables are frequently created, renamed, dropped, a backup cannot be restored · f21a6cbf
      Marko Mäkelä authored
      During mariadb-backup --backup, a table could be renamed, created and
      dropped. We could have both oldname.ibd and oldname.new, and one of
      the files would be deleted before the InnoDB recovery starts. The desired
      end result would be that we will recover both oldname.ibd and newname.ibd.
      
      During normal crash recovery, at most one file operation (create, rename,
      delete) may require to be replayed from the write-ahead log before the
      DDL recovery starts.
      
      deferred_spaces.create(): In mariadb-backup --prepare, try to create the
      file in case it does not exist.
      
      fil_name_process(): Display a message about not found files not only
      if innodb_force_recovery is set, but also in mariadb-backup --prepare.
      If we are processing a FILE_RENAME for a tablespace whose recovery is
      deferred, suppress the message and adjust the file name in case
      fil_ibd_load() returns FIL_LOAD_NOT_FOUND or FIL_LOAD_DEFER.
      
      fil_ibd_load(): Remove a redundant file name comparison.
      The caller already compared that the file names are different.
      We used to wrongly return FIL_LOAD_OK instead of FIL_LOAD_ID_CHANGED
      if only the schema name differed, such as a/t1.ibd and b/t1.ibd.
      
      Tested by: Matthias Leich
      Reviewed by: Thirunarayanan Balathandayuthapani
      f21a6cbf
  11. 21 Nov, 2023 1 commit
    • Marko Mäkelä's avatar
      MDEV-32374 log_sys.lsn_lock is a performance hog · 7443ad1c
      Marko Mäkelä authored
      The log_sys.lsn_lock that was introduced in
      commit a635c406
      had better be located in the same cache line with log_sys.latch
      so that log_t::append_prepare() needs to modify only two first
      cache lines where log_sys is stored.
      
      log_t::lsn_lock: On Linux, change the type from pthread_mutex_t to
      something that may be as small as 32 bits, to pack more data members
      in the same cache line. On Microsoft Windows, CRITICAL_SECTION works
      better.
      
      log_t::check_flush_or_checkpoint_: Renamed to need_checkpoint.
      There is no need to pause all writer threads in log_free_check() when
      we only need to write log_sys.buf to ib_logfile0. That will be done in
      mtr_t::commit().
      
      log_t::append_prepare_wait(): Make the member function non-static
      to simplify the call interface, and add a parameter for the LSN.
      
      log_t::append_prepare(): Invoke append_prepare_wait() at most once.
      Only set_check_for_checkpoint() if a log checkpoint needs to
      be written. If the log buffer needs to be written, we will take care
      of it ourselves later in our caller. This will reduce interference
      with log_free_check() in other threads.
      
      mtr_t::commit(): Call log_write_up_to() if needed.
      
      log_t::get_write_target(): Return a log_write_up_to() target
      to mtr_t::commit().
      
      buf_flush_ahead(): If we are in furious flushing, call
      log_sys.set_check_for_checkpoint() so that all writers will wait
      in log_free_check() until the checkpoint is done. Otherwise,
      the test innodb.insert_into_empty could occasionally report
      an error "Crash recovery is broken".
      
      log_check_margins(): Replaced by log_free_check().
      
      log_flush_margin(): Removed. This is part of mtr_t::commit()
      and other operations that write log.
      
      log_t::create(), log_t::attach(): Guarantee that buf_free < max_buf_free
      will always hold on PMEM, to satisfy an assumption of
      log_t::get_write_target().
      
      log_write_up_to(): Assert lsn!=0. Such calls are not incorrect, but it
      is cheaper to test that single unlikely condition in mtr_t::commit()
      rather than test several conditions in log_write_up_to().
      
      innodb_drop_database(), unlock_and_close_files(): Check the LSN before
      calling log_write_up_to().
      
      ha_innobase::commit_inplace_alter_table(): Remove redundant calls to
      log_write_up_to() after calling unlock_and_close_files().
      
      Reviewed by: Vladislav Vaintroub
      Stress tested by: Matthias Leich
      Performance tested by: Steve Shaw
      7443ad1c
  12. 17 Nov, 2023 1 commit
    • Marko Mäkelä's avatar
      MDEV-32027 Opening all .ibd files on InnoDB startup can be slow · eb1f8b29
      Marko Mäkelä authored
      dict_find_max_space_id(): Return SELECT MAX(SPACE) FROM SYS_TABLES.
      
      dict_check_tablespaces_and_store_max_id(): In the normal case
      (no encryption plugin has been loaded and the change buffer is empty),
      invoke dict_find_max_space_id() and do not open any .ibd files.
      If a std::set<uint32_t> has been specified, open the files whose
      tablespace ID is mentioned. Else, open all data files that are identified
      by SYS_TABLES records.
      
      fil_ibd_open(): Remove a call to os_file_get_last_error() that can
      report a misleading error, such as EINVAL inside my_realpath() that is
      not an actual error. This could be invoked when a data file is found
      but the FSP_SPACE_FLAGS are incorrect, such as is the case for
      table test.td in
      ./mtr --mysqld=--innodb-buffer-pool-dump-at-shutdown=0 innodb.table_flags
      
      buf_load(): If any tablespaces could not be found, invoke
      dict_check_tablespaces_and_store_max_id() on the missing tablespaces.
      
      dict_load_tablespace(): Try to load the tablespace unless it was found
      to be futile. This fixes failures related to FTS_*.ibd files for
      FULLTEXT INDEX.
      
      btr_cur_t::search_leaf(): Prevent a crash when the tablespace
      does not exist. This was caught by the test innodb_fts.fts_concurrent_insert
      when the change to dict_load_tablespaces() was not present.
      
      We modify a few tests to ensure that tables will not be loaded at startup.
      For some fault injection tests this means that the corrupted tables
      will not be loaded, because dict_load_tablespace() would perform stricter
      checks than dict_check_tablespaces_and_store_max_id().
      
      Tested by: Matthias Leich
      Reviewed by: Thirunarayanan Balathandayuthapani
      eb1f8b29
  13. 04 Nov, 2023 1 commit
  14. 31 Oct, 2023 1 commit
  15. 26 Oct, 2023 1 commit
    • Marko Mäkelä's avatar
      MDEV-31826 InnoDB may fail to recover after being killed in fil_delete_tablespace() · 39e3ca8b
      Marko Mäkelä authored
      InnoDB was violating the write-ahead-logging protocol when a file
      was being deleted, like this:
      
      1. fil_delete_tablespace() set the fil_space_t::STOPPING flag
      2. The buf_flush_page_cleaner() thread discards some changed pages for
      this tablespace advances the log checkpoint a little.
      3. The server process is killed before fil_delete_tablespace() wrote
      a FILE_DELETE record.
      4. Recovery will try to apply log to pages of the tablespace, because
      there was no FILE_DELETE record. This will fail, because some pages
      that had been modified since the latest checkpoint had not been written
      by the page cleaner.
      
      Page writes must not be stopped before a FILE_DELETE record has been
      durably written.
      
      fil_space_t::drop(): Replaces fil_space_t::check_pending_operations().
      Add the parameter detached_handle, and return a tablespace pointer
      if this thread was the first one to stop I/O on the tablespace.
      
      mtr_t::commit_file(): Remove the parameter detached_handle, and
      move some handling to fil_space_t::drop().
      
      fil_space_t: STOPPING_READS, STOPPING_WRITES: Separate flags for STOPPING.
      We want to stop reads (and encryption) before stopping page writes.
      
      fil_space_t::is_stopping_writes(), fil_space_t::get_for_write():
      Special accessors for the write path.
      
      fil_space_t::flush_low(): Ignore the STOPPING_READS flag and only
      stop if STOPPING_WRITES is set, to avoid an infinite loop in
      fil_flush_file_spaces(), which was occasionally repeated by
      running the test encryption.create_or_replace.
      
      Reviewed by: Vladislav Lesin
      Tested by: Matthias Leich
      39e3ca8b
  16. 18 Oct, 2023 2 commits
    • Marko Mäkelä's avatar
      MDEV-32511: Race condition between checkpoint and page write · cfd17881
      Marko Mäkelä authored
      fil_aio_callback(): Invoke fil_node_t::complete_write() before
      releasing any page latch, so that in case a log checkpoint is
      executed roughly concurrently with the first write into a file
      since the previous checkpoint, we will not miss a fdatasync()
      or fsync() call to make the write durable.
      cfd17881
    • Marko Mäkelä's avatar
      MDEV-32511 Assertion !os_aio_pending_writes() failed · bf7c6fc2
      Marko Mäkelä authored
      In MemorySanitizer builds of 10.10 and 10.11, we would rather often
      have the assertion fail in innodb_init() during mariadb-backup --prepare.
      The assertion could also fail during InnoDB startup, but less often.
      
      Before commit 685d958e in 10.8 the
      log file cleanup after a successfully applied backup is different,
      and the os_aio_pending_writes() assertion is in srv0start.cc.
      
      IORequest::write_complete(): Invoke node->complete_write() before
      releasing the page latch, so that a log checkpoint that is about to
      execute concurrently will not miss a fdatasync() or fsync() on the
      file, in case this was the first write since the last such call.
      
      create_log_file(), srv_start(): Replace the debug assertion with
      a debug check. For all intents and purposes, all writes could have
      been completed but some write_io_callback() may not have invoked
      io_slots::release() yet.
      bf7c6fc2
  17. 13 Oct, 2023 1 commit
  18. 12 Oct, 2023 2 commits
    • Daniel Black's avatar
      MDEV-18200 MariaBackup full backup failed with InnoDB: Failing assertion: success · 3b38c2f3
      Daniel Black authored
      There are many filesystem related errors that can occur with
      MariaBackup. These already outputed to stderr with a good description of
      the error. Many of these are permission or resource (file descriptor)
      limits where the assertion and resulting core crash doesn't offer
      developers anything more than the log message. To the user, assertions
      and core crashes come across as poor error handling.
      
      As such we return an error and handle this all the way up the stack.
      3b38c2f3
    • Daniel Black's avatar
      MDEV-18200 MariaBackup full backup failed with InnoDB: Failing assertion: success · c79ca7c7
      Daniel Black authored
      There are many filesystem related errors that can occur with
      MariaBackup. These already outputed to stderr with a good description of
      the error. Many of these are permission or resource (file descriptor)
      limits where the assertion and resulting core crash doesn't offer
      developers anything more than the log message. To the user, assertions
      and core crashes come across as poor error handling.
      
      As such we return an error and handle this all the way up the stack.
      c79ca7c7
  19. 25 Sep, 2023 1 commit
    • Daniel Black's avatar
      MDEV-18200 MariaBackup full backup failed with InnoDB: Failing assertion: success · ca66a2cb
      Daniel Black authored
      There are many filesystem related errors that can occur with
      MariaBackup. These already outputed to stderr with a good description of
      the error. Many of these are permission or resource (file descriptor)
      limits where the assertion and resulting core crash doesn't offer
      developers anything more than the log message. To the user, assertions
      and core crashes come across as poor error handling.
      
      As such we return an error and handle this all the way up the stack.
      ca66a2cb
  20. 01 Aug, 2023 1 commit
    • Marko Mäkelä's avatar
      MDEV-27593: Crashing on I/O error is unhelpful · 72928e64
      Marko Mäkelä authored
      buf_page_t::write_complete(), buf_page_write_complete(),
      IORequest::write_complete(): Add a parameter for passing
      an error code. If an error occurred, we will release the
      io-fix, buffer-fix and page latch but not reset the
      oldest_modification field. The block would remain in
      buf_pool.LRU and possibly buf_pool.flush_list, to be written
      again later, by buf_flush_page_cleaner(). If all page writes
      start consistently failing, all write threads should eventually
      hang in log_free_check() because the log checkpoint cannot
      be advanced to make room in the circular write-ahead-log ib_logfile0.
      
      IORequest::read_complete(): Add a parameter for passing
      an error code. If a read operation fails, we report the error
      and discard the page, just like we would do if the page checksum
      was not validated or the page could not be decrypted.
      This only affects asynchronous reads, due to linear or random read-ahead
      or crash recovery. When buf_page_get_low() invokes buf_read_page(),
      that will be a synchronous read, not involving this code.
      
      This was tested by randomly injecting errors in
      write_io_callback() and read_io_callback(), like this:
      
        if (!ut_rnd_interval(100))
          cb->m_err= 42;
      72928e64
  21. 05 Jul, 2023 1 commit
  22. 01 Jun, 2023 1 commit
  23. 31 May, 2023 1 commit
  24. 19 May, 2023 3 commits
    • Marko Mäkelä's avatar
      MDEV-29911 InnoDB recovery and mariadb-backup --prepare fail to report detailed progress · f2c17cc9
      Marko Mäkelä authored
      This is a 10.6 port of commit 2f9e2647
      from MariaDB Server 10.9 that is missing some optimization due to a
      more complex redo log format and recovery logic
      (which was simplified in commit 685d958e).
      
      The progress reporting of InnoDB crash recovery was rather intermittent.
      Nothing was reported during the single-threaded log record parsing, which
      could consume minutes when parsing a large log. During log application,
      there only was progress reporting in background threads that would be
      invoked on data page read completion.
      
      The progress reporting here will be detailed like this:
      
      InnoDB: Starting crash recovery from checkpoint LSN=628599973,5653727799
      InnoDB: Read redo log up to LSN=1963895808
      InnoDB: Multi-batch recovery needed at LSN 2534560930
      InnoDB: Read redo log up to LSN=3312233472
      InnoDB: Read redo log up to LSN=1599646720
      InnoDB: Read redo log up to LSN=2160831488
      InnoDB: To recover: LSN 2806789376/2806819840; 195082 pages
      InnoDB: To recover: LSN 2806789376/2806819840; 63507 pages
      InnoDB: Read redo log up to LSN=3195776000
      InnoDB: Read redo log up to LSN=3687099392
      InnoDB: Read redo log up to LSN=4165315584
      InnoDB: To recover: LSN 4374395699/4374440960; 241454 pages
      InnoDB: To recover: LSN 4374395699/4374440960; 123701 pages
      InnoDB: Read redo log up to LSN=4508724224
      InnoDB: Read redo log up to LSN=5094550528
      InnoDB: To recover: 205230 pages
      
      The previous messages "Starting a batch to recover" or
      "Starting a final batch to recover" will be replaced by
      "To recover: ... pages" messages.
      
      If a batch lasts longer than 15 seconds, then there will be
      progress reports every 15 seconds, showing the number of remaining pages.
      For the non-final batch, the "To recover:" message includes two end LSN:
      that of the batch, and of the recovered log. This is the primary measure
      of progress. The batch will end once the number of pages to recover
      reaches 0.
      
      If recovery is possible in a single batch, the output will look like this,
      with a shorter "To recover:" message that counts only the remaining pages:
      
      InnoDB: Starting crash recovery from checkpoint LSN=628599973,5653727799
      InnoDB: Read redo log up to LSN=1984539648
      InnoDB: Read redo log up to LSN=2710875136
      InnoDB: Read redo log up to LSN=3358895104
      InnoDB: Read redo log up to LSN=3965299712
      InnoDB: Read redo log up to LSN=4557417472
      InnoDB: Read redo log up to LSN=5219527680
      InnoDB: To recover: 450915 pages
      
      We will also speed up recovery by improving the memory management and
      implementing multi-threaded recovery of data pages that will not need
      to be read into the buffer pool ("fake read"). Log application in the
      "fake read" threads will be protected by an atomic being_recovered field
      and exclusive buf_page_t::lock.
      
      Recovery will reserve for data pages two thirds of the buffer pool,
      or 256 pages, whichever is smaller. Previously, we could only use at most
      one third of the buffer pool for buffered log records. This would typically
      mean that with large buffer pools, recovery unnecessary consisted of
      multiple batches.
      
      If recovery runs out of memory, it will "roll back" or "rewind" the current
      mini-transaction. The recv_sys.recovered_lsn and recv_sys.pages
      will correspond to the "out of memory LSN", at the end of the previous
      complete mini-transaction.
      
      If recovery runs out of memory while executing the final recovery batch,
      we can simply invoke recv_sys.apply(false) to make room, and resume
      parsing.
      
      If recovery runs out of memory before the final batch, we will
      scan the redo log to the end and check for any missing or inconsistent
      files. In this version of the patch, we will throw away any previously
      buffered recv_sys.pages and rescan the log from the checkpoint onwards.
      
      recv_sys_t::pages_it: A cached iterator to recv_sys.pages.
      
      recv_sys_t::is_memory_exhausted(): Remove. We will have out-of-memory
      handling deep inside recv_sys_t::parse().
      
      recv_sys_t::rewind(), page_recv_t::recs_t::rewind():
      Remove all log starting with a specific LSN.
      
      IORequest::write_complete(), IORequest::read_complete():
      Replaces fil_aio_callback().
      
      read_io_callback(), write_io_callback(): Replaces io_callback().
      
      IORequest::fake_read_complete(), fake_io_callback(), os_fake_read():
      Process a "fake read" request for concurrent recovery.
      
      recv_sys_t::apply_batch(): Choose a number of successive pages
      for a recovery batch.
      
      recv_sys_t::erase(recv_sys_t::map::iterator): Remove log records for a
      page whose recovery is not in progress. Log application threads
      will not invoke this; they will only set being_recovered=-1 to indicate
      that the entry is no longer needed.
      
      recv_sys_t::garbage_collect(): Remove all being_recovered=-1 entries.
      
      recv_sys_t::wait_for_pool(): Wait for some space to become available
      in the buffer pool.
      
      mlog_init_t::mark_ibuf_exist(): Avoid calls to
      recv_sys::recover_low() via ibuf_page_exists() and buf_page_get_low().
      Such calls would lead to double locking of recv_sys.mutex, which
      depending on implementation could cause a deadlock. We will use
      lower-level calls to look up index pages.
      
      buf_LRU_block_remove_hashed(): Disable consistency checks for freed
      ROW_FORMAT=COMPRESSED pages. Their contents could be uninitialized garbage.
      This fixes an occasional failure of the test
      innodb.innodb_bulk_create_index_debug.
      
      Tested by: Matthias Leich
      f2c17cc9
    • Marko Mäkelä's avatar
      MDEV-29911 InnoDB recovery and mariadb-backup --prepare fail to report detailed progress · 2f9e2647
      Marko Mäkelä authored
      The progress reporting of InnoDB crash recovery was rather intermittent.
      Nothing was reported during the single-threaded log record parsing, which
      could consume minutes when parsing a large log. During log application,
      there only was progress reporting in background threads that would be
      invoked on data page read completion.
      
      The progress reporting here will be detailed like this:
      
      InnoDB: Starting crash recovery from checkpoint LSN=503549688
      InnoDB: Parsed redo log up to LSN=1990840177; to recover: 124806 pages
      InnoDB: Parsed redo log up to LSN=2729777071; to recover: 186123 pages
      InnoDB: Parsed redo log up to LSN=3488599173; to recover: 248397 pages
      InnoDB: Parsed redo log up to LSN=4177856618; to recover: 306469 pages
      InnoDB: Multi-batch recovery needed at LSN 4189599815
      InnoDB: End of log at LSN=4483551634
      InnoDB: To recover: LSN 4189599815/4483551634; 307490 pages
      InnoDB: To recover: LSN 4189599815/4483551634; 197159 pages
      InnoDB: To recover: LSN 4189599815/4483551634; 67623 pages
      InnoDB: Parsed redo log up to LSN=4353924218; to recover: 102083 pages
      ...
      InnoDB: log sequence number 4483551634 ...
      
      The previous messages "Starting a batch to recover" or
      "Starting a final batch to recover" will be replaced by
      "To recover: ... pages" messages.
      
      If a batch lasts longer than 15 seconds, then there will be
      progress reports every 15 seconds, showing the number of remaining pages.
      For the non-final batch, the "To recover:" message includes two end LSN:
      that of the batch, and of the recovered log. This is the primary measure
      of progress. The batch will end once the number of pages to recover
      reaches 0.
      
      If recovery is possible in a single batch, the output will look like this,
      with a shorter "To recover:" message that counts only the remaining pages:
      
      InnoDB: Starting crash recovery from checkpoint LSN=503549688
      InnoDB: Parsed redo log up to LSN=1998701027; to recover: 125560 pages
      InnoDB: Parsed redo log up to LSN=2734136874; to recover: 186446 pages
      InnoDB: Parsed redo log up to LSN=3499505504; to recover: 249378 pages
      InnoDB: Parsed redo log up to LSN=4183247844; to recover: 306964 pages
      InnoDB: End of log at LSN=4483551634
      ...
      InnoDB: To recover: 331797 pages
      ...
      InnoDB: log sequence number 4483551634 ...
      
      We will also speed up recovery by improving the memory management and
      implementing multi-threaded recovery of data pages that will not need
      to be read into the buffer pool ("fake read"). Log application in the
      "fake read" threads will be protected by an atomic being_recovered field
      and exclusive buf_page_t::latch.
      
      Recovery will reserve for data pages two thirds of the buffer pool,
      or 256 pages, whichever is smaller. Previously, we could only use at most
      one third of the buffer pool for buffered log records. This would typically
      mean that with large buffer pools, recovery unnecessary consisted of
      multiple batches.
      
      If recovery runs out of memory, it will "roll back" or "rewind" the current
      mini-transaction. The recv_sys.lsn and recv_sys.pages will correspond
      to the "out of memory LSN", at the end of the previous complete
      mini-transaction.
      
      If recovery runs out of memory while executing the final recovery batch,
      we can simply invoke recv_sys.apply(false) to make room, and resume
      parsing.
      
      If recovery runs out of memory before the final batch, we will scan
      the redo log to the end (recv_sys.scanned_lsn) and check for any missing
      or inconsistent files. If recv_init_crash_recovery_spaces() does not
      report any potentially missing tablespaces, we can make use of the
      already stored recv_sys.pages and only rewind to the "out of memory LSN".
      Else, we must keep parsing and invoking recv_validate_tablespace()
      until an error has been found or everything has been resolved, and
      ultimatily rewind to to the checkpoint LSN.
      
      recv_sys_t::pages_it: A cached iterator to recv_sys.pages
      
      recv_sys_t::parse_mtr(): Remove an ATTRIBUTE_NOINLINE that would
      prevent tail call optimization in recv_sys_t::parse_pmem().
      
      recv_sys_t::parse(), recv_sys_t::parse_mtr(), recv_sys_t::parse_pmem():
      Add template<bool store> parameter. Redo log record parsing
      (store=false) is better specialized from store=true
      (with bool if_exists) so that we can avoid some conditional branches
      in frequently invoked low-level code.
      
      recv_sys_t::is_memory_exhausted(): Remove. The special parse() status
      GOT_OOM will report out-of-memory situation at the low level.
      
      recv_sys_t::rewind(), page_recv_t::recs_t::rewind():
      Remove all log starting with a specific LSN.
      
      recv_scan_log(): Separate some code for only parsing, not storing log.
      In rewound_lsn, remember the LSN at which last_phase=false recovery
      ran out of memory. This is where the next call to recv_scan_log()
      will resume storing the log. This replaces recv_sys.last_stored_lsn.
      
      recv_sys_t::parse(): Evaluate the template parameter store in a few more
      cases, to allow dead code to be eliminated at compile time.
      
      recv_sys_t::scanned_lsn: The end of the log found by recv_scan_log().
      The special value 1 means that recv_sys has been initialized but
      no log has been parsed.
      
      IORequest::write_complete(), IORequest::read_complete():
      Replaces fil_aio_callback().
      
      read_io_callback(), write_io_callback(): Replaces io_callback().
      
      IORequest::fake_read_complete(), fake_io_callback(), os_fake_read():
      Process a "fake read" request for concurrent recovery.
      
      recv_sys_t::apply_batch(): Choose a number of successive pages
      for a recovery batch.
      
      recv_sys_t::erase(recv_sys_t::map::iterator): Remove log records for a
      page whose recovery is not in progress. Log application threads
      will not invoke this; they will only set being_recovered=-1 to indicate
      that the entry is no longer needed.
      
      recv_sys_t::garbage_collect(): Remove all being_recovered=-1 entries.
      
      recv_sys_t::wait_for_pool(): Wait for some space to become available
      in the buffer pool.
      
      mlog_init_t::mark_ibuf_exist(): Avoid calls to
      recv_sys::recover_low() via ibuf_page_exists() and buf_page_get_low().
      Such calls would lead to double locking of recv_sys.mutex, which
      depending on implementation could cause a deadlock. We will use
      lower-level calls to look up index pages.
      
      buf_LRU_block_remove_hashed(): Disable consistency checks for freed
      ROW_FORMAT=COMPRESSED pages. Their contents could be uninitialized garbage.
      This fixes an occasional failure of the test
      innodb.innodb_bulk_create_index_debug.
      
      Tested by: Matthias Leich
      2f9e2647
    • Vlad Lesin's avatar
      MDEV-31256 fil_node_open_file() releases fil_system.mutex allowing other... · 54227847
      Vlad Lesin authored
      MDEV-31256 fil_node_open_file() releases fil_system.mutex allowing other thread to open its file node
      
      There is room between mutex_exit(&fil_system.mutex) and
      mutex_enter(&fil_system.mutex) calls in fil_node_open_file(). During this
      room another thread can open the node, and ut_ad(!node->is_open())
      assertion in fil_node_open_file_low() can fail.
      
      The fix is not to open node if it was already opened by another thread.
      54227847
  25. 26 Apr, 2023 1 commit
    • Marko Mäkelä's avatar
      MDEV-31132 Deadlock between DDL and purge of InnoDB history · 5740638c
      Marko Mäkelä authored
      log_free_check(): Assert that the caller must not hold
      exclusive lock_sys.latch. This was the case for calls from
      ibuf_delete_for_discarded_space(). This caused a deadlock with
      another thread that would be holding a latch on a dirty page
      that would need to be written so that the checkpoint would advance
      and log_free_check() could return. That other thread was waiting
      for a shared lock_sys.latch.
      
      fil_delete_tablespace(): Do not invoke ibuf_delete_for_discarded_space()
      because in DDL operations, we will be holding exclusive lock_sys.latch.
      
      trx_t::commit(std::vector<pfs_os_file_t>&), innodb_drop_database(),
      row_purge_remove_clust_if_poss_low(), row_undo_ins_remove_clust_rec(),
      row_discard_tablespace_for_mysql():
      Invoke ibuf_delete_for_discarded_space() on the deleted tablespaces after
      releasing all latches.
      5740638c
  26. 19 Apr, 2023 1 commit
    • Marko Mäkelä's avatar
      MDEV-31080 fil_validate() failures during deferred tablespace recovery · 0cda0e4e
      Marko Mäkelä authored
      fil_space_t::create(), fil_space_t::add(): Expect the caller to
      acquire and release fil_system.mutex. In this way, creating a tablespace
      and adding the first (usually only) data file will be atomic.
      
      recv_sys_t::recover_deferred(): Correctly protect some changes by
      holding fil_system.mutex.
      
      Tested by: Matthias Leich
      0cda0e4e
  27. 14 Apr, 2023 2 commits
    • Vlad Lesin's avatar
      MDEV-31049 fil_delete_tablespace() returns wrong file handle if tablespace was... · 71f16c83
      Vlad Lesin authored
      MDEV-31049 fil_delete_tablespace() returns wrong file handle if tablespace was closed by parallel thread
      
      fil_delete_tablespace() stores file handle in local variable and calls
      mtr_t::commit_file()=>fil_system_t::detach(..., detach_handle=true), which
      sets space->chain.start->handle = OS_FILE_CLOSED. fil_system_t::detach()
      is invoked under fil_system.mutex.
      
      But before the mutex is acquired some parallel thread can change
      space->chain.start->handle. fil_delete_tablespace() returns value, stored
      in local variable, i.e. wrong value.
      
      File handle can be closed, for example, from buf_flush_space() when the
      limit of innodb_open_files exceded and fil_space_t::get() causes
      fil_space_t::try_to_close() call.
      
      fil_space_t::try_to_close() is executed under fil_system.mutex. And
      mtr_t::commit_file() locks it for fil_system_t::detach() call.
      fil_system_t::detach() returns detached file handle if its argument
      detach_handle is true. The fix is to let mtr_t::commit_file() to pass
      that detached file handle to fil_delete_tablespace().
      71f16c83
    • Vlad Lesin's avatar
      MDEV-30775 Performance regression in fil_space_t::try_to_close() introduced in MDEV-23855 · 0cca8166
      Vlad Lesin authored
      Post-push fix.
      
      10.5 MDEV-30775 fix inserts just opened tablespace just after the element
      which fil_system.space_list_last_opened points to.
      
      In MDEV-25223 fil_system_t::space_list was changed from UT_LIST to
      ilist. ilist<...>::insert(iterator pos, reference value) inserts element
      to list before pos.
      
      But it was not taken into account during 10.5->10.6 merge in
      85cbfaef, and the fix
      does not work properly, i.e. it inserted just opened tablespace to the
      position preceding fil_system.space_list_last_opened.
      0cca8166
  28. 12 Apr, 2023 1 commit
    • Thirunarayanan Balathandayuthapani's avatar
      MDEV-29273 Race condition between drop table and closing of table · 2ddfb838
      - This issue caused by race condition between drop thread
      and fil_encrypt_thread. fil_encrypt_thread closes
      the tablespace if the number of opened files
      exceeds innodb_open_files. fil_node_open_file()
      closes the tablespace which are open and it doesn't
      have pending operations. At that time, InnoDB drop tries
      to write the redo log for the file delete operation.
      It throws the bad file descriptor error.
      
      - When trying to close the file, InnoDB should check
      whether the table is going to be dropped.
      2ddfb838
  29. 28 Mar, 2023 1 commit
    • Marko Mäkelä's avatar
      MDEV-30936 fixup · 402f36dd
      Marko Mäkelä authored
      fil_space_t::~fil_space_t(): Invoke ut_free(name) because
      doing so in the callers would trip MSAN_OPTIONS=poison_in_dtor=1
      402f36dd
  30. 27 Mar, 2023 1 commit
    • Vlad Lesin's avatar
      MDEV-29050 mariabackup issues error messages during InnoDB tablespaces export... · 4c226c18
      Vlad Lesin authored
      MDEV-29050 mariabackup issues error messages during InnoDB tablespaces export on partial backup preparing
      
      The solution is to suppress error messages for missing tablespaces if
      mariabackup is launched with "--prepare --export" options.
      
      "mariabackup --prepare --export" invokes itself with --mysqld parameter.
      If the parameter is set, then it starts server to feed "FLUSH TABLES ...
      FOR EXPORT;" queries for exported tablespaces. This is "normal" server
      start, that's why new srv_operation value is introduced.
      
      Reviewed by Marko Makela.
      4c226c18
  31. 10 Mar, 2023 1 commit
    • Vlad Lesin's avatar
      MDEV-30775 Performance regression in fil_space_t::try_to_close() introduced in MDEV-23855 · 7d6b3d40
      Vlad Lesin authored
      fil_node_open_file_low() tries to close files from the top of
      fil_system.space_list if the number of opened files is exceeded.
      
      It invokes fil_space_t::try_to_close(), which iterates the list searching
      for the first opened space. Then it just closes the space, leaving it in
      the same position in fil_system.space_list.
      
      On heavy files opening, like during 'SHOW TABLE STATUS ...' execution,
      if the number of opened files limit is reached,
      fil_space_t::try_to_close() iterates more and more closed spaces before
      reaching any opened space for each fil_node_open_file_low() call. What
      causes performance regression if the number of spaces is big enough.
      
      The fix is to keep opened spaces at the top of fil_system.space_list,
      and move closed files at the end of the list.
      
      For this purpose fil_space_t::space_list_last_opened pointer is
      introduced. It points to the last inserted opened space in
      fil_space_t::space_list. When space is opened, it's inserted to the
      position just after the pointer points to in fil_space_t::space_list to
      preserve the logic, inroduced in MDEV-23855. Any closed space is added
      to the end of fil_space_t::space_list.
      
      As opened spaces are located at the top of fil_space_t::space_list,
      fil_space_t::try_to_close() finds opened space faster.
      
      There can be the case when opened and closed spaces are mixed in
      fil_space_t::space_list if fil_system.freeze_space_list was set during
      fil_node_open_file_low() execution. But this should not cause any error,
      as fil_space_t::try_to_close() still iterates spaces in the list.
      
      There is no need in any test case for the fix, as it does not change any
      functionality, but just fixes performance regression.
      7d6b3d40
  32. 24 Jan, 2023 1 commit
    • Marko Mäkelä's avatar
      MDEV-30400 Assertion height == btr_page_get_level(...) on INSERT · de4030e4
      Marko Mäkelä authored
      This also fixes part of MDEV-29835 Partial server freeze
      which is caused by violations of the latching order that was
      defined in https://dev.mysql.com/worklog/task/?id=6326
      (WL#6326: InnoDB: fix index->lock contention). Unless the
      current thread is holding an exclusive dict_index_t::lock,
      it must acquire page latches in a strict parent-to-child,
      left-to-right order. Not all cases of MDEV-29835 are fixed yet.
      Failure to follow the correct latching order will cause deadlocks
      of threads due to lock order inversion.
      
      As part of these changes, the BTR_MODIFY_TREE mode is modified
      so that an Update latch (U a.k.a. SX) will be acquired on the
      root page, and eXclusive latches (X) will be acquired on all pages
      leading to the leaf page, as well as any left and right siblings
      of the pages along the path. The DEBUG_SYNC test innodb.innodb_wl6326
      will be removed, because at the time the DEBUG_SYNC point is hit,
      the thread is actually holding several page latches that will be
      blocking a concurrent SELECT statement.
      
      We also remove double bookkeeping that was caused due to excessive
      information hiding in mtr_t::m_memo. We simply let mtr_t::m_memo
      store information of latched pages, and ensure that
      mtr_memo_slot_t::object is never a null pointer.
      The tree_blocks[] and tree_savepoints[] were redundant.
      
      buf_page_get_low(): If innodb_change_buffering_debug=1, to avoid
      a hang, do not try to evict blocks if we are holding a latch on
      a modified page. The test innodb.innodb-change-buffer-recovery
      will be removed, because change buffering may no longer be forced
      by debug injection when the change buffer comprises multiple pages.
      Remove a debug assertion that could fail when
      innodb_change_buffering_debug=1 fails to evict a page.
      For other cases, the assertion is redundant, because we already
      checked that right after the got_block: label. The test
      innodb.innodb-change-buffering-recovery will be removed, because
      due to this change, we will be unable to evict the desired page.
      
      mtr_t::lock_register(): Register a change of a page latch
      on an unmodified buffer-fixed block.
      
      mtr_t::x_latch_at_savepoint(), mtr_t::sx_latch_at_savepoint():
      Replaced by the use of mtr_t::upgrade_buffer_fix(), which now
      also handles RW_S_LATCH.
      
      mtr_t::set_modified(): For temporary tables, invoke
      buf_page_t::set_modified() here and not in mtr_t::commit().
      We will never set the MTR_MEMO_MODIFY flag on other than
      persistent data pages, nor set mtr_t::m_modifications when
      temporary data pages are modified.
      
      mtr_t::commit(): Only invoke the buf_flush_note_modification() loop
      if persistent data pages were modified.
      
      mtr_t::get_already_latched(): Look up a latched page in mtr_t::m_memo.
      This avoids many redundant entries in mtr_t::m_memo, as well as
      redundant calls to buf_page_get_gen() for blocks that had already
      been looked up in a mini-transaction.
      
      btr_get_latched_root(): Return a pointer to an already latched root page.
      This replaces btr_root_block_get() in cases where the mini-transaction
      has already latched the root page.
      
      btr_page_get_parent(): Fetch a parent page that was already latched
      in BTR_MODIFY_TREE, by invoking mtr_t::get_already_latched().
      If needed, upgrade the root page U latch to X.
      This avoids bloating mtr_t::m_memo as well as performing redundant
      buf_pool.page_hash lookups. For non-QUICK CHECK TABLE as well as for
      B-tree defragmentation, we will invoke btr_cur_search_to_nth_level().
      
      btr_cur_search_to_nth_level(): This will only be used for non-leaf
      (level>0) B-tree searches that were formerly named BTR_CONT_SEARCH_TREE
      or BTR_CONT_MODIFY_TREE. In MDEV-29835, this function could be
      removed altogether, or retained for the case of
      CHECK TABLE without QUICK.
      
      btr_cur_t::left_block: Remove. btr_pcur_move_backward_from_page()
      can retrieve the left sibling from the end of mtr_t::m_memo.
      
      btr_cur_t::open_leaf(): Some clean-up.
      
      btr_cur_t::search_leaf(): Replaces btr_cur_search_to_nth_level()
      for searches to level=0 (the leaf level). We will never release
      parent page latches before acquiring leaf page latches. If we need to
      temporarily release the level=1 page latch in the BTR_SEARCH_PREV or
      BTR_MODIFY_PREV latch_mode, we will reposition the cursor on the
      child node pointer so that we will land on the correct leaf page.
      
      btr_cur_t::pessimistic_search_leaf(): Implement new BTR_MODIFY_TREE
      latching logic in the case that page splits or merges will be needed.
      The parent pages (and their siblings) should already be latched on
      the first dive to the leaf and be present in mtr_t::m_memo; there
      should be no need for BTR_CONT_MODIFY_TREE. This pre-latching almost
      suffices; it must be revised in MDEV-29835 and work-arounds removed
      for cases where mtr_t::get_already_latched() fails to find a block.
      
      rtr_search_to_nth_level(): A SPATIAL INDEX version of
      btr_search_to_nth_level() that can search to any level
      (including the leaf level).
      
      rtr_search_leaf(), rtr_insert_leaf(): Wrappers for
      rtr_search_to_nth_level().
      
      rtr_search(): Replaces rtr_pcur_open().
      
      rtr_latch_leaves(): Replaces btr_cur_latch_leaves(). Note that unlike
      in the B-tree code, there is no error handling in case the sibling
      pages are corrupted.
      
      rtr_cur_restore_position(): Remove an unused constant parameter.
      
      btr_pcur_open_on_user_rec(): Remove the constant parameter
      mode=PAGE_CUR_GE.
      
      row_ins_clust_index_entry_low(): Use a new
      mode=BTR_MODIFY_ROOT_AND_LEAF to gain access to the root page
      when mode!=BTR_MODIFY_TREE, to write the PAGE_ROOT_AUTO_INC.
      
      BTR_SEARCH_TREE, BTR_CONT_SEARCH_TREE: Remove.
      
      BTR_CONT_MODIFY_TREE: Note that this is only used by
      rtr_search_to_nth_level().
      
      btr_pcur_optimistic_latch_leaves(): Replaces
      btr_cur_optimistic_latch_leaves().
      
      ibuf_delete_rec(): Acquire exclusive ibuf.index->lock in order
      to avoid a deadlock with ibuf_insert_low(BTR_MODIFY_PREV).
      
      btr_blob_log_check_t(): Acquire a U latch on the root page,
      so that btr_page_alloc() in btr_store_big_rec_extern_fields()
      will avoid a deadlock.
      
      btr_store_big_rec_extern_fields(): Assert that the root page latch
      is being held.
      
      Tested by: Matthias Leich
      Reviewed by: Vladislav Lesin
      de4030e4
  33. 30 Nov, 2022 2 commits