An error occurred fetching the project authors.
- 27 Sep, 2024 1 commit
-
-
Marko Mäkelä authored
In some places, there were redundant comparisons against TRX_SYS_SPACE or SRV_TMP_SPACE_ID. The temporary tablespace is never the subject of log-based recovery.
-
- 14 Sep, 2024 1 commit
-
-
Marko Mäkelä authored
GCC 12.2.0 could issue -Wnonnull for an unreachable call to strlen(new_path). Let us prevent that by replacing the condition (type == FILE_RENAME) with the equivalent (new_path). This should also optimize the generated code, because the life time of the parameter "type" will be reduced.
-
- 07 Jun, 2024 1 commit
-
-
Thirunarayanan Balathandayuthapani authored
number of non-user tablespace. fil_space_t::try_to_close(): Don't try to close the tablespace which is acquired by the caller of the function Added the suppression message in open_files_limit test case
-
- 26 Mar, 2024 1 commit
-
-
Thirunarayanan Balathandayuthapani authored
- Background encryption threads wait for stop flag to exit early from the tablespace. Alter operation fails to set the stop flag before waiting for the encryption thread to stop using the tablespace.
-
- 22 Mar, 2024 1 commit
-
-
Marko Mäkelä authored
The log_sys.lsn_lock is a very contended resource with a small critical section in log_sys.append_prepare(). On many processor microarchitectures, replacing the system call based log_sys.lsn_lock with a pure spin lock would fare worse during high concurrency workloads, wasting a significant amount of CPU cycles in the spin loop. On other microarchitectures, we would see a significant amount of time being spent in native_queued_spin_lock_slowpath() in the Linux kernel, plus context switching between user and kernel address space. This was pointed out by Steve Shaw from Intel Corporation. Depending on the workload and the hardware implementation, it may be useful to use a pure spin lock in log_sys.append_prepare(). We will introduce a parameter. The statement SET GLOBAL INNODB_LOG_SPIN_WAIT_DELAY=50; would enable a spin lock that will execute that many MY_RELAX_CPU() operations (such as the x86 PAUSE instruction) between successive attempts of acquiring the spin lock. The use of a system call based log_sys.lsn_lock (which is the default setting) can be enabled by SET GLOBAL INNODB_LOG_SPIN_WAIT_DELAY=0; This patch will also introduce #ifdef LOG_LATCH_DEBUG (part of cmake -DWITH_INNODB_EXTRA_DEBUG=ON) for more accurate tracking of log_sys.latch ownership and reorganize the fields of log_sys to improve the locality of reference and to reduce the chances of false sharing. When a spin lock is being used, it will be maintained in the most significant bit of log_sys.buf_free. This is useful, because that is one of the fields that is covered by the lock. For IA-32 or AMD64, we implement the spin lock specially via log_t::lsn_lock_bts(), employing the i386 LOCK BTS instruction. A straightforward std::atomic::fetch_or() would translate into an inefficient loop around LOCK CMPXCHG. mtr_t::spin_wait_delay: The value of innodb_log_spin_wait_delay. mtr_t::finisher: Pointer to the currently used mtr_t::finish_write() implementation. This allows to avoid introducing conditional branches. We no longer invoke log_sys.is_pmem() at the mini-transaction level, but we would do that in log_write_up_to(). mtr_t::finisher_update(): Update finisher when spin_wait_delay is changed from or to 0 (the spin lock is changed to log_sys.lsn_lock or vice versa).
-
- 27 Feb, 2024 1 commit
-
-
mariadb-DebarunBanerjee authored
The root cause is the WAL logging of file operation when the actual operation fails afterwards. It creates a situation with a log entry for a operation that would always fail. I could simulate both the backup scenario error and Innodb recovery failure exploiting the weakness. We are following WAL for file rename operation and once logged the operation must eventually complete successfully, or it is a major catastrophe. Right now, we fail for rename and handle it as normal error and it is the problem. I created a patch to address RENAME operation to a non existing schema where the destination schema directory is missing. The patch checks for the missing schema before logging in an attempt to avoid the failure after WAL log is written/flushed. I also checked that the schema cannot be dropped or there cannot be any race with other rename to the same file. This is protected by the MDL lock in SQL today. The patch should this be a good improvement over the current situation and solves the issue at hand.
-
- 20 Feb, 2024 2 commits
-
-
Marko Mäkelä authored
Ever since commit 412ee033 or commit a440d6ed InnoDB should generally not abort when failing to open or create files. In Datafile::open_or_create() we had failed to set the flag to avoid abort() on failure, but everywhere else we were setting it. We may still call abort() via os_file_handle_error(). Reviewed by: Vladislav Vaintroub
-
Marko Mäkelä authored
Apparently, invoking fcntl(fd, F_SETFL, O_DIRECT) will lead to unexpected behaviour on Linux bcachefs and possibly other file systems, depending on the operating system version. So, let us avoid doing that, and instead just attempt to pass the O_DIRECT flag to open(). This should make us compatible with NetBSD, IBM AIX, as well as Solaris and its derivatives. This fix does not change the fact that we had only implemented innodb_log_file_buffering=OFF on systems where we can determine the physical block size (typically 512 or 4096 bytes). Currently, those operating systems are Linux and Microsoft Windows. HAVE_FCNTL_DIRECT, os_file_set_nocache(): Remove. OS_FILE_OVERWRITE, OS_FILE_CREATE_PATH: Remove (never used parameters). os_file_log_buffered(), os_file_log_maybe_unbuffered(): Helper functions. os_file_create_simple_func(): When applicable, initially attempt to open files in O_DIRECT mode. os_file_create_func(): When applicable, initially attempt to open files in O_DIRECT mode. For type==OS_LOG_FILE && create_mode != OS_FILE_CREATE we will first invoke stat(2) on the file name to find out if the size is compatible with O_DIRECT. If create_mode == OS_FILE_CREATE, we will invoke fstat(2) on the created log file afterwards, and may close and reopen the file in O_DIRECT mode if applicable. create_temp_file(): Support O_DIRECT. This is only used if O_TMPFILE is available and innodb_disable_sort_file_cache=ON (non-default value). Notably, that setting never worked on Microsoft Windows. row_merge_file_create_mode(): Split from row_merge_file_create_low(). Create a temporary file in the specified mode. Reviewed by: Vladislav Vaintroub
-
- 19 Jan, 2024 1 commit
-
-
Marko Mäkelä authored
The directio(3C) function on Solaris is supported on NFS and UFS while the majority of users should be on ZFS, which is a copy-on-write file system that implements transparent compression and therefore cannot support unbuffered I/O. Let us remove the call to directio() and simply treat innodb_flush_method=O_DIRECT in the same way as the previous default value innodb_flush_method=fsync on Solaris. Also, let us remove some dead code around calls to os_file_set_nocache() on platforms where fcntl(2) is not usable with O_DIRECT. On IBM AIX, O_DIRECT is not documented for fcntl(2), only for open(2).
-
- 10 Jan, 2024 1 commit
-
-
Marko Mäkelä authored
When innodb_undo_log_truncate=ON causes an InnoDB undo tablespace to be truncated, we must guarantee that the undo tablespace will be rebuilt atomically: After mtr_t::commit_shrink() has durably written the mini-transaction that rebuilds the undo tablespace, we must not write any old pages to the tablespace. To guarantee this, in trx_purge_truncate_history() we used to traverse the entire buf_pool.flush_list in order to acquire exclusive latches on all pages for the undo tablespace that reside in the buffer pool, so that those pages cannot be written and will be evicted during mtr_t::commit_shrink(). But, this traversal may interfere with the page writing activity of buf_flush_page_cleaner(). It would be better to lazily discard the old pages of the truncated undo tablespace. fil_space_t::is_being_truncated, fil_space_t::clear_stopping(): Remove. fil_space_t::create_lsn: A new field, identifying the LSN of the latest rebuild of a tablespace. buf_page_t::flush(), buf_flush_try_neighbors(): Evict pages whose FIL_PAGE_LSN is below fil_space_t::create_lsn. mtr_t::commit_shrink(): Update fil_space_t::create_lsn and fil_space_t::size right before the log is durably written and the tablespace file is being truncated. fsp_page_create(), trx_purge_truncate_history(): Simplify the logic. Reviewed by: Thirunarayanan Balathandayuthapani, Vladislav Lesin Performance tested by: Axel Schwenke Correctness tested by: Matthias Leich
-
- 14 Dec, 2023 1 commit
-
-
Marko Mäkelä authored
During mariadb-backup --backup, a table could be renamed, created and dropped. We could have both oldname.ibd and oldname.new, and one of the files would be deleted before the InnoDB recovery starts. The desired end result would be that we will recover both oldname.ibd and newname.ibd. During normal crash recovery, at most one file operation (create, rename, delete) may require to be replayed from the write-ahead log before the DDL recovery starts. deferred_spaces.create(): In mariadb-backup --prepare, try to create the file in case it does not exist. fil_name_process(): Display a message about not found files not only if innodb_force_recovery is set, but also in mariadb-backup --prepare. If we are processing a FILE_RENAME for a tablespace whose recovery is deferred, suppress the message and adjust the file name in case fil_ibd_load() returns FIL_LOAD_NOT_FOUND or FIL_LOAD_DEFER. fil_ibd_load(): Remove a redundant file name comparison. The caller already compared that the file names are different. We used to wrongly return FIL_LOAD_OK instead of FIL_LOAD_ID_CHANGED if only the schema name differed, such as a/t1.ibd and b/t1.ibd. Tested by: Matthias Leich Reviewed by: Thirunarayanan Balathandayuthapani
-
- 21 Nov, 2023 1 commit
-
-
Marko Mäkelä authored
The log_sys.lsn_lock that was introduced in commit a635c406 had better be located in the same cache line with log_sys.latch so that log_t::append_prepare() needs to modify only two first cache lines where log_sys is stored. log_t::lsn_lock: On Linux, change the type from pthread_mutex_t to something that may be as small as 32 bits, to pack more data members in the same cache line. On Microsoft Windows, CRITICAL_SECTION works better. log_t::check_flush_or_checkpoint_: Renamed to need_checkpoint. There is no need to pause all writer threads in log_free_check() when we only need to write log_sys.buf to ib_logfile0. That will be done in mtr_t::commit(). log_t::append_prepare_wait(): Make the member function non-static to simplify the call interface, and add a parameter for the LSN. log_t::append_prepare(): Invoke append_prepare_wait() at most once. Only set_check_for_checkpoint() if a log checkpoint needs to be written. If the log buffer needs to be written, we will take care of it ourselves later in our caller. This will reduce interference with log_free_check() in other threads. mtr_t::commit(): Call log_write_up_to() if needed. log_t::get_write_target(): Return a log_write_up_to() target to mtr_t::commit(). buf_flush_ahead(): If we are in furious flushing, call log_sys.set_check_for_checkpoint() so that all writers will wait in log_free_check() until the checkpoint is done. Otherwise, the test innodb.insert_into_empty could occasionally report an error "Crash recovery is broken". log_check_margins(): Replaced by log_free_check(). log_flush_margin(): Removed. This is part of mtr_t::commit() and other operations that write log. log_t::create(), log_t::attach(): Guarantee that buf_free < max_buf_free will always hold on PMEM, to satisfy an assumption of log_t::get_write_target(). log_write_up_to(): Assert lsn!=0. Such calls are not incorrect, but it is cheaper to test that single unlikely condition in mtr_t::commit() rather than test several conditions in log_write_up_to(). innodb_drop_database(), unlock_and_close_files(): Check the LSN before calling log_write_up_to(). ha_innobase::commit_inplace_alter_table(): Remove redundant calls to log_write_up_to() after calling unlock_and_close_files(). Reviewed by: Vladislav Vaintroub Stress tested by: Matthias Leich Performance tested by: Steve Shaw
-
- 17 Nov, 2023 1 commit
-
-
Marko Mäkelä authored
dict_find_max_space_id(): Return SELECT MAX(SPACE) FROM SYS_TABLES. dict_check_tablespaces_and_store_max_id(): In the normal case (no encryption plugin has been loaded and the change buffer is empty), invoke dict_find_max_space_id() and do not open any .ibd files. If a std::set<uint32_t> has been specified, open the files whose tablespace ID is mentioned. Else, open all data files that are identified by SYS_TABLES records. fil_ibd_open(): Remove a call to os_file_get_last_error() that can report a misleading error, such as EINVAL inside my_realpath() that is not an actual error. This could be invoked when a data file is found but the FSP_SPACE_FLAGS are incorrect, such as is the case for table test.td in ./mtr --mysqld=--innodb-buffer-pool-dump-at-shutdown=0 innodb.table_flags buf_load(): If any tablespaces could not be found, invoke dict_check_tablespaces_and_store_max_id() on the missing tablespaces. dict_load_tablespace(): Try to load the tablespace unless it was found to be futile. This fixes failures related to FTS_*.ibd files for FULLTEXT INDEX. btr_cur_t::search_leaf(): Prevent a crash when the tablespace does not exist. This was caught by the test innodb_fts.fts_concurrent_insert when the change to dict_load_tablespaces() was not present. We modify a few tests to ensure that tables will not be loaded at startup. For some fault injection tests this means that the corrupted tables will not be loaded, because dict_load_tablespace() would perform stricter checks than dict_check_tablespaces_and_store_max_id(). Tested by: Matthias Leich Reviewed by: Thirunarayanan Balathandayuthapani
-
- 04 Nov, 2023 1 commit
-
-
Marko Mäkelä authored
fil_space_t::drop(): If the caller is not interested in a detached handle, close it immediately.
-
- 31 Oct, 2023 1 commit
-
-
Marko Mäkelä authored
fil_delete_tablespace(): Invoke fil_space_free_low() directly. This fixes up commit 39e3ca8b
-
- 26 Oct, 2023 1 commit
-
-
Marko Mäkelä authored
InnoDB was violating the write-ahead-logging protocol when a file was being deleted, like this: 1. fil_delete_tablespace() set the fil_space_t::STOPPING flag 2. The buf_flush_page_cleaner() thread discards some changed pages for this tablespace advances the log checkpoint a little. 3. The server process is killed before fil_delete_tablespace() wrote a FILE_DELETE record. 4. Recovery will try to apply log to pages of the tablespace, because there was no FILE_DELETE record. This will fail, because some pages that had been modified since the latest checkpoint had not been written by the page cleaner. Page writes must not be stopped before a FILE_DELETE record has been durably written. fil_space_t::drop(): Replaces fil_space_t::check_pending_operations(). Add the parameter detached_handle, and return a tablespace pointer if this thread was the first one to stop I/O on the tablespace. mtr_t::commit_file(): Remove the parameter detached_handle, and move some handling to fil_space_t::drop(). fil_space_t: STOPPING_READS, STOPPING_WRITES: Separate flags for STOPPING. We want to stop reads (and encryption) before stopping page writes. fil_space_t::is_stopping_writes(), fil_space_t::get_for_write(): Special accessors for the write path. fil_space_t::flush_low(): Ignore the STOPPING_READS flag and only stop if STOPPING_WRITES is set, to avoid an infinite loop in fil_flush_file_spaces(), which was occasionally repeated by running the test encryption.create_or_replace. Reviewed by: Vladislav Lesin Tested by: Matthias Leich
-
- 18 Oct, 2023 2 commits
-
-
Marko Mäkelä authored
fil_aio_callback(): Invoke fil_node_t::complete_write() before releasing any page latch, so that in case a log checkpoint is executed roughly concurrently with the first write into a file since the previous checkpoint, we will not miss a fdatasync() or fsync() call to make the write durable.
-
Marko Mäkelä authored
In MemorySanitizer builds of 10.10 and 10.11, we would rather often have the assertion fail in innodb_init() during mariadb-backup --prepare. The assertion could also fail during InnoDB startup, but less often. Before commit 685d958e in 10.8 the log file cleanup after a successfully applied backup is different, and the os_aio_pending_writes() assertion is in srv0start.cc. IORequest::write_complete(): Invoke node->complete_write() before releasing the page latch, so that a log checkpoint that is about to execute concurrently will not miss a fdatasync() or fsync() on the file, in case this was the first write since the last such call. create_log_file(), srv_start(): Replace the debug assertion with a debug check. For all intents and purposes, all writes could have been completed but some write_io_callback() may not have invoked io_slots::release() yet.
-
- 13 Oct, 2023 1 commit
-
-
Thirunarayanan Balathandayuthapani authored
MDEV-31098 InnoDB Recovery doesn't display encryption message when no encryption configuration passed - InnoDB fails to report the error when encryption configuration wasn't passed. This patch addresses the issue by adding the error while loading the tablespace and deferring the tablespace creation.
-
- 12 Oct, 2023 2 commits
-
-
Daniel Black authored
There are many filesystem related errors that can occur with MariaBackup. These already outputed to stderr with a good description of the error. Many of these are permission or resource (file descriptor) limits where the assertion and resulting core crash doesn't offer developers anything more than the log message. To the user, assertions and core crashes come across as poor error handling. As such we return an error and handle this all the way up the stack.
-
Daniel Black authored
There are many filesystem related errors that can occur with MariaBackup. These already outputed to stderr with a good description of the error. Many of these are permission or resource (file descriptor) limits where the assertion and resulting core crash doesn't offer developers anything more than the log message. To the user, assertions and core crashes come across as poor error handling. As such we return an error and handle this all the way up the stack.
-
- 25 Sep, 2023 1 commit
-
-
Daniel Black authored
There are many filesystem related errors that can occur with MariaBackup. These already outputed to stderr with a good description of the error. Many of these are permission or resource (file descriptor) limits where the assertion and resulting core crash doesn't offer developers anything more than the log message. To the user, assertions and core crashes come across as poor error handling. As such we return an error and handle this all the way up the stack.
-
- 01 Aug, 2023 1 commit
-
-
Marko Mäkelä authored
buf_page_t::write_complete(), buf_page_write_complete(), IORequest::write_complete(): Add a parameter for passing an error code. If an error occurred, we will release the io-fix, buffer-fix and page latch but not reset the oldest_modification field. The block would remain in buf_pool.LRU and possibly buf_pool.flush_list, to be written again later, by buf_flush_page_cleaner(). If all page writes start consistently failing, all write threads should eventually hang in log_free_check() because the log checkpoint cannot be advanced to make room in the circular write-ahead-log ib_logfile0. IORequest::read_complete(): Add a parameter for passing an error code. If a read operation fails, we report the error and discard the page, just like we would do if the page checksum was not validated or the page could not be decrypted. This only affects asynchronous reads, due to linear or random read-ahead or crash recovery. When buf_page_get_low() invokes buf_read_page(), that will be a synchronous read, not involving this code. This was tested by randomly injecting errors in write_io_callback() and read_io_callback(), like this: if (!ut_rnd_interval(100)) cb->m_err= 42;
-
- 05 Jul, 2023 1 commit
-
-
Marko Mäkelä authored
fil_node_open_file_low(): Always acquire an advisory lock on the system tablespace. Originally, we already did this in SysTablespace::open_file(), but SysTablespace::open_or_create() would release those locks when it is closing the file handles. This is a 10.5+ specific follow up to commit 0ee1082b (MDEV-28495). Thanks to Daniel Black for verifying this bug.
-
- 01 Jun, 2023 1 commit
-
-
Marko Mäkelä authored
fil_space_t::add(): If a file handle was passed, invoke fil_node_t::find_metadata() before releasing fil_system.mutex. The call was moved from fil_ibd_create(). This is a 10.5 version of commit e3b06156 from 10.6.
-
- 31 May, 2023 1 commit
-
-
Marko Mäkelä authored
fil_ibd_create(): Hold fil_system.mutex until fil_node_t::find_metadata() has completed, so that node->handle cannot be closed by a concurrent thread. This race condition was introduced in commit 10dd290b (MDEV-17380). Tested by: Matthias Leich
-
- 19 May, 2023 3 commits
-
-
Marko Mäkelä authored
This is a 10.6 port of commit 2f9e2647 from MariaDB Server 10.9 that is missing some optimization due to a more complex redo log format and recovery logic (which was simplified in commit 685d958e). The progress reporting of InnoDB crash recovery was rather intermittent. Nothing was reported during the single-threaded log record parsing, which could consume minutes when parsing a large log. During log application, there only was progress reporting in background threads that would be invoked on data page read completion. The progress reporting here will be detailed like this: InnoDB: Starting crash recovery from checkpoint LSN=628599973,5653727799 InnoDB: Read redo log up to LSN=1963895808 InnoDB: Multi-batch recovery needed at LSN 2534560930 InnoDB: Read redo log up to LSN=3312233472 InnoDB: Read redo log up to LSN=1599646720 InnoDB: Read redo log up to LSN=2160831488 InnoDB: To recover: LSN 2806789376/2806819840; 195082 pages InnoDB: To recover: LSN 2806789376/2806819840; 63507 pages InnoDB: Read redo log up to LSN=3195776000 InnoDB: Read redo log up to LSN=3687099392 InnoDB: Read redo log up to LSN=4165315584 InnoDB: To recover: LSN 4374395699/4374440960; 241454 pages InnoDB: To recover: LSN 4374395699/4374440960; 123701 pages InnoDB: Read redo log up to LSN=4508724224 InnoDB: Read redo log up to LSN=5094550528 InnoDB: To recover: 205230 pages The previous messages "Starting a batch to recover" or "Starting a final batch to recover" will be replaced by "To recover: ... pages" messages. If a batch lasts longer than 15 seconds, then there will be progress reports every 15 seconds, showing the number of remaining pages. For the non-final batch, the "To recover:" message includes two end LSN: that of the batch, and of the recovered log. This is the primary measure of progress. The batch will end once the number of pages to recover reaches 0. If recovery is possible in a single batch, the output will look like this, with a shorter "To recover:" message that counts only the remaining pages: InnoDB: Starting crash recovery from checkpoint LSN=628599973,5653727799 InnoDB: Read redo log up to LSN=1984539648 InnoDB: Read redo log up to LSN=2710875136 InnoDB: Read redo log up to LSN=3358895104 InnoDB: Read redo log up to LSN=3965299712 InnoDB: Read redo log up to LSN=4557417472 InnoDB: Read redo log up to LSN=5219527680 InnoDB: To recover: 450915 pages We will also speed up recovery by improving the memory management and implementing multi-threaded recovery of data pages that will not need to be read into the buffer pool ("fake read"). Log application in the "fake read" threads will be protected by an atomic being_recovered field and exclusive buf_page_t::lock. Recovery will reserve for data pages two thirds of the buffer pool, or 256 pages, whichever is smaller. Previously, we could only use at most one third of the buffer pool for buffered log records. This would typically mean that with large buffer pools, recovery unnecessary consisted of multiple batches. If recovery runs out of memory, it will "roll back" or "rewind" the current mini-transaction. The recv_sys.recovered_lsn and recv_sys.pages will correspond to the "out of memory LSN", at the end of the previous complete mini-transaction. If recovery runs out of memory while executing the final recovery batch, we can simply invoke recv_sys.apply(false) to make room, and resume parsing. If recovery runs out of memory before the final batch, we will scan the redo log to the end and check for any missing or inconsistent files. In this version of the patch, we will throw away any previously buffered recv_sys.pages and rescan the log from the checkpoint onwards. recv_sys_t::pages_it: A cached iterator to recv_sys.pages. recv_sys_t::is_memory_exhausted(): Remove. We will have out-of-memory handling deep inside recv_sys_t::parse(). recv_sys_t::rewind(), page_recv_t::recs_t::rewind(): Remove all log starting with a specific LSN. IORequest::write_complete(), IORequest::read_complete(): Replaces fil_aio_callback(). read_io_callback(), write_io_callback(): Replaces io_callback(). IORequest::fake_read_complete(), fake_io_callback(), os_fake_read(): Process a "fake read" request for concurrent recovery. recv_sys_t::apply_batch(): Choose a number of successive pages for a recovery batch. recv_sys_t::erase(recv_sys_t::map::iterator): Remove log records for a page whose recovery is not in progress. Log application threads will not invoke this; they will only set being_recovered=-1 to indicate that the entry is no longer needed. recv_sys_t::garbage_collect(): Remove all being_recovered=-1 entries. recv_sys_t::wait_for_pool(): Wait for some space to become available in the buffer pool. mlog_init_t::mark_ibuf_exist(): Avoid calls to recv_sys::recover_low() via ibuf_page_exists() and buf_page_get_low(). Such calls would lead to double locking of recv_sys.mutex, which depending on implementation could cause a deadlock. We will use lower-level calls to look up index pages. buf_LRU_block_remove_hashed(): Disable consistency checks for freed ROW_FORMAT=COMPRESSED pages. Their contents could be uninitialized garbage. This fixes an occasional failure of the test innodb.innodb_bulk_create_index_debug. Tested by: Matthias Leich
-
Marko Mäkelä authored
The progress reporting of InnoDB crash recovery was rather intermittent. Nothing was reported during the single-threaded log record parsing, which could consume minutes when parsing a large log. During log application, there only was progress reporting in background threads that would be invoked on data page read completion. The progress reporting here will be detailed like this: InnoDB: Starting crash recovery from checkpoint LSN=503549688 InnoDB: Parsed redo log up to LSN=1990840177; to recover: 124806 pages InnoDB: Parsed redo log up to LSN=2729777071; to recover: 186123 pages InnoDB: Parsed redo log up to LSN=3488599173; to recover: 248397 pages InnoDB: Parsed redo log up to LSN=4177856618; to recover: 306469 pages InnoDB: Multi-batch recovery needed at LSN 4189599815 InnoDB: End of log at LSN=4483551634 InnoDB: To recover: LSN 4189599815/4483551634; 307490 pages InnoDB: To recover: LSN 4189599815/4483551634; 197159 pages InnoDB: To recover: LSN 4189599815/4483551634; 67623 pages InnoDB: Parsed redo log up to LSN=4353924218; to recover: 102083 pages ... InnoDB: log sequence number 4483551634 ... The previous messages "Starting a batch to recover" or "Starting a final batch to recover" will be replaced by "To recover: ... pages" messages. If a batch lasts longer than 15 seconds, then there will be progress reports every 15 seconds, showing the number of remaining pages. For the non-final batch, the "To recover:" message includes two end LSN: that of the batch, and of the recovered log. This is the primary measure of progress. The batch will end once the number of pages to recover reaches 0. If recovery is possible in a single batch, the output will look like this, with a shorter "To recover:" message that counts only the remaining pages: InnoDB: Starting crash recovery from checkpoint LSN=503549688 InnoDB: Parsed redo log up to LSN=1998701027; to recover: 125560 pages InnoDB: Parsed redo log up to LSN=2734136874; to recover: 186446 pages InnoDB: Parsed redo log up to LSN=3499505504; to recover: 249378 pages InnoDB: Parsed redo log up to LSN=4183247844; to recover: 306964 pages InnoDB: End of log at LSN=4483551634 ... InnoDB: To recover: 331797 pages ... InnoDB: log sequence number 4483551634 ... We will also speed up recovery by improving the memory management and implementing multi-threaded recovery of data pages that will not need to be read into the buffer pool ("fake read"). Log application in the "fake read" threads will be protected by an atomic being_recovered field and exclusive buf_page_t::latch. Recovery will reserve for data pages two thirds of the buffer pool, or 256 pages, whichever is smaller. Previously, we could only use at most one third of the buffer pool for buffered log records. This would typically mean that with large buffer pools, recovery unnecessary consisted of multiple batches. If recovery runs out of memory, it will "roll back" or "rewind" the current mini-transaction. The recv_sys.lsn and recv_sys.pages will correspond to the "out of memory LSN", at the end of the previous complete mini-transaction. If recovery runs out of memory while executing the final recovery batch, we can simply invoke recv_sys.apply(false) to make room, and resume parsing. If recovery runs out of memory before the final batch, we will scan the redo log to the end (recv_sys.scanned_lsn) and check for any missing or inconsistent files. If recv_init_crash_recovery_spaces() does not report any potentially missing tablespaces, we can make use of the already stored recv_sys.pages and only rewind to the "out of memory LSN". Else, we must keep parsing and invoking recv_validate_tablespace() until an error has been found or everything has been resolved, and ultimatily rewind to to the checkpoint LSN. recv_sys_t::pages_it: A cached iterator to recv_sys.pages recv_sys_t::parse_mtr(): Remove an ATTRIBUTE_NOINLINE that would prevent tail call optimization in recv_sys_t::parse_pmem(). recv_sys_t::parse(), recv_sys_t::parse_mtr(), recv_sys_t::parse_pmem(): Add template<bool store> parameter. Redo log record parsing (store=false) is better specialized from store=true (with bool if_exists) so that we can avoid some conditional branches in frequently invoked low-level code. recv_sys_t::is_memory_exhausted(): Remove. The special parse() status GOT_OOM will report out-of-memory situation at the low level. recv_sys_t::rewind(), page_recv_t::recs_t::rewind(): Remove all log starting with a specific LSN. recv_scan_log(): Separate some code for only parsing, not storing log. In rewound_lsn, remember the LSN at which last_phase=false recovery ran out of memory. This is where the next call to recv_scan_log() will resume storing the log. This replaces recv_sys.last_stored_lsn. recv_sys_t::parse(): Evaluate the template parameter store in a few more cases, to allow dead code to be eliminated at compile time. recv_sys_t::scanned_lsn: The end of the log found by recv_scan_log(). The special value 1 means that recv_sys has been initialized but no log has been parsed. IORequest::write_complete(), IORequest::read_complete(): Replaces fil_aio_callback(). read_io_callback(), write_io_callback(): Replaces io_callback(). IORequest::fake_read_complete(), fake_io_callback(), os_fake_read(): Process a "fake read" request for concurrent recovery. recv_sys_t::apply_batch(): Choose a number of successive pages for a recovery batch. recv_sys_t::erase(recv_sys_t::map::iterator): Remove log records for a page whose recovery is not in progress. Log application threads will not invoke this; they will only set being_recovered=-1 to indicate that the entry is no longer needed. recv_sys_t::garbage_collect(): Remove all being_recovered=-1 entries. recv_sys_t::wait_for_pool(): Wait for some space to become available in the buffer pool. mlog_init_t::mark_ibuf_exist(): Avoid calls to recv_sys::recover_low() via ibuf_page_exists() and buf_page_get_low(). Such calls would lead to double locking of recv_sys.mutex, which depending on implementation could cause a deadlock. We will use lower-level calls to look up index pages. buf_LRU_block_remove_hashed(): Disable consistency checks for freed ROW_FORMAT=COMPRESSED pages. Their contents could be uninitialized garbage. This fixes an occasional failure of the test innodb.innodb_bulk_create_index_debug. Tested by: Matthias Leich
-
Vlad Lesin authored
MDEV-31256 fil_node_open_file() releases fil_system.mutex allowing other thread to open its file node There is room between mutex_exit(&fil_system.mutex) and mutex_enter(&fil_system.mutex) calls in fil_node_open_file(). During this room another thread can open the node, and ut_ad(!node->is_open()) assertion in fil_node_open_file_low() can fail. The fix is not to open node if it was already opened by another thread.
-
- 26 Apr, 2023 1 commit
-
-
Marko Mäkelä authored
log_free_check(): Assert that the caller must not hold exclusive lock_sys.latch. This was the case for calls from ibuf_delete_for_discarded_space(). This caused a deadlock with another thread that would be holding a latch on a dirty page that would need to be written so that the checkpoint would advance and log_free_check() could return. That other thread was waiting for a shared lock_sys.latch. fil_delete_tablespace(): Do not invoke ibuf_delete_for_discarded_space() because in DDL operations, we will be holding exclusive lock_sys.latch. trx_t::commit(std::vector<pfs_os_file_t>&), innodb_drop_database(), row_purge_remove_clust_if_poss_low(), row_undo_ins_remove_clust_rec(), row_discard_tablespace_for_mysql(): Invoke ibuf_delete_for_discarded_space() on the deleted tablespaces after releasing all latches.
-
- 19 Apr, 2023 1 commit
-
-
Marko Mäkelä authored
fil_space_t::create(), fil_space_t::add(): Expect the caller to acquire and release fil_system.mutex. In this way, creating a tablespace and adding the first (usually only) data file will be atomic. recv_sys_t::recover_deferred(): Correctly protect some changes by holding fil_system.mutex. Tested by: Matthias Leich
-
- 14 Apr, 2023 2 commits
-
-
Vlad Lesin authored
MDEV-31049 fil_delete_tablespace() returns wrong file handle if tablespace was closed by parallel thread fil_delete_tablespace() stores file handle in local variable and calls mtr_t::commit_file()=>fil_system_t::detach(..., detach_handle=true), which sets space->chain.start->handle = OS_FILE_CLOSED. fil_system_t::detach() is invoked under fil_system.mutex. But before the mutex is acquired some parallel thread can change space->chain.start->handle. fil_delete_tablespace() returns value, stored in local variable, i.e. wrong value. File handle can be closed, for example, from buf_flush_space() when the limit of innodb_open_files exceded and fil_space_t::get() causes fil_space_t::try_to_close() call. fil_space_t::try_to_close() is executed under fil_system.mutex. And mtr_t::commit_file() locks it for fil_system_t::detach() call. fil_system_t::detach() returns detached file handle if its argument detach_handle is true. The fix is to let mtr_t::commit_file() to pass that detached file handle to fil_delete_tablespace().
-
Vlad Lesin authored
Post-push fix. 10.5 MDEV-30775 fix inserts just opened tablespace just after the element which fil_system.space_list_last_opened points to. In MDEV-25223 fil_system_t::space_list was changed from UT_LIST to ilist. ilist<...>::insert(iterator pos, reference value) inserts element to list before pos. But it was not taken into account during 10.5->10.6 merge in 85cbfaef, and the fix does not work properly, i.e. it inserted just opened tablespace to the position preceding fil_system.space_list_last_opened.
-
- 12 Apr, 2023 1 commit
-
-
Thirunarayanan Balathandayuthapani authored
- This issue caused by race condition between drop thread and fil_encrypt_thread. fil_encrypt_thread closes the tablespace if the number of opened files exceeds innodb_open_files. fil_node_open_file() closes the tablespace which are open and it doesn't have pending operations. At that time, InnoDB drop tries to write the redo log for the file delete operation. It throws the bad file descriptor error. - When trying to close the file, InnoDB should check whether the table is going to be dropped.
-
- 28 Mar, 2023 1 commit
-
-
Marko Mäkelä authored
fil_space_t::~fil_space_t(): Invoke ut_free(name) because doing so in the callers would trip MSAN_OPTIONS=poison_in_dtor=1
-
- 27 Mar, 2023 1 commit
-
-
Vlad Lesin authored
MDEV-29050 mariabackup issues error messages during InnoDB tablespaces export on partial backup preparing The solution is to suppress error messages for missing tablespaces if mariabackup is launched with "--prepare --export" options. "mariabackup --prepare --export" invokes itself with --mysqld parameter. If the parameter is set, then it starts server to feed "FLUSH TABLES ... FOR EXPORT;" queries for exported tablespaces. This is "normal" server start, that's why new srv_operation value is introduced. Reviewed by Marko Makela.
-
- 10 Mar, 2023 1 commit
-
-
Vlad Lesin authored
fil_node_open_file_low() tries to close files from the top of fil_system.space_list if the number of opened files is exceeded. It invokes fil_space_t::try_to_close(), which iterates the list searching for the first opened space. Then it just closes the space, leaving it in the same position in fil_system.space_list. On heavy files opening, like during 'SHOW TABLE STATUS ...' execution, if the number of opened files limit is reached, fil_space_t::try_to_close() iterates more and more closed spaces before reaching any opened space for each fil_node_open_file_low() call. What causes performance regression if the number of spaces is big enough. The fix is to keep opened spaces at the top of fil_system.space_list, and move closed files at the end of the list. For this purpose fil_space_t::space_list_last_opened pointer is introduced. It points to the last inserted opened space in fil_space_t::space_list. When space is opened, it's inserted to the position just after the pointer points to in fil_space_t::space_list to preserve the logic, inroduced in MDEV-23855. Any closed space is added to the end of fil_space_t::space_list. As opened spaces are located at the top of fil_space_t::space_list, fil_space_t::try_to_close() finds opened space faster. There can be the case when opened and closed spaces are mixed in fil_space_t::space_list if fil_system.freeze_space_list was set during fil_node_open_file_low() execution. But this should not cause any error, as fil_space_t::try_to_close() still iterates spaces in the list. There is no need in any test case for the fix, as it does not change any functionality, but just fixes performance regression.
-
- 24 Jan, 2023 1 commit
-
-
Marko Mäkelä authored
This also fixes part of MDEV-29835 Partial server freeze which is caused by violations of the latching order that was defined in https://dev.mysql.com/worklog/task/?id=6326 (WL#6326: InnoDB: fix index->lock contention). Unless the current thread is holding an exclusive dict_index_t::lock, it must acquire page latches in a strict parent-to-child, left-to-right order. Not all cases of MDEV-29835 are fixed yet. Failure to follow the correct latching order will cause deadlocks of threads due to lock order inversion. As part of these changes, the BTR_MODIFY_TREE mode is modified so that an Update latch (U a.k.a. SX) will be acquired on the root page, and eXclusive latches (X) will be acquired on all pages leading to the leaf page, as well as any left and right siblings of the pages along the path. The DEBUG_SYNC test innodb.innodb_wl6326 will be removed, because at the time the DEBUG_SYNC point is hit, the thread is actually holding several page latches that will be blocking a concurrent SELECT statement. We also remove double bookkeeping that was caused due to excessive information hiding in mtr_t::m_memo. We simply let mtr_t::m_memo store information of latched pages, and ensure that mtr_memo_slot_t::object is never a null pointer. The tree_blocks[] and tree_savepoints[] were redundant. buf_page_get_low(): If innodb_change_buffering_debug=1, to avoid a hang, do not try to evict blocks if we are holding a latch on a modified page. The test innodb.innodb-change-buffer-recovery will be removed, because change buffering may no longer be forced by debug injection when the change buffer comprises multiple pages. Remove a debug assertion that could fail when innodb_change_buffering_debug=1 fails to evict a page. For other cases, the assertion is redundant, because we already checked that right after the got_block: label. The test innodb.innodb-change-buffering-recovery will be removed, because due to this change, we will be unable to evict the desired page. mtr_t::lock_register(): Register a change of a page latch on an unmodified buffer-fixed block. mtr_t::x_latch_at_savepoint(), mtr_t::sx_latch_at_savepoint(): Replaced by the use of mtr_t::upgrade_buffer_fix(), which now also handles RW_S_LATCH. mtr_t::set_modified(): For temporary tables, invoke buf_page_t::set_modified() here and not in mtr_t::commit(). We will never set the MTR_MEMO_MODIFY flag on other than persistent data pages, nor set mtr_t::m_modifications when temporary data pages are modified. mtr_t::commit(): Only invoke the buf_flush_note_modification() loop if persistent data pages were modified. mtr_t::get_already_latched(): Look up a latched page in mtr_t::m_memo. This avoids many redundant entries in mtr_t::m_memo, as well as redundant calls to buf_page_get_gen() for blocks that had already been looked up in a mini-transaction. btr_get_latched_root(): Return a pointer to an already latched root page. This replaces btr_root_block_get() in cases where the mini-transaction has already latched the root page. btr_page_get_parent(): Fetch a parent page that was already latched in BTR_MODIFY_TREE, by invoking mtr_t::get_already_latched(). If needed, upgrade the root page U latch to X. This avoids bloating mtr_t::m_memo as well as performing redundant buf_pool.page_hash lookups. For non-QUICK CHECK TABLE as well as for B-tree defragmentation, we will invoke btr_cur_search_to_nth_level(). btr_cur_search_to_nth_level(): This will only be used for non-leaf (level>0) B-tree searches that were formerly named BTR_CONT_SEARCH_TREE or BTR_CONT_MODIFY_TREE. In MDEV-29835, this function could be removed altogether, or retained for the case of CHECK TABLE without QUICK. btr_cur_t::left_block: Remove. btr_pcur_move_backward_from_page() can retrieve the left sibling from the end of mtr_t::m_memo. btr_cur_t::open_leaf(): Some clean-up. btr_cur_t::search_leaf(): Replaces btr_cur_search_to_nth_level() for searches to level=0 (the leaf level). We will never release parent page latches before acquiring leaf page latches. If we need to temporarily release the level=1 page latch in the BTR_SEARCH_PREV or BTR_MODIFY_PREV latch_mode, we will reposition the cursor on the child node pointer so that we will land on the correct leaf page. btr_cur_t::pessimistic_search_leaf(): Implement new BTR_MODIFY_TREE latching logic in the case that page splits or merges will be needed. The parent pages (and their siblings) should already be latched on the first dive to the leaf and be present in mtr_t::m_memo; there should be no need for BTR_CONT_MODIFY_TREE. This pre-latching almost suffices; it must be revised in MDEV-29835 and work-arounds removed for cases where mtr_t::get_already_latched() fails to find a block. rtr_search_to_nth_level(): A SPATIAL INDEX version of btr_search_to_nth_level() that can search to any level (including the leaf level). rtr_search_leaf(), rtr_insert_leaf(): Wrappers for rtr_search_to_nth_level(). rtr_search(): Replaces rtr_pcur_open(). rtr_latch_leaves(): Replaces btr_cur_latch_leaves(). Note that unlike in the B-tree code, there is no error handling in case the sibling pages are corrupted. rtr_cur_restore_position(): Remove an unused constant parameter. btr_pcur_open_on_user_rec(): Remove the constant parameter mode=PAGE_CUR_GE. row_ins_clust_index_entry_low(): Use a new mode=BTR_MODIFY_ROOT_AND_LEAF to gain access to the root page when mode!=BTR_MODIFY_TREE, to write the PAGE_ROOT_AUTO_INC. BTR_SEARCH_TREE, BTR_CONT_SEARCH_TREE: Remove. BTR_CONT_MODIFY_TREE: Note that this is only used by rtr_search_to_nth_level(). btr_pcur_optimistic_latch_leaves(): Replaces btr_cur_optimistic_latch_leaves(). ibuf_delete_rec(): Acquire exclusive ibuf.index->lock in order to avoid a deadlock with ibuf_insert_low(BTR_MODIFY_PREV). btr_blob_log_check_t(): Acquire a U latch on the root page, so that btr_page_alloc() in btr_store_big_rec_extern_fields() will avoid a deadlock. btr_store_big_rec_extern_fields(): Assert that the root page latch is being held. Tested by: Matthias Leich Reviewed by: Vladislav Lesin
-
- 30 Nov, 2022 2 commits
-
-
Marko Mäkelä authored
os_file_read(): Merged with os_file_read_no_error_handling(). Crashing on a partial page read is as unhelpful as crashing on a corrupted page read (commit 0b47c126). Report the file name if it is available via IORequest.
-
Marko Mäkelä authored
fil_space_t::prepare_acquired(): Do not attempt to extend (or shrink) files that will be processed by recv_sys_t::recover_deferred().
-