An error occurred fetching the project authors.
- 18 Apr, 2023 1 commit
-
-
Marko Mäkelä authored
trx_assign_rseg_low(): Simplify the debug check. trx_rseg_t::reinit(): Reset the skip_allocation() flag. This logic was broken in the merge commit 3e2ad0e9 of commit 0de3be8c (that is, innodb_undo_log_truncate=ON would never be "completed"). Tested by: Matthias Leich
-
- 16 Mar, 2023 1 commit
-
-
Marko Mäkelä authored
lock_sec_rec_some_has_impl(): Remove a harmful condition that caused the performance regression and should not have been added in commit b6e41e38 in the first place. Locking transactions that have not modified any persistent tables can carry the transaction identifier 0. trx_t::max_inactive_id: A cache for trx_sys_t::find_same_or_older(). The value is not reset on transaction commit so that previous results can be reused for subsequent transactions. The smallest active transaction ID can only increase over time, not decrease. trx_sys_t::find_same_or_older(): Remember the maximum previous id for which rw_trx_hash.iterate() returned false, to avoid redundant iterations. lock_sec_rec_read_check_and_lock(): Add an early return in case we are already holding a covering table lock. lock_rec_convert_impl_to_expl(): Add a template parameter to avoid a redundant run-time check on whether the index is secondary. lock_rec_convert_impl_to_expl_for_trx(): Move some code from lock_rec_convert_impl_to_expl(), to reduce code duplication due to the added template parameter. Reviewed by: Vladislav Lesin Tested by: Matthias Leich
-
- 02 Mar, 2023 1 commit
-
-
Thirunarayanan Balathandayuthapani authored
- InnoDB fails to reset the check_foreigns and check_unique_secondary in trx_t::free(), trx_t::commit_cleanup(). This lead to bulk insert in internal innodb fts table operation.
-
- 24 Feb, 2023 1 commit
-
-
Marko Mäkelä authored
It is not safe to invoke trx_purge_free_segment() or execute innodb_undo_log_truncate=ON before all undo log records in the rollback segment has been processed. A prominent failure that would occur due to premature freeing of undo log pages is that trx_undo_get_undo_rec() would crash when trying to copy an undo log record to fetch the previous version of a record. If trx_undo_get_undo_rec() was not invoked in the unlucky time frame, then the symptom would be that some committed transaction history is never removed. This would be detected by CHECK TABLE...EXTENDED that was impleented in commit ab019010. Such a garbage collection leak should be possible even when using innodb_undo_log_truncate=OFF, just involving trx_purge_free_segment(). trx_rseg_t::needs_purge: Change the type from Boolean to a transaction identifier, noting the most recent non-purged transaction, or 0 if everything has been purged. On transaction start, we initialize this to 1 more than the transaction start ID. On recovery, the field may be adjusted to the transaction end ID (TRX_UNDO_TRX_NO) if it is larger. The field TRX_UNDO_NEEDS_PURGE becomes write-only; only some debug assertions that would validate the value. The field reflects the old inaccurate Boolean field trx_rseg_t::needs_purge. trx_undo_mem_create_at_db_start(), trx_undo_lists_init(), trx_rseg_mem_restore(): Remove the parameter max_trx_id. Instead, store the maximum in trx_rseg_t::needs_purge, where trx_rseg_array_init() will find it. trx_purge_free_segment(): Contiguously hold a lock on trx_rseg_t to prevent any concurrent allocation of undo log. trx_purge_truncate_rseg_history(): Only invoke trx_purge_free_segment() if the rollback segment is empty and there are no pending transactions associated with it. trx_purge_truncate_history(): Only proceed with innodb_undo_log_truncate=ON if trx_rseg_t::needs_purge indicates that all history has been purged. Tested by: Matthias Leich
-
- 09 Feb, 2023 1 commit
-
-
Vicențiu Ciorbaru authored
This patch is the result of running run-clang-tidy -fix -header-filter=.* -checks='-*,modernize-use-equals-default' . Code style changes have been done on top. The result of this change leads to the following improvements: 1. Binary size reduction. * For a -DBUILD_CONFIG=mysql_release build, the binary size is reduced by ~400kb. * A raw -DCMAKE_BUILD_TYPE=Release reduces the binary size by ~1.4kb. 2. Compiler can better understand the intent of the code, thus it leads to more optimization possibilities. Additionally it enabled detecting unused variables that had an empty default constructor but not marked so explicitly. Particular change required following this patch in sql/opt_range.cc result_keys, an unused template class Bitmap now correctly issues unused variable warnings. Setting Bitmap template class constructor to default allows the compiler to identify that there are no side-effects when instantiating the class. Previously the compiler could not issue the warning as it assumed Bitmap class (being a template) would not be performing a NO-OP for its default constructor. This prevented the "unused variable warning".
-
- 17 Nov, 2022 1 commit
-
-
Marko Mäkelä authored
btr_cur_t: Zero-initialize all fields in the default constructor. btr_cur_t::index: Remove; it duplicated page_cur.index. Many functions: Remove arguments that were duplicating page_cur_t::index and page_cur_t::block. page_cur_open_level(), btr_pcur_open_level(): Replaces btr_cur_open_at_index_side() for dict_stats_analyze_index(). At the end, release all latches except the dict_index_t::lock and the buf_page_t::lock on the requested page. dict_stats_analyze_index(): Rely on mtr_t::rollback_to_savepoint() to release all uninteresting page latches. btr_search_guess_on_hash(): Simplify the logic, and invoke mtr_t::rollback_to_savepoint(). We will use plain C++ std::vector<mtr_memo_slot_t> for mtr_t::m_memo. In this way, we can avoid setting mtr_memo_slot_t::object to nullptr and instead just remove garbage from m_memo. mtr_t::rollback_to_savepoint(): Shrink the vector. We will be needing this in dict_stats_analyze_index(), where we will release page latches and only retain the index->lock in mtr_t::m_memo. mtr_t::release_last_page(): Release the last acquired page latch. Replaces btr_leaf_page_release(). mtr_t::release(const buf_block_t&): Release a single page latch. Used in btr_pcur_move_backward_from_page(). mtr_t::memo_release(): Replaced with mtr_t::release(). mtr_t::upgrade_buffer_fix(): Acquire a latch for a buffer-fixed page. This replaces the double bookkeeping in btr_cur_t::open_leaf(). Reviewed by: Vladislav Lesin
-
- 25 Oct, 2022 1 commit
-
-
Vladislav Vaintroub authored
Non-blocking log_write_upto (MDEV-24341) was only designed for the client connections. Fix, so it is not be triggered for any system THD. Previously, an incomplete solution only excluded Innodb purge THDs, but not the slave for example. The hang in MDEV still remains somewhat a mystery though, it is not immediately clear how exactly condition variable can become corrupted. But it is clear that it can be avoided.
-
- 24 Oct, 2022 1 commit
-
-
Vlad Lesin authored
The lock is created during page splitting after moving records and locks(lock_move_rec_list_(start|end)()) to the new page, and inheriting the locks to the supremum of left page from the successor of the infimum on right page. There is no need in such inheritance for READ COMMITTED isolation level and not-gap locks, so the fix is to add the corresponding condition in gap lock inheritance function. One more fix is to forbid gap lock inheritance if XA was prepared. Use the most significant bit of trx_t::n_ref to indicate that gap lock inheritance is forbidden. This fix is based on mysql/mysql-server@b063e52a8367dc9d5ed418e7f6d96400867e9f43
-
- 21 Oct, 2022 2 commits
-
-
Vlad Lesin authored
Suppose we have two transactions, trx 1 and trx 2. trx 2 does deadlock resolving from lock_wait(), it sets victim->lock.was_chosen_as_deadlock_victim=true for trx 1, but has not yet invoked lock_cancel_waiting_and_release(). trx 1 checks the flag in lock_trx_handle_wait(), and starts rollback from row_mysql_handle_errors(). It can change trx->lock.wait_thr and trx->state as it holds trx_t::mutex, but trx 2 has not yet requested it, as lock_cancel_waiting_and_release() has not yet been called. After that trx 1 tries to release locks in trx_t::rollback_low(), invoking trx_t::rollback_finish(). lock_release() is blocked on try to acquire lock_sys.rd_lock(SRW_LOCK_CALL) in lock_release_try(), as lock_sys is blocked by trx 2, as deadlock resolution works under lock_sys.wr_lock(SRW_LOCK_CALL), see Deadlock::report() for details. trx 2 executes lock_cancel_waiting_and_release() for deadlock victim, i. e. for trx 1. lock_cancel_waiting_and_release() contains some trx->lock.wait_thr and trx->state assertions, which will fail, because trx 1 has changed them during rollback execution. So, according to the above scenario, it's legal to have trx->lock.wait_thr==0 and trx->state!=TRX_STATE_ACTIVE in lock_cancel_waiting_and_release(), if it was invoked from Deadlock::report(), and the fix is just in the assertion conditions changing. The fix is just in changing assertion condition. There is also lock_wait() cleanup around trx->error_state. If trx->error_state can be changed not by the owned thread, it must be protected with lock_sys.wait_mutex, as lock_wait() uses trx->lock.cond along with that mutex. Also if trx->error_state was changed before lock_sys.wait_mutex acquision, then it could be reset with the following code, what is wrong. Also we need to check trx->error_state before entering waiting loop, otherwise it can be the case when trx->error_state was set before lock_sys.wait_mutex acquision, but the thread will be waiting on trx->lock.cond.
-
Marko Mäkelä authored
Until now, the attribute EXTENDED of CHECK TABLE was ignored by InnoDB, and InnoDB only counted the records in each index according to the current read view. Unless the attribute QUICK was specified, the function btr_validate_index() would be invoked to validate the B-tree structure (the sibling and child links between index pages). The EXTENDED check will not only count all index records according to the current read view, but also ensure that any delete-marked records in the clustered index are waiting for the purge of history, and that all secondary index records point to a version of the clustered index record that is waiting for the purge of history. In other words, no index may contain orphan records. Normal MVCC reads and the non-EXTENDED version of CHECK TABLE would ignore these orphans. Unpurged records merely result in warnings (at most one per index), not errors, and no indexes will be flagged as corrupted due to such garbage. It will remain possible to SELECT data from such indexes or tables (which will skip such records) or to rebuild the table to reclaim some space. We introduce purge_sys.end_view that will be (almost) a copy of purge_sys.view at the end of a batch of purging committed transaction history. It is not an exact copy, because if the size of a purge batch is limited by innodb_purge_batch_size, some records that purge_sys.view would allow to be purged will be left over for subsequent batches. The purge_sys.view is relevant in the purge of committed transaction history, to determine if records are safe to remove. The new purge_sys.end_view is relevant in MVCC operations and in CHECK TABLE ... EXTENDED. It tells which undo log records are safe to access (have not been discarded at the end of a purge batch). purge_sys.clone_oldest_view<true>(): In trx_lists_init_at_db_start(), clone the oldest read view similar to purge_sys_t::clone_end_view() so that CHECK TABLE ... EXTENDED will not report bogus failures between InnoDB restart and the completed purge of committed transaction history. purge_sys_t::is_purgeable(): Replaces purge_sys_t::changes_visible() in the case that purge_sys.latch will not be held by the caller. Among other things, this guards access to BLOBs. It is not safe to dereference any BLOBs of a delete-marked purgeable record, because they may have already been freed. purge_sys_t::view_guard::view(): Return a reference to purge_sys.view that will be protected by purge_sys.latch, held by purge_sys_t::view_guard. purge_sys_t::end_view_guard::view(): Return a reference to purge_sys.end_view while it is protected by purge_sys.end_latch. Whenever a thread needs to retrieve an older version of a clustered index record, it will hold a page latch on the clustered index page and potentially also on a secondary index page that points to the clustered index page. If these pages contain purgeable records that would be accessed by a currently running purge batch, the progress of the purge batch would be blocked by the page latches. Hence, it is safe to make a copy of purge_sys.end_view while holding an index page latch, and consult the copy of the view to determine whether a record should already have been purged. btr_validate_index(): Remove a redundant check. row_check_index_match(): Check if a secondary index record and a version of a clustered index record match each other. row_check_index(): Replaces row_scan_index_for_mysql(). Count the records in each index directly, duplicating the relevant logic from row_search_mvcc(). Initialize check_table_extended_view for CHECK ... EXTENDED while holding an index leaf page latch. If we encounter an orphan record, the copy of purge_sys.end_view that we make is safe for visibility checks, and trx_undo_get_undo_rec() will check for the safety to access each undo log record. Should that check fail, we should return DB_MISSING_HISTORY to report a corrupted index. The EXTENDED check tries to match each secondary index record with every available clustered index record version, by duplicating the logic of row_vers_build_for_consistent_read() and invoking trx_undo_prev_version_build() directly. Before invoking row_check_index_match() on delete-marked clustered index record versions, we will consult purge_sys.is_purgeable() in order to avoid accessing freed BLOBs. We will always check that the DB_TRX_ID or PAGE_MAX_TRX_ID does not exceed the global maximum. Orphan secondary index records will be flagged only if everything up to PAGE_MAX_TRX_ID has been purged. We warn also about clustered index records whose nonzero DB_TRX_ID should have been reset in purge or rollback. trx_set_rw_mode(): Move an assertion from ReadView::set_creator_trx_id(). trx_undo_prev_version_build(): Remove two debug-only parameters, and return an error code instead of a Boolean. trx_undo_get_undo_rec(): Return a pointer to the undo log record, or nullptr if one cannot be retrieved. Instead of consulting the purge_sys.view, consult the purge_sys.end_view to determine which records can be accessed. trx_undo_get_rec_if_purgeable(): A variant of trx_undo_get_undo_rec() that will consult purge_sys.view instead of purge_sys.end_view. TRX_UNDO_CHECK_PURGEABILITY: A new parameter to trx_undo_prev_version_build(), passed by row_vers_old_has_index_entry() so that purge_sys.view instead of purge_sys.end_view will be consulted to determine whether a secondary index record may be safely purged. row_upd_changes_disowned_external(): Remove. This should be more expensive than briefly latching purge_sys in trx_undo_prev_version_build() (which may make use of transactional memory). row_sel_reset_old_vers_heap(): New function, split from row_sel_build_prev_vers_for_mysql(). row_sel_build_prev_vers_for_mysql(): Reorder some parameters to simplify the call to row_sel_reset_old_vers_heap(). row_search_for_mysql(): Replaced with direct calls to row_search_mvcc(). sel_node_get_nth_plan(): Define inline in row0sel.h open_step(): Define at the call site, in simplified form. sel_node_reset_cursor(): Merged with the only caller open_step(). --- ReadViewBase::check_trx_id_sanity(): Remove. Let us handle "future" DB_TRX_ID in a more meaningful way: row_sel_clust_sees(): Return DB_SUCCESS if the record is visible, DB_SUCCESS_LOCKED_REC if it is invisible, and DB_CORRUPTION if the DB_TRX_ID is in the future. row_undo_mod_must_purge(), row_undo_mod_clust(): Silently ignore corrupted DB_TRX_ID. We are in ROLLBACK, and we should have noticed that corruption when we were about to modify the record in the first place (leading us to refuse the operation). row_vers_build_for_consistent_read(): Return DB_CORRUPTION if DB_TRX_ID is in the future. Tested by: Matthias Leich Reviewed by: Vladislav Lesin
-
- 03 Oct, 2022 1 commit
-
-
Vlad Lesin authored
MDEV-29575 Access to innodb_trx, innodb_locks and innodb_lock_waits along with detached XA's can cause SIGSEGV trx->mysql_thd can be zeroed-out between thd_get_thread_id() and thd_query_safe() calls in fill_trx_row(). trx_disconnect_prepared() zeroes out trx->mysql_thd. And this can cause null pointer dereferencing in fill_trx_row(). fill_trx_row() is invoked from fetch_data_into_cache() under trx_sys.mutex. Bug fix is in reseting trx_t::mysql_thd in trx_disconnect_prepared() under trx_sys.mutex lock too. MTR test case can't be created for the fix, as we need to wait for trx_t::mysql_thd reseting in fill_trx_row() after trx_t::mysql_thd was checked for null while trx_sys.mutex is held. But trx_t::mysql_thd must be reset in trx_disconnect_prepared() under trx_sys.mutex. There will be deadlock.
-
- 20 Sep, 2022 1 commit
-
-
Marko Mäkelä authored
-
- 24 Aug, 2022 1 commit
-
-
Vlad Lesin authored
The issue is that trx_t::lock.was_chosen_as_deadlock_victim can be reset before the transaction check it and set trx_t::error_state. The fix is to reset trx_t::lock.was_chosen_as_deadlock_victim only in trx_t::commit_in_memory(), which is invoked on full rollback. There is also no need to have separate bit in trx_t::lock.was_chosen_as_deadlock_victim to flag transaction it was chosen as a victim of Galera conflict resolution, the same variable can be used for both cases except debug build. For debug build we need to distinguish deadlock and Galera's abort victims for debug checks. Also there is no need to check for deadlock in lock_table_enqueue_waiting() for Galera as the coresponding check presents in lock_wait(). Local variable "error_state" in lock_wait() was replaced with trx->error_state, because before the replace lock_sys_t::cancel<false>(trx, lock) and lock_sys.deadlock_check() could change trx->error_state, which then could be overwritten with the local "error_state" variable value. The lock_wait_suspend_thread_enter DEBUG_SYNC point name is misleading, because lock_wait_suspend_thread was eliminated in e71e6133. It was renamed to lock_wait_start. Reviewed by: Marko Mäkelä, Jan Lindström.
-
- 06 Jun, 2022 1 commit
-
-
Marko Mäkelä authored
The approach to handling corruption that was chosen by Oracle in commit 177d8b0c is not really useful. Not only did it actually fail to prevent InnoDB from crashing, but it is making things worse by blocking attempts to rescue data from or rebuild a partially readable table. We will try to prevent crashes in a different way: by propagating errors up the call stack. We will never mark the clustered index persistently corrupted, so that data recovery may be attempted by reading from the table, or by rebuilding the table. This should also fix MDEV-13680 (crash on btr_page_alloc() failure); it was extensively tested with innodb_file_per_table=0 and a non-autoextend system tablespace. We should now avoid crashes in many cases, such as when a page cannot be read or allocated, or an inconsistency is detected when attempting to update multiple pages. We will not crash on double-free, such as on the recovery of DDL in system tablespace in case something was corrupted. Crashes on corrupted data are still possible. The fault injection mechanism that is introduced in the subsequent commit may help catch more of them. buf_page_import_corrupt_failure: Remove the fault injection, and instead corrupt some pages using Perl code in the tests. btr_cur_pessimistic_insert(): Always reserve extents (except for the change buffer), in order to prevent a subsequent allocation failure. btr_pcur_open_at_rnd_pos(): Merged to the only caller ibuf_merge_pages(). btr_assert_not_corrupted(), btr_corruption_report(): Remove. Similar checks are already part of btr_block_get(). FSEG_MAGIC_N_BYTES: Replaces FSEG_MAGIC_N_VALUE. dict_hdr_get(), trx_rsegf_get_new(), trx_undo_page_get(), trx_undo_page_get_s_latched(): Replaced with error-checking calls. trx_rseg_t::get(mtr_t*): Replaces trx_rsegf_get(). trx_rseg_header_create(): Let the caller update the TRX_SYS page if needed. trx_sys_create_sys_pages(): Merged with trx_sysf_create(). dict_check_tablespaces_and_store_max_id(): Do not access DICT_HDR_MAX_SPACE_ID, because it was already recovered in dict_boot(). Merge dict_check_sys_tables() with this function. dir_pathname(): Replaces os_file_make_new_pathname(). row_undo_ins_remove_sec(): Do not modify the undo page by adding a terminating NUL byte to the record. btr_decryption_failed(): Report decryption failures dict_set_corrupted_by_space(), dict_set_encrypted_by_space(), dict_set_corrupted_index_cache_only(): Remove. dict_set_corrupted(): Remove the constant parameter dict_locked=false. Never flag the clustered index corrupted in SYS_INDEXES, because that would deny further access to the table. It might be possible to repair the table by executing ALTER TABLE or OPTIMIZE TABLE, in case no B-tree leaf page is corrupted. dict_table_skip_corrupt_index(), dict_table_next_uncorrupted_index(), row_purge_skip_uncommitted_virtual_index(): Remove, and refactor the callers to read dict_index_t::type only once. dict_table_is_corrupted(): Remove. dict_index_t::is_btree(): Determine if the index is a valid B-tree. BUF_GET_NO_LATCH, BUF_EVICT_IF_IN_POOL: Remove. UNIV_BTR_DEBUG: Remove. Any inconsistency will no longer trigger assertion failures, but error codes being returned. buf_corrupt_page_release(): Replaced with a direct call to buf_pool.corrupted_evict(). fil_invalid_page_access_msg(): Never crash on an invalid read; let the caller of buf_page_get_gen() decide. btr_pcur_t::restore_position(): Propagate failure status to the caller by returning CORRUPTED. opt_search_plan_for_table(): Simplify the code. row_purge_del_mark(), row_purge_upd_exist_or_extern_func(), row_undo_ins_remove_sec_rec(), row_undo_mod_upd_del_sec(), row_undo_mod_del_mark_sec(): Avoid mem_heap_create()/mem_heap_free() when no secondary indexes exist. row_undo_mod_upd_exist_sec(): Simplify the code. row_upd_clust_step(), dict_load_table_one(): Return DB_TABLE_CORRUPT if the clustered index (and therefore the table) is corrupted, similar to what we do in row_insert_for_mysql(). fut_get_ptr(): Replace with buf_page_get_gen() calls. buf_page_get_gen(): Return nullptr and *err=DB_CORRUPTION if the page is marked as freed. For other modes than BUF_GET_POSSIBLY_FREED or BUF_PEEK_IF_IN_POOL this will trigger a debug assertion failure. For BUF_GET_POSSIBLY_FREED, we will return nullptr for freed pages, so that the callers can be simplified. The purge of transaction history will be a new user of BUF_GET_POSSIBLY_FREED, to avoid crashes on corrupted data. buf_page_get_low(): Never crash on a corrupted page, but simply return nullptr. fseg_page_is_allocated(): Replaces fseg_page_is_free(). fts_drop_common_tables(): Return an error if the transaction was rolled back. fil_space_t::set_corrupted(): Report a tablespace as corrupted if it was not reported already. fil_space_t::io(): Invoke fil_space_t::set_corrupted() to report out-of-bounds page access or other errors. Clean up mtr_t::page_lock() buf_page_get_low(): Validate the page identifier (to check for recently read corrupted pages) after acquiring the page latch. buf_page_t::read_complete(): Flag uninitialized (all-zero) pages with DB_FAIL. Return DB_PAGE_CORRUPTED on page number mismatch. mtr_t::defer_drop_ahi(): Renamed from mtr_defer_drop_ahi(). recv_sys_t::free_corrupted_page(): Only set_corrupt_fs() if any log records exist for the page. We do not mind if read-ahead produces corrupted (or all-zero) pages that were not actually needed during recovery. recv_recover_page(): Return whether the operation succeeded. recv_sys_t::recover_low(): Simplify the logic. Check for recovery error. Thanks to Matthias Leich for testing this extensively and to the authors of https://rr-project.org for making it easy to diagnose and fix any failures that were found during the testing.
-
- 26 Apr, 2022 1 commit
-
-
Marko Mäkelä authored
MDEV-26217 Failing assertion: list.count > 0 in ut_list_remove or Assertion `lock->trx == this' failed in dberr_t trx_t::drop_table This follows up the previous fix in commit c3c53926 (MDEV-26554). ha_innobase::delete_table(): Work around the insufficient metadata locking (MDL) during DML operations by acquiring exclusive InnoDB table locks on all child tables. Previously, this was only done on TRUNCATE and ALTER. ibuf_delete_rec(), btr_cur_optimistic_delete(): Do not invoke lock_update_delete() during change buffer operations. The revised trx_t::commit(std::vector<pfs_os_file_t>&) will hold exclusive lock_sys.latch while invoking fil_delete_tablespace(), which in turn may invoke ibuf_delete_rec(). dict_index_t::has_locking(): A new predicate, replacing the dummy !dict_table_is_locking_disabled(index->table). Used for skipping lock operations during ibuf_delete_rec(). trx_t::commit(std::vector<pfs_os_file_t>&): Release the locks and remove the table from the cache while holding exclusive lock_sys.latch. trx_t::commit_in_memory(): Skip release_locks() if dict_operation holds. trx_t::commit(): Reset dict_operation before invoking commit_in_memory() via commit_persist(). lock_release_on_drop(): Release locks while lock_sys.latch is exclusively locked. lock_table(): Add a parameter for a pointer to the table. We must not dereference the table before a lock_sys.latch has been acquired. If the pointer to the table does not match the table at that point, the table is invalid and DB_DEADLOCK will be returned. row_ins_foreign_check_on_constraint(): Improve the checks. Remove a bogus DB_LOCK_WAIT_TIMEOUT return that was needed before commit c5fd9aa5 (MDEV-25919). row_upd_check_references_constraints(), wsrep_row_upd_check_foreign_constraints(): Simplify checks.
-
- 25 Apr, 2022 1 commit
-
-
Thirunarayanan Balathandayuthapani authored
- InnoDB DDL results in `Duplicate entry' if concurrent DML throws duplicate key error. The following scenario explains the problem connection con1: ALTER TABLE t1 FORCE; connection con2: INSERT INTO t1(pk, uk) VALUES (2, 2), (3, 2); In connection con2, InnoDB throws the 'DUPLICATE KEY' error because of unique index. Alter operation will throw the error when applying the concurrent DML log. - Inserting the duplicate key for unique index logs the insert operation for online ALTER TABLE. When insertion fails, transaction does rollback and it leads to logging of delete operation for online ALTER TABLE. While applying the insert log entries, alter operation encounters 'DUPLICATE KEY' error. - To avoid the above fake duplicate scenario, InnoDB should not write any log for online ALTER TABLE before DML transaction commit. - User thread which does DML can apply the online log if InnoDB ran out of online log and index is marked as completed. Set online log error if apply phase encountered any error. It can also clear all other indexes log, marks the newly added indexes as corrupted. - Removed the old online code which was a part of DML operations commit_inplace_alter_table() : Does apply the online log for the last batch of secondary index log and does frees the log for the completed index. trx_t::apply_online_log: Set to true while writing the undo log if the modified table has active DDL trx_t::apply_log(): Apply the DML changes to online DDL tables dict_table_t::is_active_ddl(): Returns true if the table has an active DDL dict_index_t::online_log_make_dummy(): Assign dummy value for clustered index online log to indicate the secondary indexes are being rebuild. dict_index_t::online_log_is_dummy(): Check whether the online log has dummy value ha_innobase_inplace_ctx::log_failure(): Handle the apply log failure for online DDL transaction row_log_mark_other_online_index_abort(): Clear out all other online index log after encountering the error during row_log_apply() row_log_get_error(): Get the error happened during row_log_apply() row_log_online_op(): Does apply the online log if index is completed and ran out of memory. Returns false if apply log fails UndorecApplier: Introduced a class to maintain the undo log record, latched undo buffer page, parse the undo log record, maintain the undo record type, info bits and update vector UndorecApplier::get_old_rec(): Get the correct version of the clustered index record that was modified by the current undo log record UndorecApplier::clear_undo_rec(): Clear the undo log related information after applying the undo log record UndorecApplier::log_update(): Handle the update, delete undo log and apply it on online indexes UndorecApplier::log_insert(): Handle the insert undo log and apply it on online indexes UndorecApplier::is_same(): Check whether the given roll pointer is generated by the current undo log record information trx_t::rollback_low(): Set apply_online_log for the transaction after partially rollbacked transaction has any active DDL prepare_inplace_alter_table_dict(): After allocating the online log, InnoDB does create fulltext common tables. Fulltext index doesn't allow the index to be online. So removed the dead code of online log removal Thanks to Marko Mäkelä for providing the initial prototype and Matthias Leich for testing the issue patiently.
-
- 14 Apr, 2022 1 commit
-
-
Marko Mäkelä authored
The element mutex is unnecessarily large. The PERFORMANCE_SCHEMA instrumentation was not even enabled.
-
- 24 Feb, 2022 1 commit
-
-
Krunal Bauskar authored
- In 10.6, trx_rseg_t mutex was ported to use latch. As part of this porting profiling of the patch was removed. This patch reenables it given that the said latch continues to occupy the top-slots in the contention list.
-
- 18 Nov, 2021 1 commit
-
-
Marko Mäkelä authored
buf_page_t::frame: Moved from buf_block_t::frame. All 'thin' buf_page_t describing compressed-only ROW_FORMAT=COMPRESSED pages will have frame=nullptr, while all 'fat' buf_block_t will have a non-null frame pointing to aligned innodb_page_size bytes. This eliminates the need for separate states for BUF_BLOCK_FILE_PAGE and BUF_BLOCK_ZIP_PAGE. buf_page_t::lock: Moved from buf_block_t::lock. That is, all block descriptors will have a page latch. The IO_PIN state that was used for discarding or creating the uncompressed page frame of a ROW_FORMAT=COMPRESSED block is replaced by a combination of read-fix and page X-latch. page_zip_des_t::fix: Replaces state_, buf_fix_count_, io_fix_, status of buf_page_t with a single std::atomic<uint32_t>. All modifications will use store(), fetch_add(), fetch_sub(). This space was previously wasted to alignment on 64-bit systems. We will use the following encoding that combines a state (partly read-fix or write-fix) and a buffer-fix count: buf_page_t::NOT_USED=0 (previously BUF_BLOCK_NOT_USED) buf_page_t::MEMORY=1 (previously BUF_BLOCK_MEMORY) buf_page_t::REMOVE_HASH=2 (previously BUF_BLOCK_REMOVE_HASH) buf_page_t::FREED=3 + fix: pages marked as freed in the file buf_page_t::UNFIXED=1U<<29 + fix: normal pages buf_page_t::IBUF_EXIST=2U<<29 + fix: normal pages; may need ibuf merge buf_page_t::REINIT=3U<<29 + fix: reinitialized pages (skip doublewrite) buf_page_t::READ_FIX=4U<<29 + fix: read-fixed pages (also X-latched) buf_page_t::WRITE_FIX=5U<<29 + fix: write-fixed pages (also U-latched) buf_page_t::WRITE_FIX_IBUF=6U<<29 + fix: write-fixed; may have ibuf buf_page_t::WRITE_FIX_REINIT=7U<<29 + fix: write-fixed (no doublewrite) buf_page_t::write_complete(): Change WRITE_FIX or WRITE_FIX_REINIT to UNFIXED, and WRITE_FIX_IBUF to IBUF_EXIST, before releasing the U-latch. buf_page_t::read_complete(): Renamed from buf_page_read_complete(). Change READ_FIX to UNFIXED or IBUF_EXIST, before releasing the X-latch. buf_page_t::can_relocate(): If the page latch is being held or waited for, or the block is buffer-fixed or io-fixed, return false. (The condition on the page latch is new.) Outside buf_page_get_gen(), buf_page_get_low() and buf_page_free(), we will acquire the page latch before fix(), and unfix() before unlocking. buf_page_t::flush(): Replaces buf_flush_page(). Optimize the handling of FREED pages. buf_pool_t::release_freed_page(): Assume that buf_pool.mutex is held by the caller. buf_page_t::is_read_fixed(), buf_page_t::is_write_fixed(): New predicates. buf_page_get_low(): Ignore guesses that are read-fixed because they may not yet be registered in buf_pool.page_hash and buf_pool.LRU. buf_page_optimistic_get(): Acquire latch before buffer-fixing. buf_page_make_young(): Leave read-fixed blocks alone, because they might not be registered in buf_pool.LRU yet. recv_sys_t::recover_deferred(), recv_sys_t::recover_low(): Possibly fix MDEV-26326, by holding a page X-latch instead of only buffer-fixing the page.
-
- 22 Oct, 2021 1 commit
-
-
Marko Mäkelä authored
This implements memory transaction support for: * Intel Restricted Transactional Memory (RTM), also known as TSX-NI (Transactional Synchronization Extensions New Instructions) * POWER v2.09 Hardware Trace Monitor (HTM) on GNU/Linux transactional_lock_guard, transactional_shared_lock_guard: RAII lock guards that try to elide the lock acquisition when transactional memory is available. buf_pool.page_hash: Try to elide latches whenever feasible. Related to the InnoDB change buffer and ROW_FORMAT=COMPRESSED tables, this is not always possible. In buf_page_get_low(), memory transactions only work reasonably well for validating a guessed block address. TMLockGuard, TMLockTrxGuard, TMLockMutexGuard: RAII lock guards that try to elide lock_sys.latch and related latches.
-
- 21 Oct, 2021 1 commit
-
-
Marko Mäkelä authored
trx_commit_in_memory(): Do not release the rseg reference before trx_undo_commit_cleanup() has been invoked and the current transaction is truly done with the rollback segment. The purpose of the reference count is to prevent data races with trx_purge_truncate_history(). This is based on mysql/mysql-server@ac79aa1522f33e6eb912133a81fa2614db764c9c.
-
- 18 Oct, 2021 1 commit
-
-
Marko Mäkelä authored
The purpose of non-exclusive locks in a transaction is to guarantee that the records covered by those locks must remain in that way until the transaction is committed. (The purpose of gap locks is to ensure that a record that was nonexistent will remain that way.) Once a transaction has reached the XA PREPARE state, the only allowed further actions are XA ROLLBACK or XA COMMIT. Therefore, it can be argued that only the exclusive locks that the XA PREPARE transaction is holding are essential. Furthermore, InnoDB never preserved explicit locks across server restart. For XA PREPARE transations, we will only recover implicit exclusive locks for records that had been modified. Because of the fact that XA PREPARE followed by a server restart will cause some locks to be lost, we might as well always release all non-exclusive locks during the execution of an XA PREPARE statement. lock_release_on_prepare(): Release non-exclusive locks on XA PREPARE. trx_prepare(): Invoke lock_release_on_prepare() unless the isolation level is SERIALIZABLE or this is an internal distributed transaction with the binlog (not actual XA PREPARE statement). This has been discussed with Sergei Golubchik and Andrei Elkin. Reviewed by: Sergei Golubchik
-
- 06 Sep, 2021 1 commit
-
-
Marko Mäkelä authored
Invoking ut_delay(srv_wpin_wait_delay) inside a spinloop would cause a read of 2 global variables as well as multiplication. Let us loop around MY_RELAX_CPU() using a precomputed loop count to keep the loops simpler, to help them scale better. We also tried precomputing the delay into a global variable, but that appeared to result in slightly worse throughput.
-
- 31 Aug, 2021 3 commits
-
-
Marko Mäkelä authored
In commit 1bd681c8 (MDEV-25506 part 3) we introduced a "fake instant timeout" when a transaction would wait for a table or record lock while holding dict_sys.latch. This prevented a deadlock of the server but could cause bogus errors for operations on the InnoDB persistent statistics tables. A better fix is to ensure that whenever a transaction is being executed in the InnoDB internal SQL parser (which will for now require dict_sys.latch to be held), it will already have acquired all locks that could be required for the execution. So, we will acquire the following locks upfront, before acquiring dict_sys.latch: (1) MDL on the affected user table (acquired by the SQL layer) (2) If applicable (not for RENAME TABLE): InnoDB table lock (3) If persistent statistics are going to be modified: (3.a) MDL_SHARED on mysql.innodb_table_stats, mysql.innodb_index_stats (3.b) exclusive table locks on the statistics tables (4) Exclusive table locks on the InnoDB data dictionary tables (not needed in ANALYZE TABLE and the like) Note: Acquiring exclusive locks on the statistics tables may cause more locking conflicts between concurrent DDL operations. Notably, RENAME TABLE will lock the statistics tables even if no persistent statistics are enabled for the table. DROP DATABASE will only acquire locks on statistics tables if persistent statistics are enabled for the tables on which the SQL layer is invoking ha_innobase::delete_table(). For any "garbage collection" in innodb_drop_database(), a timeout while acquiring locks on the statistics tables will result in any statistics not being deleted for any tables that the SQL layer did not know about. If innodb_defragment=ON, information may be written to the statistics tables even for tables for which InnoDB persistent statistics are disabled. But, DROP TABLE will no longer attempt to delete that information if persistent statistics are not enabled for the table. This change should also fix the hangs related to InnoDB persistent statistics and STATS_AUTO_RECALC (MDEV-15020) as well as a bug that running ALTER TABLE on the statistics tables concurrently with running ALTER TABLE on InnoDB tables could cause trouble. lock_rec_enqueue_waiting(), lock_table_enqueue_waiting(): Do not issue a fake instant timeout error when the transaction is holding dict_sys.latch. Instead, assert that the dict_sys.latch is never being held here. lock_sys_tables(): A new function to acquire exclusive locks on all dictionary tables, in case DROP TABLE or similar operation is being executed. Locking non-hard-coded tables is optional to avoid a crash in row_merge_drop_temp_indexes(). The SYS_VIRTUAL table was introduced in MySQL 5.7 and MariaDB Server 10.2. Normally, we require all these dictionary tables to exist before executing any DDL, but the function row_merge_drop_temp_indexes() is an exception. When upgrading from MariaDB Server 10.1 or MySQL 5.6 or earlier, the table SYS_VIRTUAL would not exist at this point. ha_innobase::commit_inplace_alter_table(): Invoke log_write_up_to() while not holding dict_sys.latch. dict_sys_t::remove(), dict_table_close(): No longer try to drop index stubs that were left behind by aborted online ADD INDEX. Such indexes should be dropped from the InnoDB data dictionary by row_merge_drop_indexes() as part of the failed DDL operation. Stubs for aborted indexes may only be left behind in the data dictionary cache. dict_stats_fetch_from_ps(): Use a normal read-only transaction. ha_innobase::delete_table(), ha_innobase::truncate(), fts_lock_table(): While waiting for purge to stop using the table, do not hold dict_sys.latch. ha_innobase::delete_table(): Implement a work-around for the rollback of ALTER TABLE...ADD PARTITION. MDL_EXCLUSIVE would not be held if ALTER TABLE hits lock_wait_timeout while trying to upgrade the MDL due to a conflicting LOCK TABLES, such as in the first ALTER TABLE in the test case of Bug#53676 in parts.partition_special_innodb. Therefore, we must explicitly stop purge, because it would not be stopped by MDL. dict_stats_func(), btr_defragment_chunk(): Allocate a THD so that we can acquire MDL on the InnoDB persistent statistics tables. mysqltest_embedded: Invoke ha_pre_shutdown() before free_used_memory() in order to avoid ASAN heap-use-after-free related to acquire_thd(). trx_t::dict_operation_lock_mode: Changed the type to bool. row_mysql_lock_data_dictionary(), row_mysql_unlock_data_dictionary(): Implemented as macros. rollback_inplace_alter_table(): Apply an infinite timeout to lock waits. innodb_thd_increment_pending_ops(): Wrapper for thd_increment_pending_ops(). Never attempt async operation for InnoDB background threads, such as the trx_t::commit() in dict_stats_process_entry_from_recalc_pool(). lock_sys_t::cancel(trx_t*): Make dictionary transactions immune to KILL. lock_wait(): Make dictionary transactions immune to KILL, and to lock wait timeout when waiting for locks on dictionary tables. parts.partition_special_innodb: Use lock_wait_timeout=0 to instantly get ER_LOCK_WAIT_TIMEOUT. main.mdl: Filter out MDL on InnoDB persistent statistics tables Reviewed by: Thirunarayanan Balathandayuthapani
-
Marko Mäkelä authored
que_eval_sql(): Remove the parameter lock_dict. The only caller with lock_dict=true was dict_stats_exec_sql(), which will now explicitly invoke dict_sys.lock() and dict_sys.unlock() by itself. row_import_cleanup(): Do not unnecessarily lock the dictionary. Concurrent access to the table during ALTER TABLE...IMPORT TABLESPACE is prevented by MDL and the fact that there cannot exist any undo log or change buffer records that would refer to the table or tablespace. row_import_for_mysql(): Do not unnecessarily lock the dictionary while accessing fil_system. Thanks to MDL_EXCLUSIVE that was acquired by the SQL layer, only one IMPORT may be in effect for the table name. row_quiesce_set_state(): Do not unnecessarily lock the dictionary. The dict_table_t::quiesce state is documented to be protected by all index latches, which we are acquiring. dict_table_close(): Introduce a simpler variant with fewer parameters. dict_table_close(): Reduce the amount of calls. We can simply invoke dict_table_t::release() on startup or in DDL operations, or when the table is inaccessible. In none of these cases, there is no need to invalidate the InnoDB persistent statistics. pars_info_t::graph_owns_us: Remove (unused). pars_info_free(): Define inline. fts_delete(), trx_t::evict_table(), row_prebuilt_free(), row_rename_table_for_mysql(): Simplify. row_mysql_lock_data_dictionary(): Remove some references; use dict_sys.lock() and dict_sys.unlock() instead. row_mysql_lock_table(): Remove. Use lock_table_for_trx() instead. ha_innobase::check_if_supported_inplace_alter(), row_create_table_for_mysql(): Simply assert dict_sys.sys_tables_exist(). In commit 49e2c8f0 and commit 1bd681c8 srv_start() actually guarantees that the system tables will exist, or the server is in read-only mode, or startup will fail. Reviewed by: Thirunarayanan Balathandayuthapani
-
Marko Mäkelä authored
In the parent commit, dict_sys.latch could theoretically have been replaced with a mutex. But, we can do better and merge dict_sys.mutex into dict_sys.latch. Generally, every occurrence of dict_sys.mutex_lock() will be replaced with dict_sys.lock(). The PERFORMANCE_SCHEMA instrumentation for dict_sys_mutex will be removed along with dict_sys.mutex. The dict_sys.latch will remain instrumented as dict_operation_lock. Some use of dict_sys.lock() will be replaced with dict_sys.freeze(), which we will reintroduce for the new shared mode. Most notably, concurrent table lookups are possible as long as the tables are present in the dict_sys cache. In particular, this will allow more concurrency among InnoDB purge workers. Because dict_sys.mutex will no longer 'throttle' the threads that purge InnoDB transaction history, a performance degradation may be observed unless innodb_purge_threads=1. The table cache eviction policy will become FIFO-like, similar to what happened to fil_system.LRU in commit 45ed9dd9. The name of the list dict_sys.table_LRU will become somewhat misleading; that list contains tables that may be evicted, even though the eviction policy no longer is least-recently-used but first-in-first-out. (Note: Tables can never be evicted as long as locks exist on them or the tables are in use by some thread.) As demonstrated by the test perfschema.sxlock_func, there will be less contention on dict_sys.latch, because some previous use of exclusive latches will be replaced with shared latches. fts_parse_sql_no_dict_lock(): Replaced with pars_sql(). fts_get_table_name_prefix(): Merged to fts_optimize_create(). dict_stats_update_transient_for_index(): Deduplicated some code. ha_innobase::info_low(), dict_stats_stop_bg(): Use a combination of dict_sys.latch and table->stats_mutex_lock() to cover the changes of BG_STAT_SHOULD_QUIT, because the flag is being read in dict_stats_update_persistent() while not holding dict_sys.latch. row_discard_tablespace_for_mysql(): Protect stats_bg_flag by exclusive dict_sys.latch, like most other code does. row_quiesce_table_has_fts_index(): Remove unnecessary mutex acquisition. FLUSH TABLES...FOR EXPORT is protected by MDL. row_import::set_root_by_heuristic(): Remove unnecessary mutex acquisition. ALTER TABLE...IMPORT TABLESPACE is protected by MDL. row_ins_sec_index_entry_low(): Replace a call to dict_set_corrupted_index_cache_only(). Reads of index->type were not really protected by dict_sys.mutex, and writes (flagging an index corrupted) should be extremely rare. dict_stats_process_entry_from_defrag_pool(): Only freeze the dictionary, do not lock it exclusively. dict_stats_wait_bg_to_stop_using_table(), DICT_BG_YIELD: Remove trx. We can simply invoke dict_sys.unlock() and dict_sys.lock() directly. dict_acquire_mdl_shared()<trylock=false>: Assert that dict_sys.latch is only held in shared more, not exclusive mode. Only acquire it in exclusive mode if the table needs to be loaded to the cache. dict_sys_t::acquire(): Remove. Relocating elements in dict_sys.table_LRU would require holding an exclusive latch, which we want to avoid for performance reasons. dict_sys_t::allow_eviction(): Add the table first to dict_sys.table_LRU, to compensate for the removal of dict_sys_t::acquire(). This function is only invoked by INFORMATION_SCHEMA.INNODB_SYS_TABLESTATS. dict_table_open_on_id(), dict_table_open_on_name(): If dict_locked=false, try to acquire dict_sys.latch in shared mode. Only acquire the latch in exclusive mode if the table is not found in the cache. Reviewed by: Thirunarayanan Balathandayuthapani
-
- 27 Jul, 2021 1 commit
-
-
Marko Mäkelä authored
trx_t::will_lock: Changed the type to bool. trx_t::is_autocommit_non_locking(): Replaces trx_is_autocommit_non_locking(). trx_is_ac_nl_ro(): Remove (replaced with equivalent assertion expressions). assert_trx_nonlocking_or_in_list(): Remove. Replaced with at least as strict checks in each place. check_trx_state(): Moved to a static function; partially replaced with individual debug assertions implementing equivalent or stricter checks. This is a backport of commit 7b51d11c from 10.5.
-
- 23 Jul, 2021 1 commit
-
-
Marko Mäkelä authored
trx_flush_log_if_needed_low(): Do flush for both innodb_flush_log_at_trx_commit=1 and innodb_flush_log_at_trx_commit=3. This is the 10.6 version of 10.2 commit 5f8651ac which fixed a regression that had been introduced in 2e814d47 (breaking 288eeb3a).
-
- 22 Jul, 2021 1 commit
-
-
Marko Mäkelä authored
Starting with commit 6e12ebd4 (MDEV-25062), srv_wake_purge_thread_if_not_active() became more expensive operation, especially on NUMA systems, because instead of reading an atomic global variable trx_sys.rseg_history_len we are traversing up to 128 cache lines in trx_sys.history_exists(). trx_t::commit_cleanup(): Do not wake up purge at all. We will wake up purge about once per second in srv_master_callback(). srv_master_do_active_tasks(), srv_master_do_idle_tasks(): Move some duplicated code to srv_master_callback(). srv_master_callback(): Invoke purge_coordinator_timer_callback() to ensure that purge will be periodically woken up, even if the latest execution of trx_t::commit_cleanup() allowed the purge view to advance but did not wake up purge. Do not call log_free_check(), because every thread that is going to generate redo log is supposed to call that function anyway, before acquiring any page latches. Additional calls to the function every few seconds should not make any difference. srv_shutdown_threads(): Ensure that srv_shutdown_state can be at most SRV_SHUTDOWN_INITIATED in srv_master_callback(), by first invoking srv_master_timer.reset() before changing srv_shutdown_state. (Note: We first terminate the srv_master_callback and only then terminate the purge tasks. Thus, the purge subsystem should exist when srv_master_callback() invokes purge_coordinator_timer_callback() if it was initiated in the first place.
-
- 20 Jul, 2021 1 commit
-
-
Jagdeep Sidhu authored
In commit 2e814d47 on MariaDB 10.2 the switch case statement in trx_flush_log_if_needed_low() regressed. Since 10.2 this code was refactored to have switches in descending order, so value of 3 for innodb_flush_log_at_trx_commit is behaving the same as value of 2, that is no FSYNC is being enforced during COMMIT phase. The switch should however not be empty and cases 2 and 3 should not have the identical contents. As per documentation, setting innodb_flush_log_at_trx_commit to 3 should do FSYNC to disk if innodb_flush_log_at_trx_commit is set to 3. This fixes the regression so that the switch statement again does what users expect the setting should do. All new code of the whole pull request, including one or several files that are either new files or modified ones, are contributed under the BSD-new license. I am contributing on behalf of my employer Amazon Web Services, Inc.
-
- 03 Jul, 2021 1 commit
-
-
Marko Mäkelä authored
trx_t::free(): Declare xid as fully initialized in order to avoid tripping the subsequent MEM_CHECK_DEFINED (in WITH_MSAN and WITH_VALGRIND builds).
-
- 01 Jul, 2021 2 commits
-
-
Marko Mäkelä authored
With commit 1bd681c8 (MDEV-25506) it no longer is necessary to run DDL and DML operations in separate transactions. Let us remove the flag trx_t::internal. Dictionary transactions will be distinguished by trx_t::dict_operation.
-
Marko Mäkelä authored
The trx_t::xid is always allocated, so we might as well allocate it directly in the trx_t object to improve the locality of reference.
-
- 24 Jun, 2021 1 commit
-
-
Marko Mäkelä authored
trx_t::commit_in_memory(): Do not initiate a redo log write if the transaction has no visible effect. If anything for this transaction had been made durable, crash recovery will roll back the transaction just fine even if the end of ROLLBACK is not durably written. Rollbacks of transactions that are associated with XA identifiers (possibly internally via the binlog) will always be persisted. The test rpl.rpl_gtid_crash covers this.
-
- 23 Jun, 2021 1 commit
-
-
Marko Mäkelä authored
redo_rseg_mutex, noredo_rseg_mutex: Remove the PERFORMANCE_SCHEMA keys. The rollback segment mutex will be uninstrumented. trx_sys_t: Remove pointer indirection for rseg_array, temp_rseg. Align each element to the cache line. trx_sys_t::rseg_id(): Replaces trx_rseg_t::id. trx_rseg_t::ref: Replaces needs_purge, trx_ref_count, skip_allocation in a single std::atomic<uint32_t>. trx_rseg_t::latch: Replaces trx_rseg_t::mutex. trx_rseg_t::history_size: Replaces trx_sys_t::rseg_history_len trx_sys_t::history_size_approx(): Replaces trx_sys.rseg_history_len in those places where the exact count does not matter. We must not acquire any trx_rseg_t::latch while holding index page latches, because normally the trx_rseg_t::latch is acquired before any page latches. trx_sys_t::history_exists(): Replaces trx_sys.rseg_history_len!=0 with an approximation. We remove some unnecessary trx_rseg_t::latch acquisition around trx_undo_set_state_at_prepare() and trx_undo_set_state_at_finish(). Those operations will only access fields that remain constant after trx_rseg_t::init().
-
- 21 Jun, 2021 1 commit
-
-
Marko Mäkelä authored
Let us simply refuse an upgrade from earlier versions if the upgrade procedure was not followed. This simplifies the purge, commit, and rollback of transactions. Before upgrading to MariaDB 10.3 or later, a clean shutdown of the server (with innodb_fast_shutdown=1 or 0) is necessary, to ensure that any incomplete transactions are rolled back. The undo log format was changed in MDEV-12288. There is only one persistent undo log for each transaction.
-
- 09 Jun, 2021 1 commit
-
-
Marko Mäkelä authored
This is a complete rewrite of DROP TABLE, also as part of other DDL, such as ALTER TABLE, CREATE TABLE...SELECT, TRUNCATE TABLE. The background DROP TABLE queue hack is removed. If a transaction needs to drop and create a table by the same name (like TRUNCATE TABLE does), it must first rename the table to an internal #sql-ib name. No committed version of the data dictionary will include any #sql-ib tables, because whenever a transaction renames a table to a #sql-ib name, it will also drop that table. Either the rename will be rolled back, or the drop will be committed. Data files will be unlinked after the transaction has been committed and a FILE_RENAME record has been durably written. The file will actually be deleted when the detached file handle returned by fil_delete_tablespace() will be closed, after the latches have been released. It is possible that a purge of the delete of the SYS_INDEXES record for the clustered index will execute fil_delete_tablespace() concurrently with the DDL transaction. In that case, the thread that arrives later will wait for the other thread to finish. HTON_TRUNCATE_REQUIRES_EXCLUSIVE_USE: A new handler flag. ha_innobase::truncate() now requires that all other references to the table be released in advance. This was implemented by Monty. ha_innobase::delete_table(): If CREATE TABLE..SELECT is detected, we will "hijack" the current transaction, drop the table in the current transaction and commit the current transaction. This essentially fixes MDEV-21602. There is a FIXME comment about making the check less failure-prone. ha_innobase::truncate(), ha_innobase::delete_table(): Implement a fast path for temporary tables. We will no longer allow temporary tables to use the adaptive hash index. dict_table_t::mdl_name: The original table name for the purpose of acquiring MDL in purge, to prevent a race condition between a DDL transaction that is dropping a table, and purge processing undo log records of DML that had executed before the DDL operation. For #sql-backup- tables during ALTER TABLE...ALGORITHM=COPY, the dict_table_t::mdl_name will differ from dict_table_t::name. dict_table_t::parse_name(): Use mdl_name instead of name. dict_table_rename_in_cache(): Update mdl_name. For the internal FTS_ tables of FULLTEXT INDEX, purge would acquire MDL on the FTS_ table name, but not on the main table, and therefore it would be able to run concurrently with a DDL transaction that is dropping the table. Previously, the DROP TABLE queue hack prevented a race between purge and DDL. For now, we introduce purge_sys.stop_FTS() to prevent purge from opening any table, while a DDL transaction that may drop FTS_ tables is in progress. The function fts_lock_table(), which will be invoked before the dictionary is locked, will wait for purge to release any table handles. trx_t::drop_table_statistics(): Drop statistics for the table. This replaces dict_stats_drop_index(). We will drop or rename persistent statistics atomically as part of DDL transactions. On lock conflict for dropping statistics, we will fail instantly with DB_LOCK_WAIT_TIMEOUT, because we will be holding the exclusive data dictionary latch. trx_t::commit_cleanup(): Separated from trx_t::commit_in_memory(). Relax an assertion around fts_commit() and allow DB_LOCK_WAIT_TIMEOUT in addition to DB_DUPLICATE_KEY. The call to fts_commit() is entirely misplaced here and may obviously break the consistency of transactions that affect FULLTEXT INDEX. It needs to be fixed separately. dict_table_t::n_foreign_key_checks_running: Remove (MDEV-21175). The counter was a work-around for missing meta-data locking (MDL) on the SQL layer, and not really needed in MariaDB. ER_TABLE_IN_FK_CHECK: Replaced with ER_UNUSED_28. HA_ERR_TABLE_IN_FK_CHECK: Remove. row_ins_check_foreign_constraints(): Do not acquire dict_sys.latch either. The SQL-layer MDL will protect us. This was reviewed by Thirunarayanan Balathandayuthapani and tested by Matthias Leich.
-
- 27 May, 2021 1 commit
-
-
Marko Mäkelä authored
Back in 2006 or 2007, when MySQL AB and Innobase Oy existed as separately controlled entities (Innobase had been acquired by Oracle Corporation), MySQL 5.1 introduced a storage engine plugin interface and Oracle made use of it by distributing a separate InnoDB Plugin, which would contain some more bug fixes and improvements, compared to the version of InnoDB that was statically linked with the mysqld server that was distributed by MySQL AB. The built-in InnoDB would export global symbols, which would clash with the symbols of the dynamic InnoDB Plugin (which was supposed to override the built-in one when present). The solution to this problem was to declare all global symbols with UNIV_INTERN, so that they would get the GCC function attribute that specifies hidden visibility. Later, in MariaDB Server, something based on Percona XtraDB (a fork of MySQL InnoDB) became the statically linked implementation, and something closer to MySQL InnoDB was available as a dynamic plugin. Starting with version 10.2, MariaDB Server includes only one InnoDB implementation, and hence any reason to have the UNIV_INTERN definition was lost. btr_get_size_and_reserved(): Move to the same compilation unit with the only caller. innodb_set_buf_pool_size(): Remove. Modify innobase_buffer_pool_size directly. fil_crypt_calculate_checksum(): Merge to the only caller. ha_innobase::innobase_reset_autoinc(): Merge to the only caller. thd_query_start_micro(): Remove. Call thd_start_utime() directly.
-
- 21 May, 2021 1 commit
-
-
Marko Mäkelä authored
Many InnoDB data dictionary cache operations require that the table name be copied so that it will be NUL terminated. (For example, SYS_TABLES.NAME is not guaranteed to be NUL-terminated.) dict_table_t::is_garbage_name(): Check if a name belongs to the background drop table queue. dict_check_if_system_table_exists(): Remove. dict_sys_t::load_sys_tables(): Load the non-hard-coded system tables SYS_FOREIGN, SYS_FOREIGN_COLS, SYS_VIRTUAL on startup. dict_sys_t::create_or_check_sys_tables(): Replaces dict_create_or_check_foreign_constraint_tables() and dict_create_or_check_sys_virtual(). dict_sys_t::load_table(): Replaces dict_table_get_low() and dict_load_table(). dict_sys_t::find_table(): Renamed from get_table(). dict_sys_t::sys_tables_exist(): Check whether all the non-hard-coded tables SYS_FOREIGN, SYS_FOREIGN_COLS, SYS_VIRTUAL exist. trx_t::has_stats_table_lock(): Moved to dict0stats.cc. Some error messages will now report table names in the internal databasename/tablename format, instead of `databasename`.`tablename`.
-
- 19 May, 2021 1 commit
-
-
Monty authored
MDEV-25604 Atomic DDL: Binlog event written upon recovery does not have default database The purpose of this task is to ensure that ALTER TABLE is atomic even if the MariaDB server would be killed at any point of the alter table. This means that either the ALTER TABLE succeeds (including that triggers, the status tables and the binary log are updated) or things should be reverted to their original state. If the server crashes before the new version is fully up to date and commited, it will revert to the original table and remove all temporary files and tables. If the new version is commited, crash recovery will use the new version, and update triggers, the status tables and the binary log. The one execption is ALTER TABLE .. RENAME .. where no changes are done to table definition. This one will work as RENAME and roll back unless the whole statement completed, including updating the binary log (if enabled). Other changes: - Added handlerton->check_version() function to allow the ddl recovery code to check, in case of inplace alter table, if the table in the storage engine is of the new or old version. - Added handler->table_version() so that an engine can report the current version of the table. This should be changed each time the table definition changes. - Added ha_signal_ddl_recovery_done() and handlerton::signal_ddl_recovery_done() to inform all handlers when ddl recovery has been done. (Needed by InnoDB). - Added handlerton call inplace_alter_table_committed, to signal engine that ddl_log has been closed for the alter table query. - Added new handerton flag HTON_REQUIRES_NOTIFY_TABLEDEF_CHANGED_AFTER_COMMIT to signal when we should call hton->notify_tabledef_changed() during mysql_inplace_alter_table. This was required as MyRocks and InnoDB needed the call at different times. - Added function server_uuid_value() to be able to generate a temporary xid when ddl recovery writes the query to the binary log. This is needed to be able to handle crashes during ddl log recovery. - Moved freeing of the frm definition to end of mysql_alter_table() to remove duplicate code and have a common exit strategy. ------- InnoDB part of atomic ALTER TABLE (Implemented by Marko Mäkelä) innodb_check_version(): Compare the saved dict_table_t::def_trx_id to determine whether an ALTER TABLE operation was committed. We must correctly recover dict_table_t::def_trx_id for this to work. Before purge removes any trace of DB_TRX_ID from system tables, it will make an effort to load the user table into the cache, so that the dict_table_t::def_trx_id can be recovered. ha_innobase::table_version(): return garbage, or the trx_id that would be used for committing an ALTER TABLE operation. In InnoDB, table names starting with #sql-ib will remain special: they will be dropped on startup. This may be revisited later in MDEV-18518 when we implement proper undo logging and rollback for creating or dropping multiple tables in a transaction. Table names starting with #sql will retain some special meaning: dict_table_t::parse_name() will not consider such names for MDL acquisition, and dict_table_rename_in_cache() will treat such names specially when handling FOREIGN KEY constraints. Simplify InnoDB DROP INDEX. Prevent purge wakeup To ensure that dict_table_t::def_trx_id will be recovered correctly in case the server is killed before ddl_log_complete(), we will block the purge of any history in SYS_TABLES, SYS_INDEXES, SYS_COLUMNS between ha_innobase::commit_inplace_alter_table(commit=true) (purge_sys.stop_SYS()) and purge_sys.resume_SYS(). The completion callback purge_sys.resume_SYS() must be between ddl_log_complete() and MDL release. -------- MyRocks support for atomic ALTER TABLE (Implemented by Sergui Petrunia) Implement these SE API functions: - ha_rocksdb::table_version() - hton->check_version = rocksdb_check_versionMyRocks data dictionary now stores table version for each table. (Absence of table version record is interpreted as table_version=0, that is, which means no upgrade changes are needed) - For inplace alter table of a partitioned table, call the underlying handlerton when checking if the table is ok. This assumes that the partition engine commits all changes at once.
-