Commits · 485a1b1f116f0c5e73fce3a97ffdac84c861b3c2 · nexedi / MariaDB

An error occurred fetching the project authors.

18 Apr, 2023 1 commit

MDEV-30863 Server freeze, all threads in trx_assign_rseg_low() · 485a1b1f

Marko Mäkelä authored 1 year ago

trx_assign_rseg_low(): Simplify the debug check.

trx_rseg_t::reinit(): Reset the skip_allocation() flag.
This logic was broken in the merge
commit 3e2ad0e9
of commit 0de3be8c
(that is, innodb_undo_log_truncate=ON would never be "completed").

Tested by: Matthias Leich

485a1b1f

16 Mar, 2023 1 commit

MDEV-30357 Performance regression in locking reads from secondary indexes · 4105017a

Marko Mäkelä authored 1 year ago

lock_sec_rec_some_has_impl(): Remove a harmful condition that caused the
performance regression and should not have been added in
commit b6e41e38 in the first place.
Locking transactions that have not modified any persistent tables
can carry the transaction identifier 0.

trx_t::max_inactive_id: A cache for trx_sys_t::find_same_or_older().
The value is not reset on transaction commit so that previous results
can be reused for subsequent transactions. The smallest active
transaction ID can only increase over time, not decrease.

trx_sys_t::find_same_or_older(): Remember the maximum previous id for which
rw_trx_hash.iterate() returned false, to avoid redundant iterations.

lock_sec_rec_read_check_and_lock(): Add an early return in case we are
already holding a covering table lock.

lock_rec_convert_impl_to_expl(): Add a template parameter to avoid
a redundant run-time check on whether the index is secondary.

lock_rec_convert_impl_to_expl_for_trx(): Move some code from
lock_rec_convert_impl_to_expl(), to reduce code duplication due
to the added template parameter.

Reviewed by: Vladislav Lesin
Tested by: Matthias Leich

4105017a

02 Mar, 2023 1 commit

MDEV-30341 Reset check_foreigns, check_unique_secondary variables · 49e2b50d

Thirunarayanan Balathandayuthapani authored 2 years ago

- InnoDB fails to reset the check_foreigns and check_unique_secondary
in trx_t::free(), trx_t::commit_cleanup(). This lead to bulk insert
in internal innodb fts table operation.

49e2b50d

24 Feb, 2023 1 commit

MDEV-30671 InnoDB undo log truncation fails to wait for purge of history · 0de3be8c

Marko Mäkelä authored 2 years ago

It is not safe to invoke trx_purge_free_segment() or execute
innodb_undo_log_truncate=ON before all undo log records in
the rollback segment has been processed.

A prominent failure that would occur due to premature freeing of
undo log pages is that trx_undo_get_undo_rec() would crash when
trying to copy an undo log record to fetch the previous version
of a record.

If trx_undo_get_undo_rec() was not invoked in the unlucky time frame,
then the symptom would be that some committed transaction history is
never removed. This would be detected by CHECK TABLE...EXTENDED that
was impleented in commit ab019010.
Such a garbage collection leak should be possible even when using
innodb_undo_log_truncate=OFF, just involving trx_purge_free_segment().

trx_rseg_t::needs_purge: Change the type from Boolean to a transaction
identifier, noting the most recent non-purged transaction, or 0 if
everything has been purged. On transaction start, we initialize this
to 1 more than the transaction start ID. On recovery, the field may be
adjusted to the transaction end ID (TRX_UNDO_TRX_NO) if it is larger.

The field TRX_UNDO_NEEDS_PURGE becomes write-only; only some debug
assertions that would validate the value. The field reflects the old
inaccurate Boolean field trx_rseg_t::needs_purge.

trx_undo_mem_create_at_db_start(), trx_undo_lists_init(),
trx_rseg_mem_restore(): Remove the parameter max_trx_id.
Instead, store the maximum in trx_rseg_t::needs_purge,
where trx_rseg_array_init() will find it.

trx_purge_free_segment(): Contiguously hold a lock on
trx_rseg_t to prevent any concurrent allocation of undo log.

trx_purge_truncate_rseg_history(): Only invoke trx_purge_free_segment()
if the rollback segment is empty and there are no pending transactions
associated with it.

trx_purge_truncate_history(): Only proceed with innodb_undo_log_truncate=ON
if trx_rseg_t::needs_purge indicates that all history has been purged.

Tested by: Matthias Leich

0de3be8c

09 Feb, 2023 1 commit

Apply clang-tidy to remove empty constructors / destructors · 08c85202

Vicențiu Ciorbaru authored 2 years ago

This patch is the result of running
run-clang-tidy -fix -header-filter=.* -checks='-*,modernize-use-equals-default' .

Code style changes have been done on top. The result of this change
leads to the following improvements:

1. Binary size reduction.
* For a -DBUILD_CONFIG=mysql_release build, the binary size is reduced by
  ~400kb.
* A raw -DCMAKE_BUILD_TYPE=Release reduces the binary size by ~1.4kb.

2. Compiler can better understand the intent of the code, thus it leads
   to more optimization possibilities. Additionally it enabled detecting
   unused variables that had an empty default constructor but not marked
   so explicitly.

   Particular change required following this patch in sql/opt_range.cc

   result_keys, an unused template class Bitmap now correctly issues
   unused variable warnings.

   Setting Bitmap template class constructor to default allows the compiler
   to identify that there are no side-effects when instantiating the class.
   Previously the compiler could not issue the warning as it assumed Bitmap
   class (being a template) would not be performing a NO-OP for its default
   constructor. This prevented the "unused variable warning".

08c85202

17 Nov, 2022 1 commit

MDEV-29603 btr_cur_open_at_index_side() is missing some consistency checks · 24fe5347

Marko Mäkelä authored 2 years ago

btr_cur_t: Zero-initialize all fields in the default constructor.

btr_cur_t::index: Remove; it duplicated page_cur.index.

Many functions: Remove arguments that were duplicating
page_cur_t::index and page_cur_t::block.

page_cur_open_level(), btr_pcur_open_level(): Replaces
btr_cur_open_at_index_side() for dict_stats_analyze_index().
At the end, release all latches except the dict_index_t::lock
and the buf_page_t::lock on the requested page.

dict_stats_analyze_index(): Rely on mtr_t::rollback_to_savepoint()
to release all uninteresting page latches.

btr_search_guess_on_hash(): Simplify the logic, and invoke
mtr_t::rollback_to_savepoint().

We will use plain C++ std::vector<mtr_memo_slot_t> for mtr_t::m_memo.
In this way, we can avoid setting mtr_memo_slot_t::object to nullptr
and instead just remove garbage from m_memo.

mtr_t::rollback_to_savepoint(): Shrink the vector. We will be needing this
in dict_stats_analyze_index(), where we will release page latches and
only retain the index->lock in mtr_t::m_memo.

mtr_t::release_last_page(): Release the last acquired page latch.
Replaces btr_leaf_page_release().

mtr_t::release(const buf_block_t&): Release a single page latch.
Used in btr_pcur_move_backward_from_page().

mtr_t::memo_release(): Replaced with mtr_t::release().

mtr_t::upgrade_buffer_fix(): Acquire a latch for a buffer-fixed page.
This replaces the double bookkeeping in btr_cur_t::open_leaf().

Reviewed by: Vladislav Lesin

24fe5347

25 Oct, 2022 1 commit

MDEV-29843 Do not use asynchronous log_write_upto() for system THDs · b7fe6179

Vladislav Vaintroub authored 2 years ago

Non-blocking log_write_upto (MDEV-24341) was only designed  for the
client connections. Fix, so it is not be triggered for any system THD.

Previously, an incomplete solution only excluded Innodb purge THDs, but
not  the slave for example.

The hang in MDEV still remains somewhat a mystery though, it is not
immediately clear how exactly condition variable can become corrupted.
But it is clear that it can be avoided.

b7fe6179

24 Oct, 2022 1 commit

MDEV-28709 unexpected X lock on Supremum in READ COMMITTED · 8128a468

Vlad Lesin authored 2 years ago

The lock is created during page splitting after moving records and
locks(lock_move_rec_list_(start|end)()) to the new page, and inheriting
the locks to the supremum of left page from the successor of the infimum
on right page.

There is no need in such inheritance for READ COMMITTED isolation level
and not-gap locks, so the fix is to add the corresponding condition in
gap lock inheritance function.

One more fix is to forbid gap lock inheritance if XA was prepared. Use the
most significant bit of trx_t::n_ref to indicate that gap lock inheritance
is forbidden. This fix is based on
mysql/mysql-server@b063e52a8367dc9d5ed418e7f6d96400867e9f43

8128a468

21 Oct, 2022 2 commits

MDEV-29622 Wrong assertions in lock_cancel_waiting_and_release() for deadlock resolving caller · 9c04d66d

Vlad Lesin authored 2 years ago

Suppose we have two transactions, trx 1 and trx 2.

trx 2 does deadlock resolving from lock_wait(), it sets
victim->lock.was_chosen_as_deadlock_victim=true for trx 1, but has not
yet invoked lock_cancel_waiting_and_release().

trx 1 checks the flag in lock_trx_handle_wait(), and starts rollback
from row_mysql_handle_errors(). It can change trx->lock.wait_thr and
trx->state as it holds trx_t::mutex, but trx 2 has not yet requested it,
as lock_cancel_waiting_and_release() has not yet been called.

After that trx 1 tries to release locks in trx_t::rollback_low(),
invoking trx_t::rollback_finish(). lock_release() is blocked on try to
acquire lock_sys.rd_lock(SRW_LOCK_CALL) in lock_release_try(), as
lock_sys is blocked by trx 2, as deadlock resolution works under
lock_sys.wr_lock(SRW_LOCK_CALL), see Deadlock::report() for details.

trx 2 executes lock_cancel_waiting_and_release() for deadlock victim, i.
e. for trx 1. lock_cancel_waiting_and_release() contains some
trx->lock.wait_thr and trx->state assertions, which will fail, because
trx 1 has changed them during rollback execution.

So, according to the above scenario, it's legal to have
trx->lock.wait_thr==0 and trx->state!=TRX_STATE_ACTIVE in
lock_cancel_waiting_and_release(), if it was invoked from
Deadlock::report(), and the fix is just in the assertion conditions
changing.

The fix is just in changing assertion condition.

There is also lock_wait() cleanup around trx->error_state.

If trx->error_state can be changed not by the owned thread, it must be
protected with lock_sys.wait_mutex, as lock_wait() uses trx->lock.cond
along with that mutex.

Also if trx->error_state was changed before lock_sys.wait_mutex
acquision, then it could be reset with the following code, what is
wrong. Also we need to check trx->error_state before entering waiting
loop, otherwise it can be the case when trx->error_state was set before
lock_sys.wait_mutex acquision, but the thread will be waiting on
trx->lock.cond.

9c04d66d

MDEV-24402: InnoDB CHECK TABLE ... EXTENDED · ab019010

Marko Mäkelä authored 2 years ago

Until now, the attribute EXTENDED of CHECK TABLE was ignored by InnoDB,
and InnoDB only counted the records in each index according
to the current read view. Unless the attribute QUICK was specified, the
function btr_validate_index() would be invoked to validate the B-tree
structure (the sibling and child links between index pages).

The EXTENDED check will not only count all index records according to the
current read view, but also ensure that any delete-marked records in the
clustered index are waiting for the purge of history, and that all
secondary index records point to a version of the clustered index record
that is waiting for the purge of history. In other words, no index may
contain orphan records. Normal MVCC reads and the non-EXTENDED version
of CHECK TABLE would ignore these orphans.

Unpurged records merely result in warnings (at most one per index),
not errors, and no indexes will be flagged as corrupted due to such
garbage. It will remain possible to SELECT data from such indexes or
tables (which will skip such records) or to rebuild the table to
reclaim some space.

We introduce purge_sys.end_view that will be (almost) a copy of
purge_sys.view at the end of a batch of purging committed transaction
history. It is not an exact copy, because if the size of a purge batch
is limited by innodb_purge_batch_size, some records that
purge_sys.view would allow to be purged will be left over for
subsequent batches.

The purge_sys.view is relevant in the purge of committed transaction
history, to determine if records are safe to remove. The new
purge_sys.end_view is relevant in MVCC operations and in
CHECK TABLE ... EXTENDED. It tells which undo log records are
safe to access (have not been discarded at the end of a purge batch).

purge_sys.clone_oldest_view<true>(): In trx_lists_init_at_db_start(),
clone the oldest read view similar to purge_sys_t::clone_end_view()
so that CHECK TABLE ... EXTENDED will not report bogus failures between
InnoDB restart and the completed purge of committed transaction history.

purge_sys_t::is_purgeable(): Replaces purge_sys_t::changes_visible()
in the case that purge_sys.latch will not be held by the caller.
Among other things, this guards access to BLOBs. It is not safe to
dereference any BLOBs of a delete-marked purgeable record, because
they may have already been freed.

purge_sys_t::view_guard::view(): Return a reference to purge_sys.view
that will be protected by purge_sys.latch, held by purge_sys_t::view_guard.

purge_sys_t::end_view_guard::view(): Return a reference to
purge_sys.end_view while it is protected by purge_sys.end_latch.
Whenever a thread needs to retrieve an older version of a clustered
index record, it will hold a page latch on the clustered index page
and potentially also on a secondary index page that points to the
clustered index page. If these pages contain purgeable records that
would be accessed by a currently running purge batch, the progress of
the purge batch would be blocked by the page latches. Hence, it is
safe to make a copy of purge_sys.end_view while holding an index page
latch, and consult the copy of the view to determine whether a record
should already have been purged.

btr_validate_index(): Remove a redundant check.

row_check_index_match(): Check if a secondary index record and a
version of a clustered index record match each other.

row_check_index(): Replaces row_scan_index_for_mysql().
Count the records in each index directly, duplicating the relevant
logic from row_search_mvcc(). Initialize check_table_extended_view
for CHECK ... EXTENDED while holding an index leaf page latch.
If we encounter an orphan record, the copy of purge_sys.end_view that
we make is safe for visibility checks, and trx_undo_get_undo_rec() will
check for the safety to access each undo log record. Should that check
fail, we should return DB_MISSING_HISTORY to report a corrupted index.
The EXTENDED check tries to match each secondary index record with
every available clustered index record version, by duplicating the logic
of row_vers_build_for_consistent_read() and invoking
trx_undo_prev_version_build() directly.

Before invoking row_check_index_match() on delete-marked clustered index
record versions, we will consult purge_sys.is_purgeable() in order to
avoid accessing freed BLOBs.

We will always check that the DB_TRX_ID or PAGE_MAX_TRX_ID does not
exceed the global maximum. Orphan secondary index records will be
flagged only if everything up to PAGE_MAX_TRX_ID has been purged.
We warn also about clustered index records whose nonzero DB_TRX_ID
should have been reset in purge or rollback.

trx_set_rw_mode(): Move an assertion from ReadView::set_creator_trx_id().

trx_undo_prev_version_build(): Remove two debug-only parameters,
and return an error code instead of a Boolean.

trx_undo_get_undo_rec(): Return a pointer to the undo log record,
or nullptr if one cannot be retrieved. Instead of consulting the
purge_sys.view, consult the purge_sys.end_view to determine which
records can be accessed.

trx_undo_get_rec_if_purgeable(): A variant of trx_undo_get_undo_rec()
that will consult purge_sys.view instead of purge_sys.end_view.

TRX_UNDO_CHECK_PURGEABILITY: A new parameter to
trx_undo_prev_version_build(), passed by row_vers_old_has_index_entry()
so that purge_sys.view instead of purge_sys.end_view will be consulted
to determine whether a secondary index record may be safely purged.

row_upd_changes_disowned_external(): Remove. This should be more
expensive than briefly latching purge_sys in trx_undo_prev_version_build()
(which may make use of transactional memory).

row_sel_reset_old_vers_heap(): New function, split from
row_sel_build_prev_vers_for_mysql().

row_sel_build_prev_vers_for_mysql(): Reorder some parameters
to simplify the call to row_sel_reset_old_vers_heap().

row_search_for_mysql(): Replaced with direct calls to row_search_mvcc().

sel_node_get_nth_plan(): Define inline in row0sel.h

open_step(): Define at the call site, in simplified form.

sel_node_reset_cursor(): Merged with the only caller open_step().
---
ReadViewBase::check_trx_id_sanity(): Remove.
Let us handle "future" DB_TRX_ID in a more meaningful way:

row_sel_clust_sees(): Return DB_SUCCESS if the record is visible,
DB_SUCCESS_LOCKED_REC if it is invisible, and DB_CORRUPTION if
the DB_TRX_ID is in the future.

row_undo_mod_must_purge(), row_undo_mod_clust(): Silently ignore
corrupted DB_TRX_ID. We are in ROLLBACK, and we should have noticed
that corruption when we were about to modify the record in the first
place (leading us to refuse the operation).

row_vers_build_for_consistent_read(): Return DB_CORRUPTION if
DB_TRX_ID is in the future.

Tested by: Matthias Leich
Reviewed by: Vladislav Lesin

ab019010

03 Oct, 2022 1 commit

MDEV-29575 Access to innodb_trx, innodb_locks and innodb_lock_waits along with... · c0817dac

Vlad Lesin authored 2 years ago

MDEV-29575 Access to innodb_trx, innodb_locks and innodb_lock_waits along with detached XA's can cause SIGSEGV

trx->mysql_thd can be zeroed-out between thd_get_thread_id() and
thd_query_safe() calls in fill_trx_row(). trx_disconnect_prepared() zeroes out
trx->mysql_thd. And this can cause null pointer dereferencing in
fill_trx_row().

fill_trx_row() is invoked from fetch_data_into_cache() under trx_sys.mutex.

Bug fix is in reseting trx_t::mysql_thd in trx_disconnect_prepared() under
trx_sys.mutex lock too.

MTR test case can't be created for the fix, as we need to wait for
trx_t::mysql_thd reseting in fill_trx_row() after trx_t::mysql_thd was
checked for null while trx_sys.mutex is held. But trx_t::mysql_thd must be
reset in trx_disconnect_prepared() under trx_sys.mutex. There will be deadlock.

c0817dac

20 Sep, 2022 1 commit
- MDEV-15020 fixup: Make trx_t::apply_log() truly ATTRIBUTE_COLD · 5d9d3793
  Marko Mäkelä authored 2 years ago
  
  5d9d3793
24 Aug, 2022 1 commit

MDEV-29081 trx_t::lock.was_chosen_as_deadlock_victim race in lock_wait_end() · 8ff10969

Vlad Lesin authored 2 years ago

The issue is that trx_t::lock.was_chosen_as_deadlock_victim can be reset
before the transaction check it and set trx_t::error_state.

The fix is to reset trx_t::lock.was_chosen_as_deadlock_victim only in
trx_t::commit_in_memory(), which is invoked on full rollback. There is
also no need to have separate bit in
trx_t::lock.was_chosen_as_deadlock_victim to flag transaction it was
chosen as a victim of Galera conflict resolution, the same variable can be
used for both cases except debug build. For debug build we need to
distinguish deadlock and Galera's abort victims for debug checks. Also
there is no need to check for deadlock in lock_table_enqueue_waiting() for
Galera as the coresponding check presents in lock_wait().

Local variable "error_state" in lock_wait() was replaced with
trx->error_state, because before the replace
lock_sys_t::cancel<false>(trx, lock) and lock_sys.deadlock_check() could
change trx->error_state, which then could be overwritten with the local
"error_state" variable value.

The lock_wait_suspend_thread_enter DEBUG_SYNC point name is misleading,
because lock_wait_suspend_thread was eliminated in e71e6133. It was renamed
to lock_wait_start.

Reviewed by: Marko Mäkelä, Jan Lindström.

8ff10969

06 Jun, 2022 1 commit

MDEV-13542: Crashing on corrupted page is unhelpful · 0b47c126

Marko Mäkelä authored 2 years ago

The approach to handling corruption that was chosen by Oracle in
commit 177d8b0c
is not really useful. Not only did it actually fail to prevent InnoDB
from crashing, but it is making things worse by blocking attempts to
rescue data from or rebuild a partially readable table.

We will try to prevent crashes in a different way: by propagating
errors up the call stack. We will never mark the clustered index
persistently corrupted, so that data recovery may be attempted by
reading from the table, or by rebuilding the table.

This should also fix MDEV-13680 (crash on btr_page_alloc() failure);
it was extensively tested with innodb_file_per_table=0 and a
non-autoextend system tablespace.

We should now avoid crashes in many cases, such as when a page
cannot be read or allocated, or an inconsistency is detected when
attempting to update multiple pages. We will not crash on double-free,
such as on the recovery of DDL in system tablespace in case something
was corrupted.

Crashes on corrupted data are still possible. The fault injection mechanism
that is introduced in the subsequent commit may help catch more of them.

buf_page_import_corrupt_failure: Remove the fault injection, and instead
corrupt some pages using Perl code in the tests.

btr_cur_pessimistic_insert(): Always reserve extents (except for the
change buffer), in order to prevent a subsequent allocation failure.

btr_pcur_open_at_rnd_pos(): Merged to the only caller ibuf_merge_pages().

btr_assert_not_corrupted(), btr_corruption_report(): Remove.
Similar checks are already part of btr_block_get().

FSEG_MAGIC_N_BYTES: Replaces FSEG_MAGIC_N_VALUE.

dict_hdr_get(), trx_rsegf_get_new(), trx_undo_page_get(),
trx_undo_page_get_s_latched(): Replaced with error-checking calls.

trx_rseg_t::get(mtr_t*): Replaces trx_rsegf_get().

trx_rseg_header_create(): Let the caller update the TRX_SYS page if needed.

trx_sys_create_sys_pages(): Merged with trx_sysf_create().

dict_check_tablespaces_and_store_max_id(): Do not access
DICT_HDR_MAX_SPACE_ID, because it was already recovered in dict_boot().
Merge dict_check_sys_tables() with this function.

dir_pathname(): Replaces os_file_make_new_pathname().

row_undo_ins_remove_sec(): Do not modify the undo page by adding
a terminating NUL byte to the record.

btr_decryption_failed(): Report decryption failures

dict_set_corrupted_by_space(), dict_set_encrypted_by_space(),
dict_set_corrupted_index_cache_only(): Remove.

dict_set_corrupted(): Remove the constant parameter dict_locked=false.
Never flag the clustered index corrupted in SYS_INDEXES, because
that would deny further access to the table. It might be possible to
repair the table by executing ALTER TABLE or OPTIMIZE TABLE, in case
no B-tree leaf page is corrupted.

dict_table_skip_corrupt_index(), dict_table_next_uncorrupted_index(),
row_purge_skip_uncommitted_virtual_index(): Remove, and refactor
the callers to read dict_index_t::type only once.

dict_table_is_corrupted(): Remove.

dict_index_t::is_btree(): Determine if the index is a valid B-tree.

BUF_GET_NO_LATCH, BUF_EVICT_IF_IN_POOL: Remove.

UNIV_BTR_DEBUG: Remove. Any inconsistency will no longer trigger
assertion failures, but error codes being returned.

buf_corrupt_page_release(): Replaced with a direct call to
buf_pool.corrupted_evict().

fil_invalid_page_access_msg(): Never crash on an invalid read;
let the caller of buf_page_get_gen() decide.

btr_pcur_t::restore_position(): Propagate failure status to the caller
by returning CORRUPTED.

opt_search_plan_for_table(): Simplify the code.

row_purge_del_mark(), row_purge_upd_exist_or_extern_func(),
row_undo_ins_remove_sec_rec(), row_undo_mod_upd_del_sec(),
row_undo_mod_del_mark_sec(): Avoid mem_heap_create()/mem_heap_free()
when no secondary indexes exist.

row_undo_mod_upd_exist_sec(): Simplify the code.

row_upd_clust_step(), dict_load_table_one(): Return DB_TABLE_CORRUPT
if the clustered index (and therefore the table) is corrupted, similar
to what we do in row_insert_for_mysql().

fut_get_ptr(): Replace with buf_page_get_gen() calls.

buf_page_get_gen(): Return nullptr and *err=DB_CORRUPTION
if the page is marked as freed. For other modes than
BUF_GET_POSSIBLY_FREED or BUF_PEEK_IF_IN_POOL this will
trigger a debug assertion failure. For BUF_GET_POSSIBLY_FREED,
we will return nullptr for freed pages, so that the callers
can be simplified. The purge of transaction history will be
a new user of BUF_GET_POSSIBLY_FREED, to avoid crashes on
corrupted data.

buf_page_get_low(): Never crash on a corrupted page, but simply
return nullptr.

fseg_page_is_allocated(): Replaces fseg_page_is_free().

fts_drop_common_tables(): Return an error if the transaction
was rolled back.

fil_space_t::set_corrupted(): Report a tablespace as corrupted if
it was not reported already.

fil_space_t::io(): Invoke fil_space_t::set_corrupted() to report
out-of-bounds page access or other errors.

Clean up mtr_t::page_lock()

buf_page_get_low(): Validate the page identifier (to check for
recently read corrupted pages) after acquiring the page latch.

buf_page_t::read_complete(): Flag uninitialized (all-zero) pages
with DB_FAIL. Return DB_PAGE_CORRUPTED on page number mismatch.

mtr_t::defer_drop_ahi(): Renamed from mtr_defer_drop_ahi().

recv_sys_t::free_corrupted_page(): Only set_corrupt_fs()
if any log records exist for the page. We do not mind if read-ahead
produces corrupted (or all-zero) pages that were not actually needed
during recovery.

recv_recover_page(): Return whether the operation succeeded.

recv_sys_t::recover_low(): Simplify the logic. Check for recovery error.

Thanks to Matthias Leich for testing this extensively and to the
authors of https://rr-project.org for making it easy to diagnose
and fix any failures that were found during the testing.

0b47c126

26 Apr, 2022 1 commit

MDEV-26217 Failing assertion: list.count > 0 in ut_list_remove or Assertion... · 2ca11234

Marko Mäkelä authored 2 years ago

MDEV-26217 Failing assertion: list.count > 0 in ut_list_remove or Assertion `lock->trx == this' failed in dberr_t trx_t::drop_table

This follows up the previous fix in
commit c3c53926 (MDEV-26554).

ha_innobase::delete_table(): Work around the insufficient
metadata locking (MDL) during DML operations by acquiring exclusive
InnoDB table locks on all child tables. Previously, this was only
done on TRUNCATE and ALTER.

ibuf_delete_rec(), btr_cur_optimistic_delete(): Do not invoke
lock_update_delete() during change buffer operations.
The revised trx_t::commit(std::vector<pfs_os_file_t>&) will
hold exclusive lock_sys.latch while invoking fil_delete_tablespace(),
which in turn may invoke ibuf_delete_rec().

dict_index_t::has_locking(): A new predicate, replacing the dummy
!dict_table_is_locking_disabled(index->table). Used for skipping lock
operations during ibuf_delete_rec().

trx_t::commit(std::vector<pfs_os_file_t>&): Release the locks
and remove the table from the cache while holding exclusive
lock_sys.latch.

trx_t::commit_in_memory(): Skip release_locks() if dict_operation holds.

trx_t::commit(): Reset dict_operation before invoking commit_in_memory()
via commit_persist().

lock_release_on_drop(): Release locks while lock_sys.latch is
exclusively locked.

lock_table(): Add a parameter for a pointer to the table.
We must not dereference the table before a lock_sys.latch has
been acquired. If the pointer to the table does not match the table
at that point, the table is invalid and DB_DEADLOCK will be returned.

row_ins_foreign_check_on_constraint(): Improve the checks.
Remove a bogus DB_LOCK_WAIT_TIMEOUT return that was needed
before commit c5fd9aa5 (MDEV-25919).

row_upd_check_references_constraints(),
wsrep_row_upd_check_foreign_constraints(): Simplify checks.

2ca11234

25 Apr, 2022 1 commit

MDEV-15250 UPSERT during ALTER TABLE results in 'Duplicate entry' error for alter · 4b80c11f

Thirunarayanan Balathandayuthapani authored 2 years ago

- InnoDB DDL results in `Duplicate entry' if concurrent DML throws
duplicate key error. The following scenario explains the problem

connection con1:
  ALTER TABLE t1 FORCE;

connection con2:
  INSERT INTO t1(pk, uk) VALUES (2, 2), (3, 2);

In connection con2, InnoDB throws the 'DUPLICATE KEY' error because
of unique index. Alter operation will throw the error when applying
the concurrent DML log.

- Inserting the duplicate key for unique index logs the insert
operation for online ALTER TABLE. When insertion fails,
transaction does rollback and it leads to logging of
delete operation for online ALTER TABLE.
While applying the insert log entries, alter operation
encounters 'DUPLICATE KEY' error.

- To avoid the above fake duplicate scenario, InnoDB should
not write any log for online ALTER TABLE before DML transaction
commit.

- User thread which does DML can apply the online log if
InnoDB ran out of online log and index is marked as completed.
Set online log error if apply phase encountered any error.
It can also clear all other indexes log, marks the newly
added indexes as corrupted.

- Removed the old online code which was a part of DML operations

commit_inplace_alter_table() : Does apply the online log
for the last batch of secondary index log and does frees
the log for the completed index.

trx_t::apply_online_log: Set to true while writing the undo
log if the modified table has active DDL

trx_t::apply_log(): Apply the DML changes to online DDL tables

dict_table_t::is_active_ddl(): Returns true if the table
has an active DDL

dict_index_t::online_log_make_dummy(): Assign dummy value
for clustered index online log to indicate the secondary
indexes are being rebuild.

dict_index_t::online_log_is_dummy(): Check whether the online
log has dummy value

ha_innobase_inplace_ctx::log_failure(): Handle the apply log
failure for online DDL transaction

row_log_mark_other_online_index_abort(): Clear out all other
online index log after encountering the error during
row_log_apply()

row_log_get_error(): Get the error happened during row_log_apply()

row_log_online_op(): Does apply the online log if index is
completed and ran out of memory. Returns false if apply log fails

UndorecApplier: Introduced a class to maintain the undo log
record, latched undo buffer page, parse the undo log record,
maintain the undo record type, info bits and update vector

UndorecApplier::get_old_rec(): Get the correct version of the
clustered index record that was modified by the current undo
log record

UndorecApplier::clear_undo_rec(): Clear the undo log related
information after applying the undo log record

UndorecApplier::log_update(): Handle the update, delete undo
log and apply it on online indexes

UndorecApplier::log_insert(): Handle the insert undo log
and apply it on online indexes

UndorecApplier::is_same(): Check whether the given roll pointer
is generated by the current undo log record information

trx_t::rollback_low(): Set apply_online_log for the transaction
after partially rollbacked transaction has any active DDL

prepare_inplace_alter_table_dict(): After allocating the online
log, InnoDB does create fulltext common tables. Fulltext index
doesn't allow the index to be online. So removed the dead
code of online log removal

Thanks to Marko Mäkelä for providing the initial prototype and
Matthias Leich for testing the issue patiently.

4b80c11f

14 Apr, 2022 1 commit

MDEV-28313: Shrink rw_trx_hash_element_t::mutex · 8074ab57

Marko Mäkelä authored 2 years ago

The element mutex is unnecessarily large. The PERFORMANCE_SCHEMA
instrumentation was not even enabled.

8074ab57

24 Feb, 2022 1 commit

MDEV-27935: Enable performance_schema profiling for trx_rseg_t latch · 83212632

Krunal Bauskar authored 3 years ago

- In 10.6, trx_rseg_t mutex was ported to use latch. As part of this porting
  profiling of the patch was removed. This patch reenables it given that
  the said latch continues to occupy the top-slots in the contention list.

83212632

18 Nov, 2021 1 commit

MDEV-27058: Reduce the size of buf_block_t and buf_page_t · aaef2e1d

Marko Mäkelä authored 3 years ago

buf_page_t::frame: Moved from buf_block_t::frame.
All 'thin' buf_page_t describing compressed-only ROW_FORMAT=COMPRESSED
pages will have frame=nullptr, while all 'fat' buf_block_t
will have a non-null frame pointing to aligned innodb_page_size bytes.
This eliminates the need for separate states for
BUF_BLOCK_FILE_PAGE and BUF_BLOCK_ZIP_PAGE.

buf_page_t::lock: Moved from buf_block_t::lock. That is, all block
descriptors will have a page latch. The IO_PIN state that was used
for discarding or creating the uncompressed page frame of a
ROW_FORMAT=COMPRESSED block is replaced by a combination of read-fix
and page X-latch.

page_zip_des_t::fix: Replaces state_, buf_fix_count_, io_fix_, status
of buf_page_t with a single std::atomic<uint32_t>. All modifications
will use store(), fetch_add(), fetch_sub(). This space was previously
wasted to alignment on 64-bit systems. We will use the following encoding
that combines a state (partly read-fix or write-fix) and a buffer-fix
count:

buf_page_t::NOT_USED=0 (previously BUF_BLOCK_NOT_USED)
buf_page_t::MEMORY=1 (previously BUF_BLOCK_MEMORY)
buf_page_t::REMOVE_HASH=2 (previously BUF_BLOCK_REMOVE_HASH)
buf_page_t::FREED=3 + fix: pages marked as freed in the file
buf_page_t::UNFIXED=1U<<29 + fix: normal pages
buf_page_t::IBUF_EXIST=2U<<29 + fix: normal pages; may need ibuf merge
buf_page_t::REINIT=3U<<29 + fix: reinitialized pages (skip doublewrite)
buf_page_t::READ_FIX=4U<<29 + fix: read-fixed pages (also X-latched)
buf_page_t::WRITE_FIX=5U<<29 + fix: write-fixed pages (also U-latched)
buf_page_t::WRITE_FIX_IBUF=6U<<29 + fix: write-fixed; may have ibuf
buf_page_t::WRITE_FIX_REINIT=7U<<29 + fix: write-fixed (no doublewrite)

buf_page_t::write_complete(): Change WRITE_FIX or WRITE_FIX_REINIT to
UNFIXED, and WRITE_FIX_IBUF to IBUF_EXIST, before releasing the U-latch.

buf_page_t::read_complete(): Renamed from buf_page_read_complete().
Change READ_FIX to UNFIXED or IBUF_EXIST, before releasing the X-latch.

buf_page_t::can_relocate(): If the page latch is being held or waited for,
or the block is buffer-fixed or io-fixed, return false. (The condition
on the page latch is new.)

Outside buf_page_get_gen(), buf_page_get_low() and buf_page_free(), we
will acquire the page latch before fix(), and unfix() before unlocking.

buf_page_t::flush(): Replaces buf_flush_page(). Optimize the
handling of FREED pages.

buf_pool_t::release_freed_page(): Assume that buf_pool.mutex is held
by the caller.

buf_page_t::is_read_fixed(), buf_page_t::is_write_fixed(): New predicates.

buf_page_get_low(): Ignore guesses that are read-fixed because they
may not yet be registered in buf_pool.page_hash and buf_pool.LRU.

buf_page_optimistic_get(): Acquire latch before buffer-fixing.

buf_page_make_young(): Leave read-fixed blocks alone, because they
might not be registered in buf_pool.LRU yet.

recv_sys_t::recover_deferred(), recv_sys_t::recover_low():
Possibly fix MDEV-26326, by holding a page X-latch instead of
only buffer-fixing the page.

aaef2e1d

22 Oct, 2021 1 commit

MDEV-26769 InnoDB does not support hardware lock elision · 1f022809

Marko Mäkelä authored 3 years ago

This implements memory transaction support for:

* Intel Restricted Transactional Memory (RTM), also known as TSX-NI
(Transactional Synchronization Extensions New Instructions)
* POWER v2.09 Hardware Trace Monitor (HTM) on GNU/Linux

transactional_lock_guard, transactional_shared_lock_guard:
RAII lock guards that try to elide the lock acquisition
when transactional memory is available.

buf_pool.page_hash: Try to elide latches whenever feasible.
Related to the InnoDB change buffer and ROW_FORMAT=COMPRESSED
tables, this is not always possible.
In buf_page_get_low(), memory transactions only work reasonably
well for validating a guessed block address.

TMLockGuard, TMLockTrxGuard, TMLockMutexGuard: RAII lock guards
that try to elide lock_sys.latch and related latches.

1f022809

21 Oct, 2021 1 commit

MDEV-26864 Race condition between transaction commit and undo log truncation · c484a358

Marko Mäkelä authored 3 years ago

trx_commit_in_memory(): Do not release the rseg reference before
trx_undo_commit_cleanup() has been invoked and the current transaction
is truly done with the rollback segment. The purpose of the reference
count is to prevent data races with trx_purge_truncate_history().

This is based on
mysql/mysql-server@ac79aa1522f33e6eb912133a81fa2614db764c9c.

c484a358

18 Oct, 2021 1 commit

MDEV-26682 Replication timeouts with XA PREPARE · 18eab4a8

Marko Mäkelä authored 3 years ago

The purpose of non-exclusive locks in a transaction is to guarantee
that the records covered by those locks must remain in that way until
the transaction is committed. (The purpose of gap locks is to ensure
that a record that was nonexistent will remain that way.)

Once a transaction has reached the XA PREPARE state, the only allowed
further actions are XA ROLLBACK or XA COMMIT. Therefore, it can be
argued that only the exclusive locks that the XA PREPARE transaction
is holding are essential.

Furthermore, InnoDB never preserved explicit locks across server restart.
For XA PREPARE transations, we will only recover implicit exclusive locks
for records that had been modified.

Because of the fact that XA PREPARE followed by a server restart will
cause some locks to be lost, we might as well always release all
non-exclusive locks during the execution of an XA PREPARE statement.

lock_release_on_prepare(): Release non-exclusive locks on XA PREPARE.

trx_prepare(): Invoke lock_release_on_prepare() unless the
isolation level is SERIALIZABLE or this is an internal distributed
transaction with the binlog (not actual XA PREPARE statement).

This has been discussed with Sergei Golubchik and Andrei Elkin.

Reviewed by: Sergei Golubchik

18eab4a8

06 Sep, 2021 1 commit

MDEV-26467: Avoid re-reading srv_spin_wait_delay inside a loop · 0f0b7e47

Marko Mäkelä authored 3 years ago

Invoking ut_delay(srv_wpin_wait_delay) inside a spinloop would
cause a read of 2 global variables as well as multiplication.
Let us loop around MY_RELAX_CPU() using a precomputed loop count
to keep the loops simpler, to help them scale better.

We also tried precomputing the delay into a global variable,
but that appeared to result in slightly worse throughput.

0f0b7e47

31 Aug, 2021 3 commits

MDEV-25919: Lock tables before acquiring dict_sys.latch · c5fd9aa5

Marko Mäkelä authored 3 years ago

In commit 1bd681c8 (MDEV-25506 part 3)
we introduced a "fake instant timeout" when a transaction would wait
for a table or record lock while holding dict_sys.latch. This prevented
a deadlock of the server but could cause bogus errors for operations
on the InnoDB persistent statistics tables.

A better fix is to ensure that whenever a transaction is being
executed in the InnoDB internal SQL parser (which will for now
require dict_sys.latch to be held), it will already have acquired
all locks that could be required for the execution. So, we will
acquire the following locks upfront, before acquiring dict_sys.latch:

(1) MDL on the affected user table (acquired by the SQL layer)
(2) If applicable (not for RENAME TABLE): InnoDB table lock
(3) If persistent statistics are going to be modified:
(3.a) MDL_SHARED on mysql.innodb_table_stats, mysql.innodb_index_stats
(3.b) exclusive table locks on the statistics tables
(4) Exclusive table locks on the InnoDB data dictionary tables
(not needed in ANALYZE TABLE and the like)

Note: Acquiring exclusive locks on the statistics tables may cause
more locking conflicts between concurrent DDL operations.
Notably, RENAME TABLE will lock the statistics tables
even if no persistent statistics are enabled for the table.

DROP DATABASE will only acquire locks on statistics tables if
persistent statistics are enabled for the tables on which the
SQL layer is invoking ha_innobase::delete_table().
For any "garbage collection" in innodb_drop_database(), a timeout
while acquiring locks on the statistics tables will result in any
statistics not being deleted for any tables that the SQL layer
did not know about.

If innodb_defragment=ON, information may be written to the statistics
tables even for tables for which InnoDB persistent statistics are
disabled. But, DROP TABLE will no longer attempt to delete that
information if persistent statistics are not enabled for the table.

This change should also fix the hangs related to InnoDB persistent
statistics and STATS_AUTO_RECALC (MDEV-15020) as well as
a bug that running ALTER TABLE on the statistics tables
concurrently with running ALTER TABLE on InnoDB tables could
cause trouble.

lock_rec_enqueue_waiting(), lock_table_enqueue_waiting():
Do not issue a fake instant timeout error when the transaction
is holding dict_sys.latch. Instead, assert that the dict_sys.latch
is never being held here.

lock_sys_tables(): A new function to acquire exclusive locks on all
dictionary tables, in case DROP TABLE or similar operation is
being executed. Locking non-hard-coded tables is optional to avoid
a crash in row_merge_drop_temp_indexes(). The SYS_VIRTUAL table was
introduced in MySQL 5.7 and MariaDB Server 10.2. Normally, we require
all these dictionary tables to exist before executing any DDL, but
the function row_merge_drop_temp_indexes() is an exception.
When upgrading from MariaDB Server 10.1 or MySQL 5.6 or earlier,
the table SYS_VIRTUAL would not exist at this point.

ha_innobase::commit_inplace_alter_table(): Invoke
log_write_up_to() while not holding dict_sys.latch.

dict_sys_t::remove(), dict_table_close(): No longer try to
drop index stubs that were left behind by aborted online ADD INDEX.
Such indexes should be dropped from the InnoDB data dictionary by
row_merge_drop_indexes() as part of the failed DDL operation.
Stubs for aborted indexes may only be left behind in the
data dictionary cache.

dict_stats_fetch_from_ps(): Use a normal read-only transaction.

ha_innobase::delete_table(), ha_innobase::truncate(), fts_lock_table():
While waiting for purge to stop using the table,
do not hold dict_sys.latch.

ha_innobase::delete_table(): Implement a work-around for the rollback
of ALTER TABLE...ADD PARTITION. MDL_EXCLUSIVE would not be held if
ALTER TABLE hits lock_wait_timeout while trying to upgrade the MDL
due to a conflicting LOCK TABLES, such as in the first ALTER TABLE
in the test case of Bug#53676 in parts.partition_special_innodb.
Therefore, we must explicitly stop purge, because it would not be
stopped by MDL.

dict_stats_func(), btr_defragment_chunk(): Allocate a THD so that
we can acquire MDL on the InnoDB persistent statistics tables.

mysqltest_embedded: Invoke ha_pre_shutdown() before free_used_memory()
in order to avoid ASAN heap-use-after-free related to acquire_thd().

trx_t::dict_operation_lock_mode: Changed the type to bool.

row_mysql_lock_data_dictionary(), row_mysql_unlock_data_dictionary():
Implemented as macros.

rollback_inplace_alter_table(): Apply an infinite timeout to lock waits.

innodb_thd_increment_pending_ops(): Wrapper for
thd_increment_pending_ops(). Never attempt async operation for
InnoDB background threads, such as the trx_t::commit() in
dict_stats_process_entry_from_recalc_pool().

lock_sys_t::cancel(trx_t*): Make dictionary transactions immune to KILL.

lock_wait(): Make dictionary transactions immune to KILL, and to
lock wait timeout when waiting for locks on dictionary tables.

parts.partition_special_innodb: Use lock_wait_timeout=0 to instantly
get ER_LOCK_WAIT_TIMEOUT.

main.mdl: Filter out MDL on InnoDB persistent statistics tables

Reviewed by: Thirunarayanan Balathandayuthapani

c5fd9aa5

MDEV-25919 preparation: Various cleanup · 094de717

Marko Mäkelä authored 3 years ago

que_eval_sql(): Remove the parameter lock_dict. The only caller
with lock_dict=true was dict_stats_exec_sql(), which will now
explicitly invoke dict_sys.lock() and dict_sys.unlock() by itself.

row_import_cleanup(): Do not unnecessarily lock the dictionary.
Concurrent access to the table during ALTER TABLE...IMPORT TABLESPACE
is prevented by MDL and the fact that there cannot exist any
undo log or change buffer records that would refer to the table
or tablespace.

row_import_for_mysql(): Do not unnecessarily lock the dictionary
while accessing fil_system. Thanks to MDL_EXCLUSIVE that was acquired
by the SQL layer, only one IMPORT may be in effect for the table name.

row_quiesce_set_state(): Do not unnecessarily lock the dictionary.
The dict_table_t::quiesce state is documented to be protected by
all index latches, which we are acquiring.

dict_table_close(): Introduce a simpler variant with fewer parameters.

dict_table_close(): Reduce the amount of calls.
We can simply invoke dict_table_t::release() on startup or
in DDL operations, or when the table is inaccessible.
In none of these cases, there is no need to invalidate the
InnoDB persistent statistics.

pars_info_t::graph_owns_us: Remove (unused).

pars_info_free(): Define inline.

fts_delete(), trx_t::evict_table(), row_prebuilt_free(),
row_rename_table_for_mysql(): Simplify.

row_mysql_lock_data_dictionary(): Remove some references;
use dict_sys.lock() and dict_sys.unlock() instead.

row_mysql_lock_table(): Remove. Use lock_table_for_trx() instead.

ha_innobase::check_if_supported_inplace_alter(),
row_create_table_for_mysql(): Simply assert dict_sys.sys_tables_exist().
In commit 49e2c8f0 and
commit 1bd681c8 srv_start()
actually guarantees that the system tables will exist,
or the server is in read-only mode, or startup will fail.

Reviewed by: Thirunarayanan Balathandayuthapani

094de717

MDEV-24258 Merge dict_sys.mutex into dict_sys.latch · 82b7c561

Marko Mäkelä authored 3 years ago

In the parent commit, dict_sys.latch could theoretically have been
replaced with a mutex. But, we can do better and merge dict_sys.mutex
into dict_sys.latch. Generally, every occurrence of dict_sys.mutex_lock()
will be replaced with dict_sys.lock().

The PERFORMANCE_SCHEMA instrumentation for dict_sys_mutex
will be removed along with dict_sys.mutex. The dict_sys.latch
will remain instrumented as dict_operation_lock.

Some use of dict_sys.lock() will be replaced with dict_sys.freeze(),
which we will reintroduce for the new shared mode. Most notably,
concurrent table lookups are possible as long as the tables are present
in the dict_sys cache. In particular, this will allow more concurrency
among InnoDB purge workers.

Because dict_sys.mutex will no longer 'throttle' the threads that purge
InnoDB transaction history, a performance degradation may be observed
unless innodb_purge_threads=1.

The table cache eviction policy will become FIFO-like,
similar to what happened to fil_system.LRU
in commit 45ed9dd9.
The name of the list dict_sys.table_LRU will become somewhat misleading;
that list contains tables that may be evicted, even though the
eviction policy no longer is least-recently-used but first-in-first-out.
(Note: Tables can never be evicted as long as locks exist on them or
the tables are in use by some thread.)

As demonstrated by the test perfschema.sxlock_func, there
will be less contention on dict_sys.latch, because some previous
use of exclusive latches will be replaced with shared latches.

fts_parse_sql_no_dict_lock(): Replaced with pars_sql().

fts_get_table_name_prefix(): Merged to fts_optimize_create().

dict_stats_update_transient_for_index(): Deduplicated some code.

ha_innobase::info_low(), dict_stats_stop_bg(): Use a combination
of dict_sys.latch and table->stats_mutex_lock() to cover the
changes of BG_STAT_SHOULD_QUIT, because the flag is being read
in dict_stats_update_persistent() while not holding dict_sys.latch.

row_discard_tablespace_for_mysql(): Protect stats_bg_flag by
exclusive dict_sys.latch, like most other code does.

row_quiesce_table_has_fts_index(): Remove unnecessary mutex
acquisition. FLUSH TABLES...FOR EXPORT is protected by MDL.

row_import::set_root_by_heuristic(): Remove unnecessary mutex
acquisition. ALTER TABLE...IMPORT TABLESPACE is protected by MDL.

row_ins_sec_index_entry_low(): Replace a call
to dict_set_corrupted_index_cache_only(). Reads of index->type
were not really protected by dict_sys.mutex, and writes
(flagging an index corrupted) should be extremely rare.

dict_stats_process_entry_from_defrag_pool(): Only freeze the dictionary,
do not lock it exclusively.

dict_stats_wait_bg_to_stop_using_table(), DICT_BG_YIELD: Remove trx.
We can simply invoke dict_sys.unlock() and dict_sys.lock() directly.

dict_acquire_mdl_shared()<trylock=false>: Assert that dict_sys.latch is
only held in shared more, not exclusive mode. Only acquire it in
exclusive mode if the table needs to be loaded to the cache.

dict_sys_t::acquire(): Remove. Relocating elements in dict_sys.table_LRU
would require holding an exclusive latch, which we want to avoid
for performance reasons.

dict_sys_t::allow_eviction(): Add the table first to dict_sys.table_LRU,
to compensate for the removal of dict_sys_t::acquire(). This function
is only invoked by INFORMATION_SCHEMA.INNODB_SYS_TABLESTATS.

dict_table_open_on_id(), dict_table_open_on_name(): If dict_locked=false,
try to acquire dict_sys.latch in shared mode. Only acquire the latch in
exclusive mode if the table is not found in the cache.

Reviewed by: Thirunarayanan Balathandayuthapani

82b7c561

27 Jul, 2021 1 commit

MDEV-25594: Improve debug checks · cf1fc598

Marko Mäkelä authored 3 years ago

trx_t::will_lock: Changed the type to bool.

trx_t::is_autocommit_non_locking(): Replaces
trx_is_autocommit_non_locking().

trx_is_ac_nl_ro(): Remove (replaced with equivalent assertion expressions).

assert_trx_nonlocking_or_in_list(): Remove.
Replaced with at least as strict checks in each place.

check_trx_state(): Moved to a static function; partially replaced with
individual debug assertions implementing equivalent or stricter checks.

This is a backport of commit 7b51d11c
from 10.5.

cf1fc598

23 Jul, 2021 1 commit

MDEV-26138 innodb_flush_log_at_trx_commit=3 doesn't flush · 42b9daae

Marko Mäkelä authored 3 years ago

trx_flush_log_if_needed_low(): Do flush for both
innodb_flush_log_at_trx_commit=1 and
innodb_flush_log_at_trx_commit=3.

This is the 10.6 version of
10.2 commit 5f8651ac
which fixed a regression that had been introduced
in 2e814d47
(breaking 288eeb3a).

42b9daae

22 Jul, 2021 1 commit

MDEV-26193: Wake up purge less often · a4dc9265

Marko Mäkelä authored 3 years ago

Starting with commit 6e12ebd4
(MDEV-25062), srv_wake_purge_thread_if_not_active() became
more expensive operation, especially on NUMA systems, because
instead of reading an atomic global variable trx_sys.rseg_history_len
we are traversing up to 128 cache lines in trx_sys.history_exists().

trx_t::commit_cleanup(): Do not wake up purge at all.
We will wake up purge about once per second in srv_master_callback().

srv_master_do_active_tasks(), srv_master_do_idle_tasks():
Move some duplicated code to srv_master_callback().

srv_master_callback(): Invoke purge_coordinator_timer_callback()
to ensure that purge will be periodically woken up, even if the
latest execution of trx_t::commit_cleanup() allowed the purge view
to advance but did not wake up purge.
Do not call log_free_check(), because every thread that is going
to generate redo log is supposed to call that function anyway,
before acquiring any page latches. Additional calls to the function
every few seconds should not make any difference.

srv_shutdown_threads(): Ensure that srv_shutdown_state can be at most
SRV_SHUTDOWN_INITIATED in srv_master_callback(), by first invoking
srv_master_timer.reset() before changing srv_shutdown_state.
(Note: We first terminate the srv_master_callback and only then
terminate the purge tasks. Thus, the purge subsystem should exist
when srv_master_callback() invokes purge_coordinator_timer_callback()
if it was initiated in the first place.

a4dc9265

20 Jul, 2021 1 commit

Fix switch case statement in trx_flush_log_if_needed_low() · 5f8651ac

Jagdeep Sidhu authored 3 years ago

In commit 2e814d47 on MariaDB 10.2
the switch case statement in trx_flush_log_if_needed_low() regressed.

Since 10.2 this code was refactored to have switches in descending
order, so value of 3 for innodb_flush_log_at_trx_commit is behaving
the same as value of 2, that is no FSYNC is being enforced during
COMMIT phase. The switch should however not be empty and cases 2 and 3
should not have the identical contents.

As per documentation, setting innodb_flush_log_at_trx_commit to 3
should do FSYNC to disk if innodb_flush_log_at_trx_commit is set to 3.
This fixes the regression so that the switch statement again does
what users expect the setting should do.

All new code of the whole pull request, including one or several files
that are either new files or modified ones, are contributed under the
BSD-new license. I am contributing on behalf of my employer Amazon Web
Services, Inc.

5f8651ac

03 Jul, 2021 1 commit

fixup · 789a2a36

Marko Mäkelä authored 3 years ago

trx_t::free(): Declare xid as fully initialized in order to
avoid tripping the subsequent MEM_CHECK_DEFINED
(in WITH_MSAN and WITH_VALGRIND builds).

789a2a36

01 Jul, 2021 2 commits

MDEV-25919 preparation: Remove trx_t::internal · ed6b2307

Marko Mäkelä authored 3 years ago

With commit 1bd681c8 (MDEV-25506)
it no longer is necessary to run DDL and DML operations in
separate transactions. Let us remove the flag trx_t::internal.
Dictionary transactions will be distinguished by trx_t::dict_operation.

ed6b2307

Cleanup: Remove pointer indirection for trx_t::xid · 0a67b15a

Marko Mäkelä authored 3 years ago

The trx_t::xid is always allocated, so we might as well allocate it
directly in the trx_t object to improve the locality of reference.

0a67b15a

24 Jun, 2021 1 commit

MDEV-26007 Rollback unnecessarily initiates redo log write · 033e29b6

Marko Mäkelä authored 3 years ago

trx_t::commit_in_memory(): Do not initiate a redo log write if
the transaction has no visible effect. If anything for this
transaction had been made durable, crash recovery will roll back
the transaction just fine even if the end of ROLLBACK is not
durably written.

Rollbacks of transactions that are associated with XA identifiers
(possibly internally via the binlog) will always be persisted.
The test rpl.rpl_gtid_crash covers this.

033e29b6

23 Jun, 2021 1 commit

MDEV-25062: Reduce trx_rseg_t::mutex contention · 6e12ebd4

Marko Mäkelä authored 3 years ago

redo_rseg_mutex, noredo_rseg_mutex: Remove the PERFORMANCE_SCHEMA keys.
The rollback segment mutex will be uninstrumented.

trx_sys_t: Remove pointer indirection for rseg_array, temp_rseg.
Align each element to the cache line.

trx_sys_t::rseg_id(): Replaces trx_rseg_t::id.

trx_rseg_t::ref: Replaces needs_purge, trx_ref_count, skip_allocation
in a single std::atomic<uint32_t>.

trx_rseg_t::latch: Replaces trx_rseg_t::mutex.

trx_rseg_t::history_size: Replaces trx_sys_t::rseg_history_len

trx_sys_t::history_size_approx(): Replaces trx_sys.rseg_history_len
in those places where the exact count does not matter. We must not
acquire any trx_rseg_t::latch while holding index page latches, because
normally the trx_rseg_t::latch is acquired before any page latches.

trx_sys_t::history_exists(): Replaces trx_sys.rseg_history_len!=0
with an approximation.

We remove some unnecessary trx_rseg_t::latch acquisition around
trx_undo_set_state_at_prepare() and trx_undo_set_state_at_finish().
Those operations will only access fields that remain constant
after trx_rseg_t::init().

6e12ebd4

21 Jun, 2021 1 commit

MDEV-15912: Remove traces of insert_undo · e46f76c9

Marko Mäkelä authored 3 years ago

Let us simply refuse an upgrade from earlier versions if the
upgrade procedure was not followed. This simplifies the purge,
commit, and rollback of transactions.

Before upgrading to MariaDB 10.3 or later, a clean shutdown
of the server (with innodb_fast_shutdown=1 or 0) is necessary,
to ensure that any incomplete transactions are rolled back.
The undo log format was changed in MDEV-12288. There is only
one persistent undo log for each transaction.

e46f76c9

09 Jun, 2021 1 commit

MDEV-25506 (3 of 3): Do not delete .ibd files before commit · 1bd681c8

Marko Mäkelä authored 3 years ago

This is a complete rewrite of DROP TABLE, also as part of other DDL,
such as ALTER TABLE, CREATE TABLE...SELECT, TRUNCATE TABLE.

The background DROP TABLE queue hack is removed.
If a transaction needs to drop and create a table by the same name
(like TRUNCATE TABLE does), it must first rename the table to an
internal #sql-ib name. No committed version of the data dictionary
will include any #sql-ib tables, because whenever a transaction
renames a table to a #sql-ib name, it will also drop that table.
Either the rename will be rolled back, or the drop will be committed.

Data files will be unlinked after the transaction has been committed
and a FILE_RENAME record has been durably written. The file will
actually be deleted when the detached file handle returned by
fil_delete_tablespace() will be closed, after the latches have been
released. It is possible that a purge of the delete of the SYS_INDEXES
record for the clustered index will execute fil_delete_tablespace()
concurrently with the DDL transaction. In that case, the thread that
arrives later will wait for the other thread to finish.

HTON_TRUNCATE_REQUIRES_EXCLUSIVE_USE: A new handler flag.
ha_innobase::truncate() now requires that all other references to
the table be released in advance. This was implemented by Monty.

ha_innobase::delete_table(): If CREATE TABLE..SELECT is detected,
we will "hijack" the current transaction, drop the table in
the current transaction and commit the current transaction.
This essentially fixes MDEV-21602. There is a FIXME comment about
making the check less failure-prone.

ha_innobase::truncate(), ha_innobase::delete_table():
Implement a fast path for temporary tables. We will no longer allow
temporary tables to use the adaptive hash index.

dict_table_t::mdl_name: The original table name for the purpose of
acquiring MDL in purge, to prevent a race condition between a
DDL transaction that is dropping a table, and purge processing
undo log records of DML that had executed before the DDL operation.
For #sql-backup- tables during ALTER TABLE...ALGORITHM=COPY, the
dict_table_t::mdl_name will differ from dict_table_t::name.

dict_table_t::parse_name(): Use mdl_name instead of name.

dict_table_rename_in_cache(): Update mdl_name.

For the internal FTS_ tables of FULLTEXT INDEX, purge would
acquire MDL on the FTS_ table name, but not on the main table,
and therefore it would be able to run concurrently with a
DDL transaction that is dropping the table. Previously, the
DROP TABLE queue hack prevented a race between purge and DDL.
For now, we introduce purge_sys.stop_FTS() to prevent purge from
opening any table, while a DDL transaction that may drop FTS_
tables is in progress. The function fts_lock_table(), which will
be invoked before the dictionary is locked, will wait for
purge to release any table handles.

trx_t::drop_table_statistics(): Drop statistics for the table.
This replaces dict_stats_drop_index(). We will drop or rename
persistent statistics atomically as part of DDL transactions.
On lock conflict for dropping statistics, we will fail instantly
with DB_LOCK_WAIT_TIMEOUT, because we will be holding the
exclusive data dictionary latch.

trx_t::commit_cleanup(): Separated from trx_t::commit_in_memory().
Relax an assertion around fts_commit() and allow DB_LOCK_WAIT_TIMEOUT
in addition to DB_DUPLICATE_KEY. The call to fts_commit() is
entirely misplaced here and may obviously break the consistency
of transactions that affect FULLTEXT INDEX. It needs to be fixed
separately.

dict_table_t::n_foreign_key_checks_running: Remove (MDEV-21175).
The counter was a work-around for missing meta-data locking (MDL)
on the SQL layer, and not really needed in MariaDB.

ER_TABLE_IN_FK_CHECK: Replaced with ER_UNUSED_28.

HA_ERR_TABLE_IN_FK_CHECK: Remove.

row_ins_check_foreign_constraints(): Do not acquire
dict_sys.latch either. The SQL-layer MDL will protect us.

This was reviewed by Thirunarayanan Balathandayuthapani
and tested by Matthias Leich.

1bd681c8

27 May, 2021 1 commit

MDEV-25791: Remove UNIV_INTERN · a7d68e7a

Marko Mäkelä authored 3 years ago

Back in 2006 or 2007, when MySQL AB and Innobase Oy existed as
separately controlled entities (Innobase had been acquired by
Oracle Corporation), MySQL 5.1 introduced a storage engine plugin
interface and Oracle made use of it by distributing a separate
InnoDB Plugin, which would contain some more bug fixes and
improvements, compared to the version of InnoDB that was statically
linked with the mysqld server that was distributed by MySQL AB.
The built-in InnoDB would export global symbols, which would clash
with the symbols of the dynamic InnoDB Plugin (which was supposed
to override the built-in one when present).

The solution to this problem was to declare all global symbols with
UNIV_INTERN, so that they would get the GCC function attribute that
specifies hidden visibility.

Later, in MariaDB Server, something based on Percona XtraDB (a fork of
MySQL InnoDB) became the statically linked implementation, and something
closer to MySQL InnoDB was available as a dynamic plugin. Starting with
version 10.2, MariaDB Server includes only one InnoDB implementation,
and hence any reason to have the UNIV_INTERN definition was lost.

btr_get_size_and_reserved(): Move to the same compilation unit with
the only caller.

innodb_set_buf_pool_size(): Remove. Modify innobase_buffer_pool_size
directly.

fil_crypt_calculate_checksum(): Merge to the only caller.

ha_innobase::innobase_reset_autoinc(): Merge to the only caller.

thd_query_start_micro(): Remove. Call thd_start_utime() directly.

a7d68e7a

21 May, 2021 1 commit

MDEV-25743: Unnecessary copying of table names in InnoDB dictionary · 49e2c8f0

Marko Mäkelä authored 3 years ago

Many InnoDB data dictionary cache operations require that the
table name be copied so that it will be NUL terminated.
(For example, SYS_TABLES.NAME is not guaranteed to be NUL-terminated.)

dict_table_t::is_garbage_name(): Check if a name belongs to
the background drop table queue.

dict_check_if_system_table_exists(): Remove.

dict_sys_t::load_sys_tables(): Load the non-hard-coded system tables
SYS_FOREIGN, SYS_FOREIGN_COLS, SYS_VIRTUAL on startup.

dict_sys_t::create_or_check_sys_tables(): Replaces
dict_create_or_check_foreign_constraint_tables() and
dict_create_or_check_sys_virtual().

dict_sys_t::load_table(): Replaces dict_table_get_low()
and dict_load_table().

dict_sys_t::find_table(): Renamed from get_table().

dict_sys_t::sys_tables_exist(): Check whether all the non-hard-coded
tables SYS_FOREIGN, SYS_FOREIGN_COLS, SYS_VIRTUAL exist.

trx_t::has_stats_table_lock(): Moved to dict0stats.cc.

Some error messages will now report table names in the internal
databasename/tablename format, instead of `databasename`.`tablename`.

49e2c8f0

19 May, 2021 1 commit

MDEV-25180 Atomic ALTER TABLE · 7762ee5d

Monty authored 3 years ago

MDEV-25604 Atomic DDL: Binlog event written upon recovery does not
           have default database

The purpose of this task is to ensure that ALTER TABLE is atomic even if
the MariaDB server would be killed at any point of the alter table.
This means that either the ALTER TABLE succeeds (including that triggers,
the status tables and the binary log are updated) or things should be
reverted to their original state.

If the server crashes before the new version is fully up to date and
commited, it will revert to the original table and remove all
temporary files and tables.
If the new version is commited, crash recovery will use the new version,
and update triggers, the status tables and the binary log.
The one execption is ALTER TABLE .. RENAME .. where no changes are done
to table definition. This one will work as RENAME and roll back unless
the whole statement completed, including updating the binary log (if
enabled).

Other changes:
- Added handlerton->check_version() function to allow the ddl recovery
  code to check, in case of inplace alter table, if the table in the
  storage engine is of the new or old version.
- Added handler->table_version() so that an engine can report the current
  version of the table. This should be changed each time the table
  definition changes.
- Added  ha_signal_ddl_recovery_done() and
  handlerton::signal_ddl_recovery_done() to inform all handlers when
  ddl recovery has been done. (Needed by InnoDB).
- Added handlerton call inplace_alter_table_committed, to signal engine
  that ddl_log has been closed for the alter table query.
- Added new handerton flag
  HTON_REQUIRES_NOTIFY_TABLEDEF_CHANGED_AFTER_COMMIT to signal when we
  should call hton->notify_tabledef_changed() during
  mysql_inplace_alter_table. This was required as MyRocks and InnoDB
  needed the call at different times.
- Added function server_uuid_value() to be able to generate a temporary
  xid when ddl recovery writes the query to the binary log. This is
  needed to be able to handle crashes during ddl log recovery.
- Moved freeing of the frm definition to end of mysql_alter_table() to
  remove duplicate code and have a common exit strategy.

-------
InnoDB part of atomic ALTER TABLE
(Implemented by Marko Mäkelä)
innodb_check_version(): Compare the saved dict_table_t::def_trx_id
to determine whether an ALTER TABLE operation was committed.

We must correctly recover dict_table_t::def_trx_id for this to work.
Before purge removes any trace of DB_TRX_ID from system tables, it
will make an effort to load the user table into the cache, so that
the dict_table_t::def_trx_id can be recovered.

ha_innobase::table_version(): return garbage, or the trx_id that would
be used for committing an ALTER TABLE operation.

In InnoDB, table names starting with #sql-ib will remain special:
they will be dropped on startup. This may be revisited later in
MDEV-18518 when we implement proper undo logging and rollback
for creating or dropping multiple tables in a transaction.

Table names starting with #sql will retain some special meaning:
dict_table_t::parse_name() will not consider such names for
MDL acquisition, and dict_table_rename_in_cache() will treat such
names specially when handling FOREIGN KEY constraints.

Simplify InnoDB DROP INDEX.
Prevent purge wakeup

To ensure that dict_table_t::def_trx_id will be recovered correctly
in case the server is killed before ddl_log_complete(), we will block
the purge of any history in SYS_TABLES, SYS_INDEXES, SYS_COLUMNS
between ha_innobase::commit_inplace_alter_table(commit=true)
(purge_sys.stop_SYS()) and purge_sys.resume_SYS().
The completion callback purge_sys.resume_SYS() must be between
ddl_log_complete() and MDL release.

--------

MyRocks support for atomic ALTER TABLE
(Implemented by Sergui Petrunia)

Implement these SE API functions:
- ha_rocksdb::table_version()
- hton->check_version = rocksdb_check_versionMyRocks data dictionary
  now stores table version for each table.
  (Absence of table version record is interpreted as table_version=0,
  that is, which means no upgrade changes are needed)
- For inplace alter table of a partitioned table, call the underlying
  handlerton when checking if the table is ok. This assumes that the
  partition engine commits all changes at once.

7762ee5d