An error occurred fetching the project authors.
  1. 18 Apr, 2023 1 commit
  2. 16 Mar, 2023 1 commit
    • Marko Mäkelä's avatar
      MDEV-30357 Performance regression in locking reads from secondary indexes · 4105017a
      Marko Mäkelä authored
      lock_sec_rec_some_has_impl(): Remove a harmful condition that caused the
      performance regression and should not have been added in
      commit b6e41e38 in the first place.
      Locking transactions that have not modified any persistent tables
      can carry the transaction identifier 0.
      
      trx_t::max_inactive_id: A cache for trx_sys_t::find_same_or_older().
      The value is not reset on transaction commit so that previous results
      can be reused for subsequent transactions. The smallest active
      transaction ID can only increase over time, not decrease.
      
      trx_sys_t::find_same_or_older(): Remember the maximum previous id for which
      rw_trx_hash.iterate() returned false, to avoid redundant iterations.
      
      lock_sec_rec_read_check_and_lock(): Add an early return in case we are
      already holding a covering table lock.
      
      lock_rec_convert_impl_to_expl(): Add a template parameter to avoid
      a redundant run-time check on whether the index is secondary.
      
      lock_rec_convert_impl_to_expl_for_trx(): Move some code from
      lock_rec_convert_impl_to_expl(), to reduce code duplication due
      to the added template parameter.
      
      Reviewed by: Vladislav Lesin
      Tested by: Matthias Leich
      4105017a
  3. 02 Mar, 2023 1 commit
  4. 24 Feb, 2023 1 commit
    • Marko Mäkelä's avatar
      MDEV-30671 InnoDB undo log truncation fails to wait for purge of history · 0de3be8c
      Marko Mäkelä authored
      It is not safe to invoke trx_purge_free_segment() or execute
      innodb_undo_log_truncate=ON before all undo log records in
      the rollback segment has been processed.
      
      A prominent failure that would occur due to premature freeing of
      undo log pages is that trx_undo_get_undo_rec() would crash when
      trying to copy an undo log record to fetch the previous version
      of a record.
      
      If trx_undo_get_undo_rec() was not invoked in the unlucky time frame,
      then the symptom would be that some committed transaction history is
      never removed. This would be detected by CHECK TABLE...EXTENDED that
      was impleented in commit ab019010.
      Such a garbage collection leak should be possible even when using
      innodb_undo_log_truncate=OFF, just involving trx_purge_free_segment().
      
      trx_rseg_t::needs_purge: Change the type from Boolean to a transaction
      identifier, noting the most recent non-purged transaction, or 0 if
      everything has been purged. On transaction start, we initialize this
      to 1 more than the transaction start ID. On recovery, the field may be
      adjusted to the transaction end ID (TRX_UNDO_TRX_NO) if it is larger.
      
      The field TRX_UNDO_NEEDS_PURGE becomes write-only; only some debug
      assertions that would validate the value. The field reflects the old
      inaccurate Boolean field trx_rseg_t::needs_purge.
      
      trx_undo_mem_create_at_db_start(), trx_undo_lists_init(),
      trx_rseg_mem_restore(): Remove the parameter max_trx_id.
      Instead, store the maximum in trx_rseg_t::needs_purge,
      where trx_rseg_array_init() will find it.
      
      trx_purge_free_segment(): Contiguously hold a lock on
      trx_rseg_t to prevent any concurrent allocation of undo log.
      
      trx_purge_truncate_rseg_history(): Only invoke trx_purge_free_segment()
      if the rollback segment is empty and there are no pending transactions
      associated with it.
      
      trx_purge_truncate_history(): Only proceed with innodb_undo_log_truncate=ON
      if trx_rseg_t::needs_purge indicates that all history has been purged.
      
      Tested by: Matthias Leich
      0de3be8c
  5. 09 Feb, 2023 1 commit
    • Vicențiu Ciorbaru's avatar
      Apply clang-tidy to remove empty constructors / destructors · 08c85202
      Vicențiu Ciorbaru authored
      This patch is the result of running
      run-clang-tidy -fix -header-filter=.* -checks='-*,modernize-use-equals-default' .
      
      Code style changes have been done on top. The result of this change
      leads to the following improvements:
      
      1. Binary size reduction.
      * For a -DBUILD_CONFIG=mysql_release build, the binary size is reduced by
        ~400kb.
      * A raw -DCMAKE_BUILD_TYPE=Release reduces the binary size by ~1.4kb.
      
      2. Compiler can better understand the intent of the code, thus it leads
         to more optimization possibilities. Additionally it enabled detecting
         unused variables that had an empty default constructor but not marked
         so explicitly.
      
         Particular change required following this patch in sql/opt_range.cc
      
         result_keys, an unused template class Bitmap now correctly issues
         unused variable warnings.
      
         Setting Bitmap template class constructor to default allows the compiler
         to identify that there are no side-effects when instantiating the class.
         Previously the compiler could not issue the warning as it assumed Bitmap
         class (being a template) would not be performing a NO-OP for its default
         constructor. This prevented the "unused variable warning".
      08c85202
  6. 17 Nov, 2022 1 commit
    • Marko Mäkelä's avatar
      MDEV-29603 btr_cur_open_at_index_side() is missing some consistency checks · 24fe5347
      Marko Mäkelä authored
      btr_cur_t: Zero-initialize all fields in the default constructor.
      
      btr_cur_t::index: Remove; it duplicated page_cur.index.
      
      Many functions: Remove arguments that were duplicating
      page_cur_t::index and page_cur_t::block.
      
      page_cur_open_level(), btr_pcur_open_level(): Replaces
      btr_cur_open_at_index_side() for dict_stats_analyze_index().
      At the end, release all latches except the dict_index_t::lock
      and the buf_page_t::lock on the requested page.
      
      dict_stats_analyze_index(): Rely on mtr_t::rollback_to_savepoint()
      to release all uninteresting page latches.
      
      btr_search_guess_on_hash(): Simplify the logic, and invoke
      mtr_t::rollback_to_savepoint().
      
      We will use plain C++ std::vector<mtr_memo_slot_t> for mtr_t::m_memo.
      In this way, we can avoid setting mtr_memo_slot_t::object to nullptr
      and instead just remove garbage from m_memo.
      
      mtr_t::rollback_to_savepoint(): Shrink the vector. We will be needing this
      in dict_stats_analyze_index(), where we will release page latches and
      only retain the index->lock in mtr_t::m_memo.
      
      mtr_t::release_last_page(): Release the last acquired page latch.
      Replaces btr_leaf_page_release().
      
      mtr_t::release(const buf_block_t&): Release a single page latch.
      Used in btr_pcur_move_backward_from_page().
      
      mtr_t::memo_release(): Replaced with mtr_t::release().
      
      mtr_t::upgrade_buffer_fix(): Acquire a latch for a buffer-fixed page.
      This replaces the double bookkeeping in btr_cur_t::open_leaf().
      
      Reviewed by: Vladislav Lesin
      24fe5347
  7. 25 Oct, 2022 1 commit
    • Vladislav Vaintroub's avatar
      MDEV-29843 Do not use asynchronous log_write_upto() for system THDs · b7fe6179
      Vladislav Vaintroub authored
      Non-blocking log_write_upto (MDEV-24341) was only designed  for the
      client connections. Fix, so it is not be triggered for any system THD.
      
      Previously, an incomplete solution only excluded Innodb purge THDs, but
      not  the slave for example.
      
      The hang in MDEV still remains somewhat a mystery though, it is not
      immediately clear how exactly condition variable can become corrupted.
      But it is clear that it can be avoided.
      b7fe6179
  8. 24 Oct, 2022 1 commit
    • Vlad Lesin's avatar
      MDEV-28709 unexpected X lock on Supremum in READ COMMITTED · 8128a468
      Vlad Lesin authored
      The lock is created during page splitting after moving records and
      locks(lock_move_rec_list_(start|end)()) to the new page, and inheriting
      the locks to the supremum of left page from the successor of the infimum
      on right page.
      
      There is no need in such inheritance for READ COMMITTED isolation level
      and not-gap locks, so the fix is to add the corresponding condition in
      gap lock inheritance function.
      
      One more fix is to forbid gap lock inheritance if XA was prepared. Use the
      most significant bit of trx_t::n_ref to indicate that gap lock inheritance
      is forbidden. This fix is based on
      mysql/mysql-server@b063e52a8367dc9d5ed418e7f6d96400867e9f43
      8128a468
  9. 21 Oct, 2022 2 commits
    • Vlad Lesin's avatar
      MDEV-29622 Wrong assertions in lock_cancel_waiting_and_release() for deadlock resolving caller · 9c04d66d
      Vlad Lesin authored
      Suppose we have two transactions, trx 1 and trx 2.
      
      trx 2 does deadlock resolving from lock_wait(), it sets
      victim->lock.was_chosen_as_deadlock_victim=true for trx 1, but has not
      yet invoked lock_cancel_waiting_and_release().
      
      trx 1 checks the flag in lock_trx_handle_wait(), and starts rollback
      from row_mysql_handle_errors(). It can change trx->lock.wait_thr and
      trx->state as it holds trx_t::mutex, but trx 2 has not yet requested it,
      as lock_cancel_waiting_and_release() has not yet been called.
      
      After that trx 1 tries to release locks in trx_t::rollback_low(),
      invoking trx_t::rollback_finish(). lock_release() is blocked on try to
      acquire lock_sys.rd_lock(SRW_LOCK_CALL) in lock_release_try(), as
      lock_sys is blocked by trx 2, as deadlock resolution works under
      lock_sys.wr_lock(SRW_LOCK_CALL), see Deadlock::report() for details.
      
      trx 2 executes lock_cancel_waiting_and_release() for deadlock victim, i.
      e. for trx 1. lock_cancel_waiting_and_release() contains some
      trx->lock.wait_thr and trx->state assertions, which will fail, because
      trx 1 has changed them during rollback execution.
      
      So, according to the above scenario, it's legal to have
      trx->lock.wait_thr==0 and trx->state!=TRX_STATE_ACTIVE in
      lock_cancel_waiting_and_release(), if it was invoked from
      Deadlock::report(), and the fix is just in the assertion conditions
      changing.
      
      The fix is just in changing assertion condition.
      
      There is also lock_wait() cleanup around trx->error_state.
      
      If trx->error_state can be changed not by the owned thread, it must be
      protected with lock_sys.wait_mutex, as lock_wait() uses trx->lock.cond
      along with that mutex.
      
      Also if trx->error_state was changed before lock_sys.wait_mutex
      acquision, then it could be reset with the following code, what is
      wrong. Also we need to check trx->error_state before entering waiting
      loop, otherwise it can be the case when trx->error_state was set before
      lock_sys.wait_mutex acquision, but the thread will be waiting on
      trx->lock.cond.
      9c04d66d
    • Marko Mäkelä's avatar
      MDEV-24402: InnoDB CHECK TABLE ... EXTENDED · ab019010
      Marko Mäkelä authored
      Until now, the attribute EXTENDED of CHECK TABLE was ignored by InnoDB,
      and InnoDB only counted the records in each index according
      to the current read view. Unless the attribute QUICK was specified, the
      function btr_validate_index() would be invoked to validate the B-tree
      structure (the sibling and child links between index pages).
      
      The EXTENDED check will not only count all index records according to the
      current read view, but also ensure that any delete-marked records in the
      clustered index are waiting for the purge of history, and that all
      secondary index records point to a version of the clustered index record
      that is waiting for the purge of history. In other words, no index may
      contain orphan records. Normal MVCC reads and the non-EXTENDED version
      of CHECK TABLE would ignore these orphans.
      
      Unpurged records merely result in warnings (at most one per index),
      not errors, and no indexes will be flagged as corrupted due to such
      garbage. It will remain possible to SELECT data from such indexes or
      tables (which will skip such records) or to rebuild the table to
      reclaim some space.
      
      We introduce purge_sys.end_view that will be (almost) a copy of
      purge_sys.view at the end of a batch of purging committed transaction
      history. It is not an exact copy, because if the size of a purge batch
      is limited by innodb_purge_batch_size, some records that
      purge_sys.view would allow to be purged will be left over for
      subsequent batches.
      
      The purge_sys.view is relevant in the purge of committed transaction
      history, to determine if records are safe to remove. The new
      purge_sys.end_view is relevant in MVCC operations and in
      CHECK TABLE ... EXTENDED. It tells which undo log records are
      safe to access (have not been discarded at the end of a purge batch).
      
      purge_sys.clone_oldest_view<true>(): In trx_lists_init_at_db_start(),
      clone the oldest read view similar to purge_sys_t::clone_end_view()
      so that CHECK TABLE ... EXTENDED will not report bogus failures between
      InnoDB restart and the completed purge of committed transaction history.
      
      purge_sys_t::is_purgeable(): Replaces purge_sys_t::changes_visible()
      in the case that purge_sys.latch will not be held by the caller.
      Among other things, this guards access to BLOBs. It is not safe to
      dereference any BLOBs of a delete-marked purgeable record, because
      they may have already been freed.
      
      purge_sys_t::view_guard::view(): Return a reference to purge_sys.view
      that will be protected by purge_sys.latch, held by purge_sys_t::view_guard.
      
      purge_sys_t::end_view_guard::view(): Return a reference to
      purge_sys.end_view while it is protected by purge_sys.end_latch.
      Whenever a thread needs to retrieve an older version of a clustered
      index record, it will hold a page latch on the clustered index page
      and potentially also on a secondary index page that points to the
      clustered index page. If these pages contain purgeable records that
      would be accessed by a currently running purge batch, the progress of
      the purge batch would be blocked by the page latches. Hence, it is
      safe to make a copy of purge_sys.end_view while holding an index page
      latch, and consult the copy of the view to determine whether a record
      should already have been purged.
      
      btr_validate_index(): Remove a redundant check.
      
      row_check_index_match(): Check if a secondary index record and a
      version of a clustered index record match each other.
      
      row_check_index(): Replaces row_scan_index_for_mysql().
      Count the records in each index directly, duplicating the relevant
      logic from row_search_mvcc(). Initialize check_table_extended_view
      for CHECK ... EXTENDED while holding an index leaf page latch.
      If we encounter an orphan record, the copy of purge_sys.end_view that
      we make is safe for visibility checks, and trx_undo_get_undo_rec() will
      check for the safety to access each undo log record. Should that check
      fail, we should return DB_MISSING_HISTORY to report a corrupted index.
      The EXTENDED check tries to match each secondary index record with
      every available clustered index record version, by duplicating the logic
      of row_vers_build_for_consistent_read() and invoking
      trx_undo_prev_version_build() directly.
      
      Before invoking row_check_index_match() on delete-marked clustered index
      record versions, we will consult purge_sys.is_purgeable() in order to
      avoid accessing freed BLOBs.
      
      We will always check that the DB_TRX_ID or PAGE_MAX_TRX_ID does not
      exceed the global maximum. Orphan secondary index records will be
      flagged only if everything up to PAGE_MAX_TRX_ID has been purged.
      We warn also about clustered index records whose nonzero DB_TRX_ID
      should have been reset in purge or rollback.
      
      trx_set_rw_mode(): Move an assertion from ReadView::set_creator_trx_id().
      
      trx_undo_prev_version_build(): Remove two debug-only parameters,
      and return an error code instead of a Boolean.
      
      trx_undo_get_undo_rec(): Return a pointer to the undo log record,
      or nullptr if one cannot be retrieved. Instead of consulting the
      purge_sys.view, consult the purge_sys.end_view to determine which
      records can be accessed.
      
      trx_undo_get_rec_if_purgeable(): A variant of trx_undo_get_undo_rec()
      that will consult purge_sys.view instead of purge_sys.end_view.
      
      TRX_UNDO_CHECK_PURGEABILITY: A new parameter to
      trx_undo_prev_version_build(), passed by row_vers_old_has_index_entry()
      so that purge_sys.view instead of purge_sys.end_view will be consulted
      to determine whether a secondary index record may be safely purged.
      
      row_upd_changes_disowned_external(): Remove. This should be more
      expensive than briefly latching purge_sys in trx_undo_prev_version_build()
      (which may make use of transactional memory).
      
      row_sel_reset_old_vers_heap(): New function, split from
      row_sel_build_prev_vers_for_mysql().
      
      row_sel_build_prev_vers_for_mysql(): Reorder some parameters
      to simplify the call to row_sel_reset_old_vers_heap().
      
      row_search_for_mysql(): Replaced with direct calls to row_search_mvcc().
      
      sel_node_get_nth_plan(): Define inline in row0sel.h
      
      open_step(): Define at the call site, in simplified form.
      
      sel_node_reset_cursor(): Merged with the only caller open_step().
      ---
      ReadViewBase::check_trx_id_sanity(): Remove.
      Let us handle "future" DB_TRX_ID in a more meaningful way:
      
      row_sel_clust_sees(): Return DB_SUCCESS if the record is visible,
      DB_SUCCESS_LOCKED_REC if it is invisible, and DB_CORRUPTION if
      the DB_TRX_ID is in the future.
      
      row_undo_mod_must_purge(), row_undo_mod_clust(): Silently ignore
      corrupted DB_TRX_ID. We are in ROLLBACK, and we should have noticed
      that corruption when we were about to modify the record in the first
      place (leading us to refuse the operation).
      
      row_vers_build_for_consistent_read(): Return DB_CORRUPTION if
      DB_TRX_ID is in the future.
      
      Tested by: Matthias Leich
      Reviewed by: Vladislav Lesin
      ab019010
  10. 03 Oct, 2022 1 commit
    • Vlad Lesin's avatar
      MDEV-29575 Access to innodb_trx, innodb_locks and innodb_lock_waits along with... · c0817dac
      Vlad Lesin authored
      MDEV-29575 Access to innodb_trx, innodb_locks and innodb_lock_waits along with detached XA's can cause SIGSEGV
      
      trx->mysql_thd can be zeroed-out between thd_get_thread_id() and
      thd_query_safe() calls in fill_trx_row(). trx_disconnect_prepared() zeroes out
      trx->mysql_thd. And this can cause null pointer dereferencing in
      fill_trx_row().
      
      fill_trx_row() is invoked from fetch_data_into_cache() under trx_sys.mutex.
      
      Bug fix is in reseting trx_t::mysql_thd in trx_disconnect_prepared() under
      trx_sys.mutex lock too.
      
      MTR test case can't be created for the fix, as we need to wait for
      trx_t::mysql_thd reseting in fill_trx_row() after trx_t::mysql_thd was
      checked for null while trx_sys.mutex is held. But trx_t::mysql_thd must be
      reset in trx_disconnect_prepared() under trx_sys.mutex. There will be deadlock.
      c0817dac
  11. 20 Sep, 2022 1 commit
  12. 24 Aug, 2022 1 commit
    • Vlad Lesin's avatar
      MDEV-29081 trx_t::lock.was_chosen_as_deadlock_victim race in lock_wait_end() · 8ff10969
      Vlad Lesin authored
      The issue is that trx_t::lock.was_chosen_as_deadlock_victim can be reset
      before the transaction check it and set trx_t::error_state.
      
      The fix is to reset trx_t::lock.was_chosen_as_deadlock_victim only in
      trx_t::commit_in_memory(), which is invoked on full rollback. There is
      also no need to have separate bit in
      trx_t::lock.was_chosen_as_deadlock_victim to flag transaction it was
      chosen as a victim of Galera conflict resolution, the same variable can be
      used for both cases except debug build. For debug build we need to
      distinguish deadlock and Galera's abort victims for debug checks. Also
      there is no need to check for deadlock in lock_table_enqueue_waiting() for
      Galera as the coresponding check presents in lock_wait().
      
      Local variable "error_state" in lock_wait() was replaced with
      trx->error_state, because before the replace
      lock_sys_t::cancel<false>(trx, lock) and lock_sys.deadlock_check() could
      change trx->error_state, which then could be overwritten with the local
      "error_state" variable value.
      
      The lock_wait_suspend_thread_enter DEBUG_SYNC point name is misleading,
      because lock_wait_suspend_thread was eliminated in e71e6133. It was renamed
      to lock_wait_start.
      
      Reviewed by: Marko Mäkelä, Jan Lindström.
      8ff10969
  13. 06 Jun, 2022 1 commit
    • Marko Mäkelä's avatar
      MDEV-13542: Crashing on corrupted page is unhelpful · 0b47c126
      Marko Mäkelä authored
      The approach to handling corruption that was chosen by Oracle in
      commit 177d8b0c
      is not really useful. Not only did it actually fail to prevent InnoDB
      from crashing, but it is making things worse by blocking attempts to
      rescue data from or rebuild a partially readable table.
      
      We will try to prevent crashes in a different way: by propagating
      errors up the call stack. We will never mark the clustered index
      persistently corrupted, so that data recovery may be attempted by
      reading from the table, or by rebuilding the table.
      
      This should also fix MDEV-13680 (crash on btr_page_alloc() failure);
      it was extensively tested with innodb_file_per_table=0 and a
      non-autoextend system tablespace.
      
      We should now avoid crashes in many cases, such as when a page
      cannot be read or allocated, or an inconsistency is detected when
      attempting to update multiple pages. We will not crash on double-free,
      such as on the recovery of DDL in system tablespace in case something
      was corrupted.
      
      Crashes on corrupted data are still possible. The fault injection mechanism
      that is introduced in the subsequent commit may help catch more of them.
      
      buf_page_import_corrupt_failure: Remove the fault injection, and instead
      corrupt some pages using Perl code in the tests.
      
      btr_cur_pessimistic_insert(): Always reserve extents (except for the
      change buffer), in order to prevent a subsequent allocation failure.
      
      btr_pcur_open_at_rnd_pos(): Merged to the only caller ibuf_merge_pages().
      
      btr_assert_not_corrupted(), btr_corruption_report(): Remove.
      Similar checks are already part of btr_block_get().
      
      FSEG_MAGIC_N_BYTES: Replaces FSEG_MAGIC_N_VALUE.
      
      dict_hdr_get(), trx_rsegf_get_new(), trx_undo_page_get(),
      trx_undo_page_get_s_latched(): Replaced with error-checking calls.
      
      trx_rseg_t::get(mtr_t*): Replaces trx_rsegf_get().
      
      trx_rseg_header_create(): Let the caller update the TRX_SYS page if needed.
      
      trx_sys_create_sys_pages(): Merged with trx_sysf_create().
      
      dict_check_tablespaces_and_store_max_id(): Do not access
      DICT_HDR_MAX_SPACE_ID, because it was already recovered in dict_boot().
      Merge dict_check_sys_tables() with this function.
      
      dir_pathname(): Replaces os_file_make_new_pathname().
      
      row_undo_ins_remove_sec(): Do not modify the undo page by adding
      a terminating NUL byte to the record.
      
      btr_decryption_failed(): Report decryption failures
      
      dict_set_corrupted_by_space(), dict_set_encrypted_by_space(),
      dict_set_corrupted_index_cache_only(): Remove.
      
      dict_set_corrupted(): Remove the constant parameter dict_locked=false.
      Never flag the clustered index corrupted in SYS_INDEXES, because
      that would deny further access to the table. It might be possible to
      repair the table by executing ALTER TABLE or OPTIMIZE TABLE, in case
      no B-tree leaf page is corrupted.
      
      dict_table_skip_corrupt_index(), dict_table_next_uncorrupted_index(),
      row_purge_skip_uncommitted_virtual_index(): Remove, and refactor
      the callers to read dict_index_t::type only once.
      
      dict_table_is_corrupted(): Remove.
      
      dict_index_t::is_btree(): Determine if the index is a valid B-tree.
      
      BUF_GET_NO_LATCH, BUF_EVICT_IF_IN_POOL: Remove.
      
      UNIV_BTR_DEBUG: Remove. Any inconsistency will no longer trigger
      assertion failures, but error codes being returned.
      
      buf_corrupt_page_release(): Replaced with a direct call to
      buf_pool.corrupted_evict().
      
      fil_invalid_page_access_msg(): Never crash on an invalid read;
      let the caller of buf_page_get_gen() decide.
      
      btr_pcur_t::restore_position(): Propagate failure status to the caller
      by returning CORRUPTED.
      
      opt_search_plan_for_table(): Simplify the code.
      
      row_purge_del_mark(), row_purge_upd_exist_or_extern_func(),
      row_undo_ins_remove_sec_rec(), row_undo_mod_upd_del_sec(),
      row_undo_mod_del_mark_sec(): Avoid mem_heap_create()/mem_heap_free()
      when no secondary indexes exist.
      
      row_undo_mod_upd_exist_sec(): Simplify the code.
      
      row_upd_clust_step(), dict_load_table_one(): Return DB_TABLE_CORRUPT
      if the clustered index (and therefore the table) is corrupted, similar
      to what we do in row_insert_for_mysql().
      
      fut_get_ptr(): Replace with buf_page_get_gen() calls.
      
      buf_page_get_gen(): Return nullptr and *err=DB_CORRUPTION
      if the page is marked as freed. For other modes than
      BUF_GET_POSSIBLY_FREED or BUF_PEEK_IF_IN_POOL this will
      trigger a debug assertion failure. For BUF_GET_POSSIBLY_FREED,
      we will return nullptr for freed pages, so that the callers
      can be simplified. The purge of transaction history will be
      a new user of BUF_GET_POSSIBLY_FREED, to avoid crashes on
      corrupted data.
      
      buf_page_get_low(): Never crash on a corrupted page, but simply
      return nullptr.
      
      fseg_page_is_allocated(): Replaces fseg_page_is_free().
      
      fts_drop_common_tables(): Return an error if the transaction
      was rolled back.
      
      fil_space_t::set_corrupted(): Report a tablespace as corrupted if
      it was not reported already.
      
      fil_space_t::io(): Invoke fil_space_t::set_corrupted() to report
      out-of-bounds page access or other errors.
      
      Clean up mtr_t::page_lock()
      
      buf_page_get_low(): Validate the page identifier (to check for
      recently read corrupted pages) after acquiring the page latch.
      
      buf_page_t::read_complete(): Flag uninitialized (all-zero) pages
      with DB_FAIL. Return DB_PAGE_CORRUPTED on page number mismatch.
      
      mtr_t::defer_drop_ahi(): Renamed from mtr_defer_drop_ahi().
      
      recv_sys_t::free_corrupted_page(): Only set_corrupt_fs()
      if any log records exist for the page. We do not mind if read-ahead
      produces corrupted (or all-zero) pages that were not actually needed
      during recovery.
      
      recv_recover_page(): Return whether the operation succeeded.
      
      recv_sys_t::recover_low(): Simplify the logic. Check for recovery error.
      
      Thanks to Matthias Leich for testing this extensively and to the
      authors of https://rr-project.org for making it easy to diagnose
      and fix any failures that were found during the testing.
      0b47c126
  14. 26 Apr, 2022 1 commit
    • Marko Mäkelä's avatar
      MDEV-26217 Failing assertion: list.count > 0 in ut_list_remove or Assertion... · 2ca11234
      Marko Mäkelä authored
      MDEV-26217 Failing assertion: list.count > 0 in ut_list_remove or Assertion `lock->trx == this' failed in dberr_t trx_t::drop_table
      
      This follows up the previous fix in
      commit c3c53926 (MDEV-26554).
      
      ha_innobase::delete_table(): Work around the insufficient
      metadata locking (MDL) during DML operations by acquiring exclusive
      InnoDB table locks on all child tables. Previously, this was only
      done on TRUNCATE and ALTER.
      
      ibuf_delete_rec(), btr_cur_optimistic_delete(): Do not invoke
      lock_update_delete() during change buffer operations.
      The revised trx_t::commit(std::vector<pfs_os_file_t>&) will
      hold exclusive lock_sys.latch while invoking fil_delete_tablespace(),
      which in turn may invoke ibuf_delete_rec().
      
      dict_index_t::has_locking(): A new predicate, replacing the dummy
      !dict_table_is_locking_disabled(index->table). Used for skipping lock
      operations during ibuf_delete_rec().
      
      trx_t::commit(std::vector<pfs_os_file_t>&): Release the locks
      and remove the table from the cache while holding exclusive
      lock_sys.latch.
      
      trx_t::commit_in_memory(): Skip release_locks() if dict_operation holds.
      
      trx_t::commit(): Reset dict_operation before invoking commit_in_memory()
      via commit_persist().
      
      lock_release_on_drop(): Release locks while lock_sys.latch is
      exclusively locked.
      
      lock_table(): Add a parameter for a pointer to the table.
      We must not dereference the table before a lock_sys.latch has
      been acquired. If the pointer to the table does not match the table
      at that point, the table is invalid and DB_DEADLOCK will be returned.
      
      row_ins_foreign_check_on_constraint(): Improve the checks.
      Remove a bogus DB_LOCK_WAIT_TIMEOUT return that was needed
      before commit c5fd9aa5 (MDEV-25919).
      
      row_upd_check_references_constraints(),
      wsrep_row_upd_check_foreign_constraints(): Simplify checks.
      2ca11234
  15. 25 Apr, 2022 1 commit
    • Thirunarayanan Balathandayuthapani's avatar
      MDEV-15250 UPSERT during ALTER TABLE results in 'Duplicate entry' error for alter · 4b80c11f
      Thirunarayanan Balathandayuthapani authored
      - InnoDB DDL results in `Duplicate entry' if concurrent DML throws
      duplicate key error. The following scenario explains the problem
      
      connection con1:
        ALTER TABLE t1 FORCE;
      
      connection con2:
        INSERT INTO t1(pk, uk) VALUES (2, 2), (3, 2);
      
      In connection con2, InnoDB throws the 'DUPLICATE KEY' error because
      of unique index. Alter operation will throw the error when applying
      the concurrent DML log.
      
      - Inserting the duplicate key for unique index logs the insert
      operation for online ALTER TABLE. When insertion fails,
      transaction does rollback and it leads to logging of
      delete operation for online ALTER TABLE.
      While applying the insert log entries, alter operation
      encounters 'DUPLICATE KEY' error.
      
      - To avoid the above fake duplicate scenario, InnoDB should
      not write any log for online ALTER TABLE before DML transaction
      commit.
      
      - User thread which does DML can apply the online log if
      InnoDB ran out of online log and index is marked as completed.
      Set online log error if apply phase encountered any error.
      It can also clear all other indexes log, marks the newly
      added indexes as corrupted.
      
      - Removed the old online code which was a part of DML operations
      
      commit_inplace_alter_table() : Does apply the online log
      for the last batch of secondary index log and does frees
      the log for the completed index.
      
      trx_t::apply_online_log: Set to true while writing the undo
      log if the modified table has active DDL
      
      trx_t::apply_log(): Apply the DML changes to online DDL tables
      
      dict_table_t::is_active_ddl(): Returns true if the table
      has an active DDL
      
      dict_index_t::online_log_make_dummy(): Assign dummy value
      for clustered index online log to indicate the secondary
      indexes are being rebuild.
      
      dict_index_t::online_log_is_dummy(): Check whether the online
      log has dummy value
      
      ha_innobase_inplace_ctx::log_failure(): Handle the apply log
      failure for online DDL transaction
      
      row_log_mark_other_online_index_abort(): Clear out all other
      online index log after encountering the error during
      row_log_apply()
      
      row_log_get_error(): Get the error happened during row_log_apply()
      
      row_log_online_op(): Does apply the online log if index is
      completed and ran out of memory. Returns false if apply log fails
      
      UndorecApplier: Introduced a class to maintain the undo log
      record, latched undo buffer page, parse the undo log record,
      maintain the undo record type, info bits and update vector
      
      UndorecApplier::get_old_rec(): Get the correct version of the
      clustered index record that was modified by the current undo
      log record
      
      UndorecApplier::clear_undo_rec(): Clear the undo log related
      information after applying the undo log record
      
      UndorecApplier::log_update(): Handle the update, delete undo
      log and apply it on online indexes
      
      UndorecApplier::log_insert(): Handle the insert undo log
      and apply it on online indexes
      
      UndorecApplier::is_same(): Check whether the given roll pointer
      is generated by the current undo log record information
      
      trx_t::rollback_low(): Set apply_online_log for the transaction
      after partially rollbacked transaction has any active DDL
      
      prepare_inplace_alter_table_dict(): After allocating the online
      log, InnoDB does create fulltext common tables. Fulltext index
      doesn't allow the index to be online. So removed the dead
      code of online log removal
      
      Thanks to Marko Mäkelä for providing the initial prototype and
      Matthias Leich for testing the issue patiently.
      4b80c11f
  16. 14 Apr, 2022 1 commit
  17. 24 Feb, 2022 1 commit
  18. 18 Nov, 2021 1 commit
    • Marko Mäkelä's avatar
      MDEV-27058: Reduce the size of buf_block_t and buf_page_t · aaef2e1d
      Marko Mäkelä authored
      buf_page_t::frame: Moved from buf_block_t::frame.
      All 'thin' buf_page_t describing compressed-only ROW_FORMAT=COMPRESSED
      pages will have frame=nullptr, while all 'fat' buf_block_t
      will have a non-null frame pointing to aligned innodb_page_size bytes.
      This eliminates the need for separate states for
      BUF_BLOCK_FILE_PAGE and BUF_BLOCK_ZIP_PAGE.
      
      buf_page_t::lock: Moved from buf_block_t::lock. That is, all block
      descriptors will have a page latch. The IO_PIN state that was used
      for discarding or creating the uncompressed page frame of a
      ROW_FORMAT=COMPRESSED block is replaced by a combination of read-fix
      and page X-latch.
      
      page_zip_des_t::fix: Replaces state_, buf_fix_count_, io_fix_, status
      of buf_page_t with a single std::atomic<uint32_t>. All modifications
      will use store(), fetch_add(), fetch_sub(). This space was previously
      wasted to alignment on 64-bit systems. We will use the following encoding
      that combines a state (partly read-fix or write-fix) and a buffer-fix
      count:
      
      buf_page_t::NOT_USED=0 (previously BUF_BLOCK_NOT_USED)
      buf_page_t::MEMORY=1 (previously BUF_BLOCK_MEMORY)
      buf_page_t::REMOVE_HASH=2 (previously BUF_BLOCK_REMOVE_HASH)
      buf_page_t::FREED=3 + fix: pages marked as freed in the file
      buf_page_t::UNFIXED=1U<<29 + fix: normal pages
      buf_page_t::IBUF_EXIST=2U<<29 + fix: normal pages; may need ibuf merge
      buf_page_t::REINIT=3U<<29 + fix: reinitialized pages (skip doublewrite)
      buf_page_t::READ_FIX=4U<<29 + fix: read-fixed pages (also X-latched)
      buf_page_t::WRITE_FIX=5U<<29 + fix: write-fixed pages (also U-latched)
      buf_page_t::WRITE_FIX_IBUF=6U<<29 + fix: write-fixed; may have ibuf
      buf_page_t::WRITE_FIX_REINIT=7U<<29 + fix: write-fixed (no doublewrite)
      
      buf_page_t::write_complete(): Change WRITE_FIX or WRITE_FIX_REINIT to
      UNFIXED, and WRITE_FIX_IBUF to IBUF_EXIST, before releasing the U-latch.
      
      buf_page_t::read_complete(): Renamed from buf_page_read_complete().
      Change READ_FIX to UNFIXED or IBUF_EXIST, before releasing the X-latch.
      
      buf_page_t::can_relocate(): If the page latch is being held or waited for,
      or the block is buffer-fixed or io-fixed, return false. (The condition
      on the page latch is new.)
      
      Outside buf_page_get_gen(), buf_page_get_low() and buf_page_free(), we
      will acquire the page latch before fix(), and unfix() before unlocking.
      
      buf_page_t::flush(): Replaces buf_flush_page(). Optimize the
      handling of FREED pages.
      
      buf_pool_t::release_freed_page(): Assume that buf_pool.mutex is held
      by the caller.
      
      buf_page_t::is_read_fixed(), buf_page_t::is_write_fixed(): New predicates.
      
      buf_page_get_low(): Ignore guesses that are read-fixed because they
      may not yet be registered in buf_pool.page_hash and buf_pool.LRU.
      
      buf_page_optimistic_get(): Acquire latch before buffer-fixing.
      
      buf_page_make_young(): Leave read-fixed blocks alone, because they
      might not be registered in buf_pool.LRU yet.
      
      recv_sys_t::recover_deferred(), recv_sys_t::recover_low():
      Possibly fix MDEV-26326, by holding a page X-latch instead of
      only buffer-fixing the page.
      aaef2e1d
  19. 22 Oct, 2021 1 commit
    • Marko Mäkelä's avatar
      MDEV-26769 InnoDB does not support hardware lock elision · 1f022809
      Marko Mäkelä authored
      This implements memory transaction support for:
      
      * Intel Restricted Transactional Memory (RTM), also known as TSX-NI
      (Transactional Synchronization Extensions New Instructions)
      * POWER v2.09 Hardware Trace Monitor (HTM) on GNU/Linux
      
      transactional_lock_guard, transactional_shared_lock_guard:
      RAII lock guards that try to elide the lock acquisition
      when transactional memory is available.
      
      buf_pool.page_hash: Try to elide latches whenever feasible.
      Related to the InnoDB change buffer and ROW_FORMAT=COMPRESSED
      tables, this is not always possible.
      In buf_page_get_low(), memory transactions only work reasonably
      well for validating a guessed block address.
      
      TMLockGuard, TMLockTrxGuard, TMLockMutexGuard: RAII lock guards
      that try to elide lock_sys.latch and related latches.
      1f022809
  20. 21 Oct, 2021 1 commit
    • Marko Mäkelä's avatar
      MDEV-26864 Race condition between transaction commit and undo log truncation · c484a358
      Marko Mäkelä authored
      trx_commit_in_memory(): Do not release the rseg reference before
      trx_undo_commit_cleanup() has been invoked and the current transaction
      is truly done with the rollback segment. The purpose of the reference
      count is to prevent data races with trx_purge_truncate_history().
      
      This is based on
      mysql/mysql-server@ac79aa1522f33e6eb912133a81fa2614db764c9c.
      c484a358
  21. 18 Oct, 2021 1 commit
    • Marko Mäkelä's avatar
      MDEV-26682 Replication timeouts with XA PREPARE · 18eab4a8
      Marko Mäkelä authored
      The purpose of non-exclusive locks in a transaction is to guarantee
      that the records covered by those locks must remain in that way until
      the transaction is committed. (The purpose of gap locks is to ensure
      that a record that was nonexistent will remain that way.)
      
      Once a transaction has reached the XA PREPARE state, the only allowed
      further actions are XA ROLLBACK or XA COMMIT. Therefore, it can be
      argued that only the exclusive locks that the XA PREPARE transaction
      is holding are essential.
      
      Furthermore, InnoDB never preserved explicit locks across server restart.
      For XA PREPARE transations, we will only recover implicit exclusive locks
      for records that had been modified.
      
      Because of the fact that XA PREPARE followed by a server restart will
      cause some locks to be lost, we might as well always release all
      non-exclusive locks during the execution of an XA PREPARE statement.
      
      lock_release_on_prepare(): Release non-exclusive locks on XA PREPARE.
      
      trx_prepare(): Invoke lock_release_on_prepare() unless the
      isolation level is SERIALIZABLE or this is an internal distributed
      transaction with the binlog (not actual XA PREPARE statement).
      
      This has been discussed with Sergei Golubchik and Andrei Elkin.
      
      Reviewed by: Sergei Golubchik
      18eab4a8
  22. 06 Sep, 2021 1 commit
    • Marko Mäkelä's avatar
      MDEV-26467: Avoid re-reading srv_spin_wait_delay inside a loop · 0f0b7e47
      Marko Mäkelä authored
      Invoking ut_delay(srv_wpin_wait_delay) inside a spinloop would
      cause a read of 2 global variables as well as multiplication.
      Let us loop around MY_RELAX_CPU() using a precomputed loop count
      to keep the loops simpler, to help them scale better.
      
      We also tried precomputing the delay into a global variable,
      but that appeared to result in slightly worse throughput.
      0f0b7e47
  23. 31 Aug, 2021 3 commits
    • Marko Mäkelä's avatar
      MDEV-25919: Lock tables before acquiring dict_sys.latch · c5fd9aa5
      Marko Mäkelä authored
      In commit 1bd681c8 (MDEV-25506 part 3)
      we introduced a "fake instant timeout" when a transaction would wait
      for a table or record lock while holding dict_sys.latch. This prevented
      a deadlock of the server but could cause bogus errors for operations
      on the InnoDB persistent statistics tables.
      
      A better fix is to ensure that whenever a transaction is being
      executed in the InnoDB internal SQL parser (which will for now
      require dict_sys.latch to be held), it will already have acquired
      all locks that could be required for the execution. So, we will
      acquire the following locks upfront, before acquiring dict_sys.latch:
      
      (1) MDL on the affected user table (acquired by the SQL layer)
      (2) If applicable (not for RENAME TABLE): InnoDB table lock
      (3) If persistent statistics are going to be modified:
      (3.a) MDL_SHARED on mysql.innodb_table_stats, mysql.innodb_index_stats
      (3.b) exclusive table locks on the statistics tables
      (4) Exclusive table locks on the InnoDB data dictionary tables
      (not needed in ANALYZE TABLE and the like)
      
      Note: Acquiring exclusive locks on the statistics tables may cause
      more locking conflicts between concurrent DDL operations.
      Notably, RENAME TABLE will lock the statistics tables
      even if no persistent statistics are enabled for the table.
      
      DROP DATABASE will only acquire locks on statistics tables if
      persistent statistics are enabled for the tables on which the
      SQL layer is invoking ha_innobase::delete_table().
      For any "garbage collection" in innodb_drop_database(), a timeout
      while acquiring locks on the statistics tables will result in any
      statistics not being deleted for any tables that the SQL layer
      did not know about.
      
      If innodb_defragment=ON, information may be written to the statistics
      tables even for tables for which InnoDB persistent statistics are
      disabled. But, DROP TABLE will no longer attempt to delete that
      information if persistent statistics are not enabled for the table.
      
      This change should also fix the hangs related to InnoDB persistent
      statistics and STATS_AUTO_RECALC (MDEV-15020) as well as
      a bug that running ALTER TABLE on the statistics tables
      concurrently with running ALTER TABLE on InnoDB tables could
      cause trouble.
      
      lock_rec_enqueue_waiting(), lock_table_enqueue_waiting():
      Do not issue a fake instant timeout error when the transaction
      is holding dict_sys.latch. Instead, assert that the dict_sys.latch
      is never being held here.
      
      lock_sys_tables(): A new function to acquire exclusive locks on all
      dictionary tables, in case DROP TABLE or similar operation is
      being executed. Locking non-hard-coded tables is optional to avoid
      a crash in row_merge_drop_temp_indexes(). The SYS_VIRTUAL table was
      introduced in MySQL 5.7 and MariaDB Server 10.2. Normally, we require
      all these dictionary tables to exist before executing any DDL, but
      the function row_merge_drop_temp_indexes() is an exception.
      When upgrading from MariaDB Server 10.1 or MySQL 5.6 or earlier,
      the table SYS_VIRTUAL would not exist at this point.
      
      ha_innobase::commit_inplace_alter_table(): Invoke
      log_write_up_to() while not holding dict_sys.latch.
      
      dict_sys_t::remove(), dict_table_close(): No longer try to
      drop index stubs that were left behind by aborted online ADD INDEX.
      Such indexes should be dropped from the InnoDB data dictionary by
      row_merge_drop_indexes() as part of the failed DDL operation.
      Stubs for aborted indexes may only be left behind in the
      data dictionary cache.
      
      dict_stats_fetch_from_ps(): Use a normal read-only transaction.
      
      ha_innobase::delete_table(), ha_innobase::truncate(), fts_lock_table():
      While waiting for purge to stop using the table,
      do not hold dict_sys.latch.
      
      ha_innobase::delete_table(): Implement a work-around for the rollback
      of ALTER TABLE...ADD PARTITION. MDL_EXCLUSIVE would not be held if
      ALTER TABLE hits lock_wait_timeout while trying to upgrade the MDL
      due to a conflicting LOCK TABLES, such as in the first ALTER TABLE
      in the test case of Bug#53676 in parts.partition_special_innodb.
      Therefore, we must explicitly stop purge, because it would not be
      stopped by MDL.
      
      dict_stats_func(), btr_defragment_chunk(): Allocate a THD so that
      we can acquire MDL on the InnoDB persistent statistics tables.
      
      mysqltest_embedded: Invoke ha_pre_shutdown() before free_used_memory()
      in order to avoid ASAN heap-use-after-free related to acquire_thd().
      
      trx_t::dict_operation_lock_mode: Changed the type to bool.
      
      row_mysql_lock_data_dictionary(), row_mysql_unlock_data_dictionary():
      Implemented as macros.
      
      rollback_inplace_alter_table(): Apply an infinite timeout to lock waits.
      
      innodb_thd_increment_pending_ops(): Wrapper for
      thd_increment_pending_ops(). Never attempt async operation for
      InnoDB background threads, such as the trx_t::commit() in
      dict_stats_process_entry_from_recalc_pool().
      
      lock_sys_t::cancel(trx_t*): Make dictionary transactions immune to KILL.
      
      lock_wait(): Make dictionary transactions immune to KILL, and to
      lock wait timeout when waiting for locks on dictionary tables.
      
      parts.partition_special_innodb: Use lock_wait_timeout=0 to instantly
      get ER_LOCK_WAIT_TIMEOUT.
      
      main.mdl: Filter out MDL on InnoDB persistent statistics tables
      
      Reviewed by: Thirunarayanan Balathandayuthapani
      c5fd9aa5
    • Marko Mäkelä's avatar
      MDEV-25919 preparation: Various cleanup · 094de717
      Marko Mäkelä authored
      que_eval_sql(): Remove the parameter lock_dict. The only caller
      with lock_dict=true was dict_stats_exec_sql(), which will now
      explicitly invoke dict_sys.lock() and dict_sys.unlock() by itself.
      
      row_import_cleanup(): Do not unnecessarily lock the dictionary.
      Concurrent access to the table during ALTER TABLE...IMPORT TABLESPACE
      is prevented by MDL and the fact that there cannot exist any
      undo log or change buffer records that would refer to the table
      or tablespace.
      
      row_import_for_mysql(): Do not unnecessarily lock the dictionary
      while accessing fil_system. Thanks to MDL_EXCLUSIVE that was acquired
      by the SQL layer, only one IMPORT may be in effect for the table name.
      
      row_quiesce_set_state(): Do not unnecessarily lock the dictionary.
      The dict_table_t::quiesce state is documented to be protected by
      all index latches, which we are acquiring.
      
      dict_table_close(): Introduce a simpler variant with fewer parameters.
      
      dict_table_close(): Reduce the amount of calls.
      We can simply invoke dict_table_t::release() on startup or
      in DDL operations, or when the table is inaccessible.
      In none of these cases, there is no need to invalidate the
      InnoDB persistent statistics.
      
      pars_info_t::graph_owns_us: Remove (unused).
      
      pars_info_free(): Define inline.
      
      fts_delete(), trx_t::evict_table(), row_prebuilt_free(),
      row_rename_table_for_mysql(): Simplify.
      
      row_mysql_lock_data_dictionary(): Remove some references;
      use dict_sys.lock() and dict_sys.unlock() instead.
      
      row_mysql_lock_table(): Remove. Use lock_table_for_trx() instead.
      
      ha_innobase::check_if_supported_inplace_alter(),
      row_create_table_for_mysql(): Simply assert dict_sys.sys_tables_exist().
      In commit 49e2c8f0 and
      commit 1bd681c8 srv_start()
      actually guarantees that the system tables will exist,
      or the server is in read-only mode, or startup will fail.
      
      Reviewed by: Thirunarayanan Balathandayuthapani
      094de717
    • Marko Mäkelä's avatar
      MDEV-24258 Merge dict_sys.mutex into dict_sys.latch · 82b7c561
      Marko Mäkelä authored
      In the parent commit, dict_sys.latch could theoretically have been
      replaced with a mutex. But, we can do better and merge dict_sys.mutex
      into dict_sys.latch. Generally, every occurrence of dict_sys.mutex_lock()
      will be replaced with dict_sys.lock().
      
      The PERFORMANCE_SCHEMA instrumentation for dict_sys_mutex
      will be removed along with dict_sys.mutex. The dict_sys.latch
      will remain instrumented as dict_operation_lock.
      
      Some use of dict_sys.lock() will be replaced with dict_sys.freeze(),
      which we will reintroduce for the new shared mode. Most notably,
      concurrent table lookups are possible as long as the tables are present
      in the dict_sys cache. In particular, this will allow more concurrency
      among InnoDB purge workers.
      
      Because dict_sys.mutex will no longer 'throttle' the threads that purge
      InnoDB transaction history, a performance degradation may be observed
      unless innodb_purge_threads=1.
      
      The table cache eviction policy will become FIFO-like,
      similar to what happened to fil_system.LRU
      in commit 45ed9dd9.
      The name of the list dict_sys.table_LRU will become somewhat misleading;
      that list contains tables that may be evicted, even though the
      eviction policy no longer is least-recently-used but first-in-first-out.
      (Note: Tables can never be evicted as long as locks exist on them or
      the tables are in use by some thread.)
      
      As demonstrated by the test perfschema.sxlock_func, there
      will be less contention on dict_sys.latch, because some previous
      use of exclusive latches will be replaced with shared latches.
      
      fts_parse_sql_no_dict_lock(): Replaced with pars_sql().
      
      fts_get_table_name_prefix(): Merged to fts_optimize_create().
      
      dict_stats_update_transient_for_index(): Deduplicated some code.
      
      ha_innobase::info_low(), dict_stats_stop_bg(): Use a combination
      of dict_sys.latch and table->stats_mutex_lock() to cover the
      changes of BG_STAT_SHOULD_QUIT, because the flag is being read
      in dict_stats_update_persistent() while not holding dict_sys.latch.
      
      row_discard_tablespace_for_mysql(): Protect stats_bg_flag by
      exclusive dict_sys.latch, like most other code does.
      
      row_quiesce_table_has_fts_index(): Remove unnecessary mutex
      acquisition. FLUSH TABLES...FOR EXPORT is protected by MDL.
      
      row_import::set_root_by_heuristic(): Remove unnecessary mutex
      acquisition. ALTER TABLE...IMPORT TABLESPACE is protected by MDL.
      
      row_ins_sec_index_entry_low(): Replace a call
      to dict_set_corrupted_index_cache_only(). Reads of index->type
      were not really protected by dict_sys.mutex, and writes
      (flagging an index corrupted) should be extremely rare.
      
      dict_stats_process_entry_from_defrag_pool(): Only freeze the dictionary,
      do not lock it exclusively.
      
      dict_stats_wait_bg_to_stop_using_table(), DICT_BG_YIELD: Remove trx.
      We can simply invoke dict_sys.unlock() and dict_sys.lock() directly.
      
      dict_acquire_mdl_shared()<trylock=false>: Assert that dict_sys.latch is
      only held in shared more, not exclusive mode. Only acquire it in
      exclusive mode if the table needs to be loaded to the cache.
      
      dict_sys_t::acquire(): Remove. Relocating elements in dict_sys.table_LRU
      would require holding an exclusive latch, which we want to avoid
      for performance reasons.
      
      dict_sys_t::allow_eviction(): Add the table first to dict_sys.table_LRU,
      to compensate for the removal of dict_sys_t::acquire(). This function
      is only invoked by INFORMATION_SCHEMA.INNODB_SYS_TABLESTATS.
      
      dict_table_open_on_id(), dict_table_open_on_name(): If dict_locked=false,
      try to acquire dict_sys.latch in shared mode. Only acquire the latch in
      exclusive mode if the table is not found in the cache.
      
      Reviewed by: Thirunarayanan Balathandayuthapani
      82b7c561
  24. 27 Jul, 2021 1 commit
    • Marko Mäkelä's avatar
      MDEV-25594: Improve debug checks · cf1fc598
      Marko Mäkelä authored
      trx_t::will_lock: Changed the type to bool.
      
      trx_t::is_autocommit_non_locking(): Replaces
      trx_is_autocommit_non_locking().
      
      trx_is_ac_nl_ro(): Remove (replaced with equivalent assertion expressions).
      
      assert_trx_nonlocking_or_in_list(): Remove.
      Replaced with at least as strict checks in each place.
      
      check_trx_state(): Moved to a static function; partially replaced with
      individual debug assertions implementing equivalent or stricter checks.
      
      This is a backport of commit 7b51d11c
      from 10.5.
      cf1fc598
  25. 23 Jul, 2021 1 commit
  26. 22 Jul, 2021 1 commit
    • Marko Mäkelä's avatar
      MDEV-26193: Wake up purge less often · a4dc9265
      Marko Mäkelä authored
      Starting with commit 6e12ebd4
      (MDEV-25062), srv_wake_purge_thread_if_not_active() became
      more expensive operation, especially on NUMA systems, because
      instead of reading an atomic global variable trx_sys.rseg_history_len
      we are traversing up to 128 cache lines in trx_sys.history_exists().
      
      trx_t::commit_cleanup(): Do not wake up purge at all.
      We will wake up purge about once per second in srv_master_callback().
      
      srv_master_do_active_tasks(), srv_master_do_idle_tasks():
      Move some duplicated code to srv_master_callback().
      
      srv_master_callback(): Invoke purge_coordinator_timer_callback()
      to ensure that purge will be periodically woken up, even if the
      latest execution of trx_t::commit_cleanup() allowed the purge view
      to advance but did not wake up purge.
      Do not call log_free_check(), because every thread that is going
      to generate redo log is supposed to call that function anyway,
      before acquiring any page latches. Additional calls to the function
      every few seconds should not make any difference.
      
      srv_shutdown_threads(): Ensure that srv_shutdown_state can be at most
      SRV_SHUTDOWN_INITIATED in srv_master_callback(), by first invoking
      srv_master_timer.reset() before changing srv_shutdown_state.
      (Note: We first terminate the srv_master_callback and only then
      terminate the purge tasks. Thus, the purge subsystem should exist
      when srv_master_callback() invokes purge_coordinator_timer_callback()
      if it was initiated in the first place.
      a4dc9265
  27. 20 Jul, 2021 1 commit
    • Jagdeep Sidhu's avatar
      Fix switch case statement in trx_flush_log_if_needed_low() · 5f8651ac
      Jagdeep Sidhu authored
      In commit 2e814d47 on MariaDB 10.2
      the switch case statement in trx_flush_log_if_needed_low() regressed.
      
      Since 10.2 this code was refactored to have switches in descending
      order, so value of 3 for innodb_flush_log_at_trx_commit is behaving
      the same as value of 2, that is no FSYNC is being enforced during
      COMMIT phase. The switch should however not be empty and cases 2 and 3
      should not have the identical contents.
      
      As per documentation, setting innodb_flush_log_at_trx_commit to 3
      should do FSYNC to disk if innodb_flush_log_at_trx_commit is set to 3.
      This fixes the regression so that the switch statement again does
      what users expect the setting should do.
      
      All new code of the whole pull request, including one or several files
      that are either new files or modified ones, are contributed under the
      BSD-new license. I am contributing on behalf of my employer Amazon Web
      Services, Inc.
      5f8651ac
  28. 03 Jul, 2021 1 commit
    • Marko Mäkelä's avatar
      fixup 0a67b15a · 789a2a36
      Marko Mäkelä authored
      trx_t::free(): Declare xid as fully initialized in order to
      avoid tripping the subsequent MEM_CHECK_DEFINED
      (in WITH_MSAN and WITH_VALGRIND builds).
      789a2a36
  29. 01 Jul, 2021 2 commits
  30. 24 Jun, 2021 1 commit
    • Marko Mäkelä's avatar
      MDEV-26007 Rollback unnecessarily initiates redo log write · 033e29b6
      Marko Mäkelä authored
      trx_t::commit_in_memory(): Do not initiate a redo log write if
      the transaction has no visible effect. If anything for this
      transaction had been made durable, crash recovery will roll back
      the transaction just fine even if the end of ROLLBACK is not
      durably written.
      
      Rollbacks of transactions that are associated with XA identifiers
      (possibly internally via the binlog) will always be persisted.
      The test rpl.rpl_gtid_crash covers this.
      033e29b6
  31. 23 Jun, 2021 1 commit
    • Marko Mäkelä's avatar
      MDEV-25062: Reduce trx_rseg_t::mutex contention · 6e12ebd4
      Marko Mäkelä authored
      redo_rseg_mutex, noredo_rseg_mutex: Remove the PERFORMANCE_SCHEMA keys.
      The rollback segment mutex will be uninstrumented.
      
      trx_sys_t: Remove pointer indirection for rseg_array, temp_rseg.
      Align each element to the cache line.
      
      trx_sys_t::rseg_id(): Replaces trx_rseg_t::id.
      
      trx_rseg_t::ref: Replaces needs_purge, trx_ref_count, skip_allocation
      in a single std::atomic<uint32_t>.
      
      trx_rseg_t::latch: Replaces trx_rseg_t::mutex.
      
      trx_rseg_t::history_size: Replaces trx_sys_t::rseg_history_len
      
      trx_sys_t::history_size_approx(): Replaces trx_sys.rseg_history_len
      in those places where the exact count does not matter. We must not
      acquire any trx_rseg_t::latch while holding index page latches, because
      normally the trx_rseg_t::latch is acquired before any page latches.
      
      trx_sys_t::history_exists(): Replaces trx_sys.rseg_history_len!=0
      with an approximation.
      
      We remove some unnecessary trx_rseg_t::latch acquisition around
      trx_undo_set_state_at_prepare() and trx_undo_set_state_at_finish().
      Those operations will only access fields that remain constant
      after trx_rseg_t::init().
      6e12ebd4
  32. 21 Jun, 2021 1 commit
    • Marko Mäkelä's avatar
      MDEV-15912: Remove traces of insert_undo · e46f76c9
      Marko Mäkelä authored
      Let us simply refuse an upgrade from earlier versions if the
      upgrade procedure was not followed. This simplifies the purge,
      commit, and rollback of transactions.
      
      Before upgrading to MariaDB 10.3 or later, a clean shutdown
      of the server (with innodb_fast_shutdown=1 or 0) is necessary,
      to ensure that any incomplete transactions are rolled back.
      The undo log format was changed in MDEV-12288. There is only
      one persistent undo log for each transaction.
      e46f76c9
  33. 09 Jun, 2021 1 commit
    • Marko Mäkelä's avatar
      MDEV-25506 (3 of 3): Do not delete .ibd files before commit · 1bd681c8
      Marko Mäkelä authored
      This is a complete rewrite of DROP TABLE, also as part of other DDL,
      such as ALTER TABLE, CREATE TABLE...SELECT, TRUNCATE TABLE.
      
      The background DROP TABLE queue hack is removed.
      If a transaction needs to drop and create a table by the same name
      (like TRUNCATE TABLE does), it must first rename the table to an
      internal #sql-ib name. No committed version of the data dictionary
      will include any #sql-ib tables, because whenever a transaction
      renames a table to a #sql-ib name, it will also drop that table.
      Either the rename will be rolled back, or the drop will be committed.
      
      Data files will be unlinked after the transaction has been committed
      and a FILE_RENAME record has been durably written. The file will
      actually be deleted when the detached file handle returned by
      fil_delete_tablespace() will be closed, after the latches have been
      released. It is possible that a purge of the delete of the SYS_INDEXES
      record for the clustered index will execute fil_delete_tablespace()
      concurrently with the DDL transaction. In that case, the thread that
      arrives later will wait for the other thread to finish.
      
      HTON_TRUNCATE_REQUIRES_EXCLUSIVE_USE: A new handler flag.
      ha_innobase::truncate() now requires that all other references to
      the table be released in advance. This was implemented by Monty.
      
      ha_innobase::delete_table(): If CREATE TABLE..SELECT is detected,
      we will "hijack" the current transaction, drop the table in
      the current transaction and commit the current transaction.
      This essentially fixes MDEV-21602. There is a FIXME comment about
      making the check less failure-prone.
      
      ha_innobase::truncate(), ha_innobase::delete_table():
      Implement a fast path for temporary tables. We will no longer allow
      temporary tables to use the adaptive hash index.
      
      dict_table_t::mdl_name: The original table name for the purpose of
      acquiring MDL in purge, to prevent a race condition between a
      DDL transaction that is dropping a table, and purge processing
      undo log records of DML that had executed before the DDL operation.
      For #sql-backup- tables during ALTER TABLE...ALGORITHM=COPY, the
      dict_table_t::mdl_name will differ from dict_table_t::name.
      
      dict_table_t::parse_name(): Use mdl_name instead of name.
      
      dict_table_rename_in_cache(): Update mdl_name.
      
      For the internal FTS_ tables of FULLTEXT INDEX, purge would
      acquire MDL on the FTS_ table name, but not on the main table,
      and therefore it would be able to run concurrently with a
      DDL transaction that is dropping the table. Previously, the
      DROP TABLE queue hack prevented a race between purge and DDL.
      For now, we introduce purge_sys.stop_FTS() to prevent purge from
      opening any table, while a DDL transaction that may drop FTS_
      tables is in progress. The function fts_lock_table(), which will
      be invoked before the dictionary is locked, will wait for
      purge to release any table handles.
      
      trx_t::drop_table_statistics(): Drop statistics for the table.
      This replaces dict_stats_drop_index(). We will drop or rename
      persistent statistics atomically as part of DDL transactions.
      On lock conflict for dropping statistics, we will fail instantly
      with DB_LOCK_WAIT_TIMEOUT, because we will be holding the
      exclusive data dictionary latch.
      
      trx_t::commit_cleanup(): Separated from trx_t::commit_in_memory().
      Relax an assertion around fts_commit() and allow DB_LOCK_WAIT_TIMEOUT
      in addition to DB_DUPLICATE_KEY. The call to fts_commit() is
      entirely misplaced here and may obviously break the consistency
      of transactions that affect FULLTEXT INDEX. It needs to be fixed
      separately.
      
      dict_table_t::n_foreign_key_checks_running: Remove (MDEV-21175).
      The counter was a work-around for missing meta-data locking (MDL)
      on the SQL layer, and not really needed in MariaDB.
      
      ER_TABLE_IN_FK_CHECK: Replaced with ER_UNUSED_28.
      
      HA_ERR_TABLE_IN_FK_CHECK: Remove.
      
      row_ins_check_foreign_constraints(): Do not acquire
      dict_sys.latch either. The SQL-layer MDL will protect us.
      
      This was reviewed by Thirunarayanan Balathandayuthapani
      and tested by Matthias Leich.
      1bd681c8
  34. 27 May, 2021 1 commit
    • Marko Mäkelä's avatar
      MDEV-25791: Remove UNIV_INTERN · a7d68e7a
      Marko Mäkelä authored
      Back in 2006 or 2007, when MySQL AB and Innobase Oy existed as
      separately controlled entities (Innobase had been acquired by
      Oracle Corporation), MySQL 5.1 introduced a storage engine plugin
      interface and Oracle made use of it by distributing a separate
      InnoDB Plugin, which would contain some more bug fixes and
      improvements, compared to the version of InnoDB that was statically
      linked with the mysqld server that was distributed by MySQL AB.
      The built-in InnoDB would export global symbols, which would clash
      with the symbols of the dynamic InnoDB Plugin (which was supposed
      to override the built-in one when present).
      
      The solution to this problem was to declare all global symbols with
      UNIV_INTERN, so that they would get the GCC function attribute that
      specifies hidden visibility.
      
      Later, in MariaDB Server, something based on Percona XtraDB (a fork of
      MySQL InnoDB) became the statically linked implementation, and something
      closer to MySQL InnoDB was available as a dynamic plugin. Starting with
      version 10.2, MariaDB Server includes only one InnoDB implementation,
      and hence any reason to have the UNIV_INTERN definition was lost.
      
      btr_get_size_and_reserved(): Move to the same compilation unit with
      the only caller.
      
      innodb_set_buf_pool_size(): Remove. Modify innobase_buffer_pool_size
      directly.
      
      fil_crypt_calculate_checksum(): Merge to the only caller.
      
      ha_innobase::innobase_reset_autoinc(): Merge to the only caller.
      
      thd_query_start_micro(): Remove. Call thd_start_utime() directly.
      a7d68e7a
  35. 21 May, 2021 1 commit
    • Marko Mäkelä's avatar
      MDEV-25743: Unnecessary copying of table names in InnoDB dictionary · 49e2c8f0
      Marko Mäkelä authored
      Many InnoDB data dictionary cache operations require that the
      table name be copied so that it will be NUL terminated.
      (For example, SYS_TABLES.NAME is not guaranteed to be NUL-terminated.)
      
      dict_table_t::is_garbage_name(): Check if a name belongs to
      the background drop table queue.
      
      dict_check_if_system_table_exists(): Remove.
      
      dict_sys_t::load_sys_tables(): Load the non-hard-coded system tables
      SYS_FOREIGN, SYS_FOREIGN_COLS, SYS_VIRTUAL on startup.
      
      dict_sys_t::create_or_check_sys_tables(): Replaces
      dict_create_or_check_foreign_constraint_tables() and
      dict_create_or_check_sys_virtual().
      
      dict_sys_t::load_table(): Replaces dict_table_get_low()
      and dict_load_table().
      
      dict_sys_t::find_table(): Renamed from get_table().
      
      dict_sys_t::sys_tables_exist(): Check whether all the non-hard-coded
      tables SYS_FOREIGN, SYS_FOREIGN_COLS, SYS_VIRTUAL exist.
      
      trx_t::has_stats_table_lock(): Moved to dict0stats.cc.
      
      Some error messages will now report table names in the internal
      databasename/tablename format, instead of `databasename`.`tablename`.
      49e2c8f0
  36. 19 May, 2021 1 commit
    • Monty's avatar
      MDEV-25180 Atomic ALTER TABLE · 7762ee5d
      Monty authored
      MDEV-25604 Atomic DDL: Binlog event written upon recovery does not
                 have default database
      
      The purpose of this task is to ensure that ALTER TABLE is atomic even if
      the MariaDB server would be killed at any point of the alter table.
      This means that either the ALTER TABLE succeeds (including that triggers,
      the status tables and the binary log are updated) or things should be
      reverted to their original state.
      
      If the server crashes before the new version is fully up to date and
      commited, it will revert to the original table and remove all
      temporary files and tables.
      If the new version is commited, crash recovery will use the new version,
      and update triggers, the status tables and the binary log.
      The one execption is ALTER TABLE .. RENAME .. where no changes are done
      to table definition. This one will work as RENAME and roll back unless
      the whole statement completed, including updating the binary log (if
      enabled).
      
      Other changes:
      - Added handlerton->check_version() function to allow the ddl recovery
        code to check, in case of inplace alter table, if the table in the
        storage engine is of the new or old version.
      - Added handler->table_version() so that an engine can report the current
        version of the table. This should be changed each time the table
        definition changes.
      - Added  ha_signal_ddl_recovery_done() and
        handlerton::signal_ddl_recovery_done() to inform all handlers when
        ddl recovery has been done. (Needed by InnoDB).
      - Added handlerton call inplace_alter_table_committed, to signal engine
        that ddl_log has been closed for the alter table query.
      - Added new handerton flag
        HTON_REQUIRES_NOTIFY_TABLEDEF_CHANGED_AFTER_COMMIT to signal when we
        should call hton->notify_tabledef_changed() during
        mysql_inplace_alter_table. This was required as MyRocks and InnoDB
        needed the call at different times.
      - Added function server_uuid_value() to be able to generate a temporary
        xid when ddl recovery writes the query to the binary log. This is
        needed to be able to handle crashes during ddl log recovery.
      - Moved freeing of the frm definition to end of mysql_alter_table() to
        remove duplicate code and have a common exit strategy.
      
      -------
      InnoDB part of atomic ALTER TABLE
      (Implemented by Marko Mäkelä)
      innodb_check_version(): Compare the saved dict_table_t::def_trx_id
      to determine whether an ALTER TABLE operation was committed.
      
      We must correctly recover dict_table_t::def_trx_id for this to work.
      Before purge removes any trace of DB_TRX_ID from system tables, it
      will make an effort to load the user table into the cache, so that
      the dict_table_t::def_trx_id can be recovered.
      
      ha_innobase::table_version(): return garbage, or the trx_id that would
      be used for committing an ALTER TABLE operation.
      
      In InnoDB, table names starting with #sql-ib will remain special:
      they will be dropped on startup. This may be revisited later in
      MDEV-18518 when we implement proper undo logging and rollback
      for creating or dropping multiple tables in a transaction.
      
      Table names starting with #sql will retain some special meaning:
      dict_table_t::parse_name() will not consider such names for
      MDL acquisition, and dict_table_rename_in_cache() will treat such
      names specially when handling FOREIGN KEY constraints.
      
      Simplify InnoDB DROP INDEX.
      Prevent purge wakeup
      
      To ensure that dict_table_t::def_trx_id will be recovered correctly
      in case the server is killed before ddl_log_complete(), we will block
      the purge of any history in SYS_TABLES, SYS_INDEXES, SYS_COLUMNS
      between ha_innobase::commit_inplace_alter_table(commit=true)
      (purge_sys.stop_SYS()) and purge_sys.resume_SYS().
      The completion callback purge_sys.resume_SYS() must be between
      ddl_log_complete() and MDL release.
      
      --------
      
      MyRocks support for atomic ALTER TABLE
      (Implemented by Sergui Petrunia)
      
      Implement these SE API functions:
      - ha_rocksdb::table_version()
      - hton->check_version = rocksdb_check_versionMyRocks data dictionary
        now stores table version for each table.
        (Absence of table version record is interpreted as table_version=0,
        that is, which means no upgrade changes are needed)
      - For inplace alter table of a partitioned table, call the underlying
        handlerton when checking if the table is ok. This assumes that the
        partition engine commits all changes at once.
      7762ee5d