An error occurred fetching the project authors.
  1. 23 Sep, 2024 1 commit
  2. 24 Aug, 2024 1 commit
  3. 16 Aug, 2024 1 commit
  4. 09 Aug, 2024 1 commit
  5. 08 Aug, 2024 6 commits
  6. 10 Jan, 2024 2 commits
    • Marko Mäkelä's avatar
      MDEV-26195 fixup: Remove page_no_t · 338ed5c4
      Marko Mäkelä authored
      338ed5c4
    • Marko Mäkelä's avatar
      MDEV-33112 innodb_undo_log_truncate=ON is blocking page write · 3613fb2a
      Marko Mäkelä authored
      When innodb_undo_log_truncate=ON causes an InnoDB undo tablespace
      to be truncated, we must guarantee that the undo tablespace will
      be rebuilt atomically: After mtr_t::commit_shrink() has durably
      written the mini-transaction that rebuilds the undo tablespace,
      we must not write any old pages to the tablespace.
      
      To guarantee this, in trx_purge_truncate_history() we used to
      traverse the entire buf_pool.flush_list in order to acquire
      exclusive latches on all pages for the undo tablespace that
      reside in the buffer pool, so that those pages cannot be written
      and will be evicted during mtr_t::commit_shrink(). But, this
      traversal may interfere with the page writing activity of
      buf_flush_page_cleaner(). It would be better to lazily discard
      the old pages of the truncated undo tablespace.
      
      fil_space_t::is_being_truncated, fil_space_t::clear_stopping(): Remove.
      
      fil_space_t::create_lsn: A new field, identifying the LSN of the
      latest rebuild of a tablespace.
      
      buf_page_t::flush(), buf_flush_try_neighbors(): Evict pages whose
      FIL_PAGE_LSN is below fil_space_t::create_lsn.
      
      mtr_t::commit_shrink(): Update fil_space_t::create_lsn and
      fil_space_t::size right before the log is durably written and the
      tablespace file is being truncated.
      
      fsp_page_create(), trx_purge_truncate_history(): Simplify the logic.
      
      Reviewed by: Thirunarayanan Balathandayuthapani, Vladislav Lesin
      Performance tested by: Axel Schwenke
      Correctness tested by: Matthias Leich
      3613fb2a
  7. 15 Nov, 2023 1 commit
    • Marko Mäkelä's avatar
      MDEV-32757 innodb_undo_log_truncate=ON is not crash safe · a0f02f74
      Marko Mäkelä authored
      trx_purge_truncate_history(): Do not prematurely mark dirty pages
      as clean. This will be done in mtr_t::commit_shrink() as part of
      Shrink::operator()(mtr_memo_slot_t*). Also, register each dirty page
      only once in the mini-transaction.
      
      fsp_page_create(): Adjust and simplify the page creation during
      undo tablespace truncation. We can directly reuse pages that are
      already in buf_pool.page_hash.
      
      This fixes a regression that was caused by
      commit f5794e1d (MDEV-26445).
      
      Tested by: Matthias Leich
      Reviewed by: Thirunarayanan Balathandayuthapani
      a0f02f74
  8. 01 Nov, 2023 1 commit
  9. 27 Oct, 2023 1 commit
    • Thirunarayanan Balathandayuthapani's avatar
      MDEV-28699 Shrink temporary tablespaces without restart · c507678b
      - Introduced the variable "innodb_truncate_temporary_tablespace_now"
      to shrink the temporary tablespace.
      
      Steps for shrinking the temporary tablespace:
      
      1) Find the last used extent in temporary tablespace
      by iterating through the BITMAP in extent descriptor pages
      
      2) If the last used extent is lesser than user specified size
      then set desired target size to user specified size
      
      3) Store the page contents of "to be modified" extent
      descriptor pages, latches the "to be modified" extent
      descriptor pages and check for buffer pool memory availability
      
      4) Update the FSP_SIZE and FSP_FREE_LIMIT in header page
      
      5) Remove the "to be truncated" pages from FSP_FREE and
      FSP_FREE_FRAG list
      
      6) Reset the bitmap in the last descriptor pages for the
      "to be truncated" pages.
      
      7) Clear the freed range in temporary tablespace which
      are to be truncated.
      
      8) Evict the "to be truncated" temporary tablespace pages
      from LRU list.
      
      9) In case of multiple files, calculate the truncated last
      file size and do truncation in last file
      
      10) Commit the mini-transaction for shrinking the tablespace
      c507678b
  10. 14 Sep, 2023 1 commit
  11. 11 Sep, 2023 1 commit
  12. 28 Aug, 2023 1 commit
  13. 01 Aug, 2023 1 commit
    • Thirunarayanan Balathandayuthapani's avatar
      MDEV-14795 InnoDB system tablespace cannot be shrunk · f9003c73
      - Introduce the option :autoshrink attribute to be
      added to innodb_data_file_path variable to allow
      the shrinking of system tablespace during startup process.
      
      Steps for shrinking the system tablespace:
      
      1) Find the last used extent in system tablespace
      by iterating through the BITMAP in extent descriptor pages
      
      2) If the last used extent is lesser than user specified size
      then set desired target size to user specified size.
      
      3) Store the page contents of "to be modified" extent
      descriptor pages, latches the "to be modified"
      extent descriptor pages and check for buffer pool
      memory availability
      
      4) Make checkpoint to flush all pages in buffer pool, so
      that pages in flush list doesn't have to use doublewrite
      buffer and disable doublewrite buffer during shrinking process
      
      5) Update the FSP_SIZE and FSP_FREE_LIMIT in header page
      
      6) Remove the "to be truncated" pages from FSP_FREE and
      FSP_FREE_FRAG list
      
      7) Reset the bitmap in the last descriptor pages for the
      "to be truncated" pages.
      
      8) In case of multiple files, calculate the truncated last
      file size and do the truncation in last file
      
      9) Check whether mini-transaction log size doesn't exceed
      the minimum value of innodb_log_buffer_size which is 2MB.
      In that case, replace the modified buffer pool pages with
      the page old content.
      
      11) Commit the mini-transaction for shrinking the tablespace
      and enable/disable the doublewrite buffer depends on user
      specified value.
      
      recv_sys_t::apply(): Handle the truncation of system tablespace
      only if the recovered tablespace size is lesser than actual
      existing size.
      f9003c73
  14. 24 May, 2023 1 commit
  15. 16 Feb, 2023 1 commit
    • Marko Mäkelä's avatar
      MDEV-30638 Deadlock between INSERT and InnoDB non-persistent statistics update · 201cfc33
      Marko Mäkelä authored
      This is a partial revert of
      commit 8b6a308e (MDEV-29883)
      and a follow-up to the
      merge commit 394fc71f (MDEV-24569).
      
      The latching order related to any operation that accesses the allocation
      metadata of an InnoDB index tree is as follows:
      
      1. Acquire dict_index_t::lock in non-shared mode.
      2. Acquire the index root page latch in non-shared mode.
      3. Possibly acquire further index page latches. Unless an exclusive
      dict_index_t::lock is held, this must follow the root-to-leaf,
      left-to-right order.
      4. Acquire a *non-shared* fil_space_t::latch.
      5. Acquire latches on the allocation metadata pages.
      6. Possibly allocate and write some pages, or free some pages.
      
      btr_get_size_and_reserved(), dict_stats_update_transient_for_index(),
      dict_stats_analyze_index(): Acquire an exclusive fil_space_t::latch
      in order to avoid a deadlock in fseg_n_reserved_pages() in case of
      concurrent access to multiple indexes sharing the same "inode page".
      
      fseg_page_is_allocated(): Acquire an exclusive fil_space_t::latch
      in order to avoid deadlocks. All callers are holding latches
      on a buffer pool page, or an index, or both.
      Before commit edbde4a1 (MDEV-24167)
      a third mode was available that would not conflict with the shared
      fil_space_t::latch acquired by ha_innobase::info_low(),
      i_s_sys_tablespaces_fill_table(),
      or i_s_tablespaces_encryption_fill_table().
      Because those calls should be rather rare, it makes sense to use
      the simple rw_lock with only shared and exclusive modes.
      
      fil_crypt_get_page_throttle(): Avoid invoking fseg_page_is_allocated()
      on an allocation bitmap page (which can never be freed), to avoid
      acquiring a shared latch on top of an exclusive one.
      
      mtr_t::s_lock_space(), MTR_MEMO_SPACE_S_LOCK: Remove.
      201cfc33
  16. 24 Jan, 2023 2 commits
    • Marko Mäkelä's avatar
      MDEV-26790 InnoDB read-ahead may cause page writes · a30d4250
      Marko Mäkelä authored
      buf_LRU_get_free_block(): Replace the Boolean parameter with a
      ternary parameter, so that have_no_mutex_soft can be specified
      reduce the chances of initiating page eviction flushing in read-ahead.
      
      buf_read_acquire(): Invoke buf_LRU_get_free_block(have_no_mutex_soft)
      and check in each caller for a nullptr return value.
      a30d4250
    • Marko Mäkelä's avatar
      MDEV-30400 Assertion height == btr_page_get_level(...) on INSERT · de4030e4
      Marko Mäkelä authored
      This also fixes part of MDEV-29835 Partial server freeze
      which is caused by violations of the latching order that was
      defined in https://dev.mysql.com/worklog/task/?id=6326
      (WL#6326: InnoDB: fix index->lock contention). Unless the
      current thread is holding an exclusive dict_index_t::lock,
      it must acquire page latches in a strict parent-to-child,
      left-to-right order. Not all cases of MDEV-29835 are fixed yet.
      Failure to follow the correct latching order will cause deadlocks
      of threads due to lock order inversion.
      
      As part of these changes, the BTR_MODIFY_TREE mode is modified
      so that an Update latch (U a.k.a. SX) will be acquired on the
      root page, and eXclusive latches (X) will be acquired on all pages
      leading to the leaf page, as well as any left and right siblings
      of the pages along the path. The DEBUG_SYNC test innodb.innodb_wl6326
      will be removed, because at the time the DEBUG_SYNC point is hit,
      the thread is actually holding several page latches that will be
      blocking a concurrent SELECT statement.
      
      We also remove double bookkeeping that was caused due to excessive
      information hiding in mtr_t::m_memo. We simply let mtr_t::m_memo
      store information of latched pages, and ensure that
      mtr_memo_slot_t::object is never a null pointer.
      The tree_blocks[] and tree_savepoints[] were redundant.
      
      buf_page_get_low(): If innodb_change_buffering_debug=1, to avoid
      a hang, do not try to evict blocks if we are holding a latch on
      a modified page. The test innodb.innodb-change-buffer-recovery
      will be removed, because change buffering may no longer be forced
      by debug injection when the change buffer comprises multiple pages.
      Remove a debug assertion that could fail when
      innodb_change_buffering_debug=1 fails to evict a page.
      For other cases, the assertion is redundant, because we already
      checked that right after the got_block: label. The test
      innodb.innodb-change-buffering-recovery will be removed, because
      due to this change, we will be unable to evict the desired page.
      
      mtr_t::lock_register(): Register a change of a page latch
      on an unmodified buffer-fixed block.
      
      mtr_t::x_latch_at_savepoint(), mtr_t::sx_latch_at_savepoint():
      Replaced by the use of mtr_t::upgrade_buffer_fix(), which now
      also handles RW_S_LATCH.
      
      mtr_t::set_modified(): For temporary tables, invoke
      buf_page_t::set_modified() here and not in mtr_t::commit().
      We will never set the MTR_MEMO_MODIFY flag on other than
      persistent data pages, nor set mtr_t::m_modifications when
      temporary data pages are modified.
      
      mtr_t::commit(): Only invoke the buf_flush_note_modification() loop
      if persistent data pages were modified.
      
      mtr_t::get_already_latched(): Look up a latched page in mtr_t::m_memo.
      This avoids many redundant entries in mtr_t::m_memo, as well as
      redundant calls to buf_page_get_gen() for blocks that had already
      been looked up in a mini-transaction.
      
      btr_get_latched_root(): Return a pointer to an already latched root page.
      This replaces btr_root_block_get() in cases where the mini-transaction
      has already latched the root page.
      
      btr_page_get_parent(): Fetch a parent page that was already latched
      in BTR_MODIFY_TREE, by invoking mtr_t::get_already_latched().
      If needed, upgrade the root page U latch to X.
      This avoids bloating mtr_t::m_memo as well as performing redundant
      buf_pool.page_hash lookups. For non-QUICK CHECK TABLE as well as for
      B-tree defragmentation, we will invoke btr_cur_search_to_nth_level().
      
      btr_cur_search_to_nth_level(): This will only be used for non-leaf
      (level>0) B-tree searches that were formerly named BTR_CONT_SEARCH_TREE
      or BTR_CONT_MODIFY_TREE. In MDEV-29835, this function could be
      removed altogether, or retained for the case of
      CHECK TABLE without QUICK.
      
      btr_cur_t::left_block: Remove. btr_pcur_move_backward_from_page()
      can retrieve the left sibling from the end of mtr_t::m_memo.
      
      btr_cur_t::open_leaf(): Some clean-up.
      
      btr_cur_t::search_leaf(): Replaces btr_cur_search_to_nth_level()
      for searches to level=0 (the leaf level). We will never release
      parent page latches before acquiring leaf page latches. If we need to
      temporarily release the level=1 page latch in the BTR_SEARCH_PREV or
      BTR_MODIFY_PREV latch_mode, we will reposition the cursor on the
      child node pointer so that we will land on the correct leaf page.
      
      btr_cur_t::pessimistic_search_leaf(): Implement new BTR_MODIFY_TREE
      latching logic in the case that page splits or merges will be needed.
      The parent pages (and their siblings) should already be latched on
      the first dive to the leaf and be present in mtr_t::m_memo; there
      should be no need for BTR_CONT_MODIFY_TREE. This pre-latching almost
      suffices; it must be revised in MDEV-29835 and work-arounds removed
      for cases where mtr_t::get_already_latched() fails to find a block.
      
      rtr_search_to_nth_level(): A SPATIAL INDEX version of
      btr_search_to_nth_level() that can search to any level
      (including the leaf level).
      
      rtr_search_leaf(), rtr_insert_leaf(): Wrappers for
      rtr_search_to_nth_level().
      
      rtr_search(): Replaces rtr_pcur_open().
      
      rtr_latch_leaves(): Replaces btr_cur_latch_leaves(). Note that unlike
      in the B-tree code, there is no error handling in case the sibling
      pages are corrupted.
      
      rtr_cur_restore_position(): Remove an unused constant parameter.
      
      btr_pcur_open_on_user_rec(): Remove the constant parameter
      mode=PAGE_CUR_GE.
      
      row_ins_clust_index_entry_low(): Use a new
      mode=BTR_MODIFY_ROOT_AND_LEAF to gain access to the root page
      when mode!=BTR_MODIFY_TREE, to write the PAGE_ROOT_AUTO_INC.
      
      BTR_SEARCH_TREE, BTR_CONT_SEARCH_TREE: Remove.
      
      BTR_CONT_MODIFY_TREE: Note that this is only used by
      rtr_search_to_nth_level().
      
      btr_pcur_optimistic_latch_leaves(): Replaces
      btr_cur_optimistic_latch_leaves().
      
      ibuf_delete_rec(): Acquire exclusive ibuf.index->lock in order
      to avoid a deadlock with ibuf_insert_low(BTR_MODIFY_PREV).
      
      btr_blob_log_check_t(): Acquire a U latch on the root page,
      so that btr_page_alloc() in btr_store_big_rec_extern_fields()
      will avoid a deadlock.
      
      btr_store_big_rec_extern_fields(): Assert that the root page latch
      is being held.
      
      Tested by: Matthias Leich
      Reviewed by: Vladislav Lesin
      de4030e4
  17. 23 Jan, 2023 1 commit
  18. 19 Jan, 2023 1 commit
    • Marko Mäkelä's avatar
      MDEV-30400 Assertion height == btr_page_get_level(...) on INSERT · f9cac8d2
      Marko Mäkelä authored
      This also fixes part of MDEV-29835 Partial server freeze
      which is caused by violations of the latching order that was
      defined in https://dev.mysql.com/worklog/task/?id=6326
      (WL#6326: InnoDB: fix index->lock contention). Unless the
      current thread is holding an exclusive dict_index_t::lock,
      it must acquire page latches in a strict parent-to-child,
      left-to-right order. Not all cases are fixed yet. Failure to
      follow the correct latching order will cause deadlocks of threads
      due to lock order inversion.
      
      As part of these changes, the BTR_MODIFY_TREE mode is modified
      so that an Update latch (U a.k.a. SX) will be acquired on the
      root page, and eXclusive latches (X) will be acquired on all pages
      leading to the leaf page, as well as any left and right siblings
      of the pages along the path. The test innodb.innodb_wl6326
      will be removed, because at the time the DEBUG_SYNC point is hit,
      the thread is actually holding several page latches that will be
      blocking a concurrent SELECT statement.
      
      We also remove double bookkeeping that was caused due to excessive
      information hiding in mtr_t::m_memo. We simply let mtr_t::m_memo
      store information of latched pages, and ensure that
      mtr_memo_slot_t::object is never a null pointer.
      The tree_blocks[] and tree_savepoints[] were redundant.
      
      mtr_t::get_already_latched(): Look up a latched page in mtr_t::m_memo.
      This avoids many redundant entries in mtr_t::m_memo, as well as
      redundant calls to buf_page_get_gen() for blocks that had already
      been looked up in a mini-transaction.
      
      btr_get_latched_root(): Return a pointer to an already latched root page.
      This replaces btr_root_block_get() in cases where the mini-transaction
      has already latched the root page.
      
      btr_page_get_parent(): Fetch a parent page that was already latched
      in BTR_MODIFY_TREE, by invoking mtr_t::get_already_latched().
      If needed, upgrade the root page U latch to X.
      This avoids bloating mtr_t::m_memo as well as redundant
      buf_pool.page_hash lookups. For non-QUICK CHECK TABLE as well as for
      B-tree defragmentation, we will invoke btr_cur_search_to_nth_level().
      
      btr_cur_search_to_nth_level(): This will only be used for non-leaf
      (level>0) B-tree searches that were formerly named BTR_CONT_SEARCH_TREE
      or BTR_CONT_MODIFY_TREE. In MDEV-29835, this function could be
      removed altogether, or retained for the case of
      CHECK TABLE without QUICK.
      
      btr_cur_t::search_leaf(): Replaces btr_cur_search_to_nth_level()
      for searches to level=0 (the leaf level).
      
      btr_cur_t::pessimistic_search_leaf(): Implement the new
      BTR_MODIFY_TREE latching logic in the case that page splits
      or merges will be needed. The parent pages (and their siblings)
      should already be latched on the first dive to the leaf and be
      present in mtr_t::m_memo; there should be no need for
      BTR_CONT_MODIFY_TREE. This pre-latching almost suffices;
      MDEV-29835 will have to revise it and remove work-arounds where
      mtr_t::get_already_latched() fails to find a block.
      
      rtr_search_to_nth_level(): A SPATIAL INDEX version of
      btr_search_to_nth_level() that can search to any level
      (including the leaf level).
      
      rtr_search_leaf(), rtr_insert_leaf(): Wrappers for
      rtr_search_to_nth_level().
      
      rtr_search(): Replaces rtr_pcur_open().
      
      rtr_cur_restore_position(): Remove an unused constant parameter.
      
      btr_pcur_open_on_user_rec(): Remove the constant parameter
      mode=PAGE_CUR_GE.
      
      btr_cur_latch_leaves(): Update a pre-existing mtr_t::m_memo entry
      for the current leaf page.
      
      row_ins_clust_index_entry_low(): Use a new
      mode=BTR_MODIFY_ROOT_AND_LEAF to gain access to the root page
      when mode!=BTR_MODIFY_TREE, to write the PAGE_ROOT_AUTO_INC.
      
      btr_cur_t::open_leaf(): Some clean-up.
      
      mtr_t::lock_register(): Register a page latch on a buffer-fixed block.
      
      BTR_SEARCH_TREE, BTR_CONT_SEARCH_TREE: Remove.
      
      BTR_CONT_MODIFY_TREE: Note that this is only used by
      rtr_search_to_nth_level().
      
      btr_pcur_optimistic_latch_leaves(): Replaces
      btr_cur_optimistic_latch_leaves().
      
      ibuf_delete_rec(): Acquire ibuf.index->lock.u_lock() in order
      to avoid a deadlock with ibuf_insert_low(BTR_MODIFY_PREV).
      
      Tested by: Matthias Leich
      f9cac8d2
  19. 11 Jan, 2023 1 commit
    • Marko Mäkelä's avatar
      MDEV-29694 Remove the InnoDB change buffer · f27e9c89
      Marko Mäkelä authored
      The purpose of the change buffer was to reduce random disk access,
      which could be useful on rotational storage, but maybe less so on
      solid-state storage.
      When we wished to
      (1) insert a record into a non-unique secondary index,
      (2) delete-mark a secondary index record,
      (3) delete a secondary index record as part of purge (but not ROLLBACK),
      and the B-tree leaf page where the record belongs to is not in the buffer
      pool, we inserted a record into the change buffer B-tree, indexed by
      the page identifier. When the page was eventually read into the buffer
      pool, we looked up the change buffer B-tree for any modifications to the
      page, applied these upon the completion of the read operation. This
      was called the insert buffer merge.
      
      We remove the change buffer, because it has been the source of
      various hard-to-reproduce corruption bugs, including those fixed in
      commit 5b9ee8d8 and
      commit 165564d3 but not limited to them.
      
      A downgrade will fail with a clear message starting with
      commit db14eb16 (MDEV-30106).
      
      buf_page_t::state: Merge IBUF_EXIST to UNFIXED and
      WRITE_FIX_IBUF to WRITE_FIX.
      
      buf_pool_t::watch[]: Remove.
      
      trx_t: Move isolation_level, check_foreigns, check_unique_secondary,
      bulk_insert into the same bit-field. The only purpose of
      trx_t::check_unique_secondary is to enable bulk insert into an
      empty table. It no longer enables insert buffering for UNIQUE INDEX.
      
      btr_cur_t::thr: Remove. This field was originally needed for change
      buffering. Later, its use was extended to cover SPATIAL INDEX.
      Much of the time, rtr_info::thr holds this field. When it does not,
      we will add parameters to SPATIAL INDEX specific functions.
      
      ibuf_upgrade_needed(): Check if the change buffer needs to be updated.
      
      ibuf_upgrade(): Merge and upgrade the change buffer after all redo log
      has been applied. Free any pages consumed by the change buffer, and
      zero out the change buffer root page to mark the upgrade completed,
      and to prevent a downgrade to an earlier version.
      
      dict_load_tablespaces(): Renamed from
      dict_check_tablespaces_and_store_max_id(). This needs to be invoked
      before ibuf_upgrade().
      
      btr_cur_open_at_rnd_pos(): Specialize for use in persistent statistics.
      The change buffer merge does not need this function anymore.
      
      btr_page_alloc(): Renamed from btr_page_alloc_low(). We no longer
      allocate any change buffer pages.
      
      btr_cur_open_at_rnd_pos(): Specialize for use in persistent statistics.
      The change buffer merge does not need this function anymore.
      
      row_search_index_entry(), btr_lift_page_up(): Add a parameter thr
      for the SPATIAL INDEX case.
      
      rtr_page_split_and_insert(): Specialized from btr_page_split_and_insert().
      
      rtr_root_raise_and_insert(): Specialized from btr_root_raise_and_insert().
      
      Note: The support for upgrading from the MySQL 3.23 or MySQL 4.0
      change buffer format that predates the MySQL 4.1 introduction of
      the option innodb_file_per_table was removed in MySQL 5.6.5
      as part of mysql/mysql-server@69b6241a79876ae98bb0c9dce7c8d8799d6ad273
      and MariaDB 10.0.11 as part of 1d0f70c2.
      
      In the tests innodb.log_upgrade and innodb.log_corruption, we create
      valid (upgraded) change buffer pages.
      
      Tested by: Matthias Leich
      f27e9c89
  20. 31 Aug, 2022 1 commit
    • Marko Mäkelä's avatar
      MDEV-29374 InnoDB recovery fails with "Data structure corruption" · bdf62ece
      Marko Mäkelä authored
      recv_sys_t::free_corrupted_page(): Identify the corrupted page in
      an error or warning message.
      
      buf_page_free(): Just in case, register the page as modified.
      This should already have been done in mtr_t::free() as part of
      fseg_free_page_low().
      
      mtr_t::memo_push(): Simplify a condition, so that when invoked
      with MTR_MEMO_PAGE_X_MODIFY, we will do the right thing.
      
      fseg_free_page_low(): Remove an accidentally added return statement
      that prevented mtr_t::free() from being called. This fixes a regression
      that was introduced in
      commit 0b47c126 (MDEV-13542).
      bdf62ece
  21. 17 Aug, 2022 1 commit
  22. 06 Jun, 2022 2 commits
    • Marko Mäkelä's avatar
      MDEV-18976 Implement OPT_PAGE_CHECKSUM log record for improved validation · 4179f93d
      Marko Mäkelä authored
      We will introduce an optional log record OPT_PAGE_CHECKSUM for recording
      page checksums, so that more inconsistencies on crash recovery may be
      caught.
      
      mtr_t::page_checksum(const buf_page_t&): Write OPT_PAGE_CHECKSUM
      (currently not for ROW_FORMAT=COMPRESSED pages).
      
      mtr_t::do_write(): Write OPT_PAGE_CHECKSUM records for all pages
      (currently, in debug builds only).
      
      mtr_t::is_logged(): Return whether log should be written.
      
      mtr_t::set_log_mode_sub(const mtr_t&): Set the logging mode of
      a sub-minitransaction when another mini-transaction is holding
      latches on some modified pages. When creating or freeing BLOB pages,
      we may only write OPT_PAGE_CHECKSUM records in the main mini-transaction,
      after all changes have been written to the log.
      
      MTR_LOG_SUB: Log mode for a sub-mini-transaction.
      
      mtr_t::free(): Define non-inline, and invoke MarkFreed.
      
      MarkFreed: For any matching page in the mini-transaction log,
      change the first entry to say MTR_MEMO_PAGE_X_MODIFY and any subsequent
      entries to MTR_MEMO_PAGE_X_FIX.
      
      FindModified: Simplify a condition. MTR_MEMO_MODIFY can only be set
      if MTR_MEMO_PAGE_X_FIX or MTR_MEMO_PAGE_SX_FIX are set.
      
      FindBlockX: Consider also MTR_MEMO_PAGE_X_MODIFY.
      
      recv_sys_t::parse(): Store OPT_PAGE_CHECKSUM records.
      
      log_phys_t::apply(): Validate OPT_PAGE_CHECKSUM records.
      
      log_phys_t::page_checksum(): Validate an OPT_PAGE_CHECKSUM record.
      
      Tested by: Matthias Leich
      4179f93d
    • Marko Mäkelä's avatar
      MDEV-13542: Crashing on corrupted page is unhelpful · 0b47c126
      Marko Mäkelä authored
      The approach to handling corruption that was chosen by Oracle in
      commit 177d8b0c
      is not really useful. Not only did it actually fail to prevent InnoDB
      from crashing, but it is making things worse by blocking attempts to
      rescue data from or rebuild a partially readable table.
      
      We will try to prevent crashes in a different way: by propagating
      errors up the call stack. We will never mark the clustered index
      persistently corrupted, so that data recovery may be attempted by
      reading from the table, or by rebuilding the table.
      
      This should also fix MDEV-13680 (crash on btr_page_alloc() failure);
      it was extensively tested with innodb_file_per_table=0 and a
      non-autoextend system tablespace.
      
      We should now avoid crashes in many cases, such as when a page
      cannot be read or allocated, or an inconsistency is detected when
      attempting to update multiple pages. We will not crash on double-free,
      such as on the recovery of DDL in system tablespace in case something
      was corrupted.
      
      Crashes on corrupted data are still possible. The fault injection mechanism
      that is introduced in the subsequent commit may help catch more of them.
      
      buf_page_import_corrupt_failure: Remove the fault injection, and instead
      corrupt some pages using Perl code in the tests.
      
      btr_cur_pessimistic_insert(): Always reserve extents (except for the
      change buffer), in order to prevent a subsequent allocation failure.
      
      btr_pcur_open_at_rnd_pos(): Merged to the only caller ibuf_merge_pages().
      
      btr_assert_not_corrupted(), btr_corruption_report(): Remove.
      Similar checks are already part of btr_block_get().
      
      FSEG_MAGIC_N_BYTES: Replaces FSEG_MAGIC_N_VALUE.
      
      dict_hdr_get(), trx_rsegf_get_new(), trx_undo_page_get(),
      trx_undo_page_get_s_latched(): Replaced with error-checking calls.
      
      trx_rseg_t::get(mtr_t*): Replaces trx_rsegf_get().
      
      trx_rseg_header_create(): Let the caller update the TRX_SYS page if needed.
      
      trx_sys_create_sys_pages(): Merged with trx_sysf_create().
      
      dict_check_tablespaces_and_store_max_id(): Do not access
      DICT_HDR_MAX_SPACE_ID, because it was already recovered in dict_boot().
      Merge dict_check_sys_tables() with this function.
      
      dir_pathname(): Replaces os_file_make_new_pathname().
      
      row_undo_ins_remove_sec(): Do not modify the undo page by adding
      a terminating NUL byte to the record.
      
      btr_decryption_failed(): Report decryption failures
      
      dict_set_corrupted_by_space(), dict_set_encrypted_by_space(),
      dict_set_corrupted_index_cache_only(): Remove.
      
      dict_set_corrupted(): Remove the constant parameter dict_locked=false.
      Never flag the clustered index corrupted in SYS_INDEXES, because
      that would deny further access to the table. It might be possible to
      repair the table by executing ALTER TABLE or OPTIMIZE TABLE, in case
      no B-tree leaf page is corrupted.
      
      dict_table_skip_corrupt_index(), dict_table_next_uncorrupted_index(),
      row_purge_skip_uncommitted_virtual_index(): Remove, and refactor
      the callers to read dict_index_t::type only once.
      
      dict_table_is_corrupted(): Remove.
      
      dict_index_t::is_btree(): Determine if the index is a valid B-tree.
      
      BUF_GET_NO_LATCH, BUF_EVICT_IF_IN_POOL: Remove.
      
      UNIV_BTR_DEBUG: Remove. Any inconsistency will no longer trigger
      assertion failures, but error codes being returned.
      
      buf_corrupt_page_release(): Replaced with a direct call to
      buf_pool.corrupted_evict().
      
      fil_invalid_page_access_msg(): Never crash on an invalid read;
      let the caller of buf_page_get_gen() decide.
      
      btr_pcur_t::restore_position(): Propagate failure status to the caller
      by returning CORRUPTED.
      
      opt_search_plan_for_table(): Simplify the code.
      
      row_purge_del_mark(), row_purge_upd_exist_or_extern_func(),
      row_undo_ins_remove_sec_rec(), row_undo_mod_upd_del_sec(),
      row_undo_mod_del_mark_sec(): Avoid mem_heap_create()/mem_heap_free()
      when no secondary indexes exist.
      
      row_undo_mod_upd_exist_sec(): Simplify the code.
      
      row_upd_clust_step(), dict_load_table_one(): Return DB_TABLE_CORRUPT
      if the clustered index (and therefore the table) is corrupted, similar
      to what we do in row_insert_for_mysql().
      
      fut_get_ptr(): Replace with buf_page_get_gen() calls.
      
      buf_page_get_gen(): Return nullptr and *err=DB_CORRUPTION
      if the page is marked as freed. For other modes than
      BUF_GET_POSSIBLY_FREED or BUF_PEEK_IF_IN_POOL this will
      trigger a debug assertion failure. For BUF_GET_POSSIBLY_FREED,
      we will return nullptr for freed pages, so that the callers
      can be simplified. The purge of transaction history will be
      a new user of BUF_GET_POSSIBLY_FREED, to avoid crashes on
      corrupted data.
      
      buf_page_get_low(): Never crash on a corrupted page, but simply
      return nullptr.
      
      fseg_page_is_allocated(): Replaces fseg_page_is_free().
      
      fts_drop_common_tables(): Return an error if the transaction
      was rolled back.
      
      fil_space_t::set_corrupted(): Report a tablespace as corrupted if
      it was not reported already.
      
      fil_space_t::io(): Invoke fil_space_t::set_corrupted() to report
      out-of-bounds page access or other errors.
      
      Clean up mtr_t::page_lock()
      
      buf_page_get_low(): Validate the page identifier (to check for
      recently read corrupted pages) after acquiring the page latch.
      
      buf_page_t::read_complete(): Flag uninitialized (all-zero) pages
      with DB_FAIL. Return DB_PAGE_CORRUPTED on page number mismatch.
      
      mtr_t::defer_drop_ahi(): Renamed from mtr_defer_drop_ahi().
      
      recv_sys_t::free_corrupted_page(): Only set_corrupt_fs()
      if any log records exist for the page. We do not mind if read-ahead
      produces corrupted (or all-zero) pages that were not actually needed
      during recovery.
      
      recv_recover_page(): Return whether the operation succeeded.
      
      recv_sys_t::recover_low(): Simplify the logic. Check for recovery error.
      
      Thanks to Matthias Leich for testing this extensively and to the
      authors of https://rr-project.org for making it easy to diagnose
      and fix any failures that were found during the testing.
      0b47c126
  23. 21 Jan, 2022 1 commit
    • Marko Mäkelä's avatar
      MDEV-14425 Improve the redo log for concurrency · 685d958e
      Marko Mäkelä authored
      The InnoDB redo log used to be formatted in blocks of 512 bytes.
      The log blocks were encrypted and the checksum was calculated while
      holding log_sys.mutex, creating a serious scalability bottleneck.
      
      We remove the fixed-size redo log block structure altogether and
      essentially turn every mini-transaction into a log block of its own.
      This allows encryption and checksum calculations to be performed
      on local mtr_t::m_log buffers, before acquiring log_sys.mutex.
      The mutex only protects a memcpy() of the data to the shared
      log_sys.buf, as well as the padding of the log, in case the
      to-be-written part of the log would not end in a block boundary of
      the underlying storage. For now, the "padding" consists of writing
      a single NUL byte, to allow recovery and mariadb-backup to detect
      the end of the circular log faster.
      
      Like the previous implementation, we will overwrite the last log block
      over and over again, until it has been completely filled. It would be
      possible to write only up to the last completed block (if no more
      recent write was requested), or to write dummy FILE_CHECKPOINT records
      to fill the incomplete block, by invoking the currently disabled
      function log_pad(). This would require adjustments to some logic around
      log checkpoints, page flushing, and shutdown.
      
      An upgrade after a crash of any previous version is not supported.
      Logically empty log files from a previous version will be upgraded.
      
      An attempt to start up InnoDB without a valid ib_logfile0 will be
      refused. Previously, the redo log used to be created automatically
      if it was missing. Only with with innodb_force_recovery=6, it is
      possible to start InnoDB in read-only mode even if the log file
      does not exist. This allows the contents of a possibly corrupted
      database to be dumped.
      
      Because a prepared backup from an earlier version of mariadb-backup
      will create a 0-sized log file, we will allow an upgrade from such
      log files, provided that the FIL_PAGE_FILE_FLUSH_LSN in the system
      tablespace looks valid.
      
      The 512-byte log checkpoint blocks at 0x200 and 0x600 will be replaced
      with 64-byte log checkpoint blocks at 0x1000 and 0x2000.
      
      The start of log records will move from 0x800 to 0x3000. This allows us
      to use 4096-byte aligned blocks for all I/O in a future revision.
      
      We extend the MDEV-12353 redo log record format as follows.
      
      (1) Empty mini-transactions or extra NUL bytes will not be allowed.
      (2) The end-of-minitransaction marker (a NUL byte) will be replaced
      with a 1-bit sequence number, which will be toggled each time when the
      circular log file wraps back to the beginning.
      (3) After the sequence bit, a CRC-32C checksum of all data
      (excluding the sequence bit) will written.
      (4) If the log is encrypted, 8 bytes will be written before
      the checksum and included in it. This is part of the
      initialization vector (IV) of encrypted log data.
      (5) File names, page numbers, and checkpoint information will not be
      encrypted. Only the payload bytes of page-level log will be encrypted.
      The tablespace ID and page number will form part of the IV.
      (6) For padding, arbitrary-length FILE_CHECKPOINT records may be written,
      with all-zero payload, and with the normal end marker and checksum.
      The minimum size is 7 bytes, or 7+8 with innodb_encrypt_log=ON.
      
      In mariadb-backup and in Galera snapshot transfer (SST) scripts, we will
      no longer remove ib_logfile0 or create an empty ib_logfile0. Server startup
      will require a valid log file. When resizing the log, we will create
      a logically empty ib_logfile101 at the current LSN and use an atomic rename
      to replace ib_logfile0 with it. See the test innodb.log_file_size.
      
      Because there is no mandatory padding in the log file, we are able
      to create a dummy log file as of an arbitrary log sequence number.
      See the test mariabackup.huge_lsn.
      
      The parameter innodb_log_write_ahead_size and the
      INFORMATION_SCHEMA.INNODB_METRICS counter log_padded will be removed.
      
      The minimum value of innodb_log_buffer_size will be increased to 2MiB
      (because log_sys.buf will replace recv_sys.buf) and the increment
      adjusted to 4096 bytes (the maximum log block size).
      
      The following INFORMATION_SCHEMA.INNODB_METRICS counters will be removed:
      
      os_log_fsyncs
      os_log_pending_fsyncs
      log_pending_log_flushes
      log_pending_checkpoint_writes
      
      The following status variables will be removed:
      
      Innodb_os_log_fsyncs (this is included in Innodb_data_fsyncs)
      Innodb_os_log_pending_fsyncs (this was limited to at most 1 by design)
      
      log_sys.get_block_size(): Return the physical block size of the log file.
      This is only implemented on Linux and Microsoft Windows for now, and for
      the power-of-2 block sizes between 64 and 4096 bytes (the minimum and
      maximum size of a checkpoint block). If the block size is anything else,
      the traditional 512-byte size will be used via normal file system
      buffering.
      
      If the file system buffers can be bypassed, a message like the following
      will be issued:
      
      InnoDB: File system buffers for log disabled (block size=512 bytes)
      InnoDB: File system buffers for log disabled (block size=4096 bytes)
      
      This has been tested on Linux and Microsoft Windows with both sizes.
      
      On Linux, only enable O_DIRECT on the log for innodb_flush_method=O_DSYNC.
      Tests in 3 different environments where the log is stored in a device
      with a physical block size of 512 bytes are yielding better throughput
      without O_DIRECT. This could be due to the fact that in the event the
      last log block is being overwritten (if multiple transactions would
      become durable at the same time, and each of will write a small
      number of bytes to the last log block), it should be faster to re-copy
      data from log_sys.buf or log_sys.flush_buf to the kernel buffer,
      to be finally written at fdatasync() time.
      
      The parameter innodb_flush_method=O_DSYNC will imply O_DIRECT for
      data files. This option will enable O_DIRECT on the log file on Linux.
      It may be unsafe to use when the storage device does not support
      FUA (Force Unit Access) mode.
      
      When the server is compiled WITH_PMEM=ON, we will use memory-mapped
      I/O for the log file if the log resides on a "mount -o dax" device.
      We will identify PMEM in a start-up message:
      
      InnoDB: log sequence number 0 (memory-mapped); transaction id 3
      
      On Linux, we will also invoke mmap() on any ib_logfile0 that resides
      in /dev/shm, effectively treating the log file as persistent memory.
      This should speed up "./mtr --mem" and increase the test coverage of
      PMEM on non-PMEM hardware. It also allows users to estimate how much
      the performance would be improved by installing persistent memory.
      On other tmpfs file systems such as /run, we will not use mmap().
      
      mariadb-backup: Eliminated several variables. We will refer
      directly to recv_sys and log_sys.
      
      backup_wait_for_lsn(): Detect non-progress of
      xtrabackup_copy_logfile(). In this new log format with
      arbitrary-sized blocks, we can only detect log file overrun
      indirectly, by observing that the scanned log sequence number
      is not advancing.
      
      xtrabackup_copy_logfile(): On PMEM, do not modify the sequence bit,
      because we are not allowed to modify the server's log file, and our
      memory mapping is read-only.
      
      trx_flush_log_if_needed_low(): Do not use the callback on pmem.
      Using neither flush_lock nor write_lock around PMEM writes seems
      to yield the best performance. The pmem_persist() calls may
      still be somewhat slower than the pwrite() and fdatasync() based
      interface (PMEM mounted without -o dax).
      
      recv_sys_t::buf: Remove. We will use log_sys.buf for parsing.
      
      recv_sys_t::MTR_SIZE_MAX: Replaces RECV_SCAN_SIZE.
      
      recv_sys_t::file_checkpoint: Renamed from mlog_checkpoint_lsn.
      
      recv_sys_t, log_sys_t: Removed many data members.
      
      recv_sys.lsn: Renamed from recv_sys.recovered_lsn.
      recv_sys.offset: Renamed from recv_sys.recovered_offset.
      log_sys.buf_size: Replaces srv_log_buffer_size.
      
      recv_buf: A smart pointer that wraps log_sys.buf[recv_sys.offset]
      when the buffer is being allocated from the memory heap.
      
      recv_ring: A smart pointer that wraps a circular log_sys.buf[] that is
      backed by ib_logfile0. The pointer will wrap from recv_sys.len
      (log_sys.file_size) to log_sys.START_OFFSET. For the record that
      wraps around, we may copy file name or record payload data to
      the auxiliary buffer decrypt_buf in order to have a contiguous
      block of memory. The maximum size of a record is less than
      innodb_page_size bytes.
      
      recv_sys_t::parse(): Take the smart pointer as a template parameter.
      Do not temporarily add a trailing NUL byte to FILE_ records, because
      we are not supposed to modify the memory-mapped log file. (It is
      attached in read-write mode already during recovery.)
      
      recv_sys_t::parse_mtr(): Wrapper for recv_sys_t::parse().
      
      recv_sys_t::parse_pmem(): Like parse_mtr(), but if PREMATURE_EOF would be
      returned on PMEM, use recv_ring to wrap around the buffer to the start.
      
      mtr_t::finish_write(), log_close(): Do not enforce log_sys.max_buf_free
      on PMEM, because it has no meaning on the mmap-based log.
      
      log_sys.write_to_buf: Count writes to log_sys.buf. Replaces
      srv_stats.log_write_requests and export_vars.innodb_log_write_requests.
      Protected by log_sys.mutex. Updated consistently in log_close().
      Previously, mtr_t::commit() conditionally updated the count,
      which was inconsistent.
      
      log_sys.write_to_log: Count swaps of log_sys.buf and log_sys.flush_buf,
      for writing to log_sys.log (the ib_logfile0). Replaces
      srv_stats.log_writes and export_vars.innodb_log_writes.
      Protected by log_sys.mutex.
      
      log_sys.waits: Count waits in append_prepare(). Replaces
      srv_stats.log_waits and export_vars.innodb_log_waits.
      
      recv_recover_page(): Do not unnecessarily acquire
      log_sys.flush_order_mutex. We are inserting the blocks in arbitary
      order anyway, to be adjusted in recv_sys.apply(true).
      
      We will change the definition of flush_lock and write_lock to
      avoid potential false sharing. Depending on sizeof(log_sys) and
      CPU_LEVEL1_DCACHE_LINESIZE, the flush_lock and write_lock could
      share a cache line with each other or with the last data members
      of log_sys.
      
      Thanks to Matthias Leich for providing https://rr-project.org traces
      for various failures during the development, and to
      Thirunarayanan Balathandayuthapani for his help in debugging
      some of the recovery code. And thanks to the developers of the
      rr debugger for a tool without which extensive changes to InnoDB
      would be very challenging to get right.
      
      Thanks to Vladislav Vaintroub for useful feedback and
      to him, Axel Schwenke and Krunal Bauskar for testing the performance.
      685d958e
  24. 18 Nov, 2021 1 commit
    • Marko Mäkelä's avatar
      MDEV-27058: Reduce the size of buf_block_t and buf_page_t · aaef2e1d
      Marko Mäkelä authored
      buf_page_t::frame: Moved from buf_block_t::frame.
      All 'thin' buf_page_t describing compressed-only ROW_FORMAT=COMPRESSED
      pages will have frame=nullptr, while all 'fat' buf_block_t
      will have a non-null frame pointing to aligned innodb_page_size bytes.
      This eliminates the need for separate states for
      BUF_BLOCK_FILE_PAGE and BUF_BLOCK_ZIP_PAGE.
      
      buf_page_t::lock: Moved from buf_block_t::lock. That is, all block
      descriptors will have a page latch. The IO_PIN state that was used
      for discarding or creating the uncompressed page frame of a
      ROW_FORMAT=COMPRESSED block is replaced by a combination of read-fix
      and page X-latch.
      
      page_zip_des_t::fix: Replaces state_, buf_fix_count_, io_fix_, status
      of buf_page_t with a single std::atomic<uint32_t>. All modifications
      will use store(), fetch_add(), fetch_sub(). This space was previously
      wasted to alignment on 64-bit systems. We will use the following encoding
      that combines a state (partly read-fix or write-fix) and a buffer-fix
      count:
      
      buf_page_t::NOT_USED=0 (previously BUF_BLOCK_NOT_USED)
      buf_page_t::MEMORY=1 (previously BUF_BLOCK_MEMORY)
      buf_page_t::REMOVE_HASH=2 (previously BUF_BLOCK_REMOVE_HASH)
      buf_page_t::FREED=3 + fix: pages marked as freed in the file
      buf_page_t::UNFIXED=1U<<29 + fix: normal pages
      buf_page_t::IBUF_EXIST=2U<<29 + fix: normal pages; may need ibuf merge
      buf_page_t::REINIT=3U<<29 + fix: reinitialized pages (skip doublewrite)
      buf_page_t::READ_FIX=4U<<29 + fix: read-fixed pages (also X-latched)
      buf_page_t::WRITE_FIX=5U<<29 + fix: write-fixed pages (also U-latched)
      buf_page_t::WRITE_FIX_IBUF=6U<<29 + fix: write-fixed; may have ibuf
      buf_page_t::WRITE_FIX_REINIT=7U<<29 + fix: write-fixed (no doublewrite)
      
      buf_page_t::write_complete(): Change WRITE_FIX or WRITE_FIX_REINIT to
      UNFIXED, and WRITE_FIX_IBUF to IBUF_EXIST, before releasing the U-latch.
      
      buf_page_t::read_complete(): Renamed from buf_page_read_complete().
      Change READ_FIX to UNFIXED or IBUF_EXIST, before releasing the X-latch.
      
      buf_page_t::can_relocate(): If the page latch is being held or waited for,
      or the block is buffer-fixed or io-fixed, return false. (The condition
      on the page latch is new.)
      
      Outside buf_page_get_gen(), buf_page_get_low() and buf_page_free(), we
      will acquire the page latch before fix(), and unfix() before unlocking.
      
      buf_page_t::flush(): Replaces buf_flush_page(). Optimize the
      handling of FREED pages.
      
      buf_pool_t::release_freed_page(): Assume that buf_pool.mutex is held
      by the caller.
      
      buf_page_t::is_read_fixed(), buf_page_t::is_write_fixed(): New predicates.
      
      buf_page_get_low(): Ignore guesses that are read-fixed because they
      may not yet be registered in buf_pool.page_hash and buf_pool.LRU.
      
      buf_page_optimistic_get(): Acquire latch before buffer-fixing.
      
      buf_page_make_young(): Leave read-fixed blocks alone, because they
      might not be registered in buf_pool.LRU yet.
      
      recv_sys_t::recover_deferred(), recv_sys_t::recover_low():
      Possibly fix MDEV-26326, by holding a page X-latch instead of
      only buffer-fixing the page.
      aaef2e1d
  25. 10 Nov, 2021 1 commit
    • Thirunarayanan Balathandayuthapani's avatar
      MDEV-26121 [Note] InnoDB: Resetting invalid page · 3480c3f9
      Thirunarayanan Balathandayuthapani authored
      In dict_index_t::clear(), InnoDB frees all the page except root page.
      root page leaf segment has reset and does reinitialize again.
      t in fseg_create(), we do have the assumption that only
      FIL_PAGE_TYPE_TRX_SYS or FIL_PAGE_TYPE_TRX_SYS page should
      be re-created for non-full-crc32 format. This assumption is wrong
      in case of rollback of bulk insert operation.
      3480c3f9
  26. 22 Oct, 2021 1 commit
    • Marko Mäkelä's avatar
      MDEV-26826 Duplicated computations of buf_pool.page_hash addresses · c091a0bc
      Marko Mäkelä authored
      Since commit bd5a6403 (MDEV-26033)
      we can actually calculate the buf_pool.page_hash cell and latch
      addresses while not holding buf_pool.mutex.
      
      buf_page_alloc_descriptor(): Remove the MEM_UNDEFINED.
      We now expect buf_page_t::hash to be zero-initialized.
      
      buf_pool_t::hash_chain: Dedicated data type for buf_pool.page_hash.array.
      
      buf_LRU_free_one_page(): Merged to the only caller
      buf_pool_t::corrupted_evict().
      c091a0bc
  27. 24 Sep, 2021 2 commits
    • Marko Mäkelä's avatar
      MDEV-26445 innodb_undo_log_truncate is unnecessarily slow · f5794e1d
      Marko Mäkelä authored
      trx_purge_truncate_history(): Do not force a write of the undo tablespace
      that is being truncated. Instead, prevent page writes by acquiring
      an exclusive latch on all dirty pages of the tablespace.
      
      fseg_create(): Relax an assertion that could fail if a dirty undo page
      is being initialized during undo tablespace truncation (and
      trx_purge_truncate_history() already acquired an exclusive latch on it).
      
      fsp_page_create(): If we are truncating a tablespace, try to reuse
      a page that we may have already latched exclusively (because it was
      in buf_pool.flush_list). To some extent, this helps the test
      innodb.undo_truncate,16k to avoid running out of buffer pool.
      
      mtr_t::commit_shrink(): Mark as clean all pages that are outside the
      new bounds of the tablespace, and only add the newly reinitialized pages
      to the buf_pool.flush_list.
      
      buf_page_create(): Do not unnecessarily invoke change buffer merge on
      undo tablespaces.
      
      buf_page_t::clear_oldest_modification(bool temporary): Move some
      assertions to the caller buf_page_write_complete().
      
      innodb.undo_truncate: Use a bigger innodb_buffer_pool_size=24M.
      On my system, it would otherwise hang 1 out of 1547 attempts
      (on the 40th repeat of innodb.undo_truncate,16k).
      Other page sizes were not affected.
      f5794e1d
    • Marko Mäkelä's avatar
      MDEV-26450: Corruption due to innodb_undo_log_truncate · f5fddae3
      Marko Mäkelä authored
      At least since commit 055a3334
      (MDEV-13564) the undo log truncation in InnoDB did not work correctly.
      
      The main issue is that during the execution of
      trx_purge_truncate_history() some pages of the newly truncated
      undo tablespace could be discarded.
      
      This is improved from commit 1cb218c3
      which was applied to earlier-version branches.
      
      fsp_try_extend_data_file(): Apply the peculiar rounding of
      fil_space_t::size_in_header only to the system tablespace,
      whose size can be expressed in megabytes in a configuration parameter.
      Other files may freely grow by a number of pages.
      
      fseg_alloc_free_page_low(): Do allow the extension of undo tablespaces,
      and mention the file name in the error message.
      
      mtr_t::commit_shrink(): Implement crash-safe shrinking of a tablespace:
      (1) durably write the log
      (2) release the page latches of the rebuilt tablespace
      (3) release the mutexes
      (4) truncate the file
      (5) release the tablespace latch
      This is refactored from trx_purge_truncate_history().
      
      log_write_and_flush_prepare(), log_write_and_flush(): New functions
      to durably write log during mtr_t::commit_shrink().
      f5fddae3
  28. 22 Sep, 2021 1 commit
    • Marko Mäkelä's avatar
      MDEV-26450: Corruption due to innodb_undo_log_truncate · 1cb218c3
      Marko Mäkelä authored
      At least since commit 055a3334
      (MDEV-13564) the undo log truncation in InnoDB did not work correctly.
      
      The main issue is that during the execution of
      trx_purge_truncate_history() some pages of the newly truncated
      undo tablespace could be discarded.
      
      fsp_try_extend_data_file(): Apply the peculiar rounding of
      fil_space_t::size_in_header only to the system tablespace,
      whose size can be expressed in megabytes in a configuration parameter.
      Other files may freely grow by a number of pages.
      
      fseg_alloc_free_page_low(): Do allow the extension of undo tablespaces,
      and mention the file name in the error message.
      
      mtr_t::commit_shrink(): Implement crash-safe shrinking of a tablespace
      file. First, durably write the log, then shrink the file, and finally
      release the page latches of the rebuilt tablespace. Refactored from
      trx_purge_truncate_history().
      
      log_write_and_flush_prepare(), log_write_and_flush(): New functions
      to durably write log during mtr_t::commit_shrink().
      1cb218c3
  29. 22 Jul, 2021 1 commit
    • Marko Mäkelä's avatar
      MDEV-26195: Use a 32-bit data type for some tablespace fields · ca501ffb
      Marko Mäkelä authored
      In the InnoDB data files, we allocate 32 bits for tablespace identifiers
      and page numbers as well as tablespace flags. But, in main memory
      data structures we allocate 32 or 64 bits, depending on the register
      width of the processor. Let us always use 32-bit fields to eliminate
      a mismatch and reduce the memory footprint on 64-bit systems.
      ca501ffb
  30. 20 Jul, 2021 1 commit
    • Marko Mäkelä's avatar
      MDEV-24626 fixup: Remove useless code · ed0a7b1b
      Marko Mäkelä authored
      fil_ibd_create(): Remove code that should have been removed in
      commit 86dc7b4d already.
      We no longer wrote an initialized page to the file, but we would
      still allocate a page image in memory and write it.
      
      xb_space_create_file(): Remove an unnecessary page write.
      (This is a functional change for Mariabackup.)
      ed0a7b1b
  31. 16 Jun, 2021 1 commit