1. 12 Sep, 2024 2 commits
    • Brandon Nesterenko's avatar
      MDEV-32014: Reduce min val of large_commit_threshold for debug builds · d33b9d9b
      Brandon Nesterenko authored
      To help in the testing of MDEV-32014, allow debug_builds to
      set a lower value for binlog_large_commit_threshold
      d33b9d9b
    • Libing Song's avatar
      MDEV-32014 Rename binlog cache temporary file to binlog file · fba09f8c
      Libing Song authored
                 for large transaction
      
      Description
      ===========
      When a transaction commits, it copies the binlog events from
      binlog cache to binlog file. Very large transactions
      (eg. gigabytes) can stall other transactions for a long time
      because the data is copied while holding LOCK_log, which blocks
      other commits from binlogging.
      
      The solution in this patch is to rename the binlog cache file to
      a binlog file instead of copy, if the commiting transaction has
      large binlog cache. Rename is a very fast operation, it doesn't
      block other transactions a long time.
      
      Design
      ======
      * binlog_large_commit_threshold
        type: ulonglong
        scope: global
        dynamic: yes
        default: 128MB
      
        Only the binlog cache temporary files large than 128MB are
        renamed to binlog file.
      
      * #binlog_cache_files directory
        To support rename, all binlog cache temporary files are managed
        as normal files now. `#binlog_cache_files` directory is in the same
        directory with binlog files. It is created at server startup if it doesn't
        exist. Otherwise, all files in the directory is deleted at startup.
      
        The temporary files are named with ML_ prefix and the memorary address
        of the binlog_cache_data object which guarantees it is unique.
      
      * Reserve space
        To supprot rename feature, It must reserve enough space at the
        begin of the binlog cache file. The space is required for
        Format description, Gtid list, checkpoint and Gtid events when
        renaming it to a binlog file.
      
        Since binlog_cache_data's cache_log is directly accessed by binlog log,
        online alter and wsrep. It is not easy to update all the code. Thus
        binlog cache will not reserve space if it is not session binlog cache or
        wsrep session is enabled.
      
        - m_file_reserved_bytes
          Stores the bytes reserved at the begin of the cache file.
          It is initialized in write_prepare() and cleared by reset().
      
          The reserved file header is hide to callers. Thus there is no
          change for callers. E.g.
          - get_byte_position() still get the length of binlog data
            written to the cache, but not the file length.
          - truncate(0) will truncate the file to m_file_reserved_bytes but not 0.
      
        - write_prepare()
          write_prepare() is called everytime when anything is being written
          into the cache. It will call init_file_reserved_bytes() to  create
          the cache file (if it doesn't exist) and reserve suitable space if
          the data written exceeds buffer's size.
      
      * Binlog_commit_by_rotate
        It is used to encapsulate the code for remaing a binlog cache
        tempoary file to binlog file.
        - should_commit_by_rotate()
          it is called by write_transaction_to_binlog_events() to check if
          a binlog cache should be rename to a binlog file.
        - commit()
          That is the entry to rename a binlog cache and commit the
          transaction. Both rename and commit are protected by LOCK_log,
          Thus not other transactions can write anything into the renamed
          binlog before it.
      
          Rename happens in a rotation. After the new binlog file is generated,
          replace_binlog_file() is called to:
          - copy data from the new binlog file to its binlog cache file.
          - write gtid event.
          - rename the binlog cache file to binlog file.
      
          After that the rotation will continue to succeed. Then the transaction
          is committed in a seperated group itself. Its cache file will be
          detached and cache log will be reset before calling
          trx_group_commit_with_engines(). Thus only Xid event be written.
      fba09f8c
  2. 10 Sep, 2024 1 commit
  3. 05 Sep, 2024 1 commit
    • Libing Song's avatar
      MDEV-33853 Async rollback prepared transactions during binlog · 5bbda971
      Libing Song authored
                 crash recovery
      
      Summary
      =======
      When doing server recovery, the active transactions will be rolled
      back by InnoDB background rollback thread automatically. The
      prepared transactions will be committed or rolled back accordingly
      by binlog recovery. Binlog recovery is done in main thread before
      the server can provide service to users. If there is a big
      transaction to rollback, the server will not available for a long
      time.
      
      This patch provides a way to rollback the prepared transactions
      asynchronously. Thus the rollback will not block server startup.
      
      Design
      ======
      - Handler::recover_rollback_by_xid()
        This patch provides a new handler interface to rollback transactions
        in recover phase. InnoDB just set the transaction's state to active.
        Then the transaction will be rolled back by the background rollback
        thread.
      
      - Handler::signal_tc_log_recover_done()
        This function is called after tc log is opened(typically binlog opened)
        has done. When this function is called, all transactions will be rolled
        back have been reverted to ACTIVE state. Thus it starts rollback thread
        to rollback the transactions.
      
      - Background rollback thread
        With this patch, background rollback thread is defered to run until binlog
        recovery is finished. It is started by innobase_tc_log_recovery_done().
      5bbda971
  4. 04 Sep, 2024 3 commits
  5. 29 Aug, 2024 7 commits
    • Marko Mäkelä's avatar
      Merge 11.2 into 11.4 · 44733aa8
      Marko Mäkelä authored
      44733aa8
    • Marko Mäkelä's avatar
      Merge 10.11 into 11.2 · e91a7994
      Marko Mäkelä authored
      e91a7994
    • Marko Mäkelä's avatar
      MDEV-34750 SET GLOBAL innodb_log_file_size is not crash safe · 984606d7
      Marko Mäkelä authored
      The recent commit 4ca355d8 (MDEV-33894)
      caused a serious regression for online InnoDB ib_logfile0 resizing,
      breaking crash-safety unless the memory-mapped log file interface is
      being used. However, the log resizing was broken also before this.
      
      To prevent such regressions in the future, we extend the test
      innodb.log_file_size_online with a kill and restart of the server
      and with some writes running concurrently with the log size change.
      When run enough many times, this test revealed all the bugs that
      are being fixed by the code changes.
      
      log_t::resize_start(): Do not allow the resized log to start before
      the current log sequence number. In this way, there is no need to
      copy anything to the first block of resize_buf. The previous logic
      regarding that was incorrect in two ways. First, we would have to
      copy from the last written buffer (buf or flush_buf). Second, we failed
      to ensure that the mini-transaction end marker bytes would be 1
      in the buffer. If the source ib_logfile0 had wrapped around an odd number
      of times, the end marker would be 0. This was occasionally observed
      when running the test innodb.log_file_size_online.
      
      log_t::resize_write_buf(): To adjust for the resize_start() change,
      do not write anything that would be before the resize_lsn.
      Take the buffer (resize_buf or resize_flush_buf) as a parameter.
      Starting with commit 4ca355d8
      we no longer swap buffers when rewriting the last log block.
      
      log_t::append(): Define as a static function; only some debug
      assertions need to refer to the log_sys object.
      
      innodb_log_file_size_update(): Wake up the buf_flush_page_cleaner()
      if needed, and wait for it to complete a batch while waiting for
      the log resizing to be completed. If the current LSN is behind the
      resize target LSN, we will write redundant FILE_CHECKPOINT records to
      ensure that the log resizing completes. If the buf_pool.flush_list is
      empty or the buf_flush_page_cleaner() is stuck for some reason, our wait
      will time out in 5 seconds, so that we can periodically check if the
      execution of SET GLOBAL innodb_log_file_size was aborted. Previously,
      we could get into a busy loop here while the buf_flush_page_cleaner()
      would remain idle.
      984606d7
    • Oleksandr Byelkin's avatar
      Merge branch '10.6' into 10.11 · 3a1ff739
      Oleksandr Byelkin authored
      3a1ff739
    • Oleksandr Byelkin's avatar
      Merge branch '10.5' into 10.6 · a4654ecc
      Oleksandr Byelkin authored
      a4654ecc
    • Oleksandr Byelkin's avatar
      MDEV-34833 Assertion failure in Item_float::do_build_clone (Item_static_float_func) · 03a5455c
      Oleksandr Byelkin authored
      Added missing method of Item_static_float_func
      03a5455c
    • Marko Mäkelä's avatar
      Merge 10.6 into 10.11 · cfcf27c6
      Marko Mäkelä authored
      cfcf27c6
  6. 28 Aug, 2024 8 commits
    • Oleksandr Byelkin's avatar
      MDEV-34704 Quick mode produces the bug for mariadb client · 872dbec9
      Oleksandr Byelkin authored
        --quick-max-column-width parameter added to limit field
          width in --quick mode.
      872dbec9
    • Marko Mäkelä's avatar
      Merge 10.5 into 10.6 · 0e76c1ba
      Marko Mäkelä authored
      0e76c1ba
    • Marko Mäkelä's avatar
      MDEV-34802 Recovery fails to note some log corruption · 1ff6b6f0
      Marko Mäkelä authored
      recv_recovery_from_checkpoint_start(): Abort startup due to log
      corruption if we were unable to parse the entire log between
      the latest log checkpoint and the corresponding FILE_CHECKPOINT record.
      
      Also, reduce some code bloat related to log output and log_sys.mutex.
      
      Reviewed by: Debarun Banerjee
      1ff6b6f0
    • Brandon Nesterenko's avatar
      MDEV-33756: Deprecate binlog_optimize_thread_scheduling · 9811d23b
      Brandon Nesterenko authored
      The option binlog_optimize_thread_scheduling was initially added
      to provide a safe alternative for the newly added binlog group
      commit logic, such that when 0, it would disable a leader thread
      from performing the binlog write for all transactions that are a
      part of the group commit. Any problems related to the binlog group
      commit optimization should be sorted out by now, so we can
      deprecate-to-eventually-remove the option altogether.
      
      This commit performs the deprecation, and the removal is tracked
      by MDEV-33745. Note, as the option is only able to be provided
      via configuration at startup time, users will not see a
      deprecation message unless looking through the CLI help
      message.
      
      Reviewed By
      ============
      Kristian Nielsen <knielsen@knielsen-hq.org>
      Sergei Golubchik <serg@mariadb.org>
      9811d23b
    • Alexander Barkov's avatar
      MDEV-34829 LOCALTIME returns a wrong data type · c67149b8
      Alexander Barkov authored
      Changing the alias LOCALTIME->CURRENT_TIMESTAMP to LOCALTIME->CURRENT_TIME.
      
      This changes the return type of LOCALTIME from DATETIME to TIME,
      according to the SQL Standard.
      c67149b8
    • Yuchen Pei's avatar
      MDEV-32627 Spider: use CONNECTION string in SQLDriverConnect · 18d3f63a
      Yuchen Pei authored
      This is the CS part of the implementation of MENT-2070.
      18d3f63a
    • Marko Mäkelä's avatar
      MDEV-34803 innodb_lru_flush_size is no longer used · bda40ccb
      Marko Mäkelä authored
      In commit fa8a46eb (MDEV-33613)
      the parameter innodb_lru_flush_size ceased to have any effect.
      
      Let us declare the parameter as deprecated and additionally as
      MARIADB_REMOVED_OPTION, so that there will be a warning written
      to the error log in case the option is specified in the command line.
      
      Let us also do the same for the parameter
      innodb_purge_rseg_truncate_frequency
      that was deprecated&ignored earlier in MDEV-32050.
      
      Reviewed by: Debarun Banerjee
      bda40ccb
    • Andrew Hutchings's avatar
      Update markdown files for `main` branch · e6df06d4
      Andrew Hutchings authored
      Coding standards and PR template now reference `main`.
      e6df06d4
  7. 27 Aug, 2024 7 commits
  8. 26 Aug, 2024 9 commits
    • Kristian Nielsen's avatar
      Fix sporadic failure of test case rpl.rpl_start_stop_slave · 8642453c
      Kristian Nielsen authored
      The test was expecting the I/O thread to be in a specific state, but thread
      scheduling may cause it to not yet have reached that state. So just have a
      loop that waits for the expected state to occur.
      Signed-off-by: default avatarKristian Nielsen <knielsen@knielsen-hq.org>
      8642453c
    • Kristian Nielsen's avatar
    • Kristian Nielsen's avatar
      Fix sporadic failure of test case rpl.rpl_old_master · 214e6c5b
      Kristian Nielsen authored
      Remove the test for MDEV-14528. This is supposed to test that parallel
      replication from pre-10.0 master will update Seconds_Behind_Master. But
      after MDEV-12179 the SQL thread is blocked from even beginning to fetch
      events from the relay log due to FLUSH TABLES WITH READ LOCK, so the test
      case is no longer testing what is was intended to. And pre-10.0 versions are
      long since out of support, so does not seem worthwhile to try to rewrite the
      test to work another way.
      
      The root cause of the test failure is MDEV-34778. Briefly, depending on
      exact timing during slave stop, the rli->sql_thread_caught_up flag may end
      up with different value. If it ends up as "true", this causes
      Seconds_Behind_Master to be 0 during next slave start; and this caused test
      case timeout as the test was waiting for Seconds_Behind_Master to become
      non-zero.
      Signed-off-by: default avatarKristian Nielsen <knielsen@knielsen-hq.org>
      214e6c5b
    • Kristian Nielsen's avatar
      Fix sporadic test failure in rpl.rpl_create_drop_event · 7dc4ea56
      Kristian Nielsen authored
      Depending on timing, an extra event run could start just when the event
      scheduler is shut down and delay running until after the table has been
      dropped; this would cause the test to fail with a "table does not exist"
      error in the log.
      Signed-off-by: default avatarKristian Nielsen <knielsen@knielsen-hq.org>
      7dc4ea56
    • Kristian Nielsen's avatar
      Restore skiping rpl.rpl_mdev6020 under Valgrind · 33854d73
      Kristian Nielsen authored
      (Revert a change done by mistake when XtraDB was removed.)
      Signed-off-by: default avatarKristian Nielsen <knielsen@knielsen-hq.org>
      33854d73
    • Kristian Nielsen's avatar
      MDEV-34696: do_gco_wait() completes too early on InnoDB dict stats updates · b4c2e239
      Kristian Nielsen authored
      Before doing mark_start_commit(), check that there is no pending deadlock
      kill. If there is a pending kill, we won't commit (we will abort, roll back,
      and retry). Then we should not mark the commit as started, since that could
      potentially make the following GCO start too early, before we completed the
      commit after the retry.
      
      This condition could trigger in some corner cases, where InnoDB would take
      temporarily table/row locks that are released again immediately, not held
      until the transaction commits. This happens with dict_stats updates and
      possibly auto-increment locks.
      
      Such locks can be passed to thd_rpl_deadlock_check() and cause a deadlock
      kill to be scheduled in the background. But since the blocking locks are
      held only temporarily, they can be released before the background kill
      happens. This way, the kill can be delayed until after mark_start_commit()
      has been called. Thus we need to check the synchronous indication
      rgi->killed_for_retry, not just the asynchroneous thd->killed.
      Signed-off-by: default avatarKristian Nielsen <knielsen@knielsen-hq.org>
      b4c2e239
    • Marko Mäkelä's avatar
      MDEV-34515: Reduce context switching in purge · 76f6b6d8
      Marko Mäkelä authored
      Before this patch, the InnoDB purge coordinator task submitted
      innodb_purge_threads-1 tasks even if there was not sufficient amount
      of work for all of them. For example, if there are undo log records
      only for 1 table, only 1 task can be employed, and that task had better
      be the purge coordinator.
      
      srv_purge_worker_task_low(): Split from purge_worker_callback().
      
      trx_purge_attach_undo_recs(): Remove the parameter n_purge_threads,
      and add the parameter n_work_items, to keep track of the amount of
      work.
      
      trx_purge(): Launch purge worker tasks only if necessary. The work of
      one thread will be executed by this purge coordinator thread.
      
      que_fork_scheduler_round_robin(): Merged to trx_purge().
      
      Thanks to Vladislav Vaintroub for supplying a prototype of this.
      
      Reviewed by: Debarun Banerjee
      76f6b6d8
    • Marko Mäkelä's avatar
      MDEV-34515: Contention between purge and workload · b7b9f3ce
      Marko Mäkelä authored
      In a Sysbench oltp_update_index workload that involves 1 table,
      a serious contention between the workload and the purge of history
      was observed. This was the worst when the table contained only 1 record.
      
      This turned out to be fixed by setting innodb_purge_batch_size=128,
      which corresponds to the number of usable persistent rollback segments.
      When we go above that, there would be contention between row_purge_poss_sec()
      and the workload, typically on the clustered index page latch, sometimes
      also on a secondary index page latch. It might be that with smaller
      batches, trx_sys.history_size() will end up pausing all concurrent
      transaction start/commit frequently enough so that purge will be able
      to make some progress, so that there would be less contention on the
      index page latches between purge and SQL execution.
      
      In commit aa719b50 (part of MDEV-32050)
      the interpretation of the parameter innodb_purge_batch_size was slightly
      changed. It would correspond to the maximum desired size of the
      purge_sys.pages cache. Before that change, the parameter was referring to
      a number of undo log pages, but the accounting might have been inaccurate.
      
      To avoid a regression, we will reduce the default value to
      innodb_purge_batch_size=127, which will also be compatible with
      innodb_undo_tablespaces>1 (which will disable rollback segment 0).
      
      Additionally, some logic in the purge and MVCC checks is simplified.
      The purge tasks will make use of purge_sys.pages when accessing undo
      log pages to find out if a secondary index record can be removed.
      If an undo page needs to be looked up in buf_pool.page_hash, we will
      merely buffer-fix it. This is correct, because the undo pages are
      append-only in nature. Holding purge_sys.latch or purge_sys.end_latch
      or the fact that the current thread is executing as a part of an
      in-progress purge batch will prevent the contents of the undo page from
      being freed and subsequently reused. The buffer-fix will prevent the
      page from being evicted form the buffer pool. Thanks to this logic,
      we can refer to the undo log record directly in the buffer pool page
      and avoid copying the record.
      
      buf_pool_t::page_fix(): Look up and buffer-fix a page. This is useful
      for accessing undo log pages, which are append-only by nature.
      There will be no need to deal with change buffer or ROW_FORMAT=COMPRESSED
      in that case.
      
      purge_sys_t::view_guard::view_guard(): Allow the type of guard to be
      acquired: end_latch, latch, or no latch (in case we are a purge thread).
      
      purge_sys_t::view_guard::get(): Read-only accessor to purge_sys.pages.
      
      purge_sys_t::get_page(): Invoke buf_pool_t::page_fix().
      
      row_vers_old_has_index_entry(): Replaced with row_purge_is_unsafe()
      and row_undo_mod_sec_unsafe().
      
      trx_undo_get_undo_rec(): Merged to trx_undo_prev_version_build().
      
      row_purge_poss_sec(): Add the parameter mtr and remove redundant
      or unused parameters sec_pcur, sec_mtr, is_tree. We will use the
      caller's mtr object but release any acquired page latches before
      returning.
      
      btr_cur_get_page(), page_cur_get_page(): Do not invoke page_align().
      
      row_purge_remove_sec_if_poss_leaf(): Return the value of PAGE_MAX_TRX_ID
      to be checked against the page in row_purge_remove_sec_if_poss_tree().
      If the secondary index page was not changed meanwhile, it will be
      unnecessary to invoke row_purge_poss_sec() again.
      
      trx_undo_prev_version_build(): Access any undo log pages using
      the caller's mini-transaction object.
      
      row_purge_vc_matches_cluster(): Moved to the only compilation unit that
      needs it.
      
      Reviewed by: Debarun Banerjee
      b7b9f3ce
    • Marko Mäkelä's avatar
      MDEV-34520 purge_sys_t::wait_FTS sleeps 10ms, even if it does not have to · d58734d7
      Marko Mäkelä authored
      There were two separate Atomic_counter<uint32_t>, purge_sys.m_SYS_paused
      and purge_sys.m_FTS_paused. In purge_sys.wait_FTS() we have to read both
      atomically. We used to use an overkill solution for this, acquiring
      purge_sys.latch and waiting 10 milliseconds between samples. To make
      matters worse, the 10-millisecond wait was unconditional, which would
      unnecessarily suspend the purge_coordinator_task every now and then.
      
      It turns out that we can fold both "reference counts" into a single
      Atomic_relaxed<uint32_t> and avoid the purge_sys.latch.
      To assess whether std::memory_order_relaxed is acceptable, we should
      consider the operations that read these "reference counts", that is,
      purge_sys_t::wait_FTS(bool) and purge_sys_t::must_wait_FTS().
      
      Outside debug assertions, purge_sys.must_wait_FTS() is only invoked in
      trx_purge_table_acquire(), which is covered by a shared dict_sys.latch.
      We would increment the counter as part of a DDL operation, but before
      acquiring an exclusive dict_sys.latch. So, a
      purge_sys_t::close_and_reopen() loop could be triggered slightly
      prematurely, before a problematic DDL operation is actually executed.
      Decrementing the counter is less of an issue; purge_sys.resume_FTS()
      or purge_sys.resume_SYS() would mostly be invoked while holding an
      exclusive dict_sys.latch; ha_innobase::delete_table() does it outside
      that critical section. Still, this would only cause some extra wait in
      the purge_coordinator_task, just like at the start of a DDL operation.
      
      There are two calls to purge_sys_t::wait_FTS(bool): in the above mentioned
      purge_sys_t::close_and_reopen() and in purge_sys_t::clone_oldest_view(),
      both invoked by the purge_coordinator_task. There is also a
      purge_sys.clone_oldest_view<true>() call at startup when no DDL operation
      can be in progress.
      
      purge_sys_t::m_SYS_paused: Merged into m_FTS_paused, using a new
      multiplier PAUSED_SYS = 65536.
      
      purge_sys_t::wait_FTS(): Remove an unnecessary sleep as well as the
      access to purge_sys.latch. It suffices to poll purge_sys.m_FTS_paused.
      
      purge_sys_t::stop_FTS(): Do not acquire purge_sys.latch.
      
      Reviewed by: Debarun Banerjee
      d58734d7
  9. 25 Aug, 2024 1 commit
    • Sergei Petrunia's avatar
      Trivial fix: Make test_if_cheaper_ordering() use actual_rec_per_key() · 9020baf1
      Sergei Petrunia authored
      Discovered this while working on MDEV-34720: test_if_cheaper_ordering()
      uses rec_per_key, while the original estimate for the access method
      is produced in best_access_path() by using actual_rec_per_key().
      
      Make test_if_cheaper_ordering() also use actual_rec_per_key().
      Also make several getter function "const" to make this compile.
      Also adjusted the testcase to handle this (the change backported from
      11.0)
      9020baf1
  10. 23 Aug, 2024 1 commit
    • Marko Mäkelä's avatar
      MDEV-34759: buf_page_get_low() is unnecessarily acquiring exclusive latch · 9db2b327
      Marko Mäkelä authored
      buf_page_ibuf_merge_try(): A new, separate function for invoking
      ibuf_merge_or_delete_for_page() when needed. Use the already requested
      page latch for determining if the call is necessary. If it is and
      if we are currently holding rw_latch==RW_S_LATCH, upgrading to an exclusive
      latch may involve waiting that another thread acquires and releases
      a U or X latch on the page. If we have to wait, we must recheck if the
      call to ibuf_merge_or_delete_for_page() is still needed. If the page
      turns out to be corrupted, we will release and fail the operation.
      Finally, the exclusive page latch will be downgraded to the originally
      requested latch.
      
      ssux_lock_impl::rd_u_upgrade_try(): Attempt to upgrade a shared lock to
      an update lock.
      
      sux_lock::s_x_upgrade_try(): Attempt to upgrade a shared lock to
      exclusive.
      
      sux_lock::s_x_upgrade(): Upgrade a shared lock to exclusive.
      Return whether a wait was elided.
      
      ssux_lock_impl::u_rd_downgrade(), sux_lock::u_s_downgrade():
      Downgrade an update lock to shared.
      9db2b327