Commit 20460897 authored by Marko Mäkelä's avatar Marko Mäkelä

MDEV-14637: Fix hang due to DDL with FOREIGN KEY

When MySQL 5.7.1 introduced WL#6326 to reduce contention on the
non-leaf levels of B-trees, it introduced a new rw-lock mode SX
(not conflicting with S, but conflicting with SX and X) and
new rules to go with it.

A thread that is holding an dict_index_t::lock aka index->lock
in SX mode is permitted to acquire non-leaf buf_block_t::lock
aka block->lock X or SX mode, in monotonically descending order.
That is, once the thread has acquired a block->lock, it is not
allowed to acquire a lock on its parent or grandparent pages.
Such arbitrary-order access is only allowed when the thread
acquired the index->lock in X mode upfront.

A customer encountered a repeatable hang when loading a dump into
InnoDB while using multiple innodb_purge_threads (default: 4).
The dump makes very heavy use of FOREIGN KEY constraints.
By luck, it happened so that two purge worker threads (srv_worker_thread)
deadlocked with each other. Both were operating on the index FOR_REF
of the InnoDB internal table SYS_FOREIGN. One of them was legitimately
holding index->lock S-latch and the root block->lock S-latch. The other
had acquired index->lock SX-latch, root block->lock SX-latch, and a bunch
of other latches, including the fil_space_t::latch for freeing some blocks
and some leaf page latches. This other thread was inside 2 nested calls
to btr_compress() and it was trying to reacquire the root block->lock
in X mode, violating the WL#6326 protocol.

This violation led to a deadlock, because while S is compatible with SX
and a thread can upgrade an SX lock to X when there are no conflicting
requests, in this case there was a conflicting S lock held by the other
purge worker thread.

During this deadlock, both threads are holding dict_operation_lock S-latch,
which would block any subsequent DDL statements, such as CREATE TABLE.

The tables SYS_FOREIGN and SYS_FOREIGN_COLS are special in that they
define key columns of the type VARCHAR(0), created using the InnoDB
internal SQL parser. Because InnoDB does not internally enforce the
maximum length of columns, it would happily write more than 0 bytes
to these columns. This caused a miscalculation of node_ptr_max_size.

btr_cur_will_modify_tree(): Clean up some code. (No functional change.)

btr_node_ptr_max_size(): Renamed from dict_index_node_ptr_max_size().
Use a more realistic maximum size for SYS_FOREIGN and SYS_FOREIGN_COLS.

btr_cur_pessimistic_delete(): Refrain from merging pages if it is
not safe.

This work is based on the following MySQL 5.7.23 fix:

commit 58dcf0b4a4165ed59de94a9a1e7d8c954f733726
Author: Aakanksha Verma <aakanksha.verma@oracle.com>
Date:   Wed May 9 18:54:03 2018 +0530

    BUG#26225783 MYSQL CRASH ON CREATE TABLE (REPRODUCEABLE) -> INNODB: A
    LONG SEMAPHORE WAIT
parent f70a3185
/*****************************************************************************
Copyright (c) 1994, 2016, Oracle and/or its affiliates. All Rights Reserved.
Copyright (c) 1994, 2018, Oracle and/or its affiliates. All Rights Reserved.
Copyright (c) 2008, Google Inc.
Copyright (c) 2012, Facebook Inc.
Copyright (c) 2015, 2018, MariaDB Corporation.
......@@ -586,13 +586,15 @@ btr_cur_will_modify_tree(
first record and following compress might delete the record and causes
the uppper level node_ptr modification. */
const ulint n_recs = page_get_n_recs(page);
if (lock_intention <= BTR_INTENTION_BOTH) {
ulint margin;
/* check delete will cause. (BTR_INTENTION_BOTH
or BTR_INTENTION_DELETE) */
/* first, 2nd, 2nd-last and last records are 4 records */
if (page_get_n_recs(page) < 5) {
if (n_recs < 5) {
return(true);
}
......@@ -638,8 +640,7 @@ btr_cur_will_modify_tree(
/* Once we invoke the btr_cur_limit_optimistic_insert_debug,
we should check it here in advance, since the max allowable
records in a page is limited. */
LIMIT_OPTIMISTIC_INSERT_DEBUG(page_get_n_recs(page),
return(true));
LIMIT_OPTIMISTIC_INSERT_DEBUG(n_recs, return true);
/* needs 2 records' space for the case the single split and
insert cannot fit.
......@@ -652,18 +653,16 @@ btr_cur_will_modify_tree(
|| max_size < rec_size * 2) {
return(true);
}
/* TODO: optimize this condition for compressed page.
this is based on the worst compress rate.
currently looking only uncompressed page, but we can look
also compressed page page_zip_available() if already in the
buffer pool */
/* TODO: optimize this condition for ROW_FORMAT=COMPRESSED.
This is based on the worst case, and we could invoke
page_zip_available() on the block->page.zip. */
/* needs 2 records' space also for worst compress rate. */
if (page_size.is_compressed()
&& page_zip_empty_size(index->n_fields,
page_size.physical())
< rec_size * 2 + page_get_data_size(page)
+ page_dir_calc_reserved_space(
page_get_n_recs(page) + 2) + 1) {
<= rec_size * 2 + page_get_data_size(page)
+ page_dir_calc_reserved_space(n_recs + 2)) {
return(true);
}
}
......@@ -701,6 +700,98 @@ btr_cur_need_opposite_intention(
return(false);
}
/**
@param[in] index b-tree
@return maximum size of a node pointer record in bytes */
static ulint btr_node_ptr_max_size(const dict_index_t* index)
{
if (dict_index_is_ibuf(index)) {
/* cannot estimate accurately */
/* This is universal index for change buffer.
The max size of the entry is about max key length * 2.
(index key + primary key to be inserted to the index)
(The max key length is UNIV_PAGE_SIZE / 16 * 3 at
ha_innobase::max_supported_key_length(),
considering MAX_KEY_LENGTH = 3072 at MySQL imposes
the 3500 historical InnoDB value for 16K page size case.)
For the universal index, node_ptr contains most of the entry.
And 512 is enough to contain ibuf columns and meta-data */
return srv_page_size / 8 * 3 + 512;
}
/* Each record has page_no, length of page_no and header. */
ulint comp = dict_table_is_comp(index->table);
ulint rec_max_size = comp
? REC_NODE_PTR_SIZE + 1 + REC_N_NEW_EXTRA_BYTES
+ UT_BITS_IN_BYTES(index->n_nullable)
: REC_NODE_PTR_SIZE + 2 + REC_N_OLD_EXTRA_BYTES
+ 2 * index->n_fields;
/* Compute the maximum possible record size. */
for (ulint i = 0; i < dict_index_get_n_unique_in_tree(index); i++) {
const dict_field_t* field
= dict_index_get_nth_field(index, i);
const dict_col_t* col
= dict_field_get_col(field);
ulint field_max_size;
ulint field_ext_max_size;
/* Determine the maximum length of the index field. */
field_max_size = dict_col_get_fixed_size(col, comp);
if (field_max_size) {
/* dict_index_add_col() should guarantee this */
ut_ad(!field->prefix_len
|| field->fixed_len == field->prefix_len);
/* Fixed lengths are not encoded
in ROW_FORMAT=COMPACT. */
rec_max_size += field_max_size;
continue;
}
field_max_size = dict_col_get_max_size(col);
if (UNIV_UNLIKELY(!field_max_size)) {
/* SYS_FOREIGN.ID is defined as CHAR in the
InnoDB internal SQL parser, which translates
into the incorrect VARCHAR(0). InnoDB does
not enforce maximum lengths of columns, so
that is why any data can be inserted in the
first place.
Likewise, SYS_FOREIGN.FOR_NAME,
SYS_FOREIGN.REF_NAME, SYS_FOREIGN_COLS.ID, are
defined as CHAR, and also they are part of a key. */
ut_ad(!strcmp(index->table->name.m_name,
"SYS_FOREIGN")
|| !strcmp(index->table->name.m_name,
"SYS_FOREIGN_COLS"));
ut_ad(!comp);
rec_max_size += (srv_page_size == UNIV_PAGE_SIZE_MAX)
? REDUNDANT_REC_MAX_DATA_SIZE
: page_get_free_space_of_empty(FALSE) / 2;
}
field_ext_max_size = field_max_size < 256 ? 1 : 2;
if (field->prefix_len
&& field->prefix_len < field_max_size) {
field_max_size = field->prefix_len;
}
if (comp) {
/* Add the extra size for ROW_FORMAT=COMPACT.
For ROW_FORMAT=REDUNDANT, these bytes were
added to rec_max_size before this loop. */
rec_max_size += field_ext_max_size;
}
rec_max_size += field_max_size;
}
return rec_max_size;
}
/********************************************************************//**
Searches an index tree and positions a tree cursor on a given level.
NOTE: n_fields_cmp in tuple must be set so that it cannot be compared
......@@ -1028,7 +1119,7 @@ btr_cur_search_to_nth_level(
page_id_t page_id(space, dict_index_get_page(index));
if (root_leaf_rw_latch == RW_X_LATCH) {
node_ptr_max_size = dict_index_node_ptr_max_size(index);
node_ptr_max_size = btr_node_ptr_max_size(index);
}
up_match = 0;
......@@ -2128,7 +2219,7 @@ btr_cur_open_at_index_side_func(
const page_size_t& page_size = dict_table_page_size(index->table);
if (root_leaf_rw_latch == RW_X_LATCH) {
node_ptr_max_size = dict_index_node_ptr_max_size(index);
node_ptr_max_size = btr_node_ptr_max_size(index);
}
height = ULINT_UNDEFINED;
......@@ -2487,7 +2578,7 @@ btr_cur_open_at_rnd_pos_func(
dberr_t err = DB_SUCCESS;
if (root_leaf_rw_latch == RW_X_LATCH) {
node_ptr_max_size = dict_index_node_ptr_max_size(index);
node_ptr_max_size = btr_node_ptr_max_size(index);
}
height = ULINT_UNDEFINED;
......@@ -5162,7 +5253,6 @@ btr_cur_pessimistic_delete(
btr_discard_page(cursor, mtr);
ret = TRUE;
goto return_after_reservations;
}
......@@ -5236,23 +5326,45 @@ btr_cur_pessimistic_delete(
}
}
page_cur_delete_rec(btr_cur_get_page_cur(cursor), index, offsets, mtr);
/* SPATIAL INDEX never use SX locks; we can allow page merges
while holding X lock on the spatial index tree.
Do not allow merges of non-leaf B-tree pages unless it is
safe to do so. */
{
const bool allow_merge = page_is_leaf(page)
|| dict_index_is_spatial(index)
|| btr_cur_will_modify_tree(
index, page, BTR_INTENTION_DELETE, rec,
btr_node_ptr_max_size(index),
block->page.size, mtr);
page_cur_delete_rec(btr_cur_get_page_cur(cursor), index,
offsets, mtr);
#ifdef UNIV_ZIP_DEBUG
ut_a(!page_zip || page_zip_validate(page_zip, page, index));
#endif /* UNIV_ZIP_DEBUG */
/* btr_check_node_ptr() needs parent block latched */
ut_ad(!parent_latched || btr_check_node_ptr(index, block, mtr));
ut_ad(!parent_latched
|| btr_check_node_ptr(index, block, mtr));
if (!ret && btr_cur_compress_recommendation(cursor, mtr)) {
if (UNIV_LIKELY(allow_merge)) {
ret = btr_cur_compress_if_useful(
cursor, FALSE, mtr);
} else {
ib::warn() << "Not merging page "
<< block->page.id
<< " in index " << index->name
<< " of " << index->table->name;
ut_ad(!"MDEV-14637");
}
}
}
return_after_reservations:
*err = DB_SUCCESS;
mem_heap_free(heap);
if (ret == FALSE) {
ret = btr_cur_compress_if_useful(cursor, FALSE, mtr);
}
if (!srv_read_only_mode
&& page_is_leaf(page)
&& !dict_index_is_online_ddl(index)) {
......
......@@ -2139,93 +2139,6 @@ dict_col_name_is_reserved(
return(FALSE);
}
/****************************************************************//**
Return maximum size of the node pointer record.
@return maximum size of the record in bytes */
ulint
dict_index_node_ptr_max_size(
/*=========================*/
const dict_index_t* index) /*!< in: index */
{
ulint comp;
ulint i;
/* maximum possible storage size of a record */
ulint rec_max_size;
if (dict_index_is_ibuf(index)) {
/* cannot estimate accurately */
/* This is universal index for change buffer.
The max size of the entry is about max key length * 2.
(index key + primary key to be inserted to the index)
(The max key length is UNIV_PAGE_SIZE / 16 * 3 at
ha_innobase::max_supported_key_length(),
considering MAX_KEY_LENGTH = 3072 at MySQL imposes
the 3500 historical InnoDB value for 16K page size case.)
For the universal index, node_ptr contains most of the entry.
And 512 is enough to contain ibuf columns and meta-data */
return(UNIV_PAGE_SIZE / 8 * 3 + 512);
}
comp = dict_table_is_comp(index->table);
/* Each record has page_no, length of page_no and header. */
rec_max_size = comp
? REC_NODE_PTR_SIZE + 1 + REC_N_NEW_EXTRA_BYTES
: REC_NODE_PTR_SIZE + 2 + REC_N_OLD_EXTRA_BYTES;
if (comp) {
/* Include the "null" flags in the
maximum possible record size. */
rec_max_size += UT_BITS_IN_BYTES(index->n_nullable);
} else {
/* For each column, include a 2-byte offset and a
"null" flag. */
rec_max_size += 2 * index->n_fields;
}
/* Compute the maximum possible record size. */
for (i = 0; i < dict_index_get_n_unique_in_tree(index); i++) {
const dict_field_t* field
= dict_index_get_nth_field(index, i);
const dict_col_t* col
= dict_field_get_col(field);
ulint field_max_size;
ulint field_ext_max_size;
/* Determine the maximum length of the index field. */
field_max_size = dict_col_get_fixed_size(col, comp);
if (field_max_size) {
/* dict_index_add_col() should guarantee this */
ut_ad(!field->prefix_len
|| field->fixed_len == field->prefix_len);
/* Fixed lengths are not encoded
in ROW_FORMAT=COMPACT. */
rec_max_size += field_max_size;
continue;
}
field_max_size = dict_col_get_max_size(col);
field_ext_max_size = field_max_size < 256 ? 1 : 2;
if (field->prefix_len
&& field->prefix_len < field_max_size) {
field_max_size = field->prefix_len;
}
if (comp) {
/* Add the extra size for ROW_FORMAT=COMPACT.
For ROW_FORMAT=REDUNDANT, these bytes were
added to rec_max_size before this loop. */
rec_max_size += field_ext_max_size;
}
rec_max_size += field_max_size;
}
return(rec_max_size);
}
/****************************************************************//**
If a record of this index might not fit on a single B-tree page,
return TRUE.
......
......@@ -1952,14 +1952,6 @@ const char*
dict_tf_to_row_format_string(
/*=========================*/
ulint table_flag); /*!< in: row format setting */
/****************************************************************//**
Return maximum size of the node pointer record.
@return maximum size of the record in bytes */
ulint
dict_index_node_ptr_max_size(
/*=========================*/
const dict_index_t* index) /*!< in: index */
MY_ATTRIBUTE((warn_unused_result));
#define dict_col_is_virtual(col) (col)->is_virtual()
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment