• Marko Mäkelä's avatar
    Bug#13721257 RACE CONDITION IN UPDATES OR INSERTS OF WIDE RECORDS · f77329ac
    Marko Mäkelä authored
    This bug was originally filed and fixed as Bug#12612184. The original
    fix was buggy, and it was patched by Bug#12704861. Also that patch was
    buggy (potentially breaking crash recovery), and both fixes were
    reverted.
    
    This fix was not ported to the built-in InnoDB of MySQL 5.1, because
    the function signatures of many core functions are different from
    InnoDB Plugin and later versions. The block allocation routines and
    their callers would have to changed so that they handle block
    descriptors instead of page frames.
    
    When a record is updated so that its size grows, non-updated columns
    can be selected for external (off-page) storage. The bug is that the
    initially inserted updated record contains an all-zero BLOB pointer to
    the field that was not updated. Only after the BLOB pages have been
    allocated and written, the valid pointer can be written to the record.
    
    Between the release of the page latch in mtr_commit(mtr) after
    btr_cur_pessimistic_update() and the re-latching of the page in
    btr_pcur_restore_position(), other threads can see the invalid BLOB
    pointer consisting of 20 zero bytes. Moreover, if the system crashes
    at this point, the situation could persist after crash recovery, and
    the contents of the non-updated column would be permanently lost.
    
    The problem is amplified by the ROW_FORMAT=DYNAMIC and
    ROW_FORMAT=COMPRESSED that were introduced in
    innodb_file_format=barracuda in InnoDB Plugin, but the bug does exist
    in all InnoDB versions.
    
    The fix is as follows. After a pessimistic B-tree operation that needs
    to write out off-page columns, allocate the pages for these columns in
    the mini-transaction that performed the B-tree operation (btr_mtr),
    but write the pages in a separate mini-transaction (blob_mtr). Do
    mtr_commit(blob_mtr) before mtr_commit(btr_mtr). A quirk: Do not reuse
    pages that were previously freed in btr_mtr. Only write the off-page
    columns to 'fresh' pages.
    
    In this way, crash recovery will see redo log entries for blob_mtr
    before any redo log entry for btr_mtr. It will apply the BLOB page
    writes to pages that were marked free at that point. If crash recovery
    fails to see all of the btr_mtr redo log, there will be some
    unreachable BLOB data in free pages, but the B-tree will be in a
    consistent state.
    
    btr_page_alloc_low(): Renamed from btr_page_alloc(). Add the parameter
    init_mtr. Return an allocated block, or NULL. If init_mtr!=mtr but
    the page was already X-latched in mtr, do not initialize the page.
    
    btr_page_alloc(): Wrapper for btr_page_alloc_for_ibuf() and
    btr_page_alloc_low().
    
    btr_page_free(): Add a debug assertion that the page was a B-tree page.
    
    btr_lift_page_up(): Return the father block.
    
    btr_compress(), btr_cur_compress_if_useful(): Add the parameter ibool
    adjust, for adjusting the cursor position.
    
    btr_cur_pessimistic_update(): Preserve the cursor position when
    big_rec will be written and the new flag BTR_KEEP_POS_FLAG is defined.
    Remove a duplicate rec_get_offsets() call. Keep the X-latch on
    index->lock when big_rec is needed.
    
    btr_store_big_rec_extern_fields(): Replace update_inplace with
    an operation code, and local_mtr with btr_mtr. When not doing a
    fresh insert and btr_mtr has freed pages, put aside any pages that
    were previously X-latched in btr_mtr, and free the pages after
    writing out all data. The data must be written to 'fresh' pages,
    because btr_mtr will be committed and written to the redo log after
    the BLOB writes have been written to the redo log.
    
    btr_blob_op_is_update(): Check if an operation passed to
    btr_store_big_rec_extern_fields() is an update or insert-by-update.
    
    fseg_alloc_free_page_low(), fsp_alloc_free_page(),
    fseg_alloc_free_extent(), fseg_alloc_free_page_general(): Add the
    parameter init_mtr. Return an allocated block, or NULL. If
    init_mtr!=mtr but the page was already X-latched in mtr, do not
    initialize the page.
    
    xdes_get_descriptor_with_space_hdr(): Assert that the file space
    header is being X-latched.
    
    fsp_alloc_from_free_frag(): Refactored from fsp_alloc_free_page().
    
    fsp_page_create(): New function, for allocating, X-latching and
    potentially initializing a page. If init_mtr!=mtr but the page was
    already X-latched in mtr, do not initialize the page.
    
    fsp_free_page(): Add ut_ad(0) to the error outcomes.
    
    fsp_free_page(), fseg_free_page_low(): Increment mtr->n_freed_pages.
    
    fsp_alloc_seg_inode_page(), fseg_create_general(): Assert that the
    page was not previously X-latched in the mini-transaction. A file
    segment or inode page should never be allocated in the middle of an
    mini-transaction that frees pages, such as btr_cur_pessimistic_delete().
    
    fseg_alloc_free_page_low(): If the hinted page was allocated, skip the
    check if the tablespace should be extended. Return NULL instead of
    FIL_NULL on failure. Remove the flag frag_page_allocated. Instead,
    return directly, because the page would already have been initialized.
    
    fseg_find_free_frag_page_slot() would return ULINT_UNDEFINED on error,
    not FIL_NULL. Correct a bogus assertion.
    
    fseg_alloc_free_page(): Redefine as a wrapper macro around
    fseg_alloc_free_page_general().
    
    buf_block_buf_fix_inc(): Move the definition from the buf0buf.ic to
    buf0buf.h, so that it can be called from other modules.
    
    mtr_t: Add n_freed_pages (number of pages that have been freed).
    
    page_rec_get_nth_const(), page_rec_get_nth(): The inverse function of
    page_rec_get_n_recs_before(), get the nth record of the record
    list. This is faster than iterating the linked list. Refactored from
    page_get_middle_rec().
    
    trx_undo_rec_copy(): Add a debug assertion for the length.
    
    trx_undo_add_page(): Return a block descriptor or NULL instead of a
    page number or FIL_NULL.
    
    trx_undo_report_row_operation(): Add debug assertions.
    
    trx_sys_create_doublewrite_buf(): Assert that each page was not
    previously X-latched.
    
    page_cur_insert_rec_zip_reorg(): Make use of page_rec_get_nth().
    
    row_ins_clust_index_entry_by_modify(): Pass BTR_KEEP_POS_FLAG, so that
    the repositioning of the cursor can be avoided.
    
    row_ins_index_entry_low(): Add DEBUG_SYNC points before and after
    writing off-page columns. If inserting by updating a delete-marked
    record, do not reposition the cursor or commit the mini-transaction
    before writing the off-page columns.
    
    row_build(): Tighten a debug assertion about null BLOB pointers.
    
    row_upd_clust_rec(): Add DEBUG_SYNC points before and after writing
    off-page columns. Do not reposition the cursor or commit the
    mini-transaction before writing the off-page columns.
    
    rb:939 approved by Jimmy Yang
    f77329ac
btr0btr.c 125 KB