• Vlad Lesin's avatar
    MDEV-29622 Wrong assertions in lock_cancel_waiting_and_release() for deadlock resolving caller · 9c04d66d
    Vlad Lesin authored
    Suppose we have two transactions, trx 1 and trx 2.
    
    trx 2 does deadlock resolving from lock_wait(), it sets
    victim->lock.was_chosen_as_deadlock_victim=true for trx 1, but has not
    yet invoked lock_cancel_waiting_and_release().
    
    trx 1 checks the flag in lock_trx_handle_wait(), and starts rollback
    from row_mysql_handle_errors(). It can change trx->lock.wait_thr and
    trx->state as it holds trx_t::mutex, but trx 2 has not yet requested it,
    as lock_cancel_waiting_and_release() has not yet been called.
    
    After that trx 1 tries to release locks in trx_t::rollback_low(),
    invoking trx_t::rollback_finish(). lock_release() is blocked on try to
    acquire lock_sys.rd_lock(SRW_LOCK_CALL) in lock_release_try(), as
    lock_sys is blocked by trx 2, as deadlock resolution works under
    lock_sys.wr_lock(SRW_LOCK_CALL), see Deadlock::report() for details.
    
    trx 2 executes lock_cancel_waiting_and_release() for deadlock victim, i.
    e. for trx 1. lock_cancel_waiting_and_release() contains some
    trx->lock.wait_thr and trx->state assertions, which will fail, because
    trx 1 has changed them during rollback execution.
    
    So, according to the above scenario, it's legal to have
    trx->lock.wait_thr==0 and trx->state!=TRX_STATE_ACTIVE in
    lock_cancel_waiting_and_release(), if it was invoked from
    Deadlock::report(), and the fix is just in the assertion conditions
    changing.
    
    The fix is just in changing assertion condition.
    
    There is also lock_wait() cleanup around trx->error_state.
    
    If trx->error_state can be changed not by the owned thread, it must be
    protected with lock_sys.wait_mutex, as lock_wait() uses trx->lock.cond
    along with that mutex.
    
    Also if trx->error_state was changed before lock_sys.wait_mutex
    acquision, then it could be reset with the following code, what is
    wrong. Also we need to check trx->error_state before entering waiting
    loop, otherwise it can be the case when trx->error_state was set before
    lock_sys.wait_mutex acquision, but the thread will be waiting on
    trx->lock.cond.
    9c04d66d
deadlock_wait_thr_race.result 1.27 KB