• Kristian Nielsen's avatar
    MDEV-6917: Parallel replication: "Commit failed due to failure of an earlier... · 26b11130
    Kristian Nielsen authored
    MDEV-6917: Parallel replication: "Commit failed due to failure of an earlier commit on which this one depends", but no prior failure seen
    
    This bug was seen when parallel replication experienced a deadlock between
    transactions T1 and T2, where T2 has reached the commit phase and is waiting
    for T1 to commit first. In this case, the deadlock is broken by sending a kill
    to T2; that kill error is then later detected and converted to a deadlock
    error, which causes T2 to be rolled back and retried.
    
    The problem was that the kill caused ha_commit_trans() to errorneously call
    wakeup_subsequent_commits() on T3, signalling it to abort because T2 failed
    during commit. This is incorrect, because the error in T2 is only a temporary
    error, which will be resolved by normal transaction retry. We should not
    signal error to the next transaction until we have executed the code that
    handles such temporary errors.
    
    So this patch just removes the calls to wakeup_subsequent_commits() from
    ha_commit_trans(). They are incorrect in this case, and they are not needed in
    general, as wakeup_subsequent_commits() must in any case be called in
    finish_event_group() to wakeup any transactions that may have started to wait
    after ha_commit_trans(). And normally, wakeup will in fact have happened
    earlier, either from the binlog group commit code, or (in case of no
    binlogging) after the fast part of InnoDB/XtraDB group commit.
    
    The symptom of this bug was that replication would break on some transaction
    with "Commit failed due to failure of an earlier commit on which this one
    depends", but with no such failure of an earlier commit visible anywhere.
    
    26b11130
rpl_parallel_retry.result 7.19 KB