• Brandon Nesterenko's avatar
    MDEV-10653: SHOW SLAVE STATUS Can Deadlock an Errored Slave · 8dad5148
    Brandon Nesterenko authored
    AKA rpl.rpl_parallel, binlog_encryption.rpl_parallel fails in
    buildbot with timeout in include
    
    A replication parallel worker thread can deadlock with another
    connection running SHOW SLAVE STATUS. That is, if the replication
    worker thread is in do_gco_wait() and is killed, it will already
    hold the LOCK_parallel_entry, and during error reporting, try to
    grab the err_lock. SHOW SLAVE STATUS, however, grabs these locks in
    reverse order. It will initially grab the err_lock, and then try to
    grab LOCK_parallel_entry. This leads to a deadlock when both threads
    have grabbed their first lock without the second.
    
    This patch implements the MDEV-31894 proposed fix to optimize the
    workers_idle() check to compare the last in-use relay log’s
    queued_count==dequeued_count for idleness. This removes the need for
    workers_idle() to grab LOCK_parallel_entry, as these values are
    atomically updated.
    
    Huge thanks to Kristian Nielsen for diagnosing the problem!
    
    Reviewed By:
    ============
    Kristian Nielsen <knielsen@knielsen-hq.org>
    Andrei Elkin <andrei.elkin@mariadb.com>
    8dad5148
rpl_deadlock_show_slave_status.result 2.25 KB