A better fix for bug #56405 "Deadlock in the MDL deadlock
detector" that doesn't introduce bug #56715 "Concurrent transactions + FLUSH result in sporadical unwarranted deadlock errors". Deadlock could have occurred when workload containing a mix of DML, DDL and FLUSH TABLES statements affecting the same set of tables was executed in a heavily concurrent environment. This deadlock occurred when several connections tried to perform deadlock detection in the metadata locking subsystem. The first connection started traversing wait-for graph, encountered a sub-graph representing a wait for flush, acquired LOCK_open and dived into sub-graph inspection. Then it encountered sub-graph corresponding to wait for metadata lock and blocked while trying to acquire a rd-lock on MDL_lock::m_rwlock, since some,other thread had a wr-lock on it. When this wr-lock was released it could have happened (if there was another pending wr-lock against this rwlock) that the rd-lock from the first connection was left unsatisfied but at the same time the new rd-lock request from the second connection sneaked in and was satisfied (for this to be possible the second rd-request should come exactly after the wr-lock is released but before pending the wr-lock manages to grab rwlock, which is possible both on Linux and in our own rwlock implementation). If this second connection continued traversing the wait-for graph and encountered a sub-graph representing a wait for flush it tried to acquire LOCK_open and thus the deadlock was created. The previous patch tried to workaround this problem by not allowing the deadlock detector to lock LOCK_open mutex if some other thread doing deadlock detection already owns it and current search depth is greater than 0. Instead deadlock was reported. As a result it has introduced bug #56715. This patch solves this problem in a different way. It introduces a new rw_pr_lock_t implementation to be used by MDL subsystem instead of one based on Linux rwlocks or our own rwlock implementation. This new implementation never allows situation in which an rwlock is rd-locked and there is a blocked pending rd-lock. Thus the situation which has caused this bug becomes impossible with this implementation. Due to fact that this implementation is optimized for wr-lock/unlock scenario which is most common in the MDL subsystem it doesn't introduce noticeable performance regressions in sysbench tests. Moreover it significantly improves situation for POINT_SELECT test when many connections are used. No test case is provided as this bug is very hard to repeat in MTR environment but is repeatable with the help of RQG tests. This patch also doesn't include a test for bug #56715 "Concurrent transactions + FLUSH result in sporadical unwarranted deadlock errors" as it takes too much time to be run as part of normal test-suite runs.
Showing
Please register or sign in to comment