• Vincent Pelletier's avatar
    CMFActivity.Activity.SQLBase: Reduce the number of deadlocks · 18b5e4ed
    Vincent Pelletier authored
    MariaDB seems to be using inconsistent lock acquisition order when
    executing the activity reservation queries. As a consequence, it produces
    internal deadlocks, which it detects. Upon detection, it kills one of the
    involved query, which causes message reservation to fail, despite the
    presence of executable activities.
    To avoid depending on MariaDB internal lock acquisition order, acquire an
    explicit table-scoped lock before running the activity reservation queries.
    
    On an otherwise-idle 31 processing node cluster with the following
    activities spawned, designed to stress activity reservation queries
    (many ultra-short activities being executed one at a time):
      active_getTitle = context.getPortalObject().portal_catalog.activate(
        activity='SQLQueue',
        priority=5,
        tag='foo',
      ).getTitle
      for _ in xrange(40000):
        active_getTitle()
    the results are:
    - a 26% shorter activity execution time: from 206s with the original code
      to 152s
    - a 100% reduction in reported deadlocks from 300 with the original code
      to 0
    
    There is room for further improvements at a later time:
    - tweaking the amount of time spent waiting for this new lock to be
      available, set for now at 1s.
    - possibly bypassing this lock altogether when there are too few processing
      nodes simultaneously enabled, or even in an adaptive reaction to
      deadlock errors actually happening.
    - cover more write accesses to these tables with the same lock
    
    From a production environment, it appears that the getReservedMessageList
    method alone is involved in 95% of these deadlocks, so for now this change
    only targets this method.
    18b5e4ed
SQLBase.py 45.3 KB