product/CMFActivity/Activity/SQLBase.py · 18b5e4edea04b714f391de57b97cd05aa1704c3f · nexedi / erp5

CMFActivity.Activity.SQLBase: Reduce the number of deadlocks · 18b5e4ed

Vincent Pelletier authored Sep 17, 2021

MariaDB seems to be using inconsistent lock acquisition order when
executing the activity reservation queries. As a consequence, it produces
internal deadlocks, which it detects. Upon detection, it kills one of the
involved query, which causes message reservation to fail, despite the
presence of executable activities.
To avoid depending on MariaDB internal lock acquisition order, acquire an
explicit table-scoped lock before running the activity reservation queries.

On an otherwise-idle 31 processing node cluster with the following
activities spawned, designed to stress activity reservation queries
(many ultra-short activities being executed one at a time):
  active_getTitle = context.getPortalObject().portal_catalog.activate(
    activity='SQLQueue',
    priority=5,
    tag='foo',
  ).getTitle
  for _ in xrange(40000):
    active_getTitle()
the results are:
- a 26% shorter activity execution time: from 206s with the original code
  to 152s
- a 100% reduction in reported deadlocks from 300 with the original code
  to 0

There is room for further improvements at a later time:
- tweaking the amount of time spent waiting for this new lock to be
  available, set for now at 1s.
- possibly bypassing this lock altogether when there are too few processing
  nodes simultaneously enabled, or even in an adaptive reaction to
  deadlock errors actually happening.
- cover more write accesses to these tables with the same lock

From a production environment, it appears that the getReservedMessageList
method alone is involved in 95% of these deadlocks, so for now this change
only targets this method.

18b5e4ed

SQLBase.py 45.3 KB

Replace SQLBase.py