Commit a42da4de authored by Vincent Pelletier's avatar Vincent Pelletier

CMFActivity: Do not use offset for scanning messages to validate.

This was inefficient for two reasons:
- any message we could validate during current iteration means a message we
  did not consider is now in the range we just scanned. And it will not be
  considered until validation node starts over and scan this same range
  again.
- "LIMIT x,1000" pattern on >1000 messages causes a quick-growing number of
  extra rows scanned by the SQL database just to skip the "x" first rows:
  at 2000 rows present it must scan 1000 + 2000 = 3000 rows for a complete
  loop over all pending activities. At 3k rows it must scan 6k rows.
  At 4k, 10k.
  While this is an overestimation (some rows should be possible to
  validate, so these would be scanned once only), this overhead grows so
  large that this overestimation can become negligible.

Instead, use a range condition consistent with query's "SORT ON", which is
already efficiently materialised by an index: SQL database just has to
dive into the existing index to start just above the last message from
previous iteration, and resume scanning from there, solving both issues
listed above.
parent ea56fe72
......@@ -108,6 +108,9 @@ def sqltest_dict():
_('retry')
_('to_date', column="date", op="<=")
_('uid')
_('from_date', column="date", op=">=")
_('from_priority', column="priority", op=">=")
_('above_uid', column="uid", op=">")
return sqltest_dict
sqltest_dict = sqltest_dict()
......@@ -192,19 +195,18 @@ class SQLBase(Queue):
assert len(result[0]) == 1
return result[0][0]
def _getMessageList(self, activity_tool, offset=0, count=1000, src__=0, **kw):
def _getMessageList(self, activity_tool, count=1000, src__=0, **kw):
# XXX: Because most columns have NOT NULL constraint, conditions with None
# value should be ignored, instead of trying to render them
# (with comparisons with NULL).
sql_connection = activity_tool.getPortalObject().cmf_activity_sql_connection
q = sql_connection.sql_quote__
if offset:
limit = '\nLIMIT %d,%d' % (offset, sys.maxint if count is None else count)
else:
limit = '' if count is None else '\nLIMIT %d' % count
sql = '\n AND '.join(sqltest_dict[k](v, q) for k, v in kw.iteritems())
sql = "SELECT * FROM %s%s\nORDER BY priority, date, uid%s" % (
self.sql_table, sql and '\nWHERE ' + sql, limit)
self.sql_table,
sql and '\nWHERE ' + sql,
'' if count is None else '\nLIMIT %d' % count,
)
return sql if src__ else Results(sql_connection().query(sql, max_rows=0))
def getMessageList(self, *args, **kw):
......@@ -349,17 +351,15 @@ class SQLBase(Queue):
assignMessage = getattr(activity_tool, 'SQLBase_assignMessage', None)
if assignMessage is None:
return
offset = 0
now_date = self.getNow(activity_tool)
where_kw = {
'processing_node': -1,
'to_date': now_date,
'count': READ_MESSAGE_LIMIT,
}
validated_count = 0
while 1:
result = self._getMessageList(
activity_tool,
processing_node=-1,
to_date=now_date,
offset=offset,
count=READ_MESSAGE_LIMIT,
)
result = self._getMessageList(activity_tool, **where_kw)
if not result:
return
transaction.commit()
......@@ -399,7 +399,9 @@ class SQLBase(Queue):
validated_count += distributable_count
if validated_count >= MAX_VALIDATED_LIMIT:
return
offset += READ_MESSAGE_LIMIT
where_kw['from_priority'] = line.priority
where_kw['from_date'] = line.date
where_kw['above_uid'] = line.uid
def getReservedMessageList(self, activity_tool, date, processing_node,
limit=None, group_method_id=None):
......
  • @jm here is the new version, with one more coma which angered the previous tests. Now they look happy (though not over yet). So the remaining question is finding a better name for above_uid if you think it needs one.

    [edit]: Also, I fixed the commit message. There was a pretty bad typo in the previous one which made the first inefficiency description very hard to understand, I think.

    Edited by Vincent Pelletier
  • mentioned in merge request !776 (merged)

    Toggle commit list
  • @jm: I realise this change is bad. Not on the intent, but on the implementation:

    • the sort on uids, which was here before this change, is meaningless since the switch to random uids. The original intent was that uids implemented a finer-grained "date" value, giving information about activity age when activities are created within a single second. With random uids, this became meaningless.
    • ...but it is required for this change to work, otherwise if there are >=1k activities on a given priority and date and cannot be validated, then validation loops on these forever

    So this change relies on something meaningless to work, so it's not good.

    Also, while trying to improve this change I also realised something I completely forgot: validation node restarts from the beginning everytime it could validate 1k activities. I think it is wrong (why not just continue scanning the current activity queue until all activities have been considered ? this looks like an optimisation for the special case where validation node is also a processing node, which I think as meaningless to optimise for - either performance does not matter, or a cluster should be used and responsibilities split). This behaviour means the expected large gain from when there are many pending activities is not actually achieved.

    All this makes me want to revert this commit, to keep the code simple and not adding more code relying on dubious features. And we can continue living with a meaningless sort on uid until a better solution is found.

    What do you think ?

  • Increasing the precision of the date column (DATETIME(6)) looks enough, doesn't it ?

    MariaDB [test]> select UTC_TIMESTAMP(6); select UTC_TIMESTAMP(6);
    +----------------------------+
    | UTC_TIMESTAMP(6)           |
    +----------------------------+
    | 2018-12-19 15:48:51.207722 |
    +----------------------------+
    1 row in set (0.00 sec)
    
    +----------------------------+
    | UTC_TIMESTAMP(6)           |
    +----------------------------+
    | 2018-12-19 15:48:51.207986 |
    +----------------------------+
    1 row in set (0.00 sec)

    Note that several UTC_TIMESTAMP(6) within the same request return the same result (either MariaDB is too fast or it computes the result only once) but it should be ok to sort by uids here.

  • Increasing date precision looks like a good idea, yes.

    I would not say that it fixes the issue (we still need to sort by a meaningless value, just as a tie-break) but it certainly improves the result, and pushes the limit of how many nodes we can have and how fast they can be before they start to actually rely on that tie-break.

    What is needed ? I believe there is something needed on the insertion queries, doesn't it ? I also faintly remember that you worked on date precision in the past and had to patch dtml rendering.

  • mentioned in commit 0b3aecff

    Toggle commit list
  • Julien Muchembled @jm

    mentioned in merge request !820 (merged)

    ·

    mentioned in merge request !820 (merged)

    Toggle commit list
  • mentioned in commit b82f3ba1

    Toggle commit list
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment