Commit 72320df7 authored by Kirill Smelkov's avatar Kirill Smelkov

Add tests that demonstrates data corruption when ZEO5 client is served by ZEO4 server

As explained by "Bug1" in https://github.com/zopefoundation/ZEO/issues/209
there is a race in between load and invalidate when ZEO5 client works
wrt ZEO4 server:

    ---- 8< ----
    ZEO5.client - contrary to ZEO4.client - does not account for simultaneous invalidations when working wrt ZEO4.server.
    It shows as e.g.

    ```console
    (z-dev) kirr@deca:~/src/wendelin/z/ZEO5$ ZEO4_SERVER=1 zope-testrunner -fvvvx --test-path=src -t check_race_load_vs_external
    ...

    AssertionError: T1: obj1 (24)  !=  obj2 (23)
    obj1._p_serial: 0x03ea4b6fb05b52cc  obj2._p_serial: 0x03ea4b6faf253855
    zconn_at: 0x03ea4b6fb05b52cc  # approximated as max(serials)
    zstor.loadBefore(obj1, @zconn.at)       ->  serial: 0x03ea4b6fb05b52cc  next_serial: None
    zstor.loadBefore(obj2, @zconn.at)       ->  serial: 0x03ea4b6faf253855  next_serial: 0x03ea4b6fb07def66
    zstor._cache.clear()
    zstor.loadBefore(obj1, @zconn.at)       ->  serial: 0x03ea4b6fb05b52cc  next_serial: 0x03ea4b6fb07def66
    zstor.loadBefore(obj2, @zconn.at)       ->  serial: 0x03ea4b6fb05b52cc  next_serial: 0x03ea4b6fb07def66
    ```

    indicating that obj2 was provided to user from the cache that erroneously got stale.

    With [IO trace](kirr/ZEO@f184a1d9) showing message exchange in between client and the server, this looks as:

    ```
    # loadBefore issued
    tx (('\x00\x00\x00\x00\x00\x00\x00\x02', '\x03\xeaKo\xaf%8V'), False, 'loadBefore', ('\x00\x00\x00\x00\x00\x00\x00\x02', '\x03\xeaKo\xaf%8V'))

    # received invalidateTransaction
    rx (0, 1, 'invalidateTransaction', ('\x03\xeaKo\xb0[R\xcc', ['\x00\x00\x00\x00\x00\x00\x00\x01', '\x00\x00\x00\x00\x00\x00\x00\x02']))

    # received loadBefore reply but with end_tid=None !!!
    rx (('\x00\x00\x00\x00\x00\x00\x00\x02', '\x03\xeaKo\xaf%8V'), 0, '.reply', ('\x80\x03cZODB.tests.MinPO\nMinPO\nq\x01.\x80\x03}q\x02U\x05valueq\x03K\x17s.', '\x03\xeaKo\xaf%8U', None))
    ```

    which:

    1) contradicts what @jimfulton [wrote about ZEO4](https://github.com/zopefoundation/ZEO/blob/master/docs/ordering.rst#zeo-4): that there _invalidations are sent in a callback called when the storage lock is held, blocking loads while committing_, and
    2) points out what the bug is:

    Since in ZEO4 loads can be handled while a commit is in progress, ZEO4.client has special care to detect if an `invalidateTransaction` comes in between `load` request and corresponding `.reply` response, and _does not update the cache if invalidateTransaction sneaked in-between_:

    https://github.com/zopefoundation/ZEO/blob/47d3fbe8cbf24cad91b183483df069ef20708874/src/ZEO/ClientStorage.py#L367-L374
    https://github.com/zopefoundation/ZEO/blob/47d3fbe8cbf24cad91b183483df069ef20708874/src/ZEO/ClientStorage.py#L841-L852
    https://github.com/zopefoundation/ZEO/blob/47d3fbe8cbf24cad91b183483df069ef20708874/src/ZEO/ClientStorage.py#L1473-L1476

    but in ZEO5.client `loadBefore` does not have anything like that

    https://github.com/zopefoundation/ZEO/blob/fc0729b3cc754bda02c7f54319260b5527dd42a3/src/ZEO/ClientStorage.py#L603-L608
    https://github.com/zopefoundation/ZEO/blob/fc0729b3cc754bda02c7f54319260b5527dd42a3/src/ZEO/asyncio/client.py#L289-L309

    and thus `invalidateTransaction` sneaked in between `loadBefore` request and
    corresponding `.reply` causes ZEO client cache corruption.

    In the original `check_race_load_vs_external_invalidate` the problem appears
    only sometimes, but the bug happens with ~ 100% probability with the
    following delay injected on the server after `loadBefore`:

    ```diff
    --- a/src/ZEO/tests/ZEO4/StorageServer.py
    +++ b/src/ZEO/tests/ZEO4/StorageServer.py
    @@ -285,7 +285,9 @@ def loadEx(self, oid):

         def loadBefore(self, oid, tid):
             self.stats.loads += 1
    -        return self.storage.loadBefore(oid, tid)
    +        x = self.storage.loadBefore(oid, tid)
    +        time.sleep(0.1)
    +        return x

         def getInvalidations(self, tid):
             invtid, invlist = self.server.get_invalidations(self.storage_id, tid)
    ```

    so maybe, in addition to normal test runs, it could be also good idea to run
    the whole ZEO testsuite against so-amended storage backend. This way it will be
    similar to how e.g. [races are detected](https://lab.nexedi.com/kirr/go123/blob/070bfdbb/tracing/tracetest/tracetest.go#L76-102) by my [tracetest](https://lab.nexedi.com/kirr/go123/blob/070bfdbb/tracing/tracetest/tracetest.go#L20-73) package.
    ---- 8< ----

-> Add suggested test to the whole ZEO testsuite. Currently it reliably
catches the data curruption caused by explained load/invalidate race
condition and fails as follows:

    (z-dev) kirr@deca:~/src/wendelin/z/ZEO5$ ZEO4_SERVER=1 zope-testrunner -fvvvc -a 100 --test-path=src
    ...

    Running .FileStorageLoadDelayedTests tests:
    ...

    Failure in test check_race_load_vs_external_invalidate (ZEO.tests.testZEO.FileStorageLoadDelayedTests)
    Traceback (most recent call last):
      File "/usr/lib/python2.7/unittest/case.py", line 329, in run
        testMethod()
      File "/home/kirr/src/wendelin/z/ZODB/src/ZODB/tests/racetest.py", line 253, in check_race_load_vs_external_invalidate
        return self._check_race_load_vs_external_invalidate(T2ObjectsInc())
      File "/home/kirr/src/wendelin/z/ZODB/src/ZODB/tests/util.py", line 417, in _
        return f(*argv, **kw)
      File "/home/kirr/src/wendelin/z/ZODB/src/ZODB/tests/racetest.py", line 346, in _check_race_load_vs_external_invalidate
        self.fail('\n\n'.join([_ for _ in failure if _]))
      File "/usr/lib/python2.7/unittest/case.py", line 410, in fail
        raise self.failureException(msg)
    AssertionError: T4: obj1 (19)  !=  obj2 (20)
    obj1._p_serial: 0x03eabd580073db55  obj2._p_serial: 0x03eabd580079a5dd
    zconn_at: 0x03eabd580079a5dd  # approximated as max(serials)
    zstor.loadBefore(obj1, @zconn.at)       ->  serial: 0x03eabd580073db55  next_serial: None
    zstor.loadBefore(obj2, @zconn.at)       ->  serial: 0x03eabd580079a5dd  next_serial: None
    zstor._cache.clear()
    zstor.loadBefore(obj1, @zconn.at)       ->  serial: 0x03eabd580079a5dd  next_serial: None
    zstor.loadBefore(obj2, @zconn.at)       ->  serial: 0x03eabd580079a5dd  next_serial: 0x03eabd5800aff0ee

    T7: obj1 (18)  !=  obj2 (20)
    obj1._p_serial: 0x03eabd580071a011  obj2._p_serial: 0x03eabd580079a5dd
    zconn_at: 0x03eabd580079a5dd  # approximated as max(serials)
    zstor.loadBefore(obj1, @zconn.at)       ->  serial: 0x03eabd580071a011  next_serial: None
    zstor.loadBefore(obj2, @zconn.at)       ->  serial: 0x03eabd580079a5dd  next_serial: None
    zstor._cache.clear()
    zstor.loadBefore(obj1, @zconn.at)       ->  serial: 0x03eabd580079a5dd  next_serial: None
    zstor.loadBefore(obj2, @zconn.at)       ->  serial: 0x03eabd580079a5dd  next_serial: None

Fortunately added tests do not fail for ZEO5.client-ZEO5.server.

Running added tests takes more time compared to running regular tests.
For example the time to run BlobAdaptedFileStorageTests at default
level=1 takes 19s. To run the same BlobAdaptedFileStorageTests at
level=2 (thus including check_race_external_invalidate_vs_disconnect
from https://github.com/zopefoundation/ZODB/pull/368) takes 25s.
And to run hereby-added FileStorageLoadDelayedTests takes 85s.

That's why the test is marked as very long via level=10
parent 83730901
<component>
<import package="ZODB"/>
<sectiontype
name="loaddelayed_storage"
datatype="ZEO.tests.racetest.ZConfigLoadDelayed"
implements="ZODB.storage"
>
<section type="ZODB.storage" name="*" attribute="base" required="yes" />
</sectiontype>
</component>
##############################################################################
#
# Copyright (c) 2022 Zope Foundation and Contributors.
# All Rights Reserved.
#
# This software is subject to the provisions of the Zope Public License,
# Version 2.1 (ZPL). A copy of the ZPL should accompany this distribution.
# THIS SOFTWARE IS PROVIDED "AS IS" AND ANY AND ALL EXPRESS OR IMPLIED
# WARRANTIES ARE DISCLAIMED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
# WARRANTIES OF TITLE, MERCHANTABILITY, AGAINST INFRINGEMENT, AND FITNESS
# FOR A PARTICULAR PURPOSE
#
##############################################################################
"""Module racetest provides infrastructure for tests that catch concurrency
problems in ZEO.
It complements ZODB.tests.racetest .
"""
import zope.interface
import time
# LoadDelayedStorage wraps base storage in injects delays after load*
# operations.
#
# It is useful to catch concurrency problems due to races in between load and
# other storage events, e.g. invalidations.
#
# See e.g. https://github.com/zopefoundation/ZEO/issues/209 for example where
# injecting such delays was effective to catch bugs that were otherwise hard to
# reproduce.
class LoadDelayedStorage(object):
def __init__(self, base, tdelay=0.01):
self.base = base
self.tdelay = tdelay
zope.interface.directlyProvides(self, zope.interface.providedBy(base))
def _delay(self):
time.sleep(self.tdelay)
def __getattr__(self, name):
return getattr(self.base, name)
def __len__(self): # ^^^ __getattr__ does not forward __<attr>
return len(self.base) # because they are mangled
def load(self, *argv, **kw):
_ = self.base.load(*argv, **kw)
self._delay()
return _
def loadBefore(self, *argv, **kw):
_ = self.base.loadBefore(*argv, **kw)
self._delay()
return _
def loadSerial(self, *argv, **kw):
_ = self.base.loadSerial(*argv, **kw)
self._delay()
return _
class ZConfigLoadDelayed(object):
_factory = LoadDelayedStorage
def __init__(self, config):
self.config = config
self.name = config.getSectionName()
def open(self):
base = self.config.base.open()
return self._factory(base)
...@@ -511,6 +511,27 @@ class FileStorageClientHexTests(FileStorageHexTests): ...@@ -511,6 +511,27 @@ class FileStorageClientHexTests(FileStorageHexTests):
return ZODB.tests.hexstorage.HexStorage(client) return ZODB.tests.hexstorage.HexStorage(client)
class FileStorageLoadDelayedTests(FileStorageTests):
"""Test ZEO backed by a FileStorage with delayes injected after load
operations.
This catches e.g. races in between loads and invalidations.
See https://github.com/zopefoundation/ZEO/issues/209 for details.
"""
level = 10 # very long test
def getConfig(self):
return """\
%import ZEO.tests
<loaddelayed_storage>
<filestorage 1>
path Data.fs
</filestorage>
</loaddelayed_storage>
"""
class ClientConflictResolutionTests( class ClientConflictResolutionTests(
GenericTestBase, GenericTestBase,
ConflictResolution.ConflictResolvingStorage): ConflictResolution.ConflictResolvingStorage):
...@@ -1748,6 +1769,7 @@ slow_test_classes = [ ...@@ -1748,6 +1769,7 @@ slow_test_classes = [
# DemoStorageTests, # DemoStorageTests,
# FileStorageTests, # FileStorageTests,
# FileStorageHexTests, FileStorageClientHexTests, # FileStorageHexTests, FileStorageClientHexTests,
FileStorageLoadDelayedTests,
] ]
if not forker.ZEO4_SERVER: if not forker.ZEO4_SERVER:
slow_test_classes.append(FileStorageSSLTests) slow_test_classes.append(FileStorageSSLTests)
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment