1. 04 Sep, 2022 1 commit
    • Kirill Smelkov's avatar
      Add tests that demonstrates data corruption when ZEO5 client is served by ZEO4 server · 72320df7
      Kirill Smelkov authored
      As explained by "Bug1" in https://github.com/zopefoundation/ZEO/issues/209
      there is a race in between load and invalidate when ZEO5 client works
      wrt ZEO4 server:
      
          ---- 8< ----
          ZEO5.client - contrary to ZEO4.client - does not account for simultaneous invalidations when working wrt ZEO4.server.
          It shows as e.g.
      
          ```console
          (z-dev) kirr@deca:~/src/wendelin/z/ZEO5$ ZEO4_SERVER=1 zope-testrunner -fvvvx --test-path=src -t check_race_load_vs_external
          ...
      
          AssertionError: T1: obj1 (24)  !=  obj2 (23)
          obj1._p_serial: 0x03ea4b6fb05b52cc  obj2._p_serial: 0x03ea4b6faf253855
          zconn_at: 0x03ea4b6fb05b52cc  # approximated as max(serials)
          zstor.loadBefore(obj1, @zconn.at)       ->  serial: 0x03ea4b6fb05b52cc  next_serial: None
          zstor.loadBefore(obj2, @zconn.at)       ->  serial: 0x03ea4b6faf253855  next_serial: 0x03ea4b6fb07def66
          zstor._cache.clear()
          zstor.loadBefore(obj1, @zconn.at)       ->  serial: 0x03ea4b6fb05b52cc  next_serial: 0x03ea4b6fb07def66
          zstor.loadBefore(obj2, @zconn.at)       ->  serial: 0x03ea4b6fb05b52cc  next_serial: 0x03ea4b6fb07def66
          ```
      
          indicating that obj2 was provided to user from the cache that erroneously got stale.
      
          With [IO trace](kirr/ZEO@f184a1d9) showing message exchange in between client and the server, this looks as:
      
          ```
          # loadBefore issued
          tx (('\x00\x00\x00\x00\x00\x00\x00\x02', '\x03\xeaKo\xaf%8V'), False, 'loadBefore', ('\x00\x00\x00\x00\x00\x00\x00\x02', '\x03\xeaKo\xaf%8V'))
      
          # received invalidateTransaction
          rx (0, 1, 'invalidateTransaction', ('\x03\xeaKo\xb0[R\xcc', ['\x00\x00\x00\x00\x00\x00\x00\x01', '\x00\x00\x00\x00\x00\x00\x00\x02']))
      
          # received loadBefore reply but with end_tid=None !!!
          rx (('\x00\x00\x00\x00\x00\x00\x00\x02', '\x03\xeaKo\xaf%8V'), 0, '.reply', ('\x80\x03cZODB.tests.MinPO\nMinPO\nq\x01.\x80\x03}q\x02U\x05valueq\x03K\x17s.', '\x03\xeaKo\xaf%8U', None))
          ```
      
          which:
      
          1) contradicts what @jimfulton [wrote about ZEO4](https://github.com/zopefoundation/ZEO/blob/master/docs/ordering.rst#zeo-4): that there _invalidations are sent in a callback called when the storage lock is held, blocking loads while committing_, and
          2) points out what the bug is:
      
          Since in ZEO4 loads can be handled while a commit is in progress, ZEO4.client has special care to detect if an `invalidateTransaction` comes in between `load` request and corresponding `.reply` response, and _does not update the cache if invalidateTransaction sneaked in-between_:
      
          https://github.com/zopefoundation/ZEO/blob/47d3fbe8cbf24cad91b183483df069ef20708874/src/ZEO/ClientStorage.py#L367-L374
          https://github.com/zopefoundation/ZEO/blob/47d3fbe8cbf24cad91b183483df069ef20708874/src/ZEO/ClientStorage.py#L841-L852
          https://github.com/zopefoundation/ZEO/blob/47d3fbe8cbf24cad91b183483df069ef20708874/src/ZEO/ClientStorage.py#L1473-L1476
      
          but in ZEO5.client `loadBefore` does not have anything like that
      
          https://github.com/zopefoundation/ZEO/blob/fc0729b3cc754bda02c7f54319260b5527dd42a3/src/ZEO/ClientStorage.py#L603-L608
          https://github.com/zopefoundation/ZEO/blob/fc0729b3cc754bda02c7f54319260b5527dd42a3/src/ZEO/asyncio/client.py#L289-L309
      
          and thus `invalidateTransaction` sneaked in between `loadBefore` request and
          corresponding `.reply` causes ZEO client cache corruption.
      
          In the original `check_race_load_vs_external_invalidate` the problem appears
          only sometimes, but the bug happens with ~ 100% probability with the
          following delay injected on the server after `loadBefore`:
      
          ```diff
          --- a/src/ZEO/tests/ZEO4/StorageServer.py
          +++ b/src/ZEO/tests/ZEO4/StorageServer.py
          @@ -285,7 +285,9 @@ def loadEx(self, oid):
      
               def loadBefore(self, oid, tid):
                   self.stats.loads += 1
          -        return self.storage.loadBefore(oid, tid)
          +        x = self.storage.loadBefore(oid, tid)
          +        time.sleep(0.1)
          +        return x
      
               def getInvalidations(self, tid):
                   invtid, invlist = self.server.get_invalidations(self.storage_id, tid)
          ```
      
          so maybe, in addition to normal test runs, it could be also good idea to run
          the whole ZEO testsuite against so-amended storage backend. This way it will be
          similar to how e.g. [races are detected](https://lab.nexedi.com/kirr/go123/blob/070bfdbb/tracing/tracetest/tracetest.go#L76-102) by my [tracetest](https://lab.nexedi.com/kirr/go123/blob/070bfdbb/tracing/tracetest/tracetest.go#L20-73) package.
          ---- 8< ----
      
      -> Add suggested test to the whole ZEO testsuite. Currently it reliably
      catches the data curruption caused by explained load/invalidate race
      condition and fails as follows:
      
          (z-dev) kirr@deca:~/src/wendelin/z/ZEO5$ ZEO4_SERVER=1 zope-testrunner -fvvvc -a 100 --test-path=src
          ...
      
          Running .FileStorageLoadDelayedTests tests:
          ...
      
          Failure in test check_race_load_vs_external_invalidate (ZEO.tests.testZEO.FileStorageLoadDelayedTests)
          Traceback (most recent call last):
            File "/usr/lib/python2.7/unittest/case.py", line 329, in run
              testMethod()
            File "/home/kirr/src/wendelin/z/ZODB/src/ZODB/tests/racetest.py", line 253, in check_race_load_vs_external_invalidate
              return self._check_race_load_vs_external_invalidate(T2ObjectsInc())
            File "/home/kirr/src/wendelin/z/ZODB/src/ZODB/tests/util.py", line 417, in _
              return f(*argv, **kw)
            File "/home/kirr/src/wendelin/z/ZODB/src/ZODB/tests/racetest.py", line 346, in _check_race_load_vs_external_invalidate
              self.fail('\n\n'.join([_ for _ in failure if _]))
            File "/usr/lib/python2.7/unittest/case.py", line 410, in fail
              raise self.failureException(msg)
          AssertionError: T4: obj1 (19)  !=  obj2 (20)
          obj1._p_serial: 0x03eabd580073db55  obj2._p_serial: 0x03eabd580079a5dd
          zconn_at: 0x03eabd580079a5dd  # approximated as max(serials)
          zstor.loadBefore(obj1, @zconn.at)       ->  serial: 0x03eabd580073db55  next_serial: None
          zstor.loadBefore(obj2, @zconn.at)       ->  serial: 0x03eabd580079a5dd  next_serial: None
          zstor._cache.clear()
          zstor.loadBefore(obj1, @zconn.at)       ->  serial: 0x03eabd580079a5dd  next_serial: None
          zstor.loadBefore(obj2, @zconn.at)       ->  serial: 0x03eabd580079a5dd  next_serial: 0x03eabd5800aff0ee
      
          T7: obj1 (18)  !=  obj2 (20)
          obj1._p_serial: 0x03eabd580071a011  obj2._p_serial: 0x03eabd580079a5dd
          zconn_at: 0x03eabd580079a5dd  # approximated as max(serials)
          zstor.loadBefore(obj1, @zconn.at)       ->  serial: 0x03eabd580071a011  next_serial: None
          zstor.loadBefore(obj2, @zconn.at)       ->  serial: 0x03eabd580079a5dd  next_serial: None
          zstor._cache.clear()
          zstor.loadBefore(obj1, @zconn.at)       ->  serial: 0x03eabd580079a5dd  next_serial: None
          zstor.loadBefore(obj2, @zconn.at)       ->  serial: 0x03eabd580079a5dd  next_serial: None
      
      Fortunately added tests do not fail for ZEO5.client-ZEO5.server.
      
      Running added tests takes more time compared to running regular tests.
      For example the time to run BlobAdaptedFileStorageTests at default
      level=1 takes 19s. To run the same BlobAdaptedFileStorageTests at
      level=2 (thus including check_race_external_invalidate_vs_disconnect
      from https://github.com/zopefoundation/ZODB/pull/368) takes 25s.
      And to run hereby-added FileStorageLoadDelayedTests takes 85s.
      
      That's why the test is marked as very long via level=10
      72320df7
  2. 02 Sep, 2022 3 commits
  3. 01 Sep, 2022 1 commit
  4. 02 Aug, 2022 1 commit
  5. 01 Jun, 2022 1 commit
  6. 31 May, 2022 1 commit
  7. 30 May, 2022 1 commit
  8. 23 May, 2022 1 commit
    • Kirill Smelkov's avatar
      tests: Drop use of random2 · cc84605e
      Kirill Smelkov authored
      We can use stdlib random module and manually adjust randint to behave
      exactly as before Python3, so that testdata are matched exactly as before.
      
      Also drop unnecessary `random.seed(42)` in cache_run. It is not needed
      there because random is not used in cache_run at all.
      
      Based on work of @d-maurer in https://github.com/zopefoundation/ZEO/pull/195 .
      cc84605e
  9. 11 Apr, 2022 1 commit
  10. 09 Apr, 2022 2 commits
  11. 07 Apr, 2022 7 commits
  12. 06 Apr, 2022 3 commits
  13. 05 Apr, 2022 1 commit
  14. 31 Mar, 2022 4 commits
  15. 30 Mar, 2022 2 commits
  16. 24 Mar, 2022 5 commits
  17. 19 Mar, 2022 5 commits