Commit 3809711b authored by Tim Peters's avatar Tim Peters

Critical bugfix for invalidations during ZEO cache verification.

ClientCache.invalidate():

- This was written not to expect tid=None, which is what happens
  during cache verification, although it didn't catch the tid=None
  case and went on doing really strange things then.  Rewrote so
  that tid=None means "forget everything I think I know about
  the oid".  It was kind-of performing an invalidation wrt the
  in-memory data structures in this case, but leaving the
  invalidated data for oid looking as if it were still current on
  disk.  That was accidental, depending on what various other
  parts of the code thought "None" meant (primarily, an end_tid
  of None is generally taken here to mean "the data is current").
  
- Also changed the "and we already have version data for oid" case
  to a "forget everything I think I know about the oid" case.  The
  comments seemed to say that's what it thought it was doing, but
  the code actually left non-current revisions sitting around.  I
  really don't want to spend time trying to figure out what versions
  are supposed to do in all cases here, and nuking everything in
  this case is an obviously *safe* thing to do.

- Added more asserts, because the pre- and post-conditions here are
  delicate and non-obvious (indeed, that a new assert failed is how
  I stumbled into the invalidation problem to begin with).

FileCache.remove():  Changed to remove the key from self.key2entry.
While self.filemap keeps an in-memory map of the entire disk file,
there's no need to keep info in key2entry for keys that no longer
exist in the disk file.  I'm not sure a visible bug followed from
this (but not sure one didn't either!), but it at best wasted
memory, and violated a documented invariant.

ConnectionTests.py:  New test checkVerificationInvalidationPersists().
Before the patch, this provokes a persistent ZEO cache into
delivering stale object state.  Of course the test passes after
this patch.

stats.py:  Added a trace code meaning "invalidation hit, and the
object state was discarded".  This is what happens now for
invalidations produced by verification.  Because of MVCC, that's
really not needed *except* for the invalidations produced by
startup verification (invalidations normally just set a new
ending tid on the validity range for what was current data; but
validity-range info isn't available from verification invalidations).
parent eee5f974
...@@ -11,6 +11,17 @@ ZEO client cache ...@@ -11,6 +11,17 @@ ZEO client cache
is carried over from a time when gigabyte disks were expensive and rare; is carried over from a time when gigabyte disks were expensive and rare;
20MB is also too small on most modern machines. 20MB is also too small on most modern machines.
- Fixed a nasty bug in cache verification. A persistent ZEO cache uses a
disk file, and, when active, has some in-memory data structures too to
speed operation. Invalidations processed as part of startup cache
verification were reflected in the in-memory data structures, but not
correctly in the disk file. So if an object revision was invalidated as
part of verification, the object wasn't loaded again before the connection
was closed, and the object revision remained in the cache file until the
connection was closed, then the next time the cache file was opened it
could believe that the stale object revision in the file was actually
current.
- Fixed a bug wherein an object removed from the client cache didn't - Fixed a bug wherein an object removed from the client cache didn't
properly mark the file slice it occupied as being available for reuse. properly mark the file slice it occupied as being available for reuse.
......
...@@ -130,6 +130,7 @@ class ClientCache: ...@@ -130,6 +130,7 @@ class ClientCache:
elif o.end_tid is None: elif o.end_tid is None:
self.current[oid] = o.start_tid self.current[oid] = o.start_tid
else: else:
assert o.start_tid < o.end_tid
this_span = o.start_tid, o.end_tid this_span = o.start_tid, o.end_tid
span_list = self.noncurrent.get(oid) span_list = self.noncurrent.get(oid)
if span_list: if span_list:
...@@ -286,89 +287,90 @@ class ClientCache: ...@@ -286,89 +287,90 @@ class ClientCache:
self.fc.add(o) self.fc.add(o)
## ##
# Mark the current data for oid as non-current. If there is no # Remove all knowledge of noncurrent revisions of oid, both in
# current data for oid, do nothing. # self.noncurrent and in our FileCache. `version` and `tid` are used
# only for trace records.
def _remove_noncurrent_revisions(self, oid, version, tid):
noncurrent_list = self.noncurrent.get(oid)
if noncurrent_list:
# Note: must iterate over a copy of noncurrent_list. The
# FileCache remove() calls our _evicted() method, and that
# mutates the list.
for old_tid, dummy in noncurrent_list[:]:
# 0x1E = invalidate (hit, discarding current or non-current)
self._trace(0x1E, oid, version, tid)
self.fc.remove((oid, old_tid))
del self.noncurrent[oid]
##
# If `tid` is None, or we have data for `oid` in a (non-empty) version,
# forget all knowledge of `oid`. (`tid` can be None only for
# invalidations generated by startup cache verification.) If `tid`
# isn't None, we don't have version data for `oid`, and we had current
# data for `oid`, stop believing we have current data, and mark the
# data we had as being valid only up to `tid`. In all other cases, do
# nothing.
# @param oid object id # @param oid object id
# @param version name of version to invalidate. # @param version name of version to invalidate.
# @param tid the id of the transaction that wrote a new revision of oid # @param tid the id of the transaction that wrote a new revision of oid,
# or None to forget all cached info about oid (version, current
# revision, and non-current revisions)
def invalidate(self, oid, version, tid): def invalidate(self, oid, version, tid):
if tid > self.fc.tid: if tid > self.fc.tid and tid is not None:
self.fc.settid(tid) self.fc.settid(tid)
remove_all_knowledge_of_oid = tid is None
if oid in self.version: if oid in self.version:
# Forget we know about the version data.
# 0x1A = invalidate (hit, version)
self._trace(0x1A, oid, version, tid) self._trace(0x1A, oid, version, tid)
dllversion, dlltid = self.version[oid] dllversion, dlltid = self.version[oid]
assert not version or version == dllversion, (version, dllversion) assert not version or version == dllversion, (version, dllversion)
self.fc.remove((oid, dlltid)) self.fc.remove((oid, dlltid))
# And continue on, we must also remove any non-version data assert oid not in self.version # .remove() got rid of it
# from the cache. This is a bit of a failure of the current # And continue: we must also remove any non-version data from
# cache consistency approach as the new tid of the version # the cache. Or, at least, I have such a poor understanding of
# data gets confused with the old tid of the non-version data. # versions that anything less drastic would probably be wrong.
# I could sort this out, but it seems simpler to punt and remove_all_knowledge_of_oid = True
# have the cache invalidate too much for versions.
if remove_all_knowledge_of_oid:
if oid not in self.current: self._remove_noncurrent_revisions(oid, version, tid)
# Only current, non-version data remains to be handled.
cur_tid = self.current.get(oid)
if not cur_tid:
# 0x10 == invalidate (miss)
self._trace(0x10, oid, version, tid) self._trace(0x10, oid, version, tid)
return return
cur_tid = self.current.pop(oid)
# XXX. During checkDisconnectedAbort, tid shows up as None here # We had current data for oid, but no longer.
# sometimes.
# if remove_all_knowledge_of_oid:
#if not cur_tid < tid: # 0x1E = invalidate (hit, discarding current or non-current)
# print "OUCH1", cur_tid, tid self._trace(0x1E, oid, version, tid)
# import traceback self.fc.remove((oid, cur_tid))
# traceback.print_stack() assert cur_tid not in self.current # .remove() got rid of it
#assert cur_tid < tid return
#
# The stack trace. oid is 1, version '', tid None:
#
# File "C:\Python23\lib\threading.py", line 436, in __bootstrap
# self.run()
# File "C:\Code\ZODB3.3\build\lib.win32-2.3\ZEO\zrpc\client.py", line 293, in run
# success = self.try_connecting(attempt_timeout)
# File "C:\Code\ZODB3.3\build\lib.win32-2.3\ZEO\zrpc\client.py", line 324, in try_connecting
# r = self._connect_wrappers(wrappers, deadline)
# File "C:\Code\ZODB3.3\build\lib.win32-2.3\ZEO\zrpc\client.py", line 385, in _connect_wrappers
# wrap.connect_procedure()
# File "C:\Code\ZODB3.3\build\lib.win32-2.3\ZEO\zrpc\client.py", line 471, in connect_procedure
# self.test_connection()
# File "C:\Code\ZODB3.3\build\lib.win32-2.3\ZEO\zrpc\client.py", line 495, in test_connection
# self.notify_client()
# File "C:\Code\ZODB3.3\build\lib.win32-2.3\ZEO\zrpc\client.py", line 507, in notify_client
# self.client.notifyConnected(self.conn)
# File "C:\Code\ZODB3.3\build\lib.win32-2.3\ZEO\ClientStorage.py", line 516, in notifyConnected
# self._wait_sync()
# File "C:\Code\ZODB3.3\build\lib.win32-2.3\ZEO\ClientStorage.py", line 370, in _wait_sync
# self._connection.pending(30)
# File "C:\Code\ZODB3.3\build\lib.win32-2.3\ZEO\zrpc\connection.py", line 546, in pending
# self.handle_read_event()
# File "C:\Python23\lib\asyncore.py", line 390, in handle_read_event
# self.handle_read()
# File "C:\Code\ZODB3.3\build\lib.win32-2.3\ZEO\zrpc\smac.py", line 219, in handle_read
# self.message_input(msg)
# File "C:\Code\ZODB3.3\build\lib.win32-2.3\ZEO\zrpc\connection.py", line 244, in message_input
# self.handle_request(msgid, flags, name, args)
# File "C:\Code\ZODB3.3\build\lib.win32-2.3\ZEO\zrpc\connection.py", line 269, in handle_request
# ret = meth(*args)
# File "C:\Code\ZODB3.3\build\lib.win32-2.3\ZEO\tests\ConnectionTests.py", line 63, in endVerify
# ClientStorage.endVerify(self)
# File "C:\Code\ZODB3.3\build\lib.win32-2.3\ZEO\ClientStorage.py", line 1093, in endVerify
# self._process_invalidations(InvalidationLogIterator(f))
# File "C:\Code\ZODB3.3\build\lib.win32-2.3\ZEO\ClientStorage.py", line 1074, in _process_invalidations
# self._cache.invalidate(oid, version, tid)
# File "C:\Code\ZODB3.3\build\lib.win32-2.3\ZEO\cache.py", line 313, in invalidate
# traceback.print_stack()
# Add the data we have to the list of non-current data for oid.
assert tid is not None and cur_tid < tid
# 0x1C = invalidate (hit, saving non-current)
self._trace(0x1C, oid, version, tid)
del self.current[oid] # because we no longer have current data
# Update the end_tid half of oid's validity range on disk.
# XXX Want to fetch object without marking it as accessed # XXX Want to fetch object without marking it as accessed
o = self.fc.access((oid, cur_tid)) o = self.fc.access((oid, cur_tid))
assert o is not None assert o is not None
assert o.end_tid is None # i.e., o was current assert o.end_tid is None # i.e., o was current
if o is None: if o is None:
# XXX is this possible? (doubt it; added an assert just above) # XXX is this possible? (doubt it; added an assert just above)
return None return
o.end_tid = tid o.end_tid = tid
self.fc.update(o) # record the new end_tid on disk self.fc.update(o) # record the new end_tid on disk
self._trace(0x1C, oid, version, tid) # Add to oid's list of non-current data.
L = self.noncurrent.setdefault(oid, []) L = self.noncurrent.setdefault(oid, [])
bisect.insort_left(L, (cur_tid, tid)) bisect.insort_left(L, (cur_tid, tid))
...@@ -943,7 +945,6 @@ class FileCache(object): ...@@ -943,7 +945,6 @@ class FileCache(object):
if size > self.maxsize - ZEC3_HEADER_SIZE: if size > self.maxsize - ZEC3_HEADER_SIZE:
return return
assert size <= self.maxsize
assert object.key not in self.key2entry assert object.key not in self.key2entry
assert len(object.key[0]) == 8 assert len(object.key[0]) == 8
assert len(object.key[1]) == 8 assert len(object.key[1]) == 8
...@@ -994,14 +995,16 @@ class FileCache(object): ...@@ -994,14 +995,16 @@ class FileCache(object):
# XXX Or we could just keep the header in memory at all times. # XXX Or we could just keep the header in memory at all times.
e = self.key2entry.get(key) e = self.key2entry.pop(key, None)
if e is None: if e is None:
return return
offset = e.offset offset = e.offset
size, e2 = self.filemap[offset] size, e2 = self.filemap[offset]
assert size >= 5 # only free blocks are tiny assert e is e2
self.filemap[offset] = size, None
self.f.seek(offset + OBJECT_HEADER_SIZE) self.f.seek(offset + OBJECT_HEADER_SIZE)
o = Object.fromFile(self.f, key, skip_data=True) o = Object.fromFile(self.f, key, skip_data=True)
assert size >= 5 # only free blocks are tiny
# Because `size` >= 5, we can change an allocated block to a free # Because `size` >= 5, we can change an allocated block to a free
# block just by overwriting the 'a' status byte with 'f' -- the # block just by overwriting the 'a' status byte with 'f' -- the
# size field stays the same. # size field stays the same.
...@@ -1009,7 +1012,6 @@ class FileCache(object): ...@@ -1009,7 +1012,6 @@ class FileCache(object):
self.f.write('f') self.f.write('f')
self.f.flush() self.f.flush()
self.parent._evicted(o) self.parent._evicted(o)
self.filemap[offset] = size, None
## ##
# Update on-disk representation of Object obj. # Update on-disk representation of Object obj.
......
...@@ -374,6 +374,8 @@ explain = { ...@@ -374,6 +374,8 @@ explain = {
0x10: "invalidate (miss)", 0x10: "invalidate (miss)",
0x1A: "invalidate (hit, version)", 0x1A: "invalidate (hit, version)",
0x1C: "invalidate (hit, saving non-current)", 0x1C: "invalidate (hit, saving non-current)",
# 0x1E can occur during startup verification.
0x1E: "invalidate (hit, discarding current or non-current)",
0x20: "load (miss)", 0x20: "load (miss)",
0x22: "load (hit)", 0x22: "load (hit)",
......
...@@ -386,7 +386,7 @@ class ConnectionTests(CommonSetupTearDown): ...@@ -386,7 +386,7 @@ class ConnectionTests(CommonSetupTearDown):
# objects that weren't in the _seriald, because the client was # objects that weren't in the _seriald, because the client was
# interrupted waiting for tpc_vote() to return. When the next # interrupted waiting for tpc_vote() to return. When the next
# transaction committed, it tried to do something with the # transaction committed, it tried to do something with the
# bogus _tbuf entries. The exaplanation is wrong/incomplete, # bogus _tbuf entries. The explanation is wrong/incomplete,
# because tpc_begin() should clear the _tbuf. # because tpc_begin() should clear the _tbuf.
# 2003-01-15T15:44:19 ERROR(200) ZODB A storage error occurred # 2003-01-15T15:44:19 ERROR(200) ZODB A storage error occurred
...@@ -465,6 +465,65 @@ class ConnectionTests(CommonSetupTearDown): ...@@ -465,6 +465,65 @@ class ConnectionTests(CommonSetupTearDown):
self.assertRaises(ClientDisconnected, self._storage.load, oid1, '') self.assertRaises(ClientDisconnected, self._storage.load, oid1, '')
self._storage.close() self._storage.close()
def checkVerificationInvalidationPersists(self):
# This tests a subtle invalidation bug from ZODB 3.3:
# invalidations processed as part of ZEO cache verification acted
# kinda OK wrt the in-memory cache structures, but had no effect
# on the cache file. So opening the file cache again could
# incorrectly believe that a previously invalidated object was
# still current. This takes some effort to set up.
# First, using a persistent cache ('test'), create an object
# MinPO(13). We used to see this again at the end of this test,
# despite that we modify it, and despite that it gets invalidated
# in 'test', before the end.
self._storage = self.openClientStorage('test')
oid = self._storage.new_oid()
obj = MinPO(13)
self._dostore(oid, data=obj)
self._storage.close()
# Now modify obj via a temp connection. `test` won't learn about
# this until we open a connection using `test` again.
self._storage = self.openClientStorage()
pickle, rev = self._storage.load(oid, '')
newobj = zodb_unpickle(pickle)
self.assertEqual(newobj, obj)
newobj.value = 42 # .value *should* be 42 forever after now, not 13
self._dostore(oid, data=newobj, revid=rev)
self._storage.close()
# Open 'test' again. `oid` in this cache should be (and is)
# invalidated during cache verification. The bug was that it
# got invalidated (kinda) in memory, but not in the cache file.
self._storage = self.openClientStorage('test')
# The invalidation happened already. Now create and store a new
# object before closing this storage: this is so `test` believes
# it's seen transactions beyond the one that invalidated `oid`, so
# that the *next* time we open `test` it doesn't process another
# invalidation for `oid`. It's also important that we not try to
# load `oid` now: because it's been (kinda) invalidated in the
# cache's memory structures, loading it now would fetch the
# current revision from the server, thus hiding the bug.
obj2 = MinPO(666)
oid2 = self._storage.new_oid()
self._dostore(oid2, data=obj2)
self._storage.close()
# Finally, open `test` again and load `oid`. `test` believes
# it's beyond the transaction that modified `oid`, so its view
# of whether it has an up-to-date `oid` comes solely from the disk
# file, unaffected by cache verification.
self._storage = self.openClientStorage('test')
pickle, rev = self._storage.load(oid, '')
newobj_copy = zodb_unpickle(pickle)
# This used to fail, with
# AssertionError: MinPO(13) != MinPO(42)
# That is, `test` retained a stale revision of the object on disk.
self.assertEqual(newobj_copy, newobj)
self._storage.close()
def checkReconnection(self): def checkReconnection(self):
# Check that the client reconnects when a server restarts. # Check that the client reconnects when a server restarts.
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment