ZApplication: Do not let Cleanup references to be held anywhere except REQUEST (9ca50f15) · Commits · Kirill Smelkov / Zope

/cc @nexedi

I'd like to add that the bug is more critical for Wendelin systems than regular ERP5 or Zope since in Wendelin systems wendelin.core has to implement its own transaction manager logic and carefully synchronize it with ZODB transaction managing machinery.

@kirr : Very nice low-level debugging and clear analysis, as usual :) .

A __del__ method being (de facto made) responsible for transaction control is totally frightening to me. I believe this should also be removed. Transaction control on a "normal" instance (ie, not unit tests) must be the exclusive task of ZPublisher and (until it someday gets rewritten to go through ZPublisher for every single transaction) CMFActivity. Nothing else should ever get close to the "controll transaction boundaries" side of transaction module.

Would it make sense to reverse the relation ? Like, trigger cleanup in a before (or after) commit hook ? I would expect the ZODB connection to be kept alive merely because of every single persistent object loaded in any frame of about every script run during the transaction keeping a reference to the connection (_p_jar), but I may be missing something. Also, I would expect ZODB.Connection, by registering itself to the transaction manager, to be also kept alive by that path until transaction end, which must ultimately happen in ZPublisher.

About the fix, what about collapsing it all into a single statement so the local does not exist and hence does not need to be deleted ?

REQUEST._hold(Cleanup(conn))

It does remove the natural place to document the change in the code, but it is already well documented in the commit message. It is admitedly one of my OCDs, but I dislike such single-use variables very much as they just load the (mental) namespace for no good reason, even when there is no GC trickery around. Fewer statements FTW !

We all 3 had the idea of collapsing into a single statement. But in any case, a comment is required in the code to make sure nobody won't revert. Commit messages are obviously not a place for warnings.

No comment is needed if Cleanup.__del__ stop aborting transactions...

And we cannot reasonably put all the "don't"s in all code everywhere, so I do not think a warning is critical enough that we should keep over complex (...for some ridiculously low amount of extra complexity, granted) just for the sake of having the comment in the code.

@vpelletier, @jm thanks for feedback.

A __del__ method being (de facto made) responsible for transaction control is totally frightening to me. I believe this should also be removed.

Yes, this is what I was saying in "This is simple but not optimal fix. The good solution would avoid using __del__ at all and track request lifetime explicitly via regular programming logic." I agree we should look into properly fixing this, probably as you say in ZPublisher.

Would it make sense to reverse the relation ? Like, trigger cleanup in a before (or after) commit hook ?

Yes, I would program it this way that after request is handled without an error transaction is committed, or if there is error transaction is aborted. Then database connection is "closed" (see below).

I would expect the ZODB connection to be kept alive merely because of every single persistent object loaded in any frame of about every script run during the transaction keeping a reference to the connection (_p_jar), but I may be missing something.

When connection is closed it is not destroyed but actually is returned to database connection poll. It's cache is also retained for next requests to use:

https://github.com/zopefoundation/ZODB/blob/3b700be9/src/ZODB/interfaces.py#L186
https://github.com/zopefoundation/ZODB/blob/3b700be9/src/ZODB/Connection.py#L279
https://github.com/zopefoundation/ZODB/blob/3b700be9/src/ZODB/Connection.py#L325

After transaction is committed or aborted there could not be live objects left in dirty state - they all become synchronized to database state - either to newly created transaction or to a previous transaction in case of abort. This way all live objects are actually cache. ZODB cache keeps them (and garbage-collects some times) but if a reference to an object stays live from outside after transaction ends it has to be understood that it becomes a reference to inside a cache and that modifying an object via such reference is playing dirty games with cache internals thus potentially affecting unrelated transactions. In other words it is invalid to modify objects reference to which were obtained in a transaction after that transaction lifetime ends.

Also, I would expect ZODB.Connection, by registering itself to the transaction manager, to be also kept alive by that path until transaction end, which must ultimately happen in ZPublisher.

Transaction manager keeps registered synchronizers in a weakset:

https://github.com/zopefoundation/transaction/blob/1.1.1/transaction/_manager.py#L76
https://github.com/zopefoundation/transaction/blob/1.1.1/transaction/_manager.py#L56
https://github.com/zopefoundation/ZODB/blob/3b700be9/src/ZODB/Connection.py#L925

this way it cannot be relied on just transaction manager keeping reference to registered connection.

It is imho better to explicitly program the logic: request comes; open db connection; handle request; transacion commit or abort; close connection.

I would agree into collapsing Cleanup call into REQUEST._hold with comments left. However I think this is minor and we better fix Zope to serve requests without __del__ at all.

probably as you say in ZPublisher.

What I meant is that ZPublisher is already handling transaction boundaries: it is where exceptions are caught so error page is rendered, which also provides natural locations for abort & commit:

When connection is closed it is not destroyed

Oh right, that code was not about holding a strong reference to Connection, but merely using weak references to control something else (transaction lifetime).

Transaction manager keeps registered synchronizers in a weakset:

Mmh this is a bit weird to me. I mean, if a transactional adapter expressed interest in current transaction, I see this as a strong expectation that transaction will actually do stuff on that connector, and it may not be trivial to guarantee the synchronizer will be kept alive in another way. Maybe there is a separate strong-relation reference chain elsewhere ? Anyway, this may be drifting too far from the original topic.

For the reference - with the following patch applied on one system:

--- ZApplication.py.kirr        2017-06-14 14:46:30.000000000 +0200
+++ ZApplication.py     2017-08-20 15:47:08.231696334 +0200
@@ -49,6 +49,14 @@
         # arrange for the connection to be closed when the request goes away
         cleanup = Cleanup(conn)
         REQUEST._hold(cleanup)
+        # https://lab.nexedi.com/kirr/Zope/commit/9ca50f15
+        # do not let `cleanup` stay referenced by local frame.
+        # this could be dangerous if the frame is held in any traceback
+        # collected e.g. for exceptions raised and caught in ^^^ hooks, because
+        # if so cleanup lifetime will be different from REQUEST lifetime and
+        # thus cleanup will be called at arbitrary time from under GC from
+        # arbitrary thread.
+        del cleanup
 
         conn.setDebugInfo(REQUEST.environ, REQUEST.other)
 
@@ -75,10 +83,25 @@
         return connection.root()[aname]
 
 
+import traceback
+from thread import get_ident as gettid
+import time
+import sys
 class Cleanup:
     def __init__(self, jar):
+        self._tid = gettid()
+        self._created = time.time()
         self._jar = jar
 
     def __del__(self):
-        transaction.abort()
-        self._jar.close()
+        tid = gettid()
+        if tid != self._tid:
+             now = time.time()
+             print >>sys.stderr, "E: %s: created from thread %s @%s; __del__ runs from thread %s @%s" \
+                % (self, self._tid, time.ctime(self._created), tid, time.ctime(now))
+        try:
+            transaction.abort()
+            self._jar.close()
+        except:
+            traceback.print_exc()
+            raise

I still see entries in log like this:

==> slappart8/.slappart8_zope-2.log <==
E: <App.ZApplication.Cleanup instance at 0x7f1ca5237c68>: created from thread 139761418483456 @Tue Aug 29 04:35:38 2017; __del__ runs from thread 139761443661568 @Tue Aug 29 04:35:42 2017

in other words Cleanup is still being run from another thread that created it and oops - beacuse transaction module is imlicitly thread-local Cleanup wrongly aborts transaction of another thread.

I can imagine explaining it this way: even though if Cleanup is held only by REQUEST, the REQUEST itself could participate in a cycle and thus it won't neccessarily be destroyed when main zpublisher logic drops reference to it. So if REQUEST stays alive in cycle until GC is run, and if GC is triggered by another thread -> then Cleanup will be called by another thread.

So the only reliable fix is top stop using __del__ at all.

Another data point:

On a project we started to see the following regular errors on one zopes from activities zope group:

018-07-09 08:28:49,472 ERROR ZODB.Connection Shouldn't load state for 0xf7b1 when the connection is closed
------
2018-07-09 08:28:49,472 ERROR ZServerPublisher exception caught
Traceback (most recent call last):
 File "/srv/slapgrid/slappart8/srv/runner/software/56ad0b3e7929159eb04546ac668fd1e9/eggs/Zope2-2.13.24-py2.7.egg/ZServer/PubCore/ZServerPublisher.py", line 31, in __init__
   response=b)
 File "/srv/slapgrid/slappart8/srv/runner/software/56ad0b3e7929159eb04546ac668fd1e9/eggs/Zope2-2.13.24-py2.7.egg/ZPublisher/Publish.py", line 455, in publish_module
   environ, debug, request, response)
 File "/srv/slapgrid/slappart8/srv/runner/software/56ad0b3e7929159eb04546ac668fd1e9/eggs/Products.LongRequestLogger-2.0.0-py2.7.egg/Products/LongRequestLogger/__init__.py", line 19, in publish_module_standard
   return publish_module_standard.original(*args, **kw)
 File "/srv/slapgrid/slappart8/srv/runner/software/56ad0b3e7929159eb04546ac668fd1e9/eggs/Zope2-2.13.24-py2.7.egg/ZPublisher/Publish.py", line 276, in publish_module_standard
   if request is not None: request.close()
 File "/srv/slapgrid/slappart8/srv/runner/software/56ad0b3e7929159eb04546ac668fd1e9/eggs/Zope2-2.13.24-py2.7.egg/ZPublisher/BaseRequest.py", line 220, in close
   notify(EndRequestEvent(None, self))
 File "/srv/slapgrid/slappart8/srv/runner/software/56ad0b3e7929159eb04546ac668fd1e9/eggs/zope.event-3.5.2-py2.7.egg/zope/event/__init__.py", line 31, in notify
   subscriber(event)
 File "/srv/slapgrid/slappart8/srv/runner/software/56ad0b3e7929159eb04546ac668fd1e9/eggs/zope.component-3.9.5-py2.7.egg/zope/component/event.py", line 24, in dispatch
   zope.component.subscribers(event, None)
 File "/srv/slapgrid/slappart8/srv/runner/software/56ad0b3e7929159eb04546ac668fd1e9/eggs/zope.component-3.9.5-py2.7.egg/zope/component/_api.py", line 136, in subscribers
   return sitemanager.subscribers(objects, interface)
 File "/srv/slapgrid/slappart8/srv/runner/software/56ad0b3e7929159eb04546ac668fd1e9/eggs/ZODB3-3.10.7-py2.7-linux-x86_64.egg/ZODB/Connection.py", line 859, in setstate
   raise ConnectionStateError(msg)
ConnectionStateError: Shouldn't load state for 0xf7b1 when the connection is closed

Collecting all such objects for which state could not be loaded becase connection was closed showed they are:

# grep  "Shouldn't load state for" *event* | awk '{print $9}' |sort |uniq -c
      3
      1 0x00
    128 0x01
     77 0x01e8
     20 0x03
    200 0xf7b0
      9 0xf87b

and checking via zodb catobj showed that:

0x00     is a root object
0x01     = OFS.Application.Application
0x03     = App.ApplicationManager.ApplicationManager
0x01e8   = Products.TimerService.TimerService
0xf87b   = erp5.portal_type.Activity Tool

in other words all problematic objects were somehow very global.

Further inspecting it turned out to be errors from health cheker trying to access / every second. Curl confirmed that:

[12:33:15] <kirr> (venv.1) (56ad0b3e7929159eb04546ac668fd1e9) slapuser8@micfix01:~/srv/runner/instance/slappart8/var/log$ curl -v 10.0.208.122:2221
* Rebuilt URL to: 10.0.208.122:2221/
*   Trying 10.0.208.122...
* Connected to 10.0.208.122 (10.0.208.122) port 2221 (#0)
> GET / HTTP/1.1
> Host: 10.0.208.122:2221
> User-Agent: curl/7.47.0
> Accept: */*
> 
< HTTP/1.1 500 Internal Server Error
< Server: Zope/(2.13.24, python 2.7.12, linux2) ZServer/1.1
< Date: Mon, 09 Jul 2018 09:32:48 GMT
< Bobo-Exception-Line: 859
< Content-Length: 1157
< Bobo-Exception-Value: See the server error log for details
< Content-Type: text/html; charset=utf-8
< Bobo-Exception-Type: <class 'ZODB.POSException.ConnectionStateError'>
< Bobo-Exception-File: Connection.py
< 
 <h2>Site Error</h2>
 <p>An error was encountered while publishing this resource.
 </p>
 <p><strong><class 'ZODB.POSException.ConnectionStateError'></strong></p>

 Sorry, a site error occurred.<p><p>Traceback (innermost last):</p>
<ul>

<li>  Module ZPublisher.Publish, line 247, in publish_module_standard<br />
   setDefaultSkin(request)</li>

<li>  Module zope.publisher.skinnable, line 46, in setDefaultSkin<br />
   adapters = zope.component.getSiteManager().adapters</li>

<li>  Module ZODB.Connection, line 859, in setstate<br />
   raise ConnectionStateError(msg)</li>

</ul><p>ConnectionStateError: Shouldn't load state for 0xf7b1 when the connection is closed
</p></p>
 <hr noshade="noshade"/>

 <p>Troubleshooting Suggestions</p>

 <ul>
 <li>The URL may be incorrect.</li>
 <li>The parameters passed to this resource may be incorrect.</li>
 <li>A resource that this resource relies on may be
     encountering an error.</li>
 </ul>

 <p>For more detailed information about the error, please
 refer to the error log.
 </p>

 <p>If the error persists please contact the site maintainer.
 Thank you for your patience.
* Connection #0 to host 10.0.208.122 left intact

It was also observed that liveness checks for that particular zope started to be 500 at

zope-1-Z2.log:10.0.208.122 - Anonymous [08/Jul/2018:18:48:30 +0000] "GET / HTTP/1.0" 500 0 "" ""

and after checking .slappart8_*.log it was discovered that even though

Exception KeyError: (140574709077840,) in <bound method Cleanup.__del__ of <App.ZApplication.Cleanup instance at 0x7fd9f22f6830>> ignored

is regularly seen for all zopes, for this particular zope there was something new:

Exception AssertionError: AssertionError() in <bound method Cleanup.__del__ of <App.ZApplication.Cleanup instance at 0x7fe62b4b2098>> ignored

This error was at the end of .slappart8_zope-1.log and last mtime of that file was corresponding to the time when 500 errors started to appear in zope-1-Z2.log.

Inspecting Cleanup code shows:

class Cleanup:
   def __init__(self, jar):
       self._jar = jar

   def __del__(self):
       transaction.abort()
       self._jar.close()

that Cleanup.__del__ tries to abort current transaction and close initially-pinned-to-it ZODB connection. Unfortunately there is no traceback printed for the AssertionError since it happenned under finalizer running. Anyway what can be said is that somehow because Cleanup.__del__ was run either a) from a wrong thread, or b) too late some assert and missinchronization in ZODB happenned and that prevented it from further working correctly.

For the reference inspecting all threads of that zopes showed nothing abnormal:

Threads traceback dump at 2018-07-09 09:55:44

Thread 140626557400832:
 File "/srv/slapgrid/slappart8/srv/runner/software/56ad0b3e7929159eb04546ac668fd1e9/eggs/Zope2-2.13.24-py2.7.egg/ZServer/PubCore/ZServerPublisher.py", line 25, in __init__
   name, a, b=accept()
 File "/srv/slapgrid/slappart8/srv/runner/software/56ad0b3e7929159eb04546ac668fd1e9/eggs/Zope2-2.13.24-py2.7.egg/ZServer/PubCore/ZRendezvous.py", line 73, in accept
   l.acquire()

Thread 140626635888384:
 File "/srv/slapgrid/slappart8/srv/runner/software/56ad0b3e7929159eb04546ac668fd1e9/parts/python2.7/lib/python2.7/threading.py", line 774, in __bootstrap
   self.__bootstrap_inner()
 File "/srv/slapgrid/slappart8/srv/runner/software/56ad0b3e7929159eb04546ac668fd1e9/parts/python2.7/lib/python2.7/threading.py", line 801, in __bootstrap_inner
   self.run()
 File "/srv/slapgrid/slappart8/srv/runner/software/56ad0b3e7929159eb04546ac668fd1e9/eggs/Products.LongRequestLogger-2.0.0-py2.7.egg/Products/LongRequestLogger/monitor.py", line 46, in run
   while not select(rlist, (), (), t)[0] or os.read(r, 8):

Thread 140626878301952:
 File "/srv/slapgrid/slappart8/srv/runner/software/56ad0b3e7929159eb04546ac668fd1e9/eggs/timerserver-2.0.2-py2.7.egg/timerserver/TimerServer.py", line 93, in run
   time.sleep(interval)

Thread 140626669999872:
 File "/srv/slapgrid/slappart8/srv/runner/software/56ad0b3e7929159eb04546ac668fd1e9/parts/python2.7/lib/python2.7/threading.py", line 774, in __bootstrap
   self.__bootstrap_inner()
 File "/srv/slapgrid/slappart8/srv/runner/software/56ad0b3e7929159eb04546ac668fd1e9/parts/python2.7/lib/python2.7/threading.py", line 801, in __bootstrap_inner
   self.run()
 File "/srv/slapgrid/slappart8/srv/runner/software/56ad0b3e7929159eb04546ac668fd1e9/parts/python2.7/lib/python2.7/threading.py", line 754, in run
   self.__target(*self.__args, **self.__kwargs)
 File "/srv/slapgrid/slappart8/srv/runner/software/56ad0b3e7929159eb04546ac668fd1e9/parts/neoppod-repository/neo/lib/threaded_app.py", line 81, in run
   self._run()
 File "/srv/slapgrid/slappart8/srv/runner/software/56ad0b3e7929159eb04546ac668fd1e9/parts/neoppod-repository/neo/lib/threaded_app.py", line 91, in _run
   poll(1)
 File "/srv/slapgrid/slappart8/srv/runner/software/56ad0b3e7929159eb04546ac668fd1e9/parts/neoppod-repository/neo/lib/event.py", line 155, in poll
   self._poll(blocking)
 File "/srv/slapgrid/slappart8/srv/runner/software/56ad0b3e7929159eb04546ac668fd1e9/parts/neoppod-repository/neo/lib/event.py", line 202, in _poll
   event_list = self.epoll.poll(blocking)

End of dump

and after that zope restart the problem went away.

Once again: the only reliable fix is to stop using __del__ at all.

/cc @klaus, @Tyagov

@kirr , what prevents from pushing this change ?

@Tyagov, as 9ca50f15 (comment 38864) says even if we add explicit del cleanup it won't prevent problems and Cleanup.__del__ can still be, and in practice is, called from wrong context. The only solution is to rework zope more to avoid depending on finalizer at all. There is no patch for this yet.

I would expect the ZODB connection to be kept alive merely because of every single persistent object loaded in any frame of about every script run during the transaction keeping a reference to the connection (_p_jar), but I may be missing something.

Because of ZODB#185, which I hope to get fixed one day, we currently want Connections to be closed once the zope worker has finished with a request. I agree this generates overhead but the only solution to avoid memory leaks would be to also process invalidations immediately for open Connections that are outside transactions. I didn't think about this before but that looks doable now that Connections can require explicit calls to transaction.begin (transaction >= 2.1.0).

publish() is actually agnostic about accessed databases. We're somewhat biased because in our case, we always open the same DB. For the moment, I don't see what I could change in this method or in publish_module_standard (transactions are already begun/committed/aborted when needed, and the request is explicitly closed at the end). We could say that the whole ZPublisher is not optimised for our use.

Like, trigger cleanup in a before (or after) commit hook ?

What about aborts ? Or multiple commits within a request ? I know the latter is questionable (to the point we often wonder what's the difference between transactions and requests, and where to cache things - transactional cache or request ?), but there's actually a good example: process_timer (in particular CMFActivity). With what you suggest, we're likely to cleanup things too early, or never.

I thought about using zope components, by making ZApplicationWrapper clean up things during EndRequestEvent (see BaseRequest.close), but there would be no guarantee that this clean up is processed last (as written in a comment, subscribers might need the zodb).

Another thing is that Cleanup seems to do more than just cleanup. It also keeps the Connection open during the request: that's not done by transactions (which only have weak refs) and it's not clear whether traversal guarantees to have references to persistent objects (either in locals or in the request). BTW, since __del__ is not implemented, a Connection disappears without returning to the DB pool if all its referrers are deleted.

Here is the best idea I have for the moment (not tested):

diff --git a/src/App/ZApplication.py b/src/App/ZApplication.py
index 1f8c785..7e3c5d1 100644
--- a/src/App/ZApplication.py
+++ b/src/App/ZApplication.py
@@ -47,8 +47,11 @@ def __bobo_traverse__(self, REQUEST=None, name=None):
                 hook(conn)
 
         # arrange for the connection to be closed when the request goes away
-        cleanup = Cleanup(conn)
-        REQUEST._hold(cleanup)
+        request_clear = REQUEST.clear
+        def clear():
+            request_clear()
+            conn.close()
+        REQUEST.clear = clear
 
         conn.setDebugInfo(REQUEST.environ, REQUEST.other)
 
@@ -73,12 +76,3 @@ def __call__(self, connection=None):
             connection=db.open(connection)
 
         return connection.root()[aname]
-
-
-class Cleanup:
-    def __init__(self, jar):
-        self._jar = jar
-
-    def __del__(self):
-        transaction.abort()
-        self._jar.close()

I can imagine explaining it this way: even though if Cleanup is held only by REQUEST, the REQUEST itself could participate in a cycle and thus it won't necessarily be destroyed when main zpublisher logic drops reference to it. So if REQUEST stays alive in cycle until GC is run, and if GC is triggered by another thread -> then Cleanup will be called by another thread.

What remains obscure to me is that ZPublisher does close the request explicitly. How is it possible that self._held=None (in BaseRequest.clear) would not trigger Cleanup.__del__ synchronously ? Something may be wrong with nexedi/erp5@bc120f55.

How is it possible that self._held=None (in BaseRequest.clear) would not trigger Cleanup.__del__ synchronously ?

Why would it be guaranteed to be synchronous ?

__del__ is known to be an obscure mechanism in python, and in my experience it should be seen as a hook into GC cleanup worker rather than a regular destructor/finaliser for instances. And AFAIK it is inherently asyncrhonous.

For example, I have seen __del__ be called despite the object still being referenced (although I do not remember now if it was during interpreter shutdown, which is yet another case triggering more counter-intuitive effects): a.foo = b, b.__del__ being called before a.__del__. This was not even part of a reference loop resolution. The answer I was given when asking about this behaviour was along the lines of "use context managers, they are the only lifetime management mechanism available in python which work deterministicaly and reliably". Asides from writing our own try..finaly blocks, of course, but that's just more verbose for AFAIK no gain.

I think relying on exact __del__ call sequence could legitimately make our code non-portable to other python interpretations (thinking of pypy as it has a radically different approach to garbage collection than cPython).

My point is not to defend __del__; I know very well it's wrong in this case at least because it's incompatible with PyPy. But we're not going to use Zope 2.13 with PyPy, nor fix coding crimes in Zope if they don't actually affect us. I also spent time in NEO removing many of them.

Why would it be guaranteed to be synchronous ?

Because of refcounting. GC only plays a role in case of cycles and before CPython 3.4, it ignores objects with a __del__ method. The second paragraph of this commit message is wrong.

I read the CPython code (not trivial so I might be wrong) and it matches what we can read e.g. at https://rushter.com/blog/python-garbage-collector/. tp_dealloc is called immediately when the refcount reaches 0, which in turn calls __del__ and dealloc everything (I let aside specific cases like weak refs, resurrection, etc).

For this commit to do anything, we should also make sure that an exception actually happened, so that a ref to cleanup is kept.

For me __del__ is predictable and I just want to understand the bug.

Thanks for explaining, I understand your objective better now.

Too bad that the tracebacks we have on Prod don't show what calls __del__. But the log added by Kirill shows that it's by the LongRequestLogger thread. I see nothing there about request._held or request.__dict__, but maybe manipulating frames is enough.

Quick check:

import sys, threading, time

class A(object):
    def __del__(self):
        print('__del__')

def f():
    x = A()
    time.sleep(.2)
    #del x
def g():
    f()
    print('stopped')
g()

t = threading.Thread(target=g)
t.start()
time.sleep(.1)
x = sys._current_frames()[t.ident]
t.join()
print('end')
del x

displays:

__del__
stopped
stopped
end
__del__

and if the del statement is uncommented:

__del__
stopped
__del__
stopped
end

I see only 1 place where frames could keep a reference to the cleanup object: BaseRequest._hold. Actually, Kirill's patch handles ZApplication.__bobo_traverse__ but misses callees. The remaining problem with this hypothesis is that the logger should not attempt to dump a request so early.

Maybe Kirill has more information so I'll wait he's back.

@klaus @Tyagov Meanwhile, the patch I suggest above should work, and you can either apply it this form or nexedi/erp5@879a8a62. Alternatives are:

if you're curious, add del object at the end of BaseRequest._hold (in addition to Kirill's patch)
disable the long request logger

if you're curious, add del object at the end of BaseRequest._hold (in addition to Kirill's patch)

That's actually a good idea for 2 reasons:

that may confirm my analysis
we're sure there's no regression with such change

About nexedi/erp5@879a8a62, a test result is completed at https://nexedi.erp5.net/test_result_module/20180821-55EF02 (apparently successfully, it's never obvious with some functional tests).

@jm, @vpelletier, nexedi/erp5@bc120f55 indeed seems to be suspicious: on the first glance my_self.REQUEST.clear() or my_self.REQUEST.close() seems to be missing after we hand-over held objects to original self.REQUEST. This way e.g. after self.REQUEST is explicitly closed by ZPublisher, and if say myself would still be held by a cycle, it could be eventually released at that future GC garbage-collecting that cycle.

@jm, I suggest not to guess and simply add to BaseRequest.clear verification of reference-count of released Cleanup object. And if it is not what is expected - print the information about who is additionally helding the object via gc.get_referrers. This way next time the bug happens we'll get the information about who is helding Cleanup automatically. That would help to write a test-case for the bug.

I think the LongRequest logger is unrelated - it is only a place where user visible bug happens, but the cause of the bug happenned before and was due to transaction.abort() was incorrectly called from wrong context. In other words LongRequest logger traceback is only a consequence of the bug.

The second paragraph of this commit message is wrong.

I still think it is right as the following testcase demonstrates, where __del__ is called by thread T2 because object cycle that references finalized object is garbage-collected by that second thread, not original T1 where finalized object was created:

#!/usr/bin/env python
# demonstrate that __del__ might be called from whatever thread which causes cyclic GC

from __future__ import print_function

from golang import go, chan
from six.moves._thread import get_ident as gettid
import gc, sys

# class without finalizer (objects with finalizers if cycled are not collected)
class X:
    pass

# call f on __del__
class DelCall:
    def __del__(self):
        print('__del__ called under thread', gettid())


# mkgarbagecycle creates object cycle, so that only cyclic GC can free them.
def mkgarbagecycle():
    # cycle
    a, b = X(), X()
    a.b, b.a = b, a
    return a


def main():
    ready = chan()      # t1 -> t2
    done  = chan()      # t1 <- t2

    # second thread
    def T2():
        ready.recv()    # wait till first thread has done its work

        print('T2:', gettid())

        # perform GC.
        # this simulates a possible scenario as if cyclic GC was automatically triggered by T2 thread.
        gc.collect()

        done.close()

    go(T2)


    # first thread
    print(sys.version)
    print('T1:', gettid())

    # disable automatic GC; now only objects whose reference count explicitly goes to 0 are deallocated.
    # objects, whose are kept in cycles, stay alive.
    # this simulates a possible scenariou where cyclic GC would be triggered by T2 thread.
    gc.disable()

    # create cyclic garbage referencing object with __del__
    a = mkgarbagecycle()
    a.held = DelCall()
    del a

    ready.close()
    done.recv()


if __name__ == '__main__':
    main()

(py2.venv) kirr@deco:~/tmp/trashme/gcdel$ ./gcdel.py 
2.7.15 (default, Jul 28 2018, 11:29:29) 
[GCC 8.1.0]
T1: 139630321579840
T2: 139630308894464
__del__ called under thread 139630308894464

(py3.venv) kirr@deco:~/tmp/trashme/gcdel$ ./gcdel.py 
3.6.6 (default, Jun 27 2018, 14:44:17) 
[GCC 8.1.0]
T1: 140348348168000
T2: 140348332013312
__del__ called under thread 140348332013312

nexedi/erp5@bc120f55 indeed seems to be suspicious: ...

But unrelated to the current issue.

In other words LongRequest logger traceback is only a consequence of the bug.

No. It could really be what triggers the bug, by keeping a ref to the Cleanup instance.

I still think it is right as the following testcase demonstrates, where __del__ is called by thread T2 because object cycle that references finalized object is garbage-collected by that second thread, not original T1 where finalized object was created:

I already know what your testcase shows. And I misread second paragraph of the commit message.

Your hypothesis is rather that a cycle is formed by an exception. But that's valid only if it happens in ZApplication.__bobo_traverse__ (or a callee) whereas the cleanup variable is defined. I don't see what could raise an exception. An example ?

@klaus @Tyagov Do not wait this discussion to end up. First disable the disable the long request logger on activity nodes. This is anyway a good idea because this logger is currently incompatible with CMFActivity, in that it produces useless logs.

@jm, I don't have ready answer but I already said what I would do to automatically determine who is holding additional reference on Cleanup. Maybe that is due to exception, maybe not. Maybe LongRequestLogger is related (but I do not see how after a quick check; maybe my fault), but the usefulnes of LongRequestLogger is high, especially on activity nodes. Thus I would suggest not to disable it blindly without clear understanding how/if it is faulty and we cannot fix.

the usefulnes of LongRequestLogger is high, especially on activity nodes

Sure, a working LongRequestLogger would be useful on activity nodes. But what happens currently is that process_timer processes activities in loop as long as there are activities to process, whereas LongRequestLogger starts the timeout a the beginning of process_timer, rather than at the beginning of each activity. So as soon as you have more than a few validated activities, LongRequestLogger can dump a lot even if each activity take less than the configured timeout.

Does someone really look at long request logs of activity nodes ? I'd like to but: these are currently huge, with much noise, and no way to make stats from them. On another project, it was the first cause of downtime by filling the disk.

Thus I would suggest not to disable it blindly without clear understanding how/if it is faulty and we cannot fix.

But we have logs reporting that it happens ~10 times per day. Applying any of my suggestions won't give all answers but we'd still learn something. They're safe, quick to test and I have good hope they work.

So what is the mysterious ref to the Cleanup instance even after the unconditional call to BaseRequest.close in publish_module_standard. I hoped you may have answers from a previous debugging session, but no. If you want to add debug information to BaseRequest.clear, go on.

Julien, I'm assuming the assignment about this issue was passed to you, that's why I'm not debugging it myself further. Please let me know if it is not the case.

Your hypothesis is rather that a cycle is formed by an exception.

Is a stack trace any different from an exception for this purpose ?

At a short glance in DeadlockDebugger/dumper.py I see that frame does not get deleted before dump_threads returns. Values extracted from non-DeadlockDebugger threads are kept alive aswell: co, code and request (!). If DeadlockDebugger's own thread is the last one returned by _current_frames, then it will cause a cycle with itself and keep the above variables alive from the last non-DeadlockDebugger thread.

Off-topic: it's somewhat worrying to realise DeadlockDebugger's password matching is done by regex. I'm certain this is vulnerable to timing attacks. We normally do not expose zope root externally, so it should not be critical, but it's bad practice nevertheless.

I agree with JM for his quick approach of disabling long request logging. With some luck this can give us q quick and cheap info what is going on. @klaus , I suggest we disable it for 2-3 weeks and monitor. Still I think it is not activity nodes which are affected (in Woelfel SHM* case) but "web" family of nodes.

@Tyagov : I think @jm suggest disabling LongRequestLogger on activity nodes as it is already worthless - a clear win. It would cause the same issue on web nodes, but disabling it would remove a feature to try to remove a bug - not a clear win.

@vpelletier , generally I agree with you. My point is to do only for 2-3 weeks, gather information, make conclusions and bring back (and hopefully fix it).

If for whatever business reasons which Klaus can comment on this is not an option then well no option...

Still I think it is not activity nodes which are affected (in Woelfel SHM* case) but "web" family of nodes.

They are, but it's true it's a little too rare. For example, this night:

/srv/slapgrid/slappart5/srv/runner/instance/slappart5/.slappart5_zope-4.log:E: <App.ZApplication.Cleanup instance at 0x7f545ca9bb90>: created from thread 140000710526720 @Tue Aug 28 01:05:46 2018; __del__ runs from thread 140000719443712 @Tue Aug 28 01:05:50 2018

We disabled LongRequestLogger on the activity nodes of mic_wind and mic_construction production servers today ([forum post for reference] (https://www.erp5.com/project_section/woelfel-monitoring-project/forum/Mic-wind-upgrade---deactivate-the-long-request-logger-for-activity-nodes-XCcymCzmqK)).

/cc @levin.zimmermann, @jerome