- 24 Jul, 2016 5 commits
-
-
Julien Muchembled authored
The following error was reported on a client node: #0x0000 Error < None (2001:...:2051) 1 (Retry Later) connection closed for <MTClientConnection(uuid=None, address=2001:...:2051, handler=PrimaryNotificationsHandler, closed, client) at 7f1ea7c42f90> Event Manager: connection started for <MTClientConnection(uuid=None, address=2001:...:2051, handler=PrimaryNotificationsHandler, fd=13, on_close=onConnectionClosed, connecting, client) at 7f1ea7c25c10> #0x0000 RequestIdentification > None (2001:...:2051) Readers: [] Writers: [] Connections: 13: <MTClientConnection(uuid=None, address=2001:...:2051, handler=PrimaryNotificationsHandler, fd=13, on_close=onConnectionClosed, connecting, client) at 7f1ea7c25c10> (pending=False) Node manager : 1 nodes * None | MASTER | 2001:...:2051 | UNKNOWN <ClientCache history_size=0 oid_count=0 size=0 time=0 queue_length=[0] (life_time=10000 max_history_size=100000 max_size=20971520)> poll raised, retrying Traceback (most recent call last): File "neo/lib/threaded_app.py", line 93, in _run poll(1) File "neo/lib/event.py", line 134, in poll self._poll(0) File "neo/lib/event.py", line 164, in _poll conn = self.connection_dict[fd] KeyError: 13 which means that: - while the poll thread is getting a (13, EPOLLIN) event because it is closed (aborted by the master) - another thread processes the error packet, by closing it in PrimaryBootstrapHandler.notReady - next, the poll thread resumes the execution of EpollEventManager._poll and fails to find fd=13 in self.connection_dict So here, we have a race condition between epoll_wait and any further use of connection_dict to map returned fds. However, what commit a4731a0c does to handle the case of fd reallocation only works for mono-threaded applications. In EPOLLIN, wrapping 'self.connection_dict[fd]' the same way as for other events is not enough. For example: - case 1: - thread 1: epoll returns fd=13 - thread 2: close(13) - thread 2: open(13) - thread 1: self.connection_dict[13] does not match but this would be handled by the 'unregistered' list - case 2: - thread 1: reset 'unregistered' - thread 2: close(13) - thread 2: open(13) - thread 1: epoll returns fd=13 - thread 1: self.connection_dict[13] matches but it would be wrongly ignored by 'unregistered' - case 3: - thread 1: about to call readable/writable/onTimeout on a connection - thread 2: this connection is closed - thread 1: readable/writable/onTimeout wrongly called on a closed connection We could protect _poll() with a lock, and make unregister() use wakeup() so that it gets a chance to acquire it, but that causes threaded tests to deadlock (continuing in this direction seems too complicated). So we have to deal with the fact that there can be race conditions at any time and there's no way to make 'connection_dict' match exactly what epoll returns. We solve this by preventing fd reallocation inside _poll(), which is fortunately possible with sockets, using 'shutdown': the closing of fds is delayed. For above case 3, readable/writable/onTimeout for MTClientConnection are also changed to test whether the connection is still open while it has the lock. Just for safety, we do the same for 'process'. At last, another kind of race condition that this commit also fixes concerns the use of itervalues() on EventManager.connection_dict.
-
Julien Muchembled authored
This is a preliminary commit, without any functional change, just to make the next one easier to review.
-
Julien Muchembled authored
We already have logs when a connection fails, and ask() raises ConnectionClosed if the connection is closed.
-
Vincent Pelletier authored
-
Vincent Pelletier authored
-
- 13 Jul, 2016 1 commit
-
-
Julien Muchembled authored
SourceForge currently has too many issues.
-
- 17 Jun, 2016 2 commits
-
-
Julien Muchembled authored
Obviously, oids can't be automatically invalidated if the undo is done directly at the storage level. In commit 9cca0f8e, only 1 bug was found.
-
Julien Muchembled authored
-
- 15 Jun, 2016 2 commits
-
-
Julien Muchembled authored
-
Julien Muchembled authored
-
- 08 Jun, 2016 2 commits
-
-
Julien Muchembled authored
-
Julien Muchembled authored
FileStorage has been fixed in commit b7ea4e6f708dcded329332b24a9d70211a6b6368
-
- 26 May, 2016 1 commit
-
-
Julien Muchembled authored
Cache items are stored in double-linked chains. In order to quickly know the number of history items, an extra attribute is used to count them. It was not always decremented when a history item was removed. This led to the following exception: <ClientCache history_size=100000 oid_count=1959 size=20970973 time=2849049 queue_length=[1, 7, 738, 355, 480, 66, 255, 44, 3, 5, 2, 1, 3, 4, 2, 2] (life_time=10000 max_history_size=100000 max_size=20971520)> poll raised, retrying Traceback (most recent call last): ... File "neo/client/handlers/master.py", line 137, in packetReceived cache.store(oid, data, tid, None) File "neo/client/cache.py", line 247, in store self._add(head) File "neo/client/cache.py", line 129, in _add self._remove(head) File "neo/client/cache.py", line 136, in _remove level = item.level AttributeError: 'NoneType' object has no attribute 'level'
-
- 25 Apr, 2016 1 commit
-
-
Julien Muchembled authored
-
- 20 Apr, 2016 2 commits
-
-
Julien Muchembled authored
-
Julien Muchembled authored
This fixes the following issue: WARNING replication aborted for partition 1 DEBUG connection started for <ClientConnection(uuid=None, address=...:43776, handler=StorageOperationHandler, fd=10, on_close=onConnectionClosed, connecting, client) at 7f5d2067fdd0> DEBUG connect failed for <SocketConnectorIPv6 at 0x7f5d2067fe10 fileno 10 ('::', 0), opened to ('...', 43776)>: ENETUNREACH (Network is unreachable) WARNING replication aborted for partition 5 DEBUG connection started for <ClientConnection(uuid=None, address=...:43776, handler=StorageOperationHandler, fd=10, on_close=onConnectionClosed, connecting, client) at 7f5d1c409510> PACKET #0x0000 RequestIdentification > None (...:43776) | (<EnumItem STORAGE (1)>, None, ('...', 60533), '...') ERROR Pre-mortem data: ERROR Traceback (most recent call last): ERROR File "neo/storage/app.py", line 157, in run ERROR self._run() ERROR File "neo/storage/app.py", line 197, in _run ERROR self.doOperation() ERROR File "neo/storage/app.py", line 285, in doOperation ERROR poll() ERROR File "neo/storage/app.py", line 95, in _poll ERROR self.em.poll(1) ERROR File "neo/lib/event.py", line 121, in poll ERROR self._poll(blocking) ERROR File "neo/lib/event.py", line 165, in _poll ERROR if conn.readable(): ERROR File "neo/lib/connection.py", line 481, in readable ERROR self._closure() ERROR File "neo/lib/connection.py", line 539, in _closure ERROR self.close() ERROR File "neo/lib/connection.py", line 531, in close ERROR handler.connectionClosed(self) ERROR File "neo/lib/handler.py", line 135, in connectionClosed ERROR self.connectionLost(conn, NodeStates.TEMPORARILY_DOWN) ERROR File "neo/storage/handlers/storage.py", line 59, in connectionLost ERROR replicator.abort() ERROR File "neo/storage/replicator.py", line 339, in abort ERROR self._nextPartition() ERROR File "neo/storage/replicator.py", line 260, in _nextPartition ERROR None if name else app.uuid, app.server, name or app.name)) ERROR File "neo/lib/connection.py", line 562, in ask ERROR raise ConnectionClosed ERROR ConnectionClosed
-
- 18 Apr, 2016 1 commit
-
-
Julien Muchembled authored
This fixes a lock leak on storages, causing further transactions to timeout.
-
- 01 Apr, 2016 1 commit
-
-
Julien Muchembled authored
-
- 31 Mar, 2016 1 commit
-
-
Julien Muchembled authored
-
- 30 Mar, 2016 2 commits
-
-
Julien Muchembled authored
-
Julien Muchembled authored
-
- 28 Mar, 2016 2 commits
-
-
Julien Muchembled authored
-
Julien Muchembled authored
-
- 22 Mar, 2016 2 commits
-
-
Julien Muchembled authored
-
Julien Muchembled authored
-
- 21 Mar, 2016 3 commits
-
-
Julien Muchembled authored
This fixes the following crash (for example when a client disconnects during tpc_finish): Traceback (most recent call last): ... File "neo/master/handlers/storage.py", line 68, in answerInformationLocked self.app.tm.lock(ttid, conn.getUUID()) File "neo/master/transactions.py", line 338, in lock if self._ttid_dict[ttid].lock(uuid) and self._queue[0][1] == ttid: IndexError: list index out of range
-
Julien Muchembled authored
-
Julien Muchembled authored
-
- 09 Mar, 2016 2 commits
-
-
Julien Muchembled authored
-
Julien Muchembled authored
-
- 08 Mar, 2016 2 commits
-
-
Julien Muchembled authored
-
Julien Muchembled authored
-
- 04 Mar, 2016 3 commits
-
-
Julien Muchembled authored
-
Julien Muchembled authored
Before this change, a storage node did 3 commits per transaction: - once all data are stored - when locking the transaction - when unlocking the transaction The last one is not important for ACID. In case of a crash, the transaction is unlocked again (verification phase). By deferring it by 1 second, we only have 2 commits per transaction during high activity because all pending changes are merged with the commits caused by other transactions. This change compensates the extra commit(s) per transaction that were introduced in commit 7eb7cf1b ("Minimize the amount of work during tpc_finish").
-
Julien Muchembled authored
-
- 02 Mar, 2016 1 commit
-
-
Julien Muchembled authored
Since commit d2d77437 ("client: make the cache tolerant to late invalidations when the entry is in the history queue"), invalidated items became current again when they were moved to the history queue, which was wrong for 2 reasons: - only the last items of _oid_dict values may have next_tid=None, - and for such items, they could be wrongly reused when caching the real current data.
-
- 01 Mar, 2016 1 commit
-
-
Julien Muchembled authored
-
- 26 Feb, 2016 3 commits
-
-
Julien Muchembled authored
-
Julien Muchembled authored
-
Julien Muchembled authored
-