- 15 Nov, 2016 2 commits
-
-
Kirill Smelkov authored
A backup cluster for tids <= backup_tid has all data to provide regular read-only ZODB service. Having regular ZODB access to the data can be handy e.g. for externally verifying data for consistency between main and backup clusters. Peeking around without disturbing main cluster might be also useful sometimes. In this patch: - master & storage nodes are taught: * to instantiate read-only or regular client service handler depending on cluster state: RUNNING -> regular BACKINGUP -> read-only * in read-only client handler: + to reject write-related operations + to provide read operations but adjust semantic as last_tid in the database would be = backup_tid - new READ_ONLY_ACCESS protocol error code is introduced so that client can raise POSException.ReadOnlyError upon receiving it. I have not implemented back-channel for invalidations in read-only mode (yet ?). This way once a client connects to cluster in backup state, it won't see new data fetched by backup cluster from upstream after client connected. The reasons invalidations are not implemented is that for now (imho) there is no off-hand ready infrastructure to get updates from replicating node on transaction-by-transaction basis (it currently only notifies when whole batch is done). For consistency verification (main reason for this patch) we also don't need invalidations to work, as in that task we always connect afresh to backup. So I simply only put relevant TODOs about invalidations for now. The patch is not very polished but should work. /reviewed-on !4
-
Kirill Smelkov authored
-
- 27 Oct, 2016 1 commit
-
-
Iliya Manolov authored
Currently, the command "neoctl [arguments] print ids" has the following output: last_oid = 0x... last_tid = 0x... last_ptid = ... or backup_tid = 0x... last_tid = 0x... last_ptid = ... depending on whether the cluster is in normal or backup mode. This is extremely unreadable since the admin is often interested in the time that corresponds to each tid. Now the output is: last_oid = 0x... last_tid = 0x... (yyyy-mm-dd hh:mm:ss.ssssss) last_ptid = ... or backup_tid = 0x... (yyyy-mm-dd hh:mm:ss.ssssss) last_tid = 0x... (yyyy-mm-dd hh:mm:ss.ssssss) last_ptid = ... /reviewed-on nexedi/neoppod!2
-
- 17 Oct, 2016 1 commit
-
-
Kirill Smelkov authored
Similarly to 13911ca3 on the same instance after MariaDB was upgraded to 10.1.17 the following query, even after `OPTIMIZE TABLE obj`, started to execute very slowly: MariaDB [(none)]> SELECT tid FROM neo1.obj WHERE `partition`=5 AND oid=79613 AND tid>268707071353462798 ORDER BY tid LIMIT 1; +--------------------+ | tid | +--------------------+ | 268707072758797063 | +--------------------+ 1 row in set (4.82 sec) Both explain and analyze says the query will/is using `partition` key but only partially (note key_len is only 10, not 18): MariaDB [(none)]> SHOW INDEX FROM neo1.obj; +-------+------------+-----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+ | Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment | +-------+------------+-----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+ | obj | 0 | PRIMARY | 1 | partition | A | 28755928 | NULL | NULL | | BTREE | | | | obj | 0 | PRIMARY | 2 | tid | A | 28755928 | NULL | NULL | | BTREE | | | | obj | 0 | PRIMARY | 3 | oid | A | 28755928 | NULL | NULL | | BTREE | | | | obj | 0 | partition | 1 | partition | A | 28755928 | NULL | NULL | | BTREE | | | | obj | 0 | partition | 2 | oid | A | 28755928 | NULL | NULL | | BTREE | | | | obj | 0 | partition | 3 | tid | A | 28755928 | NULL | NULL | | BTREE | | | | obj | 1 | data_id | 1 | data_id | A | 28755928 | NULL | NULL | YES | BTREE | | | +-------+------------+-----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+ 7 rows in set (0.00 sec) MariaDB [(none)]> explain SELECT tid FROM neo1.obj WHERE `partition`=5 AND oid=79613 AND tid>268707071353462798 ORDER BY tid LIMIT 1; +------+-------------+-------+------+-------------------+-----------+---------+-------------+------+--------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +------+-------------+-------+------+-------------------+-----------+---------+-------------+------+--------------------------+ | 1 | SIMPLE | obj | ref | PRIMARY,partition | partition | 10 | const,const | 2 | Using where; Using index | +------+-------------+-------+------+-------------------+-----------+---------+-------------+------+--------------------------+ 1 row in set (0.00 sec) MariaDB [(none)]> analyze SELECT tid FROM neo1.obj WHERE `partition`=5 AND oid=79613 AND tid>268707071353462798 ORDER BY tid LIMIT 1; +------+-------------+-------+------+-------------------+-----------+---------+-------------+------+------------+----------+------------+--------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | r_rows | filtered | r_filtered | Extra | +------+-------------+-------+------+-------------------+-----------+---------+-------------+------+------------+----------+------------+--------------------------+ | 1 | SIMPLE | obj | ref | PRIMARY,partition | partition | 10 | const,const | 2 | 9741121.00 | 100.00 | 0.00 | Using where; Using index | +------+-------------+-------+------+-------------------+-----------+---------+-------------+------+------------+----------+------------+--------------------------+ 1 row in set (4.93 sec) By explicitly forcing (partition, oid, tid) index usage which is precisely designed to serve this and similar queries can avoid the query from being slow: MariaDB [(none)]> analyze SELECT tid FROM neo1.obj FORCE INDEX(`partition`) WHERE `partition`=5 AND oid=79613 AND tid>268707071353462798 ORDER BY tid LIMIT 1; +------+-------------+-------+-------+---------------+-----------+---------+------+------+--------+----------+------------+--------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | r_rows | filtered | r_filtered | Extra | +------+-------------+-------+-------+---------------+-----------+---------+------+------+--------+----------+------------+--------------------------+ | 1 | SIMPLE | obj | range | partition | partition | 18 | NULL | 2 | 1.00 | 100.00 | 100.00 | Using where; Using index | +------+-------------+-------+-------+---------------+-----------+---------+------+------+--------+----------+------------+--------------------------+ 1 row in set (0.00 sec) /cc @jm, @vpelltier, @Tyagov /reviewed-on nexedi/neoppod!1
-
- 12 Sep, 2016 1 commit
-
-
Julien Muchembled authored
Many patches have been merged upstream :) A notable change is that lastTransaction() does not ping the master anymore (but it still causes a connection to the master if the client is disconnected).
-
- 29 Aug, 2016 2 commits
-
-
Julien Muchembled authored
After partitions were dropped with TokuDB, we had a case where MariaDB 10.1.14 stopped using the most appropriate index. MariaDB [neo0]> explain SELECT DISTINCT data_id FROM obj WHERE `partition`=5; +------+-------------+-------+-------+-------------------+---------+---------+------+------+---------------------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +------+-------------+-------+-------+-------------------+---------+---------+------+------+---------------------------------------+ | 1 | SIMPLE | obj | range | PRIMARY,partition | data_id | 11 | NULL | 10 | Using where; Using index for group-by | +------+-------------+-------+-------+-------------------+---------+---------+------+------+---------------------------------------+ MariaDB [neo0]> SELECT SQL_NO_CACHE DISTINCT data_id FROM obj WHERE `partition`=5; Empty set (1 min 51.47 sec) Expected: MariaDB [neo1]> explain SELECT DISTINCT data_id FROM obj WHERE `partition`=4; +------+-------------+-------+------+-------------------+---------+---------+-------+------+------------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +------+-------------+-------+------+-------------------+---------+---------+-------+------+------------------------------+ | 1 | SIMPLE | obj | ref | PRIMARY,partition | PRIMARY | 2 | const | 1 | Using where; Using temporary | +------+-------------+-------+------+-------------------+---------+---------+-------+------+------------------------------+ 1 row in set (0.00 sec) MariaDB [neo1]> SELECT SQL_NO_CACHE DISTINCT data_id FROM obj WHERE `partition`=4; Empty set (0.00 sec) Restarting the server or 'OPTIMIZE TABLE obj; ' does not help. Such issue could prevent the cluster to start due to timeouts, by always going back to RECOVERING state.
-
Julien Muchembled authored
-
- 11 Aug, 2016 2 commits
-
-
Julien Muchembled authored
Freeing disk space when a cell is dropped will have to be implemented with care, not only for performance reasons.
-
Julien Muchembled authored
TRUNCATE was chosen for performance reasons, but it's usually done on small tables, and not for performance-critical operations. TRUNCATE commits implicitely, so for pt/ttrans in particular, it's certainly slower due to extra fsyncs to disk. On the other side, committing too early can corrupt the database if the storage node is stopped just after. For example, a failure in changePartitionTable() can cause 'pt' to remain empty.
-
- 01 Aug, 2016 2 commits
-
-
Julien Muchembled authored
-
Julien Muchembled authored
It slowed down everything but abortFor(), which is not performance critical.
-
- 31 Jul, 2016 1 commit
-
-
Julien Muchembled authored
This reverts commit 7aecdada partially. There seems to be no bug here, because: - abortFor() is only called upon a notification from the master that a client is disconnected, - and from the same TCP connection, we only receive a LockInformation packet if there's still such a transaction on the master side. The code removed in abortFor() was redundant with abort().
-
- 27 Jul, 2016 6 commits
-
-
Julien Muchembled authored
-
Julien Muchembled authored
-
Julien Muchembled authored
There was a bug that connections were not maintained during a TPC, which caused transactions to be aborted when the limit was reached. Given that oids are spreaded evenly over all partitions, and that clients always write to all cells of each involved partitions, clients would spend their time reconnecting to storage nodes as soon as the limit is reached. So such feature really looks counter-productive.
-
Julien Muchembled authored
-
Julien Muchembled authored
Client nodes ignored the state of the connection to the master node when reading data from storage, as long as their partition tables were recent enough. This way, they were able to finish read-only transactions even if they could't reach the master, which could be useful for high availability. The downside is that the master node ignored that their node ids were still used, which causes "uuid" conflicts when reallocating them. Rejected solutions: - An unused NEO Storage should not insist in staying connected to master node. - Reverting to big random node identifiers is a lot of work and it would make debugging annoying (see commit 23fad3af). - Always increasing node ids could have been a simple solution if we accepted that the cluster dies after that all 2^24 possible ids were allocated. Given that reading from storage without being connected to the master can only be useful to finish the current transaction (because we always ping the master at the beginning of every transaction), keeping such feature is not worth the effort. This commit fixes id conflicts in a very simple way, by clearing the partition table upon primary node failure, which forces reconnection to the master before querying any storage node. In such case, we raise a special exception that will cause the transaction to be restarted, so that the user does not get errors for temporary connection failures.
-
Julien Muchembled authored
Currently, another argument not to lock is that we would not be able to test incremental resolution anymore. We can think about this again when deadlock resolution is implemented.
-
- 24 Jul, 2016 5 commits
-
-
Julien Muchembled authored
The following error was reported on a client node: #0x0000 Error < None (2001:...:2051) 1 (Retry Later) connection closed for <MTClientConnection(uuid=None, address=2001:...:2051, handler=PrimaryNotificationsHandler, closed, client) at 7f1ea7c42f90> Event Manager: connection started for <MTClientConnection(uuid=None, address=2001:...:2051, handler=PrimaryNotificationsHandler, fd=13, on_close=onConnectionClosed, connecting, client) at 7f1ea7c25c10> #0x0000 RequestIdentification > None (2001:...:2051) Readers: [] Writers: [] Connections: 13: <MTClientConnection(uuid=None, address=2001:...:2051, handler=PrimaryNotificationsHandler, fd=13, on_close=onConnectionClosed, connecting, client) at 7f1ea7c25c10> (pending=False) Node manager : 1 nodes * None | MASTER | 2001:...:2051 | UNKNOWN <ClientCache history_size=0 oid_count=0 size=0 time=0 queue_length=[0] (life_time=10000 max_history_size=100000 max_size=20971520)> poll raised, retrying Traceback (most recent call last): File "neo/lib/threaded_app.py", line 93, in _run poll(1) File "neo/lib/event.py", line 134, in poll self._poll(0) File "neo/lib/event.py", line 164, in _poll conn = self.connection_dict[fd] KeyError: 13 which means that: - while the poll thread is getting a (13, EPOLLIN) event because it is closed (aborted by the master) - another thread processes the error packet, by closing it in PrimaryBootstrapHandler.notReady - next, the poll thread resumes the execution of EpollEventManager._poll and fails to find fd=13 in self.connection_dict So here, we have a race condition between epoll_wait and any further use of connection_dict to map returned fds. However, what commit a4731a0c does to handle the case of fd reallocation only works for mono-threaded applications. In EPOLLIN, wrapping 'self.connection_dict[fd]' the same way as for other events is not enough. For example: - case 1: - thread 1: epoll returns fd=13 - thread 2: close(13) - thread 2: open(13) - thread 1: self.connection_dict[13] does not match but this would be handled by the 'unregistered' list - case 2: - thread 1: reset 'unregistered' - thread 2: close(13) - thread 2: open(13) - thread 1: epoll returns fd=13 - thread 1: self.connection_dict[13] matches but it would be wrongly ignored by 'unregistered' - case 3: - thread 1: about to call readable/writable/onTimeout on a connection - thread 2: this connection is closed - thread 1: readable/writable/onTimeout wrongly called on a closed connection We could protect _poll() with a lock, and make unregister() use wakeup() so that it gets a chance to acquire it, but that causes threaded tests to deadlock (continuing in this direction seems too complicated). So we have to deal with the fact that there can be race conditions at any time and there's no way to make 'connection_dict' match exactly what epoll returns. We solve this by preventing fd reallocation inside _poll(), which is fortunately possible with sockets, using 'shutdown': the closing of fds is delayed. For above case 3, readable/writable/onTimeout for MTClientConnection are also changed to test whether the connection is still open while it has the lock. Just for safety, we do the same for 'process'. At last, another kind of race condition that this commit also fixes concerns the use of itervalues() on EventManager.connection_dict.
-
Julien Muchembled authored
This is a preliminary commit, without any functional change, just to make the next one easier to review.
-
Julien Muchembled authored
We already have logs when a connection fails, and ask() raises ConnectionClosed if the connection is closed.
-
Vincent Pelletier authored
-
Vincent Pelletier authored
-
- 13 Jul, 2016 1 commit
-
-
Julien Muchembled authored
SourceForge currently has too many issues.
-
- 17 Jun, 2016 2 commits
-
-
Julien Muchembled authored
Obviously, oids can't be automatically invalidated if the undo is done directly at the storage level. In commit 9cca0f8e, only 1 bug was found.
-
Julien Muchembled authored
-
- 15 Jun, 2016 2 commits
-
-
Julien Muchembled authored
-
Julien Muchembled authored
-
- 08 Jun, 2016 2 commits
-
-
Julien Muchembled authored
-
Julien Muchembled authored
FileStorage has been fixed in commit b7ea4e6f708dcded329332b24a9d70211a6b6368
-
- 26 May, 2016 1 commit
-
-
Julien Muchembled authored
Cache items are stored in double-linked chains. In order to quickly know the number of history items, an extra attribute is used to count them. It was not always decremented when a history item was removed. This led to the following exception: <ClientCache history_size=100000 oid_count=1959 size=20970973 time=2849049 queue_length=[1, 7, 738, 355, 480, 66, 255, 44, 3, 5, 2, 1, 3, 4, 2, 2] (life_time=10000 max_history_size=100000 max_size=20971520)> poll raised, retrying Traceback (most recent call last): ... File "neo/client/handlers/master.py", line 137, in packetReceived cache.store(oid, data, tid, None) File "neo/client/cache.py", line 247, in store self._add(head) File "neo/client/cache.py", line 129, in _add self._remove(head) File "neo/client/cache.py", line 136, in _remove level = item.level AttributeError: 'NoneType' object has no attribute 'level'
-
- 25 Apr, 2016 1 commit
-
-
Julien Muchembled authored
-
- 20 Apr, 2016 2 commits
-
-
Julien Muchembled authored
-
Julien Muchembled authored
This fixes the following issue: WARNING replication aborted for partition 1 DEBUG connection started for <ClientConnection(uuid=None, address=...:43776, handler=StorageOperationHandler, fd=10, on_close=onConnectionClosed, connecting, client) at 7f5d2067fdd0> DEBUG connect failed for <SocketConnectorIPv6 at 0x7f5d2067fe10 fileno 10 ('::', 0), opened to ('...', 43776)>: ENETUNREACH (Network is unreachable) WARNING replication aborted for partition 5 DEBUG connection started for <ClientConnection(uuid=None, address=...:43776, handler=StorageOperationHandler, fd=10, on_close=onConnectionClosed, connecting, client) at 7f5d1c409510> PACKET #0x0000 RequestIdentification > None (...:43776) | (<EnumItem STORAGE (1)>, None, ('...', 60533), '...') ERROR Pre-mortem data: ERROR Traceback (most recent call last): ERROR File "neo/storage/app.py", line 157, in run ERROR self._run() ERROR File "neo/storage/app.py", line 197, in _run ERROR self.doOperation() ERROR File "neo/storage/app.py", line 285, in doOperation ERROR poll() ERROR File "neo/storage/app.py", line 95, in _poll ERROR self.em.poll(1) ERROR File "neo/lib/event.py", line 121, in poll ERROR self._poll(blocking) ERROR File "neo/lib/event.py", line 165, in _poll ERROR if conn.readable(): ERROR File "neo/lib/connection.py", line 481, in readable ERROR self._closure() ERROR File "neo/lib/connection.py", line 539, in _closure ERROR self.close() ERROR File "neo/lib/connection.py", line 531, in close ERROR handler.connectionClosed(self) ERROR File "neo/lib/handler.py", line 135, in connectionClosed ERROR self.connectionLost(conn, NodeStates.TEMPORARILY_DOWN) ERROR File "neo/storage/handlers/storage.py", line 59, in connectionLost ERROR replicator.abort() ERROR File "neo/storage/replicator.py", line 339, in abort ERROR self._nextPartition() ERROR File "neo/storage/replicator.py", line 260, in _nextPartition ERROR None if name else app.uuid, app.server, name or app.name)) ERROR File "neo/lib/connection.py", line 562, in ask ERROR raise ConnectionClosed ERROR ConnectionClosed
-
- 18 Apr, 2016 1 commit
-
-
Julien Muchembled authored
This fixes a lock leak on storages, causing further transactions to timeout.
-
- 01 Apr, 2016 1 commit
-
-
Julien Muchembled authored
-
- 31 Mar, 2016 1 commit
-
-
Julien Muchembled authored
-
- 30 Mar, 2016 2 commits
-
-
Julien Muchembled authored
-
Julien Muchembled authored
-
- 28 Mar, 2016 1 commit
-
-
Julien Muchembled authored
-