Commits · 8eb14b0184cc5cde5df16fa52367e27e34bad404 · nexedi / neoppod

27 Nov, 2016 11 commits

Bump protocol version · 8eb14b01
Julien Muchembled authored Nov 27, 2016

8eb14b01

Fix identification issues, including a race condition causing id conflicts · 9385706f

Julien Muchembled authored Nov 24, 2016

The added test describes how the new id timestamps fix the race condition.
These timestamps could be any unique opaque values, and the protocol is
extended to exchange them along with node ids.

Internally, nodes also reuse timestamps as a marker to identify the first
NotifyNodeInformation packets from the master: since this packet is a complete
list of nodes in the cluster, any other node in the node manager has left the
cluster definitely and is removed.

The secondary masters didn't receive update about master nodes.
It's also useless to send them information about non-master nodes.

9385706f

protocol: simplify definition of Struct-based items · 54e819ff
Julien Muchembled authored Nov 24, 2016

54e819ff

Remove AskNodeInformation packet · d048a52d

Julien Muchembled authored Nov 25, 2016

When Client (including backup master) and admin nodes are identified,
the primary master now sends them automatically all nodes with
NotifyNodeInformation, as with storage nodes.

d048a52d

master: fix crashes in identification due to buggy nodes · 35664759
Julien Muchembled authored Nov 24, 2016
```
- check address conflicts
- on invalid values, reject peer instead of dying
```
35664759

lib.node: fix NodeManager accessors returning identified nodes · e7cccf01

Julien Muchembled authored Nov 23, 2016

Listing connected/connecting nodes with a UUID is used:
- in one place by storage nodes: here, it does not matter if we skip nodes that
  aren't really identified
- in many places by the master, only for server connections, in which case we
  have equivalence with real identification

So in practice, NodeManager is only simplified to reuse the 'identified'
property of nodes.

e7cccf01

lib.node: code refactoring · 5941b27d
Julien Muchembled authored Nov 23, 2016

5941b27d
storage: only accept clients that are known by the master · c17f5f91
Julien Muchembled authored Nov 23, 2016
```
Therefore, a client node in the node manager is always RUNNING.
```
c17f5f91

Give new ids to clients whose ids were already reallocated · d752aadb

Julien Muchembled authored Nov 21, 2016

Although the change applies to any node with a temporary ids (all but storage),
only clients don't have addresses and are therefore not recognizable.

After a client is disconnected from the master and before reconnecting, another
client may join the cluster and "steals" the id of the first client. This issue
leads to stuck clients, failing in loop with exceptions like the following one:

    ERROR ZODB.Connection Couldn't load state for 0x0251
    Traceback (most recent call last):
      File "ZODB/Connection.py", line 860, in setstate
        self._setstate(obj)
      File "ZODB/Connection.py", line 901, in _setstate
        p, serial = self._storage.load(obj._p_oid, '')
      File "neo/client/Storage.py", line 82, in load
        return self.app.load(oid)[:2]
      File "neo/client/app.py", line 353, in load
        data, tid, next_tid, _ = self._loadFromStorage(oid, tid, before_tid)
      File "neo/client/app.py", line 373, in _loadFromStorage
        for node, conn in self.cp.iterateForObject(oid, readable=True):
      File "neo/client/pool.py", line 91, in iterateForObject
        pt = self.app.pt
      File "neo/client/app.py", line 145, in __getattr__
        self._getMasterConnection()
      File "neo/client/app.py", line 214, in _getMasterConnection
        result = self.master_conn = self._connectToPrimaryNode()
      File "neo/client/app.py", line 246, in _connectToPrimaryNode
        handler=handler)
      File "neo/lib/threaded_app.py", line 154, in _ask
        _handlePacket(qconn, qpacket, kw, handler)
      File "neo/lib/threaded_app.py", line 135, in _handlePacket
        handler.dispatch(conn, packet, kw)
      File "neo/lib/handler.py", line 66, in dispatch
        method(conn, *args, **kw)
      File "neo/lib/handler.py", line 188, in error
        getattr(self, Errors[code])(conn, message)
      File "neo/client/handlers/__init__.py", line 23, in protocolError
        raise StorageError("protocol error: %s" % message)
    StorageError: protocol error: already connected

d752aadb

spelling: oudated -> outdated · b62b8dc3
Julien Muchembled authored Nov 27, 2016

b62b8dc3
Fix spelling mistakes · 6e32ebb7
Julien Muchembled authored Nov 21, 2016

6e32ebb7

25 Nov, 2016 2 commits
- coverage: CacheItem.__repr__ (client) · b61f8745
  Julien Muchembled authored Nov 24, 2016
  
  b61f8745
- New neotestrunner option for code coverage testing · 5de0ff3a
  Julien Muchembled authored Nov 24, 2016
  
  5de0ff3a
21 Nov, 2016 2 commits

client: fix item eviction from cache, which could break loading from storage · 4ef05b9e

Julien Muchembled authored Nov 18, 2016

`ClientCache._oid_dict` shall not have empty values. For a given oid, when the
last item is removed from the cache, the oid must be removed as well to free
memory. In some cases, this was not done.

A consequence of this bug is the following exception:

    ERROR ZODB.Connection Couldn't load state for 0x02d1e1e4
    Traceback (most recent call last):
      File "ZODB/Connection.py", line 860, in setstate
        self._setstate(obj)
      File "ZODB/Connection.py", line 901, in _setstate
        p, serial = self._storage.load(obj._p_oid, '')
      File "neo/client/Storage.py", line 82, in load
        return self.app.load(oid)[:2]
      File "neo/client/app.py", line 358, in load
        self._cache.store(oid, data, tid, next_tid)
      File "neo/client/cache.py", line 228, in store
        prev = item_list[-1]
    IndexError: list index out of range

4ef05b9e

Bump protocol version for new read-only mode in BACKUPING state · 2b3993f1
Julien Muchembled authored Nov 21, 2016

2b3993f1

15 Nov, 2016 2 commits

backup: Teach cluster in BACKUPING state to also serve regular ZODB clients in read-only mode · d4944062

Kirill Smelkov authored Nov 10, 2016

A backup cluster for tids <= backup_tid has all data to provide regular
read-only ZODB service. Having regular ZODB access to the data can be
handy e.g. for externally verifying data for consistency between
main and backup clusters. Peeking around without disturbing main
cluster might be also useful sometimes.

In this patch:

- master & storage nodes are taught:

* to instantiate read-only or regular client service handler depending on cluster state:
RUNNING -> regular
BACKINGUP -> read-only

* in read-only client handler:
+ to reject write-related operations
+ to provide read operations but adjust semantic as last_tid in the database
would be = backup_tid

- new READ_ONLY_ACCESS protocol error code is introduced so that client can
raise POSException.ReadOnlyError upon receiving it.

I have not implemented back-channel for invalidations in read-only mode (yet ?).
This way once a client connects to cluster in backup state, it won't see
new data fetched by backup cluster from upstream after client connected.

The reasons invalidations are not implemented is that for now (imho)
there is no off-hand ready infrastructure to get updates from
replicating node on transaction-by-transaction basis (it currently only
notifies when whole batch is done). For consistency verification (main
reason for this patch) we also don't need invalidations to work, as in
that task we always connect afresh to backup. So I simply only put
relevant TODOs about invalidations for now.

The patch is not very polished but should work.

/reviewed-on !4

d4944062

tests/threaded: Add handy shortcuts to NEOCluster to concisely check cluster properties in tests · ab552d87
Kirill Smelkov authored Nov 10, 2016

ab552d87

27 Oct, 2016 1 commit

neoctl: make 'print ids' command display time of TIDs · d9dd39f0

Iliya Manolov authored Oct 12, 2016

Currently, the command "neoctl [arguments] print ids" has the following output:

    last_oid = 0x...
    last_tid = 0x...
    last_ptid = ...

or

    backup_tid = 0x...
    last_tid = 0x...
    last_ptid = ...

depending on whether the cluster is in normal or backup mode.

This is extremely unreadable since the admin is often interested in the time that corresponds to each tid. Now the output is:

    last_oid = 0x...
    last_tid = 0x... (yyyy-mm-dd hh:mm:ss.ssssss)
    last_ptid = ...

or

    backup_tid = 0x... (yyyy-mm-dd hh:mm:ss.ssssss)
    last_tid = 0x... (yyyy-mm-dd hh:mm:ss.ssssss)
    last_ptid = ...

/reviewed-on !2

d9dd39f0

17 Oct, 2016 1 commit

mysql: force _getNextTID() to use appropriate/whole index · eaa00a88

Kirill Smelkov authored Oct 16, 2016

Similarly to 13911ca3 on the same instance after MariaDB was upgraded to
10.1.17 the following query, even after `OPTIMIZE TABLE obj`, started to execute
very slowly:

    MariaDB [(none)]> SELECT tid FROM neo1.obj WHERE `partition`=5 AND oid=79613 AND tid>268707071353462798 ORDER BY tid LIMIT 1;
    +--------------------+
    | tid                |
    +--------------------+
    | 268707072758797063 |
    +--------------------+
    1 row in set (4.82 sec)

Both explain and analyze says the query will/is using `partition` key but only partially (note key_len is only 10, not 18):

    MariaDB [(none)]> SHOW INDEX FROM neo1.obj;
    +-------+------------+-----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
    | Table | Non_unique | Key_name  | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
    +-------+------------+-----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
    | obj   |          0 | PRIMARY   |            1 | partition   | A         |    28755928 |     NULL | NULL   |      | BTREE      |         |               |
    | obj   |          0 | PRIMARY   |            2 | tid         | A         |    28755928 |     NULL | NULL   |      | BTREE      |         |               |
    | obj   |          0 | PRIMARY   |            3 | oid         | A         |    28755928 |     NULL | NULL   |      | BTREE      |         |               |
    | obj   |          0 | partition |            1 | partition   | A         |    28755928 |     NULL | NULL   |      | BTREE      |         |               |
    | obj   |          0 | partition |            2 | oid         | A         |    28755928 |     NULL | NULL   |      | BTREE      |         |               |
    | obj   |          0 | partition |            3 | tid         | A         |    28755928 |     NULL | NULL   |      | BTREE      |         |               |
    | obj   |          1 | data_id   |            1 | data_id     | A         |    28755928 |     NULL | NULL   | YES  | BTREE      |         |               |
    +-------+------------+-----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
    7 rows in set (0.00 sec)

    MariaDB [(none)]> explain SELECT tid FROM neo1.obj WHERE `partition`=5 AND oid=79613 AND tid>268707071353462798 ORDER BY tid LIMIT 1;
    +------+-------------+-------+------+-------------------+-----------+---------+-------------+------+--------------------------+
    | id   | select_type | table | type | possible_keys     | key       | key_len | ref         | rows | Extra                    |
    +------+-------------+-------+------+-------------------+-----------+---------+-------------+------+--------------------------+
    |    1 | SIMPLE      | obj   | ref  | PRIMARY,partition | partition | 10      | const,const |    2 | Using where; Using index |
    +------+-------------+-------+------+-------------------+-----------+---------+-------------+------+--------------------------+
    1 row in set (0.00 sec)

    MariaDB [(none)]> analyze SELECT tid FROM neo1.obj WHERE `partition`=5 AND oid=79613 AND tid>268707071353462798 ORDER BY tid LIMIT 1;
    +------+-------------+-------+------+-------------------+-----------+---------+-------------+------+------------+----------+------------+--------------------------+
    | id   | select_type | table | type | possible_keys     | key       | key_len | ref         | rows | r_rows     | filtered | r_filtered | Extra                    |
    +------+-------------+-------+------+-------------------+-----------+---------+-------------+------+------------+----------+------------+--------------------------+
    |    1 | SIMPLE      | obj   | ref  | PRIMARY,partition | partition | 10      | const,const |    2 | 9741121.00 |   100.00 |       0.00 | Using where; Using index |
    +------+-------------+-------+------+-------------------+-----------+---------+-------------+------+------------+----------+------------+--------------------------+
    1 row in set (4.93 sec)

By explicitly forcing (partition, oid, tid) index usage which is precisely designed to serve this and similar queries can avoid the query from being slow:

    MariaDB [(none)]> analyze SELECT tid FROM neo1.obj FORCE INDEX(`partition`) WHERE `partition`=5 AND oid=79613 AND tid>268707071353462798 ORDER BY tid LIMIT 1;
    +------+-------------+-------+-------+---------------+-----------+---------+------+------+--------+----------+------------+--------------------------+
    | id   | select_type | table | type  | possible_keys | key       | key_len | ref  | rows | r_rows | filtered | r_filtered | Extra                    |
    +------+-------------+-------+-------+---------------+-----------+---------+------+------+--------+----------+------------+--------------------------+
    |    1 | SIMPLE      | obj   | range | partition     | partition | 18      | NULL |    2 |   1.00 |   100.00 |     100.00 | Using where; Using index |
    +------+-------------+-------+-------+---------------+-----------+---------+------+------+--------+----------+------------+--------------------------+
    1 row in set (0.00 sec)

/cc @jm, @vpelltier, @Tyagov

/reviewed-on !1

eaa00a88

12 Sep, 2016 1 commit

Add support for latest versions of ZODB (4.4.3 & 5.0.1) · c39d5c67

Julien Muchembled authored Jun 15, 2016

Many patches have been merged upstream :)

A notable change is that lastTransaction() does not ping the master anymore
(but it still causes a connection to the master if the client is disconnected).

c39d5c67

29 Aug, 2016 2 commits

mysql: fix use of wrong SQL index when checking for dropped partitions · 13911ca3

Julien Muchembled authored Aug 29, 2016

After partitions were dropped with TokuDB, we had a case where MariaDB 10.1.14
stopped using the most appropriate index.

MariaDB [neo0]> explain SELECT DISTINCT data_id FROM obj WHERE `partition`=5;
+------+-------------+-------+-------+-------------------+---------+---------+------+------+---------------------------------------+
| id   | select_type | table | type  | possible_keys     | key     | key_len | ref  | rows | Extra                                 |
+------+-------------+-------+-------+-------------------+---------+---------+------+------+---------------------------------------+
|    1 | SIMPLE      | obj   | range | PRIMARY,partition | data_id | 11      | NULL |   10 | Using where; Using index for group-by |
+------+-------------+-------+-------+-------------------+---------+---------+------+------+---------------------------------------+
MariaDB [neo0]> SELECT SQL_NO_CACHE DISTINCT data_id FROM obj WHERE `partition`=5;
Empty set (1 min 51.47 sec)

Expected:

MariaDB [neo1]> explain SELECT DISTINCT data_id FROM obj WHERE `partition`=4;
+------+-------------+-------+------+-------------------+---------+---------+-------+------+------------------------------+
| id   | select_type | table | type | possible_keys     | key     | key_len | ref   | rows | Extra                        |
+------+-------------+-------+------+-------------------+---------+---------+-------+------+------------------------------+
|    1 | SIMPLE      | obj   | ref  | PRIMARY,partition | PRIMARY | 2       | const |    1 | Using where; Using temporary |
+------+-------------+-------+------+-------------------+---------+---------+-------+------+------------------------------+
1 row in set (0.00 sec)
MariaDB [neo1]> SELECT SQL_NO_CACHE DISTINCT data_id FROM obj WHERE `partition`=4;
Empty set (0.00 sec)

Restarting the server or 'OPTIMIZE TABLE obj; ' does not help.

Such issue could prevent the cluster to start due to timeouts, by always going
back to RECOVERING state.

13911ca3

Update TODO · 00ffb1ef
Julien Muchembled authored Aug 29, 2016

00ffb1ef

11 Aug, 2016 2 commits

Add test to check that a moved cell doesn't cause POSKeyError · df990a05
Julien Muchembled authored Aug 11, 2016
```
Freeing disk space when a cell is dropped will have to be implemented with care,
not only for performance reasons.
```
df990a05

mysql: do not use unsafe TRUNCATE statement · c3c2ffe2

Julien Muchembled authored Aug 11, 2016

TRUNCATE was chosen for performance reasons, but it's usually done on small
tables, and not for performance-critical operations. TRUNCATE commits
implicitely, so for pt/ttrans in particular, it's certainly slower due to extra
fsyncs to disk.

On the other side, committing too early can corrupt the database if the storage
node is stopped just after. For example, a failure in changePartitionTable()
can cause 'pt' to remain empty.

c3c2ffe2

01 Aug, 2016 2 commits
- storage: speed up transaction registration · e25fa5d9
  Julien Muchembled authored Aug 01, 2016
  
  e25fa5d9
- storage: remove uuid index in TransactionManager · c3d3dabd
  Julien Muchembled authored Aug 01, 2016
```
It slowed down everything but abortFor(), which is not performance critical.
```
  c3d3dabd
31 Jul, 2016 1 commit

storage: review TransactionManager.abortFor · 2d388048

Julien Muchembled authored Jul 31, 2016

This reverts commit 7aecdada partially.
There seems to be no bug here, because:
- abortFor() is only called upon a notification from the master that a client
  is disconnected,
- and from the same TCP connection, we only receive a LockInformation packet
  if there's still such a transaction on the master side.

The code removed in abortFor() was redundant with abort().

2d388048

27 Jul, 2016 6 commits

Reenable checkTransactionalUndoIterator · cb144fdb
Julien Muchembled authored Jul 27, 2016

cb144fdb
client: better exception handling in tpc_abort · 38583af9
Julien Muchembled authored Jul 27, 2016

38583af9

client: do not limit the number of open connections to storage nodes · 77132157

Julien Muchembled authored Jul 27, 2016

There was a bug that connections were not maintained during a TPC,
which caused transactions to be aborted when the limit was reached.

Given that oids are spreaded evenly over all partitions, and that clients always
write to all cells of each involved partitions, clients would spend their time
reconnecting to storage nodes as soon as the limit is reached. So such feature
really looks counter-productive.

77132157

client: small optimization when iterating over storage connections · cfe1b5ca
Julien Muchembled authored Jul 27, 2016

cfe1b5ca

client: fix conflict of node id by never reading from storage without being connected to the master · 11d83ad9

Julien Muchembled authored Jul 26, 2016

Client nodes ignored the state of the connection to the master node when reading
data from storage, as long as their partition tables were recent enough. This
way, they were able to finish read-only transactions even if they could't reach
the master, which could be useful for high availability. The downside is that
the master node ignored that their node ids were still used, which causes "uuid"
conflicts when reallocating them.

Rejected solutions:
- An unused NEO Storage should not insist in staying connected to master node.
- Reverting to big random node identifiers is a lot of work and it would make
  debugging annoying (see commit 23fad3af).
- Always increasing node ids could have been a simple solution if we accepted
  that the cluster dies after that all 2^24 possible ids were allocated.

Given that reading from storage without being connected to the master can only
be useful to finish the current transaction (because we always ping the master
at the beginning of every transaction), keeping such feature is not worth the
effort.

This commit fixes id conflicts in a very simple way, by clearing the partition
table upon primary node failure, which forces reconnection to the master before
querying any storage node. In such case, we raise a special exception that will
cause the transaction to be restarted, so that the user does not get errors for
temporary connection failures.

11d83ad9

storage: add comment about the idea to lock an oid before reporting a resolvable conflict · 4e17456b

Julien Muchembled authored Jul 26, 2016

Currently, another argument not to lock is that we would not be able to test
incremental resolution anymore. We can think about this again when deadlock
resolution is implemented.

4e17456b

24 Jul, 2016 5 commits

Fix race conditions in EventManager between _poll/connection_dict and (un)registration · 8b91706a

Julien Muchembled authored Jul 24, 2016

The following error was reported on a client node:

    #0x0000 Error                   < None (2001:...:2051)
    1 (Retry Later)
    connection closed for <MTClientConnection(uuid=None, address=2001:...:2051, handler=PrimaryNotificationsHandler, closed, client) at 7f1ea7c42f90>
    Event Manager:
    connection started for <MTClientConnection(uuid=None, address=2001:...:2051, handler=PrimaryNotificationsHandler, fd=13, on_close=onConnectionClosed, connecting, client) at 7f1ea7c25c10>
    #0x0000 RequestIdentification          > None (2001:...:2051)
      Readers: []
      Writers: []
      Connections:
        13: <MTClientConnection(uuid=None, address=2001:...:2051, handler=PrimaryNotificationsHandler, fd=13, on_close=onConnectionClosed, connecting, client) at 7f1ea7c25c10> (pending=False)
    Node manager : 1 nodes
    * None |   MASTER | 2001:...:2051 | UNKNOWN
    <ClientCache history_size=0 oid_count=0 size=0 time=0 queue_length=[0] (life_time=10000 max_history_size=100000 max_size=20971520)>
    poll raised, retrying
    Traceback (most recent call last):
      File "neo/lib/threaded_app.py", line 93, in _run
        poll(1)
      File "neo/lib/event.py", line 134, in poll
        self._poll(0)
      File "neo/lib/event.py", line 164, in _poll
        conn = self.connection_dict[fd]
    KeyError: 13

which means that:
- while the poll thread is getting a (13, EPOLLIN) event because it is
  closed (aborted by the master)
- another thread processes the error packet, by closing it in
  PrimaryBootstrapHandler.notReady
- next, the poll thread resumes the execution of EpollEventManager._poll
  and fails to find fd=13 in self.connection_dict

So here, we have a race condition between epoll_wait and any further use
of connection_dict to map returned fds.

However, what commit a4731a0c does to handle
the case of fd reallocation only works for mono-threaded applications.
In EPOLLIN, wrapping 'self.connection_dict[fd]' the same way as for other
events is not enough. For example:
- case 1:
  - thread 1: epoll returns fd=13
  - thread 2: close(13)
  - thread 2: open(13)
  - thread 1: self.connection_dict[13] does not match
              but this would be handled by the 'unregistered' list
- case 2:
  - thread 1: reset 'unregistered'
  - thread 2: close(13)
  - thread 2: open(13)
  - thread 1: epoll returns fd=13
  - thread 1: self.connection_dict[13] matches
              but it would be wrongly ignored by 'unregistered'
- case 3:
  - thread 1: about to call readable/writable/onTimeout on a connection
  - thread 2: this connection is closed
  - thread 1: readable/writable/onTimeout wrongly called on a closed connection

We could protect _poll() with a lock, and make unregister() use wakeup() so
that it gets a chance to acquire it, but that causes threaded tests to deadlock
(continuing in this direction seems too complicated).

So we have to deal with the fact that there can be race conditions at any time
and there's no way to make 'connection_dict' match exactly what epoll returns.
We solve this by preventing fd reallocation inside _poll(), which is fortunately
possible with sockets, using 'shutdown': the closing of fds is delayed.

For above case 3, readable/writable/onTimeout for MTClientConnection are also
changed to test whether the connection is still open while it has the lock.
Just for safety, we do the same for 'process'.

At last, another kind of race condition that this commit also fixes concerns
the use of itervalues() on EventManager.connection_dict.

8b91706a

Indent many lines before any real change · 4a0b936f

Julien Muchembled authored Jul 22, 2016

This is a preliminary commit, without any functional change,
just to make the next one easier to review.

4a0b936f

client: remove redundant check of new connections to the master · 9f4dd15e
Julien Muchembled authored Jul 24, 2016
```
We already have logs when a connection fails,
and ask() raises ConnectionClosed if the connection is closed.
```
9f4dd15e
Control verbose locking via en environment variable · e791dc3f
Vincent Pelletier authored Jun 04, 2016

e791dc3f
client: avoid (harmless) variable shadowing · b7e0ec7f
Vincent Pelletier authored Jun 04, 2016

b7e0ec7f

13 Jul, 2016 1 commit
- setup: first try to get 'mock.py' from the backup in repository · a4f34eaa
  Julien Muchembled authored Jul 13, 2016
```
SourceForge currently has too many issues.
```
  a4f34eaa
17 Jun, 2016 1 commit

tests: an expected failure was actually due to a misuse of undo API · 4dfdf05a

Julien Muchembled authored Jun 17, 2016

Obviously, oids can't be automatically invalidated if the undo is done directly
at the storage level.

In commit 9cca0f8e, only 1 bug was found.

4dfdf05a