1. 25 Nov, 2016 2 commits
  2. 21 Nov, 2016 2 commits
    • Julien Muchembled's avatar
      client: fix item eviction from cache, which could break loading from storage · 4ef05b9e
      Julien Muchembled authored
      `ClientCache._oid_dict` shall not have empty values. For a given oid, when the
      last item is removed from the cache, the oid must be removed as well to free
      memory. In some cases, this was not done.
      
      A consequence of this bug is the following exception:
      
          ERROR ZODB.Connection Couldn't load state for 0x02d1e1e4
          Traceback (most recent call last):
            File "ZODB/Connection.py", line 860, in setstate
              self._setstate(obj)
            File "ZODB/Connection.py", line 901, in _setstate
              p, serial = self._storage.load(obj._p_oid, '')
            File "neo/client/Storage.py", line 82, in load
              return self.app.load(oid)[:2]
            File "neo/client/app.py", line 358, in load
              self._cache.store(oid, data, tid, next_tid)
            File "neo/client/cache.py", line 228, in store
              prev = item_list[-1]
          IndexError: list index out of range
      4ef05b9e
    • Julien Muchembled's avatar
  3. 15 Nov, 2016 2 commits
    • Kirill Smelkov's avatar
      backup: Teach cluster in BACKUPING state to also serve regular ZODB clients in read-only mode · d4944062
      Kirill Smelkov authored
      A backup cluster for tids <= backup_tid has all data to provide regular
      read-only ZODB service. Having regular ZODB access to the data can be
      handy e.g. for externally verifying data for consistency between
      main and backup clusters. Peeking around without disturbing main
      cluster might be also useful sometimes.
      
      In this patch:
      
      - master & storage nodes are taught:
      
          * to instantiate read-only or regular client service handler depending on cluster state:
            RUNNING   -> regular
            BACKINGUP -> read-only
      
          * in read-only client handler:
            + to reject write-related operations
            + to provide read operations but adjust semantic as last_tid in the database
              would be = backup_tid
      
      - new READ_ONLY_ACCESS protocol error code is introduced so that client can
        raise POSException.ReadOnlyError upon receiving it.
      
      I have not implemented back-channel for invalidations in read-only mode (yet ?).
      This way once a client connects to cluster in backup state, it won't see
      new data fetched by backup cluster from upstream after client connected.
      
      The reasons invalidations are not implemented is that for now (imho)
      there is no off-hand ready infrastructure to get updates from
      replicating node on transaction-by-transaction basis (it currently only
      notifies when whole batch is done). For consistency verification (main
      reason for this patch) we also don't need invalidations to work, as in
      that task we always connect afresh to backup. So I simply only put
      relevant TODOs about invalidations for now.
      
      The patch is not very polished but should work.
      
      /reviewed-on nexedi/neoppod!4
      d4944062
    • Kirill Smelkov's avatar
  4. 27 Oct, 2016 1 commit
    • Iliya Manolov's avatar
      neoctl: make 'print ids' command display time of TIDs · d9dd39f0
      Iliya Manolov authored
      Currently, the command "neoctl [arguments] print ids" has the following output:
      
          last_oid = 0x...
          last_tid = 0x...
          last_ptid = ...
      
      or
      
          backup_tid = 0x...
          last_tid = 0x...
          last_ptid = ...
      
      depending on whether the cluster is in normal or backup mode.
      
      This is extremely unreadable since the admin is often interested in the time that corresponds to each tid. Now the output is:
      
          last_oid = 0x...
          last_tid = 0x... (yyyy-mm-dd hh:mm:ss.ssssss)
          last_ptid = ...
      
      or
      
          backup_tid = 0x... (yyyy-mm-dd hh:mm:ss.ssssss)
          last_tid = 0x... (yyyy-mm-dd hh:mm:ss.ssssss)
          last_ptid = ...
      
      /reviewed-on nexedi/neoppod!2
      d9dd39f0
  5. 17 Oct, 2016 1 commit
    • Kirill Smelkov's avatar
      mysql: force _getNextTID() to use appropriate/whole index · eaa00a88
      Kirill Smelkov authored
      Similarly to 13911ca3 on the same instance after MariaDB was upgraded to
      10.1.17 the following query, even after `OPTIMIZE TABLE obj`, started to execute
      very slowly:
      
          MariaDB [(none)]> SELECT tid FROM neo1.obj WHERE `partition`=5 AND oid=79613 AND tid>268707071353462798 ORDER BY tid LIMIT 1;
          +--------------------+
          | tid                |
          +--------------------+
          | 268707072758797063 |
          +--------------------+
          1 row in set (4.82 sec)
      
      Both explain and analyze says the query will/is using `partition` key but only partially (note key_len is only 10, not 18):
      
          MariaDB [(none)]> SHOW INDEX FROM neo1.obj;
          +-------+------------+-----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
          | Table | Non_unique | Key_name  | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
          +-------+------------+-----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
          | obj   |          0 | PRIMARY   |            1 | partition   | A         |    28755928 |     NULL | NULL   |      | BTREE      |         |               |
          | obj   |          0 | PRIMARY   |            2 | tid         | A         |    28755928 |     NULL | NULL   |      | BTREE      |         |               |
          | obj   |          0 | PRIMARY   |            3 | oid         | A         |    28755928 |     NULL | NULL   |      | BTREE      |         |               |
          | obj   |          0 | partition |            1 | partition   | A         |    28755928 |     NULL | NULL   |      | BTREE      |         |               |
          | obj   |          0 | partition |            2 | oid         | A         |    28755928 |     NULL | NULL   |      | BTREE      |         |               |
          | obj   |          0 | partition |            3 | tid         | A         |    28755928 |     NULL | NULL   |      | BTREE      |         |               |
          | obj   |          1 | data_id   |            1 | data_id     | A         |    28755928 |     NULL | NULL   | YES  | BTREE      |         |               |
          +-------+------------+-----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
          7 rows in set (0.00 sec)
      
          MariaDB [(none)]> explain SELECT tid FROM neo1.obj WHERE `partition`=5 AND oid=79613 AND tid>268707071353462798 ORDER BY tid LIMIT 1;
          +------+-------------+-------+------+-------------------+-----------+---------+-------------+------+--------------------------+
          | id   | select_type | table | type | possible_keys     | key       | key_len | ref         | rows | Extra                    |
          +------+-------------+-------+------+-------------------+-----------+---------+-------------+------+--------------------------+
          |    1 | SIMPLE      | obj   | ref  | PRIMARY,partition | partition | 10      | const,const |    2 | Using where; Using index |
          +------+-------------+-------+------+-------------------+-----------+---------+-------------+------+--------------------------+
          1 row in set (0.00 sec)
      
          MariaDB [(none)]> analyze SELECT tid FROM neo1.obj WHERE `partition`=5 AND oid=79613 AND tid>268707071353462798 ORDER BY tid LIMIT 1;
          +------+-------------+-------+------+-------------------+-----------+---------+-------------+------+------------+----------+------------+--------------------------+
          | id   | select_type | table | type | possible_keys     | key       | key_len | ref         | rows | r_rows     | filtered | r_filtered | Extra                    |
          +------+-------------+-------+------+-------------------+-----------+---------+-------------+------+------------+----------+------------+--------------------------+
          |    1 | SIMPLE      | obj   | ref  | PRIMARY,partition | partition | 10      | const,const |    2 | 9741121.00 |   100.00 |       0.00 | Using where; Using index |
          +------+-------------+-------+------+-------------------+-----------+---------+-------------+------+------------+----------+------------+--------------------------+
          1 row in set (4.93 sec)
      
      By explicitly forcing (partition, oid, tid) index usage which is precisely designed to serve this and similar queries can avoid the query from being slow:
      
          MariaDB [(none)]> analyze SELECT tid FROM neo1.obj FORCE INDEX(`partition`) WHERE `partition`=5 AND oid=79613 AND tid>268707071353462798 ORDER BY tid LIMIT 1;
          +------+-------------+-------+-------+---------------+-----------+---------+------+------+--------+----------+------------+--------------------------+
          | id   | select_type | table | type  | possible_keys | key       | key_len | ref  | rows | r_rows | filtered | r_filtered | Extra                    |
          +------+-------------+-------+-------+---------------+-----------+---------+------+------+--------+----------+------------+--------------------------+
          |    1 | SIMPLE      | obj   | range | partition     | partition | 18      | NULL |    2 |   1.00 |   100.00 |     100.00 | Using where; Using index |
          +------+-------------+-------+-------+---------------+-----------+---------+------+------+--------+----------+------------+--------------------------+
          1 row in set (0.00 sec)
      
      /cc @jm, @vpelltier, @Tyagov
      
      /reviewed-on nexedi/neoppod!1
      eaa00a88
  6. 12 Sep, 2016 1 commit
  7. 29 Aug, 2016 2 commits
    • Julien Muchembled's avatar
      mysql: fix use of wrong SQL index when checking for dropped partitions · 13911ca3
      Julien Muchembled authored
      After partitions were dropped with TokuDB, we had a case where MariaDB 10.1.14
      stopped using the most appropriate index.
      
      MariaDB [neo0]> explain SELECT DISTINCT data_id FROM obj WHERE `partition`=5;
      +------+-------------+-------+-------+-------------------+---------+---------+------+------+---------------------------------------+
      | id   | select_type | table | type  | possible_keys     | key     | key_len | ref  | rows | Extra                                 |
      +------+-------------+-------+-------+-------------------+---------+---------+------+------+---------------------------------------+
      |    1 | SIMPLE      | obj   | range | PRIMARY,partition | data_id | 11      | NULL |   10 | Using where; Using index for group-by |
      +------+-------------+-------+-------+-------------------+---------+---------+------+------+---------------------------------------+
      MariaDB [neo0]> SELECT SQL_NO_CACHE DISTINCT data_id FROM obj WHERE `partition`=5;
      Empty set (1 min 51.47 sec)
      
      Expected:
      
      MariaDB [neo1]> explain SELECT DISTINCT data_id FROM obj WHERE `partition`=4;
      +------+-------------+-------+------+-------------------+---------+---------+-------+------+------------------------------+
      | id   | select_type | table | type | possible_keys     | key     | key_len | ref   | rows | Extra                        |
      +------+-------------+-------+------+-------------------+---------+---------+-------+------+------------------------------+
      |    1 | SIMPLE      | obj   | ref  | PRIMARY,partition | PRIMARY | 2       | const |    1 | Using where; Using temporary |
      +------+-------------+-------+------+-------------------+---------+---------+-------+------+------------------------------+
      1 row in set (0.00 sec)
      MariaDB [neo1]> SELECT SQL_NO_CACHE DISTINCT data_id FROM obj WHERE `partition`=4;
      Empty set (0.00 sec)
      
      Restarting the server or 'OPTIMIZE TABLE obj; ' does not help.
      
      Such issue could prevent the cluster to start due to timeouts, by always going
      back to RECOVERING state.
      13911ca3
    • Julien Muchembled's avatar
      Update TODO · 00ffb1ef
      Julien Muchembled authored
      00ffb1ef
  8. 11 Aug, 2016 2 commits
    • Julien Muchembled's avatar
      Add test to check that a moved cell doesn't cause POSKeyError · df990a05
      Julien Muchembled authored
      Freeing disk space when a cell is dropped will have to be implemented with care,
      not only for performance reasons.
      df990a05
    • Julien Muchembled's avatar
      mysql: do not use unsafe TRUNCATE statement · c3c2ffe2
      Julien Muchembled authored
      TRUNCATE was chosen for performance reasons, but it's usually done on small
      tables, and not for performance-critical operations. TRUNCATE commits
      implicitely, so for pt/ttrans in particular, it's certainly slower due to extra
      fsyncs to disk.
      
      On the other side, committing too early can corrupt the database if the storage
      node is stopped just after. For example, a failure in changePartitionTable()
      can cause 'pt' to remain empty.
      c3c2ffe2
  9. 01 Aug, 2016 2 commits
  10. 31 Jul, 2016 1 commit
    • Julien Muchembled's avatar
      storage: review TransactionManager.abortFor · 2d388048
      Julien Muchembled authored
      This reverts commit 7aecdada partially.
      There seems to be no bug here, because:
      - abortFor() is only called upon a notification from the master that a client
        is disconnected,
      - and from the same TCP connection, we only receive a LockInformation packet
        if there's still such a transaction on the master side.
      
      The code removed in abortFor() was redundant with abort().
      2d388048
  11. 27 Jul, 2016 6 commits
    • Julien Muchembled's avatar
      cb144fdb
    • Julien Muchembled's avatar
      38583af9
    • Julien Muchembled's avatar
      client: do not limit the number of open connections to storage nodes · 77132157
      Julien Muchembled authored
      There was a bug that connections were not maintained during a TPC,
      which caused transactions to be aborted when the limit was reached.
      
      Given that oids are spreaded evenly over all partitions, and that clients always
      write to all cells of each involved partitions, clients would spend their time
      reconnecting to storage nodes as soon as the limit is reached. So such feature
      really looks counter-productive.
      77132157
    • Julien Muchembled's avatar
    • Julien Muchembled's avatar
      client: fix conflict of node id by never reading from storage without being connected to the master · 11d83ad9
      Julien Muchembled authored
      Client nodes ignored the state of the connection to the master node when reading
      data from storage, as long as their partition tables were recent enough. This
      way, they were able to finish read-only transactions even if they could't reach
      the master, which could be useful for high availability. The downside is that
      the master node ignored that their node ids were still used, which causes "uuid"
      conflicts when reallocating them.
      
      Rejected solutions:
      - An unused NEO Storage should not insist in staying connected to master node.
      - Reverting to big random node identifiers is a lot of work and it would make
        debugging annoying (see commit 23fad3af).
      - Always increasing node ids could have been a simple solution if we accepted
        that the cluster dies after that all 2^24 possible ids were allocated.
      
      Given that reading from storage without being connected to the master can only
      be useful to finish the current transaction (because we always ping the master
      at the beginning of every transaction), keeping such feature is not worth the
      effort.
      
      This commit fixes id conflicts in a very simple way, by clearing the partition
      table upon primary node failure, which forces reconnection to the master before
      querying any storage node. In such case, we raise a special exception that will
      cause the transaction to be restarted, so that the user does not get errors for
      temporary connection failures.
      11d83ad9
    • Julien Muchembled's avatar
      storage: add comment about the idea to lock an oid before reporting a resolvable conflict · 4e17456b
      Julien Muchembled authored
      Currently, another argument not to lock is that we would not be able to test
      incremental resolution anymore. We can think about this again when deadlock
      resolution is implemented.
      4e17456b
  12. 24 Jul, 2016 5 commits
    • Julien Muchembled's avatar
      Fix race conditions in EventManager between _poll/connection_dict and (un)registration · 8b91706a
      Julien Muchembled authored
      The following error was reported on a client node:
      
          #0x0000 Error                   < None (2001:...:2051)
          1 (Retry Later)
          connection closed for <MTClientConnection(uuid=None, address=2001:...:2051, handler=PrimaryNotificationsHandler, closed, client) at 7f1ea7c42f90>
          Event Manager:
          connection started for <MTClientConnection(uuid=None, address=2001:...:2051, handler=PrimaryNotificationsHandler, fd=13, on_close=onConnectionClosed, connecting, client) at 7f1ea7c25c10>
          #0x0000 RequestIdentification          > None (2001:...:2051)
            Readers: []
            Writers: []
            Connections:
              13: <MTClientConnection(uuid=None, address=2001:...:2051, handler=PrimaryNotificationsHandler, fd=13, on_close=onConnectionClosed, connecting, client) at 7f1ea7c25c10> (pending=False)
          Node manager : 1 nodes
          * None |   MASTER | 2001:...:2051 | UNKNOWN
          <ClientCache history_size=0 oid_count=0 size=0 time=0 queue_length=[0] (life_time=10000 max_history_size=100000 max_size=20971520)>
          poll raised, retrying
          Traceback (most recent call last):
            File "neo/lib/threaded_app.py", line 93, in _run
              poll(1)
            File "neo/lib/event.py", line 134, in poll
              self._poll(0)
            File "neo/lib/event.py", line 164, in _poll
              conn = self.connection_dict[fd]
          KeyError: 13
      
      which means that:
      - while the poll thread is getting a (13, EPOLLIN) event because it is
        closed (aborted by the master)
      - another thread processes the error packet, by closing it in
        PrimaryBootstrapHandler.notReady
      - next, the poll thread resumes the execution of EpollEventManager._poll
        and fails to find fd=13 in self.connection_dict
      
      So here, we have a race condition between epoll_wait and any further use
      of connection_dict to map returned fds.
      
      However, what commit a4731a0c does to handle
      the case of fd reallocation only works for mono-threaded applications.
      In EPOLLIN, wrapping 'self.connection_dict[fd]' the same way as for other
      events is not enough. For example:
      - case 1:
        - thread 1: epoll returns fd=13
        - thread 2: close(13)
        - thread 2: open(13)
        - thread 1: self.connection_dict[13] does not match
                    but this would be handled by the 'unregistered' list
      - case 2:
        - thread 1: reset 'unregistered'
        - thread 2: close(13)
        - thread 2: open(13)
        - thread 1: epoll returns fd=13
        - thread 1: self.connection_dict[13] matches
                    but it would be wrongly ignored by 'unregistered'
      - case 3:
        - thread 1: about to call readable/writable/onTimeout on a connection
        - thread 2: this connection is closed
        - thread 1: readable/writable/onTimeout wrongly called on a closed connection
      
      We could protect _poll() with a lock, and make unregister() use wakeup() so
      that it gets a chance to acquire it, but that causes threaded tests to deadlock
      (continuing in this direction seems too complicated).
      
      So we have to deal with the fact that there can be race conditions at any time
      and there's no way to make 'connection_dict' match exactly what epoll returns.
      We solve this by preventing fd reallocation inside _poll(), which is fortunately
      possible with sockets, using 'shutdown': the closing of fds is delayed.
      
      For above case 3, readable/writable/onTimeout for MTClientConnection are also
      changed to test whether the connection is still open while it has the lock.
      Just for safety, we do the same for 'process'.
      
      At last, another kind of race condition that this commit also fixes concerns
      the use of itervalues() on EventManager.connection_dict.
      8b91706a
    • Julien Muchembled's avatar
      Indent many lines before any real change · 4a0b936f
      Julien Muchembled authored
      This is a preliminary commit, without any functional change,
      just to make the next one easier to review.
      4a0b936f
    • Julien Muchembled's avatar
      client: remove redundant check of new connections to the master · 9f4dd15e
      Julien Muchembled authored
      We already have logs when a connection fails,
      and ask() raises ConnectionClosed if the connection is closed.
      9f4dd15e
    • Vincent Pelletier's avatar
      e791dc3f
    • Vincent Pelletier's avatar
      b7e0ec7f
  13. 13 Jul, 2016 1 commit
  14. 17 Jun, 2016 2 commits
  15. 15 Jun, 2016 2 commits
  16. 08 Jun, 2016 2 commits
  17. 26 May, 2016 1 commit
    • Julien Muchembled's avatar
      client: fix the count of history items in the cache · e61b017f
      Julien Muchembled authored
      Cache items are stored in double-linked chains. In order to quickly know the
      number of history items, an extra attribute is used to count them. It was not
      always decremented when a history item was removed.
      
      This led to the following exception:
        <ClientCache history_size=100000 oid_count=1959 size=20970973 time=2849049 queue_length=[1, 7, 738, 355, 480, 66, 255, 44, 3, 5, 2, 1, 3, 4, 2, 2] (life_time=10000 max_history_size=100000 max_size=20971520)>
        poll raised, retrying
        Traceback (most recent call last):
          ...
          File "neo/client/handlers/master.py", line 137, in packetReceived
            cache.store(oid, data, tid, None)
          File "neo/client/cache.py", line 247, in store
            self._add(head)
          File "neo/client/cache.py", line 129, in _add
            self._remove(head)
          File "neo/client/cache.py", line 136, in _remove
            level = item.level
        AttributeError: 'NoneType' object has no attribute 'level'
      e61b017f
  18. 25 Apr, 2016 1 commit
  19. 20 Apr, 2016 2 commits
    • Julien Muchembled's avatar
    • Julien Muchembled's avatar
      storage: fix crash when trying to replicate from an unreachable node · 8b07ff98
      Julien Muchembled authored
      This fixes the following issue:
      
      WARNING replication aborted for partition 1
      DEBUG   connection started for <ClientConnection(uuid=None, address=...:43776, handler=StorageOperationHandler, fd=10, on_close=onConnectionClosed, connecting, client) at 7f5d2067fdd0>
      DEBUG   connect failed for <SocketConnectorIPv6 at 0x7f5d2067fe10 fileno 10 ('::', 0), opened to ('...', 43776)>: ENETUNREACH (Network is unreachable)
      WARNING replication aborted for partition 5
      DEBUG   connection started for <ClientConnection(uuid=None, address=...:43776, handler=StorageOperationHandler, fd=10, on_close=onConnectionClosed, connecting, client) at 7f5d1c409510>
      PACKET  #0x0000 RequestIdentification          > None (...:43776)  | (<EnumItem STORAGE (1)>, None, ('...', 60533), '...')
      ERROR   Pre-mortem data:
      ERROR   Traceback (most recent call last):
      ERROR     File "neo/storage/app.py", line 157, in run
      ERROR       self._run()
      ERROR     File "neo/storage/app.py", line 197, in _run
      ERROR       self.doOperation()
      ERROR     File "neo/storage/app.py", line 285, in doOperation
      ERROR       poll()
      ERROR     File "neo/storage/app.py", line 95, in _poll
      ERROR       self.em.poll(1)
      ERROR     File "neo/lib/event.py", line 121, in poll
      ERROR       self._poll(blocking)
      ERROR     File "neo/lib/event.py", line 165, in _poll
      ERROR       if conn.readable():
      ERROR     File "neo/lib/connection.py", line 481, in readable
      ERROR       self._closure()
      ERROR     File "neo/lib/connection.py", line 539, in _closure
      ERROR       self.close()
      ERROR     File "neo/lib/connection.py", line 531, in close
      ERROR       handler.connectionClosed(self)
      ERROR     File "neo/lib/handler.py", line 135, in connectionClosed
      ERROR       self.connectionLost(conn, NodeStates.TEMPORARILY_DOWN)
      ERROR     File "neo/storage/handlers/storage.py", line 59, in connectionLost
      ERROR       replicator.abort()
      ERROR     File "neo/storage/replicator.py", line 339, in abort
      ERROR       self._nextPartition()
      ERROR     File "neo/storage/replicator.py", line 260, in _nextPartition
      ERROR       None if name else app.uuid, app.server, name or app.name))
      ERROR     File "neo/lib/connection.py", line 562, in ask
      ERROR       raise ConnectionClosed
      ERROR   ConnectionClosed
      8b07ff98
  20. 18 Apr, 2016 1 commit
  21. 01 Apr, 2016 1 commit