1. 17 Jul, 2023 8 commits
    • Levin Zimmermann's avatar
      Fix flaky `client_test.go/TestLoad` · 7a0674c2
      Levin Zimmermann authored
      /reviewed-by @kirr
      /reviewed-on kirr/neo!5
      
      * t-fix-flaky-testload:
        fixup! neonet/newlink: Fix lost conn in encoding detector
        go/neo/neonet: Fix client handshake not to accept server encoding if it is different from what client indicated
        go/neo/neonet: Demonstrate problem in handshake with NEO/py
        go/neo/neonet: Dedicate an error type to indicate "protocol version mismatch" as handshake failure cause
        fixup! client_test: Keep NEO srv logs if test fails
        fixup! client_test/NEOSrv += LogContent for better debug
        neonet/newlink: Fix lost conn in encoding detector
        client_test: Keep NEO srv logs if test fails
        client_test += print NEO server log if >=1 test(s) failed
        client_test/NEOSrv += LogContent for better debug
      7a0674c2
    • Kirill Smelkov's avatar
      Merge branch 'master' into t-fix-flaky-testload · 917bacd2
      Kirill Smelkov authored
      * master:
        go/neo/neonet: Fix client handshake not to accept server encoding if it is different from what client indicated
        go/neo/neonet: Demonstrate problem in handshake with NEO/py
        go/neo/neonet: Dedicate an error type to indicate "protocol version mismatch" as handshake failure cause
      917bacd2
    • Kirill Smelkov's avatar
    • Levin Zimmermann's avatar
      go/neo/neonet: Fix client handshake not to accept server encoding if it is... · 052856ce
      Levin Zimmermann authored
      go/neo/neonet: Fix client handshake not to accept server encoding if it is different from what client indicated
      
      If the peers encoding is different than our encoding two different
      scenarios can happen, because the handshake order is undefined (e.g.
      we don't know if our handshake is received before the peer sends its
      handshake):
      
      1. Our handshake is received before peer sends its handshake, NEO/py
      closes connection if it sees unexpected magic, version, etc.
      
      2. The client already sends a handshake before it proceeds our handshake.
      In this case it initally sends us it version, we can extract its encoding,
      and only later, once it proceeded our handshake with the bad encoding,
      it closes the connection.
      
      Before this patch case (2) wasn't handled correctly by the automatic
      encoding detection of 'DialLink'. 'DialLink' simply accepted the
      different-than-expected encoding, but once the peer proceeded the nodes
      handshake the peer closed the connection and the initially established
      and returned link was immediately closed again. Due to this it was good
      luck whether connecting with a peer different with an encoding different
      from the expected one worked or didn't work (it depended on which handshake
      was faster). Now 'DialLink' should reliably find the correct encoding
      and return a stable link.
      
      --------
      
      kirr: this is based on the following original patch by Levin: f6b59772
      
      I updated documentation throughout correspondingly and also added
      corresponding handshake-specific test in the previous patch.
      
      See kirr/neo!5 and b2da69e2
      (go/neo/neonet: Demonstrate problem in handshake with NEO/py) for more
      in-depth description of the problem.
      052856ce
    • Kirill Smelkov's avatar
      go/neo/neonet: Demonstrate problem in handshake with NEO/py · b2da69e2
      Kirill Smelkov authored
      Levin Zimmerman discovered that sometimes NEO/py accepts our handshake
      hello with encoding 'M', then replies its owh handshake ehlo with
      encoding 'N' and then further terminates the connection. In other words
      it looks like that the handshake went successful, but it actually did
      not and NEO/py terminates the link after some time.
      
      This manifests itself e.g. as infrequent TestLoad failures on t branch
      with the following output:
      
      === RUN   TestLoad/py/!ssl
      I: runneo.py: /tmp/neo776618506/1 !ssl: started master(s): 127.0.0.1:21151
      === RUN   TestLoad/py/!ssl/enc=N(dialTryOrder=N,M)
          client_test.go:598: skip: does not excercise client redial
      === RUN   TestLoad/py/!ssl/enc=N(dialTryOrder=M,N)
          xtesting.go:330: load 0285cbac258bf266:0000000000000000: returned err unexpected:
              have: neo://127.0.0.1:21151/1: load 0285cbac258bf266:0000000000000000: dial S1: dial 127.0.0.1:40345 (STORAGE): 127.0.0.1:56678 - 127.0.0.1:40345: request identification: 127.0.0.1:56678 - 127.0.0.1:40345 .1: recv: EOF
              want: nil
          xtesting.go:330: load 0285cbac258bf266:0000000000000000: returned tid unexpected: 0000000000000000  ; want: 0285cbac258bf266
          xtesting.go:330: load 0285cbac258bf266:0000000000000000: returned buf = nil
          xtesting.go:339: load 0285cbac258bf265:0000000000000000: returned err unexpected:
              have: neo://127.0.0.1:21151/1: load 0285cbac258bf265:0000000000000000: dial S1: dial 127.0.0.1:40345 (STORAGE): 127.0.0.1:56688 - 127.0.0.1:40345: request identification: 127.0.0.1:56688 - 127.0.0.1:40345 .1: recv: EOF
              want: neo://127.0.0.1:21151/1: load 0285cbac258bf265:0000000000000000: 0000000000000000: object was not yet created
          ...
          client_test.go:588: NEO log tail:
      
              log file 'storage_0.log' tail:
      
              2023-07-17 17:21:57.1519 DEBUG     S1         connection completed for <ServerConnection(nid=None, address=127.0.0.1:51230, handler=IdentificationHandler, fd=20, server) at 7f3583fd4f50> (from 127.0.0.1:40345)
              2023-07-17 17:21:57.1537 WARNING   S1         Protocol version mismatch with <ServerConnection(nid=None, address=127.0.0.1:51230, handler=IdentificationHandler, fd=20, server) at 7f3583fd4f50>
              2023-07-17 17:21:57.1548 DEBUG     S1         connection closed for <ServerConnection(nid=None, address=127.0.0.1:51230, handler=IdentificationHandler, closed, server) at 7f3583fd4f50>
              2023-07-17 17:21:57.1551 WARNING   S1         A connection was lost during identification
              2023-07-17 17:22:00.1582 DEBUG     S1         accepted a connection from 127.0.0.1:51236
              2023-07-17 17:22:00.1585 DEBUG     S1         connection completed for <ServerConnection(nid=None, address=127.0.0.1:51236, handler=IdentificationHandler, fd=20, server) at 7f3583fd4e90> (from 127.0.0.1:40345)
              2023-07-17 17:22:00.1604 WARNING   S1         Protocol version mismatch with <ServerConnection(nid=None, address=127.0.0.1:51236, handler=IdentificationHandler, fd=20, server) at 7f3583fd4e90>
              2023-07-17 17:22:00.1622 DEBUG     S1         connection closed for <ServerConnection(nid=None, address=127.0.0.1:51236, handler=IdentificationHandler, closed, server) at 7f3583fd4e90>
              2023-07-17 17:22:00.1625 WARNING   S1         A connection was lost during identification
              2023-07-17 17:22:03.1663 DEBUG     S1         accepted a connection from 127.0.0.1:51238
              2023-07-17 17:22:03.1666 DEBUG     S1         connection completed for <ServerConnection(nid=None, address=127.0.0.1:51238, handler=IdentificationHandler, fd=20, server) at 7f3583fd4d10> (from 127.0.0.1:40345)
              2023-07-17 17:22:03.1674 WARNING   S1         Protocol version mismatch with <ServerConnection(nid=None, address=127.0.0.1:51238, handler=IdentificationHandler, fd=20, server) at 7f3583fd4d10>
              2023-07-17 17:22:03.1688 DEBUG     S1         connection closed for <ServerConnection(nid=None, address=127.0.0.1:51238, handler=IdentificationHandler, closed, server) at 7f3583fd4d10>
              2023-07-17 17:22:03.1691 WARNING   S1         A connection was lost during identification
              2023-07-17 17:22:06.1714 DEBUG     S1         accepted a connection from 127.0.0.1:57072
              2023-07-17 17:22:06.1719 DEBUG     S1         connection completed for <ServerConnection(nid=None, address=127.0.0.1:57072, handler=IdentificationHandler, fd=20, server) at 7f3583fd4b50> (from 127.0.0.1:40345)
              2023-07-17 17:22:06.1727 WARNING   S1         Protocol version mismatch with <ServerConnection(nid=None, address=127.0.0.1:57072, handler=IdentificationHandler, fd=20, server) at 7f3583fd4b50>
              2023-07-17 17:22:06.1738 DEBUG     S1         connection closed for <ServerConnection(nid=None, address=127.0.0.1:57072, handler=IdentificationHandler, closed, server) at 7f3583fd4b50>
              2023-07-17 17:22:06.1738 WARNING   S1         A connection was lost during identification
      
              log file 'master_0.log' tail:
      
              2023-07-17 17:21:21.0799 PACKET    M1         #0x0012 NotifyNodeInformation          > A1 (127.0.0.1:37906)
              2023-07-17 17:21:21.0799 PACKET    M1          ! C0 | CLIENT |  | RUNNING | 2023-07-17 14:21:21.079838
              2023-07-17 17:21:21.0800 PACKET    M1         #0x0102 NotifyNodeInformation          > S1 (127.0.0.1:37918)
              2023-07-17 17:21:21.0800 PACKET    M1          ! C0 | CLIENT |  | RUNNING | 2023-07-17 14:21:21.079838
              2023-07-17 17:21:21.0801 DEBUG     M1         Handler changed on <ServerConnection(nid=None, address=127.0.0.1:37966, handler=ClientServiceHandler, fd=18, server) at 7f3584245910>
              2023-07-17 17:21:21.0802 PACKET    M1         #0x0001 AnswerRequestIdentification    > C0 (127.0.0.1:37966)
              2023-07-17 17:21:21.0804 PACKET    M1         #0x0000 NotifyNodeInformation          > C0 (127.0.0.1:37966)
              2023-07-17 17:21:21.0804 PACKET    M1          ! C0 | CLIENT  |                 | RUNNING | 2023-07-17 14:21:21.079838
              2023-07-17 17:21:21.0804 PACKET    M1          ! M1 | MASTER  | 127.0.0.1:21151 | RUNNING | None
              2023-07-17 17:21:21.0804 PACKET    M1          ! S1 | STORAGE | 127.0.0.1:40345 | RUNNING | 2023-07-17 14:21:18.737469
              2023-07-17 17:21:21.0805 PACKET    M1         #0x0002 NotifyPartitionTable           > C0 (127.0.0.1:37966)
              2023-07-17 17:21:21.0810 PACKET    M1         #0x0003 LastTransaction                < C0 (127.0.0.1:37966)
              2023-07-17 17:21:21.0811 PACKET    M1         #0x0003 AnswerLastTransaction          > C0 (127.0.0.1:37966)
              2023-07-17 17:22:06.2053 DEBUG     M1         <SocketConnectorIPv4 at 0x7f3584252d10 fileno 18 ('127.0.0.1', 21151), opened from ('127.0.0.1', 37966)> closed in recv
              2023-07-17 17:22:06.2056 DEBUG     M1         connection closed for <ServerConnection(nid=C0, address=127.0.0.1:37966, handler=ClientServiceHandler, closed, server) at 7f3584245910>
              2023-07-17 17:22:06.2058 PACKET    M1         #0x0014 NotifyNodeInformation          > A1 (127.0.0.1:37906)
              2023-07-17 17:22:06.2058 PACKET    M1          ! C0 | CLIENT |  | UNKNOWN | 2023-07-17 14:21:21.079838
              2023-07-17 17:22:06.2059 PACKET    M1         #0x0104 NotifyNodeInformation          > S1 (127.0.0.1:37918)
              2023-07-17 17:22:06.2059 PACKET    M1          ! C0 | CLIENT |  | UNKNOWN | 2023-07-17 14:21:21.079838
      
      The problem is due to that my analysis from e407f725 (go/neo/neonet:
      Rework handshake to differentiate client and server parts) turned out to
      be incorrect. Quoting that patch:
      
          -> Rework handshake so that client always sends its hello first, and
          only then the server side replies. This matches actual NEO/py behaviour:
      
          https://lab.nexedi.com/nexedi/neoppod/blob/v1.12-67-g261dd4b4/neo/lib/connector.py#L293-294
      
          even though the "NEO protocol" states that
      
          	Handshake transmissions are not ordered with respect to each other and can go in parallel.
      
          	( https://neo.nexedi.com/P-NEO-Protocol.Specification.2019?portal_skin=CI_slideshow#/9/2 )
      
          If I recall correctly that sentence was authored by me in 2018 based on
          previous understanding of should-be full symmetry in-between client and
          server.
      
      so here "This matches actual NEO/py behaviour" was wrong: even though
      https://lab.nexedi.com/nexedi/neoppod/blob/v1.12-67-g261dd4b4/neo/lib/connector.py#L293-294
      indeed says that
      
          #       The NEO protocol is such that a client connection is always the
          #       first to send a packet, as soon as the connection is established,
      
      in reality it is not the case as SocketConnector always queues handshake
      hello upon its creation before receiving anything from remote side:
      https://lab.nexedi.com/nexedi/neoppod/blob/v1.12-93-gfd87e153/neo/lib/connector.py#L77-78 .
      In practice this leads to that in non-SSL case NEO/py server might be
      fast enough to send its prepared hello before receiving hello from us.
      Levin also explains at !5 (comment 187429):
      
          I think what happens is this:
      
          the NEO protocol doesn't specify in which order handshakes happen after
          initial dial. If the peer sends a handshake before receiving our
          handshake and if this peers handshake is received by us, 'DialLink'
          assumes everything is fine (no err is returned), it breaks the loop and
          returns the link. But then, very little time later, when the peer
          finally receives our handshake, this looks strange for the peer and it
          closes the connection.
      
          So in my understanding this should be fixed by explicitly comparing the
          encodings between our expected one and what the peer provided us. If
          encodings don't match we should retry with a new encoding in order to
          prevent the peer from closing the connection.
      
          For me this also explains why sometimes the tests passed and sometimes
          didn't: it depended on which node was faster ('race condition').
      
      -> In this patch we add correspondig handshake test that demonstrates
      this problem. It currently fails as
      
      --- FAIL: TestHandshake (0.01s)
          --- FAIL: TestHandshake/enc=N (0.00s)
              newlink_test.go:154: handshake encoding mismatch: client: unexpected error:
                  have: <nil>
                       "<nil>"
                  want: &neonet._HandshakeError{LocalRole:1, LocalAddr:net.pipeAddr{}, RemoteAddr:net.pipeAddr{}, Err:(*neonet._EncodingMismatchError)(0xc0000a4190)}
                       "pipe - pipe: handshake (client): protocol encoding mismatch: peer = 'M'  ; our side = 'N'"
          --- FAIL: TestHandshake/enc=M (0.00s)
              newlink_test.go:154: handshake encoding mismatch: client: unexpected error:
                  have: <nil>
                       "<nil>"
                  want: &neonet._HandshakeError{LocalRole:1, LocalAddr:net.pipeAddr{}, RemoteAddr:net.pipeAddr{}, Err:(*neonet._EncodingMismatchError)(0xc0001a22cc)}
                       "pipe - pipe: handshake (client): protocol encoding mismatch: peer = 'N'  ; our side = 'M'"
      
      We will fix it in the next patch.
      
      /reported-by @levin.zimmermann
      /reported-on !5
      b2da69e2
    • Kirill Smelkov's avatar
      go/neo/neonet: Dedicate an error type to indicate "protocol version mismatch"... · c100a0c9
      Kirill Smelkov authored
      go/neo/neonet: Dedicate an error type to indicate "protocol version mismatch" as handshake failure cause
      
      We will soon need to detect if a handshake failure was due to mismatch
      of protocol encodings and that would require introduction of dedicated
      error type for that cause. As a preparatory step first refactor "version
      mismatch cause" to follow the same style for symmetry.
      c100a0c9
    • Kirill Smelkov's avatar
      088502fd
    • Kirill Smelkov's avatar
      fixup! client_test/NEOSrv += LogContent for better debug · 91d65a96
      Kirill Smelkov authored
      LogContent -> LogTail
      py/neolog  -> `python -m neo.scripts.neolog`
      FIXME for glog
      FIXME for admin_0.log and other log files
      91d65a96
  2. 07 Jul, 2023 1 commit
    • Levin Zimmermann's avatar
      neonet/newlink: Fix lost conn in encoding detector · f6b59772
      Levin Zimmermann authored
      If the peers encoding is different than our encoding two different
      scenarios can happen, because the handshake order is undefined (e.g.
      we don't know if our handshake is received before the peer sends its
      handshake):
      
      1. Our handshake is received before peer sends its handshake, NEO/py
      closes connection if it sees unexpected magic, version, etc.
      
      2. The client already sends a handshake before it proceeds our handshake.
      In this case it initally sends us it version, we can extract its encoding,
      and only later, once it proceeded our handshake with the bad encoding,
      it closes the connection.
      
      Before this patch case (2) wasn't handled correctly by the automatic
      encoding detection of 'DialLink'. 'DialLink' simply accepted the
      different-than-expected encoding, but once the peer proceeded the nodes
      handshake the peer closed the connection and the initially established
      and returned link was immediately closed again. Due to this it was good
      luck whether connecting with a peer different with an encoding different
      from the expected one worked or didn't work (it depended on which handshake
      was faster). Now 'DialLink' should reliably find the correct encoding
      and return a stable link.
      f6b59772
  3. 29 Jun, 2023 3 commits
  4. 24 May, 2023 2 commits
    • Kirill Smelkov's avatar
      Y client: Adjust URI scheme to move client-specific options to fragment · 4c9414ea
      Kirill Smelkov authored
      For example option `compress` controls kind of compression that _client_
      performs when saving data to server. Similarly cache-size, logfile and
      read-only adjust on-client behaviour, not server.
      
      From nexedi/neoppod!18 (comment 124725) :
      
          In general the most correct thing to do is:
      
          - use host part for where to connect (host:port, list of host ports, UNIX socket, etc)
          - use path part to identify a database or other on-server resource
          - use query part for parameters that are passed to remote server (e.g. `storage` in case of ZEO)
          - use fragment part for local parameters that are not passed to remote server (e.g. local `logfile`)
          - use credentials part for things required to authenticate/encrypt.
      
          To normalize an URL wcfs client would drop credentials and fragment, but keep host, path and query.
      
          Fragments are documented not to be sent to remote side and to be evaluated by local side only.
      
      -> Move options that control client behaviour to fragment.
      4c9414ea
    • Kirill Smelkov's avatar
      Y client: Don't allow master_nodes and name to be present in options · 6047f893
      Kirill Smelkov authored
      Because list of masters and cluster name must be already present in
      netloc and path. Previously e.g.
      
      	neo://α,β,γ/db?master_nodes=a,b,c"
      
      would mean to use master nodes {a,b,c} not {α,β,γ}. Now it is treated as
      invalid URI to remove ambiguity. Same for cluster name.
      6047f893
  5. 18 Jan, 2023 3 commits
  6. 04 Nov, 2022 1 commit
  7. 03 Nov, 2022 1 commit
    • Kirill Smelkov's avatar
      Merge branch 'master' into t · a875f56c
      Kirill Smelkov authored
      * master:
        go/neo/t/neotest: Use python -c 'print ...' in a way that works on both py2 and py3
        go/neo/t/tzodb.py: Fix zhash for Python3
      a875f56c
  8. 18 May, 2022 2 commits
    • Kirill Smelkov's avatar
      go/neo/t/neotest: Use python -c 'print ...' in a way that works on both py2 and py3 · 789c9ed9
      Kirill Smelkov authored
      Without parenthesis it was failing on py3:
      
      	(neo) (py3.venv) (g.env) kirr@deca:~/src/neo/src/lab.nexedi.com/kirr/neo/go/neo/t$ ./neotest info-local
      	date:   Wed, 18 May 2022 11:05:50 +0300
      	xnode:  kirr@deca.navytux.spb.ru (2401:5180:0:af::1 192.168.0.3 (+ 1·ipv4))
      	uname:  Linux deca 5.10.0-13-amd64 #1 SMP Debian 5.10.106-1 (2022-03-17) x86_64 GNU/Linux
      	cpu:    Intel(R) Core(TM) i7-7600U CPU @ 2.80GHz
      	  File "<string>", line 1
      	    print '%.2fGHz' % (400000 / 1E6)
      	          ^
      	SyntaxError: invalid syntax
      789c9ed9
    • Kirill Smelkov's avatar
      go/neo/t/tzodb.py: Fix zhash for Python3 · 30329f5a
      Kirill Smelkov authored
      On py3 dict.keys() returns iterator instead of list:
      
          $ ./tzodb.py zhash
          Traceback (most recent call last):
            File "/home/kirr/src/neo/src/lab.nexedi.com/kirr/neo/go/neo/t/./tzodb.py", line 141, in <module>
              main()
            File "/home/kirr/src/neo/src/lab.nexedi.com/kirr/neo/go/neo/t/./tzodb.py", line 138, in main
              zhash()
            File "/home/kirr/src/neo/src/lab.nexedi.com/kirr/neo/go/neo/t/./tzodb.py", line 59, in zhash
              optv, argv = getopt(sys.argv[2:], "h", ["help", "check=", "bench="] + hashRegistry.keys())
          TypeError: can only concatenate list (not "dict_keys") to list
      30329f5a
  9. 25 Nov, 2021 2 commits
  10. 24 Nov, 2021 1 commit
    • Kirill Smelkov's avatar
      X go.mod: v↑ go123 · dfe935b7
      Kirill Smelkov authored
      Instead of requiring users to explicitly mark their tests as finished in
      particular subject-to-deadlock places, tracetest was fixed to
      unconditionally detect "extra sends" deadlocks automatically:
      
      go123@3b19f68c
      dfe935b7
  11. 19 Nov, 2021 2 commits
  12. 12 Nov, 2021 1 commit
    • Kirill Smelkov's avatar
      X go.mod: v↑ * · ea538368
      Kirill Smelkov authored
      go get: upgraded github.com/fsnotify/fsnotify v1.4.10-0.20200417215612-7f4cf4dd2b52 => v1.5.1
      go get: upgraded github.com/golang/glog v0.0.0-20160126235308-23def4e6c14b => v1.0.0
      go get: upgraded github.com/gwenn/gosqlite v0.0.0-20201008200117-82e079acf5b6 => v0.0.0-20211101095637-b18efb2e44c8
      go get: upgraded github.com/gwenn/yacr v0.0.0-20200112083327-bbe82c1f4d60 => v0.0.0-20211101095056-492fb0c571bc
      go get: upgraded github.com/kisielk/og-rek v1.1.0 => v1.2.0
      go get: upgraded github.com/soheilhy/cmux v0.1.5-0.20210205191134-5ec6847320e5 => v0.1.5
      go get: upgraded github.com/tinylib/msgp v1.1.5 => v1.1.6
      go get: upgraded golang.org/x/net v0.0.0-20210226172049-e18ecbb05110 => v0.0.0-20211111160137-58aab5ef257a
      go get: upgraded golang.org/x/sys v0.0.0-20210301091718-77cc2087c03b => v0.0.0-20211111213525-f221eed1c01e
      go get: upgraded golang.org/x/text v0.3.5 => v0.3.7
      go get: upgraded lab.nexedi.com/kirr/go123 v0.0.0-20210128150852-c20e95f0f789 => v0.0.0-20210906140734-c9eb28d9e408
      
      Highlights:
      
      - og-rek brings support for pickle protocol 5
      - go123  brings in support for Go1.17
      ea538368
  13. 04 Oct, 2021 13 commits