Commits · a4a9d69d02ba8c24b49b0239ecbbd58b3397b9cb · Kirill Smelkov / neo

02 Feb, 2024 1 commit

X: Apply new URI scheme to NEO/go + some refactors and tests of URL parser · a4a9d69d

Kirill Smelkov authored Feb 02, 2024

/reviewed-by @kirr
/reviewed-on kirr/neo!4

* kirr/t+new-uri:
  Revert "Y client: Adjust URI scheme to move client-specific options to fragment"
  fixup! client.go: Fix URI client option parsing for supported + unsupported options
  client.go: Fix URI client option parsing for supported + unsupported options
  fixup! client_test: Add tests for NEO URI parser
  client_test: Add tests for NEO URI parser
  fixup! client: Refactor openClientByURL for easier testing
  client: Refactor openClientByURL for easier testing
  Y go/zodb: Handle common options in zurl in generic layer

a4a9d69d

29 Jan, 2024 8 commits

Revert "Y client: Adjust URI scheme to move client-specific options to fragment" · c9490507

Levin Zimmermann authored Dec 21, 2023

This reverts commit kirr/neo@4c9414ea.
This patch was added at a time when nexedi/neoppod!18 wasn't
resolved yet, but we already wanted to proceed with WCFS. Now the NEO MR
is resolved and we decided to mostly leave the NEO zurl as it was
originally implemented in nexedi/neoppod!6.
This means we don't need this patch anymore which changed the NEO
zurl format.

c9490507

fixup! client.go: Fix URI client option parsing for supported + unsupported options · f1a1bb9d
Kirill Smelkov authored Jan 29, 2024
```
readonly is handled by common zodb.OpenDriver.
```
f1a1bb9d

client.go: Fix URI client option parsing for supported + unsupported options · d6c33660

Levin Zimmermann authored Jun 01, 2023

Before this patch, the parser ignored options which were already supported
by the client (for instance 'read-only') and even raised an error. But the
client can already use this option: as a9246333 describes this should happen in the
local storage URL parser.

Furthermore not-yet-supported client options (for instance compress) broke
the NEO client before this patch. Now these options only raise a warning which
informs the user that they are ignored. Why? We want to use pre-complete NEO
in real-world projects together with NEO/py clusters. Those real-world projects
may already specify options which aren't supported by our NEO/go client yet.
But it doesn't matters so much, because those options are mostly
relevant for other NEO/py cluster clients (e.g. zope nodes). Instead of
filtering those parameters before parsing them to NEO/go in a higher
level (e.g. SlapOS), NEO/go should already support any valid NEO URL
and raise warnings for not yet implemented features.

d6c33660

fixup! client_test: Add tests for NEO URI parser · 4e9311d5

Kirill Smelkov authored Jan 29, 2024

- use simplified parseURL signature - DriverOptions are not passed nor changed there.
- read-only is handled by generic zodb layer not neo.parseURL .

4e9311d5

client_test: Add tests for NEO URI parser · 1fca6ad4

Levin Zimmermann authored Jun 01, 2023

This test was missing so far. Particularly recent changes of the
NEO URI scheme [1], but also problems with valid old URI [2] stressed
out the necessity for comprehensive NEO URI parser tests.

[1] kirr/neo@4c9414ea
[2] 573514c6 (comment 184417)

1fca6ad4

fixup! client: Refactor openClientByURL for easier testing · 2aa9b909

Kirill Smelkov authored Jan 26, 2024

- no need to pass DriverOptions into parseURL - it is only zurl that is
  parsed, and also DriverOptions should not be changed by the opener.

- no need to document "If anything fails within this process an error
  and nil are returned." because that is standard omnipresent Go convention.

2aa9b909

client: Refactor openClientByURL for easier testing · 7bad0dda

Levin Zimmermann authored May 31, 2023

With all the recent changes of the NEO URI scheme we need to reliably test
the function which parses the URI and convert it into the different
parameter. Testing is much simpler if we can only analyse how the URI
parsing works. Therefore this patch moves NEO URI parsing to an
external function.

7bad0dda

Y go/zodb: Handle common options in zurl in generic layer · 8c39d23e

Kirill Smelkov authored Jan 29, 2024

Offload drivers from handling options such as ?read-only=1 and force
them to deal with such options only via DriverOptions, never zurl.

See added commend for details.

8c39d23e

22 Aug, 2023 1 commit

fixup! fixup! Node: Add support for NEO cluster with > 1 master · dc36ffc7

Kirill Smelkov authored Aug 22, 2023

I missed the following build failure in go/neo/cmd:

    # lab.nexedi.com/kirr/neo/go/neo/cmd/neo
    ./storage.go:128:37: cannot use master (variable of type string) as []string value in argument to neo.NewStorage

dc36ffc7

02 Aug, 2023 8 commits

X: Teach NEO/go to handle multiple master nodes · db81e0de

Levin Zimmermann authored Aug 02, 2023

See kirr/neo!2 for discussion,
context and details.

/reviewed-by @kirr

* t-with-multiple-master-nodes:
  fixup! client_test: Add nmaster={1,2} to test matrix
  fixup! client_test: Support test cluster /w >1 master
  fixup! TalkMaster: Switch master if dialed M is secondary
  fixup! Node: Add support for NEO cluster with > 1 master
  fixup! Dial: Catch NotPrimaryMaster & return custom error
  fixup! proto: Implement Error for NotPrimaryMaster
  fixup! proto.NotPrimaryMaster: Fix .Primary data type (2)
  fixup! proto.NotPrimaryMaster: Fix .Primary data type (1)
  client_test: Add nmaster={1,2} to test matrix
  client_test: Support test cluster /w >1 master
  proto.NotPrimaryMaster: Fix .Primary data type
  TalkMaster: Switch master if dialed M is secondary
  Dial: Catch NotPrimaryMaster & return custom error
  proto: Implement Error for NotPrimaryMaster
  openClientByURL: Fix for >1 master (split URL host)
  Client.URL: Fix incomplete URL if > 1 master nodes
  Node: Add support for NEO cluster with > 1 master

db81e0de

fixup! client_test: Add nmaster={1,2} to test matrix · 4605cba1
Kirill Smelkov authored Aug 02, 2023
```
Actually do test nmaster=2 case.
```
4605cba1

fixup! client_test: Support test cluster /w >1 master · f96c8ccd

Kirill Smelkov authored Aug 02, 2023

- show all options in error context and ran test kind

- skip nmaster > 1 for NEO/go as that is currently not implemented on NEO/go server

- use json for interacting with runneo.py, so that we can use
  whatever builtin type for any argument without hardcoding ad-hoc
  handling of specific arguments inside runneo.py.

- adjust comments + cosmetics

f96c8ccd

fixup! TalkMaster: Switch master if dialed M is secondary · 3c37c0fc

Kirill Smelkov authored Aug 02, 2023

- validate received NotPrimaryMaster
- use Address.String() instead of printf with format that works only for ipv6
- add some logging, comments and TODO

3c37c0fc

fixup! Node: Add support for NEO cluster with > 1 master · 354887e3

Kirill Smelkov authored Aug 02, 2023

- no need to keep Node.MasterAddr anymore - the address of current PM is
  managed by TalkMaster and is provided as part of operational context
  to user functions that TalkMaster runs.

- correct docstrings.

- cosmetics.

354887e3

fixup! Dial: Catch NotPrimaryMaster & return custom error · 20f76288
Kirill Smelkov authored Aug 02, 2023
```
Expect NotPrimaryMaster only if we are trying to connect to a master.
```
20f76288
fixup! proto: Implement Error for NotPrimaryMaster · 7efe5d04
Kirill Smelkov authored Aug 02, 2023
```
Provide details in the error message.
```
7efe5d04

fixup! proto.NotPrimaryMaster: Fix .Primary data type (2) · 43c19d59

Kirill Smelkov authored Aug 01, 2023

Change .Promary type from int8 back to int32.

5d93e434 says that .Primary type is not NodeID. That is true, but
changing it to int8 was a mistake:

1. PSignedNull is explicitly defined to come with '!l' struct code, which
   according to https://docs.python.org/3/library/struct.html#module-struct
   comes as 4-bytes integer on the wire:

   https://lab.nexedi.com/nexedi/neoppod/blob/v1.12-13-gf2ea4be2/neo/lib/protocol.py#L560-562

2. verifying this via serializing NotPrimaryMaster on NEO/py also
   confirms that .Primary occupies 4 bytes, not one:

   In [1]: from neo.lib.protocol import NotPrimaryMaster

   In [2]: NotPrimaryMaster(0x01020304, [('m111', 111), ('m222', 222)])._body
   Out[2]: '\x01\x02\x03\x04\x00\x00\x00\x02\x00\x00\x00\x04m111\x00o\x00\x00\x00\x04m222\x00\xde'
            ^^^^^^^^^^^^^^^^
            NOTE  NOTE  NOTE

-> change .Primary type back to being 4-bytes integer, but to int32
instead of NodeID because, as 5d93e434 correctly says, .Primary comes as
array index, not a node ID. The following place of NEO/py code
explicitly confirms this:

https://lab.nexedi.com/nexedi/neoppod/blob/v1.12-13-gf2ea4be2/neo/master/handlers/identification.py#L155-159

Add corresponding test.

43c19d59

01 Aug, 2023 1 commit

fixup! proto.NotPrimaryMaster: Fix .Primary data type (1) · 3cb82317

Kirill Smelkov authored Aug 01, 2023

Rerun `go generate`. As the diff in zproto-marshal.go shows changing
NotPrimaryMaster.Primary type from NodeID to int8 actually does make a
difference. This happens because NodeID type is based on int32 and
changing that to int8 changes how NotPrimaryMaster structure is layed
out in memory and on the wire.

The changes to zproto-marshal.go in 5d93e434 seem to be done by hand
and not matching the change to proto.go even though head of
zproto-marshal.go says

	// Code generated by protogen.go; DO NOT EDIT.

3cb82317

18 Jul, 2023 9 commits

client_test: Add nmaster={1,2} to test matrix · b2929804
Levin Zimmermann authored Jul 18, 2023
```
Tests should work with both one master or more than one masters.
```
b2929804

client_test: Support test cluster /w >1 master · 6dba6409

Levin Zimmermann authored Jul 18, 2023

Now it's possible to run client tests against a NEO cluster which
has more than one master nodes. We need this adjustement in order
to test NEO/go client modification in order to support more than
one master node.

6dba6409

proto.NotPrimaryMaster: Fix .Primary data type · 5d93e434

Levin Zimmermann authored Apr 26, 2023

The '.Primary' attribute of the 'NotPrimaryMaster' packet has been
assigned to 'NodeID' data type. This is incorrect, because the data
doesn't represent the ID of the node, but an index of the
'.KnownMasterList' [1]. In the old protocol NEO/py therefore also
used 'PSignedNull' instead of 'PUUID' [2]. This patches fixes the data
type of '.Primary' and uses 'int8' instead of 'NodeID'. Technically this
doesn't make any difference, but semantically for human beings the code
is easier to understand now.

[1] https://lab.nexedi.com/nexedi/neoppod/blob/c6453626/neo/lib/handler.py#L161
[2] https://lab.nexedi.com/nexedi/neoppod/blob/c6453626/neo/lib/protocol.py#L716

5d93e434

TalkMaster: Switch master if dialed M is secondary · 22e5d1e9

Levin Zimmermann authored Apr 26, 2023

When connecting to a master node, the client needs to try a different
master if the initially tried one is a secondary master node. This
statement wasn't implemented yet before this patch and therefore it was
good luck if the initally tried master was the primary one - and the
connection worked - or if it was a secondary master - and the client
got stuck in re-trying the same node forever. This patch makes NEO/go
usage with clusters of more than one master therefore much more stable.

22e5d1e9

Dial: Catch NotPrimaryMaster & return custom error · bbf9f440

Levin Zimmermann authored Apr 26, 2023

After initial handshake a NEO node checks the identification of its peer
by sending the 'RequestIdentification' packet. In case the peer is a
secondary master it responds with 'NotPrimaryMaster'. Before this patch
'Dial' ignored the 'NotPrimaryMaster' packet and simply returned a
general error. Now - after this patch - 'Dial' returns an instance of
'proto.NotPrimaryMaster' (which implements 'Error').
This helps a caller to correctly handle the secondary-master-case, which
otherwise is impossible to differentiate from any other error
possibility.

bbf9f440

proto: Implement Error for NotPrimaryMaster · 8811f8b4

Levin Zimmermann authored Apr 26, 2023

When a client receives 'NotPrimaryMaster' from a secondary master, the
situation is similar to the situation when we receive an error: the
other node tells us, don't connect with me, connect with someone else.
Finally the peer even closes the connection.

Due to this similarity in structure (& because it helps us later to
teach NEO/go to correctly handle 'NotPrimaryMaster' with minimal
changes), we implement 'Error' for 'proto.NotPrimaryMaster'. Now
'NotPrimaryMaster' can be treated like an error.

8811f8b4

openClientByURL: Fix for >1 master (split URL host) · 2a75cdb0

Levin Zimmermann authored Apr 12, 2023

In a NEO URI more than one master node can be specified, because a NEO
cluster may have more masters than one. But before this patch
'openClientByURL' always assumed that the given URL only specifies one
master. Now the host is split into potentially > 1 master nodes. It
therefore works now in the same way as the Python implementation [1].

[1] https://lab.nexedi.com/nexedi/neoppod/blob/342168cd/neo/client/zodburi.py#L64

2a75cdb0

Client.URL: Fix incomplete URL if > 1 master nodes · 5ba1b669

Levin Zimmermann authored Apr 13, 2023

Before this patch Client.URL didn't contain more than one master node.
This can be problematic in case we have a cluster with > 1 master nodes
and the printed master is a secondary master (which may be down). In
this case the user who trusts the "URL" attribute to connect to the
cluster won't succeed, because the primary master can't be reached.

5ba1b669

Node: Add support for NEO cluster with > 1 master · 655c4cc3

Levin Zimmermann authored Apr 13, 2023

Some NEO clusters have more than one master to gain a higher availability.
Before this patch NEO/go Node type only handled one master address. This
commit adjusts the node type and related bits so that it can support more than
one master node.

655c4cc3

17 Jul, 2023 8 commits

Fix flaky `client_test.go/TestLoad` · 7a0674c2

Levin Zimmermann authored Jul 17, 2023

/reviewed-by @kirr
/reviewed-on kirr/neo!5

* t-fix-flaky-testload:
  fixup! neonet/newlink: Fix lost conn in encoding detector
  go/neo/neonet: Fix client handshake not to accept server encoding if it is different from what client indicated
  go/neo/neonet: Demonstrate problem in handshake with NEO/py
  go/neo/neonet: Dedicate an error type to indicate "protocol version mismatch" as handshake failure cause
  fixup! client_test: Keep NEO srv logs if test fails
  fixup! client_test/NEOSrv += LogContent for better debug
  neonet/newlink: Fix lost conn in encoding detector
  client_test: Keep NEO srv logs if test fails
  client_test += print NEO server log if >=1 test(s) failed
  client_test/NEOSrv += LogContent for better debug

7a0674c2

Merge branch 'master' into t-fix-flaky-testload · 917bacd2

Kirill Smelkov authored Jul 17, 2023

* master:
  go/neo/neonet: Fix client handshake not to accept server encoding if it is different from what client indicated
  go/neo/neonet: Demonstrate problem in handshake with NEO/py
  go/neo/neonet: Dedicate an error type to indicate "protocol version mismatch" as handshake failure cause

917bacd2

fixup! neonet/newlink: Fix lost conn in encoding detector · fb4d8efc
Kirill Smelkov authored Jul 17, 2023

fb4d8efc

go/neo/neonet: Fix client handshake not to accept server encoding if it is... · 052856ce

Levin Zimmermann authored Jul 07, 2023

go/neo/neonet: Fix client handshake not to accept server encoding if it is different from what client indicated

If the peers encoding is different than our encoding two different
scenarios can happen, because the handshake order is undefined (e.g.
we don't know if our handshake is received before the peer sends its
handshake):

1. Our handshake is received before peer sends its handshake, NEO/py
closes connection if it sees unexpected magic, version, etc.

2. The client already sends a handshake before it proceeds our handshake.
In this case it initally sends us it version, we can extract its encoding,
and only later, once it proceeded our handshake with the bad encoding,
it closes the connection.

Before this patch case (2) wasn't handled correctly by the automatic
encoding detection of 'DialLink'. 'DialLink' simply accepted the
different-than-expected encoding, but once the peer proceeded the nodes
handshake the peer closed the connection and the initially established
and returned link was immediately closed again. Due to this it was good
luck whether connecting with a peer different with an encoding different
from the expected one worked or didn't work (it depended on which handshake
was faster). Now 'DialLink' should reliably find the correct encoding
and return a stable link.

--------

kirr: this is based on the following original patch by Levin: levin.zimmermann/neoppod@f6b59772

I updated documentation throughout correspondingly and also added
corresponding handshake-specific test in the previous patch.

See !5 and b2da69e2
(go/neo/neonet: Demonstrate problem in handshake with NEO/py) for more
in-depth description of the problem.

052856ce

go/neo/neonet: Demonstrate problem in handshake with NEO/py · b2da69e2

Kirill Smelkov authored Jul 17, 2023

Levin Zimmerman discovered that sometimes NEO/py accepts our handshake
hello with encoding 'M', then replies its owh handshake ehlo with
encoding 'N' and then further terminates the connection. In other words
it looks like that the handshake went successful, but it actually did
not and NEO/py terminates the link after some time.

This manifests itself e.g. as infrequent TestLoad failures on t branch
with the following output:

=== RUN   TestLoad/py/!ssl
I: runneo.py: /tmp/neo776618506/1 !ssl: started master(s): 127.0.0.1:21151
=== RUN   TestLoad/py/!ssl/enc=N(dialTryOrder=N,M)
    client_test.go:598: skip: does not excercise client redial
=== RUN   TestLoad/py/!ssl/enc=N(dialTryOrder=M,N)
    xtesting.go:330: load 0285cbac258bf266:0000000000000000: returned err unexpected:
        have: neo://127.0.0.1:21151/1: load 0285cbac258bf266:0000000000000000: dial S1: dial 127.0.0.1:40345 (STORAGE): 127.0.0.1:56678 - 127.0.0.1:40345: request identification: 127.0.0.1:56678 - 127.0.0.1:40345 .1: recv: EOF
        want: nil
    xtesting.go:330: load 0285cbac258bf266:0000000000000000: returned tid unexpected: 0000000000000000  ; want: 0285cbac258bf266
    xtesting.go:330: load 0285cbac258bf266:0000000000000000: returned buf = nil
    xtesting.go:339: load 0285cbac258bf265:0000000000000000: returned err unexpected:
        have: neo://127.0.0.1:21151/1: load 0285cbac258bf265:0000000000000000: dial S1: dial 127.0.0.1:40345 (STORAGE): 127.0.0.1:56688 - 127.0.0.1:40345: request identification: 127.0.0.1:56688 - 127.0.0.1:40345 .1: recv: EOF
        want: neo://127.0.0.1:21151/1: load 0285cbac258bf265:0000000000000000: 0000000000000000: object was not yet created
    ...
    client_test.go:588: NEO log tail:

        log file 'storage_0.log' tail:

        2023-07-17 17:21:57.1519 DEBUG     S1         connection completed for <ServerConnection(nid=None, address=127.0.0.1:51230, handler=IdentificationHandler, fd=20, server) at 7f3583fd4f50> (from 127.0.0.1:40345)
        2023-07-17 17:21:57.1537 WARNING   S1         Protocol version mismatch with <ServerConnection(nid=None, address=127.0.0.1:51230, handler=IdentificationHandler, fd=20, server) at 7f3583fd4f50>
        2023-07-17 17:21:57.1548 DEBUG     S1         connection closed for <ServerConnection(nid=None, address=127.0.0.1:51230, handler=IdentificationHandler, closed, server) at 7f3583fd4f50>
        2023-07-17 17:21:57.1551 WARNING   S1         A connection was lost during identification
        2023-07-17 17:22:00.1582 DEBUG     S1         accepted a connection from 127.0.0.1:51236
        2023-07-17 17:22:00.1585 DEBUG     S1         connection completed for <ServerConnection(nid=None, address=127.0.0.1:51236, handler=IdentificationHandler, fd=20, server) at 7f3583fd4e90> (from 127.0.0.1:40345)
        2023-07-17 17:22:00.1604 WARNING   S1         Protocol version mismatch with <ServerConnection(nid=None, address=127.0.0.1:51236, handler=IdentificationHandler, fd=20, server) at 7f3583fd4e90>
        2023-07-17 17:22:00.1622 DEBUG     S1         connection closed for <ServerConnection(nid=None, address=127.0.0.1:51236, handler=IdentificationHandler, closed, server) at 7f3583fd4e90>
        2023-07-17 17:22:00.1625 WARNING   S1         A connection was lost during identification
        2023-07-17 17:22:03.1663 DEBUG     S1         accepted a connection from 127.0.0.1:51238
        2023-07-17 17:22:03.1666 DEBUG     S1         connection completed for <ServerConnection(nid=None, address=127.0.0.1:51238, handler=IdentificationHandler, fd=20, server) at 7f3583fd4d10> (from 127.0.0.1:40345)
        2023-07-17 17:22:03.1674 WARNING   S1         Protocol version mismatch with <ServerConnection(nid=None, address=127.0.0.1:51238, handler=IdentificationHandler, fd=20, server) at 7f3583fd4d10>
        2023-07-17 17:22:03.1688 DEBUG     S1         connection closed for <ServerConnection(nid=None, address=127.0.0.1:51238, handler=IdentificationHandler, closed, server) at 7f3583fd4d10>
        2023-07-17 17:22:03.1691 WARNING   S1         A connection was lost during identification
        2023-07-17 17:22:06.1714 DEBUG     S1         accepted a connection from 127.0.0.1:57072
        2023-07-17 17:22:06.1719 DEBUG     S1         connection completed for <ServerConnection(nid=None, address=127.0.0.1:57072, handler=IdentificationHandler, fd=20, server) at 7f3583fd4b50> (from 127.0.0.1:40345)
        2023-07-17 17:22:06.1727 WARNING   S1         Protocol version mismatch with <ServerConnection(nid=None, address=127.0.0.1:57072, handler=IdentificationHandler, fd=20, server) at 7f3583fd4b50>
        2023-07-17 17:22:06.1738 DEBUG     S1         connection closed for <ServerConnection(nid=None, address=127.0.0.1:57072, handler=IdentificationHandler, closed, server) at 7f3583fd4b50>
        2023-07-17 17:22:06.1738 WARNING   S1         A connection was lost during identification

        log file 'master_0.log' tail:

        2023-07-17 17:21:21.0799 PACKET    M1         #0x0012 NotifyNodeInformation          > A1 (127.0.0.1:37906)
        2023-07-17 17:21:21.0799 PACKET    M1          ! C0 | CLIENT |  | RUNNING | 2023-07-17 14:21:21.079838
        2023-07-17 17:21:21.0800 PACKET    M1         #0x0102 NotifyNodeInformation          > S1 (127.0.0.1:37918)
        2023-07-17 17:21:21.0800 PACKET    M1          ! C0 | CLIENT |  | RUNNING | 2023-07-17 14:21:21.079838
        2023-07-17 17:21:21.0801 DEBUG     M1         Handler changed on <ServerConnection(nid=None, address=127.0.0.1:37966, handler=ClientServiceHandler, fd=18, server) at 7f3584245910>
        2023-07-17 17:21:21.0802 PACKET    M1         #0x0001 AnswerRequestIdentification    > C0 (127.0.0.1:37966)
        2023-07-17 17:21:21.0804 PACKET    M1         #0x0000 NotifyNodeInformation          > C0 (127.0.0.1:37966)
        2023-07-17 17:21:21.0804 PACKET    M1          ! C0 | CLIENT  |                 | RUNNING | 2023-07-17 14:21:21.079838
        2023-07-17 17:21:21.0804 PACKET    M1          ! M1 | MASTER  | 127.0.0.1:21151 | RUNNING | None
        2023-07-17 17:21:21.0804 PACKET    M1          ! S1 | STORAGE | 127.0.0.1:40345 | RUNNING | 2023-07-17 14:21:18.737469
        2023-07-17 17:21:21.0805 PACKET    M1         #0x0002 NotifyPartitionTable           > C0 (127.0.0.1:37966)
        2023-07-17 17:21:21.0810 PACKET    M1         #0x0003 LastTransaction                < C0 (127.0.0.1:37966)
        2023-07-17 17:21:21.0811 PACKET    M1         #0x0003 AnswerLastTransaction          > C0 (127.0.0.1:37966)
        2023-07-17 17:22:06.2053 DEBUG     M1         <SocketConnectorIPv4 at 0x7f3584252d10 fileno 18 ('127.0.0.1', 21151), opened from ('127.0.0.1', 37966)> closed in recv
        2023-07-17 17:22:06.2056 DEBUG     M1         connection closed for <ServerConnection(nid=C0, address=127.0.0.1:37966, handler=ClientServiceHandler, closed, server) at 7f3584245910>
        2023-07-17 17:22:06.2058 PACKET    M1         #0x0014 NotifyNodeInformation          > A1 (127.0.0.1:37906)
        2023-07-17 17:22:06.2058 PACKET    M1          ! C0 | CLIENT |  | UNKNOWN | 2023-07-17 14:21:21.079838
        2023-07-17 17:22:06.2059 PACKET    M1         #0x0104 NotifyNodeInformation          > S1 (127.0.0.1:37918)
        2023-07-17 17:22:06.2059 PACKET    M1          ! C0 | CLIENT |  | UNKNOWN | 2023-07-17 14:21:21.079838

The problem is due to that my analysis from e407f725 (go/neo/neonet:
Rework handshake to differentiate client and server parts) turned out to
be incorrect. Quoting that patch:

    -> Rework handshake so that client always sends its hello first, and
    only then the server side replies. This matches actual NEO/py behaviour:

    https://lab.nexedi.com/nexedi/neoppod/blob/v1.12-67-g261dd4b4/neo/lib/connector.py#L293-294

    even though the "NEO protocol" states that

    	Handshake transmissions are not ordered with respect to each other and can go in parallel.

    	( https://neo.nexedi.com/P-NEO-Protocol.Specification.2019?portal_skin=CI_slideshow#/9/2 )

    If I recall correctly that sentence was authored by me in 2018 based on
    previous understanding of should-be full symmetry in-between client and
    server.

so here "This matches actual NEO/py behaviour" was wrong: even though
https://lab.nexedi.com/nexedi/neoppod/blob/v1.12-67-g261dd4b4/neo/lib/connector.py#L293-294
indeed says that

    #       The NEO protocol is such that a client connection is always the
    #       first to send a packet, as soon as the connection is established,

in reality it is not the case as SocketConnector always queues handshake
hello upon its creation before receiving anything from remote side:
https://lab.nexedi.com/nexedi/neoppod/blob/v1.12-93-gfd87e153/neo/lib/connector.py#L77-78 .
In practice this leads to that in non-SSL case NEO/py server might be
fast enough to send its prepared hello before receiving hello from us.
Levin also explains at kirr/neo!5 (comment 187429):

    I think what happens is this:

    the NEO protocol doesn't specify in which order handshakes happen after
    initial dial. If the peer sends a handshake before receiving our
    handshake and if this peers handshake is received by us, 'DialLink'
    assumes everything is fine (no err is returned), it breaks the loop and
    returns the link. But then, very little time later, when the peer
    finally receives our handshake, this looks strange for the peer and it
    closes the connection.

    So in my understanding this should be fixed by explicitly comparing the
    encodings between our expected one and what the peer provided us. If
    encodings don't match we should retry with a new encoding in order to
    prevent the peer from closing the connection.

    For me this also explains why sometimes the tests passed and sometimes
    didn't: it depended on which node was faster ('race condition').

-> In this patch we add correspondig handshake test that demonstrates
this problem. It currently fails as

--- FAIL: TestHandshake (0.01s)
    --- FAIL: TestHandshake/enc=N (0.00s)
        newlink_test.go:154: handshake encoding mismatch: client: unexpected error:
            have: <nil>
                 "<nil>"
            want: &neonet._HandshakeError{LocalRole:1, LocalAddr:net.pipeAddr{}, RemoteAddr:net.pipeAddr{}, Err:(*neonet._EncodingMismatchError)(0xc0000a4190)}
                 "pipe - pipe: handshake (client): protocol encoding mismatch: peer = 'M'  ; our side = 'N'"
    --- FAIL: TestHandshake/enc=M (0.00s)
        newlink_test.go:154: handshake encoding mismatch: client: unexpected error:
            have: <nil>
                 "<nil>"
            want: &neonet._HandshakeError{LocalRole:1, LocalAddr:net.pipeAddr{}, RemoteAddr:net.pipeAddr{}, Err:(*neonet._EncodingMismatchError)(0xc0001a22cc)}
                 "pipe - pipe: handshake (client): protocol encoding mismatch: peer = 'N'  ; our side = 'M'"

We will fix it in the next patch.

/reported-by @levin.zimmermann
/reported-on kirr/neo!5

b2da69e2

go/neo/neonet: Dedicate an error type to indicate "protocol version mismatch"... · c100a0c9

Kirill Smelkov authored Jul 17, 2023

go/neo/neonet: Dedicate an error type to indicate "protocol version mismatch" as handshake failure cause

We will soon need to detect if a handshake failure was due to mismatch
of protocol encodings and that would require introduction of dedicated
error type for that cause. As a preparatory step first refactor "version
mismatch cause" to follow the same style for symmetry.

c100a0c9

fixup! client_test: Keep NEO srv logs if test fails · 088502fd
Kirill Smelkov authored Jul 17, 2023

088502fd

fixup! client_test/NEOSrv += LogContent for better debug · 91d65a96

Kirill Smelkov authored Jul 17, 2023

LogContent -> LogTail
py/neolog  -> `python -m neo.scripts.neolog`
FIXME for glog
FIXME for admin_0.log and other log files

91d65a96

07 Jul, 2023 1 commit

neonet/newlink: Fix lost conn in encoding detector · f6b59772

Levin Zimmermann authored Jul 07, 2023

1. Our handshake is received before peer sends its handshake, NEO/py
closes connection if it sees unexpected magic, version, etc.

f6b59772

29 Jun, 2023 3 commits

client_test: Keep NEO srv logs if test fails · 3d4e20f6

Levin Zimmermann authored Jun 29, 2023

If a test fails we may want to invest further why this test fails and
reading the NEO server log files can be an efficient way for these
investigations. Therefore we should keep log files if a test fails.

3d4e20f6

client_test += print NEO server log if >=1 test(s) failed · a1a9ff5b
Levin Zimmermann authored Jun 29, 2023

a1a9ff5b

client_test/NEOSrv += LogContent for better debug · 5b4ca2cd

Levin Zimmermann authored Jun 29, 2023

If tests fail we may want to check NEO logs. The new function LogContent
of the NEOSrv interface provides an API to fetch the relevant parts of
the log file of a NEO server. This API can be used by test code in order
to automatically print relevant server log information if a test fails.

5b4ca2cd