1. 20 Aug, 2024 1 commit
    • Levin Zimmermann's avatar
      identification: Set possible answers to same msgid · 20b37b60
      Levin Zimmermann authored
      When a node tries to connect to another node it initially sends a
      'RequestIdentification' packet. The other node can either reply with
      'AcceptIdentification' or in case of a secondary master with
      'NotPrimaryMaster'.
      
      In the second case the message id differs from the initial requests
      message id. This makes it difficult in a multi-threaded implementation
      to proceed this answer: due to the different msg/connection - id the
      multi-threaded implementation tries to proceed this incoming message in
      a different thread, while the requesting thread waits forever for its
      peers reply.
      
      The most straightforward solution is to use the same connection - id for
      both possible answers to the 'RequestIdentification' packet. This
      doesn't break given NEO/py implementation and is only a small patch.
      A workaround in a multi-threaded implementation on the other hand
      seems to be much more complicated and time-consuming. Finally it also makes
      sense semantically, because "Message IDs are used to identify response
      packets" and in the given context the 'NotPrimaryMaster' *is* the de facto
      response of a 'RequestIdentification'.
      
      /suggested-at !2
      /proposed-for-review-at nexedi/neoppod!20
      20b37b60
  2. 19 Aug, 2024 5 commits
    • Kirill Smelkov's avatar
      Merge branch 'master' into t · efdfb19d
      Kirill Smelkov authored
      * master:
        go/neo/proto: Fix KnownMasterList nesting
        go/neo/proto/RowInfo: Fix representation on the wire
      efdfb19d
    • Levin Zimmermann's avatar
      go/neo/proto: Fix KnownMasterList nesting · dc844465
      Levin Zimmermann authored
      Before this patch, the 'KnownMasterList' field of the 'NotPrimaryMaster'
      was expected to be structured in the following way:
      
      	ArrayHeader (KnownMasterList)
      
      	    ArrayHeader (KnownMaster)
      
      		ArrayHeader (Address)
      
      		    Host (string)
      		    Port (uint16)
      
      However NEO/py sends the following structure:
      
      	ArrayHeader (KnownMasterList)
      
      	    ArrayHeader (Address)
      
      		Host (string)
      		Port (uint16)
      
      This also makes sense, as 'KnownMaster' doesn't need to add another
      nesting, because it only includes the address.
      
      This patch amends the NEO/go protocol definition to transparently
      represent the nesting as it's send by NEO/py. See also
      levin.zimmermann/neoppod@18287612 for a similar issue.
      
      ----
      kirr: tests currently live only on t branch.
      
      /reviewed-by @kirr
      /reviewed-on !6
      dc844465
    • Levin Zimmermann's avatar
      go/neo/proto/RowInfo: Fix representation on the wire · 7f96ad77
      Levin Zimmermann authored
      Some NEO protocol packets have the field 'RowList'. This field contains
      information about each row of a partition table. In NEO/go the information
      of each row is represented with the 'RowInfo' type [1]. This type is defined
      as a struct with the field ‘CellList’. ‘CellList’ is defined as a list of
      'CellInfo' [1] (e.g. an entry for each cell).
      
      NEO/go {en,de}codes struct types with ‘genStructHead’ (structures in
      golang are encoded as arrays in msgpack) [2]. From the 'RowList'
      definition, the msgpack decoder currently expects the following msgpack
      array structure:
      
          ArrayHeader (RowList)
      
              ArrayHeader (RowInfo)
      
                  ArrayHeader (CellList)
      
                      ArrayHeader (CellInfo)
      
                          int32 (NID)
                          enum (State)
      
      However NEO/py actually sends:
      
          ArrayHeader (RowList)
      
              ArrayHeader (CellList)
      
                  ArrayHeader (CellInfo)
      
                      int32 (NID)
                      enum (State)
      
      In other words, currently the NEO/go msgpack encoder expects one more
      nesting, which NEO/py doesn’t provide (and which also doesn’t seem to
      be necessary as the outer nesting would always only contain one
      element). We could adjust the msgpack {en,de}coder to introduce an
      exception for the 'RowInfo' type, however as the protocol definition in
      'proto.go' aims to transparently reflect the structure of the packets on
      the wire, it seems to be more appropriate to fix this straight in the
      protocol definition. This is also less error-prone as we don't have to
      fix all the different positions of the encoder, decoder & sizer and it's
      less code (particularly if 'RowInfo' doesn't stay the only case for such
      an issue).
      
      [1] https://lab.nexedi.com/kirr/neo/-/blob/1ad088c8/go/neo/proto/proto.go#L391-394
      [2] https://lab.nexedi.com/kirr/neo/-/blob/1ad088c8/go/neo/proto/protogen.go#L1770-1775
      
      --------
      
      kirr: I've applied the following interdiff to the original patch of
      levin.zimmermann/neoppod@c93d5dbc :
      
          --- a/go/neo/neo_test.go
          +++ b/go/neo/neo_test.go
          @@ -100,7 +100,7 @@ func _TestMasterStorage(t0 *tEnv) {
                          PTid:           1,
                          NumReplicas:    0,
                          RowList:        []proto.RowInfo{
          -                       proto.RowInfo{proto.CellInfo{proto.NID(proto.STORAGE, 1), proto.UP_TO_DATE}},
          +                       {proto.CellInfo{proto.NID(proto.STORAGE, 1), proto.UP_TO_DATE}},
                          },
                  }))
      
          @@ -173,7 +173,7 @@ func _TestMasterStorage(t0 *tEnv) {
                          PTid:           1,
                          NumReplicas:    0,
                          RowList:        []proto.RowInfo{
          -                       proto.RowInfo{proto.CellInfo{proto.NID(proto.STORAGE, 1), proto.UP_TO_DATE}},
          +                       {proto.CellInfo{proto.NID(proto.STORAGE, 1), proto.UP_TO_DATE}},
                          },
                  }))
      
          --- a/go/neo/proto/proto_test.go
          +++ b/go/neo/proto/proto_test.go
          @@ -210,9 +210,9 @@ func TestMsgMarshal(t *testing.T) {
                                  PTid:        0x0102030405060708,
                                  NumReplicas: 34,
                                  RowList: []RowInfo{
          -                               {CellInfo{11, UP_TO_DATE}, CellInfo{17, OUT_OF_DATE}},
          -                               {CellInfo{11, FEEDING}},
          -                               {CellInfo{11, CORRUPTED}, CellInfo{15, DISCARDED}, CellInfo{23, UP_TO_DATE}},
          +                               {{11, UP_TO_DATE}, {17, OUT_OF_DATE}},
          +                               {{11, FEEDING}},
          +                               {{11, CORRUPTED}, {15, DISCARDED}, {23, UP_TO_DATE}},
                                  },
                                  },
      
          @@ -229,9 +229,9 @@ func TestMsgMarshal(t *testing.T) {
                                  hex("cf0102030405060708") +
                                  hex("22") +
                                  hex("93") +
          -                               hex("92"+"920bd40001"+"9211d40000") +
          -                               hex("91"+"920bd40002") +
          -                               hex("93"+"920bd40003"+"920fd40004"+"9217d40001"),
          +                               hex("92"+"920bd40401"+"9211d40400") +
          +                               hex("91"+"920bd40402") +
          +                               hex("93"+"920bd40403"+"920fd40404"+"9217d40401"),
                          },
      
                          // map[Oid]struct {Tid,Tid,bool}
      
      for cosmetics and because the tests were failing as
      
          --- FAIL: TestMsgMarshal (0.00s)
              proto_test.go:106: M/proto.AnswerPartitionTable: encode result unexpected:
              proto_test.go:107:  have: 93cf0102030405060708229392920bd404019211d4040091920bd4040293920bd40403920fd404049217d40401
              proto_test.go:108:  want: 93cf0102030405060708229392920bd400019211d4000091920bd4000293920bd40003920fd400049217d40001
      
      /reviewed-by @kirr
      /reviewed-on !6
      7f96ad77
    • Levin Zimmermann's avatar
      go/neo/proto: Fix KnownMasterList nesting · 30c69d9a
      Levin Zimmermann authored
      Before this patch, the 'KnownMasterList' field of the 'NotPrimaryMaster'
      was expected to be structured in the following way:
      
      	ArrayHeader (KnownMasterList)
      
      	    ArrayHeader (KnownMaster)
      
      		ArrayHeader (Address)
      
      		    Host (string)
      		    Port (uint16)
      
      However NEO/py sends the following structure:
      
      	ArrayHeader (KnownMasterList)
      
      	    ArrayHeader (Address)
      
      		Host (string)
      		Port (uint16)
      
      This also makes sense, as 'KnownMaster' doesn't need to add another
      nesting, because it only includes the address.
      
      This patch amends the NEO/go protocol definition to transparently
      represent the nesting as it's send by NEO/py. See also
      levin.zimmermann/neoppod@18287612 for a similar issue.
      
      /reviewed-by @kirr
      /reviewed-on !6
      30c69d9a
    • Levin Zimmermann's avatar
      go/neo/proto/RowInfo: Fix representation on the wire · 2cf10114
      Levin Zimmermann authored
      Some NEO protocol packets have the field 'RowList'. This field contains
      information about each row of a partition table. In NEO/go the information
      of each row is represented with the 'RowInfo' type [1]. This type is defined
      as a struct with the field ‘CellList’. ‘CellList’ is defined as a list of
      'CellInfo' [1] (e.g. an entry for each cell).
      
      NEO/go {en,de}codes struct types with ‘genStructHead’ (structures in
      golang are encoded as arrays in msgpack) [2]. From the 'RowList'
      definition, the msgpack decoder currently expects the following msgpack
      array structure:
      
          ArrayHeader (RowList)
      
              ArrayHeader (RowInfo)
      
                  ArrayHeader (CellList)
      
                      ArrayHeader (CellInfo)
      
                          int32 (NID)
                          enum (State)
      
      However NEO/py actually sends:
      
          ArrayHeader (RowList)
      
              ArrayHeader (CellList)
      
                  ArrayHeader (CellInfo)
      
                      int32 (NID)
                      enum (State)
      
      In other words, currently the NEO/go msgpack encoder expects one more
      nesting, which NEO/py doesn’t provide (and which also doesn’t seem to
      be necessary as the outer nesting would always only contain one
      element). We could adjust the msgpack {en,de}coder to introduce an
      exception for the 'RowInfo' type, however as the protocol definition in
      'proto.go' aims to transparently reflect the structure of the packets on
      the wire, it seems to be more appropriate to fix this straight in the
      protocol definition. This is also less error-prone as we don't have to
      fix all the different positions of the encoder, decoder & sizer and it's
      less code (particularly if 'RowInfo' doesn't stay the only case for such
      an issue).
      
      [1] https://lab.nexedi.com/kirr/neo/-/blob/1ad088c8/go/neo/proto/proto.go#L391-394
      [2] https://lab.nexedi.com/kirr/neo/-/blob/1ad088c8/go/neo/proto/protogen.go#L1770-1775
      
      --------
      
      kirr: I've applied the following interdiff to the original patch of
      levin.zimmermann/neoppod@c93d5dbc :
      
          --- a/go/neo/neo_test.go
          +++ b/go/neo/neo_test.go
          @@ -100,7 +100,7 @@ func _TestMasterStorage(t0 *tEnv) {
                          PTid:           1,
                          NumReplicas:    0,
                          RowList:        []proto.RowInfo{
          -                       proto.RowInfo{proto.CellInfo{proto.NID(proto.STORAGE, 1), proto.UP_TO_DATE}},
          +                       {proto.CellInfo{proto.NID(proto.STORAGE, 1), proto.UP_TO_DATE}},
                          },
                  }))
      
          @@ -173,7 +173,7 @@ func _TestMasterStorage(t0 *tEnv) {
                          PTid:           1,
                          NumReplicas:    0,
                          RowList:        []proto.RowInfo{
          -                       proto.RowInfo{proto.CellInfo{proto.NID(proto.STORAGE, 1), proto.UP_TO_DATE}},
          +                       {proto.CellInfo{proto.NID(proto.STORAGE, 1), proto.UP_TO_DATE}},
                          },
                  }))
      
          --- a/go/neo/proto/proto_test.go
          +++ b/go/neo/proto/proto_test.go
          @@ -210,9 +210,9 @@ func TestMsgMarshal(t *testing.T) {
                                  PTid:        0x0102030405060708,
                                  NumReplicas: 34,
                                  RowList: []RowInfo{
          -                               {CellInfo{11, UP_TO_DATE}, CellInfo{17, OUT_OF_DATE}},
          -                               {CellInfo{11, FEEDING}},
          -                               {CellInfo{11, CORRUPTED}, CellInfo{15, DISCARDED}, CellInfo{23, UP_TO_DATE}},
          +                               {{11, UP_TO_DATE}, {17, OUT_OF_DATE}},
          +                               {{11, FEEDING}},
          +                               {{11, CORRUPTED}, {15, DISCARDED}, {23, UP_TO_DATE}},
                                  },
                                  },
      
          @@ -229,9 +229,9 @@ func TestMsgMarshal(t *testing.T) {
                                  hex("cf0102030405060708") +
                                  hex("22") +
                                  hex("93") +
          -                               hex("92"+"920bd40001"+"9211d40000") +
          -                               hex("91"+"920bd40002") +
          -                               hex("93"+"920bd40003"+"920fd40004"+"9217d40001"),
          +                               hex("92"+"920bd40401"+"9211d40400") +
          +                               hex("91"+"920bd40402") +
          +                               hex("93"+"920bd40403"+"920fd40404"+"9217d40401"),
                          },
      
                          // map[Oid]struct {Tid,Tid,bool}
      
      for cosmetics and because the tests were failing as
      
          --- FAIL: TestMsgMarshal (0.00s)
              proto_test.go:106: M/proto.AnswerPartitionTable: encode result unexpected:
              proto_test.go:107:  have: 93cf0102030405060708229392920bd404019211d4040091920bd4040293920bd40403920fd404049217d40401
              proto_test.go:108:  want: 93cf0102030405060708229392920bd400019211d4000091920bd4000293920bd40003920fd400049217d40001
      
      /reviewed-by @kirr
      /reviewed-on !6
      2cf10114
  3. 06 Aug, 2024 1 commit
    • Kirill Smelkov's avatar
      Merge branch 'master' into t · 6fb93a60
      Kirill Smelkov authored
      * master:
        go/neo/neonet: DialLink: Fix SIGSEGV in case client handshake fails
        go/neo/neonet: Demonstrate DialLink misbehaviour when all handshake attempts fail
      6fb93a60
  4. 04 Aug, 2024 2 commits
    • Levin Zimmermann's avatar
      go/neo/neonet: DialLink: Fix SIGSEGV in case client handshake fails · d75f4ac2
      Levin Zimmermann authored
      In case the last 'handshakeClient' call returns an error, 'DialLink'
      returns 'link = nil, err = nil'. Callers of 'DialLink' then don't
      recognize that 'link' is 'nil', as it's the convention to only check if
      'err' is 'nil', which leads to a 'segmentation violation' as soon as
      subsequent code tries to access fields of 'link':
      
      ```
      panic: runtime error: invalid memory address or nil pointer dereference
      [signal SIGSEGV: segmentation violation code=0x1 addr=0x14 pc=0x7087ae]
      
      goroutine 5 [running]:
      lab.nexedi.com/kirr/neo/go/neo/neonet.(*NodeLink).NewConn(0x0)
      	/srv/slapgrid/slappart82/srv/runner/instance/slappart6/software_release/parts/wendelin.core/wcfs/neo/go/neo/neonet/connection.go:404 +0x4e
      lab.nexedi.com/kirr/neo/go/neo/xneo.Dial.func1()
      	/srv/slapgrid/slappart82/srv/runner/instance/slappart6/software_release/parts/wendelin.core/wcfs/neo/go/neo/xneo/connect.go:138 +0x52
      lab.nexedi.com/kirr/neo/go/internal/xio.WithCloseOnErrCancel.func2()
      	/srv/slapgrid/slappart82/srv/runner/instance/slappart6/software_release/parts/wendelin.core/wcfs/neo/go/internal/xio/xio.go:114 +0x6a
      created by lab.nexedi.com/kirr/neo/go/internal/xio.WithCloseOnErrCancel in goroutine 21
      	/srv/slapgrid/slappart82/srv/runner/instance/slappart6/software_release/parts/wendelin.core/wcfs/neo/go/internal/xio/xio.go:109 +0x1ad
      ```
      
      This patch fixes this issue so that now 'err' and 'link' are never both
      'nil' again.
      
      /reviewed-by @kirr
      /reviewed-on !10
      d75f4ac2
    • Kirill Smelkov's avatar
      go/neo/neonet: Demonstrate DialLink misbehaviour when all handshake attempts fail · 03db1d8a
      Kirill Smelkov authored
      Levin found that when all handshake attempts fail DialLink returns
      both link=nil and err=nil which breaks what callers expect and lead to
      segmentation fault when accessing that nil link.
      
      -> Add test to demonstrate the problem.
      
      With xfail removed that test currently fails as
      
          --- FAIL: TestDialLink_AllHandshakeErr (0.00s)
          panic: lab.nexedi.com/kirr/neo/go/neo/neonet.TestDialLink_AllHandshakeErr.gox.func4.1: lab.nexedi.com/kirr/neo/go/neo/neonet.TestDialLink_AllHandshakeErr.func2: DialLink to handshake-rejecting server:
          have: link=<nil> err=<nil>
          want: link=<nil> err=client:1 - server:2: handshake (client): unexpected EOF [recovered]
      
      We will fix the problem in the next patch.
      
      /reported-by @levin.zimmermann
      /reported-at !10
      03db1d8a
  5. 23 Jul, 2024 2 commits
  6. 21 Jul, 2024 1 commit
    • Kirill Smelkov's avatar
      X: Sync zurl format with NEO/py · 95572d6a
      Kirill Smelkov authored
      Hello Kirill,
      
      in nexedi/neoppod!18 and nexedi/neoppod!21 we could find a common solution for a zurl format that previously diverged between NEOgo and NEOpy. The purpose of this MR is to sync again NEOgo and NEOpy zurl format. After merging this, we can continue to sync NEO zurl format in 'wendelin.core' & 'slapos'. Then we finally have unified approach again, which simplifies understanding and reduces unnecessary mental overhead.
      
      As this is strongly related to nexedi/neoppod!21 I thought it'd be a good idea to generally reduce difference and to replace WIP commits with merged NEOpy upstream commits.
      
      Best, Levin
      
      /reviewed-by @kirr
      /reviewed-on kirr/neo!7
      
      * lev/sync-zurl:
        client: Don't allow oPtion_nAme in zurl
        app: Remember SSL credentials so that it is possible to retrieve them
        client: Allow to force TLS via neos:// scheme
        client: Don't allow master_nodes and name to be present in options
        Revert "."
        Revert "Y client: Fix URI scheme to move credentials out of query"
        Revert "X Adjust NEO/go to neo:// URL change + py fixups"
        Revert "fixup! Y client: Fix URI scheme to move credentials out of query"
        Revert "Y client: Don't allow master_nodes and name to be present in options"
        go/client/zurl: Sync format to py upstream
      95572d6a
  7. 19 Jul, 2024 10 commits
  8. 02 Feb, 2024 3 commits
  9. 29 Jan, 2024 8 commits
  10. 22 Aug, 2023 1 commit
  11. 02 Aug, 2023 6 commits
    • Levin Zimmermann's avatar
      X: Teach NEO/go to handle multiple master nodes · db81e0de
      Levin Zimmermann authored
      See !2 for discussion,
      context and details.
      
      /reviewed-by @kirr
      
      * t-with-multiple-master-nodes:
        fixup! client_test: Add nmaster={1,2} to test matrix
        fixup! client_test: Support test cluster /w >1 master
        fixup! TalkMaster: Switch master if dialed M is secondary
        fixup! Node: Add support for NEO cluster with > 1 master
        fixup! Dial: Catch NotPrimaryMaster & return custom error
        fixup! proto: Implement Error for NotPrimaryMaster
        fixup! proto.NotPrimaryMaster: Fix .Primary data type (2)
        fixup! proto.NotPrimaryMaster: Fix .Primary data type (1)
        client_test: Add nmaster={1,2} to test matrix
        client_test: Support test cluster /w >1 master
        proto.NotPrimaryMaster: Fix .Primary data type
        TalkMaster: Switch master if dialed M is secondary
        Dial: Catch NotPrimaryMaster & return custom error
        proto: Implement Error for NotPrimaryMaster
        openClientByURL: Fix for >1 master (split URL host)
        Client.URL: Fix incomplete URL if > 1 master nodes
        Node: Add support for NEO cluster with > 1 master
      db81e0de
    • Kirill Smelkov's avatar
      fixup! client_test: Add nmaster={1,2} to test matrix · 4605cba1
      Kirill Smelkov authored
      Actually do test nmaster=2 case.
      4605cba1
    • Kirill Smelkov's avatar
      fixup! client_test: Support test cluster /w >1 master · f96c8ccd
      Kirill Smelkov authored
      - show all options in error context and ran test kind
      
      - skip nmaster > 1 for NEO/go as that is currently not implemented on NEO/go server
      
      - use json for interacting with runneo.py, so that we can use
        whatever builtin type for any argument without hardcoding ad-hoc
        handling of specific arguments inside runneo.py.
      
      - adjust comments + cosmetics
      f96c8ccd
    • Kirill Smelkov's avatar
      fixup! TalkMaster: Switch master if dialed M is secondary · 3c37c0fc
      Kirill Smelkov authored
      - validate received NotPrimaryMaster
      - use Address.String() instead of printf with format that works only for ipv6
      - add some logging, comments and TODO
      3c37c0fc
    • Kirill Smelkov's avatar
      fixup! Node: Add support for NEO cluster with > 1 master · 354887e3
      Kirill Smelkov authored
      - no need to keep Node.MasterAddr anymore - the address of current PM is
        managed by TalkMaster and is provided as part of operational context
        to user functions that TalkMaster runs.
      
      - correct docstrings.
      
      - cosmetics.
      354887e3
    • Kirill Smelkov's avatar
      fixup! Dial: Catch NotPrimaryMaster & return custom error · 20f76288
      Kirill Smelkov authored
      Expect NotPrimaryMaster only if we are trying to connect to a master.
      20f76288