1. 04 Aug, 2024 2 commits
    • Levin Zimmermann's avatar
      go/neo/neonet: DialLink: Fix SIGSEGV in case client handshake fails · d75f4ac2
      Levin Zimmermann authored
      In case the last 'handshakeClient' call returns an error, 'DialLink'
      returns 'link = nil, err = nil'. Callers of 'DialLink' then don't
      recognize that 'link' is 'nil', as it's the convention to only check if
      'err' is 'nil', which leads to a 'segmentation violation' as soon as
      subsequent code tries to access fields of 'link':
      
      ```
      panic: runtime error: invalid memory address or nil pointer dereference
      [signal SIGSEGV: segmentation violation code=0x1 addr=0x14 pc=0x7087ae]
      
      goroutine 5 [running]:
      lab.nexedi.com/kirr/neo/go/neo/neonet.(*NodeLink).NewConn(0x0)
      	/srv/slapgrid/slappart82/srv/runner/instance/slappart6/software_release/parts/wendelin.core/wcfs/neo/go/neo/neonet/connection.go:404 +0x4e
      lab.nexedi.com/kirr/neo/go/neo/xneo.Dial.func1()
      	/srv/slapgrid/slappart82/srv/runner/instance/slappart6/software_release/parts/wendelin.core/wcfs/neo/go/neo/xneo/connect.go:138 +0x52
      lab.nexedi.com/kirr/neo/go/internal/xio.WithCloseOnErrCancel.func2()
      	/srv/slapgrid/slappart82/srv/runner/instance/slappart6/software_release/parts/wendelin.core/wcfs/neo/go/internal/xio/xio.go:114 +0x6a
      created by lab.nexedi.com/kirr/neo/go/internal/xio.WithCloseOnErrCancel in goroutine 21
      	/srv/slapgrid/slappart82/srv/runner/instance/slappart6/software_release/parts/wendelin.core/wcfs/neo/go/internal/xio/xio.go:109 +0x1ad
      ```
      
      This patch fixes this issue so that now 'err' and 'link' are never both
      'nil' again.
      
      /reviewed-by @kirr
      /reviewed-on !10
      d75f4ac2
    • Kirill Smelkov's avatar
      go/neo/neonet: Demonstrate DialLink misbehaviour when all handshake attempts fail · 03db1d8a
      Kirill Smelkov authored
      Levin found that when all handshake attempts fail DialLink returns
      both link=nil and err=nil which breaks what callers expect and lead to
      segmentation fault when accessing that nil link.
      
      -> Add test to demonstrate the problem.
      
      With xfail removed that test currently fails as
      
          --- FAIL: TestDialLink_AllHandshakeErr (0.00s)
          panic: lab.nexedi.com/kirr/neo/go/neo/neonet.TestDialLink_AllHandshakeErr.gox.func4.1: lab.nexedi.com/kirr/neo/go/neo/neonet.TestDialLink_AllHandshakeErr.func2: DialLink to handshake-rejecting server:
          have: link=<nil> err=<nil>
          want: link=<nil> err=client:1 - server:2: handshake (client): unexpected EOF [recovered]
      
      We will fix the problem in the next patch.
      
      /reported-by @levin.zimmermann
      /reported-at !10
      03db1d8a
  2. 02 Feb, 2024 1 commit
  3. 17 Jul, 2023 3 commits
    • Levin Zimmermann's avatar
      go/neo/neonet: Fix client handshake not to accept server encoding if it is... · 052856ce
      Levin Zimmermann authored
      go/neo/neonet: Fix client handshake not to accept server encoding if it is different from what client indicated
      
      If the peers encoding is different than our encoding two different
      scenarios can happen, because the handshake order is undefined (e.g.
      we don't know if our handshake is received before the peer sends its
      handshake):
      
      1. Our handshake is received before peer sends its handshake, NEO/py
      closes connection if it sees unexpected magic, version, etc.
      
      2. The client already sends a handshake before it proceeds our handshake.
      In this case it initally sends us it version, we can extract its encoding,
      and only later, once it proceeded our handshake with the bad encoding,
      it closes the connection.
      
      Before this patch case (2) wasn't handled correctly by the automatic
      encoding detection of 'DialLink'. 'DialLink' simply accepted the
      different-than-expected encoding, but once the peer proceeded the nodes
      handshake the peer closed the connection and the initially established
      and returned link was immediately closed again. Due to this it was good
      luck whether connecting with a peer different with an encoding different
      from the expected one worked or didn't work (it depended on which handshake
      was faster). Now 'DialLink' should reliably find the correct encoding
      and return a stable link.
      
      --------
      
      kirr: this is based on the following original patch by Levin: levin.zimmermann/neoppod@f6b59772
      
      I updated documentation throughout correspondingly and also added
      corresponding handshake-specific test in the previous patch.
      
      See !5 and b2da69e2
      (go/neo/neonet: Demonstrate problem in handshake with NEO/py) for more
      in-depth description of the problem.
      052856ce
    • Kirill Smelkov's avatar
      go/neo/neonet: Demonstrate problem in handshake with NEO/py · b2da69e2
      Kirill Smelkov authored
      Levin Zimmerman discovered that sometimes NEO/py accepts our handshake
      hello with encoding 'M', then replies its owh handshake ehlo with
      encoding 'N' and then further terminates the connection. In other words
      it looks like that the handshake went successful, but it actually did
      not and NEO/py terminates the link after some time.
      
      This manifests itself e.g. as infrequent TestLoad failures on t branch
      with the following output:
      
      === RUN   TestLoad/py/!ssl
      I: runneo.py: /tmp/neo776618506/1 !ssl: started master(s): 127.0.0.1:21151
      === RUN   TestLoad/py/!ssl/enc=N(dialTryOrder=N,M)
          client_test.go:598: skip: does not excercise client redial
      === RUN   TestLoad/py/!ssl/enc=N(dialTryOrder=M,N)
          xtesting.go:330: load 0285cbac258bf266:0000000000000000: returned err unexpected:
              have: neo://127.0.0.1:21151/1: load 0285cbac258bf266:0000000000000000: dial S1: dial 127.0.0.1:40345 (STORAGE): 127.0.0.1:56678 - 127.0.0.1:40345: request identification: 127.0.0.1:56678 - 127.0.0.1:40345 .1: recv: EOF
              want: nil
          xtesting.go:330: load 0285cbac258bf266:0000000000000000: returned tid unexpected: 0000000000000000  ; want: 0285cbac258bf266
          xtesting.go:330: load 0285cbac258bf266:0000000000000000: returned buf = nil
          xtesting.go:339: load 0285cbac258bf265:0000000000000000: returned err unexpected:
              have: neo://127.0.0.1:21151/1: load 0285cbac258bf265:0000000000000000: dial S1: dial 127.0.0.1:40345 (STORAGE): 127.0.0.1:56688 - 127.0.0.1:40345: request identification: 127.0.0.1:56688 - 127.0.0.1:40345 .1: recv: EOF
              want: neo://127.0.0.1:21151/1: load 0285cbac258bf265:0000000000000000: 0000000000000000: object was not yet created
          ...
          client_test.go:588: NEO log tail:
      
              log file 'storage_0.log' tail:
      
              2023-07-17 17:21:57.1519 DEBUG     S1         connection completed for <ServerConnection(nid=None, address=127.0.0.1:51230, handler=IdentificationHandler, fd=20, server) at 7f3583fd4f50> (from 127.0.0.1:40345)
              2023-07-17 17:21:57.1537 WARNING   S1         Protocol version mismatch with <ServerConnection(nid=None, address=127.0.0.1:51230, handler=IdentificationHandler, fd=20, server) at 7f3583fd4f50>
              2023-07-17 17:21:57.1548 DEBUG     S1         connection closed for <ServerConnection(nid=None, address=127.0.0.1:51230, handler=IdentificationHandler, closed, server) at 7f3583fd4f50>
              2023-07-17 17:21:57.1551 WARNING   S1         A connection was lost during identification
              2023-07-17 17:22:00.1582 DEBUG     S1         accepted a connection from 127.0.0.1:51236
              2023-07-17 17:22:00.1585 DEBUG     S1         connection completed for <ServerConnection(nid=None, address=127.0.0.1:51236, handler=IdentificationHandler, fd=20, server) at 7f3583fd4e90> (from 127.0.0.1:40345)
              2023-07-17 17:22:00.1604 WARNING   S1         Protocol version mismatch with <ServerConnection(nid=None, address=127.0.0.1:51236, handler=IdentificationHandler, fd=20, server) at 7f3583fd4e90>
              2023-07-17 17:22:00.1622 DEBUG     S1         connection closed for <ServerConnection(nid=None, address=127.0.0.1:51236, handler=IdentificationHandler, closed, server) at 7f3583fd4e90>
              2023-07-17 17:22:00.1625 WARNING   S1         A connection was lost during identification
              2023-07-17 17:22:03.1663 DEBUG     S1         accepted a connection from 127.0.0.1:51238
              2023-07-17 17:22:03.1666 DEBUG     S1         connection completed for <ServerConnection(nid=None, address=127.0.0.1:51238, handler=IdentificationHandler, fd=20, server) at 7f3583fd4d10> (from 127.0.0.1:40345)
              2023-07-17 17:22:03.1674 WARNING   S1         Protocol version mismatch with <ServerConnection(nid=None, address=127.0.0.1:51238, handler=IdentificationHandler, fd=20, server) at 7f3583fd4d10>
              2023-07-17 17:22:03.1688 DEBUG     S1         connection closed for <ServerConnection(nid=None, address=127.0.0.1:51238, handler=IdentificationHandler, closed, server) at 7f3583fd4d10>
              2023-07-17 17:22:03.1691 WARNING   S1         A connection was lost during identification
              2023-07-17 17:22:06.1714 DEBUG     S1         accepted a connection from 127.0.0.1:57072
              2023-07-17 17:22:06.1719 DEBUG     S1         connection completed for <ServerConnection(nid=None, address=127.0.0.1:57072, handler=IdentificationHandler, fd=20, server) at 7f3583fd4b50> (from 127.0.0.1:40345)
              2023-07-17 17:22:06.1727 WARNING   S1         Protocol version mismatch with <ServerConnection(nid=None, address=127.0.0.1:57072, handler=IdentificationHandler, fd=20, server) at 7f3583fd4b50>
              2023-07-17 17:22:06.1738 DEBUG     S1         connection closed for <ServerConnection(nid=None, address=127.0.0.1:57072, handler=IdentificationHandler, closed, server) at 7f3583fd4b50>
              2023-07-17 17:22:06.1738 WARNING   S1         A connection was lost during identification
      
              log file 'master_0.log' tail:
      
              2023-07-17 17:21:21.0799 PACKET    M1         #0x0012 NotifyNodeInformation          > A1 (127.0.0.1:37906)
              2023-07-17 17:21:21.0799 PACKET    M1          ! C0 | CLIENT |  | RUNNING | 2023-07-17 14:21:21.079838
              2023-07-17 17:21:21.0800 PACKET    M1         #0x0102 NotifyNodeInformation          > S1 (127.0.0.1:37918)
              2023-07-17 17:21:21.0800 PACKET    M1          ! C0 | CLIENT |  | RUNNING | 2023-07-17 14:21:21.079838
              2023-07-17 17:21:21.0801 DEBUG     M1         Handler changed on <ServerConnection(nid=None, address=127.0.0.1:37966, handler=ClientServiceHandler, fd=18, server) at 7f3584245910>
              2023-07-17 17:21:21.0802 PACKET    M1         #0x0001 AnswerRequestIdentification    > C0 (127.0.0.1:37966)
              2023-07-17 17:21:21.0804 PACKET    M1         #0x0000 NotifyNodeInformation          > C0 (127.0.0.1:37966)
              2023-07-17 17:21:21.0804 PACKET    M1          ! C0 | CLIENT  |                 | RUNNING | 2023-07-17 14:21:21.079838
              2023-07-17 17:21:21.0804 PACKET    M1          ! M1 | MASTER  | 127.0.0.1:21151 | RUNNING | None
              2023-07-17 17:21:21.0804 PACKET    M1          ! S1 | STORAGE | 127.0.0.1:40345 | RUNNING | 2023-07-17 14:21:18.737469
              2023-07-17 17:21:21.0805 PACKET    M1         #0x0002 NotifyPartitionTable           > C0 (127.0.0.1:37966)
              2023-07-17 17:21:21.0810 PACKET    M1         #0x0003 LastTransaction                < C0 (127.0.0.1:37966)
              2023-07-17 17:21:21.0811 PACKET    M1         #0x0003 AnswerLastTransaction          > C0 (127.0.0.1:37966)
              2023-07-17 17:22:06.2053 DEBUG     M1         <SocketConnectorIPv4 at 0x7f3584252d10 fileno 18 ('127.0.0.1', 21151), opened from ('127.0.0.1', 37966)> closed in recv
              2023-07-17 17:22:06.2056 DEBUG     M1         connection closed for <ServerConnection(nid=C0, address=127.0.0.1:37966, handler=ClientServiceHandler, closed, server) at 7f3584245910>
              2023-07-17 17:22:06.2058 PACKET    M1         #0x0014 NotifyNodeInformation          > A1 (127.0.0.1:37906)
              2023-07-17 17:22:06.2058 PACKET    M1          ! C0 | CLIENT |  | UNKNOWN | 2023-07-17 14:21:21.079838
              2023-07-17 17:22:06.2059 PACKET    M1         #0x0104 NotifyNodeInformation          > S1 (127.0.0.1:37918)
              2023-07-17 17:22:06.2059 PACKET    M1          ! C0 | CLIENT |  | UNKNOWN | 2023-07-17 14:21:21.079838
      
      The problem is due to that my analysis from e407f725 (go/neo/neonet:
      Rework handshake to differentiate client and server parts) turned out to
      be incorrect. Quoting that patch:
      
          -> Rework handshake so that client always sends its hello first, and
          only then the server side replies. This matches actual NEO/py behaviour:
      
          https://lab.nexedi.com/nexedi/neoppod/blob/v1.12-67-g261dd4b4/neo/lib/connector.py#L293-294
      
          even though the "NEO protocol" states that
      
          	Handshake transmissions are not ordered with respect to each other and can go in parallel.
      
          	( https://neo.nexedi.com/P-NEO-Protocol.Specification.2019?portal_skin=CI_slideshow#/9/2 )
      
          If I recall correctly that sentence was authored by me in 2018 based on
          previous understanding of should-be full symmetry in-between client and
          server.
      
      so here "This matches actual NEO/py behaviour" was wrong: even though
      https://lab.nexedi.com/nexedi/neoppod/blob/v1.12-67-g261dd4b4/neo/lib/connector.py#L293-294
      indeed says that
      
          #       The NEO protocol is such that a client connection is always the
          #       first to send a packet, as soon as the connection is established,
      
      in reality it is not the case as SocketConnector always queues handshake
      hello upon its creation before receiving anything from remote side:
      https://lab.nexedi.com/nexedi/neoppod/blob/v1.12-93-gfd87e153/neo/lib/connector.py#L77-78 .
      In practice this leads to that in non-SSL case NEO/py server might be
      fast enough to send its prepared hello before receiving hello from us.
      Levin also explains at !5 (comment 187429):
      
          I think what happens is this:
      
          the NEO protocol doesn't specify in which order handshakes happen after
          initial dial. If the peer sends a handshake before receiving our
          handshake and if this peers handshake is received by us, 'DialLink'
          assumes everything is fine (no err is returned), it breaks the loop and
          returns the link. But then, very little time later, when the peer
          finally receives our handshake, this looks strange for the peer and it
          closes the connection.
      
          So in my understanding this should be fixed by explicitly comparing the
          encodings between our expected one and what the peer provided us. If
          encodings don't match we should retry with a new encoding in order to
          prevent the peer from closing the connection.
      
          For me this also explains why sometimes the tests passed and sometimes
          didn't: it depended on which node was faster ('race condition').
      
      -> In this patch we add correspondig handshake test that demonstrates
      this problem. It currently fails as
      
      --- FAIL: TestHandshake (0.01s)
          --- FAIL: TestHandshake/enc=N (0.00s)
              newlink_test.go:154: handshake encoding mismatch: client: unexpected error:
                  have: <nil>
                       "<nil>"
                  want: &neonet._HandshakeError{LocalRole:1, LocalAddr:net.pipeAddr{}, RemoteAddr:net.pipeAddr{}, Err:(*neonet._EncodingMismatchError)(0xc0000a4190)}
                       "pipe - pipe: handshake (client): protocol encoding mismatch: peer = 'M'  ; our side = 'N'"
          --- FAIL: TestHandshake/enc=M (0.00s)
              newlink_test.go:154: handshake encoding mismatch: client: unexpected error:
                  have: <nil>
                       "<nil>"
                  want: &neonet._HandshakeError{LocalRole:1, LocalAddr:net.pipeAddr{}, RemoteAddr:net.pipeAddr{}, Err:(*neonet._EncodingMismatchError)(0xc0001a22cc)}
                       "pipe - pipe: handshake (client): protocol encoding mismatch: peer = 'N'  ; our side = 'M'"
      
      We will fix it in the next patch.
      
      /reported-by @levin.zimmermann
      /reported-on !5
      b2da69e2
    • Kirill Smelkov's avatar
      go/neo/neonet: Dedicate an error type to indicate "protocol version mismatch"... · c100a0c9
      Kirill Smelkov authored
      go/neo/neonet: Dedicate an error type to indicate "protocol version mismatch" as handshake failure cause
      
      We will soon need to detect if a handshake failure was due to mismatch
      of protocol encodings and that would require introduction of dedicated
      error type for that cause. As a preparatory step first refactor "version
      mismatch cause" to follow the same style for symmetry.
      c100a0c9
  4. 18 Jan, 2023 2 commits
  5. 18 May, 2022 2 commits
    • Kirill Smelkov's avatar
      go/neo/t/neotest: Use python -c 'print ...' in a way that works on both py2 and py3 · 789c9ed9
      Kirill Smelkov authored
      Without parenthesis it was failing on py3:
      
      	(neo) (py3.venv) (g.env) kirr@deca:~/src/neo/src/lab.nexedi.com/kirr/neo/go/neo/t$ ./neotest info-local
      	date:   Wed, 18 May 2022 11:05:50 +0300
      	xnode:  kirr@deca.navytux.spb.ru (2401:5180:0:af::1 192.168.0.3 (+ 1·ipv4))
      	uname:  Linux deca 5.10.0-13-amd64 #1 SMP Debian 5.10.106-1 (2022-03-17) x86_64 GNU/Linux
      	cpu:    Intel(R) Core(TM) i7-7600U CPU @ 2.80GHz
      	  File "<string>", line 1
      	    print '%.2fGHz' % (400000 / 1E6)
      	          ^
      	SyntaxError: invalid syntax
      789c9ed9
    • Kirill Smelkov's avatar
      go/neo/t/tzodb.py: Fix zhash for Python3 · 30329f5a
      Kirill Smelkov authored
      On py3 dict.keys() returns iterator instead of list:
      
          $ ./tzodb.py zhash
          Traceback (most recent call last):
            File "/home/kirr/src/neo/src/lab.nexedi.com/kirr/neo/go/neo/t/./tzodb.py", line 141, in <module>
              main()
            File "/home/kirr/src/neo/src/lab.nexedi.com/kirr/neo/go/neo/t/./tzodb.py", line 138, in main
              zhash()
            File "/home/kirr/src/neo/src/lab.nexedi.com/kirr/neo/go/neo/t/./tzodb.py", line 59, in zhash
              optv, argv = getopt(sys.argv[2:], "h", ["help", "check=", "bench="] + hashRegistry.keys())
          TypeError: can only concatenate list (not "dict_keys") to list
      30329f5a
  6. 25 Nov, 2021 1 commit
  7. 04 Oct, 2021 5 commits
  8. 08 Sep, 2021 4 commits
    • Kirill Smelkov's avatar
      go/internal/xtesting: Add missing X · e7bea476
      Kirill Smelkov authored
      staticcheck says:
      
      	xtesting.go:386:2: this value of err is never used (SA4006)
      e7bea476
    • Kirill Smelkov's avatar
      go/zodb/{fs1,zeo}: ~staticcheck · 3ff4eb14
      Kirill Smelkov authored
      fs1/format.go:204:2: only the first constant in this group has an explicit type (SA9004)
      zeo/proto.go:56:2: only the first constant in this group has an explicit type (SA9004)
      3ff4eb14
    • Kirill Smelkov's avatar
      go/zodb/btree: Fix missing return on data-consistency error · ca630163
      Kirill Smelkov authored
      staticcheck reports
      
          ziobtree.go:606:4: Errorf is a pure function but its return value is ignored (SA4017)
          ziobtree.go:626:4: Errorf is a pure function but its return value is ignored (SA4017)
          zlobtree.go:606:4: Errorf is a pure function but its return value is ignored (SA4017)
          zlobtree.go:626:4: Errorf is a pure function but its return value is ignored (SA4017)
      ca630163
    • Kirill Smelkov's avatar
      go/zodb, go/zodb/btree: Fix go generate after rename on zodbtools side · 3d27ed5d
      Kirill Smelkov authored
      run_with_zodb3py2_compat was renamed to run_with_zodb4py2_compat in
      nexedi/zodbtools@c59a54ca .
      
      Without the fix go genrate was failing as
      
          (neo) (z-dev) (g.env) kirr@deca:~/src/neo/src/lab.nexedi.com/kirr/neo/go/zodb/btree$ go generate
          Traceback (most recent call last):
            File "./testdata/gen-testdata", line 26, in <module>
              from zodbtools.test.gen_testdata import run_with_zodb3py2_compat
          ImportError: cannot import name run_with_zodb3py2_compat
      
      and
      
          (neo) (z-dev) (g.env) kirr@deca:~/src/neo/src/lab.nexedi.com/kirr/neo/go/zodb$ go generate
          Traceback (most recent call last):
            File "./py/pydata-gen-testdata", line 24, in <module>
              from zodbtools.test.gen_testdata import run_with_zodb3py2_compat
          ImportError: cannot import name run_with_zodb3py2_compat
      
      This amends commit fc69e00d (go/zodb/fs1: Fix Python database generator
      to work with recent zodbtools).
      3d27ed5d
  9. 20 Jul, 2021 1 commit
  10. 24 May, 2021 6 commits
    • Kirill Smelkov's avatar
      go/zodb: Prevent newly-created not-yet committed objects to go to ghost on deactivate · bb618ce1
      Kirill Smelkov authored
      When object is just created, it is not yet assigned an OID, but can be
      reachable from other objects. The code that processes transaction can
      reach to that new object and try to PActivate/PDeactivate it. And
      currently PDeactivate will drop the object state completely.
      
      Another example of object without an OID is Bucket embedded into a Tree
      object. There, the code that scans the tree can reach to that bucket and
      try to activate/deactivate it, leading, again, to dropping state of that bucket.
      
      -> Fix it.
      bb618ce1
    • Kirill Smelkov's avatar
      go/zodb: Fix PActivate not to panic after an error · 67f0d2cd
      Kirill Smelkov authored
      Persistent.PActivate used to panic when called the second time, if the
      first time it hit an error. WCFS hit this in practice via object, that was
      previously accessed and pinned into the cache, but later deleted in the
      storage.
      
      -> Fix PActivate to reset .loading on an error, so that next time PActivate is
      called, it tries to trigger load again instead of panicking. Change doload
      criteria from
      
      	state==GHOST && refcnt==1
      
      to
      
      	state==GHOST && loading==nil
      
      because now, after failed PActivate, refcnt can be != 0, if there are several
      other PActivate calls that were waiting for the failed PActivate but did not
      yet woke up.
      
      Here is how added test fails without the fix:
      
          --- FAIL: TestActivateAfterDelete (1.65s)
          panic: t.zodb.MyObject(0000000000000065): activate: need to load, but .loading != nil [recovered]
                  panic: t.zodb.MyObject(0000000000000065): activate: need to load, but .loading != nil
      
          goroutine 10085 [running]:
          testing.tRunner.func1.2(0x649020, 0xc000520660)
                  /home/kirr/src/tools/go/go/src/testing/testing.go:1143 +0x332
          testing.tRunner.func1(0xc0001cb080)
                  /home/kirr/src/tools/go/go/src/testing/testing.go:1146 +0x4b6
          panic(0x649020, 0xc000520660)
                  /home/kirr/src/tools/go/go/src/runtime/panic.go:965 +0x1b9
          lab.nexedi.com/kirr/neo/go/zodb.(*Persistent).PActivate(0xc0001184d0, 0x6e8360, 0xc00019ac90, 0x0, 0x0)
                  /home/kirr/src/neo/src/lab.nexedi.com/kirr/neo/go/zodb/persistent.go:191 +0xce5
          lab.nexedi.com/kirr/neo/go/zodb.TestActivateAfterDelete(0xc0001cb080)
                  /home/kirr/src/neo/src/lab.nexedi.com/kirr/neo/go/zodb/persistent_test.go:786 +0x72c
      67f0d2cd
    • Kirill Smelkov's avatar
      go/zodb/fs1tools: Verify · 5a26fb31
      Kirill Smelkov authored
      Add utility to verify FileStorage data for consistency.
      To verify we just need to iterate through all records, because
      FileStorage driver performs all consistency checks by itself.
      
      Mimic normal output to be the same as in fstest from ZODB/py.
      Example runs of fstest.py and `fs1 verify` on a broken file:
      
          $ python ~/src/wendelin/z/ZODB/src/ZODB/scripts/fstest.py -v 1.fs
                   4: transaction tid 0x03e044f6448c8022 #0
                 213: transaction tid 0x03e044f646e044bb #1
          1.fs has data records that extend beyond the transaction record; end at 466
      
          $ fs1 verify -v 1.fs
                   4: transaction tid 0x03e044f6448c8022 #0
                 213: transaction tid 0x03e044f646e044bb #1
          2021/05/24 12:43:37 fsverify: 1.fs: 1.fs: transaction record @355: -> (iter data): 1.fs: data record @416: check: data record [..., 466) overlaps txn boundary [..., 458)
      
      As can be seen, in fs1 case, the error contains more details: [start, end) of
      both mismatching transaction and data records.
      
      In addition to fstest-like verbosity, add progress-mode, where % of total
      completion is printed in a style similar to one used by `fs1 verify-index`.
      
      The Go-based implementation is also faster even when data is on HDD. For
      example on a 73GB database provided by @jerome[1] fsrefs.py takes ~15 minutes
      to run and occupy ~70-100% of CPU. On the other hand `fs1 verify` takes ~7
      minutes to run and occupy ~ 20-30% of CPU.
      
      Tests pending.
      
      [1] nexedi/zodbtools!19 (comment 129480)
      5a26fb31
    • Kirill Smelkov's avatar
      go/zodb/fs1tools: Dumper += DumpEndOK · c66dc12d
      Kirill Smelkov authored
      Some dumpers might want to print something at the end of their dump.
      We will need this functionality for Verify (see next patch).
      c66dc12d
    • Kirill Smelkov's avatar
      7d896243
    • Kirill Smelkov's avatar
      go/zodb: tests: Teach tDB.Commit to return committed TID · c68c0994
      Kirill Smelkov authored
      And use that in the callers.
      c68c0994
  11. 19 May, 2021 1 commit
  12. 10 May, 2021 1 commit
  13. 26 Mar, 2021 11 commits
    • Kirill Smelkov's avatar
      go/*: Cosmetics · 890290e6
      Kirill Smelkov authored
      890290e6
    • Kirill Smelkov's avatar
    • Kirill Smelkov's avatar
      go/zodb/demo: New package that provides base+δ storages overlay · 4fb6bd0a
      Kirill Smelkov authored
      For Load demo.Storage implementation is similar to DemoStorage in
      ZODB/py with fixes "cherry-picked" from:
      
      - https://github.com/zopefoundation/ZODB/issues/318
        (DemoStorage does not take whiteouts into account -> leading to data corruption)
      
      - https://github.com/zopefoundation/ZODB/pull/323
        (loadAt + fix for the above issue)
      
      For safety demo.Storage - contrary to DemoStorage/py - actually verifies
      that for demo=base+δ δ comes strictly after base and that base remains
      unchanged.
      
      URI schema follows XRI Cross-references approach and is
      
      	demo:(zurl_base)/(zurl_δ)
      
      https://en.wikipedia.org/wiki/Extensible_Resource_Identifier provides
      some related details and examples.
      
      For ZODB/py corresponding pull-request for zodburi to support demo: URI
      scheme has been made here: https://github.com/Pylons/zodburi/pull/29 .
      
      Tests need:
      
      - recent zodbtools with zodbrestore:
      
        https://lab.nexedi.com/nexedi/zodbtools/blob/129afa67/zodbtools/zodbrestore.py
        nexedi/zodbtools!19
      
      - ZODB with support for DemoStorage.deleteObject
        https://github.com/zopefoundation/ZODB/pull/341
      
      On Go side demo storage is needed for wendelin.core 2 because ERP5 uses
      DemoStorage to run tests.
      4fb6bd0a
    • Kirill Smelkov's avatar
      go/zodb: Provide OpenDriver in addition to Open · 954321b2
      Kirill Smelkov authored
      This is low-level API to open IStorageDriver instead of IStorage.
      Demo storage will need this.
      
      Maybe it would be a good idea to move drivers-related functionality into
      separate package in the future.
      954321b2
    • Kirill Smelkov's avatar
      go/zodb/fs1: Report URL with file:// schema included · 5aa4372f
      Kirill Smelkov authored
      In ZODB/go when there is no schema in zurl, open automatically prepends
      file:// . However filename itself could contain ":" and so generally
      speaking it is incorrect to return URL without file:// schema prepended
      to file name.
      
      Another reason to always use fully-constructed URLs with schema, is
      interoperability with ZODB/py - there zodburi, when given zurl without
      schema, does not make any assumption that it is of file:// kind and
      rejects opening such URIs.
      5aa4372f
    • Kirill Smelkov's avatar
      go/zodb: Fix Open to detect lack of schema only by ":" instead of by "://" · d904eb10
      Kirill Smelkov authored
      An URI schema is required to have ":" after it, but - even if frequently
      used in practice - not //. We will soon introduce "demo:" URI scheme
      that comes without //, so fix Open to detect schema presence just by ":"
      and not to fixup "demo:..." url to "file://demo:..." automatically.
      d904eb10
    • Kirill Smelkov's avatar
      go/zodb: Fix IStorage implementation not to deadlock driver if watcher fails · 003c44cb
      Kirill Smelkov authored
      Before the patch if storage.watcher fails, storage.driver.Close is not
      called, and so the driver will continue to send to .drvWatchq, but noone
      is receiving from it.
      
      a5dbb92b (go/zodb: Require drivers to close watchq on Close), provides
      the guarantee that the driver will stop sending on drvWatchq right after
      drv.Close call.
      003c44cb
    • Kirill Smelkov's avatar
      go/zodb/zeo: Fix it to provide "Close vs watchq" guaranty · 58e0142c
      Kirill Smelkov authored
      Provide guaranty that Close forces the driver to stop sending to watchq
      and to close it. See a5dbb92b ("go/zodb: Require drivers to close watchq
      on Close") for details.
      
      Without the fix TestWatch fails with test timeout:
      
          panic: test timed out after 30s
      
          # Close waits for serve to stop
          goroutine 93 [semacquire]:
          sync.runtime_Semacquire(0xc000152170)
                  /home/kirr/src/tools/go/go/src/runtime/sema.go:56 +0x45
          sync.(*WaitGroup).Wait(0xc000152168)
                  /home/kirr/src/tools/go/go/src/sync/waitgroup.go:130 +0x65
          lab.nexedi.com/kirr/neo/go/zodb/storage/zeo.(*zLink).Close(0xc0001520f0, 0x1313, 0x1)
                  /home/kirr/src/neo/src/lab.nexedi.com/kirr/neo/go/zodb/storage/zeo/zrpc.go:159 +0x47
          lab.nexedi.com/kirr/neo/go/zodb/storage/zeo.(*zeo).Close(0xc000313680, 0xc000107c78, 0x1)
                  /home/kirr/src/neo/src/lab.nexedi.com/kirr/neo/go/zodb/storage/zeo/zeo.go:526 +0x2e
          lab.nexedi.com/kirr/neo/go/internal/xtesting.DrvTestWatch(0xc000082c00, 0xc0000aa2a0, 0x24, 0x6a4a38)
                  /home/kirr/src/neo/src/lab.nexedi.com/kirr/neo/go/internal/xtesting/xtesting.go:442 +0xdb5
          lab.nexedi.com/kirr/neo/go/zodb/storage/zeo.TestWatch.func1(0xc000082c00, 0x6e3498, 0xc00009a380)
                  /home/kirr/src/neo/src/lab.nexedi.com/kirr/neo/go/zodb/storage/zeo/zeo_test.go:270 +0x99
          lab.nexedi.com/kirr/neo/go/zodb/storage/zeo.withZEOSrv.func2.1(0xc0000a4168, 0x16)
                  /home/kirr/src/neo/src/lab.nexedi.com/kirr/neo/go/zodb/storage/zeo/zeo_test.go:207 +0xfb
          lab.nexedi.com/kirr/neo/go/zodb/storage/zeo.withZEOSrv.func1(0xc000082c00, 0xc00009c180)
                  /home/kirr/src/neo/src/lab.nexedi.com/kirr/neo/go/zodb/storage/zeo/zeo_test.go:186 +0x129
          lab.nexedi.com/kirr/neo/go/zodb/storage/zeo.withZEOSrv.func2(0xc000082c00)
                  /home/kirr/src/neo/src/lab.nexedi.com/kirr/neo/go/zodb/storage/zeo/zeo_test.go:199 +0x10e
          testing.tRunner(0xc000082c00, 0xc00009c160)
                  /home/kirr/src/tools/go/go/src/testing/testing.go:1194 +0xef
          created by testing.(*T).Run
                  /home/kirr/src/tools/go/go/src/testing/testing.go:1239 +0x2b3
      
          # serve is stuck in invalidateTransaction doing watchq<-
          goroutine 26 [chan send]:
          lab.nexedi.com/kirr/neo/go/zodb/storage/zeo.(*zeo).invalidateTransaction(0xc000313680, 0x6417e0, 0xc000323b60, 0x0, 0x0)
                  /home/kirr/src/neo/src/lab.nexedi.com/kirr/neo/go/zodb/storage/zeo/zeo.go:176 +0x373
          lab.nexedi.com/kirr/neo/go/zodb/storage/zeo.(*zLink).serveRecv1(0xc0001520f0, 0xc000393890, 0x0, 0x0)
                  /home/kirr/src/neo/src/lab.nexedi.com/kirr/neo/go/zodb/storage/zeo/zrpc.go:225 +0x4b4
          lab.nexedi.com/kirr/neo/go/zodb/storage/zeo.(*zLink).serveRecv(0xc0001520f0)
                  /home/kirr/src/neo/src/lab.nexedi.com/kirr/neo/go/zodb/storage/zeo/zrpc.go:176 +0x8d
          created by lab.nexedi.com/kirr/neo/go/zodb/storage/zeo.(*zLink).start
                  /home/kirr/src/neo/src/lab.nexedi.com/kirr/neo/go/zodb/storage/zeo/zrpc.go:99 +0xc8
      58e0142c
    • Kirill Smelkov's avatar
      go/zodb/fs1: Fix it to provide "Close vs watchq" guaranty · 88848c31
      Kirill Smelkov authored
      Provide guaranty that Close forces the driver to stop sending to watchq
      and to close it. See a5dbb92b ("go/zodb: Require drivers to close watchq
      on Close") for details.
      88848c31
    • Kirill Smelkov's avatar
      go/zodb: Require drivers to close watchq on Close · a5dbb92b
      Kirill Smelkov authored
      If we don't require drivers to stop sending to watchq after Close, there
      could be many deadlock scenarios, for example:
      
      - client called drv.Close(); no longer listens to watchq; driver is
        stuck sending to watchq, or
      
      - client called drv.Close(), which itself waits for tasks spawned inside
        driver to complete, which are stuck sending to watchq because client
        no longer does <-watchq.
      
      The change is similar in spirit to safety guaranty provided by high-level
      Watcher where after DelWatch call it is guaranteed that there will be no
      more sends to subscribed watchq (see c41c2907 "go/zodb: High-level
      watching - initial draft") for details.
      
      All drivers don't provide requested guarantee yet.
      We'll be fixing them one-by-one in followup commits.
      a5dbb92b
    • Kirill Smelkov's avatar
      go/internal/xtesting: Fix FatalIf to return a helper · 20d48ded
      Kirill Smelkov authored
      Else, on an error, it is the lineno of `t.Fatal(err)` inside FatalIf
      that is printed, not the line number inside user test.
      20d48ded