1. 08 Jul, 2018 10 commits
    • Kirill Smelkov's avatar
      go/zodb: Teach ZODB/go to access ZEO (draft) · b65f6d0f
      Kirill Smelkov authored
      For the reference on deco (performance, frequency not fixed):
      
      	name                           time/object
      	deco/fs1/zhash.py              15.8µs ± 2%
      	deco/fs1/zhash.py-P16           116µs ±12%
      	deco/fs1/zhash.go              2.60µs ± 0%
      	deco/fs1/zhash.go+prefetch128  3.70µs ±11%
      	deco/fs1/zhash.go-P16          13.4µs ±43%
      	deco/zeo/zhash.py               316µs ± 7%
      	deco/zeo/zhash.py-P16          2.68ms ± 7%
      	deco/zeo/zhash.go               111µs ± 2%
      	deco/zeo/zhash.go+prefetch128  57.7µs ± 2%
      	deco/zeo/zhash.go-P16          1.23ms ± 5%
      
      and in particular it shows that with the same ZEO/py server, the latency
      to load an object via py client is ~ 3x worse compared to the latency to
      load the same object via hereby Go client.
      
      The performance was obtained via forthcoming neotest, and in particular
      ZEO/go client will be also used in forthcoming zwrk (no analog on python side).
      
      See http://navytux.spb.ru/~kirr/neo.html#performance-tests for details.
      
      Tests: pending.
      b65f6d0f
    • Kirill Smelkov's avatar
      go/zodb: Start putting pickle-related utilities into internal/pickletools package · 2ee495ce
      Kirill Smelkov authored
      As the first step factor-out int64 Xint64 checker from zodb/storagefs1/index.go
      into there.  We'll need the checker in the next patch.
      2ee495ce
    • Kirill Smelkov's avatar
      go/neo/neonet: Lightweight mode · ec4b3ce0
      Kirill Smelkov authored
      In situations when created connections are used to only send/receive 1
      request/response, the overhead to create/shutdown full connections could be
      too much. Unfortunately this is exactly the mode that is currently
      primarily used for compatibility with NEO/py. To help mitigate the overhead
      in such scenarios, lightweight connections mode is provided:
      
      At requester side, one message can be sent over node link with link.Send1 .
      Inside a connection will be created and then shut down, but since the
      code manages whole process internally and does not show the connection to
      user, it can optimize those operations significantly. Similarly link.Ask1
      sends 1 request, receives 1 response, and then puts the connection back into
      pool for later reuse.
      
      At receiver side, link.Recv1 accepts a connection with the first message
      remote peer sent us when establishing it, and wraps the result into Request
      object. The Request contains the message received and internally the
      connection. A response can be sent back via Request.Reply. Then once
      Request.Close is called the connection object that was accepted is
      immediately put back into pool for later reuse.
      
      Some history of lightweight mode:
      
      lab.nexedi.com/kirr/neo/commit/0fa96338	X Clarified Request.Close semantics - tests working again
      lab.nexedi.com/kirr/neo/commit/a5ac1652	X Ask1: switch to sending directly over link
      lab.nexedi.com/kirr/neo/commit/755e3654	X Request.Reply: switch to replying directly over link
      lab.nexedi.com/kirr/neo/commit/c643ba53	X Send1: switch to sending directly over link
      lab.nexedi.com/kirr/neo/commit/7dcbc9c5	X Send1: switch to lightClose
      lab.nexedi.com/kirr/neo/commit/851864a9	X chan RTT benchmark which simulates Recv1 = Accept + Recv
      lab.nexedi.com/kirr/neo/commit/099bfc29	X freelist for PktBuf
      lab.nexedi.com/kirr/neo/commit/58c2e39a	X Benchmark for link Ask1/Recv1 over TCP loopback
      ec4b3ce0
    • Kirill Smelkov's avatar
      go/neo/neonet: User-visible Send/Recv + Ask/Expect · 39982595
      Kirill Smelkov authored
      Provide Conn.Send and Conn.Recv which work on NEO messages and
      automatically encode/decode them into packets on the fly.
      
      Similarly to NEO/py also provide Ask to send a request and receive
      expected reply and Expect which does only the latter half of Ask.
      39982595
    • Kirill Smelkov's avatar
      go/neo/neonet: Link establishment · e8396003
      Kirill Smelkov authored
      Implement NEO protocol handshaking and use it in newly provided DialLink
      and ListenLink which correspondingly first do regular network dial or
      listen and than perform the handshake on just established TCP
      connection. If handshake goes ok, the result is wrapped into NodeLink.
      
      Some history:
      
      	lab.nexedi.com/kirr/neo/commit/8d0a1469	X Handshake draftly done
      
      See also http://navytux.spb.ru/~kirr/neo.html#development-overview
      (starting from "The neonet module also provides DialLink and ListenLink
      ...")
      e8396003
    • Kirill Smelkov's avatar
      go/neo/neonet: Start (first draft) · 64513925
      Kirill Smelkov authored
      Continue NEO/go with neonet - the layer to exchange messages in between
      NEO nodes.
      
      NEO/go shifts from thinking about NEO protocol logic as RPC to thinking
      of it as more general network protocol and so settles to provide general
      connection-oriented message exchange service. This way neonet provides
      generic connection multiplexing on top of a single TCP node-node link.
      
      Neonet compatibility with NEO/py depends on the following small NEO/py patch:
      
          dd3bb8b4
      
      which adjusts message ID a bit so it behaves like stream_id in HTTP/2:
      
          - always even for server initiated streams
          - always odd  for client initiated streams
      
      and is incremented by += 2, instead of += 1 to maintain above invariant.
      
      See http://navytux.spb.ru/~kirr/neo.html#development-overview (starting from
      "Then comes the link layer which provides service to exchange messages over
      network...") for the rationale.
      
      Unfortunately current NEO/py maintainer is very much against merging that patch.
      
      This patch brings in the core of neonet. Next patches will add initial
      handshaking, user-level Send/Recv + Ask/Expect and "lightweight mode".
      
      Some neonet core history:
      
      lab.nexedi.com/kirr/neo/commit/6b9ed46d	X neonet: Avoid integer overflow on max packet length check
      lab.nexedi.com/kirr/neo/commit/8eac771c	X neo/connection: Fix race between link.shutdown() and conn.lightClose()
      lab.nexedi.com/kirr/neo/commit/8021a1d5	X rxghandoff
      lab.nexedi.com/kirr/neo/commit/68738036	X ... but negative impact on separate client / server processes, strange ...
      lab.nexedi.com/kirr/neo/commit/b0dda9d2	X serveRecv: help Go scheduler to switch to receiving G sooner
      lab.nexedi.com/kirr/neo/commit/4989918a	X remove defer from rx/tx hot paths
      lab.nexedi.com/kirr/neo/commit/e055406a	X no select for acceptq - similarly for rxq path
      lab.nexedi.com/kirr/neo/commit/c28ad4d0	X Conn.Recv: receive without select
      lab.nexedi.com/kirr/neo/commit/496bd425	X add benchmark RTT over plain net.Conn with serveRecv-style RX handler
      lab.nexedi.com/kirr/neo/commit/9fa79958	X draft how to mark RX down without reallocating .rxdown
      lab.nexedi.com/kirr/neo/commit/4324c812	X restore all Conn functionality
      lab.nexedi.com/kirr/neo/commit/a8e61d2f	X serveSend is not needed
      lab.nexedi.com/kirr/neo/commit/9d047b36	X recvPkt via only 1 syscall
      lab.nexedi.com/kirr/neo/commit/b555a507	X baseline net RTT benchmark
      lab.nexedi.com/kirr/neo/commit/91be5cdd	X everyone is listening from start; CloseAccept to disable listening - works
      lab.nexedi.com/kirr/neo/commit/c2a1b63a	X naming: Packet = raw data; Message = meaningful object
      lab.nexedi.com/kirr/neo/commit/6fd0c9be	X connection: Adding context to errors from NodeLink and Conn operations
      lab.nexedi.com/kirr/neo/commit/65b17bdc	X rework Conn acceptance to be explicit via NodeLink.Accept
      64513925
    • Kirill Smelkov's avatar
      go/neo/proto: Test that message codes are the same in between Go and Python NEO versions · 5beab048
      Kirill Smelkov authored
      This brings some go/py compatibility checks that verify go and python
      treat a message code equally. Although messages encoding are tested in
      the previous patch there is no explicit tests for go/py compatibility on
      messages encoding.
      5beab048
    • Kirill Smelkov's avatar
      go/neo/proto: Serialization support · ea5f7d61
      Kirill Smelkov authored
      Provide a way for every message to be encoded/decoded to/from NEO wire
      encoding. For this introduce Msg interface with wire coding methods and
      provide such methods for all message types.
      
      For selected types the methods are implemented manually.
      For most of the types the methods are generated automatically by protogen.go program.
      
      protogen.go was mentioned in http://navytux.spb.ru/~kirr/neo.html#development-overview
      in "On server-side NEO/go work started by first implementing messages
      serialization in exactly the same wire format as NEO/py does ..." paragraph.
      
      A bit of late protogen fixups history:
      
      	lab.nexedi.com/kirr/neo/commit/c884bfd5
      	lab.nexedi.com/kirr/neo/commit/385d813a
      	lab.nexedi.com/kirr/neo/commit/0f7e0b00
      	lab.nexedi.com/kirr/neo/commit/de3ef2c0
      
      Also a message type can be reverse-looked up by message code via MsgType().
      This will be later used in network receive code path.
      ea5f7d61
    • Kirill Smelkov's avatar
      go/neo/proto: String/Error/Address/UUID conversion · fcd6f9f6
      Kirill Smelkov authored
      Provide routines to convert selected types to string and also for
      UUID and Address <-> string  encoding/decoding.
      fcd6f9f6
    • Kirill Smelkov's avatar
      go/neo/proto: Start (first draft) · 6301c23f
      Kirill Smelkov authored
      Start NEO/go code with protocol package that defines message that NEO
      nodes exchange in between each other. The definition is based on
      neo/lib/protocol.py and is kept in sync with that file.
      
      This commit brings only messages definition. Messages serialization will
      come in the follow-up patch.
      6301c23f
  2. 05 Jul, 2018 1 commit
  3. 04 Jul, 2018 1 commit
  4. 03 Jul, 2018 1 commit
  5. 09 Apr, 2018 1 commit
  6. 13 Mar, 2018 1 commit
  7. 02 Mar, 2018 3 commits
    • Julien Muchembled's avatar
      master: fix resumption of backup replication (internal or not) · 27229793
      Julien Muchembled authored
      Before, it waited for upstream activity until all partitions are touched.
      However, when upstream is idle the backup cluster could remain stuck forever
      if it was interrupted whereas some cells were still late.
      27229793
    • Julien Muchembled's avatar
      master: fix/simplify generation of TID · 7b2e6752
      Julien Muchembled authored
      The 'min_tid < new_tid' assertion failed when jumping to the past.
      7b2e6752
    • Julien Muchembled's avatar
      master: fix possible failure when reading data in a backup cluster with replicas · ca2f7061
      Julien Muchembled authored
      Given that:
      - read locks are only taken by transactions (not replication)
      - in backup mode, storage nodes stay in UP_TO_DATE state, even if partitions
        are synchronized up to different tids
      
      there was a race condition with the master node replying to LastTransaction
      with a TID that may not be replicated yet by all replicas, potentially causing
      such replicas to reply OidDoesNotExist or OidNotFound if a client asks it data
      too early.
      
      IOW, even if the cluster does contain the data up to `getBackupTid(max)`,
      it is only readable by NEO clients up to `getBackupTid(min)` as long as the
      cluster is in BACKINGUP state.
      ca2f7061
  8. 17 Jan, 2018 1 commit
  9. 15 Jan, 2018 21 commits
    • Kirill Smelkov's avatar
      go/zodb/zodbtools: TODO (cmp, analyze) · 6faed528
      Kirill Smelkov authored
      6faed528
    • Kirill Smelkov's avatar
      go/zodb/zodbtools: Catobj · aa1d7e12
      Kirill Smelkov authored
      `zodb catobj` command to dump content of an object - similarly to `git
      cat-file`. Two modes: raw and verbose with `zodb dump` like headers for
      the object present.
      
      There is no such command currently in zodbtools/py.
      aa1d7e12
    • Kirill Smelkov's avatar
      go/zodb/zodbtools: Info · 27d02ad5
      Kirill Smelkov authored
      Command to print general information about a ZODB database.
      Same as `zodb info` in zodbtools/py.
      27d02ad5
    • Kirill Smelkov's avatar
      go/zodb/zodbtools: Dump · dbb63f65
      Kirill Smelkov authored
      Add `zodb dump` command to dump arbitrary ZODB database in generic
      format. The actual dump protocol being used here is the same as in
      zodbtools/py with
      
      	https://lab.nexedi.com/zodbtools/merge_requests/3
      
      applied. (the MR there is OK and is just waiting for upstream ZODB to
      negotiate a way to retrieve transaction extension data in raw form).
      dbb63f65
    • Kirill Smelkov's avatar
      go/zodb: Start of zodbtools - tools for managing ZODB databases · c6457cf7
      Kirill Smelkov authored
      Add zodbtools which is generic (contrast to fs1tools) set of ZODB
      managing utilities. Only package and command infrastructure here -
      actual commands will follow up in the next patches.
      c6457cf7
    • Kirill Smelkov's avatar
    • Kirill Smelkov's avatar
      go/zodb/fs1tools: Reindex, Verify-index · 11ee44e0
      Kirill Smelkov authored
      Add commands for FileStorage index maintainance: manually rebuild the
      index and to performe index verification.
      11ee44e0
    • Kirill Smelkov's avatar
      go/zodb/fs1tools: Dump · 9de107fe
      Kirill Smelkov authored
      Add various FileStorage-specific dump commands with output being
      bit-to-bit exact with the following ZODB/py FileStorage tools:
      
      - fsdump.py
      - fsdump.py (verbose dumper)
      - fstail.py
      
      Please see the patch for links about this dump formats.
      9de107fe
    • Kirill Smelkov's avatar
    • Kirill Smelkov's avatar
      go/zodb/fs1: My notes on I/O · 0814c1e1
      Kirill Smelkov authored
      0814c1e1
    • Kirill Smelkov's avatar
      d232237e
    • Kirill Smelkov's avatar
      go/zodb/fs1: Actual FileStorage ZODB driver · 7792a133
      Kirill Smelkov authored
      Build FileStorage ZODB driver out of format record loading/decoding
      and index routines we just added in previous patches.
      
      The driver supports only read-only mode so far.
      
      Promised tests for data format interoperability with ZODB/py are added.
      7792a133
    • Kirill Smelkov's avatar
    • Kirill Smelkov's avatar
      go/zodb/fs1: Index save/load · 8fa9fdaf
      Kirill Smelkov authored
      Build index type on top of fsb.Tree introduced in the previous patch and
      add routines to save and load it to/from disk.
      
      We ensure ZODB/py compatibility via generating test FileStorage database
      + its index and checking we can load index from it and also that if we
      save an index ZODB/py can load it back. FileStorage index is hard to get
      bit-to-bit identical since this index uses python pickles which can
      encode the same objects in several different ways.
      8fa9fdaf
    • Kirill Smelkov's avatar
      go/zodb/fs1: BTree specialized with KEY=zodb.Oid, VALUE=int64 · 33d10066
      Kirill Smelkov authored
      FileStorage index maps oid to file position storing latest data record
      for this oid. This index is naturally to implement via BTree as e.g.
      ZODB/py does.
      
      In Go world there is github.com/cznic/b BTree library but without
      specialization and working via interface{} it is slower than it could be
      and allocates a lot. So generate specialized version of that code with
      key and value types exactly suitable for FileStorage indexing.
      
      We use a bit patched b version with speed ups for bulk-loading data via
      regular point-ingestion BTree entry point:
      
      	https://lab.nexedi.com/kirr/b x/refill
      
      The patches has not been upstreamed because it slows down general case a
      bit (only a bit, but still this is a "no" to me), and because with
      dedicated bulk-loading API it could be possible to still load data
      several times faster. Still current version is enough for not very-huge
      indices.
      
      Btw ZODB/py does the same (see fsBucket + friends).
      33d10066
    • Kirill Smelkov's avatar
      go/zodb: Start of FileStorage support · 8f64f6ed
      Kirill Smelkov authored
      Start implementing FileStorage support by adding code to load/decode
      FileStorage records and way to iterate a FileStorage.
      
      Tests will come in a later patch together with ZODB-level loading
      support.
      8f64f6ed
    • Kirill Smelkov's avatar
      go/zodb: Way for storage-drivers to be registered and for clients to open them by URL · fcab9405
      Kirill Smelkov authored
      Storage drivers can register themselves via zodb.RegisterDriver.
      
      Later cliens can request to open a storage by URL via zodb.OpenStorage.
      The opener will lookup driver registry and wrap created driver instance
      with common layer with cache etc to turn an IStorageDriver into fully
      working IStorage.
      fcab9405
    • Kirill Smelkov's avatar
      zodb/go: In-RAM client cache · 7233b4c0
      Kirill Smelkov authored
      The cache is needed so that we can provide IStorage.Prefetch
      functionality generally wrapped on top of a storage driver: when an
      object is loaded, the loading itself consists of steps:
      
      1. start loading object into cache,
      2. wait for the loading to complete.
      
      This way Prefetch is naturally only "1" - start loading object into
      cache but do not wait for the loading to be complete. Go's goroutines
      naturally help here where we can spawn every such loading into its own
      goroutine instead of explicitly programming loading in terms of a state
      machine.
      
      Since this cache is mainly needed for Prefetch to work, not to actually
      cache data (though it works as cache for repeating access too), the goal
      when writing it was to add minimal overhead for "data-not-yet-in-cache"
      case. Current state we are not completely there yet but the latency is
      acceptable - depending on the workload the cache layer adds ~
      
      	0.5 - 1 - 3µs
      
      to loading times.
      7233b4c0
    • Kirill Smelkov's avatar
      go/zodb: Minimal serialization compatibility with ZODB/py · dfd4fb73
      Kirill Smelkov authored
      ZODB/py serializes data using python pickles. Basically every serialized
      object has two parts: class description and object state. Here we
      start by providing minimal functionality to extract class-name from
      serialized data.
      
      The library used for pickle decoding (and in later patches encoding) is
      
      	github.com/kisielk/og-rek
      
      It was audited by me for security flaws to some extent.
      
      Contrary to Python pickle module it does not run arbitrary code on
      decoding.
      dfd4fb73
    • Kirill Smelkov's avatar
      go/zodb: Tid connection with time · bac6c953
      Kirill Smelkov authored
      Since in ZODB TIDs are corresponding to time, provide functionality to
      convert a tid to timestamp. Do so in exactly the same way as ZODB/py
      does for interoperability.
      bac6c953
    • Kirill Smelkov's avatar
      3d13a276