- 08 Jul, 2018 3 commits
-
-
Kirill Smelkov authored
Provide a way for every message to be encoded/decoded to/from NEO wire encoding. For this introduce Msg interface with wire coding methods and provide such methods for all message types. For selected types the methods are implemented manually. For most of the types the methods are generated automatically by protogen.go program. protogen.go was mentioned in http://navytux.spb.ru/~kirr/neo.html#development-overview in "On server-side NEO/go work started by first implementing messages serialization in exactly the same wire format as NEO/py does ..." paragraph. A bit of late protogen fixups history: lab.nexedi.com/kirr/neo/commit/c884bfd5 lab.nexedi.com/kirr/neo/commit/385d813a lab.nexedi.com/kirr/neo/commit/0f7e0b00 lab.nexedi.com/kirr/neo/commit/de3ef2c0 Also a message type can be reverse-looked up by message code via MsgType(). This will be later used in network receive code path.
-
Kirill Smelkov authored
Provide routines to convert selected types to string and also for UUID and Address <-> string encoding/decoding.
-
Kirill Smelkov authored
Start NEO/go code with protocol package that defines message that NEO nodes exchange in between each other. The definition is based on neo/lib/protocol.py and is kept in sync with that file. This commit brings only messages definition. Messages serialization will come in the follow-up patch.
-
- 05 Jul, 2018 1 commit
-
-
Kirill Smelkov authored
We will need to use BE16 and BE32 in the next patch.
-
- 04 Jul, 2018 1 commit
-
-
Kirill Smelkov authored
See nexedi/zodbtools@b1163449 for details.
-
- 03 Jul, 2018 1 commit
-
-
Kirill Smelkov authored
-
- 09 Apr, 2018 1 commit
-
-
Kirill Smelkov authored
Tomáš Peterka noticed that gotos in dump.go are not actually needed because the same functionality could be achieved with defer in more clean and structured way. Do it. This brings ~ 5% performance hit name old time/op new time/op delta ZodbDump-4 148µs ± 1% 155µs ± 2% +4.69% (p=0.000 n=9+10) because defer implementation is currently not great (https://github.com/golang/go/issues/14939) If we absolutely need those 5% back it could be worked around similar to e.g. FileStorage.Load: https://lab.nexedi.com/kirr/neo/blob/6faed528/go/zodb/storage/fs1/filestorage.go#L133 https://lab.nexedi.com/kirr/neo/blob/6faed528/go/zodb/storage/fs1/filestorage.go#L141 /suggested-by @katomaso
-
- 13 Mar, 2018 1 commit
-
-
Julien Muchembled authored
-
- 02 Mar, 2018 3 commits
-
-
Julien Muchembled authored
Before, it waited for upstream activity until all partitions are touched. However, when upstream is idle the backup cluster could remain stuck forever if it was interrupted whereas some cells were still late.
-
Julien Muchembled authored
The 'min_tid < new_tid' assertion failed when jumping to the past.
-
Julien Muchembled authored
Given that: - read locks are only taken by transactions (not replication) - in backup mode, storage nodes stay in UP_TO_DATE state, even if partitions are synchronized up to different tids there was a race condition with the master node replying to LastTransaction with a TID that may not be replicated yet by all replicas, potentially causing such replicas to reply OidDoesNotExist or OidNotFound if a client asks it data too early. IOW, even if the cluster does contain the data up to `getBackupTid(max)`, it is only readable by NEO clients up to `getBackupTid(min)` as long as the cluster is in BACKINGUP state.
-
- 17 Jan, 2018 1 commit
-
-
Kirill Smelkov authored
Usage of supportsTransactionalUndo() was removed from ZODB in 2007 - see e.g. the following commits: https://github.com/zopefoundation/ZODB/commit/a06bfc03 https://github.com/zopefoundation/ZODB/commit/e667b022 https://github.com/zopefoundation/ZODB/commit/f595f7e7 ... /reviewed-by @vpelletier /reviewed-on nexedi/neoppod!8
-
- 15 Jan, 2018 25 commits
-
-
Kirill Smelkov authored
-
Kirill Smelkov authored
`zodb catobj` command to dump content of an object - similarly to `git cat-file`. Two modes: raw and verbose with `zodb dump` like headers for the object present. There is no such command currently in zodbtools/py.
-
Kirill Smelkov authored
Command to print general information about a ZODB database. Same as `zodb info` in zodbtools/py.
-
Kirill Smelkov authored
Add `zodb dump` command to dump arbitrary ZODB database in generic format. The actual dump protocol being used here is the same as in zodbtools/py with https://lab.nexedi.com/zodbtools/merge_requests/3 applied. (the MR there is OK and is just waiting for upstream ZODB to negotiate a way to retrieve transaction extension data in raw form).
-
Kirill Smelkov authored
Add zodbtools which is generic (contrast to fs1tools) set of ZODB managing utilities. Only package and command infrastructure here - actual commands will follow up in the next patches.
-
Kirill Smelkov authored
-
Kirill Smelkov authored
Add commands for FileStorage index maintainance: manually rebuild the index and to performe index verification.
-
Kirill Smelkov authored
Add various FileStorage-specific dump commands with output being bit-to-bit exact with the following ZODB/py FileStorage tools: - fsdump.py - fsdump.py (verbose dumper) - fstail.py Please see the patch for links about this dump formats.
-
Kirill Smelkov authored
-
Kirill Smelkov authored
-
Kirill Smelkov authored
-
Kirill Smelkov authored
Build FileStorage ZODB driver out of format record loading/decoding and index routines we just added in previous patches. The driver supports only read-only mode so far. Promised tests for data format interoperability with ZODB/py are added.
-
Kirill Smelkov authored
-
Kirill Smelkov authored
Build index type on top of fsb.Tree introduced in the previous patch and add routines to save and load it to/from disk. We ensure ZODB/py compatibility via generating test FileStorage database + its index and checking we can load index from it and also that if we save an index ZODB/py can load it back. FileStorage index is hard to get bit-to-bit identical since this index uses python pickles which can encode the same objects in several different ways.
-
Kirill Smelkov authored
FileStorage index maps oid to file position storing latest data record for this oid. This index is naturally to implement via BTree as e.g. ZODB/py does. In Go world there is github.com/cznic/b BTree library but without specialization and working via interface{} it is slower than it could be and allocates a lot. So generate specialized version of that code with key and value types exactly suitable for FileStorage indexing. We use a bit patched b version with speed ups for bulk-loading data via regular point-ingestion BTree entry point: https://lab.nexedi.com/kirr/b x/refill The patches has not been upstreamed because it slows down general case a bit (only a bit, but still this is a "no" to me), and because with dedicated bulk-loading API it could be possible to still load data several times faster. Still current version is enough for not very-huge indices. Btw ZODB/py does the same (see fsBucket + friends).
-
Kirill Smelkov authored
Start implementing FileStorage support by adding code to load/decode FileStorage records and way to iterate a FileStorage. Tests will come in a later patch together with ZODB-level loading support.
-
Kirill Smelkov authored
Storage drivers can register themselves via zodb.RegisterDriver. Later cliens can request to open a storage by URL via zodb.OpenStorage. The opener will lookup driver registry and wrap created driver instance with common layer with cache etc to turn an IStorageDriver into fully working IStorage.
-
Kirill Smelkov authored
The cache is needed so that we can provide IStorage.Prefetch functionality generally wrapped on top of a storage driver: when an object is loaded, the loading itself consists of steps: 1. start loading object into cache, 2. wait for the loading to complete. This way Prefetch is naturally only "1" - start loading object into cache but do not wait for the loading to be complete. Go's goroutines naturally help here where we can spawn every such loading into its own goroutine instead of explicitly programming loading in terms of a state machine. Since this cache is mainly needed for Prefetch to work, not to actually cache data (though it works as cache for repeating access too), the goal when writing it was to add minimal overhead for "data-not-yet-in-cache" case. Current state we are not completely there yet but the latency is acceptable - depending on the workload the cache layer adds ~ 0.5 - 1 - 3µs to loading times.
-
Kirill Smelkov authored
ZODB/py serializes data using python pickles. Basically every serialized object has two parts: class description and object state. Here we start by providing minimal functionality to extract class-name from serialized data. The library used for pickle decoding (and in later patches encoding) is github.com/kisielk/og-rek It was audited by me for security flaws to some extent. Contrary to Python pickle module it does not run arbitrary code on decoding.
-
Kirill Smelkov authored
Since in ZODB TIDs are corresponding to time, provide functionality to convert a tid to timestamp. Do so in exactly the same way as ZODB/py does for interoperability.
-
Kirill Smelkov authored
-
Kirill Smelkov authored
Our path of implementing NEO in Go will be not only for server-side, but also for client-side, since it is needed by Wendelin.core. On server-side we'll also need to work with types and data model Python ZODB implementation uses, so here it goes: Start of ZODB in Go. Here we define ZODB data types, data model and operational interfaces for IStorage + friends. The interfaces are currently read-only with stubs for write mode.
-
Kirill Smelkov authored
Ignore files commonly produced while profiling Go programs and running tests.
-
Kirill Smelkov authored
We want to make sure the code can be used by all projects without a problem. This way the license is GPLv3+ with wide exception for all Free Software / Open Source projects + Business options. Nexedi stack is licensed under Free Software licenses with various exceptions that cover three business cases: - Free Software - Proprietary Software - Rebranding As long as one intends to develop Free Software based on Nexedi stack, no license cost is involved. Developing proprietary software based on Nexedi stack may require a proprietary exception license. Rebranding Nexedi stack is prohibited unless rebranding license is acquired. Through this licensing approach, Nexedi expects to encourage Free Software development without restrictions and at the same time create a framework for proprietary software to contribute to the long term sustainability of the Nexedi stack. Please see https://www.nexedi.com/licensing for details, rationale and options. ( NEO/py for now stays at the old terms but it will be upgraded to the same terms as NEO/go eventually )
-
Kirill Smelkov authored
Sync with current NEO in Python implementation as the first step. We'll be using some common bits and in particular on-the-wire protocol must be the same and for py/go interoperability testing we'll also need python parts.
-
- 11 Jan, 2018 1 commit
-
-
Julien Muchembled authored
The issue was that at startup, or after nodes are back, the previous code prevented full load balancing until some data are written. It was like this to limit the number of connections, which does not matter anymore (see commit 77132157).
-
- 08 Jan, 2018 1 commit
-
-
Julien Muchembled authored
# Previous status The issue was that we had extreme storage fragmentation from the point of view of the replication algorithm, which processes one partition at a time. By using an autoincrement for the 'data' table, rows were ordered by the time at which they were added: - parts may be the result of replication -> ordered by partition, tid, oid - other rows are globally sorted by tid Which means that when scanning a given partition, many rows were skipped all the time: - if readahead is bigger enough, the efficiency is 1/N for a node with N partitions assigned - else, it is worse because it seeks all the time For huge databases, the replication was horribly slow, in particular from HDD. # Chosen solution This commit changes how ids are generated to somehow split 'data' per partition. The backend tracks 1 last id per assigned partition, where the 16 higher bits contains the partition. Keep in mind that the value of id has no meaning and it's only chosen for performance reasons. IOW, a row can be referred by an oid of a partition different than the 16 higher bits of id: - there's no migration needed and the 16 higher bits of all existing rows are 0 - in case of deduplication, a row can still be shared by different partitions Due to https://jira.mariadb.org/browse/MDEV-12836, we leave the autoincrement on existing databases. ## Downsides On insertion, increasing the number of partitions now slows down significantly: for 2 nodes using TokuDB, 4% for 180 partitions, 40% for 2000. For 12 partitions, the difference remains negligible. The solution for this issue will be to enable to increase the number of partitions efficiently, so that nodes can keep a small number of them, even for DB that are expected to grow so much that many nodes are added over time: such feature was already considered so that users don't have to worry anymore about this obscure setting at database creation. Read performance is only slowed down for applications that read a lot of data that were written contiguously, but split in small blocks. A solution is to extend ZODB so that the application tells it to chose new oids that will end up in the same partition. Like for insertion, there should not be too many partitions. With RocksDB (MariaDB 10.2.10), it takes a significant amount of time to collect all last ids at startup when there are many partitions. ## Other advantages - The storage layout of data is now always the same and does not depend on whether rows came from replication or commits. - Efficient deletion of partition to free space in-place will be possible. # Considered alternative The only serious alternative was to replicate as many partitions as possible at the same time, ideally all assigned partitions, but it's not always possible. For best performance, it would often require to synchronize new nodes, or even all of them, so that thesource nodes don't have to scan 'data' several times. If existing nodes are kept, all data that aren't copied to the newly added nodes have to be skipped. If the number of nodes is multiplied by N, the efficiency is 1-1/N at best (synchronized nodes), else it's even worse because partitions are somehow shuffled. Checking/replacing a single node would remain slow when there are several source nodes. At last, such an algorithm would be much more complex and we would not have the other advantages listed above.
-
- 05 Jan, 2018 1 commit
-
-
Julien Muchembled authored
For existing DB, altering the table may be doable with schema editing and clean up of sqlite_sequence.
-