Commits · 49a9cdcc72b4e46880c8f3d23505c358e55d07f6 · Levin Zimmermann / neoppod

05 Jul, 2018 1 commit
- go/internal/packed: Small package that provides types to be used in packed structures. · 49a9cdcc
  Kirill Smelkov authored Jul 04, 2018
```
We will need to use BE16 and BE32 in the next patch.
```
  49a9cdcc
04 Jul, 2018 1 commit
- go/zodb/*.py: qq moved from zodbtools -> pygolang · cdd7234b
  Kirill Smelkov authored Jul 04, 2018
```
See nexedi/zodbtools@b1163449 for details.
```
  cdd7234b
03 Jul, 2018 1 commit
- Sync with NEO/py v1.9 · 4a2407dc
  Kirill Smelkov authored Jul 03, 2018
  
  4a2407dc
09 Apr, 2018 1 commit

go/zodb/zodbtools/dump: Don't use goto · 5bf40022

Kirill Smelkov authored Apr 09, 2018

Tomáš Peterka noticed that gotos in dump.go are not actually needed because
the same functionality could be achieved with defer in more clean and
structured way.

Do it.

This brings ~ 5% performance hit

	name        old time/op  new time/op  delta
	ZodbDump-4   148µs ± 1%   155µs ± 2%  +4.69%  (p=0.000 n=9+10)

because defer implementation is currently not great
(https://github.com/golang/go/issues/14939)

If we absolutely need those 5% back it could be worked around similar to
e.g. FileStorage.Load:

https://lab.nexedi.com/kirr/neo/blob/6faed528/go/zodb/storage/fs1/filestorage.go#L133
https://lab.nexedi.com/kirr/neo/blob/6faed528/go/zodb/storage/fs1/filestorage.go#L141

/suggested-by @katomaso

5bf40022

13 Mar, 2018 1 commit
- Release version 1.9 · 1b57a7ae
  Julien Muchembled authored Mar 13, 2018
  
  1b57a7ae
02 Mar, 2018 3 commits

master: fix resumption of backup replication (internal or not) · 27229793

Julien Muchembled authored Feb 27, 2018

Before, it waited for upstream activity until all partitions are touched.
However, when upstream is idle the backup cluster could remain stuck forever
if it was interrupted whereas some cells were still late.

27229793

master: fix/simplify generation of TID · 7b2e6752
Julien Muchembled authored Feb 27, 2018
```
The 'min_tid < new_tid' assertion failed when jumping to the past.
```
7b2e6752

master: fix possible failure when reading data in a backup cluster with replicas · ca2f7061

Julien Muchembled authored Feb 14, 2018

Given that:
- read locks are only taken by transactions (not replication)
- in backup mode, storage nodes stay in UP_TO_DATE state, even if partitions
  are synchronized up to different tids

there was a race condition with the master node replying to LastTransaction
with a TID that may not be replicated yet by all replicas, potentially causing
such replicas to reply OidDoesNotExist or OidNotFound if a client asks it data
too early.

IOW, even if the cluster does contain the data up to `getBackupTid(max)`,
it is only readable by NEO clients up to `getBackupTid(min)` as long as the
cluster is in BACKINGUP state.

ca2f7061

17 Jan, 2018 1 commit

client: kill .supportsTransactionalUndo() · f95f336a

Kirill Smelkov authored Jan 16, 2018

Usage of supportsTransactionalUndo() was removed from ZODB in 2007 - see
e.g. the following commits:

https://github.com/zopefoundation/ZODB/commit/a06bfc03
https://github.com/zopefoundation/ZODB/commit/e667b022
https://github.com/zopefoundation/ZODB/commit/f595f7e7
...

/reviewed-by @vpelletier
/reviewed-on nexedi/neoppod!8

f95f336a

15 Jan, 2018 25 commits

go/zodb/zodbtools: TODO (cmp, analyze) · 6faed528
Kirill Smelkov authored Jan 15, 2018

6faed528

go/zodb/zodbtools: Catobj · aa1d7e12

Kirill Smelkov authored Jan 15, 2018

`zodb catobj` command to dump content of an object - similarly to `git
cat-file`. Two modes: raw and verbose with `zodb dump` like headers for
the object present.

There is no such command currently in zodbtools/py.

aa1d7e12

go/zodb/zodbtools: Info · 27d02ad5

Kirill Smelkov authored Jan 15, 2018

Command to print general information about a ZODB database.
Same as `zodb info` in zodbtools/py.

27d02ad5

go/zodb/zodbtools: Dump · dbb63f65

Kirill Smelkov authored Jan 15, 2018

Add `zodb dump` command to dump arbitrary ZODB database in generic
format. The actual dump protocol being used here is the same as in
zodbtools/py with

	zodbtools!3

applied. (the MR there is OK and is just waiting for upstream ZODB to
negotiate a way to retrieve transaction extension data in raw form).

dbb63f65

go/zodb: Start of zodbtools - tools for managing ZODB databases · c6457cf7

Kirill Smelkov authored Jan 15, 2018

Add zodbtools which is generic (contrast to fs1tools) set of ZODB
managing utilities. Only package and command infrastructure here -
actual commands will follow up in the next patches.

c6457cf7

go/zodb/fs1tools: Notes about other possible useful commands currently being there on ZODB/py side · 99b986f3
Kirill Smelkov authored Jan 15, 2018

99b986f3

go/zodb/fs1tools: Reindex, Verify-index · 11ee44e0

Kirill Smelkov authored Jan 15, 2018

Add commands for FileStorage index maintainance: manually rebuild the
index and to performe index verification.

11ee44e0

go/zodb/fs1tools: Dump · 9de107fe

Kirill Smelkov authored Jan 15, 2018

Add various FileStorage-specific dump commands with output being
bit-to-bit exact with the following ZODB/py FileStorage tools:

- fsdump.py
- fsdump.py (verbose dumper)
- fstail.py

Please see the patch for links about this dump formats.

9de107fe

go/zodb/fs1: Start fs1tools - tools for managing and maintaining ZODB FileStorage v1 databases · db167e69
Kirill Smelkov authored Jan 15, 2018

db167e69
go/zodb/fs1: My notes on I/O · 0814c1e1
Kirill Smelkov authored Jan 15, 2018

0814c1e1
go/zodb/fs1: Register FileStorage to zodb & wks · d232237e
Kirill Smelkov authored Jan 15, 2018

d232237e

go/zodb/fs1: Actual FileStorage ZODB driver · 7792a133

Kirill Smelkov authored Jan 15, 2018

Build FileStorage ZODB driver out of format record loading/decoding
and index routines we just added in previous patches.

The driver supports only read-only mode so far.

Promised tests for data format interoperability with ZODB/py are added.

7792a133

go/zodb/fs1: Add routines to (re)build and verify index from/wrt original FileStorage data · d3bf6538
Kirill Smelkov authored Jan 15, 2018

d3bf6538

go/zodb/fs1: Index save/load · 8fa9fdaf

Kirill Smelkov authored Jan 15, 2018

Build index type on top of fsb.Tree introduced in the previous patch and
add routines to save and load it to/from disk.

We ensure ZODB/py compatibility via generating test FileStorage database
+ its index and checking we can load index from it and also that if we
save an index ZODB/py can load it back. FileStorage index is hard to get
bit-to-bit identical since this index uses python pickles which can
encode the same objects in several different ways.

8fa9fdaf

go/zodb/fs1: BTree specialized with KEY=zodb.Oid, VALUE=int64 · 33d10066

Kirill Smelkov authored Jan 15, 2018

FileStorage index maps oid to file position storing latest data record
for this oid. This index is naturally to implement via BTree as e.g.
ZODB/py does.

In Go world there is github.com/cznic/b BTree library but without
specialization and working via interface{} it is slower than it could be
and allocates a lot. So generate specialized version of that code with
key and value types exactly suitable for FileStorage indexing.

We use a bit patched b version with speed ups for bulk-loading data via
regular point-ingestion BTree entry point:

	https://lab.nexedi.com/kirr/b x/refill

The patches has not been upstreamed because it slows down general case a
bit (only a bit, but still this is a "no" to me), and because with
dedicated bulk-loading API it could be possible to still load data
several times faster. Still current version is enough for not very-huge
indices.

Btw ZODB/py does the same (see fsBucket + friends).

33d10066

go/zodb: Start of FileStorage support · 8f64f6ed

Kirill Smelkov authored Jan 15, 2018

Start implementing FileStorage support by adding code to load/decode
FileStorage records and way to iterate a FileStorage.

Tests will come in a later patch together with ZODB-level loading
support.

8f64f6ed

go/zodb: Way for storage-drivers to be registered and for clients to open them by URL · fcab9405

Kirill Smelkov authored Jan 15, 2018

Storage drivers can register themselves via zodb.RegisterDriver.

Later cliens can request to open a storage by URL via zodb.OpenStorage.
The opener will lookup driver registry and wrap created driver instance
with common layer with cache etc to turn an IStorageDriver into fully
working IStorage.

fcab9405

zodb/go: In-RAM client cache · 7233b4c0

Kirill Smelkov authored Jan 15, 2018

The cache is needed so that we can provide IStorage.Prefetch
functionality generally wrapped on top of a storage driver: when an
object is loaded, the loading itself consists of steps:

1. start loading object into cache,
2. wait for the loading to complete.

This way Prefetch is naturally only "1" - start loading object into
cache but do not wait for the loading to be complete. Go's goroutines
naturally help here where we can spawn every such loading into its own
goroutine instead of explicitly programming loading in terms of a state
machine.

Since this cache is mainly needed for Prefetch to work, not to actually
cache data (though it works as cache for repeating access too), the goal
when writing it was to add minimal overhead for "data-not-yet-in-cache"
case. Current state we are not completely there yet but the latency is
acceptable - depending on the workload the cache layer adds ~

	0.5 - 1 - 3µs

to loading times.

7233b4c0

go/zodb: Minimal serialization compatibility with ZODB/py · dfd4fb73

Kirill Smelkov authored Jan 15, 2018

ZODB/py serializes data using python pickles. Basically every serialized
object has two parts: class description and object state. Here we
start by providing minimal functionality to extract class-name from
serialized data.

The library used for pickle decoding (and in later patches encoding) is

	github.com/kisielk/og-rek

It was audited by me for security flaws to some extent.

Contrary to Python pickle module it does not run arbitrary code on
decoding.

dfd4fb73

go/zodb: Tid connection with time · bac6c953

Kirill Smelkov authored Jan 15, 2018

Since in ZODB TIDs are corresponding to time, provide functionality to
convert a tid to timestamp. Do so in exactly the same way as ZODB/py
does for interoperability.

bac6c953

go/zodb: Stringification and parsing for Tid, Oid, Xid · 3d13a276
Kirill Smelkov authored Jan 15, 2018

3d13a276

go: Start of ZODB · 20d8456c

Kirill Smelkov authored Jan 15, 2018

Our path of implementing NEO in Go will be not only for server-side, but
also for client-side, since it is needed by Wendelin.core. On
server-side we'll also need to work with types and data model Python
ZODB implementation uses, so here it goes: Start of ZODB in Go.

Here we define ZODB data types, data model and operational interfaces
for IStorage + friends.

The interfaces are currently read-only with stubs for write mode.

20d8456c

go: Basic .gitignore · 7cb20f32

Kirill Smelkov authored Jan 15, 2018

Ignore files commonly produced while profiling Go programs and running
tests.

7cb20f32

NEO/go licensing · 612d556d

Kirill Smelkov authored Jan 15, 2018

We want to make sure the code can be used by all projects without a
problem. This way the license is GPLv3+ with wide exception for all Free
Software / Open Source projects + Business options.

Nexedi stack is licensed under Free Software licenses with various exceptions
that cover three business cases:

- Free Software
- Proprietary Software
- Rebranding

As long as one intends to develop Free Software based on Nexedi stack, no
license cost is involved. Developing proprietary software based on Nexedi stack
may require a proprietary exception license. Rebranding Nexedi stack is
prohibited unless rebranding license is acquired.

Through this licensing approach, Nexedi expects to encourage Free Software
development without restrictions and at the same time create a framework for
proprietary software to contribute to the long term sustainability of the
Nexedi stack.

Please see https://www.nexedi.com/licensing for details, rationale and options.

( NEO/py for now stays at the old terms but it will be upgraded to the same
  terms as NEO/go eventually )

612d556d

Sync NEO/py · a48d51c2

Kirill Smelkov authored Jan 15, 2018

Sync with current NEO in Python implementation as the first step.

We'll be using some common bits and in particular on-the-wire protocol
must be the same and for py/go interoperability testing we'll also need
python parts.

a48d51c2

11 Jan, 2018 1 commit

client: for read accesses, pick a random good node, connected or not · 8dce4bbf

Julien Muchembled authored Jan 10, 2018

The issue was that at startup, or after nodes are back, the previous code
prevented full load balancing until some data are written.

It was like this to limit the number of connections, which does not matter
anymore (see commit 77132157).

8dce4bbf

08 Jan, 2018 1 commit

storage: optimize storage layout of raw data for replication · f4dd4bab

Julien Muchembled authored Nov 23, 2017

# Previous status

The issue was that we had extreme storage fragmentation from the point of view
of the replication algorithm, which processes one partition at a time.

By using an autoincrement for the 'data' table, rows were ordered by the time
at which they were added:
- parts may be the result of replication -> ordered by partition, tid, oid
- other rows are globally sorted by tid

Which means that when scanning a given partition, many rows were skipped all
the time:
- if readahead is bigger enough, the efficiency is 1/N for a node with N
  partitions assigned
- else, it is worse because it seeks all the time

For huge databases, the replication was horribly slow, in particular from HDD.

# Chosen solution

This commit changes how ids are generated to somehow split 'data'
per partition. The backend tracks 1 last id per assigned partition, where the
16 higher bits contains the partition. Keep in mind that the value of id has no
meaning and it's only chosen for performance reasons. IOW, a row can be
referred by an oid of a partition different than the 16 higher bits of id:
- there's no migration needed and the 16 higher bits of all existing rows are 0
- in case of deduplication, a row can still be shared by different partitions

Due to https://jira.mariadb.org/browse/MDEV-12836, we leave the autoincrement
on existing databases.

## Downsides

On insertion, increasing the number of partitions now slows down significantly:
for 2 nodes using TokuDB, 4% for 180 partitions, 40% for 2000. For 12
partitions, the difference remains negligible. The solution for this issue will
be to enable to increase the number of partitions efficiently, so that nodes
can keep a small number of them, even for DB that are expected to grow so much
that many nodes are added over time: such feature was already considered so
that users don't have to worry anymore about this obscure setting at database
creation.

Read performance is only slowed down for applications that read a lot of data
that were written contiguously, but split in small blocks. A solution is to
extend ZODB so that the application tells it to chose new oids that will end up
in the same partition. Like for insertion, there should not be too many
partitions.

With RocksDB (MariaDB 10.2.10), it takes a significant amount of time to
collect all last ids at startup when there are many partitions.

## Other advantages

- The storage layout of data is now always the same and does not depend on
  whether rows came from replication or commits.
- Efficient deletion of partition to free space in-place will be possible.

# Considered alternative

The only serious alternative was to replicate as many partitions as possible at
the same time, ideally all assigned partitions, but it's not always possible.
For best performance, it would often require to synchronize new nodes, or even
all of them, so that thesource nodes don't have to scan 'data' several times.

If existing nodes are kept, all data that aren't copied to the newly added
nodes have to be skipped. If the number of nodes is multiplied by N, the
efficiency is 1-1/N at best (synchronized nodes), else it's even worse
because partitions are somehow shuffled.

Checking/replacing a single node would remain slow when there are several
source nodes.

At last, such an algorithm would be much more complex and we would not have the
other advantages listed above.

f4dd4bab

05 Jan, 2018 4 commits

sqlite: remove useless AUTOINCREMENT for data.id (reuse of deleted ids is fine) · 7b497b8e
Julien Muchembled authored Jan 05, 2018
```
For existing DB, altering the table may be doable with schema editing and
clean up of sqlite_sequence.
```
7b497b8e
storage: automatic upgrade of 'obj' table (change of indices) · d289050e
Julien Muchembled authored Jan 05, 2018

d289050e

storage: speed up reads by indexing 'obj' primarily by 'oid' (instead of 'tid') · 3c7a3160

Julien Muchembled authored Nov 20, 2017

getObject becomes faster because it does not use secondary index anymore.
Only the primary one. This frees RAM during normal operation. For MySQL,
DatabaseManager._getObject is sped up by ~3% for in-memory loads.
An improvement of ~1% from ERP5 was also mesured for IO-bound loads.

On insertion, the fast index is (`partition`, tid, oid) because we almost
always insert lines with increasing tid, whereas oid values are more random.
Although the value (data_id+value_tid) is moved from the fast to the slow index,
this should have little impact on performance because the value size is quite
small compared to the key.

The impact on replication should also be negligible:
- a little faster when there's no oid to replicate: only the secondary index,
  smaller, is scanned
- otherwise: the (slightly) biggest index is scanned randomly

On disk usage, an increase of ~4% was observed for TokuDB.
Less compressibility ? Any link with https://jira.percona.com/browse/TDB-86 ?

3c7a3160

storage: pass schema of tables to migration methods · ca7acefc
Julien Muchembled authored Nov 28, 2017

ca7acefc