- 12 Jun, 2018 3 commits
-
-
Kirill Smelkov authored
Like it was already said in 899103bf (pull: Switch from porcelain `git fetch` to plumbing `git fetch-pack` + friends) currently on lab.nexedi.com `git-backup pull` became slow and most of the slowness was tracked down to the fact that `git fetch` for every pulled repository does linear scan of whole backup repository history just to find out there is usually nothing to fetch. Quoting 899103bf: """ `git fetch`, before fetching data from remote repository, first checks whether it already locally has all the objects remote advertises. This boils down to running echo $remote_tips | git rev-list --quiet --objects --stdin --not --all and checking whether it succeeds or not: https://git.kernel.org/pub/scm/git/git.git/commit/?h=4191c35671 https://git.kernel.org/pub/scm/git/git.git/tree/builtin/fetch.c?h=v2.18.0-rc1-1-g6f333ff2fb#n925 https://git.kernel.org/pub/scm/git/git.git/tree/connected.c?h=v2.18.0-rc1-1-g6f333ff2fb#n8 The "--not --all" in the query means that objects should be not reachable from all locally existing refs and is implemented by linearly scanning from tip of those existing refs and marking objects reachable from there as "do not print". In case of git-backup, where we have mostly master which is super commit merging from whole histories of all projects and from backup history, linearly scanning from such a tip goes through lots of commits. Up to the point where fetching a small, outdated repository, which was already pulled into backup and did not changed since long, takes more than 30 seconds with almost 100% of that time being spent in quickfetch() only. """ The solution is that we can build index of objects we already have ourselves only once at startup, and then in fetch, after checking lsremote output, consult that index, and if we see we already have everything for an advertised reference - just avoid giving it to fetch-pack to process. It turns out for many pulled repositories there is no references changed at all and this way fetch-pack can be skipped completely. This leads to dramatical speedup: before `gitlab-backup pull` was taking ~ 2 hours, and now something under ~ 5 minutes. The index building itself takes ~ 30 seconds - the time which we were previously spending to fetch just from 1 unchanged repository. The index size is small and so it all can be kept in RAM - please see details in the code comments on this. I initially wanted to speedup fetching by teaching `git fetch-objects` to consult backup repo bitmap reachability index (if, for a commit, we can see that there is an entry in this index -> we know we already have all reachable objects for this commit and can skip fetching). This won't however work fully for all our refs - 40% of them are mostly tags, and since in the backup repository we don't keep tag objects - we keep tags/tree/blobs encoded as commits - sha1 of those 40% references to tags won't be in bitmap index. So just do the indexing ourselves.
-
Kirill Smelkov authored
In the next patch we will need to load backup.refs in the beginning of pull too. Factored function changed to return regular error instead of raising exception (which will be the general plan from now on).
-
Kirill Smelkov authored
On lab.nexedi.com `git-backup pull` became slow, and most of the slowness was tracked down to the following: `git fetch`, before fetching data from remote repository, first checks whether it already locally has all the objects remote advertises. This boils down to running echo $remote_tips | git rev-list --quiet --objects --stdin --not --all and checking whether it succeeds or not: https://git.kernel.org/pub/scm/git/git.git/commit/?h=4191c35671 https://git.kernel.org/pub/scm/git/git.git/tree/builtin/fetch.c?h=v2.18.0-rc1-1-g6f333ff2fb#n925 https://git.kernel.org/pub/scm/git/git.git/tree/connected.c?h=v2.18.0-rc1-1-g6f333ff2fb#n8 The "--not --all" in the query means that objects should be not reachable from all locally existing refs and is implemented by linearly scanning from tip of those existing refs and marking objects reachable from there as "do not print". In case of git-backup, where we have mostly master which is super commit merging from whole histories of all projects and from backup history, linearly scanning from such a tip goes through lots of commits. Up to the point where fetching a small, outdated repository, which was already pulled into backup and did not changed since long, takes more than 30 seconds with almost 100% of that time being spent in quickfetch() only. The solution will be to optimize checking whether we already have all the remote objects and to not repeat whole backup-repo scanning for every pulled repository. This will be done via first querying through `git ls-remote` what tips remote repository has, then checking on git-backup specific index which tips we already have and then fetching only the rest. This way we are essentially moving most of quickfetch phase of git into git-backup. Since we'll be tailing to git to fetch only some of the remote refs, we will either have to amend ourselves the refs `git fetch` creates after fetching, or to not rely on `git fetch` creating any refs at all. Since we already have a long standing issue that many many refs that are coming live after `git fetch` slow down further git fetches https://lab.nexedi.com/kirr/git-backup/blob/0ab7bbb6/git-backup.go#L551 the longer term plan will be not to create unneeded references. Since 2 forks could have references covering the same commits, we would either have to compare references created after git-fetch and deduplicate them or manage references creation ourselves. It is also generally better to split `git fetch` into steps at plumbing layer, because after doing so, we can have the chance to optimize or tweak any of the steps at our side with knowing full git-backup context and indices. This commit only switches from using `git fetch` to its plumbing counterpart `git fetch-pack` + friends + manually creating fetched refs the way `git fetch` used to do exactly. There should be neither functionality changed nor any speedup. Further commits will start to take advantage of the switch and optimize `git-backup pull`.
-
- 11 Jun, 2018 2 commits
-
-
Kirill Smelkov authored
- tell that reference name always goes without "refs/" prefix - use .name for reference name, not .ref: this way ref.name is more readable than ref.ref and so there is less need to use for __ in range loops.
-
Kirill Smelkov authored
Noticed this while changing how pull works and making error there incidentally with leaving more "refs/" prefix. With the error before this patch tests show: git-backup_test.go:91: git-backup_test.go:204: lab.nexedi.com/kirr/git-backup.cmd_restore: 2 errors: - E: extracted /tmp/t-git-backup981909377/1/dir 2 + β/repo with+fragile name %αβγ.git refs corrupt: - E: extracted /tmp/t-git-backup981909377/1/dir/hello.git refs corrupt: with the patch tests report: git-backup_test.go:91: git-backup_test.go:204: lab.nexedi.com/kirr/git-backup.cmd_restore: 2 errors: - E: extracted /tmp/t-git-backup981909377/1/dir 2 + β/repo with+fragile name %αβγ.git refs corrupt: want: cbb6d3f205749888f77fb1a88fbac3b8a0b8000f refs/refs/heads/master have: cbb6d3f205749888f77fb1a88fbac3b8a0b8000f refs/heads/master - E: extracted /tmp/t-git-backup981909377/1/dir/hello.git refs corrupt: want: 647e137fd3b31939b36889eba854a298ef97b6ff refs/refs/heads/branch2 feeed96ca75fcf8dcf183008f61dbf72e91ab4de refs/refs/heads/master 11e67095628aa17b03436850e690faea3006c25d refs/refs/tags/tag-to-blob f735011c9fcece41219729a33f7876cd8791f659 refs/refs/tags/tag-to-commit 7124713e403925bc772cd252b0dec099f3ced9c5 refs/refs/tags/tag-to-tag ba899e5639273a6fa4d50d684af8db1ae070351e refs/refs/tags/tag-to-tree 7a3343f584218e973165d943d7c0af47a52ca477 refs/refs/test/ref-to-blob 61882eb85774ed4401681d800bb9c638031375e2 refs/refs/test/ref-to-tree have: 647e137fd3b31939b36889eba854a298ef97b6ff refs/heads/branch2 feeed96ca75fcf8dcf183008f61dbf72e91ab4de refs/heads/master 11e67095628aa17b03436850e690faea3006c25d refs/tags/tag-to-blob f735011c9fcece41219729a33f7876cd8791f659 refs/tags/tag-to-commit 7124713e403925bc772cd252b0dec099f3ced9c5 refs/tags/tag-to-tag ba899e5639273a6fa4d50d684af8db1ae070351e refs/tags/tag-to-tree 7a3343f584218e973165d943d7c0af47a52ca477 refs/test/ref-to-blob 61882eb85774ed4401681d800bb9c638031375e2 refs/test/ref-to-tree Should be good to have this details if something really breaks after restore.
-
- 08 Jun, 2018 2 commits
-
-
Kirill Smelkov authored
This way, if backup repository was freshly repacked with bitmap index generation turned on, we can get ~ 30% - 50% speedup for a typical erp5.git pack extraction. "--use-bitmap-index" option was added to git in v2.0, but was only active for to-stdout packs generation. It was enabled for to-file packs generation in git v2.11. Since git v2.0 was released in 2014 - 4 years ago - I'm not adding runtime detection of "--use-bitmap-index" availability. See https://git.kernel.org/pub/scm/git/git.git/commit/?h=645c432d61 for details.
-
Kirill Smelkov authored
-
- 05 Jun, 2018 1 commit
-
-
Kirill Smelkov authored
- remove blank line between main description and package clause, so that the main description is understood as such; - move notes describing what a file does after package clause, so that those notes do not get mixed into program description under godoc.
-
- 25 Apr, 2018 1 commit
-
-
Alain Takoudjou authored
add option to remove or keep pulled backup data [ kirr: The .pulled files with gitlab backup data (SQL and the like) were originally not removed "just in case" in the early days of git/gitlab-backup. They are clearly not needed to be kept since their content is entered into git backup database by gitlab-backup, and leaving those .pulled files just wastes disk space. So default to not keep them around and for now add an option to forcibly preserve the raw gitlab backup if we'll need it just in case or for the debugging. However if it turns out we won't really need -keep in practice, it might go away in some time. ] /reviewed-on !3
-
- 07 Mar, 2018 1 commit
-
-
Alain Takoudjou authored
If a repository is removed when git-backup is running, print a warning message and continue pulling instead of exiting with error. /reviewed-on !2
-
- 24 Oct, 2017 1 commit
-
-
Kirill Smelkov authored
Relicense to GPLv3+ with wide exception for all Free Software / Open Source projects + Business options. Nexedi stack is licensed under Free Software licenses with various exceptions that cover three business cases: - Free Software - Proprietary Software - Rebranding As long as one intends to develop Free Software based on Nexedi stack, no license cost is involved. Developing proprietary software based on Nexedi stack may require a proprietary exception license. Rebranding Nexedi stack is prohibited unless rebranding license is acquired. Through this licensing approach, Nexedi expects to encourage Free Software development without restrictions and at the same time create a framework for proprietary software to contribute to the long term sustainability of the Nexedi stack. Please see https://www.nexedi.com/licensing for details, rationale and options.
-
- 19 Apr, 2017 1 commit
-
-
Kirill Smelkov authored
- myname moved -> my go123@98249b24 - Traceback now returns []runtime.Frame go123@7deb28a5
-
- 13 Dec, 2016 4 commits
-
-
Kirill Smelkov authored
to xflag.Count
-
https://lab.nexedi.com/kirr/go123/xstrings/Kirill Smelkov authored
xstrings.SplitLines xstrings.Split2 xstrings.HeadTail Other string-related routines stay in git-backup for now as I don't feel they are general enough or interface chosen is really ok.
-
https://lab.nexedi.com/kirr/go123/mem/Kirill Smelkov authored
It is now mem.String(), and mem.Bytes()
-
Kirill Smelkov authored
error.go is completely being moved to that shared place for handy Go utilities into several subpackages: lab.nexedi.com/kirr/go123/exc -- exception-style error handling for Go lab.nexedi.com/kirr/go123/myname -- easy way to determine current function's name and package lab.nexedi.com/kirr/go123/xerr -- addons for error-handling lab.nexedi.com/kirr/go123/xruntime -- addons to standard package runtime
-
- 03 Nov, 2016 1 commit
-
-
Kirill Smelkov authored
By definition of strings.Split(..., sep) it "slices s into all substrings separated by sep and returns a slice of the substrings between those separators". That means that string.Split("hello\nworld\n", "\n") -> ["hello", "world", ""]) # NOTE the last "" when parsing file by lines, it is handy though to do not get last empty "" after last "\n". #6 shows how we missed to do that filtering-out for case of empty backup.refs file and errored-out because of that. To fix let's introduce a helper - splitlines(), which does the job of filtering-out last empty entry after last separator. By using this helper everywhere we can hopefully avoid problems while pulling only empty repositories (#6 case), and also similar ones. Fixes #6 /reported-by @iv
-
- 01 Aug, 2016 3 commits
-
-
Kirill Smelkov authored
Continuing 62374038 (pull: Turns unused refs are removed not 100% and a lot of empty directories are accumulated) we just make sure to remove them in the end of pull. But NOTE: there could be O(n^2) behaviour still hidden, so it makes sense to eventually revisit it and cleanup empty dirs earlier. For now we just care not to degrade future pull performance. The appropriate time for revisiting could be when reworking pull to do fetches in parallel. Updates: https://lab.nexedi.com/lab.nexedi.com/lab.nexedi.com/issues/4
-
Kirill Smelkov authored
This way it allows us to leverage multiple CPUs on a system for pack extractions, which are computation-heavy operations. The way to do is more-or-less classical: - main worker prepares requests for pack extraction jobs - there are multiple pack-extraction workers, which read requests from jobs queue and perform them - at the end we wait for everything to stop, collect errors and optionally signalling the whole thing to cancel if we see an error coming. (it is only a signal and we still have to wait for everything to stop) The default number of workers is N(CPU) on the system - because we spawn separate `git pack-objects ...` for every request. We also now explicitly limit N(CPU) each `git pack-objects ...` can use to 1. This way control how many resources to use is in git-backup hand and also git packs better this way (when only using 1 thread) because when deltifying all objects are considered to each other, not only all objects inside 1 thread's object poll, and even when pack.threads is not 1, first "objects counting" phase of pack is serial - wasting all but 1 core. On lab.nexedi.com we already use pack.threads=1 by default in global gitconfig, but the above change is for code to be universal. Time to restore nexedi/ from lab.nexedi.com backup: 2CPU laptop: before (pack.threads=1) 10m11s before (pack.threads=NCPU) 9m13s after -j1 10m11s after 6m17s 8CPU system (with other load present, noisy) : before (pack.threads=1) ~5m after ~1m30s
-
Kirill Smelkov authored
like in 302aaaea (raiseif: Fix it wrt erraddcallingcontext()) now fix raisef, which I originally overlooked.
-
- 31 Jul, 2016 3 commits
-
-
Kirill Smelkov authored
Because spawning separate process per 1 commit is slow. Libgit2 does not allow to create commits only knowing tree & parentv sha1s, but we can create commit objects by hand pretty easily - their format is tree <sha1> parent <parent1-sha1> parent <parent2-sha1> ... author user <email> date +offset committer user <email> date +offset LF message Time for pulling-in kirr/slapos.git before: 2.5s after: 0.9s NOTE AuthorInfo is changed to inherit from git.Signature (same fields and semantic) NOTE Since libgit2 default ident can fail, and does not look beyond user.name and user.email we do backup identity detection (user/hostname) - in similar way Git does - ourselves.
-
Kirill Smelkov authored
We are going to rework this function, but before adding changes let's move it to more appropriate place. Since xcommit_tree() creates commit object from tree and parents and is pretty standard git function - the appropriate place is gitobjects. NOTE we cannot just replace xcommit_tree() with g.CreateCommit() as the latter works with already loaded tree and parent objects, but we want to be able to make commits only knowing tree and parents sha1.
-
Kirill Smelkov authored
In upcoming patch we are going to switch xcommit_tree() to our own implementation, and since this can potentially change how commits are represented, for backward compatibility reason we need to make sure objects encoded as commits stay the same. So for all kind of objects (they are present in testdata/ repositories) add checks that: - encode/decode is idempotent - encoding and decoding produces exactly expected sha1 One nice side effect of this is that we can now remove runtime consistency check from tail of decoding. That check was there from the beginning - from 6f237f22 (git-backup: Initial draft) mainly present because there was no testsuite at that time. That check place is however even not completely right - in case we somehow wrongly pulled an object it has to be detected at pull time, not restore time. So that check was checking only 1/2 of implementation - and not the main one - that decoding does not mess up. Since now we have proper testsuite and add encode/decode tests in this patch, we can remove that partial runtime check. And even if decoding messes something up, despite having it testsuited, it will be 100% caught by restore process, because for an extracted repository, if there is no some object which needs to be present in it, pack generation for that repository will fail. So we can be safe with the removal. Time for restoring kirr/slapos.git from lab.nexedi.com backup before: 5.5s after: 3.5s ( so much because there are ~ 500 tags in slapos.git and currently tag encoding is done with spawning separate subprocess per tag )
-
- 30 Jul, 2016 1 commit
-
-
Kirill Smelkov authored
Do not waste resources adding every file converted to blob with spawning `git update-index ...` per file - we can queue the info and add all entries to index in one go. Time to pull files part for lab.nexedi.com before: ~110s after: ~3s
-
- 29 Jul, 2016 6 commits
-
-
Kirill Smelkov authored
Time for restoring kirr/slapos.git from lab.nexedi.com backup before: 7.4s after: 5.6s
-
Kirill Smelkov authored
We can reuse ReadObject() like for blob_to_file(). We cannot drop xload_tag() in favor of Repository.LookupTag() because upon tag loading we need to have not only parsed tag, but also its raw content for encoding in another object. Time for restoring kirr/slapos.git from lab.nexedi.com backup before: 8.9s after: 7.4s ( it goes down because on restore restored tags are reencoded again to verify restoration was ok. Pulling time should go down appropriately as well )
-
Kirill Smelkov authored
Substituting `git cat-file` to Odb.Read() and `git hash-object -w` to Odb.Write(). Timing for restoring only files from lab.nexedi.com backup: before: ~95s after: ~8s Timings for making backup in file part should have similar effect.
-
Kirill Smelkov authored
This saves us one `git cat-file` call per recreated tag. Time for restoring kirr/slapos.git from lab.nexedi.com backup before: 10.3s after: 8.9s
-
Kirill Smelkov authored
Currently for every file -> blob, and blob -> file we invoke git subprocess (cat-file or hash-object). We also invoke git subprocess for every tag read/write and the same for commits and this 1-subprocess per 1 object has very high overhead. The ways to avoid such overhead could be: 1) for every kind of operation spawn git service process, like e.g. `git cat-file --batch` for reading files, and only do request/reply per object with it. 2) use some go library to work with git repository ourselves. "1" can work but: - at present there is no counterpart of `cat-file --batch` for e.g. `hash-object` - i.e. we cannot write objects without quirks or patching git. - even if we add support for hashing via request/reply, as all requests are processed sequentially on git side by e.g. `git cat-file --batch`, we won't be able to leverage parallelism. - request/reply has also latency attached. For "2" we have roughly the following choices: - use cgo bindings to libgit2 (git2go) - use some pure-go git library Pure-go approach has pros that it by design avoids problems related to tricky CGo pointer C <-> Go passing rules. The fact that this was sorted out by go team itself only during 1.6 cycle https://github.com/golang/go/issues/12416 tells a lot. The net is full of examples where those were hard to get, and git2go in particular has a story of e.g. heap corruption (the bug was on golang itself side and fixed only for 1.5) https://github.com/libgit2/git2go/issues/223 https://groups.google.com/forum/#!topic/golang-nuts/Vi1HD-54BTA/discussion However there is no good (to my knowledge) pure-go git library, and the family of forks around github.com/speedata/gogit either: - works 3x slower compared to git2go ( or the same 3x in serial mode compared to e.g. `git cat-file --batch` as in serial mode git subservice and git2go has roughly similar performance ) - or does not work at all (e.g. barfing out on REF_DELTA pack entries, etc) So because of 3x slowdown, pure-go way is currently a no-runner. Since one person from golang team cared to update git2go to properly follow the CGo rules https://github.com/libgit2/git2go/pull/282 we can be relatively confident about git2go bindings quality and try to use it. This commit only hooks git2go into the build, subcommands and to Sha1 for to/from Oid conversion. We'll be switching places to git2go incrementally in upcoming patches. NOTE for now we need git2go from next branch for https://github.com/libgit2/git2go/commit/cf7553e7 The plan is to eventually switch to gopkg.in/libgit2/git2go.v25 once it is out.
-
Kirill Smelkov authored
We are going to use git2go (see next patch) for which canonical import path is git (import "github.com/libgit2/git2go" results in package name being autotruncated to just "git") so free up the "git" name for that package. Reason is: git() - as function - is used not often, while the package will be used often. Regarding naming: not sure it is good choice but ggit() is something like xgit(), only g is for "GitError".
-
- 27 Jul, 2016 1 commit
-
-
Kirill Smelkov authored
We can do similar to what git does for blobs - searching in a window of repositories sorted by repo basename.
-
- 25 Jul, 2016 1 commit
-
-
Kirill Smelkov authored
In 28986e0e (Rewrite in Go) I've added mypkgname() with comment that go escapes all '.' in function name with %2e. That turned out to be not true: Go escapes only dots in last component after last slash, e.g. lab.nexedi.com/kirr/git-backup/package%2ename.Function lab.nexedi.com/kirr/git-backup/pkg2.qqq/name%2ezzz.Function Correct mypkgname() accordingly. Noted while trying to run git-backup in a GOPATH root, not as standalone.
-
- 07 Jul, 2016 2 commits
-
-
Kirill Smelkov authored
erraddcallingcontext() already tries not to go beyond raise, but since raiseif wes calling raise, it was omitting raiseif but not raise itself. So an error could be like this cmd_restore: raiseif: mkdir ../R/1: file exists while it should be cmd_restore: mkdir ../R/1: file exists Fix it.
-
Kirill Smelkov authored
when/if we ever get to need them.
-
- 06 Jul, 2016 2 commits
-
-
Kirill Smelkov authored
It was a default leftover to autodetect object type if obj_type=None, from the beginning - from bbee44ce (Start of git-backup.git) - because even there obj_represent_as_commit() is always called with obj_type explicitly passed in. So remove the leftover.
-
Kirill Smelkov authored
This is more-or-less 1-to-1 port of git-backup to Go. There are things we handle a bit differently: - there is a separate type for Sha1 - conversion of repo paths to git references is now more robust wrt avoiding not-allowed in git constructs like ".." or ".lock" https://git.kernel.org/cgit/git/git.git/tree/refs.c?h=v2.9.0-37-g6d523a3#n34 The rewrite happened because we need to optimize restore, and for e.g. parallelizing part it should be convenient to use goroutines and channels. I'm not very comfortable with how error handling is done, because contrary to what canonical Go way seems to be, in a lot of places it still looks to me exceptions are better idea compared to just error codes, though in many places just error codes are better and makes more sense. Probably there will be less exceptions over time once the code starts to be collaborating set of goroutines with communications done via channels. Still a lot of python habits on my side. And as a bonus we now have end-to-end pull/restore tests...
-
- 20 Jun, 2016 2 commits
-
-
Kirill Smelkov authored
Bug present since the beginning: 6f237f22 (git-backup: Initial draft).
-
Kirill Smelkov authored
Even though we delete all temporary refs after pull, git leaves empty directories in the place where the refs were - for example if there was a ref dir/ref and we delete ref `ref`, empty dir/ is still leaved there. That increasingly hurts next pull performance a lot - before pulling git wants to scan all local refs, and while doing so it descends into all directories under refs/. As after several pulls we can have many such empty directories under refs/backup/, this scanning can take quite some time: e.g. for lab.nexedi.com normal pull currently takes ~3 minutes, but after doing pull ~60 times, it can become as bad as ~10 minutes for one pull. And all that slowness goes away after cleaning refs/backup/ manually. /cc https://lab.nexedi.com/lab.nexedi.com/lab.nexedi.com/issues/4
-
- 13 Jun, 2016 1 commit
-
-
Kirill Smelkov authored
Same story as in e.g. wendelin.core@b0b2c52e ( in short: GitLab now prepends namespace/repo/blob/ref/ prefix by itself )
-
- 02 May, 2016 1 commit
-
-
Kirill Smelkov authored
No need to compute that twice. My mistake from original 6f237f22 (git-backup: Initial draft).
-