Commits · 495bd2fa157ba17c96733c4bd7411819f5f8a321 · Alain Takoudjou / git-backup

30 Dec, 2015 1 commit

gitlab-backup: Unpack *.tar.gz before storing them in git · 495bd2fa

Kirill Smelkov authored Dec 30, 2015

Starting from 8.2 GitLab backups uploads and other directories not just
as set of files, but as one tarball:

    https://gitlab.com/gitlab-org/gitlab-ce/commit/d3734fbd

and this does not play well with git - now objects are stored as a
one big whole, compressed, so git cannot find good deltas.

So to help git properly deltify and find duplicates, let's unpack/repack
the archives, the same way we already do for database.sql.gz

495bd2fa

14 Oct, 2015 1 commit

fsck incoming objects on pull · 7c0e3ff2

Kirill Smelkov authored Oct 14, 2015

Since objects are shared between backed up repositories, it is important
to make sure we do not pull a broken object once, thus programming
future corruption of that object after restore in all repositories which
use it.

Object corruption could happen for two reasons:

    - plain storage corruption, or
    - someone intentionally pushing corrupted object with known sha1 to
      any repository.

Second case is even more dangerous, as it potentially allows attacker to
change data in not-available-to-him repositories.

Now objects are checked on pull, and if corrupt, git-backup complains,
e.g. this way:

    RuntimeError: git -c fetch.fsckObjects=true fetch --no-tags ../D/corrupt.git refs/*:refs/backup/20151014-1914/aaa/corrupt.git/*
    error: inflate: data stream error (incorrect data check)
    fatal: loose object 52baccfe8479b61c2a0d5447bc0a6bf7c6827c60 (stored in ./objects/52/baccfe8479b61c2a0d5447bc0a6bf7c6827c60) is corrupt
    fatal: The remote end hung up unexpectedly

7c0e3ff2

24 Sep, 2015 1 commit
- readme: Turn what-should-be-reference into hyperlinks · 19b35be9
  Kirill Smelkov authored Sep 24, 2015
  
  19b35be9
22 Sep, 2015 1 commit

readme: .txt -> .rst · a695bdbe

Kirill Smelkov authored Sep 22, 2015

Current hostings don't recognize .txt as being reStructuredText, so
let's be explicit, so readme gets automatically rendered.

a695bdbe

08 Sep, 2015 2 commits

Fix typo · 73815b9f
Kirill Smelkov authored Sep 08, 2015

73815b9f

Don't forget to save symlinks pointing to directories · 380b65f1

Kirill Smelkov authored Sep 08, 2015

os.walk() yields symlinks to directories in dirnames and do not follow
them. Our backup cycle expects all files that need to go to blob to be
in filenames and that dirnames are only recursed-into by walk().

Thus, until now, symlink to a directory was simply ignored and not
backup'ed. In particular *.git/hooks are usually symlinks to common
place.

The fix is to adjust our xwalk() to always represent blob-ish things in
filenames, and leave dirnames only for real directories.

/cc @kazuhiko

380b65f1

31 Aug, 2015 3 commits

gitlab-backup: Initial draft · 32e1f7af

Kirill Smelkov authored Aug 31, 2015

This is convenience program to pull/restore backup data for a GitLab
instance into/from git-backup managed repository.

Backup layout is:

    gitlab/misc   - db + uploads + ...
    gitlab/repo   - git repositories

On restoration we extract repositories into
.../git-data/repositories.<timestamp> and db backup into standard gitlab
backup tar and advice user how to proceed with exact finishing commands.

This will hopefully be improved and changed to finish automatically,
after some testing.

32e1f7af

git-backup: Initial draft · 6f237f22

Kirill Smelkov authored Aug 31, 2015

This program backups files and set of bare Git repositories into one Git repository.
Files are copied to blobs and then added to tree under certain place, and for
Git repositories, all reachable objects are pulled in with maintaining index
which remembers reference -> sha1 for every pulled repositories.

After objects from backuped Git repositories are pulled in, we create new
commit which references tree with changed backup index and files, and also has
all head objects from pulled-in repositories in its parents(*). This way backup
has history and all pulled objects become reachable from single head commit in
backup repository. In particular that means that the whole state of backup can
be described with only single sha1, and that backup repository itself could be
synchronized via standard git pull/push, be repacked, etc.

Restoration process is the opposite - from a particular backup state, files are
extracted at a proper place, and for Git repositories a pack with all objects
reachable from that repository heads is prepared and extracted from backup
repository object database.

This approach allows to leverage Git's good ability for object contents
deduplication and packing, especially for cases when there are many hosted
repositories which are forks of each other with relatively minor changes in
between each other and over time, and mostly common base. In author experience
the size of backup is dramatically smaller compared to straightforward "let's
tar it all" approach.

Data for all backuped files and repositories can be accessed if one has access
to backup repository, so either they all should be in the same security domain,
or extra care has to be taken to protect access to backup repository.

File permissions are not managed with strict details due to inherent
nature of Git. This aspect can be improved with e.g. etckeeper-like
(http://etckeeper.branchable.com/) approach if needed.

Please see README.txt with user-level overview on how to use git-backup.

NOTE the idea of pulling all refs together is similar to git-namespaces
     http://git-scm.com/docs/gitnamespaces

(*) Tag objects are handled specially - because in a lot of places Git insists and
    assumes commit parents can only be commit objects. We encode tag objects in
    specially-crafted commit object on pull, and decode back on backup restore.

    We do likewise if a ref points to tree or blob, which is valid in Git.

6f237f22

Start of git-backup.git · bbee44ce

Kirill Smelkov authored Aug 31, 2015

The project to implement backing up repositories on git hosting
efficiently.

bbee44ce