Commit d31febed authored by Kirill Smelkov's avatar Kirill Smelkov

gitlab-backup: Split each table to parts <= 16M in size

As was outlined 2 patches before (gitlab-backup: Dump DB ourselves),
currently DB dump is not git friendly, because for each table dump is
just one (potentially large) file and grows over time. In Gitlab there
is one big table which dominates ~95% of whole dump size.

So to avoid overloading git with large blobs, let's split each table to
parts <= 16M in size, so this way we do not store very large blobs in
git, with which it is inefficient.

The fact that table data is sorted (see previous patch) helps the
splitting result to be more-or-less stable - as we split not purely by
byte size, but by lines, and max size 16M is only approximate, if a row
is changed in a part, it will be splitted the same way on the next
backup run.

This works not so good, when row entries are large itself (e.g. for big
patches which change a lot of files with big diff). For such cases
splitting can be improved with splitting by edges found similar to e.g.
bup[1] - by finding nodes of a rolling checksum, but for now we are
staying with more simple way of doing the split.

This reduce load on git packing (for e.g. repack or when doing fetch and
push) a lot.

[1] https://github.com/bup/bup

/cc @kazuhiko
parent 5534e682
...@@ -152,6 +152,19 @@ backup_pull() { ...@@ -152,6 +152,19 @@ backup_pull() {
rm -f $F.data{,.x} $F.tail{,.x} rm -f $F.data{,.x} $F.tail{,.x}
done done
# ... split each table to parts <= 16M in size
# so we do not store very large blobs in git (with which it is inefficient)
find "$db_pgdump" -maxdepth 1 -type f -name "*.dat" -a \! -name toc.dat | \
while read F; do
mv $F $F.x
mkdir $F
split -C 16M $F.x $F/`basename $F`.
md5=`md5sum <$F.x`
md5_=`cat $F/* | md5sum`
test "$md5" = "$md5_" || die "E: md5 mismatch after $F split"
rm $F.x
done
# 4. pull gitlab data into git-backup # 4. pull gitlab data into git-backup
# gitlab/misc - db + uploads + ... # gitlab/misc - db + uploads + ...
...@@ -192,6 +205,14 @@ backup_restore() { ...@@ -192,6 +205,14 @@ backup_restore() {
# if backup is in pgdump (not sql) format - decode it # if backup is in pgdump (not sql) format - decode it
db_pgdump="$tmpd/gitlab_backup/db/database.pgdump" db_pgdump="$tmpd/gitlab_backup/db/database.pgdump"
if [ -d "$db_pgdump" ]; then if [ -d "$db_pgdump" ]; then
# merge splitted database dump files
find "$db_pgdump" -maxdepth 1 -type d -name "*.dat" | \
while read F; do
mv $F $F.x
cat $F.x/* >$F
rm -rf "$F.x"
done
# convert database dump to plain-text sql (as gitlab restore expects) # convert database dump to plain-text sql (as gitlab restore expects)
gitlab-rake -e "exec \"pg_restore --clean \\"$db_pgdump\\" >$tmpd/gitlab_backup/db/database.sql \"" gitlab-rake -e "exec \"pg_restore --clean \\"$db_pgdump\\" >$tmpd/gitlab_backup/db/database.sql \""
rm -rf "$db_pgdump" rm -rf "$db_pgdump"
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment