Commit 45c5c2aa authored by Jacob Vosmaer's avatar Jacob Vosmaer Committed by Achilleas Pipinellis

Update git object deduplication overview

parent 9f35bf3a
...@@ -106,6 +106,11 @@ enabled for individual projects by executing ...@@ -106,6 +106,11 @@ enabled for individual projects by executing
be on hashed storage, should not be a fork itself, and hashed storage should be be on hashed storage, should not be a fork itself, and hashed storage should be
enabled for all new projects. enabled for all new projects.
DANGER: **Danger:**
Do not run `git prune` or `git gc` in pool repositories! This can
cause data loss in "real" repositories that depend on the pool in
question.
### How to migrate to Hashed Storage ### How to migrate to Hashed Storage
To start a migration, enable Hashed Storage for new projects: To start a migration, enable Hashed Storage for new projects:
......
...@@ -10,10 +10,10 @@ GitLab implements Git object deduplication. ...@@ -10,10 +10,10 @@ GitLab implements Git object deduplication.
## Enabling Git object deduplication via feature flags ## Enabling Git object deduplication via feature flags
As of GitLab 11.9, Git object deduplication in GitLab is in beta. In this As of GitLab 12.0, Git object deduplication in GitLab is still behind a
document, you can read about the caveats of enabling the feature. Also, feature flag. In this document, you can read about the effects of
note that Git object deduplication is limited to forks of public enabling the feature. Also, note that Git object deduplication is
projects on hashed repository storage. limited to forks of public projects on hashed repository storage.
You can enable deduplication globally by setting the `object_pools` You can enable deduplication globally by setting the `object_pools`
feature flag to `true`: feature flag to `true`:
...@@ -51,6 +51,15 @@ configuration. Objects in A that are not in B will remain in A. For this ...@@ -51,6 +51,15 @@ configuration. Objects in A that are not in B will remain in A. For this
to work, it is of course critical that **no objects ever get deleted from to work, it is of course critical that **no objects ever get deleted from
B** because A might need them. B** because A might need them.
DANGER: **Danger:**
Do not run `git prune` or `git gc` in pool repositories! This can
cause data loss in "real" repositories that depend on the pool in
question.
The danger lies in `git prune`, and `git gc` calls `git prune`. The
problem is that `git prune`, when running in a pool repository, cannot
reliable decide if an object is no longer needed.
### Git alternates in GitLab: pool repositories ### Git alternates in GitLab: pool repositories
GitLab organizes this object borrowing by creating special **pool GitLab organizes this object borrowing by creating special **pool
...@@ -80,43 +89,10 @@ across a collection of GitLab project repositories at the Git level: ...@@ -80,43 +89,10 @@ across a collection of GitLab project repositories at the Git level:
The effectiveness of Git object deduplication in GitLab depends on the The effectiveness of Git object deduplication in GitLab depends on the
amount of overlap between the pool repository and each of its amount of overlap between the pool repository and each of its
participants. As of GitLab 11.9, we have a somewhat optimistic system. participants. Each time garbage collection runs on the source project,
The only data that will be deduplicated is the data in the source Git objects from the source project will get migrated to the pool
project repository at the time the pool repository is created. That is, repository. One by one, as garbage collection runs, other member
the data in the source project at the time of the first fork *after* the projects will benefit from the new objects that got added to the pool.
deduplication feature has been enabled.
When we enable the object deduplication feature for
gitlab.com/gitlab-org/gitlab-ce, which is about 1GB at the time of
writing, all new forks of that project would be 1GB smaller than they
would have been without Git object deduplication. So even in its current
optimistic form, we expect Git object deduplication in GitLab to make a
difference.
However, if a lot of Git objects get added to the project repositories
in a pool after the pool repository was created these new Git objects
will currently (GitLab 11.9) not get deduplicated. Over time, the
deduplication factor of the pool will get worse and worse.
As an extreme example, if we create an empty repository A, and fork that
to repository B, behind the scenes we get an object pool P with no
objects in it at all. If we then push 1GB of Git data to A, and push the
same Git data to B, it will not get deduplicated, because that data was
not in A at the time P was created.
This also matters in less extreme examples. Consider a pool P with
source project A and 500 active forks B1, B2,...,B500. Suppose,
optimistically, that the forks are fully deduplicated at the start of
our scenario. Now some time passes and 200MB of new Git data gets added
to project A. Because of the forking workflow, this data makes also its way
into the forks B1, ..., B500. That means we would now have 100GB of Git
data sitting around (500 \* 200MB) across the forks, that could have
been deduplicated. But because of the way we do deduplication this new
data will not be deduplicated.
> TODO Add periodic maintenance of object pools to prevent gradual loss
> of deduplication over time.
> https://gitlab.com/groups/gitlab-org/-/epics/524
## SQL model ## SQL model
...@@ -136,6 +112,9 @@ are as follows: ...@@ -136,6 +112,9 @@ are as follows:
- a `PoolRepository` has exactly one "source `Project`" - a `PoolRepository` has exactly one "source `Project`"
(`pool.source_project`) (`pool.source_project`)
> TODO Fix invalid SQL data for pools created prior to GitLab 11.11
> https://gitlab.com/gitlab-org/gitaly/issues/1653.
### Assumptions ### Assumptions
- All repositories in a pool must use [hashed - All repositories in a pool must use [hashed
...@@ -146,10 +125,6 @@ are as follows: ...@@ -146,10 +125,6 @@ are as follows:
The Git alternates mechanism relies on direct disk access across The Git alternates mechanism relies on direct disk access across
multiple repositories, and we can only assume direct disk access to multiple repositories, and we can only assume direct disk access to
be possible within a Gitaly storage shard. be possible within a Gitaly storage shard.
- All project repositories in a pool must have "Public" visibility in
GitLab at the time they join. There are gotchas around visibility of
Git objects across alternates links. This restriction is a defense
against accidentally leaking private Git data.
- The only two ways to remove a member project from a pool are (1) to - The only two ways to remove a member project from a pool are (1) to
delete the project or (2) to move the project to another Gitaly delete the project or (2) to move the project to another Gitaly
storage shard. storage shard.
...@@ -187,17 +162,14 @@ are as follows: ...@@ -187,17 +162,14 @@ are as follows:
### Consequences ### Consequences
- If a normal Project participating in a pool gets moved to another - If a normal Project participating in a pool gets moved to another
Gitaly storage shard, its "belongs to PoolRepository" relation must Gitaly storage shard, its "belongs to PoolRepository" relation will
be broken. Because of the way moving repositories between shard is be broken. Because of the way moving repositories between shard is
implemented, we will automatically get a fresh self-contained copy implemented, we will automatically get a fresh self-contained copy
of the project's repository on the new storage shard. of the project's repository on the new storage shard.
- If the source project of a pool gets moved to another Gitaly storage - If the source project of a pool gets moved to another Gitaly storage
shard or is deleted, we may have to break the "PoolRepository has shard or is deleted the "source project" relation is not broken.
one source Project" relation? However, as of GitLab 12.0 a pool will not fetch from a source
unless the source is on the same Gitaly shard.
> TODO What happens, or should happen, if a source project changes
> visibility, is deleted, or moves to another storage shard?
> https://gitlab.com/gitlab-org/gitaly/issues/1488
## Consistency between the SQL pool relation and Gitaly ## Consistency between the SQL pool relation and Gitaly
...@@ -209,16 +181,8 @@ repository and a pool. ...@@ -209,16 +181,8 @@ repository and a pool.
### Pool existence ### Pool existence
If GitLab thinks a pool repository exists (i.e. it exists according to If GitLab thinks a pool repository exists (i.e. it exists according to
SQL), but it does not on the Gitaly server, then certain RPC calls that SQL), but it does not on the Gitaly server, then it will be created on
take the object pool as an argument will fail. the fly by Gitaly.
> TODO What happens if SQL says the pool repo exists but Gitaly says it
> does not? https://gitlab.com/gitlab-org/gitaly/issues/1533
If GitLab thinks a pool does not exist, while it does exist on disk,
that has no direct consequences on its own. However, if other
repositories on disk borrow objects from this unknown pool repository
then we risk data loss, see below.
### Pool relation existence ### Pool relation existence
...@@ -226,26 +190,19 @@ There are three different things that can go wrong here. ...@@ -226,26 +190,19 @@ There are three different things that can go wrong here.
#### 1. SQL says repo A belongs to pool P but Gitaly says A has no alternate objects #### 1. SQL says repo A belongs to pool P but Gitaly says A has no alternate objects
In this case, we miss out on disk space savings but all RPC's on A itself In this case, we miss out on disk space savings but all RPC's on A
will function fine. As long as Git can find all its objects, it does not itself will function fine. The next time garbage collection runs on A,
matter exactly where those objects are. the alternates connection gets established in Gitaly. This is done by
`Projects::GitDeduplicationService` in gitlab-rails.
#### 2. SQL says repo A belongs to pool P1 but Gitaly says A has alternate objects in pool P2 #### 2. SQL says repo A belongs to pool P1 but Gitaly says A has alternate objects in pool P2
If we are not careful, this situation can lead to data loss. During some In this case `Projects::GitDeduplicationService` will throw an exception.
operations (repository maintenance), GitLab will try to re-link A to its
pool P1. If this clobbers the existing link to P2, then A will loose Git
objects and become invalid.
Also, keep in mind that if GitLab's database got messed up, it may not
even know that P2 exists.
> TODO Ensure that Gitaly will not clobber existing, unexpected
> alternates links. https://gitlab.com/gitlab-org/gitaly/issues/1534
#### 3. SQL says repo A does not belong to any pool but Gitaly says A belongs to P #### 3. SQL says repo A does not belong to any pool but Gitaly says A belongs to P
This has the same data loss possibility as scenario 2 above. In this case `Projects::GitDeduplicationService` will try to
"re-duplicate" the repository A using the DisconnectGitAlternates RPC.
## Git object deduplication and GitLab Geo ## Git object deduplication and GitLab Geo
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment