Commit ed650b2d authored by Evan Read's avatar Evan Read

Merge branch '35414-categorise-data-types-for-geo' into 'master'

Docs: Categorise data types for Geo

Closes #35414

See merge request gitlab-org/gitlab!20751
parents 61dfa10b 4c6f3278
# Geo data types support
A Geo data type is a specific class of data that is required by one or more GitLab features to
store relevant information.
To replicate data produced by these features with Geo, we use several strategies to access, transfer, and verify them.
## Data types
We currently distinguish between three different data types:
- [Git repositories](#git-repositories)
- [Blobs](#blobs)
- [Database](#database)
See the list below of each feature or component we replicate, its corresponding data type, replication and
verification methods:
| Type | Feature / component | Replication method | Verification method |
|----------|-----------------------------------------------|---------------------------------------------|----------------------|
| Database | Application data in PostgreSQL | Native | Native |
| Database | Redis | _N/A_ (*1*) | _N/A_ |
| Database | Elasticsearch | Native | Native |
| Database | Personal snippets | Postgres Replication | Postgres Replication |
| Database | Project snippets | Postgres Replication | Postgres Replication |
| Database | SSH public keys | Postgres Replication | Postgres Replication |
| Git | Project repository | Geo with Gitaly | Gitaly Checksum |
| Git | Project wiki repository | Geo with Gitaly | Gitaly Checksum |
| Git | Project designs repository | Geo with Gitaly | Gitaly Checksum |
| Git | Object pools for forked project deduplication | Geo with Gitaly | _Not implemented_ |
| Blobs | User uploads _(filesystem)_ | Geo with API | _Not implemented_ |
| Blobs | User uploads _(object storage)_ | Geo with API/Managed (*2*) | _Not implemented_ |
| Blobs | LFS objects _(filesystem)_ | Geo with API | _Not implemented_ |
| Blobs | LFS objects _(object storage)_ | Geo with API/Managed (*2*) | _Not implemented_ |
| Blobs | CI job artifacts _(filesystem)_ | Geo with API | _Not implemented_ |
| Blobs | CI job artifacts _(object storage)_ | Geo with API/Managed (*2*) | _Not implemented_ |
| Blobs | Archived CI build traces _(filesystem)_ | Geo with API | _Not implemented_ |
| Blobs | Archived CI build traces _(object storage)_ | Geo with API/Managed (*2*) | _Not implemented_ |
| Blobs | Container registry _(filesystem)_ | Geo with API/Docker API | _Not implemented_ |
| Blobs | Container registry _(object storage)_ | Geo with API/Managed/Docker API (*2*) | _Not implemented_ |
- (*1*): Redis replication can be used as part of HA with Redis sentinel. It's not used between Geo nodes.
- (*2*): Object storage replication can be performed by Geo or by your object storage provider/appliance
native replication feature.
### Git repositories
A GitLab instance can have one or more repository shards. Each shard has a Gitaly instance that
is responsible for allowing access and operations on the locally stored Git repositories. It can run
on a machine with a single disk, multiple disks mounted as a single mount-point (like with a RAID array),
or using LVM.
It requires no special filesystem and can work with NFS or a mounted Storage Appliance (there may be
performance limitations when using a remote filesystem).
Communication is done via Gitaly's own gRPC API. There are three possible ways of synchronization:
- Using regular Git clone/fetch from one Geo node to another (with special authentication).
- Using repository snapshots (for when the first method fails or repository is corrupt).
- Manual trigger from the Admin UI (a combination of both of the above).
Each project can have at most 3 different repositories:
- A project repository, where the source code is stored.
- A wiki repository, where the wiki content is stored.
- A design repository, where design artifacts are indexed (assets are actually in LFS).
They all live in the same shard and share the same base name with a `-wiki` and `-design` suffix
for Wiki and Design Repository cases.
### Blobs
GitLab stores files and blobs such as Issue attachments or LFS objects into either:
- The filesystem in a specific location.
- An Object Storage solution. Object Storage solutions can be:
- Cloud based like Amazon S3 Google Cloud Storage.
- Self hosted (like MinIO).
- A Storage Appliance that exposes an Object Storage-compatible API.
When using the filesystem store instead of Object Storage, you need to use network mounted filesystems
to run GitLab when using more than one server (for example with a High Availability setup).
With respect to replication and verification:
- We transfer files and blobs using an internal API request.
- With Object Storage, you can either:
- Use a cloud provider replication functionality.
- Have GitLab replicate it for you.
### Database
GitLab relies on data stored in multiple databases, for different use-cases.
PostgreSQL is the single point of truth for user-generated content in the Web interface, like issues content, comments
as well as permissions and credentials.
PostgreSQL can also hold some level of cached data like HTML rendered Markdown, cached merge-requests diff (this can
also be configured to be offloaded to object storage).
We use PostgreSQL's own replication functionality to replicate data from the primary to secondary nodes.
We use Redis both as a cache store and to hold persistent data for our background jobs system. Because both
use-cases has data that are exclusive to the same Geo node, we don't replicate it between nodes.
Elasticsearch is an optional database, that can enable advanced searching capabilities, like improved Global Search
in both source-code level and user generated content in Issues / Merge-Requests and discussions. Currently it's not
supported in Geo.
## Limitations on replication/verification
The following table lists the GitLab features along with their replication
and verification status on a **secondary** node.
You can keep track of the progress to implement the missing items in
these epics/issues:
- [Unreplicated Data Types](https://gitlab.com/groups/gitlab-org/-/epics/893)
- [Verify all replicated data](https://gitlab.com/groups/gitlab-org/-/epics/1430)
DANGER: **DANGER**
Features not on this list, or with **No** in the **Replicated** column,
are not replicated on the **secondary** node. Failing over without manually
replicating data from those features will cause the data to be **lost**.
If you wish to use those features on a **secondary** node, or to execute a failover
successfully, you must replicate their data using some other means.
| Feature | Replicated | Verified | Notes |
|-----------------------------------------------------|--------------------------|-----------------------------|---------------------------------------------|
| Application data in PostgreSQL | **Yes** | **Yes** | |
| Project repository | **Yes** | **Yes** | |
| Project wiki repository | **Yes** | **Yes** | |
| Project designs repository | **Yes** | [No][design-verification] | Behind feature flag (*1*) |
| Uploads | **Yes** | [No][upload-verification] | Verified only on transfer, or manually (*2*)|
| LFS objects | **Yes** | [No][lfs-verification] | Verified only on transfer, or manually (*2*)|
| CI job artifacts (other than traces) | **Yes** | [No][artifact-verification] | Verified only manually (*2*) |
| Archived traces | **Yes** | [No][artifact-verification] | Verified only on transfer, or manually (*2*)|
| Personal snippets | **Yes** | **Yes** | |
| Project snippets | **Yes** | **Yes** | |
| Object pools for forked project deduplication | **Yes** | No | |
| [Server-side Git Hooks][custom-hooks] | No | No | |
| [Elasticsearch integration][elasticsearch] | No | No | |
| [GitLab Pages][gitlab-pages] | [No][pages-replication] | No | |
| [Container Registry][container-registry] | **Yes** | No | |
| [NPM Registry][npm-registry] | No | No | |
| [Maven Repository][maven-repository] | No | No | |
| [Conan Repository][conan-repository] | No | No | |
| [External merge request diffs][merge-request-diffs] | [No][diffs-replication] | No | |
| Content in object storage | **Yes** | No | |
- (*1*): Enable the `enable_geo_design_sync` feature flag by running the following in a Rails console:
```ruby
Feature.disable(:enable_geo_design_sync)
```
- (*2*): The integrity can be verified manually using
[Integrity Check Rake Task](../../raketasks/check.md) on both nodes and comparing the output between them.
[design-replication]: https://gitlab.com/groups/gitlab-org/-/epics/1633
[design-verification]: https://gitlab.com/gitlab-org/gitlab/issues/32467
[upload-verification]: https://gitlab.com/groups/gitlab-org/-/epics/1817
[lfs-verification]: https://gitlab.com/gitlab-org/gitlab/issues/8922
[artifact-verification]: https://gitlab.com/gitlab-org/gitlab/issues/8923
[diffs-replication]: https://gitlab.com/gitlab-org/gitlab/issues/33817
[pages-replication]: https://gitlab.com/groups/gitlab-org/-/epics/589
[custom-hooks]: ../../custom_hooks.md
[elasticsearch]: ../../../integration/elasticsearch.md
[gitlab-pages]: ../../pages/index.md
[container-registry]: ../../packages/container_registry.md
[npm-registry]: ../../../user/packages/npm_registry/index.md
[maven-repository]: ../../../user/packages/maven_repository/index.md
[conan-repository]: ../../../user/packages/conan_repository/index.md
[merge-request-diffs]: ../../merge_request_diffs.md
...@@ -252,74 +252,13 @@ This list of limitations only reflects the latest version of GitLab. If you are ...@@ -252,74 +252,13 @@ This list of limitations only reflects the latest version of GitLab. If you are
### Limitations on replication/verification ### Limitations on replication/verification
The following table lists the GitLab features along with their replication
and verification status on a **secondary** node.
You can keep track of the progress to implement the missing items in You can keep track of the progress to implement the missing items in
these epics/issues: these epics/issues:
- [Unreplicated Data Types](https://gitlab.com/groups/gitlab-org/-/epics/893) - [Unreplicated Data Types](https://gitlab.com/groups/gitlab-org/-/epics/893)
- [Verify all replicated data](https://gitlab.com/groups/gitlab-org/-/epics/1430) - [Verify all replicated data](https://gitlab.com/groups/gitlab-org/-/epics/1430)
| Feature | Replicated | Verified | Notes | There is a complete list of all GitLab [data types](datatypes.md) and [existing support for replication and verification](datatypes.md#limitations-on-replicationverification).
|-----------------------------------------------------|--------------------------|-----------------------------|--------------------------------------------|
| All database content | **Yes** | **Yes** | |
| Project repository | **Yes** | **Yes** | |
| Project wiki repository | **Yes** | **Yes** | |
| Project designs repository | **Yes** | [No][design-verification] | Behind feature flag (2) |
| Uploads | **Yes** | [No][upload-verification] | Verified only on transfer, or manually (1) |
| LFS Objects | **Yes** | [No][lfs-verification] | Verified only on transfer, or manually (1) |
| CI job artifacts (other than traces) | **Yes** | [No][artifact-verification] | Verified only manually (1) |
| Archived traces | **Yes** | [No][artifact-verification] | Verified only on transfer, or manually (1) |
| Personal snippets | **Yes** | **Yes** | |
| Version-controlled personal snippets | No | No | [Not yet supported][unsupported-snippets] |
| Project snippets | **Yes** | **Yes** | |
| Version-controlled project snippets | No | No | [Not yet supported][unsupported-snippets] |
| Object pools for forked project deduplication | **Yes** | No | |
| [Server-side Git Hooks][custom-hooks] | No | No | |
| [Elasticsearch integration][elasticsearch] | No | No | |
| [GitLab Pages][gitlab-pages] | [No][pages-replication] | No | |
| [Container Registry][container-registry] | **Yes** | No | |
| [NPM Registry][npm-registry] | No | No | |
| [Maven Repository][maven-repository] | No | No | |
| [Conan Repository][conan-repository] | No | No | |
| [External merge request diffs][merge-request-diffs] | [No][diffs-replication] | No | |
| Content in object storage | **Yes** | No | |
[design-replication]: https://gitlab.com/groups/gitlab-org/-/epics/1633
[design-verification]: https://gitlab.com/gitlab-org/gitlab/issues/32467
[upload-verification]: https://gitlab.com/groups/gitlab-org/-/epics/1817
[lfs-verification]: https://gitlab.com/gitlab-org/gitlab/issues/8922
[artifact-verification]: https://gitlab.com/gitlab-org/gitlab/issues/8923
[diffs-replication]: https://gitlab.com/gitlab-org/gitlab/issues/33817
[pages-replication]: https://gitlab.com/groups/gitlab-org/-/epics/589
[unsupported-snippets]: https://gitlab.com/gitlab-org/gitlab/issues/14228
[custom-hooks]: ../../custom_hooks.md
[elasticsearch]: ../../../integration/elasticsearch.md
[gitlab-pages]: ../../pages/index.md
[container-registry]: ../../packages/container_registry.md
[npm-registry]: ../../../user/packages/npm_registry/index.md
[maven-repository]: ../../../user/packages/maven_repository/index.md
[conan-repository]: ../../../user/packages/conan_repository/index.md
[merge-request-diffs]: ../../merge_request_diffs.md
1. The integrity can be verified manually using
[Integrity Check Rake Task](../../raketasks/check.md)
on both nodes and comparing the output between them.
1. Enable the `enable_geo_design_sync` feature flag by running the
following in a Rails console:
```ruby
Feature.disable(:enable_geo_design_sync)
```
DANGER: **DANGER**
Features not on this list, or with **No** in the **Replicated** column,
are not replicated on the **secondary** node. Failing over without manually
replicating data from those features will cause the data to be **lost**.
If you wish to use those features on a **secondary** node, or to execute a failover
successfully, you must replicate their data using some other means.
## Frequently Asked Questions ## Frequently Asked Questions
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment