Commit 6a1b3ac0 authored by Gerardo Lopez-Fernandez's avatar Gerardo Lopez-Fernandez

Merge branch 'docs/gb/add-architecture-blueprints' into 'master'

Move architecture blueprints from epics to docs

See merge request gitlab-org/gitlab!43120
parents 6ae8edd8 ca27554a
---
comments: false
description: 'Next iteration of build logs architecture at GitLab'
---
# Cloud Native Build Logs
Cloud native and the adoption of Kubernetes has been recognised by GitLab to be
one of the top two biggest tailwinds that are helping us grow faster as a
company behind the project.
This effort is described in a more details [in the infrastructure team
handbook](https://about.gitlab.com/handbook/engineering/infrastructure/production/kubernetes/gitlab-com/).
## Traditional build logs
Traditional job logs depend a lot on availability of a local shared storage.
Every time a GitLab Runner sends a new partial build output, we write this
output to a file on a disk. This is simple, but this mechanism depends on
shared local storage - the same file needs to be available on every GitLab web
node machine, because GitLab Runner might connect to a different one every time
it performs an API request. Sidekiq also needs access to the file because when
a job is complete, a trace file contents will be sent to the object store.
## New architecture
New architecture writes data to Redis instead of writing build logs into a
file.
In order to make this performant and resilient enough, we implemented a chunked
I/O mechanism - we store data in Redis in chunks, and migrate them to an object
store once we reach a desired chunk size.
Simplified sequence diagram is available below.
```mermaid
sequenceDiagram
autonumber
participant U as User
participant R as Runner
participant G as GitLab (rails)
participant I as Redis
participant D as Database
participant O as Object store
loop incremental trace update sent by a runner
Note right of R: Runner appends a build trace
R->>+G: PATCH trace [build.id, offset, data]
G->>+D: find or create chunk [chunk.index]
D-->>-G: chunk [id, index]
G->>I: append chunk data [chunk.index, data]
G-->>-R: 200 OK
end
Note right of R: User retrieves a trace
U->>+G: GET build trace
loop every trace chunk
G->>+D: find chunk [index]
D-->>-G: chunk [id]
G->>+I: read chunk data [chunk.index]
I-->>-G: chunk data [data, size]
end
G-->>-U: build trace
Note right of R: Trace chunk is full
R->>+G: PATCH trace [build.id, offset, data]
G->>+D: find or create chunk [chunk.index]
D-->>-G: chunk [id, index]
G->>I: append chunk data [chunk.index, data]
G->>G: chunk full [index]
G-->>-R: 200 OK
G->>+I: read chunk data [chunk.index]
I-->>-G: chunk data [data, size]
G->>O: send chunk data [data, size]
G->>+D: update data store type [chunk.id]
G->>+I: delete chunk data [chunk.index]
```
## NFS coupling
In 2017, we experienced serious problems of scaling our NFS infrastructure. We
even tried to replace NFS with
[CephFS](https://docs.ceph.com/docs/master/cephfs/) - unsuccessfully.
Since that time it has become apparent that the cost of operations and
maintenance of a NFS cluster is significant and that if we ever decide to
migrate to Kubernetes [we need to decouple GitLab from a shared local storage
and
NFS](https://gitlab.com/gitlab-org/gitlab-pages/-/issues/426#note_375646396).
1. NFS might be a single point of failure
1. NFS can only be reliably scaled vertically
1. Moving to Kubernetes means increasing the number of mount points by an order
of magnitude
1. NFS depends on extremely reliable network which can be difficult to provide
in Kubernetes environment
1. Storing customer data on NFS involves additional security risks
Moving GitLab to Kubernetes without NFS decoupling would result in an explosion
of complexity, maintenance cost and enormous, negative impact on availability.
## Iterations
1. ✓ Implement the new architecture in way that it does not depend on shared local storage
1. ✓ Evaluate performance and edge-cases, iterate to improve the new architecture
1. Design cloud native build logs correctness verification mechanisms
1. Build observability mechanisms around performance and correctness
1. Rollout the feature into production environment incrementally
The work needed to make the new architecture production ready and enabled on
GitLab.com is being tracked in [Cloud Native Build Logs on
GitLab.com](https://gitlab.com/groups/gitlab-org/-/epics/4275) epic.
Enabling this feature on GitLab.com is a subtask of [making the new
architecture generally
available](https://gitlab.com/groups/gitlab-org/-/epics/3791) for everyone.
## Who
Proposal:
| Role | Who
|------------------------------|-------------------------|
| Author | Grzegorz Bizon |
| Architecture Evolution Coach | Gerardo Lopez-Fernandez |
| Engineering Leader | Darby Frey |
| Domain Expert | Kamil Trzciński |
| Domain Expert | Sean McGivern |
DRIs:
| Role | Who
|------------------------------|------------------------|
| Product | Jason Yavorska |
| Leadership | Darby Frey |
| Engineering | Grzegorz Bizon |
---
comments: false
description: 'Making GitLab Pages a Cloud Native application - architecture blueprint.'
---
# GitLab Pages New Architecture
GitLab Pages is an important component of the GitLab product. It is mostly
being used to serve static content, and has a limited set of well defined
responsibilities. That being said, unfortunately it has become a blocker for
GitLab.com Kubernetes migration.
Cloud Native and the adoption of Kubernetes has been recognised by GitLab to be
one of the top two biggest tailwinds that are helping us grow faster as a
company behind the project.
This effort is described in more detail [in the infrastructure team handbook
page](https://about.gitlab.com/handbook/engineering/infrastructure/production/kubernetes/gitlab-com/).
GitLab Pages is tightly coupled with NFS and in order to unblock Kubernetes
migration a significant change to GitLab Pages' architecture is required. This
is an ongoing work that we have started more than a year ago. This blueprint
might be useful to understand why it is important, and what is the roadmap.
## How GitLab Pages Works
GitLab Pages is a daemon designed to serve static content, written in
[Go](https://golang.org/).
Initially, GitLab Pages has been designed to store static content on a local
shared block storage (NFS) in a hierarchical group > project directory
structure. Each directory, representing a project, was supposed to contain a
configuration file and static content that GitLab Pages daemon was supposed to
read and serve.
```mermaid
graph LR
A(GitLab Rails) -- Writes new pages deployment --> B[(NFS)]
C(GitLab Pages) -. Reads static content .-> B
```
This initial design has become outdated because of a few reasons - NFS coupling
being one of them - and we decided to replace it with more "decoupled
service"-like architecture. The new architecture, that we are working on, is
described in this blueprint.
## NFS coupling
In 2017, we experienced serious problems of scaling our NFS infrastructure. We
even tried to replace NFS with
[CephFS](https://docs.ceph.com/docs/master/cephfs/) - unsuccessfully.
Since that time it has become apparent that the cost of operations and
maintenance of a NFS cluster is significant and that if we ever decide to
migrate to Kubernetes [we need to decouple GitLab from a shared local storage
and
NFS](https://gitlab.com/gitlab-org/gitlab-pages/-/issues/426#note_375646396).
1. NFS might be a single point of failure
1. NFS can only be reliably scaled vertically
1. Moving to Kubernetes means increasing the number of mount points by an order
of magnitude
1. NFS depends on extremely reliable network which can be difficult to provide
in Kubernetes environment
1. Storing customer data on NFS involves additional security risks
Moving GitLab to Kubernetes without NFS decoupling would result in an explosion
of complexity, maintenance cost and enormous, negative impact on availability.
## New GitLab Pages Architecture
- GitLab Pages is going to source domains' configuration from GitLab's internal
API, instead of reading `config.json` files from a local shared storage.
- GitLab Pages is going to serve static content from Object Storage.
```mermaid
graph TD
A(User) -- Pushes pages deployment --> B{GitLab}
C((GitLab Pages)) -. Reads configuration from API .-> B
C -. Reads static content .-> D[(Object Storage)]
C -- Serves static content --> E(Visitors)
```
This new architecture has been briefly described in [the blog
post](https://about.gitlab.com/blog/2020/08/03/how-gitlab-pages-uses-the-gitlab-api-to-serve-content/)
too.
## Iterations
1. ✓ Redesign GitLab Pages configuration source to use GitLab's API
1. ✓ Evaluate performance and build reliable caching mechanisms
1. ✓ Incrementally rollout the new source on GitLab.com
1. ✓ Make GitLab Pages API domains config source enabled by default
1. Enable experimentation with different servings through feature flags
1. Triangulate object store serving design through meaningful experiments
1. Design pages migration mechanisms that can work incrementally
1. Gradually migrate towards object storage serving on GitLab.com
[GitLab Pages Architecture](https://gitlab.com/groups/gitlab-org/-/epics/1316)
epic with detailed roadmap is also available.
## Who
Proposal:
| Role | Who
|------------------------------|-------------------------|
| Author | Grzegorz Bizon |
| Architecture Evolution Coach | Kamil Trzciński |
| Engineering Leader | Daniel Croft |
| Domain Expert | Grzegorz Bizon |
| Domain Expert | Vladimir Shushlin |
| Domain Expert | Jaime Martinez |
DRIs:
| Role | Who
|------------------------------|------------------------|
| Product | Jackie Porter |
| Leadership | Daniel Croft |
| Engineering | TBD |
Domain Experts:
| Role | Who
|------------------------------|------------------------|
| Domain Expert | Kamil Trzciński |
| Domain Expert | Grzegorz Bizon |
| Domain Expert | Vladimir Shushlin |
| Domain Expert | Jaime Martinez |
| Domain Expert | Krasimir Angelov |
---
comments: false
description: 'Internal usage of Feature Flags for GitLab development'
---
# Usage of Feature Flags for GitLab development
Usage of feature flags become crucial for the development of GitLab. The
feature flags are a convenient way to ship changes early, and safely rollout
them to wide audience ensuring that feature is stable and performant.
Since the presence of feature is controlled with a dedicated condition, a
developer can decide for a best time for testing the feature, ensuring that
feature is not enable prematurely.
## Challenges
The extensive usage of feature flags poses a few challenges
- Each feature flag that we add to codebase is a ~"technical debt" as it adds a
matrix of configurations.
- Testing each combination of feature flags is close to impossible, so we
instead try to optimise our testing of feature flags to the most common
scenarios.
- There's a growing challenge of maintaining a growing number of feature flags.
We sometimes forget how our feature flags are configured or why we haven't
yet removed the feature flag.
- The usage of feature flags can also be confusing to people outside of
development that might not fully understand dependence of ~feature or ~bug
fix on feature flag and how this feature flag is configured. Or if the feature
should be announced as part of release post.
- Maintaining feature flags poses additional challenge of having to manage
different configurations across different environments/target. We have
different configuration of feature flags for testing, for development, for
staging, for production and what is being shipped to our customers as part of
on-premise offering.
## Goals
The biggest challenge today with our feature flags usage is their implicit
nature. Feature flags are part of the codebase, making them hard to understand
outside of development function.
We should aim to make our feature flag based development to be accessible to
any interested party.
- developer / engineer
- can easily add a new feature flag, and configure it's state
- can quickly find who to reach if touches another feature flag
- can quickly find stale feature flags
- engineering manager
- can understand what feature flags her/his group manages
- engineering manager and director
- can understand how much ~"technical debt" is inflicted due to amount of feature flags that we have to manage
- can understand how many feature flags are added and removed in each release
- product manager and documentation writer
- can understand what features are gated by what feature flags
- can understand if feature and thus feature flag is generally available on GitLab.com
- can understand if feature and thus feature flag is enabled by default for on-premise installations
- delivery engineer
- can understand what feature flags are introduced and changed between subsequent deployments
- support and reliability engineer
- can understand how feature flags changed between releases: what feature flags become enabled, what removed
- can quickly find relevant information about feature flag to know individuals which might help with an ongoing support request or incident
## Proposal
To help with above goals we should aim to make our feature flags usage explicit
and understood by all involved parties.
Introduce a YAML-described `feature-flags/<name-of-feature.yml>` that would
allow us to have:
1. A central place where all feature flags are documented,
1. A description of why the given feature flag was introduced,
1. A what relevant issue and merge request it was introduced by,
1. Build automated documentation with all feature flags in the codebase,
1. Track how many feature flags are per given group
1. Track how many feature flags are added and removed between releases
1. Make this information easily accessible for all
1. Allow our customers to easily discover how to enable features and quickly
find out information what did change between different releases
### The `YAML`
```yaml
---
name: ci_disallow_to_create_merge_request_pipelines_in_target_project
introduced_by_url: https://gitlab.com/gitlab-org/gitlab/-/merge_requests/40724
rollout_issue_url: https://gitlab.com/gitlab-org/gitlab/-/issues/235119
group: group::progressive delivery
type: development
default_enabled: false
```
## Reasons
These are reason why these changes are needed:
- we have around 500 different feature flags today
- we have hard time tracking their usage
- we have ambiguous usage of feature flag with different `default_enabled:` and
different `actors` used
- we lack a clear indication who owns what feature flag and where to find
relevant informations
- we do not emphasise the desire to create feature flag rollout issue to
indicate that feature flag is in fact a ~"technical debt"
- we don't know exactly what feature flags we have in our codebase
- we don't know exactly how our feature flags are configured for different
environments: what is being used for `test`, what we ship for `on-premise`,
what is our settings for `staging`, `qa` and `production`
## Iterations
This work is being done as part of dedicated epic: [Improve internal usage of
Feature Flags](https://gitlab.com/groups/gitlab-org/-/epics/3551). This epic
describes a meta reasons for making these changes.
## Who
Proposal:
| Role | Who
|------------------------------|-------------------------|
| Author | Kamil Trzciński |
| Architecture Evolution Coach | Gerardo Lopez-Fernandez |
| Engineering Leader | Kamil Trzciński |
| Domain Expert | Shinya Maeda |
DRIs:
| Role | Who
|------------------------------|------------------------|
| Product | ? |
| Leadership | Craig Gomes |
| Engineering | Kamil Trzciński |
---
comments: false
description: 'Architecture Practice at GitLab'
---
# Architecture at GitLab
- [Architecture at GitLab](https://about.gitlab.com/handbook/engineering/architecture/)
- [Architecture Workflow](https://about.gitlab.com/handbook/engineering/architecture/workflow/)
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment