Merge branch 'docs/gb/add-architecture-blueprints' into 'master'

Move architecture blueprints from epics to docs See merge request gitlab-org/gitlab!43120

Merge branch 'docs/gb/add-architecture-blueprints' into 'master'
Move architecture blueprints from epics to docs See merge request gitlab-org/gitlab!43120
6a1b3ac0 · Gerardo Lopez-Fernandez · 6ae8edd8 · ca27554a · 6a1b3ac0 · 6a1b3ac0
Commit 6a1b3ac0 authored Sep 24, 2020 by Gerardo Lopez-Fernandez
4 changed files
--- a/doc/architecture/blueprints/cloud_native_build_logs/index.md
+++ b/doc/architecture/blueprints/cloud_native_build_logs/index.md
+---
+comments: false
+description: 'Next iteration of build logs architecture at GitLab'
+---
+
+# Cloud Native Build Logs
+
+Cloud native and the adoption of Kubernetes has been recognised by GitLab to be
+one of the top two biggest tailwinds that are helping us grow faster as a
+company behind the project.
+
+This effort is described in a more details [in the infrastructure team
+handbook](https://about.gitlab.com/handbook/engineering/infrastructure/production/kubernetes/gitlab-com/).
+
+## Traditional build logs
+
+Traditional job logs depend a lot on availability of a local shared storage.
+
+Every time a GitLab Runner sends a new partial build output, we write this
+output to a file on a disk. This is simple, but this mechanism depends on
+shared local storage - the same file needs to be available on every GitLab web
+node machine, because GitLab Runner might connect to a different one every time
+it performs an API request. Sidekiq also needs access to the file because when
+a job is complete, a trace file contents will be sent to the object store.
+
+## New architecture
+
+New architecture writes data to Redis instead of writing build logs into a
+file.
+
+In order to make this performant and resilient enough, we implemented a chunked
+I/O mechanism - we store data in Redis in chunks, and migrate them to an object
+store once we reach a desired chunk size.
+
+Simplified sequence diagram is available below.
+
+```mermaid
+sequenceDiagram
+    autonumber
+    participant U as User
+    participant R as Runner
+    participant G as GitLab (rails)
+    participant I as Redis
+    participant D as Database
+    participant O as Object store
+
+    loop incremental trace update sent by a runner
+      Note right of R: Runner appends a build trace
+      R->>+G: PATCH trace [build.id, offset, data]
+      G->>+D: find or create chunk [chunk.index]
+      D-->>-G: chunk [id, index]
+      G->>I: append chunk data [chunk.index, data]
+      G-->>-R: 200 OK
+    end
+
+    Note right of R: User retrieves a trace
+    U->>+G: GET build trace
+    loop every trace chunk
+      G->>+D: find chunk [index]
+      D-->>-G: chunk [id]
+      G->>+I: read chunk data [chunk.index]
+      I-->>-G: chunk data [data, size]
+    end
+    G-->>-U: build trace
+
+    Note right of R: Trace chunk is full
+    R->>+G: PATCH trace [build.id, offset, data]
+    G->>+D: find or create chunk [chunk.index]
+    D-->>-G: chunk [id, index]
+    G->>I: append chunk data [chunk.index, data]
+    G->>G: chunk full [index]
+    G-->>-R: 200 OK
+    G->>+I: read chunk data [chunk.index]
+    I-->>-G: chunk data [data, size]
+    G->>O: send chunk data [data, size]
+    G->>+D: update data store type [chunk.id]
+    G->>+I: delete chunk data [chunk.index]
+```
+
+## NFS coupling
+
+In 2017, we experienced serious problems of scaling our NFS infrastructure. We
+even tried to replace NFS with
+[CephFS](https://docs.ceph.com/docs/master/cephfs/) - unsuccessfully.
+
+Since that time it has become apparent that the cost of operations and
+maintenance of a NFS cluster is significant and that if we ever decide to
+migrate to Kubernetes [we need to decouple GitLab from a shared local storage
+and
+NFS](https://gitlab.com/gitlab-org/gitlab-pages/-/issues/426#note_375646396).
+
+1. NFS might be a single point of failure
+1. NFS can only be reliably scaled vertically
+1. Moving to Kubernetes means increasing the number of mount points by an order
+   of magnitude
+1. NFS depends on extremely reliable network which can be difficult to provide
+   in Kubernetes environment
+1. Storing customer data on NFS involves additional security risks
+
+Moving GitLab to Kubernetes without NFS decoupling would result in an explosion
+of complexity, maintenance cost and enormous, negative impact on availability.
+
+## Iterations
+
+1. ✓ Implement the new architecture in way that it does not depend on shared local storage
+1. ✓ Evaluate performance and edge-cases, iterate to improve the new architecture
+1. Design cloud native build logs correctness verification mechanisms
+1. Build observability mechanisms around performance and correctness
+1. Rollout the feature into production environment incrementally
+
+The work needed to make the new architecture production ready and enabled on
+GitLab.com is being tracked in [Cloud Native Build Logs on
+GitLab.com](https://gitlab.com/groups/gitlab-org/-/epics/4275) epic.
+
+Enabling this feature on GitLab.com is a subtask of [making the new
+architecture generally
+available](https://gitlab.com/groups/gitlab-org/-/epics/3791) for everyone.
+
+## Who
+
+Proposal:
+
+| Role                         | Who
+|------------------------------|-------------------------|
+| Author                       |     Grzegorz Bizon      |
+| Architecture Evolution Coach | Gerardo Lopez-Fernandez |
+| Engineering Leader           |       Darby Frey        |
+| Domain Expert                |     Kamil Trzciński     |
+| Domain Expert                |      Sean McGivern      |
+
+DRIs:
+
+| Role                         | Who
+|------------------------------|------------------------|
+| Product                      |     Jason Yavorska     |
+| Leadership                   |       Darby Frey       |
+| Engineering                  |     Grzegorz Bizon     |
--- a/doc/architecture/blueprints/cloud_native_gitlab_pages/index.md
+++ b/doc/architecture/blueprints/cloud_native_gitlab_pages/index.md
+---
+comments: false
+description: 'Making GitLab Pages a Cloud Native application - architecture blueprint.'
+---
+
+# GitLab Pages New Architecture
+
+GitLab Pages is an important component of the GitLab product. It is mostly
+being used to serve static content, and has a limited set of well defined
+responsibilities. That being said, unfortunately it has become a blocker for
+GitLab.com Kubernetes migration.
+
+Cloud Native and the adoption of Kubernetes has been recognised by GitLab to be
+one of the top two biggest tailwinds that are helping us grow faster as a
+company behind the project.
+
+This effort is described in more detail [in the infrastructure team handbook
+page](https://about.gitlab.com/handbook/engineering/infrastructure/production/kubernetes/gitlab-com/).
+
+GitLab Pages is tightly coupled with NFS and in order to unblock Kubernetes
+migration a significant change to GitLab Pages' architecture is required. This
+is an ongoing work that we have started more than a year ago. This blueprint
+might be useful to understand why it is important, and what is the roadmap.
+
+## How GitLab Pages Works
+
+GitLab Pages is a daemon designed to serve static content, written in
+[Go](https://golang.org/).
+
+Initially, GitLab Pages has been designed to store static content on a local
+shared block storage (NFS) in a hierarchical group > project directory
+structure. Each directory, representing a project, was supposed to contain a
+configuration file and static content that GitLab Pages daemon was supposed to
+read and serve.
+
+```mermaid
+graph LR
+  A(GitLab Rails) -- Writes new pages deployment --> B[(NFS)]
+  C(GitLab Pages) -. Reads static content .-> B
+```
+
+This initial design has become outdated because of a few reasons - NFS coupling
+being one of them - and we decided to replace it with more "decoupled
+service"-like architecture. The new architecture, that we are working on, is
+described in this blueprint.
+
+## NFS coupling
+
+In 2017, we experienced serious problems of scaling our NFS infrastructure. We
+even tried to replace NFS with
+[CephFS](https://docs.ceph.com/docs/master/cephfs/) - unsuccessfully.
+
+Since that time it has become apparent that the cost of operations and
+maintenance of a NFS cluster is significant and that if we ever decide to
+migrate to Kubernetes [we need to decouple GitLab from a shared local storage
+and
+NFS](https://gitlab.com/gitlab-org/gitlab-pages/-/issues/426#note_375646396).
+
+1. NFS might be a single point of failure
+1. NFS can only be reliably scaled vertically
+1. Moving to Kubernetes means increasing the number of mount points by an order
+   of magnitude
+1. NFS depends on extremely reliable network which can be difficult to provide
+   in Kubernetes environment
+1. Storing customer data on NFS involves additional security risks
+
+Moving GitLab to Kubernetes without NFS decoupling would result in an explosion
+of complexity, maintenance cost and enormous, negative impact on availability.
+
+## New GitLab Pages Architecture
+
+- GitLab Pages is going to source domains' configuration from GitLab's internal
+  API, instead of reading `config.json` files from a local shared storage.
+- GitLab Pages is going to serve static content from Object Storage.
+
+```mermaid
+graph TD
+  A(User) -- Pushes pages deployment --> B{GitLab}
+  C((GitLab Pages)) -. Reads configuration from API .-> B
+  C -. Reads static content .-> D[(Object Storage)]
+  C -- Serves static content --> E(Visitors)
+```
+
+This new architecture has been briefly described in [the blog
+post](https://about.gitlab.com/blog/2020/08/03/how-gitlab-pages-uses-the-gitlab-api-to-serve-content/)
+too.
+
+## Iterations
+
+1. ✓ Redesign GitLab Pages configuration source to use GitLab's API
+1. ✓ Evaluate performance and build reliable caching mechanisms
+1. ✓ Incrementally rollout the new source on GitLab.com
+1. ✓ Make GitLab Pages API domains config source enabled by default
+1. Enable experimentation with different servings through feature flags
+1. Triangulate object store serving design through meaningful experiments
+1. Design pages migration mechanisms that can work incrementally
+1. Gradually migrate towards object storage serving on GitLab.com
+
+[GitLab Pages Architecture](https://gitlab.com/groups/gitlab-org/-/epics/1316)
+epic with detailed roadmap is also available.
+
+## Who
+
+Proposal:
+
+| Role                         | Who
+|------------------------------|-------------------------|
+| Author                       |    Grzegorz Bizon       |
+| Architecture Evolution Coach |    Kamil Trzciński      |
+| Engineering Leader           |    Daniel Croft         |
+| Domain Expert                |    Grzegorz Bizon       |
+| Domain Expert                |    Vladimir Shushlin    |
+| Domain Expert                |    Jaime Martinez       |
+
+DRIs:
+
+| Role                         | Who
+|------------------------------|------------------------|
+| Product                      |    Jackie Porter       |
+| Leadership                   |    Daniel Croft        |
+| Engineering                  |         TBD            |
+
+Domain Experts:
+
+| Role                         | Who
+|------------------------------|------------------------|
+| Domain Expert                |    Kamil Trzciński     |
+| Domain Expert                |    Grzegorz Bizon      |
+| Domain Expert                |    Vladimir Shushlin   |
+| Domain Expert                |    Jaime Martinez      |
+| Domain Expert                |    Krasimir Angelov    |
--- a/doc/architecture/blueprints/feature_flags_development/index.md
+++ b/doc/architecture/blueprints/feature_flags_development/index.md
+---
+comments: false
+description: 'Internal usage of Feature Flags for GitLab development'
+---
+
+# Usage of Feature Flags for GitLab development
+
+Usage of feature flags become crucial for the development of GitLab. The
+feature flags are a convenient way to ship changes early, and safely rollout
+them to wide audience ensuring that feature is stable and performant.
+
+Since the presence of feature is controlled with a dedicated condition, a
+developer can decide for a best time for testing the feature, ensuring that
+feature is not enable prematurely.
+
+## Challenges
+
+The extensive usage of feature flags poses a few challenges
+
+- Each feature flag that we add to codebase is a ~"technical debt" as it adds a
+  matrix of configurations.
+- Testing each combination of feature flags is close to impossible, so we
+  instead try to optimise our testing of feature flags to the most common
+  scenarios.
+- There's a growing challenge of maintaining a growing number of feature flags.
+  We sometimes forget how our feature flags are configured or why we haven't
+  yet removed the feature flag.
+- The usage of feature flags can also be confusing to people outside of
+  development that might not fully understand dependence of ~feature or ~bug
+  fix on feature flag and how this feature flag is configured. Or if the feature
+  should be announced as part of release post.
+- Maintaining feature flags poses additional challenge of having to manage
+  different configurations across different environments/target. We have
+  different configuration of feature flags for testing, for development, for
+  staging, for production and what is being shipped to our customers as part of
+  on-premise offering.
+
+## Goals
+
+The biggest challenge today with our feature flags usage is their implicit
+nature. Feature flags are part of the codebase, making them hard to understand
+outside of development function.
+
+We should aim to make our feature flag based development to be accessible to
+any interested party.
+
+- developer / engineer
+  - can easily add a new feature flag, and configure it's state
+  - can quickly find who to reach if touches another feature flag
+  - can quickly find stale feature flags
+- engineering manager
+  - can understand what feature flags her/his group manages
+- engineering manager and director
+  - can understand how much ~"technical debt" is inflicted due to amount of feature flags that we have to manage
+  - can understand how many feature flags are added and removed in each release
+- product manager and documentation writer
+  - can understand what features are gated by what feature flags
+  - can understand if feature and thus feature flag is generally available on GitLab.com
+  - can understand if feature and thus feature flag is enabled by default for on-premise installations
+- delivery engineer
+  - can understand what feature flags are introduced and changed between subsequent deployments
+- support and reliability engineer
+  - can understand how feature flags changed between releases: what feature flags become enabled, what removed
+  - can quickly find relevant information about feature flag to know individuals which might help with an ongoing support request or incident
+
+## Proposal
+
+To help with above goals we should aim to make our feature flags usage explicit
+and understood by all involved parties.
+
+Introduce a YAML-described `feature-flags/<name-of-feature.yml>` that would
+allow us to have:
+
+1. A central place where all feature flags are documented,
+1. A description of why the given feature flag was introduced,
+1. A what relevant issue and merge request it was introduced by,
+1. Build automated documentation with all feature flags in the codebase,
+1. Track how many feature flags are per given group
+1. Track how many feature flags are added and removed between releases
+1. Make this information easily accessible for all
+1. Allow our customers to easily discover how to enable features and quickly
+   find out information what did change between different releases
+
+### The `YAML`
+
+```yaml
+---
+name: ci_disallow_to_create_merge_request_pipelines_in_target_project
+introduced_by_url: https://gitlab.com/gitlab-org/gitlab/-/merge_requests/40724
+rollout_issue_url: https://gitlab.com/gitlab-org/gitlab/-/issues/235119
+group: group::progressive delivery
+type: development
+default_enabled: false
+```
+
+## Reasons
+
+These are reason why these changes are needed:
+
+- we have around 500 different feature flags today
+- we have hard time tracking their usage
+- we have ambiguous usage of feature flag with different `default_enabled:` and
+  different `actors` used
+- we lack a clear indication who owns what feature flag and where to find
+  relevant informations
+- we do not emphasise the desire to create feature flag rollout issue to
+  indicate that feature flag is in fact a ~"technical debt"
+- we don't know exactly what feature flags we have in our codebase
+- we don't know exactly how our feature flags are configured for different
+  environments: what is being used for `test`, what we ship for `on-premise`,
+  what is our settings for `staging`, `qa` and `production`
+
+## Iterations
+
+This work is being done as part of dedicated epic: [Improve internal usage of
+Feature Flags](https://gitlab.com/groups/gitlab-org/-/epics/3551). This epic
+describes a meta reasons for making these changes.
+
+## Who
+
+Proposal:
+
+| Role                         | Who
+|------------------------------|-------------------------|
+| Author                       |     Kamil Trzciński     |
+| Architecture Evolution Coach | Gerardo Lopez-Fernandez |
+| Engineering Leader           |     Kamil Trzciński     |
+| Domain Expert                |       Shinya Maeda      |
+
+DRIs:
+
+| Role                         | Who
+|------------------------------|------------------------|
+| Product                      |            ?           |
+| Leadership                   |      Craig Gomes       |
+| Engineering                  |     Kamil Trzciński    |
--- a/doc/architecture/index.md
+++ b/doc/architecture/index.md
+---
+comments: false
+description: 'Architecture Practice at GitLab'
+---
+
+# Architecture at GitLab
+
+- [Architecture at GitLab](https://about.gitlab.com/handbook/engineering/architecture/)
+- [Architecture Workflow](https://about.gitlab.com/handbook/engineering/architecture/workflow/)