index.md 8.49 KB
Newer Older
Evan Read's avatar
Evan Read committed
1
# Optimizing GitLab for large repositories
2 3 4 5 6 7

Large repositories consisting of more than 50k files in a worktree
often require special consideration because of
the time required to clone and check out.

GitLab and GitLab Runner handle this scenario well
Evan Read's avatar
Evan Read committed
8
but require optimized configuration to efficiently perform its
9 10 11 12 13 14 15 16 17
set of operations.

The general guidelines for handling big repositories are simple.
Each guideline is described in more detail in the sections below:

- Always fetch incrementally. Do not clone in a way that results in recreating all of the worktree.
- Always use shallow clone to reduce data transfer. Be aware that this puts more burden
  on GitLab instance due to higher CPU impact.
- Control the clone directory if you heavily use a fork-based workflow.
Evan Read's avatar
Evan Read committed
18
- Optimize `git clean` flags to ensure that you remove or keep data that might affect or speed-up your build.
19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74

## Shallow cloning

> Introduced in GitLab Runner 8.9.

GitLab and GitLab Runner always perform a full clone by default.
While it means that all changes from GitLab are received,
it often results in receiving extra commit logs.

Ideally, you should always use `GIT_DEPTH` with a small number
like 10. This will instruct GitLab Runner to perform shallow clones.
Shallow clones makes Git request only the latest set of changes for a given branch,
up to desired number of commits as defined by the `GIT_DEPTH` variable.

This significantly speeds up fetching of changes from Git repositories,
especially if the repository has a very long backlog consisting of number
of big files as we effectively reduce amount of data transfer.

The following example makes GitLab Runner shallow clone to fetch only a given branch,
it does not fetch any other branches nor tags.

```yaml
variables:
  GIT_DEPTH: 10

test:
  script:
    - ls -al
```

## Git strategy

> Introduced in GitLab Runner 8.9.

By default, GitLab is configured to always prefer the `GIT_STRATEGY: fetch` strategy.
The `GIT_STRATEGY: fetch` strategy will re-use existing worktrees if found
on disk. This is different to the `GIT_STRATEGY: clone` strategy
as in case of clones, if a worktree is found, it is removed before clone.

Usage of `fetch` is preferred because it reduces the amount of data to transfer and
does not really impact the operations that you might do on a repository from CI.

However, `fetch` does require access to the previous worktree. This works
well when using the `shell` or `docker` executor because these
try to preserve worktrees and try to re-use them by default.

This does not work today for `kubernetes` executor and has limitations when using
`docker+machine`. `kubernetes` executor today always clones into ephemeral directory.

GitLab also offers the `GIT_STRATEGY: none` strategy. This disables any `fetch` and `checkout` commands
done by GitLab, requiring you to do them.

## Git clone path

> Introduced in GitLab Runner 11.10.

75 76 77
[`GIT_CLONE_PATH`](../yaml/README.md#custom-build-directories) allows you to
control where you clone your sources. This can have implications if you
heavily use big repositories with fork workflow.
78 79

Fork workflow from GitLab Runner's perspective is stored as a separate repository
Evan Read's avatar
Evan Read committed
80
with separate worktree. That means that GitLab Runner cannot optimize the usage
81 82 83 84 85 86
of worktrees and you might have to instruct GitLab Runner to use that.

In such cases, ideally you want to make the GitLab Runner executor be used only used only
for the given project and not shared across different projects to make this
process more efficient.

87 88 89
The [`GIT_CLONE_PATH`](../yaml/README.md#custom-build-directories) has to be
within the `$CI_BUILDS_DIR`. Currently, it is impossible to pick any path
from disk.
90 91 92 93 94

## Git clean flags

> Introduced in GitLab Runner 11.10.

95 96 97
[`GIT_CLEAN_FLAGS`](../yaml/README.md#git-clean-flags) allows you to control
whether or not you require the `git clean` command to be executed for each CI
job. By default, GitLab ensures that you have your worktree on the given SHA,
98 99
and that your repository is clean.

100 101 102 103 104 105 106 107 108 109 110 111
[`GIT_CLEAN_FLAGS`](../yaml/README.md#git-clean-flags) is disabled when set
to `none`. On very big repositories, this might be desired because `git
clean` is disk I/O intensive. Controlling that with `GIT_CLEAN_FLAGS: -ffdx
-e .build/`, for example, allows you to control and disable removal of some
directories within the worktree between subsequent runs, which can speed-up
the incremental builds. This has the biggest effect if you re-use existing
machines, and have an existing worktree that you can re-use for builds.

For exact parameters accepted by
[`GIT_CLEAN_FLAGS`](../yaml/README.md#git-clean-flags), see the documentation
for [git clean](https://git-scm.com/docs/git-clean). The available parameters
are dependent on Git version.
112 113 114 115 116 117 118

## Fork-based workflow

> Introduced in GitLab Runner 11.10.

Following the guidelines above, lets imagine that we want to:

Evan Read's avatar
Evan Read committed
119
- Optimize for a big project (more than 50k files in directory).
120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238
- Use forks-based workflow for contributing.
- Reuse existing worktrees. Have preconfigured runners that are pre-cloned with repositories.
- Runner assigned only to project and all forks.

Lets consider the following two examples, one using `shell` executor and
other using `docker` executor.

### `shell` executor example

Lets assume that you have the following [config.toml](https://docs.gitlab.com/runner/configuration/advanced-configuration.html).

```toml
concurrent = 4

[[runners]]
  url = "GITLAB_URL"
  token = "TOKEN"
  executor = "shell"
  builds_dir = "/builds"
  cache_dir = "/cache"

  [runners.custom_build_dir]
    enabled = true
```

This `config.toml`:

- Uses the `shell` executor,
- Specifies a custom `/builds` directory where all clones will be stored.
- Enables the ability to specify `GIT_CLONE_PATH`,
- Runs at most 4 jobs at once.

### `docker` executor example

Lets assume that you have the following [config.toml](https://docs.gitlab.com/runner/configuration/advanced-configuration.html).

```toml
concurrent = 4

[[runners]]
  url = "GITLAB_URL"
  token = "TOKEN"
  executor = "docker"
  builds_dir = "/builds"
  cache_dir = "/cache"

  [runners.docker]
    volumes = ["/builds:/builds", "/cache:/cache"]
```

This `config.toml`:

- Uses the `docker` executor,
- Specifies a custom `/builds` directory on disk where all clones will be stored.
   We host mount the `/builds` directory to make it reusable between subsequent runs
   and be allowed to override the cloning strategy.
- Doesn't enable the ability to specify `GIT_CLONE_PATH` as it is enabled by default.
- Runs at most 4 jobs at once.

### Our `.gitlab-ci.yml`

Once we have the executor configured, we need to fine tune our `.gitlab-ci.yml`.

Our pipeline will be most performant if we use the following `.gitlab-ci.yml`:

```yaml
variables:
  GIT_DEPTH: 10
  GIT_CLONE_PATH: $CI_BUILDS_DIR/$CI_CONCURRENT_ID/$CI_PROJECT_NAME

build:
  script: ls -al
```

The above configures a:

- Shallow clone of 10, to speed up subsequent `git fetch` commands.
- Custom clone path to make it possible to re-use worktrees between parent project and all forks
  because we use the same clone path for all forks.

Why use `$CI_CONCURRENT_ID`? The main reason is to ensure that worktrees used are not conflicting
between projects. The `$CI_CONCURRENT_ID` represents a unique identifier within the given executor,
so as long as we use it to construct the path, it is guaranteed that this directory will not conflict
with other concurrent jobs running.

### Store custom clone options in `config.toml`

Ideally, all job-related configuration should be stored in `.gitlab-ci.yml`.
However, sometimes it is desirable to make these schemes part of Runner configuration.

In the above example of Forks, making this configuration discoverable for users may be preferred,
but this brings administrative overhead as the `.gitlab-ci.yml` needs to be updated for each branch.
In such cases, it might be desirable to keep the `.gitlab-ci.yml` clone path agnostic, but make it
a configuration of Runner.

We can extend our [config.toml](https://docs.gitlab.com/runner/configuration/advanced-configuration.html)
with the following specification that will be used by Runner if `.gitlab-ci.yml` will not override it:

```toml
concurrent = 4

[[runners]]
  url = "GITLAB_URL"
  token = "TOKEN"
  executor = "docker"
  builds_dir = "/builds"
  cache_dir = "/cache"

  environment = [
    "GIT_DEPTH=10",
    "GIT_CLONE_PATH=$CI_BUILDS_DIR/$CI_CONCURRENT_ID/$CI_PROJECT_NAME"
  ]

  [runners.docker]
    volumes = ["/builds:/builds", "/cache:/cache"]
```

This makes the cloning configuration to be part of given Runner,
and does not require us to update each `.gitlab-ci.yml`.