elasticsearch.md 14.5 KB
Newer Older
1
# Elasticsearch knowledge **(STARTER ONLY)**
2

3
This area is to maintain a compendium of useful information when working with Elasticsearch.
4

5
Information on how to enable Elasticsearch and perform the initial indexing is in
6
the [Elasticsearch integration documentation](../integration/elasticsearch.md#enabling-elasticsearch).
7

8 9
## Deep Dive

10
In June 2019, Mario de la Ossa hosted a Deep Dive (GitLab team members only: `https://gitlab.com/gitlab-org/create-stage/issues/1`) on GitLab's [Elasticsearch integration](../integration/elasticsearch.md) to share his domain specific knowledge with anyone who may work in this part of the code base in the future. You can find the [recording on YouTube](https://www.youtube.com/watch?v=vrvl-tN2EaA), and the slides on [Google Slides](https://docs.google.com/presentation/d/1H-pCzI_LNrgrL5pJAIQgvLX8Ji0-jIKOg1QeJQzChug/edit) and in [PDF](https://gitlab.com/gitlab-org/create-stage/uploads/c5aa32b6b07476fa8b597004899ec538/Elasticsearch_Deep_Dive.pdf). Everything covered in this deep dive was accurate as of GitLab 12.0, and while specific details may have changed since then, it should still serve as a good introduction.
11

12
## Supported Versions
13

14
See [Version Requirements](../integration/elasticsearch.md#version-requirements).
15

16
Developers making significant changes to Elasticsearch queries should test their features against all our supported versions.
17

18
## Setting up development environment
19

20
See the [Elasticsearch GDK setup instructions](https://gitlab.com/gitlab-org/gitlab-development-kit/blob/master/doc/howto/elasticsearch.md)
21

22
## Helpful Rake tasks
23 24 25 26

- `gitlab:elastic:test:index_size`: Tells you how much space the current index is using, as well as how many documents are in the index.
- `gitlab:elastic:test:index_size_change`: Outputs index size, reindexes, and outputs index size again. Useful when testing improvements to indexing size.

Amy Qualls's avatar
Amy Qualls committed
27
Additionally, if you need large repositories or multiple forks for testing, please consider [following these instructions](rake_tasks.md#extra-project-seed-options)
28 29 30

## How does it work?

Amy Qualls's avatar
Amy Qualls committed
31
The Elasticsearch integration depends on an external indexer. We ship an [indexer written in Go](https://gitlab.com/gitlab-org/gitlab-elasticsearch-indexer). The user must trigger the initial indexing via a Rake task but, after this is done, GitLab itself will trigger reindexing when required via `after_` callbacks on create, update, and destroy that are inherited from [`/ee/app/models/concerns/elastic/application_versioned_search.rb`](https://gitlab.com/gitlab-org/gitlab/blob/master/ee/app/models/concerns/elastic/application_versioned_search.rb).
32

33
After initial indexing is complete, create, update, and delete operations for all models except projects (see [#207494](https://gitlab.com/gitlab-org/gitlab/-/issues/207494)) are tracked in a Redis [`ZSET`](https://redis.io/topics/data-types#sorted-sets). A regular `sidekiq-cron` `ElasticIndexBulkCronWorker` processes this queue, updating many Elasticsearch documents at a time with the [Bulk Request API](https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-bulk.html).
34

Amy Qualls's avatar
Amy Qualls committed
35
Search queries are generated by the concerns found in [`ee/app/models/concerns/elastic`](https://gitlab.com/gitlab-org/gitlab/tree/master/ee/app/models/concerns/elastic). These concerns are also in charge of access control, and have been a historic source of security bugs so please pay close attention to them!
36 37

## Existing Analyzers/Tokenizers/Filters
Evan Read's avatar
Evan Read committed
38

Amy Qualls's avatar
Amy Qualls committed
39
These are all defined in [`ee/lib/elastic/latest/config.rb`](https://gitlab.com/gitlab-org/gitlab/blob/master/ee/lib/elastic/latest/config.rb)
40 41

### Analyzers
Evan Read's avatar
Evan Read committed
42

43
#### `path_analyzer`
Evan Read's avatar
Evan Read committed
44

45 46 47 48 49
Used when indexing blobs' paths. Uses the `path_tokenizer` and the `lowercase` and `asciifolding` filters.

Please see the `path_tokenizer` explanation below for an example.

#### `sha_analyzer`
Evan Read's avatar
Evan Read committed
50

51 52 53 54 55
Used in blobs and commits. Uses the `sha_tokenizer` and the `lowercase` and `asciifolding` filters.

Please see the `sha_tokenizer` explanation later below for an example.

#### `code_analyzer`
Evan Read's avatar
Evan Read committed
56

57
Used when indexing a blob's filename and content. Uses the `whitespace` tokenizer and the filters: [`code`](#code), `lowercase`, and `asciifolding`
58 59 60 61 62

The `whitespace` tokenizer was selected in order to have more control over how tokens are split. For example the string `Foo::bar(4)` needs to generate tokens like `Foo` and `bar(4)` in order to be properly searched.

Please see the `code` filter for an explanation on how tokens are split.

63 64 65
NOTE: **Known Issues**:
Currently the [Elasticsearch code_analyzer doesn't account for all code cases](../integration/elasticsearch.md#known-issues).

66
#### `code_search_analyzer`
Evan Read's avatar
Evan Read committed
67

68 69 70
Not directly used for indexing, but rather used to transform a search input. Uses the `whitespace` tokenizer and the `lowercase` and `asciifolding` filters.

### Tokenizers
Evan Read's avatar
Evan Read committed
71

72
#### `sha_tokenizer`
Evan Read's avatar
Evan Read committed
73

Amy Qualls's avatar
Amy Qualls committed
74
This is a custom tokenizer that uses the [`edgeNGram` tokenizer](https://www.elastic.co/guide/en/elasticsearch/reference/5.5/analysis-edgengram-tokenizer.html) to allow SHAs to be searchable by any sub-set of it (minimum of 5 chars).
75

Evan Read's avatar
Evan Read committed
76
Example:
77 78

`240c29dc7e` becomes:
Evan Read's avatar
Evan Read committed
79

80 81 82 83 84 85 86 87
- `240c2`
- `240c29`
- `240c29d`
- `240c29dc`
- `240c29dc7`
- `240c29dc7e`

#### `path_tokenizer`
Evan Read's avatar
Evan Read committed
88

89 90
This is a custom tokenizer that uses the [`path_hierarchy` tokenizer](https://www.elastic.co/guide/en/elasticsearch/reference/5.5/analysis-pathhierarchy-tokenizer.html) with `reverse: true` in order to allow searches to find paths no matter how much or how little of the path is given as input.

Evan Read's avatar
Evan Read committed
91
Example:
92 93

`'/some/path/application.js'` becomes:
Evan Read's avatar
Evan Read committed
94

95 96 97 98 99 100
- `'/some/path/application.js'`
- `'some/path/application.js'`
- `'path/application.js'`
- `'application.js'`

### Filters
Evan Read's avatar
Evan Read committed
101

102
#### `code`
Evan Read's avatar
Evan Read committed
103 104

Uses a [Pattern Capture token filter](https://www.elastic.co/guide/en/elasticsearch/reference/5.5/analysis-pattern-capture-tokenfilter.html) to split tokens into more easily searched versions of themselves.
105 106

Patterns:
Evan Read's avatar
Evan Read committed
107

108 109 110 111 112 113
- `"(\\p{Ll}+|\\p{Lu}\\p{Ll}+|\\p{Lu}+)"`: captures CamelCased and lowedCameCased strings as separate tokens
- `"(\\d+)"`: extracts digits
- `"(?=([\\p{Lu}]+[\\p{L}]+))"`: captures CamelCased strings recursively. Ex: `ThisIsATest` => `[ThisIsATest, IsATest, ATest, Test]`
- `'"((?:\\"|[^"]|\\")*)"'`: captures terms inside quotes, removing the quotes
- `"'((?:\\'|[^']|\\')*)'"`: same as above, for single-quotes
- `'\.([^.]+)(?=\.|\s|\Z)'`: separate terms with periods in-between
114 115
- `'([\p{L}_.-]+)'`: some common chars in file names to keep the whole filename intact (eg. `my_file-ñame.txt`)
- `'([\p{L}\d_]+)'`: letters, numbers and underscores are the most common tokens in programming. Always capture them greedily regardless of context.
116 117 118 119 120 121

## Gotchas

- Searches can have their own analyzers. Remember to check when editing analyzers
- `Character` filters (as opposed to token filters) always replace the original character, so they're not a good choice as they can hinder exact searches

Mark Chao's avatar
Mark Chao committed
122
## Zero downtime reindexing with multiple indices
Mark Chao's avatar
Mark Chao committed
123

Mark Chao's avatar
Mark Chao committed
124
Currently GitLab can only handle a single version of setting. Any setting/schema changes would require reindexing everything from scratch. Since reindexing can take a long time, this can cause search functionality downtime.
Mark Chao's avatar
Mark Chao committed
125

126 127 128 129 130 131 132
To avoid downtime, GitLab is working to support multiple indices that
can function at the same time. Whenever the schema changes, the admin
will be able to create a new index and reindex to it, while searches
continue to go to the older, stable index. Any data updates will be
forwarded to both indices. Once the new index is ready, an admin can
mark it active, which will direct all searches to it, and remove the old
index.
Mark Chao's avatar
Mark Chao committed
133

Mark Chao's avatar
Mark Chao committed
134
This is also helpful for migrating to new servers, e.g. moving to/from AWS.
Mark Chao's avatar
Mark Chao committed
135

Mark Chao's avatar
Mark Chao committed
136
Currently we are on the process of migrating to this new design. Everything is hardwired to work with one single version for now.
Mark Chao's avatar
Mark Chao committed
137

Mark Chao's avatar
Mark Chao committed
138
### Architecture
Mark Chao's avatar
Mark Chao committed
139

Mark Chao's avatar
Mark Chao committed
140 141 142 143 144 145 146 147 148
The traditional setup, provided by `elasticsearch-rails`, is to communicate through its internal proxy classes. Developers would write model-specific logic in a module for the model to include in (e.g. `SnippetsSearch`). The `__elasticsearch__` methods would return a proxy object, e.g.:

- `Issue.__elasticsearch__` returns an instance of `Elasticsearch::Model::Proxy::ClassMethodsProxy`
- `Issue.first.__elasticsearch__` returns an instance of `Elasticsearch::Model::Proxy::InstanceMethodsProxy`.

These proxy objects would talk to Elasticsearch server directly (see top half of the diagram).

![Elasticsearch Architecture](img/elasticsearch_architecture.svg)

Amy Qualls's avatar
Amy Qualls committed
149
In the planned new design, each model would have a pair of corresponding sub-classed proxy objects, in which model-specific logic is located. For example, `Snippet` would have `SnippetClassProxy` and `SnippetInstanceProxy` (being subclass of `Elasticsearch::Model::Proxy::ClassMethodsProxy` and `Elasticsearch::Model::Proxy::InstanceMethodsProxy`, respectively).
Mark Chao's avatar
Mark Chao committed
150 151 152 153 154

`__elasticsearch__` would represent another layer of proxy object, keeping track of multiple actual proxy objects. It would forward method calls to the appropriate index. For example:

- `model.__elasticsearch__.search` would be forwarded to the one stable index, since it is a read operation.
- `model.__elasticsearch__.update_document` would be forwarded to all indices, to keep all indices up-to-date.
Mark Chao's avatar
Mark Chao committed
155 156 157 158 159

The global configurations per version are now in the `Elastic::(Version)::Config` class. You can change mappings there.

### Creating new version of schema

160
NOTE: **Note:** this is not applicable yet as multiple indices functionality is not fully implemented.
Mark Chao's avatar
Mark Chao committed
161

162
Folders like `ee/lib/elastic/v12p1` contain snapshots of search logic from different versions. To keep a continuous Git history, the latest version lives under `ee/lib/elastic/latest`, but its classes are aliased under an actual version (e.g. `ee/lib/elastic/v12p3`). When referencing these classes, never use the `Latest` namespace directly, but use the actual version (e.g. `V12p3`).
Mark Chao's avatar
Mark Chao committed
163 164

The version name basically follows GitLab's release version. If setting is changed in 12.3, we will create a new namespace called `V12p3` (p stands for "point"). Raise an issue if there is a need to name a version differently.
Mark Chao's avatar
Mark Chao committed
165 166 167 168 169 170 171 172

If the current version is `v12p1`, and we need to create a new version for `v12p3`, the steps are as follows:

1. Copy the entire folder of `v12p1` as `v12p3`
1. Change the namespace for files under `v12p3` folder from `V12p1` to `V12p3` (which are still aliased to `Latest`)
1. Delete `v12p1` folder
1. Copy the entire folder of `latest` as `v12p1`
1. Change the namespace for files under `v12p1` folder from `Latest` to `V12p1`
Mark Chao's avatar
Mark Chao committed
173
1. Make changes to files under the `latest` folder as needed
Mark Chao's avatar
Mark Chao committed
174

175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221
## Performance Monitoring

### Prometheus

GitLab exports [Prometheus
metrics](../administration/monitoring/prometheus/gitlab_metrics.md) relating to
the number of requests and timing for all web/API requests and Sidekiq jobs,
which can help diagnose performance trends and compare how Elasticsearch timing
is impacting overall performance relative to the time spent doing other things.

#### Indexing queues

GitLab also exports [Prometheus
metrics](../administration/monitoring/prometheus/gitlab_metrics.md) for
indexing queues, which can help diagnose performance bottlenecks and determine
whether or not your GitLab instance or Elasticsearch server can keep up with
the volume of updates.

### Logs

All of the indexing happens in Sidekiq, so much of the relevant logs for the
Elasticsearch integration can be found in
[`sidekiq.log`](../administration/logs.md#sidekiqlog). In particular, all
Sidekiq workers that make requests to Elasticsearch in any way will log the
number of requests and time taken querying/writing to Elasticsearch. This can
be useful to understand whether or not your cluster is keeping up with
indexing.

Searching Elasticsearch is done via ordinary web workers handling requests. Any
requests to load a page or make an API request, which then make requests to
Elasticsearch, will log the number of requests and the time taken to
[`production_json.log`](../administration/logs.md#production_jsonlog). These
logs will also include the time spent on Database and Gitaly requests, which
may help to diagnose which part of the search is performing poorly.

There are additional logs specific to Elasticsearch that are sent to
[`elasticsearch.log`](../administration/logs.md#elasticsearchlog-starter-only)
that may contain information to help diagnose performance issues.

### Performance Bar

Elasticsearch requests will be displayed in the [`Performance
Bar`](../administration/monitoring/performance/performance_bar.md), which can
be used both locally in development and on any deployed GitLab instance to
diagnose poor search performance. This will show the exact queries being made,
which is useful to diagnose why a search might be slow.

222 223 224 225 226 227 228 229 230 231
### Correlation ID and X-Opaque-Id

Our [correlation
ID](./distributed_tracing.md#developer-guidelines-for-working-with-correlation-ids)
is forwarded by all requests from Rails to Elasticsearch as the
[`X-Opaque-Id`](https://www.elastic.co/guide/en/elasticsearch/reference/current/tasks.html#_identifying_running_tasks)
header which allows us to track any
[tasks](https://www.elastic.co/guide/en/elasticsearch/reference/current/tasks.html)
in the cluster back the request in GitLab.

232 233
## Troubleshooting

234
### Getting `flood stage disk watermark [95%] exceeded`
235 236 237

You might get an error such as

Amy Qualls's avatar
Amy Qualls committed
238
```plaintext
Evan Read's avatar
Evan Read committed
239 240 241
[2018-10-31T15:54:19,762][WARN ][o.e.c.r.a.DiskThresholdMonitor] [pval5Ct]
   flood stage disk watermark [95%] exceeded on
   [pval5Ct7SieH90t5MykM5w][pval5Ct][/usr/local/var/lib/elasticsearch/nodes/0] free: 56.2gb[3%],
242 243 244
   all indices on this node will be marked read-only
```

Evan Read's avatar
Evan Read committed
245
This is because you've exceeded the disk space threshold - it thinks you don't have enough disk space left, based on the default 95% threshold.
246

247
In addition, the `read_only_allow_delete` setting will be set to `true`. It will block indexing, `forcemerge`, etc
248

Amy Qualls's avatar
Amy Qualls committed
249
```shell
250 251 252 253 254
curl "http://localhost:9200/gitlab-development/_settings?pretty"
```

Add this to your `elasticsearch.yml` file:

Amy Qualls's avatar
Amy Qualls committed
255
```yaml
256
# turn off the disk allocator
Evan Read's avatar
Evan Read committed
257
cluster.routing.allocation.disk.threshold_enabled: false
258 259 260 261
```

_or_

Amy Qualls's avatar
Amy Qualls committed
262
```yaml
263
# set your own limits
Evan Read's avatar
Evan Read committed
264
cluster.routing.allocation.disk.threshold_enabled: true
265
cluster.routing.allocation.disk.watermark.flood_stage: 5gb   # ES 6.x only
Evan Read's avatar
Evan Read committed
266
cluster.routing.allocation.disk.watermark.low: 15gb
267 268 269
cluster.routing.allocation.disk.watermark.high: 10gb
```

270
Restart Elasticsearch, and the `read_only_allow_delete` will clear on it's own.
271

272
_from "Disk-based Shard Allocation | Elasticsearch Reference" [5.6](https://www.elastic.co/guide/en/elasticsearch/reference/5.6/disk-allocator.html#disk-allocator) and [6.x](https://www.elastic.co/guide/en/elasticsearch/reference/6.7/disk-allocator.html)_