Commit 5f33219d authored by Dylan Griffith's avatar Dylan Griffith

Change more Elasticsearch indexes to keyword type

Related to https://gitlab.com/gitlab-org/gitlab/-/issues/213035 .

The [Elasticsearch keyword type](
https://www.elastic.co/guide/en/elasticsearch/reference/7.10/keyword.html)
"is used for structured content such as IDs, email addresses, hostnames,
status codes, zip codes, or tags". This index is preferred over the
current [text type](
https://www.elastic.co/guide/en/elasticsearch/reference/7.10/text.html)
as the text type takes up more storage.

The `text` type splits up the text as though it was human readable text
(ie. splitting words apart) and indexes each word separately in the
inverted index. As such the `text` type will usually take up more space
in the inverted index and should only be used when you need to search
for individual words in the text.

For each of these cases this is not adding any value and possibly making
certain searches incorrect. After testing locally this change appears to
save `4%` disk storage.

As per
https://gitlab.com/gitlab-org/gitlab/-/issues/213035#note_439629162 here
is the reasoning on a per field basis:

1. `state/merge_status` => We only do exact matches against this for
filtering. It's only 1 word so changing to keyword won't make any
difference
2. `target_branch/source_branch` => these are not used in any searches
today so there is no risk to changing the index options. Changing this
to keyword should have a decent storage improvement as these can be
quite long and composed of many words
3. `merge_status` => this is not used in any searches today so there is
no risk to changing the index options. This appears to be things like
`can_be_merged/cannot_be_merged/unchecked` which implies to me that it
should be a keyword anyway as splitting this by word will be producing
wrong results if we ever did filter on it and it will save some storage.
4. `commit.(commiter/author).email` => this is used in commit searches
today and it's hard to know exactly how this might be used by our
current users.Users will lose some behaviour though if they were
searching for partial email addresses before. For example you can
[search for `dyl.griffith`](
https://gitlab.com/search?scope=commits&repository_ref=&search=dyl.griffith&group_id=9970&project_id=278964)
and you will find commits authored by my email address which starts with
`dyl.griffith`. After this change to use keyword you'd need to search
for the entire exact email address or you could use the prefix search
`dyl.griffith*` as well. However, since prefix searches are (wildcards)
can only be used at the end of the word you will not be able to search
for `griffith` only after this change.
parent 65e2e1b5
---
title: Change more Elasticsearch indexes to keyword type to save storage
merge_request: 46640
author:
type: performance
......@@ -132,7 +132,7 @@ module Elastic
index_options: 'positions'
indexes :description, type: :text,
index_options: 'positions'
indexes :state, type: :text
indexes :state, type: :keyword
indexes :project_id, type: :integer
indexes :author_id, type: :integer
......@@ -150,11 +150,9 @@ module Elastic
indexes :assignee_id, type: :integer
### MERGE REQUESTS
indexes :target_branch, type: :text,
index_options: 'docs'
indexes :source_branch, type: :text,
index_options: 'docs'
indexes :merge_status, type: :text
indexes :target_branch, type: :keyword
indexes :source_branch, type: :keyword
indexes :merge_status, type: :keyword
indexes :source_project_id, type: :integer
indexes :target_project_id, type: :integer
......@@ -234,13 +232,13 @@ module Elastic
indexes :author do
indexes :name, type: :text, index_options: 'positions'
indexes :email, type: :text, index_options: 'positions'
indexes :email, type: :keyword
indexes :time, type: :date, format: :basic_date_time_no_millis
end
indexes :committer do
indexes :name, type: :text, index_options: 'positions'
indexes :email, type: :text, index_options: 'positions'
indexes :email, type: :keyword
indexes :time, type: :date, format: :basic_date_time_no_millis
end
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment