Change more Elasticsearch indexes to keyword type
Related to https://gitlab.com/gitlab-org/gitlab/-/issues/213035 . The [Elasticsearch keyword type]( https://www.elastic.co/guide/en/elasticsearch/reference/7.10/keyword.html) "is used for structured content such as IDs, email addresses, hostnames, status codes, zip codes, or tags". This index is preferred over the current [text type]( https://www.elastic.co/guide/en/elasticsearch/reference/7.10/text.html) as the text type takes up more storage. The `text` type splits up the text as though it was human readable text (ie. splitting words apart) and indexes each word separately in the inverted index. As such the `text` type will usually take up more space in the inverted index and should only be used when you need to search for individual words in the text. For each of these cases this is not adding any value and possibly making certain searches incorrect. After testing locally this change appears to save `4%` disk storage. As per https://gitlab.com/gitlab-org/gitlab/-/issues/213035#note_439629162 here is the reasoning on a per field basis: 1. `state/merge_status` => We only do exact matches against this for filtering. It's only 1 word so changing to keyword won't make any difference 2. `target_branch/source_branch` => these are not used in any searches today so there is no risk to changing the index options. Changing this to keyword should have a decent storage improvement as these can be quite long and composed of many words 3. `merge_status` => this is not used in any searches today so there is no risk to changing the index options. This appears to be things like `can_be_merged/cannot_be_merged/unchecked` which implies to me that it should be a keyword anyway as splitting this by word will be producing wrong results if we ever did filter on it and it will save some storage. 4. `commit.(commiter/author).email` => this is used in commit searches today and it's hard to know exactly how this might be used by our current users.Users will lose some behaviour though if they were searching for partial email addresses before. For example you can [search for `dyl.griffith`]( https://gitlab.com/search?scope=commits&repository_ref=&search=dyl.griffith&group_id=9970&project_id=278964) and you will find commits authored by my email address which starts with `dyl.griffith`. After this change to use keyword you'd need to search for the entire exact email address or you could use the prefix search `dyl.griffith*` as well. However, since prefix searches are (wildcards) can only be used at the end of the word you will not be able to search for `griffith` only after this change.
Showing
Please register or sign in to comment