Commit e279b703 authored by Imre Farkas's avatar Imre Farkas

Merge branch '219329-pass-max-doc-size-to-elasticsearch-indexer' into 'master'

Add setting for max doc size Elasticsearch

See merge request gitlab-org/gitlab!36925
parents 64e4747b 587b0894
# frozen_string_literal: true
class AddElasticsearchIndexedFileSizeLimitKbToApplicationSettings < ActiveRecord::Migration[6.0]
DOWNTIME = false
def change
add_column :application_settings,
:elasticsearch_indexed_file_size_limit_kb,
:integer,
null: false,
default: 1024 # 1 MiB (units in KiB)
end
end
dfe979676a74b09ed6d0ffe258162f7d5af2bdea7c21b307aafbbb9ceb7c0334
\ No newline at end of file
...@@ -9247,6 +9247,7 @@ CREATE TABLE public.application_settings ( ...@@ -9247,6 +9247,7 @@ CREATE TABLE public.application_settings (
maintenance_mode boolean DEFAULT false NOT NULL, maintenance_mode boolean DEFAULT false NOT NULL,
maintenance_mode_message text, maintenance_mode_message text,
wiki_page_max_content_bytes bigint DEFAULT 52428800 NOT NULL, wiki_page_max_content_bytes bigint DEFAULT 52428800 NOT NULL,
elasticsearch_indexed_file_size_limit_kb integer DEFAULT 1024 NOT NULL,
CONSTRAINT check_51700b31b5 CHECK ((char_length(default_branch_name) <= 255)), CONSTRAINT check_51700b31b5 CHECK ((char_length(default_branch_name) <= 255)),
CONSTRAINT check_9c6c447a13 CHECK ((char_length(maintenance_mode_message) <= 255)), CONSTRAINT check_9c6c447a13 CHECK ((char_length(maintenance_mode_message) <= 255)),
CONSTRAINT check_d03919528d CHECK ((char_length(container_registry_vendor) <= 255)), CONSTRAINT check_d03919528d CHECK ((char_length(container_registry_vendor) <= 255)),
......
...@@ -440,6 +440,24 @@ Reports that go over the 20 MB limit won't be loaded. Affected reports: ...@@ -440,6 +440,24 @@ Reports that go over the 20 MB limit won't be loaded. Affected reports:
## Advanced Global Search limits ## Advanced Global Search limits
### Maximum file size indexed
> [Introduced](https://gitlab.com/gitlab-org/gitlab/-/issues/8638) in GitLab 13.3.
You can set a limit on the content of repository files that are indexed in
Elasticsearch. Any files larger than this limit will not be indexed, and thus
will not be searchable.
Setting a limit helps reduce the memory usage of the indexing processes as well
as the overall index size. This value defaults to `1024 KiB` (1 MiB) as any
text files larger than this likely aren't meant to be read by humans.
NOTE: **Note:**
You must set a limit, as an unlimited file size is not supported. Setting this
value to be greater than the amount of memory on GitLab's Sidekiq nodes will
lead to GitLab's Sidekiq nodes running out of memory as they will pre-allocate
this amount of memory during indexing.
### Maximum field length ### Maximum field length
> [Introduced](https://gitlab.com/gitlab-org/gitlab/-/issues/201826) in GitLab 12.8. > [Introduced](https://gitlab.com/gitlab-org/gitlab/-/issues/201826) in GitLab 12.8.
...@@ -448,6 +466,9 @@ You can set a limit on the content of text fields indexed for Global Search. ...@@ -448,6 +466,9 @@ You can set a limit on the content of text fields indexed for Global Search.
Setting a maximum helps to reduce the load of the indexing processes. If any Setting a maximum helps to reduce the load of the indexing processes. If any
text field exceeds this limit then the text will be truncated to this number of text field exceeds this limit then the text will be truncated to this number of
characters and the rest will not be indexed and hence will not be searchable. characters and the rest will not be indexed and hence will not be searchable.
This is applicable to all indexed data except repository files that get
indexed, which have a separate limit (see [Maximum file size
indexed](#maximum-file-size-indexed)).
- On GitLab.com this is limited to 20000 characters - On GitLab.com this is limited to 20000 characters
- For self-managed installations it is unlimited by default - For self-managed installations it is unlimited by default
......
...@@ -238,6 +238,7 @@ are listed in the descriptions of the relevant settings. ...@@ -238,6 +238,7 @@ are listed in the descriptions of the relevant settings.
| `elasticsearch_aws` | boolean | no | **(PREMIUM)** Enable the use of AWS hosted Elasticsearch | | `elasticsearch_aws` | boolean | no | **(PREMIUM)** Enable the use of AWS hosted Elasticsearch |
| `elasticsearch_aws_region` | string | no | **(PREMIUM)** The AWS region the Elasticsearch domain is configured | | `elasticsearch_aws_region` | string | no | **(PREMIUM)** The AWS region the Elasticsearch domain is configured |
| `elasticsearch_aws_secret_access_key` | string | no | **(PREMIUM)** AWS IAM secret access key | | `elasticsearch_aws_secret_access_key` | string | no | **(PREMIUM)** AWS IAM secret access key |
| `elasticsearch_indexed_file_size_limit_kb` | integer | no | **(PREMIUM)** Maximum size of repository and wiki files that will be indexed by Elasticsearch. |
| `elasticsearch_indexed_field_length_limit` | integer | no | **(PREMIUM)** Maximum size of text fields that will be indexed by Elasticsearch. 0 value means no limit. This does not apply to repository and wiki indexing. | | `elasticsearch_indexed_field_length_limit` | integer | no | **(PREMIUM)** Maximum size of text fields that will be indexed by Elasticsearch. 0 value means no limit. This does not apply to repository and wiki indexing. |
| `elasticsearch_indexing` | boolean | no | **(PREMIUM)** Enable Elasticsearch indexing | | `elasticsearch_indexing` | boolean | no | **(PREMIUM)** Enable Elasticsearch indexing |
| `elasticsearch_limit_indexing` | boolean | no | **(PREMIUM)** Limit Elasticsearch to index certain namespaces and projects | | `elasticsearch_limit_indexing` | boolean | no | **(PREMIUM)** Limit Elasticsearch to index certain namespaces and projects |
......
...@@ -155,6 +155,7 @@ The following Elasticsearch settings are available: ...@@ -155,6 +155,7 @@ The following Elasticsearch settings are available:
| `AWS Region` | The AWS region your Elasticsearch service is located in. | | `AWS Region` | The AWS region your Elasticsearch service is located in. |
| `AWS Access Key` | The AWS access key. | | `AWS Access Key` | The AWS access key. |
| `AWS Secret Access Key` | The AWS secret access key. | | `AWS Secret Access Key` | The AWS secret access key. |
| `Maximum file size indexed` | See [the explanation in instance limits.](../administration/instance_limits.md#maximum-file-size-indexed). |
| `Maximum field length` | See [the explanation in instance limits.](../administration/instance_limits.md#maximum-field-length). | | `Maximum field length` | See [the explanation in instance limits.](../administration/instance_limits.md#maximum-field-length). |
| `Maximum bulk request size (MiB)` | The Maximum Bulk Request size is used by GitLab's Golang-based indexer processes and indicates how much data it ought to collect (and store in memory) in a given indexing process before submitting the payload to Elasticsearch’s Bulk API. This setting should be used with the Bulk request concurrency setting (see below) and needs to accommodate the resource constraints of both the Elasticsearch host(s) and the host(s) running GitLab's Golang-based indexer either from the `gitlab-rake` command or the Sidekiq tasks. | | `Maximum bulk request size (MiB)` | The Maximum Bulk Request size is used by GitLab's Golang-based indexer processes and indicates how much data it ought to collect (and store in memory) in a given indexing process before submitting the payload to Elasticsearch’s Bulk API. This setting should be used with the Bulk request concurrency setting (see below) and needs to accommodate the resource constraints of both the Elasticsearch host(s) and the host(s) running GitLab's Golang-based indexer either from the `gitlab-rake` command or the Sidekiq tasks. |
| `Bulk request concurrency` | The Bulk request concurrency indicates how many of GitLab's Golang-based indexer processes (or threads) can run in parallel to collect data to subsequently submit to Elasticsearch’s Bulk API. This increases indexing performance, but fills the Elasticsearch bulk requests queue faster. This setting should be used together with the Maximum bulk request size setting (see above) and needs to accommodate the resource constraints of both the Elasticsearch host(s) and the host(s) running GitLab's Golang-based indexer either from the `gitlab-rake` command or the Sidekiq tasks. | | `Bulk request concurrency` | The Bulk request concurrency indicates how many of GitLab's Golang-based indexer processes (or threads) can run in parallel to collect data to subsequently submit to Elasticsearch’s Bulk API. This increases indexing performance, but fills the Elasticsearch bulk requests queue faster. This setting should be used together with the Maximum bulk request size setting (see above) and needs to accommodate the resource constraints of both the Elasticsearch host(s) and the host(s) running GitLab's Golang-based indexer either from the `gitlab-rake` command or the Sidekiq tasks. |
......
...@@ -30,6 +30,7 @@ module EE ...@@ -30,6 +30,7 @@ module EE
:elasticsearch_max_bulk_concurrency, :elasticsearch_max_bulk_concurrency,
:elasticsearch_max_bulk_size_mb, :elasticsearch_max_bulk_size_mb,
:elasticsearch_replicas, :elasticsearch_replicas,
:elasticsearch_indexed_file_size_limit_kb,
:elasticsearch_indexed_field_length_limit, :elasticsearch_indexed_field_length_limit,
:elasticsearch_search, :elasticsearch_search,
:elasticsearch_shards, :elasticsearch_shards,
......
...@@ -66,6 +66,10 @@ module EE ...@@ -66,6 +66,10 @@ module EE
presence: { message: "can't be blank when using aws hosted elasticsearch" }, presence: { message: "can't be blank when using aws hosted elasticsearch" },
if: ->(setting) { setting.elasticsearch_indexing? && setting.elasticsearch_aws? } if: ->(setting) { setting.elasticsearch_indexing? && setting.elasticsearch_aws? }
validates :elasticsearch_indexed_file_size_limit_kb,
presence: true,
numericality: { only_integer: true, greater_than: 0 }
validates :elasticsearch_indexed_field_length_limit, validates :elasticsearch_indexed_field_length_limit,
presence: true, presence: true,
numericality: { only_integer: true, greater_than_or_equal_to: 0 } numericality: { only_integer: true, greater_than_or_equal_to: 0 }
...@@ -102,6 +106,7 @@ module EE ...@@ -102,6 +106,7 @@ module EE
elasticsearch_aws_region: ENV['ELASTIC_REGION'] || 'us-east-1', elasticsearch_aws_region: ENV['ELASTIC_REGION'] || 'us-east-1',
elasticsearch_replicas: 1, elasticsearch_replicas: 1,
elasticsearch_shards: 5, elasticsearch_shards: 5,
elasticsearch_indexed_file_size_limit_kb: 1024, # 1 MiB (units in KiB)
elasticsearch_indexed_field_length_limit: 0, elasticsearch_indexed_field_length_limit: 0,
elasticsearch_max_bulk_size_bytes: 10.megabytes, elasticsearch_max_bulk_size_bytes: 10.megabytes,
elasticsearch_max_bulk_concurrency: 10, elasticsearch_max_bulk_concurrency: 10,
......
...@@ -68,6 +68,12 @@ ...@@ -68,6 +68,12 @@
= _('How many replicas each Elasticsearch shard has.') = _('How many replicas each Elasticsearch shard has.')
= recreate_index_text = recreate_index_text
.form-group
= f.label :elasticsearch_indexed_file_size_limit_kb, _('Maximum file size indexed (KiB)'), class: 'label-bold'
= f.number_field :elasticsearch_indexed_file_size_limit_kb, value: @application_setting.elasticsearch_indexed_file_size_limit_kb, class: 'form-control'
.form-text.text-muted
= _('Any files larger than this limit will not be indexed, and thus will not be searchable.')
.form-group .form-group
= f.label :elasticsearch_indexed_field_length_limit, _('Maximum field length'), class: 'label-bold' = f.label :elasticsearch_indexed_field_length_limit, _('Maximum field length'), class: 'label-bold'
= f.number_field :elasticsearch_indexed_field_length_limit, value: @application_setting.elasticsearch_indexed_field_length_limit, class: 'form-control' = f.number_field :elasticsearch_indexed_field_length_limit, value: @application_setting.elasticsearch_indexed_field_length_limit, class: 'form-control'
......
---
title: Add setting for max indexed file size in Elasticsearch
merge_request: 36925
author:
type: added
...@@ -17,7 +17,13 @@ module Elastic ...@@ -17,7 +17,13 @@ module Elastic
number_of_shards: Elastic::AsJSON.new { Gitlab::CurrentSettings.elasticsearch_shards }, number_of_shards: Elastic::AsJSON.new { Gitlab::CurrentSettings.elasticsearch_shards },
number_of_replicas: Elastic::AsJSON.new { Gitlab::CurrentSettings.elasticsearch_replicas }, number_of_replicas: Elastic::AsJSON.new { Gitlab::CurrentSettings.elasticsearch_replicas },
highlight: { highlight: {
max_analyzed_offset: 1.megabyte # `highlight.max_analyzed_offset` is technically not measured in
# bytes, but rather in characters. Since this is an uppper bound on
# the number of characters that can be highlighted before
# Elasticsearch will error it is fine to use the number of bytes as
# the upper limit since you cannot fit more characters than bytes
# in a file.
max_analyzed_offset: Elastic::AsJSON.new { Gitlab::CurrentSettings.elasticsearch_indexed_file_size_limit_kb.kilobytes }
}, },
codec: 'best_compression', codec: 'best_compression',
analysis: { analysis: {
......
...@@ -101,7 +101,7 @@ module Gitlab ...@@ -101,7 +101,7 @@ module Gitlab
vars = { vars = {
'RAILS_ENV' => Rails.env, 'RAILS_ENV' => Rails.env,
'ELASTIC_CONNECTION_INFO' => elasticsearch_config(target), 'ELASTIC_CONNECTION_INFO' => elasticsearch_config(target),
'GITALY_CONNECTION_INFO' => gitaly_connection_info, 'GITALY_CONNECTION_INFO' => gitaly_config,
'FROM_SHA' => from_sha, 'FROM_SHA' => from_sha,
'TO_SHA' => to_sha, 'TO_SHA' => to_sha,
'CORRELATION_ID' => Labkit::Correlation::CorrelationId.current_id, 'CORRELATION_ID' => Labkit::Correlation::CorrelationId.current_id,
...@@ -168,9 +168,10 @@ module Gitlab ...@@ -168,9 +168,10 @@ module Gitlab
).to_json ).to_json
end end
def gitaly_connection_info def gitaly_config
{ {
storage: project.repository_storage storage: project.repository_storage,
limit_file_size: Gitlab::CurrentSettings.elasticsearch_indexed_file_size_limit_kb.kilobytes
}.merge(Gitlab::GitalyClient.connection_data(project.repository_storage)).to_json }.merge(Gitlab::GitalyClient.connection_data(project.repository_storage)).to_json
end end
......
...@@ -65,6 +65,7 @@ RSpec.describe 'Admin updates EE-only settings' do ...@@ -65,6 +65,7 @@ RSpec.describe 'Admin updates EE-only settings' do
check 'Search with Elasticsearch enabled' check 'Search with Elasticsearch enabled'
fill_in 'Number of Elasticsearch shards', with: '120' fill_in 'Number of Elasticsearch shards', with: '120'
fill_in 'Number of Elasticsearch replicas', with: '2' fill_in 'Number of Elasticsearch replicas', with: '2'
fill_in 'Maximum file size indexed (KiB)', with: '5000'
fill_in 'Maximum field length', with: '100000' fill_in 'Maximum field length', with: '100000'
fill_in 'Maximum bulk request size (MiB)', with: '17' fill_in 'Maximum bulk request size (MiB)', with: '17'
fill_in 'Bulk request concurrency', with: '23' fill_in 'Bulk request concurrency', with: '23'
...@@ -77,6 +78,7 @@ RSpec.describe 'Admin updates EE-only settings' do ...@@ -77,6 +78,7 @@ RSpec.describe 'Admin updates EE-only settings' do
expect(current_settings.elasticsearch_search).to be_truthy expect(current_settings.elasticsearch_search).to be_truthy
expect(current_settings.elasticsearch_shards).to eq(120) expect(current_settings.elasticsearch_shards).to eq(120)
expect(current_settings.elasticsearch_replicas).to eq(2) expect(current_settings.elasticsearch_replicas).to eq(2)
expect(current_settings.elasticsearch_indexed_file_size_limit_kb).to eq(5000)
expect(current_settings.elasticsearch_indexed_field_length_limit).to eq(100000) expect(current_settings.elasticsearch_indexed_field_length_limit).to eq(100000)
expect(current_settings.elasticsearch_max_bulk_size_mb).to eq(17) expect(current_settings.elasticsearch_max_bulk_size_mb).to eq(17)
expect(current_settings.elasticsearch_max_bulk_concurrency).to eq(23) expect(current_settings.elasticsearch_max_bulk_concurrency).to eq(23)
......
...@@ -10,6 +10,8 @@ RSpec.describe Gitlab::Elastic::Indexer do ...@@ -10,6 +10,8 @@ RSpec.describe Gitlab::Elastic::Indexer do
end end
let(:project) { create(:project, :repository) } let(:project) { create(:project, :repository) }
let(:user) { project.owner }
let(:expected_from_sha) { Gitlab::Git::EMPTY_TREE_ID } let(:expected_from_sha) { Gitlab::Git::EMPTY_TREE_ID }
let(:to_commit) { project.commit } let(:to_commit) { project.commit }
let(:to_sha) { to_commit.try(:sha) } let(:to_sha) { to_commit.try(:sha) }
...@@ -83,7 +85,8 @@ RSpec.describe Gitlab::Elastic::Indexer do ...@@ -83,7 +85,8 @@ RSpec.describe Gitlab::Elastic::Indexer do
it 'runs the indexing command' do it 'runs the indexing command' do
gitaly_connection_data = { gitaly_connection_data = {
storage: project.repository_storage storage: project.repository_storage,
limit_file_size: Gitlab::CurrentSettings.elasticsearch_indexed_file_size_limit_kb.kilobytes
}.merge(Gitlab::GitalyClient.connection_data(project.repository_storage)) }.merge(Gitlab::GitalyClient.connection_data(project.repository_storage))
expect_popen.with( expect_popen.with(
...@@ -131,27 +134,12 @@ RSpec.describe Gitlab::Elastic::Indexer do ...@@ -131,27 +134,12 @@ RSpec.describe Gitlab::Elastic::Indexer do
it_behaves_like 'index up to the specified commit' it_behaves_like 'index up to the specified commit'
context 'after reverting a change' do context 'after reverting a change' do
let(:user) { project.owner }
let!(:initial_commit) { project.repository.commit('master').sha } let!(:initial_commit) { project.repository.commit('master').sha }
def change_repository_and_index(project, &blk) def change_repository_and_index(project, &blk)
yield blk if blk yield blk if blk
current_commit = project.repository.commit('master').sha index_repository(project)
described_class.new(project).run(current_commit)
ensure_elasticsearch_index!
end
def indexed_file_paths_for(term)
blobs = Repository.elastic_search(
term,
type: 'blob'
)[:blobs][:results].response
blobs.map do |blob|
blob['_source']['blob']['path']
end
end end
def indexed_commits_for(term) def indexed_commits_for(term)
...@@ -242,8 +230,6 @@ RSpec.describe Gitlab::Elastic::Indexer do ...@@ -242,8 +230,6 @@ RSpec.describe Gitlab::Elastic::Indexer do
end end
context 'when IndexStatus#last_wiki_commit is no longer in repository' do context 'when IndexStatus#last_wiki_commit is no longer in repository' do
let(:user) { project.owner }
def change_wiki_and_index(project, &blk) def change_wiki_and_index(project, &blk)
yield blk if blk yield blk if blk
...@@ -368,6 +354,25 @@ RSpec.describe Gitlab::Elastic::Indexer do ...@@ -368,6 +354,25 @@ RSpec.describe Gitlab::Elastic::Indexer do
end end
end end
context 'when a file is larger than elasticsearch_indexed_file_size_limit_kb', :elastic do
let(:project) { create(:project, :repository) }
before do
stub_ee_application_setting(elasticsearch_indexed_file_size_limit_kb: 1) # 1 KiB limit
project.repository.create_file(user, 'small_file.txt', 'Small file contents', message: 'small_file.txt', branch_name: 'master')
project.repository.create_file(user, 'large_file.txt', 'Large file' * 1000, message: 'large_file.txt', branch_name: 'master')
index_repository(project)
end
it 'does not index that file' do
files = indexed_file_paths_for('file')
expect(files).to include('small_file.txt')
expect(files).not_to include('large_file.txt')
end
end
def expect_popen def expect_popen
expect(Gitlab::Popen).to receive(:popen) expect(Gitlab::Popen).to receive(:popen)
end end
...@@ -392,4 +397,22 @@ RSpec.describe Gitlab::Elastic::Indexer do ...@@ -392,4 +397,22 @@ RSpec.describe Gitlab::Elastic::Indexer do
Gitlab::Git::BLANK_SHA, Gitlab::Git::BLANK_SHA,
project.repository.__elasticsearch__.elastic_writing_targets.first) project.repository.__elasticsearch__.elastic_writing_targets.first)
end end
def indexed_file_paths_for(term)
blobs = Repository.elastic_search(
term,
type: 'blob'
)[:blobs][:results].response
blobs.map do |blob|
blob['_source']['blob']['path']
end
end
def index_repository(project)
current_commit = project.repository.commit('master').sha
described_class.new(project).run(current_commit)
ensure_elasticsearch_index!
end
end end
...@@ -41,6 +41,12 @@ RSpec.describe ApplicationSetting do ...@@ -41,6 +41,12 @@ RSpec.describe ApplicationSetting do
it { is_expected.not_to allow_value(1.1).for(:elasticsearch_replicas) } it { is_expected.not_to allow_value(1.1).for(:elasticsearch_replicas) }
it { is_expected.not_to allow_value(-1).for(:elasticsearch_replicas) } it { is_expected.not_to allow_value(-1).for(:elasticsearch_replicas) }
it { is_expected.to allow_value(10).for(:elasticsearch_indexed_file_size_limit_kb) }
it { is_expected.not_to allow_value(0).for(:elasticsearch_indexed_file_size_limit_kb) }
it { is_expected.not_to allow_value(nil).for(:elasticsearch_indexed_file_size_limit_kb) }
it { is_expected.not_to allow_value(1.1).for(:elasticsearch_indexed_file_size_limit_kb) }
it { is_expected.not_to allow_value(-1).for(:elasticsearch_indexed_file_size_limit_kb) }
it { is_expected.to allow_value(10).for(:elasticsearch_indexed_field_length_limit) } it { is_expected.to allow_value(10).for(:elasticsearch_indexed_field_length_limit) }
it { is_expected.to allow_value(0).for(:elasticsearch_indexed_field_length_limit) } it { is_expected.to allow_value(0).for(:elasticsearch_indexed_field_length_limit) }
it { is_expected.not_to allow_value(nil).for(:elasticsearch_indexed_field_length_limit) } it { is_expected.not_to allow_value(nil).for(:elasticsearch_indexed_field_length_limit) }
......
...@@ -2847,6 +2847,9 @@ msgstr "" ...@@ -2847,6 +2847,9 @@ msgstr ""
msgid "Any encrypted tokens" msgid "Any encrypted tokens"
msgstr "" msgstr ""
msgid "Any files larger than this limit will not be indexed, and thus will not be searchable."
msgstr ""
msgid "Any label" msgid "Any label"
msgstr "" msgstr ""
...@@ -14558,6 +14561,9 @@ msgstr "" ...@@ -14558,6 +14561,9 @@ msgstr ""
msgid "Maximum field length" msgid "Maximum field length"
msgstr "" msgstr ""
msgid "Maximum file size indexed (KiB)"
msgstr ""
msgid "Maximum file size is 2MB. Please select a smaller file." msgid "Maximum file size is 2MB. Please select a smaller file."
msgstr "" msgstr ""
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment