Commit f24f5fb5 authored by Dylan Griffith's avatar Dylan Griffith

Fix the scroll position in code search results

The previous algorithm was a heuristic approach which looked for a term
inside a highlighted fragment. The term in the fragment was then used
again to find the matching line in the actual content. This ocassionally
gave incorrect results where whatever happened to appear to be
highlighted was also found earlier in the document written in a
different way which was not intended to be a match.

The [Elasticsearch
highlighter](
https://www.elastic.co/guide/en/elasticsearch/reference/current/highlighting.html
)
supports setting `number_of_fragments` to `0` which means that the
highlighted result will actual be the entire content itself rather than
small fragments of content. This makes it much easier to figure out the
line number since we can just loop through this to begin with and stop
as soon as we find the opening highlight tag.

This does come at the cost that all docs are returned in the highlight
section now which makes the Elasticsearch response payload approximately
twice as large but it is the only correct way to do it that I could
find.

An alternative described in [the docs](
https://www.elastic.co/guide/en/elasticsearch/reference/current/highlighting.html
)
is to use `boundary_scanner` but this requires the [Fast vector
highlighter](
https://www.elastic.co/guide/en/elasticsearch/reference/current/highlighting.html#fast-vector-highlighter
)
which in term requires us to set `offsets` for our
[`index_options`](
https://www.elastic.co/guide/en/elasticsearch/reference/current/index-options.html
)
but this will use considerably more storage so we would like to avoid
this if possible.

It's worth noting that this won't fix the fact that highlighting is not
behaving properly in the quoted examples. For that I've created
https://gitlab.com/gitlab-org/gitlab/-/issues/254941 which is explaining
a very similar problem to this.
parent 75947d15
---
title: Fix the scroll position in code search results
merge_request: 43083
author:
type: fixed
......@@ -159,7 +159,7 @@ module Elastic
query_hash[:highlight] = {
pre_tags: ["gitlabelasticsearch→"],
post_tags: ["←gitlabelasticsearch"],
order: "score",
number_of_fragments: 0, # highlighted text fragments do not work well for code as we want to show a few whole lines of code. We need to get the whole content to determine the exact line number that was highlighted.
fields: {
"blob.content" => {},
"blob.file_name" => {}
......
......@@ -123,16 +123,12 @@ module Gitlab
project_id = result['_source']['project_id'].to_i
total_lines = content.lines.size
term =
if result['highlight']
highlighted = result['highlight']['blob.content']
highlighted && highlighted[0].match(/gitlabelasticsearch→(.*?)←gitlabelasticsearch/)[1]
end
highlight_content = result.dig('highlight', 'blob.content')&.first || ''
found_line_number = 0
content.each_line.each_with_index do |line, index|
if term && line.include?(term)
highlight_content.each_line.each_with_index do |line, index|
if line.include?('gitlabelasticsearch→')
found_line_number = index
break
end
......
......@@ -58,11 +58,12 @@ RSpec.describe Gitlab::Elastic::SearchResults, :elastic, :sidekiq_might_not_need
describe 'parse_search_result' do
let(:project) { double(:project) }
let(:content) { "foo\nbar\nbaz\n" }
let(:blob) do
{
'blob' => {
'commit_sha' => 'sha',
'content' => "foo\nbar\nbaz\n",
'content' => content,
'path' => 'path/file.ext'
}
}
......@@ -100,6 +101,60 @@ RSpec.describe Gitlab::Elastic::SearchResults, :elastic, :sidekiq_might_not_need
data: "bar\n"
)
end
context 'when the highlighting finds the same terms multiple times' do
let(:content) do
<<~END
bar
bar
foo
bar # this is the highlighted bar
baz
boo
bar
END
end
it 'does not mistake a line that happens to include the same term that was highlighted on a later line' do
highlighted_content = <<~END
bar
bar
foo
gitlabelasticsearch→bar←gitlabelasticsearch # this is the highlighted bar
baz
boo
bar
END
result = {
'_source' => blob,
'highlight' => {
'blob.content' => [highlighted_content]
}
}
parsed = described_class.parse_search_result(result, project)
expected_data = <<~END
bar
foo
bar # this is the highlighted bar
baz
boo
END
expect(parsed).to be_kind_of(::Gitlab::Search::FoundBlob)
expect(parsed).to have_attributes(
id: nil,
path: 'path/file.ext',
basename: 'path/file',
ref: 'sha',
startline: 2,
project: project,
data: expected_data
)
end
end
end
describe 'issues' do
......@@ -567,7 +622,7 @@ RSpec.describe Gitlab::Elastic::SearchResults, :elastic, :sidekiq_might_not_need
results = described_class.new(user, 'defau*', limit_project_ids)
blobs = results.objects('blobs')
expect(blobs.first.data).to include('default')
expect(blobs.first.data).to match(/default/i)
expect(results.blobs_count).to eq 3
end
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment