checks: Convert FileSizeCheck push rule to allow for batching

The FileSizeCheck push rule loads all new blobs for a given change and checks whether any of these blobs is bigger than a certain threshold. This push rule is quite expensive: loading new blobs requires a revwalk of all preexisting references and thus scales with the number of refs in a repository. This may easily take a few dozen seconds to compute, and repeating this computation for each change doesn't help either. Refactor the FileSizeCheck to be implemented atop the BaseBulkChecker such that we can fix this shortcoming in the future: instead of loading blobs for each change, we may load them once for all changes. While this would already amortize the costs because we have to perform the walk of negative refs once only, we can eventually tweak this even further and use the quarantine directory to enumerate all pushed objects directly without doing a revwalk at all. This optimization will bring down the time to load blobs from dozens of seconds to a few milliseconds. Note that this commit doesn't yet change behaviour, but instead only prepares the infrastructure to allow for batch loading. Batch loading will require some additional changes to our infrastructure first, which will be done in a separate patch series.

checks: Convert FileSizeCheck push rule to allow for batching
The FileSizeCheck push rule loads all new blobs for a given change and checks whether any of these blobs is bigger than a certain threshold. This push rule is quite expensive: loading new blobs requires a revwalk of all preexisting references and thus scales with the number of refs in a repository. This may easily take a few dozen seconds to compute, and repeating this computation for each change doesn't help either. Refactor the FileSizeCheck to be implemented atop the BaseBulkChecker such that we can fix this shortcoming in the future: instead of loading blobs for each change, we may load them once for all changes. While this would already amortize the costs because we have to perform the walk of negative refs once only, we can eventually tweak this even further and use the quarantine directory to enumerate all pushed objects directly without doing a revwalk at all. This optimization will bring down the time to load blobs from dozens of seconds to a few milliseconds. Note that this commit doesn't yet change behaviour, but instead only prepares the infrastructure to allow for batch loading. Batch loading will require some additional changes to our infrastructure first, which will be done in a separate patch series.
31fba5e0 · Patrick Steinhardt · 885ae6e9 · 31fba5e0 · 31fba5e0 · 31fba5e0
Commit 31fba5e0 authored Aug 24, 2021 by Patrick Steinhardt
3 changed files
--- a/ee/lib/ee/gitlab/checks/push_rule_check.rb
+++ b/ee/lib/ee/gitlab/checks/push_rule_check.rb
@@ -33,9 +33,7 @@ module EE
        # @return [Nil] returns nil unless an error is raised
        # @raise [Gitlab::GitAccess::ForbiddenError] if check fails
        def check_file_size!
-          changes_access.single_change_accesses.each do |single_change_access|
-            PushRules::FileSizeCheck.new(single_change_access).validate!
-          end
+          PushRules::FileSizeCheck.new(changes_access).validate!

          nil
        end

--- a/ee/lib/ee/gitlab/checks/push_rules/file_size_check.rb
+++ b/ee/lib/ee/gitlab/checks/push_rules/file_size_check.rb
@@ -4,7 +4,7 @@ module EE
  module Gitlab
    module Checks
      module PushRules
-        class FileSizeCheck < ::Gitlab::Checks::BaseSingleChecker
+        class FileSizeCheck < ::Gitlab::Checks::BaseBulkChecker
          LOG_MESSAGE = "Checking if any files are larger than the allowed size..."

          def validate!
@@ -12,14 +12,21 @@ module EE

            logger.log_timed(LOG_MESSAGE) do
              max_file_size = push_rule.max_file_size
-              blobs = project.repository.new_blobs(newrev, dynamic_timeout: logger.time_left)

-              large_blob = blobs.find do |blob|
-                ::Gitlab::Utils.bytes_to_megabytes(blob.size) > max_file_size
-              end
+              changes_access.changes.each do |change|
+                newrev = change[:newrev]
+
+                next if newrev.blank? || ::Gitlab::Git.blank_ref?(newrev)
+
+                blobs = project.repository.new_blobs(newrev, dynamic_timeout: logger.time_left)
+
+                large_blob = blobs.find do |blob|
+                  ::Gitlab::Utils.bytes_to_megabytes(blob.size) > max_file_size
+                end

-              if large_blob
-                raise ::Gitlab::GitAccess::ForbiddenError, %Q{File "#{large_blob.path}" is larger than the allowed size of #{max_file_size} MB}
+                if large_blob
+                  raise ::Gitlab::GitAccess::ForbiddenError, %Q{File "#{large_blob.path}" is larger than the allowed size of #{max_file_size} MB}
+                end
              end
            end
          end

--- a/ee/spec/lib/ee/gitlab/checks/push_rules/file_size_check_spec.rb
+++ b/ee/spec/lib/ee/gitlab/checks/push_rules/file_size_check_spec.rb
@@ -5,6 +5,29 @@ require 'spec_helper'
 RSpec.describe EE::Gitlab::Checks::PushRules::FileSizeCheck do
  include_context 'push rules checks context'

+  let(:changes) do
+    [
+      # Update of existing branch
+      { oldrev: oldrev, newrev: newrev, ref: ref },
+      # Creation of new branch
+      { newrev: newrev, ref: 'refs/heads/something' },
+      # Deletion of branch
+      { oldrev: oldrev, ref: 'refs/heads/deleteme' }
+    ]
+  end
+
+  let(:changes_access) do
+    Gitlab::Checks::ChangesAccess.new(
+      changes,
+      project: project,
+      user_access: user_access,
+      protocol: protocol,
+      logger: logger
+    )
+  end
+
+  subject { described_class.new(changes_access) }
+
  describe '#validate!' do
    let(:push_rule) { create(:push_rule, max_file_size: 1) }
    # SHA of the 2-mb-file branch