Respect sidekiq timeout when hard-killing workers

As discovered in https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/10930, the 5 second timeout can be too short as during normal shutdowns getppid returns "1" sooner than expected. But even in a "real" failure case where the sidekiq-cluster process is terminated hard, we still need to respect the sidekiq timeout so that sidekiq will be able to wait for running jobs to complete (or termiante them and push them back into the queue) before being killed off. Otherwise we end up with orphaned jobs that are only picked up by the reliable fetcher cleanup, up to an hour later.

Respect sidekiq timeout when hard-killing workers
As discovered in https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/10930, the 5 second timeout can be too short as during normal shutdowns getppid returns "1" sooner than expected. But even in a "real" failure case where the sidekiq-cluster process is terminated hard, we still need to respect the sidekiq timeout so that sidekiq will be able to wait for running jobs to complete (or termiante them and push them back into the queue) before being killed off. Otherwise we end up with orphaned jobs that are only picked up by the reliable fetcher cleanup, up to an hour later.
d31730c3 · Craig Miskell · dbbdcf45 · d31730c3
Commit d31730c3 authored Jul 30, 2020 by Craig Miskell
Hide whitespace changes
Inline Side-by-side

Showing with 4 additions and 4 deletions

config/initializers/sidekiq_cluster.rb config/initializers/sidekiq_cluster.rb +4 -4

No files found.
--- a/config/initializers/sidekiq_cluster.rb
+++ b/config/initializers/sidekiq_cluster.rb
@@ -14,10 +14,10 @@ if ENV['ENABLE_SIDEKIQ_CLUSTER']
      if Process.ppid != parent
        Process.kill(:TERM, Process.pid)

-        # Wait for just a few extra seconds for a final attempt to
-        # gracefully terminate. Considering the parent (cluster) process
-        # have changed (SIGKILL'd), it shouldn't take long to shutdown.
-        sleep(5)
+        # Allow sidekiq to cleanly terminate and push any running jobs back
+        # into the queue.  We use the configured timeout and add a small
+        # grace period
+        sleep(Sidekiq.options[:timeout] + 5)

        # Signaling the Sidekiq Pgroup as KILL is not forwarded to
        # a possible child process. In Sidekiq Cluster, all child Sidekiq