Trigger stackprof by sending a SIGUSR2 signal

64c39f21 · Igor · Stan Hu · 8ed529ae · 64c39f21 · 64c39f21
Commit 64c39f21 authored Jul 14, 2020 by Igor Committed by Stan Hu Jul 14, 2020
3 changed files
--- a/changelogs/unreleased/stackprof.yml
+++ b/changelogs/unreleased/stackprof.yml
+---
+title: Trigger stackprof by sending a SIGUSR2 signal
+merge_request: 35993
+author:
+type: performance
--- a/config/initializers/stackprof.rb
+++ b/config/initializers/stackprof.rb
+# frozen_string_literal: true
+# trigger stackprof by sending a SIGUSR2 signal
+#
+# default settings:
+# * collect raw samples
+# * sample at 100hz (every 10k microseconds)
+# * timeout profile after 30 seconds
+# * write to $TMPDIR/stackprof.$PID.$RAND.profile
+if Gitlab::Utils.to_boolean(ENV['STACKPROF_ENABLED'].to_s)
+  Gitlab::Cluster::LifecycleEvents.on_worker_start do
+    require 'stackprof'
+    require 'tmpdir'
+    Gitlab::AppJsonLogger.info "stackprof: listening on SIGUSR2 signal"
+    # create a pipe in order to propagate signal out of the signal handler
+    # see also: https://cr.yp.to/docs/selfpipe.html
+    read, write = IO.pipe
+    # create a separate thread that polls for signals on the pipe.
+    #
+    # this way we do not execute in signal handler context, which
+    # lifts restrictions and also serializes the calls in a thread-safe
+    # manner.
+    #
+    # it's very similar to a goroutine and channel design.
+    #
+    # another nice benefit of this method is that we can timeout the
+    # IO.select call, allowing the profile to automatically stop after
+    # a given interval (by default 30 seconds), avoiding unbounded memory
+    # growth from a profile that was started and never stopped.
+    t = Thread.new do
+      timeout_s = ENV['STACKPROF_TIMEOUT_S']&.to_i || 30
+      current_timeout_s = nil
+      loop do
+        got_value = IO.select([read], nil, nil, current_timeout_s)
+        read.getbyte if got_value
+        if StackProf.running?
+          stackprof_file_prefix = ENV['STACKPROF_FILE_PREFIX'] || Dir.tmpdir
+          stackprof_out_file = "#{stackprof_file_prefix}/stackprof.#{Process.pid}.#{SecureRandom.hex(6)}.profile"
+          Gitlab::AppJsonLogger.info(
+            event: "stackprof",
+            message: "stopping profile",
+            output_filename: stackprof_out_file,
+            pid: Process.pid,
+            timeout_s: timeout_s,
+            timed_out: got_value.nil?
+          )
+          StackProf.stop
+          StackProf.results(stackprof_out_file)
+          current_timeout_s = nil
+        else
+          Gitlab::AppJsonLogger.info(
+            event: "stackprof",
+            message: "starting profile",
+            pid: Process.pid
+          )
+          StackProf.start(
+            raw: Gitlab::Utils.to_boolean(ENV['STACKPROF_RAW'] || 'true'),
+            interval: ENV['STACKPROF_INTERVAL_US']&.to_i || 10_000
+          )
+          current_timeout_s = timeout_s
+        end
+      end
+    end
+    t.abort_on_exception = true
+    # in the case of puma, this will override the existing SIGUSR2 signal handler
+    # that can be used to trigger a restart.
+    #
+    # puma cluster has two types of restarts:
+    # * SIGUSR1: phased restart
+    # * SIGUSR2: restart
+    #
+    # phased restart is not supported in our configuration, because we use
+    # preload_app. this means we will always perform a normal restart.
+    # additionally, phased restart is not supported when sending a SIGUSR2
+    # directly to a puma worker (as opposed to the master process).
+    #
+    # the result is that the behaviour of SIGUSR1 and SIGUSR2 is identical in
+    # our configuration, and we can always use a SIGUSR1 to perform a restart.
+    #
+    # thus, it is acceptable for us to re-appropriate the SIGUSR2 signal, and
+    # override the puma behaviour.
+    #
+    # see also:
+    # * https://github.com/puma/puma/blob/master/docs/signals.md#puma-signals
+    # * https://github.com/phusion/unicorn/blob/master/SIGNALS
+    # * https://github.com/mperham/sidekiq/wiki/Signals
+    Signal.trap('SIGUSR2') do
+      write.write('.')
+    end
+  end
+end
--- a/doc/development/performance.md
+++ b/doc/development/performance.md
@@ -36,7 +36,6 @@ graphs/dashboards.
 GitLab provides built-in tools to help improve performance and availability:
 - [Profiling](profiling.md).
-  - [Sherlock](profiling.md#sherlock).
 - [Distributed Tracing](distributed_tracing.md)
 - [GitLab Performance Monitoring](../administration/monitoring/performance/index.md).
 - [Request Profiling](../administration/monitoring/performance/request_profiling.md).
@@ -108,16 +107,24 @@ In short:
 ## Profiling
 By collecting snapshots of process state at regular intervals, profiling allows
-you to see where time is spent in a process. The [StackProf](https://github.com/tmm1/stackprof)
+you to see where time is spent in a process. The
-gem is included in GitLab's development environment, allowing you to investigate
+[Stackprof](https://github.com/tmm1/stackprof) gem is included in GitLab,
-the behavior of suspect code in detail.
+allowing you to profile which code is running on CPU in detail.
-It's important to note that profiling an application *alters its performance*,
+It's important to note that profiling an application *alters its performance*.
-and will generally be done *in an unrepresentative environment*. In particular,
+Different profiling strategies have different overheads. Stackprof is a sampling
-a method is not necessarily troublesome just because it's executed many times,
+profiler. It will sample stack traces from running threads at a configurable
-or takes a long time to execute. Profiles are tools you can use to better
+frequency (e.g. 100hz, that is 100 stacks per second). This type of profiling
-understand what is happening in an application - using that information wisely
+has quite a low (albeit non-zero) overhead and is generally considered to be
-is up to you!
+safe for production.
+### Development
+A profiler can be a very useful tool during development, even if it does run *in
+an unrepresentative environment*. In particular, a method is not necessarily
+troublesome just because it's executed many times, or takes a long time to
+execute. Profiles are tools you can use to better understand what is happening
+in an application - using that information wisely is up to you!
 Keeping that in mind, to create a profile, identify (or create) a spec that
 exercises the troublesome code path, then run it using the `bin/rspec-stackprof`
@@ -211,11 +218,56 @@ application code, these profiles can be used to investigate slow tests as well.
 However, for smaller runs (like this example), this means that the cost of
 setting up the test suite will tend to dominate.
-It's also possible to modify the application code in-place to output profiles
+### Production
-whenever a particular code path is triggered without going through the test
-suite first. See the
+Stackprof can also be used to profile production workloads.
-[StackProf documentation](https://github.com/tmm1/stackprof/blob/master/README.md)
-for details.
+In order to enable production profiling for Ruby processes, you can set the `STACKPROF_ENABLED` environment variable to `true`.
+The following configuration options can be configured:
+- `STACKPROF_ENABLED`: Enables stackprof signal handler on SIGUSR2 signal.
+  Defaults to `false`.
+- `STACKPROF_INTERVAL_US`: Sampling interval in microseconds. Defaults to
+  `10000` μs (100hz).
+- `STACKPROF_FILE_PREFIX`: File path prefix where profiles are stored. Defaults
+  to `$TMPDIR` (often corresponds to `/tmp`).
+- `STACKPROF_TIMEOUT_S`: Profiling timeout in seconds. Profiling will
+  automatically stop after this time has elapsed. Defaults to `30`.
+- `STACKPROF_RAW`: Whether to collect raw samples or only aggregates. Raw
+  samples are needed to generate flamegraphs, but they do have a higher memory
+  and disk overhead. Defaults to `true`.
+Once enabled, profiling can be triggered by sending a `SIGUSR2` signal to the
+Ruby process. The process will begin sampling stacks. Profiling can be stopped
+by sending another `SIGUSR2`. Alternatively, it will automatically stop after
+the timeout.
+Once profiling stops, the profile is written out to disk at
+`$STACKPROF_FILE_PREFIX/stackprof.$PID.$RAND.profile`. It can then be inspected
+further via the `stackprof` command line tool, as described in the previous
+section.
+Currently supported profiling targets are:
+- Puma worker
+- Sidekiq
+NOTE: **Note:** The Puma master process is not supported. Neither is Unicorn.
+Sending SIGUSR2 to either of those will trigger restarts. In the case of Puma,
+take care to only send the signal to Puma workers.
+This can be done via `pkill -USR2 puma:`. The `:` disambiguates between `puma
+4.3.3.gitlab.2 ...` (the master process) from `puma: cluster worker 0: ...` (the
+worker processes), selecting the latter.
+Production profiles can be especially noisy. It can be helpful to visualize them
+as a [flamegraph](https://github.com/brendangregg/FlameGraph). This can be done
+via:
+```shell
+bundle exec stackprof --stackcollapse /tmp/stackprof.55769.c6c3906452.profile | flamegraph.pl > flamegraph.svg
+```
 ## RSpec profiling