Merge branch 'bvl-custom-adpex-duration-docs' into 'master'

This adds docs on how to customize apdex target See merge request gitlab-org/gitlab!71135

Merge branch 'bvl-custom-adpex-duration-docs' into 'master'
This adds docs on how to customize apdex target See merge request gitlab-org/gitlab!71135
edaf3c46 · Sean McGivern · 7f0048b8 · 5006b546 · edaf3c46 · edaf3c46
Commit edaf3c46 authored Sep 30, 2021 by Sean McGivern
4 changed files
--- a/doc/development/application_slis/index.md
+++ b/doc/development/application_slis/index.md
+---
+stage: Platforms
+group: Scalability
+info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#assignments
+---
+
+# GitLab Application Service Level Indicators (SLIs)
+
+> [Introduced](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/525) in GitLab 14.4
+
+It is possible to define [Service Level Indicators
+(SLIs)](https://en.wikipedia.org/wiki/Service_level_indicator)
+directly in the Ruby codebase. This keeps the definition of operations
+and their success close to the implementation and allows the people
+building features to easily define how these features should be
+monitored.
+
+Defining an SLI causes 2
+[Prometheus
+counters](https://prometheus.io/docs/concepts/metric_types/#counter)
+to be emitted from the rails application:
+
+- `gitlab_sli:<sli name>:total`: incremented for each operation.
+- `gitlab_sli:<sli_name>:success_total`: incremented for successful
+  operations.
+
+## Existing SLIs
+
+1. [`rails_request_apdex`](rails_request_apdex.md)
+
+## Defining a new SLI
+
+An SLI can be defined using the `Gitlab::Metrics::Sli` class.
+
+Before the first scrape, it is important to have [initialized the SLI
+with all possible
+label-combinations](https://prometheus.io/docs/practices/instrumentation/#avoid-missing-metrics). This
+avoid confusing results when using these counters in calculations.
+
+To initialize an SLI, use the `.inilialize_sli` class method, for
+example:
+
+```ruby
+Gitlab::Metrics::Sli.initialize_sli(:received_email, [
+  {
+    feature_category: :issue_tracking,
+    email_type: :create_issue
+  },
+  {
+    feature_category: :service_desk,
+    email_type: :service_desk
+  },
+  {
+    feature_category: :code_review,
+    email_type: :create_merge_request
+  }
+])
+```
+
+Metrics must be initialized before they get
+scraped for the first time. This could be done at the start time of the
+process that will emit them, in which case we need to pay attention
+not to increase application's boot time too much. This is preferable
+if possible.
+
+Alternatively, if initializing would take too long, this can be done
+during the first scrape. We need to make sure we don't do it for every
+scrape. This can be done as follows:
+
+```ruby
+def initialize_request_slis_if_needed!
+  return if Gitlab::Metrics::Sli.initialized?(:rails_request_apdex)
+  Gitlab::Metrics::Sli.initialize_sli(:rails_request_apdex, possible_request_labels)
+end
+```
+
+Also pay attention to do it for the different metrics
+endpoints we have. Currently the
+[`WebExporter`](https://gitlab.com/gitlab-org/gitlab/blob/master/lib/gitlab/metrics/exporter/web_exporter.rb)
+and the
+[`HealthController`](https://gitlab.com/gitlab-org/gitlab/blob/master/app/controllers/health_controller.rb)
+for Rails and
+[`SidekiqExporter`](https://gitlab.com/gitlab-org/gitlab/blob/master/lib/gitlab/metrics/exporter/sidekiq_exporter.rb)
+for Sidekiq.
+
+## Tracking operations for an SLI
+
+Tracking an operation in the newly defined SLI can be done like this:
+
+```ruby
+Gitlab::Metrics::Sli[:received_email].increment(
+  labels: {
+    feature_category: :service_desk,
+    email_type: :service_desk
+  },
+  success: issue_created?
+)
+```
+
+Calling `#increment` on this SLI will increment the total Prometheus counter
+
+```prometheus
+gitlab_sli:received_email:total{ feature_category='service_desk', email_type='service_desk' }
+```
+
+If the `success:` argument passed is truthy, then the success counter
+will also be incremented:
+
+```prometheus
+gitlab_sli:received_email:success_total{ feature_category='service_desk', email_type='service_desk' }
+```
+
+## Using the SLI in service monitoring and alerts
+
+When the application is emitting metrics for the new SLI, those need
+to be consumed in the service catalog to result in alerts, and be
+included in the error budget for stage groups and GitLab.com's overall
+availability.
+
+This is currently being worked on in [this
+project](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/573). As
+part of [this
+issue](https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/1307)
+we will update the documentation.
+
+For any question, please don't hesitate to createan issue in [the
+Scalability issue
+tracker](https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues)
+or come find us in
+[#g_scalability](https://gitlab.slack.com/archives/CMMF8TKR9) on Slack.
--- a/doc/development/application_slis/rails_request_apdex.md
+++ b/doc/development/application_slis/rails_request_apdex.md
+---
+stage: Platforms
+group: Scalability
+info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#assignments
+---
+
+# Rails request apdex SLI
+
+> [Introduced](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/525) in GitLab 14.4
+
+NOTE:
+This SLI is not yet used in [error budgets for stage
+groups](../stage_group_dashboards.md#error-budget) or service
+monitoring. This is being worked on in [this
+project](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/573).
+
+The request apdex SLI is [an SLI defined in the application](index.md)
+that measures the duration of successful requests as an indicator for
+application performance. This includes the REST and GraphQL API, and the
+regular controller endpoints. It consists of these counters:
+
+1. `gitlab_sli:rails_request_apdex:total`: This counter gets
+   incremented for every request that did not result in a response
+   with a 5xx status code. This means that slow failures don't get
+   counted twice: The request is already counted in the error-SLI.
+
+1. `gitlab_sli:rails_request_apdex:success_total`: This counter gets
+   incremented for every successful request that performed faster than
+   the [defined target duration](#adjusting-request-target-duration).
+
+Both these counters are labeled with:
+
+1. `endpoint_id`: The identification of the Rails Controller or the
+   Grape-API endpoint
+
+1. `feature_category`: The feature category specified for that
+   controller or API endpoint.
+
+## Request Apdex SLO
+
+These counters can be combined into a success ratio, the objective for
+this ratio is defined in the service catalog per service:
+
+1. [Web: 0.998](https://gitlab.com/gitlab-com/runbooks/blob/master/metrics-catalog/services/web.jsonnet#L19)
+1. [API: 0.995](https://gitlab.com/gitlab-com/runbooks/blob/master/metrics-catalog/services/api.jsonnet#L19)
+1. [Git: 0.998](https://gitlab.com/gitlab-com/runbooks/blob/master/metrics-catalog/services/git.jsonnet#L22)
+
+This means that for this SLI to meet SLO, the ratio recorded needs to
+be higher than those defined above.
+
+For example: for the web-service, we want at least 99.8% of requests
+to be faster than their target duration.
+
+These are the targets we use for alerting and service montoring. So
+durations should be set keeping those into account.
+
+Both successful measurements and unsuccessful ones have an impact on the
+error budget for stage groups.
+
+## Adjusting request target duration
+
+Not all endpoints perform the same type of work, so it is possible to
+define different durations for different endpoints.
+
+Long-running requests are more expensive for our
+infrastructure: while one request is being served, the thread remains
+occupied for the duration of that request. So nothing else can be handled by that
+thread. Because of Ruby's Global VM Lock, the thread might keep the
+lock and stall other requests handled by the same Puma worker
+process. The request is in fact a noisy neighbor for other requests
+handled by the worker. This is why the upper bound for a target
+duration is capped at 5 seconds.
+
+## Increasing the target duration (setting a slower target)
+
+Increasing the target duration on an existing endpoint can be done on
+a case-by-case basis. Please take the following into account:
+
+1. Apdex is about perceived performance, if a user is actively waiting
+   for the result of a request, waiting 5 seconds might not be
+   acceptable. While if the endpoint is used by an automation
+   requiring a lot of data, 5 seconds could be okay.
+
+   A product manager can help to identify how an endpoint is used.
+
+1. The workload for some endpoints can sometimes differ greatly
+   depending on the parameters specified by the caller. The target
+   duration needs to accomodate that. In some cases, it might be
+   interesting to define a separate [application
+   SLI](index.md#defining-a-new-sli) for what the endpoint is doing.
+
+   When the endpoints in certain cases turn into no-ops, making them
+   very fast, we should ignore these fast requests when setting the
+   target. For example, if the `MergeRequests::DraftsController` is
+   hit for every merge request being viewed, but doesn't need to
+   render anything in most cases, then we should pick the target that
+   would still accomodate the endpoint performing work.
+
+1. Consider the dependent resources consumed by the endpoint. If the endpoint
+   loads a lot of data from Gitaly or the database and this is causing
+   it to not perform satisfactory. It could be better to optimize the
+   way the data is loaded rather than increasing the target duration.
+
+   In cases like this, it might be appropriate to temporarily increase
+   the duration to make the endpoint meet SLO, if this is bearable for
+   the infrastructure. In such cases, please link an issue from a code
+   comment.
+
+   If the endpoint consumes a lot of CPU time, we should also consider
+   this: these kinds of requests are the kind of noisy neighbors we
+   should try to keep as short as possible.
+
+1. Traffic characteristics should also be taken into account: if the
+   trafic to the endpoint is bursty, like CI traffic spinning up a
+   big batch of jobs hitting the same endpoint, then having these
+   endpoints take 5s is not acceptable from an infrastructure point of
+   view. We cannot scale up the fleet fast enough to accomodate for
+   the incoming slow requests alongside the regular traffic.
+
+When increasing the target duration for an existing endpoint, please
+involve a [Scalability team
+member](https://about.gitlab.com/handbook/engineering/infrastructure/team/scalability/#team-members)
+in the review. We can use request rates and durations available in the
+logs to come up with a recommendation. Picking a threshold can be done
+using the same process as for [decreasing a target
+duration](#decreasing-a-target-duration-setting-a-faster-target), picking a duration that is
+higher than the SLO for the service.
+
+We shouldn't set the longest durations on endpoints in the merge
+requests that introduces them, since we don't yet have data to support
+the decision.
+
+## Decreasing a target duration (setting a faster target)
+
+When decreasing the target duration, we need to make sure the endpoint
+still meets SLO for the fleet that handles the request. You can use the
+information in the logs to determine this:
+
+1. Open [this table in
+   Kibana](https://log.gprd.gitlab.net/goto/bbb6465c68eb83642269e64a467df3df)
+
+1. The table loads information for the busiest endpoints by
+   default. You can speed things up by adding a filter for
+   `json.caller_id.keyword` and adding the identifier you're intersted
+   in (for example: `Projects::RawController#show`).
+
+1. Check the [appropriate percentile duration](#request-apdex-slo) for
+   the service the endpoint is handled by. The overall duration should
+   be lower than the target you intend to set.
+
+1. If the overall duration is below the intended targed. Please also
+   check the peaks over time in [this
+   graph](https://log.gprd.gitlab.net/goto/9319c4a402461d204d13f3a4924a89fc)
+   in Kibana. Here, the percentile in question should not peak above
+   the target duration we want to set.
+
+Since decreasing a threshold too much could result in alerts for the
+apdex degradation, please also involve a Scalability team member in
+the merge reqeust.
+
+## How to adjust the target duration
+
+The target duration can be specified similar to how endpoints [get a
+feature category](../feature_categorization/index.md).
+
+For endpoints that don't have a specific target, the default of 1s
+(medium) will be used.
+
+The following configurations are available:
+
+| Name       | Duration in seconds | Notes                                         |
+|------------|---------------------|-----------------------------------------------|
+| :very_fast | 0.25s               |                                               |
+| :fast      | 0.5s                |                                               |
+| :medium    | 1s                  | This is the default when nothing is specified |
+| :slow      | 5s                  |                                               |
+
+### Rails controller
+
+A duration can be specified for all actions in a controller like this:
+
+```ruby
+class Boards::ListsController < ApplicationController
+  target_duration :fast
+end
+```
+
+To specify the duration also for certain actions in a controller, they
+can be specified like this:
+
+```ruby
+class Boards::ListsController < ApplicationController
+  target_duration :fast, [:index, :show]
+end
+```
+
+### Grape endpoints
+
+To specify the duration for an entire API class, this can be done as
+follows:
+
+```ruby
+module API
+  class Issues < ::API::Base
+    target_duration :slow
+  end
+end
+```
+
+To specify the duration also for certain actions in a API class, they
+can be specified like this:
+
+```ruby
+module API
+  class Issues < ::API::Base
+      target_duration :fast, [
+        '/groups/:id/issues',
+        '/groups/:id/issues_statistics'
+      ]
+  end
+end
+```
+
+Or, we can specify a custom duration per endpoint:
+
+```ruby
+get 'client/features', target_duration: :fast do
+  # endpoint logic
+end
+```
--- a/doc/development/index.md
+++ b/doc/development/index.md
@@ -334,6 +334,7 @@ See [database guidelines](database/index.md).
 - [Features inside `.gitlab/`](features_inside_dot_gitlab.md)
 - [Dashboards for stage groups](stage_group_dashboards.md)
 - [Preventing transient bugs](transient/prevention-patterns.md)
+- [GitLab Application SLIs](application_slis/index.md)

 ## Other GitLab Development Kit (GDK) guides


--- a/doc/development/stage_group_dashboards.md
+++ b/doc/development/stage_group_dashboards.md
 ---
-stage: Enablement
-group: Infrastructure
+stage: Platforms
+group: Scalability
 info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#assignments
 ---

@@ -58,6 +58,12 @@ component can have 2 indicators:
   [Web](https://gitlab.com/gitlab-com/runbooks/-/blob/f22f40b2c2eab37d85e23ccac45e658b2c914445/metrics-catalog/services/web.jsonnet#L154)
   services, that threshold is **5 seconds**.

+   We're working on making this target configurable per endpoint in [this
+   project](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/525). Learn
+   how to [customize the request
+   apdex](application_slis/rails_request_apdex.md), this new apdex
+   measurement is not yet part of the error budget.
+
   For Sidekiq job execution, the threshold depends on the [job
   urgency](sidekiq_style_guide.md#job-urgency). It is
   [currently](https://gitlab.com/gitlab-com/runbooks/-/blob/f22f40b2c2eab37d85e23ccac45e658b2c914445/metrics-catalog/services/lib/sidekiq-helpers.libsonnet#L25-38)