This adds docs on how to customize apdex target

This adds the documentation on how to define new SLIs and how to customize the rails apdex SLI.

This adds docs on how to customize apdex target
This adds the documentation on how to define new SLIs and how to customize the rails apdex SLI.
5006b546 · Bob Van Landuyt · 5cc0b39e · 5006b546 · 5006b546 · 5006b546
Commit 5006b546 authored Sep 24, 2021 by Bob Van Landuyt
4 changed files
--- a/doc/development/application_slis/index.md
+++ b/doc/development/application_slis/index.md
+---
+stage: Platforms
+group: Scalability
+info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#assignments
+---
+
+# GitLab Application Service Level Indicators (SLIs)
+
+> [Introduced](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/525) in GitLab 14.4
+
+It is possible to define [Service Level Indicators
+(SLIs)](https://en.wikipedia.org/wiki/Service_level_indicator)
+directly in the Ruby codebase. This keeps the definition of operations
+and their success close to the implementation and allows the people
+building features to easily define how these features should be
+monitored.
+
+Defining an SLI causes 2
+[Prometheus
+counters](https://prometheus.io/docs/concepts/metric_types/#counter)
+to be emitted from the rails application:
+
+- `gitlab_sli:<sli name>:total`: incremented for each operation.
+- `gitlab_sli:<sli_name>:success_total`: incremented for successful
+  operations.
+
+## Existing SLIs
+
+1. [`rails_request_apdex`](rails_request_apdex.md)
+
+## Defining a new SLI
+
+An SLI can be defined using the `Gitlab::Metrics::Sli` class.
+
+Before the first scrape, it is important to have [initialized the SLI
+with all possible
+label-combinations](https://prometheus.io/docs/practices/instrumentation/#avoid-missing-metrics). This
+avoid confusing results when using these counters in calculations.
+
+To initialize an SLI, use the `.inilialize_sli` class method, for
+example:
+
+```ruby
+Gitlab::Metrics::Sli.initialize_sli(:received_email, [
+  {
+    feature_category: :issue_tracking,
+    email_type: :create_issue
+  },
+  {
+    feature_category: :service_desk,
+    email_type: :service_desk
+  },
+  {
+    feature_category: :code_review,
+    email_type: :create_merge_request
+  }
+])
+```
+
+Metrics must be initialized before they get
+scraped for the first time. This could be done at the start time of the
+process that will emit them, in which case we need to pay attention
+not to increase application's boot time too much. This is preferable
+if possible.
+
+Alternatively, if initializing would take too long, this can be done
+during the first scrape. We need to make sure we don't do it for every
+scrape. This can be done as follows:
+
+```ruby
+def initialize_request_slis_if_needed!
+  return if Gitlab::Metrics::Sli.initialized?(:rails_request_apdex)
+  Gitlab::Metrics::Sli.initialize_sli(:rails_request_apdex, possible_request_labels)
+end
+```
+
+Also pay attention to do it for the different metrics
+endpoints we have. Currently the
+[`WebExporter`](https://gitlab.com/gitlab-org/gitlab/blob/master/lib/gitlab/metrics/exporter/web_exporter.rb)
+and the
+[`HealthController`](https://gitlab.com/gitlab-org/gitlab/blob/master/app/controllers/health_controller.rb)
+for Rails and
+[`SidekiqExporter`](https://gitlab.com/gitlab-org/gitlab/blob/master/lib/gitlab/metrics/exporter/sidekiq_exporter.rb)
+for Sidekiq.
+
+## Tracking operations for an SLI
+
+Tracking an operation in the newly defined SLI can be done like this:
+
+```ruby
+Gitlab::Metrics::Sli[:received_email].increment(
+  labels: {
+    feature_category: :service_desk,
+    email_type: :service_desk
+  },
+  success: issue_created?
+)
+```
+
+Calling `#increment` on this SLI will increment the total Prometheus counter
+
+```prometheus
+gitlab_sli:received_email:total{ feature_category='service_desk', email_type='service_desk' }
+```
+
+If the `success:` argument passed is truthy, then the success counter
+will also be incremented:
+
+```prometheus
+gitlab_sli:received_email:success_total{ feature_category='service_desk', email_type='service_desk' }
+```
+
+## Using the SLI in service monitoring and alerts
+
+When the application is emitting metrics for the new SLI, those need
+to be consumed in the service catalog to result in alerts, and be
+included in the error budget for stage groups and GitLab.com's overall
+availability.
+
+This is currently being worked on in [this
+project](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/573). As
+part of [this
+issue](https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/1307)
+we will update the documentation.
+
+For any question, please don't hesitate to createan issue in [the
+Scalability issue
+tracker](https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues)
+or come find us in
+[#g_scalability](https://gitlab.slack.com/archives/CMMF8TKR9) on Slack.
--- a/doc/development/application_slis/rails_request_apdex.md
+++ b/doc/development/application_slis/rails_request_apdex.md
+---
+stage: Platforms
+group: Scalability
+info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#assignments
+---
+
+# Rails request apdex SLI
+
+> [Introduced](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/525) in GitLab 14.4
+
+NOTE:
+This SLI is not yet used in [error budgets for stage
+groups](../stage_group_dashboards.md#error-budget) or service
+monitoring. This is being worked on in [this
+project](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/573).
+
+The request apdex SLI is [an SLI defined in the application](index.md)
+that measures the duration of successful requests as an indicator for
+application performance. This includes the REST and GraphQL API, and the
+regular controller endpoints. It consists of these counters:
+
+1. `gitlab_sli:rails_request_apdex:total`: This counter gets
+   incremented for every request that did not result in a response
+   with a 5xx status code. This means that slow failures don't get
+   counted twice: The request is already counted in the error-SLI.
+
+1. `gitlab_sli:rails_request_apdex:success_total`: This counter gets
+   incremented for every successful request that performed faster than
+   the [defined target duration](#adjusting-request-target-duration).
+
+Both these counters are labeled with:
+
+1. `endpoint_id`: The identification of the Rails Controller or the
+   Grape-API endpoint
+
+1. `feature_category`: The feature category specified for that
+   controller or API endpoint.
+
+## Request Apdex SLO
+
+These counters can be combined into a success ratio, the objective for
+this ratio is defined in the service catalog per service:
+
+1. [Web: 0.998](https://gitlab.com/gitlab-com/runbooks/blob/master/metrics-catalog/services/web.jsonnet#L19)
+1. [API: 0.995](https://gitlab.com/gitlab-com/runbooks/blob/master/metrics-catalog/services/api.jsonnet#L19)
+1. [Git: 0.998](https://gitlab.com/gitlab-com/runbooks/blob/master/metrics-catalog/services/git.jsonnet#L22)
+
+This means that for this SLI to meet SLO, the ratio recorded needs to
+be higher than those defined above.
+
+For example: for the web-service, we want at least 99.8% of requests
+to be faster than their target duration.
+
+These are the targets we use for alerting and service montoring. So
+durations should be set keeping those into account.
+
+Both successful measurements and unsuccessful ones have an impact on the
+error budget for stage groups.
+
+## Adjusting request target duration
+
+Not all endpoints perform the same type of work, so it is possible to
+define different durations for different endpoints.
+
+Long-running requests are more expensive for our
+infrastructure: while one request is being served, the thread remains
+occupied for the duration of that request. So nothing else can be handled by that
+thread. Because of Ruby's Global VM Lock, the thread might keep the
+lock and stall other requests handled by the same Puma worker
+process. The request is in fact a noisy neighbor for other requests
+handled by the worker. This is why the upper bound for a target
+duration is capped at 5 seconds.
+
+## Increasing the target duration (setting a slower target)
+
+Increasing the target duration on an existing endpoint can be done on
+a case-by-case basis. Please take the following into account:
+
+1. Apdex is about perceived performance, if a user is actively waiting
+   for the result of a request, waiting 5 seconds might not be
+   acceptable. While if the endpoint is used by an automation
+   requiring a lot of data, 5 seconds could be okay.
+
+   A product manager can help to identify how an endpoint is used.
+
+1. The workload for some endpoints can sometimes differ greatly
+   depending on the parameters specified by the caller. The target
+   duration needs to accomodate that. In some cases, it might be
+   interesting to define a separate [application
+   SLI](index.md#defining-a-new-sli) for what the endpoint is doing.
+
+   When the endpoints in certain cases turn into no-ops, making them
+   very fast, we should ignore these fast requests when setting the
+   target. For example, if the `MergeRequests::DraftsController` is
+   hit for every merge request being viewed, but doesn't need to
+   render anything in most cases, then we should pick the target that
+   would still accomodate the endpoint performing work.
+
+1. Consider the dependent resources consumed by the endpoint. If the endpoint
+   loads a lot of data from Gitaly or the database and this is causing
+   it to not perform satisfactory. It could be better to optimize the
+   way the data is loaded rather than increasing the target duration.
+
+   In cases like this, it might be appropriate to temporarily increase
+   the duration to make the endpoint meet SLO, if this is bearable for
+   the infrastructure. In such cases, please link an issue from a code
+   comment.
+
+   If the endpoint consumes a lot of CPU time, we should also consider
+   this: these kinds of requests are the kind of noisy neighbors we
+   should try to keep as short as possible.
+
+1. Traffic characteristics should also be taken into account: if the
+   trafic to the endpoint is bursty, like CI traffic spinning up a
+   big batch of jobs hitting the same endpoint, then having these
+   endpoints take 5s is not acceptable from an infrastructure point of
+   view. We cannot scale up the fleet fast enough to accomodate for
+   the incoming slow requests alongside the regular traffic.
+
+When increasing the target duration for an existing endpoint, please
+involve a [Scalability team
+member](https://about.gitlab.com/handbook/engineering/infrastructure/team/scalability/#team-members)
+in the review. We can use request rates and durations available in the
+logs to come up with a recommendation. Picking a threshold can be done
+using the same process as for [decreasing a target
+duration](#decreasing-a-target-duration-setting-a-faster-target), picking a duration that is
+higher than the SLO for the service.
+
+We shouldn't set the longest durations on endpoints in the merge
+requests that introduces them, since we don't yet have data to support
+the decision.
+
+## Decreasing a target duration (setting a faster target)
+
+When decreasing the target duration, we need to make sure the endpoint
+still meets SLO for the fleet that handles the request. You can use the
+information in the logs to determine this:
+
+1. Open [this table in
+   Kibana](https://log.gprd.gitlab.net/goto/bbb6465c68eb83642269e64a467df3df)
+
+1. The table loads information for the busiest endpoints by
+   default. You can speed things up by adding a filter for
+   `json.caller_id.keyword` and adding the identifier you're intersted
+   in (for example: `Projects::RawController#show`).
+
+1. Check the [appropriate percentile duration](#request-apdex-slo) for
+   the service the endpoint is handled by. The overall duration should
+   be lower than the target you intend to set.
+
+1. If the overall duration is below the intended targed. Please also
+   check the peaks over time in [this
+   graph](https://log.gprd.gitlab.net/goto/9319c4a402461d204d13f3a4924a89fc)
+   in Kibana. Here, the percentile in question should not peak above
+   the target duration we want to set.
+
+Since decreasing a threshold too much could result in alerts for the
+apdex degradation, please also involve a Scalability team member in
+the merge reqeust.
+
+## How to adjust the target duration
+
+The target duration can be specified similar to how endpoints [get a
+feature category](../feature_categorization/index.md).
+
+For endpoints that don't have a specific target, the default of 1s
+(medium) will be used.
+
+The following configurations are available:
+
+| Name       | Duration in seconds | Notes                                         |
+|------------|---------------------|-----------------------------------------------|
+| :very_fast | 0.25s               |                                               |
+| :fast      | 0.5s                |                                               |
+| :medium    | 1s                  | This is the default when nothing is specified |
+| :slow      | 5s                  |                                               |
+
+### Rails controller
+
+A duration can be specified for all actions in a controller like this:
+
+```ruby
+class Boards::ListsController < ApplicationController
+  target_duration :fast
+end
+```
+
+To specify the duration also for certain actions in a controller, they
+can be specified like this:
+
+```ruby
+class Boards::ListsController < ApplicationController
+  target_duration :fast, [:index, :show]
+end
+```
+
+### Grape endpoints
+
+To specify the duration for an entire API class, this can be done as
+follows:
+
+```ruby
+module API
+  class Issues < ::API::Base
+    target_duration :slow
+  end
+end
+```
+
+To specify the duration also for certain actions in a API class, they
+can be specified like this:
+
+```ruby
+module API
+  class Issues < ::API::Base
+      target_duration :fast, [
+        '/groups/:id/issues',
+        '/groups/:id/issues_statistics'
+      ]
+  end
+end
+```
+
+Or, we can specify a custom duration per endpoint:
+
+```ruby
+get 'client/features', target_duration: :fast do
+  # endpoint logic
+end
+```
--- a/doc/development/index.md
+++ b/doc/development/index.md
@@ -334,6 +334,7 @@ See [database guidelines](database/index.md).
 - [Features inside `.gitlab/`](features_inside_dot_gitlab.md)
 - [Dashboards for stage groups](stage_group_dashboards.md)
 - [Preventing transient bugs](transient/prevention-patterns.md)
+- [GitLab Application SLIs](application_slis/index.md)

 ## Other GitLab Development Kit (GDK) guides


--- a/doc/development/stage_group_dashboards.md
+++ b/doc/development/stage_group_dashboards.md
 ---
-stage: Enablement
-group: Infrastructure
+stage: Platforms
+group: Scalability
 info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#assignments
 ---

@@ -58,6 +58,12 @@ component can have 2 indicators:
   [Web](https://gitlab.com/gitlab-com/runbooks/-/blob/f22f40b2c2eab37d85e23ccac45e658b2c914445/metrics-catalog/services/web.jsonnet#L154)
   services, that threshold is **5 seconds**.

+   We're working on making this target configurable per endpoint in [this
+   project](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/525). Learn
+   how to [customize the request
+   apdex](application_slis/rails_request_apdex.md), this new apdex
+   measurement is not yet part of the error budget.
+
   For Sidekiq job execution, the threshold depends on the [job
   urgency](sidekiq_style_guide.md#job-urgency). It is
   [currently](https://gitlab.com/gitlab-com/runbooks/-/blob/f22f40b2c2eab37d85e23ccac45e658b2c914445/metrics-catalog/services/lib/sidekiq-helpers.libsonnet#L25-38)