Merge branch 'bvl-update-request-apdex-sli-docs' into 'master'

Update application SLI documentation See merge request gitlab-org/gitlab!73994

Merge branch 'bvl-update-request-apdex-sli-docs' into 'master'
Update application SLI documentation See merge request gitlab-org/gitlab!73994
a9413b64 · Marcia Ramos · deb20688 · 4e759c20 · a9413b64 · a9413b64
Commit a9413b64 authored Nov 17, 2021 by Marcia Ramos
Showing with 84 additions and 19 deletions

doc/development/application_slis/index.md doc/development/application_slis/index.md +70 -12

doc/development/application_slis/rails_request_apdex.md doc/development/application_slis/rails_request_apdex.md +14 -7

No files found.
--- a/doc/development/application_slis/index.md
+++ b/doc/development/application_slis/index.md
@@ -110,21 +110,79 @@ will also be incremented:
 gitlab_sli:received_email:success_total{ feature_category='service_desk', email_type='service_desk' }
 ```
+So far, only tracking `apdex` using a success rate is supported. If you
+need to track errors this way, please upvote
+[this issue](https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/1395)
+and leave a comment so we can prioritize this.
 ## Using the SLI in service monitoring and alerts
-When the application is emitting metrics for the new SLI, those need
+When the application is emitting metrics for a new SLI, they need
-to be consumed in the service catalog to result in alerts, and be
+to be consumed from the [metrics catalog](https://gitlab.com/gitlab-com/runbooks/-/tree/master/metrics-catalog)
-included in the error budget for stage groups and GitLab.com's overall
+to result in alerts, and included in the error budget for stage
-availability.
+groups and GitLab.com's overall availability.
+Start by adding the new SLI to the
+[Application-SLI library](https://gitlab.com/gitlab-com/runbooks/-/blob/d109886dfd5170793eeb8de3d69aafd4a9da78f6/metrics-catalog/gitlab-slis/library.libsonnet#L4).
+After that, add the following information:
+- `name`: the name of the SLI as defined in code. For example
+  `received_email`.
+- `significantLabels`: an array of Prometheus labels that belong to the
+  metrics. For example: `["email_type"]`. If the significant labels
+  for the SLI include `feature_category`, the metrics will also
+  feed into the
+  [error budgets for stage groups](../stage_group_dashboards.md#error-budget).
+- `featureCategory`: if the SLI applies to a single feature category,
+  you can specify it statically through this field to feed the SLI
+  into the error budgets for stage groups.
+- `description`: a Markdown string explaining the SLI. It will
+  be shown on dashboards and alerts.
+- `kind`: the kind of indicator. Only `sliDefinition.apdexKind` is supported at the moment.
+  Reach out in
+  [this issue](https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/1395)
+  if you want to implement an SLI for success or error rates.
+When done, run `make generate` to generate recording rules for
+the new SLI. This command creates recordings for all services
+emitting these metrics aggregated over `significantLabels`.
+Open up a merge request with these changes and request review from a Scalability
+team member.
+When these changes are merged, and the aggregations in
+[Thanos](https://thanos.gitlab.net) recorded, query Thanos to see
+the success ratio of the new aggregated metrics. For example:
+```prometheus
+sum by (environment, stage, type)(gitlab_sli_aggregation:rails_request_apdex:apdex:success:rate_1h)
+/
+sum by (environment, stage, type)(gitlab_sli_aggregation:rails_request_apdex:apdex:weight:rate_1h)
+```
+This shows the success ratio, which can guide you to set an
+appropriate SLO when adding this SLI to a service.
+Then, add the SLI to the appropriate service
+catalog file. For example, the [`web` service](https://gitlab.com/gitlab-com/runbooks/-/blob/2b7be37a006c236bd684a4e6a1fbf4c66158292a/metrics-catalog/services/web.jsonnet#L198):
+```jsonnet
+rails_requests:
+  sliLibrary.get('rails_request_apdex')
+    .generateServiceLevelIndicator({ job: 'gitlab-rails' })
+```
+To pass extra selectors and override properties of the SLI, see the
+[service monitoring documentation](https://gitlab.com/gitlab-com/runbooks/blob/master/metrics-catalog/README.md).
-This is currently being worked on in [this
+SLIs with statically defined feature categories can already receive
-project](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/573). As
+alerts about the SLI in specified Slack channels. For more information, read the
-part of [this
+[alert routing documentation](https://gitlab.com/gitlab-com/runbooks/-/blob/master/docs/uncategorized/alert-routing.md).
-issue](https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/1307)
+In [this project](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/614)
-we will update the documentation.
+we are extending this so alerts for SLIs with a `feature_category`
+label in the souce metrics can also be routed.
-For any question, please don't hesitate to createan issue in [the
+For any question, please don't hesitate to create an issue in
-Scalability issue
+[the Scalability issue tracker](https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues)
-tracker](https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues)
 or come find us in
 [#g_scalability](https://gitlab.slack.com/archives/CMMF8TKR9) on Slack.
--- a/doc/development/application_slis/rails_request_apdex.md
+++ b/doc/development/application_slis/rails_request_apdex.md
@@ -9,11 +9,8 @@ info: To determine the technical writer assigned to the Stage/Group associated w
 > [Introduced](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/525) in GitLab 14.4
 NOTE:
-This SLI is not yet used in
+This SLI is used for service monitoring. But not for [error budgets for stage groups](../stage_group_dashboards.md#error-budget)
-[error budgets for stage groups](../stage_group_dashboards.md#error-budget)
+by default. You can [opt in](#error-budget-attribution-and-ownership).
-or service monitoring. To learn more about this work, read about how we are
-[incorporating custom SLIs](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/573)
-into error budgets and the service catalog.
 The request Apdex SLI (Service Level Indicator) is [an SLI defined in the application](index.md).
 It measures the duration of successful requests as an indicator for
@@ -223,8 +220,18 @@ end
 ### Error budget attribution and ownership
 This SLI is used for service level monitoring. It feeds into the
-[error budget for stage groups](../stage_group_dashboards.md#error-budget) when
+[error budget for stage
-opting in. For more information, read the epic for
+groups](../stage_group_dashboards.md#error-budget). For this
+particular SLI, we have opted everyone out by default to give time to
+set the correct urgencies on endpoints before it affects a group's
+error budget.
+To include this SLI in the error budget, remove the `rails_requests`
+from the `ignored_components` array in the entry for your group. Read
+more about what is configurable in the
+[runbooks documentation](https://gitlab.com/gitlab-com/runbooks/-/tree/master/services#teamsyml).
+For more information, read the epic for
 [defining custom SLIs and incorporating them into error budgets](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/525)).
 The endpoints for the SLI feed into a group's error budget based on the
 [feature category declared on it](../feature_categorization/index.md).