Commit a9413b64 authored by Marcia Ramos's avatar Marcia Ramos

Merge branch 'bvl-update-request-apdex-sli-docs' into 'master'

Update application SLI documentation

See merge request gitlab-org/gitlab!73994
parents deb20688 4e759c20
...@@ -110,21 +110,79 @@ will also be incremented: ...@@ -110,21 +110,79 @@ will also be incremented:
gitlab_sli:received_email:success_total{ feature_category='service_desk', email_type='service_desk' } gitlab_sli:received_email:success_total{ feature_category='service_desk', email_type='service_desk' }
``` ```
So far, only tracking `apdex` using a success rate is supported. If you
need to track errors this way, please upvote
[this issue](https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/1395)
and leave a comment so we can prioritize this.
## Using the SLI in service monitoring and alerts ## Using the SLI in service monitoring and alerts
When the application is emitting metrics for the new SLI, those need When the application is emitting metrics for a new SLI, they need
to be consumed in the service catalog to result in alerts, and be to be consumed from the [metrics catalog](https://gitlab.com/gitlab-com/runbooks/-/tree/master/metrics-catalog)
included in the error budget for stage groups and GitLab.com's overall to result in alerts, and included in the error budget for stage
availability. groups and GitLab.com's overall availability.
Start by adding the new SLI to the
[Application-SLI library](https://gitlab.com/gitlab-com/runbooks/-/blob/d109886dfd5170793eeb8de3d69aafd4a9da78f6/metrics-catalog/gitlab-slis/library.libsonnet#L4).
After that, add the following information:
- `name`: the name of the SLI as defined in code. For example
`received_email`.
- `significantLabels`: an array of Prometheus labels that belong to the
metrics. For example: `["email_type"]`. If the significant labels
for the SLI include `feature_category`, the metrics will also
feed into the
[error budgets for stage groups](../stage_group_dashboards.md#error-budget).
- `featureCategory`: if the SLI applies to a single feature category,
you can specify it statically through this field to feed the SLI
into the error budgets for stage groups.
- `description`: a Markdown string explaining the SLI. It will
be shown on dashboards and alerts.
- `kind`: the kind of indicator. Only `sliDefinition.apdexKind` is supported at the moment.
Reach out in
[this issue](https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/1395)
if you want to implement an SLI for success or error rates.
When done, run `make generate` to generate recording rules for
the new SLI. This command creates recordings for all services
emitting these metrics aggregated over `significantLabels`.
Open up a merge request with these changes and request review from a Scalability
team member.
When these changes are merged, and the aggregations in
[Thanos](https://thanos.gitlab.net) recorded, query Thanos to see
the success ratio of the new aggregated metrics. For example:
```prometheus
sum by (environment, stage, type)(gitlab_sli_aggregation:rails_request_apdex:apdex:success:rate_1h)
/
sum by (environment, stage, type)(gitlab_sli_aggregation:rails_request_apdex:apdex:weight:rate_1h)
```
This shows the success ratio, which can guide you to set an
appropriate SLO when adding this SLI to a service.
Then, add the SLI to the appropriate service
catalog file. For example, the [`web` service](https://gitlab.com/gitlab-com/runbooks/-/blob/2b7be37a006c236bd684a4e6a1fbf4c66158292a/metrics-catalog/services/web.jsonnet#L198):
```jsonnet
rails_requests:
sliLibrary.get('rails_request_apdex')
.generateServiceLevelIndicator({ job: 'gitlab-rails' })
```
To pass extra selectors and override properties of the SLI, see the
[service monitoring documentation](https://gitlab.com/gitlab-com/runbooks/blob/master/metrics-catalog/README.md).
This is currently being worked on in [this SLIs with statically defined feature categories can already receive
project](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/573). As alerts about the SLI in specified Slack channels. For more information, read the
part of [this [alert routing documentation](https://gitlab.com/gitlab-com/runbooks/-/blob/master/docs/uncategorized/alert-routing.md).
issue](https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/1307) In [this project](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/614)
we will update the documentation. we are extending this so alerts for SLIs with a `feature_category`
label in the souce metrics can also be routed.
For any question, please don't hesitate to createan issue in [the For any question, please don't hesitate to create an issue in
Scalability issue [the Scalability issue tracker](https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues)
tracker](https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues)
or come find us in or come find us in
[#g_scalability](https://gitlab.slack.com/archives/CMMF8TKR9) on Slack. [#g_scalability](https://gitlab.slack.com/archives/CMMF8TKR9) on Slack.
...@@ -9,11 +9,8 @@ info: To determine the technical writer assigned to the Stage/Group associated w ...@@ -9,11 +9,8 @@ info: To determine the technical writer assigned to the Stage/Group associated w
> [Introduced](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/525) in GitLab 14.4 > [Introduced](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/525) in GitLab 14.4
NOTE: NOTE:
This SLI is not yet used in This SLI is used for service monitoring. But not for [error budgets for stage groups](../stage_group_dashboards.md#error-budget)
[error budgets for stage groups](../stage_group_dashboards.md#error-budget) by default. You can [opt in](#error-budget-attribution-and-ownership).
or service monitoring. To learn more about this work, read about how we are
[incorporating custom SLIs](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/573)
into error budgets and the service catalog.
The request Apdex SLI (Service Level Indicator) is [an SLI defined in the application](index.md). The request Apdex SLI (Service Level Indicator) is [an SLI defined in the application](index.md).
It measures the duration of successful requests as an indicator for It measures the duration of successful requests as an indicator for
...@@ -223,8 +220,18 @@ end ...@@ -223,8 +220,18 @@ end
### Error budget attribution and ownership ### Error budget attribution and ownership
This SLI is used for service level monitoring. It feeds into the This SLI is used for service level monitoring. It feeds into the
[error budget for stage groups](../stage_group_dashboards.md#error-budget) when [error budget for stage
opting in. For more information, read the epic for groups](../stage_group_dashboards.md#error-budget). For this
particular SLI, we have opted everyone out by default to give time to
set the correct urgencies on endpoints before it affects a group's
error budget.
To include this SLI in the error budget, remove the `rails_requests`
from the `ignored_components` array in the entry for your group. Read
more about what is configurable in the
[runbooks documentation](https://gitlab.com/gitlab-com/runbooks/-/tree/master/services#teamsyml).
For more information, read the epic for
[defining custom SLIs and incorporating them into error budgets](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/525)). [defining custom SLIs and incorporating them into error budgets](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/525)).
The endpoints for the SLI feed into a group's error budget based on the The endpoints for the SLI feed into a group's error budget based on the
[feature category declared on it](../feature_categorization/index.md). [feature category declared on it](../feature_categorization/index.md).
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment