Commit 4e759c20 authored by Bob Van Landuyt's avatar Bob Van Landuyt Committed by Marcia Ramos

Update application SLI documentation

parent deb20688
......@@ -110,21 +110,79 @@ will also be incremented:
gitlab_sli:received_email:success_total{ feature_category='service_desk', email_type='service_desk' }
```
So far, only tracking `apdex` using a success rate is supported. If you
need to track errors this way, please upvote
[this issue](https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/1395)
and leave a comment so we can prioritize this.
## Using the SLI in service monitoring and alerts
When the application is emitting metrics for the new SLI, those need
to be consumed in the service catalog to result in alerts, and be
included in the error budget for stage groups and GitLab.com's overall
availability.
When the application is emitting metrics for a new SLI, they need
to be consumed from the [metrics catalog](https://gitlab.com/gitlab-com/runbooks/-/tree/master/metrics-catalog)
to result in alerts, and included in the error budget for stage
groups and GitLab.com's overall availability.
Start by adding the new SLI to the
[Application-SLI library](https://gitlab.com/gitlab-com/runbooks/-/blob/d109886dfd5170793eeb8de3d69aafd4a9da78f6/metrics-catalog/gitlab-slis/library.libsonnet#L4).
After that, add the following information:
- `name`: the name of the SLI as defined in code. For example
`received_email`.
- `significantLabels`: an array of Prometheus labels that belong to the
metrics. For example: `["email_type"]`. If the significant labels
for the SLI include `feature_category`, the metrics will also
feed into the
[error budgets for stage groups](../stage_group_dashboards.md#error-budget).
- `featureCategory`: if the SLI applies to a single feature category,
you can specify it statically through this field to feed the SLI
into the error budgets for stage groups.
- `description`: a Markdown string explaining the SLI. It will
be shown on dashboards and alerts.
- `kind`: the kind of indicator. Only `sliDefinition.apdexKind` is supported at the moment.
Reach out in
[this issue](https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/1395)
if you want to implement an SLI for success or error rates.
When done, run `make generate` to generate recording rules for
the new SLI. This command creates recordings for all services
emitting these metrics aggregated over `significantLabels`.
Open up a merge request with these changes and request review from a Scalability
team member.
When these changes are merged, and the aggregations in
[Thanos](https://thanos.gitlab.net) recorded, query Thanos to see
the success ratio of the new aggregated metrics. For example:
```prometheus
sum by (environment, stage, type)(gitlab_sli_aggregation:rails_request_apdex:apdex:success:rate_1h)
/
sum by (environment, stage, type)(gitlab_sli_aggregation:rails_request_apdex:apdex:weight:rate_1h)
```
This shows the success ratio, which can guide you to set an
appropriate SLO when adding this SLI to a service.
Then, add the SLI to the appropriate service
catalog file. For example, the [`web` service](https://gitlab.com/gitlab-com/runbooks/-/blob/2b7be37a006c236bd684a4e6a1fbf4c66158292a/metrics-catalog/services/web.jsonnet#L198):
```jsonnet
rails_requests:
sliLibrary.get('rails_request_apdex')
.generateServiceLevelIndicator({ job: 'gitlab-rails' })
```
To pass extra selectors and override properties of the SLI, see the
[service monitoring documentation](https://gitlab.com/gitlab-com/runbooks/blob/master/metrics-catalog/README.md).
This is currently being worked on in [this
project](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/573). As
part of [this
issue](https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/1307)
we will update the documentation.
SLIs with statically defined feature categories can already receive
alerts about the SLI in specified Slack channels. For more information, read the
[alert routing documentation](https://gitlab.com/gitlab-com/runbooks/-/blob/master/docs/uncategorized/alert-routing.md).
In [this project](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/614)
we are extending this so alerts for SLIs with a `feature_category`
label in the souce metrics can also be routed.
For any question, please don't hesitate to createan issue in [the
Scalability issue
tracker](https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues)
For any question, please don't hesitate to create an issue in
[the Scalability issue tracker](https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues)
or come find us in
[#g_scalability](https://gitlab.slack.com/archives/CMMF8TKR9) on Slack.
......@@ -9,11 +9,8 @@ info: To determine the technical writer assigned to the Stage/Group associated w
> [Introduced](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/525) in GitLab 14.4
NOTE:
This SLI is not yet used in
[error budgets for stage groups](../stage_group_dashboards.md#error-budget)
or service monitoring. To learn more about this work, read about how we are
[incorporating custom SLIs](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/573)
into error budgets and the service catalog.
This SLI is used for service monitoring. But not for [error budgets for stage groups](../stage_group_dashboards.md#error-budget)
by default. You can [opt in](#error-budget-attribution-and-ownership).
The request Apdex SLI (Service Level Indicator) is [an SLI defined in the application](index.md).
It measures the duration of successful requests as an indicator for
......@@ -223,8 +220,18 @@ end
### Error budget attribution and ownership
This SLI is used for service level monitoring. It feeds into the
[error budget for stage groups](../stage_group_dashboards.md#error-budget) when
opting in. For more information, read the epic for
[error budget for stage
groups](../stage_group_dashboards.md#error-budget). For this
particular SLI, we have opted everyone out by default to give time to
set the correct urgencies on endpoints before it affects a group's
error budget.
To include this SLI in the error budget, remove the `rails_requests`
from the `ignored_components` array in the entry for your group. Read
more about what is configurable in the
[runbooks documentation](https://gitlab.com/gitlab-com/runbooks/-/tree/master/services#teamsyml).
For more information, read the epic for
[defining custom SLIs and incorporating them into error budgets](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/525)).
The endpoints for the SLI feed into a group's error budget based on the
[feature category declared on it](../feature_categorization/index.md).
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment