Merge branch '342048-aqualls-tidy-page' into 'master'

Line edits: rails_request_apdex page See merge request gitlab-org/gitlab!73369

Merge branch '342048-aqualls-tidy-page' into 'master'
Line edits: rails_request_apdex page See merge request gitlab-org/gitlab!73369
72e3ebbd · Bob Van Landuyt · bcbea92f · ca5bef86 · 72e3ebbd
Commit 72e3ebbd authored Oct 29, 2021 by Bob Van Landuyt
Hide whitespace changes
Inline Side-by-side

Showing with 76 additions and 89 deletions

doc/development/application_slis/rails_request_apdex.md doc/development/application_slis/rails_request_apdex.md +76 -89

No files found.
--- a/doc/development/application_slis/rails_request_apdex.md
+++ b/doc/development/application_slis/rails_request_apdex.md
@@ -4,184 +4,174 @@ group: Scalability
 info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#assignments
 ---

-# Rails request apdex SLI
+# Rails request Apdex SLI

 > [Introduced](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/525) in GitLab 14.4

 NOTE:
-This SLI is not yet used in [error budgets for stage
-groups](../stage_group_dashboards.md#error-budget) or service
-monitoring. This is being worked on in [this
-project](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/573).
-
-The request apdex SLI (Service Level Indicator) is [an SLI defined in the application](index.md)
-that measures the duration of successful requests as an indicator for
+This SLI is not yet used in
+[error budgets for stage groups](../stage_group_dashboards.md#error-budget)
+or service monitoring. To learn more about this work, read about how we are
+[incorporating custom SLIs](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/573)
+into error budgets and the service catalog.
+
+The request Apdex SLI (Service Level Indicator) is [an SLI defined in the application](index.md).
+It measures the duration of successful requests as an indicator for
 application performance. This includes the REST and GraphQL API, and the
 regular controller endpoints. It consists of these counters:

 1. `gitlab_sli:rails_request_apdex:total`: This counter gets
   incremented for every request that did not result in a response
-   with a 5xx status code. This means that slow failures don't get
-   counted twice: The request is already counted in the error-SLI.
+   with a `5xx` status code. It ensures slow failures are not
+   counted twice, because the request is already counted in the error SLI.

 1. `gitlab_sli:rails_request_apdex:success_total`: This counter gets
   incremented for every successful request that performed faster than
-   the [defined target duration depending on the endpoint's
-   urgency](#adjusting-request-urgency).
+   the [defined target duration depending on the endpoint's urgency](#adjusting-request-urgency).

 Both these counters are labeled with:

 1. `endpoint_id`: The identification of the Rails Controller or the
-   Grape-API endpoint
+   Grape-API endpoint.

 1. `feature_category`: The feature category specified for that
   controller or API endpoint.

 ## Request Apdex SLO

-These counters can be combined into a success ratio, the objective for
-this ratio is defined in the service catalog per service:
-
-1. [Web: 0.998](https://gitlab.com/gitlab-com/runbooks/blob/master/metrics-catalog/services/web.jsonnet#L19)
-1. [API: 0.995](https://gitlab.com/gitlab-com/runbooks/blob/master/metrics-catalog/services/api.jsonnet#L19)
-1. [Git: 0.998](https://gitlab.com/gitlab-com/runbooks/blob/master/metrics-catalog/services/git.jsonnet#L22)
+These counters can be combined into a success ratio. The objective for
+this ratio is defined in the service catalog per service. For this SLI to meet SLO,
+the ratio recorded must be higher than:

-This means that for this SLI to meet SLO, the ratio recorded needs to
-be higher than those defined above.
+- [Web: 0.998](https://gitlab.com/gitlab-com/runbooks/blob/master/metrics-catalog/services/web.jsonnet#L19)
+- [API: 0.995](https://gitlab.com/gitlab-com/runbooks/blob/master/metrics-catalog/services/api.jsonnet#L19)
+- [Git: 0.998](https://gitlab.com/gitlab-com/runbooks/blob/master/metrics-catalog/services/git.jsonnet#L22)

 For example: for the web-service, we want at least 99.8% of requests
 to be faster than their target duration.

-These are the targets we use for alerting and service monitoring. So
-durations should be set keeping those into account. So we would not
-cause alerts. But the goal would be to set the urgency to a target
-that users would be satisfied with.
+We use these targets for alerting and service monitoring. Set durations taking
+these targets into account, so we don't cause alerts. The goal, however, is to
+set the urgency to a target that satisfies our users.

-Both successful measurements and unsuccessful ones have an impact on the
+Both successful measurements and unsuccessful ones affect the
 error budget for stage groups.

 ## Adjusting request urgency

 Not all endpoints perform the same type of work, so it is possible to
 define different urgency levels for different endpoints. An endpoint with a
-lower urgency can have a longer request duration than endpoints that
-are high urgency.
+lower urgency can have a longer request duration than endpoints with high urgency.

-Long-running requests are more expensive for our
-infrastructure: while one request is being served, the thread remains
-occupied for the duration of that request. So nothing else can be handled by that
-thread. Because of Ruby's Global VM Lock, the thread might keep the
+Long-running requests are more expensive for our infrastructure. While serving
+one request, the thread remains occupied for the duration of that request. The thread
+can handle nothing else. Due to Ruby's Global VM Lock, the thread might keep the
 lock and stall other requests handled by the same Puma worker
-process. The request is in fact a noisy neighbor for other requests
-handled by the worker. This is why the upper bound for a target
-duration is capped at 5 seconds.
+process. The request is, in fact, a noisy neighbor for other requests
+handled by the worker. We cap the upper bound for a target duration at 5 seconds
+for this reason.

 ## Decreasing the urgency (setting a higher target duration)

-Decreasing the urgency on an existing endpoint can be done on
-a case-by-case basis. Please take the following into account:
+You can decrease the urgency on an existing endpoint on
+a case-by-case basis. Take the following into account:

-1. Apdex is about perceived performance, if a user is actively waiting
+1. Apdex is about perceived performance. If a user is actively waiting
   for the result of a request, waiting 5 seconds might not be
-   acceptable. While if the endpoint is used by an automation
-   requiring a lot of data, 5 seconds could be okay.
+   acceptable. However, if the endpoint is used by an automation
+   requiring a lot of data, 5 seconds could be acceptable.

   A product manager can help to identify how an endpoint is used.

 1. The workload for some endpoints can sometimes differ greatly
   depending on the parameters specified by the caller. The urgency
-   needs to accommodate that. In some cases, it might be interesting to
+   needs to accommodate those differences. In some cases, you could
   define a separate [application SLI](index.md#defining-a-new-sli)
   for what the endpoint is doing.

   When the endpoints in certain cases turn into no-ops, making them
   very fast, we should ignore these fast requests when setting the
   target. For example, if the `MergeRequests::DraftsController` is
-   hit for every merge request being viewed, but doesn't need to
-   render anything in most cases, then we should pick the target that
+   hit for every merge request being viewed, but rarely renders
+   anything, then we should pick the target that
   would still accommodate the endpoint performing work.

 1. Consider the dependent resources consumed by the endpoint. If the endpoint
-   loads a lot of data from Gitaly or the database and this is causing
-   it to not perform satisfactory. It could be better to optimize the
+   loads a lot of data from Gitaly or the database, and this causes
+   unsatisfactory performance, consider optimizing the
   way the data is loaded rather than increasing the target duration
   by lowering the urgency.

-   In cases like this, it might be appropriate to temporarily decrease
+   In these cases, it might be appropriate to temporarily decrease
   urgency to make the endpoint meet SLO, if this is bearable for the
-   infrastructure. In such cases, please link an issue from a code
-   comment.
+   infrastructure. In such cases, create a code comment linking to an issue.

   If the endpoint consumes a lot of CPU time, we should also consider
   this: these kinds of requests are the kind of noisy neighbors we
   should try to keep as short as possible.

-1. Traffic characteristics should also be taken into account: if the
+1. Traffic characteristics should also be taken into account. If the
   traffic to the endpoint is bursty, like CI traffic spinning up a
   big batch of jobs hitting the same endpoint, then having these
-   endpoints take 5s is not acceptable from an infrastructure point of
+   endpoints take five seconds is unacceptable from an infrastructure point of
   view. We cannot scale up the fleet fast enough to accommodate for
   the incoming slow requests alongside the regular traffic.

 When lowering the urgency for an existing endpoint, please involve a
 [Scalability team member](https://about.gitlab.com/handbook/engineering/infrastructure/team/scalability/#team-members)
 in the review. We can use request rates and durations available in the
-logs to come up with a recommendation. Picking a threshold can be done
-using the same process as for [increasing
-urgency](#increasing-urgency-setting-a-lower-target-duration), picking
-a duration that is higher than the SLO for the service.
+logs to come up with a recommendation. You can pick a threshold
+using the same process as for
+[increasing urgency](#increasing-urgency-setting-a-lower-target-duration),
+picking a duration that is higher than the SLO for the service.

 We shouldn't set the longest durations on endpoints in the merge
-requests that introduces them, since we don't yet have data to support
+requests that introduces them, because we don't yet have data to support
 the decision.

 ## Increasing urgency (setting a lower target duration)

-When increasing the urgency, we need to make sure the endpoint
+When increasing the urgency, we must make sure the endpoint
 still meets SLO for the fleet that handles the request. You can use the
-information in the logs to determine this:
+information in the logs to check:

-1. Open [this table in
-   Kibana](https://log.gprd.gitlab.net/goto/bbb6465c68eb83642269e64a467df3df)
+1. Open [this table in Kibana](https://log.gprd.gitlab.net/goto/bbb6465c68eb83642269e64a467df3df)

 1. The table loads information for the busiest endpoints by
-   default. You can speed things up by adding a filter for
-   `json.caller_id.keyword` and adding the identifier you're interested
-   in (for example: `Projects::RawController#show`).
+   default. To speed the response, add both:
+   - A filter for `json.caller_id.keyword`.
+   - The identifier you're interested in, such as `Projects::RawController#show`.

 1. Check the [appropriate percentile duration](#request-apdex-slo) for
-   the service the endpoint is handled by. The overall duration should
-   be lower than the target you intend to set.
+   the service handling the endpoint. The overall duration should
+   be lower than your intended target.

-1. If the overall duration is below the intended target. Please also
-   check the peaks over time in [this
-   graph](https://log.gprd.gitlab.net/goto/9319c4a402461d204d13f3a4924a89fc)
+1. If the overall duration is below the intended target, check the peaks over time
+   in [this graph](https://log.gprd.gitlab.net/goto/9319c4a402461d204d13f3a4924a89fc)
   in Kibana. Here, the percentile in question should not peak above
   the target duration we want to set.

-Since decreasing a threshold too much could result in alerts for the
-apdex degradation, please also involve a Scalability team member in
+As decreasing a threshold too much could result in alerts for the
+Apdex degradation, please also involve a Scalability team member in
 the merge request.

 ## How to adjust the urgency

-The urgency can be specified similar to how endpoints [get a feature
-category](../feature_categorization/index.md).
-
-For endpoints that don't have a specific target, the default urgency (1s duration) will be used.
-
-The following configurations are available:
+You can specify urgency similar to how endpoints
+[get a feature category](../feature_categorization/index.md). Endpoints without a
+specific target use the default urgency: 1s duration. These configurations
+are available:

-| Urgency  | Duration in seconds | Notes                                         |
-|----------|---------------------|-----------------------------------------------|
-| :high    | 0.25s               |                                               |
-| :medium  | 0.5s                |                                               |
-| :default | 1s                  | This is the default when nothing is specified |
-| :low     | 5s                  |                                               |
+| Urgency    | Duration in seconds | Notes                                         |
+|------------|---------------------|-----------------------------------------------|
+| `:high`    | 0.25s               |                                               |
+| `:medium`  | 0.5s                |                                               |
+| `:default` | 1s                  | The default when nothing is specified.        |
+| `:low`     | 5s                  |                                               |

 ### Rails controller

-An urgency can be specified for all actions in a controller like this:
+An urgency can be specified for all actions in a controller:

 ```ruby
 class Boards::ListsController < ApplicationController
@@ -189,8 +179,7 @@ class Boards::ListsController < ApplicationController
 end
 ```

-To specify the urgency also for certain actions in a controller, they
-can be specified like this:
+To also specify the urgency for certain actions in a controller:

 ```ruby
 class Boards::ListsController < ApplicationController
@@ -200,8 +189,7 @@ end

 ### Grape endpoints

-To specify the urgency for an entire API class, this can be done as
-follows:
+To specify the urgency for an entire API class:

 ```ruby
 module API
@@ -211,8 +199,7 @@ module API
 end
 ```

-To specify the urgency also for certain actions in a API class, they
-can be specified like this:
+To specify the urgency also for certain actions in a API class:

 ```ruby
 module API
@@ -245,7 +232,7 @@ The endpoints for the SLI feed into a group's error budget based on the
 To know which endpoints are included for your group, you can see the
 request rates on the
 [group dashboard for your group](https://dashboards.gitlab.net/dashboards/f/stage-groups/stage-groups).
-In the **Budget Attribution** row, the **Puma apdex** log link shows you
+In the **Budget Attribution** row, the **Puma Apdex** log link shows you
 how many requests are not meeting a 1s or 5s target.

 Learn more about the content of the dashboard in the documentation for