Rebalance issues relative position without transaction
Rebalancing issues in a long running transaction with a lock retry generates subtransactions that lead to overall DB performance degradations. see https://gitlab.com/gitlab-org/gitlab/-/issues/338346 So we are moving issues rebalancing out of one single big transaction by locking the repositioning within a project or namespace during the rebalance. Rebalancing is a long running job, it can take multiple hours to finish a rebalance of a namespace with large number of issues. This change introduces a simple mechanism of resuming rebalance from a checkpoint. - A limited number of rebalancing jobs are allowed to run, 5 at this point - Before starting a rebalance the number of running rebalances is checked. - If the limit of running rebalances is met we store the first project id from the list of projects to be rebalanced, use that to determine how many rebalances are running - We load all namespace issue ids into a redis sorted set, by using current issue relative position as a score. We do that so that we do not have to run a very slow SQL query that would require otherwise ordering and we are able to load the issue ids in batches and get the sorting for free from redis. This also allows us to pick up issue loading in case of a failure from the project we read last time - Because we are no longer in a DB transaction and we want to preserve the relative position of the issues after the rebalance we need to disable issue repositioning in the namespace while rebalancing. - Once all the issue ids are loaded the positions are being updated in batches by reading the issues in a sorted order and computing the new positions based on number of issues in the namespace, distributed equally. - Every successfull update stores a checkpoint from which the next update can be picked-up in case of a failure - Updates are retried on failure and batch sizes are dynamically downsized and retried down to a limit of 5 issues per batch. - All cache keys are set to expire in 10 days from last interaction to leave enough time for the job to be picked up and also cleanup any unused keys after given grace period. Changelog: changed
Showing
Please register or sign in to comment