• Alexandru Croitor's avatar
    Rebalance issues relative position without transaction · 8834fda8
    Alexandru Croitor authored
    Rebalancing issues in a long running transaction with a lock
    retry generates subtransactions that lead to overall DB
    performance degradations.
    
    see https://gitlab.com/gitlab-org/gitlab/-/issues/338346
    
    So we are moving issues rebalancing out of one single big
    transaction by locking the repositioning within a project or
    namespace during the rebalance.
    
    Rebalancing is a long running job, it can take multiple hours
    to finish a rebalance of a namespace with large number of issues.
    This change introduces a simple mechanism of resuming rebalance
    from a checkpoint.
    
    - A limited number of rebalancing jobs are allowed to run,
    5 at this point
    - Before starting a rebalance the number of running rebalances
    is checked.
    - If the limit of running rebalances is met we store the first
    project id from the list of projects to be rebalanced, use that
    to determine how many rebalances are running
    - We load all namespace issue ids into a redis sorted set, by
    using current issue relative position as a score. We do that so
    that we do not have to run a very slow SQL query that would require
    otherwise ordering and we are able to load the issue ids in batches
    and get the sorting for free from redis. This also allows us to
    pick up issue loading in case of a failure from the project we
    read last time
    - Because we are no longer in a DB transaction and we want to
    preserve the relative position of the issues after the rebalance
    we need to disable issue repositioning in the namespace
    while rebalancing.
    - Once all the issue ids are loaded the positions are being
    updated in batches by reading the issues in a sorted order and
    computing the new positions based on number of issues in the
    namespace, distributed equally.
    - Every successfull update stores a checkpoint from which the
    next update can be picked-up in case of a failure
    - Updates are retried on failure and batch sizes are dynamically
    downsized and retried down to a limit of 5 issues per batch.
    - All cache keys are set to expire in 10 days from last interaction
    to leave enough time for the job to be picked up and also cleanup
    any unused keys after given grace period.
    
    Changelog: changed
    8834fda8
state_spec.rb 7.59 KB