- Ensure that you have the `gitlab_rails['db_password']` set to the plain text-password used when creating the hash for `postgresql['sql_user_password']`.
Ensure you have the `gitlab_rails['db_password']` set to the plain-text
password used when creating the hash for `postgresql['sql_user_password']`.
- Verify the correct password is set for `gitlab_rails['db_password']` that was used when creating the hash in `postgresql['sql_user_password']` by running `gitlab-ctl pg-password-md5 gitlab` and entering the password.
Verify the correct password is set for `gitlab_rails['db_password']` that was
used when creating the hash in `postgresql['sql_user_password']` by running
`gitlab-ctl pg-password-md5 gitlab` and entering the password.
- Ensure that you have added the secondary node in the Admin Area of the **primary** node.
Ensure you have added the secondary node in the Admin Area of the **primary** node.
- Ensure that you entered the `external_url` or `gitlab_rails['geo_node_name']` when adding the secondary node in the Admin Area of the **primary** node.
Also ensure you entered the `external_url` or `gitlab_rails['geo_node_name']`
- Prior to GitLab 12.4, edit the secondary node in the Admin Area of the **primary** node and ensure that there is a trailing `/` in the `Name` field.
when adding the secondary node in the Admin Area of the **primary** node.
In GitLab 12.3 and earlier, edit the secondary node in the Admin Area of the **primary**
node and ensure that there is a trailing `/` in the `Name` field.
1. Check returns `Exception: PG::UndefinedTable: ERROR: relation "geo_nodes" does not exist`
- Check returns `Exception: PG::UndefinedTable: ERROR: relation "geo_nodes" does not exist`.
The following sections outline troubleshooting steps for fixing replication
The following sections outline troubleshooting steps for fixing replication
errors (indicated by `Database replication working? ... no` in the
error messages (indicated by `Database replication working? ... no` in the
[`geo:check` output](#health-check-rake-task).
[`geo:check` output](#health-check-rake-task).
### Message: `ERROR: replication slots can only be used if max_replication_slots > 0`?
### Message: `ERROR: replication slots can only be used if max_replication_slots > 0`?
...
@@ -304,7 +310,7 @@ process](../setup/database.md) on the **secondary** node .
...
@@ -304,7 +310,7 @@ process](../setup/database.md) on the **secondary** node .
### Message: "Command exceeded allowed execution time" when setting up replication?
### Message: "Command exceeded allowed execution time" when setting up replication?
This may happen while [initiating the replication process](../setup/database.md#step-3-initiate-the-replication-process) on the **secondary** node,
This may happen while [initiating the replication process](../setup/database.md#step-3-initiate-the-replication-process) on the **secondary** node,
and indicates that your initial dataset is too large to be replicated in the default timeout (30 minutes).
and indicates your initial dataset is too large to be replicated in the default timeout (30 minutes).
Re-run `gitlab-ctl replicate-geo-database`, but include a larger value for
Re-run `gitlab-ctl replicate-geo-database`, but include a larger value for
`--backup-timeout`:
`--backup-timeout`:
...
@@ -318,7 +324,7 @@ sudo gitlab-ctl \
...
@@ -318,7 +324,7 @@ sudo gitlab-ctl \
```
```
This gives the initial replication up to six hours to complete, rather than
This gives the initial replication up to six hours to complete, rather than
the default thirty minutes. Adjust as required for your installation.
the default 30 minutes. Adjust as required for your installation.
### Message: "PANIC: could not write to file `pg_xlog/xlogtemp.123`: No space left on device"
### Message: "PANIC: could not write to file `pg_xlog/xlogtemp.123`: No space left on device"
...
@@ -334,7 +340,7 @@ log data to build up in `pg_xlog`. Removing the unused slots can reduce the amou
...
@@ -334,7 +340,7 @@ log data to build up in `pg_xlog`. Removing the unused slots can reduce the amou
NOTE:
NOTE:
Using `gitlab-rails dbconsole` does not work, because managing replication slots requires superuser permissions.
Using `gitlab-rails dbconsole` does not work, because managing replication slots requires superuser permissions.
1. View your replication slots with:
1. View your replication slots:
```sql
```sql
SELECT*FROMpg_replication_slots;
SELECT*FROMpg_replication_slots;
...
@@ -343,7 +349,7 @@ log data to build up in `pg_xlog`. Removing the unused slots can reduce the amou
...
@@ -343,7 +349,7 @@ log data to build up in `pg_xlog`. Removing the unused slots can reduce the amou
Slots where `active` is `f` are not active.
Slots where `active` is `f` are not active.
- When this slot should be active, because you have a **secondary** node configured using that slot,
- When this slot should be active, because you have a **secondary** node configured using that slot,
log in to that **secondary** node and check the [PostgreSQL logs](../../logs.md#postgresql-logs)
sign in to that **secondary** node and check the [PostgreSQL logs](../../logs.md#postgresql-logs)
to view why the replication is not running.
to view why the replication is not running.
- If you are no longer using the slot (for example, you no longer have Geo enabled), you can remove it with in the
- If you are no longer using the slot (for example, you no longer have Geo enabled), you can remove it with in the
...
@@ -355,11 +361,11 @@ Slots where `active` is `f` are not active.
...
@@ -355,11 +361,11 @@ Slots where `active` is `f` are not active.
### Message: "ERROR: canceling statement due to conflict with recovery"
### Message: "ERROR: canceling statement due to conflict with recovery"
This error occurs infrequently under normal usage, and the system is resilient
This error message occurs infrequently under normal usage, and the system is resilient
enough to recover.
enough to recover.
However, under certain conditions, some database queries on secondaries may run
However, under certain conditions, some database queries on secondaries may run
excessively long, which increases the frequency of this error. This can lead to a situation
excessively long, which increases the frequency of this error message. This can lead to a situation
where some queries never complete due to being canceled on every replication.
where some queries never complete due to being canceled on every replication.
These long-running queries are
These long-running queries are
...
@@ -426,7 +432,7 @@ If large repositories are affected by this problem,
...
@@ -426,7 +432,7 @@ If large repositories are affected by this problem,
their resync may take a long time and cause significant load on your Geo nodes,
their resync may take a long time and cause significant load on your Geo nodes,
storage and network systems.
storage and network systems.
If you get the error `Synchronization failed - Error syncing repository` along with the following log messages, this indicates that the expected `geo` remote is not present in the `.git/config` file
If you get the error message `Synchronization failed - Error syncing repository` along with the following log messages, this indicates that the expected `geo` remote is not present in the `.git/config` file
of a repository on the secondary Geo node's file system:
of a repository on the secondary Geo node's file system:
```json
```json
...
@@ -449,11 +455,11 @@ of a repository on the secondary Geo node's file system:
...
@@ -449,11 +455,11 @@ of a repository on the secondary Geo node's file system:
To solve this:
To solve this:
1.Log into the secondary Geo node.
1.Sign in to the secondary Geo node.
1. Back up [the `.git` folder](../../repository_storage_types.md#translate-hashed-storage-paths).
1. Back up [the `.git` folder](../../repository_storage_types.md#translate-hashed-storage-paths).
Use `fatal: 'geo'` as the `grep` term and the following API call:
Use `fatal: 'geo'` as the `grep` term and the following API call:
...
@@ -488,18 +494,19 @@ GitLab places a timeout on all repository clones, including project imports
...
@@ -488,18 +494,19 @@ GitLab places a timeout on all repository clones, including project imports
and Geo synchronization operations. If a fresh `git clone` of a repository
and Geo synchronization operations. If a fresh `git clone` of a repository
on the **primary** takes more than the default three hours, you may be affected by this.
on the **primary** takes more than the default three hours, you may be affected by this.
To increase the timeout, add the following line to `/etc/gitlab/gitlab.rb`
To increase the timeout:
on the **secondary** node:
```ruby
1. On the **secondary** node, add the following line to `/etc/gitlab/gitlab.rb`:
gitlab_rails['gitlab_shell_git_timeout']=14400
```
Then reconfigure GitLab:
```ruby
gitlab_rails['gitlab_shell_git_timeout']=14400
```
```shell
1. Reconfigure GitLab:
sudo gitlab-ctl reconfigure
```
```shell
sudo gitlab-ctl reconfigure
```
This increases the timeout to four hours (14400 seconds). Choose a time
This increases the timeout to four hours (14400 seconds). Choose a time
long enough to accommodate a full clone of your largest repositories.
long enough to accommodate a full clone of your largest repositories.
...
@@ -510,7 +517,7 @@ If new LFS objects are never replicated to secondary Geo nodes, check the versio
...
@@ -510,7 +517,7 @@ If new LFS objects are never replicated to secondary Geo nodes, check the versio
GitLab you are running. GitLab versions 11.11.x or 12.0.x are affected by
GitLab you are running. GitLab versions 11.11.x or 12.0.x are affected by
[a bug that results in new LFS objects not being replicated to Geo secondary nodes](https://gitlab.com/gitlab-org/gitlab/-/issues/32696).
[a bug that results in new LFS objects not being replicated to Geo secondary nodes](https://gitlab.com/gitlab-org/gitlab/-/issues/32696).
To resolve the issue, upgrade to GitLab 12.1 or newer.
To resolve the issue, upgrade to GitLab 12.1 or later.
### Failures during backfill
### Failures during backfill
...
@@ -522,7 +529,7 @@ of the backfill queue, therefore these failures only clear up **after** the back
...
@@ -522,7 +529,7 @@ of the backfill queue, therefore these failures only clear up **after** the back
If you get a **secondary** node in a broken state and want to reset the replication state,
If you get a **secondary** node in a broken state and want to reset the replication state,
to start again from scratch, there are a few steps that can help you:
to start again from scratch, there are a few steps that can help you:
1. Stop Sidekiq and the Geo LogCursor
1. Stop Sidekiq and the Geo LogCursor.
It's possible to make Sidekiq stop gracefully, but making it stop getting new jobs and
It's possible to make Sidekiq stop gracefully, but making it stop getting new jobs and
wait until the current jobs to finish processing.
wait until the current jobs to finish processing.
...
@@ -545,7 +552,7 @@ to start again from scratch, there are a few steps that can help you:
...
@@ -545,7 +552,7 @@ to start again from scratch, there are a few steps that can help you:
gitlab-ctl tail sidekiq
gitlab-ctl tail sidekiq
```
```
1. Rename repository storage folders and create new ones. If you are not concerned about possible orphaned directories and files, then you can simply skip this step.
1. Rename repository storage folders and create new ones. If you are not concerned about possible orphaned directories and files, you can skip this step.
@@ -557,14 +564,14 @@ to start again from scratch, there are a few steps that can help you:
...
@@ -557,14 +564,14 @@ to start again from scratch, there are a few steps that can help you:
You may want to remove the `/var/opt/gitlab/git-data/repositories.old` in the future
You may want to remove the `/var/opt/gitlab/git-data/repositories.old` in the future
as soon as you confirmed that you don't need it anymore, to save disk space.
as soon as you confirmed that you don't need it anymore, to save disk space.
1. Optional. Rename other data folders and create new ones
1. Optional. Rename other data folders and create new ones.
WARNING:
WARNING:
You may still have files on the **secondary** node that have been removed from the **primary** node, but this
You may still have files on the **secondary** node that have been removed from the **primary** node, but this
removal has not been reflected. If you skip this step, these files are not removed at all from the Geo node.
removal has not been reflected. If you skip this step, these files are not removed from the Geo node.
Any uploaded content like file attachments, avatars or LFS objects are stored in a
Any uploaded content (like file attachments, avatars, or LFS objects) is stored in a
subfolder in one of the two paths below:
subfolder in one of these paths:
-`/var/opt/gitlab/gitlab-rails/shared`
-`/var/opt/gitlab/gitlab-rails/shared`
-`/var/opt/gitlab/gitlab-rails/uploads`
-`/var/opt/gitlab/gitlab-rails/uploads`
...
@@ -591,7 +598,7 @@ to start again from scratch, there are a few steps that can help you:
...
@@ -591,7 +598,7 @@ to start again from scratch, there are a few steps that can help you:
gitlab-ctl reconfigure
gitlab-ctl reconfigure
```
```
1. Reset the Tracking Database
1. Reset the Tracking Database.
```shell
```shell
gitlab-rake geo:db:drop # on a secondary app node
gitlab-rake geo:db:drop # on a secondary app node
...
@@ -599,7 +606,7 @@ to start again from scratch, there are a few steps that can help you:
...
@@ -599,7 +606,7 @@ to start again from scratch, there are a few steps that can help you:
gitlab-rake geo:db:setup # on a secondary app node
gitlab-rake geo:db:setup # on a secondary app node
```
```
1. Restart previously stopped services
1. Restart previously stopped services.
```shell
```shell
gitlab-ctl start
gitlab-ctl start
...
@@ -609,10 +616,10 @@ to start again from scratch, there are a few steps that can help you:
...
@@ -609,10 +616,10 @@ to start again from scratch, there are a few steps that can help you:
On the top bar, under **Menu > Admin > Geo > Nodes**,
On the top bar, under **Menu > Admin > Geo > Nodes**,
if the Design repositories progress bar shows
if the Design repositories progress bar shows
`Synced` and `Failed` greater than 100%, and negative `Queued`, then the instance
`Synced` and `Failed` greater than 100%, and negative `Queued`, the instance
is likely affected by
is likely affected by
[a bug in GitLab 13.2 and 13.3](https://gitlab.com/gitlab-org/gitlab/-/issues/241668).
[a bug in GitLab 13.2 and 13.3](https://gitlab.com/gitlab-org/gitlab/-/issues/241668).
It was [fixed in 13.4+](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/40643).
It was [fixed in GitLab 13.4 and later](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/40643).
To determine the actual replication status of design repositories in
To determine the actual replication status of design repositories in
a [Rails console](../../operations/rails_console.md):
a [Rails console](../../operations/rails_console.md):
...
@@ -663,7 +670,7 @@ determine the actual replication status of Design repositories.
...
@@ -663,7 +670,7 @@ determine the actual replication status of Design repositories.
`gitlab-ctl promote-to-primary-node` fails since it runs preflight checks.
`gitlab-ctl promote-to-primary-node` fails since it runs preflight checks.
If the [previous snippet](#design-repository-failures-on-mirrored-projects-and-project-imports)
If the [previous snippet](#design-repository-failures-on-mirrored-projects-and-project-imports)
shows that all designs are synced, then you can use the
shows that all designs are synced, you can use the
`--skip-preflight-checks` option or the `--force` option to move forward with
`--skip-preflight-checks` option or the `--force` option to move forward with
promotion.
promotion.
...
@@ -676,9 +683,9 @@ determine the actual replication status of Design repositories.
...
@@ -676,9 +683,9 @@ determine the actual replication status of Design repositories.
### Sync failure message: "Verification failed with: Error during verification: File is not checksummable"
### Sync failure message: "Verification failed with: Error during verification: File is not checksummable"
Until GitLab 14.6, certain data types which were missing on the Geo primary site were marked as "synced" on Geo secondary sites. This was because from the perspective of Geo secondary sites, the state matched the primary site and nothing more could be done on secondary sites.
In GitLab 14.5 and earlier, certain data types which were missing on the Geo primary site were marked as "synced" on Geo secondary sites. This was because from the perspective of Geo secondary sites, the state matched the primary site and nothing more could be done on secondary sites.
Secondaries would regularly try to sync these files again via the "verification" feature:
Secondaries would regularly try to sync these files again by using the "verification" feature:
- Verification fails since the file doesn't exist.
- Verification fails since the file doesn't exist.
- The file is marked "sync failed".
- The file is marked "sync failed".
...
@@ -703,11 +710,11 @@ After confirming this is the problem, the files on the primary site need to be f
...
@@ -703,11 +710,11 @@ After confirming this is the problem, the files on the primary site need to be f
- A non-atomic backup was restored.
- A non-atomic backup was restored.
- Services or servers or network infrastructure was interrupted/restarted during use.
- Services or servers or network infrastructure was interrupted/restarted during use.
The appropriate action sometimes depends on the cause. For example, you can remount an NFS share. Often, a root cause may not be apparent or not useful to discover. If you have regular backups, then it may be expedient to look through them and pull files from there.
The appropriate action sometimes depends on the cause. For example, you can remount an NFS share. Often, a root cause may not be apparent or not useful to discover. If you have regular backups, it may be expedient to look through them and pull files from there.
In some cases, a file may be determined to be of low value, and so it may be worth deleting the record.
In some cases, a file may be determined to be of low value, and so it may be worth deleting the record.
Geo itself is an excellent mitigation for files missing on the primary. If a file disappears on the primary but it was already synced to the secondary, then you can grab the secondary's file. In cases like this, the `File is not checksummable` error will not occur on Geo secondary sites, and only the primary will log this error.
Geo itself is an excellent mitigation for files missing on the primary. If a file disappears on the primary but it was already synced to the secondary, you can grab the secondary's file. In cases like this, the `File is not checksummable` error message will not occur on Geo secondary sites, and only the primary will log this error message.
This problem is more likely to show up in Geo secondary sites which were set up long after the original GitLab site. In this case, Geo is only surfacing an existing problem.
This problem is more likely to show up in Geo secondary sites which were set up long after the original GitLab site. In this case, Geo is only surfacing an existing problem.
...
@@ -725,17 +732,19 @@ This behavior affects only the following data types through GitLab 14.6:
...
@@ -725,17 +732,19 @@ This behavior affects only the following data types through GitLab 14.6:
| Uploads | 14.6 |
| Uploads | 14.6 |
| CI Job Artifacts | 14.6 |
| CI Job Artifacts | 14.6 |
[Since GitLab 14.7, files which are missing on the primary site are now treated as sync failures](https://gitlab.com/gitlab-org/gitlab/-/issues/348745) in order to make Geo visibly surface data loss risks. The sync/verification loop is therefore short-circuited. `last_sync_failure` is now set to `The file is missing on the Geo primary site`.
[Since GitLab 14.7, files that are missing on the primary site are now treated as sync failures](https://gitlab.com/gitlab-org/gitlab/-/issues/348745)
to make Geo visibly surface data loss risks. The sync/verification loop is
therefore short-circuited. `last_sync_failure` is now set to `The file is missing on the Geo primary site`.
## Fixing errors during a failover or when promoting a secondary to a primary node
## Fixing errors during a failover or when promoting a secondary to a primary node
The following are possible errors that might be encountered during failover or
The following are possible error messages that might be encountered during failover or
when promoting a secondary to a primary node with strategies to resolve them.
when promoting a secondary to a primary node with strategies to resolve them.
### Message: ActiveRecord::RecordInvalid: Validation failed: Name has already been taken
### Message: ActiveRecord::RecordInvalid: Validation failed: Name has already been taken
When [promoting a **secondary** site](../disaster_recovery/index.md#step-3-promoting-a-secondary-site),
When [promoting a **secondary** site](../disaster_recovery/index.md#step-3-promoting-a-secondary-site),
This can happen with XFS or file systems that list files in lexical order, because the
This can happen with XFS or file systems that list files in lexical order, because the
load order of the Omnibus command files can be different than expected, and a global function would get redefined.
load order of the Omnibus GitLab command files can be different than expected, and a global function would get redefined.
More details can be found in [the related issue](https://gitlab.com/gitlab-org/omnibus-gitlab/-/issues/6076).
More details can be found in [the related issue](https://gitlab.com/gitlab-org/omnibus-gitlab/-/issues/6076).
The workaround is to manually run the preflight checks and promote the database, by running
The workaround is to manually run the preflight checks and promote the database, by running
...
@@ -894,8 +903,8 @@ registry record related to the orphan files on disk.
...
@@ -894,8 +903,8 @@ registry record related to the orphan files on disk.
### Message: The redirect URI included is not valid
### Message: The redirect URI included is not valid
If you are able to log in to the **primary** node, but you receive this error
If you are able to sign in to the **primary** node, but you receive this error message
when attempting to log into a **secondary**, you should check that the Geo
when attempting to sign in to a **secondary**, you should verify the Geo
node's URL matches its external URL.
node's URL matches its external URL.
On the **primary** node:
On the **primary** node:
...
@@ -909,7 +918,7 @@ On the **primary** node:
...
@@ -909,7 +918,7 @@ On the **primary** node:
## Fixing common errors
## Fixing common errors
This section documents common errors reported in the Admin Area and how to fix them.
This section documents common error messages reported in the Admin Area, and how to fix them.
### Geo database configuration file is missing
### Geo database configuration file is missing
...
@@ -930,7 +939,7 @@ It is safest to use a fresh secondary, or reset the whole secondary by following
...
@@ -930,7 +939,7 @@ It is safest to use a fresh secondary, or reset the whole secondary by following
### Geo node has a database that is writable which is an indication it is not configured for replication with the primary node
### Geo node has a database that is writable which is an indication it is not configured for replication with the primary node
This error refers to a problem with the database replica on a **secondary** node,
This error message refers to a problem with the database replica on a **secondary** node,
which Geo expects to have access to. It usually means, either:
which Geo expects to have access to. It usually means, either:
- An unsupported replication method was used (for example, logical replication).
- An unsupported replication method was used (for example, logical replication).
...
@@ -943,7 +952,7 @@ Geo **secondary** sites require two separate PostgreSQL instances:
...
@@ -943,7 +952,7 @@ Geo **secondary** sites require two separate PostgreSQL instances:
- A read-only replica of the **primary** node.
- A read-only replica of the **primary** node.
- A regular, writable instance that holds replication metadata. That is, the Geo tracking database.
- A regular, writable instance that holds replication metadata. That is, the Geo tracking database.
This error indicates that the replica database in the **secondary** site is misconfigured and replication has stopped.
This error message indicates that the replica database in the **secondary** site is misconfigured and replication has stopped.
To restore the database and resume replication, you can do one of the following:
To restore the database and resume replication, you can do one of the following:
...
@@ -979,7 +988,7 @@ This can be caused by orphaned records in the project registry. You can clear th
...
@@ -979,7 +988,7 @@ This can be caused by orphaned records in the project registry. You can clear th
### Geo Admin Area returns 404 error for a secondary node
### Geo Admin Area returns 404 error for a secondary node
Sometimes `sudo gitlab-rake gitlab:geo:check` indicates that the **secondary** node is
Sometimes `sudo gitlab-rake gitlab:geo:check` indicates that the **secondary** node is
healthy, but a 404 error for the **secondary** node is returned in the Geo Admin Area on
healthy, but a 404 Not Found error message for the **secondary** node is returned in the Geo Admin Area on
the **primary** node.
the **primary** node.
To resolve this issue:
To resolve this issue:
...
@@ -997,7 +1006,7 @@ If using a load balancer, ensure that the load balancer's URL is set as the `ext
...
@@ -997,7 +1006,7 @@ If using a load balancer, ensure that the load balancer's URL is set as the `ext
### Geo Admin Area shows 'Unhealthy' after enabling Maintenance Mode
### Geo Admin Area shows 'Unhealthy' after enabling Maintenance Mode
In GitLab 13.9 through GitLab 14.3, when [GitLab Maintenance Mode](../../maintenance_mode/index.md) is enabled, the status of Geo secondary sites will stop getting updated. After 10 minutes, the status will become`Unhealthy`.
In GitLab 13.9 through GitLab 14.3, when [GitLab Maintenance Mode](../../maintenance_mode/index.md) is enabled, the status of Geo secondary sites will stop getting updated. After 10 minutes, the status changes to`Unhealthy`.
Geo secondary sites will continue to replicate and verify data, and the secondary sites should still be usable. You can use the [Sync status Rake task](#sync-status-rake-task) to determine the actual status of a secondary site during Maintenance Mode.
Geo secondary sites will continue to replicate and verify data, and the secondary sites should still be usable. You can use the [Sync status Rake task](#sync-status-rake-task) to determine the actual status of a secondary site during Maintenance Mode.
...
@@ -1006,7 +1015,7 @@ This bug was [fixed in GitLab 14.4](https://gitlab.com/gitlab-org/gitlab/-/issue
...
@@ -1006,7 +1015,7 @@ This bug was [fixed in GitLab 14.4](https://gitlab.com/gitlab-org/gitlab/-/issue
### GitLab Pages return 404 errors after promoting
### GitLab Pages return 404 errors after promoting
This is due to [Pages data not being managed by Geo](datatypes.md#limitations-on-replicationverification).
This is due to [Pages data not being managed by Geo](datatypes.md#limitations-on-replicationverification).
Find advice to resolve those errors in the
Find advice to resolve those error messages in the