Commit c10bde1f authored by Achilleas Pipinellis's avatar Achilleas Pipinellis

Merge branch 'docs/add-another-fdw-troubleshoot-item' into 'master'

Add new troubleshooting step and refactor Geo replication docs

Closes #63210

See merge request gitlab-org/gitlab-ce!29783
parents a53fc9f1 b765b8e3
# Geo Troubleshooting **[PREMIUM ONLY]** # Geo Troubleshooting **[PREMIUM ONLY]**
NOTE: **Note:**
This list is an attempt to document all the moving parts that can go wrong.
We are working into getting all this steps verified automatically in a
rake task in the future.
Setting up Geo requires careful attention to details and sometimes it's easy to Setting up Geo requires careful attention to details and sometimes it's easy to
miss a step. Here is a list of questions you should ask to try to detect miss a step.
what you need to fix (all commands and path locations are for Omnibus installs):
Here is a list of steps you should take to attempt to fix problem:
- Perform [basic troubleshooting](#basic-troubleshooting).
- Fix any [replication errors](#fixing-replication-errors).
- Fix any [Foreign Data Wrapper](#fixing-foreign-data-wrapper-errors) errors.
- Fix any [common](#fixing-common-errors) errors.
## First check the health of the **secondary** node ## Basic troubleshooting
Before attempting more advanced troubleshooting:
- Check [the health of the **secondary** node](#check-the-health-of-the-secondary-node).
- Check [if PostgreSQL replication is working](#check-if-postgresql-replication-is-working).
### Check the health of the **secondary** node
Visit the **primary** node's **Admin Area > Geo** (`/admin/geo/nodes`) in Visit the **primary** node's **Admin Area > Geo** (`/admin/geo/nodes`) in
your browser. We perform the following health checks on each **secondary** node your browser. We perform the following health checks on each **secondary** node
...@@ -23,10 +31,12 @@ to help identify if something is wrong: ...@@ -23,10 +31,12 @@ to help identify if something is wrong:
![Geo health check](img/geo_node_healthcheck.png) ![Geo health check](img/geo_node_healthcheck.png)
For information on how to resolve common errors reported from the UI, see [common errors](#common-errors). For information on how to resolve common errors reported from the UI, see
[Fixing Common Errors](#fixing-common-errors).
If the UI is not working, or you are unable to log in, you can run the Geo If the UI is not working, or you are unable to log in, you can run the Geo
health check manually to get this information as well as a few more details. health check manually to get this information as well as a few more details.
This rake task can be run on an app node in the **primary** or **secondary** This rake task can be run on an app node in the **primary** or **secondary**
Geo nodes: Geo nodes:
...@@ -36,7 +46,7 @@ sudo gitlab-rake gitlab:geo:check ...@@ -36,7 +46,7 @@ sudo gitlab-rake gitlab:geo:check
Example output: Example output:
``` ```text
Checking Geo ... Checking Geo ...
GitLab Geo is available ... yes GitLab Geo is available ... yes
...@@ -68,7 +78,7 @@ sudo gitlab-rake geo:status ...@@ -68,7 +78,7 @@ sudo gitlab-rake geo:status
Example output: Example output:
``` ```text
http://secondary.example.com/ http://secondary.example.com/
----------------------------------------------------- -----------------------------------------------------
GitLab Version: 11.10.4-ee GitLab Version: 11.10.4-ee
...@@ -89,16 +99,21 @@ http://secondary.example.com/ ...@@ -89,16 +99,21 @@ http://secondary.example.com/
Last status report was: 2 minutes ago Last status report was: 2 minutes ago
``` ```
## Is Postgres replication working? ### Check if PostgreSQL replication is working
To check if PostgreSQL replication is working, check if:
- [Nodes are pointing to the correct database instance](#are-nodes-pointing-to-the-correct-database-instance).
- [Geo can detect the current node correctly](#can-geo-detect-the-current-node-correctly).
### Are my nodes pointing to the correct database instance? #### Are nodes pointing to the correct database instance?
You should make sure your **primary** Geo node points to the instance with You should make sure your **primary** Geo node points to the instance with
writing permissions. writing permissions.
Any **secondary** nodes should point only to read-only instances. Any **secondary** nodes should point only to read-only instances.
### Can Geo detect my current node correctly? #### Can Geo detect the current node correctly?
Geo uses the defined node from the **Admin Area > Geo** screen, and tries to match Geo uses the defined node from the **Admin Area > Geo** screen, and tries to match
it with the value defined in the `/etc/gitlab/gitlab.rb` configuration file. it with the value defined in the `/etc/gitlab/gitlab.rb` configuration file.
...@@ -112,29 +127,38 @@ sudo gitlab-rails runner "puts Gitlab::Geo.current_node.inspect" ...@@ -112,29 +127,38 @@ sudo gitlab-rails runner "puts Gitlab::Geo.current_node.inspect"
and expect something like: and expect something like:
``` ```ruby
#<GeoNode id: 2, schema: "https", host: "gitlab.example.com", port: 443, relative_url_root: "", primary: false, ...> #<GeoNode id: 2, schema: "https", host: "gitlab.example.com", port: 443, relative_url_root: "", primary: false, ...>
``` ```
By running the command above, `primary` should be `true` when executed in By running the command above, `primary` should be `true` when executed in
the **primary** node, and `false` on any **secondary** node. the **primary** node, and `false` on any **secondary** node.
## How do I fix the message, "ERROR: replication slots can only be used if max_replication_slots > 0"? ## Fixing replication errors
The following sections outline troubleshooting steps for fixing replication
errors.
### Message: "ERROR: replication slots can only be used if max_replication_slots > 0"?
This means that the `max_replication_slots` PostgreSQL variable needs to This means that the `max_replication_slots` PostgreSQL variable needs to
be set on the **primary** database. In GitLab 9.4, we have made this setting be set on the **primary** database. In GitLab 9.4, we have made this setting
default to 1. You may need to increase this value if you have more default to 1. You may need to increase this value if you have more
**secondary** nodes. Be sure to restart PostgreSQL for this to take **secondary** nodes.
Be sure to restart PostgreSQL for this to take
effect. See the [PostgreSQL replication effect. See the [PostgreSQL replication
setup][database-pg-replication] guide for more details. setup][database-pg-replication] guide for more details.
## How do I fix the message, "FATAL: could not start WAL streaming: ERROR: replication slot "geo_secondary_my_domain_com" does not exist"? ### Message: "FATAL: could not start WAL streaming: ERROR: replication slot "geo_secondary_my_domain_com" does not exist"?
This occurs when PostgreSQL does not have a replication slot for the This occurs when PostgreSQL does not have a replication slot for the
**secondary** node by that name. You may want to rerun the [replication **secondary** node by that name.
You may want to rerun the [replication
process](database.md) on the **secondary** node . process](database.md) on the **secondary** node .
## How do I fix the message, "Command exceeded allowed execution time" when setting up replication? ### Message: "Command exceeded allowed execution time" when setting up replication?
This may happen while [initiating the replication process][database-start-replication] on the **secondary** node, This may happen while [initiating the replication process][database-start-replication] on the **secondary** node,
and indicates that your initial dataset is too large to be replicated in the default timeout (30 minutes). and indicates that your initial dataset is too large to be replicated in the default timeout (30 minutes).
...@@ -153,7 +177,7 @@ sudo gitlab-ctl \ ...@@ -153,7 +177,7 @@ sudo gitlab-ctl \
This will give the initial replication up to six hours to complete, rather than This will give the initial replication up to six hours to complete, rather than
the default thirty minutes. Adjust as required for your installation. the default thirty minutes. Adjust as required for your installation.
## How do I fix the message, "PANIC: could not write to file 'pg_xlog/xlogtemp.123': No space left on device" ### Message: "PANIC: could not write to file 'pg_xlog/xlogtemp.123': No space left on device"
Determine if you have any unused replication slots in the **primary** database. This can cause large amounts of Determine if you have any unused replication slots in the **primary** database. This can cause large amounts of
log data to build up in `pg_xlog`. Removing the unused slots can reduce the amount of space used in the `pg_xlog`. log data to build up in `pg_xlog`. Removing the unused slots can reduce the amount of space used in the `pg_xlog`.
...@@ -184,11 +208,12 @@ Slots where `active` is `f` are not active. ...@@ -184,11 +208,12 @@ Slots where `active` is `f` are not active.
SELECT pg_drop_replication_slot('<name_of_extra_slot>'); SELECT pg_drop_replication_slot('<name_of_extra_slot>');
``` ```
## Very large repositories never successfully synchronize on the **secondary** node ### Very large repositories never successfully synchronize on the **secondary** node
GitLab places a timeout on all repository clones, including project imports GitLab places a timeout on all repository clones, including project imports
and Geo synchronization operations. If a fresh `git clone` of a repository and Geo synchronization operations. If a fresh `git clone` of a repository
on the primary takes more than a few minutes, you may be affected by this. on the primary takes more than a few minutes, you may be affected by this.
To increase the timeout, add the following line to `/etc/gitlab/gitlab.rb` To increase the timeout, add the following line to `/etc/gitlab/gitlab.rb`
on the **secondary** node: on the **secondary** node:
...@@ -205,7 +230,7 @@ sudo gitlab-ctl reconfigure ...@@ -205,7 +230,7 @@ sudo gitlab-ctl reconfigure
This will increase the timeout to three hours (10800 seconds). Choose a time This will increase the timeout to three hours (10800 seconds). Choose a time
long enough to accommodate a full clone of your largest repositories. long enough to accommodate a full clone of your largest repositories.
## How to reset Geo **secondary** node replication ### Reseting Geo **secondary** node replication
If you get a **secondary** node in a broken state and want to reset the replication state, If you get a **secondary** node in a broken state and want to reset the replication state,
to start again from scratch, there are a few steps that can help you: to start again from scratch, there are a few steps that can help you:
...@@ -289,12 +314,16 @@ to start again from scratch, there are a few steps that can help you: ...@@ -289,12 +314,16 @@ to start again from scratch, there are a few steps that can help you:
gitlab-ctl start gitlab-ctl start
``` ```
## How do I fix a "Foreign Data Wrapper (FDW) is not configured" error? ## Fixing Foreign Data Wrapper errors
This section documents ways to fix potential Foreign Data Wrapper errors.
### "Foreign Data Wrapper (FDW) is not configured" error
When setting up Geo, you might see this warning in the `gitlab-rake When setting up Geo, you might see this warning in the `gitlab-rake
gitlab:geo:check` output: gitlab:geo:check` output:
``` ```text
GitLab Geo tracking database Foreign Data Wrapper schema is up-to-date? ... foreign data wrapper is not configured GitLab Geo tracking database Foreign Data Wrapper schema is up-to-date? ... foreign data wrapper is not configured
``` ```
...@@ -307,7 +336,7 @@ There are a few key points to remember: ...@@ -307,7 +336,7 @@ There are a few key points to remember:
By default, the Geo secondary and tracking database are running on the By default, the Geo secondary and tracking database are running on the
same host on different ports. That is, 5432 and 5431 respectively. same host on different ports. That is, 5432 and 5431 respectively.
### Checking configuration #### Checking configuration
NOTE: **Note:** NOTE: **Note:**
The following steps are for Omnibus installs only. Using Geo with source-based installs was **deprecated** in GitLab 11.5. The following steps are for Omnibus installs only. Using Geo with source-based installs was **deprecated** in GitLab 11.5.
...@@ -419,7 +448,7 @@ should see something like this: ...@@ -419,7 +448,7 @@ should see something like this:
- `geo_postgresql['fdw_external_user']` - `geo_postgresql['fdw_external_user']`
- `geo_postgresql['fdw_external_password']` - `geo_postgresql['fdw_external_password']`
### Manual reload of FDW schema #### Manual reload of FDW schema
If you're still unable to get FDW working, you may want to try a manual If you're still unable to get FDW working, you may want to try a manual
reload of the FDW schema. To manually reload the FDW schema: reload of the FDW schema. To manually reload the FDW schema:
...@@ -459,9 +488,25 @@ reload of the FDW schema. To manually reload the FDW schema: ...@@ -459,9 +488,25 @@ reload of the FDW schema. To manually reload the FDW schema:
[database-start-replication]: database.md#step-3-initiate-the-replication-process [database-start-replication]: database.md#step-3-initiate-the-replication-process
[database-pg-replication]: database.md#postgresql-replication [database-pg-replication]: database.md#postgresql-replication
## Common errors ### "Geo database has an outdated FDW remote schema" error
GitLab can error with a `Geo database has an outdated FDW remote schema` message.
For example:
This section documents common errors reported in the admin UI and how to fix them. ```text
Geo database has an outdated FDW remote schema. It contains 229 of 236 expected tables. Please refer to Geo Troubleshooting.
```
To resolve this, run the following command:
```sh
sudo gitlab-rake geo:db:refresh_foreign_tables
```
## Fixing common errors
This section documents common errors reported in the Admin UI and how to fix them.
### Geo database configuration file is missing ### Geo database configuration file is missing
...@@ -470,7 +515,6 @@ GitLab cannot find or doesn't have permission to access the `database_geo.yml` c ...@@ -470,7 +515,6 @@ GitLab cannot find or doesn't have permission to access the `database_geo.yml` c
In an Omnibus GitLab installation, the file should be in `/var/opt/gitlab/gitlab-rails/etc`. In an Omnibus GitLab installation, the file should be in `/var/opt/gitlab/gitlab-rails/etc`.
If it doesn't exist or inadvertent changes have been made to it, run `sudo gitlab-ctl reconfigure` to restore it to its correct state. If it doesn't exist or inadvertent changes have been made to it, run `sudo gitlab-ctl reconfigure` to restore it to its correct state.
If this path is mounted on a remote volume, please check your volume configuration and that it has correct permissions. If this path is mounted on a remote volume, please check your volume configuration and that it has correct permissions.
### Geo node has a database that is writable which is an indication it is not configured for replication with the primary node. ### Geo node has a database that is writable which is an indication it is not configured for replication with the primary node.
...@@ -503,7 +547,7 @@ Make sure you follow the [Geo database replication](database.md) instructions fo ...@@ -503,7 +547,7 @@ Make sure you follow the [Geo database replication](database.md) instructions fo
If you are using GitLab Omnibus installation, something might have failed during upgrade. You can: If you are using GitLab Omnibus installation, something might have failed during upgrade. You can:
- Run `sudo gitlab-ctl reconfigure`. - Run `sudo gitlab-ctl reconfigure`.
- Manually trigger the database migration by running: `sudo gitlab-rake geo:db:migrate` as root on the **secondary** node. - Manually trigger the database migration by running: `sudo gitlab-rake geo:db:migrate` as root on the **secondary** node.
### Geo database is not configured to use Foreign Data Wrapper ### Geo database is not configured to use Foreign Data Wrapper
...@@ -511,4 +555,4 @@ If you are using GitLab Omnibus installation, something might have failed during ...@@ -511,4 +555,4 @@ If you are using GitLab Omnibus installation, something might have failed during
This error means the Geo Tracking Database doesn't have the FDW server and credentials This error means the Geo Tracking Database doesn't have the FDW server and credentials
configured. configured.
See [How do I fix a "Foreign Data Wrapper (FDW) is not configured" error?](#how-do-i-fix-a-foreign-data-wrapper-fdw-is-not-configured-error). See ["Foreign Data Wrapper (FDW) is not configured" error?](#foreign-data-wrapper-fdw-is-not-configured-error).
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment