Commit 19e5320c authored by Achilleas Pipinellis's avatar Achilleas Pipinellis

Add troubleshooting sections

parent 4f331147
......@@ -810,7 +810,7 @@ Role | Name | Upstream | Connection String
```
If the 'Role' column for any node says "FAILED", check the
[Troubleshooting section](troubleshooting/index.md) before proceeding.
[Troubleshooting section](troubleshooting.md) before proceeding.
Also, check that the `repmgr-check-master` command works successfully on each node:
......@@ -822,7 +822,7 @@ gitlab-ctl repmgr-check-master || echo 'This node is a standby repmgr node'
This command relies on exit codes to tell Consul whether a particular node is a master
or secondary. The most important thing here is that this command does not produce errors.
If there are errors it's most likely due to incorrect `gitlab-consul` database user permissions.
Check the [Troubleshooting section](troubleshooting/index.md) before proceeding.
Check the [Troubleshooting section](troubleshooting.md) before proceeding.
<div align="right">
<a type="button" class="btn btn-default" href="#setup-components">
......@@ -1425,7 +1425,7 @@ NOTE: **Note:**
If you encounter a `rake aborted!` error stating that PgBouncer is failing to connect to
PostgreSQL it may be that your PgBouncer node's IP address is missing from
PostgreSQL's `trust_auth_cidr_addresses` in `gitlab.rb` on your database nodes. See
[PgBouncer error `ERROR: pgbouncer cannot connect to server`](#pgbouncer-error-error-pgbouncer-cannot-connect-to-server)
[PgBouncer error `ERROR: pgbouncer cannot connect to server`](troubleshooting.md#pgbouncer-error-error-pgbouncer-cannot-connect-to-server)
in the Troubleshooting section before proceeding.
<div align="right">
......
# Troubleshooting a reference architecture set up
# Troubleshooting a reference architecture setup
This page serves as the troubleshooting documentation if you followed one of
the [reference architectures](index.md#reference-architectures).
......@@ -86,8 +86,117 @@ a workaround, in the mean time, to
## Troubleshooting Redis
If the application node cannot connect to the Redis node, check your firewall rules and
make sure Redis can accept TCP connections under port `6379`.
There are a lot of moving parts that needs to be taken care carefully
in order for the HA setup to work as expected.
Before proceeding with the troubleshooting below, check your firewall rules:
- Redis machines
- Accept TCP connection in `6379`
- Connect to the other Redis machines via TCP in `6379`
- Sentinel machines
- Accept TCP connection in `26379`
- Connect to other Sentinel machines via TCP in `26379`
- Connect to the Redis machines via TCP in `6379`
### Troubleshooting Redis replication
You can check if everything is correct by connecting to each server using
`redis-cli` application, and sending the `info replication` command as below.
```shell
/opt/gitlab/embedded/bin/redis-cli -h <redis-host-or-ip> -a '<redis-password>' info replication
```
When connected to a `Primary` Redis, you will see the number of connected
`replicas`, and a list of each with connection details:
```plaintext
# Replication
role:master
connected_replicas:1
replica0:ip=10.133.5.21,port=6379,state=online,offset=208037514,lag=1
master_repl_offset:208037658
repl_backlog_active:1
repl_backlog_size:1048576
repl_backlog_first_byte_offset:206989083
repl_backlog_histlen:1048576
```
When it's a `replica`, you will see details of the primary connection and if
its `up` or `down`:
```plaintext
# Replication
role:replica
master_host:10.133.1.58
master_port:6379
master_link_status:up
master_last_io_seconds_ago:1
master_sync_in_progress:0
replica_repl_offset:208096498
replica_priority:100
replica_read_only:1
connected_replicas:0
master_repl_offset:0
repl_backlog_active:0
repl_backlog_size:1048576
repl_backlog_first_byte_offset:0
repl_backlog_histlen:0
```
### Troubleshooting Sentinel
If you get an error like: `Redis::CannotConnectError: No sentinels available.`,
there may be something wrong with your configuration files or it can be related
to [this issue](https://github.com/redis/redis-rb/issues/531).
You must make sure you are defining the same value in `redis['master_name']`
and `redis['master_pasword']` as you defined for your sentinel node.
The way the Redis connector `redis-rb` works with sentinel is a bit
non-intuitive. We try to hide the complexity in omnibus, but it still requires
a few extra configurations.
---
To make sure your configuration is correct:
1. SSH into your GitLab application server
1. Enter the Rails console:
```shell
# For Omnibus installations
sudo gitlab-rails console
# For source installations
sudo -u git rails console -e production
```
1. Run in the console:
```ruby
redis = Redis.new(Gitlab::Redis::SharedState.params)
redis.info
```
Keep this screen open and try to simulate a failover below.
1. To simulate a failover on primary Redis, SSH into the Redis server and run:
```shell
# port must match your primary redis port, and the sleep time must be a few seconds bigger than defined one
redis-cli -h localhost -p 6379 DEBUG sleep 20
```
1. Then back in the Rails console from the first step, run:
```ruby
redis.info
```
You should see a different port after a few seconds delay
(the failover/reconnect time).
## Troubleshooting Gitaly
......@@ -327,3 +436,82 @@ or
```shell
curl http[s]://localhost:<EXPORTER LISTENING PORT>/-/metric
```
## Troubleshooting PgBouncer
In case you are experiencing any issues connecting through PgBouncer, the first place to check is always the logs:
```shell
sudo gitlab-ctl tail pgbouncer
```
Additionally, you can check the output from `show databases` in the [Administrative console](#administrative-console). In the output, you would expect to see values in the `host` field for the `gitlabhq_production` database. Additionally, `current_connections` should be greater than 1.
### Message: `LOG: invalid CIDR mask in address`
See the suggested fix [in Geo documentation](../geo/replication/troubleshooting.md#message-log--invalid-cidr-mask-in-address).
### Message: `LOG: invalid IP mask "md5": Name or service not known`
See the suggested fix [in Geo documentation](../geo/replication/troubleshooting.md#message-log--invalid-ip-mask-md5-name-or-service-not-known).
## Troubleshooting PostgreSQL
In case you are experiencing any issues connecting through PgBouncer, the first place to check is always the logs:
```shell
sudo gitlab-ctl tail postgresql
```
### Consul and PostgreSQL changes not taking effect
Due to the potential impacts, `gitlab-ctl reconfigure` only reloads Consul and PostgreSQL, it will not restart the services. However, not all changes can be activated by reloading.
To restart either service, run `gitlab-ctl restart SERVICE`
For PostgreSQL, it is usually safe to restart the master node by default. Automatic failover defaults to a 1 minute timeout. Provided the database returns before then, nothing else needs to be done. To be safe, you can stop `repmgrd` on the standby nodes first with `gitlab-ctl stop repmgrd`, then start afterwards with `gitlab-ctl start repmgrd`.
On the Consul server nodes, it is important to restart the Consul service in a controlled fashion. Read our [Consul documentation](../high_availability/consul.md#restarting-the-server-cluster) for instructions on how to restart the service.
### `gitlab-ctl repmgr-check-master` command produces errors
If this command displays errors about database permissions it is likely that something failed during
install, resulting in the `gitlab-consul` database user getting incorrect permissions. Follow these
steps to fix the problem:
1. On the master database node, connect to the database prompt - `gitlab-psql -d template1`
1. Delete the `gitlab-consul` user - `DROP USER "gitlab-consul";`
1. Exit the database prompt - `\q`
1. [Reconfigure GitLab](../restart_gitlab.md#omnibus-gitlab-reconfigure) and the user will be re-added with the proper permissions.
1. Change to the `gitlab-consul` user - `su - gitlab-consul`
1. Try the check command again - `gitlab-ctl repmgr-check-master`.
Now there should not be errors. If errors still occur then there is another problem.
### PgBouncer error `ERROR: pgbouncer cannot connect to server`
You may get this error when running `gitlab-rake gitlab:db:configure` or you
may see the error in the PgBouncer log file.
```plaintext
PG::ConnectionBad: ERROR: pgbouncer cannot connect to server
```
The problem may be that your PgBouncer node's IP address is not included in the
`trust_auth_cidr_addresses` setting in `/etc/gitlab/gitlab.rb` on the database nodes.
You can confirm that this is the issue by checking the PostgreSQL log on the master
database node. If you see the following error then `trust_auth_cidr_addresses`
is the problem.
```plaintext
2018-03-29_13:59:12.11776 FATAL: no pg_hba.conf entry for host "123.123.123.123", user "pgbouncer", database "gitlabhq_production", SSL off
```
To fix the problem, add the IP address to `/etc/gitlab/gitlab.rb`.
```ruby
postgresql['trust_auth_cidr_addresses'] = %w(123.123.123.123/32 <other_cidrs>)
```
[Reconfigure GitLab](../restart_gitlab.md#omnibus-gitlab-reconfigure) for the changes to take effect.
\ No newline at end of file
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment