Add basic documentation on the consul cluster

2324c152 · Ian Baum · Marin Jankovski · 18533135 · 2324c152 · 2324c152
Commit 2324c152 authored Sep 27, 2017 by Ian Baum Committed by Marin Jankovski Sep 27, 2017
Showing with 80 additions and 0 deletions

doc/administration/high_availability/alpha_database.md doc/administration/high_availability/alpha_database.md +1 -0

doc/administration/high_availability/consul.md doc/administration/high_availability/consul.md +79 -0

No files found.
--- a/doc/administration/high_availability/alpha_database.md
+++ b/doc/administration/high_availability/alpha_database.md
@@ -518,5 +518,6 @@ Read more on high-availability configuration:
 1. [Configure NFS](nfs.md)
 1. [Configure the GitLab application servers](gitlab.md)
 1. [Configure the load balancers](load_balancer.md)
+1. [Manage the bundled Consul cluster](consul.md)
 [reconfigure GitLab]: ../restart_gitlab.md#omnibus-gitlab-reconfigure
--- a/doc/administration/high_availability/consul.md
+++ b/doc/administration/high_availability/consul.md
+# Working with the bundle Consul service
+## Overview
+As part of its High Availability stack, GitLab EEP includes a bundled version of [Consul](http://consul.io) that can be managed through `/etc/gitlab/gitlab.rb`.
+A Consul cluster consists of multiple server agents, as well as client agents that run on other nodes which need to talk to the consul cluster.
+## Operations
+### Checking cluster membership
+To see which nodes are part of the cluster, run the following on any member in the cluster
+```
+# /opt/gitlab/embedded/bin/consul members
+Node            Address               Status  Type    Build  Protocol  DC
+consul-b        XX.XX.X.Y:8301        alive   server  0.9.0  2         gitlab_consul
+consul-c        XX.XX.X.Y:8301        alive   server  0.9.0  2         gitlab_consul
+consul-c        XX.XX.X.Y:8301        alive   server  0.9.0  2         gitlab_consul
+db-a            XX.XX.X.Y:8301        alive   client  0.9.0  2         gitlab_consul
+db-b            XX.XX.X.Y:8301        alive   client  0.9.0  2         gitlab_consul
+```
+Ideally all nodes will have a `Status` of `alive`.
+### Restarting the server cluster
+**Note**: This section only applies to server agents. It is safe to restart client agents whenever needed.
+If it is necessary to restart the server cluster, it is important to do this in a controlled fashion in order to maintain quorum. If quorum is lost, you will need to follow the consul [outage recovery](#outage-recovery) process to recover the cluster.
+To be safe, we recommend you only restart one server agent at a time to ensure the cluster remains intact.
+For larger clusters, it is possible to restart multiple agents at a time. See the [Consul consensus document](https://www.consul.io/docs/internals/consensus.html#deployment-table) for how many failures it can tolerate. This will be the number of simulateneous restarts it can sustain.
+## Troubleshooting
+### Consul server agents unable to communicate
+By default, the server agents will attempt to [bind](https://www.consul.io/docs/agent/options.html#_bind) to '0.0.0.0', but they will advertise the first private IP address on the node for other agents to communicate with them. If the other nodes cannot communicate with a node on this address, then the cluster will have a failed status.
+You will see messages like the following in `gitlab-ctl tail consul` output if you are running into this issue:
+```
+2017-09-25_19:53:39.90821     2017/09/25 19:53:39 [WARN] raft: no known peers, aborting election
+2017-09-25_19:53:41.74356     2017/09/25 19:53:41 [ERR] agent: failed to sync remote state: No cluster leader
+```
+To fix this:
+1. Pick an address on each node that all of the other nodes can reach this node through.
+1. Update your `/etc/gitlab/gitlab.rb`
+   ```ruby
+   consul['configuration'] = {
+     ...
+     bind_addr: 'IP ADDRESS'
+   }
+   ```
+1. Run `gitlab-ctl reconfigure`
+### Outage recovery
+If you lost enough server agents in the cluster to break quorum, then the cluster is considered failed, and it will not function without manual intervenetion.
+#### Recreate from scratch
+By default, GitLab does not store anything in the consul cluster that cannot be recreated. To erase the consul database and reinitialize
+```
+# gitlab-ctl stop consul
+# rm -rf /var/opt/gitlab/consul/data
+# gitlab-ctl start consul
+```
+After this, the cluster should start back up, and the server agents rejoin. Shortly after that, the client agents should rejoin as well.
+#### Recover a failed cluster
+If you have taken advantage of consul to store other data, and want to restore the failed cluster, please follow the [Consul guide](https://www.consul.io/docs/guides/outage.html) to recover a failed cluster.