Commit 0c047594 authored by James Ramsay's avatar James Ramsay

Move geo docs to administration/geo/replication

parent c7759ab3
# Geo configuration
>**Note:**
This is the documentation for the Omnibus GitLab packages. For installations
from source, follow the [**Geo nodes configuration for installations
from source**](configuration_source.md) guide.
## Configuring a new secondary node
>**Note:**
This is the final step in setting up a secondary Geo node. Stages of the
setup process must be completed in the documented order.
Before attempting the steps in this stage, [complete all prior stages](README.md#using-omnibus-gitlab).
The basic steps of configuring a secondary node are to replicate required
configurations between the primary and the secondaries; to configure a tracking
database on each secondary; and to start GitLab on the secondary node.
You are encouraged to first read through all the steps before executing them
in your testing/production environment.
>**Notes:**
- **Do not** setup any custom authentication in the secondary nodes, this will be
handled by the primary node.
- **Do not** add anything in the secondaries Geo nodes admin area
(**Admin Area ➔ Geo Nodes**). This is handled solely by the primary node.
### Step 1. Manually replicate secret GitLab values
GitLab stores a number of secret values in the `/etc/gitlab/gitlab-secrets.json`
file which *must* match between the primary and secondary nodes. Until there is
a means of automatically replicating these between nodes (see
[issue #3789](https://gitlab.com/gitlab-org/gitlab-ee/issues/3789)), they must
be manually replicated to the secondary.
1. SSH into the **primary** node, and execute the command below:
```bash
sudo cat /etc/gitlab/gitlab-secrets.json
```
This will display the secrets that need to be replicated, in JSON format.
1. SSH into the **secondary** node and login as the `root` user:
```
sudo -i
```
1. Make a backup of any existing secrets:
```bash
mv /etc/gitlab/gitlab-secrets.json /etc/gitlab/gitlab-secrets.json.`date +%F`
```
1. Copy `/etc/gitlab/gitlab-secrets.json` from the primary to the secondary, or
copy-and-paste the file contents between nodes:
```bash
sudo editor /etc/gitlab/gitlab-secrets.json
# paste the output of the `cat` command you ran on the primary
# save and exit
```
1. Ensure the file permissions are correct:
```bash
chown root:root /etc/gitlab/gitlab-secrets.json
chmod 0600 /etc/gitlab/gitlab-secrets.json
```
1. Reconfigure the secondary node for the change to take effect:
```
gitlab-ctl reconfigure
```
Once reconfigured, the secondary will automatically start
replicating missing data from the primary in a process known as backfill.
Meanwhile, the primary node will start to notify the secondary of any changes, so
that the secondary can act on those notifications immediately.
Make sure the secondary instance is
running and accessible. You can login to the secondary node
with the same credentials as used in the primary.
### Step 2. Manually replicate primary SSH host keys
GitLab integrates with the system-installed SSH daemon, designating a user
(typically named git) through which all access requests are handled.
In a [Disaster Recovery](../disaster_recovery/index.md) situation, GitLab system
administrators will promote a secondary Geo replica to a primary and they can
update the DNS records for the primary domain to point to the secondary to prevent
the need to update all references to the primary domain to the secondary domain,
like changing Git remotes and API URLs.
This will cause all SSH requests to the newly promoted primary node from
failing due to SSH host key mismatch. To prevent this, the primary SSH host
keys must be manually replicated to the secondary node.
1. SSH into the **secondary** node and login as the `root` user:
```
sudo -i
```
1. Make a backup of any existing SSH host keys:
```bash
find /etc/ssh -iname ssh_host_* -exec cp {} {}.backup.`date +%F` \;
```
1. SSH into the **primary** node, and execute the command below:
```bash
sudo find /etc/ssh -iname ssh_host_* -not -iname '*.pub'
```
1. For each file in that list replace the file from the primary node to
the **same** location on your **secondary** node.
1. On your **secondary** node, ensure the file permissions are correct:
```bash
chown root:root /etc/ssh/ssh_host_*
chmod 0600 /etc/ssh/ssh_host_*
```
1. Regenerate the public keys from the private keys:
```bash
find /etc/ssh -iname ssh_host_* -not -iname '*.backup*' -exec sh -c 'ssh-keygen -y -f "{}" > "{}.pub"' \;
```
1. Restart sshd:
```bash
service ssh restart
```
### Step 3. (Optional) Enabling hashed storage (from GitLab 10.0)
>**Warning**
Hashed storage is in **Beta**. It is not considered production-ready. See
[Hashed Storage](../repository_storage_types.md) for more detail,
and for the latest updates, check
[infrastructure issue #2821](https://gitlab.com/gitlab-com/infrastructure/issues/2821).
Using hashed storage significantly improves Geo replication - project and group
renames no longer require synchronization between nodes.
1. Visit the **primary** node's **Admin Area ➔ Settings**
(`/admin/application_settings`) in your browser
1. In the `Repository Storages` section, check `Create new projects using hashed storage paths`:
![](img/hashed-storage.png)
### Step 4. (Optional) Configuring the secondary to trust the primary
You can safely skip this step if your primary uses a CA-issued HTTPS certificate.
If your primary is using a self-signed certificate for *HTTPS* support, you will
need to add that certificate to the secondary's trust store. Retrieve the
certificate from the primary and follow
[these instructions](https://docs.gitlab.com/omnibus/settings/ssl.html)
on the secondary.
### Step 5. Enable Git access over HTTP/HTTPS
Geo synchronizes repositories over HTTP/HTTPS, and therefore requires this clone
method to be enabled. Navigate to **Admin Area ➔ Settings**
(`/admin/application_settings`) on the primary node, and set
`Enabled Git access protocols` to `Both SSH and HTTP(S)` or `Only HTTP(S)`.
### Step 6. Verify proper functioning of the secondary node
Congratulations! Your secondary geo node is now configured!
You can login to the secondary node with the same credentials you used on the
primary. Visit the secondary node's **Admin Area ➔ Geo Nodes**
(`/admin/geo_nodes`) in your browser to check if it's correctly identified as a
secondary Geo node and if Geo is enabled.
The initial replication, or 'backfill', will probably still be in progress. You
can monitor the synchronization process on each geo node from the primary
node's Geo Nodes dashboard in your browser.
![Geo dashboard](img/geo-node-dashboard.png)
If your installation isn't working properly, check the
[troubleshooting document](troubleshooting.md).
The two most obvious issues that can become apparent in the dashboard are:
1. Database replication not working well
1. Instance to instance notification not working. In that case, it can be
something of the following:
- You are using a custom certificate or custom CA (see the
[troubleshooting document](troubleshooting.md))
- The instance is firewalled (check your firewall rules)
Please note that disabling a secondary node will stop the sync process.
Please note that if `git_data_dirs` is customized on the primary for multiple
repository shards you must duplicate the same configuration on the secondary.
Point your users to the ["Using a Geo Server" guide](using_a_geo_server.md).
Currently, this is what is synced:
* Git repositories
* Wikis
* LFS objects
* Issues, merge requests, snippets, and comment attachments
* Users, groups, and project avatars
## Selective synchronization
Geo supports selective synchronization, which allows admins to choose
which projects should be synchronized by secondary nodes.
It is important to note that selective synchronization does not:
1. Restrict permissions from secondary nodes.
1. Hide project metadata from secondary nodes.
* Since Geo currently relies on PostgreSQL replication, all project metadata
gets replicated to secondary nodes, but repositories that have not been
selected will be empty.
1. Reduce the number of events generated for the Geo event log
* The primary generates events as long as any secondaries are present.
Selective synchronization restrictions are implemented on the secondaries,
not the primary.
A subset of projects can be chosen, either by group or by storage shard. The
former is ideal for replicating data belonging to a subset of users, while the
latter is more suited to progressively rolling out Geo to a large GitLab
instance.
## Upgrading Geo
See the [updating the Geo nodes document](updating_the_geo_nodes.md).
## Troubleshooting
See the [troubleshooting document](troubleshooting.md).
# Geo configuration
>**Note:**
This is the documentation for installations from source. For installations
using the Omnibus GitLab packages, follow the
[**Omnibus Geo nodes configuration**](configuration.md) guide.
## Configuring a new secondary node
>**Note:**
This is the final step in setting up a secondary Geo node. Stages of the setup
process must be completed in the documented order. Before attempting the steps
in this stage, [complete all prior stages](README.md#using-gitlab-installed-from-source).
The basic steps of configuring a secondary node are to replicate required
configurations between the primary and the secondaries; to configure a tracking
database on each secondary; and to start GitLab on the secondary node.
You are encouraged to first read through all the steps before executing them
in your testing/production environment.
>**Notes:**
- **Do not** setup any custom authentication in the secondary nodes, this will be
handled by the primary node.
- **Do not** add anything in the secondaries Geo nodes admin area
(**Admin Area ➔ Geo Nodes**). This is handled solely by the primary node.
### Step 1. Manually replicate secret GitLab values
GitLab stores a number of secret values in the `/home/git/gitlab/config/secrets.yml`
file which *must* match between the primary and secondary nodes. Until there is
a means of automatically replicating these between nodes (see
[issue #3789](https://gitlab.com/gitlab-org/gitlab-ee/issues/3789)), they must
be manually replicated to the secondary.
1. SSH into the **primary** node, and execute the command below:
```bash
sudo cat /home/git/gitlab/config/secrets.yml
```
This will display the secrets that need to be replicated, in YAML format.
1. SSH into the **secondary** node and login as the `git` user:
```bash
sudo -i -u git
```
1. Make a backup of any existing secrets:
```bash
mv /home/git/gitlab/config/secrets.yml /home/git/gitlab/config/secrets.yml.`date +%F`
```
1. Copy `/home/git/gitlab/config/secrets.yml` from the primary to the secondary, or
copy-and-paste the file contents between nodes:
```bash
sudo editor /home/git/gitlab/config/secrets.yml
# paste the output of the `cat` command you ran on the primary
# save and exit
```
1. Ensure the file permissions are correct:
```bash
chown git:git /home/git/gitlab/config/secrets.yml
chmod 0600 /home/git/gitlab/config/secrets.yml
```
1. Restart GitLab for the changes to take effect:
```bash
service gitlab restart
```
Once restarted, the secondary will automatically start replicating missing data
from the primary in a process known as backfill. Meanwhile, the primary node
will start to notify the secondary of any changes, so that the secondary can
act on those notifications immediately.
Make sure the secondary instance is running and accessible. You can login to
the secondary node with the same credentials as used in the primary.
### Step 2. Manually replicate primary SSH host keys
Read [Manually replicate primary SSH host keys](configuration.md#step-2-manually-replicate-primary-ssh-host-keys)
### Step 3. (Optional) Enabling hashed storage (from GitLab 10.0)
Read [Enabling Hashed Storage](configuration.md#step-3-optional-enabling-hashed-storage-from-gitlab-10-0)
### Step 4. (Optional) Configuring the secondary to trust the primary
You can safely skip this step if your primary uses a CA-issued HTTPS certificate.
If your primary is using a self-signed certificate for *HTTPS* support, you will
need to add that certificate to the secondary's trust store. Retrieve the
certificate from the primary and follow your distribution's instructions for
adding it to the secondary's trust store. In Debian/Ubuntu, for example, with a
certificate file of `primary.geo.example.com.crt`, you would follow these steps:
```
sudo -i
cp primary.geo.example.com.crt /usr/local/share/ca-certificates
update-ca-certificates
```
### Step 5. Enable Git access over HTTP/HTTPS
Geo synchronizes repositories over HTTP/HTTPS, and therefore requires this clone
method to be enabled. Navigate to **Admin Area ➔ Settings**
(`/admin/application_settings`) on the primary node, and set
`Enabled Git access protocols` to `Both SSH and HTTP(S)` or `Only HTTP(S)`.
### Step 6. Verify proper functioning of the secondary node
Read [Verify proper functioning of the secondary node](configuration.md#step-6-verify-proper-functioning-of-the-secondary-node).
## Selective synchronization
Read [Selective synchronization](configuration.md#selective-synchronization).
## Troubleshooting
Read the [troubleshooting document](troubleshooting.md).
# Geo database replication
>**Note:**
This is the documentation for the Omnibus GitLab packages. For installations
from source, follow the
[**database replication for installations from source**](database_source.md) guide.
>**Note:**
If your GitLab installation uses external PostgreSQL, the Omnibus roles
will not be able to perform all necessary configuration steps. Refer to the
section on [External PostreSQL][external postgresql] for additional instructions.
>**Note:**
The stages of the setup process must be completed in the documented order.
Before attempting the steps in this stage, [complete all prior stages][toc].
This document describes the minimal steps you have to take in order to
replicate your primary GitLab database to a secondary node's database. You may
have to change some values according to your database setup, how big it is, etc.
You are encouraged to first read through all the steps before executing them
in your testing/production environment.
## PostgreSQL replication
The GitLab primary node where the write operations happen will connect to
the primary database server, and the secondary nodes which are read-only will
connect to the secondary database servers (which are also read-only).
>**Note:**
In database documentation you may see "primary" being referenced as "master"
and "secondary" as either "slave" or "standby" server (read-only).
We recommend using [PostgreSQL replication
slots](https://medium.com/@tk512/replication-slots-in-postgresql-b4b03d277c75)
to ensure that the primary retains all the data necessary for the secondaries to
recover. See below for more details.
The following guide assumes that:
- You are using Omnibus and therefore you are using PostgreSQL 9.6 or later
which includes the [`pg_basebackup` tool][pgback] and improved
[Foreign Data Wrapper][FDW] support.
- You have a primary node already set up (the GitLab server you are
replicating from), running Omnibus' PostgreSQL (or equivalent version), and
you have a new secondary server set up with the same versions of the OS,
PostgreSQL, and GitLab on all nodes.
- The IP of the primary server for our examples will be `1.2.3.4`, whereas the
secondary's IP will be `5.6.7.8`. Note that the primary and secondary servers
**must** be able to communicate over these addresses. More on this in the
guide below.
### Step 1. Configure the primary server
1. SSH into your GitLab **primary** server and login as root:
```bash
sudo -i
```
1. Execute the command below to define the node as primary Geo node:
```bash
gitlab-ctl set-geo-primary-node
```
This command will use your defined `external_url` in `/etc/gitlab/gitlab.rb`.
1. GitLab 10.4 and up only: Do the following to make sure the `gitlab` database user has a password defined
Generate a MD5 hash of the desired password:
```bash
gitlab-ctl pg-password-md5 gitlab
# Enter password: mypassword
# Confirm password: mypassword
# fca0b89a972d69f00eb3ec98a5838484
```
Edit `/etc/gitlab/gitlab.rb`:
```ruby
# Fill with the hash generated by `gitlab-ctl pg-password-md5 gitlab`
postgresql['sql_user_password'] = 'fca0b89a972d69f00eb3ec98a5838484'
# If you have HA setup, this must be present in all nodes as well
gitlab_rails['db_password'] = 'mypassword'
```
1. Omnibus GitLab already has a [replication user](https://wiki.postgresql.org/wiki/Streaming_Replication)
called `gitlab_replicator`. You must set the password for this user manually.
You will be prompted to enter a password:
```bash
gitlab-ctl set-replication-password
```
This command will also read the `postgresql['sql_replication_user']` Omnibus
setting in case you have changed `gitlab_replicator` username to something
else.
1. Configure PostgreSQL to listen on network interfaces
For security reasons, PostgreSQL does not listen on any network interfaces
by default. However, Geo requires the secondary to be able to
connect to the primary's database. For this reason, we need the address of
each node. Note: For external PostgreSQL instances, see [additional instructions][external postgresql].
If you are using a cloud provider, you can lookup the addresses for each
Geo node through your cloud provider's management console.
To lookup the address of a Geo node, SSH in to the Geo node and execute:
```bash
##
## Private address
##
ip route get 255.255.255.255 | awk '{print "Private address:", $NF; exit}'
##
## Public address
##
echo "External address: $(curl ipinfo.io/ip)"
```
In most cases, the following addresses will be used to configure GitLab
Geo:
| Configuration | Address |
|-----|-----|
| `postgresql['listen_address']` | Primary's private address |
| `postgresql['trust_auth_cidr_addresses']` | Primary's private address |
| `postgresql['md5_auth_cidr_addresses']` | Secondary's public addresses |
If you are using Google Cloud Platform, SoftLayer, or any other vendor that
provides a virtual private cloud you can use the secondary's private
address (corresponds to "internal address" for Google Cloud Platform) for
`postgresql['md5_auth_cidr_addresses']`.
The `listen_address` option opens PostgreSQL up to network connections
with the interface corresponding to the given address. See [the PostgreSQL
documentation](https://www.postgresql.org/docs/9.6/static/runtime-config-connection.html)
for more details.
Depending on your network configuration, the suggested addresses may not
be correct. If your primary and secondary connect over a local
area network, or a virtual network connecting availability zones like
[Amazon's VPC](https://aws.amazon.com/vpc/) or [Google's VPC](https://cloud.google.com/vpc/)
you should use the secondary's private address for `postgresql['md5_auth_cidr_addresses']`.
Edit `/etc/gitlab/gitlab.rb` and add the following, replacing the IP
addresses with addresses appropriate to your network configuration:
```ruby
geo_primary_role['enable'] = true
##
## Primary address
## - replace '1.2.3.4' with the primary private address
##
postgresql['listen_address'] = '1.2.3.4'
postgresql['trust_auth_cidr_addresses'] = ['127.0.0.1/32','1.2.3.4/32']
##
# Secondary addresses
# - replace '5.6.7.8' with the secondary public address
##
postgresql['md5_auth_cidr_addresses'] = ['5.6.7.8/32']
##
## Replication settings
## - set this to be the number of Geo secondary nodes you have
##
postgresql['max_replication_slots'] = 1
# postgresql['max_wal_senders'] = 10
# postgresql['wal_keep_segments'] = 10
##
## Disable automatic database migrations temporarily
## (until PostgreSQL is restarted and listening on the private address).
##
gitlab_rails['auto_migrate'] = false
```
1. Optional: If you want to add another secondary, the relevant setting would look like:
```ruby
postgresql['md5_auth_cidr_addresses'] = ['5.6.7.8/32','9.10.11.12/32']
```
You may also want to edit the `wal_keep_segments` and `max_wal_senders` to
match your database replication requirements. Consult the [PostgreSQL -
Replication documentation](https://www.postgresql.org/docs/9.6/static/runtime-config-replication.html)
for more information.
1. Save the file and reconfigure GitLab for the database listen changes and
the replication slot changes to be applied.
```bash
gitlab-ctl reconfigure
```
Restart PostgreSQL for its changes to take effect:
```bash
gitlab-ctl restart postgresql
```
1. Re-enable migrations now that PostgreSQL is restarted and listening on the
private address.
Edit `/etc/gitlab/gitlab.rb` and **change** the configuration to `true`:
```ruby
gitlab_rails['auto_migrate'] = true
```
Save the file and reconfigure GitLab:
```bash
gitlab-ctl reconfigure
```
1. Now that the PostgreSQL server is set up to accept remote connections, run
`netstat -plnt | grep 5432` to make sure that PostgreSQL is listening on port
`5432` to the primary server's private address.
1. A certificate was automatically generated when GitLab was reconfigured. This
will be used automatically to protect your PostgreSQL traffic from
eavesdroppers, but to protect against active ("man-in-the-middle") attackers,
the secondary needs a copy of the certificate. Make a copy of the PostgreSQL
`server.crt` file on the primary node by running this command:
```bash
cat ~gitlab-psql/data/server.crt
```
Copy the output into a clipboard or into a local file. You
will need it when setting up the secondary! The certificate is not sensitive
data.
### Step 2. Add the secondary GitLab node
To prevent the secondary geo node from trying to act as the primary once the
database is replicated, the secondary geo node must be added on the
primary before the database is replicated.
1. Visit the **primary** node's **Admin Area ➔ Geo Nodes**
(`/admin/geo_nodes`) in your browser.
1. Add the secondary node by providing its full URL. **Do NOT** check the box
'This is a primary node'.
1. Optionally, choose which namespaces should be replicated by the
secondary node. Leave blank to replicate all. Read more in
[selective replication](#selective-replication).
1. Click the **Add node** button.
1. SSH into your GitLab **primary** server and login as root to verify the
secondary is reachable:
```
gitlab-rake gitlab:geo:check
```
The new secondary geo node will have the status **Unhealthy**. This is expected
because we have not yet configured the secondary server. This is the next step.
### Step 3. Configure the secondary server
1. SSH into your GitLab **secondary** server and login as root:
```
sudo -i
```
1. [Check TCP connectivity](../raketasks/maintenance.md) to the
primary's PostgreSQL server:
```bash
gitlab-rake gitlab:tcp_check[1.2.3.4,5432]
```
If this step fails, you may be using the wrong IP address, or a firewall may
be preventing access to the server. Check the IP address, paying close
attention to the difference between public and private addresses and ensure
that, if a firewall is present, the secondary is permitted to connect to the
primary on port 5432.
1. Create a file `server.crt` in the secondary server, with the content you got on the last step of the primary setup:
```
editor server.crt
```
1. Set up PostgreSQL TLS verification on the secondary
Install the `server.crt` file:
```bash
install -D -o gitlab-psql -g gitlab-psql -m 0400 -T server.crt ~gitlab-psql/.postgresql/root.crt
```
PostgreSQL will now only recognize that exact certificate when verifying TLS
connections. The certificate can only be replicated by someone with access
to the private key, which is **only** present on the primary node.
1. Test that the `gitlab-psql` user can connect to the primary's database:
```bash
sudo -u gitlab-psql /opt/gitlab/embedded/bin/psql --list -U gitlab_replicator -d "dbname=gitlabhq_production sslmode=verify-ca" -W -h 1.2.3.4
```
When prompted enter the password you set in the first step for the
`gitlab_replicator` user. If all worked correctly, you should see
the list of primary's databases.
A failure to connect here indicates that the TLS configuration is incorrect.
Ensure that the contents of `~gitlab-psql/data/server.crt` on the primary
match the contents of `~gitlab-psql/.postgresql/root.crt` on the secondary.
1. Configure PostreSQL to enable FDW support
This step is similar to how we configured the primary instance.
We need to enable this, to enable FDW support, even if using a single node.
Edit `/etc/gitlab/gitlab.rb` and add the following, replacing the IP
addresses with addresses appropriate to your network configuration:
```ruby
# Secondary addresses
# - replace '5.6.7.8' with the secondary private address
postgresql['listen_address'] = '5.6.7.8'
postgresql['trust_auth_cidr_addresses'] = ['127.0.0.1/32','5.6.7.8/32']
postgresql['md5_auth_cidr_addresses'] = ['5.6.7.8/32']
# gitlab database user's password (defined previously)
gitlab_rails['db_password'] = 'mypassword'
# enable fdw for the geo tracking database
geo_secondary['db_fdw'] = true
```
1. Edit `/etc/gitlab/gitlab.rb` and add the following:
```ruby
geo_secondary_role['enable'] = true
```
For external PostgreSQL instances, [see additional instructions][external postgresql].
If you bring a former primary back online to serve as a secondary then you also need to remove `geo_primary_role['enable'] = true`.
1. Reconfigure GitLab for the changes to take effect:
```bash
gitlab-ctl reconfigure
```
1. Restart PostgreSQL for its changes to take effect:
```bash
gitlab-ctl restart postgresql
```
### Step 4. Initiate the replication process
Below we provide a script that connects the database on the secondary node to
the database on the primary node, replicates the database, and creates the
needed files for streaming replication.
The directories used are the defaults that are set up in Omnibus. If you have
changed any defaults or are using a source installation, configure it as you
see fit replacing the directories and paths.
>**Warning:**
Make sure to run this on the **secondary** server as it removes all PostgreSQL's
data before running `pg_basebackup`.
1. SSH into your GitLab **secondary** server and login as root:
```
sudo -i
```
1. Choose a database-friendly name to use for your secondary to
use as the replication slot name. For example, if your domain is
`secondary.geo.example.com`, you may use `secondary_example` as the slot
name as shown in the commands below.
1. Execute the command below to start a backup/restore and begin the replication
>**Warning:** Each Geo secondary must have its own unique replication slot name.
Using the same slot name between two secondaries will break PostgreSQL replication.
```bash
gitlab-ctl replicate-geo-database --slot-name=secondary_example --host=1.2.3.4
```
When prompted, enter the _plaintext_ password you set up for the `gitlab_replicator`
user in the first step.
This command also takes a number of additional options. You can use `--help`
to list them all, but here are a couple of tips:
- If PostgreSQL is listening on a non-standard port, add `--port=` as well.
- If your database is too large to be transferred in 30 minutes, you will need
to increase the timeout, e.g., `--backup-timeout=3600` if you expect the
initial replication to take under an hour.
- Pass `--sslmode=disable` to skip PostgreSQL TLS authentication altogether
(e.g., you know the network path is secure, or you are using a site-to-site
VPN). This is **not** safe over the public Internet!
- You can read more details about each `sslmode` in the
[PostgreSQL documentation](https://www.postgresql.org/docs/9.6/static/libpq-ssl.html#LIBPQ-SSL-PROTECTION);
the instructions above are carefully written to ensure protection against
both passive eavesdroppers and active "man-in-the-middle" attackers.
- Change the `--slot-name` to the name of the replication slot
to be used on the primary database. The script will attempt to create the
replication slot automatically if it does not exist.
- If you're repurposing an old server into a Geo secondary, you'll need to
add `--force` to the command line.
- When not in a production machine you can disable backup step if you
really sure this is what you want by adding `--skip-backup`
1. Verify that the secondary is configured correctly and that the primary is
reachable:
```
gitlab-rake gitlab:geo:check
```
The replication process is now complete.
### External PostgreSQL instances
For installations using external PostgreSQL instances, the `geo_primary_role`
and `geo_secondary_role` includes configuration changes that must be applied
manually.
The `geo_primary_role` makes configuration changes to `pg_hba.conf` and
`postgresql.conf` on the primary:
```
##
## Geo Primary
## - pg_hba.conf
##
host replication gitlab_replicator <trusted secondary IP>/32 md5
```
```
##
## Geo Primary Role
## - postgresql.conf
##
sql_replication_user = gitlab_replicator
wal_level = hot_standby
max_wal_senders = 10
wal_keep_segments = 50
max_replication_slots = 1 # number of secondary instances
hot_standby = on
```
Th `geo_secondary_role` makes configuration changes to `postgresql.conf` and
enables the Geo Log Cursor (`geo_logcursor`) and secondary tracking database
on the secondary. The PostgreSQL settings for this database it adds to
the default settings:
```
##
## Geo Secondary Role
## - postgresql.conf
##
wal_level = hot_standby
max_wal_senders = 10
wal_keep_segments = 10
hot_standby = on
```
Geo secondary nodes use a tracking database to keep track of replication
status and recover automatically from some replication issues. Follow the
instructions for [enabling tracking database on the secondary server][tracking].
## MySQL replication
MySQL replication is not supported for Geo.
## Troubleshooting
Read the [troubleshooting document](troubleshooting.md).
[pgback]: http://www.postgresql.org/docs/9.2/static/app-pgbasebackup.html
[external postgresql]: #external-postgresql-instances
[tracking]: database_source.md#enable-tracking-database-on-the-secondary-server
[FDW]: https://www.postgresql.org/docs/9.6/static/postgres-fdw.html
[toc]: README.md#using-omnibus-gitlab
# Geo database replication
>**Note:**
This is the documentation for installations from source. For installations
using the Omnibus GitLab packages, follow the
[**database replication for Omnibus GitLab**](database.md) guide.
>**Note:**
The stages of the setup process must be completed in the documented order.
Before attempting the steps in this stage, [complete all prior stages][toc].
This document describes the minimal steps you have to take in order to
replicate your primary GitLab database to a secondary node's database. You may
have to change some values according to your database setup, how big it is, etc.
You are encouraged to first read through all the steps before executing them
in your testing/production environment.
## PostgreSQL replication
The GitLab primary node where the write operations happen will connect to
primary database server, and the secondary ones which are read-only will
connect to secondary database servers (which are read-only too).
>**Note:**
In many databases documentation you will see "primary" being referenced as "master"
and "secondary" as either "slave" or "standby" server (read-only).
We recommend using [PostgreSQL replication
slots](https://medium.com/@tk512/replication-slots-in-postgresql-b4b03d277c75)
to ensure the primary retains all the data necessary for the secondaries to
recover. See below for more details.
The following guide assumes that:
- You are using PostgreSQL 9.6 or later which includes the
[`pg_basebackup` tool][pgback] and improved [Foreign Data Wrapper][FDW] support.
- You have a primary node already set up (the GitLab server you are
replicating from), running PostgreSQL 9.6 or later, and
you have a new secondary server set up with the same versions of the OS,
PostgreSQL, and GitLab on all nodes.
- The IP of the primary server for our examples will be `1.2.3.4`, whereas the
secondary's IP will be `5.6.7.8`. Note that the primary and secondary servers
**must** be able to communicate over these addresses. These IP addresses can either
be public or private.
### Step 1. Configure the primary server
1. SSH into your GitLab **primary** server and login as root:
```bash
sudo -i
```
1. Add this node as the Geo primary by running:
```bash
bundle exec rake geo:set_primary_node
```
1. Create a [replication user] named `gitlab_replicator`:
```bash
sudo -u postgres psql -c "CREATE USER gitlab_replicator REPLICATION ENCRYPTED PASSWORD 'thepassword';"
```
1. Make sure your the `gitlab` database user has a password defined
```bash
sudo -u postgres psql -d template1 -c "ALTER USER gitlab WITH ENCRYPTED PASSWORD 'mydatabasepassword';"
```
1. Edit the content of `database.yml` in `production:` and add the password like the exemple below:
```yaml
#
# PRODUCTION
#
production:
adapter: postgresql
encoding: unicode
database: gitlabhq_production
pool: 10
username: gitlab
password: mydatabasepassword
host: /var/opt/gitlab/geo-postgresql
```
1. Set up TLS support for the PostgreSQL primary server
> **Warning**: Only skip this step if you **know** that PostgreSQL traffic
> between the primary and secondary will be secured through some other
> means, e.g., a known-safe physical network path or a site-to-site VPN that
> you have configured.
If you are replicating your database across the open Internet, it is
**essential** that the connection is TLS-secured. Correctly configured, this
provides protection against both passive eavesdroppers and active
"man-in-the-middle" attackers.
To generate a self-signed certificate and key, run this command:
```bash
openssl req -nodes -batch -x509 -newkey rsa:4096 -keyout server.key -out server.crt -days 3650
```
This will create two files - `server.key` and `server.crt` - that you can
use for authentication.
Copy them to the correct location for your PostgreSQL installation:
```bash
# Copying a self-signed certificate and key
install -o postgres -g postgres -m 0400 -T server.crt ~postgres/9.x/main/data/server.crt
install -o postgres -g postgres -m 0400 -T server.key ~postgres/9.x/main/data/server.key
```
Add this configuration to `postgresql.conf`, removing any existing
configuration for `ssl_cert_file` or `ssl_key_file`:
```
ssl = on
ssl_cert_file='server.crt'
ssl_key_file='server.key'
```
1. Edit `postgresql.conf` to configure the primary server for streaming replication
(for Debian/Ubuntu that would be `/etc/postgresql/9.x/main/postgresql.conf`):
```
listen_address = '1.2.3.4'
wal_level = hot_standby
max_wal_senders = 5
min_wal_size = 80MB
max_wal_size = 1GB
max_replicaton_slots = 1 # Number of Geo secondary nodes
wal_keep_segments = 10
hot_standby = on
```
Be sure to set `max_replication_slots` to the number of Geo secondary
nodes that you may potentially have (at least 1).
For security reasons, PostgreSQL by default only listens on the local
interface (e.g. 127.0.0.1). However, Geo needs to communicate
between the primary and secondary nodes over a common network, such as a
corporate LAN or the public Internet. For this reason, we need to
configure PostgreSQL to listen on more interfaces.
The `listen_address` option opens PostgreSQL up to external connections
with the interface corresponding to the given IP. See [the PostgreSQL
documentation](https://www.postgresql.org/docs/9.6/static/runtime-config-connection.html)
for more details.
You may also want to edit the `wal_keep_segments` and `max_wal_senders` to
match your database replication requirements. Consult the [PostgreSQL - Replication documentation](https://www.postgresql.org/docs/9.6/static/runtime-config-replication.html)
for more information.
1. Set the access control on the primary to allow TCP connections using the
server's public IP and set the connection from the secondary to require a
password. Edit `pg_hba.conf` (for Debian/Ubuntu that would be
`/etc/postgresql/9.x/main/pg_hba.conf`):
```bash
host all all 127.0.0.1/32 trust
host all all 1.2.3.4/32 trust
host replication gitlab_replicator 5.6.7.8/32 md5
```
Where `1.2.3.4` is the public IP address of the primary server, and `5.6.7.8`
the public IP address of the secondary one. If you want to add another
secondary, add one more row like the replication one and change the IP
address:
```bash
host all all 127.0.0.1/32 trust
host all all 1.2.3.4/32 trust
host replication gitlab_replicator 5.6.7.8/32 md5
host replication gitlab_replicator 11.22.33.44/32 md5
```
1. Restart PostgreSQL for the changes to take effect.
1. Choose a database-friendly name to use for your secondary to use as the
replication slot name. For example, if your domain is
`secondary.geo.example.com`, you may use `secondary_example` as the slot
name.
1. Create the replication slot on the primary:
```bash
$ sudo -u postgres psql -c "SELECT * FROM pg_create_physical_replication_slot('secondary_example');"
slot_name | xlog_position
------------------+---------------
secondary_example |
(1 row)
```
1. Now that the PostgreSQL server is set up to accept remote connections, run
`netstat -plnt` to make sure that PostgreSQL is listening to the server's
public IP.
### Step 2. Add the secondary GitLab node
Follow the steps in ["add the secondary GitLab node"](database.md#step-2-add-the-secondary-gitlab-node).
### Step 3. Configure the secondary server
Follow the first steps in ["configure the secondary server"](database.md#step-3-configure-the-secondary-server),
but note that since you are installing from source, the username and
group listed as `gitlab-psql` in those steps should be replaced by `postgres`
instead. After completing the "Test that the `gitlab-psql` user can connect to
the primary's database" step, continue here:
1. Edit `postgresql.conf` to configure the secondary for streaming replication
(for Debian/Ubuntu that would be `/etc/postgresql/9.*/main/postgresql.conf`):
```bash
wal_level = hot_standby
max_wal_senders = 5
checkpoint_segments = 10
wal_keep_segments = 10
hot_standby = on
```
1. Restart PostgreSQL for the changes to take effect.
#### Enable tracking database on the secondary server
Geo secondary nodes use a tracking database to keep track of replication status
and recover automatically from some replication issues. Follow the steps below to create
the tracking database.
1. On the secondary node, run the following command to create `database_geo.yml` with the
information of your secondary PostgreSQL instance:
```bash
sudo cp /home/git/gitlab/config/database_geo.yml.postgresql /home/git/gitlab/config/database_geo.yml
```
1. Edit the content of `database_geo.yml` in `production:` as in the example below:
```yaml
#
# PRODUCTION
#
production:
adapter: postgresql
encoding: unicode
database: gitlabhq_geo_production
pool: 10
username: gitlab_geo
# password:
host: /var/opt/gitlab/geo-postgresql
```
1. Create the database `gitlabhq_geo_production` on the PostgreSQL instance of the secondary
node.
1. Set up the Geo tracking database:
```bash
bundle exec rake geo:db:migrate
```
1. Configure the [PostgreSQL FDW][FDW] connection and credentials:
Save the script below in a file, ex. `/tmp/geo_fdw.sh` and modify the connection
params to match your environment.
```bash
#!/bin/bash
# Secondary Database connection params:
DB_HOST="/var/opt/gitlab/postgresql"
DB_NAME="gitlabhq_production"
DB_USER="gitlab"
DB_PORT="5432"
# Tracking Database connection params:
GEO_DB_HOST="/var/opt/gitlab/geo-postgresql"
GEO_DB_NAME="gitlabhq_geo_production"
GEO_DB_USER="gitlab_geo"
GEO_DB_PORT="5432"
sudo -u postgres psql -h $GEO_DB_HOST -d $GEO_DB_NAME -p $GEO_DB_PORT -c "CREATE EXTENSION postgres_fdw;"
sudo -u postgres psql -h $GEO_DB_HOST -d $GEO_DB_NAME -p $GEO_DB_PORT -c "CREATE SERVER gitlab_secondary FOREIGN DATA WRAPPER postgres_fdw OPTIONS (host '$(DB_HOST)', dbname '$(DB_NAME)', port '$(DB_PORT)' );"
sudo -u postgres psql -h $GEO_DB_HOST -d $GEO_DB_NAME -p $GEO_DB_PORT -c "CREATE USER MAPPING FOR $(GEO_DB_USER) SERVER gitlab_secondary OPTIONS (user '$(DB_USER)');"
sudo -u postgres psql -h $GEO_DB_HOST -d $GEO_DB_NAME -p $GEO_DB_PORT -c "CREATE SCHEMA gitlab_secondary;"
sudo -u postgres psql -h $GEO_DB_HOST -d $GEO_DB_NAME -p $GEO_DB_PORT -c "GRANT USAGE ON FOREIGN SERVER gitlab_secondary TO $(GEO_DB_USER);"
```
And edit the content of `database_geo.yml` and to add `fdw: true` to
the `production:` block.
### Step 4. Initiate the replication process
Below we provide a script that connects the database on the secondary node to
the database on the primary node, replicates the database, and creates the
needed files for streaming replication.
The directories used are the defaults for Debian/Ubuntu. If you have changed
any defaults, configure it as you see fit replacing the directories and paths.
>**Warning:**
Make sure to run this on the **secondary** server as it removes all PostgreSQL's
data before running `pg_basebackup`.
1. SSH into your GitLab **secondary** server and login as root:
```bash
sudo -i
```
1. Save the snippet below in a file, let's say `/tmp/replica.sh`. Modify the
embedded paths if necessary:
```bash
#!/bin/bash
PORT="5432"
USER="gitlab_replicator"
echo ---------------------------------------------------------------
echo WARNING: Make sure this script is run from the secondary server
echo ---------------------------------------------------------------
echo
echo Enter the IP or FQDN of the primary PostgreSQL server
read HOST
echo Enter the password for $USER@$HOST
read -s PASSWORD
echo Enter the required sslmode
read SSLMODE
echo Stopping PostgreSQL and all GitLab services
gitlab-ctl stop
echo Backing up postgresql.conf
sudo -u postgres mv /var/opt/gitlab/postgresql/data/postgresql.conf /var/opt/gitlab/postgresql/
echo Cleaning up old cluster directory
sudo -u postgres rm -rf /var/opt/gitlab/postgresql/data
rm -f /tmp/postgresql.trigger
echo Starting base backup as the replicator user
echo Enter the password for $USER@$HOST
sudo -u postgres /opt/gitlab/embedded/bin/pg_basebackup -h $HOST -D /var/opt/gitlab/postgresql/data -U gitlab_replicator -v -x -P
echo Writing recovery.conf file
sudo -u postgres bash -c "cat > /var/opt/gitlab/postgresql/data/recovery.conf <<- _EOF1_
standby_mode = 'on'
primary_conninfo = 'host=$HOST port=$PORT user=$USER password=$PASSWORD sslmode=$SSLMODE'
trigger_file = '/tmp/postgresql.trigger'
_EOF1_
"
echo Restoring postgresql.conf
sudo -u postgres mv /var/opt/gitlab/postgresql/postgresql.conf /var/opt/gitlab/postgresql/data/
echo Starting PostgreSQL and all GitLab services
gitlab-ctl start
```
1. Run it with:
```bash
bash /tmp/replica.sh
```
When prompted, enter the IP/FQDN of the primary, and the password you set up
for the `gitlab_replicator` user in the first step.
You should use `verify-ca` for the `sslmode`. You can use `disable` if you
are happy to skip PostgreSQL TLS authentication altogether (e.g., you know
the network path is secure, or you are using a site-to-site VPN). This is
**not** safe over the public Internet!
You can read more details about each `sslmode` in the
[PostgreSQL documentation](https://www.postgresql.org/docs/9.6/static/libpq-ssl.html#LIBPQ-SSL-PROTECTION);
the instructions above are carefully written to ensure protection against
both passive eavesdroppers and active "man-in-the-middle" attackers.
The replication process is now over.
## MySQL replication
MySQL replication is not supported for Geo.
## Troubleshooting
Read the [troubleshooting document](troubleshooting.md).
[pgback]: http://www.postgresql.org/docs/9.6/static/app-pgbasebackup.html
[replication user]:https://wiki.postgresql.org/wiki/Streaming_Replication
[FDW]: https://www.postgresql.org/docs/9.6/static/postgres-fdw.html
[toc]: README.md#using-gitlab-installed-from-source
# Docker Registry for a secondary node
You can setup a [Docker Registry](https://docs.docker.com/registry/) on your
secondary Geo node that mirrors the one on the primary Geo node.
## Storage support
CAUTION: **Warning:**
If you use [local storage](../container_registry.md#container-registry-storage-driver)
for the Container Registry you **cannot** replicate it to the secondary Geo node.
Docker Registry currently supports a few types of storages. If you choose a
distributed storage (`azure`, `gcs`, `s3`, `swift`, or `oss`) for your Docker
Registry on a primary Geo node, you can use the same storage for a secondary
Docker Registry as well. For more information, read the
[Load balancing considerations](https://docs.docker.com/registry/deploying/#load-balancing-considerations)
when deploying the Registry, and how to setup the storage driver for GitLab's
integrated [Container Registry](../container_registry.md#container-registry-storage-driver).
[ee]: https://about.gitlab.com/products/
# Geo Frequently Asked Questions
## Can I use Geo in a disaster recovery situation?
Yes, but there are limitations to what we replicate (see
[What data is replicated to a secondary node?](#what-data-is-replicated-to-a-secondary-node)).
Read the documentation for [Disaster Recovery](../disaster_recovery/index.md).
## What data is replicated to a secondary node?
We currently replicate project repositories, LFS objects, generated
attachments / avatars and the whole database. This means user accounts,
issues, merge requests, groups, project data, etc., will be available for
query. We currently don't replicate artifact data (`shared/folder`).
## Can I git push to a secondary node?
No. All writing operations (this includes `git push`) must be done in your
primary node.
## How long does it take to have a commit replicated to a secondary node?
All replication operations are asynchronous and are queued to be dispatched in
a batched request every 10 minutes. Besides that, it depends on a lot of other
factors including the amount of traffic, how big your commit is, the
connectivity between your nodes, your hardware, etc.
## What if the SSH server runs at a different port?
We send the clone url from the primary server to any secondaries, so it
doesn't matter. If primary is running on port `2200`, clone url will reflect
that.
## Is this possible to set up a Docker Registry for a secondary node that mirrors the one on a primary node?
Yes. See [Docker Registry for a secondary Geo node](docker_registry.md).
# Geo High Availability
This document describes a minimal reference architecture for running Geo
in a high availability configuration. If your HA setup differs from the one
described, it is possible to adapt these instructions to your needs.
## Architecture overview
![Geo HA Diagram](../img/high_availability/geo-ha-diagram.png)
_[diagram source - gitlab employees only](https://docs.google.com/drawings/d/1z0VlizKiLNXVVVaERFwgsIOuEgjcUqDTWPdQYsE7Z4c/edit)_
The topology above assumes that the primary and secondary Geo clusters
are located in two separate locations, on their own virtual network
with private IP addresses. The network is configured such that all machines within
one geographic location can communicate with each other using their private IP addresses.
The IP addresses given are examples and may be different depending on the
network topology of your deployment.
The only external way to access the two Geo deployments is by HTTPS at
`gitlab.us.example.com` and `gitlab.eu.example.com` in the example above.
> **Note:** The primary and secondary Geo deployments must be able to
> communicate to each other over HTTPS.
## Redis and PostgreSQL High Availability
The primary and secondary Redis and PostgreSQL should be configured
for high availability. Because of the additional complexity involved
in setting up this configuration for PostgreSQL and Redis
it is not covered by this Geo HA documentation.
The two services will instead be configured such that
they will each run on a single machine.
For more information about setting up a highly available PostgreSQL cluster and Redis cluster using the omnibus package see the high availability documentation for
[PostgreSQL](../high_availability/database.md) and
[Redis](../high_availability/redis.md), respectively.
From these instructions you will need the following for the examples below:
* `gitlab_rails['db_password']` for the PostgreSQL "DB password"
* `redis['password']` for the Redis "Redis password"
NOTE: **Note:**
It is possible to use cloud hosted services for PostgreSQL and Redis but this is beyond the scope of this document.
### Prerequisites
Make sure you have GitLab EE installed using the
[Omnibus package](https://about.gitlab.com/installation).
### Step 1: Configure the Geo Backend Services
On the **primary** backend servers configure the following services:
* [Redis](../high_availability/redis.md) for high availability.
* [NFS Server](../high_availability/nfs.md) for repository, LFS, and upload storage.
* [PostgreSQL](../high_availability/database.md) for high availability.
On the **secondary** backend servers configure the following services:
* [Redis](../high_availability/redis.md) for high availability.
* [NFS Server](../high_availability/nfs.md) which will store data that is synchronized from the Geo primary.
### Step 2: Configure the Postgres services on the Geo Secondary
1. Configure the [secondary Geo PostgreSQL database](../gitlab-geo/database.md)
as a read-only secondary of the primary Geo PostgreSQL database.
1. Configure the Geo tracking database on the secondary server, to do this modify `/etc/gitlab/gitlab.rb`:
```ruby
geo_postgresql['enable'] = true
geo_postgresql['listen_address'] = '10.1.4.1'
geo_postgresql['trust_auth_cidr_addresses'] = ['10.1.0.0/16']
geo_secondary['auto_migrate'] = true
geo_secondary['db_host'] = '10.1.4.1'
geo_secondary['db_password'] = 'Geo tracking DB password'
```
NOTE: **Note:**
Be sure that other non-postgresql services are disabled by setting `enable` to `false` in
the [gitlab.rb configuration](https://gitlab.com/gitlab-org/omnibus-gitlab/blob/master/files/gitlab-config-template/gitlab.rb.template).
After making these changes be sure to run `sudo gitlab-ctl reconfigure` so that they take effect.
### Step 3: Setup the LoadBalancer
In this topology there will need to be a load balancers at each geographical location
to route traffic to the application servers.
See the [Load Balancer for GitLab HA](../high_availability/load_balancer.md)
documentation for more information.
### Step 4: Configure the Geo Frontend Application Servers
In the architecture overview there are two machines running the GitLab application
services. These services are enabled selectively in the configuration. Additionally
the addresses of the remote endpoints for PostgreSQL and Redis will need to be specified.
#### On the GitLab Primary Frontend servers
1. Edit `/etc/gitlab/gitlab.rb` and ensure the following to disable PostgreSQL and Redis from running locally.
```ruby
##
## Disable PostgreSQL on the local machine and connect to the remote
##
postgresql['enable'] = false
gitlab_rails['auto_migrate'] = false
gitlab_rails['db_host'] = '10.0.3.1'
gitlab_rails['db_password'] = 'DB password'
##
## Disable Redis on the local machine and connect to the remote
##
redis['enable'] = false
gitlab_rails['redis_host'] = '10.0.2.1'
gitlab_rails['redis_password'] = 'Redis password'
geo_primary_role['enable'] = true
```
#### On the GitLab Secondary Frontend servers
On the secondary the remote endpoint for the PostgreSQL Geo database will
be specified.
1. Edit `/etc/gitlab/gitlab.rb` and ensure the following to disable PostgreSQL and Redis from running locally. Configure the secondary to connect to the Geo tracking database.
```ruby
##
## Disable PostgreSQL on the local machine and connect to the remote
##
postgresql['enable'] = false
gitlab_rails['auto_migrate'] = false
gitlab_rails['db_host'] = '10.1.3.1'
gitlab_rails['db_password'] = 'DB password'
##
## Disable Redis on the local machine and connect to the remote
##
redis['enable'] = false
gitlab_rails['redis_host'] = '10.1.2.1'
gitlab_rails['redis_password'] = 'Redis password'
##
## Enable the geo secondary role and configure the
## geo tracking database
##
geo_secondary_role['enable'] = true
geo_secondary['db_host'] = '10.1.4.1'
geo_secondary['db_password'] = 'Geo tracking DB password'
geo_postgresql['enable'] = false
```
After making these changes [Reconfigure GitLab][] so that they take effect.
On the primary the following GitLab frontend services will be enabled:
* gitlab-pages
* gitlab-workhorse
* logrotate
* nginx
* registry
* remote-syslog
* sidekiq
* unicorn
On the secondary the following GitLab frontend services will be enabled:
* geo-logcursor
* gitlab-pages
* gitlab-workhorse
* logrotate
* nginx
* registry
* remote-syslog
* sidekiq
* unicorn
Verify these services by running `sudo gitlab-ctl status` on the frontend
application servers.
[reconfigure GitLab]: ../restart_gitlab.md#omnibus-gitlab-reconfigure
[restart GitLab]: ../restart_gitlab.md#omnibus-gitlab-restart
# Geo (Geo Replication)
> **Notes:**
- Geo is part of [GitLab Premium][ee].
- Introduced in GitLab Enterprise Edition 8.9.
We recommend you use it with at least GitLab Enterprise Edition 10.0 for
basic Geo features, or latest version for a better experience.
- You should make sure that all nodes run the same GitLab version.
- Geo requires PostgreSQL 9.6 and Git 2.9 in addition to GitLab's usual
[minimum requirements](../install/requirements.md)
- Using Geo in combination with High Availability is considered **GA** in GitLab Enterprise Edition 10.4
>**Note:**
Geo changes significantly from release to release. Upgrades **are**
supported and [documented](#updating-the-geo-nodes), but you should ensure that
you're following the right version of the documentation for your installation!
The best way to do this is to follow the documentation from the `/help` endpoint
on your **primary** node, but you can also navigate to [this page on GitLab.com](https://gitlab.com/gitlab-org/gitlab-ee/blob/master/doc/gitlab-geo/README.md)
and choose the appropriate release from the `tags` dropdown, e.g., `v10.0.0-ee`.
Geo allows you to replicate your GitLab instance to other geographical
locations as a read-only fully operational version.
## Overview
If you have two or more teams geographically spread out, but your GitLab
instance is in a single location, fetching large repositories can take a long
time.
Your Geo instance can be used for cloning and fetching projects, in addition to
reading any data. This will make working with large repositories over large
distances much faster.
![Geo overview](img/geo-overview.png)
When Geo is enabled, we refer to your original instance as a **primary** node
and the replicated read-only ones as **secondaries**.
Keep in mind that:
- Secondaries talk to the primary to get user data for logins (API) and to
replicate repositories, LFS Objects and Attachments (HTTPS + JWT).
- Since GitLab Premium 10.0, the primary no longer talks to
secondaries to notify for changes (API).
## Use-cases
- Can be used for cloning and fetching projects, in addition
to reading any data available in the GitLab web interface (see [current limitations](#current-limitations))
- Overcomes slow connection between distant offices, saving time by
improving speed for distributed teams
- Helps reducing the loading time for automated tasks,
custom integrations and internal workflows
- Quickly fail-over to a Geo secondary in a
[Disaster Recovery](../disaster_recovery/index.md) scenario
- Allows [planned fail-over](../disaster_recovery/planned_fail_over.md) to a Geo secondary
## Architecture
The following diagram illustrates the underlying architecture of Geo:
![Geo architecture](img/geo-architecture.png)
[Source diagram](https://docs.google.com/drawings/d/1Abw0P_H0Ew1-2Lj_xPDRWP87clGIke-1fil7_KQqrtE/edit)
In this diagram, there is one Geo primary node and one secondary. The
secondary clones repositories via git over HTTPS. Attachments, LFS objects, and
other files are downloaded via HTTPS using the GitLab API to authenticate,
with a special endpoint protected by JWT.
Writes to the database and Git repositories can only be performed on the Geo
primary node. The secondary node receives database updates via PostgreSQL
streaming replication.
Note that the secondary needs two different PostgreSQL databases: a read-only
instance that streams data from the main GitLab database and another used
internally by the secondary node to record what data has been replicated.
In the secondary nodes there is an additional daemon: Geo Log Cursor.
## Geo Recommendations
We highly recommend that you install Geo on an operating system that supports
OpenSSH 6.9 or higher. The following operating systems are known to ship with a
current version of OpenSSH:
* CentOS 7.4
* Ubuntu 16.04
Note that CentOS 6 and 7.0 ship with an old version of OpenSSH that do not
support a feature that Geo requires. See the [documentation on Geo SSH
access](../operations/fast_ssh_key_lookup.md) for more details.
### LDAP
We recommend that if you use LDAP on your primary that you also set up a
secondary LDAP server for the secondary Geo node. Otherwise, users will not be
able to perform Git operations over HTTP(s) on the **secondary** Geo node
using HTTP Basic Authentication. However, Git via SSH and personal access
tokens will still work.
Check with your LDAP provider for instructions on on how to set up
replication. For example, OpenLDAP provides [these
instructions](https://www.openldap.org/doc/admin24/replication.html).
### Geo Tracking Database
We use the tracking database as metadata to control what needs to be
updated on the disk of the local instance (for example, download new assets,
fetch new LFS Objects or fetch changes from a repository that has recently been
updated).
Because the replicated instance is read-only, we need this additional instance
per secondary location.
### Geo Log Cursor
This daemon reads a log of events replicated by the primary node to the secondary
database and updates the Geo Tracking Database with changes that need to be
executed.
When something is marked to be updated in the tracking database, asynchronous
jobs running on the secondary node will execute the required operations and
update the state.
This new architecture allows us to be resilient to connectivity issues between the
nodes. It doesn't matter if it was just a few minutes or days. The secondary
instance will be able to replay all the events in the correct order and get in
sync again.
## Setup instructions
These instructions assume you have a working instance of GitLab. They will
guide you through making your existing instance the primary Geo node and
adding secondary Geo nodes.
The steps below should be followed in the order they appear. **Make sure the
GitLab version is the same on all nodes.**
### Using Omnibus GitLab
If you installed GitLab using the Omnibus packages (highly recommended):
1. [Install GitLab Enterprise Edition][install-ee] on the server that will serve
as the **secondary** Geo node. Do not create an account or login to the new
secondary node.
1. [Upload the GitLab License](../user/admin_area/license.md) on the **primary**
Geo node to unlock Geo.
1. [Setup the database replication](database.md) (`primary (read-write) <->
secondary (read-only)` topology).
1. [Configure fast lookup of authorized SSH keys in the database](../operations/fast_ssh_key_lookup.md),
this step is required and needs to be done on both the primary AND secondary nodes.
1. [Configure GitLab](configuration.md) to set the primary and secondary nodes.
1. Optional: [Configure a secondary LDAP server](../auth/ldap.md)
for the secondary. See [notes on LDAP](#ldap).
1. [Follow the "Using a Geo Server" guide](using_a_geo_server.md).
[install-ee]: https://about.gitlab.com/downloads-ee/ "GitLab Enterprise Edition Omnibus packages downloads page"
### Using GitLab installed from source
If you installed GitLab from source:
1. [Install GitLab Enterprise Edition][install-ee-source] on the server that
will serve as the **secondary** Geo node. Do not create an account or login
to the new secondary node.
1. [Upload the GitLab License](../user/admin_area/license.md) on the **primary**
Geo node to unlock Geo.
1. [Setup the database replication](database_source.md) (`primary (read-write)
<-> secondary (read-only)` topology).
1. [Configure fast lookup of authorized SSH keys in the database](../operations/fast_ssh_key_lookup.md),
do this step for both primary AND secondary nodes.
1. [Configure GitLab](configuration_source.md) to set the primary and secondary
nodes.
1. [Follow the "Using a Geo Server" guide](using_a_geo_server.md).
[install-ee-source]: https://docs.gitlab.com/ee/install/installation.html "GitLab Enterprise Edition installation from source"
## Configuring Geo
Read through the [Geo configuration](configuration.md) documentation.
## Updating the Geo nodes
Read how to [update your Geo nodes to the latest GitLab version](updating_the_geo_nodes.md).
## Configuring Geo HA
Read through the [Geo High Availability documentation](ha.md).
## Configuring Geo with Object storage
When you have object storage enabled, please consult the
[Geo with Object Storage](object_storage.md) documentation.
## Replicating the Container Registry
Read how to [replicate the Container Registry](docker_registry.md).
## Current limitations
- You cannot push code to secondary nodes, see [3912](https://gitlab.com/gitlab-org/gitlab-ee/issues/3912) for details.
- The primary node has to be online for OAuth login to happen (existing sessions and Git are not affected)
- It works for repos, wikis, issues, and merge requests, but it does not work for job logs, artifacts, GitLab Pages, and Docker images of the Container
Registry (by default, but you can configure it separately, see [replicate the Container Registry](docker_registry.md) for details)
- The installation takes multiple manual steps that together can take about an hour depending on circumstances; we are working on improving this experience, see [#2978](https://gitlab.com/gitlab-org/omnibus-gitlab/issues/2978) for details.
## Frequently Asked Questions
Read more in the [Geo FAQ](faq.md).
## Log files
Since GitLab 9.5, Geo stores structured log messages in a `geo.log` file. For
Omnibus installations, this file can be found in
`/var/log/gitlab/gitlab-rails/geo.log`. This file contains information about
when Geo attempts to sync repositories and files. Each line in the file contains a
separate JSON entry that can be ingested into Elasticsearch, Splunk, etc. For
example:
```json
{"severity":"INFO","time":"2017-08-06T05:40:16.104Z","message":"Repository update","project_id":1,"source":"repository","resync_repository":true,"resync_wiki":true,"class":"Gitlab::Geo::LogCursor::Daemon","cursor_delay_s":0.038}
```
This message shows that Geo detected that a repository update was needed for project 1.
## Security of Geo
Read the [security review](security-review.md) page.
## Tuning Geo
Read the [Geo tuning](tuning.md) documentation.
## Troubleshooting
Read the [troubleshooting document](troubleshooting.md).
[ee]: https://about.gitlab.com/products/ "GitLab Enterprise Edition landing page"
[install-ee]: https://about.gitlab.com/downloads-ee/ "GitLab Enterprise Edition Omnibus packages downloads page"
[install-ee-source]: https://docs.gitlab.com/ee/install/installation.html "GitLab Enterprise Edition installation from source"
# Geo with Object storage
Geo can be used in combination with Object Storage (AWS S3, or
other compatible object storage).
## Configuration
At this time it is required that if object storage is enabled on the
primary, it must also be enabled on the secondary.
The secondary nodes can use the same storage bucket as the primary, or
they can use a replicated storage bucket. At this time GitLab does not
take care of content replication in object storage.
For LFS, follow the documentation to
[set up LFS object storage](../workflow/lfs/lfs_administration.md#setting-up-s3-compatible-object-storage).
For CI job artifacts, there is similar documentation to configure
[jobs artifact object storage](../job_artifacts.md#using-object-storage)
Complete these steps on all nodes, primary **and** secondary.
## Replication
When using Amazon S3, you can use
[CRR](https://docs.aws.amazon.com/AmazonS3/latest/dev/crr.html) to
have automatic replication between the bucket used by the primary and
the bucket used by the secondary.
If you are using Google Cloud Storage, consider using
[Multi-Regional Storage](https://cloud.google.com/storage/docs/storage-classes#multi-regional).
Or you can use the [Storage Transfer Service](https://cloud.google.com/storage/transfer/),
although this only supports daily synchronization.
For manual synchronization, or scheduled by `cron`, please have a look at:
- [`s3cmd sync`](http://s3tools.org/s3cmd-sync)
- [`gsutil rsync`](https://cloud.google.com/storage/docs/gsutil/commands/rsync)
The following security review of the Geo feature set focuses on security
aspects of the feature as they apply to customers running their own GitLab
instances. The review questions are based in part on the [application security architecture](https://www.owasp.org/index.php/Application_Security_Architecture_Cheat_Sheet)
questions from [owasp.org](https://www.owasp.org).
## Business Model
### What geographic areas does the application service?
- This varies by customer. Geo allows customers to deploy to multiple areas,
and they get to choose where they are.
- Region and node selection is entirely manual.
## Data Essentials
### What data does the application receive, produce, and process?
- Geo streams almost all data held by a GitLab instance between sites. This
includes full database replication, most files (user-uploaded attachments,
etc) and repository + wiki data. In a typical configuration, this will
happen across the public Internet, and be TLS-encrypted.
- PostgreSQL replication is TLS-encrypted.
- See also: [only TLSv1.2 should be supported](https://gitlab.com/gitlab-org/omnibus-gitlab/issues/2948)
### How can the data be classified into categories according to its sensitivity?
- GitLab’s model of sensitivity is centered around public vs. internal vs.
private projects. Geo replicates them all indiscriminately. “Selective sync”
exists for files and repositories (but not database content), which would permit
only less-sensitive projects to be replicated to a secondary if desired.
- See also: [developing a data classification policy](https://gitlab.com/gitlab-com/security/issues/4).
### What data backup and retention requirements have been defined for the application?
- Geo is designed to provide replication of a certain subset of the application
data. It is part of the solution, rather than part of the problem.
## End-Users
### Who are the application's end‐users?
- Geo nodes (secondaries) are created in regions that are distant (in terms of
Internet latency) from the main GitLab installation (the primary). They are
intended to be used by anyone who would ordinarily use the primary, who finds
that the secondary is closer to them (in terms of Internet latency).
### How do the end‐users interact with the application?
- A Geo secondary node provides all the interfaces a Geo primary node does
(notably a HTTP/HTTPS web application, and HTTP/HTTPS or SSH git repository
access), but is constrained to read-only activities. The principal use case is
envisioned to be cloning git repositories from the secondary in favor of the
primary, but end-users may use the GitLab web interface to view projects,
issues, merge requests, snippets, etc.
### What security expectations do the end‐users have?
- The replication process must be secure. It would typically be unacceptable to
transmit the entire database contents or all files and repositories across the
public Internet in plaintext, for instance.
- The Geo secondary must have the same access controls over its content as the
primary - unauthenticated users must not be able to gain access to privileged
information on the primary by querying the secondary.
- Attackers must not be able to impersonate the secondary to the primary, and
thus gain access to privileged information.
## Administrators
### Who has administrative capabilities in the application?
- Nothing Geo-specific. Any user where `admin: true` is set in the database is
considered an admin with super-user privileges.
- See also: [more granular access control](https://gitlab.com/gitlab-org/gitlab-ce/issues/32730)
(not geo-specific)
- Much of Geo’s integration (database replication, for instance) must be
configured with the application, typically by system administrators.
### What administrative capabilities does the application offer?
- Geo secondaries may be added, modified, or removed by users with
administrative access.
- The replication process may be controlled (start/stop) via the Sidekiq
administrative controls.
## Network
### What details regarding routing, switching, firewalling, and load‐balancing have been defined?
- Geo requires the primary and secondary to be able to communicate with each
other across a TCP/IP network. In particular, the secondaries must be able to
access HTTP/HTTPS and PostgreSQL services on the primary.
### What core network devices support the application?
- Varies from customer to customer.
### What network performance requirements exist?
- Maximum replication speeds between primary and secondary is limited by the
available bandwidth between sites. No hard requirements exist - time to complete
replication (and ability to keep up with changes on the primary) is a function
of the size of the data set, tolerance for latency, and available network
capacity.
### What private and public network links support the application?
- Customers choose their own networks. As sites are intended to be
geographically separated, it is envisioned that replication traffic will pass
over the public Internet in a typical deployment, but this is not a requirement.
## Systems
### What operating systems support the application?
- Geo imposes no additional restrictions on operating system (see the
[GitLab installation](https://about.gitlab.com/installation/) page for more
details), however we recommend using the operating systems listed in the [Geo documentation](http://docs.gitlab.com/ee/gitlab-geo/#geo-recommendations).
### What details regarding required OS components and lock‐down needs have been defined?
- The recommended installation method (Omnibus) packages most components itself.
A from-source installation method exists. Both are documented at
https://docs.gitlab.com/ee/gitlab-geo/
- There are significant dependencies on the system-installed OpenSSH daemon (Geo
requires users to set up custom authentication methods) and the omnibus or
system-provided PostgreSQL daemon (it must be configured to listen on TCP,
additional users and replication slots must be added, etc).
- The process for dealing with security updates (for example, if there is a
significant vulnerability in OpenSSH or other services, and the customer
wants to patch those services on the OS) is identical to the non-Geo
situation: security updates to OpenSSH would be provided to the user via the
usual distribution channels. Geo introduces no delay there.
## Infrastructure Monitoring
### What network and system performance monitoring requirements have been defined?
- None specific to Geo.
### What mechanisms exist to detect malicious code or compromised application components?
- None specific to Geo.
### What network and system security monitoring requirements have been defined?
- None specific to Geo.
## Virtualization and Externalization
### What aspects of the application lend themselves to virtualization?
- All.
## What virtualization requirements have been defined for the application?
- Nothing Geo-specific, but everything in GitLab needs to have full
functionality in such an environment.
### What aspects of the product may or may not be hosted via the cloud computing model?
- GitLab is “cloud native” and this applies to Geo as much as to the rest of the
product. Deployment in clouds is a common and supported scenario.
## If applicable, what approach(es) to cloud computing will be taken (Managed Hosting versus "Pure" Cloud, a "full machine" approach such as AWS-EC2 versus a "hosted database" approach such as AWS-RDS and Azure, etc)?
- To be decided by our customers, according to their operational needs.
## Environment
### What frameworks and programming languages have been used to create the application?
- Ruby on Rails, Ruby.
### What process, code, or infrastructure dependencies have been defined for the application?
- Nothing specific to Geo.
### What databases and application servers support the application?
- PostgreSQL >= 9.6, Redis, Sidekiq, Unicorn.
### How will database connection strings, encryption keys, and other sensitive components be stored, accessed, and protected from unauthorized detection?
- There are some Geo-specific values. Some are shared secrets which must be
securely transmitted from the primary to the secondary at setup time. Our
documentation recommends transmitting them from the primary to the system
administrator via SSH, and then back out to the secondary in the same manner.
In particular, this includes the PostgreSQL replication credentials and a secret
key (`db_key_base`) which is used to decrypt certain columns in the database.
The `db_key_base` secret is stored unencrypted on the filesystem, in
`/etc/gitlab/gitlab-secrets.json`, along with a number of other secrets. There is
no at-rest protection for them.
## Data Processing
### What data entry paths does the application support?
- Data is entered via the web application exposed by GitLab itself. Some data is
also entered using system administration commands on the GitLab servers (e.g.,
`gitlab-ctl set-primary-node`).
- Secondaries also receive inputs via PostgreSQL streaming replication from the
primary.
### What data output paths does the application support?
- Primaries output via PostgreSQL streaming replication to the secondary.
Otherwise, principally via the web application exposed by GitLab itself, and via
SSH `git clone` operations initiated by the end-user.
### How does data flow across the application's internal components?
- Secondaries and primaries interact via HTTP/HTTPS (secured with JSON web
tokens) and via PostgreSQL streaming replication.
- Within a primary or secondary, the SSOT is the filesystem and the database
(including Geo tracking database on secondary). The various internal components
are orchestrated to make alterations to these stores.
### What data input validation requirements have been defined?
- Secondaries must have a faithful replication of the primary’s data.
### What data does the application store and how?
- Git repositories and files, tracking information related to the them, and the
GitLab database contents.
### What data is or may need to be encrypted and what key management requirements have been defined?
- Neither primaries or secondaries encrypt Git repository or filesystem data at
rest. A subset of database columns are encrypted at rest using the `db_otp_key`
- a static secret shared across all hosts in a GitLab deployment.
- In transit, data should be encrypted, although the application does permit
communication to proceed unencrypted. The two main transits are the secondary’s
replication process for PostgreSQL, and for git repositories/files. Both should
be protected using TLS, with the keys for that managed via Omnibus per existing
configuration for end-user access to GitLab.
### What capabilities exist to detect the leakage of sensitive data?
- Comprehensive system logs exist, tracking every connection to GitLab and
PostgreSQL.
### What encryption requirements have been defined for data in transit - including transmission over WAN, LAN, SecureFTP, or publicly accessible protocols such as http: and https:?
- Data must have the option to be encrypted in transit, and be secure against
both passive and active attack (e.g., MITM attacks should not be possible).
## Access
### What user privilege levels does the application support?
- Geo adds one type of privilege: secondaries can access a special Geo API to
download files over HTTP/HTTPS, and to clone repositories using HTTP/HTTPS.
### What user identification and authentication requirements have been defined?
- Geo secondaries identify to Geo primaries via OAuth or JWT authentication
based on the shared database (HTTP access) or a PostgreSQL replication user (for
database replication). The database replication also requires IP-based access
controls to be defined.
### What user authorization requirements have been defined?
- Secondaries must only be able to *read* data. They are not currently able to
mutate data on the primary.
### What session management requirements have been defined?
- Geo JWTs are defined to last for only two minutes before needing to be
regenerated.
### What access requirements have been defined for URI and Service calls?
- A Geo secondary makes many calls to the primary's API. This is how file
replication proceeds, for instance. This endpoint is only accessible with a JWT
token.
- The primary also makes calls to the secondary to get status information.
## Application Monitoring
### What application auditing requirements have been defined? How are audit and debug logs accessed, stored, and secured?
- Structured JSON log is written to the filesystem, and can also be ingested
into a Kibana installation for further analysis.
# Geo Troubleshooting
>**Note:**
This list is an attempt to document all the moving parts that can go wrong.
We are working into getting all this steps verified automatically in a
rake task in the future.
Setting up Geo requires careful attention to details and sometimes it's easy to
miss a step. Here is a list of questions you should ask to try to detect
what you need to fix (all commands and path locations are for Omnibus installs):
#### First check the health of the secondary
Visit the primary node's **Admin Area ➔ Geo Nodes** (`/admin/geo_nodes`) in
your browser. We perform the following health checks on each secondary node
to help identify if something is wrong:
- Is the node running?
- Is the node's secondary database configured for streaming replication?
- Is the node's secondary tracking database configured?
- Is the node's secondary tracking database connected?
- Is the node's secondary tracking database up-to-date?
![Geo health check](img/geo-node-healthcheck.png)
There is also an option to check the status of the secondary node by running a special rake task:
```
sudo gitlab-rake geo:status
```
#### Is Postgres replication working?
#### Are my nodes pointing to the correct database instance?
You should make sure your primary Geo node points to the instance with
writing permissions.
Any secondary nodes should point only to read-only instances.
#### Can Geo detect my current node correctly?
Geo uses the defined node from the `Admin ➔ Geo` screen, and tries to match
it with the value defined in the `/etc/gitlab/gitlab.rb` configuration file.
The relevant line looks like: `external_url "http://gitlab.example.com"`.
To check if the node on the current machine is correctly detected type:
```bash
sudo gitlab-rails runner "puts Gitlab::Geo.current_node.inspect"
```
and expect something like:
```
#<GeoNode id: 2, schema: "https", host: "gitlab.example.com", port: 443, relative_url_root: "", primary: false, ...>
```
By running the command above, `primary` should be `true` when executed in
the primary node, and `false` on any secondary.
#### How do I fix the message, "ERROR: replication slots can only be used if max_replication_slots > 0"?
This means that the `max_replication_slots` PostgreSQL variable needs to
be set on the primary database. In GitLab 9.4, we have made this setting
default to 1. You may need to increase this value if you have more Geo
secondary nodes. Be sure to restart PostgreSQL for this to take
effect. See the [PostgreSQL replication
setup](database.md#postgresql-replication) guide for more details.
#### How do I fix the message, "FATAL: could not start WAL streaming: ERROR: replication slot "geo_secondary_my_domain_com" does not exist"?
This occurs when PostgreSQL does not have a replication slot for the
secondary by that name. You may want to rerun the [replication
process](database.md) on the secondary.
#### How do I fix the message, "Command exceeded allowed execution time" when setting up replication?
This may happen while [initiating the replication process](database.md#step-4-initiate-the-replication-process) on the Geo secondary, and indicates that your
initial dataset is too large to be replicated in the default timeout (30 minutes).
Re-run `gitlab-ctl replicate-geo-database`, but include a larger value for
`--backup-timeout`:
```bash
sudo gitlab-ctl replicate-geo-database --host=primary.geo.example.com --slot-name=secondary_geo_example_com --backup-timeout=21600
```
This will give the initial replication up to six hours to complete, rather than
the default thirty minutes. Adjust as required for your installation.
#### How do I fix the message, "PANIC: could not write to file 'pg_xlog/xlogtemp.123': No space left on device"
Determine if you have any unused replication slots in the primary database. This can cause large amounts of log data to build up in `pg_xlog`.
Removing the unused slots can reduce the amount of space used in the `pg_xlog`.
1. Start a PostgreSQL console session:
```bash
sudo gitlab-psql gitlabhq_production
```
Note that using `gitlab-rails dbconsole` will not work, because managing replication slots requires superuser permissions.
2. View your replication slots with
```sql
SELECT * FROM pg_replication_slots;
```
Slots where `active` is `f` are not active.
- When this slot should be active, because you have a secondary configured using that slot,
log in to that secondary and check the PostgreSQL logs why the replication is not running.
- If you are no longer using the slot (e.g. you no longer have Geo enabled), you can remove it with in the PostgreSQL console session:
```sql
SELECT pg_drop_replication_slot('name_of_extra_slot');
```
#### Very large repositories never successfully synchronize on the secondary
GitLab places a timeout on all repository clones, including project imports
and Geo synchronization operations. If a fresh `git clone` of a repository
on the primary takes more than a few minutes, you may be affected by this.
To increase the timeout, add the following line to `/etc/gitlab/gitlab.rb`
on the secondary:
```ruby
gitlab_rails['gitlab_shell_git_timeout'] = 10800
```
Then reconfigure GitLab:
```bash
sudo gitlab-ctl reconfigure
```
This will increase the timeout to three hours (10800 seconds). Choose a time
long enough to accomodate a full clone of your largest repositories.
# Tuning Geo
## Changing the sync capacity values
In the Geo admin page (`/admin/geo_nodes`), there are several variables that
can be tuned to improve performance of Geo:
* Repository sync capacity
* File sync capacity
Increasing these values will increase the number of jobs that are scheduled,
but this may not lead to a more downloads in parallel unless the number of
available Sidekiq threads is also increased. For example, if repository sync
capacity is increased from 25 to 50, you may also want to increase the number
of Sidekiq threads from 25 to 50. See the [Sidekiq concurrency
documentation](../operations/extra_sidekiq_processes.html#concurrency)
for more details.
# Updating the Geo nodes
Depending on which version of Geo you are updating to/from, there may be
different steps.
## General update steps
In order to update the Geo nodes when a new GitLab version is released,
all you need to do is update GitLab itself:
1. Log into each node (primary and secondaries)
1. [Update GitLab][update]
1. [Update tracking database on secondary node](#update-tracking-database-on-secondary-node) when
the tracking database is enabled.
1. [Test](#check-status-after-updating) primary and secondary nodes, and check version in each.
## Upgrading to GitLab 10.5
For Geo Disaster Recovery to work with minimum downtime, your Geo secondary
should use the same set of secrets as the primary. However, setup instructions
prior to the 10.5 release only synchronized the `db_key_base` secret.
To rectify this error on existing installations, you should **overwrite** the
contents of `/etc/gitlab/gitlab-secrets.json` on the secondary node with the
contents of `/etc/gitlab/gitlab-secrets.json` on the primary node, then run the
following command on the secondary node:
```bash
sudo gitlab-ctl reconfigure
```
If you do not perform this step, you may find that two-factor authentication
[is broken following DR](faq.md#i-followed-the-disaster-recovery-instructions-and-now-two-factor-auth-is-broken).
To prevent SSH requests to the newly promoted primary node from failing
due to SSH host key mismatch when updating the primary domain's DNS record
you should perform the step to [Manually replicate primary SSH host keys](configuration.md#step-2-manually-replicate-primary-ssh-host-keys) in each
secondary node.
## Upgrading to GitLab 10.4
There are no Geo-specific steps to take!
## Upgrading to GitLab 10.3
### Support for SSH repository synchronization removed
In GitLab 10.2, synchronizing secondaries over SSH was deprecated. In 10.3,
support is removed entirely. All installations will switch to the HTTP/HTTPS
cloning method instead. Before upgrading, ensure that all your Geo nodes are
configured to use this method and that it works for your installation. In
particular, ensure that [Git access over HTTP/HTTPS is enabled](configuration.md#step-5-enable-git-access-over-http-https).
Synchronizing repositories over the public Internet using HTTP is insecure, so
you should ensure that you have HTTPS configured before upgrading. Note that
file synchronization is **also** insecure in these cases!
## Upgrading to GitLab 10.2
### Secure PostgreSQL replication
Support for TLS-secured PostgreSQL replication has been added. If you are
currently using PostgreSQL replication across the open internet without an
external means of securing the connection (e.g., a site-to-site VPN), then you
should immediately reconfigure your primary and secondary PostgreSQL instances
according to the [updated instructions](#database.md).
If you *are* securing the connections externally and wish to continue doing so,
ensure you include the new option `--sslmode=prefer` in future invocations of
`gitlab-ctl replicate-geo-database`.
### HTTPS repository sync
Support for replicating repositories and wikis over HTTP/HTTPS has been added.
Replicating over SSH has been deprecated, and support for this option will be
removed in a future release.
To switch to HTTP/HTTPS replication, log into the primary node as an admin and visit
**Admin Area ➔ Geo Nodes** (`/admin/geo_nodes`). For each secondary listed,
press the "Edit" button, change the "Repository cloning" setting from
"SSH (deprecated)" to "HTTP/HTTPS", and press "Save changes". This should take
effect immediately.
Any new secondaries should be created using HTTP/HTTPS replication - this is the
default setting.
After you've verified that HTTP/HTTPS replication is working, you should remove
the now-unused SSH keys from your secondaries, as they may cause problems if the
secondary if ever promoted to a primary:
1. **[secondary]** Login to **all** your secondary nodes and run:
```ruby
sudo -u git -H rm ~git/.ssh/id_rsa ~git/.ssh/id_rsa.pub
```
### Hashed Storage
>**Warning**
Hashed storage is in **Alpha**. It is considered experimental and not
production-ready. See [Hashed
Storage](../repository_storage_types.md) for more detail.
If you previously enabled Hashed Storage and migrated all your existing
projects to Hashed Storage, disabling hashed storage will not migrate projects
to their previous project based storage path. As such, once enabled and
migrated we recommend leaving Hashed Storage enabled.
## Upgrading to GitLab 10.1
>**Warning**
Hashed storage is in **Alpha**. It is considered experimental and not
production-ready. See [Hashed
Storage](../repository_storage_types.md) for more detail.
[Hashed storage](../repository_storage_types.md) was introduced
in GitLab 10.0, and a [migration path](../raketasks/storage.md)
for existing repositories was added in GitLab 10.1.
## Upgrading to GitLab 10.0
Since GitLab 10.0, we require all **Geo** systems to [use SSH key lookups via
the database](../operations/fast_ssh_key_lookup.md) to avoid having to maintain consistency of the
`authorized_keys` file for SSH access. Failing to do this will prevent users
from being able to clone via SSH.
Note that in older versions of Geo, attachments downloaded on the secondary
nodes would be saved to the wrong directory. We recommend that you do the
following to clean this up.
On the SECONDARY Geo nodes, run as root:
```sh
mv /var/opt/gitlab/gitlab-rails/working /var/opt/gitlab/gitlab-rails/working.old
mkdir /var/opt/gitlab/gitlab-rails/working
chmod 700 /var/opt/gitlab/gitlab-rails/working
chown git:git /var/opt/gitlab/gitlab-rails/working
```
You may delete `/var/opt/gitlab/gitlab-rails/working.old` any time.
Once this is done, we advise restarting GitLab on the secondary nodes for the
new working directory to be used:
```
sudo gitlab-ctl restart
```
## Upgrading from GitLab 9.3 or older
If you started running Geo on GitLab 9.3 or older, we recommend that you
resync your secondary PostgreSQL databases to use replication slots. If you
started using Geo with GitLab 9.4 or 10.x, no further action should be
required because replication slots are used by default. However, if you
started with GitLab 9.3 and upgraded later, you should still follow the
instructions below.
When in doubt, it does not hurt to do a resync. The easiest way to do this in
Omnibus is the following:
1. Install GitLab on the primary server
1. Run `gitlab-ctl reconfigure` and `gitlab-ctl restart postgresql`. This will enable replication slots on the primary database.
1. Install GitLab on the secondary server.
1. Re-run the [database replication process](database.md#step-3-initiate-the-replication-process).
## Special update notes for 9.0.x
> **IMPORTANT**:
With GitLab 9.0, the PostgreSQL version is upgraded to 9.6 and manual steps are
required in order to update the secondary nodes and keep the Streaming
Replication working. Downtime is required, so plan ahead.
The following steps apply only if you upgrade from a 8.17 GitLab version to
9.0+. For previous versions, update to GitLab 8.17 first before attempting to
upgrade to 9.0+.
---
Make sure to follow the steps in the exact order as they appear below and pay
extra attention in what node (primary/secondary) you execute them! Each step
is prepended with the relevant node for better clarity:
1. **[secondary]** Login to **all** your secondary nodes and stop all services:
```ruby
sudo gitlab-ctl stop
```
1. **[secondary]** Make a backup of the `recovery.conf` file on **all**
secondary nodes to preserve PostgreSQL's credentials:
```
sudo cp /var/opt/gitlab/postgresql/data/recovery.conf /var/opt/gitlab/
```
1. **[primary]** Update the primary node to GitLab 9.0 following the
[regular update docs][update]. At the end of the update, the primary node
will be running with PostgreSQL 9.6.
1. **[primary]** To prevent a de-synchronization of the repository replication,
stop all services except `postgresql` as we will use it to re-initialize the
secondary node's database:
```
sudo gitlab-ctl stop
sudo gitlab-ctl start postgresql
```
1. **[secondary]** Run the following steps on each of the secondaries:
1. **[secondary]** Stop all services:
```
sudo gitlab-ctl stop
```
1. **[secondary]** Prevent running database migrations:
```
sudo touch /etc/gitlab/skip-auto-migrations
```
1. **[secondary]** Move the old database to another directory:
```
sudo mv /var/opt/gitlab/postgresql{,.bak}
```
1. **[secondary]** Update to GitLab 9.0 following the [regular update docs][update].
At the end of the update, the node will be running with PostgreSQL 9.6.
1. **[secondary]** Make sure all services are up:
```
sudo gitlab-ctl start
```
1. **[secondary]** Reconfigure GitLab:
```
sudo gitlab-ctl reconfigure
```
1. **[secondary]** Run the PostgreSQL upgrade command:
```
sudo gitlab-ctl pg-upgrade
```
1. **[secondary]** See the stored credentials for the database that you will
need to re-initialize the replication:
```
sudo grep -s primary_conninfo /var/opt/gitlab/recovery.conf
```
1. **[secondary]** Create the `replica.sh` script as described in the
[database configuration document](database.md#step-3-initiate-the-replication-process).
1. **[secondary]** Run the recovery script using the credentials from the
previous step:
```
sudo bash /tmp/replica.sh
```
1. **[secondary]** Reconfigure GitLab:
```
sudo gitlab-ctl reconfigure
```
1. **[secondary]** Start all services:
```
sudo gitlab-ctl start
```
1. **[secondary]** Repeat the steps for the rest of the secondaries.
1. **[primary]** After all secondaries are updated, start all services in
primary:
```
sudo gitlab-ctl start
```
## Check status after updating
Now that the update process is complete, you may want to check whether
everything is working correctly:
1. Run the Geo raketask on all nodes, everything should be green:
```
sudo gitlab-rake gitlab:geo:check
```
1. Check the primary's Geo dashboard for any errors
1. Test the data replication by pushing code to the primary and see if it
is received by the secondaries
## Update tracking database on secondary node
After updating a secondary node, you might need to run migrations on
the tracking database. The tracking database was added in GitLab 9.1,
and it is required since 10.0.
1. Run database migrations on tracking database
```
sudo gitlab-rake geo:db:migrate
```
1. Repeat this step for every secondary node
[update]: ../update/README.md
[//]: # (Please update EE::GitLab::GeoGitAccess::GEO_SERVER_DOCS_URL if this file is moved)
# Using a Geo Server
After you set up the [database replication and configure the Geo nodes][req],
there are a few things to consider:
1. Users need an extra step to be able to fetch code from the secondary and push
to primary:
1. Clone the repository as you would normally do, but from the secondary node:
```bash
git clone git@secondary.gitlab.example.com:user/repo.git
```
1. Change the remote push URL to always push to primary, following this example:
```bash
git remote set-url --push origin git@primary.gitlab.example.com:user/repo.git
```
[req]: README.md#setup-instructions
# Geo (Geo Replication)
> **Notes:**
- Geo is part of [GitLab Premium][ee].
- Introduced in GitLab Enterprise Edition 8.9.
We recommend you use it with at least GitLab Enterprise Edition 10.0 for
basic Geo features, or latest version for a better experience.
- You should make sure that all nodes run the same GitLab version.
- Geo requires PostgreSQL 9.6 and Git 2.9 in addition to GitLab's usual
[minimum requirements](../install/requirements.md)
- Using Geo in combination with High Availability is considered **GA** in GitLab Enterprise Edition 10.4
>**Note:**
Geo changes significantly from release to release. Upgrades **are**
supported and [documented](#updating-the-geo-nodes), but you should ensure that
you're following the right version of the documentation for your installation!
The best way to do this is to follow the documentation from the `/help` endpoint
on your **primary** node, but you can also navigate to [this page on GitLab.com](https://gitlab.com/gitlab-org/gitlab-ee/blob/master/doc/gitlab-geo/README.md)
and choose the appropriate release from the `tags` dropdown, e.g., `v10.0.0-ee`.
Geo allows you to replicate your GitLab instance to other geographical
locations as a read-only fully operational version.
## Overview
If you have two or more teams geographically spread out, but your GitLab
instance is in a single location, fetching large repositories can take a long
time.
Your Geo instance can be used for cloning and fetching projects, in addition to
reading any data. This will make working with large repositories over large
distances much faster.
![Geo overview](img/geo-overview.png)
When Geo is enabled, we refer to your original instance as a **primary** node
and the replicated read-only ones as **secondaries**.
Keep in mind that:
- Secondaries talk to the primary to get user data for logins (API) and to
replicate repositories, LFS Objects and Attachments (HTTPS + JWT).
- Since GitLab Premium 10.0, the primary no longer talks to
secondaries to notify for changes (API).
## Use-cases
- Can be used for cloning and fetching projects, in addition
to reading any data available in the GitLab web interface (see [current limitations](#current-limitations))
- Overcomes slow connection between distant offices, saving time by
improving speed for distributed teams
- Helps reducing the loading time for automated tasks,
custom integrations and internal workflows
- Quickly fail-over to a Geo secondary in a
[Disaster Recovery](../administration/disaster_recovery/index.md) scenario
- Allows [planned fail-over](../administration/disaster_recovery/planned_fail_over.md) to a Geo secondary
## Architecture
The following diagram illustrates the underlying architecture of Geo:
![Geo architecture](img/geo-architecture.png)
[Source diagram](https://docs.google.com/drawings/d/1Abw0P_H0Ew1-2Lj_xPDRWP87clGIke-1fil7_KQqrtE/edit)
In this diagram, there is one Geo primary node and one secondary. The
secondary clones repositories via git over HTTPS. Attachments, LFS objects, and
other files are downloaded via HTTPS using the GitLab API to authenticate,
with a special endpoint protected by JWT.
Writes to the database and Git repositories can only be performed on the Geo
primary node. The secondary node receives database updates via PostgreSQL
streaming replication.
Note that the secondary needs two different PostgreSQL databases: a read-only
instance that streams data from the main GitLab database and another used
internally by the secondary node to record what data has been replicated.
In the secondary nodes there is an additional daemon: Geo Log Cursor.
## Geo Recommendations
We highly recommend that you install Geo on an operating system that supports
OpenSSH 6.9 or higher. The following operating systems are known to ship with a
current version of OpenSSH:
* CentOS 7.4
* Ubuntu 16.04
Note that CentOS 6 and 7.0 ship with an old version of OpenSSH that do not
support a feature that Geo requires. See the [documentation on Geo SSH
access](../administration/operations/fast_ssh_key_lookup.md) for more details.
### LDAP
We recommend that if you use LDAP on your primary that you also set up a
secondary LDAP server for the secondary Geo node. Otherwise, users will not be
able to perform Git operations over HTTP(s) on the **secondary** Geo node
using HTTP Basic Authentication. However, Git via SSH and personal access
tokens will still work.
Check with your LDAP provider for instructions on on how to set up
replication. For example, OpenLDAP provides [these
instructions](https://www.openldap.org/doc/admin24/replication.html).
### Geo Tracking Database
We use the tracking database as metadata to control what needs to be
updated on the disk of the local instance (for example, download new assets,
fetch new LFS Objects or fetch changes from a repository that has recently been
updated).
Because the replicated instance is read-only, we need this additional instance
per secondary location.
### Geo Log Cursor
This daemon reads a log of events replicated by the primary node to the secondary
database and updates the Geo Tracking Database with changes that need to be
executed.
When something is marked to be updated in the tracking database, asynchronous
jobs running on the secondary node will execute the required operations and
update the state.
This new architecture allows us to be resilient to connectivity issues between the
nodes. It doesn't matter if it was just a few minutes or days. The secondary
instance will be able to replay all the events in the correct order and get in
sync again.
## Setup instructions
These instructions assume you have a working instance of GitLab. They will
guide you through making your existing instance the primary Geo node and
adding secondary Geo nodes.
The steps below should be followed in the order they appear. **Make sure the
GitLab version is the same on all nodes.**
### Using Omnibus GitLab
If you installed GitLab using the Omnibus packages (highly recommended):
1. [Install GitLab Enterprise Edition][install-ee] on the server that will serve
as the **secondary** Geo node. Do not create an account or login to the new
secondary node.
1. [Upload the GitLab License](../user/admin_area/license.md) on the **primary**
Geo node to unlock Geo.
1. [Setup the database replication](database.md) (`primary (read-write) <->
secondary (read-only)` topology).
1. [Configure fast lookup of authorized SSH keys in the database](../administration/operations/fast_ssh_key_lookup.md),
this step is required and needs to be done on both the primary AND secondary nodes.
1. [Configure GitLab](configuration.md) to set the primary and secondary nodes.
1. Optional: [Configure a secondary LDAP server](../administration/auth/ldap.md)
for the secondary. See [notes on LDAP](#ldap).
1. [Follow the "Using a Geo Server" guide](using_a_geo_server.md).
[install-ee]: https://about.gitlab.com/downloads-ee/ "GitLab Enterprise Edition Omnibus packages downloads page"
### Using GitLab installed from source
If you installed GitLab from source:
1. [Install GitLab Enterprise Edition][install-ee-source] on the server that
will serve as the **secondary** Geo node. Do not create an account or login
to the new secondary node.
1. [Upload the GitLab License](../user/admin_area/license.md) on the **primary**
Geo node to unlock Geo.
1. [Setup the database replication](database_source.md) (`primary (read-write)
<-> secondary (read-only)` topology).
1. [Configure fast lookup of authorized SSH keys in the database](../administration/operations/fast_ssh_key_lookup.md),
do this step for both primary AND secondary nodes.
1. [Configure GitLab](configuration_source.md) to set the primary and secondary
nodes.
1. [Follow the "Using a Geo Server" guide](using_a_geo_server.md).
[install-ee-source]: https://docs.gitlab.com/ee/install/installation.html "GitLab Enterprise Edition installation from source"
## Configuring Geo
Read through the [Geo configuration](configuration.md) documentation.
## Updating the Geo nodes
Read how to [update your Geo nodes to the latest GitLab version](updating_the_geo_nodes.md).
## Configuring Geo HA
Read through the [Geo High Availability documentation](ha.md).
## Configuring Geo with Object storage
When you have object storage enabled, please consult the
[Geo with Object Storage](object_storage.md) documentation.
## Replicating the Container Registry
Read how to [replicate the Container Registry](docker_registry.md).
## Current limitations
- You cannot push code to secondary nodes, see [3912](https://gitlab.com/gitlab-org/gitlab-ee/issues/3912) for details.
- The primary node has to be online for OAuth login to happen (existing sessions and Git are not affected)
- It works for repos, wikis, issues, and merge requests, but it does not work for job logs, artifacts, GitLab Pages, and Docker images of the Container
Registry (by default, but you can configure it separately, see [replicate the Container Registry](docker_registry.md) for details)
- The installation takes multiple manual steps that together can take about an hour depending on circumstances; we are working on improving this experience, see [#2978](https://gitlab.com/gitlab-org/omnibus-gitlab/issues/2978) for details.
## Frequently Asked Questions
Read more in the [Geo FAQ](faq.md).
## Log files
Since GitLab 9.5, Geo stores structured log messages in a `geo.log` file. For
Omnibus installations, this file can be found in
`/var/log/gitlab/gitlab-rails/geo.log`. This file contains information about
when Geo attempts to sync repositories and files. Each line in the file contains a
separate JSON entry that can be ingested into Elasticsearch, Splunk, etc. For
example:
```json
{"severity":"INFO","time":"2017-08-06T05:40:16.104Z","message":"Repository update","project_id":1,"source":"repository","resync_repository":true,"resync_wiki":true,"class":"Gitlab::Geo::LogCursor::Daemon","cursor_delay_s":0.038}
```
This message shows that Geo detected that a repository update was needed for project 1.
## Security of Geo
Read the [security review](security-review.md) page.
## Tuning Geo
Read the [Geo tuning](tuning.md) documentation.
## Troubleshooting
Read the [troubleshooting document](troubleshooting.md).
[ee]: https://about.gitlab.com/products/ "GitLab Enterprise Edition landing page"
[install-ee]: https://about.gitlab.com/downloads-ee/ "GitLab Enterprise Edition Omnibus packages downloads page"
[install-ee-source]: https://docs.gitlab.com/ee/install/installation.html "GitLab Enterprise Edition installation from source"
This document was moved to [another location](../administration/geo/index.md).
# Geo configuration
>**Note:**
This is the documentation for the Omnibus GitLab packages. For installations
from source, follow the [**Geo nodes configuration for installations
from source**](configuration_source.md) guide.
## Configuring a new secondary node
>**Note:**
This is the final step in setting up a secondary Geo node. Stages of the
setup process must be completed in the documented order.
Before attempting the steps in this stage, [complete all prior stages](README.md#using-omnibus-gitlab).
The basic steps of configuring a secondary node are to replicate required
configurations between the primary and the secondaries; to configure a tracking
database on each secondary; and to start GitLab on the secondary node.
You are encouraged to first read through all the steps before executing them
in your testing/production environment.
>**Notes:**
- **Do not** setup any custom authentication in the secondary nodes, this will be
handled by the primary node.
- **Do not** add anything in the secondaries Geo nodes admin area
(**Admin Area ➔ Geo Nodes**). This is handled solely by the primary node.
### Step 1. Manually replicate secret GitLab values
GitLab stores a number of secret values in the `/etc/gitlab/gitlab-secrets.json`
file which *must* match between the primary and secondary nodes. Until there is
a means of automatically replicating these between nodes (see
[issue #3789](https://gitlab.com/gitlab-org/gitlab-ee/issues/3789)), they must
be manually replicated to the secondary.
1. SSH into the **primary** node, and execute the command below:
```bash
sudo cat /etc/gitlab/gitlab-secrets.json
```
This will display the secrets that need to be replicated, in JSON format.
1. SSH into the **secondary** node and login as the `root` user:
```
sudo -i
```
1. Make a backup of any existing secrets:
```bash
mv /etc/gitlab/gitlab-secrets.json /etc/gitlab/gitlab-secrets.json.`date +%F`
```
1. Copy `/etc/gitlab/gitlab-secrets.json` from the primary to the secondary, or
copy-and-paste the file contents between nodes:
```bash
sudo editor /etc/gitlab/gitlab-secrets.json
# paste the output of the `cat` command you ran on the primary
# save and exit
```
1. Ensure the file permissions are correct:
```bash
chown root:root /etc/gitlab/gitlab-secrets.json
chmod 0600 /etc/gitlab/gitlab-secrets.json
```
1. Reconfigure the secondary node for the change to take effect:
```
gitlab-ctl reconfigure
```
Once reconfigured, the secondary will automatically start
replicating missing data from the primary in a process known as backfill.
Meanwhile, the primary node will start to notify the secondary of any changes, so
that the secondary can act on those notifications immediately.
Make sure the secondary instance is
running and accessible. You can login to the secondary node
with the same credentials as used in the primary.
### Step 2. Manually replicate primary SSH host keys
GitLab integrates with the system-installed SSH daemon, designating a user
(typically named git) through which all access requests are handled.
In a [Disaster Recovery](../administration/disaster_recovery/index.md) situation, GitLab system
administrators will promote a secondary Geo replica to a primary and they can
update the DNS records for the primary domain to point to the secondary to prevent
the need to update all references to the primary domain to the secondary domain,
like changing Git remotes and API URLs.
This will cause all SSH requests to the newly promoted primary node from
failing due to SSH host key mismatch. To prevent this, the primary SSH host
keys must be manually replicated to the secondary node.
1. SSH into the **secondary** node and login as the `root` user:
```
sudo -i
```
1. Make a backup of any existing SSH host keys:
```bash
find /etc/ssh -iname ssh_host_* -exec cp {} {}.backup.`date +%F` \;
```
1. SSH into the **primary** node, and execute the command below:
```bash
sudo find /etc/ssh -iname ssh_host_* -not -iname '*.pub'
```
1. For each file in that list replace the file from the primary node to
the **same** location on your **secondary** node.
1. On your **secondary** node, ensure the file permissions are correct:
```bash
chown root:root /etc/ssh/ssh_host_*
chmod 0600 /etc/ssh/ssh_host_*
```
1. Regenerate the public keys from the private keys:
```bash
find /etc/ssh -iname ssh_host_* -not -iname '*.backup*' -exec sh -c 'ssh-keygen -y -f "{}" > "{}.pub"' \;
```
1. Restart sshd:
```bash
service ssh restart
```
### Step 3. (Optional) Enabling hashed storage (from GitLab 10.0)
>**Warning**
Hashed storage is in **Beta**. It is not considered production-ready. See
[Hashed Storage](../administration/repository_storage_types.md) for more detail,
and for the latest updates, check
[infrastructure issue #2821](https://gitlab.com/gitlab-com/infrastructure/issues/2821).
Using hashed storage significantly improves Geo replication - project and group
renames no longer require synchronization between nodes.
1. Visit the **primary** node's **Admin Area ➔ Settings**
(`/admin/application_settings`) in your browser
1. In the `Repository Storages` section, check `Create new projects using hashed storage paths`:
![](img/hashed-storage.png)
### Step 4. (Optional) Configuring the secondary to trust the primary
You can safely skip this step if your primary uses a CA-issued HTTPS certificate.
If your primary is using a self-signed certificate for *HTTPS* support, you will
need to add that certificate to the secondary's trust store. Retrieve the
certificate from the primary and follow
[these instructions](https://docs.gitlab.com/omnibus/settings/ssl.html)
on the secondary.
### Step 5. Enable Git access over HTTP/HTTPS
Geo synchronizes repositories over HTTP/HTTPS, and therefore requires this clone
method to be enabled. Navigate to **Admin Area ➔ Settings**
(`/admin/application_settings`) on the primary node, and set
`Enabled Git access protocols` to `Both SSH and HTTP(S)` or `Only HTTP(S)`.
### Step 6. Verify proper functioning of the secondary node
Congratulations! Your secondary geo node is now configured!
You can login to the secondary node with the same credentials you used on the
primary. Visit the secondary node's **Admin Area ➔ Geo Nodes**
(`/admin/geo_nodes`) in your browser to check if it's correctly identified as a
secondary Geo node and if Geo is enabled.
The initial replication, or 'backfill', will probably still be in progress. You
can monitor the synchronization process on each geo node from the primary
node's Geo Nodes dashboard in your browser.
![Geo dashboard](img/geo-node-dashboard.png)
If your installation isn't working properly, check the
[troubleshooting document](troubleshooting.md).
The two most obvious issues that can become apparent in the dashboard are:
1. Database replication not working well
1. Instance to instance notification not working. In that case, it can be
something of the following:
- You are using a custom certificate or custom CA (see the
[troubleshooting document](troubleshooting.md))
- The instance is firewalled (check your firewall rules)
Please note that disabling a secondary node will stop the sync process.
Please note that if `git_data_dirs` is customized on the primary for multiple
repository shards you must duplicate the same configuration on the secondary.
Point your users to the ["Using a Geo Server" guide](using_a_geo_server.md).
Currently, this is what is synced:
* Git repositories
* Wikis
* LFS objects
* Issues, merge requests, snippets, and comment attachments
* Users, groups, and project avatars
## Selective synchronization
Geo supports selective synchronization, which allows admins to choose
which projects should be synchronized by secondary nodes.
It is important to note that selective synchronization does not:
1. Restrict permissions from secondary nodes.
1. Hide project metadata from secondary nodes.
* Since Geo currently relies on PostgreSQL replication, all project metadata
gets replicated to secondary nodes, but repositories that have not been
selected will be empty.
1. Reduce the number of events generated for the Geo event log
* The primary generates events as long as any secondaries are present.
Selective synchronization restrictions are implemented on the secondaries,
not the primary.
A subset of projects can be chosen, either by group or by storage shard. The
former is ideal for replicating data belonging to a subset of users, while the
latter is more suited to progressively rolling out Geo to a large GitLab
instance.
## Upgrading Geo
See the [updating the Geo nodes document](updating_the_geo_nodes.md).
## Troubleshooting
See the [troubleshooting document](troubleshooting.md).
This document was moved to [another location](../administration/geo/configuration.md).
# Geo configuration
>**Note:**
This is the documentation for installations from source. For installations
using the Omnibus GitLab packages, follow the
[**Omnibus Geo nodes configuration**](configuration.md) guide.
## Configuring a new secondary node
>**Note:**
This is the final step in setting up a secondary Geo node. Stages of the setup
process must be completed in the documented order. Before attempting the steps
in this stage, [complete all prior stages](README.md#using-gitlab-installed-from-source).
The basic steps of configuring a secondary node are to replicate required
configurations between the primary and the secondaries; to configure a tracking
database on each secondary; and to start GitLab on the secondary node.
You are encouraged to first read through all the steps before executing them
in your testing/production environment.
>**Notes:**
- **Do not** setup any custom authentication in the secondary nodes, this will be
handled by the primary node.
- **Do not** add anything in the secondaries Geo nodes admin area
(**Admin Area ➔ Geo Nodes**). This is handled solely by the primary node.
### Step 1. Manually replicate secret GitLab values
GitLab stores a number of secret values in the `/home/git/gitlab/config/secrets.yml`
file which *must* match between the primary and secondary nodes. Until there is
a means of automatically replicating these between nodes (see
[issue #3789](https://gitlab.com/gitlab-org/gitlab-ee/issues/3789)), they must
be manually replicated to the secondary.
1. SSH into the **primary** node, and execute the command below:
```bash
sudo cat /home/git/gitlab/config/secrets.yml
```
This will display the secrets that need to be replicated, in YAML format.
1. SSH into the **secondary** node and login as the `git` user:
```bash
sudo -i -u git
```
1. Make a backup of any existing secrets:
```bash
mv /home/git/gitlab/config/secrets.yml /home/git/gitlab/config/secrets.yml.`date +%F`
```
1. Copy `/home/git/gitlab/config/secrets.yml` from the primary to the secondary, or
copy-and-paste the file contents between nodes:
```bash
sudo editor /home/git/gitlab/config/secrets.yml
# paste the output of the `cat` command you ran on the primary
# save and exit
```
1. Ensure the file permissions are correct:
```bash
chown git:git /home/git/gitlab/config/secrets.yml
chmod 0600 /home/git/gitlab/config/secrets.yml
```
1. Restart GitLab for the changes to take effect:
```bash
service gitlab restart
```
Once restarted, the secondary will automatically start replicating missing data
from the primary in a process known as backfill. Meanwhile, the primary node
will start to notify the secondary of any changes, so that the secondary can
act on those notifications immediately.
Make sure the secondary instance is running and accessible. You can login to
the secondary node with the same credentials as used in the primary.
### Step 2. Manually replicate primary SSH host keys
Read [Manually replicate primary SSH host keys](configuration.md#step-2-manually-replicate-primary-ssh-host-keys)
### Step 3. (Optional) Enabling hashed storage (from GitLab 10.0)
Read [Enabling Hashed Storage](configuration.md#step-3-optional-enabling-hashed-storage-from-gitlab-10-0)
### Step 4. (Optional) Configuring the secondary to trust the primary
You can safely skip this step if your primary uses a CA-issued HTTPS certificate.
If your primary is using a self-signed certificate for *HTTPS* support, you will
need to add that certificate to the secondary's trust store. Retrieve the
certificate from the primary and follow your distribution's instructions for
adding it to the secondary's trust store. In Debian/Ubuntu, for example, with a
certificate file of `primary.geo.example.com.crt`, you would follow these steps:
```
sudo -i
cp primary.geo.example.com.crt /usr/local/share/ca-certificates
update-ca-certificates
```
### Step 5. Enable Git access over HTTP/HTTPS
Geo synchronizes repositories over HTTP/HTTPS, and therefore requires this clone
method to be enabled. Navigate to **Admin Area ➔ Settings**
(`/admin/application_settings`) on the primary node, and set
`Enabled Git access protocols` to `Both SSH and HTTP(S)` or `Only HTTP(S)`.
### Step 6. Verify proper functioning of the secondary node
Read [Verify proper functioning of the secondary node](configuration.md#step-6-verify-proper-functioning-of-the-secondary-node).
## Selective synchronization
Read [Selective synchronization](configuration.md#selective-synchronization).
## Troubleshooting
Read the [troubleshooting document](troubleshooting.md).
This document was moved to [another location](../administration/geo/configuration_source.md).
# Geo database replication
>**Note:**
This is the documentation for the Omnibus GitLab packages. For installations
from source, follow the
[**database replication for installations from source**](database_source.md) guide.
>**Note:**
If your GitLab installation uses external PostgreSQL, the Omnibus roles
will not be able to perform all necessary configuration steps. Refer to the
section on [External PostreSQL][external postgresql] for additional instructions.
>**Note:**
The stages of the setup process must be completed in the documented order.
Before attempting the steps in this stage, [complete all prior stages][toc].
This document describes the minimal steps you have to take in order to
replicate your primary GitLab database to a secondary node's database. You may
have to change some values according to your database setup, how big it is, etc.
You are encouraged to first read through all the steps before executing them
in your testing/production environment.
## PostgreSQL replication
The GitLab primary node where the write operations happen will connect to
the primary database server, and the secondary nodes which are read-only will
connect to the secondary database servers (which are also read-only).
>**Note:**
In database documentation you may see "primary" being referenced as "master"
and "secondary" as either "slave" or "standby" server (read-only).
We recommend using [PostgreSQL replication
slots](https://medium.com/@tk512/replication-slots-in-postgresql-b4b03d277c75)
to ensure that the primary retains all the data necessary for the secondaries to
recover. See below for more details.
The following guide assumes that:
- You are using Omnibus and therefore you are using PostgreSQL 9.6 or later
which includes the [`pg_basebackup` tool][pgback] and improved
[Foreign Data Wrapper][FDW] support.
- You have a primary node already set up (the GitLab server you are
replicating from), running Omnibus' PostgreSQL (or equivalent version), and
you have a new secondary server set up with the same versions of the OS,
PostgreSQL, and GitLab on all nodes.
- The IP of the primary server for our examples will be `1.2.3.4`, whereas the
secondary's IP will be `5.6.7.8`. Note that the primary and secondary servers
**must** be able to communicate over these addresses. More on this in the
guide below.
### Step 1. Configure the primary server
1. SSH into your GitLab **primary** server and login as root:
```bash
sudo -i
```
1. Execute the command below to define the node as primary Geo node:
```bash
gitlab-ctl set-geo-primary-node
```
This command will use your defined `external_url` in `/etc/gitlab/gitlab.rb`.
1. GitLab 10.4 and up only: Do the following to make sure the `gitlab` database user has a password defined
Generate a MD5 hash of the desired password:
```bash
gitlab-ctl pg-password-md5 gitlab
# Enter password: mypassword
# Confirm password: mypassword
# fca0b89a972d69f00eb3ec98a5838484
```
Edit `/etc/gitlab/gitlab.rb`:
```ruby
# Fill with the hash generated by `gitlab-ctl pg-password-md5 gitlab`
postgresql['sql_user_password'] = 'fca0b89a972d69f00eb3ec98a5838484'
# If you have HA setup, this must be present in all nodes as well
gitlab_rails['db_password'] = 'mypassword'
```
1. Omnibus GitLab already has a [replication user](https://wiki.postgresql.org/wiki/Streaming_Replication)
called `gitlab_replicator`. You must set the password for this user manually.
You will be prompted to enter a password:
```bash
gitlab-ctl set-replication-password
```
This command will also read the `postgresql['sql_replication_user']` Omnibus
setting in case you have changed `gitlab_replicator` username to something
else.
1. Configure PostgreSQL to listen on network interfaces
For security reasons, PostgreSQL does not listen on any network interfaces
by default. However, Geo requires the secondary to be able to
connect to the primary's database. For this reason, we need the address of
each node. Note: For external PostgreSQL instances, see [additional instructions][external postgresql].
If you are using a cloud provider, you can lookup the addresses for each
Geo node through your cloud provider's management console.
To lookup the address of a Geo node, SSH in to the Geo node and execute:
```bash
##
## Private address
##
ip route get 255.255.255.255 | awk '{print "Private address:", $NF; exit}'
##
## Public address
##
echo "External address: $(curl ipinfo.io/ip)"
```
In most cases, the following addresses will be used to configure GitLab
Geo:
| Configuration | Address |
|-----|-----|
| `postgresql['listen_address']` | Primary's private address |
| `postgresql['trust_auth_cidr_addresses']` | Primary's private address |
| `postgresql['md5_auth_cidr_addresses']` | Secondary's public addresses |
If you are using Google Cloud Platform, SoftLayer, or any other vendor that
provides a virtual private cloud you can use the secondary's private
address (corresponds to "internal address" for Google Cloud Platform) for
`postgresql['md5_auth_cidr_addresses']`.
The `listen_address` option opens PostgreSQL up to network connections
with the interface corresponding to the given address. See [the PostgreSQL
documentation](https://www.postgresql.org/docs/9.6/static/runtime-config-connection.html)
for more details.
Depending on your network configuration, the suggested addresses may not
be correct. If your primary and secondary connect over a local
area network, or a virtual network connecting availability zones like
[Amazon's VPC](https://aws.amazon.com/vpc/) or [Google's VPC](https://cloud.google.com/vpc/)
you should use the secondary's private address for `postgresql['md5_auth_cidr_addresses']`.
Edit `/etc/gitlab/gitlab.rb` and add the following, replacing the IP
addresses with addresses appropriate to your network configuration:
```ruby
geo_primary_role['enable'] = true
##
## Primary address
## - replace '1.2.3.4' with the primary private address
##
postgresql['listen_address'] = '1.2.3.4'
postgresql['trust_auth_cidr_addresses'] = ['127.0.0.1/32','1.2.3.4/32']
##
# Secondary addresses
# - replace '5.6.7.8' with the secondary public address
##
postgresql['md5_auth_cidr_addresses'] = ['5.6.7.8/32']
##
## Replication settings
## - set this to be the number of Geo secondary nodes you have
##
postgresql['max_replication_slots'] = 1
# postgresql['max_wal_senders'] = 10
# postgresql['wal_keep_segments'] = 10
##
## Disable automatic database migrations temporarily
## (until PostgreSQL is restarted and listening on the private address).
##
gitlab_rails['auto_migrate'] = false
```
1. Optional: If you want to add another secondary, the relevant setting would look like:
```ruby
postgresql['md5_auth_cidr_addresses'] = ['5.6.7.8/32','9.10.11.12/32']
```
You may also want to edit the `wal_keep_segments` and `max_wal_senders` to
match your database replication requirements. Consult the [PostgreSQL -
Replication documentation](https://www.postgresql.org/docs/9.6/static/runtime-config-replication.html)
for more information.
1. Save the file and reconfigure GitLab for the database listen changes and
the replication slot changes to be applied.
```bash
gitlab-ctl reconfigure
```
Restart PostgreSQL for its changes to take effect:
```bash
gitlab-ctl restart postgresql
```
1. Re-enable migrations now that PostgreSQL is restarted and listening on the
private address.
Edit `/etc/gitlab/gitlab.rb` and **change** the configuration to `true`:
```ruby
gitlab_rails['auto_migrate'] = true
```
Save the file and reconfigure GitLab:
```bash
gitlab-ctl reconfigure
```
1. Now that the PostgreSQL server is set up to accept remote connections, run
`netstat -plnt | grep 5432` to make sure that PostgreSQL is listening on port
`5432` to the primary server's private address.
1. A certificate was automatically generated when GitLab was reconfigured. This
will be used automatically to protect your PostgreSQL traffic from
eavesdroppers, but to protect against active ("man-in-the-middle") attackers,
the secondary needs a copy of the certificate. Make a copy of the PostgreSQL
`server.crt` file on the primary node by running this command:
```bash
cat ~gitlab-psql/data/server.crt
```
Copy the output into a clipboard or into a local file. You
will need it when setting up the secondary! The certificate is not sensitive
data.
### Step 2. Add the secondary GitLab node
To prevent the secondary geo node from trying to act as the primary once the
database is replicated, the secondary geo node must be added on the
primary before the database is replicated.
1. Visit the **primary** node's **Admin Area ➔ Geo Nodes**
(`/admin/geo_nodes`) in your browser.
1. Add the secondary node by providing its full URL. **Do NOT** check the box
'This is a primary node'.
1. Optionally, choose which namespaces should be replicated by the
secondary node. Leave blank to replicate all. Read more in
[selective replication](#selective-replication).
1. Click the **Add node** button.
1. SSH into your GitLab **primary** server and login as root to verify the
secondary is reachable:
```
gitlab-rake gitlab:geo:check
```
The new secondary geo node will have the status **Unhealthy**. This is expected
because we have not yet configured the secondary server. This is the next step.
### Step 3. Configure the secondary server
1. SSH into your GitLab **secondary** server and login as root:
```
sudo -i
```
1. [Check TCP connectivity](../administration/raketasks/maintenance.md) to the
primary's PostgreSQL server:
```bash
gitlab-rake gitlab:tcp_check[1.2.3.4,5432]
```
If this step fails, you may be using the wrong IP address, or a firewall may
be preventing access to the server. Check the IP address, paying close
attention to the difference between public and private addresses and ensure
that, if a firewall is present, the secondary is permitted to connect to the
primary on port 5432.
1. Create a file `server.crt` in the secondary server, with the content you got on the last step of the primary setup:
```
editor server.crt
```
1. Set up PostgreSQL TLS verification on the secondary
Install the `server.crt` file:
```bash
install -D -o gitlab-psql -g gitlab-psql -m 0400 -T server.crt ~gitlab-psql/.postgresql/root.crt
```
PostgreSQL will now only recognize that exact certificate when verifying TLS
connections. The certificate can only be replicated by someone with access
to the private key, which is **only** present on the primary node.
1. Test that the `gitlab-psql` user can connect to the primary's database:
```bash
sudo -u gitlab-psql /opt/gitlab/embedded/bin/psql --list -U gitlab_replicator -d "dbname=gitlabhq_production sslmode=verify-ca" -W -h 1.2.3.4
```
When prompted enter the password you set in the first step for the
`gitlab_replicator` user. If all worked correctly, you should see
the list of primary's databases.
A failure to connect here indicates that the TLS configuration is incorrect.
Ensure that the contents of `~gitlab-psql/data/server.crt` on the primary
match the contents of `~gitlab-psql/.postgresql/root.crt` on the secondary.
1. Configure PostreSQL to enable FDW support
This step is similar to how we configured the primary instance.
We need to enable this, to enable FDW support, even if using a single node.
Edit `/etc/gitlab/gitlab.rb` and add the following, replacing the IP
addresses with addresses appropriate to your network configuration:
```ruby
# Secondary addresses
# - replace '5.6.7.8' with the secondary private address
postgresql['listen_address'] = '5.6.7.8'
postgresql['trust_auth_cidr_addresses'] = ['127.0.0.1/32','5.6.7.8/32']
postgresql['md5_auth_cidr_addresses'] = ['5.6.7.8/32']
# gitlab database user's password (defined previously)
gitlab_rails['db_password'] = 'mypassword'
# enable fdw for the geo tracking database
geo_secondary['db_fdw'] = true
```
1. Edit `/etc/gitlab/gitlab.rb` and add the following:
```ruby
geo_secondary_role['enable'] = true
```
For external PostgreSQL instances, [see additional instructions][external postgresql].
If you bring a former primary back online to serve as a secondary then you also need to remove `geo_primary_role['enable'] = true`.
1. Reconfigure GitLab for the changes to take effect:
```bash
gitlab-ctl reconfigure
```
1. Restart PostgreSQL for its changes to take effect:
```bash
gitlab-ctl restart postgresql
```
### Step 4. Initiate the replication process
Below we provide a script that connects the database on the secondary node to
the database on the primary node, replicates the database, and creates the
needed files for streaming replication.
The directories used are the defaults that are set up in Omnibus. If you have
changed any defaults or are using a source installation, configure it as you
see fit replacing the directories and paths.
>**Warning:**
Make sure to run this on the **secondary** server as it removes all PostgreSQL's
data before running `pg_basebackup`.
1. SSH into your GitLab **secondary** server and login as root:
```
sudo -i
```
1. Choose a database-friendly name to use for your secondary to
use as the replication slot name. For example, if your domain is
`secondary.geo.example.com`, you may use `secondary_example` as the slot
name as shown in the commands below.
1. Execute the command below to start a backup/restore and begin the replication
>**Warning:** Each Geo secondary must have its own unique replication slot name.
Using the same slot name between two secondaries will break PostgreSQL replication.
```bash
gitlab-ctl replicate-geo-database --slot-name=secondary_example --host=1.2.3.4
```
When prompted, enter the _plaintext_ password you set up for the `gitlab_replicator`
user in the first step.
This command also takes a number of additional options. You can use `--help`
to list them all, but here are a couple of tips:
- If PostgreSQL is listening on a non-standard port, add `--port=` as well.
- If your database is too large to be transferred in 30 minutes, you will need
to increase the timeout, e.g., `--backup-timeout=3600` if you expect the
initial replication to take under an hour.
- Pass `--sslmode=disable` to skip PostgreSQL TLS authentication altogether
(e.g., you know the network path is secure, or you are using a site-to-site
VPN). This is **not** safe over the public Internet!
- You can read more details about each `sslmode` in the
[PostgreSQL documentation](https://www.postgresql.org/docs/9.6/static/libpq-ssl.html#LIBPQ-SSL-PROTECTION);
the instructions above are carefully written to ensure protection against
both passive eavesdroppers and active "man-in-the-middle" attackers.
- Change the `--slot-name` to the name of the replication slot
to be used on the primary database. The script will attempt to create the
replication slot automatically if it does not exist.
- If you're repurposing an old server into a Geo secondary, you'll need to
add `--force` to the command line.
- When not in a production machine you can disable backup step if you
really sure this is what you want by adding `--skip-backup`
1. Verify that the secondary is configured correctly and that the primary is
reachable:
```
gitlab-rake gitlab:geo:check
```
The replication process is now complete.
### External PostgreSQL instances
For installations using external PostgreSQL instances, the `geo_primary_role`
and `geo_secondary_role` includes configuration changes that must be applied
manually.
The `geo_primary_role` makes configuration changes to `pg_hba.conf` and
`postgresql.conf` on the primary:
```
##
## Geo Primary
## - pg_hba.conf
##
host replication gitlab_replicator <trusted secondary IP>/32 md5
```
```
##
## Geo Primary Role
## - postgresql.conf
##
sql_replication_user = gitlab_replicator
wal_level = hot_standby
max_wal_senders = 10
wal_keep_segments = 50
max_replication_slots = 1 # number of secondary instances
hot_standby = on
```
Th `geo_secondary_role` makes configuration changes to `postgresql.conf` and
enables the Geo Log Cursor (`geo_logcursor`) and secondary tracking database
on the secondary. The PostgreSQL settings for this database it adds to
the default settings:
```
##
## Geo Secondary Role
## - postgresql.conf
##
wal_level = hot_standby
max_wal_senders = 10
wal_keep_segments = 10
hot_standby = on
```
Geo secondary nodes use a tracking database to keep track of replication
status and recover automatically from some replication issues. Follow the
instructions for [enabling tracking database on the secondary server][tracking].
## MySQL replication
MySQL replication is not supported for Geo.
## Troubleshooting
Read the [troubleshooting document](troubleshooting.md).
[pgback]: http://www.postgresql.org/docs/9.2/static/app-pgbasebackup.html
[external postgresql]: #external-postgresql-instances
[tracking]: database_source.md#enable-tracking-database-on-the-secondary-server
[FDW]: https://www.postgresql.org/docs/9.6/static/postgres-fdw.html
[toc]: README.md#using-omnibus-gitlab
This document was moved to [another location](../administration/geo/database.md).
# Geo database replication
>**Note:**
This is the documentation for installations from source. For installations
using the Omnibus GitLab packages, follow the
[**database replication for Omnibus GitLab**](database.md) guide.
>**Note:**
The stages of the setup process must be completed in the documented order.
Before attempting the steps in this stage, [complete all prior stages][toc].
This document describes the minimal steps you have to take in order to
replicate your primary GitLab database to a secondary node's database. You may
have to change some values according to your database setup, how big it is, etc.
You are encouraged to first read through all the steps before executing them
in your testing/production environment.
## PostgreSQL replication
The GitLab primary node where the write operations happen will connect to
primary database server, and the secondary ones which are read-only will
connect to secondary database servers (which are read-only too).
>**Note:**
In many databases documentation you will see "primary" being referenced as "master"
and "secondary" as either "slave" or "standby" server (read-only).
We recommend using [PostgreSQL replication
slots](https://medium.com/@tk512/replication-slots-in-postgresql-b4b03d277c75)
to ensure the primary retains all the data necessary for the secondaries to
recover. See below for more details.
The following guide assumes that:
- You are using PostgreSQL 9.6 or later which includes the
[`pg_basebackup` tool][pgback] and improved [Foreign Data Wrapper][FDW] support.
- You have a primary node already set up (the GitLab server you are
replicating from), running PostgreSQL 9.6 or later, and
you have a new secondary server set up with the same versions of the OS,
PostgreSQL, and GitLab on all nodes.
- The IP of the primary server for our examples will be `1.2.3.4`, whereas the
secondary's IP will be `5.6.7.8`. Note that the primary and secondary servers
**must** be able to communicate over these addresses. These IP addresses can either
be public or private.
### Step 1. Configure the primary server
1. SSH into your GitLab **primary** server and login as root:
```bash
sudo -i
```
1. Add this node as the Geo primary by running:
```bash
bundle exec rake geo:set_primary_node
```
1. Create a [replication user] named `gitlab_replicator`:
```bash
sudo -u postgres psql -c "CREATE USER gitlab_replicator REPLICATION ENCRYPTED PASSWORD 'thepassword';"
```
1. Make sure your the `gitlab` database user has a password defined
```bash
sudo -u postgres psql -d template1 -c "ALTER USER gitlab WITH ENCRYPTED PASSWORD 'mydatabasepassword';"
```
1. Edit the content of `database.yml` in `production:` and add the password like the exemple below:
```yaml
#
# PRODUCTION
#
production:
adapter: postgresql
encoding: unicode
database: gitlabhq_production
pool: 10
username: gitlab
password: mydatabasepassword
host: /var/opt/gitlab/geo-postgresql
```
1. Set up TLS support for the PostgreSQL primary server
> **Warning**: Only skip this step if you **know** that PostgreSQL traffic
> between the primary and secondary will be secured through some other
> means, e.g., a known-safe physical network path or a site-to-site VPN that
> you have configured.
If you are replicating your database across the open Internet, it is
**essential** that the connection is TLS-secured. Correctly configured, this
provides protection against both passive eavesdroppers and active
"man-in-the-middle" attackers.
To generate a self-signed certificate and key, run this command:
```bash
openssl req -nodes -batch -x509 -newkey rsa:4096 -keyout server.key -out server.crt -days 3650
```
This will create two files - `server.key` and `server.crt` - that you can
use for authentication.
Copy them to the correct location for your PostgreSQL installation:
```bash
# Copying a self-signed certificate and key
install -o postgres -g postgres -m 0400 -T server.crt ~postgres/9.x/main/data/server.crt
install -o postgres -g postgres -m 0400 -T server.key ~postgres/9.x/main/data/server.key
```
Add this configuration to `postgresql.conf`, removing any existing
configuration for `ssl_cert_file` or `ssl_key_file`:
```
ssl = on
ssl_cert_file='server.crt'
ssl_key_file='server.key'
```
1. Edit `postgresql.conf` to configure the primary server for streaming replication
(for Debian/Ubuntu that would be `/etc/postgresql/9.x/main/postgresql.conf`):
```
listen_address = '1.2.3.4'
wal_level = hot_standby
max_wal_senders = 5
min_wal_size = 80MB
max_wal_size = 1GB
max_replicaton_slots = 1 # Number of Geo secondary nodes
wal_keep_segments = 10
hot_standby = on
```
Be sure to set `max_replication_slots` to the number of Geo secondary
nodes that you may potentially have (at least 1).
For security reasons, PostgreSQL by default only listens on the local
interface (e.g. 127.0.0.1). However, Geo needs to communicate
between the primary and secondary nodes over a common network, such as a
corporate LAN or the public Internet. For this reason, we need to
configure PostgreSQL to listen on more interfaces.
The `listen_address` option opens PostgreSQL up to external connections
with the interface corresponding to the given IP. See [the PostgreSQL
documentation](https://www.postgresql.org/docs/9.6/static/runtime-config-connection.html)
for more details.
You may also want to edit the `wal_keep_segments` and `max_wal_senders` to
match your database replication requirements. Consult the [PostgreSQL - Replication documentation](https://www.postgresql.org/docs/9.6/static/runtime-config-replication.html)
for more information.
1. Set the access control on the primary to allow TCP connections using the
server's public IP and set the connection from the secondary to require a
password. Edit `pg_hba.conf` (for Debian/Ubuntu that would be
`/etc/postgresql/9.x/main/pg_hba.conf`):
```bash
host all all 127.0.0.1/32 trust
host all all 1.2.3.4/32 trust
host replication gitlab_replicator 5.6.7.8/32 md5
```
Where `1.2.3.4` is the public IP address of the primary server, and `5.6.7.8`
the public IP address of the secondary one. If you want to add another
secondary, add one more row like the replication one and change the IP
address:
```bash
host all all 127.0.0.1/32 trust
host all all 1.2.3.4/32 trust
host replication gitlab_replicator 5.6.7.8/32 md5
host replication gitlab_replicator 11.22.33.44/32 md5
```
1. Restart PostgreSQL for the changes to take effect.
1. Choose a database-friendly name to use for your secondary to use as the
replication slot name. For example, if your domain is
`secondary.geo.example.com`, you may use `secondary_example` as the slot
name.
1. Create the replication slot on the primary:
```bash
$ sudo -u postgres psql -c "SELECT * FROM pg_create_physical_replication_slot('secondary_example');"
slot_name | xlog_position
------------------+---------------
secondary_example |
(1 row)
```
1. Now that the PostgreSQL server is set up to accept remote connections, run
`netstat -plnt` to make sure that PostgreSQL is listening to the server's
public IP.
### Step 2. Add the secondary GitLab node
Follow the steps in ["add the secondary GitLab node"](database.md#step-2-add-the-secondary-gitlab-node).
### Step 3. Configure the secondary server
Follow the first steps in ["configure the secondary server"](database.md#step-3-configure-the-secondary-server),
but note that since you are installing from source, the username and
group listed as `gitlab-psql` in those steps should be replaced by `postgres`
instead. After completing the "Test that the `gitlab-psql` user can connect to
the primary's database" step, continue here:
1. Edit `postgresql.conf` to configure the secondary for streaming replication
(for Debian/Ubuntu that would be `/etc/postgresql/9.*/main/postgresql.conf`):
```bash
wal_level = hot_standby
max_wal_senders = 5
checkpoint_segments = 10
wal_keep_segments = 10
hot_standby = on
```
1. Restart PostgreSQL for the changes to take effect.
#### Enable tracking database on the secondary server
Geo secondary nodes use a tracking database to keep track of replication status
and recover automatically from some replication issues. Follow the steps below to create
the tracking database.
1. On the secondary node, run the following command to create `database_geo.yml` with the
information of your secondary PostgreSQL instance:
```bash
sudo cp /home/git/gitlab/config/database_geo.yml.postgresql /home/git/gitlab/config/database_geo.yml
```
1. Edit the content of `database_geo.yml` in `production:` as in the example below:
```yaml
#
# PRODUCTION
#
production:
adapter: postgresql
encoding: unicode
database: gitlabhq_geo_production
pool: 10
username: gitlab_geo
# password:
host: /var/opt/gitlab/geo-postgresql
```
1. Create the database `gitlabhq_geo_production` on the PostgreSQL instance of the secondary
node.
1. Set up the Geo tracking database:
```bash
bundle exec rake geo:db:migrate
```
1. Configure the [PostgreSQL FDW][FDW] connection and credentials:
Save the script below in a file, ex. `/tmp/geo_fdw.sh` and modify the connection
params to match your environment.
```bash
#!/bin/bash
# Secondary Database connection params:
DB_HOST="/var/opt/gitlab/postgresql"
DB_NAME="gitlabhq_production"
DB_USER="gitlab"
DB_PORT="5432"
# Tracking Database connection params:
GEO_DB_HOST="/var/opt/gitlab/geo-postgresql"
GEO_DB_NAME="gitlabhq_geo_production"
GEO_DB_USER="gitlab_geo"
GEO_DB_PORT="5432"
sudo -u postgres psql -h $GEO_DB_HOST -d $GEO_DB_NAME -p $GEO_DB_PORT -c "CREATE EXTENSION postgres_fdw;"
sudo -u postgres psql -h $GEO_DB_HOST -d $GEO_DB_NAME -p $GEO_DB_PORT -c "CREATE SERVER gitlab_secondary FOREIGN DATA WRAPPER postgres_fdw OPTIONS (host '$(DB_HOST)', dbname '$(DB_NAME)', port '$(DB_PORT)' );"
sudo -u postgres psql -h $GEO_DB_HOST -d $GEO_DB_NAME -p $GEO_DB_PORT -c "CREATE USER MAPPING FOR $(GEO_DB_USER) SERVER gitlab_secondary OPTIONS (user '$(DB_USER)');"
sudo -u postgres psql -h $GEO_DB_HOST -d $GEO_DB_NAME -p $GEO_DB_PORT -c "CREATE SCHEMA gitlab_secondary;"
sudo -u postgres psql -h $GEO_DB_HOST -d $GEO_DB_NAME -p $GEO_DB_PORT -c "GRANT USAGE ON FOREIGN SERVER gitlab_secondary TO $(GEO_DB_USER);"
```
And edit the content of `database_geo.yml` and to add `fdw: true` to
the `production:` block.
### Step 4. Initiate the replication process
Below we provide a script that connects the database on the secondary node to
the database on the primary node, replicates the database, and creates the
needed files for streaming replication.
The directories used are the defaults for Debian/Ubuntu. If you have changed
any defaults, configure it as you see fit replacing the directories and paths.
>**Warning:**
Make sure to run this on the **secondary** server as it removes all PostgreSQL's
data before running `pg_basebackup`.
1. SSH into your GitLab **secondary** server and login as root:
```bash
sudo -i
```
1. Save the snippet below in a file, let's say `/tmp/replica.sh`. Modify the
embedded paths if necessary:
```bash
#!/bin/bash
PORT="5432"
USER="gitlab_replicator"
echo ---------------------------------------------------------------
echo WARNING: Make sure this script is run from the secondary server
echo ---------------------------------------------------------------
echo
echo Enter the IP or FQDN of the primary PostgreSQL server
read HOST
echo Enter the password for $USER@$HOST
read -s PASSWORD
echo Enter the required sslmode
read SSLMODE
echo Stopping PostgreSQL and all GitLab services
gitlab-ctl stop
echo Backing up postgresql.conf
sudo -u postgres mv /var/opt/gitlab/postgresql/data/postgresql.conf /var/opt/gitlab/postgresql/
echo Cleaning up old cluster directory
sudo -u postgres rm -rf /var/opt/gitlab/postgresql/data
rm -f /tmp/postgresql.trigger
echo Starting base backup as the replicator user
echo Enter the password for $USER@$HOST
sudo -u postgres /opt/gitlab/embedded/bin/pg_basebackup -h $HOST -D /var/opt/gitlab/postgresql/data -U gitlab_replicator -v -x -P
echo Writing recovery.conf file
sudo -u postgres bash -c "cat > /var/opt/gitlab/postgresql/data/recovery.conf <<- _EOF1_
standby_mode = 'on'
primary_conninfo = 'host=$HOST port=$PORT user=$USER password=$PASSWORD sslmode=$SSLMODE'
trigger_file = '/tmp/postgresql.trigger'
_EOF1_
"
echo Restoring postgresql.conf
sudo -u postgres mv /var/opt/gitlab/postgresql/postgresql.conf /var/opt/gitlab/postgresql/data/
echo Starting PostgreSQL and all GitLab services
gitlab-ctl start
```
1. Run it with:
```bash
bash /tmp/replica.sh
```
When prompted, enter the IP/FQDN of the primary, and the password you set up
for the `gitlab_replicator` user in the first step.
You should use `verify-ca` for the `sslmode`. You can use `disable` if you
are happy to skip PostgreSQL TLS authentication altogether (e.g., you know
the network path is secure, or you are using a site-to-site VPN). This is
**not** safe over the public Internet!
You can read more details about each `sslmode` in the
[PostgreSQL documentation](https://www.postgresql.org/docs/9.6/static/libpq-ssl.html#LIBPQ-SSL-PROTECTION);
the instructions above are carefully written to ensure protection against
both passive eavesdroppers and active "man-in-the-middle" attackers.
The replication process is now over.
## MySQL replication
MySQL replication is not supported for Geo.
## Troubleshooting
Read the [troubleshooting document](troubleshooting.md).
[pgback]: http://www.postgresql.org/docs/9.6/static/app-pgbasebackup.html
[replication user]:https://wiki.postgresql.org/wiki/Streaming_Replication
[FDW]: https://www.postgresql.org/docs/9.6/static/postgres-fdw.html
[toc]: README.md#using-gitlab-installed-from-source
This document was moved to [another location](../administration/geo/database_source.md).
# Docker Registry for a secondary node
You can setup a [Docker Registry](https://docs.docker.com/registry/) on your
secondary Geo node that mirrors the one on the primary Geo node.
## Storage support
CAUTION: **Warning:**
If you use [local storage](../administration/container_registry.md#container-registry-storage-driver)
for the Container Registry you **cannot** replicate it to the secondary Geo node.
Docker Registry currently supports a few types of storages. If you choose a
distributed storage (`azure`, `gcs`, `s3`, `swift`, or `oss`) for your Docker
Registry on a primary Geo node, you can use the same storage for a secondary
Docker Registry as well. For more information, read the
[Load balancing considerations](https://docs.docker.com/registry/deploying/#load-balancing-considerations)
when deploying the Registry, and how to setup the storage driver for GitLab's
integrated [Container Registry](../administration/container_registry.md#container-registry-storage-driver).
[ee]: https://about.gitlab.com/products/
This document was moved to [another location](../administration/geo/docker_registry.md).
# Geo Frequently Asked Questions
## Can I use Geo in a disaster recovery situation?
Yes, but there are limitations to what we replicate (see
[What data is replicated to a secondary node?](#what-data-is-replicated-to-a-secondary-node)).
Read the documentation for [Disaster Recovery](../administration/disaster_recovery/index.md).
## What data is replicated to a secondary node?
We currently replicate project repositories, LFS objects, generated
attachments / avatars and the whole database. This means user accounts,
issues, merge requests, groups, project data, etc., will be available for
query. We currently don't replicate artifact data (`shared/folder`).
## Can I git push to a secondary node?
No. All writing operations (this includes `git push`) must be done in your
primary node.
## How long does it take to have a commit replicated to a secondary node?
All replication operations are asynchronous and are queued to be dispatched in
a batched request every 10 minutes. Besides that, it depends on a lot of other
factors including the amount of traffic, how big your commit is, the
connectivity between your nodes, your hardware, etc.
## What if the SSH server runs at a different port?
We send the clone url from the primary server to any secondaries, so it
doesn't matter. If primary is running on port `2200`, clone url will reflect
that.
## Is this possible to set up a Docker Registry for a secondary node that mirrors the one on a primary node?
Yes. See [Docker Registry for a secondary Geo node](docker_registry.md).
This document was moved to [another location](../administration/geo/faq.md).
# Geo High Availability
This document describes a minimal reference architecture for running Geo
in a high availability configuration. If your HA setup differs from the one
described, it is possible to adapt these instructions to your needs.
## Architecture overview
![Geo HA Diagram](../administration/img/high_availability/geo-ha-diagram.png)
_[diagram source - gitlab employees only](https://docs.google.com/drawings/d/1z0VlizKiLNXVVVaERFwgsIOuEgjcUqDTWPdQYsE7Z4c/edit)_
The topology above assumes that the primary and secondary Geo clusters
are located in two separate locations, on their own virtual network
with private IP addresses. The network is configured such that all machines within
one geographic location can communicate with each other using their private IP addresses.
The IP addresses given are examples and may be different depending on the
network topology of your deployment.
The only external way to access the two Geo deployments is by HTTPS at
`gitlab.us.example.com` and `gitlab.eu.example.com` in the example above.
> **Note:** The primary and secondary Geo deployments must be able to
> communicate to each other over HTTPS.
## Redis and PostgreSQL High Availability
The primary and secondary Redis and PostgreSQL should be configured
for high availability. Because of the additional complexity involved
in setting up this configuration for PostgreSQL and Redis
it is not covered by this Geo HA documentation.
The two services will instead be configured such that
they will each run on a single machine.
For more information about setting up a highly available PostgreSQL cluster and Redis cluster using the omnibus package see the high availability documentation for
[PostgreSQL](../administration/high_availability/database.md) and
[Redis](../administration/high_availability/redis.md), respectively.
From these instructions you will need the following for the examples below:
* `gitlab_rails['db_password']` for the PostgreSQL "DB password"
* `redis['password']` for the Redis "Redis password"
NOTE: **Note:**
It is possible to use cloud hosted services for PostgreSQL and Redis but this is beyond the scope of this document.
### Prerequisites
Make sure you have GitLab EE installed using the
[Omnibus package](https://about.gitlab.com/installation).
### Step 1: Configure the Geo Backend Services
On the **primary** backend servers configure the following services:
* [Redis](../administration/high_availability/redis.md) for high availability.
* [NFS Server](../administration/high_availability/nfs.md) for repository, LFS, and upload storage.
* [PostgreSQL](../administration/high_availability/database.md) for high availability.
On the **secondary** backend servers configure the following services:
* [Redis](../administration/high_availability/redis.md) for high availability.
* [NFS Server](../administration/high_availability/nfs.md) which will store data that is synchronized from the Geo primary.
### Step 2: Configure the Postgres services on the Geo Secondary
1. Configure the [secondary Geo PostgreSQL database](../gitlab-geo/database.md)
as a read-only secondary of the primary Geo PostgreSQL database.
1. Configure the Geo tracking database on the secondary server, to do this modify `/etc/gitlab/gitlab.rb`:
```ruby
geo_postgresql['enable'] = true
geo_postgresql['listen_address'] = '10.1.4.1'
geo_postgresql['trust_auth_cidr_addresses'] = ['10.1.0.0/16']
geo_secondary['auto_migrate'] = true
geo_secondary['db_host'] = '10.1.4.1'
geo_secondary['db_password'] = 'Geo tracking DB password'
```
NOTE: **Note:**
Be sure that other non-postgresql services are disabled by setting `enable` to `false` in
the [gitlab.rb configuration](https://gitlab.com/gitlab-org/omnibus-gitlab/blob/master/files/gitlab-config-template/gitlab.rb.template).
After making these changes be sure to run `sudo gitlab-ctl reconfigure` so that they take effect.
### Step 3: Setup the LoadBalancer
In this topology there will need to be a load balancers at each geographical location
to route traffic to the application servers.
See the [Load Balancer for GitLab HA](../administration/high_availability/load_balancer.md)
documentation for more information.
### Step 4: Configure the Geo Frontend Application Servers
In the architecture overview there are two machines running the GitLab application
services. These services are enabled selectively in the configuration. Additionally
the addresses of the remote endpoints for PostgreSQL and Redis will need to be specified.
#### On the GitLab Primary Frontend servers
1. Edit `/etc/gitlab/gitlab.rb` and ensure the following to disable PostgreSQL and Redis from running locally.
```ruby
##
## Disable PostgreSQL on the local machine and connect to the remote
##
postgresql['enable'] = false
gitlab_rails['auto_migrate'] = false
gitlab_rails['db_host'] = '10.0.3.1'
gitlab_rails['db_password'] = 'DB password'
##
## Disable Redis on the local machine and connect to the remote
##
redis['enable'] = false
gitlab_rails['redis_host'] = '10.0.2.1'
gitlab_rails['redis_password'] = 'Redis password'
geo_primary_role['enable'] = true
```
#### On the GitLab Secondary Frontend servers
On the secondary the remote endpoint for the PostgreSQL Geo database will
be specified.
1. Edit `/etc/gitlab/gitlab.rb` and ensure the following to disable PostgreSQL and Redis from running locally. Configure the secondary to connect to the Geo tracking database.
```ruby
##
## Disable PostgreSQL on the local machine and connect to the remote
##
postgresql['enable'] = false
gitlab_rails['auto_migrate'] = false
gitlab_rails['db_host'] = '10.1.3.1'
gitlab_rails['db_password'] = 'DB password'
##
## Disable Redis on the local machine and connect to the remote
##
redis['enable'] = false
gitlab_rails['redis_host'] = '10.1.2.1'
gitlab_rails['redis_password'] = 'Redis password'
##
## Enable the geo secondary role and configure the
## geo tracking database
##
geo_secondary_role['enable'] = true
geo_secondary['db_host'] = '10.1.4.1'
geo_secondary['db_password'] = 'Geo tracking DB password'
geo_postgresql['enable'] = false
```
After making these changes [Reconfigure GitLab][] so that they take effect.
On the primary the following GitLab frontend services will be enabled:
* gitlab-pages
* gitlab-workhorse
* logrotate
* nginx
* registry
* remote-syslog
* sidekiq
* unicorn
On the secondary the following GitLab frontend services will be enabled:
* geo-logcursor
* gitlab-pages
* gitlab-workhorse
* logrotate
* nginx
* registry
* remote-syslog
* sidekiq
* unicorn
Verify these services by running `sudo gitlab-ctl status` on the frontend
application servers.
[reconfigure GitLab]: ../administration/restart_gitlab.md#omnibus-gitlab-reconfigure
[restart GitLab]: ../administration/restart_gitlab.md#omnibus-gitlab-restart
This document was moved to [another location](../administration/geo/high_availability.md).
# Geo with Object storage
Geo can be used in combination with Object Storage (AWS S3, or
other compatible object storage).
## Configuration
At this time it is required that if object storage is enabled on the
primary, it must also be enabled on the secondary.
The secondary nodes can use the same storage bucket as the primary, or
they can use a replicated storage bucket. At this time GitLab does not
take care of content replication in object storage.
For LFS, follow the documentation to
[set up LFS object storage](../workflow/lfs/lfs_administration.md#setting-up-s3-compatible-object-storage).
For CI job artifacts, there is similar documentation to configure
[jobs artifact object storage](../administration/job_artifacts.md#using-object-storage)
Complete these steps on all nodes, primary **and** secondary.
## Replication
When using Amazon S3, you can use
[CRR](https://docs.aws.amazon.com/AmazonS3/latest/dev/crr.html) to
have automatic replication between the bucket used by the primary and
the bucket used by the secondary.
If you are using Google Cloud Storage, consider using
[Multi-Regional Storage](https://cloud.google.com/storage/docs/storage-classes#multi-regional).
Or you can use the [Storage Transfer Service](https://cloud.google.com/storage/transfer/),
although this only supports daily synchronization.
For manual synchronization, or scheduled by `cron`, please have a look at:
- [`s3cmd sync`](http://s3tools.org/s3cmd-sync)
- [`gsutil rsync`](https://cloud.google.com/storage/docs/gsutil/commands/rsync)
This document was moved to [another location](../administration/geo/object_storage.md).
The following security review of the Geo feature set focuses on security
aspects of the feature as they apply to customers running their own GitLab
instances. The review questions are based in part on the [application security architecture](https://www.owasp.org/index.php/Application_Security_Architecture_Cheat_Sheet)
questions from [owasp.org](https://www.owasp.org).
## Business Model
### What geographic areas does the application service?
- This varies by customer. Geo allows customers to deploy to multiple areas,
and they get to choose where they are.
- Region and node selection is entirely manual.
## Data Essentials
### What data does the application receive, produce, and process?
- Geo streams almost all data held by a GitLab instance between sites. This
includes full database replication, most files (user-uploaded attachments,
etc) and repository + wiki data. In a typical configuration, this will
happen across the public Internet, and be TLS-encrypted.
- PostgreSQL replication is TLS-encrypted.
- See also: [only TLSv1.2 should be supported](https://gitlab.com/gitlab-org/omnibus-gitlab/issues/2948)
### How can the data be classified into categories according to its sensitivity?
- GitLab’s model of sensitivity is centered around public vs. internal vs.
private projects. Geo replicates them all indiscriminately. “Selective sync”
exists for files and repositories (but not database content), which would permit
only less-sensitive projects to be replicated to a secondary if desired.
- See also: [developing a data classification policy](https://gitlab.com/gitlab-com/security/issues/4).
### What data backup and retention requirements have been defined for the application?
- Geo is designed to provide replication of a certain subset of the application
data. It is part of the solution, rather than part of the problem.
## End-Users
### Who are the application's end‐users?
- Geo nodes (secondaries) are created in regions that are distant (in terms of
Internet latency) from the main GitLab installation (the primary). They are
intended to be used by anyone who would ordinarily use the primary, who finds
that the secondary is closer to them (in terms of Internet latency).
### How do the end‐users interact with the application?
- A Geo secondary node provides all the interfaces a Geo primary node does
(notably a HTTP/HTTPS web application, and HTTP/HTTPS or SSH git repository
access), but is constrained to read-only activities. The principal use case is
envisioned to be cloning git repositories from the secondary in favor of the
primary, but end-users may use the GitLab web interface to view projects,
issues, merge requests, snippets, etc.
### What security expectations do the end‐users have?
- The replication process must be secure. It would typically be unacceptable to
transmit the entire database contents or all files and repositories across the
public Internet in plaintext, for instance.
- The Geo secondary must have the same access controls over its content as the
primary - unauthenticated users must not be able to gain access to privileged
information on the primary by querying the secondary.
- Attackers must not be able to impersonate the secondary to the primary, and
thus gain access to privileged information.
## Administrators
### Who has administrative capabilities in the application?
- Nothing Geo-specific. Any user where `admin: true` is set in the database is
considered an admin with super-user privileges.
- See also: [more granular access control](https://gitlab.com/gitlab-org/gitlab-ce/issues/32730)
(not geo-specific)
- Much of Geo’s integration (database replication, for instance) must be
configured with the application, typically by system administrators.
### What administrative capabilities does the application offer?
- Geo secondaries may be added, modified, or removed by users with
administrative access.
- The replication process may be controlled (start/stop) via the Sidekiq
administrative controls.
## Network
### What details regarding routing, switching, firewalling, and load‐balancing have been defined?
- Geo requires the primary and secondary to be able to communicate with each
other across a TCP/IP network. In particular, the secondaries must be able to
access HTTP/HTTPS and PostgreSQL services on the primary.
### What core network devices support the application?
- Varies from customer to customer.
### What network performance requirements exist?
- Maximum replication speeds between primary and secondary is limited by the
available bandwidth between sites. No hard requirements exist - time to complete
replication (and ability to keep up with changes on the primary) is a function
of the size of the data set, tolerance for latency, and available network
capacity.
### What private and public network links support the application?
- Customers choose their own networks. As sites are intended to be
geographically separated, it is envisioned that replication traffic will pass
over the public Internet in a typical deployment, but this is not a requirement.
## Systems
### What operating systems support the application?
- Geo imposes no additional restrictions on operating system (see the
[GitLab installation](https://about.gitlab.com/installation/) page for more
details), however we recommend using the operating systems listed in the [Geo documentation](http://docs.gitlab.com/ee/gitlab-geo/#geo-recommendations).
### What details regarding required OS components and lock‐down needs have been defined?
- The recommended installation method (Omnibus) packages most components itself.
A from-source installation method exists. Both are documented at
https://docs.gitlab.com/ee/gitlab-geo/
- There are significant dependencies on the system-installed OpenSSH daemon (Geo
requires users to set up custom authentication methods) and the omnibus or
system-provided PostgreSQL daemon (it must be configured to listen on TCP,
additional users and replication slots must be added, etc).
- The process for dealing with security updates (for example, if there is a
significant vulnerability in OpenSSH or other services, and the customer
wants to patch those services on the OS) is identical to the non-Geo
situation: security updates to OpenSSH would be provided to the user via the
usual distribution channels. Geo introduces no delay there.
## Infrastructure Monitoring
### What network and system performance monitoring requirements have been defined?
- None specific to Geo.
### What mechanisms exist to detect malicious code or compromised application components?
- None specific to Geo.
### What network and system security monitoring requirements have been defined?
- None specific to Geo.
## Virtualization and Externalization
### What aspects of the application lend themselves to virtualization?
- All.
## What virtualization requirements have been defined for the application?
- Nothing Geo-specific, but everything in GitLab needs to have full
functionality in such an environment.
### What aspects of the product may or may not be hosted via the cloud computing model?
- GitLab is “cloud native” and this applies to Geo as much as to the rest of the
product. Deployment in clouds is a common and supported scenario.
## If applicable, what approach(es) to cloud computing will be taken (Managed Hosting versus "Pure" Cloud, a "full machine" approach such as AWS-EC2 versus a "hosted database" approach such as AWS-RDS and Azure, etc)?
- To be decided by our customers, according to their operational needs.
## Environment
### What frameworks and programming languages have been used to create the application?
- Ruby on Rails, Ruby.
### What process, code, or infrastructure dependencies have been defined for the application?
- Nothing specific to Geo.
### What databases and application servers support the application?
- PostgreSQL >= 9.6, Redis, Sidekiq, Unicorn.
### How will database connection strings, encryption keys, and other sensitive components be stored, accessed, and protected from unauthorized detection?
- There are some Geo-specific values. Some are shared secrets which must be
securely transmitted from the primary to the secondary at setup time. Our
documentation recommends transmitting them from the primary to the system
administrator via SSH, and then back out to the secondary in the same manner.
In particular, this includes the PostgreSQL replication credentials and a secret
key (`db_key_base`) which is used to decrypt certain columns in the database.
The `db_key_base` secret is stored unencrypted on the filesystem, in
`/etc/gitlab/gitlab-secrets.json`, along with a number of other secrets. There is
no at-rest protection for them.
## Data Processing
### What data entry paths does the application support?
- Data is entered via the web application exposed by GitLab itself. Some data is
also entered using system administration commands on the GitLab servers (e.g.,
`gitlab-ctl set-primary-node`).
- Secondaries also receive inputs via PostgreSQL streaming replication from the
primary.
### What data output paths does the application support?
- Primaries output via PostgreSQL streaming replication to the secondary.
Otherwise, principally via the web application exposed by GitLab itself, and via
SSH `git clone` operations initiated by the end-user.
### How does data flow across the application's internal components?
- Secondaries and primaries interact via HTTP/HTTPS (secured with JSON web
tokens) and via PostgreSQL streaming replication.
- Within a primary or secondary, the SSOT is the filesystem and the database
(including Geo tracking database on secondary). The various internal components
are orchestrated to make alterations to these stores.
### What data input validation requirements have been defined?
- Secondaries must have a faithful replication of the primary’s data.
### What data does the application store and how?
- Git repositories and files, tracking information related to the them, and the
GitLab database contents.
### What data is or may need to be encrypted and what key management requirements have been defined?
- Neither primaries or secondaries encrypt Git repository or filesystem data at
rest. A subset of database columns are encrypted at rest using the `db_otp_key`
- a static secret shared across all hosts in a GitLab deployment.
- In transit, data should be encrypted, although the application does permit
communication to proceed unencrypted. The two main transits are the secondary’s
replication process for PostgreSQL, and for git repositories/files. Both should
be protected using TLS, with the keys for that managed via Omnibus per existing
configuration for end-user access to GitLab.
### What capabilities exist to detect the leakage of sensitive data?
- Comprehensive system logs exist, tracking every connection to GitLab and
PostgreSQL.
### What encryption requirements have been defined for data in transit - including transmission over WAN, LAN, SecureFTP, or publicly accessible protocols such as http: and https:?
- Data must have the option to be encrypted in transit, and be secure against
both passive and active attack (e.g., MITM attacks should not be possible).
## Access
### What user privilege levels does the application support?
- Geo adds one type of privilege: secondaries can access a special Geo API to
download files over HTTP/HTTPS, and to clone repositories using HTTP/HTTPS.
### What user identification and authentication requirements have been defined?
- Geo secondaries identify to Geo primaries via OAuth or JWT authentication
based on the shared database (HTTP access) or a PostgreSQL replication user (for
database replication). The database replication also requires IP-based access
controls to be defined.
### What user authorization requirements have been defined?
- Secondaries must only be able to *read* data. They are not currently able to
mutate data on the primary.
### What session management requirements have been defined?
- Geo JWTs are defined to last for only two minutes before needing to be
regenerated.
### What access requirements have been defined for URI and Service calls?
- A Geo secondary makes many calls to the primary's API. This is how file
replication proceeds, for instance. This endpoint is only accessible with a JWT
token.
- The primary also makes calls to the secondary to get status information.
## Application Monitoring
### What application auditing requirements have been defined? How are audit and debug logs accessed, stored, and secured?
- Structured JSON log is written to the filesystem, and can also be ingested
into a Kibana installation for further analysis.
This document was moved to [another location](../administration/geo/security_review.md).
# Geo Troubleshooting
>**Note:**
This list is an attempt to document all the moving parts that can go wrong.
We are working into getting all this steps verified automatically in a
rake task in the future.
Setting up Geo requires careful attention to details and sometimes it's easy to
miss a step. Here is a list of questions you should ask to try to detect
what you need to fix (all commands and path locations are for Omnibus installs):
#### First check the health of the secondary
Visit the primary node's **Admin Area ➔ Geo Nodes** (`/admin/geo_nodes`) in
your browser. We perform the following health checks on each secondary node
to help identify if something is wrong:
- Is the node running?
- Is the node's secondary database configured for streaming replication?
- Is the node's secondary tracking database configured?
- Is the node's secondary tracking database connected?
- Is the node's secondary tracking database up-to-date?
![Geo health check](img/geo-node-healthcheck.png)
There is also an option to check the status of the secondary node by running a special rake task:
```
sudo gitlab-rake geo:status
```
#### Is Postgres replication working?
#### Are my nodes pointing to the correct database instance?
You should make sure your primary Geo node points to the instance with
writing permissions.
Any secondary nodes should point only to read-only instances.
#### Can Geo detect my current node correctly?
Geo uses the defined node from the `Admin ➔ Geo` screen, and tries to match
it with the value defined in the `/etc/gitlab/gitlab.rb` configuration file.
The relevant line looks like: `external_url "http://gitlab.example.com"`.
To check if the node on the current machine is correctly detected type:
```bash
sudo gitlab-rails runner "puts Gitlab::Geo.current_node.inspect"
```
and expect something like:
```
#<GeoNode id: 2, schema: "https", host: "gitlab.example.com", port: 443, relative_url_root: "", primary: false, ...>
```
By running the command above, `primary` should be `true` when executed in
the primary node, and `false` on any secondary.
#### How do I fix the message, "ERROR: replication slots can only be used if max_replication_slots > 0"?
This means that the `max_replication_slots` PostgreSQL variable needs to
be set on the primary database. In GitLab 9.4, we have made this setting
default to 1. You may need to increase this value if you have more Geo
secondary nodes. Be sure to restart PostgreSQL for this to take
effect. See the [PostgreSQL replication
setup](database.md#postgresql-replication) guide for more details.
#### How do I fix the message, "FATAL: could not start WAL streaming: ERROR: replication slot "geo_secondary_my_domain_com" does not exist"?
This occurs when PostgreSQL does not have a replication slot for the
secondary by that name. You may want to rerun the [replication
process](database.md) on the secondary.
#### How do I fix the message, "Command exceeded allowed execution time" when setting up replication?
This may happen while [initiating the replication process](database.md#step-4-initiate-the-replication-process) on the Geo secondary, and indicates that your
initial dataset is too large to be replicated in the default timeout (30 minutes).
Re-run `gitlab-ctl replicate-geo-database`, but include a larger value for
`--backup-timeout`:
```bash
sudo gitlab-ctl replicate-geo-database --host=primary.geo.example.com --slot-name=secondary_geo_example_com --backup-timeout=21600
```
This will give the initial replication up to six hours to complete, rather than
the default thirty minutes. Adjust as required for your installation.
#### How do I fix the message, "PANIC: could not write to file 'pg_xlog/xlogtemp.123': No space left on device"
Determine if you have any unused replication slots in the primary database. This can cause large amounts of log data to build up in `pg_xlog`.
Removing the unused slots can reduce the amount of space used in the `pg_xlog`.
1. Start a PostgreSQL console session:
```bash
sudo gitlab-psql gitlabhq_production
```
Note that using `gitlab-rails dbconsole` will not work, because managing replication slots requires superuser permissions.
2. View your replication slots with
```sql
SELECT * FROM pg_replication_slots;
```
Slots where `active` is `f` are not active.
- When this slot should be active, because you have a secondary configured using that slot,
log in to that secondary and check the PostgreSQL logs why the replication is not running.
- If you are no longer using the slot (e.g. you no longer have Geo enabled), you can remove it with in the PostgreSQL console session:
```sql
SELECT pg_drop_replication_slot('name_of_extra_slot');
```
#### Very large repositories never successfully synchronize on the secondary
GitLab places a timeout on all repository clones, including project imports
and Geo synchronization operations. If a fresh `git clone` of a repository
on the primary takes more than a few minutes, you may be affected by this.
To increase the timeout, add the following line to `/etc/gitlab/gitlab.rb`
on the secondary:
```ruby
gitlab_rails['gitlab_shell_git_timeout'] = 10800
```
Then reconfigure GitLab:
```bash
sudo gitlab-ctl reconfigure
```
This will increase the timeout to three hours (10800 seconds). Choose a time
long enough to accomodate a full clone of your largest repositories.
This document was moved to [another location](../administration/geo/troubleshooting.md).
# Tuning Geo
## Changing the sync capacity values
In the Geo admin page (`/admin/geo_nodes`), there are several variables that
can be tuned to improve performance of Geo:
* Repository sync capacity
* File sync capacity
Increasing these values will increase the number of jobs that are scheduled,
but this may not lead to a more downloads in parallel unless the number of
available Sidekiq threads is also increased. For example, if repository sync
capacity is increased from 25 to 50, you may also want to increase the number
of Sidekiq threads from 25 to 50. See the [Sidekiq concurrency
documentation](../administration/operations/extra_sidekiq_processes.html#concurrency)
for more details.
This document was moved to [another location](../administration/geo/tuning.md).
# Updating the Geo nodes
Depending on which version of Geo you are updating to/from, there may be
different steps.
## General update steps
In order to update the Geo nodes when a new GitLab version is released,
all you need to do is update GitLab itself:
1. Log into each node (primary and secondaries)
1. [Update GitLab][update]
1. [Update tracking database on secondary node](#update-tracking-database-on-secondary-node) when
the tracking database is enabled.
1. [Test](#check-status-after-updating) primary and secondary nodes, and check version in each.
## Upgrading to GitLab 10.5
For Geo Disaster Recovery to work with minimum downtime, your Geo secondary
should use the same set of secrets as the primary. However, setup instructions
prior to the 10.5 release only synchronized the `db_key_base` secret.
To rectify this error on existing installations, you should **overwrite** the
contents of `/etc/gitlab/gitlab-secrets.json` on the secondary node with the
contents of `/etc/gitlab/gitlab-secrets.json` on the primary node, then run the
following command on the secondary node:
```bash
sudo gitlab-ctl reconfigure
```
If you do not perform this step, you may find that two-factor authentication
[is broken following DR](faq.md#i-followed-the-disaster-recovery-instructions-and-now-two-factor-auth-is-broken).
To prevent SSH requests to the newly promoted primary node from failing
due to SSH host key mismatch when updating the primary domain's DNS record
you should perform the step to [Manually replicate primary SSH host keys](configuration.md#step-2-manually-replicate-primary-ssh-host-keys) in each
secondary node.
## Upgrading to GitLab 10.4
There are no Geo-specific steps to take!
## Upgrading to GitLab 10.3
### Support for SSH repository synchronization removed
In GitLab 10.2, synchronizing secondaries over SSH was deprecated. In 10.3,
support is removed entirely. All installations will switch to the HTTP/HTTPS
cloning method instead. Before upgrading, ensure that all your Geo nodes are
configured to use this method and that it works for your installation. In
particular, ensure that [Git access over HTTP/HTTPS is enabled](configuration.md#step-5-enable-git-access-over-http-https).
Synchronizing repositories over the public Internet using HTTP is insecure, so
you should ensure that you have HTTPS configured before upgrading. Note that
file synchronization is **also** insecure in these cases!
## Upgrading to GitLab 10.2
### Secure PostgreSQL replication
Support for TLS-secured PostgreSQL replication has been added. If you are
currently using PostgreSQL replication across the open internet without an
external means of securing the connection (e.g., a site-to-site VPN), then you
should immediately reconfigure your primary and secondary PostgreSQL instances
according to the [updated instructions](#database.md).
If you *are* securing the connections externally and wish to continue doing so,
ensure you include the new option `--sslmode=prefer` in future invocations of
`gitlab-ctl replicate-geo-database`.
### HTTPS repository sync
Support for replicating repositories and wikis over HTTP/HTTPS has been added.
Replicating over SSH has been deprecated, and support for this option will be
removed in a future release.
To switch to HTTP/HTTPS replication, log into the primary node as an admin and visit
**Admin Area ➔ Geo Nodes** (`/admin/geo_nodes`). For each secondary listed,
press the "Edit" button, change the "Repository cloning" setting from
"SSH (deprecated)" to "HTTP/HTTPS", and press "Save changes". This should take
effect immediately.
Any new secondaries should be created using HTTP/HTTPS replication - this is the
default setting.
After you've verified that HTTP/HTTPS replication is working, you should remove
the now-unused SSH keys from your secondaries, as they may cause problems if the
secondary if ever promoted to a primary:
1. **[secondary]** Login to **all** your secondary nodes and run:
```ruby
sudo -u git -H rm ~git/.ssh/id_rsa ~git/.ssh/id_rsa.pub
```
### Hashed Storage
>**Warning**
Hashed storage is in **Alpha**. It is considered experimental and not
production-ready. See [Hashed
Storage](../administration/repository_storage_types.md) for more detail.
If you previously enabled Hashed Storage and migrated all your existing
projects to Hashed Storage, disabling hashed storage will not migrate projects
to their previous project based storage path. As such, once enabled and
migrated we recommend leaving Hashed Storage enabled.
## Upgrading to GitLab 10.1
>**Warning**
Hashed storage is in **Alpha**. It is considered experimental and not
production-ready. See [Hashed
Storage](../administration/repository_storage_types.md) for more detail.
[Hashed storage](../administration/repository_storage_types.md) was introduced
in GitLab 10.0, and a [migration path](../administration/raketasks/storage.md)
for existing repositories was added in GitLab 10.1.
## Upgrading to GitLab 10.0
Since GitLab 10.0, we require all **Geo** systems to [use SSH key lookups via
the database](../administration/operations/fast_ssh_key_lookup.md) to avoid having to maintain consistency of the
`authorized_keys` file for SSH access. Failing to do this will prevent users
from being able to clone via SSH.
Note that in older versions of Geo, attachments downloaded on the secondary
nodes would be saved to the wrong directory. We recommend that you do the
following to clean this up.
On the SECONDARY Geo nodes, run as root:
```sh
mv /var/opt/gitlab/gitlab-rails/working /var/opt/gitlab/gitlab-rails/working.old
mkdir /var/opt/gitlab/gitlab-rails/working
chmod 700 /var/opt/gitlab/gitlab-rails/working
chown git:git /var/opt/gitlab/gitlab-rails/working
```
You may delete `/var/opt/gitlab/gitlab-rails/working.old` any time.
Once this is done, we advise restarting GitLab on the secondary nodes for the
new working directory to be used:
```
sudo gitlab-ctl restart
```
## Upgrading from GitLab 9.3 or older
If you started running Geo on GitLab 9.3 or older, we recommend that you
resync your secondary PostgreSQL databases to use replication slots. If you
started using Geo with GitLab 9.4 or 10.x, no further action should be
required because replication slots are used by default. However, if you
started with GitLab 9.3 and upgraded later, you should still follow the
instructions below.
When in doubt, it does not hurt to do a resync. The easiest way to do this in
Omnibus is the following:
1. Install GitLab on the primary server
1. Run `gitlab-ctl reconfigure` and `gitlab-ctl restart postgresql`. This will enable replication slots on the primary database.
1. Install GitLab on the secondary server.
1. Re-run the [database replication process](database.md#step-3-initiate-the-replication-process).
## Special update notes for 9.0.x
> **IMPORTANT**:
With GitLab 9.0, the PostgreSQL version is upgraded to 9.6 and manual steps are
required in order to update the secondary nodes and keep the Streaming
Replication working. Downtime is required, so plan ahead.
The following steps apply only if you upgrade from a 8.17 GitLab version to
9.0+. For previous versions, update to GitLab 8.17 first before attempting to
upgrade to 9.0+.
---
Make sure to follow the steps in the exact order as they appear below and pay
extra attention in what node (primary/secondary) you execute them! Each step
is prepended with the relevant node for better clarity:
1. **[secondary]** Login to **all** your secondary nodes and stop all services:
```ruby
sudo gitlab-ctl stop
```
1. **[secondary]** Make a backup of the `recovery.conf` file on **all**
secondary nodes to preserve PostgreSQL's credentials:
```
sudo cp /var/opt/gitlab/postgresql/data/recovery.conf /var/opt/gitlab/
```
1. **[primary]** Update the primary node to GitLab 9.0 following the
[regular update docs][update]. At the end of the update, the primary node
will be running with PostgreSQL 9.6.
1. **[primary]** To prevent a de-synchronization of the repository replication,
stop all services except `postgresql` as we will use it to re-initialize the
secondary node's database:
```
sudo gitlab-ctl stop
sudo gitlab-ctl start postgresql
```
1. **[secondary]** Run the following steps on each of the secondaries:
1. **[secondary]** Stop all services:
```
sudo gitlab-ctl stop
```
1. **[secondary]** Prevent running database migrations:
```
sudo touch /etc/gitlab/skip-auto-migrations
```
1. **[secondary]** Move the old database to another directory:
```
sudo mv /var/opt/gitlab/postgresql{,.bak}
```
1. **[secondary]** Update to GitLab 9.0 following the [regular update docs][update].
At the end of the update, the node will be running with PostgreSQL 9.6.
1. **[secondary]** Make sure all services are up:
```
sudo gitlab-ctl start
```
1. **[secondary]** Reconfigure GitLab:
```
sudo gitlab-ctl reconfigure
```
1. **[secondary]** Run the PostgreSQL upgrade command:
```
sudo gitlab-ctl pg-upgrade
```
1. **[secondary]** See the stored credentials for the database that you will
need to re-initialize the replication:
```
sudo grep -s primary_conninfo /var/opt/gitlab/recovery.conf
```
1. **[secondary]** Create the `replica.sh` script as described in the
[database configuration document](database.md#step-3-initiate-the-replication-process).
1. **[secondary]** Run the recovery script using the credentials from the
previous step:
```
sudo bash /tmp/replica.sh
```
1. **[secondary]** Reconfigure GitLab:
```
sudo gitlab-ctl reconfigure
```
1. **[secondary]** Start all services:
```
sudo gitlab-ctl start
```
1. **[secondary]** Repeat the steps for the rest of the secondaries.
1. **[primary]** After all secondaries are updated, start all services in
primary:
```
sudo gitlab-ctl start
```
## Check status after updating
Now that the update process is complete, you may want to check whether
everything is working correctly:
1. Run the Geo raketask on all nodes, everything should be green:
```
sudo gitlab-rake gitlab:geo:check
```
1. Check the primary's Geo dashboard for any errors
1. Test the data replication by pushing code to the primary and see if it
is received by the secondaries
## Update tracking database on secondary node
After updating a secondary node, you might need to run migrations on
the tracking database. The tracking database was added in GitLab 9.1,
and it is required since 10.0.
1. Run database migrations on tracking database
```
sudo gitlab-rake geo:db:migrate
```
1. Repeat this step for every secondary node
[update]: ../update/README.md
This document was moved to [another location](../administration/geo/updating_the_geo_nodes.md).
[//]: # (Please update EE::GitLab::GeoGitAccess::GEO_SERVER_DOCS_URL if this file is moved)
# Using a Geo Server
After you set up the [database replication and configure the Geo nodes][req],
there are a few things to consider:
1. Users need an extra step to be able to fetch code from the secondary and push
to primary:
1. Clone the repository as you would normally do, but from the secondary node:
```bash
git clone git@secondary.gitlab.example.com:user/repo.git
```
1. Change the remote push URL to always push to primary, following this example:
```bash
git remote set-url --push origin git@primary.gitlab.example.com:user/repo.git
```
[req]: README.md#setup-instructions
This document was moved to [another location](../administration/geo/using_a_geo_server.md).
......@@ -4,7 +4,7 @@ module EE
include ::Gitlab::ConfigHelper
include ::EE::GitlabRoutingHelper
GEO_SERVER_DOCS_URL = 'https://docs.gitlab.com/ee/gitlab-geo/using_a_geo_server.html'.freeze
GEO_SERVER_DOCS_URL = 'https://docs.gitlab.com/ee/administration/geo/using_a_geo_server.html'.freeze
protected
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment