Added documentation on adding database indexes

5f739407 · Yorick Peterse · 3a46eac1 · 5f739407 · 5f739407
Commit 5f739407 authored Aug 11, 2016 by Yorick Peterse
Hide whitespace changes
Inline Side-by-side

Showing with 127 additions and 0 deletions

doc/development/README.md doc/development/README.md +4 -0

doc/development/adding_database_indexes.md doc/development/adding_database_indexes.md +123 -0

No files found.
--- a/doc/development/README.md
+++ b/doc/development/README.md
@@ -30,7 +30,11 @@
 - [Rake tasks](rake_tasks.md) for development
 - [Shell commands](shell_commands.md) in the GitLab codebase
 - [Sidekiq debugging](sidekiq_debugging.md)
+## Databases
 - [What requires downtime?](what_requires_downtime.md)
+- [Adding database indexes](adding_database_indexes.md)
 ## Compliance

--- a/doc/development/adding_database_indexes.md
+++ b/doc/development/adding_database_indexes.md
+# Adding Database Indexes
+Indexes can be used to speed up database queries, but when should you add a new
+index? Traditionally the answer to this question has been to add an index for
+every column used for filtering or joining data. For example, consider the
+following query:
+```sql
+SELECT *
+FROM projects
+WHERE user_id = 2;
+```
+Here we are filtering by the `user_id` column and as such a developer may decide
+to index this column.
+While in certain cases indexing columns using the above approach may make sense
+it can actually have a negative impact. Whenever you write data to a table any
+existing indexes need to be updated. The more indexes there are the slower this
+can potentially become. Indexes can also take up quite some disk space depending
+on the amount of data indexed and the index type. For example, PostgreSQL offers
+"GIN" indexes which can be used to index certain data types that can not be
+indexed by regular btree indexes. These indexes however generally take up more
+data and are slower to update compared to btree indexes.
+Because of all this one should not blindly add a new index for every column used
+to filter data by. Instead one should ask themselves the following questions:
+1. Can I write my query in such a way that it re-uses as many existing indexes
+   as possible?
+2. Is the data going to be large enough that using an index will actually be
+   faster than just iterating over the rows in the table?
+3. Is the overhead of maintaining the index worth the reduction in query
+   timings?
+We'll explore every question in detail below.
+## Re-using Queries
+The first step is to make sure your query re-uses as many existing indexes as
+possible. For example, consider the following query:
+```sql
+SELECT *
+FROM todos
+WHERE user_id = 123
+AND state = 'open';
+```
+Now imagine we already have an index on the `user_id` column but not on the
+`state` column. One may think this query will perform badly due to `state` being
+unindexed. In reality the query may perform just fine given the index on
+`user_id` can filter out enough rows.
+The best way to determine if indexes are re-used is to run your query using
+`EXPLAIN ANALYZE`. Depending on any extra tables that may be joined and
+other columns being used for filtering you may find an extra index is not going
+to make much (if any) difference. On the other hand you may determine that the
+index _may_ make a difference.
+In short:
+1. Try to write your query in such a way that it re-uses as many existing
+   indexes as possible.
+2. Run the query using `EXPLAIN ANALYZE` and study the output to find the most
+   ideal query.
+## Data Size
+A database may decide not to use an index despite it existing in case a regular
+sequence scan (= simply iterating over all existing rows) is faster. This is
+especially the case for small tables.
+If a table is expected to grow in size and you expect your query has to filter
+out a lot of rows you may want to consider adding an index. If the table size is
+very small (e.g. only a handful of rows) or any existing indexes filter out
+enough rows you may _not_ want to add a new index.
+## Maintenance Overhead
+Indexes have to be updated on every table write. In case of PostgreSQL _all_
+existing indexes will be updated whenever data is written to a table. As a
+result of this having many indexes on the same table will slow down writes.
+Because of this one should ask themselves: is the reduction in query performance
+worth the overhead of maintaining an extra index?
+If adding an index reduces SELECT timings by 5 milliseconds but increases
+INSERT/UPDATE/DELETE timings by 10 milliseconds then the index may not be worth
+it. On the other hand, if SELECT timings are reduced but INSERT/UPDATE/DELETE
+timings are not affected you may want to add the index after all.
+## Finding Unused Indexes
+To see which indexes are unused you can run the following query:
+```sql
+SELECT relname as table_name, indexrelname as index_name, idx_scan, idx_tup_read, idx_tup_fetch, pg_size_pretty(pg_relation_size(indexrelname::regclass))
+FROM pg_stat_all_indexes
+WHERE schemaname = 'public'
+AND idx_scan = 0
+AND idx_tup_read = 0
+AND idx_tup_fetch = 0
+ORDER BY pg_relation_size(indexrelname::regclass) desc;
+```
+This query outputs a list containing all indexes that are never used and sorts
+them by indexes sizes in descending order.  This query can be useful to
+determine if any previously indexes are useful after all. More information on
+the meaning of the various columns can be found at
+<https://www.postgresql.org/docs/current/static/monitoring-stats.html>.
+Because the output of this query relies on the actual usage of your database it
+may be affected by factors such as (but not limited to):
+* Certain queries never being executed, thus not being able to use certain
+  indexes.
+* Certain tables having little data, resulting in PostgreSQL using sequence
+  scans instead of index scans.
+In other words, this data is only reliable for a frequently used database with
+plenty of data and with as many GitLab features enabled (and being used) as
+possible.