Guide/Intermediate/Indexes: How Databases Find Data Fast

Indexes: How Databases Find Data Fast

What indexes are, when they help, and when they do not.

When you run a query with a WHERE clause, the database has to find the rows that match your condition. Without an index, the only way to do this is to examine every row in the table one by one. This is called a sequential scan, and for small tables it is perfectly fast. For a table with tens of millions of rows, it is the difference between a query that returns in milliseconds and one that takes minutes.

Indexes are the mechanism that allows the database to skip most of that work.

What an index actually is

A B-tree index (the most common kind) is a balanced tree data structure maintained separately from the table itself. Each leaf in the tree holds a column value and a pointer to the physical location of the corresponding row.

When you query WHERE department = 'Engineering', the database traverses the tree in O(log n) time to find the "Engineering" entry, then follows the pointer directly to the matching rows. It reads a tiny fraction of the data it would otherwise scan.

The analogy is a book index. To find every mention of "window functions" in a 500-page book, you could read all 500 pages or you could look up "window functions" in the index, find the page numbers, and go directly there. The index is slower to build and uses extra space, but it makes lookups fast.

CREATE INDEX idx_employees_department ON employees (department);

Automatic indexes

You do not always have to create indexes manually. The database creates them automatically for:

ConstraintIndex created
PRIMARY KEYUnique index on the key column(s)
UNIQUEUnique index on the constrained column(s)

If you are querying by id, you already have an index. If you are querying by a non-key column that appears frequently in WHERE or JOIN ON conditions, you probably want to add one.

When indexes help

Indexes are most valuable when:

  • The column appears frequently in WHERE, JOIN ON, or ORDER BY
  • The query returns a small fraction of the table (an index on status where 95% of rows are 'active' is unlikely to help)
  • The table is large enough that a sequential scan is noticeably slow

Indexes are less useful when:

  • The column has very few distinct values (low selectivity). The database may decide a scan is faster.
  • The query would return most of the table anyway. Following millions of index pointers is slower than a sequential scan.
  • The table is small. A scan over 10,000 rows is instant regardless.

Multi-column indexes

You can index multiple columns together in a single index:

CREATE INDEX idx_orders_customer_date ON orders (customer_id, created_at);

This index is useful for queries that filter on customer_id alone, or on both customer_id and created_at. It is not useful for queries that filter on created_at alone, because created_at is the second column. The database can only use the index efficiently when the query includes the leftmost column(s).

This is called the leftmost prefix rule. An index on (a, b, c) supports queries that filter on: a alone, a and b together, or a and b and c together. It does not support b alone or c alone.

The cost side

Indexes are not free. Every INSERT, UPDATE, and DELETE must update all indexes on the table, in addition to the table itself. A table with ten indexes has ten extra data structures to maintain for every write.

For read-heavy tables, this overhead is acceptable. For tables that are written to constantly and queried less often, excess indexes slow down writes noticeably. The goal is to have exactly the indexes your most important queries need, and no more.

Checking index usage with EXPLAIN

EXPLAIN SELECT * FROM employees WHERE department = 'Engineering';

The output will show either "Index Scan" (the index is being used) or "Seq Scan" (full table scan). If you see a sequential scan where you expected an index, either the index does not exist, the database decided a scan was faster given the data distribution, or the query is written in a way that prevents index use (such as filtering on a function of the column rather than the column itself). The next guide covers how to read and interpret this output.