Reference Guides/Intermediate/Indexes: How Databases Find Data Fast

Indexes: How Databases Find Data Fast

What indexes are, when they help, and when they do not.

When a query has a WHERE clause, the database needs to find the matching rows. Without an index, it does this by scanning every row in the table from top to bottom. That is fine on a table with a few thousand rows. On a table with 50 million rows it can mean seconds or minutes of wall time for a query that should return instantly.

An index lets the database skip most of that work. Understanding when and why to add one is one of the highest-leverage things you can learn as a SQL developer.

What an index actually is

The most common kind is a B-tree index: a balanced tree maintained separately from the table. Each leaf holds a column value and a pointer to the physical location of the matching row.

When you filter on WHERE department = 'Engineering', the database walks the tree in O(log n) time to find that entry, then jumps directly to the matching rows. It touches a tiny fraction of what a full scan would read.

Think of the index at the back of a textbook. To find every mention of "window functions" in 500 pages you could read the whole thing, or you could flip to the index, find the page numbers, and go there. The index takes space and stays up to date as the book changes, but it makes lookups dramatically faster.

CREATE INDEX idx_employees_department ON employees (department);

Automatic indexes

You already have more indexes than you think. The database creates them automatically for:

ConstraintIndex created
PRIMARY KEYUnique index on the key column(s)
UNIQUEUnique index on the constrained column(s)

If your queries filter on id, you are already covered. The gap shows up when you start filtering on other columns - status, email, created_at - that do not have automatic indexes. Those are the ones worth examining.

When indexes help

Indexes help most when the column appears in WHERE, JOIN ON, or ORDER BY, the query returns a small slice of the table, and the table is large enough that a scan is noticeably slow.

They are less useful - or even counterproductive - in a few situations that surprise people:

  • Low-selectivity columns: if 95% of rows have status = 'active', an index on status is unlikely to help. The database may decide a scan is faster.
  • Queries that return most of the table anyway. Following millions of individual index pointers is actually slower than one clean sequential scan.
  • Small tables. A scan over 10,000 rows is instant regardless of indexes.

Multi-column indexes

A single index can cover multiple columns. This matters when your queries filter on more than one column at a time:

CREATE INDEX idx_orders_customer_date ON orders (customer_id, created_at);

This index works for queries that filter on customer_id alone, or on both customer_id and created_at. It does not work for queries that filter on created_at alone - that is the part people get wrong.

The rule is called the leftmost prefix rule. An index on (a, b, c) supports: a alone, a+b together, a+b+c together. It does not support b alone or c alone. Column order in a composite index is not arbitrary - put the most-filtered column first.

The cost side

Every index has a write cost. Each INSERT, UPDATE, and DELETE must update every index on the table in addition to the table itself. A table with ten indexes pays ten extra writes for every row mutation.

On a read-heavy table that overhead is fine. On a table that gets hammered with writes and queried less often, excess indexes become a drag. The goal is to have exactly the indexes your most important queries need - no fewer, no more.

Checking index usage with EXPLAIN

EXPLAIN SELECT * FROM employees WHERE department = 'Engineering';

The output shows either "Index Scan" (index used) or "Seq Scan" (full table scan). If you see a sequential scan where you expected an index, there are a few likely culprits: the index does not exist yet, the database decided a scan was cheaper given the data distribution, or the query wraps the column in a function - WHERE LOWER(email) = ... cannot use an index on email. The next guide goes deep on reading this output.