GROUP BY and HAVING

The previous guide showed how aggregate functions summarize an entire table into a single row. GROUP BY takes that a step further: it splits the table into groups first, then applies the aggregate to each group separately. Instead of one row describing everything, you get one row per group.

This is the foundation of most analytical SQL. Any time you want to compare categories, calculate per-customer totals, or summarize data by time period, you are using GROUP BY.

Basic GROUP BY

SELECT department, COUNT(*) AS headcount
FROM employees
GROUP BY department;

The database collects all rows with the same department value, counts the rows in each collection, and returns one output row per department. The result might look like:

department   | headcount
-------------|----------
Engineering  | 42
Marketing    | 18
Sales        | 31
Design       | 12

You can group by multiple columns. When you do, each unique combination of those column values forms its own group:

SELECT department, title, AVG(salary) AS avg_salary
FROM employees
GROUP BY department, title;

This gives one row per (department, title) pair: "Engineering / Senior Engineer," "Engineering / Staff Engineer," "Marketing / Manager," and so on.

The SELECT rule

This rule confuses almost everyone the first time: every column in SELECT must either appear in GROUP BY or be wrapped in an aggregate function.

-- Wrong: name is neither grouped nor aggregated
SELECT department, name, COUNT(*)
FROM employees
GROUP BY department;

PostgreSQL will return an error here. The reason is logical: if you are collapsing 42 engineering employees into a single summary row, which employee's name do you show? There is no single right answer. The database is being strict because any value it chose would be arbitrary.

The fix is to include only columns that are either part of the grouping or described by an aggregate:

-- Right: both columns are accounted for
SELECT department, COUNT(*) AS headcount
FROM employees
GROUP BY department;

HAVING: filtering on aggregated results

WHERE filters individual rows before any grouping happens. HAVING filters groups after aggregation is complete. They are not interchangeable.

SELECT department, COUNT(*) AS headcount
FROM employees
GROUP BY department
HAVING COUNT(*) > 10;

This returns only departments that have more than 10 employees. You cannot write WHERE COUNT(*) > 10 because at the time WHERE runs, the groups do not exist yet. WHERE would return a syntax error.

The distinction between WHERE and HAVING:

Clause	Filters	Runs
`WHERE`	Individual rows	Before grouping
`HAVING`	Groups	After grouping and aggregation

Using WHERE and HAVING together

Both clauses can appear in the same query, and each does its job at a different stage:

SELECT department, AVG(salary) AS avg_salary
FROM employees
WHERE hire_date >= '2020-01-01'
GROUP BY department
HAVING AVG(salary) > 80000;

The execution order here is: WHERE first reduces the rows to employees hired since 2020, then GROUP BY groups those filtered rows by department, then HAVING removes groups where the average salary is too low. The final result is departments that, among employees hired since 2020, average more than $80,000.

Filtering in WHERE before grouping is also more efficient: it reduces the number of rows that need to be grouped in the first place.

A mental model for the whole pipeline

Think of GROUP BY as sorting rows into buckets, one bucket per unique combination of the grouped columns. Aggregate functions measure each bucket. HAVING decides which buckets make it into the final output. WHERE narrows which rows go into the buckets at all.

That pipeline, filter rows, group into buckets, measure buckets, filter buckets, is the core pattern behind almost all analytical SQL.