Reference Guides/Beginner/Grouping

Grouping

Summarizing data by category, and filtering those summaries.

Aggregate functions collapse an entire table into a single number. That is useful, but you rarely want one number for everything. You want one number per department, per customer, per month. That is exactly what GROUP BY is for: it splits rows into groups first, then runs the aggregate on each group separately. Instead of one row describing everything, you get one row per group.

Most of the analytical SQL you will write in practice relies on this pattern. Comparing categories, ranking customers by revenue, spotting trends over time - it all comes back to GROUP BY.

Basic GROUP BY

SELECT department, COUNT(*) AS headcount
FROM employees
GROUP BY department;

The database collects all rows with the same department value, counts the rows in each collection, and returns one output row per department. The result might look like:

department   | headcount
-------------|----------
Engineering  | 42
Marketing    | 18
Sales        | 31
Design       | 12

You can group by multiple columns. When you do, each unique combination of those column values forms its own group:

SELECT department, title, AVG(salary) AS avg_salary
FROM employees
GROUP BY department, title;

This gives one row per (department, title) pair: "Engineering / Senior Engineer," "Engineering / Staff Engineer," "Marketing / Manager," and so on. The grouping key is always the combination of values, not each column independently.

The SELECT rule

Here is the rule that trips people up more than anything else: every column in SELECT must either appear in GROUP BY or be wrapped in an aggregate function.

-- Wrong: name is neither grouped nor aggregated
SELECT department, name, COUNT(*)
FROM employees
GROUP BY department;

PostgreSQL raises an error here. Think about why: you are collapsing 42 engineering employees into one summary row. Which employee's name should appear? There is no right answer, so the database refuses to guess.

The fix is straightforward - include only columns that are part of the grouping key or described by an aggregate:

-- Right: both columns are accounted for
SELECT department, COUNT(*) AS headcount
FROM employees
GROUP BY department;

HAVING: filtering on aggregated results

You will want to filter on the results of an aggregate eventually - "show me only departments with more than 10 people." That is where HAVING comes in. WHERE filters rows before any grouping happens; HAVING filters the groups that come out the other side. They are not interchangeable, and the reason matters.

SELECT department, COUNT(*) AS headcount
FROM employees
GROUP BY department
HAVING COUNT(*) > 10;

Only departments with more than 10 employees appear. Writing WHERE COUNT(*) > 10 instead would be a syntax error - at the time WHERE runs, the rows have not been grouped yet, so there is nothing to count.

The difference between WHERE and HAVING:

ClauseFiltersRuns
WHEREIndividual rowsBefore grouping
HAVINGGroupsAfter grouping and aggregation

Using WHERE and HAVING together

Both clauses can appear in the same query, and each does its job at a different stage:

SELECT department, AVG(salary) AS avg_salary
FROM employees
WHERE hire_date >= '2020-01-01'
GROUP BY department
HAVING AVG(salary) > 80000;

The execution order here is: WHERE first reduces the rows to employees hired since 2020, then GROUP BY groups those filtered rows by department, then HAVING removes groups where the average salary is too low. The final result is departments that, among employees hired since 2020, average more than $80,000.

Filtering in WHERE before grouping is also more efficient: it reduces the number of rows that need to be grouped in the first place.

A mental model for the whole pipeline

Picture GROUP BY as sorting rows into buckets - one bucket per unique combination of grouped column values. WHERE controls which rows even reach the buckets. Aggregate functions measure what is inside each bucket. HAVING decides which buckets make the final cut.

Filter -> bucket -> measure -> filter buckets. Get that sequence into your head and most analytical SQL queries will make sense on first read.