Grouping
Summarizing data by category, and filtering those summaries.
Aggregate functions collapse an entire table into a single number. That is useful, but you rarely want one number for everything. You want one number per department, per customer, per month. That is exactly what GROUP BY is for: it splits rows into groups first, then runs the aggregate on each group separately. Instead of one row describing everything, you get one row per group.
Most of the analytical SQL you will write in practice relies on this pattern. Comparing categories, ranking customers by revenue, spotting trends over time - it all comes back to GROUP BY.
Basic GROUP BY
SELECT department, COUNT(*) AS headcount
FROM employees
GROUP BY department;The database collects all rows with the same department value, counts the rows in each collection, and returns one output row per department. The result might look like:
department | headcount
-------------|----------
Engineering | 42
Marketing | 18
Sales | 31
Design | 12You can group by multiple columns. When you do, each unique combination of those column values forms its own group:
SELECT department, title, AVG(salary) AS avg_salary
FROM employees
GROUP BY department, title;This gives one row per (department, title) pair: "Engineering / Senior Engineer," "Engineering / Staff Engineer," "Marketing / Manager," and so on. The grouping key is always the combination of values, not each column independently.
The SELECT rule
Here is the rule that trips people up more than anything else: every column in SELECT must either appear in GROUP BY or be wrapped in an aggregate function.
-- Wrong: name is neither grouped nor aggregated
SELECT department, name, COUNT(*)
FROM employees
GROUP BY department;PostgreSQL raises an error here. Think about why: you are collapsing 42 engineering employees into one summary row. Which employee's name should appear? There is no right answer, so the database refuses to guess.
The fix is straightforward - include only columns that are part of the grouping key or described by an aggregate:
-- Right: both columns are accounted for
SELECT department, COUNT(*) AS headcount
FROM employees
GROUP BY department;HAVING: filtering on aggregated results
You will want to filter on the results of an aggregate eventually - "show me only departments with more than 10 people." That is where HAVING comes in. WHERE filters rows before any grouping happens; HAVING filters the groups that come out the other side. They are not interchangeable, and the reason matters.
SELECT department, COUNT(*) AS headcount
FROM employees
GROUP BY department
HAVING COUNT(*) > 10;Only departments with more than 10 employees appear. Writing WHERE COUNT(*) > 10 instead would be a syntax error - at the time WHERE runs, the rows have not been grouped yet, so there is nothing to count.
The difference between WHERE and HAVING:
| Clause | Filters | Runs |
|---|---|---|
WHERE | Individual rows | Before grouping |
HAVING | Groups | After grouping and aggregation |
Using WHERE and HAVING together
Both clauses can appear in the same query, and each does its job at a different stage:
SELECT department, AVG(salary) AS avg_salary
FROM employees
WHERE hire_date >= '2020-01-01'
GROUP BY department
HAVING AVG(salary) > 80000;The execution order here is: WHERE first reduces the rows to employees hired since 2020, then GROUP BY groups those filtered rows by department, then HAVING removes groups where the average salary is too low. The final result is departments that, among employees hired since 2020, average more than $80,000.
Filtering in WHERE before grouping is also more efficient: it reduces the number of rows that need to be grouped in the first place.
A mental model for the whole pipeline
Picture GROUP BY as sorting rows into buckets - one bucket per unique combination of grouped column values. WHERE controls which rows even reach the buckets. Aggregate functions measure what is inside each bucket. HAVING decides which buckets make the final cut.
Filter -> bucket -> measure -> filter buckets. Get that sequence into your head and most analytical SQL queries will make sense on first read.