Aggregation
Collapsing many rows into a single summary value.
Up until now, every query has returned one output row per input row. Aggregation is a different mode entirely: you feed in many rows, and the database collapses them into a single summary. That is how you answer questions like "how many orders came in this month?" or "what is the average salary in engineering?"
Get comfortable with how aggregation works - and how it interacts with NULL - before you move on to GROUP BY. The concepts carry straight over.
The five core aggregate functions
| Function | What it returns |
|---|---|
COUNT(*) | Number of rows in the set |
COUNT(column) | Number of non-NULL values in the column |
SUM(column) | Total of all non-NULL values |
AVG(column) | Arithmetic mean of all non-NULL values |
MIN(column) | Smallest non-NULL value |
MAX(column) | Largest non-NULL value |
SELECT
COUNT(*) AS total_employees,
AVG(salary) AS avg_salary,
MAX(salary) AS highest_salary,
MIN(salary) AS lowest_salary,
SUM(salary) AS total_payroll,
MIN(hire_date) AS earliest_hire
FROM employees;Without a GROUP BY clause, these functions collapse the entire table into one row. All the individual rows disappear - you see only the summary.
COUNT(*) versus COUNT(column)
This distinction trips people up constantly, and the frustrating part is that getting it wrong produces no error - just a subtly wrong number.
COUNT(*) counts every row, full stop. COUNT(column) counts only the rows where that specific column is not NULL.
SELECT
COUNT(*) AS total_orders,
COUNT(shipped_at) AS shipped_orders
FROM orders;If 80 orders exist but only 65 have been shipped, the first column returns 80 and the second returns 65. The right choice depends on the question: "how many orders?" is COUNT(*), "how many have shipped?" is COUNT(shipped_at).
How NULL affects aggregates
All aggregate functions except COUNT(*) silently ignore NULL values. Usually that is exactly what you want - but the "silently" part is where the surprises live.
Consider a bonus column where some employees have no bonus recorded:
SELECT
SUM(bonus) AS total_bonuses, -- ignores NULLs; sums only what exists
AVG(bonus) AS avg_bonus, -- divides by non-NULL count, not total rows
COUNT(bonus) AS has_bonus -- count of employees with a bonus on record
FROM employees;If you want AVG to treat missing bonuses as zero rather than ignoring them, you have to say so explicitly:
SELECT AVG(COALESCE(bonus, 0)) AS avg_bonus_including_zero
FROM employees;Which version is correct depends on your domain. "No bonus" can mean zero (they earned nothing extra) or it can mean the information is not recorded yet. Both are legitimate - pick the one that matches what the data actually represents.
DISTINCT inside aggregates
Adding DISTINCT inside an aggregate tells it to operate on unique values only:
SELECT COUNT(DISTINCT customer_id) AS unique_customers
FROM orders;Without DISTINCT, a customer who placed five orders gets counted five times. With DISTINCT, they count once. You can technically apply DISTINCT to SUM and other aggregates too, but COUNT(DISTINCT ...) is by far the most common use.
Aggregation as a foundation
Before you dive into a dataset, a few bare aggregate queries are worth running just to get your bearings: how many rows are there, what is the range of values, are there unexpected NULLs in columns you assumed were populated? It only takes a minute and often saves you from building analysis on top of faulty assumptions. Once that foundation is solid, GROUP BY takes everything you just learned and applies it to subgroups - which is where most real analytical SQL lives.