Reference Guides/Beginner/Aggregation

Aggregation

Collapsing many rows into a single summary value.

Up until now, every query has returned one output row per input row. Aggregation is a different mode entirely: you feed in many rows, and the database collapses them into a single summary. That is how you answer questions like "how many orders came in this month?" or "what is the average salary in engineering?"

Get comfortable with how aggregation works - and how it interacts with NULL - before you move on to GROUP BY. The concepts carry straight over.

The five core aggregate functions

FunctionWhat it returns
COUNT(*)Number of rows in the set
COUNT(column)Number of non-NULL values in the column
SUM(column)Total of all non-NULL values
AVG(column)Arithmetic mean of all non-NULL values
MIN(column)Smallest non-NULL value
MAX(column)Largest non-NULL value
SELECT
  COUNT(*)         AS total_employees,
  AVG(salary)      AS avg_salary,
  MAX(salary)      AS highest_salary,
  MIN(salary)      AS lowest_salary,
  SUM(salary)      AS total_payroll,
  MIN(hire_date)   AS earliest_hire
FROM employees;

Without a GROUP BY clause, these functions collapse the entire table into one row. All the individual rows disappear - you see only the summary.

COUNT(*) versus COUNT(column)

This distinction trips people up constantly, and the frustrating part is that getting it wrong produces no error - just a subtly wrong number.

COUNT(*) counts every row, full stop. COUNT(column) counts only the rows where that specific column is not NULL.

SELECT
  COUNT(*)           AS total_orders,
  COUNT(shipped_at)  AS shipped_orders
FROM orders;

If 80 orders exist but only 65 have been shipped, the first column returns 80 and the second returns 65. The right choice depends on the question: "how many orders?" is COUNT(*), "how many have shipped?" is COUNT(shipped_at).

How NULL affects aggregates

All aggregate functions except COUNT(*) silently ignore NULL values. Usually that is exactly what you want - but the "silently" part is where the surprises live.

Consider a bonus column where some employees have no bonus recorded:

SELECT
  SUM(bonus)   AS total_bonuses,   -- ignores NULLs; sums only what exists
  AVG(bonus)   AS avg_bonus,       -- divides by non-NULL count, not total rows
  COUNT(bonus) AS has_bonus        -- count of employees with a bonus on record
FROM employees;

If you want AVG to treat missing bonuses as zero rather than ignoring them, you have to say so explicitly:

SELECT AVG(COALESCE(bonus, 0)) AS avg_bonus_including_zero
FROM employees;

Which version is correct depends on your domain. "No bonus" can mean zero (they earned nothing extra) or it can mean the information is not recorded yet. Both are legitimate - pick the one that matches what the data actually represents.

DISTINCT inside aggregates

Adding DISTINCT inside an aggregate tells it to operate on unique values only:

SELECT COUNT(DISTINCT customer_id) AS unique_customers
FROM orders;

Without DISTINCT, a customer who placed five orders gets counted five times. With DISTINCT, they count once. You can technically apply DISTINCT to SUM and other aggregates too, but COUNT(DISTINCT ...) is by far the most common use.

Aggregation as a foundation

Before you dive into a dataset, a few bare aggregate queries are worth running just to get your bearings: how many rows are there, what is the range of values, are there unexpected NULLs in columns you assumed were populated? It only takes a minute and often saves you from building analysis on top of faulty assumptions. Once that foundation is solid, GROUP BY takes everything you just learned and applies it to subgroups - which is where most real analytical SQL lives.