Over Keyword - Window Functions
Window functions perform calculations across sets of table rows that are related to the current row. Unlike aggregate functions that return a single result for a group of rows, window functions return a value for every row while considering a window of rows defined by the OVER clause.
We'll cover high-level, introductory information about window functions, and then move on to composition.
We also have some common examples to get you started.
Click Demo this query within our query examples to see them in our live demo.
Deep Dive: What is a Window Function?
A window function performs a calculation across a set of rows that are related
to the current row. This set of related rows is called a "window", defined by an
OVER
clause that follows the window function.
In practical terms, window functions are used when you need to perform a
calculation that depends on a group of rows, but you want to retain the
individual rows in the result set. This is different from aggregate functions
like a cumulative sum
or avg
, which perform calculations on a group of rows
and return a single result.
The underlying mechanism of a window function involves three components:
-
Partitioning: The
PARTITION BY
clause divides the result set into partitions (groups of rows) upon which the window function is applied. If no partition is defined, the function treats all rows of the query result set as a single partition. -
Ordering: The
ORDER BY
clause within theOVER
clause determines the order of the rows in each partition. -
Frame Specification: This defines the set of rows included in the window, relative to the current row. For example,
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
includes all rows from the start of the partition to the current row.
Use cases for window functions are vast.
They are often used in analytics for tasks such as:
- Calculating running totals or averages
- Finding the maximum or minimum value in a sequence or partition
- Ranking items within a specific category or partition
- Calculating moving averages or cumulative sums
Window functions are tough to grok.
An analogy before we get to building:
Imagine a group of cars in a race. Each car has a number, a name, and a finish
time. If you wanted to know the average finish time, you could use an aggregate
function like avg()
to calculate it. But this would only give you a single
result: the average time. You wouldn't know anything about individual cars'
times.
For example, a window function allows you to calculate the average finish time for all the cars (the window), but display it on each car (row), so you can compare this average to each car's average speed to see if they were faster or slower than the global average.
So, in essence, window functions allow you to perform calculations that consider more than just the individual row or the entire table, but a 'window' of related rows. This 'window' could be all rows with the same value in a certain column, like all cars of the same engine size, or it could be a range of rows based on some order, like the three cars who finished before and after a certain car.
This makes window functions incredibly powerful for complex calculations and analyses.
Syntax
functionName OVER (
[PARTITION BY columnName [, ...]]
[ORDER BY columnName [ASC | DESC] [, ...]]
[ROWS | RANGE BETWEEN frame_start AND frame_end]
[EXCLUDE CURRENT ROW | EXCLUDE NO OTHERS]
)
Where:
functionName
: The window function to apply (e.g., avg, sum, rank)OVER
: Specifies the window over which the function operatesPARTITION BY
: Divides the result set into partitionsORDER BY
: Specifies the order of rows within each partitionROWS | RANGE BETWEEN
: Defines the window frame relative to the current rowEXCLUDE
: Optionally excludes certain rows from the frame
Supported functions
-
avg()
– Calculates the average within a window -
count()
– Counts rows or non-null values -
dense_rank()
– Assigns a rank to rows monotonically -
first_not_null_value()
– Retrieves the first not null value in a window -
first_value()
– Retrieves the first value in a window -
lag()
– Accesses data from previous rows -
last_value()
– Retrieves the last value in a window -
lead()
– Accesses data from subsequent rows -
max()
– Returns the maximum value within a window -
min()
– Returns the minimum value within a window -
rank()
– Assigns a rank to rows -
row_number()
– Assigns sequential numbers to rows -
sum()
– Calculates the sum within a window
Components of a window function
A window function calculates results across a set of rows related to the current row, called a window. This allows for complex calculations like moving averages, running totals, and rankings without collapsing rows.
- Function Name: Specifies the calculation to perform (e.g.,
avg(price)
) - OVER Clause: Defines the window for the function
PARTITION BY
: Divides the result set into partitionsORDER BY
: Orders rows within partitions- Frame Specification: Defines the subset of rows using ROWS or RANGE
- Exclusion Option: Excludes specific rows from the frame
Example
SELECT
symbol,
price,
timestamp,
avg(price) OVER (
PARTITION BY symbol
ORDER BY timestamp
ROWS BETWEEN 3 PRECEDING AND CURRENT ROW
) AS moving_avg
FROM trades;
This calculates a moving average of price over the current and three preceding rows for each symbol. For other common window function examples, please check the Window functions reference.
Frame types and behavior
Window frames specify which rows are included in the calculation relative to the current row.
ROWS frame
Defines the frame based on a physical number of rows:
ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
This includes the current row and two preceding rows.
RANGE frame
RANGE functions have a known issue. When using RANGE, all the rows with the same value will have the same output for the function. Read the open issue for more information.
Defines the frame based on the actual values in the ORDER BY column, rather than counting rows. Unlike ROWS, which counts a specific number of rows, RANGE considers the values in the ORDER BY column to determine the window.
Important requirements for RANGE:
- Data must be ordered by the designated timestamp column
- The window is calculated based on the values in that ORDER BY column
For example, with a current row at 09:04 and RANGE BETWEEN '1' MINUTE PRECEDING AND CURRENT ROW
:
- Only includes rows with timestamps between 09:03 and 09:04 (inclusive)
- Earlier rows (e.g., 09:00, 09:02) are excluded as they fall outside the 1-minute range
The following time units can be used in RANGE window functions:
- day
- hour
- minute
- second
- millisecond
- microsecond
Plural forms of these time units are also accepted (e.g., 'minutes', 'hours').
SELECT
timestamp,
bid_px_00,
-- 5-minute average: includes rows from (current_timestamp - 5 minutes) to current_timestamp
AVG(bid_px_00) OVER (
ORDER BY timestamp
RANGE BETWEEN '5' MINUTE PRECEDING AND CURRENT ROW
) AS avg_5min,
-- 100ms count: includes rows from (current_timestamp - 100ms) to current_timestamp
COUNT(*) OVER (
ORDER BY timestamp
RANGE BETWEEN '100' MILLISECOND PRECEDING AND CURRENT ROW
) AS updates_100ms,
-- 2-second sum: includes rows from (current_timestamp - 2 seconds) to current_timestamp
SUM(bid_sz_00) OVER (
ORDER BY timestamp
RANGE BETWEEN '2' SECOND PRECEDING AND CURRENT ROW
) AS volume_2sec
FROM AAPL_orderbook
WHERE bid_px_00 > 0
LIMIT 10;
This query demonstrates different time intervals in action, calculating:
- 5-minute moving average of best bid price
- Update frequency in 100ms windows
- 2-second rolling volume
Note that each window calculation is based on the timestamp values, not the number of rows. This means the number of rows included can vary depending on how many records exist within each time interval.
Frame boundaries
Frame boundaries determine which rows are included in the window calculation:
UNBOUNDED PRECEDING
: Starts at the first row of the partition<value> PRECEDING
: Starts or ends at a specified number of rows or interval before the current rowCURRENT ROW
: Starts or ends at the current row
When the frame clause is not specified, the default frame is
RANGE UNBOUNDED PRECEDING
, which includes all rows from the start of the
partition to the current row.
-
If
ORDER BY
is not present, the frame includes the entire partition, as all rows are considered equal. -
If
ORDER BY
is present, the frame includes all rows from the start of the partition to the current row. Note thatUNBOUNDED FOLLOWING
is only allowed when the frame start isUNBOUNDED PRECEDING
, which means the frame includes the entire partition.
Restrictions
-
Frame start can only be:
UNBOUNDED PRECEDING
<value> PRECEDING
CURRENT ROW
-
Frame end can only be:
CURRENT ROW
<value> PRECEDING
(unless start isUNBOUNDED PRECEDING
)
-
RANGE frames must be ordered by a designated timestamp
Exclusion options
Modifies the window frame by excluding certain rows:
EXCLUDE NO OTHERS
- Default behavior
- Includes all rows in the frame
EXCLUDE CURRENT ROW
- Excludes the current row from the frame
- When frame ends at
CURRENT ROW
, end boundary automatically adjusts to1 PRECEDING
- This automatic adjustment ensures that the current row is effectively excluded from the calculation, as there cannot be a frame that ends after the current row when the current row is excluded.
Example query
To tie it together, consider the following example:
SELECT
timestamp,
price,
SUM(price) OVER (
ORDER BY timestamp
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
EXCLUDE CURRENT ROW
) AS cumulative_sum_excluding_current
FROM trades;
The query calculates a cumulative sum of the price column for each row in the trades table, excluding the current row from the calculation. By using EXCLUDE CURRENT ROW
, the window frame adjusts to include all rows from the start up to one row before the current row. This demonstrates how the EXCLUDE CURRENT ROW
option modifies the window frame to exclude the current row, affecting the result of the window function.
Notes and restrictions
ORDER BY behavior
- ORDER BY in OVER clause determines the logical order for window functions
- Independent of the query-level ORDER BY
- Required for window-only functions
- Required for RANGE frames
Frame specifications
-
ROWS frames:
- Based on physical row counts
- More efficient for large datasets
- Can be used with any ORDER BY column
-
RANGE frames:
- Defines the frame based on the actual values in the ORDER BY column, rather than counted row.
- Require ORDER BY on timestamp
- Support time-based intervals (e.g., '1h', '5m')
Exclusion behavior
- Using
EXCLUDE CURRENT ROW
with frame end atCURRENT ROW
:- Automatically adjusts end boundary to
1 PRECEDING
- Ensures consistent results across queries
- Automatically adjusts end boundary to
Performance considerations
- ROWS frames typically perform better than RANGE frames for large datasets
- Partitioning can improve performance by processing smaller chunks of data
- Consider index usage when ordering by timestamp columns
Common pitfalls
Using window functions in WHERE clauses:
-- Incorrect usage
SELECT
symbol,
price,
timestamp
FROM trades
WHERE
avg(price) OVER (ORDER BY timestamp) > 100;
Instead, build like so:
with prices_and_avg AS (
SELECT
symbol,
price, avg(price) OVER (ORDER BY timestamp) as moving_avg_price,
timestamp
FROM trades
WHERE timestamp in yesterday()
)
select * from prices_and_avg
WHERE
moving_avg_price > 100;
Missing ORDER BY in OVER clause
When no ORDER BY
is specified, the average will be calculated for the whole
partition. Given we don't have a PARTITION BY and we are using a global window,
all the rows will show the same average. This is the average for the whole
dataset.
-- Potential issue
SELECT
symbol,
price,
sum(price) OVER () AS cumulative_sum
FROM trades;
WHERE timestamp in yesterday();
To compute the moving average, we need to specify an ORDER BY
clause:
SELECT
symbol,
price,
sum(price) OVER (ORDER BY TIMESTAMP) AS cumulative_sum
FROM trades
WHERE timestamp in yesterday();
We may also have a case where all the rows for the same partition (symbol) will
have the same average, if we include a PARTITION BY
clause without an
ORDER BY
clause:
-- Potential issue
SELECT
symbol,
price,
sum(price) OVER (PARTITION BY symbol ) AS cumulative_sum
FROM trades
WHERE timestamp in yesterday();
For every row to show the moving average for each symbol, we need to specify both
an ORDER BY
and a PARTITION BY
clause:
SELECT
symbol,
price,
sum(price) OVER (PARTITION BY symbol ORDER BY TIMESTAMP) AS cumulative_sum
FROM trades
WHERE timestamp in yesterday();