SAMPLE BY keyword
SAMPLE BY
is used on time-series data to summarize large datasets into
aggregates of homogeneous time chunks as part of a
SELECT statement.
To use SAMPLE BY
, a table column needs to be specified as a
designated timestamp.
Users performing SAMPLE BY
queries on datasets with missing data may make
use of the FILL keyword to specify a fill behavior.
Syntax
SAMPLE BY keywords
FILL keywords
ALIGN TO keywords
Sample units
The size of sampled groups are specified with the following syntax:
SAMPLE BY n{units}
Where the unit for sampled groups may be one of the following:
unit | description |
---|---|
U | microsecond |
T | millisecond |
s | second |
m | minute |
h | hour |
d | day |
M | month |
y | year |
For example, given a table trades
, the following query returns the number of
trades per hour:
SELECT ts, count() FROM trades SAMPLE BY 1h
FROM-TO
Versions prior to QuestDB 8.1.0 do not have access to this extension.
Please see the new blog for more information.
When using SAMPLE BY
with FILL
, you can fill missing rows within the result set with pre-determined values.
However, this method will only fill rows between existing data in the data set and cannot fill rows outside of this range. rows outside of this range.
To fill outside the bounds of the existing data, you can specify a fill range using a FROM-TO
clause.
Syntax
Specify the shape of the query using FROM
and TO
:
SELECT pickup_datetime as t, count
FROM trips
SAMPLE BY 1d FROM '2008-12-28' TO '2009-01-05' FILL(NULL)
Since no rows existed before 2009, QuestDB automatically fills in these rows.
This is distinct from the WHERE
clause with a simple rule of thumb -
WHERE
controls what data flows in, FROM-TO
controls what data flows out.
Use both FROM
and TO
in isolation to pre-fill or post-fill data. If FROM
is not provided, then the lower bound is the start of the dataset, aligned to calendar. The opposite is true omitting TO
.
WHERE
clause optimisation
If the user does not provide a WHERE
clause, or the WHERE
clause does not consider the designated timestamp,
QuestDB will add one for you, matching the FROM-TO
interval.
This means that the query will run optimally, and avoid touching data not relevant to the result.
Therefore, we compile the prior query into something similar to this:
SELECT pickup_datetime as t, count
FROM trips
WHERE pickup_datetime >= '2008-12-28'
AND pickup_datetime < '2009-01-05'
SAMPLE BY 1d FROM '2008-12-28' TO '2009-01-05' FILL(NULL)
Limitations
Here are the current limits to this feature.
- This syntax is not compatible with
FILL(PREV)
orFILL(LINEAR)
. - This syntax is for
ALIGN TO CALENDAR
only (default alignment). - Does not consider any specified
OFFSET
. - This syntax is for non-keyed
SAMPLE BY
i.e. only designated timestamp and aggregate columns.
Fill options
The FILL
keyword is optional and expects one or more fillOption
strategies
which will be applied to one or more aggregate columns. The following
restrictions apply:
-
Keywords denoting fill strategies may not be combined. Only one option from
NONE
,NULL
,PREV
,LINEAR
and constants may be used. -
LINEAR
strategy is not supported for keyed queries, i.e. queries that contain non-aggregated columns other than the timestamp in the SELECT clause. -
The
FILL
keyword must precede alignment described in the sample calculation section, i.e.:SELECT ts, max(price) max FROM prices
SAMPLE BY 1h FILL(LINEAR)
ALIGN TO ...
fillOption | Description |
---|---|
NONE | No fill applied. If there is no data, the time sample will be skipped in the results. A table could be missing intervals. |
NULL | Fills with NULL values. |
PREV | Fills using the previous value. |
LINEAR | Fills by linear interpolation of the 2 surrounding points. |
x | Fills with a constant value - where x is the desired value, for example FILL(100.05) . |
Consider an example table named prices
which has no records during the entire
third hour (2021-01-01T03
):
ts | price |
---|---|
2021-01-01T01:00:00.000000Z | p1 |
2021-01-01T02:00:00.000000Z | p2 |
2021-01-01T04:00:00.000000Z | p4 |
2021-01-01T05:00:00.000000Z | p5 |
The following query returns the maximum price per hour. As there are missing values, an aggregate cannot be calculated:
SELECT ts, max(price) max FROM prices SAMPLE BY 1h;
A row is missing for the 2021-01-01T03:00:00.000000Z
sample:
ts | max |
---|---|
2021-01-01T01:00:00.000000Z | max1 |
2021-01-01T02:00:00.000000Z | max2 |
2021-01-01T04:00:00.000000Z | max4 |
2021-01-01T05:00:00.000000Z | max5 |
A FILL
strategy can be employed which fills with the previous value using
PREV
:
SELECT ts, max(price) max FROM prices SAMPLE BY 1h FILL(PREV);
ts | max |
---|---|
2021-01-01T01:00:00.000000Z | max1 |
2021-01-01T02:00:00.000000Z | max2 |
2021-01-01T03:00:00.000000Z | max2 |
2021-01-01T04:00:00.000000Z | max4 |
2021-01-01T05:00:00.000000Z | max5 |
Linear interpolation is done using the LINEAR
fill option:
SELECT ts, max(price) max FROM prices SAMPLE BY 1h FILL(LINEAR);
ts | max |
---|---|
2021-01-01T01:00:00.000000Z | max1 |
2021-01-01T02:00:00.000000Z | max2 |
2021-01-01T03:00:00.000000Z | (max2+max4)/2 |
2021-01-01T04:00:00.000000Z | max4 |
2021-01-01T05:00:00.000000Z | max5 |
A constant value can be used as a fillOption
:
SELECT ts, max(price) max FROM prices SAMPLE BY 1h FILL(100.5);
ts | max |
---|---|
2021-01-01T01:00:00.000000Z | max1 |
2021-01-01T02:00:00.000000Z | max2 |
2021-01-01T03:00:00.000000Z | 100.5 |
2021-01-01T04:00:00.000000Z | max4 |
2021-01-01T05:00:00.000000Z | max5 |
Finally, NULL
may be used as a fillOption
:
SELECT ts, max(price) max FROM prices SAMPLE BY 1h FILL(NULL);
ts | max |
---|---|
2021-01-01T01:00:00.000000Z | max1 |
2021-01-01T02:00:00.000000Z | max2 |
2021-01-01T03:00:00.000000Z | null |
2021-01-01T04:00:00.000000Z | max4 |
2021-01-01T05:00:00.000000Z | max5 |
Multiple fill values
FILL()
accepts a list of values where each value corresponds to a single
aggregate column in the SELECT clause order:
SELECT min(price), max(price), avg(price), ts
FROM prices
SAMPLE BY 1h
FILL(NULL, 10, PREV);
In the above query min(price)
aggregate will get FILL(NULL)
strategy
applied, max(price)
will get FILL(10)
, and avg(price)
will get
FILL(PREV)
.
Sample calculation
The default time calculation of sampled groups is an absolute value, in other
words, sampling by one day is a 24 hour range which is not bound to calendar
dates. To align sampled groups to calendar dates, the ALIGN TO
keywords can be
used and are described in the ALIGN TO CALENDAR section
below.
Since QuestDB v7.4.0, the default behaviour for ALIGN TO
has changed. If you do not specify
an explicit alignment, SAMPLE BY
expressions will use ALIGN TO CALENDAR
behaviour.
The prior default behaviour can be retained by specifying ALIGN TO FIRST OBSERVATION
on a SAMPLE BY
query.
Alternatively, one can set the cairo.sql.sampleby.default.alignment.calendar
option to false
in server.conf
.
ALIGN TO FIRST OBSERVATION
Consider a table sensors
with the following data spanning three calendar days:
CREATE TABLE sensors (
ts TIMESTAMP,
val INT
) TIMESTAMP(ts) PARTITION BY DAY WAL
INSERT INTO sensors (ts, val) VALUES
('2021-05-31T23:10:00.000000Z', 10),
('2021-06-01T01:10:00.000000Z', 80),
('2021-06-01T07:20:00.000000Z', 15),
('2021-06-01T13:20:00.000000Z', 10),
('2021-06-01T19:20:00.000000Z', 40),
('2021-06-02T01:10:00.000000Z', 90),
('2021-06-02T07:20:00.000000Z', 30)
The following query can be used to sample the table by day.
SELECT ts, count() FROM sensors
SAMPLE BY 1d
ALIGN TO FIRST OBSERVATION
This query will return two rows:
ts | count |
---|---|
2021-05-31T23:10:00.000000Z | 5 |
2021-06-01T23:10:00.000000Z | 2 |
The timestamp value for the 24 hour groups start at the first-observed
timestamp, and continue in 1d
intervals.
ALIGN TO CALENDAR
The default behaviour for SAMPLE BY, this option aligns data to calendar dates, with two optional parameters:
SELECT ts, count() from sensors
SAMPLE BY 1d
or:
SELECT ts, count() from sensors
SAMPLE BY 1d
ALIGN TO CALENDAR
Gives the following result:
ts | count |
---|---|
2021-05-31T00:00:00.000000Z | 1 |
2021-06-01T00:00:00.000000Z | 4 |
2021-06-02T00:00:00.000000Z | 2 |
In this case, the timestamps are floored to the nearest UTC day, and grouped. The counts correspond to the number of entries occurring within each UTC day.
This is particularly useful for summarising data for charting purposes; see the candlestick chart from the example crypto dashboard/
TIME ZONE
A time zone may be provided for sampling with calendar alignment. Details on the options for specifying time zones with available formats are provided in the guide for working with timestamps and time zones.
SELECT ts, count() FROM sensors
SAMPLE BY 1d
ALIGN TO CALENDAR TIME ZONE 'Europe/Berlin'
In this case, the 24 hour samples begin at 2021-05-31T01:00:00.000000Z
:
ts | count |
---|---|
2021-05-31T01:00:00.000000Z | 1 |
2021-06-01T01:00:00.000000Z | 4 |
2021-06-02T01:00:00.000000Z | 2 |
Additionally, an offset may be applied when aligning sample calculation to calendar
SELECT ts, count() FROM sensors
SAMPLE BY 1d
ALIGN TO CALENDAR TIME ZONE 'Europe/Berlin' WITH OFFSET '00:45'
In this case, the 24 hour samples begin at 2021-05-31T01:45:00.000000Z
:
ts | count |
---|---|
2021-05-31T01:45:00.000000Z | 2 |
2021-06-01T01:45:00.000000Z | 4 |
2021-06-02T01:45:00.000000Z | 1 |
Local timezone output
The timestamp values output from SAMPLE BY
queries is in UTC. To have UTC
values converted to specific timezones, the
to_timezone() function should
be used.
SELECT to_timezone(ts, 'PST') ts, count
FROM (SELECT ts, count()
FROM sensors SAMPLE BY 2h
ALIGN TO CALENDAR TIME ZONE 'PST')
Time zone transitions
Calendar dates may contain historical time zone transitions or may vary in the
total number of hours due to daylight savings time. Considering the 31st October
2021, in the Europe/London
calendar day which consists of 25 hours:
- Sunday, 31 October 2021, 02:00:00 clocks are turned backward 1 hour to
- Sunday, 31 October 2021, 01:00:00 local standard time
When a SAMPLE BY
operation crosses time zone transitions in cases such as
this, the first sampled group which spans a transition will include aggregates
by full calendar range. Consider a table sensors
with one data point per hour
spanning three calendar hours:
ts | val |
---|---|
2021-10-31T00:10:00.000000Z | 10 |
2021-10-31T01:10:00.000000Z | 20 |
2021-10-31T02:10:00.000000Z | 30 |
2021-10-31T03:10:00.000000Z | 40 |
2021-10-31T04:10:00.000000Z | 50 |
The following query will sample by hour with the Europe/London
time zone and
align to calendar ranges:
SELECT ts, count() FROM sensors
SAMPLE BY 1h
ALIGN TO CALENDAR TIME ZONE 'Europe/London'
The record count for the hour which encounters a time zone transition will contain two records for both hours at the time zone transition:
ts | count |
---|---|
2021-10-31T00:00:00.000000Z | 2 |
2021-10-31T01:00:00.000000Z | 1 |
2021-10-31T02:00:00.000000Z | 1 |
2021-10-31T03:00:00.000000Z | 1 |
Similarly, given one data point per hour on this table, running SAMPLE BY 1d
will have a count of 25
for this day when aligned to calendar time zone
'Europe/London'.
WITH OFFSET
Aligning sampling calculation can be provided an arbitrary offset in the format
'+/-HH:mm'
, for example:
'00:30'
plus thirty minutes'+00:30'
plus thirty minutes'-00:15'
minus 15 minutes
The query uses the default offset '00:00' if the parameter is not set.
SELECT ts, count() FROM sensors
SAMPLE BY 1d
ALIGN TO CALENDAR WITH OFFSET '02:00'
In this case, the 24 hour samples begin at 2021-05-31T02:00:00.000000Z
:
ts | count |
---|---|
2021-05-31T02:00:00.000000Z | 2 |
2021-06-01T02:00:00.000000Z | 4 |
2021-06-02T02:00:00.000000Z | 1 |
TIME ZONE WITH OFFSET
The TIME ZONE
and WITH OFFSET
options can be combined.
SELECT ts, count() FROM sensors
SAMPLE BY 1h
ALIGN TO CALENDAR TIME ZONE 'Europe/London' WITH OFFSET '02:00'
The sample then begins from Europe/London
at 2021-10-31T02:00:00.000000Z
:
ts | count |
---|---|
2021-10-31T02:00:00.000000Z | 1 |
2021-10-31T03:00:00.000000Z | 1 |
2021-10-31T04:00:00.000000Z | 3 |
2021-10-31T05:00:00.000000Z | 2 |
Examples
Assume the following table trades
:
ts | quantity | price |
---|---|---|
2021-05-31T23:45:10.000000Z | 10 | 100.05 |
2021-06-01T00:01:33.000000Z | 5 | 100.05 |
2021-06-01T00:15:14.000000Z | 200 | 100.15 |
2021-06-01T00:30:40.000000Z | 300 | 100.15 |
2021-06-01T00:45:20.000000Z | 10 | 100 |
2021-06-01T01:00:50.000000Z | 50 | 100.15 |
This query will return the number of trades per hour:
SELECT ts, count() FROM trades SAMPLE BY 1h;
ts | count |
---|---|
2021-05-31T23:45:10.000000Z | 3 |
2021-06-01T00:45:10.000000Z | 1 |
2021-05-31T23:45:10.000000Z | 1 |
2021-06-01T00:45:10.000000Z | 1 |
The following will return the trade volume in 30 minute intervals
SELECT ts, sum(quantity*price) FROM trades SAMPLE BY 30m;
ts | sum |
---|---|
2021-05-31T23:45:10.000000Z | 1000.5 |
2021-06-01T00:15:10.000000Z | 16024 |
2021-06-01T00:45:10.000000Z | 8000 |
2021-06-01T00:15:10.000000Z | 8012 |
2021-06-01T00:45:10.000000Z | 8000 |
The following will return the average trade notional (where notional is = q * p) by day:
SELECT ts, avg(quantity*price) FROM trades SAMPLE BY 1d;
ts | avg |
---|---|
2021-05-31T23:45:10.000000Z | 6839.416666666667 |
To make this sample align to calendar dates:
SELECT ts, avg(quantity*price) FROM trades SAMPLE BY 1d ALIGN TO CALENDAR;
ts | avg |
---|---|
2021-05-31T00:00:00.000000Z | 1000.5 |
2021-06-01T00:00:00.000000Z | 8007.2 |
See also
This section includes links to additional information such as tutorials: