Maximize your SQL efficiency: SELECT best practices
SQL is wildly popular. Within it, the most common command is the SELECT statement. After all, what good is a database full of valuable information if you're unable to read it with flexibility and efficiency?
This guide walks you through best SELECT practices. Afterwards, you'll write more efficient and powerful SQL statements. This can save both processing time and resources, leading to faster queries and cheaper databases.
While many of the optimization tips would make sense universally across any SQL database, the statements we show are specific for QuestDB and many include its robust time-series extensions.
If you don't have a database of your own running, consider launching these queries into the QuestDB demo instance. The example queries leverage the built in data sets when possible.
Apply columnar specificity
Read only the columns you need and avoid doing select *.
The first one may seem like common sense, but searching GitHub for
SELECT * FROM
returns 5.7 million results. Surely a great number of these are
only looking for a column or two. Specifying a column vs. all columns narrows
down the result set. This reduces data transfer and lightens the load on memory.
Specify - or "scope" - specific columns, if all you need are specific columns:
SELECT fare_amount from trips;
Scoping your columns is always a wise idea. But where it shines most is within databases that support the columnar database format, such as QuestDB.
Always filter by designated timestamp
Designated timestamps are a very powerful feature. With timestamps, the database quickly locates the initial data point that is relevant to the query.
QuestDB optimizes query execution by physically organizing data in ascending order based on a designated timestamp and partitioning this data according to a predefined resolution.
This minimizes random access operations, instead favouring sequential access. Sequential access is preferred by both the operating system and the storage drive, leading to faster and more efficient data retrieval.
If possible, always filter by the designated timestamp.
Instead of returning "all time", we return only the time we need. Since data is stored in chronological order, a database like QuestDB will know when to stop reading as soon it sees a value later than the timestamp boundary. As a result, it can immediately return results:
SELECT price, timestamp FROM tradesWHERE timestamp BETWEEN '2022-03-08T08:59:59' AND '2022-03-08T18:59:59';
Optimize partitions
Partitions split databases into smaller chunks. Time is one convenient way to partition a database. But consider that your partitions are very large: a YEAR, for example.
Learn more about database partitioning.
Your queries may often apply a short scope, a day, or hours. In this common case, you are accessing a large chunk of un-needed data. This is inefficient.
Consider the following query to against a table partitioned by YEAR:
SELECT prices, timestampFROM tradesWHERE timestamp >= '2022-03-08T00:00:00' AND timestamp < '2022-03-09T00:00:00';
Looks OK. But because of the underlying partitioning, we need to read - and discard - many rows until we find the start of the time range, which is in August. If the table was instead partioned by MONTH, then we would read/discard a much smaller amount of data, even if we end up opening more files as a result of spanning two partitions.
Try to find the sweet spot based on your most typical queries.
Divide tables to match query patterns
Building off of partition precision, consider matching your tables to your query types. Apart from matching the timestamps up with an appropriately broad partition, it is strong practice to match tables up with query types.
Divide large single tables into multiple tables that match query types.
Consider dividing a large table into multiple tables if you have different query patterns for different columns in the query. Maybe some columns you typically need to query daily and some monthly. Table makeup can reflect this.
JOINs are expensive
While many tables are appropriate in some cases, a singular table may be the right choice if you often find yourself applying JOIN to your queries. A database query that JOINs together many tables is typically more expensive than a similar query that reads from a single table. If different columns are frequently queried together, then a single table is preferred.
Let's use an example query:
SELECTpickup_datetime,fare_amount,tempF,windDirFROMtripsASOF JOIN weatherWHEREtrips.pickup_datetime >= '2018-06-01T00:00:00'ANDtrips.pickup_datetime < '2018-06-02T00:00:00';
The above query is a two-table join. Not bad. Efficient and clean.
However, what if we added four additional tables: drivers, vehicles, and payment_info. In this query, t, w, d, v, p, tc are table aliases used for clarity and brevity.
SELECTt.pickup_datetime,t.fare_amount,w.tempF,w.windDir,d.driver_name,d.driver_rating,v.vehicle_make,v.vehicle_model,p.payment_method,p.amount_paid,FROMtrips tASOF JOIN weather wJOIN drivers d ON (driver_id)JOIN vehicles v ON(vehicle_id)LEFT OUTER JOIN payment_info p ON (transaction_id)WHEREt.pickup_datetime >= '2022-03-08T00:00:00' AND t.pickup_datetime < '2022-03-09T00:00:00';
This query is much more difficult to process.
If these were all part of a singular trip_details table, it would be that much more efficient.
Match tables up with symbols
A symbol is a data structure used to store repetitive strings. Within the database itself, symbol types are stored as a table of integers and their corresponding string values.
Using symbols can improve query performance by a large margin, as string operations compare and write int types instead of string. As an additional benefit, less storage space is consumed. They're also unobtrusive to the query writer (you!) and offer reduced overall complexity.
If you often query by timestamp plus a symbol, like type, country, model and so on, you may be able to find additional performance through the creation of multiple tables. Typically, one table per different symbol value would provide a boost.
CREATE TABLE symbol_table(symbol SYMBOL CAPACITY 128 NOCACHE, timestamp TIMESTAMP)timestamp(timestamp);
However do note that if you needed to query two different symbol values, you
would need to apply a UNION
, so make sure the trade off is right for you.
Applying UNION
merges the results of 2 different queries:
symbol1_tableUNIONsymbol2_table;
Also, if the number of symbols is in the thousands and you create a different table for each, there will be extra memory & disk consumption. In that case, you may find that you need to manually tune configuration, or perhaps just keep using a single table.
EXPLAIN your queries first
Use EXPLAIN to see how queries will execute.
Depending on which column types you are filtering by, or which aggregation functions you are using, queries might parallelise or not.
To demonstrate, let's deconstruct the trades table in the demo instance.
It is created like so:
CREATE TABLE trades (symbol SYMBOL CAPACITY 256 CACHE,side SYMBOL CAPACITY 256 CACHE,price DOUBLE,amount DOUBLE,timestamp TIMESTAMP) TIMESTAMP (timestamp) PARTITION BY DAY
Let's run one of our ill-advised open *
queries:
EXPLAIN SELECT * FROM trades ORDER BY timestamp DESC;
It returns:
QUERY PLAN
DataFrame
Row backward scan
Frame backward scan on: trades
Noteable is that the entire DataFrame was processed backward.
Now let's try another query, with a filtering parameter.
EXPLAIN SELECT * FROM trades WHERE amount > 100.0;
It returns:
Async JIT Filter
filter: 100.0<amount
workers: 1
DataFrame
Row forward scan
Frame forward scan on: trades
Ah, now it's a forward scan and we can see the filter.
Let's try one more example:
EXPLAIN SELECT avg(price), var_pop(price)FROM trades WHERE amount > 100.0;
This returns:
GroupBy vectorized: false
values: [avg(price),var_pop(price)]
Async JIT Filter workers: 24
filter: 100.0<amount
DataFrame
Row forward scan
Frame forward scan on: trades
Note GroupBy vectorized: false
. Vectorization means that QuestDB will
parallelize some computations and aggregations. It leads to better performance.
The avg
aggregation does support vectoration, but var_pop
does not.
Read more about vectorization and SIMD operations.
We can remove var_pop
and try again:
EXPLAIN SELECT avg(price)FROM trades WHERE amount > 100.0;
We no longer see the notice that the query as a whole is not applying vectorization:
Async JIT Group By workers: 24
values: [avg(price)]
filter: 100.0<amount
DataFrame
Row forward scan
Frame forward scan on: trades
Learning the nitty-gritty of how your queries are actually being processed will pay future dividends. You'll soon get a sense of just how your queries are running, and become more familiar with the conditions which lead to high query performance.
Are you up-to-date?
OK, this one is a bit of a soft ball. But database innovation often moves at a staggering pace. Newer versions tend to improve execution, so as a general rule: Always ensure you are on the latest available version.
For example, QuestDB 7.3.10, brought with it a 5x-10x speed-up in ASOF and LT JOIN queries, as well as a number of optimizations in parallel filters and GROUP BY, among other improvements.
Simply bumping the version offers strong value.
You'll remain current, and - naturally - very hip.
Know your queries
Imagine you found a query you love.
Like, LATEST ON running on the trades table to find the most recent pair of rows:
SELECT price, symbol, timestamp FROM tradesLATEST ON timestamp PARTITION BY side;
It's neat to see different trades fly into the database.
But let's say we apply many symbols:
SELECT price, symbol, timestamp FROM tradesWHERE symbol IN ('SYM1', 'SYM2', 'SYM3', 'SYM4')LATEST ON timestamp PARTITION BY side;
This will be a fairly taxing query, as LATEST ON performs better when only a single symbol is used. How would we know? Well, the docs tell us.
If you're using a database like QuestDB that has native time-series extensions, or any database that offers less common extensions, they may have tradeoffs and caveats.
As such, it's always wise to read the docs when using novel extensions.
Apply early inner-filtering
In SQL, the "innermost" query represents the deepest level of nesting. It is often the first part of the query to be executed in a sequence of nested queries, also known as a subquery.
When using subqueries or Common Table Expressions (CTEs), apply filters as early as possible in your query structure. Filtering early reduces the volume of data processed in subsequent steps. Irrelevant rows are removed and not passed down the query chain, thus preventing unnecessary computations on data not included in the final result.
Let's look at an example.
We want to calculate the average price of ETH-USD trades bewteen '2022-03-15' and '2022-03-19', but only for 'buy' side trades.
An inefficient example might look as such:
WITH DailyTrades AS (SELECTsymbol,side,AVG(price) AS avg_priceFROM tradesWHERE timestamp BETWEEN '2022-03-15 00:00:00' AND '2022-03-19 23:59:59'GROUP BY symbol, side)SELECT * FROM DailyTradesWHERE symbol = 'ETH-USD'AND side = 'buy';
Our DailyTrades CTE calculates the average price for all symbols and sides for
the entire day first. Only after aggregation does it filter for 'ETH-USD' and
'buy' trades. This means the aggregation operation processes more data than is
needed. As per the demo instance, ~230ms
is our total computation time.
Can we do better?
SELECTsymbol,side,AVG(price) AS avg_priceFROM tradesWHEREtimestamp BETWEEN '2022-03-15 00:00:00' AND '2022-03-19 23:59:59'AND symbol = 'ETH-USD'AND side = 'buy'GROUP BY symbol, side
In this second cut, we've applied filters on timestamp, symbol, and side in the
main FROM clause. This ensures that only relevant data is considered from the
start. And as a result, our second query finishes around ~190ms
, a cool 18%
improvement.
How come? The GROUP BY clause - an already well optimized extension - now has less data to process and thus operates more efficiently. Avoiding an aggregation across the entire dataset is, indeed, bound to be helpful.
ORDER BY at the end
Apply ORDER BY at the end of your statements. Or, if you expect a large result set, avoid ORDER BY. Also by default the natural order of results will be ascending designated timestamp. If thatβs good enough for you, then great! No need for ORDER BY. But assuming you do, consider the following example.
Let's say we want 10 'buy' side trades for 'ETH-USD' between '2022-03-15' and '2022-03-19', including the symbol, side, price, and timestamp.
SELECT *FROM (SELECT symbol, side, price, timestampFROM tradesORDER BY timestamp DESC) AS OrderedTradesWHERE symbol = 'ETH-USD'AND side = 'buy'AND timestamp BETWEEN '2022-03-15 00:00:00' AND '2022-03-19 23:59:59'LIMIT 10;
This structure is logically correct, however note that ORDER BY is applied
within a subquery. The query completed at around ~200ms
.
Can we do better? We can!
Let's apply ORDER BY at the end:
SELECT symbol, side, price, timestampFROM tradesWHERE symbol = 'ETH-USD'AND side = 'buy'AND timestamp BETWEEN '2022-03-15 00:00:00' AND '2022-03-19 23:59:59'ORDER BY timestamp DESCLIMIT 10;
This re-ordering of our statement averages around ~180ms
. The goal, as in the
prior best practice, is to present a minimized dataset to ORDER BY. The
opimization becomes more stark the larger your dataset.
It should be noted that, in the case of a database like QuestDB, SQL extensions
can also help. For example, if we were to apply a LIMIT clause, the whole
ORDER BY timestamp DESC LIMIT 10
can be replaced by 'LIMIT -10'
. The natural
sorting of QuestDB once again saves the day (or the microsecond):
SELECT symbol, side, price, timestampFROM tradesWHERE symbol = 'ETH-USD'AND side = 'buy'AND timestamp BETWEEN '2022-03-15 00:00:00' AND '2022-03-19 23:59:59'LIMIT -10;
Summary
Hopefully these 10 SELECT tips will help you refine your queries. Great improvements can often be found through fairly minor adjustments. Applying these tips will help you write efficient and flexible queries.
If you have any other recommendations, we'd love to hear about them. Tweet or x-eet or wave or email or smoke signal us (whatever your flavour), or visit us in our Community Forum to share.
Thanks for reading!