Why AI needs a database

March 11, 2025

Tags:

AI-powered language models have arrived. The hype storm is real. Suddenly, frontier models are the hammer and every opportunity looks like a nail.

But is this really the case? Will an AI solve all of our infrastructure and development challenges? Nope - not yet, anyways. That said, their intelligent use will supercharge our capabilties and require broad adaptation.

In this article, we'll pierce through the confusion and wild claims, and better define the dynamic and profound agentified era of database + AI collaboration.

To that tune, we'll consider a question:

Will an AI replace my database?

To answer that, we'll start with the token problem.

A massive wave appears - one sits in fear, the other has a surf board — We're excited!

The token problem

A token is the smallest unit of text a language model processes, typically a word, subword, or even a character, depending on the tokenizer. Models don't read text like humans do — they break text into tokens, which serve as the building blocks for understanding and generating language.

Let's demonstrate simplified tokenization on a context-rich sentence:

Diagram showing how a sentence is broken down into individual tokens

This whole sentence is 6 tokens.

It's a tidy process. But what about information that isn't text?

Structured data doesn't tokenize well — LLMs stumble over it.

To illustrate the point, let's "tokenize" a structured timestamp:

Diagram showing how a timestamp is broken into multiple tokens

That's 17 tokens, depending on the tokenizer - ChatGPT-4o pictured.

Converted into a microsecond timestamp - 1646762637609765 - it's 7.

Either way, that's a lot of tokens for a single, precise point of data.

Databases - especially time-series databases - are stuffed with structured data.

Consider a database table full of financial trades.

We have one such database on our demo instance:

The reliable ol' trades tableDemo this query

SELECT * FROM trades;

This table has over 1.5 billion rows representing various trades across time.

As an uncompressed file it's "only" 52GB - pretty lean, right?

But how many tokens do you think are contained in that 52GB?

Tokenizing, War and Peace

Textual data is naturally broken into words and phrases that align with an LLM's tokenization scheme. Structured datasets - tables, logs, financial records, and similar - are highly compressed formats that become wildly inefficient when converted into LLM-readable sequences.

Consider one row from our trades table:

symbol	side	price	amount	timestamp
`ETH-USD`	`sell`	`2615.54`	`0.00044`	`2022-03-08T18:03:57...`

How many tokens do you think make up this one row?

Value	Estimated tokens
`ETH-USD`	~2 tokens
`sell`	~1 token
`2615.54`	~4+ tokens
`0.00044`	~3+ tokens
`2022-03-08T18:03:57.609765Z`	~16+ tokens

Before metadata or padding, one row balloons to 25-30 tokens.

Within this range, let's estimate the total token size of the dataset...

Lower bound: 1.5 billion rows * 25 avg tokens = 37.5 billion tokens
Upper bound: 1.5 billion rows * 30 avg tokens = 45 billion tokens

For reference, GPT-3 took ~300 billion tokens to train. The training set included a significant portion of the organized English that we've digitized: encyclopedias, articles, websites, books, and much, much more. This is hundreds of terabytes of information.

Our relatively tiny 52GB example dataset of 1.5 billion rows would have consumed ~15% of GPT-3's entire training budget if tokenized!

Most structured datasets are much larger than 52GB. Billions of rows in a real financial dataset wouldn't be a few gigabytes — it would be hundreds, terabytes even, and it would expand through a constant, real-time stream. Token counts for a useful timeframe would skyrocket into the trillions. Wild.

As such, "training" on this data would be:

Wildly expensive
Very, very slow
Deeply impractical

Why's that? How does that work?

Unforgiving tokenomics

LLMs tokenize data based on learned language structures.

Structured and binary data do not map efficiently to these token vocabularies.

This results in severe token expansion when converting structured datasets:

Data type	Bytes per token	Tokens per 1GB	Expansion vs. text
Plain Text	~4 bytes/token	250M tokens	1× (baseline)
JSON Data	~2-3 bytes/token	500M-750M tokens	2-3×
Tabular Data	~1-2 bytes/token	865M-1B tokens	3.5-4×
Binary Logs	~0.75-1.5 bytes/token	670M-1.3B tokens	3-5×

Tokenization is lossy and inefficient for structured data.

It breaks information into approximate units optimized for prediction, not retrieval. This expansive process is especially dramatic for numerical and binary data, where precise values must be split across multiple tokens.

Consider this:

For text data, 1GB of data is 1800 novels of average length.

That's about 286 copies of War and Peace.

This expands into about 250M tokens.

For structured data, 1GB is ~29 million rows of transactions over mere weeks.

This expands into almost a billion tokens — almost 4× the size of plain text.

Bar chart comparing token counts between novels and transaction data — Quite a contrast

A single month of financial data outstrips entire bookshelves of classic literature. But, eh, the financial dataset still seems so small.

Can't the LLM just store it for us and analyze its contents?

Training ≠ storing

If LLMs simply "memorized" everything, they'd function like a lossless compression algorithm, perfectly retrieving exact details on demand.

But as shown above with tokenizing, we see the opposite.

As a result:

LLMs struggle with factual recall
They generate responses probabilistically
They approximate facts, rather than retrieve exact data

Why? A model is a function, not a repository.

If 300 billion tokens were stored directly, you'd expect something many terabytes in size. But it's lossily compressed from data, into tokens, then into parameterized weights during training. Millions of words become billions of tokens which become just hundreds of gigabytes of weights.

Diagram showing how training data is compressed into model weights — Pseudo-compression during training

This is possible because a model doesn't store text, it learns patterns.

It forms an intricate map of probabilities and relationships.

This creates an abstraction atop meaning, not a ledger of exact records.

In this sense, training is less like taking a snapshot of reality and more like sculpting a shape out of raw material — only the most "statistically useful" features remain. Statistically useful is... OK, but is it accurate?

Homer Simpson stands next to a very poor art piece — 'I analyzed your time-series dataset'

Trying to extract precise information from a model — like the exact closing price of Ethereum on a specific date — is a fundamentally misaligned request.

An LLM might generate a plausible answer, but there's no guarantee it's correct. This presents a major challenge. Structured datasets — prices, transactions, and events — constantly change. Today's models can't keep up.

So why can't we just "give our data to the model", in CSV chunks if we have to, to produce insights from it and "talk to it" in real-time? Can't we use a pre-prompt, Retrieval-Augmented Generation (RAG), or "context"?

"Just AI it"

In other words: can't we "just AI it" anyway?

Nope. Not if we expect accuracy, quality and recency from our results.

Unfortunately, LLMs just don't work that way.

A meme from futurama - robot with an animal wheel face, on horse, with the text horse says analysis denied — Neigh!

The problem is, context windows are fixed, and structured data is massive.

Even state-of-the-art models like Claude 3.7 max out at 200K tokens per conversation. While this might seem large, it's only a tiny fraction of a real structured dataset.

A common assumption is that chunking — splitting a dataset into smaller parts and feeding them incrementally — allows an LLM to work with large-scale structured data.

The bottleneck remains the model's ability to retain past data. LLMs do not have persistent memory; they process only what is inside their active context window.

Each new chunk overwrites earlier chunks once the context limit is reached, meaning older data is forgotten. Unlike a database, which indexes and retains all historical records, an LLM's memory resets every time it reaches capacity.

Model	Parameters	Context window size (tokens)
GPT-4 Turbo	1.76 trillion*	128,000
Claude 3	540 billion	200,000 (expandable to 1 million)
Claude 3.5 Sonnet	Not specified	200,000
Claude 3.5 Haiku	Not specified	200,000
Claude 3.7 Sonnet	Not specified	200,000
DeepSeek-V2	236 billion	128,000
DeepSeek-V3	671 billion	128,000
Jamba	52 billion	256,000

Even if we manage to fit a portion of the dataset into a prompt, LLMs do not retrieve facts deterministically — they generate responses probabilistically. This means even when given the same correct data, an LLM might:

Return slightly different numbers
Round off values inconsistently
Introduce hallucinated data

So, what do we do?

Enter the trusty database.

Database: friend of AI

Databases, on the other hand, provide:

Exact lookup, deterministic access to structured data
Real-time updates, new data is immediately available
Precision, answers are exact, not probabilistic estimates

The solution isn't to replace databases but to integrate them into AI workflows. Instead of forcing AI models to "remember" everything, we give them access to query and interpret structured data dynamically.

Indian Jones prepares to swap bag of sand (hallucinations) with gold idol (real data) — Steady now...

AIs, like able humans, apes and mischievous crows, can work with tools.

This has powerful potential, especially when we consider high-tier data like that within financial markets, industrial metrics, and sensor data, all which stream in real-time and at enormous scale.

These datasets are often:

Massive, with billions of rows and terabytes of data
- Think of the tokens!
High-frequency, expanding every millisecond
High-cardinality, with thousands of unique entities

So what if an AI directly calls a database - QuestDB, PostgreSQL, or similar?

The model translates natural language into structured SQL queries
The database executes the query, and the model interprets the results

The aforementioned flow — Tokenized vs. database retrieval

This provides exact numerical recall, with no approximations.

It's live, up-to-date data retrieval, which scales efficiently for massive datasets.

In theory, it solves the token problem.

The model won't be crushed by the token bloat of unstructured data.

But how does this work in real-world "AI" systems?

Direct database querying

Let's say an AI-powered trading assistant needs to analyze market trends and explain them to a human. Instead of making the AI model "memorize" past trades, we let it dynamically query something like QuestDB.

User asks an "agent" or similar model-guided entry point:

"How did ETH perform last month compared to BTC?"

As of today, frontier models are proficient at query generation when they know the available syntax. With a coherent pre-prompt, the AI then converts to SQL:

Text converted into SQL

SELECT symbol, AVG(price) FROM trades
WHERE timestamp BETWEEN '2022-03-01' AND '2022-03-31'
GROUP BY symbol;

QuestDB then executes in milliseconds, returning exact averages:

symbol	AVG Price
ETH-USD	2914.567413002606
BTC-USD	42180.825850850815

The model then interprets them: "ETH's average price was $2,914.57, whereas BTC's was $42,180.83. This suggests ETH had lower volatility in March."

By separating the logic, we achieve accuracy and sub-second response times instead of janky LLM-generated estimates or a ponderous journey through a large index. We receive exact data, instead of probabilistic inference.

In this hybrid approach, QuestDB ingests a massive stream of data, scaling cost-effectively as it routes data from immediate storage into less expensive object storage for historical querying.

Both the latest information and historical data are thus always available for immediate analysis. The cost of storing this data, the relative simplicity of the entry point (one query into all your historical data), and the millisecond response time, offer extreme efficiency and performance.

The flow discussed in the next section — Hey data, it's me

With database access and a clean pre-prompt, a conversational interface atop one source which contains "all of your data over time" turns the model into something profound. Given its vast knowledge about market analysis techniques, these conversations generate queries and visualizations that are extremely deep, precise, up-to-date, and fast.

A clean system-of-record at scale provides a simplified interface for your AI or agent to use. It's also easily verifiable against the source, with SQL being as coherent for a human as it is for an agent. The best of both worlds.

The bottleneck in how fluid these conversations will feel - and in how "far back" in time that they can go - is the quality of the data store and the query engine underneath. Overly complicated and expensive ingress/egress architectures will slow things down and weaken response quality.

Choose your AI's partner carefully!

Bender from Futurama with his lady bot in a girder of love — The right model with the right DB <3

The Others

So, what are our other options?

If we are working with structured data, are there other techniques?

The flow of the following methods — One of these will work

Embedding & vectorized search

Vector search converts text into embeddings — lists of numbers which capture meaning in a mathematical space. Similar meanings have similar vectors, even if they use different words.

When a query is made, an approximate nearest-neighbor (ANN) search finds the most semantically similar data points. These data points then return as standalone search results, or are re-routed into an LLM as additional context.

So hard to explain. A dective, cabinets, gibberish - it's wild — ChatGPTs whacky take on a Vector Search meme, for hilarity

This method is fast for unstructured text, making it ideal for:

Research papers
Customer support logs
Knowledge bases

Given that embeddings are indexed, there's no need to retrain the model.

We just update the search index.

As such, vector search excels at unstructured data retrieval.

But it struggles with exact values and real-time updates.

Why?

Embeddings rank similarity over precision - a price of $99.99 vs. $100.01 may be similar in value but far apart in embedding space
Precomputed embeddings can become outdated - if financial transactions change, old embeddings remain static

Typically, vector search on its own doesn't generate new content — it retrieves the closest matching results. When we add generation into the flow, then we're talking about RAG.

Retrieval-Augmented Generation (RAG)

RAG builds on vector search.

Instead of relying on the model's internal knowledge, RAG retrieves up-to-date information from an external source — typically a "vector-friendly" database like Pinecone, Weaviate, or Elasticsearch. The LLM then combines the original query with the retrieved data to generate an answer.

This method is particularly powerful because it:

Allows real-time updates
Expands an LLM's effective knowledge without retraining
Bridges structured and unstructured retrieval, making it flexible

RAG retrieves and integrates external data. As such, it too excels with the same dynamic, knowledge-heavy applications as its forebearer.

However, RAG still inherits the weaknesses of vector search when it comes to structured numerical data. It does not guarantee exact numerical recall, meaning financial or time-series data isn't a strong fit.

The approach also leans quite heavily on a well-structured retrieval pipeline. Poor retrieval pipelines mean poor results. Query efficiency also depends on the search index quality and can degrade performance if unkempt.

Furthermore, multi-step retrieval methodologies introduce latency before response generation. This is crucial when you expect a fluid and information-rich conversation.

If accuracy, deterministic values, recency, and fluid speed are needed, such as within financial data, real-time analytics, structured logs, and IoT, then direct database querying is the best choice for unstructured data.

Summary

A language model is not a database, nor should it be treated as one.

Depending on your data-type, the best approach is to combine AI's language capabilities with something else, like a high-performance database which can scale and simplify the overall ingest and query architecture atop your data.

This enables real-time, accurate, and cost-effective conversations, and avoids the pitfalls of tokenization overload, slow retrieval, and hallucinated responses.

In sum, an AI isn't coming to replace your database. It needs its help.

... For now...

A suspicious robot smile — *Trustworthy smile*