🛡️ QuestDB 9.0 is here!Read the release blog

Multi Armed Bandit Optimization in Trading

SUMMARY

Multi Armed Bandit (MAB) optimization is a reinforcement learning framework used in algorithmic trading to dynamically allocate resources across multiple trading strategies while balancing exploration of new opportunities with exploitation of known profitable approaches. The method derives its name from the "one-armed bandit" casino slot machine analogy, where a player must choose between multiple slot machines with unknown reward distributions.

Core concepts and mathematical framework

The MAB problem in trading can be formalized mathematically as follows:

Let $\{a_1, ..., a_K\}$ be a set of K trading strategies (arms) For each time step t:

Select strategy $a_i$
Receive reward $r_t \sim R_i$ where $R_i$ is the unknown reward distribution
Update strategy selection policy based on observed reward

The objective is to maximize the cumulative reward while minimizing regret:

$\text{Regret}(T) = \mu^* T - \sum_{t=1}^T r_t$

where $\mu^*$ is the expected reward of the optimal strategy and T is the time horizon.

Next generation time-series database

QuestDB is an open-source time-series database optimized for market and heavy industry data. Built from scratch in Java and C++, it offers high-throughput ingestion and fast SQL queries with time-series extensions.

Try live demo Read documentation

Common MAB algorithms in trading

Upper Confidence Bound (UCB)

The UCB algorithm selects strategies using the following criterion:

$\text{UCB}_i(t) = \bar{X}_i + \sqrt{\frac{2\ln(t)}{n_i}}$

where:

$\bar{X}_i$ is the mean reward for strategy i
$n_i$ is the number of times strategy i has been selected
t is the total number of trials

This formula balances exploitation (first term) with exploration (second term).

Thompson Sampling

Thompson Sampling maintains a Bayesian probability distribution over expected returns:

For each strategy i, sample $\theta_i$ from posterior distribution
Select strategy with highest sampled value
Update posterior with observed reward

Next generation time-series database

Try live demo Read documentation

Applications in trading systems

Strategy allocation

MAB optimization helps solve several key challenges in algorithmic trading:

Dynamic capital allocation across strategies
Adaptation to changing market conditions
Automated strategy selection and rotation

Risk management integration

The framework can be extended to include risk constraints:

$\text{UCB}_i^{\text{risk}}(t) = \text{UCB}_i(t) - \lambda \sigma_i$

where:

$\sigma_i$ is the strategy volatility
$\lambda$ is the risk aversion parameter

Performance considerations

Implementation challenges

Reward definition and normalization
Time horizon selection
Strategy correlation handling
Market regime adaptation

Monitoring and validation

Key performance metrics include:

Cumulative regret
Strategy selection diversity
Exploration/exploitation ratio
Risk-adjusted returns

Real-world considerations

The practical implementation of MAB in trading requires careful attention to:

Transaction costs and market impact
Strategy capacity constraints
Execution latency
Market microstructure effects

These factors can significantly impact the effectiveness of the MAB framework and must be incorporated into the reward calculation and strategy selection process.

The success of MAB optimization in trading depends heavily on:

Quality of the strategy pool
Accuracy of reward measurement
Robustness of the exploration mechanism
Effectiveness of risk controls

By carefully considering these elements, traders can develop more adaptive and resilient trading systems that effectively balance the exploration-exploitation tradeoff inherent in strategy selection.