QuestDB for Capital Markets?

Learn more

From Rust to Reality: The Hidden Journey of fetch_max

QuestDB is the open-source time-series database for demanding workloads—from trading floors to mission control It delivers ultra-low latency, high ingestion throughput, and a multi-tier storage engine. Native support for Parquet and SQL keeps your data portable, AI-ready—no vendor lock-in.

How a Job Interview Sent Me Down a Compiler Rabbit Hole

I occasionally interview candidates for engineering roles. We need people who understand concurrent programming. One of our favorite questions involves keeping track of a maximum value across multiple producer threads - a classic pattern that appears in many real-world systems.

Candidates can use any language they want. In Java (the language I know best), you might write a CAS loop, or if you're feeling functional, use updateAndGet() with a lambda:

AtomicLong highScore = new AtomicLong(100);
[...]
highScore.updateAndGet(current -> Math.max(current, newScore));

But that lambda is doing work - it's still looping under the hood, retrying if another thread interferes. You can see the loop right in AtomicLong's source code.

Then one candidate chose Rust.

I was following along as he started typing, expecting to see either an explicit CAS loop or some functional wrapper around one. But instead, he just wrote:

high_score.fetch_max(new_score, Ordering::Relaxed);

"Rust has fetch_max built in," he explained casually, moving on to the next part of the problem.

Hold on. This wasn't a wrapper around a loop pattern - this was a first-class atomic operation, sitting right there next to fetch_add and fetch_or. Java doesn't have this. C++ doesn't have this. How could Rust just... have this?

After the interview, curiosity got the better of me. Why would Rust provide fetch_max as a built-in intrinsic? Intrinsics usually exist to leverage specific hardware instructions. But x86-64 doesn't have an atomic max instruction. So there had to be a CAS loop somewhere in the pipeline. Unless... maybe some architectures do have this instruction natively? And if so, how does the same Rust code work on both?

I had to find out. Was the loop in Rust's standard library? Was it in LLVM? Was it generated during code generation for x86-64?

So I started digging. What I found was a fascinating journey through five distinct layers of compiler transformations, each one peeling back another level of abstraction, until I found exactly where that loop materialized. Let me share what I discovered.

Layer 1: The Rust Code

Let's start with what that candidate wrote - a simple high score tracker that can be safely updated from multiple threads:

use std::sync::atomic::{AtomicU64, Ordering};
fn main() {
let high_score = AtomicU64::new(100);
// [...]
// Another thread reports a new score of 200
let _old_score = high_score.fetch_max(200, Ordering::Relaxed);
// [...]
}
// Save this snippet as `main.rs` we are going to use it later.

This single line does exactly what it promises: atomically fetches the current value, compares it with the new one, updates it if the new value is greater, and returns the old value. It's safe, concise, and impossible to mess up. No explicit loops, no retry logic visible anywhere. But how does it actually work under the hood?

Layer 2: The Macro Expansion

Before our fetch_max call even reaches anywhere close to machine code generation, there's another layer of abstraction at work. The fetch_max method isn't hand-written for each atomic type - it's generated by a Rust macro called atomic_int!.

If we peek into Rust's standard library source code, we find that AtomicU64 and all its methods are actually created by this macro:

atomic_int! {
cfg(target_has_atomic = "64"),
// ... various configuration attributes ...
atomic_umin, atomic_umax, // The intrinsics to use
8, // Alignment
u64 AtomicU64 // The type to generate
}

Inside this macro, fetch_max is defined as a template that works for any integer type:

pub fn fetch_max(&self, val: $int_type, order: Ordering) -> $int_type {
// SAFETY: data races are prevented by atomic intrinsics.
unsafe { $max_fn(self.v.get(), val, order) }
}

The $max_fn placeholder gets replaced with atomic_umax for unsigned types and atomic_max for signed types. This single macro definition generates fetch_max methods for AtomicI8, AtomicU8, AtomicI16, AtomicU16, and so on - all the way up to AtomicU128.

So our simple fetch_max call is actually invoking generated code. But what does the atomic_umax function actually do? To answer that, we need to see what the Rust compiler produces next.

Layer 3: LLVM IR

Now that we know fetch_max is macro-generated code calling atomic_umax, let's see what happens when the Rust compiler processes it. The compiler doesn't go straight to assembly. First, it translates the code into an intermediate representation. Rust uses the LLVM compiler project, so it generates LLVM Intermediate Representation (IR).

If we peek at the LLVM IR for our fetch_max call, we see something like this:

; Before the transformation
bb7:
%0 = atomicrmw umax ptr %self, i64 %val monotonic, align 8
...

This is LLVM's language for saying: "I need an atomic read-modify-write operation. The modification I want to perform is an unsigned maximum."

This is a powerful, high-level instruction within the compiler itself. But it poses a critical question: does the CPU actually have a single instruction called umax? For most architectures, the answer is no. So how does the compiler bridge this gap?

How to See This Yourself

My goal is not to merely describe what is happening, but to give you the tools to see it for yourself. You can trace this transformation step-by-step on your own machine.

First, tell the Rust compiler to stop after generating the LLVM IR:

rustc --emit=llvm-ir main.rs

This creates a main.ll file. This file contains the LLVM IR representation of your Rust code, including our atomicrmw umax instruction. Keep the file around; we'll use it in the next steps.

Interlude: Compiler Intrinsics

We're missing something important. How does the Rust function atomic_umax actually become the LLVM instruction atomicrmw umax? This is where compiler intrinsics come into play.

If you dig into Rust's source code, you'll find that atomic_umax is defined like this:

/// Updates `*dst` to the max value of `val` and the old value (unsigned comparison)
#[inline]
#[cfg(target_has_atomic)]
#[cfg_attr(miri, track_caller)] // even without panics, this helps for Miri backtraces
unsafe fn atomic_umax<T: Copy>(dst: *mut T, val: T, order: Ordering) -> T {
// SAFETY: the caller must uphold the safety contract for `atomic_umax`
unsafe {
match order {
Relaxed => intrinsics::atomic_umax::<T, { AO::Relaxed }>(dst, val),
Acquire => intrinsics::atomic_umax::<T, { AO::Acquire }>(dst, val),
Release => intrinsics::atomic_umax::<T, { AO::Release }>(dst, val),
AcqRel => intrinsics::atomic_umax::<T, { AO::AcqRel }>(dst, val),
SeqCst => intrinsics::atomic_umax::<T, { AO::SeqCst }>(dst, val),
}
}
}

But what is this intrinsics::atomic_umax function? If you look at its definition, you find something slightly unusual:

/// Maximum with the current value using an unsigned comparison.
/// `T` must be an unsigned integer type.
///
/// The stabilized version of this intrinsic is available on the
/// [`atomic`] unsigned integer types via the `fetch_max` method. For example, [`AtomicU32::fetch_max`].
#[rustc_intrinsic]
#[rustc_nounwind]
pub unsafe fn atomic_umax<T: Copy, const ORD: AtomicOrdering>(dst: *mut T, src: T) -> T;

There is no body. This is a declaration, not a definition. The #[rustc_intrinsic] attribute tells the Rust compiler that this function maps directly to a low-level operation understood by the compiler itself. When the Rust compiler sees a call to intrinsics::atomic_umax, it knows to replace it with the corresponding LLVM intrinsic function.

So our journey actually looks like this:

  1. fetch_max method (user-facing API)
  2. Macro expands to call atomic_umax function
  3. atomic_umax is a compiler intrinsic
  4. Rustc replaces the intrinsic with LLVM's atomicrmw umaxWe are here
  5. LLVM processes this instruction...

Layer 4: The Transformation

LLVM runs a series of "passes" that analyze and transform the code. The one we're interested in is called the AtomicExpandPass.

Its job is to look at high-level atomic operations like atomicrmw umax and ask the target architecture, "Can you do this natively?"

When the x86-64 backend says "No, I can't," this pass expands the single instruction into a sequence of more fundamental ones that the CPU does understand. The result is a compare-and-swap (CAS) loop.

We can see this transformation in action by asking LLVM to emit the intermediate representation before and after this pass. To see the IR before the AtomicExpandPass, run:

llc -print-before=atomic-expand main.ll -o /dev/null

Tip: If you do not have llc installed, you can ask rustc to run the pass for you directly. rustc -C llvm-args="-print-before=atomic-expand -print-after=atomic-expand" main.rs

The code will be printed to your terminal. The function containing our atomic max looks like this:

*** IR Dump Before Expand Atomic instructions (atomic-expand) ***
; Function Attrs: inlinehint nonlazybind uwtable
define internal i64 @_ZN4core4sync6atomic9AtomicU649fetch_max17h6c42d6f2fc1a6124E(ptr align 8 %self, i64 %val, i8 %0) unnamed_addr #1 {
start:
%_0 = alloca [8 x i8], align 8
%order = alloca [1 x i8], align 1
store i8 %0, ptr %order, align 1
%1 = load i8, ptr %order, align 1
%_7 = zext i8 %1 to i64
switch i64 %_7, label %bb2 [
i64 0, label %bb7
i64 1, label %bb5
i64 2, label %bb6
i64 3, label %bb4
i64 4, label %bb3
]
bb2: ; preds = %start
unreachable
bb7: ; preds = %start
%2 = atomicrmw umax ptr %self, i64 %val monotonic, align 8
store i64 %2, ptr %_0, align 8
br label %bb1
bb5: ; preds = %start
%3 = atomicrmw umax ptr %self, i64 %val release, align 8
store i64 %3, ptr %_0, align 8
br label %bb1
bb6: ; preds = %start
%4 = atomicrmw umax ptr %self, i64 %val acquire, align 8
store i64 %4, ptr %_0, align 8
br label %bb1
bb4: ; preds = %start
%5 = atomicrmw umax ptr %self, i64 %val acq_rel, align 8
store i64 %5, ptr %_0, align 8
br label %bb1
bb3: ; preds = %start
%6 = atomicrmw umax ptr %self, i64 %val seq_cst, align 8
store i64 %6, ptr %_0, align 8
br label %bb1
bb1: ; preds = %bb3, %bb4, %bb6, %bb5, %bb7
%7 = load i64, ptr %_0, align 8
ret i64 %7
}

You can see the atomicrmw umax instruction in multiple places, depending on the memory ordering specified. This is the high-level atomic operation that the compiler backend understands, but the CPU does not.

llc -print-after=atomic-expand main.ll -o /dev/null

This is the relevant part of the output:

*** IR Dump After Expand Atomic instructions (atomic-expand) ***
; Function Attrs: inlinehint nonlazybind uwtable
define internal i64 @_ZN4core4sync6atomic9AtomicU649fetch_max17h6c42d6f2fc1a6124E(ptr align 8 %self, i64 %val, i8 %0) unnamed_addr #1 {
start:
%_0 = alloca [8 x i8], align 8
%order = alloca [1 x i8], align 1
store i8 %0, ptr %order, align 1
%1 = load i8, ptr %order, align 1
%_7 = zext i8 %1 to i64
switch i64 %_7, label %bb2 [
i64 0, label %bb7
i64 1, label %bb5
i64 2, label %bb6
i64 3, label %bb4
i64 4, label %bb3
]
bb2: ; preds = %start
unreachable
bb7: ; preds = %start
%2 = load i64, ptr %self, align 8 ; seed expected value
br label %atomicrmw.start ; enter CAS loop
atomicrmw.start: ; preds = %atomicrmw.start, %bb7
%loaded = phi i64 [ %2, %bb7 ], [ %newloaded, %atomicrmw.start ] ; on first iteration: use %2, on retries: use value observed by last cmpxchg
%3 = icmp ugt i64 %loaded, %val ; unsigned compare (umax semantics)
%new = select i1 %3, i64 %loaded, i64 %val ; desired = max(loaded, val)
%4 = cmpxchg ptr %self, i64 %loaded, i64 %new monotonic monotonic, align 8 ; CAS: if *self==loaded, store new
%success = extractvalue { i64, i1 } %4, 1 ; boolean: whether the swap happened
%newloaded = extractvalue { i64, i1 } %4, 0 ; value seen in memory before the CAS
br i1 %success, label %atomicrmw.end, label %atomicrmw.start ; loop until CAS succeeds
atomicrmw.end: ; preds = %atomicrmw.start
store i64 %newloaded, ptr %_0, align 8
br label %bb1
[... MORE OF THE SAME, JUST FOR DIFFERENT ORDERING..]
bb1: ; preds = %bb3, %bb4, %bb6, %bb5, %bb7
%7 = load i64, ptr %_0, align 8
ret i64 %7
}

We can see the pass did not change the first part - it still has the code to dispatch based on the memory ordering. But in the bb7 block, where we originally had the atomicrmw umax LLVM instruction, we now see a full compare-and-swap loop. A compiler engineer would say that the atomicrmw umax instruction has been "lowered" into a sequence of more primitive operations, that are closer to what the hardware can actually execute.

Here's the simplified logic:

  1. Read (seed): grab the current value (expected).
  2. Compute: desired = umax(expected, val).
  3. Attempt: observed, success = cmpxchg(ptr, expected, desired, [...]).
  4. If success, return observed (the old value). Otherwise set expected = observed and loop.

This CAS loop is a fundamental pattern in lock-free programming. The compiler just built it for us automatically.

Layer 5: The Final Product (x86-64 Assembly)

We're at the final step. To see the final machine code, you can tell rustc to emit the assembly directly:

rustc --emit=asm main.rs

This will produce a main.s file containing the final assembly code. Inside, you'll find the result of the cmpxchg loop:

.LBB8_2:
movq -32(%rsp), %rax # rax = &self
movq (%rax), %rax # rax = *self (seed 'expected')
movq %rax, -48(%rsp) # spill expected to stack
.LBB8_3: # loop head
movq -48(%rsp), %rax # rax = expected
movq -32(%rsp), %rcx # rcx = &self
movq -40(%rsp), %rdx # rdx = val
movq %rax, %rsi # rsi = expected (scratch)
subq %rdx, %rsi # set flags for unsigned compare: expected - val
cmovaq %rax, %rdx # if (expected > val) rdx = expected; else rdx = val (compute max)
lock cmpxchgq %rdx, (%rcx)# CAS: if *rcx==rax then *rcx=rdx; rax <- old *rcx; ZF=success
sete %cl # cl = success
movq %rax, -56(%rsp) # spill observed to stack
testb $1, %cl # branch on success
movq %rax, -48(%rsp) # expected = observed (for retry)
jne .LBB8_4 # success -> exit
jmp .LBB8_3 # failure → retry

The syntax might look a bit different from what you're used to, that's because it's in AT&T syntax, which is the default for rustc. If you prefer Intel syntax, you can use rustc --emit=asm main.rs -C "llvm-args=-x86-asm-syntax=intel" to get that.

I'm not an assembly expert, but you can see the key parts of the CAS loop here:

  • Seed read (first iteration): Load *self once to initialize the expected value.
  • Compute umax without branching: The pair sub + cmova implements desired = max_u(expected, val).
  • CAS operation: On x86-64, cmpxchg uses RAX as the expected value and returns the observed value in RAX; ZF encodes success.
  • Retry or finish: If ZF is clear, we failed and need to retry. Otherwise, we are done.

Note we did not ask rustc to optimize the code. If we did, the compiler would generate more efficient assembly: No spills to the stack, fewer jumps, no dispatch on memory ordering, etc. But I wanted to keep the output as close to the original IR as possible to make it easier to follow.

The Beauty of Abstraction

And there we have it. Our journey is complete. We started with a safe, clear, single line of Rust and ended with a CAS loop written in assembly language.

Rust fetch_maxMacro-generated atomic_umaxLLVM atomicrmw umaxLLVM cmpxchg loopAssembly lock cmpxchg loop

This journey is a perfect example of the power of modern compilers. We get to work at a high level of abstraction, focusing on safety and logic, while the compiler handles the messy, error-prone, and incredibly complex task of generating correct and efficient code for the hardware.

So, next time you use an atomic, take a moment to appreciate the incredible, hidden journey your code is about to take.

PS: After conducting this journey I learned that C++26 adds fetch_max too!

PPS: We are hiring!

Bonus: Apple Silicon (AArch64)

Out of curiosity, I also checked how this looks on Apple Silicon (AArch64). This architecture does have a native atomic max instruction, so the AtomicExpandPass does not need to lower it into a CAS loop. The LLVM code before and after the pass is identical, still containing the atomicrmw umax instruction.

The final assembly contains a variant of the LDUMAX instruction. This is the relevant part of the assembly:

ldr x8, [sp, #16] # x8 = value to compare with
ldr x9, [sp, #8] # x9 = pointer to the atomic variable
ldumax x8, x8, [x9] # atomic unsigned max (relaxed), [x9] = max(x8, [x9]), x8 = old value
str x8, [sp, #40] # Store old value
b LBB8_11

Note that AArch64 uses Unified Assembler Language, when reading the snippet above, it's important to remember that the destination register comes first.

And that's really it. We could continue to dig into the microarchitecture, to see how instructions are executed at the hardware level, what are the effects of the LOCK prefix, dive into differences in memory ordering, etc. But we'll leave that for another day.

Alice: "Would you tell me, please, which way I ought to go from here?"
The Cat: "That depends a good deal on where you want to get to."
Alice: "I don't much care where."
The Cat: "Then it doesn't much matter which way you go."
Alice: "...So long as I get somewhere."
The Cat: "Oh, you're sure to do that, if only you walk long enough."

- Lewis Carroll, Alice's Adventures in Wonderland

Subscribe to our newsletters for the latest. Secure and never shared or sold.