From Rust to Reality: The Hidden Journey of fetch_max
How a Job Interview Sent Me Down a Compiler Rabbit Hole
I occasionally interview candidates for engineering roles. We need people who understand concurrent programming. One of our favorite questions involves keeping track of a maximum value across multiple producer threads - a classic pattern that appears in many real-world systems.
Candidates can use any language they want.
In Java (the language I know best), you might write a CAS loop,
or if you're feeling functional, use updateAndGet()
with a lambda:
AtomicLong highScore = new AtomicLong(100);[...]highScore.updateAndGet(current -> Math.max(current, newScore));
But that lambda is doing work - it's still looping under the hood, retrying if another thread interferes. You can see the loop right in AtomicLong's source code.
Then one candidate chose Rust.
I was following along as he started typing, expecting to see either an explicit CAS loop or some functional wrapper around one. But instead, he just wrote:
high_score.fetch_max(new_score, Ordering::Relaxed);
"Rust has fetch_max built in," he explained casually, moving on to the next part of the problem.
Hold on. This wasn't a wrapper around a loop pattern - this was a first-class
atomic operation, sitting right there next to fetch_add
and fetch_or
. Java
doesn't have this. C++ doesn't have this. How could Rust just... have this?
After the interview, curiosity got the better of me. Why would Rust provide
fetch_max
as a built-in intrinsic? Intrinsics usually exist to leverage
specific hardware instructions. But x86-64 doesn't have an atomic max
instruction. So there had to be a CAS loop somewhere in the pipeline. Unless...
maybe some architectures do have this instruction natively? And if so, how
does the same Rust code work on both?
I had to find out. Was the loop in Rust's standard library? Was it in LLVM? Was it generated during code generation for x86-64?
So I started digging. What I found was a fascinating journey through five distinct layers of compiler transformations, each one peeling back another level of abstraction, until I found exactly where that loop materialized. Let me share what I discovered.
Layer 1: The Rust Code
Let's start with what that candidate wrote - a simple high score tracker that can be safely updated from multiple threads:
use std::sync::atomic::{AtomicU64, Ordering};fn main() {let high_score = AtomicU64::new(100);// [...]// Another thread reports a new score of 200let _old_score = high_score.fetch_max(200, Ordering::Relaxed);// [...]}// Save this snippet as `main.rs` we are going to use it later.
This single line does exactly what it promises: atomically fetches the current value, compares it with the new one, updates it if the new value is greater, and returns the old value. It's safe, concise, and impossible to mess up. No explicit loops, no retry logic visible anywhere. But how does it actually work under the hood?
Layer 2: The Macro Expansion
Before our fetch_max
call even reaches anywhere close to machine code generation,
there's another layer of abstraction at work. The fetch_max
method isn't hand-written
for each atomic type - it's generated by a Rust macro called atomic_int!
.
If we peek into Rust's standard library source code, we find that AtomicU64
and all its methods are actually created by
this macro:
atomic_int! {cfg(target_has_atomic = "64"),// ... various configuration attributes ...atomic_umin, atomic_umax, // The intrinsics to use8, // Alignmentu64 AtomicU64 // The type to generate}
Inside this macro, fetch_max
is defined as a
template
that works for any integer type:
pub fn fetch_max(&self, val: $int_type, order: Ordering) -> $int_type {// SAFETY: data races are prevented by atomic intrinsics.unsafe { $max_fn(self.v.get(), val, order) }}
The $max_fn
placeholder gets replaced with atomic_umax
for unsigned types
and atomic_max
for signed types. This single macro definition generates
fetch_max
methods for AtomicI8
, AtomicU8
, AtomicI16
, AtomicU16
, and so
on - all the way up to AtomicU128
.
So our simple fetch_max
call is actually invoking generated code. But what
does the atomic_umax
function actually do? To answer that, we need
to see what the Rust compiler produces next.
Layer 3: LLVM IR
Now that we know fetch_max
is macro-generated code calling atomic_umax
,
let's see what happens when the Rust compiler processes it. The compiler
doesn't go straight to assembly. First, it translates the code into an
intermediate representation. Rust uses the LLVM compiler project, so it
generates LLVM Intermediate Representation (IR).
If we peek at the LLVM IR for our fetch_max
call, we see something like this:
; Before the transformationbb7:%0 = atomicrmw umax ptr %self, i64 %val monotonic, align 8...
This is LLVM's language for saying: "I need an atomic read-modify-write operation. The modification I want to perform is an unsigned maximum."
This is a powerful, high-level instruction within the compiler itself. But it
poses a critical question: does the CPU actually have a single instruction
called umax
? For most architectures, the answer is no. So how does the
compiler bridge this gap?
How to See This Yourself
My goal is not to merely describe what is happening, but to give you the tools to see it for yourself. You can trace this transformation step-by-step on your own machine.
First, tell the Rust compiler to stop after generating the LLVM IR:
rustc --emit=llvm-ir main.rs
This creates a main.ll
file. This file contains the LLVM IR
representation of your Rust code, including our atomicrmw umax
instruction.
Keep the file around; we'll use it in the next steps.
Interlude: Compiler Intrinsics
We're missing something important. How does the Rust function atomic_umax
actually become the LLVM instruction atomicrmw umax
? This is where compiler
intrinsics come into play.
If you dig into Rust's source code, you'll find that atomic_umax
is
defined like this:
/// Updates `*dst` to the max value of `val` and the old value (unsigned comparison)#[inline]#[cfg(target_has_atomic)]#[cfg_attr(miri, track_caller)] // even without panics, this helps for Miri backtracesunsafe fn atomic_umax<T: Copy>(dst: *mut T, val: T, order: Ordering) -> T {// SAFETY: the caller must uphold the safety contract for `atomic_umax`unsafe {match order {Relaxed => intrinsics::atomic_umax::<T, { AO::Relaxed }>(dst, val),Acquire => intrinsics::atomic_umax::<T, { AO::Acquire }>(dst, val),Release => intrinsics::atomic_umax::<T, { AO::Release }>(dst, val),AcqRel => intrinsics::atomic_umax::<T, { AO::AcqRel }>(dst, val),SeqCst => intrinsics::atomic_umax::<T, { AO::SeqCst }>(dst, val),}}}
But what is this intrinsics::atomic_umax
function? If you look at its
definition,
you find something slightly unusual:
/// Maximum with the current value using an unsigned comparison./// `T` must be an unsigned integer type.////// The stabilized version of this intrinsic is available on the/// [`atomic`] unsigned integer types via the `fetch_max` method. For example, [`AtomicU32::fetch_max`].#[rustc_intrinsic]#[rustc_nounwind]pub unsafe fn atomic_umax<T: Copy, const ORD: AtomicOrdering>(dst: *mut T, src: T) -> T;
There is no body. This is a declaration, not a definition. The
#[rustc_intrinsic]
attribute tells the Rust compiler that this function
maps directly to a low-level operation understood by the compiler
itself. When the Rust compiler sees a call to intrinsics::atomic_umax
, it
knows to
replace it
with the corresponding
LLVM intrinsic function.
So our journey actually looks like this:
fetch_max
method (user-facing API)- Macro expands to call
atomic_umax
function atomic_umax
is a compiler intrinsic- Rustc replaces the intrinsic with LLVM's
atomicrmw umax
← We are here - LLVM processes this instruction...
Layer 4: The Transformation
LLVM runs a series of "passes" that analyze and transform the code. The one we're interested in is called the
AtomicExpandPass
.
Its job is to look at high-level atomic operations like atomicrmw umax
and ask
the target architecture, "Can you do this natively?"
When the x86-64
backend says "No, I can't," this pass expands the single
instruction into a sequence of more fundamental ones that the CPU does
understand. The result is a
compare-and-swap (CAS) loop.
We can see this transformation in action by asking LLVM to emit the
intermediate representation before and after this pass. To see the IR before
the AtomicExpandPass
, run:
llc -print-before=atomic-expand main.ll -o /dev/null
Tip: If you do not have
llc
installed, you can askrustc
to run the pass for you directly.rustc -C llvm-args="-print-before=atomic-expand -print-after=atomic-expand" main.rs
The code will be printed to your terminal. The function containing our atomic max looks like this:
*** IR Dump Before Expand Atomic instructions (atomic-expand) ***; Function Attrs: inlinehint nonlazybind uwtabledefine internal i64 @_ZN4core4sync6atomic9AtomicU649fetch_max17h6c42d6f2fc1a6124E(ptr align 8 %self, i64 %val, i8 %0) unnamed_addr #1 {start:%_0 = alloca [8 x i8], align 8%order = alloca [1 x i8], align 1store i8 %0, ptr %order, align 1%1 = load i8, ptr %order, align 1%_7 = zext i8 %1 to i64switch i64 %_7, label %bb2 [i64 0, label %bb7i64 1, label %bb5i64 2, label %bb6i64 3, label %bb4i64 4, label %bb3]bb2: ; preds = %startunreachablebb7: ; preds = %start%2 = atomicrmw umax ptr %self, i64 %val monotonic, align 8store i64 %2, ptr %_0, align 8br label %bb1bb5: ; preds = %start%3 = atomicrmw umax ptr %self, i64 %val release, align 8store i64 %3, ptr %_0, align 8br label %bb1bb6: ; preds = %start%4 = atomicrmw umax ptr %self, i64 %val acquire, align 8store i64 %4, ptr %_0, align 8br label %bb1bb4: ; preds = %start%5 = atomicrmw umax ptr %self, i64 %val acq_rel, align 8store i64 %5, ptr %_0, align 8br label %bb1bb3: ; preds = %start%6 = atomicrmw umax ptr %self, i64 %val seq_cst, align 8store i64 %6, ptr %_0, align 8br label %bb1bb1: ; preds = %bb3, %bb4, %bb6, %bb5, %bb7%7 = load i64, ptr %_0, align 8ret i64 %7}
You can see the atomicrmw umax
instruction in multiple places, depending on
the memory ordering specified. This is the high-level atomic operation that the
compiler backend understands, but the CPU does not.
llc -print-after=atomic-expand main.ll -o /dev/null
This is the relevant part of the output:
*** IR Dump After Expand Atomic instructions (atomic-expand) ***; Function Attrs: inlinehint nonlazybind uwtabledefine internal i64 @_ZN4core4sync6atomic9AtomicU649fetch_max17h6c42d6f2fc1a6124E(ptr align 8 %self, i64 %val, i8 %0) unnamed_addr #1 {start:%_0 = alloca [8 x i8], align 8%order = alloca [1 x i8], align 1store i8 %0, ptr %order, align 1%1 = load i8, ptr %order, align 1%_7 = zext i8 %1 to i64switch i64 %_7, label %bb2 [i64 0, label %bb7i64 1, label %bb5i64 2, label %bb6i64 3, label %bb4i64 4, label %bb3]bb2: ; preds = %startunreachablebb7: ; preds = %start%2 = load i64, ptr %self, align 8 ; seed expected valuebr label %atomicrmw.start ; enter CAS loopatomicrmw.start: ; preds = %atomicrmw.start, %bb7%loaded = phi i64 [ %2, %bb7 ], [ %newloaded, %atomicrmw.start ] ; on first iteration: use %2, on retries: use value observed by last cmpxchg%3 = icmp ugt i64 %loaded, %val ; unsigned compare (umax semantics)%new = select i1 %3, i64 %loaded, i64 %val ; desired = max(loaded, val)%4 = cmpxchg ptr %self, i64 %loaded, i64 %new monotonic monotonic, align 8 ; CAS: if *self==loaded, store new%success = extractvalue { i64, i1 } %4, 1 ; boolean: whether the swap happened%newloaded = extractvalue { i64, i1 } %4, 0 ; value seen in memory before the CASbr i1 %success, label %atomicrmw.end, label %atomicrmw.start ; loop until CAS succeedsatomicrmw.end: ; preds = %atomicrmw.startstore i64 %newloaded, ptr %_0, align 8br label %bb1[... MORE OF THE SAME, JUST FOR DIFFERENT ORDERING..]bb1: ; preds = %bb3, %bb4, %bb6, %bb5, %bb7%7 = load i64, ptr %_0, align 8ret i64 %7}
We can see the pass did not change the first part - it still has the code to dispatch based
on the memory ordering. But in the bb7
block, where we originally had the
atomicrmw umax
LLVM instruction, we now see a full compare-and-swap loop.
A compiler engineer would say that the atomicrmw umax
instruction has been
"lowered" into a sequence of more primitive operations, that are closer to what
the hardware can actually execute.
Here's the simplified logic:
- Read (seed): grab the current value (
expected
). - Compute:
desired = umax(expected, val)
. - Attempt:
observed, success = cmpxchg(ptr, expected, desired, [...])
. - If success, return
observed
(the old value). Otherwiseset expected = observed
and loop.
This CAS loop is a fundamental pattern in lock-free programming. The compiler just built it for us automatically.
Layer 5: The Final Product (x86-64 Assembly)
We're at the final step. To see the final machine code, you can tell rustc
to
emit the assembly directly:
rustc --emit=asm main.rs
This will produce a main.s
file containing the final assembly code.
Inside, you'll find the result of the cmpxchg
loop:
.LBB8_2:movq -32(%rsp), %rax # rax = &selfmovq (%rax), %rax # rax = *self (seed 'expected')movq %rax, -48(%rsp) # spill expected to stack.LBB8_3: # loop headmovq -48(%rsp), %rax # rax = expectedmovq -32(%rsp), %rcx # rcx = &selfmovq -40(%rsp), %rdx # rdx = valmovq %rax, %rsi # rsi = expected (scratch)subq %rdx, %rsi # set flags for unsigned compare: expected - valcmovaq %rax, %rdx # if (expected > val) rdx = expected; else rdx = val (compute max)lock cmpxchgq %rdx, (%rcx)# CAS: if *rcx==rax then *rcx=rdx; rax <- old *rcx; ZF=successsete %cl # cl = successmovq %rax, -56(%rsp) # spill observed to stacktestb $1, %cl # branch on successmovq %rax, -48(%rsp) # expected = observed (for retry)jne .LBB8_4 # success -> exitjmp .LBB8_3 # failure → retry
The syntax might look a bit different from what you're used to, that's because it's
in AT&T syntax, which is the default for rustc
. If you prefer Intel syntax, you can
use rustc --emit=asm main.rs -C "llvm-args=-x86-asm-syntax=intel"
to get that.
I'm not an assembly expert, but you can see the key parts of the CAS loop here:
- Seed read (first iteration): Load
*self
once to initialize the expected value. - Compute umax without branching: The pair
sub
+cmova
implementsdesired = max_u(expected, val)
. - CAS operation: On x86-64,
cmpxchg
usesRAX
as the expected value and returns the observed value inRAX
;ZF
encodes success. - Retry or finish: If
ZF
is clear, we failed and need to retry. Otherwise, we are done.
Note we did not ask
rustc
to optimize the code. If we did, the compiler would generate more efficient assembly: No spills to the stack, fewer jumps, no dispatch on memory ordering, etc. But I wanted to keep the output as close to the original IR as possible to make it easier to follow.
The Beauty of Abstraction
And there we have it. Our journey is complete. We started with a safe, clear, single line of Rust and ended with a CAS loop written in assembly language.
Rust fetch_max
→ Macro-generated atomic_umax
→ LLVM
atomicrmw umax
→ LLVM cmpxchg
loop → Assembly lock cmpxchg
loop
This journey is a perfect example of the power of modern compilers. We get to work at a high level of abstraction, focusing on safety and logic, while the compiler handles the messy, error-prone, and incredibly complex task of generating correct and efficient code for the hardware.
So, next time you use an atomic, take a moment to appreciate the incredible, hidden journey your code is about to take.
PS: After conducting this journey I learned that
C++26 adds fetch_max
too!
PPS: We are hiring!
Bonus: Apple Silicon (AArch64)
Out of curiosity, I also checked how this looks on Apple Silicon (AArch64).
This architecture does have a native atomic max
instruction, so the
AtomicExpandPass
does not need to lower it into a CAS loop. The LLVM code before and after
the pass is identical, still containing the atomicrmw umax
instruction.
The final assembly contains a variant of the LDUMAX
instruction. This is the relevant part of the assembly:
ldr x8, [sp, #16] # x8 = value to compare withldr x9, [sp, #8] # x9 = pointer to the atomic variableldumax x8, x8, [x9] # atomic unsigned max (relaxed), [x9] = max(x8, [x9]), x8 = old valuestr x8, [sp, #40] # Store old valueb LBB8_11
Note that AArch64 uses Unified Assembler Language, when reading the snippet above, it's important to remember that the destination register comes first.
And that's really it. We could continue to dig into the microarchitecture, to see how instructions are executed
at the hardware level, what are the effects of the LOCK
prefix, dive into differences in memory ordering, etc.
But we'll leave that for another day.
Alice: "Would you tell me, please, which way I ought to go from here?"
The Cat: "That depends a good deal on where you want to get to."
Alice: "I don't much care where."
The Cat: "Then it doesn't much matter which way you go."
Alice: "...So long as I get somewhere."
The Cat: "Oh, you're sure to do that, if only you walk long enough."- Lewis Carroll, Alice's Adventures in Wonderland