The mask that compiles to nothing: how HotSpot's JIT learned to reason about bits
When a developer types (x << 2) & -4, an optimizing compiler should compile it to just the shift,
x << 2: the bitwise AND should disappear. Why? Imagine we have an 8-bit number x = 1011 0111 and my expression looks like (x << 2) & -4;
x 1011 0111 (the original x)----------------------x << 2 1101 1100 (x after lshift, the 2 lowest bits are always 0s)& -4 1111 1100 (-4 in two's complement -> the mask only clears lowest 2 bits)----------------------result 1101 1100 (same as x << 2, the bitwise AND has no effect)
But how does the C2 compiler actually do this? The optimization relies on a general abstraction, built over multiple changesets. This post is my attempt to understand it and explain it.
What does the JIT actually know about your numbers?
When C2 is compiling a method, what does it know about the value in a given variable?
The answer is "a set of possible values." The compiler usually can't know the exact runtime value of
x (that's the whole point of a variable), but it can often prove that x is
constrained. If it can prove the constraint is tight enough, it can rewrite the code. A classic
example: if the compiler proves an array index is always in [0, length), it deletes
the bounds check. Constant folding, dead branch elimination, and other optimizations often come down to
"prove the set of possible values is small enough to act on."
C2 stores this "set of possible values" as a type. Don't imagine a Java types int or long, but a much
richer internal type that carries a range. For most of HotSpot's life, an integer type was
essentially:
[lo, hi] // the value is somewhere in this signed range, inclusive
So the type of x & 0xFF would be [0, 255], and the compiler could use that. This signed range is simple and useful,
but it is too weak for our purpose.
Consider x << 2. What's the range of x << 2 if x can be anything? Well... almost anything.
Shifting left can overflow, wrap around, produce huge positives and huge negatives. The tightest
signed interval covering the result is [Integer.MIN_VALUE, 2147483644]. It is a huge range, just shy
of the full int range, and it gives the compiler almost nothing useful.
But we know something very specific about x << 2: No matter what x is the bottom two bits are
always zero! A range can't express that. [lo, hi] can say
"the value is small" but it can't say "the value is even," let alone "the value is a multiple of
four." That knowledge lives in the bits, and the range alone cannot express it.
Enter known bits
The fix: alongside the range, also track what we know about each individual bit. For a 32-bit integer, imagine two extra 32-bit masks travelling with the type:
zeros: a1in position i means "bit i is definitely 0"ones: a1in position i means "bit i is definitely 1"
A bit that's unknown is 0 in both masks. A bit can't be both, so the invariant zeros & ones == 0
must always hold. In C2 it's a dozen lines of rangeinference.hpp:
template <class U>class KnownBits {static_assert(U(-1) > U(0), "bit info should be unsigned");public:U _zeros;U _ones;bool is_satisfied_by(U v) const {return (v & _zeros) == U(0) && (v & _ones) == _ones;}};
is_satisfied_by just checks whether a given number v is allowed by the masks. The masks must never
rule out a value that can actually occur, and C2's tests and internal checks use this to verify they
don't.
Every bit is now in one of three states: known-0, known-1, or unknown. Here's the type of x << 2, with . for unknown:
bit: 31 2 1 0x: . . . . . . . . . . . . . . . . . . . (x is fully unknown)x << 2: . . . . . . . . . . . . . . . . . 0 0 (low two bits: known zero!)
So x << 2 has zeros = 0b11 and everything else unknown.
The compiler now knows the number is a multiple of four.
This data structure isn't unique to HotSpot. LLVM's
KnownBits
is the closest match to C2's representation (a Zero and a One bitmask with the same
Zero & One == 0 invariant), and GCC tracks the same thing as the mask in its
conditional constant-propagation pass
(a value plus an uncertainty mask). It's the same idea: a cheap, non-relational, per-bit
abstraction of an integer. HotSpot got the foundations
(JDK-8315066, by Quan Anh Mai and Emanuel Peter) in
JDK 26, and the pieces that actually delete (x << 2) & -4 landed in JDK 27. Fun fact: Quan Anh Mai is merykitty, whose magic SWAR line
parser we took apart during 1BRC.
INFO
Non-relational: the abstraction tracks each bit on its own and never records relationships between bits (no "bit 3 is always equal to bit 5"). That independence is what makes it cheap: constant per-bit bookkeeping.
Two views that refine each other
We now have two different descriptions of the same number side by side: a range and a set of known bits. Each one knows things the other doesn't, and they can teach each other:
- The bits know the low two bits are zero. The range uses that: if it currently says
[5, 41], then 5, 6, 7, and 41 all have nonzero low bits and can't occur, so the range tightens to[8, 40], the nearest multiples of four inside it. - The range knows the value is in
[0, 200]. The bits use that: every number from 0 to 200 has its top 24 bits zero, so those aren't unknown bits at all; they're known-zero, and the bits record it.
This "combine two abstractions and let them refine each other" pattern has a name in the compiler
literature (a reduced product), and the function that makes the range and the bits
reconcile until they agree is called the reduction operator. In C2 it's a function with the very
down-to-earth name canonicalize_constraints(), and it runs every time a new integer type
is created. It's the heart of the whole feature.
The canonicalization dance
So how do you get the range and the bits to agree? You let them take turns refining each other
until neither has anything new to add. One clarification first: C2 carries both a signed and
an unsigned interval, and the reduction here runs on the unsigned one, the "simple interval"
[ulo, uhi], which never straddles the sign boundary, so the shared-prefix trick in step 1 is always
safe. The helper below works on one such simple interval; the outer canonicalize_constraints()
splits and recombines when the signed and unsigned views straddle the boundary. The loop:
-
Bits learn from the interval. Take the current unsigned interval
[ulo, uhi]. The high bits that are identical in bothuloanduhiare identical for every value in between (that's just how binary counting works: the leading digits don't change until you cross a power-of-two boundary). Those bits are now known. -
Interval learns from the bits. Take the current
ulo. If it violates the known bits (say the bits demand the low two are zero butulois0b...101), thenuloisn't actually a possible value, so bump it up to the smallest value ≥ulothat satisfies the bits. Do the mirror thing to pulluhidown. -
Go back to step 1 and see if the tighter interval reveals more known bits. Repeat until a full pass changes nothing.
That's the whole loop, straight from
canonicalize_constraints_simple:
adjust_unsigned_bounds_from_bits is step 2, adjust_bits_from_unsigned_bounds is step 1, and the
exit is _progress going false:
while (true) {canonicalized_bounds = adjust_unsigned_bounds_from_bits(canonicalized_bounds._result, canonicalized_bits._result);if (!canonicalized_bounds._progress || canonicalized_bounds.empty()) {return SimpleCanonicalResult<U>(canonicalized_bounds._present, canonicalized_bounds._result, canonicalized_bits._result);}canonicalized_bits = adjust_bits_from_unsigned_bounds(canonicalized_bits._result, canonicalized_bounds._result);if (!canonicalized_bits._progress || canonicalized_bits.empty()) {return SimpleCanonicalResult<U>(canonicalized_bits._present, canonicalized_bounds._result, canonicalized_bits._result);}}
Does this terminate? Yes. Each iteration that triggers another full round must turn at least one previously-unknown bit into a known bit, and a final bounds-only tightening may happen just before the loop exits. Given there are at most 64 bits, convergence is bounded by the word width.
Teaching AND to disappear
We also have to teach C2 to compute known bits through operations: every participating operation needs a rule for "given the known bits of my inputs, what are the known bits of my output?" These rules are simple and mechanical:
AND: out.ones = a.ones & b.ones // a bit is 1 only if it's 1 in BOTHout.zeros = a.zeros | b.zeros // a bit is 0 if it's 0 in EITHEROR: out.ones = a.ones | b.onesout.zeros = a.zeros & b.zerosSHL by k: shift both masks left by k, and OR (1<<k)-1 into zeros// the vacated low k bits are known-zero, exactly our x<<2 fact
INFO
Transfer function: the rule for a single operation that turns what we know about the inputs
into what we know about the output. The AND and SHL rules just above are transfer functions:
feed in the operands' known bits, get back the result's known bits.
The real code barely differs. The AND rule
(infer_and_impl):
U<CTP> zeros = st1._bits._zeros | st2._bits._zeros;U<CTP> ones = st1._bits._ones & st2._bits._ones;
and the left shift
(infer_lshift),
where pattern = (1 << k) - 1 is literally the "vacated low bits are zero" fact:
U<CTP> pattern = (U<CTP>(1) << masked_shift) - U<CTP>(1);U<CTP> known_one_bits = t1->_bits._ones << masked_shift;U<CTP> known_zero_bits = (t1->_bits._zeros << masked_shift) | pattern;
I learned that other compilers use similar techniques. The AND and shift rules we just saw are in
LLVM as KnownBits::operator&
and KnownBits::shl,
and GCC as bit_value_binop.
Quan Anh Mai brought this to the Value of C2's AND, OR, and XOR nodes in
JDK-8367341; the same patch adds the
OR
and XOR
rules right beside the AND (the XOR dual rides along, even though the issue is titled just
And / Or). Ashay Rane did the left-shift rule in the
JDK-8278857 that started this whole post.
With known bits flowing through the graph, the actual optimization, deleting the mask, becomes
almost a one-liner. When is x & y just y? When the AND can't possibly clear any bit that y
might have set. In known-bits terms: at every position where y could be 1, x must definitely
be 1. Here is the real code from C2,
AndINode::Identity,
the JDK-8380475 change (by Ashay Rane) that finally
deletes the mask.
Two bits of C2 vocabulary first: in(1) and in(2) are the AND node's two operands, and Identity
is the hook every node has for answering "am I redundant?" Returning one of the inputs tells the
compiler to delete this AND and rewire everything that used it to that input instead:
Node* AndINode::Identity(PhaseGVN* phase) {// x & x => xif (in(1) == in(2)) {return in(1);}const TypeInt* t1 = phase->type(in(1))->is_int();const TypeInt* t2 = phase->type(in(2))->is_int();if ((~t1->_bits._ones & ~t2->_bits._zeros) == 0) {// All bits that might be 0 in in1 are known to be 0 in in2return in(2);}if ((~t2->_bits._ones & ~t1->_bits._zeros) == 0) {// All bits that might be 0 in in2 are known to be 0 in in1return in(1);}return MulNode::Identity(phase);}
The two known-bits ifs are mirror images: the first collapses x & y to y, the second to x.
For (x << 2) & -4 it's the second that fires: x & y == x exactly when every bit y might clear
is already definitely zero in x. In known-bits terms ~y.ones is the set of bits y might clear (anything
not known-1), and ~x.zeros is the set of bits x might have set (anything not known-0), both
taken within the 32-bit lane; the identity holds when those two don't overlap:
x & y == x <=> (~y.ones & ~x.zeros) == 0
Let's have a look at a concrete example: (x << 2) & -4
The AndINode has the first argument (in(1)) set to a node representing x = x << 2 and the 2nd argument (in(2)) represents -4.
x = x << 2: zeros = ...0011 -> ~zeros = ...1100 (bits x might have set: all but the low two)y = -4: ones = ...1100 -> ~ones = ...0011 (bits y might clear: the low two)(~y.ones & ~x.zeros) == (...0011 & ...1100) == 0
-4 can only clear the low two bits, and x << 2 has those two definitely zero, so the AND changes
nothing and C2 returns in(1), which is x << 2. The & -4 node is removed from the graph.
The machine code contains only the shift and the and instruction is not emitted at all.
Let's see the effects of this machinery in action. Here is a simple - almost naive - test program: two one-line methods and a loop that calls them often enough to trigger C2 compilation:
public class ShiftMask {// Redundant: (x << 2) already has its low two bits zero, and -4 == 0xFFFFFFFC// clears exactly those two bits. C2 should delete the AND.static int redundant(int x) { return (x << 2) & -4; }// Not redundant: -8 == 0xFFFFFFF8 also clears bit 2, which C2 can't prove is zero.static int kept(int x) { return (x << 2) & -8; }public static void main(String[] args) {long s = 0;for (int i = 0; i < 400_000; i++) {s += redundant(i);s += kept(i);}System.out.println(s);}}
Compile it, then have C2 print the disassembly for just those two methods (this needs hsdis on the JDK's library path):
javac ShiftMask.javajava -XX:+UnlockDiagnosticVMOptions -XX:-TieredCompilation \-XX:CompileCommand='compileonly,ShiftMask::redundant' \-XX:CompileCommand='compileonly,ShiftMask::kept' \-XX:CompileCommand='dontinline,ShiftMask::redundant' \-XX:CompileCommand='dontinline,ShiftMask::kept' \-XX:CompileCommand='print,ShiftMask::redundant' \-XX:CompileCommand='print,ShiftMask::kept' \-XX:PrintAssemblyOptions=intel ShiftMask
Environment for the listings below: AMD Ryzen 9 9950X, Ubuntu 26.04, Linux 7.0.0. The
JDK 24 output is stock Corretto 24.0.2. The JDK 27 output is a local build of openjdk/jdk at
0c209afd.
The compileonly filter also leaves the hot main loop interpreted, so it never becomes an OSR
compile. What you read below is the plain, invocation-triggered method body. On JDK 24, before this work
was merged, the redundant mask is emitted:
; JDK 24: static int redundant(int x) { return (x << 2) & -4; }...shl esi,0x2 ; esi = x << 2mov eax,esiand eax,0xfffffffc ; & -4, a no-op, but the old C2 can't see through...ret
On a JDK 27 build the redundant and is gone, and on
this x86-64 build C2 even folds the shift into a single lea:
; JDK 27: static int redundant(int x) { return (x << 2) & -4; }...lea eax,[rsi*4+0x0] ; eax = x << 2 (AND is gone)...ret
lea eax,[rsi*4+0x0] computes x * 4, i.e. x << 2, in a single addressing operation, with no
and anywhere.
Here is an experiment to validate our understanding: Change the mask from -4 to -8,
so it also clears the bit 2, which C2 can't prove is zero, and on that same JDK 27 build the and
is part of the emitted code again:
; JDK 27: static int kept(int x) { return (x << 2) & -8; }...shl esi,0x2 ; esi = x << 2mov eax,esiand eax,0xfffffff8 ; the mask stays: -8 clears a bit that might be 1...ret
The redundant and kept methods start from the same bytecode (ishl, then iand, then
ireturn), differing only in the masking constant. The known-bits analysis is the reason that build
drops the first iand and keeps the second.
How do you even test this?
How do you prove that canonicalize_constraints and all those
bit rules are actually correct, across all 2^64 possible longs?
The change ships with a little intn_t<N> type (an integer of arbitrary tiny width). Canonicalization is checked exhaustively through 4-bit types (at 4 bits
that's 16 values, every combination of ranges and masks, by brute force). I read it as a useful trick: shrink the integer
width until exhaustive testing becomes cheap then test every state in that universe. The operation transfer functions use a mix of exhaustive tiny-width tests and random wider ones.
Here's the core
(test_rangeinference.cpp):
six nested loops over the whole tiny universe, feeding every possible
(lo, hi, ulo, uhi, zeros, ones) tuple through canonicalize_constraints:
for (int lo = S::min; lo <= S::max; lo++) {for (int hi = lo; hi <= S::max; hi++) {for (int ulo = U::min; ulo <= U::max; ulo++) {for (int uhi = ulo; uhi <= U::max; uhi++) {for (int zeros = U::min; zeros <= U::max; zeros++) {for (int ones = U::min; ones <= U::max; ones++) {TypeIntPrototype<S, U> t{{S(lo), S(hi)}, {U(ulo), U(uhi)}, {U(zeros), U(ones)}};auto new_t = t.canonicalize_constraints();// ...then assert new_t contains exactly the same values as t, for every v
S and U are
intn_t<N> / uintn_t<N>,
and the
test
runs the whole thing for N = 1, 2, 3, 4.
Why I care, and maybe why you should
I'm not a compiler engineer, and I won't pretend to be. I just enjoy learning how compilers computers work, and
chasing an optimization like this is fun. I also use LLMs to help me understand concepts that are new to me.
During these explorations I use LLM as a personal tutor: I ask questions until I understand what is going on. If you are an actual
compiler engineer and spot an error in my explanation, please let me know! I will be happy to correct it.
Thanks to Quan Anh Mai, Emanuel Peter, and Ashay Rane, whose work sent me down this wonderful rabbit hole!