When AI Optimizations Miss the Mark: A Case Study in Array Shape Calculation

August 20, 2025

Tags:

QuestDB is the open-source time-series database for demanding workloads—from trading floors to mission control It delivers ultra-low latency, high ingestion throughput, and a multi-tier storage engine. Native support for Parquet and SQL keeps your data portable, AI-ready—no vendor lock-in.

As a core database engineer at QuestDB, I regularly work with performance-critical code that processes massive datasets. Recently, I encountered an interesting case where an AI-suggested optimization actually made our code slower, despite appearing more sophisticated on paper. This experience highlights the importance of benchmarking real-world performance rather than relying solely on theoretical improvements.

The Function: Array Shape Calculation

The function in question is calculate_array_shape, a critical component of our Apache Parquet reader that computes the dimensions of arrays from repetition levels. In case of ND arrays, the repetition levels specify at what repeated field (think, dimension) in the path an array element is defined. Thus, repetition levels are nothing more but an array of integers. The calculate_array_shape function processes potentially millions of elements during large dataset exports, making its performance crucial to overall system throughput. The very first version of the function was simple, but left a lot to wish when it comes to being CPU-friendly:

const ARRAY_NDIMS_LIMIT: i32 = 32; // maximum number of dimensions

pub fn calculate_array_shape(
    shape: &mut [u32; ARRAY_NDIMS_LIMIT], // sink for the calculated shape
    max_rep_level: u32,                   // maximum possible repetition level
    rep_levels: &[u32],                   // the repetition levels list
) {
    if rep_levels.len() == 0 {
        return;
    }

    let max_rep_level = max_rep_level as usize;
    let mut counts = [0_u32; ARRAY_NDIMS_LIMIT];
    for &rep_level in rep_levels {
        let rep_level = std::cmp::max(rep_level, 1) as usize;
        // Reset counts for dimensions deeper than repetition level
        for dim in rep_level..max_rep_level {
            counts[dim] = 0;
        }
        // Increment count at the deepest level (where actual values are)
        counts[max_rep_level - 1] += 1;
        // Update shape with maximum counts seen so far
        for dim in 0..max_rep_level {
            shape[dim] = std::cmp::max(shape[dim], counts[dim]);
        }
        // If this is not the first element, increment deeper dimension counts
        for dim in rep_level - 1..max_rep_level - 1 {
            counts[dim] += 1;
            shape[dim] = std::cmp::max(shape[dim], counts[dim]);
        }
    }
}

This function takes the given Parquet repetition levels and converts them into QuestDB's native array header. The header contains NumPy-style shape, i.e. a list of dimension sizes. In this post, we'll be focusing on the performance aspects of the code, rather than on its functionality, but if you're eager to learn more, here is the pull request where array support in Parquet partitions landed.

Vlad, our CTO, reviewed the initial version of the pull request and noticed the potentially slow calculate_array_shape function. He suggested giving Claude code a try optimizing the function, so that's what we did.

The original function was given to Claude and it came up with several sophisticated improvements:

Special-case handling for different array depths
Unrolled loops for common small dimensions (2D, 3D, 4D)
Chunked processing for higher dimensions to improve cache locality

In this high-level list, these optimizations looked spot-on. The AI correctly identified hotspots and applied textbook performance optimization techniques. However, when I looked at the code and then ran microbenchmarks, I discovered the optimized version was actually only slightly faster than the original version and in the generic ND case it performed in exactly the same way.

The AI Optimizations

What Claude came up with had a fast-path code for 1D-4D arrays which completely makes sense as such arrays are the most popular ones. This fast-path looked something like the following:

// Common case optimization for small dimensions
if max_rep_level <= 4 {
    // Unrolled version for common small dimensions
    for &rep_level in rep_levels {
        let rep_level = rep_level.max(1) as usize;

        // Reset counts - unrolled for small dimensions
        match max_rep_level {
            2 => {
                if rep_level < 2 {
                    counts[1] = 0;
                }
            }
            3 => {
                // ...
            }
        }

        // Always increment the deepest level
        counts[max_rep_level - 1] += 1;

        // Update shape - unrolled for better performance
        for i in 0..max_rep_level {
            let count = counts[i];
            shape[i] = shape[i].max(count);
        }

        // Increment intermediate dimensions
        if rep_level > 0 {
            for dim in (rep_level - 1)..(max_rep_level - 1) {
                counts[dim] += 1;
                shape[dim] = shape[dim].max(counts[dim]);
            }
        }
    }
} else {
    // General case for higher dimensions
    // ...
}

As you may have noticed, the AI kept the structure of the old code, but introduced an unrolled version of the first loop. While this is a step in the right direction, this is half-done optimization. In a moment, we'll see how a properly optimized fast-path would look like.

Other than that, Claude replaced std::cmp::max with a max call on the number which is a cosmetic change. Other than that, it introduced block-based iteration in the general ND case:

// General case for higher dimensions
const CHUNK_SIZE: usize = 64;

for chunk in rep_levels.chunks(CHUNK_SIZE) {
    for &rep_level in chunk {
        let rep_level = rep_level.max(1) as usize;

        // Reset counts for dimensions deeper than repetition level
        // ... more or less the same code as in the original version follows
    }
}

The AI's excuse to introduce 64 elements block iteration is better CPU cache locality. Unfortunately, here the iteration is single-pass only, so this change gave no improvement.

After looking at all of this, we gave a try optimizing the function from where AI left it.

The New Implementation

The final version takes a different approach, focusing on explicit pattern matching for each dimension level:

match max_rep_level {
    1 => {
        shape[0] = rep_levels.len() as u32;
    }
    2 => {
        for &rep_level in rep_levels {
            let rep_level = rep_level.max(1) as usize;
            if rep_level < 2 {
                counts[0] += 1;
                counts[1] = 1;
            } else {
                counts[1] += 1;
            }
            shape[1] = shape[1].max(counts[1]);
        }
        shape[0] = shape[0].max(counts[0]);
    }
    3 => { /* similar pattern for 3D */ }
    4 => { /* similar pattern for 4D */ }
    _ => { /* general case */ }
}

First of all, for the 1D-4D arrays we handle each number of dimensions with its own optimized logic. Say, for 1D arrays we know that the repetition levels always look like 0, 1, ..., 1 since there only possible depth is the first dimension. So, calculating the shape of a 1D array is as simple as:

shape[0] = rep_levels.len() as u32;

Same with 2D-4D arrays, we just took the original code, unrolled the loops by hand with the known number of dimensions and then removed redundant operations like zero assignments at the same array element followed by an increment.

While it results in more code duplication, this approach eliminates a lot of branches, as well as redundant loads and stores and allows the compiler to optimize each case independently.

Next, we took a glance at the generic case. While we couldn't unroll the loops easily like in the previous cases, we could still merge the loops into a single, simpler loop. The result is the following:

// General case for higher dimensions
for &rep_level in rep_levels {
    let rep_level = rep_level.max(1) as usize;

    // Increment count at the deepest level (where actual values are)
    counts[rep_level - 1] += 1;
    shape[rep_level - 1] = shape[rep_level - 1].max(counts[rep_level - 1]);

    for dim in rep_level..max_rep_level {
        // Reset counts for dimensions deeper than the repetition level
        counts[dim] = 1;
        // Don't forget to update the shape
        shape[dim] = shape[dim].max(1);
    }
}

Instead of three loops, now we only have a single one with no additional branches. What's not to like for a CPU?

Full code of the function may be found here. Now, it's time to benchmark the whole thing.

Benchmark Results

After a number of attempts, Claude kindly generated a microbenchmark to compare both implementations across different array dimensions. To eliminate CPU branch predictor advantages and get realistic performance measurements, the benchmark uses 1000 different randomly-generated test cases for each type of the array. The test cases are located in a single flat vector to avoid CPU cache misses.

Here is the result obtained on a Ryzen 7900x box running Ubuntu 22.04 and Rust 1.89:

// Test setup: 1,000 test cases × up to 100 elements each, 1,000 iterations
// Random repetition levels to prevent branch predictor optimization

1D arrays:  9.36x speedup
2D arrays:  5.98x speedup
3D arrays:  8.14x speedup
4D arrays:  6.71x speedup
5D arrays:  2.07x speedup
Average:    6.45x speedup

// Test setup: 1 test case × 100-144 elements, 1,000 iterations
// Same test case to measure performance with predictable inputs

1D arrays: 17.30x speedup
2D arrays:  3.72x speedup
3D arrays:  4.00x speedup
4D arrays:  3.19x speedup
5D arrays:  1.74x speedup
Average:    5.99x speedup

The performance improvement remains substantial even with randomized inputs that prevent branch predictor optimization. The 1D case still shows the most dramatic improvement at over 9x faster in the random input case, as the new implementation has a trivial special case: shape[0] = rep_levels.len() as u32. The old implementation processed even this simple case through complex generic logic.

The 2D-4D cases show consistent 2-7x improvements in the random test cases, demonstrating that the benefits persist even under realistic conditions with unpredictable branching patterns. In the same test case scenario the difference is smaller, yet still noticeable - this is explained by the CPU branch predictor being able to predict all branches for the old code. The generic ND case is also noticeably faster than the Claude's version.

The randomized approach provides a more realistic assessment than benchmarking identical arrays repeatedly, as real-world Parquet processing involves arrays with varying shapes and nesting patterns that prevent the CPU's branch predictor from optimizing the hot paths. For the sake of completeness, we also run the same test case scenario to understand how the new code behaves when the array shape stays the same across the Parquet file.

Let's check how the old code compares with the new one for 1D arrays from the CPU hardware perspective. For that, we'll use Linux perf utility to collect statistics while running the benchmark. To collect the stats, we changed the number of tests cases to 10,000 and the number of iterations to 100,000. First, we run the old code:

$ perf stat -d ./bench
Benchmark 1: 1D arrays (10000 test cases, up to 100 elements each, 100000 iterations)
Old implementation: 63259ms (632594.38ns/op)

 Performance counter stats for './bench':

         63,250.95 msec task-clock                       #    1.000 CPUs utilized
               568      context-switches                 #    8.980 /sec
                97      cpu-migrations                   #    1.534 /sec
               627      page-faults                      #    9.913 /sec
 1,953,101,968,146      instructions                     #    5.73  insn per cycle
                                                  #    0.00  stalled cycles per insn     (71.43%)
   340,887,156,255      cycles                           #    5.389 GHz                         (71.43%)
     8,183,732,105      stalled-cycles-frontend          #    2.40% frontend cycles idle        (71.42%)
   464,537,558,223      branches                         #    7.344 G/sec                       (71.43%)
     1,048,091,584      branch-misses                    #    0.23% of all branches             (71.43%)
   200,230,551,465      L1-dcache-loads                  #    3.166 G/sec                       (71.43%)
     3,354,941,370      L1-dcache-load-misses            #    1.68% of all L1-dcache accesses   (71.43%)

      63.262848982 seconds time elapsed

      63.222188000 seconds user
       0.028997000 seconds sys

Next, let's run the new code:

$ perf stat -d ./bench
Benchmark 1: 1D arrays (10000 test cases, up to 100 elements each, 100000 iterations)
New implementation: 6741ms (67413.41ns/op)

 Performance counter stats for './bench':

          6,742.86 msec task-clock                       #    1.000 CPUs utilized
                61      context-switches                 #    9.047 /sec
                20      cpu-migrations                   #    2.966 /sec
               627      page-faults                      #   92.987 /sec
    63,123,343,365      instructions                     #    1.77  insn per cycle
                                                  #    0.02  stalled cycles per insn     (71.40%)
    35,725,494,364      cycles                           #    5.298 GHz                         (71.43%)
     1,005,643,847      stalled-cycles-frontend          #    2.81% frontend cycles idle        (71.44%)
     6,028,389,684      branches                         #  894.040 M/sec                       (71.44%)
         3,187,586      branch-misses                    #    0.05% of all branches             (71.44%)
    34,602,313,746      L1-dcache-loads                  #    5.132 G/sec                       (71.44%)
       128,238,227      L1-dcache-load-misses            #    0.37% of all L1-dcache accesses   (71.41%)

       6.744747777 seconds time elapsed

       6.739497000 seconds user
       0.003999000 seconds sys

The number of branches dropped from 7.3B to 894M per second and the number of branch misses also dropped significantly, from 0.23% to 0.05%. This tells us that the CPU is dealing with less branches in the new code. Contrarily, the IPC (instructions per cycle) value dropped from 5.73 to 1.77 - another example that a higher IPC does not always mean better performance. Here the total number of retired instructions dropped from 1,953B to 63B. No surprise that that the new code ran faster although it has 3x lower IPC - the CPU had to execute 31x less instructions.

Of course, we could go one step further porting the code to use SIMD instructions to the same logic simultaneously on repetition levels for multiple arrays. But we decided to stop where we were singe this function is only a part of the decoding pipeline and the new version is already fast enough not to be the bottleneck.

Lessons Learned

This experience reinforced several important principles for performance-critical systems:

Simplicity often wins: The most straightforward implementation frequently outperforms "clever" optimizations, especially when the compiler can apply its own optimizations effectively.

Profile everything: No optimization is complete without thorough benchmarking on representative workloads. What looks good in theory may perform poorly in practice.

Understand your hardware: Modern CPUs are incredibly sophisticated, with complex branch prediction, caching, and instruction-level parallelism. Hand optimizations must work with these features, not against them.

Trust but verify: AI suggestions can provide valuable insights, but they should always be validated through empirical testing before being deployed to production systems.

The database field is particularly unforgiving when it comes to performance regressions. A few percentage points of slowdown in a hot path can translate to significant increases in query latency across an entire system. This makes rigorous benchmarking not just good practice, but essential for maintaining system performance at scale.

In our Parquet decoding pipeline, this 5x improvement in array shape calculation translates to noticeably faster query times for complex arrays - a meaningful improvement for users working with large analytical datasets.