AI evals are becoming the new compute bottleneck

I’ve been watching the AI evaluation cost problem grow for a while now, but the numbers coming out of recent agent benchmarks finally made me sit up. The Holistic Agent Leaderboard (HAL) just dropped $40,000 to run 21,730 agent rollouts across 9 models and 9 benchmarks. That’s not a typo. A single GAIA run on a frontier model can hit $2,829 before you even think about caching. Exgentic’s $22,000 sweep across agent configurations found a 33× cost spread on identical tasks, and UK-AISI scaled agentic steps into the millions to study inference-time compute.

This isn’t just a sticker shock problem. It changes who can do serious evaluation work. If you’re a small lab or a university group, you’re priced out. Period.

The static benchmark days are gone

Let me back up. The cost problem didn’t start with agents. When Stanford’s CRFM released HELM in 2022, their per-model accounting showed API costs ranging from $85 for OpenAI’s code-cushman-001 to $10,926 for AI21’s J1-Jumbo (178B). Open models needed 540 to 4,200 GPU-hours, with BLOOM (176B) and OPT (175B) at the top end. Across HELM’s 30 models and 42 scenarios, the aggregate came to roughly $100,000. That was already a lot.

But here’s the kicker from Perlitz et al.’s analysis of EleutherAI’s Pythia checkpoints: developers pay for evaluation repeatedly during model development. Pythia released 154 checkpoints for each of 16 models spanning 8 sizes — that’s 2,464 checkpoints if you count each one. Running the LM Evaluation Harness across all those turns eval into a multiplier on training. For small models, evaluation becomes the dominant compute line item across the whole development cycle. When you scale inference-time compute, you scale evaluation costs.

What we learned from compressing static benchmarks

Perlitz et al. then asked how much of HELM actually carried the rankings. The result was striking: a 100× to 200× reduction in compute preserved nearly the same ordering. Flash-HELM turned that into a coarse-to-fine procedure: run cheap evaluations first, then spend high-resolution compute only on the top candidates. Much of HELM’s compute was confirming rankings the field could have inferred much more cheaply.

Other work reached the same conclusion. tinyBenchmarks compressed MMLU from 14,000 items to 100 anchor items at about 2% error using Item Response Theory. The Open LLM Leaderboard collapsed from 29,000 examples to 180. Anchor Points showed that as few as 1 to 30 examples could rank-order 87 language-model/prompt pairs on GLUE. Static benchmarks had a weakness you could exploit: model differences often concentrate in a small subset of items, so ranking can survive aggressive subsampling.

That trick weakened sharply once benchmarks moved from static predictions to agents.

Agent evals are a different beast

HAL’s headline cost is $40,000 for 21,730 rollouts across 9 models and 9 benchmarks. By April 2026, the leaderboard had grown to 26,597 rollouts. Behind that aggregate, the cost of a single benchmark run varies by four orders of magnitude across HAL tasks, and by three orders within some individual benchmarks.

Behind these numbers is a blunt pricing fact. Claude Opus 4.1 charges $15 per million input tokens and $75 per million output. Gemini 2.0 Flash charges $0.10 and $0.40 — a two-order-of-magnitude spread on input alone. Agent benchmarks rarely benchmark “the model” in isolation. They benchmark a model × scaffold × token-budget product, and small scaffold choices can multiply costs 10×.

Worse, higher spend does not reliably buy better results. On Online Mind2Web, Browser-Use with Claude Sonnet 4 cost $1,577 for 40% accuracy. SeeAct with GPT-5 Medium hit 42% for $171. The HAL paper notes “a 9× difference in cost despite just a two-percentage-point difference in accuracy.” On GAIA, an HAL Generalist with o3 Medium cost $2,828 for 28.5% accuracy, while a different agent hit 57.6% for $1,686. CLEAR finds across 6 SOTA agents on 300 enterprise tasks that “accuracy-optimal configurations cost 4.4 to 10.8× more than Pareto-efficient alternatives” with comparable real-world performance.

Scientific ML isn’t immune either

In scientific ML, The Well costs about 960 H100-hours to evaluate one new architecture and 3,840 H100-hours for a full four-baseline sweep. Training-in-the-loop benchmarks are expensive by construction, and when you try to add reliability to these evals, repeated runs further multiply the cost.

What this means

We’re entering a phase where evaluation costs are becoming the gating factor for AI research. The compression techniques that worked for static benchmarks don’t transfer to agent benchmarks, which are noisy, scaffold-sensitive, and only partly compressible. The field needs new approaches — either better caching strategies, more efficient evaluation protocols, or a fundamental rethinking of what “evaluation” means in an agentic world.

For now, if you’re not working at a place with deep pockets or significant GPU reserves, you’re effectively locked out of rigorous agent evaluation. That’s a problem that won’t solve itself.