How Many Raters Do You Actually Need for a Reliable AI Benchmark?

How Many Raters Do You Actually Need for a Reliable AI Benchmark?

2 0 0

If you’ve ever built a dataset for evaluating an AI model, you’ve probably stared at a spreadsheet of human annotations and wondered: did we pay enough people to look at this?

Turns out, that question is way more consequential than most people in ML realize. Google Research just published a paper called “Forest vs Tree: The (N,K) Trade-off in Reproducible ML Evaluation” that pokes at a sore spot in our field — we’ve been systematically under-rating the impact of human disagreement.

The forest vs tree problem

The core question is deceptively simple: given a fixed budget for human annotation, should you spread it thin across many items (the forest) or go deep with many raters per item (the tree)?

Think of it like restaurant reviews. The forest approach asks 1,000 people to each try one dish. You get a broad sense of quality but zero signal about whether a 4-star average hides wild disagreement. The tree approach asks 20 people to each try the same 50 dishes. You learn that the pad thai is divisive while the curry is universally loved — which actually tells you more about the restaurant’s consistency.

Most AI benchmarks have leaned heavily toward the forest. The standard is 1 to 5 raters per item, with the assumption that this is enough to extract a single “correct” label via majority vote or plurality. That assumption, Google’s researchers argue, is often wrong.

What they actually did

They built a simulator fed by real-world datasets involving subjective tasks — toxicity detection, hate speech labeling, that sort of thing. Then they stress-tested thousands of combinations of:

  • N (number of items): from 100 to 50,000
  • K (raters per item): from 1 to 500

The goal was to find which configurations produced statistically reliable results (p < 0.05) — meaning another lab running the same evaluation would get the same answer.

The results are sobering. For many subjective tasks, the industry standard of 1-3 raters per item is flat-out insufficient. You need more depth (higher K) than most people budget for, especially when the task involves genuine ambiguity where humans naturally disagree.

Why this matters beyond academia

This isn’t just a theoretical exercise. If your toxicity classifier gets evaluated on a benchmark where 3 raters labeled each example, and another team evaluates the same model on a different split with 3 different raters, you might get opposite conclusions about which model is better. That’s not reproducibility — that’s noise masquerading as science.

The paper provides a framework to optimize the N/K trade-off for a given budget. The sweet spot depends on how subjective the task is. For tasks with high inter-rater agreement, you can get away with fewer raters per item. For tasks where disagreement is baked into the problem (like content moderation), you need more depth.

My take

I’ve been on the receiving end of this problem more times than I want to count. You pay for 5 raters per item, get back a spreadsheet, collapse everything to majority vote, and pretend the disagreement never happened. The paper’s simulator approach is refreshingly practical — it gives you a way to calculate, before you spend a dime, whether your budget allocation will actually produce reproducible results.

That said, the paper doesn’t solve the deeper issue: most benchmark builders don’t have the budget to do this right. Google can afford 500 raters per item on a research project. A startup building a specialized dataset can’t. The framework helps you make the best of a bad situation, but it doesn’t change the fact that high-quality subjective evaluation is expensive.

The open-source simulator they released is genuinely useful. If you’re building a benchmark, plug in your estimated disagreement rates and budget, and it’ll tell you whether your planned N/K split is likely to produce garbage. I wish I’d had this tool five years ago.

One thing I wish they’d explored more: the interaction between rater quality and the N/K trade-off. Not all raters are equal. A smaller crowd of carefully vetted experts might outperform a larger crowd of cheap MTurkers. The paper mostly assumes raters are interchangeable, which is a simplification.

Still, this is the kind of work that should make benchmark builders uncomfortable. If your evaluation results could flip depending on which 3 people happened to label each example, you don’t have a benchmark — you have a lottery.

Comments (0)

Be the first to comment!