Google’s AI Overviews still gets 1 in 10 answers wrong, and that’s a lot of garbage per hour

Google’s AI Overviews still gets 1 in 10 answers wrong, and that’s a lot of garbage per hour

6 0 0

If you’ve used Google recently, you’ve probably seen AI Overviews—that Gemini-powered blob of text that now sits at the top of your search results, trying to answer your query before you click any links. It’s been a rough ride since 2024, with users complaining about everything from hallucinated facts to outright dangerous advice. Google’s been quietly improving it, and by most accounts, it’s getting better.

But “better” is a low bar when the baseline was a dumpster fire.

The New York Times decided to actually measure how often AI Overviews gets things wrong, and the results are sobering. They teamed up with Oumi, a startup that builds AI models, to run the SimpleQA benchmark—a set of over 4,000 questions with verifiable answers, originally released by OpenAI in 2024. It’s a standard way to test how factual a generative model is.

When Oumi first ran the test last year, with Gemini 2.5 under the hood, AI Overviews scored 85 percent accuracy. That sounds decent until you realize 15 percent of answers were wrong. After the Gemini 3 update, the score climbed to 91 percent. So one in ten answers is still wrong.

Now do the math. Google processes billions of searches every day. Even at a 9 percent error rate, we’re talking tens of millions of incorrect answers daily. Per hour, that’s hundreds of thousands of lies flowing into people’s screens. And that’s under controlled test conditions—real-world queries are messier, and the actual error rate is probably higher.

I’ve been watching this space since the early days of LLMs, and I have to say: I’m not surprised, but I am disappointed. Google has the resources to get this right. They have the data, the talent, the compute. Yet they’re shipping a product that tells millions of lies per hour and calling it good enough. The fact that they’re relying on a benchmark from OpenAI to measure themselves says a lot about the state of evaluation in this industry.

Look, I get it. AI Overviews is useful for quick answers, and it’s better than digging through ten blue links for simple facts. But when the cost of a wrong answer can be anything from mild annoyance to medical misinformation, a 91 percent accuracy rate isn’t a victory lap. It’s a warning.

Google needs to either slow down and fix the underlying model, or be far more transparent about when AI Overviews is guessing versus when it actually knows the answer. Right now, it feels like we’re all beta testers for a product that’s already in production. And that’s not how you build trust.

Comments (0)

Be the first to comment!