Can LLMs Actually Help Physicists? Google Put Them to the Test on Superconductivity

Can a large language model be a useful partner for a physicist working on one of the hardest open problems in condensed matter? That’s the question behind a new paper from Google Research, published in PNAS, where they took six LLMs and put them through the wringer on high-temperature superconductivity.

The topic is a smart choice. High-temperature superconductivity—specifically in cuprates, a class of copper-oxide compounds—has been an open puzzle since its discovery in 1987. Thousands of papers have been published, multiple competing theories exist, and no single model has won universal acceptance. A graduate student entering this field faces a mountain of literature and a minefield of conflicting interpretations. If an LLM can navigate that, it can probably handle a lot.

So what did they test? The team assembled expert-level questions that require more than just fact retrieval. These were the kind of questions a researcher might ask when exploring a new direction or trying to understand the state of a debate. Things like “What is the evidence for and against the spin fluctuation mechanism in cuprates?”—not “What is the critical temperature of YBCO?”

A panel of domain experts then graded the responses on accuracy, completeness, and how well they reflected the current state of the field, including unresolved debates.

The results are interesting, and maybe a little surprising. The top performers weren’t the biggest general-purpose models. They were two systems that draw from a closed, curated set of sources: NotebookLM (Google’s own tool for interacting with your documents) and a custom-built system that similarly restricted its knowledge base to a vetted collection of papers.

That makes sense to me. When you’re dealing with a field where the literature is vast and not all of it is equally reliable, having a model that can’t just grab something from a random Reddit thread or a 1995 preprint that’s been superseded is actually a feature, not a bug. The open-web models tended to produce answers that were more generic, sometimes glossing over nuance or giving equal weight to fringe ideas.

But even the best systems had clear weaknesses. The experts noted that the models sometimes struggled to distinguish between settled facts and active controversies. They’d present a theory that’s been mostly abandoned as if it’s still a contender, or they’d fail to mention that a particular experimental result has been challenged by later work. In a field like superconductivity, that kind of error isn’t just an annoyance—it could send a junior researcher down a wrong path for months.

There’s also the question of depth. The models could produce coherent paragraphs that sounded authoritative, but when probed on specifics—like the exact predictions of a particular theory or the limitations of a certain experimental technique—they often fell short. The experts described some responses as “plausible but shallow.”

This isn’t exactly shocking. LLMs are pattern matchers, not physicists. They can mimic the language of scientific discourse without truly understanding the underlying physics. But the fact that they got as far as they did is still noteworthy. A student who uses one of these curated systems as a starting point for literature exploration could save a lot of time, as long as they treat the output as a guide, not gospel.

The paper also highlights something I’ve been saying for a while: the quality of the training data matters more than the size of the model for specialized domains. NotebookLM’s advantage came from being able to restrict its attention to a specific set of documents that the user provides. That’s a fundamentally different paradigm from asking a general-purpose chatbot to answer a physics question. It’s more like having a really fast, really thorough research assistant who can only read the papers you give them.

Other groups at Google are also working on AI for science—using models as thought partners for hypothesis generation, as agents that can write scientific software, or as tools for single-cell analysis. This superconductivity paper is a focused case study, but the implications are broader. If you want an LLM to be useful in a scientific context, you probably need to control what it reads.

The takeaway for me is cautiously optimistic. LLMs aren’t ready to replace human experts, and they probably never will be for genuinely open research questions. But as tools for getting up to speed on a complex field, especially when paired with curated references, they’re already useful. Just don’t ask them to resolve a 40-year-old debate over coffee.

Can LLMs Actually Help Physicists? Google Put Them to the Test on Superconductivity

Comments (0)