Deep Dives
AI evals are becoming the new compute bottleneck
Evaluation costs for AI models have skyrocketed, with agent benchmarks costing tens of thousands of...
VAKRA: What Happens When Enterprise AI Agents Actually Have to Use Tools
IBM Research's VAKRA benchmark tests AI agents on real multi-step workflows with 8,000+ APIs. The...
Training mRNA Language Models Across 25 Species for $165
OpenMed built a protein-to-mRNA pipeline in 55 GPU-hours, comparing architectures like ModernBERT and RoBERTa for...
QIMMA: The Arabic LLM Leaderboard That Actually Checks Its Homework
QIMMA is a quality-first Arabic LLM leaderboard that validates benchmarks before evaluating models. It found...
Google’s TurboQuant Shrinks LLM Memory 6x Without Sacrificing Quality
Google Research's TurboQuant compression algorithm reduces LLM memory usage 6x and boosts speed 8x by...
Fusion power might not get cheap as fast as you think
A new study estimates fusion's experience rate at 2–8%, meaning electricity from fusion plants could...
We Actually Tried a Diagnostic AI on Real Patients. Here’s What Happened.
Google and Beth Israel Deaconess put AMIE, their conversational diagnostic AI, through a real-world clinical...
TurboQuant: Google’s New Compression Tricks That Actually Work
Google Research introduces TurboQuant, a set of compression algorithms that reduce AI model memory without...
Google and NHS test AI for breast cancer screening — here’s what they found
Two new studies in Nature Cancer evaluate Google's mammography AI across NHS screening services, showing...
Can LLMs Actually Help Physicists? Google Put Them to the Test on Superconductivity
Google researchers tested six LLMs on expert-level questions about high-temperature superconductivity. The results show promise...
Groundsource: Google’s Gemini turns news articles into flood data
Google Research introduces Groundsource, a scalable framework using Gemini to extract structured historical data from...
How Many Raters Do You Actually Need for a Reliable AI Benchmark?
Google Research digs into the reproducibility crisis in AI evaluation, asking whether it's better to...