Deep Dives

Deep Dives

AI evals are becoming the new compute bottleneck

Evaluation costs for AI models have skyrocketed, with agent benchmarks costing tens of thousands of...

11 0
Deep Dives

VAKRA: What Happens When Enterprise AI Agents Actually Have to Use Tools

IBM Research's VAKRA benchmark tests AI agents on real multi-step workflows with 8,000+ APIs. The...

8 0
Deep Dives

Training mRNA Language Models Across 25 Species for $165

OpenMed built a protein-to-mRNA pipeline in 55 GPU-hours, comparing architectures like ModernBERT and RoBERTa for...

6 0
Deep Dives

QIMMA: The Arabic LLM Leaderboard That Actually Checks Its Homework

QIMMA is a quality-first Arabic LLM leaderboard that validates benchmarks before evaluating models. It found...

6 0
Deep Dives

Google’s TurboQuant Shrinks LLM Memory 6x Without Sacrificing Quality

Google Research's TurboQuant compression algorithm reduces LLM memory usage 6x and boosts speed 8x by...

7 0
Deep Dives

Fusion power might not get cheap as fast as you think

A new study estimates fusion's experience rate at 2–8%, meaning electricity from fusion plants could...

6 0
Deep Dives

We Actually Tried a Diagnostic AI on Real Patients. Here’s What Happened.

Google and Beth Israel Deaconess put AMIE, their conversational diagnostic AI, through a real-world clinical...

10 0
Deep Dives

TurboQuant: Google’s New Compression Tricks That Actually Work

Google Research introduces TurboQuant, a set of compression algorithms that reduce AI model memory without...

7 0
Deep Dives

Google and NHS test AI for breast cancer screening — here’s what they found

Two new studies in Nature Cancer evaluate Google's mammography AI across NHS screening services, showing...

9 0
Deep Dives

Can LLMs Actually Help Physicists? Google Put Them to the Test on Superconductivity

Google researchers tested six LLMs on expert-level questions about high-temperature superconductivity. The results show promise...

6 0
Deep Dives

Groundsource: Google’s Gemini turns news articles into flood data

Google Research introduces Groundsource, a scalable framework using Gemini to extract structured historical data from...

5 0
Deep Dives

How Many Raters Do You Actually Need for a Reliable AI Benchmark?

Google Research digs into the reproducibility crisis in AI evaluation, asking whether it's better to...

12 0