Google Cloud researchers just dropped a paper at ICLR that tackles something that’s been bugging me for a while: why do deployed agents keep making the same dumb mistakes?
Their new framework, ReasoningBank, is a memory system that doesn’t just log everything an agent does or only celebrate wins. It actively distills both successful trajectories and failed attempts into high-level reasoning strategies. Think of it as an agent that actually gets smarter over time, not just more cluttered.
The problem with existing agent memory
Most agent memory approaches fall into two camps, and neither is great.
First, there’s trajectory memory — the “log everything” approach used in systems like Synapse. It records every click, every API call, every scroll. Sure, you can replay it, but you’re drowning in detail. The agent never asks “why did that work?” or “what could I have done differently?”
Second, there’s workflow memory, which only saves successful runs. That sounds sensible until you realize it actively discards the most valuable learning signal: failures. An agent that never analyzes its mistakes will keep repeating them.
Both approaches share a deeper flaw: they store actions, not reasoning. They remember that you clicked “Load More” but forget the context that made it work — or fail.
How ReasoningBank works
ReasoningBank stores structured memory items with three fields: a title (what’s this strategy about?), a description (brief summary), and content (the actual reasoning steps or decision rationales).
Here’s the kicker: the content isn’t a raw action log. It’s distilled insight. Instead of “click button X > wait 2s > scroll down”, you get “before loading more results, always verify the current page identifier to avoid infinite scroll traps.”
The workflow is a continuous loop. Before acting, the agent retrieves relevant memories from the bank. It interacts with the environment, then uses an LLM-as-a-judge to self-assess — both successes and failures. It extracts generalizable lessons and appends them back.
Crucially, the self-judgment doesn’t need to be perfect. The authors note ReasoningBank is surprisingly robust against judgment noise. That’s a relief, because LLM-as-judge setups are notoriously inconsistent.
What the numbers show
The team evaluated on web browsing and software engineering benchmarks. Compared to baseline approaches, ReasoningBank improved both success rates and efficiency — fewer steps to complete tasks. That makes sense: if you’ve internalized why you failed last time, you skip the dead ends.
I’d love to see more granular ablation studies — how much does failure analysis contribute versus success distillation? — but the direction is solid.
What this means for deployed agents
We’re moving toward agents that run for weeks or months, not just single-shot tasks. In that world, memory isn’t a luxury; it’s the difference between a system that improves and one that stagnates.
ReasoningBank’s approach of learning from failure feels obvious in hindsight, but most systems don’t do it. That’s partly because failure analysis is harder — you need to extract counterfactual signals, not just replay what worked. But it’s also because the field has been obsessed with immediate accuracy rather than long-term adaptation.
The code is on GitHub, so I’ll be curious to see how it performs in practice beyond the paper’s benchmarks. The consolidation strategy (just appending new memories) feels like a placeholder — real-world systems will need deduplication and pruning. But as a starting point for self-evolving agents, this is one of the more practical ideas I’ve seen in a while.
If your agent isn’t learning from its mistakes, it’s not really learning at all.
Comments (0)
Login Log in to comment.
Be the first to comment!