Google Research just dropped a paper that tries to answer a question I’ve been mulling over for a while: Do LLMs actually understand how to behave in social situations, or are they just faking it?
Their approach is refreshingly grounded. Instead of the usual “let’s ask the model if it’s empathetic” nonsense, they built a framework that turns established psychological questionnaires into situational judgment tests. Think of it as a personality test for AI, but one that actually puts the model in realistic scenarios instead of just asking it to self-report.
The team at Google Research—Amir Taubenfeld, Zorik Gekhman, and Lior Nezry—took instruments like the Interpersonal Reactivity Index (IRI) for empathy and the Emotion Regulation Questionnaire (ERQ), which are standard tools in psychology. They adapted the statements into declarations of the model’s advising tendency, then generated realistic scenarios where the model has to choose between two courses of action.
Here’s the pipeline: They start with a statement like “I am quick to express an opinion” from a validated questionnaire. They turn that into a scenario where an AI assistant is advising a user who’s in a meeting and has a strong opinion but isn’t sure whether to speak up. The assistant can either encourage the user to speak (supporting assertiveness) or advise caution (opposing it). Each scenario is reviewed by three human annotators to make sure it actually tests what it’s supposed to test.
Then they compare the model’s choice to what humans would do—10 annotators per scenario from a pool of 550 participants. This is where it gets interesting. They tested 25 different LLMs and found two kinds of gaps.
The first gap: models sometimes deviate from the consensus. When most humans would choose one course of action, the model picks the other. This is the obvious alignment problem—the model doesn’t share our social intuitions.
The second gap is more subtle and, honestly, more interesting. When humans disagree among themselves—when there’s no clear consensus—the model often picks one side confidently, as if it doesn’t understand that reasonable people can disagree. It fails to capture the range of human opinions.
I’ve seen this before in other alignment work. Models tend to collapse nuance into a single “correct” answer, which is fine for factual questions but terrible for social dynamics where context and personal values matter. Google’s results suggest this is a real problem, not just a theoretical one.
The scenarios they tested cover professional composure, conflict resolution, practical tasks like booking a trip, and lifestyle decisions. These aren’t edge cases—this is the stuff of daily life. If your AI assistant can’t navigate a disagreement about whether to speak up in a meeting, it’s going to give bad advice.
One thing I appreciate about this work is that they didn’t just ask the model to rate itself. Self-report questionnaires are notoriously unreliable even with humans—we all think we’re more empathetic than we actually are. With LLMs, it’s worse because they’re sensitive to prompt phrasing and distribution shifts. A model might claim to be empathetic in one format and behave completely differently in an open-ended conversation.
The situational judgment test approach sidesteps that. The model has to actually demonstrate its disposition through a choice in a realistic scenario, not just claim it.
That said, I have some reservations. The paper uses an LLM-as-a-judge to map the model’s natural language response to one of the two courses of action. That introduces another layer of potential bias. If the judge model has its own behavioral dispositions, it might misclassify responses. The team doesn’t address this deeply enough for my taste.
Also, the annotator pool of 550 participants is decent but not huge. Behavioral science studies often use thousands of participants to get reliable norms. With 10 annotators per scenario, the human baseline might not be as stable as they’d like.
Still, this is a step in the right direction. Most alignment research focuses on safety—avoiding harmful outputs, refusing dangerous requests. But alignment isn’t just about avoiding harm; it’s about understanding human social dynamics well enough to be helpful in everyday situations.
If you’re building a customer service bot, an AI therapist, or even a writing assistant, you need it to have appropriate social intuitions. You don’t want your bot to be overly assertive with a shy user, or overly passive with someone who needs a push.
The paper is early-stage, and the authors acknowledge that. They’re not claiming to have solved behavioral alignment. But they’ve built a framework that can actually measure it, which is more than most alignment work can say.
I’ll be watching to see if they release the scenarios and judge prompts. If they do, this could become a useful benchmark for anyone training or fine-tuning models for social interaction. If they don’t, it’s just another interesting paper that the industry will nod at and ignore.
Either way, it’s good to see someone taking the question seriously. We’re going to need models that can navigate social nuance if we want them to be genuinely useful, not just glorified autocomplete.
Comments (0)
Login Log in to comment.
Be the first to comment!