Two people glance at each other across a crowded room. In milliseconds, you know they're collaborating—not competing, not strangers, not waiting. An AI watching the same scene? It's still guessing.
That gap—between human intuition and machine interpretation—is wider than we thought. A study published at the International Conference on Learning Representations (ICLR) in April 2025 reveals that even the most advanced AI models struggle to interpret the social dynamics humans read effortlessly.
The research, led by scientists at Johns Hopkins University, tested over 350 large language models and generative AI systems against human perception. The result: no AI model could adequately match how people understand and respond to social behavior in real time.
This isn't just an academic curiosity. It's a fundamental limitation with real-world stakes—for autonomous vehicles navigating pedestrian crossings, delivery robots interpreting when someone holds a door open, and any technology that must move safely through human spaces.
What Social Interaction Actually Involves
Before we understand what AI can't do, we need to clarify what humans do without thinking.
Social interaction isn't just seeing people move. It's reading body language, interpreting context, predicting intentions, and sensing collaboration or conflict in a glance. When two people assemble furniture together, you instantly recognize coordination. When they work on separate tasks in the same room, you know they're coexisting, not cooperating.
These judgments happen in fractions of a second. They rely on pattern recognition, contextual memory, and emotional inference—cognitive processes woven so deeply into perception that we barely notice them.
AI, by contrast, sees pixels and patterns. It lacks the lived experience that teaches humans what collaboration looks like versus competition, what hesitation means versus confidence.
How Scientists Tested AI Against Human Perception
The Johns Hopkins team designed an experiment to measure this gap precisely.
Researchers Kathy Garcia, Emalie McMahon, Colin Conwell, Michael F. Bonner, and Leyla Isik led the study.
The Three-Second Video Experiment
Participants watched 250 short video clips—each just three seconds long—drawn from the Moments in Time dataset. In these clips, people performed tasks together or independently, demonstrating different aspects of social interaction.
After watching, participants rated characteristics important for understanding social dynamics on a scale from 1 to 5. Questions included: Are these people working together? Is this interaction cooperative or independent? What is the social relationship here?
What AI Models Were Asked to Do
Researchers fed the same videos to over 350 AI systems—including large language models (AI systems trained on vast text to predict and generate human-like responses) and generative AI models (systems that create new content based on patterns).
The models were asked to predict how humans would rate the videos. Additionally, language models evaluated short captions written by humans describing the social interactions.
To deepen the comparison, the team also collected fMRI brain response data from four participants, measuring neural activity in regions associated with social cognition—specifically, lateral-stream brain responses, which process social information.
Why AI Struggles With Social Dynamics
The results were clear: AI models could not reliably predict human judgments about social behavior.
Language models performed relatively well at predicting human ratings when given text captions. Video models showed some ability to predict brain responses in certain regions. But no single model excelled at both behavioral judgments and social brain activity.
Think of it like reading sheet music versus feeling rhythm. AI sees the notes but misses the beat that makes humans move together.
The researchers concluded that current AI architecture lacks a fundamental aspect that allows the human brain to interpret dynamic social interaction quickly and accurately. That missing piece isn't just more data or better algorithms—it's something closer to lived understanding, the kind that comes from being a social creature navigating a social world.
What This Means for Autonomous Technology
This limitation isn't abstract. It has immediate implications for technologies already entering public spaces.
Self-Driving Cars and Social Navigation
Autonomous vehicles rely on AI to interpret pedestrian behavior. A person making eye contact at a crosswalk signals intent to cross. A group hesitating on the curb suggests uncertainty. These cues—invisible to current AI—are critical for safe navigation.
If an AI can't distinguish collaboration from coexistence in a three-second video, how reliably can it interpret the social choreography of a busy intersection?
Assistant Robots in Human Spaces
Delivery robots, warehouse assistants, and service machines must navigate environments filled with people. They need to recognize when someone is blocking a path intentionally versus accidentally, when a gesture means "go ahead" versus "wait."
Without the ability to read social dynamics, these systems risk awkward interactions at best—and safety failures at worst.
The Missing Piece in AI Architecture
What exactly do humans possess that AI lacks?
The Johns Hopkins researchers point to something deeper than pattern recognition. Humans don't just process visual information—they interpret it through layers of social experience, emotional context, and predictive modeling built over a lifetime of interaction.
AI models, even those trained on billions of images and videos, lack this embodied knowledge. They can identify objects, track motion, and classify actions. But they can't feel the difference between a tense silence and a comfortable one, between cooperation and competition, between invitation and dismissal.
That gap—between seeing and understanding—is where current AI architecture falls short.
What Comes Next for AI Development
The research team made their findings publicly available, inviting other researchers to build on their work.
They shared code, captions, behavioral data, and fMRI data through the Open Science Framework.
In a follow-up study posted in October 2025, Garcia and Isik introduced a human-similarity benchmark with approximately 49,000 odd-one-out judgments. They also developed a method to fine-tune video models to better align with human social judgments.
These steps suggest a path forward: not just training AI on more data, but training it to recognize the patterns that matter most to human social cognition.
The question isn't whether AI will learn to read social cues—it's how researchers will teach machines something the human brain does without thinking. Until then, the room remains unreadable to the algorithm watching from the corner.


















