A Tesla Model S slows at a San Francisco crosswalk. Two pedestrians stand at the curb. One glances at the other and nods. The other steps forward. The car waits, sensors tracking motion but blind to the exchange that just decided who would cross first.
In a new test, researchers pitted more than 350 state‑of‑the‑art AI systems against human observers to see which could read such moments. Not a single AI matched human performance.
When 350 AI Models Watched Humans Work
Cognitive scientists at Johns Hopkins University showed three‑second video clips of people collaborating or acting alone to both humans and AI. The team, led by neuroscientist Chaz Firestone and research scientist Yael Benn, drew 250 clips from the Moments in Time dataset. They reserved 200 clips to train the AI systems. The remaining 50 served as the test set.
At least ten human raters watched each clip. They scored what they saw on five‑point scales: How much were people collaborating? Who was leading? How smooth was the interaction? What was the shared intent?
The same clips fed into more than 350 AI architectures. The roster included large language models trained on billions of words, vision networks that parse images and multimodal systems that process video and audio together. Some came from OpenAI and Google DeepMind. Others originated in university labs at MIT, Stanford and Carnegie Mellon.
The team also recorded brain activity using functional magnetic resonance imaging (fMRI). Four volunteers lay inside a scanner and watched the clips. The imaging revealed which brain regions activated during social perception, particularly the lateral visual stream—the pathway that lights up when humans read intentions and roles.
The Pattern AI Can't See
Human raters agreed strongly on what they observed, while AI predictions scattered across the scale. Humans achieved a reliability score of 0.85 out of 1.0, a statistical measure called Cronbach's alpha that indicates strong consensus. When one person saw smooth collaboration, others did too. When one detected a leader, the group concurred.
The best AI models fell far short across all four dimensions. Their predictions missed human judgments by more than 1.5 points on the five‑point scale—a difference larger than the gap between "somewhat collaborating" and "collaborating strongly."
But here's where it gets interesting. The researchers tried feeding the AI systems a second type of input: human‑written captions. A person watched each clip and wrote a precise description: "Two workers hand tools back and forth" or "One person gestures while the other adjusts equipment." Even with these textual scaffolds, the models still failed to predict the social ratings humans generated effortlessly.
Language alone didn't bridge the gap. The AI could parse the words but couldn't extract the invisible structure—the mutual awareness, joint attention and role hierarchy—that humans read automatically.
What AI Sees vs. What It Misses
Current AI learns statistical regularities in pixels and words, not the mental states that generate behavior. Human social perception operates differently. Cognitive scientists call it "theory of mind"—the automatic simulation of others' goals, beliefs and focus. You watch someone reach for a door and instantly infer whether they expect you to hold it, whether they see you, whether they're in a hurry.
Brain imaging studies show that this inference happens in specific regions. The lateral visual stream activates when we process not just shapes and motion but the intentions behind them. Firestone's fMRI data confirmed this. When volunteers watched collaborative clips, activity surged in the pathways associated with reading minds, not just tracking movement.
AI architectures encode visual input as feature maps—arrays of numbers representing edges, textures and objects. Language models represent text as token embeddings—vectors in high‑dimensional space. Neither representation includes explicit slots for agents, their goals or the shared structures that bind them.
AI can identify two people and a ball. It can label the action as "throwing." It cannot tell whether they're playing catch as a team, whether one is teaching the other, or whether one is leading and the other following. It reads social cues the way a tourist reads an unfamiliar sport—sees the players moving but misses the coordination and strategy unfolding.
Why Autonomous Systems Struggle in Human Spaces
Self‑driving cars and warehouse robots relying on these models miss critical social signals. Waymo vehicles navigate Phoenix intersections using sensor arrays that detect pedestrians, cyclists and other cars. The system knows where bodies are and how fast they're moving. It doesn't know that two pedestrians are hesitating together, that a cyclist is making eye contact with a driver, or that a parent is signaling a child to wait.
Amazon operates more than 520,000 robotic drive units in its fulfillment centers. These machines move shelves, track inventory and navigate around workers. But when a worker glances at a coworker to signal, "I'll hand you this package," the robot doesn't register the exchange. It sees two people near a conveyor belt, not a coordinated handoff. The result: inefficient paths, hesitation or near‑collisions.
Boston Dynamics' humanoid robots can climb stairs, open doors and carry boxes. They struggle in spaces where humans collaborate fluidly because they lack the perceptual machinery to parse who's doing what with whom. A robot entering a room sees individuals moving. A human sees a team working, a hierarchy forming, a moment of cooperation.
These aren't hypothetical limits. They surface in real deployments where AI must share space with people. Safety protocols in autonomous systems currently assume worst‑case scenarios—stop if uncertain—because the models cannot distinguish coordinated behavior from independent action.
The Missing Element AI Cannot Replicate
Humans layer visual input with automatic inference about shared goals, attention flow and role dynamics. You watch two people pass a ball and instantly sense who's leading, who's following, whether they're equals or mentor and student. This understanding arrives without conscious effort, milliseconds after the image hits your retina.
The computation happens in parallel across multiple brain systems. The lateral visual stream processes shapes and motion. The superior temporal sulcus tracks gaze direction. The medial prefrontal cortex simulates intentions. The temporoparietal junction integrates perspective‑taking. Together, they construct a model not just of what is happening but of what people believe, intend and expect.
AI has no equivalent mechanism. Current architectures perform pattern recognition: correlating inputs with labels, predicting next tokens, clustering similar features. They do not simulate hidden mental states. They do not build internal models of agents and goals.
Bridging the gap may require hybrid designs. Some researchers propose augmenting neural networks with symbolic‑reasoning layers that represent agents, actions and causal relations. Others suggest training on datasets annotated not just for objects and actions but for coordination types, attention patterns and interaction roles.
Firestone's team tested whether AI's internal representations at least correlate with human brain activity. They compared the feature maps from vision models with the fMRI responses in the lateral visual stream. The correlation was weak. The patterns that drive AI classifications do not resemble the neural codes that drive human social perception.
The Road From Seeing to Understanding
The next generation of AI may need to simulate not just what humans do but what they intend. That shift—from pattern recognition to something closer to mind reading—won't come from scaling up existing architectures. It requires building models that represent the world not as pixels and tokens but as scenes populated by goal‑directed agents.
For now, that crosswalk in San Francisco still depends on human intuition. The pedestrians exchange a glance and a nod. One steps forward. The car, lacking the circuitry to parse the exchange, waits for motion it can measure. As autonomous systems move from controlled test tracks to crowded sidewalks, the gap between seeing and understanding may determine whether machines become true collaborators or remain very fast, very literal observers of a social world they cannot yet read.
Can new training methods teach models to infer coordination, not just detect it? Will hybrid architectures that blend neural networks with symbolic reasoning close the perception gap? The experiments to answer those questions are just beginning.










