Why AI Still Can’t Read Human Group Dynamics?

Why 350+ AI models miss human teamwork cues—and the risk for robots

7 November 2025

—

Explainer

Researchers at Johns Hopkins showed that more than 350 cutting‑edge AI models cannot reliably judge whether people are acting together or alone. Human observers quickly infer collaboration, leadership and intent, while the machines miss these social cues. The gap threatens autonomous vehicles and warehouse robots that must read human teamwork to stay safe and efficient.

telegram-cloud-photo-size-2-5211058007943351500-y

Summary:

Human raters reliably judged collaboration, leadership, smoothness and intent in 3‑second clips, while 350 AI models fell >1.5 points short on a five‑point scale.
Even with human‑written captions, AI could not infer the invisible structure of joint attention, role hierarchy or shared intent that humans read instantly.
The gap means self‑driving cars and warehouse robots miss subtle social cues, risking inefficiency or danger until AI learns true theory‑of‑mind perception.

A Tesla Model S slows at a San Francisco crosswalk. Two pedestrians stand at the curb. One glances at the other and nods. The other steps forward. The car waits, sensors tracking motion but blind to the exchange that just decided who would cross first.

In a new test, researchers pitted more than 350 state‑of‑the‑art AI systems against human observers to see which could read such moments. Not a single AI matched human performance.

When 350 AI Models Watched Humans Work

Cognitive scientists at Johns Hopkins University showed three‑second video clips of people collaborating or acting alone to both humans and AI. The team, led by neuroscientist Chaz Firestone and research scientist Yael Benn, drew 250 clips from the Moments in Time dataset. They reserved 200 clips to train the AI systems. The remaining 50 served as the test set.

At least ten human raters watched each clip. They scored what they saw on five‑point scales: How much were people collaborating? Who was leading? How smooth was the interaction? What was the shared intent?

The same clips fed into more than 350 AI architectures. The roster included large language models trained on billions of words, vision networks that parse images and multimodal systems that process video and audio together. Some came from OpenAI and Google DeepMind. Others originated in university labs at MIT, Stanford and Carnegie Mellon.

The team also recorded brain activity using functional magnetic resonance imaging (fMRI). Four volunteers lay inside a scanner and watched the clips. The imaging revealed which brain regions activated during social perception, particularly the lateral visual stream—the pathway that lights up when humans read intentions and roles.

The Pattern AI Can't See

Human raters agreed strongly on what they observed, while AI predictions scattered across the scale. Humans achieved a reliability score of 0.85 out of 1.0, a statistical measure called Cronbach's alpha that indicates strong consensus. When one person saw smooth collaboration, others did too. When one detected a leader, the group concurred.

The best AI models fell far short across all four dimensions. Their predictions missed human judgments by more than 1.5 points on the five‑point scale—a difference larger than the gap between "somewhat collaborating" and "collaborating strongly."

But here's where it gets interesting. The researchers tried feeding the AI systems a second type of input: human‑written captions. A person watched each clip and wrote a precise description: "Two workers hand tools back and forth" or "One person gestures while the other adjusts equipment." Even with these textual scaffolds, the models still failed to predict the social ratings humans generated effortlessly.

Language alone didn't bridge the gap. The AI could parse the words but couldn't extract the invisible structure—the mutual awareness, joint attention and role hierarchy—that humans read automatically.

What AI Sees vs. What It Misses

Current AI learns statistical regularities in pixels and words, not the mental states that generate behavior. Human social perception operates differently. Cognitive scientists call it "theory of mind"—the automatic simulation of others' goals, beliefs and focus. You watch someone reach for a door and instantly infer whether they expect you to hold it, whether they see you, whether they're in a hurry.

Brain imaging studies show that this inference happens in specific regions. The lateral visual stream activates when we process not just shapes and motion but the intentions behind them. Firestone's fMRI data confirmed this. When volunteers watched collaborative clips, activity surged in the pathways associated with reading minds, not just tracking movement.

AI architectures encode visual input as feature maps—arrays of numbers representing edges, textures and objects. Language models represent text as token embeddings—vectors in high‑dimensional space. Neither representation includes explicit slots for agents, their goals or the shared structures that bind them.

AI can identify two people and a ball. It can label the action as "throwing." It cannot tell whether they're playing catch as a team, whether one is teaching the other, or whether one is leading and the other following. It reads social cues the way a tourist reads an unfamiliar sport—sees the players moving but misses the coordination and strategy unfolding.

Why Autonomous Systems Struggle in Human Spaces

Self‑driving cars and warehouse robots relying on these models miss critical social signals. Waymo vehicles navigate Phoenix intersections using sensor arrays that detect pedestrians, cyclists and other cars. The system knows where bodies are and how fast they're moving. It doesn't know that two pedestrians are hesitating together, that a cyclist is making eye contact with a driver, or that a parent is signaling a child to wait.

Amazon operates more than 520,000 robotic drive units in its fulfillment centers. These machines move shelves, track inventory and navigate around workers. But when a worker glances at a coworker to signal, "I'll hand you this package," the robot doesn't register the exchange. It sees two people near a conveyor belt, not a coordinated handoff. The result: inefficient paths, hesitation or near‑collisions.

Boston Dynamics' humanoid robots can climb stairs, open doors and carry boxes. They struggle in spaces where humans collaborate fluidly because they lack the perceptual machinery to parse who's doing what with whom. A robot entering a room sees individuals moving. A human sees a team working, a hierarchy forming, a moment of cooperation.

These aren't hypothetical limits. They surface in real deployments where AI must share space with people. Safety protocols in autonomous systems currently assume worst‑case scenarios—stop if uncertain—because the models cannot distinguish coordinated behavior from independent action.

The Missing Element AI Cannot Replicate

Humans layer visual input with automatic inference about shared goals, attention flow and role dynamics. You watch two people pass a ball and instantly sense who's leading, who's following, whether they're equals or mentor and student. This understanding arrives without conscious effort, milliseconds after the image hits your retina.

The computation happens in parallel across multiple brain systems. The lateral visual stream processes shapes and motion. The superior temporal sulcus tracks gaze direction. The medial prefrontal cortex simulates intentions. The temporoparietal junction integrates perspective‑taking. Together, they construct a model not just of what is happening but of what people believe, intend and expect.

AI has no equivalent mechanism. Current architectures perform pattern recognition: correlating inputs with labels, predicting next tokens, clustering similar features. They do not simulate hidden mental states. They do not build internal models of agents and goals.

Bridging the gap may require hybrid designs. Some researchers propose augmenting neural networks with symbolic‑reasoning layers that represent agents, actions and causal relations. Others suggest training on datasets annotated not just for objects and actions but for coordination types, attention patterns and interaction roles.

Firestone's team tested whether AI's internal representations at least correlate with human brain activity. They compared the feature maps from vision models with the fMRI responses in the lateral visual stream. The correlation was weak. The patterns that drive AI classifications do not resemble the neural codes that drive human social perception.

The Road From Seeing to Understanding

The next generation of AI may need to simulate not just what humans do but what they intend. That shift—from pattern recognition to something closer to mind reading—won't come from scaling up existing architectures. It requires building models that represent the world not as pixels and tokens but as scenes populated by goal‑directed agents.

For now, that crosswalk in San Francisco still depends on human intuition. The pedestrians exchange a glance and a nod. One steps forward. The car, lacking the circuitry to parse the exchange, waits for motion it can measure. As autonomous systems move from controlled test tracks to crowded sidewalks, the gap between seeing and understanding may determine whether machines become true collaborators or remain very fast, very literal observers of a social world they cannot yet read.

Can new training methods teach models to infer coordination, not just detect it? Will hybrid architectures that blend neural networks with symbolic reasoning close the perception gap? The experiments to answer those questions are just beginning.

Topic

AI Robot Behavioral Intelligence

Former Tesla Optimus leader says humanoid robots are impractical for warehouses

5 December 2025

Claude Sonnet 3.5 Fails the Butter Test, Robots Lag Humans

11 November 2025

What is this about?

Feed

iPhone 18 Pro to Launch iOS 27 Camera with f/1.5‑f/2.8 Aperture

iOS 27 adds a “Siri” visual‑AI mode as Apple readies iPhone 18 Pro for fall

Carter Brooks3 days ago

Therapist vs Counselor: Which Fits Your Needs?

Licenses, Training Hours, and Treatment Options Compared (2024‑2025 Data)

Caleb Brooks3 days ago

Ask YouTube Launches March 15, 2026 for Premium Users

On March 15, 2026, YouTube introduced Ask YouTube, an AI‑driven chat that lets U.S. Premium subscribers ask questions and receive synthesized video‑based answers. The tool promises a conversational search experience, yet early tests revealed factual slips, such as a wrong claim about the Steam controller’s joysticks, highlighting the need for users to verify information before acting.

Ask YouTube Launches March 15, 2026 for Premium Users

Carter Brooks5 days ago

Samsung unveils Galaxy Z Fold 8 Wide with magnets

Leaked images released by insider Sonny Dixon reveal Samsung’s upcoming Galaxy Z Fold 8 lineup, including a new Z Fold 8 Wide with integrated chassis magnets and a simplified two-camera rear array. The wide model aims to lower costs while keeping tablet-size screens, targeting buyers priced out of premium foldables ahead of an August 2026 launch.

Samsung unveils Galaxy Z Fold 8 Wide with magnets

Carter Brooks5 days ago

Samsung launches Jinju smart glasses in 2026

Samsung’s first smart glasses, code‑named Jinju, debut in 2026 as a voice‑assistant and photo‑capture device. They use a Qualcomm Snapdragon AR1 chip, Sony IMX681 12MP camera, 155 mAh battery, and bone‑conduction speakers, with no display. The battery lasts a few hours; sustained tasks may throttle. Samsung will unveil Jinju in 2026, targeting the Russian market where Meta glasses are unavailable.

Samsung launches Jinju smart glasses in 2026

Priya Desai5 days ago

Sony Adds 30‑Day Online Checks for PlayStation 4 & PS5

Starting April 2026, Sony’s PlayStation 4 and PS5 will require each digital title purchased after March 2026 to verify its license with Sony’s servers at least once every 30 days. Missing the online ping renders the game unplayable until the console reconnects, while disc copies and pre‑March downloads remain unaffected. Users should plan a monthly check to keep libraries active.

Sony Adds 30‑Day Online Checks for PlayStation 4 & PS5

Carter Brooks5 days ago

Boost Your Healthspan: 1‑MET Gains Cut Mortality by 11–17%

Why a 5–7 MET boost (16–25 ml·kg⁻¹·min⁻¹) narrows smoker‑level death risk

Sarah Lindgren5 days ago

Geely unveils 196‑billion‑parameter EVA Cab L4 robotaxi

At Auto China 2026, Geely, AFARI and CaoCao introduced the EVA Cab, a purpose‑built L4 robotaxi with a 196‑billion‑parameter AI stack and a 1,400 TOPS compute platform. The 43‑sensor suite, featuring a 2,160‑line LiDAR with 600 m range, claims 99% scenario coverage and aims for series production in late 2027, while U.S. entry remains uncertain.

Geely unveils 196‑billion‑parameter EVA Cab L4 robotaxi

Ethan Whitaker6 days ago

MediaTek launches Dimensity 7450X for mid‑range foldables

MediaTek unveiled the Dimensity 7450 and 7450X on April 27, 2026, for mid‑range phones. They feature an octa‑core CPU (Cortex‑A78 up to 2.6 GHz + Cortex‑A55), Mali‑G615 MC2 GPU, sixth‑gen NPU with 7 % AI gain, and an Imagiq 950 ISP supporting up to 200 MP cameras. The 7450X adds dual‑display optimization and flagship‑class camera and AI capabilities, debuting in Motorola’s Razr 70 on April 29, 2026.

MediaTek launches Dimensity 7450X for mid‑range foldables

Priya Desai6 days ago

Cat Gatekeeper Chrome Extension Launches on April 27, 2026

Cat Gatekeeper, a free Chrome extension released on April 27, 2026, overlays a cartoon cat on selected sites—Facebook, X, Reddit, YouTube, Threads, and Bluesky—once a user‑set timer expires. The tab remains blocked until the user resets it. Developer @konekone2026 describes it as a light‑hearted productivity cue that avoids shame‑based blocking. A Firefox version is planned.

Cat Gatekeeper Chrome Extension Launches on April 27, 2026

Carter Brooks6 days ago

Why AI Still Can’t Read Human Group Dynamics?

Why 350+ AI models miss human teamwork cues—and the risk for robots

November 7, 2025, 1:13 am

Researchers at Johns Hopkins showed that more than 350 cutting‑edge AI models cannot reliably judge whether people are acting together or alone. Human observers quickly infer collaboration, leadership and intent, while the machines miss these social cues. The gap threatens autonomous vehicles and warehouse robots that must read human teamwork to stay safe and efficient.

telegram-cloud-photo-size-2-5211058007943351500-y

Summary

Human raters reliably judged collaboration, leadership, smoothness and intent in 3‑second clips, while 350 AI models fell >1.5 points short on a five‑point scale.
Even with human‑written captions, AI could not infer the invisible structure of joint attention, role hierarchy or shared intent that humans read instantly.
The gap means self‑driving cars and warehouse robots miss subtle social cues, risking inefficiency or danger until AI learns true theory‑of‑mind perception.

A Tesla Model S slows at a San Francisco crosswalk. Two pedestrians stand at the curb. One glances at the other and nods. The other steps forward. The car waits, sensors tracking motion but blind to the exchange that just decided who would cross first.

In a new test, researchers pitted more than 350 state‑of‑the‑art AI systems against human observers to see which could read such moments. Not a single AI matched human performance.

When 350 AI Models Watched Humans Work

Cognitive scientists at Johns Hopkins University showed three‑second video clips of people collaborating or acting alone to both humans and AI. The team, led by neuroscientist Chaz Firestone and research scientist Yael Benn, drew 250 clips from the Moments in Time dataset. They reserved 200 clips to train the AI systems. The remaining 50 served as the test set.

At least ten human raters watched each clip. They scored what they saw on five‑point scales: How much were people collaborating? Who was leading? How smooth was the interaction? What was the shared intent?

The same clips fed into more than 350 AI architectures. The roster included large language models trained on billions of words, vision networks that parse images and multimodal systems that process video and audio together. Some came from OpenAI and Google DeepMind. Others originated in university labs at MIT, Stanford and Carnegie Mellon.

The team also recorded brain activity using functional magnetic resonance imaging (fMRI). Four volunteers lay inside a scanner and watched the clips. The imaging revealed which brain regions activated during social perception, particularly the lateral visual stream—the pathway that lights up when humans read intentions and roles.

The Pattern AI Can't See

Human raters agreed strongly on what they observed, while AI predictions scattered across the scale. Humans achieved a reliability score of 0.85 out of 1.0, a statistical measure called Cronbach's alpha that indicates strong consensus. When one person saw smooth collaboration, others did too. When one detected a leader, the group concurred.

The best AI models fell far short across all four dimensions. Their predictions missed human judgments by more than 1.5 points on the five‑point scale—a difference larger than the gap between "somewhat collaborating" and "collaborating strongly."

But here's where it gets interesting. The researchers tried feeding the AI systems a second type of input: human‑written captions. A person watched each clip and wrote a precise description: "Two workers hand tools back and forth" or "One person gestures while the other adjusts equipment." Even with these textual scaffolds, the models still failed to predict the social ratings humans generated effortlessly.

Language alone didn't bridge the gap. The AI could parse the words but couldn't extract the invisible structure—the mutual awareness, joint attention and role hierarchy—that humans read automatically.

What AI Sees vs. What It Misses

Current AI learns statistical regularities in pixels and words, not the mental states that generate behavior. Human social perception operates differently. Cognitive scientists call it "theory of mind"—the automatic simulation of others' goals, beliefs and focus. You watch someone reach for a door and instantly infer whether they expect you to hold it, whether they see you, whether they're in a hurry.

Brain imaging studies show that this inference happens in specific regions. The lateral visual stream activates when we process not just shapes and motion but the intentions behind them. Firestone's fMRI data confirmed this. When volunteers watched collaborative clips, activity surged in the pathways associated with reading minds, not just tracking movement.

AI architectures encode visual input as feature maps—arrays of numbers representing edges, textures and objects. Language models represent text as token embeddings—vectors in high‑dimensional space. Neither representation includes explicit slots for agents, their goals or the shared structures that bind them.

AI can identify two people and a ball. It can label the action as "throwing." It cannot tell whether they're playing catch as a team, whether one is teaching the other, or whether one is leading and the other following. It reads social cues the way a tourist reads an unfamiliar sport—sees the players moving but misses the coordination and strategy unfolding.

Why Autonomous Systems Struggle in Human Spaces

Self‑driving cars and warehouse robots relying on these models miss critical social signals. Waymo vehicles navigate Phoenix intersections using sensor arrays that detect pedestrians, cyclists and other cars. The system knows where bodies are and how fast they're moving. It doesn't know that two pedestrians are hesitating together, that a cyclist is making eye contact with a driver, or that a parent is signaling a child to wait.

Amazon operates more than 520,000 robotic drive units in its fulfillment centers. These machines move shelves, track inventory and navigate around workers. But when a worker glances at a coworker to signal, "I'll hand you this package," the robot doesn't register the exchange. It sees two people near a conveyor belt, not a coordinated handoff. The result: inefficient paths, hesitation or near‑collisions.

Boston Dynamics' humanoid robots can climb stairs, open doors and carry boxes. They struggle in spaces where humans collaborate fluidly because they lack the perceptual machinery to parse who's doing what with whom. A robot entering a room sees individuals moving. A human sees a team working, a hierarchy forming, a moment of cooperation.

These aren't hypothetical limits. They surface in real deployments where AI must share space with people. Safety protocols in autonomous systems currently assume worst‑case scenarios—stop if uncertain—because the models cannot distinguish coordinated behavior from independent action.

The Missing Element AI Cannot Replicate

Humans layer visual input with automatic inference about shared goals, attention flow and role dynamics. You watch two people pass a ball and instantly sense who's leading, who's following, whether they're equals or mentor and student. This understanding arrives without conscious effort, milliseconds after the image hits your retina.

The computation happens in parallel across multiple brain systems. The lateral visual stream processes shapes and motion. The superior temporal sulcus tracks gaze direction. The medial prefrontal cortex simulates intentions. The temporoparietal junction integrates perspective‑taking. Together, they construct a model not just of what is happening but of what people believe, intend and expect.

AI has no equivalent mechanism. Current architectures perform pattern recognition: correlating inputs with labels, predicting next tokens, clustering similar features. They do not simulate hidden mental states. They do not build internal models of agents and goals.

Bridging the gap may require hybrid designs. Some researchers propose augmenting neural networks with symbolic‑reasoning layers that represent agents, actions and causal relations. Others suggest training on datasets annotated not just for objects and actions but for coordination types, attention patterns and interaction roles.

Firestone's team tested whether AI's internal representations at least correlate with human brain activity. They compared the feature maps from vision models with the fMRI responses in the lateral visual stream. The correlation was weak. The patterns that drive AI classifications do not resemble the neural codes that drive human social perception.

The Road From Seeing to Understanding

The next generation of AI may need to simulate not just what humans do but what they intend. That shift—from pattern recognition to something closer to mind reading—won't come from scaling up existing architectures. It requires building models that represent the world not as pixels and tokens but as scenes populated by goal‑directed agents.

For now, that crosswalk in San Francisco still depends on human intuition. The pedestrians exchange a glance and a nod. One steps forward. The car, lacking the circuitry to parse the exchange, waits for motion it can measure. As autonomous systems move from controlled test tracks to crowded sidewalks, the gap between seeing and understanding may determine whether machines become true collaborators or remain very fast, very literal observers of a social world they cannot yet read.

Can new training methods teach models to infer coordination, not just detect it? Will hybrid architectures that blend neural networks with symbolic reasoning close the perception gap? The experiments to answer those questions are just beginning.

Topic

AI Robot Behavioral Intelligence

Former Tesla Optimus leader says humanoid robots are impractical for warehouses

5 December 2025

Claude Sonnet 3.5 Fails the Butter Test, Robots Lag Humans

11 November 2025

What is this about?

Feed

iPhone 18 Pro to Launch iOS 27 Camera with f/1.5‑f/2.8 Aperture

iOS 27 adds a “Siri” visual‑AI mode as Apple readies iPhone 18 Pro for fall

Carter Brooks3 days ago

Therapist vs Counselor: Which Fits Your Needs?

Licenses, Training Hours, and Treatment Options Compared (2024‑2025 Data)

Caleb Brooks3 days ago

Ask YouTube Launches March 15, 2026 for Premium Users

On March 15, 2026, YouTube introduced Ask YouTube, an AI‑driven chat that lets U.S. Premium subscribers ask questions and receive synthesized video‑based answers. The tool promises a conversational search experience, yet early tests revealed factual slips, such as a wrong claim about the Steam controller’s joysticks, highlighting the need for users to verify information before acting.

Ask YouTube Launches March 15, 2026 for Premium Users

Carter Brooks5 days ago

Samsung unveils Galaxy Z Fold 8 Wide with magnets

Leaked images released by insider Sonny Dixon reveal Samsung’s upcoming Galaxy Z Fold 8 lineup, including a new Z Fold 8 Wide with integrated chassis magnets and a simplified two-camera rear array. The wide model aims to lower costs while keeping tablet-size screens, targeting buyers priced out of premium foldables ahead of an August 2026 launch.

Samsung unveils Galaxy Z Fold 8 Wide with magnets

Carter Brooks5 days ago

Samsung launches Jinju smart glasses in 2026

Samsung’s first smart glasses, code‑named Jinju, debut in 2026 as a voice‑assistant and photo‑capture device. They use a Qualcomm Snapdragon AR1 chip, Sony IMX681 12MP camera, 155 mAh battery, and bone‑conduction speakers, with no display. The battery lasts a few hours; sustained tasks may throttle. Samsung will unveil Jinju in 2026, targeting the Russian market where Meta glasses are unavailable.

Samsung launches Jinju smart glasses in 2026

Priya Desai5 days ago

Sony Adds 30‑Day Online Checks for PlayStation 4 & PS5

Starting April 2026, Sony’s PlayStation 4 and PS5 will require each digital title purchased after March 2026 to verify its license with Sony’s servers at least once every 30 days. Missing the online ping renders the game unplayable until the console reconnects, while disc copies and pre‑March downloads remain unaffected. Users should plan a monthly check to keep libraries active.

Sony Adds 30‑Day Online Checks for PlayStation 4 & PS5

Carter Brooks5 days ago

Boost Your Healthspan: 1‑MET Gains Cut Mortality by 11–17%

Why a 5–7 MET boost (16–25 ml·kg⁻¹·min⁻¹) narrows smoker‑level death risk

Sarah Lindgren5 days ago

Geely unveils 196‑billion‑parameter EVA Cab L4 robotaxi

At Auto China 2026, Geely, AFARI and CaoCao introduced the EVA Cab, a purpose‑built L4 robotaxi with a 196‑billion‑parameter AI stack and a 1,400 TOPS compute platform. The 43‑sensor suite, featuring a 2,160‑line LiDAR with 600 m range, claims 99% scenario coverage and aims for series production in late 2027, while U.S. entry remains uncertain.

Geely unveils 196‑billion‑parameter EVA Cab L4 robotaxi

Ethan Whitaker6 days ago

MediaTek launches Dimensity 7450X for mid‑range foldables

MediaTek unveiled the Dimensity 7450 and 7450X on April 27, 2026, for mid‑range phones. They feature an octa‑core CPU (Cortex‑A78 up to 2.6 GHz + Cortex‑A55), Mali‑G615 MC2 GPU, sixth‑gen NPU with 7 % AI gain, and an Imagiq 950 ISP supporting up to 200 MP cameras. The 7450X adds dual‑display optimization and flagship‑class camera and AI capabilities, debuting in Motorola’s Razr 70 on April 29, 2026.

MediaTek launches Dimensity 7450X for mid‑range foldables

Priya Desai6 days ago

Cat Gatekeeper Chrome Extension Launches on April 27, 2026

Cat Gatekeeper, a free Chrome extension released on April 27, 2026, overlays a cartoon cat on selected sites—Facebook, X, Reddit, YouTube, Threads, and Bluesky—once a user‑set timer expires. The tab remains blocked until the user resets it. Developer @konekone2026 describes it as a light‑hearted productivity cue that avoids shame‑based blocking. A Firefox version is planned.

Cat Gatekeeper Chrome Extension Launches on April 27, 2026

Carter Brooks6 days ago