Teleoperation Data Is The Wrong Unit Of Scale for Physical AI

May 26, 2026

For the last four years, the default playbook for training humanoid robot foundation models has been simple:

  • Buy a teleoperation rig
  • Hire operators
  • Collect thousands of hours of demonstrations
  • Train a Vision-Language-Action model on the result

(This is how Open X-Embodiment and π0 got built.)

That playbook is dead. Teleop hours are the wrong thing to be scaling at every stage of training a humanoid foundation model. Pretraining should be scaled using widely distributed datasets, where egocentric data dominates. Post-training should be scaled for precise joint movements, data collection rate, and diverse contact coverage, where IMU, RL, and on-robot tactile sensors each beat teleop on their respective axes.

Why Teleop Fails At Pretraining

Pretraining a humanoid foundation model is fundamentally a data scaling problem. The job of the base model is to learn what physical reality looks like across people, scenes, tasks, and objects, so the bottleneck is situational physics.

Teleop is bad at delivering this. Almost every teleoperation hour comes from a similar fleet of robots in the same lab/industrial setting doing the same narrow task list. The total information sampled from teleop becomes redundant for a pre-training use case. Ego data has the opposite shape as it’s passive, sampled from real workers doing jobs in diverse environments. A thousand workers recording their egocentric data sample a wider distribution of states than any feasible teleop program concentrated on a small fleet doing a narrow task list. The base model is learning what the world contains. From hands that grasp, to liquids pour, to fabrics deform, to doors open in particular kinematic envelopes, coverage scales with the distinct situations sampled.

Additionally, the industry is converging on egocentric data. Apple’s EgoDex released 829 hours of dexterous egocentric video with 3D hand and finger tracking; NVIDIA’s EgoScale released 20,854 hours of action-labeled human egocentric video; DreamDojo’s pretraining corpus reaches 44,711 hours of diverse egocentric video; Build AI’s Egocentric-100K just released 100,000 hours of industrial first-person footage from 14,228 workers.

Why Teleop Fails At Post-Training

Post-training is where teleoperation data has the most value. By fine-tuning a base model using teleop data, one provides clean joint trajectories, plus wide coverage that only on-robot demonstrations could provide. The historical case for teleop at post-training rested on two claims: that teleop produced cleaner action labels, and that teleop produced better robot-feasible trajectories.

While teleop may have cleaner labels by reading the rig’s own joint encoders, the scale and cost at which teleoperational data works is insanely heavy. For every robot, one must have an operator, locked to each machine, each doing diverse tasks.

Additionally, transferring ego data to clean joint encoders is fundamentally an engineering problem. DexCap’s mocap glove plus chest LiDAR rig delivers “precise, occlusion-resistant tracking of wrist and finger motions based on SLAM,” and the Universal Manipulation Interface turns hand-held human demonstrations into hardware-agnostic robot policies through “portable, low-cost, and information-rich data collection.” Recent papers like HumanEgo have even achieved zero-shot learning using purely egocentric data. IMU can also augment ego data, providing high-rate joint and orientation signals captured directly, with no camera-angle dependence.

The “robot-feasible trajectories” argument is being closed by retargeting and latent action learning. EgoMI and physics-aware residual RL translate human demonstrations directly to robot embodiments, learning the kinematic movements an arm actually makes. Genie learned a vocabulary of latent actions from 30,000 hours of unlabeled video, where each token is the model’s inference about what must have happened between two consecutive frames. DreamDojo uses continuous latent actions as unified proxy actions across its 44,711-hour ego pretraining corpus. While teleop data remains useful for training robot trajectories in post-training, ego data is slowly becoming more and more popular through brilliant engineering efforts.

What is the right Data for Physical AI?

In my eyes, teleoperation data will be replaced.

Generalist AI’s GEN-1, released April 2026, makes the bet explicit:

“The base foundation model is trained without any robot data. Instead, the model uses data from low-cost wearable devices on humans doing millions of activities.”

A frontier humanoid foundation model is trainable from scratch on roughly 500,000 hours of wearable human data, with robot data appearing only at the fine-tuning stage using only one hour per task. Through wearable human data and barely any teleop, robot policies learn production tasks like folding and packing at commercial reliability.

NVIDIA’s GR00T N1.7 uses 20,854 hours from EgoScale corpus for pretraining, with synthetic DreamGen trajectories for additional data and a few teleop runs for fine-tuning.

Figure’s Project Go-Big bets big on human video data, committing to gather over 100,000 hours of ego data.

TL;DR: Very bullish on Ego data being the way ahead, at least for now.

Further Reading