What Matters When Cotraining Robot Manipulation Policies
on Everyday Human Videos?

Richard Li, Aditya Prakash, Andrew Wen, Branden Romero, Saurabh Gupta, Yilun Du, Pulkit Agrawal

We compare our policy to a robot-data-only baseline and show zero-shot generalization to unseen backgrounds, objects, and distractors.

System Diagram

System diagram: Large-scale human data, action alignment, and policy cotraining for robot manipulation

Autonomous Rollouts

More zero-shot policy rollouts on unseen objects and backgrounds across six manipulation tasks. Select a task to see example rollouts from our test environments.

Abstract

Existing human video datasets used for cotraining robot manipulation policies largely consist of curated demonstrations collected by roboticists. The human motions in these videos are carefully orchestrated to look like robot motions and the 3D hand poses acquired are highly accurate due to usage of specialized capture devices. A more plentiful and diverse source of data is everyday videos on the Internet of humans accomplishing daily tasks. To study cotraining in this setting, we curate a large-scale dataset of 532 videos of natural human activity (cooking) with over 28 hours of high-quality hand labels. We find that hand quality affects transfer performance, but even with high-quality hands, the inherent motion gap can hinder transfer if the policy network cannot properly specialize to each embodiment. We propose a cotraining recipe based on a token-level fusion architecture, embodiment-specific action encoders and decoders, and a loss that upweights robot data. Our recipe allows for consistently improved scaling when cotraining on human data, with a mean success rate improvement of +29.7% in the low-robot-data regime across six manipulation tasks.

532
Human Videos
28h+
Hand Label Data
+29.7%
Success Rate Gain
6
Manipulation Tasks

TriHands Visualization

We build TriHands, a dataset of everyday cooking videos with accurate 3D hands produced by a multi-view triangulation pipeline.

Below we visualize 2D projections of our triangulated hand keypoints in the egocentric and exocentric (TODO) camera views across diverse cooking scenes.

TriHands triangulation visualizations (first five scenes)

TriHands triangulation visualizations (first five scenes).

TriHands triangulation visualizations (remaining four scenes)

TriHands triangulation visualizations (remaining four scenes).

Training & Test Environments

Training Environments (10 scenes)

Test Environments (unseen objects & backgrounds)

Acknowledgments

We thank the EgoExo4D team for providing the dataset and calibration data. We also thank all data collectors and annotators who contributed to the robot demonstration dataset. This work was supported in part by [funding sources to be added].

BibTeX

@inproceedings{anonymous2026cotraining,
  title     = {What Matters When Cotraining Robot Manipulation Policies on Everyday Human Videos?},
  author    = {Anonymous},
  booktitle = {TBD},
  year      = {2026}
}