ZeroMimic: Distilling Robotic Manipulation Skills from Web Videos

Robots require demonstrations on the same robot, in the same room, with the same objects, which scales poorly.

Can robots learn general skill policies without specific reference to one robot, scene, or object from more diverse and larger sources of data?

We introduce ZeroMimic, a system that distills robotic manipulation skills from egocentric human web videos for zero-shot deployment in diverse real-world and simulated environments, a variety of objects, and different robot embodiments. Critically, ZeroMimic learns purely from passive human videos and does not require any robot data.

Abstract

Many recent advances in robotic manipulation have come through imitation learning, yet these rely largely on mimicking a particularly hard-to-acquire form of demonstrations: those collected on the same robot in the same room with the same objects as the trained policy must handle at test time. In contrast, large pre-recorded human video datasets demonstrating manipulation skills in-the-wild already exist, which contain valuable information for robots. Is it possible to distill a repository of useful robotic skill policies out of such data without any additional requirements on robot-specific demonstrations or exploration? We present the first such system ZeroMimic, that generates immediately deployable image goal-conditioned skill policies for several common categories of manipulation tasks (opening, closing, pouring, pick&place, cutting, and stirring) each capable of acting upon diverse objects and across diverse unseen task setups. ZeroMimic is carefully designed to exploit recent advances in semantic and geometric visual understanding of human videos, together with modern grasp affordance detectors and imitation policy classes. After training ZeroMimic on the popular EpicKitchens dataset of ego-centric human videos, we evaluate its out-of-the-box performance in varied real-world and simulated kitchen settings with two different robot embodiments, demonstrating its impressive abilities to handle these varied tasks.

Overview of ZeroMimic

ZeroMimic is composed of the grasping phase and the post-grasp phase. The grasping phase (top) leverages human affordance-based grasping to execute a task-relevant grasp. The post-grasp phase (bottom) is an imitation policy trained on web videos to predict 6D wrist trajectories. We deploy this trained model directly on the robot.

ZeroMimic Zero-Shot Deployment

We train 9 different skill policies: vertical hinge opening/closing, slide opening/closing, pouring, picking, placing, cutting, and stirring. We evaluate ZeroMimic skill policies out-of-the-box in the real world on a diverse set of objects in diverse unseen environments. See below for the evaluation videos.

Vertical Hinge Opening

Object:

Instance:

Vertical Hinge Closing

Object:

Instance:

Slide Opening

Object:

Instance:

Slide Closing

Object:

Instance:

Picking

Object:

Instance:

Placing

Object:

Instance:

Pouring

Object:

Instance:

Cutting

Object:

Instance:

Stirring

Object:

Instance:

ZeroMimic Zero-Shot Quantitative Results

ZeroMimic Quantitative Results — **ZeroMimic** demonstrates strong generalization capabilities, achieving consistent success across diverse tasks, robot embodiments, and both real-world and simulated environments. The evaluation spans 34 distinct scenarios across 18 object categories in 7 kitchen scenes, highlighting the adaptability and robustness of the system.

Comparison to ReKep

Concurrent work ReKep [1] optimizes keypoint-based constraints generated by vision-language models (VLMs) to achieve zero-shot robotic behavior. To compare ZeroMimic to ReKep, we perform real-world experiments on 4 tasks using the Franka robot in the Levine Hall Kitchen environment. We observe that the failure cases of ReKep mostly stem from two issues: the vision module generate inaccurate keypoints or associate incorrect keypoints with target objects, and the VLM generate incorrect keypoint-based constraints due to its limited spatial reasoning capabilities.

[1] Huang, Wenlong, et al. "ReKep: Spatio-Temporal Reasoning of Relational Keypoint Constraints for Robotic Manipulation." 8th Annual Conference on Robot Learning.

Human Wrist Trajectories from Web Videos

To curate diverse and large-scale human behavior, we use EpicKitchens, an in-the-wild egocentric vision dataset. It contains 20M frames in 100 hours of daily activities in the kitchen. To extract wrist trajectories from EpicKitchens, we run HaMeR, a state-of-the-art pre-trained hand-tracking model, to obtain 3D hand pose reconstruction. We leverage the camera extrinsics and intrinsics reconstruction provided by EPIC-Fields to account for the motion of the egocentric camera and convert human actions into the camera coordinate. See the paper for more details.

Camera Info — Camera extrinsics/intrinsics from Structure-from-Motion

Hand Pose — 3D hand pose reconstruction from HaMeR

Wrist Pose

Visualized with sparse point cloud

Projected on images

Predictions on Unseen Web Videos

We evaluate ZeroMimic policies on unseen web videos from the EpicKitchens dataset. As shown below, while the predicted trajectories do not exactly replicate the ground truth, they still offer plausible paths that accomplish the intended tasks effectively, thanks to training on a diverse set of multi-modal human demonstrations.

Open Drawer

Prediction

Ground Truth

Open Cupboard

Predictions

Ground Truth

Experiment Time-Lapse

Our complete experimental procedure is shown in the following time-lapse video. Before our evaluations, we position the camera and the robot at a camera angle roughly similar to the relative camera angle of the human hand appearing in egocentric videos. We perform 10 trials with varying camera and robot positions to generally resemble a human's egocentric viewpoint.

BibTeX

@inproceedings{shi2025zeromimic,
      title={ZeroMimic: Distilling Robotic Manipulation Skills from Web Videos}, 
      author={Junyao Shi and Zhuolun Zhao and Tianyou Wang and Ian Pedroza and Amy Luo and Jie Wang and Jason Ma and Dinesh Jayaraman},
      year={2025},
      booktitle={International Conference on Robotics and Automation (ICRA)},
}