ZeroMimic: Distilling Robotic Manipulation Skills from Web Videos

University of Pennsylvania
ICRA 2025
Equal Contribution; Corresponding to: junys@seas.upenn.edu

Robots require demonstrations on the same robot, in the same room, with the same objects, which scales poorly.

Can robots learn general skill policies without specific reference to one robot, scene, or object from more diverse and larger sources of data?

ZeroMimic Concept Overview

We introduce ZeroMimic, a system that distills robotic manipulation skills from egocentric human web videos for zero-shot deployment in diverse environments with a variety of objects. Critically, ZeroMimic learns purely from passive human videos and does not require any robot data.

Abstract

Many recent advances in robotic manipulation have come through imitation learning, yet these rely largely on mimicking a particularly hard-to-acquire form of demonstrations: those collected on the same robot in the same room with the same objects as the trained policy must handle at test time. In contrast, large pre-recorded human video datasets demonstrating manipulation skills in-the-wild already exist, which contain valuable information for robots. Is it possible to distill a repository of useful robotic skill policies out of such data without any additional requirements on robot-specific demonstrations or exploration? We present the first such system ZeroMimic, that generates immediately deployable image goal-conditioned skill policies for several common categories of manipulation tasks (opening, closing, pouring, pick&place, cutting, and stirring) each capable of acting upon diverse objects and across diverse unseen task setups. ZeroMimic is carefully designed to exploit recent advances in semantic and geometric visual understanding of human videos, together with modern grasp affordance detectors and imitation policy classes. After training ZeroMimic on the popular EpicKitchens dataset of ego-centric human videos, we evaluate its out-of-the-box performance in varied kitchen settings, demonstrating its impressive abilities to handle these varied tasks. To enable plug-and-play reuse of ZeroMimic policies on other task setups and robots, we will release software and policy checkpoints for all skills.

Overview of ZeroMimic

Overview of ZeroMimic

ZeroMimic is composed of the grasping module and the post-grasp module. The grasping module (left) leverages existing pre-trained models for human interaction affordance model and grasp candidate generation to execute a task-relevant grasp. The post-grasp module (right) is an imitation policy trained on web videos to predict 6D wrist trajectories. We deploy this trained model directly on the robot. The images in this figure correspond to a real execution of the ZeroMimic “slide opening” skill for a drawer-opening task.

ZeroMimic Zero-Shot Deployment

We train 9 different skill policies: vertical hinge opening/closing, slide opening/closing, pouring, picking, placing, cutting, and stirring. We evaluate ZeroMimic skill policies out-of-the-box in the real world on a diverse set of objects in diverse unseen environments. See below for the evaluation videos.



Vertical Hinge Opening

Object:
Instance:

Vertical Hinge Closing

Object:
Instance:

Slide Opening

Object:
Instance:

Slide Closing

Object:
Instance:

Picking

Object:
Instance:

Placing

Object:
Instance:

Pouring

Object:
Instance:

Cutting

Object:
Instance:

Stirring

Object:
Instance:

ZeroMimic Zero-Shot Quantitative Results

ZeroMimic Quantitative Results
ZeroMimic demonstrates strong generalization capabilities, achieving consistent success across diverse tasks, robot embodiments, and both real-world and simulated environments. The evaluation spans 32 distinct scenarios across 16 object categories in 7 kitchen scenes, highlighting the adaptability and robustness of the system.

Comparison to ReKep

ReKep Results
Concurrent work ReKep [1] optimizes keypoint-based constraints generated by vision-language models (VLMs) to achieve zero-shot robotic behavior. To compare ZeroMimic to ReKep, we perform real-world experiments on 4 tasks using the Franka robot in the Levine Hall Kitchen environment. We observe that the failure cases of ReKep mostly stem from two issues: the vision module generate inaccurate keypoints or associate incorrect keypoints with target objects, and the VLM generate incorrect keypoint-based constraints due to its limited spatial reasoning capabilities.

[1] Huang, Wenlong, et al. "ReKep: Spatio-Temporal Reasoning of Relational Keypoint Constraints for Robotic Manipulation." 8th Annual Conference on Robot Learning.

Grounded Human Wrist Trajectories from Web Videos

To curate diverse and large-scale human behavior, we use EpicKitchens, an in-the-wild egocentric vision dataset. It contains 20M frames in 100 hours of daily activities in the kitchen. To extract wrist trajectories from EpicKitchens, we run HaMeR, a state-of-the-art pre-trained hand-tracking model, to obtain 3D hand pose reconstruction. We leverage the camera extrinsics and intrinsics reconstruction provided by EPIC-Fields to account for the motion of the egocentric camera and accurately convert human actions into the camera coordinate. See the paper for more details.

Camera Info
Camera extrinsics/intrinsics from Structure-from-Motion
Hand Pose
3D hand pose reconstruction from HaMeR

Grounded Wrist Pose

Visualized with sparse point cloud

Projected on images

Predictions on Unseen Web Videos

We evaluate ZeroMimic policies on unseen web videos from the EpicKitchens dataset. As shown below, while the predicted trajectories do not exactly replicate the ground truth, they still offer plausible paths that accomplish the intended tasks effectively, thanks to training on a diverse set of multi-modal human demonstrations.

Open Drawer

Prediction
Ground Truth

Open Cupboard

Predictions
Ground Truth

Experiment Time-Lapse

Our complete experimental procedure is shown in the following time-lapse video. Before our evaluations, we position the camera and the robot at a camera angle roughly similar to the relative camera angle of the human hand appearing in egocentric videos. We perform 10 trials with varying camera and robot positions to generally resemble a human's egocentric viewpoint.