We present RoboEval, a simulation benchmark and structured evaluation framework designed to reveal the limitations of current bimanual manipulation policies. While prior benchmarks report only binary task success, we show that such metrics often conceal critical weaknesses in policy behavior—such as poor coordination, slipping during grasping, or asymmetric arm usage. ROBOEVAL introduces a suite of tiered, semantically grounded tasks decomposed into skill-specific stages, with variations that systematically challenge spatial, physical, and coordination capabilities. Tasks are paired with fine-grained diagnostic metrics and 3 000+ human demonstrations to support imitation learning. Our experiments reveal that policies with similar success rates diverge in how tasks are executed—some struggle with alignment, others with temporally consistent bimanual control. We find that behavioral metrics correlate with success in over half of task-metric pairs and remain informative even when binary success saturates. By pinpointing when and how policies fail, ROBOEVAL enables a deeper, more actionable understanding of robotic manipulation—and highlights the need for evaluation tools that go beyond success alone.
RoboEval is a benchmark for evaluating bimanual manipulation policies under diverse task settings. The first iteration consists of 10 base tasks and 3000+ human demonstrations. The tasks are derived from common tasks that humans perform in diverse settings, from service style tasks such as lifting a tray, to warehouse tasks like closing a box, to industrial tasks like rotating hand-wheels. Each task includes multiple variations—ranging from static setups to dynamic shifts in object pose and semantic context—designed to assess policy performance in a systematic manner. To facilitate research in imitation learning and demo-driven policy training, we provide a suite of raw expert human demonstrations, along with fine-grained evaluation metrics such as trajectory smoothness, environment collisions, etc.
Task Name | Variations | # Demos | Traj Len | Skills | Coordination Type |
---|---|---|---|---|---|
Lift Tray | Static, Pos, Rot, PR, Drag | 730 | 77.318 | grasp, lift | Tight Sym. |
Stack Two Cubes | Static, Pos, Rot, PR | 400 | 108.368 | grasp, hold, place | Loosely Coord. |
Stack Single Book Shelf | Static, Pos, PR | 199 | 187.280 | push, grasp, lift, place | Loosely Coord. |
Rod Handover | Static, Pos, Rot, PR, Vertical | 511 | 93.631 | grasp, hold | Loosely Coord. |
Lift Pot | Static, Pos, Rot, PR | 390 | 58.561 | grasp, lift | Tight Sym. |
Pack Box | Static, Pos, Rot, PR | 312 | 123.016 | push | Uncoord. |
Pick Book From Table | Static, Pos, Rot, PR | 359 | 103.364 | grasp, lift | Loosely Coord. |
Rotate Valve | Static, Pos, Rot, PR | 456 | 112.484 | grasp, rotate along axis | Uncoord. |
We evaluate policy performance across trajectory, precision, task progression, and bimanual coordination.
Trajectory-Based Metrics
Spatial Precision Metrics
Task Progression Metrics
Bimanual Coordination Metrics
This section provides a visual summary of performance degradation using radial metrics plots and failure mode heatmaps. These visualizations allow fine-grained interpretation of agent performance under different task variations.