:Where Robotic Manipulation Meets Structured and Scalable Evaluation

1University of Washington 2Allen Institute for AI *Equal advising

Summary

Abstract

We present RoboEval, a simulation benchmark and structured evaluation framework designed to reveal the limitations of current bimanual manipulation policies. While prior benchmarks report only binary task success, we show that such metrics often conceal critical weaknesses in policy behavior—such as poor coordination, slipping during grasping, or asymmetric arm usage. ROBOEVAL introduces a suite of tiered, semantically grounded tasks decomposed into skill-specific stages, with variations that systematically challenge spatial, physical, and coordination capabilities. Tasks are paired with fine-grained diagnostic metrics and 3 000+ human demonstrations to support imitation learning. Our experiments reveal that policies with similar success rates diverge in how tasks are executed—some struggle with alignment, others with temporally consistent bimanual control. We find that behavioral metrics correlate with success in over half of task-metric pairs and remain informative even when binary success saturates. By pinpointing when and how policies fail, ROBOEVAL enables a deeper, more actionable understanding of robotic manipulation—and highlights the need for evaluation tools that go beyond success alone.

Benchmark

Task Overview



RoboEval Benchmark

RoboEval is a benchmark for evaluating bimanual manipulation policies under diverse task settings. The first iteration consists of 10 base tasks and 3000+ human demonstrations. The tasks are derived from common tasks that humans perform in diverse settings, from service style tasks such as lifting a tray, to warehouse tasks like closing a box, to industrial tasks like rotating hand-wheels. Each task includes multiple variations—ranging from static setups to dynamic shifts in object pose and semantic context—designed to assess policy performance in a systematic manner. To facilitate research in imitation learning and demo-driven policy training, we provide a suite of raw expert human demonstrations, along with fine-grained evaluation metrics such as trajectory smoothness, environment collisions, etc.

Base Task Set in RobotArena

Task Name Variations # Demos Traj Len Skills Coordination Type
Lift Tray Static, Pos, Rot, PR, Drag 730 77.318 grasp, lift Tight Sym.
Stack Two Cubes Static, Pos, Rot, PR 400 108.368 grasp, hold, place Loosely Coord.
Stack Single Book Shelf Static, Pos, PR 199 187.280 push, grasp, lift, place Loosely Coord.
Rod Handover Static, Pos, Rot, PR, Vertical 511 93.631 grasp, hold Loosely Coord.
Lift Pot Static, Pos, Rot, PR 390 58.561 grasp, lift Tight Sym.
Pack Box Static, Pos, Rot, PR 312 123.016 push Uncoord.
Pick Book From Table Static, Pos, Rot, PR 359 103.364 grasp, lift Loosely Coord.
Rotate Valve Static, Pos, Rot, PR 456 112.484 grasp, rotate along axis Uncoord.

Evaluation Metrics

We evaluate policy performance across trajectory, precision, task progression, and bimanual coordination.

Trajectory-Based Metrics

  • Joint Path Length: Total angular joint distance during execution.
  • Cartesian Path Length: 3D distance traveled by end-effectors.
  • Jerk (Joint / Cartesian): Measures motion smoothness.
  • Collision Counts: Number of robot/environment collisions.

Spatial Precision Metrics

  • Final Distance to Target: Distance between final and goal object pose.
  • Orientation Error: Geodesic difference between object rotations.

Task Progression Metrics

  • Stage-wise Success: Binary success indicators for each sub-task.
  • Time in Each Stage: Timesteps spent per task stage.

Bimanual Coordination Metrics

  • Height Discrepancy: Height difference between two arms.
  • Velocity Divergence: Measures arm coordination.
  • Slip Count: Tracks unintended object drops.

Simulation Results

Task rollouts

Task
with variation
and method
episode


Performance on Bimanual Tasks with Variations

Performance on bimanual tasks with variations

Real-world experiments

To validate the trends observed in our simulation benchmarks, we conducted real-world experiments on three tasks—Lift Tray, Stack Two Cubes, and Rod Handover – using a bimanual Franka Panda robot setup. These tasks closely mirror their simulated counterparts and include multiple task variations. For each task, we collected 100 demonstrations under the static variation using a VR Oculus controller, and fine-tuned OpenVLA until training accuracy plateaued. We then evaluated each task variation over 25 trials. The results, shown in Figure 4, demonstrate a strong correlation between the real-world task performance of OpenVLA and the trends observed in simulation, with performance decreasing as task complexity increased due to variations

Failure Analysis & Metric Summary

This section provides a visual summary of performance degradation using radial metrics plots and failure mode heatmaps. These visualizations allow fine-grained interpretation of agent performance under different task variations.


Radial Graph

Radial Metrics Plot

Failure Summary Plot

Failure Summary Plot

Interactive demo (WIP)

Lift tray