Metrics#
RoboEval provides comprehensive metrics tracking through the MetricRolloutEval class, enabling detailed analysis of robot performance beyond binary success.
Overview#
The metrics system tracks:
Task success and progression (stages, subtask completion)
Bimanual coordination (velocity sync, vertical alignment)
Object manipulation quality (slippage detection)
Safety (collision detection)
Trajectory quality (smoothness via jerk metrics)
Efficiency (path length in cartesian and joint space)
Custom task metrics (distances, pose errors)
MetricRolloutEval Integration#
To enable comprehensive metrics, inherit from MetricRolloutEval in your task:
from roboeval.roboeval_env import RoboEvalEnv
from roboeval.utils.metric_rollout import MetricRolloutEval
from abc import ABC
class MyTask(RoboEvalEnv, ABC, MetricRolloutEval):
"""Task with comprehensive metrics tracking."""
def _initialize_env(self):
# Create props
self.cube = Cube(self._mojo)
# Initialize metric tracking
self._metric_init(
track_vel_sync=True,
track_vertical_sync=True,
track_slippage=True,
slip_objects=[self.cube],
slip_sample_window=20,
track_collisions=True,
track_cartesian_jerk=True,
track_joint_jerk=True,
track_cartesian_path_length=True,
track_joint_path_length=True,
track_orientation_path_length=True,
robot=self.robot
)
def _on_reset(self):
super()._on_reset()
# Re-initialize metrics for each episode
self._metric_init(...)
# Initialize stage flags
for idx in range(1, 4):
self._metric_stage(idx, False)
def _on_step(self):
# Update metrics every step
self._metric_step()
def _success(self):
success = self._check_success_conditions()
# Track stage progression
if self._is_grasped():
self._metric_stage(1)
if self._is_lifted():
self._metric_stage(2)
if success:
self._metric_stage(3)
# Finalize and store metrics
self._final_metrics = self._metric_finalize(
success_flag=success,
target_distance={"distance": 0.1},
pose_error=0.02
)
return success
def _get_task_info(self):
# Expose metrics via info dict
return getattr(self, "_final_metrics", {})
def _get_task_info(self):
# Expose metrics via info dict
return getattr(self, "_final_metrics", {})
Available Tracking Options#
_metric_init() Parameters#
Bimanual Coordination:
track_vel_sync(bool): Track arm velocity synchronizationMeasures difference in joint velocities between left and right arms
Returns
bimanual_arm_velocity_difference(lower is better)
track_vertical_sync(bool): Track gripper height alignmentMeasures vertical (Z-axis) difference between left and right gripper wrists
Returns
bimanual_gripper_vertical_difference(lower is better)
Object Manipulation:
track_slippage(bool): Detect object grip lossslip_objects(list): Props to monitor for slippingslip_sample_window(int): Frames between slip checks (default: 20)Detects when gripper loses grip (was holding, now not holding, gripper didn’t open)
Returns
slip_countandslip_count_per_object
Safety:
track_collisions(bool): Track collision eventsDetects new collisions with environment and robot self-collisions
Returns
env_collision_countandself_collision_countOnly counts new collision contacts, not sustained contact
Trajectory Quality:
track_cartesian_jerk(bool): End-effector smoothnessJerk = rate of change of acceleration (derivative of acceleration)
Returns
avg_cartesian_jerkandrms_cartesian_jerkper armLower jerk = smoother, more human-like motion
track_joint_jerk(bool): Joint space smoothnessJerk in joint space
Returns
avg_joint_jerkandrms_joint_jerkper arm
Efficiency:
track_cartesian_path_length(bool): End-effector path distanceTotal distance traveled by end-effector in cartesian space
Returns
cartesian_path_lengthper arm,total_cartesian_path_length,avg_cartesian_path_length
track_joint_path_length(bool): Joint space path distanceTotal distance traveled in joint configuration space
Returns
joint_path_lengthper arm,total_joint_path_length,avg_joint_path_length
track_orientation_path_length(bool): Rotation path lengthTotal angular distance traveled (quaternion angular distance)
Returns
orientation_path_lengthper arm,total_orientation_path_length
Required Parameter:
robot(Robot): Robot instance for accessing kinematics and grippers
Metrics Lifecycle#
1. Initialize (_metric_init)
Call in both _initialize_env() and _on_reset():
self._metric_init(
track_vel_sync=True,
track_slippage=True,
slip_objects=[self.cube, self.pot],
track_collisions=True,
robot=self.robot,
slip_sample_window=20
)
# Initialize stage flags
for idx in range(1, 4):
self._metric_stage(idx, False)
2. Update (_metric_step)
Call in _on_step() to update metrics every simulation step:
def _on_step(self):
self._metric_step()
This method:
Tracks velocity and vertical sync differences
Detects new collisions
Detects slip events (every N frames based on
slip_sample_window)Accumulates path lengths
Calculates and accumulates jerk values
3. Track Stages (_metric_stage)
Call during _success() to mark subtask completion:
def _success(self):
# Update stages as task progresses
if self._is_grasped():
self._metric_stage(1) # Stage 1 complete
if self._is_lifted():
self._metric_stage(2) # Stage 2 complete
success = self._in_goal_region()
if success:
self._metric_stage(3) # Stage 3 complete
# Finalize...
return success
4. Finalize (_metric_finalize)
Call at end of _success() to compute final metrics:
self._final_metrics = self._metric_finalize(
success_flag=success,
target_distance={"lift_distance": 0.15, "goal_distance": 0.05},
pose_error=0.02
)
Parameters:
success_flag(bool): Whether task succeededtarget_distance(float or dict): Distance(s) to goal/targetpose_error(float or dict): Orientation/pose error(s)
5. Expose (_get_task_info)
Return metrics in info dict:
def _get_task_info(self):
return getattr(self, "_final_metrics", {})
Metrics Dictionary Structure#
_metric_finalize() returns a comprehensive dictionary:
metrics = {
# Core metrics (always present)
"success": 1.0, # Float: 1.0 or 0.0
"completion_time": 15.32, # Seconds since episode start
# Stage progression (if _metric_stage was called)
"task_stage_reached": {1: True, 2: True, 3: False},
"subtask_progress": 0.67, # 2/3 stages completed
# Custom distances (if target_distance provided)
"target_distance": {
"lift_distance": 0.15,
"goal_distance": 0.05
},
# Or single value: "target_distance": 0.05
# Pose errors (if pose_error provided)
"object_pose_error": 0.02,
# Or dict: "object_pose_error": {"pot": 0.02, "lid": 0.01}
# Slip tracking (if track_slippage=True)
"slip_count": 2, # Total slip events
"slip_count_per_object": {"object_1": 1, "object_2": 1},
# Collision tracking (if track_collisions=True)
"env_collision_count": 3, # New environment collisions
"self_collision_count": 0, # Robot self-collisions
# Bimanual coordination (if multiarm and track_vel_sync=True)
"bimanual_arm_velocity_difference": 0.05, # Avg velocity diff
# Vertical sync (if multiarm and track_vertical_sync=True)
"bimanual_gripper_vertical_difference": 0.01, # Avg height diff
# Path lengths (if track_cartesian_path_length=True)
"cartesian_path_length": {
"left": 1.52,
"right": 1.48
},
"total_cartesian_path_length": 3.00,
"avg_cartesian_path_length": 1.50,
# Joint path (if track_joint_path_length=True)
"joint_path_length": {
"left": 8.23,
"right": 8.45
},
"total_joint_path_length": 16.68,
"avg_joint_path_length": 8.34,
# Orientation path (if track_orientation_path_length=True)
"orientation_path_length": {
"left": 0.52,
"right": 0.48
},
"total_orientation_path_length": 1.00,
"avg_orientation_path_length": 0.50,
# Cartesian jerk (if track_cartesian_jerk=True)
"avg_cartesian_jerk": {
"left": 0.32,
"right": 0.28
},
"rms_cartesian_jerk": {
"left": 0.45,
"right": 0.41
},
"overall_avg_cartesian_jerk": 0.30,
"overall_rms_cartesian_jerk": 0.43,
# Joint jerk (if track_joint_jerk=True)
"avg_joint_jerk": {
"left": 2.1,
"right": 2.3
},
"rms_joint_jerk": {
"left": 3.2,
"right": 3.5
},
"overall_avg_joint_jerk": 2.2,
"overall_rms_joint_jerk": 3.35,
}
Note: For single-arm robots, metrics return single values instead of dicts:
"cartesian_path_length": 1.50, # Not {"left": 1.50}
"avg_cartesian_jerk": 0.30, # Not {"left": 0.30}
Analyze collected information over episodes:
import numpy as np
from typing import Any
# Collect info from multiple episodes
all_info: list[list[dict[str, Any]]] = []
for episode in range(100):
obs, info = env.reset()
episode_info = []
for step in range(1000):
action = policy.predict(obs)
obs, reward, terminated, truncated, info = env.step(action)
episode_info.append(info)
if terminated or truncated:
break
all_info.append(episode_info)
# Compute statistics
success_rate = np.mean([ep_info[-1]['task_success']
for ep_info in all_info])
# Analyze custom info if available
if 'stage' in all_info[0][0]:
max_stages = [max(step_info['stage']
for step_info in ep_info)
for ep_info in all_info]
avg_stage = np.mean(max_stages)
print(f"Average max stage: {avg_stage:.2f}")
print(f"Success rate: {success_rate:.2%}")
"cartesian_path_length": 1.50, # Not {"left": 1.50}
"avg_cartesian_jerk": 0.30, # Not {"left": 0.30}
Accessing Metrics#
During Episode:
Metrics are only finalized at episode end (when _success() is called):
obs, reward, terminated, truncated, info = env.step(action)
if terminated or truncated:
# Metrics available in info dict
print(f"Success: {info['success']}")
print(f"Time: {info['completion_time']:.2f}s")
print(f"Slips: {info['slip_count']}")
print(f"Collisions: {info['env_collision_count']}")
print(f"Progress: {info['subtask_progress']:.1%}")
if 'cartesian_path_length' in info:
print(f"Path: {info['cartesian_path_length']}")
if 'avg_cartesian_jerk' in info:
print(f"Jerk: {info['avg_cartesian_jerk']}")
Collecting Over Episodes:
from collections import defaultdict
import numpy as np
metrics = defaultdict(list)
for episode in range(100):
obs, info = env.reset()
for step in range(1000):
action = policy.predict(obs)
obs, reward, terminated, truncated, info = env.step(action)
if terminated or truncated:
# Collect all metrics
metrics["success"].append(info["success"])
metrics["completion_time"].append(info["completion_time"])
metrics["slip_count"].append(info["slip_count"])
metrics["subtask_progress"].append(info["subtask_progress"])
if "env_collision_count" in info:
metrics["collisions"].append(info["env_collision_count"])
if "total_cartesian_path_length" in info:
metrics["path_length"].append(info["total_cartesian_path_length"])
if "overall_avg_cartesian_jerk" in info:
metrics["jerk"].append(info["overall_avg_cartesian_jerk"])
if "bimanual_arm_velocity_difference" in info:
metrics["vel_sync"].append(info["bimanual_arm_velocity_difference"])
break
# Analyze results
print(f"Success rate: {np.mean(metrics['success']):.2%}")
print(f"Avg completion time: {np.mean(metrics['completion_time']):.2f}s")
print(f"Avg slips: {np.mean(metrics['slip_count']):.2f}")
print(f"Avg collisions: {np.mean(metrics['collisions']):.2f}")
print(f"Avg subtask progress: {np.mean(metrics['subtask_progress']):.1%}")
print(f"Avg path length: {np.mean(metrics['path_length']):.3f}")
print(f"Avg jerk (smoothness): {np.mean(metrics['jerk']):.3f}")
print(f"Avg velocity sync: {np.mean(metrics['vel_sync']):.3f}")
Interpreting Metrics#
Success and Progression:
success: Binary task completion (1.0 = success, 0.0 = failure)subtask_progress: Percentage of stages completed (0.0 to 1.0)Useful for analyzing partial task completion
Example: 0.67 = completed 2 out of 3 stages
task_stage_reached: Which specific stages were reachedDiagnose where failures occur
Example: {1: True, 2: True, 3: False} = grasped and lifted, but failed to place
Bimanual Coordination:
bimanual_arm_velocity_difference: Lower is betterMeasures how synchronized arm movements are
High values indicate uncoordinated bimanual manipulation
bimanual_gripper_vertical_difference: Lower is betterMeasures gripper height alignment
Important for tasks requiring level grasping (e.g., carrying trays)
Manipulation Quality:
slip_count: Lower is better (ideally 0)Each slip indicates grip failure
High slip counts suggest poor grasp quality or excessive forces
slip_count_per_object: Identifies which objects are problematicExample: {“pot”: 3, “lid”: 0} = pot slips often, lid doesn’t
Safety:
env_collision_count: Lower is better (ideally 0)Counts new collisions with environment (tables, cabinets, etc.)
Does NOT count sustained contact (e.g., gripper on object being manipulated)
self_collision_count: Should always be 0Robot parts colliding with each other
Indicates poor motion planning or singularities
Trajectory Quality:
avg_cartesian_jerk/rms_cartesian_jerk: Lower is betterMeasures motion smoothness
Human-like motion has low jerk
High jerk = jerky, robotic motion
RMS jerk penalizes spikes more than average jerk
avg_joint_jerk/rms_joint_jerk: Lower is betterJoint space smoothness
Important for motor wear and energy efficiency
Efficiency:
cartesian_path_length: Lower is betterTotal distance traveled by end-effector
Shorter paths = more efficient
Compare to ideal/demonstrated path length
joint_path_length: Lower is betterTotal distance in joint configuration space
May differ from cartesian path (redundant arms)
orientation_path_length: Lower is betterTotal angular distance traveled
Excessive rotation indicates inefficient orientation planning
Custom Metrics:
target_distance: Task-specific distance to goalUseful for debugging near-misses
Can be dict for multiple distances
object_pose_error: Orientation/pose errorMeasures how close object is to target pose
Example: pot upright angle deviation
Adding Custom Metrics#
You can add task-specific metrics via _metric_finalize() parameters:
Distance Metrics:
def _success(self):
cube_pos = self.cube.body.get_position()
gripper_pos = self.robot.grippers[HandSide.LEFT].wrist_position
# Calculate custom distances
lift_distance = cube_pos[2] - 1.0 # Height above table
gripper_distance = np.linalg.norm(cube_pos - gripper_pos)
goal_distance = np.linalg.norm(cube_pos - self.goal_pos)
success = goal_distance < 0.05
self._final_metrics = self._metric_finalize(
success_flag=success,
target_distance={
"lift_distance": lift_distance,
"gripper_distance": gripper_distance,
"goal_distance": goal_distance
},
pose_error=0.0
)
return success
Pose Error Metrics:
def _success(self):
from pyquaternion import Quaternion
# Calculate orientation error
pot_quat = Quaternion(self.pot.body.get_quaternion())
target_quat = Quaternion(axis=[0, 0, 1], angle=0)
# Angular distance between orientations
angle_error = Quaternion.absolute_distance(pot_quat, target_quat)
success = angle_error < np.deg2rad(10)
self._final_metrics = self._metric_finalize(
success_flag=success,
target_distance=0.0,
pose_error=angle_error # In radians
)
return success
Multiple Object Errors:
def _success(self):
# Track errors for multiple objects
pose_errors = {}
for obj_name, obj in [("pot", self.pot), ("lid", self.lid)]:
obj_quat = Quaternion(obj.body.get_quaternion())
target_quat = self.target_orientations[obj_name]
pose_errors[obj_name] = Quaternion.absolute_distance(obj_quat, target_quat)
success = all(error < self.threshold for error in pose_errors.values())
self._final_metrics = self._metric_finalize(
success_flag=success,
target_distance=0.0,
pose_error=pose_errors # Dict of errors
)
return success
Accessing Custom Metrics:
# These appear in the info dict automatically
info = env.step(action)[-1]
if "target_distance" in info:
if isinstance(info["target_distance"], dict):
for name, dist in info["target_distance"].items():
print(f"{name}: {dist:.3f}")
else:
print(f"Distance: {info['target_distance']:.3f}")
if "object_pose_error" in info:
if isinstance(info["object_pose_error"], dict):
for obj, error in info["object_pose_error"].items():
print(f"{obj} error: {np.rad2deg(error):.1f}°")
else:
print(f"Pose error: {np.rad2deg(info['object_pose_error']):.1f}°")
Best Practices#
Always inherit from MetricRolloutEval - Provides comprehensive, standardized metrics
Initialize metrics in both _initialize_env() and _on_reset() - Ensures clean state per episode
Call _metric_step() in _on_step() - Updates metrics every simulation step
Use _metric_stage() for progression tracking - Helps diagnose where failures occur
Track what’s relevant to your task:
Slippage for grasping tasks
Collisions for safety-critical tasks
Velocity/vertical sync for bimanual coordination
Jerk for human-like motion
Path length for efficiency
Add custom distances and pose errors - Provides debugging insights beyond binary success
Initialize stage flags - Set all stages to False in
_on_reset()before trackingStore and expose metrics properly:
self._final_metrics = self._metric_finalize(...) def _get_task_info(self): return getattr(self, "_final_metrics", {})
Analyze distributions, not just averages - Use RMS jerk, max slips, 95th percentile path length
Compare against baselines - Human demonstrations, random policy, previous versions
Common Pitfalls#
Forgetting to re-initialize in _on_reset():
# WRONG - metrics persist across episodes
def _initialize_env(self):
self._metric_init(...)
# CORRECT - reset metrics each episode
def _initialize_env(self):
self._metric_init(...)
def _on_reset(self):
self._metric_init(...) # Re-initialize!
Not calling _metric_step():
# WRONG - metrics never update
def _on_step(self):
pass
# CORRECT
def _on_step(self):
self._metric_step()
Accessing metrics before finalization:
# WRONG - metrics not available during episode
obs, reward, terminated, truncated, info = env.step(action)
print(info['slip_count']) # May not exist yet!
# CORRECT - only access at episode end
if terminated or truncated:
print(info['slip_count']) # Now available
Forgetting to store _final_metrics:
# WRONG - metrics computed but not stored
def _success(self):
self._metric_finalize(...) # Lost!
return True
# CORRECT - store in instance variable
def _success(self):
self._final_metrics = self._metric_finalize(...)
return True
See Also#
Creating Custom Tasks - Implementing tasks with metrics
Demonstrations - Using metrics with demonstrations
roboeval/utils/metric_rollout.py- Full implementation detailsroboeval/envs/lift_pot.py- Example task using all metrics