Achieving human-level manipulation requires dexterous hands and standardized evaluation. Existing dexterous benchmarks often lack realistic manipulator-hand setups, tasks that reveal the unique capabilities of dexterous hands over parallel grippers, reliable demonstration acquisition tools, and unified pipelines for evaluating modern VLA models.
DexJoCo addresses these gaps with 11 functionally grounded tasks built around the Franka Panda and Allegro Hand. It provides a low-cost motion-capture data collection system, 1.1K human demonstration trajectories, replay-based domain randomization, and evaluation support for modern imitation learning and VLA policies.
Benchmark for task-oriented dexterous manipulation
DexJoCo tasks are designed around functional interactions rather than isolated object relocation. Each task defines interactive objects and success constraints over execution order, object poses, articulated joint states, and contact, so completion requires meaningful progress toward an everyday objective.
The benchmark covers tool use, reasoning, bimanual coordination, and long-horizon execution. Its assets provide explicit visual interaction feedback, such as unlocking an iPad after entering a password, spraying water from a watering can, and waking a display by clicking a mouse.
Single-arm demos
Bimanual demos
Data collection system, low-cost and user-friendly
DexJoCo provides a low-cost teleoperation system for efficient human demonstration collection. Rokoko Smartgloves capture hand motion without camera occlusion, while HTC Vive Trackers and Base Stations track wrist motion for Franka end-effector control in a unified setup of about $2,300 USD.
The software combines wrist tracking with retarget MLP, a lightweight self-supervised retargeting method that maps human fingertip poses to Allegro Hand joint configurations without paired human-robot annotations.
DexJoCo is designed as a low-cost system for human demonstration data collection.
Hardware Design
Teleoperation Algorithm
The teleoperation system combines hand motion retargeting and wrist motion tracking. Because human and robotic hands have different structures, direct linear mapping is infeasible. DexJoCo uses retarget MLP to preserve fingertip motion directions, workspace coverage, pinch behavior, and collision avoidance for real-time Allegro Hand control.
Hardware Design
- Rokoko Smartgloves capture hand motion without camera occlusion.
- HTC Vive Trackers and Base Stations track wrist motion and end-effector pose.
- The full setup stays comfortable, unified, and low-cost at about $2,300 USD.
Teleoperation Algorithm
- The system combines hand motion retargeting with wrist motion tracking.
- Retarget MLP bridges structural differences between human and robot hands.
- Self-supervised learning removes the need for manual human-robot pair annotation.
Datasets and policy learning
DexJoCo collects 1.1K human demonstration trajectories across the 11 benchmark tasks. Each trajectory records rich observations, including third-person and wrist-mounted visual streams, object and robot states, TCP pose, and hand joint angles, while actions are represented as target absolute end-effector poses and hand joint angles.
The dataset can be converted into common formats such as LeRobot Dataset v3.0 and Diffusion Policy Zarr. DexJoCo then evaluates policies through constructed task environments and an asynchronous server-client deployment pipeline.
augmentation
Domain Randomizations
DexJoCo supports domain randomization across all task scenarios. Object placement and table height are randomized for trajectory diversity, while third-person camera poses, lighting direction and color, and tabletop textures are randomized to evaluate visual robustness.
Third-person camera pose
Camera poses are sampled on a spherical surface, then filtered to select viewpoints with minimal occlusion.
Lighting direction and color
For lighting randomization, we follow a simple procedure inspired by our implementation. Each light in the scene is randomized in terms of its position, direction, and diffuse color to introduce diverse illumination conditions.
Table height
Table height is randomized together with object placement to broaden the state distribution of replayed trajectories.
Table texture
For tabletop texture randomization, we sample textures from a pre-constructed texture library.
Baseline model performance
DexJoCo benchmarks ACT, Diffusion Policy, π0.5, and GR00T N1.5 under object-only and full visual randomization. The benchmark is challenging: visual randomization sharply reduces success rates, difficult bimanual tasks expose persistent failures, and precise interaction remains a key bottleneck.
Citation
BibTeX
@article{wang2026dexjoco,
title = {DexJoCo: A Unified Benchmark and Toolkit for Task-Oriented Dexterous Manipulation on MuJoCo},
author = {Wang, Hanwen and Zhao, Weizhi and Wang, Xiangyu and Huang, Siyuan and Lin, He and Zheng, Boyuan and Xu, Rongtao and Wang, Gang and Mu, Yao and Wang, He and Fan, Lue and Li, Hongsheng and Zhang, Zhaoxiang and Tan, Tieniu},
journal = {arXiv preprint arXiv:2605.16257},
year = {2026},
url = {https://dexjoco.github.io}
}