DuoBench: A Reproducible Benchmark for Bimanual Manipulation in Simulation and the Real World

Tobias Jülg1*, Seongjin Bien1*, Simon Hilber2, Yannik Blei1, Pierre Krack1, Maximilian Li2, Sven Parusel3, Rudolf Lioutikov2, Florian Walter4, Wolfram Burgard1
1University of Technology Nuremberg 2Karlsruhe Institute of Technology 3Franka Robotics 4Technical University of Munich *core contributors
Overview of DuoBench with task categories, stage-based evaluation, and sim-to-real teleoperation pipeline.

Overview of DuoBench: four bimanual task categories with eleven tasks, four replicated in the real world, and a stage-based evaluation protocol for diagnosing failure modes beyond binary success.

Abstract

Bimanual robot systems substantially expand manipulation capabilities, but coordinating two arms introduces additional control complexity and failure modes that are not well captured by existing benchmarks. We introduce DuoBench, an extensible benchmarking framework for bimanual manipulation on the Franka Research 3 Duo platform. DuoBench comprises eleven tasks spanning four coordination categories, implemented in simulation and partially reproduced in the real world through reproducible task recipes with 3D-printable assets. In addition, it provides stage-based evaluation for fine-grained semantic failure analysis beyond binary success and human-teleoperated datasets for all benchmark tasks. Across dual-arm imitation-learning and vision-language-action baselines, DuoBench exposes persistent difficulty in early interaction stages, parallel arm execution, and transfer between simulation and real-world settings.

Motivation & Main Contribution

Why DuoBench?

Existing manipulation benchmarks still provide limited support for systematically evaluating dual-arm coordination. They often underrepresent the diversity of coordination patterns required by two-arm manipulation, provide little support for reproducible sim-and-real evaluation, or rely mainly on binary task success.

That makes it hard to tell whether failures come from grasp acquisition, stabilization, object transfer, parallel execution, or later semantic stages of the task.

What DuoBench adds

  • 11 bimanual benchmark tasks across 4 coordination categories
  • Stage-based evaluation for semantic failure analysis
  • Reproducible sim-and-real task recipes with 3D-printable assets
  • Teleoperated datasets and a shared interface for collection, replay, and evaluation

Platform & Reproducibility

Side-by-side real and simulated Franka Research 3 Duo setup.

DuoBench is built around the Franka Research 3 Duo setup and exposes the same benchmark logic through a shared environment interface for simulation, teleoperation, replay, and evaluation.

The goal is not just to report whether a policy succeeds on a single lab setup, but to make the benchmark itself reproducible across labs. Tasks can be recreated in simulation and, for a subset, in the real world with 3D-printable assets and shared teleoperation tools.

  • duobench/<task_id> Gymnasium environments
  • Task stages directly exposed through the environment wrapper
  • Human teleoperation in simulation and on hardware
  • Replay support for future ablations and extensions

Bimanual Task Taxonomy

Asymmetric Support

One arm stabilizes or holds an object while the other performs the task-relevant interaction.

Hinge-Chest, Spring-Door, Pour-Marbles

Bimanual Manipulation

Both arms jointly manipulate the same object in a tightly coupled way.

Ball-Maze, Carry-Pot, Block-Balance, Join-Blocks

Sequential Handoff

The object must be transferred between arms due to workspace or task constraints.

Transfer-Cube, Transfer-Gate, Transfer-Reorient

Parallel Execution

Both arms solve largely independent subproblems without direct physical cooperation.

Bin-Sort

Benchmark Tasks

The table below summarizes the benchmark tasks, their environment IDs, and their language instructions. The previews keep their native aspect ratio.

Task image Paper task ID Task description Env ID string Language instruction
Ball-Maze Pick up a maze board with both arms and tilt it so a ball rolls into a target region. duobench/ball_maze pick up the board and tilt it so the ball rolls onto the red square
Bin-Sort Sort two cubes into the matching bowls, testing simultaneous execution instead of direct cooperation. duobench/bin_sort use the left arm to place the white cube in the white bowl; use the right arm to place the black cube in the black bowl
Block-Balance Place a beam on a support cube and then place two rectangular blocks on the beam simultaneously. duobench/block_balance place the beam on the cube and then place the other blocks on the beam simultaneously using one arm for each cube
Carry-Pot Carry a pot using both side handles and place it on a stove. Both arms are needed to lift the pot. duobench/carry_pot use two arms to carry the pot at the handle on the stove
Hinge-Chest Holding the lid of a small chest open while inserting a box. One arm must hold the lid while the other inserts the box. duobench/hinge_chest open the box with the right arm and place the cube inside the box with the left arm
Join-Blocks Connect two movable blocks together and then attach them to a peg on a third stationary block. duobench/join_blocks join the two blocks using the peg on the left block and join the free socket of the right block with the peg on the wall
Pour-Marbles Two cups, one containing marbles. Both cups must be picked up, and the marbles must be poured into the other cup before both cups are placed back. duobench/pour_marbles grasp and lift both cups, then pour the marbles from one cup into the other and place the cups back to their original location inside the green square
Spring-Door A spring-loaded microwave door requires one arm to hold it open while the other inserts a box. duobench/spring_door use the left arm to open the microwave door, then use the right arm to place the box inside the microwave, and close the door again
Transfer-Cube Hand over a cube between arms before placing it into a bowl. duobench/transfer_cube grasp the white cube with the right arm, hand it over to the left arm and place it in the white bowl with the left arm
Transfer-Gate Hand over a box between arms before placing it onto a mat. The box has to be passed through a gate. duobench/transfer_gate use the right arm to pick up the white box, and hand it over to the left arm through the hoop, then place it on the green mat with the left arm
Transfer-Reorient The right arm picks up a peg and hands it over to the left arm so that the left arm can insert it into a socket. duobench/transfer_reorient grasp the block with the right arm, hand it over to the left arm such that the left arm can easily insert the piece later, then insert the block into the socket with the left arm

Simulation Results

DuoBench is challenging for current dual-arm policies. Across the full set of 11 simulation tasks, the benchmark remains far from saturated across all evaluated baselines.

The stage-based analysis is particularly useful because many failures happen in the earliest interaction phases, especially grasp acquisition and initial task setup. The benchmark also reveals meaningful differences between tasks that look similar at a high level but differ substantially in execution difficulty.

Parallel execution is especially difficult: Bin-Sort remains a weak point across policies, suggesting that independent yet coordinated control of both arms is still far from solved.

Mixed simulation-and-real training can help on some real-world tasks, but the sim-to-real gap remains clearly visible. That makes DuoBench useful not only for comparing policies within a domain, but also for studying transfer strategies and data-mixing choices.

  • Early-stage failures dominate many rollouts
  • Hinge-Chest is much harder than Spring-Door
  • Transfer-Gate is easier than unconstrained Transfer-Cube
  • Bin-Sort exposes persistent weakness in parallel execution
Stage failure distributions across simulation tasks.
Fraction of rollouts in simulation that failed in a given stage.
Average task progress over normalized rollout time.
Average task progress over normalized time across all rollouts in simulation.

Real-World Results

Real-world evaluation on the four replicated tasks shows the same overall pattern as simulation. Many failures happen early, with grasping and initial setup remaining the dominant bottlenecks. At the same time, once grasping succeeds, later stages are often completed reliably, especially in tasks such as Transfer-Cube and Bin-Sort.

Fraction of real-world rollouts that ended in a given stage across the four replicated tasks.
Fraction of real-world rollouts that ended in a given stage. This stage-distribution figure is shown in the appendix; success rates are annotated above for each policy-task pair.

Selected Rollouts

These qualitative rollouts make the benchmark failure modes more concrete: some runs succeed once the initial interaction is correct, while others stall immediately at grasping, setup, or coordinated follow-through.

Ball-Maze success: the policy establishes the correct dual-arm contact and then follows through on the task physics.
Bin-Sort success: an example where parallel execution works and both arms complete their own subtask.
Ball-Maze real-world rollout: a physical example on the replicated real setup, showing the same task family outside simulation.
Early grasp failure: a typical rollout where the policy never gets the first interaction under control.
Transfer-Cube failure: the task breaks before the handover because the initial grasp and setup phase is unstable.
Hinge-Chest failure: a longer rollout illustrating why this task remains one of the hardest in the benchmark.

BibTeX

@misc{duobench,
  title={{DuoBench}: A Reproducible Benchmark for Bimanual Manipulation in Simulation and the Real World}, 
  author={Tobias J{\"u}lg and Seongjin Bien and Simon Hilber and Yannik Blei and Pierre Krack and Maximilian Li and Sven Parusel and Rudolf Lioutikov and Florian Walter and Wolfram Burgard},
  year={2026},
  url={https://arxiv.org/abs/2606.11901}
}