Opening heavy, self-closing doors—especially those that require pulling—remains a long-standing challenge in robotics. Humans naturally employ both arms in a dexterous manner—rotating the handle, widening the gap, holding the door, switching arms when needed, and moving through while maintaining clearance. To replicate such behaviors, a robot must perform a long sequence of motions spanning multiple stages and interactions with different parts of the door. Traditional approaches rely on state machines that transition between manually defined stages (e.g., pulling after the knob is rotated, passing after the gap is sufficiently wide). While intuitive, these methods lack robustness, as hand-crafted trajectories fail to generalize to the diversity of real-world conditions without extensive engineering effort. Recent advances in imitation learning offer a scalable alternative, yet no existing visual-action model has demonstrated simultaneous coordination of a nonholonomic base and dual arms for the complete door opening and passing task. In this paper, we tackle this complex, highly constrained problem using a diffusion-based visuomotor control policy. Our results demonstrate that a single end-to-end policy can be learned to execute long-horizon tasks requiring tight coordination between manipulation and locomotion. The resulting policy not only achieves a high success rate in opening and traversing damped pull doors but also demonstrates strong robustness to external disturbances—capabilities that are difficult to realize with traditional methods.
To collect data in simulation, we use a state-based controller that combines inverse kinematics (IK) for door manipulation with model predictive control (MPC) for base motion. To better approximate human-like behavior, rather than directly performing joint-level control with the IK solution, we developed a scheduler that gradually guides the robot in the task-space toward the desired joint configuration found by the IK. To enhance realism and variability, we randomize lighting conditions as well as door and handle colors across episodes, thereby reflecting the natural diversity of real-world environments.
In both hardware and simulation, we generated 100 demonstration trajectories. To prevent the policies from overfitting to a single motion pattern, we additionally randomize the robot's initial base pose (dx, dy, and dyaw). The robot base is placed at a random lateral distance of 0.90±0.03m with longitudinal offset of ±0.03m, and yaw of ±1.00 rad.
For collecting expert demonstrations on hardware, we use the teleoperation kit provided by Realman, which has the same joint configuration as the target robot. We collect image data from three independent cameras capturing RGB images of shape 180×240 pixels. During teleoperation, the robot's joint positions and images from all three cameras are recorded as the state, while the joint positions commanded through the teleoperation kit are stored as actions.
Our diffusion-based policy architecture consists of three independent ResNet-18 encoders for multi-view perception and a 1D U-Net with FiLM conditioning for action generation. The policy operates with a prediction horizon of 16 steps and generates action sequences of length 8, using stacked observations from the last 3 timesteps.
The diffusion process uses 100 forward diffusion steps during training, which are reduced to 10 steps during inference for computational efficiency. This approach enables the policy to learn smooth, coordinated actions while maintaining robustness to the complex dynamics of door opening and traversal tasks.
We successfully trained a diffusion policy that enables a mobile manipulator to open and traverse a damped pull door using only visual and proprioceptive inputs. The policy learns a long-horizon trajectory that integrates multiple coordinated skills, including reaching for the handle, twisting it, pulling the door, coordinating both arms, and synchronizing manipulation with locomotion.
We conducted an ablation study on our proposed policy and additionally trained ACT and SmolVLA as baselines for comparison as shown in Table. II. Each policy was trained for 100k steps, with evaluations performed every 20k steps. During each evaluation, the policy was executed 10 times, and we report the best-performing trial.
The policy successfully executed reliable door-opening behaviors with bimanual coordination on real hardware. Notably, the learned policy demonstrated robustness to disturbances: when the door was manually re-closed during execution, the policy responded by halting further extension and re-initiating the opening sequence.
Our study demonstrates that diffusion-based visuomotor policies can achieve reliable performance on the challenging task of opening and traversing damped pull doors using a dual-arm mobile manipulator. Unlike prior approaches that rely on state machines or heavily engineered perception pipelines, our method learns a unified policy that integrates perception, manipulation, and base coordination directly from demonstration data. The results show that diffusion policies not only generate long-horizon trajectories but also exhibit robustness to disturbances and environmental variability, a critical capability for real-world deployment.