Motion Before Action: Diffusing Object Motion as Manipulation Condition

Yue Su^1,2,*, Xinyu Zhan^1,*, Hongjie Fang¹, Yong-Lu Li¹, Cewu Lu¹, Lixin Yang^1†

¹Shanghai Jiao Tong University, ²Xidian University

* Equal contribution † Corresponding author

IEEE RA-L 2025

MBA assists robotic manipulation by predicting the motion representation of objects from the scene.

Abstract

Inferring object motion representations from observations enhances the performance of robotic manipulation tasks. This paper introduces a new paradigm for robot imitation learning that generates action sequences by reasoning about object motion from visual observations. We propose MBA, a novel module that employs two cascaded diffusion processes for object motion generation and robot action generation under object motion guidance. MBA first predicts the future pose sequence of the object based on observations, then uses this sequence as a condition to guide robot action generation. Designed as a plug-and-play component, MBA can be flexibly integrated into existing robotic manipulation policies with diffusion action heads. Extensive experiments in both simulated and real-world environments demonstrate that our approach substantially improves the performance of existing policies across a wide range of manipulation tasks.

Motion Before Action (MBA)

MBA is a novel module that employs two cascaded diffusion processes for object motion generation and robot action generation under object motion guidance. Designed as a plug-and-play component, MBA can be flexibly integrated into existing robotic manipulation policies with diffusion action heads.

Presentation

MBA's great advantages in manipulation are reflected in two points. When the manipulated object is stationary, whose prediction of the object's pose is equivalent to making an accurate pose estimation of the object to help the robot arm capture the object. When the object is in motion, the object trajectory reflected by the Pose sequence can be used as a priori for the robot's motion sequence to guide and calibrate the robot's movements.

Cut Clay at 4x Speed

MBA (ours)

Baseline

Open Drawer at 4x Speed

MBA (ours)

Baseline

Put Bread into Pot at 4x Speed

MBA (ours)

Baseline

Pour Balls at 4x Speed

MBA (ours)

Baseline

Deal with different kinds of objects and complete different kinds of tasks

We conduct comparative experiments with MBA on three 2D and 3D robotic manipulation policies with diffusion action heads, demonstrating substantial performance improvements across various tasks. These tasks, comprising 57 tasks from 3 simulation benchmarks and 4 real-world tasks, involve articulated object manipulation, soft and rigid body manipulation, tool use, non-tool use, and diverse action patterns. Results show that MBA consistently enhances the performance of such policies in both simulated and real-world environments.

Failure Cases

Due to the insufficient number of 50 expert demonstrations, MBA's ability to accurately estimate object pose sequences lacks robustness, which can lead to task execution failures. For example, in the Open Drawer task shown below, when the target is positioned at the image edge, resulting in point cloud occlusions, MBA fails to grasp the drawer handle. Similarly, in the Put Bread task, the presence of numerous clutter items in the scene negatively impacts the policy's execution.

Failure Cases at 4x Speed

Open Drawer

Put Bread into Pot

Test Recording

Citation

@ARTICLE{MBA,
  author={Su, Yue and Zhan, Xinyu and Fang, Hongjie and Li, Yong-Lu and Lu, Cewu and Yang, Lixin},
  journal={IEEE Robotics and Automation Letters}, 
  title={Motion Before Action: Diffusing Object Motion as Manipulation Condition}, 
  year={2025},
  volume={10},
  number={7},
  pages={7428-7435},
 doi={10.1109/LRA.2025.3577424}}