DSPv2 | Whole-Body Mobile Policy

Effective Perception, Generalizable Manipulation and Coherent Whole-Body Actions

DSPv2 is a whole-body mobile manipulation policy that achieves generalizable performance by fusing multi-view 2D semantic perception with 3D spatial awareness, and generates coherent whole-body actions via dense action head.

Abstract

Learning whole-body mobile manipulation via imitation is essential for generalizing robotic skills to diverse environments and complex tasks. However, this goal is hin- dered by significant challenges, particularly in effectively pro- cessing complex observation, achieving robust generalization, and generating coherent actions. To address these issues, we propose DSPv2, a novel policy architecture. DSPv2 introduces an effective encoding scheme that aligns 3D spatial features with multi-view 2D semantic features. This fusion enables the policy to achieve broad generalization while retaining the fine- grained perception necessary for precise control. Furthermore, we extend the Dense Policy paradigm to the whole-body mobile manipulation domain, demonstrating its effectiveness in generating coherent and precise actions for the whole- body robotic platform. Extensive experiments show that our method significantly outperforms existing approaches in both task performance and generalization ability.

Overview of DSPv2

First, a sparse 3D encoder processes the xyz point cloud, which is projected from the head-cam to the base, to obtain voxel-level feature tokens. A 2D vision foundation model is also used to acquire patch- level feature maps. Subsequently, a Q-former is designed to query multi-view semantic features from feature maps for the voxels and fuse them with spatial features, based on the positional information of the voxels and patches. Finally, the resulting features are fed into a dense head to generate the future action sequence in a bidirectional autoregressive paradigm, thus reduces the error amplification between components and enhances the coherence of whole-body actions.

Conclusion & Limitation

The videos shown above are all 4x speed. DSPv2 demonstrates the ability to fully utilize observation information and achieve coherent whole-body manipulation. It also achieves generalization capabilities for lighting, spatial arrangement, object color and shape, and scene changes. However, its limitations are also very obvious.

When a large domain shift occurs, such as when the test robot is different from the training robot and the manipulating platform changes (As in Deliver), its generalization ability will be very limited and affect the stability of the whole-body actions. We believe that introducing higher-frequency modalities to help policies achieve more robust generalization performance is the key to solving this challenge in the future.

Citation

@misc{su2025dspv2improveddensepolicy,
      title={DSPv2: Improved Dense Policy for Effective and Generalizable Whole-body Mobile Manipulation}, 
      author={Yue Su and Chubin Zhang and Sijin Chen and Liufan Tan and Yansong Tang and Jianan Wang and Xihui Liu},
      year={2025},
      eprint={2509.16063},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2509.16063}, 
}

DSPv2: Improved Dense Policy for Effective and Generalizable Whole-body Mobile Manipulation

Effective Perception, Generalizable Manipulation and Coherent Whole-Body Actions

Abstract

Overview of DSPv2

Pick and Place with Light and Spatial Generalization

Cart Pushing with Scene-level Generalization

Bowling & Sort with Object Shape Generalization

Deliver (Transfer on Out-of-domain Robot and Platform)

Conclusion & Limitation

Citation