logo.png


DSPv2: Improved Dense Policy for Effective and Generalizable Whole-body Mobile Manipulation

1The University of Hong Kong 2Astribot 3Xidian University 4Tsinghua University
* Project Leader    † Corresponding author

Effective Perception, Generalizable Manipulation and Coherent Whole-Body Actions

DSPv2 Diagram

DSPv2 is a whole-body mobile manipulation policy that achieves generalizable performance by fusing multi-view 2D semantic perception with 3D spatial awareness, and generates coherent whole-body actions via dense action head.

Abstract


Learning whole-body mobile manipulation via imitation is essential for generalizing robotic skills to diverse environments and complex tasks. However, this goal is hin- dered by significant challenges, particularly in effectively pro- cessing complex observation, achieving robust generalization, and generating coherent actions. To address these issues, we propose DSPv2, a novel policy architecture. DSPv2 introduces an effective encoding scheme that aligns 3D spatial features with multi-view 2D semantic features. This fusion enables the policy to achieve broad generalization while retaining the fine- grained perception necessary for precise control. Furthermore, we extend the Dense Policy paradigm to the whole-body mobile manipulation domain, demonstrating its effectiveness in generating coherent and precise actions for the whole- body robotic platform. Extensive experiments show that our method significantly outperforms existing approaches in both task performance and generalization ability.

Overview of DSPv2

DSPv2 Diagram
First, a sparse 3D encoder processes the xyz point cloud, which is projected from the head-cam to the base, to obtain voxel-level feature tokens. A 2D vision foundation model is also used to acquire patch- level feature maps. Subsequently, a Q-former is designed to query multi-view semantic features from feature maps for the voxels and fuse them with spatial features, based on the positional information of the voxels and patches. Finally, the resulting features are fed into a dense head to generate the future action sequence in a bidirectional autoregressive paradigm, thus reduces the error amplification between components and enhances the coherence of whole-body actions.

Pick and Place with Light and Spatial Generalization

Pick and Place

After Light Down and Table Lifted Up

Cart Pushing with Scene-level Generalization

In-domain Scene

In-domain Scene

Out-of-domain Scene

Bowling & Sort with Object Shape Generalization

Bowling

Original Buckets' Size

All Buckets' Size Changed

Deliver (Transfer on Out-of-domain Robot and Platform)

Original Setup

Obeject Color Change

Conclusion & Limitation

The videos shown above are all 4x speed. DSPv2 demonstrates the ability to fully utilize observation information and achieve coherent whole-body manipulation. It also achieves generalization capabilities for lighting, spatial arrangement, object color and shape, and scene changes. However, its limitations are also very obvious.

When a large domain shift occurs, such as when the test robot is different from the training robot and the manipulating platform changes (As in Deliver), its generalization ability will be very limited and affect the stability of the whole-body actions. We believe that introducing higher-frequency modalities to help policies achieve more robust generalization performance is the key to solving this challenge in the future.

Citation

@misc{su2025dspv2improveddensepolicy,
      title={DSPv2: Improved Dense Policy for Effective and Generalizable Whole-body Mobile Manipulation}, 
      author={Yue Su and Chubin Zhang and Sijin Chen and Liufan Tan and Yansong Tang and Jianan Wang and Xihui Liu},
      year={2025},
      eprint={2509.16063},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2509.16063}, 
}