Dense Policy provides new insights into policy learning. From a sequence learning perspective, we posit a novel paradigm: bidirectional prediction offers advantages over unidirectional prediction for sequence modeling. Regarding action generation, we explore a novel approach, demonstrating that expanding actions from sparse keyframes to complete, dense frames via inference is more effective than modeling the joint distribution directly.
Dense Policy accepts visual inputs in different modalities and optional robot proprioception. It employs a unified encoder to perform cross-attention between hierarchical action representations and observation features. This facilitates a bidirectionally expanding dense process. During each dense process level, the actions, initially represented as sparse keyframes, are progressively infilled and refined into a complete predicted sequence, leading to a coarse-to-fine generation procedure.
Compared to other downstream policies, the Dense Policy achieves a superior balance between a lightweight parameters and efficient inference speed. Furthermore, it exhibits enhanced learning efficiency, achieving superior performance within the same number of training iterations. We demonstrate its efficacy across four distinct manipulation tasks.
Dense Policy
Baseline
Dense Policy
Baseline
Dense Policy
Baseline
Dense Policy
Baseline
Dense Policy exhibits superior performance across a diverse range of manipulation tragets, including rigid bodies, deformable objects, and articulated structures, as well as tasks characterized by high degrees of freedom, long horizons, and multi-object interactions. This is primarily attributed to its bidirectional sequence modeling, which produces smoother, more adaptive action trajectories, and its coarse-to-fine hierarchical inference, enabling high-precision actions suitable for manipulation tasks with low error tolerance.
We evaluate the zero-shot generalization capability of the dense policy by attempting to elevate both the cup and the bowl. This scenario, absent from the expert training demonstrations, presents an out-of-distribution challenge, implying the policy has likely not encountered grasping and pouring actions from such elevated heights during training. Our findings reveal that the policy achieves complete ball transfer when only the cup is elevated. However, performance degrades significantly when both the cup and the bowl are raised simultaneously. This observation indicates that while the policy exhibits a certain degree of generalization ability, it is not yet sufficiently robust to handle such combined perturbations.
(a) Elevate cup
(b) Elevate cup and bowl
(c) The Flower Dilemma
We also conducted an intriguing experiment focusing on the task of flower arrangement, a task that stringently tests a model's spatial reasoning capabilities, as mentioned in our paper. While the order of picking flowers is often inconsequential to task success, certain extreme cases necessitate a specific sequence. For instance, in the case presented, only by first inserting the flower with the blue base into the cup can all three flowers be successfully arranged. Otherwise, inserting other flowers initially makes grasping the blue-base flower significantly more challenging due to spatial constraints and the risk of collision with already-inserted flowers, potentially leading to failure. Furthermore, the dense point cloud in this crowded scene constricts the action space, inherently increasing the task difficulty. The dense policy, after successfully inserting the red-base flower, encountered the difficulty in picking the blue-base flower and subsequently stalled, highlighting a limitation in handling such constrained scenarios.
@article{su2025dense,
title={Dense Policy: Bidirectional Autoregressive Learning of Actions},
author={Su, Yue and Zhan, Xinyu and Fang, Hongjie and Xue, Han and Fang, Hao-Shu and Li, Yong-Lu and Lu, Cewu and Yang, Lixin},
journal={arXiv preprint arXiv:2503.13217},
year={2025}
}