🌹 → 🌹 🌹 → 🌹 🌹 🌹 🌹 → 🌹 🌹 🌹 🌹 🌹 🌹 🌹 🌹

DenSe Policy: Bidirectional Autoregressive Learning of Actions

Yue Su^1,2*, Xinyu Zhan^1*, Hongjie Fang¹, Han Xue¹, Haoshu Fang¹, Yong-Lu Li^1,3, Cewu Lu^1,3, Lixin Yang^1†

¹Shanghai Jiao Tong University, ²Xidian University, ³Shanghai Innovation Institute

* Equal contribution † Corresponding author

ICCV 2025

Bidirectional Learning, Coarse-to-fine Inference

Dense Policy provides new insights into policy learning. From a sequence learning perspective, we posit a novel paradigm: bidirectional prediction offers advantages over unidirectional prediction for sequence modeling. Regarding action generation, we explore a novel approach, demonstrating that expanding actions from sparse keyframes to complete, dense frames via inference is more effective than modeling the joint distribution directly.

Abstract

Mainstream visuomotor policies predominantly rely on generative models for holistic action prediction, while current autoregressive policies, predicting the next token or chunk, have shown suboptimal results. This motivates a search for more effective learning methods to unleash the potential of autoregressive policies for robotic manipulation. This paper introduces a bidirectionally expanded learning approach, termed Dense Policy, to establish a new paradigm for autoregressive policies in action prediction. It employs a lightweight encoder-only architecture to iteratively unfold the action sequence from an initial single frame into the target sequence in a coarse-to-fine manner with logarithmic-time inference. Extensive experiments validate that our dense policy has superior autoregressive learning capabilities and can surpass existing holistic generative policies. Our policy, example data, and training code will be publicly available upon publication.

Dense Policy Overview

Dense Policy accepts visual inputs in different modalities and optional robot proprioception. It employs a unified encoder to perform cross-attention between hierarchical action representations and observation features. This facilitates a bidirectionally expanding dense process. During each dense process level, the actions, initially represented as sparse keyframes, are progressively infilled and refined into a complete predicted sequence, leading to a coarse-to-fine generation procedure.

Presentation

Compared to other downstream policies, the Dense Policy achieves a superior balance between a lightweight parameters and efficient inference speed. Furthermore, it exhibits enhanced learning efficiency, achieving superior performance within the same number of training iterations. We demonstrate its efficacy across four distinct manipulation tasks.

Open Drawer at 4x Speed

Dense Policy

Baseline

Put Bread into Pot at 4x Speed

Dense Policy

Baseline

Pour Balls at 4x Speed

Dense Policy

Baseline

Flower Arrangement at 4x Speed

Dense Policy

Baseline

Bimanual Handover

Deal with different kinds of objects and complete different kinds of tasks

Dense Policy exhibits superior performance across a diverse range of manipulation tragets, including rigid bodies, deformable objects, and articulated structures, as well as tasks characterized by high degrees of freedom, long horizons, and multi-object interactions. This is primarily attributed to its bidirectional sequence modeling, which produces smoother, more adaptive action trajectories, and its coarse-to-fine hierarchical inference, enabling high-precision actions suitable for manipulation tasks with low error tolerance.

We provide visualizations of actions during dense process. It's obvious that the generation of actions follows a coarse to fine manner, espacially in the early levels, which bascially paints the keyframe stage of a trajectory.

Limitations

We evaluate the zero-shot generalization capability of the dense policy by attempting to elevate both the cup and the bowl. This scenario, absent from the expert training demonstrations, presents an out-of-distribution challenge, implying the policy has likely not encountered grasping and pouring actions from such elevated heights during training. Our findings reveal that the policy achieves complete ball transfer when only the cup is elevated. However, performance degrades significantly when both the cup and the bowl are raised simultaneously. This observation indicates that while the policy exhibits a certain degree of generalization ability, it is not yet sufficiently robust to handle such combined perturbations.

All Cases at 4x Speed

(a) Elevate cup

(b) Elevate cup and bowl

We also conducted an intriguing experiment focusing on the task of flower arrangement, a task that stringently tests a model's spatial reasoning capabilities, as mentioned in our paper. While the order of picking flowers is often inconsequential to task success, certain extreme cases necessitate a specific sequence. For instance, in the case presented, only by first inserting the flower with the blue base into the cup can all three flowers be successfully arranged. Otherwise, inserting other flowers initially makes grasping the blue-base flower significantly more challenging due to spatial constraints and the risk of collision with already-inserted flowers, potentially leading to failure. Furthermore, the dense point cloud in this crowded scene constricts the action space, inherently increasing the task difficulty. The dense policy, after successfully inserting the red-base flower, encountered the difficulty in picking the blue-base flower and subsequently stalled, highlighting a limitation in handling such constrained scenarios.

Citation

@article{su2025dense,
  title={Dense Policy: Bidirectional Autoregressive Learning of Actions},
  author={Su, Yue and Zhan, Xinyu and Fang, Hongjie and Xue, Han and Fang, Hao-Shu and Li, Yong-Lu and Lu, Cewu and Yang, Lixin},
  journal={arXiv preprint arXiv:2503.13217},
  year={2025}
}