Drive-Cascade: Autoregressive Occupancy to LiDAR and Video Synthesis

Shuangming Lei1,*, Yuehao Huang1,*, Yi Yao1, Yijia Xie1, Jingke Wang1, Ruoyu Wang1, Jiajun Lv1, Guanglin Xu2, AiXue Ye2, Bingbing Liu2, Siyuan Cheng2, Hongbo Zhang2, Yukai Ma1†, Yong Liu1†
1Zhejiang University 2Huawei Noah's Ark Lab

Abstract

The generation of realistic, consistent, and controllable multi-modal data for dynamic driving scenes remains a crucial challenge in autonomous vehicle simulation. Current methods often struggle to maintain geometric and temporal coherence, particularly when synthesizing complex interactions across disparate modalities, such as LiDAR and video. In this paper, we propose a novel cascaded autoregressive framework to generate highly realistic and multi-modally aligned driving scenes. The key innovation of this work is the utilization of dynamic occupancy as a unified and explicit intermediate representation. The proposed framework operates in two stages: first, the system generates a coherent sequence of controllable dynamic occupancy grids that capture the spatiotemporal geometry of the scene. Second, conditioned on the generated occupancy prior, two specialized diffusion models autoregressively synthesize the corresponding LiDAR point clouds and camera videos. By anchoring the generation of all modalities to a shared geometric foundation, the proposed model inherently ensures cross-modal consistency and temporal stability. Extensive experiments demonstrate that the proposed approach significantly outperforms state-of-the-art methods in terms of generation fidelity, geometric accuracy, and long-term temporal coherence for both LiDAR and video synthesis, paving the way for high-fidelity and multi-modal simulation.

Drive-Cascade teaser figure

Drive-Cascade models dynamic occupancy autoregressively, then uses the generated 4D state as a shared geometric scaffold for LiDAR and video synthesis.

Highlights

Unified 4D Occupancy State

Dynamic semantic occupancy serves as the explicit world state, making spatial structure and temporal evolution controllable.

Autoregressive Long-Horizon Modeling

The state sequence is generated autoregressively to preserve object permanence and temporal continuity over extended scenes.

Cross-Modal Sensor Synthesis

Conditioned on the same occupancy prior, dedicated LiDAR and video generators remain geometrically aligned and temporally stable.

Figures

Drive-Cascade teaser figure

Teaser

Overview figure highlighting Drive-Cascade's autoregressive occupancy-to-sensor generation pipeline.

Drive-Cascade qualitative visualization overview

Qualitative Overview

Combined qualitative comparison highlighting coherent multi-modal generation.

Visualization

Paper

BibTeX

@article{lei2026drivecascade,
  title={Drive-Cascade: Autoregressive Occupancy to LiDAR and Video Synthesis},
  author={Lei, Shuangming and Huang, Yuehao and Yao, Yi and Xie, Yijia and Wang, Jingke and Wang, Ruoyu and Lv, Jiajun and Xu, Guanglin and Ye, AiXue and Liu, Bingbing and Cheng, Siyuan and Zhang, Hongbo and Ma, Yukai and Liu, Yong},
  journal={CVPR 2026 Findings},
  year={2026},
  url={https://summersray.github.io/Drive-Cascade/}
}