Drive-Cascade: Autoregressive Occupancy to LiDAR and Video Synthesis

Lei, Shuangming; Huang, Yuehao; Yao, Yi; Xie, Yijia; Wang, Jingke; Wang, Ruoyu; Lv, Jiajun; Xu, Guanglin; Ye, AiXue; Liu, Bingbing; Cheng, Siyuan; Zhang, Hongbo; Ma, Yukai; Liu, Yong

Drive-Cascade: Autoregressive Occupancy to LiDAR and Video Synthesis

Shuangming Lei^1,*, Yuehao Huang^1,*, Yi Yao¹, Yijia Xie¹, Jingke Wang¹, Ruoyu Wang¹, Jiajun Lv¹, Guanglin Xu², AiXue Ye², Bingbing Liu², Siyuan Cheng², Hongbo Zhang², Yukai Ma^1†, Yong Liu^1†

¹Zhejiang University ²Huawei Noah's Ark Lab

Paper PDF Read Online Project Repo Demo BibTeX

Abstract

The generation of realistic, consistent, and controllable multi-modal data for dynamic driving scenes remains a crucial challenge in autonomous vehicle simulation. Current methods often struggle to maintain geometric and temporal coherence, particularly when synthesizing complex interactions across disparate modalities, such as LiDAR and video. In this paper, we propose a novel cascaded autoregressive framework to generate highly realistic and multi-modally aligned driving scenes. The key innovation of this work is the utilization of dynamic occupancy as a unified and explicit intermediate representation. The proposed framework operates in two stages: first, the system generates a coherent sequence of controllable dynamic occupancy grids that capture the spatiotemporal geometry of the scene. Second, conditioned on the generated occupancy prior, two specialized diffusion models autoregressively synthesize the corresponding LiDAR point clouds and camera videos. By anchoring the generation of all modalities to a shared geometric foundation, the proposed model inherently ensures cross-modal consistency and temporal stability. Extensive experiments demonstrate that the proposed approach significantly outperforms state-of-the-art methods in terms of generation fidelity, geometric accuracy, and long-term temporal coherence for both LiDAR and video synthesis, paving the way for high-fidelity and multi-modal simulation.

Drive-Cascade models dynamic occupancy autoregressively, then uses the generated 4D state as a shared geometric scaffold for LiDAR and video synthesis.

Highlights

Unified 4D Occupancy State

Dynamic semantic occupancy serves as the explicit world state, making spatial structure and temporal evolution controllable.

Autoregressive Long-Horizon Modeling

The state sequence is generated autoregressively to preserve object permanence and temporal continuity over extended scenes.

Cross-Modal Sensor Synthesis

Conditioned on the same occupancy prior, dedicated LiDAR and video generators remain geometrically aligned and temporally stable.

Figures

Teaser

Overview figure highlighting Drive-Cascade's autoregressive occupancy-to-sensor generation pipeline.

Drive-Cascade qualitative visualization overview

Qualitative Overview

Combined qualitative comparison highlighting coherent multi-modal generation.

Visualization

Paper

BibTeX

@article{lei2026drivecascade,
  title={Drive-Cascade: Autoregressive Occupancy to LiDAR and Video Synthesis},
  author={Lei, Shuangming and Huang, Yuehao and Yao, Yi and Xie, Yijia and Wang, Jingke and Wang, Ruoyu and Lv, Jiajun and Xu, Guanglin and Ye, AiXue and Liu, Bingbing and Cheng, Siyuan and Zhang, Hongbo and Ma, Yukai and Liu, Yong},
  journal={CVPR 2026 Findings},
  year={2026},
  url={https://summersray.github.io/Drive-Cascade/}
}

Navigate

Abstract

Highlights

Figures

Visualization

Paper PDF

BibTeX

Drive-Cascade: Autoregressive Occupancy to LiDAR and Video Synthesis

Abstract

Drive-Cascade models dynamic occupancy autoregressively, then uses the generated 4D state as a shared geometric scaffold for LiDAR and video synthesis.

Highlights

Unified 4D Occupancy State

Autoregressive Long-Horizon Modeling

Cross-Modal Sensor Synthesis

Figures

Teaser

Qualitative Overview

Visualization

Paper

BibTeX