MiLA: Multi-view Intensive-fidelity Long-term Video Generation World Model for Autonomous Driving

Haiguang Wang 1,2,*, Daqi Liu 2,*, Hongwei Xie 2,†, Haisong Liu 1, Enhui Ma 3, Kaichen Yu 3, Limin Wang 1,✉, Bing Wang 2
1Nanjing University, 2Xiaomi EV, 3Westlake University
*Contributed Equally, Project Leader, Corresponding Author

Abstract


In recent years, data-driven techniques have greatly advanced autonomous driving systems, but the need for rare and diverse training data remains a challenge, requiring significant investment in equipment and labor. World models, which predict and generate future environmental states, offer a promising solution by synthesizing annotated video data for training. However, existing methods struggle to generate long, consistent videos without accumulating errors, especially in dynamic scenes. To address this, we propose MiLA, a novel framework for generating high-fidelity, long-duration videos up to one minute. MiLA utilizes a Coarse-to-Re(fine) approach to both stabilize video generation and correct distortion of dynamic objects. Additionally, we introduce a Temporal Progressive Denoising Scheduler and Joint Denoising and Correcting Flow modules to improve the quality of generated videos. Extensive experiments on the nuScenes dataset show that MiLA achieves state-of-the-art performance in video generation quality.


1. Model Overview



The overall pipeline of the proposed MiLA. Top: The Coarse-to-(Re)fine generation framework. The conditions of the current (Re)fine stage involve both the corrected high FPS frames from the previous (Re)fine stage and the predicted low FPS anchor frames from the Coarse stage. The low FPS anchor frames are jointly denoised with high FPS interpolation frames to correct the artifacts. Bottom: The generation model structure of MiLA. The inputs of the model include multi-view video (condition frames, noise, and anchor frames occasionally), FPS, waypoints, camera parameters, and brief textual scene descriptions. After encoding multi-modal inputs with different encoders, a DiT-based framework denoises the noised video tokens for N times and outputs the prediction video.

2. Short-term Multi-view Scenario Generation

Videos in this section are generated conditioned only on 4 initial frames from NuScenes dataset and waypoints, each lasting roughly 2.5 seconds.

3. Long-term Multi-view Scenario Generation

Videos in this section are generated conditioned on 4 initial frames collected by our awesome Xiaomi SU7 vehicle.

4. Waypoint Controllability

Given the same initial frames, MiLA can generate future frames following different waypoints conditions strictly.

5. Comparison between MiLA and Vista

Given the same initial frame, frames generated by MiLA encounter much less distortion as time extends.

6. Generation Quality Comparison



Comparison of generation fidelity on nuScenes validation set, where red and blue represent the best value. MiLA outperforms the state-of-the-art world models with respect to generation quality. Note that our evaluation setting are totally aligned with Vista.

BibTeX

@article{wang2025mila,
      author    = {Haiguang Wang and Daqi Liu and Hongwei Xie and Haisong Liu and Enhui Ma and Kaicheng Yu and Limin Wang and Bing Wang},
      title     = {MiLA: Multi-view Intensive-fidelity Long-term Video Generation World Model for Autonomous Driving},
      year      = {2025},
}