ORION: A Holistic End-to-End Autonomous Driving Framework by Vision-Language Instructed Action Generation

by Vision-Language Instructed Action Generation

Haoyu Fu^1*, Diankun Zhang^2*, Zongchuang Zhao^1*, Jianfeng Cui², Dingkang Liang^1†, Chong Zhang²,

Dingyuan Zhang¹, Hongwei Xie^2†, Bing Wang², Xiang Bai^1✉

¹Huazhong University of Science & Technology ²Xiaomi EV

^†Project Lead. ^✉Corresponding Author. ^*Equal Contribution.

Work done when Haoyu Fu and Zhongchuang Zhao were interns at Xiaomi EV

Abstract

End-to-end (E2E) autonomous driving methods still struggle to make correct decisions in interactive closed-loop evaluation due to limited causal reasoning capability. Current methods attempt to leverage the powerful understanding and reasoning abilities of Vision-Language Models (VLMs) to resolve this dilemma. However, the problem is still open that few VLMs for E2E methods perform well in the closed-loop evaluation due to the gap between the semantic reasoning space and the purely numerical trajectory output in the action space. To tackle this issue, we propose ORION, a holistic E2E autonomous driving framework by vision-language instructed action generation. ORION uniquely combines a QT-Former to aggregate long-term history context, a Large Language Model (LLM) for driving scenario reasoning, and a generative planner for precision trajectory prediction. ORION further aligns the reasoning space and the action space to implement a unified E2E optimization for both visual question-answering (VQA) and planning tasks. Our method achieves an impressive closed-loop performance of 77.74 Driving Score (DS) and 54.62% Success Rate (SR) on the challenge Bench2Drive datasets, which outperforms state-of-the-art (SOTA) methods by a large margin of 14.28 DS and 19.61% SR.

Framework

The pipeline of our ORION, a holistic E2E framework aligning vision-reasoning-action space. It consists of three key components: a QT-Former to extract long-term context and link the vision space of the vision encoder and LLM's reasoning space; the LLM for performing textual tasks and predicting a planning token; and a generative planner that bridges reasoning and action space for generating a multi-modal trajectory conditioned by the planning token.

Main Results

Closed-loop and Open-loop Results of E2E-AD Methods in Bench2Drive under base set. C/L refers to camera/LiDAR. Avg. L2 is averaged over the predictions in 2 seconds under 2Hz, similar to UniAD. * denote expert feature distillation. NC: navigation command, TP: target point, DS: Driving Score, SR: Success Rate.

BibTeX

@article{fu2025orion, title={ORION: A Holistic End-to-End Autonomous Driving Framework by Vision-Language Instructed Action Generation}, author={Haoyu Fu and Diankun Zhang and Zongchuang Zhao and Jianfeng Cui and Dingkang Liang and Chong Zhang and Dingyuan Zhang and Hongwei Xie and Bing Wang and Xiang Bai}, journal={arXiv:2503.19755}, year={2025} }

ORION: A Holistic End-to-End Autonomous Driving Framework

by Vision-Language Instructed Action Generation

Abstract

Comprehensive capabilities of ORION in close-loop simulation

Framework

Main Results

Multi-Ability

Qualitative Results

Closed loop evaluation in different scenarios

Construction Obstacle Two-Ways

Crossing Bicycle Flow

Hard Break Route

Non-Signalized Left-Turn Enter Flow

Accident

Opposite Vehicle Taking Priority

Highway Cut-In

Qualitatively analysis on Bench2Drive compared ORION with baselines

Route 2283: Signalized Junction Left-Turn

Route 3749: Enter Actor Flow

Route 25418: Parked Obstacle

Route 26396: Static Cut-In

BibTeX