ORION: A Holistic End-to-End Autonomous Driving Framework

by Vision-Language Instructed Action Generation

Haoyu Fu1*, Diankun Zhang2*, Zongchuang Zhao1*, Jianfeng Cui2, Dingkang Liang1†, Chong Zhang2,

Dingyuan Zhang1, Hongwei Xie2†, Bing Wang2, Xiang Bai1✉
1Huazhong University of Science & Technology 2Xiaomi EV

Project Lead. Corresponding Author. *Equal Contribution.

Work done when Haoyu Fu and Zhongchuang Zhao were interns at Xiaomi EV

Abstract

End-to-end (E2E) autonomous driving methods still struggle to make correct decisions in interactive closed-loop evaluation due to limited causal reasoning capability. Current methods attempt to leverage the powerful understanding and reasoning abilities of Vision-Language Models (VLMs) to resolve this dilemma. However, the problem is still open that few VLMs for E2E methods perform well in the closed-loop evaluation due to the gap between the semantic reasoning space and the purely numerical trajectory output in the action space. To tackle this issue, we propose ORION, a holistic E2E autonomous driving framework by vision-language instructed action generation. ORION uniquely combines a QT-Former to aggregate long-term history context, a Large Language Model (LLM) for driving scenario reasoning, and a generative planner for precision trajectory prediction. ORION further aligns the reasoning space and the action space to implement a unified E2E optimization for both visual question-answering (VQA) and planning tasks. Our method achieves an impressive closed-loop performance of 77.74 Driving Score (DS) and 54.62% Success Rate (SR) on the challenge Bench2Drive datasets, which outperforms state-of-the-art (SOTA) methods by a large margin of 14.28 DS and 19.61% SR.


Comprehensive capabilities of ORION in close-loop simulation

Framework

The pipeline of our ORION, a holistic E2E framework aligning vision-reasoning-action space. It consists of three key components: a QT-Former to extract long-term context and link the vision space of the vision encoder and LLM's reasoning space; the LLM for performing textual tasks and predicting a planning token; and a generative planner that bridges reasoning and action space for generating a multi-modal trajectory conditioned by the planning token.

Main Results

Closed-loop and Open-loop Results of E2E-AD Methods in Bench2Drive under base set. C/L refers to camera/LiDAR. Avg. L2 is averaged over the predictions in 2 seconds under 2Hz, similar to UniAD. * denote expert feature distillation. NC: navigation command, TP: target point, DS: Driving Score, SR: Success Rate.

Multi-Ability

Multi-Ability Results of E2E-AD Methods under base set. * denote expert feature distillation. C/L refers to camera/LiDAR. NC: navigation command, TP: target point.

Qualitative Results

Qualitative results of ORION on the Bench2Drive closed-loop evaluation set. The brown, red, and green refer to the action decision, the objects that influence driving decisions, and the prediction trajectory, respectively.

Closed loop evaluation in different scenarios

Construction Obstacle Two-Ways

Crossing Bicycle Flow

Hard Break Route

Non-Signalized Left-Turn Enter Flow

Accident

Opposite Vehicle Taking Priority

Highway Cut-In

Qualitatively analysis on Bench2Drive compared ORION with baselines

More details in ORION repository

TCP|UniAD-Base|VAD-Base|ORION

Route 2283: Signalized Junction Left-Turn

TCP: Success

UniAD: Failed

VAD: Success

ORION: Success

Route 3749: Enter Actor Flow

TCP: Success

UniAD: Failed

VAD: Failed

ORION: Success

Route 25418: Parked Obstacle

TCP: Failed

UniAD: Failed

VAD: Success

ORION: Success

Route 26396: Static Cut-In

TCP: Failed

UniAD: Success

VAD: Success

ORION: Success

BibTeX

@article{fu2025orion,
  title={ORION: A Holistic End-to-End Autonomous Driving Framework by Vision-Language Instructed Action Generation},
  author={Haoyu Fu and Diankun Zhang and Zongchuang Zhao and Jianfeng Cui and Dingkang Liang and Chong Zhang and Dingyuan Zhang and Hongwei Xie and Bing Wang and Xiang Bai},
  journal={arXiv:2503.19755},
  year={2025}
}