MindDrive: A Vision-Language-Action Model

for Autonomous Driving via Online Reinforcement Learning

Haoyu Fu1*, Diankun Zhang2*, Zongchuang Zhao1, Jianfeng Cui2,

Hongwei Xie2†, Bing Wang2, Guang Chen2, Dingkang Liang1†, Xiang Bai1✉
1Huazhong University of Science & Technology 2Xiaomi EV

Project Lead. Corresponding Author. *Equal Contribution.

Abstract

Current Vision-Language-Action (VLA) paradigms in autonomous driving primarily rely on Imitation Learning (IL), which introduces inherent challenges such as distribution shift and causal confusion. Online Reinforcement Learning offers a promising pathway to address these issues through trial-and-error learning. However, applying online reinforcement learning to VLA models in autonomous driving is hindered by inefficient exploration in continuous action spaces. To overcome this limitation, we propose MindDrive, a VLA framework comprising a large language model (LLM) with two distinct sets of LoRA parameters. The one LLM serves as a Decision Expert for scenario reasoning and driving decision-making, while the other acts as an Action Expert that dynamically maps linguistic decisions into feasible trajectories. By feeding trajectory-level rewards back into the reasoning space, MindDrive enables trial-and-error learning over a finite set of discrete linguistic driving decisions, instead of operating directly in a continuous action space. This approach effectively balances optimal decision-making in complex scenarios, human-like driving behavior, and efficient exploration in online reinforcement learning. MindDrive achieves strong closed-loop performance on the challenging Bench2Drive benchmark, with a Driving Score (DS) of 78.04 and a Success Rate (SR) of 55.09\%. To the best of our knowledge, this is the first work to demonstrate the effectiveness of online reinforcement learning for the VLA model in autonomous driving.


Comprehensive capabilities of MindDrive in close-loop simulation

Framework

Overview of MindDrive and its training pipeline. MindDrive consists of two experts, both of which utilize the same base LLM and fine-tune with distinct LoRA parameters, respectively. Decision Expert generates high-level meta-actions from scene and text inputs, and Action Expert maps these meta-actions to concrete trajectories. MindDrive is first trained with Imitation Learning (IL) on expert data to align meta-actions with trajectories. Then it is refined through online reinforcement learning in a closed-loop simulator, where action rewards directly enhance the model's reasoning capabilities.

Main Results

Closed-loop and Multi-Ability Results of E2E-AD Methods in Bench2Drive under \textbf{base} training set. C/L refers to camera/LiDAR. * denote expert feature distillation. $\dag$ represent reproducing the result based on official code. DS: Driving Score, SR: Success Rate, M: Merging, O: Overtaking, EB: Emergency Brake, GW: Give Way, TS: Traffic Sign.

Qualitative Results

Qualitative results of MindDrive after IL and RL on the Bench2Drive closed-loop evaluation set. The green and blue refer to the prediction speed and path meta-actions, respectively. Red denote a collision has occurred.

Closed loop evaluation in different scenarios

Construction Obstacle

Pedestrian Crossing

Accident

Dynamic Object Crossing

IL vs. RL: A Visual Comparison

IL (Imitation Learning)

Parking Crossing Pedestrain

Vanilla Non-Signalized Turn

Accident

RL (Reinforcement Learning)

Parking Crossing Pedestrain

Vanilla Non-Signalized Turn

Accident

BibTeX

@article{fu2025minddrive,
  title={MindDrive: A Vision-Language-Action Model for Autonomous Driving via Online Reinforcement Learning},
  author={Haoyu Fu and Diankun Zhang and Zongchuang Zhao and Jianfeng Cui and Dingkang Liang  and Hongwei Xie and Bing Wang and Xiang Bai},
  journal={},
  year={2025}
}