AutoVLA

Abstract

Recent advancements in Vision-Language-Action (VLA) models have shown promise for end-to-end autonomous driving by leveraging world knowledge and reasoning capabilities. However, current VLA models often struggle with physically infeasible action outputs, complex model structures, and unnecessarily long reasoning.

In this paper, we propose AutoVLA, a novel VLA framework that unifies reasoning and action generation within a single autoregressive generation model. AutoVLA performs semantic reasoning and trajectory planning directly from raw visual inputs and language instructions. We tokenize continuous trajectories into discrete, feasible actions, enabling direct integration into the language model. For training, we employ supervised fine-tuning to equip the model with dual thinking modes: fast thinking (trajectory-only) and slow thinking (enhanced with chain-of-thought reasoning). To further enhance planning performance and efficiency, we introduce a reinforcement fine-tuning method based on Group Relative Policy Optimization (GRPO), reducing unnecessary reasoning in straightforward scenarios.

Extensive experiments across real-world and simulated datasets and benchmarks, including nuPlan, nuScenes, Waymo, and CARLA, demonstrate the competitive performance of AutoVLA in both open-loop and closed-loop settings. Qualitative results further showcase the adaptive reasoning and accurate planning capabilities of AutoVLA in diverse scenarios. We will release the code, model weights, and datasets to facilitate future research in the field.

Model Structure and Training Strategy

⚙️ Two Main Components:

VLM Backbone is capable of processing visual and textual input and generating corresponding tokens (reasoning and action), employing a unified autoregressive Transformer decoder.
Physical Action Token Generation extends the language model decoder to output physical action tokens, which are designed to comply with physical constraints and can be reliably translated into physically feasible trajectories.

🪜 Two Training Stages:

Supervised Fine-Tuning (SFT) aims to jointly learn reasoning & action and enable dual thinking capabilities, using ground-truth trajectory data and distilling high-quality reasoning data from a large-scale VLM model.
Reinforcement Fine-Tuning (RFT) uses task-specific reward functions to optimize planning performance while enabling adaptive reasoning and improving its running efficiency by minimizing unnecessary reasoning.

Experiments

Data Scaling

In this experiment, AutoVLA is trained on a mixture of the nuPlan and nuScenes datasets with varying training set sizes (10k, 50k, 100k, 185k).

Increasing the amount of training data consistently improves planning performance.
Learning structured reasoning requires enough training data and can further improve planning performance.
The action-only supervision yields better performance in simpler scenario datasets, such as the nuScenes dataset.

RFT Performance

We apply RFT to the full-data CoT reasoning model trained via SFT.

10.6% improvement in PDMS on the NAVSIM testing set and a 66.8% reduction in runtime (500 samples) with RFT.
Larger groups lead to better performance by promoting a broader exploration of training samples.
RFT reduces unnecessary and slow reasoning in simple scenarios.

nuPlan Results

Waymo End-to-End Driving Results

In the Waymo Vision-based End-to-End Driving Challenge (as of May 22, 2025), AutoVLA ranks highly in both RFS Overall and ADE metrics and achieves the top score in the RFS Spotlight metric, which focuses on the most challenging scenarios.

nuScenes Results

Red lines represent the planning trajectories, and Green lines represent the ground-truth trajectories.

CARLA Closed-loop Results

Parking Crossing Pedestrian

Vehicle Turning Route Pedestrian

Parking Cut-In

Parked Obstacle

Hazard At Side Lane

Opposite Vehicle Taking Priority

BibTeX

@article{zhou2025autovla,
  author    = {Zhou, Zewei and Cai, Tianhui and Zhao, Seth Z.and Zhang, Yun and Huang, Zhiyu and Zhou, Bolei and Ma, Jiaqi},
  title     = {AutoVLA: A Vision-Language-Action Model for End-to-End Autonomous Driving with Adaptive Reasoning and Reinforcement Fine-Tuning},
  journal   = {arXiv preprint arXiv:2506.13757},
  year      = {2025},
}

The website design was adapted from nerfies.

AutoVLA: A Vision-Language-Action Model for End-to-End Autonomous Driving with Adaptive Reasoning and Reinforcement Fine-Tuning

AutoVLA integrates CoT reasoning and physical action tokenization to directly generate planning trajectories through a unified autoregressive process, dynamically switching dual-thinking modes.