VITA-VLA: Efficiently Teaching Vision-Language Models to Act via Action Expert Distillation

We are thrilled to announce the launch of VITA-VLA — a powerful and streamlined vision-language-action (VLA) model that combines architectural simplicity with highly efficient training. Designed to unlock the full potential of pretrained vision-language models in embodied AI tasks. VITA-VLA introduces a novel two-stage training strategy, enabling superior performance with minimal computational overhead. Whether in simulation or real-world robotic scenarios, VITA-VLA delivers state-of-the-art results.

✨ Highlights

Efficient Training: Lightly align VLM hidden states with pretrained action models, then fine-tune for accurate action generation.
Simple Architecture: Add only minimal layers — state encoder and query token.
Strong Performance: Outperform most VLAs on CALVIN ABC-D and LIBERO.

🧠 Method Overview

Overview of mainstream VLA architectures.

Discretization-based methods convert actions into tokens and directly decode them using visual and language features, but omit robot state information, which is crucial for physical dynamics.
Diffusion-based approaches extract vision-language features with a VLM, but offload action generation to an action expert, making the VLM a passive feature extractor.
Our method introduces a state encoder and action query token, retains the full VLM, and distills knowledge from an expert model to achieve high reasoning and efficiency.

📈 Benchmark Results

CALVIN ABC-D

Our model demonstrates strong zero-shot generalization to unseen environments, achieving higher overall performance compared with existing VLA models. This highlights both the effectiveness of our two-stage distillation strategy and the importance of fine-tuning the VLM for action execution.

LIBERO-LONG

Our model achieves state-of-the-art performance, with a 5.8% improvement over Seer-Large and a 1% improvement over the fine-tuning-only strategy. These results validate the effectiveness of our approach in handling long-horizon tasks and complex instruction-following scenarios.

LIBERO

Our model achieves the highest average success rate across all task suites, outperforming existing VLA models by a significant margin. In particular, it improves the previous best result on LIBERO-LONG by 24.5%, reaching a 97.3% success rate. These findings demonstrate that our framework effectively combines the reasoning capacity of large-scale VLMs with the efficient action modeling of small action models.

🌍 Real-World Experiment

🛠️ Task Setup

Platform: ALOHA
Control: 6-DoF arm + 1-D gripper width
Tasks: Pick, Place, Close, Stack

🎬 Execution Demo

Instructions:

"Close the drawer"
"Stack the orange cup on top of the green cup"
"Stack the red block on top of the yellow block"
"Pick up the sponge and put it into the basket"
"Pick up the red block and put it into the basket"

📊 Real-World Results Summary

VITA-VLA achieves top performance across all tasks, demonstrating the strongest results in long-horizon and stacking scenarios, and validating its two-stage training strategy.

📔 Get Started

Prepare Environment

Python: 3.9–3.12
PyTorch: ≥ 1.13
CUDA: 11.8+
OS: Ubuntu 20.04+

Please refer to Seer and VITA for environment configuration, and download the corresponding weights.

Seer

conda create -n vita_vla python=3.10 -y
conda activate vita_vla
git clone https://github.com/InternRobotics/Seer.git
cd Seer
pip install -r requirements.txt
Please refer to following the instructions to prepare calvin and LIBERO environment and download the corresponding weights.

VITA

git clone https://github.com/VITA-MLLM/VITA.git
cd VITA
pip install -r requirements.txt
pip install flash-attn --no-build-isolation

Download:

https://huggingface.co/VITA-MLLM/VITA-1.5

🎲 Training

We use the CALVIN ABC-D experiment as an example. LIBERO scripts are also provided in the scripts/ directory.

Strategy 1:

Stage 1: Alignment
bash scripts/CALVIN_ABC_D/pretrain_frevlm.sh

Stage 2: Finetune
bash scripts/CALVIN_ABC_D/ft_nofrevlm_pt_frevlm_from_2pth.sh
Set `finetune_from_pretrained_ckpt` to alignment weights

Strategy 2: Finetune Only

bash scripts/CALVIN_ABC_D/ft_wopt_nofrevlm.sh

Strategy 3: Freeze VLM

bash scripts/CALVIN_ABC_D/ft_wopt_from_seer.sh
Use transfer.py to transfer the full weights to a small ckpt we need
Set `finetune_from_pretrained_ckpt` to the transfered VITA-VLA weights(only action modules)

🔎 Evaluation

bash scripts/CALVIN_ABC_D/eval_ft_nofrevlm_pt_frevlm_from_2pth.sh
Set `resume_from_checkpoint` to your trained weights

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
assets		assets
models		models
scripts		scripts
utils		utils
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.md		README.md
eval_vita.py		eval_vita.py
train_ds.py		train_ds.py
transfer.py		transfer.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

VITA-VLA: Efficiently Teaching Vision-Language Models to Act via Action Expert Distillation

✨ Highlights

🧠 Method Overview

📈 Benchmark Results

🌍 Real-World Experiment

🛠️ Task Setup

🎬 Execution Demo

📊 Real-World Results Summary

📔 Get Started

Prepare Environment

Seer

VITA

🎲 Training

Strategy 1:

Strategy 2: Finetune Only

Strategy 3: Freeze VLM

🔎 Evaluation

🙏 Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

Tencent/VITA

Folders and files

Latest commit

History

Repository files navigation

VITA-VLA: Efficiently Teaching Vision-Language Models to Act via Action Expert Distillation

✨ Highlights

🧠 Method Overview

📈 Benchmark Results

🌍 Real-World Experiment

🛠️ Task Setup

🎬 Execution Demo

📊 Real-World Results Summary

📔 Get Started

Prepare Environment

Seer

VITA

🎲 Training

Strategy 1:

Strategy 2: Finetune Only

Strategy 3: Freeze VLM

🔎 Evaluation

🙏 Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages