| 🗺️ Overview | 📊 Benchmark | 🌍 Real-World | ⚡ Quick Start |
We are thrilled to announce the launch of VITA-VLA — a powerful and streamlined vision-language-action (VLA) model that combines architectural simplicity with highly efficient training. Designed to unlock the full potential of pretrained vision-language models in embodied AI tasks. VITA-VLA introduces a novel two-stage training strategy, enabling superior performance with minimal computational overhead. Whether in simulation or real-world robotic scenarios, VITA-VLA delivers state-of-the-art results.
- Efficient Training: Lightly align VLM hidden states with pretrained action models, then fine-tune for accurate action generation.
- Simple Architecture: Add only minimal layers — state encoder and query token.
- Strong Performance: Outperform most VLAs on CALVIN ABC-D and LIBERO.
Overview of mainstream VLA architectures.
- Discretization-based methods convert actions into tokens and directly decode them using visual and language features, but omit robot state information, which is crucial for physical dynamics.
- Diffusion-based approaches extract vision-language features with a VLM, but offload action generation to an action expert, making the VLM a passive feature extractor.
- Our method introduces a state encoder and action query token, retains the full VLM, and distills knowledge from an expert model to achieve high reasoning and efficiency.
CALVIN ABC-D
Our model demonstrates strong zero-shot generalization to unseen environments, achieving higher overall performance compared with existing VLA models. This highlights both the effectiveness of our two-stage distillation strategy and the importance of fine-tuning the VLM for action execution.
LIBERO-LONG
Our model achieves state-of-the-art performance, with a 5.8% improvement over Seer-Large and a 1% improvement over the fine-tuning-only strategy. These results validate the effectiveness of our approach in handling long-horizon tasks and complex instruction-following scenarios.
LIBERO
Our model achieves the highest average success rate across all task suites, outperforming existing VLA models by a significant margin. In particular, it improves the previous best result on LIBERO-LONG by 24.5%, reaching a 97.3% success rate. These findings demonstrate that our framework effectively combines the reasoning capacity of large-scale VLMs with the efficient action modeling of small action models.
- Platform: ALOHA
- Control: 6-DoF arm + 1-D gripper width
- Tasks: Pick, Place, Close, Stack
Instructions:
- "Close the drawer"
- "Stack the orange cup on top of the green cup"
- "Stack the red block on top of the yellow block"
- "Pick up the sponge and put it into the basket"
- "Pick up the red block and put it into the basket"
VITA-VLA achieves top performance across all tasks, demonstrating the strongest results in long-horizon and stacking scenarios, and validating its two-stage training strategy.
- Python: 3.9–3.12
- PyTorch: ≥ 1.13
- CUDA: 11.8+
- OS: Ubuntu 20.04+
Please refer to Seer and VITA for environment configuration, and download the corresponding weights.
conda create -n vita_vla python=3.10 -y
conda activate vita_vla
git clone https://github.com/InternRobotics/Seer.git
cd Seer
pip install -r requirements.txt
Please refer to following the instructions to prepare calvin and LIBERO environment and download the corresponding weights.
git clone https://github.com/VITA-MLLM/VITA.git
cd VITA
pip install -r requirements.txt
pip install flash-attn --no-build-isolation
Download:
We use the CALVIN ABC-D experiment as an example. LIBERO scripts are also provided in the scripts/
directory.
Stage 1: Alignment
bash scripts/CALVIN_ABC_D/pretrain_frevlm.sh
Stage 2: Finetune
bash scripts/CALVIN_ABC_D/ft_nofrevlm_pt_frevlm_from_2pth.sh
Set `finetune_from_pretrained_ckpt` to alignment weights
bash scripts/CALVIN_ABC_D/ft_wopt_nofrevlm.sh
bash scripts/CALVIN_ABC_D/ft_wopt_from_seer.sh
Use transfer.py to transfer the full weights to a small ckpt we need
Set `finetune_from_pretrained_ckpt` to the transfered VITA-VLA weights(only action modules)
bash scripts/CALVIN_ABC_D/eval_ft_nofrevlm_pt_frevlm_from_2pth.sh
Set `resume_from_checkpoint` to your trained weights