This repository provides the official implementation of the paper VITA: Vision-to-Action Flow Matching Policy (July 2025).
VITA is a noise-free, conditioning-free policy learning framework that learns visuomotor policies by directly mapping latent images to latent actions.
This section covers installation, dataset preprocessing, and training.
- Policy and training:
./flare
- Simulation: AV-ALOHA tasks (
gym-av-aloha
) and Robomimic tasks (gym-robomimic
) - Datasets: Built on LeRobot Hugging Face formats, with optimized preprocessing into offline Zarr for faster training
git clone [email protected]:ucd-dare/VITA.git
cd VITA
conda create --name vita python==3.10
conda activate vita
conda install cmake
pip install -e .
pip install -r requirements.txt
# Install LeRobot dependencies
cd lerobot
pip install -e .
# Install ffmpeg for dataset processing
conda install -c conda-forge ffmpeg
Set the dataset storage path:
echo 'export FLARE_DATASETS_DIR=<PATH_TO_VITA>/gym-av-aloha/outputs' >> ~/.bashrc
# Reload bashrc
source ~/.bashrc
conda activate vita
Install benchmark dependencies for AV-ALOHA and/or Robomimic as needed:
- AV-ALOHA
cd gym-av-aloha
pip install -e .
- Robomimic
cd gym-robomimic
pip install -e .
Our dataloaders extend LeRobot, converting datasets into offline zarr format for faster training. We host datasets on HuggingFace. To list available datasets:
#
cd gym-av-aloha/scripts
python convert.py --ls
As of Sept 2025, available datasets include:
- iantc104/av_aloha_sim_cube_transfer
- iantc104/av_aloha_sim_thread_needle
- iantc104/av_aloha_sim_pour_test_tube
- iantc104/av_aloha_sim_slot_insertion
- iantc104/av_aloha_sim_hook_package
- iantc104/robomimic_sim_transport
- iantc104/robomimic_sim_square
- iantc104/robomimic_sim_can
- lerobot/pusht
Convert a HuggingFace dataset (conversion may take >10 minutes) to offline zarr datasets. For example:
python convert.py -r iantc104/av_aloha_sim_hook_package
Datasets will be stored in ./gym-av-aloha/outputs
.
If you encounter errors with cv2
, numpy
, or scipy
during the conversion, re-installing them often resolves the issue:
pip uninstall opencv-python numpy scipy
pip install opencv-python numpy scipy
We use WandB for experiment tracking. Log in with wandb login
, then set your entity in ./flare/configs/default_policy.yaml
(or append wandb.entity=YOUR_ENTITY_NAME
to the training command):
wandb:
entity: "YOUR_WANDB_ENTITY"
We log: offline validation results, online simulator validation results, as well as visualizations of the ODE denoising process, which helps interpret how action trajectories evolve during ODE solving using different algorithms.
Example:
in the first row below, VITA produces a structured action trajectory after just one ODE step, while conventional flow matching starts from Gaussian noise and gradually denoises.
python flare/train.py policy=vita task=hook_package session=test
- Use
session
to name checkpoints/logs (and WandB runs). - Default config:
./flare/configs/default_policy.yaml
- Policy config:
./flare/configs/policy/vita.yaml
- Task config:
./flare/configs/task/hook_package.yaml
- These override defaults when specified, e.g.
policy=vita task=hook_package
.
Override flags as needed:
# Example 1: Use a specific GPU
python flare/train.py policy=vita task=hook_package session=test device=cuda:2
# Example 2: Change online validation frequency and episodes
python flare/train.py policy=vita task=hook_package session=test \
val.val_online_freq=2000 val.eval_n_episodes=10
# Example 3: Run an ablation
python flare/train.py policy=vita task=hook_package session=ablate \
policy.vita.decode_flow_latents=False wandb.notes=ablation
Available task configs are located in ./flare/config/tasks. To launch training with a specific task, set the task
flag (e.g., task=cube_transfer
to load cube_transfer.yaml
).
# AV-ALOHA tasks
cube_transfer
hook_package
pour_test_tube
slot_insertion
thread_needle
# Robomimic
robomimic_can
robomimic_square
# PushT
pusht
- 🧪 Project Page
- 📄 arXiv Paper
We gratefully acknowledge open-source codebases that inspired VITA: AV-ALOHA, Robomimic, and LeRobot.
@article{gao2025vita,
title={VITA: Vision-to-Action Flow Matching Policy},
author={Gao, Dechen and Zhao, Boqi and Lee, Andrew and Chuang, Ian and Zhou, Hanchu and Wang, Hang and Zhao, Zhe and Zhang, Junshan and Soltani, Iman},
journal={arXiv preprint arXiv:2507.13231},
year={2025}
}