🌊 VITA: Vision-to-Action Flow Matching Policy

This repository provides the official implementation of the paper VITA: Vision-to-Action Flow Matching Policy (July 2025).

VITA is a noise-free, conditioning-free policy learning framework that learns visuomotor policies by directly mapping latent images to latent actions.

🚀 Getting Started

This section covers installation, dataset preprocessing, and training.

Policy and training: ./flare
Simulation: AV-ALOHA tasks (gym-av-aloha) and Robomimic tasks (gym-robomimic)
Datasets: Built on LeRobot Hugging Face formats, with optimized preprocessing into offline Zarr for faster training

🔧 Setup

git clone [email protected]:ucd-dare/VITA.git
cd VITA
conda create --name vita python==3.10
conda activate vita
conda install cmake
pip install -e .
pip install -r requirements.txt
# Install LeRobot dependencies
cd lerobot
pip install -e .
# Install ffmpeg for dataset processing
conda install -c conda-forge ffmpeg

Set the dataset storage path:

echo 'export FLARE_DATASETS_DIR=<PATH_TO_VITA>/gym-av-aloha/outputs' >> ~/.bashrc
# Reload bashrc
source ~/.bashrc
conda activate vita

Install benchmark dependencies for AV-ALOHA and/or Robomimic as needed:

AV-ALOHA

cd gym-av-aloha
pip install -e .

Robomimic

cd gym-robomimic
pip install -e .

📦 Dataset Preprocessing

Our dataloaders extend LeRobot, converting datasets into offline zarr format for faster training. We host datasets on HuggingFace. To list available datasets:

# 
cd gym-av-aloha/scripts
python convert.py --ls

As of Sept 2025, available datasets include:

- iantc104/av_aloha_sim_cube_transfer
- iantc104/av_aloha_sim_thread_needle
- iantc104/av_aloha_sim_pour_test_tube
- iantc104/av_aloha_sim_slot_insertion
- iantc104/av_aloha_sim_hook_package
- iantc104/robomimic_sim_transport
- iantc104/robomimic_sim_square
- iantc104/robomimic_sim_can
- lerobot/pusht

Convert a HuggingFace dataset (conversion may take >10 minutes) to offline zarr datasets. For example:

python convert.py -r iantc104/av_aloha_sim_hook_package

Datasets will be stored in ./gym-av-aloha/outputs.

If you encounter errors with cv2, numpy, or scipy during the conversion, re-installing them often resolves the issue:

pip uninstall opencv-python numpy scipy
pip install opencv-python numpy scipy

📊 Logging

We use WandB for experiment tracking. Log in with wandb login, then set your entity in ./flare/configs/default_policy.yaml (or append wandb.entity=YOUR_ENTITY_NAME to the training command):

wandb:
  entity: "YOUR_WANDB_ENTITY"

We log: offline validation results, online simulator validation results, as well as visualizations of the ODE denoising process, which helps interpret how action trajectories evolve during ODE solving using different algorithms.

Example: in the first row below, VITA produces a structured action trajectory after just one ODE step, while conventional flow matching starts from Gaussian noise and gradually denoises.

🏋️ Training

python flare/train.py policy=vita task=hook_package session=test

Use session to name checkpoints/logs (and WandB runs).
Default config: ./flare/configs/default_policy.yaml
Policy config: ./flare/configs/policy/vita.yaml
Task config: ./flare/configs/task/hook_package.yaml
These override defaults when specified, e.g. policy=vita task=hook_package.

Override flags as needed:

# Example 1: Use a specific GPU
python flare/train.py policy=vita task=hook_package session=test device=cuda:2

# Example 2: Change online validation frequency and episodes
python flare/train.py policy=vita task=hook_package session=test \
  val.val_online_freq=2000 val.eval_n_episodes=10

# Example 3: Run an ablation
python flare/train.py policy=vita task=hook_package session=ablate \
  policy.vita.decode_flow_latents=False wandb.notes=ablation

🎮 Available Tasks

Available task configs are located in ./flare/config/tasks. To launch training with a specific task, set the task flag (e.g., task=cube_transfer to load cube_transfer.yaml).

# AV-ALOHA tasks
cube_transfer
hook_package
pour_test_tube
slot_insertion
thread_needle
# Robomimic
robomimic_can
robomimic_square
# PushT
pusht

🌐 Links

🧪 Project Page
📄 arXiv Paper
📑 PDF

We gratefully acknowledge open-source codebases that inspired VITA: AV-ALOHA, Robomimic, and LeRobot.

📖 Citation

@article{gao2025vita,
  title={VITA: Vision-to-Action Flow Matching Policy},
  author={Gao, Dechen and Zhao, Boqi and Lee, Andrew and Chuang, Ian and Zhou, Hanchu and Wang, Hang and Zhao, Zhe and Zhang, Junshan and Soltani, Iman},
  journal={arXiv preprint arXiv:2507.13231},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.assets		.assets
flare		flare
gym-av-aloha		gym-av-aloha
gym-robomimic		gym-robomimic
lerobot		lerobot
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🌊 VITA: Vision-to-Action Flow Matching Policy

🚀 Getting Started

🔧 Setup

📦 Dataset Preprocessing

📊 Logging

🏋️ Training

🎮 Available Tasks

🌐 Links

📖 Citation

About

Uh oh!

Releases

Packages

Contributors 5

Uh oh!

Languages

License

ucd-dare/VITA

Folders and files

Latest commit

History

Repository files navigation

🌊 VITA: Vision-to-Action Flow Matching Policy

🚀 Getting Started

🔧 Setup

📦 Dataset Preprocessing

📊 Logging

🏋️ Training

🎮 Available Tasks

🌐 Links

📖 Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 5

Uh oh!

Languages

Packages