LLaVA with OpenCLIP Vision Encoders

This is a fork of LLaVA: Large Language and Vision Assistant that adds compatibility with alternative vision encoders from OpenCLIP. If anything remains unclear here, please have a look at these repositories.

What's New in This Fork

This fork adds support for:

Using alternative vision encoders with LLaVA, particularly from OpenCLIP
Testing with the UCSC-VLAA/ViT-L-14-CLIPA-336-datacomp1B encoder
Compatibility with different CUDA versions and hardware setups

Data Preparation

For downloading and preparing all necessary datasets to train LLaVA models, refer to the script in preparing_datasets/download.py. The script handles:

Pretrain feature alignment datasets
Visual instruction tuning datasets (COCO, GQA, OCR-VQA, TextVQA, VisualGenome)

See comments in the script for handling timeouts and special cases with certain datasets.

Environment Setup

Docker

For building with Docker:

export DOCKER_DEFAULT_PLATFORM=linux/amd64
cog build -t registry.datexis.com/jwesterhoff/llava-pretrain:latest

Note that compared to the original LLaVA code, I could not get it to build without some version adjustments in the cog.yaml files and removing flash attention (see llava/train/train_mem.py).

If the build fails:

Run cog debug > Dockerfile
Modify the Dockerfile and requirements.txt in the .cog folder as needed (e.g., other CUDA and torch version)
Build with docker build -t registry.datexis.com/jwesterhoff/llava-train:latest . --platform=linux/amd64

Training

Pretraining

Hardware used: 8x A100 (40GB) with CUDA 11.8
Compatible with both standard CLIP and custom OpenCLIP models
batch size 16 with 2 gradient_accumulation_steps worked for all models on this setup

Finetuning

For CLIP models: Used zero3.json config with bf16 and batch size 16 (1 gradient_accumulation_step) on 8x B200 GPUs with CUDA 12.8
For OpenCLIP models: Used zero2.json with CUDA 11.8 on 8x A100s with bf16
- Note: Custom OpenCLIP implementation has compatibility issues with DeepSpeed stage 3 optimization, it seems.
- With zero2 training on 8x A100 (40GB), batch size had to be reduced to 2 (training time: ~14 hours)

NOTE: Could not get SigLIP models to train, because they don't even fit on 8xA100 40GB with batch size 1.

OpenCLIP Integration Notes

For better DeepSpeed stage 3 compatibility, improvements could be made following approaches like EVA-CLIP

Inference

For inference with the trained models, refer to SCAM project.

Original LLaVA Project

For the complete documentation of the base LLaVA project, including installation instructions, model zoo, evaluation, and more, please refer to the original LLaVA repository.

License

This project follows the licensing terms of the original LLaVA repository.

Name		Name	Last commit message	Last commit date
Latest commit History 486 Commits
container_cuda118		container_cuda118
container_cuda128		container_cuda128
docs		docs
images		images
k8s		k8s
llava		llava
playground/data		playground/data
preparing_datasets		preparing_datasets
scripts		scripts
.dockerignore		.dockerignore
.editorconfig		.editorconfig
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
cog.yaml		cog.yaml
predict.py		predict.py
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

LLaVA with OpenCLIP Vision Encoders

What's New in This Fork

Data Preparation

Environment Setup

Docker

Training

Pretraining

Finetuning

OpenCLIP Integration Notes

Inference

Original LLaVA Project

License

About

Uh oh!

Languages

License

Bliss-e-V/LLaVA-OpenCLIP

Folders and files

Latest commit

History

Repository files navigation

LLaVA with OpenCLIP Vision Encoders

What's New in This Fork

Data Preparation

Environment Setup

Docker

Training

Pretraining

Finetuning

OpenCLIP Integration Notes

Inference

Original LLaVA Project

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Languages