This is a fork of LLaVA: Large Language and Vision Assistant that adds compatibility with alternative vision encoders from OpenCLIP. If anything remains unclear here, please have a look at these repositories.
This fork adds support for:
- Using alternative vision encoders with LLaVA, particularly from OpenCLIP
- Testing with the UCSC-VLAA/ViT-L-14-CLIPA-336-datacomp1B encoder
- Compatibility with different CUDA versions and hardware setups
For downloading and preparing all necessary datasets to train LLaVA models, refer to the script in preparing_datasets/download.py. The script handles:
- Pretrain feature alignment datasets
- Visual instruction tuning datasets (COCO, GQA, OCR-VQA, TextVQA, VisualGenome)
See comments in the script for handling timeouts and special cases with certain datasets.
For building with Docker:
export DOCKER_DEFAULT_PLATFORM=linux/amd64
cog build -t registry.datexis.com/jwesterhoff/llava-pretrain:latestNote that compared to the original LLaVA code, I could not get it to build without some version adjustments in the cog.yaml files and removing flash attention (see llava/train/train_mem.py).
If the build fails:
- Run
cog debug > Dockerfile - Modify the
Dockerfileandrequirements.txtin the.cogfolder as needed (e.g., other CUDA and torch version) - Build with
docker build -t registry.datexis.com/jwesterhoff/llava-train:latest . --platform=linux/amd64
- Hardware used: 8x A100 (40GB) with CUDA 11.8
- Compatible with both standard CLIP and custom OpenCLIP models
- batch size 16 with 2 gradient_accumulation_steps worked for all models on this setup
- For CLIP models: Used zero3.json config with bf16 and batch size 16 (1 gradient_accumulation_step) on 8x B200 GPUs with CUDA 12.8
- For OpenCLIP models: Used zero2.json with CUDA 11.8 on 8x A100s with bf16
- Note: Custom OpenCLIP implementation has compatibility issues with DeepSpeed stage 3 optimization, it seems.
- With zero2 training on 8x A100 (40GB), batch size had to be reduced to 2 (training time: ~14 hours)
NOTE: Could not get SigLIP models to train, because they don't even fit on 8xA100 40GB with batch size 1.
- For better DeepSpeed stage 3 compatibility, improvements could be made following approaches like EVA-CLIP
For inference with the trained models, refer to SCAM project.
For the complete documentation of the base LLaVA project, including installation instructions, model zoo, evaluation, and more, please refer to the original LLaVA repository.
This project follows the licensing terms of the original LLaVA repository.