Repo for the paper "Empowering LLMs with Pseudo-Untrimmed Videos for Audio-Visual Temporal Understanding".
We recommend setting up a conda environment for the project:
git clone https://github.com/yunlong10/AVicuna.git
cd AVicuna
conda env create -f avicuna.yml
conda activate avicuna
Download the metadata in JSON here and place them into the ./data
folder.
Download the fine-tuned model's checkpoints here and place them into the ./checkpoints
folder.
- data
- stage1.json
- stage2.json
- stage3.json
- stage4.json
- checkpoints
- avicuna-vicuna-v1-5-7b-stage1
- avicuna-vicuna-v1-5-7b-stage2
- avicuna-vicuna-v1-5-7b-stage3
- avicuna-vicuna-v1-5-7b-stage4
- clip
- ViT-L-14.pt
python -m avicuna.inference
The video and audio features can be extracted by ./avicuna/get_clip.py
and ./avicuna/get_clap.py
. You can also download the extracted features here.
We train our model on a single NVIDIA A6000 48G GPU.
Stage I: Vision-Text Alignment
bash scripts/stage1.sh
Stage II: Audio-Text Alignment
bash scripts/stage2.sh
Stage III: Time-Event Alignment
bash scripts/stage3.sh
Stage IV: Instruction Tuning
bash scripts/stage4.sh
This work was supported by Sony Group Corporation. We would like to thank Sayaka Nakamura and Jerry Jun Yokono for insightful discussion.
We are also thankful for the following awesome projects our AVicuna arising from:
- LLaMA: Open and efficient foundation language models.
- FastChat: An open platform for training, serving, and evaluating large language model based chatbots.
- Video-ChatGPT: Towards detailed video understanding via large vision and language models.
- Vid2seq: Large-scale pretraining of a visual language model for dense video captioning.
- VTimeLLM: A Vid-LLM for fine-grained video moment understanding.
- VALOR-32K: A audiovisual-language dataset.
- UnAV-100: An untrimmed video dataset for dense audio-visual event localization.
- Auto-ACD: a large-scale dataset for audio-language representation learning.
- AudioSet: A large-scale dataset of manually annotated audio events.
- AudioCap: Towards generating natural language description for any kind of audio in the wild.
- InternVid: A large-scale video-text dataset.
@article{tang2024avicuna,
title={Empowering LLMs with Pseudo-Untrimmed Videos for Audio-Visual Temporal Understanding},
author={Tang, Yunlong and Shimada, Daiki and Bi, Jing and Feng, Mingqian and Hua, Hang and Xu, Chenliang},
journal={arXiv preprint arXiv:2403.16276},
year={2024}
}