Empowering LLMs with Pseudo-Untrimmed Videos for Audio-Visual Temporal Understanding (AAAI 2025)

Repo for the paper "Empowering LLMs with Pseudo-Untrimmed Videos for Audio-Visual Temporal Understanding".

Installation

We recommend setting up a conda environment for the project:

git clone https://github.com/yunlong10/AVicuna.git
cd AVicuna

conda env create -f avicuna.yml
conda activate avicuna

Data & Checkpoints

Download the metadata in JSON here and place them into the ./data folder.

Download the fine-tuned model's checkpoints here and place them into the ./checkpoints folder.

- data
    - stage1.json
    - stage2.json
    - stage3.json
    - stage4.json

- checkpoints
    - avicuna-vicuna-v1-5-7b-stage1
    - avicuna-vicuna-v1-5-7b-stage2
    - avicuna-vicuna-v1-5-7b-stage3
    - avicuna-vicuna-v1-5-7b-stage4
    - clip
        - ViT-L-14.pt

Inference

python -m avicuna.inference

Features

The video and audio features can be extracted by ./avicuna/get_clip.py and ./avicuna/get_clap.py. You can also download the extracted features here.

Training

We train our model on a single NVIDIA A6000 48G GPU.

Stage I: Vision-Text Alignment

bash scripts/stage1.sh

Stage II: Audio-Text Alignment

bash scripts/stage2.sh

Stage III: Time-Event Alignment

bash scripts/stage3.sh

Stage IV: Instruction Tuning

bash scripts/stage4.sh

Pseudo-Untrimmed Video Construction Pipeline

Coming soon ...

Acknowledgements

This work was supported by Sony Group Corporation. We would like to thank Sayaka Nakamura and Jerry Jun Yokono for insightful discussion.

We are also thankful for the following awesome projects our AVicuna arising from:

LLaMA: Open and efficient foundation language models.
FastChat: An open platform for training, serving, and evaluating large language model based chatbots.
Video-ChatGPT: Towards detailed video understanding via large vision and language models.
Vid2seq: Large-scale pretraining of a visual language model for dense video captioning.
VTimeLLM: A Vid-LLM for fine-grained video moment understanding.
VALOR-32K: A audiovisual-language dataset.
UnAV-100: An untrimmed video dataset for dense audio-visual event localization.
Auto-ACD: a large-scale dataset for audio-language representation learning.
AudioSet: A large-scale dataset of manually annotated audio events.
AudioCap: Towards generating natural language description for any kind of audio in the wild.
InternVid: A large-scale video-text dataset.

Citation

@article{tang2024avicuna,
  title={Empowering LLMs with Pseudo-Untrimmed Videos for Audio-Visual Temporal Understanding},
  author={Tang, Yunlong and Shimada, Daiki and Bi, Jing and Feng, Mingqian and Hua, Hang and Xu, Chenliang},
  journal={arXiv preprint arXiv:2403.16276},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
assets		assets
avicuna		avicuna
demo		demo
plots		plots
scripts		scripts
.gitignore		.gitignore
README.md		README.md
avicuna.yml		avicuna.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Empowering LLMs with Pseudo-Untrimmed Videos for Audio-Visual Temporal Understanding (AAAI 2025)

Installation

Data & Checkpoints

Inference

Features

Training

Pseudo-Untrimmed Video Construction Pipeline

Acknowledgements

Citation

About

Uh oh!

Packages

Uh oh!

Languages

yunlong10/AVicuna

Folders and files

Latest commit

History

Repository files navigation

Empowering LLMs with Pseudo-Untrimmed Videos for Audio-Visual Temporal Understanding (AAAI 2025)

Installation

Data & Checkpoints

Inference

Features

Training

Pseudo-Untrimmed Video Construction Pipeline

Acknowledgements

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Packages 0

Uh oh!

Languages

Packages