Skip to content

yunlong10/AVicuna

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Empowering LLMs with Pseudo-Untrimmed Videos for Audio-Visual Temporal Understanding (AAAI 2025)

Repo for the paper "Empowering LLMs with Pseudo-Untrimmed Videos for Audio-Visual Temporal Understanding".


Installation

We recommend setting up a conda environment for the project:

git clone https://github.com/yunlong10/AVicuna.git
cd AVicuna

conda env create -f avicuna.yml
conda activate avicuna

Data & Checkpoints

Download the metadata in JSON here and place them into the ./data folder.

Download the fine-tuned model's checkpoints here and place them into the ./checkpoints folder.

- data
    - stage1.json
    - stage2.json
    - stage3.json
    - stage4.json

- checkpoints
    - avicuna-vicuna-v1-5-7b-stage1
    - avicuna-vicuna-v1-5-7b-stage2
    - avicuna-vicuna-v1-5-7b-stage3
    - avicuna-vicuna-v1-5-7b-stage4
    - clip
        - ViT-L-14.pt

Inference

python -m avicuna.inference

Features

The video and audio features can be extracted by ./avicuna/get_clip.py and ./avicuna/get_clap.py. You can also download the extracted features here.

Training

We train our model on a single NVIDIA A6000 48G GPU.

Stage I: Vision-Text Alignment

bash scripts/stage1.sh

Stage II: Audio-Text Alignment

bash scripts/stage2.sh

Stage III: Time-Event Alignment

bash scripts/stage3.sh

Stage IV: Instruction Tuning

bash scripts/stage4.sh

Pseudo-Untrimmed Video Construction Pipeline

Coming soon ...

Acknowledgements

This work was supported by Sony Group Corporation. We would like to thank Sayaka Nakamura and Jerry Jun Yokono for insightful discussion.

We are also thankful for the following awesome projects our AVicuna arising from:

  • LLaMA: Open and efficient foundation language models.
  • FastChat: An open platform for training, serving, and evaluating large language model based chatbots.
  • Video-ChatGPT: Towards detailed video understanding via large vision and language models.
  • Vid2seq: Large-scale pretraining of a visual language model for dense video captioning.
  • VTimeLLM: A Vid-LLM for fine-grained video moment understanding.
  • VALOR-32K: A audiovisual-language dataset.
  • UnAV-100: An untrimmed video dataset for dense audio-visual event localization.
  • Auto-ACD: a large-scale dataset for audio-language representation learning.
  • AudioSet: A large-scale dataset of manually annotated audio events.
  • AudioCap: Towards generating natural language description for any kind of audio in the wild.
  • InternVid: A large-scale video-text dataset.

Citation

@article{tang2024avicuna,
  title={Empowering LLMs with Pseudo-Untrimmed Videos for Audio-Visual Temporal Understanding},
  author={Tang, Yunlong and Shimada, Daiki and Bi, Jing and Feng, Mingqian and Hua, Hang and Xu, Chenliang},
  journal={arXiv preprint arXiv:2403.16276},
  year={2024}
}

About

[AAAI 2025] Empowering LLMs with Pseudo-Untrimmed Videos for Audio-Visual Temporal Understanding

Resources

Stars

Watchers

Forks

Packages

No packages published