Skip to content

Eyeline-Research/Survey-Video-Diffusion

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

80 Commits
 
 
 
 
 
 

Repository files navigation

Survey of Video Diffusion Models: Foundations, Implementations, and Applications

Paper

1University of Waterloo, 2Duke University, 3Netflix Eyeline Studios
*Contributed Equally, Corresponding Author

Abstract

In this survey and github repository, we provide a comprehensive overview of the recent advances in video diffusion models. We cover the foundations of video generative models, including GANs, auto-regressive models, and diffusion models. We also discuss the learning foundations, including classic denoising diffusion models, flow matching, and training-free methods. Additionally, we explore various architectures, including UNet and diffusion transformers. We discuss the applications of video diffusion models, including video generation, enhancement, personalization, and 3D-aware video generation. Finally, we highlight the benefits of video diffusion models to other domains, such as video representation learning and video retrieval.

Moreover, to facilitate the understanding of video diffusion models, we provide a cheatsheet including commonly used training datasets, training engineering techniques, and evaluation metrics. We also provide a list of video diffusion models in academia and industry.

image

Table of Contents

Foundations

Video generative paradigms

GAN video models

Papers are listed generally in reverse order of their publication timestamps.

Title
arXiv
GitHub
Website
Conference & Year
StyleInV: A Temporal Style Modulated Inversion Network for Unconditional Video Generation arXiv Star Website ICCV 2023
StyleLipSync: Style-based Personalized Lip-sync Video Generation arXiv - Website ICCV 2023
SIDGAN: High-Resolution Dubbed Video Generation via Shift-Invariant Learning arXiv - - ICCV 2023
AniFaceGAN: Animatable 3D-Aware Face Image Generation for Video Avatars arXiv - - NeurIPS 2022
StyleGAN-V: A Continuous Video Generator with the Price, Image Quality and Perks of StyleGAN2 arXiv Star Website CVPR 2022
Generating Videos with Dynamics-aware Implicit Generative Adversarial Networks arXiv Star Website ICLR 2022
A Good Image Generator Is What You Need for High-Resolution Video Synthesis arXiv Star Website ICLR 2021
Analyzing and Improving the Image Quality of StyleGAN arXiv Star - CVPR 2020
Train Sparsely, Generate Densely: Memory-efficient Unsupervised Training of High-resolution Temporal GAN arXiv Star - IJCV 2020
Adversarial Video Generation on Complex Datasets arXiv - - arXiv 2019
MoCoGAN: Decomposing Motion and Content for Video Generation arXiv Star - CVPR 2018
Temporal Generative Adversarial Nets with Singular Value Clipping arXiv Star Website ICCV 2017

Auto-regressive video models

Papers are listed generally in reverse order of their publication timestamps.

Title
arXiv
GitHub
Website
Conference & Year
CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers arXiv Star Website ICLR 2023
Single Image Video Prediction with Auto-Regressive GANs arXiv - - Sensors 2022
HARP: Autoregressive Latent Video Prediction with High-Fidelity Image Generator arXiv - Website ICIP 2022
Long Video Generation with Time-Agnostic VQGAN and Time-Sensitive Transformer arXiv Star Website ECCV 2022
VideoGPT: Video Generation using VQ-VAE and Transformers arXiv Star Website arXiv 2021
Latent Video Transformer arXiv Star - arXiv 2020
Parallel Multiscale Autoregressive Density Estimation arXiv - - ICML 2017
Video Pixel Networks arXiv - - ICML 2017

Video diffusion models

Papers are listed generally in reverse order of their publication timestamps.

Title
arXiv
GitHub
Website
Conference & Year
CCEdit: Creative and Controllable Video Editing via Diffusion Models arXiv - - arXiv 2024
Align Your Latents: High-Resolution Video Synthesis with Latent Diffusion Models arXiv Star Website CVPR 2023
Make-A-Video: Text-to-Video Generation without Text-Video Data arXiv Star Website arXiv 2022
MagicVideo: Efficient Video Generation with Latent Diffusion Models arXiv - Website arXiv 2022
Imagen Video: High Definition Video Generation with Diffusion Models arXiv - - arXiv 2022
Video Diffusion Models arXiv Star Website arXiv 2022
Cascaded Diffusion Models for High Fidelity Image Generation arXiv - Website JMLR 2022
High-Resolution Image Synthesis with Latent Diffusion Models arXiv Star - CVPR 2022

Auto-regressive video diffusion models

Papers are listed generally in reverse order of their publication timestamps.

Title
arXiv
GitHub
Website
Conference & Year
From Slow Bidirectional to Fast Causal Video Generators arXiv Star Website arXiv 2024
Progressive Autoregressive Video Diffusion Models arXiv Star Website arXiv 2024
Pyramidal Flow Matching for Efficient Video Generative Modeling arXiv Star Website arXiv 2024
ART·V: Auto-Regressive Text-to-Video Generation with Diffusion Models arXiv Star Website CVPR 2024

Learning foundations

Classic denoising diffusion models

Papers are listed generally in reverse order of their publication timestamps.

Title
arXiv
GitHub
Website
Conference & Year
Elucidating the Design Space of Diffusion-Based Generative Models arXiv Star - NeurIPS 2022
Denoising Diffusion Implicit Models arXiv Star - ICLR 2021
Improved Denoising Diffusion Probabilistic Models arXiv Star - ICML 2021
Denoising Diffusion Probabilistic Models arXiv Star Website NeurIPS 2020
Deep Unsupervised Learning using Nonequilibrium Thermodynamics arXiv Star - ICML 2015

Flow matching and rectified flow

Papers are listed generally in reverse order of their publication timestamps.

Title
arXiv
GitHub
Website
Conference & Year
Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale arXiv Star Website NeurIPS 2023
Boosting Fast and High-Quality Speech Synthesis with Linear Diffusion arXiv - - ICASSP 2023
InstaFlow: One Step is Enough for High-Quality Diffusion-Based Text-to-Image Generation arXiv - - ICLR 2024
Stochastic Interpolants: A Unifying Framework for Flows and Diffusions arXiv Star - arXiv 2023
Flow Matching for Generative Modeling arXiv - - arXiv 2022

Learning from feedback and reward models

Papers are listed generally in reverse order of their publication timestamps.

Title
arXiv
GitHub
Website
Conference & Year
Improving Dynamic Object Interactions in Text-to-Video Generation with AI Feedback arXiv - - arXiv 2024
T2V-Turbo-v2: Enhancing Video Generation Model Post-Training through Data, Reward, and Conditional Guidance Design arXiv Star Website arXiv 2024
T2V-Turbo: Breaking the Quality Bottleneck of Video Consistency Model with Mixed Reward Feedback arXiv Star Website NeurIPS 2024
InstructVideo: Instructing Video Diffusion Models with Human Feedback arXiv Star Website CVPR 2024
Click to Move: Controlling Video Generation with Sparse Motion arXiv Star - ICCV 2021

One-shot and few-shot learning

Papers are listed generally in reverse order of their publication timestamps.

Title
arXiv
GitHub
Website
Conference & Year
Make an Image Move: Few-Shot Based Video Generation Guided by CLIP - - - ICPR 2025
LAMP: Learn A Motion Pattern for Few-Shot-Based Video Generation arXiv Star Website arXiv 2023
Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation arXiv Star Website ICCV 2023

Training-free methods

Papers are listed generally in reverse order of their publication timestamps.

Title
arXiv
GitHub
Website
Conference & Year
DreamVideo-2: Zero-Shot Subject-Driven Video Customization with Precise Motion Control arXiv - Website arXiv 2024
VideoElevator: Elevating Video Generation Quality with Versatile Text-to-Image Diffusion Models arXiv Star Website arXiv 2024
GPT4Motion: Scripting Physical Motions in Text-to-Video Generation via Blender-Oriented GPT Planning arXiv Star Website CVPR 2024
Peekaboo: Interactive Video Generation via Masked-Diffusion arXiv Star Website CVPR 2024
ControlVideo: Training-free Controllable Text-to-Video Generation arXiv Star Website ICLR 2024
Magic-Me: Identity-Specific Video Customized Diffusion arXiv Star Website arXiv 2024
AnyV2V: A Tuning-Free Framework For Any Video-to-Video Editing Tasks arXiv Star Website arXiv 2024
Free-Bloom: Zero-Shot Text-to-Video Generator with LLM Director and LDM Animator arXiv Star Website NeurIPS 2023
FateZero: Fusing Attentions for Zero-shot Text-based Video Editing arXiv Star - ICCV 2023
Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators arXiv Star Website ICCV 2023

Token learning

Papers are listed generally in reverse order of their publication timestamps.

Title
arXiv
GitHub
Website
Conference & Year
DreamVideo-2: Zero-Shot Subject-Driven Video Customization with Precise Motion Control arXiv - Website arXiv 2024
Animate-A-Story: Storytelling with Retrieval-Augmented Video Generation arXiv - - arXiv 2023
An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion arXiv Star Website ICLR 2023

Guidances

Classifier guidance

Papers are listed generally in reverse order of their publication timestamps.

Title
arXiv
GitHub
Website
Conference & Year
CFG++: Manifold-Constrained Classifier Free Guidance for Diffusion Models arXiv Star Website arXiv 2024
LLM-grounded Video Diffusion Models arXiv Star Website ICLR 2024
Exploring Compositional Visual Generation with Latent Classifier Guidance arXiv - - CVPRW 2023
Diffusion Models Beat GANs on Image Synthesis arXiv - - NeurIPS 2021

Classifier-free guidance

Papers are listed generally in reverse order of their publication timestamps.

Title
arXiv
GitHub
Website
Conference & Year
Classifier-Free Diffusion Guidance arXiv Star Website 2022

Diffusion model frameworks

Pixel diffusion and latent diffusion

Papers are listed generally in reverse order of their publication timestamps.

Title
arXiv
GitHub
Website
Conference & Year
Latte: Latent Diffusion Transformer for Video Generation arXiv Star Website TMLR 2025
Show-1: Marrying Pixel and Latent Diffusion Models for Text-to-Video Generation arXiv Star Website arXiv 2023
Preserve Your Own Correlation: A Noise Prior for Video Diffusion Models arXiv - - ICCV 2023
ModelScope Text-to-Video Technical Report arXiv - Website arXiv 2023
Structure and Content-Guided Video Synthesis with Diffusion Models arXiv - - ICCV 2023
Align Your Latents: High-Resolution Video Synthesis with Latent Diffusion Models arXiv Star Website CVPR 2023
Text-To-4D Dynamic Scene Generation arXiv - Website arXiv 2023

Optical-flow-based diffusion models

Papers are listed generally in reverse order of their publication timestamps.

Title
arXiv
GitHub
Website
Conference & Year
Go-with-the-Flow: Motion-Controllable Video Diffusion Models Using Real-Time Warped Noise arXiv Star Website CVPR 2025
Infinite-Resolution Integral Noise Warping for Diffusion Models arXiv Star Website ICLR 2024
Motion-I2V: Consistent and Controllable Image-to-Video Generation with Explicit Motion Modeling arXiv Star Website SIGGRAPH 2024
A Dynamic Multi-Scale Voxel Flow Network for Video Prediction arXiv Star Website CVPR 2023
FlowFormer++: Masked Cost Volume Autoencoding for Pretraining Optical Flow Estimation arXiv Star - arXiv 2023
FlowFormer: A Transformer Architecture for Optical Flow arXiv Star Website arXiv 2022
RAFT: Recurrent All-Pairs Field Transforms for Optical Flow arXiv Star - ECCV 2020
LiteFlowNet: A Lightweight Convolutional Neural Network for Optical Flow Estimation arXiv Star - CVPR 2018
PWC-Net: CNNs for Optical Flow Using Pyramid, Warping, and Cost Volume arXiv Star - CVPR 2018
FlowNet 2.0: Evolution of Optical Flow Estimation with Deep Networks arXiv Star - arXiv 2016
FlowNet: Learning Optical Flow with Convolutional Networks arXiv Star - ICCV 2015
A Quantitative Analysis of Current Practices in Optical Flow Estimation and the Principles Behind Them - - - IJCV 2014
Lucas/Kanade Meets Horn/Schunck: Combining Local and Global Optic Flow Methods - - - IJCV 2005
A Framework for the Robust Estimation of Optical Flow - - - ICCV 1993
Determining Optical Flow - - - AI Journal 1981

Noise scheduling

Papers are listed generally in reverse order of their publication timestamps.

Title
arXiv
GitHub
Website
Conference & Year
Rethinking the Noise Schedule of Diffusion-Based Generative Models arXiv - - ICLR 2024
On the Importance of Noise Scheduling for Diffusion Models arXiv - - arXiv 2023
simple diffusion: End-to-end diffusion for high resolution images arXiv Star - ICML 2023
Elucidating the Design Space of Diffusion-Based Generative Models arXiv Star - NeurIPS 2022
Improved Denoising Diffusion Probabilistic Models arXiv Star - ICML 2021
Denoising Diffusion Probabilistic Models arXiv Star Website NeurIPS 2020
Score-Based Generative Modeling through Stochastic Differential Equations arXiv Star - ICLR 2021

Agent-based diffusion models

Papers are listed generally in reverse order of their publication timestamps.

Title
arXiv
GitHub
Website
Conference & Year
VideoAgent: Self-Improving Video Generation arXiv Star Website ICLR 2025
Mora: Enabling Generalist Video Generation via A Multi-Agent Framework arXiv Star - arXiv 2024
MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework arXiv Star - arXiv 2023
AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation Framework arXiv - Website arXiv 2023
DriveGAN: Towards a Controllable High-Quality Neural Simulation arXiv Star Website CVPR 2021

Architectures

UNet

Papers are listed generally in reverse order of their publication timestamps.

Title
arXiv
GitHub
Website
Conference & Year
MoVideo: Motion-Aware Video Generation with Diffusion Models arXiv - Website ECCV 2025
ModelScope Text-to-Video Technical Report arXiv - - arXiv 2023
Latent-Shift: Latent Diffusion with Temporal Shift for Efficient Text-to-Video Generation arXiv - Website arXiv 2023
MagicVideo: Efficient Video Generation with Latent Diffusion Models arXiv - Website arXiv 2022
Latent Video Diffusion Models for High-Fidelity Video Generation with Arbitrary Lengths arXiv - - arXiv 2022
Imagen Video: High Definition Video Generation with Diffusion Models arXiv - - arXiv 2022
Video Diffusion Models arXiv Star Website arXiv 2022
Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding arXiv - - NeurIPS 2022
High-Resolution Image Synthesis with Latent Diffusion Models arXiv Star - CVPR 2022
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale arXiv - - ICLR 2021
U-Net: Convolutional Networks for Biomedical Image Segmentation arXiv Star - MICCAI 2015

Diffusion transformers

Papers are listed generally in reverse order of their publication timestamps.

Title
arXiv
GitHub
Website
Conference & Year
Open-Sora: Democratizing Efficient Video Production for All arXiv Star - arXiv 2024
From Slow Bidirectional to Fast Causal Video Generators arXiv Star Website arXiv 2024
SiT: Exploring Flow and Diffusion-based Generative Models with Scalable Interpolant Transformers arXiv - Website arXiv 2024
GenTron: Diffusion Transformers for Image and Video Generation arXiv - Website CVPR 2024
VDT: General-purpose Video Diffusion Transformers via Mask Modeling arXiv Star - ICLR 2024
Text2Performer: Text-Driven Human Video Generation arXiv Star - ICCV 2023
Scalable Diffusion Models with Transformers arXiv - Website ICCV 2023
CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers arXiv - - ICLR 2023
VersVideo: Leveraging Enhanced Temporal Diffusion Models for Versatile Video Generation - - - ICLR 2023
ViViT: A Video Vision Transformer arXiv - - ICCV 2021

VAE for latent space compression

Papers are listed generally in reverse order of their publication timestamps.

Title
arXiv
GitHub
Website
Conference & Year
HunyuanVideo: A Systematic Framework for Large Video Generative Models arXiv - - arXiv 2025
SkyReels V1: Human-Centric Video Foundation Model - Star Website arXiv 2025
Magic 1-for-1: Generating One Minute Video Clips Within One Minute arXiv Star Website arXiv 2025
Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model arXiv Star Website arXiv 2025
Latte: Latent Diffusion Transformer for Video Generation arXiv Star Website TMLR 2025
AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models Without Specific Tuning arXiv Star Website ICLR 2024
VideoPoet: A Large Language Model for Zero-Shot Video Generation arXiv - Website PMLR 2024
LaVie: Latent Video Encoding for Diffusion Models arXiv Star Website IJCV 2024
Show-1: Marrying Pixel and Latent Diffusion Models for Text-to-Video Generation arXiv Star Website arXiv 2023
VideoCrafter1: Open Diffusion Models for High-Quality Video Generation arXiv Star Website arXiv 2023
Make-A-Video: Text-to-Video Generation Without Text-Video Data arXiv Star Website ICLR 2023
Phenaki: Variable Length Video Generation from Open Domain Textual Descriptions arXiv Star Website ICLR 2023
Denoising Diffusion Probabilistic Models arXiv Star - NeurIPS 2020

Text encoders

Papers are listed generally in reverse order of their publication timestamps.

Title
arXiv
GitHub
Website
Conference & Year
Magic 1-for-1: Generating one minute video clips within one minute arXiv Star Website arXiv 2025
SkyReels v1: Human-centric video foundation model arXiv Star - arXiv 2025
Hunyuanvideo: A systematic framework for large video generative models arXiv - Website arXiv 2025
Step-video-t2v technical report: The practice, challenges, and future of video foundation model arXiv - Website arXiv 2025
Identity-Preserving Text-to-Video Generation by Frequency Decomposition arXiv Star Website arXiv 2024
An empirical study and analysis of text-to-image generation using large language model-powered textual representation arXiv - - arXiv 2024
Hunyuan-dit: A powerful multi-resolution diffusion transformer with fine-grained chinese understanding arXiv - Website arXiv 2024
FIT: Flexible Vision Transformer for Diffusion Model arXiv Star Website arXiv 2024
CogVideoX: Text-to-Video Diffusion Models with an Expert Transformer arXiv Star Website arXiv 2024
Scaling Rectified Flow Transformers for High-Resolution Image Synthesis arXiv Star Website ICML 2024
SIT: Exploring Flow and Diffusion-based Generative Models with Scalable Interpolant Transformers arXiv Star Website arXiv 2024
Kolors: Effective training of diffusion model for photorealistic text-to-image synthesis arXiv - Website arXiv 2024
Open-Sora: Democratizing Efficient Video Production for All arXiv Star Website arXiv 2024
Open-Sora-Plan arXiv Star Website arXiv 2024
SimDA: Simple Diffusion Adapter for Efficient Video Generation arXiv Star Website CVPR 2024
Latte: Latent Diffusion Transformer for Video Generation arXiv Star Website arXiv 2024
Flux arXiv Star Website arXiv 2023
Scalable Diffusion Models with Transformers arXiv Star Website ICCV 2023
All are Worth Words: A ViT Backbone for Diffusion Models arXiv Star Website CVPR 2023
Baichuan 2: Open Large-Scale Language Models arXiv Star Website arXiv 2023
LLaMA 2: Open Foundation and Fine-Tuned Chat Models arXiv Star Website arXiv 2023
LLaMA: Open and Efficient Foundation Language Models arXiv Star Website arXiv 2023
ByT5: Towards a Token-Free Future with Pre-trained Byte-to-Byte Models arXiv Star - TACL 2022
Imagen Video: High Definition Video Generation with Diffusion Models arXiv - Website arXiv 2022
Hierarchical Text-Conditional Image Generation with CLIP Latents arXiv - Website arXiv 2022
High-Resolution Image Synthesis with Latent Diffusion Models arXiv Star Website CVPR 2022
GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models arXiv Star - arXiv 2021
GLM: General Language Model Pretraining with Autoregressive Blank Infilling arXiv Star Website arXiv 2021
Learning Transferable Visual Models From Natural Language Supervision arXiv Star Website ICML 2021
Zero-Shot Text-to-Image Generation arXiv - - ICML 2021
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer arXiv Star - JMLR 2020
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding arXiv Star Website NAACL 2019

Implementation

Datasets

More datasets could be found on Pixabay, Mixkit, Pond5, Adobe Stock, Shutterstock, Getty, Coverr, Videvo, Depositphotos, Storyblocks, Dissolve, Freepik, Vimeo, and Envato. Also, there are some datasets at Midjourney V5.1 Cleaned Data, Unsplash-lite, AnimateBench, Pexels-400k, and LAION-AESTHETICS.

Title
arXiv
GitHub
Website
Conference & Year
Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers arXiv Star Website CVPR 2024
VBench: Comprehensive Benchmark Suite for Video Generative Models arXiv Star Website CVPR 2024
InternVid: Learning Text-to-Video Generation from Web-scale Video-Text Data arXiv Star Website ICLR 2024
MiraData: A Large-Scale Video Dataset with Long Durations and Structured Captions arXiv Star Website NeurIPS 2024
VidProM: A Million-scale Real Prompt-Gallery Dataset for Text-to-Video Diffusion Models arXiv Star Website NeurIPS 2024
Vript: A Video Is Worth Thousands of Words arXiv Star - NeurIPS 2024
VideoCrafter2 arXiv Star Website arXiv 2024
Open-Sora: Democratizing Efficient Video Production for All arXiv Star Website arXiv 2024
Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generator arXiv Star Website ICCV 2023
Temporally Consistent Transformers for Video Generation arXiv Star Website ICML 2023
Bitstream-Corrupted Video Recovery: A Novel Benchmark Dataset and Method arXiv Star - NeurIPS 2023
FETV: A Benchmark for Fine-Grained Evaluation of Open-Domain Text-to-Video Generation arXiv Star - NeurIPS 2023
AIGCBench: Comprehensive evaluation of image-to-video content generated by AI arXiv Star Website TBench 2023
AdaPool: Exponential Adaptive Pooling for Information-Retaining Downsampling arXiv Star - TIP 2023
Swap Attention in Spatiotemporal Diffusions for Text-to-Video Generation arXiv Star - arXiv 2023
Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions arXiv Star - CVPR 2022
The DEVIL is in the Details: A Diagnostic Evaluation Benchmark for Video Inpainting arXiv Star - CVPR 2022
VFHQ: A High-Quality Dataset and Benchmark for Video Face Super-Resolution arXiv - Website CVPR 2022
Learning Audio-Video Modalities from Image Captions arXiv - Website ECCV 2022
The Anatomy of Video Editing: A Dataset and Benchmark Suite for AI-Assisted Video Editing arXiv Star - ECCV 2022
Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding arXiv Star Website NeurIPS 2022
Scaling Autoregressive Models for Content-Rich Text-to-Image Generation arXiv - Website TMLR 2022
ACAV100M: Automatic Curation of Large-Scale Datasets for Audio-Visual Video Representation Learning arXiv Star Website ICCV 2021
Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval arXiv Star Website ICCV 2021
MERLOT: Multimodal Neural Script Knowledge Models arXiv Star Website NeurIPS 2021
Learning Video Representations from Textual Web Supervision arXiv - - arXiv 2020
HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips arXiv - Website ICCV 2019
VATEX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research arXiv - Website ICCV 2019
Towards Automatic Learning of Procedures from Web Instructional Videos arXiv - Website AAAI 2018
How2: A Large-scale Dataset for Multimodal Language Understanding arXiv Star Website arXiv 2018
Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset arXiv Star - CVPR 2017
Localizing Moments in Video with Natural Language arXiv Star - ICCV 2017
MSR-VTT: A Large Video Description Dataset for Bridging Video and Language - - Website CVPR 2016
ActivityNet: A Large-Scale Video Benchmark for Human Activity Understanding - - Website CVPR 2015
A Dataset for Movie Description arXiv - - CVPR 2015
UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild arXiv - Website arXiv 2012

Training engineering

Papers are listed generally in reverse order of their publication timestamps.

Title
arXiv
GitHub
Website
Conference & Year
LLaVA-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models arXiv Star Website ICLR 2025
SAM 2: Segment Anything in Images and Videos arXiv - Website ICLR 2025
Motion Prompting: Controlling Video Generation with Motion Trajectories arXiv - Website CVPR 2025
SimDA: Simple Diffusion Adapter for Efficient Video Generation arXiv Star Website CVPR 2024
DynamiCrafter: Animating Open-domain Images with Video Diffusion Priors arXiv Star Website ECCV 2024
CogVLM2: Visual Language Models for Image and Video Understanding arXiv Star - arXiv 2024
CogVideoX: Text-to-Video Diffusion Models with an Expert Transformer arXiv Star Website arXiv 2024
Open-Sora: Democratizing Efficient Video Production for All arXiv Star Website arXiv 2024
Hunyuan-dit: A powerful multi-resolution diffusion transformer with fine-grained chinese understanding arXiv - Website arXiv 2024
PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning arXiv Star Website arXiv 2024
Cogvideo: Large-scale pretraining for text-to-video generation via transformers arXiv Star - ICLR 2023
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets arXiv Star - arXiv 2023
LLaMA: Open and Efficient Foundation Language Models arXiv Star Website arXiv 2023
LLaMA 2: Open Foundation and Fine-Tuned Chat Models arXiv Star Website arXiv 2023
ST-Adapter: Parameter-Efficient Image-to-Video Transfer Learning arXiv Star - NeurIPS 2022
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness arXiv Star - NeurIPS 2022
Visual Prompt Tuning arXiv Star - ECCV 2022
CoCa: Contrastive Captioners are Image-Text Foundation Models arXiv Star - TMLR 2022
ZeRO: memory optimizations toward training trillion parameter models arXiv - - Supercomputing 2020

Evaluation metrics and benchmarking findings

Papers are listed generally in reverse order of their publication timestamps.

Title
arXiv
GitHub
Website
Conference & Year
VBench: Comprehensive Benchmark Suite for Video Generative Models arXiv Star Website CVPR 2024
VBench++: Comprehensive and Versatile Benchmark Suite for Video Generative Models arXiv Star Website CVPR 2024
STREAM: Spatio-TempoRal Evaluation and Analysis Metric for Video Generative Models arXiv Star - ICLR 2024
TC-Bench: Benchmarking Temporal Compositionality in Conditional Video Generation arXiv Star - arXiv 2024
Exploring Video Quality Assessment on User Generated Contents from Aesthetic and Technical Perspectives arXiv Star - ICCV 2023
FETV: A Benchmark for Fine-Grained Evaluation of Open-Domain Text-to-Video Generation arXiv Star - NeurIPS 2023
AIGCBench: Comprehensive evaluation of image-to-video content generated by AI arXiv Star Website TBench 2023
Learning Transferable Visual Models From Natural Language Supervision arXiv Star Website ICML 2021
RAFT: Recurrent All-Pairs Field Transforms for Optical Flow arXiv Star - ECCV 2020
FVD: A new Metric for Video Generation - - - ICLR Workshop 2019
GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium arXiv Star - NeurIPS 2017
Blind video temporal consistency - - - TOG 2015

Industry models

Title
arXiv
GitHub
Website
Conference & Year
Magic 1-for-1: Generating one minute video clips within one minute arXiv Star Website arXiv 2025
SkyReels v1: Human-centric video foundation model arXiv Star - arXiv 2025
Step-Video-T2V arXiv - Website arXiv 2024
HunyuanVideo arXiv - Website arXiv 2024
Sora - - Website 2024
STIV arXiv Star Website arXiv 2024
LTX-Video arXiv Star Website arXiv 2024
Allegro arXiv Star Website arXiv 2024
Jimeng arXiv - Website arXiv 2024
Mochi 1 arXiv - Website arXiv 2024
EasyAnimate arXiv Star Website arXiv 2024
Vidu - - Website 2024
VideoCrafter2 arXiv Star Website arXiv 2024
VideoCrafter1 arXiv Star Website arXiv 2023
Mira arXiv - Website arXiv 2024
Hailuo AI - - Website 2024
Lumiere arXiv - Website arXiv 2024
VideoPoet arXiv - Website arXiv 2023
LumaAI Ray 2 - - Website 2024
LumaAI Dream Machine - - Website 2023
Veo-2 - - Website 2024
Veo-1 - - Website 2023
Nova Real - - Website 2024
Wanx 2.1 - - Website 2024
Kling - - Website 2024
Show-1 arXiv Star Website NeurIPS 2023
MovieGen arXiv - Website arXiv 2024
Pika - - Website 2023
Vchitect-2.0 - - Website 2024
Optis arXiv Star Website NeurIPS 2023
VLogger arXiv Star Website ICCV 2023
Seine arXiv Star Website CVPR 2023
Lavie arXiv Star Website ICCV 2023
MiracleVision - - Website 2023
Phenaki arXiv Star Website ICLR 2024
W.A.L.T arXiv - Website arXiv 2024
Imagen video arXiv - Website 2022
GEN-3 Alpha - - Website 2024
GEN-2 - - Website 2023
GEN-1 - - Website 2022

Academia models

Title
arXiv
GitHub
Website
Conference & Year
RepVideo: Rethinking Cross-Layer Representation for Video Generation arXiv Star Website arXiv 2025
CausVid: Causality-Aware Video Generation with Slow-Fast Diffusion Models arXiv Star Website CVPR 2025
Open-Sora Plan: Open-Source Large Video Generation Model arXiv Star Website arXiv 2024
Open-Sora: Democratizing Efficient Video Production for All arXiv Star Website arXiv 2024
Snap Video: Scaled Spatiotemporal Transformers for Text-to-Video Synthesis arXiv - Website arXiv 2024
SiT: Exploring Flow and Diffusion-Based Generative Models with Scalable Interpolant Transformers arXiv Star Website arXiv 2024
VideoDirectorGPT: Consistent Multi-scene Video Generation via LLM-Guided Planning arXiv Star Website COLM 2024
AnimateLCM: Accelerating the Animation of Personalized Diffusion Models and Adapters with Decoupled Consistency Learning arXiv Star Website arXiv 2024
I4VGEN: Interactive Video Generation via Integrated Dynamic Control arXiv Star Website arXiv 2024
SimDA: Simple Diffusion Adapter for Efficient Text-to-Video Generation arXiv Star Website arXiv 2023
AnimateDiff-v2 arXiv Star Website ICLR 2024
Animate-A-Story: Storytelling with Retrieval-Augmented Video Generation arXiv Star Website arXiv 2023
VideoGen: A Reference-Guided Latent Diffusion Approach for High-Definition Text-to-Video Generation arXiv - Website arXiv 2023
Dysen-VDM: Diffusion Model with Dynamic Spatio-Temporal Fusion for Video Generation arXiv Star - arXiv 2023
HiGen: Hierarchical 3D Feature Generation for 3D-Aware Image Synthesis and Manipulation arXiv Star - arXiv 2023
ModelScope Text-to-Video Technical Report arXiv - Website arXiv 2023
InstructVideo: Instructing Video Diffusion Models with Human Feedback arXiv - Website CVPR 2024
VideoComposer: Compositional Video Synthesis with Motion Controllability arXiv Star Website NeurIPS 2023
VideoFusion: Decomposed Diffusion Models for High-Quality Video Generation arXiv - - CVPR 2023
MagViT-v2: Masked Generative Video Transformer arXiv Star - arXiv 2023
MagViT: Masked Generative Video Transformer arXiv Star - arXiv 2022
Latent-Shift: Latent Diffusion with Temporal Shift for Efficient Text-to-Video Generation arXiv - Website arXiv 2023
Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models arXiv Star Website CVPR 2023
Video Diffusion Models arXiv Star Website arXiv 2022
Make-A-Video: Text-to-Video Generation without Text-Video Data arXiv Star Website ICLR 2023
MagicVideo: Efficient Video Generation With Latent Diffusion Models arXiv - Website arXiv 2022
CogVideoX: Enhancing Video Understanding in the Era of Large Language Models arXiv Star Website arXiv 2024
CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers arXiv Star Website ICLR 2023
VideoGPT: Video Generation using VQ-VAE and Transformers arXiv Star Website arXiv 2021

Applications

Conditions

Image condition

Papers are listed generally in reverse order of their publication timestamps.

Title
arXiv
GitHub
Website
Conference & Year
CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer arXiv Star Website ICLR 2025
DreamVideo-2: Zero-Shot Subject-Driven Video Customization with Precise Motion Control arXiv - Website ICLR 2025
DreamVideo: High-Fidelity Image-to-Video Generation with Image Retention and Text Guidance arXiv Star Website ICASSP 2025
EMO: Emote Portrait Alive-Generating Expressive Portrait Videos with Audio2Video Diffusion Model under Weak Conditions arXiv - Website ECCV 2024
Cinemo: Consistent and Controllable Image Animation with Motion Diffusion Models arXiv Star Website CVPR 2025
MagDiff: Multi-alignment Diffusion for High-Fidelity Video Generation and Editing arXiv Star Website ECCV 2024
ConsistI2V: Enhancing Visual Consistency for Image-to-Video Generation arXiv Star Website TMLR
I2V-Adapter: A General Image-to-Video Adapter for Diffusion Models arXiv Star Website SIGGRAPH 2024
ID-Animator: Zero-Shot Identity-Preserving Human Video Generation arXiv Star Website arXiv 2024
CamCo: Camera-Controllable 3D-Consistent Image-to-Video Generation arXiv - Website arXiv 2024
Generative Image Dynamics arXiv - Website CVPR 2024
PIA: Your Personalized Image Animator via Plug-and-Play Modules in Text-to-Image Models arXiv Star Website CVPR 2024
TRIP: Temporal Residual Learning with Image Noise Prior for Image-to-Video Diffusion Models arXiv - Website CVPR 2024
AtomoVideo: High Fidelity Image-to-Video Generation arXiv - Website arXiv 2024
Motion-i2v: Consistent and controllable image-to-video generation with explicit motion modeling arXiv Star Website SIGGRAPH 2024
Seer: Language Instructed Video Prediction with Latent Diffusion Models arXiv Star Website ICLR 2024
AnimateAnything: Fine-Grained Open Domain Image Animation with Motion Guidance arXiv Star Website arXiv 2023
VideoBooth: Diffusion-based Video Generation with Image Prompts arXiv Star Website CVPR 2024
Sparsectrl: Adding sparse controls to text-to-video diffusion models arXiv Star Website ECCV 2024
DynamiCrafter: Animating Open-domain Images with Video Diffusion Priors arXiv Star Website ECCV 2024
Adding Conditional Control to Text-to-Image Diffusion Models arXiv Star Website ICCV 2023
Stable video diffusion: Scaling latent video diffusion models to large datasets arXiv Star Website Arxix 2023
Make pixels dance: High-dynamic video generation arXiv - Website CVPR 2024
I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion Models arXiv Star Website arXiv 2023
Videocrafter1: Open diffusion models for high-quality video generation arXiv Star Website arXiv 2023
VDT: General-purpose Video Diffusion Transformers via Mask Modeling arXiv Star Website ICLR 2024
VideoComposer: Compositional Video Synthesis with Motion Controllability arXiv Star Website NIPS 2023
Conditional Image-to-Video Generation with Latent Flow Diffusion Models arXiv Star Website CVPR 2023

Spatial condition

Papers are listed generally in reverse order of their publication timestamps.

Title
arXiv
GitHub
Website
Conference & Year
CineMaster: A 3D-Aware and Controllable Framework for Cinematic Text-to-Video Generation arXiv - Website SIGGRAPH 2025
ObjCtrl-2.5D: Training-free Object Control with Camera Poses arXiv Star Website arXiv 2024
Motion Prompting: Controlling Video Generation with Motion Trajectories arXiv - Website CVPR 2025
SG-I2V: Self-Guided Trajectory Control in Image-to-Video Generation arXiv Star Website ICLR 2025
MVideo: Motion Control for Enhanced Complex Action Video Generation arXiv - Website arXiv 2024
DreamVideo-2: Zero-Shot Subject-Driven Video Customization with Precise Motion Control arXiv - Website arXiv 2024
Tora: Trajectory-oriented Diffusion Transformer for Video Generation arXiv Star Website CVPR 2025
Streetscapes: Large-scale Consistent Street View Generation Using Autoregressive Video Diffusion arXiv - Website SIGGRAPH 2024
FreeTraj: Tuning-Free Trajectory Control in Video Diffusion Models arXiv Star Website arXiv 2024
MotionBooth: Motion-Aware Customized Text-to-Video Generation arXiv Star Website NeurIPS 2024
DragAnything: Motion Control for Anything using Entity Representation arXiv Star Website ECCV 2024
Boximator: Generating Rich and Controllable Motions for Video Synthesis arXiv - Website arXiv 2024
Motion-I2V: Consistent and Controllable Image-to-Video Generation with Explicit Motion Modeling arXiv Star Website SIGGRAPH 2024
PEEKABOO: Interactive Video Generation via Masked-Diffusion arXiv Star Website CVPR 2024
SparseCtrl: Adding Sparse Controls to Text-to-Video Diffusion Models arXiv - Website ECCV 2024
DragNUWA: Fine-grained Control in Video Generation by Integrating Text, Image, and Trajectory arXiv Star Website arXiv 2023

Camera parameter condition

Papers are listed generally in reverse order of their publication timestamps.

Title
arXiv
GitHub
Website
Conference & Year
CineMaster: A 3D-Aware and Controllable Framework for Cinematic Text-to-Video Generation arXiv - Website SIGGRAPH 2025
3DTrajMaster: Mastering 3D Trajectory for Multi-Entity Motion in Video Generation arXiv Star Website ICLR 2025
AC3D: Analyzing and Improving 3D Camera Control in Video Diffusion Transformers arXiv Star Website CVPR 2025
Cavia: Camera-controllable Multi-view Video Diffusion with View-Integrated Attention arXiv - Website arXiv 2024
VD3D: Taming Large Video Diffusion Transformers for 3D Camera Control arXiv - Website ICLR 2025
CamCo: Camera-Controllable 3D-Consistent Image-to-Video Generation arXiv - Website arXiv 2024
MotionBooth: Motion-Aware Customized Text-to-Video Generation arXiv Star Website NeurIPS 2024
Collaborative Video Diffusion: Consistent Multi-video Generation with Camera Control arXiv Star Website NeurIPS 2024
CameraCtrl: Enabling Camera Control for Text-to-Video Generation arXiv Star Website arXiv 2024
Direct-a-Video: Customized Video Generation with User-Directed Camera Movement and Object Motion arXiv Star Website arXiv 2024
MotionCtrl: A Unified and Flexible Motion Controller for Video Generation arXiv Star Website SIGGRAPH 2024

Audio condition

Papers are listed generally in reverse order of their publication timestamps.

Title
arXiv
GitHub
Website
Conference & Year
OmniHuman-1: Rethinking the Scaling-Up of One-Stage Conditioned Human Animation Models arXiv - Website arXiv 2025
AniTalker: Animate Vivid and Diverse Talking Faces through Identity-Decoupled Facial Motion Encoding arXiv Star Website ACM MM 2024
VASA-1: Lifelike Audio-Driven Talking Faces Generated in Real Time arXiv - Website NeurIPS 2024
EMOPortraits: Emotion-enhanced Multimodal One-shot Head Avatars arXiv Star Website CVPR 2024
FlowVQTalker: High-Quality Emotional Talking Face Generation through Normalizing Flow and Quantization arXiv - - CVPR 2024
AniPortrait: Audio-Driven Synthesis of Photorealistic Portrait Animation arXiv Star - arXiv 2024
EMO: Emote Portrait Alive -- Generating Expressive Portrait Videos with Audio2Video Diffusion Model under Weak Conditions arXiv Star Website ECCV 2024
Speech2Lip: High-fidelity Speech to Lip Generation by Learning from a Short Video arXiv Star Website ICCV 2023
Audio-Driven Co-Speech Gesture Video Generation arXiv Star Website NeurIPS 2022
CCVS: Context-aware Controllable Video Synthesis arXiv Star Website NeurIPS 2021

High-level video condition

Papers are listed generally in reverse order of their publication timestamps.

Title
arXiv
GitHub
Website
Conference & Year
TokenFlow: Unified Image Tokenizer for Multimodal Understanding and Generation arXiv Star Website ICLR 2024
MotionClone: Training-Free Motion Cloning for Controllable Video Generation arXiv Star Website ICLR 2025
I2VEdit: First-Frame-Guided Video Editing via Image-to-Video Diffusion Models arXiv Star Website SIGGRAPH Asia 2024
ReVideo: Remake a Video with Motion and Content Control arXiv Star Website NeurIPS 2024
AniTalker: Animate Vivid and Diverse Talking Faces through Identity-Decoupled Facial Motion Encoding arXiv Star Website ACM MM 2024
AnyV2V: A Tuning-Free Framework For Any Video-to-Video Editing Tasks arXiv Star Website TMLR 2024
UniEdit: A Unified Tuning-Free Framework for Video Motion and Appearance Editing arXiv Star Website arXiv 2024
VidToMe: Video Token Merging for Zero-Shot Video Editing arXiv Star Website CVPR 2024
FlowVid: Taming Imperfect Optical Flows for Consistent Video-to-Video Synthesis arXiv - Website CVPR 2024
SAVE: Protagonist Diversification with Structure Agnostic Video Editing arXiv Star Website ECCV 2024
RAVE: Randomized Noise Shuffling for Fast and Consistent Video Editing with Diffusion Models arXiv Star Website CVPR 2024
DiffusionAtlas: High-Fidelity Consistent Diffusion Video Editing arXiv - Website arXiv 2023
DragVideo: Interactive Drag-style Video Editing arXiv Star Website ECCV 2024
Drag-A-Video: Non-rigid Video Editing with Point-based Interaction arXiv Star Website arXiv 2023
VideoSwap: Customized Video Subject Swapping with Interactive Semantic Point Correspondence arXiv Star Website CVPR 2024
A Video is Worth 256 Bases: Spatial-Temporal Expectation-Maximization Inversion for Zero-Shot Video Editing arXiv Star Website CVPR 2024
Motion-Conditioned Image Animation for Video Editing arXiv Star Website arXiv 2023
MagicPose: Realistic Human Poses and Facial Expressions Retargeting with Identity-aware Diffusion arXiv Star Website ICML 2024
Animate Anyone: Consistent and Controllable Image-to-Video Synthesis for Character Animation arXiv Star Website CVPR 2024
Consistent Video-to-Video Transfer Using Synthetic Dataset arXiv Star Website ICLR 2024
MotionDirector: Motion Customization of Text-to-Video Diffusion Models arXiv Star Website ECCV 2024
SimDA: Simple Diffusion Adapter for Efficient Video Generation arXiv Star Website CVPR 2024
MagicEdit: High-Fidelity and Temporally Coherent Video Editing arXiv Star Website arXiv 2023
CoDeF: Content Deformation Fields for Temporally Consistent Video Processing arXiv Star Website CVPR 2024
StableVideo: Text-driven Consistency-aware Diffusion Video Editing arXiv Star Website ICCV 2023
VideoControlNet: A Motion-Guided Video-to-Video Translation Framework by Using Diffusion Model with ControlNet arXiv Star Website arXiv 2023
VideoComposer: Compositional Video Synthesis with Motion Controllability arXiv Star Website NeurIPS 2023
Rerender A Video: Zero-Shot Text-Guided Video-to-Video Translation arXiv Star Website SIGGRAPH Asia 2023
Video Colorization with Pre-trained Text-to-Image Diffusion Models arXiv Star Website arXiv 2023
VidEdit: Zero-Shot and Spatially Aware Text-Driven Video Editing arXiv - Website TMLR 2024
DisCo: Disentangled Control for Realistic Human Dance Generation arXiv Star Website CVPR 2024
Towards Consistent Video Editing with Text-to-Image Diffusion Models arXiv - - NeurIPS 2023
Video ControlNet: Towards Temporally Consistent Synthetic-to-Real Video Translation Using Conditional Image Diffusion Models arXiv - - -
ControlVideo: Training-free Controllable Text-to-Video Generation arXiv Star Website ICLR 2024
InstructVid2Vid: Controllable Video Editing with Natural Language Instructions arXiv - - arXiv 2023
Follow Your Pose: Pose-Guided Text-to-Video Generation using Pose-Free Videos arXiv Star Website AAAI 2024
DreamPose: Fashion Image-to-Video Synthesis via Stable Diffusion arXiv Star Website ICCV 2023
Zero-Shot Video Editing Using Off-The-Shelf Image Diffusion Models arXiv Star - IEEE Trans On Multimedia, 2023
Pix2Video: Video Editing using Image Diffusion arXiv Star Website ICCV 2023
Structure and Content-Guided Video Synthesis with Diffusion Models arXiv - Website ICCV 2023
Shape-aware Text-driven Layered Video Editing arXiv Star Website CVPR 2023
DPE: Disentanglement of Pose and Expression for General Video Portrait Editing arXiv Star Website CVPR 2023
Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation arXiv Star Website ICCV 2023
Diffusion Video Autoencoders: Toward Temporally Consistent Face Video Editing via Disentangled Video Encoding arXiv Star Website CVPR 2023
Layered Neural Atlases for Consistent Video Editing arXiv Star Website SIGGRAPH Asia 2021

Other conditions

Papers are listed generally in reverse order of their publication timestamps.

Title
arXiv
GitHub
Website
Conference & Year
Mind the Time: Temporally-Controlled Multi-Event Video Generation arXiv - Website CVPR 2025
PhysGen: Rigid-Body Physics-Grounded Image-to-Video Generation arXiv Star Website ECCV 2024
MotionCraft: Physics-based Zero-Shot Video Generation arXiv Star Website AAAI 2025
VideoAgent: Long-form Video Understanding with Large Language Model as Agent arXiv Star Website ECCV 2024
Synthetic Generation of Face Videos with Plethysmograph Physiology arXiv - - CVPR 2022

Enhancement

Video denoising and deblurring

Papers are listed generally in reverse order of their publication timestamps.

Title
arXiv
GitHub
Website
Conference & Year
Video restoration based on deep learning: a comprehensive survey - - - 2022

Video inpainting

Papers are listed generally in reverse order of their publication timestamps.

Title
arXiv
GitHub
Website
Conference & Year
CoCoCo: Improving Text-Guided Video Inpainting for Better Consistency arXiv Star Website AAAI 2025
UniPaint: Unified Space-time Video Inpainting via Mixture-of-Experts arXiv - - arXiv 2024
Semantically Consistent Video Inpainting with Conditional Diffusion Models arXiv - - arXiv 2024
AVID: Any-Length Video Inpainting with Diffusion Model arXiv Star Website CVPR 2024
Towards language-driven video inpainting via multimodal large language models arXiv Star Website CVPR 2024
Deep Learning-Based Image and Video Inpainting: A Survey arXiv - - arXiv 2024
Smartbrush: Text and shape guided object inpainting with diffusion model arXiv - - CVPR 2023
Repaint: Inpainting using denoising diffusion probabilistic models arXiv Star - CVPR 2022
Free-form video inpainting with 3d gated convolution and temporal patchgan arXiv Star - ICCV 2019
Generative image inpainting with contextual attention arXiv Star - CVPR 2018
Context encoders: Feature learning by inpainting arXiv Star Website CVPR 2016

Video interpolation and extrapolation/prediction

Papers are listed generally in reverse order of their publication timestamps.

Title
arXiv
GitHub
Website
Conference & Year
Video interpolation with diffusion models arXiv - Website CVPR 2024
ToonCrafter: Generative Cartoon Interpolation arXiv Star Website TOG 2024
Ldmvfi: Video frame interpolation with latent diffusion models arXiv Star Website AAAI 2024
Tell me what happened: Unifying text-guided video completion via multimodal masked video generation arXiv Star Website CVPR 2023
MCVD - Masked Conditional Video Diffusion for Prediction arXiv Star Website NeurIPS 2022
Diffusion models for video prediction and infilling arXiv Star Website TMLR 2022

Video super-resolution

Papers are listed generally in reverse order of their publication timestamps.

Title
arXiv
GitHub
Website
Conference & Year
FreeScale: High-Resolution Video Generation with Cascaded Diffusion Models arXiv Star Website arXiv 2024
Make a cheap scaling: A self-cascade diffusion model for higher-resolution adaptation arXiv Star Website ECCV 2024
VEnhancer: Generative Space-Time Enhancement for Video Generation arXiv Star Website arXiv 2024
Exploiting diffusion prior for real-world image super-resolution arXiv Star Website IJCV 2024
Learning Spatial Adaptation and Temporal Coherence in Diffusion Models for Video Super-Resolution arXiv - - CVPR 2024
Upscale-A-Video: Temporal-Consistent Diffusion Model for Real-World Video Super-Resolution arXiv Star Website CVPR 2024
Inflation with Diffusion: Efficient Temporal Adaptation for Text-to-Video Super-Resolution arXiv - - WACV Workshop 2024

Combining multiple video enhancement tasks

Papers are listed generally in reverse order of their publication timestamps.

Title
arXiv
GitHub
Website
Conference & Year
UniPaint: Unified Space-time Video Inpainting via Mixture-of-Experts arXiv - - arXiv 2024
VEnhancer: Generative Space-Time Enhancement for Video Generation arXiv Star Website arXiv 2024
MCVD - Masked Conditional Video Diffusion for Prediction arXiv Star Website NeurIPS 2022

Personalization

Papers are listed generally in reverse order of their publication timestamps.

Title
arXiv
GitHub
Website
Conference & Year
Dynamic Concepts Personalization from Single Videos arXiv - Website SIGGRAPH 2025
VideoAlchemy: Open-set Personalization in Video Generation arXiv Star Website CVPR 2025
PersonalVideo: High ID-Fidelity Video Customization without Dynamic and Semantic Degradation arXiv Star Website arXiv 2024
Identity-PreservingText-to-VideoGenerationbyFrequencyDecomposition arXiv Star Website CVPR 2025
DreamVideo-2: Zero-Shot Subject-Driven Video Customization with Precise Motion Control arXiv - Website ICLR 2025
Still-Moving: Customized Video Generation without Customized Video Data arXiv Star Website ACM TOG 2024
MCNet: Rethinking the Core Ingredients for Accurate and Efficient Homography Estimation arXiv Star - CVPR 2024
AniTalker: Animate Vivid and Diverse Talking Faces through Identity-Decoupled Facial Motion Encoding arXiv Star Website ACM MM 2024
VASA-1: Lifelike Audio-Driven Talking Faces Generated in Real Time arXiv - Website NeurIPS 2024
FlowVQTalker: High-Quality Emotional Talking Face Generation through Normalizing Flow and Quantization arXiv - - CVPR 2024
EMO: Emote Portrait Alive - Generating Expressive Portrait Videos with Audio2Video Diffusion Model under Weak Conditions arXiv Star Website ECCV 2024
DreamVideo: High-Fidelity Image-to-Video Generation with Image Retention and Text Guidance arXiv Star Website ICASSP 2025
Audio-Driven Co-Speech Gesture Video Generation arXiv Star Website NeurIPS 2022

Consistency

Papers are listed generally in reverse order of their publication timestamps.

Title
arXiv
GitHub
Website
Conference & Year
Cinemo: Consistent and Controllable Image Animation with Motion Diffusion Models arXiv Star Website CVPR 2025
How i warped your noise: a temporally-correlated noise prior for diffusion models - - Website ICLR 2024
Tokenflow: Consistent diffusion features for consistent video editing arXiv Star Website ICLR 2024
Seine: Short-to-long video diffusion model for generative transition and prediction. arXiv Star Website ICLR 2024
VideoBooth: Diffusion-based Video Generation with Image Prompts arXiv Star Website CVPR 2024
TRIP: Temporal Residual Learning with Image Noise Prior for Image-to-Vieo Diffusion Models arXiv - Website CVPR 2024
EMOPortraits: Emotion-enhanced Multimodal One-shot Head Avatars arXiv Star Website CVPR 2024
CAMEL: CAusal Motion Enhancement tailored for Lifting Text-driven Video Editing arXiv Star - CVPR 2024
VidToMe: Video Token Merging for Zero-Shot Video Editing arXiv Star Website CVPR 2024
EMO: Emote Portrait Alive - Generating Expressive Portrait Videos with Audio2Video Diffusion Model under Weak Conditions arXiv Star Website ECCV 2024
StoryDiffusion: Consistent Self-Attention for Long-Range Image and Video Generation arXiv Star Website NeurIPS 2024
Streetscapes: Large-scale consistent street view generation using autoregressive video diffusion arXiv - Website SIGGRAPH 2024
Motion-i2v: Consistent and controllable image-to-video generation with explicit motion modeling arXiv Star Website SIGGRAPH 2024
Consisti2v: Enhancing visual consistency for image-to-video generation arXiv Star Website TMLR 2024
StreamingT2V: Consistent, Dynamic, and Extendable Long Video Generation from Text arXiv Star Website arXiv 2024
AniPortrait: Audio-Driven Synthesis of Photorealistic Portrait Animation arXiv Star - arXiv 2024
Flexifilm: Long video generation with flexible conditions arXiv Star Website arXiv 2024
Towards Smooth Video Composition arXiv Star Website ICLR 2023
Conditional Image-to-Video Generation with Latent Flow Diffusion Models arXiv Star - CVPR 2023
MoStGAN-V: Video Generation with Temporal Motion Styles arXiv Star - CVPR 2023
MOSO: Decomposing MOtion, Scene and Object for Video Prediction arXiv Star - CVPR 2023
Stablevideo: Text-driven consistency-aware diffusion video editing arXiv Star HuggingFace Demo ICCV 2023
Preserve your own correlation: A noise prior for video diffusion models arXiv - Website ICCV 2023
Scenescape: Text-driven consistent scene generation arXiv Star Website NeurIPS 2023
GLOBER: Coherent Non-autoregressive Video Generation via GLOBal Guided Video DecodER arXiv Star - NeurIPS 2023
VideoComposer: Compositional Video Synthesis with Motion Controllability arXiv Star Website NeurIPS 2023
DreamTalk: When Emotional Talking Head Generation Meets Diffusion Probabilistic Models arXiv Star Website arXiv 2023
Generating Videos with Dynamics-aware Implicit Generative Adversarial Networks arXiv Star Website ICLR 2022
StyleGAN-V: A Continuous Video Generator with the Price, Image Quality and Perks of StyleGAN2 arXiv Star Website CVPR 2022

Long video

Papers are listed generally in reverse order of their publication timestamps.

Title
arXiv
GitHub
Website
Conference & Year
MovieDreamer: Hierarchical Generation for Coherent Long Visual Sequences arXiv Star Website ICLR 2025
FreeNoise: Tuning-Free Longer Video Diffusion via Noise Rescheduling arXiv Star Website ICLR 2024
ViD-GPT: Introducing GPT-style Autoregressive Generation in Video Diffusion Models arXiv Star - arXiv 2024
Video-Infinity: Distributed Long Video Generation arXiv Star - arXiv 2024
Progressive Autoregressive Video Diffusion Models arXiv Star Website arXiv 2024
MAVIN: Multi-Action Video Generation with Diffusion Models via Transition Video Infilling arXiv Star - arXiv 2024
CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers arXiv Star - ICLR 2023
Towards Smooth Video Composition arXiv Star Website ICLR 2023
NUWA-XL: Diffusion over Diffusion for eXtremely Long Video Generation arXiv - Website ACL 2023
Gen-L-Video: Multi-Text to Long Video Generation via Temporal Co-Denoising arXiv Star Website arXiv 2023
StyleGAN-V: A Continuous Video Generator with the Price, Image Quality and Perks of StyleGAN2 arXiv - Website CVPR 2022
Long Video Generation with Time-Agnostic VQGAN and Time-Sensitive Transformer arXiv Star Website ECCV 2022
Generating Long Videos of Dynamic Scenes arXiv Star Website NeurIPS 2022
Flexible Diffusion Modeling of Long Videos arXiv - Website NeurIPS 2022

3D-aware video diffusion

Training on 3D dataset

Papers are listed generally in reverse order of their publication timestamps.

Title
arXiv
GitHub
Website
Conference & Year
OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation arXiv Star Website ICLR 2025
CineMaster: A 3D-Aware and Controllable Framework for Cinematic Text-to-Video Generation arXiv - Website arXiv 2025
HumanVid: Demystifying Training Data for Camera-controllable Human Image Animation arXiv Star Website NeurIPS D&B 2024
Generating 3D-Consistent Videos from Unposed Internet Photos arXiv - Website CVPR 2025
Cavia: Camera-controllable multi-view video diffusion with view-integrated attention arXiv - Website arXiv 2024
RGBD Objects in the Wild: Scaling Real-World 3D Object Learning from RGB-D Videos arXiv Star Website CVPR 2024
VFusion3D: Learning Scalable 3D Generative Models from Video Diffusion Models arXiv Star Website ECCV 2024
Generative Camera Dolly: Extreme Monocular Dynamic Novel View Synthesis arXiv Star Website ECCV 2024
CamCo: Camera-Controllable 3D-Consistent Image-to-Video Generation arXiv - Website -
Collaborative Video Diffusion: Consistent Multi-video Generation with Camera Control arXiv Star Website NeurIPS 2024
Diffusion4D: Fast Spatial-temporal Consistent 4D generation via Video Diffusion Models arXiv Star Website NeurIPS 2024
V3D: Video Diffusion Models are Effective 3D Generators arXiv Star Website arXiv 2024
Sora Generates Videos with Stunning Geometrical Consistency arXiv Star Website arXiv 2024
IM-3D: Iterative Multiview Diffusion and Reconstruction for High-Quality 3D Generation arXiv - Website ICML 2024
InternVid: Learning Text-to-Video Generation from Web-scale Video-Text Data arXiv Star Website ICLR 2024
DL3DV-10K: A Large-Scale Scene Dataset for Deep Learning-based 3D Vision arXiv Star Website CVPR 2024
Stable video diffusion: Scaling latent video diffusion models to large datasets arXiv Star Website arXiv 2023
Objaverse-XL: A Universe of 10M+ 3D Objects arXiv Star Website NeurIPS D&B 2023
OmniObject3D: Large-Vocabulary 3D Object Dataset for Realistic Perception, Reconstruction and Generation arXiv Star Website CVPR 2023
MVImgNet: A Large-scale Dataset of Multi-view Images arXiv Star Website CVPR 2023
Objaverse: A Universe of Annotated 3D Objects arXiv Star Website CVPR 2023
Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval arXiv Star Website ICCV 2021
Google Scanned Objects: A High-Quality Dataset of 3D Scanned Household Items arXiv - Website ICRA 2022
Common Objects in 3D: Large-Scale Learning and Evaluation of Real-life 3D Category Reconstruction arXiv Star Website ICCV 2021
Stereo magnification: Learning view synthesis using multiplane images arXiv Star Website SIGGRAPH 2018

Architecture for 3D diffusion models

Papers are listed generally in reverse order of their publication timestamps.

Title
arXiv
GitHub
Website
Conference & Year
SV4D: Dynamic 3D Content Generation with Multi-Frame and Multi-View Consistency arXiv Star Website ICLR 2025
CineMaster: A 3D-Aware and Controllable Framework for Cinematic Text-to-Video Generation arXiv - Website arXiv 2025
Diffusion as Shader: 3D-aware Video Diffusion for Versatile Video Generation Control arXiv Star Website arXiv 2025
Magic-Boost: Boost 3D Generation with Mutli-View Conditioned Diffusion arXiv Star Website arXiv 2025
Wonderland: Navigating 3D Scenes from a Single Image arXiv Star Website arXiv 2024
CamI2V: Camera-Controlled Image-to-Video Diffusion Model arXiv Star Website arXiv 2024
Generating 3D-Consistent Videos from Unposed Internet Photos arXiv - Website CVPR 2025
Cavia: Camera-controllable multi-view video diffusion with view-integrated attention arXiv - Website arXiv 2024
ControlDreamer: Blending Geometry and Style in Text-to-3D arXiv Star Website BMVC 2024
UniDream: Unifying Diffusion Priors for Relightable Text-to-3D Generation arXiv - Website ECCV 2024
Vivid-ZOO: Multi-View Video Generation with Diffusion Model arXiv Star Website NeurIPS 2024
CamCo: Camera-Controllable 3D-Consistent Image-to-Video Generation arXiv - Website Arxix 2024
CAT3D: Create Anything in 3D with Multi-View Diffusion Models arXiv - Website NeurIPS 2024
MVDiffusion++: A Dense High-resolution Multi-view Diffusion Model for Single or Sparse-view 3D Object Reconstruction arXiv - Website ECCV 2024
MVDream: Multi-view Diffusion for 3D Generation arXiv Star Website ICLR 2024
MVD-Fusion: Single-view 3D via Depth-consistent Multi-view Generation arXiv Star Website CVPR 2024
EpiDiff: Enhancing Multi-View Synthesis via Localized Epipolar-Constrained Diffusion arXiv Star Website CVPR 2024
Make-Your-3D: Fast and Consistent Subject-Driven 3D Content Generation arXiv Star Website ECCV 2024
SPAD: Spatially Aware Multi-View Diffusers arXiv Star Website CVPR 2024
MVDiffusion: Enabling Holistic Multi-view Image Generation with Correspondence-Aware Diffusion arXiv Star Website NeuIPS 2023
ConsistNet: Enforcing 3D Consistency for Multi-view Images Diffusion arXiv - Website CVPR 2024

Camera conditioning

Papers are listed generally in reverse order of their publication timestamps.

Title
arXiv
GitHub
Website
Conference & Year
Motion Prompting: Controlling Video Generation with Motion Trajectories arXiv - Website CVPR 2025
Vd3d: Taming large video diffusion transformers for 3d camera control arXiv Star Website ICLR 2025
AC3D: Analyzing and Improving 3D Camera Control in Video Diffusion Transformers arXiv Star Website CVPR 2025
Cameractrl: Enabling camera control for text-to-video generation arXiv Star Website ICLR 2025
I2VControl-Camera: Precise Video Camera Control with Adjustable Motion Strength arXiv Star Website ICLR 2025
CineMaster: A 3D-Aware and Controllable Framework for Cinematic Text-to-Video Generation arXiv - Website arXiv 2025
Diffusion as Shader: 3D-aware Video Diffusion for Versatile Video Generation Control arXiv Star Website arXiv 2025
CAT4D: Create Anything in 4D with Multi-View Video Diffusion Models arXiv - Website arXiv 2024
Wonderland: Navigating 3D Scenes from a Single Image arXiv Star Website arXiv 2024
CamI2V: Camera-Controlled Image-to-Video Diffusion Model arXiv Star Website arXiv 2024
HumanVid: Demystifying Training Data for Camera-controllable Human Image Animation arXiv Star Website NeurIPS D&B 2024
Cavia: Camera-controllable multi-view video diffusion with view-integrated attention arXiv - Website arXiv 2024
Hi3D: Pursuing High-Resolution Image-to-3D Generation with Video Diffusion Models arXiv Star Website Multimedia 2024
MyGo: Consistent and Controllable Multi-View Driving Video Generation with Camera Control arXiv - Website arXiv 2024
MotionCtrl: A Unified and Flexible Motion Controller for Video Generation arXiv Star Website SIGGRAPH 2024
Controlling Space and Time with Diffusion Models arXiv - Website ICLR 2025
Generative Camera Dolly: Extreme Monocular Dynamic Novel View Synthesis arXiv Star Website ECCV 2024
CamCo: Camera-Controllable 3D-Consistent Image-to-Video Generation arXiv - Website arXiv 2024
Collaborative Video Diffusion: Consistent Multi-video Generation with Camera Control arXiv Star Website NeurIPS 2024
CAT3D: Create Anything in 3D with Multi-View Diffusion Models arXiv - Website NeurIPS 2024
Direct-a-Video: Customized Video Generation with User-Directed Camera Movement and Object Motion arXiv Star Website SIGGRAPH 2024
Sv3d: Novel multi-view synthesis and 3d generation from a single image using latent video diffusion arXiv - Website ECCV 2024

Inference-time tricks

Papers are listed generally in reverse order of their publication timestamps.

Title
arXiv
GitHub
Website
Conference & Year
NVS-Solver: Video Diffusion Model as Zero-Shot Novel View Synthesizer arXiv Star Website ICLR 2025
Training-free Camera Control for Video Generation arXiv - Website ICLR 2025
ViVid-1-to-3: Novel View Synthesis with Video Diffusion Models arXiv Star Website CVPR 2024

Benefits to other domains

Video representation learning

Papers are listed generally in reverse order of their publication timestamps.

Title
arXiv
GitHub
Website
Conference & Year
Deep video representation learning: a survey arXiv - - Multimedia Tools and Applications 2024
Text Is MASS: Modeling as Stochastic Embedding for Text-Video Retrieval arXiv Star - CVPR 2024
T2Vs Meet VLMs: A Scalable Multimodal Dataset for Visual Harmfulness Recognition arXiv Star - NeurIPS 2024
M²Chat: Empowering VLM for Multimodal LLM Interleaved Text-Image Generation arXiv Star Website arXiv 2024
Cali-NCE: Boosting Cross-modal Video Representation Learning with Calibrated Alignment - Star - CVPR Workshops 2023
HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training arXiv - - ICCV 2023
Retrieving-to-Answer: Zero-Shot Video Question Answering with Frozen Large Language Models arXiv - - ICCV Workshop 2023
DeCap: Decoding CLIP Latents for Zero-Shot Captioning via Text-Only Training arXiv Star - ICLR 2023
Visual Consensus Modeling for Video-Text Retrieval - - - AAAI 2022
End-to-End Referring Video Object Segmentation with Multimodal Transformers arXiv Star - CVPR 2022
Language as Queries for Referring Video Object Segmentation arXiv Star - CVPR 2022
Align and Prompt: Video-and-Language Pre-training with Entity Prompts arXiv Star - CVPR 2022
Multi-Level Representation Learning with Semantic Alignment for Referring Video Object Segmentation - - - CVPR 2022
CenterCLIP: Token Clustering for Efficient Text-Video Retrieval arXiv Star - SIGIR 2022
GL-RG: Global-Local Representation Granularity for Video Captioning arXiv Star - IJCAI 2022
Less is More: ClipBERT for Video-and-Language Learning via Sparse Sampling arXiv Star - CVPR 2021
Learning Transferable Visual Models From Natural Language Supervision arXiv Star - ICML 2021
Multi-modal Transformer for Video Retrieval arXiv Star Website ECCV 2020
URVOS: Unified Referring Video Object Segmentation Network with a Large-Scale Benchmark - Star - ECCV 2020
Asymmetric 3D Convolutional Neural Networks for action recognition - - - Pattern Recognition 2019
Skeleton-Based Human Action Recognition With Global Context-Aware Attention LSTM Networks arXiv - - TIP 2018
Deep Sequential Context Networks for Action Prediction - - - CVPR 2017
Attention Is All You Need arXiv - - NeurIPS 2017
Deep Residual Learning for Image Recognition arXiv Star - CVPR 2016
Spatio-Temporal LSTM with Trust Gates for 3D Human Action Recognition arXiv Star - ECCV 2016
Learning Spatiotemporal Features with 3D Convolutional Networks arXiv - - ICCV 2015
ImageNet Classification with Deep Convolutional Neural Networks - - - NeurIPS 2012
Long Short-Term Memory - - - Neural Comput., 1997
Neural Networks and Physical Systems with Emergent Collective Computational Abilities - - - PNAS, 1982

Video retrieval

Papers are listed generally in reverse order of their publication timestamps.

Title
arXiv
GitHub
Website
Conference & Year
Towards Retrieval Augmented Generation over Large Video Libraries arXiv - - IEEE HSI 2024
Video Enriched Retrieval Augmented Generation Using Aligned Video Captions arXiv Star - SIGIR Workshop 2024
GenTron: Diffusion Transformers for Image and Video Generation arXiv - Website CVPR 2024
iRAG: Advancing RAG for Videos with an Incremental Approach arXiv - - CIKM 2024
Img2Loc: Revisiting Image Geolocalization using Multi-modality Foundation Models and Image-based Retrieval-Augmented Generation arXiv Star - SIGIR 2024
Multimodal Federated Learning via Contrastive Representation Ensemble arXiv Star - ICLR 2023
SIDGAN: High-Resolution Dubbed Video Generation via Shift-Invariant Learning arXiv - - ICCV 2023
Animate-A-Story: Storytelling with Retrieval-Augmented Video Generation arXiv - - arXiv 2023
Visual Consensus Modeling for Video-Text Retrieval - - - AAAI 2022
FLAVA: A Foundational Language And Vision Alignment Model arXiv Star Website CVPR 2022
X-Pool: Cross-Modal Language-Video Attention for Text-Video Retrieval arXiv Star Website CVPR 2022
Exposing the Limits of Video-Text Models through Contrast Sets - Star - NAACL 2022
CenterCLIP: Token Clustering for Efficient Text-Video Retrieval arXiv Star - SIGIR 2022
Boosting Video-Text Retrieval with Explicit High-Level Semantics arXiv - - ACM MM 2022
X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval arXiv Star - ACM MM 2022
Cross-Modal Discrete Representation Learning arXiv - - ACL 2022
TS2-Net: Token Shift and Selection Transformer for Text-Video Retrieval arXiv Star - ECCV 2022
CLIP4Clip: An empirical study of CLIP for end to end video clip retrieval and captioning arXiv Star - Neurocomputing 2022
Align and Tell: Boosting Text-Video Retrieval With Local Alignment and Fine-Grained Supervision - - - TMM 2022
Deep Unified Cross-Modality Hashing by Pairwise Data Alignment - - - IJCAI 2021
Less is More: ClipBERT for Video-and-Language Learning via Sparse Sampling arXiv Star - CVPR 2021
MDMMT: Multidomain Multimodal Transformer for Video Retrieval arXiv Star - CVPR Workshop 2021
TEACHTEXT: CrossModal Generalized Distillation for Text-Video Retrieval arXiv Star Website ICCV 2021
Improving Video-Text Retrieval by Multi-Stream Corpus Alignment and Dual Softmax Loss arXiv - - arXiv 2021
CLIP2TV: Align, Match and Distill for Video-Text Retrieval arXiv - - arXiv 2021
ActBERT: Learning Global-Local Video-Text Representations arXiv - - CVPR 2020
End-to-End Learning of Visual Representations from Uncurated Instructional Videos arXiv Star Website CVPR 2020
Multi-modal Transformer for Video Retrieval arXiv Star Website ECCV 2020
Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks arXiv Star - ECCV 2020
Searching Privately by Imperceptible Lying: A Novel Private Hashing Method with Differential Privacy - - - ACM MM 2020
Language Models are Few-Shot Learners arXiv - - NeurIPS 2020
Language Models are Few-Shot Learners arXiv - - NeurIPS 2020
Large-Scale Adversarial Training for Vision-and-Language Representation Learning arXiv - - NeurIPS 2020
StyleGuide: Zero-Shot Sketch-Based Image Retrieval Using Style-Guided Image Generation - - - TMM 2020
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding arXiv - - NAACL 2019
Dual Encoding for Zero-Example Video Retrieval arXiv Star - CVPR 2019
Language Models are Unsupervised Multitask Learners - Star - OpenAI 2019
End-to-end Concept Word Detection for Video Captioning, Retrieval, and Question Answering arXiv - - CVPR 2017

Video QA and captioning

Papers are listed generally in reverse order of their publication timestamps.

Title
arXiv
GitHub
Website
Conference & Year
LongVideoBench: A Benchmark for Long-context Interleaved Video-Language Understanding arXiv Star Website NeurIPS 2024
Video Question Answering: Datasets, Algorithms and Challenges arXiv Star - EMNLP 2022
Video Question Answering via Gradually Refined Attention over Appearance and Motion - Star - ACM MM 2017
Video Question Answering via Hierarchical Dual-Level Attention Network Learning - - - ACM MM 2017
TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering arXiv Star - CVPR 2017
Leveraging Video Descriptions to Learn Video Question Answering arXiv - - AAAI 2017

3D and 4D generation

Video diffusion for 3D generation

Papers are listed generally in reverse order of their publication timestamps.

Title
arXiv
GitHub
Website
Conference & Year
Wonderland: Navigating 3D Scenes from a Single Image arXiv Star Website arXiv 2024
ReconX: Reconstruct Any Scene from Sparse Views with Video Diffusion Model arXiv Star Website arXiv 2024
Hi3D: Pursuing High-Resolution Image-to-3D Generation with Video Diffusion Models arXiv Star Website Multimedia 2024
VFusion3D: Learning Scalable 3D Generative Models from Video Diffusion Models arXiv Star Website ECCV 2024
Sv3d: Novel multi-view synthesis and 3d generation from a single image using latent video diffusion arXiv - Website ECCV 2024
V3D: Video Diffusion Models are Effective 3D Generators arXiv Star Website arXiv 2024
IM-3D: Iterative Multiview Diffusion and Reconstruction for High-Quality 3D Generation arXiv - Website ICML 2024

Video diffusion for 4D generation

Papers are listed generally in reverse order of their publication timestamps.

Title
arXiv
GitHub
Website
Conference & Year
SV4D: Dynamic 3D Content Generation with Multi-Frame and Multi-View Consistency arXiv Star Website ICLR 2025
CAT4D: Create Anything in 4D with Multi-View Video Diffusion Models arXiv - Website arXiv 2024
4Real: Towards Photorealistic 4D Scene Generation via Video Diffusion Models arXiv - Website NeurIPS 2024
PLA4D: Pixel-Level Alignments for Text-to-4D Gaussian Splatting arXiv - Website arXiv 2024
4Diffusion: Multi-view Video Diffusion Model for 4D Generation arXiv Star Website NeurIPS 2024
TC4D: Trajectory-Conditioned Text-to-4D Generation arXiv Star Website ECCV 2024
Animate3D: Animating Any 3D Model with Multi-view Video Diffusion arXiv Star Website NeurIPS 2024
Vivid-ZOO: Multi-View Video Generation with Diffusion Model arXiv Star Website NeurIPS 2024
DreamGaussian4D: Generative 4D Gaussian Splatting arXiv Star Website arXiv 2024
EG4D: Explicit Generation of 4D Object without Score Distillation arXiv Star Website ICLR 2025
Vidu4D: Single Generated Video to High-Fidelity 4D Reconstruction with Dynamic Gaussian Surfels arXiv Star Website NeurIPS 2024
Diffusion4D: Fast Spatial-temporal Consistent 4D generation via Video Diffusion Models arXiv Star Website NeurIPS 2024
4D-fy: Text-to-4D Generation Using Hybrid Score Distillation Sampling arXiv Star Website CVPR 2024
Dream-in-4D: A Unified Approach for Text-and Image-Guided 4D Scene Generation arXiv Star Website CVPR 2024
STAG4D: Spatial-Temporal Anchored Generative 4D Gaussians arXiv Star Website ECCV 2024
VideoMV: Consistent Multi-View Generation Based on Large Video Generative Model arXiv Star Website arXiv 2024
Animate124: Animating One Image to 4D Dynamic Scene arXiv Star Website arXiv 2024
Align Your Gaussians: Text-to-4D with Dynamic 3D Gaussians and Composed Diffusion Models arXiv - Website CVPR 2024
Text-to-4d dynamic scene generation arXiv - Website ICML 2023

Citation

If you find our survey is useful in your research or applications, please consider giving us a star 🌟 and citing it by the following BibTeX entry.

@misc{wang2025surveyvideodiffusionmodels,
      title={Survey of Video Diffusion Models: Foundations, Implementations, and Applications}, 
      author={Yimu Wang and Xuye Liu and Wei Pang and Li Ma and Shuai Yuan and Paul Debevec and Ning Yu},
      year={2025},
      eprint={2504.16081},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2504.16081}, 
}

Acknowledgement

The format of this repo is built based on Awesome-Video-Diffusion-Models.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 7