Survey of Video Diffusion Models: Foundations, Implementations, and Applications

Yimu Wang^1,*, Xuye Liu^1,*, Wei Pang^1,*, Li Ma^3,*, Shuai Yuan^2,*, Paul Debevec³, Ning Yu^3,†

¹University of Waterloo, ²Duke University, ³Netflix Eyeline Studios
^*Contributed Equally, ^†Corresponding Author

Abstract

In this survey and github repository, we provide a comprehensive overview of the recent advances in video diffusion models. We cover the foundations of video generative models, including GANs, auto-regressive models, and diffusion models. We also discuss the learning foundations, including classic denoising diffusion models, flow matching, and training-free methods. Additionally, we explore various architectures, including UNet and diffusion transformers. We discuss the applications of video diffusion models, including video generation, enhancement, personalization, and 3D-aware video generation. Finally, we highlight the benefits of video diffusion models to other domains, such as video representation learning and video retrieval.

Moreover, to facilitate the understanding of video diffusion models, we provide a cheatsheet including commonly used training datasets, training engineering techniques, and evaluation metrics. We also provide a list of video diffusion models in academia and industry.

Foundations

Video generative paradigms

GAN video models

Papers are listed generally in reverse order of their publication timestamps.

Title	GitHub	Website	Conference & Year
StyleInV: A Temporal Style Modulated Inversion Network for Unconditional Video Generation			ICCV 2023
StyleLipSync: Style-based Personalized Lip-sync Video Generation	-		ICCV 2023
SIDGAN: High-Resolution Dubbed Video Generation via Shift-Invariant Learning	-	-	ICCV 2023
AniFaceGAN: Animatable 3D-Aware Face Image Generation for Video Avatars	-	-	NeurIPS 2022
StyleGAN-V: A Continuous Video Generator with the Price, Image Quality and Perks of StyleGAN2			CVPR 2022
Generating Videos with Dynamics-aware Implicit Generative Adversarial Networks			ICLR 2022
A Good Image Generator Is What You Need for High-Resolution Video Synthesis			ICLR 2021
Analyzing and Improving the Image Quality of StyleGAN		-	CVPR 2020
Train Sparsely, Generate Densely: Memory-efficient Unsupervised Training of High-resolution Temporal GAN		-	IJCV 2020
Adversarial Video Generation on Complex Datasets	-	-	arXiv 2019
MoCoGAN: Decomposing Motion and Content for Video Generation		-	CVPR 2018
Temporal Generative Adversarial Nets with Singular Value Clipping			ICCV 2017

Auto-regressive video models

Papers are listed generally in reverse order of their publication timestamps.

Title	GitHub	Website	Conference & Year
CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers			ICLR 2023
Single Image Video Prediction with Auto-Regressive GANs	-	-	Sensors 2022
HARP: Autoregressive Latent Video Prediction with High-Fidelity Image Generator	-		ICIP 2022
Long Video Generation with Time-Agnostic VQGAN and Time-Sensitive Transformer			ECCV 2022
VideoGPT: Video Generation using VQ-VAE and Transformers			arXiv 2021
Latent Video Transformer		-	arXiv 2020
Parallel Multiscale Autoregressive Density Estimation	-	-	ICML 2017
Video Pixel Networks	-	-	ICML 2017

Video diffusion models

Papers are listed generally in reverse order of their publication timestamps.

Title	GitHub	Website	Conference & Year
CCEdit: Creative and Controllable Video Editing via Diffusion Models	-	-	arXiv 2024
Align Your Latents: High-Resolution Video Synthesis with Latent Diffusion Models			CVPR 2023
Make-A-Video: Text-to-Video Generation without Text-Video Data			arXiv 2022
MagicVideo: Efficient Video Generation with Latent Diffusion Models	-		arXiv 2022
Imagen Video: High Definition Video Generation with Diffusion Models	-	-	arXiv 2022
Video Diffusion Models			arXiv 2022
Cascaded Diffusion Models for High Fidelity Image Generation	-		JMLR 2022
High-Resolution Image Synthesis with Latent Diffusion Models		-	CVPR 2022

Auto-regressive video diffusion models

Papers are listed generally in reverse order of their publication timestamps.

Title	arXiv	GitHub	Website	Conference & Year
From Slow Bidirectional to Fast Causal Video Generators				arXiv 2024
Progressive Autoregressive Video Diffusion Models				arXiv 2024
Pyramidal Flow Matching for Efficient Video Generative Modeling				arXiv 2024
ART·V: Auto-Regressive Text-to-Video Generation with Diffusion Models				CVPR 2024

Learning foundations

Classic denoising diffusion models

Papers are listed generally in reverse order of their publication timestamps.

Title	Website	Conference & Year
Elucidating the Design Space of Diffusion-Based Generative Models	-	NeurIPS 2022
Denoising Diffusion Implicit Models	-	ICLR 2021
Improved Denoising Diffusion Probabilistic Models	-	ICML 2021
Denoising Diffusion Probabilistic Models		NeurIPS 2020
Deep Unsupervised Learning using Nonequilibrium Thermodynamics	-	ICML 2015

Flow matching and rectified flow

Papers are listed generally in reverse order of their publication timestamps.

Title	GitHub	Website	Conference & Year
Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale			NeurIPS 2023
Boosting Fast and High-Quality Speech Synthesis with Linear Diffusion	-	-	ICASSP 2023
InstaFlow: One Step is Enough for High-Quality Diffusion-Based Text-to-Image Generation	-	-	ICLR 2024
Stochastic Interpolants: A Unifying Framework for Flows and Diffusions		-	arXiv 2023
Flow Matching for Generative Modeling	-	-	arXiv 2022

Learning from feedback and reward models

Papers are listed generally in reverse order of their publication timestamps.

Title	GitHub	Website	Conference & Year
Improving Dynamic Object Interactions in Text-to-Video Generation with AI Feedback	-	-	arXiv 2024
T2V-Turbo-v2: Enhancing Video Generation Model Post-Training through Data, Reward, and Conditional Guidance Design			arXiv 2024
T2V-Turbo: Breaking the Quality Bottleneck of Video Consistency Model with Mixed Reward Feedback			NeurIPS 2024
InstructVideo: Instructing Video Diffusion Models with Human Feedback			CVPR 2024
Click to Move: Controlling Video Generation with Sparse Motion		-	ICCV 2021

One-shot and few-shot learning

Papers are listed generally in reverse order of their publication timestamps.

Title	arXiv	GitHub	Website	Conference & Year
Make an Image Move: Few-Shot Based Video Generation Guided by CLIP	-	-	-	ICPR 2025
LAMP: Learn A Motion Pattern for Few-Shot-Based Video Generation				arXiv 2023
Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation				ICCV 2023

Training-free methods

Papers are listed generally in reverse order of their publication timestamps.

Title	GitHub	Website	Conference & Year
DreamVideo-2: Zero-Shot Subject-Driven Video Customization with Precise Motion Control	-		arXiv 2024
VideoElevator: Elevating Video Generation Quality with Versatile Text-to-Image Diffusion Models			arXiv 2024
GPT4Motion: Scripting Physical Motions in Text-to-Video Generation via Blender-Oriented GPT Planning			CVPR 2024
Peekaboo: Interactive Video Generation via Masked-Diffusion			CVPR 2024
ControlVideo: Training-free Controllable Text-to-Video Generation			ICLR 2024
Magic-Me: Identity-Specific Video Customized Diffusion			arXiv 2024
AnyV2V: A Tuning-Free Framework For Any Video-to-Video Editing Tasks			arXiv 2024
Free-Bloom: Zero-Shot Text-to-Video Generator with LLM Director and LDM Animator			NeurIPS 2023
FateZero: Fusing Attentions for Zero-shot Text-based Video Editing		-	ICCV 2023
Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators			ICCV 2023

Token learning

Papers are listed generally in reverse order of their publication timestamps.

Title	GitHub	Website	Conference & Year
DreamVideo-2: Zero-Shot Subject-Driven Video Customization with Precise Motion Control	-		arXiv 2024
Animate-A-Story: Storytelling with Retrieval-Augmented Video Generation	-	-	arXiv 2023
An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion			ICLR 2023

Guidances

Classifier guidance

Papers are listed generally in reverse order of their publication timestamps.

Title	GitHub	Website	Conference & Year
CFG++: Manifold-Constrained Classifier Free Guidance for Diffusion Models			arXiv 2024
LLM-grounded Video Diffusion Models			ICLR 2024
Exploring Compositional Visual Generation with Latent Classifier Guidance	-	-	CVPRW 2023
Diffusion Models Beat GANs on Image Synthesis	-	-	NeurIPS 2021

Classifier-free guidance

Papers are listed generally in reverse order of their publication timestamps.

Title	arXiv	GitHub	Website	Conference & Year
Classifier-Free Diffusion Guidance				2022

Diffusion model frameworks

Pixel diffusion and latent diffusion

Papers are listed generally in reverse order of their publication timestamps.

Title	GitHub	Website	Conference & Year
Latte: Latent Diffusion Transformer for Video Generation			TMLR 2025
Show-1: Marrying Pixel and Latent Diffusion Models for Text-to-Video Generation			arXiv 2023
Preserve Your Own Correlation: A Noise Prior for Video Diffusion Models	-	-	ICCV 2023
ModelScope Text-to-Video Technical Report	-		arXiv 2023
Structure and Content-Guided Video Synthesis with Diffusion Models	-	-	ICCV 2023
Align Your Latents: High-Resolution Video Synthesis with Latent Diffusion Models			CVPR 2023
Text-To-4D Dynamic Scene Generation	-		arXiv 2023

Optical-flow-based diffusion models

Papers are listed generally in reverse order of their publication timestamps.

Title	arXiv	GitHub	Website	Conference & Year
Go-with-the-Flow: Motion-Controllable Video Diffusion Models Using Real-Time Warped Noise				CVPR 2025
Infinite-Resolution Integral Noise Warping for Diffusion Models				ICLR 2024
Motion-I2V: Consistent and Controllable Image-to-Video Generation with Explicit Motion Modeling				SIGGRAPH 2024
A Dynamic Multi-Scale Voxel Flow Network for Video Prediction				CVPR 2023
FlowFormer++: Masked Cost Volume Autoencoding for Pretraining Optical Flow Estimation			-	arXiv 2023
FlowFormer: A Transformer Architecture for Optical Flow				arXiv 2022
RAFT: Recurrent All-Pairs Field Transforms for Optical Flow			-	ECCV 2020
LiteFlowNet: A Lightweight Convolutional Neural Network for Optical Flow Estimation			-	CVPR 2018
PWC-Net: CNNs for Optical Flow Using Pyramid, Warping, and Cost Volume			-	CVPR 2018
FlowNet 2.0: Evolution of Optical Flow Estimation with Deep Networks			-	arXiv 2016
FlowNet: Learning Optical Flow with Convolutional Networks			-	ICCV 2015
A Quantitative Analysis of Current Practices in Optical Flow Estimation and the Principles Behind Them	-	-	-	IJCV 2014
Lucas/Kanade Meets Horn/Schunck: Combining Local and Global Optic Flow Methods	-	-	-	IJCV 2005
A Framework for the Robust Estimation of Optical Flow	-	-	-	ICCV 1993
Determining Optical Flow	-	-	-	AI Journal 1981

Noise scheduling

Papers are listed generally in reverse order of their publication timestamps.

Title	GitHub	Website	Conference & Year
Rethinking the Noise Schedule of Diffusion-Based Generative Models	-	-	ICLR 2024
On the Importance of Noise Scheduling for Diffusion Models	-	-	arXiv 2023
simple diffusion: End-to-end diffusion for high resolution images		-	ICML 2023
Elucidating the Design Space of Diffusion-Based Generative Models		-	NeurIPS 2022
Improved Denoising Diffusion Probabilistic Models		-	ICML 2021
Denoising Diffusion Probabilistic Models			NeurIPS 2020
Score-Based Generative Modeling through Stochastic Differential Equations		-	ICLR 2021

Agent-based diffusion models

Papers are listed generally in reverse order of their publication timestamps.

Title	GitHub	Website	Conference & Year
VideoAgent: Self-Improving Video Generation			ICLR 2025
Mora: Enabling Generalist Video Generation via A Multi-Agent Framework		-	arXiv 2024
MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework		-	arXiv 2023
AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation Framework	-		arXiv 2023
DriveGAN: Towards a Controllable High-Quality Neural Simulation			CVPR 2021

Architectures

UNet

Papers are listed generally in reverse order of their publication timestamps.

Title	GitHub	Website	Conference & Year
MoVideo: Motion-Aware Video Generation with Diffusion Models	-		ECCV 2025
ModelScope Text-to-Video Technical Report	-	-	arXiv 2023
Latent-Shift: Latent Diffusion with Temporal Shift for Efficient Text-to-Video Generation	-		arXiv 2023
MagicVideo: Efficient Video Generation with Latent Diffusion Models	-		arXiv 2022
Latent Video Diffusion Models for High-Fidelity Video Generation with Arbitrary Lengths	-	-	arXiv 2022
Imagen Video: High Definition Video Generation with Diffusion Models	-	-	arXiv 2022
Video Diffusion Models			arXiv 2022
Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding	-	-	NeurIPS 2022
High-Resolution Image Synthesis with Latent Diffusion Models		-	CVPR 2022
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale	-	-	ICLR 2021
U-Net: Convolutional Networks for Biomedical Image Segmentation		-	MICCAI 2015

Diffusion transformers

Papers are listed generally in reverse order of their publication timestamps.

Title	arXiv	GitHub	Website	Conference & Year
Open-Sora: Democratizing Efficient Video Production for All			-	arXiv 2024
From Slow Bidirectional to Fast Causal Video Generators				arXiv 2024
SiT: Exploring Flow and Diffusion-based Generative Models with Scalable Interpolant Transformers		-		arXiv 2024
GenTron: Diffusion Transformers for Image and Video Generation		-		CVPR 2024
VDT: General-purpose Video Diffusion Transformers via Mask Modeling			-	ICLR 2024
Text2Performer: Text-Driven Human Video Generation			-	ICCV 2023
Scalable Diffusion Models with Transformers		-		ICCV 2023
CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers		-	-	ICLR 2023
VersVideo: Leveraging Enhanced Temporal Diffusion Models for Versatile Video Generation	-	-	-	ICLR 2023
ViViT: A Video Vision Transformer		-	-	ICCV 2021

VAE for latent space compression

Papers are listed generally in reverse order of their publication timestamps.

Title	arXiv	GitHub	Website	Conference & Year
HunyuanVideo: A Systematic Framework for Large Video Generative Models		-	-	arXiv 2025
SkyReels V1: Human-Centric Video Foundation Model	-			arXiv 2025
Magic 1-for-1: Generating One Minute Video Clips Within One Minute				arXiv 2025
Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model				arXiv 2025
Latte: Latent Diffusion Transformer for Video Generation				TMLR 2025
AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models Without Specific Tuning				ICLR 2024
VideoPoet: A Large Language Model for Zero-Shot Video Generation		-		PMLR 2024
LaVie: Latent Video Encoding for Diffusion Models				IJCV 2024
Show-1: Marrying Pixel and Latent Diffusion Models for Text-to-Video Generation				arXiv 2023
VideoCrafter1: Open Diffusion Models for High-Quality Video Generation				arXiv 2023
Make-A-Video: Text-to-Video Generation Without Text-Video Data				ICLR 2023
Phenaki: Variable Length Video Generation from Open Domain Textual Descriptions				ICLR 2023
Denoising Diffusion Probabilistic Models			-	NeurIPS 2020

Text encoders

Papers are listed generally in reverse order of their publication timestamps.

Title	GitHub	Website	Conference & Year
Magic 1-for-1: Generating one minute video clips within one minute			arXiv 2025
SkyReels v1: Human-centric video foundation model		-	arXiv 2025
Hunyuanvideo: A systematic framework for large video generative models	-		arXiv 2025
Step-video-t2v technical report: The practice, challenges, and future of video foundation model	-		arXiv 2025
Identity-Preserving Text-to-Video Generation by Frequency Decomposition			arXiv 2024
An empirical study and analysis of text-to-image generation using large language model-powered textual representation	-	-	arXiv 2024
Hunyuan-dit: A powerful multi-resolution diffusion transformer with fine-grained chinese understanding	-		arXiv 2024
FIT: Flexible Vision Transformer for Diffusion Model			arXiv 2024
CogVideoX: Text-to-Video Diffusion Models with an Expert Transformer			arXiv 2024
Scaling Rectified Flow Transformers for High-Resolution Image Synthesis			ICML 2024
SIT: Exploring Flow and Diffusion-based Generative Models with Scalable Interpolant Transformers			arXiv 2024
Kolors: Effective training of diffusion model for photorealistic text-to-image synthesis	-		arXiv 2024
Open-Sora: Democratizing Efficient Video Production for All			arXiv 2024
Open-Sora-Plan			arXiv 2024
SimDA: Simple Diffusion Adapter for Efficient Video Generation			CVPR 2024
Latte: Latent Diffusion Transformer for Video Generation			arXiv 2024
Flux			arXiv 2023
Scalable Diffusion Models with Transformers			ICCV 2023
All are Worth Words: A ViT Backbone for Diffusion Models			CVPR 2023
Baichuan 2: Open Large-Scale Language Models			arXiv 2023
LLaMA 2: Open Foundation and Fine-Tuned Chat Models			arXiv 2023
LLaMA: Open and Efficient Foundation Language Models			arXiv 2023
ByT5: Towards a Token-Free Future with Pre-trained Byte-to-Byte Models		-	TACL 2022
Imagen Video: High Definition Video Generation with Diffusion Models	-		arXiv 2022
Hierarchical Text-Conditional Image Generation with CLIP Latents	-		arXiv 2022
High-Resolution Image Synthesis with Latent Diffusion Models			CVPR 2022
GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models		-	arXiv 2021
GLM: General Language Model Pretraining with Autoregressive Blank Infilling			arXiv 2021
Learning Transferable Visual Models From Natural Language Supervision			ICML 2021
Zero-Shot Text-to-Image Generation	-	-	ICML 2021
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer		-	JMLR 2020
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding			NAACL 2019

Implementation

Datasets

More datasets could be found on Pixabay, Mixkit, Pond5, Adobe Stock, Shutterstock, Getty, Coverr, Videvo, Depositphotos, Storyblocks, Dissolve, Freepik, Vimeo, and Envato. Also, there are some datasets at Midjourney V5.1 Cleaned Data, Unsplash-lite, AnimateBench, Pexels-400k, and LAION-AESTHETICS.

Title	arXiv	GitHub	Website	Conference & Year
Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers				CVPR 2024
VBench: Comprehensive Benchmark Suite for Video Generative Models				CVPR 2024
InternVid: Learning Text-to-Video Generation from Web-scale Video-Text Data				ICLR 2024
MiraData: A Large-Scale Video Dataset with Long Durations and Structured Captions				NeurIPS 2024
VidProM: A Million-scale Real Prompt-Gallery Dataset for Text-to-Video Diffusion Models				NeurIPS 2024
Vript: A Video Is Worth Thousands of Words			-	NeurIPS 2024
VideoCrafter2				arXiv 2024
Open-Sora: Democratizing Efficient Video Production for All				arXiv 2024
Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generator				ICCV 2023
Temporally Consistent Transformers for Video Generation				ICML 2023
Bitstream-Corrupted Video Recovery: A Novel Benchmark Dataset and Method			-	NeurIPS 2023
FETV: A Benchmark for Fine-Grained Evaluation of Open-Domain Text-to-Video Generation			-	NeurIPS 2023
AIGCBench: Comprehensive evaluation of image-to-video content generated by AI				TBench 2023
AdaPool: Exponential Adaptive Pooling for Information-Retaining Downsampling			-	TIP 2023
Swap Attention in Spatiotemporal Diffusions for Text-to-Video Generation			-	arXiv 2023
Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions			-	CVPR 2022
The DEVIL is in the Details: A Diagnostic Evaluation Benchmark for Video Inpainting			-	CVPR 2022
VFHQ: A High-Quality Dataset and Benchmark for Video Face Super-Resolution		-		CVPR 2022
Learning Audio-Video Modalities from Image Captions		-		ECCV 2022
The Anatomy of Video Editing: A Dataset and Benchmark Suite for AI-Assisted Video Editing			-	ECCV 2022
Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding				NeurIPS 2022
Scaling Autoregressive Models for Content-Rich Text-to-Image Generation		-		TMLR 2022
ACAV100M: Automatic Curation of Large-Scale Datasets for Audio-Visual Video Representation Learning				ICCV 2021
Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval				ICCV 2021
MERLOT: Multimodal Neural Script Knowledge Models				NeurIPS 2021
Learning Video Representations from Textual Web Supervision		-	-	arXiv 2020
HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips		-		ICCV 2019
VATEX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research		-		ICCV 2019
Towards Automatic Learning of Procedures from Web Instructional Videos		-		AAAI 2018
How2: A Large-scale Dataset for Multimodal Language Understanding				arXiv 2018
Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset			-	CVPR 2017
Localizing Moments in Video with Natural Language			-	ICCV 2017
MSR-VTT: A Large Video Description Dataset for Bridging Video and Language	-	-		CVPR 2016
ActivityNet: A Large-Scale Video Benchmark for Human Activity Understanding	-	-		CVPR 2015
A Dataset for Movie Description		-	-	CVPR 2015
UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild		-		arXiv 2012

Training engineering

Papers are listed generally in reverse order of their publication timestamps.

Title	GitHub	Website	Conference & Year
LLaVA-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models			ICLR 2025
SAM 2: Segment Anything in Images and Videos	-		ICLR 2025
Motion Prompting: Controlling Video Generation with Motion Trajectories	-		CVPR 2025
SimDA: Simple Diffusion Adapter for Efficient Video Generation			CVPR 2024
DynamiCrafter: Animating Open-domain Images with Video Diffusion Priors			ECCV 2024
CogVLM2: Visual Language Models for Image and Video Understanding		-	arXiv 2024
CogVideoX: Text-to-Video Diffusion Models with an Expert Transformer			arXiv 2024
Open-Sora: Democratizing Efficient Video Production for All			arXiv 2024
Hunyuan-dit: A powerful multi-resolution diffusion transformer with fine-grained chinese understanding	-		arXiv 2024
PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning			arXiv 2024
Cogvideo: Large-scale pretraining for text-to-video generation via transformers		-	ICLR 2023
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets		-	arXiv 2023
LLaMA: Open and Efficient Foundation Language Models			arXiv 2023
LLaMA 2: Open Foundation and Fine-Tuned Chat Models			arXiv 2023
ST-Adapter: Parameter-Efficient Image-to-Video Transfer Learning		-	NeurIPS 2022
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness		-	NeurIPS 2022
Visual Prompt Tuning		-	ECCV 2022
CoCa: Contrastive Captioners are Image-Text Foundation Models		-	TMLR 2022
ZeRO: memory optimizations toward training trillion parameter models	-	-	Supercomputing 2020

Evaluation metrics and benchmarking findings

Papers are listed generally in reverse order of their publication timestamps.

Title	arXiv	GitHub	Website	Conference & Year
VBench: Comprehensive Benchmark Suite for Video Generative Models				CVPR 2024
VBench++: Comprehensive and Versatile Benchmark Suite for Video Generative Models				CVPR 2024
STREAM: Spatio-TempoRal Evaluation and Analysis Metric for Video Generative Models			-	ICLR 2024
TC-Bench: Benchmarking Temporal Compositionality in Conditional Video Generation			-	arXiv 2024
Exploring Video Quality Assessment on User Generated Contents from Aesthetic and Technical Perspectives			-	ICCV 2023
FETV: A Benchmark for Fine-Grained Evaluation of Open-Domain Text-to-Video Generation			-	NeurIPS 2023
AIGCBench: Comprehensive evaluation of image-to-video content generated by AI				TBench 2023
Learning Transferable Visual Models From Natural Language Supervision				ICML 2021
RAFT: Recurrent All-Pairs Field Transforms for Optical Flow			-	ECCV 2020
FVD: A new Metric for Video Generation	-	-	-	ICLR Workshop 2019
GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium			-	NeurIPS 2017
Blind video temporal consistency	-	-	-	TOG 2015

Industry models

Title	arXiv	GitHub	Website	Conference & Year
Magic 1-for-1: Generating one minute video clips within one minute				arXiv 2025
SkyReels v1: Human-centric video foundation model			-	arXiv 2025
Step-Video-T2V		-		arXiv 2024
HunyuanVideo		-		arXiv 2024
Sora	-	-		2024
STIV				arXiv 2024
LTX-Video				arXiv 2024
Allegro				arXiv 2024
Jimeng		-		arXiv 2024
Mochi 1		-		arXiv 2024
EasyAnimate				arXiv 2024
Vidu	-	-		2024
VideoCrafter2				arXiv 2024
VideoCrafter1				arXiv 2023
Mira		-		arXiv 2024
Hailuo AI	-	-		2024
Lumiere		-		arXiv 2024
VideoPoet		-		arXiv 2023
LumaAI Ray 2	-	-		2024
LumaAI Dream Machine	-	-		2023
Veo-2	-	-		2024
Veo-1	-	-		2023
Nova Real	-	-		2024
Wanx 2.1	-	-		2024
Kling	-	-		2024
Show-1				NeurIPS 2023
MovieGen		-		arXiv 2024
Pika	-	-		2023
Vchitect-2.0	-	-		2024
Optis				NeurIPS 2023
VLogger				ICCV 2023
Seine				CVPR 2023
Lavie				ICCV 2023
MiracleVision	-	-		2023
Phenaki				ICLR 2024
W.A.L.T		-		arXiv 2024
Imagen video		-		2022
GEN-3 Alpha	-	-		2024
GEN-2	-	-		2023
GEN-1	-	-		2022

Academia models

Title	GitHub	Website	Conference & Year
RepVideo: Rethinking Cross-Layer Representation for Video Generation			arXiv 2025
CausVid: Causality-Aware Video Generation with Slow-Fast Diffusion Models			CVPR 2025
Open-Sora Plan: Open-Source Large Video Generation Model			arXiv 2024
Open-Sora: Democratizing Efficient Video Production for All			arXiv 2024
Snap Video: Scaled Spatiotemporal Transformers for Text-to-Video Synthesis	-		arXiv 2024
SiT: Exploring Flow and Diffusion-Based Generative Models with Scalable Interpolant Transformers			arXiv 2024
VideoDirectorGPT: Consistent Multi-scene Video Generation via LLM-Guided Planning			COLM 2024
AnimateLCM: Accelerating the Animation of Personalized Diffusion Models and Adapters with Decoupled Consistency Learning			arXiv 2024
I4VGEN: Interactive Video Generation via Integrated Dynamic Control			arXiv 2024
SimDA: Simple Diffusion Adapter for Efficient Text-to-Video Generation			arXiv 2023
AnimateDiff-v2			ICLR 2024
Animate-A-Story: Storytelling with Retrieval-Augmented Video Generation			arXiv 2023
VideoGen: A Reference-Guided Latent Diffusion Approach for High-Definition Text-to-Video Generation	-		arXiv 2023
Dysen-VDM: Diffusion Model with Dynamic Spatio-Temporal Fusion for Video Generation		-	arXiv 2023
HiGen: Hierarchical 3D Feature Generation for 3D-Aware Image Synthesis and Manipulation		-	arXiv 2023
ModelScope Text-to-Video Technical Report	-		arXiv 2023
InstructVideo: Instructing Video Diffusion Models with Human Feedback	-		CVPR 2024
VideoComposer: Compositional Video Synthesis with Motion Controllability			NeurIPS 2023
VideoFusion: Decomposed Diffusion Models for High-Quality Video Generation	-	-	CVPR 2023
MagViT-v2: Masked Generative Video Transformer		-	arXiv 2023
MagViT: Masked Generative Video Transformer		-	arXiv 2022
Latent-Shift: Latent Diffusion with Temporal Shift for Efficient Text-to-Video Generation	-		arXiv 2023
Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models			CVPR 2023
Video Diffusion Models			arXiv 2022
Make-A-Video: Text-to-Video Generation without Text-Video Data			ICLR 2023
MagicVideo: Efficient Video Generation With Latent Diffusion Models	-		arXiv 2022
CogVideoX: Enhancing Video Understanding in the Era of Large Language Models			arXiv 2024
CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers			ICLR 2023
VideoGPT: Video Generation using VQ-VAE and Transformers			arXiv 2021

Applications

Conditions

Image condition

Papers are listed generally in reverse order of their publication timestamps.

Title	GitHub	Conference & Year
CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer		ICLR 2025
DreamVideo-2: Zero-Shot Subject-Driven Video Customization with Precise Motion Control	-	ICLR 2025
DreamVideo: High-Fidelity Image-to-Video Generation with Image Retention and Text Guidance		ICASSP 2025
EMO: Emote Portrait Alive-Generating Expressive Portrait Videos with Audio2Video Diffusion Model under Weak Conditions	-	ECCV 2024
Cinemo: Consistent and Controllable Image Animation with Motion Diffusion Models		CVPR 2025
MagDiff: Multi-alignment Diffusion for High-Fidelity Video Generation and Editing		ECCV 2024
ConsistI2V: Enhancing Visual Consistency for Image-to-Video Generation		TMLR
I2V-Adapter: A General Image-to-Video Adapter for Diffusion Models		SIGGRAPH 2024
ID-Animator: Zero-Shot Identity-Preserving Human Video Generation		arXiv 2024
CamCo: Camera-Controllable 3D-Consistent Image-to-Video Generation	-	arXiv 2024
Generative Image Dynamics	-	CVPR 2024
PIA: Your Personalized Image Animator via Plug-and-Play Modules in Text-to-Image Models		CVPR 2024
TRIP: Temporal Residual Learning with Image Noise Prior for Image-to-Video Diffusion Models	-	CVPR 2024
AtomoVideo: High Fidelity Image-to-Video Generation	-	arXiv 2024
Motion-i2v: Consistent and controllable image-to-video generation with explicit motion modeling		SIGGRAPH 2024
Seer: Language Instructed Video Prediction with Latent Diffusion Models		ICLR 2024
AnimateAnything: Fine-Grained Open Domain Image Animation with Motion Guidance		arXiv 2023
VideoBooth: Diffusion-based Video Generation with Image Prompts		CVPR 2024
Sparsectrl: Adding sparse controls to text-to-video diffusion models		ECCV 2024
DynamiCrafter: Animating Open-domain Images with Video Diffusion Priors		ECCV 2024
Adding Conditional Control to Text-to-Image Diffusion Models		ICCV 2023
Stable video diffusion: Scaling latent video diffusion models to large datasets		Arxix 2023
Make pixels dance: High-dynamic video generation	-	CVPR 2024
I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion Models		arXiv 2023
Videocrafter1: Open diffusion models for high-quality video generation		arXiv 2023
VDT: General-purpose Video Diffusion Transformers via Mask Modeling		ICLR 2024
VideoComposer: Compositional Video Synthesis with Motion Controllability		NIPS 2023
Conditional Image-to-Video Generation with Latent Flow Diffusion Models		CVPR 2023

Spatial condition

Papers are listed generally in reverse order of their publication timestamps.

Title	GitHub	Conference & Year
CineMaster: A 3D-Aware and Controllable Framework for Cinematic Text-to-Video Generation	-	SIGGRAPH 2025
ObjCtrl-2.5D: Training-free Object Control with Camera Poses		arXiv 2024
Motion Prompting: Controlling Video Generation with Motion Trajectories	-	CVPR 2025
SG-I2V: Self-Guided Trajectory Control in Image-to-Video Generation		ICLR 2025
MVideo: Motion Control for Enhanced Complex Action Video Generation	-	arXiv 2024
DreamVideo-2: Zero-Shot Subject-Driven Video Customization with Precise Motion Control	-	arXiv 2024
Tora: Trajectory-oriented Diffusion Transformer for Video Generation		CVPR 2025
Streetscapes: Large-scale Consistent Street View Generation Using Autoregressive Video Diffusion	-	SIGGRAPH 2024
FreeTraj: Tuning-Free Trajectory Control in Video Diffusion Models		arXiv 2024
MotionBooth: Motion-Aware Customized Text-to-Video Generation		NeurIPS 2024
DragAnything: Motion Control for Anything using Entity Representation		ECCV 2024
Boximator: Generating Rich and Controllable Motions for Video Synthesis	-	arXiv 2024
Motion-I2V: Consistent and Controllable Image-to-Video Generation with Explicit Motion Modeling		SIGGRAPH 2024
PEEKABOO: Interactive Video Generation via Masked-Diffusion		CVPR 2024
SparseCtrl: Adding Sparse Controls to Text-to-Video Diffusion Models	-	ECCV 2024
DragNUWA: Fine-grained Control in Video Generation by Integrating Text, Image, and Trajectory		arXiv 2023

Camera parameter condition

Papers are listed generally in reverse order of their publication timestamps.

Title	GitHub	Conference & Year
CineMaster: A 3D-Aware and Controllable Framework for Cinematic Text-to-Video Generation	-	SIGGRAPH 2025
3DTrajMaster: Mastering 3D Trajectory for Multi-Entity Motion in Video Generation		ICLR 2025
AC3D: Analyzing and Improving 3D Camera Control in Video Diffusion Transformers		CVPR 2025
Cavia: Camera-controllable Multi-view Video Diffusion with View-Integrated Attention	-	arXiv 2024
VD3D: Taming Large Video Diffusion Transformers for 3D Camera Control	-	ICLR 2025
CamCo: Camera-Controllable 3D-Consistent Image-to-Video Generation	-	arXiv 2024
MotionBooth: Motion-Aware Customized Text-to-Video Generation		NeurIPS 2024
Collaborative Video Diffusion: Consistent Multi-video Generation with Camera Control		NeurIPS 2024
CameraCtrl: Enabling Camera Control for Text-to-Video Generation		arXiv 2024
Direct-a-Video: Customized Video Generation with User-Directed Camera Movement and Object Motion		arXiv 2024
MotionCtrl: A Unified and Flexible Motion Controller for Video Generation		SIGGRAPH 2024

Audio condition

Papers are listed generally in reverse order of their publication timestamps.

Title	GitHub	Website	Conference & Year
OmniHuman-1: Rethinking the Scaling-Up of One-Stage Conditioned Human Animation Models	-		arXiv 2025
AniTalker: Animate Vivid and Diverse Talking Faces through Identity-Decoupled Facial Motion Encoding			ACM MM 2024
VASA-1: Lifelike Audio-Driven Talking Faces Generated in Real Time	-		NeurIPS 2024
EMOPortraits: Emotion-enhanced Multimodal One-shot Head Avatars			CVPR 2024
FlowVQTalker: High-Quality Emotional Talking Face Generation through Normalizing Flow and Quantization	-	-	CVPR 2024
AniPortrait: Audio-Driven Synthesis of Photorealistic Portrait Animation		-	arXiv 2024
EMO: Emote Portrait Alive -- Generating Expressive Portrait Videos with Audio2Video Diffusion Model under Weak Conditions			ECCV 2024
Speech2Lip: High-fidelity Speech to Lip Generation by Learning from a Short Video			ICCV 2023
Audio-Driven Co-Speech Gesture Video Generation			NeurIPS 2022
CCVS: Context-aware Controllable Video Synthesis			NeurIPS 2021

High-level video condition

Papers are listed generally in reverse order of their publication timestamps.

Title	GitHub	Website	Conference & Year
TokenFlow: Unified Image Tokenizer for Multimodal Understanding and Generation			ICLR 2024
MotionClone: Training-Free Motion Cloning for Controllable Video Generation			ICLR 2025
I2VEdit: First-Frame-Guided Video Editing via Image-to-Video Diffusion Models			SIGGRAPH Asia 2024
ReVideo: Remake a Video with Motion and Content Control			NeurIPS 2024
AniTalker: Animate Vivid and Diverse Talking Faces through Identity-Decoupled Facial Motion Encoding			ACM MM 2024
AnyV2V: A Tuning-Free Framework For Any Video-to-Video Editing Tasks			TMLR 2024
UniEdit: A Unified Tuning-Free Framework for Video Motion and Appearance Editing			arXiv 2024
VidToMe: Video Token Merging for Zero-Shot Video Editing			CVPR 2024
FlowVid: Taming Imperfect Optical Flows for Consistent Video-to-Video Synthesis	-		CVPR 2024
SAVE: Protagonist Diversification with Structure Agnostic Video Editing			ECCV 2024
RAVE: Randomized Noise Shuffling for Fast and Consistent Video Editing with Diffusion Models			CVPR 2024
DiffusionAtlas: High-Fidelity Consistent Diffusion Video Editing	-		arXiv 2023
DragVideo: Interactive Drag-style Video Editing			ECCV 2024
Drag-A-Video: Non-rigid Video Editing with Point-based Interaction			arXiv 2023
VideoSwap: Customized Video Subject Swapping with Interactive Semantic Point Correspondence			CVPR 2024
A Video is Worth 256 Bases: Spatial-Temporal Expectation-Maximization Inversion for Zero-Shot Video Editing			CVPR 2024
Motion-Conditioned Image Animation for Video Editing			arXiv 2023
MagicPose: Realistic Human Poses and Facial Expressions Retargeting with Identity-aware Diffusion			ICML 2024
Animate Anyone: Consistent and Controllable Image-to-Video Synthesis for Character Animation			CVPR 2024
Consistent Video-to-Video Transfer Using Synthetic Dataset			ICLR 2024
MotionDirector: Motion Customization of Text-to-Video Diffusion Models			ECCV 2024
SimDA: Simple Diffusion Adapter for Efficient Video Generation			CVPR 2024
MagicEdit: High-Fidelity and Temporally Coherent Video Editing			arXiv 2023
CoDeF: Content Deformation Fields for Temporally Consistent Video Processing			CVPR 2024
StableVideo: Text-driven Consistency-aware Diffusion Video Editing			ICCV 2023
VideoControlNet: A Motion-Guided Video-to-Video Translation Framework by Using Diffusion Model with ControlNet			arXiv 2023
VideoComposer: Compositional Video Synthesis with Motion Controllability			NeurIPS 2023
Rerender A Video: Zero-Shot Text-Guided Video-to-Video Translation			SIGGRAPH Asia 2023
Video Colorization with Pre-trained Text-to-Image Diffusion Models			arXiv 2023
VidEdit: Zero-Shot and Spatially Aware Text-Driven Video Editing	-		TMLR 2024
DisCo: Disentangled Control for Realistic Human Dance Generation			CVPR 2024
Towards Consistent Video Editing with Text-to-Image Diffusion Models	-	-	NeurIPS 2023
Video ControlNet: Towards Temporally Consistent Synthetic-to-Real Video Translation Using Conditional Image Diffusion Models	-	-	-
ControlVideo: Training-free Controllable Text-to-Video Generation			ICLR 2024
InstructVid2Vid: Controllable Video Editing with Natural Language Instructions	-	-	arXiv 2023
Follow Your Pose: Pose-Guided Text-to-Video Generation using Pose-Free Videos			AAAI 2024
DreamPose: Fashion Image-to-Video Synthesis via Stable Diffusion			ICCV 2023
Zero-Shot Video Editing Using Off-The-Shelf Image Diffusion Models		-	IEEE Trans On Multimedia, 2023
Pix2Video: Video Editing using Image Diffusion			ICCV 2023
Structure and Content-Guided Video Synthesis with Diffusion Models	-		ICCV 2023
Shape-aware Text-driven Layered Video Editing			CVPR 2023
DPE: Disentanglement of Pose and Expression for General Video Portrait Editing			CVPR 2023
Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation			ICCV 2023
Diffusion Video Autoencoders: Toward Temporally Consistent Face Video Editing via Disentangled Video Encoding			CVPR 2023
Layered Neural Atlases for Consistent Video Editing			SIGGRAPH Asia 2021

Other conditions

Papers are listed generally in reverse order of their publication timestamps.

Title	GitHub	Website	Conference & Year
Mind the Time: Temporally-Controlled Multi-Event Video Generation	-		CVPR 2025
PhysGen: Rigid-Body Physics-Grounded Image-to-Video Generation			ECCV 2024
MotionCraft: Physics-based Zero-Shot Video Generation			AAAI 2025
VideoAgent: Long-form Video Understanding with Large Language Model as Agent			ECCV 2024
Synthetic Generation of Face Videos with Plethysmograph Physiology	-	-	CVPR 2022

Enhancement

Video denoising and deblurring

Papers are listed generally in reverse order of their publication timestamps.

Title	arXiv	GitHub	Website	Conference & Year
Video restoration based on deep learning: a comprehensive survey	-	-	-	2022

Video inpainting

Papers are listed generally in reverse order of their publication timestamps.

Title	GitHub	Website	Conference & Year
CoCoCo: Improving Text-Guided Video Inpainting for Better Consistency			AAAI 2025
UniPaint: Unified Space-time Video Inpainting via Mixture-of-Experts	-	-	arXiv 2024
Semantically Consistent Video Inpainting with Conditional Diffusion Models	-	-	arXiv 2024
AVID: Any-Length Video Inpainting with Diffusion Model			CVPR 2024
Towards language-driven video inpainting via multimodal large language models			CVPR 2024
Deep Learning-Based Image and Video Inpainting: A Survey	-	-	arXiv 2024
Smartbrush: Text and shape guided object inpainting with diffusion model	-	-	CVPR 2023
Repaint: Inpainting using denoising diffusion probabilistic models		-	CVPR 2022
Free-form video inpainting with 3d gated convolution and temporal patchgan		-	ICCV 2019
Generative image inpainting with contextual attention		-	CVPR 2018
Context encoders: Feature learning by inpainting			CVPR 2016

Video interpolation and extrapolation/prediction

Papers are listed generally in reverse order of their publication timestamps.

Title	GitHub	Conference & Year
Video interpolation with diffusion models	-	CVPR 2024
ToonCrafter: Generative Cartoon Interpolation		TOG 2024
Ldmvfi: Video frame interpolation with latent diffusion models		AAAI 2024
Tell me what happened: Unifying text-guided video completion via multimodal masked video generation		CVPR 2023
MCVD - Masked Conditional Video Diffusion for Prediction		NeurIPS 2022
Diffusion models for video prediction and infilling		TMLR 2022

Video super-resolution

Papers are listed generally in reverse order of their publication timestamps.

Title	GitHub	Website	Conference & Year
FreeScale: High-Resolution Video Generation with Cascaded Diffusion Models			arXiv 2024
Make a cheap scaling: A self-cascade diffusion model for higher-resolution adaptation			ECCV 2024
VEnhancer: Generative Space-Time Enhancement for Video Generation			arXiv 2024
Exploiting diffusion prior for real-world image super-resolution			IJCV 2024
Learning Spatial Adaptation and Temporal Coherence in Diffusion Models for Video Super-Resolution	-	-	CVPR 2024
Upscale-A-Video: Temporal-Consistent Diffusion Model for Real-World Video Super-Resolution			CVPR 2024
Inflation with Diffusion: Efficient Temporal Adaptation for Text-to-Video Super-Resolution	-	-	WACV Workshop 2024

Combining multiple video enhancement tasks

Papers are listed generally in reverse order of their publication timestamps.

Title	GitHub	Website	Conference & Year
UniPaint: Unified Space-time Video Inpainting via Mixture-of-Experts	-	-	arXiv 2024
VEnhancer: Generative Space-Time Enhancement for Video Generation			arXiv 2024
MCVD - Masked Conditional Video Diffusion for Prediction			NeurIPS 2022

Personalization

Papers are listed generally in reverse order of their publication timestamps.

Title	GitHub	Website	Conference & Year
Dynamic Concepts Personalization from Single Videos	-		SIGGRAPH 2025
VideoAlchemy: Open-set Personalization in Video Generation			CVPR 2025
PersonalVideo: High ID-Fidelity Video Customization without Dynamic and Semantic Degradation			arXiv 2024
Identity-PreservingText-to-VideoGenerationbyFrequencyDecomposition			CVPR 2025
DreamVideo-2: Zero-Shot Subject-Driven Video Customization with Precise Motion Control	-		ICLR 2025
Still-Moving: Customized Video Generation without Customized Video Data			ACM TOG 2024
MCNet: Rethinking the Core Ingredients for Accurate and Efficient Homography Estimation		-	CVPR 2024
AniTalker: Animate Vivid and Diverse Talking Faces through Identity-Decoupled Facial Motion Encoding			ACM MM 2024
VASA-1: Lifelike Audio-Driven Talking Faces Generated in Real Time	-		NeurIPS 2024
FlowVQTalker: High-Quality Emotional Talking Face Generation through Normalizing Flow and Quantization	-	-	CVPR 2024
EMO: Emote Portrait Alive - Generating Expressive Portrait Videos with Audio2Video Diffusion Model under Weak Conditions			ECCV 2024
DreamVideo: High-Fidelity Image-to-Video Generation with Image Retention and Text Guidance			ICASSP 2025
Audio-Driven Co-Speech Gesture Video Generation			NeurIPS 2022

Consistency

Papers are listed generally in reverse order of their publication timestamps.

Title	arXiv	GitHub	Website	Conference & Year
Cinemo: Consistent and Controllable Image Animation with Motion Diffusion Models				CVPR 2025
How i warped your noise: a temporally-correlated noise prior for diffusion models	-	-		ICLR 2024
Tokenflow: Consistent diffusion features for consistent video editing				ICLR 2024
Seine: Short-to-long video diffusion model for generative transition and prediction.				ICLR 2024
VideoBooth: Diffusion-based Video Generation with Image Prompts				CVPR 2024
TRIP: Temporal Residual Learning with Image Noise Prior for Image-to-Vieo Diffusion Models		-		CVPR 2024
EMOPortraits: Emotion-enhanced Multimodal One-shot Head Avatars				CVPR 2024
CAMEL: CAusal Motion Enhancement tailored for Lifting Text-driven Video Editing			-	CVPR 2024
VidToMe: Video Token Merging for Zero-Shot Video Editing				CVPR 2024
EMO: Emote Portrait Alive - Generating Expressive Portrait Videos with Audio2Video Diffusion Model under Weak Conditions				ECCV 2024
StoryDiffusion: Consistent Self-Attention for Long-Range Image and Video Generation				NeurIPS 2024
Streetscapes: Large-scale consistent street view generation using autoregressive video diffusion		-		SIGGRAPH 2024
Motion-i2v: Consistent and controllable image-to-video generation with explicit motion modeling				SIGGRAPH 2024
Consisti2v: Enhancing visual consistency for image-to-video generation				TMLR 2024
StreamingT2V: Consistent, Dynamic, and Extendable Long Video Generation from Text				arXiv 2024
AniPortrait: Audio-Driven Synthesis of Photorealistic Portrait Animation			-	arXiv 2024
Flexifilm: Long video generation with flexible conditions				arXiv 2024
Towards Smooth Video Composition				ICLR 2023
Conditional Image-to-Video Generation with Latent Flow Diffusion Models			-	CVPR 2023
MoStGAN-V: Video Generation with Temporal Motion Styles			-	CVPR 2023
MOSO: Decomposing MOtion, Scene and Object for Video Prediction			-	CVPR 2023
Stablevideo: Text-driven consistency-aware diffusion video editing				ICCV 2023
Preserve your own correlation: A noise prior for video diffusion models		-		ICCV 2023
Scenescape: Text-driven consistent scene generation				NeurIPS 2023
GLOBER: Coherent Non-autoregressive Video Generation via GLOBal Guided Video DecodER			-	NeurIPS 2023
VideoComposer: Compositional Video Synthesis with Motion Controllability				NeurIPS 2023
DreamTalk: When Emotional Talking Head Generation Meets Diffusion Probabilistic Models				arXiv 2023
Generating Videos with Dynamics-aware Implicit Generative Adversarial Networks				ICLR 2022
StyleGAN-V: A Continuous Video Generator with the Price, Image Quality and Perks of StyleGAN2				CVPR 2022

Long video

Papers are listed generally in reverse order of their publication timestamps.

Title	GitHub	Website	Conference & Year
MovieDreamer: Hierarchical Generation for Coherent Long Visual Sequences			ICLR 2025
FreeNoise: Tuning-Free Longer Video Diffusion via Noise Rescheduling			ICLR 2024
ViD-GPT: Introducing GPT-style Autoregressive Generation in Video Diffusion Models		-	arXiv 2024
Video-Infinity: Distributed Long Video Generation		-	arXiv 2024
Progressive Autoregressive Video Diffusion Models			arXiv 2024
MAVIN: Multi-Action Video Generation with Diffusion Models via Transition Video Infilling		-	arXiv 2024
CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers		-	ICLR 2023
Towards Smooth Video Composition			ICLR 2023
NUWA-XL: Diffusion over Diffusion for eXtremely Long Video Generation	-		ACL 2023
Gen-L-Video: Multi-Text to Long Video Generation via Temporal Co-Denoising			arXiv 2023
StyleGAN-V: A Continuous Video Generator with the Price, Image Quality and Perks of StyleGAN2	-		CVPR 2022
Long Video Generation with Time-Agnostic VQGAN and Time-Sensitive Transformer			ECCV 2022
Generating Long Videos of Dynamic Scenes			NeurIPS 2022
Flexible Diffusion Modeling of Long Videos	-		NeurIPS 2022

3D-aware video diffusion

Training on 3D dataset

Papers are listed generally in reverse order of their publication timestamps.

Title	GitHub	Conference & Year
OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation		ICLR 2025
CineMaster: A 3D-Aware and Controllable Framework for Cinematic Text-to-Video Generation	-	arXiv 2025
HumanVid: Demystifying Training Data for Camera-controllable Human Image Animation		NeurIPS D&B 2024
Generating 3D-Consistent Videos from Unposed Internet Photos	-	CVPR 2025
Cavia: Camera-controllable multi-view video diffusion with view-integrated attention	-	arXiv 2024
RGBD Objects in the Wild: Scaling Real-World 3D Object Learning from RGB-D Videos		CVPR 2024
VFusion3D: Learning Scalable 3D Generative Models from Video Diffusion Models		ECCV 2024
Generative Camera Dolly: Extreme Monocular Dynamic Novel View Synthesis		ECCV 2024
CamCo: Camera-Controllable 3D-Consistent Image-to-Video Generation	-	-
Collaborative Video Diffusion: Consistent Multi-video Generation with Camera Control		NeurIPS 2024
Diffusion4D: Fast Spatial-temporal Consistent 4D generation via Video Diffusion Models		NeurIPS 2024
V3D: Video Diffusion Models are Effective 3D Generators		arXiv 2024
Sora Generates Videos with Stunning Geometrical Consistency		arXiv 2024
IM-3D: Iterative Multiview Diffusion and Reconstruction for High-Quality 3D Generation	-	ICML 2024
InternVid: Learning Text-to-Video Generation from Web-scale Video-Text Data		ICLR 2024
DL3DV-10K: A Large-Scale Scene Dataset for Deep Learning-based 3D Vision		CVPR 2024
Stable video diffusion: Scaling latent video diffusion models to large datasets		arXiv 2023
Objaverse-XL: A Universe of 10M+ 3D Objects		NeurIPS D&B 2023
OmniObject3D: Large-Vocabulary 3D Object Dataset for Realistic Perception, Reconstruction and Generation		CVPR 2023
MVImgNet: A Large-scale Dataset of Multi-view Images		CVPR 2023
Objaverse: A Universe of Annotated 3D Objects		CVPR 2023
Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval		ICCV 2021
Google Scanned Objects: A High-Quality Dataset of 3D Scanned Household Items	-	ICRA 2022
Common Objects in 3D: Large-Scale Learning and Evaluation of Real-life 3D Category Reconstruction		ICCV 2021
Stereo magnification: Learning view synthesis using multiplane images		SIGGRAPH 2018

Architecture for 3D diffusion models

Papers are listed generally in reverse order of their publication timestamps.

Title	GitHub	Conference & Year
SV4D: Dynamic 3D Content Generation with Multi-Frame and Multi-View Consistency		ICLR 2025
CineMaster: A 3D-Aware and Controllable Framework for Cinematic Text-to-Video Generation	-	arXiv 2025
Diffusion as Shader: 3D-aware Video Diffusion for Versatile Video Generation Control		arXiv 2025
Magic-Boost: Boost 3D Generation with Mutli-View Conditioned Diffusion		arXiv 2025
Wonderland: Navigating 3D Scenes from a Single Image		arXiv 2024
CamI2V: Camera-Controlled Image-to-Video Diffusion Model		arXiv 2024
Generating 3D-Consistent Videos from Unposed Internet Photos	-	CVPR 2025
Cavia: Camera-controllable multi-view video diffusion with view-integrated attention	-	arXiv 2024
ControlDreamer: Blending Geometry and Style in Text-to-3D		BMVC 2024
UniDream: Unifying Diffusion Priors for Relightable Text-to-3D Generation	-	ECCV 2024
Vivid-ZOO: Multi-View Video Generation with Diffusion Model		NeurIPS 2024
CamCo: Camera-Controllable 3D-Consistent Image-to-Video Generation	-	Arxix 2024
CAT3D: Create Anything in 3D with Multi-View Diffusion Models	-	NeurIPS 2024
MVDiffusion++: A Dense High-resolution Multi-view Diffusion Model for Single or Sparse-view 3D Object Reconstruction	-	ECCV 2024
MVDream: Multi-view Diffusion for 3D Generation		ICLR 2024
MVD-Fusion: Single-view 3D via Depth-consistent Multi-view Generation		CVPR 2024
EpiDiff: Enhancing Multi-View Synthesis via Localized Epipolar-Constrained Diffusion		CVPR 2024
Make-Your-3D: Fast and Consistent Subject-Driven 3D Content Generation		ECCV 2024
SPAD: Spatially Aware Multi-View Diffusers		CVPR 2024
MVDiffusion: Enabling Holistic Multi-view Image Generation with Correspondence-Aware Diffusion		NeuIPS 2023
ConsistNet: Enforcing 3D Consistency for Multi-view Images Diffusion	-	CVPR 2024

Camera conditioning

Papers are listed generally in reverse order of their publication timestamps.

Title	GitHub	Conference & Year
Motion Prompting: Controlling Video Generation with Motion Trajectories	-	CVPR 2025
Vd3d: Taming large video diffusion transformers for 3d camera control		ICLR 2025
AC3D: Analyzing and Improving 3D Camera Control in Video Diffusion Transformers		CVPR 2025
Cameractrl: Enabling camera control for text-to-video generation		ICLR 2025
I2VControl-Camera: Precise Video Camera Control with Adjustable Motion Strength		ICLR 2025
CineMaster: A 3D-Aware and Controllable Framework for Cinematic Text-to-Video Generation	-	arXiv 2025
Diffusion as Shader: 3D-aware Video Diffusion for Versatile Video Generation Control		arXiv 2025
CAT4D: Create Anything in 4D with Multi-View Video Diffusion Models	-	arXiv 2024
Wonderland: Navigating 3D Scenes from a Single Image		arXiv 2024
CamI2V: Camera-Controlled Image-to-Video Diffusion Model		arXiv 2024
HumanVid: Demystifying Training Data for Camera-controllable Human Image Animation		NeurIPS D&B 2024
Cavia: Camera-controllable multi-view video diffusion with view-integrated attention	-	arXiv 2024
Hi3D: Pursuing High-Resolution Image-to-3D Generation with Video Diffusion Models		Multimedia 2024
MyGo: Consistent and Controllable Multi-View Driving Video Generation with Camera Control	-	arXiv 2024
MotionCtrl: A Unified and Flexible Motion Controller for Video Generation		SIGGRAPH 2024
Controlling Space and Time with Diffusion Models	-	ICLR 2025
Generative Camera Dolly: Extreme Monocular Dynamic Novel View Synthesis		ECCV 2024
CamCo: Camera-Controllable 3D-Consistent Image-to-Video Generation	-	arXiv 2024
Collaborative Video Diffusion: Consistent Multi-video Generation with Camera Control		NeurIPS 2024
CAT3D: Create Anything in 3D with Multi-View Diffusion Models	-	NeurIPS 2024
Direct-a-Video: Customized Video Generation with User-Directed Camera Movement and Object Motion		SIGGRAPH 2024
Sv3d: Novel multi-view synthesis and 3d generation from a single image using latent video diffusion	-	ECCV 2024

Inference-time tricks

Papers are listed generally in reverse order of their publication timestamps.

Title	GitHub	Conference & Year
NVS-Solver: Video Diffusion Model as Zero-Shot Novel View Synthesizer		ICLR 2025
Training-free Camera Control for Video Generation	-	ICLR 2025
ViVid-1-to-3: Novel View Synthesis with Video Diffusion Models		CVPR 2024

Benefits to other domains

Video representation learning

Papers are listed generally in reverse order of their publication timestamps.

Title	arXiv	GitHub	Website	Conference & Year
Deep video representation learning: a survey		-	-	Multimedia Tools and Applications 2024
Text Is MASS: Modeling as Stochastic Embedding for Text-Video Retrieval			-	CVPR 2024
T2Vs Meet VLMs: A Scalable Multimodal Dataset for Visual Harmfulness Recognition			-	NeurIPS 2024
M²Chat: Empowering VLM for Multimodal LLM Interleaved Text-Image Generation				arXiv 2024
Cali-NCE: Boosting Cross-modal Video Representation Learning with Calibrated Alignment	-		-	CVPR Workshops 2023
HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training		-	-	ICCV 2023
Retrieving-to-Answer: Zero-Shot Video Question Answering with Frozen Large Language Models		-	-	ICCV Workshop 2023
DeCap: Decoding CLIP Latents for Zero-Shot Captioning via Text-Only Training			-	ICLR 2023
Visual Consensus Modeling for Video-Text Retrieval	-	-	-	AAAI 2022
End-to-End Referring Video Object Segmentation with Multimodal Transformers			-	CVPR 2022
Language as Queries for Referring Video Object Segmentation			-	CVPR 2022
Align and Prompt: Video-and-Language Pre-training with Entity Prompts			-	CVPR 2022
Multi-Level Representation Learning with Semantic Alignment for Referring Video Object Segmentation	-	-	-	CVPR 2022
CenterCLIP: Token Clustering for Efficient Text-Video Retrieval			-	SIGIR 2022
GL-RG: Global-Local Representation Granularity for Video Captioning			-	IJCAI 2022
Less is More: ClipBERT for Video-and-Language Learning via Sparse Sampling			-	CVPR 2021
Learning Transferable Visual Models From Natural Language Supervision			-	ICML 2021
Multi-modal Transformer for Video Retrieval				ECCV 2020
URVOS: Unified Referring Video Object Segmentation Network with a Large-Scale Benchmark	-		-	ECCV 2020
Asymmetric 3D Convolutional Neural Networks for action recognition	-	-	-	Pattern Recognition 2019
Skeleton-Based Human Action Recognition With Global Context-Aware Attention LSTM Networks		-	-	TIP 2018
Deep Sequential Context Networks for Action Prediction	-	-	-	CVPR 2017
Attention Is All You Need		-	-	NeurIPS 2017
Deep Residual Learning for Image Recognition			-	CVPR 2016
Spatio-Temporal LSTM with Trust Gates for 3D Human Action Recognition			-	ECCV 2016
Learning Spatiotemporal Features with 3D Convolutional Networks		-	-	ICCV 2015
ImageNet Classification with Deep Convolutional Neural Networks	-	-	-	NeurIPS 2012
Long Short-Term Memory	-	-	-	Neural Comput., 1997
Neural Networks and Physical Systems with Emergent Collective Computational Abilities	-	-	-	PNAS, 1982

Video retrieval

Papers are listed generally in reverse order of their publication timestamps.

Title	arXiv	GitHub	Website	Conference & Year
Towards Retrieval Augmented Generation over Large Video Libraries		-	-	IEEE HSI 2024
Video Enriched Retrieval Augmented Generation Using Aligned Video Captions			-	SIGIR Workshop 2024
GenTron: Diffusion Transformers for Image and Video Generation		-		CVPR 2024
iRAG: Advancing RAG for Videos with an Incremental Approach		-	-	CIKM 2024
Img2Loc: Revisiting Image Geolocalization using Multi-modality Foundation Models and Image-based Retrieval-Augmented Generation			-	SIGIR 2024
Multimodal Federated Learning via Contrastive Representation Ensemble			-	ICLR 2023
SIDGAN: High-Resolution Dubbed Video Generation via Shift-Invariant Learning		-	-	ICCV 2023
Animate-A-Story: Storytelling with Retrieval-Augmented Video Generation		-	-	arXiv 2023
Visual Consensus Modeling for Video-Text Retrieval	-	-	-	AAAI 2022
FLAVA: A Foundational Language And Vision Alignment Model				CVPR 2022
X-Pool: Cross-Modal Language-Video Attention for Text-Video Retrieval				CVPR 2022
Exposing the Limits of Video-Text Models through Contrast Sets	-		-	NAACL 2022
CenterCLIP: Token Clustering for Efficient Text-Video Retrieval			-	SIGIR 2022
Boosting Video-Text Retrieval with Explicit High-Level Semantics		-	-	ACM MM 2022
X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval			-	ACM MM 2022
Cross-Modal Discrete Representation Learning		-	-	ACL 2022
TS2-Net: Token Shift and Selection Transformer for Text-Video Retrieval			-	ECCV 2022
CLIP4Clip: An empirical study of CLIP for end to end video clip retrieval and captioning			-	Neurocomputing 2022
Align and Tell: Boosting Text-Video Retrieval With Local Alignment and Fine-Grained Supervision	-	-	-	TMM 2022
Deep Unified Cross-Modality Hashing by Pairwise Data Alignment	-	-	-	IJCAI 2021
Less is More: ClipBERT for Video-and-Language Learning via Sparse Sampling			-	CVPR 2021
MDMMT: Multidomain Multimodal Transformer for Video Retrieval			-	CVPR Workshop 2021
TEACHTEXT: CrossModal Generalized Distillation for Text-Video Retrieval				ICCV 2021
Improving Video-Text Retrieval by Multi-Stream Corpus Alignment and Dual Softmax Loss		-	-	arXiv 2021
CLIP2TV: Align, Match and Distill for Video-Text Retrieval		-	-	arXiv 2021
ActBERT: Learning Global-Local Video-Text Representations		-	-	CVPR 2020
End-to-End Learning of Visual Representations from Uncurated Instructional Videos				CVPR 2020
Multi-modal Transformer for Video Retrieval				ECCV 2020
Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks			-	ECCV 2020
Searching Privately by Imperceptible Lying: A Novel Private Hashing Method with Differential Privacy	-	-	-	ACM MM 2020
Language Models are Few-Shot Learners		-	-	NeurIPS 2020
Language Models are Few-Shot Learners		-	-	NeurIPS 2020
Large-Scale Adversarial Training for Vision-and-Language Representation Learning		-	-	NeurIPS 2020
StyleGuide: Zero-Shot Sketch-Based Image Retrieval Using Style-Guided Image Generation	-	-	-	TMM 2020
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding		-	-	NAACL 2019
Dual Encoding for Zero-Example Video Retrieval			-	CVPR 2019
Language Models are Unsupervised Multitask Learners	-		-	OpenAI 2019
End-to-end Concept Word Detection for Video Captioning, Retrieval, and Question Answering		-	-	CVPR 2017

Video QA and captioning

Papers are listed generally in reverse order of their publication timestamps.

Title	arXiv	GitHub	Website	Conference & Year
LongVideoBench: A Benchmark for Long-context Interleaved Video-Language Understanding				NeurIPS 2024
Video Question Answering: Datasets, Algorithms and Challenges			-	EMNLP 2022
Video Question Answering via Gradually Refined Attention over Appearance and Motion	-		-	ACM MM 2017
Video Question Answering via Hierarchical Dual-Level Attention Network Learning	-	-	-	ACM MM 2017
TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering			-	CVPR 2017
Leveraging Video Descriptions to Learn Video Question Answering		-	-	AAAI 2017

3D and 4D generation

Video diffusion for 3D generation

Papers are listed generally in reverse order of their publication timestamps.

Title	GitHub	Conference & Year
Wonderland: Navigating 3D Scenes from a Single Image		arXiv 2024
ReconX: Reconstruct Any Scene from Sparse Views with Video Diffusion Model		arXiv 2024
Hi3D: Pursuing High-Resolution Image-to-3D Generation with Video Diffusion Models		Multimedia 2024
VFusion3D: Learning Scalable 3D Generative Models from Video Diffusion Models		ECCV 2024
Sv3d: Novel multi-view synthesis and 3d generation from a single image using latent video diffusion	-	ECCV 2024
V3D: Video Diffusion Models are Effective 3D Generators		arXiv 2024
IM-3D: Iterative Multiview Diffusion and Reconstruction for High-Quality 3D Generation	-	ICML 2024

Video diffusion for 4D generation

Papers are listed generally in reverse order of their publication timestamps.

Title	GitHub	Conference & Year
SV4D: Dynamic 3D Content Generation with Multi-Frame and Multi-View Consistency		ICLR 2025
CAT4D: Create Anything in 4D with Multi-View Video Diffusion Models	-	arXiv 2024
4Real: Towards Photorealistic 4D Scene Generation via Video Diffusion Models	-	NeurIPS 2024
PLA4D: Pixel-Level Alignments for Text-to-4D Gaussian Splatting	-	arXiv 2024
4Diffusion: Multi-view Video Diffusion Model for 4D Generation		NeurIPS 2024
TC4D: Trajectory-Conditioned Text-to-4D Generation		ECCV 2024
Animate3D: Animating Any 3D Model with Multi-view Video Diffusion		NeurIPS 2024
Vivid-ZOO: Multi-View Video Generation with Diffusion Model		NeurIPS 2024
DreamGaussian4D: Generative 4D Gaussian Splatting		arXiv 2024
EG4D: Explicit Generation of 4D Object without Score Distillation		ICLR 2025
Vidu4D: Single Generated Video to High-Fidelity 4D Reconstruction with Dynamic Gaussian Surfels		NeurIPS 2024
Diffusion4D: Fast Spatial-temporal Consistent 4D generation via Video Diffusion Models		NeurIPS 2024
4D-fy: Text-to-4D Generation Using Hybrid Score Distillation Sampling		CVPR 2024
Dream-in-4D: A Unified Approach for Text-and Image-Guided 4D Scene Generation		CVPR 2024
STAG4D: Spatial-Temporal Anchored Generative 4D Gaussians		ECCV 2024
VideoMV: Consistent Multi-View Generation Based on Large Video Generative Model		arXiv 2024
Animate124: Animating One Image to 4D Dynamic Scene		arXiv 2024
Align Your Gaussians: Text-to-4D with Dynamic 3D Gaussians and Composed Diffusion Models	-	CVPR 2024
Text-to-4d dynamic scene generation	-	ICML 2023

Citation

If you find our survey is useful in your research or applications, please consider giving us a star 🌟 and citing it by the following BibTeX entry.

@misc{wang2025surveyvideodiffusionmodels,
      title={Survey of Video Diffusion Models: Foundations, Implementations, and Applications}, 
      author={Yimu Wang and Xuye Liu and Wei Pang and Li Ma and Shuai Yuan and Paul Debevec and Ning Yu},
      year={2025},
      eprint={2504.16081},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2504.16081}, 
}

Acknowledgement

The format of this repo is built based on Awesome-Video-Diffusion-Models.

Name		Name	Last commit message	Last commit date
Latest commit History 80 Commits
img		img
.DS_Store		.DS_Store
README.md		README.md

Eyeline-Research/Survey-Video-Diffusion

Folders and files

Latest commit

History

Repository files navigation

Survey of Video Diffusion Models: Foundations, Implementations, and Applications

Abstract

Table of Contents

Foundations

Video generative paradigms

GAN video models

Auto-regressive video models

Video diffusion models

Auto-regressive video diffusion models

Learning foundations

Classic denoising diffusion models

Flow matching and rectified flow

Learning from feedback and reward models

One-shot and few-shot learning

Training-free methods

Token learning

Guidances

Classifier guidance

Classifier-free guidance

Diffusion model frameworks

Pixel diffusion and latent diffusion

Optical-flow-based diffusion models

Noise scheduling

Agent-based diffusion models

Architectures

UNet

Diffusion transformers

VAE for latent space compression

Text encoders

Implementation

Datasets

Training engineering

Evaluation metrics and benchmarking findings

Industry models

Academia models

Applications

Conditions

Image condition

Spatial condition

Camera parameter condition

Audio condition

High-level video condition

Other conditions

Enhancement

Video denoising and deblurring

Video inpainting

Video interpolation and extrapolation/prediction

Video super-resolution

Combining multiple video enhancement tasks

Personalization

Consistency

Long video

3D-aware video diffusion

Training on 3D dataset

Architecture for 3D diffusion models

Camera conditioning

Inference-time tricks

Benefits to other domains

Video representation learning

Video retrieval

Video QA and captioning

3D and 4D generation

Video diffusion for 3D generation

Video diffusion for 4D generation

Citation

Acknowledgement

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 7

Packages