Yimu Wang1,*, Xuye Liu1,*, Wei Pang1,*, Li Ma3,*, Shuai Yuan2,*, Paul Debevec3, Ning Yu3,†
1University of Waterloo, 2Duke University, 3Netflix Eyeline Studios
*Contributed Equally, †Corresponding Author
In this survey and github repository, we provide a comprehensive overview of the recent advances in video diffusion models. We cover the foundations of video generative models, including GANs, auto-regressive models, and diffusion models. We also discuss the learning foundations, including classic denoising diffusion models, flow matching, and training-free methods. Additionally, we explore various architectures, including UNet and diffusion transformers. We discuss the applications of video diffusion models, including video generation, enhancement, personalization, and 3D-aware video generation. Finally, we highlight the benefits of video diffusion models to other domains, such as video representation learning and video retrieval.
Moreover, to facilitate the understanding of video diffusion models, we provide a cheatsheet including commonly used training datasets, training engineering techniques, and evaluation metrics. We also provide a list of video diffusion models in academia and industry.
- Foundations
- Implementation
- Applications
- Benefits to other domains
- Citation
- Acknowledgement
Papers are listed generally in reverse order of their publication timestamps.
Papers are listed generally in reverse order of their publication timestamps.
Title |
arXiv |
GitHub |
Website |
Conference & Year |
---|---|---|---|---|
CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers | ICLR 2023 | |||
Single Image Video Prediction with Auto-Regressive GANs | - | - | Sensors 2022 | |
HARP: Autoregressive Latent Video Prediction with High-Fidelity Image Generator | - | ICIP 2022 | ||
Long Video Generation with Time-Agnostic VQGAN and Time-Sensitive Transformer | ECCV 2022 | |||
VideoGPT: Video Generation using VQ-VAE and Transformers | arXiv 2021 | |||
Latent Video Transformer | - | arXiv 2020 | ||
Parallel Multiscale Autoregressive Density Estimation | - | - | ICML 2017 | |
Video Pixel Networks | - | - | ICML 2017 |
Papers are listed generally in reverse order of their publication timestamps.
Title |
arXiv |
GitHub |
Website |
Conference & Year |
---|---|---|---|---|
CCEdit: Creative and Controllable Video Editing via Diffusion Models | - | - | arXiv 2024 | |
Align Your Latents: High-Resolution Video Synthesis with Latent Diffusion Models | CVPR 2023 | |||
Make-A-Video: Text-to-Video Generation without Text-Video Data | arXiv 2022 | |||
MagicVideo: Efficient Video Generation with Latent Diffusion Models | - | arXiv 2022 | ||
Imagen Video: High Definition Video Generation with Diffusion Models | - | - | arXiv 2022 | |
Video Diffusion Models | arXiv 2022 | |||
Cascaded Diffusion Models for High Fidelity Image Generation | - | JMLR 2022 | ||
High-Resolution Image Synthesis with Latent Diffusion Models | - | CVPR 2022 |
Papers are listed generally in reverse order of their publication timestamps.
Title |
arXiv |
GitHub |
Website |
Conference & Year |
---|---|---|---|---|
From Slow Bidirectional to Fast Causal Video Generators | arXiv 2024 | |||
Progressive Autoregressive Video Diffusion Models | arXiv 2024 | |||
Pyramidal Flow Matching for Efficient Video Generative Modeling | arXiv 2024 | |||
ART·V: Auto-Regressive Text-to-Video Generation with Diffusion Models | CVPR 2024 |
Papers are listed generally in reverse order of their publication timestamps.
Title |
arXiv |
GitHub |
Website |
Conference & Year |
---|---|---|---|---|
Elucidating the Design Space of Diffusion-Based Generative Models | - | NeurIPS 2022 | ||
Denoising Diffusion Implicit Models | - | ICLR 2021 | ||
Improved Denoising Diffusion Probabilistic Models | - | ICML 2021 | ||
Denoising Diffusion Probabilistic Models | NeurIPS 2020 | |||
Deep Unsupervised Learning using Nonequilibrium Thermodynamics | - | ICML 2015 |
Papers are listed generally in reverse order of their publication timestamps.
Title |
arXiv |
GitHub |
Website |
Conference & Year |
---|---|---|---|---|
Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale | NeurIPS 2023 | |||
Boosting Fast and High-Quality Speech Synthesis with Linear Diffusion | - | - | ICASSP 2023 | |
InstaFlow: One Step is Enough for High-Quality Diffusion-Based Text-to-Image Generation | - | - | ICLR 2024 | |
Stochastic Interpolants: A Unifying Framework for Flows and Diffusions | - | arXiv 2023 | ||
Flow Matching for Generative Modeling | - | - | arXiv 2022 |
Papers are listed generally in reverse order of their publication timestamps.
Title |
arXiv |
GitHub |
Website |
Conference & Year |
---|---|---|---|---|
Improving Dynamic Object Interactions in Text-to-Video Generation with AI Feedback | - | - | arXiv 2024 | |
T2V-Turbo-v2: Enhancing Video Generation Model Post-Training through Data, Reward, and Conditional Guidance Design | arXiv 2024 | |||
T2V-Turbo: Breaking the Quality Bottleneck of Video Consistency Model with Mixed Reward Feedback | NeurIPS 2024 | |||
InstructVideo: Instructing Video Diffusion Models with Human Feedback | CVPR 2024 | |||
Click to Move: Controlling Video Generation with Sparse Motion | - | ICCV 2021 |
Papers are listed generally in reverse order of their publication timestamps.
Title |
arXiv |
GitHub |
Website |
Conference & Year |
---|---|---|---|---|
Make an Image Move: Few-Shot Based Video Generation Guided by CLIP | - | - | - | ICPR 2025 |
LAMP: Learn A Motion Pattern for Few-Shot-Based Video Generation | arXiv 2023 | |||
Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation | ICCV 2023 |
Papers are listed generally in reverse order of their publication timestamps.
Papers are listed generally in reverse order of their publication timestamps.
Title |
arXiv |
GitHub |
Website |
Conference & Year |
---|---|---|---|---|
DreamVideo-2: Zero-Shot Subject-Driven Video Customization with Precise Motion Control | - | arXiv 2024 | ||
Animate-A-Story: Storytelling with Retrieval-Augmented Video Generation | - | - | arXiv 2023 | |
An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion | ICLR 2023 |
Papers are listed generally in reverse order of their publication timestamps.
Title |
arXiv |
GitHub |
Website |
Conference & Year |
---|---|---|---|---|
CFG++: Manifold-Constrained Classifier Free Guidance for Diffusion Models | arXiv 2024 | |||
LLM-grounded Video Diffusion Models | ICLR 2024 | |||
Exploring Compositional Visual Generation with Latent Classifier Guidance | - | - | CVPRW 2023 | |
Diffusion Models Beat GANs on Image Synthesis | - | - | NeurIPS 2021 |
Papers are listed generally in reverse order of their publication timestamps.
Title |
arXiv |
GitHub |
Website |
Conference & Year |
---|---|---|---|---|
Classifier-Free Diffusion Guidance | 2022 |
Papers are listed generally in reverse order of their publication timestamps.
Title |
arXiv |
GitHub |
Website |
Conference & Year |
---|---|---|---|---|
Latte: Latent Diffusion Transformer for Video Generation | TMLR 2025 | |||
Show-1: Marrying Pixel and Latent Diffusion Models for Text-to-Video Generation | arXiv 2023 | |||
Preserve Your Own Correlation: A Noise Prior for Video Diffusion Models | - | - | ICCV 2023 | |
ModelScope Text-to-Video Technical Report | - | arXiv 2023 | ||
Structure and Content-Guided Video Synthesis with Diffusion Models | - | - | ICCV 2023 | |
Align Your Latents: High-Resolution Video Synthesis with Latent Diffusion Models | CVPR 2023 | |||
Text-To-4D Dynamic Scene Generation | - | arXiv 2023 |
Papers are listed generally in reverse order of their publication timestamps.
Papers are listed generally in reverse order of their publication timestamps.
Title |
arXiv |
GitHub |
Website |
Conference & Year |
---|---|---|---|---|
Rethinking the Noise Schedule of Diffusion-Based Generative Models | - | - | ICLR 2024 | |
On the Importance of Noise Scheduling for Diffusion Models | - | - | arXiv 2023 | |
simple diffusion: End-to-end diffusion for high resolution images | - | ICML 2023 | ||
Elucidating the Design Space of Diffusion-Based Generative Models | - | NeurIPS 2022 | ||
Improved Denoising Diffusion Probabilistic Models | - | ICML 2021 | ||
Denoising Diffusion Probabilistic Models | NeurIPS 2020 | |||
Score-Based Generative Modeling through Stochastic Differential Equations | - | ICLR 2021 |
Papers are listed generally in reverse order of their publication timestamps.
Title |
arXiv |
GitHub |
Website |
Conference & Year |
---|---|---|---|---|
VideoAgent: Self-Improving Video Generation | ICLR 2025 | |||
Mora: Enabling Generalist Video Generation via A Multi-Agent Framework | - | arXiv 2024 | ||
MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework | - | arXiv 2023 | ||
AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation Framework | - | arXiv 2023 | ||
DriveGAN: Towards a Controllable High-Quality Neural Simulation | CVPR 2021 |
Papers are listed generally in reverse order of their publication timestamps.
Title |
arXiv |
GitHub |
Website |
Conference & Year |
---|---|---|---|---|
MoVideo: Motion-Aware Video Generation with Diffusion Models | - | ECCV 2025 | ||
ModelScope Text-to-Video Technical Report | - | - | arXiv 2023 | |
Latent-Shift: Latent Diffusion with Temporal Shift for Efficient Text-to-Video Generation | - | arXiv 2023 | ||
MagicVideo: Efficient Video Generation with Latent Diffusion Models | - | arXiv 2022 | ||
Latent Video Diffusion Models for High-Fidelity Video Generation with Arbitrary Lengths | - | - | arXiv 2022 | |
Imagen Video: High Definition Video Generation with Diffusion Models | - | - | arXiv 2022 | |
Video Diffusion Models | arXiv 2022 | |||
Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding | - | - | NeurIPS 2022 | |
High-Resolution Image Synthesis with Latent Diffusion Models | - | CVPR 2022 | ||
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale | - | - | ICLR 2021 | |
U-Net: Convolutional Networks for Biomedical Image Segmentation | - | MICCAI 2015 |
Papers are listed generally in reverse order of their publication timestamps.
Title |
arXiv |
GitHub |
Website |
Conference & Year |
---|---|---|---|---|
Open-Sora: Democratizing Efficient Video Production for All | - | arXiv 2024 | ||
From Slow Bidirectional to Fast Causal Video Generators | arXiv 2024 | |||
SiT: Exploring Flow and Diffusion-based Generative Models with Scalable Interpolant Transformers | - | arXiv 2024 | ||
GenTron: Diffusion Transformers for Image and Video Generation | - | CVPR 2024 | ||
VDT: General-purpose Video Diffusion Transformers via Mask Modeling | - | ICLR 2024 | ||
Text2Performer: Text-Driven Human Video Generation | - | ICCV 2023 | ||
Scalable Diffusion Models with Transformers | - | ICCV 2023 | ||
CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers | - | - | ICLR 2023 | |
VersVideo: Leveraging Enhanced Temporal Diffusion Models for Versatile Video Generation | - | - | - | ICLR 2023 |
ViViT: A Video Vision Transformer | - | - | ICCV 2021 |
Papers are listed generally in reverse order of their publication timestamps.
Papers are listed generally in reverse order of their publication timestamps.
More datasets could be found on Pixabay, Mixkit, Pond5, Adobe Stock, Shutterstock, Getty, Coverr, Videvo, Depositphotos, Storyblocks, Dissolve, Freepik, Vimeo, and Envato. Also, there are some datasets at Midjourney V5.1 Cleaned Data, Unsplash-lite, AnimateBench, Pexels-400k, and LAION-AESTHETICS.
Papers are listed generally in reverse order of their publication timestamps.
Papers are listed generally in reverse order of their publication timestamps.
Title |
arXiv |
GitHub |
Website |
Conference & Year |
---|---|---|---|---|
Magic 1-for-1: Generating one minute video clips within one minute | arXiv 2025 | |||
SkyReels v1: Human-centric video foundation model | - | arXiv 2025 | ||
Step-Video-T2V | - | arXiv 2024 | ||
HunyuanVideo | - | arXiv 2024 | ||
Sora | - | - | 2024 | |
STIV | arXiv 2024 | |||
LTX-Video | arXiv 2024 | |||
Allegro | arXiv 2024 | |||
Jimeng | - | arXiv 2024 | ||
Mochi 1 | - | arXiv 2024 | ||
EasyAnimate | arXiv 2024 | |||
Vidu | - | - | 2024 | |
VideoCrafter2 | arXiv 2024 | |||
VideoCrafter1 | arXiv 2023 | |||
Mira | - | arXiv 2024 | ||
Hailuo AI | - | - | 2024 | |
Lumiere | - | arXiv 2024 | ||
VideoPoet | - | arXiv 2023 | ||
LumaAI Ray 2 | - | - | 2024 | |
LumaAI Dream Machine | - | - | 2023 | |
Veo-2 | - | - | 2024 | |
Veo-1 | - | - | 2023 | |
Nova Real | - | - | 2024 | |
Wanx 2.1 | - | - | 2024 | |
Kling | - | - | 2024 | |
Show-1 | NeurIPS 2023 | |||
MovieGen | - | arXiv 2024 | ||
Pika | - | - | 2023 | |
Vchitect-2.0 | - | - | 2024 | |
Optis | NeurIPS 2023 | |||
VLogger | ICCV 2023 | |||
Seine | CVPR 2023 | |||
Lavie | ICCV 2023 | |||
MiracleVision | - | - | 2023 | |
Phenaki | ICLR 2024 | |||
W.A.L.T | - | arXiv 2024 | ||
Imagen video | - | 2022 | ||
GEN-3 Alpha | - | - | 2024 | |
GEN-2 | - | - | 2023 | |
GEN-1 | - | - | 2022 |
Papers are listed generally in reverse order of their publication timestamps.
Papers are listed generally in reverse order of their publication timestamps.
Papers are listed generally in reverse order of their publication timestamps.
Papers are listed generally in reverse order of their publication timestamps.
Papers are listed generally in reverse order of their publication timestamps.
Papers are listed generally in reverse order of their publication timestamps.
Title |
arXiv |
GitHub |
Website |
Conference & Year |
---|---|---|---|---|
Mind the Time: Temporally-Controlled Multi-Event Video Generation | - | CVPR 2025 | ||
PhysGen: Rigid-Body Physics-Grounded Image-to-Video Generation | ECCV 2024 | |||
MotionCraft: Physics-based Zero-Shot Video Generation | AAAI 2025 | |||
VideoAgent: Long-form Video Understanding with Large Language Model as Agent | ECCV 2024 | |||
Synthetic Generation of Face Videos with Plethysmograph Physiology | - | - | CVPR 2022 |
Papers are listed generally in reverse order of their publication timestamps.
Title |
arXiv |
GitHub |
Website |
Conference & Year |
---|---|---|---|---|
Video restoration based on deep learning: a comprehensive survey | - | - | - | 2022 |
Papers are listed generally in reverse order of their publication timestamps.
Title |
arXiv |
GitHub |
Website |
Conference & Year |
---|---|---|---|---|
CoCoCo: Improving Text-Guided Video Inpainting for Better Consistency | AAAI 2025 | |||
UniPaint: Unified Space-time Video Inpainting via Mixture-of-Experts | - | - | arXiv 2024 | |
Semantically Consistent Video Inpainting with Conditional Diffusion Models | - | - | arXiv 2024 | |
AVID: Any-Length Video Inpainting with Diffusion Model | CVPR 2024 | |||
Towards language-driven video inpainting via multimodal large language models | CVPR 2024 | |||
Deep Learning-Based Image and Video Inpainting: A Survey | - | - | arXiv 2024 | |
Smartbrush: Text and shape guided object inpainting with diffusion model | - | - | CVPR 2023 | |
Repaint: Inpainting using denoising diffusion probabilistic models | - | CVPR 2022 | ||
Free-form video inpainting with 3d gated convolution and temporal patchgan | - | ICCV 2019 | ||
Generative image inpainting with contextual attention | - | CVPR 2018 | ||
Context encoders: Feature learning by inpainting | CVPR 2016 |
Papers are listed generally in reverse order of their publication timestamps.
Title |
arXiv |
GitHub |
Website |
Conference & Year |
---|---|---|---|---|
Video interpolation with diffusion models | - | CVPR 2024 | ||
ToonCrafter: Generative Cartoon Interpolation | TOG 2024 | |||
Ldmvfi: Video frame interpolation with latent diffusion models | AAAI 2024 | |||
Tell me what happened: Unifying text-guided video completion via multimodal masked video generation | CVPR 2023 | |||
MCVD - Masked Conditional Video Diffusion for Prediction | NeurIPS 2022 | |||
Diffusion models for video prediction and infilling | TMLR 2022 |
Papers are listed generally in reverse order of their publication timestamps.
Papers are listed generally in reverse order of their publication timestamps.
Title |
arXiv |
GitHub |
Website |
Conference & Year |
---|---|---|---|---|
UniPaint: Unified Space-time Video Inpainting via Mixture-of-Experts | - | - | arXiv 2024 | |
VEnhancer: Generative Space-Time Enhancement for Video Generation | arXiv 2024 | |||
MCVD - Masked Conditional Video Diffusion for Prediction | NeurIPS 2022 |
Papers are listed generally in reverse order of their publication timestamps.
Papers are listed generally in reverse order of their publication timestamps.
Papers are listed generally in reverse order of their publication timestamps.
Papers are listed generally in reverse order of their publication timestamps.
Papers are listed generally in reverse order of their publication timestamps.
Papers are listed generally in reverse order of their publication timestamps.
Papers are listed generally in reverse order of their publication timestamps.
Title |
arXiv |
GitHub |
Website |
Conference & Year |
---|---|---|---|---|
NVS-Solver: Video Diffusion Model as Zero-Shot Novel View Synthesizer | ICLR 2025 | |||
Training-free Camera Control for Video Generation | - | ICLR 2025 | ||
ViVid-1-to-3: Novel View Synthesis with Video Diffusion Models | CVPR 2024 |
Papers are listed generally in reverse order of their publication timestamps.
Papers are listed generally in reverse order of their publication timestamps.
Papers are listed generally in reverse order of their publication timestamps.
Title |
arXiv |
GitHub |
Website |
Conference & Year |
---|---|---|---|---|
LongVideoBench: A Benchmark for Long-context Interleaved Video-Language Understanding | NeurIPS 2024 | |||
Video Question Answering: Datasets, Algorithms and Challenges | - | EMNLP 2022 | ||
Video Question Answering via Gradually Refined Attention over Appearance and Motion | - | - | ACM MM 2017 | |
Video Question Answering via Hierarchical Dual-Level Attention Network Learning | - | - | - | ACM MM 2017 |
TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering | - | CVPR 2017 | ||
Leveraging Video Descriptions to Learn Video Question Answering | - | - | AAAI 2017 |
Papers are listed generally in reverse order of their publication timestamps.
Title |
arXiv |
GitHub |
Website |
Conference & Year |
---|---|---|---|---|
Wonderland: Navigating 3D Scenes from a Single Image | arXiv 2024 | |||
ReconX: Reconstruct Any Scene from Sparse Views with Video Diffusion Model | arXiv 2024 | |||
Hi3D: Pursuing High-Resolution Image-to-3D Generation with Video Diffusion Models | Multimedia 2024 | |||
VFusion3D: Learning Scalable 3D Generative Models from Video Diffusion Models | ECCV 2024 | |||
Sv3d: Novel multi-view synthesis and 3d generation from a single image using latent video diffusion | - | ECCV 2024 | ||
V3D: Video Diffusion Models are Effective 3D Generators | arXiv 2024 | |||
IM-3D: Iterative Multiview Diffusion and Reconstruction for High-Quality 3D Generation | - | ICML 2024 |
Papers are listed generally in reverse order of their publication timestamps.
If you find our survey is useful in your research or applications, please consider giving us a star 🌟 and citing it by the following BibTeX entry.
@misc{wang2025surveyvideodiffusionmodels,
title={Survey of Video Diffusion Models: Foundations, Implementations, and Applications},
author={Yimu Wang and Xuye Liu and Wei Pang and Li Ma and Shuai Yuan and Paul Debevec and Ning Yu},
year={2025},
eprint={2504.16081},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2504.16081},
}
The format of this repo is built based on Awesome-Video-Diffusion-Models.