📜 A Curated List of Amazing Works in World Modeling, spanning applications in Embodied AI, Autonomous Driving, Natural Language Processing and Agents.
Based on Awesome-World-Model-for-Autonomous-Driving and Awesome-World-Model-for-Robotics.
Photo Credit: Gemini-Nano-Banana🍌.
Major updates and announcements are shown below. Scroll for full timeline.
🚀 [2025-11] 1k+ Stars ⭐️ Under 30 Days — 🌍 Awesome World Models reached 1k github stars within 30 days of initial release, let's go!!!
🗺️ [2025-10] Enhanced Visual Navigation — Introduced badge system for papers! All entries now display
for quick access to resources.
🔥 [2025-10] Repository Launch — Awesome World Models is now live! We're building a comprehensive collection spanning Embodied AI, Autonomous Driving, NLP, and more. See CONTRIBUTING.md for how to contribute.
💡 [Ongoing] Community Contributions Welcome — Help us maintain the most up-to-date world models resource! Submit papers via PR or contact us at email.
⭐ [Ongoing] Support This Project — If you find this useful, please cite our work and give us a star. Share with your research community!
- 🎯 Aim of the project
- 📚 Definition of World Models
- 📖 Surveys of World Models
- 🎮 World Models for Game Simulation
- 🚗 World Models for Autonomous Driving
- 🤖 World Models for Embodied AI
- 🔬 World Models for Science
- 💭 Positions on World Models
- 📐 Theory & World Models Explainability
- 🛠️ General Approaches to World Models
- 📊 Evaluating World Models
- 🙏 Acknowledgements
- 📝 Citation
World Models have become a hot topic in both research and industry, attracting unprecedented attention from the AI community and beyond. However, due to the interdisciplinary nature of the field (and because the term "world model" simply sounds amazing), the concept has been used with varying definitions across different domains.
This repository aims to:
- 🔍 Organize the rapidly growing body of world model research across multiple application domains
- 🗺️ Provide a minimalist map of how world models are utilized in different fields (Embodied AI, Autonomous Driving, NLP, etc.)
- 🤝 Bridge the gap between different communities working on world models with varying perspectives
- 📚 Serve as a one-stop resource for researchers, practitioners, and enthusiasts interested in world modeling
- 🚀 Track the latest developments and breakthroughs in this exciting field
Whether you're a researcher looking for related work, a practitioner seeking implementation references, or simply curious about world models, we hope this curated list helps you navigate the landscape!
While world models' outreach has been expanded again and again, it is widely adopted that the original sources of world models come from these two papers:
- [⭐️] World Models, World Models.
- [⭐️] Yann Lecun's Speech, "A Path Towards Autonomous Machine Intelligence".
Some other great blogposts on world models include:
- [⭐️] Towards Video World Models, "Towards Video World Models".
- Status of World Models in 2025, "Beyond the Hype: How I See World Models Evolving in 2025".
- [⭐️] Jim Fan's tweet.
- [⭐️] Is Sora a World Simulator, "Is Sora a World Simulator? A Comprehensive Survey on General World Models and Beyond".
- Physics Cognition in Video Generation, "Exploring the Evolution of Physics Cognition in Video Generation: A Survey".
- [⭐️] 3D and 4D World Modeling: A Survey, "3D and 4D World Modeling: A Survey".
- [⭐️] Understanding World or Predicting Future?, "Understanding World or Predicting Future? A Comprehensive Survey of World Models".
- From 2D to 3D Cognition, "From 2D to 3D Cognition: A Brief Survey of General World Models".
- [⭐️] World Models for Embodied AI, "A Comprehensive Survey on World Models for Embodied AI".
- World Models and Physical Simulation, "A Survey: Learning Embodied Intelligence from Physical Simulators and World Models".
- Embodied AI Agents: Modeling the World, "Embodied AI Agents: Modeling the World".
- Aligning Cyber Space with Physical World, "Aligning Cyber Space with Physical World: A Comprehensive Survey on Embodied AI".
- [⭐️] A Survey of World Models for Autonomous Driving, "A Survey of World Models for Autonomous Driving".
- World Models for Autonomous Driving: An Initial Survey, "World Models for Autonomous Driving: An Initial Survey".
- Interplay Between Video Generation and World Models in Autonomous Driving, "Exploring the Interplay Between Video Generation and World Models in Autonomous Driving: A Survey".
- From Masks to Worlds, "From Masks to Worlds: A Hitchhiker's Guide to World Models".
- The Safety Challenge of World Models, "The Safety Challenge of World Models for Embodied AI Agents: A Review".
- World Models in AI: Like a Child, "World Models in Artificial Intelligence: Sensing, Learning, and Reasoning Like a Child".
- World Model Safety, "World Models: The Safety Perspective".
- Model-based reinforcement learning: "A survey on model-based reinforcement learning".
Pixel Space:
- [⭐️] GameNGen, "Diffusion Models Are Real-Time Game Engines".
- [⭐️] DIAMOND, "Diffusion for World Modeling: Visual Details Matter in Atari".
- MineWorld, "MineWorld: a Real-Time and Open-Source Interactive World Model on Minecraft".
- Oasis, "Oasis: A Universe in a Transformer".
- AnimeGamer, "AnimeGamer: Infinite Anime Life Simulation with Next Game State Prediction".
- [⭐️] Matrix-Game, "Matrix-Game: Interactive World Foundation Model."
- [⭐️] Matrix-Game 2.0, Matrix-Game 2.0: An Open-Source, Real-Time, and Streaming Interactive World Model.
- RealPlay, "From Virtual Games to Real-World Play".
- GameFactory, "GameFactory: Creating New Games with Generative Interactive Videos".
- WORLDMEM, "Worldmem: Long-term Consistent World Simulation with Memory".
3D Mesh Space:
- [⭐️] HunyuanWorld 1.0, HunyuanWorld 1.0: Generating Immersive, Explorable, and Interactive 3D Worlds from Words or Pixels.
- [⭐️] Matrix-3D, Matrix-3D: Omnidirectional Explorable 3D World Generation.
Refer to https://github.com/LMD0311/Awesome-World-Model for full list.
Note
📢 [Call for Maintenance] The repo creator is no expert of autonomous driving, so this is a more-than-concise list of works without classification. We anticipate community effort on turning this section cleaner and more well-sorted.
- [⭐️] Cosmos-Drive-Dreams, "Cosmos-Drive-Dreams: Scalable Synthetic Driving Data Generation with World Foundation Models".
- [⭐️] GAIA-2, "GAIA-2: A Controllable Multi-View Generative World Model for Autonomous Driving".
- Copilot4D, "Copilot4D: Learning Unsupervised World Models for Autonomous Driving via Discrete Diffusion".
- OmniNWM: "OmniNWM: Omniscient Driving Navigation World Models".
- GAIA-1, "Introducing GAIA-1: A Cutting-Edge Generative AI Model for Autonomy".
-
PWM, "From Forecasting to Planning: Policy World Model for Collaborative State-Action Prediction".
-
Dream4Drive, "Rethinking Driving World Model as Synthetic Data Generator for Perception Tasks".
-
SparseWorld, "SparseWorld: A Flexible, Adaptive, and Efficient 4D Occupancy World Model Powered by Sparse and Dynamic Queries".
-
DriveVLA-W0: "DriveVLA-W0: World Models Amplify Data Scaling Law in Autonomous Driving".
-
"Enhancing Physical Consistency in Lightweight World Models".
-
IRL-VLA: "IRL-VLA: Training an Vision-Language-Action Policy via Reward World Model".
-
LiDARCrafter: "LiDARCrafter: Dynamic 4D World Modeling from LiDAR Sequences".
-
FASTopoWM: "FASTopoWM: Fast-Slow Lane Segment Topology Reasoning with Latent World Models".
-
Orbis: "Orbis: Overcoming Challenges of Long-Horizon Prediction in Driving World Models".
-
"World Model-Based End-to-End Scene Generation for Accident Anticipation in Autonomous Driving".
-
NRSeg: "NRSeg: Noise-Resilient Learning for BEV Semantic Segmentation via Driving World Models"
-
World4Drive: "World4Drive: End-to-End Autonomous Driving via Intention-aware Physical Latent World Model".
-
Epona: "Epona: Autoregressive Diffusion World Model for Autonomous Driving".
-
"Towards foundational LiDAR world models with efficient latent flow matching".
-
SceneDiffuser++: "SceneDiffuser++: City-Scale Traffic Simulation via a Generative World Model".
-
COME: "COME: Adding Scene-Centric Forecasting Control to Occupancy World Model"
-
STAGE: "STAGE: A Stream-Centric Generative World Model for Long-Horizon Driving-Scene Simulation".
-
ReSim: "ReSim: Reliable World Simulation for Autonomous Driving".
-
"Ego-centric Learning of Communicative World Models for Autonomous Driving".
-
Dreamland: "Dreamland: Controllable World Creation with Simulator and Generative Models".
-
LongDWM: "LongDWM: Cross-Granularity Distillation for Building a Long-Term Driving World Model".
-
GeoDrive: "GeoDrive: 3D Geometry-Informed Driving World Model with Precise Action Control".
-
FutureSightDrive: "FutureSightDrive: Thinking Visually with Spatio-Temporal CoT for Autonomous Driving".
-
Raw2Drive: "Raw2Drive: Reinforcement Learning with Aligned World Models for End-to-End Autonomous Driving (in CARLA v2)".
-
VL-SAFE: "VL-SAFE: Vision-Language Guided Safety-Aware Reinforcement Learning with World Models for Autonomous Driving".
-
PosePilot: "PosePilot: Steering Camera Pose for Generative World Models with Self-supervised Depth".
-
"World Model-Based Learning for Long-Term Age of Information Minimization in Vehicular Networks".
-
DriVerse: "DriVerse: Navigation World Model for Driving Simulation via Multimodal Trajectory Prompting and Motion Alignment".
-
"End-to-End Driving with Online Trajectory Evaluation via BEV World Model".
-
"Knowledge Graphs as World Models for Semantic Material-Aware Obstacle Handling in Autonomous Vehicles".
-
MiLA: "MiLA: Multi-view Intensive-fidelity Long-term Video Generation World Model for Autonomous Driving".
-
SimWorld: "SimWorld: A Unified Benchmark for Simulator-Conditioned Scene Generation via World Model".
-
UniFuture: "Seeing the Future, Perceiving the Future: A Unified Driving World Model for Future Generation and Perception".
-
EOT-WM: "Other Vehicle Trajectories Are Also Needed: A Driving World Model Unifies Ego-Other Vehicle Trajectories in Video Latent Space".
-
InDRiVE: "InDRiVE: Intrinsic Disagreement based Reinforcement for Vehicle Exploration through Curiosity Driven Generalized World Model".
-
MaskGWM: "MaskGWM: A Generalizable Driving World Model with Video Mask Reconstruction".
-
Dream to Drive: "Dream to Drive: Model-Based Vehicle Control Using Analytic World Models".
-
"Semi-Supervised Vision-Centric 3D Occupancy World Model for Autonomous Driving".
-
HERMES: "HERMES: A Unified Self-Driving World Model for Simultaneous 3D Scene Understanding and Generation".
-
AdaWM: "AdaWM: Adaptive World Model based Planning for Autonomous Driving".
-
AD-L-JEPA: "AD-L-JEPA: Self-Supervised Spatial World Models with Joint Embedding Predictive Architecture for Autonomous Driving with LiDAR Data".
-
DrivingWorld: "DrivingWorld: Constructing World Model for Autonomous Driving via Video GPT".
-
DrivingGPT: "DrivingGPT: Unifying Driving World Modeling and Planning with Multi-modal Autoregressive Transformers".
-
"An Efficient Occupancy World Model via Decoupled Dynamic Flow and Image-assisted Training".
-
GEM: "GEM: A Generalizable Ego-Vision Multimodal World Model for Fine-Grained Ego-Motion, Object Dynamics, and Scene Composition Control".
-
GaussianWorld: "GaussianWorld: Gaussian World Model for Streaming 3D Occupancy Prediction".
-
Doe-1: "Doe-1: Closed-Loop Autonomous Driving with Large World Model".
-
InfiniCube: "InfiniCube: Unbounded and Controllable Dynamic 3D Driving Scene Generation with World-Guided Video Models".
-
InfinityDrive: "InfinityDrive: Breaking Time Limits in Driving World Models".
-
ReconDreamer: "ReconDreamer: Crafting World Models for Driving Scene Reconstruction via Online Restoration".
-
Imagine-2-Drive: "Imagine-2-Drive: High-Fidelity World Modeling in CARLA for Autonomous Vehicles".
-
DynamicCity: "DynamicCity: Large-Scale 4D Occupancy Generation from Dynamic Scenes".
-
DriveDreamer4D: "World Models Are Effective Data Machines for 4D Driving Scene Representation".
-
DOME: "Taming Diffusion Model into High-Fidelity Controllable Occupancy World Model".
-
SSR: "Does End-to-End Autonomous Driving Really Need Perception Tasks?".
-
"Mitigating Covariate Shift in Imitation Learning for Autonomous Vehicles Using Latent Space Generative World Models".
-
LatentDriver: "Learning Multiple Probabilistic Decisions from Latent World Model in Autonomous Driving".
-
OccLLaMA: "An Occupancy-Language-Action Generative World Model for Autonomous Driving".
-
DriveGenVLM: "Real-world Video Generation for Vision Language Model based Autonomous Driving".
-
Drive-OccWorld: "Driving in the Occupancy World: Vision-Centric 4D Occupancy Forecasting and Planning via World Models for Autonomous Driving".
-
CarFormer: "Self-Driving with Learned Object-Centric Representations".
-
BEVWorld: "A Multimodal World Model for Autonomous Driving via Unified BEV Latent Space".
-
TOKEN: "Tokenize the World into Object-level Knowledge to Address Long-tail Events in Autonomous Driving".
-
UMAD: "Unsupervised Mask-Level Anomaly Detection for Autonomous Driving".
-
AdaptiveDriver: "Planning with Adaptive World Models for Autonomous Driving".
-
UnO: "Unsupervised Occupancy Fields for Perception and Forecasting".
-
LAW: "Enhancing End-to-End Autonomous Driving with Latent World Model".
-
Delphi: "Unleashing Generalization of End-to-End Autonomous Driving with Controllable Long Video Generation".
-
OccSora: "4D Occupancy Generation Models as World Simulators for Autonomous Driving".
-
MagicDrive3D: "Controllable 3D Generation for Any-View Rendering in Street Scenes".
-
Vista: "A Generalizable Driving World Model with High Fidelity and Versatile Controllability".
-
CarDreamer: "Open-Source Learning Platform for World Model based Autonomous Driving".
-
DriveSim: "Probing Multimodal LLMs as World Models for Driving".
-
DriveWorld: "4D Pre-trained Scene Understanding via World Models for Autonomous Driving".
-
LidarDM: "Generative LiDAR Simulation in a Generated World".
-
SubjectDrive: "Scaling Generative Data in Autonomous Driving via Subject Control".
-
DriveDreamer-2: "LLM-Enhanced World Models for Diverse Driving Video Generation".
-
Think2Drive: "Efficient Reinforcement Learning by Thinking in Latent World Model for Quasi-Realistic Autonomous Driving".
-
MARL-CCE: "Modelling Competitive Behaviors in Autonomous Driving Under Generative World Model".
-
GenAD: "Generalized Predictive Model for Autonomous Driving".
-
NeMo: "Neural Volumetric World Models for Autonomous Driving".
-
MARL-CCE: "Modelling-Competitive-Behaviors-in-Autonomous-Driving-Under-Generative-World-Model".
-
ViDAR: "Visual Point Cloud Forecasting enables Scalable Autonomous Driving".
-
Drive-WM: "Driving into the Future: Multiview Visual Forecasting and Planning with World Model for Autonomous Driving".
-
Cam4DOCC: "Benchmark for Camera-Only 4D Occupancy Forecasting in Autonomous Driving Applications".
-
Panacea: "Panoramic and Controllable Video Generation for Autonomous Driving".
-
OccWorld: "Learning a 3D Occupancy World Model for Autonomous Driving".
-
DrivingDiffusion: "Layout-Guided multi-view driving scene video generation with latent diffusion model".
-
SafeDreamer: "Safe Reinforcement Learning with World Models".
-
MagicDrive: "Street View Generation with Diverse 3D Geometry Control".
-
DriveDreamer: "Towards Real-world-driven World Models for Autonomous Driving".
-
SEM2: "Enhance Sample Efficiency and Robustness of End-to-end Urban Autonomous Driving via Semantic Masked World Model".
- COMPARATIVE STUDY OF WORLD MODELS: "COMPARATIVE STUDY OF WORLD MODELS, NVAE- BASED HIERARCHICAL MODELS, AND NOISYNET- AUGMENTED MODELS IN CARRACING-V2".
- Knowledge Graphs as World Models: "Knowledge Graphs as World Models for Material-Aware Obstacle Handling in Autonomous Vehicles".
- Uncertainty Modeling: "Uncertainty Modeling in Autonomous Vehicle Trajectory Prediction: A Comprehensive Survey".
- Divide and Merge: "Divide and Merge: Motion and Semantic Learning in End-to-End Autonomous Driving".
- [⭐️] Genie Envisioner: "Genie Envisioner: A Unified World Foundation Platform for Robotic Manipulation".
- [⭐️] WoW, "WoW: Towards a World omniscient World model Through Embodied Interaction".
- UnifoLM-WMA-0, "UnifoLM-WMA-0: A World-Model-Action (WMA) Framework under UnifoLM Family".
- [⭐️] iVideoGPT, "iVideoGPT: Interactive VideoGPTs are Scalable World Models".
- Direct Robot Configuration Space Construction: "Direct Robot Configuration Space Construction using Convolutional Encoder-Decoders".
- [⭐️] FLARE, "FLARE: Robot Learning with Implicit World Modeling".
- [⭐️] Enerverse, "EnerVerse: Envisioning Embodied Future Space for Robotics Manipulation".
- [⭐️] AgiBot-World, "AgiBot World Colosseo: A Large-scale Manipulation Platform for Scalable and Intelligent Embodied Systems".
- [⭐️] DyWA: "DyWA: Dynamics-adaptive World Action Model for Generalizable Non-prehensile Manipulation"
- [⭐️] TesserAct, "TesserAct: Learning 4D Embodied World Models".
- [⭐️] DreamGen: "DreamGen: Unlocking Generalization in Robot Learning through Video World Models".
- [⭐️] HiP, "Compositional Foundation Models for Hierarchical Planning".
- PAR: "Physical Autoregressive Model for Robotic Manipulation without Action Pretraining".
- iMoWM: "iMoWM: Taming Interactive Multi-Modal World Model for Robotic Manipulation".
- WristWorld: "WristWorld: Generating Wrist-Views via 4D World Models for Robotic Manipulation".
- "A Recipe for Efficient Sim-to-Real Transfer in Manipulation with Online Imitation-Pretrained World Models".
- EMMA: "EMMA: Generalizing Real-World Robot Manipulation via Generative Visual Transfer".
- PhysTwin, "PhysTwin: Physics-Informed Reconstruction and Simulation of Deformable Objects from Videos".
- [⭐️] KeyWorld: "KeyWorld: Key Frame Reasoning Enables Effective and Efficient World Models".
- World4RL: "World4RL: Diffusion World Models for Policy Refinement with Reinforcement Learning for Robotic Manipulation".
- [⭐️] SAMPO: "SAMPO:Scale-wise Autoregression with Motion PrOmpt for generative world models".
- PhysicalAgent: "PhysicalAgent: Towards General Cognitive Robotics with Foundation World Models".
- "Empowering Multi-Robot Cooperation via Sequential World Models".
- [⭐️] "Learning Primitive Embodied World Models: Towards Scalable Robotic Learning".
- [⭐️] GWM: "GWM: Towards Scalable Gaussian World Models for Robotic Manipulation".
- [⭐️] Flow-as-Action, "Latent Policy Steering with Embodiment-Agnostic Pretrained World Models".
- EmbodieDreamer: "EmbodieDreamer: Advancing Real2Sim2Real Transfer for Policy Training via Embodied World Modeling".
- RoboScape: "RoboScape: Physics-informed Embodied World Model".
- FWM, "Factored World Models for Zero-Shot Generalization in Robotic Manipulation".
- [⭐️] ParticleFormer: "ParticleFormer: A 3D Point Cloud World Model for Multi-Object, Multi-Material Robotic Manipulation".
- ManiGaussian++: "ManiGaussian++: General Robotic Bimanual Manipulation with Hierarchical Gaussian World Model".
- ReOI: "Reimagination with Test-time Observation Interventions: Distractor-Robust World Model Predictions for Visual Model Predictive Control".
- GAF: "GAF: Gaussian Action Field as a Dynamic World Model for Robotic Manipulation".
- "Prompting with the Future: Open-World Model Predictive Control with Interactive Digital Twins".
- "Time-Aware World Model for Adaptive Prediction and Control".
- [⭐️] 3DFlowAction: "3DFlowAction: Learning Cross-Embodiment Manipulation from 3D Flow World Model".
- [⭐️] ORV: "ORV: 4D Occupancy-centric Robot Video Generation".
- [⭐️] WoMAP: "WoMAP: World Models For Embodied Open-Vocabulary Object Localization".
- "Sparse Imagination for Efficient Visual World Model Planning".
- [⭐️] OSVI-WM: "OSVI-WM: One-Shot Visual Imitation for Unseen Tasks using World-Model-Guided Trajectory Generation".
- [⭐️] LaDi-WM: "LaDi-WM: A Latent Diffusion-based World Model for Predictive Manipulation".
- FlowDreamer: "FlowDreamer: A RGB-D World Model with Flow-based Motion Representations for Robot Manipulation".
- PIN-WM: "PIN-WM: Learning Physics-INformed World Models for Non-Prehensile Manipulation".
- RoboMaster, "Learning Video Generation for Robotic Manipulation with Collaborative Trajectory Control".
- ManipDreamer: "ManipDreamer: Boosting Robotic Manipulation World Model with Action Tree and Visual Guidance".
- [⭐️] AdaWorld: "AdaWorld: Learning Adaptable World Models with Latent Actions"
- "Towards Suturing World Models: Learning Predictive Models for Robotic Surgical Tasks"
- [⭐️] EVA: "EVA: An Embodied World Model for Future Video Anticipation".
- "Representing Positional Information in Generative World Models for Object Manipulation".
- DexSim2Real$^2$: "DexSim2Real$^2: Building Explicit World Model for Precise Articulated Object Dexterous Manipulation".
- "Physically Embodied Gaussian Splatting: A Realtime Correctable World Model for Robotics".
- [⭐️] LUMOS: "LUMOS: Language-Conditioned Imitation Learning with World Models".
- [⭐️] "Object-Centric World Model for Language-Guided Manipulation"
- [⭐️] DEMO^3: "Multi-Stage Manipulation with Demonstration-Augmented Reward, Policy, and World Model Learning"
- "Strengthening Generative Robot Policies through Predictive World Modeling".
- RoboHorizon: "RoboHorizon: An LLM-Assisted Multi-View World Model for Long-Horizon Robotic Manipulation.
- Dream to Manipulate: "Dream to Manipulate: Compositional World Models Empowering Robot Imitation Learning with Imagination".
- [⭐️] RoboDreamer: "RoboDreamer: Learning Compositional World Models for Robot Imagination".
- ManiGaussian: "ManiGaussian: Dynamic Gaussian Splatting for Multi-task Robotic Manipulation".
- [⭐️] WHALE: "WHALE: Towards Generalizable and Scalable World Models for Embodied Decision-making".
- [⭐️] VisualPredicator: "VisualPredicator: Learning Abstract World Models with Neuro-Symbolic Predicates for Robot Planning".
- [⭐️] "Multi-Task Interactive Robot Fleet Learning with Visual World Models".
- PIVOT-R: "PIVOT-R: Primitive-Driven Waypoint-Aware World Model for Robotic Manipulation".
- Video2Action, "Grounding Video Models to Actions through Goal Conditioned Exploration".
- Diffuser, "Planning with Diffusion for Flexible Behavior Synthesis".
- Decision Diffuser, "Is Conditional Generative Modeling all you need for Decision-Making?".
- Potential Based Diffusion Motion Planning, "Potential Based Diffusion Motion Planning".
- World4Omni: "World4Omni: A Zero-Shot Framework from Image Generation World Model to Robotic Manipulation".
- Mobile Manipulation with Active Inference: "Mobile Manipulation with Active Inference for Long-Horizon Rearrangement Tasks".
- [⭐️] NWM, "Navigation World Models".
- [⭐️] MindJourney: "MindJourney: Test-Time Scaling with World Models for Spatial Reasoning".
- Scaling Inference-Time Search: "Scaling Inference-Time Search with Vision Value Model for Improved Visual Comprehension".
- Foundation Models as World Models: "Foundation Models as World Models: A Foundational Study in Text-Based GridWorlds".
- Geosteering Through the Lens of Decision Transformers: "Geosteering Through the Lens of Decision Transformers: Toward Embodied Sequence Decision-Making".
- Latent Weight Diffusion: "Latent Weight Diffusion: Generating reactive policies instead of trajectories".
- NavMorph: "NavMorph: A Self-Evolving World Model for Vision-and-Language Navigation in Continuous Environments".
- Unified World Models: "Unified World Models: Memory-Augmented Planning and Foresight for Visual Navigation".
[code]
- RECON, "Rapid Exploration for Open-World Navigation with Latent Goal Models".
- WMNav: "WMNav: Integrating Vision-Language Models into World Models for Object Goal Navigation".
- NaVi-WM, "Deductive Chain-of-Thought Augmented Socially-aware Robot Navigation World Model".
- AIF, "Deep Active Inference with Diffusion Policy and Multiple Timescale World Model for Real-World Exploration and Navigation".
- "Kinodynamic Motion Planning for Mobile Robot Navigation across Inconsistent World Models".
- "World Model Implanting for Test-time Adaptation of Embodied Agents".
- "Imaginative World Modeling with Scene Graphs for Embodied Agent Navigation".
- [⭐️] Persistent Embodied World Models, "Learning 3D Persistent Embodied World Models".
- "Perspective-Shifted Neuro-Symbolic World Models: A Framework for Socially-Aware Robot Navigation"
- X-MOBILITY: "X-MOBILITY: End-To-End Generalizable Navigation via World Modeling".
- MWM, "Masked World Models for Visual Control".
Locomotion:
- [⭐️] Ego-VCP, "Ego-Vision World Model for Humanoid Contact Planning".
- [⭐️] RWM-O, "Offline Robotic World Model: Learning Robotic Policies without a Physics Simulator".
- [⭐️] DWL: "Advancing Humanoid Locomotion: Mastering Challenging Terrains with Denoising World Model Learning".
- HRSSM: "Learning Latent Dynamic Robust Representations for World Models".
- WMP: "World Model-based Perception for Visual Legged Locomotion".
- TrajWorld, "Trajectory World Models for Heterogeneous Environments".
- Puppeteer: "Hierarchical World Models as Visual Whole-Body Humanoid Controllers".
- ProTerrain: "ProTerrain: Probabilistic Physics-Informed Rough Terrain World Modeling".
- Occupancy World Model, "Occupancy World Model for Robots".
- [⭐️] "Accelerating Model-Based Reinforcement Learning with State-Space World Models".
- [⭐️] "Learning Humanoid Locomotion with World Model Reconstruction".
- [⭐️] Robotic World Model: "Robotic World Model: A Neural Network Simulator for Robust Policy Optimization in Robotics.
Loco-Manipulation:
- [⭐️] 1X World Model, 1X World Model.
- [⭐️] GROOT-Dreams, "Dream Come True — NVIDIA Isaac GR00T-Dreams Advances Robot Training With Synthetic Data and Neural Simulation".
- Humanoid World Models: "Humanoid World Models: Open World Foundation Models for Humanoid Robotics".
- Ego-Agent, "EgoAgent: A Joint Predictive Agent Model in Egocentric Worlds".
- D^2PO, "World Modeling Makes a Better Planner: Dual Preference Optimization for Embodied Task Planning"
- COMBO: "COMBO: Compositional World Models for Embodied Multi-Agent Cooperation.
- Scalable Humanoid Whole-Body Control: "Scalable Humanoid Whole-Body Control via Differentiable Neural Network Dynamics".
- Bridging the Sim-to-Real Gap: "Bridging the Sim-to-Real Gap in Humanoid Dynamics via Learned Nonlinear Operators".
Unifying World Models and VLAs in one model:
- [⭐️] CoT-VLA: "CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models".
- [⭐️] UP-VLA, "UP-VLA: A Unified Understanding and Prediction Model for Embodied Agent".
- [⭐️] VPP, "Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations".
- [⭐️] FLARE: "FLARE: Robot Learning with Implicit World Modeling".
- [⭐️] MinD: "MinD: Unified Visual Imagination and Control via Hierarchical World Models".
- [⭐️] DreamVLA, "DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge".
- [⭐️] WorldVLA: "WorldVLA: Towards Autoregressive Action World Model".
- 3D-VLA: "3D-VLA: A 3D Vision-Language-Action Generative World Model".
- LAWM: "Latent Action Pretraining Through World Modeling".
- [⭐️] UniVLA: "UniVLA: Unified Vision-Language-Action Model".
- [⭐️] dVLA, "dVLA: Diffusion Vision-Language-Action Model with Multimodal Chain-of-Thought".
- [⭐️] Vidar, "Vidar: Embodied Video Diffusion Model for Generalist Manipulation".
- [⭐️] UD-VLA, "Unified Diffusion VLA: Vision-Language-Action Model via Joint Discrete Denoising Diffusion Process".
- Goal-VLA: "Goal-VLA: Image-Generative VLMs as Object-Centric World Models Empowering Zero-shot Robot Manipulation".
Combining World Models and VLAs:
- [⭐️] Ctrl-World: "Ctrl-World: A Controllable Generative World Model for Robot Manipulation".
- VLA-RFT: "VLA-RFT: Vision-Language-Action Reinforcement Fine-tuning with Verified Rewards in World Simulators".
- World-Env: "World-Env: Leveraging World Model as a Virtual Environment for VLA Post-Training".
- [⭐️] Self-Improving Embodied Foundation Models, "Self-Improving Embodied Foundation Models".
- GigaBrain-0, GigaBrain-0: A World Model-Powered Vision-Language-Action Model.
- A Smooth Sea Never Made a Skilled SAILOR: "A Smooth Sea Never Made a Skilled SAILOR: Robust Imitation via Learning to Search".
This subsection focuses on general policy learning methods in embodied intelligence via leveraging world models.
- [⭐️] UWM, "Unified World Models: Coupling Video and Action Diffusion for Pretraining on Large Robotic Datasets".
- [⭐️] UVA, Unified Video Action Model.
- DiWA, "DiWA: Diffusion Policy Adaptation with World Models".
- [⭐️] Dreamerv4, "Training Agents Inside of Scalable World Models".
- Latent Action Learning Requires Supervision: "Latent Action Learning Requires Supervision in the Presence of Distractors".
- Robotic World Model: "Robotic World Model: A Neural Network Simulator for Robust Policy Optimization in Robotics".
- Sim-to-Real Contact-Rich Pivoting: "Sim-to-Real Contact-Rich Pivoting via Optimization-Guided RL with Vision and Touch".
- Hierarchical Task Environments: "Hierarchical Task Environments as the Next Frontier for Embodied World Models in Robot Soccer".
Real-world policy evaluation is expensive and noisy. The promise of world models is by accurately capturing environment dynamics, it can serve as a surrogate evaluation environment with high correlation to the policy performance in the real world. Before world models, the role for that was simulators:
For World Model Evaluation:
- [⭐️] WorldGym, "WorldGym: Evaluating Robot Policies in a World Model".
- [⭐️] WorldEval: "WorldEval: World Model as Real-World Robot Policies Evaluator".
- [⭐️] WoW!: "WOW!: World Models in a Closed-Loop World".
- Cosmos-Surg-dVRK: "Cosmos-Surg-dVRK: World Foundation Model-based Automated Online Evaluation of Surgical Robot Policy Learning".
Natural Science:
- [⭐️] CellFlux, "CellFlux: Simulating Cellular Morphology Changes via Flow Matching".
.
- CheXWorld, "CheXWorld: Exploring Image World Modeling for Radiograph Representation Learning".
- EchoWorld: "EchoWorld: Learning Motion-Aware World Models for Echocardiography Probe Guidance".
- ODesign, "ODesign: A World Model for Biomolecular Interaction Design."
- [⭐️] SFP, "Spatiotemporal Forecasting as Planning: A Model-Based Reinforcement Learning Approach with Generative World Models".
- Xray2Xray, "Xray2Xray: World Model from Chest X-rays with Volumetric Context".
- [⭐️] Medical World Model: "Medical World Model: Generative Simulation of Tumor Evolution for Treatment Planning".
- Surgical Vision World Model, "Surgical Vision World Model".
Social Science:
- Social World Models, "Social World Models".
- "Social World Model-Augmented Mechanism Design Policy Learning".
- SocioVerse, "SocioVerse: A World Model for Social Simulation Powered by LLM Agents and A Pool of 10 Million Real-World Users".
- Effectively Designing 2-Dimensional Sequence Models: "Effectively Designing 2-Dimensional Sequence Models for Multivariate Time Series".
- A Virtual Reality-Integrated System: "A Virtual Reality-Integrated System for Behavioral Analysis in Neurological Decline".
- Latent Representation Encoding: "Latent Representation Encoding and Multimodal Biomarkers for Post-Stroke Speech Assessment".
- Reconstructing Dynamics: "Reconstructing Dynamics from Steady Spatial Patterns with Partial Observations".
- SP: Learning Physics from Sparse Observations: "SP: Learning Physics from Sparse Observations — Three Pitfalls of PDE-Constrained Diffusion Models".
- SP: Continuous Autoregressive Generation: "SP: Continuous Autoregressive Generation with Mixture of Gaussians".
- EquiReg: "EquiReg: Symmetry-Driven Regularization for Physically Grounded Diffusion-based Inverse Solvers".
- PINT: "PINT: Physics-Informed Neural Time Series Models with Applications to Long-term Inference on WeatherBench 2m-Temperature Data".
- [⭐️] Video as the New Language for Real-World Decision Making, "Video as the New Language for Real-World Decision Making".
- [⭐️] Critiques of World Models, "Critiques of World Models".
- LAW, "Language Models, Agent Models, and World Models: The LAW for Machine Reasoning and Planning".
- [⭐️] Compositional Generative Modeling: A Single Model is Not All You Need, "Compositional Generative Modeling: A Single Model is Not All You Need".
- Interactive Generative Video as Next-Generation Game Engine, "Position: Interactive Generative Video as Next-Generation Game Engine".
- A Proposal for Networks Capable of Continual Learning: "A Proposal for Networks Capable of Continual Learning".
- Towards Unified Expressive Policy Optimization: "Opinion: Towards Unified Expressive Policy Optimization for Robust Robot Learning".
- A Unified World Model: "Opinion: A Unified World Model is the cornerstone for integrating perception, reasoning, and decision-making in embodied AI".
- [⭐️] General agents Contain World Models, "General agents contain world models".
- [⭐️] When Do Neural Networks Learn World Models? "When Do Neural Networks Learn World Models?"
- What Does it Mean for a Neural Network to Learn a 'World Model'?, "What Does it Mean for a Neural Network to Learn a 'World Model'?".
- Transformer cannot learn HMMs (sometimes) "On Limitation of Transformer for Learning HMMs".
- [⭐️] Inductive Bias Probe, "What Has a Foundation Model Found? Using Inductive Bias to Probe for World Models".
- [⭐️] Dynamical Systems Learning for World Models, "When do World Models Successfully Learn Dynamical Systems?".
- How Hard is it to Confuse a World Model?, "How Hard is it to Confuse a World Model?".
- ICL Emergence, "Context and Diversity Matter: The Emergence of In-Context Learning in World Models".
- [⭐️] Scaling Law,"Scaling Laws for Pre-training Agents and World Models".
- LLM World Model, "Linear Spatial World Models Emerge in Large Language Models".
- Revisiting Othello, "Revisiting the Othello World Model Hypothesis".
- [⭐️] Transformers Use Causal World Models, "Transformers Use Causal World Models in Maze-Solving Tasks".
- [⭐️] Causal World Model inside NTP, "A Causal World Model Underlying Next Token Prediction: Exploring GPT in a Controlled Environment".
- Utilizing World Models: "Utilizing World Models for Adaptively Covariate Acquisition Under Limited Budget for Causal Decision Making Problem".
Interactive Video Generation:
- [⭐️] Genie 3, "Genie 3: A new frontier for world models".
- [⭐️] V-JEPA 2, "V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning".
- [⭐️] Cosmos Predict 2.5 & Cosmos Transfer 2.5, "Cosmos Predict 2.5 & Transfer 2.5: Evolving the World Foundation Models for Physical AI".
- [⭐️] PAN, "PAN: A World Model for General, Interactable, and Long-Horizon World Simulation".
3D Scene Generation:
- [⭐️] RTFM, "RTFM: A Real-Time Frame Model".
- [⭐️] Marble, "Generating Bigger and Better Worlds".
- [⭐️] WorldGen, "WorldGen: From Text to Traversable and Interactive 3D Worlds".
Genie Series:
- [⭐️] Genie 2, "Genie 2: A Large-Scale Foundation World Model".
- [⭐️] Genie, "Genie: Generative Interactive Environments".
V-JEPA Series:
Cosmos Series:
World-Lab Projects:
Other Awesome Models:
- [⭐️] Pandora, "Pandora: Towards General World Model with Natural Language Actions and Video States".
- [⭐️] UniSim, "UniSim: Learning Interactive Real-World Simulators".
- Masked Generative Priors: "Masked Generative Priors Improve World Models Sequence Modelling Capabilities".
- Mixture-of-Transformers: "Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models".
- Mixture-of-Mamba: "Mixture-of-Mamba: Enhancing Multi-Modal State-Space Models with Modality-Aware Sparsity".
- FPAN: "FPAN: Mitigating Replication in Diffusion Models through the Fine-Grained Probabilistic Addition of Noise to Token Embeddings".
The represents a "bottom-up" approach to achieving intelligence, sensorimotor before abstraction. In the 2D pixel space, world models often build upon pre-existing image/video generation approaches.
To what extent does Vision Intelligence exist in Video Generation Models:
- [⭐️] Sora, "Video generation models as world simulators". [Technical report]
- [⭐️] Veo-3 are zero-shot Learners and Reasoners, "Video models are zero-shot learners and reasoners".
- [⭐️] PhyWorld, "How Far is Video Generation from World Model: A Physical Law Perspective".
- Emergent Few-Shot Learning in Video Diffusion Models, "From Generation to Generalization: Emergent Few-Shot Learning in Video Diffusion Models".
- VideoVerse: "VideoVerse: How Far is Your T2V Generator from a World Model?".
- [⭐️] Emu 3.5, "Emu3.5: Native Multimodal Models are World Learners".
- [⭐️] Emu 3, "Emu3: Next-Token Prediction is All You Need".
Useful Approaches in Video Generation:
- [⭐️] Diffusion Forcing, "Diffusion Forcing: Next-token Prediction Meets Full-Sequence Diffusion".
- [⭐️] DFoT, "History-Guided Video Diffusion".
- [⭐️] Self-Forcing, "Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion".
- CausVid, "From Slow Bidirectional to Fast Causal Video Generators".
- Longlive, "LongLive: Real-time Interactive Long Video Generation".
- ControlNet, "Adding Conditional Control to Text-to-Image Diffusion Models".
- ReCamMaster, "ReCamMaster: Camera-Controlled Generative Rendering from A Single Video".
From Video Generation Models to World Models:
- [⭐️] Vid2World: "Vid2World: Crafting Video Diffusion Models to Interactive World Models".
- AVID, "AVID: Adapting Video Diffusion Models to World Models".
- IRASim, "IRASim: A Fine-Grained World Model for Robot Manipulation".
- DWS, "Pre-Trained Video Generative Models as World Simulators".
- Video Adapter, "Probabilistic Adaptation of Black-Box Text-to-Video Models".
- Video Agent, "VideoAgent: Self-Improving Video Generation".
- WISA, "WISA: World Simulator Assistant for Physics-Aware Text-to-Video Generation".
- Force Prompting, "Force Prompting: Video Generation Models Can Learn and Generalize Physics-based Control Signals".
Pixel Space World Models:
- [⭐️] Owl-1: "Owl-1: Omni World Model for Consistent Long Video Generation".
- [⭐️] Long-Context State-Space Video World Models, "Long-Context State-Space Video World Models".
- [⭐️] StateSpaceDiffuser: "StateSpaceDiffuser: Bringing Long Context to Diffusion World Models".
- [⭐️] Geometry Forcing: "Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling".
- Yume: "Yume: An Interactive World Generation Model".
- PSI, "World Modeling with Probabilistic Structure Integration".
- Martian World Models, "Martian World Models: Controllable Video Synthesis with Physically Accurate 3D Reconstructions".
- WorldDreamer: "WorldDreamer: Towards General World Models for Video Generation via Predicting Masked Tokens".
- EBWM: "Cognitively Inspired Energy-Based World Models".
- "Video World Models with Long-term Spatial Memory".
- VRAG, "Learning World Models for Interactive Video Generation".
- DRAW, "Adapting World Models with Latent-State Dynamics Residuals".
- ForeDiff, "Consistent World Models via Foresight Diffusion".
- Distribution Recovery: "Distribution Recovery in Compact Diffusion World Models via Conditioned Frame Interpolation".
- EmbodiedScene: "EmbodiedScene: Towards Automated Generation of Diverse and Realistic Scenes for Embodied AI".
- BEYOND SINGLE-STEP: "BEYOND SINGLE-STEP: MULTI-FRAME ACTION- CONDITIONED VIDEO GENERATION FOR REINFORCE- MENT LEARNING ENVIRONMENTS".
- Adaptive Attention-Guided Masking: "Adaptive Attention-Guided Masking in Vision Transformers for Self-Supervised Hyperspectral Feature Learning".
- Enhancing Long Video Generation Consistency: "Enhancing Long Video Generation Consistency without Tuning: Time-Frequency Analysis, Prompt Alignment, and Theory".
- Can Image-To-Video Models Simulate Pedestrian Dynamics?: "Can Image-To-Video Models Simulate Pedestrian Dynamics?".
- Video Self-Distillation: "Video Self-Distillation for Single-Image Encoders: A Step Toward Physically Plausible Perception".
- Whole-Body Conditioned Egocentric Video Prediction: "Whole-Body Conditioned Egocentric Video Prediction".
- Taming generative world models: "Taming generative world models for zero-shot optical flow extraction".
3D Mesh is also a useful representaiton of the physical world, including benefits such as spatial consistency.
- [⭐️] WorldGrow: "WorldGrow: Generating Infinite 3D World".
- TRELLISWorld: "TRELLISWorld: Training-Free World Generation from Object Generators".
- Terra: "Terra: Explorable Native 3D World Model with Point Latents".
- MorphoSim: "MorphoSim: An Interactive, Controllable, and Editable Language-guided 4D World Simulator".
[code]
- EvoWorld: "EvoWorld: Evolving Panoramic World Generation with Explicit 3D Memory".
- [⭐️] FantasyWorld: "FantasyWorld: Geometry-Consistent World Modeling via Unified Video and 3D Prediction".
- [⭐️] Aether: "Aether: Geometric-Aware Unified World Modeling".
- HERO: "HERO: Hierarchical Extrapolation and Refresh for Efficient World Models".
- UrbanWorld: "UrbanWorld: An Urban World Model for 3D City Generation".
- DeepVerse: "DeepVerse: 4D Autoregressive Video Generation as a World Model".
The represents a "top-down" approach to achieving intelligence, abstraction before sensorimotor.
Aiming to Advance LLM/VLM skills:
- [⭐️] VLWM, "Planning with Reasoning using Vision Language World Model".
- [⭐️] Agent Learning via Early Experience, "Agent Learning via Early Experience".
- [⭐️] CWM, "CWM: An Open-Weights LLM for Research on Code Generation with World Models".
- [⭐️] RAP, "Reasoning with language model is planning with world model".
- SURGE, "SURGE: On the Potential of Large Language Models as General-Purpose Surrogate Code Executors".
- LLM-Sim: "Can Language Models Serve as Text-Based World Simulators?".
- WorldLLM, "WorldLLM: Improving LLMs' world modeling using curiosity-driven theory-making".
- LLMs as World Models, "LLMs as World Models: Data-Driven and Human-Centered Pre-Event Simulation for Disaster Impact Assessment".
- [⭐️] LWM: "World Model on Million-Length Video And Language With RingAttention".
- "Evaluating World Models with LLM for Decision Making".
- LLMPhy: "LLMPhy: Complex Physical Reasoning Using Large Language Models and World Models".
- LLMCWM: "Language Agents Meet Causality -- Bridging LLMs and Causal World Models".
- "Making Large Language Models into World Models with Precondition and Effect Knowledge".
- CityBench: "CityBench: Evaluating the Capabilities of Large Language Model as World Model".
Aiming to enhance computer-use agent performance:
- [⭐️] Neural-OS, "NeuralOS: Towards Simulating Operating Systems via Neural Generative Models".
- R-WoM: "R-WoM: Retrieval-augmented World Model For Computer-use Agents".
- [⭐️] SimuRA: "SimuRA: Towards General Goal-Oriented Agent via Simulative Reasoning Architecture with LLM-Based World Model".
- WebSynthesis, "WebSynthesis: World-Model-Guided MCTS for Efficient WebUI-Trajectory Synthesis".
- WKM: "Agent Planning with World Knowledge Model".
- WebDreamer, "Is Your LLM Secretly a World Model of the Internet? Model-Based Planning for Web Agents".
- "Web Agents with World Models: Learning and Leveraging Environment Dynamics in Web Navigation".
- WebEvolver: "WebEvolver: Enhancing Web Agent Self-Improvement with Coevolving World Model".
- WALL-E 2.0: "WALL-E 2.0: World Alignment by NeuroSymbolic Learning improves World Model-based LLM Agents".
- ViMo: "ViMo: A Generative Visual GUI World Model for App Agent".
- [⭐️] Dyna-Think: "Dyna-Think: Synergizing Reasoning, Acting, and World Model Simulation in AI Agents".
- FPWC, "Unlocking Smarter Device Control: Foresighted Planning with a World Model-Driven Code Execution Approach".
Symbolic World Models:
- [⭐️] PoE-World, "PoE-World: Compositional World Modeling with Products of Programmatic Experts".
- "One Life to Learn: Inferring Symbolic World Models for Stochastic Environments from Unguided Exploration".
- "Finite Automata Extraction: Low-data World Model Learning as Programs from Gameplay Video".
- "Synthesizing world models for bilevel planning".
- "Generating Symbolic World Models via Test-time Scaling of Large Language Models".
LLM-in-the-loop World Generation:
- LatticeWorld, "LatticeWorld: A Multimodal Large Language Model-Empowered Framework for Interactive Complex World Generation".
- Text2World: "Text2World: Benchmarking Large Language Models for Symbolic World Model Generation".
A recent trend of work is bridging highly-compressed semantic tokens (e.g. language) with information-sparse cues in the observation space (e.g. vision). This results in World Models that combine high-level and low-level intelligence.
- [⭐️] VAGEN, "VAGEN: Reinforcing World Model Reasoning for Multi-Turn VLM Agents".
- [⭐️] Semantic World Models, "Semantic World Models".
- DyVA: "Can World Models Benefit VLMs for World Dynamics?".
- From Foresight to Forethought: "From Foresight to Forethought: VLM-In-the-Loop Policy Steering via Latent Alignment".
- Emergent Stack Representations: "Emergent Stack Representations in Modeling Counter Languages Using Transformers".
- DIALOGUES BETWEEN ADAM AND EVE: "DIALOGUES BETWEEN ADAM AND EVE: EXPLORATION OF UNKNOWN CIVILIZATION LANGUAGE BY LLM".
- Memory Helps, but Confabulation Misleads: "Memory Helps, but Confabulation Misleads: Understanding Streaming Events in Videos with MLLMs".
- VLA-OS: "VLA-OS: Structuring and Dissecting Planning Representations and Paradigms in Vision-Language-Action Models".
- LLM-Guided Probabilistic Program Induction: "LLM-Guided Probabilistic Program Induction for POMDP Model Estimation".
- Decoupled Planning and Execution: "Decoupled Planning and Execution with LLM-Driven World Models for Efficient Task Planning".
- The Physical Basis of Prediction: "The Physical Basis of Prediction: World Model Formation in Neural Organoids via an LLM-Generated Curriculum".
- Avi: "Avi: A 3D Vision-Language Action Model Architecture generating Action from Volumetric Inference".
- How Foundational Skills Influence VLM-based Embodied Agents: "How Foundational Skills Influence VLM-based Embodied Agents: A Native Perspective".
- Towards Fine-tuning a Small Vision-Language Model: "Towards Fine-tuning a Small Vision-Language Model for Aerial Navigation".
- Improvisational Reasoning: "Improvisational Reasoning with Vision-Language Models for Grounded Procedural Planning".
- Vision-Language Reasoning for Burn Depth Assessment: "Vision-Language Reasoning for Burn Depth Assessment with Structured Diagnostic Hypotheses".
- Puffin: "Thinking with Camera: A Unified Multimodal Model for Camera-Centric Understanding and Generation".
While learning in the observation space (pixel, 3D mesh, language, etc.) is a common approach, for many applications (planning, policy evaluation, etc.) learning in latent space is sufficient or is believed to lead to even better performace.
- [⭐️] DINO-WM, "DINO-WM: World Models on Pre-trained Visual Features enable Zero-shot Planning".
- [⭐️] DINO-World, "Back to the Features: DINO as a Foundation for Video World Models".
- [⭐️] DINO-Foresight, "DINO-Foresight: Looking into the Future with DINO
".
- AWM, "Learning Abstract World Models with a Group-Structured Latent Space".
JEPA is a special kind of learning in latent space, where the loss is put on the latent space, and the encoder & predictor are co-trained. However, the usage of JEPA is not only in world models (e.g. V-JEPA2-AC), but also representation learning (e.g. I-JEPA, V-JEPA), we provide representative works from both perspectives below.
- [⭐️] I-JEPA,"Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture".
- IWM: "Learning and Leveraging World Models in Visual Representation Learning".
- [⭐️] V-JEPA: "V-JEPA: Video Joint Embedding Predictive Architecture".
- [⭐️] V-JEPA Learns Intuitive Physics, "Intuitive physics understanding emerges from self-supervised pretraining on natural videos".
- [⭐️] V-JEPA 2, "V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning".
- seq-JEPA: "seq-JEPA: Autoregressive Predictive Learning of Invariant-Equivariant World Models".
- Image World Models, "Learning and Leveraging World Models in Visual Representation Learning".
- MC-JEPA, "MC-JEPA: A Joint-Embedding Predictive Architecture for Self-Supervised Learning of Motion and Content Features".
- [⭐️] NPE, "A Compositional Object-Based Approach to Learning Physical Dynamics"
- [⭐️] SlotFormer, "SlotFormer: Unsupervised Visual Dynamics Simulation with Object-Centric Models".
- Dyn-O, "Dyn-O: Building Structured World Models with Object-Centric Representations".
- COMET, "Compete and Compose: Learning Independent Mechanisms for Modular World Models".
- FPTT, "Transformers and Slot Encoding for Sample Efficient Physical World Modelling".
- "Efficient Exploration and Discriminative World Model Learning with an Object-Centric Abstraction".
- OC-STORM, "Objects matter: object-centric world models improve reinforcement learning in visually complex environments".
- Unifying Causal and Object-centric Representation Learning: "Unifying Causal and Object-centric Representation Learning allows Causal Composition".
- Object-Centric Representations: "Object-Centric Representations Generalize Better Compositionally with Less Compute".
- [⭐️] RLVR-World: "RLVR-World: Training World Models with Reinforcement Learning".
[Website]
- RLIR, "Reinforcement Learning with Inverse Rewards for World Model Post-training".
- Chrono-Edit, "ChronoEdit: Towards Temporal Reasoning for Image Editing and World Simulation".
- [⭐️] SWIFT, "Can Test-Time Scaling Improve World Foundation Model?".
A significant porportion of World Model Algorithms and Techniques stem from the advances in Model-based Reinforcement Learning in the era around 2020. Dreamer(v1-v3) are classical works in this era. We provide a list of classical works as well as works following this line of thought.
- [⭐️] Dreamer, "Dream to Control: Learning Behaviors by Latent Imagination".
- [⭐️] Dreamerv2, "Mastering Atari with Discrete World Models".
- [⭐️] Dreamerv3, "Mastering Diverse Domains through World Models".
- DreamSmooth: "DreamSmooth: Improving Model-based Reinforcement Learning via Reward Smoothing".
- [⭐️] TD-MPC2: "TD-MPC2: Scalable, Robust World Models for Continuous Control".
[Torch Code]
- Hieros: "Hieros: Hierarchical Imagination on Structured State Space Sequence World Models".
- CoWorld: "Making Offline RL Online: Collaborative World Models for Offline Visual Reinforcement Learning".
- HarmonyDream, "HarmonyDream: Task Harmonization Inside World Models".
- DyMoDreamer, "DyMoDreamer: World Modeling with Dynamic Modulation".
- "Dynamics-Aligned Latent Imagination in Contextual World Models for Zero-Shot Generalization".
- PIGDreamer, "PIGDreamer: Privileged Information Guided World Models for Safe Partially Observable Reinforcement Learning".
- [⭐️] Continual Reinforcement Learning by Planning with Online World Models, "Continual Reinforcement Learning by Planning with Online World Models".
- Δ-IRIS: "Efficient World Models with Context-Aware Tokenization".
- AD3: "AD3: Implicit Action is the Key for World Models to Distinguish the Diverse Visual Distractors".
- R2I: "Mastering Memory Tasks with World Models".
- REM: "Improving Token-Based World Models with Parallel Observation Prediction".
- AWM, "Do Transformer World Models Give Better Policy Gradients?"".
- [⭐️] Dreaming of Many Worlds, "Dreaming of Many Worlds: Learning Contextual World Models Aids Zero-Shot Generalization".
- PWM: "PWM: Policy Learning with Large World Models".
- GenRL: "GenRL: Multimodal foundation world models for generalist embodied agents".
- DLLM: "World Models with Hints of Large Language Models for Goal Achieving".
- Adaptive World Models: "Adaptive World Models: Learning Behaviors by Latent Imagination Under Non-Stationarity".
- "Reward-free World Models for Online Imitation Learning".
- MoReFree: "World Models Increase Autonomy in Reinforcement Learning".
- ROMBRL, "Policy-Driven World Model Adaptation for Robust Offline Model-based Reinforcement Learning".
- "Coupled Distributional Random Expert Distillation for World Model Online Imitation Learning".
- [⭐️] MoSim: "Neural Motion Simulator Pushing the Limit of World Models in Reinforcement Learning".
- SENSEI: "SENSEI: Semantic Exploration Guided by Foundation Models to Learn Versatile World Models".
- Spiking World Model, "Implementing Spiking World Model with Multi-Compartment Neurons for Model-based Reinforcement Learning".
- DCWM, "Discrete Codebook World Models for Continuous Control".
- Multimodal Dreaming: "Multimodal Dreaming: A Global Workspace Approach to World Model-Based Reinforcement Learning".
- "Generalist World Model Pre-Training for Efficient Reinforcement Learning".
- "Learning To Explore With Predictive World Model Via Self-Supervised Learning".
- Simulus: "Uncovering Untapped Potential in Sample-Efficient World Model Agents".
- DMWM: "DMWM: Dual-Mind World Model with Long-Term Imagination".
- EvoAgent: "EvoAgent: Agent Autonomous Evolution with Continual World Model for Long-Horizon Tasks".
- GLIMO: "Grounding Large Language Models In Embodied Environment With Imperfect World Models".
- Energy-based Transition Models, "Offline Transition Modeling via Contrastive Energy Learning".
- PCM, "Policy-conditioned Environment Models are More Generalizable".
- World Models as Reference Trajectories: "World Models as Reference Trajectories for Rapid Motor Adaptation".
- Stress-Testing Offline Reward-Free Reinforcement Learning: "Stress-Testing Offline Reward-Free Reinforcement Learning: A Case for Planning with Latent Dynamics Models".
- Decentralized Transformers with Centralized Aggregation: "Decentralized Transformers with Centralized Aggregation are Sample-Efficient Multi-Agent World Models".
- Model-based Offline Reinforcement Learning: "Model-based Offline Reinforcement Learning with Lower Expectile Q-Learning".
- Learning from Reward-Free Offline Data: "Learning from Reward-Free Offline Data: A Case for Planning with Latent Dynamics Models".
- DAWM: "DAWM: Diffusion Action World Models for Offline Reinforcement Learning via Action-Inferred Transitions".
- Revisiting Multi-Agent World Modeling: "Revisiting Multi-Agent World Modeling from a Diffusion-Inspired Perspective".
- Communicating Plans, Not Percepts: "Communicating Plans, Not Percepts: Scalable Multi-Agent Coordination with Embodied World Models".
- Learning to Focus: "Learning to Focus: Prioritizing Informative Histories with Structured Attention Mechanisms in Partially Observable Reinforcement Learning".
- Stable Planning: "Stable Planning through Aligned Representations in Model-Based Reinforcement Learning".
- Revisiting Multi-Agent World Modeling from a Diffusion-Inspired Perspective: "Revisiting Multi-Agent World Modeling from a Diffusion-Inspired Perspective".
- THICK: "Learning Hierarchical World Models with Adaptive Temporal Abstractions from Discrete Latent Dynamics".
Implicit Memory:
- [⭐️] Context as Memory, "Context as Memory: Scene-Consistent Interactive Long Video Generation with Memory Retrieval".
.
- [⭐️] History-Guided Video Diffusion, "History-Guided Video Diffusion".
.
- [⭐️] Mixture of Contexts for Long Video Generation, "Mixture of Contexts for Long Video Generation".
.
Explicit Memory:
World Models in the Language Modality:
- Evaluating the World Model Implicit in a Generative Model, "Evaluating the World Model Implicit in a Generative Model".
- "Benchmarking World-Model Learning".
- WM-ABench: "Do Vision-Language Models Have Internal World Models? Towards an Atomic Evaluation".
- UNIVERSE: "Adapting Vision-Language Models for Evaluating World Models".
- WorldPrediction: "WorldPrediction: A Benchmark for High-level World Modeling and Long-horizon Procedural Planning".
- EVA: "EVA: An Embodied World Model for Future Video Anticipation".
- AeroVerse: "AeroVerse: UAV-Agent Benchmark Suite for Simulating, Pre-training, Finetuning, and Evaluating Aerospace Embodied World Models".
World Models in the Pixel Space:
- World-in-World: "World-in-World: World Models in a Closed-Loop World".
- WorldPrediction: "WorldPrediction: A Benchmark for High-level World Modeling and Long-horizon Procedural Planning".
- "Toward Memory-Aided World Models: Benchmarking via Spatial Consistency".
- SimWorld: "SimWorld: A Unified Benchmark for Simulator-Conditioned Scene
Generation via World Model".
- EWMBench: "EWMBench: Evaluating Scene, Motion, and Semantic Quality in Embodied World Models".
- "Toward Stable World Models: Measuring and Addressing World Instability in Generative Environments".
- WorldModelBench: "WorldModelBench: Judging Video Generation Models As World Models".
'
- EVA: "EVA: An Embodied World Model for Future Video Anticipation".
- ACT-Bench: "ACT-Bench: Towards Action Controllable World Models for Autonomous Driving".
- WorldSimBench: "WorldSimBench: Towards Video Generation Models as World Simulators".
- WorldScore, "WorldScore: A Unified Evaluation Benchmark for World Generation".
- "Imagine the Unseen World: A Benchmark for Systematic Generalization in Visual World Models".
World Models in 3D Mesh Space:
World Models in other modalities:
Physically Plausible World Models:
- Text2World: "Text2World: Benchmarking World Modeling Capabilities of Large Language Models via Program Synthesis".
- AetherVision-Bench: "AetherVision-Bench: An Open-Vocabulary RGB-Infrared Benchmark for Multi-Angle Segmentation across Aerial and Ground Perspectives".
- VideoPhy-2: "VideoPhy-2: A Challenging Action-Centric Physical Commonsense Evaluation in Video Generation".
- A Comprehensive Evaluation: "A Comprehensive Evaluation of Physical Realism in Text-to-Video Models".
This project is largely built on the foundations laid by:
- 🕶️ A Survey: Learning Embodied Intelligence from Physical Simulators and World Models
- 🕶️ Awesome-World-Model-for-Autonomous-Driving
- 🕶️ Awesome-World-Model-for-Robotics
Huge shoutout the the authors for their awesome work.
If you find this repository useful, please consider citing this list:
@misc{huang2025awesomeworldmodels,
title = {Awesome-World-Models},
author = {Siqiao Huang},
journal = {GitHub repository},
url = {https://github.com/knightnemo/Awesome-World-Models},
year = {2025},
}
