Skip to content

A Curated List of Awesome Works in World Modeling, Aiming to Serve as a One-stop Resource for Researchers, Practitioners, and Enthusiasts Interested in World Modeling.

License

Notifications You must be signed in to change notification settings

knightnemo/Awesome-World-Models

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

26 Commits
 
 
 
 
 
 
 
 

Repository files navigation

🌍 Awesome World Models

Awesome GitHub stars License PRs Welcome

📜 A Curated List of Amazing Works in World Modeling, spanning applications in Embodied AI, Autonomous Driving, Natural Language Processing and Agents.
Based on Awesome-World-Model-for-Autonomous-Driving and Awesome-World-Model-for-Robotics.

Awesome World Models

Photo Credit: Gemini-Nano-Banana🍌.


🚩 News & Updates

Major updates and announcements are shown below. Scroll for full timeline.

🚀 [2025-11] 1k+ Stars ⭐️ Under 30 Days — 🌍 Awesome World Models reached 1k github stars within 30 days of initial release, let's go!!!

🗺️ [2025-10] Enhanced Visual Navigation — Introduced badge system for papers! All entries now display arXiv Website Code for quick access to resources.

🔥 [2025-10] Repository Launch — Awesome World Models is now live! We're building a comprehensive collection spanning Embodied AI, Autonomous Driving, NLP, and more. See CONTRIBUTING.md for how to contribute.

💡 [Ongoing] Community Contributions Welcome — Help us maintain the most up-to-date world models resource! Submit papers via PR or contact us at email.

[Ongoing] Support This Project — If you find this useful, please cite our work and give us a star. Share with your research community!


Overview


Aim of the Project

World Models have become a hot topic in both research and industry, attracting unprecedented attention from the AI community and beyond. However, due to the interdisciplinary nature of the field (and because the term "world model" simply sounds amazing), the concept has been used with varying definitions across different domains.

Awesome World Models

This repository aims to:

  • 🔍 Organize the rapidly growing body of world model research across multiple application domains
  • 🗺️ Provide a minimalist map of how world models are utilized in different fields (Embodied AI, Autonomous Driving, NLP, etc.)
  • 🤝 Bridge the gap between different communities working on world models with varying perspectives
  • 📚 Serve as a one-stop resource for researchers, practitioners, and enthusiasts interested in world modeling
  • 🚀 Track the latest developments and breakthroughs in this exciting field

Whether you're a researcher looking for related work, a practitioner seeking implementation references, or simply curious about world models, we hope this curated list helps you navigate the landscape!


Definition of World Models

While world models' outreach has been expanded again and again, it is widely adopted that the original sources of world models come from these two papers:

  • [⭐️] World Models, World Models. arXiv Website
  • [⭐️] Yann Lecun's Speech, "A Path Towards Autonomous Machine Intelligence". OpenReview

Some other great blogposts on world models include:

  • [⭐️] Towards Video World Models, "Towards Video World Models". Blog
  • Status of World Models in 2025, "Beyond the Hype: How I See World Models Evolving in 2025". Blog
  • [⭐️] Jim Fan's tweet. Blog

Surveys of World Models

1. World Models and Video Generation:

  • [⭐️] Is Sora a World Simulator, "Is Sora a World Simulator? A Comprehensive Survey on General World Models and Beyond". arXiv Website
  • Physics Cognition in Video Generation, "Exploring the Evolution of Physics Cognition in Video Generation: A Survey". arXiv Website

2. World Models and 3D Generation:

  • [⭐️] 3D and 4D World Modeling: A Survey, "3D and 4D World Modeling: A Survey". arXiv
  • [⭐️] Understanding World or Predicting Future?, "Understanding World or Predicting Future? A Comprehensive Survey of World Models". arXiv
  • From 2D to 3D Cognition, "From 2D to 3D Cognition: A Brief Survey of General World Models". arXiv

3. World Models and Embodied Artificial Intelligence:

  • [⭐️] World Models for Embodied AI, "A Comprehensive Survey on World Models for Embodied AI". arXiv Website
  • World Models and Physical Simulation, "A Survey: Learning Embodied Intelligence from Physical Simulators and World Models". arXiv Website
  • Embodied AI Agents: Modeling the World, "Embodied AI Agents: Modeling the World". arXiv
  • Aligning Cyber Space with Physical World, "Aligning Cyber Space with Physical World: A Comprehensive Survey on Embodied AI". arXiv Website

4. World Models for Autonomous Driving:

  • [⭐️] A Survey of World Models for Autonomous Driving, "A Survey of World Models for Autonomous Driving". arXiv
  • World Models for Autonomous Driving: An Initial Survey, "World Models for Autonomous Driving: An Initial Survey". arXiv
  • Interplay Between Video Generation and World Models in Autonomous Driving, "Exploring the Interplay Between Video Generation and World Models in Autonomous Driving: A Survey". arXiv

5. Other Good Surveys:

  • From Masks to Worlds, "From Masks to Worlds: A Hitchhiker's Guide to World Models". arXiv Website
  • The Safety Challenge of World Models, "The Safety Challenge of World Models for Embodied AI Agents: A Review". arXiv
  • World Models in AI: Like a Child, "World Models in Artificial Intelligence: Sensing, Learning, and Reasoning Like a Child". arXiv
  • World Model Safety, "World Models: The Safety Perspective". arXiv
  • Model-based reinforcement learning: "A survey on model-based reinforcement learning". Website

World Models for Game Simulation

Pixel Space:

  • [⭐️] GameNGen, "Diffusion Models Are Real-Time Game Engines". arXiv
  • [⭐️] DIAMOND, "Diffusion for World Modeling: Visual Details Matter in Atari". arXiv Code
  • MineWorld, "MineWorld: a Real-Time and Open-Source Interactive World Model on Minecraft". arXiv Website
  • Oasis, "Oasis: A Universe in a Transformer". Website
  • AnimeGamer, "AnimeGamer: Infinite Anime Life Simulation with Next Game State Prediction". arXivWebsite
  • [⭐️] Matrix-Game, "Matrix-Game: Interactive World Foundation Model." arXiv Code
  • [⭐️] Matrix-Game 2.0, Matrix-Game 2.0: An Open-Source, Real-Time, and Streaming Interactive World Model. arXiv Website
  • RealPlay, "From Virtual Games to Real-World Play". arXiv Website Code
  • GameFactory, "GameFactory: Creating New Games with Generative Interactive Videos". arXiv Website Code
  • WORLDMEM, "Worldmem: Long-term Consistent World Simulation with Memory". arXiv Website Code

3D Mesh Space:

  • [⭐️] HunyuanWorld 1.0, HunyuanWorld 1.0: Generating Immersive, Explorable, and Interactive 3D Worlds from Words or Pixels. arXiv Website Code
  • [⭐️] Matrix-3D, Matrix-3D: Omnidirectional Explorable 3D World Generation. arXiv Website

World Models for Autonomous Driving

Refer to https://github.com/LMD0311/Awesome-World-Model for full list.

Note

📢 [Call for Maintenance] The repo creator is no expert of autonomous driving, so this is a more-than-concise list of works without classification. We anticipate community effort on turning this section cleaner and more well-sorted.

  • [⭐️] Cosmos-Drive-Dreams, "Cosmos-Drive-Dreams: Scalable Synthetic Driving Data Generation with World Foundation Models". arXiv Website
  • [⭐️] GAIA-2, "GAIA-2: A Controllable Multi-View Generative World Model for Autonomous Driving". arXiv Website
  • Copilot4D, "Copilot4D: Learning Unsupervised World Models for Autonomous Driving via Discrete Diffusion". arXiv
  • OmniNWM: "OmniNWM: Omniscient Driving Navigation World Models". arXiv Website
  • GAIA-1, "Introducing GAIA-1: A Cutting-Edge Generative AI Model for Autonomy". arXiv Blog
  • PWM, "From Forecasting to Planning: Policy World Model for Collaborative State-Action Prediction". arXiv Code

  • Dream4Drive, "Rethinking Driving World Model as Synthetic Data Generator for Perception Tasks". arXiv Website

  • SparseWorld, "SparseWorld: A Flexible, Adaptive, and Efficient 4D Occupancy World Model Powered by Sparse and Dynamic Queries". arXiv Code

  • DriveVLA-W0: "DriveVLA-W0: World Models Amplify Data Scaling Law in Autonomous Driving". arXiv Code

  • "Enhancing Physical Consistency in Lightweight World Models". arXiv

  • IRL-VLA: "IRL-VLA: Training an Vision-Language-Action Policy via Reward World Model". arXiv Website Code

  • LiDARCrafter: "LiDARCrafter: Dynamic 4D World Modeling from LiDAR Sequences". arXiv Website Code

  • FASTopoWM: "FASTopoWM: Fast-Slow Lane Segment Topology Reasoning with Latent World Models". arXiv Code

  • Orbis: "Orbis: Overcoming Challenges of Long-Horizon Prediction in Driving World Models". arXiv Code

  • "World Model-Based End-to-End Scene Generation for Accident Anticipation in Autonomous Driving". arXiv

  • NRSeg: "NRSeg: Noise-Resilient Learning for BEV Semantic Segmentation via Driving World Models" arXiv Code

  • World4Drive: "World4Drive: End-to-End Autonomous Driving via Intention-aware Physical Latent World Model". arXiv Code

  • Epona: "Epona: Autoregressive Diffusion World Model for Autonomous Driving". arXiv Code

  • "Towards foundational LiDAR world models with efficient latent flow matching". arXiv

  • SceneDiffuser++: "SceneDiffuser++: City-Scale Traffic Simulation via a Generative World Model". arXiv

  • COME: "COME: Adding Scene-Centric Forecasting Control to Occupancy World Model" arXiv Code

  • STAGE: "STAGE: A Stream-Centric Generative World Model for Long-Horizon Driving-Scene Simulation". arXiv

  • ReSim: "ReSim: Reliable World Simulation for Autonomous Driving". arXiv Code Website

  • "Ego-centric Learning of Communicative World Models for Autonomous Driving". arXiv

  • Dreamland: "Dreamland: Controllable World Creation with Simulator and Generative Models". arXiv Website

  • LongDWM: "LongDWM: Cross-Granularity Distillation for Building a Long-Term Driving World Model". arXiv Website

  • GeoDrive: "GeoDrive: 3D Geometry-Informed Driving World Model with Precise Action Control". arXiv Code

  • FutureSightDrive: "FutureSightDrive: Thinking Visually with Spatio-Temporal CoT for Autonomous Driving". arXiv Code

  • Raw2Drive: "Raw2Drive: Reinforcement Learning with Aligned World Models for End-to-End Autonomous Driving (in CARLA v2)". arXiv

  • VL-SAFE: "VL-SAFE: Vision-Language Guided Safety-Aware Reinforcement Learning with World Models for Autonomous Driving". arXiv Website

  • PosePilot: "PosePilot: Steering Camera Pose for Generative World Models with Self-supervised Depth". arXiv

  • "World Model-Based Learning for Long-Term Age of Information Minimization in Vehicular Networks". arXiv

  • "Learning to Drive from a World Model". arXiv

  • DriVerse: "DriVerse: Navigation World Model for Driving Simulation via Multimodal Trajectory Prompting and Motion Alignment". arXiv

  • "End-to-End Driving with Online Trajectory Evaluation via BEV World Model". arXiv Code

  • "Knowledge Graphs as World Models for Semantic Material-Aware Obstacle Handling in Autonomous Vehicles". arXiv

  • MiLA: "MiLA: Multi-view Intensive-fidelity Long-term Video Generation World Model for Autonomous Driving". arXiv Website

  • SimWorld: "SimWorld: A Unified Benchmark for Simulator-Conditioned Scene Generation via World Model". arXiv Website

  • UniFuture: "Seeing the Future, Perceiving the Future: A Unified Driving World Model for Future Generation and Perception". arXiv Website

  • EOT-WM: "Other Vehicle Trajectories Are Also Needed: A Driving World Model Unifies Ego-Other Vehicle Trajectories in Video Latent Space". arXiv

  • "Temporal Triplane Transformers as Occupancy World Models". arXiv

  • InDRiVE: "InDRiVE: Intrinsic Disagreement based Reinforcement for Vehicle Exploration through Curiosity Driven Generalized World Model". arXiv

  • MaskGWM: "MaskGWM: A Generalizable Driving World Model with Video Mask Reconstruction". arXiv

  • Dream to Drive: "Dream to Drive: Model-Based Vehicle Control Using Analytic World Models". arXiv

  • "Semi-Supervised Vision-Centric 3D Occupancy World Model for Autonomous Driving". arXiv

  • "Dream to Drive with Predictive Individual World Model". arXiv Code

  • HERMES: "HERMES: A Unified Self-Driving World Model for Simultaneous 3D Scene Understanding and Generation". arXiv

  • AdaWM: "AdaWM: Adaptive World Model based Planning for Autonomous Driving". arXiv

  • AD-L-JEPA: "AD-L-JEPA: Self-Supervised Spatial World Models with Joint Embedding Predictive Architecture for Autonomous Driving with LiDAR Data". arXiv

  • DrivingWorld: "DrivingWorld: Constructing World Model for Autonomous Driving via Video GPT". arXiv Code Website

  • DrivingGPT: "DrivingGPT: Unifying Driving World Modeling and Planning with Multi-modal Autoregressive Transformers". arXiv Website

  • "An Efficient Occupancy World Model via Decoupled Dynamic Flow and Image-assisted Training". arXiv

  • GEM: "GEM: A Generalizable Ego-Vision Multimodal World Model for Fine-Grained Ego-Motion, Object Dynamics, and Scene Composition Control". arXiv Website

  • GaussianWorld: "GaussianWorld: Gaussian World Model for Streaming 3D Occupancy Prediction". arXiv Code

  • Doe-1: "Doe-1: Closed-Loop Autonomous Driving with Large World Model". arXiv Website Code

  • "Physical Informed Driving World Model". arXiv Website

  • InfiniCube: "InfiniCube: Unbounded and Controllable Dynamic 3D Driving Scene Generation with World-Guided Video Models". arXiv Website

  • InfinityDrive: "InfinityDrive: Breaking Time Limits in Driving World Models". arXiv Website

  • ReconDreamer: "ReconDreamer: Crafting World Models for Driving Scene Reconstruction via Online Restoration". arXiv Website

  • Imagine-2-Drive: "Imagine-2-Drive: High-Fidelity World Modeling in CARLA for Autonomous Vehicles". arXiv Website

  • DynamicCity: "DynamicCity: Large-Scale 4D Occupancy Generation from Dynamic Scenes". arXiv Website Code

  • DriveDreamer4D: "World Models Are Effective Data Machines for 4D Driving Scene Representation". arXiv Website

  • DOME: "Taming Diffusion Model into High-Fidelity Controllable Occupancy World Model". arXiv Website

  • SSR: "Does End-to-End Autonomous Driving Really Need Perception Tasks?". arXiv Code

  • "Mitigating Covariate Shift in Imitation Learning for Autonomous Vehicles Using Latent Space Generative World Models". arXiv

  • LatentDriver: "Learning Multiple Probabilistic Decisions from Latent World Model in Autonomous Driving". arXiv Code

  • RenderWorld: "World Model with Self-Supervised 3D Label". arXiv

  • OccLLaMA: "An Occupancy-Language-Action Generative World Model for Autonomous Driving". arXiv

  • DriveGenVLM: "Real-world Video Generation for Vision Language Model based Autonomous Driving". arXiv

  • Drive-OccWorld: "Driving in the Occupancy World: Vision-Centric 4D Occupancy Forecasting and Planning via World Models for Autonomous Driving". arXiv

  • CarFormer: "Self-Driving with Learned Object-Centric Representations". arXiv Code

  • BEVWorld: "A Multimodal World Model for Autonomous Driving via Unified BEV Latent Space". arXiv Code

  • TOKEN: "Tokenize the World into Object-level Knowledge to Address Long-tail Events in Autonomous Driving". arXiv

  • UMAD: "Unsupervised Mask-Level Anomaly Detection for Autonomous Driving". arXiv

  • SimGen: "Simulator-conditioned Driving Scene Generation". arXiv Code

  • AdaptiveDriver: "Planning with Adaptive World Models for Autonomous Driving". arXiv Code

  • UnO: "Unsupervised Occupancy Fields for Perception and Forecasting". arXiv Code

  • LAW: "Enhancing End-to-End Autonomous Driving with Latent World Model". arXiv Code

  • Delphi: "Unleashing Generalization of End-to-End Autonomous Driving with Controllable Long Video Generation". arXiv Code

  • OccSora: "4D Occupancy Generation Models as World Simulators for Autonomous Driving". arXiv Code

  • MagicDrive3D: "Controllable 3D Generation for Any-View Rendering in Street Scenes". arXiv Code

  • Vista: "A Generalizable Driving World Model with High Fidelity and Versatile Controllability". arXiv Code

  • CarDreamer: "Open-Source Learning Platform for World Model based Autonomous Driving". arXiv Code

  • DriveSim: "Probing Multimodal LLMs as World Models for Driving". arXiv Code

  • DriveWorld: "4D Pre-trained Scene Understanding via World Models for Autonomous Driving". arXiv

  • LidarDM: "Generative LiDAR Simulation in a Generated World". arXiv Code

  • SubjectDrive: "Scaling Generative Data in Autonomous Driving via Subject Control". arXiv Website

  • DriveDreamer-2: "LLM-Enhanced World Models for Diverse Driving Video Generation". arXiv Code

  • Think2Drive: "Efficient Reinforcement Learning by Thinking in Latent World Model for Quasi-Realistic Autonomous Driving". arXiv

  • MARL-CCE: "Modelling Competitive Behaviors in Autonomous Driving Under Generative World Model". arXiv Code

  • GenAD: "Generalized Predictive Model for Autonomous Driving". arXiv Website

  • GenAD: "Generative End-to-End Autonomous Driving". arXiv Code

  • NeMo: "Neural Volumetric World Models for Autonomous Driving". arXiv

  • MARL-CCE: "Modelling-Competitive-Behaviors-in-Autonomous-Driving-Under-Generative-World-Model". Code

  • ViDAR: "Visual Point Cloud Forecasting enables Scalable Autonomous Driving". arXiv Code

  • Drive-WM: "Driving into the Future: Multiview Visual Forecasting and Planning with World Model for Autonomous Driving". arXiv Code

  • Cam4DOCC: "Benchmark for Camera-Only 4D Occupancy Forecasting in Autonomous Driving Applications". arXiv Code

  • Panacea: "Panoramic and Controllable Video Generation for Autonomous Driving". arXiv Code

  • OccWorld: "Learning a 3D Occupancy World Model for Autonomous Driving". arXiv Code

  • DrivingDiffusion: "Layout-Guided multi-view driving scene video generation with latent diffusion model". arXiv Code

  • SafeDreamer: "Safe Reinforcement Learning with World Models". arXiv Code

  • MagicDrive: "Street View Generation with Diverse 3D Geometry Control". arXiv Code

  • DriveDreamer: "Towards Real-world-driven World Models for Autonomous Driving". arXiv Code

  • SEM2: "Enhance Sample Efficiency and Robustness of End-to-end Urban Autonomous Driving via Semantic Masked World Model". arXiv

  • COMPARATIVE STUDY OF WORLD MODELS: "COMPARATIVE STUDY OF WORLD MODELS, NVAE- BASED HIERARCHICAL MODELS, AND NOISYNET- AUGMENTED MODELS IN CARRACING-V2". OpenReview Website
  • Knowledge Graphs as World Models: "Knowledge Graphs as World Models for Material-Aware Obstacle Handling in Autonomous Vehicles". OpenReview Website
  • Uncertainty Modeling: "Uncertainty Modeling in Autonomous Vehicle Trajectory Prediction: A Comprehensive Survey". OpenReview Website
  • Divide and Merge: "Divide and Merge: Motion and Semantic Learning in End-to-End Autonomous Driving". OpenReview Website
  • RDAR: "RDAR: Reward-Driven Agent Relevance Estimation for Autonomous Driving". OpenReview Website

World Models for Embodied AI

1. Foundation Embodied World Models

  • [⭐️] Genie Envisioner: "Genie Envisioner: A Unified World Foundation Platform for Robotic Manipulation". arXiv Website
  • [⭐️] WoW, "WoW: Towards a World omniscient World model Through Embodied Interaction". arXiv Website Code
  • UnifoLM-WMA-0, "UnifoLM-WMA-0: A World-Model-Action (WMA) Framework under UnifoLM Family". Website Code
  • [⭐️] iVideoGPT, "iVideoGPT: Interactive VideoGPTs are Scalable World Models". arXivWebsite
  • Direct Robot Configuration Space Construction: "Direct Robot Configuration Space Construction using Convolutional Encoder-Decoders". OpenReview Website
  • ViPRA: "ViPRA: Video Prediction for Robot Actions". OpenReview Website
  • ROPES: "ROPES: Robotic Pose Estimation via Score-based Causal Representation Learning". OpenReview Website

2. World Models for Manipulation

  • [⭐️] FLARE, "FLARE: Robot Learning with Implicit World Modeling". arXiv Website
  • [⭐️] Enerverse, "EnerVerse: Envisioning Embodied Future Space for Robotics Manipulation". arXiv Website
  • [⭐️] AgiBot-World, "AgiBot World Colosseo: A Large-scale Manipulation Platform for Scalable and Intelligent Embodied Systems". arXiv Website Code
  • [⭐️] DyWA: "DyWA: Dynamics-adaptive World Action Model for Generalizable Non-prehensile Manipulation" arXiv Website
  • [⭐️] TesserAct, "TesserAct: Learning 4D Embodied World Models". arXiv Website
  • [⭐️] DreamGen: "DreamGen: Unlocking Generalization in Robot Learning through Video World Models". arXiv Code
  • [⭐️] HiP, "Compositional Foundation Models for Hierarchical Planning". arXiv Website
  • PAR: "Physical Autoregressive Model for Robotic Manipulation without Action Pretraining". arXiv Website
  • iMoWM: "iMoWM: Taming Interactive Multi-Modal World Model for Robotic Manipulation". arXiv Website
  • WristWorld: "WristWorld: Generating Wrist-Views via 4D World Models for Robotic Manipulation". arXiv
  • "A Recipe for Efficient Sim-to-Real Transfer in Manipulation with Online Imitation-Pretrained World Models". arXiv
  • EMMA: "EMMA: Generalizing Real-World Robot Manipulation via Generative Visual Transfer". arXiv
  • PhysTwin, "PhysTwin: Physics-Informed Reconstruction and Simulation of Deformable Objects from Videos". arXiv Website Code
  • [⭐️] KeyWorld: "KeyWorld: Key Frame Reasoning Enables Effective and Efficient World Models". arXiv
  • World4RL: "World4RL: Diffusion World Models for Policy Refinement with Reinforcement Learning for Robotic Manipulation". arXiv
  • [⭐️] SAMPO: "SAMPO:Scale-wise Autoregression with Motion PrOmpt for generative world models". arXiv
  • PhysicalAgent: "PhysicalAgent: Towards General Cognitive Robotics with Foundation World Models". arXiv
  • "Empowering Multi-Robot Cooperation via Sequential World Models". arXiv
  • [⭐️] "Learning Primitive Embodied World Models: Towards Scalable Robotic Learning". arXiv Website
  • [⭐️] GWM: "GWM: Towards Scalable Gaussian World Models for Robotic Manipulation". arXiv Website
  • [⭐️] Flow-as-Action, "Latent Policy Steering with Embodiment-Agnostic Pretrained World Models". arXiv
  • EmbodieDreamer: "EmbodieDreamer: Advancing Real2Sim2Real Transfer for Policy Training via Embodied World Modeling". arXiv Website
  • RoboScape: "RoboScape: Physics-informed Embodied World Model". arXiv Code
  • FWM, "Factored World Models for Zero-Shot Generalization in Robotic Manipulation". arXiv
  • [⭐️] ParticleFormer: "ParticleFormer: A 3D Point Cloud World Model for Multi-Object, Multi-Material Robotic Manipulation". arXiv Website
  • ManiGaussian++: "ManiGaussian++: General Robotic Bimanual Manipulation with Hierarchical Gaussian World Model". arXiv Code
  • ReOI: "Reimagination with Test-time Observation Interventions: Distractor-Robust World Model Predictions for Visual Model Predictive Control". arXiv
  • GAF: "GAF: Gaussian Action Field as a Dynamic World Model for Robotic Manipulation". arXiv Website
  • "Prompting with the Future: Open-World Model Predictive Control with Interactive Digital Twins". arXiv Website
  • "Time-Aware World Model for Adaptive Prediction and Control". arXiv
  • [⭐️] 3DFlowAction: "3DFlowAction: Learning Cross-Embodiment Manipulation from 3D Flow World Model". arXiv
  • [⭐️] ORV: "ORV: 4D Occupancy-centric Robot Video Generation". arXiv Code Website
  • [⭐️] WoMAP: "WoMAP: World Models For Embodied Open-Vocabulary Object Localization". arXiv
  • "Sparse Imagination for Efficient Visual World Model Planning". arXiv
  • [⭐️] OSVI-WM: "OSVI-WM: One-Shot Visual Imitation for Unseen Tasks using World-Model-Guided Trajectory Generation". arXiv
  • [⭐️] LaDi-WM: "LaDi-WM: A Latent Diffusion-based World Model for Predictive Manipulation". arXiv
  • FlowDreamer: "FlowDreamer: A RGB-D World Model with Flow-based Motion Representations for Robot Manipulation". arXiv Website
  • PIN-WM: "PIN-WM: Learning Physics-INformed World Models for Non-Prehensile Manipulation". arXiv
  • RoboMaster, "Learning Video Generation for Robotic Manipulation with Collaborative Trajectory Control". arXiv Website Code
  • ManipDreamer: "ManipDreamer: Boosting Robotic Manipulation World Model with Action Tree and Visual Guidance". arXiv
  • [⭐️] AdaWorld: "AdaWorld: Learning Adaptable World Models with Latent Actions" arXiv Website
  • "Towards Suturing World Models: Learning Predictive Models for Robotic Surgical Tasks" arXiv Website
  • [⭐️] EVA: "EVA: An Embodied World Model for Future Video Anticipation". arXiv Website
  • "Representing Positional Information in Generative World Models for Object Manipulation". arXiv
  • DexSim2Real$^2$: "DexSim2Real$^2: Building Explicit World Model for Precise Articulated Object Dexterous Manipulation". arXiv
  • "Physically Embodied Gaussian Splatting: A Realtime Correctable World Model for Robotics". arXiv Website
  • [⭐️] LUMOS: "LUMOS: Language-Conditioned Imitation Learning with World Models". arXiv Website
  • [⭐️] "Object-Centric World Model for Language-Guided Manipulation" arXiv
  • [⭐️] DEMO^3: "Multi-Stage Manipulation with Demonstration-Augmented Reward, Policy, and World Model Learning" arXiv Website
  • "Strengthening Generative Robot Policies through Predictive World Modeling". arXiv Website
  • RoboHorizon: "RoboHorizon: An LLM-Assisted Multi-View World Model for Long-Horizon Robotic Manipulation. arXiv
  • Dream to Manipulate: "Dream to Manipulate: Compositional World Models Empowering Robot Imitation Learning with Imagination". arXiv Website
  • [⭐️] RoboDreamer: "RoboDreamer: Learning Compositional World Models for Robot Imagination". arXiv Code
  • ManiGaussian: "ManiGaussian: Dynamic Gaussian Splatting for Multi-task Robotic Manipulation". arXiv Code
  • [⭐️] WHALE: "WHALE: Towards Generalizable and Scalable World Models for Embodied Decision-making". arXiv
  • [⭐️] VisualPredicator: "VisualPredicator: Learning Abstract World Models with Neuro-Symbolic Predicates for Robot Planning". arXiv
  • [⭐️] "Multi-Task Interactive Robot Fleet Learning with Visual World Models". arXiv Code
  • PIVOT-R: "PIVOT-R: Primitive-Driven Waypoint-Aware World Model for Robotic Manipulation". arXiv
  • Video2Action, "Grounding Video Models to Actions through Goal Conditioned Exploration". arXiv Website Code
  • Diffuser, "Planning with Diffusion for Flexible Behavior Synthesis". arXiv
  • Decision Diffuser, "Is Conditional Generative Modeling all you need for Decision-Making?". arXiv
  • Potential Based Diffusion Motion Planning, "Potential Based Diffusion Motion Planning". arXiv
  • GRIM: "GRIM: Task-Oriented Grasping with Conditioning on Generative Examples". OpenReview Website
  • World4Omni: "World4Omni: A Zero-Shot Framework from Image Generation World Model to Robotic Manipulation". OpenReview Website
  • In-Context Policy Iteration: "In-Context Policy Iteration for Dynamic Manipulation". OpenReview Website
  • HDFlow: "HDFlow: Hierarchical Diffusion-Flow Planning for Long-horizon Robotic Assembly". OpenReview Website
  • Mobile Manipulation with Active Inference: "Mobile Manipulation with Active Inference for Long-Horizon Rearrangement Tasks". OpenReview Website

3. World Models for Navigation

  • [⭐️] NWM, "Navigation World Models". arXiv Website
  • [⭐️] MindJourney: "MindJourney: Test-Time Scaling with World Models for Spatial Reasoning". arXiv Website
  • Test-Time Scaling: "Test-Time Scaling with World Models for Spatial Reasoning". arXiv OpenReview Website
  • Scaling Inference-Time Search: "Scaling Inference-Time Search with Vision Value Model for Improved Visual Comprehension". OpenReview Website
  • FalconWing: "FalconWing: An Ultra-Light Fixed-Wing Platform for Indoor Aerial Applications". OpenReview Website
  • Foundation Models as World Models: "Foundation Models as World Models: A Foundational Study in Text-Based GridWorlds". OpenReview Website
  • Geosteering Through the Lens of Decision Transformers: "Geosteering Through the Lens of Decision Transformers: Toward Embodied Sequence Decision-Making". OpenReview Website
  • Latent Weight Diffusion: "Latent Weight Diffusion: Generating reactive policies instead of trajectories". OpenReview Website
  • Abstract Sim2Real: "Abstract Sim2Real through Approximate Information States". OpenReview Website
  • FLAM: "FLAM: Scaling Latent Action Models with Factorization". OpenReview Website
  • NavMorph: "NavMorph: A Self-Evolving World Model for Vision-and-Language Navigation in Continuous Environments". arXiv Code
  • Unified World Models: "Unified World Models: Memory-Augmented Planning and Foresight for Visual Navigation". arXiv [code]
  • RECON, "Rapid Exploration for Open-World Navigation with Latent Goal Models". arXiv Website
  • WMNav: "WMNav: Integrating Vision-Language Models into World Models for Object Goal Navigation". arXiv Website
  • NaVi-WM, "Deductive Chain-of-Thought Augmented Socially-aware Robot Navigation World Model". arXiv Website
  • AIF, "Deep Active Inference with Diffusion Policy and Multiple Timescale World Model for Real-World Exploration and Navigation". arXiv
  • "Kinodynamic Motion Planning for Mobile Robot Navigation across Inconsistent World Models". arXiv
  • "World Model Implanting for Test-time Adaptation of Embodied Agents". arXiv
  • "Imaginative World Modeling with Scene Graphs for Embodied Agent Navigation". arXiv
  • [⭐️] Persistent Embodied World Models, "Learning 3D Persistent Embodied World Models". arXiv
  • "Perspective-Shifted Neuro-Symbolic World Models: A Framework for Socially-Aware Robot Navigation" arXiv
  • X-MOBILITY: "X-MOBILITY: End-To-End Generalizable Navigation via World Modeling". arXiv
  • MWM, "Masked World Models for Visual Control". arXiv Website Code

4. World Models for Locomotion

Locomotion:

  • [⭐️] Ego-VCP, "Ego-Vision World Model for Humanoid Contact Planning". arXiv Website Code
  • [⭐️] RWM-O, "Offline Robotic World Model: Learning Robotic Policies without a Physics Simulator". arXiv
  • [⭐️] DWL: "Advancing Humanoid Locomotion: Mastering Challenging Terrains with Denoising World Model Learning". arXiv
  • HRSSM: "Learning Latent Dynamic Robust Representations for World Models". arXiv Code
  • WMP: "World Model-based Perception for Visual Legged Locomotion". arXiv Website
  • TrajWorld, "Trajectory World Models for Heterogeneous Environments". arXiv Code
  • Puppeteer: "Hierarchical World Models as Visual Whole-Body Humanoid Controllers". arXiv Code
  • ProTerrain: "ProTerrain: Probabilistic Physics-Informed Rough Terrain World Modeling". arXiv
  • Occupancy World Model, "Occupancy World Model for Robots". arXiv
  • [⭐️] "Accelerating Model-Based Reinforcement Learning with State-Space World Models". arXiv
  • [⭐️] "Learning Humanoid Locomotion with World Model Reconstruction". arXiv
  • [⭐️] Robotic World Model: "Robotic World Model: A Neural Network Simulator for Robust Policy Optimization in Robotics. arXiv

Loco-Manipulation:

  • [⭐️] 1X World Model, 1X World Model. Blog
  • [⭐️] GROOT-Dreams, "Dream Come True — NVIDIA Isaac GR00T-Dreams Advances Robot Training With Synthetic Data and Neural Simulation". Blog
  • Humanoid World Models: "Humanoid World Models: Open World Foundation Models for Humanoid Robotics". arXiv
  • Ego-Agent, "EgoAgent: A Joint Predictive Agent Model in Egocentric Worlds". arXiv
  • D^2PO, "World Modeling Makes a Better Planner: Dual Preference Optimization for Embodied Task Planning" arXiv
  • COMBO: "COMBO: Compositional World Models for Embodied Multi-Agent Cooperation. arXiv Website Code
  • Scalable Humanoid Whole-Body Control: "Scalable Humanoid Whole-Body Control via Differentiable Neural Network Dynamics". OpenReview Website
  • HuWo: "HuWo: Building Physical Interaction World Models for Humanoid Robot Locomotion". OpenReview Website
  • Bridging the Sim-to-Real Gap: "Bridging the Sim-to-Real Gap in Humanoid Dynamics via Learned Nonlinear Operators". OpenReview Website

5. World Models x VLAs

Unifying World Models and VLAs in one model:

  • [⭐️] CoT-VLA: "CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models". arXiv Website
  • [⭐️] UP-VLA, "UP-VLA: A Unified Understanding and Prediction Model for Embodied Agent". arXiv Code
  • [⭐️] VPP, "Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations". arXiv Website
  • [⭐️] FLARE: "FLARE: Robot Learning with Implicit World Modeling". arXiv Code Website
  • [⭐️] MinD: "MinD: Unified Visual Imagination and Control via Hierarchical World Models". arXiv Website
  • [⭐️] DreamVLA, "DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge". arXiv Code Website
  • [⭐️] WorldVLA: "WorldVLA: Towards Autoregressive Action World Model". arXiv Code
  • 3D-VLA: "3D-VLA: A 3D Vision-Language-Action Generative World Model". arXiv
  • LAWM: "Latent Action Pretraining Through World Modeling". arXiv Code
  • [⭐️] UniVLA: "UniVLA: Unified Vision-Language-Action Model". arXiv Code
  • [⭐️] dVLA, "dVLA: Diffusion Vision-Language-Action Model with Multimodal Chain-of-Thought". arXiv
  • [⭐️] Vidar, "Vidar: Embodied Video Diffusion Model for Generalist Manipulation". arXiv
  • [⭐️] UD-VLA, "Unified Diffusion VLA: Vision-Language-Action Model via Joint Discrete Denoising Diffusion Process". arXiv Code Website
  • Goal-VLA: "Goal-VLA: Image-Generative VLMs as Object-Centric World Models Empowering Zero-shot Robot Manipulation". arXiv Website

Combining World Models and VLAs:

  • [⭐️] Ctrl-World: "Ctrl-World: A Controllable Generative World Model for Robot Manipulation". arXiv Website Code
  • VLA-RFT: "VLA-RFT: Vision-Language-Action Reinforcement Fine-tuning with Verified Rewards in World Simulators". arXiv
  • World-Env: "World-Env: Leveraging World Model as a Virtual Environment for VLA Post-Training". arXiv
  • [⭐️] Self-Improving Embodied Foundation Models, "Self-Improving Embodied Foundation Models". arXiv
  • GigaBrain-0, GigaBrain-0: A World Model-Powered Vision-Language-Action Model. arXiv Website
  • NinA: "NinA: Normalizing Flows in Action. Training VLA Models with Normalizing Flows". OpenReview Website
  • Ada-Diffuser: "Ada-Diffuser: Latent-Aware Adaptive Diffusion for Decision-Making". OpenReview Website
  • Steering Diffusion Policies: "Steering Diffusion Policies with Value-Guided Denoising". OpenReview Website
  • SPUR: "SPUR: Scaling Reward Learning from Human Demonstrations". OpenReview Website
  • A Smooth Sea Never Made a Skilled SAILOR: "A Smooth Sea Never Made a Skilled SAILOR: Robust Imitation via Learning to Search". OpenReview Website
  • RADI: "RADI: LLMs as World Models for Robotic Action Decomposition and Imagination". OpenReview Website
  • WMPO: "WMPO: World Model-based Policy Optimization for Vision-Language-Action Models". arXiv Website

6. World Models x Policy Learning

This subsection focuses on general policy learning methods in embodied intelligence via leveraging world models.

  • [⭐️] UWM, "Unified World Models: Coupling Video and Action Diffusion for Pretraining on Large Robotic Datasets". arXiv Website
  • [⭐️] UVA, Unified Video Action Model. arXiv Website Code
  • DiWA, "DiWA: Diffusion Policy Adaptation with World Models". arXiv Code
  • [⭐️] Dreamerv4, "Training Agents Inside of Scalable World Models". arXiv Website
  • Latent Action Learning Requires Supervision: "Latent Action Learning Requires Supervision in the Presence of Distractors". OpenReview Website
  • Beyond Experience: "Beyond Experience: Fictive Learning as an Inherent Advantage of World Models". OpenReview Website
  • Robotic World Model: "Robotic World Model: A Neural Network Simulator for Robust Policy Optimization in Robotics". OpenReview Website
  • Sim-to-Real Contact-Rich Pivoting: "Sim-to-Real Contact-Rich Pivoting via Optimization-Guided RL with Vision and Touch". OpenReview Website
  • Hierarchical Task Environments: "Hierarchical Task Environments as the Next Frontier for Embodied World Models in Robot Soccer". OpenReview Website

7. World Models for Policy evaluation

Real-world policy evaluation is expensive and noisy. The promise of world models is by accurately capturing environment dynamics, it can serve as a surrogate evaluation environment with high correlation to the policy performance in the real world. Before world models, the role for that was simulators:

  • [⭐️] Simpler, "Evaluating Real-World Robot Manipulation Policies in Simulation". arXiv Code

For World Model Evaluation:

  • [⭐️] WorldGym, "WorldGym: Evaluating Robot Policies in a World Model". arXiv Website
  • [⭐️] WorldEval: "WorldEval: World Model as Real-World Robot Policies Evaluator". arXiv Website
  • [⭐️] WoW!: "WOW!: World Models in a Closed-Loop World". OpenReview Website
  • Cosmos-Surg-dVRK: "Cosmos-Surg-dVRK: World Foundation Model-based Automated Online Evaluation of Surgical Robot Policy Learning". arXiv

World Models for Science

Natural Science:

  • [⭐️] CellFlux, "CellFlux: Simulating Cellular Morphology Changes via Flow Matching". arXivWebsite.
  • CheXWorld, "CheXWorld: Exploring Image World Modeling for Radiograph Representation Learning". arXivCode
  • EchoWorld: "EchoWorld: Learning Motion-Aware World Models for Echocardiography Probe Guidance". arXiv Code
  • ODesign, "ODesign: A World Model for Biomolecular Interaction Design." arXiv Website
  • [⭐️] SFP, "Spatiotemporal Forecasting as Planning: A Model-Based Reinforcement Learning Approach with Generative World Models". arXiv
  • Xray2Xray, "Xray2Xray: World Model from Chest X-rays with Volumetric Context". arXiv
  • [⭐️] Medical World Model: "Medical World Model: Generative Simulation of Tumor Evolution for Treatment Planning". arXiv
  • Surgical Vision World Model, "Surgical Vision World Model". arXiv

Social Science:

  • Social World Models, "Social World Models". arXiv
  • "Social World Model-Augmented Mechanism Design Policy Learning". arXiv
  • SocioVerse, "SocioVerse: A World Model for Social Simulation Powered by LLM Agents and A Pool of 10 Million Real-World Users". arXiv Code
  • Effectively Designing 2-Dimensional Sequence Models: "Effectively Designing 2-Dimensional Sequence Models for Multivariate Time Series". OpenReview Website
  • A Virtual Reality-Integrated System: "A Virtual Reality-Integrated System for Behavioral Analysis in Neurological Decline". OpenReview Website
  • TwinMarket: "TwinMarket: A Scalable Behavioral and Social Simulation for Financial Markets". OpenReview Website
  • Latent Representation Encoding: "Latent Representation Encoding and Multimodal Biomarkers for Post-Stroke Speech Assessment". OpenReview Website
  • Reconstructing Dynamics: "Reconstructing Dynamics from Steady Spatial Patterns with Partial Observations". OpenReview Website
  • SP: Learning Physics from Sparse Observations: "SP: Learning Physics from Sparse Observations — Three Pitfalls of PDE-Constrained Diffusion Models". OpenReview Website
  • SP: Continuous Autoregressive Generation: "SP: Continuous Autoregressive Generation with Mixture of Gaussians". OpenReview Website
  • EquiReg: "EquiReg: Symmetry-Driven Regularization for Physically Grounded Diffusion-based Inverse Solvers". OpenReview Website
  • Neural Modular World Model: "Neural Modular World Model". OpenReview Website
  • Bidding for Influence: "Bidding for Influence: Auction-Driven Diffusion Image Generation". OpenReview Website
  • PINT: "PINT: Physics-Informed Neural Time Series Models with Applications to Long-term Inference on WeatherBench 2m-Temperature Data". OpenReview Website
  • HEP-JEPA: "HEP-JEPA: A foundation model for collider physics". OpenReview Website

Positions on World Models

  • [⭐️] Video as the New Language for Real-World Decision Making, "Video as the New Language for Real-World Decision Making". arXiv
  • [⭐️] Critiques of World Models, "Critiques of World Models". arXiv
  • LAW, "Language Models, Agent Models, and World Models: The LAW for Machine Reasoning and Planning". arXiv
  • [⭐️] Compositional Generative Modeling: A Single Model is Not All You Need, "Compositional Generative Modeling: A Single Model is Not All You Need". arXiv
  • Interactive Generative Video as Next-Generation Game Engine, "Position: Interactive Generative Video as Next-Generation Game Engine". arXiv
  • A Proposal for Networks Capable of Continual Learning: "A Proposal for Networks Capable of Continual Learning". OpenReview Website
  • Towards Unified Expressive Policy Optimization: "Opinion: Towards Unified Expressive Policy Optimization for Robust Robot Learning". OpenReview Website
  • Learning Intuitive Physics: "Opinion: Learning Intuitive Physics Requires More Than Visual Data". OpenReview Website
  • A Unified World Model: "Opinion: A Unified World Model is the cornerstone for integrating perception, reasoning, and decision-making in embodied AI". OpenReview Website
  • Small VLAs: "Opinion: Small VLAs Self-Learn Consistency". OpenReview Website
  • How Can Causal AI Benefit World Models?: "Opinion: How Can Causal AI Benefit World Models?". OpenReview Website

Theory & World Models Explainability

  • [⭐️] General agents Contain World Models, "General agents contain world models". arXiv
  • [⭐️] When Do Neural Networks Learn World Models? "When Do Neural Networks Learn World Models?" arXiv
  • What Does it Mean for a Neural Network to Learn a 'World Model'?, "What Does it Mean for a Neural Network to Learn a 'World Model'?". arXiv
  • Transformer cannot learn HMMs (sometimes) "On Limitation of Transformer for Learning HMMs". arXiv
  • [⭐️] Inductive Bias Probe, "What Has a Foundation Model Found? Using Inductive Bias to Probe for World Models". arXiv
  • [⭐️] Dynamical Systems Learning for World Models, "When do World Models Successfully Learn Dynamical Systems?". arXiv
  • How Hard is it to Confuse a World Model?, "How Hard is it to Confuse a World Model?". arXiv
  • ICL Emergence, "Context and Diversity Matter: The Emergence of In-Context Learning in World Models". arXiv
  • [⭐️] Scaling Law,"Scaling Laws for Pre-training Agents and World Models". arXiv
  • LLM World Model, "Linear Spatial World Models Emerge in Large Language Models". arXiv Code
  • Revisiting Othello, "Revisiting the Othello World Model Hypothesis". arXiv
  • [⭐️] Transformers Use Causal World Models, "Transformers Use Causal World Models in Maze-Solving Tasks". arXiv
  • [⭐️] Causal World Model inside NTP, "A Causal World Model Underlying Next Token Prediction: Exploring GPT in a Controlled Environment". arXiv
  • When do neural networks learn world models?: "When do neural networks learn world models?". OpenReview Website
  • Utilizing World Models: "Utilizing World Models for Adaptively Covariate Acquisition Under Limited Budget for Causal Decision Making Problem". OpenReview Website

General Approaches to World Models

1. Foundation World Models

SOTA Models:

Interactive Video Generation:

  • [⭐️] Genie 3, "Genie 3: A new frontier for world models". Blog
  • [⭐️] V-JEPA 2, "V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning". arXiv Website Code
  • [⭐️] Cosmos Predict 2.5 & Cosmos Transfer 2.5, "Cosmos Predict 2.5 & Transfer 2.5: Evolving the World Foundation Models for Physical AI". BlogCode
  • [⭐️] PAN, "PAN: A World Model for General, Interactable, and Long-Horizon World Simulation". arXiv Blog

3D Scene Generation:

  • [⭐️] RTFM, "RTFM: A Real-Time Frame Model". Blog
  • [⭐️] Marble, "Generating Bigger and Better Worlds". Blog
  • [⭐️] WorldGen, "WorldGen: From Text to Traversable and Interactive 3D Worlds". Blog OpenReview

Classics:

Genie Series:

  • [⭐️] Genie 2, "Genie 2: A Large-Scale Foundation World Model". Blog
  • [⭐️] Genie, "Genie: Generative Interactive Environments". arXiv Blog

V-JEPA Series:

  • [⭐️] V-JEPA: "V-JEPA: Video Joint Embedding Predictive Architecture". Blog arXiv Code

Cosmos Series:

  • [⭐️] Cosmos, "Cosmos World Foundation Model Platform for Physical AI". arXivCodeBlog

World-Lab Projects:

  • Generating Worlds, "Generating Worlds". Blog

Other Awesome Models:

  • [⭐️] Pandora, "Pandora: Towards General World Model with Natural Language Actions and Video States". arXiv Code
  • [⭐️] UniSim, "UniSim: Learning Interactive Real-World Simulators". arXiv Website
  • Masked Generative Priors: "Masked Generative Priors Improve World Models Sequence Modelling Capabilities". OpenReview Website
  • Recurrent world model: "Recurrent world model with tokenized latent states". OpenReview Website
  • Mixture-of-Transformers: "Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models". OpenReview Website
  • Mixture-of-Mamba: "Mixture-of-Mamba: Enhancing Multi-Modal State-Space Models with Modality-Aware Sparsity". OpenReview Website
  • Improving World Models: "Improving World Models using Supervision with Co-Evolving Linear Probes". OpenReview Website
  • MS-SSM: "MS-SSM: A Multi-Scale State Space Model for Enhanced Sequence Modeling". OpenReview Website
  • Fixed-Point RNNs: "Fixed-Point RNNs: From Diagonal to Dense in a Few Iterations". OpenReview Website
  • ACDiT: "ACDiT: Interpolating Autoregressive Conditional Modeling and Diffusion Transformer". OpenReview Website
  • FPAN: "FPAN: Mitigating Replication in Diffusion Models through the Fine-Grained Probabilistic Addition of Noise to Token Embeddings". OpenReview Website
  • SPARTAN: "SPARTAN: A Sparse Transformer World Model Attending to What Matters". arXiv OpenReview

2. Building World Models from 2D Vision Priors

The represents a "bottom-up" approach to achieving intelligence, sensorimotor before abstraction. In the 2D pixel space, world models often build upon pre-existing image/video generation approaches.

To what extent does Vision Intelligence exist in Video Generation Models:

  • [⭐️] Sora, "Video generation models as world simulators". [Technical report]
  • [⭐️] Veo-3 are zero-shot Learners and Reasoners, "Video models are zero-shot learners and reasoners". arXiv Website
  • [⭐️] PhyWorld, "How Far is Video Generation from World Model: A Physical Law Perspective". arXiv Website Code
  • Emergent Few-Shot Learning in Video Diffusion Models, "From Generation to Generalization: Emergent Few-Shot Learning in Video Diffusion Models". arXiv
  • VideoVerse: "VideoVerse: How Far is Your T2V Generator from a World Model?". arXiv
  • [⭐️] Emu 3.5, "Emu3.5: Native Multimodal Models are World Learners". arXiv Website
  • [⭐️] Emu 3, "Emu3: Next-Token Prediction is All You Need". arXiv Website Code

Useful Approaches in Video Generation:

  • [⭐️] Diffusion Forcing, "Diffusion Forcing: Next-token Prediction Meets Full-Sequence Diffusion". arXiv Website
  • [⭐️] DFoT, "History-Guided Video Diffusion". arXiv Website Code
  • [⭐️] Self-Forcing, "Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion". arXiv Website Code
  • CausVid, "From Slow Bidirectional to Fast Causal Video Generators". arXiv Code Website
  • Longlive, "LongLive: Real-time Interactive Long Video Generation". arXiv Code Website
  • ControlNet, "Adding Conditional Control to Text-to-Image Diffusion Models". arXiv
  • ReCamMaster, "ReCamMaster: Camera-Controlled Generative Rendering from A Single Video". arXiv Code Website

From Video Generation Models to World Models:

  • [⭐️] Vid2World: "Vid2World: Crafting Video Diffusion Models to Interactive World Models". arXiv Website
  • AVID, "AVID: Adapting Video Diffusion Models to World Models". arXiv Code
  • IRASim, "IRASim: A Fine-Grained World Model for Robot Manipulation". arXiv Website
  • DWS, "Pre-Trained Video Generative Models as World Simulators". arXiv
  • Video Adapter, "Probabilistic Adaptation of Black-Box Text-to-Video Models". OpenReview Website
  • Video Agent, "VideoAgent: Self-Improving Video Generation". arXiv Website
  • WISA, "WISA: World Simulator Assistant for Physics-Aware Text-to-Video Generation". arXiv Website Code
  • Force Prompting, "Force Prompting: Video Generation Models Can Learn and Generalize Physics-based Control Signals". arXiv Website Code

Pixel Space World Models:

  • [⭐️] Owl-1: "Owl-1: Omni World Model for Consistent Long Video Generation". arXiv
  • [⭐️] Long-Context State-Space Video World Models, "Long-Context State-Space Video World Models". arXiv Website
  • [⭐️] StateSpaceDiffuser: "StateSpaceDiffuser: Bringing Long Context to Diffusion World Models". arXiv
  • [⭐️] Geometry Forcing: "Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling". arXiv Website
  • Yume: "Yume: An Interactive World Generation Model". arXiv Website Code
  • PSI, "World Modeling with Probabilistic Structure Integration". arXiv
  • Martian World Models, "Martian World Models: Controllable Video Synthesis with Physically Accurate 3D Reconstructions". arXiv Website
  • WorldDreamer: "WorldDreamer: Towards General World Models for Video Generation via Predicting Masked Tokens". arXiv Code
  • EBWM: "Cognitively Inspired Energy-Based World Models". arXiv
  • "Video World Models with Long-term Spatial Memory". arXiv Website
  • VRAG, "Learning World Models for Interactive Video Generation". arXiv
  • DRAW, "Adapting World Models with Latent-State Dynamics Residuals". arXiv
  • ForeDiff, "Consistent World Models via Foresight Diffusion". arXiv
  • Distribution Recovery: "Distribution Recovery in Compact Diffusion World Models via Conditioned Frame Interpolation". OpenReview Website
  • EmbodiedScene: "EmbodiedScene: Towards Automated Generation of Diverse and Realistic Scenes for Embodied AI". OpenReview Website
  • BEYOND SINGLE-STEP: "BEYOND SINGLE-STEP: MULTI-FRAME ACTION- CONDITIONED VIDEO GENERATION FOR REINFORCE- MENT LEARNING ENVIRONMENTS". OpenReview Website
  • Adaptive Attention-Guided Masking: "Adaptive Attention-Guided Masking in Vision Transformers for Self-Supervised Hyperspectral Feature Learning". OpenReview Website
  • Implicit State Estimation: "Implicit State Estimation via Video Replanning". OpenReview Website
  • Enhancing Long Video Generation Consistency: "Enhancing Long Video Generation Consistency without Tuning: Time-Frequency Analysis, Prompt Alignment, and Theory". OpenReview Website
  • Can Image-To-Video Models Simulate Pedestrian Dynamics?: "Can Image-To-Video Models Simulate Pedestrian Dynamics?". OpenReview Website
  • Eyes of the DINO: "Eyes of the DINO: Learning Physical World Models from Uncurated Web Videos". OpenReview Website
  • Video Self-Distillation: "Video Self-Distillation for Single-Image Encoders: A Step Toward Physically Plausible Perception". OpenReview Website
  • Learning Skill Abstraction: "Learning Skill Abstraction from Action-Free Videos via Optical Flow". OpenReview Website
  • CRISP: "CRISP: Contact-guided Real2Sim from Monocular Video with Planar Scene Primitives". OpenReview Website
  • Whole-Body Conditioned Egocentric Video Prediction: "Whole-Body Conditioned Egocentric Video Prediction". arXiv OpenReview Website
  • Taming generative world models: "Taming generative world models for zero-shot optical flow extraction". arXiv OpenReview Website

3. Building World Models from 3D Vision Priors

3D Mesh is also a useful representaiton of the physical world, including benefits such as spatial consistency.

  • [⭐️] WorldGrow: "WorldGrow: Generating Infinite 3D World". arXiv Code
  • TRELLISWorld: "TRELLISWorld: Training-Free World Generation from Object Generators". arXiv
  • Terra: "Terra: Explorable Native 3D World Model with Point Latents". arXiv Website
  • MorphoSim: "MorphoSim: An Interactive, Controllable, and Editable Language-guided 4D World Simulator". arXiv [code]
  • EvoWorld: "EvoWorld: Evolving Panoramic World Generation with Explicit 3D Memory". arXiv Code
  • [⭐️] FantasyWorld: "FantasyWorld: Geometry-Consistent World Modeling via Unified Video and 3D Prediction". arXiv
  • [⭐️] Aether: "Aether: Geometric-Aware Unified World Modeling". arXiv Website
  • HERO: "HERO: Hierarchical Extrapolation and Refresh for Efficient World Models". arXiv
  • UrbanWorld: "UrbanWorld: An Urban World Model for 3D City Generation". arXiv
  • DeepVerse: "DeepVerse: 4D Autoregressive Video Generation as a World Model". arXiv
  • EnerVerse-AC: "EnerVerse-AC: Envisioning Embodied Environments with Action Condition". OpenReview Website
  • Adapting a World Model: "Adapting a World Model for Trajectory Following in a 3D Game". OpenReview Website
  • SteerX: "SteerX: Creating Any Camera-Free 3D and 4D Scenes with Geometric Steering". OpenReview Website
  • SP- PhysicsNeRF: "SP- PhysicsNeRF: Physics-Guided 3D Reconstruction from Sparse Views". OpenReview Website

4. Building World Models from Language Priors

The represents a "top-down" approach to achieving intelligence, abstraction before sensorimotor.

Aiming to Advance LLM/VLM skills:

  • [⭐️] VLWM, "Planning with Reasoning using Vision Language World Model". arXiv
  • [⭐️] Agent Learning via Early Experience, "Agent Learning via Early Experience". arXiv
  • [⭐️] CWM, "CWM: An Open-Weights LLM for Research on Code Generation with World Models". arXiv Website Code
  • [⭐️] RAP, "Reasoning with language model is planning with world model". arXiv
  • SURGE, "SURGE: On the Potential of Large Language Models as General-Purpose Surrogate Code Executors". arXiv Code
  • LLM-Sim: "Can Language Models Serve as Text-Based World Simulators?". arXiv Code
  • WorldLLM, "WorldLLM: Improving LLMs' world modeling using curiosity-driven theory-making". arXiv
  • LLMs as World Models, "LLMs as World Models: Data-Driven and Human-Centered Pre-Event Simulation for Disaster Impact Assessment". arXiv
  • [⭐️] LWM: "World Model on Million-Length Video And Language With RingAttention". arXiv Code
  • "Evaluating World Models with LLM for Decision Making". arXiv
  • LLMPhy: "LLMPhy: Complex Physical Reasoning Using Large Language Models and World Models". arXiv
  • LLMCWM: "Language Agents Meet Causality -- Bridging LLMs and Causal World Models". arXiv Code
  • "Making Large Language Models into World Models with Precondition and Effect Knowledge". arXiv
  • CityBench: "CityBench: Evaluating the Capabilities of Large Language Model as World Model". arXiv Code

Aiming to enhance computer-use agent performance:

  • [⭐️] Neural-OS, "NeuralOS: Towards Simulating Operating Systems via Neural Generative Models". arXiv Website Code
  • R-WoM: "R-WoM: Retrieval-augmented World Model For Computer-use Agents". arXiv
  • [⭐️] SimuRA: "SimuRA: Towards General Goal-Oriented Agent via Simulative Reasoning Architecture with LLM-Based World Model". arXiv
  • WebSynthesis, "WebSynthesis: World-Model-Guided MCTS for Efficient WebUI-Trajectory Synthesis". arXiv
  • WKM: "Agent Planning with World Knowledge Model". arXiv Code
  • WebDreamer, "Is Your LLM Secretly a World Model of the Internet? Model-Based Planning for Web Agents". arXiv Code
  • "Web Agents with World Models: Learning and Leveraging Environment Dynamics in Web Navigation". arXiv
  • WebEvolver: "WebEvolver: Enhancing Web Agent Self-Improvement with Coevolving World Model". arXiv
  • WALL-E 2.0: "WALL-E 2.0: World Alignment by NeuroSymbolic Learning improves World Model-based LLM Agents". arXiv Code
  • ViMo: "ViMo: A Generative Visual GUI World Model for App Agent". arXiv
  • [⭐️] Dyna-Think: "Dyna-Think: Synergizing Reasoning, Acting, and World Model Simulation in AI Agents". arXiv
  • FPWC, "Unlocking Smarter Device Control: Foresighted Planning with a World Model-Driven Code Execution Approach". arXiv

Symbolic World Models:

  • [⭐️] PoE-World, "PoE-World: Compositional World Modeling with Products of Programmatic Experts". arXiv Website
  • "One Life to Learn: Inferring Symbolic World Models for Stochastic Environments from Unguided Exploration". arXiv Code
  • "Finite Automata Extraction: Low-data World Model Learning as Programs from Gameplay Video". arXiv
  • "Synthesizing world models for bilevel planning". arXiv
  • "Generating Symbolic World Models via Test-time Scaling of Large Language Models". arXiv Website

LLM-in-the-loop World Generation:

  • LatticeWorld, "LatticeWorld: A Multimodal Large Language Model-Empowered Framework for Interactive Complex World Generation". arXiv
  • Text2World: "Text2World: Benchmarking Large Language Models for Symbolic World Model Generation". arXiv Website

5. Building World Models by Bridging Language and Vision Intelligence

A recent trend of work is bridging highly-compressed semantic tokens (e.g. language) with information-sparse cues in the observation space (e.g. vision). This results in World Models that combine high-level and low-level intelligence.

  • [⭐️] VAGEN, "VAGEN: Reinforcing World Model Reasoning for Multi-Turn VLM Agents". arXiv Website
  • [⭐️] Semantic World Models, "Semantic World Models". arXiv Website
  • DyVA: "Can World Models Benefit VLMs for World Dynamics?". arXiv Website
  • From Foresight to Forethought: "From Foresight to Forethought: VLM-In-the-Loop Policy Steering via Latent Alignment". OpenReview Website
  • SEAL: "SEAL: SEmantic-Augmented Imitation Learning via Language Model". OpenReview Website
  • Programmatic Video Prediction: "Programmatic Video Prediction Using Large Language Models". OpenReview Website
  • Emergent Stack Representations: "Emergent Stack Representations in Modeling Counter Languages Using Transformers". OpenReview Website
  • DIALOGUES BETWEEN ADAM AND EVE: "DIALOGUES BETWEEN ADAM AND EVE: EXPLORATION OF UNKNOWN CIVILIZATION LANGUAGE BY LLM". OpenReview Website
  • Memory Helps, but Confabulation Misleads: "Memory Helps, but Confabulation Misleads: Understanding Streaming Events in Videos with MLLMs". OpenReview Website
  • Reframing LLM Finetuning: "Reframing LLM Finetuning Through the Lens of Bayesian Optimization". OpenReview Website
  • TrajEvo: "TrajEvo: Designing Trajectory Prediction Heuristics via LLM-driven Evolution". OpenReview Website
  • CCC: "CCC: Enhancing Video Generation via Structured MLLM Feedback". OpenReview Website
  • VLA-OS: "VLA-OS: Structuring and Dissecting Planning Representations and Paradigms in Vision-Language-Action Models". OpenReview Website
  • LLM-Guided Probabilistic Program Induction: "LLM-Guided Probabilistic Program Induction for POMDP Model Estimation". OpenReview Website
  • Decoupled Planning and Execution: "Decoupled Planning and Execution with LLM-Driven World Models for Efficient Task Planning". OpenReview Website
  • The Physical Basis of Prediction: "The Physical Basis of Prediction: World Model Formation in Neural Organoids via an LLM-Generated Curriculum". OpenReview Website
  • Avi: "Avi: A 3D Vision-Language Action Model Architecture generating Action from Volumetric Inference". OpenReview Website
  • Plan Verification: "Plan Verification for LLM-Based Embodied Task Completion Agents". OpenReview Website
  • SpatialThinker: "SpatialThinker: Reinforcing 3D Reasoning in Multimodal LLMs via Spatial Rewards". OpenReview Website
  • How Foundational Skills Influence VLM-based Embodied Agents: "How Foundational Skills Influence VLM-based Embodied Agents: A Native Perspective". OpenReview Website
  • Towards Fine-tuning a Small Vision-Language Model: "Towards Fine-tuning a Small Vision-Language Model for Aerial Navigation". OpenReview Website
  • Improvisational Reasoning: "Improvisational Reasoning with Vision-Language Models for Grounded Procedural Planning". OpenReview Website
  • Vision-Language Reasoning for Burn Depth Assessment: "Vision-Language Reasoning for Burn Depth Assessment with Structured Diagnostic Hypotheses". OpenReview Website
  • WALL-E: "WALL-E: World Alignment by NeuroSymbolic Learning improves World Model-based LLM Agents". arXiv OpenReview Code
  • Puffin: "Thinking with Camera: A Unified Multimodal Model for Camera-Centric Understanding and Generation". arXiv Website

6. Latent Space World Models

While learning in the observation space (pixel, 3D mesh, language, etc.) is a common approach, for many applications (planning, policy evaluation, etc.) learning in latent space is sufficient or is believed to lead to even better performace.

  • [⭐️] DINO-WM, "DINO-WM: World Models on Pre-trained Visual Features enable Zero-shot Planning". arXiv Website
  • [⭐️] DINO-World, "Back to the Features: DINO as a Foundation for Video World Models". arXiv
  • [⭐️] DINO-Foresight, "DINO-Foresight: Looking into the Future with DINO ". arXiv Code
  • AWM, "Learning Abstract World Models with a Group-Structured Latent Space". arXiv

JEPA is a special kind of learning in latent space, where the loss is put on the latent space, and the encoder & predictor are co-trained. However, the usage of JEPA is not only in world models (e.g. V-JEPA2-AC), but also representation learning (e.g. I-JEPA, V-JEPA), we provide representative works from both perspectives below.

  • [⭐️] I-JEPA,"Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture". arXiv
  • IWM: "Learning and Leveraging World Models in Visual Representation Learning". arXiv
  • [⭐️] V-JEPA: "V-JEPA: Video Joint Embedding Predictive Architecture". Blog arXiv Code
  • [⭐️] V-JEPA Learns Intuitive Physics, "Intuitive physics understanding emerges from self-supervised pretraining on natural videos". arXiv Code
  • [⭐️] V-JEPA 2, "V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning". arXiv Website Code
  • seq-JEPA: "seq-JEPA: Autoregressive Predictive Learning of Invariant-Equivariant World Models". arXiv
  • Image World Models, "Learning and Leveraging World Models in Visual Representation Learning". arXiv
  • MC-JEPA, "MC-JEPA: A Joint-Embedding Predictive Architecture for Self-Supervised Learning of Motion and Content Features". arXiv

7. Building World Models from an Object-Centric Perspective

  • [⭐️] NPE, "A Compositional Object-Based Approach to Learning Physical Dynamics" arXiv
  • [⭐️] SlotFormer, "SlotFormer: Unsupervised Visual Dynamics Simulation with Object-Centric Models". arXiv
  • Dyn-O, "Dyn-O: Building Structured World Models with Object-Centric Representations". arXiv
  • COMET, "Compete and Compose: Learning Independent Mechanisms for Modular World Models". arXiv
  • FPTT, "Transformers and Slot Encoding for Sample Efficient Physical World Modelling". arXiv Code
  • "Efficient Exploration and Discriminative World Model Learning with an Object-Centric Abstraction". arXiv
  • OC-STORM, "Objects matter: object-centric world models improve reinforcement learning in visually complex environments". arXiv
  • Object-Centric Latent Action Learning: "Object-Centric Latent Action Learning". Website
  • Unifying Causal and Object-centric Representation Learning: "Unifying Causal and Object-centric Representation Learning allows Causal Composition". Website
  • Object-Centric Representations: "Object-Centric Representations Generalize Better Compositionally with Less Compute". Website
  • Object-Centric Latent Action Learning: "Object-Centric Latent Action Learning". Website

7. Post-training and Inference-Time Scaling for World Models

  • [⭐️] RLVR-World: "RLVR-World: Training World Models with Reinforcement Learning". arXiv [Website] Code
  • RLIR, "Reinforcement Learning with Inverse Rewards for World Model Post-training". arXiv
  • Chrono-Edit, "ChronoEdit: Towards Temporal Reasoning for Image Editing and World Simulation". arXiv Website Code
  • [⭐️] SWIFT, "Can Test-Time Scaling Improve World Foundation Model?". arXiv Code

8. World Models in the context of Model-Based RL

A significant porportion of World Model Algorithms and Techniques stem from the advances in Model-based Reinforcement Learning in the era around 2020. Dreamer(v1-v3) are classical works in this era. We provide a list of classical works as well as works following this line of thought.

  • [⭐️] Dreamer, "Dream to Control: Learning Behaviors by Latent Imagination". arXiv Code Website
  • [⭐️] Dreamerv2, "Mastering Atari with Discrete World Models". arXiv Code Website
  • [⭐️] Dreamerv3, "Mastering Diverse Domains through World Models". arXiv Code Website
  • DreamSmooth: "DreamSmooth: Improving Model-based Reinforcement Learning via Reward Smoothing". arXiv
  • [⭐️] TD-MPC2: "TD-MPC2: Scalable, Robust World Models for Continuous Control". arXiv [Torch Code]
  • Hieros: "Hieros: Hierarchical Imagination on Structured State Space Sequence World Models". arXiv
  • CoWorld: "Making Offline RL Online: Collaborative World Models for Offline Visual Reinforcement Learning". arXiv
  • HarmonyDream, "HarmonyDream: Task Harmonization Inside World Models". arXiv Code
  • DyMoDreamer, "DyMoDreamer: World Modeling with Dynamic Modulation". arXiv Code
  • "Dynamics-Aligned Latent Imagination in Contextual World Models for Zero-Shot Generalization". arXiv
  • PIGDreamer, "PIGDreamer: Privileged Information Guided World Models for Safe Partially Observable Reinforcement Learning". arXiv
  • [⭐️] Continual Reinforcement Learning by Planning with Online World Models, "Continual Reinforcement Learning by Planning with Online World Models". arXiv
  • Δ-IRIS: "Efficient World Models with Context-Aware Tokenization". arXiv Code
  • AD3: "AD3: Implicit Action is the Key for World Models to Distinguish the Diverse Visual Distractors". arXiv
  • R2I: "Mastering Memory Tasks with World Models". arXiv Website Code
  • REM: "Improving Token-Based World Models with Parallel Observation Prediction". arXiv Code
  • AWM, "Do Transformer World Models Give Better Policy Gradients?"". arXiv
  • [⭐️] Dreaming of Many Worlds, "Dreaming of Many Worlds: Learning Contextual World Models Aids Zero-Shot Generalization". arXiv Code
  • PWM: "PWM: Policy Learning with Large World Models". arXiv Code
  • GenRL: "GenRL: Multimodal foundation world models for generalist embodied agents". arXiv Code
  • DLLM: "World Models with Hints of Large Language Models for Goal Achieving". arXiv
  • Adaptive World Models: "Adaptive World Models: Learning Behaviors by Latent Imagination Under Non-Stationarity". arXiv
  • "Reward-free World Models for Online Imitation Learning". arXiv
  • MoReFree: "World Models Increase Autonomy in Reinforcement Learning". arXiv Website
  • ROMBRL, "Policy-Driven World Model Adaptation for Robust Offline Model-based Reinforcement Learning". arXiv
  • "Coupled Distributional Random Expert Distillation for World Model Online Imitation Learning". arXiv
  • [⭐️] MoSim: "Neural Motion Simulator Pushing the Limit of World Models in Reinforcement Learning". arXiv
  • SENSEI: "SENSEI: Semantic Exploration Guided by Foundation Models to Learn Versatile World Models". arXiv Website
  • Spiking World Model, "Implementing Spiking World Model with Multi-Compartment Neurons for Model-based Reinforcement Learning". arXiv
  • DCWM, "Discrete Codebook World Models for Continuous Control". arXiv
  • Multimodal Dreaming: "Multimodal Dreaming: A Global Workspace Approach to World Model-Based Reinforcement Learning". arXiv
  • "Generalist World Model Pre-Training for Efficient Reinforcement Learning". arXiv
  • "Learning To Explore With Predictive World Model Via Self-Supervised Learning". arXiv
  • Simulus: "Uncovering Untapped Potential in Sample-Efficient World Model Agents". arXiv
  • DMWM: "DMWM: Dual-Mind World Model with Long-Term Imagination". arXiv
  • EvoAgent: "EvoAgent: Agent Autonomous Evolution with Continual World Model for Long-Horizon Tasks". arXiv
  • GLIMO: "Grounding Large Language Models In Embodied Environment With Imperfect World Models". arXiv
  • Energy-based Transition Models, "Offline Transition Modeling via Contrastive Energy Learning". OpenReview Code
  • PCM, "Policy-conditioned Environment Models are More Generalizable". OpenReview Website Code
  • Temporal Difference Flows: "Temporal Difference Flows". OpenReview Website
  • Improving Transformer World Models: "Improving Transformer World Models for Data-Efficient RL". OpenReview Website
  • Accelerating Goal-Conditioned RL: "Accelerating Goal-Conditioned RL Algorithms and Research". OpenReview Website
  • LEARNING FROM LESS: "LEARNING FROM LESS: SINDY SURROGATES IN RL". OpenReview Website
  • Combining Unsupervised and Offline RL: "Combining Unsupervised and Offline RL via World Models". OpenReview Website
  • World Models as Reference Trajectories: "World Models as Reference Trajectories for Rapid Motor Adaptation". arXiv OpenReview
  • Stress-Testing Offline Reward-Free Reinforcement Learning: "Stress-Testing Offline Reward-Free Reinforcement Learning: A Case for Planning with Latent Dynamics Models". OpenReview Website
  • Decentralized Transformers with Centralized Aggregation: "Decentralized Transformers with Centralized Aggregation are Sample-Efficient Multi-Agent World Models". OpenReview Website
  • Model-based Offline Reinforcement Learning: "Model-based Offline Reinforcement Learning with Lower Expectile Q-Learning". OpenReview Website
  • BiD: "BiD: Behavioral Agents in Dynamic Auctions". OpenReview Website
  • Pushing the Limit: "Pushing the Limit of Sample-Efficient Offline Reinforcement Learning". OpenReview Website
  • Learning from Reward-Free Offline Data: "Learning from Reward-Free Offline Data: A Case for Planning with Latent Dynamics Models". OpenReview Website
  • SP: JEPA-WMs: "SP: JEPA-WMs: On Planning with Joint-Embedding Predictive World Models". OpenReview Website
  • DAWM: "DAWM: Diffusion Action World Models for Offline Reinforcement Learning via Action-Inferred Transitions". OpenReview Website
  • Revisiting Multi-Agent World Modeling: "Revisiting Multi-Agent World Modeling from a Diffusion-Inspired Perspective". OpenReview Website
  • Communicating Plans, Not Percepts: "Communicating Plans, Not Percepts: Scalable Multi-Agent Coordination with Embodied World Models". OpenReview Website
  • Exploring exploration: "Exploring exploration with foundation agents in interactive environments". OpenReview Website
  • Adversarial Diffusion: "Adversarial Diffusion for Robust Reinforcement Learning". OpenReview Website
  • Learning to Focus: "Learning to Focus: Prioritizing Informative Histories with Structured Attention Mechanisms in Partially Observable Reinforcement Learning". OpenReview Website
  • PolicyGRID: "PolicyGRID: Acting to Understand, Understanding to Act". OpenReview Website
  • Stable Planning: "Stable Planning through Aligned Representations in Model-Based Reinforcement Learning". OpenReview Website
  • Revisiting Multi-Agent World Modeling from a Diffusion-Inspired Perspective: "Revisiting Multi-Agent World Modeling from a Diffusion-Inspired Perspective". OpenReview Website
  • JOWA: "Scaling Offline Model-Based RL via Jointly-Optimized World-Action Model Pretraining". OpenReview Code
  • LS-Imagine: "Open-World Reinforcement Learning over Long Short-Term Imagination". OpenReview
  • TWISTER: "Learning Transformer-based World Models with Contrastive Predictive Coding". OpenReview Website
  • WAKER: "Reward-Free Curricula for Training Robust World Models". OpenReview
  • THICK: "Learning Hierarchical World Models with Adaptive Temporal Abstractions from Discrete Latent Dynamics". OpenReview Code

9. World models in other modalities

  • Graph World Model, "Graph World Model". arXiv Website

10. Memory in World Model

Implicit Memory:

  • [⭐️] Context as Memory, "Context as Memory: Scene-Consistent Interactive Long Video Generation with Memory Retrieval". arXivWebsite.
  • [⭐️] History-Guided Video Diffusion, "History-Guided Video Diffusion". arXivWebsite.
  • [⭐️] Mixture of Contexts for Long Video Generation, "Mixture of Contexts for Long Video Generation". arXivWebsite.

Explicit Memory:

  • [⭐️] WonderWorld, "WonderWorld: Interactive 3D Scene Generation from a Single Image". arXivWebsite.

Evaluating World Models

World Models in the Language Modality:

  • Evaluating the World Model Implicit in a Generative Model, "Evaluating the World Model Implicit in a Generative Model". arXiv Code
  • "Benchmarking World-Model Learning". arXiv
  • WM-ABench: "Do Vision-Language Models Have Internal World Models? Towards an Atomic Evaluation". arXiv Website
  • UNIVERSE: "Adapting Vision-Language Models for Evaluating World Models". arXiv
  • WorldPrediction: "WorldPrediction: A Benchmark for High-level World Modeling and Long-horizon Procedural Planning". arXiv
  • EVA: "EVA: An Embodied World Model for Future Video Anticipation". arXiv Website
  • AeroVerse: "AeroVerse: UAV-Agent Benchmark Suite for Simulating, Pre-training, Finetuning, and Evaluating Aerospace Embodied World Models". arXiv

World Models in the Pixel Space:

  • World-in-World: "World-in-World: World Models in a Closed-Loop World". arXiv Website
  • WorldPrediction: "WorldPrediction: A Benchmark for High-level World Modeling and Long-horizon Procedural Planning". arXiv
  • "Toward Memory-Aided World Models: Benchmarking via Spatial Consistency". arXiv Website Code
  • SimWorld: "SimWorld: A Unified Benchmark for Simulator-Conditioned Scene Generation via World Model". arXiv Code
  • EWMBench: "EWMBench: Evaluating Scene, Motion, and Semantic Quality in Embodied World Models". arXiv Code
  • "Toward Stable World Models: Measuring and Addressing World Instability in Generative Environments". arXiv
  • WorldModelBench: "WorldModelBench: Judging Video Generation Models As World Models". arXiv Website'
  • EVA: "EVA: An Embodied World Model for Future Video Anticipation". arXiv Website
  • ACT-Bench: "ACT-Bench: Towards Action Controllable World Models for Autonomous Driving". arXiv
  • WorldSimBench: "WorldSimBench: Towards Video Generation Models as World Simulators". arXiv Website
  • WorldScore, "WorldScore: A Unified Evaluation Benchmark for World Generation". arXiv Website
  • "Imagine the Unseen World: A Benchmark for Systematic Generalization in Visual World Models". arXiv

World Models in 3D Mesh Space:

  • OmniWorld: "OmniWorld: A Multi-Domain and Multi-Modal Dataset for 4D World Modeling". arXiv Website

World Models in other modalities:

  • "Beyond Simulation: Benchmarking World Models for Planning and Causality in Autonomous Driving". arXiv

Physically Plausible World Models:

  • Newton: "Newton - A Small Benchmark for Interactive Foundation World Models". Website
  • Text2World: "Text2World: Benchmarking World Modeling Capabilities of Large Language Models via Program Synthesis". Website
  • AetherVision-Bench: "AetherVision-Bench: An Open-Vocabulary RGB-Infrared Benchmark for Multi-Angle Segmentation across Aerial and Ground Perspectives". Website
  • VideoPhy-2: "VideoPhy-2: A Challenging Action-Centric Physical Commonsense Evaluation in Video Generation". Website
  • A Comprehensive Evaluation: "A Comprehensive Evaluation of Physical Realism in Text-to-Video Models". Website
  • ScenePhys: "ScenePhys — Controllable Physics Videos for World-Model Evaluation". Website
  • OpenGVL: "OpenGVL - Benchmarking Visual Temporal Progress for Data Curation". Website

Acknowledgements

This project is largely built on the foundations laid by:

Huge shoutout the the authors for their awesome work.


Citation

If you find this repository useful, please consider citing this list:

@misc{huang2025awesomeworldmodels,
    title = {Awesome-World-Models},
    author = {Siqiao Huang},
    journal = {GitHub repository},
    url = {https://github.com/knightnemo/Awesome-World-Models},
    year = {2025},
}

All Thanks to Our Contributors


Star History

Star History Chart

About

A Curated List of Awesome Works in World Modeling, Aiming to Serve as a One-stop Resource for Researchers, Practitioners, and Enthusiasts Interested in World Modeling.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 9