Stars
[NeurIPS 2025] Reasoning MLLM, Share-GRPO, advantage vanishing, sparse reward
AgentFlow: In-the-Flow Agentic System Optimization
Resources and paper list for "Thinking with Images for LVLMs". This repository accompanies our survey on how LVLMs can leverage visual information for complex reasoning, planning, and generation.
MMSearch-R1 is an end-to-end RL framework that enables LMMs to perform on-demand, multi-turn search with real-world multimodal search tools.
An open-source AI agent that brings the power of Gemini directly into your terminal.
A powerful tool for creating fine-tuning datasets for LLM
MM-Eureka V0 also called R1-Multimodal-Journey, Latest version is in MM-Eureka
✨✨[NeurIPS 2025] VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction
Open-source evaluation toolkit of large multi-modality models (LMMs), support 220+ LMMs, 80+ benchmarks
Qwen3-VL is the multimodal large language model series developed by Qwen team, Alibaba Cloud.
Official code for Paper "Mantis: Multi-Image Instruction Tuning" [TMLR 2024]
MiniCPM-V 4.5: A GPT-4o Level MLLM for Single Image, Multi Image and High-FPS Video Understanding on Your Phone
Cambrian-1 is a family of multimodal LLMs with a vision-centric design.
[NeurIPS 2024 Best Paper Award][GPT beats diffusion🔥] [scaling laws in visual generation📈] Official impl. of "Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction". A…
TensorRT LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and supports state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. Tensor…
Large World Model -- Modeling Text and Video with Millions Context
VideoSys: An easy and efficient system for video generation
Open-Sora: Democratizing Efficient Video Production for All
【TMM 2025🔥】 Mixture-of-Experts for Large Vision-Language Models
LLaVA-Plus: Large Language and Vision Assistants that Plug and Learn to Use Skills
A Next-Generation Training Engine Built for Ultra-Large MoE Models
[CVPR 2024 Highlight🔥] Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding
[CVPR 2024] TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding

