-
22:22
(UTC -05:00) - yunlong10.github.io
- in/yolo-yunlong-tang
- @yoloytang
- @yoloytang
Highlights
- Pro
Stars
Fine Tuning MLLMs with Reasoning Priors from DeepSeekR1
OmniVinci is an omni-modal LLM for joint understanding of vision, audio, and language.
SlideVQA: A Dataset for Document Visual Question Answering on Multiple Images (AAAI2023)
Qwen3-omni is a natively end-to-end, omni-modal LLM developed by the Qwen team at Alibaba Cloud, capable of understanding text, audio, images, and video, as well as generating speech in real time.
StreamingVLM: Real-Time Understanding for Infinite Video Streams
Building a Foundational Guardrail for General Agentic Systems via Synthetic Data
🔥🔥🔥 Latest Papers, Codes and Datasets on Video-LMM Post-Training
VideoNSA: Native Sparse Attention Scales Video Understanding
SpatialVID: A Large-Scale Video Dataset with Spatial Annotations
Resources and paper list for "Thinking with Images for LVLMs". This repository accompanies our survey on how LVLMs can leverage visual information for complex reasoning, planning, and generation.
[EMNLP 2025 Oral] Official codebase for Seeing More, Saying More: Lightweight Language Experts are Dynamic Video Token Compressors.
The official code of "Thinking With Videos: Multimodal Tool-Augmented Reinforcement Learning for Long Video Reasoning"
🔥 🔥 🔥 Awesome MLLMs/Benchmarks for Short/Long/Streaming Video Understanding 📹
Official implementation of "Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding"
Easily fine-tune, evaluate and deploy gpt-oss, Qwen3, DeepSeek-R1, or any open source LLM / VLM!
Repository for PrePrint: "LayerT2V: Interactive Multi-Object Trajectory Layering for Video Generation"
Latest Papers, Codes and Datasets on VTG-LLMs.
video-SALMONN 2 is a powerful audio-visual large language model (LLM) that generates high-quality audio-visual video captions, which is developed by the Department of Electronic Engineering at Tsin…
🚀 Efficient implementations of state-of-the-art linear attention models
Reference PyTorch implementation and models for DINOv3
Streamlining Cartoon Production with Generative Post-Keyframing
✏️ Storyboarder makes it easy to visualize a story as fast you can draw stick figures.
accompanying material for sleep-time compute paper
Renderer for the harmony response format to be used with gpt-oss
Structured Video Comprehension of Real-World Shorts

