Trending Papers

Submitted by

AdinaY

Depth Anything 3: Recovering the Visual Space from Any Views

Depth Anything 3 (DA3) uses a plain transformer for geometry prediction from visual inputs, achieving state-of-the-art results in camera pose estimation, any-view geometry, visual rendering, and monocular depth estimation.

ByteDance Seed · Published on Nov 13, 2025

Upvote

30

GitHub 1.15k arXiv Page

Submitted by

AdinaY

Depth Anything 3: Recovering the Visual Space from Any Views

Depth Anything 3 (DA3) uses a plain transformer for geometry prediction from visual inputs, achieving state-of-the-art results in camera pose estimation, any-view geometry, visual rendering, and monocular depth estimation.

ByteDance Seed · Nov 13, 2025

Upvote

30

GitHub 1.15k arXiv Page

LightRAG: Simple and Fast Retrieval-Augmented Generation

LightRAG improves Retrieval-Augmented Generation by integrating graph structures for enhanced contextual awareness and efficient information retrieval, achieving better accuracy and response times.

5 authors

· Published on Oct 8, 2024

Upvote

8

GitHub 23.2k arXiv Page

LightRAG: Simple and Fast Retrieval-Augmented Generation

LightRAG improves Retrieval-Augmented Generation by integrating graph structures for enhanced contextual awareness and efficient information retrieval, achieving better accuracy and response times.

5 authors

· Oct 8, 2024

Upvote

8

GitHub 23.2k arXiv Page

Submitted by

taesiri

PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vision-Language Model

PaddleOCR-VL, a vision-language model combining NaViT-style visual encoder and ERNIE-4.5 language model, achieves state-of-the-art performance in document parsing with minimal resource consumption.

PaddlePaddle · Published on Oct 16, 2025

Upvote

90

GitHub 63.9k arXiv Page

Submitted by

taesiri

PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vision-Language Model

PaddleOCR-VL, a vision-language model combining NaViT-style visual encoder and ERNIE-4.5 language model, achieves state-of-the-art performance in document parsing with minimal resource consumption.

PaddlePaddle · Oct 16, 2025

Upvote

90

GitHub 63.9k arXiv Page

Submitted by

daixufang

Agent Lightning: Train ANY AI Agents with Reinforcement Learning

Agent Lightning is a flexible RL framework for training LLMs in various agents, using a hierarchical RL algorithm and decoupling execution from training to handle complex interactions.

8 authors

· Published on Aug 5, 2025

Upvote

119

GitHub 8.39k arXiv Page

Submitted by

daixufang

Agent Lightning: Train ANY AI Agents with Reinforcement Learning

Agent Lightning is a flexible RL framework for training LLMs in various agents, using a hierarchical RL algorithm and decoupling execution from training to handle complex interactions.

8 authors

· Aug 5, 2025

Upvote

119

GitHub 8.39k arXiv Page

Submitted by

akhaliq

FilmAgent: A Multi-Agent Framework for End-to-End Film Automation in Virtual 3D Spaces

FilmAgent, a multi-agent collaborative framework based on LLMs, automates film production processes including scriptwriting, cinematography, and actor positioning, outperforming other models in human evaluations.

10 authors

· Published on Jan 22, 2025

Upvote

73

GitHub 1.05k arXiv Page

Submitted by

akhaliq

FilmAgent: A Multi-Agent Framework for End-to-End Film Automation in Virtual 3D Spaces

FilmAgent, a multi-agent collaborative framework based on LLMs, automates film production processes including scriptwriting, cinematography, and actor positioning, outperforming other models in human evaluations.

10 authors

· Jan 22, 2025

Upvote

73

GitHub 1.05k arXiv Page

Submitted by

dyyyyyyyy

FAPO: Flawed-Aware Policy Optimization for Efficient and Reliable Reasoning

Flawed-Aware Policy Optimization (FAPO) enhances reinforcement learning with verifiable rewards by penalizing flawed-positive rollouts, improving reasoning capability and training stability in large language models.

6 authors

· Published on Oct 26, 2025

Upvote

6

GitHub 15.8k arXiv Page

Submitted by

dyyyyyyyy

FAPO: Flawed-Aware Policy Optimization for Efficient and Reliable Reasoning

Flawed-Aware Policy Optimization (FAPO) enhances reinforcement learning with verifiable rewards by penalizing flawed-positive rollouts, improving reasoning capability and training stability in large language models.

6 authors

· Oct 26, 2025

Upvote

6

GitHub 15.8k arXiv Page

Submitted by

taesiri

MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing

MinerU2.5, a 1.2B-parameter document parsing vision-language model, achieves state-of-the-art recognition accuracy with computational efficiency through a coarse-to-fine parsing strategy.

61 authors

· Published on Sep 26, 2025

Upvote

131

GitHub 48.9k arXiv Page

Submitted by

taesiri

MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing

MinerU2.5, a 1.2B-parameter document parsing vision-language model, achieves state-of-the-art recognition accuracy with computational efficiency through a coarse-to-fine parsing strategy.

61 authors

· Sep 26, 2025

Upvote

131

GitHub 48.9k arXiv Page

Submitted by

wanderkid

MinerU: An Open-Source Solution for Precise Document Content Extraction

MinerU is an open-source tool that enhances document content extraction using fine-tuned models and pre/postprocessing rules across diverse document types.

18 authors

· Published on Sep 27, 2024

Upvote

31

GitHub 48.9k arXiv Page

Submitted by

wanderkid

MinerU: An Open-Source Solution for Precise Document Content Extraction

MinerU is an open-source tool that enhances document content extraction using fine-tuned models and pre/postprocessing rules across diverse document types.

18 authors

· Sep 27, 2024

Upvote

31

GitHub 48.9k arXiv Page

Submitted by

YiZhouDenseHub

Tiny Model, Big Logic: Diversity-Driven Optimization Elicits Large-Model Reasoning Ability in VibeThinker-1.5B

VibeThinker-1.5B, a 1.5B-parameter model using the Spectrum-to-Signal Principle, achieves superior reasoning capabilities compared to larger models at a significantly lower cost.

WeiboAI · Published on Nov 9, 2025

Upvote

91

GitHub 281 arXiv Page

Submitted by

YiZhouDenseHub

Tiny Model, Big Logic: Diversity-Driven Optimization Elicits Large-Model Reasoning Ability in VibeThinker-1.5B

VibeThinker-1.5B, a 1.5B-parameter model using the Spectrum-to-Signal Principle, achieves superior reasoning capabilities compared to larger models at a significantly lower cost.

WeiboAI · Nov 9, 2025

Upvote

91

GitHub 281 arXiv Page

OmniFlatten: An End-to-end GPT Model for Seamless Voice Conversation

A novel GPT-based model, OmniFlatten, enables real-time natural full-duplex spoken dialogue through a multi-stage post-training technique that integrates speech and text without altering the original model's architecture.

9 authors

· Published on Oct 23, 2024

Upvote

5

GitHub 49.6k arXiv Page

OmniFlatten: An End-to-end GPT Model for Seamless Voice Conversation

A novel GPT-based model, OmniFlatten, enables real-time natural full-duplex spoken dialogue through a multi-stage post-training technique that integrates speech and text without altering the original model's architecture.

9 authors

· Oct 23, 2024

Upvote

5

GitHub 49.6k arXiv Page

Submitted by

akhaliq

LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models

LlamaFactory is a unified framework enabling efficient fine-tuning of large language models across various tasks using a web-based user interface.

5 authors

· Published on Mar 20, 2024

Upvote

168

GitHub 62.5k arXiv Page

Submitted by

akhaliq

LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models

LlamaFactory is a unified framework enabling efficient fine-tuning of large language models across various tasks using a web-based user interface.

5 authors

· Mar 20, 2024

Upvote

168

GitHub 62.5k arXiv Page

Beyond Outlining: Heterogeneous Recursive Planning for Adaptive Long-form Writing with Language Models

A general agent framework enabling adaptive writing through recursive task decomposition and dynamic integration of retrieval, reasoning, and composition outperforms existing methods.

5 authors

· Published on Mar 11, 2025

Upvote

3

GitHub 623 arXiv Page

Beyond Outlining: Heterogeneous Recursive Planning for Adaptive Long-form Writing with Language Models

A general agent framework enabling adaptive writing through recursive task decomposition and dynamic integration of retrieval, reasoning, and composition outperforms existing methods.

5 authors

· Mar 11, 2025

Upvote

3

GitHub 623 arXiv Page

IndexTTS: An Industrial-Level Controllable and Efficient Zero-Shot Text-To-Speech System

IndexTTS, an enhanced text-to-speech system combining XTTS and Tortoise models, offers improved naturalness, enhanced voice cloning, and controllable usage through hybrid character-pinyin modeling and optimized vector quantization.

5 authors

· Published on Feb 8, 2025

Upvote

4

GitHub 15.4k arXiv Page

IndexTTS: An Industrial-Level Controllable and Efficient Zero-Shot Text-To-Speech System

IndexTTS, an enhanced text-to-speech system combining XTTS and Tortoise models, offers improved naturalness, enhanced voice cloning, and controllable usage through hybrid character-pinyin modeling and optimized vector quantization.

5 authors

· Feb 8, 2025

Upvote

4

GitHub 15.4k arXiv Page

TradingAgents: Multi-Agents LLM Financial Trading Framework

A multi-agent framework using large language models for stock trading simulates real-world trading firms, improving performance metrics like cumulative returns and Sharpe ratio.

4 authors

· Published on Dec 28, 2024

Upvote

10

GitHub 24.9k arXiv Page

TradingAgents: Multi-Agents LLM Financial Trading Framework

A multi-agent framework using large language models for stock trading simulates real-world trading firms, improving performance metrics like cumulative returns and Sharpe ratio.

4 authors

· Dec 28, 2024

Upvote

10

GitHub 24.9k arXiv Page

Submitted by

akhaliq

Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory

Mem0, a memory-centric architecture with graph-based memory, enhances long-term conversational coherence in LLMs by efficiently extracting, consolidating, and retrieving information, outperforming existing memory systems in terms of accuracy and computational efficiency.

5 authors

· Published on Apr 28, 2025

Upvote

32

GitHub 43.2k arXiv Page

Submitted by

akhaliq

Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory

Mem0, a memory-centric architecture with graph-based memory, enhances long-term conversational coherence in LLMs by efficiently extracting, consolidating, and retrieving information, outperforming existing memory systems in terms of accuracy and computational efficiency.

5 authors

· Apr 28, 2025

Upvote

32

GitHub 43.2k arXiv Page

Submitted by

YuWangX

MIRIX: Multi-Agent Memory System for LLM-Based Agents

MIRIX, a modular multi-agent memory system, enhances language models' memory capabilities by integrating diverse memory types and a dynamic framework, achieving superior performance in multimodal and long-form conversation benchmarks.

2 authors

· Published on Jul 10, 2025

Upvote

76

GitHub 2.85k arXiv Page

Submitted by

YuWangX

MIRIX: Multi-Agent Memory System for LLM-Based Agents

MIRIX, a modular multi-agent memory system, enhances language models' memory capabilities by integrating diverse memory types and a dynamic framework, achieving superior performance in multimodal and long-form conversation benchmarks.

2 authors

· Jul 10, 2025

Upvote

76

GitHub 2.85k arXiv Page

Submitted by

Rbin

RAG-Anything: All-in-One RAG Framework

RAG-Anything is a unified framework that enhances multimodal knowledge retrieval by integrating cross-modal relationships and semantic matching, outperforming existing methods on complex benchmarks.

Data Intelligence Lab@HKU · Published on Oct 14, 2025

Upvote

48

GitHub 10.3k arXiv Page

Submitted by

Rbin

RAG-Anything: All-in-One RAG Framework

RAG-Anything is a unified framework that enhances multimodal knowledge retrieval by integrating cross-modal relationships and semantic matching, outperforming existing methods on complex benchmarks.

Data Intelligence Lab@HKU · Oct 14, 2025

Upvote

48

GitHub 10.3k arXiv Page

Submitted by

zhangshaolei

DeepAnalyze: Agentic Large Language Models for Autonomous Data Science

DeepAnalyze-8B, an agentic LLM, autonomously completes the data science pipeline from raw data to research reports using curriculum-based training and data-grounded trajectory synthesis.

RUC-DataLab · Published on Oct 19, 2025

Upvote

98

GitHub 2.07k arXiv Page

Submitted by

zhangshaolei

DeepAnalyze: Agentic Large Language Models for Autonomous Data Science

DeepAnalyze-8B, an agentic LLM, autonomously completes the data science pipeline from raw data to research reports using curriculum-based training and data-grounded trajectory synthesis.

RUC-DataLab · Oct 19, 2025

Upvote

98

GitHub 2.07k arXiv Page

Submitted by

Weiyun1025

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

InternVL3 is a multimodal pre-trained language model that jointly learns from both multimodal data and text, improving performance and scalability through advanced techniques and setting a new state-of-the-art in multimodal tasks.

47 authors

· Published on Apr 14, 2025

Upvote

301

arXiv Page

Submitted by

Weiyun1025

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

InternVL3 is a multimodal pre-trained language model that jointly learns from both multimodal data and text, improving performance and scalability through advanced techniques and setting a new state-of-the-art in multimodal tasks.

47 authors

· Apr 14, 2025

Upvote

301

arXiv Page

Zep: A Temporal Knowledge Graph Architecture for Agent Memory

Zep, a memory layer service, outperforms MemGPT in the DMR benchmark and LongMemEval by excelling in dynamic knowledge integration and temporal reasoning, critical for enterprise use cases.

5 authors

· Published on Jan 20, 2025

Upvote

6

GitHub 20.2k arXiv Page

Zep: A Temporal Knowledge Graph Architecture for Agent Memory

Zep, a memory layer service, outperforms MemGPT in the DMR benchmark and LongMemEval by excelling in dynamic knowledge integration and temporal reasoning, critical for enterprise use cases.

5 authors

· Jan 20, 2025

Upvote

6

GitHub 20.2k arXiv Page

PyTorch Distributed: Experiences on Accelerating Data Parallel Training

The PyTorch distributed data parallel module optimizes large-scale model training using techniques like gradient bucketing, computation-communication overlap, and selective synchronization to achieve near-linear scalability.

11 authors

· Published on Jun 28, 2020

Upvote

2

GitHub 95.1k arXiv Page

PyTorch Distributed: Experiences on Accelerating Data Parallel Training

The PyTorch distributed data parallel module optimizes large-scale model training using techniques like gradient bucketing, computation-communication overlap, and selective synchronization to achieve near-linear scalability.

11 authors

· Jun 28, 2020

Upvote

2

GitHub 95.1k arXiv Page

PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel

PyTorch Fully Sharded Data Parallel (FSDP) enables efficient and scalable training of large models across hardware configurations.

16 authors

· Published on Apr 21, 2023

Upvote

3

GitHub 95.1k arXiv Page

PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel

PyTorch Fully Sharded Data Parallel (FSDP) enables efficient and scalable training of large models across hardware configurations.

16 authors

· Apr 21, 2023

Upvote

3

GitHub 95.1k arXiv Page

Submitted by

taesiri

JoyAgent-JDGenie: Technical Report on the GAIA

A generalist agent architecture combining multi-agent planning, hierarchical memory, and a refined tool suite outperforms existing systems in diverse tasks.

jingdong · Published on Oct 1, 2025

Upvote

3

GitHub 11k arXiv Page

Submitted by

taesiri

JoyAgent-JDGenie: Technical Report on the GAIA

A generalist agent architecture combining multi-agent planning, hierarchical memory, and a refined tool suite outperforms existing systems in diverse tasks.

jingdong · Oct 1, 2025

Upvote

3

GitHub 11k arXiv Page

Submitted by

AlexiaJM

Less is More: Recursive Reasoning with Tiny Networks

Tiny Recursive Model (TRM) achieves high generalization on complex puzzle tasks using a small, two-layer network with minimal parameters, outperforming larger language models.

Samsung SAIT AI Lab, Montreal · Published on Oct 6, 2025

Upvote

476

GitHub 5.6k arXiv Page

Submitted by

AlexiaJM

Less is More: Recursive Reasoning with Tiny Networks

Tiny Recursive Model (TRM) achieves high generalization on complex puzzle tasks using a small, two-layer network with minimal parameters, outperforming larger language models.

Samsung SAIT AI Lab, Montreal · Oct 6, 2025

Upvote

476

GitHub 5.6k arXiv Page

Submitted by

callanwu

Scaling Agents via Continual Pre-training

AgentFounder, a deep research agent model incorporating Agentic Continual Pre-training, achieves state-of-the-art performance in agentic tasks while maintaining strong tool-use ability.

22 authors

· Published on Sep 16, 2025

Upvote

115

GitHub 17.2k arXiv Page

Submitted by

callanwu

Scaling Agents via Continual Pre-training

AgentFounder, a deep research agent model incorporating Agentic Continual Pre-training, achieves state-of-the-art performance in agentic tasks while maintaining strong tool-use ability.

22 authors

· Sep 16, 2025

Upvote

115

GitHub 17.2k arXiv Page

Submitted by

learn3r

WebSailor: Navigating Super-human Reasoning for Web Agent

WebSailor, a post-training methodology, enhances open-source LLMs with sophisticated reasoning to match proprietary systems in complex information-seeking tasks.

19 authors

· Published on Jul 3, 2025

Upvote

121

GitHub 17.2k arXiv Page

Submitted by

learn3r

WebSailor: Navigating Super-human Reasoning for Web Agent

WebSailor, a post-training methodology, enhances open-source LLMs with sophisticated reasoning to match proprietary systems in complex information-seeking tasks.

19 authors

· Jul 3, 2025

Upvote

121

GitHub 17.2k arXiv Page

Submitted by

richardxp888

WebWatcher: Breaking New Frontier of Vision-Language Deep Research Agent

WebWatcher, a multimodal agent with enhanced visual-language reasoning, outperforms existing agents in complex visual and textual information retrieval tasks using synthetic trajectories and reinforcement learning.

Alibaba-NLP · Published on Aug 7, 2025

Upvote

138

GitHub 17.2k arXiv Page

Submitted by

richardxp888

WebWatcher: Breaking New Frontier of Vision-Language Deep Research Agent

WebWatcher, a multimodal agent with enhanced visual-language reasoning, outperforms existing agents in complex visual and textual information retrieval tasks using synthetic trajectories and reinforcement learning.

Alibaba-NLP · Aug 7, 2025

Upvote

138

GitHub 17.2k arXiv Page

Submitted by

callanwu

WebShaper: Agentically Data Synthesizing via Information-Seeking Formalization

WebShaper, a formalization-driven framework, synthesizes information-seeking datasets using set theory and Knowledge Projections to enhance reasoning structure and achieve top performance in open-sourced benchmarks.

Alibaba-NLP · Published on Jul 20, 2025

Upvote

59

GitHub 17.2k arXiv Page

Submitted by

callanwu

WebShaper: Agentically Data Synthesizing via Information-Seeking Formalization

WebShaper, a formalization-driven framework, synthesizes information-seeking datasets using set theory and Knowledge Projections to enhance reasoning structure and achieve top performance in open-sourced benchmarks.

Alibaba-NLP · Jul 20, 2025

Upvote

59

GitHub 17.2k arXiv Page

Submitted by

callanwu

WebSailor-V2: Bridging the Chasm to Proprietary Agents via Synthetic Data and Scalable Reinforcement Learning

WebSailor, a post-training methodology, enhances open-source models with systematic uncertainty reduction, matching proprietary agents' performance in complex information-seeking tasks.

17 authors

· Published on Sep 16, 2025

Upvote

89

GitHub 17.2k arXiv Page

Submitted by

callanwu

WebSailor-V2: Bridging the Chasm to Proprietary Agents via Synthetic Data and Scalable Reinforcement Learning

WebSailor, a post-training methodology, enhances open-source models with systematic uncertainty reduction, matching proprietary agents' performance in complex information-seeking tasks.

17 authors

· Sep 16, 2025

Upvote

89

GitHub 17.2k arXiv Page

Submitted by

callanwu

WebDancer: Towards Autonomous Information Seeking Agency

The paper proposes a framework for building end-to-end agentic information seeking agents through a combination of data construction, trajectory sampling, supervised fine-tuning, and reinforcement learning, showcasing its effectiveness on information seeking benchmarks.

12 authors

· Published on May 28, 2025

Upvote

33

GitHub 17.2k arXiv Page

Submitted by

callanwu

WebDancer: Towards Autonomous Information Seeking Agency

The paper proposes a framework for building end-to-end agentic information seeking agents through a combination of data construction, trajectory sampling, supervised fine-tuning, and reinforcement learning, showcasing its effectiveness on information seeking benchmarks.

12 authors

· May 28, 2025

Upvote

33

GitHub 17.2k arXiv Page

Submitted by

taesiri

WebWeaver: Structuring Web-Scale Evidence with Dynamic Outlines for Open-Ended Deep Research

WebWeaver, a dual-agent framework, addresses open-ended deep research challenges by integrating adaptive planning and focused synthesis to produce high-quality, reliable reports.

12 authors

· Published on Sep 16, 2025

Upvote

105

GitHub 17.2k arXiv Page

Submitted by

taesiri

WebWeaver: Structuring Web-Scale Evidence with Dynamic Outlines for Open-Ended Deep Research

WebWeaver, a dual-agent framework, addresses open-ended deep research challenges by integrating adaptive planning and focused synthesis to produce high-quality, reliable reports.

12 authors

· Sep 16, 2025

Upvote

105

GitHub 17.2k arXiv Page

Submitted by

callanwu

ReSum: Unlocking Long-Horizon Search Intelligence via Context Summarization

ReSum, a novel paradigm with periodic context summarization, enhances web agents' performance on knowledge-intensive tasks by overcoming context window limitations, achieving significant improvements over ReAct.

14 authors

· Published on Sep 16, 2025

Upvote

78

GitHub 17.2k arXiv Page

Submitted by

callanwu

ReSum: Unlocking Long-Horizon Search Intelligence via Context Summarization

ReSum, a novel paradigm with periodic context summarization, enhances web agents' performance on knowledge-intensive tasks by overcoming context window limitations, achieving significant improvements over ReAct.

14 authors

· Sep 16, 2025

Upvote

78

GitHub 17.2k arXiv Page

Submitted by

cccczshao

Continuous Autoregressive Language Models

Continuous Autoregressive Language Models (CALM) improve language model efficiency by predicting continuous vectors instead of discrete tokens, reducing computational cost while maintaining performance.

Tencent · Published on Oct 31, 2025

Upvote

64

GitHub 560 arXiv Page

Submitted by

cccczshao

Continuous Autoregressive Language Models

Continuous Autoregressive Language Models (CALM) improve language model efficiency by predicting continuous vectors instead of discrete tokens, reducing computational cost while maintaining performance.

Tencent · Oct 31, 2025

Upvote

64

GitHub 560 arXiv Page

Submitted by

taesiri

Time-to-Move: Training-Free Motion Controlled Video Generation via Dual-Clock Denoising

Time-to-Move (TTM) is a plug-and-play framework for motion- and appearance-controlled video generation using image-to-video (I2V) diffusion models, offering precise control over video content without requiring additional training.

Technion Israel institute of technology · Published on Nov 9, 2025

Upvote

49

GitHub 64 arXiv Page

Submitted by

taesiri

Time-to-Move: Training-Free Motion Controlled Video Generation via Dual-Clock Denoising

Time-to-Move (TTM) is a plug-and-play framework for motion- and appearance-controlled video generation using image-to-video (I2V) diffusion models, offering precise control over video content without requiring additional training.

Technion Israel institute of technology · Nov 9, 2025

Upvote

49

GitHub 64 arXiv Page

Submitted by

xw-eric

Agent S2: A Compositional Generalist-Specialist Framework for Computer Use Agents

Agent S2, a compositional framework using Mixture-of-Grounding and Proactive Hierarchical Planning, achieves state-of-the-art performance in computer use automation across various benchmarks and operating systems.

Simular · Published on Apr 1, 2025

Upvote

26

GitHub 8.23k arXiv Page

Submitted by

xw-eric

Agent S2: A Compositional Generalist-Specialist Framework for Computer Use Agents

Agent S2, a compositional framework using Mixture-of-Grounding and Proactive Hierarchical Planning, achieves state-of-the-art performance in computer use automation across various benchmarks and operating systems.

Simular · Apr 1, 2025

Upvote

26

GitHub 8.23k arXiv Page

Submitted by

taesiri

AgentScope 1.0: A Developer-Centric Framework for Building Agentic Applications

AgentScope enhances agentic applications by providing flexible tool-based interactions, unified interfaces, and advanced infrastructure based on the ReAct paradigm, supporting efficient and safe development and deployment.

23 authors

· Published on Aug 22, 2025

Upvote

52

GitHub 13.9k arXiv Page

Submitted by

taesiri

AgentScope 1.0: A Developer-Centric Framework for Building Agentic Applications

AgentScope enhances agentic applications by providing flexible tool-based interactions, unified interfaces, and advanced infrastructure based on the ReAct paradigm, supporting efficient and safe development and deployment.

23 authors

· Aug 22, 2025

Upvote

52

GitHub 13.9k arXiv Page

Submitted by

xw-eric

The Unreasonable Effectiveness of Scaling Agents for Computer Use

Behavior Best-of-N (bBoN) improves the reliability and success rates of computer-use agents by generating and selecting among multiple rollouts using behavior narratives, achieving state-of-the-art performance on OSWorld and strong generalization to different operating systems.

Simular · Published on Oct 2, 2025

Upvote

24

GitHub 8.23k arXiv Page

Submitted by

xw-eric

The Unreasonable Effectiveness of Scaling Agents for Computer Use

Behavior Best-of-N (bBoN) improves the reliability and success rates of computer-use agents by generating and selecting among multiple rollouts using behavior narratives, achieving state-of-the-art performance on OSWorld and strong generalization to different operating systems.

Simular · Oct 2, 2025

Upvote

24

GitHub 8.23k arXiv Page

Submitted by

taesiri

Robot Learning from a Physical World Model

PhysWorld integrates video generation and physical world modeling to enable accurate robotic manipulation from visual demonstrations without real robot data.

Deepmind · Published on Nov 10, 2025

Upvote

25

GitHub 105 arXiv Page

Submitted by

taesiri

Robot Learning from a Physical World Model

PhysWorld integrates video generation and physical world modeling to enable accurate robotic manipulation from visual demonstrations without real robot data.

Deepmind · Nov 10, 2025

Upvote

25

GitHub 105 arXiv Page

Submitted by

ZhiyuanZeng

RLVE: Scaling Up Reinforcement Learning for Language Models with Adaptive Verifiable Environments

Reinforcement Learning with Adaptive Verifiable Environments (RLVE) improves language model reasoning by dynamically adjusting problem difficulty, outperforming static environments and traditional RL training.

17 authors

· Published on Nov 10, 2025

Upvote

10

GitHub 115 arXiv Page

Submitted by

ZhiyuanZeng

RLVE: Scaling Up Reinforcement Learning for Language Models with Adaptive Verifiable Environments

Reinforcement Learning with Adaptive Verifiable Environments (RLVE) improves language model reasoning by dynamically adjusting problem difficulty, outperforming static environments and traditional RL training.

17 authors

· Nov 10, 2025

Upvote

10

GitHub 115 arXiv Page

Submitted by

zoeyuchao

RLinf-VLA: A Unified and Efficient Framework for VLA+RL Training

RLinf-VLA is a unified framework for scalable reinforcement learning training of vision-language-action models, offering improved performance and generalization compared to supervised fine-tuning.

RLinf · Published on Oct 8, 2025

Upvote

38

GitHub 1.26k arXiv Page

Submitted by

zoeyuchao

RLinf-VLA: A Unified and Efficient Framework for VLA+RL Training

RLinf-VLA is a unified framework for scalable reinforcement learning training of vision-language-action models, offering improved performance and generalization compared to supervised fine-tuning.

RLinf · Oct 8, 2025

Upvote

38

GitHub 1.26k arXiv Page

Submitted by

taesiri

FlashVSR: Towards Real-Time Diffusion-Based Streaming Video Super-Resolution

Diffusion models have recently advanced video restoration, but applying them to real-world video super-resolution (VSR) remains challenging due to high latency, prohibitive computation, and poor generalization to ultra-high resolutions. Our goal in this work is to make diffusion-based VSR practical by achieving efficiency, scalability, and real-time performance. To this end, we propose FlashVSR, the first diffusion-based one-step streaming framework towards real-time VSR. FlashVSR runs at approximately 17 FPS for 768x1408 videos on a single A100 GPU by combining three complementary innovations: (i) a train-friendly three-stage distillation pipeline that enables streaming super-resolution, (ii) locality-constrained sparse attention that cuts redundant computation while bridging the train-test resolution gap, and (iii) a tiny conditional decoder that accelerates reconstruction without sacrificing quality. To support large-scale training, we also construct VSR-120K, a new dataset with 120k videos and 180k images. Extensive experiments show that FlashVSR scales reliably to ultra-high resolutions and achieves state-of-the-art performance with up to 12x speedup over prior one-step diffusion VSR models. We will release the code, pretrained models, and dataset to foster future research in efficient diffusion-based VSR.

7 authors

· Published on Oct 14, 2025

Upvote

36

GitHub 839 arXiv Page

Submitted by

taesiri

FlashVSR: Towards Real-Time Diffusion-Based Streaming Video Super-Resolution

Diffusion models have recently advanced video restoration, but applying them to real-world video super-resolution (VSR) remains challenging due to high latency, prohibitive computation, and poor generalization to ultra-high resolutions. Our goal in this work is to make diffusion-based VSR practical by achieving efficiency, scalability, and real-time performance. To this end, we propose FlashVSR, the first diffusion-based one-step streaming framework towards real-time VSR. FlashVSR runs at approximately 17 FPS for 768x1408 videos on a single A100 GPU by combining three complementary innovations: (i) a train-friendly three-stage distillation pipeline that enables streaming super-resolution, (ii) locality-constrained sparse attention that cuts redundant computation while bridging the train-test resolution gap, and (iii) a tiny conditional decoder that accelerates reconstruction without sacrificing quality. To support large-scale training, we also construct VSR-120K, a new dataset with 120k videos and 180k images. Extensive experiments show that FlashVSR scales reliably to ultra-high resolutions and achieves state-of-the-art performance with up to 12x speedup over prior one-step diffusion VSR models. We will release the code, pretrained models, and dataset to foster future research in efficient diffusion-based VSR.

7 authors

· Oct 14, 2025

Upvote

36

GitHub 839 arXiv Page

Submitted by

akhaliq

Transformer Explainer: Interactive Learning of Text-Generative Models

Transformer Explainer is an interactive visualization tool that allows non-experts to understand the inner workings of the GPT-2 model through real-time experimentation and visualization in a web browser.

8 authors

· Published on Aug 8, 2024

Upvote

172

GitHub 5.93k arXiv Page

Submitted by

akhaliq

Transformer Explainer: Interactive Learning of Text-Generative Models

Transformer Explainer is an interactive visualization tool that allows non-experts to understand the inner workings of the GPT-2 model through real-time experimentation and visualization in a web browser.

8 authors

· Aug 8, 2024

Upvote

172

GitHub 5.93k arXiv Page

Submitted by

hiyouga

Easy Dataset: A Unified and Extensible Framework for Synthesizing LLM Fine-Tuning Data from Unstructured Documents

A framework called Easy Dataset synthesizes fine-tuning data from unstructured documents using a GUI and LLMs, improving domain-specific performance of LLMs while maintaining general knowledge.

7 authors

· Published on Jul 5, 2025

Upvote

51

GitHub 11.8k arXiv Page

Submitted by

hiyouga

Easy Dataset: A Unified and Extensible Framework for Synthesizing LLM Fine-Tuning Data from Unstructured Documents

A framework called Easy Dataset synthesizes fine-tuning data from unstructured documents using a GUI and LLMs, improving domain-specific performance of LLMs while maintaining general knowledge.

7 authors

· Jul 5, 2025

Upvote

51

GitHub 11.8k arXiv Page

Submitted by

weiminwang

Ovi: Twin Backbone Cross-Modal Fusion for Audio-Video Generation

Ovi is a unified audio-video generation model using twin-DiT modules with blockwise cross-modal fusion, enabling natural synchronization and high-quality multimodal outputs.

Character.AI · Published on Sep 30, 2025

Upvote

32

GitHub 1.3k arXiv Page

Submitted by

weiminwang

Ovi: Twin Backbone Cross-Modal Fusion for Audio-Video Generation

Ovi is a unified audio-video generation model using twin-DiT modules with blockwise cross-modal fusion, enabling natural synchronization and high-quality multimodal outputs.

Character.AI · Sep 30, 2025

Upvote

32

GitHub 1.3k arXiv Page

Submitted by

akhaliq

3D Gaussian Splatting for Real-Time Radiance Field Rendering

A method using 3D Gaussians for scene representation and optimized rendering allows high-quality, real-time novel-view synthesis at 1080p resolution.

4 authors

· Published on Aug 8, 2023

Upvote

192

GitHub 19.5k arXiv Page

Submitted by

akhaliq

3D Gaussian Splatting for Real-Time Radiance Field Rendering

A method using 3D Gaussians for scene representation and optimized rendering allows high-quality, real-time novel-view synthesis at 1080p resolution.

4 authors

· Aug 8, 2023

Upvote

192

GitHub 19.5k arXiv Page

Submitted by

nielsr

DINOv3

DINOv3, a self-supervised learning model, achieves superior performance across various vision tasks by scaling datasets and models, addressing dense feature degradation, and enhancing flexibility with post-hoc strategies.

AI at Meta · Published on Aug 13, 2025

Upvote

278

GitHub 8.3k arXiv Page

Submitted by

nielsr

DINOv3

DINOv3, a self-supervised learning model, achieves superior performance across various vision tasks by scaling datasets and models, addressing dense feature degradation, and enhancing flexibility with post-hoc strategies.

AI at Meta · Aug 13, 2025

Upvote

278

GitHub 8.3k arXiv Page

Submitted by

jayw

ChronoEdit: Towards Temporal Reasoning for Image Editing and World Simulation

ChronoEdit addresses physical consistency in image editing by reframing it as a video generation problem, leveraging pretrained video models and temporal reasoning tokens.

NVIDIA · Published on Oct 5, 2025

Upvote

14

GitHub 531 arXiv Page

Submitted by

jayw

ChronoEdit: Towards Temporal Reasoning for Image Editing and World Simulation

ChronoEdit addresses physical consistency in image editing by reframing it as a video generation problem, leveraging pretrained video models and temporal reasoning tokens.

NVIDIA · Oct 5, 2025

Upvote

14

GitHub 531 arXiv Page

Submitted by

taesiri

Visual Spatial Tuning

A framework called Visual Spatial Tuning (VST) enhances the spatial abilities of Vision-Language Models (VLMs) through progressive training with specialized datasets, achieving state-of-the-art results on spatial benchmarks.

ByteDance Seed · Published on Nov 7, 2025

Upvote

46

GitHub 120 arXiv Page

Submitted by

taesiri

Visual Spatial Tuning

A framework called Visual Spatial Tuning (VST) enhances the spatial abilities of Vision-Language Models (VLMs) through progressive training with specialized datasets, achieving state-of-the-art results on spatial benchmarks.

ByteDance Seed · Nov 7, 2025

Upvote

46

GitHub 120 arXiv Page

Ming-UniAudio: Speech LLM for Joint Understanding, Generation and Editing with Unified Representation

A unified continuous speech tokenizer and model enable instruction-based free-form speech editing with high performance across multiple metrics.

inclusionAI · Published on Oct 26, 2025

Upvote

8

GitHub 369 arXiv Page

Ming-UniAudio: Speech LLM for Joint Understanding, Generation and Editing with Unified Representation

A unified continuous speech tokenizer and model enable instruction-based free-form speech editing with high performance across multiple metrics.

inclusionAI · Oct 26, 2025

Upvote

8

GitHub 369 arXiv Page

Submitted by

zhuiguang-ning

LoopTool: Closing the Data-Training Loop for Robust LLM Tool Calls

A fully automated data evolution framework, LoopTool, enhances tool-use capabilities of Large Language Models by iteratively refining data and model through a closed-loop process.

Shanghai Jiao Tong University · Published on Nov 12, 2025

Upvote

15

GitHub 20 arXiv Page

Submitted by

zhuiguang-ning

LoopTool: Closing the Data-Training Loop for Robust LLM Tool Calls

A fully automated data evolution framework, LoopTool, enhances tool-use capabilities of Large Language Models by iteratively refining data and model through a closed-loop process.

Shanghai Jiao Tong University · Nov 12, 2025

Upvote

15

GitHub 20 arXiv Page

byAK and the research community