Discover, Discuss, and Read arXiv papers

Discover new, recommended papers

GitHub
Aristotle, developed by The Harmonic Team, achieved gold-medal-equivalent performance on the IMO 2025 by producing formally verified solutions to five out of six problems, leveraging a hybrid approach that integrates informal reasoning with a formal proof search algorithm in Lean 4.
Research from Boston University and LinkedIn demonstrates that vanilla Supervised Fine-Tuning (SFT) can achieve strong generalization capabilities comparable to or exceeding Reinforcement Learning (RL) methods, provided training data incorporates prompt diversity and Chain-of-Thought (CoT) supervision. The work attributes SFT's previously observed failures to data design issues rather than inherent algorithmic limitations.
1
RLAD trains large language models to discover and utilize natural language "reasoning abstractions," which are high-level descriptions of procedural and factual knowledge for problem-solving. This approach enhances reasoning by promoting structured exploration, outperforming baseline methods by up to 11.9 percentage points on math benchmarks like AIME 2025 and showing improved generalization across diverse domains.
107
Researchers from FAIR at Meta systematically investigated synthetic data in LLM pre-training, showing that while pure synthetic data isn't superior, mixing approximately 30% rephrased synthetic data with natural text can accelerate pre-training by 5-10x and potentially reduce irreducible loss. The study clarifies the conditional benefits and optimal application strategies for synthetic data across various scales.
Researchers at Pathway developed the Dragon Hatchling (BDH) architecture, which bridges the gap between Large Language Models and biologically plausible brain models through a system of locally interacting neuron particles. Its GPU-optimized variant (BDH-GPU) achieves performance comparable to Transformer models on language tasks, while demonstrating emergent modularity, monosemantic synapses, and adaptive sparsity that offers inherent interpretability.
ExGRPO introduces a principled framework for large language models that enhances reasoning capabilities by systematically managing and reusing valuable past experiences in reinforcement learning from verifiable rewards (RLVR). It achieves consistent performance gains of +3.5 to +7.6 points on reasoning benchmarks and stabilizes training for weaker models by optimizing intermediate-difficulty, low-entropy reasoning trajectories.
327
Simular Research introduces Behavior Best-of-N (bBoN), a framework that boosts Computer-Use Agent (CUA) reliability on complex digital tasks by generating and comparatively evaluating multiple full-length solution trajectories. bBoN achieves a new state-of-the-art 69.9% success rate on OSWorld, nearing human performance.
Google DeepMind introduces Dreamer 4, a world model agent that achieves the first successful diamond acquisition in Minecraft purely from offline data. The model demonstrates real-time, highly accurate simulation of complex game mechanics on a single GPU, enabling an agent to learn long-horizon tasks and generalize action understanding to new environments.
Computer-Use Agents (CUAs) are an increasingly deployed class of agents that take actions on GUIs to accomplish user goals. In this paper, we show that CUAs consistently exhibit Blind Goal-Directedness (BGD): a bias to pursue goals regardless of feasibility, safety, reliability, or context. We characterize three prevalent patterns of BGD: (i) lack of contextual reasoning, (ii) assumptions and decisions under ambiguity, and (iii) contradictory or infeasible goals. We develop BLIND-ACT, a benchmark of 90 tasks capturing these three patterns. Built on OSWorld, BLIND-ACT provides realistic environments and employs LLM-based judges to evaluate agent behavior, achieving 93.75% agreement with human annotations. We use BLIND-ACT to evaluate nine frontier models, including Claude Sonnet and Opus 4, Computer-Use-Preview, and GPT-5, observing high average BGD rates (80.8%) across them. We show that BGD exposes subtle risks that arise even when inputs are not directly harmful. While prompting-based interventions lower BGD levels, substantial risk persists, highlighting the need for stronger training- or inference-time interventions. Qualitative analysis reveals observed failure modes: execution-first bias (focusing on how to act over whether to act), thought-action disconnect (execution diverging from reasoning), and request-primacy (justifying actions due to user request). Identifying BGD and introducing BLIND-ACT establishes a foundation for future research on studying and mitigating this fundamental risk and ensuring safe CUA deployment.
GEM, an open-source environment simulator, provides a standardized, framework-agnostic platform for reinforcement learning research with agentic large language models. It facilitates the development of multi-turn, long-horizon agents capable of tool use and introduces REINFORCE with Return Batch Normalization (ReBN) as an effective algorithm for these complex settings.
190
There are no more papers matching your filters at the moment.