AI 前沿论文 TOP18(2026-05-06)

AI 前沿论文 TOP18（今日）

精选 arXiv 最新论文，覆盖大语言模型、多模态、AI Agent、视频生成、世界模型、强化学习六大方向，由 Euler 基于网络搜索整理。

🧠 大语言模型（LLM）

1. Large Language Model Reasoning Failures

作者：

Peiyang Song, Pengrui Han, Noah Goodman
发表：

2026-02-05 | arXiv:2602.06176
链接：

https://arxiv.org/abs/2602.06176

中文摘要： 本文系统梳理了大语言模型推理失败现象，首次提出将推理划分为”具身”与”非具身”两类，其中非具身推理又分为直觉型（informal）和逻辑型（formal）；并沿另一维度将失败归为三类：架构固有失败、应用特定失败与分布外失败。为理解和改进 LLM 推理能力提供了统一分类框架。

英文摘要：

Large Language Models (LLMs) have exhibited remarkable reasoning capabilities, achieving impressive results across a wide range of tasks. Despite these advances, significant reasoning failures persist, occurring even in seemingly simple scenarios. To systematically understand and address these shortcomings, we present the first comprehensive survey dedicated to reasoning failures in LLMs. We introduce a novel categorization framework that distinguishes reasoning into embodied and non-embodied types, with the latter further subdivided into informal (intuitive) and formal (logical) reasoning. In parallel, we classify reasoning failures along a complementary axis into three types: fundamental failures intrinsic to LLM architectures that broadly affect downstream tasks; application-specific failures; and out-of-distribution failures.

2. Assessing the Impact of Code Changes on the Fault Localizability of Large Language Models

作者：

Sabaat Haroon, Ahmad Faraz Khan, Ahmad Humayun, Waris Gill, Abdul Haddi Amjad, Ali R. Butt, Mohammad Taha Khan, Muhammad Ali Gulzar
发表：

2025-04-06 (v4 2026-03-05) | arXiv:2504.04372
链接：

https://arxiv.org/abs/2504.04372

中文摘要： 首个大规模实证研究，探究代码变更对 LLM 故障定位能力的影响。受变异测试启发，作者开发了新的评估方法，揭示代码变更如何削弱或增强 LLM 在软件工程中定位缺陷的能力，为将 LLM 应用于非生成式软件维护任务提供关键洞察。

英文摘要：

Generative Large Language Models (LLMs) are increasingly used in non-generative software maintenance tasks, such as fault localization (FL). Success in FL depends on a model’s ability to reason about code relationships. We present the first large-scale empirical investigation into the robustness of LLMs’ fault localizability, inspired by mutation testing. We develop a methodology to systematically evaluate how code changes impact the FL performance of LLMs, revealing when and why FL capabilities degrade or improve.

3. Colorful Talks with Graphs: Human-Interpretable Graph Encodings for Large Language Models

作者：

Angelo Zangari, Peyman et al.
发表：

ACL Findings 2026 | arXiv:2602.10386
链接：

https://arxiv.org/abs/2602.10386

中文摘要： 研究如何让 LLM 可解释地理解和处理图结构数据。提出人类可解读的图编码方法，将图信息转化为 LLM 可理解的文本形式，在多种图分析任务上验证了方法有效性，为 LLM 在图数据上的应用提供了新范式。

英文摘要：

We present Colorful Talks with Graphs, a study on human-interpretable graph encodings for large language models. We explore how to effectively encode graph structures into textual representations that LLMs can process, evaluating multiple encoding strategies across diverse graph analytics tasks. Our findings reveal that carefully designed graph encodings can significantly improve LLM performance on graph-related reasoning while maintaining interpretability.

🎨 多模态

4. Beyond Language Modeling: An Exploration of Multimodal Pretraining

作者：

Shengbang Tong, David Fan, John Nguyen, Ellis Brown, Gaoyue Zhou, Shengyi Qian, Boyang Zheng, Théophane Vallaeys, Junlin Han, Rob Fergus, Naila Murray, Marjan Ghazvininejad, Mike Lewis, Nicolas Ballas et al.
发表：

2026-03-03 | arXiv:2603.03276
链接：

https://arxiv.org/abs/2603.03276

中文摘要： FAIR、Meta 和 NYU 联合探索原生多模态预训练，提出统一自回归 Transformer 框架 Transfusion。实验揭示：表征自编码器（RAE）能有效统一视觉理解与生成，通用数据中涌现世界建模能力，MoE 架构可高效扩展。为超越纯语言模型、实现真正统一的多模态基础模型提供关键经验。

英文摘要：

We explore native multimodal pretraining from scratch using Transfusion, a unified autoregressive transformer framework. Controlled multimodal pretraining experiments reveal key insights about unified visual representations, data complementarity, world modeling emergence from general data, and efficient scaling with Mixture-of-Experts architectures. This work provides a foundation for building true unified multimodal foundation models that engage directly with the visual world rather than relying on text as a lossy compression of reality.

5. MIRAGE: The Illusion of Visual Understanding

作者：

Mohammad Asadi, Jack W. O’Sullivan, Fang Cao, Tahoura Nedaee, Kamyar Rajabalifardi, Fei-Fei Li, Ehsan Adeli, Euan Ashley
发表：

2026-03-23 (v3 2026-04-02) | arXiv:2603.21687
链接：

https://arxiv.org/abs/2603.21687

中文摘要： 揭示多模态 AI 系统视觉-语言推理的深层机制谜团。研究发现现有 VLM 在视觉理解上存在系统性假象：即使缺乏真实视觉理解，也能生成看似合理且难以辨别真伪的推理轨迹。这一发现对当前多模态系统的能力边界提出了根本性质疑。

英文摘要：

Multimodal AI systems have achieved remarkable performance across a broad range of real-world tasks, yet the mechanisms underlying visual-language reasoning remain surprisingly poorly understood. We report three findings that challenge prevailing assumptions about how these systems process and understand visual information. We then show how this behavior can lend the illusion of visual understanding by generating a reasoning trace, indistinguishable from a correct one, despite the absence of genuine visual comprehension.

6. GraphVLM: Benchmarking Vision Language Models for Multimodal Graph Learning

作者：

（待补充）
发表：

2026-03-15 | arXiv:2603.13370
链接：

https://arxiv.org/abs/2603.13370

中文摘要： 首个系统性评估 VLM 在多模态图学习能力上的基准。GraphVLM 研究 VLM 作为图编码器、特征增强器和图推理器三种范式，在六个跨领域数据集上验证了 VLM 显著提升多模态图学习效果，为融合视觉语言能力与图结构推理开辟新路径。

英文摘要：

We present GraphVLM, a systematic benchmark designed to evaluate and harness the capabilities of VLMs for multimodal graph learning. GraphVLM investigates three complementary paradigms for integrating VLMs with graph reasoning: VLM-as-Encoder, VLM-as-Feature-Enhancer, and VLM-as-Graph-Reasoner. Extensive experiments across six datasets from diverse domains demonstrate that VLMs enhance multimodal graph learning via all three roles.

🤖 AI Agent

7. Agentifying Agentic AI

作者：

Virginia Dignum, Frank Dignum et al.
发表：

AAAI 2026 WMAC Bridge Program | arXiv:2511.17332
链接：

https://arxiv.org/abs/2511.17332

中文摘要： 论证明代理 AI 系统必须将数据驱动学习与结构化推理协调模型相结合。提出 AAMAS 社区的 BDI 架构、通信协议、机制设计等工具可补充大模型的适应性，形成既强大灵活又透明、可信、可问责的智能体系统，为真正实现代理自主性提供理论路径。

英文摘要：

Agentic AI seeks to endow systems with sustained autonomy, reasoning, and interaction capabilities. This paper argues that in order to make agentic AI systems truly agents, learning-based mechanisms must be complemented by structured reasoning and coordination models. By aligning and complementing the adaptive power of foundation models with the explicit structure of agent-based reasoning, we outline a path toward systems that are not only capable and flexible, but also coherent, cooperative, and accountable.

8. AI Agents Under EU Law: A Compliance Architecture for AI Providers

作者：

Luca Nannini, Adam Leon Smith, Michele Joshua Maggini, Enrico Panai, Sandra Feliciano, Aleksandr Tiulkanov, Elena Maran, James Gealy, Piercosma Bisconti
发表：

2026-04-08 | arXiv:2604.04604
链接：

https://arxiv.org/abs/2604.04604

中文摘要： 首个系统梳理 AI 代理提供商合规地图的研究，涵盖 M/613 协调标准、GPAI 行为准则、CRA 协调标准及 2025 年 11 月数字一揽子提案。结论：高风险代理系统若存在不可追溯行为漂移，当前无法满足 AI 法案核心要求；提供商的首要合规任务是建立代理外部行为、数据流、连接系统和受影响人员的完整清单。

英文摘要：

This paper provides the first systematic regulatory mapping for AI agent providers integrating draft harmonised standards under M/613, the GPAI Code of Practice, the CRA harmonised standards programme, and the Digital Omnibus proposals. We conclude that high-risk agentic systems with untraceable behavioral drift cannot currently satisfy the AI Act’s essential requirements, and that the provider’s foundational compliance task is an exhaustive inventory of the agent’s external actions, data flows, connected systems, and affected persons.

9. The Auton Agentic AI Framework

作者：

（Snapchat 等）
发表：

2026-02-23 | arXiv:2602.23720
链接：

https://arxiv.org/abs/2602.23720

中文摘要： Snapchat 提出声明式自主智能体架构，将”认知蓝图”（身份与能力声明规范）与”运行时引擎”（平台特定执行基座）严格分离。形式化定义带隐推理空间的增强 POMDP，引入类生物情景记忆的层次化记忆巩固架构，提出约束流形安全执行框架，并给出三层自演化机制与多项运行时优化，显著降低多步智能体工作流端到端延迟。

英文摘要：

The Auton Agentic AI Framework is a principled architecture for standardizing the creation, execution, and governance of autonomous agent systems. The framework is organized around a strict separation between the Cognitive Blueprint and the Runtime Engine. The paper formalizes the agent execution model as an augmented POMDP with a latent reasoning space, introduces a hierarchical memory consolidation architecture, defines a constraint manifold formalism for safety enforcement, and presents a three-level self-evolution framework with runtime optimizations that reduce end-to-end latency for multi-step agent workflows.

🎬 视频生成

10. Open-Sora 2.0: Training a Commercial-Level Video Generation Model in $200k

作者：

（Open-Sora 团队）
发表：

2025-03-13 (v3) | arXiv:2503.09642
链接：

https://arxiv.org/abs/2503.09642

中文摘要： 仅用 20 万美元训练出商业级视频生成模型 Open-Sora 2.0，训练成本仅为全球领先方案的 1/5-1/10。详细披露了实现这一效率突破的关键技术：数据策管、模型架构、训练策略与系统优化。该模型质量与 HunyuanVideo（开源）和 Runway Gen-3 Alpha（闭源）相当，为低成本训练顶流视频生成模型提供了新范式。

英文摘要：

We present Open-Sora 2.0, a commercial-level video generation model trained for only $200k. We demonstrate that the cost of training a top-performing video generation model is highly controllable. We detail all techniques that contribute to this efficiency breakthrough, including data curation, model architecture, training strategy, and system optimization. Open-Sora 2.0 is comparable to global leading video generation models including the open-source HunyuanVideo and the closed-source Runway Gen-3 Alpha.

11. Seedance 2.0: Advancing Video Generation for World Complexity

作者：

（ByteDance）
发表：

2026-04-18 | arXiv:2604.14148
链接：

https://arxiv.org/abs/2604.14148

中文摘要： 字节跳动发布 Seedance 2.0，原生多模态音视频生成模型，支持直接生成 4-15 秒音频视频内容，输出分辨率 480p/720p。定位”世界复杂度”视频生成，在物理一致性、时空连贯性上显著提升，将 AI 视频生成能力推进至可处理真实世界动态复杂场景的新阶段。

英文摘要：

Seedance 2.0 is a new native multi-modal audio-video generation model, officially released in China in early February 2026. It supports direct generation of audio-video content with durations ranging from 4 to 15 seconds, with native output resolutions of 480p and 720p. Seedance 2.0 advances video generation for world complexity, demonstrating significantly improved physical consistency and spatiotemporal coherence for real-world dynamic scenes.

12. Motion Attribution for Video Generation

作者：

（待补充）
发表：

2026-01-15 | arXiv:2601.08828
链接：

https://arxiv.org/abs/2601.08828

中文摘要： 提出 Motive——首个针对视频生成中运动归因的梯度框架，识别哪些训练片段影响或损害时序动态。通过运动加权损失掩码隔离时序动态与静态外观，在 VBench 上实现 74.1% 人类偏好胜率，显著提升运动平滑度和动态合理性，为视频生成数据策管提供科学依据。

英文摘要：

We present Motive (MOTIon attribution for Video gEneration), a motion-centric, gradient-based data attribution framework that scales to modern, large, high-quality video datasets and models. We use this to study which fine-tuning clips improve or degrade temporal dynamics. With Motive-selected high-influence data, our method improves both motion smoothness and dynamic degree on VBench, achieving a 74.1% human preference win rate compared with the pretrained base model.

🌍 世界模型

13. A Mechanistic View on Video Generation as World Models: State and Dynamics

作者：

（待补充）
发表：

2026-01-19 | arXiv:2601.17067
链接：

https://arxiv.org/abs/2601.17067

中文摘要： 从机制层面解析大规模视频生成模型作为世界模型的潜力。提出”状态构建”与”动态建模”两大分析维度，揭示视频生成模型如何涌现物理一致性。揭示了视频生成模型内部如何表示状态更替与物理规律，为理解和改进其世界建模能力提供理论基础。

英文摘要：

Large-scale video generation models have demonstrated emergent physical coherence, positioning them as potential world models. This work proposes a novel taxonomy centered on two pillars: State Construction and Dynamics Modeling, which categorize how video generation models build internal representations of scene state and predict physical dynamics over time, providing a mechanistic understanding of their emergent world modeling capabilities.

14. Probabilistic Dreaming for World Models

作者：

（待补充）
发表：

2026-03-09 | arXiv:2603.04715
链接：

https://arxiv.org/abs/2603.04715

中文摘要： 研究如何通过”做梦”机制让智能体从想象经验中学习。提出在概率世界模型框架下进行想象的理念，提升智能体学习的鲁棒性与样本效率，为强化学习和规划提供更可靠的环境模型，对减少真实环境交互、加速学习具有重要意义。

英文摘要：

“Dreaming” enables agents to learn from imagined experiences, enabling more robust and sample-efficient learning of world models. We explore probabilistic formulations that allow agents to imagine diverse scenarios and learn from imaginary experiences in a principled way, improving robustness and sample efficiency in real-world tasks.

15. LingBot-World: Advancing Open-source World Models

作者：

R Team
发表：

2026-01-22 | arXiv:2601.20540
链接：

https://arxiv.org/abs/2601.20540

中文摘要： 开源世界模拟器 LingBot-World，从视频生成演进而来，支持分钟级长时域视野并保持上下文一致性（”长期记忆”）。采用混合数据获取引擎与 MoE 架构，确保语义一致性、长时记忆与精细控制。开源代码与模型，致力缩小开源与闭源世界模型的技术差距。

英文摘要：

We present LingBot-World, an open-sourced world simulator stemming from video generation. It maintains high fidelity and robust dynamics across diverse environments, enables a minute-level horizon while preserving contextual consistency over time (long-term memory), and employs a hybrid data acquisition engine with Mixture-of-Experts architecture. We provide public access to the code and model in an effort to narrow the divide between open-source and closed-source world model technologies.

🎮 强化学习

16. Supervised Reinforcement Learning: From Expert Trajectories to Step-wise Reasoning

作者：

Yihe Deng, I-Hung Hsu, Jun Yan, Zifeng Wang, Rujun Han, Gufeng Zhang, Yanfei Chen, Wei Wang, Tomas Pfister, Chen-Yu Lee
发表：

ICLR 2026 | arXiv:2510.25992
链接：

https://arxiv.org/abs/2510.25992

中文摘要： 针对小型开源模型，提出监督强化学习（SRL）：在多步推理任务中，即使所有 rollout 均不正确，也能通过专家轨迹的逐步相似度提供更平滑的学习信号。在稀疏奖励控制和智能体推理基准上，相比强 RL 基线提升最高 +81%，工具使用推理任务提升 +11%。

英文摘要：

Supervised Reinforcement Learning (SRL) enhances small-scale LLMs’ multi-step reasoning by generating internal monologues and using step-wise similarity between model actions and expert actions as supervision. This provides richer learning signals even when all rollouts are incorrect. Across sparse-reward control environments and agentic reasoning benchmarks, SRL consistently improves learning efficiency and final performance over strong RL baselines, achieving gains of up to +81% in complex multi-step environments and +11% in tool-using reasoning tasks.

17. Experiential Reinforcement Learning

作者：

（待补充）
发表：

2026-02-19 | arXiv:2602.13949
链接：

https://arxiv.org/abs/2602.13949

中文摘要： 提出体验式强化学习（ERL），将”经验-反思-巩固”循环显式嵌入 RL 过程。模型生成初始尝试 → 接收环境反馈 → 产生反思 → 引导改进的第二次尝试 → 成功被强化并内化到基础策略。在稀疏奖励环境和智能体推理基准上稳定提升学习效率与最终性能。

英文摘要：

We introduce Experiential Reinforcement Learning (ERL), a training paradigm that embeds an explicit experience-reflection-consolidation loop into the reinforcement learning process. Given a task, the model generates an initial attempt, receives environmental feedback, and produces a reflection that guides a refined second attempt, whose success is reinforced and internalized into the base policy. Across sparse-reward control environments and agentic reasoning benchmarks, ERL consistently improves learning efficiency and final performance over strong reinforcement learning baselines, achieving gains of up to +81% in complex multi-step environments and up to +11% in tool-using reasoning tasks.

18. On the Role of Iterative Computation in Reinforcement Learning

作者：

（待补充）
发表：

2026-02-07 | arXiv:2602.05999
链接：

https://arxiv.org/abs/2602.05999

中文摘要： 形式化定义”计算有界策略”，证明使用更多计算的策略能解决推理不足策略无法触及的更长时域任务。提出可变计算量最小循环架构，在 31 个不同线上/离线 RL 任务中验证：仅通过增加计算量即可提升性能，更长时域任务泛化能力最高提升 5 倍。

英文摘要：

We formalize compute bounded policies and prove that policies which use more compute can solve problems and generalize to longer-horizon tasks that are outside the scope of policies with less compute. We propose a minimal recurrent architecture that can use a variable amount of compute. On 31 different tasks spanning online and offline RL, we show that this architecture achieves stronger performance simply by using more compute, and stronger generalization on longer-horizon test tasks compared to standard feedforward or deep residual networks using up to 5× more parameters.

本报告由 Euler 基于网络搜索整理 | 2026-05-06