AI 论文速递 | 2026年05月28日

💡 今日精选22篇AI领域最新论文，涵盖计算机视觉 / 自然语言处理 / 机器学习 / 人工智能 / 多模态等方向

🖥️ 计算机视觉

1. D^2Turb: Depth-Aware Simulation and Decoupled Learning for Single-Frame Atmospheric Turbulence Mitigation

【作者】Zixiao Hu, Tianyu Li, Guoqing Wang et al.

【摘要】针对大气湍流导致的图像空间模糊和几何畸变问题，现有端到端方法难以同时恢复纹理和校正形状。为此提出的D²Turb框架先引入深度感知的湍流合成引擎，生成物理一致的退化数据。再将修复解耦为纹理去模糊和几何校正两个交互阶段，并通过自适应结构先验注入机制传递深层结构信息以指导校正。在合成与真实数据集上，该方法在纹理恢复与几何保真度上均达到最佳水平。

【英文摘要】Single-frame atmospheric turbulence mitigation is inherently ill-posed due to spatially varying blur coupled with non-rigid geometric distortion. Existing end-to-end approaches trained on flat-field simulations often struggle to balance texture recovery with geometric rectification. To overcome this limitation, we propose D^2Turb, a unified framework that bridges physics-grounded simulation with explicitly decoupled restoration.

📄 论文：https://arxiv.org/abs/2605.27460💻 代码：https://github.com/HertzDot222/D2Turb

2. Beyond Motion Primitives: Behavioral Activity Recognition from Head-Mounted IMU

【作者】Chung-Ta Huang, Leopold Das, Jeffrey Zhou et al.

【摘要】为让AR眼镜能主动提供帮助，需从其惯性测量单元数据中识别更高级的用户行为。研究定义了五类行为类别，并构建了160K样本的数据集。提出的HiT-HAR分层模型参数仅703K，却在动作和场景识别任务上超越了先前头部IMU模型。分析表明，利用时间上下文和场景结构比单纯扩大模型规模更有效，代码和数据集已开源。

【英文摘要】AR smart glasses need continuous behavioral context to offer proactive assistance, yet their most practical always-on sensor, the head-mounted Inertial Measurement Unit (IMU), detects only motion primitives such as walking or standing. We push beyond motion primitives to behavioral-level recognition, defining five categories that balance AR application need with sensor observability.

📄 论文：https://arxiv.org/abs/2605.27464💻 代码：https://github.com/Harvard-AI-and-Robotics-Lab/HiT-HAR

3. Dual-branch Distilled Transformer for Efficient Asymmetric UAV Tracking

【作者】Hongtao Yang, Bineng Zhong, Qihua Liang et al.

【摘要】无人机跟踪需在精度与速度间平衡。现有方法为简化计算而削弱骨干网络，导致性能下降。EATrack框架采用教师引导的蒸馏策略，通过空间特征级和预测级的双路知识迁移来增强轻量学生模型的表达能力。同时引入细粒度目标感知蒸馏以应对外观变化，并在推理时使用时间自适应模块。在五个基准上，该方法实现了准确率与速度的良好平衡。

【英文摘要】Given the real-time demands of UAV tracking, many methods simplify the backbone to reduce computation, but this often weakens feature representation and degrades performance in complex scenarios. To alleviate this issue, we propose EATrack, an efficient and asymmetric UAV tracking framework centered around a teacher-guided dual-branch distillation strategy that enhances the feature expressiveness of the lightweight student model.

📄 论文：https://arxiv.org/abs/2605.28018💻 代码：https://github.com/GXNU-ZhongLab/EATrack

4. Which Pretraining Paradigm Better Serves Spatial Intelligence? An Empirical Comparison of Vision-Language and Video Generation Models

【作者】Haozhan Shen, Tiancheng Zhao, Kangjia Zhao et al.

【摘要】空间智能需要能同时捕捉语义和几何结构的视觉表示。研究对视觉语言模型和视频生成模型进行了首次系统性冻结特征探查。结果发现两者具有互补性：VLM在语义标记和实例分组上更强，VGM则为密集几何和相机运动提供了更易获取的信号。简单融合两者就能得到在几何和语义上都表现出色的表示，这为构建更强的空间智能骨干模型指明了方向。

【英文摘要】Spatial intelligence requires visual representations that capture both semantic objects and geometric structure in the physical world. To support this, two major pre-training schemes are now widely used as foundation backbones: Vision-Language Models (VLMs), which use language supervision to align visual observations with semantic concepts, and Video Generation Models (VGMs), which learn from temporally evolving visual worlds.

📄 论文：https://arxiv.org/abs/2605.28132💻 代码：https://github.com/om-ai-lab/Probing-VLM-VGMhttps

5. MeniOmni: A Structured Multimodal Benchmark for Holistic Meniscus Injury Assessment

【作者】Shurui Xu, Siqi Yang, Weiping Ding et al.

【摘要】临床诊断半月板损伤需要综合MRI影像和患者信息。现有基准通常模态单一且标签粗糙。为此引入的MeniOmni基准包含746个多中心MRI研究，支持细粒度严重程度分级和诊断报告生成两个任务。基准还提出了风险感知评估和语义一致性度量。基线实验表明，引入临床先验信息能提升分级性能并减少严重错误。

【英文摘要】Clinical diagnosis of meniscus injuries requires radiologists to integrate volumetric MRI evidence with patient context (e.g., sex, age, BMI) and to produce structured diagnostic reports. Existing knee MRI benchmarks are typically unimodal and rely on coarse labels, limiting their ability to evaluate holistic clinical reasoning.

📄 论文：https://arxiv.org/abs/2605.28161💻 代码：https://github.com/ShuruiXu/MeniOmni

📝 自然语言处理

6. From AR to Diffusion: Efficiently Adapting Large Language Models with Strictly Causal and Elastic Horizons

【作者】Xiangyu Ma, Teng Xiao, Zuchao Li et al.

【摘要】扩散模型能并行生成文本，但与预训练自回归模型存在结构不匹配。FLUID框架通过严格因果对齐，实现了从标准GPT检查点到扩散范式的高效适配，避免了从头预训练。此外，它引入了基于熵的弹性视野机制，根据局部信息密度动态调整去噪步长。实验显示，FLUID在达到最佳性能的同时，将训练成本降低了数个数量级。

【英文摘要】Diffusion models promise efficient parallel text generation but rely on bidirectional attention, creating a structural mismatch with pre-trained Autoregressive (AR) models. This incompatibility precludes reusing robust AR priors, necessitating prohibitive pre-training from scratch. To bridge this gap, we propose FLUID, a framework that efficiently adapts AR backbones to the diffusion paradigm.

📄 论文：https://arxiv.org/abs/2605.27387💻 代码：https://github.com/Oli-lab-nun/FLUID

7. ChildEval: When large language models meet children's personalities

【作者】Yanyan Luo, Xue Han, Chunxu Zhao et al.

【摘要】大语言模型在以儿童为中心的个性化方面评估不足。ChildEval基准包含29K个3-6岁儿童的合成人格档案及对应的显式或隐式偏好。基准设计了五个大类和十四个子类评估场景。实验表明，不同个性化表征会影响模型响应，且在ChildEval上进行微调能有效提升模型以儿童为中心的服务表现。

【英文摘要】While LLMs enable personalized chatbots, their effectiveness in child-centered personalization remains unclear, as systematic evaluation of child-specific preferences is still lacking. To address this gap, we introduce ChildEval, a benchmark for evaluating LLMs' ability to infer and follow child-centered preferences in long-context conversations. ChildEval contains 29K synthesized persona profiles of children aged 3-6, providing relatively static background information.

📄 论文：https://arxiv.org/abs/2605.27805💻 代码：https://github.com/ziyanluo/ChildEval

8. MERIT: Matching Expertise via Rubric-Informed Training for Reviewer Assignment

【作者】Zixuan Yang, Yibo Zhao, Weicong Liu et al.

【摘要】大规模为论文匹配审稿人是一大挑战。MERIT框架先用强化学习训练一个审稿人评估器，利用大模型裁判根据专业准则提供奖励，识别论文所需的专业维度。第二阶段将评估器的预测蒸馏到基于嵌入的检索器中，以实现高效大规模分配。4B参数的评估器在适合度分类上超越了更大的通用模型，检索器在基准上也达到了最佳性能。

【英文摘要】Matching submissions with suitable reviewers at scale is a growing challenge for major venues, yet existing approaches either rely on coarse proxy signals that conflate general relatedness with true suitability, or require expensive human annotations that are difficult to scale for training. We propose MERIT, a two-stage framework that bridges this gap by converting criterion-level expertise matching into scalable suitability supervision.

📄 论文：https://arxiv.org/abs/2605.27865💻 代码：https://github.com/Luli3220/MERIT

9. GRADE: Generalizable Reasoning-Aware Dialogue Evaluation for AI Tutors

【作者】Parth Bhalerao, Jeromy Chang, David Chou et al.

【摘要】评估AI导师的回应需要超越事实正确性。GRADE研究系统评估了开源模型在导师对话中的教学能力。通过测试120种配置发现，Gemma3-12B适合单任务评估，而8位精度的Gemma3-27B更适合多任务预测。精心选择的LoRA流水线能在关键教学维度上匹配甚至超越专有系统。研究还发现，模型选择和推理模式对碳排放有显著影响。

【英文摘要】Evaluating AI tutor responses requires more than factual correctness: tutors must identify mistakes, locate errors, provide guidance, and offer actionable next steps. We present GRADE, a systematic study of open-source models for pedagogical ability assessment in student-tutor dialogues. Building on the BEA 2025 TutorMind setting, we evaluate 120 configurations across five language models, zero-shot inference, LoRA fine-tuning, synthetic augmentation, CoT+Reasoning, and single-task versus multit...

📄 论文：https://arxiv.org/abs/2605.27866💻 代码：https://github.com/pvbgeek/GRADE

10. Retrieval, Reward, and Training Protocols: What Matters in Training Search Agents?

【作者】Yibo Zhao, Zichen Ding, Jiayi Wu et al.

【摘要】对搜索智能体的训练方法进行受控实证研究，以厘清改进的真正驱动因素。研究发现，纠正常用Wikipedia 2018语料中的数据覆盖问题带来的提升，比不同训练算法间的差异更大。在多数情况下，最简单的结果奖励方法就能取得竞争性或更优的表现，而过程级信用分配可能过度纠正智能体行为。研究还提炼了关于数据多样性、离策略数据利用和搜索预算扩展的实用指南。

【英文摘要】Search agents powered by large language models can autonomously decompose queries, retrieve information, and synthesize answers through multi-step reasoning. However, the rapid growth of training methods has outpaced controlled comparison: existing works differ in retrieval corpora, reward designs, and training protocols, making it unclear what actually drives improvements. We present a controlled empirical study that isolates three under-explored dimensions of search agent training.

📄 论文：https://arxiv.org/abs/2605.27881💻 代码：https://github.com/YiboZhao624/SearchAgentReview

🤖 机器学习

11. Balancing Fidelity and Diversity in Diffusion Models via Symmetric Attention Decomposition: Hopfield Perspective

【作者】Hyunmin Cho, Woo Kyoung Han, Kyong Hwan Jin

【摘要】将Transformer中的注意力矩阵解释为编码输入特征间关联的联想记忆矩阵。将其分解为对称和反对称部分后，对称部分控制能量景观结构，反对称部分驱动其上的循环流动。利用对称部分导出的Hopfield稳定性度量，可量化检索特征的稳定性，并观察到其与生成中保真度-多样性权衡的相关性。最后提出一个可控旋钮，通过修改底层动力学的循环来调节这一权衡。

【英文摘要】We characterize the pre-softmax attention matrix QK^ in transformers as an associative memory matrix encoding pairwise associations between input features. By decomposing this matrix into its symmetric and skew-symmetric parts, we interpret the symmetric component as governing the structure of the energy landscape, and the skew-symmetric component as driving circulation on that landscape.

📄 论文：https://arxiv.org/abs/2605.27476💻 代码：https://github.com/hyeon-cho/Attention-Symmetric-Decomposition

12. GenSBI: Generative Methods for Simulation-Based Inference in JAX

【作者】Aurelio Amerio

【摘要】为满足基于JAX开发模型和分析流程的研究者的需求，GenSBI库完全用JAX实现了流匹配、得分匹配和去噪扩散。它提供三种基于Transformer的可互换架构，解耦了生成方法、神经骨干和推理模式。在标准基准上，该库以最少的调参实现了接近理想的校准后验覆盖和均值C2ST分数（0.50-0.56，0.50为理想值）。

【英文摘要】Flow and diffusion generative models have established themselves as widely adopted density estimators for simulation-based inference (SBI), extending naturally from neural posterior estimation to likelihood and joint density estimation. Their principled optimization objectives and freedom from architectural constraints have driven rapid adoption across the natural sciences.

📄 论文：https://arxiv.org/abs/2605.27499💻 代码：https://github.com/aurelio-amerio/GenSBI

13. SPARD: Defending Harmful Fine-Tuning Attack via Safety Projection with Relevance-Diversity Data Selection

【作者】Shuhao Chen, Weisen Jiang, Yeqi Gong et al.

【摘要】微调大模型常会破坏其安全对齐，有害微调攻击更会加剧此问题。SPARD防御框架集成安全投影交替优化与多样性感知数据选择。它采用SPAG方法，在效用更新和显式安全投影间交替优化，并利用多样性感知的选择过程挑选紧凑的安全数据。在四个有害攻击下，SPARD始终取得最低的平均攻击成功率，同时保持高任务准确性。

【英文摘要】Fine-tuning large language models often undermines their safety alignment, a problem further amplified by harmful fine-tuning attacks in which adversarial data removes safeguards and induces unsafe behaviors. We propose SPARD, a defense framework that integrates Safety-Projected Alternating optimization with Relevance-Diversity aware data selection.

📄 论文：https://arxiv.org/abs/2605.28030💻 代码：https://github.com/shuhao02/SPARD

14. Long Live The Balance: Information Bottleneck Driven Tree-based Policy Optimization

【作者】Hao Jiang, Shurui Li, Tianpeng Bu et al.

【摘要】在线强化学习用于大语言模型时，常出现探索-利用失衡。基于信息瓶颈理论提出的IB-Score指标，通过量化步级推理多样性与正确答案间的互信息权衡来评估此平衡。分析发现常用方法如GRPO难以维持平衡。为此提出IB-TPO框架，将IB-Score作为优化目标，并利用新的IB引导树采样策略。该策略将在线采样效率提高50%，并复用树结构进行有效估计。实验表明，该方法比GRPO基线显著提升2.9%至3.6%。

【英文摘要】Recent advances in online reinforcement learning (RL) for large language models (LLMs) have demonstrated promising performance in complex reasoning tasks. However, they often exhibit an imbalanced exploration-exploitation trade-off, resulting in unstable optimization and sub-optimal performance. We introduce IB-Score, a novel metric grounded in Information Bottleneck theory that evaluates policy's exploration-exploitation balance by quantifying the trade-off between step-level reasoning diversit...

📄 论文：https://arxiv.org/abs/2605.28109💻 代码：https://github.com/alibaba/EfficientRL

15. QuITE: Query-Based Irregular Time Series Embedding

【作者】JungHoon Lim

【摘要】不规则多变量时间序列因采样不规则而难以建模。现有方法要么设计专用架构，要么通过插值映射到规则网格，可能扭曲时序动态。QuITE模块通过可学习查询令牌，利用单层自注意力聚合不规则观测，直接生成骨干网络兼容的潜在表示，无需生成人工值或修改架构。在多个真实基准上，QuITE持续提升多变量时间序列模型性能，在预测和分类任务上分别获得平均高达54.7%和15.8%的相对增益。

【英文摘要】Irregular Multivariate Time Series (IMTS) are common in practice, yet their irregular sampling complicates effective modeling. Existing approaches typically either (i) design specialized architectures that limit the reuse of proven Multivariate Time Series (MTS) models, or (ii) map IMTS onto regular temporal grids through interpolation, which may distort temporal dynamics by introducing artificial values. To address these limitations, we propose a new input-embedding-based approach.

📄 论文：https://arxiv.org/abs/2605.28166💻 代码：https://github.com/Meaningfull9502/QuITE

🧠 人工智能

16. PortBench: A Correlation-Aware, Full-Pipeline Benchmark for LLM-Driven Portfolio Management

【作者】Yuxuan Zhao, Sijia Chen, Ningxin Su

【摘要】投资组合管理这一关键金融决策任务缺乏良好基准。PortBench基准覆盖六大类资产、十年跨度，包含静态QA数据集和动态五阶段分配流水线。它引入双层相关性得分和CEPS指标来评估投资组合。评估十个前沿大模型发现，尽管在静态QA上表现强劲，但90%的模型组合在压力下无法超越简单的等权配置，且即使满足所有流程约束，也可能遭受灾难性回撤。

【英文摘要】LLMs have shown strong performance across diverse financial tasks, yet portfolio management (PM), a critical financial decision-making task, remains poorly benchmarked. Existing benchmarks exhibit two main gaps: they ignore cross-asset correlation structures, thereby failing to distinguish genuinely diversified portfolios from concentrated ones, and fail to evaluate the complete PM decision pipeline in real-world scenarios.

📄 论文：https://arxiv.org/abs/2605.27887💻 代码：https://github.com/AgenticFinLab/portbenchthis

17. A Unified Framework for the Evaluation of LLM Agentic Capabilities

【作者】Pengyu Zhu, Lijun Li, Yaxing Lyu et al.

【摘要】评估大模型的智能体能力时，基准分数常混杂了模型能力与框架实现选择的影响。为此提出统一评估框架，通过统一配置系统将多样基准整合为标准化指令-工具-环境格式，在可控沙箱中执行智能体。框架还提供离线模式以消除环境波动。基于此，对七个广泛基准、覆盖24个领域进行了大规模分析，发现框架和环境选择会显著影响基准结果，从而可分离出模型的内在能力。

【英文摘要】As LLMs are increasingly deployed as agents, reliable assessment of their agentic capabilities has become essential. However, reported benchmark scores often jointly reflect model capability and the implementation choices each benchmark is packaged with, making cross-benchmark results difficult to interpret as clean measurements of the underlying model. In this work, we present a unified framework for the fair evaluation of LLM agentic capabilities.

📄 论文：https://arxiv.org/abs/2605.27898💻 代码：https://github.com/whfeLingYu/A-Unified-Framework-for-the-Evaluation-of-LLM-Agentic-Capabilities

18. STAB: Specification-driven Testing for Algorithmic Bottlenecks

【作者】Soohan Lim, Joonghyuk Hahn, Hyundong Jin et al.

【摘要】为评估算法代码的效率，需要能暴露运行时瓶颈的测试用例。STAB流水线仅从自然语言问题规范出发，通过约束饱和和对抗场景注入两个阶段来生成此类用例。约束饱和器提取并满足约束，对抗注入器则从场景目录中检索构造原则。在CodeContests上，STAB将生成的能暴露算法瓶颈的测试用例比率从平均50.43%提升至73.45%。

【英文摘要】Evaluating the efficiency of algorithmic code requires test cases that expose runtime bottlenecks. Previous methods generate efficiency test cases either by increasing input size or by generating code-specific inputs that make the given implementation run slowly. Consequently, they do not address the structural input conditions that drive the algorithmic worst case.

📄 论文：https://arxiv.org/abs/2605.27981💻 代码：https://github.com/suhanmen/STAB

19. Defending LLM-based Multi-Agent Systems Against Cooperative Attacks with Sentence-Level Rectification

【作者】Yaoyang Luo, Zhi Zheng, Ziwei Zhao et al.

【摘要】多智能体系统中的恶意智能体可能协同攻击以更有效地误导系统。为此提出的自适应协同攻击框架中，恶意智能体通过多轮交互自主协调攻击策略。对应的STAR防御框架在句子级别识别和纠正通信中的误导信息。实验显示，协同攻击导致的任务成功率下降比独立攻击更显著（相对下降5.34%），而STAR能有效缓解两种威胁，平均提升任务成功率36.76%。

【英文摘要】Recent years have witnessed the rapid development of Large Language Model-based Multi-Agent Systems (MAS), which excel at collaborative decision-making and complex problem-solving. However, malicious agents in MAS may inject misinformation to mislead other agents and disrupt system performance, giving rise to a new research direction that focuses on attack mechanisms and defense strategies in MAS.

📄 论文：https://arxiv.org/abs/2605.28104💻 代码：https://github.com/smoooom/STAR

20. From Fact Overwriting to Knowledge Evolution: Causal Editing via On-Policy Self-Distillation

【作者】Shuaike Li, Kai Zhang, Xianquan Wang et al.

【摘要】知识编辑主流范式是静态事实覆盖，但这会破坏模型预训练的逻辑拓扑，引发认知失调，即新旧知识冲突。实验证实这是结构性缺陷。而基于因果叙事的编辑可将冲突率从95.6%大幅降至6.6%。CODE方法通过因果自举与非对称在策略蒸馏，将因果转换逻辑直接注入参数记忆。在LLaMA-3.1和Qwen-2.5上，CODE将自我反驳率压至1.8%，同时保证多跳推理准确性。

【英文摘要】While Knowledge Editing (KE) enables efficient updates, its dominant Static Fact Overwriting paradigm treats LLMs as discrete databases, forcibly injecting isolated facts. Fracturing pre-trained logical topologies, this triggers Epistemic Dissonance -- a pathology where un-evolved legacy priors force the model to explicitly negate the injected update.

📄 论文：https://arxiv.org/abs/2605.28303💻 代码：https://github.com/CrashBugger/CODE

🌐 多模态

21. AgenticVBench: Can AI Agents Complete Real-World Post-Production Tasks?

【作者】Zongheng Cao, Yi Zheng, Rui Song et al.

【摘要】视频制作工作流是评估多模态AI智能体的严苛场景。AgenticVBench基准包含100个来自真实工作流的智能体任务，由20位行业专家设计。评估显示，最佳智能体栈的准确率勉强超过30%，远低于人类专家。研究还发现，框架选择会显著影响模型的行为、分数和失败模式。该基准为诊断和改进用于智能体视频制作的模型与框架奠定了基础。

【英文摘要】Video production workflows offer a rich and demanding arena for evaluating multimodal AI agents: they require composite capabilities across text, image, audio, and video understanding, along with long-horizon planning, and tool use. To this end, we introduce AgenticVBench, a benchmark of 100 agentic tasks across 4 task families spanning the real world post-production workflow, constructed from real production workflows contributed by 20 industry experts averaging 6 years of professional experien...

📄 论文：https://arxiv.org/abs/2605.27705

22. Bridging the Pose-Semantic Gap: A Cascade Framework for Text-Based Person Anomaly Search

【作者】Zequn Xie, Guijin Luo, Chuxin Wang et al.

【摘要】基于文本的人员异常搜索旨在用自然语言查询从监控视频中检索特定行为事件。现有方法面临姿态-语义鸿沟：语义不同的动作可能有相似骨架。SSDC框架将检索解耦为两阶段：先用轻量模型基于骨架相似度进行粗检索，再用侦探小组进行语义验证。小组包括快速筛选的侦探、提取证据的分析师和综合语义的写手。最后融合生成的描述与结构先验进行重排序。在PAB基准上，SSDC通过平衡效率和语义推理达到了最佳性能。

【英文摘要】Text-based person anomaly search retrieves specific behavioral events from surveillance archives using natural-language queries. Although recent pose-aware methods align geometric structures well, they face a fundamental Pose-Semantic Gap: semantically different actions can share similar skeletal geometries. While Multimodal Large Language Models (MLLMs) can reduce this ambiguity, using them for large-scale retrieval is computationally prohibitive.

📄 论文：https://arxiv.org/abs/2604.23282

📬 AI论文速递 · 每日更新 · 关注不迷路

💬 欢迎转发分享，一起追踪AI前沿