今日相关 / Relevant Today
AI4Protein 前沿追踪
AI 深度解读
该研究提出了一种名为 Socratic-SWE 的框架,旨在通过求解器(Solver)与生成器(Generator)的协同进化,自动构建高质量软件工程(SWE)修复任务。其核心挑战在于如何获取既结构化又基于真实代理行为的技能,以及如何确保生成任务的可用性与有效性。
在方法层面,研究首先构建了一个“代理技能注册表”(Agent Skill Registry)。该注册表从历史交互轨迹中蒸馏出结构化技能,每个技能包含名称、描述、适用条件及操作序列。对于成功轨迹,提取通用策略;对于失败轨迹,总结失败教训与修正原则。随后,生成器利用这些技能在代码仓库环境中合成针对求解器能力短板的候选任务。
为确保任务质量,研究设计了一套严格的验证管道。候选任务需通过格式、 grounding(与代码库实体的关联)、执行稳定性及语义有效性(存在有效修复方案)四重检查。此外,引入对齐奖励机制:生成器不仅需保证任务可执行,还需确保求解器在完成任务后产生的策略更新方向,与基于保留验证集计算出的目标梯度方向一致。若任务无效或更新方向偏离,则不予采纳。
最终,该框架形成了一个闭环:求解器在沙箱中解决任务并产生轨迹,轨迹被蒸馏为新技能;生成器利用新技能提出新任务;新任务经验证后进入课程库供求解器学习。这一机制实现了任务、轨迹与技能的持续迭代,有效提升了模型在复杂代码修复任务上的能力,同时避免了无效或不可复现任务的干扰。
中文摘要
摘要:由大语言模型(LLM)驱动的软件工程智能体已成为检验现实世界语言模型能力的核心试验场,但其训练仍受限于高质量软件工程(SWE)任务的可用性。现有的合成数据方法通常通过固定的变异或注入缺陷程序来生成任务,导致生成的任务分布与智能体自身的弱点及训练进展基本无关。我们提出了 Socratic-SWE,这是一种闭环自我进化框架,它复用智能体的历史求解轨迹作为训练信号来源。Socratic-SWE 不仅将轨迹视为奖励计算的证据,还将其提炼为结构化智能体技能,以总结重复出现的失败模式和有效的修复策略。这些技能随后指导在真实代码库中生成针对性的修复任务。候选任务通过基于执行的验证进行检查,并利用求解器梯度对齐奖励进行评分,从而确保保留的任务既可验证又对提升求解器有益。更新后的求解器生成新的轨迹,使任务课程能够在多轮迭代中自适应调整。在 SWE-bench Verified、SWE-bench Lite、SWE-bench Pro 和 Terminal-Bench 2.0 等多个基准测试中,Socratic-SWE 在相同的计算预算下始终优于自我进化的基线方法,经过三轮迭代后,在 SWE-bench Verified 上达到了 50.40% 的准确率。这些结果表明,求解轨迹可作为自我进化软件工程智能体的可扩展基础。

Paper Key Illustration
原文
Socratic-SWE: Self-Evolving Coding Agents via Trace-Derived Agent Skills
Abstract: LLM-driven software engineering agents have become a central testbed for real-world language-model capability, yet their training remains limited by the availability of high-quality SWE tasks. Existing synthetic data methods typically create tasks through fixed mutation or bug-injection procedures, making the resulting distributions largely independent of the agent's own weaknesses and training progress. We introduce Socratic-SWE, a closed-loop self-evolution framework that reuses the agent's historical solving traces as a source of training signal. Rather than treating traces only as evidence for reward computation, Socratic-SWE distills them into structured agent skills that summarize recurring failures and effective repair patterns. These skills then guide the generation of targeted repair tasks in real repositories. Candidate tasks are checked through execution-based validation and scored with a solver-gradient alignment reward, so that the retained tasks are both verifiable and useful for improving the Solver. The updated Solver produces new traces, enabling the task curriculum to adapt over successive rounds. Across SWE-bench Verified, SWE-bench Lite, SWE-bench Pro, and Terminal-Bench 2.0, Socratic-SWE consistently improves over self-evolving baselines under the same compute budget, reaching 50.40% on SWE-bench Verified after three iterations. These results suggest that solving traces can serve as a scalable substrate for self-evolving SWE agents.
链接:https://arxiv.org/pdf/2606.07412
AI 深度解读
针对知识图谱记忆系统在多跳检索中的局限性,本文提出了一种基于超图键值存储的检索增强生成框架(HKVM-RAG)。研究旨在解决现有方法(如 HippoRAG、EcphoryRAG)过度依赖成对实体键或在线过滤、且难以在固定证据提取基础上优化检索键空间的问题。该方法的核心在于构建“答案路径超边”:首先利用大语言模型提取局部三元组证据并缓存,随后通过共享的“桥接顶点”将分散的三元组组件自动组装为高阶超边,从而形成能够绑定多个实体并映射回相关文档的检索键。实验设计严格控制变量,在相同的证据提取子系统和候选文档集基础上,对比了传统的成对图遍历(KG-PPR)与超图遍历的效果。结果表明,将检索键空间从成对关系升级为答案路径超边,能够在不增加额外提取成本的前提下,更有效地组织多跳证据,验证了高阶结构在提升检索质量方面的潜力。
中文摘要
摘要:多跳检索增强生成(Multi-hop RAG)提出了超越段落匹配的数据工程问题:在固定的检索预算下,系统必须将检索到的文本组织为能够揭示答案链的证据单元。稠密检索器独立对段落进行评分,而基于图的记忆机制虽然使关联显式化,但往往依赖成对或实体为中心的键,导致多跳证据碎片化。我们提出了 HKVM-RAG,这是一种键值分离的证据组织层。该层利用缓存的段落级大语言模型(LLM)证据元组组装答案路径超边,并将其用作检索键,同时保留段落文本作为答案值。为了隔离键空间设计的影响,我们的固定基底协议在成对图与超图变体中保持元组缓存、候选段落、阅读器和评估预算恒定。加权超图键值检索在 2WikiMultiHopQA 数据集上比 KG-PPR 提升 3.426 个 F1 分数,在 MuSiQue 数据集上提升 3.592 个 F1 分数;HotpotQA 的结果表明,更高的结构化支持覆盖率并不必然带来独立的回答 F1 分数提升。因此,我们将 WHG-KV 视为一种证据控制信号,而非稠密检索的替代方案。Oracle 分析和训练至验证集分析表明,支持选择是可修复的,而一种感知稠密检索的控制器结合了冻结的 ColBERTv2 与 HKVM 的排名/评分特征,并利用非折叠(out-of-fold)的 HKVM 预测。该方法在三个基准测试上分别达到 88.846、65.073 和 85.810 的 F1 分数,较 ColBERTv2 分别提升 11.084、6.763 和 5.966 个 F1 分数。源级消融实验表明,匹配的非 WHG 结构化信号无法达到 WHG-KV 带来的增益。这些结果为键值分离的超图组织可作为多跳 RAG 的可复用证据控制机制提供了有界证据。

Paper Key Illustration
原文
HKVM-RAG: Key-Value-Separated Hypergraph Evidence Organization for Multi-Hop RAG
Abstract: Multi-hop RAG poses a data-engineering problem beyond passage matching: under fixed retrieval budgets, a system must organize retrieved text into evidence units that expose answer chains. Dense retrievers score passages independently, while graph-based memories make associations explicit but often rely on pairwise or entity-centered keys that fragment multi-hop evidence. We present HKVM-RAG, a key-value-separated evidence-organization layer. It assembles answer-path hyperedges from cached passage-level LLM evidence tuples and uses them as retrieval keys, while retaining passage text as answer values. To isolate key-space design, our fixed-substrate protocol holds the tuple cache, candidate passages, reader, and evaluation budget constant across pairwise graph and hypergraph variants. Weighted hypergraph key-value retrieval improves over KG-PPR by +3.426 F1 on 2WikiMultiHopQA and +3.592 F1 on MuSiQue; HotpotQA shows that higher structured support coverage need not yield standalone answer-F1 gains. We therefore study WHG-KV as an evidence-control signal rather than a dense-retrieval replacement. Oracle and train-to-dev analyses identify support selection as repairable, and a dense-aware controller combines frozen ColBERTv2 and HKVM rank/score features using out-of-fold HKVM predictions. It reaches 88.846, 65.073, and 85.810 F1 on the three benchmarks, improving over ColBERTv2 by +11.084, +6.763, and +5.966 F1. Source-level ablations show that matched non-WHG structured signals do not match the WHG-KV gains. These results provide bounded evidence that key-value-separated hypergraph organization can serve as a reusable evidence-control mechanism for multi-hop RAG.
链接:https://arxiv.org/pdf/2606.07218
AI 深度解读
本研究针对浅表结合口袋(shallow pockets)中生成式药物设计模型表现不佳的问题,构建了 ShallowBench 基准测试集。研究首先对 CrossDocked2020 数据集进行了分层采样与预处理,通过计算配体质心并提取 8.0 Å 半径内的蛋白原子,剔除了结合位点原子数少于 4 的样本,确保了测试集的异质性与结构多样性。评估体系涵盖了化学有效性、药物类性(QED)、Vina 结合亲和力及形状互补性(Sc)四个维度,并对比了 DiffSBDD、SimpleSBDD 和 TargetDiff 三种前沿模型。实验结果显示,在浅表口袋场景下,所有模型的结合亲和力均显著下降,证实了平坦表面难以被有效对接。其中,SimpleSBDD 凭借从 ZINC 库筛选的特性获得了最高的 QED 分数和结合能,但形状互补性极低,表明其更多充当了药物重定位评分过滤器而非真正的从头设计器;DiffSBDD 虽生成了大量化学有效的分子,但 QED 分数极低且形状互补性接近零,反映出其条件信号在引导分子向口袋几何结构优化方面存在严重不足;TargetDiff 则在化学有效性上表现出明显的退化。总体而言,该研究揭示了现有生成模型在处理浅表拓扑结构时的局限性,强调了在缺乏深口袋几何约束的情况下,模型难以建立有效的几何牵引力。
中文摘要
摘要:尽管生成式人工智能模型在基于结构的药物设计中取得了显著成功,但它们主要依赖于深结合口袋,难以对具有挑战性的低结合口袋能力靶点(如历史上被称为“不可成药”的肿瘤学靶点 KRAS 和 MYC)采样出有效的配体。为填补这一空白,我们引入了 ShallowBench,这是一个严格筛选的基准数据集,包含从 CrossDocked2020 中提取的 5,780 个浅结合口袋靶点。通过计算 Alpha Shape“盖子”体积与底层蛋白质原子体素体积之间的差异,我们成功分离出低凹度靶点,同时确保其具有足够的结合表面积。对多种最先进生成式模型进行评估表明,这些模型在低凹度界面上的预测结合亲和力较弱。因此,ShallowBench 为生成式生物学模型提供了一个严格的基准,并突显了需要开发能够应对这些挑战性靶点的新架构创新或损失函数的必要性。

Paper Key Illustration
原文
ShallowBench: Benchmarking Generative Drug Design Models on Shallow-Pocket Targets
Abstract: While generative AI models have demonstrated remarkable success in structure-based drug design, they predominantly rely on deep binding pockets and struggle to sample effective ligands for challenging low-pocketability targets, such as the historically "undruggable" oncology targets KRAS and MYC. To address this gap, we introduce ShallowBench, a strictly curated benchmark of 5,780 shallow-pocket targets extracted from CrossDocked2020. By computing the difference between an Alpha Shape "lid" volume and the underlying protein atom voxel volume, we successfully isolated targets with low concavity while ensuring sufficient surface area for binding. Evaluating various state-of-the-art generative models reveals weaker predicted binding affinity on these low-concavity interfaces. ShallowBench therefore provides a rigorous benchmark for generative biology models and highlights the necessity of new architectural innovations or loss functions capable of navigating these challenging targets.
链接:https://arxiv.org/pdf/2606.06717
今日热门 / Popular Today
ArXiv 高热度精选
AI 深度解读
本研究旨在解决大型电商目录中绝大多数商品缺乏碳足迹(PCF)标签的难题,提出了一种结合语义相似度搜索与结构化大语言模型(LLM)提示的推断框架。该方法利用少量标注的“碳目录”数据,通过检索增强生成(RAG)技术,将邻近产品的碳足迹信息作为上下文示例,引导 LLM 对亚马逊商品进行碳足迹估算。研究构建了包含候选生成与碳感知重排序的推荐管道,在保持用户参与度(如 NDCG@10)的同时,通过调节参数λ权衡推荐质量与环境影响(平均碳足迹)。实验表明,相较于简单的最近邻平均或零样本 LLM 预测,该检索增强策略在估算精度上表现最优,能够有效揭示推荐系统优化目标中“用户效用”与“环境可持续性”之间的帕累托前沿,为设计负责任的、具有环境意识的推荐系统提供了可复现的技术路径与评估基准。
中文摘要
摘要:电子商务推荐系统深刻影响着用户考虑和购买的产品,然而,在产品目录规模上几乎从未提供诸如产品碳足迹(PCF)等可持续性信号。本研究探讨了在现实场景下的碳感知产品推荐问题,即大多数商品的 PCF 标签缺失,必须通过推断获得。我们首先通过一种检索增强的 PCF 估算流程来估算产品级碳足迹,该流程利用碳目录(Carbon Catalogue)中少量经过生命周期评估的产品作为监督信号,借助语义相似性搜索、少样本大语言模型提示以及最近邻回退机制,将其迁移至大规模未标注的电子商务产品目录。随后,我们在三种主流推荐模型(BPR、NeuMF 和 LightGCN)生成的相关性评分基础上,应用碳感知的事后重排序策略。该方法通过一个可调节参数 lambda,在预测的用户 - 商品互动与估算的碳足迹之间进行权衡。在本项离线研究中,用户互动通过亚马逊评论数据来量化,这些评论作为隐式反馈,并作为用户兴趣或购买行为的代理指标。我们在亚马逊评论数据集上对家居与厨房、运动与户外、电子三大产品类别进行了评估。通过遍历 lambda 参数,我们构建了帕累托前沿曲线,刻画了各模型与类别在互动与碳减排之间的权衡边界。结果显示,在所有模型和类别中,均能在极小的互动代价下实现显著的碳减排。然而,可用的碳减排空间因模型和类别而异,这凸显了模型选择与领域背景的重要性。

Paper Key Illustration
原文
Trading Engagement for Sustainability: Carbon-Aware Re-ranking for E-commerce Recommendations
Abstract: E-commerce recommender systems strongly influence which products users consider and purchase, yet sustainability signals such as Product Carbon Footprint (PCF) are almost never available at catalog scale. We study carbon-aware product recommendation in the realistic setting where PCF labels are missing for most items and must be inferred. We first estimate product-level carbon footprints via a retrieval-augmented PCF estimation pipeline that transfers supervision from the Carbon Catalogue, a small set of life-cycle-assessed products, to a large unlabeled e-commerce catalog using semantic similarity search, few-shot LLM prompting, and a nearest-neighbour fallback. We then apply a carbon-aware post-hoc re-ranking strategy on top of relevance scores produced by three established recommendation models: BPR, NeuMF, and LightGCN. The method trades off predicted user-item engagement against estimated carbon footprint through a single tunable parameter, lambda. In this offline study, engagement is operationalized through Amazon review interactions, which serve as implicit feedback and as a proxy for user interest or purchase behavior. We evaluate the framework on the Amazon Reviews dataset across three product categories: Home and Kitchen, Sports and Outdoors, and Electronics. By sweeping lambda, we construct Pareto frontiers that characterize the achievable engagement and carbon trade-off for each model and category. Substantial carbon reductions are achievable at minimal engagement cost across all models and categories. However, the available carbon headroom varies by model and category, underscoring the importance of model choice and domain context.
链接:https://arxiv.org/pdf/2606.04550
AI 深度解读
该研究提出了一种名为 CatDT 的自主多智能体系统,旨在解决催化剂设计中过渡态结构构建困难、计算成本高及难以复现等核心挑战。系统基于 CAMEL 框架,严格遵循‘智能体负责推理、工具负责计算’的原则,由 8 个专用智能体协同调用 27 种科学工具,实现从晶体结构到微动力学可观测量的全流程自动化,无需人工干预。其两大创新在于:UniMech 智能体通过能量引导的路径剪枝,将反应网络构建成本降低至穷举法的 1/103;记忆增强强化循环机制将过渡态端点成功率从 41% 提升至 84%。系统具备强大的自我修正能力,当 NEB 计算失败或端点被拒时,验证审计智能体会诊断重叠、元素计数或路径碰撞等具体原因,并提出几何修正方案。在涵盖阶梯金属、单原子催化剂、二维硫化物等七类气体 - 固体界面的基准测试中,CatDT 的预测值与实验值偏差控制在 0.5 至 2 倍之间,覆盖四个数量级的测量数据。该系统成功自主提出了非贵金属丙烷脱氢催化剂(如 Ni@ZrO₂ SMSI 包覆层),其丙烯时空产率(1.63 s⁻¹)和选择性(约 100%)均优于工业基准 PtSn 催化剂。相比传统需 6-30 小时专家劳动的串行工作流,CatDT 在单 GPU 上仅需 5-30 分钟即可完成同等任务,且具备随基础模型和工具进步而持续进化的能力,为构建真实的催化剂数字孪生提供了可靠的技术路径。
中文摘要
摘要:理论上的异相催化有望加速催化剂的发现,然而计算与机器学习预测往往偏离实验结果,且局限于狭窄的材料家族,原因在于缺乏一个忠实且能感知工况的催化模拟器。我们提出了 CatDT(催化数字孪生),这是一个自我演化的多智能体系统,能够构建工作催化剂的自主数字孪生体,统一了气 - 固与液 - 固建模。仅需体相晶体结构和自然语言描述的反应信息,八个专用智能体及 27 种科学工具即可在单个 GPU 上于 5 至 30 分钟内预测稳定晶面、重构活性表面、枚举并排序反应路径、定位过渡态并计算动力学参数。两项创新解决了最困难的步骤:UniMech 通过将智能体引导的提议与基于能量缓存的图搜索相结合,以超过 10^3 倍的成本优势找出新材料的主导反应路径;而增强记忆的强化学习循环则将 600 种催化表面的能垒计算成功率从 41% 提升至 84%。在七个气 - 固基准测试(包括阶梯状金属、单原子催化剂、有序金属间化合物、富空位二维硫化物和碳化物,以及强金属 - 载体相互作用(SMSI)界面)中,CatDT 的所有预测结果均落在实验值的 0.5 至 2 倍范围内,跨越四个数量级。对于丙烷脱氢反应,CatDT 独立发现了可与 Pt 基工业基准相媲美的非贵金属候选材料,其中提出的 Ni@ZrO_2 SMSI 包覆层模拟得到的周转频率(TOF)达到 1.63 exts^-1,选择性接近 100%。更广泛地而言,构建忠实催化剂数字孪生体(或任何多阶段科学模拟器)的决定性因素并非原始大语言模型的能力,而是围绕其构建的工程化框架:确定性工具、持久化记忆以及跨模型、工具和运行实例累积验证的自我改进机制。

Paper Key Illustration
原文
Autonomous heterogeneous catalyst discovery with a self-evolving multi-agent digital twin
Abstract: Theoretical heterogeneous catalysis promises rapid catalyst discovery, yet computational and machine-learning predictions often deviate from experiment and stay confined to narrow material families, for want of a faithful, condition-aware catalytic simulator. We present CatDT (Catalysis Digital Twin), a self-evolving multi-agent system that builds an autonomous digital twin of a working catalyst, unifying gas-solid and liquid-solid modeling. From only a bulk crystal and a natural-language reaction description, eight specialized agents and 27 scientific tools predict stable facets, reconstruct working surfaces, enumerate and rank reaction pathways, locate transition states, and compute kinetics in 5-30 min on a single GPU. Two innovations address the hardest steps: UniMech finds dominant pathways for novel materials at over 10^3× lower cost than exhaustive enumeration by fusing agent-guided proposals with energy-cached graph search, and a memory-augmented reinforcement loop raises barrier-calculation success from 41\% to 84\% across 600 catalytic surfaces. Across seven gas-solid benchmarks -- stepped metals, single-atom catalysts, ordered intermetallics, vacancy-rich 2D sulfides and carbides, and a strong-metal--support-interaction (SMSI) interface -- every CatDT prediction lies within 0.5-2 times experiment over four orders of magnitude. For propane dehydrogenation, CatDT independently discovers non-precious candidates rivaling the Pt-based industrial benchmark, with a proposed Ni@ZrO_2 SMSI overlayer reaching a simulated TOF of 1.63 s^-1 at ∼100\% selectivity. More broadly, the decisive factor for a faithful catalyst digital twin -- or any multi-stage scientific simulator -- is not raw LLM capability but the engineered harness around it: deterministic tools, persistent memory, and verified self-improvement that compound across models, tools, and runs.
链接:https://arxiv.org/pdf/2606.05050
AI 深度解读
本文提出了一种基于多项式逼近的脉冲神经网络(PST-DTLGN),旨在解决离散逻辑门在连续训练与推理之间转换的难题。研究首先定义了两种偏序关系:数值序(用于连续梯度下降训练)和信息序(用于离散推理,区分“未知”与“确定”状态)。基于这两种序,作者从 19,683 种可能的二元三值逻辑门中筛选出两类门集:数值单调门(NM),保证循环电路的固定点存在性和轨道有界性;以及信息单调门(IM),提供特定的退化保证。研究发现,NM 与 IM 的交集(NM ∩ IM)同时具备稳定性与退化保证,是构建受控循环连接的理想门集。方法上,论文设计了多项式替代训练(PST)框架,利用 9 维单项式基将离散逻辑门映射为连续可微的多项式,通过引入承诺正则化项,使训练过程中的软真值表在推理阶段能精确收敛到离散的三值逻辑门。该架构通过裁剪操作保持数值范围并防止多项式发散,最终实现了无需预定义词汇表即可自动离散化为纯三值逻辑电路的神经网络,有效平衡了连续训练的灵活性与离散逻辑的确定性。
中文摘要
Recurrent Neural Networks (RNNs) can learn to predict Signal Temporal Logic (STL) verdicts online from partial trajectories, but deploying them as runtime monitors in safety-critical systems demands more than predictive accuracy. Standard RNN architectures offer no structural guarantee that outputs degrade gracefully under sensor degradation; a dropped input can silently flip a verdict from safe to unsafe. We introduce the Recurrent Differentiable Ternary Logic Gate Network (R-DTLGN), a recurrent architecture that operates over Kleene's three-valued logic $\{-1, 0, +1\}$, where $0$ explicitly represents unknown. The R-DTLGN trains through continuous polynomial surrogates and hardens to a discrete ternary logic circuit at inference. We analyze the hardened circuit through two gate vocabularies derived from two orderings on the ternary domain: numerically monotone gates ensure stable recurrent dynamics, while information-monotone gates, when present, guarantee principled abstention (unknown inputs never produce wrong outputs) and monotonicity in input certainty (more information can only improve the verdict). We show that the recurrent connections required by bounded STL operators use exclusively AND and OR, which belong to both vocabularies, linking the monitoring task to the architecture's guarantees. A realizability bound derived from the STL formula's temporal operators directly sizes the network's hidden state, replacing hyperparameter search with a formula-driven specification. We evaluate on STL specifications over D4RL PointMaze navigation data, testing prediction accuracy, degradation under predicate dropout, and the accuracy-versus-safety tradeoff between two label construction pipelines. The R-DTLGN is, to our knowledge, the first recurrent architecture that couples learned temporal prediction with formal degradation guarantees rooted in three-valued logic.

Paper Key Illustration
原文
On the Stability and Realizability of Recurrent Polynomial Surrogate Ternary Logic Gate Networks
Abstract: Recurrent Neural Networks (RNNs) can learn to predict Signal Temporal Logic (STL) verdicts online from partial trajectories, but deploying them as runtime monitors in safety-critical systems demands more than predictive accuracy. Standard RNN architectures offer no structural guarantee that outputs degrade gracefully under sensor degradation; a dropped input can silently flip a verdict from safe to unsafe. We introduce the Recurrent Differentiable Ternary Logic Gate Network (R-DTLGN), a recurrent architecture that operates over Kleene's three-valued logic {-1, 0, +1}, where 0 explicitly represents unknown. The R-DTLGN trains through continuous polynomial surrogates and hardens to a discrete ternary logic circuit at inference. We analyze the hardened circuit through two gate vocabularies derived from two orderings on the ternary domain: numerically monotone gates ensure stable recurrent dynamics, while information-monotone gates, when present, guarantee principled abstention (unknown inputs never produce wrong outputs) and monotonicity in input certainty (more information can only improve the verdict). We show that the recurrent connections required by bounded STL operators use exclusively AND and OR, which belong to both vocabularies, linking the monitoring task to the architecture's guarantees. A realizability bound derived from the STL formula's temporal operators directly sizes the network's hidden state, replacing hyperparameter search with a formula-driven specification. We evaluate on STL specifications over D4RL PointMaze navigation data, testing prediction accuracy, degradation under predicate dropout, and the accuracy-versus-safety tradeoff between two label construction pipelines. The R-DTLGN is, to our knowledge, the first recurrent architecture that couples learned temporal prediction with formal degradation guarantees rooted in three-valued logic.
链接:https://arxiv.org/pdf/2605.24649
AI 深度解读
本研究针对多智能体系统(MAS)在推理过程中存在的不确定性及其交互引发的误差传播问题,提出了一套分层熵特征提取框架。研究构建了包含智能体级、轮次级、样本级及系统级四个维度的熵特征体系,并进一步引入基座模型(Mbase)的熵特征以量化其对 MAS 性能的影响。基于此,研究训练了名为“熵判官”(Entropy Judger)的集成分类器(结合 XGBoost 与 LightGBM),用于预测样本正确率并支持无标签的候选选择。通过 SHAP 分析与因果推断,研究发现基座模型的熵水平显著制约 MAS 的有效性:基座模型熵值过高会直接导致 MAS 准确率下降。具体而言,LLaMA 系列模型因熵值较低(0-100)且缺乏自我修正机制,在高熵下易发生错误传播;而 Qwen 系列模型虽熵值较高(100-1000),但凭借验证与修正的推理风格有效抑制了误差扩散,从而在更高熵值下仍保持较好性能。此外,研究还揭示了单智能体系统(SAS)在多种场景下并不逊色于 MAS,挑战了“智能体越多越好”的传统假设,表明 MAS 的性能提升并非绝对,而是高度依赖于基座模型的内在熵特性与推理风格。
中文摘要
摘要:多智能体系统(MAS)已成为利用大语言模型(LLMs)解决复杂任务的重要范式。然而,基于公开可用 LLM 构建的 MAS 的有效性机制,特别是其成功或失败的根本原因,仍 largely 未被探索。本文从不确定性的视角重新审视 MAS,通过研究不同拓扑结构和六个基准任务中问题求解过程中的熵变,考察了智能体内部及智能体间的动态变化。通过对涵盖 token 级、轨迹级和轮次级熵的 245 个特征进行分析,我们反直觉地发现,在约 43.3% 的情况下,单个智能体的表现优于 MAS,且不确定性动态主要在第一轮交互中决定。此外,我们提出了三个关键观察结果:1)确定性偏好:在任何阶段降低任何智能体的不确定性对于确保正确解至关重要;2)基础不确定性:在问题求解过程中具有较低熵的基础模型直接有利于 MAS 的性能;3)任务感知:MAS 的熵动态在不同任务中发挥不同作用。基于这些见解,我们提出了一种简单而有效的算法——熵判据(Entropy Judger),用于从 MAS 的 pass@k 结果中选择解,从而在所有 MAS 配置和任务中实现一致性的准确率提升。我们的源代码可在 https://github.com/AgenticFinLab/multiagent-entropy 获取。

Paper Key Illustration
原文
On the Uncertainty of Large Language Model-Based Multi-Agent Systems
Abstract: Multi-agent systems (MAS) have emerged as a prominent paradigm for leveraging large language models (LLMs) to tackle complex tasks. However, the mechanisms governing the effectiveness of MAS built upon publicly available LLMs, specifically the underlying rationales for their success or failure, remain largely unexplored. In this paper, we revisit MAS through the perspective of uncertainty, considering both intra- and inter-agent dynamics by investigating entropy transitions during problem-solving across various topologies and six benchmark tasks. By analyzing 245 features spanning token-, trajectory-, and round-level entropy, we counterintuitively find that a single agent outperforms MAS in approximately 43.3% of cases, and that uncertainty dynamics are largely determined during the first round of interaction. Furthermore, we provide three key observations: 1) Certainty Preference: reducing uncertainty at any stage for any agent is critical for guaranteeing correct solutions; 2) Base Uncertainty: base models with lower entropy during problem-solving directly benefit MAS performance; and 3) Task Awareness: entropy dynamics of MAS play varying roles across different tasks. Building on these insights, we introduce a simple yet effective algorithm, the Entropy Judger, to select solutions from MAS's pass@k results, leading to consistent accuracy improvements across all MAS configurations and tasks. Our source code is available at https://github.com/AgenticFinLab/multiagent-entropy.
链接:https://arxiv.org/pdf/2602.04234
AI 深度解读
该研究提出了一种名为 ZEDD 的轻量级提示注入检测框架,旨在解决大语言模型应用中语义漂移难以捕捉的问题。研究采用分层建模策略,首先利用高斯混合模型(GMM)对嵌入向量的漂移分数分布进行双成分拟合,以区分正常的'清洁 - 清洁'提示对与受攻击的'注入 - 清洁'提示对;当 GMM 收敛失败时,系统自动切换至核密度估计(KDE)作为备用机制。在阈值优化方面,研究通过迭代二分搜索在统计尾部确定决策阈值,将清洁数据的假阳性率(FPR)严格控制在 3% 以内,同时追求约 50% 的整体检测覆盖率。
实验基于 NVIDIA B200 GPU 对四种主流模型(Sentence BERT、Llama 3 8B、Mistral 7B、Qwen 2 7B)进行了微调,耗时仅 15-18 分钟。在包含 51,603 个对齐提示对的测试集中,ZEDD 展现了极高的检测精度,平均清洁假阳性率为 2.93%。尽管在越狱(Jailbreak)、编码操纵(Encoding Manipulation)和系统泄露(System Leak)等特定类别中,Sentence BERT 模型表现略有波动,但所有模型在多数攻击类别上的准确率与 F1 分数均保持在 90% 以上。与现有项目相比,ZEDD 在保持极低计算开销的同时,显著提升了检测的精确度与召回率。研究指出,由于依赖特定嵌入模型来表征语义漂移,ZEDD 在面对不同规模 LLM 的嵌入空间差异时可能存在局限性,未来计划通过自适应资源调整及少样本学习等方法进一步优化,以增强其对抗恶意绕过攻击的鲁棒性。
中文摘要
提示注入攻击已成为大语言模型(LLM)应用日益严重的漏洞,对抗性提示利用电子邮件或用户生成内容等间接输入渠道,绕过对齐 safeguards 并诱导产生有害或非预期的输出。尽管对齐技术已取得进展,但即便是最先进的 LLM 仍普遍易受对抗性提示的攻击,这凸显了迫切需要开发高效、实用且具备泛化能力的检测机制,以取代低效且针对特定模型的修补方案。本文提出了零样本嵌入漂移检测(Zero-Shot Embedding Drift Detection, ZEDD)框架,这是一个轻量级、低工程开销的解决方案,通过量化良性输入与可疑输入在嵌入空间中的语义漂移,识别直接和间接的提示注入尝试。ZEDD 无需访问模型内部结构、预先了解攻击类型或进行特定任务的重训练,即可实现跨多种 LLM 架构的高效零样本部署。该方法利用对抗性 - 清洁提示对,并通过余弦相似度测量嵌入漂移,以捕捉真实世界注入攻击中固有的微妙对抗性操纵。为确保评估的鲁棒性,我们整合并重新标注了涵盖五种注入类别的综合性 LLMail-Inject 数据集,该数据集源自公开来源。广泛的实验表明,嵌入漂移是一种鲁棒且可迁移的信号,在检测准确率和运行效率方面均优于传统方法。在 Llama 3、Qwen 2 和 Mistral 等多种模型架构上,该方法在提示注入分类中取得了超过 93% 的准确率,且误报率低于 3%。我们的方法提供了一种轻量级、可扩展的防御层,可集成到现有的 LLM 流程中,从而解决了保障 LLM 驱动系统抵御适应性对抗威胁的关键缺口。

Paper Key Illustration
原文
Zero-Shot Embedding Drift Detection: A Lightweight Defense Against Prompt Injections in LLMs
Abstract: Prompt injection attacks have become an increasing vulnerability for LLM applications, where adversarial prompts exploit indirect input channels such as emails or user-generated content to circumvent alignment safeguards and induce harmful or unintended outputs. Despite advances in alignment, even state-of-the-art LLMs remain broadly vulnerable to adversarial prompts, underscoring the urgent need for robust, productive, and generalizable detection mechanisms beyond inefficient, model-specific patches. In this work, we propose Zero-Shot Embedding Drift Detection (ZEDD), a lightweight, low-engineering-overhead framework that identifies both direct and indirect prompt injection attempts by quantifying semantic shifts in embedding space between benign and suspect inputs. ZEDD operates without requiring access to model internals, prior knowledge of attack types, or task-specific retraining, enabling efficient zero-shot deployment across diverse LLM architectures. Our method uses adversarial-clean prompt pairs and measures embedding drift via cosine similarity to capture subtle adversarial manipulations inherent to real-world injection attacks. To ensure robust evaluation, we assemble and re-annotate the comprehensive LLMail-Inject dataset spanning five injection categories derived from publicly available sources. Extensive experiments demonstrate that embedding drift is a robust and transferable signal, outperforming traditional methods in detection accuracy and operational efficiency. With greater than 93% accuracy in classifying prompt injections across model architectures like Llama 3, Qwen 2, and Mistral and a false positive rate of <3%, our approach offers a lightweight, scalable defense layer that integrates into existing LLM pipelines, addressing a critical gap in securing LLM-powered systems to withstand adaptive adversarial threats.
链接:https://arxiv.org/pdf/2601.12359
Subscribe to arXiv's Daily Preprint Notifications
夜雨聆风