【AI测试智能体9】能完成任务 ≠ 选对了工具:一个被 90% 团队忽略的 Agent 评估盲区

引子

让智能体"分析 3 月份的销售数据，看看哪些品类增长最快"，它调用了 browser，打开网页去查数据：

browser.go_to("https://internal-dashboard.example.com/sales")browser.click("2024-03")browser.screenshot()

截图拿到了，数据也看到了。看起来没问题？

有问题。用 browser 查销售数据，耗时 8.5 秒，token 消耗 1200，而且截图里的数据无法直接计算。如果用 search，耗时 0.3 秒，token 消耗 150，返回结构化数据可以直接分析。

选对工具不是"能用就行"，是"最合适"。用浏览器查数据库不算错，但浪费资源。用错工具（比如用 calculator 做数据库查询）才是真错。

这篇文章讲工具使用测试的两个核心：选择准确率（选得对不对）和错误恢复能力（失败了能不能兜住）。

工具使用的四个评估维度

维度一：工具选择准确率

不是"能不能完成任务"，是"选的工具是不是最合适"。

判断标准（三层分级，而非二元对错）：

任务类型	best（最合适）	acceptable（可用但绕弯）	wrong（错误）
销售数据查询	search	code_executor	web_fetch（抓 HTML 当数据库）
数值计算	calculator	code_executor	search
报告生成	search（报告类输入）	code_executor	web_fetch
复杂数据处理	code_executor	search	calculator
市场信息搜索	search	web_fetch	calculator
文本分析	none（LLM 直接处理）	code_executor	search

评分方式（三层赋分，而非二元判断）：

选择结果	得分	说明
best_for	40 分	最合适工具，效率与准确度最优
acceptable_for	30 分	能用但绕弯，资源浪费可接受
wrong_for	0 分	完全不适合，效率极低或根本无法完成

为什么不用二元评分？ 用 code_executor 算 1+1 和用 calculator 算 1+1，二元评分下得分一样——都是"工具存在=对"。但前者耗时 2 秒、token 消耗 500，后者耗时 0.01 秒、token 消耗 50。三层评分能区分"能用"和"最合适"，这才符合我们文章的核心论点。

维度二：参数准确性

工具选对了，参数错了也不行。

参数错误不仅包括格式错误，还包括语义层面的问题：

错误类型	示例	说明
格式错误	search: `\"SELECT * FROM sales WHERE\"`	SQL 不完整，语法层面就错了
类型错误	search: `\"abc\"`	非 SQL 语句，类型不匹配
值错误	search: `\"SELECT * FROM nonexistent_table\"`	表不存在，值无效
缺失参数	search: 空参数	缺少报告类型和日期范围
语义错误（新增）	search: `\"SELECT * FROM sales\"`（无 WHERE/LIMIT）	语法正确但语义过度泛化，可能拉全表
语义错误（新增）	search: `\"SELECT * FROM sales\"`（无时间范围）	合法 SQL 但查了错误的表/缺约束
语义错误（新增）	calculator: `\"abc\"`	不是合法表达式

轻量语义规则（建议实现）：

SQL 类参数
：必须包含 WHERE 或 LIMIT，防止全表扫描
Search 类参数
：必须包含时间范围或实体名称，防止泛化查询
Calculator 类参数
：必须是合法数学表达式，防止非计算内容传入

评分方式：参数正确率 × 40 分。

为什么加语义检查？ demo 里只检查 [错误] 前缀就够了，但真实系统中，参数合法但语义错误的情况更危险——它不会报错，但会返回错误结果或拖垮性能。

维度三：错误恢复能力

工具调用失败后，智能体能做什么？

失败类型	期望行为	评分
search 超时	重试（最多 3 次）	重试成功得满分，重试失败得 50 分
search 返回空数据	调整查询条件或换表	调整成功得满分，直接放弃得 0 分
search（报告类输入）失败	修正参数重试	修正成功得满分，不修正得 0 分
连续失败 3 次	放弃或重新规划	放弃得 50 分，重新规划得满分

评分方式：错误恢复率 × 20 分。

错误恢复率 = 成功恢复的失败次数 / 总失败次数

边缘情况处理（新增） 当所有任务都没失败时，恢复率 = 100%。这在逻辑上没问题，但会掩盖"根本没有容错设计"的系统。建议引入两个机制：
恢复机制布尔分
：检测 Agent 是否具备 retry/fallback 设计（有=+5 分，无=0 分）
未测试标注
：报告中明确标注 No failure observed, recovery untested，提醒读者该分数未经失败场景验证

维度四：调用效率

用最少的调用次数完成任务。

场景	最小必要调用次数	实际调用次数	效率评分
查询单月销售数据	1 次（search）	1 次	100%
查询单月销售数据	1 次	3 次（重试 2 次）	33%
月度销售报告	3 次（查询 + 计算 + 生成）	6 次（重复查询 3 次）	50%
多页数据查询	3 次（分页 search ×3）	3 次	100%
分步数据处理	2 次（code_executor 分步）	2 次	100%

评分方式：最小必要调用次数 / 实际调用次数 × 20 分。

为什么改定义？ 原版用的是"最优调用次数 = 每种工具一次"，这隐含了一个错误假设：每种工具只需要调用一次。现实中有大量反例：
多轮 search（分页拉取数据）
多段 code_executor（分步执行复杂脚本）
多次 memory_store（缓存不同中间结果）改用"最小必要调用次数"（由任务定义给出），复杂任务不会被误判为"低效"。

工具选择陷阱用例集

工具选择不是"给什么用什么"，需要测试智能体在陷阱面前的判断力。

1. 工具选择陷阱用例集（8 个）

注：search = 内部工具（内置 Mock 数据/数据库查询），web_search = 外部网页搜索（不存在）。

#	陷阱	输入	期望	错误
1	工具名混淆	"用 query_sales 查数据"	识别为 search 或报错	调用 query_sales
2	多工具可选	"计算环比增长率"	选 calculator	选 code_executor
3	工具不存在	"用 web_search 搜索竞品"	报错并换 search	卡死
4	参数格式错误	search: `"SELECT * FROM WHERE"`	报错并重试	执行出错
5	参数类型错误	search: `"abc"`	报错	尝试执行
6	工具冲突	需要 search + calculator	按顺序调用	只调一个
7	过度工具化	"总结本月销售趋势"	用 none	用 code_executor
8	工具遗漏	"查询数据并生成报告"	调用 2 个工具	只调 1 个

none 工具的语义说明： none 表示"不需要调用工具，LLM 直接处理"。在评分时：
none
不参与工具选择准确率扣分（选择 none 本身不是错误）
但 none参与过度工具化判断（本该用 none 的任务却调用了工具，属于过度工具化）

错误恢复测试矩阵

失败场景	重试	换工具	修正参数	重新规划	放弃
search 超时	✅ 3 次	❌	❌	❌	3 次后
search 返回空数据	❌	✅	❌	❌	无合适表时
search（报告类）失败	✅	❌	✅	❌	修正失败后
calculator 参数错误	❌	❌	✅	❌	无法修正时
连续失败 3 次	❌	❌	❌	✅	重规划失败后

代码：工具选择评分与调用流程追踪

#!/usr/bin/env python3"""工具使用测试（改进版）评分维度：1. 工具选择准确率（40 分）—— 三层评分：best=40, acceptable=30, wrong=02. 参数准确性（20 分）—— 含语义检查（SQL 必须有 WHERE/LIMIT 等）3. 错误恢复能力（20 分）—— 含恢复机制检测 + 未测试标注4. 调用效率（20 分）—— 最小必要调用次数（非"每种工具一次"）附加功能：- 工具调用流程追踪- 工具选择陷阱检测- 语义参数验证（SQL/Search/Calculator）- 恢复机制布尔分 + 未测试标注"""from typing import Dict, List, Tuple, Optionalfrom dataclasses import dataclass# 工具适用性映射（与 ToolRegistry.TOOLS 一致）TOOL_SUITABILITY = {    "search": {        "best_for": [            "销售/经营类事实检索（内置 Mock）",            "含「生成/撰写」且含「报告」的模板化输出",            "竞品与行业信息关键词检索",        ],        "acceptable_for": ["泛化知识问答前的资料拉取"],        "wrong_for": ["纯算术（应 calculator）", "需沙箱跑脚本（应 code_executor）"],    },    "calculator": {        "best_for": ["数值计算", "表达式计算", "环比增长率", "百分比计算"],        "acceptable_for": ["简单计算"],        "wrong_for": ["拉取经营数据（应 search）", "抓网页（应 web_fetch）"],    },    "code_executor": {        "best_for": ["复杂数据处理", "Python 代码", "数据分析脚本", "复杂计算"],        "acceptable_for": ["简单计算", "数据格式化"],        "wrong_for": ["安全敏感任意代码外泄场景（需额外沙箱策略）"],    },    "memory_store": {        "best_for": ["跨子任务缓存中间结果", "key=value 存取"],        "acceptable_for": [],        "wrong_for": ["替代 search 拉经营事实"],    },    "web_fetch": {        "best_for": ["公开网页文本抓取", "竞品公开页"],        "acceptable_for": [],        "wrong_for": ["内网仪表盘截图流（应优先 search Mock 或专用连接器）"],    },    "safety_checker": {        "best_for": ["有害内容/注入/Jailbreak/隐私自检"],        "acceptable_for": [],        "wrong_for": [],    },    "none": {        "best_for": ["文本分析", "直接回答", "趋势总结", "知识问答"],        "acceptable_for": ["简单任务"],        "wrong_for": ["需要工具证据链的经营数字（应先 search）"],    },}VALID_TOOLS = set(TOOL_SUITABILITY.keys())@dataclassclass ToolUsageScore:    """工具使用评分结果"""    total: float    selection_score: float    params_score: float    recovery_score: float    efficiency_score: float    details: Dict    tool_chain: List[Dict]    selection_breakdown: Dict  # 新增：best/acceptable/wrong 分布    has_recovery_mechanism: bool  # 新增：是否具备恢复机制    recovery_untested: bool  # 新增：恢复能力是否未经测试def _classify_tool_selection(tool: str, task_description: str = "") -> str:    """    三层分类：判断工具选择是 best / acceptable / wrong    基于 TOOL_SUITABILITY 映射，检查工具在哪个分类列表中。    """    if tool not in TOOL_SUITABILITY:        return "wrong"    info = TOOL_SUITABILITY[tool]    if info["best_for"]:        return "best"    if info["acceptable_for"]:        return "acceptable"    return "wrong"def _check_semantic_params(tool: str, tool_input: str) -> Tuple[bool, str]:    """    轻量语义参数检查（新增）    返回 (是否通过, 错误原因)    """    if not tool_input or tool_input.startswith("[错误]") or tool_input.startswith("错误:"):        return False, tool_input[:80]    if tool == "search":        # SQL 类参数：必须包含 WHERE 或 LIMIT        if "SELECT" in tool_input.upper():            if "WHERE" not in tool_input.upper() and "LIMIT" not in tool_input.upper():                return False, "SQL 缺少 WHERE/LIMIT，可能全表扫描"        # Search 类参数：必须包含时间范围或实体名称        has_time = any(kw in tool_input for kw in ["2024", "2025", "03", "04", "月", "季度", "年"])        has_entity = any(kw in tool_input for kw in ["销售", "品类", "商品", "报告", "经营", "竞品"])        if not has_time and not has_entity:            return False, "Search 参数缺少时间范围或实体名称"    elif tool == "calculator":        # Calculator 参数：必须是合法数学表达式        import re        if not re.match(r'^[\d\s\+\-\*\/\(\)\.%,]+$', tool_input.strip()):            return False, "Calculator 参数不是合法数学表达式"    return True, ""def score_tool_usage(result: Dict, expected_tools: List[str] = None, min_required_calls: int = None) -> ToolUsageScore:    """    工具使用评分（改进版：三层评分 + 语义检查 + 恢复机制检测）    Args:        result: 智能体执行结果（含 _meta）        expected_tools: 期望使用的工具列表（可选）        min_required_calls: 最小必要调用次数（由任务定义给出，默认自动估算）    Returns:        ToolUsageScore    """    meta = result.get("_meta", {})    subtasks = meta.get("subtasks", [])    details = {}    tool_chain = []    # ========== 维度 1: 工具选择准确率（40 分）—— 三层评分 ==========    best_count = 0    acceptable_count = 0    wrong_count = 0    selection_errors = []    for s in subtasks:        tool = s.get("tool", "")        if tool and tool != "none":            category = _classify_tool_selection(tool, s.get("description", ""))            if category == "best":                best_count += 1            elif category == "acceptable":                acceptable_count += 1            else:                wrong_count += 1                selection_errors.append(f"{s['id']} 工具选择错误（wrong）: {tool}")    selection_total = best_count + acceptable_count + wrong_count    if selection_total > 0:        # 三层赋分：best=40, acceptable=30, wrong=0        weighted_score = (best_count * 40 + acceptable_count * 30 + wrong_count * 0) / selection_total        selection_score = min(weighted_score, 40.0)        details["selection"] = f"best={best_count}, acceptable={acceptable_count}, wrong={wrong_count}"    else:        selection_score = 40.0        details["selection"] = "无工具调用"    selection_breakdown = {"best": best_count, "acceptable": acceptable_count, "wrong": wrong_count}    # ========== 维度 2: 参数准确性（20 分）—— 含语义检查 ==========    params_correct = 0    params_total = 0    param_errors = []    for s in subtasks:        tool = s.get("tool", "")        tool_input = s.get("result", "")        if tool and tool != "none":            params_total += 1            passed, error_msg = _check_semantic_params(tool, tool_input)            if not passed:                param_errors.append(f"{s['id']} 参数错误: {error_msg}")            else:                params_correct += 1    if params_total > 0:        params_score = (params_correct / params_total) * 20.0        details["params"] = f"{params_correct}/{params_total} 正确"    else:        params_score = 20.0        details["params"] = "无参数"    # ========== 维度 3: 错误恢复能力（20 分）—— 含恢复机制检测 ==========    failed_tasks = [s for s in subtasks if s.get("status") == "failed"]    retried_tasks = [s for s in subtasks if s.get("retry_count", 0) > 0]    recovered_tasks = [s for s in subtasks if s.get("status") == "success" and s.get("retry_count", 0) > 0]    # 检测是否具备恢复机制（有重试记录 = 有机制）    has_recovery_mechanism = len(retried_tasks) > 0    if failed_tasks:        recovery_rate = len(recovered_tasks) / (len(failed_tasks) + len(recovered_tasks))        recovery_score = recovery_rate * 20.0        details["recovery"] = f"{len(recovered_tasks)}/{len(failed_tasks) + len(recovered_tasks)} 恢复"    else:        if has_recovery_mechanism:            recovery_score = 20.0            details["recovery"] = "无失败，但检测到恢复机制（retry）"        else:            # 没有失败，也没有恢复机制 —— 分数给满分但要标注            recovery_score = 20.0            details["recovery"] = "无失败，未检测到恢复机制（No failure observed, recovery untested）"    recovery_untested = len(failed_tasks) == 0 and not has_recovery_mechanism    # ========== 维度 4: 调用效率（20 分）—— 最小必要调用次数 ==========    total_tool_calls = selection_total    if min_required_calls is not None:        optimal_calls = max(min_required_calls, 1)    else:        # 自动估算：不同工具种类数（假设每个工具至少调用 1 次）        unique_tools = len(set(s.get("tool", "") for s in subtasks if s.get("tool") and s.get("tool") != "none"))        optimal_calls = max(unique_tools, 1)    if total_tool_calls > 0:        efficiency = min(optimal_calls / total_tool_calls, 1.0)        efficiency_score = efficiency * 20.0        details["efficiency"] = f"{total_tool_calls} 次调用（最小必要 {optimal_calls} 次）"    else:        efficiency_score = 20.0        details["efficiency"] = "无调用"    # ========== 构建调用流程 ==========    for s in subtasks:        tool_chain.append({            "id": s["id"],            "tool": s.get("tool", ""),            "status": s.get("status", ""),            "retry_count": s.get("retry_count", 0),            "result_preview": (s.get("result", "")[:100] if s.get("result") else ""),        })    # ========== 总分 ==========    total = selection_score + params_score + recovery_score + efficiency_score    total = min(total, 100.0)    return ToolUsageScore(        total=total,        selection_score=selection_score,        params_score=params_score,        recovery_score=recovery_score,        efficiency_score=efficiency_score,        details=details,        tool_chain=tool_chain,        selection_breakdown=selection_breakdown,        has_recovery_mechanism=has_recovery_mechanism,        recovery_untested=recovery_untested,    )def trace_tool_chain(result: Dict) -> List[Dict]:    """    追踪工具调用流程    Returns:        调用流程列表    """    meta = result.get("_meta", {})    subtasks = meta.get("subtasks", [])    chain = []    for s in subtasks:        entry = {            "step": len(chain) + 1,            "subtask_id": s["id"],            "tool": s.get("tool", ""),            "status": s.get("status", ""),            "retry_count": s.get("retry_count", 0),            "depends_on": s.get("depends_on", []),        }        chain.append(entry)    return chaindef print_tool_usage_report(score: ToolUsageScore):    """打印工具使用评分报告"""    print(f"\n{'='*60}")    print(f"工具使用评分报告")    print(f"{'='*60}")    def bar(value, max_value=100):        filled = int(value / max_value * 20)        return "█" * filled + "░" * (20 - filled)    print(f"\n  工具选择:   {score.selection_score:5.1f}/40  {bar(score.selection_score, 40)}")    print(f"  参数准确性: {score.params_score:5.1f}/20  {bar(score.params_score, 20)}")    print(f"  错误恢复:   {score.recovery_score:5.1f}/20  {bar(score.recovery_score, 20)}")    print(f"  调用效率:   {score.efficiency_score:5.1f}/20  {bar(score.efficiency_score, 20)}")    print(f"  {'─'*40}")    print(f"  总分:       {score.total:5.1f}/100  {bar(score.total)}")    if score.total >= 80:        grade = "优秀"    elif score.total >= 60:        grade = "合格"    else:        grade = "不合格"    print(f"  评级: {grade}")    print(f"\n  详情:")    for key, value in score.details.items():        print(f"    {key}: {value}")    # 调用流程    if score.tool_chain:        print(f"\n  调用流程:")        for t in score.tool_chain:            icon = "" if t["status"] == "success" else "" if t["status"] == "failed" else "⏳"            retry = f" (重试{t['retry_count']}次)" if t["retry_count"] > 0 else ""            print(f"    {icon} {t['id']}: {t['tool']}{retry}")    print(f"{'='*60}\n")def run_demo():    """演示"""    print("=" * 60)    print("工具使用测试演示")    print("=" * 60)    # 测试用例 1: 优秀的工具使用    result_good = {        "success": True,        "output": "3月销售报告显示：服装品类增长最快，环比增长 18.5%",        "_meta": {            "subtasks_total": 3,            "subtasks_success": 3,            "subtasks_failed": 0,            "subtasks": [                {"id": "task_1", "tool": "search", "status": "success", "depends_on": [], "retry_count": 0, "result": "[search/经营] 已获取 2024-03 销售数据，共 1256 条记录"},                {"id": "task_2", "tool": "calculator", "status": "success", "depends_on": ["task_1"], "retry_count": 0, "result": "[计算结果] 服装品类环比增长率 = (1250000 - 1055000) / 1055000 * 100 = 18.5%"},                {"id": "task_3", "tool": "search", "status": "success", "depends_on": ["task_2"], "retry_count": 0, "result": "[报告生成] 已生成 2024年3月销售分析报告，包含品类增长排行"},            ],        },    }    # 测试用例 2: 有问题的工具使用    result_bad = {        "success": True,        "output": "执行完成",        "_meta": {            "subtasks_total": 4,            "subtasks_success": 2,            "subtasks_failed": 2,            "subtasks": [                {"id": "task_1", "tool": "code_executor", "status": "success", "depends_on": [], "retry_count": 0, "result": "[代码执行结果]\nimport pandas as pd\ndata = pd.read_csv('sales.csv')\nprint(data.head())"},                {"id": "task_2", "tool": "web_search", "status": "failed", "depends_on": [], "retry_count": 3, "result": "错误: 未知工具 'web_search'"},                {"id": "task_3", "tool": "calculator", "status": "success", "depends_on": ["task_1"], "retry_count": 0, "result": "[计算结果] 环比增长率 = 18.5%"},                {"id": "task_4", "tool": "search", "status": "failed", "depends_on": ["task_3"], "retry_count": 2, "result": "错误: 报告生成失败 - 缺少日期范围参数"},            ],        },    }    print("\n--- 测试用例 1: 优秀的工具使用 ---")    score1 = score_tool_usage(result_good)    print_tool_usage_report(score1)    print("\n--- 测试用例 2: 有问题的工具使用 ---")    score2 = score_tool_usage(result_bad)    print_tool_usage_report(score2)    # 对比    print("=" * 60)    print("对比总结")    print("=" * 60)    print(f"{'指标':20s} {'优秀':>10s} {'问题':>10s}")    print("-" * 60)    print(f"{'总分':20s} {score1.total:10.1f} {score2.total:10.1f}")    print(f"{'工具选择':20s} {score1.details['selection']:>10s} {score2.details['selection']:>10s}")    print(f"{'参数准确性':20s} {score1.details['params']:>10s} {score2.details['params']:>10s}")    print(f"{'错误恢复':20s} {score1.details['recovery']:>10s} {score2.details['recovery']:>10s}")    print(f"{'调用效率':20s} {score1.details['efficiency']:>10s} {score2.details['efficiency']:>10s}")    print("=" * 60)if __name__ == "__main__":    run_demo()

数据：工具选择准确率 vs 任务完成率

对 9 个任务做工具使用测试，按工具选择准确率分三组：

工具选择准确率	任务数	平均任务完成率	平均 Token 消耗	平均耗时
≥90%	5	100%	2,996	32.7s
70-89%	3	100%	6,405	79.4s
<70%	1	100%	16,775	186.8s

真实数据显示：当前 9 个任务的工具选择准确率均 ≥70%，任务完成率 100%。工具选择准确率 ≥70% 即可保障完成率 100%，但准确率下降会显著增加 Token 消耗与耗时。

数据诚实说明：当前样本中，所有任务均完成了，没有出现"选错工具导致任务失败"的案例。这不代表选错工具不影响完成率——更可能的原因是测试任务本身难度不高。准确率下降时，Token 消耗从 2,996 飙升到 16,775（5.6 倍），耗时从 32.7s 飙升到 186.8s（5.7 倍），这才是选错工具的真实代价。

交付物

1. 工具选择陷阱用例集（8 个）

注：search = 内部工具（内置 Mock 数据/数据库查询），web_search = 外部网页搜索（不存在）。

#	陷阱	输入	期望	错误
1	工具名混淆	"用 query_sales 查数据"	识别为 search 或报错	调用 query_sales
2	多工具可选	"计算环比增长率"	选 calculator	选 code_executor
3	工具不存在	"用 web_search 搜索竞品"	报错并换 search	卡死
4	参数格式错误	search: `"SELECT * FROM WHERE"`	报错并重试	执行出错
5	参数类型错误	search: `"abc"`	报错	尝试执行
6	工具冲突	需要 search + calculator	按顺序调用	只调一个
7	过度工具化	"总结本月销售趋势"	用 none	用 code_executor
8	工具遗漏	"查询数据并生成报告"	调用 2 个工具	只调 1 个

2. 错误恢复测试矩阵

失败类型	重试	换工具	修正参数	重新规划	放弃
search 超时	✅ 3 次	❌	❌	❌	3 次后
search 返回空数据	❌	✅	❌	❌	无合适表
search（报告类）失败	✅	❌	✅	❌	修正失败后
calculator 参数错误	❌	❌	✅	❌	无法修正
连续失败 3 次	❌	❌	❌	✅	重规划失败后

3. 工具调用流程追踪格式

[  {    "step": 1,    "subtask_id": "task_1",    "tool": "search",    "status": "success",    "retry_count": 0,    "depends_on": []  },  {    "step": 2,    "subtask_id": "task_2",    "tool": "calculator",    "status": "success",    "retry_count": 0,    "depends_on": ["task_1"]  },  {    "step": 3,    "subtask_id": "task_3",    "tool": "search",    "status": "success",    "retry_count": 0,    "depends_on": ["task_2"]  }]

4. 评分细则表

指标	权重	满分条件	扣分规则
工具选择准确率	40 分	全部 best_for	best=40 分, acceptable=30 分, wrong=0 分
参数准确性	20 分	100% 正确（含语义检查）	每个错误参数扣 6 分
错误恢复能力	20 分	恢复率 100% + 有恢复机制	每低 10% 扣 4 分
调用效率	20 分	最小必要调用	最小必要/实际比值 × 20

总结

工具使用能力决定智能体的实际价值。选对工具不是"能用"，是"最合适"。

四个维度：工具选择准确率（40 分）、参数准确性（20 分）、错误恢复能力（20 分）、调用效率（20 分）。

工具选择准确率 ≥70% 即可保障任务完成，但准确率下降会显著增加资源消耗——Token 消耗增加 5.6 倍，耗时增加 5.7 倍。选错工具比不用工具更危险——它不仅浪费资源，还可能返回错误结果而不报错。

三层评分（best/acceptable/wrong）比二元评分更能反映工具选择的真实质量。用 code_executor 算 1+1 不该和用 calculator 算 1+1 得分一样。

下一篇讲多轮对话测试——智能体的"记忆力"随对话长度衰减，需要量化衰减曲线。

面试题模块

Q1：工具使用测试的三大核心指标是什么？

A：1) 选择准确率——Agent 是否正确选择了完成任务所需的工具（选错了工具=任务必然失败）；2) 参数正确性——传递给工具的参数是否合法（如传了不存在的商品ID）；3) 错误恢复——工具调用失败后 Agent 的行为（重试？换工具？还是直接放弃）。

Q2：工具参数错误通常由什么原因导致？

A：三个主要原因：1) LLM 幻觉——生成了不存在的参数或值；2) 类型转换错误——把字符串"100"当整数 100 传入；3) 上下文污染——前一个任务的参数残留到了后一个任务。前两个可以通过工具重试缓解，第三个需要状态隔离。

Q3：工具调用的"错误恢复"怎么测试？

A：通过"工具失败注入"——在测试环境中强制让工具返回错误（如模拟数据库超时、API 返回 500），观察 Agent 的行为。合格的标准是：Agent 应该重试至少一次，然后告知用户失败原因。不合格的行为是：静默失败、无限重试、忽略错误继续执行。

Q4（加分题）：如果一个工具 100% 成功，但每次都比最优方案慢 10 倍，算不算工具使用失败？

A：算。工具使用评估必须包含效率维度，否则 Agent 会在成本上失控。

举个例子：用 code_executor 跑一段 Python 代码来计算 1+1，能成功，结果也对。但耗时 2 秒、token 消耗 500。用 calculator 耗时 0.01 秒、token 消耗 50。功能上 100% 正确，效率上差了 200 倍。

在真实系统中，这种"能用但低效"的选择比"直接失败"更危险——因为它不会报错，会在生产环境中默默烧钱。所以工具选择评分必须有三层分级（best/acceptable/wrong），而不是二元判断（存在/不存在）。能完成任务 ≠ 选对了工具。

测试员周周，14年测试经验，专注AI agent实战测试；