《金融时报》本周援引伦敦初创公司 General Reasoning 的模拟测试称,多款前沿大模型在整季英超投注回放中整体亏损,提示“自动分析师”在嘈杂现实场景里仍缺稳定证据链。
— — —

影像素材与英超赛事画面相关,用于辅助理解报道主题。
General Reasoning, an artificial-intelligence startup in London, rebuilt the 2023-24 Premier League calendar for a betting simulation and stress-tested eight frontier models. When the Financial Times published the results this week, the headline was uncomfortable: the stack lost money overall, a reminder that strong language skills and coding demos do not automatically transfer to long-horizon, noisy markets.
总部位于伦敦的 General Reasoning 将 2023-24 赛季英超赛程放入投注回放环境,并对八款前沿模型进行压力测试。《金融时报》本周披露的结果并不“好看”:整体账面亏损,说明语言能力或编程演示的亮眼表现,并不能自动迁移到周期长、噪声高的市场场景。
Researchers fed historical schedules, team sheets where available, and past performance statistics. The agents could not browse the live web for injuries, line-up leaks, or breaking news. Each model could run up to three attempts, placing wagers both on match outcomes and on goal totals, with instructions to pursue profit while keeping drawdowns in check.
研究团队输入历史赛程、可得阵容信息以及既往表现数据;智能体不得联网获取临场伤情、首发传闻或突发新闻。每个模型最多进行三轮尝试,既要押胜负,也要押进球数,并在指令中要求兼顾收益与回撤控制。
Every system started from a notional £100,000 bankroll. Final scores blended the attempts when a model finished all three runs; incomplete trials were left out of the averages, which still left the cohort underwater.
各模型均以 10 万英镑虚拟本金起步。若模型完成全部三轮,研究以多次尝试的平均表现汇总;未完成的部分会被剔除均值,但即便如此,样本整体仍未能扭亏。
Claude Opus 4.6 from Anthropic recorded the shallowest average loss, about 11%, and nearly broke even on one pass. Google’s Gemini 3.1 swung wildly: one attempt printed a 34% paper gain, another blew up. xAI’s Grok 4.20 lost on its first try and could not complete the follow-on runs, underscoring stability gaps across vendors.
Anthropic 的 Claude Opus 4.6 平均亏损约 11%,其中一轮接近打平;谷歌 Gemini 3.1 波动剧烈,一轮纸面盈利 34%,另一轮则明显失手。xAI 的 Grok 4.20 首轮即亏,后续轮次未能跑完,凸显不同厂商在稳定性上的落差。
Ross Taylor, chief executive of General Reasoning and a co-author of the study, told the FT that automation hype is running ahead of measurement. Many public benchmarks live in “very static environments,” he said, bearing little resemblance to the chaos of live sports books or, by extension, treasury desks that must reconcile streaming payments, fraud alerts, and macro shocks at once.
General Reasoning 首席执行官、论文共同作者 Ross Taylor 对《金融时报》表示,自动化叙事跑在评测前面。许多公开基准停留在“高度静态的环境”,与真实博彩盘的混乱相去甚远;推而广之,支付与财资团队同时要处理流水、欺诈告警与宏观冲击,复杂度只增不减。
The paper has not been peer reviewed. Taylor’s team frames it as a counterweight to Silicon Valley chatter about autonomous coding agents. “If you test these systems on real-world chores, they can fall apart quickly,” he said—language that banks and wallet operators should weigh before handing models direct control of capital.
该研究尚未经同行评审。作者团队将其定位为对“自动编程代理”热潮的冷静对照。Taylor 称:“把这些系统放到真实世界的杂活里,它们可能很快露怯。”银行与钱包运营方若考虑让模型直接调度资金,需要把这句话听进去。
◆
对移动支付与金融科技团队而言,这条新闻的真正落点不在球场,而在风控流程:当模型只能吃“离线历史”而接不到实时信号时,它的下注策略与反欺诈模型一样,容易把结构性偏差当成阿尔法。把 AI 放进清结算或商户策略之前,先回答两个问题:数据源是否覆盖关键噪声,失败时谁来兜底。
— — —
转载说明:正文素材综合整理自巴西《Valor 经济报》对《金融时报》报道的转引与扩展叙述,仅保留新闻要素,已略去署名、编辑及更新痕迹。
📖 重点词汇与表达
| 英语 | 音标 | 中文释义 | 例句 |
|---|---|---|---|
| frontier model | /ˈfrʌntɪə ˈmɒdl/ | 前沿大模型 | Eight frontier models joined the Premier League betting replay. |
| bankroll | /ˈbæŋkrəʊl/ | 投注本金;资金池 | Each agent began with a £100,000 bankroll. |
| drawdown | /ˈdrɔːdaʊn/ | 回撤;峰值到低谷的跌幅 | Risk teams track drawdown before scaling live capital. |
| notional | /ˈnəʊʃənl/ | 名义上的;账面模拟的 | The desk ran a notional book to test model behaviour. |
| paper gain | /ˈpeɪpə ɡeɪn/ | 账面盈利(未兑现) | One Gemini run showed a 34% paper gain. |
| live web access | /laɪv web ˈækses/ | 实时联网权限 | Agents had no live web access during the simulation. |
| goal totals | /ɡəʊl ˈtəʊtlz/ | 总进球数盘口 | They wagered on outcomes and goal totals alike. |
| stress test | /ˈstres test/ | 压力测试 | Compliance asked for a quarterly model stress test. |
| peer review | /ˌpɪə rɪˈvjuː/ | 同行评审 | The study has not yet passed peer review. |
| static benchmark | /ˈstætɪk ˈbentʃmɑːk/ | 静态基准测试 | Static benchmarks rarely mirror treasury workflows. |
| automation hype | /ˌɔːtəˈmeɪʃn haɪp/ | 自动化炒作热潮 | Automation hype now outpaces field measurement. |
| long-horizon task | /lɒŋ həˈraɪzn tɑːsk/ | 长周期任务 | Payments reconciliation is a long-horizon task for models. |
| counterweight | /ˈkaʊntəweɪt/ | 制衡因素;对照物 | The paper acts as a counterweight to coding headlines. |
| vendor stability | /ˈvendə stəˈbɪləti/ | 供应商稳定性 | Procurement added vendor stability to the AI scorecard. |
| treasury desk | /ˈtreʒəri desk/ | 财资交易台 | The treasury desk refused unsupervised model trades. |
睡前十分钟,看懂世界的另一面。——BeforeBed
夜雨聆风