

Health Care AI
Requires a Lot of
Expensive Humans to Run
1

Preparing cancer patients for difficult decisions is an oncologist's job. They don't always remember to do it, however. At the University of Pennsylvania Health System, doctors are nudged to talk about a patient's treatment and end-of-life preferences by an artificially intelligent algorithm that predicts the chances of death. But its far from being a set-it-and-forget-it tool. A routine tech checkup revealed the algorithm decayed during the covid-19 pandemic, getting 7 percentage points worse at predicting who would die, according to a 2022 study. There were likely real-life impacts.
帮助癌症患者为艰难的决策做准备是肿瘤学家的职责。然而,他们并非总能记得履行这一职责。在宾夕法尼亚大学卫生系统,人工智能算法能够预测患者的死亡概率,会提醒医生与患者探讨治疗方案及临终偏好。但这绝非一款“一劳永逸”的工具。根据2022年的一项研究,常规技术检查发现,该算法在新冠疫情期间性能有所下降,其预测患者死亡概率的准确率下降了7个百分点。这一情况很可能对实际医疗产生了影响。
Ravi Parikh, an Emory University oncologist who was the study's lead author, told KFF Health News the tool failed hundreds of times to prompt doctors to initiate that important discussion —possibly heading off unnecessary chemotherapy — with patients who needed it. He believes several algorithms designed to enhance medical care weakened during the pandemic, not just the one at Penn Medicine.“Many institutions are not routinely monitoring the performance of their products” Parikh said.
该研究的主要作者、埃默里大学肿瘤学家拉维·帕里赫在接受凯泽家庭基金会健康新闻采访时表示,这款工具曾数百次未能提醒医生,与有需要的患者开启这场重要对话——这样的对话或许能避免不必要的化疗。他认为疫情期间,并非只有宾夕法尼亚大学医学院的这一款算法性能下降,许多旨在改善医疗服务的算法都出现了弱化现象。帕里赫表示:“许多机构并未对其产品的性能进行常规监测。”

2

Algorithm glitches are one facet of a dilemma that computer scientists and doctors have long acknowledged but that is starting to puzzle hospital executives and researchers: Artificial intelligence systems require consistent monitoring and staffing to put in place and to keep them working well. In essence: You need people, and more machines, to make sure the new tools don't mess up. “Everybody thinks that AI will help us with our access and capacity and improve care and so on,” said Nigam Shah, chief data scientist at Stanford Health Care. “All of that is nice and good, but if it increases the cost of care by 20%, is that viable?”
算法故障只是计算机科学家和医生早就意识到,如今却开始困扰医院管理者和研究人员的一个难题:人工智能系统投入使用并维持正常运行,需要持续的监测和人员配备。本质上,你需要更多的人力和机器,才能确保这些新工具不出差错。斯坦福医疗保健中心首席数据科学家尼加姆·沙阿表示:“所有人都认为人工智能会帮助我们改善医疗可及性、提升医疗容量、优化医疗服务等等。这些想法都很好,但如果它会使医疗成本增加20%,那它还可行吗?”
Government officials worry hospitals lack the resources to put these technologies through their paces. “I have looked far and wide,” FDA Commissioner Robert Califf said at a recent agency panel on AI. “I do not believe there's a single health system, in the United States, that's capable of validating an AI algorithm that's put into place in a clinical care system.”
政府官员担忧医院缺乏足够资源来对这些技术进行全面测试。“我四处调研,”美国食品药品监督管理局(FDA)局长罗伯特·卡利夫在最近一场该机构举办的关于人工智能的专题研讨会上说道,“我不认为美国有任何一家医疗体系,有能力对投入临床护理系统使用的人工智能算法进行验证。”
AI is already widespread in health care. Algorithms are used to predict patients' risk of death or deterioration, to suggest diagnoses or triage patients, to record and summarize visits to save doctors work, and to approve insurance claims. If tech evangelists are right, the technology will become ubiquitous — and profitable. The investment firm Bessemer Venture Partners has identified some 20 health-focused AI startups on track to make $10 million in revenue each in a year. The FDA has approved nearly a thousand artificially intelligent products.
人工智能已在医疗领域广泛应用。算法被用于预测患者的死亡或病情恶化风险、提出诊断建议或对患者进行分诊、记录并总结就诊情况以节省医生的工作时间,以及审批保险理赔。如果科技拥护者的观点正确,那么这项技术将变得无处不在且有利可图。投资公司贝塞默风险投资公司已发现约20家专注于医疗领域的人工智能初创企业,每家有望在一年内实现1000万美元的营收。美国食品药品监督管理局已批准了近千种人工智能产品。

3

Evaluating whether these products work is challenging. Evaluating whether they continue to work—or have developed the software equivalent of a blown gasket or leaky engine—is even trickier. Take a recent study at Yale Medicine evaluating six “early warning systems,” which alert clinicians when patients are likely to deteriorate rapidly.
评估这些产品是否有效颇具挑战性,而评估它们能否持续稳定运行——或是在软件层面出现类似密封垫损坏、发动机漏油般的故障——则更具挑战性。耶鲁医学院最近开展了一项研究,评估了六种“预警系统”,这类系统会在患者可能出现病情快速恶化时向临床医生发出警报。
A supercomputer ran the data for several days, said Dana Edelson, a doctor at the University of Chicago and co-founder of a company that provided one algorithm for the study. The process was fruitful, showing huge differences in performance among the six products. It's not easy for hospitals and providers to select the best algorithms for their needs. The average doctor doesn't have a supercomputer sitting around, and there is no Consumer Reports for AI.
芝加哥大学医生、为该研究提供一款算法的公司联合创始人达纳·埃德尔森表示,一台超级计算机连续数天处理数据。这一过程卓有成效,揭示出这六种产品在性能上存在巨大差异。对于医院和医疗服务提供者而言,选择最符合自身需求的算法并非易事。普通医生不会配备超级计算机,而且目前也没有针对人工智能的类似《消费者报告》的参考指南。
“We have no standards,” said Jesse Ehrenfeld, immediate past president of the American Medical Association. “There is nothing I can point you to today that is a standard around how you evaluate, monitor, look at the performance of a model of an algorithm, AI-enabled or not, when it's deployed.”Perhaps the most common AI product in doctors' offices is called ambient documentation, a tech-enabled assistant that listens to and summarizes patient visits. Last year, investors at Rock Health tracked $353 million flowing into these documentation companies.
美国医学协会前主席杰西·埃伦费尔德表示:“我们没有任何标准。如今,对于如何评估、监测、审视一款模型或算法(无论是否搭载人工智能)在部署后的性能,我无法向你指出任何相关标准。”医生诊室中最常见的人工智能产品或许是“环境文档记录”,这是一种技术辅助型助手,能够监听并总结患者的就诊情况。去年,Rock Health(一家专注于数字健康领域的风险投资公司)的投资者追踪到,有3.53亿美元资金流入了这些文档技术公司。
But, Ehrenfeld said, “There is no standard right now for comparing the output of these tools.” And that's a problem, when even small errors can be devastating. A team at Stanford University tried using large language models — the technology underlying popular AI tools like ChatGPT — to summarize patients' medical history. They compared the results with what a physician would write.
但埃伦费尔德表示:“目前尚无用于对比这些工具输出结果的标准。”这是一个不容小觑的问题,因为即便是微小的误差也可能造成灾难性的后果。斯坦福大学的一个团队曾尝试使用大型语言模型——也就是像ChatGPT等热门人工智能工具所依赖的技术——来总结患者的病史。他们将大型语言模型生成的总结与医生撰写的病史记录进行了比对。

4

“Even in the best case, the models had a 35% error rate,” said Stanford's Shah. In medicine, “when you're writing a summary and you forget one word, like ‘fever’—I mean, that's a problem, right?”Sometimes the reasons algorithms fail are fairly logical. For example, changes to underlying data can erode their effectiveness, like when hospitals switch lab providers.
斯坦福大学的沙阿表示:“即便在最佳情况下,这些模型的错误率也高达35%。”在医疗领域,“当你撰写总结时,哪怕遗漏一个词,比如‘发烧’——我的意思是,这就会出现问题,对吧?”有时算法失效的原因相当合乎逻辑。例如,底层数据的变化会削弱其效用,比如医院更换实验室服务商时就可能削弱其效用。
Sometimes, however, the pitfalls yawn open for no apparent reason. Sandy Aronson, a tech executive at Mass General Brigham's personalized medicine program in Boston, said that when his team tested one application meant to help genetic counselors locate relevant literature about DNA variants, the product suffered “nondeterminism”—that is, when asked the same question multiple times in a short period, it gave different results.
然而,有时算法的隐患会毫无明显原因地出现。波士顿麻省总医院布莱根个性化医疗项目的科技高管桑迪·阿伦森表示,当其团队测试一款旨在帮助遗传咨询师查找关于DNA变异体的文献的应用程序时,该产品出现了“非确定性”——也就是说,在短时间内多次被问到同一个问题时,它会给出不同的结果。
Aronson is excited about the potential for large language models to summarize knowledge for overburdened genetic counselors, but “the technology needs to improve.” If metrics and standards are sparse and errors can crop up for strange reasons, what are institutions to do? Invest lots of resources. At Stanford, Shah said, it took eight to 10 months and 115 man-hours just to audit two models for fairness and reliability. Experts interviewed by KFF Health News floated the idea of artificial intelligence monitoring artificial intelligence, with some (human) data whiz monitoring both.
阿伦森对大型语言模型为负担过重的遗传咨询师总结知识的潜力感到兴奋,但他同时强调“这项技术仍需改进”。如果相关指标和标准匮乏,且误差可能因奇怪的原因突然出现,机构该如何应对?机构需要投入大量资源。沙阿表示,在斯坦福大学,仅为审核两款模型的公平性和可靠性,就花费了8至10个月的时间和115人时。凯泽家庭基金会健康新闻采访的专家们提出了“人工智能监测人工智能”的想法,即由一些(人类)数据专家同时监测这两者。
All acknowledged that would require organizations to spend even more money — a tough ask given the realities of hospital budgets and the limited supply of AI tech specialists. “It's great to have a vision where we're melting icebergs in order to have a model monitoring their model,” Shah said. “But is that really what I wanted? How many more people are we going to need?”
所有人都承认,这将要求机构投入更多资金——考虑到医院预算的实际情况以及人工智能技术专家的供应有限,这无疑是一项艰巨的任务。沙阿表示:“拥有一个让模型自我监测的愿景固然美好,但这真的是我想要的吗?我们还需要再多多少人?”

重点词汇
oncologist n. 肿瘤学家
algorithm n. 算法
prompt v. 促使
initiate v. 发起
dilemma n. 困境
viable adj. 可行的
deterioration n. 恶化
ubiquitous adj. 普遍存在的
tricky adj. 棘手的
fruitful adj. 卓有成效的
devastating adj. 毁灭性的
underlying adj. 潜在的
erode v.削弱
pitfall n. 隐患
audit v.审核
sparse adj. 稀疏的
往期回顾
文章来源丨Scientific American:Health Care AI Requires a Lot of Expensive Humans to Run
图片来源丨百度图片
译者丨陈华政
译审丨王春渝
复审丨李小辉
审核编辑 |王春渝
执行编辑 | 陈华政

夜雨聆风