
点击蓝字 关注我们


上一期提到Biomni有20个功能块代码,包括主函数代码Biomni\biomni\tool\和介绍(description)代码Biomni\biomni\tool\tool_descriptionl。

以"genomics.py(基因组与前沿单细胞)"的annotate_celltype_scRNA功能为例进行介绍。
"description"代码:
{"description": "Annotate cell types based on gene markers and transferred ""labels using LLM. After leiden clustering, annotate clusters ""using differentially expressed genes and optionally ""incorporate transferred labels from reference datasets.","name": "annotate_celltype_scRNA","optional_parameters": [{"default": "leiden","description": "Clustering method to use for cell type annotation","name": "cluster","type": "str",},{"default": "claude-3-5-sonnet-20241022","description": "Language model instance for cell type prediction","name": "llm","type": "str",},{"default": None,"description": "Transferred cell type composition for each cluster","name": "composition","type": "pd.DataFrame",},],"required_parameters": [{"default": None,"description": "Name of the AnnData file containing scRNA-seq data","name": "adata_filename","type": "str",},{"default": None,"description": "Directory containing the data files","name": "data_dir","type": "str",},{"default": None,"description": 'Information about the scRNA-seq data (e.g., "homo sapiens, brain tissue, normal")',"name": "data_info","type": "str",},{"default": None,"description": "Path to the data lake","name": "data_lake_path","type": "str",},],},
主函数代码:
第一阶段:加载数据与提取 Marker 基因
做什么:
使用 scanpy 读取单细胞 .h5ad 文件。
调用 sc.tl.rank_genes_groups,使用 Wilcoxon 秩和检验(单细胞分析中最经典的差异表达算法),计算每一个 Leiden 聚类群里,哪些基因的表达量显著高于其他群。
取出前 20 个表达最显著且得分大于 0(gene_scores > 0)的基因,存入 markers 字典中。这就是给 LLM 准备的“物理证据”。
def annotate_celltype_scRNA(adata_filename,data_dir,data_info,data_lake_path,cluster="leiden",llm="claude-3-5-sonnet-20241022",composition=None,):"""Annotate cell types based on gene markers and transferred labels using LLM.After leiden clustering, annotate clusters using differentially expressed genesand optionally incorporate transferred labels from reference datasets.Parameters----------- adata_filename(str): Name of the AnnData file containing scRNA-seq data- data_dir(str): Directory containing the data files- data_info(str): Information about the scRNA-seq data(e.g., "homo sapiens, brain tissue, normal")- data_lake_path(str): Path to the data lake- llm(str): Language model instance for cell type prediction, such as 'claude-3-haiku-20240307'- composition(pd.DataFrame, optional): Transferred cell type composition for each clusterReturns:- str: Steps performed and file paths where results were saved"""def _cluster_info(cluster_id, marker_genes, composition_df=None):"""Format cluster information for LLM prompt."""if composition_df is None:return f"The enriched genes in this cluster are: {', '.join(marker_genes)}."info = [f"The enriched genes in this cluster are: {', '.join(marker_genes)}.",f"For a starting point, the transferred reference cell type composition {cluster_id} is:",]cluster_comp = []for celltype, proportion in composition_df.loc[cluster_id].items():if proportion > 0:cluster_comp.append(f"{celltype}:{proportion:.2f}")return "\n".join(info) + "" + "; ".join(cluster_comp) + "\n"from langchain_core.prompts import PromptTemplate# from langchain.chains import LLMChainsteps = []steps.append(f"Loading AnnData from {data_dir}/{adata_filename}")adata = sc.read_h5ad(f"{data_dir}/{adata_filename}")steps.append(f"Identifying marker genes for clusters defined by {cluster} clustering.")sc.tl.rank_genes_groups(adata, groupby="leiden", method="wilcoxon", use_raw=False)genes = pd.DataFrame(adata.uns["rank_genes_groups"]["names"]).head(20)scores = pd.DataFrame(adata.uns["rank_genes_groups"]["scores"]).head(20)markers = {}for i in range(genes.shape[1]):gene_names = genes.iloc[:, i].tolist()gene_scores = scores.iloc[:, i].tolist()markers[i] = list(np.array(gene_names)[np.array(gene_scores) > 0])
第二阶段:构建标准化细胞图谱白名单 (Lines 45-49)
做什么:从数据湖里读取了一个标准的 CZI (Chan Zuckerberg Initiative) 细胞本体论 (Cell Ontology) 数据集。
为什么:AI 容易天马行空地瞎编细胞名字(比如把同一个细胞一会儿叫 "T cell",一会儿叫 "T-Lymphocyte")。代码在这里把官方标准的细胞名词全部提取出来,组合成一个长字符串 czi_celltype,作为紧箍咒紧紧限制住 LLM 的输出范围。
# TODO: this can be optimizedczi_celltype_path = data_lake_path + "/czi_census_datasets_v4.parquet"df = pd.read_parquet(czi_celltype_path)czi_celltype_set = {cell_type.strip() for cell_types in df["cell_type"] for cell_type in str(cell_types).split(";")}czi_celltype = ", ".join(sorted(czi_celltype_set))
第三阶段:设计 Prompt 模板与 AI 链 (Lines 51-68)
这里构建了发给 Claude 的提示词工程(Prompt Engineering):
提示词明确告诉 AI:
结合组织背景(data_info)。
参考转移标签(composition),但明确规定:如果置信度低于 50%(0.5),就不要信任它。
必须从 CZI 细胞标准库里挑选名字。
严格限制输出格式为:"name; score; reason"(名字; 分数; 理由)。
prompt_template = f"""Please think carefully, and identify the cell type in {data_info} based on the gene markers.Optionally refer to the transferred cell type information but do not trust it when the percentage is lower than 0.5.{{cluster_info}}The cell type names should come from cell ontology: {czi_celltype}.Only provide the cell type name, confidence score (0-1), and detailed reason.Output format: "name; score; reason".No numbers before name or spaces before number."""# Some can be a mixture of multiple cell types.llm = get_llm(llm)prompt = PromptTemplate(input_variables=["cluster_info"], template=prompt_template)chain = prompt | llmsteps.append("Annotating cell types of each cluster based on gene markers and transferred labels.")# valid_celltypes = set(czi_celltype.split(";"))cluster_annotations = {}annotation_reasons = []
第四阶段:循环迭代与纠错机制 (Lines 74-98)
这是代码中最精彩的部分。由于大模型输出具有随机性,代码采用了一个 while True 死循环来确保结果的 100% 正确:
自适应纠错:如果 Claude 没有按照 ; 分割格式输出,或者给出的细胞名字不在 CZI 标准库里,代码不会报错崩溃,而是把错误信息追加到提示词后面,重新调教并喂给 AI,直到 AI 给出正确格式的答案为止。
print(f"Annotate each cluster of {cluster}")for _idx in range(len(adata.obs[cluster].unique())):cluster_info = _cluster_info(str(_idx), markers[_idx], composition)while True:response = chain.invoke({"cluster_info": cluster_info})# Handle different response typesif hasattr(response, "content"): # For AIMessageresponse = response.contentelif isinstance(response, dict) and"text" in response:response = response["text"]elif isinstance(response, str):response = responseelse:response = str(response)try:predicted_celltype, confidence, reason = [x.strip() for x in response.split(";", 2)]if predicted_celltype in czi_celltype_set or predicted_celltype.lower() in czi_celltype_set:cluster_annotations[str(_idx)] = predicted_celltypeannotation_reasons.append((predicted_celltype, reason))breakelse:cluster_info += "\nAssigned cell type name must be in cell ontology!"except ValueError:cluster_info += "\nPlease follow the format: name; score; reason"print(f"Cluster {_idx}: {response}")
第五阶段:数据写回与保存 (Lines 100-111)
将 AI 预测出的聚类标签映射回每一个细胞(adata.obs["cell_type"])。
将 AI 给出的判定理由存入 cell_type_reason 供用户后续人工复核。
最后以 gzip 压缩格式将最终的 AnnData 写回磁盘。
# create reason dictionaryreason_dict = {}for celltype, reason in annotation_reasons:if celltype not in reason_dict:reason_dict[celltype] = []reason_dict[celltype].append(reason)reason_dict = {k: "\n".join(v) for k, v in reason_dict.items()}adata.obs["cell_type"] = adata.obs[cluster].map(cluster_annotations)adata.obs["cell_type_reason"] = adata.obs["cell_type"].map(reason_dict).astype(str)steps.append(f"Saving annotated adata to {data_dir}/annotated.h5ad, the annotations are in the 'cell_type' column.")adata.write(f"{data_dir}/annotated.h5ad", compression="gzip")return"\n".join(steps)
其实这种设计思路,在科研工具里是共通的。
比如在 青熵视界(https://qssj.nextsci.cn)
这类数据分析与可视化系统里,也在做类似的事情:
用结构化流程替代手工分析
用参数化系统替代经验选图
用规则约束保证图表一致性
用自动化生成降低科研表达成本
本质上两者在解决同一个问题:
把“经验驱动的科研操作”,变成“可计算、可复现、可约束的系统”。
Biomni 在做的是“AI 生信自动注释系统”,
而青熵这类工具在做的是“科研表达与分析自动化系统”。
方向不同,但底层逻辑一致:
让科研从“人做流程”,变成“系统执行认知”。
点击
阅读原文
查看更多
夜雨聆风