Biomni源码拆解|AI如何自动注释单细胞类型?

点击蓝字关注我们

上一期提到Biomni有20个功能块代码，包括主函数代码Biomni\biomni\tool\和介绍（description）代码Biomni\biomni\tool\tool_descriptionl。

以"genomics.py(基因组与前沿单细胞)"的annotate_celltype_scRNA功能为例进行介绍。

"description"代码:

     {        "description": "Annotate cell types based on gene markers and transferred "        "labels using LLM. After leiden clustering, annotate clusters "        "using differentially expressed genes and optionally "        "incorporate transferred labels from reference datasets.",        "name": "annotate_celltype_scRNA",        "optional_parameters": [            {                "default": "leiden",                "description": "Clustering method to use for cell type annotation",                "name": "cluster",                "type": "str",            },            {                "default": "claude-3-5-sonnet-20241022",                "description": "Language model instance for cell type prediction",                "name": "llm",                "type": "str",            },            {                "default": None,                "description": "Transferred cell type composition for each cluster",                "name": "composition",                "type": "pd.DataFrame",            },        ],        "required_parameters": [            {                "default": None,                "description": "Name of the AnnData file containing scRNA-seq data",                "name": "adata_filename",                "type": "str",            },            {                "default": None,                "description": "Directory containing the data files",                "name": "data_dir",                "type": "str",            },            {                "default": None,                "description": 'Information about the scRNA-seq data (e.g., "homo sapiens, brain tissue, normal")',                "name": "data_info",                "type": "str",            },            {                "default": None,                "description": "Path to the data lake",                "name": "data_lake_path",                "type": "str",            },        ],    },

主函数代码：

第一阶段：加载数据与提取 Marker 基因

做什么：

使用 scanpy 读取单细胞 .h5ad 文件。
调用 sc.tl.rank_genes_groups，使用 Wilcoxon 秩和检验（单细胞分析中最经典的差异表达算法），计算每一个 Leiden 聚类群里，哪些基因的表达量显著高于其他群。
取出前 20 个表达最显著且得分大于 0（gene_scores > 0）的基因，存入 markers 字典中。这就是给 LLM 准备的“物理证据”。

def annotate_celltype_scRNA(    adata_filename,    data_dir,    data_info,    data_lake_path,    cluster="leiden",    llm="claude-3-5-sonnet-20241022",    composition=None,):    """Annotate cell types based on gene markers and transferred labels using LLM.    After leiden clustering, annotate clusters using differentially expressed genes    and optionally incorporate transferred labels from reference datasets.    Parameters    ----------    - adata_filename(str): Name of the AnnData file containing scRNA-seq data    - data_dir(str): Directory containing the data files    - data_info(str): Information about the scRNA-seq data(e.g., "homo sapiens, brain tissue, normal")    - data_lake_path(str): Path to the data lake    - llm(str): Language model instance for cell type prediction, such as 'claude-3-haiku-20240307'    - composition(pd.DataFrame, optional): Transferred cell type composition for each cluster    Returns:    - str: Steps performed and file paths where results were saved    """    def _cluster_info(cluster_id, marker_genes, composition_df=None):        """Format cluster information for LLM prompt."""        if composition_df is None:            return f"The enriched genes in this cluster are: {', '.join(marker_genes)}."        info = [            f"The enriched genes in this cluster are: {', '.join(marker_genes)}.",            f"For a starting point, the transferred reference cell type composition {cluster_id} is:",        ]        cluster_comp = []        for celltype, proportion in composition_df.loc[cluster_id].items():            if proportion > 0:                cluster_comp.append(f"{celltype}:{proportion:.2f}")        return "\n".join(info) + "" + "; ".join(cluster_comp) + "\n"    from langchain_core.prompts import PromptTemplate    # from langchain.chains import LLMChain    steps = []    steps.append(f"Loading AnnData from {data_dir}/{adata_filename}")    adata = sc.read_h5ad(f"{data_dir}/{adata_filename}")    steps.append(f"Identifying marker genes for clusters defined by {cluster} clustering.")    sc.tl.rank_genes_groups(adata, groupby="leiden", method="wilcoxon", use_raw=False)    genes = pd.DataFrame(adata.uns["rank_genes_groups"]["names"]).head(20)    scores = pd.DataFrame(adata.uns["rank_genes_groups"]["scores"]).head(20)    markers = {}    for i in range(genes.shape[1]):        gene_names = genes.iloc[:, i].tolist()        gene_scores = scores.iloc[:, i].tolist()        markers[i] = list(np.array(gene_names)[np.array(gene_scores) > 0])

第二阶段：构建标准化细胞图谱白名单 (Lines 45-49)

做什么：从数据湖里读取了一个标准的 CZI (Chan Zuckerberg Initiative) 细胞本体论 (Cell Ontology) 数据集。

为什么：AI 容易天马行空地瞎编细胞名字（比如把同一个细胞一会儿叫 "T cell"，一会儿叫 "T-Lymphocyte"）。代码在这里把官方标准的细胞名词全部提取出来，组合成一个长字符串 czi_celltype，作为紧箍咒紧紧限制住 LLM 的输出范围。

    # TODO: this can be optimized    czi_celltype_path = data_lake_path + "/czi_census_datasets_v4.parquet"    df = pd.read_parquet(czi_celltype_path)    czi_celltype_set = {cell_type.strip() for cell_types in df["cell_type"] for cell_type in str(cell_types).split(";")}    czi_celltype = ", ".join(sorted(czi_celltype_set))

第三阶段：设计 Prompt 模板与 AI 链 (Lines 51-68)

这里构建了发给 Claude 的提示词工程（Prompt Engineering）：

提示词明确告诉 AI：

结合组织背景（data_info）。
参考转移标签（composition），但明确规定：如果置信度低于 50%（0.5），就不要信任它。
必须从 CZI 细胞标准库里挑选名字。
严格限制输出格式为："name; score; reason"（名字; 分数; 理由）。

 prompt_template = f"""Please think carefully, and identify the cell type in {data_info} based on the gene markers.Optionally refer to the transferred cell type information but do not trust it when the percentage is lower than 0.5.{{cluster_info}}The cell type names should come from cell ontology: {czi_celltype}.Only provide the cell type name, confidence score (0-1), and detailed reason.Output format: "name; score; reason".No numbers before name or spaces before number."""    # Some can be a mixture of multiple cell types.    llm = get_llm(llm)    prompt = PromptTemplate(input_variables=["cluster_info"], template=prompt_template)    chain = prompt | llm    steps.append("Annotating cell types of each cluster based on gene markers and transferred labels.")    # valid_celltypes = set(czi_celltype.split(";"))    cluster_annotations = {}    annotation_reasons = []

第四阶段：循环迭代与纠错机制 (Lines 74-98)

这是代码中最精彩的部分。由于大模型输出具有随机性，代码采用了一个 while True 死循环来确保结果的 100% 正确：

自适应纠错：如果 Claude 没有按照 ; 分割格式输出，或者给出的细胞名字不在 CZI 标准库里，代码不会报错崩溃，而是把错误信息追加到提示词后面，重新调教并喂给 AI，直到 AI 给出正确格式的答案为止。

print(f"Annotate each cluster of {cluster}")    for _idx in range(len(adata.obs[cluster].unique())):        cluster_info = _cluster_info(str(_idx), markers[_idx], composition)        while True:            response = chain.invoke({"cluster_info": cluster_info})            # Handle different response types            if hasattr(response, "content"):  # For AIMessage                response = response.content            elif isinstance(response, dict) and"text" in response:                response = response["text"]            elif isinstance(response, str):                response = response            else:                response = str(response)            try:                predicted_celltype, confidence, reason = [x.strip() for x in response.split(";", 2)]                if predicted_celltype in czi_celltype_set or predicted_celltype.lower() in czi_celltype_set:                    cluster_annotations[str(_idx)] = predicted_celltype                    annotation_reasons.append((predicted_celltype, reason))                    break                else:                    cluster_info += "\nAssigned cell type name must be in cell ontology!"            except ValueError:                cluster_info += "\nPlease follow the format: name; score; reason"        print(f"Cluster {_idx}: {response}")

第五阶段：数据写回与保存 (Lines 100-111)

将 AI 预测出的聚类标签映射回每一个细胞（adata.obs["cell_type"]）。

将 AI 给出的判定理由存入 cell_type_reason 供用户后续人工复核。

最后以 gzip 压缩格式将最终的 AnnData 写回磁盘。

    # create reason dictionary    reason_dict = {}    for celltype, reason in annotation_reasons:        if celltype not in reason_dict:            reason_dict[celltype] = []        reason_dict[celltype].append(reason)    reason_dict = {k: "\n".join(v) for k, v in reason_dict.items()}    adata.obs["cell_type"] = adata.obs[cluster].map(cluster_annotations)    adata.obs["cell_type_reason"] = adata.obs["cell_type"].map(reason_dict).astype(str)    steps.append(f"Saving annotated adata to {data_dir}/annotated.h5ad, the annotations are in the 'cell_type' column.")    adata.write(f"{data_dir}/annotated.h5ad", compression="gzip")    return"\n".join(steps)

其实这种设计思路，在科研工具里是共通的。

比如在 青熵视界（https://qssj.nextsci.cn）

这类数据分析与可视化系统里，也在做类似的事情：

用结构化流程替代手工分析
用参数化系统替代经验选图
用规则约束保证图表一致性
用自动化生成降低科研表达成本

本质上两者在解决同一个问题：

把“经验驱动的科研操作”，变成“可计算、可复现、可约束的系统”。

Biomni 在做的是“AI 生信自动注释系统”，

而青熵这类工具在做的是“科研表达与分析自动化系统”。

方向不同，但底层逻辑一致：

让科研从“人做流程”，变成“系统执行认知”。

点击

阅读原文