ESM 蛋白质 AI 工具箱实战指南:从理解序列到设计全新蛋白

本文基于 ESM (Evolutionary Scale Modeling) 官方代码库的最新 cookbook，带你快速上手三大核心模型：ESMC、ESMFold2、ESM3。无论你是做蛋白功能预测、结构解析，还是从头设计新蛋白，这篇文章都能帮你找到对应的工具和方法。

一、三大模型定位：先搞清楚谁做什么

ESM 代码库里包含了三套不同定位的模型，别搞混了：

模型	类比	输入	输出	一句话定位
ESMC	蛋白质版的 GPT/BERT	氨基酸序列 (字符串)	Embedding、Logits、Hidden States	理解序列：提取特征、预测突变效应、分类
ESMFold2	蛋白质版的 AlphaFold	序列 (+ DNA/RNA/小分子)	全原子 3D 结构	预测结构：从序列折叠出三维结构
ESM3	多模态生成模型	序列 / 结构 / 功能 / 二级结构任意组合	补全或生成任意 track	设计蛋白：按你的需求生成或修改蛋白质

关键区别：

ESMC 是只读的——你给它序列，它给你分析结果
ESMFold2 是结构预测的——你给它序列，它给你 PDB 坐标
ESM3 是可写的——你给它一部分信息（比如只给结构），它补全其余部分（比如生成序列）

二、ESMC：蛋白质语言模型

ESMC 的核心能力是把蛋白质序列编码成数值向量。基于这个能力，官方 cookbook 提供了 4 个实用教程，下面挑最实用的 3 个展开。

2.1 序列嵌入与聚类 (`embed.ipynb`)

场景： 你手里有一批蛋白质序列（比如不同物种的同源蛋白），想知道它们按结构或功能怎么聚类。

核心代码：

from esm.sdk import esmc_client, batch_executorfrom esm.sdk.api import ESMProtein, LogitsConfig# 连接模型（本地或 API 均可）model = esmc_client(model="esmc-300m-2024-12", url="https://biohub.ai", token=token)# 配置：返回 mean-pooled 的 hidden statesEMBEDDING_CONFIG = LogitsConfig(    sequence=True,    return_mean_hidden_states=True# shape: (n_layers, hidden_size))defembed_sequence(model, sequence: str):    protein = ESMProtein(sequence=sequence)    protein_tensor = model.encode(protein)    output = model.logits(protein_tensor, EMBEDDING_CONFIG)return output.mean_hidden_state  # 每层一个向量# 批量处理所有序列with batch_executor() as executor:    outputs = executor.execute_batch(        user_func=embed_sequence,        model=model,        sequence=df["sequence"].tolist()    )

结果怎么用：

每个蛋白会返回 (n_layers, hidden_size) 的矩阵
取中间层（比如第 30 层）的向量，做 PCA 降维 + KMeans 聚类
通常中间层比最后一层更适合做结构/功能聚类，因为最后一层太偏向"下一个 token 预测"

示例效果： 教程中用腺苷酸激酶（AdK）数据集，按 lid 类型（closed/open）聚类，中间层的 embedding 能把两类清楚分开。

2.2 零样本突变分析 (`esmc_mutation_scoring.ipynb`)

场景： 你想做定向进化，但不想先湿实验筛选——先用 AI 预测哪些位置可以突变、哪些是保守位点。

核心思路：Leave-one-out Masking

把序列中每个位置的氨基酸依次 mask 掉，看模型对这个位置的概率分布：

defget_leave_one_out_logits(client, sequence):# 每个位置都变成 "_"（mask）    masked_sequences = [        sequence[:i] + "_" + sequence[i+1:] for i in range(len(sequence))    ]with batch_executor() as executor:        outputs = executor.execute_batch(            get_logits, client=client, sequence=masked_sequences        )return outputs

两个关键指标：

# 指标 1：熵（Entropy）defget_per_position_entropy(logit_outputs, sequence):    entropies = []for i in range(len(sequence)):        logits = logit_outputs[i].logits.sequence        position_logits = logits[i + 1]  # +1 跳过 BOS        probs = F.softmax(position_logits, dim=-1)        entropy = -torch.sum(probs * torch.log2(probs + 1e-9)).item()        entropies.append(entropy)return entropies

熵值	含义	应用
低（<2）	模型认为这个位置只能放特定氨基酸	保守位点，可能是活性中心，不要乱动
高（>4）	很多氨基酸都能放	可突变位点，适合定向进化筛选

# 指标 2：突变有害度（Log-Likelihood Ratio）# LLR < 0 说明突变后概率下降，突变有害defget_per_position_log_likelihood_ratios(logit_outputs, sequence):for i, aa in enumerate(sequence):        logp = torch.log_softmax(logits[i + 1], dim=-1)        wt_idx = vocab[aa]# 野生型 vs 所有可能突变的 log 概率比

实际案例： 教程用 PETase（塑料降解酶）做分析，低熵位点集中在催化三联体附近，高熵位点分布在表面 loop 区——和已知的结构生物学知识一致。

2.3 PEFT 微调 (`esmc_finetune.ipynb`)

场景： 你有自己的标注数据（比如酶分类、稳定性打分、结合亲和力），想在 ESMC 上训一个分类器。

核心代码：

from transformers import AutoTokenizer, ESMCForSequenceClassificationfrom peft import LoraConfig, get_peft_model# 加载 ESMC-300M + 分类头MODEL_PATH = "biohub/ESMC-300M"tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)model = ESMCForSequenceClassification.from_pretrained(    MODEL_PATH, num_labels=7, device_map="auto"# 7 类 EC1 酶)# 加 LoRA，只训练少量参数lora_config = LoraConfig(    r=8,    lora_alpha=16,    lora_dropout=0.01,    target_modules=["out_proj"],    target_parameters=["layernorm_qkv.weight","ffn.fc1_weight","ffn.fc2_weight",    ],)model = get_peft_model(model, lora_config)# 正常训练optimizer = torch.optim.AdamW(trainable_params, lr=1e-4)for step in range(NUM_TRAINING_STEPS):    batch = sample_batch(train_sequences, train_labels, BATCH_SIZE)    outputs = model(**inputs, labels=labels)    outputs.loss.backward()    optimizer.step()

效果： 教程在 EC1 酶分类任务上（7 大类：氧化还原酶、转移酶、水解酶……），训练 1000 步就能达到不错的分类准确率。关键是训练成本极低——LoRA 只更新不到 1% 的参数。

三、ESMFold2：结构预测

3.1 单蛋白与复合物折叠 (`esmfold2.ipynb`)

场景： 你有一个蛋白序列，想知道它长什么样；或者更复杂——蛋白结合 DNA、RNA、多肽、小分子后的复合物结构。

核心调用：

from esm.sdk import esmfold2_clientfrom esm.sdk.api import FoldingConfigfrom esm.utils.structure import input_builderclient = esmfold2_client(    model="esmfold2-fast-2026-05", url="https://biohub.ai", token=token)config = FoldingConfig(    num_loops=3,           # 迭代优化次数    num_sampling_steps=32, # 采样步数    include_pae=True,      # 输出 PAE 矩阵)

三种典型输入：

(1) 单蛋白折叠

protein = input_builder.ProteinInput(    id="A",    sequence="MNKIIIYTDGGARGN...")structure = client.fold_all_atom(protein, config=config)# 输出：coordinates, plddt, pae

(2) 蛋白 + RNA/DNA 复合物

rnaseh_protein = input_builder.ProteinInput(    id=["A", "B"], sequence=rnaseh_sequence)rna = input_builder.NucleicAcidInput(id="C", sequence="CGACACCUGAUUCC")dna = input_builder.NucleicAcidInput(id="D", sequence="GGAATCAGGTGTCG")complex_input = input_builder.StructurePredictionInput(    sequences=[rnaseh_protein, rna, dna])structure = client.fold_all_atom(complex_input, config=config)

(3) 蛋白 + 多肽 + 小分子 + 共价键

peptide = input_builder.ProteinInput(    id="B",    sequence="HAEGTFTSDVSSYLEGQAAKEFIAWLVRGRG",    modifications=[        input_builder.Modification(position=1, ccd="AIB")  # 非天然氨基酸    ],)ligand = input_builder.LigandInput(    id="C",    smiles="C(=O)(CCOCCOCC(=O)N...)",)# 定义共价键bond = input_builder.CovalentBond(    chain_id1="B", res_idx1=lys_idx, atom_idx1=8,    chain_id2="C", res_idx2=0, atom_idx2=0,)complex_inputs = input_builder.StructurePredictionInput(    sequences=[receptor, peptide, ligand],    covalent_bonds=[bond])structure = client.fold_all_atom(complex_inputs, config=config)

输出解读：

structure.coordinates —— 全原子坐标（可直接写 PDB/mmCIF）
structure.plddt —— 每个残基的置信度，>90 很准，<50 谨慎使用
structure.pae —— Predicted Aligned Error 矩阵，看哪两个区域相对位置不确定

3.2 Binder 设计 (`binder_design.ipynb`)

场景： 你想设计一个小蛋白（minibinder）或抗体片段，能特异性结合某个靶点蛋白（比如 PD-L1）。

这是 ESMFold2 最硬核的应用。 完整流程跑下来大约 8-10 分钟（A100）：

classESMFold2Design:defload(self, use_scaling_critics=False):# 加载多个 ESMFold2 模型（inversion + critic）        self.inversion_models = {            name: _load_hf_model(name, lm_dropout=0.5, device="cuda")for name in ["ESMFold2-Experimental-Fast", "ESMFold2-Experimental-Fast-Cutoff2025"]        }        self.hf_critic_models = {            name: _load_hf_model(name, lm_dropout=0.25, device="cuda")for name in ["ESMFold2-Experimental-Fast", ...]        }        self.esmc_model = ESMCForMaskedLM.from_pretrained("biohub/ESMC-6B")defdesign(self, target_name="pd-l1", binder_name="minibinder", seed=0):# 150 步迭代优化for step in range(STEPS):# 1. Softmax 得到当前 binder 序列            design = F.softmax(logits / temperature, dim=-1)# 2. ESMFold2 预测 target+binder 复合物结构            fold_result = fold_and_get_distogram(model, target_seq, design)# 3. 计算结构损失            losses = compute_structure_losses(                fold_result["distogram_logits"], binder_length            )# 4. ESMC 计算伪困惑度（序列合理性）            plm_loss = compute_esmc_pseudoperplexity_nll(esmc_model, design)# 5. 梯度更新            logits.grad = structure_grad + 0.15 * plm_grad            optimizer.step()

损失函数包含三项：

Intra-contact loss：binder 内部的残基接触（希望 binder 自己能折叠好）
Inter-contact loss：binder 与 target 的界面接触（希望结合得紧）
Globularity loss：binder 的球形度（希望它 compact，不要太松散）

实际跑的结果：

Target：PD-L1（115 个氨基酸）
Binder：100 个氨基酸的 minibinder
150 步，总 loss 从 8.14 降到 3.41
输出序列：QQHNNNNNNNVLNQILNQ...（约 116 aa）
用 critic 模型最终打分，可得到 pTM、ipTM 等指标评估结合质量

四、ESM3：多模态生成

ESM3 的核心能力是任意 track 的组合生成。它有 5 个可输入/可输出的 track：

sequence —— 氨基酸序列
coordinates —— 3D 原子坐标
secondary_structure —— 二级结构（H/E/C）
sasa —— 溶剂可及性
function —— 功能注释

你可以只给其中任意一个，让模型补全其余四个。

4.1 Motif Scaffolding (`esm3_generate.ipynb`)

场景： 你有一个功能 motif（比如酶的活性位点、金属结合位点），想围绕它设计一个完整的新蛋白质支架。

流程：

# 1. 从天然蛋白中提取 motifpdb_id = "1ITU"# Renal Dipeptidasechain = ProteinChain.from_rcsb(pdb_id, "A")motif_inds = np.arange(123, 146)  # 23 个残基的 motifmotif_sequence = chain[motif_inds].sequencemotif_coords = chain[motif_inds].atom37_positions# 2. 构建 prompt：motif 固定，其余位置 maskprompt_length = 200sequence_prompt = ["_"] * prompt_lengthsequence_prompt[72 : 72 + len(motif_sequence)] = list(motif_sequence)structure_prompt = torch.full((prompt_length, 37, 3), np.nan)structure_prompt[72 : 72 + len(motif_coords)] = torch.tensor(motif_coords)protein_prompt = ESMProtein(    sequence="".join(sequence_prompt),    coordinates=structure_prompt,)# 3. 先跑 sequence track，补全序列sequence_gen = model.generate(    protein_prompt,    GenerationConfig(track="sequence", num_steps=prompt.count("_") // 2))# 4. 再跑 structure track，预测结构structure_gen = model.generate(    ESMProtein(sequence=sequence_gen.sequence),    GenerationConfig(track="structure", num_steps=len(sequence_gen) // 8))# 5. 验证 motif RMSDcrmsd = structure_gen.to_protein_chain().rmsd(    chain, mobile_inds=motif_inds_in_gen, target_inds=motif_inds)

关键点：

"_" 表示 mask，模型会在这里生成新序列
先生成序列，再预测结构，分两步走
最后用 RMSD 验证 motif 是否被准确保留（< 1.5 Å 算成功）

4.2 GFP 设计 (`gfp_design.ipynb`)

这是 ESM3 最经典的案例——复现了论文中设计 esmGFP 的方法。

思路： 以天然 GFP（PDB 1qy3）为模板，保留形成发色团的关键残基，让 ESM3 重新设计其余部分，得到一个自然界不存在的全新 GFP。

template_gfp = ESMProtein.from_protein_chain(    ProteinChain.from_rcsb("1qy3", chain_id="A"))# 保留关键位点（形成发色团所必需）prompt_sequence = ["_"] * len(template_gfp.sequence)prompt_sequence[59] = "T"prompt_sequence[62] = "T"prompt_sequence[63] = "Y"prompt_sequence[64] = "G"prompt_sequence[93] = "R"prompt_sequence[219] = "E"prompt = model.encode(ESMProtein(sequence="".join(prompt_sequence)))# 先预测结构structure_gen = model.generate(    prompt,    GenerationConfig(track="structure", temperature=1.0))# 再生成完整序列sequence_gen = model.generate(    structure_gen,    GenerationConfig(track="sequence", temperature=1.0))# 评估：序列一致性、关键位点 RMSDidentity = align.get_sequence_identity(alignment)

设计约束：

关键位点固定（这些是形成荧光发色团的化学基础）
其余序列自由生成
最终序列与天然 GFP 的相似性可以很低（论文中 esmGFP 与已知 GFP 的序列相似度 < 60%）

4.3 引导生成 (`esm3_guided_generation.ipynb`)

场景： 你想生成的蛋白满足某种自定义属性——比如高 pTM（结构质量好）、不含半胱氨酸（避免错误折叠）、或者紧凑（小回转半径）。

用法： 继承 GuidedDecodingScoringFunction，写自己的评分函数。

示例 1：引导高 pTM

classPTMScoringFunction(GuidedDecodingScoringFunction):def__call__(self, protein: ESMProtein) -> float:return float(protein.ptm)  # 越高越好ptm_guided = ESM3GuidedDecoding(    client=model, scoring_function=PTMScoringFunction())generated = ptm_guided.guided_generate(    protein=ESMProtein(sequence="_" * 256),    num_decoding_steps=256 // 8,    num_samples_per_step=10,  # 每步采 10 个，选最好的)

对比效果：

无引导：pTM = 0.52
有引导：pTM = 0.78

示例 2：避免半胱氨酸

classNoCysteineScoringFunction(GuidedDecodingScoringFunction):def__call__(self, protein: ESMProtein) -> float:return -protein.sequence.count("C")  # C 越少越好no_cys_guided = ESM3GuidedDecoding(    client=model, scoring_function=NoCysteineScoringFunction())result = no_cys_guided.guided_generate(...)# 结果序列中 C 的数量 = 0

示例 3：组合约束

还可以用 ESM3GuidedDecodingWithConstraints 强制某些位置固定：

from esm.sdk.experimental import GenerationConstraint, ConstraintTypeconstraints = [    GenerationConstraint(        track="sequence",        constraint_type=ConstraintType.EQUAL,        first_residue=0,        last_residue=10,        value="MKT..."# N-端信号肽固定    )]

五、按任务选模型

你的任务	推荐模型	具体教程 / 脚本	关键 API
序列特征提取（用于下游分类/聚类）	ESMC	`embed.ipynb`	`client.logits(..., return_embeddings=True)`
突变位点预测（哪些位置保守/可突变）	ESMC	`esmc_mutation_scoring.ipynb`	Leave-one-out masking + 熵计算
找最佳特征层（哪层 embedding 最适合我的任务）	ESMC	`esmc_layer_sweep.ipynb`	`return_mean_hidden_states=True` + 逐层 CV
训练分类器（酶分类、稳定性预测等）	ESMC + LoRA	`esmc_finetune.ipynb`	`ESMCForSequenceClassification` + PEFT
蛋白质结构预测	ESMFold2	`esmfold2.ipynb`	`client.fold_all_atom(...)`
蛋白-DNA/RNA/小分子复合物	ESMFold2	`esmfold2.ipynb`	`StructurePredictionInput(sequences=[...])`
设计结合蛋白（minibinder/抗体）	ESMFold2 + ESMC	`binder_design.ipynb`	`design_binder(...)`
围绕 motif 设计新蛋白	ESM3	`esm3_generate.ipynb`	`GenerationConfig(track="sequence")` + `track="structure"`
从头设计 GFP 等全新蛋白	ESM3	`gfp_design.ipynb`	固定关键位点 + 分步生成
自定义属性的蛋白生成	ESM3	`esm3_guided_generation.ipynb`	继承 `GuidedDecodingScoringFunction`
模型可解释性（SAE 特征）	ESMC + SAE	`esmc_sae_feature_interpretation.ipynb`	`LogitsConfig(sae_config=SAEConfig(...))`
批量处理大量序列	任意	`batch_executor()`	`executor.execute_batch(user_func, ...)`

六、本地跑 vs API

所有模型都支持两种方式：

import osif os.environ.get("ESM_API_KEY"):# 方式一：Biohub API（不需要 GPU，按 token 计费）    model = esmc_client(model="esmc-300m-2024-12",                         url="https://biohub.ai",                         token=os.environ["ESM_API_KEY"])else:# 方式二：本地加载（需要 GPU，免费）    model = ESMC.from_pretrained("esmc_300m")  # 本地路径或 HuggingFace

方式	优点	缺点
API	零配置、随时可用、不用管 GPU	需要联网、按调用计费
本地	免费、数据不出内网、可批量跑	需要 A100/V100、模型文件大

模型文件大小参考：

ESMC-300M：约 1.2 GB
ESMC-6B：约 22 GB
ESMFold2-Experimental-Fast：约 685 MB
ESMFold2-Experimental：约 862 MB
ESM3-medium：约 3-4 GB

写在最后

ESM 这套工具的核心设计思想是"生物学数据的多模态统一表示"：

序列、结构、功能，全部 tokenize 成离散符号
Transformer 在这些符号上联合推理
生成时可以任意条件组合（只给结构、让模型猜序列；只给序列、让模型预测结构）

这意味着，你不需要为每个任务单独训模型——同一个 ESM3 模型，既能折叠、又能逆折叠、又能设计全新蛋白。这才是它区别于传统工具的地方。

本文代码均来自 ESM 官方 cookbook：https://github.com/biohub/esm/tree/main/cookbook/tutorials，建议直接 clone 下来跑一遍。