没养龙虾(OpenClaw),先养个马(Hermes)来做生物信息学

AI 发展日新月异，还没来得及养龙虾，马（Hermes）又来了。“弃龙虾（OpenClaw）、选爱马仕（Hermes）” ，似乎正在形成共识。真是应了那句话：只要学得慢，就不用学。

既然如此，那么我们就先不管龙虾，今天先来安装一个马试试。

安装

我们先看一下 Hermes 的 GitHub 主页：https://github.com/nousresearch/hermes-agent

今天（2026-05-12），hermes 在 GitHub 上有 14.5 万颗星，这对于才火起来 1 个月左右的项目来说，已经非常成功了。

进入文档页面：https://hermes-agent.nousresearch.com/

我们复制安装命令：

curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash

运行安装——我这里是 Linux 系统。在运行这条命令之前，先要确保系统安装了Python 3.11。

配置

运行hermes setup

我们选择第一项：Quick setup。

设置大模型提供商

我们这里选择使用 DeepSeek。

我们先到 DeepSeek 官网的开放平台申请好 API Key：

取一个名字，比如：hermes

然后回到终端，在这里输入刚才创建的 API key：

接着 Base URL 填写这个：https://api.deepseek.com

选择模型：

选择终端后端

保持默认选项就好了。

设置消息平台

选择 QQ：

如果终端没有出现二维码，那么就复制提示的链接，到 QQ 开放平台创建一个 QQ 机器人，我们给它取一个名字：Hermes

配置好之后：

最后我们希望小马能常驻服务器，于是同意安装程序将其设为系统服务，这样重启服务器 hermes 也自动启动了。

但是出现了一条错误，原始是我们的当前账号galaxy没有添加系统服务的权限。

在 root 下执行：

# 1. 赋予 galaxy 用户常驻权限（确保退出登录后服务依然运行）loginctl enable-linger galaxy# 2. 切换到 galaxy 用户并进入对应目录su - galaxycd /home/galaxy/.hermes/hermes-agent# 3. 运行安装命令hermes gateway install

打声招呼

要在 QQ 上跟 hermes 对话，首先必须要创建好机器人并且跟 hermes 配对好，就像蓝牙配对一样。其次hermes它只是一个智能体，它能工作要依赖于大模型，因此必须为其配置好大模型，我们这里用的是 DeepSeek v4。

可以看到，前面跟机器人对话它没反应，就是没有配置好。

单细胞学习路线图

不得不说，hermes给的单细胞学习建议还是非常科学客观的。先是打语言基础，然后是核心工具的掌握：Seurat，Scanpy。接着是标准分析，进阶分析。

单细胞标准分析

大家接下来看hermes的实战表演吧。

下面让它补一个celltypist分析：

最终 UMAP 图：

最后总结

有意思的是，hermes会自动将分析过程提取成skill，这或许就是它跟OpenClaw很大的不同之处。后者更依赖人工编写skill。

比如我们刚才的分析就自动保存在

/home/galaxy/.hermes/skills/data-science/single-cell-scanpy目录下：

.├── references│   ├── celltypist_pitfalls.md│   └── rank_genes_groups_structured_array.md├── SKILL.md└── templates    ├── celltypist_annotation.py    └── pbmc3k_pipeline.py2 directories, 5 files

SKILL.md

---name: single-cell-scanpydescription: Single-cell RNA-seq analysis with Scanpy — QC, normalization, HVG, PCA, UMAP, clustering, marker detection, and cell-type annotation (manual marker + CellTypist automated).---# Single-Cell RNA-seq Analysis with ScanpyStandard scRNA-seq analysis pipeline using Scanpy (Python). Covers loading data through cell-type annotation, with tool-specific pitfalls documented below.## Triggers- User asks to run single-cell analysis, scRNA-seq, Scanpy, PBMC analysis- User mentions UMAP, Leiden clustering, marker genes, Seurat/Scanpy- User wants to analyze `.h5ad` files or 10x Genomics data## Standard Pipeline (10 Steps)1. **Load data** — `sc.datasets.pbmc3k()` or `sc.read_h5ad()`2. **QC filtering** — mark MT/ribo genes, `sc.pp.calculate_qc_metrics`, filter cells by n_genes and %MT3. **Normalization** — `sc.pp.normalize_total(target_sum=1e4)` + `sc.pp.log1p()`4. **HVG selection** — `sc.pp.highly_variable_genes(n_top_genes=2000)` (use default `seurat` flavor unless `scikit-misc` installed)5. **Regress + scale** — `sc.pp.regress_out(['total_counts', 'pct_counts_mt'])` then `sc.pp.scale(max_value=10)`6. **PCA** — `sc.tl.pca(svd_solver='arpack', n_comps=50)`7. **Neighbors + UMAP** — `sc.pp.neighbors(n_pcs=15)`, `sc.tl.umap()`8. **Clustering** — `sc.tl.leiden(resolution=1.0)`9. **Marker genes** — `sc.tl.rank_genes_groups(groupby, method='wilcoxon', use_raw=True)`10. **Cell-type annotation** — manual marker-based or automated via **CellTypist** (recommended: majority voting per cluster, far more accurate than manual markers)## CellTypist Automated Annotation (Step 10-b)CellTypist (Oxford Teichmann Lab) uses a pre-trained model with 98 immune cell types and 4164 gene features. **Always prefer this over manual marker-based annotation** — manual markers routinely misclassify MAIT cells, CD8+ Tem/Trm, and DCs (all of which express NKG7/CCL5/GZMB and get confused with NK cells).```pythonfrom celltypist import modelsmodel = models.Model.load(model='Immune_All_Low.pkl')  # auto-downloads# CellTypist needs log-norm data; create clean copyadata_ct = adata.raw.to_adata()adata_ct.obs = adata.obs.copy()sc.pp.normalize_total(adata_ct, target_sum=1e4)sc.pp.log1p(adata_ct)predictions = celltypist.annotate(    adata_ct, model='Immune_All_Low.pkl',    majority_voting=True, over_clustering='leiden')adata.obs['celltypist_label'] = predictions.predicted_labels['majority_voting'].values```See `references/celltypist_pitfalls.md` for model output columns, data prep, and manual-vs-automated comparison.Full working script: `templates/celltypist_annotation.py`.## Critical Pitfalls### P1: `rank_genes_groups` yields structured numpy arrays- `adata.uns['rank_genes_groups']['names']` is a 1D structured recarray- Shape = `(n_genes,)` — NOT `(n_genes, n_clusters)`- Number of clusters = `len(names.dtype.names)` — NOT `shape[1]`- Access pattern: `names[rank][cluster_index]` — rank is gene rank (0=top), cluster_index is integer- Always cast genes to `str()` before passing to plotting functions```pythonnames = adata.uns['rank_genes_groups']['names']n_clusters = len(names.dtype.names)top_gene_cluster0 = str(names[0][0])  # top gene for cluster 0```### P2: `sc.pl.pca_variance_ratio` does not accept `ax` parameter- Use separate `plt.figure()` calls; save and close each individually- Same for `sc.pl.pca()` — it manages its own figure### P3: Dotplot `var_names` must be plain Python strings- Passing numpy record types (from structured arrays) causes `TypeError: unhashable type`- Always convert: `var_names=[str(g) for g in gene_list]`### P4: `sc.tl.score_genes` needs a flat gene list- Passing `list(marker_dict.values())` gives list-of-lists — wrong- Flatten: `[g for genes in marker_dict.values() for g in genes]`### P5: `sc.tl.rank_genes_groups` should use log-normalized data- Despite `use_raw=True`, log-normalize before calling or expect a warning- The pipeline above normalizes + log1p before HVG, so data is ready### P6: Vanilla `seurat` HVG flavor works out of the box- `flavor='seurat_v3'` requires `scikit-misc` (not commonly pre-installed)- Default `flavor='seurat'` needs no extra packages### P7: CellTypist — majority voting output has no `conf_score` column- `predictions.predicted_labels` columns: `['predicted_labels', 'over_clustering', 'majority_voting']`- Use `majority_voting` for cluster-level consensus; no separate confidence column- Must pass log-normalized data (not regressed/scaled) — create fresh `adata_ct` from `.raw`- Model auto-downloads on first use; wrap in `stdbuf -oL -eL` to avoid buffered-hang appearance## Running with real-time outputAlways use `stdbuf -oL -eL python3 -u script.py` or `PYTHONUNBUFFERED=1 python3 -u script.py` to avoid buffered stdout in long analyses.## Files| Path                                               | Purpose                                                            || -------------------------------------------------- | ------------------------------------------------------------------ || `templates/pbmc3k_pipeline.py`                     | Complete 10-step pipeline — copy and modify for new datasets       || `templates/celltypist_annotation.py`               | CellTypist automated annotation with UMAP cluster labeling         || `references/rank_genes_groups_structured_array.md` | Deep-dive on the structured array access pattern (hardest pitfall) || `references/celltypist_pitfalls.md`                | CellTypist model output, data prep, manual-vs-automated comparison |

pbmc3k_pipeline.py

分析代码保存成了模板：

#!/usr/bin/env python3"""PBMC 3k scRNA-seq Analysis Pipeline (Scanpy)==============================================Validated template for single-cell analysis. Replace dataset loadingin Step 1 to use your own .h5ad file.Generated by / updated by: see SKILL.md for full documentation."""import scanpy as scimport matplotlib.pyplot as pltimport os, sysimport warningswarnings.filterwarnings('ignore')sc.settings.verbosity = 2sc.settings.set_figure_params(dpi=100, facecolor='white', frameon=True)# --- Configuration ---OUT_DIR = "./results/"os.makedirs(OUT_DIR, exist_ok=True)# ============================================================# Step 1: Load data# ============================================================print("\n" + "="*60 + "\n  Step 1: Load data\n" + "="*60)adata = sc.datasets.pbmc3k()  # Replace with sc.read_h5ad('your_data.h5ad')print(f"  Dimensions: {adata.shape[0]} cells x {adata.shape[1]} genes")# ============================================================# Step 2: Quality Control# ============================================================print("\n" + "="*60 + "\n  Step 2: QC\n" + "="*60)# Mark mitochondrial and ribosomal genes (adjust prefix for your species)adata.var['mt'] = adata.var_names.str.startswith('MT-')adata.var['ribo'] = adata.var_names.str.startswith(('RPS', 'RPL'))sc.pp.calculate_qc_metrics(adata, qc_vars=['mt', 'ribo'], inplace=True)# --- QC plots (before filtering) ---fig, axes = plt.subplots(1, 3, figsize=(18, 5))sc.pl.violin(adata, ['n_genes_by_counts', 'total_counts', 'pct_counts_mt'],             jitter=0.4, multi_panel=False, ax=axes[0], show=False)axes[0].set_title('QC (before)')sc.pl.scatter(adata, x='total_counts', y='n_genes_by_counts', color='pct_counts_mt',              ax=axes[1], show=False)axes[1].set_title('Genes vs UMI')axes[2].hist(adata.obs['n_genes_by_counts'], bins=100, alpha=0.7)axes[2].axvline(200, color='red', linestyle='--', label='min=200')axes[2].axvline(2500, color='darkred', linestyle='--', label='max=2500')axes[2].set_xlabel('Number of genes')axes[2].set_title('Gene count distribution')axes[2].legend()plt.tight_layout()plt.savefig(f"{OUT_DIR}01_QC_before.png", dpi=150, bbox_inches='tight')plt.close()# --- Filtering ---print(f"  Before filter: {adata.n_obs} cells")sc.pp.filter_cells(adata, min_genes=200)sc.pp.filter_cells(adata, max_genes=2500)adata = adata[adata.obs.pct_counts_mt < 5, :].copy()sc.pp.filter_genes(adata, min_cells=3)print(f"  After filter:  {adata.n_obs} cells, {adata.n_vars} genes")# QC after filteringfig, axes = plt.subplots(1, 3, figsize=(18, 5))sc.pl.violin(adata, ['n_genes_by_counts', 'total_counts', 'pct_counts_mt'],             jitter=0.4, multi_panel=False, ax=axes[0], show=False)axes[0].set_title('QC (after)')sc.pl.scatter(adata, x='total_counts', y='n_genes_by_counts', color='pct_counts_mt',              ax=axes[1], show=False)axes[1].set_title('Genes vs UMI (clean)')top20 = adata.var_names[adata.var['n_cells_by_counts'].argsort()[::-1][:20]]sc.pl.highest_expr_genes(adata, n_top=20, ax=axes[2], show=False)plt.tight_layout()plt.savefig(f"{OUT_DIR}02_QC_after.png", dpi=150, bbox_inches='tight')plt.close()# ============================================================# Step 3: Normalization# ============================================================print("\n" + "="*60 + "\n  Step 3: Normalization\n" + "="*60)adata.raw = adata.copy()sc.pp.normalize_total(adata, target_sum=1e4)sc.pp.log1p(adata)# ============================================================# Step 4: Highly Variable Genes# ============================================================print("\n" + "="*60 + "\n  Step 4: HVG selection\n" + "="*60)# Use default 'seurat' flavor (no extra deps). 'seurat_v3' needs scikit-misc.sc.pp.highly_variable_genes(adata, n_top_genes=2000, min_mean=0.0125, max_mean=3, min_disp=0.5)n_hvg = adata.var.highly_variable.sum()print(f"  HVGs: {n_hvg}")sc.pl.highly_variable_genes(adata, show=False)plt.savefig(f"{OUT_DIR}03_HVG.png", dpi=150, bbox_inches='tight')plt.close()adata = adata[:, adata.var.highly_variable].copy()# ============================================================# Step 5: Regress + Scale# ============================================================print("\n" + "="*60 + "\n  Step 5: Regress + Scale\n" + "="*60)sc.pp.regress_out(adata, ['total_counts', 'pct_counts_mt'])sc.pp.scale(adata, max_value=10)# ============================================================# Step 6: PCA# ============================================================print("\n" + "="*60 + "\n  Step 6: PCA\n" + "="*60)sc.tl.pca(adata, svd_solver='arpack', n_comps=50)# PCA plots — note: pca_variance_ratio does NOT accept ax=sc.pl.pca_variance_ratio(adata, n_pcs=50, show=False)plt.title('Elbow plot')plt.savefig(f"{OUT_DIR}04a_elbow.png", dpi=150, bbox_inches='tight')plt.close()sc.pl.pca(adata, color=['n_genes_by_counts', 'pct_counts_mt'], show=False)plt.suptitle('PCA colored by QC metrics')plt.savefig(f"{OUT_DIR}04b_pca_qc.png", dpi=150, bbox_inches='tight')plt.close()# ============================================================# Step 7: Neighbors + UMAP# ============================================================print("\n" + "="*60 + "\n  Step 7: Neighbors + UMAP\n" + "="*60)n_pcs = 15sc.pp.neighbors(adata, n_pcs=n_pcs, n_neighbors=15)sc.tl.umap(adata, min_dist=0.3, spread=1.0)# ============================================================# Step 8: Clustering# ============================================================print("\n" + "="*60 + "\n  Step 8: Clustering\n" + "="*60)for res in [0.5, 0.8, 1.0, 1.2]:    sc.tl.leiden(adata, resolution=res, key_added=f'leiden_r{res}')print("  Clusters per resolution:")for res in [0.5, 0.8, 1.0, 1.2]:    print(f"    res={res}: {adata.obs[f'leiden_r{res}'].nunique()}")adata.obs['leiden'] = adata.obs['leiden_r1.0'].astype(str)# UMAP clustering visualizationfig, axes = plt.subplots(2, 2, figsize=(14, 14))sc.pl.umap(adata, color='leiden', legend_loc='right margin', ax=axes[0,0],           title=f'Leiden (n={adata.obs.leiden.nunique()})', show=False)sc.pl.umap(adata, color='n_genes_by_counts', ax=axes[0,1], title='n_genes', show=False)sc.pl.umap(adata, color='pct_counts_mt', ax=axes[1,0], title='%MT', show=False)sc.pl.umap(adata, color='total_counts', ax=axes[1,1], title='total UMI', show=False)plt.tight_layout()plt.savefig(f"{OUT_DIR}05_UMAP_clusters.png", dpi=150, bbox_inches='tight')plt.close()# ============================================================# Step 9: Marker Genes# ============================================================print("\n" + "="*60 + "\n  Step 9: Marker Genes\n" + "="*60)sc.tl.rank_genes_groups(adata, 'leiden', method='wilcoxon', use_raw=True)# CRITICAL: rank_genes_groups['names'] is a 1D structured recarray#   Shape = (n_genes,), clusters accessed via dtype.names#   Access: names[rank][cluster_index]names_struct = adata.uns['rank_genes_groups']['names']scores_struct = adata.uns['rank_genes_groups']['scores']n_clusters = len(names_struct.dtype.names)print("\n  Top5 markers per cluster:")for ci in range(n_clusters):    clabel = names_struct.dtype.names[ci]    markers = [str(names_struct[j][ci]) for j in range(min(5, len(names_struct)))]    scores  = [scores_struct[j][ci] for j in range(min(5, len(names_struct)))]    print(f"    Cluster {clabel:>2}: " + ', '.join(f"{g}({s:.1f})" for g,s in zip(markers, scores)))# Dotplot with top-3 per cluster (cast to plain strings!)top3_genes = []for ci in range(n_clusters):    for rank in range(min(3, len(names_struct))):        top3_genes.append(str(names_struct[rank][ci]))top3_genes = list(set(top3_genes))sc.pl.dotplot(adata, var_names=top3_genes, groupby='leiden', show=False)plt.savefig(f"{OUT_DIR}07_Dotplot_top3.png", dpi=150, bbox_inches='tight')plt.close()# ============================================================# Step 10: Cell-Type Annotation# ============================================================print("\n" + "="*60 + "\n  Step 10: Cell-Type Annotation\n" + "="*60)# Adjust marker genes for your dataset/speciesmarker_dict = {    'CD14+ Monocytes':   ['CD14', 'LYZ', 'S100A9'],    'FCGR3A+ Monocytes': ['FCGR3A', 'MS4A7', 'LST1'],    'CD4+ T cells':      ['CD3D', 'CD3E', 'IL7R', 'CD4'],    'CD8+ T cells':      ['CD3D', 'CD3E', 'CD8A', 'CD8B'],    'NK cells':          ['NKG7', 'GNLY', 'KLRD1'],    'B cells':           ['CD79A', 'MS4A1', 'CD19'],    'Dendritic cells':   ['FCER1A', 'CST3'],    'Megakaryocytes':    ['PPBP', 'PF4'],}# Average expression per cluster for each cell typecell_type_anno = {}for cluster in sorted(adata.obs['leiden'].unique(), key=int):    mask = adata.obs['leiden'] == cluster    ct_scores = {}    for ct, genes in marker_dict.items():        valid_genes = [g for g in genes if g in adata.raw.var_names]        if not valid_genes:            ct_scores[ct] = 0            continue        avg_expr = adata.raw[mask, valid_genes].X.mean()        try:            avg_expr = float(avg_expr)        except:            avg_expr = 0        ct_scores[ct] = avg_expr    best_ct = max(ct_scores, key=ct_scores.get)    cell_type_anno[cluster] = best_ct    sorted_by_score = sorted(ct_scores.items(), key=lambda x: x[1], reverse=True)    candidates = ' / '.join(f"{ct}({s:.2f})" for ct, s in sorted_by_score[:2])    print(f"  Cluster {cluster:>2}: \u2192 {best_ct} ({candidates})")adata.obs['cell_type'] = adata.obs['leiden'].map(cell_type_anno)# Cell-type UMAPfig, axes = plt.subplots(1, 2, figsize=(20, 8))sc.pl.umap(adata, color='leiden', legend_loc='right margin', ax=axes[0],           title='Leiden Clusters', show=False)sc.pl.umap(adata, color='cell_type', legend_loc='right margin', ax=axes[1],           title='Predicted Cell Types', show=False)plt.tight_layout()plt.savefig(f"{OUT_DIR}09_UMAP_celltypes.png", dpi=150, bbox_inches='tight')plt.close()# Marker dotplot by cell typemarker_genes_all = list(dict.fromkeys(    g for genes in marker_dict.values() for g in genes if g in adata.raw.var_names))sc.pl.dotplot(adata, var_names=marker_genes_all, groupby='cell_type',              use_raw=True, dendrogram=True, show=False)plt.savefig(f"{OUT_DIR}10_Dotplot_markers.png", dpi=150, bbox_inches='tight')plt.close()# ============================================================# Summary# ============================================================print(f"""{'='*60}  Analysis Complete!{'='*60}Cells after QC:  {adata.n_obs}HVGs:            {n_hvg}PCs used:        {n_pcs}Leiden clusters: {adata.obs['leiden'].nunique()}Cell types:""")for ct in sorted(set(cell_type_anno.values())):    count = (adata.obs['cell_type'] == ct).sum()    pct = count / adata.n_obs * 100    print(f"  {ct}: {count} cells ({pct:.1f}%)")print(f"\nResults saved to: {OUT_DIR}")

大家可以看看它这个代码写得到底对不对，质量如何？

好了，今天养马的尝试就先到这里。大家有什么好的想法欢迎到群里交流：

中国银河生信云平台精品课程

中国银河生信云平台（UseGalaxy.cn）致力于生信平权。海量云端算力、8000+生信工具结合AI，推动生信进入3.0时代：数据分析从本地到云端，从手工到 AI。加入交流群，免费领取学习资料。

特色生信培训，助你丝滑发顶刊：

单细胞数据分析培训班（Python/Galaxy可选），不怕学不会

转录组数据分析实战（Galaxy）| 直播回放

咨询小助手：usegalaxy