OpenClaw 架构深潜:Gateway 如何把一条消息变成-夜雨聆风

OpenClaw 架构深潜:Gateway 如何把一条消息变成

架构与原理｜可核对官方文档｜建议收藏

> 依据 [OpenClaw Gateway architecture](https://docs.openclaw.ai/concepts/architecture)、[Agent Loop](https://docs.openclaw.ai/concepts/agent-loop)、[Gateway protocol](https://docs.openclaw.ai/gateway/protocol)（2026-06 文档快照）。数字与字段名以官方为准，安装后请 `openclaw doctor` 核对本地版本。

1. 你要解决什么问题（30 秒）

读 OpenClaw 文档时，常见三个困惑：

1. Gateway 和 Agent 到底谁干活？ 为什么 WhatsApp 只能连 Gateway，不能直接连模型？

2. `agent` RPC 返回了 `runId`，为什么还没看到回复？ 流式事件从哪来、`agent.wait` 等的是什么？

3. Session 串线、工具竞态、48 小时 timeout 这些词出现在不同章节——它们是不是同一套并发模型？

这篇的目标：用一条完整链路把组件对齐，读完后你能对着日志说「现在卡在第几步」，而不是只记得「Gateway 像总机」。

2. 核心机制：Hub-and-Spoke + 双平面

OpenClaw 的官方表述是：单 Host 上 一个长期运行的 Gateway daemon 拥有所有 messaging surfaces（WhatsApp/Baileys、Telegram/grammY、Slack、Discord、Signal、iMessage、WebChat）；控制面客户端（CLI、Web UI、macOS App）和 Nodes（`role: node`，带 camera/canvas 等能力）都通过 WebSocket 连到 Gateway，默认 `127.0.0.1:18789`。

可以拆成两个平面：

控制平面（Gateway）

渠道连接、WS 协议、Session 路由、鉴权/配对、事件广播 | 不做 LLM 推理

执行平面（Agent Runtime）

组 prompt、调模型、跑 tool、写 transcript、compaction | 不直接持有 Baileys WhatsApp session

Invariant（文档原话）：每个 Host 恰好一个 Gateway 控制单个 Baileys WhatsApp session；首帧必须是 `connect`，否则 hard close；事件不重放，客户端 gap 后必须自己 refresh。

### 2.1 从渠道消息到 Agent Run（时序）

下面合并 Architecture 文档里的 connect 流程 + Agent Loop 的五步执行链：

sequenceDiagram
participant WA as WhatsApp/Telegram/…
participant GW as Gateway
participant RT as Agent Runtime
participant LLM as Model Provider
WA->>GW: 入站消息（渠道 adapter）
GW->>GW: Router 解析 channel/account/peer → sessionKey
GW->>GW: 入队 session lane（串行）
Note over GW: 客户端也可主动 req:agent
GW->>RT: agentCommand → runEmbeddedAgent
RT->>RT: 加 session write lock，加载 skills snapshot
RT->>LLM: 推理（可流式 assistant delta）
LLM–>>RT: tool_calls
RT->>RT: 执行 tool（tool 事件流）
RT->>GW: lifecycle end / assistant final
GW->>WA: 出站回复（经 messaging tool 或 channel send）

关键推论：

你在 CLI 里调 `agent`，和手机 WhatsApp 进线，最终都进同一个 Gateway 进程，共享 Session/Queue 规则。

「回复」不是 `agent` RPC 的同步返回值，而是 `event:agent` 流 + 渠道 outbound；同步 API 是 `agent.wait` 等 lifecycle 结束。

### 2.2 Agent Loop 的五步（执行平面）

官方 Agent Loop 文档给出的调用链（函数名可直接在源码里搜）：

1. `agent` RPC：校验参数 → 解析 `sessionKey`/`sessionId` → 持久化 session 元数据 → 立即返回 `{ runId, acceptedAt }`

2. `agentCommand`：解析 model/thinking 默认值 → 加载 skills snapshot → 调用 `runEmbeddedAgent`

3. `runEmbeddedAgent`：经 per-session + global queue 串行；订阅 runtime 事件；超时则 abort

4. `subscribeEmbeddedAgentSession`：把 runtime 事件映射为 Gateway 流：

tool → `stream: “tool”`

assistant delta → `stream: “assistant”`

lifecycle → `stream: “lifecycle”`（`start` | `end` | `error`）

5. `agent.wait`：`waitForAgentRun(runId)` → 返回 `{ status: ok|error|timeout, … }`

两个 timeout 不要混（文档默认值）：

`agent.wait` 的 `timeoutMs`

30s | 只等「wait 这个 RPC」，不会停止 agent

`agents.defaults.timeoutSeconds`

172800s（48h） | 整个 embedded run 的上限，在 runtime abort timer 里 enforce

很多人「wait 超时了但 agent 还在跑」，正是因为这个设计。

### 2.3 并发模型：Session Lane + Write Lock

Agent Loop 文档强调：

Run 按 session key 串行（session lane），可选再走 global lane → 防止 tool/session 竞态。

Transcript 写入另有 session write lock（进程感知、基于文件）；默认 acquire 等待 60000ms，超时报告 session busy。

Write lock 默认不可重入；嵌套写入要显式 `allowReentrant: true`。

这意味着：同一 session 上不要假设能并行两个 agent run；渠道层还有 steer/followup/collect/interrupt 等 queue mode 喂给 lane（见 Command Queue 文档）。

3. 协议与配置长什么样

### 3.1 WebSocket 帧格式（Gateway protocol）

传输：WebSocket 文本帧，JSON payload。第一帧必须是 `connect`。

握手后的请求/响应（Architecture 摘要）：

{ “type”: “req”, “id”: “1”, “method”: “agent”, “params”: { } }

{ “type”: “res”, “id”: “1”, “ok”: true, “payload”: { } }

{ “type”: “event”, “event”: “agent”, “payload”: { }, “seq”: 42 }

`connect` 参数在 protocol 文档里比早期示例更细——包含 `role`、`scopes`、`device` 签名等（节选）：

{
“type”: “req”,
“id”: “connect-1”,
“method”: “connect”,
“params”: {
“minProtocol”: 3,
“maxProtocol”: 4,
“role”: “operator”,
“scopes”: [“operator.read”, “operator.write”],
“auth”: { “token”: “…” },
“client”: {
“id”: “cli”,
“version”: “1.2.3”,
“platform”: “macos”,
“mode”: “operator”
},
“device”: {
“id”: “device_fingerprint”,
“publicKey”: “…”,
“signature”: “…”,
“signedAt”: 1737264000000,
“nonce”: “…”
}
}
}

Side-effect 方法（`send`、`agent`）需要 idempotency key，服务端有短 TTL dedupe cache——重试时必须带同一 key。

### 3.2 配置片段（Gateway 绑定与端口）

gateway:
port: 18789
bind: loopback # 远程访问需 Tailscale/VPN 或 SSH 隧道，见下

Architecture 文档给出的远程访问方式：

ssh -N -L 18789:127.0.0.1:18789 user@host

Canvas / A2UI 与 Gateway 同端口：

`/__openclaw__/canvas/`

`/__openclaw__/a2ui/`

### 3.3 `agent` 调用里你真正要传的

概念上（字段名以你安装的 protocol schema 为准）：

`sessionKey`：几乎所有 session 级操作的锚点；Gateway 服务端推导 trusted runtime context，不接受 caller 随意伪造 delivery context。

`message`：用户输入。

`runId` / idempotency：用于 dedupe 与 `agent.wait` 对齐。

流式阶段你会在 WS 上收到 `event:agent`，payload 里区分 `stream: “assistant” | “tool” | “lifecycle”`。

### 3.4 System Prompt 从哪来

Agent Loop 文档：system prompt = OpenClaw base prompt + skills prompt + bootstrap context + per-run overrides；并 enforce model-specific limits 与 compaction reserve tokens。

Hook 可拦截的位置（节选）：

Gateway：`agent:bootstrap`（改 bootstrap 文件）

Plugin：`before_prompt_build`、`before_tool_call`、`tool_result_persist`、`before_compaction` …

Harness 落点：OpenClaw 的「控制」大量在这些 hook + exec approvals + tool sanitization 里，而不是 prompt 里喊「请注意安全」。

4. 和 ChatGPT 网页 / Hermes 的本质差别

同一需求：「Telegram 里给个人助手发消息，它能在服务器上跑 tool」。

控制平面

OpenAI SaaS | Hermes Gateway（自托管）

渠道

官方 App | Telegram/Discord/…

协议

HTTPS API | Python `AIAgent` + registry

Session 并发

不透明 | session + sub-agent

扩展

GPTs / Actions | MCP + skills 自生成

选型结论（机制级，不是口碑）：

要 WhatsApp + 单 Host 单 session invariant → OpenClaw 文档明确写了这条约束，Hermes 偏 Telegram/多平台但实现栈不同。

要 最少运维、接受 SaaS → 不是 OpenClaw 主战场。

要 研究 self-improving skills + RL 轨迹 → Hermes 文档更重；OpenClaw 更重 Gateway 协议与渠道 OS。

5. 三个失败案例：现象 → 原因 → 怎么查

### 案例 A：`agent.wait` 30s timeout，但 Telegram 后来才回复

现象：客户端报 wait timeout；消息几分钟后才到。

原因：`agent.wait` 默认 只等 30s；runtime 默认最长 48h。Wait 超时不会 abort agent。

怎么查：

1. 调大 `timeoutMs` 或改用事件订阅而非阻塞 wait。

2. 订阅 `event:agent` 看 lifecycle 是否 `end`。

3. 查 `agents.defaults.timeoutSeconds` 与 model idle timeout（文档：model idle 默认 cap 120s，cron run 另有外层 timeout）。

### 案例 B：两个群的消息「串人设」或工具写错目录

现象：A 群指令触发了 B 群 context 里的 tool。

原因：sessionKey 路由错误或共享 workspace；或 bypass queue 并行 run 导致 transcript 竞态。

怎么查：

1. 打 log：入站消息的 channel / account / peer → 解析出的 sessionKey。

2. 确认 Router binding（Multi-Agent 规则）是否把不同 peer 映到不同 agent/workspace。

3. 查是否有第二个进程写同一 session 文件（write lock 应挡，但 60s 后会报 busy）。

### 案例 C：Gateway 连上立刻 disconnect / `connect` 失败

现象：WS 首帧报错或 hard close。

原因（文档列举）：

首帧不是 `connect`，或非 JSON

未签 `connect.challenge` nonce（非 local 必须 pairing）

`gateway.auth.mode` 与 token/Tailscale/trusted-proxy 不匹配

帧超过 pre-connect 64 KiB 限制

怎么查：

1. `openclaw gateway` 前台跑，看 stdout。

2. `health` / `status` RPC（connect 成功后）。

3. 对照 [Pairing](https://docs.openclaw.ai/channels/pairing) 与 [Security](https://docs.openclaw.ai/gateway/security) 检查 device token。

6. 最小可运行路径（命令级，不是「读文档顺序」）

目标：本机 Gateway + 一个入站渠道（或 WebChat）+ 一次完整 agent run。

# 1. 安装与配置（略，见官方 Getting Started）
openclaw setup
# 2. 前台启动 Gateway（日志直接打终端，便于对照本文案例 A/C）
openclaw gateway
# 3. 另开终端：健康检查（走 WS，或 CLI 封装）
# 等价于 connect 后 method: health
# 4. 仅 WebChat 验证（不碰 WhatsApp Baileys）
# 浏览器打开 Gateway 提供的 WebChat，或按文档配置 webchat
# 5. 触发一次 agent run（CLI 封装 agent RPC）
openclaw agent –message “列出当前 workspace 根目录文件” –session-key “main”
# 6. 观察事件流：lifecycle start → assistant/tool deltas → lifecycle end
# 若用 API：agent.wait 记得传 timeoutMs >> 30s

通过标准：

[ ] Gateway 仅 一个进程 监听 18789

[ ] connect 成功，`hello-ok` 里能看到 methods/events discovery

[ ] 单次 run 在 transcript 里留下 tool + assistant 条目

[ ] 连续两条消息 sessionKey 相同 时，第二条能引用第一条（验证 session 持久化）

7. 生产还要加什么（具体项，不是「注意安全」）

OpenClaw 给你的是 个人 AI OS 骨架；上生产（哪怕是小团队 internal bot）至少要补：

网络

禁止 public `bind: all` + `auth.mode: none`；优先 Tailscale/VPN；WS TLS + pinning | Security, Remote access

身份

非 local connect 必须 pairing；device 签名 v3 绑定 platform | Pairing, protocol

权限

`operator.read/write/admin` scope 分离；`tools.effective` 审计实际可用 tool | Gateway protocol

执行

Exec Approvals 给 shell；`before_tool_call` block 敏感 tool | Exec Approvals, Plugin hooks

数据

`tool_result_persist` 脱敏；transcript 备份与 retention | Agent Loop hooks

并发

理解 session lane；channel queue mode 选型 | Queue

可观测

开 diagnostics：`session.stalled` / `session.stuck` 区分慢 run vs 真死锁 | Agent Loop Timeouts

合规

WhatsApp/Telegram ToS；自动群发 limits | 渠道文档

读完自检（答不上来 = 还没吃透）

1. `agent.wait` 30s 超时会不会 cancel 正在跑的 agent？为什么？

2. 同一 session 上两个 run 并行，官方用什么机制避免 tool 竞态？

3. Baileys WhatsApp session 为什么要求「单 Host 单 Gateway」？

「职场AI技能」

AI 技能深潜 · 架构与原理 · 可核对文档