OpenClaw 入门指南 – OG_024|OpenClaw 监控告警实战:从零搭建你的 Agent 运维体系

作者:Alex 合集:OpenClaw 入门指南 第 024 篇 标签:#OpenClaw #监控告警 #运维 #Prometheus #Grafana
开篇:为什么 Agent 需要监控?
用 OpenClaw 搭了 20 多个 Agent 后,我遇到过一个崩溃的凌晨:
3:47,飞书推送告警:「marketing-content-creator 连续 3 次任务失败」 3:48,检查日志:磁盘满了,临时文件没清理 3:52,清理磁盘,重启 Agent 3:55,恢复正常
如果没有监控,这个问题可能要等到早上 9 点才发现——3 小时的故障窗口,足够让一篇定时发布的文章错过黄金时段。
Agent 监控不是可选项,是必选项。
今天这篇,手把手教你从零搭建 OpenClaw Agent 的监控告警体系,覆盖指标采集、日志聚合、告警通知全流程。
看完这篇,你的 Agent 出了问题,你会比它先知道。
第一章:监控架构设计
1.1 整体架构
┌─────────────────────────────────────────┐
│ OpenClaw Agent │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ 任务执行 │ │ 资源使用 │ │ 错误日志 │ │
│ └────┬────┘ └────┬────┘ └────┬────┘ │
│ │ │ │ │
│ └───────────┼───────────┘ │
│ │ │
│ ┌────┴────┐ │
│ │ 指标采集 │ │
│ │(内置) │ │
│ └────┬────┘ │
└───────────────────┼─────────────────────┘
│
┌─────────┴─────────┐
│ │
┌────┴────┐ ┌────┴────┐
│Prometheus│ │ Loki │
│(指标存储)│ │(日志存储)│
└────┬────┘ └────┬────┘
│ │
└─────────┬─────────┘
│
┌────┴────┐
│ Grafana │
│(可视化) │
└────┬────┘
│
┌────┴────┐
│ 告警规则 │
│(Alertmanager)│
└────┬────┘
│
┌────┴────┐
│ 通知渠道 │
│飞书/钉钉/微信│
└─────────┘
1.2 监控三要素
| 要素 | 工具 | 采集内容 | 用途 |
|---|---|---|---|
| 指标 | Prometheus | CPU、内存、任务成功率、响应时间 | 趋势分析、容量规划 |
| 日志 | Loki | 错误日志、异常堆栈、操作记录 | 故障排查、审计追踪 |
| 告警 | Alertmanager | 规则触发后的通知 | 实时响应、故障止损 |
第二章:指标采集——Prometheus 集成
2.1 OpenClaw 内置指标
OpenClaw 2026.5.22+ 内置 Prometheus 指标端点:
# 查看指标端点
openclaw config get gateway.metrics.enabled
# 默认开启,端口 8080
# 测试指标采集
curl http://localhost:8080/metrics
内置指标列表:
# Agent 任务统计
openclaw_tasks_total{agent="marketing-content-creator", status="success"} 42
openclaw_tasks_total{agent="marketing-content-creator", status="failure"} 3
openclaw_task_duration_seconds{agent="marketing-content-creator", quantile="0.95"} 2.5
# 资源使用
openclaw_memory_usage_bytes{agent="marketing-content-creator"} 536870912
openclaw_cpu_usage_percent{agent="marketing-content-creator"} 15.3
# 通道状态
openclaw_channel_health{channel="discord"} 1
openclaw_channel_health{channel="telegram"} 1
openclaw_channel_health{channel="wechat"} 0 # 0=异常
# 消息统计
openclaw_messages_total{channel="discord", direction="inbound"} 128
openclaw_messages_total{channel="discord", direction="outbound"} 128
2.2 配置 Prometheus 采集
创建 prometheus.yml:
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'openclaw-agents'
static_configs:
- targets: ['localhost:8080']
metrics_path: '/metrics'
scrape_interval: 5s
- job_name: 'openclaw-gateway'
static_configs:
- targets: ['localhost:8080']
metrics_path: '/gateway/metrics'
Docker 启动 Prometheus:
docker run -d \
--name prometheus \
-p 9090:9090 \
-v $(pwd)/prometheus.yml:/etc/prometheus/prometheus.yml \
prom/prometheus:latest
验证采集:
# 查看目标状态
curl http://localhost:9090/api/v1/targets
# 查询指标
curl 'http://localhost:9090/api/v1/query?query=openclaw_tasks_total'
第三章:日志聚合——Loki 集成
3.1 OpenClaw 日志配置
# 查看日志配置
openclaw config get gateway.logging
# 设置日志输出格式(JSON 便于 Loki 解析)
openclaw config set gateway.logging.format json
openclaw config set gateway.logging.output /var/log/openclaw/agent.log
# 启用结构化日志
openclaw config set gateway.logging.structured true
日志格式示例:
{
"timestamp": "2026-05-25T10:30:00+08:00",
"level": "ERROR",
"agent": "marketing-content-creator",
"task": "image-generation",
"error": "API rate limit exceeded",
"retry_count": 3,
"duration_ms": 5000
}
3.2 配置 Promtail 采集
创建 promtail.yml:
server:
http_listen_port: 9080
grpc_listen_port: 0
positions:
filename: /tmp/positions.yaml
clients:
- url: http://localhost:3100/loki/api/v1/push
scrape_configs:
- job_name: openclaw-logs
static_configs:
- targets:
- localhost
labels:
job: openclaw-agent
__path__: /var/log/openclaw/*.log
pipeline_stages:
- json:
expressions:
level: level
agent: agent
task: task
- labels:
level:
agent:
task:
Docker 启动 Loki + Promtail:
# Loki
docker run -d \
--name loki \
-p 3100:3100 \
grafana/loki:latest
# Promtail
docker run -d \
--name promtail \
-v $(pwd)/promtail.yml:/etc/promtail/config.yml \
-v /var/log/openclaw:/var/log/openclaw \
grafana/promtail:latest \
-config.file=/etc/promtail/config.yml
第四章:可视化——Grafana 仪表盘
4.1 安装 Grafana
docker run -d \
--name grafana \
-p 3000:3000 \
-e GF_SECURITY_ADMIN_PASSWORD=admin123 \
grafana/grafana:latest
访问:http://localhost:3000(默认账号 admin/admin123)
4.2 配置数据源
- 1. Prometheus 数据源:
- • URL:
http://host.docker.internal:9090 - • Save & Test
- 2.
- :
- • URL:
http://host.docker.internal:3100 - • Save & Test
Loki 数据源
4.3 导入 OpenClaw 仪表盘
创建仪表盘 JSON(openclaw-dashboard.json):
{
"dashboard": {
"title": "OpenClaw Agent 监控",
"panels": [
{
"title": "任务成功率",
"type": "stat",
"targets": [
{
"expr": "sum(rate(openclaw_tasks_total{status=\"success\"}[5m])) / sum(rate(openclaw_tasks_total[5m])) * 100",
"legendFormat": "成功率"
}
],
"fieldConfig": {
"defaults": {
"unit": "percent",
"thresholds": {
"steps": [
{"color": "red", "value": 0},
{"color": "yellow", "value": 80},
{"color": "green", "value": 95}
]
}
}
}
},
{
"title": "Agent 内存使用",
"type": "timeseries",
"targets": [
{
"expr": "openclaw_memory_usage_bytes / 1024 / 1024",
"legendFormat": "{{agent}}"
}
],
"fieldConfig": {
"defaults": {
"unit": "megabytes"
}
}
},
{
"title": "通道健康状态",
"type": "table",
"targets": [
{
"expr": "openclaw_channel_health",
"format": "table",
"instant": true
}
]
},
{
"title": "错误日志",
"type": "logs",
"datasource": "Loki",
"targets": [
{
"expr": "{job=\"openclaw-agent\", level=\"ERROR\"}"
}
]
}
]
}
}
导入:
# 通过 API 导入
curl -X POST \
http://admin:admin123@localhost:3000/api/dashboards/db \
-H "Content-Type: application/json" \
-d @openclaw-dashboard.json
第五章:告警规则——Alertmanager 配置
5.1 创建告警规则
创建 alert-rules.yml:
groups:
- name: openclaw-agent-alerts
rules:
# 任务失败率过高
- alert: AgentTaskFailureRateHigh
expr: sum(rate(openclaw_tasks_total{status="failure"}[5m])) / sum(rate(openclaw_tasks_total[5m])) > 0.1
for: 2m
labels:
severity: critical
annotations:
summary: "Agent {{ $labels.agent }} 任务失败率超过 10%"
description: "过去 5 分钟失败率: {{ $value | humanizePercentage }}"
# 内存使用过高
- alert: AgentMemoryHigh
expr: openclaw_memory_usage_bytes / 1024 / 1024 / 1024 > 1
for: 5m
labels:
severity: warning
annotations:
summary: "Agent {{ $labels.agent }} 内存使用超过 1GB"
description: "当前使用: {{ $value | humanize }}GB"
# 通道异常
- alert: ChannelUnhealthy
expr: openclaw_channel_health == 0
for: 1m
labels:
severity: critical
annotations:
summary: "通道 {{ $labels.channel }} 异常"
description: "通道已连续 1 分钟不可用"
# 任务执行时间过长
- alert: AgentTaskSlow
expr: openclaw_task_duration_seconds{quantile="0.95"} > 30
for: 3m
labels:
severity: warning
annotations:
summary: "Agent {{ $labels.agent }} 任务执行缓慢"
description: "P95 延迟: {{ $value }}s"
# 磁盘空间不足
- alert: DiskSpaceLow
expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) < 0.1
for: 5m
labels:
severity: critical
annotations:
summary: "磁盘空间不足"
description: "剩余空间: {{ $value | humanizePercentage }}"
5.2 配置 Alertmanager 通知
创建 alertmanager.yml:
global:
smtp_smarthost: 'localhost:587'
smtp_from: 'alert@yjett.com'
templates:
- '/etc/alertmanager/templates/*.tmpl'
route:
group_by: ['alertname', 'agent']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receiver: 'feishu-webhook'
routes:
- match:
severity: critical
receiver: 'feishu-webhook'
continue: true
- match:
severity: warning
receiver: 'feishu-webhook'
repeat_interval: 4h
receivers:
- name: 'feishu-webhook'
webhook_configs:
- url: 'https://open.feishu.cn/open-apis/bot/v2/hook/your-webhook-token'
send_resolved: true
title: '{{ template "default.title" . }}'
text: '{{ template "default.content" . }}'
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'agent']
飞书通知模板(feishu.tmpl):
{{ define "default.title" }}
🔥 OpenClaw 告警: {{ .GroupLabels.alertname }}
{{ end }}
{{ define "default.content" }}
{{ range .Alerts }}
**告警级别**: {{ .Labels.severity }}
**Agent**: {{ .Labels.agent }}
**问题**: {{ .Annotations.summary }}
**详情**: {{ .Annotations.description }}
**时间**: {{ .StartsAt.Format "2006-01-02 15:04:05" }}
{{ end }}
{{ end }}
启动 Alertmanager:
docker run -d \
--name alertmanager \
-p 9093:9093 \
-v $(pwd)/alertmanager.yml:/etc/alertmanager/alertmanager.yml \
-v $(pwd)/feishu.tmpl:/etc/alertmanager/templates/feishu.tmpl \
prom/alertmanager:latest \
--config.file=/etc/alertmanager/alertmanager.yml
第六章:一键部署脚本
6.1 完整 Docker Compose
创建 docker-compose.monitoring.yml:
version: '3.8'
services:
prometheus:
image: prom/prometheus:latest
container_name: prometheus
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- ./alert-rules.yml:/etc/prometheus/alert-rules.yml
- prometheus-data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
loki:
image: grafana/loki:latest
container_name: loki
ports:
- "3100:3100"
volumes:
- loki-data:/loki
promtail:
image: grafana/promtail:latest
container_name: promtail
volumes:
- ./promtail.yml:/etc/promtail/config.yml
- /var/log/openclaw:/var/log/openclaw
command: -config.file=/etc/promtail/config.yml
grafana:
image: grafana/grafana:latest
container_name: grafana
ports:
- "3000:3000"
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin123
volumes:
- grafana-data:/var/lib/grafana
alertmanager:
image: prom/alertmanager:latest
container_name: alertmanager
ports:
- "9093:9093"
volumes:
- ./alertmanager.yml:/etc/alertmanager/alertmanager.yml
- ./feishu.tmpl:/etc/alertmanager/templates/feishu.tmpl
volumes:
prometheus-data:
loki-data:
grafana-data:
6.2 启动命令
# 一键启动监控栈
docker-compose -f docker-compose.monitoring.yml up -d
# 验证状态
docker-compose -f docker-compose.monitoring.yml ps
# 查看日志
docker-compose -f docker-compose.monitoring.yml logs -f
访问地址:
| 服务 | 地址 | 用途 |
|---|---|---|
| Prometheus | http://localhost:9090 | 指标查询 |
| Grafana | http://localhost:3000 | 可视化 |
| Alertmanager | http://localhost:9093 | 告警管理 |
| Loki | http://localhost:3100 | 日志查询 |
第七章:实战检查清单
部署前准备
- • [ ] Docker 已安装
- • [ ] OpenClaw 指标端点已开启
- • [ ] 飞书 Webhook 已配置
部署过程
- • [ ] Prometheus 采集配置正确
- • [ ] Loki 日志采集正常
- • [ ] Grafana 数据源连接成功
- • [ ] 仪表盘导入成功
告警验证
- • [ ] 触发测试告警(如手动停止 Agent)
- • [ ] 确认飞书收到通知
- • [ ] 验证告警恢复通知
日常维护
- • [ ] 每周检查磁盘空间
- • [ ] 每月 review 告警规则(减少误报)
- • [ ] 每季度优化仪表盘布局
结尾:监控的本质
2021 年上岸的程序员最后送你一句话:
监控不是为了好看,是为了在出问题之前发现问题,在出问题之后快速定位问题。
好的监控体系 = 指标告诉你「哪里可能有问题」+ 日志告诉你「具体问题是什么」+ 告警告诉你「现在就要处理」。
现在,去监控你的 Agent 吧。
📌 行动清单
- • [ ] 启动 Prometheus + Grafana + Loki + Alertmanager
- • [ ] 导入 OpenClaw 仪表盘
- • [ ] 配置飞书告警通知
- • [ ] 触发测试告警验证通知链路
- • [ ] 设置每周监控 review 提醒
👇 互动话题
你的 Agent 遇到过什么诡异的故障?评论区聊聊~
如果这篇文章对你有帮助,欢迎「在看」+「转发」给需要的朋友
夜雨聆风