OpenClaw 入门指南 – OG_024|OpenClaw 监控告警实战:从零搭建你的 Agent 运维体系-夜雨聆风

OpenClaw 入门指南 – OG_024|OpenClaw 监控告警实战:从零搭建你的 Agent 运维体系

作者：Alex 合集：OpenClaw 入门指南第 024 篇标签：#OpenClaw #监控告警 #运维 #Prometheus #Grafana

开篇：为什么 Agent 需要监控？

用 OpenClaw 搭了 20 多个 Agent 后，我遇到过一个崩溃的凌晨：

3:47，飞书推送告警：「marketing-content-creator 连续 3 次任务失败」 3:48，检查日志：磁盘满了，临时文件没清理 3:52，清理磁盘，重启 Agent 3:55，恢复正常

如果没有监控，这个问题可能要等到早上 9 点才发现——3 小时的故障窗口，足够让一篇定时发布的文章错过黄金时段。

Agent 监控不是可选项，是必选项。

今天这篇，手把手教你从零搭建 OpenClaw Agent 的监控告警体系，覆盖指标采集、日志聚合、告警通知全流程。

看完这篇，你的 Agent 出了问题，你会比它先知道。

第一章：监控架构设计

1.1 整体架构

┌─────────────────────────────────────────┐
│           OpenClaw Agent                │
│  ┌─────────┐ ┌─────────┐ ┌─────────┐  │
│  │ 任务执行 │ │ 资源使用 │ │ 错误日志 │  │
│  └────┬────┘ └────┬────┘ └────┬────┘  │
│       │           │           │         │
│       └───────────┼───────────┘         │
│                   │                     │
│              ┌────┴────┐               │
│              │ 指标采集 │               │
│              │(内置)   │               │
│              └────┬────┘               │
└───────────────────┼─────────────────────┘
                    │
         ┌─────────┴─────────┐
         │                   │
    ┌────┴────┐         ┌────┴────┐
    │Prometheus│         │ Loki   │
    │(指标存储)│         │(日志存储)│
    └────┬────┘         └────┬────┘
         │                   │
         └─────────┬─────────┘
                   │
              ┌────┴────┐
              │ Grafana │
              │(可视化) │
              └────┬────┘
                   │
              ┌────┴────┐
              │ 告警规则 │
              │(Alertmanager)│
              └────┬────┘
                   │
              ┌────┴────┐
              │ 通知渠道 │
              │飞书/钉钉/微信│
              └─────────┘

1.2 监控三要素

要素	工具	采集内容	用途
指标	Prometheus	CPU、内存、任务成功率、响应时间	趋势分析、容量规划
日志	Loki	错误日志、异常堆栈、操作记录	故障排查、审计追踪
告警	Alertmanager	规则触发后的通知	实时响应、故障止损

第二章：指标采集——Prometheus 集成

2.1 OpenClaw 内置指标

OpenClaw 2026.5.22+ 内置 Prometheus 指标端点：

# 查看指标端点
openclaw config get gateway.metrics.enabled
# 默认开启，端口 8080

# 测试指标采集
curl http://localhost:8080/metrics

内置指标列表：

# Agent 任务统计
openclaw_tasks_total{agent="marketing-content-creator", status="success"} 42
openclaw_tasks_total{agent="marketing-content-creator", status="failure"} 3
openclaw_task_duration_seconds{agent="marketing-content-creator", quantile="0.95"} 2.5

# 资源使用
openclaw_memory_usage_bytes{agent="marketing-content-creator"} 536870912
openclaw_cpu_usage_percent{agent="marketing-content-creator"} 15.3

# 通道状态
openclaw_channel_health{channel="discord"} 1
openclaw_channel_health{channel="telegram"} 1
openclaw_channel_health{channel="wechat"} 0  # 0=异常

# 消息统计
openclaw_messages_total{channel="discord", direction="inbound"} 128
openclaw_messages_total{channel="discord", direction="outbound"} 128

2.2 配置 Prometheus 采集

创建 prometheus.yml：

global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'openclaw-agents'
    static_configs:
      - targets: ['localhost:8080']
    metrics_path: '/metrics'
    scrape_interval: 5s
    
  - job_name: 'openclaw-gateway'
    static_configs:
      - targets: ['localhost:8080']
    metrics_path: '/gateway/metrics'

Docker 启动 Prometheus：

docker run -d \
  --name prometheus \
  -p 9090:9090 \
  -v $(pwd)/prometheus.yml:/etc/prometheus/prometheus.yml \
  prom/prometheus:latest

验证采集：

# 查看目标状态
curl http://localhost:9090/api/v1/targets

# 查询指标
curl 'http://localhost:9090/api/v1/query?query=openclaw_tasks_total'

第三章：日志聚合——Loki 集成

3.1 OpenClaw 日志配置

# 查看日志配置
openclaw config get gateway.logging

# 设置日志输出格式（JSON 便于 Loki 解析）
openclaw config set gateway.logging.format json
openclaw config set gateway.logging.output /var/log/openclaw/agent.log

# 启用结构化日志
openclaw config set gateway.logging.structured true

日志格式示例：

{
  "timestamp": "2026-05-25T10:30:00+08:00",
  "level": "ERROR",
  "agent": "marketing-content-creator",
  "task": "image-generation",
  "error": "API rate limit exceeded",
  "retry_count": 3,
  "duration_ms": 5000
}

3.2 配置 Promtail 采集

创建 promtail.yml：

server:
  http_listen_port: 9080
  grpc_listen_port: 0

positions:
  filename: /tmp/positions.yaml

clients:
  - url: http://localhost:3100/loki/api/v1/push

scrape_configs:
  - job_name: openclaw-logs
    static_configs:
      - targets:
          - localhost
        labels:
          job: openclaw-agent
          __path__: /var/log/openclaw/*.log
    pipeline_stages:
      - json:
          expressions:
            level: level
            agent: agent
            task: task
      - labels:
          level:
          agent:
          task:

Docker 启动 Loki + Promtail：

# Loki
docker run -d \
  --name loki \
  -p 3100:3100 \
  grafana/loki:latest

# Promtail
docker run -d \
  --name promtail \
  -v $(pwd)/promtail.yml:/etc/promtail/config.yml \
  -v /var/log/openclaw:/var/log/openclaw \
  grafana/promtail:latest \
  -config.file=/etc/promtail/config.yml

第四章：可视化——Grafana 仪表盘

4.1 安装 Grafana

docker run -d \
  --name grafana \
  -p 3000:3000 \
  -e GF_SECURITY_ADMIN_PASSWORD=admin123 \
  grafana/grafana:latest

访问：http://localhost:3000（默认账号 admin/admin123）

4.2 配置数据源

1. Prometheus 数据源：

• URL: http://host.docker.internal:9090
• Save & Test

Loki 数据源

• URL: http://host.docker.internal:3100
• Save & Test

4.3 导入 OpenClaw 仪表盘

创建仪表盘 JSON（openclaw-dashboard.json）：

{
  "dashboard": {
    "title": "OpenClaw Agent 监控",
    "panels": [
      {
        "title": "任务成功率",
        "type": "stat",
        "targets": [
          {
            "expr": "sum(rate(openclaw_tasks_total{status=\"success\"}[5m])) / sum(rate(openclaw_tasks_total[5m])) * 100",
            "legendFormat": "成功率"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "percent",
            "thresholds": {
              "steps": [
                {"color": "red", "value": 0},
                {"color": "yellow", "value": 80},
                {"color": "green", "value": 95}
              ]
            }
          }
        }
      },
      {
        "title": "Agent 内存使用",
        "type": "timeseries",
        "targets": [
          {
            "expr": "openclaw_memory_usage_bytes / 1024 / 1024",
            "legendFormat": "{{agent}}"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "megabytes"
          }
        }
      },
      {
        "title": "通道健康状态",
        "type": "table",
        "targets": [
          {
            "expr": "openclaw_channel_health",
            "format": "table",
            "instant": true
          }
        ]
      },
      {
        "title": "错误日志",
        "type": "logs",
        "datasource": "Loki",
        "targets": [
          {
            "expr": "{job=\"openclaw-agent\", level=\"ERROR\"}"
          }
        ]
      }
    ]
  }
}

导入：

# 通过 API 导入
curl -X POST \
  http://admin:admin123@localhost:3000/api/dashboards/db \
  -H "Content-Type: application/json" \
  -d @openclaw-dashboard.json

第五章：告警规则——Alertmanager 配置

5.1 创建告警规则

创建 alert-rules.yml：

groups:
  - name: openclaw-agent-alerts
    rules:
      # 任务失败率过高
      - alert: AgentTaskFailureRateHigh
        expr: sum(rate(openclaw_tasks_total{status="failure"}[5m])) / sum(rate(openclaw_tasks_total[5m])) > 0.1
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Agent {{ $labels.agent }} 任务失败率超过 10%"
          description: "过去 5 分钟失败率: {{ $value | humanizePercentage }}"

      # 内存使用过高
      - alert: AgentMemoryHigh
        expr: openclaw_memory_usage_bytes / 1024 / 1024 / 1024 > 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Agent {{ $labels.agent }} 内存使用超过 1GB"
          description: "当前使用: {{ $value | humanize }}GB"

      # 通道异常
      - alert: ChannelUnhealthy
        expr: openclaw_channel_health == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "通道 {{ $labels.channel }} 异常"
          description: "通道已连续 1 分钟不可用"

      # 任务执行时间过长
      - alert: AgentTaskSlow
        expr: openclaw_task_duration_seconds{quantile="0.95"} > 30
        for: 3m
        labels:
          severity: warning
        annotations:
          summary: "Agent {{ $labels.agent }} 任务执行缓慢"
          description: "P95 延迟: {{ $value }}s"

      # 磁盘空间不足
      - alert: DiskSpaceLow
        expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) < 0.1
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "磁盘空间不足"
          description: "剩余空间: {{ $value | humanizePercentage }}"

5.2 配置 Alertmanager 通知

创建 alertmanager.yml：

global:
  smtp_smarthost: 'localhost:587'
  smtp_from: 'alert@yjett.com'

templates:
  - '/etc/alertmanager/templates/*.tmpl'

route:
  group_by: ['alertname', 'agent']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h
  receiver: 'feishu-webhook'
  routes:
    - match:
        severity: critical
      receiver: 'feishu-webhook'
      continue: true
    - match:
        severity: warning
      receiver: 'feishu-webhook'
      repeat_interval: 4h

receivers:
  - name: 'feishu-webhook'
    webhook_configs:
      - url: 'https://open.feishu.cn/open-apis/bot/v2/hook/your-webhook-token'
        send_resolved: true
        title: '{{ template "default.title" . }}'
        text: '{{ template "default.content" . }}'

inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'agent']

飞书通知模板（feishu.tmpl）：

{{ define "default.title" }}
🔥 OpenClaw 告警: {{ .GroupLabels.alertname }}
{{ end }}

{{ define "default.content" }}
{{ range .Alerts }}
**告警级别**: {{ .Labels.severity }}
**Agent**: {{ .Labels.agent }}
**问题**: {{ .Annotations.summary }}
**详情**: {{ .Annotations.description }}
**时间**: {{ .StartsAt.Format "2006-01-02 15:04:05" }}
{{ end }}
{{ end }}

启动 Alertmanager：

docker run -d \
  --name alertmanager \
  -p 9093:9093 \
  -v $(pwd)/alertmanager.yml:/etc/alertmanager/alertmanager.yml \
  -v $(pwd)/feishu.tmpl:/etc/alertmanager/templates/feishu.tmpl \
  prom/alertmanager:latest \
  --config.file=/etc/alertmanager/alertmanager.yml

第六章：一键部署脚本

6.1 完整 Docker Compose

创建 docker-compose.monitoring.yml：

version: '3.8'

services:
  prometheus:
    image: prom/prometheus:latest
    container_name: prometheus
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - ./alert-rules.yml:/etc/prometheus/alert-rules.yml
      - prometheus-data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'

  loki:
    image: grafana/loki:latest
    container_name: loki
    ports:
      - "3100:3100"
    volumes:
      - loki-data:/loki

  promtail:
    image: grafana/promtail:latest
    container_name: promtail
    volumes:
      - ./promtail.yml:/etc/promtail/config.yml
      - /var/log/openclaw:/var/log/openclaw
    command: -config.file=/etc/promtail/config.yml

  grafana:
    image: grafana/grafana:latest
    container_name: grafana
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin123
    volumes:
      - grafana-data:/var/lib/grafana

  alertmanager:
    image: prom/alertmanager:latest
    container_name: alertmanager
    ports:
      - "9093:9093"
    volumes:
      - ./alertmanager.yml:/etc/alertmanager/alertmanager.yml
      - ./feishu.tmpl:/etc/alertmanager/templates/feishu.tmpl

volumes:
  prometheus-data:
  loki-data:
  grafana-data:

6.2 启动命令

# 一键启动监控栈
docker-compose -f docker-compose.monitoring.yml up -d

# 验证状态
docker-compose -f docker-compose.monitoring.yml ps

# 查看日志
docker-compose -f docker-compose.monitoring.yml logs -f

访问地址：

服务	地址	用途
Prometheus	http://localhost:9090	指标查询
Grafana	http://localhost:3000	可视化
Alertmanager	http://localhost:9093	告警管理
Loki	http://localhost:3100	日志查询

第七章：实战检查清单

部署前准备

• [ ] Docker 已安装
• [ ] OpenClaw 指标端点已开启
• [ ] 飞书 Webhook 已配置

部署过程

• [ ] Prometheus 采集配置正确
• [ ] Loki 日志采集正常
• [ ] Grafana 数据源连接成功
• [ ] 仪表盘导入成功

告警验证

• [ ] 触发测试告警（如手动停止 Agent）
• [ ] 确认飞书收到通知
• [ ] 验证告警恢复通知

日常维护

• [ ] 每周检查磁盘空间
• [ ] 每月 review 告警规则（减少误报）
• [ ] 每季度优化仪表盘布局

结尾：监控的本质

2021 年上岸的程序员最后送你一句话：

监控不是为了好看，是为了在出问题之前发现问题，在出问题之后快速定位问题。

好的监控体系 = 指标告诉你「哪里可能有问题」+ 日志告诉你「具体问题是什么」+ 告警告诉你「现在就要处理」。

现在，去监控你的 Agent 吧。

📌 行动清单

• [ ] 启动 Prometheus + Grafana + Loki + Alertmanager
• [ ] 导入 OpenClaw 仪表盘
• [ ] 配置飞书告警通知
• [ ] 触发测试告警验证通知链路
• [ ] 设置每周监控 review 提醒

👇 互动话题
你的 Agent 遇到过什么诡异的故障？评论区聊聊～

如果这篇文章对你有帮助，欢迎「在看」+「转发」给需要的朋友