K8s 部署 OpenClaw 全实战指南:从零搭建高可用游戏服务器平台
— 无处不在的技术 —

一、OpenClaw 是什么?
OpenClaw 是一款开源的 2D 横板卷轴游戏引擎,完全基于 C++ 编写,重现了经典 PC 游戏《Captain Claw》(爪队长)的完整游戏机制。项目托管于 GitHub,拥有完整的物理引擎、碰撞检测、关卡编辑器和资源管理系统。
将 OpenClaw 部署到 Kubernetes 的核心价值在于:
- ▸弹性扩缩容
:玩家峰值时自动扩容,低谷期缩减资源成本 - ▸高可用保障
:多副本 + 健康检查,单节点故障自动摁漂移 - ▸统一运维
:与云原生生态(Prometheus、Ingress、CI/CD)无缝集成 - ▸容器隔离
:每个游戏实例独立运行,互不干扰

二、架构设计
在将 OpenClaw 容器化部署到 K8s 之前,先看清整体架构:
📄 CODE
┌─────────────────────────────┐
│ 外部流量入口 │
│ Ingress / LoadBalancer │
└──────────────┬──────────────┘
│
┌──────────────▼──────────────┐
│ Service (ClusterIP) │
│ openclaw-service:8080 │
└──────────────┬──────────────┘
│
┌────────────────────┼────────────────────┐
│ │ │
┌──────────▼──────┐ ┌──────────▼──────┐ ┌──────────▼──────┐
│ Pod openclaw-0 │ │ Pod openclaw-1 │ │ Pod openclaw-2 │
│ (game server) │ │ (game server) │ │ (game server) │
└─────────────────┘ └─────────────────┘ └─────────────────┘
│ │ │
└────────────────────┼────────────────────┘
│
┌──────────────▼──────────────┐
│ PersistentVolume │
│ /data/openclaw (NFS/CSI) │
└─────────────────────────────┘
核心组件说明:
◆
三、环境准备
3.1 前置条件
3.2 目录结构规划
💻 Shell
openclaw-k8s/
├── Dockerfile # 镜像构建文件
├── docker-compose.yml # 本地调试用
├── k8s/
│ ├── namespace.yaml # 命名空间
│ ├── configmap.yaml # 配置注入
│ ├── secret.yaml # 敏感信息
│ ├── pvc.yaml # 持久化存储
│ ├── deployment.yaml # 核心部署配置
│ ├── service.yaml # 服务暴露
│ ├── ingress.yaml # 七层路由
│ └── hpa.yaml # 弹性伸缩
└── scripts/
├── build.sh # 构建脚本
└── deploy.sh # 一键部署脚本
◆
四、构建 Docker 镜像
4.1 克隆 OpenClaw 源码
💻 Shell
# 克隆项目
git clone https://github.com/pjank/OpenClaw.git
cd OpenClaw
# 查看项目结构
ls -la
# 主要目录:
# OpenClaw/ - 游戏源码
# CaptainClaw/ - 资源文件
# CMakeLists.txt
4.2 编写多阶段 Dockerfile
🐳 Dockerfile
# ============================================================
# Stage 1: 编译阶段 - 使用 Ubuntu 22.04 编译 C++ 项目
# ============================================================
FROM ubuntu:22.04 AS builder
LABEL stage=builder
# 避免交互式提示
ENV DEBIAN_FRONTEND=noninteractive
# 安装编译依赖
RUN apt-get update && apt-get install -y \
build-essential \
cmake \
git \
libsdl2-dev \
libsdl2-image-dev \
libsdl2-mixer-dev \
libsdl2-ttf-dev \
libboost-all-dev \
libxml2-dev \
zlib1g-dev \
&& rm -rf /var/lib/apt/lists/*
WORKDIR /build
# 复制源码
COPY . .
# CMake 编译(Release 模式)
RUN mkdir -p build && cd build \
&& cmake .. \
-DCMAKE_BUILD_TYPE=Release \
-DCMAKE_INSTALL_PREFIX=/opt/openclaw \
&& make -j$(nproc) \
&& make install
# ============================================================
# Stage 2: 运行阶段 - 精简镜像,只保留运行时依赖
# ============================================================
FROM ubuntu:22.04 AS runtime
ENV DEBIAN_FRONTEND=noninteractive
# 只安装运行时库(不含开发头文件,大幅减小镜像体积)
RUN apt-get update && apt-get install -y \
libsdl2-2.0-0 \
libsdl2-image-2.0-0 \
libsdl2-mixer-2.0-0 \
libsdl2-ttf-2.0-0 \
libboost-system1.74.0 \
libboost-filesystem1.74.0 \
libxml2 \
curl \
&& rm -rf /var/lib/apt/lists/* \
&& apt-get clean
# 从编译阶段复制产物
COPY --from=builder /opt/openclaw /opt/openclaw
# 游戏资源目录
WORKDIR /opt/openclaw
# 创建非 root 用户运行(安全最佳实践)
RUN groupadd -r openclaw && useradd -r -g openclaw -d /opt/openclaw openclaw \
&& chown -R openclaw:openclaw /opt/openclaw
# 挂载点:游戏存档、自定义关卡
VOLUME ["/opt/openclaw/saves", "/opt/openclaw/levels"]
# 游戏服务端口(根据实际配置调整)
EXPOSE 8080
USER openclaw
# 健康检查
HEALTHCHECK --interval=30s --timeout=10s --start-period=60s --retries=3 \
CMD curl -sf http://localhost:8080/health || exit 1
ENTRYPOINT ["/opt/openclaw/bin/openclaw"]
CMD ["--server", "--port=8080", "--config=/opt/openclaw/config/server.xml"]
4.3 构建并推送镜像
💻 Shell
# 设置镜像仓库变量
export IMAGE_REPO="registry.example.com/games/openclaw"
export IMAGE_TAG="v1.0.0"
# 构建镜像(开启 BuildKit 加速)
DOCKER_BUILDKIT=1 docker build \
-t ${IMAGE_REPO}:${IMAGE_TAG} \
-t ${IMAGE_REPO}:latest \
--build-arg BUILDKIT_INLINE_CACHE=1 \
.
# 查看镜像大小
docker images | grep openclaw
# 推送到镜像仓库
docker push ${IMAGE_REPO}:${IMAGE_TAG}
docker push ${IMAGE_REPO}:latest
# 本地验证运行
docker run --rm -p 8080:8080 \
-v $(pwd)/saves:/opt/openclaw/saves \
${IMAGE_REPO}:${IMAGE_TAG}
◆
五、Kubernetes 资源编排
5.1 创建命名空间
📄 YAML
# k8s/namespace.yaml
apiVersion: v1
kind: Namespace
metadata:
name: openclaw
labels:
app.kubernetes.io/name: openclaw
environment: production
team: game-ops
💻 Shell
kubectl apply -f k8s/namespace.yaml
kubectl get namespace openclaw
5.2 ConfigMap — 游戏配置注入
📄 YAML
# k8s/configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: openclaw-config
namespace: openclaw
labels:
app: openclaw
data:
# 服务器基础配置
SERVER_PORT: "8080"
SERVER_HOST: "0.0.0.0"
MAX_PLAYERS: "100"
TICK_RATE: "60"
# 游戏参数
GAME_DIFFICULTY: "normal"
ENABLE_CHEATS: "false"
LOG_LEVEL: "info"
# 性能调优
PHYSICS_FPS: "60"
RENDER_DISTANCE: "1024"
# 游戏配置文件(完整 XML 注入)
server.xml: |
<?xml version="1.0" encoding="UTF-8"?>
<ServerConfig>
<Network>
<Host>0.0.0.0</Host>
<Port>8080</Port>
<MaxConnections>100</MaxConnections>
<Timeout>30</Timeout>
</Network>
<Game>
<Difficulty>normal</Difficulty>
<TickRate>60</TickRate>
<PhysicsEnabled>true</PhysicsEnabled>
</Game>
<Logging>
<Level>INFO</Level>
<File>/opt/openclaw/logs/server.log</File>
</Logging>
</ServerConfig>
5.3 Secret — 敏感信息管理
💻 Shell
# 使用 kubectl 创建 Secret(Base64 自动编码)
kubectl create secret generic openclaw-secret \
--namespace=openclaw \
--from-literal=admin-password='SuperSecureP@ss123' \
--from-literal=api-key='sk-openclaw-xxx-yyy-zzz' \
--from-literal=db-password='Db@Pass456' \
--dry-run=client -o yaml > k8s/secret.yaml
# 查看(Base64 解码验证)
kubectl get secret openclaw-secret -n openclaw -o jsonpath='{.data.admin-password}' | base64 -d
📄 YAML
# k8s/secret.yaml(生产环境建议使用 Vault 或 External Secrets Operator)
apiVersion: v1
kind: Secret
metadata:
name: openclaw-secret
namespace: openclaw
type: Opaque
data:
# echo -n 'value' | base64
admin-password: U3VwZXJTZWN1cmVQQHNzMTIz
api-key: c2stb3BlbmNsYXcteHh4LXl5eS16eno=
5.4 PersistentVolumeClaim — 持久化存储
📄 YAML
# k8s/pvc.yaml
---
# 游戏存档 PVC
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: openclaw-saves-pvc
namespace: openclaw
labels:
app: openclaw
type: saves
spec:
accessModes:
- ReadWriteMany # 多 Pod 共享读写(需要 NFS 或 CephFS)
storageClassName: nfs-client # 替换为你的 StorageClass
resources:
requests:
storage: 10Gi
---
# 自定义关卡 PVC
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: openclaw-levels-pvc
namespace: openclaw
labels:
app: openclaw
type: levels
spec:
accessModes:
- ReadWriteMany
storageClassName: nfs-client
resources:
requests:
storage: 5Gi
5.5 Deployment — 核心部署配置
这是最关键的配置文件,包含完整的探针、资源限制、反亲和性等生产级设置:
📄 YAML
# k8s/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: openclaw
namespace: openclaw
labels:
app: openclaw
version: v1.0.0
annotations:
deployment.kubernetes.io/revision: "1"
spec:
replicas: 3 # 初始 3 副本
revisionHistoryLimit: 5 # 保留 5 个历史版本(用于回滚)
selector:
matchLabels:
app: openclaw
# 滚动更新策略:确保始终有实例在服务
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1 # 最多多出 1 个新 Pod
maxUnavailable: 0 # 更新期间不允许有不可用 Pod(零停机)
template:
metadata:
labels:
app: openclaw
version: v1.0.0
annotations:
# Prometheus 自动发现
prometheus.io/scrape: "true"
prometheus.io/port: "9090"
prometheus.io/path: "/metrics"
spec:
# Pod 调度到不同节点(高可用)
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values: ["openclaw"]
topologyKey: kubernetes.io/hostname
# 安全上下文(非 root 运行)
securityContext:
runAsNonRoot: true
runAsUser: 1000
runAsGroup: 1000
fsGroup: 1000
# 优雅停止等待时间
terminationGracePeriodSeconds: 60
# 初始化容器:等待依赖服务就绪
initContainers:
- name: init-config
image: busybox:1.36
command:
- sh
- -c
- |
echo "检查配置文件..."
ls -la /config/
echo "初始化完成"
volumeMounts:
- name: config-volume
mountPath: /config
containers:
- name: openclaw
image: registry.example.com/games/openclaw:v1.0.0
imagePullPolicy: IfNotPresent
ports:
- name: http
containerPort: 8080
protocol: TCP
- name: metrics
containerPort: 9090
protocol: TCP
# 环境变量注入
env:
- name: SERVER_PORT
valueFrom:
configMapKeyRef:
name: openclaw-config
key: SERVER_PORT
- name: MAX_PLAYERS
valueFrom:
configMapKeyRef:
name: openclaw-config
key: MAX_PLAYERS
- name: ADMIN_PASSWORD
valueFrom:
secretKeyRef:
name: openclaw-secret
key: admin-password
- name: POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: POD_NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
- name: POD_IP
valueFrom:
fieldRef:
fieldPath: status.podIP
# 资源限制(防止单 Pod 耗尽节点资源)
resources:
requests:
cpu: "250m" # 启动保障
memory: "512Mi"
limits:
cpu: "2000m" # 最多 2 核
memory: "2Gi" # 最多 2GB,超过触发 OOMKill(137)
# ① 启动探针:给 C++ 程序充足初始化时间(最多等 3 分钟)
startupProbe:
httpGet:
path: /health
port: 8080
failureThreshold: 36 # 36 次 × 5s = 3 分钟
periodSeconds: 5
# ② 存活探针:检测进程是否卡死
livenessProbe:
httpGet:
path: /health/live
port: 8080
initialDelaySeconds: 0
periodSeconds: 15
timeoutSeconds: 5
failureThreshold: 3
successThreshold: 1
# ③ 就绪探针:控制流量是否接入此 Pod
readinessProbe:
httpGet:
path: /health/ready
port: 8080
initialDelaySeconds: 0
periodSeconds: 10
timeoutSeconds: 3
failureThreshold: 3
successThreshold: 2 # 需连续成功 2 次才接流量(更稳健)
# 优雅停止:先等待现有连接处理完
lifecycle:
preStop:
exec:
command:
- sh
- -c
- "sleep 10" # 等 10s,让 LB 摘掉此 Pod 再停止
# 挂载卷
volumeMounts:
- name: config-volume
mountPath: /opt/openclaw/config
readOnly: true
- name: saves-storage
mountPath: /opt/openclaw/saves
- name: levels-storage
mountPath: /opt/openclaw/levels
- name: logs-volume
mountPath: /opt/openclaw/logs
- name: tmp-dir
mountPath: /tmp
# Sidecar:日志收集(可选)
- name: log-collector
image: fluent/fluent-bit:2.2
resources:
requests:
cpu: "50m"
memory: "64Mi"
limits:
cpu: "100m"
memory: "128Mi"
volumeMounts:
- name: logs-volume
mountPath: /logs
readOnly: true
volumes:
- name: config-volume
configMap:
name: openclaw-config
- name: saves-storage
persistentVolumeClaim:
claimName: openclaw-saves-pvc
- name: levels-storage
persistentVolumeClaim:
claimName: openclaw-levels-pvc
- name: logs-volume
emptyDir: {}
- name: tmp-dir
emptyDir: {}
5.6 Service — 服务暴露
📄 YAML
# k8s/service.yaml
---
# ClusterIP:集群内部通信
apiVersion: v1
kind: Service
metadata:
name: openclaw-service
namespace: openclaw
labels:
app: openclaw
spec:
type: ClusterIP
selector:
app: openclaw
ports:
- name: http
port: 80
targetPort: 8080
protocol: TCP
- name: metrics
port: 9090
targetPort: 9090
protocol: TCP
---
# NodePort:外部直接访问(测试环境用)
apiVersion: v1
kind: Service
metadata:
name: openclaw-nodeport
namespace: openclaw
spec:
type: NodePort
selector:
app: openclaw
ports:
- name: http
port: 80
targetPort: 8080
nodePort: 30800 # 范围 30000-32767,通过 NodeIP:30800 访问
5.7 Ingress — 七层流量路由
📄 YAML
# k8s/ingress.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: openclaw-ingress
namespace: openclaw
annotations:
# Nginx Ingress Controller 配置
nginx.ingress.kubernetes.io/rewrite-target: /
nginx.ingress.kubernetes.io/proxy-body-size: "50m"
nginx.ingress.kubernetes.io/proxy-connect-timeout: "60"
nginx.ingress.kubernetes.io/proxy-send-timeout: "60"
nginx.ingress.kubernetes.io/proxy-read-timeout: "60"
# WebSocket 支持(游戏实时通信必须)
nginx.ingress.kubernetes.io/proxy-http-version: "1.1"
nginx.ingress.kubernetes.io/configuration-snippet: |
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
# 限流保护
nginx.ingress.kubernetes.io/limit-rps: "100"
# SSL 重定向
nginx.ingress.kubernetes.io/ssl-redirect: "true"
spec:
ingressClassName: nginx
# TLS 配置
tls:
- hosts:
- openclaw.example.com
secretName: openclaw-tls-cert
rules:
- host: openclaw.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: openclaw-service
port:
number: 80
- path: /api
pathType: Prefix
backend:
service:
name: openclaw-service
port:
number: 80
5.8 HPA — 弹性扩缩容
📄 YAML
# k8s/hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: openclaw-hpa
namespace: openclaw
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: openclaw
minReplicas: 2 # 最少 2 副本(保证高可用)
maxReplicas: 20 # 最多 20 副本(玩家高峰期)
metrics:
# CPU 使用率超 70% 时扩容
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
# 内存使用率超 80% 时扩容
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
behavior:
scaleUp:
stabilizationWindowSeconds: 60 # 扩容稳定窗口 60s
policies:
- type: Pods
value: 4 # 每次最多增加 4 个 Pod
periodSeconds: 60
scaleDown:
stabilizationWindowSeconds: 300 # 缩容稳定窗口 5 分钟(防抖动)
policies:
- type: Pods
value: 2 # 每次最多减少 2 个 Pod
periodSeconds: 120
◆
六、一键部署脚本
6.1 完整部署脚本
💻 Shell
#!/bin/bash
# scripts/deploy.sh - OpenClaw K8s 一键部署脚本
set -euo pipefail
# =========== 配置区 ===========
NAMESPACE="openclaw"
IMAGE_REPO="registry.example.com/games/openclaw"
IMAGE_TAG="${1:-v1.0.0}"
K8S_DIR="./k8s"
# ==============================
echo "🚀 开始部署 OpenClaw ${IMAGE_TAG} 到 K8s..."
# 检查 kubectl 连接
kubectl cluster-info > /dev/null 2>&1 || {
echo "❌ kubectl 无法连接集群,请检查 kubeconfig"
exit 1
}
# 1. 创建命名空间
echo "📦 [1/7] 创建命名空间..."
kubectl apply -f ${K8S_DIR}/namespace.yaml
# 2. 部署配置
echo "⚙️ [2/7] 应用 ConfigMap..."
kubectl apply -f ${K8S_DIR}/configmap.yaml -n ${NAMESPACE}
# 3. 部署 Secret(生产环境用 Vault 替代)
echo "🔐 [3/7] 应用 Secret..."
kubectl apply -f ${K8S_DIR}/secret.yaml -n ${NAMESPACE}
# 4. 创建存储
echo "💾 [4/7] 创建 PVC..."
kubectl apply -f ${K8S_DIR}/pvc.yaml -n ${NAMESPACE}
# 等待 PVC 绑定
echo " 等待 PVC 绑定..."
kubectl wait --for=condition=Bound pvc/openclaw-saves-pvc \
-n ${NAMESPACE} --timeout=60s
# 5. 更新镜像 tag 并部署
echo "🐳 [5/7] 部署应用..."
sed "s|registry.example.com/games/openclaw:v1.0.0|${IMAGE_REPO}:${IMAGE_TAG}|g" \
${K8S_DIR}/deployment.yaml | kubectl apply -f - -n ${NAMESPACE}
# 等待 Deployment 就绪
echo " 等待 Pod 启动(超时 5 分钟)..."
kubectl rollout status deployment/openclaw \
-n ${NAMESPACE} --timeout=300s
# 6. 部署 Service
echo "🌐 [6/7] 部署 Service..."
kubectl apply -f ${K8S_DIR}/service.yaml -n ${NAMESPACE}
kubectl apply -f ${K8S_DIR}/ingress.yaml -n ${NAMESPACE}
# 7. 配置 HPA
echo "📈 [7/7] 配置弹性伸缩..."
kubectl apply -f ${K8S_DIR}/hpa.yaml -n ${NAMESPACE}
echo ""
echo "✅ 部署完成!"
echo ""
echo "📊 部署状态:"
kubectl get pods -n ${NAMESPACE} -l app=openclaw
echo ""
kubectl get svc -n ${NAMESPACE}
echo ""
echo "🔗 访问地址: https://openclaw.example.com"
6.2 快速验证命令
💻 Shell
# 查看 Pod 运行状态
kubectl get pods -n openclaw -o wide
# 查看 Pod 详情(调试用)
kubectl describe pod -n openclaw -l app=openclaw
# 实时查看日志
kubectl logs -n openclaw -l app=openclaw -f --tail=100
# 查看某个容器日志(多容器 Pod)
kubectl logs -n openclaw <pod-name> -c openclaw --tail=50
# 进入容器调试
kubectl exec -it -n openclaw <pod-name> -- /bin/sh
# 查看 HPA 状态
kubectl get hpa -n openclaw
# 查看 Ingress
kubectl get ingress -n openclaw
# 端口转发本地调试
kubectl port-forward -n openclaw svc/openclaw-service 8080:80
# 然后访问 http://localhost:8080
◆
七、滚动更新与回滚
7.1 更新镜像版本
💻 Shell
# 方式一:直接更新镜像(推荐)
kubectl set image deployment/openclaw \
openclaw=registry.example.com/games/openclaw:v1.1.0 \
-n openclaw
# 方式二:编辑 Deployment
kubectl edit deployment openclaw -n openclaw
# 监控滚动更新过程
kubectl rollout status deployment/openclaw -n openclaw
# 查看更新历史
kubectl rollout history deployment/openclaw -n openclaw
# REVISION CHANGE-CAUSE
# 1 <none>
# 2 Update to v1.1.0
# 3 Hotfix memory leak
7.2 版本回滚
💻 Shell
# 回滚到上一个版本
kubectl rollout undo deployment/openclaw -n openclaw
# 回滚到指定版本(如 revision 1)
kubectl rollout undo deployment/openclaw \
--to-revision=1 \
-n openclaw
# 验证回滚结果
kubectl get pods -n openclaw
kubectl rollout status deployment/openclaw -n openclaw
7.3 金丝雀发布(进阶)
💻 Shell
# 新建一个 canary deployment,只跑 1 个副本
kubectl create deployment openclaw-canary \
--image=registry.example.com/games/openclaw:v1.2.0-beta \
-n openclaw
# 调整副本数(控制金丝雀流量比例)
kubectl scale deployment openclaw-canary \
--replicas=1 -n openclaw
# 此时流量分配:正式版 3 Pod(75%)vs 金丝雀 1 Pod(25%)
# 金丝雀验证无问题后全量切换
kubectl set image deployment/openclaw \
openclaw=registry.example.com/games/openclaw:v1.2.0-beta \
-n openclaw
kubectl delete deployment openclaw-canary -n openclaw
◆
八、监控与告警配置
8.1 Prometheus 告警规则
📄 YAML
# prometheus-rules.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: openclaw-alerts
namespace: openclaw
labels:
release: prometheus
spec:
groups:
- name: openclaw.rules
rules:
# Pod 频繁重启告警
- alert: OpenClawPodCrashLooping
expr: |
rate(kube_pod_container_status_restarts_total{
namespace="openclaw", container="openclaw"
}[5m]) > 0
for: 5m
labels:
severity: critical
annotations:
summary: "OpenClaw Pod 频繁重启"
description: "Pod {{ $labels.pod }} 在过去 5 分钟内发生重启"
# 内存使用率超 85% 告警
- alert: OpenClawHighMemory
expr: |
(container_memory_usage_bytes{namespace="openclaw",container="openclaw"}
/ container_spec_memory_limit_bytes{namespace="openclaw",container="openclaw"})
* 100 > 85
for: 3m
labels:
severity: warning
annotations:
summary: "OpenClaw 内存使用率过高"
description: "Pod {{ $labels.pod }} 内存使用率 {{ $value | humanize }}%"
# Pod 数量低于预期
- alert: OpenClawPodsNotReady
expr: |
kube_deployment_status_replicas_ready{
namespace="openclaw", deployment="openclaw"
} < 2
for: 2m
labels:
severity: critical
annotations:
summary: "OpenClaw 可用实例不足"
description: "当前就绪 Pod 数: {{ $value }},低于最低要求 2"
8.2 关键监控指标
💻 Shell
# 查看 Pod 资源使用
kubectl top pods -n openclaw
# 查看节点资源
kubectl top nodes
# 查看 Pod 事件(排查问题)
kubectl get events -n openclaw --sort-by='.lastTimestamp'
# 查看 OOMKill 事件
kubectl get events -n openclaw | grep -i oom
# 实时监控 Pod 状态变化
watch -n 2 kubectl get pods -n openclaw -o wide
◆
九、常见问题排查
9.1 Pod 无法启动(ImagePullBackOff)
💻 Shell
# 查看具体错误
kubectl describe pod <pod-name> -n openclaw
# 常见原因:
# 1. 镜像仓库认证失败 → 创建 imagePullSecret
kubectl create secret docker-registry registry-secret \
--docker-server=registry.example.com \
--docker-username=用户名 \
--docker-password=密码 \
-n openclaw
# 在 Deployment 中引用
# spec.template.spec.imagePullSecrets:
# - name: registry-secret
# 2. 镜像 tag 不存在 → 检查仓库
docker pull registry.example.com/games/openclaw:v1.0.0
9.2 Pod 状态 OOMKilled(exit code 137)
💻 Shell
# 确认是 OOM 问题
kubectl describe pod <pod-name> -n openclaw | grep -A5 "Last State"
# Last State: Terminated
# Reason: OOMKilled
# Exit Code: 137
# 查看内存使用趋势
kubectl top pod <pod-name> -n openclaw
# 解决方案:
# 1. 临时调高 memory limit
kubectl set resources deployment openclaw \
--limits=memory=4Gi -n openclaw
# 2. 检查是否有内存泄漏(进入容器)
kubectl exec -it <pod-name> -n openclaw -- sh
# 在容器内:cat /proc/meminfo
9.3 探针失败导致循环重启
💻 Shell
# 查看探针失败原因
kubectl describe pod <pod-name> -n openclaw | grep -A10 "Liveness probe"
# 临时关闭探针验证(只用于调试!)
kubectl patch deployment openclaw -n openclaw --type='json' \
-p='[{"op":"remove","path":"/spec/template/spec/containers/0/livenessProbe"}]'
# 手动测试健康接口
kubectl exec -it <pod-name> -n openclaw -- curl http://localhost:8080/health
9.4 PVC 挂载失败(Pending 状态)
💻 Shell
# 查看 PVC 状态
kubectl get pvc -n openclaw
kubectl describe pvc openclaw-saves-pvc -n openclaw
# 常见原因:
# 1. StorageClass 不存在
kubectl get storageclass
# 2. NFS Server 不可达
# 检查节点上的 NFS 挂载
ssh <node-ip>
showmount -e <nfs-server-ip>
# 3. 容量不足
kubectl describe pv | grep -A5 "Capacity"
◆
十、生产环境最佳实践
10.1 安全加固清单
💻 Shell
# 1. 强制非 root 运行(验证)
kubectl get pod <pod-name> -n openclaw \
-o jsonpath='{.spec.securityContext}'
# 2. 网络策略隔离(只允许 Ingress 访问)
cat <<EOF | kubectl apply -f -
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: openclaw-netpol
namespace: openclaw
spec:
podSelector:
matchLabels:
app: openclaw
policyTypes:
- Ingress
- Egress
ingress:
- from:
- namespaceSelector:
matchLabels:
kubernetes.io/metadata.name: ingress-nginx
ports:
- protocol: TCP
port: 8080
egress:
- to: [] # 允许所有出站(可按需限制)
EOF
# 3. 资源配额(防止单 NS 耗尽集群资源)
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: ResourceQuota
metadata:
name: openclaw-quota
namespace: openclaw
spec:
hard:
pods: "50"
requests.cpu: "10"
requests.memory: "20Gi"
limits.cpu: "40"
limits.memory: "80Gi"
persistentvolumeclaims: "10"
EOF
10.2 备份与灾难恢复
💻 Shell
# 使用 Velero 备份整个 namespace
velero backup create openclaw-backup \
--include-namespaces openclaw \
--storage-location default
# 查看备份状态
velero backup describe openclaw-backup
# 恢复备份
velero restore create --from-backup openclaw-backup
# 定时备份(每天 2:00 AM)
velero schedule create openclaw-daily \
--schedule="0 2 * * *" \
--include-namespaces openclaw
10.3 性能调优参考
◆
十一、总结
通过本文,我们完成了 OpenClaw 从源码编译、镜像构建到 K8s 全套资源编排的完整部署链路:
✅ 多阶段 Dockerfile — 编译产物精简,镜像体积最小化
✅ 三大探针配置 — startup/liveness/readiness 协同保障服务稳定
✅ 完整 K8s 资源 — Namespace、ConfigMap、Secret、PVC、Deployment、Service、Ingress 一应俱全
✅ HPA 弹性伸缩 — 根据 CPU/内存自动扩缩,应对流量波动
✅ 安全加固 — 非 root 运行、NetworkPolicy、ResourceQuota 三重防护
✅ 监控告警 — Prometheus 规则覆盖崩溃、内存、副本数关键指标
✅ 运维工具箱 — 滚动更新、一键回滚、金丝雀发布、Velero 备份全部就位
云原生部署不只是把应用塞进容器,而是要让它在 K8s 的编排下真正做到自愈、弹性、可观测。
◆
关注「院长技术」,持续输出 K8s、云原生、DevOps 实战干货。
作者:高鹏举 | 国家电网高级运维开发工程师 | 《云原生Kubernetes自动化运维实践》
如果觉得有用,欢迎点赞、在看、转发三连支持~
高鹏举
国家电网高级运维开发工程师
《云原生Kubernetes自动化运维实践》作者 | 清华大学出版社
关注「无处不在的技术」
持续输出 K8s、云原生、DevOps 实战干货每篇都是踩坑血泪史,每篇都能直接用
夜雨聆风