乐于分享
好东西不私藏

K8s Nvidia插件部署

K8s Nvidia插件部署

一、准备工作

在部署插件之前,需要确保所有GPU节点满足以下要求:

  • • NVIDIA驱动程序版本 ~= 384.81
  • • nvidia-container-toolkit >= 1.7.0
  • • Kubernetes版本 >= 1.10
  • • 配置nvidia-container-runtime为默认运行时
全版本段兼容汇总(含 1.24-1.29+1.3x)
K8s 版本段
对应 nvidia-device-plugin 版本
对应 NVIDIA Container Toolkit 版本
1.24-1.26
v0.14.x
1.11.x
1.27-1.29
v0.15.x~v0.17.x
1.12.x~1.14.x
1.30+
v0.18.x~v0.19.x
1.15.x~1.17.x

二、安装Nvidia容器工具包

1、ubuntu、debian系列
# 添加NVIDIA GPG密钥curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg# 添加仓库curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \tee /etc/apt/sources.list.d/nvidia-container-toolkit.list > /dev/null# 更新软件包索引apt-get update# 安装nvidia-container-toolkitapt-get install -y nvidia-container-toolkit# 配置nvidia-container-runtimenvidia-ctk runtime configure --runtime=containerd# 重启containerd服务systemctl restart containerd
2、Euler、Rocky、centos系列
curl -s -L https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo | \tee /etc/yum.repos.d/nvidia-container-toolkit.repodnf-config-manager --enable nvidia-container-toolkit-experimentalexport NVIDIA_CONTAINER_TOOLKIT_VERSION=1.18.2-1dnf install -y \      nvidia-container-toolkit-${NVIDIA_CONTAINER_TOOLKIT_VERSION} \      nvidia-container-toolkit-base-${NVIDIA_CONTAINER_TOOLKIT_VERSION} \      libnvidia-container-tools-${NVIDIA_CONTAINER_TOOLKIT_VERSION} \      libnvidia-container1-${NVIDIA_CONTAINER_TOOLKIT_VERSION}# 配置nvidia-container-runtimenvidia-ctk runtime configure --runtime=containerd# 重启containerd服务systemctl restart containerd

三、部署Nvidia插件

1、部署插件
# 在线快速部署NVIDIA插件kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.17.4/deployments/static/nvidia-device-plugin.yml# 离线文件如下。# Copyright (c) 2019, NVIDIA CORPORATION.  All rights reserved.## Licensed under the Apache License, Version 2.0 (the "License");# you may not use this file except in compliance with the License.# You may obtain a copy of the License at##     http://www.apache.org/licenses/LICENSE-2.0## Unless required by applicable law or agreed to in writing, software# distributed under the License is distributed on an "AS IS" BASIS,# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.# See the License for the specific language governing permissions and# limitations under the License.apiVersion: apps/v1kind: DaemonSetmetadata:  name: nvidia-device-plugin-daemonset  namespace: kube-systemspec:  selector:    matchLabels:      name: nvidia-device-plugin-ds  updateStrategy:    type: RollingUpdate  template:    metadata:      labels:        name: nvidia-device-plugin-ds    spec:      tolerations:      - key: nvidia.com/gpu        operator: Exists        effect: NoSchedule      # Mark this pod as a critical add-on; when enabled, the critical add-on      # scheduler reserves resources for critical add-on pods so that they can      # be rescheduled after a failure.      # See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/      priorityClassName: "system-node-critical"      containers:      - image: nvcr.io/nvidia/k8s-device-plugin:v0.17.4        name: nvidia-device-plugin-ctr        env:          - name: FAIL_ON_INIT_ERROR            value: "false"        securityContext:          allowPrivilegeEscalation: false          capabilities:            drop: ["ALL"]        volumeMounts:        - name: kubelet-device-plugins-dir          mountPath: /var/lib/kubelet/device-plugins      volumes:      - name: kubelet-device-plugins-dir        hostPath:          path: /var/lib/kubelet/device-plugins          type: Directory
# 验证部署状态kubectl get daemonset -n kube-system# 查看Pod运行状态kubectl get pods -n kube-system -l name=nvidia-device-plugin-ds# 查看节点GPU资源Capacity:  nvidia.com/gpu:     1
2、测试
apiVersion: v1kind: Podmetadata:  name: gpu-testspec:  containers:  - name: gpu-test    image: nvidia/cuda:11.6.2-base-ubuntu20.04    command: ["nvidia-smi"]    resources:      limits:        nvidia.com/gpu: 1kubectl apply -f gpu-test.yaml# command命令为nvidia-smi# 可以查看logs,预期输出为显卡信息面板
3、生产环境 推荐helm部署
# 添加NVIDIA Helm仓库helm repo add nvidia https://helm.ngc.nvidia.com/nvidia && helm repo update# helm repo add nvidia https://nvidia.github.io/k8s-device-plugin# helm repo update# 部署设备插件helm install -n gpu-operator --create-namespace gpu-operator nvidia/gpu-operator --set driver.enabled=false# helm upgrade -i nvidia-device-plugin nvidia/k8s-device-plugin \#   --namespace kube-system \#   s--create-namespace# 使用自定义values安装helm install -f gpu-operator-values.yaml -n gpu-operator --create-namespace gpu-operator nvidia/gpu-operator --set driver.enabled=false
本站文章均为手工撰写未经允许谢绝转载:夜雨聆风 » K8s Nvidia插件部署

评论 抢沙发

7 + 1 =
  • 昵称 (必填)
  • 邮箱 (必填)
  • 网址
×
订阅图标按钮