K8s Nvidia插件部署
一、准备工作
在部署插件之前,需要确保所有GPU节点满足以下要求:
• NVIDIA驱动程序版本 ~= 384.81 • nvidia-container-toolkit >= 1.7.0 • Kubernetes版本 >= 1.10 • 配置nvidia-container-runtime为默认运行时
全版本段兼容汇总(含 1.24-1.29+1.3x)
|
|
|
|
|---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
二、安装Nvidia容器工具包
1、ubuntu、debian系列
# 添加NVIDIA GPG密钥curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg# 添加仓库curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \tee /etc/apt/sources.list.d/nvidia-container-toolkit.list > /dev/null# 更新软件包索引apt-get update# 安装nvidia-container-toolkitapt-get install -y nvidia-container-toolkit# 配置nvidia-container-runtimenvidia-ctk runtime configure --runtime=containerd# 重启containerd服务systemctl restart containerd
2、Euler、Rocky、centos系列
curl -s -L https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo | \tee /etc/yum.repos.d/nvidia-container-toolkit.repodnf-config-manager --enable nvidia-container-toolkit-experimentalexport NVIDIA_CONTAINER_TOOLKIT_VERSION=1.18.2-1dnf install -y \ nvidia-container-toolkit-${NVIDIA_CONTAINER_TOOLKIT_VERSION} \ nvidia-container-toolkit-base-${NVIDIA_CONTAINER_TOOLKIT_VERSION} \ libnvidia-container-tools-${NVIDIA_CONTAINER_TOOLKIT_VERSION} \ libnvidia-container1-${NVIDIA_CONTAINER_TOOLKIT_VERSION}# 配置nvidia-container-runtimenvidia-ctk runtime configure --runtime=containerd# 重启containerd服务systemctl restart containerd
三、部署Nvidia插件
1、部署插件
# 在线快速部署NVIDIA插件kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.17.4/deployments/static/nvidia-device-plugin.yml# 离线文件如下。# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.## Licensed under the Apache License, Version 2.0 (the "License");# you may not use this file except in compliance with the License.# You may obtain a copy of the License at## http://www.apache.org/licenses/LICENSE-2.0## Unless required by applicable law or agreed to in writing, software# distributed under the License is distributed on an "AS IS" BASIS,# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.# See the License for the specific language governing permissions and# limitations under the License.apiVersion: apps/v1kind: DaemonSetmetadata: name: nvidia-device-plugin-daemonset namespace: kube-systemspec: selector: matchLabels: name: nvidia-device-plugin-ds updateStrategy: type: RollingUpdate template: metadata: labels: name: nvidia-device-plugin-ds spec: tolerations: - key: nvidia.com/gpu operator: Exists effect: NoSchedule # Mark this pod as a critical add-on; when enabled, the critical add-on # scheduler reserves resources for critical add-on pods so that they can # be rescheduled after a failure. # See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/ priorityClassName: "system-node-critical" containers: - image: nvcr.io/nvidia/k8s-device-plugin:v0.17.4 name: nvidia-device-plugin-ctr env: - name: FAIL_ON_INIT_ERROR value: "false" securityContext: allowPrivilegeEscalation: false capabilities: drop: ["ALL"] volumeMounts: - name: kubelet-device-plugins-dir mountPath: /var/lib/kubelet/device-plugins volumes: - name: kubelet-device-plugins-dir hostPath: path: /var/lib/kubelet/device-plugins type: Directory
# 验证部署状态kubectl get daemonset -n kube-system# 查看Pod运行状态kubectl get pods -n kube-system -l name=nvidia-device-plugin-ds# 查看节点GPU资源Capacity: nvidia.com/gpu: 1
2、测试
apiVersion: v1kind: Podmetadata: name: gpu-testspec: containers: - name: gpu-test image: nvidia/cuda:11.6.2-base-ubuntu20.04 command: ["nvidia-smi"] resources: limits: nvidia.com/gpu: 1kubectl apply -f gpu-test.yaml# command命令为nvidia-smi# 可以查看logs,预期输出为显卡信息面板
3、生产环境 推荐helm部署
# 添加NVIDIA Helm仓库helm repo add nvidia https://helm.ngc.nvidia.com/nvidia && helm repo update# helm repo add nvidia https://nvidia.github.io/k8s-device-plugin# helm repo update# 部署设备插件helm install -n gpu-operator --create-namespace gpu-operator nvidia/gpu-operator --set driver.enabled=false# helm upgrade -i nvidia-device-plugin nvidia/k8s-device-plugin \# --namespace kube-system \# s--create-namespace# 使用自定义values安装helm install -f gpu-operator-values.yaml -n gpu-operator --create-namespace gpu-operator nvidia/gpu-operator --set driver.enabled=false
夜雨聆风
