问小白 wenxiaobai
资讯
历史
科技
环境与自然
成长
游戏
财经
文学与艺术
美食
健康
家居
文化
情感
汽车
三农
军事
旅行
运动
教育
生活
星座命理

Ubuntu 22.04下Kubernetes管理4块4090GPU显卡

创作时间:
作者:
@小白创作中心

Ubuntu 22.04下Kubernetes管理4块4090GPU显卡

引用
CSDN
1.
https://m.blog.csdn.net/wu_tech/article/details/143182294

本文将详细介绍如何在Ubuntu 22.04系统下使用Kubernetes管理4块RTX 4090 GPU显卡。内容涵盖显卡驱动安装、Docker配置、NVIDIA容器工具包安装、Kubernetes集群搭建以及GPU设备插件部署等多个方面,适合有一定技术基础的读者参考学习。

1. 安装显卡驱动

使用CUDA 12.2版本的驱动程序:

# wget https://cn.download.nvidia.com/XFree86/Linux-x86_64/525.113.01/NVIDIA-Linux-x86_64-525.113.01.run
# sh NVIDIA-Linux-x86_64-535.113.01.run

安装完成后重启系统:

sudo reboot

开启显卡内存持久化:

nvidia-smi -pm 1

检查显卡信息:

nvidia-smi -L

2. 安装Docker

配置APT源:

sudo apt-get update
sudo apt-get install ca-certificates curl gnupg
sudo install -m 0755 -d /etc/apt/keyrings
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg
sudo chmod a+r /etc/apt/keyrings/docker.gpg
echo \
"deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu \
$(. /etc/os-release && echo "$VERSION_CODENAME") stable" | \
sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
sudo apt-get update

安装Docker:

sudo apt-get install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin

重启Docker服务:

systemctl restart docker

3. 安装NVIDIA-Docker-Toolkit

配置存储库:

curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
&& curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list \
&& \
sudo apt-get update

安装NVIDIA容器工具包:

sudo apt-get install -y nvidia-container-toolkit

测试安装:

docker run --rm --gpus all nvidia/cuda:11.0.3-base-ubuntu20.04 nvidia-smi

4. 安装Kubernetes和Kubeadm

由于服务器上已经安装了Docker,这里不再安装containerd。

基础环境配置

  1. 设置主机名:
hostnamectl set-hostname ubuntu
  1. 禁用SELinux:
sudo setenforce 0
sudo sed -i 's/^SELINUX=enforcing$/SELINUX=permissive/' /etc/selinux/config
  1. 关闭swap分区:
swapoff -a
sed -ri 's/.*swap.*/#&/' /etc/fstab
sed -ri 's/#(.*swap.*)/\1/' /etc/fstab
  1. 配置IPv6流量桥接:
cat <<EOF | sudo tee /etc/modules-load.d/k8s.conf
br_netfilter
EOF
cat <<EOF | sudo tee /etc/sysctl.d/k8s.conf
net.bridge.bridge-nf-call-ip6tables = 1
net.bridge.bridge-nf-call-iptables = 1
EOF
  1. 应用配置:
sysctl --system
  1. 安装Kubernetes组件:

配置Docker:

cat /etc/docker/daemon.json
{
"exec-opts":["native.cgroupdriver=systemd"],
"data-root": "/data2/dockerdata",
"runtimes": {
"nvidia": {
"args": [],
"path": "nvidia-container-runtime"
}
}
}

添加Kubernetes APT存储库:

curl https://mirrors.aliyun.com/kubernetes/apt/doc/apt-key.gpg | apt-key add -
sudo apt-add-repository "deb kubernetes-xenial main"

安装Kubernetes组件:

sudo apt update
sudo apt install -y kubelet=1.23.8-00 kubeadm=1.23.8-00 kubectl=1.23.8-00
sudo apt-mark hold kubelet kubeadm kubectl

设置kubelet开机自启:

sudo systemctl enable --now kubelet

配置master域名映射:

echo "172.16.1.220 cluster-endpoint" >> /etc/hosts

初始化Kubernetes集群:

sudo kubeadm init \
--apiserver-advertise-address=172.16.1.220 \
--control-plane-endpoint=cluster-endpoint \
--image-repository registry.aliyuncs.com/google_containers \
--kubernetes-version v1.23.8 \
--service-cidr=10.96.0.0/16 \
--pod-network-cidr=10.244.0.0/16

在使用Docker安装Kubernetes时,需要确保Docker使用systemd作为cgroup驱动:

vim /etc/docker/daemon.json
{
"exec-opts":["native.cgroupdriver=systemd"]
}
systemctl daemon-reload
systemctl restart docker

重置之前的初始化:

kubeadm reset
rm -rf /etc/kubernetes/manifests/kube-apiserver.yaml
rm -rf /etc/kubernetes/manifests/kube-controller-manager.yaml
rm -rf /etc/kubernetes/manifests/kube-scheduler.yaml
rm -rf /etc/kubernetes/manifests/etcd.yaml
rm -rf /var/lib/etcd/*

检查kubelet状态:

sudo systemctl status kubelet

查看Docker镜像:

sudo docker images | grep google

生成加入令牌:

kubeadm token create --print-join-command

配置kubectl:

mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config

安装网络组件:

curl https://docs.projectcalico.org/v3.20/manifests/calico.yaml -O
vi calico.yaml
kubectl apply -f calico.yaml

查看节点状态:

kubectl get nodes

去除节点污点:

kubectl taint nodes --all node-role.kubernetes.io/master-

检查Pod状态:

kubectl get pods --all-namespaces

5. 安装设备插件

部署设备插件的首选方法是使用Helm作为守护进程。安装Helm的说明可以在 这里 找到。

下载Helm二进制包:

tar -zxvf helm-v3.10.2-linux-amd64.tar.gz
mv linux-amd64/helm /usr/local/bin/helm

设置Helm存储库:

helm repo add nvdp https://nvidia.github.io/k8s-device-plugin
helm repo update

验证插件版本:

helm search repo nvdp --devel

部署设备插件:

helm install --generate-name nvdp/nvidia-device-plugin --namespace nvidia-device-plugin \
--create-namespace

下载Helm Chart包:

helm pull nvdp/nvidia-device-plugin

如果安装失败,检查日志并修改配置:

kubectl logs nvidia-device-plugin-1712138777-wxdc8 -n nvidia-device-plugin

修改daemon.json文件:

more /etc/docker/daemon.json
{
"exec-opts":["native.cgroupdriver=systemd"],
"data-root": "/data2/dockerdata",
"default-runtime": "nvidia",
"runtimes": {
"nvidia": {
"runtimeArgs": [],
"path": "/usr/bin/nvidia-container-runtime"
}
}
}
sudo systemctl restart docker

6. 安装GPU特性发现组件

设置Helm存储库:

helm repo add nvgfd https://nvidia.github.io/gpu-feature-discovery
helm repo update
helm search repo nvgfd --devel

部署GPU特性发现组件:

helm install --generate-name nvgfd/gpu-feature-discovery --namespace gpu-feature-discovery \
--create-namespace

如果镜像下载失败,可以手动下载并重新安装:

helm uninstall gpu-feature-discovery-1712148385 -n gpu-feature-discovery
docker pull yansenchangyu/node-feature-discovery:v0.13.1
docker pull nvcr.io/nvidia/gpu-feature-discovery:v0.8.2
helm install gpu-feature-discovery . --create-namespace --namespace gpu-feature-discovery

7. 测试集群和GPU集成

创建GPU Pod:

cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
name: gpu-pod
spec:
restartPolicy: Never
containers:
- name: cuda-container
image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2
resources:
limits:
nvidia.com/gpu: 1
nodeSelector:
accelerator: nvidia-rtx4090
EOF

观察运行日志:

kubectl logs gpu-pod

如果看到 "Test PASSED",则表示容器使用GPU计算运行完成。

完整的视频演示可以在B站查看:老吴聊技术

© 2023 北京元石科技有限公司 ◎ 京公网安备 11010802042949号