centos7 安装英伟达驱动；cuda；docker离线安装；docker gpu离线安装；制作自己的cuda镜像；安装容器中ssh协议

原创不易，谢谢！

centos7 安装英伟达驱动，cuda,docker安装，docker gpu安装，制作自己的cuda镜像,docker ssh协议

无论是笔记本还是台式电脑，这一步关键

参考资料

英伟达docker官网介绍
https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#system-requirements
https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#docker
https://www.nvidia.cn/Download/Find.aspx?lang=cn
cudnn官方文档
https://docs.nvidia.com/deeplearning/sdk/cudnn-install/index.html

补充知识点—开机F2进入安全模式

关闭 secure boot =disable

补充知识点—centos 用户加入管理员权限:

1. 查看文件权限ls -al /etc/sudoers
2. 切换的高权限 su root
3. 更改文件夹权限chmod u+w /etc/sudoers
4. hao添加sudo权限
vim /etc/sudoers
5. 添加新用户 xxx
xxx  ALL=(ALL) ALL
5. 再次查看文件权限 ls -al /etc/sudoers
自带的yum源没有这个软件，要用第三方的软件源，这里我用的是阿里的epel.
切换到系统yum目录并下载阿里的epel
[root@localhost cdrom]# cd /etc/yum.repos.d/

补充知识点–安装centos7-gcc

安装 gcc 版本
1.yum install gcc -y
2.yum install gcc-c++ -y

补充知识点 centos7 -挂载问题

[root@localhost yum.repos.d]# wget http://mirrors.aliyun.com/repo/epel-7.repo
yum -y install ntfs-3g

补充知识点 centos7-中文输法问题

在这里插入图片描述

补充知识点 rpm(安装）

sudo rpm -ivh

补充知识点-- 查看当前操作系统是ubuntu还是centos

执行： lsb_release -a

补充知识点：docker的一些常见操作

a.停止正在运行的容器
docker stop $(docker ps -a -q)
b、删除所有的容器

docker rm $(docker ps -a -q)
c、删除所有的镜像

docker rmi $(docker images -q)
d、先查询下docker

 yum list installed| grep docker
e、执行卸载命令

yum -y remove  docker.x86_64   docker-client.x86_64  docker-common.x86_64 nvidia-docker.x86_

f.将容器 commit 成镜像  docker commit a8ab2d989dde pydocker:aaa
g.将镜像变成.tar          docker save pydocker:cbtainewshushui >  pydocker.tar
h.容器映射
docker run -itd --gpus all  -p 10000:22 -p 10001:10001 -p 10002:10002 -h pytorch  --name pytorch 157b43e24a92 sh /root/start.sh
l.进入bash操作
docker run -it --gpus all nvidia/cuda:11.4.0-devel-centos7 /bin/bash

补充知识点–所有tar包变成一个tar包

cat tar.gz.part-a*>1.tar.gz

补充知识点—centos7 和win10 文件传输

在这里插入图片描述

 scp -P 端口号 /home/xxx/xxx.sh   root@远程主机：/root/远程主机里面的文件夹

##################################################################

安装显卡驱动450版本，cuda11.4和对应的cudnn

cuda下载地址：
cuda地址

1、Linux查看显卡信息：（ps：若找不到lspci命令，可以安装 yum install pciutils）

lspci | grep -i vga

2、使用nvidia GPU可以：

lspci | grep -i nvidia

3、查看显卡驱动

cat /proc/driver/nvidia/versionll || /usr/src/kernels/

4.安装依赖环境：

sudo yum install kernel-devel-$(uname -r) kernel-headers-$(uname -r)

5.检查内核版本和源码版本，保证一致

ls /boot | grep vmlinu
rpm -aq | grep kernel-devel

6.屏蔽系统自带的nouveau
查看命令：

lsmod | grep nouveau

修改dist-blacklist.conf文件：

vim /lib/modprobe.d/dist-blacklist.conf

将nvidiafb注释掉:

#blacklist nvidiafb

然后添加以下语句：

blacklist nouveau
options nouveau modeset=0

7.重建initramfs image步骤

mv /boot/initramfs-$(uname -r).img /boot/initramfs-$(uname -r).img.bak
dracut /boot/initramfs-$(uname -r).img $(uname -r)
systemctl set-default multi-user.target

8.安装显卡驱动

./NVIDIA-Linux-x86_64-450.66.run --kernel-source-path=/usr/src/kernels/3.10.0-1127.19.1.el7.x86_64  ----no-opengl-files

9、配置环境变量

vi /etc/profile

进入文件，添加配置

export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
使环境变量立即生效
source /etc/profile ;

10.安装cuda

wget https://developer.download.nvidia.com/compute/cuda/11.4.0/local_installers/cuda-repo-rhel7-11-4-local-11.4.0_470.42.01-1.x86_64.rpm
sudo rpm -i cuda-repo-rhel7-11-4-local-11.4.0_470.42.01-1.x86_64.rpm
sudo yum clean all
sudo yum -y install nvidia-driver-latest-dkms cuda
sudo yum -y install cuda-drivers

11.cuda 校验

cd /usr/local/cuda/samples/1_Utilities/deviceQuery 
sudo make
./deviceQuery

12.安装cudnn

老版本：
sudo cp cuda/include/cudnn.h /usr/local/cuda/include/ 
sudo cp cuda/lib64/libcudnn* /usr/local/cuda/lib64/ 
sudo chmod a+r /usr/local/cuda/include/cudnn.h 
sudo chmod a+r /usr/local/cuda/lib64/libcudnn*
新版
sudo cp cuda/include/cudnn_version.h /usr/local/cuda/include/
cat /usr/local/cuda/include/cudnn_version.h | grep CUDNN_MAJOR -A 2

13测试

cat /usr/local/cuda/include/cudnn.h | grep CUDNN_MAJOR -A 2

################################################################################

docker gpu 安装介绍文章

docker 离线安装

1.1 docker包下载
https://download.docker.com/linux/static/stable/x86_64/
tar -xzvf docker

``
3.将docker 文件夹下面二进制文件移动到bin下

mv  /usr/local/resource/docker/docker/*  /usr/bin/

4.配置docker

vim /etc/systemd/system/docker.service

[Unit]
Description=Docker Application Container Engine
Documentation=https://docs.docker.com
After=network-online.target firewalld.service
Wants=network-online.target
   
[Service]
Type=notify
# the default is not to use systemd for cgroups because the delegate issues still
# exists and systemd currently does not support the cgroup feature set required
# for containers run by docker
ExecStart=/usr/bin/dockerd -H tcp://0.0.0.0:2375 -H unix://var/run/docker.sock --add-runtime=nvidia=/usr/bin/nvidia-container-runtime
ExecReload=/bin/kill -s HUP $MAINPID
# Having non-zero Limit*s causes performance problems due to accounting overhead
# in the kernel. We recommend using cgroups to do container-local accounting.
LimitNOFILE=infinity
LimitNPROC=infinity
LimitCORE=infinity
# Uncomment TasksMax if your systemd version supports it.
# Only systemd 226 and above support this version.
#TasksMax=infinity
TimeoutStartSec=0
# set delegate yes so that systemd does not reset the cgroups of docker containers
Delegate=yes
# kill only the docker process, not all processes in the cgroup
KillMode=process
# restart the docker process if it exits prematurely
Restart=on-failure
StartLimitBurst=3
StartLimitInterval=60s
   
[Install]
WantedBy=multi-user.target

5.修改配置文件权限

chmod +x /etc/systemd/system/docker.service

3.将下载下来的rpm包全部解压（这里包括英伟达docker-gpu和docker的）

rpm -Uvh *.rpm --nodeps --force

6.启动docker,添加docker用户组

systemctl daemon-reload   # 重载systemd下 xxx.service文件
systemctl start docker       # 启动Docker
systemctl enable docker.service   # 设置开机自启
systemctl status docker   # 查看Docker状态
docker -v # 查看Docker版本
sudo groupadd docker     #添加docker用户组
sudo gpasswd -a $USER docker     #将登陆用户加入到docker用户组中
newgrp docker     #更新用户组
docker ps    #测试docker命令是否可以使用sudo正常使用
systemctl enable docker
sudo groupadd docker     #添加docker用户组
sudo gpasswd -a $USER docker     #将登陆用户加入到docker用户组中
newgrp docker     #更新用户组
docker ps    #测试docker命令是否可以使用sudo正常使用
systemctl enable docker

安装docker-gpu

1.离线下载英伟达docker

yum install --downloadonly nvidia-docker2 --downloaddir=/tmp/nvidia
rpm -Uvh *.rpm --nodeps --force

rpm 安装

rpm -Uvh *.rpm --nodeps --force

3.检查GPU状态

docker run --help | grep -i gpus

4.tar->转变成镜像

docker image load -i 1.tar.gz

5.镜像->容器

docker run -itd --gpus all  -p 10000:22 -p 10001:10001 -p 10002:10002 -h pytorch  --name pytorch 157b43e24a92 sh /root/start.sh

制作自己的docker–gpu

1.docker gpu 下载地址
https://hub.docker.com/r/ufoym/deepo
参数说明
通过运行已经pull到的镜像，根据需求执行以下命令来实例化各种不同配置的容器。
比如通过–gpus参数来指定容器可以使用的gpu（全部gpu、指定数目的gpu、指定设备号的gpu等等）
参数说明：
-i 交互式操作
-t 终端
–gpus 此参数很重要。加上这个参数创建的容器才可以感知宿主机上的gpu环境。
all表示容器可见宿主机上的全部gpu设备。
–name 为即将创建的容器指定一个名字。这里指定的容器名是container0。
nvidia/cuda:11.3-base 从官方pull的镜像。
列如

#### Test nvidia-smi with the latest official CUDA image
docker run --gpus all nvidia/cuda:11.3-base nvidia-smi
# Start a GPU enabled container on two GPUs
docker run --gpus 2 nvidia/cuda:11.3-base nvidia-smi
# Starting a GPU enabled container on specific GPUs
# Specifying a capability (graphics, compute, ...) for my container
# Note this is rarely if ever used this way
docker run --gpus all,capabilities=utility nvidia/cuda:10.0-base nvidia-smi

/bin/bash 放在镜像名后面的是命令。
因为docker允许我们在创建容器时可以附带想要执行的指令，使用这条指令帮我们创建了一个交互式的shell。
例如：

docker run -it --gpus all --name container0 nvidia/cuda:10.0-base  /bin/bash
docker exec -it pytorch bash

ssh 协议（这里说明一下基本上docker-gpu 的cuda11.3的镜像选择UBUNTU系统）

1.查看Linux是否安装SSH服务

dpkg -l | grep ssh

2.若无输出，则应先安装openssh-server

sudo apt-get update
sudo apt-get install openssh-server -y

3.如果成功安装，再次执行dpkg -l | grep ssh应该能看见如下输出：
4.同时确认以下文件也存在

/var/run/sshd
/usr/sbin/sshd

5.修改SSH配置文件的内容

vim /etc/ssh/sshd_config

6.# 这两行前面的注释取消掉

PasswordAuthentication yes
PermitRootLogin yes

7.启动SSH服务

service ssh start  # 停止服务为 service ssh stop

8.# 查看SSH服务状态

service ssh status

9.在容器的root目录下新建start.sh脚本，内容为：

/usr/sbin/sshd
	/bin/bash

10.以后创建容器时，跟在镜像后面的命令就可以改成 sh /root/start.sh，表示在实例化容器的同时执行脚本中的内容。这里表示的是创建一个交互式的shell终端以及开启容器的SSH服务。

docker run -itd --gpus all  -p 10000:22 -p 10001:10001 -p 10002:10002 -h pytorch  --name pytorch 157b43e24a92 sh /root/start.sh

11.想要以root的身份连接到我们配置好的容器，先要给root用户设置个密码。（刚安装好的root用户是没有密码的，没有密码也就没法用root用户登录）

apt-get install sudo
sudo passwd

12.接着正常安装pytorch之类的，这里就不在累述。

原文链接：https://blog.csdn.net/m0_46308537/article/details/123390387