前提条件就不多提啦,首先得装好nvidia驱动和Docker19以上版本,网上有很多教程。
尝试1:拉取现有的deepo镜像制作
deepo是一个囊括几乎所有深度学习框架的开源镜像,这里我们选择拉取一个tensorflow-gpu版本的,避免占用储存过大。
# 拉取
root@master:/home/hqc# docker pull ufoym/deepo:tensorflow-py36
# 查看该容器是否可见
root@master:/home/hqc# sudo docker run --rm --gpus all ufoym/deepo:tensorflow-py36 nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 495.44 Driver Version: 495.44 CUDA Version: 11.5 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:01:00.0 Off | N/A |
| 23% 38C P8 9W / 250W | 11MiB / 11178MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
# 可见
# 进入容器
root@master:/home/hqc# docker run -gpus all -it ufoym/deepo:tensorflow-py36 bash
root@6ef50267dc04:/# nvidia-smi
Tue Jun 7 05:59:34 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 495.44 Driver Version: 495.44 CUDA Version: 11.5 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:01:00.0 Off | N/A |
| 23% 38C P8 9W / 250W | 11MiB / 11178MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
# 也可见
# 验证
root@6ef50267dc04:/# python
Python 3.6.9 (default, Oct 8 2020, 12:12:24)
[GCC 8.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
2022-06-07 05:35:03.266686: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2022-06-07 05:35:03.266704: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
>>> tf.test.is_gpu_available() # 查看gpu是否可获取
WARNING:tensorflow:From <stdin>:1: is_gpu_available (from tensorflow.python.framework.test_util) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.config.list_physical_devices('GPU')` instead.
2022-06-07 05:03:51.592175: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX512F
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-06-07 05:03:51.594231: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
2022-06-07 05:03:51.594877: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
2022-06-07 05:03:51.631522: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-06-07 05:03:51.631807: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: NVIDIA GeForce GTX 1080 Ti computeCapability: 6.1
coreClock: 1.582GHz coreCount: 28 deviceMemorySize: 10.92GiB deviceMemoryBandwidth: 451.17GiB/s
2022-06-07 05:03:51.631894: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2022-06-07 05:03:51.631989: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcublas.so.11'; dlerror: libcublas.so.11: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2022-06-07 05:03:51.632046: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcublasLt.so.11'; dlerror: libcublasLt.so.11: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2022-06-07 05:03:51.633279: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2022-06-07 05:03:51.633571: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2022-06-07 05:03:51.634825: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10
2022-06-07 05:03:51.634947: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcusparse.so.11'; dlerror: libcusparse.so.11: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2022-06-07 05:03:51.634997: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudnn.so.8'; dlerror: libcudnn.so.8: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2022-06-07 05:03:51.635004: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1757] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
2022-06-07 05:03:51.686019: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1261] Device interconnect StreamExecutor with strength 1 edge matrix:
2022-06-07 05:03:51.686037: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1267] 0
2022-06-07 05:03:51.686041: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1280] 0: N
False
# 为False,因此gpu不可用
# 查看cuda
root@6ef50267dc04:/# nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Sun_Jul_28_19:07:16_PDT_2019
Cuda compilation tools, release 10.1, V10.1.243
# 可见cuda也有,只是比宿主机更低的版本,不知是不是与此有关
多方查找原因无法解决,决定尝试另一种方法
尝试2:基于带有cudn的ubuntu镜像制作
1 拉取官方的nvidia/cuda镜像
需要注意,选择的cuda版本需要满足宿主机的显卡驱动需求:官网宿主机的驱动版本为
Driver Version: 495.44
,因此选用11.0的即可
拉取镜像官方地址这里选择
11.0.3-cudnn8-devel-ubuntu18.04
,复制命令:
docker pull nvidia/cuda:11.0.3-cudnn8-devel-ubuntu18.04
镜像比较大,需要耐心等待,拉取中~拉取完成:
root@master:/home/hqc# docker pull nvidia/cuda:11.0.3-cudnn8-devel-ubuntu18.04
11.0.3-cudnn8-devel-ubuntu18.04: Pulling from nvidia/cuda
e4ca327ec0e7: Pull complete
0fa9fc055636: Pull complete
448bb2d7fba5: Pull complete
a084e2627368: Pull complete
de932d3a14d8: Pull complete
ebe7db8e97e0: Pull complete
66fef8aabad3: Pull complete
9696f5331161: Pull complete
7799e2177407: Pull complete
56d35ebee226: Pull complete
Digest: sha256:f56265faac1e5cf5062c466c250b65be0cfd18e2146cf06056f1376f94c08bac
Status: Downloaded newer image for nvidia/cuda:11.0.3-cudnn8-devel-ubuntu18.04
docker.io/nvidia/cuda:11.0.3-cudnn8-devel-ubuntu18.04
2 通过镜像建立容器
通过以下指令:
sudo docker run -it --name ubuntu-tf-gpu --gpus all -p 1234:22 nvidia/cuda:11.0.3-cudnn8-devel-ubuntu18.04
说明:
-it
:以交互模型运行容器,也就是运行容器后不退出
--name ubuntu-tf-gpu
:将容器命名为ubuntu-tf-gpu,否则会随机命名
--gpus all
:允许使用所有的gpu,这个非常重要,没有这个参数gpu无法正常使用
-p 1234:22
:将宿主机的1234端口映射到容器的22端口(ssh通用端口),为了ssh链接做准备
nvidia/cuda:11.0.3-cudnn8-devel-ubuntu20.04
:镜像名:版本号
root@master:/home/hqc# docker run -it --name ubuntu-tf-gpu --gpus all -p 1234:22 nvidia/cuda:11.0.3-cudnn8-devel-ubuntu18.04
root@6feb1814fb68:/# nvidia-smi
Tue Jun 7 07:35:28 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 495.44 Driver Version: 495.44 CUDA Version: 11.5 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:01:00.0 Off | N/A |
| 23% 38C P8 9W / 250W | 11MiB / 11178MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
root@6feb1814fb68:/# nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Wed_Jul_22_19:09:09_PDT_2020
Cuda compilation tools, release 11.0, V11.0.221
Build cuda_11.0_bu.TC445_37.28845127_0
# 可见容器内的cuda为11.0版本
3 再次进入容器
# ctrl+d退出交互式容器后容器名并没有消除,因此再次想进入时会报错
root@master:/home/hqc# docker run -it --name ubuntu-tf-gpu --gpus all -p 1234:22 nvidia/cuda:11.0.3-cudnn8-devel-ubuntu18.04
docker: Error response from daemon: Conflict. The container name "/ubuntu-tf-gpu" is already in use by container "6feb1814fb68986b382f5c4c2a3a28ba59c19b4c2a172786f31cf0aa6cf4de72". You have to remove (or rename) that container to be able to reuse that name.
See 'docker run --help'.
# 找到关闭的容器ID
root@master:/home/hqc# docker ps -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
048bc6bb1fbc 74060cea7f70 "kube-apiserver --ad…" 4 minutes ago Exited (0) 2 minutes ago k8s_kube-apiserver_kube-apiserver-master_kube-system_7cedabdfa181aeeef915699b4c38b029_24272
6feb1814fb68 nvidia/cuda:11.0.3-cudnn8-devel-ubuntu18.04 "bash" 9 minutes ago Exited (0) 7 minutes ago ubuntu-tf-gpu
6ef50267dc04 ufoym/deepo:tensorflow-py36 "bash" 2 hours ago Exited (1) About an hour ago thirsty_aryabhata
# 重启容器
root@master:/home/hqc# docker start 6feb1814fb68
6feb1814fb68
# 再次进入
root@master:/home/hqc# docker exec -it 6feb1814fb68 /bin/bash
# 或者
root@master:/home/hqc# docker attach 6feb1814fb68
root@6feb1814fb68:/#
工作中我们一般使用 docker run -d image:tag /bin/bash
启动容器,再通过docker exec -it 容器ID /bin/bash
,最为安全可靠。
# 杀死进程
root@master:/home/hqc# docker kill 6feb1814fb68
6feb1814fb68
# 容器还是存在,只是被关闭了而已
# 删除容器
root@master:/home/hqc# docker rm -f 6feb1814fb68
6feb1814fb68
4 配置python3.6
# 创建并进入一个依赖文件夹
root@1dcc1e5f8ae7:/# mkdir requirement
root@1dcc1e5f8ae7:/# cd requirement/
root@1dcc1e5f8ae7:/requirement# ls
root@1dcc1e5f8ae7:/requirement# sudo apt-get install python3.6 python3-pip
bash: sudo: command not found # 还不能用sudo
# 更新一下先
root@1dcc1e5f8ae7:/requirement# apt-get update
Get:1 http://archive.ubuntu.com/ubuntu bionic InRelease [242 kB]
Get:2 http://security.ubuntu.com/ubuntu bionic-security InRelease [88.7 kB]
Get:3 http://archive.ubuntu.com/ubuntu bionic-updates InRelease [88.7 kB]
Get:4 http://security.ubuntu.com/ubuntu bionic-security/restricted amd64 Packages [957 kB]
Get:5 http://archive.ubuntu.com/ubuntu bionic-backports InRelease [74.6 kB]
Get:6 http://archive.ubuntu.com/ubuntu bionic/restricted amd64 Packages [13.5 kB]
Get:7 http://archive.ubuntu.com/ubuntu bionic/main amd64 Packages [1344 kB]
# 安装
root@1dcc1e5f8ae7:/requirement# apt-get install python3.6 python3-pip
Reading package lists... Done
Building dependency tree
Reading state information... Done
The following additional packages will be installed:
dbus dh-python file gir1.2-glib-2.0 libapparmor1 libdbus-1-3 libexpat1 libexpat1-dev libgirepository-1.0-1 libglib2.0-0 libglib2.0-data libicu60 libmagic-mgc libmagic1 libmpdec2 libpython3-dev
...
python3-secretstorage python3-setuptools python3-six python3-wheel python3-xdg python3.6 python3.6-dev python3.6-minimal shared-mime-info xdg-user-dirs
0 upgraded, 50 newly installed, 0 to remove and 48 not upgraded.
Need to get 65.9 MB of archives.
After this operation, 167 MB of additional disk space will be used.
Do you want to continue? [Y/n] y
...
5 配置虚拟环境管理工具Virtualenv
# 安装
root@1dcc1e5f8ae7:/requirement# pip3 install -U virtualenv
Collecting virtualenv
...
Successfully installed distlib-0.3.4 filelock-3.4.1 importlib-metadata-4.8.3 importlib-resources-5.4.0 platformdirs-2.4.0 six-1.16.0 typing-extensions-4.1.1 virtualenv-20.14.1 zipp-3.6.0
# 创建一个专门放虚拟环境的文件夹
root@1dcc1e5f8ae7:/requirement# cd ..
root@1dcc1e5f8ae7:/# mkdir environments
root@1dcc1e5f8ae7:/# cd environments/
# 创建名为tf2_py3的虚拟环境
root@1dcc1e5f8ae7:/environments# virtualenv --system-site-packages -p python3.6 ./tf2_py3
created virtual environment CPython3.6.9.final.0-64 in 147ms
creator CPython3Posix(dest=/environments/tf2_py3, clear=False, no_vcs_ignore=False, global=True)
seeder FromAppData(download=False, pip=bundle, setuptools=bundle, wheel=bundle, via=copy, app_data_dir=/root/.local/share/virtualenv)
added seed packages: pip==21.3.1, setuptools==59.6.0, wheel==0.37.1
activators BashActivator,CShellActivator,FishActivator,NushellActivator,PowerShellActivator,PythonActivator
# 激活该环境
root@1dcc1e5f8ae7:/environments# source tf2_py3/bin/activate
6 虚拟环境中配置tensorflow-gpu2.4.0
# 在虚拟环境中安装tensorflow-gpu2.4版本
(tf2_py3) root@1dcc1e5f8ae7:/environments# pip install --upgrade tensorflow-gpu==2.4.0
Collecting tensorflow-gpu==2.4.0
Downloading tensorflow_gpu-2.4.0-cp36-cp36m-manylinux2010_x86_64.whl (394.7 MB)
...
Successfully installed absl-py-0.15.0 astunparse-1.6.3 cachetools-4.2.4 certifi-2022.5.18.1 charset-normalizer-2.0.12 dataclasses-0.8 flatbuffers-1.12 gast-0.3.3 google-auth-2.6.6 google-auth-oauthlib-0.4.6 google-pasta-0.2.0 grpcio-1.32.0 h5py-2.10.0 keras-preprocessing-1.1.2 markdown-3.3.7 numpy-1.19.5 oauthlib-3.2.0 opt-einsum-3.3.0 protobuf-3.19.4 pyasn1-0.4.8 pyasn1-modules-0.2.8 requests-2.27.1 requests-oauthlib-1.3.1 rsa-4.8 six-1.15.0 tensorboard-2.9.0 tensorboard-data-server-0.6.1 tensorboard-plugin-wit-1.8.1 tensorflow-estimator-2.4.0 tensorflow-gpu-2.4.0 termcolor-1.1.0 typing-extensions-3.7.4.3 urllib3-1.26.9 werkzeug-2.0.3 wrapt-1.12.1
验证tensorflow-gpu环境:
(tf2_py3) root@1dcc1e5f8ae7:/environments# python
Python 3.6.9 (default, Mar 15 2022, 13:55:28)
[GCC 8.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
2022-06-07 08:52:18.240052: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
>>> print(tf.__version__)
2.4.0
>>> tf.test.is_gpu_available()
WARNING:tensorflow:From <stdin>:1: is_gpu_available (from tensorflow.python.framework.test_util) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.config.list_physical_devices('GPU')` instead.
...
2022-06-07 08:52:59.482723: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1406] Created TensorFlow device (/device:GPU:0 with 10261 MB memory) -> physical GPU (device: 0, name: NVIDIA GeForce GTX 1080 Ti, pci bus id: 0000:01:00.0, compute capability: 6.1)
True
# 成功!!!
7 其他配置
1 libssl-dev
:传输敏感数据
(tf2_py3) root@1dcc1e5f8ae7:/requirement# apt-get install libssl-dev
Reading package lists... Done
Building dependency tree
Reading state information... Done
The following additional packages will be installed:
libssl1.1
Suggested packages:
libssl-doc
The following NEW packages will be installed:
libssl-dev
The following packages will be upgraded:
libssl1.1
1 upgraded, 1 newly installed, 0 to remove and 47 not upgraded.
...
2 make
:自动编译
(tf2_py3) root@1dcc1e5f8ae7:/requirement# make
make: *** No targets specified and no makefile found. Stop.
(tf2_py3) root@1dcc1e5f8ae7:/requirement# apt-get install make
Reading package lists... Done
Building dependency tree
Reading state information... Done
make is already the newest version (4.1-9.1ubuntu1).
make set to manually installed.
0 upgraded, 0 newly installed, 0 to remove and 47 not upgraded.
3 NFS
:文件同步共享
网络文件系统(NFS)是一个分布式文件系统协议,它允许你通过网络共享远程文件夹。通过 NFS,你可以将远程文件夹挂载到你的系统上,并且操作远程机器的文件,就像本地文件一样方便。
nfs-common
:客户端,功能较少
nfs-kernel-server
:服务端,功能更全(推荐)
但由于好像比较复杂,以后再考虑
4 git
:代码管理工具 & vim
:代码编辑工具
版本控制器,更方便我们管理这些不同版本的文件
(tf2_py3) root@1dcc1e5f8ae7:/requirement# git
bash: git: command not found
(tf2_py3) root@1dcc1e5f8ae7:/requirement# apt-get install git
Reading package lists... Done
Building dependency tree
Reading state information... Done
The following additional packages will be installed:
...
After this operation, 49.5 MB of additional disk space will be used.
Do you want to continue? [Y/n] Y
Get:1 http://archive.ubuntu.com/ubuntu bionic-updates/main amd64 multiarch-support amd64 2.27-3ubuntu1.6 [6960 B]
...
(tf2_py3) root@1dcc1e5f8ae7:/requirement# apt-get install vim
...
8 更改该容器为镜像并保存(可重新导入)
# 退出容器
(tf2_py3) root@1dcc1e5f8ae7:/# exit
exit
# 查看容器ID
root@master:/home/hqc# docker ps -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
1dcc1e5f8ae7 nvidia/cuda:11.0.3-cudnn8-devel-ubuntu18.04 "bash" 2 hours ago Exited (0) About a minute ago ubuntu-tf-gpu
# 打包成镜像
root@master:/home/hqc# docker commit 1dcc1e5f8ae7 tf-gpu2.4.0
sha256:141cba844accecf332ce7434ee949b69b6e9c4fc62e0edc05e5645304b622bff
root@master:/home/hqc# docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
tf-gpu2.4.0 latest 141cba844acc 8 seconds ago 9.76GB
# 保存到指定文件夹
root@master:/home/hqc# docker save -o ./docker_learning/tf-gpu2.4.0.tar tf-gpu2.4.0
# 可以在任何地方载入
root@master:/home/hqc# docker load -i ./docker_learning/tf-gpu2.4.0.tar
Loaded image: tf-gpu2.4.0:latest