$ nvidia-smi
Failed to initialize NVML: Driver/library version mismatch
查看是否能运行cuda:
>>> import torch
>>> torch.cuda.is_available()
False
查看当前驱动,发现安装了一堆显卡驱动(用来驱动显卡的程序,它是硬件所对应的软件)
$ ubuntu-drivers devices
== /sys/devices/pci0000:64/0000:64:00.0/0000:65:00.0 ==
modalias : pci:v000010DEd00001E87sv00001458sd000037A8bc03sc00i00
vendor : NVIDIA Corporation
driver : nvidia-driver-450-server - distro non-free
driver : nvidia-driver-470 - distro non-free
driver : nvidia-driver-418-server - distro non-free
driver : nvidia-driver-470-server - distro non-free recommended
driver : nvidia-driver-460 - distro non-free
driver : nvidia-driver-460-server - distro non-free
driver : xserver-xorg-video-nouveau - distro free builtin
驱动太乱了,决定卸载原来所有的驱动,重新安装recommended的版本
$ sudo apt-get remove --purge nvidia-\*
安装
$ sudo apt-get install nvidia-driver-470-server nvidia-settings nvidia-prime
但是运行nvidia-smi
还是报错。
查看当前驱动安装:
$ dpkg -l | grep nvidia-driver
ii nvidia-driver-470-server 470.57.02-0ubuntu0.18.04.2 amd64 NVIDIA Server Driver metapackage
发现驱动安装成功。
原文提到这个问题出现的原因是kernel mod 的 Nvidia driver 的版本没有更新,一般情况下,重启机器就能够解决,如果因为某些原因不能够重启的话,也有办法reload kernel mod:
- unload nvidia kernel mod, i.e., (
sudo rmmod nvidia
) - reload nvidia kernel mod, i.e., (
sudo nvidia-smi
)
执行时,遇到卸载失败
$ sudo rmmod nvidia
rmmod: ERROR: Module nvidia is in use by: nvidia_uvm nvidia_modeset
这时,就要一点一点的卸载整个驱动了,首先要知道现在kernel mod 的依赖情况,首先我们从错误信息中知道,nvidia_modeset nvidia_uvm 这两个 mod 依赖于 nvidia,所以要先卸载他们。
$ lsmod | grep nivdia
nvidia_uvm 978944 0
nvidia_drm 49152 9
nvidia_modeset 1183744 12 nvidia_drm
nvidia 19742720 552 nvidia_uvm,nvidia_modeset
drm_kms_helper 184320 1 nvidia_drm
drm 491520 12 drm_kms_helper,nvidia_drm
i2c_nvidia_gpu 16384 0
可以看到 nvidia 被使用了552次,我们可以先卸载 nvidia_uvm 和 nvidia_modeset。由于nvidia_modeset 依赖 nvidia_drm,因此我们也需要卸载 nvidia_drm:
$ sudo rmmod nvidia_uvm
$ sudo rmmod nvidia_modeset
rmmod: ERROR: Module nvidia_modeset is in use by: nvidia_drm
$ sudo rmmod nvidia_drm
rmmod: ERROR: Module nvidia_drm is in use
卸载失败。先查看下有哪些进程使用了 nvidia*,再关闭所有进程
$ sudo lsof -n -w /dev/nvidia*
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
Xorg 1310 root mem CHR 195,255 493 /dev/nvidiactl
Xorg 1310 root 12u CHR 195,255 0t0 493 /dev/nvidiactl
....
gnome-she 1544 nics mem CHR 195,255 493 /dev/nvidiactl
gnome-she 1544 nics mem CHR 195,0 494 /dev/nvidia0
gnome-she 1544 nics 12u CHR 195,255 0t0 493 /dev/nvidiactl
....
code 3013 nics mem CHR 195,255 493 /dev/nvidiactl
code 3013 nics mem CHR 195,0 494 /dev/nvidia0
code 3013 nics 23u CHR 195,255 0t0 493 /dev/nvidiactl
....
chrome 5158 nics mem CHR 195,255 493 /dev/nvidiactl
chrome 5158 nics mem CHR 195,0 494 /dev/nvidia0
chrome 5158 nics 24u CHR 195,255 0t0 493 /dev/nvidiactl
....
chrome 5280 nics mem CHR 195,255 493 /dev/nvidiactl
chrome 5280 nics mem CHR 195,0 494 /dev/nvidia0
chrome 5280 nics 24u CHR 195,255 0t0 493 /dev/nvidiactl
....
code 7925 nics mem CHR 195,255 493 /dev/nvidiactl
code 7925 nics mem CHR 195,0 494 /dev/nvidia0
code 7925 nics 23u CHR 195,255 0t0 493 /dev/nvidiactl
....
关闭这些进程
$ sudo kill -9 7925
$ sudo kill -9 5280
....
查看是否关闭了所有进程
$ sudo lsof -n -w /dev/nvidia* # nothing is returned
$ sudo rmmod nvidia_drm
$ sudo rmmod nvidia_modeset
查看nvidia的使用used by是否降到0
$ lsmod | grep nvidia
nvidia 19742720 0
i2c_nvidia_gpu 16384 0
# 已经降到0
最后
$ sudo rmmod nvidia
$ nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02 Driver Version: 470.57.02 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:65:00.0 Off | N/A |
| 38% 44C P0 30W / 225W | 0MiB / 7979MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
这次可以运行cuda了!
版权声明:本文为weixin_43192983原创文章,遵循CC 4.0 BY-SA版权协议,转载请附上原文出处链接和本声明。