显卡版本不匹配

参考Comzyh的博客

$ nvidia-smi
Failed to initialize NVML: Driver/library version mismatch

查看是否能运行cuda:

>>> import torch
>>> torch.cuda.is_available()
False

查看当前驱动,发现安装了一堆显卡驱动(用来驱动显卡的程序,它是硬件所对应的软件)

$ ubuntu-drivers devices
== /sys/devices/pci0000:64/0000:64:00.0/0000:65:00.0 ==
modalias : pci:v000010DEd00001E87sv00001458sd000037A8bc03sc00i00
vendor   : NVIDIA Corporation
driver   : nvidia-driver-450-server - distro non-free
driver   : nvidia-driver-470 - distro non-free
driver   : nvidia-driver-418-server - distro non-free
driver   : nvidia-driver-470-server - distro non-free recommended
driver   : nvidia-driver-460 - distro non-free
driver   : nvidia-driver-460-server - distro non-free
driver   : xserver-xorg-video-nouveau - distro free builtin

驱动太乱了,决定卸载原来所有的驱动,重新安装recommended的版本

$ sudo apt-get remove --purge nvidia-\*

安装

$ sudo apt-get install nvidia-driver-470-server nvidia-settings nvidia-prime

但是运行nvidia-smi还是报错。
查看当前驱动安装:

$ dpkg -l | grep nvidia-driver
ii  nvidia-driver-470-server                          470.57.02-0ubuntu0.18.04.2                       amd64        NVIDIA Server Driver metapackage

发现驱动安装成功。

原文提到这个问题出现的原因是kernel mod 的 Nvidia driver 的版本没有更新,一般情况下,重启机器就能够解决,如果因为某些原因不能够重启的话,也有办法reload kernel mod:

  1. unload nvidia kernel mod, i.e., (sudo rmmod nvidia)
  2. reload nvidia kernel mod, i.e., (sudo nvidia-smi)

执行时,遇到卸载失败

$ sudo rmmod nvidia
rmmod: ERROR: Module nvidia is in use by: nvidia_uvm nvidia_modeset

这时,就要一点一点的卸载整个驱动了,首先要知道现在kernel mod 的依赖情况,首先我们从错误信息中知道,nvidia_modeset nvidia_uvm 这两个 mod 依赖于 nvidia,所以要先卸载他们。

$ lsmod | grep nivdia
nvidia_uvm            978944  0
nvidia_drm             49152  9
nvidia_modeset       1183744  12 nvidia_drm
nvidia              19742720  552 nvidia_uvm,nvidia_modeset
drm_kms_helper        184320  1 nvidia_drm
drm                   491520  12 drm_kms_helper,nvidia_drm
i2c_nvidia_gpu         16384  0

可以看到 nvidia 被使用了552次,我们可以先卸载 nvidia_uvm 和 nvidia_modeset。由于nvidia_modeset 依赖 nvidia_drm,因此我们也需要卸载 nvidia_drm:

$ sudo rmmod nvidia_uvm
$ sudo rmmod nvidia_modeset
rmmod: ERROR: Module nvidia_modeset is in use by: nvidia_drm
$ sudo rmmod nvidia_drm
rmmod: ERROR: Module nvidia_drm is in use

卸载失败。先查看下有哪些进程使用了 nvidia*,再关闭所有进程

$ sudo lsof -n -w  /dev/nvidia*
COMMAND    PID USER   FD   TYPE  DEVICE SIZE/OFF NODE NAME
Xorg      1310 root  mem    CHR 195,255           493 /dev/nvidiactl
Xorg      1310 root   12u   CHR 195,255      0t0  493 /dev/nvidiactl
....
gnome-she 1544 nics  mem    CHR 195,255           493 /dev/nvidiactl
gnome-she 1544 nics  mem    CHR   195,0           494 /dev/nvidia0
gnome-she 1544 nics   12u   CHR 195,255      0t0  493 /dev/nvidiactl
....
code      3013 nics  mem    CHR 195,255           493 /dev/nvidiactl
code      3013 nics  mem    CHR   195,0           494 /dev/nvidia0
code      3013 nics   23u   CHR 195,255      0t0  493 /dev/nvidiactl
....
chrome    5158 nics  mem    CHR 195,255           493 /dev/nvidiactl
chrome    5158 nics  mem    CHR   195,0           494 /dev/nvidia0
chrome    5158 nics   24u   CHR 195,255      0t0  493 /dev/nvidiactl
....
chrome    5280 nics  mem    CHR 195,255           493 /dev/nvidiactl
chrome    5280 nics  mem    CHR   195,0           494 /dev/nvidia0
chrome    5280 nics   24u   CHR 195,255      0t0  493 /dev/nvidiactl
....
code      7925 nics  mem    CHR 195,255           493 /dev/nvidiactl
code      7925 nics  mem    CHR   195,0           494 /dev/nvidia0
code      7925 nics   23u   CHR 195,255      0t0  493 /dev/nvidiactl
....

关闭这些进程

$ sudo kill -9 7925
$ sudo kill -9 5280
....

查看是否关闭了所有进程

$ sudo lsof -n -w  /dev/nvidia* # nothing is returned
$ sudo rmmod nvidia_drm
$ sudo rmmod nvidia_modeset

查看nvidia的使用used by是否降到0

$ lsmod | grep nvidia
nvidia              19742720  0
i2c_nvidia_gpu         16384  0
# 已经降到0

最后

$ sudo rmmod nvidia
$ nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:65:00.0 Off |                  N/A |
| 38%   44C    P0    30W / 225W |      0MiB /  7979MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

这次可以运行cuda了!


版权声明:本文为weixin_43192983原创文章,遵循CC 4.0 BY-SA版权协议,转载请附上原文出处链接和本声明。