DELL R750 ubuntu 20.04 安裝nvidia tesla a100 80G pcie 驅(qū)動 ok
DELL R750 ubuntu 20.04 安裝nvidia tesla a100 80G pcie 驅(qū)動
<strong style="color:#ff0000;">為了解決網(wǎng)上完全沒有相關(guān)成品解決方案的問題特此編寫此文檔</strong>
環(huán)境
NVIDIA A100 80GB PCIe GPU
Dell Poweredge R750
ubuntu 20.04
前期需要確認(rèn)條件
- 服務(wù)器嘗試安裝了windows并安裝驅(qū)動可以成功輸出nvidia-smi,證明此顯卡服務(wù)器支持
- NVIDIA A100 80GB PCIe GPU 安裝在 pcie gen4 x16插槽上
- BIOS secure boot is disabled
- DRAC Version 5.00.10.20 was added support for NVIDIA A100 80GB PCIe GPU in PowerEdge R750, PowerEdge R750xa, and PowerEdge R7525:
- NVIDIA A100 installed in pcie slot 7 or 2
問題復(fù)現(xiàn)
user@user:~$ nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the lat est NVIDIA driver is installed and running.
#并且NVIDIA-Linux-x86_64-515.65.07.run研究3天裝不上直接放棄抵抗,gcc/g++版本切換,ubuntu內(nèi)核更改,所有網(wǎng)上辦法全都使用過,直接躺平不研究
解決方案
驅(qū)動安裝
apt-get install nvidia-driver-515
#也可以通過圖形化附加驅(qū)動的方法安裝
安裝完畢后報錯
user@user:~$ nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the lat est NVIDIA driver is installed and running.
編輯/etc/default/grub
#編輯/etc/default/grub
user@user:~$ sudo vim /etc/default/grub
#增加pci=realloc=off到GRUB_CMDLINE_LINUX=""
GRUB_CMDLINE_LINUX="pci=realloc=off"
上載grub
user@user:~$ sudo update-grub
重啟
reboot
見證奇跡的時刻
user@user:~$ nvidia-smi
Wed Nov 16 13:04:49 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01 Driver Version: 515.65.01 CUDA Version: 11.7 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A100 80G... Off | 00000000:17:00.0 Off | 0 |
| N/A 36C P0 67W / 300W | 0MiB / 81920MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA A100 80G... Off | 00000000:CA:00.0 Off | 0 |
| N/A 36C P0 63W / 300W | 0MiB / 81920MiB | 2% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
過程分析
萬惡之源/夢開始的地方
#關(guān)鍵字段
Ubuntu Server 20.04 LTS for Dell EMC PowerEdge Servers Release Notes
#解決方法的字段
pci=realloc=off
尋找解決方案原文
尋找pci=realloc=off字段得知是內(nèi)核相關(guān)
https://blog.csdn.net/liuzq/article/details/89682079
得知內(nèi)核的命令行相關(guān)信息(/proc/cmdline)
https://www.kernel.org/doc/html/v4.14/admin-guide/kernel-parameters.html
尋找內(nèi)核增加參數(shù)方法
https://linux.cn/article-2268-1.html
#編輯/etc/default/grub
user@user:~$ sudo vim /etc/default/grub
#增加pci=realloc=off到GRUB_CMDLINE_LINUX=""
GRUB_CMDLINE_LINUX="pci=realloc=off"
#更新grub
user@user:~$ sudo update-grub