DELL R750 ubuntu 20.04 安裝nvidia tesla a100 80G pcie 驅(qū)動 ok

DELL R750 ubuntu 20.04 安裝nvidia tesla a100 80G pcie 驅(qū)動 ok

DELL R750 ubuntu 20.04 安裝nvidia tesla a100 80G pcie 驅(qū)動

<strong style="color:#ff0000;">為了解決網(wǎng)上完全沒有相關(guān)成品解決方案的問題特此編寫此文檔</strong>

環(huán)境

NVIDIA A100 80GB PCIe GPU
Dell Poweredge R750
ubuntu 20.04 

前期需要確認(rèn)條件

  1. 服務(wù)器嘗試安裝了windows并安裝驅(qū)動可以成功輸出nvidia-smi,證明此顯卡服務(wù)器支持
  2. NVIDIA A100 80GB PCIe GPU 安裝在 pcie gen4 x16插槽上
  3. BIOS secure boot is disabled
  4. DRAC Version 5.00.10.20 was added support for NVIDIA A100 80GB PCIe GPU in PowerEdge R750, PowerEdge R750xa, and PowerEdge R7525:
  5. NVIDIA A100 installed in pcie slot 7 or 2

問題復(fù)現(xiàn)

user@user:~$ nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the lat est NVIDIA driver is installed and running.
#并且NVIDIA-Linux-x86_64-515.65.07.run研究3天裝不上直接放棄抵抗,gcc/g++版本切換,ubuntu內(nèi)核更改,所有網(wǎng)上辦法全都使用過,直接躺平不研究

解決方案

驅(qū)動安裝

apt-get install nvidia-driver-515
#也可以通過圖形化附加驅(qū)動的方法安裝

安裝完畢后報錯

user@user:~$ nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the lat est NVIDIA driver is installed and running.

編輯/etc/default/grub

#編輯/etc/default/grub
user@user:~$ sudo vim /etc/default/grub
#增加pci=realloc=off到GRUB_CMDLINE_LINUX=""
GRUB_CMDLINE_LINUX="pci=realloc=off"

上載grub

user@user:~$ sudo update-grub

重啟

reboot

見證奇跡的時刻

user@user:~$ nvidia-smi
Wed Nov 16 13:04:49 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01    Driver Version: 515.65.01    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100 80G...  Off  | 00000000:17:00.0 Off |                    0 |
| N/A   36C    P0    67W / 300W |      0MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100 80G...  Off  | 00000000:CA:00.0 Off |                    0 |
| N/A   36C    P0    63W / 300W |      0MiB / 81920MiB |      2%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

過程分析

萬惡之源/夢開始的地方

https://www.dell.com/community/PowerEdge-Hardware-General/NVIDIA-A100-in-Dell-Poweredge-R750-with-Ubuntu-20-04/td-p/8136431

#關(guān)鍵字段
Ubuntu Server 20.04 LTS for Dell EMC PowerEdge Servers Release Notes
#解決方法的字段
pci=realloc=off

尋找解決方案原文

https://www.dell.com/support/manuals/zh-cn/ubuntu-server/ubuntu_20.04_rn_pub/nvidia-out-of-box-driver-fails-to-load-when-system-has-nvidia-gpgpus-on-ubuntu-20.04?guid=guid-030db733-1273-4e3b-a53f-b13f9f4c40f7&lang=en-us

尋找pci=realloc=off字段得知是內(nèi)核相關(guān)

https://blog.csdn.net/liuzq/article/details/89682079

得知內(nèi)核的命令行相關(guān)信息(/proc/cmdline)

https://www.kernel.org/doc/html/v4.14/admin-guide/kernel-parameters.html

尋找內(nèi)核增加參數(shù)方法

https://linux.cn/article-2268-1.html

#編輯/etc/default/grub
user@user:~$ sudo vim /etc/default/grub
#增加pci=realloc=off到GRUB_CMDLINE_LINUX=""
GRUB_CMDLINE_LINUX="pci=realloc=off"
#更新grub
user@user:~$ sudo update-grub
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時請結(jié)合常識與多方信息審慎甄別。
平臺聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點,簡書系信息發(fā)布平臺,僅提供信息存儲服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容