https://github.com/NVIDIA/cuda-samples/tree/master/Samples/UnifiedMemoryPerf
Unified and other CUDA Memories Performance
該示例演示了使用帶/不帶提示的統(tǒng)一內(nèi)存矩陣乘法內(nèi)核以及其他類(lèi)型的內(nèi)存(例如零拷貝緩沖區(qū),可分頁(yè),分頁(yè)鎖定的內(nèi)存,在單個(gè)GPU上執(zhí)行同步和異步傳輸)的性能比較:
UMhint UMhntAs UMeasy 0Copy MemCopy CpAsync CpHpglk CpPglAs
"UMhint", // Managed Memory With Hints
"UMhntAs", // Managed Memory With_Hints Async
"UMeasy", // Managed_Memory with No Hints
"0Copy", // Zero Copy
"MemCopy", // USE HOST PAGEABLE AND DEVICE_MEMORY
"CpAsync", // USE HOST PAGEABLE AND DEVICE_MEMORY ASYNC
"CpHpglk", // USE HOST PAGELOCKED AND DEVICE MEMORY
"CpPglAs" // USE HOST PAGELOCKED AND DEVICE MEMORY ASYNC
測(cè)試結(jié)果:
-
(Dell Precision 5520) Device 0: "Quadro M1200" (Maxwell cc5.0)
Quadro M1200
-Jetson Xavier capability 7.2 (Volta)
- (機(jī)械革命S1):mx150(Pascal)
