Professional CUDA C Programing
代碼下載:http://www.wrox.com/WileyCDA/WroxTitle/Professional-CUDA-C-Programming.productCd-1118739329,descCd-DOWNLOAD.html
Memory Management
CUDA編程中的內(nèi)存管理與C編程類似,并附加了程序員明確負責內(nèi)存管理及主機與設(shè)備之間的數(shù)據(jù)移動。
? 分配和釋放設(shè)備內(nèi)存
? 在主機和設(shè)備之間傳輸數(shù)據(jù)
Memory Allocation and Deallocation
//分配顯存:
cudaError_t cudaMalloc(void **devPtr, size_t count);
//初始化:
cudaError_t cudaMemset(void *devPtr, int value, size_t count);
//釋放顯存:
cudaError_t cudaFree(void *devPtr);
device資源分配是個非常昂貴的操作,因此device Memory應(yīng)該盡可能的重用,而不是重新分配。
Memory Transfer
cudaError_t cudaMemcpy(void *dst, const void *src, size_t count,
enum cudaMemcpyKind kind);
//cudaMemcpy通常情況下,都是同步的。
Example:
#include "../common/common.h"
#include <cuda_runtime.h>
#include <stdio.h>
/*
* An example of using CUDA's memory copy API to transfer data to and from the
* device. In this case, cudaMalloc is used to allocate memory on the GPU and
* cudaMemcpy is used to transfer the contents of host memory to an array
* allocated using cudaMalloc.
*/
int main(int argc, char **argv)
{
// set up device
int dev = 0;
CHECK(cudaSetDevice(dev));
// memory size
unsigned int isize = 1 << 22;
unsigned int nbytes = isize * sizeof(float);
// get device information
cudaDeviceProp deviceProp;
CHECK(cudaGetDeviceProperties(&deviceProp, dev));
printf("%s starting at ", argv[0]);
printf("device %d: %s memory size %d nbyte %5.2fMB\n", dev,
deviceProp.name, isize, nbytes / (1024.0f * 1024.0f));
// allocate the host memory
float *h_a = (float *)malloc(nbytes);
// allocate the device memory
float *d_a;
CHECK(cudaMalloc((float **)&d_a, nbytes));
// initialize the host memory
for(unsigned int i = 0; i < isize; i++) h_a[i] = 0.5f;
// transfer data from the host to the device
CHECK(cudaMemcpy(d_a, h_a, nbytes, cudaMemcpyHostToDevice));
// transfer data from the device to the host
CHECK(cudaMemcpy(h_a, d_a, nbytes, cudaMemcpyDeviceToHost));
// free memory
CHECK(cudaFree(d_a));
free(h_a);
// reset device
CHECK(cudaDeviceReset());
return EXIT_SUCCESS;
}
編譯運行:
hym@hym-ThinkPad-Edge-E440:~/CodeSamples/chapter04$ nvcc -O3 memTransfer.cu -o memTransfer
hym@hym-ThinkPad-Edge-E440:~/CodeSamples/chapter04$ nvprof ./memTransfer
==8038== NVPROF is profiling process 8038, command: ./memTransfer
./memTransfer starting at device 0: GeForce GT 740M memory size 4194304 nbyte 16.00MB
==8038== Profiling application: ./memTransfer
==8038== Profiling result:
Time(%) Time Calls Avg Min Max Name
50.53% 2.7607ms 1 2.7607ms 2.7607ms 2.7607ms [CUDA memcpy HtoD]
49.47% 2.7030ms 1 2.7030ms 2.7030ms 2.7030ms [CUDA memcpy DtoH]

上圖是CPU和GPU之間傳輸關(guān)系圖,可以看出來,CPU和GPU之間傳輸速度相對很差(8GB/s),GPU和on-board Memory傳輸速度要快得多,所以對于編程來說,要時刻考慮減少CPU和GPU之間的數(shù)據(jù)傳輸。
Pinned Memory
為什么需要虛擬內(nèi)存地址空間?
假設(shè)某個進程需要4MB的空間,內(nèi)存假設(shè)是1MB的,如果進程直接使用物理地址,這個進程會因為內(nèi)存不足跑不起來。但是進程可以根據(jù)運行時間調(diào)用部分數(shù)據(jù),執(zhí)行進程。
host的內(nèi)存是按頁進行管理的,虛擬內(nèi)存和物理內(nèi)存間有一個映射關(guān)系,比如要將host上的某個變量拷貝到device上,首先得知道host上變量的物理地址,實際上host的物理地址和虛擬地址的映射關(guān)系隨時間而變化的。所以device無法安全地訪問host的變量。因此,當將pageable host Memory數(shù)據(jù)送到device時,CUDA驅(qū)動會首先分配一個臨時的page-locked或者pinned host Memory,并將host的數(shù)據(jù)放到這個臨時空間里。然后GPU從這個所謂的pinned Memory中獲取數(shù)據(jù),如下圖所示:

我們也可以顯式的直接使用pinned Memory,如下:
cudaError_t cudaMallocHost(void **devPtr, size_t count);
由于pinned Memory能夠被device直接訪問(不是指不通過PCIE了,而是相對左圖我們少了pageable Memory到pinned Memory這一步),所以他比pageable Memory具有相當高的讀寫帶寬,但是可能會降低pageable Memory的數(shù)量,影響整個虛擬存儲性能。
cudaError_t status = cudaMallocHost((void**)&h_aPinned, bytes);
if (status != cudaSuccess) {
fprintf(stderr, "Error returned from pinned host memory allocation\n");
exit(1);
}
//釋放pinned memory
cudaError_t cudaFreeHost(void *ptr);
Example:
hym@hym-ThinkPad-Edge-E440:~/CodeSamples/chapter04$ nvcc -O3 pinMemTransfer.cu -o pinMemTransfer
nvcc warning : The 'compute_20', 'sm_20', and 'sm_21' architectures are deprecated, and may be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).
hym@hym-ThinkPad-Edge-E440:~/CodeSamples/chapter04$ nvprof ./pinMemTransfer
==9488== NVPROF is profiling process 9488, command: ./pinMemTransfer
./pinMemTransfer starting at device 0: GeForce GT 740M memory size 4194304 nbyte 16.00MB canMap 1
==9488== Profiling application: ./pinMemTransfer
==9488== Profiling result:
Time(%) Time Calls Avg Min Max Name
50.71% 2.5983ms 1 2.5983ms 2.5983ms 2.5983ms [CUDA memcpy HtoD]
49.29% 2.5255ms 1 2.5255ms 2.5255ms 2.5255ms [CUDA memcpy DtoH]
Pinned Memory比pageable Memory的分配操作更加昂貴,但是對大數(shù)據(jù)的傳輸有很好的表現(xiàn)。pinned Memory性能的好壞也是跟CC有關(guān)的。將許多小的傳輸合并到一次大的數(shù)據(jù)傳輸,并使用pinned Memory將降低很大的傳輸消耗。有些GPU數(shù)據(jù)傳輸和kernel的計算是可以overlap的。
Zero-Copy Memory
通常情況下,host不能直接訪問device的變量,device的變量也不能直接訪問host的變量。但Zero-Copy Memory是個例外,主機和設(shè)備都可以訪問Zero-Copy Memory。
使用Zero-Copy Memory的優(yōu)點如下:
?當設(shè)備內(nèi)存不足時利用主機內(nèi)存
?避免主機和設(shè)備之間的顯式數(shù)據(jù)傳輸
?提高PCIe傳輸速率
需要注意的問題:要注意device和host端的synchronize
memory accesses 問題,在同一時刻host和device端同時修改zero-copy的數(shù)據(jù),可能會導(dǎo)致無法預(yù)料的后果。
Zero-copy本身實質(zhì)就是pinned memory并且被映射到了device的地址空間。
cudaError_t cudaHostAlloc(void **pHost, size_t count, unsigned int flags);
當使用cudaHostAllocDefault時,cudaHostAlloc和cudaMallocHost等價。cudaHostAllocPortable則說明,分配的pinned memory對所有CUDA context都有效,而不是單單執(zhí)行分配此操作的那個context或者說線程。cudaHostAllocWriteCombined是在特殊系統(tǒng)配置情況下使用的,這塊pinned memory在PCIE上的傳輸更快,但是對于host自己來說,卻沒什么效率。所以該選項一般用來讓host去寫,然后device讀。最常用的是cudaHostAllocMapped,就是返回一個標準的zero-copy??梢杂孟旅娴腁PI來獲取device端的地址:
cudaError_t cudaHostGetDevicePointer(void **pDevice, void *pHost, unsigned int flags);
//注意:flags目前設(shè)置為0
使用zero-copy memory來作為device memory的讀寫很頻繁的那部分是很不明智的,究其根本原因還是GPU和CPU之間低的傳輸速度,甚至在頻繁讀寫情況下,zero-copy表現(xiàn)比global memory也要差不少。
下面一段代買是比較頻繁讀寫情況下,zero-copy的表現(xiàn):
#include "../common/common.h"
#include <cuda_runtime.h>
#include <stdio.h>
/*
* This example demonstrates the use of zero-copy memory to remove the need to
* explicitly issue a memcpy operation between the host and device. By mapping
* host, page-locked memory into the device's address space, the address can
* directly reference a host array and transfer its contents over the PCIe bus.
*
* This example compares performing a vector addition with and without zero-copy
* memory.
*/
void checkResult(float *hostRef, float *gpuRef, const int N)
{
double epsilon = 1.0E-8;
for (int i = 0; i < N; i++)
{
if (abs(hostRef[i] - gpuRef[i]) > epsilon)
{
printf("Arrays do not match!\n");
printf("host %5.2f gpu %5.2f at current %d\n", hostRef[i],
gpuRef[i], i);
break;
}
}
return;
}
void initialData(float *ip, int size)
{
int i;
for (i = 0; i < size; i++)
{
ip[i] = (float)( rand() & 0xFF ) / 10.0f;
}
return;
}
void sumArraysOnHost(float *A, float *B, float *C, const int N)
{
for (int idx = 0; idx < N; idx++)
{
C[idx] = A[idx] + B[idx];
}
}
__global__ void sumArrays(float *A, float *B, float *C, const int N)
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < N) C[i] = A[i] + B[i];
}
__global__ void sumArraysZeroCopy(float *A, float *B, float *C, const int N)
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < N) C[i] = A[i] + B[i];
}
int main(int argc, char **argv)
{
// set up device
int dev = 0;
CHECK(cudaSetDevice(dev));
// get device properties
cudaDeviceProp deviceProp;
CHECK(cudaGetDeviceProperties(&deviceProp, dev));
// check if support mapped memory
if (!deviceProp.canMapHostMemory)
{
printf("Device %d does not support mapping CPU host memory!\n", dev);
CHECK(cudaDeviceReset());
exit(EXIT_SUCCESS);
}
printf("Using Device %d: %s ", dev, deviceProp.name);
// set up data size of vectors
int ipower = 10;
if (argc > 1) ipower = atoi(argv[1]);
int nElem = 1 << ipower;
size_t nBytes = nElem * sizeof(float);
if (ipower < 18)
{
printf("Vector size %d power %d nbytes %3.0f KB\n", nElem, ipower,
(float)nBytes / (1024.0f));
}
else
{
printf("Vector size %d power %d nbytes %3.0f MB\n", nElem, ipower,
(float)nBytes / (1024.0f * 1024.0f));
}
// part 1: using device memory
// malloc host memory
float *h_A, *h_B, *hostRef, *gpuRef;
h_A = (float *)malloc(nBytes);
h_B = (float *)malloc(nBytes);
hostRef = (float *)malloc(nBytes);
gpuRef = (float *)malloc(nBytes);
// initialize data at host side
initialData(h_A, nElem);
initialData(h_B, nElem);
memset(hostRef, 0, nBytes);
memset(gpuRef, 0, nBytes);
// add vector at host side for result checks
sumArraysOnHost(h_A, h_B, hostRef, nElem);
// malloc device global memory
float *d_A, *d_B, *d_C;
CHECK(cudaMalloc((float**)&d_A, nBytes));
CHECK(cudaMalloc((float**)&d_B, nBytes));
CHECK(cudaMalloc((float**)&d_C, nBytes));
// transfer data from host to device
CHECK(cudaMemcpy(d_A, h_A, nBytes, cudaMemcpyHostToDevice));
CHECK(cudaMemcpy(d_B, h_B, nBytes, cudaMemcpyHostToDevice));
// set up execution configuration
int iLen = 512;
dim3 block (iLen);
dim3 grid ((nElem + block.x - 1) / block.x);
sumArrays<<<grid, block>>>(d_A, d_B, d_C, nElem);
// copy kernel result back to host side
CHECK(cudaMemcpy(gpuRef, d_C, nBytes, cudaMemcpyDeviceToHost));
// check device results
checkResult(hostRef, gpuRef, nElem);
// free device global memory
CHECK(cudaFree(d_A));
CHECK(cudaFree(d_B));
// free host memory
free(h_A);
free(h_B);
// part 2: using zerocopy memory for array A and B
// allocate zerocpy memory
CHECK(cudaHostAlloc((void **)&h_A, nBytes, cudaHostAllocMapped));
CHECK(cudaHostAlloc((void **)&h_B, nBytes, cudaHostAllocMapped));
// initialize data at host side
initialData(h_A, nElem);
initialData(h_B, nElem);
memset(hostRef, 0, nBytes);
memset(gpuRef, 0, nBytes);
// pass the pointer to device
CHECK(cudaHostGetDevicePointer((void **)&d_A, (void *)h_A, 0));
CHECK(cudaHostGetDevicePointer((void **)&d_B, (void *)h_B, 0));
// add at host side for result checks
sumArraysOnHost(h_A, h_B, hostRef, nElem);
// execute kernel with zero copy memory
sumArraysZeroCopy<<<grid, block>>>(d_A, d_B, d_C, nElem);
// copy kernel result back to host side
CHECK(cudaMemcpy(gpuRef, d_C, nBytes, cudaMemcpyDeviceToHost));
// check device results
checkResult(hostRef, gpuRef, nElem);
// free memory
CHECK(cudaFree(d_C));
CHECK(cudaFreeHost(h_A));
CHECK(cudaFreeHost(h_B));
free(hostRef);
free(gpuRef);
// reset device
CHECK(cudaDeviceReset());
return EXIT_SUCCESS;
}
編譯運行:
hym@hym-ThinkPad-Edge-E440:~/CodeSamples/chapter04$ nvcc -O3 sumArrayZerocpy.cu -o sumZerocpy
nvcc warning : The 'compute_20', 'sm_20', and 'sm_21' architectures are deprecated, and may be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).
hym@hym-ThinkPad-Edge-E440:~/CodeSamples/chapter04$ nvprof ./sumZerocpy
==11871== NVPROF is profiling process 11871, command: ./sumZerocpy
Using Device 0: GeForce GT 740M Vector size 1024 power 10 nbytes 4 KB
==11871== Profiling application: ./sumZerocpy
==11871== Profiling result:
Time(%) Time Calls Avg Min Max Name
36.62% 5.3440us 2 2.6720us 2.6560us 2.6880us [CUDA memcpy DtoH]
32.46% 4.7360us 1 4.7360us 4.7360us 4.7360us sumArraysZeroCopy(float*, float*, float*, int)
17.76% 2.5920us 2 1.2960us 1.2800us 1.3120us [CUDA memcpy HtoD]
13.16% 1.9200us 1 1.9200us 1.9200us 1.9200us sumArrays(float*, float*, float*, int)
???:
$ ./sumZerocopy <size-log-2>

因此,對于共享host和device之間的一小塊內(nèi)存空間,zero-copy是很好的選擇,簡化了編程。
在異構(gòu)架構(gòu)中有兩種:集成&分離。集成:CPU和GPU在同一個芯片上,共享memory,這個時候zero-copy memory很適合。分離:CPU和GPU在不同的芯片上,通過PCIe總線進行傳輸,只有特定場景適合zero-copy。另外,不要過度使用zero-copy,因為device中的threads讀取zero-copy非常慢。
Unified Virtual Addressing
在CC2.0以上的設(shè)備支持一種新特性:Unified Virtual Addressing (UVA).這個特性在CUDA4.0中首次介紹,并被64位Linux系統(tǒng)支持。如下圖所示,在使用UVA的情況下,CPU和GPU使用同一塊連續(xù)的地址空間:

在UVA之前,我們需要分別管理指向host memory和device memory的指針。使用UVA之后,實際指向內(nèi)存空間的指針對我們來說是透明的,我們看到的是同一塊連續(xù)地址空間。
這樣,使用cudaHostAlloc分配的pinned memory獲得的地址對于device和host來說是通用的。我們可以直接在kernel里使用這個地址。回看前文,我們對于zero-copy的處理過程是:
1 分配已經(jīng)映射到device的pinned memory。
2 根據(jù)獲得的host地址,獲取device的映射地址。
3 在kernel中使用該映射地址。
使用UVA之后,就沒必要來獲取device的映射地址了,直接使用一個地址就可以,如下代碼所示:
// allocate zero-copy memory at the host side
cudaHostAlloc((void **)&h_A, nBytes, cudaHostAllocMapped);
cudaHostAlloc((void **)&h_B, nBytes, cudaHostAllocMapped);
// initialize data at the host side
initialData(h_A, nElem);
initialData(h_B, nElem);
// invoke the kernel with zero-copy memory
sumArraysZeroCopy<<<grid, block>>>(h_A, h_B, d_C, nElem);
編譯運行:
hym@hym-ThinkPad-Edge-E440:~/CodeSamples/Solutions/chapter04$ nvcc -O3 sumArrayZerocpyUVA.cu -o sumArrayZerocpyUVA
nvcc warning : The 'compute_20', 'sm_20', and 'sm_21' architectures are deprecated, and may be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).
hym@hym-ThinkPad-Edge-E440:~/CodeSamples/Solutions/chapter04$ nvprof ./sumArrayZerocpyUVA
==16987== NVPROF is profiling process 16987, command: ./sumArrayZerocpyUVA
Using Device 0: GeForce GT 740M Vector size 16777216 power 24 nbytes 64 MB
sumArrays, elapsed = 0.015717 s
sumArraysZeroCopy, elapsed = 0.020800 s
sumArraysZeroCopy w/ UVA, elapsed = 0.020872 s
==16987== Profiling application: ./sumArrayZerocpyUVA
==16987== Profiling result:
Time(%) Time Calls Avg Min Max Name
29.82% 33.489ms 3 11.163ms 11.107ms 11.256ms [CUDA memcpy DtoH]
19.39% 21.775ms 2 10.887ms 10.847ms 10.927ms [CUDA memcpy HtoD]
18.50% 20.778ms 1 20.778ms 20.778ms 20.778ms sumArraysZeroCopyWithUVA(float*, float*, float*, int)
18.46% 20.733ms 1 20.733ms 20.733ms 20.733ms sumArraysZeroCopy(float*, float*, float*, int)
13.84% 15.545ms 1 15.545ms 15.545ms 15.545ms sumArrays(float*, float*, float*, int)
Unified Memory
在CUDA 6.0,引入了一個Unified Memory的新功能,以簡化CUDA的內(nèi)存管理。
Unified Memory依賴于UVA,但它們是完全不同的技術(shù)。UVA給所有CPU和GPU提供了一個虛擬的地址空間,但是UVA不會自動地將數(shù)據(jù)從一個物理位置遷移到另一個位置,這正是Unified Memory所特有的。
Unified Memory提供了一個“單指針數(shù)據(jù)”模型,其概念上類似于zero-copy。 然而,零拷貝內(nèi)存被分配在主機內(nèi)存中,并且在kernek中的性能通常會受到PCIe總線對零拷貝內(nèi)存的高延遲訪問。另一方面,Unified Memory解耦內(nèi)存和執(zhí)行空間,以便數(shù)據(jù)可以透明地根據(jù)需要遷移到主機或設(shè)備,以提高局部性和性能。???沒有理解
原始的CUDA程序:
__global__ void AplusB(int *ret, int a, int b)
{
ret[threadIdx.x] = a + b + threadIdx.x;
}
int main()
{
int *ret;
//**************************************
cudaMalloc(&ret, 1000 * sizeof(int));
AplusB<<<1, 1000>>>(ret, 10, 100);
//**************************************
int *host_ret = (int *)malloc(1000 * sizeof(int));
cudaMemcpy(host_ret, ret, 1000 * sizeof(int), cudaMemcpyDefault);
for(int i = 0; i < 1000; i++)
printf("%d: A + B = %d\n", i, host_ret[i]);
free(host_ret);
cudaFree(ret);
return 0;
}
使用Unifiled Memory
#include "../common/common.h"
#include <cuda_runtime.h>
#include <stdio.h>
__global__ void AplusB(int *ret, int a, int b)
{
ret[threadIdx.x] = a + b + threadIdx.x;
}
int main()
{
int *ret;
//***********************************************
CHECK(cudaMallocManaged(&ret, 1000 * sizeof(int)));
AplusB<<<1, 1000>>>(ret, 10, 100);
//***********************************************
CHECK(cudaDeviceSynchronize());
for(int i = 0; i < 1000; i++)
printf("%d: A + B = %d\n", i, ret[i]);
cudaFree(ret);
return 0;
}
從上面不同的代碼可以看出,統(tǒng)一尋址后的代碼更簡潔,使用了函數(shù)cudaMallocManaged()開辟一塊存儲空間,無論是在Kernel函數(shù)中還是main函數(shù)中,都可以使用這塊內(nèi)存,達到了統(tǒng)一尋址的目的。
注意:main函數(shù)在調(diào)用kernel函數(shù)之后,使用了一個同步函數(shù)。仔細思考后就會有所領(lǐng)悟——既然這塊存儲空間既可以被kernel函數(shù)訪問,也可以被main函數(shù)訪問,為了解決訪問沖突的問題,因此使用了同步函數(shù),使得在Kernel改變變量的值后,main函數(shù)才能使用該變量。
注意:Unifiled Memory需要在CC3.0以上,64bit.
http://blog.csdn.net/tom1027/article/details/44856875