最色网站在线观看,欧美韩国你懂得在线

Memory Hierarchy

GPU Hardware

Memory Abstraction

全局內(nèi)存 global memory

獨(dú)立于GPU核心的硬件RAM
GPU絕大多數(shù)內(nèi)存空間都是全局內(nèi)存
全局內(nèi)存的IO是GPU上最慢的IO形式（除了訪問(wèn)host端內(nèi)存）

通過(guò)cache L2(CC>=3.0)訪問(wèn)，cache line 大小128 bytes ，每個(gè)線程操作盡量少的cache line，速度更快

Examples of Global Memory Accesses. Examples of Global Memory Accesses by a Warp, 4-Byte Word per Thread, and Associated Memory Transactions for Compute Capabilities 2.x and Beyond

共享內(nèi)存 shared memory

SM（SM = streaming multiprocessor）中的內(nèi)存空間
最大48KB
作用域是線程塊

靜態(tài)分配語(yǔ)法
__shared__ float data[1024];
Declared in the kernel function, nothing in host code

動(dòng)態(tài)分配語(yǔ)法
Host:
kernel<<<grid_dim, block_dim, numBytesShMem>>>(args);
Device (in kernel):
extern __shared__ float s[];

多個(gè)動(dòng)態(tài)分配的變量  需要額外注意對(duì)齊
extern __shared__ int s[];
int *integerData = s;                        // nI ints
float *floatData = (float*)&integerData[nI]; // nF floats
char *charData = (char*)&floatData[nF];      // nC chars

myKernel<<<gridSize, blockSize, nI*sizeof(int)+nF*sizeof(float)+nC*sizeof(char)>>>(...);

共享內(nèi)存塊沖突

共享內(nèi)存分成相同大小的內(nèi)存塊，實(shí)現(xiàn)高速并行訪問(wèn)，但是當(dāng)多個(gè)線程的請(qǐng)求地址映射到同一個(gè)內(nèi)存塊block時(shí)，訪問(wèn)是串行的

步幅stride為n時(shí) 最大公約數(shù)為1，即gcd(n,32)==1 ,訪問(wèn)共享內(nèi)存可以避免塊沖突

Strided Shared Memory Accesses. Examples for devices of compute capability 3.x (in 32-bit mode) or compute capability 5.x and 6.x

Left
Linear addressing with a stride of one 32-bit word (no bank conflict).
Middle
Linear addressing with a stride of two 32-bit words (two-way bank conflict).
Right
Linear addressing with a stride of three 32-bit words (no bank conflict).

Irregular Shared Memory Accesses. Examples for devices of compute capability 3.x, 5.x, or 6.x

Left
Conflict-free access via random permutation.
Middle
Conflict-free access since threads 3, 4, 6, 7, and 9 access the same word within bank 5.
Right
Conflict-free broadcast access (threads access the same word within a bank).

本地內(nèi)存 local memory

位于堆棧中，不在寄存器中的所有內(nèi)容
作用域?yàn)樘囟ň€程
存儲(chǔ)在global內(nèi)存空間中，速度比寄存器慢很多

寄存器溢出 register spilling

內(nèi)核使用的寄存器比可用的寄存器多,存儲(chǔ)到local memory中

L1 cache

每個(gè)SM都有自己的L1 cache
可配置大小16KB/48KB cudaFuncSetCacheConfig
2.x Fermi - caches local & global memory
3.x 及以上 Kepler, Maxwell - only caches local memory

L2 cache

緩存 local and global memory
被所有的SM共享
大約為1MB

常量?jī)?nèi)存 constant memory

屬于全局內(nèi)存，大小64KB
線程請(qǐng)求同一個(gè)數(shù)據(jù)時(shí)很快，請(qǐng)求不同的數(shù)據(jù)時(shí)性能下降
在運(yùn)行中不變，所有constant變量的值必須在kernel啟動(dòng)之前從host設(shè)置
__global__ 函數(shù)參數(shù)通過(guò) constant memory穿的到device端，限定4 KB，即kernel參數(shù)通過(guò)常量?jī)?nèi)存?zhèn)鬟f

__constant__ float constData[256]; 
float data[256]; 
cudaMemcpyToSymbol(constData, data, sizeof(data)); 
cudaMemcpyFromSymbol(data, constData, sizeof(data));

常量緩存 constant cache

每個(gè)SM上大小8KB,CC>=5.0大小為10KB
把一個(gè)內(nèi)存地址廣播到所有的warp線程
可以加載靜態(tài)索引數(shù)據(jù)，通過(guò) “l(fā)oad uniform” (LDU)指令

紋理內(nèi)存空間 texture memory

類似constant memory，是只讀內(nèi)存，以某種形式訪問(wèn)的時(shí)候可以提升性能。原本是用在OpenGL和DirectX渲染管線中的。
有用的特點(diǎn)：

不需考慮要聚合coalescing訪問(wèn)的問(wèn)題
通過(guò)“CUDA Array”進(jìn)行緩存的2D或3D空間的數(shù)據(jù)位置
在1D，2D或3D數(shù)組上進(jìn)行快速插值
將整數(shù)轉(zhuǎn)換為“unitized”浮點(diǎn)數(shù)

用例：

通過(guò)紋理緩存和CUDA數(shù)組讀取輸入數(shù)據(jù)，以利用空間緩存
利用數(shù)字紋理功能。
與OpenGL和通用計(jì)算機(jī)圖形的交互

紋理緩存 read-only texture cache

CC ≥ 3.5 大多數(shù)的 __restrict__ 變量自動(dòng)加載到紋理緩存中了
通過(guò) __ldg函數(shù)強(qiáng)行加載到緩存

// 2D float texture 
texture<float, cudaTextureType2D, cudaReadModeElementType> texRef;

cudaArray* cuArray; 
cudaChannelFormatDesc channelDesc = cudaCreateChannelDesc<float>; 
cudaMallocArray(&cuArray, &channelDesc, width, height);
cudaMemcpyToArray(cuArray, 0, 0, h_data, size,
                      cudaMemcpyHostToDevice);

// Set texture reference parameters 
texRef.addressMode[0] = cudaAddressModeWrap; 
texRef.addressMode[1] = cudaAddressModeWrap; 
texRef.filterMode = cudaFilterModeLinear; 
texRef.normalized = true; 

// Bind the array to the texture reference 
cudaBindTextureToArray(texRef, cuArray, channelDesc);
cudaUnbindTexture (const textureReference *texref);
cudaFreeArray(cuArray);

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

CUDA——內(nèi)存層級(jí)

CUDA——內(nèi)存層級(jí)

全局內(nèi)存 global memory

共享內(nèi)存 shared memory

共享內(nèi)存塊沖突

本地內(nèi)存 local memory

寄存器溢出 register spilling

L1 cache

L2 cache

常量?jī)?nèi)存 constant memory

常量緩存 constant cache

紋理內(nèi)存空間 texture memory

紋理緩存 read-only texture cache

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九 欧美,1769亚洲,黄色成人av

CUDA——內(nèi)存層級(jí)

全局內(nèi)存 global memory

共享內(nèi)存 shared memory

共享內(nèi)存塊沖突

本地內(nèi)存 local memory

寄存器溢出 register spilling

L1 cache

L2 cache

常量?jī)?nèi)存 constant memory

常量緩存 constant cache

紋理內(nèi)存空間 texture memory

紋理緩存 read-only texture cache

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av