国产又猛又黄又爽视频,婷婷久久人人摸

前言：Ascend C算子（TIK C++）使用C/C++作為前端開發(fā)語(yǔ)言，通過(guò)四層接口抽象、并行編程范式、孿生調(diào)試等技術(shù)，極大提高算子開發(fā)效率，助力AI開發(fā)者低成本完成算子開發(fā)和模型調(diào)優(yōu)部署。學(xué)習(xí)完理論后，上代碼，通過(guò)實(shí)踐理解Ascend C算子的概念，掌握開發(fā)流程，以及內(nèi)核調(diào)用符方式的調(diào)試方法。

一、算子分析

????????Add算子的數(shù)學(xué)公式:z= x+y，為簡(jiǎn)單起見，設(shè)定輸入張量x, y,z為固定shape(8,2048)，數(shù)據(jù)類型dtype為half類型，數(shù)據(jù)排布類型format為ND。

????????確定如下內(nèi)容：

????????1、計(jì)算邏輯:輸入數(shù)據(jù)需要先搬入到片上存儲(chǔ)，然后使用計(jì)算接口（TIK C++ API/矢量計(jì)算/雙目/ADD，采用2級(jí)接口）完成兩個(gè)加法運(yùn)算，得到最終結(jié)果，再搬出到外部存儲(chǔ)。

????????2、輸入與輸出

????????輸入：x,y：固定shape(8，2048)，數(shù)據(jù)排布類型為ND。

????????輸出：z：與輸入相同，固定shape(8，2048)，數(shù)據(jù)排布類型為ND。

????????3、核函數(shù)名稱和入?yún)?/p>

????????核函數(shù)名稱：定義為add_tik2

????????入?yún)?個(gè)，x，y，z：x，y為輸入向量在Global Memory上的內(nèi)存地址，z為計(jì)算結(jié)果輸出到Global Memory上的內(nèi)存地址。?

二、代碼分析

????代碼結(jié)構(gòu)：

一）算子實(shí)現(xiàn)——Add_tik2.cpp

1、核函數(shù)定義

extern "C" __global__ __aicore__ void add_tik2(__gm__ uint8_t* x, __gm__ uint8_t* y, __gm__ uint8_t* z)

2、核函數(shù)實(shí)現(xiàn)——算子類的init()和process()

1）在核函數(shù)里實(shí)例化算子類KernelAdd，并調(diào)用init()實(shí)現(xiàn)初始化；調(diào)用process()實(shí)現(xiàn)流水操作

extern "C" __global__ __aicore__ void add_tik2(__gm__ uint8_t* x, __gm__ uint8_t* y, __gm__ uint8_t* z)

{

? ? KernelAdd op;

? ? op.Init(x, y, z);

? ? op.Process();

}

2）KernelAdd算子類定義

class KernelAdd {

public:

? ? __aicore__ inline KernelAdd() {}

? ? __aicore__ inline void Init(__gm__ uint8_t* x, __gm__ uint8_t* y, __gm__ uint8_t* z)

? ? {

? ? ? ? // get start index for current core, core parallel

? ? ? ? xGm.SetGlobalBuffer((__gm__ half*)x + block_idx * BLOCK_LENGTH, BLOCK_LENGTH);

? ? ? ? yGm.SetGlobalBuffer((__gm__ half*)y + block_idx * BLOCK_LENGTH, BLOCK_LENGTH);

? ? ? ? zGm.SetGlobalBuffer((__gm__ half*)z + block_idx * BLOCK_LENGTH, BLOCK_LENGTH);

? ? ? ? // pipe alloc memory to queue, the unit is Bytes

? ? ? ? pipe.InitBuffer(inQueueX, BUFFER_NUM, TILE_LENGTH * sizeof(half));

? ? ? ? pipe.InitBuffer(inQueueY, BUFFER_NUM, TILE_LENGTH * sizeof(half));

? ? ? ? pipe.InitBuffer(outQueueZ, BUFFER_NUM, TILE_LENGTH * sizeof(half));

? ? }

? ? __aicore__ inline void Process()

? ? {

? ? ? ? // loop count need to be doubled, due to double buffer

? ? ? ? constexpr int32_t loopCount = TILE_NUM * BUFFER_NUM;

? ? ? ? // tiling strategy, pipeline parallel

? ? ? ? for (int32_t i = 0; i < loopCount; i++) {

? ? ? ? ? ? CopyIn(i);

? ? ? ? ? ? Compute(i);

? ? ? ? ? ? CopyOut(i);

? ? ? ? }

? ? }

private:

? ? __aicore__ inline void CopyIn(int32_t progress)

? ? {

? ? ? ? // alloc tensor from queue memory

? ? ? ? LocalTensor<half> xLocal = inQueueX.AllocTensor<half>();

? ? ? ? LocalTensor<half> yLocal = inQueueY.AllocTensor<half>();

? ? ? ? // copy progress_th tile from global tensor to local tensor

? ? ? ? DataCopy(xLocal, xGm[progress * TILE_LENGTH], TILE_LENGTH);

? ? ? ? DataCopy(yLocal, yGm[progress * TILE_LENGTH], TILE_LENGTH);

? ? ? ? // enque input tensors to VECIN queue

? ? ? ? inQueueX.EnQue(xLocal);

? ? ? ? inQueueY.EnQue(yLocal);

? ? }

? ? __aicore__ inline void Compute(int32_t progress)

? ? {

? ? ? ? // deque input tensors from VECIN queue

? ? ? ? LocalTensor<half> xLocal = inQueueX.DeQue<half>();

? ? ? ? LocalTensor<half> yLocal = inQueueY.DeQue<half>();

? ? ? ? LocalTensor<half> zLocal = outQueueZ.AllocTensor<half>();

? ? ? ? // call Add instr for computation

? ? ? ? Add(zLocal, xLocal, yLocal, TILE_LENGTH);

? ? ? ? // enque the output tensor to VECOUT queue

? ? ? ? outQueueZ.EnQue<half>(zLocal);

? ? ? ? // free input tensors for reuse

? ? ? ? inQueueX.FreeTensor(xLocal);

? ? ? ? inQueueY.FreeTensor(yLocal);

? ? }

? ? __aicore__ inline void CopyOut(int32_t progress)

? ? {

? ? ? ? // deque output tensor from VECOUT queue

? ? ? ? LocalTensor<half> zLocal = outQueueZ.DeQue<half>();

? ? ? ? // copy progress_th tile from local tensor to global tensor

? ? ? ? DataCopy(zGm[progress * TILE_LENGTH], zLocal, TILE_LENGTH);

? ? ? ? // free output tensor for reuse

? ? ? ? outQueueZ.FreeTensor(zLocal);

? ? }

private:

? ? TPipe pipe;

? ? // create queues for input, in this case depth is equal to buffer num

? ? TQue<QuePosition::VECIN, BUFFER_NUM> inQueueX, inQueueY;

? ? // create queue for output, in this case depth is equal to buffer num

? ? TQue<QuePosition::VECOUT, BUFFER_NUM> outQueueZ;

? ? GlobalTensor<half> xGm, yGm, zGm;

};

3）算子類——init()

__aicore__ inline void Init(__gm__ uint8_t* x, __gm__ uint8_t* y, __gm__ uint8_t* z)

{

? ? // get start index for current core, core parallel

? ? xGm.SetGlobalBuffer((__gm__ half*)x + block_idx * BLOCK_LENGTH, BLOCK_LENGTH);

? ? yGm.SetGlobalBuffer((__gm__ half*)y + block_idx * BLOCK_LENGTH, BLOCK_LENGTH);

? ? zGm.SetGlobalBuffer((__gm__ half*)z + block_idx * BLOCK_LENGTH, BLOCK_LENGTH);

? ? // pipe alloc memory to queue, the unit is Bytes

? ? pipe.InitBuffer(inQueueX, BUFFER_NUM, TILE_LENGTH * sizeof(half));

? ? pipe.InitBuffer(inQueueY, BUFFER_NUM, TILE_LENGTH * sizeof(half));

? ? pipe.InitBuffer(outQueueZ, BUFFER_NUM, TILE_LENGTH * sizeof(half));

}

4）算子類——process()

__aicore__ inline void Process()

{

? ? // loop count need to be doubled, due to double buffer

? ? constexpr int32_t loopCount = TILE_NUM * BUFFER_NUM;

? ? // tiling strategy, pipeline parallel

? ? for (int32_t i = 0; i < loopCount; i++) {

? ? ? ? CopyIn(i);

? ? ? ? Compute(i);

? ? ? ? CopyOut(i);

? ? }

}

__aicore__ inline void CopyIn(int32_t progress)

{

? ? // alloc tensor from queue memory

? ? LocalTensor<half> xLocal = inQueueX.AllocTensor<half>();

? ? LocalTensor<half> yLocal = inQueueY.AllocTensor<half>();

? ? // copy progress_th tile from global tensor to local tensor

? ? DataCopy(xLocal, xGm[progress * TILE_LENGTH], TILE_LENGTH);

? ? DataCopy(yLocal, yGm[progress * TILE_LENGTH], TILE_LENGTH);

? ? // enque input tensors to VECIN queue

? ? inQueueX.EnQue(xLocal);

? ? inQueueY.EnQue(yLocal);

}

__aicore__ inline void Compute(int32_t progress)

{

? ? // deque input tensors from VECIN queue

? ? LocalTensor<half> xLocal = inQueueX.DeQue<half>();

? ? LocalTensor<half> yLocal = inQueueY.DeQue<half>();

? ? LocalTensor<half> zLocal = outQueueZ.AllocTensor<half>();

? ? // call Add instr for computation

? ? Add(zLocal, xLocal, yLocal, TILE_LENGTH);

? ? // enque the output tensor to VECOUT queue

? ? outQueueZ.EnQue<half>(zLocal);

? ? // free input tensors for reuse

? ? inQueueX.FreeTensor(xLocal);

? ? inQueueY.FreeTensor(yLocal);

}

__aicore__ inline void CopyOut(int32_t progress)

{

? ? // deque output tensor from VECOUT queue

? ? LocalTensor<half> zLocal = outQueueZ.DeQue<half>();

? ? // copy progress_th tile from local tensor to global tensor

? ? DataCopy(zGm[progress * TILE_LENGTH], zLocal, TILE_LENGTH);

? ? // free output tensor for reuse

? ? outQueueZ.FreeTensor(zLocal);

}

二）算子驗(yàn)證

1、算子調(diào)用——main.c

1）CPU方式——通過(guò)ICPU_RUN_KF宏調(diào)用

#ifdef __CCE_KT_TEST__

? ? uint8_t* x = (uint8_t*)tik2::GmAlloc(inputByteSize);

? ? uint8_t* y = (uint8_t*)tik2::GmAlloc(inputByteSize);

? ? uint8_t* z = (uint8_t*)tik2::GmAlloc(outputByteSize);

? ? ReadFile("./input/input_x.bin", inputByteSize, x, inputByteSize);

? ? // PrintData(x, 16, printDataType::HALF);

? ? ReadFile("./input/input_y.bin", inputByteSize, y, inputByteSize);

? ? // PrintData(y, 16, printDataType::HALF);

? ? ICPU_RUN_KF(add_tik2, blockDim, x, y, z); // use this macro for cpu debug

? ? // PrintData(z, 16, printDataType::HALF);

? ? WriteFile("./output/output_z.bin", z, outputByteSize);

? ? tik2::GmFree((void *)x);

? ? tik2::GmFree((void *)y);

? ? tik2::GmFree((void *)z);

2）NPU方式——內(nèi)核調(diào)用符方式

使用NPU方式，需要按照AscendCL的編程流程調(diào)用。

#ifdef __CCE_KT_TEST__

//cpu 方式

#else

? ? aclInit(nullptr);

? ? aclrtContext context;

? ? aclError error;

? ? int32_t deviceId = 0;

? ? aclrtCreateContext(&context, deviceId);

? ? aclrtStream stream = nullptr;

? ? aclrtCreateStream(&stream);

? ? uint8_t *xHost, *yHost, *zHost;

? ? uint8_t *xDevice, *yDevice, *zDevice;

? ? aclrtMallocHost((void**)(&xHost), inputByteSize);

? ? aclrtMallocHost((void**)(&yHost), inputByteSize);

? ? aclrtMallocHost((void**)(&zHost), outputByteSize);

? ? aclrtMalloc((void**)&xDevice, inputByteSize, ACL_MEM_MALLOC_HUGE_FIRST);

? ? aclrtMalloc((void**)&yDevice, outputByteSize, ACL_MEM_MALLOC_HUGE_FIRST);

? ? aclrtMalloc((void**)&zDevice, outputByteSize, ACL_MEM_MALLOC_HUGE_FIRST);

? ? ReadFile("./input/input_x.bin", inputByteSize, xHost, inputByteSize);

? ? // PrintData(xHost, 16, printDataType::HALF);

? ? ReadFile("./input/input_y.bin", inputByteSize, yHost, inputByteSize);

? ? // PrintData(yHost, 16, printDataType::HALF);

? ? aclrtMemcpy(xDevice, inputByteSize, xHost, inputByteSize, ACL_MEMCPY_HOST_TO_DEVICE);

? ? aclrtMemcpy(yDevice, inputByteSize, yHost, inputByteSize, ACL_MEMCPY_HOST_TO_DEVICE);

? ? add_tik2_do(blockDim, nullptr, stream, xDevice, yDevice, zDevice); // call kernel in this function

? ? aclrtSynchronizeStream(stream);

? ? aclrtMemcpy(zHost, outputByteSize, zDevice, outputByteSize, ACL_MEMCPY_DEVICE_TO_HOST);

? ? // PrintData(zHost, 16, printDataType::HALF);

? ? WriteFile("./output/output_z.bin", zHost, outputByteSize);

? ? aclrtFree(xDevice);

? ? aclrtFree(yDevice);

? ? aclrtFree(zDevice);

? ? aclrtFreeHost(xHost);

? ? aclrtFreeHost(yHost);

? ? aclrtFreeHost(zHost);

? ? aclrtDestroyStream(stream);

? ? aclrtResetDevice(deviceId);

? ? aclFinalize();

#endif

實(shí)質(zhì)上，使用的是內(nèi)核調(diào)用符方式：<<<>>>

#ifndef __CCE_KT_TEST__

// call of kernel function

void add_tik2_do(uint32_t blockDim, void* l2ctrl, void* stream, uint8_t* x, uint8_t* y, uint8_t* z)

{

? ? add_tik2<<<blockDim, l2ctrl, stream>>>(x, y, z);

}

#endif

2、算子驗(yàn)證

????????通過(guò)numpy生成輸入x，y的值，并計(jì)算出x+y的值作為精度比對(duì)基準(zhǔn)，上述三個(gè)數(shù)據(jù)落盤存儲(chǔ)，然后調(diào)用寫好的add算子在CPU模式和npu模式下分別以落盤的x，y作為輸入，計(jì)算出結(jié)果z，并于numpy的計(jì)算結(jié)果進(jìn)行對(duì)比，驗(yàn)證。采用計(jì)算md5方式比較add算子和numpy對(duì)相同輸入的計(jì)算結(jié)果，兩者md5相同，則兩個(gè)文件完全相同。

1）生成基準(zhǔn)數(shù)據(jù)——add_tik2.py

????????用numpy的隨機(jī)生成輸入：input_x和input_y，并計(jì)算出input_x+input_y的值golden作為比對(duì)基準(zhǔn)數(shù)據(jù)，并落盤存儲(chǔ)。

import numpy as np

def gen_golden_data_simple():

? ? input_x = np.random.uniform(-100, 100, [8, 2048]).astype(np.float16)

? ? input_y = np.random.uniform(-100, 100, [8, 2048]).astype(np.float16)

? ? golden = (input_x + input_y).astype(np.float16)

? ? input_x.tofile("./input/input_x.bin")

? ? input_y.tofile("./input/input_y.bin")

? ? golden.tofile("./output/golden.bin")

if __name__ == "__main__":

? ? gen_golden_data_simple()

2）數(shù)據(jù)比對(duì)

????????直接比較算子計(jì)算結(jié)果和基準(zhǔn)數(shù)據(jù)的md5，兩者相同，則數(shù)據(jù)完全相同。在run.sh的末尾處。

# 驗(yàn)證計(jì)算結(jié)果

echo "md5sum: ";md5sum output/*.bin

三、運(yùn)行調(diào)試

????????本次訓(xùn)練營(yíng)沒(méi)有提供開發(fā)環(huán)境，提供了一個(gè)沙箱，沙箱已經(jīng)安裝好了開發(fā)環(huán)境。首先把代碼搞沙箱里面。老師為了簡(jiǎn)化操作，提前將cpu和npu模式下的編譯和運(yùn)行，封裝到腳本run.sh中。使用腳本命令分別執(zhí)行CPU或NPU模式下的調(diào)試。

????????一）CPU模式下運(yùn)行、調(diào)試

????????1、編譯、運(yùn)行：

bash run.sh add_tik2 ascend910 aicore cpu

????????編譯及運(yùn)行結(jié)果：

?????2、gdb調(diào)試：

????????使用gdb單步調(diào)試算子計(jì)算精度，也可以在代碼中直接編寫printf(...)來(lái)觀察數(shù)值的輸出。由于cpu調(diào)測(cè)已轉(zhuǎn)為多進(jìn)程調(diào)試，每個(gè)核都是一個(gè)獨(dú)立的子進(jìn)程，故gdb需要轉(zhuǎn)換成子進(jìn)程調(diào)試的方式。

????????在gdb啟動(dòng)后，首先設(shè)置跟蹤子進(jìn)程，之后再打斷點(diǎn)，就會(huì)停留在子進(jìn)程中，設(shè)置的命令為：

set follow-fork-mode child

? ? ? ? 這樣，停留在遇到斷點(diǎn)的第一個(gè)子進(jìn)程中。其余不再贅述。

????????二）NPU模式下運(yùn)行、調(diào)試

????????1、運(yùn)行：

bash run.sh add_tik2 ascend910 aicore npu

????????編譯及運(yùn)行結(jié)果：

????????2、調(diào)試：

????????在真實(shí)芯片上獲取profiling數(shù)據(jù)，進(jìn)行性能精細(xì)調(diào)優(yōu)。

msprof --application="./add_tik2_npu" --output="./out" --ai-core=on --aic-metrics="PipeUtilization"

? ? ? ? 執(zhí)行過(guò)程如下：

?????????執(zhí)行后，對(duì)Profiling數(shù)據(jù)進(jìn)行解析與導(dǎo)出，存放在工程的下述目錄下。

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

【2023 · CANN訓(xùn)練營(yíng)第一季】——Ascend C算子代碼分析—Add算子（內(nèi)核調(diào)用符方式）

【2023 · CANN訓(xùn)練營(yíng)第一季】——Ascend C算子代碼分析—Add算子（內(nèi)核調(diào)用符方式）

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九 欧美,1769亚洲,黄色成人av

【2023 · CANN訓(xùn)練營(yíng)第一季】——Ascend C算子代碼分析—Add算子（內(nèi)核調(diào)用符方式）

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av