如何使用Ascend的ATB加速庫?

1 前言

Ascend Transformer Boost加速庫(下文簡稱為ATB加速庫)是一款高效、可靠的加速庫,基于華為Ascend AI處理器,專門為Transformer類模型的訓練和推理而設計。具體請閱讀:ATB是什么?

那么程序猿小白如何實現(xiàn)一個ATB算子呢?

2 具體實現(xiàn)一個ATB算子

以下內(nèi)容參考:

算子使用指導-加速庫使用指導-Ascend Transformer Boost加速庫-領域加速庫開發(fā)-CANN商用版8.0.RC2.2開發(fā)文檔-昇騰社區(qū)

實現(xiàn)一個ATB算子大概要有以下10個步驟,如下圖所示。


image.png

step 1: 包含ACL與加速庫接口頭文件

#include <acl/acl.h>
#include <atb/atb_infer.h>
#include <atb/types.h>
#include <atb/utils.h>
#include "atb/infer_op_params.h"

這里要注意:

  • 首先要安裝atb相關的so文件,才能獲取到相關頭文件,保證程序鏈接不出錯。
  • 不同的算子,可能包含的頭文件并不相同。
  • 其它頭文件,自定義添加

參考:

安裝部署-Ascend Transformer Boost加速庫-領域加速庫開發(fā)-CANN商用版8.0.RC2.2開發(fā)文檔-昇騰社區(qū)

step 2: 配置deviceId

uint32_t deviceId = 0;
aclError status = aclrtSetDevice(deviceId);

根據(jù)需求設置deviceId,如單機多卡,asecnd可用的deviceId為0-7(總共8張卡)。

step 3: 創(chuàng)建算子對象實例
從前文ATB是什么? ATB總共有3種算子實現(xiàn),下文分別進行說明。

1、基礎Operation(原生算子)

第一步:構造Operation參數(shù)

根據(jù)要創(chuàng)建的算子,實例化參數(shù)結構體,參數(shù)結構體的接口定義參考atb/infer_op_params.h和atb/train_op_params.h。

以Mul算子為例,Mul算子歸屬于Elewise,可通過以下方式構造對應參數(shù):

atb::infer::ElewiseParam mulParam;
mulParam.elewiseType = atb::infer::ElewiseParam::ElewiseType::ELEWISE_MUL;

第二步:創(chuàng)建算子對象實例

atb::Operation *op = nullptr;
atb::Status st = atb::CreateOperation(mulParam, &op);

2、插件(Plugin)機制(插件算子)

插件算子需要是使用Ascend c或者其它方式實現(xiàn)kernel。

建議直接本文3.2章節(jié)。

參考:

插件機制-ATB算子

第一步:開發(fā)算子

以使用Ascend C創(chuàng)建Add算子為例,用戶可根據(jù)實際需求選擇其他方式實現(xiàn)自定義算子。

參考如下:kernel_add.cpp

plugin_op_demo/kernel/kernel_add.cpp · Si1verBul1et623548/atb-op-demo - 碼云 - 開源中國 (gitee.com)

第二步:創(chuàng)建算子對象實例

CustomOperation*op = new CustomOperation("CustomOperation");

3、Graph Frame(圖算子)

圖算子有配置TensorId和配置TensorName組圖兩種創(chuàng)建和使用方式。

根據(jù)如下圖算子結構圖:


image.png

可以明確出,TensorId與TensorName對應關系配置如下:


image.png

表1 TensorId與TensorName對應關系配置
組圖方式1:配置TensorId

第一步:構造Operation參數(shù)

與單算子的參數(shù)不同,圖算子的參數(shù)包含圖節(jié)點、輸入Tensor數(shù)、輸出Tensor數(shù)、中間Tensor數(shù)等圖相關的信息。

首先,根據(jù)設計的圖算子結構,分別計算出圖輸入Tensor(假設為x個),圖輸出Tensor(假設為y個)以及圖中間Tensor(假設為z個)的個數(shù)。 圖輸入Tensor的Id取值為[0, x - 1],圖輸出Tensor的Id取值為[x, x + y - 1],圖中間Tensor的Id取值為[x + y, x + y + z - 1]。示例對應關系見表1Tensor與TensorId列。

然后,配置每一個節(jié)點的相關信息,包括創(chuàng)建好的單算子對象實例、輸入Tensor和輸出Tensor。該節(jié)點的輸入和輸出Tensor在圖里可能是圖的輸入Tensor、輸出Tensor或中間Tensor,用戶需根據(jù)其所屬的圖Tensor類型,在合適的范圍內(nèi)取值。

實例中的op0和op1創(chuàng)建過程可參考單算子的創(chuàng)建。

atb::GraphParam graphParam;
graphParam.inTensorNum = 3;                 // 指定該圖的輸入Tensor數(shù)量
graphParam.outTensorNum = 1;                // 指定該圖的輸出Tensor數(shù)量
graphParam.internalTensorNum = 1;           // 指定該圖的中間Tensor數(shù)量
graphParam.nodes.resize(2);                 // 指定該圖中的節(jié)點數(shù)量,即包含的單算子數(shù)量
graphParam.nodes[0].operation = op0;        // 指定該圖中的節(jié)點0的單算子對象實例
graphParam.nodes[0].inTensorIds = {0, 1};   // 指定該圖中的節(jié)點0需要的輸入Tensor所對應的id
graphParam.nodes[0].outTensorIds = {4};     // 指定該圖中的節(jié)點0輸出的輸出Tensor所對應的id
graphParam.nodes[1].operation = op1;        // 指定該圖中的節(jié)點1的單算子對象實例
graphParam.nodes[1].inTensorIds = {4, 2};   // 指定該圖中的節(jié)點1需要的輸入Tensor所對應的id
graphParam.nodes[1].outTensorIds = {3};     // 指定該圖中的節(jié)點1輸出的輸出Tensor所對應的id

第二步:創(chuàng)建算子對象實例

atb::Operation *op = nullptr;
atb::Status st = atb::CreateOperation(graphParam, &op);

組圖方式2:配置TensorId

使用TensorId組圖需要提前定義,操作過程繁瑣。該組圖通過字符串定義每個Tensor,可行性更高。示例對應關系見上表1種Tensor與TensorName。

第一步:創(chuàng)建圖算子構造器

atb::GraphOpBuilder* graphOpBuilder;
CreateGraphOpBuilder(&graphOpBuilder);

第二步:初始化圖算子構造器

// lambda函數(shù),通過圖算子的輸入TensorDesc推導輸出TensorDesc,包括DataType、Format、Shape等
atb::InferShapeFunc inferShapeFunc = [=](const atb::SVector<atb::TensorDesc> &inTensorDescs, atb::SVector<atb::TensorDesc> &outTensorDescs) {
    outTensorDescs.at(0) = inTensorDescs.at(0);
    return atb::NO_ERROR;
};
graphOpBuilder->Init("DemoGraphOperation", inferShapeFunc, {"a", "b", "c"}, {"output"});

第三步:用圖算子構造器構圖

構圖時可通過定義lambda函數(shù)對Tensor進行reshape,需保證reshape前后的shape大小一致。

op0等單算子的創(chuàng)建過程可參考上述單算子的創(chuàng)建。

graphOpBuilder->AddOperation(op0, {"a", "b"}, {"a_add_b_output"});
graphOpBuilder->AddOperation(op1, {"a_add_b_output", "c"}, {"output"});

第四步:用圖算子構造器構圖

atb::Operation *op = graphOpBuilder->Build(); // 使用時需判斷op是否為空指針
DestroyGraphOpBuilder(graphOpBuilder); // 銷毀圖算子構造器

step 4: 創(chuàng)建輸入輸出tensor,并存入VariantPack
VariantPack中包含輸入和輸出Tensor列表。VariantPack中傳入的每個輸入Tensor要求大于0且不超過256GB。

// 設置各個intensor的屬性
void CreateInTensorDescs(atb::SVector<atb::TensorDesc> &intensorDescs) 
{
    for (size_t i = 0; i < intensorDescs.size(); i++) {
        intensorDescs.at(i).dtype = ACL_FLOAT16;
        intensorDescs.at(i).format = ACL_FORMAT_ND;
        intensorDescs.at(i).shape.dimNum = 2;
        intensorDescs.at(i).shape.dims[0] = 2;
        intensorDescs.at(i).shape.dims[1] = 2;
    }
}

// 設置各個intensor并且為各個intensor分配內(nèi)存空間,此處的intensor為手動設置,工程實現(xiàn)上可以使用torchTensor轉換或者其他簡單數(shù)據(jù)結構轉換的方式
void CreateInTensors(atb::SVector<atb::Tensor> &inTensors, atb::SVector<atb::TensorDesc> &intensorDescs)
{
    std::vector<char> zeroData(8, 0); // 一段全0的hostBuffer
    for (size_t i = 0; i < inTensors.size(); i++) {
        inTensors.at(i).desc = intensorDescs.at(i);
        inTensors.at(i).dataSize = atb::Utils::GetTensorSize(inTensors.at(i));
        int ret = aclrtMalloc(&inTensors.at(i).deviceData, inTensors.at(i).dataSize, ACL_MEM_MALLOC_HUGE_FIRST); // 分配NPU內(nèi)存
        if (ret != 0) {
            std::cout << "alloc error!";
            exit(0);
        }
        ret = aclrtMemcpy(inTensors.at(i).deviceData, inTensors.at(i).dataSize, zeroData.data(), zeroData.size(), ACL_MEMCPY_HOST_TO_DEVICE); //拷貝CPU內(nèi)存到NPU側
    }
}

// 設置各個outtensor并且為outtensor分配內(nèi)存空間,同intensor設置
void CreateOutTensors(atb::SVector<atb::Tensor> &outTensors, atb::SVector<atb::TensorDesc> &outtensorDescs)
{
    for (size_t i = 0; i < outTensors.size(); i++) {
        outTensors.at(i).desc = outtensorDescs.at(i);
        outTensors.at(i).dataSize = atb::Utils::GetTensorSize(outTensors.at(i));
        int ret = aclrtMalloc(&outTensors.at(i).deviceData, outTensors.at(i).dataSize, ACL_MEM_MALLOC_HUGE_FIRST);
        if (ret != 0) {
            std::cout << "alloc error!";
            exit(0);
        }
    }
}
// 按上述方法構造所有輸入和輸出tensor,存入VariantPack
atb::VariantPack pack;
atb::SVector<atb::TensorDesc> intensorDescs;
atb::SVector<atb::TensorDesc> outtensorDescs;

uint32_t inTensorNum = op->GetInputNum();
uint32_t outTensorNum = op->GetOutputNum();
pack.inTensors.resize(inTensorNum);
intensorDescs.resize(inTensorNum);

CreateInTensorDescs(intensorDescs);
CreateInTensors(pack.inTensors, intensorDescs);
    
outtensorDescs.resize(outTensorNum);
pack.outTensors.resize(outTensorNum);
op->InferShape(intensorDescs, outtensorDescs);
CreateOutTensors(pack.outTensors, outtensorDescs);

step 5: 創(chuàng)建context,配置stream
Context主要負責對NPU中使用的Stream進行管理。

atb::Context *context = nullptr;
st = atb::CreateContext(&context);

aclrtStream stream = nullptr;
status = aclrtCreateStream(&stream);
context->SetExecuteStream(stream);

step 6: 調用Setup接口,計算workspace大小

uint64_t workspaceSize = 0;
st = op->Setup(pack, workspaceSize, context);

step 7: 根據(jù)workspace大小申請NPU內(nèi)存

void *workspace = nullptr;
if (workspaceSize != 0) {
    status = aclrtMalloc(&workspace, workspaceSize, ACL_MEM_MALLOC_HUGE_FIRST);
    if (status != 0) {
        std::cout << "alloc error!";
        exit(0);
    }
}

當workspace大小為0時,無需執(zhí)行該步驟,否則會報錯。

step 8: 調用Execute接口,執(zhí)行算子

st = op->Execute(pack, (uint8_t *)workspace, workspaceSize, context);

step 9: 銷毀創(chuàng)建的對象,釋放內(nèi)存

// 流同步,作用是等待device側任務計算完成
auto ret = aclrtSynchronizeStream(stream);
if (ret != 0) {
    std::cout << "sync error!";
    exit(0);
}

status = aclrtDestroyStream(stream); // 銷毀stream
st = atb::DestroyOperation(op);      // 銷毀op對象
st = atb::DestroyContext(context);   // 銷毀context
// 銷毀輸入tensor
for (size_t i = 0; i < pack.inTensors.size(); i++) {
    aclrtFree(pack.inTensors.at(i).deviceData);
}
// 銷毀輸出tensor
for (size_t i = 0; i < pack.outTensors.size(); i++) {
    aclrtFree(pack.outTensors.at(i).deviceData);
}
aclrtFree(pack.outTensors.at(0).deviceData); // 銷毀輸出tensor
status = aclrtFree(workspace);       // 銷毀workspace
aclrtResetDevice(deviceId);          // 重置deviceId

step 10: demo運行
編譯源文件:

# g++編譯demo工程,demo.cpp為demo對應的源碼文件
g++ -I "${ATB_HOME_PATH}/include" -I "${ASCEND_HOME_PATH}/include" -L "${ATB_HOME_PATH}/lib" -L "${ASCEND_HOME_PATH}/lib64" demo.cpp -l atb -l ascendcl -o demo

這里:

ATB_HOME_PATH:指的是atb庫文件的安裝路徑。

執(zhí)行:

./demo # 運行可執(zhí)行文件

3 完整代碼文件

3.1 單算子完整示例

文件命名為atb_mul_operation.cpp

// step1:包含ACL與加速庫接口頭文件
#include <iostream>
#include <vector>
#include <acl/acl.h>
#include <atb/atb_infer.h>
#include <atb/types.h>
#include <atb/utils.h>
#include "atb/infer_op_params.h"


void CreateInTensorDescs(atb::SVector<atb::TensorDesc> &intensorDescs) 
{
    for (size_t i = 0; i < intensorDescs.size(); i++) {
        intensorDescs.at(i).dtype = ACL_FLOAT16;
        intensorDescs.at(i).format = ACL_FORMAT_ND;
        intensorDescs.at(i).shape.dimNum = 2;
        intensorDescs.at(i).shape.dims[0] = 2;
        intensorDescs.at(i).shape.dims[1] = 2;
    }
}

// 設置各個intensor并且為各個intensor分配內(nèi)存空間,此處的intensor為手動設置,工程實現(xiàn)上可以使用torchTensor轉換或者其他簡單數(shù)據(jù)結構轉換的方式
void CreateInTensors(atb::SVector<atb::Tensor> &inTensors, atb::SVector<atb::TensorDesc> &intensorDescs)
{
    std::vector<char> zeroData(8, 0); // 一段全0的hostBuffer
    for (size_t i = 0; i < inTensors.size(); i++) {
        inTensors.at(i).desc = intensorDescs.at(i);
        inTensors.at(i).dataSize = atb::Utils::GetTensorSize(inTensors.at(i));
        int ret = aclrtMalloc(&inTensors.at(i).deviceData, inTensors.at(i).dataSize, ACL_MEM_MALLOC_HUGE_FIRST); // 分配NPU內(nèi)存
        if (ret != 0) {
            std::cout << "alloc error!";
            exit(0);
        }
        ret = aclrtMemcpy(inTensors.at(i).deviceData, inTensors.at(i).dataSize, zeroData.data(), zeroData.size(), ACL_MEMCPY_HOST_TO_DEVICE); //拷貝CPU內(nèi)存到NPU側
    }
}

// 設置各個outtensor并且為outtensor分配內(nèi)存空間,同intensor設置
void CreateOutTensors(atb::SVector<atb::Tensor> &outTensors, atb::SVector<atb::TensorDesc> &outtensorDescs)
{
    for (size_t i = 0; i < outTensors.size(); i++) {
        outTensors.at(i).desc = outtensorDescs.at(i);
        outTensors.at(i).dataSize = atb::Utils::GetTensorSize(outTensors.at(i));
        int ret = aclrtMalloc(&outTensors.at(i).deviceData, outTensors.at(i).dataSize, ACL_MEM_MALLOC_HUGE_FIRST);
        if (ret != 0) {
            std::cout << "alloc error!";
            exit(0);
        }
    }
}

int main() {
    // step2:配置deviceId
    uint32_t deviceId = 0;
    aclError status = aclrtSetDevice(deviceId);

    // step3:創(chuàng)建算子對象實例,以Mul算子為例,Mul算子歸屬于Elewise,可通過以下方式構造對應參數(shù)
    // 第一步:構造Operation參數(shù)
    atb::infer::ElewiseParam mulParam;
    mulParam.elewiseType = atb::infer::ElewiseParam::ElewiseType::ELEWISE_MUL;

    // 第二步:創(chuàng)建算子對象實例
    atb::Operation *op = nullptr;
    atb::Status st = atb::CreateOperation(mulParam, &op);

    // step4:創(chuàng)建輸入輸出tensor,并存入VariantPack
    atb::VariantPack pack;
    atb::SVector<atb::TensorDesc> intensorDescs;
    atb::SVector<atb::TensorDesc> outtensorDescs;

    uint32_t inTensorNum = op->GetInputNum();
    uint32_t outTensorNum = op->GetOutputNum();
    pack.inTensors.resize(inTensorNum);
    intensorDescs.resize(inTensorNum);

    CreateInTensorDescs(intensorDescs);
    CreateInTensors(pack.inTensors, intensorDescs);
        
    outtensorDescs.resize(outTensorNum);
    pack.outTensors.resize(outTensorNum);
    op->InferShape(intensorDescs, outtensorDescs);
    CreateOutTensors(pack.outTensors, outtensorDescs);

    // step5:創(chuàng)建context,配置stream
    atb::Context *context = nullptr;
    st = atb::CreateContext(&context);

    aclrtStream stream = nullptr;
    status = aclrtCreateStream(&stream);
    context->SetExecuteStream(stream);

    // step6:調用Setup接口,計算workspace大小
    uint64_t workspaceSize = 0;
    st = op->Setup(pack, workspaceSize, context);

    // step7:根據(jù)workspace大小申請NPU內(nèi)存
    void *workspace = nullptr;
    if (workspaceSize != 0) {
        status = aclrtMalloc(&workspace, workspaceSize, ACL_MEM_MALLOC_HUGE_FIRST);
        if (status != 0) {
            std::cout << "alloc error!";
            exit(0);
        }
    }

    // step8:調用Execute接口,執(zhí)行算子
    st = op->Execute(pack, (uint8_t *)workspace, workspaceSize, context);

    // step9:銷毀創(chuàng)建的對象,釋放內(nèi)存
    // 流同步,作用是等待device側任務計算完成
    auto ret = aclrtSynchronizeStream(stream);
    if (ret != 0) {
        std::cout << "sync error!";
        exit(0);
    }

    status = aclrtDestroyStream(stream); // 銷毀stream
    st = atb::DestroyOperation(op);      // 銷毀op對象
    st = atb::DestroyContext(context);   // 銷毀context
    // 銷毀輸入tensor
    for (size_t i = 0; i < pack.inTensors.size(); i++) {
        aclrtFree(pack.inTensors.at(i).deviceData);
    }
    // 銷毀輸出tensor
    for (size_t i = 0; i < pack.outTensors.size(); i++) {
        aclrtFree(pack.outTensors.at(i).deviceData);
    }
    status = aclrtFree(workspace);       // 銷毀workspace
    aclrtResetDevice(deviceId);          // 重置deviceId

    return 0;
}

也可以參考:

single_op_demo/single_op_demo.cpp · Si1verBul1et623548/atb-op-demo - 碼云 - 開源中國 (gitee.com)
編譯運行:

# g++編譯demo工程,demo.cpp為demo對應的源碼文件
g++ -I "${ATB_HOME_PATH}/include" -I "${ASCEND_HOME_PATH}/include" -L "${ATB_HOME_PATH}/lib" -L "${ASCEND_HOME_PATH}/lib64" atb_mul_operation.cpp -l atb -l ascendcl -o atb_mul_operation


# 運行可執(zhí)行文件
./atb_mul_operation

3.2 插件(Plugin)機制(插件算子)完整示例

參考:

Si1verBul1et623548/atb-op-demo
gitee.com/geyunqi/atb-op-demo/tree/master/plugin_op_demo

進入到plugin_op_demo目錄后,執(zhí)行

bash run.sh

在plugin_op_demo/build得到輸出

total 68
drwxr-xr-x. 3 root root  4096 Sep 29 20:02 ./
drwxr-xr-x. 5 root root  4096 Sep 29 20:02 ../
-rw-r--r--. 1 root root 14543 Sep 29 20:02 CMakeCache.txt
drwxr-xr-x. 6 root root  4096 Sep 29 20:02 CMakeFiles/
-rw-r--r--. 1 root root  5773 Sep 29 20:02 Makefile
-rw-r--r--. 1 root root  1664 Sep 29 20:02 cmake_install.cmake
-rwxr-xr-x. 1 root root 27720 Sep 29 20:02 libplugin_add.so*

可見,當前編譯為一個動態(tài)庫so的形式。但是里面的過程,已經(jīng)能夠描述清楚為plugin的單算子怎么寫。

3.3 Graph Frame(圖算子)

3.3.1按照組圖方式1:配置TensorId實現(xiàn)

image.png

文件命名為atb_add_graph_by_tensor_id.cpp

// step1:包含ACL與加速庫接口頭文件
#include <iostream>
#include <vector>
#include <acl/acl.h>
#include <atb/atb_infer.h>
#include <atb/types.h>
#include <atb/utils.h>
#include "atb/infer_op_params.h"


void CreateInTensorDescs(atb::SVector<atb::TensorDesc> &intensorDescs) 
{
    for (size_t i = 0; i < intensorDescs.size(); i++) {
        intensorDescs.at(i).dtype = ACL_FLOAT16;
        intensorDescs.at(i).format = ACL_FORMAT_ND;
        intensorDescs.at(i).shape.dimNum = 2;
        intensorDescs.at(i).shape.dims[0] = 2;
        intensorDescs.at(i).shape.dims[1] = 2;
    }
}

// 設置各個intensor并且為各個intensor分配內(nèi)存空間,此處的intensor為手動設置,工程實現(xiàn)上可以使用torchTensor轉換或者其他簡單數(shù)據(jù)結構轉換的方式
void CreateInTensors(atb::SVector<atb::Tensor> &inTensors, atb::SVector<atb::TensorDesc> &intensorDescs)
{
    for (size_t i = 0; i < inTensors.size(); i++) {
        inTensors.at(i).desc = intensorDescs.at(i);
        inTensors.at(i).dataSize = atb::Utils::GetTensorSize(inTensors.at(i));
        std::vector<uint16_t> hostData(atb::Utils::GetTensorNumel(inTensors.at(i)), 2);   // 一段全2的hostBuffer
        int ret = aclrtMalloc(&inTensors.at(i).deviceData, inTensors.at(i).dataSize, ACL_MEM_MALLOC_HUGE_FIRST); // 分配NPU內(nèi)存
        if (ret != 0) {
            std::cout << "alloc error!";
            exit(0);
        }
        ret = aclrtMemcpy(inTensors.at(i).deviceData, inTensors.at(i).dataSize, hostData.data(), hostData.size() * sizeof(uint16_t), ACL_MEMCPY_HOST_TO_DEVICE); //拷貝CPU內(nèi)存到NPU側
    }
}

// 設置各個outtensor并且為outtensor分配內(nèi)存空間,同intensor設置
void CreateOutTensors(atb::SVector<atb::Tensor> &outTensors, atb::SVector<atb::TensorDesc> &outtensorDescs)
{
    for (size_t i = 0; i < outTensors.size(); i++) {
        outTensors.at(i).desc = outtensorDescs.at(i);
        outTensors.at(i).dataSize = atb::Utils::GetTensorSize(outTensors.at(i));
        int ret = aclrtMalloc(&outTensors.at(i).deviceData, outTensors.at(i).dataSize, ACL_MEM_MALLOC_HUGE_FIRST);
        if (ret != 0) {
            std::cout << "alloc error!";
            exit(0);
        }
    }
}


//在構造圖參數(shù)時,有兩個點需要重點關注。一是Tensor的ID,ATB圖接口中把Tensor分為三種類型,輸入、輸出和中間Tensor,顧名思義,輸入輸出Tensor是整圖的輸入輸出Tensor,中間tensor則是在整圖內(nèi)的Tensor。構圖時的TensorID從小到大應保證//為輸入Tensor、輸出Tensor、中間Tensor的順序,且每一種Tensor的個數(shù)要與參數(shù)中設置的一致。二是要注意排布Node的順序,用戶需要根據(jù)計算圖的拓撲結構把計算圖變成一個有序隊列,同時還要保證tensor與節(jié)點之間的關系和計算圖保持一致。
void CreateGraphOperation(atb::GraphParam &opGraph, atb::Operation **operation)
{
    // 構圖流程
    opGraph.inTensorNum = 4;
    opGraph.outTensorNum = 1;
    opGraph.internalTensorNum = 2;
    opGraph.nodes.resize(3);

    enum InTensorId {               //定義各TensorID
        IN_TENSOR_A = 0,
        IN_TENSOR_B,
        IN_TENSOR_C,
        IN_TENSOR_D,
        ADD3_OUT,
        ADD1_OUT,
        ADD2_OUT
    };

    size_t nodeId = 0;
    atb::Node &addNode = opGraph.nodes.at(nodeId++);
    atb::Node &addNode2 = opGraph.nodes.at(nodeId++);
    atb::Node &addNode3 = opGraph.nodes.at(nodeId++);

    atb::infer::ElewiseParam addParam;
    addParam.elewiseType = atb::infer::ElewiseParam::ElewiseType::ELEWISE_ADD;
    atb::Status status = atb::CreateOperation(addParam, &addNode.operation);
    addNode.inTensorIds = {IN_TENSOR_A, IN_TENSOR_B};
    addNode.outTensorIds = {ADD1_OUT};

    atb::infer::ElewiseParam addParam2;
    addParam2.elewiseType = atb::infer::ElewiseParam::ElewiseType::ELEWISE_ADD;
    status = atb::CreateOperation(addParam2, &addNode2.operation);
    addNode2.inTensorIds = {IN_TENSOR_C, IN_TENSOR_D};
    addNode2.outTensorIds = {ADD2_OUT};

    atb::infer::ElewiseParam addParam3;
    addParam3.elewiseType = atb::infer::ElewiseParam::ElewiseType::ELEWISE_ADD;
    status = CreateOperation(addParam3, &addNode3.operation);
    addNode3.inTensorIds = {ADD1_OUT, ADD2_OUT};
    addNode3.outTensorIds = {ADD3_OUT};

    status = atb::CreateOperation(opGraph, operation);
}

void PrintOutTensorValue(atb::Tensor &outTensor)
{
    // 輸出Tensor拷貝回host側并打印
    std::vector<uint16_t> outBuffer(atb::Utils::GetTensorNumel(outTensor));
    int ret = aclrtMemcpy(outBuffer.data(), outBuffer.size() * sizeof(uint16_t), outTensor.deviceData, outTensor.dataSize, ACL_MEMCPY_DEVICE_TO_HOST);
    if (ret != 0) {
        std::cout << "copy error!";
        exit(0);
    }
    for (size_t i = 0; i < outBuffer.size(); i = i + 1) {
        std::cout << "out[" << i << "] = " << (uint32_t)outBuffer.at(i) << std::endl;
    }
}

int main() {
    // step2:配置deviceId
    uint32_t deviceId = 0;
    aclError status = aclrtSetDevice(deviceId);

    // step3:創(chuàng)建圖算子對象實例
    // 第一步:構造Operation參數(shù)
    atb::Operation *op = nullptr;
    atb::GraphParam opGraph;

    // 第二步:創(chuàng)建opGraph
    CreateGraphOperation(opGraph, &op);

    // step4:創(chuàng)建輸入輸出tensor,并存入VariantPack
    atb::VariantPack pack;
    atb::SVector<atb::TensorDesc> intensorDescs;
    atb::SVector<atb::TensorDesc> outtensorDescs;

    uint32_t inTensorNum = op->GetInputNum();
    uint32_t outTensorNum = op->GetOutputNum();
    pack.inTensors.resize(inTensorNum);
    intensorDescs.resize(inTensorNum);

    CreateInTensorDescs(intensorDescs);
    CreateInTensors(pack.inTensors, intensorDescs);
        
    outtensorDescs.resize(outTensorNum);
    pack.outTensors.resize(outTensorNum);
    op->InferShape(intensorDescs, outtensorDescs);
    CreateOutTensors(pack.outTensors, outtensorDescs);

    // step5:創(chuàng)建context,配置stream
    atb::Context *context = nullptr;
    auto st = atb::CreateContext(&context);

    aclrtStream stream = nullptr;
    status = aclrtCreateStream(&stream);
    context->SetExecuteStream(stream);

    // step6:調用Setup接口,計算workspace大小
    uint64_t workspaceSize = 0;
    st = op->Setup(pack, workspaceSize, context);

    // step7:根據(jù)workspace大小申請NPU內(nèi)存
    void *workspace = nullptr;
    if (workspaceSize != 0) {
        status = aclrtMalloc(&workspace, workspaceSize, ACL_MEM_MALLOC_HUGE_FIRST);
        if (status != 0) {
            std::cout << "alloc error!";
            exit(0);
        }
    }

    // step8:調用Execute接口,執(zhí)行算子
    st = op->Execute(pack, (uint8_t *)workspace, workspaceSize, context);

    // step9:銷毀創(chuàng)建的對象,釋放內(nèi)存
    // 流同步,作用是等待device側任務計算完成
    auto ret = aclrtSynchronizeStream(stream);
    if (ret != 0) {
        std::cout << "sync error!";
        exit(0);
    }

    // 打印輸出Tensor的值
    PrintOutTensorValue(pack.outTensors.at(0));

    status = aclrtDestroyStream(stream); // 銷毀stream
    st = atb::DestroyOperation(op);      // 銷毀op對象
    st = atb::DestroyContext(context);   // 銷毀context
    // 銷毀輸入tensor
    for (size_t i = 0; i < pack.inTensors.size(); i++) {
        aclrtFree(pack.inTensors.at(i).deviceData);
    }
    // 銷毀輸出tensor
    for (size_t i = 0; i < pack.outTensors.size(); i++) {
        aclrtFree(pack.outTensors.at(i).deviceData);
    }
    status = aclrtFree(workspace);       // 銷毀workspace
    aclrtResetDevice(deviceId);          // 重置deviceId

    return 0;
}

編譯運行:

# g++編譯demo工程,demo.cpp為demo對應的源碼文件
g++ -I "${ATB_HOME_PATH}/include" -I "${ASCEND_HOME_PATH}/include" -L "${ATB_HOME_PATH}/lib" -L "${ASCEND_HOME_PATH}/lib64" atb_add_graph_by_tensor_id.cpp -l atb -l ascendcl -o atb_add_graph_by_tensor_id

# 運行可執(zhí)行文件
./atb_add_graph_by_tensor_id

# 如果運行出現(xiàn)coredump,嘗試在g++的編譯命令中添加-D_GLIBCXX_USE_CXX11_ABI=0,也就是上述的編譯命令為:
#g++ -D_GLIBCXX_USE_CXX11_ABI=0 -I "${ATB_HOME_PATH}/include" -I "${ASCEND_HOME_PATH}/include" -L "${ATB_HOME_PATH}/lib" -L "${ASCEND_HOME_PATH}/lib64" atb_add_graph_by_tensor_id.cpp -l atb -l ascendcl -o atb_add_graph_by_tensor_id

3.3.2按照組圖方式2:配置TensorName實現(xiàn)。

image.png

文件命名為atb_add_graph_by_tensor_name.cpp

// step1:包含ACL與加速庫接口頭文件
#include <iostream>
#include <vector>
#include <acl/acl.h>
#include <atb/atb_infer.h>
#include <atb/types.h>
#include <atb/utils.h>
#include "atb/infer_op_params.h"


void CreateInTensorDescs(atb::SVector<atb::TensorDesc> &intensorDescs) 
{
    for (size_t i = 0; i < intensorDescs.size(); i++) {
        intensorDescs.at(i).dtype = ACL_FLOAT16;
        intensorDescs.at(i).format = ACL_FORMAT_ND;
        intensorDescs.at(i).shape.dimNum = 2;
        intensorDescs.at(i).shape.dims[0] = 2;
        intensorDescs.at(i).shape.dims[1] = 2;
    }
}

// 設置各個intensor并且為各個intensor分配內(nèi)存空間,此處的intensor為手動設置,工程實現(xiàn)上可以使用torchTensor轉換或者其他簡單數(shù)據(jù)結構轉換的方式
void CreateInTensors(atb::SVector<atb::Tensor> &inTensors, atb::SVector<atb::TensorDesc> &intensorDescs)
{
    for (size_t i = 0; i < inTensors.size(); i++) {
        inTensors.at(i).desc = intensorDescs.at(i);
        inTensors.at(i).dataSize = atb::Utils::GetTensorSize(inTensors.at(i));
        std::vector<uint16_t> hostData(atb::Utils::GetTensorNumel(inTensors.at(i)), 2);   // 一段全2的hostBuffer
        int ret = aclrtMalloc(&inTensors.at(i).deviceData, inTensors.at(i).dataSize, ACL_MEM_MALLOC_HUGE_FIRST); // 分配NPU內(nèi)存
        if (ret != 0) {
            std::cout << "alloc error!";
            exit(0);
        }
        ret = aclrtMemcpy(inTensors.at(i).deviceData, inTensors.at(i).dataSize, hostData.data(), hostData.size() * sizeof(uint16_t), ACL_MEMCPY_HOST_TO_DEVICE); //拷貝CPU內(nèi)存到NPU側
    }
}

// 設置各個outtensor并且為outtensor分配內(nèi)存空間,同intensor設置
void CreateOutTensors(atb::SVector<atb::Tensor> &outTensors, atb::SVector<atb::TensorDesc> &outtensorDescs)
{
    for (size_t i = 0; i < outTensors.size(); i++) {
        outTensors.at(i).desc = outtensorDescs.at(i);
        outTensors.at(i).dataSize = atb::Utils::GetTensorSize(outTensors.at(i));
        int ret = aclrtMalloc(&outTensors.at(i).deviceData, outTensors.at(i).dataSize, ACL_MEM_MALLOC_HUGE_FIRST);
        if (ret != 0) {
            std::cout << "alloc error!";
            exit(0);
        }
    }
}

static uint64_t DIM3 = 3;

struct LlamaMlpParamGb {
    bool transpose = true;
};

atb::Operation* Linear(const LlamaMlpParamGb &param)
{
    atb::Operation* op = nullptr;
    atb::infer::LinearParam linearParam;
    linearParam.hasBias = false;
    linearParam.transposeB = param.transpose;
    CreateOperation(linearParam, &op);
    return op;
}

atb::Operation* Split(const LlamaMlpParamGb &param)
{
    atb::Operation* op = nullptr;
    atb::infer::SplitParam splitParam = {2, 2};
    CreateOperation(splitParam, &op);
    return op;
}

atb::Operation* Swish(const LlamaMlpParamGb &param)
{
    atb::Operation* op = nullptr;
    atb::infer::ActivationParam activationParam;
    activationParam.activationType = atb::infer::ActivationType::ACTIVATION_SWISH;
    CreateOperation(activationParam, &op);
    return op;
}

atb::Operation* Mul(const LlamaMlpParamGb &param)
{
    atb::Operation* op = nullptr;
    atb::infer::ElewiseParam elewiseParam;
    elewiseParam.elewiseType = atb::infer::ElewiseParam::ElewiseType::ELEWISE_MUL;
    CreateOperation(elewiseParam, &op);
    return op;
}

atb::Status CreateLlamaMlpOperationByGraphOpBuilder(const LlamaMlpParamGb &param, atb::Operation **operation)
{
    atb::InferShapeFunc inferShapeFunc = [=](const atb::SVector<atb::TensorDesc> &inTensorDescs,
                                atb::SVector<atb::TensorDesc> &outTensorDescs) {
        outTensorDescs.at(0) = inTensorDescs.at(0);
        if (param.transpose == true) {
            outTensorDescs.at(0).shape.dimNum = DIM3;
            outTensorDescs.at(0).shape.dims[0] = inTensorDescs.at(0).shape.dims[0];
            outTensorDescs.at(0).shape.dims[1] = inTensorDescs.at(0).shape.dims[1];
            outTensorDescs.at(0).shape.dims[2] = inTensorDescs.at(1).shape.dims[0] / 2;
        } else {
            outTensorDescs.at(0).shape.dimNum = DIM3;
            outTensorDescs.at(0).shape.dims[0] = inTensorDescs.at(0).shape.dims[0];
            outTensorDescs.at(0).shape.dims[1] = inTensorDescs.at(0).shape.dims[1];
            outTensorDescs.at(0).shape.dims[2] = inTensorDescs.at(1).shape.dims[1] / 2;
        }
        return atb::NO_ERROR;
    };

    atb::ReshapeFunc reshape_01_2 = [](const atb::Dims &oldShape, atb::Dims &newShape) {
        newShape.dimNum = 2; // dimNum: 2
        newShape.dims[0] = oldShape.dims[0] * oldShape.dims[1];
        newShape.dims[1] = oldShape.dims[1];
    };
    atb::ReshapeFunc unsqueueze_0 = [](const atb::Dims &oldShape, atb::Dims &newShape) {
        newShape.dimNum = 3; // dimNum: 3
        newShape.dims[0] = 1;
        newShape.dims[1] = oldShape.dims[0];
        newShape.dims[2] = oldShape.dims[1];
    };
    atb::GraphOpBuilder* graphOpBuilder;
    CreateGraphOpBuilder(&graphOpBuilder);

    graphOpBuilder->Init(
        "LlamaMlpGraphOp",
        inferShapeFunc,
        {"hidden_states", "weight"},
        {"mlp_out"}
    );

    graphOpBuilder->Reshape("hidden_states", reshape_01_2, "hidden_states_");
    graphOpBuilder->AddOperation(Linear(param), {"hidden_states_", "weight"}, {"linear_out"});
    graphOpBuilder->Reshape("linear_out", unsqueueze_0, "linear_out_");
    graphOpBuilder->AddOperation(Split(param), {"linear_out_"}, {"gate_out", "up_out"});
    graphOpBuilder->AddOperation(Swish(param), {"gate_out"}, {"swish_out"});
    graphOpBuilder->AddOperation(Mul(param), {"swish_out", "up_out"}, {"mlp_out"});

    *operation = graphOpBuilder->Build();
    DestroyGraphOpBuilder(graphOpBuilder);
    return atb::NO_ERROR;
}

void PrintOutTensorValue(atb::Tensor &outTensor)
{
    // 輸出Tensor拷貝回host側并打印
    std::vector<uint16_t> outBuffer(atb::Utils::GetTensorNumel(outTensor));
    int ret = aclrtMemcpy(outBuffer.data(), outBuffer.size() * sizeof(uint16_t), outTensor.deviceData, outTensor.dataSize, ACL_MEMCPY_DEVICE_TO_HOST);
    if (ret != 0) {
        std::cout << "copy error!";
        exit(0);
    }
    for (size_t i = 0; i < outBuffer.size(); i = i + 1) {
        std::cout << "out[" << i << "] = " << (uint32_t)outBuffer.at(i) << std::endl;
    }
}

int main() {
    // step2:配置deviceId
    uint32_t deviceId = 0;
    aclError status = aclrtSetDevice(deviceId);

    // step3:創(chuàng)建圖算子對象實例
    // 第一步:構造Operation參數(shù)
    atb::Operation *op = nullptr;
    ::LlamaMlpParamGb opGraph;

    // 第二步:創(chuàng)建opGraph
    CreateLlamaMlpOperationByGraphOpBuilder(opGraph, &op);

    // step4:創(chuàng)建輸入輸出tensor,并存入VariantPack
    atb::VariantPack pack;
    atb::SVector<atb::TensorDesc> intensorDescs;
    atb::SVector<atb::TensorDesc> outtensorDescs;

    uint32_t inTensorNum = op->GetInputNum();
    uint32_t outTensorNum = op->GetOutputNum();
    pack.inTensors.resize(inTensorNum);
    intensorDescs.resize(inTensorNum);

    CreateInTensorDescs(intensorDescs);
    CreateInTensors(pack.inTensors, intensorDescs);
        
    outtensorDescs.resize(outTensorNum);
    pack.outTensors.resize(outTensorNum);
    op->InferShape(intensorDescs, outtensorDescs);
    CreateOutTensors(pack.outTensors, outtensorDescs);

    // step5:創(chuàng)建context,配置stream
    atb::Context *context = nullptr;
    auto st = atb::CreateContext(&context);

    aclrtStream stream = nullptr;
    status = aclrtCreateStream(&stream);
    context->SetExecuteStream(stream);

    // step6:調用Setup接口,計算workspace大小
    uint64_t workspaceSize = 0;
    st = op->Setup(pack, workspaceSize, context);

    // step7:根據(jù)workspace大小申請NPU內(nèi)存
    void *workspace = nullptr;
    if (workspaceSize != 0) {
        status = aclrtMalloc(&workspace, workspaceSize, ACL_MEM_MALLOC_HUGE_FIRST);
        if (status != 0) {
            std::cout << "alloc error!";
            exit(0);
        }
    }

    // step8:調用Execute接口,執(zhí)行算子
    st = op->Execute(pack, (uint8_t *)workspace, workspaceSize, context);

    // step9:銷毀創(chuàng)建的對象,釋放內(nèi)存
    // 流同步,作用是等待device側任務計算完成
    auto ret = aclrtSynchronizeStream(stream);
    if (ret != 0) {
        std::cout << "sync error!";
        exit(0);
    }

   // 打印輸出Tensor的值
    PrintOutTensorValue(pack.outTensors.at(0));

    status = aclrtDestroyStream(stream); // 銷毀stream
    st = atb::DestroyOperation(op);      // 銷毀op對象
    st = atb::DestroyContext(context);   // 銷毀context
    // 銷毀輸入tensor
    for (size_t i = 0; i < pack.inTensors.size(); i++) {
        aclrtFree(pack.inTensors.at(i).deviceData);
    }
    // 銷毀輸出tensor
    for (size_t i = 0; i < pack.outTensors.size(); i++) {
        aclrtFree(pack.outTensors.at(i).deviceData);
    }
    status = aclrtFree(workspace);       // 銷毀workspace
    aclrtResetDevice(deviceId);          // 重置deviceId

    return 0;
}

編譯運行:

# g++編譯demo工程,demo.cpp為demo對應的源碼文件
g++ -I "${ATB_HOME_PATH}/include" -I "${ASCEND_HOME_PATH}/include" -L "${ATB_HOME_PATH}/lib" -L "${ASCEND_HOME_PATH}/lib64" atb_add_graph_by_tensor_name.cpp -l atb -l ascendcl -o atb_add_graph_by_tensor_name

# 運行可執(zhí)行文件
./atb_add_graph_by_tensor_name

# 如果運行出現(xiàn)coredump,嘗試在g++的編譯命令中添加-D_GLIBCXX_USE_CXX11_ABI=0,也就是上述的編譯命令為:
#g++ -D_GLIBCXX_USE_CXX11_ABI=0 -I "${ATB_HOME_PATH}/include" -I "${ASCEND_HOME_PATH}/include" -L "
?著作權歸作者所有,轉載或內(nèi)容合作請聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時請結合常識與多方信息審慎甄別。
平臺聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點,簡書系信息發(fā)布平臺,僅提供信息存儲服務。

相關閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容