五十路熟妇,字幕AV网

歡迎關(guān)注我的公眾號(hào) [極智視界]，回復(fù)001獲取Google編程規(guī)范

O_o ? >_< ? o_O ? O_o ? ~_~ ? o_O

大家好，我是極智視界，本文剖析一下 min-max 對(duì)稱量化算法實(shí)現(xiàn)，以 Tengine 的實(shí)現(xiàn)為例。

Tengine 是 OpenAILab 開源的優(yōu)秀端側(cè)深度學(xué)習(xí)推理框架，其核心主要由 C 語言實(shí)現(xiàn)，包裹的功能代碼嵌套了 C++。量化是推理加速必不可少的優(yōu)化環(huán)節(jié)，成熟的推理框架一般會(huì)把量化模塊剝離出來形成獨(dú)立的一套工具，如 Tengine、NCNN、昇騰、寒武紀(jì)都這么做，這主要是因?yàn)榱炕^程和硬件非強(qiáng)相關(guān)，解耦開來能干更多的事。

min-max 和 kl 量化算法是硬件廠商適配推理引擎的基礎(chǔ)和標(biāo)配，其中 kl 量化深受用戶喜愛，如英偉達(dá)的 TensorRT 也正是采用了 kl 量化策略；而這里要介紹的 min-max 的特點(diǎn)是邏輯簡(jiǎn)單、效果良好，作為量化實(shí)現(xiàn)分享系列的開篇比較合適，這里帶大家一起研究一下 Tengine 中 minx-max 量化策略的具體實(shí)現(xiàn)。

1、量化使用

量化主要分為激活值（動(dòng)態(tài)）量化、權(quán)值&偏置（靜態(tài)）量化，而權(quán)值&偏置的量化是對(duì)精度影響比較大的，激活值的量化對(duì)整體影響較小，但也需要量化，才有可能協(xié)同達(dá)到整體滿意的效果。對(duì)于一般量化來說，權(quán)值&偏置的量化會(huì)采用逐通道 perChannel 的方式，而激活值的量化一般是逐層 perLayer 的方式。解釋一下為啥會(huì)這樣，對(duì)于量化來說，卷積肯定是大頭，對(duì)于卷積來說，若激活值量化采用逐通道方式，這和卷積核參數(shù)共享是相悖的，所以一般激活值就用逐層量化，以契合卷積參數(shù)共享。

這里主要看一下 Tengine 量化需要的傳參：

image

Input model：傳入的 fp32 tmfile 模型文件；
Output model：生成的 int8 tmfile 模型文件；
Calib images：傳入的激活值量化校準(zhǔn)圖片；
Scale file：生成的校準(zhǔn)表文件；
Agorithm：量化算法，可選 MIN-MAX、KL、ACIQ、DFQ、EQ；
Dims：輸入校準(zhǔn)圖的 shape，這里傳三維 c h w，n 在代碼中寫死 n = 1；
Mean：圖像預(yù)處理均值；
Scale：圖像預(yù)處理縮放尺度；
BGR2RGB：通道轉(zhuǎn)換；
Center crop：圖像預(yù)處理，裁剪；
Letter box：圖像預(yù)處理，保持橫縱比的前提下對(duì)圖像做 resize；
YOLOv5 focus：類似 yolov5 的預(yù)處理注意力機(jī)制；
Thread num：量化多線程設(shè)置；

2、min-max 量化

min-max 是最簡(jiǎn)單的量化算法，主要邏輯如下：

image

在 Tengine 中實(shí)現(xiàn) min-max 方法的主要代碼如下：

case ALGORITHM_MIN_MAX:{
 if (quant_tool.scale_file.empty()){
 quant_tool.scale_file = "table_minmax.scale";
 quant_tool.activation_quant_tool();
 }
 save_graph_i8_perchannel(quant_tool.model_file.c_str(), quant_tool.scale_file.c_str(), quant_tool.output_file, quant_tool.inplace, false);
 /* Evaluate quantitative losses */
 if (quant_tool.evaluate){
 fprintf(stderr, "[Quant Tools Info]: Step Evaluate, evaluate quantitative losses\n");
 quant_tool.assess_quant_loss(0);
 }
 break;
}

其中最主要的量化搜索策略接口是 quant_tool.activation_quant_tool() 和 save_graph_i8_perchannel，對(duì)于 min-max 來說這兩個(gè)接口分別做了兩件事：

(1) 激活值量化，生成 table_minmax.scale；

(2) 權(quán)值&偏置量化，生成 scale_weight.txt 和 scale_bias.txt；

2.1 激活值量化

看 Tengine 源碼一定要抓住 struct graph* ir_graph，graph 這個(gè)結(jié)構(gòu)體是精髓。

激活值量化是個(gè)動(dòng)態(tài)的過程，需要?jiǎng)討B(tài)的獲取每層的數(shù)據(jù)分布，這也就是為啥需要你喂一定數(shù)量校準(zhǔn)圖片的原因。

先說一下預(yù)處理模塊，這個(gè)其他量化算法是通用的：

// 將 input_tensor 和 input_data 地址綁定，而 input_tensor=>ir_graph->tensor_list。注意：這一步一定要看到，不然后續(xù)代碼很難看懂
tensor_t input_tensor = get_graph_input_tensor(ir_graph, 0, 0);

if (set_tensor_shape(input_tensor, dims, 4) < 0){
 fprintf(stderr, "Set input tensor shape failed\n");
 return -1;
}

if (set_tensor_buffer(input_tensor, input_data.data(), img_size * sizeof(float)) < 0){
 fprintf(stderr, "Set input tensor buffer failed\n");
 return -1;
}

// prerun graph，做一些初始化配置
if (prerun_graph_multithread(ir_graph, this->opt) < 0){
 fprintf(stderr, "Prerun multithread graph failed.\n");
 return -1;
}

// 圖像預(yù)處理，傳出 input_data，這個(gè)和前面的 input_tensor & ir_graph->tensor_list[0] 輸入?yún)?綁定，修改了 input_data 即修改了 ir_graph.tensor_list，這樣就能看懂
get_input_data_cv(imgs_list[nums].c_str(), input_data.data(), img_c, img_h, img_w, mean, scale, sw_RGB, center_crop, letterbox_rows, letterbox_cols, focus);

然后 run 一下，把中間激活值記錄到 ir_graph->tensor_list[i] 里：

if (run_graph(ir_graph, 1) < 0){
 fprintf(stderr, "Run graph failed\n");
 return -1;
}

激活激活值的 min、max 值：

/* get the min/max value of activation tensor */
for (int i = 0; i < ir_graph->tensor_num; i++){
 struct tensor* act_tensor = ir_graph->tensor_list[i];
 if (act_tensor->tensor_type == TENSOR_TYPE_VAR || act_tensor->tensor_type == TENSOR_TYPE_INPUT){
 float* start_addr = (float*)act_tensor->data;
 float* end_addr = (float*)act_tensor->data + act_tensor->elem_num;
 max_activation[i] = std::max(max_activation[i], *std::max_element(start_addr, end_addr));
 min_activation[i] = std::min(min_activation[i], *std::min_element(start_addr, end_addr));}
}

計(jì)算激活值量化尺度，對(duì)于 softmax 層 scale 默認(rèn)為 1 / 127.f：

/* save the calibration file with min-max algorithm */
FILE* fp_minmax = fopen("table_minmax.scale", "wb");
for (int i = 0; i < ir_graph->tensor_num; i++){
 struct tensor* t = ir_graph->tensor_list[i];
 if (t->tensor_type == TENSOR_TYPE_VAR || t->tensor_type == TENSOR_TYPE_INPUT){
 float act_scale = 1.f;
 int act_zero_point = 0;

 act_scale = std::max(std::abs(max_activation[i]), std::abs(min_activation[i])) / 127.f;

 /* the scale of softmax is always scale = 1 / 127.f */
 for (int j = 0; j < ir_graph->node_num; j++){
 struct node* noden = ir_graph->node_list[j];
 struct tensor* tensor_tmp = get_ir_graph_tensor(ir_graph, noden->output_tensors[0]);

 if (!(tensor_tmp->tensor_type == TENSOR_TYPE_INPUT || tensor_tmp->tensor_type == TENSOR_TYPE_VAR))
 continue;

 std::string tmp_op_name = get_op_name_from_type(noden->op.type);
 std::string cur_name = t->name;
 std::string tmp_name = tensor_tmp->name;

 if ((cur_name == tmp_name) && tmp_op_name == "Softmax"){
 act_scale = 1 / 127.f;
 break;}
 }

 fprintf(fp_minmax, "%s %f %d\n", ir_graph->tensor_list[i]->name, act_scale, act_zero_point);}
}

2.2 權(quán)值 & 偏置量化

權(quán)值 & 偏置量化和激活值量化不太一樣，激活值量化需要校準(zhǔn)圖片推理以獲得輸入數(shù)據(jù)的動(dòng)態(tài)分布，而權(quán)值 & 偏置是靜態(tài)的，單純的量化過程不需執(zhí)行前向推理。

2.2.1 創(chuàng)建 graph

加載 tmfile，構(gòu)建 graph：

struct graph* ir_graph = (struct graph*)create_graph(nullptr, "tengine", model_file);
if (nullptr == ir_graph){
fprintf(stderr, "Create graph failed.\n");
return -1;}

2.2.2 優(yōu)化激活值量化 scale

這里主要做一個(gè) quant.inplace 的優(yōu)化，這是針對(duì)非卷積算子的量化處理策略。

if (inplace == 0){
 for (int i = 0; i < ir_graph->tensor_num; i++){
 struct tensor* ir_tensor = ir_graph->tensor_list[i];
 if (ir_tensor->tensor_type == TENSOR_TYPE_VAR || ir_tensor->tensor_type == TENSOR_TYPE_INPUT){
 ir_tensor->scale = layer_scale[ir_tensor->name];
 ir_tensor->zero_point = layer_zeropoint[ir_tensor->name];}}
 }
 else{
 std::tr1::unordered_map<std::string, bool> layer_pass;
 for (int i = ir_graph->tensor_num - 1; i >= 0; i--){
 struct tensor* ir_tensor = ir_graph->tensor_list[i];
 if (ir_tensor->tensor_type == TENSOR_TYPE_VAR || ir_tensor->tensor_type == TENSOR_TYPE_INPUT){
 if (layer_pass[ir_tensor->name] == false){
 uint32_t ir_node_idx = ir_tensor->producer;
 struct node* t_node = ir_graph->node_list[ir_node_idx];

 std::string op_name = get_op_name_from_type(t_node->op.type);

 bool poolTrue = false;
 bool reluTrue = false;
 if (op_name == "Pooling"){
 struct pool_param* pool_param = (struct pool_param*)t_node->op.param_mem;
 if (pool_param->pool_method == 0)
 poolTrue = true;
 }
 else if (op_name == "ReLU"){
 struct relu_param* relu_param = (struct relu_param*)t_node->op.param_mem;
 if (relu_param->negative_slope == 0.f)
 reluTrue = true;
 }
 if (op_name == "Flatten" || op_name == "Reshape" || op_name == "Squeeze" || op_name == "Clip" || op_name == "Slice" || poolTrue || reluTrue){
 struct tensor* t_in_tensor = ir_graph->tensor_list[t_node->input_tensors[0]];
 if (layer_scale[ir_tensor->name] != 0){
 ir_tensor->scale = layer_scale[ir_tensor->name];
 ir_tensor->zero_point = layer_zeropoint[ir_tensor->name];

 if (t_in_tensor->tensor_type == TENSOR_TYPE_VAR || t_in_tensor->tensor_type == TENSOR_TYPE_INPUT){
 recursion_pass_through(ir_graph, ir_tensor->name, t_in_tensor, layer_used, layer_scale, layer_zeropoint, layer_pass);}}
 }
 else{
 ir_tensor->scale = layer_scale[ir_tensor->name];
 ir_tensor->zero_point = layer_zeropoint[ir_tensor->name];
 }
 layer_pass[ir_tensor->name] = true;}}}
}

2.2.3 權(quán)值 & 偏置量化

量化的整個(gè)過程和激活值量化類似，即先搜索 min、max 值，后做截?cái)嗫s放處理。這里不僅需要計(jì)算 scale，而且還要做截?cái)嗫s放處理的原因是需要生成 int8 tmfile 量化模型文件。這里還有一點(diǎn)需要注意的是權(quán)值量化精度為 int8，偏置量化精度為 int32，因?yàn)闄?quán)值做完矩陣乘后值很有可能就會(huì)溢出 int8，所以需要權(quán)值矩陣乘后的值用 int32 存儲(chǔ)，然后與 int32 的偏置做加法。

除了以上這些，和激活值量化還有個(gè)區(qū)別是，激活值量化是 perLayer 的，而權(quán)值 & 偏置量化是 perChannel 的。

權(quán)值 int8 量化：

/* quantize the weight data from fp32 to int8 */
if (op_name == "Convolution" || op_name == "FullyConnected" || op_name == "Deconvolution"){
 struct tensor* weight_tensor = ir_graph->tensor_list[noden->input_tensors[1]];

 int channel_num = weight_tensor->dims[0];
 int cstep = int(weight_tensor->elem_num / channel_num);
 float* weight_data = (float*)weight_tensor->data;
 int8_t* i8_weight_data = (int8_t*)sys_malloc(weight_tensor->elem_num * sizeof(int8_t));

 float* weight_scale_list = (float*)sys_malloc(channel_num * sizeof(float));
 int* weight_zp_list = (int*)sys_malloc(channel_num * sizeof(int));

 fprintf(fp_weight, "%s ", weight_tensor->name);
 /* calculate the quant scale value of weight perchannel, scale = abs(min, max) / 127 */
 if (internal){
 // TODO
 for (int ch = 0; ch < channel_num; ch++){
 weight_scale_list[ch] = weight_tensor->scale_list[ch];
 weight_zp_list[ch] = 0;}
 }
 else{
 for (int ch = 0; ch < channel_num; ch++){
 float* weight_data_ch_start = weight_data + ch * cstep;
 float* weight_data_ch_end = weight_data + (ch + 1) * cstep;
 float weight_max = *std::max_element(weight_data_ch_start, weight_data_ch_end);
 float weight_min = *std::min_element(weight_data_ch_start, weight_data_ch_end);

 weight_scale_list[ch] = std::max(std::abs(weight_max), std::abs(weight_min)) / 127.f;
 weight_zp_list[ch] = 0;
 fprintf(fp_weight, "%8.8f ", weight_scale_list[ch]);
 }
 fprintf(fp_weight, "\n");
 }

 /* quantize the value of weight from Float32 to Int8, value_i8 = (value_fp32 / scale).round().clip(-127, 127) */
 for (int ch = 0; ch < channel_num; ch++){
 for (int j = 0; j < cstep; j++){
 if (weight_data[ch * cstep + j] == 0 || weight_scale_list[ch] == 0)
 i8_weight_data[ch * cstep + j] = 0;
 else{
 float int8_data = round(weight_data[ch * cstep + j] / weight_scale_list[ch]);
 int8_data = int8_data > 127.f ? 127.f : int8_data;
 int8_data = int8_data < -127.f ? -127.f : int8_data;
 i8_weight_data[ch * cstep + j] = int8_t(int8_data);}}
 }

 weight_tensor->scale_list = weight_scale_list;
 weight_tensor->zp_list = weight_zp_list;
 weight_tensor->data_type = TENGINE_DT_INT8;
 weight_tensor->elem_size = sizeof(int8_t); // int8, signed char
 weight_tensor->data = i8_weight_data;
 weight_tensor->quant_param_num = channel_num;
}

偏置 int32 量化：

/* quantize the weight data from fp32 to int32 */
if (noden->input_num > 2){
    struct tensor* input_tensor = ir_graph->tensor_list[noden->input_tensors[0]];
    struct tensor* bias_tensor = ir_graph->tensor_list[noden->input_tensors[2]];

    float* bias_scale_list = (float*)sys_malloc(bias_tensor->dims[0] * sizeof(float));
    int* bias_zp_list = (int*)sys_malloc(bias_tensor->dims[0] * sizeof(int32_t));

    float* bias_data = (float*)bias_tensor->data;
    int* int32_bias_data = (int*)sys_malloc(bias_tensor->elem_num * sizeof(int32_t));

    int bstep = int(bias_tensor->elem_num / channel_num);

    fprintf(fp_bias, "%s ", bias_tensor->name);

    /* calculate the quant scale value of bias perchannel, scale = scale_weight * scale_in */
    for (int ch = 0; ch < channel_num; ch++){
        bias_scale_list[ch] = weight_scale_list[ch] * input_tensor->scale;
        bias_zp_list[ch] = 0;

        fprintf(fp_bias, "%8.8f ", bias_scale_list[ch]);
    }
    fprintf(fp_bias, "\n");

    /* quantize the value of bias from Float32 to Int32, value_i32 = (value_fp32 / scale).round() */
    for (int ch = 0; ch < channel_num; ch++){
        for (int bi = 0; bi < bstep; bi++){
            if (bias_data[ch * bstep + bi] == 0 || bias_scale_list[ch] == 0)
                int32_bias_data[ch * bstep + bi] = 0;
            else
                int32_bias_data[ch * bstep + bi] = int(round(bias_data[ch * bstep + bi] / bias_scale_list[ch]));}
    }

    bias_tensor->scale_list = bias_scale_list;
    bias_tensor->zp_list = bias_zp_list;
    bias_tensor->data_type = TENGINE_DT_INT32;
    bias_tensor->elem_size = sizeof(int32_t); // int32, signed int
    bias_tensor->data = int32_bias_data;
    bias_tensor->quant_param_num = channel_num;
}

到這里權(quán)值 & 偏置的量化就介紹的差不多咯。

以上詳細(xì)介紹了 min-max 量化算法的實(shí)現(xiàn)，主要以 Tengine 為例進(jìn)行代碼說明，希望我的分享能對(duì)你的學(xué)習(xí)有一點(diǎn)幫助。

【公眾號(hào)傳送】
《【模型推理】量化實(shí)現(xiàn)分享一：詳解 min-max 對(duì)稱量化算法實(shí)現(xiàn)》

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

【模型推理】量化實(shí)現(xiàn)分享一：詳解 min-max 對(duì)稱量化算法實(shí)現(xiàn)

【模型推理】量化實(shí)現(xiàn)分享一：詳解 min-max 對(duì)稱量化算法實(shí)現(xiàn)

1、量化使用

2、min-max 量化

2.1 激活值量化

2.2 權(quán)值 & 偏置量化

2.2.1 創(chuàng)建 graph

2.2.2 優(yōu)化激活值量化 scale

2.2.3 權(quán)值 & 偏置量化

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九 欧美,1769亚洲,黄色成人av

【模型推理】量化實(shí)現(xiàn)分享一：詳解 min-max 對(duì)稱量化算法實(shí)現(xiàn)

1、量化使用

2、min-max 量化

2.1 激活值量化

2.2 權(quán)值 & 偏置量化

2.2.1 創(chuàng)建 graph

2.2.2 優(yōu)化激活值量化 scale

2.2.3 權(quán)值 & 偏置量化

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

1、量化使用

2、min-max 量化