一、前言
TVMC是TVM python包提供的一個工具,可以通過命令行的方式執(zhí)行auto-tuning,編譯,性能profiling以及模型運行。本文將根據(jù)TVM官網(wǎng)的指導文檔跟大家一起熟悉TVMC的使用。
二、TVMC使用示例
1、安裝
從github更新TVM到最新代碼后:
cd tvm/python
python gen_requirements.py
python setup.py build
python setup.py install
遇到問題:
Processing dependencies for tvm==0.9.dev915+g5c29e55be
Searching for synr==0.6.0
Reading https://pypi.python.org/simple/synr/
Couldn't find index page for 'synr' (maybe misspelled?)
Scanning index of all packages (this may take a while)
Reading https://pypi.python.org/simple/
No local packages or working download links found for synr==0.6.0
error: Could not find suitable distribution for Requirement.parse('synr==0.6.0')
解決:
pip install synr
python setup.py build
python setup.py install
安裝完成后輸入tvmc --help會打印:
...
usage: tvmc [-v] [--version] [-h] {run,tune,compile} ...
TVM compiler driver
optional arguments:
-v, --verbose increase verbosity
--version print the version and exit
-h, --help show this help message and exit.
commands:
{run,tune,compile}
run run a compiled module
tune auto-tune a model
compile compile a model.
TVMC - TVM driver command-line interface
2、使用示例
TVMC是TVM python包的一個應用工具,安裝完TVM之后可以通過tvmc命令來執(zhí)行。接下來會使用tvmc跑一個圖像分類的模型:
(1)模型下載
使用onnx格式的resnet50模型:
pip install onnx onnxoptimizer
wget https://github.com/onnx/models/raw/main/vision/classification/resnet/model/resnet50-v2-7.onnx
(2)模型編譯
tvmc compile --target "llvm" \
--output resnet50-v2-7-tvm.tar \
resnet50-v2-7.onnx
執(zhí)行完成后:
target={1: llvm -keys=cpu -link-params=0}
target={1: llvm -keys=cpu -link-params=0}, target_host=None
One or more operators have not been tuned. Please tune your model for better performance. Use DEBUG logging level to see more details.
會在當前目錄下生成 resnet50-v2-7-tvm.tar 壓縮包,解壓后可以看到里面包含了三個文件:
-rw-rw-r-- 1 user user 88K Mar 29 16:42 mod.json
-rw-rw-r-- 1 user user 98M Mar 29 16:42 mod.params
-rwxrwxr-x 1 user user 582K Mar 29 16:42 mod.so
mod.so:tvm模型,編譯成C++動態(tài)庫的形式,可由 TVM runtime加載;
mod.json:TVM Relay的計算圖,主要描述了模型的節(jié)點以及各各節(jié)點輸入輸出的類型與參數(shù)等;
mod.params:模型訓練的參數(shù)。
(3)調(diào)優(yōu)
TVMC默認使用xgboost調(diào)優(yōu)器進行調(diào)優(yōu),需要指定調(diào)優(yōu)記錄的輸出路徑文件,整個調(diào)優(yōu)過程本質(zhì)上是一個參數(shù)選擇的過程,對不同的算子使用不同的參數(shù)配置,然后選擇模型運行最快的那一組參數(shù),屬于一種參數(shù)空間搜索的策略,一般情況下都會比較耗時。本例中通過指定--number和--repeat來限制調(diào)優(yōu)運行的總次數(shù):
tvmc tune --target "llvm" \
--output resnet50-v2-7-autotuner_records.json \
--number 10 \
--repeat 10 \
resnet50-v2-7.onnx
整個過程完成之后,得到一個調(diào)優(yōu)記錄文件resnet50-v2-7-autotuner_records.json:
{"input": ["llvm -keys=cpu -link-params=0", "conv2d_NCHWc.x86", [["TENSOR", [1, 3, 224, 224], "float32"], ["TENSOR", [64, 3, 7, 7], "float32"], [2, 2], [3, 3, 3, 3], [1, 1], "NCHW", "NCHW", "float32 "], {}], "config": {"index": 42, "code_hash": null, "entity": [["tile_ic", "sp", [-1, 1]], ["tile_oc", "sp", [-1, 1]], ["tile_ow", "sp", [-1, 64]], ["unroll_kw", "ot", true]]}, "result": [[0.0098876 83900000001], 0, 6.844898462295532, 1648601601.8722398], "version": 0.2, "tvm_version": "0.9.dev0"}
{"input": ["llvm -keys=cpu -link-params=0", "conv2d_NCHWc.x86", [["TENSOR", [1, 3, 224, 224], "float32"], ["TENSOR", [64, 3, 7, 7], "float32"], [2, 2], [3, 3, 3, 3], [1, 1], "NCHW", "NCHW", "float32 "], {}], "config": {"index": 172, "code_hash": null, "entity": [["tile_ic", "sp", [-1, 1]], ["tile_oc", "sp", [-1, 4]], ["tile_ow", "sp", [-1, 1]], ["unroll_kw", "ot", false]]}, "result": [[0.008670 090600000001], 0, 1.1055541038513184, 1648601602.1772938], "version": 0.2, "tvm_version": "0.9.dev0"}
...
可以看到每一組數(shù)據(jù)包括輸入"input",配置"config",以及運行結(jié)果"result"。
(4)編譯調(diào)優(yōu)模型
有了調(diào)優(yōu)的記錄文件,就可以重新編譯調(diào)優(yōu)模型:
tvmc compile --target "llvm" \
--output resnet50-v2-7-tvm_autotuned.tar \
--tuning-records resnet50-v2-7-autotuner_records.json \
resnet50-v2-7.onnx
(5)結(jié)果對比
- 輸入數(shù)據(jù)預處理
新建下面的代碼文件保存為 tvmc_pre_process.py
from tvm.contrib.download import download_testdata
from PIL import Image # 需要安裝依賴庫:pip install pillow
import numpy as np
img_url = "https://s3.amazonaws.com/model-server/inputs/kitten.jpg"
img_path = download_testdata(img_url, "imagenet_cat.png", module="data")
# resnet50 要求輸入圖像大小為224x224
resized_image = Image.open(img_path).resize((224, 224))
img_data = np.asarray(resized_image).astype("float32")
# ONNX使用 NCHW 格式的輸入,將NHWC轉(zhuǎn)為NCHW
img_data = np.transpose(img_data, (2, 0, 1))
# 根據(jù)ImageNet數(shù)據(jù)庫給的參數(shù)歸一化輸入
imagenet_mean = np.array([0.485, 0.456, 0.406])
imagenet_stddev = np.array([0.229, 0.224, 0.225])
norm_img_data = np.zeros(img_data.shape).astype("float32")
for i in range(img_data.shape[0]):
norm_img_data[i, :, :] = (img_data[i, :, :] / 255 - imagenet_mean[i]) / imagenet_stddev[i]
# 加上batch維
img_data = np.expand_dims(norm_img_data, axis=0)
# 保存為.npz格式,TVMC已經(jīng)提供了對這種數(shù)據(jù)格式的支持
np.savez("imagenet_cat", data=img_data)
輸出數(shù)據(jù)后處理
新建下面的代碼文件保存為 tvmc_post_process.py
import os.path
import numpy as np
from scipy.special import softmax
from tvm.contrib.download import download_testdata
# 下載標簽
labels_url = "https://s3.amazonaws.com/onnx-model-zoo/synset.txt"
labels_path = download_testdata(labels_url, "synset.txt", module="data")
with open(labels_path, "r") as f:
labels = [l.rstrip() for l in f]
output_file = "predictions.npz"
# 讀取輸出結(jié)果
if os.path.exists(output_file):
with np.load(output_file) as data:
scores = softmax(data["output_0"]) # 對輸出數(shù)據(jù)求softmax
scores = np.squeeze(scores) # 將scores的shape中為1的維度去掉
ranks = np.argsort(scores)[::-1] # 獲取scores從小到大的索引值
for rank in ranks[0:5]: # 打印前top 5的分值
print("class='%s' with probability=%f" % (labels[rank], scores[rank]))
- 未調(diào)優(yōu)模型結(jié)果
python tvmc_pre_process.py
tvmc run --inputs imagenet_cat.npz \
--output predictions.npz \
--print-time \
--repeat 100 \
resnet50-v2-7-tvm.tar
python tvmc_post_process.py
輸出:
Execution time summary:
mean (ms) median (ms) max (ms) min (ms) std (ms)
148.2297 142.3968 217.4981 137.0922 15.2046
class='n02123045 tabby, tabby cat' with probability=0.610552
class='n02123159 tiger cat' with probability=0.367180
class='n02124075 Egyptian cat' with probability=0.019365
class='n02129604 tiger, Panthera tigris' with probability=0.001273
class='n04040759 radiator' with probability=0.000261
- 調(diào)優(yōu)模型結(jié)果
python tvmc_pre_process.py
tvmc run --inputs imagenet_cat.npz \
--output predictions.npz \
--print-time \
--repeat 100 \
resnet50-v2-7-tvm_autotuned.tar
python tvmc_post_process.py
輸出:
Execution time summary:
mean (ms) median (ms) max (ms) min (ms) std (ms)
108.6551 108.0915 119.1327 106.1360 2.0859
class='n02123045 tabby, tabby cat' with probability=0.610552
class='n02123159 tiger cat' with probability=0.367179
class='n02124075 Egyptian cat' with probability=0.019365
class='n02129604 tiger, Panthera tigris' with probability=0.001273
class='n04040759 radiator' with probability=0.000261
可以看到調(diào)優(yōu)后的速度還是有比較大的提升,而精度幾乎完全不受影響。
三、基本實現(xiàn)流程
TVMC的代碼在 python/tvm/driver/tvmc目錄下,由于TVMC提供了很多的命令選項,這里我們主要看編譯、調(diào)優(yōu)與運行這三個子命令的實現(xiàn)流程。
1、main
tvmc的命令解析器是通過注冊的形式添加的,首先在main.py中定義一個全局的列表REGISTERD_PARSE和一個注冊函數(shù)register_parser():
REGISTERED_PARSER = []
def register_parser(make_subparser):
REGISTERED_PARSER.append(make_subparser)
return make_subparser
在新增命令解析時,比如增加子命令compile的命令解析器,在compiler.py中的實現(xiàn)如下:
@register_parser
def add_compile_parser(subparsers, _):
parser = subparsers.add_parser("compile", help="compile a model.")
parser.set_defaults(func=drive_compile) # 設置func屬性
...
這就相當于將函數(shù) add_compile_parser 對象添加到 REGISTERD_PARSER列表中。然后在main.py中的_main()函數(shù)遍歷這個列表,并執(zhí)行相應的函數(shù):
def _main(argv):
...
for make_subparser in REGISTERED_PARSER:
make_subparser(subparser, parser)
...
args = parser.parse_args(argv)
...
try:
return args.func(args) # 執(zhí)行func屬性所指向的函數(shù)
except TVMCImportError as err:
...
此時args的內(nèi)容為:
Namespace(FILE='resnet50-v2-7.onnx', cross_compiler='', cross_compiler_options='', desired_layout=None, disabled_pass=[''], dump_code='', executor='graph',
......,
func=, input_shapes=None, model_format=None, opt_level=3, output='resnet50-v2-7-tvm_autotuned.tar', output_format='so', pass_config=None, runtime='cpp',target='llvm',
......
tuning_records='resnet50-v2-7-autotuner_records.json', verbose=0, version=False)
其中func=<function drive_compile at 0x7f12ca2f78b0>就是子命令要執(zhí)行的函數(shù)。
2、compiler
drive_compile()就兩個主要的操作:
(1)通過frontends.load_model()加載模型
在frontends.py中,封裝的前端接口有:
ALL_FRONTENDS = [
KerasFrontend,
OnnxFrontend,
TensorflowFrontend,
TFLiteFrontend,
PyTorchFrontend,
PaddleFrontend,
]
frontends.py首先定義一個抽象基類,各個子類需要實現(xiàn)這三個函數(shù),其中在load函數(shù)中不同前端會根據(jù)各自的接口從模型文件路徑讀取model,然后調(diào)用 relay.frontend.from_xxx(model, ...) 函數(shù)進行相應的tvm模型加載與轉(zhuǎn)換:
class Frontend(ABC):
@staticmethod
@abstractmethod
def name():
# 前端的名稱
@staticmethod
@abstractmethod
def suffixes():
# 模型文件的后綴
@abstractmethod
def load(self, path, shape_dict=None, **kwargs):
# 模型加載函數(shù)
class OnnxFrontend(Frontend):
@staticmethod
def name():
return "onnx"
@staticmethod
def suffixes():
return ["onnx"]
def load(self, path, shape_dict=None, **kwargs):
onnx = lazy_import("onnx")
model = onnx.load(path)
return relay.frontend.from_onnx(model, shape=shape_dict, **kwargs)
(2)通過compile_model()編譯模型
這個函數(shù)的主要做兩個事情,一個是調(diào)用relay.build()執(zhí)行編譯,一個是導出編譯結(jié)果:
graph_module = relay.build(mod, target=tvm_target, executor=executor, runtime=runtime, params=params)
...
package_path = tvmc_model.export_package(graph_module, package_path, cross, cross_options,output_format)
# 最后會調(diào)用export_classic_format(),它會將graph_module內(nèi)部的模型信息寫入到相關的文件中并以tar包的形式保存下來
...
3、autotuner
drive_tune()函數(shù)所做的工作是根據(jù)配置決定硬件參數(shù),并判斷是否進行rpc遠端調(diào)優(yōu),然后調(diào)用tune_model()進行相應的處理。目前TVMC支持兩種自動調(diào)優(yōu)方式,分別為auto-scheduling和autotvm,默認使用的是autotvm,它們對應的最終調(diào)優(yōu)任務啟動接口是schedule_tasks()和tune_tasks()。這里主要看默認方式的實現(xiàn):
def tune_tasks(
tasks: List[autotvm.task.Task],
log_file: str,
measure_option: autotvm.measure_option,
tuner: str,
trials: int,
early_stopping: Optional[int] = None,
tuning_records: Optional[str] = None,
):
if not tasks:
logger.warning("there were no tasks found to be tuned")
return
if not early_stopping:
early_stopping = trials
# 多任務處理
for i, tsk in enumerate(tasks):
prefix = "[Task %2d/%2d] " % (i + 1, len(tasks))
# 創(chuàng)建調(diào)優(yōu)器
if tuner in ("xgb", "xgb-rank"):
tuner_obj = XGBTuner(tsk, loss_type="rank")
elif tuner == "xgb_knob":
tuner_obj = XGBTuner(tsk, loss_type="rank", feature_type="knob")
elif tuner == "ga":
tuner_obj = GATuner(tsk, pop_size=50)
elif tuner == "random":
tuner_obj = RandomTuner(tsk)
elif tuner == "gridsearch":
tuner_obj = GridSearchTuner(tsk)
else:
raise TVMCException("invalid tuner: %s " % tuner)
# 如果有調(diào)優(yōu)的歷史記錄,可以從歷史記錄開始調(diào)優(yōu),相當于"斷點續(xù)調(diào)"
if tuning_records and os.path.exists(tuning_records):
logger.info("loading tuning records from %s", tuning_records)
start_time = time.time()
tuner_obj.load_history(autotvm.record.load_from_file(tuning_records))
logging.info("loaded history in %.2f sec(s)", time.time() - start_time)
tuner_obj.tune(
n_trial=min(trials, len(tsk.config_space)),
early_stopping=early_stopping,
measure_option=measure_option,
callbacks=[
autotvm.callback.progress_bar(trials, prefix=prefix),
autotvm.callback.log_to_file(log_file),
],
)
4、runner
drive_run()主要是判斷是否使用rpc遠端執(zhí)行,調(diào)用run_module()運行模型推理,最后輸出和保存推理結(jié)果。這里摘取run_module()的主要步驟進行分析,把micro-tvm和rpc相關的部分略掉:
def run_module(tvmc_package: TVMCPackage, device: str,...):
...
# 加載編譯好的模型.so庫
session.upload(tvmc_package.lib_path)
lib = session.load_module(tvmc_package.lib_name)
...
# 設置運行的目標設備
if device == "cuda":
dev = session.cuda()
elif device == "cl":
dev = session.cl()
...
else:
dev = session.cpu()
...
# 創(chuàng)建 module 對象
module = executor.create(tvmc_package.graph, lib, dev)
...
# 加載模型訓練參數(shù)
module.load_params(tvmc_package.params)
...
# 設置模型輸入數(shù)據(jù)
shape_dict, dtype_dict = module.get_input_info()
inputs_dict = make_inputs_dict(shape_dict, dtype_dict, inputs, fill_mode)
module.set_input(**inputs_dict)
...
# 模型推理
times = module.benchmark(dev, number=number, repeat=repeat, end_to_end=end_to_end)
...
# 獲取推理輸出結(jié)果
num_outputs = module.get_num_outputs()
outputs = {}
for i in range(num_outputs):
output_name = "output_{}".format(i)
outputs[output_name] = module.get_output(i).numpy()
return TVMCResult(outputs, times)
四、總結(jié)
本文介紹了TVMC工具的使用以及基本的實現(xiàn)流程,TVMC提供了很豐富的命令選項,可以說TVMC是一個很好的TVM Python API使用范例,建議感興趣的同學可以通過深入了解。