1、TensorRT

TensorRT：高性能的深度學(xué)習(xí)Inference Lib , 應(yīng)用于產(chǎn)品真實(shí)環(huán)境的Inference. Not Train。
兩個(gè)Key Metricx：功耗效率（Power Efficiency）、快速響應(yīng)。
直接影響用戶體驗(yàn)及Cost。
tensorRT在機(jī)器學(xué)習(xí)中的位置：

tensorRT in deeplearning.png

對(duì)于實(shí)際運(yùn)行環(huán)境，TensorRT自動(dòng)優(yōu)化神經(jīng)網(wǎng)絡(luò)，可以提供更好的性能。
如下圖，同功耗條件下，GPU和CPU的Inference對(duì)比：

TensorRT1_Efficiency-1.png

1.1 DNN的兩個(gè)過(guò)程：train and inference

Solving a supervised machine learning problem with deep neural networks involves a two-step process.

1.The first step is to train a deep neural network on massive amounts of labeled data using GPUs. During this step, the neural network learns millions of weights or parameters that enable it to map input data examples to correct responses. Training requires iterative forward and backward passes through the network as the objective function is minimized with respect to the network weights. Often several models are trained and accuracy is validated against data not seen during training in order to estimate real-world performance.
Train - 利用大量帶有標(biāo)注的數(shù)據(jù)訓(xùn)練深度神經(jīng)網(wǎng)絡(luò)，神經(jīng)網(wǎng)絡(luò)需要學(xué)習(xí)上百萬(wàn)的權(quán)重和參數(shù)以使得數(shù)據(jù)和標(biāo)注之間建立正確對(duì)應(yīng)關(guān)系。訓(xùn)練過(guò)程需要迭代：通過(guò)網(wǎng)絡(luò)進(jìn)行前向和后向傳播，調(diào)整權(quán)重使得目標(biāo)函數(shù)減小。為了對(duì)真實(shí)世界數(shù)據(jù)做出正確預(yù)測(cè)，經(jīng)常會(huì)訓(xùn)練多個(gè)網(wǎng)絡(luò)。
2.The next step–inference–uses the trained model to make predictions from new data. During this step, the best trained model is used in an application running in a production environment such as a data center, an automobile, or an embedded platform. For some applications, such as autonomous driving, inference is done in real time and therefore high throughput is critical.
Inference - 使用訓(xùn)練后的模型對(duì)新的數(shù)據(jù)進(jìn)行預(yù)測(cè)。針對(duì)不同的環(huán)境：如數(shù)據(jù)中心、智能手機(jī)、嵌入式平臺(tái)，選擇最好的訓(xùn)練模型。對(duì)于一些對(duì)延遲要求比較苛刻的應(yīng)用，如智能駕駛，實(shí)際環(huán)境的Inference需要High Throughput and low Latency。

1.2 Inference Versus Training

train compare to inference.png

Both DNN training and Inference start out with the same forward propagation calculation, but training goes further. As Figure 1 illustrates, after forward propagation, the results from the forward propagation are compared against the (known) correct answer to compute an error value. A backward propagation phase propagates the error back through the network’s layers and updates their weights using gradient descent in order to improve the network’s performance at the task it is trying to learn. It is common to batch hundreds of training inputs (for example, images in an image classification network or spectrograms for speech recognition) and operate on them simultaneously during DNN training in order to prevent overfitting and, more importantly, amortize loading weights from GPU memory across many inputs, increasing computational efficiency.
對(duì)于DNN，training及Inference具有相同的前向傳播計(jì)算過(guò)程，training走的更遠(yuǎn)。如上圖，完成前向傳播后，前向傳播的結(jié)果與已知的正確答案之間對(duì)比計(jì)算產(chǎn)生一誤差值，反向傳播通過(guò)網(wǎng)絡(luò)層將誤差回傳，利用梯度下降算法更新權(quán)重，以提升整個(gè)網(wǎng)絡(luò)的性能。training過(guò)程有上百組輸入是很普遍的（如圖像分類網(wǎng)絡(luò)中的圖像輸入，語(yǔ)音識(shí)別中的語(yǔ)音輸入），在DNN training中同時(shí)對(duì)這些多個(gè)輸入運(yùn)算以避免出現(xiàn)過(guò)擬合（overfitting）。更重要的是，GPU memory從多個(gè)輸入中分期載入權(quán)重可提升計(jì)算效率。
For inference, the performance goals are different. To minimize the network’s end-to-end response time, inference typically batches a smaller number of inputs than training, as services relying on inference to work (for example, a cloud-based image-processing pipeline) are required to be as responsive as possible so users do not have to wait several seconds while the system is accumulating images for a large batch. In general, we might say that the per-image workload for training is higher than for inference, and while high throughput is the only thing that counts during training, latency becomes important for inference as well.
對(duì)于Inference，目的是不同的。為縮減網(wǎng)絡(luò)端到端的響應(yīng)時(shí)間，Inference一般比training過(guò)程的輸入少很多，對(duì)于一個(gè)依賴于Inference的服務(wù)器（如基于云端的圖像處理pipeline），用戶希望希望能夠做出快速響應(yīng)，而不是因?yàn)樘鄨D像數(shù)據(jù)的載入等待很長(zhǎng)時(shí)間。一般情況下，training載入的圖像要比inference載入的圖像要多的多，在training過(guò)程中唯一需要考慮的是高的throughtput，而對(duì)于inference而言，latency變得更為重要。

1.3 Inference using GPU VS CPU

使用兩種經(jīng)典的神經(jīng)網(wǎng)絡(luò)架構(gòu)做實(shí)驗(yàn)：
AlexNet（2012 ImageNet ILSVRC冠軍）
GoogleNet（2014 ImageNet ILSVRC冠軍），網(wǎng)絡(luò)深度及神經(jīng)網(wǎng)絡(luò)復(fù)雜度比AlexNet高很多
jetson_tx1_whitepaper.pdf 中對(duì)每種網(wǎng)絡(luò)又考慮兩種情況：

Case 1：允許對(duì)輸入圖像batching，主要針對(duì)在云端inference的模型（多個(gè)用戶每時(shí)每刻都在上傳圖像），對(duì)輸入數(shù)據(jù)的打包成batch增加的latency不敏感的情況，實(shí)驗(yàn)使用的bach size是48（for CPU） 128 for GPU。
Case 2：不使用batch（latency極度敏感），batch size =1

4個(gè)設(shè)備： NV TX1 VS Intel Core i7 6700 . NV Titan X VS Intel 16 Core Xeon E5
GPU框架選擇： Caffe VS cuDNN
Intel CPU 運(yùn)行優(yōu)化的CPU Inference code.（Intel deep learning Framework，僅支持CaffeNet網(wǎng)絡(luò)架構(gòu)，類似于AlexNet，batch size 1-48）

NV TX1 ，采用兩種浮點(diǎn)精度： 16bit 和32bit 來(lái)Inference.
Tegra X1增加了FP16的算法throughtput，在新版本的cuDNN中增加的FP16算法支持，增加了FP16的throughput，在沒(méi)有引入loss降低分類精度條件下，顯著提升了性能。

AlexNet tx1 vs i6700.png

AlexNet Titan X vs E5.png

GoogleNet Titan X vs E5.png

GPU VS CPU對(duì)比結(jié)論：

1.TX1 with FP16 比CPU方式的Inference具有更高的效能比：
Tegra X1 in FP16 45 img/sec/W Compare To Core i7 6700K 3.9 img/sec/W
絕對(duì)性能指標(biāo)：258 img/sec on Tegra X1 in FP16 Compared To 242 img/sec on Core i7
2.Titan X VS E5的結(jié)果類似：
Titan X在消耗更低的能耗情況下，可以實(shí)現(xiàn)更好的性能，3000 Images/second VS 500 Images/second in large-bach size.
大的bach size情況下，Titan X比Xoen E5具有更好的性能；即便在no batching情況，TX1、Titan X可實(shí)現(xiàn)更好的 Performance/Watt（依賴于12GB framebuffer，在基于FFT卷積算法（對(duì)memory容量要求很高）上表現(xiàn)的更優(yōu)秀）。
3.白皮書中還有一個(gè)結(jié)論：新的cuDNN對(duì)inference性能的優(yōu)化，除了對(duì)增加的Caffe deep learning framework進(jìn)行優(yōu)化外，更多的優(yōu)化是針對(duì)卷積算法的（對(duì)于多處理器運(yùn)行小的batches，分割任務(wù)，提升GPU上運(yùn)行小batches的性能）。新的cuDNN也增加了對(duì)卷積運(yùn)算FP16的支持，F(xiàn)P16算法能實(shí)現(xiàn)兩倍于FP32算法的性能，類似FP16存儲(chǔ)，使用FP16算法不會(huì)引入loss降低精度（相對(duì)于FP32網(wǎng)絡(luò)的inference）。
4.GPU性能提升還有歸功于Caffe Framework，Caffe Framework允許在inference中使用cuBLAS GEMV(matrix-vector multiplication)，替代GEMM (matrix-matrix multiplication)。

訓(xùn)練生成的模型真實(shí)部署環(huán)境和之前訓(xùn)練環(huán)境會(huì)有較大差別，如目標(biāo)是嵌入式設(shè)備，inference會(huì)對(duì)響應(yīng)時(shí)間和功耗會(huì)有很高的要求。
Key Metric：效能比：inference性能/watt。
效能比對(duì)于大規(guī)模數(shù)據(jù)中心的環(huán)境也是一個(gè)critical Metric（重要指標(biāo)），此外，還需要考慮：延遲、布置空間、散熱，這些都會(huì)影響性能發(fā)揮。

1.4 tensorRT build and deployment (編譯/構(gòu)建 & 部署)

tensorRT是高性能的inference engine，目的是獲取最大的inference throughput及效率，應(yīng)用于圖像分類、分割、目標(biāo)檢測(cè)。tensorRT根據(jù)實(shí)際場(chǎng)景(網(wǎng)絡(luò)、移動(dòng)端、嵌入式or自動(dòng)駕駛)對(duì)訓(xùn)練后的神經(jīng)網(wǎng)絡(luò)進(jìn)行優(yōu)化，以獲得最佳性能及GPU加速inference.

tensorRT-two function.png

tensorRF兩個(gè)重要功能：
? - 1 優(yōu)化訓(xùn)練后的網(wǎng)絡(luò)
? - 2 Target Runtime
tensorRT使用需兩步：build and deployment（編譯&部署）
編譯過(guò)程：優(yōu)化網(wǎng)絡(luò)配置，對(duì)于前向傳播生成一優(yōu)化Plan，該P(yáng)lan是優(yōu)化后的目標(biāo)代碼，可以序列化存儲(chǔ)在內(nèi)存或者硬盤中。
部署過(guò)程：通常需要長(zhǎng)時(shí)間運(yùn)行的服務(wù)或用戶應(yīng)用，該服務(wù)或者應(yīng)用包含批量的數(shù)據(jù)輸入和數(shù)據(jù)輸出（分類、目標(biāo)識(shí)別等）。使用TensorRT不用在部署硬件中安裝或運(yùn)行其它的deep learning framework。
inference服務(wù)的其它用途：batching及pipeline，我們就先不討論，聚焦在tensorRT用于inference。

1.4.1 編譯

tensorRT編譯階段：需要三個(gè)文件部署分類神經(jīng)網(wǎng)絡(luò)

A network architecture file (deploy.prototxt), #網(wǎng)絡(luò)體系結(jié)構(gòu)
Trained weights (net.caffemodel), and #訓(xùn)練后的權(quán)重
A label file to provide a name for each output class. #標(biāo)簽 -每一個(gè)輸出類的名字

此外，還需定義batch size 及輸出層，下面給出將caffe模型轉(zhuǎn)化為tensorRT目標(biāo)的步驟，3-5行讀取網(wǎng)絡(luò)信息。若沒(méi)有提供網(wǎng)絡(luò)結(jié)構(gòu)文件(deploy.prototxt)，用戶可以使用編譯器自定義網(wǎng)絡(luò)信息。
caffe 模型轉(zhuǎn)化為tensorRT目標(biāo)：

IBuilder* builder = createInferBuilder(gLogger);
// parse the caffe model to populate the network, then set the outputs
INetworkDefinition* network = builder->createNetwork();
CaffeParser parser;
auto blob_name_to_tensor = parser.parse(“deploy.prototxt”,
                                        trained_file.c_str(),
                                        *network,
                                        DataType::kFLOAT); 
// specify which tensors are outputs
network->markOutput(*blob_name_to_tensor->find("prob"));
// Build the engine
builder->setMaxBatchSize(1);
builder->setMaxWorkspaceSize(1 << 30); 
ICudaEngine* engine = builder->buildCudaEngine(*network);

tensorRT支持的層類型：

Convolution: 2D
Activation: ReLU, tanh and sigmoid
Pooling: max and average
ElementWise: sum, product or max of two tensors
LRN: cross-channel only
Fully-connected: with or without bias
SoftMax: cross-channel only
Deconvolution

不使用caffe parser時(shí)，可使用tensorRT C++ API定義網(wǎng)絡(luò)，使用API定義上述任何支持的層及其參數(shù)，定義網(wǎng)絡(luò)之間的變化參數(shù)，包含卷積層權(quán)重尺寸及輸出如Polling層的窗口大小及窗口移動(dòng)幅度。
Tensor RT C++ API 定義網(wǎng)絡(luò)

ITensor* in = network->addInput(“input”, DataType::kFloat, Dims3{…});
IPoolingLayer* pool = network->addPooling(in, PoolingType::kMAX, …);

定義并載入網(wǎng)絡(luò)后的步驟：

必須明確輸出tensors，見(jiàn)caffe轉(zhuǎn)tensorRT代碼部分的network->markOutput，在示例中使用的是“prob”（for probability）；
定義batch size，builder->setMaxBatchSize，可根據(jù)實(shí)際部署環(huán)境（應(yīng)用需求和系統(tǒng)配置）改變batch size；
tensorRT執(zhí)行層優(yōu)化以降低inference時(shí)間，對(duì)于API用戶，這部分是透明的，對(duì)網(wǎng)絡(luò)層分析需要memory資源，需要明確可使用的Memory size. builder->setMaxWorkspaceSize
“buildCudaEngine” 執(zhí)行層優(yōu)化，編譯優(yōu)化的網(wǎng)絡(luò)（基于提供的輸入和參數(shù)）引擎。一旦模型轉(zhuǎn)化為tensorRT目標(biāo)，可用于Host device存儲(chǔ)或在任何地方使用。

tensorRT會(huì)對(duì)神經(jīng)網(wǎng)絡(luò)執(zhí)行一些重要的轉(zhuǎn)換和優(yōu)化，首先，有些層的輸出沒(méi)有使用，這些層將刪除以減少計(jì)算量；然后，一些可能的卷積、偏置、Relu層會(huì)打包成單層。分為垂直層優(yōu)化和水平層優(yōu)化：
Vertical Layer Fusion

network_optimization-1.png

對(duì)于上圖所示的網(wǎng)絡(luò)結(jié)構(gòu)，Vertical Layer Fusion后的結(jié)果見(jiàn)下圖，F(xiàn)use層以CBR標(biāo)識(shí)。
Layer Fusion提升了tensorRT優(yōu)化網(wǎng)絡(luò)的效率

network_vertical_fusion.png

Horizontal Layer Fusion
又叫層聚集(Layer aggregation)，將聚集層需要的部分組合，連接其各層對(duì)應(yīng)的輸出，Horizontal Layer Fusion通過(guò)對(duì)來(lái)自相同Source tensor的層打包，由于這些層都使用類似的參數(shù)作相同的操作，組合成單層將具有更高的計(jì)算效率。如下圖中將3個(gè)1*1的CBR層組合成一個(gè)大的CBR層。需要注意組合后的層的輸出需要分開(kāi)提供給之前各個(gè)CBR層的輸出。

network_horizontal_fusion.png

tensorRT在編譯階段做轉(zhuǎn)換，編譯過(guò)程在對(duì)訓(xùn)練網(wǎng)絡(luò)和配置文件執(zhí)行tensorRT parser read后. 這在caffe 模型轉(zhuǎn)化為tensorRT目標(biāo)的代碼部分可查詢.

1.4.2 部署

執(zhí)行完Inference builder(buildCudaEngine)后，返回一指針指向新的inference engine runtime object(ICudaEngine). Runtime object已準(zhǔn)備好供使用，其狀態(tài)可以序列化存儲(chǔ)到磁盤或作為分配的目標(biāo)存儲(chǔ)。序列化存儲(chǔ)被成為：Plan.

如之前描述，runtime inference engine的batching and streaming data超出了本文檔范圍，下面的代碼演示了使用inference engine處理一系列數(shù)據(jù)輸入生成結(jié)果。

// The execution context is responsible for launching the 
// compute kernels
IExecutionContext *context = engine->createExecutionContext();

// In order to bind the buffers, we need to know the names of the 
// input and output tensors.
int inputIndex = engine->getBindingIndex(INPUT_LAYER_NAME),
int outputIndex = engine->getBindingIndex(OUTPUT_LAYER_NAME);

// Allocate GPU memory for Input / Output data
void* buffers = malloc(engine->getNbBindings() * sizeof(void*));
cudaMalloc(&buffers[inputIndex], batchSize * size_of_single_input);
cudaMalloc(&buffers[outputIndex], batchSize * size_of_single_output);

// Use CUDA streams to manage the concurrency of copying and executing
cudaStream_t stream;
cudaStreamCreate(&stream);

// Copy Input Data to the GPU
cudaMemcpyAsync(buffers[inputIndex], input, 
                batchSize * size_of_single_input, 
                cudaMemcpyHostToDevice, stream);

// Launch an instance of the GIE compute kernel
context.enqueue(batchSize, buffers, stream, nullptr);

// Copy Output Data to the Host
cudaMemcpyAsync(output, buffers[outputIndex], 
                batchSize * size_of_single_output, 
                cudaMemcpyDeviceToHost, stream));

// It is possible to have multiple instances of the code above
// in flight on the GPU in different streams.
// The host can then sync on a given stream and use the results
cudaStreamSynchronize(stream);

1.5 最大化tensorRT的性能和效率

tensorRT可以幫助我們簡(jiǎn)化部署神經(jīng)網(wǎng)絡(luò)，提升深度學(xué)習(xí)能力，使產(chǎn)品具有更高的性能和效率。
Build階段判別網(wǎng)絡(luò)優(yōu)化的可能性，deployment階段運(yùn)行被優(yōu)化的網(wǎng)絡(luò)以減少延遲、增加吞吐率。
若運(yùn)行存儲(chǔ)在數(shù)據(jù)中心服務(wù)器端備份的網(wǎng)頁(yè)或移動(dòng)應(yīng)用：tensorRT可以部署復(fù)雜多變的模型以增加終端使用者的智能化，并減輕終端重量。若使用tensorRT創(chuàng)造下一代設(shè)備，tensorRT可以幫忙部署高性能、高精度、高效能的網(wǎng)絡(luò)。
Moreover，使用混合精度FP16數(shù)據(jù)運(yùn)行神經(jīng)網(wǎng)絡(luò)inference，可以降低GPU功耗，減少一半的memory使用、提供更高的性能。

檢索了，我的Host PC中并沒(méi)有安裝tensorRT，Host PC主要是用來(lái)訓(xùn)練，一般使用比較頻繁的是DIGITS，而tensorRT是做Inference，故在Host端不會(huì)出現(xiàn)。

1.6 Update

上述文檔大都寫于2012年，時(shí)隔久遠(yuǎn)。 tensorRT新版本已經(jīng)到4.01，做了一些改變，引入了一些新的特性。

引入新的層： Top-k, LSTM with projection, Constant, Softmax and Batch GEMM。
通過(guò)Fuse Layer(Vertical or horizental)優(yōu)化多層感知機(jī)（MLP：Multi-Layer Perception）。
對(duì)于循環(huán)神經(jīng)網(wǎng)絡(luò)（RNN）、多層感知機(jī)（MLP）及神經(jīng)機(jī)器翻譯（NMT：Neural Machine Translation），都提供示例可以快速開(kāi)始。
parser 引入ONNX模型，transorRT可對(duì)ONNX框架做出優(yōu)化（like Caffe 2,Chainer,MxNet PyTorch etc.）。支持C++ and Python API。

ONNX support.png
對(duì)TensorFlow模型的支持。TensorFlow1.7提供簡(jiǎn)單的API使用tensorRT加速。對(duì)于不同版本Tensor Cores（FP32、FP16、INT8）自動(dòng)做出優(yōu)化。

image.png