TensorRT8 使用手記(6)性能統(tǒng)計

使用TensorRT進(jìn)行模型轉(zhuǎn)換及部署主要涉及以下幾個性能指標(biāo):

性能統(tǒng)計指標(biāo)
  1. Throughput 吞吐量

單位:qps, QPS, Queries Per Second 表示每秒能夠相應(yīng)的查詢次數(shù)
由查詢次數(shù)除以主機Walltime總和得到。如果該值明顯低于GPU計算時間的倒數(shù),說明GPU可能由于主機側(cè)的開銷或數(shù)據(jù)傳輸導(dǎo)致其未能充分利用
the observed throughput computed by dividing the number of queries by the Total Host Walltime. If this is significantly lower than the reciprocal of GPU Compute Time, the GPU may be under-utilized because of host-side overheads or data transfers.

  1. Latency

該值由 H2D 延遲, GPU 計算時間, 和 D2H 延遲相加得到,是推斷單個查詢的延遲。
the summation of H2D Latency, GPU Compute Time, and D2H Latency. This is the latency to infer a single query.

  1. End-to-End Host Latency 主機側(cè)端到端延遲

單次查詢的H2D被調(diào)用,到D2H完成所用的耗時,其包括等待之前查詢完成所需時間。
the duration from when the H2D of a query is called to when the D2H of the same query is completed, which includes the latency to wait for the completion of the previous query. This is the latency of a query if multiple queries are enqueued consecutively.

  1. Enqueue Time 查詢排隊時間

主機側(cè)進(jìn)行單次查詢排隊延遲。如果該值大于GPU計算時間,說明GPU可能沒有被充分利用
the host latency to enqueue a query. If this is longer than GPU Compute Time, the GPU may be under-utilized.

  1. H2D Latency、Host to Device 延遲

將單次查詢的輸入張量傳輸至設(shè)備側(cè)引起的延時
the latency for host-to-device data transfers for input tensors of a single query.

  1. GPU Compute Time、GPU計算時間

單次查詢執(zhí)行核函數(shù)引起的延時,用來衡量GPU用來完成計算(執(zhí)行核函數(shù))所需的時間
the GPU latency to execute the kernels for a query.

  1. D2H Latency、Device to Host 延遲

將單次查詢的輸出張量傳輸至主機側(cè)引起的延時
the latency for device-to-host data transfers for output tensors of a single query.

  1. Total Host Walltime、主機Walltime[1]總和

主機側(cè)首個查詢開始排隊到最后一個查詢完成的Walltime總和
the host walltime from when the first query (after warmups) is enqueued to when the last query is completed.

  1. Total GPU Compute Time、GPU計算時間總和

所有查詢的GPU耗時的總和。如果該值顯著低于Total Host Walltime,說明GPU可能由于主機側(cè)的開銷和數(shù)據(jù)傳輸導(dǎo)致GPU沒有被充分利用。
the summation of the GPU Compute Time of all the queries. If this is significantly shorter than Total Host Walltime, the GPU may be under-utilized because of host-side overheads or data transfers.

示例:
以下為某模型推理的性能測試數(shù)據(jù)

Throughput: 56.2013 qps
Latency: min = 22 ms, max = 22.5906 ms, mean = 22.1677 ms, median = 22.1396 ms, percentile(99%) = 22.588 ms
End-to-End Host Latency: min = 34.3153 ms, max = 35.616 ms, mean = 35.2231 ms, median = 35.2316 ms, percentile(99%) = 35.5511 ms
Enqueue Time: min = 0.937988 ms, max = 2.3905 ms, mean = 1.54232 ms, median = 1.5459 ms, percentile(99%) = 1.79907 ms
H2D Latency: min = 4.42554 ms, max = 4.89941 ms, mean = 4.47215 ms, median = 4.43042 ms, percentile(99%) = 4.88135 ms
GPU Compute Time: min = 17.5708 ms, max = 17.8975 ms, mean = 17.689 ms, median = 17.6906 ms, percentile(99%) = 17.8646 ms
D2H Latency: min = 0.00292969 ms, max = 0.0129395 ms, mean = 0.00656637 ms, median = 0.0055542 ms, percentile(99%) = 0.0128784 ms
Total Host Walltime: 3.04263 s
Total GPU Compute Time: 3.02481 s

  1. Walltime 表示從計算開始到計算結(jié)束等待的時間。 ?

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時請結(jié)合常識與多方信息審慎甄別。
平臺聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點,簡書系信息發(fā)布平臺,僅提供信息存儲服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容