Hive調(diào)優(yōu)之工具篇

HQL提供EXPLAIN和ANALYZE語(yǔ)句,用于檢查和確定查詢(xún)性能。另外Hive日志包含有足夠詳細(xì)的信息用于性能調(diào)查和問(wèn)題確認(rèn)。

  1. EXPLAIN 不需要執(zhí)行查詢(xún)即可返回一個(gè)查詢(xún)的執(zhí)行計(jì)劃。當(dāng)我們擔(dān)心查詢(xún)有性能問(wèn)題是,可以用該語(yǔ)句來(lái)分析查詢(xún)語(yǔ)句。該語(yǔ)句語(yǔ)法為,

EXPLAIN FORMATTED|EXTENDED|DEPENDENCY|AUTHORIZATION] hql_query
其中:

  • FORMATTED: 用于生成JSON格式的執(zhí)行計(jì)劃
  • EXTENDED: 提供計(jì)劃中操作方式的額外信息,如文件路徑名等
  • DEPENDENCY: JSON格式,包含表和分區(qū)的列表。自Hive 0.10.0以后支持
  • AUTHORIZATION:列出所有需要認(rèn)證的尸體,包括用于運(yùn)行查詢(xún)的輸入和輸出,以及認(rèn)證失敗信息。自Hive 0.14.0之后支持
    一個(gè)典型的查詢(xún)執(zhí)行計(jì)劃包含三部分:
  • 抽象語(yǔ)法樹(shù)(Abstract Syntax Tree (AST))使用ANTLR解析生成器來(lái)自動(dòng)生成HQL的語(yǔ)法樹(shù)。
  • 階段依賴(lài)(Stage Dependencies)列出所有依賴(lài)以及用于執(zhí)行查詢(xún)的階段。
  • 階段計(jì)劃 (Stage Plans) 包含用于執(zhí)行Job任務(wù)的重要信息,如操作和排序等
    以下例子是一個(gè)典型的執(zhí)行計(jì)劃。我們看到抽象語(yǔ)法樹(shù)(Abstract Syntax Tree (AST))作為Map / Reduce Operator Tree顯示出來(lái)。在 階段依賴(lài)(Stage Dependencies)區(qū),Stage-0 和 Stage-1都是獨(dú)立互不依賴(lài)的根階段。在階段計(jì)劃 (Stage Plans)區(qū),Stage-1各有一個(gè)Map/Reduce Operator Tree,每一個(gè)Map/Reduce Operator Tree區(qū)內(nèi),所有操作包括表達(dá)式和聚集計(jì)算都列了出來(lái)。Stage-0無(wú)Map或Reduce,只有一個(gè)Fetch操作。
> EXPLAIN SELECT gender_age.gender, count(*) 
> FROM employee_partitioned WHERE year=2018 
> GROUP BY gender_age.gender LIMIT 2;
+----------------------------------------------------------------------+
| Explain                                                              |
+----------------------------------------------------------------------+
| STAGE DEPENDENCIES:                                                  |
| Stage-1 is a root stage                                              |
| Stage-0 depends on stages: Stage-1                                   |
|                                                                      |
| STAGE PLANS:                                                         |
| Stage: Stage-1                                                       |
| Map Reduce                                                           |
| Map Operator Tree:                                                   |
| TableScan                                                            |
| alias: employee_partitioned                                          |
| Pruned Column Paths: gender_age.gender                               |
| Statistics:                                                          |
| Num rows: 4 Data size: 223 Basic stats: COMPLETE Column stats: NONE  |
| Select Operator                                                      |
| expressions: gender_age.gender (type: string)                        |
| outputColumnNames: _col0                                             |
| Statistics:                                                          |
| Num rows: 4 Data size: 223 Basic stats: COMPLETE Column stats: NONE  |
| Group By Operator                                                    |
| aggregations: count()                                                |
| keys: _col0 (type: string)                                           |
| mode: hash                                                           |
| outputColumnNames: _col0, _col1                                      |
| Statistics:                                                          |
| Num rows: 4 Data size: 223 Basic stats: COMPLETE Column stats: NONE  |
| Reduce Output Operator                                               |
| key expressions: _col0 (type: string)                                |
| sort order: +                                                        |
| Map-reduce partition columns: _col0 (type: string)                   |
| Statistics:                                                          |
| Num rows: 4 Data size: 223 Basic stats: COMPLETE Column stats: NONE  |
| TopN Hash Memory Usage: 0.1                                          |
| value expressions: _col1 (type: bigint)                              |
| Reduce Operator Tree:                                                |
| Group By Operator                                                    |
| aggregations: count(VALUE._col0)                                     |
| keys: KEY._col0 (type: string)                                       |
| mode: mergepartial                                                   |
| outputColumnNames: _col0, _col1                                      |
| Statistics:                                                          |
| Num rows: 2 Data size: 111 Basic stats: COMPLETE Column stats: NONE  |
| Limit                                                                |
| Number of rows: 2                                                    |
| Statistics:                                                          |
| Num rows: 2 Data size: 110 Basic stats: COMPLETE Column stats: NONE  |
| File Output Operator                                                 |
| compressed: false                                                    |
| Statistics:                                                          |
| Num rows: 2 Data size: 110 Basic stats: COMPLETE Column stats: NONE  |
| table:                                                               |
| input format:                                                        |
| org.apache.hadoop.mapred.SequenceFileInputFormat                     |
| output format:                                                       |
| org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat            |
| serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe            |
|                                                                      |
| Stage: Stage-0                                                       |
| Fetch Operator                                                       |
| limit: 2                                                             |
| Processor Tree:                                                      |
| ListSink                                                             |
+----------------------------------------------------------------------+
53 rows selected (0.232 seconds)
  1. ANALYZE Hive統(tǒng)計(jì)數(shù)據(jù)是描述更多詳細(xì)信息如行數(shù)、文件數(shù)和數(shù)據(jù)庫(kù)中各對(duì)象源數(shù)據(jù)大小等數(shù)據(jù)集合。統(tǒng)計(jì)數(shù)據(jù)是數(shù)據(jù)的元數(shù)據(jù)(Metadata),收集和存放于Metastore數(shù)據(jù)庫(kù)中。Hive的統(tǒng)計(jì)數(shù)據(jù)支持表、分區(qū)和列等級(jí)別。這些統(tǒng)計(jì)數(shù)據(jù)是Hive 基于成本優(yōu)化器(Cost-Based Optimizer (CBO))的輸入方,幫助基于成本優(yōu)化器以消耗系統(tǒng)資源完成查詢(xún)所需最低成本來(lái)設(shè)計(jì)執(zhí)行計(jì)劃。Hive統(tǒng)計(jì)數(shù)據(jù)既可以部分自動(dòng)收集(自Hive V3.2.0之后),也可以手工通過(guò)執(zhí)行ANALYZE來(lái)收集生成。
手工收集表統(tǒng)計(jì)數(shù)據(jù)

但NOSCAN指定時(shí),該操作不會(huì)掃描文件,只收集文件數(shù)的大小

> ANALYZE TABLE employee COMPUTE STATISTICS;
No rows affected (27.979 seconds)

> ANALYZE TABLE employee COMPUTE STATISTICS NOSCAN;
No rows affected (25.979 seconds)
手工收集分區(qū)統(tǒng)計(jì)數(shù)據(jù)
-- Applies for specific partition
> ANALYZE TABLE employee_partitioned 
> PARTITION(year=2018, month=12) COMPUTE STATISTICS;
No rows affected (45.054 seconds)
      
-- Applies for all partitions
> ANALYZE TABLE employee_partitioned 
> PARTITION(year, month) COMPUTE STATISTICS;
No rows affected (45.054 seconds)
手工收集列統(tǒng)計(jì)數(shù)據(jù)
> ANALYZE TABLE employee_id COMPUTE STATISTICS FOR COLUMNS       
employee_id;       
No rows affected (41.074 seconds)

自動(dòng)收集統(tǒng)計(jì)數(shù)據(jù)可以通過(guò)

SET hive.stats.autogather=true

通過(guò)INSERT OVERWRITE/INTO語(yǔ)句來(lái)加載表或者分區(qū)數(shù)據(jù)的統(tǒng)計(jì)數(shù)據(jù)會(huì)自動(dòng)收集至Metastore,而LOAD語(yǔ)句不會(huì)觸發(fā)自動(dòng)收集統(tǒng)計(jì)數(shù)據(jù)機(jī)制。
查看統(tǒng)計(jì)數(shù)據(jù)可以使用DESCRIBE EXTENDED/FORMATTED命令,以下為相關(guān)示例,

-- Check statistics in a table
> DESCRIBE EXTENDED employee_partitioned PARTITION(year=2018, month=12);
-- Check statistics in a partition
> DESCRIBE EXTENDED employee;
...
parameters:{numFiles=1, COLUMN_STATS_ACCURATE=true, transient_lastDdlTime=1417726247, numRows=4, totalSize=227, rawDataSize=223}).
-- Check statistics in a column
> DESCRIBE FORMATTED employee.name;
+--------+---------+---+---+---------+--------------+
|col_name|data_type|min|max|num_nulls|distinct_count| ...
+--------+---------+---+---+---------+--------------+
| name   | string  |   |   | 0       | 5            | ...
+--------+---------+---+---+---------+--------------+
+-----------+-----------+
|avg_col_len|max_col_len| ...
+-----------+-----------+
| 5.6       | 7         | ...
+-----------+-----------+
3 rows selected (0.116 seconds)
  1. Logs 日志提供用于查看一個(gè)查詢(xún)或者Job任務(wù)的詳細(xì)信息。通過(guò)查看日志,我們可以發(fā)現(xiàn)運(yùn)行問(wèn)題和錯(cuò)誤,這些問(wèn)題往往導(dǎo)致很糟糕的性能。Hive提供兩類(lèi)日志,包括系統(tǒng)日志和Job任務(wù)日志。
  • 系統(tǒng)日志,包含Hive運(yùn)行狀態(tài)和問(wèn)題。通過(guò){HIVE_HOME}/conf/hive-log4j.properties來(lái)配置,其中包含以下配置信息,

hive.root.logger=WARN,DRFA ## set logger level
hive.log.dir=/tmp/${user.name} ## set log file path
hive.log.file=hive.log ## set log file name

以上設(shè)置適用于所有用戶(hù),我們也可以通過(guò)在Hive命令行中指定這些設(shè)置,這樣只在當(dāng)前用戶(hù)會(huì)話(huà)中生效,如:

$hive --hiveconf hive.root.logger=DEBUG,console

  • Job任務(wù)日志,包含Job信息,通常由Yarn來(lái)管理??梢酝ㄟ^(guò)執(zhí)行以下Yarn命令來(lái)查看一個(gè)Job日志,

yarn logs -applicationId <application_id>

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時(shí)請(qǐng)結(jié)合常識(shí)與多方信息審慎甄別。
平臺(tái)聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡(jiǎn)書(shū)系信息發(fā)布平臺(tái),僅提供信息存儲(chǔ)服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容