HQL提供EXPLAIN和ANALYZE語(yǔ)句,用于檢查和確定查詢(xún)性能。另外Hive日志包含有足夠詳細(xì)的信息用于性能調(diào)查和問(wèn)題確認(rèn)。
- EXPLAIN 不需要執(zhí)行查詢(xún)即可返回一個(gè)查詢(xún)的執(zhí)行計(jì)劃。當(dāng)我們擔(dān)心查詢(xún)有性能問(wèn)題是,可以用該語(yǔ)句來(lái)分析查詢(xún)語(yǔ)句。該語(yǔ)句語(yǔ)法為,
EXPLAIN FORMATTED|EXTENDED|DEPENDENCY|AUTHORIZATION] hql_query
其中:
- FORMATTED: 用于生成JSON格式的執(zhí)行計(jì)劃
- EXTENDED: 提供計(jì)劃中操作方式的額外信息,如文件路徑名等
- DEPENDENCY: JSON格式,包含表和分區(qū)的列表。自Hive 0.10.0以后支持
- AUTHORIZATION:列出所有需要認(rèn)證的尸體,包括用于運(yùn)行查詢(xún)的輸入和輸出,以及認(rèn)證失敗信息。自Hive 0.14.0之后支持
一個(gè)典型的查詢(xún)執(zhí)行計(jì)劃包含三部分: - 抽象語(yǔ)法樹(shù)(Abstract Syntax Tree (AST))使用ANTLR解析生成器來(lái)自動(dòng)生成HQL的語(yǔ)法樹(shù)。
- 階段依賴(lài)(Stage Dependencies)列出所有依賴(lài)以及用于執(zhí)行查詢(xún)的階段。
- 階段計(jì)劃 (Stage Plans) 包含用于執(zhí)行Job任務(wù)的重要信息,如操作和排序等
以下例子是一個(gè)典型的執(zhí)行計(jì)劃。我們看到抽象語(yǔ)法樹(shù)(Abstract Syntax Tree (AST))作為Map / Reduce Operator Tree顯示出來(lái)。在 階段依賴(lài)(Stage Dependencies)區(qū),Stage-0 和 Stage-1都是獨(dú)立互不依賴(lài)的根階段。在階段計(jì)劃 (Stage Plans)區(qū),Stage-1各有一個(gè)Map/Reduce Operator Tree,每一個(gè)Map/Reduce Operator Tree區(qū)內(nèi),所有操作包括表達(dá)式和聚集計(jì)算都列了出來(lái)。Stage-0無(wú)Map或Reduce,只有一個(gè)Fetch操作。
> EXPLAIN SELECT gender_age.gender, count(*)
> FROM employee_partitioned WHERE year=2018
> GROUP BY gender_age.gender LIMIT 2;
+----------------------------------------------------------------------+
| Explain |
+----------------------------------------------------------------------+
| STAGE DEPENDENCIES: |
| Stage-1 is a root stage |
| Stage-0 depends on stages: Stage-1 |
| |
| STAGE PLANS: |
| Stage: Stage-1 |
| Map Reduce |
| Map Operator Tree: |
| TableScan |
| alias: employee_partitioned |
| Pruned Column Paths: gender_age.gender |
| Statistics: |
| Num rows: 4 Data size: 223 Basic stats: COMPLETE Column stats: NONE |
| Select Operator |
| expressions: gender_age.gender (type: string) |
| outputColumnNames: _col0 |
| Statistics: |
| Num rows: 4 Data size: 223 Basic stats: COMPLETE Column stats: NONE |
| Group By Operator |
| aggregations: count() |
| keys: _col0 (type: string) |
| mode: hash |
| outputColumnNames: _col0, _col1 |
| Statistics: |
| Num rows: 4 Data size: 223 Basic stats: COMPLETE Column stats: NONE |
| Reduce Output Operator |
| key expressions: _col0 (type: string) |
| sort order: + |
| Map-reduce partition columns: _col0 (type: string) |
| Statistics: |
| Num rows: 4 Data size: 223 Basic stats: COMPLETE Column stats: NONE |
| TopN Hash Memory Usage: 0.1 |
| value expressions: _col1 (type: bigint) |
| Reduce Operator Tree: |
| Group By Operator |
| aggregations: count(VALUE._col0) |
| keys: KEY._col0 (type: string) |
| mode: mergepartial |
| outputColumnNames: _col0, _col1 |
| Statistics: |
| Num rows: 2 Data size: 111 Basic stats: COMPLETE Column stats: NONE |
| Limit |
| Number of rows: 2 |
| Statistics: |
| Num rows: 2 Data size: 110 Basic stats: COMPLETE Column stats: NONE |
| File Output Operator |
| compressed: false |
| Statistics: |
| Num rows: 2 Data size: 110 Basic stats: COMPLETE Column stats: NONE |
| table: |
| input format: |
| org.apache.hadoop.mapred.SequenceFileInputFormat |
| output format: |
| org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
| serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
| |
| Stage: Stage-0 |
| Fetch Operator |
| limit: 2 |
| Processor Tree: |
| ListSink |
+----------------------------------------------------------------------+
53 rows selected (0.232 seconds)
- ANALYZE Hive統(tǒng)計(jì)數(shù)據(jù)是描述更多詳細(xì)信息如行數(shù)、文件數(shù)和數(shù)據(jù)庫(kù)中各對(duì)象源數(shù)據(jù)大小等數(shù)據(jù)集合。統(tǒng)計(jì)數(shù)據(jù)是數(shù)據(jù)的元數(shù)據(jù)(Metadata),收集和存放于Metastore數(shù)據(jù)庫(kù)中。Hive的統(tǒng)計(jì)數(shù)據(jù)支持表、分區(qū)和列等級(jí)別。這些統(tǒng)計(jì)數(shù)據(jù)是Hive 基于成本優(yōu)化器(Cost-Based Optimizer (CBO))的輸入方,幫助基于成本優(yōu)化器以消耗系統(tǒng)資源完成查詢(xún)所需最低成本來(lái)設(shè)計(jì)執(zhí)行計(jì)劃。Hive統(tǒng)計(jì)數(shù)據(jù)既可以部分自動(dòng)收集(自Hive V3.2.0之后),也可以手工通過(guò)執(zhí)行ANALYZE來(lái)收集生成。
手工收集表統(tǒng)計(jì)數(shù)據(jù)
但NOSCAN指定時(shí),該操作不會(huì)掃描文件,只收集文件數(shù)的大小
> ANALYZE TABLE employee COMPUTE STATISTICS;
No rows affected (27.979 seconds)
> ANALYZE TABLE employee COMPUTE STATISTICS NOSCAN;
No rows affected (25.979 seconds)
手工收集分區(qū)統(tǒng)計(jì)數(shù)據(jù)
-- Applies for specific partition
> ANALYZE TABLE employee_partitioned
> PARTITION(year=2018, month=12) COMPUTE STATISTICS;
No rows affected (45.054 seconds)
-- Applies for all partitions
> ANALYZE TABLE employee_partitioned
> PARTITION(year, month) COMPUTE STATISTICS;
No rows affected (45.054 seconds)
手工收集列統(tǒng)計(jì)數(shù)據(jù)
> ANALYZE TABLE employee_id COMPUTE STATISTICS FOR COLUMNS
employee_id;
No rows affected (41.074 seconds)
自動(dòng)收集統(tǒng)計(jì)數(shù)據(jù)可以通過(guò)
SET hive.stats.autogather=true
通過(guò)INSERT OVERWRITE/INTO語(yǔ)句來(lái)加載表或者分區(qū)數(shù)據(jù)的統(tǒng)計(jì)數(shù)據(jù)會(huì)自動(dòng)收集至Metastore,而LOAD語(yǔ)句不會(huì)觸發(fā)自動(dòng)收集統(tǒng)計(jì)數(shù)據(jù)機(jī)制。
查看統(tǒng)計(jì)數(shù)據(jù)可以使用DESCRIBE EXTENDED/FORMATTED命令,以下為相關(guān)示例,
-- Check statistics in a table
> DESCRIBE EXTENDED employee_partitioned PARTITION(year=2018, month=12);
-- Check statistics in a partition
> DESCRIBE EXTENDED employee;
...
parameters:{numFiles=1, COLUMN_STATS_ACCURATE=true, transient_lastDdlTime=1417726247, numRows=4, totalSize=227, rawDataSize=223}).
-- Check statistics in a column
> DESCRIBE FORMATTED employee.name;
+--------+---------+---+---+---------+--------------+
|col_name|data_type|min|max|num_nulls|distinct_count| ...
+--------+---------+---+---+---------+--------------+
| name | string | | | 0 | 5 | ...
+--------+---------+---+---+---------+--------------+
+-----------+-----------+
|avg_col_len|max_col_len| ...
+-----------+-----------+
| 5.6 | 7 | ...
+-----------+-----------+
3 rows selected (0.116 seconds)
- Logs 日志提供用于查看一個(gè)查詢(xún)或者Job任務(wù)的詳細(xì)信息。通過(guò)查看日志,我們可以發(fā)現(xiàn)運(yùn)行問(wèn)題和錯(cuò)誤,這些問(wèn)題往往導(dǎo)致很糟糕的性能。Hive提供兩類(lèi)日志,包括系統(tǒng)日志和Job任務(wù)日志。
- 系統(tǒng)日志,包含Hive運(yùn)行狀態(tài)和問(wèn)題。通過(guò){HIVE_HOME}/conf/hive-log4j.properties來(lái)配置,其中包含以下配置信息,
hive.root.logger=WARN,DRFA ## set logger level
hive.log.dir=/tmp/${user.name} ## set log file path
hive.log.file=hive.log ## set log file name
以上設(shè)置適用于所有用戶(hù),我們也可以通過(guò)在Hive命令行中指定這些設(shè)置,這樣只在當(dāng)前用戶(hù)會(huì)話(huà)中生效,如:
$hive --hiveconf hive.root.logger=DEBUG,console
- Job任務(wù)日志,包含Job信息,通常由Yarn來(lái)管理??梢酝ㄟ^(guò)執(zhí)行以下Yarn命令來(lái)查看一個(gè)Job日志,
yarn logs -applicationId <application_id>