Spark-Sql的參數(shù)調(diào)優(yōu):
官網(wǎng): http://spark.apache.org/docs/latest/sql-programming-guide.html
spark.catalog.cacheTable("tableName")
sqlContext.cacheTable("tableName")
spark.catalog.uncacheTable("tableName")
sqlContext.uncacheTable("tableName")
或者緩存dataFrame
dataFrame.cache()
dataFrame.unpersist()
或者CACHE TABLE
//緩存全表
sqlContext.sql("CACHE TABLE activity")
//緩存過(guò)濾結(jié)果
sqlContext.sql("CACHE TABLE activity_cached as select * from activity where ...")
// lazy cache
sqlContext.sql("CACHE LAZY TABLE ...")
// 取消緩存
sqlContext.sql("UNCACHE TABLE activity")
可通過(guò)兩種配置方式開啟緩存數(shù)據(jù)功能:
使用SQLContext的setConf方法
執(zhí)行SQL命令 SET key=value
- 參數(shù)配置
| Property Name | Default | Meaning |
|---|---|---|
| spark.sql.files.maxPartitionBytes | 134217728 (128 MB) | The maximum number of bytes to pack into a single partition when reading files. |
| spark.sql.files.openCostInBytes | 4194304 (4 MB) | The estimated cost to open a file, measured by the number of bytes could be scanned in the same time. This is used when putting multiple files into a partition. It is better to over estimated, then the partitions with small files will be faster than partitions with bigger files (which is scheduled first). |
| spark.sql.broadcastTimeout | 300 | Timeout in seconds for the broadcast wait time in broadcast joins |
| spark.sql.autoBroadcastJoinThreshold | 10485760 (10 MB) | Configures the maximum size in bytes for a table that will be broadcast to all worker nodes when performing a join. By setting this value to -1 broadcasting can be disabled. Note that currently statistics are only supported for Hive Metastore tables where the command ANALYZE TABLE <tableName> COMPUTE STATISTICS noscan has been run. |
| spark.sql.shuffle.partitions | 200 | Configures the number of partitions to use when shuffling data for joins or aggregations. |
用到的配置
-- spark.sql.autoBroadcastJoinThreshold, broadcast表的最大值10M,當(dāng)這是為-1時(shí), broadcasting不可用,內(nèi)存允許的情況下加大這個(gè)值
-- spark.sql.shuffle.partitions 當(dāng)join或者聚合產(chǎn)生shuffle操作時(shí), partitions的數(shù)量, 這個(gè)值可以調(diào)大點(diǎn), 我一般配置500, 切分更多的task, 有助于數(shù)據(jù)傾斜的減緩, 但是如果task越多, shuffle數(shù)據(jù)量也會(huì)增多
- Broadcast Hint for SQL Queries
參考: https://blog.csdn.net/dabokele/article/details/65963401
import org.apache.spark.sql.functions.broadcast
broadcast(spark.table("src")).join(spark.table("records"), "key").show()
對(duì)于broadcast join模式,會(huì)將小于spark.sql.autoBroadcastJoinThreshold值(默認(rèn)為10M)的表廣播到其他計(jì)算節(jié)點(diǎn),不走shuffle過(guò)程,所以會(huì)更加高效。
官網(wǎng)的原話是這個(gè)樣子:
The BROADCAST hint guides Spark to broadcast each specified table when joining them with another table or view. When Spark deciding the join methods, the broadcast hash join (i.e., BHJ) is preferred, even if the statistics is above the configuration spark.sql.autoBroadcastJoinThreshold. When both sides of a join are specified, Spark broadcasts the one having the lower statistics. Note Spark does not guarantee BHJ is always chosen, since not all cases (e.g. full outer join) support BHJ. When the broadcast nested loop join is selected, we still respect the hint.
注意: 確定broadcast hash join的決定性因素是hive的表統(tǒng)計(jì)信息一定要準(zhǔn)確。并且,由于視圖是沒有表統(tǒng)計(jì)信息的,所以所有的視圖在join時(shí)都不會(huì)被廣播。所以至少要有一張hive表。
------------------------待完善------------------------