spark -SQL 配置參數(shù)

Spark-Sql的參數(shù)調(diào)優(yōu):

官網(wǎng): http://spark.apache.org/docs/latest/sql-programming-guide.html

spark.catalog.cacheTable("tableName") 
sqlContext.cacheTable("tableName")
spark.catalog.uncacheTable("tableName") 
sqlContext.uncacheTable("tableName")

或者緩存dataFrame

dataFrame.cache()   
dataFrame.unpersist()

或者CACHE TABLE

//緩存全表
sqlContext.sql("CACHE TABLE activity")
//緩存過(guò)濾結(jié)果
sqlContext.sql("CACHE TABLE activity_cached as select * from activity where ...")
// lazy cache
sqlContext.sql("CACHE LAZY TABLE ...")
// 取消緩存
sqlContext.sql("UNCACHE TABLE activity")

可通過(guò)兩種配置方式開啟緩存數(shù)據(jù)功能:

使用SQLContext的setConf方法
執(zhí)行SQL命令 SET key=value

  • 參數(shù)配置
Property Name Default Meaning
spark.sql.files.maxPartitionBytes 134217728 (128 MB) The maximum number of bytes to pack into a single partition when reading files.
spark.sql.files.openCostInBytes 4194304 (4 MB) The estimated cost to open a file, measured by the number of bytes could be scanned in the same time. This is used when putting multiple files into a partition. It is better to over estimated, then the partitions with small files will be faster than partitions with bigger files (which is scheduled first).
spark.sql.broadcastTimeout 300 Timeout in seconds for the broadcast wait time in broadcast joins
spark.sql.autoBroadcastJoinThreshold 10485760 (10 MB) Configures the maximum size in bytes for a table that will be broadcast to all worker nodes when performing a join. By setting this value to -1 broadcasting can be disabled. Note that currently statistics are only supported for Hive Metastore tables where the command ANALYZE TABLE <tableName> COMPUTE STATISTICS noscan has been run.
spark.sql.shuffle.partitions 200 Configures the number of partitions to use when shuffling data for joins or aggregations.

用到的配置
-- spark.sql.autoBroadcastJoinThreshold, broadcast表的最大值10M,當(dāng)這是為-1時(shí), broadcasting不可用,內(nèi)存允許的情況下加大這個(gè)值
-- spark.sql.shuffle.partitions 當(dāng)join或者聚合產(chǎn)生shuffle操作時(shí), partitions的數(shù)量, 這個(gè)值可以調(diào)大點(diǎn), 我一般配置500, 切分更多的task, 有助于數(shù)據(jù)傾斜的減緩, 但是如果task越多, shuffle數(shù)據(jù)量也會(huì)增多

import org.apache.spark.sql.functions.broadcast
broadcast(spark.table("src")).join(spark.table("records"), "key").show()

對(duì)于broadcast join模式,會(huì)將小于spark.sql.autoBroadcastJoinThreshold值(默認(rèn)為10M)的表廣播到其他計(jì)算節(jié)點(diǎn),不走shuffle過(guò)程,所以會(huì)更加高效。
官網(wǎng)的原話是這個(gè)樣子:
The BROADCAST hint guides Spark to broadcast each specified table when joining them with another table or view. When Spark deciding the join methods, the broadcast hash join (i.e., BHJ) is preferred, even if the statistics is above the configuration spark.sql.autoBroadcastJoinThreshold. When both sides of a join are specified, Spark broadcasts the one having the lower statistics. Note Spark does not guarantee BHJ is always chosen, since not all cases (e.g. full outer join) support BHJ. When the broadcast nested loop join is selected, we still respect the hint.
注意: 確定broadcast hash join的決定性因素是hive的表統(tǒng)計(jì)信息一定要準(zhǔn)確。并且,由于視圖是沒有表統(tǒng)計(jì)信息的,所以所有的視圖在join時(shí)都不會(huì)被廣播。所以至少要有一張hive表。

------------------------待完善------------------------

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時(shí)請(qǐng)結(jié)合常識(shí)與多方信息審慎甄別。
平臺(tái)聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡(jiǎn)書系信息發(fā)布平臺(tái),僅提供信息存儲(chǔ)服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容