用hive命令msck repair table table_name修復(fù)表分區(qū)后spark讀取表時分區(qū)失效問題

https://blog.csdn.net/weixin_40829577/article/details/109001268

目錄

1 原因

2 解決方案

1 原因

為了提高性能spark對元數(shù)據(jù)做了緩存,如果外部系統(tǒng)更新了元數(shù)據(jù),spark使用時要更新緩存過的該表元數(shù)據(jù).

/**

* Invalidates and refreshes all the cached data and metadata of the given table. For performance

* reasons, Spark SQL or the external data source library it uses might cache certain metadata

* about a table, such as the location of blocks. When those change outside of Spark SQL, users

* should call this function to invalidate the cache.

*

* If this table is cached as an InMemoryRelation, drop the original cached version and make the

* new version cached lazily.

*

* @param tableName is either a qualified or unqualified name that designates a table/view.

*? ? ? ? ? ? ? ? ? If no database identifier is provided, it refers to a temporary view or

*? ? ? ? ? ? ? ? ? a table/view in the current database.

* @since 2.0.0

*/

defrefreshTable(tableName:String):Unit

2 解決方案

1. 啟動客spark-shell客戶端

1) 分配executor-memory/driver-memory 足夠的內(nèi)存, 否則會內(nèi)存溢出;

2) 并發(fā)度不宜過大, 否則會超過允許的并發(fā)訪問次數(shù);

spark-shell \

--name ShyTestError \

--master yarn \

--deploy-mode client \

--num-executors 3 \

--executor-memory 24G \

--executor-cores 2 \

--driver-memory 8G \

--conf spark.dynamicAllocation.enabled=false \

--conf spark.executor.memoryOverhead=4G \

?--conf spark.default.parallelism=12 \

?--conf spark.sql.shuffle.partitions=12

2. 刷新對應(yīng)表的元數(shù)據(jù)

spark.catalog.refreshTable("table_name")

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時請結(jié)合常識與多方信息審慎甄別。
平臺聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點,簡書系信息發(fā)布平臺,僅提供信息存儲服務(wù)。

友情鏈接更多精彩內(nèi)容