Not in subquery and Not exists subquery in Spark SQL

在本文中,您將了解.

1.? exists and in 子查詢的在mysql中的區(qū)別。

2.? exists and in 子查詢在sparksql中的實現(xiàn)。

3.? not exists and not in 子查詢在sparksql中的實現(xiàn)。

4.? 什麼是Nested loop join 和它的適用范圍。

5.? 一個例子顯示兩個子查詢在生產(chǎn)環(huán)境中spark sql 上的性能差異。

1. difference between exists and in subquery in mysql

1. 當(dāng)使用in子查詢的時候??梢詫ν獗砗蛢?nèi)表做索引。

2. 當(dāng)使用exists子查詢的時候,只對內(nèi)表做索引。

3. 在這里區(qū)別就是是否對外表做索引, 做索引花的時間是否比內(nèi)存中查詢要快。如果外表數(shù)據(jù)很小。顯然做索引是不值得的。如果外表很大,做索引是值得的。這也解釋了爲(wèi)什麼外表大,推薦用in子查詢,外表小,推薦用exists.

in subquery

select A.key from A.key in (select B.key from B)

exists subquery

select A.key where exists (select * from B where A.key = B.key)

2.?in subquery and exists subquery in Spark SQL implementation

比較下兩者的physical plan

in subquery

explain select a.key1? from testdata1 as a? where a.key1 in? (select key3 from testdata3 as b);

== Physical Plan ==

*(1) Project [key1#52]

+- *(1) BroadcastHashJoin [key1#52], [key3#54], LeftSemi, BuildRight

? :- HiveTableScan [key1#52], HiveTableRelation `default`.`testdata1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [key1#52, value1#53]

? +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, true]))

? ? ? +- HiveTableScan [key3#54], HiveTableRelation `default`.`testdata3`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [key3#54, value3#55]

Time taken: 0.062 seconds, Fetched 1 row(s)


exists query

explain select a.key1? from testdata1 as a? where? exists (select * from testdata3 as b where a.key1=b.key3);

== Physical Plan ==

*(1) Project [key1#15]

+- *(1) BroadcastHashJoin [key1#15], [key3#17], LeftSemi, BuildRight

? :- HiveTableScan [key1#15], HiveTableRelation `default`.`testdata1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [key1#15, value1#16]

? +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, true]))

? ? ? +- HiveTableScan [key3#17], HiveTableRelation `default`.`testdata3`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [key3#17, value3#18]

Time taken: 0.347 seconds, Fetched 1 row(s)


兩個的spark plan 是一樣的。

3. not in subquery and not exists subquery in Spark SQL implementation

not in subquery

explain select a.key1? from testdata1 as a? where a.key1 not in? (select key3 from testdata3 as b);

== Physical Plan ==

*(1) Project [key1#66]

+- BroadcastNestedLoopJoin BuildRight, LeftAnti, ((key1#66 = key3#68) || isnull((key1#66 = key3#68)))

? :- HiveTableScan [key1#66], HiveTableRelation `default`.`testdata1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [key1#66, value1#67]

? +- BroadcastExchange IdentityBroadcastMode

? ? ? +- HiveTableScan [key3#68], HiveTableRelation `default`.`testdata3`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [key3#68, value3#69]

```

not exists subquery

explain select a.key1? from testdata1 as a? where? not exists (select * from testdata3 as b where a.key1=b.key3);

== Physical Plan ==

*(1) Project [key1#73]

+- *(1) BroadcastHashJoin [key1#73], [key3#75], LeftAnti, BuildRight

? :- HiveTableScan [key1#73], HiveTableRelation `default`.`testdata1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [key1#73, value1#74]

? +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, true]))

? ? ? +- HiveTableScan [key3#75], HiveTableRelation `default`.`testdata3`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [key3#75, value3#76]

```


因爲(wèi)spark使用broadcastnestedloopJoin去實現(xiàn)not in 子查詢,而使用broadcasthashjoin實現(xiàn)not exists子查詢。后面會解釋broadcastnestedloopJoin爲(wèi)什麼是個性能不太友好的join算法,因爲(wèi)它總共要要在內(nèi)存里遍歷count(A)*count(B)遍, 而在實際生產(chǎn)中, count(A)*count(B)是個很大的數(shù)字。 并且not in 子查詢我們是沒有辦法通過參數(shù)優(yōu)化改變它的spark plan.

4. Nested loop join and which case is suitable

嵌套循環(huán)連接通常被定義為sql中最基本的連接算法。在偽代碼中,它可以實現(xiàn)為:

O_SET? # outer query set

I_SET # inner query set

PREDICATE # join predicate

foreach o_row in O_SET:

? foreach i_row in I_SET:

? ? if i_row matches PREDICATE:

? ? ? return (o_row, i_row)

broadcastnestedloopJoin性能并不友好,因為它需要將外表和內(nèi)表加載到內(nèi)存中,并將外表中的記錄與內(nèi)表中的記錄進行比較。它適用于數(shù)據(jù)量很小的表。


5. An example to show the significant difference between (not in subquery) ?and (not exists subquery)

顯示查詢掛起時間長達39min,實際掛起時間長達15h。

select a.* from A as a where a.tracking_number not in (select b.tracking_number from B as b)




在我們修改sql之后,它用了1.1分鐘來完成。原因是使用Broadcasthash join而不是broadcastnestedloopJoin

select a.*from A as a where a.tracking_number not exists? (select b.tracking_number from B as b);


Conclusion

1.何時在spark 使用in子查詢,exists子查詢?

都可以

2.何時在spark?使用not in子查詢,not exists子查詢?

建議使用not exists 子查詢因爲(wèi) not in 子查詢使用BroadcastNestedLoopJoin 這種性能不友好的算法實現(xiàn),只適用數(shù)據(jù)量很小的情況,對生產(chǎn)環(huán)境不適合。

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時請結(jié)合常識與多方信息審慎甄別。
平臺聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點,簡書系信息發(fā)布平臺,僅提供信息存儲服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

  • rljs by sennchi Timeline of History Part One The Cognitiv...
    sennchi閱讀 7,872評論 0 10
  • pyspark.sql模塊 模塊上下文 Spark SQL和DataFrames的重要類: pyspark.sql...
    mpro閱讀 9,932評論 0 13
  • **2014真題Directions:Read the following text. Choose the be...
    又是夜半驚坐起閱讀 11,214評論 0 23
  • 1. “把青春獻給身后那座輝煌的都市,為了這個美夢我們付出著代價” 此時此刻聽到《私奔》這首歌,我才回想起曾經(jīng)我也...
    枝秋閱讀 316評論 2 0
  • 01 女友:“這件衣服哪個顏色我穿上好看?” 男友:“你穿哪件都好看。” 女友生氣而走,男友一臉茫然。 戀愛中,女...
    一顆玉米豆閱讀 199評論 0 0

友情鏈接更多精彩內(nèi)容