在本文中,您將了解.
1.? exists and in 子查詢的在mysql中的區(qū)別。
2.? exists and in 子查詢在sparksql中的實現(xiàn)。
3.? not exists and not in 子查詢在sparksql中的實現(xiàn)。
4.? 什麼是Nested loop join 和它的適用范圍。
5.? 一個例子顯示兩個子查詢在生產(chǎn)環(huán)境中spark sql 上的性能差異。
1. difference between exists and in subquery in mysql
1. 當(dāng)使用in子查詢的時候??梢詫ν獗砗蛢?nèi)表做索引。
2. 當(dāng)使用exists子查詢的時候,只對內(nèi)表做索引。
3. 在這里區(qū)別就是是否對外表做索引, 做索引花的時間是否比內(nèi)存中查詢要快。如果外表數(shù)據(jù)很小。顯然做索引是不值得的。如果外表很大,做索引是值得的。這也解釋了爲(wèi)什麼外表大,推薦用in子查詢,外表小,推薦用exists.
in subquery
select A.key from A.key in (select B.key from B)
exists subquery
select A.key where exists (select * from B where A.key = B.key)
2.?in subquery and exists subquery in Spark SQL implementation
比較下兩者的physical plan
in subquery
explain select a.key1? from testdata1 as a? where a.key1 in? (select key3 from testdata3 as b);
== Physical Plan ==
*(1) Project [key1#52]
+- *(1) BroadcastHashJoin [key1#52], [key3#54], LeftSemi, BuildRight
? :- HiveTableScan [key1#52], HiveTableRelation `default`.`testdata1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [key1#52, value1#53]
? +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, true]))
? ? ? +- HiveTableScan [key3#54], HiveTableRelation `default`.`testdata3`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [key3#54, value3#55]
Time taken: 0.062 seconds, Fetched 1 row(s)
exists query
explain select a.key1? from testdata1 as a? where? exists (select * from testdata3 as b where a.key1=b.key3);
== Physical Plan ==
*(1) Project [key1#15]
+- *(1) BroadcastHashJoin [key1#15], [key3#17], LeftSemi, BuildRight
? :- HiveTableScan [key1#15], HiveTableRelation `default`.`testdata1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [key1#15, value1#16]
? +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, true]))
? ? ? +- HiveTableScan [key3#17], HiveTableRelation `default`.`testdata3`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [key3#17, value3#18]
Time taken: 0.347 seconds, Fetched 1 row(s)
兩個的spark plan 是一樣的。
3. not in subquery and not exists subquery in Spark SQL implementation
not in subquery
explain select a.key1? from testdata1 as a? where a.key1 not in? (select key3 from testdata3 as b);
== Physical Plan ==
*(1) Project [key1#66]
+- BroadcastNestedLoopJoin BuildRight, LeftAnti, ((key1#66 = key3#68) || isnull((key1#66 = key3#68)))
? :- HiveTableScan [key1#66], HiveTableRelation `default`.`testdata1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [key1#66, value1#67]
? +- BroadcastExchange IdentityBroadcastMode
? ? ? +- HiveTableScan [key3#68], HiveTableRelation `default`.`testdata3`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [key3#68, value3#69]
```
not exists subquery
explain select a.key1? from testdata1 as a? where? not exists (select * from testdata3 as b where a.key1=b.key3);
== Physical Plan ==
*(1) Project [key1#73]
+- *(1) BroadcastHashJoin [key1#73], [key3#75], LeftAnti, BuildRight
? :- HiveTableScan [key1#73], HiveTableRelation `default`.`testdata1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [key1#73, value1#74]
? +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, true]))
? ? ? +- HiveTableScan [key3#75], HiveTableRelation `default`.`testdata3`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [key3#75, value3#76]
```
因爲(wèi)spark使用broadcastnestedloopJoin去實現(xiàn)not in 子查詢,而使用broadcasthashjoin實現(xiàn)not exists子查詢。后面會解釋broadcastnestedloopJoin爲(wèi)什麼是個性能不太友好的join算法,因爲(wèi)它總共要要在內(nèi)存里遍歷count(A)*count(B)遍, 而在實際生產(chǎn)中, count(A)*count(B)是個很大的數(shù)字。 并且not in 子查詢我們是沒有辦法通過參數(shù)優(yōu)化改變它的spark plan.
4. Nested loop join and which case is suitable
嵌套循環(huán)連接通常被定義為sql中最基本的連接算法。在偽代碼中,它可以實現(xiàn)為:
O_SET? # outer query set
I_SET # inner query set
PREDICATE # join predicate
foreach o_row in O_SET:
? foreach i_row in I_SET:
? ? if i_row matches PREDICATE:
? ? ? return (o_row, i_row)
broadcastnestedloopJoin性能并不友好,因為它需要將外表和內(nèi)表加載到內(nèi)存中,并將外表中的記錄與內(nèi)表中的記錄進行比較。它適用于數(shù)據(jù)量很小的表。
5. An example to show the significant difference between (not in subquery) ?and (not exists subquery)
顯示查詢掛起時間長達39min,實際掛起時間長達15h。
select a.* from A as a where a.tracking_number not in (select b.tracking_number from B as b)

在我們修改sql之后,它用了1.1分鐘來完成。原因是使用Broadcasthash join而不是broadcastnestedloopJoin
select a.*from A as a where a.tracking_number not exists? (select b.tracking_number from B as b);

Conclusion
1.何時在spark 使用in子查詢,exists子查詢?
都可以
2.何時在spark?使用not in子查詢,not exists子查詢?
建議使用not exists 子查詢因爲(wèi) not in 子查詢使用BroadcastNestedLoopJoin 這種性能不友好的算法實現(xiàn),只適用數(shù)據(jù)量很小的情況,對生產(chǎn)環(huán)境不適合。