SQL Differences Between Impala and Hive


SQL Differences Between Impala and Hive

impala 和 hive sql 語(yǔ)句的區(qū)別

http://www.cloudera.com/documentation/cdh/5-1-x/Impala/Installing-and-Using-Impala/ciiu_langref_unsupported.html#langref_hiveql_delta


Note that: using window functions currently requires a HiveContext.

set mapreduce.job.queuename = production;

set hive.execution.engine = spark;


需要進(jìn)行前后多少時(shí)間的相關(guān)計(jì)算可有以下兩種方式進(jìn)行操作:

Lets start with example data:

import spark.implicits._? // import sqlContext.implicits._ in Spark < 2.0

val df = Seq(

("001", "event1", 10, "2016-05-01 10:50:51"),

("002", "event2", 100, "2016-05-01 10:50:53"),

("001", "event3", 20, "2016-05-01 10:50:55"),

("001", "event1", 15, "2016-05-01 10:51:50"),

("003", "event1", 13, "2016-05-01 10:55:30"),

("001", "event2", 12, "2016-05-01 10:57:00"),

("001", "event3", 11, "2016-05-01 11:00:01")

).toDF("KEY", "Event_Type", "metric", "Time")

I assume that event is identified byKEY. If this is not the case you can adjustGROUP BY/PARTITION BYclauses according to your requirements.

If you're interested in an aggregation with static window independent of data convert timestamps to numerics and round

import org.apache.spark.sql.functions.{round, sum}

// cast string to timestamp

val ts = $"Time".cast("timestamp").cast("long")

// Round to 300 seconds interval

val interval = (round(ts / 300L) * 300.0).cast("timestamp").alias("interval")

df.groupBy($"KEY", interval).sum("metric")

// +---+---------------------+-----------+

// |KEY|interval? ? ? ? ? ? |sum(metric)|

// +---+---------------------+-----------+

// |001|2016-05-01 11:00:00.0|11? ? ? ? |

// |001|2016-05-01 10:55:00.0|12? ? ? ? |

// |001|2016-05-01 10:50:00.0|45? ? ? ? |

// |003|2016-05-01 10:55:00.0|13? ? ? ? |

// |002|2016-05-01 10:50:00.0|100? ? ? ? |

// +---+---------------------+-----------+

If you're interested in a window relative to the current row use window functions:

import org.apache.spark.sql.expressions.Window

// Partition by KEY

// Order by timestamp

// Consider window of -150 seconds to + 150 seconds relative to the current row

val w = Window.partitionBy($"KEY").orderBy("ts").rangeBetween(-150, 150)

df.withColumn("ts", ts).withColumn("window_sum", sum($"metric").over(w))

// +---+----------+------+-------------------+----------+----------+

// |KEY|Event_Type|metric|Time? ? ? ? ? ? ? |ts? ? ? ? |window_sum|

// +---+----------+------+-------------------+----------+----------+

// |003|event1? ? |13? ? |2016-05-01 10:55:30|1462092930|13? ? ? ? |

// |001|event1? ? |10? ? |2016-05-01 10:50:51|1462092651|45? ? ? ? |

// |001|event3? ? |20? ? |2016-05-01 10:50:55|1462092655|45? ? ? ? |

// |001|event1? ? |15? ? |2016-05-01 10:51:50|1462092710|45? ? ? ? |

// |001|event2? ? |12? ? |2016-05-01 10:57:00|1462093020|12? ? ? ? |

// |001|event3? ? |11? ? |2016-05-01 11:00:01|1462093201|11? ? ? ? |

// |002|event2? ? |100? |2016-05-01 10:50:53|1462092653|100? ? ? |

// +---+----------+------+-------------------+----------+----------+

For performance reasons this approach is useful only if data can partitioned into multiple separate groups. In Spark < 2.0.0 you'll also needHiveContextto make it work

http://www.cloudera.com/documentation/cdh/5-1-x/Impala/Installing-and-Using-Impala/ciiu_langref_unsupported.html#langref_hiveql_delta

http://www.cloudera.com/documentation/cdh/5-1-x/Impala/Installing-and-Using-Impala/ciiu_langref_unsupported.html#langref_hiveql_delta

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時(shí)請(qǐng)結(jié)合常識(shí)與多方信息審慎甄別。
平臺(tái)聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡(jiǎn)書(shū)系信息發(fā)布平臺(tái),僅提供信息存儲(chǔ)服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容