97偷拍视频网站,永久一区二区三区,中文13区精品

在面向流處理的分布式計(jì)算中，經(jīng)常會有這種需求，希望需要處理的某個(gè)數(shù)據(jù)集能夠不隨著流式數(shù)據(jù)的流逝而消失。

以spark streaming為例，就是希望有個(gè)數(shù)據(jù)集能夠在當(dāng)前批次中更新，再下個(gè)批次后又可以繼續(xù)訪問。一個(gè)最簡單的實(shí)現(xiàn)是在driver的內(nèi)存中，我們可以自行保存一個(gè)大的內(nèi)存結(jié)構(gòu)。這種hack的方式就是我們無法利用spark提供的分布式計(jì)算的能力。

對此，spark streaming提供了stateful streaming, 可以創(chuàng)建一個(gè)有狀態(tài)的DStream，我們可以操作一個(gè)跨越不同批次的RDD。

1 updateStateByKey

該方法提供了這樣的一種機(jī)制：維護(hù)了一個(gè)可以跨越不同批次的RDD，姑且成為StateRDD，在每個(gè)批次遍歷StateRDD的所有數(shù)據(jù)，對每條數(shù)據(jù)執(zhí)行update方法。當(dāng)update方法返回None時(shí)，淘汰StateRDD中的該條數(shù)據(jù)。

具體接口如下：

/**

* Return a new "state" DStream where the state for each key is updated by applying

* the given function on the previous state of the key and the new values of each key.

* Hash partitioning is used to generate the RDDs with `numPartitions` partitions.

* @param updateFunc State update function. If `this` function returns None, then

*                   corresponding state key-value pair will be eliminated.

* @param numPartitions Number of partitions of each RDD in the new DStream.

* @tparam S State type

*/

def updateStateByKey[S: ClassTag](

    updateFunc: (Seq[V], Option[S]) => Option[S],

    numPartitions: Int

  ): DStream[(K, S)] = ssc.withScope {

  updateStateByKey(updateFunc, defaultPartitioner(numPartitions))

}

即用戶需要實(shí)現(xiàn)一個(gè)updateFunc的函數(shù)，該函數(shù)的參數(shù)：

Seq[V] 該批次中相同key的數(shù)據(jù)，以Seq數(shù)組形式傳遞

Option[S] 歷史狀態(tài)中的數(shù)據(jù)

返回值：返回需要保持的歷史狀態(tài)數(shù)據(jù)，為None時(shí)表示刪除該數(shù)據(jù)

def updateStateFunc(lines: Seq[Array[String]], state: Option[Array[String]]): Option[Array[String]] = {...}

這種做法簡單清晰明了，但是其中有一些可以優(yōu)化的地方：

a) 如果DRDD增長到比較大的時(shí)候，而每個(gè)進(jìn)入的批次數(shù)據(jù)量相比并不大，此時(shí)每次都需要遍歷DRDD，無論該批次中是否有數(shù)據(jù)需要更新DRDD。這種情況有的時(shí)候可能會引發(fā)性能問題。

b) 需要用戶自定義數(shù)據(jù)的淘汰機(jī)制。有的時(shí)候顯得不是那么方便。

c) 返回的類型需要和緩存中的類型相同。類型不能發(fā)生改變。

2 mapWithState

該接口是對updateSateByKey的改良，解決了updateStateFunc中可以優(yōu)化的地方：

/**

* :: Experimental ::

* Return a [[MapWithStateDStream]] by applying a function to every key-value element of

* `this` stream, while maintaining some state data for each unique key. The mapping function

* and other specification (e.g. partitioners, timeouts, initial state data, etc.) of this

* transformation can be specified using [[StateSpec]] class. The state data is accessible in

* as a parameter of type [[State]] in the mapping function.

*

* Example of using `mapWithState`:

* {{{

*    // A mapping function that maintains an integer state and return a String

*    def mappingFunction(key: String, value: Option[Int], state: State[Int]): Option[String] = {

*      // Use state.exists(), state.get(), state.update() and state.remove()

*      // to manage state, and return the necessary string

*    }

*

*    val spec = StateSpec.function(mappingFunction).numPartitions(10)

*

*    val mapWithStateDStream = keyValueDStream.mapWithState[StateType, MappedType](spec)

* }}}

*

* @param spec          Specification of this transformation

* @tparam StateType    Class type of the state data

* @tparam MappedType   Class type of the mapped data

*/

@Experimental

def mapWithState[StateType: ClassTag, MappedType: ClassTag](

    spec: StateSpec[K, V, StateType, MappedType]

  ): MapWithStateDStream[K, V, StateType, MappedType] = {

  new MapWithStateDStreamImpl[K, V, StateType, MappedType](

    self,

    spec.asInstanceOf[StateSpecImpl[K, V, StateType, MappedType]]

  )

}

其中spec封裝了用戶自定義的函數(shù)，用以更新緩存數(shù)據(jù)：

mappingFunction: (KeyType, Option[ValueType], State[StateType]) => MappedType

實(shí)現(xiàn)樣例如下：

val mappingFunc = (k: String, line: Option[Array[String]], state: State[Array[String]]) => {...}

參數(shù)分別代表：

數(shù)據(jù)的key: k

RDD中的每行數(shù)據(jù): line

state: 緩存數(shù)據(jù)

當(dāng)對state調(diào)用remove方法時(shí)，該數(shù)據(jù)會被刪除。

注意，如果數(shù)據(jù)超時(shí)，不要調(diào)用remove方法，因?yàn)閟park會在mappingFunc后自動調(diào)用remove。

a) 與updateStateByKey 每次都要遍歷緩存數(shù)據(jù)不同，mapWithState每次遍歷每個(gè)批次中的數(shù)據(jù)，更新緩存中的數(shù)據(jù)。對于緩存數(shù)據(jù)較大的情況來說，性能會有較大提升。

b) 提供了內(nèi)置的超時(shí)機(jī)制，當(dāng)數(shù)據(jù)一定時(shí)間內(nèi)沒有更新時(shí)，淘汰相應(yīng)數(shù)據(jù)。

注意，當(dāng)有數(shù)據(jù)到來或者有超時(shí)發(fā)生時(shí)，mappingFunc都會被調(diào)用。

3 checkpointing

通常情況下，在一個(gè)DStream鐘，對RDD的各種轉(zhuǎn)換而依賴的數(shù)據(jù)都是來自于當(dāng)前批次中。但是當(dāng)在進(jìn)行有狀態(tài)的transformations時(shí)，包括updateStateByKey/reduceByKeyAndWindow 、mapWithSate，還會依賴于以前批次的數(shù)據(jù)，RDD的容錯(cuò)機(jī)制，在異常情況需要重新計(jì)算RDD時(shí)，需要以前批次的RDD信息。如果這個(gè)依賴的鏈路過長，會需要大量的內(nèi)存，即使有些RDD的數(shù)據(jù)在內(nèi)存中，不需要計(jì)算。此時(shí)spark通過checkpoint來打破依賴鏈路。checkpoint會生成一個(gè)新的RDD到hdfs中，該RDD是計(jì)算后的結(jié)果集，而沒有對之前的RDD依賴。

此時(shí)一定要啟用checkpointing，以進(jìn)行周期性的RDD Checkpointing

在StateDstream在實(shí)現(xiàn)RDD的compute方法時(shí)，就是將之前的PreStateRDD與當(dāng)前批次中依賴的ParentRDD進(jìn)行合并。

而checkpoint的實(shí)現(xiàn)是將上述合并的RDD寫入HDFS中。

現(xiàn)在checkpoint的實(shí)現(xiàn)中，數(shù)據(jù)寫入hdfs的過程是由一個(gè)固定的線程池異步完成的。一種存在的風(fēng)險(xiǎn)是上次checkpoint的數(shù)據(jù)尚未完成，此次又來了新的要寫的checkpoint數(shù)據(jù)，會加大集群的負(fù)載，可能會引發(fā)一系列的問題。

4 checkpoint周期設(shè)置：

對mapWithStateByKey/updateStateByKey返回的DStream可以調(diào)用checkpoint方法設(shè)置checkpoint的周期。注意傳遞的時(shí)間只能是批次時(shí)間的整數(shù)倍。

另外，對于mapWithState而言，checkpoint執(zhí)行時(shí)，才會進(jìn)行數(shù)據(jù)的刪除。 State.remove方法只是設(shè)置狀態(tài)，標(biāo)記為刪除，數(shù)據(jù)并不會真的刪除。 SnapShot方法還是可以獲取得到。

參考:
[1] https://halfvim.github.io/2016/06/19/Checkpointing-in-Spark-Streaming/
[2] http://asyncified.io/2016/07/31/exploring-stateful-streaming-with-apache-spark/

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

spark streaming stateful DStream 持久保存RDD/有狀態(tài)的內(nèi)存

spark streaming stateful DStream 持久保存RDD/有狀態(tài)的內(nèi)存

1 updateStateByKey

2 mapWithState

3 checkpointing

4 checkpoint周期設(shè)置：

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九 欧美,1769亚洲,黄色成人av

spark streaming stateful DStream 持久保存RDD/有狀態(tài)的內(nèi)存

1 updateStateByKey

2 mapWithState

3 checkpointing

4 checkpoint周期設(shè)置：

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av