數(shù)據(jù)處理的過(guò)程基本可以分為三個(gè)階段分別是,數(shù)據(jù)從來(lái)哪里,做什么業(yè)務(wù)邏輯,落地到哪里去。
flink也如此。
SourceFunction 簡(jiǎn)介
flink自定義數(shù)據(jù)源需要實(shí)現(xiàn)SourceFunction,內(nèi)置的SourceFunction實(shí)現(xiàn)類有:SocketTextStreamFunction、FromElementsFunction、FlinkKafkaConsumer 等等
SourceFunction 定義了2個(gè)方法 run 和cancel 。如下圖

run方法的主體就是實(shí)現(xiàn)數(shù)據(jù)的生產(chǎn)邏輯。比如從Redis里面獲取數(shù)據(jù),或者自己模擬產(chǎn)生數(shù)據(jù)邏輯。下面會(huì)舉例說(shuō)明
cancel方法就是在任務(wù)取消的時(shí)候調(diào)用,作一些狀態(tài)賦值或者鏈接關(guān)閉之類的。
自定義flink source
首先根據(jù)并行度來(lái)區(qū)分,可分為單并行度(并行度為1)和多并行度的source。單并行度的source之后的算子中不能再通過(guò)setParallelism()來(lái)改變并行度,多并行度默認(rèn)同任務(wù)的并行度
然后可以根據(jù)是否為RichFunction來(lái)區(qū)分。RichFunction接口中有open,close,getRuntimeContext和setRuntimeContext等方法來(lái)獲取狀態(tài),緩存系統(tǒng)內(nèi)部數(shù)據(jù)等
單并行度source? 實(shí)現(xiàn)?
SourceFunction
import java.text.SimpleDateFormat
import java.util.Date
import org.apache.flink.streaming.api.functions.source.SourceFunction
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.windowing.time.Time
class NoParalleSource extends SourceFunction[String]{
private? var isrunning =true
? override def run(sourceContext: SourceFunction.SourceContext[String]):Unit = {
while (isrunning){
val time =new SimpleDateFormat("HH:mm:ss").format(new Date())
sourceContext.collect(Thread.currentThread().getId +"_"+time)
Thread.sleep(1000*1)
}
}
override def cancel():Unit = {
isrunning =false
? }
}
object NoParalleSourceTest{
def main(args: Array[String]):Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
? ? val stream = env.addSource(new NoParalleSource())/*.setParallelism(2)*/
? ? val reduce = stream.timeWindowAll(Time.seconds(5)).reduce(_+"~"+_)
reduce.print()
env.execute(NoParalleSourceTest.getClass.getName)
}
}
多并行度source 實(shí)現(xiàn)?
ParallelSourceFunction
import java.text.SimpleDateFormat
import java.util.Date
import org.apache.flink.streaming.api.functions.source.{ParallelSourceFunction, SourceFunction}
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.windowing.time.Time
/**
* 不設(shè)置并發(fā)數(shù),那就任務(wù)的默認(rèn)并發(fā)數(shù)
*/
class ParalleSource extends? ParallelSourceFunction[String]{
private var isrunning =true
? override def run(sourceContext: SourceFunction.SourceContext[String]):Unit = {
while (isrunning){
val time =new SimpleDateFormat("HH:mm:ss").format(new Date())
sourceContext.collect(Thread.currentThread().getId +"_"+time)
Thread.sleep(1000*1)
}
}
override def cancel():Unit = {
isrunning =false
? }
}
object ParalleSourceTest{
def main(args: Array[String]):Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
? ? val stream = env.addSource(new ParalleSource()).setParallelism(4)
val reduce = stream.timeWindowAll(Time.seconds(5)).reduce(_+"~"+_)
reduce.print()
env.execute(ParalleSourceTest.getClass.getName)
}
}
rich 單并行度source 實(shí)現(xiàn)? RichSourceFunction?
rich 多并行度source 實(shí)現(xiàn)? RichParallelSourceFunction
