Flink dynamic table轉(zhuǎn)成stream實(shí)戰(zhàn)

Dynamic table在flink中是一個(gè)邏輯概念,也正是Dynamic table可以讓流數(shù)據(jù)支持table API和SQL。下圖是stream和dynamic table轉(zhuǎn)換關(guān)系,先將stream轉(zhuǎn)化為dynamic table,再基于dynamic table進(jìn)行SQL操作生成新的dynamic table,最后將dynamic table轉(zhuǎn)化為stream。本文將重點(diǎn)介紹dynamic table轉(zhuǎn)化為stream的3種方式。

stream-query-stream

注:本文所涉及的代碼全部基于flink 1.9.0以及flink-table-planner-blink_2.11

Append-only stream

官方定義如下:

A dynamic table that is only modified by INSERT changes can be converted into a stream by emitting the inserted rows.

也就是說(shuō)如果dynamic table只包含了插入新數(shù)據(jù)的操作那么就可以轉(zhuǎn)化為append-only stream,所有數(shù)據(jù)追加到stream里面。

樣例代碼:

public class AppendOnlyExample {
    public static void main(String[] args) throws Exception {
        EnvironmentSettings settings = EnvironmentSettings.newInstance().inStreamingMode().useBlinkPlanner().build();
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        StreamTableEnvironment tEnv = StreamTableEnvironment.create(env, settings);
        env.setParallelism(1);

        DataStream<Tuple2<String, String>> data = env.fromElements(
                new Tuple2<>("Mary", "./home"),
                new Tuple2<>("Bob", "./cart"),
                new Tuple2<>("Mary", "./prod?id=1"),
                new Tuple2<>("Liz", "./home"),
                new Tuple2<>("Bob", "./prod?id=3")
        );

        Table clicksTable = tEnv.fromDataStream(data, "user,url");

        tEnv.registerTable("clicks", clicksTable);
        Table rTable = tEnv.sqlQuery("select user,url from clicks where user='Mary'");

        DataStream ds = tEnv.toAppendStream(rTable, TypeInformation.of(new TypeHint<Tuple2<String, String>>(){}));
        ds.print();
        env.execute();
    }
}

運(yùn)行結(jié)果:

(Mary,./prod?id=8)
(Mary,./prod?id=6)

Retract stream

官方定義

A retract stream is a stream with two types of messages, add messages and retract messages. A dynamic table is converted into an retract stream by encoding an INSERT change as add message, a DELETE change as retract message, and an UPDATE change as a retract message for the updated (previous) row and an add message for the updating (new) row. The following figure visualizes the conversion of a dynamic table into a retract stream.

undo-redo-mode

樣例代碼:

public class RetractExample {
    public static void main(String[] args) throws Exception {
        EnvironmentSettings settings = EnvironmentSettings.newInstance().inStreamingMode().useBlinkPlanner().build();
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        StreamTableEnvironment tEnv = StreamTableEnvironment.create(env, settings);
        env.setParallelism(1);

        DataStream<Tuple2<String, String>> data = env.fromElements(
                new Tuple2<>("Mary", "./home"),
                new Tuple2<>("Bob", "./cart"),
                new Tuple2<>("Mary", "./prod?id=1"),
                new Tuple2<>("Liz", "./home"),
                new Tuple2<>("Bob", "./prod?id=3")
        );

        Table clicksTable = tEnv.fromDataStream(data, "user,url");

        tEnv.registerTable("clicks", clicksTable);
        Table rTable = tEnv.sqlQuery("SELECT user,COUNT(url) FROM clicks GROUP BY user");

        DataStream ds = tEnv.toRetractStream(rTable, TypeInformation.of(new TypeHint<Tuple2<String, Long>>(){}));
        ds.print();
        env.execute();
    }
}

運(yùn)行結(jié)果:

(true,(Mary,1))
(true,(Bob,1))
(false,(Mary,1))
(true,(Mary,2))
(true,(Liz,1))
(false,(Bob,1))
(true,(Bob,2))

對(duì)于toRetractStream的返回值是一個(gè)Tuple2<Boolean, T>類型,第一個(gè)元素為true表示這條數(shù)據(jù)為要插入的新數(shù)據(jù),false表示需要?jiǎng)h除的一條舊數(shù)據(jù)。也就是說(shuō)可以把更新表中某條數(shù)據(jù)分解為先刪除一條舊數(shù)據(jù)再插入一條新數(shù)據(jù)。

Upsert stream

官方定義:

An upsert stream is a stream with two types of messages, upsert messages and delete messages. A dynamic table that is converted into an upsert stream requires a (possibly composite) unique key. A dynamic table with unique key is converted into a stream by encoding INSERT and UPDATE changes as upsert messages and DELETE changes as delete messages. The stream consuming operator needs to be aware of the unique key attribute in order to apply messages correctly. The main difference to a retract stream is that UPDATE changes are encoded with a single message and hence more efficient. The following figure visualizes the conversion of a dynamic table into an upsert stream.

redo-mode

這個(gè)模式和以上兩個(gè)模式不同的地方在于要實(shí)現(xiàn)將Dynamic table轉(zhuǎn)化成Upsert stream需要實(shí)現(xiàn)一個(gè)UpsertStreamTableSink,而不能直接使用StreamTableEnvironment進(jìn)行轉(zhuǎn)換。

樣例代碼:

public class UpsertExample {
    public static void main(String[] args) throws Exception {
        EnvironmentSettings settings = EnvironmentSettings.newInstance().inStreamingMode().useBlinkPlanner().build();
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        StreamTableEnvironment tEnv = StreamTableEnvironment.create(env, settings);
        env.setParallelism(1);

        DataStream<Tuple2<String, String>> data = env.fromElements(
                new Tuple2<>("Mary", "./home"),
                new Tuple2<>("Bob", "./cart"),
                new Tuple2<>("Mary", "./prod?id=1"),
                new Tuple2<>("Liz", "./home"),
                new Tuple2<>("Liz", "./prod?id=3"),
                new Tuple2<>("Mary", "./prod?id=7")
        );

        Table clicksTable = tEnv.fromDataStream(data, "user,url");

        tEnv.registerTable("clicks", clicksTable);
        Table rTable = tEnv.sqlQuery("SELECT user,COUNT(url) FROM clicks GROUP BY user");
        tEnv.registerTableSink("MemoryUpsertSink", new MemoryUpsertSink(rTable.getSchema()));
        rTable.insertInto("MemoryUpsertSink");

        env.execute();
    }

    private static class MemoryUpsertSink implements UpsertStreamTableSink<Tuple2<String, Long>> {
        private TableSchema schema;
        private String[] keyFields;
        private boolean isAppendOnly;

        private String[] fieldNames;
        private TypeInformation<?>[] fieldTypes;

        public MemoryUpsertSink() {

        }

        public MemoryUpsertSink(TableSchema schema) {
            this.schema = schema;
        }

        @Override
        public void setKeyFields(String[] keys) {
            this.keyFields = keys;
        }

        @Override
        public void setIsAppendOnly(Boolean isAppendOnly) {
            this.isAppendOnly = isAppendOnly;
        }

        @Override
        public TypeInformation<Tuple2<String, Long>> getRecordType() {
            return TypeInformation.of(new TypeHint<Tuple2<String, Long>>(){});
        }

        @Override
        public void emitDataStream(DataStream<Tuple2<Boolean, Tuple2<String, Long>>> dataStream) {
            consumeDataStream(dataStream);
        }

        @Override
        public DataStreamSink<?> consumeDataStream(DataStream<Tuple2<Boolean, Tuple2<String, Long>>> dataStream) {
            return dataStream.addSink(new DataSink()).setParallelism(1);
        }

        @Override
        public TableSink<Tuple2<Boolean, Tuple2<String, Long>>> configure(String[] fieldNames, TypeInformation<?>[] fieldTypes) {
            MemoryUpsertSink memoryUpsertSink = new MemoryUpsertSink();
            memoryUpsertSink.setFieldNames(fieldNames);
            memoryUpsertSink.setFieldTypes(fieldTypes);
            memoryUpsertSink.setKeyFields(keyFields);
            memoryUpsertSink.setIsAppendOnly(isAppendOnly);

            return memoryUpsertSink;
        }

        @Override
        public String[] getFieldNames() {
            return schema.getFieldNames();
        }

        public void setFieldNames(String[] fieldNames) {
            this.fieldNames = fieldNames;
        }

        @Override
        public TypeInformation<?>[] getFieldTypes() {
            return schema.getFieldTypes();
        }

        public void setFieldTypes(TypeInformation<?>[] fieldTypes) {
            this.fieldTypes = fieldTypes;
        }
    }

    private static class DataSink extends RichSinkFunction<Tuple2<Boolean, Tuple2<String, Long>>> {
        public DataSink() {
        }

        @Override
        public void invoke(Tuple2<Boolean, Tuple2<String, Long>> value, Context context) throws Exception {
            System.out.println("send message:" + value);
        }
    }
}

運(yùn)行結(jié)果:

send message:(true,(Mary,1))
send message:(true,(Bob,1))
send message:(true,(Mary,2))
send message:(true,(Liz,1))
send message:(true,(Liz,2))
send message:(true,(Mary,3))

這種模式的返回值也是一個(gè)Tuple2<Boolean, T>類型,和Retract的區(qū)別在于更新表中的某條數(shù)據(jù)并不會(huì)返回一條刪除舊數(shù)據(jù)一條插入新數(shù)據(jù),而是看上去真的是更新了某條數(shù)據(jù)。

Upsert stream番外篇

以上所講的內(nèi)容全部都是來(lái)自于flink官網(wǎng),只是附上了與其對(duì)應(yīng)的樣例,可以讓讀者更直觀的感受每種模式的輸出效果。網(wǎng)上也有很多對(duì)官方文檔的翻譯,但是幾乎沒(méi)有文章或者樣例說(shuō)明在使用UpsertStreamTableSink的時(shí)候什么情況下返回值Tuple2<Boolean, T>第一個(gè)元素是false?話不多說(shuō)直接上樣例,只要把上面的例子中的sql改為如下

String sql = "SELECT user, cnt " +
             "FROM (" +
                    "SELECT user,COUNT(url) as cnt FROM clicks GROUP BY user" +
                   ")" +
             "ORDER BY cnt LIMIT 2";

返回結(jié)果:

send message:(true,(Mary,1))
send message:(true,(Bob,1))
send message:(false,(Mary,1))
send message:(true,(Bob,1))
send message:(true,(Mary,2))
send message:(true,(Liz,1))
send message:(false,(Liz,1))
send message:(true,(Mary,2))
send message:(false,(Mary,2))
send message:(true,(Liz,2))

具體的原理可以查看源碼,org.apache.flink.table.planner.plan.nodes.physical.stream.StreamExecRankorg.apache.flink.table.planner.plan.nodes.physical.stream.StreamExecSortLimit
解析sql的時(shí)候通過(guò)下面的方法得到不同的strategy,由此影響是否需要?jiǎng)h除原有數(shù)據(jù)的行為。

  def getStrategy(forceRecompute: Boolean = false): RankProcessStrategy = {
    if (strategy == null || forceRecompute) {
      strategy = RankProcessStrategy.analyzeRankProcessStrategy(
        inputRel, ImmutableBitSet.of(), sortCollation, cluster.getMetadataQuery)
    }
    strategy
  }

知道什么時(shí)候會(huì)產(chǎn)生false屬性的數(shù)據(jù),對(duì)于理解JDBCUpsertTableSinkHBaseUpsertTableSink的使用會(huì)有很大的幫助。

總結(jié)

在閱讀flink文檔的時(shí)候,總是想通過(guò)代碼實(shí)戰(zhàn)體會(huì)其中的原理,很多文章是將官網(wǎng)的描述翻譯成了中文,而本文是將官網(wǎng)的描述“翻譯”成了代碼,希望可以幫助讀者理解官網(wǎng)原文中的含義。

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時(shí)請(qǐng)結(jié)合常識(shí)與多方信息審慎甄別。
平臺(tái)聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡(jiǎn)書系信息發(fā)布平臺(tái),僅提供信息存儲(chǔ)服務(wù)。

友情鏈接更多精彩內(nèi)容