Dynamic table在flink中是一個(gè)邏輯概念,也正是Dynamic table可以讓流數(shù)據(jù)支持table API和SQL。下圖是stream和dynamic table轉(zhuǎn)換關(guān)系,先將stream轉(zhuǎn)化為dynamic table,再基于dynamic table進(jìn)行SQL操作生成新的dynamic table,最后將dynamic table轉(zhuǎn)化為stream。本文將重點(diǎn)介紹dynamic table轉(zhuǎn)化為stream的3種方式。

注:本文所涉及的代碼全部基于flink 1.9.0以及flink-table-planner-blink_2.11
Append-only stream
官方定義如下:
A dynamic table that is only modified by INSERT changes can be converted into a stream by emitting the inserted rows.
也就是說(shuō)如果dynamic table只包含了插入新數(shù)據(jù)的操作那么就可以轉(zhuǎn)化為append-only stream,所有數(shù)據(jù)追加到stream里面。
樣例代碼:
public class AppendOnlyExample {
public static void main(String[] args) throws Exception {
EnvironmentSettings settings = EnvironmentSettings.newInstance().inStreamingMode().useBlinkPlanner().build();
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
StreamTableEnvironment tEnv = StreamTableEnvironment.create(env, settings);
env.setParallelism(1);
DataStream<Tuple2<String, String>> data = env.fromElements(
new Tuple2<>("Mary", "./home"),
new Tuple2<>("Bob", "./cart"),
new Tuple2<>("Mary", "./prod?id=1"),
new Tuple2<>("Liz", "./home"),
new Tuple2<>("Bob", "./prod?id=3")
);
Table clicksTable = tEnv.fromDataStream(data, "user,url");
tEnv.registerTable("clicks", clicksTable);
Table rTable = tEnv.sqlQuery("select user,url from clicks where user='Mary'");
DataStream ds = tEnv.toAppendStream(rTable, TypeInformation.of(new TypeHint<Tuple2<String, String>>(){}));
ds.print();
env.execute();
}
}
運(yùn)行結(jié)果:
(Mary,./prod?id=8)
(Mary,./prod?id=6)
Retract stream
官方定義
A retract stream is a stream with two types of messages, add messages and retract messages. A dynamic table is converted into an retract stream by encoding an INSERT change as add message, a DELETE change as retract message, and an UPDATE change as a retract message for the updated (previous) row and an add message for the updating (new) row. The following figure visualizes the conversion of a dynamic table into a retract stream.

樣例代碼:
public class RetractExample {
public static void main(String[] args) throws Exception {
EnvironmentSettings settings = EnvironmentSettings.newInstance().inStreamingMode().useBlinkPlanner().build();
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
StreamTableEnvironment tEnv = StreamTableEnvironment.create(env, settings);
env.setParallelism(1);
DataStream<Tuple2<String, String>> data = env.fromElements(
new Tuple2<>("Mary", "./home"),
new Tuple2<>("Bob", "./cart"),
new Tuple2<>("Mary", "./prod?id=1"),
new Tuple2<>("Liz", "./home"),
new Tuple2<>("Bob", "./prod?id=3")
);
Table clicksTable = tEnv.fromDataStream(data, "user,url");
tEnv.registerTable("clicks", clicksTable);
Table rTable = tEnv.sqlQuery("SELECT user,COUNT(url) FROM clicks GROUP BY user");
DataStream ds = tEnv.toRetractStream(rTable, TypeInformation.of(new TypeHint<Tuple2<String, Long>>(){}));
ds.print();
env.execute();
}
}
運(yùn)行結(jié)果:
(true,(Mary,1))
(true,(Bob,1))
(false,(Mary,1))
(true,(Mary,2))
(true,(Liz,1))
(false,(Bob,1))
(true,(Bob,2))
對(duì)于toRetractStream的返回值是一個(gè)Tuple2<Boolean, T>類型,第一個(gè)元素為true表示這條數(shù)據(jù)為要插入的新數(shù)據(jù),false表示需要?jiǎng)h除的一條舊數(shù)據(jù)。也就是說(shuō)可以把更新表中某條數(shù)據(jù)分解為先刪除一條舊數(shù)據(jù)再插入一條新數(shù)據(jù)。
Upsert stream
官方定義:
An upsert stream is a stream with two types of messages, upsert messages and delete messages. A dynamic table that is converted into an upsert stream requires a (possibly composite) unique key. A dynamic table with unique key is converted into a stream by encoding INSERT and UPDATE changes as upsert messages and DELETE changes as delete messages. The stream consuming operator needs to be aware of the unique key attribute in order to apply messages correctly. The main difference to a retract stream is that UPDATE changes are encoded with a single message and hence more efficient. The following figure visualizes the conversion of a dynamic table into an upsert stream.

這個(gè)模式和以上兩個(gè)模式不同的地方在于要實(shí)現(xiàn)將Dynamic table轉(zhuǎn)化成Upsert stream需要實(shí)現(xiàn)一個(gè)UpsertStreamTableSink,而不能直接使用
StreamTableEnvironment進(jìn)行轉(zhuǎn)換。
樣例代碼:
public class UpsertExample {
public static void main(String[] args) throws Exception {
EnvironmentSettings settings = EnvironmentSettings.newInstance().inStreamingMode().useBlinkPlanner().build();
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
StreamTableEnvironment tEnv = StreamTableEnvironment.create(env, settings);
env.setParallelism(1);
DataStream<Tuple2<String, String>> data = env.fromElements(
new Tuple2<>("Mary", "./home"),
new Tuple2<>("Bob", "./cart"),
new Tuple2<>("Mary", "./prod?id=1"),
new Tuple2<>("Liz", "./home"),
new Tuple2<>("Liz", "./prod?id=3"),
new Tuple2<>("Mary", "./prod?id=7")
);
Table clicksTable = tEnv.fromDataStream(data, "user,url");
tEnv.registerTable("clicks", clicksTable);
Table rTable = tEnv.sqlQuery("SELECT user,COUNT(url) FROM clicks GROUP BY user");
tEnv.registerTableSink("MemoryUpsertSink", new MemoryUpsertSink(rTable.getSchema()));
rTable.insertInto("MemoryUpsertSink");
env.execute();
}
private static class MemoryUpsertSink implements UpsertStreamTableSink<Tuple2<String, Long>> {
private TableSchema schema;
private String[] keyFields;
private boolean isAppendOnly;
private String[] fieldNames;
private TypeInformation<?>[] fieldTypes;
public MemoryUpsertSink() {
}
public MemoryUpsertSink(TableSchema schema) {
this.schema = schema;
}
@Override
public void setKeyFields(String[] keys) {
this.keyFields = keys;
}
@Override
public void setIsAppendOnly(Boolean isAppendOnly) {
this.isAppendOnly = isAppendOnly;
}
@Override
public TypeInformation<Tuple2<String, Long>> getRecordType() {
return TypeInformation.of(new TypeHint<Tuple2<String, Long>>(){});
}
@Override
public void emitDataStream(DataStream<Tuple2<Boolean, Tuple2<String, Long>>> dataStream) {
consumeDataStream(dataStream);
}
@Override
public DataStreamSink<?> consumeDataStream(DataStream<Tuple2<Boolean, Tuple2<String, Long>>> dataStream) {
return dataStream.addSink(new DataSink()).setParallelism(1);
}
@Override
public TableSink<Tuple2<Boolean, Tuple2<String, Long>>> configure(String[] fieldNames, TypeInformation<?>[] fieldTypes) {
MemoryUpsertSink memoryUpsertSink = new MemoryUpsertSink();
memoryUpsertSink.setFieldNames(fieldNames);
memoryUpsertSink.setFieldTypes(fieldTypes);
memoryUpsertSink.setKeyFields(keyFields);
memoryUpsertSink.setIsAppendOnly(isAppendOnly);
return memoryUpsertSink;
}
@Override
public String[] getFieldNames() {
return schema.getFieldNames();
}
public void setFieldNames(String[] fieldNames) {
this.fieldNames = fieldNames;
}
@Override
public TypeInformation<?>[] getFieldTypes() {
return schema.getFieldTypes();
}
public void setFieldTypes(TypeInformation<?>[] fieldTypes) {
this.fieldTypes = fieldTypes;
}
}
private static class DataSink extends RichSinkFunction<Tuple2<Boolean, Tuple2<String, Long>>> {
public DataSink() {
}
@Override
public void invoke(Tuple2<Boolean, Tuple2<String, Long>> value, Context context) throws Exception {
System.out.println("send message:" + value);
}
}
}
運(yùn)行結(jié)果:
send message:(true,(Mary,1))
send message:(true,(Bob,1))
send message:(true,(Mary,2))
send message:(true,(Liz,1))
send message:(true,(Liz,2))
send message:(true,(Mary,3))
這種模式的返回值也是一個(gè)Tuple2<Boolean, T>類型,和Retract的區(qū)別在于更新表中的某條數(shù)據(jù)并不會(huì)返回一條刪除舊數(shù)據(jù)一條插入新數(shù)據(jù),而是看上去真的是更新了某條數(shù)據(jù)。
Upsert stream番外篇
以上所講的內(nèi)容全部都是來(lái)自于flink官網(wǎng),只是附上了與其對(duì)應(yīng)的樣例,可以讓讀者更直觀的感受每種模式的輸出效果。網(wǎng)上也有很多對(duì)官方文檔的翻譯,但是幾乎沒(méi)有文章或者樣例說(shuō)明在使用UpsertStreamTableSink的時(shí)候什么情況下返回值Tuple2<Boolean, T>第一個(gè)元素是false?話不多說(shuō)直接上樣例,只要把上面的例子中的sql改為如下
String sql = "SELECT user, cnt " +
"FROM (" +
"SELECT user,COUNT(url) as cnt FROM clicks GROUP BY user" +
")" +
"ORDER BY cnt LIMIT 2";
返回結(jié)果:
send message:(true,(Mary,1))
send message:(true,(Bob,1))
send message:(false,(Mary,1))
send message:(true,(Bob,1))
send message:(true,(Mary,2))
send message:(true,(Liz,1))
send message:(false,(Liz,1))
send message:(true,(Mary,2))
send message:(false,(Mary,2))
send message:(true,(Liz,2))
具體的原理可以查看源碼,org.apache.flink.table.planner.plan.nodes.physical.stream.StreamExecRank和org.apache.flink.table.planner.plan.nodes.physical.stream.StreamExecSortLimit
解析sql的時(shí)候通過(guò)下面的方法得到不同的strategy,由此影響是否需要?jiǎng)h除原有數(shù)據(jù)的行為。
def getStrategy(forceRecompute: Boolean = false): RankProcessStrategy = {
if (strategy == null || forceRecompute) {
strategy = RankProcessStrategy.analyzeRankProcessStrategy(
inputRel, ImmutableBitSet.of(), sortCollation, cluster.getMetadataQuery)
}
strategy
}
知道什么時(shí)候會(huì)產(chǎn)生false屬性的數(shù)據(jù),對(duì)于理解JDBCUpsertTableSink和HBaseUpsertTableSink的使用會(huì)有很大的幫助。
總結(jié)
在閱讀flink文檔的時(shí)候,總是想通過(guò)代碼實(shí)戰(zhàn)體會(huì)其中的原理,很多文章是將官網(wǎng)的描述翻譯成了中文,而本文是將官網(wǎng)的描述“翻譯”成了代碼,希望可以幫助讀者理解官網(wǎng)原文中的含義。