Flink 1.11 with debezium-json format

Flink 1.11 with debezium-json format - apache-flink

In Flink 1.11, I'm trying debezium-format and the following should work, right? I'm trying to follow docs [1]
TableResult products = bsTableEnv.executeSql(
"CREATE TABLE products (\n" +
" id BIGINT,\n" +
" name STRING,\n" +
" description STRING,\n" +
" weight DECIMAL(10, 2)\n" +
") WITH (\n" +
" 'connector' = 'kafka',\n" +
" 'topic' = 'dbserver1.inventory.products',\n" +
" 'properties.bootstrap.servers' = 'localhost:9092',\n" +
" 'properties.group.id' = 'testGroup',\n" +
"'scan.startup.mode'='earliest-offset',\n" +
" 'format' = 'debezium-json'" +
")"
);
bsTableEnv.executeSql("SHOW TABLES").print(); // This seems to work;
bsTableEnv.executeSql("SELECT id FROM products").print();
Output Snippet / Exception:
+------------+
| table name |
+------------+
| products |
+------------+
1 row in set
Exception in thread "main" org.apache.flink.table.api.TableException: AppendStreamTableSink doesn't support consuming update and delete changes which is produced by node TableSourceScan(table=[[default_catalog, default_database, products]], fields=[id, name, description, weight])
I have verified Debezium setup and there are messages in the dbserver1.inventory.products topic. I'm able to read from Kafka topics in Flink using other approaches, but as previously described, I'm hoping to get the debezium-json format to work.
Also, I understand Flink 1.12 introduces new Kafka Upsert connector, but I'm stuck using 1.11 for now.
I'm pretty new to Flink, so entirely possible I'm missing something obvious here.
Thanks in advance
[1] https://ci.apache.org/projects/flink/flink-docs-release-1.11/dev/table/connectors/formats/debezium.html

Asked too soon, it seems. In case it possibly helps someone else, I was able to get it to work with
Table results = bsTableEnv.sqlQuery("SELECT id, name FROM products");
bsTableEnv.toRetractStream(results, Row.class).print();

Related

Flink multi source with kafka, kinesis and TableEnvironment

I'm new to Flink and hope someone can help. I have tried to follow Flink tutorials.
We have a requirement where we consume from:
kafka topic.
When an event arrives on kafka topic we need the json event fields (mobile_acc_id, member_id, mobile_number, valid_from, valid_to) to be stored in an external db (Postgres db)
kinesis stream.
When an event arrives on kinesis stream we need to look up the mobile_number, on the event, in Postgres DB (from step 1) and extract the "member_id" from db and enrich the incoming kinesis event and sink it to another output stream
So I set up a Stream and a Table environment like this:
public static StreamExecutionEnvironment initEnv() {
var env = StreamExecutionEnvironment.getExecutionEnvironment();
env.getConfig().setAutoWatermarkInterval(0L); //disables watermark
return env;
}
public static TableEnvironment initTableEnv() {
var settings = EnvironmentSettings.newInstance().inStreamingMode().build();
return TableEnvironment.create(settings);
}
calling process(..) methods with initEnv() will use kinesis as the source!
process(config.getTransformerConfig(), input, sink, deadLetterSink, initEnv());
In the process(..) am also initialising the Table Environment using initTableEnv() hoping that Flink with consume from both sources when I call env.execute(..):
public static void process(TransformerConfig cfg, SourceFunction<String> source, SinkFunction<UsageSummaryWithHeader> sink,
SinkFunction<DeadLetterEvent> deadLetterSink, StreamExecutionEnvironment env) throws Exception {
var events =
StreamUtils.source(source, env, "kinesis-events", cfg.getInputParallelism());
collectInSink(transform(cfg, events, deadLetterSink), sink, "kinesis-summary-events", cfg.getOutputParallelism());
processStreamIntoTable(initTableEnv());
env.execute("my-flink-event-enricher-svc");
}
private static void processStreamIntoTable(TableEnvironment tableEnv) throws Exception {
tableEnv.executeSql("CREATE TABLE mobile_accounts (\n" +
" mobile_acc_id VARCHAR(36) NOT NULL,\n" +
" member_id BIGINT NOT NULL,\n" +
" mobile_number VARCHAR(14) NOT NULL,\n" +
" valid_from TIMESTAMP NOT NULL,\n" +
" valid_to TIMESTAMP NOT NULL \n" +
") WITH (\n" +
" 'connector' = 'kafka',\n" +
" 'topic' = 'mobile_accounts',\n" +
" 'properties.bootstrap.servers' = 'kafka:9092',\n" +
" 'format' = 'json'\n" +
")");
tableEnv.executeSql("CREATE TABLE mobile_account\n" +
"(\n" +
" mobile_acc_id VARCHAR(36) NOT NULL,\n" +
" member_id BIGINT NOT NULL,\n" +
" mobile_number VARCHAR(14) NOT NULL,\n" +
" valid_from TIMESTAMP NOT NULL,\n" +
" valid_to TIMESTAMP NOT NULL \n" +
") WITH (\n" +
" 'connector' = 'jdbc',\n" +
" 'url' = 'jdbc:postgresql://flinkpg:5432/flink-demo',\n" +
" 'table-name' = 'mobile_account',\n" +
" 'driver' = 'org.postgresql.Driver',\n" +
" 'username' = 'flink-demo',\n" +
" 'password' = 'flink-demo'\n" +
")");
Table mobileAccounts = tableEnv.from("mobile_accounts");
report(mobileAccounts).executeInsert("mobile_account");
}
public static Table report(Table mobileAccounts) {
return mobileAccounts.select(
$("mobile_acc_id"),
$("member_id"),
$("mobile_number"),
$("valid_from"),
$("valid_to"));
}
What I have noticed on the flink console is that it is only consuming from one Source!
I liked TableEnvironment as not much code is needed to get the items inserted into the DB.
How can we consume from both the sources, Kinesis and TableEnvironment in Flink?
Am I using the right approach?
Is there an alternative to implement my requirements?

I assume you are able to create the tables correct, then you can simply JOIN two streams named kafka_stream and kinesis_stream as
SELECT l.*, r.something_useful FROM kinesis_stream as l
INNER JOIN kafka_stream as r
ON l.member_id = r.member_id;
If PostgreSQL sink is essential, you can make it in a different query as
INSERT INTO postgre_sink
SELECT * FROM kafka_stream;
They will solve your problem with Table API (or Flink SQL).

Flink Table print connector not being called

I am using the Flink table API to pull data from a kinesis topic into a table. I want to periodically pull that data into a temporary table and run a custom scalar function on it. However, I notice that my scalar function is not being called at all.
Here is the code for the Kinesis table :
this.tableEnv.executeSql("CREATE TABLE transactions (\n" +
" entry STRING,\n" +
" sequence_number VARCHAR(128) NOT NULL METADATA FROM 'sequence-number' VIRTUAL,\n" +
" shard_id VARCHAR(128) NOT NULL METADATA FROM 'shard-id' VIRTUAL,\n" +
" arrival_time TIMESTAMP(3) METADATA FROM 'timestamp' VIRTUAL,\n" +
" WATERMARK FOR arrival_time AS arrival_time - INTERVAL '5' SECOND\n" +
") WITH (\n" +
" 'connector' = 'kinesis',\n" +
" 'stream' = '" + streamName + "',\n" +
" 'aws.region' = 'us-west-2', \n" +
" 'format' = 'raw'\n" +
")");
Then, I want to periodically call a tumble every second which pulls data from kinesis and updates a temporary table.
My temporary table is defined like this:
this.tableEnv.executeSql("CREATE TABLE temporaryTable (\n" +
" entry STRING,\n" +
" sequence_number VARCHAR(128) NOT NULL,\n" +
" shard_id VARCHAR(128) NOT NULL,\n" +
" arrival_time TIMESTAMP(3),\n" +
" record_list STRING NOT NULL,\n" +
" PRIMARY KEY (shard_id, sequence_number) NOT ENFORCED" +
") WITH (\n" +
" 'connector' = 'print'\n" +
")");
I then have a code to do the tumbling :
Table inMemoryTable = transactions.
window(Tumble.over(lit(1).second()).on($("arrival_time")).as("log_ts"))
.groupBy($("entry"), $("sequence_number"), $("log_ts"), $("shard_id"), $("arrival_time"))
.select(
$("entry"),
$("sequence_number"), $("shard_id"), $("arrival_time"),
(call(CustomFunction.class, $("entry")).as("record_list")));
inMemoryTable.executeInsert(temporaryTable)
The CustomFunction class looks like this :
public class CustomFunction extends ScalarFunction {
#DataTypeHint("STRING")
public String eval(
#DataTypeHint("STRING") String serializedEntry) throws IOException {
return "asd";
}
When I run this code in Flink, I dont get anything in the stdout so obviously I am missing something.
Here is the Flink UI:
Image as link as I dont have enough rep
Thanks for any help.

I am able to get the stream to print with:
driver.tableEnv.getConfig().getConfiguration().setString("table.exec.source.idle", "10000 ms");
driver.env.getConfig().setAutoWatermarkInterval(5000);

No results in kafka topic sink when applying tumble window aggregation in Flink Table API

I am using Flink 1.14 deployed by lyft flink operator
I am trying to make tumble window aggregate with the Table API, read from the transactions table source, and put the aggregate result by window into a new kafka topic
My source is a kafka topic from debezium
EnvironmentSettings settings = EnvironmentSettings.inStreamingMode();
TableEnvironment tEnv = TableEnvironment.create(settings);
//this is the source
tEnv.executeSql("CREATE TABLE transactions (\n" +
" event_time TIMESTAMP(3) METADATA FROM 'value.source.timestamp' VIRTUAL,\n"+
" transaction_time AS TO_TIMESTAMP_LTZ(4001, 3),\n"+
" id INT PRIMARY KEY,\n" +
" transaction_status STRING,\n" +
" transaction_type STRING,\n" +
" merchant_id INT,\n" +
" WATERMARK FOR transaction_time AS transaction_time - INTERVAL '5' SECOND\n" +
") WITH (\n" +
" 'debezium-json.schema-include' = 'true' ,\n" +
" 'connector' = 'kafka',\n" +
" 'topic' = 'dbserver1.inventory.transactions',\n" +
" 'properties.bootstrap.servers' = 'my-cluster-kafka-bootstrap.kafka.svc:9092',\n" +
" 'properties.group.id' = 'testGroup',\n" +
" 'scan.startup.mode' = 'earliest-offset',\n"+
" 'format' = 'debezium-json'\n" +
")");
I do the tumble window and count the ids in the same window by:
public static Table report(Table transactions) {
return transactions
.window(Tumble.over(lit(2).minutes()).on($("transaction_time")).as("w"))
.groupBy($("w"), $("transaction_status"))
.select(
$("w").start().as("window_start"),
$("w").end().as("window_end"),
$("transaction_status"),
$("id").count().as("id_count"));
}
The sink is:
tEnv.executeSql("CREATE TABLE my_report (\n" +
"window_start TIMESTAMP(3),\n"+
"window_end TIMESTAMP(3)\n,"+
"transaction_status STRING,\n" +
" id_count BIGINT,\n" +
" PRIMARY KEY (window_start) NOT ENFORCED\n"+
") WITH (\n" +
" 'connector' = 'upsert-kafka',\n" +
" 'topic' = 'dbserver1.inventory.my-window-sink',\n" +
" 'properties.bootstrap.servers' = 'my-cluster-kafka-bootstrap.kafka.svc:9092',\n" +
" 'properties.group.id' = 'testGroup',\n" +
" 'key.format' = 'json',\n"+
" 'value.format' = 'json'\n"+
")");
Table transactions = tEnv.from("transactions");
Table merchants = tEnv.from("merchants");
report(transactions).executeInsert("my_report");
The problem is when I consume dbserver1.inventory.my-window-sink kubectl -n kafka exec my-cluster-kafka-0 -c kafka -i -t -- bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic dbserver1.inventory.my-window-sink --from-beginning I don't get any results, I wait 2 minutes (the window size), insert into the transactions table and then wait again for 2 min and insert again also no results.
I don't know if I have a problem with my watermark
I am working with parallelism: 2
On the flink dashboard UI I can see that in the Details of GroupWindowAggregate task the Records Received is increased when I insert into the table but still, I can't see the results when I consume the topic!

With this line
transaction_time AS TO_TIMESTAMP_LTZ(4001, 3)
you have given every event the same transaction time (4001), and with
WATERMARK FOR transaction_time AS transaction_time - INTERVAL '5' SECOND
you have arranged for the watermarks to depend on the transaction_time. With this arrangement, time is standing still, and the windows can never close.
As for "I wait 2 minutes (the window size)," this isn't how event time processing works. Assuming the timestamps and watermarks were actually moving forward, you would need to wait however long it takes to process 2 minutes worth of data.

In addition to what David thankfully answered, I was missing table.exec.source.idle-timeout as a configuration of the streaming environment, a variable that checks if the source becomes idle.
The default value of the variable is 0 which means that it doesn't check if the source becomes idle. I made it 1000ms and that fixed it as it checks for that idle source condition and the watermarks are generated properly that way.
this won't probably affect regular streams that have consistent message ingestion into them but was the case for me as I was inserting records manually and hence the stream was idle at a lot of times

Nested match_recognize query not supported in flink SQL?

I am using flink 1.11 and trying nested query where match_recognize is inside, as shown below :
select * from events where id = (SELECT * FROM events MATCH_RECOGNIZE (PARTITION BY org_id ORDER BY proctime MEASURES A.id AS startId ONE ROW PER MATCH PATTERN (A C* B) DEFINE A AS A.tag = 'tag1', C AS C.tag <> 'tag2', B AS B.tag = 'tag2'));
And I am getting an error as : org.apache.calcite.sql.validate.SqlValidatorException: Table 'A' not found
Is this not supported ? If not what's the alternative ?

I was able to get something working by doing this:
Table events = tableEnv.fromDataStream(input,
$("sensorId"),
$("ts").rowtime(),
$("kwh"));
tableEnv.createTemporaryView("events", events);
Table matches = tableEnv.sqlQuery(
"SELECT id " +
"FROM events " +
"MATCH_RECOGNIZE ( " +
"PARTITION BY sensorId " +
"ORDER BY ts " +
"MEASURES " +
"this_step.sensorId AS id " +
"AFTER MATCH SKIP TO NEXT ROW " +
"PATTERN (this_step next_step) " +
"DEFINE " +
"this_step AS TRUE, " +
"next_step AS TRUE " +
")"
);
tableEnv.createTemporaryView("mmm", matches);
Table results = tableEnv.sqlQuery(
"SELECT * FROM events WHERE events.sensorId IN (select * from mmm)");
tableEnv
.toAppendStream(results, Row.class)
.print();
For some reason, I couldn't get it to work without defining a view. I kept getting Calcite errors.
I guess you are trying to avoid enumerating all of the columns from A in the MEASURES clause of the MATCH_RECOGNIZE. You may want to compare the resulting execution plans to see if there's any significant difference.

wrong result in Apache flink full outer join

I have 2 data streams which were created from 2 tables like:
Table orderRes1 = ste.sqlQuery(
"SELECT orderId, userId, SUM(bidPrice) as q FROM " + tble +
" Group by orderId, userId");
Table orderRes2 = ste.sqlQuery(
"SELECT orderId, userId, SUM(askPrice) as q FROM " + tble +
" Group by orderId, userId");
DataStream<Tuple2<Boolean, Row>> ds1 = ste.toRetractStream(orderRes1 , Row.class).
filter(order-> order.f0);
DataStream<Tuple2<Boolean, Row>> ds2 = ste.toRetractStream(orderRes2 , Row.class).
filter(order-> order.f0);
I wonder to perform a full outer join on these 2 streams, and I used both orderRes1.fullOuterJoin(orderRes2 ,$(exp))
and a sql query containing a full outer join, as below:
Table bidOrdr = ste.fromDataStream(bidTuple, $("orderId"),
$("userId"), $("price"));
Table askOrdr = ste.fromDataStream(askTuple, $("orderId"),
$("userId"), $("price"));
Table result = ste.sqlQuery(
"SELECT COALESCE(bidTbl.orderId,askTbl.orderId) , " +
" COALESCE(bidTbl.userId,askTbl.orderId)," +
" COALESCE(bidTbl.bidTotalPrice,0) as bidTotalPrice, " +
" COALESCE(askTbl.askTotalPrice,0) as askTotalPrice, " +
" FROM " +
" (SELECT orderId, userId," +
" SUM(price) AS bidTotalPrice " +
" FROM " + bidOrdr +
" Group by orderId, userId) bidTbl full outer JOIN " +
" (SELECT orderId, userId," +
" SUM(price) AS askTotalPrice" +
" FROM " + askOrdr +
" Group by orderId, userId) askTbl " +
" ON (bidTbl.orderId = askTbl.orderId" +
" AND bidTbl.userId= askTbl.userId) ") ;
DataStream<Tuple2<Boolean, Row>> = ste.toRetractStream(result, Row.class).filter(order -> order.f0);
However, the result in some cases in not correct: imagine user A sells with a price to B 3 times, after that user B sells to A 2 times, the second time the result is:
7> (true,123,a,300.0,0.0)
7> (true,123,a,300.0,200.0)
10> (true,123,b,0.0,300.0)
10> (true,123,b,200.0,300.0)
the second and forth lines are the expected result of stream, but it will generate the 1st and 3rd lines too.
worth mentioning that coGroup is the other solution, yet I do not want to use windowing in this scenario, and a non-windowing solution is just accessible in bounded streams (DataSet).
Hint: orderId and userId will repeat in both streams, and I want to produce 2 rows in each action, containing:
orderId, userId1, bidTotalPrice, askTotalPrice AND
orderId, userId2, bidTotalPrice, askTotalPrice

Something like this is to be expected with streaming queries (or in other words, with queries executed on dynamic tables). Unlike a traditional database, where the input relations to a query are kept static during query execution, the inputs to a streaming query are being continuously updated -- and so the result must also be continuously updated.
If I understand the setup here, the "incorrect" results on lines 1 and 3 are correct up until the relevant rows from orderRes2 are processed. If those rows never arrive, then lines 1 and 3 will remain correct.
What you should expect is an eventually correct result, including retractions as necessary. You can reduce the number of intermediate results by turning on mini-batch aggregation.
This mailing list thread gives more insight. If I've misunderstood your situation, please provide a reproducible example that illustrates the problem.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Flink 1.11 with debezium-json format - apache-flink

Asked too soon, it seems. In case it possibly helps someone else, I was able to get it to work with Table results = bsTableEnv.sqlQuery("SELECT id, name FROM products"); bsTableEnv.toRetractStream(results, Row.class).print();

Related

Flink multi source with kafka, kinesis and TableEnvironment

Flink Table print connector not being called

No results in kafka topic sink when applying tumble window aggregation in Flink Table API

Nested match_recognize query not supported in flink SQL?

wrong result in Apache flink full outer join

Categories

Resources