Flink Source kafka Join with CDC source to kafka sink - apache-flink

We are trying to join from a DB-cdc connector (upsert behave) table.
With a 'kafka' source of events to enrich this events by key with the existing cdc data.
kafka-source (id, B, C) + cdc (id, D, E, F) = result(id, B, C, D, E, F) into a kafka sink (append)
INSERT INTO sink (zapatos, naranjas, device_id, account_id, user_id)
SELECT zapatos, naranjas, source.device_id, account_id, user_id FROM source
JOIN mongodb_source ON source.device_id = mongodb_source._id
The problem, this only works if our kafka sink is 'upsert-kafka'.
But this created tombstones on deletion in DB.
We need to just behave as plain events, not a changelog.
but we cannot use just 'kafka' sink because db connector is upsert so is not compatible...
What would be the way to do this? Transform the upsert into just append events?
s_env = StreamExecutionEnvironment.get_execution_environment()
s_env.set_stream_time_characteristic(TimeCharacteristic.EventTime)
s_env.set_parallelism(1)
# use blink table planner
st_env = StreamTableEnvironment \
.create(s_env, environment_settings=EnvironmentSettings
.new_instance()
.in_streaming_mode()
.use_blink_planner().build())
ddl = """CREATE TABLE sink (
`zapatos` INT,
`naranjas` STRING,
`account_id` STRING,
`user_id` STRING,
`device_id` STRING,
`time28` INT,
PRIMARY KEY (device_id) NOT ENFORCED
) WITH (
'connector' = 'upsert-kafka',
'topic' = 'as-test-output-flink-topic',
'properties.bootstrap.servers' = 'kafka:9092',
'properties.group.id' = 'testGroup',
'key.format' = 'raw',
'value.format' = 'json',
'value.fields-include' = 'EXCEPT_KEY'
)
"""
st_env.sql_update(ddl)
ddl = """CREATE TABLE source (
`device_id` STRING,
`timestamp` TIMESTAMP_LTZ(3) METADATA FROM 'timestamp',
`event_type` STRING,
`payload` ROW<`zapatos` INT, `naranjas` STRING, `time28` INT, `device_id` STRING>,
`trace_id` STRING
) WITH (
'connector' = 'kafka',
'topic' = 'as-test-input-flink-topic',
'properties.bootstrap.servers' = 'kafka:9092',
'properties.group.id' = 'testGroup',
'key.format' = 'raw',
'key.fields' = 'device_id',
'value.format' = 'json',
'value.fields-include' = 'EXCEPT_KEY'
)
"""
st_env.sql_update(ddl)
ddl = """
CREATE TABLE mongodb_source (
`_id` STRING PRIMARY KEY,
`account_id` STRING,
`user_id` STRING,
`device_id` STRING
) WITH (
'connector' = 'mongodb-cdc',
'uri' = '******',
'database' = '****',
'collection' = 'testflink'
)
"""
st_env.sql_update(ddl)
st_env.sql_update("""
INSERT INTO sink (zapatos, naranjas, device_id, account_id, user_id)
SELECT zapatos, naranjas, source.device_id, account_id, user_id FROM source
JOIN mongodb_source ON source.device_id = mongodb_source._id
""")
# execute
st_env.execute("kafka_to_kafka")
Dont mind the Mongo-cdc connector, is new but works as the mysql-cdc or postgre-cdc.
Thanks for your help!

Have you tried to use LEFT JOIN instead of JOIN?
It shouldn’t create tombstones then if your purpose is just enrichment of kafka events if there is any from mongo…

Related

Using CTE in apache flink sql

I am trying to write a sql that use CTE in flink.
I have a table defined
CREATE TABLE test_cte
(
pod VARCHAR,
PRIMARY KEY (pod) NOT ENFORCED
) WITH (
'connector' = 'upsert-kafka',
'topic' = 'test_cte',
'properties.bootstrap.servers' = 'localhost:29092,localhost:39092',
'properties.group.id' = 'test_cte_group_id',
'value.format' = 'json',
'key.format' = 'json',
'properties.allow.auto.create.topics' = 'true',
'properties.replication.factor' = '3',
'value.json.timestamp-format.standard' = 'ISO-8601',
'sink.parallelism' = '3'
);
then I have insert as
WITH q1 AS ( SELECT pod FROM source )
FROM q1
INSERT OVERWRITE TABLE test_cte
SELECT pod;
I get an error saying org.apache.flink.sql.parser.impl.ParseException: Incorrect syntax near the keyword 'FROM' at line 2, column 2.
source tables has the column pod.
When I run just the select like here
WITH q1 AS ( SELECT pod FROM roles_deleted_raw_v1)
select * from q1;
its can see the see the result
CTE is only available in Flink when using Hive dialect:
SET table.sql-dialect = hive;
However, this feature is only supported by the HiveCatalog catalog, so it is not possible to use with upsert-kafka.
For more info on this, you can check out the Flink docs.

Using flink sql how to convert a string read using json_value to timestamp_ltz

CREATE TABLE roles_created_raw_v1
(
id VARCHAR,
created VARCHAR
PRIMARY KEY (id) NOT ENFORCED
) WITH (
'connector' = 'upsert-kafka',
'topic' = 'sink_topic',
'properties.bootstrap.servers' = 'localhost:29092,localhost:39092',
'properties.group.id' = 'sink_topic_id',
'value.format' = 'json',
'key.format' = 'json',
'properties.allow.auto.create.topics' = 'true',
'value.json.timestamp-format.standard' = 'ISO-8601',
'sink.parallelism' = '3'
);
i am trying to insert into this table using
insert into roles_created_raw_v1
select
JSON_VALUE(contentJson, '$.id') as id,
to_timestamp(JSON_VALUE(contentJson, '$.created'), 'yyyy-MM-ddTHH:mm:ss.SSSZ') as created
from some_raw_table;
My contentJson field has
"contentJson": "{\"created\":\"2023-02-04T04:12:07.925Z\"}".
created field in the sink_topic and table roles_created_raw_v1 is null. How to i get this converted to timestamp_ltz field ?
Instead of to_timestamp(JSON_VALUE(contentJson, '$.created'), 'yyyy-MM-ddTHH:mm:ss.SSSZ') if used JSON_VALUE(contentJson, '$.created' RETURNING STRING) i get the string value back.
to_timestamp(replace(replace(JSON_VALUE(contentJson, '$. created'), 'T', ' '), 'Z', ' '),'yyyy-MM-dd HH:mm:ss.SSS')
This seems to work.

Why flink 1.15.2 showing No Watermark (Watermarks are only available if EventTime is used)

In my create table ddl, i have set watermark on column and doing simple count(distinct userId) on a tumble window of 1 min, but stil not getting any data, same simple job is working fine in 1.13
CREATE TABLE test (
eventName String,
ingestion_time BIGINT,
time_ltz AS TO_TIMESTAMP_LTZ(ingestion_time, 3),
props ROW(userId VARCHAR, id VARCHAR, tourName VARCHAR, advertiserId VARCHAR, deviceId VARCHAR, tourId VARCHAR),
WATERMARK FOR time_ltz AS time_ltz - INTERVAL '5' SECOND
) WITH (
'connector' = 'kafka',
'topic' = 'test',
'scan.startup.mode' = 'latest-offset',
'properties.bootstrap.servers' = 'localhost:9092',
'properties.group.id' = 'local_test_flink_115',
'format' = 'json',
'json.ignore-parse-errors' = 'true',
'scan.topic-partition-discovery.interval' = '60000'
);
Also we have other jobs migrated but no data is matching with output. Is there any watermark default setting we need to set.

Temporal Table function: Cannot add expression of different type to set

I am useing the temporal table funciton to join two stream like this, but got this error.
The diff between set type and expression type is the type of proctime0, one with NOT NULL
How will different appears, and any ways to solve this?
Exception in thread "main" java.lang.AssertionError: Cannot add expression of different type to set:
set type is RecordType(VARCHAR(2147483647) CHARACTER SET "UTF-16LE" order_id, DECIMAL(32, 2) price, VARCHAR(2147483647) CHARACTER SET "UTF-16LE" currency, TIMESTAMP(3) order_time, TIMESTAMP_LTZ(3) *PROCTIME* NOT NULL proctime, VARCHAR(2147483647) CHARACTER SET "UTF-16LE" currency0, BIGINT conversion_rate, TIMESTAMP(3) update_time, TIMESTAMP_LTZ(3) *PROCTIME* proctime0) NOT NULL
expression type is RecordType(VARCHAR(2147483647) CHARACTER SET "UTF-16LE" order_id, DECIMAL(32, 2) price, VARCHAR(2147483647) CHARACTER SET "UTF-16LE" currency, TIMESTAMP(3) order_time, TIMESTAMP_LTZ(3) *PROCTIME* NOT NULL proctime, VARCHAR(2147483647) CHARACTER SET "UTF-16LE" currency0, BIGINT conversion_rate, TIMESTAMP(3) update_time, TIMESTAMP_LTZ(3) *PROCTIME* NOT NULL proctime0) NOT NULL
set is rel#61:LogicalCorrelate.NONE.any.None: 0.[NONE].[NONE](left=HepRelVertex#59,right=HepRelVertex#60,correlation=$cor0,joinType=inner,requiredColumns={4})
expression is LogicalJoin(condition=[__TEMPORAL_JOIN_CONDITION($4, $7, __TEMPORAL_JOIN_CONDITION_PRIMARY_KEY($5))], joinType=[inner])
LogicalProject(order_id=[$0], price=[$1], currency=[$2], order_time=[$3], proctime=[PROCTIME()])
LogicalTableScan(table=[[default_catalog, default_database, orders]])
LogicalProject(currency=[$0], conversion_rate=[$1], update_time=[$2], proctime=[PROCTIME()])
LogicalTableScan(table=[[default_catalog, default_database, currency_rates]])
Fact Table:
CREATE TABLE `orders` (
order_id STRING,
price DECIMAL(32,2),
currency STRING,
order_time TIMESTAMP(3),
proctime as PROCTIME()
) WITH (
'properties.bootstrap.servers' = '127.0.0.1:9092',
'properties.group.id' = 'test',
'scan.topic-partition-discovery.interval' = '10000',
'connector' = 'kafka',
'format' = 'json',
'scan.startup.mode' = 'latest-offset',
'topic' = 'test1'
)
Build Table:
CREATE TABLE `currency_rates` (
currency STRING,
conversion_rate BIGINT,
update_time TIMESTAMP(3),
proctime as PROCTIME()
) WITH (
'properties.bootstrap.servers' = '127.0.0.1:9092',
'properties.group.id' = 'test',
'scan.topic-partition-discovery.interval' = '10000',
'connector' = 'kafka',
'format' = 'json',
'scan.startup.mode' = 'latest-offset',
'topic' = 'test3'
)
The way to generate table function:
TemporalTableFunction table_rate = tEnv.from("currency_rates")
.createTemporalTableFunction("update_time", "currency");
tEnv.registerFunction("rates", table_rate);
Join logic:
SELECT
order_id,
price,
s.currency,
conversion_rate,
order_time
FROM orders AS o,
LATERAL TABLE (rates(o.proctime)) AS s
WHERE o.currency = s.currency
try using tEnv.createTemporarySystemFunction("rates", table_rate) instead of using .registerFunction()

Flink Dynamic Tables Temporal Join - Calcite error

I think I am doing exactly what was mentioned in this stack overflow question: Flink temporal join not showing data and what has been mentioned in the official doc to join two data streams with a temporal join but I keep getting the following error:
[ERROR] Could not execute SQL statement. Reason:
java.lang.ClassCastException: org.apache.flink.table.planner.plan.nodes.calcite.LogicalWatermarkAssigner cannot be cast to org.apache.calcite.rel.core.TableScan
while executing the following SQL:
SELECT v.rid, u.name, v.parsed_timestamp FROM VEHICLES v JOIN USERS FOR SYSTEM_TIME AS OF v.parsed_timestamp AS u ON v.userID = u.userID;
I have two dynamic tables that must be joined (regular join is not enough since I need to maintain the rowtime column from VEHICLES for further processing and lookup join is not okay since both sources come from Kafka).
The two tables were created with:
CREATE TABLE USERS (
userID BIGINT,
name STRING,
ts STRING,
parsed_timestamp AS TO_TIMESTAMP(ts),
WATERMARK FOR parsed_timestamp AS parsed_timestamp - INTERVAL '5' SECONDS,
PRIMARY KEY(userID) NOT ENFORCED
) WITH (
'connector' = 'kafka',
'topic' = 'USERS',
'properties.bootstrap.servers' = 'kafka:9092',
'properties.group.id' = 'testGroup4',
'scan.startup.mode' = 'earliest-offset',
'format' = 'json'
);
CREATE TABLE VEHICLES (
userID BIGINT,
rid BIGINT,
type STRING,
manufacturer STRING,
model STRING,
plate STRING,
status STRING,
ts STRING,
parsed_timestamp AS TO_TIMESTAMP(ts),
WATERMARK FOR parsed_timestamp AS parsed_timestamp - INTERVAL '5' SECONDS
) WITH (
'connector' = 'kafka',
'topic' = 'VEHICLES',
'properties.group.id' = 'mytestgroup4',
'scan.startup.mode' = 'earliest-offset',
'properties.bootstrap.servers' = 'kafka:9092',
'format' ='json'
);
Any suggestion on what I have done wrong? I couldn't find that much information about the error. If possible I would prefer to do not create the views and work directly with the two tables (and I have tried with the views but I still get the same message).
Thank you

Resources