Using CTE in apache flink sql - apache-flink

I am trying to write a sql that use CTE in flink.
I have a table defined
CREATE TABLE test_cte
(
pod VARCHAR,
PRIMARY KEY (pod) NOT ENFORCED
) WITH (
'connector' = 'upsert-kafka',
'topic' = 'test_cte',
'properties.bootstrap.servers' = 'localhost:29092,localhost:39092',
'properties.group.id' = 'test_cte_group_id',
'value.format' = 'json',
'key.format' = 'json',
'properties.allow.auto.create.topics' = 'true',
'properties.replication.factor' = '3',
'value.json.timestamp-format.standard' = 'ISO-8601',
'sink.parallelism' = '3'
);
then I have insert as
WITH q1 AS ( SELECT pod FROM source )
FROM q1
INSERT OVERWRITE TABLE test_cte
SELECT pod;
I get an error saying org.apache.flink.sql.parser.impl.ParseException: Incorrect syntax near the keyword 'FROM' at line 2, column 2.
source tables has the column pod.
When I run just the select like here
WITH q1 AS ( SELECT pod FROM roles_deleted_raw_v1)
select * from q1;
its can see the see the result

CTE is only available in Flink when using Hive dialect:
SET table.sql-dialect = hive;
However, this feature is only supported by the HiveCatalog catalog, so it is not possible to use with upsert-kafka.
For more info on this, you can check out the Flink docs.

Related

Using flink sql how to convert a string read using json_value to timestamp_ltz

CREATE TABLE roles_created_raw_v1
(
id VARCHAR,
created VARCHAR
PRIMARY KEY (id) NOT ENFORCED
) WITH (
'connector' = 'upsert-kafka',
'topic' = 'sink_topic',
'properties.bootstrap.servers' = 'localhost:29092,localhost:39092',
'properties.group.id' = 'sink_topic_id',
'value.format' = 'json',
'key.format' = 'json',
'properties.allow.auto.create.topics' = 'true',
'value.json.timestamp-format.standard' = 'ISO-8601',
'sink.parallelism' = '3'
);
i am trying to insert into this table using
insert into roles_created_raw_v1
select
JSON_VALUE(contentJson, '$.id') as id,
to_timestamp(JSON_VALUE(contentJson, '$.created'), 'yyyy-MM-ddTHH:mm:ss.SSSZ') as created
from some_raw_table;
My contentJson field has
"contentJson": "{\"created\":\"2023-02-04T04:12:07.925Z\"}".
created field in the sink_topic and table roles_created_raw_v1 is null. How to i get this converted to timestamp_ltz field ?
Instead of to_timestamp(JSON_VALUE(contentJson, '$.created'), 'yyyy-MM-ddTHH:mm:ss.SSSZ') if used JSON_VALUE(contentJson, '$.created' RETURNING STRING) i get the string value back.
to_timestamp(replace(replace(JSON_VALUE(contentJson, '$. created'), 'T', ' '), 'Z', ' '),'yyyy-MM-dd HH:mm:ss.SSS')
This seems to work.

Why flink 1.15.2 showing No Watermark (Watermarks are only available if EventTime is used)

In my create table ddl, i have set watermark on column and doing simple count(distinct userId) on a tumble window of 1 min, but stil not getting any data, same simple job is working fine in 1.13
CREATE TABLE test (
eventName String,
ingestion_time BIGINT,
time_ltz AS TO_TIMESTAMP_LTZ(ingestion_time, 3),
props ROW(userId VARCHAR, id VARCHAR, tourName VARCHAR, advertiserId VARCHAR, deviceId VARCHAR, tourId VARCHAR),
WATERMARK FOR time_ltz AS time_ltz - INTERVAL '5' SECOND
) WITH (
'connector' = 'kafka',
'topic' = 'test',
'scan.startup.mode' = 'latest-offset',
'properties.bootstrap.servers' = 'localhost:9092',
'properties.group.id' = 'local_test_flink_115',
'format' = 'json',
'json.ignore-parse-errors' = 'true',
'scan.topic-partition-discovery.interval' = '60000'
);
Also we have other jobs migrated but no data is matching with output. Is there any watermark default setting we need to set.

Flink Dynamic Tables Temporal Join - Calcite error

I think I am doing exactly what was mentioned in this stack overflow question: Flink temporal join not showing data and what has been mentioned in the official doc to join two data streams with a temporal join but I keep getting the following error:
[ERROR] Could not execute SQL statement. Reason:
java.lang.ClassCastException: org.apache.flink.table.planner.plan.nodes.calcite.LogicalWatermarkAssigner cannot be cast to org.apache.calcite.rel.core.TableScan
while executing the following SQL:
SELECT v.rid, u.name, v.parsed_timestamp FROM VEHICLES v JOIN USERS FOR SYSTEM_TIME AS OF v.parsed_timestamp AS u ON v.userID = u.userID;
I have two dynamic tables that must be joined (regular join is not enough since I need to maintain the rowtime column from VEHICLES for further processing and lookup join is not okay since both sources come from Kafka).
The two tables were created with:
CREATE TABLE USERS (
userID BIGINT,
name STRING,
ts STRING,
parsed_timestamp AS TO_TIMESTAMP(ts),
WATERMARK FOR parsed_timestamp AS parsed_timestamp - INTERVAL '5' SECONDS,
PRIMARY KEY(userID) NOT ENFORCED
) WITH (
'connector' = 'kafka',
'topic' = 'USERS',
'properties.bootstrap.servers' = 'kafka:9092',
'properties.group.id' = 'testGroup4',
'scan.startup.mode' = 'earliest-offset',
'format' = 'json'
);
CREATE TABLE VEHICLES (
userID BIGINT,
rid BIGINT,
type STRING,
manufacturer STRING,
model STRING,
plate STRING,
status STRING,
ts STRING,
parsed_timestamp AS TO_TIMESTAMP(ts),
WATERMARK FOR parsed_timestamp AS parsed_timestamp - INTERVAL '5' SECONDS
) WITH (
'connector' = 'kafka',
'topic' = 'VEHICLES',
'properties.group.id' = 'mytestgroup4',
'scan.startup.mode' = 'earliest-offset',
'properties.bootstrap.servers' = 'kafka:9092',
'format' ='json'
);
Any suggestion on what I have done wrong? I couldn't find that much information about the error. If possible I would prefer to do not create the views and work directly with the two tables (and I have tried with the views but I still get the same message).
Thank you

Flink Source kafka Join with CDC source to kafka sink

We are trying to join from a DB-cdc connector (upsert behave) table.
With a 'kafka' source of events to enrich this events by key with the existing cdc data.
kafka-source (id, B, C) + cdc (id, D, E, F) = result(id, B, C, D, E, F) into a kafka sink (append)
INSERT INTO sink (zapatos, naranjas, device_id, account_id, user_id)
SELECT zapatos, naranjas, source.device_id, account_id, user_id FROM source
JOIN mongodb_source ON source.device_id = mongodb_source._id
The problem, this only works if our kafka sink is 'upsert-kafka'.
But this created tombstones on deletion in DB.
We need to just behave as plain events, not a changelog.
but we cannot use just 'kafka' sink because db connector is upsert so is not compatible...
What would be the way to do this? Transform the upsert into just append events?
s_env = StreamExecutionEnvironment.get_execution_environment()
s_env.set_stream_time_characteristic(TimeCharacteristic.EventTime)
s_env.set_parallelism(1)
# use blink table planner
st_env = StreamTableEnvironment \
.create(s_env, environment_settings=EnvironmentSettings
.new_instance()
.in_streaming_mode()
.use_blink_planner().build())
ddl = """CREATE TABLE sink (
`zapatos` INT,
`naranjas` STRING,
`account_id` STRING,
`user_id` STRING,
`device_id` STRING,
`time28` INT,
PRIMARY KEY (device_id) NOT ENFORCED
) WITH (
'connector' = 'upsert-kafka',
'topic' = 'as-test-output-flink-topic',
'properties.bootstrap.servers' = 'kafka:9092',
'properties.group.id' = 'testGroup',
'key.format' = 'raw',
'value.format' = 'json',
'value.fields-include' = 'EXCEPT_KEY'
)
"""
st_env.sql_update(ddl)
ddl = """CREATE TABLE source (
`device_id` STRING,
`timestamp` TIMESTAMP_LTZ(3) METADATA FROM 'timestamp',
`event_type` STRING,
`payload` ROW<`zapatos` INT, `naranjas` STRING, `time28` INT, `device_id` STRING>,
`trace_id` STRING
) WITH (
'connector' = 'kafka',
'topic' = 'as-test-input-flink-topic',
'properties.bootstrap.servers' = 'kafka:9092',
'properties.group.id' = 'testGroup',
'key.format' = 'raw',
'key.fields' = 'device_id',
'value.format' = 'json',
'value.fields-include' = 'EXCEPT_KEY'
)
"""
st_env.sql_update(ddl)
ddl = """
CREATE TABLE mongodb_source (
`_id` STRING PRIMARY KEY,
`account_id` STRING,
`user_id` STRING,
`device_id` STRING
) WITH (
'connector' = 'mongodb-cdc',
'uri' = '******',
'database' = '****',
'collection' = 'testflink'
)
"""
st_env.sql_update(ddl)
st_env.sql_update("""
INSERT INTO sink (zapatos, naranjas, device_id, account_id, user_id)
SELECT zapatos, naranjas, source.device_id, account_id, user_id FROM source
JOIN mongodb_source ON source.device_id = mongodb_source._id
""")
# execute
st_env.execute("kafka_to_kafka")
Dont mind the Mongo-cdc connector, is new but works as the mysql-cdc or postgre-cdc.
Thanks for your help!
Have you tried to use LEFT JOIN instead of JOIN?
It shouldn’t create tombstones then if your purpose is just enrichment of kafka events if there is any from mongo…

Flink: Not able to sink a stream into csv

I am trying to sink a stream into filesystem in csv format using PyFlink, however it does not work.
# stream_to_csv.py
from pyflink.table import EnvironmentSettings, StreamTableEnvironment
env_settings = EnvironmentSettings.new_instance().in_streaming_mode().use_blink_planner().build()
table_env = StreamTableEnvironment.create(environment_settings=env_settings)
table_env.execute_sql("""
CREATE TABLE datagen (
id INT,
data STRING
) WITH (
'connector' = 'datagen',
'rows-per-second' = '1'
)
""")
table_env.execute_sql("""
CREATE TABLE print (
id INT,
data STRING
) WITH (
'connector' = 'filesystem',
'format' = 'csv',
'path' = '/tmp/output'
)
""")
table_env.execute_sql("""
INSERT INTO print
SELECT id, data
FROM datagen
""").wait()
To run the script:
$ python stream_to_csv.py
I expect records go to /tmp/output folder, however that doesn't happen.
$ ~ ls /tmp/output
(nothing shown here)
Anything I miss?
I shamelessly copy Dian Fu's reply in http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Flink-Not-able-to-sink-a-stream-into-csv-td43105.html.
You need to set the rolling policy for filesystem. You could refer to the Rolling Policy section [1] for more details.
Actually there are output and you could execute command ls -la /tmp/output/, then you will see several files named “.part-xxx”.
For your job, you need to set the execution.checkpointing.interval in the configuration and sink.rolling-policy.rollover-interval in the property of Filesystem connector.
The job will look like the following:
from pyflink.table import EnvironmentSettings, StreamTableEnvironment
env_settings = EnvironmentSettings.new_instance().in_streaming_mode().use_blink_planner().build()
table_env = StreamTableEnvironment.create(environment_settings=env_settings)
table_env.get_config().get_configuration().set_string("execution.checkpointing.interval", "10s")
table_env.execute_sql("""
CREATE TABLE datagen (
id INT,
data STRING
) WITH (
'connector' = 'datagen',
'rows-per-second' = '1'
)
""")
table_env.execute_sql("""
CREATE TABLE print (
id INT,
data STRING
) WITH (
'connector' = 'filesystem',
'format' = 'csv',
'path' = '/tmp/output',
'sink.rolling-policy.rollover-interval' = '10s'
)
""")
table_env.execute_sql("""
INSERT INTO print
SELECT id, data
FROM datagen
""").wait()
[1] https://ci.apache.org/projects/flink/flink-docs-release-1.12/dev/table/connectors/filesystem.html#rolling-policy

Resources