I am trying to sink a stream into filesystem in csv format using PyFlink, however it does not work.
# stream_to_csv.py
from pyflink.table import EnvironmentSettings, StreamTableEnvironment
env_settings = EnvironmentSettings.new_instance().in_streaming_mode().use_blink_planner().build()
table_env = StreamTableEnvironment.create(environment_settings=env_settings)
table_env.execute_sql("""
CREATE TABLE datagen (
id INT,
data STRING
) WITH (
'connector' = 'datagen',
'rows-per-second' = '1'
)
""")
table_env.execute_sql("""
CREATE TABLE print (
id INT,
data STRING
) WITH (
'connector' = 'filesystem',
'format' = 'csv',
'path' = '/tmp/output'
)
""")
table_env.execute_sql("""
INSERT INTO print
SELECT id, data
FROM datagen
""").wait()
To run the script:
$ python stream_to_csv.py
I expect records go to /tmp/output folder, however that doesn't happen.
$ ~ ls /tmp/output
(nothing shown here)
Anything I miss?
I shamelessly copy Dian Fu's reply in http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Flink-Not-able-to-sink-a-stream-into-csv-td43105.html.
You need to set the rolling policy for filesystem. You could refer to the Rolling Policy section [1] for more details.
Actually there are output and you could execute command ls -la /tmp/output/, then you will see several files named “.part-xxx”.
For your job, you need to set the execution.checkpointing.interval in the configuration and sink.rolling-policy.rollover-interval in the property of Filesystem connector.
The job will look like the following:
from pyflink.table import EnvironmentSettings, StreamTableEnvironment
env_settings = EnvironmentSettings.new_instance().in_streaming_mode().use_blink_planner().build()
table_env = StreamTableEnvironment.create(environment_settings=env_settings)
table_env.get_config().get_configuration().set_string("execution.checkpointing.interval", "10s")
table_env.execute_sql("""
CREATE TABLE datagen (
id INT,
data STRING
) WITH (
'connector' = 'datagen',
'rows-per-second' = '1'
)
""")
table_env.execute_sql("""
CREATE TABLE print (
id INT,
data STRING
) WITH (
'connector' = 'filesystem',
'format' = 'csv',
'path' = '/tmp/output',
'sink.rolling-policy.rollover-interval' = '10s'
)
""")
table_env.execute_sql("""
INSERT INTO print
SELECT id, data
FROM datagen
""").wait()
[1] https://ci.apache.org/projects/flink/flink-docs-release-1.12/dev/table/connectors/filesystem.html#rolling-policy
Related
I am trying to write a sql that use CTE in flink.
I have a table defined
CREATE TABLE test_cte
(
pod VARCHAR,
PRIMARY KEY (pod) NOT ENFORCED
) WITH (
'connector' = 'upsert-kafka',
'topic' = 'test_cte',
'properties.bootstrap.servers' = 'localhost:29092,localhost:39092',
'properties.group.id' = 'test_cte_group_id',
'value.format' = 'json',
'key.format' = 'json',
'properties.allow.auto.create.topics' = 'true',
'properties.replication.factor' = '3',
'value.json.timestamp-format.standard' = 'ISO-8601',
'sink.parallelism' = '3'
);
then I have insert as
WITH q1 AS ( SELECT pod FROM source )
FROM q1
INSERT OVERWRITE TABLE test_cte
SELECT pod;
I get an error saying org.apache.flink.sql.parser.impl.ParseException: Incorrect syntax near the keyword 'FROM' at line 2, column 2.
source tables has the column pod.
When I run just the select like here
WITH q1 AS ( SELECT pod FROM roles_deleted_raw_v1)
select * from q1;
its can see the see the result
CTE is only available in Flink when using Hive dialect:
SET table.sql-dialect = hive;
However, this feature is only supported by the HiveCatalog catalog, so it is not possible to use with upsert-kafka.
For more info on this, you can check out the Flink docs.
CREATE TABLE roles_created_raw_v1
(
id VARCHAR,
created VARCHAR
PRIMARY KEY (id) NOT ENFORCED
) WITH (
'connector' = 'upsert-kafka',
'topic' = 'sink_topic',
'properties.bootstrap.servers' = 'localhost:29092,localhost:39092',
'properties.group.id' = 'sink_topic_id',
'value.format' = 'json',
'key.format' = 'json',
'properties.allow.auto.create.topics' = 'true',
'value.json.timestamp-format.standard' = 'ISO-8601',
'sink.parallelism' = '3'
);
i am trying to insert into this table using
insert into roles_created_raw_v1
select
JSON_VALUE(contentJson, '$.id') as id,
to_timestamp(JSON_VALUE(contentJson, '$.created'), 'yyyy-MM-ddTHH:mm:ss.SSSZ') as created
from some_raw_table;
My contentJson field has
"contentJson": "{\"created\":\"2023-02-04T04:12:07.925Z\"}".
created field in the sink_topic and table roles_created_raw_v1 is null. How to i get this converted to timestamp_ltz field ?
Instead of to_timestamp(JSON_VALUE(contentJson, '$.created'), 'yyyy-MM-ddTHH:mm:ss.SSSZ') if used JSON_VALUE(contentJson, '$.created' RETURNING STRING) i get the string value back.
to_timestamp(replace(replace(JSON_VALUE(contentJson, '$. created'), 'T', ' '), 'Z', ' '),'yyyy-MM-dd HH:mm:ss.SSS')
This seems to work.
I have a table created
CREATE OR REPLACE TABLE EMPLOYEE_SKILL
(
FILE_NAME_FULL_S3_PATH VARCHAR(),
LINE_NUMBER VARCHAR(),
SKILL_ID NUMBER,
EMP_ID NUMBER,
SKILL_NAME VARCHAR(50),
SKILL_LEVEL VARCHAR(50)
);
Note that I have added two columns FILE_NAME_FULL_S3_PATH and LINE_NUMBER. I want these two details of the data ingestion also.
I am trying to ingest into the above table from a 1000 files inside an s3 bucket
I am using this command
copy into EMPLOYEE_SKILL
from s3://test_bucket/emp/ credentials=(aws_key_id='XXXXXXXXXXXXXX' aws_secret_key='YYYYYYYYYY')
file_format = (type = csv field_delimiter = '|'skip_header = 1)
on_error = 'continue';
How to make sure the first two columns are added automatically? The s3 full path of the file and also the line number
You would need to do something like this
copy into EMPLOYEE_SKILL(FILE_NAME_FULL_S3_PATH,
LINE_NUMBER,
SKILL_ID,
EMP_ID,
SKILL_NAME,
SKILL_LEVEL )
from (select metadata$filename, metadata$file_row_number, t.$1, t.$2,t.$3,t.$4 from s3://test_bucket/emp/ )t
credentials=(aws_key_id='XXXXXXXXXXXXXX' aws_secret_key='YYYYYYYYYY')
file_format = (type = csv field_delimiter = '|'skip_header = 1)
on_error = 'continue';
I think I am doing exactly what was mentioned in this stack overflow question: Flink temporal join not showing data and what has been mentioned in the official doc to join two data streams with a temporal join but I keep getting the following error:
[ERROR] Could not execute SQL statement. Reason:
java.lang.ClassCastException: org.apache.flink.table.planner.plan.nodes.calcite.LogicalWatermarkAssigner cannot be cast to org.apache.calcite.rel.core.TableScan
while executing the following SQL:
SELECT v.rid, u.name, v.parsed_timestamp FROM VEHICLES v JOIN USERS FOR SYSTEM_TIME AS OF v.parsed_timestamp AS u ON v.userID = u.userID;
I have two dynamic tables that must be joined (regular join is not enough since I need to maintain the rowtime column from VEHICLES for further processing and lookup join is not okay since both sources come from Kafka).
The two tables were created with:
CREATE TABLE USERS (
userID BIGINT,
name STRING,
ts STRING,
parsed_timestamp AS TO_TIMESTAMP(ts),
WATERMARK FOR parsed_timestamp AS parsed_timestamp - INTERVAL '5' SECONDS,
PRIMARY KEY(userID) NOT ENFORCED
) WITH (
'connector' = 'kafka',
'topic' = 'USERS',
'properties.bootstrap.servers' = 'kafka:9092',
'properties.group.id' = 'testGroup4',
'scan.startup.mode' = 'earliest-offset',
'format' = 'json'
);
CREATE TABLE VEHICLES (
userID BIGINT,
rid BIGINT,
type STRING,
manufacturer STRING,
model STRING,
plate STRING,
status STRING,
ts STRING,
parsed_timestamp AS TO_TIMESTAMP(ts),
WATERMARK FOR parsed_timestamp AS parsed_timestamp - INTERVAL '5' SECONDS
) WITH (
'connector' = 'kafka',
'topic' = 'VEHICLES',
'properties.group.id' = 'mytestgroup4',
'scan.startup.mode' = 'earliest-offset',
'properties.bootstrap.servers' = 'kafka:9092',
'format' ='json'
);
Any suggestion on what I have done wrong? I couldn't find that much information about the error. If possible I would prefer to do not create the views and work directly with the two tables (and I have tried with the views but I still get the same message).
Thank you
We are trying to join from a DB-cdc connector (upsert behave) table.
With a 'kafka' source of events to enrich this events by key with the existing cdc data.
kafka-source (id, B, C) + cdc (id, D, E, F) = result(id, B, C, D, E, F) into a kafka sink (append)
INSERT INTO sink (zapatos, naranjas, device_id, account_id, user_id)
SELECT zapatos, naranjas, source.device_id, account_id, user_id FROM source
JOIN mongodb_source ON source.device_id = mongodb_source._id
The problem, this only works if our kafka sink is 'upsert-kafka'.
But this created tombstones on deletion in DB.
We need to just behave as plain events, not a changelog.
but we cannot use just 'kafka' sink because db connector is upsert so is not compatible...
What would be the way to do this? Transform the upsert into just append events?
s_env = StreamExecutionEnvironment.get_execution_environment()
s_env.set_stream_time_characteristic(TimeCharacteristic.EventTime)
s_env.set_parallelism(1)
# use blink table planner
st_env = StreamTableEnvironment \
.create(s_env, environment_settings=EnvironmentSettings
.new_instance()
.in_streaming_mode()
.use_blink_planner().build())
ddl = """CREATE TABLE sink (
`zapatos` INT,
`naranjas` STRING,
`account_id` STRING,
`user_id` STRING,
`device_id` STRING,
`time28` INT,
PRIMARY KEY (device_id) NOT ENFORCED
) WITH (
'connector' = 'upsert-kafka',
'topic' = 'as-test-output-flink-topic',
'properties.bootstrap.servers' = 'kafka:9092',
'properties.group.id' = 'testGroup',
'key.format' = 'raw',
'value.format' = 'json',
'value.fields-include' = 'EXCEPT_KEY'
)
"""
st_env.sql_update(ddl)
ddl = """CREATE TABLE source (
`device_id` STRING,
`timestamp` TIMESTAMP_LTZ(3) METADATA FROM 'timestamp',
`event_type` STRING,
`payload` ROW<`zapatos` INT, `naranjas` STRING, `time28` INT, `device_id` STRING>,
`trace_id` STRING
) WITH (
'connector' = 'kafka',
'topic' = 'as-test-input-flink-topic',
'properties.bootstrap.servers' = 'kafka:9092',
'properties.group.id' = 'testGroup',
'key.format' = 'raw',
'key.fields' = 'device_id',
'value.format' = 'json',
'value.fields-include' = 'EXCEPT_KEY'
)
"""
st_env.sql_update(ddl)
ddl = """
CREATE TABLE mongodb_source (
`_id` STRING PRIMARY KEY,
`account_id` STRING,
`user_id` STRING,
`device_id` STRING
) WITH (
'connector' = 'mongodb-cdc',
'uri' = '******',
'database' = '****',
'collection' = 'testflink'
)
"""
st_env.sql_update(ddl)
st_env.sql_update("""
INSERT INTO sink (zapatos, naranjas, device_id, account_id, user_id)
SELECT zapatos, naranjas, source.device_id, account_id, user_id FROM source
JOIN mongodb_source ON source.device_id = mongodb_source._id
""")
# execute
st_env.execute("kafka_to_kafka")
Dont mind the Mongo-cdc connector, is new but works as the mysql-cdc or postgre-cdc.
Thanks for your help!
Have you tried to use LEFT JOIN instead of JOIN?
It shouldn’t create tombstones then if your purpose is just enrichment of kafka events if there is any from mongo…