Context
I have a Flink job coded by python SQL api. it is consuming source data from Kinesis and producing results to Kinesis. I want to make a local test to ensure the Flink application code is correct. So I mocked out both the source Kinesis and sink Kinesis with filesystem connector. And then run the test pipeline locally. Although the local flink job always run successfully. But when I look into the sink file. The sink file is alway empty. This has also been the case when I run the code in 'Flink SQL Client'.
Here is my code:
CREATE TABLE incoming_data (
requestId VARCHAR(4),
groupId VARCHAR(32),
userId VARCHAR(32),
requestStartTime VARCHAR(32),
processTime AS PROCTIME(),
requestTime AS TO_TIMESTAMP(SUBSTR(REPLACE(requestStartTime, 'T', ' '), 0, 23), 'yyyy-MM-dd HH:mm:ss.SSS'),
WATERMARK FOR requestTime AS requestTime - INTERVAL '5' SECOND
) WITH (
'connector' = 'filesystem',
'path' = '/path/to/test/json/file.json',
'format' = 'json',
'json.timestamp-format.standard' = 'ISO-8601'
)
CREATE TABLE user_latest_request (
groupId VARCHAR(32),
userId VARCHAR(32),
latestRequestTime TIMESTAMP
) WITH (
'connector' = 'filesystem',
'path' = '/path/to/sink',
'format' = 'csv'
)
INSERT INTO user_latest_request
SELECT groupId,
userId,
MAX(requestTime) as latestRequestTime
FROM incoming_data
GROUP BY TUMBLE(processTime, INTERVAL '1' SECOND), groupId, userId;
Curious what I am doing wrong here.
Note:
I am using Flink 1.11.0
if I directly dump data from source to sink without windowing and grouping, it works fine. That means the source and sink table is set up correctly. So it seems the problem is around the Tumbling and grouping for local filesystem.
This code works fine with Kinesis source and sink.
Have you enabled checkpointing? This is required if you are in `STREAMING mode which appears to be the case. See https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/connectors/datastream/file_sink/
The most likely cause is that there isn't enough data in the file being read to keep the job running long enough for the window to close. You have a processing-time-based window that is 1 second long, which means that the job will have to run for at least one second to guarantee that the first window will produce results.
Otherwise, once the source runs out of data the job will shut down, regardless of whether the window contains unreported results.
If you switch to event-time-based windowing, then when the file source runs out of data it will send one last watermark with the value MAX_WATERMARK, which will trigger the window.
Related
In my pipeline I am using pyflink to load & transform data from an RDS and sink to a MYSQL. Using FLINK CDC I am able to get the data I want from the RDS and with JDBC library sink to MYSQL. My aim is to read 1 table and create 10 others using a sample of the code below, in 1 job (basically breaking a huge table in smaller tables). The problem I am facing is despite using RocksDB as state backend and options in flink cdc such as scan.incremental.snapshot.chunk.size and scan.snapshot.fetch.size and debezium.min.row. count.to.stream.result the usage memory keeps growing causing a Taskmanager with 2GB memory to fail. My intuition here is that a simple select- insert query loads all table in memory no matter what!If so, can I somehow avoid that? The table size is around 500k rows.
env = StreamExecutionEnvironment.get_execution_environment()
t_env = StreamTableEnvironment.create(env)
stmt_set = t_env.create_statement_set()
create_kafka_source= (
"""
CREATE TABLE somethin(
bla INT,
bla1 DOUBLE,
bla2 TIMESTAMP(3),
PRIMARY KEY(bla2) NOT ENFORCED
) WITH (
'connector'='mysql-cdc',
'server-id'='1000',
'debezium.snapshot.mode' = 'when_needed',
'debezium.poll.interval.ms'='5000',
'hostname'= 'som2',
'port' ='som2',
'database-name'='som3',
'username'='som4',
'password'='somepass',
'table-name' = 'atable'
)
"""
)
create_kafka_dest = (
"""CREATE TABLE IF NOT EXISTS atable(
time1 TIMESTAMP(3),
blah2 DOUBLE,
PRIMARY KEY(time_stamp) NOT ENFORCED
) WITH ( 'connector'= 'jdbc',
'url' = 'jdbc:mysql://name1:3306/name1',
'table-name' = 't1','username' = 'user123',
'password' = '123'
)"""
)
t_env.execute_sql(create_kafka_source)
t_env.execute_sql(create_kafka_dest)
stmt_set.add_insert_sql(
"INSERT INTO atable SELECT DISTINCT bla2,bla1,"
"FROM somethin"
)
Using DISTINCT in a streaming query is expensive, especially when there aren't any temporal constraints on the distinctiveness (e.g., counting unique visitors per day). I imagine that's why your query needs a lot of state.
However, you should be able to get this to work. RocksDB isn't always well-behaved; sometimes it will consume more memory than it has been allocated.
What version of Flink are you using? Improvements were made in Flink 1.11 (by switching to jemalloc) and further improvements came in Flink 1.14 (by upgrading to a newer version of RocksDB). So upgrading Flink might fix this. Otherwise you may need to basically lie and tell Flink it has somewhat less memory than it actually has, so that when RocksDB steps out of bounds it doesn't cause out-of-memory errors.
When a pipe is re-created, there is a chance of missing some notifications. Is there any way to replay these missed notifications? Refreshing the pipe is dangerous (so not an option), as the load history is lost when the pipe is re-created (and hence could result in ingesting the same files twice & creating duplicate records)
Snowflake has documented a process on how to re-create pipes with automated data loading (link). Unfortunately, any new notifications coming in between step 1 (pause the pipe) and step 3 (re-create the pipe) can be missed. Even by automating the process with a procedure, we can shrink the window, but not eliminate it. I have confirmed this with multiple tests. Even without pausing the previous pipe, there's still a slim chance for this to happen.
However, Snowflake is aware of the notifications, as the notification queue is separate from the pipes (and shared for the entire account). But the notifications received at the "wrong" time are just never processed (which I guess makes sense if there's no active pipe to process them at the time).
I think we can see those notifications in the numOutstandingMessagesOnChannel property of the pipe status, but I can't find much more information about this, nor how to get those notifications processed. I think they might just become lost when the pipe is replaced. 😞
Note: This is related to another question I asked about preserving the load history when re-creating pipes in Snowflake (link).
Assuming there's no way to replay outstanding notifications, I've instead created a procedure to detect files that have failed to load automatically. A benefit of this approach is that it can also detect any file that has failed to load for any reason (not only missed notifications).
The procedure can be called like this:
CALL verify_pipe_load(
'my_db.my_schema.my_pipe', -- Pipe name
'my_db.my_schema.my_stage', -- Stage name
'my_db.my_schema.my_table', -- Table name
'/YYYY/MM/DD/HH/', -- File prefix
'YYYY-MM-DD', -- Start time for the loads
'ERROR' -- Mode
);
Here's how it works, at a high level:
First, it finds all the files in the stage that match the specified prefix (using the LIST command), minus a slight delay to account for latency.
Then, out of those files, it finds all of those that have no records in COPY_HISTORY.
Finally, it handles those missing file loads in one of three ways, depending on the mode:
The 'ERROR' mode will abort the procedure by throwing an exception. This is useful to automate the continuous monitoring of pipes and ensure no files are missed. Just hook it up to your automation tool of choice! We use DBT + DBT Cloud.
The 'INGEST' mode will automatically re-queue the files for ingestion by Snowpipe using the REFRESH command for those specific files only.
The 'RETURN' mode will simply return the list of files in the response.
Here is the code for the procedure:
-- Returns a list of files missing from the destination table (separated by new lines).
-- Returns NULL if there are no missing files.
CREATE OR REPLACE PROCEDURE verify_pipe_load(
-- The FQN of the pipe (used to auto ingest):
PIPE_FQN STRING,
-- Stage to get the files from (same as the pipe definition):
STAGE_NAME STRING,
-- Destination table FQN (same as the pipe definition):
TABLE_FQN STRING,
-- File prefix (to filter files):
-- This should be based on a timestamp (ex: /YYYY/MM/DD/HH/)
-- in order to restrict files to a specific time interval
PREFIX STRING,
-- The time to get the loaded files from (should match the prefix):
START_TIME STRING,
-- What to do with the missing files (if any):
-- 'RETURN': Return the list of missing files.
-- 'INGEST': Automatically ingest the missing files (and return the list).
-- 'ERROR': Make the procedure fail by throwing an exception.
MODE STRING
)
RETURNS STRING
LANGUAGE JAVASCRIPT
EXECUTE AS CALLER
AS
$$
MODE = MODE.toUpperCase();
if (!['RETURN', 'INGEST', 'ERROR'].includes(MODE)) {
throw `Exception: Invalid mode '${MODE}'. Must be one of 'RETURN', 'INGEST' or 'ERROR'`;
}
let tableDB = TABLE_FQN.split('.')[0];
let [pipeDB, pipeSchema, pipeName] = PIPE_FQN.split('.')
.map(name => name.startsWith('"') && name.endsWith('"')
? name.slice(1, -1)
: name.toUpperCase()
);
let listQueryId = snowflake.execute({sqlText: `
LIST #${STAGE_NAME}${PREFIX};
`}).getQueryId();
let missingFiles = snowflake.execute({sqlText: `
WITH staged_files AS (
SELECT
"name" AS name,
TO_TIMESTAMP_NTZ(
"last_modified",
'DY, DD MON YYYY HH24:MI:SS GMT'
) AS last_modified,
-- Add a minute per GB, to account for larger file size = longer ingest time
ROUND("size" / 1024 / 1024 / 1024) AS ingest_delay,
-- Estimate the time by which the ingest should be done (default 5 minutes)
DATEADD(minute, 5 + ingest_delay, last_modified) AS ingest_done_ts
FROM TABLE(RESULT_SCAN('${listQueryId}'))
-- Ignore files that may not be done being ingested yet
WHERE ingest_done_ts < CONVERT_TIMEZONE('UTC', CURRENT_TIMESTAMP())::TIMESTAMP_NTZ
), loaded_files AS (
SELECT stage_location || file_name AS name
FROM TABLE(
${tableDB}.information_schema.copy_history(
table_name => '${TABLE_FQN}',
start_time => '${START_TIME}'::TIMESTAMP_LTZ
)
)
WHERE pipe_catalog_name = '${pipeDB}'
AND pipe_schema_name = '${pipeSchema}'
AND pipe_name = '${pipeName}'
), stage AS (
SELECT DISTINCT stage_location
FROM TABLE(
${tableDB}.information_schema.copy_history(
table_name => '${TABLE_FQN}',
start_time => '${START_TIME}'::TIMESTAMP_LTZ
)
)
WHERE pipe_catalog_name = '${pipeDB}'
AND pipe_schema_name = '${pipeSchema}'
AND pipe_name = '${pipeName}'
), missing_files AS (
SELECT REPLACE(name, stage_location) AS prefix
FROM staged_files
CROSS JOIN stage
WHERE name NOT IN (
SELECT name FROM loaded_files
)
)
SELECT LISTAGG(prefix, '\n') AS "missing_files"
FROM missing_files;
`});
if (!missingFiles.next()) return null;
missingFiles = missingFiles.getColumnValue('missing_files');
if (missingFiles.length == 0) return null;
if (MODE == 'ERROR') {
throw `Exception: Found missing files:\n'${missingFiles}'`;
}
if (MODE == 'INGEST') {
missingFiles
.split('\n')
.forEach(file => snowflake.execute({sqlText: `
ALTER PIPE ${PIPE_FQN} REFRESH prefix='${file}';
`}));
}
return missingFiles;
$$
;
I have following simple code snippet that want to write the streamin group by result into a kafka topic.
The Kafka sink table definition:
CREATE TABLE sinkTable (
id STRING,
total_price DOUBLE
) WITH (
'connector' = 'kafka',
'topic' = 'test6',
'properties.bootstrap.servers' = 'localhost:9092',
'key.format' = 'json',
'key.json.ignore-parse-errors' = 'true',
'key.fields' = 'id',
'value.format' = 'json',
'value.json.fail-on-missing-field' = 'false',
'value.fields-include' = 'ALL'
)
When I run the following query
insert into sinkTable
select id, sum(price)
from sourceTable
group by id
it throws exception, the exception is:
Table sink 'default_catalog.default_database.sinkTable' doesn't support consuming update changes which is produced by node GroupAggregate(groupBy=[id], select=[id, SUM(price) AS EXPR$1])
I don't know where the problem is. I looks to me that connector:kafka doesn't support group by query?
The problem is exactly as You've described it, the default kafka connector only supports append only stream. And as You may imagine the query You are trying to run, will produce an update for every new element, since it will change the sum for elements with this id.
One of the simplest things to do is to use upsert-kafka connector, which will automatically handle updates and write them to kafka, but this one is only available since Flink 1.12, so You may want to update Yourself to this version if You haven't yet.
I'm trying to join two continuous queries, but keep running into the following error:
Rowtime attributes must not be in the input rows of a regular join. As a workaround you can cast the time attributes of input tables to TIMESTAMP before.\nPlease check the documentation for the set of currently supported SQL features.
Here's the table definition:
CREATE TABLE `Combined` (
`machineID` STRING,
`cycleID` BIGINT,
`start` TIMESTAMP(3),
`end` TIMESTAMP(3),
WATERMARK FOR `end` AS `end` - INTERVAL '5' SECOND,
`sensor1` FLOAT,
`sensor2` FLOAT
)
and the insert query
INSERT INTO `Combined`
SELECT
a.`MachineID`,
a.`cycleID`,
MAX(a.`start`) `start`,
MAX(a.`end`) `end`,
MAX(a.`sensor1`) `sensor1`,
MAX(m.`sensor2`) `sensor2`
FROM `Aggregated` a, `MachineStatus` m
WHERE
a.`MachineID` = m.`MachineID` AND
a.`cycleID` = m.`cycleID` AND
a.`start` = m.`timestamp`
GROUP BY a.`MachineID`, a.`cycleID`, SESSION(a.`start`, INTERVAL '1' SECOND)
In the source tables Aggregated and MachineStatus, the start and timestamp columns are time attributes with a watermark.
I've tried casting the input rows of the join to timestamps, but that didn't fix the issue and would mean that I cannot use SESSION, which is supposed to ensure that only one data point gets recorded per cycle.
Any help is greatly appreciated!
I investigated this a little further and noticed that the GROUP BY statement doesn't make sense in that context.
Furthermore, the SESSION can be replaced by a time window, which is the more idiomatic approach.
INSERT INTO `Combined`
SELECT
a.`MachineID`,
a.`cycleID`,
a.`start`,
a.`end`,
a.`sensor1`,
m.`sensor2`
FROM `Aggregated` a, `MachineStatus` m
WHERE
a.`MachineID` = m.`MachineID` AND
a.`cycleID` = m.`cycleID` AND
m.`timestamp` BETWEEN a.`start` AND a.`start` + INTERVAL '0' SECOND
To understand the different ways to join dynamic tables, I found the Ververica SQL training extremely helpful.
I've created a Pipe from an S3 stage and with a python script I'm generating the timestamps of when I am generating the data from a streaming service into file batches. I would also like to be able to add the timestamp when the files were actually copied into the table from the S3 stage. I've found some documentation regarding the PIPE_USAGE_HISTORY method but although I've already ran for the past days quite a few tests the below returns an empty table. What am I doing wrong?
select * from table(information_schema.pipe_usage_history(
date_range_start=>dateadd('day',-14,current_date()),
date_range_end=>current_date())
)
I found the answer. There is another query I should be using: copy_history
The above query would be rewritten as follows
select * from table(information_schema.copy_history(
table_name => '{replace with your schema.table}',
start_time => dateadd(days, -14, current_timestamp()),
end_time => current_timestamp())
)