I am working on a solution where I am reading from a Kafka CDC compacted topic (debezium format). But this, technically would be used as source of external information stored in a Flink Table. The problem is that how would I wait for this table to catch up to the latest offset on the compacted topic for it to start processing the other stuff? I want to do this for when the flink app needs to restart, so it can first finish playing back data from the CDC topic and then start consuming messages from the other kafka source
Flink: 1.15.2
Kafka CDC table:
CREATE TABLE metadata (
id VARCHAR,
eventMetadata VARCHAR,
origin_ts as PROCTIME(),
PRIMARY KEY (id) NOT ENFORCED
) WITH (
'connector' = 'kafka',
'properties.bootstrap.servers' = 'SERVER_ADDR',
'properties.group.id' = 'SOME_GROUP',
'topic' = 'SOME_TOPIC',
'scan.startup.mode' = 'latest-offset',
'value.format' = 'debezium-json'
)
Ideally, I need to wait until the CDC table has caught up with the latest offset on the topic and only then start processing the messages from the actual source of messages
KafkaSource.<CustomObject>builder()
.setBootstrapServers(params.get("kafka.bootstrap.servers"))
.setBounded(OffsetsInitializer.latest())
.setDeserializer(KafkaRecordDeserializationSchema.of(new CustomObjectDeserializer()))
.setTopics(params.get("kafka.source.topic"))
.setGroupId(params.get("kafka.source.consumer.id"))
.build();
Would I have to write some blocking code that would only execute the rest of the topology once the CDC table is "quiet" (ie. hasn't read any new rows in, let's say, last 10 seconds)? Are there any better ways of playing back kafka topics in flink at startup which would wait until the CDC source is all caught up?
Related
We've a SQL Flink Job (Table API) that reads Offers from a Kafka topic (8 partitions) as source and sinks it back to another Kafka topic after some aggregations with other data sources to calculate the cheapest one and aggregate extra data over that result.
Sink looks like this:
CREATE TABLE cheapest_item_offer (
`id_offer` VARCHAR(36),
`id_item` VARCHAR(36),
`price` DECIMAL(13,2),
-- ... more offer fields
PRIMARY KEY (`id_item`) NOT ENFORCED
) WITH (
'connector' = 'upsert-kafka',
'topic' = '<TOPIC_NAME>',
'properties.bootstrap.servers' = '<KAFKA_BOOTSTRAP_SERVERS>',
'properties.group.id' = '<JOBNAME>',
'sink.buffer-flush.interval' = '1000',
'sink.buffer-flush.max-rows' = '100',
'key.format' = 'json',
'value.format' = 'json'
);
And the upsert looks like this:
INSERT INTO cheapest_item_offer
WITH offers_with_stock_ordered_by_price AS (
SELECT *,
ROW_NUMBER() OVER(
PARTITION BY id_item
ORDER BY price ASC
) AS n_row
FROM offer
WHERE quantity > 0
), cheapest_offer AS (
SELECT offer.*
FROM offers_with_stock_ordered_by_price offer
WHERE offer.n_row = 1
)
SELECT id_offer,
id_item,
price,
-- ... item extra fields
FROM cheapest_offer
-- ... extra JOINS here to aggregate more item data
Given this configuration, the job initially ingests the data and calculates it properly, and sets the cheapest offer right, but after some time passes, whenever there are some events in our data source they are unexpectedly resulting in a Tombstone (not always though, sometimes it's properly set) result which, after checking them, we notice they shouldn't be, mainly because there's an actual cheapest offer for that item and the related JOIN rows do exists.
The following images illustrate the issue with some Kafka messages:
Data source
This is the data source we ingest the data from. The latest for a given Item update shows that an Offer has some changes.
Data Sink
This is the data Sink for the same Item, as we can see, the latest update was generated at the same time, because of the data source update, but the resulting value is a Tombstone, rather than its actual value from the data source
If we relaunch the Job from scratch (ignoring savepoints), the affected Items are fixed on the first run, but the same issue will appears after some time.
Some considerations:
In our Data Source, each Item can have multiple Offers and can be allocated in different Partitions
Flink Job is running with Paralellism set to 8 (same as Kafka Partitions)
We're using Flink 1.13.2 with upsert-kafka connector in Source & Sink
We're using Kafka 2.8
We believe the issue is in the cheapest offer virtual tables, as the JOINs contain proper data
We're using rocksdb as state.backend
We're struggling to find the reason behind this behavior (we're pretty new to Flink), and we don't know where to focus for fixing this, can anybody help here?
Any suggestion will be highly appreciated!
Apparently it was a bug from Flink SQL on v1.13.2 as noted in Flink's Jira Task FLINK-25559.
We managed to solve this issue by upgrading version to v1.13.6.
I have a Kafka stream and Hive table that I want to use as lookup table to enhance data from kafka. Hive table points to parquet file in S3. Hive table is updated once a day with INSERT OVERWRITE statement, which means, older files from that s3 path will be replaced by newer files once a day.
Everytime, hive table is updated, newer data from hive table is joined with historical data from kafka and this results in older kafka data getting republished. I see this is the expected behaviour from this link.
I tried to set idle state retention of 2 days as shown below, but, it looks Flink is not honoring the 2 days idle state retention and seems to be keeping all the kafka records in table state. I was expecting only last 2 days data will be republished at the time hive table is updated. My job has been running for one month and instead, I see record old as one month still getting sent in the output. I think this will make the state grow forever and might result in out of memory exception at some point.
One possible reason for this is I think Flink keeps the state of kafka data keyed by sales_customer_id field because that is the field used to join with hive table and as soon as another sales come for that customer id, then state expiry is extended for another 2 days? I am not sure whether this is the reason but wanted to check with Flink expert on what could be the possible problem here.
StreamTableEnvironment tableEnv = StreamTableEnvironment.create(env);
TableConfig tableConfig = tableEnv.getConfig();
Configuration configuration = tableConfig.getConfiguration();
tableConfig.setIdleStateRetention(Duration.ofHours(24*2));
configuration.setString("table.dynamic-table-options.enabled", "true");
DataStream<Sale> salesDataStream = ....;
Table salesTable = tableEnv.fromDataStream(salesDataStream);
Table customerTable = tableEnv.sqlQuery("select * from my_schema.customers" +
" /*+ OPTIONS('streaming-source.enable'='true', 'streaming-source.partition-order'='create-time') */");
Table resultTable = salesTable.leftOuterJoin(customerTable, $("sales_customer_id").isEqual($("customer_id")));
DataStream<Sale> salesWithCustomerInfoDataStream = tableEnv.toRetractStream(resultTable, Row.class).map(new RowToSaleFunction());
My company is migrating to snowflake from SQL Server 2017 and am looking to build historical data tables that capture delta changes. In SQL, these would be in stored procedures, where old records would get expired (change to data) and insert the new row with updated data. This design allows dynamic retrieval of historical data at any point in time.
My question is, how would i migrate this design to snowflake? From what I read about procedures, they're more like UDTs or scalar functions (SQL equiv) , but in javascript lang...
Below is brief example of how we are doing CDC for tables in SQL
Would data pipeline cover this? If anyone knows good tutorial site for snowflake 101 (not snowflake offical documentation, its terrible). would be appreciated
thanks
update h
set h.expiration_date = t.effective_date
from data_table_A_history h
join data_table_A as t
on h.account_id = t.account_id
where h.expiration_date is null
and (
(isnull(t.person_name,'x') <> isnull(h.person_name,'x')) or
(isnull(t.person_age,0) <> isnull(h.person_age,0))
)
---------------------------------------------------------------------
insert into data_table_A_history (account_id,person_name,person_age)
select
account_id,person_name,person_age
from
data_table_A t
left join data_table_A_history h
on t.account_id = h.account_id
and h.expiration_date is null
where
h.account_id is null
Table streams are Snowflake's CDC solution
You can setup multiple streams on a single table and it will track changes to the table from a particular point in time. This point in time is changed once you consume the data in the stream, with the new starting point being from the time you consumed the data. Consumption here is when you either use the data to upsert another table or perhaps insert the data into a log table for example. Simply select statements do not consume the data
A pipeline could be something like this: Snowpipe->staging table->stream on staging table->task with SP->merge/upsert target table
If you wanted to keep a log of the changes then you could setup a 2nd stream on the staging table and consume that by inserting the data into another table
Another trick, if you didn't want to use a 2nd stream is to amend your SP so that before you consume the data, run a select on the stream and then immediately run
INSERT INTO my_table select * from table(result_scan(last_query_id()))
This does not consume the stream and change the offset and leaves the stream data available to be consumed by another DML operation
I am reading at https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/connectors/table/upsert-kafka/.
It says that:
As a sink, the upsert-kafka connector can consume a changelog stream.
It will write INSERT/UPDATE_AFTER data as normal Kafka messages value,
and write DELETE data as Kafka messages with null values (indicate
tombstone for the key).
It doesn't mention that if UPDATE_BEFORE message is written to upsert kafka,then what would happen?
In the same link (https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/connectors/table/upsert-kafka/#full-example), the doc provides a full example:
INSERT INTO pageviews_per_region
SELECT
user_region,
COUNT(*),
COUNT(DISTINCT user_id)
FROM pageviews
GROUP BY user_region;
With the above INSERT/SELECT operation, INSERT/UPDATE_BEFORE/UPDATE_AFTER messages will be generated and will go to the upsert kafka sink, I would ask what happens when upsert kafka meets the UPDATE_BEFORE message.
From the comments on the source code
/ / partial code
// During the Upsert mode during the serialization process, if the operation is executed is Rowkind.delete or Rowkind.Update_before
// set it to NULL (corresponding to Kafka tomb news)
https://cwiki.apache.org/confluence/plugins/servlet/mobile?contentId=165221669#content/view/165221669
Upsert-kafka sink doesn’t require planner to send UPDATE_BEFORE messages (planner may still send UPDATE_BEFORE messages in some cases), and will write INSERT/UPDATE_AFTER messages as normal Kafka records with key parts, and will write DELETE messages as Kafka records with null values (indicate tombstone for the key). Flink will guarantee the message ordering on the primary key by partition data on the values of the primary key columns.
Upsert-kafka source is a kind of changelog source. The primary key semantics on changelog source means the materialized changelogs (INSERT/UPDATE_BEFORE/UPDATE_AFTER/DELETE) are unique on the primary key constraints. Flink assumes all messages are in order on the primary key.
Implementation Details
Due to the upsert-kafka connector only produces upsert stream which doesn’t contain UPDATE_BEFORE messages. However, several operations require the UPDATE_BEFORE messages for correctly processing, e.g. aggregations. Therefore, we need to have a physical node to materialize the upsert stream and generate changelog stream with full change messages. In the physical operator, we will use state to know whether the key is the first time to be seen. The operator will produce INSERT rows, or additionally generate UPDATE_BEFORE rows for the previous image, or produce DELETE rows with all columns filled with values.
I am running Apache Flink mini cluster in Intellij.
Trying to setup a stream join where one stream is coming from kinesis source and other from jdbc.
When I am creating a datastream from table source like following :
// Table with two fields (String name, Integer age)
Table table = ...
// convert the Table into an append DataStream of Row by specifying the class
DataStream<Row> dsRow = tableEnv.toAppendStream(table, Row.class);
I am getting following info message in the stack tracke :
INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Checkpoint triggering task
Soource ...
job bcf73c5d7a0312d57c2ca36d338d4569 is not in state RUNNING but FINISHED instead. Aborting
checkpoint.
Flink checkpoints cannot happen if any of the job's tasks have run to completion. Perhaps your jdbc source has finished, and this is preventing any further checkpointing?
You can check your parallelism Settings,
if your program's parallelism is greater than the source's parallelism, then some task will show completion because it doesn't have the data, and that would abort a checkpoint.