Flink Tableconfig setIdleStateRetention seems to be not working - apache-flink

I have a Kafka stream and Hive table that I want to use as lookup table to enhance data from kafka. Hive table points to parquet file in S3. Hive table is updated once a day with INSERT OVERWRITE statement, which means, older files from that s3 path will be replaced by newer files once a day.
Everytime, hive table is updated, newer data from hive table is joined with historical data from kafka and this results in older kafka data getting republished. I see this is the expected behaviour from this link.
I tried to set idle state retention of 2 days as shown below, but, it looks Flink is not honoring the 2 days idle state retention and seems to be keeping all the kafka records in table state. I was expecting only last 2 days data will be republished at the time hive table is updated. My job has been running for one month and instead, I see record old as one month still getting sent in the output. I think this will make the state grow forever and might result in out of memory exception at some point.
One possible reason for this is I think Flink keeps the state of kafka data keyed by sales_customer_id field because that is the field used to join with hive table and as soon as another sales come for that customer id, then state expiry is extended for another 2 days? I am not sure whether this is the reason but wanted to check with Flink expert on what could be the possible problem here.
StreamTableEnvironment tableEnv = StreamTableEnvironment.create(env);
TableConfig tableConfig = tableEnv.getConfig();
Configuration configuration = tableConfig.getConfiguration();
tableConfig.setIdleStateRetention(Duration.ofHours(24*2));
configuration.setString("table.dynamic-table-options.enabled", "true");
DataStream<Sale> salesDataStream = ....;
Table salesTable = tableEnv.fromDataStream(salesDataStream);
Table customerTable = tableEnv.sqlQuery("select * from my_schema.customers" +
" /*+ OPTIONS('streaming-source.enable'='true', 'streaming-source.partition-order'='create-time') */");
Table resultTable = salesTable.leftOuterJoin(customerTable, $("sales_customer_id").isEqual($("customer_id")));
DataStream<Sale> salesWithCustomerInfoDataStream = tableEnv.toRetractStream(resultTable, Row.class).map(new RowToSaleFunction());

Related

Unforeseeable Tombstones messages when joining with Flink SQL

We've a SQL Flink Job (Table API) that reads Offers from a Kafka topic (8 partitions) as source and sinks it back to another Kafka topic after some aggregations with other data sources to calculate the cheapest one and aggregate extra data over that result.
Sink looks like this:
CREATE TABLE cheapest_item_offer (
`id_offer` VARCHAR(36),
`id_item` VARCHAR(36),
`price` DECIMAL(13,2),
-- ... more offer fields
PRIMARY KEY (`id_item`) NOT ENFORCED
) WITH (
'connector' = 'upsert-kafka',
'topic' = '<TOPIC_NAME>',
'properties.bootstrap.servers' = '<KAFKA_BOOTSTRAP_SERVERS>',
'properties.group.id' = '<JOBNAME>',
'sink.buffer-flush.interval' = '1000',
'sink.buffer-flush.max-rows' = '100',
'key.format' = 'json',
'value.format' = 'json'
);
And the upsert looks like this:
INSERT INTO cheapest_item_offer
WITH offers_with_stock_ordered_by_price AS (
SELECT *,
ROW_NUMBER() OVER(
PARTITION BY id_item
ORDER BY price ASC
) AS n_row
FROM offer
WHERE quantity > 0
), cheapest_offer AS (
SELECT offer.*
FROM offers_with_stock_ordered_by_price offer
WHERE offer.n_row = 1
)
SELECT id_offer,
id_item,
price,
-- ... item extra fields
FROM cheapest_offer
-- ... extra JOINS here to aggregate more item data
Given this configuration, the job initially ingests the data and calculates it properly, and sets the cheapest offer right, but after some time passes, whenever there are some events in our data source they are unexpectedly resulting in a Tombstone (not always though, sometimes it's properly set) result which, after checking them, we notice they shouldn't be, mainly because there's an actual cheapest offer for that item and the related JOIN rows do exists.
The following images illustrate the issue with some Kafka messages:
Data source
This is the data source we ingest the data from. The latest for a given Item update shows that an Offer has some changes.
Data Sink
This is the data Sink for the same Item, as we can see, the latest update was generated at the same time, because of the data source update, but the resulting value is a Tombstone, rather than its actual value from the data source
If we relaunch the Job from scratch (ignoring savepoints), the affected Items are fixed on the first run, but the same issue will appears after some time.
Some considerations:
In our Data Source, each Item can have multiple Offers and can be allocated in different Partitions
Flink Job is running with Paralellism set to 8 (same as Kafka Partitions)
We're using Flink 1.13.2 with upsert-kafka connector in Source & Sink
We're using Kafka 2.8
We believe the issue is in the cheapest offer virtual tables, as the JOINs contain proper data
We're using rocksdb as state.backend
We're struggling to find the reason behind this behavior (we're pretty new to Flink), and we don't know where to focus for fixing this, can anybody help here?
Any suggestion will be highly appreciated!
Apparently it was a bug from Flink SQL on v1.13.2 as noted in Flink's Jira Task FLINK-25559.
We managed to solve this issue by upgrading version to v1.13.6.

Building CDC in Snowflake

My company is migrating to snowflake from SQL Server 2017 and am looking to build historical data tables that capture delta changes. In SQL, these would be in stored procedures, where old records would get expired (change to data) and insert the new row with updated data. This design allows dynamic retrieval of historical data at any point in time.
My question is, how would i migrate this design to snowflake? From what I read about procedures, they're more like UDTs or scalar functions (SQL equiv) , but in javascript lang...
Below is brief example of how we are doing CDC for tables in SQL
Would data pipeline cover this? If anyone knows good tutorial site for snowflake 101 (not snowflake offical documentation, its terrible). would be appreciated
thanks
update h
set h.expiration_date = t.effective_date
from data_table_A_history h
join data_table_A as t
on h.account_id = t.account_id
where h.expiration_date is null
and (
(isnull(t.person_name,'x') <> isnull(h.person_name,'x')) or
(isnull(t.person_age,0) <> isnull(h.person_age,0))
)
---------------------------------------------------------------------
insert into data_table_A_history (account_id,person_name,person_age)
select
account_id,person_name,person_age
from
data_table_A t
left join data_table_A_history h
on t.account_id = h.account_id
and h.expiration_date is null
where
h.account_id is null
Table streams are Snowflake's CDC solution
You can setup multiple streams on a single table and it will track changes to the table from a particular point in time. This point in time is changed once you consume the data in the stream, with the new starting point being from the time you consumed the data. Consumption here is when you either use the data to upsert another table or perhaps insert the data into a log table for example. Simply select statements do not consume the data
A pipeline could be something like this: Snowpipe->staging table->stream on staging table->task with SP->merge/upsert target table
If you wanted to keep a log of the changes then you could setup a 2nd stream on the staging table and consume that by inserting the data into another table
Another trick, if you didn't want to use a 2nd stream is to amend your SP so that before you consume the data, run a select on the stream and then immediately run
INSERT INTO my_table select * from table(result_scan(last_query_id()))
This does not consume the stream and change the offset and leaves the stream data available to be consumed by another DML operation

how to Copy from big table to another table in snowflake?

I have a 7TB+- table in snowflake, I want to pass half of that table to a new table. for example with a country filter. what technique would you recommend? insert into select * from TABLE where COUNTRY = 'A' or use snowpipe to send a parquet format to S3 an then copy into table into snowflake target table
I tried the first option. 5 hours after and the process was on 35%. I read a post where a guy had to scaling the cluster to XL instance. He read another post where snowpipe is the good option. my cluster is only a XS :(
by the way, I have Cluster key and the mission is segment the data by countries by company politics.
The original table is about events from the devices that have the app installed. 30 events per session minute, for example a Uber App or Lyft App
An MV will definitely be more performant than a standard view but there is an extra cost associated with that as Snowflake has to keep the MV in sync with the table. Sounds like the table will be rapidly changing so this cost will be continuous.
Another option is to create a stream on the source table and use a task to merge the stream data into the target table. Tasks require a running warehouse but I've found that an XS warehouse is very capable so minimum you're talking 24 credits per day. Tasks also have a minimum 1 minute interval so if you need bleeding edge, that might discount this option

SpringBatch application periodically pulling data from DB

I am working on a spring batch service that pulls data from a db on a schedule. (e.g. every day at 12pm)
I am using JdbcPagingItemReader to read the data and a scheduler (#Scheduled provided by spring batch) to launch the job. The problem that I have now is: every time the job runs, it will just pull all the data from the beginning and not from the "last read" row.
The data from the db is changing everyday(deleting old ones and adding new ones) and all I have is a timestamp column to track them.
Is there a way to "remember" the last row read from the last execution of the job and read data only later than that row?
Since you need to pull data on a daily basis, and your records have a timestamp, then you can design your job instances to be based on a given date (ie using the date as an identifying job parameter). With this approach, you do not need to "remember" the last processed record. All you need to do is process records for a given date by using the correct SQL query. For example:
Job instance ID
Date
Job parameter
SQL
1
2021-03-22
date=2021-03-22
Select c1, c2 from table where date = 2021-03-22
2
2021-03-23
date=2021-03-23
Select c1, c2 from table where date = 2021-03-23
...
...
...
...
With that in place, you can use any cursor-based or paging-based reader to process records of a given date. If a job instance fails, you can restart it without a risk to interfere with other job instances. The restart could be done even several days after the failure since the job instance will always process the same data set. Moreover, in case of failure and job restart, Spring Batch will reprocess records from the last check point in the previous (failed) run.
Just want to post an update to this question.
So in the end I created two more steps to achieve what I wanted to do initially.
Since I don't have the privilege to modify the table where I read the data from, I couldn't use the "process indicator pattern" which involves having a column to mark if a record is processed or not. I created another table to store the last-read record's timestamp, and use it to update the sql query.
step 0: a tasklet that reads the bookmark from a table, pass it in the job context
step 1: a chunk step, get the bookmark from the context, use jdbcPagingItemReader to read the data
step 2: a tasklet to update the bookmark
But doing this I have to be very cautious with the bookmark table. If I lose that I lose everything

What's the best way to save large monthly data backups in SQL?

I work on a program that stores information about network connections across my University and I have been asked to create a report that shows the status changes of these connections over time. I was thinking about adding another table that has the current connection information and the date the data was added so when the report is run, it just grabs the data at that date, but I'm worried that the report might get slow after a couple of months as it would be adding about 50,000 rows every month. Is there a better way to do this? We use a Microsoft SQL Server.
It depends on the reason you are holding historical data for facts.
If the reason is:
For reporting needs then you could hold it in the same table by
adding two date columns FromDate and ToDate which will remove the
need to join the active and historical data tables later on.
Just for reference then it makes sense to have it in a different
table as it may decrease the performance of your indexes on your
active table.
I'll highlight the Slowly Changing Dimension (SCD) type 2 approach that tracks data history by maintaining multiple versions of records and uses either the EndDate or a flag to identify the active record. This method allows tracking any number of historical records as each time a new record is inserted, the older ones are populated with an EndDate.
Step 1: For re-loaded facts UPDATE IsActive = 0 for the record to be history preserved and populate EndDate as the current date.
merge ActiveTable as T
using DataToBeLoaded as D
on T.ID = D.ID
and
T.isactive = 1 -- Current active entry
when matched then
update set T.IsActive = 0,
T.EndDate = GETDATE();
Step 2: Insert the latest data into the ActiveTable with IsActive = 1 and FromDate as the current date.
Disclaimer: The following approach using SCD 2 could make your data warehouse huge. However, I don't believe it would affect performance much for your scenario.

Resources