Unforeseeable Tombstones messages when joining with Flink SQL - apache-flink

We've a SQL Flink Job (Table API) that reads Offers from a Kafka topic (8 partitions) as source and sinks it back to another Kafka topic after some aggregations with other data sources to calculate the cheapest one and aggregate extra data over that result.
Sink looks like this:
CREATE TABLE cheapest_item_offer (
`id_offer` VARCHAR(36),
`id_item` VARCHAR(36),
`price` DECIMAL(13,2),
-- ... more offer fields
PRIMARY KEY (`id_item`) NOT ENFORCED
) WITH (
'connector' = 'upsert-kafka',
'topic' = '<TOPIC_NAME>',
'properties.bootstrap.servers' = '<KAFKA_BOOTSTRAP_SERVERS>',
'properties.group.id' = '<JOBNAME>',
'sink.buffer-flush.interval' = '1000',
'sink.buffer-flush.max-rows' = '100',
'key.format' = 'json',
'value.format' = 'json'
);
And the upsert looks like this:
INSERT INTO cheapest_item_offer
WITH offers_with_stock_ordered_by_price AS (
SELECT *,
ROW_NUMBER() OVER(
PARTITION BY id_item
ORDER BY price ASC
) AS n_row
FROM offer
WHERE quantity > 0
), cheapest_offer AS (
SELECT offer.*
FROM offers_with_stock_ordered_by_price offer
WHERE offer.n_row = 1
)
SELECT id_offer,
id_item,
price,
-- ... item extra fields
FROM cheapest_offer
-- ... extra JOINS here to aggregate more item data
Given this configuration, the job initially ingests the data and calculates it properly, and sets the cheapest offer right, but after some time passes, whenever there are some events in our data source they are unexpectedly resulting in a Tombstone (not always though, sometimes it's properly set) result which, after checking them, we notice they shouldn't be, mainly because there's an actual cheapest offer for that item and the related JOIN rows do exists.
The following images illustrate the issue with some Kafka messages:
Data source
This is the data source we ingest the data from. The latest for a given Item update shows that an Offer has some changes.
Data Sink
This is the data Sink for the same Item, as we can see, the latest update was generated at the same time, because of the data source update, but the resulting value is a Tombstone, rather than its actual value from the data source
If we relaunch the Job from scratch (ignoring savepoints), the affected Items are fixed on the first run, but the same issue will appears after some time.
Some considerations:
In our Data Source, each Item can have multiple Offers and can be allocated in different Partitions
Flink Job is running with Paralellism set to 8 (same as Kafka Partitions)
We're using Flink 1.13.2 with upsert-kafka connector in Source & Sink
We're using Kafka 2.8
We believe the issue is in the cheapest offer virtual tables, as the JOINs contain proper data
We're using rocksdb as state.backend
We're struggling to find the reason behind this behavior (we're pretty new to Flink), and we don't know where to focus for fixing this, can anybody help here?
Any suggestion will be highly appreciated!

Apparently it was a bug from Flink SQL on v1.13.2 as noted in Flink's Jira Task FLINK-25559.
We managed to solve this issue by upgrading version to v1.13.6.

Related

Can I perform transformations using Snowflake Streams?

Currently I have a snowflake table being updated from a kafka connector in near-realtime, I want to be able to then in near-real time take these new data entries through something such as snowflake cdc / snowflake streams and append some additional fields. Some of these will be to track max values within a certain time period (window function probs) and others will be to receive values from static tables based on where static_table.id = realtime_table.id.
The final goal is to perform these transformations and transfer them to a new presentation level table, so I have both a source table and a presentation level table, with little latency between the two.
Is this possible with Snowflake Streams? Or is there a combination of tools snowflake offers that can be used to achieve this goal? Due to a number of outside constraints it is important that this can be done within the snowflake infrastructure.
Any help would be much appreciated :).
I have considered the use of a materialised view, but am concerned regarding costs / latency.
The goal of Streams - together with Tasks - is to get the transformations done that you are asking for.
This is a quickstart to start growing you Stream and Tasks abilities:
https://quickstarts.snowflake.com/guide/getting_started_with_streams_and_tasks/
On the 6th step you can see a task that would transform the data as it arrives:
create or replace task REFINE_TASK
USER_TASK_MANAGED_INITIAL_WAREHOUSE_SIZE = 'XSMALL'
SCHEDULE = '4 minute'
COMMENT = '2. ELT Process New Transactions in Landing/Staging Table into a more Normalized/Refined Table (flattens JSON payloads)'
when
SYSTEM$STREAM_HAS_DATA('CC_TRANS_STAGING_VIEW_STREAM')
as
insert into CC_TRANS_ALL (select
card_id, merchant_id, transaction_id, amount, currency, approved, type, timestamp
from CC_TRANS_STAGING_VIEW_STREAM);

Flink Tableconfig setIdleStateRetention seems to be not working

I have a Kafka stream and Hive table that I want to use as lookup table to enhance data from kafka. Hive table points to parquet file in S3. Hive table is updated once a day with INSERT OVERWRITE statement, which means, older files from that s3 path will be replaced by newer files once a day.
Everytime, hive table is updated, newer data from hive table is joined with historical data from kafka and this results in older kafka data getting republished. I see this is the expected behaviour from this link.
I tried to set idle state retention of 2 days as shown below, but, it looks Flink is not honoring the 2 days idle state retention and seems to be keeping all the kafka records in table state. I was expecting only last 2 days data will be republished at the time hive table is updated. My job has been running for one month and instead, I see record old as one month still getting sent in the output. I think this will make the state grow forever and might result in out of memory exception at some point.
One possible reason for this is I think Flink keeps the state of kafka data keyed by sales_customer_id field because that is the field used to join with hive table and as soon as another sales come for that customer id, then state expiry is extended for another 2 days? I am not sure whether this is the reason but wanted to check with Flink expert on what could be the possible problem here.
StreamTableEnvironment tableEnv = StreamTableEnvironment.create(env);
TableConfig tableConfig = tableEnv.getConfig();
Configuration configuration = tableConfig.getConfiguration();
tableConfig.setIdleStateRetention(Duration.ofHours(24*2));
configuration.setString("table.dynamic-table-options.enabled", "true");
DataStream<Sale> salesDataStream = ....;
Table salesTable = tableEnv.fromDataStream(salesDataStream);
Table customerTable = tableEnv.sqlQuery("select * from my_schema.customers" +
" /*+ OPTIONS('streaming-source.enable'='true', 'streaming-source.partition-order'='create-time') */");
Table resultTable = salesTable.leftOuterJoin(customerTable, $("sales_customer_id").isEqual($("customer_id")));
DataStream<Sale> salesWithCustomerInfoDataStream = tableEnv.toRetractStream(resultTable, Row.class).map(new RowToSaleFunction());

Building CDC in Snowflake

My company is migrating to snowflake from SQL Server 2017 and am looking to build historical data tables that capture delta changes. In SQL, these would be in stored procedures, where old records would get expired (change to data) and insert the new row with updated data. This design allows dynamic retrieval of historical data at any point in time.
My question is, how would i migrate this design to snowflake? From what I read about procedures, they're more like UDTs or scalar functions (SQL equiv) , but in javascript lang...
Below is brief example of how we are doing CDC for tables in SQL
Would data pipeline cover this? If anyone knows good tutorial site for snowflake 101 (not snowflake offical documentation, its terrible). would be appreciated
thanks
update h
set h.expiration_date = t.effective_date
from data_table_A_history h
join data_table_A as t
on h.account_id = t.account_id
where h.expiration_date is null
and (
(isnull(t.person_name,'x') <> isnull(h.person_name,'x')) or
(isnull(t.person_age,0) <> isnull(h.person_age,0))
)
---------------------------------------------------------------------
insert into data_table_A_history (account_id,person_name,person_age)
select
account_id,person_name,person_age
from
data_table_A t
left join data_table_A_history h
on t.account_id = h.account_id
and h.expiration_date is null
where
h.account_id is null
Table streams are Snowflake's CDC solution
You can setup multiple streams on a single table and it will track changes to the table from a particular point in time. This point in time is changed once you consume the data in the stream, with the new starting point being from the time you consumed the data. Consumption here is when you either use the data to upsert another table or perhaps insert the data into a log table for example. Simply select statements do not consume the data
A pipeline could be something like this: Snowpipe->staging table->stream on staging table->task with SP->merge/upsert target table
If you wanted to keep a log of the changes then you could setup a 2nd stream on the staging table and consume that by inserting the data into another table
Another trick, if you didn't want to use a 2nd stream is to amend your SP so that before you consume the data, run a select on the stream and then immediately run
INSERT INTO my_table select * from table(result_scan(last_query_id()))
This does not consume the stream and change the offset and leaves the stream data available to be consumed by another DML operation

Perform multiple inserts per POST request

We have a scenario, where each insert happen per id_2 given id_1, for below schema, in Cassandra:
CREATE TABLE IF NOT EXISTS my_table (
id_1 UUID,
id_2 UUID,
textDetails TEXT,
PRIMARY KEY (id_1, id_2)
);
A single POST request body has the details for multiple values of id_2. This triggers multiple inserts per single POST request on single table.
Each INSERT query is performed as shown below:
insertQueryString = "INSERT INTO my_table (id_1, id_2, textDetails) " + "VALUES (?, ?, ?) IF NOT EXISTS"
cassandra.Session.Query(insertQueryString,
id1,
id2,
myTextDetails).Exec();
1
Does Cassandra ensure data consistency on multiple inserts on a single table, per POST request? Each POST request is processed on a Go-routine(thread). Subsequent GET requests should ensure retrieving consistent data(inserted through POST)
Using BATCH statements is having "Batch too large" issues in staging & production. https://github.com/RBMHTechnology/eventuate/issues/166
2
We have two data centres(for Cassandra), with 3 replica nodes per data center.
What are the consistency levels need to set for write query operation(POST request) and ready query operation(GET request), to ensure full consistency
There are multiple problems here:
Batching should be used very carefully in Cassandra - only if you're inserting data into the same partition. If you insert data into multiple partitions, then it's better to use separate queries executed in parallel (but you can collect multiple entries per partition key and batch them).
you're using IF NOT EXISTS and it's done against the same partition - as result it leads to the conflicts between multiple nodes (see documentation on lightweight transactions) plus it requires reading data from disk, so it heavily increase the load onto the nodes. But do you really need to insert data only if the row doesn't exist? What is the problem if row exists already? It's easier just to overwrite data in Cassandra when doing INSERT because it won't require reading data from the disk.
Regarding consistency level - the QUORUM (or SERIAL for LWTs) will give you the strong consistency but at expense of the increased latency (because you need to wait for answer from another DC), and lack of fault tolerance - if you lose another DC, then all your queries will fail. In most cases the LOCAL_QUORUM is enough (LOCAL_SERIAL in case of LWTs), and it will provide fault tolerance. I recommend to read this whitepaper on best practices of build fault-tolerance applications on top of Cassandra.

how to compare (1 billion records) data between two kafka streams or Database tables

we are sending data from DB2 (table-1) via CDC to Kafka topics (topic-1).
we need to do reconciliation between DB2 data and Kafka topics.
we have two options -
a) bring down all kafka topic data into DB2 (as table-1-copy) and then do left outer join (between table-1 and table-1-copy) to see the non-matching records, create the delta and push it back into kafka.
problem: Scalability - our data set is about a billion records and i am not sure if DB2 DBA is going to let us run such a huge join operation (that may last easily over 15-20 mins).
b) push DB2 back again into parallel kafka topic (topic-1-copy) and then do some kafka streams based solution to do left outer join between kafka topic-1 and topic-1-copy. I am still wrapping my head around kafka streams and left outer joins.
I am not sure whether (using the windowing system in kafka streams) I will be able to compare the ENTIRE contents of topic-1 with topic-1-copy.
To make matters worse, the topic-1 in kafka is a compact topic,
so when we push the data from DB2 back into Kafka topic-1-copy, we cannot deterministically kick off the kafka topic-compaction cycle to make sure both topic-1 and topic-1-copy are fully compacted before running any sort of compare operation on them.
c) is there any other framework option that we can consider for this ?
The ideal solution has to scale for any size data.
I see no reason why you couldn't do this in either Kafka Streams or KSQL. Both support table-table joins. That's assuming the format of the data is supported.
Key compaction won't affect the results as both Streams and KSQL will build the correct final state of joining the two tables. If compaction has run the amount of data that needs processing may be less, but the result will be the same.
For example, in ksqlDB you could import both topics as tables and perform a join and then filter by the topic-1 table being null to find the list of missing rows.
-- example using 0.9 ksqlDB, assuming a INT primary key:
-- create table from main topic:
CREATE TABLE_1
(ROWKEY INT PRIMARY KEY, <other column defs>)
WITH (kafka_topic='topic-1', value_format='?');
-- create table from second topic:
CREATE TABLE_2
(ROWKEY INT PRIMARY KEY, <other column defs>)
WITH (kafka_topic='topic-1-copy', value_format='?');
-- create a table containing only the missing keys:
CREATE MISSING AS
SELECT T2.* FROM TABLE_2 T2 LEFT JOIN TABLE_1 T1
WHERE T1.ROWKEY = null;
The benefit of this approach is that the MISSING table of missing rows would automatically update: as you extracted the missing rows from your source DB2 instance and produced them to the topic-1 then the rows in the 'MISSING' table would be deleted, i.e. you'd see tombstones being produced to the MISSING topic.
You can even extend this approach to find rows that exist in topic-1 that are no longer in the source db:
-- using the same DDL statements for TABLE_1 and TABLE_2 from above
-- perform the join:
CREATE JOINED AS
SELECT * FROM TABLE_2 T2 FULL OUTER JOIN TABLE_1 T1;
-- detect rows in the DB that aren't in the topic:
CREATE MISSING AS
SELECT * FROM JOINED
WHERE T1_ROWKEY = null;
-- detect rows in the topic that aren't in the DB:
CREATE EXTRA AS
SELECT * FROM JOINED
WHERE T2_ROWKEY = null;
Of course, you'll need to size your cluster accordingly. The bigger your ksqlDB cluster the quicker it will process the data. It'll also need on-disk capacity to materialize the table.
The maximum amount of parallelization you can get set by the number of partitions on the topics. If you've only 1 partition, then data will be processed sequentially. If running with 100 partitions, then you can process the data using 100 CPU cores, assuming you run enough ksqlDB instances. (By default, each ksqlDB node will create 4 stream-processing threads per query, (though you can increase this if the server has more cores!)).

Resources