Flink table left outer join showing an error - apache-flink

I am doing a left outer join on two tables in flink, and the code is given below, showing an exception, inner join with the same two tables worked fine and was able to convert to Datastream
Table table = customerTable.leftOuterJoin(contactTable,$("cust_custcode")
.isEqual($("contact_custcode")))
.select($("customermessage"), $("contactmessage"));
The exception is : org.apache.flink.table.api.TableException: Table sink 'anonymous_datastream_sink$3' doesn't support consuming update and delete changes which is produced by node Join(joinType=[LeftOuterJoin], where=[(f0 = f00)], select=[f0, f1, f00, f10], leftInputSpec=[NoUniqueKey], rightInputSpec=[NoUniqueKey])

When executed in streaming mode, some Flink SQL queries produce an output stream than the planner knows will only need to INSERT rows into the sink, while other queries produce an output stream that sometimes needs to UPDATE previously emitted results.
Some sinks cannot accept UPDATE streams -- including the one you are using. You'll need to either (1) change your query (e.g., by doing a windowed join), (2) use a different sink (e.g., JDBC can accept updates), or (3) write to the sink in a different format (e.g., a CDC format like debezium).

Related

how to join two data streams along with sliding window function in Flink Table API?

I have two streaming tables from two Kafka topic and I want to join these streams and perform aggregate function on the data joined.
Streams need to be joined using sliding window. On joining and windowing the data, I am getting an error Rowtime attributes must not be in the input rows of a regular join. As a workaround you can cast the time attributes of input tables to TIMESTAMP before.
Below is code snippet
select cep.payload['id'] , ep.payload['id'] ,
ep.event_flink_time,
ep.rowtime,
TIMESTAMPDIFF(SECOND, ep.event_flink_time, cep.event_flink_time) as timediff,
HOP_START (cep.event_flink_time, INTERVAL '5' MINUTES, INTERVAL '10' MINUTES) as hop_start,
HOP_END (cep.event_flink_time, INTERVAL '5' MINUTES, INTERVAL '10' MINUTES) as hop_end
FROM table1 cep
JOIN table2 ep
ON cep.payload['id'] = ep.payload['id']
group by HOP(cep.event_flink_time, INTERVAL '5' MINUTES, INTERVAL '10' MINUTES), cep.payload, ep.payload, cep.event_flink_time, ep.event_flink_time,
ep.rowtime
I am using AWS Zeppelin notebook and using Flink SQL Table API.
For streaming data, how can I join the data using the sliding window function? Or should I use a different type of join for streaming data along with window functions.
Here is a ticket for same error: https://issues.apache.org/jira/browse/FLINK-10211
From your execute sql, I suggest you split this task into two parts. One is "left join" by two streaming data sources, and then execute "Group By" follow a create view. In addition to this, confirm the type of event time attributes whether it is correct.
Streaming SQL relies on time attributes to expire state that is no longer needed to produce results. This is meaningful in the context of specific temporal queries where the timestamps on both the input and output records are advancing -- which happens with queries like windows and interval joins.
A regular join (a join without any temporal constraints) doesn't work this way. Any previously ingested record might be updated at any point in time, and the corresponding output record(s) would then need to be updated. This means that both input streams must be fully materialized in Flink state, and the output stream has no temporal ordering that downstream operations can leverage to do state retention optimization.
Given how this all works, Flink's stream SQL planner can't handle having a window after a regular join -- the regular join can't produce time attributes, and the HOP insists on having them.
One potential solution would be to reformulate the join as an interval join, if that will meet your needs.

Building CDC in Snowflake

My company is migrating to snowflake from SQL Server 2017 and am looking to build historical data tables that capture delta changes. In SQL, these would be in stored procedures, where old records would get expired (change to data) and insert the new row with updated data. This design allows dynamic retrieval of historical data at any point in time.
My question is, how would i migrate this design to snowflake? From what I read about procedures, they're more like UDTs or scalar functions (SQL equiv) , but in javascript lang...
Below is brief example of how we are doing CDC for tables in SQL
Would data pipeline cover this? If anyone knows good tutorial site for snowflake 101 (not snowflake offical documentation, its terrible). would be appreciated
thanks
update h
set h.expiration_date = t.effective_date
from data_table_A_history h
join data_table_A as t
on h.account_id = t.account_id
where h.expiration_date is null
and (
(isnull(t.person_name,'x') <> isnull(h.person_name,'x')) or
(isnull(t.person_age,0) <> isnull(h.person_age,0))
)
---------------------------------------------------------------------
insert into data_table_A_history (account_id,person_name,person_age)
select
account_id,person_name,person_age
from
data_table_A t
left join data_table_A_history h
on t.account_id = h.account_id
and h.expiration_date is null
where
h.account_id is null
Table streams are Snowflake's CDC solution
You can setup multiple streams on a single table and it will track changes to the table from a particular point in time. This point in time is changed once you consume the data in the stream, with the new starting point being from the time you consumed the data. Consumption here is when you either use the data to upsert another table or perhaps insert the data into a log table for example. Simply select statements do not consume the data
A pipeline could be something like this: Snowpipe->staging table->stream on staging table->task with SP->merge/upsert target table
If you wanted to keep a log of the changes then you could setup a 2nd stream on the staging table and consume that by inserting the data into another table
Another trick, if you didn't want to use a 2nd stream is to amend your SP so that before you consume the data, run a select on the stream and then immediately run
INSERT INTO my_table select * from table(result_scan(last_query_id()))
This does not consume the stream and change the offset and leaves the stream data available to be consumed by another DML operation

How can I get all matching rows from lookup using full cache mode?

I need to make a lookup between two tables T1(A,B,C) and T2(A,B,C,D,E) on column C to get all column B values that are matching :
T1 :
T2 :
When I choose Full cache Mode I get only the first matching row (I'm only interested by column B values): 12122 but I need to get also 12123 and 12124 because C matches also with these rows.
I've tried to use Partial and no cache modes by using custom query with inner join (which returns all needed rows when executing the query in SSMS) but doesn't return all rows and it's killing performence.
I've tried also the solution proposed here :
How to get unmatched data between two sources in SSIS Data Flow?
And it gives the same results as lookup plus, I need to redirect unmatched rows to new table.
I don't think the cache mode will affect your result, and it is performance based. The ultimate explanation is:
•If there are multiple matches in the reference table, the Lookup
transformation returns only the first match returned by the lookup
query. If multiple matches are found, the Lookup transformation
generates an error or warning only when the transformation has been
configured to load all the reference dataset into the cache. In this
case, the Lookup transformation generates a warning when the
transformation detects multiple matches as the transformation fills
the cache.
To get the matched B from T2, you can just use the SQL in OLD DB source(command), for example:
SELECT distinct T2.B
FROM T1 as A
INNER JOIN T2 as B
ON B.C = A.C
If LONG's answer does not address your needs, you'd need to write a Script Transformation, operating in asynchronous mode (1 row of input can yield 0 to many output rows)
If the source data/T1 wouldn't contain duplicate C values, then the pre-execute phase of the component could cache the result of column B & C from T2 into local memory. Then for each source row that flows through, you'd need to loop through the results and append the B values into the data flow.
This gets trickier if T1 can have duplicate-ish data as you'd need to be querying the target table for each row that flows through - but you'd also have to track the B/C values that have already rolled through as you might need to reference those Bs as well.
You can also evaluate a Merge Join as I think that allows multiple rows to be emitted but I'm guessing you'll have more control over performance with a script transformation.
Either way, when you pull T2 table in, write a custom query and only select the columns you need (B&C).

how to compare (1 billion records) data between two kafka streams or Database tables

we are sending data from DB2 (table-1) via CDC to Kafka topics (topic-1).
we need to do reconciliation between DB2 data and Kafka topics.
we have two options -
a) bring down all kafka topic data into DB2 (as table-1-copy) and then do left outer join (between table-1 and table-1-copy) to see the non-matching records, create the delta and push it back into kafka.
problem: Scalability - our data set is about a billion records and i am not sure if DB2 DBA is going to let us run such a huge join operation (that may last easily over 15-20 mins).
b) push DB2 back again into parallel kafka topic (topic-1-copy) and then do some kafka streams based solution to do left outer join between kafka topic-1 and topic-1-copy. I am still wrapping my head around kafka streams and left outer joins.
I am not sure whether (using the windowing system in kafka streams) I will be able to compare the ENTIRE contents of topic-1 with topic-1-copy.
To make matters worse, the topic-1 in kafka is a compact topic,
so when we push the data from DB2 back into Kafka topic-1-copy, we cannot deterministically kick off the kafka topic-compaction cycle to make sure both topic-1 and topic-1-copy are fully compacted before running any sort of compare operation on them.
c) is there any other framework option that we can consider for this ?
The ideal solution has to scale for any size data.
I see no reason why you couldn't do this in either Kafka Streams or KSQL. Both support table-table joins. That's assuming the format of the data is supported.
Key compaction won't affect the results as both Streams and KSQL will build the correct final state of joining the two tables. If compaction has run the amount of data that needs processing may be less, but the result will be the same.
For example, in ksqlDB you could import both topics as tables and perform a join and then filter by the topic-1 table being null to find the list of missing rows.
-- example using 0.9 ksqlDB, assuming a INT primary key:
-- create table from main topic:
CREATE TABLE_1
(ROWKEY INT PRIMARY KEY, <other column defs>)
WITH (kafka_topic='topic-1', value_format='?');
-- create table from second topic:
CREATE TABLE_2
(ROWKEY INT PRIMARY KEY, <other column defs>)
WITH (kafka_topic='topic-1-copy', value_format='?');
-- create a table containing only the missing keys:
CREATE MISSING AS
SELECT T2.* FROM TABLE_2 T2 LEFT JOIN TABLE_1 T1
WHERE T1.ROWKEY = null;
The benefit of this approach is that the MISSING table of missing rows would automatically update: as you extracted the missing rows from your source DB2 instance and produced them to the topic-1 then the rows in the 'MISSING' table would be deleted, i.e. you'd see tombstones being produced to the MISSING topic.
You can even extend this approach to find rows that exist in topic-1 that are no longer in the source db:
-- using the same DDL statements for TABLE_1 and TABLE_2 from above
-- perform the join:
CREATE JOINED AS
SELECT * FROM TABLE_2 T2 FULL OUTER JOIN TABLE_1 T1;
-- detect rows in the DB that aren't in the topic:
CREATE MISSING AS
SELECT * FROM JOINED
WHERE T1_ROWKEY = null;
-- detect rows in the topic that aren't in the DB:
CREATE EXTRA AS
SELECT * FROM JOINED
WHERE T2_ROWKEY = null;
Of course, you'll need to size your cluster accordingly. The bigger your ksqlDB cluster the quicker it will process the data. It'll also need on-disk capacity to materialize the table.
The maximum amount of parallelization you can get set by the number of partitions on the topics. If you've only 1 partition, then data will be processed sequentially. If running with 100 partitions, then you can process the data using 100 CPU cores, assuming you run enough ksqlDB instances. (By default, each ksqlDB node will create 4 stream-processing threads per query, (though you can increase this if the server has more cores!)).

how to tell SSIS not to hold data while being joined

I am asking how to do something in SSIS that is a feature in datastage.
I am seeing a SSIS job where if I am going to perform a join or lookup, SSIS tries to "memorize" the entire datasets prior to the join. My datasets are too large for SSIS to 'memorize' and causes memory overloads.
In datastage, I can avoid this by having sort stages in front of the join stage, and the join stage monopolizes this by using a "sorted join", whereas the entire dataset isn't held in memory, but is immediately joined and sent ot the next stage while the join is in progress, saving memory. the sort stage also allows me to sort during the source connector and just "say it's sorted". Either way, the datasets are not held until fully memorized. They get passed on when ajoin happens.
How do I accomplish this in SSIS? Thank you.
Well, from what I understood, you do not want SSIS to store the data in memory because the dataset is too large and its causing an error, right? In the Lookup Transform Task you can select how do you want SSIS to handle your data with Cache Mode (I worked with this in BIDS 2008). Basically you have 3 options:
Full Cache: the database is queried and 'memorizes' the data BEFORE doing any transformation or insert into it.
Partial Cache: uses a partial cache, and if no match is found, queries the database.
No Cache: It does not maintain a Lookup cache, so it queries the database for every row proccessed.
You can find more detailed information about Lookup cache mode here
Hope it was what you were looking for
Instead of a lookup, you should be using a Merge Join transformation.
The Merge Join is partially blocking, meaning the incoming rows need to be sorted, and the output is only held up until either of the incoming keys have moved to a new value.
This article explains how the Merge Join works way more exhaustively than I will. If this link ever goes dead, just google "SSIS Merge Join Blocking".
But what you need to know is that your source components need to be sorted on the keys that you will them on. Then the Merge Join will only "memorize" as much data as it needs for each possible JOIN combination, and then it will release them to the rest of the dataflow while it processes the next combination.
In other words, it does exactly what you are asking for.

Resources