Can I deduplicate records in union of two Flink SQL tables? - apache-flink

What I want to achieve is to run a Flink SQL query in streaming mode (simple aggregations like count, min, max), but I need to load archival data as well to have valid aggregations. To this end I created a Flink table for a Kafka topic and a for a table in relational database. Then I union both tables and apply deduplication pattern since some records from Kafka topic may have already been saved into the relational database.
create table `kafka_catalog`.`default_database`.`dummy` (
recordId STRING,
recordTime TIMESTAMP(3),
proctime AS PROCTIME()
) with (
'connector' = 'kafka',
...
);
create table `jdbc_catalog`.`default_database`.`dummy` (
recordId STRING,
recordTime TIMESTAMP(3),
proctime AS PROCTIME()
) with (
'connector' = 'jdbc',
...
);
create view `dummy_union` as
select * from `kafka_catalog`.`default_database`.`dummy`
union
select * from `jdbc_catalog`.`default_database`.`dummy`
;
create view `dummy_full_history` as
select
*
from (
select
*,
row_number() over (partition by recordId order by proctime asc) as row_num
from
dummy_union
)
where
row_num = 1
;
select * from dummy_full_history;
Unfortunately, according to the query plan the optimisation for deduplication is not applied. Instead Calc -> GroupAggregate -> Rank -> Calc is applied.
[69]:TableSourceScan(table=[[kafka_catalog, default_database, dummy]], fields=[recordId, recordTime])
+- [70]:Calc(select=[recordId, recordTime, PROCTIME() AS proctime])
[71]:TableSourceScan(table=[[jdbc_catalog, default_database, dummy]], fields=[recordId, recordTime])
+- [72]:Calc(select=[recordId, recordTime, PROCTIME() AS proctime])
[74]:Calc(select=[recordId. recordTime, PROCTIME_MATERIALIZE(proctime) AS proctime])
[76]:GroupAggregate(groupBy=[recordId, proctime], select=[recordId, proctime])
[78]:Rank(strategy=[UpdateFastStrategy[0,1]], rankType=[ROW_NUMBER], rankRange=[rankStart=1, rankEnd=1], partitionBy=[recordId], orderBy=[proctime ASC], select=[recordId, proctime])
+- [79]:Calc(select=[recordId, CAST(proctime AS VARCHAR(2147483647)) AS proctime, 1 AS row_num])
+- [80]:ConstraintEnforcer[NotNullEnforcer(fields=[proctime, row_num])]
+- Sink: Collect table sink
When I apply deduplication on a single table, it works like a charm:
[4]:Deduplicate(keep=[FirstRow], key=[recordId], order=[PROCTIME])
Any suggestions how to make it work?

You should use union all.
create view `dummy_union` as
select * from `kafka_catalog`.`default_database`.`dummy`
union all
select * from `jdbc_catalog`.`default_database`.`dummy`
;
According to docs:
https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/sql/queries/set-ops/

Related

How to recreate old snapshot using field history table in Bigquery

I'm currently working on an interesting problem. I am trying to recreate state of table as it was on a given previous date. I have 2 tables
Table A: consists of live data, gets refreshed on an hourly basis.
Table A_field_history: consists of changes made to the fields in Table A.
Following image consists of current state, where Table A has live updated data and Table A_field_history only captures changes made to the fields on table A.
I am trying to recreate Table A as of particular given date. Following image consists of table state as it was on 06/30/2020.
The requirement is to have capability to recreate state of Table A based on any given date.
I actually identified a way to rollback (virtually, not on actual table) all the updates made after given specific date. Following are the steps followed:
Create dummy tables:
WITH
Table_A AS
(
SELECT 1 As ID, '2020-6-28' as created_date, 10 as qty, 100 as value
Union ALL
SELECT 2 As ID, '2020-5-29' as created_date, 20 as qty, 200 as value),
Table_A_field_history AS
(
SELECT 'xyz' id,'2020-07-29' created_date,'12345' created_by,'qty' field,'10' new_value,'200' old_value,'1' A_id
UNION ALL
SELECT 'abc' id,'2020-07-24' created_date,'12345' created_by,'qty' field,'20' new_value,'10' old_value,'2' A_id
UNION ALL
SELECT 'xyz' id,'2020-07-29' created_date,'12345' created_by,'value' field,'100' new_value,'2000' old_value,'1' A_id
UNION ALL
SELECT 'abc' id,'2020-07-24' created_date,'12345' created_by,'value' field,'200' new_value,'5000' old_value,'2' A_id
UNION ALL
SELECT 'xyz' id,'2020-06-29' created_date,'12345' created_by,'qty' field,'200' new_value,'' old_value,'1' A_id
UNION ALL
SELECT 'abc' id,'2020-05-30' created_date,'12345' created_by,'qty' field,'10' new_value,'' old_value,'2' A_id
UNION ALL
SELECT 'xyz' id,'2020-06-29' created_date,'12345' created_by,'value' field,'2000' new_value,'' old_value,'1' A_id
UNION ALL
SELECT 'abc' id,'2020-05-30' created_date,'12345' created_by,'value' field,'5000' new_value,'' old_value,'2' A_id
),
Step 1. Create date cte to filter data based on given date:
`date_spine
AS
(
SELECT * FROM UNNEST(GENERATE_DATE_ARRAY('2020-01-01', CURRENT_DATE(), INTERVAL 1 Day)) AS as_of_date
),`
Step 2. Above created date cte can be used as a Spine for our query, cross join to map as_of_date with all the changes made in the history table.
date_changes
AS
(
SELECT DISTINCT
date.as_of_date,
hist.A_id
FROM Table_A_field_history hist CROSS JOIN date_spine date
),
Step 3. Now we have as_of_date mapped to all historical transactions, now we can get max of change date.
most_recent_changes AS (
SELECT
dc.as_of_date,
dc.A_id ,
MAX(fh.created_date) AS created_date,
FROM date_changes dc
LEFT JOIN Table_A_field_history AS fh
ON dc.A_id = fh.A_id
WHERE CAST(fh.created_date AS DATE) <= dc.as_of_date
GROUP BY dc.as_of_date,
dc.A_id
),
Step 4. Now mapping max change date with actual created_date and history table
past_changes AS (
SELECT
mr.as_of_date,
mr.A_id,
mr.created_date,
a.id AS entry_id,
a.created_by AS created_by_id,
CASE WHEN a.field='qty' THEN a.new_value ELSE '' END AS qty,
CASE WHEN a.field='value' THEN a.new_value ELSE '' END AS value,
FROM most_recent_changes AS mr
LEFT JOIN Table_A_field_history AS a
ON mr.A_id = a.A_id
AND mr.created_date = a.created_date
WHERE a.id IS NOT NULL
)
Step 5. Now we can use as_of_date to get historical state of Table A.
Select *
From past_changes x
WHERE x.as_of_date = '2020-07-29'

SQL Server 2008 ROW_NUMBER() order by Clustered PK slow

I have a simple query below that joins 3 of my tables. Due to the fact that OFFSET and FETCH statement are no available in SQL Server 2008, therefore I have implemented the ROW_NUMBER() in one of my paginated order report.
SELECT * FROM
(
SELECT
ROW_NUMBER() OVER ( ORDER BY OrderProductDetail.ID ) AS RowNum,
*
FROM
Order JOIN
OrderProduct ON Order.ID = OrderProduct.OrderID JOIN
OrderProductDetail ON OrderProduct.ID = OrderProductDetail.OrderProductID
WHERE
Order.Date BETWEEN '2018-01-01 00:00:00.000' AND '2018-02-01 00:00:00.000'
) AS OrderDetailView
WHERE RowNum BETWEEN 1 AND 1000;
With over 16M 3M records in the Table the above query took 1 minute to complete, records found are capped to 1000.
However, if I simply remove the RowNum in WHERE Clause then the query complete within 3 seconds and total of 1700 records returned. (Also same result if I only run the Sub-Query portion)
SELECT * FROM
(
SELECT
ROW_NUMBER() OVER ( ORDER BY OrderProductDetail.ID ) AS RowNum,
*
FROM
Order JOIN
OrderProduct ON Order.ID = OrderProduct.OrderID JOIN
OrderProductDetail ON OrderProduct.ID = OrderProductDetail.OrderProductID
WHERE
Order.Date BETWEEN '2018-01-01 00:00:00.000' AND '2018-02-01 00:00:00.000'
) AS OrderDetailView
Order.ID = Unique Clustered PK (Int)
Order.Date = Non-Clustered Index (Timestamp)
OrderProduct.ID = Unique Clustered PK (Int)
OrderProductDetail.ID = Unique Clustered PK (Int)
Some other test cases I've performed:
( ORDER BY Order.Date ) AS RowNumber >> Fast
( ORDER BY Order.ID ) AS RowNumber >> Fast
Question: How can I improve the performance?
UPDATE STATISTICS Order WITH FULLSCAN;
UPDATE STATISTICS OrderProduct WITH FULLSCAN;
UPDATE STATISTICS OrderProductDetail WITH FULLSCAN;
Finally the query went back to normal after I executed the above commands, my DBA didn't include the FULLSCAN option in first attempt therefore it wasn't working.
Thanks #Jeroen Mostert!

Prevent auto sorting in SQL Server

I am trying to run the following query:
Select *
From table
Where ID In ('100', '20', '222', '1', '15')
When I run the following query, the resultset is returned ordered by ID.
(ID is the primary key).
How do I ensure that the resultset is returned in the order I specified in In ('').
You need to explicitly state the order of the results you want if you don't want the default sorting.
SELECT *
FROM TABLE
WHERE ID IN ('100','20','222','1','15')
ORDER BY
CASE WHEN ID = '100' THEN 1
WHEN ID = '20' THEN 2
WHEN ID = '222' THEN 3
WHEN ID = '1' THEN 4
WHEN ID = '15' THEN 5
END
When you have Primary Key, then data is sorted and stored based on the primary key. In fact behind the Primary Key, SQL Server create a Clustered Index. One of the characteristics of Clustered Index is that the data is always physically sorted based on it.
If you need to show the data in custom order, you need to specify it using Order By
#Brad is currently provided a sample query which sort the data based on your IN Clause: https://stackoverflow.com/a/48453671/1666800
Yet another option. The XML Parser will include a Sequence Number
Declare #List varchar(max) = '100,20,222,1,15'
Select A.*
From YourTable A
Join (
Select RetSeq = Row_Number() over (Order By (Select null))
,RetVal = LTrim(RTrim(B.i.value('(./text())[1]', 'varchar(max)')))
From (Select x = Cast('<x>' + replace(#List,',','</x><x>')+'</x>' as xml).query('.')) as A
Cross Apply x.nodes('x') AS B(i)
) B on A.ID=B.RetVal
Order by B.RetSeq

Efficiently query for the latest version of a record using SQL

I need to query a table for the latest version of a record for all available dates (end of day time-series). The example below illustrates what I am trying to achieve.
My question is whether the table's design (primary key, etc.) and the LEFT OUTER JOIN query is accomplishing this goal in the most efficient manner.
CREATE TABLE [PriceHistory]
(
[RowID] [int] IDENTITY(1,1) NOT NULL,
[ItemIdentifier] [varchar](10) NOT NULL,
[EffectiveDate] [date] NOT NULL,
[Price] [decimal](12, 2) NOT NULL,
CONSTRAINT [PK_PriceHistory]
PRIMARY KEY CLUSTERED ([ItemIdentifier] ASC, [RowID] DESC, [EffectiveDate] ASC)
)
INSERT INTO [PriceHistory] VALUES ('ABC','2016-03-15',5.50)
INSERT INTO [PriceHistory] VALUES ('ABC','2016-03-16',5.75)
INSERT INTO [PriceHistory] VALUES ('ABC','2016-03-16',6.25)
INSERT INTO [PriceHistory] VALUES ('ABC','2016-03-17',6.05)
INSERT INTO [PriceHistory] VALUES ('ABC','2016-03-18',6.85)
GO
SELECT
L.EffectiveDate, L.Price
FROM
[PriceHistory] L
LEFT OUTER JOIN
[PriceHistory] R ON L.ItemIdentifier = R.ItemIdentifier
AND L.EffectiveDate = R.EffectiveDate
AND L.RowID < R.RowID
WHERE
L.ItemIdentifier = 'ABC' and R.EffectiveDate is NULL
ORDER BY
L.EffectiveDate
Follow up: Table can contain thousands of ItemIdentifiers each with dacades worth of price data. Historical version of data needs to be preserved for audit reasons. Say I query the table and use the data in a report. I store #MRID = Max(RowID) at the time the report was generated. Now if the price for 'ABC' on '2016-03-16' is corrected at some later date, I can modify the query using #MRID and replicate the report that I ran earlier.
A slightly modified version of #SeanLange's answer will give you the last row per date, instead of per product:
with sortedResults as
(
select *
, ROW_NUMBER() over(PARTITION by ItemIdentifier, EffectiveDate
ORDER by ID desc) as RowNum
from PriceHistory
)
select ItemIdentifier, EffectiveDate, Price
from sortedResults
where RowNum = 1
order by 2
I assume you have more than 1 ItemIdentifier in your table. Your design is a bit problematic in that you are keeping versions of the data in your table. You can however do something like this quite easily to get the most recent one for each ItemIdentifier.
with sortedResults as
(
select *
, ROW_NUMBER() over(PARTITION by ItemIdentifier order by EffectiveDate desc) as RowNum
from PriceHistory
)
select *
from sortedResults
where RowNum = 1
Short answer, no.
You're hitting the same table twice, and possibly creating a looped table scan, depending on your existing indexes. In the best case, you're causing a looped index seek, and then throwing out most of the rows.
This would be the most efficient query for what you're asking.
SELECT
L.EffectiveDate,
L.Price
FROM
(
SELECT
L.EffectiveDate,
L.Price,
ROW_NUMBER() OVER (
PARTITION BY
L.ItemIdentifier,
L.EffectiveDate
ORDER BY RowID DESC ) RowNum
FROM [PriceHistory] L
WHERE L.ItemIdentifier = 'ABC'
) L
WHERE
L.RowNum = 1;

How to combine PIVOT and aggregation?

Here is my sample schema and data (http://sqlfiddle.com/#!3/0d8b7/3/0):
CREATE TABLE cpv
(
ClientId INT,
CodeName VARCHAR(20),
Value VARCHAR(30),
LastModified DATETIME
);
INSERT INTO cpv (ClientId,CodeName,Value,LastModified)
VALUES
(1000, 'PropA', 'A', '2014-05-15 17:02:00'),
(1000, 'PropB', 'B', '2014-05-15 17:01:00'),
(1000, 'PropC', 'C', '2014-05-15 17:01:00'),
(2000, 'PropA', 'D', '2014-05-15 17:02:00'),
(2000, 'PropB', 'E', '2014-05-15 17:05:00');
I need to reshape it into:
ClientId PropA PropB PropC LastModified
1000 A B C '2014-05-15 17:02:00'
2000 D E NULL '2014-05-15 17:05:00'
There are Two operations involved here:
aggregation of the LastModified - taking the Max within the same ClientId
pivoting the CodeName column
I have no idea how to combine them.
This SQL Fiddle demonstrates pivoting the CodeName column:
SELECT PropA,PropB,PropC
FROM (
SELECT CodeName,Value FROM cpv
) src
PIVOT (
MAX(Value)
FOR CodeName IN (PropA,PropB,PropC)
) p
But it does not group by ClientId and neither takes the Maximum of the LastModified.
This SQL Fiddle demonstrates grouping by the ClientId and aggregating LastModified:
SELECT ClientId,MAX(LastModified) LastModified
FROM cpv
GROUP BY ClientId
But it totally ignores the Name and Value columns.
How can I group by ClientId, aggregate by taking the Maximum LastModified within the group and also pivot the CodeName column, again within each group?
EDIT
The answer is available here.
Try this:
;with cte as
(SELECT ClientID,PropA,PropB,PropC
FROM (
SELECT ClientID, CodeName,Value FROM cpv
) src
PIVOT (
MAX(Value)
FOR CodeName IN (PropA,PropB,PropC)
) p)
SELECT DISTINCT cte.ClientID, PropA, PropB, PropC, MAX(LastModified) OVER(PARTITION BY cte.clientid ORDER BY cte.clientid) MaxLastModified FROM cte
INNER JOIN cpv ON cte.clientid = cpv.clientid
Demo here.

Resources