Snowflake - Reconciliation of two tables based on keys - snowflake-cloud-data-platform

I have two tables (primary and secondary) and we need to do a row and column-level reconciliation between these two tables and get a summary of the differences between these tables.
Table A:
col_A
col_B
col_C
One
Two
Three
Four
Five
Six
Seven
Eight
Nine
Table B:
col_A
col_B
col_C
One
Two
Three
Four
Five
ABC
Seven
Eight
Nine
Nine
Eight
Nine
In the above table col_A is the primary key column. I want to compare Table A and Table B and produce results like below.
Matched Rows: 2
Unmatched Rows: 1
Columns not matching: col_C (sample key: Four)
Rows Present in Table A but not in B: 0
Rows Present in Table B but not in A: 1 ( Sample key:Nine)
Generally, Table A and Table B have approx. billion rows. What would be the efficient way to do it in Snowflake.

For row comparison, consider the MINUS/EXCEPT set operator.
SELECT count(*) as countOfRowsInTable1NotInTable2 FROM
(
SELECT col_A, col_B, col_C FROM table1
MINUS
SELECT col_A, col_B, col_C FROM table2
)sub;
You can reverse the order of the inner SELECT statments to get the same count for tableB. You can also perform a SELECT * instead of SELECT COUNT(*) if you want to see which rows exist in one table but not the other.

Related

How to overwrite the table rows by another Table if they are duplicate rows

I have a table in snowflake which has some data like below
Table 1(snowflake table)
LOCATIONID OBSERVATION_TIME_UTC source_record_id Value
LFOB 201001000001.00 cw_altdata:LFOB_historical_hourly.txt:2020-12-23_003400:1 3
LFOB 201001000002.00 cw_altdata:LFOB_historical_hourly.txt:2020-12-23_003400:2 3
and for the existing table I need to append the data and remove the duplicates based on first 2 columns
Table 2(Need to append to the existing table)
LOCATIONID OBSERVATION_TIME_UTC source_record_id Value
LFOB 201001000001.00 cw_altdata:LFOB_historical_hourly.txt:2020-12-24_003400:3 4
LFOB 201001000002.00 cw_altdata:LFOB_historical_hourly.txt:2020-12-24_003400:4 4
after appending the Table 2 data. I want the duplicate data to be removed from table. My output table should be looking this.
LOCATIONID OBSERVATION_TIME_UTC source_record_id Value
LFOB 201001000001.00 cw_altdata:LFOB_historical_hourly.txt:2020-12-24_003400:3 4
LFOB 201001000002.00 cw_altdata:LFOB_historical_hourly.txt:2020-12-24_003400:4 4
Here we can see duplicate rows has been removed. It should keep latest date. for eg: here 2020-12-24_003400 is latest date than previous table 1.
I only know some basics of sql statements. I did not find any articles regarding this, so did not get a chance to try any solutions. It would be a great help if someone has a solution.
UPDATE is an the most expensive DML in Snowflake (and just about every other RDBMS). IF the number of rows in Table 2 is a significant percentages of Table 1 AND a significant percentages will result in UPDATE instead of INSERT, the following technique is an alternative:
DELETE FROM TABLE_1 T1 WHERE (T1.LOCATIONID, T1.OBSERVATION_TIME_UTC)
IN (SELECT T2.LOCATIONID, T2.OBSERVATION_TIME_UTC FROM TABLE_2 T2);
INSERT INTO TABLE_1 (SELECT * FROM TABLE_2);
If you want to eliminate rows duplicate across all columns in the table (and assuming no duplicates in Table_1:
INSERT INTO TABLE_1
(SELECT * FROM TABLE_2
MINUS
SELECT * FROM TABLE_1)
or, if the table has many columns:
INSERT INTO TABLE_1
(SELECT * FROM TABLE_2 T2
WHERE (T2.LOCATIONID, T2.OBSERVATION_TIME_UTC)
NOT IN
( SELECT
LOCATIONID, OBSERVATION_TIME_UTC
FROM TABLE_T2
MINUS
SELECT
LOCATIONID, OBSERVATION_TIME_UTC
FROM TABLE_T1)
You can use a merge statement to update table_1 with the values from table_2 when they are different for the business key (assuming in this case that the business key is LOCATIONID, OBSERVATION_TIME_UTC). If the business key does not exist in table_1 the merge statement will insert the row.
Here is the merge:
merge into table_1
using(SELECT LOCATIONID,
OBSERVATION_TIME_UTC,
source_record_id,
Value
FROM table_2
) table_2
on table_1.LOCATIONID = table_2.LOCATIONID
and table_1.OBSERVATION_TIME_UTC = table_2.OBSERVATION_TIME_UTC
WHEN MATCHED
and table_1.source_record_id is distinct from table_2.source_record_id or
table_1.value is distinct from table_2.value
THEN UPDATE
SET table_1.source_record_id = table_2.source_record_id,
table_1.value = table_2.value
WHEN NOT MATCHED
THEN INSERT
(
LOCATIONID,
OBSERVATION_TIME_UTC,
source_record_id,
Value
)
VALUES
(
table_2.LOCATIONID,
table_2.OBSERVATION_TIME_UTC,
table_2.source_record_id,
table_2.Value
)
;

SQL Query to return rows with the most columns populated

Azure SQL Server 2019.
We have a table Table1 with over 100 columns of differing types of nvarchar data, all of which are allowed NULL values, and where there could be anywhere from 1 to 100 columns populated in a given record. I need to formulate a query that returns the rows ranked by how many columns have values in them, in descending order.
I started going down a road of using DATALENGTH and having to type out the name of every single column, but I can only imagine there has to be a more efficient way. Assuming the column names are column1, column2, column3 etc, how would I accomplish this?
How about a lateral join that unpivots the columns to rows? This requires enumerating the columns just once, like so:
select t.*, c.cnt
from mytable t
cross apply (
select count(*) cnt
from (values (t.column1), (t.column2), (t.column3)) x(col)
where col is not null
) c
order by c.cnt desc

SQL Server - More efficient way of joining three tables without parent table

This is an issue I have been working on for a while, I have three tables, all of which share 3 of the same columns but there are rows that are unique to each row. I would like to combine all of the tables without duplicating rows. I have a working solution but I feel like it might not be the most efficient. I tried using joins but found that without a parent table, I wasn't getting the expected number of results. My solution which does yield the correct number of results(I've cut some columns for simplicity):
--Create table
CREATE TABLE #Temp
(
ID,
Date
)
-- Insert rows that are only in db1
INSERT INTO #Temp
SELECT
ID,
Date
FROM test.dbo.db1
-- Do not include rows shared by db1 and db2
EXCEPT
(
SELECT
ID,
Date
FROM test.dbo.db2
INTERSECT
SELECT
ID,
Date
FROM test.dbo.db1
)
EXCEPT
-- And not in db1 and db3
(
SELECT
ID,
Date
FROM test.dbo.db1
INTERSECT
SELECT
ID,
Date
FROM test.dbo.db3
)
EXCEPT
-- And not in db1, db2 and db3
** Code where I intersect all 3 tables
I repeat the above steps for all three tables and then add the intersections for each combined ID/Date(db1+d2+db3, db1+db2, etc...)
Does anyone know of a way to do this that is more direct and to the point? I have tried doing a full join of all of them but without a parent table with all of the ID's, I found the ID's that only appear in the other two tables don't show up.
SELECT
ID,
Date
FROM test.dbo.db1
UNION
SELECT
ID,
Date
FROM test.dbo.db2
UNION
SELECT
ID,
Date
FROM test.dbo.db3
The UNION takes care of removing duplicates.

Create a table from full outer join query

I have two tables namely :- TDM & AccountMaster. Both are having three equal columns and I have to create a table retrieving all the rows from TDM-table joining the three columns,i.e. FD_BRANCH,FD_CUSTCODE & PRODUCTID.
while creating table through select into clause I get an error
Column names in each table must be unique. Column name 'FD_BRANCH' in table 'acty' is specified more than once.
Please find the following query from which I want to create a table which as per my requirement :-
SELECT * FROM (SELECT FD_BRANCH,FD_CUSTCODE,PRODUCTID FROM TDM
GROUP BY FD_BRANCH,FD_CUSTCODE,PRODUCTID) A full OUTER JOIN AccountMaster B
ON( A.FD_BRANCH=B.FD_BRANCH AND A.FD_CUSTCODE=B.FD_CUSTCODE AND
A.PRODUCTID=B.PRODUCTID)
Change your select to get only the fields you need from one of the 2 tables.
Select A.*
FROM (
or
Select B.FD_BRANCH,
B.FD_CUSTCODE,
B.PRODUCTID
FROM (
FULL OUTER JOIN combines the 2 sets of columns from both queries so you end up with at least 6 columns. Even though they're from different tables or aliases the columns names are the same.

how to get multiple sets of distinct values

This is not about distinct combinations of values (Select distinct col1, col2 from table)
I have a table with a newly loaded csv file.
Some columns are linked to foreign key dimensions but the values in a given column may not exist in the reference tables.
My desire is to find all the values in each column that do not exist but in such a way as to minimize the amount of table scans in our source table.
My current approach consumes the output of a bunch of queries like the following:
SELECT DISTINCT col2 FROM table WHERE col2 NOT IN (SELECT val FROM DimCol2)
SELECT DISTINCT col3 FROM table WHERE col3 NOT IN (SELECT val FROM DimCol3)
however, for N columns, this results in N table scans.
Table is up to 10M rows and columns range in cardinality from 5 through to 5M, but almost all values are already present in the dim tables (>99%).
DimColN ranges in size from 5 values to 50M values, and is well indexed.
The csv is loaded into table via SSIS, so splitting pre-processing inside SSIS is possible, but i would have to avoid a sql query for each row.
The ssis server does not have enough spare ram to cache all the dim tables.
What about using a LEFT JOIN and checking where the results of the join are null, meaning they don't exist in DimCol2
SELECT DISTINCT Col2
FROM table a
LEFT JOIN DimCol2 on a.Col2 = b.val
WHERE b.val IS NULL

Resources