How to use dbt's incremental model without duplicates - snowflake-cloud-data-platform

I have a query that selects some data that I would like to use to create an incremental table. Something like:
{{
config(
materialized='incremental',
unique_key='customer_id'
)
}}
SELECT
customer_id,
email,
updated_at,
first_name,
last_name
FROM data
The input data has duplicate customers in it. If I read the documentation correctly, then records with the same unique_key should be seen as the same record. They should be updated instead of creating duplicates in the final table. However, I am seeing duplicates in the final table instead. What am I doing wrong?
I am using Snowflake as a datawarehouse.

If your source table already contains the duplicate, this is the regular behavior.
As per dbt documentation: "The first time a model is run, the table is built by transforming all rows of source data."
Docs: https://docs.getdbt.com/docs/build/incremental-models
This means basically that the duplicates will be avoided in all future loads, but not during the initial creation. Hence you need to change your SELECT statement so that duplicates are somehow filtered out in the creation itself.

With incremental materialization dbt would do a merge or delete-insert using the unique_key. I believe in the case of Snowflake it's doing a merge. This means that running the same model several times won't write the same records over and over again to the target table.
If you experience duplicates, most likely your select returns duplicate records. You'd need to deduplicate your input data, which is often done with row_number() function, something like:
(ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY {{ timestamp_column }} DESC) = 1) AS is_latest
and then filtering by WHERE is_latest

Related

Query columns from a table and analyses its’ sources

I would like to kindly ask you the best approach of analyzing the columns from a materialized table regarding to find their source tables.
As an example: the table to be analyzed CUSTOMERS and the column is customer_name
I would like to have an output like:
current_name
source_name
source_table
customer_name
nombre_cust
nombre_cust
I would like to create valid_from valid_to columns from the source table like below example.
Desired Output:
Is there any way to analyze the sources of the columns from the final table?
I am using Snowflake and I have checked INFORMATION_SCHEMA.COLUMNS and SNOWFLAKE.ACCOUNT_USAGE.COLUMNS but they did not help me to find the source name of the column or it's source table.
I appreciate with any suggestions.
Thanks,
Kind Regards
In case you are asking for the source of a particular table: In theory you could have numerous scripts/ways inside or outside of Snowflake to load a target table. Thats why there is no straightforward way with Snowflake capabilities to detect the source of a certain table and it really depends on how you are loading the table.
Procedure based: You could run DESC PROCEDURE xyz to get the procedure code and parse it for source objects.
INSERT based: If someone is executing a simple INSERT ... SELECT statement, you are not getting this dependency unless you are parsing the query history.
CREATE based: If you are asking for the dependency based on a CREATE TABLE AS ... SELECT statement, you also need to check the query history.
In case you are asking for views/materialized views: You can checkout OBJECT_DEPENDENCIES in Snowflake: https://docs.snowflake.com/en/user-guide/object-dependencies.html. With this you can query, which view depends on which source object. Since materialized views can only be based on one source table, every column is based on this source table or was derived somehow.

How to join query_id & METADATA$ROW_ID in SnowFlake

I am working on tracking the changes in data along with few audit details like user who made the changes.
Streams in Snowflake gives delta records details and few audit columns including METADATA$ROW_ID.
Another table i.e. information_schema.query_history contain query history details including query_id, user_name, DB name, schema name etc.
I am looking for a way so that I can join query_id & METADATA$ROW_ID so that I can find the user_name corresponding to each change in data.
any lead will be much appreciated.
Regards,
Neeraj
The METADATA$ROW_ID column in a stream uniquely identifies each row in the source table so that you can track its changes using the stream.
It isn't there to track who changed the data, rather it is used to track how the data changed.
To my knowledge Snowflake doesn't track who changed individual rows, this is something you would have to build into your application yourself - by having a column like updated_by for example.
Only way i have found is to add
SELECT * FROM table(information_schema.QUERY_HISTORY_BY_SESSION()) ORDER BY start_time DESC LIMIT 1
during reports / table / row generation
Assuming that you have not changed setting that you can run more queries at same time in one session , that gets running querys id's , change it to CTE and do cross join to in last part of select to insert it to all rows.
This way you get all variables in query_history table. Also remember that snowflake does keep SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY ( and other data ) up to one year. So i recommend weekly/monthly job which merges data into long term history table. That way you an also handle access to history data much more easier that giving accountadmin role to users.

SQL: How to bring back 20 columns in a select statement, with a unique on a single column only?

I have a huge select statement, with multiple inner joins, that brings back 20 columns of info.
Is there some way to filter the result to do a unique (or distinct) based on a single column only?
Another way of thinking about it is that when it does a join, it only grabs the first result of the join on a single ID, then it halts and moved onto joining by the next ID.
I've successfully used group by and distinct, but these require you to specify many columns, not just one column, which appears to slow the query down by an order of magnitude.
Update
The answer by #Martin Smith works perfectly.
When I updated the query to use this technique:
It more than doubled in speed (1663ms down to 740ms)
It used less T-SQL code (no need to add lots of parameters to the GROUP BY clause).
It's more maintainable.
Caveat (very minor)
Note that you should only use the answer from #Martin Smith if you are absolutely sure that the rows that will be eliminated will always be duplicates, or else this query will be non-deterministic (i.e. it could bring back different results from run to run).
This is not an issue with GROUP BY, as the TSQL syntax parser will prevent this from ever occurring, i.e. it will only let you bring back results where there is no possibility of duplicates.
You can use row_number for this
WITH T AS
(
SELECT ROW_NUMBER() OVER (PARTITION BY YourCol ORDER BY YourOtherCol) AS RN,
--Rest of your query here
)
SELECT *
FROM T
WHERE RN=1

What is the best way to keep changes history to database fields?

For example I have a table which stores details about properties. Which could have owners, value etc.
Is there a good design to keep the history of every change to owner and value. I want to do this for many tables. Kind of like an audit of the table.
What I thought was keeping a single table with fields
table_name, field_name, prev_value, current_val, time, user.
But it looks kind of hacky and ugly. Is there a better design?
Thanks.
There are a few approaches
Field based
audit_field (table_name, id, field_name, field_value, datetime)
This one can capture the history of all tables and is easy to extend to new tables. No changes to structure is necessary for new tables.
Field_value is sometimes split into multiple fields to natively support the actual field type from the original table (but only one of those fields will be filled, so the data is denormalized; a variant is to split the above table into one table for each type).
Other meta data such as field_type, user_id, user_ip, action (update, delete, insert) etc.. can be useful.
The structure of such records will most likely need to be transformed to be used.
Record based
audit_table_name (timestamp, id, field_1, field_2, ..., field_n)
For each record type in the database create a generalized table that has all the fields as the original record, plus a versioning field (additional meta data again possible). One table for each working table is necessary. The process of creating such tables can be automated.
This approach provides you with semantically rich structure very similar to the main data structure so the tools used to analyze and process the original data can be easily used on this structure, too.
Log file
The first two approaches usually use tables which are very lightly indexed (or no indexes at all and no referential integrity) so that the write penalty is minimized. Still, sometimes flat log file might be preferred, but of course functionally is greatly reduced. (Basically depends if you want an actual audit/log that will be analyzed by some other system or the historical records are the part of the main system).
A different way to look at this is to time-dimension the data.
Assuming your table looks like this:
create table my_table (
my_table_id number not null primary key,
attr1 varchar2(10) not null,
attr2 number null,
constraint my_table_ak unique (attr1, att2) );
Then if you changed it like so:
create table my_table (
my_table_id number not null,
attr1 varchar2(10) not null,
attr2 number null,
effective_date date not null,
is_deleted number(1,0) not null default 0,
constraint my_table_ak unique (attr1, att2, effective_date)
constraint my_table_pk primary key (my_table_id, effective_date) );
You'd be able to have a complete running history of my_table, online and available. You'd have to change the paradigm of the programs (or use database triggers) to intercept UPDATE activity into INSERT activity, and to change DELETE activity into UPDATing the IS_DELETED boolean.
Unreason:
You are correct that this solution similar to record-based auditing; I read it initially as a concatenation of fields into a string, which I've also seen. My apologies.
The primary differences I see between the time-dimensioning the table and using record based auditing center around maintainability without sacrificing performance or scalability.
Maintainability: One needs to remember to change the shadow table if making a structural change to the primary table. Similarly, one needs to remember to make changes to the triggers which perform change-tracking, as such logic cannot live in the app. If one uses a view to simplify access to the tables, you've also got to update it, and change the instead-of trigger which would be against it to intercept DML.
In a time-dimensioned table, you make the strucutural change you need to, and you're done. As someone who's been the FNG on a legacy project, such clarity is appreciated, especially if you have to do a lot of refactoring.
Performance and Scalability: If one partitions the time-dimensioned table on the effective/expiry date column, the active records are in one "table", and the inactive records are in another. Exactly how is that less scalable than your solution? "Deleting" and active record involves row movement in Oracle, which is a delete-and-insert under the covers - exactly what the record-based solution would require.
The flip side of performance is that if the application is querying for a record as of some date, partition elimination allows the database to search only the table/index where the record could be; a view-based solution to search active and inactive records would require a UNION-ALL, and not using such a view requires putting the UNION-ALL in everywhere, or using some sort of "look-here, then look-there" logic in the app, to which I say: blech.
In short, it's a design choice; I'm not sure either's right or either's wrong.
In our projects we usually do it this way:
You have a table
properties(ID, value1, value2)
then you add table
properties_audit(ID, RecordID, timestamp or datetime, value1, value2)
ID -is an id of history record(not really required)
RecordID -points to the record in original properties table.
when you update properties table you add new record to properties_audit with previous values of record updated in properties. This can be done using triggers or in your DAL.
After that you have latest value in properties and all the history(previous values) in properties_audit.
I think a simpler schema would be
table_name, field_name, value, time, userId
No need to save current and previous values in the audit tables. When you make a change to any of the fields you just have to add a row in the audit table with the changed value. This way you can always sort the audit table on time and know what was the previous value in the field prior to your change.

In SQL Server what is most efficient way to compare records to other records for duplicates with in a given range of values?

We have an SQL Server that gets daily imports of data files from clients. This data is interrelated and we are always scrubbing it and having to look for suspect duplicate records between these files.
Finding and tagging suspect records can get pretty complicated. We use logic that requires some field values to be the same, allows some field values to differ, and allows a range to be specified for how different certain field values can be. The only way we've found to do it is by using a cursor based process, and it places a heavy burden on the database.
So I wanted to ask if there's a more efficient way to do this. I've heard it said that there's almost always a more efficient way to replace cursors with clever JOINS. But I have to admit I'm having a lot of trouble with this one.
For a concrete example suppose we have 1 table, an "orders" table, with the following 6 fields.
(order_id, customer_id, product_id, quantity, sale_date, price)
We want to look through the records to find suspect duplicates on the following example criteria. These get increasingly harder.
Records that have the same product_id, sale_date, and quantity but different customer_id's should be marked as suspect duplicates for review
Records that have the same customer_id, product_id, quantity and have sale_dates within five days of each other should be marked as suspect duplicates for review
Records that have the same customer_id, product_id, but different quantities within 20
units, and sales dates within five days of each other should be considered suspect.
Is it possible to satisfy each one of these criteria with a single SQL Query that uses JOINS? Is this the most efficient way to do this?
If this gets much more involved, then you might be looking at a simple ETL process to do the heavy carrying for you: the load to the database should be manageable in the sense that you will be loading to your ETL environment, running tranformations/checks/comparisons and then writing your results to perhaps a staging table that outputs the stats you need. It sounds like a lot of work, but once it is setup, tweaking it is no great pain.
On the other hand, if you are looking at comparing vast amounts of data, then that might entail significant network traffic.
I am thinking efficient will mean adding index to the fields you are looking into the contents of. Not sure offhand if a megajoin is what you need, or just to list off a primary key of the suspect records into a hold table to simply list problems later. I.e. do you need to know why each record is suspect in the result set
You could
-- Assuming some pkid (primary key) has been added
1.
select pkid,order_id, customer_id product_id, quantity, sale_date
from orders o
join orders o2 on o.product_id=o2.productid and o.sale_date=o2.sale_date
and o.quantity=o2.quantity and o.customerid<>o2.customerid
then keep joining up more copies of orders, I suppose
You can do this in a single Case statement. In this below scenario, the value for MarkedForReview will tell you which of your three Tests (1,2, or 3) triggered the review. Note that I have to check for the conditions of the third test before the second test.
With InputData As
(
Select order_id, product_id, sale_date, quantity, customer_id
, Case
When O.sale_date = O2.sale_date Then 1
When Abs(DateDiff(d, O.sale_date, O2.sale_date)) <= 5
And Abs( O.quantity - O2.quantity ) <= 20 Then 3
When Abs(DateDiff(d, O.sale_date, O2.sale_date)) <= 5 Then 2
Else 0
End As MarkedForReview
From Orders As O
Left Join Orders As O2
On O2.order_id <> O.order_id
And O2.customer_id = O.customer_id
And O2.product_id = O.product_id
)
Select order_id, product_id, sale_date, quantity, customer_id
From InputData
Where MarkedForReview <> 0
Btw, if you are using something prior to SQL Server 2005, you can achieve the equivalent query using a derived table. Also note that you can return the id of the complementary order that triggered the review. Both orders that trigger a review will obviously be returned.

Resources