dbt incremental load merge operation without IS_INCREMENTAL() macro - snowflake-cloud-data-platform

I have set my model to be incremental by
{{
config(
materialized='incremental',
unique_key='my_key',
incremental_strategy='merge'
)
}}
Do I absolutely have to have IS_INCREMENTAL() macro at the end of the statement?
If not, then without IS_INCREMENTAL(),
1- Is dbt running a full refresh in the pipeline?
2- Is dbt still merging the output on my_key in Snowflake? Or is dbt going to completely overwrite the entire Snowflake table with the full refresh output?

is_incremental() is just a convenience macro. It evaluates to True if the model is being run in "incremental mode" (i.e., the table already exists and the --full-refresh flag has not been passed).
If a model is running for the first time or --full-refresh is set, dbt will drop the target table and create a new table with the results of the select statement in your model.
Otherwise, in incremental mode, an incremental model will attempt to upsert any records returned by the select statement.
Typically, an incremental model's source continues to grow over time (maybe it's a table of 100M records growing by 2M records/week). In this case, you would use something like:
select *
from source
{% if is_incremental() %}
where updated_at > (select max(updated_at) from {{ this }})
{% endif %}
so that you're only selecting the new records from your source table.
However, if your source table already purges old records, you may not need the special where clause. Just note that this isn't the pattern dbt was designed for, and if someone ever runs --full-refresh on your prod target you'll lose all of the old data.
If you do have a source that purges (deletes/truncates) old records, I would consider creating a snapshot of that source, instead of loading the values into an incremental model directly. Snapshots are designed for this kind of use case, and will generalize better across multiple dev environments, etc.
Incremental docs
Snapshot docs

Related

How to test/validate views in Snowflake

Sometimes, upstream DDL changes can break downstream views (it shouldn't happen, but humans make mistakes).
In order to detect these defects before our stakeholders do, is there a way to automatically test the validity of all views in Snowflake?
A view is not automatically updated if the underlying source objects are modified.
Currently, Snowflake does not have a way of tracking which views are no longer valid, as the view definition is validated at the execution time.
Workaround
You may use the object dependencies view/GET_OBJECT_REFERENCES function to identify the list of views and try to rebuild the view definition.
https://docs.snowflake.com/en/sql-reference/account-usage/object_dependencies.html
https://docs.snowflake.com/en/user-guide/object-dependencies.html#label-object-dependencies-features
Use the task feature to check the source tables status and notify.
You could create a Stored Procedure using the SNOWFLAKE.ACCOUNT_USAGE.VIEWS and GET_OBJECT_REFERENCES() to get the invalid views.
We started calling this query on each view, and it will produce an error if the view is invalid (hence failing the test):
EXPLAIN SELECT 1 FROM database.schema.view LIMIT 1;
Even though the query without EXPLAIN is extremely simple, it can still be slow on more complex views.
EXPLAIN only builds the query plan rather than executing the query (and the query planning will fail if the view is invalid).
The query plan is built exclusively in the cloud services layer, so those queries do not require an active warehouse, and are essentially free (as long as your cloud services usage remains below 10% of your total usage).
We use DBT to run this test every hour, but you could use any other tool that allows to automate queries (such as SnowSQL).
Some details on our usage of DBT:
We have added a project variable called test_views.
We have added a test_view() macro:
{%- macro test_view(model) -%}
EXPLAIN SELECT 1 FROM {{ model.database }}.{{ model.schema }}.{{ model.alias }} LIMIT 1;
{%- endmacro -%}
We have added the following code to the view materialization:
{%- set sql = test_view(model) if (model.config.materialized == 'view' and var('test_views') == True) else sql -%}
We can run the test with this command:
dbt run -s config.materialized:view --vars '{test_views: true}'
Another option would be to simply re-create the views constantly (which would fail if there is an issue with the upstream dependencies), but that just seems terribly wasteful.

In dbt, Why is my incremental model not registering as incremental and the compiled sql is not what I would expect?

My incremental model is not working, and I don't know why! Thank you in advance.
My intention is to only process records that are within 60 days of today to be materialized into a table that has many years worth of data. Since a record can change until it is older than 60 days, I need to reprocess all records younger than 60 days every day.
So I set up an incremental model like this
Which, compiled, looks like this (notice that the 60 day clause does not show up, and I don't know why )
When I run this model, there is no mention of a temp table being created
also no mention of a merge action for the Unique_key clause at the top of my model
is_incremental() behavior
Have you run this model before? How are you running this model? I ask because is_incremental() will only be true, and therefore your extra WHERE clause will only be included if certain criteria are met.
The docs here say:
The is_incremental() macro will return True if:
the destination table already exists in the database
dbt is not running in full-refresh mode
the running model is configured with materialized='incremental'
In this case, you're right that #2 & #3 are both True, but my guess is that #1 is False
To learn more about is_incremental() check out the source code, which, IMHO, clearly lays out the logic.
no TEMP table being created?
When I run this model, there is no mention of a temp table being created
I do see a create or replace transient table statement, which is how dbt makes a temp table on Snowflake (source code)
In the dbt cloud development environment, is_incremental() always evaluates to False. In dbt cloud production runs (and using the CLI), is_incremental() evaluates to True if the table already exists.

can we convert delete SQL statements into DBT?

I am trying to build a DBT model from SQL which has delete statements based on where clause.
Can any one please suggest me how to convert the below SQL delete statement into DBT model?
'''
delete table_name where condition;
'''
Thanks
There's a couple of options for running DELETE statements in dbt:
add a DELETE statement as a pre_hook or post_hook for an existing model
create an operation macro to run a DELETE statement independently of a model
Note that unless your model materialization type is "incremental" it doesn't make much sense to delete from the model target.
Disclaimer: I haven't been using dbt for long so there might well be better ways of doing this, or reasons to not do it at all.
Not sure what your use case is but I've had to use DELETEs when retrofitting existing data warehouse logic into dbt. If you're starting from scratch with dbt then probably try and avoid a design that requires deleting data.
I have needed to implement deletes to comply with CCPA deletion requirements. Our raw layer is drop&rebuild daily, so if a row does not exist in raw, it will need to be deleted in downstream tables.
Stage layer is a set of views that rename and cast raw tables, and also create surrogate key as sha1(raw_table_business_key). Pre_hook for EDW incrementally loaded tables is something like:
delete from {{ this }} where skey not in
(select skey from {{ ref('stage_view') }})
Yes, it absolutely restates history.

Find out the recently selected rows from a Oracle table and can I update a LAST_ACCESSED column whenever the table is accessed

I have a database table which have more than 1 million records uniquely identified by a GUID column. I want to find out which of these record or rows was selected or retrieved in the last 5 years. The select query can happen from multiple places. Sometimes the row will be returned as a single row. Sometimes it will be part of a set of rows. there is select query that does the fetching from a jdbc connection from a java code. Also a SQL procedure also fetches data from the table.
My intention is to clean up a database table.I want to delete all rows which was never used( retrieved via select query) in last 5 years.
Does oracle DB have any inbuild meta data which can give me this information.
My alternative solution was to add a column LAST_ACCESSED and update this column whenever I select a row from this table. But this operation is a costly operation for me based on time taken for the whole process. Atleast 1000 - 10000 records will be selected from the table for a single operation. Is there any efficient way to do this rather than updating table after reading it. Mine is a multi threaded application. so update such large data set may result in deadlocks or large waiting period for the next read query.
Any elegant solution to this problem?
Oracle Database 12c introduced a new feature called Automatic Data Optimization that brings you Heat Maps to track table access (modifications as well as read operations). Careful, the feature is currently to be licensed under the Advanced Compression Option or In-Memory Option.
Heat Maps track whenever a database block has been modified or whenever a segment, i.e. a table or table partition, has been accessed. It does not track select operations per individual row, neither per individual block level because the overhead would be too heavy (data is generally often and concurrently read, having to keep a counter for each row would quickly become a very costly operation). However, if you have you data partitioned by date, e.g. create a new partition for every day, you can over time easily determine which days are still read and which ones can be archived or purged. Also Partitioning is an option that needs to be licensed.
Once you have reached that conclusion you can then either use In-Database Archiving to mark rows as archived or just go ahead and purge the rows. If you happen to have the data partitioned you can do easy DROP PARTITION operations to purge one or many partitions rather than having to do conventional DELETE statements.
I couldn't use any inbuild solutions. i tried below solutions
1)DB audit feature for select statements.
2)adding a trigger to update a date column whenever a select query is executed on the table.
Both were discarded. Audit uses up a lot of space and have performance hit. Similary trigger also had performance hit.
Finally i resolved the issue by maintaining a separate table were entries older than 5 years that are still used or selected in a query are inserted. While deleting I cross check this table and avoid deleting entries present in this table.

How to bulk insert and validate data against existing database data

Here is my situation, my client wants to bulk insert 100,000+ rows into the database from a csv file which is simple enough but the values need to be checked against data that is already in the database (does this product type exist? is this product still sold? etc.). To make things worse these files will also be uploaded into the live system during the day so I need to make sure I’m not locking any tables for long. The data that is inserted will also be spread across multiple tables.
I’ve been adding the date into a staging table which takes seconds, I then tried creating a WebService to start processing the table using Linq and marking any erroneous rows with an invalid flag (this can take some time). Once the validation is done I need take the valid rows and update/add the rows to the appropriate tables.
Is there a process for this that I am unfamiliar with?
For a smaller dataset I would suggest
IF EXISTS (SELECT blah FROM blah WHERE....)
UPDATE (blah)
ELSE
INSERT (blah)
You could do this in chunks to avoid server load, but this is by no means a quick solution, so SSIS would be preferable

Resources