Use Schema as pre hook in dbt not working - snowflake-cloud-data-platform

Trying to call a stored procedure using a Model. Call is made using a custom macro as below,
{% materialization call_proc, adapter='snowflake' -%}
{%- call statement('main') -%}
{{ sql }}
{%- endcall -%}
{%- endmaterialization %}
Model definition:
{{ config(
materialized='call_proc',
database='TEST_DB',
schema = 'TEST',
pre_hook = "use schema {{ database }}.{{ schema }};"
)
}}
call "{{ database }}"."{{ schema }}".TEST_PROC('SAMPLE');
Procedure is a Snowflake Procedure created with "Execute as Caller" property. In snowflake history i can see this is called with db/schema.
Internally the procedure calls another procedure which doesn't use fully qualified name.
Ideally since it a Execute as Caller, the internal procedure should run using the DB/SCHEMA context set.
FOr that i have specifically put in USE DB.SCHEMA as pre hook but seems it is not working.
Any ideas? I dont want to use fully qualified names in the call statement inside procedure body or even pass as parameter.

dbt uses a different session to execute pre-hooks and model code; use ... only applies to the current session, which is why the pre-hook has no effect on the behavior of your stored procedure.
Generally, calling stored procedures like this from dbt is a hack: dbt runs should be idempotent, so if you are inserting or deleting data, etc., then you're breaking the dbt paradigm, which is why stored procedures aren't supported more directly.
That said, if you have to use one, I would wrap it in a macro and then call that macro in either an on-run-start hook, a pre-hook, or using dbt's run-operation command. Models are intended to map 1:1 with database relations, and hacking a materialization macro like you have is really not ideal.
You can also use the target context variable to retrieve database and schema from your profile:
-- in macros/call_test_proc.sql
{% macro call_test_proc(my_arg) %}
begin;
use database {{ target.database }};
use schema {{ target.schema }};
call TEST_PROC('{{ my_arg }}');
commit;
{% endmacro %}
Then from the command line:
dbt run-operation call_test_proc --args '{"my_arg": "SAMPLE"}' --target dev
Or as a hook:
# in dbt_project.yml
on-run-start: "{{ call_test_proc('SAMPLE') }}"

Related

Silent macros error when it is expected without affecting dbt run

I have a macros that COPY INTO to S3 bucket.
{% macro my_macros() %}
{%- if not execute or target.name != 'prod' -%}
{{ return('') }}
{%- endif %}
{% set query = 'USE DATABASE MY_DB; USE SCHEMA MY_SCH; COPY INTO #MY_STAGE/my_table FROM (SELECT OBJECT_CONSTRUCT(*) from MY_TABLE) FILE_FORMAT =(TYPE = JSON COMPRESSION = NONE) OVERWRITE=FALSE;' %}
{# dbt_utils.log_info(query) #}
{%- do run_query(query) -%}
{% endmacro %}
Note that I set OVERWRITE=FALSE because I only want the files to be copied once only. However, if rerun the macros, dbt run will obviously failed due to overwrite.
001030 (22000): Files already existing at the unload destination: #MY_STAGE/my_table. Use overwrite option to force unloading.
I'm looking for a solution where (any one):
MAYBE I can silent the error during dbt run.
A condition statement within the macros to handle the error
Silent the error ONLY for this macros/model.
A condition in dbt_project.yaml where only runs macro once a day (something similar..)
Or any other solution I'm open to it

dbt Snapshot Failing (ERROR: 100090 (42P18): Duplicate row detected during DML action)

So we have a table called dim_merchant.sql and a snapshot of this table called dim_merchant_snapshot.
{% snapshot dim_merchant_snapshot %}
{{
config
(
target_schema='snapshots',
unique_key='id',
strategy='check',
check_cols='all'
)
}}
select * from {{ ref('dim_merchant') }}
{% endsnapshot %}
We never had a trouble with it but since yesterday, we ran into failure in running the snapshot with the following error message:
Database Error in snapshot dim_merchant_snapshot (snapshots/dim_merchant_snapshot.sql)
100090 (42P18): Duplicate row detected during DML action
The error is happening during this step of the snapshot:
On snapshot.analytics.dim_merchant_snapshot: merge into "X"."SNAPSHOTS"."DIM_MERCHANT_SNAPSHOT" as DBT_INTERNAL_DEST
using "X"."SNAPSHOTS"."DIM_MERCHANT_SNAPSHOT__dbt_tmp" as DBT_INTERNAL_SOURCE
on DBT_INTERNAL_SOURCE.dbt_scd_id = DBT_INTERNAL_DEST.dbt_scd_id
when matched
and DBT_INTERNAL_DEST.dbt_valid_to is null
and DBT_INTERNAL_SOURCE.dbt_change_type in ('update', 'delete')
then update
set dbt_valid_to = DBT_INTERNAL_SOURCE.dbt_valid_to
when not matched
and DBT_INTERNAL_SOURCE.dbt_change_type = 'insert'
We realized that some values were being inserted and updated twice in the snapshot (since yesterday) and that caused the failure of our snapshot but we are not sure as to why.
Note that the id key on dim_merchant is tested for its uniqueness and there are no duplicated of it. Meanwhile, the snapshot table contains duplicate after our first snapshot run (that doesn't cause any failure), but the subsequent runs on the snapshot table infected with duplicates are failing.
We just recently updated dbt from 0.20.0 to 1.0.3, but we didnt find any change in the snapshot definition between these versions.
SETUP:
dbt-core==1.0.3,
dbt-snowflake==1.0.0,
dbt-extractor==0.4.0,
Snowflake version: 6.7.1
Thanks !
I know it's been a while since this was posted. I wanted to report that I'm seeing this weird behavior as well. I have a table that I'm snapshotting with the timestamp strategy. The unique_key is made up of several columns. The intention is that this is a full snapshot each time the model is run. The table that is being snapshotted has all unique rows meaning dbt_scd_id is a unique key. I resolved this issue by adding the updated_at column to the unique_key config. In theory, this shouldn't matter since the dbt_scd_id is already a concatenation of unique_key and updated_at. Regardless, it has resolved the issue.

DBT(Data Build Tools) - drop the default database prefix that gets added to each model on deployment

In DBT, whenever we deploy the models, the database name gets prefixed to each deployed model in the sql definition in database.
I need to configure the dbt project in a way that it doesn't prefix database name to the deployed models.
You can overwrite the built-in ref macro. This macro returns a Relation object, so we can manipulate its output like this:
{% macro ref(model_name) %}
{% do return(builtins.ref(model_name).include(database=false)) %}
{% endmacro %}
So, from there, all models that use the ref function will return the Relation object without the database specification.
dbt code:
select * from {{ ref('model') }}
compiled code:
select * from schema_name.model
EDIT:
As you requested, here's an example to remove the database name from the sources:
{% macro source(source_name, table_name) %}
{% do return(builtins.source(source_name, table_name).include(database=false)) %}
{% endmacro %}
I've worked with sources from different databases, so if you ever get to that case, you might want to edit the macro to offer an option to include the database name, for example:
{% macro source(source_name, table_name, include_database = False) %}
{% do return(builtins.source(source_name, table_name).include(database = include_database)) %}
{% endmacro %}
dbt code:
select * from {{ source('kaggle_aps', 'coaches') }}
select * from {{ source('kaggle_aps', 'coaches', include_database = True) }}
compiled code:
select * from schema_name.object_name
select * from database_name.schema_name.object_name
More details can be found in the official documentation
Do you mean that:
You don't want the schema name with a prefix added to it, like just be finance.modelname instead of dbname_finance.modelname, or
you want the relation name to be rendered with a two-part name (schema.modelname) instead of the three-part name (database.schema.modelname)?
If #1, I recommend you read the entire custom schema names docs page, specifically the part about Advanced custom schema configuration
If it's #2, this is a change required at the adapter level. Since you've tagged synapse, I'd wager a guess that you're using Synapse SQL Serverless Pools because I have also encountered the fact that you can't use three-part names in Serverless pools. Last week, I actually made dbt-synapse-serverless a separate adapter from dbt-synapse which in fact disables the three-part name.

Loading data in Snowflake using bind variables

We're building dynamic data loading statements for Snowflake using the Python interface.
We want to create a stage at query runtime, and use that stage in a subsequent statement. Table and stage names are dynamic using bind variable.
Yet, it doens't seem like we can find the correct syntax as we tried everything on https://docs.snowflake.com/en/user-guide/python-connector-api.html
COPY INTO IDENTIFIER( %(table_name)s )(SRC, LOAD_TIME, ROW_HASH)
FROM (SELECT t.$1, CURRENT_TIMESTAMP(0), MD5(t.$1) FROM "'%(stage_name)s'" t)
PURGE = TRUE;
Is this even possible? Does it work for anyone?
Your code does not create stage as you mentioned, and you don't need create a stage, instead use table stage or user stage. The SQL below uses table stage.
You also need to change your syntax a little and use more pythonic way : f-strings
sql = f"""COPY INTO {table_name} (SRC, LOAD_TIME, ROW_HASH)
FROM (SELECT t.$1, CURRENT_TIMESTAMP(0), MD5(t.$1) FROM #%{table_name} t)
PURGE = TRUE"""

Django model joining a one to many relationship for displaying in a template

Not sure what the best way to describe the problem is. I have 2 tables contact and attribute. The contact table has 1 entry per person and the attribute table has 0, 1, or many entries per person. They are joined currently with a "fake" foreign key that isn't really a foreign key. If I need to add the foreign key I will it is not a big deal just dealing with old data and there was no foreign key originally. So the tables are laid out as follows:
contact:
class contact(models.Model):
contactId = models.AutoField(primary_key=True, db_column='contactId')
firstName = models.CharField(max_length=255, null=True, db_column='firstName')
middleName = models.CharField(max_length=255, null=True, db_column='middleName')
lastName = models.CharField(max_length=255, null=True, db_column='lastName')
attribute:
class attribute(models.Model):
attributeId = models.AutoField(primary_key=True, db_column='attributeId')
contactId = models.IntegerField(db_index=True, null=True, db_column='contactId')
attributeValue = models.TextField(null=True, db_column='attributeValue')
So I have correctly set up the Django models to represent these tables. Now what I need to accomplish is a view and template to loop over these tables such that it generates an xml doc in the following format:
<contacts>
<contact>
<contactId></contactId>
<firstName></firstName>
<lastName></lastName>
<attributes>
<attribute>
<attributeId></attributeId>
<attributeValue></attributeValue>
</attribute>
</attributes>
</contact>
</contacts>
So there would be a listing of all contacts and all of the attributes associated with each contact.
I am sure there is a simple way to accomplish this. In other languages I would simple write two looping queries to loop over the contact and then loop over the attributes for each contact. However the company I work for is migrating to a new platform and want the new application written in django/python both of which I am still trying to learn.
Any help that anyone can provide is appreciated.
Assuming you've set up your django models to use your current database setup, I'd do the following if I didn't have a foreign key set up.
contacts = Contact.objects.all()
for contact in contacts:
contact.attributes = Attribute.objects.filter(contactId=contact.pk)
return render_to_response("mytemplate.html", {'contacts': contacts })
Alternately, save some queries;
attributes_map = dict(
[(attribute.contactId, attribute) for attribute in \
Attribute.objects.filter(contactId__in=[contact.pk for contact in contacts])]
)
for contact in contacts:
contact.attributes = attributes_map.get(contact.pk)
return render_to_response("mytemplate.html", {'contacts': contacts })
Template
<contacts>
{% for contact in contacts %}
<contact>
<contactId>{{ contact.pk }}</contactId>
<firstName>{{ contact.firstName }}</firstName>
<lastName>{{ contact.lastName }}</lastName>
{% if contact.attributes %}
<attributes>
{% for attribute in contact.attributes %}
<attribute>
<attributeId>{{ attribute.pk }}</attributeId>
<attributeValue>{{ attribute.attributeValue }}</attributeValue>
</attribute>
{% endfor %}
</attributes>
{% endif %}
</contact>
{% endfor %}
</contacts>
It sounds like you are in legacy db hell.
As such, you will probably find it difficult to use standard django model stuff like ForeignKey, but you should still strive to keep code like that out of the view code.
So it would probably be better to define some methods on your contact class, something like...
def attributes(self):
return Attribute.objects.filter(contactId=self.contactId)
then, you can use that in your templates.
<attributes>
{% for attribute in contact.attributes.all %}
<attribute>
...
</attribute>
{% endfor %}
</attribute>
Though, contact.attributes.all in a template might not be ideal(nailing the relationship in the template). but it would be simple to add into your view to stuff something into the context.
Yuji's answer is a lot like Nick Johnson's prefetch_refprops for App Engine:
http://blog.notdot.net/2010/01/ReferenceProperty-prefetching-in-App-Engine
I found a way to optimize this code:
attributes_map = dict(
[(attribute.contactId, attribute) for attribute in \
Attribute.objects.filter(contactId__in=[contact.pk for contact in contacts])]
)
Replace it with:
attributes_map = dict(Attribute.objects
.filter(contactId__in=set([contact.pk for contact in contacts]))
.values_list('contactId', 'attribute'))
This substitution saves time in the query by:
compacting the list of keys into a set(), and
retrieving only the two fields used by the dict().
It also takes advantage of the fact that .values_list() returns the list of tuples expected by dict(), but this only works if the first field (i.e., 'contactId') is unique.
To extend this solution for multiple foreign keys, create a map for each key field using the code above. Then, inside of this loop:
for contact in contacts:
contact.attributes = attributes_map.get(contact.pk)
add a line for each foreign key. For example,
attributes_map = dict(Attribute.objects
.filter(contactId__in=set([contact.pk for contact in contacts]))
.values_list('contactId', 'attribute'))
names_map = dict(Name.objects
.filter(nameId__in=set([contact.nameId for contact in contacts]))
.values_list('nameId', 'name'))
for contact in contacts:
contact.attributes = attributes_map.get(contact.pk)
contact.name = names_map.get(contact.nameId)
If you've ever had a template generate lots of extra SQL calls for each row of a result set (in order to look up each foreign key's value), this approach will make a huge savings by pre-loading all the data before sending it to the template.

Resources