Post-hook to DBT_CLOUD_PR Schemas (DBT Cloud CI) - database

I was wondering if it is possible to attach a post hook to DBT_CLOUD_PR Schemas (Generated by dbt Cloud CI) so that only the developers can see the PR tables generated on the database.
I would like to do something looking like the following:
dbt_cloud_pr:
+post-hook:
- "revoke select on {{ this }} from role reporter"
Right now, our dbt_cloud_pr schemas can be seen by multiple roles on Snowflake, and it clusters the database with some non-essential schemas that we would rather hide.
Thanks for your help !

This is a cool idea!
You can configure your dbt Cloud CI job to use a custom "target" name (it's the Job > Edit Settings page). Let's say you set that to ci
Then I think in your code you should be able to add a post-hook like
models:
+post-hook: {% if target.name == 'ci' %} revoke select on {{ this }} from role reporter {% else %} select 1 {% endif %}
If that if..else syntax isn't allowed in the post-hook itself, you can wrap it in a macro, and call that macro from the post-hook:
models:
+post-hook: {{ revoke_select_on_ci(this) }}
And in your macros directory:
{%- macro revoke_select_on_ci(model, ci_target='ci') -%}
{%- if target.name == ci_target -%}
revoke select on {{ model }} from role reporter
{%- else -%}
select 1
{%- endif -%}
{%- endmacro -%}

Related

Snowflake SQL UDF - Unexpected 'create' syntax error

I've created the following dbt macro
{% macro product_nums() %}
create function multiply1 (a number, b number)
returns number
language sql
as 'a * b';
{% endmacro %}
However, when I try to call it with the query
SELECT
{{ target.schema }}.multiply1(5,2)
I get the following error:
Database Error
001003 (42000): SQL compilation error:
syntax error line 22 at position 0 unexpected 'create'.
The macros docs show inline SQL in the macro, not the creation of an actual snowflake SQL Function.
thus for your case you should write:
{% macro product_nums(a, b) %}
({{a}} * {{b}})
{% endmacro %}
and used something like:
SELECT {{product_nums(5,2)}}
which will generate the SQL
SELECT 5 * 2
to be run on the Snowflake servers. Aka macro's are processing.

How to use jinja template with mix of params and airflow inbuilts

I am trying to create a SQL template for Snowflake where I am trying to load a S3 file using SnowflakeOperator and s3 file is provided as xcom variable from upstream task.
Here is an example template for SQL
create or replace temp table {{ params.raw_target_table }}_tmp
as
select *
from '#{{ params.stage_name }}/{{ params.get_s3_file }}'
file_format => '{{ params.file_format }}'
;
params.get_s3_file is set to use ti like {{{{ti.xcom_pull(task_ids="foo", key="file_uploaded_to_s3")}}}}
I understand that in the template if used directly, it will work if it is not coming from params, but I want it to be configurable so I can use it with multiple dags/tasks.
Ideally I want this to work
create or replace temp table {{ params.raw_target_table }}_tmp
as
select *
from '#{{ params.stage_name }}/{{ti.xcom_pull(task_ids="{{params.previous_task}}", key="file_uploaded_to_s3")}}'
file_format => '{{ params.file_format }}'. --note the nested structure
;
So it resolves params.previous_task first and then gets the xcom values. Not sure how to instruct it do it.
When you use {{ <some code> }} jinja execute the code during the runtime, so this code is just hard python code (not template) executed during the runtime.
{{ti.xcom_pull(task_ids="{{params.previous_task}}", key="file_uploaded_to_s3")}} will try to pull the xcom with key file_uploaded_to_s3 from the task {{params.previous_task}} which doesn't exist. Instead of providing a string as task_ids, you can provide params.previous_task and jinja will replace it by the value of previous_task from the params dict:
create or replace temp table {{ params.raw_target_table }}_tmp
as
select *
from '#{{ params.stage_name }}/{{ti.xcom_pull(task_ids=params.previous_task, key="file_uploaded_to_s3")}}'
file_format => '{{ params.file_format }}'. --note the nested structure
;

How to convert delete+insert SQL into DBT module

I'm learning DBT and would like to rewrite following Snowflake procedure with DBT model.
Unfortunately I don't know how to express SQL delete/inserts in DBT.
Here is my procedure:
create or replace procedure staging.ingest_google_campaigns_into_master()
returns varchar
language sql
as
$$
begin
DELETE FROM GOOGLE_ADWORD_CAMPAIGN
WHERE DT IN (SELECT DISTINCT ORIGINALDATE AS DT FROM GOOGLEADWORDS_CAMPAIGN);
INSERT INTO GOOGLE_ADWORD_CAMPAIGN
SELECT DISTINCT *
FROM
(
SELECT g.* ,
YEAR(TO_TIMESTAMP(DATE_PART(EPOCH_SECOND, ORIGINALDATE::TIMESTAMP)::VARCHAR)) AS YEAR,
LPAD(MONTH(TO_TIMESTAMP(DATE_PART(EPOCH_SECOND, ORIGINALDATE::TIMESTAMP)::VARCHAR)),2,0) AS MONTH,
LPAD(DAY(TO_TIMESTAMP(DATE_PART(EPOCH_SECOND, ORIGINALDATE::TIMESTAMP)::VARCHAR)),2,0) AS DAY,
TO_DATE(DATE_PART(EPOCH_SECOND, ORIGINALDATE::TIMESTAMP)::VARCHAR) AS DT
FROM GOOGLEADWORDS_CAMPAIGN g
) t;
end;
$$
;
The procedure first remove old rows from table GOOGLE_ADWORD_CAMPAIGN and later replace them with the fresh one.
This pattern is called an "incremental" materialization in dbt. See the docs for more background.
On Snowflake, there are a few different "strategies" you can use for incremental materializations. One strategy is called delete+insert, which does exactly what your stored procedure does. Another option is merge, which may perform better.
Adapting your code to dbt would look like this:
{{
config(
materialized='incremental',
unique_key='ORIGINALDATE',
incremental_strategy='delete+insert',
)
}}
SELECT DISTINCT
g.* ,
YEAR(TO_TIMESTAMP(DATE_PART(EPOCH_SECOND, ORIGINALDATE::TIMESTAMP)::VARCHAR)) AS YEAR,
LPAD(MONTH(TO_TIMESTAMP(DATE_PART(EPOCH_SECOND, ORIGINALDATE::TIMESTAMP)::VARCHAR)),2,0) AS MONTH,
LPAD(DAY(TO_TIMESTAMP(DATE_PART(EPOCH_SECOND, ORIGINALDATE::TIMESTAMP)::VARCHAR)),2,0) AS DAY,
TO_DATE(DATE_PART(EPOCH_SECOND, ORIGINALDATE::TIMESTAMP)::VARCHAR) AS DT
FROM GOOGLEADWORDS_CAMPAIGN g
Note that this assumes there is just a single record in GOOGLEADWORDS_CAMPAIGN for each ORIGINALDATE -- if that is not true, you will want to substitute your own unique_key in the config block.
This also assumes that GOOGLEADWORDS_CAMPAIGN contains a sliding date window already. If that isn't the case, and you want to filter the dates contained in that table (and only update a subset of the data), you want to add a conditional WHERE statement when the model is built in incremental mode:
...
FROM GOOGLEADWORDS_CAMPAIGN g
{% if is_incremental() %}
WHERE
DT >= (SELECT MAX(DT) FROM {{ this }}) - INTERVAL '7 DAYS'
{% endif %}

Storing the results from run query into a table in dbt

I am calling this store procedure in dbt. How do I store the results using a select statement into a temp table?
{% set results= run_query('call mystoredproc()') %}
{% do log("Printing table" , info=True) %}
{% do results.print_table() %}
{% set sql %}
select * from results <<--- how to store the result into a temp table
{% end set %}
{% do run_query(create_table_as(True, tmp_relation, sql)) %}
You should use the materialization which is a strategy for persisting the dbt models in a warehouse. You can configure the materialization in the project.yml file or configure it directly inside the sql files as:
{{ config(materialized='table | view |', sort='timestamp', dist='user_id') }}
select *
from ...
For more info check the Materialization docs.
I ran into this problem when trying to create a table that I could join later on in the same model. Turned out all I needed to do was:
with (call mystoredproc())
as temp_table select ...

dbt query to Snowflake resulting in an "invalid identifier" error for a column that exists

I've been pulling my hair out for several hours trying to understand what's going on, to no avail so far.
I've got this query on dbt:
{{
config(
materialized='incremental',
unique_key='event_ID'
)
}}
SELECT
{{ dbt_utils.star(from=ref('staging_pg_ahoy_events'), relation_alias='events', prefix='event_') }},
{{ dbt_utils.star(from=ref('staging_pg_ahoy_visits'), relation_alias='visits', prefix='visit_') }}
FROM
{{ ref('staging_pg_ahoy_events') }} AS events
LEFT JOIN {{ ref('staging_pg_ahoy_visits') }} AS visits ON events.visit_id = visits.id
{% if is_incremental() %}
WHERE "events"."event_ID" >= (SELECT max("events"."event_ID") FROM {{ this }})
{% endif %}
Along with this config:
version: 2
models:
- name: facts_ahoy_events
columns:
- name: event_ID
quote: true
tests:
- unique
- not_null
dbt run -m facts_ahoy_events --full-refresh runs successfully, however when I try an incremental backup by dropping the --full-refresh flag, the following error ensues:
10:35:51 1 of 1 START incremental model DBT_PCOISNE.facts_ahoy_events.................... [RUN]
10:35:52 1 of 1 ERROR creating incremental model DBT_PCOISNE.facts_ahoy_events........... [ERROR in 0.88s]
10:35:52
10:35:52 Finished running 1 incremental model in 3.01s.
10:35:52
10:35:52 Completed with 1 error and 0 warnings:
10:35:52
10:35:52 Database Error in model facts_ahoy_events (models/marts/facts/facts_ahoy_events.sql)
10:35:52 000904 (42000): SQL compilation error: error line 41 at position 10
10:35:52 invalid identifier '"events"."event_ID"'
I've gotten used to the case-sensitive column names on Snowflake, but I can't for the life of me figure out what's going on, since the following query run directly on Snowflake, completes:
select "event_ID" from DBT_PCOISNE.FACTS_AHOY_EVENTS limit 10;
Whereas this one expectedly fails:
select event_ID from DBT_PCOISNE.FACTS_AHOY_EVENTS limit 10;
I think I've tried every combination of upper, lower, and mixed casing, each with and without quoting, but all my attempts have failed.
Any help or insight would be greatly appreciated!
Thank you
Most probably your column event_ID was created using "" around it which means an identifier was used. Now, using it also requires "" as all column names are capitalized inside Snowflake unless using identifiers.
Solution is to either use "" around column name or rename it to lower case using an ALTER.
For DBT you can read more here

Resources