How to use jinja template with mix of params and airflow inbuilts - snowflake-cloud-data-platform

I am trying to create a SQL template for Snowflake where I am trying to load a S3 file using SnowflakeOperator and s3 file is provided as xcom variable from upstream task.
Here is an example template for SQL
create or replace temp table {{ params.raw_target_table }}_tmp
as
select *
from '#{{ params.stage_name }}/{{ params.get_s3_file }}'
file_format => '{{ params.file_format }}'
;
params.get_s3_file is set to use ti like {{{{ti.xcom_pull(task_ids="foo", key="file_uploaded_to_s3")}}}}
I understand that in the template if used directly, it will work if it is not coming from params, but I want it to be configurable so I can use it with multiple dags/tasks.
Ideally I want this to work
create or replace temp table {{ params.raw_target_table }}_tmp
as
select *
from '#{{ params.stage_name }}/{{ti.xcom_pull(task_ids="{{params.previous_task}}", key="file_uploaded_to_s3")}}'
file_format => '{{ params.file_format }}'. --note the nested structure
;
So it resolves params.previous_task first and then gets the xcom values. Not sure how to instruct it do it.

When you use {{ <some code> }} jinja execute the code during the runtime, so this code is just hard python code (not template) executed during the runtime.
{{ti.xcom_pull(task_ids="{{params.previous_task}}", key="file_uploaded_to_s3")}} will try to pull the xcom with key file_uploaded_to_s3 from the task {{params.previous_task}} which doesn't exist. Instead of providing a string as task_ids, you can provide params.previous_task and jinja will replace it by the value of previous_task from the params dict:
create or replace temp table {{ params.raw_target_table }}_tmp
as
select *
from '#{{ params.stage_name }}/{{ti.xcom_pull(task_ids=params.previous_task, key="file_uploaded_to_s3")}}'
file_format => '{{ params.file_format }}'. --note the nested structure
;

Related

Snowflake:Copy from S3 into Table with nested JSON

Requirement: To load the Nested JSON file into the Snowflake from S3
Error: SQL compilation error: COPY statement only supports simple SELECT from stage statements for import.
I know I can create a temporary table from the SQL, Is there a better way to load directly from the S3 into Snowflake
COPY INTO schema.table_A FROM (
WITH s3 AS (
SELECT $1 AS json_array
FROM '#public.stage'
(file_format => 'public.json',
pattern => 'abc/xyz/.*')
)
SELECT DISTINCT
CURRENT_TIMESTAMP() AS exec_t,
json_array AS data,
json_array:id AS id,
json_array:code::text AS code
FROM s3,TABLE(Flatten(s3.json_array)) f
);
Basically Transformations during loading come along with certain limitations, see here: https://docs.snowflake.com/en/user-guide/data-load-transform.html#transforming-data-during-a-load
If you still want to keep your code and not apply the transformations later, you may create a view on top of the stage and then basically you INSERT into another table based on SELECT * from that view.
Maybe avoiding the CTE is already helping.

Using variable arrays in models

Is it possible to define an array in the vars section and use it inside the SQL syntax of a model?
Something like this
dbt_project.yml:
vars:
active_country_codes: ['it','ge']
model.sql
SELECT ...
FROM TABLE WHERE country_code IN ('{{ var("active_country_codes") }}')
I've tried with a single value, i.e:['it'], and works but if I add another it starts failing.
I am using the SQL Server Data connector.
The query that you are writing is correct. You just need to pass the variable as a string with a comma also as a string character.
vars:
active_country_codes: 'it'',''ge'
You can do something like this :
SELECT ...
FROM TABLE WHERE country_code IN ('{{ var("active_country_codes") }}')
And it will create query for you like this:
SELECT ...
FROM TABLE WHERE country_code IN ('it,'ge')
I have tested this and it's working fine. I'm using Bigquery Connection but it shouldn't matter as it's dbt generation.
My educated guess is that the result of {{ var("active_country_codes") }} is to insert a comma separated string. In that case, you'll need a string splitting function. You will have to roll your own if you haven't already, unless you have SQL Server 2016 or later. Then you can use string_split. Below is code using it. I use the exists approach as opposed to in due to performance.
select ...
from table t
where exists (
select 0
from string_split('{{ var("active_country_codes") }}', ',') ss
where t.country_code = ss.value
)
I would use:
vars:
var_name: "'one','two','three'"
where field_name in ({{ var("var_name") }})
Looks a little bit clearer than:
active_country_codes: 'it'',''ge'

Snowflake procedure call through AWS Lambda function using Python code

Is it possible to call Snowflake Procedure (It has a Merge statement which copies data from Stage to snowflake main table) through AWS Lambda function? I want the function to be triggered as soon as we push a file to the S3 stage. We have a Snowpile option to copy the data but I don't want another table like one to copy the data from the S3 bucket to the snowflake stage and then merge it to the master table rather I have a merge statement which directly merges the data from the file in S3 bucket to the snowflake master table.
first I would create an external stage in snowflake:
CREATE STAGE YOUR_STAGE_NAME
URL='s3://your-bucket-name'
CREDENTIALS = (
AWS_KEY_ID='xxx'
AWS_SECRET_KEY='yyy'
)
FILE_FORMAT = (
TYPE=CSV
COMPRESSION=AUTO,
FIELD_DELIMITER=',',
SKIP_HEADER=0,
....
);
then I would call a query on your csv file in the stage using your python lambda script (pulling the file name from the python event payload):
WITH file_data AS (
select t.$1, t.$2,<etc..> from #YOUR_STAGE_NAME/path/your-file.csv t
)
MERGE INTO MASTER_TABLE USING FILE_DATA ON <......>
you should be able to wrap this into a stored procedure if you so desire. there is also syntax to skip creating the named external stage and reference the bucket name and credentials within the query call, but I wouldn't recommend this since it would expose your aws credentials in plaintext in your stored procedure definition

How to use inline file format to query data from stage in Snowflake data warehouse

Is there any way to query data from a stage with an inline file format without copying the data into a table?
When using a COPY INTO table statement, I can specify an inline file format:
COPY INTO <table>
FROM (
SELECT ...
FROM #my_stage/some_file.csv
)
FILE_FORMAT = (
TYPE = CSV,
...
);
However, the same thing doesn't work when running the same select query directly, outside of the COPY INTO command:
SELECT ...
FROM #my_stage/some_file.csv
(FILE_FORMAT => (
TYPE = CSV,
...
));
Instead, the best I can do is to use a pre-existing file format:
SELECT ...
FROM #my_stage/some_file.csv
(FILE_FORMAT => 'my_file_format');
But this doesn't allow me to programatically change the file format when creating the query. I've tried every syntax variation possible, but this just doesn't seem to be supported right now.
I don't believe it is possible but, as a workaround, can't you create the file format programatically, use that named file format in your SQL and then, if necessary, drop it?

I am trying to run multiple query statements created when using the python connector with the same query id

I have created a Python function which creates multiple query statements.
Once it creates the SQL statement, it executes it (one at a time).
Is there anyway to way to bulk run all the statements at once (assuming I was able to create all the SQL statements and wanted to execute them once all the statements were generated)? I know there is an execute_stream in the Python Connector, but I think this requires a file to be created first. It also appears to me that it runs a single query statement at a time."
Since this question is missing an example of the file, here is a file content that I have provided as extra that we can work from.
//connection test file for python multiple queries
import snowflake.connector
conn = snowflake.connector.connect(
user = 'xxx',
password = '',
account = 'xxx',
warehouse= 'xxx',
database= 'TEST_xxx'
session_parameters = {
'QUERY_TAG: 'Rachel_test',
}
}
while(conn== true){
print(conn.sfqid)import snowflake.connector
try:
conn.cursor().execute("CREATE WAREHOUSE IF NOT EXISTS tiny_warehouse_mg")
conn.cursor().execute("CREATE DATABASE IF NOT EXISTS testdb_mg")
conn.cursor().execute("USE DATABASE testdb_mg")
conn.cursor().execute(
"CREATE OR REPLACE TABLE "
"test_table(col1 integer, col2 string)")
conn.cursor().execute(
"INSERT INTO test_table(col1, col2) VALUES " +
" (123, 'test string1'), " +
" (456, 'test string2')")
break
except Exception as e:
conn.rollback()
raise e
}
conn.close()
The reference to this question refers to a method that can be done with the file call, the example in documentation is as follows:
from codecs import open
with open(sqlfile, 'r', encoding='utf-8') as f:
for cur in con.execute_stream(f):
for ret in cur:
print(ret)
Reference to guide I used
Now when I ran these, they were not perfect, but in practice I was able to execute multiple sql statements in one connection, but not many at once. Each statement had their own query id. Is it possible to have a .sql file associated with one query id?
Is it possible to have a .sql file associated with one query id?
You can achieve that effect with the QUERY_TAG session parameter. Set the QUERY_TAG to the name of your .SQL file before executing it's queries. Access the .SQL file QUERY_IDs later using the QUERY_TAG field in QUERY_HISTORY().
I believe though you generated the .sql while executing in snowflake each statement will have unique query id.
If you want to run one sql independent to other you may try with multiprocessing/multi threading concept in python.
The Python and Node.Js libraries do not allow multiple statement executions.
I'm not sure about Python but for Node.JS there is this library that extends the original one and add a method call "ExecutionAll" to it:
snowflake-multisql
You just need to wrap multiple statements with the BEGIN and END.
BEGIN
<statement_1>;
<statement_2>;
END;
With these operators, I was able to execute multiple statement in nodejs

Resources