Snowflake procedure call through AWS Lambda function using Python code - snowflake-cloud-data-platform

Is it possible to call Snowflake Procedure (It has a Merge statement which copies data from Stage to snowflake main table) through AWS Lambda function? I want the function to be triggered as soon as we push a file to the S3 stage. We have a Snowpile option to copy the data but I don't want another table like one to copy the data from the S3 bucket to the snowflake stage and then merge it to the master table rather I have a merge statement which directly merges the data from the file in S3 bucket to the snowflake master table.

first I would create an external stage in snowflake:
CREATE STAGE YOUR_STAGE_NAME
URL='s3://your-bucket-name'
CREDENTIALS = (
AWS_KEY_ID='xxx'
AWS_SECRET_KEY='yyy'
)
FILE_FORMAT = (
TYPE=CSV
COMPRESSION=AUTO,
FIELD_DELIMITER=',',
SKIP_HEADER=0,
....
);
then I would call a query on your csv file in the stage using your python lambda script (pulling the file name from the python event payload):
WITH file_data AS (
select t.$1, t.$2,<etc..> from #YOUR_STAGE_NAME/path/your-file.csv t
)
MERGE INTO MASTER_TABLE USING FILE_DATA ON <......>
you should be able to wrap this into a stored procedure if you so desire. there is also syntax to skip creating the named external stage and reference the bucket name and credentials within the query call, but I wouldn't recommend this since it would expose your aws credentials in plaintext in your stored procedure definition

Related

Snowflake:Copy from S3 into Table with nested JSON

Requirement: To load the Nested JSON file into the Snowflake from S3
Error: SQL compilation error: COPY statement only supports simple SELECT from stage statements for import.
I know I can create a temporary table from the SQL, Is there a better way to load directly from the S3 into Snowflake
COPY INTO schema.table_A FROM (
WITH s3 AS (
SELECT $1 AS json_array
FROM '#public.stage'
(file_format => 'public.json',
pattern => 'abc/xyz/.*')
)
SELECT DISTINCT
CURRENT_TIMESTAMP() AS exec_t,
json_array AS data,
json_array:id AS id,
json_array:code::text AS code
FROM s3,TABLE(Flatten(s3.json_array)) f
);
Basically Transformations during loading come along with certain limitations, see here: https://docs.snowflake.com/en/user-guide/data-load-transform.html#transforming-data-during-a-load
If you still want to keep your code and not apply the transformations later, you may create a view on top of the stage and then basically you INSERT into another table based on SELECT * from that view.
Maybe avoiding the CTE is already helping.

Is there a way to automatically unload data in Snowflake to local machine using tasks or stored procedure?

I would like to automatically unload my data using Task or Stored Procedure but they didn't work out, seems like the get command wasn't run, is there a way to fix it? Thank you,
CREATE OR REPLACE STREAM unload_dimads_stream
ON TABLE "FA_PROJECT01_DB"."ADSBI"."DIM_ADS";
CREATE OR REPLACE TASK unload_dimads_task1
WAREHOUSE = FA_PROJECT01_CLOUDDW
SCHEDULE = '1 minute'
WHEN SYSTEM$STREAM_HAS_DATA('unload_dimads_stream')
AS
COPY INTO #adsbi.%dim_ads from AdsBI.Dim_Ads file_format = (TYPE=CSV FIELD_DELIMITER = '|' BINARY_FORMAT = 'UTF-8' compression=none) header= true overwrite=true;
CREATE OR REPLACE TASK unload_dimads_task2
WAREHOUSE = FA_PROJECT01_CLOUDDW
AFTER unload_dimads_task1
WHEN SYSTEM$STREAM_HAS_DATA('unload_dimads_stream')
AS
GET #adsbi.%dim_ads file://D:\unload\table1;
ALTER TASK unload_dimads_task2 resume;
ALTER TASK unload_dimads_task1 resume;
GET #adsbi.%dim_ads file://D:\unload\table1;
You cannot run this step on a stored procedure or task. Stored procedures and tasks are running on Snowflake servers. You can script GET operations outside of Snowflake using the SnowSQL client, but something else will need to call it to run.

How to use copy Storage Integration in a Snowflake task statement?

I'm testing SnowFlake. To do this I created an instance of SnowFlake on GCP.
One of the tests is to try the daily load of data from a STORAGE INTEGRATION.
To do that I had generated the STORAGE INTEGRATION and the stage.
I tested the copy
copy into DEMO_DB.PUBLIC.DATA_BY_REGION from #sg_gcs_covid pattern='.*data_by_region.*'
and all goes fine.
Now it's time to test the daily scheduling with the task statement.
I created this task:
CREATE TASK schedule_regioni
WAREHOUSE = COMPUTE_WH
SCHEDULE = 'USING CRON 42 18 9 9 * Europe/Rome'
COMMENT = 'Test Schedule'
AS
copy into DEMO_DB.PUBLIC.DATA_BY_REGION from #sg_gcs_covid pattern='.*data_by_region.*';
And I enabled it:
alter task schedule_regioni resume;
I got no errors, but the task don't loads data.
To resolve the issue i had to put the copy in a stored procedure and insert the call of the storede procedure instead of the copy:
DROP TASK schedule_regioni;
CREATE TASK schedule_regioni
WAREHOUSE = COMPUTE_WH
SCHEDULE = 'USING CRON 42 18 9 9 * Europe/Rome'
COMMENT = 'Test Schedule'
AS
call sp_upload_c19_regioni();
The question is: this is a desired behavior or an issue (as I suppose)?
Someone can give to me some information about this?
I've just tried ( but with storage integration and stage on AWS S3) and it works fine also using copy command inside sql part of the task, without calling a stored procedure.
In order to start investigating the issue, I would check following info (maybe for debugging I would create the task scheduling it every few minutes):
check task_history and verify executions
select *
from table(information_schema.task_history(
scheduled_time_range_start=>dateadd('hour',-1,current_timestamp()),
result_limit => 100,
task_name=>'YOUR_TASK_NAME'));
if previous step is successfull, check copy_history and verify the input file name , target table and number of records/errors are the expected ones
SELECT *
FROM TABLE (information_schema.copy_history(TABLE_NAME => 'YOUR_TABLE_NAME',
start_time=> dateadd(hours, -1, current_timestamp())))
ORDER BY 3 DESC;
Check if the results are the same you get when the task with sp call is executed.
Please also confirm that you are loading new files not yet loaded into your table with COPY command (otherwise you need to specify FORCE = TRUE parameter in the copy command or remove metadata information truncating your target table to reload the same files).

In the tutorial "Tutorial: Bulk Loading from a local file system using copy" what is the difference between my_stage and my_table permissions?

I started to go through the first tutorial for how to load data into Snowflake from a local file.
This is what I have set up so far:
CREATE WAREHOUSE mywh;
CREATE DATABASE Mydb;
Use Database mydb;
CREATE ROLE ANALYST;
grant usage on database mydb to role sysadmin;
grant usage on database mydb to role analyst;
grant usage, create file format, create stage, create table on schema mydb.public to role analyst;
grant operate, usage on warehouse mywh to role analyst;
//tutorial 1 loading data
CREATE FILE FORMAT mycsvformat
TYPE = "CSV"
FIELD_DELIMITER= ','
SKIP_HEADER = 1;
CREATE FILE FORMAT myjsonformat
TYPE="JSON"
STRIP_OUTER_ARRAY = true;
//create stage
CREATE OR REPLACE STAGE my_stage
FILE_FORMAT = mycsvformat;
//Use snowsql for this and make sure that the role, db, and warehouse are seelcted: put file:///data/data.csv #my_stage;
// put file on stage
PUT file://contacts.csv #my
List #~;
list #%mytable;
Then in my active Snowsql when I run:
Put file:///Users/<user>/Documents/data/data.csv #my_table;
I have confirmed I am in the correct role Accountadmin:
002003 (02000): SQL compilation error:
Stage 'MYDB.PUBLIC.MY_TABLE' does not exist or not authorized.
So then I try to create the table in Snowsql and am successful:
create or replace table my_table(id varchar, link varchar, stuff string);
I still run into this error after I run:
Put file:///Users/<>/Documents/data/data.csv #my_table;
002003 (02000): SQL compilation error:
Stage 'MYDB.PUBLIC.MY_TABLE' does not exist or not authorized.
What is the difference between putting a file to a my_table and a my_stage in this scenario? Thanks for your help!
EDIT:
CREATE OR REPLACE TABLE myjsontable(json variant);
COPY INTO myjsontable
FROM #my_stage/random.json.gz
FILE_FORMAT = (TYPE= 'JSON')
ON_ERROR = 'skip_file';
CREATE OR REPLACE TABLE save_copy_errors AS SELECT * FROM TABLE(VALIDATE(myjsontable, JOB_ID=>'enterid'));
SELECT * FROM SAVE_COPY_ERRORS;
//error for random: Error parsing JSON: invalid character outside of a string: '\\'
//no error for generated
SELECT * FROM Myjsontable;
REMOVE #My_stage pattern = '.*.csv.gz';
REMOVE #My_stage pattern = '.*.json.gz';
//yay your are done!
The put command copies the file from your local drive to the stage. You should do the put to the stage, not that table.
put file:///Users/<>/Documents/data/data.csv #my_stage;
The copy command loads it from the stage.
But in document its mention like it gets created by default for every stage
Each table has a Snowflake stage allocated to it by default for storing files. This stage is a convenient option if your files need to be accessible to multiple users and only need to be copied into a single table.
Table stages have the following characteristics and limitations:
Table stages have the same name as the table; e.g. a table named mytable has a stage referenced as #%mytable
in this case without creating stage its should load into default Snowflake stage allocated

I am trying to run multiple query statements created when using the python connector with the same query id

I have created a Python function which creates multiple query statements.
Once it creates the SQL statement, it executes it (one at a time).
Is there anyway to way to bulk run all the statements at once (assuming I was able to create all the SQL statements and wanted to execute them once all the statements were generated)? I know there is an execute_stream in the Python Connector, but I think this requires a file to be created first. It also appears to me that it runs a single query statement at a time."
Since this question is missing an example of the file, here is a file content that I have provided as extra that we can work from.
//connection test file for python multiple queries
import snowflake.connector
conn = snowflake.connector.connect(
user = 'xxx',
password = '',
account = 'xxx',
warehouse= 'xxx',
database= 'TEST_xxx'
session_parameters = {
'QUERY_TAG: 'Rachel_test',
}
}
while(conn== true){
print(conn.sfqid)import snowflake.connector
try:
conn.cursor().execute("CREATE WAREHOUSE IF NOT EXISTS tiny_warehouse_mg")
conn.cursor().execute("CREATE DATABASE IF NOT EXISTS testdb_mg")
conn.cursor().execute("USE DATABASE testdb_mg")
conn.cursor().execute(
"CREATE OR REPLACE TABLE "
"test_table(col1 integer, col2 string)")
conn.cursor().execute(
"INSERT INTO test_table(col1, col2) VALUES " +
" (123, 'test string1'), " +
" (456, 'test string2')")
break
except Exception as e:
conn.rollback()
raise e
}
conn.close()
The reference to this question refers to a method that can be done with the file call, the example in documentation is as follows:
from codecs import open
with open(sqlfile, 'r', encoding='utf-8') as f:
for cur in con.execute_stream(f):
for ret in cur:
print(ret)
Reference to guide I used
Now when I ran these, they were not perfect, but in practice I was able to execute multiple sql statements in one connection, but not many at once. Each statement had their own query id. Is it possible to have a .sql file associated with one query id?
Is it possible to have a .sql file associated with one query id?
You can achieve that effect with the QUERY_TAG session parameter. Set the QUERY_TAG to the name of your .SQL file before executing it's queries. Access the .SQL file QUERY_IDs later using the QUERY_TAG field in QUERY_HISTORY().
I believe though you generated the .sql while executing in snowflake each statement will have unique query id.
If you want to run one sql independent to other you may try with multiprocessing/multi threading concept in python.
The Python and Node.Js libraries do not allow multiple statement executions.
I'm not sure about Python but for Node.JS there is this library that extends the original one and add a method call "ExecutionAll" to it:
snowflake-multisql
You just need to wrap multiple statements with the BEGIN and END.
BEGIN
<statement_1>;
<statement_2>;
END;
With these operators, I was able to execute multiple statement in nodejs

Resources