Snowflake performance test - snowflake-cloud-data-platform

I have an ETL pipeline which is scheduled in Airflow , the Airflow DAG calls the snowflake stored procedure .
The stored procedure reads the data from a view and writes into the table by performing 'Merge'
I am doing some changes in the pipeline by rewriting query in the view .
specifically , removing the filter in the view and applying in the stored proc .
How can I test this by not using any cache in snowflake .
I have tested with separate warehouses
I have tested with ALTER SESSION SET USE_CACHED_RESULT=FALSE;
I have checked the view's query plans.
I have tested the new pipeline via airflow DAG in non prod environment , but I am not able to fetch the query id of this pipeline in 'query history' table . to check the query plan .
How can I get the query ID of the non prod pipeline ?
and any suggestion's on the easy way of test ?

While running this kind of tests you can set USE_CACHED_RESULT=FALSE on the user's level. Then you do not need to worry about setting it for the session.
ALTER USER <your_test_user> SET USE_CACHED_RESULT=FALSE;
The queries generated by external apps might not appear in your query history if the filter is disabled. You can enable it. Go to Activity - Query History - Filters - Client generated statements. Plus make sure that if you use a different usernames for AirFlow and for UI, the respective filter is also set accordingly. See screenshot:

Related

snowflake session_id on query_history, can it be reinitialized to have new id per application execution

I want to use snowflake query_history's sessionId to find all the queries executed in one session. It works fine on the snowflake end when I have different worksheets which create different sessions. But from other tools (it looks to be using the same connection pool until it recreates the connection), it creates the same session id for multiple jobs on the snowflake side query_history. Is there a way to have sessionID created on every execution? I am using the control-m scheduling/job automation tool to execute multiple jobs which execute different snowflake stored procs. I want to see if I can get different sessionID for each execution of the procedure on the snowflake query_history table.
Thanks
Djay
You can change the "idle session timeout". See documentation here
You can set it to as low as 5, which means any queries that are at least 5 minutes apart will need to reauthenticate and will have a new session.
CREATE [OR REPLACE] SESSION POLICY DO_NOT_IDLE SESSION_IDLE_TIMEOUT_MINS = 0
Though I believe this will affect any applications that use your account, and will make your applications need to reauthenticate every time the session expires.
Another option, if you need a window smaller than 5 minutes, is to get the sessionid and explicitely run an ABORT_SESSION after your query has finished.
which would look something like this
SEELCT SYSTEM$ABORT_SESSION(CURRENT_SESSION())

How to specify snowflake session parameters in Talend

I am using Talend to load data from Oracle to Snowflake. I am able to set up the load pipeline, but I wanted to set the query tag as part of the load pipeline so that I can do some analysis based on the tag. However, I could not find any way to specify the query tag along with query statements (ALTER SESSION SET QUERY_TAG='TALENDLOAD') in the load pipeline.
Is it that Talend does not allow to set the session parameters?
You need to first run ALTER SESSION SET MULTI_STATEMENT_COUNT=0; as the default value is 1, which allows only one statement in JDBC and ODBC connectors info here
Then you may pass ALTER SESSION SET QUERY_TAG='TALENDLOAD' along with other query statements.

AWS DataPipeline insert status with SQLActivity

I am looking for a way to record the status of the pipeline in a DB table. Assuming this is a very common use case.
Is there any way where I can record
status and time of completion of the complete pipeline.
status and time of completion of selected individual activities.
the ID of individual runs/execution.
The only way I found was using SQLActivity that is dependent on an individual activity but even there I cannot access the status or timestamp of the parent/node.
I am using a jdbc connection to connect to a remote SQLServer. And the pipeline is for coping S3 files into the SQLServer DB.
Hmmm... I haven't tried this but I can hit you with some pointers to possibly achieve the desired results. However, you will have to do research & figure out actual implementation.
Option 1
Create a ShellCommandActivity, which has depends on set to last activity in your pipeline. Your shell will use aws-cli to list-runs details of the current run, you can use filters to achieve this.
Use Staging Data to move output of previous ShellActivity to SQLActivity to eventually insert into the destination SQLServer.
Option 2
Use AWS lambda to run aws-cli data-pipeline list-runs periodically, with filters, & update the destination table with latest activities. Resource

Transmitting sessions id from SQL Server to web scripts

I have a bunch of stored procs doing the business logic job in a SQL Server instance within a web application. When something goes wrong all of them queue a message in a specific table (let's say 'warnings') keeping track of the error severity and issue description. I want the web application using the stored procs to be able to query the message table at the end of the run but getting a list of the proper message only, i.e. the message created during the specific session or that specific connection: so I am in doubt if
have the web application send to the db a guid to INSERT as column value in the message records (but somehow I have to keep track of it in several stored procs running at the page level, so I need something "global")
OR
if I can use some id related to the connection opened by the application - and this would be definitely more sane. Something like (this is pseudo code):
SELECT #sessionid = sessionid FROM sys.something where this = that
Have some hints?
Thanks a lot for your ideas
To retrieve your current SessionId in SQL Server you can simply execute the following:
select ##SPID
See:
http://msdn.microsoft.com/en-us/library/ms189535.aspx

How do I configure cocoon to use a database as a store for quartz jobs and triggers

I'm using Cocoon and want to store the jobs and triggers for the quartz scheduler in the database so they are persisted. I can see where I need to make the change in cocoon.xconf but I can't find much on how to configure the datasource etc.
How do I configure this to use our existing (postgres) database?
You need to do 2 things:
Add the following configuration to quartz.properties with appropriate values substituted for the $ placeholders
org.quartz.jobStore.dataSource=myDS
org.quartz.dataSource.myDS.URL=$URL
org.quartz.dataSource.myDS.driver=$driver
org.quartz.dataSource.myDS.maxConnections=5
org.quartz.dataSource.myDS.password=$password
org.quartz.dataSource.myDS.user=$user
org.quartz.dataSource.myDS.validationQuery=$any query that doesn't return an error when properly connected
org.quartz.jobStore.tablePrefix=QREPL_
org.quartz.jobStore.driverDelegateClass=org.quartz.impl.jdbcjobstore.PostgreSQLDelegate
Create the database tables in which Quartz stores the job data - you should find a DDL script included in the Quartz distribution that will create them for you. Each of the Quartz table names should begins with the same prefix. In the configuration above, I've assumed this prefix is "QREPL_"
Hope this helps,
Don

Resources