Snowflake - loading queries and controlling the execution of sequence of steps - snowflake-cloud-data-platform

As a part of our overall flow, data will be ingested into Azure blob from Influx DB and SQL DB, the thought process is to use Snowflake queries/SP to load the data from blob to snow flake in a scheduled manner (batch process). The thought process is to use the Tasks to schedule and orchestrate the execution using Snowflake scripting. Few questions,
Dynamic queries can be created and executed based on a config table - Ex: A copy command specifying the exact paths and file to load data from.
As a part of snowflake scripting, per understanding a sequence of steps (queries / SP) stored in a configuration DB can be executed in order along with some control mechanism.
Possibilities for sending email notifications of error records by loading into a table. whether this should be handled outside of snowflake after the data load process by using Azure data factory / logic apps.
Whether the above approach is possible and are there any limitations in using the above manner? Are there any alternate approaches that can be considered for the above.

you can dynamically generate and execute queries with a SP. You can chain activities within an SP's logic or by linked tasks running separate SPs. There is no functionality within Snowflake that will generate emails

Related

if it's possible to run batch processing on dynamic table in flink

Currently I run multiple varient structured ETL job on the same table by the following steps:
sync data from RDBMS to data warehouse continuously.
run multiple ETL at different time(data in data warehouse at corresponding timepoint).
If it's possible to share the dynamic table across multiple ETL job at different time, then the first syncing step can be removed.
Here are few options I can think.
Use external database like sql or something to load the dynamic config table. This would load table every time your batch job runs.
Versioned table is also an option, as you may have explored already.
Use flink queryable state. You would need external client to update the state though.

Trigger Snowflake task when data pushed to Snowflake stage

I have 1 snowflake internal stage where I am pushing json files data through snowsql. Then I will run some queries using snowflake UI. Currently its all manual, is there anyway we can trigger snowflake task when I put data on the stage?
There's no way to trigger a Snowflake task, only schedule it. You can however prevent the running of a task on a particular schedule based on a condition. Right now, the only condition that is supported is SYSTEM$STREAM_HAS_DATA:
https://docs.snowflake.com/en/sql-reference/functions/system_stream_has_data.html
In any case, you don't need tasks to automate this pipeline. Using streams and tasks is more than is required for this flow. If the Snowflake stage is external (S3, Azure Blob, GCP storage rather than a Snowflake internal stage) you can use Snowpipe to copy newly arriving files into a table.
https://docs.snowflake.com/en/user-guide/data-load-snowpipe.html

Row processing data from Redshift to Redshift

We are working on requirement where we want to fetch incremental data from one redshift cluster "row wise", process it based on requirement and insert it in another redshift cluster. We want to do it "row wise" not "batch operation." For that we are writing one generic service which will do row processing from Redshift -> Redshift. So, it is like Redshift -> Service -> Redshift.
For inserting data, we will use insert queries to insert. We will commit after particular batch not row wise for performance.
But I am bit worried about performance of multiple insert queries. Or is there any other tool available which does it. There are many ETL tools available but all do batch processing. We want to process row wise. Can someone please suggest on it?
I can guarantee that your approach will not be efficient based on experience. You can refer this link for detailed best practices :
https://docs.aws.amazon.com/redshift/latest/dg/c_loading-data-best-practices.html
But, I would suggest that you do as follows :
Write a python script to unload the data from your source Redshift to S3 based on a query condition that filters data as per your requirement, i.e based on some threshold like time, date etc. This operation should be fast and you can schedule this script to execute every minute or in a couple of minutes, generating multiple files.
Now, you basically have a continuous stream of files in S3, where the size of each file or batch size can be controlled based on your frequency for the previous script.
Now, all you have to do is set up a service that keeps polling S3 for objects/files as and when they are created and then process them as needed and put the processed file in another bucket. Let's call this as B2.
Set up another python script/ETL step that remotely executes a COPY command from bucket B2.
This is just an initial idea though. You have to evolve on this approach and optimize this. Best of luck!

Hibernate: how to mirror specific data

I'm currently working on a project using hibernate for persistance on top of databases of various types.
The solution consists of multiple servers with their own databases.
The challenge is now to build server that receives all data from all other servers to provide monitoring and reporting functionality. If data changes in one of the servers, it shall (almost) instantly be sent to the monitoring server. Network latency and outage shall be handled.
I found two possible ways to monitor the data changes (insert, update, delete):
Hibernate Envers
Appears to be an auditing solution that builds a protocol of all modifications in individually created database tables. I could not find information how to filter the data. This may become necessary in the future
Hibernate Interceptor
The interceptor functionality (e.g. described in the Mykong blog entry). It
does almost the same like Envers but gives me the possibility to use my own audit table to store the modifications and to filter the data by my own criteria if necessary
My idea is now to
store the modifications by serializing the data to the audit table
scan the table (e.g. every 30 seconds) for new entries
transfer the entries (e.g. by http upload) to the monitoring server
import the data to the monitoring database using hibernate
My question is now:
Is there a better or easier way to solve this?

Importing Millions of Records with SSIS

Any tips for speeding up the import processes? Theres alot of Joins in the db.
Also, when a SSIS task is completed, would the best way to handle the next functions by code or using the Emailing notification SSIS has..?
Here is a sample that I have used to illustrate loading 1 million rows in under 3 minutes from text file to SQL Server database. The package in the sample was created using SSIS 208 R2 and was executed on Xeon single core CPU 2.5GHz and 3.00 GB RAM.
Import records on SSIS after lookup
One of the main bottlenecks in importing large number of rows will be the destination component. Faster the destination component can insert the rows, the faster the preceding source or transformation components can process the rows. Again if you happen to have components like Sort transformation that will be different because Sort transformation would hold up all the data to sort before sending it down the pipeline.
Sending email depends on what you would like to do.
If you need simple success or failure, you could simply use Send Mail task. Other option is that you could also enable the Alert notification on SQL jobs from where you might schedule the package to run on regular basis.
If you need more information to be added to the email, then you might need to use a Script Task to formulate the message body. After creating the message body, you can send the mail from within Script Task or use Send Mail task.
Hope that example along with the article #Nikhil S provided should help you fine tune your package.
This SimpleTalk article discusses ways to optimize your data flow task
Horizontally partition your data-to-be transferred into N data flows. Where N is the number of cpu cores available on your server where ssis is installed.
Play with the ssis buffer size property to figure our setting optimal for your kind of data.

Resources