How can I overwrite data in Snowflake using snowpipe

How can I overwrite data in Snowflake using snowpipe - snowflake-cloud-data-platform

I have created one snowpipe in Snowflake.
But I am unable to overwrite my data in it.
Is there a way I can either use delete or overwrite in snow pipe before copying my data?

You can create a stored procedure to execute the below task sequentially:
Truncate the table
Execute the copy command
If you want to load the data any specific time then you can schedule a stored procedure using task.

It would be normal practice to load the data from Snowpipe into a temporary/staging table and process it there - doubt you can process it in Snowpipe.
Please can you explain in more detail how/why you want to overwrite data in Snowpipe? Can you provide examples of what you are trying to achieve?
Answer 2
Snowpipe is used for continuous data loading from a stream, so I'm not sure why you are using it for daily loads of a single dataset?
I would create a standard COPY INTO process and then wrap it in a stored procedure that handles the target table deletion as well. You can then schedule this to run daily.

Related

Snowflake - loading queries and controlling the execution of sequence of steps

As a part of our overall flow, data will be ingested into Azure blob from Influx DB and SQL DB, the thought process is to use Snowflake queries/SP to load the data from blob to snow flake in a scheduled manner (batch process). The thought process is to use the Tasks to schedule and orchestrate the execution using Snowflake scripting. Few questions,
Dynamic queries can be created and executed based on a config table - Ex: A copy command specifying the exact paths and file to load data from.
As a part of snowflake scripting, per understanding a sequence of steps (queries / SP) stored in a configuration DB can be executed in order along with some control mechanism.
Possibilities for sending email notifications of error records by loading into a table. whether this should be handled outside of snowflake after the data load process by using Azure data factory / logic apps.
Whether the above approach is possible and are there any limitations in using the above manner? Are there any alternate approaches that can be considered for the above.

you can dynamically generate and execute queries with a SP. You can chain activities within an SP's logic or by linked tasks running separate SPs. There is no functionality within Snowflake that will generate emails

if it's possible to run batch processing on dynamic table in flink

Currently I run multiple varient structured ETL job on the same table by the following steps:
sync data from RDBMS to data warehouse continuously.
run multiple ETL at different time(data in data warehouse at corresponding timepoint).
If it's possible to share the dynamic table across multiple ETL job at different time, then the first syncing step can be removed.

Here are few options I can think.
Use external database like sql or something to load the dynamic config table. This would load table every time your batch job runs.
Versioned table is also an option, as you may have explored already.
Use flink queryable state. You would need external client to update the state though.

Snowpipe for continuous ingestion of daily files arriving irregularly

I am new to snowflake and we are working on a POC. The scenario is we get around 100 (.txt) files from our ERP system uploaded into S3 bucket overnight. We would need these files to be loaded into Staging tables and then to DW tables, with data transformations applied, in snowflake. We are thinking of using snowpipe to load the data from S3 to the staging tables as file arrival from ERP is not scheduled and could be anytime within a window of four hours. The daily files are timestamped and will have full data daily. So we would need the staging tables to be truncated daily before ingesting the day's file.
But snow-pipe definition doesn't allow truncate/create statements.
Please share your thoughts on this. Should we continue considering snow-pipe? or try using COPY command scheduled as a 'TASK' to run at fixed intervals, say for every 15 minutes?

Have you considered just continually adding the data to your stage tables, put an append-only STREAM over that table, and then use tasks to load downstream tables from the STREAM. The task could run every minute with a WHEN statement that checks whether data is in the STREAM or not. This would load the data and push it downstream whenever the data happens to land from your ERP.
Then, you can have a daily task that runs anytime during the day which checks the STREAM to make sure there is NO DATA in it, and if that's true, then DELETE everything in the underlying table. This step only needs to happen to save storage and because the STREAM is append-only, the DELETE statement does not create records in your STREAM.

Row processing data from Redshift to Redshift

We are working on requirement where we want to fetch incremental data from one redshift cluster "row wise", process it based on requirement and insert it in another redshift cluster. We want to do it "row wise" not "batch operation." For that we are writing one generic service which will do row processing from Redshift -> Redshift. So, it is like Redshift -> Service -> Redshift.
For inserting data, we will use insert queries to insert. We will commit after particular batch not row wise for performance.
But I am bit worried about performance of multiple insert queries. Or is there any other tool available which does it. There are many ETL tools available but all do batch processing. We want to process row wise. Can someone please suggest on it?

I can guarantee that your approach will not be efficient based on experience. You can refer this link for detailed best practices :
https://docs.aws.amazon.com/redshift/latest/dg/c_loading-data-best-practices.html
But, I would suggest that you do as follows :
Write a python script to unload the data from your source Redshift to S3 based on a query condition that filters data as per your requirement, i.e based on some threshold like time, date etc. This operation should be fast and you can schedule this script to execute every minute or in a couple of minutes, generating multiple files.
Now, you basically have a continuous stream of files in S3, where the size of each file or batch size can be controlled based on your frequency for the previous script.
Now, all you have to do is set up a service that keeps polling S3 for objects/files as and when they are created and then process them as needed and put the processed file in another bucket. Let's call this as B2.
Set up another python script/ETL step that remotely executes a COPY command from bucket B2.
This is just an initial idea though. You have to evolve on this approach and optimize this. Best of luck!

is triggers create any performance issues in Production

I am working on project where i am getting logs or data feed from outside in CSV and DAT file we have SSIS Package configured.i just wanted to create Trigger on table to reconcile table count with File count if i create triggers is it lead to performance issue?
AVG CSV/DAT FILE COUNT IS 2.5 M Records

Yes, triggers will reduce your system performance by locking the tables for longer time. It is always better to look for other alternate options like CDC or handling the it manually thru' stored procedures or some other way.