Snowpipe for continuous ingestion of daily files arriving irregularly - snowflake-cloud-data-platform

I am new to snowflake and we are working on a POC. The scenario is we get around 100 (.txt) files from our ERP system uploaded into S3 bucket overnight. We would need these files to be loaded into Staging tables and then to DW tables, with data transformations applied, in snowflake. We are thinking of using snowpipe to load the data from S3 to the staging tables as file arrival from ERP is not scheduled and could be anytime within a window of four hours. The daily files are timestamped and will have full data daily. So we would need the staging tables to be truncated daily before ingesting the day's file.
But snow-pipe definition doesn't allow truncate/create statements.
Please share your thoughts on this. Should we continue considering snow-pipe? or try using COPY command scheduled as a 'TASK' to run at fixed intervals, say for every 15 minutes?

Have you considered just continually adding the data to your stage tables, put an append-only STREAM over that table, and then use tasks to load downstream tables from the STREAM. The task could run every minute with a WHEN statement that checks whether data is in the STREAM or not. This would load the data and push it downstream whenever the data happens to land from your ERP.
Then, you can have a daily task that runs anytime during the day which checks the STREAM to make sure there is NO DATA in it, and if that's true, then DELETE everything in the underlying table. This step only needs to happen to save storage and because the STREAM is append-only, the DELETE statement does not create records in your STREAM.

Related

Should I use sync two databases by using only incremental files, os should I use a combination of incremental + full sync (at the end of every month)

I have a table present in my system which needs to be in sync with a third party table at a daily interval. The table size is >50 million rows and every day, around < 1% gets updated/created and hence we perform incremental sync where they send the delta data to us every day. Now, this is based on timestamps and I was wondering whether we should have monthly full data sync to make sure everything is in order. I'm guessing this will act as a validation of sorts to confirm that any data that is missed during the daily, will be sent in the monthly.
The full data sync is obviously painful to do (how do you easily update 50 million rows in your relational database) but assuming that that is feasible, should incremental data syncs really be backed up by monthly full syncs to ensure full data sync, or is this just over-engineering a simple problem which should work off the bat?
EDIT: It is not a db replication and we do a few modifications on a few columns in the source data as well as add a few custom attributes of our own to each row.

Reload specific files in an external stage

I'm loading CSVs to from S3 to a table in snowflake using COPY INTO. The table is truncated each time the process runs (data is persisted in a subsequent staging table). In the event of the COPY INTO finishing but the job failing before loading to the persistent staging table, the records are lost on the next load and the COPY INTO command will ignore the loaded files.
Our archive process applies to files >1 day old so I can't switch to a force load temporarily as irrelevant files would be loaded.
Manually reducing to just the missing files isn't ideal as we have 100+ tables which are partitioned by table name in S3.
Can anyone suggest any other approaches?
I would consider changing your process to copy the files to both the staging location, as well as your archive location at the same time, and then leverage the PURGE command in your COPY INTO. This way errored files stick around for the next run, and you still have a full archive available.

Row processing data from Redshift to Redshift

We are working on requirement where we want to fetch incremental data from one redshift cluster "row wise", process it based on requirement and insert it in another redshift cluster. We want to do it "row wise" not "batch operation." For that we are writing one generic service which will do row processing from Redshift -> Redshift. So, it is like Redshift -> Service -> Redshift.
For inserting data, we will use insert queries to insert. We will commit after particular batch not row wise for performance.
But I am bit worried about performance of multiple insert queries. Or is there any other tool available which does it. There are many ETL tools available but all do batch processing. We want to process row wise. Can someone please suggest on it?
I can guarantee that your approach will not be efficient based on experience. You can refer this link for detailed best practices :
https://docs.aws.amazon.com/redshift/latest/dg/c_loading-data-best-practices.html
But, I would suggest that you do as follows :
Write a python script to unload the data from your source Redshift to S3 based on a query condition that filters data as per your requirement, i.e based on some threshold like time, date etc. This operation should be fast and you can schedule this script to execute every minute or in a couple of minutes, generating multiple files.
Now, you basically have a continuous stream of files in S3, where the size of each file or batch size can be controlled based on your frequency for the previous script.
Now, all you have to do is set up a service that keeps polling S3 for objects/files as and when they are created and then process them as needed and put the processed file in another bucket. Let's call this as B2.
Set up another python script/ETL step that remotely executes a COPY command from bucket B2.
This is just an initial idea though. You have to evolve on this approach and optimize this. Best of luck!

Where should transaction records go? Flat file or Database

I'm developing a Java Enterprise Application which needs to write transaction records either to flat files or directly to a relational database. Transaction records are records which show when the transaction starts, ends, transaction status (success/failure) and also data unique to this transaction.
These transaction records will then be used to generate reports. The reports generating tool reads data from a database and generates them.
If flat file is used, records will be eventually loaded into database for reports generating. This adds an extra step.
If database is used, there will be no flat file. My concern is that if database is down, some records will be missing. Thus this approach is not as secure as the flat file one.
So, I cannot decide. Maybe there are other things I didn't consider? What's your view?
Thanks in advance.
If you DO use a flat file, you'll need to worry about locking and flushing and all of that garbage. Furthermore, it can only live in one place which makes it a pain if you ever want the app to scale. Go with the database unless downtime is a REALLY big concern.

Upload large amounts of data to production sql server with minimal contention

I am running a web site that helps manage lots of information for medical clinics. Part of the application needs to upload patient files from an excel spreadsheet. The patient table has about 1 million records and an excel import needs to insert or update 10k,20k,30k patient records at a time. All the while other customers are pounding the table. Processing time is less important than reducing contention on the database. What strategies would you recommend?
I know other sites effectively do this. Salesforce allows you to upload large amounts of data at once.
Load the Excel sheet to a staging table first, then decide whether to update/insert the rows in a single batch or what.
Typically, inserting a million rows from one table to another should be quick enough to run while the server is under load. You will have a lock during the insert, but it should be a matter of seconds. Unless you are loading billions of records a minute, or your upsert operation is very intensive, I don't see it being a problem.
If your upsert is very complex, there are a number of ways to do it. You can insert in a single batch, but mark the production records as incomplete as their subordinate records are updated. You can mark the staging rows as unprocessed and process in batches.
If each row update is independent, run a loop that gets a row, updates the table, get another row, ...
Then you can put a delay in the loop to slow it down to avoid impacting the main site (some sort of load metric could be used to adjust this on the fly). Some sort of token ring like setup could be used to make several update throttle together.

Resources