We are using snowpipe functionality to upload CSV data into our snowflake table.
We got the requirement to add new datetime field in table (DATAUPLOADTIME). Sometime it happens that our process took hours so client want to know Start time, Finish time and Data upload time in snowflake table.
We have done StartTime and Finish time but not Data upload time because we are not aware how to achieve this.
Please guide/advice.
Related
I am new to snowflake and we are working on a POC. The scenario is we get around 100 (.txt) files from our ERP system uploaded into S3 bucket overnight. We would need these files to be loaded into Staging tables and then to DW tables, with data transformations applied, in snowflake. We are thinking of using snowpipe to load the data from S3 to the staging tables as file arrival from ERP is not scheduled and could be anytime within a window of four hours. The daily files are timestamped and will have full data daily. So we would need the staging tables to be truncated daily before ingesting the day's file.
But snow-pipe definition doesn't allow truncate/create statements.
Please share your thoughts on this. Should we continue considering snow-pipe? or try using COPY command scheduled as a 'TASK' to run at fixed intervals, say for every 15 minutes?
Have you considered just continually adding the data to your stage tables, put an append-only STREAM over that table, and then use tasks to load downstream tables from the STREAM. The task could run every minute with a WHEN statement that checks whether data is in the STREAM or not. This would load the data and push it downstream whenever the data happens to land from your ERP.
Then, you can have a daily task that runs anytime during the day which checks the STREAM to make sure there is NO DATA in it, and if that's true, then DELETE everything in the underlying table. This step only needs to happen to save storage and because the STREAM is append-only, the DELETE statement does not create records in your STREAM.
I have a SQL server database, Where millions of rows are (inserted/deleted/updated) every day. I'm supposed to propose an ETL solution to transfer data from this database to a data warehouse. At first i tried to work with CDC and SSIS, but the company i work in want a more real time solution. I've done some research and discovered stream processing. I've also looked for Spark and Flink tutorials but i didn't find anything.
my question is which stream processing tool do i choose? and how do i learn to work with it?
Open Source Solution
You can use Confluent Kafka Integration tool to track Insert and Update operation using Load Timestamp. These would automatically provide you the real time data which get inserted or Updated in the database. If you are having a soft delete in your database , that can be also tracked by using load timestamp and active or inactive flag.
If there is no such flags then you need to provide some logic on which partition might get updated on that day and send that entire partition into the stream which is definitely resource exhaustive.
Paid Solution
There is a paid tool called Striim CDC which can provide real time responses to your system
We have ETL process which ingest data every 5 mins once from different source system(as400,orcale,sap etc) into our sqlserver database, and from there we ingest data into elastic index every 5 mins so that both are in sync.
I wanted to tighten the timeframe to seconds rather than 5 mins and i wanted to make sure they both are in sync all time.
I am using control log table to make sure elastic ingestion and SSIS ETL are not running at the same time, so that we might go out of sync. which is very poor solution and not allowing me to achieve near real time data capture
I am looking for better solution to sync sqlserver database and elastic index in near real time rather than doing manually.
Note:I am using python scripts for pumping the data from sql to elastic index currently.
One approach would be to have an event stream coming out of your database or even directly out of the SSIS package run (which might actually be simpler to implement) that feeds directly into your elastic search index. ELK handles streaming log files so should handle an event stream pretty well.
We are working on requirement where we want to fetch incremental data from one redshift cluster "row wise", process it based on requirement and insert it in another redshift cluster. We want to do it "row wise" not "batch operation." For that we are writing one generic service which will do row processing from Redshift -> Redshift. So, it is like Redshift -> Service -> Redshift.
For inserting data, we will use insert queries to insert. We will commit after particular batch not row wise for performance.
But I am bit worried about performance of multiple insert queries. Or is there any other tool available which does it. There are many ETL tools available but all do batch processing. We want to process row wise. Can someone please suggest on it?
I can guarantee that your approach will not be efficient based on experience. You can refer this link for detailed best practices :
https://docs.aws.amazon.com/redshift/latest/dg/c_loading-data-best-practices.html
But, I would suggest that you do as follows :
Write a python script to unload the data from your source Redshift to S3 based on a query condition that filters data as per your requirement, i.e based on some threshold like time, date etc. This operation should be fast and you can schedule this script to execute every minute or in a couple of minutes, generating multiple files.
Now, you basically have a continuous stream of files in S3, where the size of each file or batch size can be controlled based on your frequency for the previous script.
Now, all you have to do is set up a service that keeps polling S3 for objects/files as and when they are created and then process them as needed and put the processed file in another bucket. Let's call this as B2.
Set up another python script/ETL step that remotely executes a COPY command from bucket B2.
This is just an initial idea though. You have to evolve on this approach and optimize this. Best of luck!
What will be a better approach if I want to create an SQL table daily into which I insert data via my storm application:
run a cron daily to create the table on the server
Use tick tuples in the storm bolt to create a new table after every 24 hours.
or some other better approach?
if we want to perform the sql table creation task at a specified time of the day, then cron job is a better option, otherwise tick tuple would do.
I went ahead with the first option for my use case.