Use Snowpipe to load the latest data and delete the previous one - snowflake-cloud-data-platform

I have a table which get loaded from S3 every time when there is a new file in bucket.
And I am using Snowpipe to do so.
However the ask is to refresh the table in every load.
To accomplish that, My thought process is below.
Create a pipe on table t1 to copy from S3.
Create a Stream on table t1.
Create a task to run every 5 min and condition when stream has data.
The Task statement will be to delete the record from table where load_date of stream is not equal to the load_date of Table. (Using stream to implement DML operation on Stream so that Stream get empty)
So Basically using the Self stream to delete the data from the table.
However my issue is what will happen when there is multiple load on the same day.
And this approach does not look very professional. Is there a better way.

I would create a new target table for the stream data and merge into this new table on every run. If you really need to delete data from t1 then you could setup a child task that deletes data from t1 based on what you have in t2 (after you have merged)
However, the stream will record these delete operations. Depending on how your load works you could created an append only stream or when ingesting the stream, make sure to use the metadata to filter only the data events you are interested in

Related

Building CDC in Snowflake

My company is migrating to snowflake from SQL Server 2017 and am looking to build historical data tables that capture delta changes. In SQL, these would be in stored procedures, where old records would get expired (change to data) and insert the new row with updated data. This design allows dynamic retrieval of historical data at any point in time.
My question is, how would i migrate this design to snowflake? From what I read about procedures, they're more like UDTs or scalar functions (SQL equiv) , but in javascript lang...
Below is brief example of how we are doing CDC for tables in SQL
Would data pipeline cover this? If anyone knows good tutorial site for snowflake 101 (not snowflake offical documentation, its terrible). would be appreciated
thanks
update h
set h.expiration_date = t.effective_date
from data_table_A_history h
join data_table_A as t
on h.account_id = t.account_id
where h.expiration_date is null
and (
(isnull(t.person_name,'x') <> isnull(h.person_name,'x')) or
(isnull(t.person_age,0) <> isnull(h.person_age,0))
)
---------------------------------------------------------------------
insert into data_table_A_history (account_id,person_name,person_age)
select
account_id,person_name,person_age
from
data_table_A t
left join data_table_A_history h
on t.account_id = h.account_id
and h.expiration_date is null
where
h.account_id is null
Table streams are Snowflake's CDC solution
You can setup multiple streams on a single table and it will track changes to the table from a particular point in time. This point in time is changed once you consume the data in the stream, with the new starting point being from the time you consumed the data. Consumption here is when you either use the data to upsert another table or perhaps insert the data into a log table for example. Simply select statements do not consume the data
A pipeline could be something like this: Snowpipe->staging table->stream on staging table->task with SP->merge/upsert target table
If you wanted to keep a log of the changes then you could setup a 2nd stream on the staging table and consume that by inserting the data into another table
Another trick, if you didn't want to use a 2nd stream is to amend your SP so that before you consume the data, run a select on the stream and then immediately run
INSERT INTO my_table select * from table(result_scan(last_query_id()))
This does not consume the stream and change the offset and leaves the stream data available to be consumed by another DML operation

SpringBatch application periodically pulling data from DB

I am working on a spring batch service that pulls data from a db on a schedule. (e.g. every day at 12pm)
I am using JdbcPagingItemReader to read the data and a scheduler (#Scheduled provided by spring batch) to launch the job. The problem that I have now is: every time the job runs, it will just pull all the data from the beginning and not from the "last read" row.
The data from the db is changing everyday(deleting old ones and adding new ones) and all I have is a timestamp column to track them.
Is there a way to "remember" the last row read from the last execution of the job and read data only later than that row?
Since you need to pull data on a daily basis, and your records have a timestamp, then you can design your job instances to be based on a given date (ie using the date as an identifying job parameter). With this approach, you do not need to "remember" the last processed record. All you need to do is process records for a given date by using the correct SQL query. For example:
Job instance ID
Date
Job parameter
SQL
1
2021-03-22
date=2021-03-22
Select c1, c2 from table where date = 2021-03-22
2
2021-03-23
date=2021-03-23
Select c1, c2 from table where date = 2021-03-23
...
...
...
...
With that in place, you can use any cursor-based or paging-based reader to process records of a given date. If a job instance fails, you can restart it without a risk to interfere with other job instances. The restart could be done even several days after the failure since the job instance will always process the same data set. Moreover, in case of failure and job restart, Spring Batch will reprocess records from the last check point in the previous (failed) run.
Just want to post an update to this question.
So in the end I created two more steps to achieve what I wanted to do initially.
Since I don't have the privilege to modify the table where I read the data from, I couldn't use the "process indicator pattern" which involves having a column to mark if a record is processed or not. I created another table to store the last-read record's timestamp, and use it to update the sql query.
step 0: a tasklet that reads the bookmark from a table, pass it in the job context
step 1: a chunk step, get the bookmark from the context, use jdbcPagingItemReader to read the data
step 2: a tasklet to update the bookmark
But doing this I have to be very cautious with the bookmark table. If I lose that I lose everything

Sending incremental data to other application from Oracle database:- Even small suggestion would be very helpful

I have 1 table lets suppose Item. there are many DML happens on this table daily. Whatever DML(Insert update delete) happens on this table I need to insert this transaction data into another application using APIs.
if in item table ,2 record gets inserted, 1 updated and, 1 deleted I need to inject data into another application in the below form. file will be in json format.
I can create below file. My question is regarding how to extract daily transactional data.
{
"insert": ["A1,A2"].
"delete": "B1",
"update": "C1 "
}
something like above. means if A1 A2 inserted into Item table, B1 got deleted and C1 got updated. so i will send the data in above format to target application to do changes.
To do this I created one more table Item_trigger. also I created trigger on Item table. so if any DML happens trigger will insert into Item_trigger table with value
('A1','Insert'), ('A2','Insert'),('B1','delete'),('C1','Update')
then using Item_trigger table I will create file and send the data to target system.
The above design have been rejected because i am using trigger.is there any good solution? I was thingking about MV but it doesn't consider delete. doesn't consider delete so I can not use even that.
Could you please help me with design. Is there anyway to record transaction without using trigger
You can make use of statement level auditing on particular table. But that will only provide the information of what type of operation has been performed but not the actual data. You can combine this information with storing the value of whatever inserted, deleted and updated in another table or use the main table to directly transmit data.
Below is the script
audit select,insert,update,delete on test.test_audit by access;
delete from test_audit where id <= 10;
select * from Dba_Audit_Object where OBJ_NAME='TEST_AUDIT';

Backing up a table before deleting all the records and reloading it in SSIS

I have a table named abcTbl, the data in there is populated
from other tables from a different database. Every time I am loading
data to abcTbl, I am doing a delete all to it and loading the buffer
data into it.
This package runs daily. My question is how do I avoid losing data
from the table abcTbl if we fail to load the data into it. So my
first step is deleting all the data in the abcTbl and then
selecting the data from various sources into a buffer and then
loading the buffer data into abcTbl.
Since we can encounter issues like failed connections, package
stopping prematurely, supernatural forces trying to stop/break my
package from running smoothly, etc. which will end up with the
package losing all the data in the buffer after I have already
deleted the data from abcTbl. 
My first intuition was to save the data from the abcTbl into a
backup table and then deleting the data in the abcTbl but my DBAs
wouldn't be too thrilled about creating a backup table for in every
environment for the purpose of this package, and giving me juice to
create backup tables on the fly and then deleting it again is out of
the question too. This data is not business critical and can be repopulated
again if lost.
But, what is the best approach here? What are the best practices for this issue?
For backing up your table, instead of loading data from one table (Original) to another table (Backup), you can just rename your original table to something (back-up table), create original table again like the back-up table and then drop the renamed table only when your data load is successful. This may save some time to transfer data from one table to another. You may want to test which approach is faster for you depending on your data/table structure etc., But what I wanted to mention is, this is also one of the way to do it. If you have lot of data in that table below approach may be faster.
sp_rename 'abcTbl', 'abcTbl_bkp';
CREATE TABLE abcTbl ;
While creating this table, you can keep similar table structure as that of abcTbl_bkp
Load your new data to abcTbl table
DROP TABLE abcTbl_bkp;
Trying to figure this out but I think what you are asking for is a method to capture the older data before loading the new data. I would agree with your DBA's that a seperate table for every reload would be extremely messy and not very usable if you ever need it.
Instead, create a table that copies your load table but adds a single DateTime field(say history_date). Each load you would just flow all the data in your primary table to the backup table. Use a Derived Column task in the Data Flow to add the history_date value to the backup table.
Once the backup table is complete, either truncate or delete the contents of the current table. Then load the new data.
Instead of created additional tables you can set the package to execute as a single transaction. By doing this, if any component fails all the tasks that have already executed will be rolled back and subsequent ones will not run. To do this, set the TransactionOption to Required on the package. This will allow that the package will begin a transaction. After this set all this property to Supported for all components that you want to succeed or fail together. The Supported level will have these tasks join a transaction that is already in progress by the parent container, being the package in this case. If there are other components in the package that you want to commit or rollback independent of these tasks you can place the related objects in a Sequence container, and apply the Required level to the Sequence instead. An important thing to note is that if anything performs a TRUNCATE then all other components that access the truncated object will need to have the ValidateExternalMetadata option set to false to avoid the known blocking issue that is a result of this.

How to bulk insert and validate data against existing database data

Here is my situation, my client wants to bulk insert 100,000+ rows into the database from a csv file which is simple enough but the values need to be checked against data that is already in the database (does this product type exist? is this product still sold? etc.). To make things worse these files will also be uploaded into the live system during the day so I need to make sure I’m not locking any tables for long. The data that is inserted will also be spread across multiple tables.
I’ve been adding the date into a staging table which takes seconds, I then tried creating a WebService to start processing the table using Linq and marking any erroneous rows with an invalid flag (this can take some time). Once the validation is done I need take the valid rows and update/add the rows to the appropriate tables.
Is there a process for this that I am unfamiliar with?
For a smaller dataset I would suggest
IF EXISTS (SELECT blah FROM blah WHERE....)
UPDATE (blah)
ELSE
INSERT (blah)
You could do this in chunks to avoid server load, but this is by no means a quick solution, so SSIS would be preferable

Resources