I am new to snowflake and doing a POC following Automating Snowpipe for Amazon S3 document.
Here is the snowpipe I've created
create pipe demo_db.public.storage_to_snowflake_pipe
auto_ingest=true as copy into demo_db.public.test_table(Name) from (Select $1 FROM #demo_db.public.stage_table)file_format = (type = 'CSV' skip_header = 1);
Is there any possibility to truncate/delete data in the snowflake table (test_table) before snowpipe loads data in to it?
Thanks in advance
Since you are truncating the table I assume it's a staging table and you will have another process to read data from it and move to another table.
If that's the case you can use ALTER TABLE .. SWAP WITH ... to achieve your goal.
Assuming your table is T_STAGE and it's being loaded by Snowpipe, you can create a secondary stage table T_STAGE_INC with the same structure
SWAP will swap the data between the tables.
At the beginning of your process that reads data from T_STAGE you need to run
ALTER TABLE T_STAGE SWAP WITH T_STAGE_INC
and use T_STAGE_INC as your source.
After your process is done you can truncate T_STAGE_INC table.
It also needs to be empty for the first run
Checked this with Snowflake. This is not supported with snowpipe yet
Related
My company is migrating to snowflake from SQL Server 2017 and am looking to build historical data tables that capture delta changes. In SQL, these would be in stored procedures, where old records would get expired (change to data) and insert the new row with updated data. This design allows dynamic retrieval of historical data at any point in time.
My question is, how would i migrate this design to snowflake? From what I read about procedures, they're more like UDTs or scalar functions (SQL equiv) , but in javascript lang...
Below is brief example of how we are doing CDC for tables in SQL
Would data pipeline cover this? If anyone knows good tutorial site for snowflake 101 (not snowflake offical documentation, its terrible). would be appreciated
thanks
update h
set h.expiration_date = t.effective_date
from data_table_A_history h
join data_table_A as t
on h.account_id = t.account_id
where h.expiration_date is null
and (
(isnull(t.person_name,'x') <> isnull(h.person_name,'x')) or
(isnull(t.person_age,0) <> isnull(h.person_age,0))
)
---------------------------------------------------------------------
insert into data_table_A_history (account_id,person_name,person_age)
select
account_id,person_name,person_age
from
data_table_A t
left join data_table_A_history h
on t.account_id = h.account_id
and h.expiration_date is null
where
h.account_id is null
Table streams are Snowflake's CDC solution
You can setup multiple streams on a single table and it will track changes to the table from a particular point in time. This point in time is changed once you consume the data in the stream, with the new starting point being from the time you consumed the data. Consumption here is when you either use the data to upsert another table or perhaps insert the data into a log table for example. Simply select statements do not consume the data
A pipeline could be something like this: Snowpipe->staging table->stream on staging table->task with SP->merge/upsert target table
If you wanted to keep a log of the changes then you could setup a 2nd stream on the staging table and consume that by inserting the data into another table
Another trick, if you didn't want to use a 2nd stream is to amend your SP so that before you consume the data, run a select on the stream and then immediately run
INSERT INTO my_table select * from table(result_scan(last_query_id()))
This does not consume the stream and change the offset and leaves the stream data available to be consumed by another DML operation
I have a table which get loaded from S3 every time when there is a new file in bucket.
And I am using Snowpipe to do so.
However the ask is to refresh the table in every load.
To accomplish that, My thought process is below.
Create a pipe on table t1 to copy from S3.
Create a Stream on table t1.
Create a task to run every 5 min and condition when stream has data.
The Task statement will be to delete the record from table where load_date of stream is not equal to the load_date of Table. (Using stream to implement DML operation on Stream so that Stream get empty)
So Basically using the Self stream to delete the data from the table.
However my issue is what will happen when there is multiple load on the same day.
And this approach does not look very professional. Is there a better way.
I would create a new target table for the stream data and merge into this new table on every run. If you really need to delete data from t1 then you could setup a child task that deletes data from t1 based on what you have in t2 (after you have merged)
However, the stream will record these delete operations. Depending on how your load works you could created an append only stream or when ingesting the stream, make sure to use the metadata to filter only the data events you are interested in
I have a table named abcTbl, the data in there is populated
from other tables from a different database. Every time I am loading
data to abcTbl, I am doing a delete all to it and loading the buffer
data into it.
This package runs daily. My question is how do I avoid losing data
from the table abcTbl if we fail to load the data into it. So my
first step is deleting all the data in the abcTbl and then
selecting the data from various sources into a buffer and then
loading the buffer data into abcTbl.
Since we can encounter issues like failed connections, package
stopping prematurely, supernatural forces trying to stop/break my
package from running smoothly, etc. which will end up with the
package losing all the data in the buffer after I have already
deleted the data from abcTbl.
My first intuition was to save the data from the abcTbl into a
backup table and then deleting the data in the abcTbl but my DBAs
wouldn't be too thrilled about creating a backup table for in every
environment for the purpose of this package, and giving me juice to
create backup tables on the fly and then deleting it again is out of
the question too. This data is not business critical and can be repopulated
again if lost.
But, what is the best approach here? What are the best practices for this issue?
For backing up your table, instead of loading data from one table (Original) to another table (Backup), you can just rename your original table to something (back-up table), create original table again like the back-up table and then drop the renamed table only when your data load is successful. This may save some time to transfer data from one table to another. You may want to test which approach is faster for you depending on your data/table structure etc., But what I wanted to mention is, this is also one of the way to do it. If you have lot of data in that table below approach may be faster.
sp_rename 'abcTbl', 'abcTbl_bkp';
CREATE TABLE abcTbl ;
While creating this table, you can keep similar table structure as that of abcTbl_bkp
Load your new data to abcTbl table
DROP TABLE abcTbl_bkp;
Trying to figure this out but I think what you are asking for is a method to capture the older data before loading the new data. I would agree with your DBA's that a seperate table for every reload would be extremely messy and not very usable if you ever need it.
Instead, create a table that copies your load table but adds a single DateTime field(say history_date). Each load you would just flow all the data in your primary table to the backup table. Use a Derived Column task in the Data Flow to add the history_date value to the backup table.
Once the backup table is complete, either truncate or delete the contents of the current table. Then load the new data.
Instead of created additional tables you can set the package to execute as a single transaction. By doing this, if any component fails all the tasks that have already executed will be rolled back and subsequent ones will not run. To do this, set the TransactionOption to Required on the package. This will allow that the package will begin a transaction. After this set all this property to Supported for all components that you want to succeed or fail together. The Supported level will have these tasks join a transaction that is already in progress by the parent container, being the package in this case. If there are other components in the package that you want to commit or rollback independent of these tasks you can place the related objects in a Sequence container, and apply the Required level to the Sequence instead. An important thing to note is that if anything performs a TRUNCATE then all other components that access the truncated object will need to have the ValidateExternalMetadata option set to false to avoid the known blocking issue that is a result of this.
There is a process which has huge ETL process and finally it dumps the data in table X of Ddatabase ABC
I wanted to create a mirror of this table on table Y which is on Database DEF of different server.
I had created a trigger for this which would push the data on Insertion of the table.
But later came to know that the ETL process drops the table and re-creates it.
Which results in dropping of the trigger as well.
What way can I implement this process to make an exact copy of the table on another server DB.
Edit:
a) I don't have any control over the ETL process. Hence I can't do any modification to it and want to keep the process discrete.
b) I cannot truncate table Y as DDL is not supported by linked servers, Hence I want to delete Table Y and then insert
What way can I implement this process to make an exact copy of the
table on another server DB
The best way to it is to use SQL Server Replication
If for any reasons you prefer to recreate the trigger on the recreated table you can use DDL Triggers that should create your trigger in the event of table creation.
P.S. Why don't you use TRUNCATE table (with index disabling if needed) instead of drop/create?
I've a running system where data is inserted periodically into MS SQL DB and web application is used to display this data to users.
During data insert users should be able to continue to use DB, unfortunatelly I can't redesign the whole system right now. Every 2 hours 40k-80k records are inserted.
Right now the process looks like this:
Temp table is created
Data is inserted into it using plain INSERT statements (parameterized queries or stored proceuders should improve the speed).
Data is pumped from temp table to destination table using INSERT INTO MyTable(...) SELECT ... FROM #TempTable
I think that such approach is very inefficient. I see, that insert phase can be improved (bulk insert?), but what about transfering data from temp table to destination?
This is waht we did a few times. Rename your table as TableName_A. Create a view that calls that table. Create a second table exactly like the first one (Tablename_B). Populate it with the data from the first one. Now set up your import process to populate the table that is not being called by the view. Then change the view to call that table instead. Total downtime to users, a few seconds. Then repopulate the first table. It is actually easier if you can truncate and populate the table becasue then you don't need that last step, but that may not be possible if your input data is not a complete refresh.
You cannot avoid locking when inserting into the table. Even with BULK INSERT this is not possible.
But clients that want to access this table during the concurrent INSERT operations can do so when changing the transaction isolation level to READ UNCOMMITTED or by executing the SELECT command with the WITH NOLOCK option.
The INSERT command will still lock the table/rows but the SELECT command will then ignore these locks and also read uncommitted entries.