Change data retention for Dropped tables in Snowflake - snowflake-cloud-data-platform

We are leveraging Snowflake as the warehouse platform. While loading data to Snowflake , we had some drop create logics for populating the data in snowflake. Instead of using temporary or transient table we accidentally leveraged permanent tables which resulted in huge storage sizes as part of time travel storage(close to 140 tb).
Since we have already dropped the permanent tables these are not listing the snowflake schema. Wanted to check if there is any way we can change the retention period for already dropped tables so that we can release the time travel storage used by these dropped tables.
Many thanks in advance,
Prasanth

If you did not change the default Time Travel retention time, it defaults to one day. If so, you will see the data start to clear out in eight days (1 day Time Travel + 7 days Fail Safe). If you didn't alter it, you may want to just let it age until cleared automatically. If you altered the duration, you can try this to change the retention time. It should work.
If you have new tables with the old dropped table names, you'll need to temporarily rename the new table. Please refer to this KB article for details: KB Article for Time Travel on dropped tables.
After temporarily renaming your new tables, undrop your old tables. You can then set their retention to 0 (be 100% you want this), and then drop the table.
undrop MY_TABLE;
alter table MY_TABLE set DATA_RETENTION_TIME_IN_DAYS = 0;
drop MY_TABLE;
Of course when you re-drop your old tables, you can rename your new transient tables to their proper names. While this should work, you will not see the Time Travel data go away immediately. A background service will delete them at some point.

Related

Can Snowflake Auto Purge records older than X number of days?

For an existing table in snowflake is there a way we can set TTL for each record ?
In other words can i ensure records updated/created more than 90 days ago is automatically purged periodically.
You can use a Snowflake TASK to run deletes on a routine schedule. And if you are dealing with a very large table, I recommend that you cluster it on the DATE of whatever field you are using to delete from. This will increase the performance of the delete statement. Unfortunately, there is no way to set this on a table and have it remove records automatically for you.
Opt 1. If the table is used for analytics, you can build a view on top of it to retrieve only last 90 days data (doing this you have the history)
Opt 2. you can use SQL statement on schedule which deletes the records which are > 90 days

MS SQL Trigger for ETL vs Performance

I would need information what might be the impact for production DB of creating triggers for ~30 Production tables that capture any Update,Delete and Insert statement and put following information "PK", "Table Name", "Time of modification" to separate table.
I have limited ability to test it as I have read only permissions to both Prod and Test environment (and I can get one work day for 10 end users to test it).
I have estimated that number of records inserted from those triggers will be around ~150-200k daily.
Background:
I have project to deploy Data Warehouse for database that is very customized + there are jobs running every day that manipulate the data. Updated on Date column is not being maintain (customization) + there are hard deletes occurring on tables. We decided to ask DEV team to add triggers like:
CREATE TRIGGER [dbo].[triggerName] ON [dbo].[ProductionTable]
FOR INSERT, UPDATE, DELETE
AS
INSERT INTO For_ETL_Warehouse (Table_Name, Regular_PK, Insert_Date)
SELECT 'ProductionTable', PK_ID, GETDATE() FROM inserted
INSERT INTO For_ETL_Warehouse (Table_Name, Regular_PK, Insert_Date)
SELECT 'ProductionTable', PK_ID, GETDATE() FROM deleted
on core ~30 production tables.
Based on this table we will pull delta from last 24 hours and push it to Data Warehouse staging tables.
If anyone had similar issue and can help me estimate how it can impact performance on production database I will really appreciate. (if it works - I am saved, if not I need to propose other solution. Currently mirroring or replication might be hard to get as local DEVs have no idea how to set it up...)
Other ideas how to handle this situation or perform tests are welcome (My deadline is Friday 26-01).
First of all I would suggest you code your table name into a smaller variable and not a character one (30 tables => tinyint).
Second of all you need to understand how big is the payload you are going to write and how:
if you chose a correct clustered index (date column) then the server will just need to out data row by row in a sequence. That is a silly easy job even if you put all 200k rows at once.
if you code the table name as a tinyint, then basically it has to write:
1byte (table name) + PK size (hopefully numeric so <= 8bytes) + 8bytes datetime - so aprox 17bytes on the datapage + indexes if any + log file . This is very lightweight and again will put no "real" pressure on sql sever.
The trigger itself will add a small overhead, but with the amount of rows you are talking about, it is negligible.
I saw systems that do similar stuff on a way larger scale with close to 0 effect on the work process, so I would say that it's a safe bet. The only problem with this approach is that it will not work in some cases (ex: outputs to temp tables from DML statements). But if you do not have these kind of blockers then go for it.
I hope it helps.

Find out the recently selected rows from a Oracle table and can I update a LAST_ACCESSED column whenever the table is accessed

I have a database table which have more than 1 million records uniquely identified by a GUID column. I want to find out which of these record or rows was selected or retrieved in the last 5 years. The select query can happen from multiple places. Sometimes the row will be returned as a single row. Sometimes it will be part of a set of rows. there is select query that does the fetching from a jdbc connection from a java code. Also a SQL procedure also fetches data from the table.
My intention is to clean up a database table.I want to delete all rows which was never used( retrieved via select query) in last 5 years.
Does oracle DB have any inbuild meta data which can give me this information.
My alternative solution was to add a column LAST_ACCESSED and update this column whenever I select a row from this table. But this operation is a costly operation for me based on time taken for the whole process. Atleast 1000 - 10000 records will be selected from the table for a single operation. Is there any efficient way to do this rather than updating table after reading it. Mine is a multi threaded application. so update such large data set may result in deadlocks or large waiting period for the next read query.
Any elegant solution to this problem?
Oracle Database 12c introduced a new feature called Automatic Data Optimization that brings you Heat Maps to track table access (modifications as well as read operations). Careful, the feature is currently to be licensed under the Advanced Compression Option or In-Memory Option.
Heat Maps track whenever a database block has been modified or whenever a segment, i.e. a table or table partition, has been accessed. It does not track select operations per individual row, neither per individual block level because the overhead would be too heavy (data is generally often and concurrently read, having to keep a counter for each row would quickly become a very costly operation). However, if you have you data partitioned by date, e.g. create a new partition for every day, you can over time easily determine which days are still read and which ones can be archived or purged. Also Partitioning is an option that needs to be licensed.
Once you have reached that conclusion you can then either use In-Database Archiving to mark rows as archived or just go ahead and purge the rows. If you happen to have the data partitioned you can do easy DROP PARTITION operations to purge one or many partitions rather than having to do conventional DELETE statements.
I couldn't use any inbuild solutions. i tried below solutions
1)DB audit feature for select statements.
2)adding a trigger to update a date column whenever a select query is executed on the table.
Both were discarded. Audit uses up a lot of space and have performance hit. Similary trigger also had performance hit.
Finally i resolved the issue by maintaining a separate table were entries older than 5 years that are still used or selected in a query are inserted. While deleting I cross check this table and avoid deleting entries present in this table.

Will archiving lots of old data lock my Database?

I need to move the data that is a month old from a logging table to a logging-archive table, and remove data older than a year from the later.
There are lots of data (600k insert in 2 months).
I was considering to simply call (batch) a stored proc every day/week.
I first thought about doing 2 stored proc :
Deleting from the archives what is older than 365 days
Moving the data from logging to archive, what is older than 30 days (I suppose there's a way to do that with 1 sql query)
Removing from logging what is older than 30 days.
However, this solution seems quite ineficient and will probably lock the DB for a few minutes, which I do not want.
So, do I have any alternative and what are they?
None of this should lock the tables that you actually use. You are writing only to the logging table currently, and only to new records.
You are selecting from the logging table only OLD records, and writing to a table that you don't write to except for the archive process.
The steps you are taking sound fine. I would go one step further, and instead of deleting based on date, just do an INNER JOIN to your archive table on your id field - then you only delete the specific records you have archived.
As a side note, 600k records is not very big at all. We have production DBs with tables over 2billion rows, and I know some other folks here have dbs with millions of inserts a minute into transactional tables.
Edit:
I forgot to include originally, another benefit of your planned method is that each step is isolated. If you need to stop for any reason, none of your steps is destructive or depends on the next step executing immediately. You could potentially archive a lot of records, then run the deletes the next day or overnight without creating any issues.
What if you archived to a secondary database.
I.e.:
Primary database has the logging table.
Secondary database has the archive table.
That way, if you're worried about locking your archive table so you can do a batch on it, it won't take your primary database down.
But in any case, i'm not sure you have to worry about locking -- I guess just depends on how you implement.

What are good strategies for updating a live database table?

I have a db table that gets entirely re-populated with fresh data periodically. This data needs to be then pushed into a corresponding live db table, overwriting the previous live data.
As the table size increases, the time required to push the data into the live table also increases, and the app would look like its missing data.
One solution is to push the new data into a live_temp table and then run an SQL RENAME command on this table to rename it as the live table. The rename usually runs in sub-second time. Is this the "right" way to solve this problem?
Are there other strategies or tools to tackle this problem? Thanks.
I don't like messing with schema objects in this way - it can confuse query optimizers and I have no idea what will happen to any transactions that are going on while you execute the rename.
I much prefer to add a version column to the table, and have a separate table to hold the current version.
That way, the client code becomes
select *
from myTable t,
myTable_currentVersion tcv
where t.versionID = tcv.CurrentVersion
This also keeps history around - which may or not be useful; if it's not delete old records after setting the CurrentVersion column.
Create a duplicate table - exact copy.
Create a new table that does nothing more than keep track of the "up to date" table.
MostCurrent (table)
id (column) - holds name of table holding the "up to date" data.
When repopulating, populate the older table and update MostCurrent.id to reflect this table.
Now, in your app where you bind the data to the page, bind the newest table.
Would it be appropriate to only push changes to the live db table? For most applications I have worked with changes have been minimal. You should be able to apply all the changes in a single transaction. Committing the transaction will make them visible with no outage on the table.
If the data does change entirely, then you could configure the database so that you can replace all the data in a single transaction.

Resources