Converting Large Data Table To Use Partitions - sql-server

I have a single MSSQL 2017 Standard table, let's call it myTable, with data going back to 2015, containing 206.4 million rows. Once INSERTed by the application, these rows are never modified or deleted. The table is actively collecting data, 24/7.
My goal is to reduce the data in this table to only the most recent full 6 months plus current month, into monthly-based partitions for easy monthly pruning. myTable.dateCreated would determine which partition the data ultimately resides.
(Unrelated, but mentioning in case it ends up being relevant: I have an existing application that replicates all data that gets stored in myTable out to a data warehouse for long term storage every 15 minutes; the main application is able to query myTable for recent data and the data warehouse for older data as needed.)
Because I want to prune the oldest one month worth of data out of myTable each time a new month starts, partitioning myTable by month makes the most sense - I can simply SWITCH the oldest partition to a staging table, then truncate that staging table without causing downtime or performance on the main table.
I've come up with the following plan, and my questions are simple: Is this the best way to approach this task, and will it keep downtime/performance degradation to a minimum?
Create a new table, myTable_pending, with the same exact table structure as myTable, EXCEPT that it will have a total of 7 monthly partitions (6 months retention plus current month) configured;
In one complete step: rename myTable to myTable_transfer, and rename myTable_pending to myTable. This should have the net effect of allowing incoming data to continue being stored, but now it will be in a partition for the month of 2023-01;
Step 3 is where I need advice... which of the following might be best to get the remaining 6mos + current data back into the now-partitioned myTable, or are there additional options I should consider?
OPTION 1: Run a Bulk Insert of just the most recent 6 months of data from myTable_transfer back into myTable, causing the data to end up in the correct partitions in the process (with the understanding that this may still take some time, but not as long as a bunch of INSERTs that would end up chewing on the transaction log);
OPTION 2: Run a DELETE against myTable_transfer, getting rid of all data except the most recent full 6 months + current, and then set up and apply partitions on THIS table, that would then cause SQL Server to reorganize the data into those partitions, but without affecting access or performance on myTable, after which I could just SWITCH the partitions from myTable_transfer into myTable for immediate access; (related issue: since myTable is still collecting current data, and myTable_transfer will contain data from the current month as well, can the current month partitions be merged?)
OPTION 3: Any other way to do this, so that myTable ends up with 6 months worth of data, properly partitioned, without significant downtime?

We ended up revising our solution, since the original table was replicated to a data warehouse anyway, we simply renamed the table and created a new one with partitioning to start collecting new data from the rename point. This provided the least amount of downtime, the fastest schema changes, and gave us the partitioning we needed to maintain the table efficiently going forward.

Related

MS SQL Trigger for ETL vs Performance

I would need information what might be the impact for production DB of creating triggers for ~30 Production tables that capture any Update,Delete and Insert statement and put following information "PK", "Table Name", "Time of modification" to separate table.
I have limited ability to test it as I have read only permissions to both Prod and Test environment (and I can get one work day for 10 end users to test it).
I have estimated that number of records inserted from those triggers will be around ~150-200k daily.
Background:
I have project to deploy Data Warehouse for database that is very customized + there are jobs running every day that manipulate the data. Updated on Date column is not being maintain (customization) + there are hard deletes occurring on tables. We decided to ask DEV team to add triggers like:
CREATE TRIGGER [dbo].[triggerName] ON [dbo].[ProductionTable]
FOR INSERT, UPDATE, DELETE
AS
INSERT INTO For_ETL_Warehouse (Table_Name, Regular_PK, Insert_Date)
SELECT 'ProductionTable', PK_ID, GETDATE() FROM inserted
INSERT INTO For_ETL_Warehouse (Table_Name, Regular_PK, Insert_Date)
SELECT 'ProductionTable', PK_ID, GETDATE() FROM deleted
on core ~30 production tables.
Based on this table we will pull delta from last 24 hours and push it to Data Warehouse staging tables.
If anyone had similar issue and can help me estimate how it can impact performance on production database I will really appreciate. (if it works - I am saved, if not I need to propose other solution. Currently mirroring or replication might be hard to get as local DEVs have no idea how to set it up...)
Other ideas how to handle this situation or perform tests are welcome (My deadline is Friday 26-01).
First of all I would suggest you code your table name into a smaller variable and not a character one (30 tables => tinyint).
Second of all you need to understand how big is the payload you are going to write and how:
if you chose a correct clustered index (date column) then the server will just need to out data row by row in a sequence. That is a silly easy job even if you put all 200k rows at once.
if you code the table name as a tinyint, then basically it has to write:
1byte (table name) + PK size (hopefully numeric so <= 8bytes) + 8bytes datetime - so aprox 17bytes on the datapage + indexes if any + log file . This is very lightweight and again will put no "real" pressure on sql sever.
The trigger itself will add a small overhead, but with the amount of rows you are talking about, it is negligible.
I saw systems that do similar stuff on a way larger scale with close to 0 effect on the work process, so I would say that it's a safe bet. The only problem with this approach is that it will not work in some cases (ex: outputs to temp tables from DML statements). But if you do not have these kind of blockers then go for it.
I hope it helps.

Find out the recently selected rows from a Oracle table and can I update a LAST_ACCESSED column whenever the table is accessed

I have a database table which have more than 1 million records uniquely identified by a GUID column. I want to find out which of these record or rows was selected or retrieved in the last 5 years. The select query can happen from multiple places. Sometimes the row will be returned as a single row. Sometimes it will be part of a set of rows. there is select query that does the fetching from a jdbc connection from a java code. Also a SQL procedure also fetches data from the table.
My intention is to clean up a database table.I want to delete all rows which was never used( retrieved via select query) in last 5 years.
Does oracle DB have any inbuild meta data which can give me this information.
My alternative solution was to add a column LAST_ACCESSED and update this column whenever I select a row from this table. But this operation is a costly operation for me based on time taken for the whole process. Atleast 1000 - 10000 records will be selected from the table for a single operation. Is there any efficient way to do this rather than updating table after reading it. Mine is a multi threaded application. so update such large data set may result in deadlocks or large waiting period for the next read query.
Any elegant solution to this problem?
Oracle Database 12c introduced a new feature called Automatic Data Optimization that brings you Heat Maps to track table access (modifications as well as read operations). Careful, the feature is currently to be licensed under the Advanced Compression Option or In-Memory Option.
Heat Maps track whenever a database block has been modified or whenever a segment, i.e. a table or table partition, has been accessed. It does not track select operations per individual row, neither per individual block level because the overhead would be too heavy (data is generally often and concurrently read, having to keep a counter for each row would quickly become a very costly operation). However, if you have you data partitioned by date, e.g. create a new partition for every day, you can over time easily determine which days are still read and which ones can be archived or purged. Also Partitioning is an option that needs to be licensed.
Once you have reached that conclusion you can then either use In-Database Archiving to mark rows as archived or just go ahead and purge the rows. If you happen to have the data partitioned you can do easy DROP PARTITION operations to purge one or many partitions rather than having to do conventional DELETE statements.
I couldn't use any inbuild solutions. i tried below solutions
1)DB audit feature for select statements.
2)adding a trigger to update a date column whenever a select query is executed on the table.
Both were discarded. Audit uses up a lot of space and have performance hit. Similary trigger also had performance hit.
Finally i resolved the issue by maintaining a separate table were entries older than 5 years that are still used or selected in a query are inserted. While deleting I cross check this table and avoid deleting entries present in this table.

Best way to handle updates on a table

I am looking for much more better way to update tables using SSIS. Specifically, i wanted to optimize the updates on tables (around 10 tables uses same logic).
The logic is,
Select the source data from staging then inserts into physical temp table in the DW (i.e TMP_Tbl)
Update all data matching by customerId column from TMP_Tbl to MyTbl.
Inserts all non-existing customerId column from TMP_Tbl1 to MyTbl.
Using the above steps, this takes some time populating TMP_Tbl. Hence, i planned to change the logic to delete-insert but according to this:
In SQL, is UPDATE always faster than DELETE+INSERT? this would be a recipe for pain.
Given:
no index/keys used on the tables
some tables contains 5M rows, some contains 2k rows
each table update took up to 2-3 minutes, which took for about (15 to 20 minutes) all in all
these updates we're in separate sequence container simultaneously runs
Anyone knows what's the best way to use, seems like using physical temp table needs to be remove, is this normal?
With SSIS you usually BULK INSERT, not INSERT. So if you do not mind DELETE - reinserting the rows should in general outperform UPDATE.
Considering this the faster approach will be:
[Execute SQL Task] Delete all records which you need to update. (Depending on your DB design and queries, some index may help here).
[Data Flow Task] Fast load (using OLE DB Destination, Data access mode: Table of fiew - fast load) both updated and new records from source into MyTbl. No need for temp tables here.
If you cannot/don't want to DELETE records - your current approach is OK too.
You just need to fix the performance of that UPDATE query (adding an index should help). 2-3 minutes per every record updated is way too long.
If it is 2-3 minutes for updating millions of records though - then it's acceptable.
Adding the correct non-clustered index to a table should not result in "much more time on the updates".
There will be a slight overhead, but if it helps your UPDATE to seek instead of scanning a big table - it is usually well worth it.

How to delete data in large size SQL Server faster?

I have a SQL Server database. The size is about 150GB which saves some data for analysis. Each day, new data comes in and we need to delete old data (based on date). Recently, the daily data size increase a lot, it will be about 8-9GB per day soon.
Currently, we delete in small batch, which takes a very long time to finish. Is there a general guide to make it faster? Tried to drop/disable index before delete, after delete finished, then rebuild index. It does not help much.
Or, this will totally depend on the actual date?
Thanks
Given the amount of data I would use a partitioned table, one for each day.
Swapping partitions in and out is going to be the fastest way to delete all data for one day.
EDIT: since truncating a partition is not as trivial as it should be in SQL Server, I figured I'd provide more details, in case you're not familiar with partitions.
In the next release of SQL Server, you should be able to just TRUNCATE PARTITION or something like that. In the meantime you have to proceed as follows:
The quickest way to delete a day of data in your database is to have the table partitioned by day and then:
Swap out the partition that you want to delete to another table: ALTER TABLE partitioned SWAP PARTITION n TO otherTableToDelete.
TRUNCATE TABLE otherTableToDelete.

Copy data from one column to another in oracle table

My current project for a client requires me to work with Oracle databases (11g). Most of my previous database experience is with MSSQL Server, Access, and MySQL. I've recently run into an issue that seems incredibly strange to me and I was hoping someone could provide some clarity.
I was looking to do a statement like the following:
update MYTABLE set COLUMN_A = COLUMN_B;
MYTABLE has about 13 million rows.
The source column is indexed (COLUMN_B), but the destination column is not (COLUMN_A)
The primary key field is a GUID.
This seems to run for 4 hours but never seems to complete.
I spoke with a former developer that was more familiar with Oracle than I, and they told me you would normally create a procedure that breaks this down into chunks of data to be commited (roughly 1000 records or so). This procedure would iterate over the 13 million records and commit 1000 records, then commit the next 1000...normally breaking the data up based on the primary key.
This sounds somewhat silly to me coming from my experience with other database systems. I'm not joining another table, or linking to another database. I'm simply copying data from one column to another. I don't consider 13 million records to be large considering there are systems out there in the orders of billions of records. I can't imagine it takes a computer hours and hours (only to fail) at copying a simple column of data in a table that as a whole takes up less than 1 GB of storage.
In experimenting with alternative ways of accomplishing what I want, I tried the following:
create table MYTABLE_2 as (SELECT COLUMN_B, COLUMN_B as COLUMN_A from MYTABLE);
This took less than 2 minutes to accomplish the exact same end result (minus dropping the first table and renaming the new table).
Why does the UPDATE run for 4 hours and fail (which simply copies one column into another column), but the create table which copies the entire table takes less than 2 minutes?
And are there any best practices or common approaches used to do this sort of change? Thanks for your help!
It does seem strange to me. However, this comes to mind:
When you are updating the table, transaction logs must be created in case a rollback is needed. Creating a table, that isn't necessary.

Resources