Is it more efficient and ultimately FASTER to delete rows from a DB in blocks of 1000 or 10000? I am having to remove approx 3 million rows from many tables. I first did the deletes in blocks of 100K rows but the performance wasn't looking good. I changed to 10000 and seem to be removing faster. Wondering if even smaller like 1K per DELETE statement is even better.
Thoughts?
I am deleting like this:
DELETE TOP(10000)
FROM TABLE
WHERE Date < '1/1/2012'
Yes, it is. It all depends on your server though. I mean, last time I did that i was using this approeach to delete things in 64 million increments (on a table that had at that point around 14 billion rows, 80% Of which got ultimately deleted). I got a delete through every 10 seconds or so.
It really depends on your hardware. Going moreg granular is more work but it means less waiting for tx logs for other things operating on the table. You have to try out and find where you are comfortable - there is no ultimate answer because it is totally dependend on usage of the table and hardware.
We used Table Partitioning to remove 5 million rows in less than a sec but this was from just one table. It took some work up-front but ultimately was the best way. This may not be the best way for you.
From our document about partitioning:
Let’s say you want to add 5 million rows to a table but don’t want to lock the table up while you do it. I ran into a case in an ordering system where I couldn’t insert the rows without stopping the system from taking orders. BAD! Partitioning is one way of doing it if you are adding rows that don’t overlap current data.
WHAT TO WATCH OUT FOR:
Data CANNOT overlap current data. You have to partition the data on a value. The new data cannot be intertwined within the currently partitioned data. If removing data, you have to remove an entire partition or partitions. You will not have a WHERE clause.
If you are doing this on a production database and want to limit the locking on the table, create your indexes with “ONLINE = ON”.
OVERVIEW OF STEPS:
FOR ADDING RECORDS
Partition the table you want to add records to (leave a blank partition for the new data). Do not forget to partition all of your indexes.
Create new table with the exact same structure (keys, data types, etc.).
Add a constraint to the new table to limit that data so that it would fit into the blank partition in the old table.
Insert new rows into new table.
Add indexes to match old table.
Swap the new table with the blank partition of the old table.
Un-partition the old table if you wish.
FOR DELETING RECORDS
Partition the table into sets so that the data you want to delete is all on partitions by itself (this could be many different partitions).
Create a new table with the same partitions.
Swap the partitions with the data you want to delete to the new table.
Un-partition the old table if you wish.
Yes, no, it depends on the usage of table due to locking. I would try to delete the records in a slower pace. So the opposite of the op's question.
set rowcount 10000
while ##rowcount > 0
begin
waitfor delay '0:0:1'
delete
from table
where date < convert(datetime, '20120101', 112)
end
set rowcount 0
Related
I have a database table which have more than 1 million records uniquely identified by a GUID column. I want to find out which of these record or rows was selected or retrieved in the last 5 years. The select query can happen from multiple places. Sometimes the row will be returned as a single row. Sometimes it will be part of a set of rows. there is select query that does the fetching from a jdbc connection from a java code. Also a SQL procedure also fetches data from the table.
My intention is to clean up a database table.I want to delete all rows which was never used( retrieved via select query) in last 5 years.
Does oracle DB have any inbuild meta data which can give me this information.
My alternative solution was to add a column LAST_ACCESSED and update this column whenever I select a row from this table. But this operation is a costly operation for me based on time taken for the whole process. Atleast 1000 - 10000 records will be selected from the table for a single operation. Is there any efficient way to do this rather than updating table after reading it. Mine is a multi threaded application. so update such large data set may result in deadlocks or large waiting period for the next read query.
Any elegant solution to this problem?
Oracle Database 12c introduced a new feature called Automatic Data Optimization that brings you Heat Maps to track table access (modifications as well as read operations). Careful, the feature is currently to be licensed under the Advanced Compression Option or In-Memory Option.
Heat Maps track whenever a database block has been modified or whenever a segment, i.e. a table or table partition, has been accessed. It does not track select operations per individual row, neither per individual block level because the overhead would be too heavy (data is generally often and concurrently read, having to keep a counter for each row would quickly become a very costly operation). However, if you have you data partitioned by date, e.g. create a new partition for every day, you can over time easily determine which days are still read and which ones can be archived or purged. Also Partitioning is an option that needs to be licensed.
Once you have reached that conclusion you can then either use In-Database Archiving to mark rows as archived or just go ahead and purge the rows. If you happen to have the data partitioned you can do easy DROP PARTITION operations to purge one or many partitions rather than having to do conventional DELETE statements.
I couldn't use any inbuild solutions. i tried below solutions
1)DB audit feature for select statements.
2)adding a trigger to update a date column whenever a select query is executed on the table.
Both were discarded. Audit uses up a lot of space and have performance hit. Similary trigger also had performance hit.
Finally i resolved the issue by maintaining a separate table were entries older than 5 years that are still used or selected in a query are inserted. While deleting I cross check this table and avoid deleting entries present in this table.
I am looking for much more better way to update tables using SSIS. Specifically, i wanted to optimize the updates on tables (around 10 tables uses same logic).
The logic is,
Select the source data from staging then inserts into physical temp table in the DW (i.e TMP_Tbl)
Update all data matching by customerId column from TMP_Tbl to MyTbl.
Inserts all non-existing customerId column from TMP_Tbl1 to MyTbl.
Using the above steps, this takes some time populating TMP_Tbl. Hence, i planned to change the logic to delete-insert but according to this:
In SQL, is UPDATE always faster than DELETE+INSERT? this would be a recipe for pain.
Given:
no index/keys used on the tables
some tables contains 5M rows, some contains 2k rows
each table update took up to 2-3 minutes, which took for about (15 to 20 minutes) all in all
these updates we're in separate sequence container simultaneously runs
Anyone knows what's the best way to use, seems like using physical temp table needs to be remove, is this normal?
With SSIS you usually BULK INSERT, not INSERT. So if you do not mind DELETE - reinserting the rows should in general outperform UPDATE.
Considering this the faster approach will be:
[Execute SQL Task] Delete all records which you need to update. (Depending on your DB design and queries, some index may help here).
[Data Flow Task] Fast load (using OLE DB Destination, Data access mode: Table of fiew - fast load) both updated and new records from source into MyTbl. No need for temp tables here.
If you cannot/don't want to DELETE records - your current approach is OK too.
You just need to fix the performance of that UPDATE query (adding an index should help). 2-3 minutes per every record updated is way too long.
If it is 2-3 minutes for updating millions of records though - then it's acceptable.
Adding the correct non-clustered index to a table should not result in "much more time on the updates".
There will be a slight overhead, but if it helps your UPDATE to seek instead of scanning a big table - it is usually well worth it.
I have Firebird table with 60 milions rows and i need delete ca. half of table.
Table rows has gps position of cars, timestamp of record and other data. Table has primary key IdVehicle+TimeStamp and one foreign key (into Vehicle table). There is no other key or index or trigger. One vehicle has 100 000 - 500 000 records.
I need delete older data, eg. from all vehicles delete data older than 1 March 2015. I tried different ways and actually use my fastest comes with 'execute block' (use primary key). First I read for one vehicle records older then 1.3.2015. Then I am going through the individual records and prepare sql execute a block and then perform it into firebird for every 50 entries.
EXECUTE BLOCK AS BEGIN
DELETE FROM RIDE_POS WHERE IdVehicle = 1547 and date = '4.5.2015 8:56:47'
DELETE FROM RIDE_POS WHERE IdVehicle = 1547 and date = '4.5.2015 8:56:59'
DELETE FROM RIDE_POS WHERE IdVehicle = 1547 and date = '4.5.2015 8:57:17'
...... a total of 50 line
END
Thus delete 1 million lines per 800 seconds (about 1 record for 1 ms).
Is there another quicker way to delete records?
Additionally, this way I can delete only a few million lines, and then I have to restart firebird, otherwise starts to slow down and jam (on the test server there is no other database / application). From early records cleared quickly and gradually takes longer and longer time.
For orientation, how quickly you erasing records routinely in large tables (not completely erase the table, but only a part of the record).
If you want to delete all records older than given date, no matter the vehicle, then there is no point including the Idvehicle in the query, just the date is enough. Ie following should do, just straight query, no need for execute block either:
DELETE FROM RIDE_POS WHERE date < '2015-03-01'
If you have to delete many thousands (or millions) records do not do it in one single transaction. You better do it in several steps - delete for example 1000 records and commit, then delete other 1000 and commit - it should be faster than delete one million of records in one transaction. 1000 is not a rule, it depends on your particular situation (how large are your records, how many linked data they have via foreign keys with "on delete cascade"). Also check whether you have "on delete" triggers and maybe it is possible to temporary deactivate them.
Maybe a combined approach would help.
Add (temporarily) index on date:
CREATE INDEX IDX_RIDE_POS_date_ASC ON RIDE_POS (date)
Write an execute block:
EXECUTE BLOCK
AS
DECLARE VARIABLE V_ID_VEHICLE INTEGER;
BEGIN
FOR SELECT
DISTINCT ID_VEHICLE
FROM
RIDE_POS
INTO
:V_ID_VEHICLE
DO BEGIN
DELETE FROM RIDE_POS WHERE IdVehicle = :V_ID_VEHICLE AND date < '1.3.2015'
END
END
Drop index if you don't want to have it anymore.
DROP INDEX IDX_RIDE_POS_date_ASC'
I think that even taking into account a time that is needed for creating index, you would still save some time on deleting records.
Finally, I found where the problem was. The main problem was that I am using the classic Winforms application (or IBExpert) and that causing jams and slowing query. I used to execute block and erases the data portions, which has solved the problem of jams, but it was slow.
The solution was to create a simple console application and run query from it. I left primary key and erases through it (no adding or deleting indexes) and the speed of deleting the records was some 65 per milisecond (1 million rows per 16 second).
When I tried to delete primary and add index on datetime column, than erasing speed up just little about 5-10%.
First off, i am kinda new to programming...lets assume I have created 6 filegroup/file(null, file with data, file with data, file with data, file with data, null) to partition a table in the beginning.And Range is 5,10,15+ million. And by time users have filled this table and i realized that it reached 23 million rows. And of course i wanna limit each filegroup file with 5 million rows. So i wanted to change partition structure of partitioned table and reorganize this partitioned table i mean adding new filegroups and files as 5,10,15,20,25,30+ million. How should i do that? or should i follow a completely different solution or smth else? Thank you...
You want to split the partition function, see here https://msdn.microsoft.com/en-us/library/ms186307.aspx
Note that splitting the table where there is existing data is going to take a while (depending on how much data) and will lock teh table whilst it runs. Generally you want to create partitions ahead of time.
My current project for a client requires me to work with Oracle databases (11g). Most of my previous database experience is with MSSQL Server, Access, and MySQL. I've recently run into an issue that seems incredibly strange to me and I was hoping someone could provide some clarity.
I was looking to do a statement like the following:
update MYTABLE set COLUMN_A = COLUMN_B;
MYTABLE has about 13 million rows.
The source column is indexed (COLUMN_B), but the destination column is not (COLUMN_A)
The primary key field is a GUID.
This seems to run for 4 hours but never seems to complete.
I spoke with a former developer that was more familiar with Oracle than I, and they told me you would normally create a procedure that breaks this down into chunks of data to be commited (roughly 1000 records or so). This procedure would iterate over the 13 million records and commit 1000 records, then commit the next 1000...normally breaking the data up based on the primary key.
This sounds somewhat silly to me coming from my experience with other database systems. I'm not joining another table, or linking to another database. I'm simply copying data from one column to another. I don't consider 13 million records to be large considering there are systems out there in the orders of billions of records. I can't imagine it takes a computer hours and hours (only to fail) at copying a simple column of data in a table that as a whole takes up less than 1 GB of storage.
In experimenting with alternative ways of accomplishing what I want, I tried the following:
create table MYTABLE_2 as (SELECT COLUMN_B, COLUMN_B as COLUMN_A from MYTABLE);
This took less than 2 minutes to accomplish the exact same end result (minus dropping the first table and renaming the new table).
Why does the UPDATE run for 4 hours and fail (which simply copies one column into another column), but the create table which copies the entire table takes less than 2 minutes?
And are there any best practices or common approaches used to do this sort of change? Thanks for your help!
It does seem strange to me. However, this comes to mind:
When you are updating the table, transaction logs must be created in case a rollback is needed. Creating a table, that isn't necessary.