Delete large amount of data in sql server - sql-server

Suppose that I have a table with 10000000 record. What is difference between this two solution?
delete data like :
DELETE FROM MyTable
delete all of data with a application row by row :
DELETE FROM MyTable WHERE ID = #SelectedID
Is the first solution has best performance?
what is the impact on log and performance?

If you need to restrict to what rows you need to delete and not do a complete delete, or you can't use TRUNCATE TABLE (e.g. the table is referenced by a FK constraint, or included in an indexed view), then you can do the delete in chunks:
DECLARE #RowsDeleted INTEGER
SET #RowsDeleted = 1
WHILE (#RowsDeleted > 0)
BEGIN
-- delete 10,000 rows a time
DELETE TOP (10000) FROM MyTable [WHERE .....] -- WHERE is optional
SET #RowsDeleted = ##ROWCOUNT
END
Generally, TRUNCATE is the best way and I'd use that if possible. But it cannot be used in all scenarios. Also, note that TRUNCATE will reset the IDENTITY value for the table if there is one.
If you are using SQL 2000 or earlier, the TOP condition is not available, so you can use SET ROWCOUNT instead.
DECLARE #RowsDeleted INTEGER
SET #RowsDeleted = 1
SET ROWCOUNT 10000 -- delete 10,000 rows a time
WHILE (#RowsDeleted > 0)
BEGIN
DELETE FROM MyTable [WHERE .....] -- WHERE is optional
SET #RowsDeleted = ##ROWCOUNT
END

If you have that many records in your table and you want to delete them all, you should consider truncate <table> instead of delete from <table>. It will be much faster, but be aware that it cannot activate a trigger.
See for more details (this case sql server 2000):
http://msdn.microsoft.com/en-us/library/aa260621%28SQL.80%29.aspx
Deleting the table within the application row by row will end up in long long time, as your dbms can not optimize anything, as it doesn't know in advance, that you are going to delete everything.

The first has clearly better performance.
When you specify DELETE [MyTable] it will simply erase everything without doing checks for ID. The second one will waste time and disk operation to locate a respective record each time before deleting it.
It also gets worse because every time a record disappears from the middle of the table, the engine may want to condense data on disk, thus wasting time and work again.
Maybe a better idea would be to delete data based on clustered index columns in descending order. Then the table will basically be truncated from the end at every delete operation.

Option 1 will create a very large transaction and have a big impact on the log / performance, as well as escalating locks so that the table will be unavailable.
Option 2 will be slower, although it will generate less impact on the log (assuming bulk / full mode)
If you want to get rid of all the data, Truncate Table MyTable would be faster than both, although it has no facility to filter rows, it does a meta data change at the back and basically drops the IAM on the floor for the table in question.

The best performance for clearing a table would bring TRUNCATE TABLE MyTable. See http://msdn.microsoft.com/en-us/library/ms177570.aspx for more verbose explaination

Found this post on Microsoft TechNet.
Basically, it recommends:
By using SELECT INTO, copy the data that you want to KEEP to an intermediate table;
Truncate the source table;
Copy back with INSERT INTO from intermediate table, the data to the source table;
..
BEGIN TRANSACTION
SELECT *
INTO dbo.bigtable_intermediate
FROM dbo.bigtable
WHERE Id % 2 = 0;
TRUNCATE TABLE dbo.bigtable;
SET IDENTITY_INSERT dbo.bigTable ON;
INSERT INTO dbo.bigtable WITH (TABLOCK) (Id, c1, c2, c3)
SELECT Id, c1, c2, c3 FROM dbo.bigtable_intermediate ORDER BY Id;
SET IDENTITY_INSERT dbo.bigtable OFF;
ROLLBACK TRANSACTION

The first will delete all the data from the table and will have better performance that your second who will delete only data from a specific key.
Now if you have to delete all the data from the table and you don't rely on using rollback think of the use a truncate table

Related

Deleting old records from a very big table based on criteria

I have a table (Table A) that contains 300 million records, I want to do a data retention activity on basis of some criteria. So I want to delete about 200M records of the table.
Concerning the performance, I planned to create a new table (Table-B) with the oldest 10M records from Table-A. Then I can select records from Table-B which matches the criteria and will delete it in Table A.
Extracting 10M records from Table-A and loading into Table-B using SQL Loader takes ~5 hours.
I already created indexes and I use parallel 32 wherever applicable.
What I wanted to know is,
Is there any better way to extract from Table-A and to load it in Table-B.
Is there any better approach other than creating a temp table(Table-B).
DBMS: Oracle 10g, PL/SQL and Shell.
Thanks.
If you want to delete 70% of the records of your table, the best way is to create a new table that contains the remaining 30% of the rows, drop the old table and rename the new table to the name of the old table. One possibility to create the new table is a create-table-as-select statement (CTAS), but there are also possibilities that make the impact on the running system much smaller, e.g. one can use materialized views to select the remaining data and convert the materialized vie to a table. The details of the approach depend on the requirements.
This reading and writing is much more efficient then deleting the rows of the old table.
If you delete the rows of the old table it is probably necessary to reorganize the old table which will also end up in writing these remaining 30% of data.
Partitioning the table by your criteria may be an option.
Consider a case with the criteria is the month. All January data falls into the Jan partition. All February data falls into the Feb partition...
Then when it comes time to drop all the old January data, you just drop the partition.
Using rowid best to use but inline cursor can help u more
Insert into table a values ( select * from table B where = criteria) then truncate table A
Is there any better way to extract from Table-A and to load it in? You can use parallel CTAS - create table-b as select from table-a. You can use compression and parallel query in one step.
Table-B. Is there any better approach other than creating a temp
table(Table-B)? Better approach would be partitioning of table a
Probably better approach would be partitioning of Table A but if not you can try fast and simple:
declare
i pls_integer :=0 ;
begin
for r in
( -- select what you want to move to second table
SELECT
rowid as rid,
col1,
col2,
col3
FROM
table_a t
WHERE
t.col < SYSDATE - 30 --- or other criteria
)
loop
insert /*+ append */ into table_b values (r.col1, r.col2, r.col3 ); -- insert it to second table
delete from table_a where rowid = r.rid; -- and delete it
if i < 500 -- check your best commit interval
then
i:=i+1;
else
commit;
i:=0;
end if;
end loop;
commit;
end;
In above example you will move your records in small 500 rows transactions. You can optimize it using collection and bulk insert but i wanted to keep simple code.
I was missing one index on a column that i was using in a search criteria.
Apart from this there was some indexes missing on referenced tables too.
Apart from this #miracle173 answer is also good but we are having some foreign key too that might create problem if we had used that approach.
+1 to #miracle173

How to copy large number of data from one table to another in same database?

I have two tables with same column structure in the same database: TableA and TableB.
TableA doesn't have any indexes, but TableB has a non-clustered unique index.
TableA has 290 Million rows of data that needs to be copied to TableB.
As they both have same structure, I've tried
INSERT INTO TableB
SELECT *
FROM TableA;
It was executing for hours and produced a huge log file that filled the disk. As a result the disk ran out of space and the query was killed.
I can shrink the log file. How can I copy these many rows of data to another table efficiently?
First of all, disable the index on TableB before inserting the rows. You can do it using T-SQL:
ALTER INDEX IX_Index_Name ON dbo.TableB DISABLE;
Make sure to disable all the constraints (foreign keys, check constraints, unique indexes) on your destination table.
Re-enable (and rebuild) them after the load is complete.
Now, there's a couple of approaches to solve the problem:
You have to be OK with a slight chance of data loss: use the INSERT INTO ... SELECT ... FROM ... syntax you have but switch your database to Bulk-logged recovery mode first (read before switching). Won't help if you're already in Bulk-logged or Simple.
With exporting the data first: you can use the BCP utility to export/import the data. It supports loading data in batches. Read more about using the BCP utility here.
Fancy, with exporting the data first: With SQL 2012+ you can try exporting the data into binary file (using the BCP utility) and load it by using the BULK INSERT statement, setting ROWS_PER_BATCH option.
Old-school "I don't give a damn" method: to prevent the log from filling up you will need to perform the
inserts in batches of rows, not everything at once. If your database
is running in Full recovery mode you will need to keep log backups
running, maybe even trying to increase the frequency of the job.
To batch-load your rows you will need a WHILE (don't use them in
day-to-day stuff, just for batch loads), something like the
following will work if you have an identifier in the dbo.TableA
table:
DECLARE #RowsToLoad BIGINT;
DECLARE #RowsPerBatch INT = 5000;
DECLARE #LeftBoundary BIGINT = 0;
DECLARE #RightBoundary BIGINT = #RowsPerBatch;
SELECT #RowsToLoad = MAX(IdentifierColumn) dbo.FROM TableA
WHILE #LeftBoundary < #RowsToLoad
BEGIN
INSERT INTO TableB (Column1, Column2)
SELECT
tA.Column1,
tB.Column2
FROM
dbo.TableA as tA
WHERE
tA.IdentifierColumn > #LeftBoundary
AND tA.IdentifierColumn <= #RightBoundary
SET #LeftBoundary = #LeftBoundary + #RowsPerBatch;
SET #RightBoundary = #RightBoundary + #RowsPerBatch;
END
For this to work effectively you really want to consider creating an
index on dbo.TableA (IdentifierColumn) just for the time you're
running the load.

Delete vs Rollback Strategy - ETL Load

I am loading data to table in the following manner:
DECLARE #srcRc INT;
DECLARE #dstRc INT;
SET #srcRc = ( SELECT COUNT(*) FROM A )
INSERT INTO t
(Col1
,Col2
,Col3
)
SELECT A.Col1
,A.Col2
,B.Col3
FROM A
JOIN B
ON A.Id = B.Id;
SET #dstRc = ##ROWCOUNT
Now I am comparing the variables #srcRc and #dstRc. The ROWCOUNT must be the same. If it is not, the inserted rows need to be deleted.
Q1: What would be the best strategy to rollback the inserted rows?
I have couple of ideas:
1) Run the load in transaction and rollback if the rowcount does not match.
2) Add flag column (bit) to the destination table called toBeDeleted, run the load and if the rowcount does not match, update the toBeDeleted column with 1 value to flag it as candidate for deletion. Then delete in batch mode (while-loop). Or do not delete them, but always exclude deletion candidates from query when working with t table.
3) Before inserting the rows, compare the the rowcount first. If it does not match, don't start the load.
DECLARE #srcRc INT;
DECLARE #dstRc INT;
SET #srcRc = ( SELECT COUNT(1) FROM A );
SET #dstRc = ( SELECT COUNT(1) FROM A JOIN B ON A.Id = B.Id );
Q2: What would be better solution for higher amount of rows, let's say 10-100 mil.?
Q3: Or is there any better strategy for similar case?
OK, Assuming :
You need the roll back to work at some later date when the content of tables A and B may have changed
There may also be other rows in T which you don't want to delete as part of the rollback.
Then you MUST keep a list of the rows you inserted, as you are unable to reliably regenerate that list from A and B and you cant just delete everything from T
You could do this in two ways
Change your import, so that it first inserts the rows to an import table, keep the import table hanging around until you are sure you don't need it anymore.
Add an extra column to T [importId] into which you put a uniquely identifying value
Obviously the first strategy uses a lot more disc space. So the longer your keep the data and the more data there is, the better the extra column looks.
Another option, would be to generate the list of imported data separately and have your transaction sql be a bulk insert with all the data hard coded into the sql.
This works well for small lists, initial setup data and the like.
Edit:
from your comments it sounds like you don't want a roll back per-se. But the best way to apply business logic around the import process.
In this case your 3rd answer is the best. Don't do the import when you know the source data is incorrect.

UPDATE slow when setting column to NULL

I have a SQL Server 2008 table with 80,000 rows and am executing the following query:
UPDATE dbo.TableName WITH (ROWLOCK)
SET HelloWorldID = NULL
WHERE HelloWorldID = #helloWorldID
HelloWorldID is an int and the #helloWorldID parameter is also int.
The query is taking too long and I'd like to optimize it. I created a nonclustered index on HelloWorldID but it didn't matter. I may have to redesign this...maybe put the HelloWorldID on another table that links it to the TableName table?
Since the command you're waiting on is DELETE I have to guess that there is a trigger on dbo.TableName and that it is performing additional work that you do not expect. Or perhaps some CASCADE option that is affecting other tables that have triggers on them.
It all depends on how much rows will be updated by this query.
If you're updating a lot of rows, say 30% of the table, then the index will actually slow down the query (as index will be updated along with the table, and it won't help with filtering the rows for update). Also ROWLOCK will slow it down, because the engine will issue a separate lock for each row (as opposed to pagelocks that would occur normally).
Try removing the index and running this update using WITH(TABLOCK) just to see what happens.
I get this problem sometimes. Your query is dependent upon simultaneously getting a write-lock on every row in the table meeting the conditions of the WHERE-Clause . Depending on your needs for full 'ACID', you could do something like this:
SELECT getdate() -- force ##rowcount=1
while ##rowcount > 0
UPDATE TOP (1000) dbo.TableName
SET HelloWorldID = NULL
WHERE HelloWorldID = #helloWorldID
This will do the update is smaller chunks, and help overcome locking issues. But remember, this-method gives up on doing this-query as a single-transaction. You will need to tune the 1000 to a value that is right for your server.

Updating two columns in a table containing millions of rows

I am updating 2 columns in a table that contains millions (85 million) of rows. Now to update these I am using a update command like,
UPDATE Table1
SET Table1.column1 = Table2.column1 ,
Table1.column2 = Table2.column2
FROM
Tables and with a Join-conditions;
Now my problem is, it is taking 23 hours for that. Even after using the batch size there is not much change in the time taken.
But I need to update it in less than 5 hours. Is that possible. What approach should I take to achieve it ?
SQL Update statements have to keep all the rows in the log file so it can roll-back on failure. As explained by this guy, the best way to handle millions of rows is to forget about atomicity and batch your updates into 50,000 rows (or whatever):
--Declare variable for row count
Declare #rc int
Set #rc=50000
While #rc=50000
Begin
Begin Transaction
--Use Top (50000) to limit number of updates
--performed in each batch to 50K rows.
--Use tablockx and holdlock to obtain and hold
--an immediate exclusive table lock. This unusually
--speeds the update because only one lock is needed.
Update Top (50000) MyTable With (tablockx, holdlock)
Set UpdFlag = 0
From MyTable mt
Join ControlTable ct
On mt.KeyCol=ct.PK
--Add criteria to avoid updating rows that
--were updated in previous pass
Where m.UpdFlag <> 0
--Get number of rows updated
--Process will continue until less than 50000
Select #rc=##rowcount
--Commit the transaction
Commit
End
This still has some problems in that you need to know which rows you've already handled, perhaps someone smarter than this guy (and me!) can figure something nicer with more MSSQL magic; but this should be a start.
I have used SSIS for doing this task.
First I have taken the source table in which I have to update the 2-columns. Then I have taken Look-Up task in which I have to mapped source columns to the destination table columns from which I have to get the data to update source table columns. Finally added OLEDB destination from where I'll fill the table basing on the joining conditions from the look-up.
This process was really fast than executing an update script.

Resources