Updating two columns in a table containing millions of rows - sql-server

I am updating 2 columns in a table that contains millions (85 million) of rows. Now to update these I am using a update command like,
UPDATE Table1
SET Table1.column1 = Table2.column1 ,
Table1.column2 = Table2.column2
FROM
Tables and with a Join-conditions;
Now my problem is, it is taking 23 hours for that. Even after using the batch size there is not much change in the time taken.
But I need to update it in less than 5 hours. Is that possible. What approach should I take to achieve it ?

SQL Update statements have to keep all the rows in the log file so it can roll-back on failure. As explained by this guy, the best way to handle millions of rows is to forget about atomicity and batch your updates into 50,000 rows (or whatever):
--Declare variable for row count
Declare #rc int
Set #rc=50000
While #rc=50000
Begin
Begin Transaction
--Use Top (50000) to limit number of updates
--performed in each batch to 50K rows.
--Use tablockx and holdlock to obtain and hold
--an immediate exclusive table lock. This unusually
--speeds the update because only one lock is needed.
Update Top (50000) MyTable With (tablockx, holdlock)
Set UpdFlag = 0
From MyTable mt
Join ControlTable ct
On mt.KeyCol=ct.PK
--Add criteria to avoid updating rows that
--were updated in previous pass
Where m.UpdFlag <> 0
--Get number of rows updated
--Process will continue until less than 50000
Select #rc=##rowcount
--Commit the transaction
Commit
End
This still has some problems in that you need to know which rows you've already handled, perhaps someone smarter than this guy (and me!) can figure something nicer with more MSSQL magic; but this should be a start.

I have used SSIS for doing this task.
First I have taken the source table in which I have to update the 2-columns. Then I have taken Look-Up task in which I have to mapped source columns to the destination table columns from which I have to get the data to update source table columns. Finally added OLEDB destination from where I'll fill the table basing on the joining conditions from the look-up.
This process was really fast than executing an update script.

Related

TSQL Large Update in Batches - Is Join Costing Me More Because it is Performed Each Time in a Loop

I'm trying to archive many records in batches rather than in one shot.
Will TSQL Join the two tables, TeamRoster and #teamIdsToDelete for every loop in the batch? My concern is that if my temporary table is huge and I don't remove records from the temporary table as I go, the JOIN might be unnecessarily expensive. On the other hand, how expensive is it to delete from the temporary table as I go? Is it made up for by the (?real/hypothetical?) smaller joins I'll have to do in each batch?
(Can provide more details/thoughts but will do so if helpful.)
DECLARE #teamIdsToDelete Table
(
RosterID int PRIMARY KEY
)
--collect the list of active teamIds. we will rely on the modified date to age them out.
INSERT INTO #teamIdsToDelete
SELECT DISTINCT tr.RosterID FROM
rosterload.TeamRoster tr WITH (NOLOCK)
WHERE tr.IsArchive=0 and tr.Loaded=1
--ageout out remaining rosters. (no cap - proved we can update more than 50k by modifying test case:
WHILE (1 = 1)
BEGIN
BEGIN TRANSACTION
UPDATE TOP (1000) r
SET [Status] = 'Delete', IsArchive = 1, ModifiedDate = GETDATE(), ModifiedBy = 'abc'
FROM rosterload.TeamRoster r with(rowlock)
JOIN #teamIdsToDelete ttd ON ttd.rosterID = r.RosterID
WHERE r.[Status] != 'Delete' AND r.IsArchive != 1 AND r.ModifiedBy != 'abc' -- predicate for filtering;
IF ##ROWCOUNT = 0 -- terminating condition;
BEGIN
COMMIT TRANSACTION
BREAK
END
COMMIT TRANSACTION
END
As I understand the goal of this query is to archive huge number of rows w/o blocking other queries at the same time. The temp table helps you to narrow down the subset of records to delete. Since it has one column which is clustered primary key, the join to another PK will be blazingly fast. You will spend more efforts on calculating and deleting updated records from the temp table.
Also, there is no reason to use transaction and do batches. You could just do one big update instead. The result is the same - table will be locked after first 5k row locks are acquired (~after first five batches updated) until the COMMIT statement. With rowlock hint does not prevent lock escalation. On the other hand, running w/o transaction would give other queries opportunity to continue after each 1000-row batch. If you need to make sure that all records are archived in one go - add some retry logic to your query or your application code for such errors like deadlocks or process interruption. And do you really need NOLOCK hint?

Sqlite Trigger Number of Rows Affected

Need to keep running count of Rows in very large database. Row Count is needed enough times in my program that running Count(*) is too slow, so I will just keep running count to get around this in SQLITE.
CREATE TRIGGER RowCountUpdate AFTER INSERT ON LastSample
BEGIN
UPDATE BufferControl SET NumberOfSamples = NumberOfSamples +
(SELECT Count(*) FROM Inserted);
END;
So from here I want to take the current number of rows (NumberOfSamples) and increment it with how many rows were affected by the insert (do same with DELETE and decrementing). In the C API of Sqlite, this is done with Sqlite3_Changes(). However, I cannot use that function here in this script. I looked around and saw that some were using the SELECT Count(*) FROM Inserted, but I don't think Sqlite supports that.
Is there any statement that Sqlite recognizes that holds the amount of rows that were affected by the INSERT and DELETE queries?
SQLite has the changes() SQL function, but like the sqlite3_changes() API function, it reports the number of rows of the last completed statement.
During trigger execution, the triggering statement is not yet completed.
Just use a FOR EACH ROW trigger, and add 1 for each row.

Getting bulk data into a busy table

I am currently performing analysis on a client's MSSQL Server. I've already fixed many issues (unnecessary indexes, index fragmentation, NEWID() being used for identities all over the shop etc), but I've come across a specific situation that I haven't seen before.
Process 1 imports data into a staging table, then Process 2 copies the data from the staging table using an INSERT INTO. The first process is very quick (it uses BULK INSERT), but the second takes around 30 mins to execute. The "problem" SQL in Process 2 is as follows:
INSERT INTO ProductionTable(field1,field2)
SELECT field1, field2
FROM SourceHeapTable (nolock)
The above INSERT statement inserts hundreds of thousands of records into ProductionTable, each row allocating a UNIQUEIDENTIFIER, and inserting into about 5 different indexes. I appreciate this process is going to take a long time, so my issue is this: while this import is taking place, a 3rd process is responsible for performing constant lookups on ProductionTable - in addition to inserting an additional record into the table as such:
INSERT INTO ProductionTable(fields...)
VALUES(values...)
SELECT *
FROM ProductionTable (nolock)
WHERE ID = #Id
For the 30 or so minutes that the INSERT...SELECT above is taking place, the INSERT INTO times-out.
My immediate thought is that SQL server is locking the entire table during the INSERT...SELECT. I did quite a lot of profiling on the server during my analysis, and there are definitely locks being allocated for the duration of the INSERT...SELECT, though I fail remember what type they were.
Having never needed to insert records into a table from two sources at the same time - at least during an ETL process - I'm not sure how to approach this. I've been looking up INSERT table hints, but most are being made obsolete in future versions.
It looks to me like a CURSOR is the only way to go here?
You could consider BULK INSERT for Process-2 to get the data into the ProductionTable.
Another option would be to batch Process-2 into small batches of around 1000 records and use a Table Valued Parameter to do the INSERT. See: http://msdn.microsoft.com/en-us/library/bb510489.aspx#BulkInsert
It seems like table lock.
Try portion insert in ETL process. Something like
while 1=1
begin
INSERT INTO ProductionTable(field1,field2)
SELECT top (1000) field1, field2
FROM SourceHeapTable sht (nolock)
where not exists (select 1 from ProductionTable pt where pt.id = sht.id)
-- optional
--waitfor delay '00:00:01.0'
if ##rowcount = 0
break;
end

Return unlocked rows in a "select top n" query

I need to have a MsSql database table and another 8 (identical) processes accessing the same table in parallel - making a select top n, processing those n rows, and updating a column of those rows. The problem is that I need to select and process each row just once. This means that if one process got to the database and selected the top n rows, when the second process comes it should find those rows locked and select the rows from n to 2*n rows, and so on...
Is it possible to put a lock on some rows when you select them, and when someone requests top n rows which are locked to return the next rows, and not to wait for the locked ones? Seems like a long shot, but...
Another thing I was thinking - maybe not so elegant but sounds simple and safe, is to have in the database a counter for the instances which made selects on that table. The first instance that comes will increment the counter and select top n, the next one will increment the counter and select rows from n*(i-1) to n*i, and so on...
Does this sound like a good ideea? Do you have any better suggestions? Any thought is highly appreciated!
Thanks for your time.
Here's a sample I blogged about a while ago:
The READPAST hint is what ensures multiple processes don't block each other when polling for records to process. Plus, in this example I have a bit field to physically "lock" a record - could be a datetime if needed.
DECLARE #NextId INTEGER
BEGIN TRANSACTION
-- Find next available item available
SELECT TOP 1 #NextId = ID
FROM QueueTable WITH (UPDLOCK, READPAST)
WHERE IsBeingProcessed = 0
ORDER BY ID ASC
-- If found, flag it to prevent being picked up again
IF (#NextId IS NOT NULL)
BEGIN
UPDATE QueueTable
SET IsBeingProcessed = 1
WHERE ID = #NextId
END
COMMIT TRANSACTION
-- Now return the queue item, if we have one
IF (#NextId IS NOT NULL)
SELECT * FROM QueueTable WHERE ID = #NextId
The most simplest method is to use row locking:
BEGIN TRAN
SELECT *
FROM authors
WITH (HOLDLOCK, ROWLOCK)
WHERE au_id = '274-80-9391'
/* Do all your stuff here while the record is locked */
COMMIT TRAN
But if you are accessing your data and then closing the connection, you won't be able to use this method.
How long will you be needing to lock the rows for? The best way might actually be as you say to place a counter on the rows you select (best done using OUTPUT clause within an UPDATE).
The best idea if you want to select records in this manner would be to use a counter in a separate table.
You really don't want to be locking rows on a production database exclusively for any great period of time, therefore I would recommend using a counter. This way only one of your processes would be able to grab that counter number at a time (as it will lock as it is being updated) which will give you the concurrency that you need.
If you need a hand writing the tables and procedures that will do this (simply and safely as you put it!) just ask.
EDIT: ahh, nevermind, you're working in a disconnected style. How about this:
UPDATE TOP (#n) QueueTable SET Locked = 1
OUTPUT INSERTED.Col1, INSERTED.Col2 INTO #this
WHERE Locked = 0
<do your stuff>
Perhaps you are looking for the READPAST hint?
<begin or save transaction>
INSERT INTO #this (Col1, Col2)
SELECT TOP (#n) Col1, Col2
FROM Table1 WITH (ROWLOCK, HOLDLOCK, READPAST)
<do your stuff>
<commit or rollback>

Delete large amount of data in sql server

Suppose that I have a table with 10000000 record. What is difference between this two solution?
delete data like :
DELETE FROM MyTable
delete all of data with a application row by row :
DELETE FROM MyTable WHERE ID = #SelectedID
Is the first solution has best performance?
what is the impact on log and performance?
If you need to restrict to what rows you need to delete and not do a complete delete, or you can't use TRUNCATE TABLE (e.g. the table is referenced by a FK constraint, or included in an indexed view), then you can do the delete in chunks:
DECLARE #RowsDeleted INTEGER
SET #RowsDeleted = 1
WHILE (#RowsDeleted > 0)
BEGIN
-- delete 10,000 rows a time
DELETE TOP (10000) FROM MyTable [WHERE .....] -- WHERE is optional
SET #RowsDeleted = ##ROWCOUNT
END
Generally, TRUNCATE is the best way and I'd use that if possible. But it cannot be used in all scenarios. Also, note that TRUNCATE will reset the IDENTITY value for the table if there is one.
If you are using SQL 2000 or earlier, the TOP condition is not available, so you can use SET ROWCOUNT instead.
DECLARE #RowsDeleted INTEGER
SET #RowsDeleted = 1
SET ROWCOUNT 10000 -- delete 10,000 rows a time
WHILE (#RowsDeleted > 0)
BEGIN
DELETE FROM MyTable [WHERE .....] -- WHERE is optional
SET #RowsDeleted = ##ROWCOUNT
END
If you have that many records in your table and you want to delete them all, you should consider truncate <table> instead of delete from <table>. It will be much faster, but be aware that it cannot activate a trigger.
See for more details (this case sql server 2000):
http://msdn.microsoft.com/en-us/library/aa260621%28SQL.80%29.aspx
Deleting the table within the application row by row will end up in long long time, as your dbms can not optimize anything, as it doesn't know in advance, that you are going to delete everything.
The first has clearly better performance.
When you specify DELETE [MyTable] it will simply erase everything without doing checks for ID. The second one will waste time and disk operation to locate a respective record each time before deleting it.
It also gets worse because every time a record disappears from the middle of the table, the engine may want to condense data on disk, thus wasting time and work again.
Maybe a better idea would be to delete data based on clustered index columns in descending order. Then the table will basically be truncated from the end at every delete operation.
Option 1 will create a very large transaction and have a big impact on the log / performance, as well as escalating locks so that the table will be unavailable.
Option 2 will be slower, although it will generate less impact on the log (assuming bulk / full mode)
If you want to get rid of all the data, Truncate Table MyTable would be faster than both, although it has no facility to filter rows, it does a meta data change at the back and basically drops the IAM on the floor for the table in question.
The best performance for clearing a table would bring TRUNCATE TABLE MyTable. See http://msdn.microsoft.com/en-us/library/ms177570.aspx for more verbose explaination
Found this post on Microsoft TechNet.
Basically, it recommends:
By using SELECT INTO, copy the data that you want to KEEP to an intermediate table;
Truncate the source table;
Copy back with INSERT INTO from intermediate table, the data to the source table;
..
BEGIN TRANSACTION
SELECT *
INTO dbo.bigtable_intermediate
FROM dbo.bigtable
WHERE Id % 2 = 0;
TRUNCATE TABLE dbo.bigtable;
SET IDENTITY_INSERT dbo.bigTable ON;
INSERT INTO dbo.bigtable WITH (TABLOCK) (Id, c1, c2, c3)
SELECT Id, c1, c2, c3 FROM dbo.bigtable_intermediate ORDER BY Id;
SET IDENTITY_INSERT dbo.bigtable OFF;
ROLLBACK TRANSACTION
The first will delete all the data from the table and will have better performance that your second who will delete only data from a specific key.
Now if you have to delete all the data from the table and you don't rely on using rollback think of the use a truncate table

Resources