Deleting 1 millions rows in SQL Server - sql-server

I am working on a client's database and there is about 1 million rows that need to be deleted due to a bug in the software. Is there an efficient way to delete them besides:
DELETE FROM table_1 where condition1 = 'value' ?

Here is a structure for a batched delete as suggested above. Do not try 1M at once...
The size of the batch and the waitfor delay are obviously quite variable, and would depend on your servers capabilities, as well as your need to mitigate contention. You may need to manually delete some rows, measuring how long they take, and adjust your batch size to something your server can handle. As mentioned above, anything over 5000 can cause locking (which I was not aware of).
This would be best done after hours... but 1M rows is really not a lot for SQL to handle. If you watch your messages in SSMS, it may take a while for the print output to show, but it will after several batches, just be aware it won't update in real-time.
Edit: Added a stop time #MAXRUNTIME & #BSTOPATMAXTIME. If you set #BSTOPATMAXTIME to 1, the script will stop on it's own at the desired time, say 8:00AM. This way you can schedule it nightly to start at say midnight, and it will stop before production at 8AM.
Edit: Answer is pretty popular, so I have added the RAISERROR in lieu of PRINT per comments.
DECLARE #BATCHSIZE INT, #WAITFORVAL VARCHAR(8), #ITERATION INT, #TOTALROWS INT, #MAXRUNTIME VARCHAR(8), #BSTOPATMAXTIME BIT, #MSG VARCHAR(500)
SET DEADLOCK_PRIORITY LOW;
SET #BATCHSIZE = 4000
SET #WAITFORVAL = '00:00:10'
SET #MAXRUNTIME = '08:00:00' -- 8AM
SET #BSTOPATMAXTIME = 1 -- ENFORCE 8AM STOP TIME
SET #ITERATION = 0 -- LEAVE THIS
SET #TOTALROWS = 0 -- LEAVE THIS
WHILE #BATCHSIZE>0
BEGIN
-- IF #BSTOPATMAXTIME = 1, THEN WE'LL STOP THE WHOLE JOB AT A SET TIME...
IF CONVERT(VARCHAR(8),GETDATE(),108) >= #MAXRUNTIME AND #BSTOPATMAXTIME=1
BEGIN
RETURN
END
DELETE TOP(#BATCHSIZE)
FROM SOMETABLE
WHERE 1=2
SET #BATCHSIZE=##ROWCOUNT
SET #ITERATION=#ITERATION+1
SET #TOTALROWS=#TOTALROWS+#BATCHSIZE
SET #MSG = 'Iteration: ' + CAST(#ITERATION AS VARCHAR) + ' Total deletes:' + CAST(#TOTALROWS AS VARCHAR)
RAISERROR (#MSG, 0, 1) WITH NOWAIT
WAITFOR DELAY #WAITFORVAL
END

BEGIN TRANSACTION
DoAgain:
DELETE TOP (1000)
FROM <YourTable>
IF ##ROWCOUNT > 0
GOTO DoAgain
COMMIT TRANSACTION

Maybe this solution from Uri Dimant
WHILE 1 = 1
BEGIN
DELETE TOP(2000)
FROM Foo
WHERE <predicate>;
IF ##ROWCOUNT < 2000 BREAK;
END
(Link: https://social.msdn.microsoft.com/Forums/sqlserver/en-US/b5225ca7-f16a-4b80-b64f-3576c6aa4d1f/how-to-quickly-delete-millions-of-rows?forum=transactsql)

Here is something I have used:
If the bad data is mixed in with the good-
INSERT INTO #table
SELECT columns
FROM old_table
WHERE statement to exclude bad rows
TRUNCATE old_table
INSERT INTO old_table
SELECT columns FROM #table

Not sure how good this would be but what if you do like below (provided table_1 is a stand alone table; I mean no referenced by other table)
create a duplicate table of table_1 like table_1_dup
insert into table_1_dup select * from table_1 where condition1 <> 'value';
drop table table_1
sp_rename table_1_dup table_1

If you cannot afford to get the database out of production while repairing, do it in small batches. See also: How to efficiently delete rows while NOT using Truncate Table in a 500,000+ rows table
If you are in a hurry and need the fastest way possible:
take the database out of production
drop all non-clustered indexes and triggers
delete the records (or if the majority of records is bad, copy+drop+rename the table)
(if applicable) fix the inconsistencies caused by the fact that you dropped triggers
re-create the indexes and triggers
bring the database back in production

Related

Recompilation issue with in-memory tables

In our high-loaded OLTP processing we're using permanent in-memory tables like temporary tables (similar to https://learn.microsoft.com/en-us/sql/relational-databases/in-memory-oltp/faster-temp-table-and-table-variable-by-using-memory-optimization?view=sql-server-2017, case C). But, under a low load we found very big number of recompilations in stored procedures caused by the reason '2 - Statistics changed'. Number of rows in these tables varies from 0 to 50-100 each execution. There is no way to disable auto update statistics on in-memory tables. Also, an option 'KEEPFIXED PLAN' cannot be applied in subqueries like this:
if exists(
select 1 from dbo.mytable option (KEEPFIXED PLAN)
)
begin
select 1
end
Any ideas, how can we avoid excessive recompilations?
Aaron, thank you very much for your suggestion - solution found. It's not perfect - we have to rewrite a lot of code, but it works. We have to change an 'if' operator to a query, so we can apply hints this way:
declare #exists int
select #exists = 1 where exists (select 1 from dbo.MyTable) option (KEEPFIXED PLAN)
if #exists = 1
begin
select 'exists!!!'
end
else
begin
select 'not exists...'
end

Why is my UPDATE stored procedure executed multiple times?

I use stored procedures to manage a warehouse. PDA scanners scan the added stock and send it in bulk (when plugged back) to a SQL database (SQL Server 2016).
The SQL database is fairly remote (other country), so there's sometimes delay in some queries, but this particular one is problematic: even if the stock table is fine, I had some problems with updating the occupancy of the warehouse spots. The PDA tracks the added items in every spot as a SMALLINT, then send back this value to the stored procedure below.
PDA "send_spots" query:
SELECT spot, added_items FROM spots WHERE change=1
Stored procedure:
CREATE PROCEDURE [dbo].[update_spots]
#spot VARCHAR(10),
#added_items SMALLINT
AS
BEGIN
BEGIN TRAN
UPDATE storage_spots
SET remaining_capacity = remaining_capacity - #added_items
WHERE storage_spot=#spot
IF ##ROWCOUNT <> 1
BEGIN
ROLLBACK TRAN
RETURN - 1
END
ELSE
BEGIN
COMMIT TRAN
RETURN 0
END
END
GO
If the remaining_capacity value is 0, the PDAs can't add more items to it on next round. But with this process, I had negative values because the query allegedly ran two times (so subtracted #added_items two times).
Is there a way for that to be possible? How could I fix it? From what I understood the transaction should be cancelled (ROLLBACK) if the affected rows are != 1, but maybe that's something else.
EDIT: current solution with the help of #Zero:
CREATE PROCEDURE [dbo].[update_spots]
#spot VARCHAR(10),
#added_racks SMALLINT
AS
BEGIN
-- Recover current capacity of the spot
DECLARE #orig_capacity SMALLINT
SELECT TOP 1
#orig_capacity = remaining_capacity
FROM storage_spots
WHERE storage_spot=#spot
-- Test if double is present in logs by comparing dates (last 10 seconds)
DECLARE #is_double BIT = 0
SELECT #is_double = CASE WHEN EXISTS(SELECT *
FROM spot_logs
WHERE log_timestamp >= dateadd(second, -10, getdate()) AND storage_spot=#spot AND delta=#added_racks)
THEN 1 ELSE 0 END
BEGIN
BEGIN TRAN
UPDATE storage_spots
SET remaining_capacity= #orig_capacity - #added_racks
WHERE storage_spot=#spot
IF ##ROWCOUNT <> 1 OR #is_double <> 0
-- If double, rollback UPDATE
ROLLBACK TRAN
ELSE
-- If no double, commit UPDATE
COMMIT TRAN
-- write log
INSERT INTO spot_logs
(storage_spot, former_capacity, new_capacity, delta, log_timestamp, double_detected)
VALUES
(#spot, #orig_capacity, #orig_capacity-#added_racks, #added_racks, getdate(), #is_double)
END
END
GO
I was thinking about possible causes (and a way to trace them) and then it hit me - you have no value validation!
Here's a simple example to illustrate the problem:
Spot | capacity
---------------
x1 | 1
Update spots set capacity = capacity - 2 where spot = 'X1'
Your scanner most likely gave you higher quantity than you had capacity to take in.
I am not sure how your business logic goes, but you need to perform something in lines of
Update spots set capacity = capacity - #added_items where spot = 'X1' and capacity >= #added_items
if ##rowcount <> 1;
rollback;
EDIT: few methods to trace your problem without implementing validation:
Create a logging table (with timestamp, user id (user, that is connected to DB), session id, old value, new value, delta value (added items).
Option 1:
Log all updates that change value from positive to negative (at least until you figure out the problem).
The drawback of this option is that it will not register double calls that do not result in a negative capacity.
Option 2: (logging identical updates):
Create script that creates a global temporary table and deletes records, from that table timestamps older than... let's say 10 minutes once every minute or so (play around with numbers).
This temporary table should hold the data passed to your update statement so 'spot', 'added_items' + 'timestamp' (for tracking).
Now the crucial part: When you call your update statement check if a similar record exists in the temporary table (same spot and added_items and where current timestamp is between timestamp and [timestamp + 1 second] - again play around with numbers). If a record like that exist log that update, if not add it to temporary table.
This will register identical updates that go within 1 second of each other (or whatever time-frame you choose).
I found here an alternative SQL query that does the update the way I need, but with a temporary value using DECLARE. Would it work better in my case, or is my initial query correct?
Initial query:
UPDATE storage_spots
SET remaining_capacity = remaining_capacity - #added_items
WHERE storage_spot=#spot
Alternative query:
DECLARE #orig_capacity SMALLINT
SELECT TOP 1 #orig_capacity = remaining_capacity
FROM storage_spots
WHERE spot=#spot
UPDATE Products
SET remaining_capacity = #orig_capacity - #added_items
WHERE spot=#spot
Also, should I get rid of the ROLLBACK/COMMIT instructions?

How can I get number of deleted records?

I have a stored procedure which deletes certain records. I need to get the number of deleted records.
I tried to do it like this:
DELETE FROM OperationsV1.dbo.Files WHERE FileID = #FileID
SELECT ##ROWCOUNT AS DELETED;
But DELETED is shown as 0, though the appropriate records are deleted. I tried SET NOCOUNT OFF; without success. Could you please help?
Thanks.
That should work fine. The setting of NOCOUNT is irrelevant. This only affects the n rows affected information sent back to the client and has no effect on the workings of ##ROWCOUNT.
Do you have any statements between the two that you have shown? ##ROWCOUNT is reset after every statement so you must retrieve the value immediately with no intervening statements.
I use this code snippet when debugging stored procedures to verify counts after operations, such as a delete:
DECLARE #Msg varchar(30)
...
SELECT #Msg = CAST(##ROWCOUNT AS VARCHAR(10)) + ' rows affected'
RAISERROR (#Msg, 0, 1) WITH NOWAIT
START TRANSACTION;
SELECT #before:=(SELECT count(*) FROM OperationsV1.dbo.Files);
DELETE FROM OperationsV1.dbo.Files WHERE FileID = #FileID;
SELECT #after:=(SELECT count(*) FROM OperationsV1.dbo.Files);
COMMIT;
SELECT #before-#after AS DELETED;
I don't know about SQL server, but in MySQL SELECT count(*) FROM ... is an extremely cheap operation.

SQL Server - Implementing sequences

I have a system which requires I have IDs on my data before it goes to the database. I was using GUIDs, but found them to be too big to justify the convenience.
I'm now experimenting with implementing a sequence generator which basically reserves a range of unique ID values for a given context. The code is as follows;
ALTER PROCEDURE [dbo].[Sequence.ReserveSequence]
#Name varchar(100),
#Count int,
#FirstValue bigint OUTPUT
AS
BEGIN
SET NOCOUNT ON;
-- Ensure the parameters are valid
IF (#Name IS NULL OR #Count IS NULL OR #Count < 0)
RETURN -1;
-- Reserve the sequence
SET TRANSACTION ISOLATION LEVEL SERIALIZABLE;
BEGIN TRANSACTION
-- Get the sequence ID, and the last reserved value of the sequence
DECLARE #SequenceID int;
DECLARE #LastValue bigint;
SELECT TOP 1 #SequenceID = [ID], #LastValue = [LastValue]
FROM [dbo].[Sequences]
WHERE [Name] = #Name;
-- Ensure the sequence exists
IF (#SequenceID IS NULL)
BEGIN
-- Create the new sequence
INSERT INTO [dbo].[Sequences] ([Name], [LastValue])
VALUES (#Name, #Count);
-- The first reserved value of a sequence is 1
SET #FirstValue = 1;
END
ELSE
BEGIN
-- Update the sequence
UPDATE [dbo].[Sequences]
SET [LastValue] = #LastValue + #Count
WHERE [ID] = #SequenceID;
-- The sequence start value will be the last previously reserved value + 1
SET #FirstValue = #LastValue + 1;
END
COMMIT TRANSACTION
END
The 'Sequences' table is just an ID, Name (unique), and the last allocated value of the sequence. Using this procedure I can request N values in a named sequence and use these as my identifiers.
This works great so far - it's extremely quick since I don't have to constantly ask for individual values, I can just use up a range of values and then request more.
The problem is that at extremely high frequency, calling the procedure concurrently can sometimes result in a deadlock. I have only found this to occur when stress testing, but I'm worried it'll crop up in production. Are there any notable flaws in this procedure, and can anyone recommend any way to improve on it? It would be nice to do with without transactions for example, but I do need this to be 'thread safe'.
MS themselves offer a solution and even they say it locks/deadlocks.
If you want to add some lock hints then you'd reduce concurrency for your high loads
Options:
You could develop against the "Denali" CTP which is the next release
Use IDENTITY and the OUTPUT clause like everyone else
Adopt/modify the solutions above
On DBA.SE there is "Emulate a TSQL sequence via a stored procedure": see dportas' answer which I think extends the MS solution.
I'd recommend sticking with the GUIDs, if as you say, this is mostly about composing data ready for a bulk insert (it's simpler than what I present below).
As an alternative, could you work with a restricted count? Say, 100 ID values at a time? In that case, you could have a table with an IDENTITY column, insert into that table, return the generated ID (say, 39), and then your code could assign all values between 3900 and 3999 (e.g. multiply up by your assumed granularity) without consulting the database server again.
Of course, this could be extended to allocating multiple IDs in a single call - provided that your okay with some IDs potentially going unused. E.g. you need 638 IDs - so you ask the database to assign you 7 new ID values (which imply that you've allocated 700 values), use the 638 you want, and the remaining 62 never get assigned.
Can you get some kind of deadlock trace? For example, enable trace flag 1222 as shown here. Duplicate the deadlock. Then look in the SQL Server log for the deadlock trace.
Also, you might inspect what locks are taken out in your code by inserting a call to exec sp_lock or select * from sys.dm_tran_locks immediately before the COMMIT TRANSACTION.
Most likely you are observing a conversion deadlock. To avoid them, you want to make sure that your table is clustered and has a PK, but this advice is specific to 2005 and 2008 R2, and they can change the implementation, rendering this advice useless. Google up "Some heap tables may be more prone to deadlocks than identical tables with clustered indexes".
Anyway, if you observe an error during stress testing, it is likely that sooner or later it will occur in production as well.
You may want to use sp_getapplock to serialize your requests. Google up "Application Locks (or Mutexes) in SQL Server 2005". Also I described a few useful ideas here: "Developing Modifications that Survive Concurrency".
I thought I'd share my solution. I doesn't deadlock, nor does it produce duplicate values. An important difference between this and my original procedure is that it doesn't create the queue if it doesn't already exist;
ALTER PROCEDURE [dbo].[ReserveSequence]
(
#Name nvarchar(100),
#Count int,
#FirstValue bigint OUTPUT
)
AS
BEGIN
SET NOCOUNT ON;
IF (#Count <= 0)
BEGIN
SET #FirstValue = NULL;
RETURN -1;
END
DECLARE #Result TABLE ([LastValue] bigint)
-- Update the sequence last value, and get the previous one
UPDATE [Sequences]
SET [LastValue] = [LastValue] + #Count
OUTPUT INSERTED.LastValue INTO #Result
WHERE [Name] = #Name;
-- Select the first value
SELECT TOP 1 #FirstValue = [LastValue] + 1 FROM #Result;
END

How to delete records in SQL 2005 keeping transaction logs in check

I am running the following stored procedure to delete large number of records. I understand that the DELETE statement writes to the transaction log and deleting many rows will make the log grow.
I have looked into other options of creating tables and inserting records to keep and then Truncating the source, this method will not work for me.
How can I make my stored procedure below more efficient while making sure that I keep the transaction log from growing unnecessarily?
CREATE PROCEDURE [dbo].[ClearLog]
(
#Age int = 30
)
AS
BEGIN
-- SET NOCOUNT ON added to prevent extra result sets from
-- interfering with SELECT statements.
SET NOCOUNT ON;
-- DELETE ERRORLOG
WHILE EXISTS ( SELECT [LogId] FROM [dbo].[Error_Log] WHERE DATEDIFF( dd, [TimeStamp], GETDATE() ) > #Age )
BEGIN
SET ROWCOUNT 10000
DELETE [dbo].[Error_Log] WHERE DATEDIFF( dd, [TimeStamp], GETDATE() ) > #Age
WAITFOR DELAY '00:00:01'
SET ROWCOUNT 0
END
END
Here is how I would do it:
CREATE PROCEDURE [dbo].[ClearLog] (
#Age int = 30)
AS
BEGIN
SET NOCOUNT ON;
DECLARE #d DATETIME
, #batch INT;
SET #batch = 10000;
SET #d = DATEADD( dd, -#Age, GETDATE() )
WHILE (1=1)
BEGIN
DELETE TOP (#batch) [dbo].[Error_Log]
WHERE [Timestamp] < #d;
IF (0 = ##ROWCOUNT)
BREAK
END
END
Make the Tiemstamp comparison SARGable
Separate the GETDATE() at the start of batch to produce a consistent run (otherwise it can block in an infinite loop as new records 'age' as the old ones are being deleted).
use TOP instead of SET ROWCOUNT (deprecated: Using SET ROWCOUNT will not affect DELETE, INSERT, and UPDATE statements in the next release of SQL Server.)
check ##ROWCOUNT to break the loop instead of redundant SELECT
Assuming you have the option of rebuilding the error log table on a partition scheme one option would be to partition the table on date and swap out the partitions. Do a google search for 'alter table switch partition' to dig a bit further.
how about you run it more often, and delete fewer rows each time? Run this every 30 minutes:
CREATE PROCEDURE [dbo].[ClearLog]
(
#Age int = 30
)
AS
BEGIN
SET NOCOUNT ON;
SET ROWCOUNT 10000 --I assume you are on an old version of SQL Server and can't use TOP
DELETE dbo.Error_Log Where Timestamp>GETDATE()-#Age
WAITFOR DELAY '00:00:01' --why???
SET ROWCOUNT 0
END
the way it handles the dates will not truncate time, and you will only delete 30 minutes worth of data each time.
If your database is in FULL recovery mode, the only way to minimize the impact of your delete statements is to "space them out" -- only delete so many during a "transaction interval". For example, if you do t-log backups every hour, only delete, say, 20,000 rows per hour. That may not drop all you need all at once, but will things even out after 24 hours, or after a week?
If your database is in SIMPLE or BULK_LOGGED mode, breaking the deletes into chunks should do it. But, since you're already doing that, I'd have to guess your database is in FULL recover mode. (That, or the connection calling the procedure may be part of a transaction.)
A solution I have used in the past was to temporarily set the recovery model to "Bulk Logged", then back to "Full" at the end of the stored procedure:
DECLARE #dbName NVARCHAR(128);
SELECT #dbName = DB_NAME();
EXEC('ALTER DATABASE ' + #dbName + ' SET RECOVERY BULK_LOGGED')
WHILE EXISTS (...)
BEGIN
-- Delete a batch of rows, then WAITFOR here
END
EXEC('ALTER DATABASE ' + #dbName + ' SET RECOVERY FULL')
This will significantly reduce the transaction log consumption for large batches.
I don't like that it sets the recovery model for the whole database (not just for this session), but it's the best solution I could find.

Resources