Avoid inserting duplicate records in SQL Server

Avoid inserting duplicate records in SQL Server - sql-server

I haven't been able to find an answer to this. Suppose I have the following table/query:
The table:
create table ##table
(
column1 int,
column2 nvarchar(max)
)
The query (in a real life scenario the condition will be more complex):
declare #shouldInsert bit
set #shouldInsert = case when exists(
select *
from ##table
where column2 = 'test') then 1 else 0 end
--Exaggerating a possible delay:
waitfor delay '00:00:10'
if(#shouldInsert = 0)
insert into ##table
values(1, 'test')
If I run this query twice simultaneously then it's liable to insert duplicate records (enforsing a unique constraint is out of the question because the real-life condition is more involved than the mere "column1" uniqueness across the table)
I see two possible solutions:
I run both concurrent transactions in serializable mode, but it will create a deadlock (first a shared lock in select then an x-lock in insert - deadlock).
In the select statement I use the query hints with(update, tablock) which will effectively x-lock the entire table, but it will prevent other transactions from reading data (something I'd like to avoid)
Which is more acceptable? Is there a third solution?
Thanks.

If you can, you should put a UNIQUE constraint (or index) on whatever column(s) it is that is defining the uniqueness.
With this, you might still get the "OK, doesn't exist yet" response for your initial check for two separate processes - but one of the two will be first and get his row inserted, while the second will get a "unique constraint violated" exception back from the database.

Regardless how "involved" your "real-life condition" is you have two options: enforce UNIQUE or deal with multiple records. Any work-around will likely be fragile.
For example your delay hack is pretty useless if you need to add another DB server or overwhelming load slows down the execution of individual threads
One of the ways you could allow for multiple copies of a should-be-unique value is to create another table that can act as a queue and doesn't enforce uniqueness and a serial worker to dequeue it. Or change the data structure to allow for 1-to-many and pick the first one when querying. Still a hack but at least not terribly "creative" and it can't break

declare #shouldInsert bit
set #shouldInsert = case when exists(
select *
from ##table
where column2 = 'test') then 1 else 0 end
--Exaggerating a possible delay:
waitfor delay '00:00:10'
truncate table #temp
if(#shouldInsert = 0)
insert into #temp
values(1, 'test')
--if records is not available in ##table then data will be inserted from #temp table to ##table
insert into ##table
select * from #temp
except
select * from ##table

Related

SQL Server Custom Identity Column

I want to generate a custom identity column related to type of product.
Can this query guaranty the order of identity and resolve concurrency.
This is a sample query:
BEGIN TRAN
INSERT INTO TBLKEY
VALUES((SELECT 'A-' + CAST(MAX(CAST(ID AS INT)) + 1 AS NVARCHAR) FROM TBLKEY),'EHSAN')
COMMIT

Try this:
BEGIN TRAN
INSERT INTO TBLKEY
VALUES((SELECT MAX(ID) + 1 AS NVARCHAR) FROM TBLKEY WITH (UPDLOCK)),'EHSAN')
COMMIT
When selecting the max ID you acquire a U lock on the row. The U lock is incompatible with the U lock which will try to acquire another session with the same query running at the same time. Only one query will be executed at a given time. The ids will be in order and continuous without any gaps between them.
A better solution would be to create an extra table dedicated only for storing the current or next id and use it instead of the maximum.
You can understand the difference by doing the following:
Prepare a table
CREATE TABLE T(id int not null PRIMARY KEY CLUSTERED)
INSERT INTO T VALUES(1)
And then run the following query in two different sessions one after another with less than 10 seconds apart
BEGIN TRAN
DECLARE #idv int
SELECT #idv = max (id) FROM T
WAITFOR DELAY '0:0:10'
INSERT INTO T VALUES(#idv+1)
COMMIT
Wait for a while until both queries complete. Observe that one of them succeeded and the other failed.
Now do the same with the following query
BEGIN TRAN
DECLARE #idv int
SELECT #idv = max (id) FROM T WITH (UPDLOCK)
WAITFOR DELAY '0:0:5'
INSERT INTO T VALUES(#idv+1)
COMMIT
View the contents of T
Cleanup the T Table with DROP TABLE T

This would be a bad thing to do as there is no way to guarantee that two queries running at the same time wouldn't get MAX(ID) as being the same value.
If you used a standard identity column you could also have a computed column which uses that or just return the key when you return the data.
Ed

Select and Delete in the same transaction using TOP clause

I have table in which the data is been continuously added at a rapid pace.
And i need to fetch record from this table and immediately remove them so i cannot process the same record second time. And since the data is been added at a faster rate, i need to use the TOP clause so only small number of records go to business logic for processing at the time.
I am using the below query to
BEGIN TRAN readrowdata
SELECT
top 5 [RawDataId],
[RawData]
FROM
[TABLE] with(HOLDLOCK)
WITH q AS
(
SELECT
top 5 [RawDataId],
[RawData]
FROM
[TABLE] with(HOLDLOCK)
)
DELETE from q
COMMIT TRANSACTION readrowdata
I am using the HOLDLOCK here, so new data cannot insert into the table while i am performing the SELECT and DELETE operation. I used it because Suppose if there are only 3 records in the table now, so the SELECT statement will get 3 records and in the same time new record gets inserted and the DELETE statement will delete 4 records. So i will loose 1 data here.
Is the query is ok in performance term? If i can improve it then please provide me your suggestion.
Thank you

Personally, I'd use a different approach. One with less locking, but also extra information signifying that certain records are currently being processed...
DECLARE #rowsBeingProcessed TABLE (
id INT
);
WITH rows AS (
SELECT top 5 [RawDataId] FROM yourTable WHERE processing_start IS NULL
)
UPDATE rows SET processing_start = getDate() WHERE processing_start IS NULL
OUTPUT INSERTED.RowDataID INTO #rowsBeingProcessed;
-- Business Logic Here
DELETE yourTable WHERE RowDataID IN (SELECT id FROM #rowsBeingProcessed);
Then you can also add checks like "if a record has been 'beingProcessed' for more than 10 minutes, assume that the business logic failed", etc, etc.

By locking the table in this way, you force other processes to wait for your transaction to complete. This can have very rapid consequences on scalability and performance - and it tends to be hard to predict, because there's often a chain of components all relying on your database.
If you have multiple clients each running this query, and multiple clients adding new rows to the table, the overall system performance is likely to deteriorate at some times, as each "read" client is waiting for a lock, the number of "write" clients waiting to insert data grows, and they in turn may tie up other components (whatever is generating the data you want to insert).
Diego's answer is on the money - put the data into a variable, and delete matching rows. Don't use locks in SQL Server if you can possibly avoid it!

You can do it very easily with TRIGGERS. Below mentioned is a kind of situation which will help you need not to hold other users which are trying to insert data simultaneously. Like below...
Data Definition language
CREATE TABLE SampleTable
(
id int
)
Sample Record
insert into SampleTable(id)Values(1)
Sample Trigger
CREATE TRIGGER SampleTableTrigger
on SampleTable AFTER INSERT
AS
IF Exists(SELECT id FROM INSERTED)
BEGIN
Set NOCOUNT ON
SET XACT_ABORT ON
Begin Try
Begin Tran
Select ID From Inserted
DELETE From yourTable WHERE ID IN (SELECT id FROM Inserted);
Commit Tran
End Try
Begin Catch
Rollback Tran
End Catch
End
Hope this is very simple and helpful

If I understand you correctly, you are worried that between your select and your delete, more records would be inserted and the first TOP 5 would be different then the second TOP 5?
If that so, why don't you load your first select into a temp table or variable (or at least the PKs) do whatever you have to do with your data and then do your delete based on this table?

I know that it's old question, but I found some solution here https://www.simple-talk.com/sql/learn-sql-server/the-delete-statement-in-sql-server/:
DECLARE #Output table
(
StaffID INT,
FirstName NVARCHAR(50),
LastName NVARCHAR(50),
CountryRegion NVARCHAR(50)
);
DELETE SalesStaff
OUTPUT DELETED.* INTO #Output
FROM Sales.vSalesPerson sp
INNER JOIN dbo.SalesStaff ss
ON sp.BusinessEntityID = ss.StaffID
WHERE sp.SalesLastYear = 0;
SELECT * FROM #output;
Maybe it will be helpfull for you.

Can I Select and Update at the same time?

This is an over-simplified explanation of what I'm working on.
I have a table with status column. Multiple instances of the application will pull the contents of the first row with a status of NEW, update the status to WORKING and then go to work on the contents.
It's easy enough to do this with two database calls; first the SELECT then the UPDATE. But I want to do it all in one call so that another instance of the application doesn't pull the same row. Sort of like a SELECT_AND_UPDATE thing.
Is a stored procedure the best way to go?

You could use the OUTPUT statement.
DECLARE #Table TABLE (ID INTEGER, Status VARCHAR(32))
INSERT INTO #Table VALUES (1, 'New')
INSERT INTO #Table VALUES (2, 'New')
INSERT INTO #Table VALUES (3, 'Working')
UPDATE #Table
SET Status = 'Working'
OUTPUT Inserted.*
FROM #Table t1
INNER JOIN (
SELECT TOP 1 ID
FROM #Table
WHERE Status = 'New'
) t2 ON t2.ID = t1.ID

Sounds like a queue processing scenario, whereby you want one process only to pick up a given record.
If that is the case, have a look at the answer I provided earlier today which describes how to implement this logic using a transaction in conjunction with UPDLOCK and READPAST table hints:
Row locks - manually using them
Best wrapped up in sproc.
I'm not sure this is what you are wanting to do, hence I haven't voted to close as duplicate.

Not quite, but you can SELECT ... WITH (UPDLOCK), then UPDATE.. subsequently. This is as good as an atomic operation as it tells the database that you are about to update what you previously selected, so it can lock those rows, preventing collisions with other clients. Under Oracle and some other database (MySQL I think) the syntax is SELECT ... FOR UPDATE.
Note: I think you'll need to ensure the two statements happen within a transaction for it to work.

You should do three things here:
Lock the row you're working on
Make sure that this and only this row is locked
Do not wait for the locked records: skip the the next ones instead.
To do this, you just issue this:
SELECT TOP 1 *
FROM mytable (ROWLOCK, UPDLOCK, READPAST)
WHERE status = 'NEW'
ORDER BY
date
UPDATE …
within a transaction.

A stored procedure is the way to go. You need to look at transactions. Sql server was born for this kind of thing.

Yes, and maybe use the rowlock hint to keep it isolated from the other threads, eg.
UPDATE
Jobs WITH (ROWLOCK, UPDLOCK, READPAST)
SET Status = 'WORKING'
WHERE JobID =
(SELECT Top 1 JobId FROM Jobs WHERE Status = 'NEW')
EDIT: Rowlock would be better as suggested by Quassnoi, but the same idea applies to do the update in one query.

SQL - Inserting and Updating Multiple Records at Once

I have a stored procedure that is responsible for inserting or updating multiple records at once. I want to perform this in my stored procedure for the sake of performance.
This stored procedure takes in a comma-delimited list of permit IDs and a status. The permit IDs are stored in a variable called #PermitIDs. The status is stored in a variable called #Status. I have a user-defined function that converts this comma-delimited list of permit IDs into a Table. I need to go through each of these IDs and do either an insert or update into a table called PermitStatus.
If a record with the permit ID does not exist, I want to add a record. If it does exist, I'm want to update the record with the given #Status value. I know how to do this for a single ID, but I do not know how to do it for multiple IDs. For single IDs, I do the following:
-- Determine whether to add or edit the PermitStatus
DECLARE #count int
SET #count = (SELECT Count(ID) FROM PermitStatus WHERE [PermitID]=#PermitID)
-- If no records were found, insert the record, otherwise add
IF #count = 0
BEGIN
INSERT INTO
PermitStatus
(
[PermitID],
[UpdatedOn],
[Status]
)
VALUES
(
#PermitID,
GETUTCDATE(),
1
)
END
ELSE
UPDATE
PermitStatus
SET
[UpdatedOn]=GETUTCDATE(),
[Status]=#Status
WHERE
[PermitID]=#PermitID
How do I loop through the records in the Table returned by my user-defined function to dynamically insert or update the records as needed?

create a split function, and use it like:
SELECT
*
FROM YourTable y
INNER JOIN dbo.splitFunction(#Parameter) s ON y.ID=s.Value
I prefer the number table approach
For this method to work, you need to do this one time table setup:
SELECT TOP 10000 IDENTITY(int,1,1) AS Number
INTO Numbers
FROM sys.objects s1
CROSS JOIN sys.objects s2
ALTER TABLE Numbers ADD CONSTRAINT PK_Numbers PRIMARY KEY CLUSTERED (Number)
Once the Numbers table is set up, create this function:
CREATE FUNCTION [dbo].[FN_ListToTableAll]
(
#SplitOn char(1) --REQUIRED, the character to split the #List string on
,#List varchar(8000)--REQUIRED, the list to split apart
)
RETURNS TABLE
AS
RETURN
(
----------------
--SINGLE QUERY-- --this WILL return empty rows
----------------
SELECT
ROW_NUMBER() OVER(ORDER BY number) AS RowNumber
,LTRIM(RTRIM(SUBSTRING(ListValue, number+1, CHARINDEX(#SplitOn, ListValue, number+1)-number - 1))) AS ListValue
FROM (
SELECT #SplitOn + #List + #SplitOn AS ListValue
) AS InnerQuery
INNER JOIN Numbers n ON n.Number < LEN(InnerQuery.ListValue)
WHERE SUBSTRING(ListValue, number, 1) = #SplitOn
);
GO
You can now easily split a CSV string into a table and join on it:
select * from dbo.FN_ListToTableAll(',','1,2,3,,,4,5,6777,,,')
OUTPUT:
RowNumber ListValue
----------- ----------
1 1
2 2
3 3
4
5
6 4
7 5
8 6777
9
10
11
(11 row(s) affected)
To make what you need work, do the following:
--this would be the existing table
DECLARE #OldData table (RowID int, RowStatus char(1))
INSERT INTO #OldData VALUES (10,'z')
INSERT INTO #OldData VALUES (20,'z')
INSERT INTO #OldData VALUES (30,'z')
INSERT INTO #OldData VALUES (70,'z')
INSERT INTO #OldData VALUES (80,'z')
INSERT INTO #OldData VALUES (90,'z')
--these would be the stored procedure input parameters
DECLARE #IDList varchar(500)
,#StatusList varchar(500)
SELECT #IDList='10,20,30,40,50,60'
,#StatusList='A,B,C,D,E,F'
--stored procedure local variable
DECLARE #InputList table (RowID int, RowStatus char(1))
--convert input prameters into a table
INSERT INTO #InputList
(RowID,RowStatus)
SELECT
i.ListValue,s.ListValue
FROM dbo.FN_ListToTableAll(',',#IDList) i
INNER JOIN dbo.FN_ListToTableAll(',',#StatusList) s ON i.RowNumber=s.RowNumber
--update all old existing rows
UPDATE o
SET RowStatus=i.RowStatus
FROM #OldData o WITH (UPDLOCK, HOLDLOCK) --to avoid race condition when there is high concurrency as per #emtucifor
INNER JOIN #InputList i ON o.RowID=i.RowID
--insert only the new rows
INSERT INTO #OldData
(RowID, RowStatus)
SELECT
i.RowID, i.RowStatus
FROM #InputList i
LEFT OUTER JOIN #OldData o ON i.RowID=o.RowID
WHERE o.RowID IS NULL
--display the old table
SELECT * FROM #OldData order BY RowID
OUTPUT:
RowID RowStatus
----------- ---------
10 A
20 B
30 C
40 D
50 E
60 F
70 z
80 z
90 z
(9 row(s) affected)
EDIT thanks to #Emtucifor click here for the tip about the race condition, I have included the locking hints in my answer, to prevent race condition problems when there is high concurrency.

There are various methods to accomplish the parts you ask are asking about.
Passing Values
There are dozens of ways to do this. Here are a few ideas to get you started:
Pass in a string of identifiers and parse it into a table, then join.
SQL 2008: Join to a table-valued parameter
Expect data to exist in a predefined temp table and join to it
Use a session-keyed permanent table
Put the code in a trigger and join to the INSERTED and DELETED tables in it.
Erland Sommarskog provides a wonderful comprehensive discussion of lists in sql server. In my opinion, the table-valued parameter in SQL 2008 is the most elegant solution for this.
Upsert/Merge
Perform a separate UPDATE and INSERT (two queries, one for each set, not row-by-row).
SQL 2008: MERGE.
An Important Gotcha
However, one thing that no one else has mentioned is that almost all upsert code, including SQL 2008 MERGE, suffers from race condition problems when there is high concurrency. Unless you use HOLDLOCK and other locking hints depending on what's being done, you will eventually run into conflicts. So you either need to lock, or respond to errors appropriately (some systems with huge transactions per second have used the error-response method successfully, instead of using locks).
One thing to realize is that different combinations of lock hints implicitly change the transaction isolation level, which affects what type of locks are acquired. This changes everything: which other locks are granted (such as a simple read), the timing of when a lock is escalated to update from update intent, and so on.
I strongly encourage you to read more detail on these race condition problems. You need to get this right.
Conditional Insert/Update Race Condition
“UPSERT” Race Condition With MERGE
Example Code
CREATE PROCEDURE dbo.PermitStatusUpdate
#PermitIDs varchar(8000), -- or (max)
#Status int
AS
SET NOCOUNT, XACT_ABORT ON -- see note below
BEGIN TRAN
DECLARE #Permits TABLE (
PermitID int NOT NULL PRIMARY KEY CLUSTERED
)
INSERT #Permits
SELECT Value FROM dbo.Split(#PermitIDs) -- split function of your choice
UPDATE S
SET
UpdatedOn = GETUTCDATE(),
Status = #Status
FROM
PermitStatus S WITH (UPDLOCK, HOLDLOCK)
INNER JOIN #Permits P ON S.PermitID = P.PermitID
INSERT PermitStatus (
PermitID,
UpdatedOn,
Status
)
SELECT
P.PermitID,
GetUTCDate(),
#Status
FROM #Permits P
WHERE NOT EXISTS (
SELECT 1
FROM PermitStatus S
WHERE P.PermitID = S.PermitID
)
COMMIT TRAN
RETURN ##ERROR;
Note: XACT_ABORT helps guarantee the explicit transaction is closed following a timeout or unexpected error.
To confirm that this handles the locking problem, open several query windows and execute an identical batch like so:
WAITFOR TIME '11:00:00' -- use a time in the near future
EXEC dbo.PermitStatusUpdate #PermitIDs = '123,124,125,126', 1
All of these different sessions will execute the stored procedure in nearly the same instant. Check each session for errors. If none exist, try the same test a few times more (since it's possible to not always have the race condition occur, especially with MERGE).
The writeups at the links I gave above give even more detail than I did here, and also describe what to do for the SQL 2008 MERGE statement as well. Please read those thoroughly to truly understand the issue.
Briefly, with MERGE, no explicit transaction is needed, but you do need to use SET XACT_ABORT ON and use a locking hint:
SET NOCOUNT, XACT_ABORT ON;
MERGE dbo.Table WITH (HOLDLOCK) AS TableAlias
...
This will prevent concurrency race conditions causing errors.
I also recommend that you do error handling after each data modification statement.

If you're using SQL Server 2008, you can use table valued parameters - you pass in a table of records into a stored procedure and then you can do a MERGE.
Passing in a table valued parameter would remove the need to parse CSV strings.
Edit:
ErikE has raised the point about race conditions, please refer to his answer and linked articles.

If you have SQL Server 2008, you can use MERGE. Here's an article describing this.

You should be able to do your insert and your update as two set based queries.
The code below was based on a data load procedure that I wrote a while ago that took data from a staging table and inserted or updated it into the main table.
I've tried to make it match your example, but you may need to tweak this (and create a table valued UDF to parse your CSV into a table of ids).
-- Update where the join on permitstatus matches
Update
PermitStatus
Set
[UpdatedOn]=GETUTCDATE(),
[Status]=staging.Status
From
PermitStatus status
Join
StagingTable staging
On
staging.PermitId = status.PermitId
-- Insert the new records, based on the Where Not Exists
Insert
PermitStatus(Updatedon, Status, PermitId)
Select (GETUTCDATE(), staging.status, staging.permitId
From
StagingTable staging
Where Not Exists
(
Select 1 from PermitStatus status
Where status.PermitId = staging.PermidId
)

Essentially you have an upsert stored procedure (eg. UpsertSinglePermit)
(like the code you have given above) for dealing with one row.
So the steps I see are to create a new stored procedure (UpsertNPermits) which does
a) Parse input string into n record entries (each record contains permit id and status)
b) Foreach entry in above, invoke UpsertSinglePermit

SQL Server READPAST hint

I'm seeing behavior which looks like the READPAST hint is set on the database itself.
The rub: I don't think this is possible.
We have table foo (id int primary key identity, name varchar(50) not null unique);
I have several threads which do, basically
id = select id from foo where name = ?
if id == null
insert into foo (name) values (?)
id = select id from foo where name = ?
Each thread is responsible for inserting its own name (no two threads try to insert the same name at the same time). Client is java.
READ_COMMITTED_SNAPSHOT is ON, transaction isolation is specifically set to READ COMMITTED, using Connection.setTransactionIsolation( Connection.TRANSACTION_READ_COMMITTED );
Symptom is that if one thread is inserting, the other thread can't see it's row -- even rows which were committed to the database before the application started -- and tries to insert, but gets a duplicate-key-exception from the unique index on name.
Throw me a bone here?

You're at the wrong isolation level. Remember what happens with the snapshot isolation level. If one transaction is making a change, no other concurrent transactions see that transaction. Period. Other transactions only will see your changes once you have committed, but only if they START after your commit. The solution to this is to use a different isolation level. Wrap your statements in a transaction and SET TRANSACTION LEVEL SERIALIZABLE. This will ensure that your other concurrent transactions work as if they were all run serially, which is what you seem to want here.

Sounds like you're not wrapping the select and insert into a transaction?
As a solution, you could:
insert into foo (col1,col2,col3) values ('a','b','c')
where not exists (select * from foo where col1 = 'a')
After this, ##rowcount will be 1 if can check if a row was inserted.

SELECT SCOPE_IDENTITY()
should do the trick here...
plus wrapping into a transaction like previous poster mentioned.

The moral of this story is fully explained in my blog post "You can't hold onto nothing" but the short version of this is that you want to use the HOLDLOCK hint. I use the pattern:
INSERT INTO dbo.Foo(Name)
SELECT TOP 1
#name AS Name
FROM (SELECT 1 AS FakeColumn) AS FakeTable
WHERE NOT EXISTS (SELECT * FROM dbo.Foo WITH (HOLDLOCK)
WHERE Name=#name)
SELECT ID FROM dbo.Foo WHERE Name=#name

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Avoid inserting duplicate records in SQL Server - sql-server

Related

SQL Server Custom Identity Column

Select and Delete in the same transaction using TOP clause

Can I Select and Update at the same time?

SQL - Inserting and Updating Multiple Records at Once

SQL Server READPAST hint

Categories

Resources