How can a constraint be violated when the SQL query excludes it? - sql-server

I have a result table that holds the output of a large, complicated, slow running query.
It's defined something like:
create table ResultsStore (
Foo int not null,
Bar int not null,
... other fields
constraint [PK_ResultsStore_foo_bar] primary key clustered
(
Foo asc,
Bar asc
)
)
I then insert to this table with a query like:
insert ResultsStore (Foo, Bar)
output inserted.*
select subQuery.ID, #bar
from (
-- large complex slow query
) subQuery
where subQuery.ID not in (
select Foo
from ResultsStore
where Bar = #bar
)
In testing this is fine, but in production, with lots of users hitting it regularly, we often get an exception:
Violation of PRIMARY KEY constraint 'PK_ResultsStore_foo_bar'. Cannot insert duplicate key in object 'ResultsStore'.
How is this possible? Surely the where should exclude any combination of the multiple primary key fields where they are already in the table?
How to best avoid this?

As written two sessions can run the query, both checking for the existence of the row concurrently, both not finding it, then both proceeding to attempt the insert. The first one will succeed in READ COMMITED, and the second one will fail.
You need WITH (UPDLOCK, HOLDLOCK, ROWLOCK) on the subquery to avoid this race condition. At default read committed isolation level either S locks taken by the sub query or row versioning is used and no locks at all are taken.
The HOLDLOCK gives serializable semantics and protects the range. UPDLOCK forces the read to use a U lock which will block other sessions from reading with UPDLOCK.

You can also use a temp table to hold interim results and perform the final insert at the end.
The following also includes a DISTINCT (which might or might not be needed), changes the dup test to use EXISTS, and applies WITH (UPDLOCK, HOLDLOCK, ROWLOCK) options to the final insert as suggested by others.
declare #TempResults table (
Foo int not null,
Bar int not null
)
insert #TempResults
select distinct subQuery.ID, #bar
from (
-- large complex slow query
) subQuery
insert ResultsStore (Foo, Bar)
output inserted.*
select T.Foo, T.Bar
from #TempResults T
where not exists (
select *
from ResultsStore RS with (updlock, holdlock, rowlock)
where RS.Foo = T.Foo
and RS.Bar = T.Bar
)
This lets your long running query run fast and dirty (as you intend), but should maintain integrity and minimize actual lock duration for the final insert.

Related

Can the clustering keys be defined on temporary and transient tables?

I understand that the clustering can be defined on a permanent table. But, Can the same be defined on temporary and transient table?
Have you tried to test it?
create or replace temporary table x_temp (id number, v varchar ) cluster by (id) as select UNIFORM(0,100,random()), 'xyz' from table(generator(rowcount=>10000000));
create or replace transient table x_transient (id number, v varchar ) cluster by (id) as select UNIFORM(0,100,random()), 'xyz' from table(generator(rowcount=>10000000));
select system$clustering_information( 'x_temp');
select system$clustering_information( 'x_transient');
You may define clustering keys for transient and temp tables. Although it's possible to define a clustering key for a temporary table, I do not think it will provide any benefits, as the clustering service will not be able to re-cluster it.
New sample based on the comment:
create or replace transient table x_transient_v2 (id number, v varchar ) cluster by (id)
as select UNIFORM(0,10,random()), RANDSTR( 1000, random()) from table(generator(rowcount=>500000));
select system$clustering_information( 'x_transient_v2'); -- save the output to compare
insert into x_transient_v2 select UNIFORM(0,10,random()), RANDSTR( 1000, random()) from table(generator(rowcount=>50000));
update x_transient_v2 set v = RANDSTR( 1000, random()) where v ilike 'A%';
insert into x_transient_v2 select UNIFORM(0,10,random()), RANDSTR( 1000, random()) from table(generator(rowcount=>50000));
select system$clustering_information( 'x_transient_v2'); -- compare the output with the first one
select *
from table(information_schema.automatic_clustering_history(
date_range_start=>dateadd(d, -1, current_date),
date_range_end=>current_date,
table_name=>'X_TRANSIENT_V2')); -- wait for a while until you see a row which has NUM_ROWS_RECLUSTERED is not zero
select system$clustering_information( 'x_transient_v2'); -- compare the output which should be similar to first
Technically yes you can ALTER CLUSTER BY a transient or temporary table.
But i'm not sure that the induced cost and expected performance will suit your use case.
You should read the Snowflake documention on clustered tables because clustering keys are not intended for all tables. You have to consider table size and performance.

Can we control the order of locks in SQL Server?

This is my scenario, I have a table like this:
CREATE TABLE [MyTable]
(
[Id] BIGINT PRIMARY KEY,
[Value] NVARCHAR(100) NOT NULL,
[IndexColumnA] NVARCHAR(100) NOT NULL,
[IndexColumnB] NVARCHAR(100) NOT NULL
)
CREATE INDEX [IX_A] ON [MyTable] ([IndexColumnA])
CREATE INDEX [IX_B] ON [MyTable] ([IndexColumnB])
And have two use cases with two different update commands
UPDATE [MyTable] SET [Value] = '...' WHERE [IndexColumnA] = '...'
and
UPDATE [MyTable] SET [Value] = '...' WHERE [IndexColumnB] = '...'
Both update commands may update multiple rows and these commands caused a deadlock when executed concurrently.
My speculation is that the two update commands use different index when scanning the rows, the order placing locks on rows are therefore different. As the result, one update command may try place an U lock on a row which already has a X lock placed by another update command. (I am not a database expert, correct me if I was wrong)
One possible solution to this should be forcing database to place locks in the same order. According to https://dba.stackexchange.com/questions/257217/why-am-i-getting-a-deadlock-for-a-single-update-query, it seems we can do this by SELECT ... ORDER BY ... FOR UPDATE in PostgreSQL.
Can we (and should we) do this in SQL Server? if not, is the only solution to this is handling the deadlock in application code?

How to prevent deadlock in concurrent T-SQL transactions?

I have a query which inserts hundreds of records. The idea behind the query is:
DELETE old record with id
INSERT new record with the same id
If the record with id not exists, value for eternal_id will be generated
If the record with id exists, we should save the value from the eternal_id
Query executing in transaction with Read Committed type
Query looks like:
DECLARE #id1 int = 100
DECLARE #id2 int = 200
CREATE TABLE #t(
[eternal_id] [uniqueidentifier] NULL,
[id] [int] NOT NULL
)
DELETE FROM [dbo].[SomeTable] WITH (HOLDLOCK)
OUTPUT
DELETED.eternal_id
,DELETED.id
INTO #t
WHERE [id] IN (#id1, #id2)
INSERT INTO [dbo].[SomeTable]
([id]
,[title]
,[eternal_id])
SELECT main.*, ISNULL([eternal_id], NEWID())
FROM
(
SELECT
#id1 Id
,'Some title 1' Title
UNION
SELECT
#id2 Id
,'Some title 2' Title
) AS main
LEFT JOIN #t t ON main.[id] = t.[id]
DROP TABLE #t
I have hundreds of threads which executing this query with different #id. Everything works perfectly when record already exists in [dbo].[SomeTable], but when records with #id doesn't exists I am catching:
Transaction (Process ID 73) was deadlocked on lock resources with another process and has been chosen as the deadlock victim. Rerun the transaction.
So the problem appears when 2 or more concurrent threads pass the same #id and the record not existing in [dbo].[SomeTable].
I tried to remove WITH (HOLDLOCK) here:
DELETE FROM [dbo].[SomeTable] WITH (HOLDLOCK)
OUTPUT
DELETED.eternal_id
,DELETED.id
INTO #t
WHERE [id] IN (#id1, #id2)
This not hepled and I am started to catch:
Violation of PRIMARY KEY constraint 'PK__SomeTable__3213E83F5D97F3D0'. Cannot insert duplicate key in object 'dbo.SomeTable'. The duplicate key value is (49).
The statement has been terminated.
So without WITH (HOLDLOCK) it works bad even when record already exists.
How to prevent deadlocks when record with id doesn't exists in the table?
Conditional update of eternal_id can be done like this:
update t set
...
eternal_id = ISNULL(t.eternal_id, NEWID())
from [dbo].[SomeTable] t
where t.id = #id
Thus you will keep the old value if it exists. No need to delete/insert. Unless you have some magic in triggers.
I think the comment above from #DaleK helped me the most. I will quote it:
While its a great ambition to try and avoid all deadlocks... its not
always possible... and you can't prevent all future deadlocks from
happens, because as more rows are added to tables query plans change.
Any application code should have some form of retry mechanism to
handle this. – Dale K
So I decided to implement some form of retry mechanism to handle this.

SQL server grouping on NULLABLE column

I have a situation in SQL Server (with a legacy DB) that i can't understand why?
I have a table A (about 2 million rows) that have column CODE that allow null. The number rows that have CODE = NULL is just several (< 10 rows). When i run the query:
select code, sum(C1)
from A
-- where code is not null
group by code;
It runs forever. But when i un-comment the where clause, it took around 1.5s (still too slow, right?)
Could anyone here help me pointing out what are the possible causes for such situation?
Execution plan add:
As a general rule, NULL values cannot be stored by a conventional index. So even if you have an index on code, your WHERE condition cannot benefit from that index.
If C1 is included in the index (which I assume is NOT NULL), things are different, because all the tuples (code=NULL, C1=(some value)) can and will be indexed. These are few, according to your question; so SQL Server can get a considerable speedup by just returning the rows for all these tuples.
First of all, a few words about performance. We have a several variants in your case.
Indexes View -
IF OBJECT_ID('dbo.t', 'U') IS NOT NULL
DROP TABLE dbo.t
GO
CREATE TABLE dbo.t (
ID INT IDENTITY PRIMARY KEY,
Code VARCHAR(10) NULL,
[Status] INT NULL
)
GO
ALTER VIEW dbo.v
WITH SCHEMABINDING
AS
SELECT Code, [Status] = SUM(ISNULL([Status], 0)), Cnt = COUNT_BIG(*)
FROM dbo.t
WHERE Code IS NOT NULL
GROUP BY Code
GO
CREATE UNIQUE CLUSTERED INDEX ix ON dbo.v (Code)
SELECT Code, [Status]
FROM dbo.v
Filtered Index -
CREATE NONCLUSTERED INDEX ix ON dbo.t (Code)
INCLUDE ([Status])
WHERE Code IS NOT NULL
Will wait your second execution plan.

TSQL ID generation

I have a question regarding locking in TSQL. Suppose I have a the following table:
A(int id, varchar name)
where id is the primary key, but is NOT an identity column.
I want to use the following pseudocode to insert a value into this table:
lock (A)
uniqueID = GenerateUniqueID()
insert into A values (uniqueID, somename)
unlock(A)
How can this be accomplished in terms of T-SQL? The computation of the next id should be done with the table A locked in order to avoid other sessions to do the same operation at the same time and get the same id.
If you have custom logic that you want to apply in generating the ids, wrap it up into a user defined function, and then use the user defined function as the default for the column. This should reduce concurrency issue similarly to the provided id generators by deferring the generation to the point of insert and piggy backing on the insert locking behavior.
create table ids (id int, somval varchar(20))
Go
Create function GenerateUniqueID()
returns int as
Begin
declare #ret int
select #ret = max(isnull(id,1)) * 2 from ids
if #ret is null set #ret = 2
return #ret
End
go
alter table ids add Constraint DF_IDS Default(dbo.GenerateUniqueID()) for Id
There are really only three ways to go about this.
Change the ID column to be an IDENTITY column where it auto increments by some value on each insert.
Change the ID column to be a GUID with a default constraint of NEWID() or NEWSEQUENTIALID(). Then you can insert your own value or let the table generate one for you on each insert.
On each insert, start a transaction. Then get the next available ID using something like select max(id)+1 . Do this in a single sql statement if possible in order to limit the possibility of a collision.
On the whole, most people prefer option 1. It's fast, easy to implement, and most people understand it.
I tend to go with option 2 with the apps I work on simply because we tend to scale out (and up) our databases. This means we routinely have apps with a multi-master situation. Be aware that using GUIDs as primary keys can mean your indexes are routinely trashed.
I'd stay away from option 3 unless you just don't have a choice. In which case I'd look at how the datamodel is structured anyway because there's bound to be something wrong.
You use the NEWID() function and you do not need any locking mechanism
You tell a column to be IDENTITY and you do not need any locking mechanism
If you generate these IDs manually and there is a chance parallel calls could generate the same IDs then something like this:
SET TRANSACTION ISOLATION LEVEL SERIALIZABLE
#NextID = GenerateUniqueID()
WHILE EXISTS (SELECT ID FROM A WHERE ID = #NextID)
BEGIN
#NextID = GenerateUniqueID()
END
INSERT INTO A (ID, Text) VALUES (#NextID , 'content')
COMMIT TRANSACTION
#Markus, you should look at using either IDENTITY or NEWID() as noted in the other answers. if you absolutely can't, here's an option for you...
DECLARE #NewID INT
BEGIN TRAN
SELECT #NewID = MAX(ID) + 1
FROM TableA (tablockx)
INSERT TableA
(ID, OtherFields)
VALUES (#NewID, OtherFields)
COMMIT TRAN
If you're using SQL2005+, you can use the OUTPUT clause to do what you're asking, without any kind of lock (The table Test1 simulates the table you're inserted into, and since OUTPUT requires a temp table and not a variable to hold the results, #Result will do that):
create table test1( test INT)
create table #result (LastValue INT)
insert into test1
output INSERTED.test into #result(test)
select GenerateUniqueID()
select LastValue from #result
Just to update an old post. It is now possible with SQL Server 2012 to use a feature called Sequence. Sequences are created in much the same way a function and it is possible to specify the range, direction(asc, desc) and rollover point. After which it's possible to invoke the NEXT VALUE FOR method to generate the next value in the range.
See the following documentation from Microsoft.
http://technet.microsoft.com/en-us/library/ff878091.aspx

Resources