I have a very simple update statement within one job step:
UPDATE [Table]
SET
[Flag] = 1
WHERE [ID] = (SELECT MAX([ID]) FROM [Table] WHERE [Name] = 'DEV')
Normally there are no issues with this code, but sometimes it ends up with the deadlock.
Is it in general possible, that such stand-alone piece of code leads to a deadlock?
Table schema:
CREATE TABLE [Table]
(
[ID] [bigint] IDENTITY(1,1) NOT NULL,
[Name] [varchar](100) NOT NULL,
[Flag] [bit] NULL,
CONSTRAINT [Table_ID] PRIMARY KEY CLUSTERED
)
The deadlock cause is quite obvious: there is no index on Name, so it's going to scan the whole table for the subquery. There is also no UPDLOCK hint on it, so that is also going to make deadlocks more likely.
Create an index on Name
CREATE NONCLUSTERED INDEX IX_Name ON [Table] (Name) INCLUDE (ID);
And make sure you use UPDLOCK on the subquery
UPDATE [Table]
SET Flag = 1
WHERE ID = (
SELECT MAX(ID)
FROM [Table] t2 WITH (UPDLOCK)
WHERE t2.Name = 'DEV')
This query is much more efficiently written without a self-join, like this:
UPDATE t
SET Flag = 1
FROM (
SELECT TOP (1)
*
FROM [Table] t
WHERE t.Name = 'DEV'
ORDER BY ID DESC
) t;
Even though the optimizer can often transform into this version, it's better to just write it like this anyway.
This version does not need a UPDLOCK, it will be added automatically. You still need the above index though.
db<>fiddle
Related
I have a table with a very large number of rows which I wish to execute via dynamic SQL. They are basically existence checks and insert statements and I want to migrate data from one production database to another - we are merging transactional data. I am trying to find the optimal way to execute the rows.
I've been finding the coalesce method for appending all the rows to one another to not be efficient for this particularly when the number of rows executed at a time is greater than ~100.
Assume the structure of the source table is something arbitrary like this:
CREATE TABLE [dbo].[MyTable]
(
[ID] [int] IDENTITY(1,1) NOT NULL,
[DataField1] [int] NOT NULL,
[FK_ID1] [int] NOT NULL,
[LotsMoreFields] [NVARCHAR] (MAX),
CONSTRAINT [PK_MyTable] PRIMARY KEY CLUSTERED ([ID] ASC)
)
CREATE TABLE [dbo].[FK1]
(
[ID] [int] IDENTITY(1,1) NOT NULL,
[Name] [int] NOT NULL, -- Unique constrained value
CONSTRAINT [PK_FK1] PRIMARY KEY CLUSTERED ([ID] ASC)
)
The other requirement is I am tracking the source table PK vs the target PK and whether an insert occurred or whether I have already migrated that row to the target. To do this, I'm tracking migrated rows in another table like so:
CREATE TABLE [dbo].[ChangeTracking]
(
[ReferenceID] BIGINT IDENTITY(1,1),
[Src_ID] BIGINT,
[Dest_ID] BIGINT,
[TableName] NVARCHAR(255),
CONSTRAINT [PK_ChangeTracking] PRIMARY KEY CLUSTERED ([ReferenceID] ASC)
)
My existing method is executing some dynamic sql generated by a stored procedure. The stored proc does PK lookups as the source system has different PK values for table [dbo].[FK1].
E.g.
IF NOT EXISTS (<ignore this existence check for now>)
BEGIN
INSERT INTO [Dest].[dbo].[MyTable] ([DataField1],[FK_ID1],[LotsMoreFields]) VALUES (333,(SELECT [ID] FROM [Dest].[dbo].[FK1] WHERE [Name]=N'ValueFoundInSource'),N'LotsMoreValues');
INSERT INTO [Dest].[dbo].[ChangeTracking] ([Src_ID],[Dest_ID],[TableName]) VALUES (666,SCOPE_IDENTITY(),N'MyTable'); --666 is the PK in [Src].[dbo].[MyTable] for this inserted row
END
So when you have a million of these, it isn't quick.
Is there a recommended performant way of doing this?
As mentioned, the MERGE statement works well when you're looking at a complex JOIN condition (if any of these fields are different, update the record to match). You can also look into creating a HASHBYTES hash of the entire record to quickly find differences between source and target tables, though that can also be time-consuming on very large data sets.
It sounds like you're making these updates like a front-end developer, by checking each row for a match and then doing the insert. It will be far more efficient to do the inserts with a single query. Below is an example that looks for names that are in the tblNewClient table, but not in the tblClient table:
INSERT INTO tblClient
( [Name] ,
TypeID ,
ParentID
)
SELECT nc.[Name] ,
nc.TypeID ,
nc.ParentID
FROM tblNewClient nc
LEFT JOIN tblClient cl
ON nc.[Name] = cl.[Name]
WHERE cl.ID IS NULL;
This is will way more efficient than doing it RBAR (row by agonizing row).
Taking the two answers from #RusselFox and putting them together, I reached this tentative solution (but looking a LOT more efficient):
MERGE INTO [Dest].[dbo].[MyTable] [MT_D]
USING (SELECT [MT_S].[ID] as [SrcID],[MT_S].[DataField1],[FK_1_D].[ID] as [FK_ID1],[MT_S].[LotsMoreFields]
FROM [Src].[dbo].[MyTable] [MT_S]
JOIN [Src].[dbo].[FK_1] ON [MT_S].[FK_ID1] = [FK_1].[ID]
JOIN [Dest].[dbo].[FK_1] [FK_1_D] ON [FK_1].[Name] = [FK_1_D].[Name]
) [SRC] ON 1 = 0
WHEN NOT MATCHED THEN
INSERT([DataField1],[FL_ID1],[LotsMoreFields])
VALUES ([DataField1],[FL_ID1],[LotsMoreFields])
OUTPUT [SRC].[SrcID],INSERTED.[ID],0,N'MyTable' INTO [Dest].[dbo].[ChangeTracking]([Src_ID],[Dest_ID],[AlreadyExists],[TableName]);
I was wondering if someone could shed some light on why SQL Server (2016 RTM in my case, but I suspect this is not version-specific) is performing this seemingly unnecessary INNER JOIN.
Consider the following two tables joined by a foreign key:
CREATE TABLE [dbo].[batches](
[Id] [smallint] IDENTITY(1,1) PRIMARY KEY,
[Date] [date] NOT NULL,
[Run] [tinyint] NOT NULL,
[Clean] [bit] NOT NULL)
CREATE TABLE [dbo].[batch_values](
[Batch_Id] [smallint] NOT NULL,
[Key] [int] NOT NULL,
[Value] [int] NOT NULL,
CONSTRAINT [PK_batch_values] PRIMARY KEY CLUSTERED
( [Batch_Id] ASC, [Key] ASC))
GO
ALTER TABLE [dbo].[batch_values] WITH CHECK
ADD CONSTRAINT [FK_batch_values_batches] FOREIGN KEY([Batch_Id])
REFERENCES [dbo].[batches] ([Id])
GO
ALTER TABLE [dbo].[batch_values] CHECK CONSTRAINT [FK_batch_values_batches]
GO
Populate the tables with some data:
SET NOCOUNT ON;
DECLARE
#BatchCount int,
#BatchId smallint,
#KeyCount int;
SET #BatchCount = 1;
WHILE #BatchCount <= 100
BEGIN
INSERT INTO dbo.[batches]
VALUES (DATEADD(dd, #BatchCount / 10, '2016-01-01'), #BatchCount % 10, #BatchCount % 2);
SET #BatchId = SCOPE_IDENTITY();
SET #KeyCount = 1;
WHILE #KeyCount <= 1000
BEGIN
INSERT INTO dbo.batch_values
VALUES (#BatchId, #KeyCount, RAND() * 1000000 - 500000);
SET #KeyCount = #KeyCount + 1;
END;
SET #BatchCount = #BatchCount + 1;
END;
Now, if I run the following query the execution plan shows that the SQL Server is performing the INNER JOIN to the [batches] table, even though no columns are selected from it, and no records could be dropped from [batch_values] as a result of the join due to the foreign key constraint.
screenshot of query and execution plan
It seems to me that Query Optimizer should discard the INNER JOIN as unnecessary and simply do a primary key seek on [batch_values], but it doesn't.
This is material because if I develop views that join multiple tables to present a "bigger picture" of the underlying data for ease of use, when querying those views I will be taking a performance hit.
There are many limitations to use JOIN ELIMINATION by SQL Optimizer
E.g. if you use multiple columns in the foreign key, or constraint is not trusted, or marked as 'not for replication', etc.
SQL Server may not use JOIN ELIMINATION if you specify WHERE predicate with the column in foreign key.
Remove WHERE or remove "Batch_id = 100" from WHERE, and you should see the Optimizer now uses JOIN ELIMINATION
The documentation is limited on this topic, so I can't provide a proof link, but many people reported this issue in the past 5-7 years for different versions and agreed that behaviour was by design. My recommendation is to raise an incident with MS and ask them directly about it if it is critical for your system.
While trying to dissect a SQL Server stored proc that's been running slow, we found that simply using a temp table instead of a real table had a drastic impact on performance. The table we're swapping out (ds_location) only has 173 rows:
This query will run complete in 1 second:
IF OBJECT_ID('tempdb..#Location') IS NOT NULL DROP TABLE #Location
SELECT * INTO #Location FROM ds_location
SELECT COUNT(*)
FROM wip_cubs_hc m
INNER JOIN ds_scenario sc ON sc.Scenario = m.Scenario
INNER JOIN ds_period pe ON pe.Period = m.ReportingPeriod
INNER JOIN #Location l ON l.Location = m.Sh_Location
Compare that to the original, which takes 7 seconds:
SELECT COUNT(*)
FROM wip_cubs_hc m
INNER JOIN ds_scenario sc ON sc.Scenario = m.Scenario
INNER JOIN ds_period pe ON pe.Period = m.ReportingPeriod
INNER JOIN ds_location l ON l.Location = m.Sh_Location
Here's the definition of wip_cubs_hc. It contains 1.7 million rows:
CREATE TABLE wip_cubs_hc(
Scenario varchar(16) NOT NULL,
ReportingPeriod varchar(50) NOT NULL,
Sh_Location varchar(50) NOT NULL,
Department varchar(50) NOT NULL,
ProductName varchar(75) NOT NULL,
Account varchar(50) NOT NULL,
Balance varchar(50) NOT NULL,
Source varchar(50) NOT NULL,
Data numeric(18, 6) NOT NULL,
CONSTRAINT PK_wip_cubs_hc PRIMARY KEY CLUSTERED
(
Scenario ASC,
ReportingPeriod ASC,
Sh_Location ASC,
Department ASC,
ProductName ASC,
Account ASC,
Balance ASC,
Source ASC
)
)
CREATE NONCLUSTERED INDEX IX_wip_cubs_hc_Balance
ON [dbo].[wip_cubs_hc] ([Scenario],[Sh_Location],[Department],[Balance])
INCLUDE ([ReportingPeriod],[ProductName],[Account],[Source])
I'd love to know HOW to determine what's causing the slowdown, too.
I can answer the "How to determine the slowdown" question...
Take a look at the execution plan of both queries. You do this by going to the "Query" menu > "Display Estimated Execution Plan". The default keyboard shortcut is Ctrl+L. You can see the plan for multiple queries at once as well. Look at the type of operation being done. What you want to see are things like Index Seek instead of Index Scan, etc.
This article explains some of the other things to look for.
Without knowing the schema/indexes of all the tables involved, this is where I would suggest starting.
Best of Luck!
i just want to know how will i index the this table for optimal performance? This will potentially hold around 20M rows.
CREATE TABLE [dbo].[Table1](
[ID] [bigint] NOT NULL,
[Col1] [varchar](100) NULL,
[Col2] [varchar](100) NULL,
[Description] [varchar](100) NULL
) ON [PRIMARY]
Basically, this table will be queried ONLY in this manner.
SELECT ID FROM Table1
WHERE Col1 = 'exactVal1' AND Col2 = 'exactVal2' AND [Description] = 'exactDesc'
This is what i did:
CREATE NONCLUSTERED INDEX IX_ID
ON Table1(ID)
GO
CREATE NONCLUSTERED INDEX IX_Col1
ON Table1(Col1)
GO
CREATE NONCLUSTERED INDEX IX_Col2
ON Table1(Col2)
GO
CREATE NONCLUSTERED INDEX IX_ValueDescription
ON Table1(ValueDescription)
GO
Am i right to index all these columns? Not really that confident yet. Just new to SQL stuff, please let me know if im on the right track.
Again, a lot of data will be put on this table. Unfortunately, i cannot test the performance yet since there are no available data. But I will soon be generating some dummy data to test the performance. But it would be great if there is already another option(suggestion) available that i can compare the results with.
Thanks,
jack
I would combine these indexes into one index, instead of having three separate indexes. For example:
CREATE INDEX ix_cols ON dbo.Table1 (Col1, Col2, Description)
If this combination of columns is unique within the table, then you should add the UNIQUE keyword to make the index unique. This is for performance reasons, but, also, more importantly, to enforce uniqueness. It may also be created as a primary key if that is appropriate.
Placing all of the columns into one index will give better performance because it will not be necessary for SQL Server to use multiple passes to find the row you are seeking.
Try this -
CREATE TABLE dbo.Table1
(
ID BIGINT NOT NULL
, Col1 VARCHAR(100) NULL
, Col2 VARCHAR(100) NULL
, [Description] VARCHAR(100) NULL
)
GO
CREATE CLUSTERED INDEX IX_Table1 ON dbo.Table1
(
Col1
, Col2
, [Description]
)
Or this -
CREATE TABLE dbo.Table1
(
ID BIGINT PRIMARY KEY NOT NULL
, Col1 VARCHAR(100) NULL
, Col2 VARCHAR(100) NULL
, [Description] VARCHAR(100) NULL
)
GO
CREATE UNIQUE NONCLUSTERED INDEX IX_Table1 ON dbo.Table1
(
Col1
, Col2
, [Description]
)
I have run into an interesting scenario. Using the context.database.sqlquery available in the Entity Framework I am calling a stored procedure to update a field.
When calling the procedure directly in management studio no deadlocks occur when editing two or more items.
When calling the same stored procedure using context.database.sqlquery a dealock is encountered, when editing two or more items.
I have found a way to correct for this, but I do not see why calling the procedure from the framework would cause a deadlock. So I thought I would ask.
--Create a temp table to hold items based on userid
CREATE TABLE #tableA
(
[Id] int identity (1,1),
[Item_Id] int
)
INSERT INTO #tableA ([Item_Id]) SELECT [Id] FROM [Items] WHERE [UserId] = #UserId
--All of the other processing is omitted for brevity
--Based on final processing above update the IsActive flag on the Items table
UPDATE i
SET [IsActive] = 0
FROM #tableA ta INNER JOIN [Items] i ON ta.[Item_Id] = i.[Item_Id]
WHERE [IsActive] = 1
Again, I have a solution that worked, just trying to understand why a deadlock would occur in EF and not when calling the procedure directly.
BTW, the solution was to add the IsActive bit to the temp table and populating it initially and then on the join in the update statement I used temp table isactive place in the where clause.
--Create a temp table to hold items based on userid
CREATE TABLE #tableA
(
[Id] int identity (1,1),
[Item_Id] int,
[IsActive]
)
INSERT INTO #tableA ([Item_Id], [IsActive]) SELECT [Id], [IsActive] FROM [Items] WHERE [UserId] = #UserId
--All of the other processing is omitted for brevity
--Based on final processing above update the IsActive flag on the Items table
UPDATE i
SET [IsActive] = 0
FROM #tableA ta INNER JOIN [Items] i ON ta.[Item_Id] = i.[Item_Id]
WHERE ta.[IsActive] = 1