High Sort Cost on Merge Operation

High Sort Cost on Merge Operation - sql-server

I am using the MERGE feature to insert data into a table using a bulk import table as source. (as described here)
This is my query:
DECLARE #InsertMapping TABLE (BulkId int, TargetId int);
MERGE dbo.Target T
USING dbo.Source S
ON 0=1 WHEN NOT MATCHED THEN
INSERT (Data) VALUES (Data)
OUTPUT S.Id BulkId, inserted.Id INTO #InsertMapping;
When evaluating the performance by displaying the actual execution plan, I saw that there is a high cost sorting done on the primary key index. I don't get it because the primary key should already be sorted ascending, there shouldn't be a need for additional sorting.
!
Because of this sort cost the query takes several seconds to complete. Is there a way to speed up the inserting? Maybe some index hinting or additional indices? Such an insert shouldn't take that long, even if there are several thousand entries.

I can reproduce this issue with the following
CREATE TABLE dbo.TargetTable(Id int IDENTITY PRIMARY KEY, Value INT)
CREATE TABLE dbo.BulkTable(Id int IDENTITY PRIMARY KEY, Value INT)
INSERT INTO dbo.BulkTable
SELECT TOP (1000000) 1
FROM sys.all_objects o1, sys.all_objects o2
DECLARE #TargetTableMapping TABLE (BulkId INT,TargetId INT);
MERGE dbo.TargetTable T
USING dbo.BulkTable S
ON 0 = 1
WHEN NOT MATCHED THEN
INSERT (Value)
VALUES (Value)
OUTPUT S.Id AS BulkId,
inserted.Id AS TargetId
INTO #TargetTableMapping;
This gives a plan with a sort before the clustered index merge operator.
The sort is on Expr1011, Action1010 which are both computed columns output from previous operators.
Expr1011 is the result of calling the internal and undocumented function getconditionalidentity to produce an id column for the identity column in TargetTable.
Action1010 is a flag indicating insert, update, delete. It is always 4 in this case as the only action this MERGE statement can perform is INSERT.
The reason the sort is in the plan is because the clustered index merge operator has the DMLRequestSort property set.
The DMLRequestSort property is set based on the number of rows expected to be inserted. Paul White explains in the comments here
[DMLRequestSort] was added to support the ability to minimally-log
INSERT statements in 2008. One of the preconditions for minimal
logging is that the rows are presented to the Insert operator in
clustered key order.
Inserting into tables in clustered index key order can be more efficient anyway as it reduces random IO and fragmentation.
If the function getconditionalidentity returns generated identity values in ascending order (as would seem reasonable) then the input to the sort will already be in the desired order. The sort in the plan would in that case be logically redundant, (there was previously a similar issue with unnecessary sorts with NEWSEQUENTIALID)
It is possible to get rid of the sort by making the expression a bit more opaque.
DECLARE #TargetTableMapping TABLE (BulkId INT,TargetId INT);
DECLARE #N BIGINT = 0x7FFFFFFFFFFFFFFF
MERGE dbo.TargetTable T
USING (SELECT TOP(#N) * FROM dbo.BulkTable) S
ON 1=0
WHEN NOT MATCHED THEN
INSERT (Value)
VALUES (Value)
OUTPUT S.Id AS BulkId,
inserted.Id AS TargetId
INTO #TargetTableMapping;
This reduces the estimated row count and the plan no longer has a sort. You will need to test whether or not this actually improves performance though. Possibly it might make things worse.

Related

SQL server grouping on NULLABLE column

I have a situation in SQL Server (with a legacy DB) that i can't understand why?
I have a table A (about 2 million rows) that have column CODE that allow null. The number rows that have CODE = NULL is just several (< 10 rows). When i run the query:
select code, sum(C1)
from A
-- where code is not null
group by code;
It runs forever. But when i un-comment the where clause, it took around 1.5s (still too slow, right?)
Could anyone here help me pointing out what are the possible causes for such situation?
Execution plan add:

As a general rule, NULL values cannot be stored by a conventional index. So even if you have an index on code, your WHERE condition cannot benefit from that index.
If C1 is included in the index (which I assume is NOT NULL), things are different, because all the tuples (code=NULL, C1=(some value)) can and will be indexed. These are few, according to your question; so SQL Server can get a considerable speedup by just returning the rows for all these tuples.

First of all, a few words about performance. We have a several variants in your case.
Indexes View -
IF OBJECT_ID('dbo.t', 'U') IS NOT NULL
DROP TABLE dbo.t
GO
CREATE TABLE dbo.t (
ID INT IDENTITY PRIMARY KEY,
Code VARCHAR(10) NULL,
[Status] INT NULL
)
GO
ALTER VIEW dbo.v
WITH SCHEMABINDING
AS
SELECT Code, [Status] = SUM(ISNULL([Status], 0)), Cnt = COUNT_BIG(*)
FROM dbo.t
WHERE Code IS NOT NULL
GROUP BY Code
GO
CREATE UNIQUE CLUSTERED INDEX ix ON dbo.v (Code)
SELECT Code, [Status]
FROM dbo.v
Filtered Index -
CREATE NONCLUSTERED INDEX ix ON dbo.t (Code)
INCLUDE ([Status])
WHERE Code IS NOT NULL
Will wait your second execution plan.

How to emulate a BEFORE INSERT trigger in T-SQL / SQL Server for super/subtype (Inheritance) entities? [duplicate]

This question already has answers here:
How can I do a BEFORE UPDATED trigger with sql server?
(9 answers)
Closed 2 years ago.
This is on Azure.
I have a supertype entity and several subtype entities, the latter of which needs to obtain their foreign keys from the primary key of the super type entity on each insert. In Oracle, I use a BEFORE INSERT trigger to accomplish this. How would one accomplish this in SQL Server / T-SQL?
DDL
CREATE TABLE super (
super_id int IDENTITY(1,1)
,subtype_discriminator char(4) CHECK (subtype_discriminator IN ('SUB1', 'SUB2')
,CONSTRAINT super_id_pk PRIMARY KEY (super_id)
);
CREATE TABLE sub1 (
sub_id int IDENTITY(1,1)
,super_id int NOT NULL
,CONSTRAINT sub_id_pk PRIMARY KEY (sub_id)
,CONSTRAINT sub_super_id_fk FOREIGN KEY (super_id) REFERENCES super (super_id)
);
I wish for an insert into sub1 to fire a trigger that actually inserts a value into super and uses the super_id generated to put into sub1.
In Oracle, this would be accomplished by the following:
CREATE TRIGGER sub_trg
BEFORE INSERT ON sub1
FOR EACH ROW
DECLARE
v_super_id int; //Ignore the fact that I could have used super_id_seq.CURRVAL
BEGIN
INSERT INTO super (super_id, subtype_discriminator)
VALUES (super_id_seq.NEXTVAL, 'SUB1')
RETURNING super_id INTO v_super_id;
:NEW.super_id := v_super_id;
END;
Please advise on how I would simulate this in T-SQL, given that T-SQL lacks the BEFORE INSERT capability?

Sometimes a BEFORE trigger can be replaced with an AFTER one, but this doesn't appear to be the case in your situation, for you clearly need to provide a value before the insert takes place. So, for that purpose, the closest functionality would seem to be the INSTEAD OF trigger one, as #marc_s has suggested in his comment.
Note, however, that, as the names of these two trigger types suggest, there's a fundamental difference between a BEFORE trigger and an INSTEAD OF one. While in both cases the trigger is executed at the time when the action determined by the statement that's invoked the trigger hasn't taken place, in case of the INSTEAD OF trigger the action is never supposed to take place at all. The real action that you need to be done must be done by the trigger itself. This is very unlike the BEFORE trigger functionality, where the statement is always due to execute, unless, of course, you explicitly roll it back.
But there's one other issue to address actually. As your Oracle script reveals, the trigger you need to convert uses another feature unsupported by SQL Server, which is that of FOR EACH ROW. There are no per-row triggers in SQL Server either, only per-statement ones. That means that you need to always keep in mind that the inserted data are a row set, not just a single row. That adds more complexity, although that'll probably conclude the list of things you need to account for.
So, it's really two things to solve then:
replace the BEFORE functionality;
replace the FOR EACH ROW functionality.
My attempt at solving these is below:
CREATE TRIGGER sub_trg
ON sub1
INSTEAD OF INSERT
AS
BEGIN
DECLARE #new_super TABLE (
super_id int
);
INSERT INTO super (subtype_discriminator)
OUTPUT INSERTED.super_id INTO #new_super (super_id)
SELECT 'SUB1' FROM INSERTED;
INSERT INTO sub (super_id)
SELECT super_id FROM #new_super;
END;
This is how the above works:
The same number of rows as being inserted into sub1 is first added to super. The generated super_id values are stored in a temporary storage (a table variable called #new_super).
The newly inserted super_ids are now inserted into sub1.
Nothing too difficult really, but the above will only work if you have no other columns in sub1 than those you've specified in your question. If there are other columns, the above trigger will need to be a bit more complex.
The problem is to assign the new super_ids to every inserted row individually. One way to implement the mapping could be like below:
CREATE TRIGGER sub_trg
ON sub1
INSTEAD OF INSERT
AS
BEGIN
DECLARE #new_super TABLE (
rownum int IDENTITY (1, 1),
super_id int
);
INSERT INTO super (subtype_discriminator)
OUTPUT INSERTED.super_id INTO #new_super (super_id)
SELECT 'SUB1' FROM INSERTED;
WITH enumerated AS (
SELECT *, ROW_NUMBER() OVER (ORDER BY (SELECT 1)) AS rownum
FROM inserted
)
INSERT INTO sub1 (super_id, other columns)
SELECT n.super_id, i.other columns
FROM enumerated AS i
INNER JOIN #new_super AS n
ON i.rownum = n.rownum;
END;
As you can see, an IDENTIY(1,1) column is added to #new_user, so the temporarily inserted super_id values will additionally be enumerated starting from 1. To provide the mapping between the new super_ids and the new data rows, the ROW_NUMBER function is used to enumerate the INSERTED rows as well. As a result, every row in the INSERTED set can now be linked to a single super_id and thus complemented to a full data row to be inserted into sub1.
Note that the order in which the new super_ids are inserted may not match the order in which they are assigned. I considered that a no-issue. All the new super rows generated are identical save for the IDs. So, all you need here is just to take one new super_id per new sub1 row.
If, however, the logic of inserting into super is more complex and for some reason you need to remember precisely which new super_id has been generated for which new sub row, you'll probably want to consider the mapping method discussed in this Stack Overflow question:
Using merge..output to get mapping between source.id and target.id

While Andriy's proposal will work well for INSERTs of a small number of records, full table scans will be done on the final join as both 'enumerated' and '#new_super' are not indexed, resulting in poor performance for large inserts.
This can be resolved by specifying a primary key on the #new_super table, as follows:
DECLARE #new_super TABLE (
row_num INT IDENTITY(1,1) PRIMARY KEY CLUSTERED,
super_id int
);
This will result in the SQL optimizer scanning through the 'enumerated' table but doing an indexed join on #new_super to get the new key.

Non Cluster indexes on multiple columns VS multiple columns in a single index

I am in a process of adding non clustered indexes to SQL Azure DB and I wonder what is the difference between having multiple columns in a single non clustered index, compared to having multiple non clustered indexes with each having a single column in it?

the difference between having multiple columns in a single non clustered index, compared to having multiple non clustered indexes with each having a single column in it?
Consider a table T with a clucstered PK Id column, and additionally columns A, B, C.
A single nonclustered index containing A, B, and C could support fast lookup for queries such as:
WHERE A > #a
WHERE A = #a AND #b1 < B AND B < #b2
WHERE A = #a AND B = #b AND C < #c
but not
WHERE B = #b
WHERE C = #c
in both of which cases, we can't do any better than a table scan
However, if we have multiple indexes, IX_A on A, etc:
WHERE A = #a
WHERE B = #b
WHERE C = #c
would all benefit from an index, and a composite query such as
WHERE A = #a AND B = #b AND C < #c
would have a small benefit too.

The SELECT query those will optimize are different.
You shouldn't be adding index without a need for query performance optimization.
If you have a query to optimize, then you can start thinking about adding indexes corresponding to the query you want to optimize.
Let's say you have a table containing the following structure :
create table emp
(
id int identity primary key clustered,
name nvarchar(200),
dep_id int,
salary int
);
Your clustered index (the primary key clustered clause) will speed up look-up and row data retrieval for queries using the id column as filter like :
select id, name, dep_id, salary
from emp
where id = #id
Any non clustered index on any (unique) column will speed the look-up (not the row data retrieval) using the same column as a filter.
create index emp_name_idx on emp(name);
-- will speed up look-up for
select id, dep_id
from emp
where name = #name
If you also want to speed up the row data retrieval using the same index, you will need to add the columns present in the SELECT clause to the included columns.
create index emp_name_idx on emp(name) include(id, dep_id);
-- will speed up look-up and data row retrieval for
select id, dep_id
from emp
where name = #name
Any non clustered index containing multiples columns will speed up look-up for queries using the first n columns in the same order as they appear in the index (as long as the filters are equality comparison, the column optimization streak will end as the first inequality comparison is met).
create index emp_namesalary_idx on emp(dep_id, salary);
-- will speed up look-up and data row retrieval for
select id, dep_id
from emp
where dep_id = #dep_id
and salary > #salary
As for the single column indexes, to also optimize row data retrieval, use the include clause to specify every columns in the SELECT clause that is not already present in the index columns list.
Remember that for every index you add to a table, you are also adding INSERT overhead for this table. So you should always balance performance between gains from the index on SELECT statement and losses on the INSERT ones.
Remember also that indexes use disk (and cache) space and that should also count in the decision process.

Querying minimum value in SQL Server is a lot longer than querying all the rows

I'm currently confronted with a strange behaviour in my database when I'm querying a minimum ID for a specific date in a table contains about a hundred million rows. The query is quite simple :
SELECT MIN(Id) FROM Connection WITH(NOLOCK) WHERE DateConnection = '2012-06-26'
This query nevers end, at least I let it run for hours. The DateConnection column is not an index neither included in one. So I would understand that this query can last quite a bit. But I tried the following query which runs in few seconds :
SELECT Id FROM Connection WITH(NOLOCK) WHERE DateConnection = '2012-06-26'
It returns 300k rows.
My table is defined as this :
CREATE TABLE [dbo].[Connection](
[Id] [bigint] IDENTITY(1,1) NOT NULL,
[DateConnection] [datetime] NOT NULL,
[TimeConnection] [time](7) NOT NULL,
[Hour] AS (datepart(hour,[TimeConnection])) PERSISTED NOT NULL,
CONSTRAINT [PK_Connection] PRIMARY KEY CLUSTERED
(
[Hour] ASC,
[Id] ASC
)
)
And it has the following index :
CREATE UNIQUE NONCLUSTERED INDEX [IX_Connection_Id] ON [dbo].[Connection]
(
[Id] ASC
)ON [PRIMARY]
One solutions I find using this strange behaviour is using the following code. But it seems to me quite a bit heavy for such a simple query.
create table #TempId
(
[Id] bigint
)
go
insert into #TempId
select id from partitionned_connection with(nolock) where dateconnection = '2012-06-26'
declare #displayId bigint
select #displayId = min(Id) from #CoIdTest
print #displayId
go
drop table #TempId
go
Has anybody been confronted to this behaviour and what is the cause of it ? Is the minimum aggregate scanning the entire table ? And if this is the case why the simple select does not ?

The root cause of the problem is the non-aligned nonclustered index, combined with the statistical limitation Martin Smith points out (see his answer to another question for details).
Your table is partitioned on [Hour] along these lines:
CREATE PARTITION FUNCTION PF (integer)
AS RANGE RIGHT
FOR VALUES (1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23);
CREATE PARTITION SCHEME PS
AS PARTITION PF ALL TO ([PRIMARY]);
-- Partitioned
CREATE TABLE dbo.Connection
(
Id bigint IDENTITY(1,1) NOT NULL,
DateConnection datetime NOT NULL,
TimeConnection time(7) NOT NULL,
[Hour] AS (DATEPART(HOUR, TimeConnection)) PERSISTED NOT NULL,
CONSTRAINT [PK_Connection]
PRIMARY KEY CLUSTERED
(
[Hour] ASC,
[Id] ASC
)
ON PS ([Hour])
);
-- Not partitioned
CREATE UNIQUE NONCLUSTERED INDEX [IX_Connection_Id]
ON dbo.Connection
(
Id ASC
)ON [PRIMARY];
-- Pretend there are lots of rows
UPDATE STATISTICS dbo.Connection WITH ROWCOUNT = 200000000, PAGECOUNT = 4000000;
The query and execution plan are:
SELECT
MinID = MIN(c.Id)
FROM dbo.Connection AS c WITH (READUNCOMMITTED)
WHERE
c.DateConnection = '2012-06-26';
The optimizer takes advantage of the index (ordered on Id) to transform the MIN aggregate to a TOP (1) - since the minimum value will by definition be the first value encountered in the ordered stream. (If the nonclustered index were also partitioned, the optimizer would not choose this strategy since the required ordering would be lost).
The slight complication is that we also need to apply the predicate in the WHERE clause, which requires a lookup to the base table to fetch the DateConnection value. The statistical limitation Martin mentions explains why the optimizer estimates it will only need to check 119 rows from the ordered index before finding one with a DateConnection value that will match the WHERE clause. The hidden correlation between DateConnection and Id values means this estimate is a very long way off.
In case you are interested, the Compute Scalar calculates which partition to perform the Key Lookup into. For each row from the nonclustered index, it computes an expression like [PtnId1000] = Scalar Operator(RangePartitionNew([dbo].[Connection].[Hour] as [c].[Hour],(1),(1),(2),(3),(4),(5),(6),(7),(8),(9),(10),(11),(12),(13),(14),(15),(16),(17),(18),(19),(20),(21),(22),(23))), and this is used as the leading key of the lookup seek. There is prefetching (read-ahead) on the nested loops join, but this needs to be an ordered prefetch to preserve the sorting required by the TOP (1) optimization.
Solution
We can avoid the statistical limitation (without using query hints) by finding the minimum Id for each Hour value, and then taking the minimum of the per-hour minimums:
-- Global minimum
SELECT
MinID = MIN(PerHour.MinId)
FROM
(
-- Local minimums (for each distinct hour value)
SELECT
MinID = MIN(c.Id)
FROM dbo.Connection AS c WITH(READUNCOMMITTED)
WHERE
c.DateConnection = '2012-06-26'
GROUP BY
c.[Hour]
) AS PerHour;
The execution plan is:
If parallelism is enabled, you will see a plan more like the following, which uses parallel index scan and multi-threaded stream aggregates to produce the result even faster:

Although it might be wise to fix the problem in a way that doesn't require index hints, a quick solution is this:
SELECT MIN(Id) FROM Connection WITH(NOLOCK, INDEX(PK_Connection)) WHERE DateConnection = '2012-06-26'
This forces a table scan.
Alternatively, try this although it probably produces the same problem:
select top 1 Id
from Connection
WHERE DateConnection = '2012-06-26'
order by Id

It makes sense that finding the minimum takes longer than going through all the records. Finding the minimum of an unsorted structure takes much longer than traversing it once (unsorted because MIN() doesn't take advantage of the identity column). What you could do, since you're using an identity column, is have a nested select, where you take the first record from the set of records with the specified date.

The NC index scan is issue in you case.It is using the unique non clustered index scan and then for each row that is hundred million rows it will traverse the clustered index and thus it causes millions of io's(usually say your index hieght is 4 then it might cause 100million*4 IO's +index scan of the nonclustered index leaf page).Optimizer must have chosen this index to avoid the strem aggregate to get the minimum.To find minimum there are 3 main technique,one is using index on the column for which we want min (it is efficient if there is index and in that case no calc required as soon as you get the row it is returned),2nd it could use hash aggregate (but it usually happens when you have group by) and 3rd is stream aggregate here it will scan through all the rows which are qualified and keep the min value always and return min when all rows are scanned..
Howvere, when the query without min used the clustered index scan and thus is fast as it has to read less number of page and thus less io's.
Now question is why optimizer picked up the index scan on non clustered index.I am sure it is to avoid the compuation involved in stream aggregate to find the min value using stream aggregate but in thise case not using the stream aggregate is much more costly. This depends on estimation so i guess stats are not up to date in the table.
So fist of all check whether your stats are upto date.When was the stats were updated last?
Thus to avoid the issue.Do following
1. First update the table stats and I am sure it must remove your issue.
2. In case, you can not use update stats or update stats doesnt change the plan and still uses the NC index scan then you can force the clustered index scan so that it uses less IO's followed by stream aggregate to get min value.

SQL Server Merge and Indexing Speed

I have a merge statement that needs to compare on many columns. The source table has 26,000 rows. The destination table has several million rows. The desintation table only has a typical Primary Key index on an int-type column.
I did some selects with group by to count the number of unique values in the source.
The test part of the Merge is
Merge Into desttable
Using #temptable
On
(
desttable.ColumnA = #temptable.ColumnA
and
desttable.ColumnB = #temptable.ColumnB
and
desttable.ColumnC = #temptable.ColumnC
and
desttable.ColumnD = #temptable.ColumnD
and
desttable.ColumnE = #temptable.ColumnE
and
desttable.ColumnF = #temptable.ColumnF
)
When Not Matched Then Insert Values (.......)
-- ColumnA: 167 unique values in #temptable
-- ColumnB: 1 unique values in #temptable
-- ColumnC: 13 unique values in #temptable
-- ColumnD: 89 unique values in #temptable
-- ColumnE: 550 unique values in #temptable
-- ColumnF: 487 unique values in #temptable
-- ColumnA: 3690 unique values in desttable
-- ColumnB: 3 unique values (plus null is possible) in desttable
-- ColumnC: 1113 unique values in desttable
-- ColumnD: 2662 unique values in desttable
-- ColumnE: 1770 unique values in desttable
-- ColumnF: 1480 unique values in desttable
The merge right now takes a very, very long time. I think I need to change my primary key but am not sure what the best tactic might be. 26,000 rows can be inserted on the first merge, but subsequent merges might only have ~2,000 inserts to do. Since I have no indexes and only a simple PK, everything is slow. :)
Can anyone point out how to make this better?
Thanks!

Well, an obvious candidate would be an index on the columns you use to do your matching in the MERGE statement - do you have an index on (ColumnA, ColumnB, ColumnC, ColumnD, ColumnE, ColumnF) on your destitation table??
This tuple of columns is being used to determine whether or not a row from your source table already exists in the database. If you don't have that index nor any other usable index in place, you get a table scan on the large destination table for each row in your source table, basically.
If not: I would try to add it and then see how the runtime behavior changes. Does the MERGE now run a little less than a very, very long time??

My suggestion is if you only need to run it once, then Merge statement is acceptable if time is not that critical. But, if you're going to use the script more often, I think it'll be better if you do it step by step instead of using the Merge statement. Step by step, like creating your own select, insert, update, delete statements in order to attain the goal. With this you'll have more control almost on everything(query optimization, indexing, etc...)
In your case, probably separating the 6 where criteria might be more efficient than combining them all at once. Downside is you'll have longer script.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight