SQL Server Merge and Indexing Speed - sql-server

I have a merge statement that needs to compare on many columns. The source table has 26,000 rows. The destination table has several million rows. The desintation table only has a typical Primary Key index on an int-type column.
I did some selects with group by to count the number of unique values in the source.
The test part of the Merge is
Merge Into desttable
Using #temptable
On
(
desttable.ColumnA = #temptable.ColumnA
and
desttable.ColumnB = #temptable.ColumnB
and
desttable.ColumnC = #temptable.ColumnC
and
desttable.ColumnD = #temptable.ColumnD
and
desttable.ColumnE = #temptable.ColumnE
and
desttable.ColumnF = #temptable.ColumnF
)
When Not Matched Then Insert Values (.......)
-- ColumnA: 167 unique values in #temptable
-- ColumnB: 1 unique values in #temptable
-- ColumnC: 13 unique values in #temptable
-- ColumnD: 89 unique values in #temptable
-- ColumnE: 550 unique values in #temptable
-- ColumnF: 487 unique values in #temptable
-- ColumnA: 3690 unique values in desttable
-- ColumnB: 3 unique values (plus null is possible) in desttable
-- ColumnC: 1113 unique values in desttable
-- ColumnD: 2662 unique values in desttable
-- ColumnE: 1770 unique values in desttable
-- ColumnF: 1480 unique values in desttable
The merge right now takes a very, very long time. I think I need to change my primary key but am not sure what the best tactic might be. 26,000 rows can be inserted on the first merge, but subsequent merges might only have ~2,000 inserts to do. Since I have no indexes and only a simple PK, everything is slow. :)
Can anyone point out how to make this better?
Thanks!

Well, an obvious candidate would be an index on the columns you use to do your matching in the MERGE statement - do you have an index on (ColumnA, ColumnB, ColumnC, ColumnD, ColumnE, ColumnF) on your destitation table??
This tuple of columns is being used to determine whether or not a row from your source table already exists in the database. If you don't have that index nor any other usable index in place, you get a table scan on the large destination table for each row in your source table, basically.
If not: I would try to add it and then see how the runtime behavior changes. Does the MERGE now run a little less than a very, very long time??

My suggestion is if you only need to run it once, then Merge statement is acceptable if time is not that critical. But, if you're going to use the script more often, I think it'll be better if you do it step by step instead of using the Merge statement. Step by step, like creating your own select, insert, update, delete statements in order to attain the goal. With this you'll have more control almost on everything(query optimization, indexing, etc...)
In your case, probably separating the 6 where criteria might be more efficient than combining them all at once. Downside is you'll have longer script.

Related

ORA-00001: Unique Constraint: Setting Primary Keys Manually

we have an Oracle Database and we have a table where we store a lot of data in.
This table has a primary key and usually those primary keys are just created upon insertion of a new row.
But now we need to manually insert data into this table with certain fixed primary keys. There is no way to change those primary keys.
So for example:
Our table has already 20 entries with the primary keys 1 to 20.
Now we need to add data manually with the primary keys 21 to 23.
When someone wants to enter a row using our standard approach, the insert process will fail because of:
Caused by: java.sql.BatchUpdateException: ORA-00001: Unique Constraint (VDMA.SYS_C0013552) verletzt
at oracle.jdbc.driver.OraclePreparedStatement.executeBatch(OraclePreparedStatement.java:10500)
at oracle.jdbc.driver.OracleStatementWrapper.executeBatch(OracleStatementWrapper.java:230)
at org.hibernate.jdbc.BatchingBatcher.doExecuteBatch(BatchingBatcher.java:70)
at org.hibernate.jdbc.AbstractBatcher.executeBatch(AbstractBatcher.java:268)
I totally understand this: The database routine (sequence) that is creating the next primary key fails because the next primary key is already taken.
But: How do I tell my sequence to look at the table again and to realize that the next primary key is 24 and not 21 ?
UPDATE
The reason why the IDs need to stay the same is because is accessing the records using a Web Interface using links that contain the ID.
So either we change the implementation mapping the old IDs to new IDs or we keep the IDs in the database.
UPDATE2
Found a solution: Since we are using hibernate, only one sequence is populating all the tables. Thus the primary keys in those 4 days where I was looking for an answer went so high that I can savely import all the data.
How do I tell my sequence to look at the table again and to realize that the next primary key is 24 and not 21 ?
In Oracle, a sequence doesn't know that you intend to use it for any particular table. All the sequence knows is its current value, its increment, its maxval and so on. So, you can't tell the sequence to look at a table, but you can tell your stored procedure to check the table and then increment the sequence beyond the maximum val of the primary key. In other words, if you really insist on manually updating the primary key with non sequence values, then your code needs to check for non sequence values in the PK and get the sequence up to speed before it uses the sequence to generate a new PK.
Here is something simple you can use to bring the sequence up to where it needs to be:
select testseq.nextval from dual;
Each time you run it the sequence increments by 1. Stick it in a for loop and run it until testseq.currval is where you need it to be.
Having said that, I agree with #a_horse_with_no_name and #EdStevens. If you have to insert rows manually, at least use sequence_name.nextval in the insert instead of a literal like '21'. Like this:
create table testtab (testpk number primary key, testval number);
create sequence testseq start with 1 increment by 1;
insert into testtab values (testseq.nextval, '12');
insert into testtab values (testseq.nextval, '123');
insert into testtab values (testseq.nextval, '1234');
insert into testtab values (testseq.nextval, '12345');
insert into testtab values (testseq.nextval, '123456');
select * from testtab;
testpk testval
2 12
3 123
4 1234
5 12345
6 123456

Resetting Primary key without deleting truncating table

I have a table with a primary key, now without any reason I don't know when I am inserting data it is being loaded like this
Pk_Col Some_Other_Col
1 A
2 B
3 C
1002 D
1003 E
1901 F
Is there any way I can reset my table like below, without deleting/ truncating the table?
Pk_Col Some_Other_Col
1 A
2 B
3 C
4 D
5 E
6 F
You can't update the IDENTITY column so DELETE/INSERT is the only way. You can reseed the IDENTITY column and recreate the data, like this:
DBCC CHECKIDENT ('dbo.tbl',RESEED,0);
INSERT INTO dbo.tbl (Some_Other_Col)
SELECT Some_Other_Col
FROM (DELETE FROM tbl OUTPUT deleted.*) d;
That assumes there are no foreign keys referencing this data.
If you really, really want to have neat identity values you can write a cursor (slow but maintainable) or investigate any number of "how can I find gaps in my sequence" question on SO and perform an UPDATE accordingly (runs faster but tricky to get right). This becomes exponentially harder when you start having foreign keys pointing back to this table. Be prepared to re-run this script any time data is put into, or removed from this table.
Edit: IDENTITY columns cannot be updated per se. You can, however, SET IDENTITY_INSERT dbo.MyTable ON;, INSERT a row with the desired IDENTITY value and the values from the other columns of an existing row, then DELETE the existing row. The nett effect on the data being the same as an UPDATE.
The only sensible reason to do this is if your table has about two billion rows and you're about to run out of integers for your identity column. If that's the case you have a whole world of other stuff to worry about, too.
But seriously - listen to #Damien, don't worry about it.
ALTER TABLE #temp1
DROP CONSTRAINT PK_Id
ALTER TABLE #temp1
DROP COLUMN Id
ALTER TABLE #temp1
ADD Id int identity(1,1)
Try this one.

High Sort Cost on Merge Operation

I am using the MERGE feature to insert data into a table using a bulk import table as source. (as described here)
This is my query:
DECLARE #InsertMapping TABLE (BulkId int, TargetId int);
MERGE dbo.Target T
USING dbo.Source S
ON 0=1 WHEN NOT MATCHED THEN
INSERT (Data) VALUES (Data)
OUTPUT S.Id BulkId, inserted.Id INTO #InsertMapping;
When evaluating the performance by displaying the actual execution plan, I saw that there is a high cost sorting done on the primary key index. I don't get it because the primary key should already be sorted ascending, there shouldn't be a need for additional sorting.
!
Because of this sort cost the query takes several seconds to complete. Is there a way to speed up the inserting? Maybe some index hinting or additional indices? Such an insert shouldn't take that long, even if there are several thousand entries.
I can reproduce this issue with the following
CREATE TABLE dbo.TargetTable(Id int IDENTITY PRIMARY KEY, Value INT)
CREATE TABLE dbo.BulkTable(Id int IDENTITY PRIMARY KEY, Value INT)
INSERT INTO dbo.BulkTable
SELECT TOP (1000000) 1
FROM sys.all_objects o1, sys.all_objects o2
DECLARE #TargetTableMapping TABLE (BulkId INT,TargetId INT);
MERGE dbo.TargetTable T
USING dbo.BulkTable S
ON 0 = 1
WHEN NOT MATCHED THEN
INSERT (Value)
VALUES (Value)
OUTPUT S.Id AS BulkId,
inserted.Id AS TargetId
INTO #TargetTableMapping;
This gives a plan with a sort before the clustered index merge operator.
The sort is on Expr1011, Action1010 which are both computed columns output from previous operators.
Expr1011 is the result of calling the internal and undocumented function getconditionalidentity to produce an id column for the identity column in TargetTable.
Action1010 is a flag indicating insert, update, delete. It is always 4 in this case as the only action this MERGE statement can perform is INSERT.
The reason the sort is in the plan is because the clustered index merge operator has the DMLRequestSort property set.
The DMLRequestSort property is set based on the number of rows expected to be inserted. Paul White explains in the comments here
[DMLRequestSort] was added to support the ability to minimally-log
INSERT statements in 2008. One of the preconditions for minimal
logging is that the rows are presented to the Insert operator in
clustered key order.
Inserting into tables in clustered index key order can be more efficient anyway as it reduces random IO and fragmentation.
If the function getconditionalidentity returns generated identity values in ascending order (as would seem reasonable) then the input to the sort will already be in the desired order. The sort in the plan would in that case be logically redundant, (there was previously a similar issue with unnecessary sorts with NEWSEQUENTIALID)
It is possible to get rid of the sort by making the expression a bit more opaque.
DECLARE #TargetTableMapping TABLE (BulkId INT,TargetId INT);
DECLARE #N BIGINT = 0x7FFFFFFFFFFFFFFF
MERGE dbo.TargetTable T
USING (SELECT TOP(#N) * FROM dbo.BulkTable) S
ON 1=0
WHEN NOT MATCHED THEN
INSERT (Value)
VALUES (Value)
OUTPUT S.Id AS BulkId,
inserted.Id AS TargetId
INTO #TargetTableMapping;
This reduces the estimated row count and the plan no longer has a sort. You will need to test whether or not this actually improves performance though. Possibly it might make things worse.

How to emulate a BEFORE INSERT trigger in T-SQL / SQL Server for super/subtype (Inheritance) entities? [duplicate]

This question already has answers here:
How can I do a BEFORE UPDATED trigger with sql server?
(9 answers)
Closed 2 years ago.
This is on Azure.
I have a supertype entity and several subtype entities, the latter of which needs to obtain their foreign keys from the primary key of the super type entity on each insert. In Oracle, I use a BEFORE INSERT trigger to accomplish this. How would one accomplish this in SQL Server / T-SQL?
DDL
CREATE TABLE super (
super_id int IDENTITY(1,1)
,subtype_discriminator char(4) CHECK (subtype_discriminator IN ('SUB1', 'SUB2')
,CONSTRAINT super_id_pk PRIMARY KEY (super_id)
);
CREATE TABLE sub1 (
sub_id int IDENTITY(1,1)
,super_id int NOT NULL
,CONSTRAINT sub_id_pk PRIMARY KEY (sub_id)
,CONSTRAINT sub_super_id_fk FOREIGN KEY (super_id) REFERENCES super (super_id)
);
I wish for an insert into sub1 to fire a trigger that actually inserts a value into super and uses the super_id generated to put into sub1.
In Oracle, this would be accomplished by the following:
CREATE TRIGGER sub_trg
BEFORE INSERT ON sub1
FOR EACH ROW
DECLARE
v_super_id int; //Ignore the fact that I could have used super_id_seq.CURRVAL
BEGIN
INSERT INTO super (super_id, subtype_discriminator)
VALUES (super_id_seq.NEXTVAL, 'SUB1')
RETURNING super_id INTO v_super_id;
:NEW.super_id := v_super_id;
END;
Please advise on how I would simulate this in T-SQL, given that T-SQL lacks the BEFORE INSERT capability?
Sometimes a BEFORE trigger can be replaced with an AFTER one, but this doesn't appear to be the case in your situation, for you clearly need to provide a value before the insert takes place. So, for that purpose, the closest functionality would seem to be the INSTEAD OF trigger one, as #marc_s has suggested in his comment.
Note, however, that, as the names of these two trigger types suggest, there's a fundamental difference between a BEFORE trigger and an INSTEAD OF one. While in both cases the trigger is executed at the time when the action determined by the statement that's invoked the trigger hasn't taken place, in case of the INSTEAD OF trigger the action is never supposed to take place at all. The real action that you need to be done must be done by the trigger itself. This is very unlike the BEFORE trigger functionality, where the statement is always due to execute, unless, of course, you explicitly roll it back.
But there's one other issue to address actually. As your Oracle script reveals, the trigger you need to convert uses another feature unsupported by SQL Server, which is that of FOR EACH ROW. There are no per-row triggers in SQL Server either, only per-statement ones. That means that you need to always keep in mind that the inserted data are a row set, not just a single row. That adds more complexity, although that'll probably conclude the list of things you need to account for.
So, it's really two things to solve then:
replace the BEFORE functionality;
replace the FOR EACH ROW functionality.
My attempt at solving these is below:
CREATE TRIGGER sub_trg
ON sub1
INSTEAD OF INSERT
AS
BEGIN
DECLARE #new_super TABLE (
super_id int
);
INSERT INTO super (subtype_discriminator)
OUTPUT INSERTED.super_id INTO #new_super (super_id)
SELECT 'SUB1' FROM INSERTED;
INSERT INTO sub (super_id)
SELECT super_id FROM #new_super;
END;
This is how the above works:
The same number of rows as being inserted into sub1 is first added to super. The generated super_id values are stored in a temporary storage (a table variable called #new_super).
The newly inserted super_ids are now inserted into sub1.
Nothing too difficult really, but the above will only work if you have no other columns in sub1 than those you've specified in your question. If there are other columns, the above trigger will need to be a bit more complex.
The problem is to assign the new super_ids to every inserted row individually. One way to implement the mapping could be like below:
CREATE TRIGGER sub_trg
ON sub1
INSTEAD OF INSERT
AS
BEGIN
DECLARE #new_super TABLE (
rownum int IDENTITY (1, 1),
super_id int
);
INSERT INTO super (subtype_discriminator)
OUTPUT INSERTED.super_id INTO #new_super (super_id)
SELECT 'SUB1' FROM INSERTED;
WITH enumerated AS (
SELECT *, ROW_NUMBER() OVER (ORDER BY (SELECT 1)) AS rownum
FROM inserted
)
INSERT INTO sub1 (super_id, other columns)
SELECT n.super_id, i.other columns
FROM enumerated AS i
INNER JOIN #new_super AS n
ON i.rownum = n.rownum;
END;
As you can see, an IDENTIY(1,1) column is added to #new_user, so the temporarily inserted super_id values will additionally be enumerated starting from 1. To provide the mapping between the new super_ids and the new data rows, the ROW_NUMBER function is used to enumerate the INSERTED rows as well. As a result, every row in the INSERTED set can now be linked to a single super_id and thus complemented to a full data row to be inserted into sub1.
Note that the order in which the new super_ids are inserted may not match the order in which they are assigned. I considered that a no-issue. All the new super rows generated are identical save for the IDs. So, all you need here is just to take one new super_id per new sub1 row.
If, however, the logic of inserting into super is more complex and for some reason you need to remember precisely which new super_id has been generated for which new sub row, you'll probably want to consider the mapping method discussed in this Stack Overflow question:
Using merge..output to get mapping between source.id and target.id
While Andriy's proposal will work well for INSERTs of a small number of records, full table scans will be done on the final join as both 'enumerated' and '#new_super' are not indexed, resulting in poor performance for large inserts.
This can be resolved by specifying a primary key on the #new_super table, as follows:
DECLARE #new_super TABLE (
row_num INT IDENTITY(1,1) PRIMARY KEY CLUSTERED,
super_id int
);
This will result in the SQL optimizer scanning through the 'enumerated' table but doing an indexed join on #new_super to get the new key.

Creating a Primary Key on a temp table - When?

I have a stored procedure that is working with a large amount of data. I have that data being inserted in to a temp table. The overall flow of events is something like
CREATE #TempTable (
Col1 NUMERIC(18,0) NOT NULL, --This will not be an identity column.
,Col2 INT NOT NULL,
,Col3 BIGINT,
,Col4 VARCHAR(25) NOT NULL,
--Etc...
--
--Create primary key here?
)
INSERT INTO #TempTable
SELECT ...
FROM MyTable
WHERE ...
INSERT INTO #TempTable
SELECT ...
FROM MyTable2
WHERE ...
--
-- ...or create primary key here?
My question is when is the best time to create a primary key on my #TempTable table? I theorized that I should create the primary key constraint/index after I insert all the data because the index needs to be reorganized as the primary key info is being created. But I realized that my underlining assumption might be wrong...
In case it is relevant, the data types I used are real. In the #TempTable table, Col1 and Col4 will be making up my primary key.
Update: In my case, I'm duplicating the primary key of the source tables. I know that the fields that will make up my primary key will always be unique. I have no concern about a failed alter table if I add the primary key at the end.
Though, this aside, my question still stands as which is faster assuming both would succeed?
This depends a lot.
If you make the primary key index clustered after the load, the entire table will be re-written as the clustered index isn't really an index, it is the logical order of the data. Your execution plan on the inserts is going to depend on the indexes in place when the plan is determined, and if the clustered index is in place, it will sort prior to the insert. You will typically see this in the execution plan.
If you make the primary key a simple constraint, it will be a regular (non-clustered) index and the table will simply be populated in whatever order the optimizer determines and the index updated.
I think the overall quickest performance (of this process to load temp table) is usually to write the data as a heap and then apply the (non-clustered) index.
However, as others have noted, the creation of the index could fail. Also, the temp table does not exist in isolation. Presumably there is a best index for reading the data from it for the next step. This index will need to either be in place or created. This is where you have to make a tradeoff of speed here for reliability (apply the PK and any other constraints first) and speed later (have at least the clustered index in place if you are going to have one).
If the recovery model of your database is set to simple or bulk-logged, SELECT ... INTO ... UNION ALL may be the fastest solution. SELECT .. INTO is a bulk operation and bulk operations are minimally logged.
eg:
-- first, create the table
SELECT ...
INTO #TempTable
FROM MyTable
WHERE ...
UNION ALL
SELECT ...
FROM MyTable2
WHERE ...
-- now, add a non-clustered primary key:
-- this will *not* recreate the table in the background
-- it will only create a separate index
-- the table will remain stored as a heap
ALTER TABLE #TempTable ADD PRIMARY KEY NONCLUSTERED (NonNullableKeyField)
-- alternatively:
-- this *will* recreate the table in the background
-- and reorder the rows according to the primary key
-- CLUSTERED key word is optional, primary keys are clustered by default
ALTER TABLE #TempTable ADD PRIMARY KEY CLUSTERED (NonNullableKeyField)
Otherwise, Cade Roux had good advice re: before or after.
You may as well create the primary key before the inserts - if the primary key is on an identity column then the inserts will be done sequentially anyway and there will be no difference.
Even more important than performance considerations, if you are not ABSOLUTELY, 100% sure that you will have unique values being inserted into the table, create the primary key first. Otherwise the primary key will fail to be created.
This prevents you from inserting duplicate/bad data.
If you add the primary key when creating the table, the first insert will be free (no checks required.) The second insert just has to see if it's different from the first. The third insert has to check two rows, and so on. The checks will be index lookups, because there's a unique constraint in place.
If you add the primary key after all the inserts, every row has to be matched against every other row. So my guess is that adding a primary key early on is cheaper.
But maybe Sql Server has a really smart way of checking uniqueness. So if you want to be sure, measure it!
I was wondering if I could improve a very very "expensive" stored procedure entailing a bunch of checks at each insert across tables and came across this answer. In the Sproc, several temp tables are opened and reference each other. I added the Primary Key to the CREATE TABLE statement (even though my selects use WHERE NOT EXISTS statements to insert data and ensure uniqueness) and my execution time was cut down SEVERELY. I highly recommend using the primary keys. Always at least try it out even when you think you don't need it.
I don't think it makes any significant difference in your case:
either you pay the penalty a little bit at a time, with each single insert
or you'll pay a larger penalty after all the inserts are done, but only once
When you create it up front before the inserts start, you could potentially catch PK violations as the data is being inserted, if the PK value isn't system-created.
But other than that - no big difference, really.
Marc
I wasn't planning to answer this, since I'm not 100% confident on my knowledge of this. But since it doesn't look like you are getting much response ...
My understanding is a PK is a unique index and when you insert each record, your index is updated and optimized. So ... if you add the data first, then create the index, the index is only optimized once.
So, if you are confident your data is clean (without duplicate PK data) then I'd say insert, then add the PK.
But if your data may have duplicate PK data, I'd say create the PK first, so it will bomb out ASAP.
When you add PK on table creation - the insert check is O(Tn) (where Tn is "n-th triangular number", which is 1 + 2 + 3 ... + n) because when you insert x-th row, it's checked against previously inserted "x - 1" rows
When you add PK after inserting all the values - the checker is O(n^2) because when you insert x-th row, it's checked against all n existing rows.
First one is obviously faster since O(Tn) is less than O(n^2)
P.S. Example: if you insert 5 rows it is 1 + 2 + 3 + 4 + 5 = 15 operations vs 5^2 = 25 operations

Resources