SQL Server INSERT, Scope_Identity() and physical writing to disc - sql-server

I have a stored procedure that does, among other stuff, some inserts in different table inside a loop. See the example below for clearer understanding:
INSERT INTO T1 VALUES ('something')
SET #MyID = Scope_Identity()
... some stuff go here
INSERT INTO T2 VALUES (#MyID, 'something else')
... The rest of the procedure
These two tables (T1 and T2) have an IDENTITY(1, 1) column in each one of them, let's call them ID1 and ID2; however, after running the procedure in our production database (very busy database) and having more than 6250 records in each table, I have noticed one incident where ID1 does not match ID2! Although normally for each record inserted in T1, there is record inserted in T2 and the identity column in both is incremented consistently.
The "wrong" records were something like that:
ID1 Col1
---- ---------
4709 data-4709
4710 data-4710
ID2 ID1 Col1
---- ---- ---------
4709 4710 data-4710
4710 4709 data-4709
Note the "inverted", ID1 in the second table.
Knowing not that much about SQL Server underneath operations, I have put the following "theory", maybe someone can correct me on this.
What I think is that because the loop is faster than physically writing to the table, and/or maybe some other thing delayed the writing process, the records were buffered. When it comes the time to write them, they were wrote in no particular order.
Is that even possible if no, how to explain the above mentioned scenario?
If yes, then I have another question to rise. What if the first insert (from the code above) got delayed? Doesn't that mean I won't get the correct IDENTITY to insert into the second table? If the answer of this is also yes, what can I do to insure the insertion in the two tables will happen in sequence with the correct IDENTITY?
I appreciate any comment and information that help me understand this.
Thanks in advance.

There is no way you can rely on IDENTITY to solve this for your second table. If you care about the generated primary key value for that row, you should generate itself.
IDENTITY is a way of saying "I don't want the hassle of generating a key myself, just do it for me, and I'll ask for the generated value if and when I need it".
What could be happening here is that two threads are inserting the rows at the same time, none of them have committed yet, so you get this scenario:
Thread 1 Thread 2
get id for table 1 = 4709
get id for table 1 = 4710
insert row for table 1
insert row for table 1
get id for table 2 = 4709
get id for table 2 = 4710
insert row for table 2
insert row for table 1
You have two ways to solve your problem:
Remove IDENTITY for the primary key in the second table
Use SET IDENTITY_INSERT ON to allow you to provide a key for it, while keeping the IDENTITY setting
In this case, however, I would use method nbr. 1. Method nbr. 2 is usually used when importing data into an empty table. You don't want the risk of the database auto-generating an ID you later on want to use yourself (since it comes from the first table), and so you should disable IDENTITY setting on the primary key of the second table.
Or you could try to avoid relying on the key for that table at all, since you have a foreign key reference, do you really need the key values to be the same?

Of course your above scenario is possible - and quite likely, too.
If you have two separate, independent tables, both being used for queries and inserts, both with a separate IDENTITY(1,1) field, there's absolutely no guarantee that an insert into one table and then into the second will be executed in the same order!
If you do need to establish a link between the two, insert the first table's ID into the second table as a foreign key. You cannot rely on the ID's generated from IDENTITY's to be the same in both tables!

Regading writing:
Whenever you do something that changes data, this is written to the database LOGS that moment, and you dont get a transaction confirm until this has happened. That is the D in ACID conditions (database theory).
Dirty database pages are written to disk "in the background". If too many are dirty, a checkpoint is triggered and they are all dumped out.
So far to the writing part.
Waht you probably run into is simlpy the fact that while individual statements are atomic, a busy atabase has possibly more than one thread running along it. So, basically, a thread switch happened between the statements. One thread got Id1, another one prioerity, id1, id2, then the first one id2.
Nothing specific here ;) Typical normal database behavior when multiple threads run along. Nothing to do with writing per se.
Basically, between
SET #MyID = Scope_Identity()
and the next statement, another thread can get priority ;)

do not rely on the actual values of identity columns for business/application logic you can only assume that they will be unique!

You should be able to avoid this issue by using a SQL 2005 feature, the OUTPUT clause. Link below.
http://msdn.microsoft.com/en-us/library/ms177564.aspx

This is a known bug in SQL Server.
The problem is when it generates the query plan the parallelisation causes scope identity to be incorrect.
Move that part into its own procedure, so pass in the params and return the scope identity - Now it should be correct.
If I remember rightly, this only manifests on tables with around a million rows or more.
Aha, here's the KB: http://support.microsoft.com/default.aspx?scid=kb;en-us;2019779&sd=rss&spid=2855

Related

How to store id of a record instead of text in parent table

I have to get input from users regarding their skills. I have a table for skills which has id as primary key. In an other table i am storing user id and skill id as many to many relationship. Now the problem is that how do I know the that the skill entered by user is already in my skills table? because I have to put Id of skill in Many to Many relationship table. Do I run each time a select statement or is there some efficient solution available? Thanks,
how do I know the that the skill entered by user is already in my skills table?
In a concurrent environment, there is no way for you to know that. Even if you do the SELECT, that only tells you whether row existed at the time of the SELECT execution - it doesn't tell you whether the row exists now. For example, even if the SELECT returned an empty result, a concurrent transaction might have inserted the row within the few milliseconds that it took for you to receive the SELECT result.
So you either do a drastic reduction in concurrency (e.g. through table locks), or learn to live with it...
When just the INSERT is needed
I'd recommend you simply attempt the INSERT (without SELECT) and then ignore the possible PRIMARY KEY violation1.
If you did the separate SELECT and INSERT steps, you'd still have to be prepared for PK violations, since a concurrent transaction could perform the INSERT (and commit) after your SELECT but before your INSERT. So why bother with the SELECT in the first place?
When INSERT or UPDATE is needed
If the junction table contains other fields in addition to FK fields, then you might want to update them to new values, so you'd have to first perform the SELECT to determine if the row needs inserting or updating.
In such a case, consider locking the row early using SELECT ... FOR UPDATE (or equivalent syntax).2 Alternatively, some DBMSes offer "insert or update" (aka "UPSERT") in a single command (e.g.: MySQL INSERT ... ON DUPLICATE KEY UPDATE).
1 But be careful to only ignore PK violations - don't just blindly "swallow" FK or CHECK violations etc...
2 To avoid it being deleted before you had a chance to UPDATE it (INSERT PK violation is still possible). Even worse, a concurrent transaction could UPDATE the row, leaving your transaction to silently overwrite other transaction's values, without being aware they were ever there.
There is no one-step solution to this problem. First you need to check if the skill exists in the Skills table and get the ID or insert it and get the ID if it did not exist. Then insert a row in the your PeopleSkills table with the IDs of the person and the skill...

Id of object before insertion into database (Linq to SQL)

From what I gather, Linq to SQL doesn't actually execute any database commands (including the opening of database connections) until the SubmitChanges() method is called. If this is the case, I would like to increase the efficiency of a few methods. Is it possible for me to retrieve the ID of an object before inserting it? I'd rather not call SubmitChanges() twice, if it's possible for me to know the value of the ID before it's actually inserted into the database. From a logical point of view, it would only makes sense to have to open a connection to the database in order to find out the value, but does an insertion procedure also have to take place?
Thanks
The usual technique to solve this, is to generate a unique identifier in the application layer (such as a GUID) and use this as the ID. That way you do not have to retrieve the ID on a subsequent call.
Of course, using a GUID as a primary key can have it's drawbacks. If you decide to go this way look up COMB GUID.
Well, here is the problem: You get somehow id BEFORE inserting to database, and do some processing with it. In the same time another thread does the same, and get's the same ID, you've got a conflict.
I.e. I don't think there is an easy way of doing this.
I don't necessarily recommend this, but have seen it done. You can calculate your own ID as an integer using a stored procedure and a table to hold the value of the next ID. The stored procedure selects the value from the table to return it, then increments the value. The table will look something like the following
Create Table Keys(
name varchar(128) not null primary key,
nextID int not null
)
Things to note before doing this is that if you select and then update in 2 different batches you have potential key collision. Both steps need to be treated as an atomic transaction.

SQL Server trigger - copy row before updating

I'd like to copy a table's row before updating and I'm trying to do it like this:
CREATE TRIGGER first_trigger_test
on Triggertest
FOR UPDATE
AS
insert into Triggertest select * from Inserted
Unfortunately, I get the error message
Msg 8101, Level 16, State 1, Procedure first_trigger_test, Line 6
An explicit value for the identity column in table 'Triggertest' can only be specified when a column list is used and IDENTITY_INSERT is ON.
I assume it's because of the id-column; can't I do something like 'except' id? I do not want to list all the columns in the trigger as it should be as dynamic as possible...
You can't, basically. You'll either have to specify the columns, or use a separate table:
CREATE TRIGGER first_trigger_test
on Triggertest
FOR UPDATE
AS
insert into Triggertest_audit select * from deleted
(where Triggertest_audit is a second table that looks like Triggertest, but without the primary key/identity/etc - commonly multiple rows per logical source row; not I assumed you actually wanted to copy the old values, not the new ones)
The problem happens because you are trying to set an identity column in Triggertest.
Is that your plan?
If you want to copy the new identity columns from INSERTED into Triggertest, then define the column in Triggertest without IDENTITY
If Triggertest has it's own IDENTITY columns, use this:
insert into Triggertest (col1, col2, col3) select col1, col2, col3 from Inserted
After comment:
No, you can't without dynamic SQL to detect what table and find all non-identity colums.
However, if you add or remove columns you'll then have a mis-match between trigger table and Triggertest and you'll get a different error.
If you really want it that dynamic, you'd have to concat all columns into one or use XML to ignore schema.
Finally:
Do all your tables have exactly the same number of columns and datatypes and nullability as TriggerTest... because this is the assumption here...
IF you want the table to be built each time the trigger runs then you have no choice but to use the the system tables to find the columns and create a table with those column definitions. Of course your first step will have to be to drop the existing table or the trigger won't work the second time someone updates a record.
However, I think you need to rethink this process. Dropping a table then creating a new one every time you change a record is a seriously bad idea. How is this table in anyway useful when it may get wiped out and rebuilt every second or so?
What you might consider doing instead is create a dynamic process to create the Create trigger scripts that have the correct information for that table but which are not dynamic. Then your configuration people need to run this process every time table changes are made.
Remember it is critical for triggers to do two things, run as fast as humanly possible and account for proccesing all the records inthe batch (triggers should never have row-by-row proccessing or other slow processses or assume only one row will be in inserted or deleted tables) Dynamic SQL in a trigger is porbably also a bad idea as you can't test out all the possibilites beforehand and can bring your whole production server to a screaming halt when some unexpected thing happens.

Integration services and identity columns

I am a bit of an SSIS newbie and while the whole system seems straightforward, I don't conceptually understand the process I need to go through in this scenario:
Need to map Invoice and InvoiceLine tables from a source database to two equivalent tables in a destination database - with different identity values.
For each invoice inserted across, I need to get the identity it was assigned and then insert all its lines referencing that new identity
There is a surrogate key on the invoices (the invoice number), however these might also clash with invoice numbers in the target system, hence they would also have to be renumbered.
This must be a common scenario in integration - is there a common solution?
Chris KL - you are correct that this is harder than one would expect. I have three methods for this, which work in different situations:
IF the data you are loading is small (hundreds or thousands but not hundreds OF thousands) then you can do this: use an OLEDB command that performs one insert for each parent row and returns the identity value back; then downstream from that join the output from that to the child rows, and insert them. Advantage: intuitive. Disadvantage: scales badly. This method is documented on the web and should Google for you.
If we are talking about a bigger system where you need bulk loading, then there are two other flavors:
a. If you have exclusive access to the table during the load (really exclusive, enforced in some way) then you can grab the max existing ID from the table, use an SSIS script task to number the rows starting above that max id, then Set Identity Insert On, stuff them in, and Set Identity Insert Off. You then have those script-generated keys in SSIS to assign to the child rows. Advantage: fast and simple, one trip to the DB. Disadvantage: possible errors if some other process inserts into your table at the same time. Brittle.
b. If you don't have exclusive access, then the only way I know of is with a round trip to the DB, thus: Insert all parent rows but keep track of a key for them that is not the identity column (a business key, for example). In a second dataflow, process the child records by using a Lookup transform that uses the business key to fetch the parent ID. Make sure the lookup is tuned appropriately vs. caching, and that thee business key is indexed.
OK, this is a good news / bad news situation I'm afraid. First the good news and a bit of background which you may know but I'll put it down in case you don't.
You generally can't insert anything into IDENTITY columns. Of course, like everything else in life there are times when you need to and that can be done with the IDENTITY_INSERT option.
SET IDENTITY_INSERT MyTable ON
INSERT INTO MyTable (
MyIdCol,
Etc…
)
SELECT SourceIdCol,
Etc…
FROM MySourceTable
SET IDENTITY_INSERT MyTable OFF
Now, you say that you have surrogate keys in the target but then you say that they may clash. So I'm a little confused… Are you using the keys from the source (e.g. IDENTITY columns) or are you generating new keys in the target? I would strongly advise against trying to merge the keyspaces in a single key column. If you need to retain the keys then I would suggest a multi-field key using something like SourceSystemId to keep them unique.
Finally the bad news: SSIS doesn't provide a simple means of using the IDENTITY_INSERT option. The only way I've been able to do it is by turning it on in a SQL task that executes before the insert task. You should be able to pass the table name into the script as a variable. Make sure to include another SQL task afterwards to turn it off because you can only use on one table at a time.

Identity SQL Server Problem

I have use Identity on ID primary key.
And then I insert some data.
For example.
Data 1 -> Add Successful without error. ID 1
Data 2 -> Add Successful without error. ID 2
Data 3 -> Add Fail with error.
Data 4 -> Add Fail with error.
Data 5 -> Add Successful without error. ID 5
You can see that ID has jump from 2 to 5.
Why ?? How can solve this ??
Why would that be a problem ?
Normally, you'll use an identity in a primary key column. Then, this primary key is a surrogate key, which means that is has absolutely no business value / business meaning.
It is just an 'administrative' fact, which is necessary in order that the database can uniquely identify a record.
So, it doesn't matter what this value is; and it also doesn't matter that there are gaps. Why do you want them to be consecutive.
And, suppose that they are consecutive -that no gaps appear when an insert fails- what would you do when you delete a row, and insert one later on ? Would you fill in the gaps as well ?
this is by design, sql server first increments the counter and than tries to create row, if it fails transaction (there is implicit transactions always) is roll backed but auto increment value is not reused. this is by design and I would be very surprised to see that it can be avoided (eventually you could call some command and reset the value to current maximum). You can always use the trigger to generate this values, but this has performance implications, usually you should not care about the value of auto_increment its just an integer, you would have the same situation later in your application if th
If an insert failed, you can, for the next insert, use set identity_insert mytable on and calculate the next identity by hand, using max(myfield)+1. You might have concurrency issues though.
But this is a cludge. There's nothing wrong with gaps.
#Frederik answered most of it -- I would just add that you are mixing up primary keys and business keys. An invoice (or whatever) should be identified by an invoice number -- a business key which should have a UNIQUE column in the table. The primary key is here to identify a row in the table and should be used by the database (to join ..) and by DBAs only.
Exposing primary keys to business users will end up in trouble and the database will sooner or later lose referential integrity -- always does, people are creative.

Resources