From what I gather, Linq to SQL doesn't actually execute any database commands (including the opening of database connections) until the SubmitChanges() method is called. If this is the case, I would like to increase the efficiency of a few methods. Is it possible for me to retrieve the ID of an object before inserting it? I'd rather not call SubmitChanges() twice, if it's possible for me to know the value of the ID before it's actually inserted into the database. From a logical point of view, it would only makes sense to have to open a connection to the database in order to find out the value, but does an insertion procedure also have to take place?
Thanks
The usual technique to solve this, is to generate a unique identifier in the application layer (such as a GUID) and use this as the ID. That way you do not have to retrieve the ID on a subsequent call.
Of course, using a GUID as a primary key can have it's drawbacks. If you decide to go this way look up COMB GUID.
Well, here is the problem: You get somehow id BEFORE inserting to database, and do some processing with it. In the same time another thread does the same, and get's the same ID, you've got a conflict.
I.e. I don't think there is an easy way of doing this.
I don't necessarily recommend this, but have seen it done. You can calculate your own ID as an integer using a stored procedure and a table to hold the value of the next ID. The stored procedure selects the value from the table to return it, then increments the value. The table will look something like the following
Create Table Keys(
name varchar(128) not null primary key,
nextID int not null
)
Things to note before doing this is that if you select and then update in 2 different batches you have potential key collision. Both steps need to be treated as an atomic transaction.
Related
i'm new to apex/oracle db and just found out that you would either use a sequence + trigger (usually for versions < 12c) or an identity-column ( versions >=12c).
What is better practice and what are the differences between the two ?
Thanks :)
One big difference is in dealing with parent-child insertions - here you first need to insert the parent, and then use the generated ID value from the parent table as a foreign key in the child table's inserts.
In those instances, with an identity column you either need to be able to use the RETURNING clause to get back the just-inserted ID (not supported in all middleware), or you do the insert of the parent record and then query to get the ID that was created so that you can use it as the FK value in the child table. If your table does not have a natural key to easily identify the just-inserted row - this may be problematic.
On the other hand, for those situations, if you do not use IDENTITY you instead first do a SELECT on the sequence to get the next incremental value, and then use that directly in your parent and child insert statements. This is a more portable solution, and is compatible with all Oracle versions if you may need to do an install to an earlier version of Oracle for a given client. In that case you don't have the trigger do the select from the sequence to set the value - you do it yourself.
Yes, it is an extra round-trip to the DB to get the sequence.nextval, but if your middleware doesn't support the RETURNING clause you're going to be doing that round trip to get the inserted ID anyway, and almost certainly using a more expensive query.
Also, if you have a bunch of PL/SQL library code that manipulates data using the very convenient %ROWTYPE conventions, and if your IDENTITY column is set to GENERATED ALWAYS, then you can start running into problems on inserts as noted here. Something to be aware of if thinking of switching to IDENTITY columns underneath an existing code base.
There is a third alternative to the two mentioned in the question (IDENTITY column and sequence + trigger): namely, create a sequence and set a default on the column, e.g.:
CREATE SEQUENCE my_sequence;
CREATE TABLE my_table
( my_column NUMBER DEFAULT my_sequence.nextval NOT NULL
, my_other_column DATE DEFAULT SYSDATE NOT NULL
);
Do you ever use a separate table for "generating" artificial primary keys for DB (and why)? What I mean is to have a table with two columns, table name and current ID - with which you could get new "ID" for some table by simply locking the row with that table name, getting the current value of the key, increment it by one, and unlock the row. Why would you prefer this over standard integer identity column?
P.S. The "idea" is from Fowlers Patterns of Enterprise Application Architecture, btw...
This is called Hi/Lo assignment.
You would do this having either a trigger on INSERT on your tables getting the ID from this table and incrementing it before or after you get your ID, depending of your choice.
This is commonly used when you have to deal with multiple database engines. The autoincremental identifier in Oracle is through a SEQUENCE, which you increment with SEQUENCE.NEXTVALUE from within a BEFORE INSERT TRIGGER on your data table.
Oppositly, SQL Server has IDENTITY columns, autoincrementing natively and this is managed by the DBE itself.
In order for your software to work on both DBE, you have to come to some sort of a standard, then the most common "standard" used for this is the Hi/Lo assignment to the primary key.
This is one approach amongst others. These days, with ORM Mapping tools such as NHibernate, it is offered through configuration so that you need less to care on both the application and the database sides.
EDIT #1
Because this kind of maneuvre can't be used for a global scope, you'd have to have such a table per database, or database schema. This way, each schema is indenpendant from the other. However, data in one schema can't implicitly be moved toward another with the same key, as it would perhaps be conflicted with an already existing row.
As for a security schema, it accesses the same database as another schema or user, so no additional table should exist for specific security schema.
Whenever you can use sql server's identity or guid features, you should. However, there are a few situations where this may not be possible.
One example is that sql server only allows one identity column per table. Rarely, a table will have records that need both a private id and a public id, and a limit of one identity column means generating both as integers can be a pain. You could always use a guid for one, but you want the integer on the private id for speed and you may also want the public id to be more human readable than a guid.
In this situation, an extra table for generating the ids can make sense. However, I'd do it a bit differently. Still have two columns in the table, but make one "shadow" or "Id mapping" table for every real table. One of the columns will be your private id (unique constraint) and one will be your public id (identity with maybe an increment value of '7' or '13' or other number that's less obvious than '1').
The key difference here is that you don't want to do the locking yourself. Let sql server handle it.
The only time I have ever used this is when I had an application in BTrieve, and it didn't have an identity column. And I should also say when they tried to use this table, it caused a massive slow down when they tried to import data, because of all the extra reads and writes. My friend looked at it and rewrote how they did it to speed it up, but the moral of the story is that if you do something like this incorrectly, there can be brutal consequences.
Personally, I don't think I would ever want to do this. There is too much possibility for error. Two people try and use the same key, because they forgot to lock the table before grabbing the id. This just seems like something that should be left up to the RDBMS if at all possible. As Will brought up, it's easy to minimize this situation, but if you don't know what you are doing it can happen.
You wouldn't prefer it at all.
Whatever you gain by using the pattern or becoming DB agnostic, you'll lose in headaches, support and performance.
locking the row with that table name,
getting the current value of the key,
increment it by one, and unlock the
row
This sounds simple, doesn't it?
UPDATE TableOfId
SET Id += 1
OUTPUT Inserted.Id
WHERE Name = #Name;
In reality, its a disaster. No activity occurs in the application as a standalone operation: all operations are part of transactions. One cannot simply 'unlock' the row because the 'unlock' will actually occur only at commit time. Which means that all transactions that need an Id on a table are serialized and only one can proceed at any time. It also means that transaction that access more than one table will likely deadlock on updating the table of Ids because enforcing the 'get the next Id' update order is hard in practice.
To avoid complete serialization one needs to obtain the Ids on separate, standalone, transactions that can commit immediately (usually implicit auto-commit transaction on the UPDATE itself). But this complicates the application logic tremendously. Every operation needs to maintain two separate connections to the database, one to do the normal transaction logic and another one to obtain the needed Ids. Even then, the update of Ids can become such a hot spot that it can still cause visible contention and blocking (similar to the dreaded 'update page hit count +1' prevalent on web apps).
In short: use IDENTITY. The identity generation is optimized for high concurrency.
I have seen this pattern used when data created in one database needs to be migrated, backed-up, clustered or staged to another database. In this situation, first of all your want to ensure the primary keys will not need to change. Secondly the foreign keys. Thirdly, externally exposed keys or durable references.
I have a stored procedure that does, among other stuff, some inserts in different table inside a loop. See the example below for clearer understanding:
INSERT INTO T1 VALUES ('something')
SET #MyID = Scope_Identity()
... some stuff go here
INSERT INTO T2 VALUES (#MyID, 'something else')
... The rest of the procedure
These two tables (T1 and T2) have an IDENTITY(1, 1) column in each one of them, let's call them ID1 and ID2; however, after running the procedure in our production database (very busy database) and having more than 6250 records in each table, I have noticed one incident where ID1 does not match ID2! Although normally for each record inserted in T1, there is record inserted in T2 and the identity column in both is incremented consistently.
The "wrong" records were something like that:
ID1 Col1
---- ---------
4709 data-4709
4710 data-4710
ID2 ID1 Col1
---- ---- ---------
4709 4710 data-4710
4710 4709 data-4709
Note the "inverted", ID1 in the second table.
Knowing not that much about SQL Server underneath operations, I have put the following "theory", maybe someone can correct me on this.
What I think is that because the loop is faster than physically writing to the table, and/or maybe some other thing delayed the writing process, the records were buffered. When it comes the time to write them, they were wrote in no particular order.
Is that even possible if no, how to explain the above mentioned scenario?
If yes, then I have another question to rise. What if the first insert (from the code above) got delayed? Doesn't that mean I won't get the correct IDENTITY to insert into the second table? If the answer of this is also yes, what can I do to insure the insertion in the two tables will happen in sequence with the correct IDENTITY?
I appreciate any comment and information that help me understand this.
Thanks in advance.
There is no way you can rely on IDENTITY to solve this for your second table. If you care about the generated primary key value for that row, you should generate itself.
IDENTITY is a way of saying "I don't want the hassle of generating a key myself, just do it for me, and I'll ask for the generated value if and when I need it".
What could be happening here is that two threads are inserting the rows at the same time, none of them have committed yet, so you get this scenario:
Thread 1 Thread 2
get id for table 1 = 4709
get id for table 1 = 4710
insert row for table 1
insert row for table 1
get id for table 2 = 4709
get id for table 2 = 4710
insert row for table 2
insert row for table 1
You have two ways to solve your problem:
Remove IDENTITY for the primary key in the second table
Use SET IDENTITY_INSERT ON to allow you to provide a key for it, while keeping the IDENTITY setting
In this case, however, I would use method nbr. 1. Method nbr. 2 is usually used when importing data into an empty table. You don't want the risk of the database auto-generating an ID you later on want to use yourself (since it comes from the first table), and so you should disable IDENTITY setting on the primary key of the second table.
Or you could try to avoid relying on the key for that table at all, since you have a foreign key reference, do you really need the key values to be the same?
Of course your above scenario is possible - and quite likely, too.
If you have two separate, independent tables, both being used for queries and inserts, both with a separate IDENTITY(1,1) field, there's absolutely no guarantee that an insert into one table and then into the second will be executed in the same order!
If you do need to establish a link between the two, insert the first table's ID into the second table as a foreign key. You cannot rely on the ID's generated from IDENTITY's to be the same in both tables!
Regading writing:
Whenever you do something that changes data, this is written to the database LOGS that moment, and you dont get a transaction confirm until this has happened. That is the D in ACID conditions (database theory).
Dirty database pages are written to disk "in the background". If too many are dirty, a checkpoint is triggered and they are all dumped out.
So far to the writing part.
Waht you probably run into is simlpy the fact that while individual statements are atomic, a busy atabase has possibly more than one thread running along it. So, basically, a thread switch happened between the statements. One thread got Id1, another one prioerity, id1, id2, then the first one id2.
Nothing specific here ;) Typical normal database behavior when multiple threads run along. Nothing to do with writing per se.
Basically, between
SET #MyID = Scope_Identity()
and the next statement, another thread can get priority ;)
do not rely on the actual values of identity columns for business/application logic you can only assume that they will be unique!
You should be able to avoid this issue by using a SQL 2005 feature, the OUTPUT clause. Link below.
http://msdn.microsoft.com/en-us/library/ms177564.aspx
This is a known bug in SQL Server.
The problem is when it generates the query plan the parallelisation causes scope identity to be incorrect.
Move that part into its own procedure, so pass in the params and return the scope identity - Now it should be correct.
If I remember rightly, this only manifests on tables with around a million rows or more.
Aha, here's the KB: http://support.microsoft.com/default.aspx?scid=kb;en-us;2019779&sd=rss&spid=2855
I am a bit of an SSIS newbie and while the whole system seems straightforward, I don't conceptually understand the process I need to go through in this scenario:
Need to map Invoice and InvoiceLine tables from a source database to two equivalent tables in a destination database - with different identity values.
For each invoice inserted across, I need to get the identity it was assigned and then insert all its lines referencing that new identity
There is a surrogate key on the invoices (the invoice number), however these might also clash with invoice numbers in the target system, hence they would also have to be renumbered.
This must be a common scenario in integration - is there a common solution?
Chris KL - you are correct that this is harder than one would expect. I have three methods for this, which work in different situations:
IF the data you are loading is small (hundreds or thousands but not hundreds OF thousands) then you can do this: use an OLEDB command that performs one insert for each parent row and returns the identity value back; then downstream from that join the output from that to the child rows, and insert them. Advantage: intuitive. Disadvantage: scales badly. This method is documented on the web and should Google for you.
If we are talking about a bigger system where you need bulk loading, then there are two other flavors:
a. If you have exclusive access to the table during the load (really exclusive, enforced in some way) then you can grab the max existing ID from the table, use an SSIS script task to number the rows starting above that max id, then Set Identity Insert On, stuff them in, and Set Identity Insert Off. You then have those script-generated keys in SSIS to assign to the child rows. Advantage: fast and simple, one trip to the DB. Disadvantage: possible errors if some other process inserts into your table at the same time. Brittle.
b. If you don't have exclusive access, then the only way I know of is with a round trip to the DB, thus: Insert all parent rows but keep track of a key for them that is not the identity column (a business key, for example). In a second dataflow, process the child records by using a Lookup transform that uses the business key to fetch the parent ID. Make sure the lookup is tuned appropriately vs. caching, and that thee business key is indexed.
OK, this is a good news / bad news situation I'm afraid. First the good news and a bit of background which you may know but I'll put it down in case you don't.
You generally can't insert anything into IDENTITY columns. Of course, like everything else in life there are times when you need to and that can be done with the IDENTITY_INSERT option.
SET IDENTITY_INSERT MyTable ON
INSERT INTO MyTable (
MyIdCol,
Etc…
)
SELECT SourceIdCol,
Etc…
FROM MySourceTable
SET IDENTITY_INSERT MyTable OFF
Now, you say that you have surrogate keys in the target but then you say that they may clash. So I'm a little confused… Are you using the keys from the source (e.g. IDENTITY columns) or are you generating new keys in the target? I would strongly advise against trying to merge the keyspaces in a single key column. If you need to retain the keys then I would suggest a multi-field key using something like SourceSystemId to keep them unique.
Finally the bad news: SSIS doesn't provide a simple means of using the IDENTITY_INSERT option. The only way I've been able to do it is by turning it on in a SQL task that executes before the insert task. You should be able to pass the table name into the script as a variable. Make sure to include another SQL task afterwards to turn it off because you can only use on one table at a time.
I'm going to running thousands of queries into SQL and I need to prevent the duplication of field 'domain'. Never had to do this before and any help would be appreciated.
You probably want to create a "UNIQUE" constraint on the field "Domain" - this constraint will raise an error if you create two rows that have the same domain in the database. For an explanation, see this tutorial in W3C school -
http://www.w3schools.com/sql/sql_unique.asp
If this doesn't solve your problem, please clarify the database you have chosen to use (MySql?).
NOTE: This constraint is completely separate from your choice of PHP as a programming language, it is a SQL database definition thing. A huge advantage of expressing this constraint in SQL is that you can trust the database to preserve the constraint even when people import / export data from the database, your application is buggy or another application shares the database.
If this is an absolute database integrity requirement (It's not likely to change, nor does existing data have this problem), I would enforce it at the database with a unique constraint.
As far as detecting it before or after the attempt in order to notify the user, there are a number of techniques which could be used.
Where is the data coming from? Is this something you only want to run once, or a couple of times, or often? If the domain-value already exists, do you just want to skip the insert or do something else (ie increment a counter)?
Depending on your answers, there are many possible solutions:
Pre-sort your data, eliminate duplicates, then insert
(assumes relatively static data, empty table to begin with)
Use an associative array in PHP as a local domain-value cache
(if table already contains data, start by reading existing content;
not thread-safe, but works if it only runs once at a time)
Make domain a UNIQUE column and write wrapper code to handle return errors
Make domain a UNIQUE or PRIMARY KEY column and use an ON DUPLICATE KEY clause:
INSERT INTO mydata ( domain, count ) VALUES
( 'firstdomain', 1 ),
( 'seconddomain', 1 ),
( 'thirddomain', 1 )
ON DUPLICATE KEY
UPDATE count = count+1
Insert all data into the table, then remove duplicates
Note that batching inserts (ie using multiple value clauses per statement) can be significantly faster.
I'm not really sure I understood your question, but perhaps you are looking for SQL's "UNIQUE" constraint. If the query tries to insert a pre-existing value to a field, you (PHP) will be notified about this constraint breach.
There are a bunch of ways to approach this. You could set a unique constraint (like a primary key) on that column. This will cause the insert to fail if that domain has also been inserted. You could also insert all of the duplicate domains and just delete them later on. This will work well if not that many of the domains are duplicated. There are a few questions posted already on finding duplicate rows.
This can be doen with sql, rather than with php.
i am assuming that you are using MySQl, but the same principles will work with different databases.
make the Domain column the primary key. (makes sense, as it has to unique.)
Rather than using INSERT, use UPDATE.
if the primary key already exists (that you are trying to put into the table), update will update the existing tuple, rather than creating a new tuple.
so you will overwrite existing data if it is different, and if it is identical the update will be skipped.