I'm going to running thousands of queries into SQL and I need to prevent the duplication of field 'domain'. Never had to do this before and any help would be appreciated.
You probably want to create a "UNIQUE" constraint on the field "Domain" - this constraint will raise an error if you create two rows that have the same domain in the database. For an explanation, see this tutorial in W3C school -
http://www.w3schools.com/sql/sql_unique.asp
If this doesn't solve your problem, please clarify the database you have chosen to use (MySql?).
NOTE: This constraint is completely separate from your choice of PHP as a programming language, it is a SQL database definition thing. A huge advantage of expressing this constraint in SQL is that you can trust the database to preserve the constraint even when people import / export data from the database, your application is buggy or another application shares the database.
If this is an absolute database integrity requirement (It's not likely to change, nor does existing data have this problem), I would enforce it at the database with a unique constraint.
As far as detecting it before or after the attempt in order to notify the user, there are a number of techniques which could be used.
Where is the data coming from? Is this something you only want to run once, or a couple of times, or often? If the domain-value already exists, do you just want to skip the insert or do something else (ie increment a counter)?
Depending on your answers, there are many possible solutions:
Pre-sort your data, eliminate duplicates, then insert
(assumes relatively static data, empty table to begin with)
Use an associative array in PHP as a local domain-value cache
(if table already contains data, start by reading existing content;
not thread-safe, but works if it only runs once at a time)
Make domain a UNIQUE column and write wrapper code to handle return errors
Make domain a UNIQUE or PRIMARY KEY column and use an ON DUPLICATE KEY clause:
INSERT INTO mydata ( domain, count ) VALUES
( 'firstdomain', 1 ),
( 'seconddomain', 1 ),
( 'thirddomain', 1 )
ON DUPLICATE KEY
UPDATE count = count+1
Insert all data into the table, then remove duplicates
Note that batching inserts (ie using multiple value clauses per statement) can be significantly faster.
I'm not really sure I understood your question, but perhaps you are looking for SQL's "UNIQUE" constraint. If the query tries to insert a pre-existing value to a field, you (PHP) will be notified about this constraint breach.
There are a bunch of ways to approach this. You could set a unique constraint (like a primary key) on that column. This will cause the insert to fail if that domain has also been inserted. You could also insert all of the duplicate domains and just delete them later on. This will work well if not that many of the domains are duplicated. There are a few questions posted already on finding duplicate rows.
This can be doen with sql, rather than with php.
i am assuming that you are using MySQl, but the same principles will work with different databases.
make the Domain column the primary key. (makes sense, as it has to unique.)
Rather than using INSERT, use UPDATE.
if the primary key already exists (that you are trying to put into the table), update will update the existing tuple, rather than creating a new tuple.
so you will overwrite existing data if it is different, and if it is identical the update will be skipped.
Related
Do you ever use a separate table for "generating" artificial primary keys for DB (and why)? What I mean is to have a table with two columns, table name and current ID - with which you could get new "ID" for some table by simply locking the row with that table name, getting the current value of the key, increment it by one, and unlock the row. Why would you prefer this over standard integer identity column?
P.S. The "idea" is from Fowlers Patterns of Enterprise Application Architecture, btw...
This is called Hi/Lo assignment.
You would do this having either a trigger on INSERT on your tables getting the ID from this table and incrementing it before or after you get your ID, depending of your choice.
This is commonly used when you have to deal with multiple database engines. The autoincremental identifier in Oracle is through a SEQUENCE, which you increment with SEQUENCE.NEXTVALUE from within a BEFORE INSERT TRIGGER on your data table.
Oppositly, SQL Server has IDENTITY columns, autoincrementing natively and this is managed by the DBE itself.
In order for your software to work on both DBE, you have to come to some sort of a standard, then the most common "standard" used for this is the Hi/Lo assignment to the primary key.
This is one approach amongst others. These days, with ORM Mapping tools such as NHibernate, it is offered through configuration so that you need less to care on both the application and the database sides.
EDIT #1
Because this kind of maneuvre can't be used for a global scope, you'd have to have such a table per database, or database schema. This way, each schema is indenpendant from the other. However, data in one schema can't implicitly be moved toward another with the same key, as it would perhaps be conflicted with an already existing row.
As for a security schema, it accesses the same database as another schema or user, so no additional table should exist for specific security schema.
Whenever you can use sql server's identity or guid features, you should. However, there are a few situations where this may not be possible.
One example is that sql server only allows one identity column per table. Rarely, a table will have records that need both a private id and a public id, and a limit of one identity column means generating both as integers can be a pain. You could always use a guid for one, but you want the integer on the private id for speed and you may also want the public id to be more human readable than a guid.
In this situation, an extra table for generating the ids can make sense. However, I'd do it a bit differently. Still have two columns in the table, but make one "shadow" or "Id mapping" table for every real table. One of the columns will be your private id (unique constraint) and one will be your public id (identity with maybe an increment value of '7' or '13' or other number that's less obvious than '1').
The key difference here is that you don't want to do the locking yourself. Let sql server handle it.
The only time I have ever used this is when I had an application in BTrieve, and it didn't have an identity column. And I should also say when they tried to use this table, it caused a massive slow down when they tried to import data, because of all the extra reads and writes. My friend looked at it and rewrote how they did it to speed it up, but the moral of the story is that if you do something like this incorrectly, there can be brutal consequences.
Personally, I don't think I would ever want to do this. There is too much possibility for error. Two people try and use the same key, because they forgot to lock the table before grabbing the id. This just seems like something that should be left up to the RDBMS if at all possible. As Will brought up, it's easy to minimize this situation, but if you don't know what you are doing it can happen.
You wouldn't prefer it at all.
Whatever you gain by using the pattern or becoming DB agnostic, you'll lose in headaches, support and performance.
locking the row with that table name,
getting the current value of the key,
increment it by one, and unlock the
row
This sounds simple, doesn't it?
UPDATE TableOfId
SET Id += 1
OUTPUT Inserted.Id
WHERE Name = #Name;
In reality, its a disaster. No activity occurs in the application as a standalone operation: all operations are part of transactions. One cannot simply 'unlock' the row because the 'unlock' will actually occur only at commit time. Which means that all transactions that need an Id on a table are serialized and only one can proceed at any time. It also means that transaction that access more than one table will likely deadlock on updating the table of Ids because enforcing the 'get the next Id' update order is hard in practice.
To avoid complete serialization one needs to obtain the Ids on separate, standalone, transactions that can commit immediately (usually implicit auto-commit transaction on the UPDATE itself). But this complicates the application logic tremendously. Every operation needs to maintain two separate connections to the database, one to do the normal transaction logic and another one to obtain the needed Ids. Even then, the update of Ids can become such a hot spot that it can still cause visible contention and blocking (similar to the dreaded 'update page hit count +1' prevalent on web apps).
In short: use IDENTITY. The identity generation is optimized for high concurrency.
I have seen this pattern used when data created in one database needs to be migrated, backed-up, clustered or staged to another database. In this situation, first of all your want to ensure the primary keys will not need to change. Secondly the foreign keys. Thirdly, externally exposed keys or durable references.
I have a development database that has fees in it. It has a feeid, which is a unique key that is the identifier. The problem I run into is that the feeid/fee amount may not match when putting updating the table on a production server. This obviously could lead to some bad things happening, like overcharging for something or undercharging. Is there a way to match reset identities in sql server or match them or is this an example of when you would not want to use them?
Don't make your primary keys
"mean something" other than
identifying an unique record. If you
need to hard code an ID somewhere,
create another column for it.
So-called "natural keys" are more
trouble than they're worth
If,
for some reason, you decide that
either you will not or cannot follow
the first rule, don't use any
automatically generated key values.
That is the behaviour of an identity column, this is also what makes it so fast because it doesn't lock the table
to reset an identity either use DBCC CHECKIDENT or TRUNCATE TABLE
to insert IDs from one table to another and to keep the same values you need to do
SET IDENTITIY_INSERT ON
--upddate/insert rows
SET IDENTITIY_INSERT OFF
keep in mind that during the time between the two SET IDENTITIY_INSERT statements that your regular inserts will FAIL!
You can set IDENTITIY INSERT ON, update the IDs (make sure there are no conflicts) and then turn it back off.
I want to learn the answer for different DB engines but in our case;
we have some records that are not unique for a column and now we want to make that column unique which forces us to remove duplicate values.
We use Oracle 10g. Is this reasonable? Or is this something like goto statement :) ? Should we really delete? What if we had millions of records?
To answer the question as posted: No, it can't be done on any RDBMS that I'm aware of.
However, like most things you can work around it, by doing the following.
Create a composite key, with a new column and the existing column
You can make it unique without deleting anything by adding a new column, call it PartialKey.
For existing rows you set PartialKey to a unique value (starting at Zero).
Create a unique constraint on the existing column and PartialKey (you can do this because each of these rows will now be unique).
For new rows, only use a default value of Zero for PartialKey (because zero has already been used), this will force the existing column to have unqiue values in the table.
IMPORTANT EDIT
This is weak - if you delete a row with partial key 0. Now another row can be added with a value that is already in the existing column, because the 0 in partial key will guarentee uniqueness.
You would need to ensure that either
You never delete the row with
partial key 0
You always have a dummy row with
partial key 0, and you never delete
it (or you immediately reinsert it automatically)
Edit: Bite the bullet and clean the data
If as you said you've just realised that the column should be unique, then you should (if possible) clean up the data. The above approach is a hack, and you'll find yourself writing more hacks when accessing the table (you may find you've two sets of logic for dealing with queries against that table, one for where the column IS unique, and one where it's NOT. I'd clean this now or it'll come back and bite you in the arse a thousand times over.
This can be done in SQL Server.
When you create a check constraint,
you can set an option to apply it
either to new data only or to existing
data as well. The option of applying
the constraint to new data only is
useful when you know that the existing
data already meets the new check
constraint, or when a business rule
requires the constraint to be enforced
only from this point forward.
for example
ALTER TABLE myTable
WITH NOCHECK
ADD CONSTRAINT myConstraint CHECK ( column > 100 )
You can do this using NOVALIDATE ENABLE constraint state, but deleting is much more preferred way.
You have to set your records straight before adding the constraints.
In Oracle you can put a constraint in a enable novalidate state. When a constraint is in the enable novalidate state, all subsequent statements are checked for conformity to the constraint. However, any existing data in the table is not checked. A table with enable novalidated constraints can contain invalid data, but it is not possible to add new invalid data to it. Enabling constraints in the novalidated state is most useful in data warehouse configurations that are uploading valid OLTP data.
I have the same database running on two different machines. The DB's make extensive use of Identity columns, and the tables have clashed pretty horribly. I now want to merge these two together before sorting out the undelying issue which I may do by
A) Using GUIDs (unweildy but works everywhere)
B) Assigning Identity ranges, kind of naff, but means you can still access records in order, knock up basic Sql and select records easily, and it identifies which machine originated the data.
My question is, what's the best way of re-keying (ie changing the primary keys) on one of the databases so the data no longer clashes. We're only looking at 6 tables total, but lots of rows ~2M in the 3 tables.
Update - is there any real sql code out there that does this, I know about Identity Insert etc. I've solved this issue in a number of in-elegant ways before, and I was looking for the elegant solution, preferable with a nice TSQL SP to do the donkey work - if that doesn't exist I'll code it up and place on wiki.
A simplistic way is to change all keys on the one of the databases by a fixed increment, say 10,000,000, and they will line up. In order to do this, you will have to bring the applications down so the database is quiet and drop all FK references affected by this, recreating them when finished. You will also have to reset the seed value on all affected identity columns to an appropriate value.
Some of the tables will be reference data, which will be more complicated to merge if it is not in sync. You could possibly have issues with conflicting codes meaning the same thing on different instances or the same code having different meanings. This may or may not be an issue with your application but if the instances have been run without having this coordinated between them you might want to check carefully for this.
Also, data like names and addresses are very likely to be out of sync if there wasn't a canonical source for these. You may need to get these out, run a matching query and get the business to tidy up any exceptions.
I would add another column to the table first, populate that with the new Primary key.
Then I'd use update statements to update the new foreign key fields in all related tables.
Then you can drop the old Primary key and old foreign key fields.
I'm currently working on someone else's database where the primary keys are generated via a lookup table which contains a list of table names and the last primary key used. A stored procedure increments this value and checks it is unique before returning it to the calling 'insert' SP.
What are the benefits for using a method like this (or just generating a GUID) instead of just using the Identity/Auto-number?
I'm not talking about primary keys that actually 'mean' something like ISBNs or product codes, just the unique identifiers.
Thanks.
An auto generated ID can cause problems in situations where you are using replication (as I'm sure the techniques you've found can!). In these cases, I generally opt for a GUID.
If you are not likely to use replication, then an auto-incrementing PK will most likely work just fine.
There's nothing inherently wrong with using AutoNumber, but there are a few reasons not to do it. Still, rolling your own solution isn't the best idea, as dacracot mentioned. Let me explain.
The first reason not to use AutoNumber on each table is you may end up merging records from multiple tables. Say you have a Sales Order table and some other kind of order table, and you decide to pull out some common data and use multiple table inheritance. It's nice to have primary keys that are globally unique. This is similar to what bobwienholt said about merging databases, but it can happen within a database.
Second, other databases don't use this paradigm, and other paradigms such as Oracle's sequences are way better. Fortunately, it's possible to mimic Oracle sequences using SQL Server. One way to do this is to create a single AutoNumber table for your entire database, called MainSequence, or whatever. No other table in the database will use autonumber, but anyone that needs a primary key generated automatically will use MainSequence to get it. This way, you get all of the built in performance, locking, thread-safety, etc. that dacracot was talking about without having to build it yourself.
Another option is using GUIDs for primary keys, but I don't recommend that because even if you are sure a human (even a developer) is never going to read them, someone probably will, and it's hard. And more importantly, things implicitly cast to ints very easily in T-SQL but can have a lot of trouble implicitly casting to a GUID. Basically, they are inconvenient.
In building a new system, I'd recommend using a dedicated table for primary key generation (just like Oracle sequences). For an existing database, I wouldn't go out of my way to change it.
from CodingHorror:
GUID Pros
Unique across every table, every database, every server
Allows easy merging of records from different databases
Allows easy distribution of databases across multiple servers
You can generate IDs anywhere, instead of having to roundtrip to the database
Most replication scenarios require GUID columns anyway
GUID Cons
It is a whopping 4 times larger than the traditional 4-byte index value; this can have serious performance and storage implications if you're not careful
Cumbersome to debug (where userid='{BAE7DF4-DDF-3RG-5TY3E3RF456AS10}')
The generated GUIDs should be partially sequential for best performance (eg, newsequentialid() on SQL 2005) and to enable use of clustered indexes
The article provides a lot of good external links on making the decision on GUID vs. Auto Increment. If I can, I go with GUID.
It's useful for clients to be able to pre-allocate a whole bunch of IDs to do a bulk insert without having to then update their local objects with the inserted IDs. Then there's the whole replication issue, as mentioned by Galwegian.
The procedure method of incrementing must be thread safe. If not, you may not get unique numbers. Also, it must be fast, otherwise it will be an application bottleneck. The built in functions have already taken these two factors into account.
My main issue with auto-incrementing keys is that they lack any meaning
That's a requirement of a primary key, in my mind -- to have no other reason to exist other than identifying a record. If it has no real-world meaning, then it has no real-world reason to change. You don't want primary keys to change, generally speaking, because you have to search-replace your whole database or worse. I have been surprised at the sorts of things I have assumed would be unique and unchanging that have not turned out to be years later.
Here's the thing with auto incrementing integers as keys:
You HAVE to have posted the record before you get access to it. That means that until you have posted the record, you cannot, for example, prepare related records that will be stored in another table, or any one of a lot of other possible reasons why it might be useful to have access to the new record's unique id, before posting it.
The above is my deciding factor, whether to go with one method, or the other.
Using a unique identifiers would allow you to merge data from two different databases.
Maybe you have an application that collects data in multiple database and then "syncs" with a master database at various times in the day. You wouldn't have to worry about primary key collisions in this scenario.
Or, possibly, you might want to know what a record's ID will be before you actually create it.
One benefit is that it can allow the database/SQL to be more cross-platform. The SQL can be exactly the same on SQL Server, Oracle, etc...
The only reason I can think of is that the code was written before sequences were invented and the code forgot to catch up ;)
I would prefer to use a GUID for most of the scenarios in which the post's current method makes any sense to me (replication being a possible one). If replication was the issue, such a stored procedure would have to be aware of the other server which would have to be linked to ensure key uniqueness, which would make it very brittle and probably a poor way of doing this.
One situation where I use integer primary keys that are NOT auto-incrementing identities is the case of rarely-changed lookup tables that enforce foreign key constraints, that will have a corresponding enum in the data-consuming application. In that scenario, I want to ensure the enum mapping will be correct between development and deployment, especially if there will be multiple prod servers.
Another potential reason is that you deliberately want random keys. This can be desirable if, say, you don't want nosey browsers leafing through every item you have in the database, but it's not critical enough to warrant actual authentication security measures.
My main issue with auto-incrementing keys is that they lack any meaning.
For tables where certain fields provide uniqueness (whether alone or in combination with another), I'd opt for using that instead.
A useful side benefit of using a GUID primary key instead of an auto-incrementing one is that you can assign the PK value for a new row on the client side (in fact you have to do this in a replication scenario), sparing you the hassle of retrieving the PK of the row you just added on the server.
One of the downsides of a GUID PK is that joins on a GUID field are slower (unless this has changed recently). Another upside of using GUIDs is that it's fun to try and explain to a non-technical manager why a GUID collision is rather unlikely.
Galwegian's answer is not necessarily true.
With MySQL you can set a key offset for each database instance. If you combine this with a large enough increment it will for fine. I'm sure other vendors would have some sort of similar settings.
Lets say we have 2 databases we want to replicate. We can set it up in the following way.
increment = 2
db1 - offset = 1
db2 - offset = 2
This means that
db1 will have keys 1, 3, 5, 7....
db2 will have keys 2, 4, 6, 8....
Therefore we will not have key clashes on inserts.
The only real reason to do this is to be database agnostic (if different db versions use different auto-numbering techniques).
The other issue mentioned here is the ability to create records in multiple places (like in the central office as well as on traveling users' laptops). In that case, though, you would probably need something like a "sitecode" that was unique to each install that was prefixed to each ID.