Related
I have unique contraint on 5 nullable columns that represent identifier of one row.
Is it okay to create unique key and create clustered index on it instead of primary key? I cannot use primary key on these columns because they are nullable, and i cannot create identity column because there are lot of deletes and inserts and it will make overflow on this identity column.
Yes, and there's an argument that this is actually "better" than a primary key, as the rule that a primary key column is non nullable is in many ways an artificial constraint.
If you make it a UNIQUE CLUSTERED INDEX then you get just about everything that a primary key brings to the table except the unwanted rule that the columns must be non nullable. However, they must still be unique, so you could only ever have one row where all five columns in your index are null for example.
So you can use your index when creating foreign key constraints, you will guarantee the order data is stored, and each row must be unique. However, the index probably won't be incredibly useful for querying, and, because it's going to be wide, and you said there's a lot of deletes/ inserts, it will have a tendency to fragment your data.
Personally, I would be tempted to make it a unique constraint, but not clustered. Then it will do the job of keeping non-unique data from being created.
You could then add a surrogate key and make this the primary key. I doubt you would ever "run out" (or "overflow"?) of numbers doing this.
So why would I use a surrogate key?
Your surrogate key will be much narrower, so less impact from fragmentation due to so many inserts/ updates/ deletes.
It's then useful if you need to extend your database. Say you only have one table, and this is always going to be the only table in the entire database. In this one scenario it would make sense to not bother with a surrogate key. It doesn't give you any value; it's just an unnecessary overhead.
However, let's assume that you have other tables hanging off your "main" table (the one with 5 columns forming a unique key). Adding a surrogate key here allows you to make any child tables with a single id that links back to the parent table. The alternative would be to enforce the addition of ALL five columns forming the unique (candidate) key every time you create a child table.
Now you have a narrow clustered index that actually serves a purpose, and the fragmentation will not be quite as bad as it would with five columns.
If there is no unique column can identify each row in the table,
then my primary key will be at least a set of two fields.
Is that correct?
If it is correct,then when I draw the Relationship Diagram, I have to underline the two attributes that formed the primary key?
Thankyou
Here is some terminology:
A superkey is a set of columns that, taken together, uniquely identify rows.
A candidate key (or just: "key") is a minimal1 superkey. Sometimes a key contains just one column, sometimes it contains several (in which case it is called "composite").
For practical reasons, we classify keys as either primary or alternate. One table has one primary key and zero or more alternate keys.
A key is "natural" if it arises from the intrinsic properties of data. In other words, it "means" something.
A key is "surrogate" if it doesn't have any meaning by itself - it is there only for identification purposes. It's typically implemented as an auto-incrementing integer, but there may be other strategies such as GUIDs (useful for replication). It is quite common for natural keys to be composite, but that almost never happens for surrogates.
If there are no "obvious" natural keys, the whole row can always act as a key2. However, this is rarely practical and in such cases you'll typically introduce a surrogate key just for the purpose of identifying rows.
Sometimes, but not always, it is useful to introduce a surrogate in addition to the existing natural key(s).
An ER diagram will clearly identify the PK3, whether it is natural or surrogate and whether it is composite or not. How exactly this will look like depends on a notation being used, but PK will typically be drawn in a graphically distinct manner and possibly prefixed with "PK".
1 I.e. if you were to remove any column from it, it would no longer be unique.
2 A database table is a physical representation of the mathematical concept of "relation". Since relation is set, there is no purpose in having two identical rows, so at the very least the whole row must be unique (an element is either in the set or isn't - it cannot be "twice" in the set, as opposed to multiset).
3 Assuming it not just entity-level so no attributes are show at all.
You are correct, after a fashion. Technically, a primary key and a unique key can be two distinct things. You can have a primary key on a table or entity uniquely identifying that entity and also. On the same table, you can have a unique key constraint which can then be used to ensure that no two rows, according to criteria chosen by you, end up having the same property. So you can have both a primary key and a unique constraint on the same table. Simply have a primary key column that will be autogenerated in your DB and then pick the two columns in your table that you want to use to enforce the unique key constraint
If you don't have primary key you can identify your datas but it's not performant.
And as best practise you use primary on your table.
The preference is to use auto increment column as primary key
In my database I have a list of users with information about them, and I also have a feature which allows a user to add other users to a shortlist. My user information is stored in one table with a primary key of the user id, and I have another table for the shortlist. The shortlist table is designed so that it has two columns and is basically just a list of pairs of names. So to find the shortlist for a particular user you retrieve all names from the second column where the id in the first column is a particular value.
The issue is that according to many sources such as this Should each and every table have a primary key? you should have a primary key in every table of the database.
According to this source http://www.w3schools.com/sql/sql_primarykey.asp - a primary key in one which uniquely identifies an entry in a database. So my question is:
What is wrong with the table in my database? Why does it need a primary key?
How should I give it a primary key? Just create a new auto-incrementing column so that each entry has a unique id? There doesn't seem much point for this. Or would I somehow encapsulate the multiple entries that represent a shortlist into another entity in another table and link that in? I'm really confused.
If the rows are unique, you can have a two-column primary key, although maybe that's database dependent. Here's an example:
CREATE TABLE my_table
(
col_1 int NOT NULL,
col_2 varchar(255) NOT NULL,
CONSTRAINT pk_cols12 PRIMARY KEY (col_1,col_2)
)
If you already have the table, the example would be:
ALTER TABLE my_table
ADD CONSTRAINT pk_cols12 PRIMARY KEY (col_1,col_2)
Primary keys must identify each record uniquely and as it was mentioned before, primary keys can consist of multiple attributes (1 or more columns). First, I'd recommend making sure each record is really unique in your table. Secondly, as I understand you left the table without primary key and that's disallowed so yes, you will need to set the key for it.
In this particular case, there is no purpose in same pair of user IDs being stored more than once in the shortlist table. After all, that table models a set, and an element is either in the set or isn't. Having an element "twice" in the set makes no sense1. To prevent that, create a composite key, consisting of these two user ID fields.
Whether this composite key will also be primary, or you'll have another key (that would act as surrogate primary key) is another matter, but either way you'll need this composite key.
Please note that under databases that support clustering (aka. index-organized tables), PK is often also a clustering key, which may have significant repercussions on performance.
1 Unlike in mutiset.
A table with duplicate rows is not an adequate representation of a relation. It's a bag of rows, not a set of rows. If you let this happen, you'll eventually find that your counts will be off, your sums will be off, and your averages will be off. In short, you'll get confusing errors out of your data when you go to use it.
Declaring a primary key is a convenient way of preventing duplicate rows from getting into the database, even if one of the application programs makes a mistake. The index you obtain is a side effect.
Foreign key references to a single row in a table could be made by referencing any candidate key. However, it's much more convenient if you declare one of those candidate keys as a primary key, and then make all foreign key references refer to the primary key. It's just careful data management.
The one-to-one correspondence between entities in the real world and corresponding rows in the table for that entity is beyond the realm of the DBMS. It's up to your applications and even your data providers to maintain that correspondence by not inventing new rows for existing entities and not letting some new entities slip through the cracks.
Well since you are asking, it's good practice but in a few instances (no joins needed to the data) it may not be absolutely required. The biggest problem though is you never really know if requirements will change and so you really want one now so you aren't adding one to a 10m record table after the fact.....
In addition to a primary key (which can span multiple columns btw) I think it is good practice to have a secondary candidate key which is a single field. This makes joins easier.
First some theory. You may remember the definition of a function from HS or college algebra is that y = f(x) where f is a function if and only if for every x there is exactly one y. In this case, in relational math we would say that y is functionally dependent on x on this case.
The same is true of your data. Suppose we are storing check numbers, checking account numbers, and amounts. Assuming that we may have several checking accounts and that for each checking account duplicate check numbers are not allowed, then amount is functionally dependent on (account, check_number). In general you want to store data together which is functionally dependent on the same thing, with no transitive dependencies. A primary key will typically be the functional dependency you specify as the primary one. This then identifies the rest of the data in the row (because it is tied to that identifier). Think of this as the natural primary key. Where possible (i.e. not using MySQL) I like to declare the primary key to be the natural one, even if it spans across columns. This gets complicated sometimes where you may have multiple interchangeable candidate keys. For example, consider:
CREATE TABLE country (
id serial not null unique,
name text primary key,
short_name text not null unique
);
This table really could have any column be the primary key. All three are perfectly acceptable candidate keys. Suppose we have a country record (232, 'United States', 'US'). Each of these fields uniquely identifies the record so if we know one we can know the others. Each one could be defined as the primary key.
I also recommend having a second, artificial candidate key which is just a machine identifier used for linking for joins. In the above example country.id does this. This can be useful for linking other records to the country table.
An exception to needing a candidate key might be where duplicate records really are possible. For example, suppose we are tracking invoices. We may have a case where someone is invoiced independently for two items with one showing on each of two line items. These could be identical. In this case you probably want to add an artificial primary key because it allows you to join things to that record later. You might not have a need to do so now but you may in the future!
Create a composite primary key.
To read more about what a composite primary key is, visit
http://www.relationaldbdesign.com/relational-database-analysis/module2/concatenated-primary-keys.php
I'm currently designing a brand new database. In school, we always learned to put a primary key in each table.
I read a lot of articles/discussions/newsgroups posts saying that it's better to use unique constraint (aka unique index for some db) instead of PK.
What's your point of view?
A Primary Key is really just a candidate key that does not allow for NULL. As such, in SQL terms - it's no different than any other unique key.
However, for our non-theoretical RDBMS's, you should have a Primary Key - I've never heard it argued otherwise. If that Primary Key is a surrogate key, then you should also have unique constraints on the natural key(s).
The important bit to walk away with is that you should have unique constraints on all the candidate (whether natural or surrogate) keys. You should then pick the one that is easiest to reference in a Foreign Key to be your Primary Key*.
You should also have a clustered index*. this could be your Primary Key, or a natural key - but it's not required to be either. You should pick your clustered index based on query usage of the table. When in doubt, the Primary Key is not a bad first choice.
Though it's technically only required to refer to a unique key in a foreign key relationship, it's accepted standard practice to greatly favor the primary key. In fact, I wouldn't be surprised if some RDBMS only allow primary key references.
Edit: It's been pointed out that Oracle's term of "clustered table" and "clustered index" are different than Sql Server. The equivalent of what I'm speaking of in Oracle-ese is an Index Ordered Table and it is recommended for OLTP tables - which, I think, would be the main focus of SO questions. I assume if you're responsible for a large OLAP data warehouse, you should already have your own opinions on database design and optimization.
Can you provide references to these articles?
I see no reason to change the tried and true methods. After all, Primary Keys are a fundamental design feature of relational databases.
Using UNIQUE to serve the same purpose sounds really hackish to me. What is their rationale?
Edit: My attention just got drawn back to this old answer. Perhaps the discussion that you read regarding PK vs. UNIQUE dealt with people making something a PK for the sole purpose of enforcing uniqueness on it. The answer to this is, If it IS a key, then make it key, otherwise make it UNIQUE.
A primary key is just a candidate key (unique constraint) singled out for special treatment (automatic creation of indexes, etc).
I expect that the folks who argue against them see no reason to treat one key differently than another. That's where I stand.
[Edit] Apparently I can't comment even on my own answer without 50 points.
#chris: I don't think there's any harm. "Primary Key" is really just syntactic sugar. I use them all the time, but I certainly don't think they're required. A unique key is required, yes, but not necessarily a Primary Key.
It would be very rare denormalization that would make you want to have a table without a primary key. Primary keys have unique constraints automatically just by their nature as the PK.
A unique constraint would be used when you want to guarantee uniqueness in a column in ADDITION to the primary key.
The rule of always have a PK is a good one.
http://msdn.microsoft.com/en-us/library/ms191166.aspx
You should always have a primary key.
However I suspect your question is just worded bit misleading, and you actually mean to ask if the primary key should always be an automatically generated number (also known as surrogate key), or some unique field which is actual meaningful data (also known as natural key), like SSN for people, ISBN for books and so on.
This question is an age old religious war in the DB field.
My take is that natural keys are preferable if they indeed are unique and never change. However, you should be careful, even something seemingly stable like a persons SSN may change under certain circumstances.
Unless the table is a temporary table to stage the data while you work on it, you always want to put a primary key on the table and here's why:
1 - a unique constraint can allow nulls but a primary key never allows nulls. If you run a query with a join on columns with null values you eliminate those rows from the resulting data set because null is not equal to null. This is how even big companies can make accounting errors and have to restate their profits. Their queries didn't show certain rows that should have been included in the total because there were null values in some of the columns of their unique index. Shoulda used a primary key.
2 - a unique index will automatically be placed on the primary key, so you don't have to create one.
3 - most database engines will automatically put a clustered index on the primary key, making queries faster because the rows are stored contiguously in the data blocks. (This can be altered to place the clustered index on a different index if that would speed up the queries.) If a table doesn't have a clustered index, the rows won't be stored contiguously in the data blocks, making the queries slower because the read/write head has to travel all over the disk to pick up the data.
4 - many front end development environments require a primary key in order to update the table or make deletions.
Primary keys should be used in situations where you will be establishing relationships from this table to other tables that will reference this value. However, depending on the nature of the table and the data that you're thinking of applying the unique constraint to, you may be able to use that particular field as a natural primary key rather than having to establish a surrogate key. Of course, surrogate vs natural keys are a whole other discussion. :)
Unique keys can be used if there will be no relationship established between this table and other tables. For example, a table that contains a list of valid email addresses that will be compared against before inserting a new user record or some such. Or unique keys can be used when you have values in a table that has a primary key but must also be absolutely unique. For example, if you have a users table that has a user name. You wouldn't want to use the user name as the primary key, but it must also be unique in order for it to be used for log in purposes.
We need to make a distinction here between logical constructs and physical constructs, and similarly between theory and practice.
To begin with: from a theoretical perspective, if you don't have a primary key, you don't have a table. It's just that simple. So, your question isn't whether your table should have a primary key (of course it should) but how you label it within your RDBMS.
At the physical level, most RDBMSs implement the Primary Key constraint as a Unique Index. If your chosen RDBMS is one of these, there's probably not much practical difference, between designating a column as a Primary Key and simply putting a unique constraint on the column. However: one of these options captures your intent, and the other doesn't. So, the decision is a no-brainer.
Furthermore, some RDBMSs make additional features available if Primary Keys are properly labelled, such as diagramming, and semi-automated foreign-key-constraint support.
Anyone who tells you to use Unique Constraints instead of Primary Keys as a general rule should provide a pretty damned good reason.
the thing is that a primary key can be one or more columns which uniquely identify a single record of a table, where a Unique Constraint is just a constraint on a field which allows only a single instance of any given data element in a table.
PERSONALLY, I use either GUID or auto-incrementing BIGINTS (Identity Insert for SQL SERVER) for unique keys utilized for cross referencing amongst my tables. Then I'll use other data to allow the user to select specific records.
For example, I'll have a list of employees, and have a GUID attached to every record that I use behind the scenes, but when the user selects an employee, they're selecting them based off of the following fields: LastName + FirstName + EmployeeNumber.
My primary key in this scenario is LastName + FirstName + EmployeeNumber while unique key is the associated GUID.
posts saying that it's better to use unique constraint (aka unique index for some db) instead of PK
i guess that the only point here is the same old discussion "natural vs surrogate keys", because unique indexes and pk´s are the same thing.
translating:
posts saying that it's better to use natural key instead of surrogate key
I usually use both PK and UNIQUE KEY. Because even if you don't denote PK in your schema, one is always generated for you internally. It's true both for SQL Server 2005 and MySQL 5.
But I don't use the PK column in my SQLs. It is for management purposes like DELETEing some erroneous rows, finding out gaps between PK values if it's set to AUTO INCREMENT. And, it makes sense to have a PK as numbers, not a set of columns or char arrays.
I've written a lot on this subject: if you read anything of mine be clear that I was probably referring specifically to Jet a.k.a. MS Access.
In Jet, the tables are physically ordered on the PRIMARY KEY using a non-maintained clustered index (is clustered on compact). If the table has no PK but does have candidate keys defined using UNIQUE constraints on NOT NULL columns then the engine will pick one for the clustered index (if your table has no clustered index then it is called a heap, arguably not a table at all!) How does the engine pick a candidate key? Can it pick one which includes nullable columns? I really don't know. The point is that in Jet the only explicit way of specifying the clustered index to the engine is to use PRIMARY KEY. There are of course other uses for the PK in Jet e.g. it will be used as the key if one is omitted from a FOREIGN KEY declaration in SQL DDL but again why not be explicit.
The trouble with Jet is that most people who create tables are unaware of or unconcerned about clustered indexes. In fact, most users (I wager) put an autoincrement Autonumber column on every table and define the PRIMARY KEY solely on this column while failing to put any unique constraints on the natural key and candidate keys (whether an autoincrement column can actually be regarded as a key without exposing it to end users is another discussion in itself). I won't go into detail about clustered indexes here but suffice to say that IMO a sole autoincrement column is rarely to ideal choice.
Whatever you SQL engine, the choice of PRIMARY KEY is arbitrary and engine specific. Usually the engine will apply special meaning to the PK, therefore you should find out what it is and use it to your advantage. I encourage people to use NOT NULL UNIQUE constraints in the hope they will give greater consideration to all candidate keys, especially when they have chosen to use 'autonumber' columns which (should) have no meaning in the data model. But I'd rather folk choose one well considered key and used PRIMARY KEY rather than putting it on the autoincrement column out of habit.
Should all tables have a PK? I say yes because doing otherwise means at the very least you are missing out on a slight advantage the engine affords the PK and at worst you have no data integrity.
BTW Chris OC makes a good point here about temporal tables, which require sequenced primary keys (lowercase) which cannot be implemented via simple PRIMARY KEY constraints (SQL key words in uppercase).
PRIMARY KEY
1. Null
It doesn’t allow Null values. Because of this we refer PRIMARY KEY =
UNIQUE KEY + Not Null CONSTRAINT.
2. INDEX
By default it adds a clustered index.
3. LIMIT
A table can have only one PRIMARY KEY Column[s].
UNIQUE KEY
1. Null
Allows Null value. But only one Null value.
2. INDEX
By default it adds a UNIQUE non-clustered index.
3. LIMIT
A table can have more than one UNIQUE Key Column[s].
If you plan on using LINQ-to-SQL, your tables will require Primary Keys if you plan on performing updates, and they will require a timestamp column if you plan on working in a disconnected environment (such as passing an object through a WCF service application).
If you like .NET, PK's and FK's are your friends.
I submit that you may need both. Primary keys by nature need to be unique and not nullable. They are often surrogate keys as integers create faster joins than character fileds and especially than multiple field character joins. However, as these are often autogenerated, they do not guarantee uniqueness of the data record excluding the id itself. If your table has a natural key that should be unique, you should have a unique index on it to prevent data entry of duplicates. This is a basic data integrity requirement.
Edited to add: It is also a real problem that real world data often does not have a natural key that truly guarantees uniqueness in a normalized table structure, especially if the database is people centered. Names, even name, address and phone number combined (think father and son in the same medical practice) are not necessarily unique.
I was thinking of this problem my self. If you are using unique, you will hurt the 2. NF. According to this every non-pk-attribute has to be depending on the PK. The pair of attributes in this unique constraint are to be considered as part of the PK.
sorry for replying to this 7 years later but didn't want to start a new discussion.
I have the following tables in my database that have a many-to-many relationship, which is expressed by a connecting table that has foreign keys to the primary keys of each of the main tables:
Widget: WidgetID (PK), Title, Price
User: UserID (PK), FirstName, LastName
Assume that each User-Widget combination is unique. I can see two options for how to structure the connecting table that defines the data relationship:
UserWidgets1: UserWidgetID (PK), WidgetID (FK), UserID (FK)
UserWidgets2: WidgetID (PK, FK), UserID (PK, FK)
Option 1 has a single column for the Primary Key. However, this seems unnecessary since the only data being stored in the table is the relationship between the two primary tables, and this relationship itself can form a unique key. Thus leading to option 2, which has a two-column primary key, but loses the one-column unique identifier that option 1 has. I could also optionally add a two-column unique index (WidgetID, UserID) to the first table.
Is there any real difference between the two performance-wise, or any reason to prefer one approach over the other for structuring the UserWidgets many-to-many table?
You only have one primary key in either case. The second one is what's called a compound key. There's no good reason for introducing a new column. In practise, you will have to keep a unique index on all candidate keys. Adding a new column buys you nothing but maintenance overhead.
Go with option 2.
Personally, I would have the synthetic/surrogate key column in many-to-many tables for the following reasons:
If you've used numeric synthetic keys in your entity tables then having the same on the relationship tables maintains consistency in design and naming convention.
It may be the case in the future that the many-to-many table itself becomes a parent entity to a subordinate entity that needs a unique reference to an individual row.
It's not really going to use that much additional disk space.
The synthetic key is not a replacement to the natural/compound key nor becomes the PRIMARY KEY for that table just because it's the first column in the table, so I partially agree with the Josh Berkus article. However, I don't agree that natural keys are always good candidates for PRIMARY KEY's and certainly should not be used if they are to be used as foreign keys in other tables.
Option 2 uses a simple compund key, option 1 uses a surrogate key. Option 2 is preferred in most scenarios and is close to the relational model in that it is a good candidate key.
There are situations where you may want to use a surrogate key (Option 1)
You are not certain that the compound key is a good candidate key over time. Particularly with temporal data (data that changes over time). What if you wanted to add another row to the UserWidget table with the same UserId and WidgetId? Think of Employment(EmployeeId,EmployeeId) - it would work in most cases except if someone went back to work for the same employer at a later date
If you are creating messages/business transactions or something similar that requires an easier key to use for integration. Replication maybe?
If you want to create your own auditing mechanisms (or similar) and don't want keys to get too long.
As a rule of thumb, when modeling data you will find that most associative entities (many to many) are the result of an event. Person takes up employment, item is added to basket etc. Most events have a temporal dependency on the event, where the date or time is relevant - in which case a surrogate key may be the best alternative.
So, take option 2, but make sure that you have the complete model.
I agree with the previous answers but I have one remark to add.
If you want to add more information to the relation and allow more relations between the same two entities you need option one.
For example if you want to track all the times user 1 has used widget 664 in the userwidget table the userid and widgetid isn't unique anymore.
What is the benefit of a primary key in this scenario? Consider the option of no primary key:
UserWidgets3: WidgetID (FK), UserID (FK)
If you want uniqueness then use either the compound key (UserWidgets2) or a uniqueness constraint.
The usual performance advantage of having a primary key is that you often query the table by the primary key, which is fast. In the case of many-to-many tables you don't usually query by the primary key so there is no performance benefit. Many-to-many tables are queried by their foreign keys, so you should consider adding indexes on WidgetID and UserID.
Option 2 is the correct answer, unless you have a really good reason to add a surrogate numeric key (which you have done in option 1).
Surrogate numeric key columns are not 'primary keys'. Primary keys are technically one of the combination of columns that uniquely identify a record within a table.
Anyone building a database should read this article http://it.toolbox.com/blogs/database-soup/primary-keyvil-part-i-7327 by Josh Berkus to understand the difference between surrogate numeric key columns and primary keys.
In my experience the only real reason to add a surrogate numeric key to your table is if your primary key is a compound key and needs to be used as a foreign key reference in another table. Only then should you even think to add an extra column to the table.
Whenever I see a database structure where every table has an 'id' column the chances are it has been designed by someone who doesn't appreciate the relational model and it will invariably display one or more of the problems identified in Josh's article.
I would go with both.
Hear me out:
The compound key is obviously the nice, correct way to go in so far as reflecting the meaning of your data goes. No question.
However: I have had all sorts of trouble making hibernate work properly unless you use a single generated primary key - a surrogate key.
So I would use a logical and physical data model. The logical one has the compound key. The physical model - which implements the logical model - has the surrogate key and foreign keys.
Since each User-Widget combination is unique, you should represent that in your table by making the combination unique. In other words, go with option 2. Otherwise you may have two entries with the same widget and user IDs but different user-widget IDs.
The userwidgetid in the first table is not needed, as like you said the uniqueness comes from the combination of the widgetid and the userid.
I would use the second table, keep the foriegn keys and add a unique index on widgetid and userid.
So:
userwidgets( widgetid(fk), userid(fk),
unique_index(widgetid, userid)
)
There is some preformance gain in not having the extra primary key, as the database would not need to calculate the index for the key. In the above model though this index (through the unique_index) is still calculated, but I believe that this is easier to understand.