Database: Should tables always be normalized and have primary keys? - database

I have a database storing customer enquiries about products.
The enquiry reference (text), product number (int) and revision number (int) together uniquely identifies a single discussion between sales and customer.
As a result, there are many tables each for a specific detail about a single enquiry, uqniuely idenified by enq, pdt and rev values combined.
The CREATE TABLE does not use any AUTO INCREMENT UNIQUE PRIMARY KEY for any field.
My question is, is this database design acceptable?
Should tables always be normalized?
Thanks for advise.

There's no need to use AUTOINCREMENT, but every table should have a PRIMARY KEY of some kind. A primary key can be a combination of several fields that together identify the record uniquely.
Based on what you've told us, yes, the design is acceptable, provided you explicitly declare the combination of the enquiry reference (text), product number (int) and revision number (int) as a primary key that together uniquely identifies a single discussion.
People sometimes denormalize a database for performance reasons. If select queries are far more frequent than inserts and updates, and the select query of interest is slow to return because of the number of tables it has to join, then consider denormalizing.
If you supply a specific query that is running slow for you, you'll get lots of specific advice.

Having a PRIMARY KEY (or a UNIQUE constraint) will, first, ensure that these values are really unique, and, second, will greatly improve the searches for a given enquiry.
A PRIMARY KEY implies creating an index over (enq, pdt, rev), and this query:
SELECT *
FROM enquiries
WHERE enq = 'enquiry'
AND pdt = 'product'
AND rev = 'revision'
will complete in a single index seek.
Without the index, this query will require scanning the whole table, and there is no guarantee that you won't end up with the duplicates.
Unless for very, very, very special conditions (like heavily inserted log tables), you should always have a PRIMARY KEY on your tables.

Personally, I ALWAYS always have some sort of primary key on all tables, even if it is an auto-incrment number used for nothing else
As to normalization, I think one should strive for normalized tables, but in reality there are many good reasons when a table design is good, but not normalized. This is where the 'theory' of DB design meets the reality - but it is good to know what normalization is, strive for it, and have good reasons when you are deviating from the rules (as opposed to just being ignorant of the rules or worse ignoring good design rules).

These are two questions.
(1) It is not required to have an auto increment key always. It is practical though, since you can use it for easy manipulation of your data. Also having no duplicates is not a must.
(2) Normalization is a must when you do homework for school, but if things get tough you can break it in order to make your life easier if you do not endanger your data integrity.

I am splitting from the herd on this one. Do NOT make your enquiry reference (text), product number (int) and revision number (int) the primary key. You indicated the enquiry reference was a text type and did you mean it would be 25 or 50 or 500 characters wide? If the primary key is made from those fields it will be too wide in my view as it will be appended to every index created for that table increasing the size of every index row by the size of the three fields and any table which needs to use a foreign key back to this table will also need the three fields.
Make the three fields a unique index. Place an auto-increment value as the primary key and make it the clustered index. The tables which will link back to this master table will have a small footprint in memory to link the data from table one to table two.
As far as normalized goes it does not matter, normalized or not, if your data is only a few thousand rows, or even 50,000 or 500,000. When the data starts getting bigger than the available RAM cache then it is an issue.
Design a view to present the data to the application to fulfill the business rule. Design stored procedures to accept data to store. Design the table stucture to meet the response time in the SLA. If you have to normalize or denormalize or patrtition or index or get a bigger server to meet the SLA the app will never know because you are always supplying the data via the view which meets the business rule.

There is nothing in normalization theory that deals with whether a table should have a simple or compound primary key. Believe it or not, the concept of "primary key" is not a component of the relational model of data.
Having said that, tables should nearly always be defined with a primary key. The primary key need not be a single column, and it need not be filled in by an autoincrement. In your case, it could be the three columns that taken together uniquely identify an enquiry.
If a table has no declared primary key, it could end up with duplicate rows. A table with duplicate rows represents a bag of tuples, not a set of tuples. Once you are dealing with bags instead of sets, the results predicted by the relational model need not apply. That is why preventing duplicate rows is so important.

Related

Many-to-many relationship structure in SQL Server with or without extra primary key column?

Assume that we have two tables: Roles and Reports. And there exists
a many-to-many relationship between them. Of course, the only solution
that comes to my mind is to create a cross-table, let's name it RoleReport.
I can see two approaches to the structure of that table:
1. Columns: RoleReportId, RoleId, ReportId
PK: RoleReportId
2. Columns: RoleId, ReportId
PK: RoleId, ReportId
Is there any real difference between them (performance or whatever else)?
You will need a composite UNIQUE index on (RoleId, ReportId) anyway.
There is no point in not doing it a PRIMARY KEY.
If you do it a CLUSTERED PRIMARY KEY (which is default), this will be better performance-wise, since it will be less in size.
A clustered primary key will contain only two columns in each record: RoleID and ReportID, while a secondary index will contain three columns: RoleID, ReportID and RoleReportID (as a row pointer).
You may want to create an additional index on ReportID which may be used to search all Roles for a given Report.
There would be some point in making a surrogate key for this relationship if the two following conditions held:
You have additional attributes in your relationship (i. e. this table contains additional columns, like Date or anything else)
You have lots of tables that refer to this relationship with a FOREIGN KEY
In this case it would be nicer to have a single-column PRIMARY KEY to refer to in FOREIGN KEY relationships.
Since you don't seem to have this need, just make a composite PRIMARY KEY.
You don't actually need the RoleReportId. It adds nothing to the relationship.
Many people try to avoid using a naturally-unique key in real tables, instead opting for an artificially unique one, but I don't always agree with that. For example, if you can be sure that your SSN will never change, you can use that as a key. If it somehow does change in the future, you can fix it then.
But I don't intend arguing that point, there's good arguments on both sides. However, you certainly don't need an artificially unique key in this case since both your other fields are, and will remain, unique.
Unless you really need the RoleReportId as a foreign key in some other table (which is not usually going to be the case), go with option 2. It's going to require less storage, and that by itself will probably give a performance advantage -- plus why have a column you're never going to use?
Semantically, the difference is what you're using as the primary key.
Typically I let the remainder of my schema dictate what I do in this situation. If the cross-table is exclusively the implementation of the many-to-many relationship, I tend to use the concatenated primary key. If I'm hanging more information off the cross table, making it an entity in its own right, I'm more inclined to give it its own id independent of the two tables it's connecting.
This is, of course, subjective. I don't claim that this is the One True Way (tm).
If you have many rows, then it might be beneficial to have appropriately ordered indexes on your RoleId and/or ReportId columns, since this will speed up look up operations - but inversely this will slow down insert/delete operations. This is a classic usage profile issue...
If not required otherwise, omit the RoleReportId PK. It adds nothing to the relationship, forces the Server to generate a useless number on each insert, and leaves the other two columns unordered, which slows down lookups.
But all in all, we are talking about milliseconds here. This only becomes relevant, if there is a huge amount of data (say more than 10.000 rows)...
I would suggest du choose no PK for your second choice. You may use indices or an unique constraint over the combination of both columns.
The benefit of using RoleReportID as a single-column primary key comes when you (or the other guy, depending on the structure of your company) need to write a front end that addresses individual role<->report relationships (for instance, to delete one). At that point, you may prefer the fact that you need to address only one column, instead of two, to identify the linking record.
Other than that, you don't need the RoleReportID column.

Should a database table always have primary keys?

Should I always have a primary key in my database tables?
Let's take the SO tagging. You can see the tag in any revision, its likely to be in a tag_rev table with the postID and revision number. Would I need a PK for that?
Also since it is in a rev table and not currently use the tags should be a blob of tagIDs instead of multiple entries of multiple post_id tagid pair?
A table should have a primary key so that you could identify each row uniquely with it.
Technically, you can have tables without a primary key, but you'll be breaking good database design rules.
You should strive to have a primary key in any non-trivial table where you're likely to want to access (or update or delete) individual records by that key. Primary keys can consist of multiple columns, and formally speaking, will be the shortest available superkey; that is, the shortest available group of columns which, together, uniquely identify any row.
I don't know what the Stack Overflow database schema looks like (and from some of the things I've read on Jeff's blog, I don't want to), but in the situation you describe, it's entirely possible there is a primary key across the post identifier, revision number and tag value; certainly, that would be the shortest (and only) superkey available.
With regards to your second point, while it may be reasonable to argue in favour of aggregating values in archive tables, it does go against the principle that each row/column intersection in a table ought to contain one single value. While it may slightly simplify development, there is no reason you can't keep to a normalised table with versioned metadata, even for something as trivial as tags.
I tend to agree that most tables should have a primary key. I can only think of two times where it doesn't make sense to do it.
If you have a table that relates keys to other keys. For example, to relate a user_id to an answer_id, that table wouldn't need a primary key.
A logging table, whose only real purpose is to create an audit trail.
Basically, if you are writing a table that may ever need to be referenced in a foreign key relationship then a primary key is important, and if you can't be positive it won't be, then just add the PK. :)
See this related question about whether an integer primary key is required. One of the answers uses tagging as an example:
Are there any good reasons to have a database table without an integer primary key
For more discussion of tagging and keys, see this question:
Id for tags in tag systems
From MySQL 5.5 Reference Manual section 13.1.17:
If you do not have a PRIMARY KEY and an application asks for the PRIMARY KEY in your tables, MySQL returns the first UNIQUE index that has no NULL columns as the PRIMARY KEY.
So, technically, the answer is no. However, as others have stated, in most cases it is quite useful.
I firmly believe every table should have a way to uniquely identify a record. For 99% of the tables, this is a primary key. For the rest you may get away with a unique index (I'm thinking one column look up type tables here). Any time I have a had to work with a table without a way to uniquely identify records, there has been trouble.
I also believe if you are using surrogate keys as your PK, you should, where at all possible, have a separate unique index on whatever combination of fields make up the natural key. I realize there are all too many times when you don't have a true natural key (names are not unique or what makes something unique might be spread across several parentchild tables), but if you do have one, please please please make sure it has a unique index or is created as the PK.
If there is no PK, how will you update or delete a single row ? It would be impossible ! To be honest I have used a few times tables without PK, for instance to store activity logs, but even in this case it is advisable to have one because the timestamps could not be granular enough. Temporary tables is another example. But according to relational theory the PK is mandatory.
it is good to have keys and relationships . Helps a lot. however if your app is good enough to handle the relationships then you could possibly skip the keys ( although i recommend that you have them )
Since I use Subsonic, I always create a primary key for all of my tables. Many DB Abstraction libraries require a primary key to work.
Note: that doesn't answer the "Grand Unified Theory" tone of your question, but I'm just saying that in practice, sometimes you MUST make a primary key for every table.
If it's a join table then I wouldn't say that you need a primary key. Suppose, for example, that you have tables PERSONS, SICKPEOPLE, and ILLNESSES. The ILLNESSES table has things like flu, cold, etc., each with a primary key. PERSONS has the usual stuff about people, each also with a primary key. The SICKPEOPLE table only has people in it who are sick, and it has two columns, PERSONID and ILLNESSID, foreign keys back to their respective tables, and no primary key. The PERSONS and ILLNESSES tables contain entities and entities get primary keys. The entries in the SICKPEOPLE table aren't entities and don't get primary keys.
Databases don't have keys, per se, but their constituent tables might. I assume you mean that, but just in case...
Anyway, tables with a large number of rows should absolutely have primary keys; tables with only a few rows don't need them, necessarily, though they don't hurt. It depends upon the usage and the size of the table. Purists will put primary keys in every table. This is not wrong; and neither is omitting PKs in small tables.
Edited to add a link to my blog entry on this question, in which I discuss a case in which database administration staff did not consider it necessary to include a primary key in a particular table. I think this illustrates my point adequately.
Cyberherbalist's Blog Post on Primary Keys

Entity relationship

I have doubt in this design(er_lp). My doubt is in how to create many - to- many relationship with entities with composite keys, Secondly, in using date type as pk. Here each machines work daily for three shifts for different userDepts on one or more fields. So to keep record of working and down hours of machineries I have used shift,taskDay and machinePlate as pks. As you will see from the ER diagram, I ended up with too many pks in the link table in many places. I hesitate not to get in to trouble in coding phase
Is there a better way to do this?
Thank you !!
Dejene
See also extra information posted as a second question Entity Relationship. The material, reformatted, is:
Elaboration: Yes, 'Field ' is referring to areas of land. We have several cane growing fields at different location. It [each field?] is named and has budget.
User is not referring to individual who are working on the machine. They are departments. I used 'isDone' table to link userDept with machine. A machine can be used by several departments and many machines can work for a userDept.
A particular machine can be used for multiple tasks on a given shift. It can work for say 2 hours and can start another task on another field. We have three shifts per day, each of 8 hrs!
If I use Auto increment PK, do you think that other key are important? I don't prefer to use it!
Usually, I use auto increment key alone in a table. How can we create relationship that involves auto increment keys?
Thank you for thoughtful comment!!
You always create many-to-many relationships between two tables using a third table, the rows of which contain the columns for the primary key of each table, and the combination of all columns is the primary key of the third table. The rule doesn't change for tables with composite primary keys.
CREATE TABLE Table1(Col11 ..., Col12 ..., Col1N ...,
PRIMARY KEY(Col11, Col12));
CREATE TABLE Table2(Col21 ..., Col22 ..., Col2N ...,
PRIMARY KEY(Col21, Col22));
CREATE TABLE RelationTable
(
Col11 ...,
Col12 ...,
FOREIGN KEY (Col11, Col12) REFERENCES Table1,
Col21 ...,
Col22 ...,
FOREIGN KEY (Col21, Col22) REFERENCES Table2,
PRIMARY KEY (Col11, Col12, Col21, Col22)
);
This works fine. It does suggest that you should try and keep keys simple whenever possible, but there is absolutely no need to bend over backwards adding auto-increment columns to the referenced tables if they have a natural composite key that is convenient to use. OTOH, the joins involving the relation table are harder to write if you use a composite keys - I'd think several times about what I'm about if either composite key involved more than two columns, not least because it might indicate problems in the design of the referenced tables.
Looking at the actual ER diagram - the 'er_lp' URL in the question - the 'tbl' prefix seems a trifle unnecessary; the things storing data in a database are always tables, so telling me that with the prefix is ... unnecessary. The table called 'Machine' seems to be misnamed; it does not so much describe a machine as the duty allocated to a machine on a particular shift. I'm guessing that the 'Field' table is referring to areas of land, rather than parts of a database. You have the 'IsDone' table (again, not particularly well named) that identifies the user who worked on a machine for a particular shift and hence for a particular task. That involves a link between the Machine table (which has a 3-part primary key) and the User table. It isn't clear whether a particular machine can be used for multiple tasks on a given shift. It isn't clear whether shift numbers cycle around the day or whether each shift number is unique across days, but the presumption must be that there are, say, three shifts per day, and the shift number and date is needed to identify when something occurred. Presumably, the Shift table would identify times and other such information.
The three-part primary key on Machine is fine - but it might be better to have two unique identifiers. One would be the current primary key combination; the other would be an automatically assigned number - auto-increment, serial, sequence or whatever...
Addressing the extended information.
It is not clear to me any more what you are seeking to track. If the 'Machine' table is supposed to track what a given machine was being used for, then you probably need to do some more structuring of the data. Given that a machine can be used for different tasks on different fields during a single shift, you should think, perhaps, in terms of a MachineTasks table which would identify the (date and) time when the operation started and finished and the type of operation. For repair operations, you'd store the information in a table describing repairs; for routine operations in a field, you might not need much extra information. Or maybe that is overkill.
I'm not clear whether particular tasks are performed on behalf of multiple departments, or whether you are simply trying to note that during a single shift a machine might be used by multiple departments, but one department at a time for each task. If each task is for a separate department, then simply include the department info in the main MachineTasks table as a foreign key field.
If you decide on an auto-increment key, you still need to maintain the uniqueness of the composite key. This is the biggest mistake I see people making with auto-increment fields. It isn't quite as simple as "a table with an auto-increment key must also have a second unique constraint on it", but it isn't too far off the mark.
When you use an auto-increment key, you need to retrieve the value assigned when you insert a record into the table; you then use that value in the foreign key columns when you insert other records into the other tables.
You need to read up on database design - I'm not sure what the current good books are as I did most of my learning a decade and more ago, and my books are consequently less likely to be available still.
One good way of not getting into trouble with primary keys is to have a single field for primary key. Usually a numeric (auto incremental) column is just fine. You can still have unique keys with multiple columns.
tblWorksOn
tblMachine
tblIsDone
...seem to be the problem tables.
Its looks like you could use taskDate for the tblMachine table as the primary key. The rest can be foriegn keys.
With the changes to the tblMachine table you can then use the taskDate with the fieldNo for the tblWorksOn table and the taskDate with the userID for the tblIsDone. Use these two fields to create Composite Keys (CK)
e.g.
tblMachine
taskDate (PK)
tblWorksOn
fieldNo (CK)
taskDate (CK)
tblIsDone
userID (CK)
taskDate (CK)

Should each and every table have a primary key?

I'm creating a database table and I don't have a logical primary key assigned to it. Should each and every table have a primary key?
Short answer: yes.
Long answer:
You need your table to be joinable on something
If you want your table to be clustered, you need some kind of a primary key.
If your table design does not need a primary key, rethink your design: most probably, you are missing something. Why keep identical records?
In MySQL, the InnoDB storage engine always creates a primary key if you didn't specify it explicitly, thus making an extra column you don't have access to.
Note that a primary key can be composite.
If you have a many-to-many link table, you create the primary key on all fields involved in the link. Thus you ensure that you don't have two or more records describing one link.
Besides the logical consistency issues, most RDBMS engines will benefit from including these fields in a unique index.
And since any primary key involves creating a unique index, you should declare it and get both logical consistency and performance.
See this article in my blog for why you should always create a unique index on unique data:
Making an index UNIQUE
P.S. There are some very, very special cases where you don't need a primary key.
Mostly they include log tables which don't have any indexes for performance reasons.
Always best to have a primary key. This way it meets first normal form and allows you to continue along the database normalization path.
As stated by others, there are some reasons not to have a primary key, but most will not be harmed if there is a primary key
Disagree with the suggested answer. The short answer is: NO.
The purpose of the primary key is to uniquely identify a row on the table in order to form a relationship with another table. Traditionally, an auto-incremented integer value is used for this purpose, but there are variations to this.
There are cases though, for example logging time-series data, where the existence of a such key is simply not needed and just takes up memory. Making a row unique is simply ...not required!
A small example:
Table A: LogData
Columns: DateAndTime, UserId, AttribA, AttribB, AttribC etc...
No Primary Key needed.
Table B: User
Columns: Id, FirstName, LastName etc.
Primary Key (Id) needed in order to be used as a "foreign key" to LogData table.
Pretty much any time I've created a table without a primary key, thinking I wouldn't need one, I've ended up going back and adding one. I now create even my join tables with an auto-generated identity field that I use as the primary key.
Except for a few very rare cases (possibly a many-to-many relationship table, or a table you temporarily use for bulk-loading huge amounts of data), I would go with the saying:
If it doesn't have a primary key, it's not a table!
Marc
Just add it, you will be sorry later when you didn't (selecting, deleting. linking, etc)
Will you ever need to join this table to other tables? Do you need a way to uniquely identify a record? If the answer is yes, you need a primary key. Assume your data is something like a customer table that has the names of the people who are customers. There may be no natural key because you need the addresses, emails, phone numbers, etc. to determine if this Sally Smith is different from that Sally Smith and you will be storing that information in related tables as the person can have mulitple phones, addesses, emails, etc. Suppose Sally Smith marries John Jones and becomes Sally Jones. If you don't have an artifical key onthe table, when you update the name, you just changed 7 Sally Smiths to Sally Jones even though only one of them got married and changed her name. And of course in this case withouth an artificial key how do you know which Sally Smith lives in Chicago and which one lives in LA?
You say you have no natural key, therefore you don't have any combinations of field to make unique either, this makes the artficial key critical.
I have found anytime I don't have a natural key, an artifical key is an absolute must for maintaining data integrity. If you do have a natural key, you can use that as the key field instead. But personally unless the natural key is one field, I still prefer an artifical key and unique index on the natural key. You will regret it later if you don't put one in.
It is a good practice to have a PK on every table, but it's not a MUST. Most probably you will need a unique index, and/or a clustered index (which is PK or not) depending on your need.
Check out the Primary Keys and Clustered Indexes sections on Books Online (for SQL Server)
"PRIMARY KEY constraints identify the column or set of columns that have values that uniquely identify a row in a table. No two rows in a table can have the same primary key value. You cannot enter NULL for any column in a primary key. We recommend using a small, integer column as a primary key. Each table should have a primary key. A column or combination of columns that qualify as a primary key value is referred to as a candidate key."
But then check this out also: http://www.aisintl.com/case/primary_and_foreign_key.html
To make it future proof you really should. If you want to replicate it you'll need one. If you want to join it to another table your life (and that of the poor fools who have to maintain it next year) will be so much easier.
I am in the role of maintaining application created by offshore development team. Now I am having all kinds of issues in the application because original database schema did not contain PRIMARY KEYS on some tables. So please dont let other people suffer because of your poor design. It is always good idea to have primary keys on tables.
Late to the party but I wanted to add my two cents:
Should each and every table have a primary key?
If you are talking about "Relational Albegra", the answer is Yes. Modelling data this way requires the entities and tables to have a primary key. The problem with relational algebra (apart from the fact there are like 20 different, mismatching flavors of it), is that it only exists on paper. You can't build real world applications using relational algebra.
Now, if you are talking about databases from real world apps, they partially/mostly adhere to the relational algebra, by taking the best of it and by overlooking other parts of it. Also, database engines offer massive non-relational functionality nowadays (it's 2020 now). So in this case the answer is No. In any case, 99.9% of my real world tables have a primary key, but there are justifiable exceptions. Case in point: event/log tables (multiple indexes, but not a single key in sight).
Bottom line, in transactional applications that follow the entity/relationship model it makes a lot of sense to have primary keys for almost (if not) all of the tables. If you ever decide to skip the primary key of a table, make sure you have a good reason for it, and you are prepared to defend your decision.
I know that in order to use certain features of the gridview in .NET, you need a primary key in order for the gridview to know which row needs updating/deleting. General practice should be to have a primary key or primary key cluster. I personally prefer the former.
I'd like to find something official like this - 15.6.2.1 Clustered and Secondary Indexes - MySQL.
If the table has no PRIMARY KEY or suitable UNIQUE index, InnoDB internally generates a hidden clustered index named GEN_CLUST_INDEX on a synthetic column containing row ID values. The rows are ordered by the ID that InnoDB assigns to the rows in such a table. The row ID is a 6-byte field that increases monotonically as new rows are inserted. Thus, the rows ordered by the row ID are physically in insertion order.
So, why not create primary key or something like it by yourself? Besides, ORM cannot identify this hidden ID, meaning that you cannot use ID in your code.
I always have a primary key, even if in the beginning I don't have a purpose in mind yet for it. There have been a few times when I eventually need a PK in a table that doesn't have one and it's always more trouble to put it in later. I think there is more of an upside to always including one.
If you are using Hibernate its not possible to create an Entity without a primary key. This issues can create problem if you are working with an existing database which was created with plain sql/ddl scripts, and no primary key was added
In short, no. However, you need to keep in mind that certain client access CRUD operations require it. For future proofing, I tend to always utilize primary keys.

Surrogate vs. natural/business keys [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
Here we go again, the old argument still arises...
Would we better have a business key as a primary key, or would we rather have a surrogate id (i.e. an SQL Server identity) with a unique constraint on the business key field?
Please, provide examples or proof to support your theory.
Just a few reasons for using surrogate keys:
Stability: Changing a key because of a business or natural need will negatively affect related tables. Surrogate keys rarely, if ever, need to be changed because there is no meaning tied to the value.
Convention: Allows you to have a standardized Primary Key column naming convention rather than having to think about how to join tables with various names for their PKs.
Speed: Depending on the PK value and type, a surrogate key of an integer may be smaller, faster to index and search.
Both. Have your cake and eat it.
Remember there is nothing special about a primary key, except that it is labelled as such. It is nothing more than a NOT NULL UNIQUE constraint, and a table can have more than one.
If you use a surrogate key, you still want a business key to ensure uniqueness according to the business rules.
It appears that no one has yet said anything in support of non-surrogate (I hesitate to say "natural") keys. So here goes...
A disadvantage of surrogate keys is that they are meaningless (cited as an advantage by some, but...). This sometimes forces you to join a lot more tables into your query than should really be necessary. Compare:
select sum(t.hours)
from timesheets t
where t.dept_code = 'HR'
and t.status = 'VALID'
and t.project_code = 'MYPROJECT'
and t.task = 'BUILD';
against:
select sum(t.hours)
from timesheets t
join departents d on d.dept_id = t.dept_id
join timesheet_statuses s on s.status_id = t.status_id
join projects p on p.project_id = t.project_id
join tasks k on k.task_id = t.task_id
where d.dept_code = 'HR'
and s.status = 'VALID'
and p.project_code = 'MYPROJECT'
and k.task_code = 'BUILD';
Unless anyone seriously thinks the following is a good idea?:
select sum(t.hours)
from timesheets t
where t.dept_id = 34394
and t.status_id = 89
and t.project_id = 1253
and t.task_id = 77;
"But" someone will say, "what happens when the code for MYPROJECT or VALID or HR changes?" To which my answer would be: "why would you need to change it?" These aren't "natural" keys in the sense that some outside body is going to legislate that henceforth 'VALID' should be re-coded as 'GOOD'. Only a small percentage of "natural" keys really fall into that category - SSN and Zip code being the usual examples. I would definitely use a meaningless numeric key for tables like Person, Address - but not for everything, which for some reason most people here seem to advocate.
See also: my answer to another question
Surrogate key will NEVER have a reason to change. I cannot say the same about the natural keys. Last names, emails, ISBN nubmers - they all can change one day.
Surrogate keys (typically integers) have the added-value of making your table relations faster, and more economic in storage and update speed (even better, foreign keys do not need to be updated when using surrogate keys, in contrast with business key fields, that do change now and then).
A table's primary key should be used for identifying uniquely the row, mainly for join purposes. Think a Persons table: names can change, and they're not guaranteed unique.
Think Companies: you're a happy Merkin company doing business with other companies in Merkia. You are clever enough not to use the company name as the primary key, so you use Merkia's government's unique company ID in its entirety of 10 alphanumeric characters.
Then Merkia changes the company IDs because they thought it would be a good idea. It's ok, you use your db engine's cascaded updates feature, for a change that shouldn't involve you in the first place. Later on, your business expands, and now you work with a company in Freedonia. Freedonian company id are up to 16 characters. You need to enlarge the company id primary key (also the foreign key fields in Orders, Issues, MoneyTransfers etc), adding a Country field in the primary key (also in the foreign keys). Ouch! Civil war in Freedonia, it's split in three countries. The country name of your associate should be changed to the new one; cascaded updates to the rescue. BTW, what's your primary key? (Country, CompanyID) or (CompanyID, Country)? The latter helps joins, the former avoids another index (or perhaps many, should you want your Orders grouped by country too).
All these are not proof, but an indication that a surrogate key to uniquely identify a row for all uses, including join operations, is preferable to a business key.
I hate surrogate keys in general. They should only be used when there is no quality natural key available. It is rather absurd when you think about it, to think that adding meaningless data to your table could make things better.
Here are my reasons:
When using natural keys, tables are clustered in the way that they are most often searched thus making queries faster.
When using surrogate keys you must add unique indexes on logical key columns. You still need to prevent logical duplicate data. For example, you can’t allow two Organizations with the same name in your Organization table even though the pk is a surrogate id column.
When surrogate keys are used as the primary key it is much less clear what the natural primary keys are. When developing you want to know what set of columns make the table unique.
In one to many relationship chains, the logical key chains. So for example, Organizations have many Accounts and Accounts have many Invoices. So the logical-key of Organization is OrgName. The logical-key of Accounts is OrgName, AccountID. The logical-key of Invoice is OrgName, AccountID, InvoiceNumber.
When surrogate keys are used, the key chains are truncated by only having a foreign key to the immediate parent. For example, the Invoice table does not have an OrgName column. It only has a column for the AccountID. If you want to search for invoices for a given organization, then you will need to join the Organization, Account, and Invoice tables. If you use logical keys, then you could Query the Organization table directly.
Storing surrogate key values of lookup tables causes tables to be filled with meaningless integers. To view the data, complex views must be created that join to all of the lookup tables. A lookup table is meant to hold a set of acceptable values for a column. It should not be codified by storing an integer surrogate key instead. There is nothing in the normalization rules that suggest that you should store a surrogate integer instead of the value itself.
I have three different database books. Not one of them shows using surrogate keys.
I want to share my experience with you on this endless war :D on natural vs surrogate key dilemma. I think that both surrogate keys (artificial auto-generated ones) and natural keys (composed of column(s) with domain meaning) have pros and cons. So depending on your situation, it might be more relevant to choose one method or the other.
As it seems that many people present surrogate keys as the almost perfect solution and natural keys as the plague, I will focus on the other point of view's arguments:
Disadvantages of surrogate keys
Surrogate keys are:
Source of performance problems:
They are usually implemented using auto-incremented columns which mean:
A round-trip to the database each time you want to get a new Id (I know that this can be improved using caching or [seq]hilo alike algorithms but still those methods have their own drawbacks).
If one-day you need to move your data from one schema to another (It happens quite regularly in my company at least) then you might encounter Id collision problems. And Yes I know that you can use UUIDs but those lasts requires 32 hexadecimal digits! (If you care about database size then it can be an issue).
If you are using one sequence for all your surrogate keys then - for sure - you will end up with contention on your database.
Error prone. A sequence has a max_value limit so - as a developer - you have to put attention to the following points:
You must cycle your sequence ( when the max-value is reached it goes back to 1,2,...).
If you are using the sequence as an ordering (over time) of your data then you must handle the case of cycling (column with Id 1 might be newer than row with Id max-value - 1).
Make sure that your code (and even your client interfaces which should not happen as it supposed to be an internal Id) supports 32b/64b integers that you used to store your sequence values.
They don't guarantee non duplicated data. You can always have 2 rows with all the same column values but with a different generated value. For me this is THE problem of surrogate keys from a database design point of view.
More in Wikipedia...
Myths on natural keys
Composite keys are less inefficient than surrogate keys. No! It depends on the used database engine:
Oracle
MySQL
Natural keys don't exist in real-life. Sorry but they do exist! In aviation industry, for example, the following tuple will be always unique regarding a given scheduled flight (airline, departureDate, flightNumber, operationalSuffix). More generally, when a set of business data is guaranteed to be unique by a given standard then this set of data is a [good] natural key candidate.
Natural keys "pollute the schema" of child tables. For me this is more a feeling than a real problem. Having a 4 columns primary-key of 2 bytes each might be more efficient than a single column of 11 bytes. Besides, the 4 columns can be used to query the child table directly (by using the 4 columns in a where clause) without joining to the parent table.
Conclusion
Use natural keys when it is relevant to do so and use surrogate keys when it is better to use them.
Hope that this helped someone!
Alway use a key that has no business meaning. It's just good practice.
EDIT: I was trying to find a link to it online, but I couldn't. However in 'Patterns of Enterprise Archtecture' [Fowler] it has a good explanation of why you shouldn't use anything other than a key with no meaning other than being a key. It boils down to the fact that it should have one job and one job only.
Surrogate keys are quite handy if you plan to use an ORM tool to handle/generate your data classes. While you can use composite keys with some of the more advanced mappers (read: hibernate), it adds some complexity to your code.
(Of course, database purists will argue that even the notion of a surrogate key is an abomination.)
I'm a fan of using uids for surrogate keys when suitable. The major win with them is that you know the key in advance e.g. you can create an instance of a class with the ID already set and guaranteed to be unique whereas with, say, an integer key you'll need to default to 0 or -1 and update to an appropriate value when you save/update.
UIDs have penalties in terms of lookup and join speed though so it depends on the application in question as to whether they're desirable.
Using a surrogate key is better in my opinion as there is zero chance of it changing. Almost anything I can think of which you might use as a natural key could change (disclaimer: not always true, but commonly).
An example might be a DB of cars - on first glance, you might think that the licence plate could be used as the key. But these could be changed so that'd be a bad idea. You wouldnt really want to find that out after releasing the app, when someone comes to you wanting to know why they can't change their number plate to their shiny new personalised one.
Always use a single column, surrogate key if at all possible. This makes joins as well as inserts/updates/deletes much cleaner because you're only responsible for tracking a single piece of information to maintain the record.
Then, as needed, stack your business keys as unique contraints or indexes. This will keep you data integrity intact.
Business logic/natural keys can change, but the phisical key of a table should NEVER change.
Case 1: Your table is a lookup table with less than 50 records (50 types)
In this case, use manually named keys, according to the meaning of each record.
For Example:
Table: JOB with 50 records
CODE (primary key) NAME DESCRIPTION
PRG PROGRAMMER A programmer is writing code
MNG MANAGER A manager is doing whatever
CLN CLEANER A cleaner cleans
...............
joined with
Table: PEOPLE with 100000 inserts
foreign key JOBCODE in table PEOPLE
looks at
primary key CODE in table JOB
Case 2: Your table is a table with thousands of records
Use surrogate/autoincrement keys.
For Example:
Table: ASSIGNMENT with 1000000 records
joined with
Table: PEOPLE with 100000 records
foreign key PEOPLEID in table ASSIGNMENT
looks at
primary key ID in table PEOPLE (autoincrement)
In the first case:
You can select all programmers in table PEOPLE without use of join with table JOB, but just with: SELECT * FROM PEOPLE WHERE JOBCODE = 'PRG'
In the second case:
Your database queries are faster because your primary key is an integer
You don't need to bother yourself with finding the next unique key because the database itself gives you the next autoincrement.
Surrogate keys can be useful when business information can change or be identical. Business names don't have to be unique across the country, after all. Suppose you deal with two businesses named Smith Electronics, one in Kansas and one in Michigan. You can distinguish them by address, but that'll change. Even the state can change; what if Smith Electronics of Kansas City, Kansas moves across the river to Kansas City, Missouri? There's no obvious way of keeping these businesses distinct with natural key information, so a surrogate key is very useful.
Think of the surrogate key like an ISBN number. Usually, you identify a book by title and author. However, I've got two books titled "Pearl Harbor" by H. P. Willmott, and they're definitely different books, not just different editions. In a case like that, I could refer to the looks of the books, or the earlier versus the later, but it's just as well I have the ISBN to fall back on.
On a datawarehouse scenario I believe is better to follow the surrogate key path. Two reasons:
You are independent of the source system, and changes there --such as a data type change-- won't affect you.
Your DW will need less physical space since you will use only integer data types for your surrogate keys. Also your indexes will work better.
As a reminder it is not good practice to place clustered indices on random surrogate keys i.e. GUIDs that read XY8D7-DFD8S, as they SQL Server has no ability to physically sort these data. You should instead place unique indices on these data, though it may be also beneficial to simply run SQL profiler for the main table operations and then place those data into the Database Engine Tuning Advisor.
See thread # http://social.msdn.microsoft.com/Forums/en-us/sqlgetstarted/thread/27bd9c77-ec31-44f1-ab7f-bd2cb13129be
This is one of those cases where a surrogate key pretty much always makes sense. There are cases where you either choose what's best for the database or what's best for your object model, but in both cases, using a meaningless key or GUID is a better idea. It makes indexing easier and faster, and it is an identity for your object that doesn't change.
In the case of point in time database it is best to have combination of surrogate and natural keys. e.g. you need to track a member information for a club. Some attributes of a member never change. e.g Date of Birth but name can change.
So create a Member table with a member_id surrogate key and have a column for DOB.
Create another table called person name and have columns for member_id, member_fname, member_lname, date_updated. In this table the natural key would be member_id + date_updated.
Horse for courses. To state my bias; I'm a developer first, so I'm mainly concerned with giving the users a working application.
I've worked on systems with natural keys, and had to spend a lot of time making sure that value changes would ripple through.
I've worked on systems with only surrogate keys, and the only drawback has been a lack of denormalised data for partitioning.
Most traditional PL/SQL developers I have worked with didn't like surrogate keys because of the number of tables per join, but our test and production databases never raised a sweat; the extra joins didn't affect the application performance. With database dialects that don't support clauses like "X inner join Y on X.a = Y.b", or developers who don't use that syntax, the extra joins for surrogate keys do make the queries harder to read, and longer to type and check: see #Tony Andrews post. But if you use an ORM or any other SQL-generation framework you won't notice it. Touch-typing also mitigates.
Maybe not completely relevant to this topic, but a headache I have dealing with surrogate keys. Oracle pre-delivered analytics creates auto-generated SKs on all of its dimension tables in the warehouse, and it also stores those on the facts. So, anytime they (dimensions) need to be reloaded as new columns are added or need to be populated for all items in the dimension, the SKs assigned during the update makes the SKs out of sync with the original values stored to the fact, forcing a complete reload of all fact tables that join to it. I would prefer that even if the SK was a meaningless number, there would be some way that it could not change for original/old records. As many know, out-of-the box rarely serves an organization's needs, and we have to customize constantly. We now have 3yrs worth of data in our warehouse, and complete reloads from the Oracle Financial systems are very large. So in my case, they are not generated from data entry, but added in a warehouse to help reporting performance. I get it, but ours do change, and it's a nightmare.

Resources