Related
Disclaimer: This is not the same question as the other topics.
One of the cons we face when using Natural Keys, is that, if the business logic changes and we need to change one key, we need to propagate this change throughout all linked tables.
However, using a ON UPDATE CASCADE declaration we can make the DBMS to propagate the change for us when we change a key value.
What I don't understand is: What are the cons of this approach? Is there any situation where using ON UPDATE CASCADE can be risky to the database?
Because if not, then if in a situation, the single reason to use Surrogate keys is the fact that it is easier to change business Natural keys, then ON UPDATE CASCADE should be used instead of converting all tables to Surrogate keys.
Advantages of surrogates..
If you're using REST then it requires the concept of a resource identifier. Using part of the resource as the resource identifier can be cumbersome. A surrogate key isn't really a surrogate in this case since it becomes a real-world resource identifier, but that is one reason to use a surrogate key pattern.
A lot of ORM tools are just easier to manage with a surrogate. For better or worse, OOP requires the concept of identity remain separate from value. In a lot of ways this is similar to point 1.
If you change the key attributes you can just alter one table. On the other hand, surrogates cannot turn a naturally composite key into a simple one if you wish to maintain referential integrity. For example, Document (UserId, FolderId) references Folder (UserId, Id) keeps a user from putting his documents in someone else's folder where simply Document (FolderId) references Folder (Id) does not. But using a surrogate at least limits mandatory sweeping changes to structural decisions, rather than just someone choosing among several candidate keys.
I won't list the advantages of natural keys because I think they're obvious. Less is more. Surrogates are the concept in need of justification.
So in my work environment we don't use a 'primary key' as defined by SQL Server. In other words, we don't right click a column and select "set as primary key".
We do however still have primary keys, we just use a unique ID column. In stored procedures we use these to access the data like you would in any relational database.
My question is, other than the built in functionality that comes with defining a primary key in SQL Server like Entity Framework stuff etc. Is there a good reason to use the 'primary key' functionality over just using a unique ID column and accessing your tables with that in your own stored procedures?
The biggest drawback I see (again other than being able to use Entity Framework and things like that) is that you have to mentally keep track or otherwise keep track of what ID relates to what tables.
There is nothing "special" about the PRIMARY KEY constraint. It's just a uniqueness constraint and you can achieve the same results by using the UNIQUE NOT NULL syntax to define your keys instead.
However, uniqueness constraints (i.e. keys in general, not "primary" keys specifically) are very important for data integrity reasons. They ensure that your data is unique which means that sensible, meaningful results can be derived from your data. It's extremely difficult to get accurate results from a database that contains duplicate data. Also, uniqueness constraints are required to enforce referential integrity between tables, which is another very important aspect of data integrity. Poor data integrity is a data management problem that costs businesses billions of dollars every year and that's the bottom line of why keys are important.
There is a further reason where unique indexes are important: query optimization and performance. Unique indexes improve query performance. If your data is supposed to be unqiue then creating a unique index on it will give the query optimizer the best chance of picking a good execution plan for your queries.
I think the drawback is not using the primary key at all and using a unique key constraint for something it wasn't intended to do.
Unique keys: You can have many of them. They are meant to offer a way to determine uniqueness among rows.
Primary key: like the Highlander, there can only be one. It's intended use is to identify the rows of the table.
I can't think of any good reason not to use a primary key. My opinion is that without a primary key, your table isn't actually a table. It's just a lump of data.
Follow Up: If you don't believe me, check out this guy who asked a bunch of DBA's if it was OK not to use a primary key.
Is it OK not to use a Primary Key When I don't Need one
There are philosophical and practical answers to your question.
The practical answer is that using the primary key constraint enforces "not null", and "unique". This protects you from application-level bugs.
The philosophical answer is that you want developers to operate at the highest possible level of abstraction, so that they don't have to stuff their brain full of detail when trying to solve problems.
Primary and foreign keys are abstractions that allow us to make assumptions about the underlying data model. We can think in terms of (business) entities, and their relationships.
In your workplace, you're forcing developers to think in terms of tables and indexes and conventions. You no longer think about "customers" and "orders" and "line items", but about software artefacts that represent those business entities, and the "we always represent uniqueness by a combination of a GUID and unique index" rule. That mental model is already complicated enough in most applications; you're just making it harder for yourselves, especially when bringing new developers into the team.
My question is more or less the opposite of this one: Why would one ever want to bother finding a natural primary key in a relation when using a sequence as a surrogate seems so much easier.
BradC mentioned in his answer to a related question that the criteria for choosing a primary key are uniqueness, irreductibility, simplicity, stability and familiarity. It looks to me like using a sequence sacrifices the last criterion in order to provide an optimal solution for the first four.
If I hold those criteria to be correct, I can reformulate my question as: In which circumstances would one ever consider it advantageous to complicate one's life by looking for a unique, irreductible, simple and stable key that is also familiar?
To get a meaningful value from a lookup table without doing unnecessary joins.
Example case: garments references a lookup table of colors, which has an auto-increment primary key. Getting the name of the color requires a join:
SELECT c.color
FROM garments g
JOIN colors c USING (color_id);
Simpler example: the colors.color itself is the primary key of that table, and therefore it's the foreign key column in any table that references it.
SELECT g.color
FROM garments g
The answer is data integrity. Instances of entities in the business domain outside the database are by definition identifiable things. If you fail to give them external, real world identifiers in the database then that database stands little chance of modelling reality correctly.
A natural key[1] is what ensures facts in the database are identifiable with actual things in the reality you are trying to model. They are the means which users rely on when they act on and update the data in the database. The constraints that enforce those keys are an implementation of business rules. If your database is to model the business domain accurately then natural keys are not just desirable but essential. If you doubt that then you haven't done enough business analysis. Just ask your customers how they think their business would operate if they were left looking at screens full of duplicate data!
[1] I recommend calling them business keys or domain keys rather than natural keys. Those are far more appropriate and less overloaded terms even though they mean exactly the same thing.
You generally need to identify what the unique key on the data is anyway, as you still need to be able to ensure that the data is not duplicated.
The strength of the synthetic key is that it allows the values of the unique natural key to be modifiable in future, with child records not needing to be updated.
So you're not really skipping the "identify the key" part of the design by using a synthetic primary key, you're just insulating yourself from the possibility of the values changing.
Below are the benefits of using a natural primary key:
In case you need to have a unique constraint on any column then making it primary key will fulfill the need for that,if you aren't suppose to receive any null value into that.So, anyways it's saving your cost of 1 extra key.
In some RDBMS, the key you are declaring as primary key is automatically creating a btree index on that column and if you make a natural primary key based on your access pattern then it is like Icing on the cake because now you are making two shots with one stone. Saving cost of an extra index and making your queries faster by having that meaningful primary key in where clause.
Last but not least ,you will be able to save space of one extra column/key/index.
Is there a benefit to having a single column primary key vs a composite primary key?
I have a table that consists of two id columns which together make up the primary key.
Are there any disadvantages to this? Is there a compelling reason for me to throw in a third column that would be unique on it's own?
Database Normalization nuts will tell you one thing.
I'm just going to offer my own opinion of what i've learned over the years..I stick an AutoIncrementing ID field to Every ($&(##$)# one of my tables. It makes life a million times easier in the long run to be able to single out with impunity a single row.
This is from a "down in the trenches" developer.
Single column keys are simple to write, simple to maintain, and simple to understand.
If you're going to have a huge number of rows - billions? - maybe saving a byte here and there will help.
But if you're not looking at extreme cases, optimizing for "simple" is often the best way to go.
If you are a coder and the database is nothing to you but a glorified object-store, then sure, by all means inject surrogate keys willy nilly. In fact go one better and just delegate all DB schema design and DB interaction to your favourite ORM and be done with it. Indeed, when I want a small or medium scale object-store, that's exactly what I do.
If you are approaching an information systems or information management problem, then it is a completely different story. When you start dealing with 10's (or more likely 100's) of millions of dirty records integrated from multiple sources, several or all of which are not under your control; at that point the seductive lure of an easy answer to the problems of 'identity' is a trap.
Yes you sometimes still introduce a surrogate key internally to allow for concise FK relationships and improved cache efficiency on covering indices; but, you gain those benefits at the cost of substantial pain at managing the natural-key/surrogate-key relationship.
In this case it will be important to make sure you don't allow the surrogate key to leak. Your public API's at the business-logic layer should use the natural-key, nothing above an document/record-cache should be aware of the existence of a surrogate key. Be aware that the cost of matching updates against the existing surrogate keys can be prohibitive, and a far larger scalability hit than the incremental cost of moving a few extra bytes per request over the internal network.
So in conclusion:
If the DB is just being used as an object-store: let the ORM worry about object identity, and there should almost certainly be a surrogate key.
If the DB is being used as a database: the introduction of a surrogate key is an engineering design decision with serious tradeoffs in both directions. The decision will need to be made on a case by case basis, with full recognition of the resulting costs to be accepted in exchange for the benefits gained either way.
Update
The 'convenience' of a surrogate key is really just the ability to punt on the question of identity. This is often necessary in a database, and reasonable in the caching layer as I allow, but beyond that it leads to brittle data designs. The problem is that identity is no something that has one correct answer. For non-trivial data-intensive systems you will routinely find yourself needing to work in terms of equivalence classes, rather than the reference identity, object-oriented programming lulls us into thinking is normal.
What it really comes down to is a realization that the whole concept of a 'primary key' is a fiction invented to help the relational model work efficiently; but, adopting a surrogate key, cements that fiction and makes the whole system brittle and inflexible. Business logic needs to be able to provide their own definitions of equality — sometimes four copies of the same file need to be considered four files, sometimes they should be considered indistinguishable from the original file; when you edit one of them, is that then a new file? the same file? The answer to both questions is of course yes, when... Working with natural keys provides this critical ability to work in terms of conceptual equivalence classes. If you let surrogate keys infect your business logic, you quickly lose this.
I have had to use multi-column primary keys in the past, and it became quite a nightmare very quickly.
If you have one table that references your first table, how does it contain that primary key? Now add another table that references only the second table but needs to find data in the first. Now another... on down the rabbit hole.
If you know that you will only have the one table, there's probably not an issue either way- use whichever represents your data better. But if you'll be using it in joins, you can lose performance pretty quickly.
Is there a benefit to having a single column primary key vs a composit[sic] primary key?
Yes. If the primary key also happens to be the clustered index, it is common that the clustered index is duplicated fully for each secondary index in the table. Therefore, having a fatter clustered index, which is what one would get with a composite, implies an increase in storage cost. Also, foreign references to this table would need to specify both fields to refer to a unique entry, which implies a further storage cost. There is also an arguably greater cost in development time because there is a slight increase in the complexity of the join.
On the other hand, depending on the distribution of the values of your two key fields, it may be the case that concurrent access to your table is greatly improved because chronologically-successive inserts could occur on different physical pages; this could be the case, for example, if your fields are time-independent (and non-monotonic like an auto-incrementer) like clientID, or something like that. This could be significant for performance in a high concurrency environment.
I have a table that consists of two id columns which together make up the primary key.
Are there any disadvantages to this? Is there a compelling reason for me
to throw in a third column that would be unique on it's own?
If the most common way in which your table is queried is to specify those three fields as restrictions, then having all three in a composite key would likely be the fastest lookup.
And there is another important point that I almost forgot. Since having a composite key means that foreign references to this table from other tables must specify all fields in the key, it also means that some queries performed on the other table that required a restriction on one or more of the parts of the composite index of this table, can be performed without requiring a join. This could be considered similar to the concept of denormalization for the sake of performance (and arguably sacrificing a little ease of maintainability).
In general I prefer to have a surrogate key becasue there are very few truly good natural keys (key problem is not uniqueness but that they change over time) and the longer the natural key, the more it affects performance when used as a PK. If you have a natural key, you should create a unique index on it and then use the surrogate key as the PK used for joining to other tables. That enforces the uniqueness of the natural key data but fixes the problems of join performance and the extra time to update all child records when the natural key changes.
There is one case where I ignore this and that is a joining table. If it is a table that is used to enforce a many to many relationship and consists only of two surrogate keys from other tables, then you really gain nothing from adding a surrogate key. Typically the individual keys are used for joins not the PK and surrogate keys almost never change. In a joining table, I just add the two colmns I need and nothing else.
In most databases I know (MySQL, PostgreSQL) the composite key will generate an index. So if you specify your key as composite the DB should provide you an efficient way to lookup tuples from the DB using that key. I think it is the case for all DBs. I think you do not have to bother about performance there.
Don't use multi-column keys. They get very difficult to maintain, especially if the components of the key are not human-understandable.
Use an internally generated key instead.
Imagine you have a composite primary key (field1 and field2 for example) instead of just one autoincremental identifier. Clients' requirements are very changeable and after some development the client says that field2 is not compulsory and it can be nullable, it won't be possible to continue as the primary key of the table. Imagine this table is one of the most importants in your model. Then all the foreign keys should be changed if field 2 cannot be in the composite primary key. It's a nightmare changing the primary key all over the model.
As well if there is a lot of foreign keys I think is not a very good Idea to add several keys to each table just to make the link.
I'm not sure there's enough information for us to make your call for you. Here are a few observations that might be helpful though.
is the primary key a clustered index? Is the table referenced by other tables through a foreign key? If yes, then you may benefit from a single-column key, because that key will appear in those other tables. This is how you would save space.
If the table is not referenced by other tables, then you would be using extra space in your table without much additional benefit. And, if this table only contains the two columns now, then you would increase the table size by 50%.
If you use an extra column for the primary key, do not forget your natural key (the two-column key). Create a unique constraint on the composite key. You still want to maintain the integrity of the real data.
The decision should always be based on requirements and the intended meaning of the data. A table with only a single attribute key clearly enforces a different kind of constraint and implies that your table has a very different meaning to the same table with a multi attribute key. On the other hand adding an additional unique column would also be a waste of resources and add meaningless complexity if you don't actually need to use it anywhere.
One caveat to the auto-incrementing column is that it can give a false impression of uniqueness. Sure, your identity column is always unique, but that's just a meaningless value you've attached to the table. Unless you also have a unique constraint attached to the set of columns that represent the actual semantic primary key of the table, you have no guarantee of meaningful uniqueness.
Should I always have a primary key in my database tables?
Let's take the SO tagging. You can see the tag in any revision, its likely to be in a tag_rev table with the postID and revision number. Would I need a PK for that?
Also since it is in a rev table and not currently use the tags should be a blob of tagIDs instead of multiple entries of multiple post_id tagid pair?
A table should have a primary key so that you could identify each row uniquely with it.
Technically, you can have tables without a primary key, but you'll be breaking good database design rules.
You should strive to have a primary key in any non-trivial table where you're likely to want to access (or update or delete) individual records by that key. Primary keys can consist of multiple columns, and formally speaking, will be the shortest available superkey; that is, the shortest available group of columns which, together, uniquely identify any row.
I don't know what the Stack Overflow database schema looks like (and from some of the things I've read on Jeff's blog, I don't want to), but in the situation you describe, it's entirely possible there is a primary key across the post identifier, revision number and tag value; certainly, that would be the shortest (and only) superkey available.
With regards to your second point, while it may be reasonable to argue in favour of aggregating values in archive tables, it does go against the principle that each row/column intersection in a table ought to contain one single value. While it may slightly simplify development, there is no reason you can't keep to a normalised table with versioned metadata, even for something as trivial as tags.
I tend to agree that most tables should have a primary key. I can only think of two times where it doesn't make sense to do it.
If you have a table that relates keys to other keys. For example, to relate a user_id to an answer_id, that table wouldn't need a primary key.
A logging table, whose only real purpose is to create an audit trail.
Basically, if you are writing a table that may ever need to be referenced in a foreign key relationship then a primary key is important, and if you can't be positive it won't be, then just add the PK. :)
See this related question about whether an integer primary key is required. One of the answers uses tagging as an example:
Are there any good reasons to have a database table without an integer primary key
For more discussion of tagging and keys, see this question:
Id for tags in tag systems
From MySQL 5.5 Reference Manual section 13.1.17:
If you do not have a PRIMARY KEY and an application asks for the PRIMARY KEY in your tables, MySQL returns the first UNIQUE index that has no NULL columns as the PRIMARY KEY.
So, technically, the answer is no. However, as others have stated, in most cases it is quite useful.
I firmly believe every table should have a way to uniquely identify a record. For 99% of the tables, this is a primary key. For the rest you may get away with a unique index (I'm thinking one column look up type tables here). Any time I have a had to work with a table without a way to uniquely identify records, there has been trouble.
I also believe if you are using surrogate keys as your PK, you should, where at all possible, have a separate unique index on whatever combination of fields make up the natural key. I realize there are all too many times when you don't have a true natural key (names are not unique or what makes something unique might be spread across several parentchild tables), but if you do have one, please please please make sure it has a unique index or is created as the PK.
If there is no PK, how will you update or delete a single row ? It would be impossible ! To be honest I have used a few times tables without PK, for instance to store activity logs, but even in this case it is advisable to have one because the timestamps could not be granular enough. Temporary tables is another example. But according to relational theory the PK is mandatory.
it is good to have keys and relationships . Helps a lot. however if your app is good enough to handle the relationships then you could possibly skip the keys ( although i recommend that you have them )
Since I use Subsonic, I always create a primary key for all of my tables. Many DB Abstraction libraries require a primary key to work.
Note: that doesn't answer the "Grand Unified Theory" tone of your question, but I'm just saying that in practice, sometimes you MUST make a primary key for every table.
If it's a join table then I wouldn't say that you need a primary key. Suppose, for example, that you have tables PERSONS, SICKPEOPLE, and ILLNESSES. The ILLNESSES table has things like flu, cold, etc., each with a primary key. PERSONS has the usual stuff about people, each also with a primary key. The SICKPEOPLE table only has people in it who are sick, and it has two columns, PERSONID and ILLNESSID, foreign keys back to their respective tables, and no primary key. The PERSONS and ILLNESSES tables contain entities and entities get primary keys. The entries in the SICKPEOPLE table aren't entities and don't get primary keys.
Databases don't have keys, per se, but their constituent tables might. I assume you mean that, but just in case...
Anyway, tables with a large number of rows should absolutely have primary keys; tables with only a few rows don't need them, necessarily, though they don't hurt. It depends upon the usage and the size of the table. Purists will put primary keys in every table. This is not wrong; and neither is omitting PKs in small tables.
Edited to add a link to my blog entry on this question, in which I discuss a case in which database administration staff did not consider it necessary to include a primary key in a particular table. I think this illustrates my point adequately.
Cyberherbalist's Blog Post on Primary Keys