Related
In a given table if there is no primary key and even impossible to create a composite primary key then what is the normal form of that table ?
If its zero(0NF) adding a new column and making it primary key will convert this table to 1NF ?
Normal forms apply to relations, which are mathematical structures. Tables can be used to represent relations, but this requires some rules to ensure that the table doesn't contain more or less information than the corresponding relation.
In order for a table to represent a relation:
all rows and columns must be unique
the order they're in mustn't matter
all significant information must be represented as values in cells (i.e. fonts, highlighting, etc, mustn't matter)
every cell must contain one value (doesn't matter how simple or complex that value is)
Also, the relational model cares about candidate keys, not primary keys. A relation can have multiple candidate keys. A primary key is just a selected candidate key that is used by some disciplines (e.g. the entity-relationship model) or by some database management systems (e.g. for physical record ordering).
With all that said, I can now answer your question. If your table follows the rules and specifically the rows are all unique, then there will be at least one candidate key, on all the columns together at worst. If your table's rows aren't unique, then the table doesn't represent a relation and the normal forms don't apply. A surrogate key (like an auto-increment column) can be added to identify rows uniquely, but that isn't necessarily sufficient on its own to make a table represent a relation (1NF).
BTW, I suggest you avoid using "0NF" or "UNF". Non-relational tables don't have a level of normalization, so attaching any kind of "NF" to them is misleading.
As long as you are talking about tables, there is one further case that needs to be covered. It's the case of duplicate rows.
Duplicate rows are rows that are identical in appearance but not in row number. Such a table cannot have a primary key. Sometimes duplicate rows represent the same information. Sometimes not.
For example, consider a table with just four columns: customerid, productid, quentity, price. If a customer orders the same product twice, we'll have two identical rows, representing different inforation. Ths is not good.
Note that the corresonding thing cannot happen with relations. If two tuples in a relation have the same appearance, then they are the same tuple.
As to the other points, they are covered by excellent earlier answers.
before you wan to check for normalization your table must have a Primary key(the primary key is playing lead role in Relational DB,...).
1NF: says that all of your table attributes must be single valued.
Answer of Question 1 : In a given table if there is no primary key and even impossible to create a composite primary key then what is the normal form of that table ?
Answer : If it is no primary key in relation and if it is impossible to create a composite primiary key(According to me your question says ,even if combine all the column of row to make candidate key then also it will not able to identify your relationship uniquly(duplicate rows are there), hence it is not in any normal form.
Answer of Question 2:
If you add some column(having unique values in it) and if all the cell contains only one value then it is in 1NF.
Still if you need some clarification can ask in comment box.
0NF is not any form of normalization. refer C.J. Date or Henry korth(database management system book)
Hope this helps.
I am modeling a database for a webshop and have come across ad issue. Basically the question is whether to ignore database normalization rules for simplicity's sake.
Below is the relevant part of my diagram prior to the issue.
Database diagram
Basically, the product can have options (size, flavor, color) but only from one option group. Since an option group can have many options and a product that uses it can take a subset, a ProductOption table is created. Next we have a SpecialOffers table. Next, a special offer can have many products and products can belong to many special offers, hence the association table SpecialOfferProducts. All this works fine until the special offer includes a product that has options. This is where I run into problems. I have a couple of ideas.
First idea:
Create an association table between SpecialOfferProducts and ProductOptions. I don't like this idea since both tables have composite primary keys and creating a table that has a composite primary key composed of two composite primary keys seems really weird and I have never seen anything like it.
Second idea:
Create a association table between SpecialOfferProducts and Options. This seems wrong since Options is not directly tied to Product. Still this would work and the primary key would be a little simpler.
Third idea:
This is the one that I like the most but it violates a few rules. Change the SpecialOfferProducts table. Make it have its own primary key and have SpecialOffers, Products and Options as foreign keys. Simply make the Options foreign key nullable and problem solved. Of course the problems are that I am not making an association table where I should and am making a foreign key nullable. This would slightly complicate my code to deal with all of this but I still feel that this is much simpler than the other approaches since I reduce the number of composite keys and I don't have to add another table in the case where the product in a special offer uses an option.
My question is, which one of this options is best? Is there a better option I have not mentioned?
Using Martin style notation
OptionGroups has (0,n) relationship with the table Options. Options has (1,1) relationship with the table OptionGroups. The purpose of these table is to store information like color, size, etc. An example wouldbe OptionGroups entry color that has Option entries black, white, etc.
Product table has (0,1) relationship with table OptionGroups. OptionGroups has (0,n) relationship with table Product. Product table has a (o,n) relationship with the table Options. Options table has a (o,n) relationship with the table Product. Many-to-many relation produces association table ProductOptions. ProductOptions has a composite PK ProductID, OptionsID. The purpose of these tables is to allow product to have (but does not have to have) options from a certain option group but does not need to have all options from that group.
Example 1. Product does not have any options, hence FK Product_OptionGroups is null. In this case the product does not have any entries in the ProductOptions table.
Example 2. Product has options (lets say color) and so the FK Product_OptionGroups is not null (has the ID of the coresponding option group). Option group color can have many colors and the product is allowed to use one or many of those colors. The colors in use by the product are entries in the table ProductOptions.
SpecialOffer table has a (1,n) relation to the table Products. Products table has a (0,n) relation to the table SpecialOffer. Many-to-many relation creates the association table SpecialOfferProducts. This table has a PK SpecialOfferID, ProductID. The table has a Quantity attribute indicating the quantity of the product.
Example. SpecialOffer A includes one instance of Product A and two instances Product B.
Lets say that the Product A has options. Now SpecialOfferProducts table must reference the correct option.(maybe the product can be blue and red and the special offer only includes the red product). This is where the current schema does not work and either an additional table must be introduced (idea 1 and 2) or the existing tables changed (idea 3).
Maybe you have some relation(ship)/association not representable in terms of your first three:
-- special offer S offers the pairing of P and option O
SpecialOfferProductOption(S, P, O)
-- PK (S, P, O)
-- FK (S, P) to SpecialOfferProducts, FK (P, O) to ProductOptions
You don't seem to understand the use of composite keys, CKs (candidate key), FKs (foreign keys) & constraints. Constraints (PKs, UNIQUE, FKs, etc) arise after you design relation(ship)s/associations sufficient to clearly describe your business situations (represented by tables), per the situations that can arise.
From an ER point of view, you are not properly applying the notions of participating entity (type), entity (type) key & associative entity (type).
You are needlessly & vaguely afraid of composite CKs. Even if you wanted to reduce use of composite keys, you should first find a straightforward design. If you don't want to use composite keys, introduce id PKs along with other CKs. But note that when you use ids as FKs that doesn't drop the obligation to properly constrain the tables that they appear in to agree where necessary with other ids or columns per the constraints you would have needed if you had used the composite CKs instead.
First idea:
Create an association table between SpecialOfferProducts and ProductOptions. I don't like this idea since both tables have composite primary keys and creating a table that has a composite primary key composed of two composite primary keys seems really weird and I have never seen anything like it.
It's not clear what you mean by this. Maybe you mean the above (good) design. Maybe you mean having duplicate product columns; but that's not what good design suggests.
From an ER perspective: You may be thinking of this as a relation(ship)/association on special orders & products. But then the entity keys would not be composite, they would identify special orders & products, and also options would participate. Or we can use the ER concept of reifying relation(ship)s/associations SpecialOfferProducts & ProductOffers to associative entities that are the two participants. That would use composite keys. (If options weren't considered entities then ER would call this a weak relation(ship)/association entity with special orders & products as identifying entities.) Regardless, special orders & products must agree on options, and if that isn't enforced via FKs then it still needs constraining.
If you have (been) read(ing) some published text(s) on information modeling & database design (as you should) you will see many uses of composite keys.
Second idea:
Create an association table between SpecialOfferProducts and Options. This seems wrong since Options is not directly tied to Product. Still this would work and the primary key would be a little simpler.
It's not clear what you mean by "directly tied", "seems" or "wrong".
Relational tables relation(ship)s/associations are among values, certain subrows of which may identify certain entities. Just use the relevant columns & declare the relevant constraints.
From an ER perspective: Considering that you seem to be confused about participant entities (special offer vs SpecialOfferProduct), maybe this is moot, but: Maybe if you tried to express yourself only using technical terms & without the confusion then you would be trying to say that this design needs a constraint that product-option pairs appear in ProductOptions and that it's messy that the constraint involves a relation(ship)/association whose associative entity ProductOption isn't one of the participating entities. I'd agree, but such a design is not "wrong".
Third idea:
This is the one that I like the most but it violates a few rules. Change the SpecialOfferProducts table. Make it have its own primary key and have SpecialOffers, Products and Options as foreign keys. Simply make the Options foreign key nullable and problem solved.
Besides just being needlessly complex, this design is bad. It involves a complex table meaning & complex constraints. When settting the table value you need to decide when to use & not use nulls. When reading you need to figure out what a row means based on whether it has a null. Introducing an id or nulls, possibly while dropping columns, does not remove the obligation to constrain remaining columns if that's not handled by remaining FK constraints. Normally we combine tables while introducing nulls in columns that are not part of every CK--not your case. Here your adding ids doesn't even obviate the need to constrain pairs of products and non-null option column values to be in ProductOptions. And when there is a NULL option column value there should still exist certain rows in ProductOptions and sometimes not certain rows in SpecialOfferProducts. Also this design must be used with complex queries dealing with the presence of NULL. (Which you address.) Justifying this as an ER design is similarly problematic.
PS 1 Please explain your business relation(ship)s/associations with less generic terms than the essentially meaningless "has", "with", "uses", "in" & "belong to"--as you would with a client buying your products & special offers. They refer to relation(ship)s/associations & sets, but they don't explain them. (Similarly, cardinalities are properties of relation(ship)s/associations, but don't explain/characterize them.)
PS 2 ER reasoning about designs involves what (possibly associative) entities are participating in relationships, whereas in the relational model view tables just capture n-ary relation(ship)s/associations for any n. So the ER view is adding needless distinctions. That is why ER-based information modeling & database design approaches are not as effective as fact-based approaches:
This leads to inadequate normalization and constraints, hence redundancy and loss of integrity. Or when those steps are adequately done it leads to the E-R diagram not actually describing the application, which is actually described by the relational database predicates, tables and constraints. Then the E-R diagram is both vague, redundant and wrong.
PS 3 We don't need SpecialOfferProducts if it holds rows where "special offer S offers the pairing of P and some option", because it is select S, P from SpecialOfferProductOption. (This seems to be the case since your option 3 involves having only one table that you call SpecialOfferProducts but is like this table with an added id.) But if it holds rows where say "special offer S offers product P" and that can be so when not all of S's product-option pairs have been recorded then you need it. (Something similar arises re deciding when something is an entity, eg when there should be a table "S is a special option".)
PS 4
seems really weird and I have never seen anything like it
This is the story of life. But in a technical context if we learn and apply clearly defined basic definitions, rules & procedures then we "see" more, and more clearly. (And don't vaguely think we vaguely see things that aren't there.) And "weird" is a rare case where we can explicitly justify that our tools don't apply.
My question is more or less the opposite of this one: Why would one ever want to bother finding a natural primary key in a relation when using a sequence as a surrogate seems so much easier.
BradC mentioned in his answer to a related question that the criteria for choosing a primary key are uniqueness, irreductibility, simplicity, stability and familiarity. It looks to me like using a sequence sacrifices the last criterion in order to provide an optimal solution for the first four.
If I hold those criteria to be correct, I can reformulate my question as: In which circumstances would one ever consider it advantageous to complicate one's life by looking for a unique, irreductible, simple and stable key that is also familiar?
To get a meaningful value from a lookup table without doing unnecessary joins.
Example case: garments references a lookup table of colors, which has an auto-increment primary key. Getting the name of the color requires a join:
SELECT c.color
FROM garments g
JOIN colors c USING (color_id);
Simpler example: the colors.color itself is the primary key of that table, and therefore it's the foreign key column in any table that references it.
SELECT g.color
FROM garments g
The answer is data integrity. Instances of entities in the business domain outside the database are by definition identifiable things. If you fail to give them external, real world identifiers in the database then that database stands little chance of modelling reality correctly.
A natural key[1] is what ensures facts in the database are identifiable with actual things in the reality you are trying to model. They are the means which users rely on when they act on and update the data in the database. The constraints that enforce those keys are an implementation of business rules. If your database is to model the business domain accurately then natural keys are not just desirable but essential. If you doubt that then you haven't done enough business analysis. Just ask your customers how they think their business would operate if they were left looking at screens full of duplicate data!
[1] I recommend calling them business keys or domain keys rather than natural keys. Those are far more appropriate and less overloaded terms even though they mean exactly the same thing.
You generally need to identify what the unique key on the data is anyway, as you still need to be able to ensure that the data is not duplicated.
The strength of the synthetic key is that it allows the values of the unique natural key to be modifiable in future, with child records not needing to be updated.
So you're not really skipping the "identify the key" part of the design by using a synthetic primary key, you're just insulating yourself from the possibility of the values changing.
Below are the benefits of using a natural primary key:
In case you need to have a unique constraint on any column then making it primary key will fulfill the need for that,if you aren't suppose to receive any null value into that.So, anyways it's saving your cost of 1 extra key.
In some RDBMS, the key you are declaring as primary key is automatically creating a btree index on that column and if you make a natural primary key based on your access pattern then it is like Icing on the cake because now you are making two shots with one stone. Saving cost of an extra index and making your queries faster by having that meaningful primary key in where clause.
Last but not least ,you will be able to save space of one extra column/key/index.
I never use weak entities when I'm doing database modelling and things seems fine till now. I usually ignored the whole issue by giving each entity a primary (auto generated) key.
However, I came across some posts that mention that some entities should be weak if their existence totally depends on other entities.
But on the other hand, some refer to weak entities as a set which does not possess sufficient attributes to form a primary key. Well that means all entities in my database where weak at first before I gave them the auto incremented key.
Could someone please outline the importance of weak entities and what are the consequences of not using them? Why don't we just give each entity a primary auto generated key and make it strong?
UPDATE:
Maybe someone can explain why weak entities should be identified by the primary key of the parent entity + an identifier instead of creating a surrogate key and relating it to the parent entity using a foreign key (with cascading changes on update and delete)?
Take an order with multiple order line items as an example. The weak entities would be the individual line items stored in their own table. Their primary key could be the primary key of the order, plus a simple integer number (e.g. 1, 2, 3, which is unique only within the order.) Thus, they don't really have their own primary key as a unique numbered column, their key spans two columns and is only unique that way.
The order line items should be deleted if and when the order is deleted - they don't make sense standing on their own. It is this linkage that makes them weak -- one thing being deleted should delete the other.
If you give each order line item their own primary key, you'll still need to relate them back to the order item, which means putting in a foreign key for the order item or, having a cross reference table. (You may also need to know the line item number from the order, which would mean adding a simple integer column... and at this point you've added enough to have a key without an auto generated one.) For the design pattern of owned sub items, either of these alternatives is a bit of overkill.
Using the complex primary key also enforces the relationship between order and line order items, in that this schema will not allow you cannot have a line item assigned to multiple orders.
Another consideration is that you can shard the orders and order line items according to the order item primary key, since both tables have that key. (Sharding is generally easier to do based on the primary key than regular columns.)
Hierarchical containment isn’t always what you want; but, it is such a commonly occurring pattern that it is nice to be clear about it, and composite keys can be used in this case. Here, using order items with line items as sub-items (i.e. contained), we’re saying not just that line items are 1 to many with respect an order, but that line items are owned and don’t exist independently of orders — that line items compose to create a single order object.
In keeping with that, we’re explicitly not going to manage a separate key space for (all) line items (together as a group), but instead borrow and extend the key space of an order. Instead of asking the system to maintain a separate key space for line items, and manually (i.e. less formally) maintaining a foreign key relation back to the order, and also maintaining an integer line item rather separately (from the order foreign reference), we can ask the system to ensure uniqueness of the whole composite key, which includes the line item number within the order.
Of course, you wouldn’t be able to add a line item that isn’t associated with an order, but additionally, using the composite sub-key, you also won’t be able to add one that overlaps with another (e.g. it won’t let you add two line item #3’s for the same order).
This forces producers and consumers of line items to think about them as being contained within and part of orders, and not as independent items, or, put another way, to reference a line item by going thru an order, or, yet in other words, to get a reference to the order “for free” by referencing one of its line items. (And because you also have a reference to the order as part of such a foreign key, you can also use that order portion of the composite foreign key alone to group or join.)
I recently worked on a project that had to manage large amounts of data samples for lake readings. In this project, we had tables similar to the following, where records is a collection of lake readings by location and uploader, and samples contain the actual lake readings -- things like temperature and intensity.
CREATE TABLE records(
email TEXT REFERENCES users(email),
lat DECIMAL,
lon DECIMAL,
depth TEXT,
upload_date TIMESTAMP,
comment TEXT,
PRIMARY KEY (upload_date,email)
);
CREATE TABLE samples(
date_taken TIMESTAMP,
temp DECIMAL,
intensity DECIMAL,
upload_date TIMESTAMP,
email TEXT,
PRIMARY KEY(date_taken,upload_date,email),
FOREIGN KEY (upload_date,email) REFERENCES records(upload_date,email)
);
samples was modeled as a weak entity, dependent on records. As you know, this means that all of the foreign keys are inherited from records and used to identify a single row in samples. But what would happen if we decided to make it an entity instead? Well, you can look at it a few different ways, Either:
The primary key from records would not be present in samples and
we would have to assign some kind of arbitrary auto increment type
ID, as you suggest. Each record contains thousands of samples, and users think of
samples as part of the records that they recorded in the field. They
expect to browse samples by record, so we would have a very large
samples table with no obvious mapping to the records they belong
to in real life.
Or we simply don't model it as a weak entity, but recognize that
it needs to be able to identify itself with a records row, so we
assign an upload_date and email. If we make these two entries
foreign keys, then we have just made a weak entity without realizing
it. If we don't, then our application layer has to be responsible
for checking to make sure that each upload_date and email are
also present in records, instead of the database doing it.
In this case, making samples a weak entity (including foreign keys in its primary key) is the simplest option (and makes the most sense).
Summary
You should model entities as weak when they are actually weak in real life. If you have an entity that needs a portion of a different key to identity itself (having a foreign key that is part of its primary key), then its probably weak.
Can you remodel the system to avoid using weak entities? Possibly, if we wanted to have unassociated samples, then we would need to be able to make their upload_date and email null, which means they would not be in the primary key and would not be a weak entity. We would have to do something like I described in 1.
The primary key must be unique. Forever. That's all there is to it. If the data in the table doesn't provide that naturally you'd create a surrogate key.
Now what are those. A natural key consists of one or more existing columns, whereas a surrogate key is an extra added column, usually auto-incremental.
A good example for a natural key would be an ISO country code in a countries table. You'd gain nothing from adding an auto-increment column here. On the contrary, you may save yourself from JOINing in the countries table in some queries, because you already have the ISO code right there.
A bad one, the name (or multiple columns) in a contacts table. That's why it's better to use a surrogate key in this case.
That's how i think about it and i rarely - if ever - run into any kind of questionable layout issues.
A practical hint: you never run an UPDATE on columns making up the primary key. You'd delete that row and re-insert it with new values. That can save you a lot of headaches.
Choosing good primary keys, candidate keys and the foreign keys that use them is a vitally important database design task -- as much art as science. The design task has very specific design criteria.
What are the criteria?
The criteria for consideration of a primary key are:
Uniqueness
Irreducibility (no subset of the key uniquely identifies a row in the table)
Simplicity (so that relational representation & manipulation can be simpler)
Stability (should not be altered frequently)
Familiarity (meaningful to the user)
What is a Primary Key?
The primary key is something that uniquely identifies a row/record of data. It can also be multiple columns, which is called a composite.
Ability to Change
Because the primary key is often used for foreign references, it should be as stable as possible. All data in the database is mutable, providing someone is connecting with an account that has appropriate privileges. This is why databases provide the ability to define CASCADE ON DELETE and CASCADE ON UPDATE--to sync referential dependencies without having to disable constraints.
Natural or Artifical/Surrogate?
Ideally, you want a natural key. A natural key is existing data that uniquely identifies the entity you are modeling. For example, the abbreviations of US states is a good natural key because the abbreviation is consistent and everyone knows them:
US_STATE_PRIMARY_KEY US_STATE
--------------------------
AL Alabama
AK Alaska
AZ Arizona
AR Arkansas
CA California
Don't try too hard to find a natural key. They seldom exist. It's unlikely that a US State name would change, but it is plausible.
Realistically, primary keys will typically be artificial (often generated by database functionality). These are typically numbers or GUIDs, and they're considered artificial because on their own - there's nothing to relate their value to the information they uniquely identify. A sales receipt is always numbered, because there's nothing natural about it and it's also for auditing - gaps in the receipt numbers raise suspicions. To demonstrate how arbitrary numbering is, here's the US state table but using an integer for the primary key column, US_STATE_CODE:
US_STATE_PRIMARY_KEY US_STATE
--------------------------
100 Alabama
101 Alaska
102 Arizona
103 Arkansas
104 California
There's no requirement to start the value at one; some shops use this as a security measure to thwart SQL injection. The value is sequential based on the alphabetic ordering of the State name, but that can't be guaranteed. But unlike the natural key, if the state name changed - only one column would have to be updated.
Single Column vs Composite
Ideally one column will be the primary key, but make the decision based on the data at hand--do not combine columns just for the sake of having a single column. If you do shoehorn data together, use a character to separate the data easily (though operations to do this won't be able to take advantage of an index if present).
Performance
From a performance perspective, integers are best because they offer a decent range of values and the number of bytes used is small when you compare to VARCHAR of five or more characters.
Database design starts with a conceptual data model (such as an entity relationship diagram) and finishes up with a database schema or schemas. Entities are mapped to tables; in this process one entity may be split into several table, several entities may be merged into a single table and new tables may arise (for instance, intersection tables to implement many-to-many relationships).
In an ERD entities have primary keys. These are natural keys, that is they are attributes of the entity. For a PERSON entity it might be SocialSecurityNumber. For an ORDER entity if might be OrderRef For an INVOICE entity it might be InvoiceNo. In the first case that is a real-life identifier; in the second case it is a smart key in an ugly format (2010/DEF/000023 ); in the third case it is a monotonically incrementing number because that is what the current paper-based system uses.
Natural keys can be fanciful. I once worked on a database design where the analyst had specified the CUSTOMER entity with a key of (FullName, Address, Sex, DateOfBirth, DistinguishingCharacteristics) on the basis that two individuals of the same name, birth date and gender could live at the same address.
The characteristics of an entity's primary key are:
unique
familiar
stable (presumed)
minimal (one or more attributes but as few as necessary)
When it comes to primary keys for database tables, natural keys are not always suitable.
There are many reasons not to use SSN as a physical primary key. Protection of a citizen's personal data is actually the most important but it is also the case that an individual's number can change. Primary keys should be unvarying.
Smart keys are dumb. They are actually compound keys compressed into a single column. They are better represented as separate columns, not least because it is a frequent requirement to search on single elements of the key. Also, the format of such keys can change.
In general compound keys are a pain as primary keys because we have to cascade multiple columns as foreign keys. This is exacerbated when the child's primary key is defined as a serial number within the parent's primary key. There are systems out there which dependent tables inheriting a nine-column foreign key from a parent when they have a scant two data columns of their own. Sometimes this sort of inheritance can be useful but mostly it is a just a hassle.
The characteristics of an entity's primary key are:
unique
appropriate (meaningless)
guaranteed stability
minimal, usually a single column (except for intersection tables)
So unless the candidate key is a meaningless identifier (such as InvoiceNo) a table should have a synthetic key (AKA surrogate key). This can be a monotonically incrementing number or a GUID according to your needs. Regarding intersection tables, if they have no other attributes or dependent tables there is no value in replacing a compound primary key (AKA composite key) with a synthetic one.
The crucial thing is: we still enforce the candidate keys. This means applying UNIQUE constraints on those columns - SSN , OrderRef - in the parent table. This is because a synthetic key uniquely identifies a row in a table, it does not uniquely identify the data.
Regarding familiarity
Familiarity is a curly one. It is an important consideration when it comes to we are identifying primary keys in a conceptual data model but it is less useful when it comes to database design.
In a commnet #bbadour provides two contrasting examples:
{3296013,840082470,Bob Badour,745} versus {840082470,Bob Badour,PE,CA}
and poses the question:
"What does 3296013 achieve that was not already achieved by 840082470, which happens to be the primary key for my academic records at any or every post-secondary school in Canada."
Well, 840082470 is like a invoice number. Of itself it is a meaningless string of digits. If the system we are designing belongs to the domain of Canadian higher education then it is certainly acceptable as a candidate key. However, because it is a key apparently owned by an external central system (forgive me for not understanding the Canadian academic system), it is open to some of the objections to SSN as a primary key. We are reliant on that external system to ensure uniqueness, guarantee stability and verify identification.
As for 745 versus PE,CA, that is clearly wrong. The Canadian postal abbreviation for "Prince Edward Island" and the ISO digraph for "Canada" identify two distinct pieces of information and derive from different sources, so they should be represented as two separate columns. But let us focus on whether 745 or PE makes the better primary key.
First thing, the database doesn't care which data type we use for the code to represent "Prince Edward Island". It just wants guaranteed uniqueness.
Second thing, the user-facing part of the system is likely to display the full expansion "Prince Edward Island", in which case the application is going to need to execute a look-up anyway. This is because users of a system which also holds addresses from the country of Peru or the state of California will appreciate the clarity of the expanded names[1]. Certainly if we go beyond the few hard cases (such as state abbreviations) the application should always expand codes when displaying them to users.
Thus the only advantage of using PE rather than 745 is that it makes ad hoc querying easier.
Third thing, if the code expansion changes we might want to distinguish records which use the newer version. This is a lot easier if 745='Prince Edward Island' and 746='Prince Edward Is.' than if we use PE as the primary key.
Fourth thing, there are programming considerations. For instance, if the application developers have to provide drop-down lists using Java Enumerations they need numeric codes.
In short, familiarity of natural keys is not as useful as the practicality of surrogate keys.
[1] Canadians will know that CA stands for Canada. But does MO stand for Morocco, Monaco, Moldova, Montenegro, Mongolia or Montserrat? Actually none of them: it's Macau.
A Primary Key is a key that uniquely identifies an entity. When you are choosing a primary key, the best choice is almost always a surrogate key that has absolutely nothing to do with the entity at all other than uniquely identifying it.
And that's it. There are supposedly rare edge cases where a primary key might be a natural key, but I've never seen a valid one.
Most of us use a 32-bit auto-increment integer as a primary key. Another excellent choice (in certain circumstances) is a UUID.
A candidate key is a set of attributes that are irreducibly unique (irreducible meaning that no attribute can be removed from the key without losing the uniqueness property).
Other criteria when choosing what candidate keys to implement are: simplicity, stability, familiarity.
These three criteria are important considerations but not necessarily essential attributes of a key. For instance it may be desirable and quite reasonable to enforce a key that can change often. e.g.: a user login name is required to be unique but the user may change it at will as long as it remains unique.
A primary key is a candidate key.
Hey. it's open again. Here goes.
(1) Choose good candidate keys.
It does not pertain to the database designer to choose candidate keys.
The database designer has the responsibility to see to it that all the
uniqueness requirements he is informed of by the user, will be enforced.
So it is the user who "chooses" what the candidate keys are.
There are two scenario's I can think of that relax this unequivocal
position a bit.
One is if the user says that some attribute of type 'video' or 'audio' (or
some such) is to be unique. It may be infeasible to actually enforce
that, and it is the designer's responsibility to point that out to the
user (as it is also his responsibility to point out that 'uniqueness' of
audio and video content is a very debatable subject, and that any
uniqueness on such attribute values, even if enforcible by the system,
still has a good chance of not being the same uniqueness that the user
wants).
Second is how the picture gets muddied by the possibility of distinct
logical designs all addressing the same problem. If D1 and D2 are both
valid designs addressing the same problem, then it might be the case that
a certain given uniqueness rule imposed by the user, is enforcible using
keys in D1, but not in D2. From this perspective, "choosing candidate
keys" can be interpreted as "choosing a particular design such that a
given uniqueness rule is enforcible using keys". But that wasn't really
the question that you asked.
(2) Choose good primary keys.
A while ago, Darwen launched the question "What are good reasons to single
out one particular candidate from among the others as being 'primary' ?".
Nothing much came out, except then perhaps : "to suggest that this
particular key is the preferred one to use whenever making references to
this relvar". I suspect they didn't find that convincing enough to change
their earlier decision that "no key is more unique than any other".
But, supposing that nonetheless there exists some valid reason to single
out one particular key as "primary", I suppose the following
considerations apply :
the likeliness, or appropriateness, of using this primary key also as,
e.g., the clustering key in the physical design.
and as a consequence of that, the probability of having to change a
value of some existing primary key. Key values that are highly stable
will be preferable over key values that are more volatile.
the percentage of the business that naturally uses some such key in
their daily operations.
if the required space for physically encoding key values is
significantly different, which one has the smallest encoding size.
Your answer to Erwin:
"I agree that choosing a primary key merely designates one candidate key as preferred for foreign key references. However, even if we eliminated the name "primary key" entirely, designers must still choose which candidate key to propagate into another relation for reference purposes. If users identify a heavily referenced relation with an unstable, composite key, do you intend to imply that the designer has no business choosing an additional simple, stable key? Or using the simple, stable key for referencing the relation? Your candidate key section seems to imply that. – bbadour 8 hours ago "
Your original question was about 'primary keys'. Now you change your focus to keys and foreign keys. A key is an integrity constraint, so the only criteria are that a minimal set of attributes has to be unique in a relation (uniqueness and irreducibility). If we change our focus to foreign keys then simplicity, stability and familiarity are the criteria to choose from all the candidate keys in de referenced relation. There could be more candidate keys that fulfill that criteria to more or less the same extend. If we look at familiarity, one candidate key could be very familiar to a group of users and not to another group for which another candidate key is more familiar. Think about different views or subschemas of a database. This second group of users should choose a different candidate key for reference purposes (as foreign key). If you insist in 'primary keys' of which we only have one per relation then I have to ask what makes a key more primary than others.
I think the term primary key should not be used. At least at the logical level. Also the term 'foreign keys' is not well chosen (foreign keys are not keys, but references).
So, I think the remarks of Erwin about ‘primary’ keys were very much to the point. Or at least this was my interpretation of what he means.
Do you agree with this?
If so, would you change your original question to "What are the design criteria for keys and what are the criteria to choose a foreign key from the available candidate keys?"?
If not, why?
Regards,
Carlos
A primary key is a candidate key chosen for special treatment, so first we must look at the properties of candidate keys. A set of one or more columns is a candidate key if it has the following two properties:
Uniqueness: A candidate key must uniquely identify each row in a table. No table may contain two rows with the same value for the candidate key.
Irreducability: Removing any column from a candidate key must violate the uniqness property. In other words, no subset of columns in a candidate key is itself a candidate key.
If no candidate key exists, and sometimes even if one does, a surrogate key is often created using an auto-incrementing integer column, or made up using some other technique. This surrogate key is now also a candidate key.
It is often useful to choose among the available candidate keys and to designate one of them as the primary key. The first criteria often applied is simplicity indicating the candidate key with the fewest columns. However there are other potential criteria, like familiarity, familiar values being more useful than non-familiar values, and stability, stable keys being less troublesome than keys that are apt to change. These criteria however, are strictlty outside the scope the relational model, often conflict with each other, and are often made to deal with implementation limitations.
I would say that the first two concepts "uniqueness" and "irreducability" are less design criteria than fundamental properties of primary keys, while the latter concepts of "simplicity", "familiarity" and "stability" are more properly labeled design criteria, as they involve tradeoffs and subjectivity.
Why choose a primary key? Simplicity and familiarity are not only criteria for choosing among available candididate keys, but are why we should choose a primary key at all. If there are are multiple candidate keys in a table, it simplifys things if all foreign keys pointing to that table refer to the same candidate key. Furthermore, the very act of choosing a particular candidate key will help make it familiar.
What are the criteria?
A PRIMARY KEY is something that will define the entity, only the entity and nothing but the entity.
You can take it from the outside world. Say, a star catalog number to identify a star (good example), or an SSN to identify a person (bad example).
In this case, you rely on the outside world.
Do all people have SSN? (They don't).
Are SSN's unique? (They aren't).
Can an SSN be assigned to another person? (It can).
You can generate it inside your model, using AUTOINCREMENT or GUIDs or whatever.
In this case, you rely on yourself and your database skills.
Do all people in your model have an ID? (Yes, they do, otherwise they wouldn't be in the table with ID NOT NULL).
Are these ID's unique? (Yes, they are, the PRIMARY KEY constraint takes care of it).
Can they be assigned to other persons? (No, they cannot, they are either non-repeatable by design or auto incrementing).
Or another set of answers:
Do all people in your model have an ID? (No, they don't, the people table was accidentally dropped, though some other information retained).
Are these ID's unique? (No, we failed to merge two versions of the database properly).
Can they be assigned to other persons? (Yes, we reset the AUTOINCREMENT by mistake).
The most important thing is that a surrogate key is a feast that is always with you. You can always create a surrogate key: nothing on Earth can stop you from declaring an AUTOINCREMENT field. But by far not all things have some kind of identifier everybody agrees upon.
However, a good natural key cannot be overemphasized.
Guide Star Catalog database is most probably backed up more reliably than yours, and the list of US state codes you always can restore right from the memory.
Only one really, choose a surrogate for each table (identity/auto_number) or something similar that the users will never even see so you can do whatever is necessary with them whenever you need to now and in the future.
(Not quite sure how to interpret this question. Sounds like a quiz or something where you are looking for one single "right" answer from a textbook. I'm going to interpret the question as a more practical one, hence my advice below.)
At least in the MS SQL world, discussion about a proper Primary Key is inevitably wrapped up in discussion about the proper clustered index for a table. The two don't have to be the same, but they are by default, and for many tables, making the two the same is often a good idea.
For the purpose of our discussion here, its important to distinguish between the two:
A PRIMARY KEY is a field or combination of fields that uniquely identify a row.
A CLUSTERED INDEX is a field or combination of fields that represents the physical ordering of a table. (Again, I am speaking about MS SQL Server, not sure how other RDBS might handle this)
Key to the remainder of my discussion is knowing that since SQL 7.0, the clustered index key is used as a row identifier for all non-clustered indexes. This means that many of the same criteria for choosing a good clustering key are the same as for choosing a good primary key.
Let's first look at the criteria for a good clustered index (From Kimberly Tripp's excellent article). A clustered index should be:
Unique - otherwise useless as a row identifier for other indexes
Narrow - this key is used in other indexes, so should be as narrow as possible
Static - If key values change, then references become invalid and will need updating
Ever-increasing - To reduce physical table fragmentation as new rows are added
It is readily apparent the first 3 are also good criteria for a primary key. #4 is a bonus that will reduce table fragmentation as tables grow.
A GUID as a primary key, as popular as that is, actually fails 2 of these criteria (Narrow and Ever-Increasing). As such, it is not recommended as a PK/Clustered index in most circumstances (see Kim's related article here)
I'm going to say something here that is not expected.
All the stuff they teach in database about normalization and keys is all wrong when it comes to choosing primary keys.
The primary key is special when it comes to range queries, and for that reason if you have a dominant range query that is your primary key, no exceptions.
If your dominant range query is not on a candidate key you end up with a primary key that is not enforced for uniqueness! This is sometimes called a clustered index, which is a misnomer because there is no index.
Now the normalization and candidate keys are all important, and you will want to enforce unique constraints on at least some of them. But do not assign the primary key because it is the natural key. In fact, this is slower than defining an index and a unique constraint. Define the primary key based on range queries only.
Remember, there is no constraint to actually have primary keys. A table with no primary keys is called a heap table and has either no intrinsic ordering or insertion order intrinsic ordering.
EDIT: definition of range query:
A range query is a query that is an ORDER BY query or contains either a greater than or less than operator. What we are interested in are the columns for which these queries run on. The fundamental idea is a range query fetches several (tens to hundreds to perhaps thousands but not all) rows from the table based on bounding conditions at one or both ends.
There is another kind of range queries, and that is where you have a foreign key to another table and an operation is select all matching on that foreign key. This is in fact also a range query although not obviously so.