Normalization and primary keys - database

In a given table if there is no primary key and even impossible to create a composite primary key then what is the normal form of that table ?
If its zero(0NF) adding a new column and making it primary key will convert this table to 1NF ?

Normal forms apply to relations, which are mathematical structures. Tables can be used to represent relations, but this requires some rules to ensure that the table doesn't contain more or less information than the corresponding relation.
In order for a table to represent a relation:
all rows and columns must be unique
the order they're in mustn't matter
all significant information must be represented as values in cells (i.e. fonts, highlighting, etc, mustn't matter)
every cell must contain one value (doesn't matter how simple or complex that value is)
Also, the relational model cares about candidate keys, not primary keys. A relation can have multiple candidate keys. A primary key is just a selected candidate key that is used by some disciplines (e.g. the entity-relationship model) or by some database management systems (e.g. for physical record ordering).
With all that said, I can now answer your question. If your table follows the rules and specifically the rows are all unique, then there will be at least one candidate key, on all the columns together at worst. If your table's rows aren't unique, then the table doesn't represent a relation and the normal forms don't apply. A surrogate key (like an auto-increment column) can be added to identify rows uniquely, but that isn't necessarily sufficient on its own to make a table represent a relation (1NF).
BTW, I suggest you avoid using "0NF" or "UNF". Non-relational tables don't have a level of normalization, so attaching any kind of "NF" to them is misleading.

As long as you are talking about tables, there is one further case that needs to be covered. It's the case of duplicate rows.
Duplicate rows are rows that are identical in appearance but not in row number. Such a table cannot have a primary key. Sometimes duplicate rows represent the same information. Sometimes not.
For example, consider a table with just four columns: customerid, productid, quentity, price. If a customer orders the same product twice, we'll have two identical rows, representing different inforation. Ths is not good.
Note that the corresonding thing cannot happen with relations. If two tuples in a relation have the same appearance, then they are the same tuple.
As to the other points, they are covered by excellent earlier answers.

before you wan to check for normalization your table must have a Primary key(the primary key is playing lead role in Relational DB,...).
1NF: says that all of your table attributes must be single valued.

Answer of Question 1 : In a given table if there is no primary key and even impossible to create a composite primary key then what is the normal form of that table ?
Answer : If it is no primary key in relation and if it is impossible to create a composite primiary key(According to me your question says ,even if combine all the column of row to make candidate key then also it will not able to identify your relationship uniquly(duplicate rows are there), hence it is not in any normal form.
Answer of Question 2:
If you add some column(having unique values in it) and if all the cell contains only one value then it is in 1NF.
Still if you need some clarification can ask in comment box.
0NF is not any form of normalization. refer C.J. Date or Henry korth(database management system book)
Hope this helps.

Related

Normalize table to 3rd normal form

This questions is obviously a homework question. I can't understand my professor and have no idea what he said during the election. I need to make step by step instructions to normalize the following table first into 1NF, then 2NF, then 3NF.
I appreciate any help and instruction.
Okay, I hope I remember all of them correctly, let's start...
Rules
To make them very short (and not very precise, just to give you a first idea of what it's all about):
NF1: A table cell must not contain more than one value.
NF2: NF1, plus all non-primary-key columns must depend on all primary key columns.
NF3: NF2, plus non-primary key columns may not depend on each other.
Instructions
NF1: find table cells containing more than one value, put those into separate columns.
NF2: find columns depending on less then all primary key columns, put them into another table which has only those primary key columns they really depend on.
NF3: find columns which depend on other non-primary-key columns, in addition to depending on the primary key. Put the dependent columns into another table.
Examples
NF1
a column "state" has values like "WA, Washington". NF1 is violated, because that's two values, abbreviation and name.
Solution: To fulfill NF1, create two columns, STATE_ABBREVIATION and STATE_NAME.
NF2
Imagine you've got a table with these 4 columns, expressing international names of car models:
COUNTRY_ID (numeric, primary key)
CAR_MODEL_ID (numeric, primary key)
COUNTRY_NAME (varchar)
CAR_MODEL_NAME (varchar)
The table may have these two data rows:
Row 1: COUNTRY_ID=1, CAR_MODEL_ID=5, COUNTRY_NAME=USA, CAR_MODEL_NAME=Fox
Row 2: COUNTRY_ID=2, CAR_MODEL_ID=5, COUNTRY_NAME=Germany, CAR_MODEL_NAME=Polo
That says, model "Fox" is called "Fox" in USA, but the same car model is called "Polo" in Germany (don't remember if that's actually true).
NF2 is violated, because the country name does not depend on both car model ID and country ID, but only on the country ID.
Solution: To fulfill NF2, move COUNTRY_NAME into a separate table "COUNTRY" with columns COUNTRY_ID (primary key) and COUNTRY_NAME. To get a result set including the country name, you'll need to connect the two tables using a JOIN.
NF3
Say you've got a table with these columns, expressing climatic conditions of states:
STATE_ID (varchar, primary key)
CLIME_ID (foreign key, ID of a climate zone like "desert", "rainforest", etc.)
IS_MOSTLY_DRY (bool)
NF3 is violated, because IS_MOSTLY_DRY only depends on the CLIME_ID (let's at least assume that), but not on the STATE_ID (primary key).
Solution: to fulfill NF3, put the column MOSTLY_DRY into the climate zone table.
Here are some thoughts regarding the actual table given in the exercise:
I apply the above mentioned NF rules without to challenge the primary key columns. But they actually don't make sense, as we will see later.
NF1 isn't violated, each cell holds just one value.
NF2 is violated by EMP_NM and all the phone numbers, because all of these columns don't depend on the full primary key. They all depend on EMP_ID (PK), but not on DEPT_CD (PK). I assume that phone numbers stay the same when an employee moves to another department.
NF2 is also violated by DEPT_NM, because DEPT_NM does not depend on the full primary key. It depends on DEPT_CD, but not on EMP_ID.
NF2 is also violated by all the skill columns, because they are not department- but only employee-specific.
NF3 is violated by SKILL_NM, because the skill name only depends on the skill code, which is not even part of the composite primary key.
SKILL_YRS violates NF3, because it depends on a primary key member (EMP_ID) and a non-primary key member (SKILL_CD). So it is partly dependent on a non-primary-key attribute.
So if you remove all columns which violate NF2 or NF3, only the primary key remains (EMP_ID and DEPT_CD). That remaining part violates the given business rules: this structure would allow an employee to work in multiple departments at the same time.
Let's review it from a distance. Your data model is about employees, departments, skills and the relationships between these entities. If you normalize that, you'll end up with one table for the employees (containing DEPT_CD as a foreign key), one for the departments, one for the skills, and another one for the relationship between employees and skills, holding the "skill years" for each tuple of EMP_ID and SKILL_CD (my teacher would have called the latter an "associative entity").
Looking at the first two rows in your table, and looking at which columns are tagged "PK" in that table, and assuming that "PK" stands for "Primary Key", and looking at the values that appear for those two columns in those two rows, I would recommend your professor to get the hell out of database teaching and not come back until he got himself educated properly on the subject.
This exercise cannot be taken seriously because the problem statement itself contains hopelessly contradictory information.
(Observe that as a consequence, there simply is not any such thing as a "good" or "right" answer to this question !!!)
Another oversimplified answer coming up.
In a 3NF relational table, every nonkey value is determined by the key, the whole key, and nothing but the key (so help me Codd ;)).
1NF: The key. This means that if you specify the key value, and a named column, there will be at most one value at the intersection of the row and the column. A multivalue, like a series of values separated by commas, is disallowed, because you can't get directly to the value with just a key and acolumn name.
2NF: The whole key. If a column that is not part of the key is determined by a proper subset of the key columns, then 2NF is being violated.
3NF: And nothing but the key. If a column is determined by some set of non key columns, then 3NF is being violated.
3NF satisfies only if it is in 2nd normal form and doesnot have any transitive dependency and all the non-key attributes should depend on the primary key.
Transitive dependency:
R=(A,B,C).
A->B AND B->C THEN A->C

primary key in a entity

If there is no unique column can identify each row in the table,
then my primary key will be at least a set of two fields.
Is that correct?
If it is correct,then when I draw the Relationship Diagram, I have to underline the two attributes that formed the primary key?
Thankyou
Here is some terminology:
A superkey is a set of columns that, taken together, uniquely identify rows.
A candidate key (or just: "key") is a minimal1 superkey. Sometimes a key contains just one column, sometimes it contains several (in which case it is called "composite").
For practical reasons, we classify keys as either primary or alternate. One table has one primary key and zero or more alternate keys.
A key is "natural" if it arises from the intrinsic properties of data. In other words, it "means" something.
A key is "surrogate" if it doesn't have any meaning by itself - it is there only for identification purposes. It's typically implemented as an auto-incrementing integer, but there may be other strategies such as GUIDs (useful for replication). It is quite common for natural keys to be composite, but that almost never happens for surrogates.
If there are no "obvious" natural keys, the whole row can always act as a key2. However, this is rarely practical and in such cases you'll typically introduce a surrogate key just for the purpose of identifying rows.
Sometimes, but not always, it is useful to introduce a surrogate in addition to the existing natural key(s).
An ER diagram will clearly identify the PK3, whether it is natural or surrogate and whether it is composite or not. How exactly this will look like depends on a notation being used, but PK will typically be drawn in a graphically distinct manner and possibly prefixed with "PK".
1 I.e. if you were to remove any column from it, it would no longer be unique.
2 A database table is a physical representation of the mathematical concept of "relation". Since relation is set, there is no purpose in having two identical rows, so at the very least the whole row must be unique (an element is either in the set or isn't - it cannot be "twice" in the set, as opposed to multiset).
3 Assuming it not just entity-level so no attributes are show at all.
You are correct, after a fashion. Technically, a primary key and a unique key can be two distinct things. You can have a primary key on a table or entity uniquely identifying that entity and also. On the same table, you can have a unique key constraint which can then be used to ensure that no two rows, according to criteria chosen by you, end up having the same property. So you can have both a primary key and a unique constraint on the same table. Simply have a primary key column that will be autogenerated in your DB and then pick the two columns in your table that you want to use to enforce the unique key constraint
If you don't have primary key you can identify your datas but it's not performant.
And as best practise you use primary on your table.
The preference is to use auto increment column as primary key

Why does my database table need a primary key?

In my database I have a list of users with information about them, and I also have a feature which allows a user to add other users to a shortlist. My user information is stored in one table with a primary key of the user id, and I have another table for the shortlist. The shortlist table is designed so that it has two columns and is basically just a list of pairs of names. So to find the shortlist for a particular user you retrieve all names from the second column where the id in the first column is a particular value.
The issue is that according to many sources such as this Should each and every table have a primary key? you should have a primary key in every table of the database.
According to this source http://www.w3schools.com/sql/sql_primarykey.asp - a primary key in one which uniquely identifies an entry in a database. So my question is:
What is wrong with the table in my database? Why does it need a primary key?
How should I give it a primary key? Just create a new auto-incrementing column so that each entry has a unique id? There doesn't seem much point for this. Or would I somehow encapsulate the multiple entries that represent a shortlist into another entity in another table and link that in? I'm really confused.
If the rows are unique, you can have a two-column primary key, although maybe that's database dependent. Here's an example:
CREATE TABLE my_table
(
col_1 int NOT NULL,
col_2 varchar(255) NOT NULL,
CONSTRAINT pk_cols12 PRIMARY KEY (col_1,col_2)
)
If you already have the table, the example would be:
ALTER TABLE my_table
ADD CONSTRAINT pk_cols12 PRIMARY KEY (col_1,col_2)
Primary keys must identify each record uniquely and as it was mentioned before, primary keys can consist of multiple attributes (1 or more columns). First, I'd recommend making sure each record is really unique in your table. Secondly, as I understand you left the table without primary key and that's disallowed so yes, you will need to set the key for it.
In this particular case, there is no purpose in same pair of user IDs being stored more than once in the shortlist table. After all, that table models a set, and an element is either in the set or isn't. Having an element "twice" in the set makes no sense1. To prevent that, create a composite key, consisting of these two user ID fields.
Whether this composite key will also be primary, or you'll have another key (that would act as surrogate primary key) is another matter, but either way you'll need this composite key.
Please note that under databases that support clustering (aka. index-organized tables), PK is often also a clustering key, which may have significant repercussions on performance.
1 Unlike in mutiset.
A table with duplicate rows is not an adequate representation of a relation. It's a bag of rows, not a set of rows. If you let this happen, you'll eventually find that your counts will be off, your sums will be off, and your averages will be off. In short, you'll get confusing errors out of your data when you go to use it.
Declaring a primary key is a convenient way of preventing duplicate rows from getting into the database, even if one of the application programs makes a mistake. The index you obtain is a side effect.
Foreign key references to a single row in a table could be made by referencing any candidate key. However, it's much more convenient if you declare one of those candidate keys as a primary key, and then make all foreign key references refer to the primary key. It's just careful data management.
The one-to-one correspondence between entities in the real world and corresponding rows in the table for that entity is beyond the realm of the DBMS. It's up to your applications and even your data providers to maintain that correspondence by not inventing new rows for existing entities and not letting some new entities slip through the cracks.
Well since you are asking, it's good practice but in a few instances (no joins needed to the data) it may not be absolutely required. The biggest problem though is you never really know if requirements will change and so you really want one now so you aren't adding one to a 10m record table after the fact.....
In addition to a primary key (which can span multiple columns btw) I think it is good practice to have a secondary candidate key which is a single field. This makes joins easier.
First some theory. You may remember the definition of a function from HS or college algebra is that y = f(x) where f is a function if and only if for every x there is exactly one y. In this case, in relational math we would say that y is functionally dependent on x on this case.
The same is true of your data. Suppose we are storing check numbers, checking account numbers, and amounts. Assuming that we may have several checking accounts and that for each checking account duplicate check numbers are not allowed, then amount is functionally dependent on (account, check_number). In general you want to store data together which is functionally dependent on the same thing, with no transitive dependencies. A primary key will typically be the functional dependency you specify as the primary one. This then identifies the rest of the data in the row (because it is tied to that identifier). Think of this as the natural primary key. Where possible (i.e. not using MySQL) I like to declare the primary key to be the natural one, even if it spans across columns. This gets complicated sometimes where you may have multiple interchangeable candidate keys. For example, consider:
CREATE TABLE country (
id serial not null unique,
name text primary key,
short_name text not null unique
);
This table really could have any column be the primary key. All three are perfectly acceptable candidate keys. Suppose we have a country record (232, 'United States', 'US'). Each of these fields uniquely identifies the record so if we know one we can know the others. Each one could be defined as the primary key.
I also recommend having a second, artificial candidate key which is just a machine identifier used for linking for joins. In the above example country.id does this. This can be useful for linking other records to the country table.
An exception to needing a candidate key might be where duplicate records really are possible. For example, suppose we are tracking invoices. We may have a case where someone is invoiced independently for two items with one showing on each of two line items. These could be identical. In this case you probably want to add an artificial primary key because it allows you to join things to that record later. You might not have a need to do so now but you may in the future!
Create a composite primary key.
To read more about what a composite primary key is, visit
http://www.relationaldbdesign.com/relational-database-analysis/module2/concatenated-primary-keys.php

Tables with a common primary key

What's the term describing the relationship between tables that share a common primary key?
Here's an example:
Table 1
property(property_id, property_location, property_price, ...);
Table 2
flat(property_id, flat_floor, flat_bedroom_count, ...);
What you have looks like table inheritance. If your table structure is that all flat records represent a single property but not all property records refer to a flat, then that's table inheritance. It's a way of modeling something close to object-oriented relationships (in other words, flat inherits from property) in a relational database.
If I understand your example correctly, the data modeling term is Supertype/Subtype. This is a modeling technique where you define a root table (the supertype) containing common attributes, and one or more referencing tables (subtypes) that contain varying attributes based on the entities being modeled.
For example, you could have a Person table (the supertype) containing columns for attributes pertaining to all people, such as Name. You could then have an Employee table (the subtype) containing attributes specific to employees only, such as rate of pay and hire date. You could then continue this process with additional tables for other specializations of Person, such as Contractor. Each of the subtype tables would have a PersonID key column, which could be the primary key of the subtype table, as well as a foreign key referencing the Person table.
For additional info, search Google for "supertype and subtype entities", and see the links below.
http://www.learndatamodeling.com/dm_super_type.htm
http://technet.microsoft.com/en-us/library/cc505839.aspx
There isn't a good name for this relationship in common database terminology (as far as I know). It's not a one-to-one relationship because there isn't guaranteed to be a record in the "extending" table for each record in the main table. It's not a one-to-many relationship because there a maximum of one record allowed on what would otherwise be the "many" side of the relationship.
The best I can do is a one-to-one-or-none or a one-to-one-at-most relationship. (I will admit to sloppy terminology myself — I just call it a one-to-one relationship.)
Whatever you decide to call it, you can model it properly and maintain integrity in your database by making the property_id column in property a PK and the property_id column in flat a PK and also an FK back to property.
"Logic and Databases" advances the term "at most one to at most one" for this kind of relationship. (Note that it is insane to assign names to tables on account of which relationships they participate in.)
Beware of the people who have suggested things like "foreign key", "table inheritance", brief, all the other answers given here. Those people are making assumptions that you have not explicitly stated to be valid, namely that one of your two tables will be guaranteed to contain all key values that appear in the other.
(Disfunctionality of the site prevents me from adding this as a comment in the proper place.)
"How would you interpret "...that share a common primary key?" "
I interpret that in the only reasonable sense possible: that within table1, attribute values for the primary key are guaranteed to be unique, and that within table2, attribute values for the primary key are guaranteed to be unique. And that furthermore, the primary key in both tables has the same [set of] attribute names, and that the types corresponding to the primary key attribute[s] are also pairwise the same. Nothing more and nothing less.
In particular, "sharing a primary key" means "having a primary key in common", and that means in turn "having a certain 'internal uniqueness rule' in common", but that commonality guarantees in no way that a primary key value appearing in one table must also appear in the second table.
"Can you give an example involving two tables with shared primary keys where one table wouldn't contain all the key values that appear in the other?" "
Table1: column A of type integer, primary key A
Table2: column A of type integer, primary key A
Rows in table1: {A:1}. Satisfies the primary key for table1.
Rows in table2: {A:2}. Satisfies the primary key for table2.
Convinced ?
"Foreign key"?

One or Two Primary Keys in Many-to-Many Table?

I have the following tables in my database that have a many-to-many relationship, which is expressed by a connecting table that has foreign keys to the primary keys of each of the main tables:
Widget: WidgetID (PK), Title, Price
User: UserID (PK), FirstName, LastName
Assume that each User-Widget combination is unique. I can see two options for how to structure the connecting table that defines the data relationship:
UserWidgets1: UserWidgetID (PK), WidgetID (FK), UserID (FK)
UserWidgets2: WidgetID (PK, FK), UserID (PK, FK)
Option 1 has a single column for the Primary Key. However, this seems unnecessary since the only data being stored in the table is the relationship between the two primary tables, and this relationship itself can form a unique key. Thus leading to option 2, which has a two-column primary key, but loses the one-column unique identifier that option 1 has. I could also optionally add a two-column unique index (WidgetID, UserID) to the first table.
Is there any real difference between the two performance-wise, or any reason to prefer one approach over the other for structuring the UserWidgets many-to-many table?
You only have one primary key in either case. The second one is what's called a compound key. There's no good reason for introducing a new column. In practise, you will have to keep a unique index on all candidate keys. Adding a new column buys you nothing but maintenance overhead.
Go with option 2.
Personally, I would have the synthetic/surrogate key column in many-to-many tables for the following reasons:
If you've used numeric synthetic keys in your entity tables then having the same on the relationship tables maintains consistency in design and naming convention.
It may be the case in the future that the many-to-many table itself becomes a parent entity to a subordinate entity that needs a unique reference to an individual row.
It's not really going to use that much additional disk space.
The synthetic key is not a replacement to the natural/compound key nor becomes the PRIMARY KEY for that table just because it's the first column in the table, so I partially agree with the Josh Berkus article. However, I don't agree that natural keys are always good candidates for PRIMARY KEY's and certainly should not be used if they are to be used as foreign keys in other tables.
Option 2 uses a simple compund key, option 1 uses a surrogate key. Option 2 is preferred in most scenarios and is close to the relational model in that it is a good candidate key.
There are situations where you may want to use a surrogate key (Option 1)
You are not certain that the compound key is a good candidate key over time. Particularly with temporal data (data that changes over time). What if you wanted to add another row to the UserWidget table with the same UserId and WidgetId? Think of Employment(EmployeeId,EmployeeId) - it would work in most cases except if someone went back to work for the same employer at a later date
If you are creating messages/business transactions or something similar that requires an easier key to use for integration. Replication maybe?
If you want to create your own auditing mechanisms (or similar) and don't want keys to get too long.
As a rule of thumb, when modeling data you will find that most associative entities (many to many) are the result of an event. Person takes up employment, item is added to basket etc. Most events have a temporal dependency on the event, where the date or time is relevant - in which case a surrogate key may be the best alternative.
So, take option 2, but make sure that you have the complete model.
I agree with the previous answers but I have one remark to add.
If you want to add more information to the relation and allow more relations between the same two entities you need option one.
For example if you want to track all the times user 1 has used widget 664 in the userwidget table the userid and widgetid isn't unique anymore.
What is the benefit of a primary key in this scenario? Consider the option of no primary key:
UserWidgets3: WidgetID (FK), UserID (FK)
If you want uniqueness then use either the compound key (UserWidgets2) or a uniqueness constraint.
The usual performance advantage of having a primary key is that you often query the table by the primary key, which is fast. In the case of many-to-many tables you don't usually query by the primary key so there is no performance benefit. Many-to-many tables are queried by their foreign keys, so you should consider adding indexes on WidgetID and UserID.
Option 2 is the correct answer, unless you have a really good reason to add a surrogate numeric key (which you have done in option 1).
Surrogate numeric key columns are not 'primary keys'. Primary keys are technically one of the combination of columns that uniquely identify a record within a table.
Anyone building a database should read this article http://it.toolbox.com/blogs/database-soup/primary-keyvil-part-i-7327 by Josh Berkus to understand the difference between surrogate numeric key columns and primary keys.
In my experience the only real reason to add a surrogate numeric key to your table is if your primary key is a compound key and needs to be used as a foreign key reference in another table. Only then should you even think to add an extra column to the table.
Whenever I see a database structure where every table has an 'id' column the chances are it has been designed by someone who doesn't appreciate the relational model and it will invariably display one or more of the problems identified in Josh's article.
I would go with both.
Hear me out:
The compound key is obviously the nice, correct way to go in so far as reflecting the meaning of your data goes. No question.
However: I have had all sorts of trouble making hibernate work properly unless you use a single generated primary key - a surrogate key.
So I would use a logical and physical data model. The logical one has the compound key. The physical model - which implements the logical model - has the surrogate key and foreign keys.
Since each User-Widget combination is unique, you should represent that in your table by making the combination unique. In other words, go with option 2. Otherwise you may have two entries with the same widget and user IDs but different user-widget IDs.
The userwidgetid in the first table is not needed, as like you said the uniqueness comes from the combination of the widgetid and the userid.
I would use the second table, keep the foriegn keys and add a unique index on widgetid and userid.
So:
userwidgets( widgetid(fk), userid(fk),
unique_index(widgetid, userid)
)
There is some preformance gain in not having the extra primary key, as the database would not need to calculate the index for the key. In the above model though this index (through the unique_index) is still calculated, but I believe that this is easier to understand.

Resources