How to choose (natural) primary key - sql-server

Suppose there is a table keeping info about Vendors and Customers in one table named Partners (since one partner can be vendor at one point of time and customer at other).
Partners table have usual stuff: company name, short name, address, city, country. Now, for domestic partners there is DomesticVatNumber and for non-domestic there is InternationalVatNumber. Usually, vat number would be perfect candidate for primary key but the problem here is that not all domestic partners have InternationalVatNumber and international ones dont have DomesticVatNumber.
I am trying to see best ways to design this in db. Is surrogate key the only option in this case or should i maybe reconsider having domestic and international partners in same table? Should i maybe split them into 2 tables: DomesticPartners (which always have DomesticVatNumber) and InternationalPartners (which always have InternationalVatNumber) and then put primary key on DomesticVat and InternationalVat columns respectively?
What are pros/cons of each approach?

Personally, I would never make a primary key out of something assigned by an external party, nor would I use a value that the user would ever see. I would always use a meaningless key (either an identity column or a unique identifier).
Given what you are saying, I wouldn't split them into separate tables since you would then have to either have any table that referenced your partner table in a foreign key would either have to have two nullable columns setup to do this or have one column but no foreign key relationship (shudder...).
The best option is to have one table, have the domestic and international VAT numbers as separate fields in the table but not a primary key. Since they will both be nullable, you would have limited options for a unique constraint on them.
Just my 2 cents

As your business grows, your systems get more complex, and it makes more sense to have one table. An example can be an ENTITIES table which stores everyone and everything, including vendors and customers. This can include individuals, groups and businesses, clients and staff, etc. Later on you will be glad you did it this way, because it reduces the number of complex joins you are going to have with multiple tables. You can use ENTITY_NO as a surrogate key and ENTITY_TYPE to differentiate entities. VAT number fields can be indexed separately and made nullable.

Related

Why does my database table need a primary key?

In my database I have a list of users with information about them, and I also have a feature which allows a user to add other users to a shortlist. My user information is stored in one table with a primary key of the user id, and I have another table for the shortlist. The shortlist table is designed so that it has two columns and is basically just a list of pairs of names. So to find the shortlist for a particular user you retrieve all names from the second column where the id in the first column is a particular value.
The issue is that according to many sources such as this Should each and every table have a primary key? you should have a primary key in every table of the database.
According to this source http://www.w3schools.com/sql/sql_primarykey.asp - a primary key in one which uniquely identifies an entry in a database. So my question is:
What is wrong with the table in my database? Why does it need a primary key?
How should I give it a primary key? Just create a new auto-incrementing column so that each entry has a unique id? There doesn't seem much point for this. Or would I somehow encapsulate the multiple entries that represent a shortlist into another entity in another table and link that in? I'm really confused.
If the rows are unique, you can have a two-column primary key, although maybe that's database dependent. Here's an example:
CREATE TABLE my_table
(
col_1 int NOT NULL,
col_2 varchar(255) NOT NULL,
CONSTRAINT pk_cols12 PRIMARY KEY (col_1,col_2)
)
If you already have the table, the example would be:
ALTER TABLE my_table
ADD CONSTRAINT pk_cols12 PRIMARY KEY (col_1,col_2)
Primary keys must identify each record uniquely and as it was mentioned before, primary keys can consist of multiple attributes (1 or more columns). First, I'd recommend making sure each record is really unique in your table. Secondly, as I understand you left the table without primary key and that's disallowed so yes, you will need to set the key for it.
In this particular case, there is no purpose in same pair of user IDs being stored more than once in the shortlist table. After all, that table models a set, and an element is either in the set or isn't. Having an element "twice" in the set makes no sense1. To prevent that, create a composite key, consisting of these two user ID fields.
Whether this composite key will also be primary, or you'll have another key (that would act as surrogate primary key) is another matter, but either way you'll need this composite key.
Please note that under databases that support clustering (aka. index-organized tables), PK is often also a clustering key, which may have significant repercussions on performance.
1 Unlike in mutiset.
A table with duplicate rows is not an adequate representation of a relation. It's a bag of rows, not a set of rows. If you let this happen, you'll eventually find that your counts will be off, your sums will be off, and your averages will be off. In short, you'll get confusing errors out of your data when you go to use it.
Declaring a primary key is a convenient way of preventing duplicate rows from getting into the database, even if one of the application programs makes a mistake. The index you obtain is a side effect.
Foreign key references to a single row in a table could be made by referencing any candidate key. However, it's much more convenient if you declare one of those candidate keys as a primary key, and then make all foreign key references refer to the primary key. It's just careful data management.
The one-to-one correspondence between entities in the real world and corresponding rows in the table for that entity is beyond the realm of the DBMS. It's up to your applications and even your data providers to maintain that correspondence by not inventing new rows for existing entities and not letting some new entities slip through the cracks.
Well since you are asking, it's good practice but in a few instances (no joins needed to the data) it may not be absolutely required. The biggest problem though is you never really know if requirements will change and so you really want one now so you aren't adding one to a 10m record table after the fact.....
In addition to a primary key (which can span multiple columns btw) I think it is good practice to have a secondary candidate key which is a single field. This makes joins easier.
First some theory. You may remember the definition of a function from HS or college algebra is that y = f(x) where f is a function if and only if for every x there is exactly one y. In this case, in relational math we would say that y is functionally dependent on x on this case.
The same is true of your data. Suppose we are storing check numbers, checking account numbers, and amounts. Assuming that we may have several checking accounts and that for each checking account duplicate check numbers are not allowed, then amount is functionally dependent on (account, check_number). In general you want to store data together which is functionally dependent on the same thing, with no transitive dependencies. A primary key will typically be the functional dependency you specify as the primary one. This then identifies the rest of the data in the row (because it is tied to that identifier). Think of this as the natural primary key. Where possible (i.e. not using MySQL) I like to declare the primary key to be the natural one, even if it spans across columns. This gets complicated sometimes where you may have multiple interchangeable candidate keys. For example, consider:
CREATE TABLE country (
id serial not null unique,
name text primary key,
short_name text not null unique
);
This table really could have any column be the primary key. All three are perfectly acceptable candidate keys. Suppose we have a country record (232, 'United States', 'US'). Each of these fields uniquely identifies the record so if we know one we can know the others. Each one could be defined as the primary key.
I also recommend having a second, artificial candidate key which is just a machine identifier used for linking for joins. In the above example country.id does this. This can be useful for linking other records to the country table.
An exception to needing a candidate key might be where duplicate records really are possible. For example, suppose we are tracking invoices. We may have a case where someone is invoiced independently for two items with one showing on each of two line items. These could be identical. In this case you probably want to add an artificial primary key because it allows you to join things to that record later. You might not have a need to do so now but you may in the future!
Create a composite primary key.
To read more about what a composite primary key is, visit
http://www.relationaldbdesign.com/relational-database-analysis/module2/concatenated-primary-keys.php

Having a table with just the fields of the primary key is a conceptual error?

I'm designing a database which will store information about some artists. These artists can belongs to one or more organizations. From these organizations i just want to store their names and i'm thinking in create a table with these organizations which just have the names as primary key and nothing else. Is the fact of having a table with just the fieds of the primary key a conceptual error? In this case, i will appreciate some suggestions to solve that.
Is the fact of having a table with just the fieds of the primary key a conceptual error?
Not by itself. There are perfectly legitimate situations where all fields comprise a PK.
In this particular case, the organization name is a key, but that doesn't necessarily mean it should be primary key - you could "invent" another key that is smaller (typically integer) and easier to maintain and make it primary, like this:
The organizarion_id is called a "surrogate key", and some pros of doing that include:
Child FKs will be slimmer (since only the integer is migrated to the child, not the whole string).
You can update the organization_name without updating the organization_id, and consequently without cascading this update to children.
A small integer surrogate may be friendlier to ORMs than a more complex natural key.
Cons:
May require more JOINing.
Requires one more index, and each additional index brings overhead (even in heap-based tables, but especially in clustered tables).
As you can see, it's a matter of balance and you are the only one who has enough domain knowledge to make the right decision.
NOTE: Order of fields in organization_artist matters. Use the order shown above if you need to efficiently query for artists of a given organization and reverse it if you need organizations of a given artist. If you need both directions, you'll need another composite index on these two fields (beside the index underneath PK), but in opposite order. If you can live with only one index, consider clustering this table (if your DBMS supports it).
You want an OrganizationId, to handle the situations where the Organization name changes.
You might also have situations where different organizations seem to have the same name. How many "Museum of Modern Art"s are there? (Well, to a New Yorker, only one ;-)
Your organization table might expand over time, with columns such as shortname, address, contact person, prefered language, and so on. So, the table should look like:
create table Organizations (
OrganizationId int not null identity(1,1),
Name varchar(255),
CreatedBy varchar(255) default system_user,
CreatedAt datetime default getdate()
)
In a mature database, you would even recognize that organizations change names, merge, and sometimes split. You can handle this by adding effective dates and end dates to the records.
The standard practice for something like this would be to have 1 table for the artists, 1 table for the organizations, and 1 association table to associate the artist with 1 or more organization.
ARTIST (id, firstName, lastName)
ORGANIZATION (id, name)
ARTIST_ORGANIZATION(artist_id, org_id)
Even though the organization name may/should be unique, it's good to have a numeric id as the primary key so you can do associations. And querying the association with id is better performance than searching for a string.

Do I need to define a new primary key field for each table?

I have a few database tables that really only require a unique id that references another table e.g.
Customer Holiday
******** *******
ID (PK) ---> CustomerID (PK)
Forename From
Surname To
....
These tables such as Holiday, only really exist to hold information regarding a Customer. Therefore, do I need to specify a separate field to hold the ID for the holiday? i.e.
Holiday
*******
ID (PK)
CustomerID (FK)
...
Or would I be ok, in this instance, to just set the CustomerID as the primary key in the table?
Regards,
James.
This really depends on what you are doing.
if each customer can have only 1 holiday, then yes, you could make the customerid the primary key.
If each customer can have multiple holidays, then no, you would want to add a new id column, make it the primary. This allows you to select holidays by each customer AND to select individual records by their unique id.
Additionally if each customer can only have 1 holiday, I'd just add the holiday information to the table, as a one-to-one relationship is typically un-necessary.
If I understand your question correctly, you could only use the Customer table as a primary key in Holiday if there will never be any other holiday for that customer in the table. In other words, two holidays for one customer breaks using the Customer id as a primary key.
If there will ever be an object-oriented program associated with this database, each entity (each row) must have a unique key.
Your second design assures that each instance of Holiday can be uniquely identified and processed by an OO application using a simple Object-Relational Mapping.
Generally, it's best to assure that every entity in the database has a unique, immutable, system-assigned ("surrogate") key. Other "natural" keys can have unique indexes, constraints, etc., to fit the business logic.
Previous answer correct, but also remember, you could have 2 seperate primary keys in each table, and the "holiday" table would have the foreign key to CustomerId.
Then you could manage the assignment of holidays to customers in your code, to make sure that only one holiday can be assigned to a customer, but this brings in the problem concurrency, being 2 people adding a holiday to a customer at the same time will most probably result in a customer having 2 holidays.
You could even place holiday fields in the customer table if a customer can only be created with a holiday, but this design is messy, and not really advised
So once again, option in your question 2 still the best way to go, just giving you your options.
In practice I've found that every table should have a unique primary key identifying the records in those tables. All relationships with other tables should be explicitly declared.
This helps others understand the relationships better, especially if they use a tool to reverse-engineer the schema into a visual representation.
In addition, it gives you more flexibility to expand your solution in the future. You may only have one holiday per customer now, but this is much more difficult to change if you make customer ID the primary key.
If you want to mandate the uniqueness of customer in the holiday table, create a unique index on that foreign key. In fact, this could improve performance when querying on customer ID (although I'm guessing you won't see enough records to notice this improvement).

Entity relationship

I have doubt in this design(er_lp). My doubt is in how to create many - to- many relationship with entities with composite keys, Secondly, in using date type as pk. Here each machines work daily for three shifts for different userDepts on one or more fields. So to keep record of working and down hours of machineries I have used shift,taskDay and machinePlate as pks. As you will see from the ER diagram, I ended up with too many pks in the link table in many places. I hesitate not to get in to trouble in coding phase
Is there a better way to do this?
Thank you !!
Dejene
See also extra information posted as a second question Entity Relationship. The material, reformatted, is:
Elaboration: Yes, 'Field ' is referring to areas of land. We have several cane growing fields at different location. It [each field?] is named and has budget.
User is not referring to individual who are working on the machine. They are departments. I used 'isDone' table to link userDept with machine. A machine can be used by several departments and many machines can work for a userDept.
A particular machine can be used for multiple tasks on a given shift. It can work for say 2 hours and can start another task on another field. We have three shifts per day, each of 8 hrs!
If I use Auto increment PK, do you think that other key are important? I don't prefer to use it!
Usually, I use auto increment key alone in a table. How can we create relationship that involves auto increment keys?
Thank you for thoughtful comment!!
You always create many-to-many relationships between two tables using a third table, the rows of which contain the columns for the primary key of each table, and the combination of all columns is the primary key of the third table. The rule doesn't change for tables with composite primary keys.
CREATE TABLE Table1(Col11 ..., Col12 ..., Col1N ...,
PRIMARY KEY(Col11, Col12));
CREATE TABLE Table2(Col21 ..., Col22 ..., Col2N ...,
PRIMARY KEY(Col21, Col22));
CREATE TABLE RelationTable
(
Col11 ...,
Col12 ...,
FOREIGN KEY (Col11, Col12) REFERENCES Table1,
Col21 ...,
Col22 ...,
FOREIGN KEY (Col21, Col22) REFERENCES Table2,
PRIMARY KEY (Col11, Col12, Col21, Col22)
);
This works fine. It does suggest that you should try and keep keys simple whenever possible, but there is absolutely no need to bend over backwards adding auto-increment columns to the referenced tables if they have a natural composite key that is convenient to use. OTOH, the joins involving the relation table are harder to write if you use a composite keys - I'd think several times about what I'm about if either composite key involved more than two columns, not least because it might indicate problems in the design of the referenced tables.
Looking at the actual ER diagram - the 'er_lp' URL in the question - the 'tbl' prefix seems a trifle unnecessary; the things storing data in a database are always tables, so telling me that with the prefix is ... unnecessary. The table called 'Machine' seems to be misnamed; it does not so much describe a machine as the duty allocated to a machine on a particular shift. I'm guessing that the 'Field' table is referring to areas of land, rather than parts of a database. You have the 'IsDone' table (again, not particularly well named) that identifies the user who worked on a machine for a particular shift and hence for a particular task. That involves a link between the Machine table (which has a 3-part primary key) and the User table. It isn't clear whether a particular machine can be used for multiple tasks on a given shift. It isn't clear whether shift numbers cycle around the day or whether each shift number is unique across days, but the presumption must be that there are, say, three shifts per day, and the shift number and date is needed to identify when something occurred. Presumably, the Shift table would identify times and other such information.
The three-part primary key on Machine is fine - but it might be better to have two unique identifiers. One would be the current primary key combination; the other would be an automatically assigned number - auto-increment, serial, sequence or whatever...
Addressing the extended information.
It is not clear to me any more what you are seeking to track. If the 'Machine' table is supposed to track what a given machine was being used for, then you probably need to do some more structuring of the data. Given that a machine can be used for different tasks on different fields during a single shift, you should think, perhaps, in terms of a MachineTasks table which would identify the (date and) time when the operation started and finished and the type of operation. For repair operations, you'd store the information in a table describing repairs; for routine operations in a field, you might not need much extra information. Or maybe that is overkill.
I'm not clear whether particular tasks are performed on behalf of multiple departments, or whether you are simply trying to note that during a single shift a machine might be used by multiple departments, but one department at a time for each task. If each task is for a separate department, then simply include the department info in the main MachineTasks table as a foreign key field.
If you decide on an auto-increment key, you still need to maintain the uniqueness of the composite key. This is the biggest mistake I see people making with auto-increment fields. It isn't quite as simple as "a table with an auto-increment key must also have a second unique constraint on it", but it isn't too far off the mark.
When you use an auto-increment key, you need to retrieve the value assigned when you insert a record into the table; you then use that value in the foreign key columns when you insert other records into the other tables.
You need to read up on database design - I'm not sure what the current good books are as I did most of my learning a decade and more ago, and my books are consequently less likely to be available still.
One good way of not getting into trouble with primary keys is to have a single field for primary key. Usually a numeric (auto incremental) column is just fine. You can still have unique keys with multiple columns.
tblWorksOn
tblMachine
tblIsDone
...seem to be the problem tables.
Its looks like you could use taskDate for the tblMachine table as the primary key. The rest can be foriegn keys.
With the changes to the tblMachine table you can then use the taskDate with the fieldNo for the tblWorksOn table and the taskDate with the userID for the tblIsDone. Use these two fields to create Composite Keys (CK)
e.g.
tblMachine
taskDate (PK)
tblWorksOn
fieldNo (CK)
taskDate (CK)
tblIsDone
userID (CK)
taskDate (CK)

One or Two Primary Keys in Many-to-Many Table?

I have the following tables in my database that have a many-to-many relationship, which is expressed by a connecting table that has foreign keys to the primary keys of each of the main tables:
Widget: WidgetID (PK), Title, Price
User: UserID (PK), FirstName, LastName
Assume that each User-Widget combination is unique. I can see two options for how to structure the connecting table that defines the data relationship:
UserWidgets1: UserWidgetID (PK), WidgetID (FK), UserID (FK)
UserWidgets2: WidgetID (PK, FK), UserID (PK, FK)
Option 1 has a single column for the Primary Key. However, this seems unnecessary since the only data being stored in the table is the relationship between the two primary tables, and this relationship itself can form a unique key. Thus leading to option 2, which has a two-column primary key, but loses the one-column unique identifier that option 1 has. I could also optionally add a two-column unique index (WidgetID, UserID) to the first table.
Is there any real difference between the two performance-wise, or any reason to prefer one approach over the other for structuring the UserWidgets many-to-many table?
You only have one primary key in either case. The second one is what's called a compound key. There's no good reason for introducing a new column. In practise, you will have to keep a unique index on all candidate keys. Adding a new column buys you nothing but maintenance overhead.
Go with option 2.
Personally, I would have the synthetic/surrogate key column in many-to-many tables for the following reasons:
If you've used numeric synthetic keys in your entity tables then having the same on the relationship tables maintains consistency in design and naming convention.
It may be the case in the future that the many-to-many table itself becomes a parent entity to a subordinate entity that needs a unique reference to an individual row.
It's not really going to use that much additional disk space.
The synthetic key is not a replacement to the natural/compound key nor becomes the PRIMARY KEY for that table just because it's the first column in the table, so I partially agree with the Josh Berkus article. However, I don't agree that natural keys are always good candidates for PRIMARY KEY's and certainly should not be used if they are to be used as foreign keys in other tables.
Option 2 uses a simple compund key, option 1 uses a surrogate key. Option 2 is preferred in most scenarios and is close to the relational model in that it is a good candidate key.
There are situations where you may want to use a surrogate key (Option 1)
You are not certain that the compound key is a good candidate key over time. Particularly with temporal data (data that changes over time). What if you wanted to add another row to the UserWidget table with the same UserId and WidgetId? Think of Employment(EmployeeId,EmployeeId) - it would work in most cases except if someone went back to work for the same employer at a later date
If you are creating messages/business transactions or something similar that requires an easier key to use for integration. Replication maybe?
If you want to create your own auditing mechanisms (or similar) and don't want keys to get too long.
As a rule of thumb, when modeling data you will find that most associative entities (many to many) are the result of an event. Person takes up employment, item is added to basket etc. Most events have a temporal dependency on the event, where the date or time is relevant - in which case a surrogate key may be the best alternative.
So, take option 2, but make sure that you have the complete model.
I agree with the previous answers but I have one remark to add.
If you want to add more information to the relation and allow more relations between the same two entities you need option one.
For example if you want to track all the times user 1 has used widget 664 in the userwidget table the userid and widgetid isn't unique anymore.
What is the benefit of a primary key in this scenario? Consider the option of no primary key:
UserWidgets3: WidgetID (FK), UserID (FK)
If you want uniqueness then use either the compound key (UserWidgets2) or a uniqueness constraint.
The usual performance advantage of having a primary key is that you often query the table by the primary key, which is fast. In the case of many-to-many tables you don't usually query by the primary key so there is no performance benefit. Many-to-many tables are queried by their foreign keys, so you should consider adding indexes on WidgetID and UserID.
Option 2 is the correct answer, unless you have a really good reason to add a surrogate numeric key (which you have done in option 1).
Surrogate numeric key columns are not 'primary keys'. Primary keys are technically one of the combination of columns that uniquely identify a record within a table.
Anyone building a database should read this article http://it.toolbox.com/blogs/database-soup/primary-keyvil-part-i-7327 by Josh Berkus to understand the difference between surrogate numeric key columns and primary keys.
In my experience the only real reason to add a surrogate numeric key to your table is if your primary key is a compound key and needs to be used as a foreign key reference in another table. Only then should you even think to add an extra column to the table.
Whenever I see a database structure where every table has an 'id' column the chances are it has been designed by someone who doesn't appreciate the relational model and it will invariably display one or more of the problems identified in Josh's article.
I would go with both.
Hear me out:
The compound key is obviously the nice, correct way to go in so far as reflecting the meaning of your data goes. No question.
However: I have had all sorts of trouble making hibernate work properly unless you use a single generated primary key - a surrogate key.
So I would use a logical and physical data model. The logical one has the compound key. The physical model - which implements the logical model - has the surrogate key and foreign keys.
Since each User-Widget combination is unique, you should represent that in your table by making the combination unique. In other words, go with option 2. Otherwise you may have two entries with the same widget and user IDs but different user-widget IDs.
The userwidgetid in the first table is not needed, as like you said the uniqueness comes from the combination of the widgetid and the userid.
I would use the second table, keep the foriegn keys and add a unique index on widgetid and userid.
So:
userwidgets( widgetid(fk), userid(fk),
unique_index(widgetid, userid)
)
There is some preformance gain in not having the extra primary key, as the database would not need to calculate the index for the key. In the above model though this index (through the unique_index) is still calculated, but I believe that this is easier to understand.

Resources