Surrogate database key and data duplication

Surrogate database key and data duplication - database

I was thinking about this problem. In database design most of the times surrogate keys are used, but how to prevent data duplication and inconsistent data? I mean one could have a customer table made of customer_id, name, surname. What would prevent me of inserting the same customer twice with a different customer_id? Sure I could add a unique index to name and surname, but if one does so than what's the purpose of the surrogate primary key?

You're asking a business question, not a technical one.
"How do I know whether two people with the same name are the same person or not?"
Well typically customers are not identified by a name alone, there is also one of:
An account number
An email address
A postal address
A credit card number
A passport number
A date of birth
... etc.
The name is simply not a uniquely identifying characteristic, it's just an attribute of a customer that is probably non-unique, so you need something else to help identify them. Within the database this is the primary key of the customer table, but for business purposes it could be any number of attributes.

If there is a natural key, you cannot replace it with a surrogate key. You can only add the surrogate without removing the natural. This has pros and cons, as I described here.
Unfortunately, there is no good natural key in the case you described, since two different human beings can easily have the same combination of first and last name. Therefore, you'll have to come-up with some additional attributes that represent a better criteria for judging whether two people are "identical" or not, and then create the corresponding natural key. Discovering such criteria is part of the requirement gathering and therefore impossible for me to do without knowing more about your domain.
If you are unable to identify such natural key, then you can just leave customer_id alone. That means you made a decision to make it acceptable for two people to be identical in every aspect (except in their customer_id) and still be considered "different". Arguably, such customer_id may no longer be called "surrogate", since its value now has a meaning in your data model, is potentially visible in the UI etc.

What you have said is perfectly logical and correct. Surrogate keys are not any kind of substitute for a natural key (AKA business key or domain key, i.e. the set of attributes used to identify information in the database and relate it to the real world things the database is supposed to model). If you care about data integrity then natural keys are essential, whereas surrogates by definition are optional and supplemental. Add surrogate keys only when and where you find they have a useful benefit.

The only purpose of the id (or "surrogate key" as you call it) is to uniquely identify a record.
First, say you will use name as a key. What will you do if:
the customer changes its name (in some countries women change their surname to their husbands');
you make a typo in customers name and have to correct it afterwards?
Then you are in a big trouble, because despite of the fact that you can change it,
id should never be changed!
Otherwise, you can make a big mess not only in your database, consistency along backups, logs etc, but also in all the external sources refering to it.
Second, how do you know you won't get two customers with the same name?

You cannot stop people from describing the world wrongly in the database. You can only stop them from describing the world wrongly in the database if the way they described it can't ever happen.
When there is no previous "natural" identifying property used in the business outside the database being stored in the database then we have to pick a "surrogate" distinguishing identifier after the system starts. (Some people wouldn't use "natural" for such an identifier picked after the system starts even though it is used in the business outside the database. And some people wouldn't use "surrogate" for such a distinguishing identifier used in the business system outside the database.)

Related

Table has no natural keys in SQL Server

The AP database example provided online by Murach's SQL Server 2016 for developers has an Invoice table with the surrogate key InvoiceID but with no natural keys. Most of the other tables have natural keys that uniquely identify each row, so I was curious: why would they provide a table without a natural key to identify what each row represents?
I got the AP database creation script from here:
https://www.murach.com/shop/murach-s-sql-server-2016-for-developers-detail

I think they made a mistake. The natural key here presumably ought to be (VendorID, InvoiceNumber). I have never seen a real accounts payable system that allowed duplicate invoice numbers for the same vendor. Paying the same invoice twice obviously isn't a good idea!
The most common motivation for creating a surrogate is to reduce the impact of having to change other key values. Natural key (AKA business key) values sometimes need to change. Surrogate keys need to change much less frequently because fewer people ever see them and so there's much less reason to change them. That relative stability may have some technical advantages in situations where the business key values are expected to change. Even in the presence of a surrogate, business keys are still critically important because they are the things that users and business processes depend on.

An invoice is a man made piece of data. That's all it is. It has no natural key, because it has no natural identifier. The person or process who creates a new invoice assigns it a number, call it Invoice Number. But that number is just as artificial as Invoice.Id would be. If you want to consider one of those a surrogate, go ahead.
An automobile is a man made piece of gear, but it isn't just data. It's something to drive around. When a new automobile is made it gets assigned a unique identifier, called the Vehicle Identification Number, or VIN. But that key is ultimately just as artificial as Invoice Number. It's just pulled out of thin air, made so that it will be unique, and assigned to the car. There is nothing more "Natural" about VIN than there is about Invoice Number. And there is nothing less "natural" about identifiers that are chosen by the DBMS, perhaps using an autonumber feature.
Edit in response to comments: VIN is assigned at the business level, but it's sole legitimate function is to identify a vehicle. There are rules for its formation, but those rules exist to prevent the same VIN from being assigned to two vehicles. If one of the digits in the VIN says the seating capacity of the vehicle, that's the seating capacity on the day of manufacture. It's possible to change the seating capacity of a vehicle after it's in operation, by ripping out one of the seats.
If all keys that are used by the business domain (alternatively the "conceptual domain") are natural, it must be recognized that in certain businesses a key will be generated inside a computerized system and eventually acquire meaning as it is used at the business layer. Arguments have been made in answers to other questions that surrogate keys should never be revealed to the application user, or perhaps even to the application program, lest it begin to be used in a meaningful way. That's ultimately a philosophical question, and not one of database design.

Can a table have a surrogate key without a (natural) alternate key?

I think it doesn't make sense for a table to have a surrogate key without also having a (natural) alternate key (keep in mind that one of the properties for a surrogate key is that it has no meaning outside the database environment).
For example say I have the following table:
Say that employee_id is the surrogate primary key, and there is no (natural) alternate key in the table.
Now let's say that some employee wants to change his phone number, how can we identify the record for this employee in the table? we can't identify it using the surrogate key, because the surrogate key is not known in the real world (i.e. we don't know the employee_id for each employee).
So there must be a (natural) alternate key in the table to identify each employee in the real world (for example: SSN).
Am I correct, or am I missing something?

You are right. Normally the users of a database need to be able to map information in a database table to real concepts or things outside the database. For that they need a usable natural key - or something that can be reliably translated into a natural key.
Your specific example isn't necessarily a good one because many (most?) organisations allocate an employee identifier to employees for the duration of their employment. That employee identifier may be known and used as the natural key by both employee and employer. You have said employee_id in your example is a surrogate but based on its name many people might assume it is not.

Every table has a candidate key. (Possibly, all the columns.) Each value of that key designates something. If employee_ids weren't used externally, they and the the set of the other three columns would both identify sets of people with the same name, address and phone number.
You are right that a table must contain a natural key for whatever entity you want to identify--for "natural key" meaning "designator under the current business rules, external to the database".
You seem to be confusing multiple meanings for "surrogate key" and "natural key".
For "surrogate": One use is, a property set where designations are determined since current business rules: surrogate as new. Another use is, a property set where designations are determined since current business rules and only used internally: surrogate as internal.
For "natural": One use is, a property set designating under current business rules and before: natural as old. Another use is, a property set designating under current business rules: natural as external.
The original use of "surrogate" was as internal with "natural" as external. Unfortunately now usually people use "surrogate" as new and "natural" as old. And they seldom either consider or distinguish surrogates as internal. Some people might call a newly introduced external designator as both surrogate (as new) and natural (as external). (Re "meaningless" names.)
All you can do is decipher or ask what someone means when they use these terms.
Note that these definitions are relative to the "current" business rules. You also seem to be assuming that employee ids arrived with the database. At some point they were introduced, so were chosen after some older system started, so were surrogates as new under the new system. But if the database came later then by that time they were natural keys as old. They were natural as external both times; when introduced they were just new natural as external.

The term "natural key" belongs with the subject matter or problem domain, and not with the database design or solution domain.
If you ask people in the organization how they refer to each other, it's usually by name. "That's Mary Jones, sitting by herself". You never hear, "That's employee 79932, sitting by herself."
If you ask a database person, like most of us, what's wrong with using the name, you'll get the standard answers: it's not unique. It's mutable. It's too many characters. All of that is true, but it doesn't change how people work in the real world.
The item "social security number" began life as an artificial key, even though that was before the computer revolution really began (1938). Over time, it has taken on most of the look and feel of a natural key. In your case, I would even call it a natural key, because every employee has one (barring something hokey).

Should the contents of a column acting as a primary key be interpretable or purely unique integers

I have the luxury of designing a database from scratch. When designing columns to act as unique keys should I just use unique integers or should I attempt to make the values interpretable. So if I had a lookup table of ward names in a hospital should the id column contain unique codes that in someway relate to the name of the ward or just unique integers?

Resist the temptation to overload the id values with meaning. Use other attributes to store the info you're considering stuffing into the id.
Overloading the id with "meaning" is bad because:
If the data being stuffed into the ID changes, so must your ID. ID's should never change
If the data type of the data changes, you'll have a problem, for example:
If your ID is numeric, and the stuffed info changes from numeric to text, you'll have big problems
If the stuffed data changes from a simple field to a one-to-many child, your model will break
What you believe has "important" meaning now may not be important in the future. Then your "specially encoded" data will become useless and a burden, even a serious restriction
What currently "identifies" a product may change as the business evolves
If have seen this idea attempted many times, never successfully. In every case, the idea was scraped and surrogate IDs were introduced to replace the magic IDs, with all the risk and development cost associated with that task.
In my career, have seen most of the problems listed above actually happen.

You should not be using a lookup table. Make your tables innodb and use referential integrity to join tables together. Your id columns should always be set as primary and should be set to auto increment. Never try to make up your own ids. You should really look at some tutorial on referential integrity and learn how to assoicate tables with other tables.

The best choice for Person table primary key

What is your choice for primary key in tables that represent a person (like Client, User, Customer, Employee etc.)? My first choice would be an social security number (SSN). However, using SSN has been discouraged because of privacy concerns and different regulations. SSN can change during person lifetime, so that is another reason against it.
I guess that one of the functions of well chosen natural primary key is to avoid duplication. I do not want a person to be registered twice in the database. Some surrogate or generated primary key does not help in avoiding duplicate entries. What is the best way to approach this?
What is the best way to guarantee uniqueness in your application for person entity and can this be handled on database level with primary key or uniqueness constraint?

I don't know which Database engine you are using, but (at least with MySQL -- see 7.4.1. Make Your Data as Small as Possible), using an integer, the shortest possible, is generally considered best for performances and memory requirements.
I would use an integer, auto_increment, for that primary key.
The idea being :
If the PK is short, it helps identifying each row (it's faster and easier to compare two integers than two long strings)
If a column used in foreign keys is short, it'll require less memory for foreign keys, as the value of that column is likely to be stored in several places.
And, then, set a UNIQUE index on an other column -- the one that determines unicity -- if that's possible and/or necessary.
Edit: Here are a couple of other questions/answers that might interest you :
What’s the best practice for Primary Keys in tables?
How do you like your primary keys?
Should I have a dedicated primary key field?
Use item specific prefixes and autonumber for primary keys?

As mentioned above, use an auto-increment as your primary key. But I don't believe this is your real question.
Your real question is how to avoid duplicate entries. In theory, there is no way - 2 people could be born on the same day, with the same name, and live in the same household, and not have a social insurance number available for one or the other. (One might be a foreigner visiting the country).
However, the combination of full name, birthdate, address, and telephone number is usually sufficient to avoid duplication. Note that addresses may be entered differently, people may have multiple phone numbers, and people may choose to omit their middle name or use an initial. It depends on how important it is to avoid duplicate entries, and how large is your userbase (and thus the likelihood of a collision).
Of course, if you can get the SSN/SIN then use that to determine uniqueness.

What attributes are available to you? Which ones does your application care about ? For example no two people can be born at exactly the same second at exactly the same place, but you probably don't have access to that data at that level of accuracy! So you need to decide, from the attributes you intend on modeling, which ones are sufficient to provide an acceptable level of data integrity. Whatever you choose, you're right in focusing on the data integrity aspects (preventing insertion of multiple rows for the same person) of your selection.
For Joins/Foreign Keys in other tables, it is best to use a surrogate key.
I've grown to consider the use of the word Primary Key as a misnomer, or at best, confusing. Any key, whether you flag it as Primary Key, Alternate Key, Unique Key, or Unique Index, is still a Key, and requires that every row in the table contain unique values for the attributes in the key. In that sense, all keys are equivilent. What matters more (Most), is whether they are natural keys (dependant on meaningful real- domain model data attributes), or surrogates (Independendant of real data attributes)
Secondly, what also matters is what you use the key for.. Surrogate keys are narrow and simple and never change (No reason to - they don't mean anything) So they are a better choice for joins or for foreign Keys in other dependant tables.
But to ensure data integrity, and prevent insertion of multiple rows for the same domain entity, they are totally useless... For that you need some kind of Natural Key, chosen from the data you have available, and which your application is modeling for some purpose.
The key does not have to be 100% immutable. If (as an example), you use Name and Phone Number and Birthdate, for example, even if a person changes their name, or their phone number, you can simply change the value in the table. As long as no other row already has the new values in their key attributes, you are fine.
Even if the key you select only works in 99.9% of the cases, (say you are unlucky enough to run into two people with the same name and phone number and were coincidentally born the same day), well, at least 99.9% of your data will be guaranteed to be accurate and consistent - and you can for example, just add time to their birthdate to make them unique, or add some other attribute to the key to distinquish them. As long as you don't have to update data values in Foreign Keys throughout your database because of the change, (since you are not using this key as a FK elsewhere) you are not facing any significant issue.

Use an autogenerated integer primary key, and then put a unique constraint on anything that you believe should be unique. But SSNs are not unique in the real world so it would be a bad idea to put a uniqueness constraint on this column unless you think turning away customers because your database won't accept them is a good business model.

I prefer natural keys, but a table person is a lost case. SSNs are not unique and not everybody has one.

I'd recommend a surrogate key. Add all the indexes you need for other candidate keys, but keeping business logic out of the key is my recommendation.

I prefer natural keys, when they can be trusted.
Unless you are running a bank or something like that, there is no reason for your clients and users to provide you with a valid SSN, or even necessarily to have one. Thus, for business reasons, you are forced to distrust SSN in the case you outline. A similar argumant would hold for any given natural key to "persons".
You have no choice but to assign an artificial (Read "surrogate") key. It might as well be an integer. Make sure it's big enough integer so you aren't going to need toexpand it real soon.

To add to #Mark and #Pascal (autoincrement integers are your best bet) -- SSN's are usefull and should be modelled correctly. Security concerns are part of application logic. You can normalize them into a separate table, and you can make them unique by providing a date-issued field.
p.s., to those who disagree with the `security in application' point, an enterprise DB will have a granular ACL model; so this won't be a sticking point.

Picking the best primary key + numbering system

We are trying to come up with a numbering system for the asset system that we are creating, there has been a few heated discussions on this topic in the office so I decided to ask the experts of SO.
Considering the database design below what would be the better option.
Example 1: Using auto surrogate keys.
================= ==================
Road_Number(PK) Segment_Number(PK)
================= ==================
1 1
Example 2: Using program generated PK
================= ==================
Road_Number(PK) Segment_Number(PK)
================= ==================
"RD00000001WCK" "00000001.1"
(the 00000001.1 means it's the first segment of the road. This increases everytime you add a new segment e.g. 00000001.2)
Example 3: Using a bit of both(adding a new column)
======================= ==========================
ID(PK) Road_Number(UK) ID(PK) Segment_Number(UK)
======================= ==========================
1 "RD00000001WCK" 1 "00000001.1"
Just a bit of background information, we will be using the Road Number and Segment Number in reports and other documents, so they have to be unique.
I have always liked keeping things simple so I prefer example 1, but I have been reading that you should not expose your primary keys in reports/documents. So now I'm thinking more along the lines of example 3.
I am also leaning towards example 3 because if we decide to change how our asset numbering is generated it won't have to do cascade updates on a primary key.
What do you think we should do?
Thanks.
EDIT: Thanks everyone for the great answers, has help me a lot.

This is really a discussion about surrogate (also called technical or synthetic) vs natural primary keys, a subject that has been extensively covered. I covered this in Database Development Mistakes Made by AppDevelopers.
Natural keys are keys based on
externally meaningful data that is
(ostensibly) unique. Common examples
are product codes, two-letter state
codes (US), social security numbers
and so on. Surrogate or technical
primary keys are those that have
absolutely no meaning outside the
system. They are invented purely for
identifying the entity and are
typically auto-incrementing fields
(SQL Server, MySQL, others) or
sequences (most notably Oracle).
In my opinion you should always
use surrogate keys. This issue has
come up in these questions:
How do you like your primary keys?
What’s the best practice for Primary Keys in tables?
Which format of primary key would you use in this situation.
Surrogate Vs. Natural/Business Keys
Should I have a dedicated primary key field?
Auto number fields are the way to go. If your keys have meaning outside your database (like asset numbers) those will quite possibly change and changing keys is problematic. Just use indexes for those things into the relevant tables.

I would personally say keep it simple and stay with an autoincremented primary key. If you need something more "Readable" in terms of display in the program, then possibly one of your other ideas, but I think that is just adding unneeded complexity to the primary key field.

I'm also very strongly in the "don't use primary keys as meaningful data" camp. Every time I have contravened that policy it has ended in tears. Sooner or later the meaningful data needs to change and if that means you have to change a primary key it can get painful. The primary key will probably be used in foreign key constraints and you can spend ages trying to sort it all out just to make a simple data change.
I always use GUIDs/UUIDs for my primary keys in every table I ever create but that's just personal preference serials or such are also good.

Don't put meaning into your PK fields unless...
It is 100% completely impossible that
the value will never change and that
No two people would ever reasonably
argue about which value should be
used for a particular row.
Go with option one and format the value in the app to look like option two or three when it is displayed.

I think the important thing to remember here is that each table in your database/design might have multiple keys. These are the Candidate Keys.
See wikipedia entry for Candidate Keys
By definition, all Candidate Keys are created equal. They are each unique identifiers for the table in question.
Your job then is to select the best candidate from the pool of Candidate Keys to serve as the Primary Key. The Primary Key will be used by other tables to establish the relational constraints, but you are free to continue using Candidate Keys to query the table.
Because Primary Keys are referenced by other structures, and therefore used in join operations, the criteria for Primary Key selection boils down to the following for me (in order of importance):
Immutable/Stable - Primary Key values should not change. If they do, you run the risk of introducing update anomolies
Not Null - most DBMS platforms require that the Primary Key attribute(s) are not null
Simple - simple datatypes and values for physical storage and performance. Integer values work well here, and this is the datatype of choice for most surrogate/auto-gen keys
Once you've identified the Candidate Keys, the criteria above can be used to select the Primary Key. If there is not a "Natural" Candidate Key meets the criteria, then a Surrogate Key that does meet the criteria can be created and used as mentioned in other answers.

Follow the Don't Use policy.
Some problems you can run into:
You need to generate keys from more than one host.
Someone will want to reserve contiguous numbers to use together.
How meaningful will people want it to be? Wars are fought over this, and you're in the first skirmish of one already. "It's already meaningful, and if we just add two more digits we can ..." i.e. you're establishing a design style that will (should) be extensible.
If you are concatenating the two, you're doing typecasts which can mess up your query Optimizer.
You'll need to reclassify roads, and redefine their boundaries (i.e. move the roads), which implies changing the primary key and maybe losing links.
There are workarounds for all this, but this is the kind of issue where workarounds proliferate and get out of control. And it doesn't take more than a couple to get beyond "Simple".

As mentioned before, keep your internal primary keys as just keys, whatever the most optimal datatype is on your platform.
However you do need to let the numbering system argument be fought out, as this is actually a business requirement, and perhaps let's call it an identification system for the asset.
If there is only going to be one identifier, then add it as a column to the main table. If there are likely to be many identification systems (and assets usually have many), you'll need two more tables
Identifier-type table Identifier-cross-ref table
type-id ------------> type-id (unique
type-name identifier-string key)
internal-id
That way different people who need to access the asset can identify in their own way. For example the server team will identify a server differently from the network team and different again from project management, accounts, etc.
Plus, you get to go to all the meetings where everyone argues with each other.

Another thing to keep in mind is that if you're importing alot of data into this system, you may find out that things like Road_Number are not as unique as you thought, and there may be operational roadblocks to fixing the problem (repainting road signs, etc.) .

While natural keys may have great meaning to the business users, if you do not have the agreement that those keys are sacred and should not be altered, you will more than likely be pulling your hair out while maintaining a database where the "product codes have to be changed to accommodate the new product line the company acquired." You need to protect the RI of your data, and integers as primary keys with auto-increment are the best way to go. Performance is also better when indexing and traversing integers than char columns.
While not appropriate as primary keys, natural keys are very appropriate for user consumption and you can enforce uniques via an index. They bring a context to the data that will make it easier for all parties to understand. Also, in the advent that you need to reload data, the natural keys can help verify that your lookups are still valid.

I would go with the surrogate key, but you may want to have a computed column that "formats" the surrogate key into a more "readable" value if that improves your reporting. The computed colum could produce example 2 from the surrogate key for instance for display purposes.
I think the surrogate key route is the way to go and the only exceptions that I make for it are join tables, where the primary key could be composed of the foreign key references. Even in these cases I'm finding that having a surrogate primary key is more useful than not.

I suspect that you really should use option #3, as many here have already said. Surrogate PKs (either Integers or GUIDs) are good practice, even if there are adequate business keys. Surrogates will reduce maintenance headaches (as you yourself have already noted).
That being said, something you may want to consider is whether or not your database is:
focused on data maintenance and transactional processing (i.e. Create/Update/Delete operations)
geared towards analysis and reporting (i.e. Queries)
In other words, are the users concerned with maintaining active data or querying largely static data to find answers?
If you are heavily focused on building an analysis and reporting DB (e.g. a data warehouse/mart) that is exposed to technical business users (e.g. report designers) who have a good grasp of the business vocabulary, then you might want to consider using natural keys based on meaningful business values. They help reduce query complexity by eliminating the need for complex joins and help the user focus on their task, not fighting the database structure.
Otherwise you're probably focused on a full CRUD DB that has to cover all the bases to some degree - this is the vast majority of situations. In which case, go with your option #3. You can always optimize for queryability in the future but you'll be hard pressed to retrofit for maintainability.

I hope you will agree with me that every design element should have single purpose.
Question is what do you think is purpose of PK? If it is to identify unique record in a table, then surrogate keys wins without much trouble. This is simple and straight.
As far as new columns in option 3 are concerned, you should check if these can be calculated (best would be to do calculation in model layer so that they can be changed easily than if calculation done in RDBMS) without too much of performance penalty from other elements. For example, you can store segment number and road number in corresponding tables and then use them to generate "00000001.1". This will allow to change asset numbering on-the-fly.

First off, option 2 is the absolute worst option. As an Index, it's a string, and that makes it slow. And it's generated based on business rules - which can change and cause a rather large headache.
Personally, I always use a separate primary key column; and I always use a GUID. Some developers prefer a simple INT over a GUID for reasons of hard-drive space. However, if the situation arises where you need to merge two databases, GUIDs will almost never collide (whereas INTs are guaranteed to collide).
Primary Keys should NEVER be seen by the user. Making it readable to the user should not be a concern. Primary Keys SHOULD be used to link with Foreign Keys. This is their purpose. The value should be machine readable and, once created, never changed.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight