The best choice for Person table primary key - database

What is your choice for primary key in tables that represent a person (like Client, User, Customer, Employee etc.)? My first choice would be an social security number (SSN). However, using SSN has been discouraged because of privacy concerns and different regulations. SSN can change during person lifetime, so that is another reason against it.
I guess that one of the functions of well chosen natural primary key is to avoid duplication. I do not want a person to be registered twice in the database. Some surrogate or generated primary key does not help in avoiding duplicate entries. What is the best way to approach this?
What is the best way to guarantee uniqueness in your application for person entity and can this be handled on database level with primary key or uniqueness constraint?

I don't know which Database engine you are using, but (at least with MySQL -- see 7.4.1. Make Your Data as Small as Possible), using an integer, the shortest possible, is generally considered best for performances and memory requirements.
I would use an integer, auto_increment, for that primary key.
The idea being :
If the PK is short, it helps identifying each row (it's faster and easier to compare two integers than two long strings)
If a column used in foreign keys is short, it'll require less memory for foreign keys, as the value of that column is likely to be stored in several places.
And, then, set a UNIQUE index on an other column -- the one that determines unicity -- if that's possible and/or necessary.
Edit: Here are a couple of other questions/answers that might interest you :
What’s the best practice for Primary Keys in tables?
How do you like your primary keys?
Should I have a dedicated primary key field?
Use item specific prefixes and autonumber for primary keys?

As mentioned above, use an auto-increment as your primary key. But I don't believe this is your real question.
Your real question is how to avoid duplicate entries. In theory, there is no way - 2 people could be born on the same day, with the same name, and live in the same household, and not have a social insurance number available for one or the other. (One might be a foreigner visiting the country).
However, the combination of full name, birthdate, address, and telephone number is usually sufficient to avoid duplication. Note that addresses may be entered differently, people may have multiple phone numbers, and people may choose to omit their middle name or use an initial. It depends on how important it is to avoid duplicate entries, and how large is your userbase (and thus the likelihood of a collision).
Of course, if you can get the SSN/SIN then use that to determine uniqueness.

What attributes are available to you? Which ones does your application care about ? For example no two people can be born at exactly the same second at exactly the same place, but you probably don't have access to that data at that level of accuracy! So you need to decide, from the attributes you intend on modeling, which ones are sufficient to provide an acceptable level of data integrity. Whatever you choose, you're right in focusing on the data integrity aspects (preventing insertion of multiple rows for the same person) of your selection.
For Joins/Foreign Keys in other tables, it is best to use a surrogate key.
I've grown to consider the use of the word Primary Key as a misnomer, or at best, confusing. Any key, whether you flag it as Primary Key, Alternate Key, Unique Key, or Unique Index, is still a Key, and requires that every row in the table contain unique values for the attributes in the key. In that sense, all keys are equivilent. What matters more (Most), is whether they are natural keys (dependant on meaningful real- domain model data attributes), or surrogates (Independendant of real data attributes)
Secondly, what also matters is what you use the key for.. Surrogate keys are narrow and simple and never change (No reason to - they don't mean anything) So they are a better choice for joins or for foreign Keys in other dependant tables.
But to ensure data integrity, and prevent insertion of multiple rows for the same domain entity, they are totally useless... For that you need some kind of Natural Key, chosen from the data you have available, and which your application is modeling for some purpose.
The key does not have to be 100% immutable. If (as an example), you use Name and Phone Number and Birthdate, for example, even if a person changes their name, or their phone number, you can simply change the value in the table. As long as no other row already has the new values in their key attributes, you are fine.
Even if the key you select only works in 99.9% of the cases, (say you are unlucky enough to run into two people with the same name and phone number and were coincidentally born the same day), well, at least 99.9% of your data will be guaranteed to be accurate and consistent - and you can for example, just add time to their birthdate to make them unique, or add some other attribute to the key to distinquish them. As long as you don't have to update data values in Foreign Keys throughout your database because of the change, (since you are not using this key as a FK elsewhere) you are not facing any significant issue.

Use an autogenerated integer primary key, and then put a unique constraint on anything that you believe should be unique. But SSNs are not unique in the real world so it would be a bad idea to put a uniqueness constraint on this column unless you think turning away customers because your database won't accept them is a good business model.

I prefer natural keys, but a table person is a lost case. SSNs are not unique and not everybody has one.

I'd recommend a surrogate key. Add all the indexes you need for other candidate keys, but keeping business logic out of the key is my recommendation.

I prefer natural keys, when they can be trusted.
Unless you are running a bank or something like that, there is no reason for your clients and users to provide you with a valid SSN, or even necessarily to have one. Thus, for business reasons, you are forced to distrust SSN in the case you outline. A similar argumant would hold for any given natural key to "persons".
You have no choice but to assign an artificial (Read "surrogate") key. It might as well be an integer. Make sure it's big enough integer so you aren't going to need toexpand it real soon.

To add to #Mark and #Pascal (autoincrement integers are your best bet) -- SSN's are usefull and should be modelled correctly. Security concerns are part of application logic. You can normalize them into a separate table, and you can make them unique by providing a date-issued field.
p.s., to those who disagree with the `security in application' point, an enterprise DB will have a granular ACL model; so this won't be a sticking point.

Related

What are the benefits of finding a natural primary key

My question is more or less the opposite of this one: Why would one ever want to bother finding a natural primary key in a relation when using a sequence as a surrogate seems so much easier.
BradC mentioned in his answer to a related question that the criteria for choosing a primary key are uniqueness, irreductibility, simplicity, stability and familiarity. It looks to me like using a sequence sacrifices the last criterion in order to provide an optimal solution for the first four.
If I hold those criteria to be correct, I can reformulate my question as: In which circumstances would one ever consider it advantageous to complicate one's life by looking for a unique, irreductible, simple and stable key that is also familiar?
To get a meaningful value from a lookup table without doing unnecessary joins.
Example case: garments references a lookup table of colors, which has an auto-increment primary key. Getting the name of the color requires a join:
SELECT c.color
FROM garments g
JOIN colors c USING (color_id);
Simpler example: the colors.color itself is the primary key of that table, and therefore it's the foreign key column in any table that references it.
SELECT g.color
FROM garments g
The answer is data integrity. Instances of entities in the business domain outside the database are by definition identifiable things. If you fail to give them external, real world identifiers in the database then that database stands little chance of modelling reality correctly.
A natural key[1] is what ensures facts in the database are identifiable with actual things in the reality you are trying to model. They are the means which users rely on when they act on and update the data in the database. The constraints that enforce those keys are an implementation of business rules. If your database is to model the business domain accurately then natural keys are not just desirable but essential. If you doubt that then you haven't done enough business analysis. Just ask your customers how they think their business would operate if they were left looking at screens full of duplicate data!
[1] I recommend calling them business keys or domain keys rather than natural keys. Those are far more appropriate and less overloaded terms even though they mean exactly the same thing.
You generally need to identify what the unique key on the data is anyway, as you still need to be able to ensure that the data is not duplicated.
The strength of the synthetic key is that it allows the values of the unique natural key to be modifiable in future, with child records not needing to be updated.
So you're not really skipping the "identify the key" part of the design by using a synthetic primary key, you're just insulating yourself from the possibility of the values changing.
Below are the benefits of using a natural primary key:
In case you need to have a unique constraint on any column then making it primary key will fulfill the need for that,if you aren't suppose to receive any null value into that.So, anyways it's saving your cost of 1 extra key.
In some RDBMS, the key you are declaring as primary key is automatically creating a btree index on that column and if you make a natural primary key based on your access pattern then it is like Icing on the cake because now you are making two shots with one stone. Saving cost of an extra index and making your queries faster by having that meaningful primary key in where clause.
Last but not least ,you will be able to save space of one extra column/key/index.

Create UserID even though usernames unique?

My web application will have various types of users: a root user, sys-admins, customers, contractors, etc. I'd like to have a table "User" that will have the columns "Username", "Password", and "Role" with role being one of the aforementioned roles. I will, for example, also have a table called "Customer" for storing customer-specific attributes.
Here's my question. Since all usernames will be unique, I could use the username as the primary key for the User table. Would it make more sense, though, to create a "User ID" column and use it rather than the Username column as the PK? There are many other tables that will use this User PK as a foreign key and I would think that, from a performance standpoint, it will be faster to compare user IDs rather than username text strings.
Thank you for your response.
This is the old natural vs. surrogate key debate.
In a nutshell, nothing is cut-and-dry and you'll need to balance between competing interests:
If you anticipate you'll need to update the key, use surrogate (which doesn't need updating) to avoid ON UPDATE CASCADE.
If you need to lookup the key, but not the other columns, use natural key to avoid the JOIN altogether. But if you need to lookup more than just the key, use surrogate to make FKs and accompanying indexes slimmer and more efficient.
Do you anticipate range scans on the natural key? If yes, cluster it. However, secondary indexes can be expensive in clustered tables, which plays against a surrogate key.
Is your table very large? Save space by having only one index (on natural key) instead of two (on natural and surrogate).
Composite natural keys (that are at the child endpoints of identifying relationships) may be necessary for correctly modeling diamond-shaped dependencies.
Surrogates may be more friendly to ORM tools.
In the case of usernames...
You'll probably want to allow the update.
It's unlikely you'll need to lookup only the username.
Range scans on usernames make no sense (unlike equi-searches).
You aren't storing the whole population of the Earth, are you?
We are talking about "stand alone" natural key here, not a natural key that is the result of an identifying relationship.
Unknown?
...so the surrogate key is probably justified in this case.
I would create the userid because while usernames might be unique, they are not generally unchangeable and I would prefer not to have to change thousands or millions of records because HLGEM wanted to become HLGEM_1. Further joins on integers are faster.
There are two schools of thought. Database purists believe that you should always use natural keys wherever possible (and it makes sense). So if you subscribe to that line of thought, and the username is unchangeable, then make the key the username.
Database pragmatists believe that surrogate keys are more useful, and tend to stand up better to changes in requirements. They can also be faster in some cases, particularly with large string keys. There are also security concerns, in that using natural keys you might leak information that wouldn't be available from a surrogate key.

What are the design criteria for primary keys?

Choosing good primary keys, candidate keys and the foreign keys that use them is a vitally important database design task -- as much art as science. The design task has very specific design criteria.
What are the criteria?
The criteria for consideration of a primary key are:
Uniqueness
Irreducibility (no subset of the key uniquely identifies a row in the table)
Simplicity (so that relational representation & manipulation can be simpler)
Stability (should not be altered frequently)
Familiarity (meaningful to the user)
What is a Primary Key?
The primary key is something that uniquely identifies a row/record of data. It can also be multiple columns, which is called a composite.
Ability to Change
Because the primary key is often used for foreign references, it should be as stable as possible. All data in the database is mutable, providing someone is connecting with an account that has appropriate privileges. This is why databases provide the ability to define CASCADE ON DELETE and CASCADE ON UPDATE--to sync referential dependencies without having to disable constraints.
Natural or Artifical/Surrogate?
Ideally, you want a natural key. A natural key is existing data that uniquely identifies the entity you are modeling. For example, the abbreviations of US states is a good natural key because the abbreviation is consistent and everyone knows them:
US_STATE_PRIMARY_KEY US_STATE
--------------------------
AL Alabama
AK Alaska
AZ Arizona
AR Arkansas
CA California
Don't try too hard to find a natural key. They seldom exist. It's unlikely that a US State name would change, but it is plausible.
Realistically, primary keys will typically be artificial (often generated by database functionality). These are typically numbers or GUIDs, and they're considered artificial because on their own - there's nothing to relate their value to the information they uniquely identify. A sales receipt is always numbered, because there's nothing natural about it and it's also for auditing - gaps in the receipt numbers raise suspicions. To demonstrate how arbitrary numbering is, here's the US state table but using an integer for the primary key column, US_STATE_CODE:
US_STATE_PRIMARY_KEY US_STATE
--------------------------
100 Alabama
101 Alaska
102 Arizona
103 Arkansas
104 California
There's no requirement to start the value at one; some shops use this as a security measure to thwart SQL injection. The value is sequential based on the alphabetic ordering of the State name, but that can't be guaranteed. But unlike the natural key, if the state name changed - only one column would have to be updated.
Single Column vs Composite
Ideally one column will be the primary key, but make the decision based on the data at hand--do not combine columns just for the sake of having a single column. If you do shoehorn data together, use a character to separate the data easily (though operations to do this won't be able to take advantage of an index if present).
Performance
From a performance perspective, integers are best because they offer a decent range of values and the number of bytes used is small when you compare to VARCHAR of five or more characters.
Database design starts with a conceptual data model (such as an entity relationship diagram) and finishes up with a database schema or schemas. Entities are mapped to tables; in this process one entity may be split into several table, several entities may be merged into a single table and new tables may arise (for instance, intersection tables to implement many-to-many relationships).
In an ERD entities have primary keys. These are natural keys, that is they are attributes of the entity. For a PERSON entity it might be SocialSecurityNumber. For an ORDER entity if might be OrderRef For an INVOICE entity it might be InvoiceNo. In the first case that is a real-life identifier; in the second case it is a smart key in an ugly format (2010/DEF/000023 ); in the third case it is a monotonically incrementing number because that is what the current paper-based system uses.
Natural keys can be fanciful. I once worked on a database design where the analyst had specified the CUSTOMER entity with a key of (FullName, Address, Sex, DateOfBirth, DistinguishingCharacteristics) on the basis that two individuals of the same name, birth date and gender could live at the same address.
The characteristics of an entity's primary key are:
unique
familiar
stable (presumed)
minimal (one or more attributes but as few as necessary)
When it comes to primary keys for database tables, natural keys are not always suitable.
There are many reasons not to use SSN as a physical primary key. Protection of a citizen's personal data is actually the most important but it is also the case that an individual's number can change. Primary keys should be unvarying.
Smart keys are dumb. They are actually compound keys compressed into a single column. They are better represented as separate columns, not least because it is a frequent requirement to search on single elements of the key. Also, the format of such keys can change.
In general compound keys are a pain as primary keys because we have to cascade multiple columns as foreign keys. This is exacerbated when the child's primary key is defined as a serial number within the parent's primary key. There are systems out there which dependent tables inheriting a nine-column foreign key from a parent when they have a scant two data columns of their own. Sometimes this sort of inheritance can be useful but mostly it is a just a hassle.
The characteristics of an entity's primary key are:
unique
appropriate (meaningless)
guaranteed stability
minimal, usually a single column (except for intersection tables)
So unless the candidate key is a meaningless identifier (such as InvoiceNo) a table should have a synthetic key (AKA surrogate key). This can be a monotonically incrementing number or a GUID according to your needs. Regarding intersection tables, if they have no other attributes or dependent tables there is no value in replacing a compound primary key (AKA composite key) with a synthetic one.
The crucial thing is: we still enforce the candidate keys. This means applying UNIQUE constraints on those columns - SSN , OrderRef - in the parent table. This is because a synthetic key uniquely identifies a row in a table, it does not uniquely identify the data.
Regarding familiarity
Familiarity is a curly one. It is an important consideration when it comes to we are identifying primary keys in a conceptual data model but it is less useful when it comes to database design.
In a commnet #bbadour provides two contrasting examples:
{3296013,840082470,Bob Badour,745} versus {840082470,Bob Badour,PE,CA}
and poses the question:
"What does 3296013 achieve that was not already achieved by 840082470, which happens to be the primary key for my academic records at any or every post-secondary school in Canada."
Well, 840082470 is like a invoice number. Of itself it is a meaningless string of digits. If the system we are designing belongs to the domain of Canadian higher education then it is certainly acceptable as a candidate key. However, because it is a key apparently owned by an external central system (forgive me for not understanding the Canadian academic system), it is open to some of the objections to SSN as a primary key. We are reliant on that external system to ensure uniqueness, guarantee stability and verify identification.
As for 745 versus PE,CA, that is clearly wrong. The Canadian postal abbreviation for "Prince Edward Island" and the ISO digraph for "Canada" identify two distinct pieces of information and derive from different sources, so they should be represented as two separate columns. But let us focus on whether 745 or PE makes the better primary key.
First thing, the database doesn't care which data type we use for the code to represent "Prince Edward Island". It just wants guaranteed uniqueness.
Second thing, the user-facing part of the system is likely to display the full expansion "Prince Edward Island", in which case the application is going to need to execute a look-up anyway. This is because users of a system which also holds addresses from the country of Peru or the state of California will appreciate the clarity of the expanded names[1]. Certainly if we go beyond the few hard cases (such as state abbreviations) the application should always expand codes when displaying them to users.
Thus the only advantage of using PE rather than 745 is that it makes ad hoc querying easier.
Third thing, if the code expansion changes we might want to distinguish records which use the newer version. This is a lot easier if 745='Prince Edward Island' and 746='Prince Edward Is.' than if we use PE as the primary key.
Fourth thing, there are programming considerations. For instance, if the application developers have to provide drop-down lists using Java Enumerations they need numeric codes.
In short, familiarity of natural keys is not as useful as the practicality of surrogate keys.
[1] Canadians will know that CA stands for Canada. But does MO stand for Morocco, Monaco, Moldova, Montenegro, Mongolia or Montserrat? Actually none of them: it's Macau.
A Primary Key is a key that uniquely identifies an entity. When you are choosing a primary key, the best choice is almost always a surrogate key that has absolutely nothing to do with the entity at all other than uniquely identifying it.
And that's it. There are supposedly rare edge cases where a primary key might be a natural key, but I've never seen a valid one.
Most of us use a 32-bit auto-increment integer as a primary key. Another excellent choice (in certain circumstances) is a UUID.
A candidate key is a set of attributes that are irreducibly unique (irreducible meaning that no attribute can be removed from the key without losing the uniqueness property).
Other criteria when choosing what candidate keys to implement are: simplicity, stability, familiarity.
These three criteria are important considerations but not necessarily essential attributes of a key. For instance it may be desirable and quite reasonable to enforce a key that can change often. e.g.: a user login name is required to be unique but the user may change it at will as long as it remains unique.
A primary key is a candidate key.
Hey. it's open again. Here goes.
(1) Choose good candidate keys.
It does not pertain to the database designer to choose candidate keys.
The database designer has the responsibility to see to it that all the
uniqueness requirements he is informed of by the user, will be enforced.
So it is the user who "chooses" what the candidate keys are.
There are two scenario's I can think of that relax this unequivocal
position a bit.
One is if the user says that some attribute of type 'video' or 'audio' (or
some such) is to be unique. It may be infeasible to actually enforce
that, and it is the designer's responsibility to point that out to the
user (as it is also his responsibility to point out that 'uniqueness' of
audio and video content is a very debatable subject, and that any
uniqueness on such attribute values, even if enforcible by the system,
still has a good chance of not being the same uniqueness that the user
wants).
Second is how the picture gets muddied by the possibility of distinct
logical designs all addressing the same problem. If D1 and D2 are both
valid designs addressing the same problem, then it might be the case that
a certain given uniqueness rule imposed by the user, is enforcible using
keys in D1, but not in D2. From this perspective, "choosing candidate
keys" can be interpreted as "choosing a particular design such that a
given uniqueness rule is enforcible using keys". But that wasn't really
the question that you asked.
(2) Choose good primary keys.
A while ago, Darwen launched the question "What are good reasons to single
out one particular candidate from among the others as being 'primary' ?".
Nothing much came out, except then perhaps : "to suggest that this
particular key is the preferred one to use whenever making references to
this relvar". I suspect they didn't find that convincing enough to change
their earlier decision that "no key is more unique than any other".
But, supposing that nonetheless there exists some valid reason to single
out one particular key as "primary", I suppose the following
considerations apply :
the likeliness, or appropriateness, of using this primary key also as,
e.g., the clustering key in the physical design.
and as a consequence of that, the probability of having to change a
value of some existing primary key. Key values that are highly stable
will be preferable over key values that are more volatile.
the percentage of the business that naturally uses some such key in
their daily operations.
if the required space for physically encoding key values is
significantly different, which one has the smallest encoding size.
Your answer to Erwin:
"I agree that choosing a primary key merely designates one candidate key as preferred for foreign key references. However, even if we eliminated the name "primary key" entirely, designers must still choose which candidate key to propagate into another relation for reference purposes. If users identify a heavily referenced relation with an unstable, composite key, do you intend to imply that the designer has no business choosing an additional simple, stable key? Or using the simple, stable key for referencing the relation? Your candidate key section seems to imply that. – bbadour 8 hours ago "
Your original question was about 'primary keys'. Now you change your focus to keys and foreign keys. A key is an integrity constraint, so the only criteria are that a minimal set of attributes has to be unique in a relation (uniqueness and irreducibility). If we change our focus to foreign keys then simplicity, stability and familiarity are the criteria to choose from all the candidate keys in de referenced relation. There could be more candidate keys that fulfill that criteria to more or less the same extend. If we look at familiarity, one candidate key could be very familiar to a group of users and not to another group for which another candidate key is more familiar. Think about different views or subschemas of a database. This second group of users should choose a different candidate key for reference purposes (as foreign key). If you insist in 'primary keys' of which we only have one per relation then I have to ask what makes a key more primary than others.
I think the term primary key should not be used. At least at the logical level. Also the term 'foreign keys' is not well chosen (foreign keys are not keys, but references).
So, I think the remarks of Erwin about ‘primary’ keys were very much to the point. Or at least this was my interpretation of what he means.
Do you agree with this?
If so, would you change your original question to "What are the design criteria for keys and what are the criteria to choose a foreign key from the available candidate keys?"?
If not, why?
Regards,
Carlos
A primary key is a candidate key chosen for special treatment, so first we must look at the properties of candidate keys. A set of one or more columns is a candidate key if it has the following two properties:
Uniqueness: A candidate key must uniquely identify each row in a table. No table may contain two rows with the same value for the candidate key.
Irreducability: Removing any column from a candidate key must violate the uniqness property. In other words, no subset of columns in a candidate key is itself a candidate key.
If no candidate key exists, and sometimes even if one does, a surrogate key is often created using an auto-incrementing integer column, or made up using some other technique. This surrogate key is now also a candidate key.
It is often useful to choose among the available candidate keys and to designate one of them as the primary key. The first criteria often applied is simplicity indicating the candidate key with the fewest columns. However there are other potential criteria, like familiarity, familiar values being more useful than non-familiar values, and stability, stable keys being less troublesome than keys that are apt to change. These criteria however, are strictlty outside the scope the relational model, often conflict with each other, and are often made to deal with implementation limitations.
I would say that the first two concepts "uniqueness" and "irreducability" are less design criteria than fundamental properties of primary keys, while the latter concepts of "simplicity", "familiarity" and "stability" are more properly labeled design criteria, as they involve tradeoffs and subjectivity.
Why choose a primary key? Simplicity and familiarity are not only criteria for choosing among available candididate keys, but are why we should choose a primary key at all. If there are are multiple candidate keys in a table, it simplifys things if all foreign keys pointing to that table refer to the same candidate key. Furthermore, the very act of choosing a particular candidate key will help make it familiar.
What are the criteria?
A PRIMARY KEY is something that will define the entity, only the entity and nothing but the entity.
You can take it from the outside world. Say, a star catalog number to identify a star (good example), or an SSN to identify a person (bad example).
In this case, you rely on the outside world.
Do all people have SSN? (They don't).
Are SSN's unique? (They aren't).
Can an SSN be assigned to another person? (It can).
You can generate it inside your model, using AUTOINCREMENT or GUIDs or whatever.
In this case, you rely on yourself and your database skills.
Do all people in your model have an ID? (Yes, they do, otherwise they wouldn't be in the table with ID NOT NULL).
Are these ID's unique? (Yes, they are, the PRIMARY KEY constraint takes care of it).
Can they be assigned to other persons? (No, they cannot, they are either non-repeatable by design or auto incrementing).
Or another set of answers:
Do all people in your model have an ID? (No, they don't, the people table was accidentally dropped, though some other information retained).
Are these ID's unique? (No, we failed to merge two versions of the database properly).
Can they be assigned to other persons? (Yes, we reset the AUTOINCREMENT by mistake).
The most important thing is that a surrogate key is a feast that is always with you. You can always create a surrogate key: nothing on Earth can stop you from declaring an AUTOINCREMENT field. But by far not all things have some kind of identifier everybody agrees upon.
However, a good natural key cannot be overemphasized.
Guide Star Catalog database is most probably backed up more reliably than yours, and the list of US state codes you always can restore right from the memory.
Only one really, choose a surrogate for each table (identity/auto_number) or something similar that the users will never even see so you can do whatever is necessary with them whenever you need to now and in the future.
(Not quite sure how to interpret this question. Sounds like a quiz or something where you are looking for one single "right" answer from a textbook. I'm going to interpret the question as a more practical one, hence my advice below.)
At least in the MS SQL world, discussion about a proper Primary Key is inevitably wrapped up in discussion about the proper clustered index for a table. The two don't have to be the same, but they are by default, and for many tables, making the two the same is often a good idea.
For the purpose of our discussion here, its important to distinguish between the two:
A PRIMARY KEY is a field or combination of fields that uniquely identify a row.
A CLUSTERED INDEX is a field or combination of fields that represents the physical ordering of a table. (Again, I am speaking about MS SQL Server, not sure how other RDBS might handle this)
Key to the remainder of my discussion is knowing that since SQL 7.0, the clustered index key is used as a row identifier for all non-clustered indexes. This means that many of the same criteria for choosing a good clustering key are the same as for choosing a good primary key.
Let's first look at the criteria for a good clustered index (From Kimberly Tripp's excellent article). A clustered index should be:
Unique - otherwise useless as a row identifier for other indexes
Narrow - this key is used in other indexes, so should be as narrow as possible
Static - If key values change, then references become invalid and will need updating
Ever-increasing - To reduce physical table fragmentation as new rows are added
It is readily apparent the first 3 are also good criteria for a primary key. #4 is a bonus that will reduce table fragmentation as tables grow.
A GUID as a primary key, as popular as that is, actually fails 2 of these criteria (Narrow and Ever-Increasing). As such, it is not recommended as a PK/Clustered index in most circumstances (see Kim's related article here)
I'm going to say something here that is not expected.
All the stuff they teach in database about normalization and keys is all wrong when it comes to choosing primary keys.
The primary key is special when it comes to range queries, and for that reason if you have a dominant range query that is your primary key, no exceptions.
If your dominant range query is not on a candidate key you end up with a primary key that is not enforced for uniqueness! This is sometimes called a clustered index, which is a misnomer because there is no index.
Now the normalization and candidate keys are all important, and you will want to enforce unique constraints on at least some of them. But do not assign the primary key because it is the natural key. In fact, this is slower than defining an index and a unique constraint. Define the primary key based on range queries only.
Remember, there is no constraint to actually have primary keys. A table with no primary keys is called a heap table and has either no intrinsic ordering or insertion order intrinsic ordering.
EDIT: definition of range query:
A range query is a query that is an ORDER BY query or contains either a greater than or less than operator. What we are interested in are the columns for which these queries run on. The fundamental idea is a range query fetches several (tens to hundreds to perhaps thousands but not all) rows from the table based on bounding conditions at one or both ends.
There is another kind of range queries, and that is where you have a foreign key to another table and an operation is select all matching on that foreign key. This is in fact also a range query although not obviously so.

Picking the best primary key + numbering system

We are trying to come up with a numbering system for the asset system that we are creating, there has been a few heated discussions on this topic in the office so I decided to ask the experts of SO.
Considering the database design below what would be the better option.
Example 1: Using auto surrogate keys.
================= ==================
Road_Number(PK) Segment_Number(PK)
================= ==================
1 1
Example 2: Using program generated PK
================= ==================
Road_Number(PK) Segment_Number(PK)
================= ==================
"RD00000001WCK" "00000001.1"
(the 00000001.1 means it's the first segment of the road. This increases everytime you add a new segment e.g. 00000001.2)
Example 3: Using a bit of both(adding a new column)
======================= ==========================
ID(PK) Road_Number(UK) ID(PK) Segment_Number(UK)
======================= ==========================
1 "RD00000001WCK" 1 "00000001.1"
Just a bit of background information, we will be using the Road Number and Segment Number in reports and other documents, so they have to be unique.
I have always liked keeping things simple so I prefer example 1, but I have been reading that you should not expose your primary keys in reports/documents. So now I'm thinking more along the lines of example 3.
I am also leaning towards example 3 because if we decide to change how our asset numbering is generated it won't have to do cascade updates on a primary key.
What do you think we should do?
Thanks.
EDIT: Thanks everyone for the great answers, has help me a lot.
This is really a discussion about surrogate (also called technical or synthetic) vs natural primary keys, a subject that has been extensively covered. I covered this in Database Development Mistakes Made by AppDevelopers.
Natural keys are keys based on
externally meaningful data that is
(ostensibly) unique. Common examples
are product codes, two-letter state
codes (US), social security numbers
and so on. Surrogate or technical
primary keys are those that have
absolutely no meaning outside the
system. They are invented purely for
identifying the entity and are
typically auto-incrementing fields
(SQL Server, MySQL, others) or
sequences (most notably Oracle).
In my opinion you should always
use surrogate keys. This issue has
come up in these questions:
How do you like your primary keys?
What’s the best practice for Primary Keys in tables?
Which format of primary key would you use in this situation.
Surrogate Vs. Natural/Business Keys
Should I have a dedicated primary key field?
Auto number fields are the way to go. If your keys have meaning outside your database (like asset numbers) those will quite possibly change and changing keys is problematic. Just use indexes for those things into the relevant tables.
I would personally say keep it simple and stay with an autoincremented primary key. If you need something more "Readable" in terms of display in the program, then possibly one of your other ideas, but I think that is just adding unneeded complexity to the primary key field.
I'm also very strongly in the "don't use primary keys as meaningful data" camp. Every time I have contravened that policy it has ended in tears. Sooner or later the meaningful data needs to change and if that means you have to change a primary key it can get painful. The primary key will probably be used in foreign key constraints and you can spend ages trying to sort it all out just to make a simple data change.
I always use GUIDs/UUIDs for my primary keys in every table I ever create but that's just personal preference serials or such are also good.
Don't put meaning into your PK fields unless...
It is 100% completely impossible that
the value will never change and that
No two people would ever reasonably
argue about which value should be
used for a particular row.
Go with option one and format the value in the app to look like option two or three when it is displayed.
I think the important thing to remember here is that each table in your database/design might have multiple keys. These are the Candidate Keys.
See wikipedia entry for Candidate Keys
By definition, all Candidate Keys are created equal. They are each unique identifiers for the table in question.
Your job then is to select the best candidate from the pool of Candidate Keys to serve as the Primary Key. The Primary Key will be used by other tables to establish the relational constraints, but you are free to continue using Candidate Keys to query the table.
Because Primary Keys are referenced by other structures, and therefore used in join operations, the criteria for Primary Key selection boils down to the following for me (in order of importance):
Immutable/Stable - Primary Key values should not change. If they do, you run the risk of introducing update anomolies
Not Null - most DBMS platforms require that the Primary Key attribute(s) are not null
Simple - simple datatypes and values for physical storage and performance. Integer values work well here, and this is the datatype of choice for most surrogate/auto-gen keys
Once you've identified the Candidate Keys, the criteria above can be used to select the Primary Key. If there is not a "Natural" Candidate Key meets the criteria, then a Surrogate Key that does meet the criteria can be created and used as mentioned in other answers.
Follow the Don't Use policy.
Some problems you can run into:
You need to generate keys from more than one host.
Someone will want to reserve contiguous numbers to use together.
How meaningful will people want it to be? Wars are fought over this, and you're in the first skirmish of one already. "It's already meaningful, and if we just add two more digits we can ..." i.e. you're establishing a design style that will (should) be extensible.
If you are concatenating the two, you're doing typecasts which can mess up your query Optimizer.
You'll need to reclassify roads, and redefine their boundaries (i.e. move the roads), which implies changing the primary key and maybe losing links.
There are workarounds for all this, but this is the kind of issue where workarounds proliferate and get out of control. And it doesn't take more than a couple to get beyond "Simple".
As mentioned before, keep your internal primary keys as just keys, whatever the most optimal datatype is on your platform.
However you do need to let the numbering system argument be fought out, as this is actually a business requirement, and perhaps let's call it an identification system for the asset.
If there is only going to be one identifier, then add it as a column to the main table. If there are likely to be many identification systems (and assets usually have many), you'll need two more tables
Identifier-type table Identifier-cross-ref table
type-id ------------> type-id (unique
type-name identifier-string key)
internal-id
That way different people who need to access the asset can identify in their own way. For example the server team will identify a server differently from the network team and different again from project management, accounts, etc.
Plus, you get to go to all the meetings where everyone argues with each other.
Another thing to keep in mind is that if you're importing alot of data into this system, you may find out that things like Road_Number are not as unique as you thought, and there may be operational roadblocks to fixing the problem (repainting road signs, etc.) .
While natural keys may have great meaning to the business users, if you do not have the agreement that those keys are sacred and should not be altered, you will more than likely be pulling your hair out while maintaining a database where the "product codes have to be changed to accommodate the new product line the company acquired." You need to protect the RI of your data, and integers as primary keys with auto-increment are the best way to go. Performance is also better when indexing and traversing integers than char columns.
While not appropriate as primary keys, natural keys are very appropriate for user consumption and you can enforce uniques via an index. They bring a context to the data that will make it easier for all parties to understand. Also, in the advent that you need to reload data, the natural keys can help verify that your lookups are still valid.
I would go with the surrogate key, but you may want to have a computed column that "formats" the surrogate key into a more "readable" value if that improves your reporting. The computed colum could produce example 2 from the surrogate key for instance for display purposes.
I think the surrogate key route is the way to go and the only exceptions that I make for it are join tables, where the primary key could be composed of the foreign key references. Even in these cases I'm finding that having a surrogate primary key is more useful than not.
I suspect that you really should use option #3, as many here have already said. Surrogate PKs (either Integers or GUIDs) are good practice, even if there are adequate business keys. Surrogates will reduce maintenance headaches (as you yourself have already noted).
That being said, something you may want to consider is whether or not your database is:
focused on data maintenance and transactional processing (i.e. Create/Update/Delete operations)
geared towards analysis and reporting (i.e. Queries)
In other words, are the users concerned with maintaining active data or querying largely static data to find answers?
If you are heavily focused on building an analysis and reporting DB (e.g. a data warehouse/mart) that is exposed to technical business users (e.g. report designers) who have a good grasp of the business vocabulary, then you might want to consider using natural keys based on meaningful business values. They help reduce query complexity by eliminating the need for complex joins and help the user focus on their task, not fighting the database structure.
Otherwise you're probably focused on a full CRUD DB that has to cover all the bases to some degree - this is the vast majority of situations. In which case, go with your option #3. You can always optimize for queryability in the future but you'll be hard pressed to retrofit for maintainability.
I hope you will agree with me that every design element should have single purpose.
Question is what do you think is purpose of PK? If it is to identify unique record in a table, then surrogate keys wins without much trouble. This is simple and straight.
As far as new columns in option 3 are concerned, you should check if these can be calculated (best would be to do calculation in model layer so that they can be changed easily than if calculation done in RDBMS) without too much of performance penalty from other elements. For example, you can store segment number and road number in corresponding tables and then use them to generate "00000001.1". This will allow to change asset numbering on-the-fly.
First off, option 2 is the absolute worst option. As an Index, it's a string, and that makes it slow. And it's generated based on business rules - which can change and cause a rather large headache.
Personally, I always use a separate primary key column; and I always use a GUID. Some developers prefer a simple INT over a GUID for reasons of hard-drive space. However, if the situation arises where you need to merge two databases, GUIDs will almost never collide (whereas INTs are guaranteed to collide).
Primary Keys should NEVER be seen by the user. Making it readable to the user should not be a concern. Primary Keys SHOULD be used to link with Foreign Keys. This is their purpose. The value should be machine readable and, once created, never changed.

Surrogate vs. natural/business keys [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
Here we go again, the old argument still arises...
Would we better have a business key as a primary key, or would we rather have a surrogate id (i.e. an SQL Server identity) with a unique constraint on the business key field?
Please, provide examples or proof to support your theory.
Just a few reasons for using surrogate keys:
Stability: Changing a key because of a business or natural need will negatively affect related tables. Surrogate keys rarely, if ever, need to be changed because there is no meaning tied to the value.
Convention: Allows you to have a standardized Primary Key column naming convention rather than having to think about how to join tables with various names for their PKs.
Speed: Depending on the PK value and type, a surrogate key of an integer may be smaller, faster to index and search.
Both. Have your cake and eat it.
Remember there is nothing special about a primary key, except that it is labelled as such. It is nothing more than a NOT NULL UNIQUE constraint, and a table can have more than one.
If you use a surrogate key, you still want a business key to ensure uniqueness according to the business rules.
It appears that no one has yet said anything in support of non-surrogate (I hesitate to say "natural") keys. So here goes...
A disadvantage of surrogate keys is that they are meaningless (cited as an advantage by some, but...). This sometimes forces you to join a lot more tables into your query than should really be necessary. Compare:
select sum(t.hours)
from timesheets t
where t.dept_code = 'HR'
and t.status = 'VALID'
and t.project_code = 'MYPROJECT'
and t.task = 'BUILD';
against:
select sum(t.hours)
from timesheets t
join departents d on d.dept_id = t.dept_id
join timesheet_statuses s on s.status_id = t.status_id
join projects p on p.project_id = t.project_id
join tasks k on k.task_id = t.task_id
where d.dept_code = 'HR'
and s.status = 'VALID'
and p.project_code = 'MYPROJECT'
and k.task_code = 'BUILD';
Unless anyone seriously thinks the following is a good idea?:
select sum(t.hours)
from timesheets t
where t.dept_id = 34394
and t.status_id = 89
and t.project_id = 1253
and t.task_id = 77;
"But" someone will say, "what happens when the code for MYPROJECT or VALID or HR changes?" To which my answer would be: "why would you need to change it?" These aren't "natural" keys in the sense that some outside body is going to legislate that henceforth 'VALID' should be re-coded as 'GOOD'. Only a small percentage of "natural" keys really fall into that category - SSN and Zip code being the usual examples. I would definitely use a meaningless numeric key for tables like Person, Address - but not for everything, which for some reason most people here seem to advocate.
See also: my answer to another question
Surrogate key will NEVER have a reason to change. I cannot say the same about the natural keys. Last names, emails, ISBN nubmers - they all can change one day.
Surrogate keys (typically integers) have the added-value of making your table relations faster, and more economic in storage and update speed (even better, foreign keys do not need to be updated when using surrogate keys, in contrast with business key fields, that do change now and then).
A table's primary key should be used for identifying uniquely the row, mainly for join purposes. Think a Persons table: names can change, and they're not guaranteed unique.
Think Companies: you're a happy Merkin company doing business with other companies in Merkia. You are clever enough not to use the company name as the primary key, so you use Merkia's government's unique company ID in its entirety of 10 alphanumeric characters.
Then Merkia changes the company IDs because they thought it would be a good idea. It's ok, you use your db engine's cascaded updates feature, for a change that shouldn't involve you in the first place. Later on, your business expands, and now you work with a company in Freedonia. Freedonian company id are up to 16 characters. You need to enlarge the company id primary key (also the foreign key fields in Orders, Issues, MoneyTransfers etc), adding a Country field in the primary key (also in the foreign keys). Ouch! Civil war in Freedonia, it's split in three countries. The country name of your associate should be changed to the new one; cascaded updates to the rescue. BTW, what's your primary key? (Country, CompanyID) or (CompanyID, Country)? The latter helps joins, the former avoids another index (or perhaps many, should you want your Orders grouped by country too).
All these are not proof, but an indication that a surrogate key to uniquely identify a row for all uses, including join operations, is preferable to a business key.
I hate surrogate keys in general. They should only be used when there is no quality natural key available. It is rather absurd when you think about it, to think that adding meaningless data to your table could make things better.
Here are my reasons:
When using natural keys, tables are clustered in the way that they are most often searched thus making queries faster.
When using surrogate keys you must add unique indexes on logical key columns. You still need to prevent logical duplicate data. For example, you can’t allow two Organizations with the same name in your Organization table even though the pk is a surrogate id column.
When surrogate keys are used as the primary key it is much less clear what the natural primary keys are. When developing you want to know what set of columns make the table unique.
In one to many relationship chains, the logical key chains. So for example, Organizations have many Accounts and Accounts have many Invoices. So the logical-key of Organization is OrgName. The logical-key of Accounts is OrgName, AccountID. The logical-key of Invoice is OrgName, AccountID, InvoiceNumber.
When surrogate keys are used, the key chains are truncated by only having a foreign key to the immediate parent. For example, the Invoice table does not have an OrgName column. It only has a column for the AccountID. If you want to search for invoices for a given organization, then you will need to join the Organization, Account, and Invoice tables. If you use logical keys, then you could Query the Organization table directly.
Storing surrogate key values of lookup tables causes tables to be filled with meaningless integers. To view the data, complex views must be created that join to all of the lookup tables. A lookup table is meant to hold a set of acceptable values for a column. It should not be codified by storing an integer surrogate key instead. There is nothing in the normalization rules that suggest that you should store a surrogate integer instead of the value itself.
I have three different database books. Not one of them shows using surrogate keys.
I want to share my experience with you on this endless war :D on natural vs surrogate key dilemma. I think that both surrogate keys (artificial auto-generated ones) and natural keys (composed of column(s) with domain meaning) have pros and cons. So depending on your situation, it might be more relevant to choose one method or the other.
As it seems that many people present surrogate keys as the almost perfect solution and natural keys as the plague, I will focus on the other point of view's arguments:
Disadvantages of surrogate keys
Surrogate keys are:
Source of performance problems:
They are usually implemented using auto-incremented columns which mean:
A round-trip to the database each time you want to get a new Id (I know that this can be improved using caching or [seq]hilo alike algorithms but still those methods have their own drawbacks).
If one-day you need to move your data from one schema to another (It happens quite regularly in my company at least) then you might encounter Id collision problems. And Yes I know that you can use UUIDs but those lasts requires 32 hexadecimal digits! (If you care about database size then it can be an issue).
If you are using one sequence for all your surrogate keys then - for sure - you will end up with contention on your database.
Error prone. A sequence has a max_value limit so - as a developer - you have to put attention to the following points:
You must cycle your sequence ( when the max-value is reached it goes back to 1,2,...).
If you are using the sequence as an ordering (over time) of your data then you must handle the case of cycling (column with Id 1 might be newer than row with Id max-value - 1).
Make sure that your code (and even your client interfaces which should not happen as it supposed to be an internal Id) supports 32b/64b integers that you used to store your sequence values.
They don't guarantee non duplicated data. You can always have 2 rows with all the same column values but with a different generated value. For me this is THE problem of surrogate keys from a database design point of view.
More in Wikipedia...
Myths on natural keys
Composite keys are less inefficient than surrogate keys. No! It depends on the used database engine:
Oracle
MySQL
Natural keys don't exist in real-life. Sorry but they do exist! In aviation industry, for example, the following tuple will be always unique regarding a given scheduled flight (airline, departureDate, flightNumber, operationalSuffix). More generally, when a set of business data is guaranteed to be unique by a given standard then this set of data is a [good] natural key candidate.
Natural keys "pollute the schema" of child tables. For me this is more a feeling than a real problem. Having a 4 columns primary-key of 2 bytes each might be more efficient than a single column of 11 bytes. Besides, the 4 columns can be used to query the child table directly (by using the 4 columns in a where clause) without joining to the parent table.
Conclusion
Use natural keys when it is relevant to do so and use surrogate keys when it is better to use them.
Hope that this helped someone!
Alway use a key that has no business meaning. It's just good practice.
EDIT: I was trying to find a link to it online, but I couldn't. However in 'Patterns of Enterprise Archtecture' [Fowler] it has a good explanation of why you shouldn't use anything other than a key with no meaning other than being a key. It boils down to the fact that it should have one job and one job only.
Surrogate keys are quite handy if you plan to use an ORM tool to handle/generate your data classes. While you can use composite keys with some of the more advanced mappers (read: hibernate), it adds some complexity to your code.
(Of course, database purists will argue that even the notion of a surrogate key is an abomination.)
I'm a fan of using uids for surrogate keys when suitable. The major win with them is that you know the key in advance e.g. you can create an instance of a class with the ID already set and guaranteed to be unique whereas with, say, an integer key you'll need to default to 0 or -1 and update to an appropriate value when you save/update.
UIDs have penalties in terms of lookup and join speed though so it depends on the application in question as to whether they're desirable.
Using a surrogate key is better in my opinion as there is zero chance of it changing. Almost anything I can think of which you might use as a natural key could change (disclaimer: not always true, but commonly).
An example might be a DB of cars - on first glance, you might think that the licence plate could be used as the key. But these could be changed so that'd be a bad idea. You wouldnt really want to find that out after releasing the app, when someone comes to you wanting to know why they can't change their number plate to their shiny new personalised one.
Always use a single column, surrogate key if at all possible. This makes joins as well as inserts/updates/deletes much cleaner because you're only responsible for tracking a single piece of information to maintain the record.
Then, as needed, stack your business keys as unique contraints or indexes. This will keep you data integrity intact.
Business logic/natural keys can change, but the phisical key of a table should NEVER change.
Case 1: Your table is a lookup table with less than 50 records (50 types)
In this case, use manually named keys, according to the meaning of each record.
For Example:
Table: JOB with 50 records
CODE (primary key) NAME DESCRIPTION
PRG PROGRAMMER A programmer is writing code
MNG MANAGER A manager is doing whatever
CLN CLEANER A cleaner cleans
...............
joined with
Table: PEOPLE with 100000 inserts
foreign key JOBCODE in table PEOPLE
looks at
primary key CODE in table JOB
Case 2: Your table is a table with thousands of records
Use surrogate/autoincrement keys.
For Example:
Table: ASSIGNMENT with 1000000 records
joined with
Table: PEOPLE with 100000 records
foreign key PEOPLEID in table ASSIGNMENT
looks at
primary key ID in table PEOPLE (autoincrement)
In the first case:
You can select all programmers in table PEOPLE without use of join with table JOB, but just with: SELECT * FROM PEOPLE WHERE JOBCODE = 'PRG'
In the second case:
Your database queries are faster because your primary key is an integer
You don't need to bother yourself with finding the next unique key because the database itself gives you the next autoincrement.
Surrogate keys can be useful when business information can change or be identical. Business names don't have to be unique across the country, after all. Suppose you deal with two businesses named Smith Electronics, one in Kansas and one in Michigan. You can distinguish them by address, but that'll change. Even the state can change; what if Smith Electronics of Kansas City, Kansas moves across the river to Kansas City, Missouri? There's no obvious way of keeping these businesses distinct with natural key information, so a surrogate key is very useful.
Think of the surrogate key like an ISBN number. Usually, you identify a book by title and author. However, I've got two books titled "Pearl Harbor" by H. P. Willmott, and they're definitely different books, not just different editions. In a case like that, I could refer to the looks of the books, or the earlier versus the later, but it's just as well I have the ISBN to fall back on.
On a datawarehouse scenario I believe is better to follow the surrogate key path. Two reasons:
You are independent of the source system, and changes there --such as a data type change-- won't affect you.
Your DW will need less physical space since you will use only integer data types for your surrogate keys. Also your indexes will work better.
As a reminder it is not good practice to place clustered indices on random surrogate keys i.e. GUIDs that read XY8D7-DFD8S, as they SQL Server has no ability to physically sort these data. You should instead place unique indices on these data, though it may be also beneficial to simply run SQL profiler for the main table operations and then place those data into the Database Engine Tuning Advisor.
See thread # http://social.msdn.microsoft.com/Forums/en-us/sqlgetstarted/thread/27bd9c77-ec31-44f1-ab7f-bd2cb13129be
This is one of those cases where a surrogate key pretty much always makes sense. There are cases where you either choose what's best for the database or what's best for your object model, but in both cases, using a meaningless key or GUID is a better idea. It makes indexing easier and faster, and it is an identity for your object that doesn't change.
In the case of point in time database it is best to have combination of surrogate and natural keys. e.g. you need to track a member information for a club. Some attributes of a member never change. e.g Date of Birth but name can change.
So create a Member table with a member_id surrogate key and have a column for DOB.
Create another table called person name and have columns for member_id, member_fname, member_lname, date_updated. In this table the natural key would be member_id + date_updated.
Horse for courses. To state my bias; I'm a developer first, so I'm mainly concerned with giving the users a working application.
I've worked on systems with natural keys, and had to spend a lot of time making sure that value changes would ripple through.
I've worked on systems with only surrogate keys, and the only drawback has been a lack of denormalised data for partitioning.
Most traditional PL/SQL developers I have worked with didn't like surrogate keys because of the number of tables per join, but our test and production databases never raised a sweat; the extra joins didn't affect the application performance. With database dialects that don't support clauses like "X inner join Y on X.a = Y.b", or developers who don't use that syntax, the extra joins for surrogate keys do make the queries harder to read, and longer to type and check: see #Tony Andrews post. But if you use an ORM or any other SQL-generation framework you won't notice it. Touch-typing also mitigates.
Maybe not completely relevant to this topic, but a headache I have dealing with surrogate keys. Oracle pre-delivered analytics creates auto-generated SKs on all of its dimension tables in the warehouse, and it also stores those on the facts. So, anytime they (dimensions) need to be reloaded as new columns are added or need to be populated for all items in the dimension, the SKs assigned during the update makes the SKs out of sync with the original values stored to the fact, forcing a complete reload of all fact tables that join to it. I would prefer that even if the SK was a meaningless number, there would be some way that it could not change for original/old records. As many know, out-of-the box rarely serves an organization's needs, and we have to customize constantly. We now have 3yrs worth of data in our warehouse, and complete reloads from the Oracle Financial systems are very large. So in my case, they are not generated from data entry, but added in a warehouse to help reporting performance. I get it, but ours do change, and it's a nightmare.

Resources