What's the difference between a primary key and an RRN?
A Primary key uniquely and unambiguously identifies a given record (in a database table/view) or a given row (in a text file). Although it can be convenient for the primary key to be based on a single Field (a single "column"), it is also possible for a primary to be based on several Fields/Columns.
RRN is an acronym which can either be understood as "Record Row Number" or "Relative Row Number". The Record Row Number is generally understood to be a number, typically (but not necessarily) assigned by simple increment (based on the value of the previous such RRN assigned) which is "added" to the other Fields/Column of a particular record type. Many DBMSes supply features for the support of such "auto-incremented" or more generally automatically assigned RRNs.
Defined as above, an RRN can be used as a Primary Key.
There are many advantages -and drawbacks- to having a [semantically void] RRN as opposed to a Primary key based on [one or several] attribute (Field or Column) values of the record. This is probably discussed in other SO question; here are a few of the most common arguments:
A primary key may be modified, an RRN is "immutable".
For example if the primary key is a Social Security Number (SSN), a record may at some time be updated because the SSN was originally input with a typo error. When/if that happens, any related records which uses this SSN to refer to the updated record need to also be updated. Had these related records used the [non significative] RRN, they would be immune to possible changes of the SSN value.
When there is no "natural" Primary key based on a single column, it may be more convenient to use the RRN
RRNs are typically shorter
Related tables and lists which refer to original records by way of a non-RRN primary key, somehow duplicate the underlying information. This can be both an advantage and a drawback: one can know the underlying field value without having to look it up in the original table: good if you want the related table to contain such info, bad if you don't (ex: Social Security Number can be considered sensitive etc.)
RRNs are guaranteed to be unique (short of a bug with the RRN-generation logic), whereby attribute-value based keys have a propensity to turning out non-unique ("oops! We thought we could use phone number as a house ID; dang!, the phone company started reusing numbers...")
A Primary key identifies a row in a table.
An RRN (I presume you mean Relative Record number) also identifies a row by position in a subset (ie. a query result).
I've found it useful if you need to extrapolate a sequential order for a set of records not related to the primary key.
Related
In a given table if there is no primary key and even impossible to create a composite primary key then what is the normal form of that table ?
If its zero(0NF) adding a new column and making it primary key will convert this table to 1NF ?
Normal forms apply to relations, which are mathematical structures. Tables can be used to represent relations, but this requires some rules to ensure that the table doesn't contain more or less information than the corresponding relation.
In order for a table to represent a relation:
all rows and columns must be unique
the order they're in mustn't matter
all significant information must be represented as values in cells (i.e. fonts, highlighting, etc, mustn't matter)
every cell must contain one value (doesn't matter how simple or complex that value is)
Also, the relational model cares about candidate keys, not primary keys. A relation can have multiple candidate keys. A primary key is just a selected candidate key that is used by some disciplines (e.g. the entity-relationship model) or by some database management systems (e.g. for physical record ordering).
With all that said, I can now answer your question. If your table follows the rules and specifically the rows are all unique, then there will be at least one candidate key, on all the columns together at worst. If your table's rows aren't unique, then the table doesn't represent a relation and the normal forms don't apply. A surrogate key (like an auto-increment column) can be added to identify rows uniquely, but that isn't necessarily sufficient on its own to make a table represent a relation (1NF).
BTW, I suggest you avoid using "0NF" or "UNF". Non-relational tables don't have a level of normalization, so attaching any kind of "NF" to them is misleading.
As long as you are talking about tables, there is one further case that needs to be covered. It's the case of duplicate rows.
Duplicate rows are rows that are identical in appearance but not in row number. Such a table cannot have a primary key. Sometimes duplicate rows represent the same information. Sometimes not.
For example, consider a table with just four columns: customerid, productid, quentity, price. If a customer orders the same product twice, we'll have two identical rows, representing different inforation. Ths is not good.
Note that the corresonding thing cannot happen with relations. If two tuples in a relation have the same appearance, then they are the same tuple.
As to the other points, they are covered by excellent earlier answers.
before you wan to check for normalization your table must have a Primary key(the primary key is playing lead role in Relational DB,...).
1NF: says that all of your table attributes must be single valued.
Answer of Question 1 : In a given table if there is no primary key and even impossible to create a composite primary key then what is the normal form of that table ?
Answer : If it is no primary key in relation and if it is impossible to create a composite primiary key(According to me your question says ,even if combine all the column of row to make candidate key then also it will not able to identify your relationship uniquly(duplicate rows are there), hence it is not in any normal form.
Answer of Question 2:
If you add some column(having unique values in it) and if all the cell contains only one value then it is in 1NF.
Still if you need some clarification can ask in comment box.
0NF is not any form of normalization. refer C.J. Date or Henry korth(database management system book)
Hope this helps.
I've run into a problem where I need to regularly insert records into a table, but the integer primary key column is not an identity column. If it was, inserting records and having them auto-increment to maintain uniqueness would be easy. However, I can't make the primary key column an identity column without causing errors in an application that is still used to sometimes accomplish what I'm doing. Is there a reason why you would want an integer primary key and not have that column as an identity column also? I'm a little new to this and just wondering why someone would structure a table this way.
Edit to add: I've done some Googling and research, and I understand their differences and purposes, but I can't find anything on why you would not want to use them together in this particular instance and even create your table/application in such a way that you couldn't.
IT seems like a bad database design to me , tables should have a primary key that you can use for searching and sorting for example a username as primary key,
a integer primary key without auto increment is bad design
Is there a reason to have an integer primary key column not also an Identity column?
If by "identity" you mean (auto-increment as) surrogate, ie made up the DBMS, then yes:
-- ticket #N is held by person P
lotto(N, P) -- PK(N)
A surrogate is just a name/identifier ("identifying" in the everyday sense) for something that got picked arbitrarily by the DBMS, eg user "3508218", rather than not, eg "asp8811" or "eighty-eight" or "Texas". Note that they are surrogates ("meaningless") in the system that exists outside the DBMS. (Although some people don't call such a name/identifier generated by a system a surrogate if it's visible outside the system.)
PK/UNIQUE just says that the subrow values for a column set are unique in a table. Here N does have a "meaning" ie it is a thing's name/identifier that the DBMS does not have control over picking. In fact if a ticket can be held by only one person then P is also a candidate key (PK/UNIQUE) whether or not the value (of whatever kind) that names/identifies people is a surrogate.
Every PK/UNIQUE or superset in every base table & query result names/identifies things of some kind. Ie any column set on any types can name/identify things (be 1:1 or M:1 with them) whether or not it is a candidate key (PK/UNIQUE). So integer (and any other type or type set) primary key (and UNIQUE) columns (and column sets) (and supersets) are all over the place naming/identifying without being surrogates and whether or not they are candidate keys.
In my database I have a list of users with information about them, and I also have a feature which allows a user to add other users to a shortlist. My user information is stored in one table with a primary key of the user id, and I have another table for the shortlist. The shortlist table is designed so that it has two columns and is basically just a list of pairs of names. So to find the shortlist for a particular user you retrieve all names from the second column where the id in the first column is a particular value.
The issue is that according to many sources such as this Should each and every table have a primary key? you should have a primary key in every table of the database.
According to this source http://www.w3schools.com/sql/sql_primarykey.asp - a primary key in one which uniquely identifies an entry in a database. So my question is:
What is wrong with the table in my database? Why does it need a primary key?
How should I give it a primary key? Just create a new auto-incrementing column so that each entry has a unique id? There doesn't seem much point for this. Or would I somehow encapsulate the multiple entries that represent a shortlist into another entity in another table and link that in? I'm really confused.
If the rows are unique, you can have a two-column primary key, although maybe that's database dependent. Here's an example:
CREATE TABLE my_table
(
col_1 int NOT NULL,
col_2 varchar(255) NOT NULL,
CONSTRAINT pk_cols12 PRIMARY KEY (col_1,col_2)
)
If you already have the table, the example would be:
ALTER TABLE my_table
ADD CONSTRAINT pk_cols12 PRIMARY KEY (col_1,col_2)
Primary keys must identify each record uniquely and as it was mentioned before, primary keys can consist of multiple attributes (1 or more columns). First, I'd recommend making sure each record is really unique in your table. Secondly, as I understand you left the table without primary key and that's disallowed so yes, you will need to set the key for it.
In this particular case, there is no purpose in same pair of user IDs being stored more than once in the shortlist table. After all, that table models a set, and an element is either in the set or isn't. Having an element "twice" in the set makes no sense1. To prevent that, create a composite key, consisting of these two user ID fields.
Whether this composite key will also be primary, or you'll have another key (that would act as surrogate primary key) is another matter, but either way you'll need this composite key.
Please note that under databases that support clustering (aka. index-organized tables), PK is often also a clustering key, which may have significant repercussions on performance.
1 Unlike in mutiset.
A table with duplicate rows is not an adequate representation of a relation. It's a bag of rows, not a set of rows. If you let this happen, you'll eventually find that your counts will be off, your sums will be off, and your averages will be off. In short, you'll get confusing errors out of your data when you go to use it.
Declaring a primary key is a convenient way of preventing duplicate rows from getting into the database, even if one of the application programs makes a mistake. The index you obtain is a side effect.
Foreign key references to a single row in a table could be made by referencing any candidate key. However, it's much more convenient if you declare one of those candidate keys as a primary key, and then make all foreign key references refer to the primary key. It's just careful data management.
The one-to-one correspondence between entities in the real world and corresponding rows in the table for that entity is beyond the realm of the DBMS. It's up to your applications and even your data providers to maintain that correspondence by not inventing new rows for existing entities and not letting some new entities slip through the cracks.
Well since you are asking, it's good practice but in a few instances (no joins needed to the data) it may not be absolutely required. The biggest problem though is you never really know if requirements will change and so you really want one now so you aren't adding one to a 10m record table after the fact.....
In addition to a primary key (which can span multiple columns btw) I think it is good practice to have a secondary candidate key which is a single field. This makes joins easier.
First some theory. You may remember the definition of a function from HS or college algebra is that y = f(x) where f is a function if and only if for every x there is exactly one y. In this case, in relational math we would say that y is functionally dependent on x on this case.
The same is true of your data. Suppose we are storing check numbers, checking account numbers, and amounts. Assuming that we may have several checking accounts and that for each checking account duplicate check numbers are not allowed, then amount is functionally dependent on (account, check_number). In general you want to store data together which is functionally dependent on the same thing, with no transitive dependencies. A primary key will typically be the functional dependency you specify as the primary one. This then identifies the rest of the data in the row (because it is tied to that identifier). Think of this as the natural primary key. Where possible (i.e. not using MySQL) I like to declare the primary key to be the natural one, even if it spans across columns. This gets complicated sometimes where you may have multiple interchangeable candidate keys. For example, consider:
CREATE TABLE country (
id serial not null unique,
name text primary key,
short_name text not null unique
);
This table really could have any column be the primary key. All three are perfectly acceptable candidate keys. Suppose we have a country record (232, 'United States', 'US'). Each of these fields uniquely identifies the record so if we know one we can know the others. Each one could be defined as the primary key.
I also recommend having a second, artificial candidate key which is just a machine identifier used for linking for joins. In the above example country.id does this. This can be useful for linking other records to the country table.
An exception to needing a candidate key might be where duplicate records really are possible. For example, suppose we are tracking invoices. We may have a case where someone is invoiced independently for two items with one showing on each of two line items. These could be identical. In this case you probably want to add an artificial primary key because it allows you to join things to that record later. You might not have a need to do so now but you may in the future!
Create a composite primary key.
To read more about what a composite primary key is, visit
http://www.relationaldbdesign.com/relational-database-analysis/module2/concatenated-primary-keys.php
What is the difference between Primary key And unique Key constraint?
What's the use of it??
Both are used to denote candidate keys for a table.
You can only have one primary key for a table so would just need to pick one if you have multiple candidates.
Either can be used in Foreign Key constraints. In SQL Server the Primary Key columns cannot be nullable. Columns used in Unique Key constraints can be.
By default in SQL Server the Primary Key will become the clustered index if it is created on a heap but it is by no means mandatory that the PK and clustered index should be the same.
A primary key is one which is used to identify the row in question. It might also have some meaning beyond that (if there was already a piece of "real" data that could serve) or it may be purely an implementation artefact (most IDENTITY columns, and equivalent auto-incremented values on other database systems).
A unique key is a more general case, where a key cannot have repeated values. In most cases people cannot have the same social security numbers in relation to the same jurisdiction (an international case could differ). Hence if we were storing social security numbers, then we would want to model them as unique, as any case of them matching an existing number is clearly wrong. Usernames generally must be unique also, so here's another case. External identifiers (identifiers used by another system, standard or protocol) tend to also be unique, e.g. there is only one language that has a given ISO 639 code, so if we were storing ISO 639 codes we would model that as unique.
This uniqueness can also be across more than one column. For example, in most hierarchical categorisation systems (e.g. a folder structure) no item can have both the same parent item and the same name, though there could be other items with the same parent and different names, and others with the same name and different parents. This multi-column capability is also present on primary keys.
A table may also have more than one unique key. E.g. a user may have both an id number and a username, and both will need to be unique.
Any non-nullable unique key can therefore serve as a primary key. Sometimes primary keys that come from the innate data being modelled are referred to as "natural primary keys", because they are a "natural" part of the data, rather than just an implementation artefact. The decision as to which to use depends on a few things:
Likelihood of change of specification. If we modelled a social security number as unique and then had to adapt to allow for multiple jurisdictions where two or more use a similar enough numbering system to allow for collisions, we likely need just remove the uniqueness constraint (other changes may be needed). If it was our primary key, we now also need to use a new primary key, and change any table that was using that primary key as part of a relationship, and any query that joined on it.
Speed of look-up. Key efficiency can be important, as they are used in many WHERE clauses and (more often) in many JOINs. With JOINS in particular, speed of lookup can be vital. The impact will depend on implementation details, and different databases vary according to how they will handle different datatypes (I would have few qualms from a performance perspective in using a large piece of text as a primary key in Postgres where I could specify the use of hash joins, but I'd be very hesitant to do so in SQLServer [Edit: for "large" I'm thinking of perhaps the size of a username, not something the size of the entire Norse Eddas!]).
Frequency of the key being the only interesting data. For example, with a table of languages, and a table of pieces of comments in that language, very often the only reason I would want to join on the language table when dealing with the comments table is either to obtain the language code or to restrict a query to those with a particular language code. Other information about the language is likely to be much more rarely used. In this case while joining on the code is likely to be less efficient than joining on a numeric id set from an IDENTITY column, having the code as the primary key - and hence as what is stored in the foreign key column on the comments table - will remove the need for any JOIN at all, with a considerable efficiency gain. More often though I want more information from the relevant tables than that, so making the JOIN more efficient is more important.
Primary key:
Primary key is nothing but it uniquely identifies each row in a table.
Primary key does not allow duplicate values, nor NULL.
Primary key by default is a clustered index.
A table can have only one primary key.
Unique Key:
Unique key is nothing but it uniquely identifies each row in a table.
Unique key does not allow duplicate values, but it allows (at most one) NULL.
Unique key by default is a non-clustered index.
This is a fruit full link to understand the Primary Key Database Keys.
Keep in mind we have only one clustered index in a table [Talking about SQL Server 2005].
Now if we want to add another unique column then we will use Unique Key column, because
Unique Key column can be added more than one.
A primary key is just any one candidate key. In principle primary keys are not different from any other candidate key because all keys are equal in the relational model.
SQL however has two different syntax for implementing candidate keys: the PRIMARY KEY constraint and the UNIQUE constraint (on non-nullable columns of course). In practice they achieve exactly the same thing except for the essentially useless restriction that a PRIMARY KEY can only be used once per table whereas a UNIQUE constraint can be used multiple times.
So there is no fundamental "use" for the PRIMARY KEY constraint. It is redundant and could easily be ignored or dropped from the language altogether. However, many people find it convenient to single out one particular key per table as having special significance. There is a very widely observed convention that keys designated with PRIMARY KEY are used for foreign key references, although this is entirely optional.
Short version:
From the point of view of database theory, there is none. Both are simply candidate keys.
In practice, most DMBS like to have one "standard key", which can be used for e.g. deciding how to store data, and to tell tools and DB clients which is the best way to identify a record.
So distinguishing one unique key as the "primary key" is just an implementation convenience (but an important one).
Basically, I will need to combine product data from multiple vendors into a single database (it's more complex than that, of course) which has several tables that will need to be joined together for most OLTP operations.
I was going to stick with the default and use an auto-incrementing integer as the primary key, but while one vendor supplies their own "ProductiD" field, the rest do not and I would have to do a lot of manual mapping to the other tables then to load the data (as I would have to first load it into the Products table, then pull the ID out and add that along with the other information I need to the other tables).
Alternatively, I could use the product's SKU as it's primary key since the SKU is unique for a single product, and all of the vendors supply a SKU in their data feeds. If I use the SKU as the PK then I could easily load the data feeds as everything is based off of the SKU, which is how it works in the real world. However the SKU is alphanumeric and will probably be slightly less efficient than an integer-based key.
Any ideas on which I should look at?
This is a choice between surrogate and natural primary keys.
IMHO always favour surrogate primary keys. Primary keys shouldn't have meaning because that meaning can change. Even country names can change and countries can come into existence and disappear, let alone products. Changing primary keys is definitely not advised, which can happen with natural keys.
More on surrogate vs primary keys:
So surrogate keys win right? Well,
let’s review and see if any of the
con’s of natural key’s apply to
surrogate keys:
Con 1: Primary key size – Surrogate keys generally don't have problems
with index size since they're usually
a single column of type int. That's
about as small as it gets.
Con 2: Foreign key size - They don't have foreign key or foreign
index size problems either for the
same reason as Con 1.
Con 3: Asthetics - Well, it’s an eye of the beholder type thing, but
they certainly don’t involve writing
as much code as with compound natural
keys.
Con 4 & 5: Optionality & Applicability – Surrogate keys have no
problems with people or things not
wanting to or not being able to
provide the data.
Con 6: Uniqueness - They are 100% guaranteed to be unique. That’s a
relief.
Con 7: Privacy - They have no privacy concerns should an
unscrupulous person obtain them.
Con 8: Accidental Denormalization – You can’t accidentally denormalize
non-business data.
Con 9: Cascading Updates - Surrogate keys don't change, so no
worries about how to cascade them on
update.
Con 10: Varchar join speed - They're generally int's, so they're generally
as fast to join over as you can get.
And there's also Surrogate Keys vs Natural Keys for Primary Key?
In all but the simplest internal situations, I recommend always going for the surrogate key.
It gives you options in the future, and protects you from unknowns.
There's no reason why additional keys, like an SKU, couldn't be made non-null to enforce them, but at least by removing your reliance on third-parties you're giving yourself the option to choose, rather than having it taken from you and enduring a painful rewrite at a later stage.
Whether you go for the auto-incremented integer or determine the next primary key yourself, there will be complications. With the auto-incremented method, you can insert the record easily and let it assign its own key, but you may have trouble identifying exactly what key your record was given (and getting the max key isn't guaranteed to return yours).
I tend to go for the self-assigned key because you have more control and, in sql server, you can retrieve your key from a central keys table and ensure nobody else gets the same key, all in one statement:
DECLARE #Key INT
UPDATE KeyTable
WITH (rowlock)
SET #Key = LastKey = LastKey + 1
WHERE KeyType = 'Product'
The table records the last key used. The sql above increments that key directly in the table and returns the new key, ensuring its uniqueness.
Why you should avoid alphanumeric primary keys:
Three main problems: performance, collation and space.
Performance - there is a performance cost though, like Razzie below, I can't quote any numbers, but it is less efficient to index alphanumerics than numbers.
Collation - your developers may create the same key with different collations in different tables (it happens) which leads to constantly using the 'collate' commands when joining these tables in queries and that gets old really quickly.
Space - a nine-character SKU like David's takes nine bytes, but an integer takes only four (2 for smallint, 1 for tinyint). Even a bigint takes only 8 bytes.
The ever present danger with natural keys is that either your initial assumptions will be proven wrong now or in the future when some change is made outside your control, or at some place you'll need to reference a record where passing a meaningful field is not desired (ex. a web application that uses an employee's social security number as the primary key, and then has to use urls like /employee.php?ssn=xxxxxxx)
From my own personal experience with "unique" SKU's and vendor data feeds - are you absolutely sure they are sending you a feed with complete, unique, well formed SKUs?
I've had to personally deal with all of the following when getting feeds from vendors who have varying levels of IT and clerical competence:
Products are missing their SKU entirely ("")
Clerks have used placeholder SKUs in their database like 999999999 and 00000000 and never corrected them
Those doing the data entry or importation have confused between various product numbers, mixing up things like UPC with SCC, or even finding ways to mangle them together (I've seen SCC codes with impossible check digits at the end, because they just copied the UPC and added 01 or 10, without correcting the check digit)
For special reasons, or just incompetence, the vendor has entered the same product twice in their database (for example rev. 1 and rev. 2 of the same motherboard have the same SKU, but exist as 2 records in the vendors database and data feed because rev 2. has new features)
I'd also go with an auto-increment primary key. The performance impact for having an alphanumeric primary key are there, though I don't dare name any numbers. However, if performance is important in your application, all the more reason to go with the autoincrement primary key column.
I'd advice on having an autoincremented "meaningless" integer as primary key. Should someone come up with the idea of reorganizing product IDs, at least your DB won't get into trouble.
Pretty similar to my question a few months ago...
Should I have a dedicated primary key field?
I went with an auto-incrementing PK in the end.
Since you're dealing with data from multiple vendors outside of your control, I would use a surrogate key. You don't want to have to rearchitect your database design one day when one of them happens to send you a duplicate.
A surrogate key (auto increment INT field) will uniquely identify a row in the table. On the other hand, a Unique Natural key (productName) will prevent duplicate product data from entering the table.
With a unique Natural key field, two or more rows can never have same data.
With a surrogate key field, Rows can be unique because of the auto increment INT field but data in rows will not be unique because the surrogate key has no relation to the data.
Lets take an example of a User table, the table's Natural key field (userName) will prevent same user from registering twice but the auto increment INT field (userId) will not.
If every product will have a SKU and the SKU is unique to each product, I don't see why you wouldn't want to use that for a possible primary key.
You could always take a hash of the SKU which would get rid of the alphas. You'd have to code for possible collisions (which should be very rare) which is an added complication.
I'd use the hash to populate the primary key and make the inital import easy but when using it in the dB always treat it as if it were a random number. That way the primary key will loose it's meaning (and have all the advantages of an auto-incremented key) allowing flexibility in the future.