Database design: primary keys and unique constraints - sql-server

I am building my third database and I am unsure how to structure primary keys and unique constraints to ensure data integrity.
The database is updated periodically when public data is released, some monthly, some annually and will have a low query load.
I have some tables where a combination of columns makes each row unique. I am aware that I can use a composite primary key but I am unsure if this is best practise as many of the articles I read on this have conflicting views.
For example, one table requires five columns for uniqueness with datatypes:
smallint
char(7)
varchar(7)
smallint
varchar(3)
I believe this would make the primary key up to 25 bytes and all of these five columns will regular be selected and often used in the where clause. One of the smallints is also a foreign key but no foreign keys will reference this table.
The only alternative I am aware of is to create a unique identifier for each row, setting this as the primary key and creating a unique constraint over the five columns.
What is the best option and if it is to use a composite primary key should it be clustered or non-clustered?
Is there a recommended maximum number of columns and bytes for a composite primary key?

I don't know what is the purpose of your database, but first advise if you can add ,id column as primary key. It'll be easier for every query.
Then If you want just be sur of unique combination of columns in the same table, the easy way is a index with is the keyword UNIQUE:
CREATE UNIQUE NONCLUSTERED INDEX [IX_COLUMN1_COLUMN2]
ON TABLE(COLUMN1 ASC, COLUMN2ASC);

Related

Primary keys, index and constrains in SQL

Please correct if im wrong. And kindly point me to articles on this concept.
When we create a primary key, in the background there is automatically a unique index, clustered index, and a not null constraint created on that coloumn.
Does this also mean that if we create a not null constraint, [clustered index or non clustered index] and unique index on a column, then that column becomes a primary key?
I want to understand the core concept/relation between primary key, index and constrains.
The primary key is the one that is declared as the "primary" key. Just having the characteristics doesn't make a key "primary". It has to be explicitly declared as such.
Different databases implement primary keys in different ways. Although primary keys are usually implemented with a clustered unique index, that is not a requirement.
The primary key is exactly what its name suggests: "primary". Any other column or group of columns can be declared both unique and not null. That does not make them primary keys. In some databases, you could even define another column or group of columns as not null, unique and clustered -- without that being the primary key.
In summary:
You can have any number of unique indexes on a table.
You can have any number of unique indexes on non-NULL columns on a table.
You can have at most one clustered index. In almost all cases, this would be the primary key. But is not required in all databases.
You can have at most one primary key. In almost all cases, this would be clustered, although that is not required in all databases.
For more detail, you should refer to the documentation of the database you are using.
If you have multiple columns comprising non-NULL, unique keys, then only one is "primary" -- that one that has been explicitly declared as primary.
Why would you have a non-clustered primary key? I can give one scenario. Imagine a database where UUIDs are the keys for rows. The company does not want to use auto-generated sequence numbers, because they provide information in the number.
However, UUIDs are remarkably bad candidates for clustered indexes, because inserts are almost never at the end. In this case, you might want to design the table with a clustered auto-generated sequential key, to speed inserts You might make this key the primary key. But, you want all foreign key references to use the UUID -- and you want all foreign key references to be to the primary key of the table.
No.
All the columns could be added with Not null and Non-clustered index and Unique But only ONE column could be PK.
And the Unique allows NULL while Primary Key does not.
You might be talking about Candidate Key, here is the ref:
https://www.techopedia.com/definition/21/candidate-key

Non nullable column with clustered unique index. Why need Primary Key?

In SQL Server, I have a non nullable column with a unique clustered index on it.
If I make this column a Primary Key the exact same index is created automatically plus
the column is recognized as a Primary Key.
I understand the abstract/semantic difference.
(Primary Key identifies the entity, while any other column with this index may not.
For example, a Person can have Email field which is Unique,Non-nullable... but can be changed)
But what bothers me is the actual difference when it comes to the DB engine itself.
What will happen if I will just create an Id column, make it non-nullable, create a unique clustered index for it, make it Identity Increment, but without the Primary Key constraint?
In what scenarios the Primary Key constraint comes into play?
(I've looked at many related questions before asking this, but all the answers I saw ended up with an abstract/theoretical explanation).
Nothing will be different really. You specify PRIMARY KEY to relay your intentions, not so that the engine does anything differently. When constructing a query plan, the optimizer will still use the uniqueness for all of its properties, and will still use the clustered index for all of its properties, regardless of whether you technically created it as a PRIMARY KEY. When creating a FOREIGN KEY, you can still reference the column(s) specified as unique (clustered or not). The difference is solely in the metadata (sys.indexes.is_primary_key) and in SSMS' representation to you (oh and the fact that you can create a unique clustered index on a NULLable column, but you can't create a PRIMARY KEY on that column).
In fact there are many cases where you want to completely separate the clustered index from the PRIMARY KEY. If you have a table where the PK is a GUID, for example, and you are typically running date range queries against the table, you are probably better off having the PK be non-clustered and have a clustered index on a naturally increasing column (the datetime column) - both to minimize page splits on heavy insert activity and also to best assist date range queries. The non-clustered index will be perfectly fine for looking up individual GUIDs. (I wanted to mention that because a lot of people think the primary key has to be clustered. Not true.)
Also interesting to note that if you create a PRIMARY KEY constraint, then create a unique clustered index with the same name using DROP_EXISTING, the is_primary_key column will still be 1 and Object Explorer will still show the index name under Keys.
Here is one scenario - a lot of code to data mapping frameworks look at the database metadata (what are the primary keys, foreign keys, etc) to determine how code is executed. For example Hibernate requires a primary key.
A typical scenario might be generating a where clause for an update.

What is the difference between a primary key and a index key

Can anyone tell me what is the difference between a primary key and index key. And when to use which?
A primary key is a special kind of index in that:
there can be only one;
it cannot be nullable; and
it must be unique.
You tend to use the primary key as the most natural unique identifier for a row (such as social security number, employee ID and so forth, although there is a school of thought that you should always use an artificial surrogate key for this).
Indexes, on the other hand, can be used for fast retrieval based on other columns. For example, an employee database may have your employee number as the primary key but it may also have an index on your last name or your department.
Both of these indexes (last name and department) would disallow NULLs (probably) and allow duplicates (almost certainly), and they would be useful to speed up queries looking for anyone with (for example) the last name 'Corleone' or working in the 'HitMan' department.
A key (minimal superkey) is a set of attributes, the values of which are unique for every tuple (every row in the table at some point in time).
An index is a performance optimisation feature that enables data to be accessed faster.
Keys are frequently good candidates for indexing and some DBMSs automatically create indexes for keys, but that doesn't have to be so.
The phrase "index key" mixes these two quite different words and might be best avoided if you want to avoid any confusion. "Index key" is sometimes used to mean "the set of attributes in an index". However the set of attributes in question are not necessarily a key because they may not be unique.
Oracle Database enforces a UNIQUE key or PRIMARY KEY integrity constraint on a table by creating a unique index on the unique key or primary key. This index is automatically created by the database when the constraint is enabled.
You can create indexes explicitly (outside of integrity constraints) using the SQL statement CREATE INDEX .
Indexes can be unique or non-unique. Unique indexes guarantee that no two rows of a table have duplicate values in the key column (or columns). Non-unique indexes do not impose this restriction on the column values.
Use the CREATE UNIQUE INDEX statement to create a unique index.
Specifying the Index Associated with a Constraint
If you require more explicit control over the indexes associated with UNIQUE and PRIMARY KEY constraints, the database lets you:
1. Specify an existing index that the database is to use
to enforce the constraint
2. Specify a CREATE INDEX statement that the database is to use to create
the index and enforce the constraint
These options are specified using the USING INDEX clause.
Example:
CREATE TABLE a (
a1 INT PRIMARY KEY USING INDEX (create index ai on a (a1)));
http://docs.oracle.com/cd/B28359_01/server.111/b28310/indexes003.htm
Other responses are defining the Primary Key, but not the Primary Index.
A Primary Index isn't an index on the Primary Key.
A Primary Index is your table's data structure, but only if your data structure is ordered by the Primary Key, thus allowing efficient lookups without a requiring a separate data structure to look up records by the Primary Key.
All databases (that I'm aware of) have a Primary Key.
Not all databases have a Primary Index. Most of those that don't build a secondary index on the Primary Key by default.

What's the difference between a Primary Key and Identity?

In a SQL Server db, what is the difference between a Primary Key and an Identity column? A column can be a primary key without being an indentity. A column cannot, however, be an identity without being a primary key.
In addition to the differences, what does a PK and Identity column offer that just a PK column doesn't?
A column can definitely be an identity without being a PK.
An identity is simply an auto-increasing column.
A primary key is the unique column or columns that define the row.
These two are often used together, but there's no requirement that this be so.
This answer is more of WHY identity and primary key than WHAT they are since Joe has answered WHAT correctly above.
An identity is a value your SQL controls. Identity is a row function. It is sequential, either increasing or decreasing in value, at least in SQL Server. It should never be modified and gaps in the value should be ignored. Identity values are very useful in linking table B to table A since the value is never duplicated. The identity is not the best choice for a clustered index in every case. If a table contains audit data the clustered index may be better being created on the date occurred as it will present the answer to the question " what happened between today and four days ago" with less work because the records for the dates are sequential in the data pages.
A primary key makes the column or columns in a row unique. Primary key is a column function. Only one primary key may be defined on any table, but multiple unique indexes may be created which simulates the primary key. Clustering the primary key is not always the correct choice. Consider a phone book. If the phone book is clustered by the primary key(phone number) the query to return the phone numbers on "First Street" will be very costly.
The general rules I follow for identity and primary key are:
Always use an identity column
Create the clustered index on the column or columns which are used in range lookups
Keep the clustered index narrow since the clustered index is added to the end of every other index
Create primary key and unique indexes to reject duplicate values
Narrow keys are better
Create an index for every column or columns used in joins
These are my GENERAL rules.
A primary key (also known as a candidate key) is any set of attributes that have the properties of uniqueness and minimality. That means the key column or columns are constrained to be unique. In other words the DBMS won't permit any two rows to have the same set of values for those attributes.
The IDENTITY property effectively creates an auto-incrementing default value for a column. That column does not have to be unique though, so an IDENTITY column isn't necessarily a key.
However, an IDENTITY column is typically intended to be used as a key and therefore it usually has a uniqueness constraint on it to ensure that duplicates are not permitted.
Major Difference between Primary and Identity Column
Primary Column:
Primary Key cannot have duplicate values.
It creates a clustered index for the Table.
It can be set for any column type.
We need to provide the primary column value while inserting in the table.
Identity Column:
Identity Column can have duplicate value.
It can only be set for Integer related columns like int, bigint, smallint, tinyint or decimal
No need to insert values in the identity column. It is inserted automatically based on the seed.
EDITS MADE BASED ON FEEDBACK
A key is unique to a row. It's a way of identifying a row. Rows may have none, one, or several keys. These keys may consist of one or more columns.
Keys are indexes with a unique constraint. This differentiates them from non-key indexes.
Any index with multi-columns is called a "composite index".
Traditionally, a primary key is viewed as the main key that uniquely identifies a row. There may only be one of these.
Depending on the table's design, one may have no primary key.
A primary key is just that - a "prime key". It's the main one that specifies the unique identity of a row. Depending on a table's design, this can be a misnomer and multiple keys express the uniqueness.
In SQL Server, a primary key may be clustered. This means the remaining columns are attached to this key at the leaf level of the index. In other words, once SQL Server has found the key, it has also found the row (to be clear, this is because of the clustered aspect).
An identity column is simply a method of generating a unique ID for a row.
These two are often used together, but this is not a requirement.
You can use IDENTITY not only with integers, but also with any numeric data type that has a scale of 0
primary key could have scale but its not required.
IDENTITY, combined with a PRIMARY KEY or UNIQUE constraint, lets you provide a simple unique row identifier
Primary key emphasizing on uniqueness and avoid duplication value for all records on the same column, while identity provides increasing numbers in a column without inserting data.
Both features could be on a single column or on difference one.

Can I have multiple primary keys in a single table?

Can I have multiple primary keys in a single table?
A Table can have a Composite Primary Key which is a primary key made from two or more columns. For example:
CREATE TABLE userdata (
userid INT,
userdataid INT,
info char(200),
primary key (userid, userdataid)
);
Update: Here is a link with a more detailed description of composite primary keys.
You can only have one primary key, but you can have multiple columns in your primary key.
You can also have Unique Indexes on your table, which will work a bit like a primary key in that they will enforce unique values, and will speed up querying of those values.
A table can have multiple candidate keys. Each candidate key is a column or set of columns that are UNIQUE, taken together, and also NOT NULL. Thus, specifying values for all the columns of any candidate key is enough to determine that there is one row that meets the criteria, or no rows at all.
Candidate keys are a fundamental concept in the relational data model.
It's common practice, if multiple keys are present in one table, to designate one of the candidate keys as the primary key. It's also common practice to cause any foreign keys to the table to reference the primary key, rather than any other candidate key.
I recommend these practices, but there is nothing in the relational model that requires selecting a primary key among the candidate keys.
This is the answer for both the main question and for #Kalmi's question of
What would be the point of having multiple auto-generating columns?
This code below has a composite primary key. One of its columns is auto-incremented. This will work only in MyISAM. InnoDB will generate an error "ERROR 1075 (42000): Incorrect table definition; there can be only one auto column and it must be defined as a key".
DROP TABLE IF EXISTS `test`.`animals`;
CREATE TABLE `test`.`animals` (
`grp` char(30) NOT NULL,
`id` mediumint(9) NOT NULL AUTO_INCREMENT,
`name` char(30) NOT NULL,
PRIMARY KEY (`grp`,`id`)
) ENGINE=MyISAM;
INSERT INTO animals (grp,name) VALUES
('mammal','dog'),('mammal','cat'),
('bird','penguin'),('fish','lax'),('mammal','whale'),
('bird','ostrich');
SELECT * FROM animals ORDER BY grp,id;
Which returns:
+--------+----+---------+
| grp | id | name |
+--------+----+---------+
| fish | 1 | lax |
| mammal | 1 | dog |
| mammal | 2 | cat |
| mammal | 3 | whale |
| bird | 1 | penguin |
| bird | 2 | ostrich |
+--------+----+---------+
(Have been studying these, a lot)
Candidate keys - A minimal column combination required to uniquely identify a table row.
Compound keys - 2 or more columns.
Multiple Candidate keys can exist in a table.
Primary KEY - Only one of the candidate keys that is chosen by us
Alternate keys - All other candidate keys
Both Primary Key & Alternate keys can be Compound keys
Sources:
https://en.wikipedia.org/wiki/Superkey
https://en.wikipedia.org/wiki/Candidate_key
https://en.wikipedia.org/wiki/Primary_key
https://en.wikipedia.org/wiki/Compound_key
As noted by the others it is possible to have multi-column primary keys.
It should be noted however that if you have some functional dependencies that are not introduced by a key, you should consider normalizing your relation.
Example:
Person(id, name, email, street, zip_code, area)
There can be a functional dependency between id -> name,email, street, zip_code and area
But often a zip_code is associated with a area and thus there is an internal functional dependecy between zip_code -> area.
Thus one may consider splitting it into another table:
Person(id, name, email, street, zip_code)
Area(zip_code, name)
So that it is consistent with the third normal form.
Primary Key is very unfortunate notation, because of the connotation of "Primary" and the subconscious association in consequence with the Logical Model. I thus avoid using it. Instead I refer to the Surrogate Key of the Physical Model and the Natural Key(s) of the Logical Model.
It is important that the Logical Model for every Entity have at least one set of "business attributes" which comprise a Key for the entity. Boyce, Codd, Date et al refer to these in the Relational Model as Candidate Keys. When we then build tables for these Entities their Candidate Keys become Natural Keys in those tables. It is only through those Natural Keys that users are able to uniquely identify rows in the tables; as surrogate keys should always be hidden from users. This is because Surrogate Keys have no business meaning.
However the Physical Model for our tables will in many instances be inefficient without a Surrogate Key. Recall that non-covered columns for a non-clustered index can only be found (in general) through a Key Lookup into the clustered index (ignore tables implemented as heaps for a moment). When our available Natural Key(s) are wide this (1) widens the width of our non-clustered leaf nodes, increasing storage requirements and read accesses for seeks and scans of that non-clustered index; and (2) reduces fan-out from our clustered index increasing index height and index size, again increasing reads and storage requirements for our clustered indexes; and (3) increases cache requirements for our clustered indexes. chasing other indexes and data out of cache.
This is where a small Surrogate Key, designated to the RDBMS as "the Primary Key" proves beneficial. When set as the clustering key, so as to be used for key lookups into the clustered index from non-clustered indexes and foreign key lookups from related tables, all these disadvantages disappear. Our clustered index fan-outs increase again to reduce clustered index height and size, reduce cache load for our clustered indexes, decrease reads when accessing data through any mechanism (whether index scan, index seek, non-clustered key lookup or foreign key lookup) and decrease storage requirements for both clustered and nonclustered indexes of our tables.
Note that these benefits only occur when the surrogate key is both small and the clustering key. If a GUID is used as the clustering key the situation will often be worse than if the smallest available Natural Key had been used. If the table is organized as a heap then the 8-byte (heap) RowID will be used for key lookups, which is better than a 16-byte GUID but less performant than a 4-byte integer.
If a GUID must be used due to business constraints than the search for a better clustering key is worthwhile. If for example a small site identifier and 4-byte "site-sequence-number" is feasible then that design might give better performance than a GUID as Surrogate Key.
If the consequences of a heap (hash join perhaps) make that the preferred storage then the costs of a wider clustering key need to be balanced into the trade-off analysis.
Consider this example::
ALTER TABLE Persons
ADD CONSTRAINT pk_PersonID PRIMARY KEY (P_Id,LastName)
where the tuple "(P_Id,LastName)" requires a uniqueness constraint, and may be a lengthy Unicode LastName plus a 4-byte integer, it would be desirable to (1) declaratively enforce this constraint as "ADD CONSTRAINT pk_PersonID UNIQUE NONCLUSTERED (P_Id,LastName)" and (2) separately declare a small Surrogate Key to be the "Primary Key" of a clustered index. It is worth noting that Anita possibly only wishes to add the LastName to this constraint in order to make that a covered field, which is unnecessary in a clustered index because ALL fields are covered by it.
The ability in SQL Server to designate a Primary Key as nonclustered is an unfortunate historical circumstance, due to a conflation of the meaning "preferred natural or candidate key" (from the Logical Model) with the meaning "lookup key in storage" from the Physical Model. My understanding is that originally SYBASE SQL Server always used a 4-byte RowID, whether into a heap or a clustered index, as the "lookup key in storage" from the Physical Model.
A primary key is the key that uniquely identifies a record and is used in all indexes. This is why you can't have more than one. It is also generally the key that is used in joining to child tables but this is not a requirement. The real purpose of a PK is to make sure that something allows you to uniquely identify a record so that data changes affect the correct record and so that indexes can be created.
However, you can put multiple fields in one primary key (a composite PK). This will make your joins slower (espcially if they are larger string type fields) and your indexes larger but it may remove the need to do joins in some of the child tables, so as far as performance and design, take it on a case by case basis. When you do this, each field itself is not unique, but the combination of them is. If one or more of the fields in a composite key should also be unique, then you need a unique index on it. It is likely though that if one field is unique, this is a better candidate for the PK.
Now at times, you have more than one candidate for the PK. In this case you choose one as the PK or use a surrogate key (I personally prefer surrogate keys for this instance). And (this is critical!) you add unique indexes to each of the candidate keys that were not chosen as the PK. If the data needs to be unique, it needs a unique index whether it is the PK or not. This is a data integrity issue. (Note this is also true anytime you use a surrogate key; people get into trouble with surrogate keys because they forget to create unique indexes on the candidate keys.)
There are occasionally times when you want more than one surrogate key (which are usually the PK if you have them). In this case what you want isn't more PK's, it is more fields with autogenerated keys. Most DBs don't allow this, but there are ways of getting around it. First consider if the second field could be calculated based on the first autogenerated key (Field1 * -1 for instance) or perhaps the need for a second autogenerated key really means you should create a related table. Related tables can be in a one-to-one relationship. You would enforce that by adding the PK from the parent table to the child table and then adding the new autogenerated field to the table and then whatever fields are appropriate for this table. Then choose one of the two keys as the PK and put a unique index on the other (the autogenerated field does not have to be a PK). And make sure to add the FK to the field that is in the parent table. In general if you have no additional fields for the child table, you need to examine why you think you need two autogenerated fields.
Some people use the term "primary key" to mean exactly an integer column that gets its values generated by some automatic mechanism. For example AUTO_INCREMENT in MySQL or IDENTITY in Microsoft SQL Server. Are you using primary key in this sense?
If so, the answer depends on the brand of database you're using. In MySQL, you can't do this, you get an error:
mysql> create table foo (
id int primary key auto_increment,
id2 int auto_increment
);
ERROR 1075 (42000): Incorrect table definition;
there can be only one auto column and it must be defined as a key
In some other brands of database, you are able to define more than one auto-generating column in a table.
Having two primary keys at the same time, is not possible. But (assuming that you have not messed the case up with composite key), may be what you might need is to make one attribute unique.
CREATE t1(
c1 int NOT NULL,
c2 int NOT NULL UNIQUE,
...,
PRIMARY KEY (c1)
);
However note that in relational database a 'super key' is a subset of attributes which uniquely identify a tuple or row in a table. A 'key' is a 'super key' that has an additional property that removing any attribute from the key, makes that key no more a 'super key'(or simply a 'key' is a minimal super key). If there are more keys, all of them are candidate keys. We select one of the candidate keys as a primary key. That's why talking about multiple primary keys for a one relation or table is being a conflict.
Good technical answers were given in better way than I can do.
I am only can add to this topic:
If you want something that not allowed/acceptable it is good reason to take step back.
Understand the core of why it's not acceptable.
Dig more in documentation/journal articles/web and etc.
Analyze/review current design and point major flaws.
Consider and test every step during new design.
Always look forward and try to create adaptive solution.
Hope it will helps someone.
Yes, Its possible in SQL,
but we can't set more than one primary keys in MsAccess.
Then, I don't know about the other databases.
CREATE TABLE CHAPTER (
BOOK_ISBN VARCHAR(50) NOT NULL,
IDX INT NOT NULL,
TITLE VARCHAR(100) NOT NULL,
NUM_OF_PAGES INT,
PRIMARY KEY (BOOK_ISBN, IDX)
);

Resources