I have been reading articles around the net to understand the differences between the following key types. But it just seems hard for me to grasp. Examples will definitely help make understanding better.
primary key,
partition key,
composite key
clustering key
There is a lot of confusion around this, I will try to make it as simple as possible.
The primary key is a general concept to indicate one or more columns used to retrieve data from a Table.
The primary key may be SIMPLE and even declared inline:
create table stackoverflow_simple (
key text PRIMARY KEY,
data text
);
That means that it is made by a single column.
But the primary key can also be COMPOSITE (aka COMPOUND), generated from more columns.
create table stackoverflow_composite (
key_part_one text,
key_part_two int,
data text,
PRIMARY KEY(key_part_one, key_part_two)
);
In a situation of COMPOSITE primary key, the "first part" of the key is called PARTITION KEY (in this example key_part_one is the partition key) and the second part of the key is the CLUSTERING KEY (in this example key_part_two)
Please note that both partition and clustering key can be made by more columns, here's how:
create table stackoverflow_multiple (
k_part_one text,
k_part_two int,
k_clust_one text,
k_clust_two int,
k_clust_three uuid,
data text,
PRIMARY KEY((k_part_one, k_part_two), k_clust_one, k_clust_two, k_clust_three)
);
Behind these names ...
The Partition Key is responsible for data distribution across your nodes.
The Clustering Key is responsible for data sorting within the partition.
The Primary Key is equivalent to the Partition Key in a single-field-key table (i.e. Simple).
The Composite/Compound Key is just any multiple-column key
Further usage information: DATASTAX DOCUMENTATION
Small usage and content examples
***SIMPLE*** KEY:
insert into stackoverflow_simple (key, data) VALUES ('han', 'solo');
select * from stackoverflow_simple where key='han';
table content
key | data
----+------
han | solo
COMPOSITE/COMPOUND KEY can retrieve "wide rows" (i.e. you can query by just the partition key, even if you have clustering keys defined)
insert into stackoverflow_composite (key_part_one, key_part_two, data) VALUES ('ronaldo', 9, 'football player');
insert into stackoverflow_composite (key_part_one, key_part_two, data) VALUES ('ronaldo', 10, 'ex-football player');
select * from stackoverflow_composite where key_part_one = 'ronaldo';
table content
key_part_one | key_part_two | data
--------------+--------------+--------------------
ronaldo | 9 | football player
ronaldo | 10 | ex-football player
But you can query with all keys (both partition and clustering) ...
select * from stackoverflow_composite
where key_part_one = 'ronaldo' and key_part_two = 10;
query output
key_part_one | key_part_two | data
--------------+--------------+--------------------
ronaldo | 10 | ex-football player
Important note: the partition key is the minimum-specifier needed to perform a query using a where clause.
If you have a composite partition key, like the following
eg: PRIMARY KEY((col1, col2), col10, col4))
You can perform query only by passing at least both col1 and col2, these are the 2 columns that define the partition key. The "general" rule to make query is you must pass at least all partition key columns, then you can add optionally each clustering key in the order they're set.
so, the valid queries are (excluding secondary indexes)
col1 and col2
col1 and col2 and col10
col1 and col2 and col10 and col 4
Invalid:
col1 and col2 and col4
anything that does not contain both col1 and col2
Adding a summary answer as the accepted one is quite long. The terms "row" and "column" are used in the context of CQL, not how Cassandra is actually implemented.
A primary key uniquely identifies a row.
A composite key is a key formed from multiple columns.
A partition key is the primary lookup to find a set of rows, i.e. a partition.
A clustering key is the part of the primary key that isn't the partition key (and defines the ordering within a partition).
Examples:
PRIMARY KEY (a): The partition key is a.
PRIMARY KEY (a, b): The partition key is a, the clustering key is b.
PRIMARY KEY ((a, b)): The composite partition key is (a, b).
PRIMARY KEY (a, b, c): The partition key is a, the composite clustering key is (b, c).
PRIMARY KEY ((a, b), c): The composite partition key is (a, b), the clustering key is c.
PRIMARY KEY ((a, b), c, d): The composite partition key is (a, b), the composite clustering key is (c, d).
In Cassandra, the difference between primary key, partition key, composite key, clustering key always makes some confusion. So, I am going to explain below and co relate to each other's. We use CQL (Cassandra Query Language) for Cassandra database access.
Note: - Answer is as per updated version of Cassandra.
Primary Key: -
In Cassandra there are 2 different ways to use primary Key.
CREATE TABLE Cass (
id int PRIMARY KEY,
name text
);
Create Table Cass (
id int,
name text,
PRIMARY KEY(id)
);
In CQL, the order in which columns are defined for the PRIMARY KEY matters. The first column of the key is called the partition key having property that all the rows sharing the same partition key (even across table in fact) are stored on the same physical node. Also, insertion/update/deletion on rows sharing the same partition key for a given table are performed atomically and in isolation. Note that it is possible to have a composite partition key, i.e. a partition key formed of multiple columns, using an extra set of parentheses to define which columns form the partition key.
Partitioning and Clustering
The PRIMARY KEY definition is made up of two parts: the Partition Key and the Clustering Columns. The first part maps to the storage engine row key, while the second is used to group columns in a row.
CREATE TABLE device_check (
device_id int,
checked_at timestamp,
is_power boolean,
is_locked boolean,
PRIMARY KEY (device_id, checked_at)
);
Here device_id is partition key and checked_at is cluster_key.
We can have multiple cluster keys as well as partition keys too which depends on declaration.
Primary Key: Is composed of partition key(s) [and optional clustering keys(or columns)]
Partition Key: The hash value of Partition key is used to determine the specific node in a cluster to store the data
Clustering Key: Is used to sort the data in each of the partitions (or responsible node and its replicas)
Compound Primary Key: As said above, the clustering keys are optional in a Primary Key. If they aren't mentioned, it's a simple primary key. If clustering keys are mentioned, it's a Compound primary key.
Composite Partition Key: Using just one column as a partition key, might result in wide row issues (depends on use case/data modeling). Hence the partition key is sometimes specified as a combination of more than one column.
Regarding confusion of which one is mandatory, which one can be skipped etc. in a query, trying to imagine Cassandra as a giant HashMap helps. So, in a HashMap, you can't retrieve the values without the Key.
Here, the Partition keys play the role of that key. So, each query needs to have them specified. Without which Cassandra wouldn't know which node to search for.
The clustering keys (columns, which are optional) help in further narrowing your query search after Cassandra finds out the specific node (and its replicas) responsible for that specific Partition key.
In brief sense:
Partition Key is nothing but identification for a row, that identification most of the times is the single column (called Primary Key) sometimes a combination of multiple columns (called Composite Partition Key).
Cluster key is nothing but Indexing & Sorting. Cluster keys depend on few things:
What columns you use in where clause except primary key columns.
If you have very large records then on what concern I can divide the date for easy management. Example, I have data of 1million a county population records. So, for easy management, I cluster data based on state and after pincode and so on.
Worth to note, you will probably use those lots more than in similar concepts in relational world (composite keys).
Example - suppose you have to find last N users who recently joined user group X. How would you do this efficiently given reads are predominant in this case? Like that (from offical Cassandra guide):
CREATE TABLE group_join_dates (
groupname text,
joined timeuuid,
join_date text,
username text,
email text,
age int,
PRIMARY KEY ((groupname, join_date), joined)
) WITH CLUSTERING ORDER BY (joined DESC)
Here, partitioning key is compound itself and the clustering key is a joined date. The reason why a clustering key is a join date is that results are already sorted (and stored, which makes lookups fast). But why do we use a compound key for partitioning key? Because we always want to read as few partitions as possible. How putting join_date in there helps? Now users from the same group and the same join date will reside in a single partition! This means we will always read as few partitions as possible (first start with the newest, then move to older and so on, rather than jumping between them).
In fact, in extreme cases you would also need to use the hash of a join_date rather than a join_date alone - so that if you query for last 3 days often those share the same hash and therefore are available from same partition!
Disclaimer: This is answer is specific to DynamoDB, however the concepts apply to Cassandra as well, since both are NoSQL databases.
When you create a table, in addition to the table name, you must specify the primary key of the table. The primary key uniquely identifies each item in the table, so that no two items can have the same key.
DynamoDB supports two different kinds of primary keys:
Partition key – A simple primary key, composed of one attribute known as the partition key.
DynamoDB uses the partition key's value as input to an internal hash function. The output from the hash function determines the partition (physical storage internal to DynamoDB) in which the item will be stored.
In a table that has only a partition key, no two items can have the same partition key value.
Partition key and sort key – Referred to as a composite primary key, this type of key is composed of two attributes. The first attribute is the partition key, and the second attribute is the sort key.
DynamoDB uses the partition key value as input to an internal hash function. The output from the hash function determines the partition (physical storage internal to DynamoDB) in which the item will be stored. All items with the same partition key value are stored together, in sorted order by sort key value.
In a table that has a partition key and a sort key, it's possible for two items to have the same partition key value. However, those two items must have different sort key values.
A composite primary key gives you additional flexibility when querying data. For example, if you provide only the value for Artist, DynamoDB retrieves all of the songs by that artist. To retrieve only a subset of songs by a particular artist, you can provide a value for Artist along with a range of values for SongTitle.
Note: The partition key of an item is also known as its hash
attribute. The term hash attribute derives from the use of an internal
hash function in DynamoDB that evenly distributes data items across
partitions, based on their partition key values.
The sort key of an item is also known as its range attribute. The term range attribute derives from the way DynamoDB stores items with the same partition key physically close together, in sorted order by the sort key value.
Reference - https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/HowItWorks.CoreComponents.html#HowItWorks.CoreComponents.PrimaryKey
Primary Key: Like in many databases, it is a unique key in a table, essentially it means, for any two records in a table, primary key cannot be same. Database, in this case Cassandra is designed to make sure that this condition is true in all situations. So if you try to write a record with PK1 as primary key, if there is a record already present with same key PK1, it will get overwritten, else a new record will be created.
Partition Key: It is a construct of distributed databases(where data of a single table is divided into multiple parts called partitions). Partitions are then distributed across nodes using a distribution strategy(usually, hash of partition key) to get infinite scaling capabilities. Having said that, partition key is a set of columns of a record that decides which partition this record will belong to. And hence, partition key decides the physical location of a record across distributed cluster of nodes.
Clustering Key: Clustering Key decides the order of records in a particular partition. So, if there are 10K records in a partition, clustering key will decide the order in which these 10K will be physically stored in a sorted manner.
Example:
Lets say you have a table in Cassandra to store sales event of an e-commerce website.
[order_id, item_id, quantity, amount, payment_id, status, order_time, PRIMARY KEY( (order_id, item_id), order_time)] with clustering ORDER BY (order_time DESC);
So here,
Primary Key is ((order_id, item_id), order_time) and it will decide the uniqueness of a record in table.
Partition Key is (order_id, item_id), hash of this tuple will decide the partition of this record and it location on distributed cluster.
Clustering Key is order_time, for a particular partition, records will be ordered by order_time in descending order. So if you do a Limit 1 cql query for a particular partition, you will get the record with max timestamp always.
Composite key is just a term to specify that Primary key of a table is not single column, but multiple columns.
Primary key is a combination of partition and clustering key.
Related
I'm not asking HOW to do this, but if it's what I SHOULD be doing.
Two employees can be working on the same job. So of course, both FKs, EmployeeID and JobID, can have a MANY relationship in a "Employee_Jobs" table.
Let's take Employee A, Employee B, Job A and Job B. All of the following would be acceptable:
A A
A B
B A
B B
What would NOT be acceptable is a duplicate of a combination of these two PKs... since we cannot have for example, [Employee A working on Job A] twice.
So would it be correct to say that the only way to manage this is to make the combination of the two PKs, EmployeeID and JobID, a Unique, non-clustered index?
I tried to think of how to instead, break this up to more tables but I keep getting back to this same problem.
Yes, not only is it appropriate, but in fact, the combination of these two attributes should be the PRIMARY KEY.
and in any other table where the entity represented by rows in the table has a logical attribute (consisting of the two columns employeeId and JobId), which represents the work done by an employee on a job, (or the contribution of the employee to a job, or the association of an employee to a job in any way), a FK in that table should be a composite Foreign Key consisting of these same two columns.
If you are using a surrogate key on this table to simplify joins and definition of Foreign Keys in other tables, then by all means continue to do so, but keep the two-column natural key in this table, as either a unique index or a Alternate Key. (a Key is a Key - anything that is declared or defined to be unique) so as to ensure data integrity in this table. In fact, to make it clear to users of the schema, when this situation comes up, I generally make the composite Natural Key the PRIMARY KEY, and add/define the surrogate (which is used in Joins and Other table FKs), as an alternate key or unique index. This is pretty much only a semantic distinction, only as they create almost identical functionality. But because data integrity is more important to me than join syntax and Foreign Key structure, To me, the Natural Key is the PRIMARY key,
Yes, In that case you should consider making both those fields as primary key; in specific a composite primary key or compound primary key like below which will make sure uniqueness of combination of both the fields.
primary key (EmployeeID , JobID)
Though as you said a Unique, non-clustered index but marking both the field as primary key will create a UNIQUE Clustered Index on them actually.
I have the following table that serves to join 3 tables:
ClientID int
BlogID int
MentionID int
Assuming that queries will always come via ClientID, I can create 1 multi-column index (ClientID, BlogID, MentionID).
The question is, should I create it as a clustered index or a unique key? I understand a clustered index stores the data on its leaf nodes. Of course, in this case, the index is the data, so I don't know if SQL Server will duplicate the data or not. Be that as it may, I can't find anything on MSDN about the significance of using "unique key".
How does this differ from Type = Index & IsUnique = yes?
Can someone tell me the advantages each way?
Clustered index is "the table itself", that is, index nodes are arranged in a tree, and its leaf nodes contains row data. Clustered index doesn't have to be declared as unique (though it usually is); if it is not unique, the server implicitly adds a "uniqalizer" to this index, so that each row is uniquely identified.
Other indexes store clustered index value as their leaf nodes (and possibly some other columns if they are included with INCLUDE clause in CREATE INDEX staetment).
Any index might be decalred as unique, so the server would perform an additional check to prevent duplicate values forom getting into the table.
It seems you are asking for the difference among:
MYTABLE
id integer primary key autoincrement
clientid integer
blogid integer
mentionid integer
-- with a unique composite index on (clientid, blogid, mentionid) and three foreign key constraints
and
MYTABLE
clientid
blogid
mentionid
-- with a composite primary key on (clientid, blogid, mentionid) and three foreign key constraints
and
MYTABLE
id integer primary key autoincrement
clientid integer
blogid integer
mentionid integer
with an index on clientid and also an index on blogid and the three foreign key constraints
In the first, you have the index on the integer primary key and also the alternative unique index on the triad. If the second, you have only the unique index on the triadic primary key. In the third, you have a unique index on the integer primary key and two other non-unique indexes, one on clientid and the other on blogid.
The performance gain with the second option's marginally greater efficiency would be de minimis, and so I'd base the decision on other factors. The third is the most flexible in terms of queries and offers greater simplicity of coding; it offers the benefit of indexes on client and blog both, in case you wanted to have a query with blog, not client, in the WHERE clause. As for coding, some GUI tools and middleware have trouble with multi-part primary keys, and your update/insert/delete logic will be simpler when it has to deal with a single integer PK column. I have found that code simplicity and ease of maintenance are far better things than a few seconds or only a few fractions of seconds of improvement in query response time.
A unique index, a unique key and
a unique constraint are basically
the same thing. They result in an
index that enforces uniqueness.
Clustered means that the index
becomes the table itself. It's good
to have a clustered index, otherwise
the table hangs around in an
unordered heap.
Unique and clustered are unrelated properties. You can combine them in any way you like. So in your case, I'd create a unique clustered index. The normal way to do that is by creating the index as a clustered primary key.
The data will not be duplicated if you create a clustered unique index on your three columns.
The unique clustered index will be the data - and the index at the same time :-)
Since this is a three-way join table, this clustered index probably does make a lot of sense. I'd say: go for it!
UNIQUE INDEX and UNIQUE CONSTRAINT are somewhat different concepts.
UNIQUE CONSTRAINT is a logical concept and means "make sure this column is unique, no matter how"
UNIQUE INDEX is a physical concept and means "create a B-Tree index on this column and fail whenever duplicates are inserted there"
The latter implies the former but not vice versa.
For instance, in Oracle, if you have a non-unique index on col1:
CREATE UNIQUE INDEX (col1) will fail and say "these columns are already indexed"
ALTER TABLE ADD CONSTRAINT UNIQUE(col1) will succeed and use the existing index to police the constraint.
Use CONSTRAINT if you just want the column to be unique and INDEX if you know a B-Tree index is what you want (to speed up searches etc).
What is the diffrence between a unique index and a unique key?
The unique piece is not where the difference lies. The index and key are not the same thing, and are not comparable.
A key is a data column, or several columns, that are forced to be unique with a constraint, either primary key or explicitly defined unique constraint. Whereas an index is a structure for storing data location for faster retrieval.
From the docs:
Unique Index
Creates a unique index on a table or
view. A unique index is one in which
no two rows are permitted to have the
same index key value. A clustered
index on a view must be unique
Unique key (Constraint)
You can use UNIQUE constraints to make
sure that no duplicate values are
entered in specific columns that do
not participate in a primary key.
Although both a UNIQUE constraint and
a PRIMARY KEY constraint enforce
uniqueness, use a UNIQUE constraint
instead of a PRIMARY KEY constraint
when you want to enforce the
uniqueness of a column, or combination
of columns, that is not the primary
key.
This MSDN article comparing the two is what you're after. The terminology is such that "constraint" is ANSI, but in SQL Server you can't disable a Unique Constraint...
For most purposes, there's no difference - the constraint is implemented as an index under the covers. The MSDN article backs this up--the difference is in the meta-data, for things like:
tweaking FILLFACTOR
INCLUDE provides more efficient covering indexes (composite constraint)
A filtered index is like a constraint over a subset of rows/ignore multiple null etc.
"Unique key" is a tautology. A Key (AKA "Candidate Key") is logical feature of the database - a constraint that enforces the uniqueness of a set of attributes in a table.
An index is a physical level feature intended to optimise performance in some way. There are many types of index.
Unique Key: It is a constraint which imposes limitation on database. That limitation is it will not allow duplicate values . For example if you want to select one column as primary key it should be NOT NULL & UNIQUE.
Unique Index: It is a index which improves the performance while executing queries on your data base. In unique index it also not allows duplicate values in index . ie.no two rows will have the same index key value.
Here are few key differences:
Purpose:
Unique Key: Ensures integrity of data at table level, so that no duplicates can be entered in the table. Is not used for query planning, does not contribute to query speed. (It's different purpose than Primary Key, primary key uniquely identifies each record for data operations such as update / delete etc. In complex tables, a unique key can be combinations of several columns and it will be inefficient to use unique key for identifying records for transactions. Hence primary key is quick way of identifying a particular record in the table, while unique key guarantees that no two records have same key attributes.)
Unique Index: Ensures uniqueness of data at index level, cannot guarantee uniqueness at the table level e.g. in case of filtered index. Is used for query planning and fetching data and thus speeds up queries depending on columns used / queried.
Filter Option:
Unique Key: Filter option is not available
Unique Index: Filter option is available
Storage Option:
Unique Key: Filegroup only
Unique Index: Filegroup or partition
Icon:
Unique Key: Icon is vertical key [ ]
Unique Index: Icon is b-tree [ ]
Both the key (aka keyword) and index are identifiers of a table row.
Though index is parallel identification structure, containing a pointer to the identified row, while keys are in situ field members.
The key, as identifier, implies uniqueness (constraint) and NOT NULL (constraint).
There is no sense in NULL as identifier (as null cannot identify anything) as well nonunique identifying value.
Non-clustered index can contain real data, not serving as identifier to real data, and so be non-unique [1]
It is unfortunate practice that the key or index (identifier) is called by constraint (rule or restriction) what most previous answers here followed.
Keys are used in context of:
alternate aka secondary aka candidate keys, can be multiple
composite key (a few fields combined)
primary key (superkey), natural or surrogate key, only one, really used for referential integrity
foreign key
Foreign key is the key in another table (where it is primary key) and even not a key to which they frequently refer. Such use is explained by confusing shortcutting of "foreign key constraint" term to just "foreign key".
Primary key constraint really implies NOT NULL and UNIQUE constraints + that referenced column (or combined columns) is identifier and also unfortunately substituted by "primary key" or "primary key constraint" while it is both which cannot be called either by only (primary key) constraint or by only (primary) key.
Update:
My related question:
[1]
UNIQUE argument for INDEX creation - what's for?
The functionalities are more or less same, it’s dependent on your use case.
Suppose you want to permit duplicate rows based on CUSTOMER_ID and TEAM_NAME.
In that case you can use both:
UNIQUE INDEX idx_customer_id_name (CUSTOMER_ID,TEAM_NAME)
UNIQUE KEY unique_key_customer_id_name (CUSTOMER_ID,TEAM_NAME)
But you should consider how often you fetch records based on CUSTOMER_ID AND TEAM_NAME. If it is more, then you should use unique index as it would help in faster retrieval of records otherwise you should go with unique key as it would prevent overheard of fetching based on index.
There's a healthy debate out there between surrogate and natural keys:
SO Post 1
SO Post 2
My opinion, which seems to be in line with the majority (it's a slim majority), is that you should use surrogate keys unless a natural key is completely obvious and guaranteed not to change. Then you should enforce uniqueness on the natural key. Which means surrogate keys almost all of the time.
Example of the two approaches, starting with a Company table:
1: Surrogate key: Table has an ID field which is the PK (and an identity). Company names are required to be unique by state, so there's a unique constraint there.
2: Natural key: Table uses CompanyName and State as the PK -- satisfies both the PK and uniqueness.
Let's say that the Company PK is used in 10 other tables. My hypothesis, with no numbers to back it up, is that the surrogate key approach would be much faster here.
The only convincing argument I've seen for natural key is for a many to many table that uses the two foreign keys as a natural key. I think in that case it makes sense. But you can get into trouble if you need to refactor; that's out of scope of this post I think.
Has anyone seen an article that compares performance differences on a set of tables that use surrogate keys vs. the same set of tables using natural keys? Looking around on SO and Google hasn't yielded anything worthwhile, just a lot of theorycrafting.
Important Update: I've started building a set of test tables that answer this question. It looks like this:
PartNatural - parts table that uses
the unique PartNumber as a PK
PartSurrogate - parts table that
uses an ID (int, identity) as PK and
has a unique index on the PartNumber
Plant - ID (int, identity) as PK
Engineer - ID (int, identity) as PK
Every part is joined to a plant and every instance of a part at a plant is joined to an engineer. If anyone has an issue with this testbed, now's the time.
Use both! Natural Keys prevent database corruption (inconsistency might be a better word). When the "right" natural key, (to eliminate duplicate rows) would perform badly because of length, or number of columns involved, for performance purposes, a surrogate key can be added as well to be used as foreign keys in other tables instead of the natural key... But the natural key should remain as an alternate key or unique index to prevent data corruption and enforce database consistency...
Much of the hoohah (in the "debate" on this issue), may be due to what is a false assumption - that you have to use the Primary Key for joins and Foreign Keys in other tables. THIS IS FALSE. You can use ANY key as the target for foreign keys in other tables. It can be the Primary Key, an alternate Key, or any unique index or unique constraint., as long as it is unique in the target relation (table). And as for joins, you can use anything at all for a join condition, it doesn't even have to be a key, or an index, or even unique !! (although if it is not unique you will get multiple rows in the Cartesian product it creates). You can even create a join using non-specific criterion (like >, <, or "like" as the join condition.
Indeed, you can create a join using any valid SQL expression that evaluate to a boolean.
Natural keys differ from surrogate keys in value, not type.
Any type can be used for a surrogate key, like a VARCHAR for the system-generated slug or something else.
However, most used types for surrogate keys are INTEGER and RAW(16) (or whatever type your RDBMS does use for GUID's),
Comparing surrogate integers and natural integers (like SSN) takes exactly same time.
Comparing VARCHARs make take collation into account and they are generally longer than integers, that making them less efficient.
Comparing a set of two INTEGER is probably also less efficient than comparing a single INTEGER.
On datatypes small in size this difference is probably percents of percents of the time required to fetch pages, traverse indexes, acquite database latches etc.
And here are the numbers (in MySQL):
CREATE TABLE aint (id INT NOT NULL PRIMARY KEY, value VARCHAR(100));
CREATE TABLE adouble (id1 INT NOT NULL, id2 INT NOT NULL, value VARCHAR(100), PRIMARY KEY (id1, id2));
CREATE TABLE bint (id INT NOT NULL PRIMARY KEY, aid INT NOT NULL);
CREATE TABLE bdouble (id INT NOT NULL PRIMARY KEY, aid1 INT NOT NULL, aid2 INT NOT NULL);
INSERT
INTO aint
SELECT id, RPAD('', FLOOR(RAND(20090804) * 100), '*')
FROM t_source;
INSERT
INTO bint
SELECT id, id
FROM aint;
INSERT
INTO adouble
SELECT id, id, value
FROM aint;
INSERT
INTO bdouble
SELECT id, id, id
FROM aint;
SELECT SUM(LENGTH(value))
FROM bint b
JOIN aint a
ON a.id = b.aid;
SELECT SUM(LENGTH(value))
FROM bdouble b
JOIN adouble a
ON (a.id1, a.id2) = (b.aid1, b.aid2);
t_source is just a dummy table with 1,000,000 rows.
aint and adouble, bint and bdouble contain exactly same data, except that aint has an integer as a PRIMARY KEY, while adouble has a pair of two identical integers.
On my machine, both queries run for 14.5 seconds, +/- 0.1 second
Performance difference, if any, is within the fluctuations range.
Can I have multiple primary keys in a single table?
A Table can have a Composite Primary Key which is a primary key made from two or more columns. For example:
CREATE TABLE userdata (
userid INT,
userdataid INT,
info char(200),
primary key (userid, userdataid)
);
Update: Here is a link with a more detailed description of composite primary keys.
You can only have one primary key, but you can have multiple columns in your primary key.
You can also have Unique Indexes on your table, which will work a bit like a primary key in that they will enforce unique values, and will speed up querying of those values.
A table can have multiple candidate keys. Each candidate key is a column or set of columns that are UNIQUE, taken together, and also NOT NULL. Thus, specifying values for all the columns of any candidate key is enough to determine that there is one row that meets the criteria, or no rows at all.
Candidate keys are a fundamental concept in the relational data model.
It's common practice, if multiple keys are present in one table, to designate one of the candidate keys as the primary key. It's also common practice to cause any foreign keys to the table to reference the primary key, rather than any other candidate key.
I recommend these practices, but there is nothing in the relational model that requires selecting a primary key among the candidate keys.
This is the answer for both the main question and for #Kalmi's question of
What would be the point of having multiple auto-generating columns?
This code below has a composite primary key. One of its columns is auto-incremented. This will work only in MyISAM. InnoDB will generate an error "ERROR 1075 (42000): Incorrect table definition; there can be only one auto column and it must be defined as a key".
DROP TABLE IF EXISTS `test`.`animals`;
CREATE TABLE `test`.`animals` (
`grp` char(30) NOT NULL,
`id` mediumint(9) NOT NULL AUTO_INCREMENT,
`name` char(30) NOT NULL,
PRIMARY KEY (`grp`,`id`)
) ENGINE=MyISAM;
INSERT INTO animals (grp,name) VALUES
('mammal','dog'),('mammal','cat'),
('bird','penguin'),('fish','lax'),('mammal','whale'),
('bird','ostrich');
SELECT * FROM animals ORDER BY grp,id;
Which returns:
+--------+----+---------+
| grp | id | name |
+--------+----+---------+
| fish | 1 | lax |
| mammal | 1 | dog |
| mammal | 2 | cat |
| mammal | 3 | whale |
| bird | 1 | penguin |
| bird | 2 | ostrich |
+--------+----+---------+
(Have been studying these, a lot)
Candidate keys - A minimal column combination required to uniquely identify a table row.
Compound keys - 2 or more columns.
Multiple Candidate keys can exist in a table.
Primary KEY - Only one of the candidate keys that is chosen by us
Alternate keys - All other candidate keys
Both Primary Key & Alternate keys can be Compound keys
Sources:
https://en.wikipedia.org/wiki/Superkey
https://en.wikipedia.org/wiki/Candidate_key
https://en.wikipedia.org/wiki/Primary_key
https://en.wikipedia.org/wiki/Compound_key
As noted by the others it is possible to have multi-column primary keys.
It should be noted however that if you have some functional dependencies that are not introduced by a key, you should consider normalizing your relation.
Example:
Person(id, name, email, street, zip_code, area)
There can be a functional dependency between id -> name,email, street, zip_code and area
But often a zip_code is associated with a area and thus there is an internal functional dependecy between zip_code -> area.
Thus one may consider splitting it into another table:
Person(id, name, email, street, zip_code)
Area(zip_code, name)
So that it is consistent with the third normal form.
Primary Key is very unfortunate notation, because of the connotation of "Primary" and the subconscious association in consequence with the Logical Model. I thus avoid using it. Instead I refer to the Surrogate Key of the Physical Model and the Natural Key(s) of the Logical Model.
It is important that the Logical Model for every Entity have at least one set of "business attributes" which comprise a Key for the entity. Boyce, Codd, Date et al refer to these in the Relational Model as Candidate Keys. When we then build tables for these Entities their Candidate Keys become Natural Keys in those tables. It is only through those Natural Keys that users are able to uniquely identify rows in the tables; as surrogate keys should always be hidden from users. This is because Surrogate Keys have no business meaning.
However the Physical Model for our tables will in many instances be inefficient without a Surrogate Key. Recall that non-covered columns for a non-clustered index can only be found (in general) through a Key Lookup into the clustered index (ignore tables implemented as heaps for a moment). When our available Natural Key(s) are wide this (1) widens the width of our non-clustered leaf nodes, increasing storage requirements and read accesses for seeks and scans of that non-clustered index; and (2) reduces fan-out from our clustered index increasing index height and index size, again increasing reads and storage requirements for our clustered indexes; and (3) increases cache requirements for our clustered indexes. chasing other indexes and data out of cache.
This is where a small Surrogate Key, designated to the RDBMS as "the Primary Key" proves beneficial. When set as the clustering key, so as to be used for key lookups into the clustered index from non-clustered indexes and foreign key lookups from related tables, all these disadvantages disappear. Our clustered index fan-outs increase again to reduce clustered index height and size, reduce cache load for our clustered indexes, decrease reads when accessing data through any mechanism (whether index scan, index seek, non-clustered key lookup or foreign key lookup) and decrease storage requirements for both clustered and nonclustered indexes of our tables.
Note that these benefits only occur when the surrogate key is both small and the clustering key. If a GUID is used as the clustering key the situation will often be worse than if the smallest available Natural Key had been used. If the table is organized as a heap then the 8-byte (heap) RowID will be used for key lookups, which is better than a 16-byte GUID but less performant than a 4-byte integer.
If a GUID must be used due to business constraints than the search for a better clustering key is worthwhile. If for example a small site identifier and 4-byte "site-sequence-number" is feasible then that design might give better performance than a GUID as Surrogate Key.
If the consequences of a heap (hash join perhaps) make that the preferred storage then the costs of a wider clustering key need to be balanced into the trade-off analysis.
Consider this example::
ALTER TABLE Persons
ADD CONSTRAINT pk_PersonID PRIMARY KEY (P_Id,LastName)
where the tuple "(P_Id,LastName)" requires a uniqueness constraint, and may be a lengthy Unicode LastName plus a 4-byte integer, it would be desirable to (1) declaratively enforce this constraint as "ADD CONSTRAINT pk_PersonID UNIQUE NONCLUSTERED (P_Id,LastName)" and (2) separately declare a small Surrogate Key to be the "Primary Key" of a clustered index. It is worth noting that Anita possibly only wishes to add the LastName to this constraint in order to make that a covered field, which is unnecessary in a clustered index because ALL fields are covered by it.
The ability in SQL Server to designate a Primary Key as nonclustered is an unfortunate historical circumstance, due to a conflation of the meaning "preferred natural or candidate key" (from the Logical Model) with the meaning "lookup key in storage" from the Physical Model. My understanding is that originally SYBASE SQL Server always used a 4-byte RowID, whether into a heap or a clustered index, as the "lookup key in storage" from the Physical Model.
A primary key is the key that uniquely identifies a record and is used in all indexes. This is why you can't have more than one. It is also generally the key that is used in joining to child tables but this is not a requirement. The real purpose of a PK is to make sure that something allows you to uniquely identify a record so that data changes affect the correct record and so that indexes can be created.
However, you can put multiple fields in one primary key (a composite PK). This will make your joins slower (espcially if they are larger string type fields) and your indexes larger but it may remove the need to do joins in some of the child tables, so as far as performance and design, take it on a case by case basis. When you do this, each field itself is not unique, but the combination of them is. If one or more of the fields in a composite key should also be unique, then you need a unique index on it. It is likely though that if one field is unique, this is a better candidate for the PK.
Now at times, you have more than one candidate for the PK. In this case you choose one as the PK or use a surrogate key (I personally prefer surrogate keys for this instance). And (this is critical!) you add unique indexes to each of the candidate keys that were not chosen as the PK. If the data needs to be unique, it needs a unique index whether it is the PK or not. This is a data integrity issue. (Note this is also true anytime you use a surrogate key; people get into trouble with surrogate keys because they forget to create unique indexes on the candidate keys.)
There are occasionally times when you want more than one surrogate key (which are usually the PK if you have them). In this case what you want isn't more PK's, it is more fields with autogenerated keys. Most DBs don't allow this, but there are ways of getting around it. First consider if the second field could be calculated based on the first autogenerated key (Field1 * -1 for instance) or perhaps the need for a second autogenerated key really means you should create a related table. Related tables can be in a one-to-one relationship. You would enforce that by adding the PK from the parent table to the child table and then adding the new autogenerated field to the table and then whatever fields are appropriate for this table. Then choose one of the two keys as the PK and put a unique index on the other (the autogenerated field does not have to be a PK). And make sure to add the FK to the field that is in the parent table. In general if you have no additional fields for the child table, you need to examine why you think you need two autogenerated fields.
Some people use the term "primary key" to mean exactly an integer column that gets its values generated by some automatic mechanism. For example AUTO_INCREMENT in MySQL or IDENTITY in Microsoft SQL Server. Are you using primary key in this sense?
If so, the answer depends on the brand of database you're using. In MySQL, you can't do this, you get an error:
mysql> create table foo (
id int primary key auto_increment,
id2 int auto_increment
);
ERROR 1075 (42000): Incorrect table definition;
there can be only one auto column and it must be defined as a key
In some other brands of database, you are able to define more than one auto-generating column in a table.
Having two primary keys at the same time, is not possible. But (assuming that you have not messed the case up with composite key), may be what you might need is to make one attribute unique.
CREATE t1(
c1 int NOT NULL,
c2 int NOT NULL UNIQUE,
...,
PRIMARY KEY (c1)
);
However note that in relational database a 'super key' is a subset of attributes which uniquely identify a tuple or row in a table. A 'key' is a 'super key' that has an additional property that removing any attribute from the key, makes that key no more a 'super key'(or simply a 'key' is a minimal super key). If there are more keys, all of them are candidate keys. We select one of the candidate keys as a primary key. That's why talking about multiple primary keys for a one relation or table is being a conflict.
Good technical answers were given in better way than I can do.
I am only can add to this topic:
If you want something that not allowed/acceptable it is good reason to take step back.
Understand the core of why it's not acceptable.
Dig more in documentation/journal articles/web and etc.
Analyze/review current design and point major flaws.
Consider and test every step during new design.
Always look forward and try to create adaptive solution.
Hope it will helps someone.
Yes, Its possible in SQL,
but we can't set more than one primary keys in MsAccess.
Then, I don't know about the other databases.
CREATE TABLE CHAPTER (
BOOK_ISBN VARCHAR(50) NOT NULL,
IDX INT NOT NULL,
TITLE VARCHAR(100) NOT NULL,
NUM_OF_PAGES INT,
PRIMARY KEY (BOOK_ISBN, IDX)
);