For a large table of transactions (100 million rows, 20 GB) that already has a primary key (a natural composite key of 4 columns), will it help performance to add an identity column and make that the primary key?
The current primary key (the natural composite primary key of 4 columns) does the job, but I have been told that you should always have a surrogate key. So, could improve performance by creating an identity column and making that the primary key?
I'm using SQL Server 2008 R2 database.
EDIT: This transaction table is mainly joined to definition tables and used to populate reports.
EDIT: If I did add a surrogate key, it wouldn't be used in any joins. The existing key fields would be used.
EDIT: There would be no child tables to this table
Just adding an IDENTITY column and adding a new constraint and index for it is very unlikely to improve performance. The table will be larger and therefore scans and seeks could take longer. There will also be more indexes to update. Of course it all depends what you are measuring the performance of... and whether you intend to make other changes to code or database when you add the new column. Adding an IDENTITY column and doing nothing else would probably be unwise.
Only if:
you have child tables that are larger
you have nonclustered indexes
In each of these cases, the PK (assumed clustered) of your table will be in each child entry/NC entry. So making the clustered key narrower will benefit.
If you have just non NC indexes (maybe one) and no child tables all you'll achieve is
a wider row (more data pages used)
a slightly smaller B-tree (which is a fraction of total space)
...but you'll still need an index/constraint on the current 4 columns anyway = an increase in space.
If your 4 way key capture parent table keys too (sounds likely) then you'd lose the advantage of overlap. This would be covered by the new index/constraint though.
So no, you probably don't want to do it.
We threw away a surrogate key (bigint) on a billion+ row table and moved to the actual 11-way key and reduced space on disk by 65%+ because of a simpler structure (one less index, slighty more rows per page etc)
Given your edits, and all the conversation the question has sparked notwithstanding, I would suggest that adding an IDENTITY column to this table will create a lot more harm than benefit.
One place where performance is hurt is on the change of the data in the natural key. The change woudl then have to promulgate to all the child records. For instance, suppose one of those fields was company name and the company changed their name, then all the related records, and there could be millions of them, would have to change but if you used a surrogate key, only one record would have to change. Integer joins tend to be faster (generally much faster than 4 column joins) and wrting the code to join is generally faster as well. However, on the other hand, having the vital four fields may mean the join isn't needed as often. Insert performanc ewilltake a slight hit as well as the surrogate key has to be generated and indexed. Usually this is so small a hit as to be unnoticalbe but the possibility is there.
A four column natural key is often not a unique as you think it will be because that number of columns the data tends to change over time. While it is unique now, will it be unique over time? If you have used a surrogate key and a unique index onteh natural key and it turns out later not to unique, then all you have to do is drop the unique index. If it is the PK and there are child tables, you have to totally redesign your database.
Only you can decide which if any of these considerations affects your specific data needs, surrogate keys are better for some applications and worse for others.
---EDIT:
Based on the edits to the question, adding an identity/surrogate key might not be the solution to this problem.
--Original Answer.
One case of performance improvement would be when you use joins and when you have child tables.
In the absence of surrogate keys, you would have to replicate all th4 4 keys to the child table and join on the 4 columns.
t_parent
-------------
col1,
col2,
col3,
col4,
col5,
constraint pk_t_parent primary key (col1,col2,col3,col4)
t_child
----------
col1,
col2,
col3,
col4,
col7,
col8,
constraint pk_t_child primary key (col1,col2,col3,col4, col5),
constraint fk_parent_child foreign key (col1, col2, col3, col4) references
t_parent ((col1, col2, col3, col4))
The joins will include all the 4 columns..
select t2.*
from t_parent t1, t_child t2
where (t1.col1 = t2.col1 and
t1.col2 = t2.col2 and
t1.col3 = t2.col3 and
t1.col4 = t2.col4
)
If you use a surrogate key and create a unique constraint on the 4 columns (which are now part of the primary key), it will be both efficient and the data would still be validated as before.
Related
I'm normally using:
UPDATE table1 SET field1='test' WHERE ID=10
But will it be more efficiently to use the following statement:
UPDATE TOP (1) table1 SET field1='test' WHERE ID=10
if I have a lots of records?
The ID Column is a primary key and autoincremented too.
If the ID column is a Primary Key, then there will be at most a single record affected by your UPDATE query.
If your Primary Key is by default a Clustered Index, then the performance should be similar in both cases.
Even if when creating your PK, you specify it as non-clustered, then you still get a performance boost when searching / selecting / identifying / filtering records (because you're using WHERE). This might not be as fast as the clustered index PK, but the performance difference should be negligible.
When creating a PK, you're forced to pick one of the two indexing types for your key, as mentioned and explained here in more detail.
Hence, both versions of the UPDATE query should have similar performance (possibly small differences when running on different occasions due to other ancillary operations).
In conclusion:
If you have a Primary Key on your ID column, and you're using it in the FILTERING part of the query (WHERE), then you should be fine when you're querying thousands, millions and possibly even up to billions of records.
Disclaimer:
The performance / speed of the UPDATE query also depends on what other indexes need to be updated, due to the changing values (indexes which contain the field1 as their key), triggers on your table, cascading rules for foreign keys etc.
In my system I have temporary entities that are created based on rules stored in my database, and the entities are not persisted.
Now, I need is to store information about these entities, and because they are created based on rules and are not stored, they have no ID.
I came up with a formula to generate an ID for these temp entities based on the rule that was used to generate them: id = rule id + "-" + entity index in the rule. This formula generates unique strings of the form 164-3, 123-0, 432-2, etc...
My question is how should I build my table (regarding primary key and clustered index) when my keys have no relation or order? Keep in mind that I will only (99.9% of the time) query the table using the id mentioned above.
Options I thought about after much reading, but don't have the knowledge to determine which is better:
1) primary key on a varchar column with clustered index. -According to various sources, this would be bad because of fragmentation and the wideness of the key. Also their format is pretty weird for sorting.
2) primary key on varchar column without clustered index (heap table). -Also a bad idea according to various sources due to indexing and fragmentation issues.
3) identity int column with clustered index, and a varchar column as primary key with unique index. -Can't really see the benefit of the surogate key here since it would mainly help with range queries and ordering and I would never query the table based on this key because it would be unknown at all times.
4) 2 columns composite key: rule id + rule index columns.
-Now I don't have strings but I have two columns that will be copied to FKs and non clustered indexes. Also I'm not sure what indexes I would use in this case.
Can anybody shine a light here? Any help is appreciated.
--Edit
I will perform more selects than inserts;
I will perform more inserts than updates;
All selects will include at least rule id;
If I use a surogate primary key, and a unique index on (rule id, index), then I can use the surogate for subsequent operations after retrieving data by rule id, which would be faster. Also, inserts would be faster.
However, because the data will be stored according to the surogate key, I might have records that have the same rule id, but different index, stored quite far from each other on disk, which means even with an index on rule id, retrieving the data could be kinda slow.
If I use (rule id, index) as clustered primary key, rows with same rule id would be stored close to each other, and selecting data by rule id would be efficient enough. However, I suspect inserts would be slow.
Is the rationale above correct?
Using a heap is generally a bad idea unless proven otherwise. Even so, you will need a very solid reason for not having a clustered index (any one will make things better, even on identity column).
Storing this key in a single column is okay; if you want natural sorting, you can pad your numbers with zeroes, for example. However, this will widen the key.
Having a composite primary key (and, subsequently, foreign keys) is completely acceptable, especially when dealing with natural keys, like the one you have. This will give you the narrowest possible key - int + int or some such - while eliminating the sorting issue at the same time. I would recommend to make this PK clustered to reduce additional key lookups.
Fragmentation here will not be a big issue; at least, no bigger than with any other indexing decision. Any index built on such a key will be prone to fragmentation, clustered or no. In any case, your DBA should know how to keep an index such as this in top form.
Regarding the order of columns in the index, the following rules usually apply:
If partial key match will take place (filtering by one part of the key but not by the other) the one which is used most often should go first;
If No.1 isn't applicable and all parts of the key used in all queries, the column with the highest cardinality should go first.
The order of remaining columns (if there are more than 1) isn't of much importance because SQL Server only creates distribution statistics for the first column in a composite index. However, it is a good idea to list them in order of decreasing cardinality.
EDIT: Seeing your update with additional details, here are the most suitable options. Suppose your table looks like this:
-- Sample table
create table dbo.TempEntities (
RuleId int not null,
IndexId int not null,
-- Remaining columns listed here
EntityData xml not null
);
go
From here, the most straightforward way is to use the natural key as a clustered index:
-- Option 1 - natural clustered index
alter table dbo.TempEntities
add constraint PK_TempEntities primary key clustered (RuleId, IndexId);
go
However, if you have any child tables that would reference this one, it might not be the most convenient solution, because natural keys are prone to updates, which creates a mess where you could avoid it. Instead, a surrogate key can be introduced, like this:
-- Option 2 - surrogate clustered, natural nonclustered
alter table dbo.TempEntities add Id bigint identity(1,1) not null;
alter table dbo.TempEntities
add constraint PK_TempEntities primary key clustered (Id);
alter table dbo.TempEntities
add constraint UQ_TempEntities_RuleIdIndexId unique (RuleId, IndexId);
go
It makes sense to have the surrogate PK clustered, because it will result in much less page splits, making inserts faster (despite having one index more compared to Option 1). Without any intimate knowledge of your queries, this is probably the most balanced solution.
Shuffling the clustered attribute between surrogate and natural keys has mostly academic value and can only make difference on a high-load system with hundreds of inserts happening every second on 24*7 schedule. If your system is indeed as such, please seek a professional consultant who will analyse your queries and provide the solution tailored to your situation.
For example, we have table A, and table B which have a many-to-many relationship. An intersection table, Table C stores A.id and B.id along with a value that represents a relationship between the two. Or as a concrete example, imagine stackexchange which has a user account, a forum, and a karma score. Or, a student, a course, and a grade. If table A and B are very large, table C can and probably will grow monstrously large very quickly(in fact lets just assume it does). How do we go about dealing with such an issue? Is there a better way to design the tables to avoid this?
There is no magic. If some rows are connected and some aren't, this information has to be represented somehow, and the "relational" way of doing it is a "junction" (aka "link") table. Yes, a junction table can grow large, but fortunately databases are very capable of handling huge amounts of data.
There are good reasons for using junction table versus comma-separated list (or similar), including:
Efficient querying (through indexing and clustering).
Enforcement of referential integrity.
When designing a junction table, ask the following questions:
Do I need to query in only one direction or both?1
If one direction, just create a composite PRIMARY KEY on both foreign keys (let's call them PARENT_ID and CHILD_ID). Order matters: if you query from parent to children, PK should be: {PARENT_ID, CHILD_ID}.
If both directions, also create a composite index in the opposite order, which is {CHILD_ID, PARENT_ID} in this case.
Is the "extra" data small?
If yes, cluster the table and cover the extra data in the secondary index as necessary.2
I no, don't cluster the table and don't cover the extra data in the secondary index.3
Are there any additional tables for which the junction table acts as a parent?
If yes, consider whether adding a surrogate key might be worthwhile to keep child FKs slim. But beware that if you add a surrogate key, this will probably eliminate the opportunity for clustering.
In many cases, answers to these questions will be: both, yes and no, in which case your table will look similar to this (Oracle syntax below):
CREATE TABLE JUNCTION_TABLE (
PARENT_ID INT,
CHILD_ID INT,
EXTRA_DATA VARCHAR2(50),
PRIMARY KEY (PARENT_ID, CHILD_ID),
FOREIGN KEY (PARENT_ID) REFERENCES PARENT_TABLE (PARENT_ID),
FOREIGN KEY (CHILD_ID) REFERENCES CHILD_TABLE (CHILD_ID)
) ORGANIZATION INDEX COMPRESS;
CREATE UNIQUE INDEX JUNCTION_TABLE_IE1 ON
JUNCTION_TABLE (CHILD_ID, PARENT_ID, EXTRA_DATA) COMPRESS;
Considerations:
ORGANIZATION INDEX: Oracle-specific syntax for what most DBMSes call clustering. Other DBMSes have their own syntax and some (MySQL/InnoDB) imply clustering and user cannot turn it off.
COMPRESS: Some DBMSes support leading-edge index compression. Since clustered table is essentially an index, compression can be applied to it as well.
JUNCTION_TABLE_IE1, EXTRA_DATA: Since extra data is covered by the secondary index, DBMS can get it without touching the table when querying in the direction from child to parents. Primary key acts as a clustering key so the extra data is naturally covered when querying from a parent to the children.
Physically, you have just two B-Trees (one is the clustered table and the other is the secondary index) and no table heap at all. This translates to good querying performance (both parent-to-child and child-to-parent directions can be satisfied by a simple index range scan) and fairly small overhead when inserting/deleting rows.
Here is the equivalent MS SQL Server syntax (sans index compression):
CREATE TABLE JUNCTION_TABLE (
PARENT_ID INT,
CHILD_ID INT,
EXTRA_DATA VARCHAR(50),
PRIMARY KEY (PARENT_ID, CHILD_ID),
FOREIGN KEY (PARENT_ID) REFERENCES PARENT_TABLE (PARENT_ID),
FOREIGN KEY (CHILD_ID) REFERENCES CHILD_TABLE (CHILD_ID)
);
CREATE UNIQUE INDEX JUNCTION_TABLE_IE1 ON
JUNCTION_TABLE (CHILD_ID, PARENT_ID) INCLUDE (EXTRA_DATA);
Note that MS SQL Server automatically clusters tables, unless PRIMARY KEY NONCLUSTERED is specified.
1 In other words, do you only need to get "children" of given "parent", or you might also need to get parents of given child.
2 Covering allows the query to be satisfied from the index alone, and avoids expensive double-lookup that would otherwise be necessary when accessing data through a secondary index in the clustered table.
3 This way, the extra data is not repeated (which would be expensive, since it's big), yet you avoid the double-lookup and replace it with (cheaper) table heap access. But, beware of clustering factor that can destroy the performance of range scans in heap-based tables!
I suppose everyone runs into this problem once in a while: you have two tables that have autonumber primary keys that need to be merged. There are many good reasons why autonumber primary keys are used in favour of say application-generated keys, but merging with other tables must be one of the biggest drawbacks.
Some problems that arise are overlapping ids and out of sync foreign keys. I would like to hear your approach for tackling this. I always run into problems, so I'm very curious if anybody has some sort of a general solution.
-- EDIT --
In response to the answers suggesting to use guids or other non-numeric keys, there are situations where in advance it just seems a better idea to use autonumber keys (and you regret this later), or you're taking over someone else's project, or you get some legacy database that you have to work with. So I'm really looking for a solution where you have no control over the database design anymore.
Solutions include:
Use GUIDs as primary keys instead of a simpler identity field. Very likely to avoid overlaps, but GUIDs are harder to use and don't play nicely with clustered indexes.
Make the primary key into a multi-column key, the second column resolving overlapping values by identifying the source of the merged data. Portable, works better with clustered indexes, but developers hate multi-column keys.
Use natural keys instead of pseudokeys.
Allocate new primary key values for one of the merged tables, and cascade these changes to any dependent rows. This changes a merge operation into an ETL operation. This is the only solution you can use for legacy data, if you can't change the database design.
I'm not sure there's a one-size-fits-all solution. Choose one of these based on the situation.
Hm, I'm kind of enthousiastic about the idea that I just put in a comment at AlexKuznetsov's answer, so I'll make a whole answer about it.
Consider the tables to be named table1 and table2, with id1 and id2 as autonumber primary keys. They will be merged to table3 with id3 (a non-autonumber primary key).
Why not:
Remove all foreign key constraints to table1 and table2
For all foreign key fields referring to table1, execute an UPDATE table SET id1 = id1 * 2, and for FK fields referring to table2, execute an UPDATE table SET id2 = (id2) * 2 + 1
Fill table3 by executing an INSERT INTO table3 SELECT id1 * 2 AS id3, ... FROM table1 UNION ALL SELECT id2 * 2 + 1 AS id3 FROM table2
Create new foreign key constraints to table3
It can even work with 3 or more tables, just by using a higher multiplier.
One of the standard approaches (if not the standard approach), where you're designing for such an eventuality, is to use GUIDs for primary keys rather than integers - merging is then relatively painless as you are guaranteed not to encounter an overlap.
Barring a redesign, tho', I think you're stuck with having to insert into the table, accept that you'll get new primary keys, and ensure that you maintain the mapping from old-to-new ID - then insert referencing data with FK remapped etc. etc. If you data has a "business key" that will remain unique after the insert, this would save on having to keep track of the mapping.
I fyou are sure you have only two such tables, you can just have even IDs in one table (0,2,4,6,...) and odd IDs in another (1,3,5,7,...)
Assuming you also have a natural key in the tables to be merged then the process isn't difficult. The natural key is used to deduplicate and to correctly reassign any references. You can renumber the surrogate key values at any time - that being one of the principal advantages of using a surrogate in the first place.
So I don't see this as a problem with surrogate keys - provided you always enforce the natural key (actually I much prefer the term "business key"). If you haven't got business keys for these tables, well maybe now would be a good time to redesign so that ALL the necessary keys are properly implemented.
There's a healthy debate out there between surrogate and natural keys:
SO Post 1
SO Post 2
My opinion, which seems to be in line with the majority (it's a slim majority), is that you should use surrogate keys unless a natural key is completely obvious and guaranteed not to change. Then you should enforce uniqueness on the natural key. Which means surrogate keys almost all of the time.
Example of the two approaches, starting with a Company table:
1: Surrogate key: Table has an ID field which is the PK (and an identity). Company names are required to be unique by state, so there's a unique constraint there.
2: Natural key: Table uses CompanyName and State as the PK -- satisfies both the PK and uniqueness.
Let's say that the Company PK is used in 10 other tables. My hypothesis, with no numbers to back it up, is that the surrogate key approach would be much faster here.
The only convincing argument I've seen for natural key is for a many to many table that uses the two foreign keys as a natural key. I think in that case it makes sense. But you can get into trouble if you need to refactor; that's out of scope of this post I think.
Has anyone seen an article that compares performance differences on a set of tables that use surrogate keys vs. the same set of tables using natural keys? Looking around on SO and Google hasn't yielded anything worthwhile, just a lot of theorycrafting.
Important Update: I've started building a set of test tables that answer this question. It looks like this:
PartNatural - parts table that uses
the unique PartNumber as a PK
PartSurrogate - parts table that
uses an ID (int, identity) as PK and
has a unique index on the PartNumber
Plant - ID (int, identity) as PK
Engineer - ID (int, identity) as PK
Every part is joined to a plant and every instance of a part at a plant is joined to an engineer. If anyone has an issue with this testbed, now's the time.
Use both! Natural Keys prevent database corruption (inconsistency might be a better word). When the "right" natural key, (to eliminate duplicate rows) would perform badly because of length, or number of columns involved, for performance purposes, a surrogate key can be added as well to be used as foreign keys in other tables instead of the natural key... But the natural key should remain as an alternate key or unique index to prevent data corruption and enforce database consistency...
Much of the hoohah (in the "debate" on this issue), may be due to what is a false assumption - that you have to use the Primary Key for joins and Foreign Keys in other tables. THIS IS FALSE. You can use ANY key as the target for foreign keys in other tables. It can be the Primary Key, an alternate Key, or any unique index or unique constraint., as long as it is unique in the target relation (table). And as for joins, you can use anything at all for a join condition, it doesn't even have to be a key, or an index, or even unique !! (although if it is not unique you will get multiple rows in the Cartesian product it creates). You can even create a join using non-specific criterion (like >, <, or "like" as the join condition.
Indeed, you can create a join using any valid SQL expression that evaluate to a boolean.
Natural keys differ from surrogate keys in value, not type.
Any type can be used for a surrogate key, like a VARCHAR for the system-generated slug or something else.
However, most used types for surrogate keys are INTEGER and RAW(16) (or whatever type your RDBMS does use for GUID's),
Comparing surrogate integers and natural integers (like SSN) takes exactly same time.
Comparing VARCHARs make take collation into account and they are generally longer than integers, that making them less efficient.
Comparing a set of two INTEGER is probably also less efficient than comparing a single INTEGER.
On datatypes small in size this difference is probably percents of percents of the time required to fetch pages, traverse indexes, acquite database latches etc.
And here are the numbers (in MySQL):
CREATE TABLE aint (id INT NOT NULL PRIMARY KEY, value VARCHAR(100));
CREATE TABLE adouble (id1 INT NOT NULL, id2 INT NOT NULL, value VARCHAR(100), PRIMARY KEY (id1, id2));
CREATE TABLE bint (id INT NOT NULL PRIMARY KEY, aid INT NOT NULL);
CREATE TABLE bdouble (id INT NOT NULL PRIMARY KEY, aid1 INT NOT NULL, aid2 INT NOT NULL);
INSERT
INTO aint
SELECT id, RPAD('', FLOOR(RAND(20090804) * 100), '*')
FROM t_source;
INSERT
INTO bint
SELECT id, id
FROM aint;
INSERT
INTO adouble
SELECT id, id, value
FROM aint;
INSERT
INTO bdouble
SELECT id, id, id
FROM aint;
SELECT SUM(LENGTH(value))
FROM bint b
JOIN aint a
ON a.id = b.aid;
SELECT SUM(LENGTH(value))
FROM bdouble b
JOIN adouble a
ON (a.id1, a.id2) = (b.aid1, b.aid2);
t_source is just a dummy table with 1,000,000 rows.
aint and adouble, bint and bdouble contain exactly same data, except that aint has an integer as a PRIMARY KEY, while adouble has a pair of two identical integers.
On my machine, both queries run for 14.5 seconds, +/- 0.1 second
Performance difference, if any, is within the fluctuations range.