SQL Server Performance Suggestion - sql-server

I have been creating database tables using only a primary key of the datatype int and I have always had great performance but need to setup merge replication with updatable subscribers.
The tables use a typical primary key, data type int, and identity increment. Setting up merge replication, I have to add the rowguid to all tables with a newsequentialid() function for the default value. I noticed that the rowguid has indexable on and was wondering if I needed the primary key anymore?
Is it okay to have 2 indexes, the primary key int and the rowguid? What is the best layout for a merge replication table? Do I keep the int id for easy row referencing and just remove the index but keep the primary key? Not sure what route to take, Thanks.

Remember that if you remove the int id column and replace it with a GUID, you may need to rework a good deal of your data and your queries. And do you really want to do queries like:
select * from orders where customer_id = '2053995D-4EFE-41C0-8A04-00009890024A'
Remember if your ids are exposed to any users (often in the case of a customer because the customer table often has no natural key since names are not unique), they will find the guid daunting for doing research.
There is nothing wrong in an existing system with having both. In a new system, you could plan to not use the ints, but there is a great risk of introducing bugs if you try to remove them in a system already using them.

The only downside of replacing the integer primary key with the guid (that I know of) is that GUIDs are larger, so the btree (index space used) will be larger and if you have foreign keys to this table (which you'd also need to change) a lot more space may end up being used across (potentially) many tables.

Related

How to store a "primary" record

Suppose I have the following tables
Companies
--CompanyID
--CompanyName
and
Locations
--LocationID
--CompanyID
--LocationName
Every company has at least one location. I want to track a primary location for each company (and yes, every company will have exactly one primary location). What's the best way to set this up? Add a primaryLocationID in the Companies table?
Add a primaryLocationID in the Companies table?
Yes, however that creates a circular reference which could prevent you from inserting new data:
One way to resolve this chicken-and-egg problem is to simply leave Company.PrimaryLocationID NULL-able, so you can temporarily disable one of the circular FKs. This unfortunately means the database will enforce only "1:0..1", but not the strict "1:1" relationship (so you'll have to enforce it in the application code).
However, if your DBMS supports deferred constraints (such as Oracle or PostgreSQL), you can simply defer one of the FKs to break the cycle while the transaction is still in progress. By the end of the transaction both FKs have to be in place, resulting in a real "1:1" relationship.
The alternative solution is to have a flag in the Locations table that is set for a primary location, and NULL non-primary locations (note the U1, denoting a UNIQUE constraint, ensuring a company cannot have multiple primary locations):
CREATE TABLE Location (
LocationID INT PRIMARY KEY,
CompanyID INT NOT NULL, -- References Company table, not shown here.
LocationName VARCHAR(50) NOT NULL, -- Possibly UNIQUE?
IsPrimary INT CHECK (IsPrimary IS NULL OR IsPrimary = 1), -- Use a BIT or BOOLEAN if supported by your DBMS.
CONSTRAINT Locations_U1 UNIQUE (CompanyID, IsPrimary)
);
Unfortunately, this has some problems:
It can only guarantee up to "1:0..1" (but not the real "1:1") even on a DBMS that supports deferred constraints.
It requires an additional index (in support to the UNIQUE constraint). Every index brings certain overhead, mostly for INSERT/UPDATE/DELETE performance. Furthermore, secondary indexes in clustered tables contain copy of PK, which may make them "fatter" than expected.
It depends on ANSI-compliant composite UNIQUE constraints, that allow duplicated rows if any (but not necessarily all) of the fields are NULL. Unfortunately not all DBMSes follow the standard, so the above would not work out-of-box under Oracle or MS SQL Server (but would work under PostgreSQL and MySQL). You could use a filtered unique index instead of the UNIQUE constraint to work-around that, but not all DBMSes support that either.
The BaBL86's solution models M:N, while your requirement seems to be 1:N. Nonetheless, that model could be "coerced" into 1:N by placing a key on {LocationID} (and on {CompanyID, TypeOfLocation} to ensure there cannot be multiple locations of the same type for the same company), but is probably over-engineered for a simple "is primary" requirement.
I think your own solution is the best one - this ensures that every company can only have one primary location. By making it a NOT NULL column, you can even enforce that every company should have a primary location.
Using BaBL86's solution, you don't have those constraints: a company can have 0 - unlimited 'primary locations', which obviously shouldn't be possible.
Do note that, if you use foreign key constraints AND define primaryLocationID as a NOT NULL column, you'll run into problems, because you basically have a loop (Location points to Company, Company points to location). You cannot create a new Company (because it needs a primary location), nor can you create a new Location (because it needs a company).
I do it with pivot table:
CompanyLocations
--CompanyID
--LocationID
--TypeOfLocation (primary, office, warehouse etc.)
In this case you can select all locations and than use type as you like. If you create PrimaryLocationID - you're need two joins of one table and more complex logic. It's worst than this.

Uniqueidentifier vs. IDENTITY vs. Material Code --which is the best choice for primary key?

Which one is the best choice for primary key in SQL Server?
There are some example code:
Uniqueidentifiers
e.g.
CREATE TABLE new_employees
(employeeId UNIQUEIDENTIFIER DEFAULT NEWID(),
fname VARCHAR(20) )
GO
INSERT INTO new_employees(fname) VALUES ('Karin')
GO
Identity columns
e.g.
CREATE TABLE new_employees
(
employeeId int IDENTITY(1,1),
fname varchar (20)
);
INSERT new_employees
(fname)
VALUES
('Karin');
[Material Code](or Business Code,which identity of a material. e.g. customer identifier)
e.g.
CREATE TABLE new_employees(
[ClientId] [varchar](20) NOT NULL,
[fName] [varchar](20) NULL
)
INSERT new_employees
(ClientID, fname)
VALUES
('C0101000001',--customer identifier,e.g.'C0101000001' a user-defined code.
'Karin');
Please give me some advices for choosing the primary key from the three type identity columns,or other choices.
Thanks!
GUID may seem to be a natural choice for your primary key - and if you really must, you could probably argue to use it for the PRIMARY KEY of the table. What I'd strongly recommend not to do is use the GUID column as the clustering key, which SQL Server does by default, unless you specifically tell it not to.
You really need to keep two issues apart:
the primary key is a logical construct - one of the candidate keys that uniquely and reliably identifies every row in your table. This can be anything, really - an INT, a GUID, a string - pick what makes most sense for your scenario.
the clustering key (the column or columns that define the "clustered index" on the table) - this is a physical storage-related thing, and here, a small, stable, ever-increasing data type is your best pick - INT or BIGINT as your default option.
By default, the primary key on a SQL Server table is also used as the clustering key - but that doesn't need to be that way! I've personally seen massive performance gains when breaking up the previous GUID-based primary / clustered key into two separate keys - the primary (logical) key on the GUID, and the clustering (ordering) key on a separate INT IDENTITY(1,1) column.
As Kimberly Tripp - the Queen of Indexing - and others have stated a great many times - a GUID as the clustering key isn't optimal, since due to its randomness, it will lead to massive page and index fragmentation and to generally bad performance.
Yes, I know - there's newsequentialid() in SQL Server 2005 and up - but even that is not truly and fully sequential and thus also suffers from the same problems as the GUID - just a bit less prominently so.
Then there's another issue to consider: the clustering key on a table will be added to each and every entry on each and every non-clustered index on your table as well - thus you really want to make sure it's as small as possible. Typically, an INT with 2+ billion rows should be sufficient for the vast majority of tables - and compared to a GUID as the clustering key, you can save yourself hundreds of megabytes of storage on disk and in server memory.
Quick calculation - using INT vs. GUID as primary and clustering key:
Base Table with 1'000'000 rows (3.8 MB vs. 15.26 MB)
6 nonclustered indexes (22.89 MB vs. 91.55 MB)
TOTAL: 25 MB vs. 106 MB - and that's just on a single table!
Some more food for thought - excellent stuff by Kimberly Tripp - read it, read it again, digest it! It's the SQL Server indexing gospel, really.
GUIDs as PRIMARY KEY and/or clustered key
The clustered index debate continues
Ever-increasing clustering key - the Clustered Index Debate..........again!
Disk space is cheap - that's not the point!
Unless you have a very good reason, I would argue to use a INT IDENTITY for almost every "real" data table as the default for their primary key - it's unique, it's stable (never changes), it's narrow, it's ever increasing - all the good properties that you want to have in a clustering key for fast and reliable performance of your SQL Server tables!
If you have some "natural" key value that also has all those properties, then you might also use that instead of a surrogate key. But two variable-length strings of max. 20 chars each do not meet those requirements in my opinion.
IDENTITY
PROS
small storage footprint;
optimal join / index performance (e.g. for time range queries, most rows recently inserted will be on a limited number of pages);
highly useful for data warehousing;
native data type of the OS, and easy to work with in all languages;
easy to debug;
generated automatically (retrieved through SCOPE_IDENTITY() rather than assigned);
not updateable (though some consider this a disadvantage, strangely enough).
CONS
cannot be reliably "predicted" by applications — can only be retrieved after the INSERT;
need a complex scheme in multi-server environments, since IDENTITY is not allowed in some forms of replication;
can be duplicated, if not explicitly set to PRIMARY KEY.
if part of the clustered index on the table, this can create an insert hot-spot;
proprietary and not directly portable;
only unique within a single table;
gaps can occur (e.g. with a rolled back transaction), and this can cause chicken little-style alarms.
GUID
PROS
since they are {more or less} guaranteed to be unique, multiple tables/databases/instances/servers/networks/data centers can generate them independently, then merged without clashes;
required for some forms of replication;
can be generated outside the database (e.g. by an application);
distributed values prevent hot-spot (as long as you don't cluster this column, which can lead to abnormally high fragmentation).
CONS
the wider datatype leads to a drop in index performance (if clustered, each insert almost guaranteed to 'dirty' a different page), and an increase in storage requirements;
cumbersome to debug (where userid = {BAE7DF4-DDF-3RG-5TY3E3RF456AS10});
updateable (need to propogate changes, or prevent the activity altogether);
sensitive to time rollbacks in certain environments (e.g. daylight savings time rollbacks);
GROUP BY and other set operations often require CAST/CONVERT;
not all languages and environments directly support GUIDs;
there is no statement like SCOPE_GUID() to determine the value that was generated, e.g. by NEWID();
One thing you'll need to consider in designing your tables is if you'll need to replicate, shard, or otherwise move your data from one place to another. Maybe the data is being generated by other applications and which will need to be kept in sync with yours. An example of that would be a mobile app that creates data and then syncs it with a server. If anything like that is or might be true then UNIQUEIDENTIFIER would the good choice use to use for your primary key.
The UNIQUEIDENTIFIER data type is terrible for performance when used as a clustered index. Yes, you could use newsequentialid(), but that doesn't help you if the values are generated on other devices. The consensus seems to be that clustered indexes are best used with a sequential and narrow data type like an INT or BIGINT.
If you're not concerned with storage space issues then you might try using a combination of both an IDENTITY cluster key and UNIQUEIDENTIFIER primary key. Create a cluster key IDENTITY column and use it for your clustered index (but not as a primary key). Inserts will still be made sequentially and it satisfies the desire for it to be a narrow data type. Now you can use a UNIQUEIDENTIFIER as your primary key. This will allow you to move, replicate, and/or shard your data when you need to.
The cluster key has no other purpose other than to keep your inserts sequential and to be what all the other non-clustered indexes point to when looking up data for a given query. The cluster key is completely throw away and can be regenerated when data is moved, replicated, and/or sharded since uniqueness is handled by the UNIQUEIDENTIFIER primary key.
Here is a great article that demonstrates what happens internally when using an IDENTITY vs a UNIQUEIDENTIFIER for your clustered index.
Effective Clustered Indexes
GUIDs are large but have the advantage of being unique everywhere: this table or that, this server or that, if you have the GUID then everything else is knowable. If that is useful to you, then great, but you will pay for it in overhead, and continue to pay, and pay, and pay....
Material codes only really work for smaller immutable keys, like colors or classification codes and the like. R will always be red, G will be green, it is one byte, etc.
Identity columns come into their own when there may not be a material code, or the natural key is composed of several material codes together, or the natural key is already composed of other identity columns and/or GUIDs, or the natural key is mutable. Yes you could use a GUID but an integer column is much more efficient in all regards.
Another option available in SQL 2012 are sequences, kind of like a database-level identity column. This is a nice halfway house between GUIDs and identity columns, in the sense that a sequence can be used across many tables, so that from a given value, not just the row is knowable, but the table too--but you can still use an INT or BIGINT (or SMALLINT!) if you think that will be enough for your data. That's kind of nifty for certain purposes, kind of like an object id in the OO world.
Be aware that many or the light-weight ORMs expect tables to have a single column primary key, preferably an integer column, and may not play well with anything but an INT IDENTITY PK.

What are the best practices for using a GUID as a primary key, specifically regarding performance? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 1 year ago.
The community reviewed whether to reopen this question 2 months ago and left it closed:
Original close reason(s) were not resolved
Improve this question
I have an application that uses GUID as the Primary Key in almost all tables and I have read that there are issues about performance when using GUID as Primary Key. Honestly, I haven't seen any problem, but I'm about to start a new application and I still want to use the GUIDs as the Primary Keys, but I was thinking of using a Composite Primary Key (The GUID and maybe another field.)
I'm using a GUID because they are nice and easy to manage when you have different environments such as "production", "test" and "dev" databases, and also for migration data between databases.
I will use Entity Framework 4.3 and I want to assign the Guid in the application code, before inserting it in the database. (i.e. I don't want to let SQL generate the Guid).
What is the best practice for creating GUID-based Primary Keys, in order to avoid the supposed performance hits associated with this approach?
GUIDs may seem to be a natural choice for your primary key - and if you really must, you could probably argue to use it for the PRIMARY KEY of the table. What I'd strongly recommend not to do is use the GUID column as the clustering key, which SQL Server does by default, unless you specifically tell it not to.
You really need to keep two issues apart:
the primary key is a logical construct - one of the candidate keys that uniquely and reliably identifies every row in your table. This can be anything, really - an INT, a GUID, a string - pick what makes most sense for your scenario.
the clustering key (the column or columns that define the "clustered index" on the table) - this is a physical storage-related thing, and here, a small, stable, ever-increasing data type is your best pick - INT or BIGINT as your default option.
By default, the primary key on a SQL Server table is also used as the clustering key - but that doesn't need to be that way! I've personally seen massive performance gains when breaking up the previous GUID-based Primary / Clustered Key into two separate key - the primary (logical) key on the GUID, and the clustering (ordering) key on a separate INT IDENTITY(1,1) column.
As Kimberly Tripp - the Queen of Indexing - and others have stated a great many times - a GUID as the clustering key isn't optimal, since due to its randomness, it will lead to massive page and index fragmentation and to generally bad performance.
Yes, I know - there's newsequentialid() in SQL Server 2005 and up - but even that is not truly and fully sequential and thus also suffers from the same problems as the GUID - just a bit less prominently so.
Then there's another issue to consider: the clustering key on a table will be added to each and every entry on each and every non-clustered index on your table as well - thus you really want to make sure it's as small as possible. Typically, an INT with 2+ billion rows should be sufficient for the vast majority of tables - and compared to a GUID as the clustering key, you can save yourself hundreds of megabytes of storage on disk and in server memory.
Quick calculation - using INT vs. GUID as Primary and Clustering Key:
Base Table with 1'000'000 rows (3.8 MB vs. 15.26 MB)
6 nonclustered indexes (22.89 MB vs. 91.55 MB)
TOTAL: 25 MB vs. 106 MB - and that's just on a single table!
Some more food for thought - excellent stuff by Kimberly Tripp - read it, read it again, digest it! It's the SQL Server indexing gospel, really.
GUIDs as PRIMARY KEY and/or clustered key
The clustered index debate continues
Ever-increasing clustering key - the Clustered Index Debate..........again!
Disk space is cheap - that's not the point!
PS: of course, if you're dealing with just a few hundred or a few thousand rows - most of these arguments won't really have much of an impact on you. However: if you get into the tens or hundreds of thousands of rows, or you start counting in millions - then those points become very crucial and very important to understand.
Update: if you want to have your PKGUID column as your primary key (but not your clustering key), and another column MYINT (INT IDENTITY) as your clustering key - use this:
CREATE TABLE dbo.MyTable
(PKGUID UNIQUEIDENTIFIER NOT NULL,
MyINT INT IDENTITY(1,1) NOT NULL,
.... add more columns as needed ...... )
ALTER TABLE dbo.MyTable
ADD CONSTRAINT PK_MyTable
PRIMARY KEY NONCLUSTERED (PKGUID)
CREATE UNIQUE CLUSTERED INDEX CIX_MyTable ON dbo.MyTable(MyINT)
Basically: you just have to explicitly tell the PRIMARY KEY constraint that it's NONCLUSTERED (otherwise it's created as your clustered index, by default) - and then you create a second index that's defined as CLUSTERED
This will work - and it's a valid option if you have an existing system that needs to be "re-engineered" for performance. For a new system, if you start from scratch, and you're not in a replication scenario, then I'd always pick ID INT IDENTITY(1,1) as my clustered primary key - much more efficient than anything else!
I've been using GUIDs as PKs since 2005. In this distributed database world, it is absolutely the best way to merge distributed data. You can fire and forget merge tables without all the worry of ints matching across joined tables. GUIDs joins can be copied without any worry.
This is my setup for using GUIDs:
PK = GUID. GUIDs are indexed similar to strings, so high row tables (over 50 million records) may need table partitioning or other performance techniques. SQL Server is getting extremely efficient, so performance concerns are less and less applicable.
PK Guid is NON-Clustered index. Never cluster index a GUID unless it is NewSequentialID. But even then, a server reboot will cause major breaks in ordering.
Add ClusterID Int to every table. This is your CLUSTERED Index... that orders your table.
Joining on ClusterIDs (int) is more efficient, but I work with 20-30 million record tables, so joining on GUIDs doesn't visibly affect performance. If you want max performance, use the ClusterID concept as your primary key & join on ClusterID.
Here is my Email table...
CREATE TABLE [Core].[Email] (
[EmailID] UNIQUEIDENTIFIER CONSTRAINT [DF_Email_EmailID] DEFAULT (newsequentialid()) NOT NULL,
[EmailAddress] NVARCHAR (50) CONSTRAINT [DF_Email_EmailAddress] DEFAULT ('') NOT NULL,
[CreatedDate] DATETIME CONSTRAINT [DF_Email_CreatedDate] DEFAULT (getutcdate()) NOT NULL,
[ClusterID] INT NOT NULL IDENTITY,
CONSTRAINT [PK_Email] PRIMARY KEY NonCLUSTERED ([EmailID] ASC)
);
GO
CREATE UNIQUE CLUSTERED INDEX [IX_Email_ClusterID] ON [Core].[Email] ([ClusterID])
GO
CREATE UNIQUE NONCLUSTERED INDEX [IX_Email_EmailAddress] ON [Core].[Email] ([EmailAddress] Asc)
I am currently developing an web application with EF Core and here is the pattern I use:
All my classes (tables) have an int PK and FK.
I then have an additional column of type Guid (generated by the C# constructor) with a non clustered index on it.
All the joins of tables within EF are managed through the int keys while all the access from outside (controllers) are done with the Guids.
This solution allows to not show the int keys on URLs but keep the model tidy and fast.
This link says it better than I could and helped in my decision making. I usually opt for an int as a primary key, unless I have a specific need not to and I also let SQL server auto-generate/maintain this field unless I have some specific reason not to. In reality, performance concerns need to be determined based on your specific app. There are many factors at play here including but not limited to expected db size, proper indexing, efficient querying, and more. Although people may disagree, I think in many scenarios you will not notice a difference with either option and you should choose what is more appropriate for your app and what allows you to develop easier, quicker, and more effectively (If you never complete the app what difference does the rest make :).
https://web.archive.org/web/20120812080710/http://databases.aspfaq.com/database/what-should-i-choose-for-my-primary-key.html
P.S. I'm not sure why you would use a Composite PK or what benefit you believe that would give you.
Well, if your data never reach millions of rows, you are good. If you ask me, i never use GUID as database identity column of any type, including PK even if you force me to design with a shotgun at the head.
Using GUID as primary key is a definitive scaling stopper, and a critical one.
I recommend you check database identity and sequence option. Sequence is table independent and may provide a solution for your needs(MS SQL has sequences).
If your tables start reaching some dozens of millions of rows the most, e.g. 50 million you will not be able read/write information at acceptable timings and even standard database index maintenance would turn impossible.
Then you need to use partitioning, and be scalable up to half a billion or even 1-2 billion rows. Adding partitioning on the way is not the easiest thing, all read/write statements must include partition column (full app changes!).
These number of course (50 million and 500 million) are for a light selecting useage. If you need to select information in a complex way and/or have lots of inserts/updates/deletes, those could even be 1-2 millions and 50 millions instead, for a very demanding system. If you also add factors like full recovery model, high availability and no maintenance window, common for modern systems, things become extremely ugly.
Note at this point that 2 billion is int limit that looks bad, but int is 4 times smaller and is a sequential type of data, small size and sequential type are the #1 factor for database scalability. And you can use big int which is just twice smaller but still sequential, sequential is what is really deadly important - even more important than size - when to comes to many millions or few billions of rows.
If GUID is also clustered, things are much worst. Just inserting a new row will be actually stored randomly everywhere in physical position.
Even been just a column, not PK or PK part, just indexing it is trouble. From fragmentation perspective.
Having a guid column is perfectly ok like any varchar column as long as you do not use it as PK part and in general as a key column to join tables. Your database must have its own PK elements, filtering and joining data using them - filtering also by a GUID afterwards is perfectly ok.
Having sequential ID's makes it a LOT easier for a hacker or data miner to compromise your site and data. Keep that in mind when choosing a PK for a website.
If you use GUID as primary key and create clustered index then I suggest use the default of NEWSEQUENTIALID() value for it.
Another reason not to expose an Id in the user interface is that a competitor can see your Id incrementing over a day or other period and so deduce the volume of business you are doing.
Most of the times it should not be used as the primary key for a table because it really hit the performance of the database.
useful links regarding GUID impact on performance and as a primary key.
https://www.sqlskills.com/blogs/kimberly/disk-space-is-cheap/
https://www.sqlskills.com/blogs/kimberly/guids-as-primary-keys-andor-the-clustering-key/

When having an identity column is not a good idea?

In tables where you need only 1 column as the key, and values in that column can be integers, when you shouldn't use an identity field?
To the contrary, in the same table and column, when would you generate manually its values and you wouldn't use an autogenerated value for each record?
I guess that it would be the case when there are lots of inserts and deletes to the table. Am I right? What other situations could be?
If you already settled on the surrogate side of the Great Primary Key Debacle then I can't find a single reason not use use identity keys. The usual alternatives are guids (they have many disadvatages, primarily from size and randomness) and application layer generated keys. But creating a surrogate key in the application layer is a little bit harder than it seems and also does not cover non-application related data access (ie. batch loads, imports, other apps etc). The one special case is distributed applications when guids and even sequential guids may offer a better alternative to site id + identity keys..
I suppose if you are creating a many-to-many linking table, where both fields are foreign keys, you don't need an identity field.
Nowadays I imagine that most ORMs expect there to be an identity field in every table. In general, it is a good practice to provide one.
I'm not sure I understand enough about your context, but I interpret your question to be:
"If I need the database to create a unique column (for whatever reason), when shouldn't it be a monotonically increasing integer (identity) column?"
In those cases, there's no reason to use anything other than the facility provided by the DBMS for the purpose; in your case (SQL Server?) that's an identity.
Except:
If you'll ever need to merge the table with data from another source, use a GUID, which will prevent duplicate keys from colliding.
If you need to merge databases it's a lot easier if you don't have to regenerate keys.
One case of not wanting an identity field would be in a one to one relationship. The secondary table would have as its primary key the same value as the primary table. The only reason to have an identity field in that situation would seem to be to satisfy an ORM.
You cannot (normally) specify values when inserting into identity columns, so for example if the column "id" was specified as an identify the following SQL would fail:
INSERT INTO MyTable (id, name) VALUES (1, 'Smith')
In order to perform this sort of insert you need to have IDENTITY_INSERT on for that table - this is not intended to be on normally and can only be on for a maximum of 1 tables in the database at any point in time.
If I need a surrogate, I would either use an IDENTITY column or a GUID column depending on the need for global uniqueness.
If there is a natural primary key, or the primary key is defined as a unique combination of other foreign keys, then I typically do not have an IDENTITY, nor do I use it as the primary key.
There is an exception, which is snapshot configuration tables which I am tracking with an audit trigger. In this case, there is usually a logical "primary key" (usually date of the snapshot and natural key of the row - like a cost center or gl account number for which the row is a configuration record), but instead of using the natural "primary key" as the primary key, I add an IDENTITY and make that the primary key and make a unique index or constraint on the date and natural key. Although theoretically the date and natural key shouldn't change, in these tables, if a user does that instead of adding a new row and deleting the old row, I want the audit (which reflects a change to a row identified by its primary key) to really reflect a change in the row - not the disappearance of a key and the appearance of a new one.
I recently implemented a Suffix Trie in C# that could index novels, and then allow searches to be done extremely fast, linear to the size of the search string. Part of the requirements (this was a homework assignment) was to use offline storage, so I used MS SQL, and needed a structure to represent a Node in a table.
I ended up with the following structure : NodeID Character ParentID, etc, where the NodeID was a primary key.
I didn't want this to be done as an autoincrementing identity for two main reasons.
How do I get the value of a NodeID after I add it to the database/data table?
I wanted more control when it came to generating my own IDs.

Should each and every table have a primary key?

I'm creating a database table and I don't have a logical primary key assigned to it. Should each and every table have a primary key?
Short answer: yes.
Long answer:
You need your table to be joinable on something
If you want your table to be clustered, you need some kind of a primary key.
If your table design does not need a primary key, rethink your design: most probably, you are missing something. Why keep identical records?
In MySQL, the InnoDB storage engine always creates a primary key if you didn't specify it explicitly, thus making an extra column you don't have access to.
Note that a primary key can be composite.
If you have a many-to-many link table, you create the primary key on all fields involved in the link. Thus you ensure that you don't have two or more records describing one link.
Besides the logical consistency issues, most RDBMS engines will benefit from including these fields in a unique index.
And since any primary key involves creating a unique index, you should declare it and get both logical consistency and performance.
See this article in my blog for why you should always create a unique index on unique data:
Making an index UNIQUE
P.S. There are some very, very special cases where you don't need a primary key.
Mostly they include log tables which don't have any indexes for performance reasons.
Always best to have a primary key. This way it meets first normal form and allows you to continue along the database normalization path.
As stated by others, there are some reasons not to have a primary key, but most will not be harmed if there is a primary key
Disagree with the suggested answer. The short answer is: NO.
The purpose of the primary key is to uniquely identify a row on the table in order to form a relationship with another table. Traditionally, an auto-incremented integer value is used for this purpose, but there are variations to this.
There are cases though, for example logging time-series data, where the existence of a such key is simply not needed and just takes up memory. Making a row unique is simply ...not required!
A small example:
Table A: LogData
Columns: DateAndTime, UserId, AttribA, AttribB, AttribC etc...
No Primary Key needed.
Table B: User
Columns: Id, FirstName, LastName etc.
Primary Key (Id) needed in order to be used as a "foreign key" to LogData table.
Pretty much any time I've created a table without a primary key, thinking I wouldn't need one, I've ended up going back and adding one. I now create even my join tables with an auto-generated identity field that I use as the primary key.
Except for a few very rare cases (possibly a many-to-many relationship table, or a table you temporarily use for bulk-loading huge amounts of data), I would go with the saying:
If it doesn't have a primary key, it's not a table!
Marc
Just add it, you will be sorry later when you didn't (selecting, deleting. linking, etc)
Will you ever need to join this table to other tables? Do you need a way to uniquely identify a record? If the answer is yes, you need a primary key. Assume your data is something like a customer table that has the names of the people who are customers. There may be no natural key because you need the addresses, emails, phone numbers, etc. to determine if this Sally Smith is different from that Sally Smith and you will be storing that information in related tables as the person can have mulitple phones, addesses, emails, etc. Suppose Sally Smith marries John Jones and becomes Sally Jones. If you don't have an artifical key onthe table, when you update the name, you just changed 7 Sally Smiths to Sally Jones even though only one of them got married and changed her name. And of course in this case withouth an artificial key how do you know which Sally Smith lives in Chicago and which one lives in LA?
You say you have no natural key, therefore you don't have any combinations of field to make unique either, this makes the artficial key critical.
I have found anytime I don't have a natural key, an artifical key is an absolute must for maintaining data integrity. If you do have a natural key, you can use that as the key field instead. But personally unless the natural key is one field, I still prefer an artifical key and unique index on the natural key. You will regret it later if you don't put one in.
It is a good practice to have a PK on every table, but it's not a MUST. Most probably you will need a unique index, and/or a clustered index (which is PK or not) depending on your need.
Check out the Primary Keys and Clustered Indexes sections on Books Online (for SQL Server)
"PRIMARY KEY constraints identify the column or set of columns that have values that uniquely identify a row in a table. No two rows in a table can have the same primary key value. You cannot enter NULL for any column in a primary key. We recommend using a small, integer column as a primary key. Each table should have a primary key. A column or combination of columns that qualify as a primary key value is referred to as a candidate key."
But then check this out also: http://www.aisintl.com/case/primary_and_foreign_key.html
To make it future proof you really should. If you want to replicate it you'll need one. If you want to join it to another table your life (and that of the poor fools who have to maintain it next year) will be so much easier.
I am in the role of maintaining application created by offshore development team. Now I am having all kinds of issues in the application because original database schema did not contain PRIMARY KEYS on some tables. So please dont let other people suffer because of your poor design. It is always good idea to have primary keys on tables.
Late to the party but I wanted to add my two cents:
Should each and every table have a primary key?
If you are talking about "Relational Albegra", the answer is Yes. Modelling data this way requires the entities and tables to have a primary key. The problem with relational algebra (apart from the fact there are like 20 different, mismatching flavors of it), is that it only exists on paper. You can't build real world applications using relational algebra.
Now, if you are talking about databases from real world apps, they partially/mostly adhere to the relational algebra, by taking the best of it and by overlooking other parts of it. Also, database engines offer massive non-relational functionality nowadays (it's 2020 now). So in this case the answer is No. In any case, 99.9% of my real world tables have a primary key, but there are justifiable exceptions. Case in point: event/log tables (multiple indexes, but not a single key in sight).
Bottom line, in transactional applications that follow the entity/relationship model it makes a lot of sense to have primary keys for almost (if not) all of the tables. If you ever decide to skip the primary key of a table, make sure you have a good reason for it, and you are prepared to defend your decision.
I know that in order to use certain features of the gridview in .NET, you need a primary key in order for the gridview to know which row needs updating/deleting. General practice should be to have a primary key or primary key cluster. I personally prefer the former.
I'd like to find something official like this - 15.6.2.1 Clustered and Secondary Indexes - MySQL.
If the table has no PRIMARY KEY or suitable UNIQUE index, InnoDB internally generates a hidden clustered index named GEN_CLUST_INDEX on a synthetic column containing row ID values. The rows are ordered by the ID that InnoDB assigns to the rows in such a table. The row ID is a 6-byte field that increases monotonically as new rows are inserted. Thus, the rows ordered by the row ID are physically in insertion order.
So, why not create primary key or something like it by yourself? Besides, ORM cannot identify this hidden ID, meaning that you cannot use ID in your code.
I always have a primary key, even if in the beginning I don't have a purpose in mind yet for it. There have been a few times when I eventually need a PK in a table that doesn't have one and it's always more trouble to put it in later. I think there is more of an upside to always including one.
If you are using Hibernate its not possible to create an Entity without a primary key. This issues can create problem if you are working with an existing database which was created with plain sql/ddl scripts, and no primary key was added
In short, no. However, you need to keep in mind that certain client access CRUD operations require it. For future proofing, I tend to always utilize primary keys.

Resources