Does having a string column as the primary key instead of an integer column adversely affect search times and/or insertion times?
Scenarios
a. A common scenario for any application is to make this query every time someone creates a new user account:
Does that user name already exist or is it taken by someone else?
b. And when a person logs in, another query that looks up the user name is required to be made like so:
Does a row with that UserName exist in the User table?
c. Similarly, when a user says they've forgotten their password, we need to search based on their email.
Does a row with that Email exist in the User table?
d. It is only in the case of linking up the User table with other user related tables such as UserRole, UserClaim, etc. that we may need to join them based on an integer Id like so:
SELECT *
FROM User, UserClaim
WHERE User.Id = UserClaim.UserId;
Having an Integer as the Primary Key vs. Having a String as the Primary Key
Till now, I've always just had a user table with an integer primary key (and clustered index thereon), like so:
User
-----
Id int primary key identity(1, 1),
UserName nvarchar(50) not null,
Email nvarchar(100) not null,
PasswordHash nvarchar(32) not null
However, now contemplating over the use-cases I described above, I am wondering if it is more fruitful to instead completely eliminate the integer primary key and instead make one of the UserName or Email field as the primary key like so:
User
-----
UserName nvarchar(50) primary key,
Email nvarchar(100) not null,
PasswordHash nvarchar(32) not null
That would create a clustered index on the UserName field probably speeding up queries in scenarios a and b listed above, but I am not sure of the impact scenarios c and d because that would depend on the speed or comparing integers with speed of comparing indices based on a string column.
Questions
However, that leaves me with a few lose ends I need to tie up before I can commit on this design:
Does making a clustered index on a text field like the above have any performance implications? How does it effect insertion times? Search times?
I would imagine creating an index on an integer is faster than on a string?
We can have only one clustered index. If I allow my users to login using either a user name or email, anyone they like, then I am going to have to make searches on both the UserName and Email fields just as frequently. How do I manage that? Should I make a non-clustered index on the Email field?
Would having a string column as the primary key have an impact on performance of the joins I do with other link tables like so:
SELECT * FROM User, UserRole
WHERE User.UserName = UserRole.UserName;
Considering #3, it looks like I should just keep the integer Id column in the User table and create a non-clustered index each on the UserName and Email columns?
I am using Microsoft SQL Server 2014.
Does making a clustered index on a text field like the above have any
performance implications? How does it affect insertion times? Search
times?
Every row of every non-clustered index will contain the clustered index key as rowkey. INT = 4 bytes, your unicode string column Email can potentially occupy NVARCHAR(100) = up to 200 bytes.
Clustered indexes are good for range scans. Range scan on email addresses are hardly expected.
An identity-based clustered index is an of warranty close to zero fragmentation and fast inserts, due to an absence of page splits
We can have only one clustered index. If I allow my users to login
using either a user name or email, anyone they like, then I am going
to have to make searches on both the UserName and Email fields just as
frequently. How do I manage that? Should I make a non-clustered index
on the Email field?
Yes, if you will decide to make a unique clustered index on UserName, you will want to have another nonclustered index on Email. If a user will search by Email column, column username will be part of such index automatically (because of the reason explained in a point above) and such index will be covered.
Would having a string column as the primary key have an impact on
performance of the joins
A clustered index on UserName column is optimal for such joins, because it will keep data preordered, so on large datasets instead HASH joins are more likely to be replaced by MERGE joins
Considering #3, it looks like I should just keep the integer Id column
in the User table and create a non-clustered index each on the
UserName and Email columns?
It very much depends on your workload. If you have to frequently join that table on a column UserName, it can be that clustered index on such column will work for you. In that case, you can make a non-clustered unique index on a field Email and keep a primary key on ID but make it non-clustered also
(This post is based pretty much on personal opinion)
Related
I need to create a table with a key that is a 256 Bit hash number. Fast searching and retrieving is crucial, so I am wondering what data structure to use as key ?
One option would be a varchar[32], but I guess searching will be very slow. The stored data amount will be much higher than a numerical solution.
A second option would be two different decimal[16] integers and combine them into a compound key, but I am sceptical if that would have a faster search performance than option #1.
I googled that topic, but didn't find solutions; perhaps some third option ? Any hints appreciated.
It is good the PRIMARY KEY of a table to be surrogate key and a number if possible. Using SMALLINT, ``INT or BIGINT with IDENTITY applied. Using such definition:
[RowID] INT IDENTITY(1,1)
will help you solving some common issues - most importantly, when new records are created they will be appended at the end of the last index page, so no page splitting/fragmentation on insert.
Additional column can be added - your hash value and you can create an index on it, to make searching by hash faster.
For example, I have a IP addresses table holding all addresses used in the application (basically addresses used by users to log in).
The table looks like this:
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
CREATE TABLE [dbo].[IPAddresses]
(
[IPAddressID] BIGINT IDENTITY(10000,1) NOT NULL
,[IPAddress] VARBINARY(84) NOT NULL
,[IPAddressHash] VARBINARY(64) NULL
,CONSTRAINT [PK_IPAddresses] PRIMARY KEY
(
[IPAddressID] ASC
)
,INDEX [IX_IPAddresses_IPAddressHash]
(
[IPAddressHash] ASC
)
)
GO
As the IPAddress is personal data, it should be encrypted. And as I want my data to be normalized and I do not want to have duplicated records, I need to check each time a user log in if the address exists - if not create one. I am doing this using the following routine:
Here, I am passing the address and calculating a hash by which I am searching. The original version was instead using hash, to decpryt all the values and search by the text, but for millions of IPs this was very slow and this routine is executing constantly. On the other hand, I am performing only inserts in this table and records are appended only - so there is no fragmentation at all.
So, my advice is:
use number column with identity as primary key
add the rest of the columns in your table
add hash column and build a hash by the column(s) used for searches
create index on this hash column
then when you need to search for a record use the hash, located the PK ID and then use the ID to extract the record
I need to create a table with a single column: email.
Can I create a table with a single column and add a clustered index on email? Or should I create and identity column and do a non-clustered on email?
The table will hold around a million email addresses.
Developers will use this table, and I imagine they will just do where xxxx in (select email from table); the way I see it, there is no other way of using this table.
I will run a merge once a week that will insert new emails. Not sure if I should do a merge, if it is uniquely clustered on email. I can just insert and hopefully if a record is duplicated it would not insert it and continue with the rest, right?
This is mostly a personal choice decision. There's some performance improvements to having the identity column as your clustered index key when your code is inserting/updating/deleting.
I would create the identity column as the clustered index and make the email a separate column. It's ideal to have the clustered index key as an ever-increasing value.
What happens if you enter the same email twice? Should that be two separate rows or should that cause an error? These are things to think about when you design this table.
Let's say I have a table like this:
CREATE TABLE t(
[guid] [uniqueidentifier] NOT NULL,
[category] [nvarchar](400)
{,...other columns}
)
Where guid is my primary key, and has a clustered index.
Now, I want an index that covers both category and guid, because I'm rolling up some other stuff related to t by category, and I want to avoid including the t table itself.
Is it sufficient to create index covering category, or do I need to include guid as well?
I would expect SQL Server indexes to point directly to page offsets in t rather than simply referring to a guid primary key value, which means I would need to explicitly include the PK column to avoid hitting t. Is this the case?
Actually your assumption is wrong - all SQL Server non-clustered indices do include the clustering key (single or multiple columns) and do not point directly at some physical page.
This prevents SQL Server from having to reorganize and update lots of index entries when a page needs to be split in two or relocated. So if you are seeking in a non-clustered index and you find a value, then you have the clustering key and SQL Server will need to do a "bookmark lookup" (or key lookup) to retrieve the actual data page (the leaf page in the clustering index) to get the whole set of data belonging to a single row.
That said - if you ever have a situation where it depends on the ordering of the key columns, then you still might need to create an index specifically on (guid, category) - of course, in that case, SQL Server is smart enough to figure out that the clustering key column is already in the index and won't be adding it one more time.
The fact that the clustering key column(s) are inlcuded in every single non-clustered index is another strong reason why your clustering keys should be narrow, static and unique. Making them too wide (anything beyond 8 byte) is a sure recipe for bloat and slow-down.
Differing slightly to marc_s' answer.
A covering index on (category, guid) will have a different sort on GUID to the primary key sort. Therefore, guid may appear twice in the index because it is in the key column list and the pointer to the clustered index.
If you INCLUDEd (as a non-key column) guid SQL Server won't add it again.
I can't test the key column thing just now, but I have verified the INCLUDE one before on SQL Server 2005.
I have a table that stores user info.
In the User table, username is unique. Do you think I should make username as primarykey or should I use a surrogate key that is an int?
Would using a string key hit performance badly?
Use a surrogate integer key.
Usernames won't change that often, but they could.
As to performance, don't worry about that until you know you have a problem.
SQL Server will create the clustered index on the Primary key column by default. If you use a wide key in the clustered index, all non-clustered indexes will also contain that wide key.
Generally using an int as the primary key. This is due in part to convention as well as saving space when using them as foreign keys in other tables. In reality, using your username field as the primary wouldn't hurt performance unless you wind up having thousands of records in multiple tables using it. If you think your tables will remain small, its up to preference.
I would use an identity surrogate primary key and cluster on that. The clustered index is included in all indexes and should be narrow, static and increasing.
As far as a primary key, you COULD make the username the primary key, BUT since foreign keys will reference it, you also want it to be static (which a username is not). So I would make a non-clustered unique index on username. The identity PK will automatically be included in the NCI.
I would include any other columns in that same index (as included columns) depending upon usage patters where access is primarily by username - for instance, the password hash, maybe the name. But I'd check the execution plans, use the profiler and/or the index tuning wizard with expected workloads.
In tables where you need only 1 column as the key, and values in that column can be integers, when you shouldn't use an identity field?
To the contrary, in the same table and column, when would you generate manually its values and you wouldn't use an autogenerated value for each record?
I guess that it would be the case when there are lots of inserts and deletes to the table. Am I right? What other situations could be?
If you already settled on the surrogate side of the Great Primary Key Debacle then I can't find a single reason not use use identity keys. The usual alternatives are guids (they have many disadvatages, primarily from size and randomness) and application layer generated keys. But creating a surrogate key in the application layer is a little bit harder than it seems and also does not cover non-application related data access (ie. batch loads, imports, other apps etc). The one special case is distributed applications when guids and even sequential guids may offer a better alternative to site id + identity keys..
I suppose if you are creating a many-to-many linking table, where both fields are foreign keys, you don't need an identity field.
Nowadays I imagine that most ORMs expect there to be an identity field in every table. In general, it is a good practice to provide one.
I'm not sure I understand enough about your context, but I interpret your question to be:
"If I need the database to create a unique column (for whatever reason), when shouldn't it be a monotonically increasing integer (identity) column?"
In those cases, there's no reason to use anything other than the facility provided by the DBMS for the purpose; in your case (SQL Server?) that's an identity.
Except:
If you'll ever need to merge the table with data from another source, use a GUID, which will prevent duplicate keys from colliding.
If you need to merge databases it's a lot easier if you don't have to regenerate keys.
One case of not wanting an identity field would be in a one to one relationship. The secondary table would have as its primary key the same value as the primary table. The only reason to have an identity field in that situation would seem to be to satisfy an ORM.
You cannot (normally) specify values when inserting into identity columns, so for example if the column "id" was specified as an identify the following SQL would fail:
INSERT INTO MyTable (id, name) VALUES (1, 'Smith')
In order to perform this sort of insert you need to have IDENTITY_INSERT on for that table - this is not intended to be on normally and can only be on for a maximum of 1 tables in the database at any point in time.
If I need a surrogate, I would either use an IDENTITY column or a GUID column depending on the need for global uniqueness.
If there is a natural primary key, or the primary key is defined as a unique combination of other foreign keys, then I typically do not have an IDENTITY, nor do I use it as the primary key.
There is an exception, which is snapshot configuration tables which I am tracking with an audit trigger. In this case, there is usually a logical "primary key" (usually date of the snapshot and natural key of the row - like a cost center or gl account number for which the row is a configuration record), but instead of using the natural "primary key" as the primary key, I add an IDENTITY and make that the primary key and make a unique index or constraint on the date and natural key. Although theoretically the date and natural key shouldn't change, in these tables, if a user does that instead of adding a new row and deleting the old row, I want the audit (which reflects a change to a row identified by its primary key) to really reflect a change in the row - not the disappearance of a key and the appearance of a new one.
I recently implemented a Suffix Trie in C# that could index novels, and then allow searches to be done extremely fast, linear to the size of the search string. Part of the requirements (this was a homework assignment) was to use offline storage, so I used MS SQL, and needed a structure to represent a Node in a table.
I ended up with the following structure : NodeID Character ParentID, etc, where the NodeID was a primary key.
I didn't want this to be done as an autoincrementing identity for two main reasons.
How do I get the value of a NodeID after I add it to the database/data table?
I wanted more control when it came to generating my own IDs.