I need to create a table with a single column: email.
Can I create a table with a single column and add a clustered index on email? Or should I create and identity column and do a non-clustered on email?
The table will hold around a million email addresses.
Developers will use this table, and I imagine they will just do where xxxx in (select email from table); the way I see it, there is no other way of using this table.
I will run a merge once a week that will insert new emails. Not sure if I should do a merge, if it is uniquely clustered on email. I can just insert and hopefully if a record is duplicated it would not insert it and continue with the rest, right?
This is mostly a personal choice decision. There's some performance improvements to having the identity column as your clustered index key when your code is inserting/updating/deleting.
I would create the identity column as the clustered index and make the email a separate column. It's ideal to have the clustered index key as an ever-increasing value.
What happens if you enter the same email twice? Should that be two separate rows or should that cause an error? These are things to think about when you design this table.
Related
My boss has assigned a SQL task to me and as I am new to SQL, I am struggling to know where to start.
Task: Create a Customer table to hold the data written in the #Customer temporary table in the PopulateCustomers stored procedure. This table will also need to have a unique id to ensure multiple instances of the populate functionality can be run concurrently.
I know how to create a table in SQL and I am guessing I can look in the PopulateCustomer stored procedure to know what data will be written in the temp Customer table in order to create columns for the Customer table.
But what I am really struggling with is the concept of a unique Id for a database table. I immediately thought primary key for each row in the table. Which my boss responded no, I didn't want to push for more as not to come across as a newbie.
I have tried to google this myself and all I keep coming up with is pages that tell me about identifiers vs primary keys. But nothing ever tells me about a table having its own unique ID unless its in reference to the rows within the table each having an Identifier or primary key. This is leading me to think that I am not searching for the right key word for what this functionality is.
The closest thing I found was here. http://sqlservercodebook.blogspot.com/2008/03/check-if-temporary-table-exists.html
This query looks to me like its creating a temp table with an id.
CREATE TABLE #temp(id INT)
I have not pasted any of my work queries because I really want to research myself and figure this out. I just want to make sure I am looking in the right direction with what term I need to search for to find out how to create a table that has a unique ID. Or maybe I have misinterpreted the task and there is no such thing.
What I got from your story is that you need a table with an unique id, automatically generated, and use this id as the primary key.
This table can be created like:
create table example
(
id int identity(1,1) primary key clustered,
other_data varchar(200)
)
The key terms here are:
identity - for the id column be auto-incremented
primary key - so SQL Server ensures this column is unique
clustered - for all the data in this table be organized physically by this column (and make it faster to be searched by it)
Does having a string column as the primary key instead of an integer column adversely affect search times and/or insertion times?
Scenarios
a. A common scenario for any application is to make this query every time someone creates a new user account:
Does that user name already exist or is it taken by someone else?
b. And when a person logs in, another query that looks up the user name is required to be made like so:
Does a row with that UserName exist in the User table?
c. Similarly, when a user says they've forgotten their password, we need to search based on their email.
Does a row with that Email exist in the User table?
d. It is only in the case of linking up the User table with other user related tables such as UserRole, UserClaim, etc. that we may need to join them based on an integer Id like so:
SELECT *
FROM User, UserClaim
WHERE User.Id = UserClaim.UserId;
Having an Integer as the Primary Key vs. Having a String as the Primary Key
Till now, I've always just had a user table with an integer primary key (and clustered index thereon), like so:
User
-----
Id int primary key identity(1, 1),
UserName nvarchar(50) not null,
Email nvarchar(100) not null,
PasswordHash nvarchar(32) not null
However, now contemplating over the use-cases I described above, I am wondering if it is more fruitful to instead completely eliminate the integer primary key and instead make one of the UserName or Email field as the primary key like so:
User
-----
UserName nvarchar(50) primary key,
Email nvarchar(100) not null,
PasswordHash nvarchar(32) not null
That would create a clustered index on the UserName field probably speeding up queries in scenarios a and b listed above, but I am not sure of the impact scenarios c and d because that would depend on the speed or comparing integers with speed of comparing indices based on a string column.
Questions
However, that leaves me with a few lose ends I need to tie up before I can commit on this design:
Does making a clustered index on a text field like the above have any performance implications? How does it effect insertion times? Search times?
I would imagine creating an index on an integer is faster than on a string?
We can have only one clustered index. If I allow my users to login using either a user name or email, anyone they like, then I am going to have to make searches on both the UserName and Email fields just as frequently. How do I manage that? Should I make a non-clustered index on the Email field?
Would having a string column as the primary key have an impact on performance of the joins I do with other link tables like so:
SELECT * FROM User, UserRole
WHERE User.UserName = UserRole.UserName;
Considering #3, it looks like I should just keep the integer Id column in the User table and create a non-clustered index each on the UserName and Email columns?
I am using Microsoft SQL Server 2014.
Does making a clustered index on a text field like the above have any
performance implications? How does it affect insertion times? Search
times?
Every row of every non-clustered index will contain the clustered index key as rowkey. INT = 4 bytes, your unicode string column Email can potentially occupy NVARCHAR(100) = up to 200 bytes.
Clustered indexes are good for range scans. Range scan on email addresses are hardly expected.
An identity-based clustered index is an of warranty close to zero fragmentation and fast inserts, due to an absence of page splits
We can have only one clustered index. If I allow my users to login
using either a user name or email, anyone they like, then I am going
to have to make searches on both the UserName and Email fields just as
frequently. How do I manage that? Should I make a non-clustered index
on the Email field?
Yes, if you will decide to make a unique clustered index on UserName, you will want to have another nonclustered index on Email. If a user will search by Email column, column username will be part of such index automatically (because of the reason explained in a point above) and such index will be covered.
Would having a string column as the primary key have an impact on
performance of the joins
A clustered index on UserName column is optimal for such joins, because it will keep data preordered, so on large datasets instead HASH joins are more likely to be replaced by MERGE joins
Considering #3, it looks like I should just keep the integer Id column
in the User table and create a non-clustered index each on the
UserName and Email columns?
It very much depends on your workload. If you have to frequently join that table on a column UserName, it can be that clustered index on such column will work for you. In that case, you can make a non-clustered unique index on a field Email and keep a primary key on ID but make it non-clustered also
(This post is based pretty much on personal opinion)
I am doing a review of some DB tables that were created in our project and came across this. The table contains an Identity column (ID) which is the primarykey for the table and a clustered index has been defined using this ID column. But when I look at the SPROC that retrieves records from this table, I see that the ID column is never used in the query and they query the records based on a USERID column (this column is not unique) and there can be multiple records for the same USERID.
So my question is there any advantage/purpose in creating a clustered index when we know that the records wont be queried with that column?
If the IDENTITY column is never used in WHERE and JOIN clauses, or referenced by foreign keys, perhaps USERID should be a clustered primary key. I would question the need for the ID column at all in that case.
The best choice for the clustered index depends much on how the table is queried. If the majority of queries are by USERID, then it should probably be a unique clustered index (or clustered unique constraint) and the ID column non-clustered.
Keep in mind that the clustered index key is implicitly included in all non-clustered indexes as the row locator. The implication is that non-clustered indexes may more likely cover queries and non-clustered index leaf node pages wider as a result.
I would say your table is mis-designed. Someone apparently thought every table needs a primary key and the primary key is the clustered index. Adding a system-generated unique number as an identifier just adds noise if that number isn't used anywhere. Noise in the clustered index is unhelpful, to say the least.
They are different concepts, by the way. A primary key is a data modeling concern, a logical concept. An index is a physical design issue. A SQL DBMS must support primary keys, but need not have any indexes, clustered or no.
If USERID is what is usually used to search the table, it should be in your clustered index. The clustered index need not be unique and need not be the primary key. I would look at the data carefully to see if some combination of USERID and another column (or two, or more) form a unique identifier for the row. If so, I'd make that the primary key (and clustered index), with USERID as the first column. If query analysis showed that many queries use only USERID and nothing else (for existence testing) I might create a separate index just of USERID.
If no combination of columns constitutes a unique identifier, you have logical problem, to wit: what does the row mean? What aspect of the real world does it represent?
A basic tenet of the Relational Model is that elements in a relation (rows in a table) are unique, that each one identifies something. If two rows are identical, they identify the same thing. What does it mean to delete one of them? Is the thing that they both identify still there, or not? If it is, what purpose did the 2nd row serve?
I hope that gives you another way to think about clustered indexes and keys. I wouldn't be surprised if you find other tables that could be improved, too.
If I create a table with Identity column, should I always make it PK for the table. I know when we do it automatically creates a clustered index for that table. Is there any perfomance hit keeping all identity columns in tables without making them PK? Any suggestions?
If you want to use that table in C# using linq then there will need to put PK.
We have a table with about 100,000 record which is used frequently in our applications. We had an identity (ID) columns and had a clustered index on it and everything worked good. But for some reasons we had to use a Uniqueidentifier column as Primary key. So we add a non clustered index on it and removed the clustered index on ID column. But now, we have lots of performance degradation issuses from our customer in peak times. Is it because the table has no clustered index now?
The fact that you added a primary key by no means implies you had to drop the clustered index. The two concepts are distinct. You can have an uniqueidentifier PK implemented by a non clustered index and a separate clustered index of choice (eg. the old ID column).
But the real question is How did you change your application when you added the uniqueidentifier PK? Did you also modified the application code to retrieve the records by this new PK (by the uniqueidentifier)? Did you update all joins to reference the new PK? Did you modified all foreign key cosntraints that referenced the old ID column? Or does the application continue to retrieve the data using the old identity ID column? My expectation is that you changed both the application and the table, and the access is now prevalent on the form of SELECT ... FROM table WHERE pk=#uniqueidentifier. If only such access occurs, then the table should perform OK even with a non-clustered uniqueidentifier primary key and no clustered index. So there must be something else at play:
your application continues to access the table based on the old identity ID column
there are joins in your query based on the old identity ID column
there are foreign key constraints referencing the table on the old ID column
Ultimately you have a performance troubleshooting issue at hand and approach it as a performance troubleshooting problem. I have two great resources for you:the Waits and Queue methodology and the Performance Troubleshooting Flowchart
Hi I think you can make uniqueidentifier column as clustered index with NEWSEQUENTIALID() instead of NEWID(). As newsequentialid generates the sequential ids and for clustered index its the best.