Working with non-clustered index on nvarchar(50) column - sql-server

I have been tasked at looking at an MSSQL db to see if I can enhance performance anywhere. Most of the DB has clustered-indexes using the standard ID Identity primary key field.
In the area I am concerned about I have found where two tables are not linked via a standard primary key -> foreign key relationship. Instead there is an nvarchar(50) field that corresponds in both tables as the contents are the same. So an inner join can be done without a concrete relationship. Here a non-clustered index has been created.
I have queried why we are not doing it the same way as everywhere else in the DB, and explained that I thought this could be where things are being significantly slowed down. I have been told that it won't be an issue because there is a non-clustered index on the field in question. So I should look elsewhere.
So I thought I would ask what people think. Will a non-clustered index, setup on an nvarchar(50) column that is used to join two tables (the two tables in question have non concrete relationship between each other) cause any performance issues, especially when comparing to a primary key, system generated clustered-index.
Would be grateful for any advise here.

Related

One to one relationship. Is Id column needed for both table?

Let's consider the tables:
CREATE TABLE [dbo].[User] (
[Id] INT PRIMARY KEY,
...
);
CREATE TABLE [dbo].[UserInfo_1] (
[Id] INT PRIMARY KEY,
[UserId] INT,
...,
CONSTRAINT FK_UserId FOREIGN KEY ([UserId]) REFERENCES [dbo].[User] (Id)
);
CREATE TABLE [dbo].[UserInfo_2] (
[UserId] INT PRIMARY KEY,
...,
CONSTRAINT FK_UserId FOREIGN KEY ([UserId]) REFERENCES [dbo].[User] (Id)
);
What are the procs and cons of using FOREIGN KEY for UserInfo_1 and UserInfo_2 tables? Also in terms of ORM.
I don't think there would be a con for using Foreign Keys on any form of tables. In fact, I make sure to have a primary key on all tables I use, especially on temps and variable tables since I know I will be joining and filtering with them.
Now your first table User and UserInfo_1 is a one to many relationship. Meaning a single User can have many different UserInfo_1 associated with it.
the second one of User and UserInfo_2 is a one to one relationship. In which a single User can only ever have one UserInfo_2 associated with it.
In terms of performance, since they are Indexed they would perform relatively the same, depending upon your filtering and what plan was cached in the query plan. though you may not entirely run into issues with cached query plans as EF utilizes ad-hock statements, though EF does run up the cached plan memory and that is typically recommended to be disabled when using EF.
One-to-One
I am a fan of one to one relationship, especially in a Domain Driven Design aspect, and when implemented correctly. If each of your rows for User is going to require information from UserInfo_2, then I would theoretically keep them on the User table. Now if you know you will not be querying that information much or not all Users will require the columns on that table, or if your main table is fairly large I would keep it as a one to one relationship.
I personally like to use System Versioning. I have tables which contain certain columns which typically update and columns which almost never update. Those that I know update on a daily/weekly bases I have them congregated on a one to one relationship to the main table that should almost never update. But each business needs and scenario is different. Not any one design fits all situations.
Benefits of Indexing
When you create a Foreign Key, you are creating an Index in the database. This will allow for you to perform faster queries. The SQL optimizer will utilize the Index to better find what you are looking for. Without the index, your query plan will turn into a table scan, which is a row by row search. When doing a row by row search, you can seriously slow your system down as your table grows.
Should you choose to create your tables without an Index, or in this case a Foreign Key index, you might find issues with the ORM aspect. When you query your database from EF, you will call your DbSet. If you had proper Foreign Key connections with your two tables, EF can utilize the .Include to join the two tables searching for what you need. Otherwise you would be forced to utilize two queries into the database for both tables.
In a project I worked on one time, a developer did that. He did not properly attach a Foreign Key connection between two objects and then didn't understand why EF would not properly return his values when he used the .Include and wasn't very fast. He thought it was EF's fault and had to do two queries to obtain the information he needed.
Well User > UserInfo_1 is a one-to-many relationship, as UserInfo_1.UserID is not a key. And EF 6 doesn't support alternate keys. EF Core does, so you could make it a key.
But the simplest design is always to collapse 1-1 relationships into a single table. In EF Core you can still have a main Entity Type and one or more separate Owned Entity Types. But on the database it's typically better to have them in a single table.
The second-simplest is to have have both tables have the same key columns.

What options are there for transferring one particular webapp user's data onto another DB instance without suffering PK collision?

EDITED
When we first architected our web app years ago, we chose auto-increment int for all of our users' data. However, we are now getting burned by how hard it is to transfer specific user's data (multiple tables with one-to-many relationships) to another non-empty database instance (with the same table strictures).
While SET IDENTITY_INSERT table ON | OFF may work for some tables, with our current architecture we will still run into problem because certain 'many' in a 'one-to-many' relation may collide with the destination DB.
Inspired by Pam Lahoud's answer below, I started researching on non-clustered PK and PK alternatives. Then I come across Selecting an Appropriate Primary Key for a Distributed Environment from MSDN, and "Keys That Include a Node Identifier" caught my eye. Anyone has experience with this kind of architecture?
GUIDs as primary keys are awesome, GUIDs as clustered index keys are not so awesome. While the PK does default to be clustered, it doesn't necessarily have to be. If there is another column on the table that would make sense to be clustered, you might consider converting over to non-clustered GUID primary keys and clustering on some other field.
If the PKs on your table are used frequently for filtering and joining, it probably still makes sense for them to be clustered even if they are GUIDs. Using newsequentialid() will get around most of the problems that are caused by GUID clustered index keys - namely logical index fragmentation, page splits and low page density. You still have the issue that GUIDs are a large data type and therefore all your indexes (both clustered and non-clustered since they also contain the clustered index key) will be somewhat larger, but I don't think that's necessarily a deal breaker.
The only other solution I can think of other than converting to GUIDs would be to specify identity ranges on each of your databases and add constraints to ensure that there is no overlap in ranges between them. This of course wouldn't work for existing data, but would prevent the problem from happening in the future as new data being inserted should be unique across your farm.
As with anything in SQL Server, there are very few "always" or "never" rules, GUIDs as primary keys and/or clustered index keys is one of those "it depends" rules, I think in this case the GUID PK might be the right solution.

How can we deal with intersection tables that quickly grow very large?

For example, we have table A, and table B which have a many-to-many relationship. An intersection table, Table C stores A.id and B.id along with a value that represents a relationship between the two. Or as a concrete example, imagine stackexchange which has a user account, a forum, and a karma score. Or, a student, a course, and a grade. If table A and B are very large, table C can and probably will grow monstrously large very quickly(in fact lets just assume it does). How do we go about dealing with such an issue? Is there a better way to design the tables to avoid this?
There is no magic. If some rows are connected and some aren't, this information has to be represented somehow, and the "relational" way of doing it is a "junction" (aka "link") table. Yes, a junction table can grow large, but fortunately databases are very capable of handling huge amounts of data.
There are good reasons for using junction table versus comma-separated list (or similar), including:
Efficient querying (through indexing and clustering).
Enforcement of referential integrity.
When designing a junction table, ask the following questions:
Do I need to query in only one direction or both?1
If one direction, just create a composite PRIMARY KEY on both foreign keys (let's call them PARENT_ID and CHILD_ID). Order matters: if you query from parent to children, PK should be: {PARENT_ID, CHILD_ID}.
If both directions, also create a composite index in the opposite order, which is {CHILD_ID, PARENT_ID} in this case.
Is the "extra" data small?
If yes, cluster the table and cover the extra data in the secondary index as necessary.2
I no, don't cluster the table and don't cover the extra data in the secondary index.3
Are there any additional tables for which the junction table acts as a parent?
If yes, consider whether adding a surrogate key might be worthwhile to keep child FKs slim. But beware that if you add a surrogate key, this will probably eliminate the opportunity for clustering.
In many cases, answers to these questions will be: both, yes and no, in which case your table will look similar to this (Oracle syntax below):
CREATE TABLE JUNCTION_TABLE (
PARENT_ID INT,
CHILD_ID INT,
EXTRA_DATA VARCHAR2(50),
PRIMARY KEY (PARENT_ID, CHILD_ID),
FOREIGN KEY (PARENT_ID) REFERENCES PARENT_TABLE (PARENT_ID),
FOREIGN KEY (CHILD_ID) REFERENCES CHILD_TABLE (CHILD_ID)
) ORGANIZATION INDEX COMPRESS;
CREATE UNIQUE INDEX JUNCTION_TABLE_IE1 ON
JUNCTION_TABLE (CHILD_ID, PARENT_ID, EXTRA_DATA) COMPRESS;
Considerations:
ORGANIZATION INDEX: Oracle-specific syntax for what most DBMSes call clustering. Other DBMSes have their own syntax and some (MySQL/InnoDB) imply clustering and user cannot turn it off.
COMPRESS: Some DBMSes support leading-edge index compression. Since clustered table is essentially an index, compression can be applied to it as well.
JUNCTION_TABLE_IE1, EXTRA_DATA: Since extra data is covered by the secondary index, DBMS can get it without touching the table when querying in the direction from child to parents. Primary key acts as a clustering key so the extra data is naturally covered when querying from a parent to the children.
Physically, you have just two B-Trees (one is the clustered table and the other is the secondary index) and no table heap at all. This translates to good querying performance (both parent-to-child and child-to-parent directions can be satisfied by a simple index range scan) and fairly small overhead when inserting/deleting rows.
Here is the equivalent MS SQL Server syntax (sans index compression):
CREATE TABLE JUNCTION_TABLE (
PARENT_ID INT,
CHILD_ID INT,
EXTRA_DATA VARCHAR(50),
PRIMARY KEY (PARENT_ID, CHILD_ID),
FOREIGN KEY (PARENT_ID) REFERENCES PARENT_TABLE (PARENT_ID),
FOREIGN KEY (CHILD_ID) REFERENCES CHILD_TABLE (CHILD_ID)
);
CREATE UNIQUE INDEX JUNCTION_TABLE_IE1 ON
JUNCTION_TABLE (CHILD_ID, PARENT_ID) INCLUDE (EXTRA_DATA);
Note that MS SQL Server automatically clusters tables, unless PRIMARY KEY NONCLUSTERED is specified.
1 In other words, do you only need to get "children" of given "parent", or you might also need to get parents of given child.
2 Covering allows the query to be satisfied from the index alone, and avoids expensive double-lookup that would otherwise be necessary when accessing data through a secondary index in the clustered table.
3 This way, the extra data is not repeated (which would be expensive, since it's big), yet you avoid the double-lookup and replace it with (cheaper) table heap access. But, beware of clustering factor that can destroy the performance of range scans in heap-based tables!

What are the best practices for using a GUID as a primary key, specifically regarding performance? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 1 year ago.
The community reviewed whether to reopen this question 2 months ago and left it closed:
Original close reason(s) were not resolved
Improve this question
I have an application that uses GUID as the Primary Key in almost all tables and I have read that there are issues about performance when using GUID as Primary Key. Honestly, I haven't seen any problem, but I'm about to start a new application and I still want to use the GUIDs as the Primary Keys, but I was thinking of using a Composite Primary Key (The GUID and maybe another field.)
I'm using a GUID because they are nice and easy to manage when you have different environments such as "production", "test" and "dev" databases, and also for migration data between databases.
I will use Entity Framework 4.3 and I want to assign the Guid in the application code, before inserting it in the database. (i.e. I don't want to let SQL generate the Guid).
What is the best practice for creating GUID-based Primary Keys, in order to avoid the supposed performance hits associated with this approach?
GUIDs may seem to be a natural choice for your primary key - and if you really must, you could probably argue to use it for the PRIMARY KEY of the table. What I'd strongly recommend not to do is use the GUID column as the clustering key, which SQL Server does by default, unless you specifically tell it not to.
You really need to keep two issues apart:
the primary key is a logical construct - one of the candidate keys that uniquely and reliably identifies every row in your table. This can be anything, really - an INT, a GUID, a string - pick what makes most sense for your scenario.
the clustering key (the column or columns that define the "clustered index" on the table) - this is a physical storage-related thing, and here, a small, stable, ever-increasing data type is your best pick - INT or BIGINT as your default option.
By default, the primary key on a SQL Server table is also used as the clustering key - but that doesn't need to be that way! I've personally seen massive performance gains when breaking up the previous GUID-based Primary / Clustered Key into two separate key - the primary (logical) key on the GUID, and the clustering (ordering) key on a separate INT IDENTITY(1,1) column.
As Kimberly Tripp - the Queen of Indexing - and others have stated a great many times - a GUID as the clustering key isn't optimal, since due to its randomness, it will lead to massive page and index fragmentation and to generally bad performance.
Yes, I know - there's newsequentialid() in SQL Server 2005 and up - but even that is not truly and fully sequential and thus also suffers from the same problems as the GUID - just a bit less prominently so.
Then there's another issue to consider: the clustering key on a table will be added to each and every entry on each and every non-clustered index on your table as well - thus you really want to make sure it's as small as possible. Typically, an INT with 2+ billion rows should be sufficient for the vast majority of tables - and compared to a GUID as the clustering key, you can save yourself hundreds of megabytes of storage on disk and in server memory.
Quick calculation - using INT vs. GUID as Primary and Clustering Key:
Base Table with 1'000'000 rows (3.8 MB vs. 15.26 MB)
6 nonclustered indexes (22.89 MB vs. 91.55 MB)
TOTAL: 25 MB vs. 106 MB - and that's just on a single table!
Some more food for thought - excellent stuff by Kimberly Tripp - read it, read it again, digest it! It's the SQL Server indexing gospel, really.
GUIDs as PRIMARY KEY and/or clustered key
The clustered index debate continues
Ever-increasing clustering key - the Clustered Index Debate..........again!
Disk space is cheap - that's not the point!
PS: of course, if you're dealing with just a few hundred or a few thousand rows - most of these arguments won't really have much of an impact on you. However: if you get into the tens or hundreds of thousands of rows, or you start counting in millions - then those points become very crucial and very important to understand.
Update: if you want to have your PKGUID column as your primary key (but not your clustering key), and another column MYINT (INT IDENTITY) as your clustering key - use this:
CREATE TABLE dbo.MyTable
(PKGUID UNIQUEIDENTIFIER NOT NULL,
MyINT INT IDENTITY(1,1) NOT NULL,
.... add more columns as needed ...... )
ALTER TABLE dbo.MyTable
ADD CONSTRAINT PK_MyTable
PRIMARY KEY NONCLUSTERED (PKGUID)
CREATE UNIQUE CLUSTERED INDEX CIX_MyTable ON dbo.MyTable(MyINT)
Basically: you just have to explicitly tell the PRIMARY KEY constraint that it's NONCLUSTERED (otherwise it's created as your clustered index, by default) - and then you create a second index that's defined as CLUSTERED
This will work - and it's a valid option if you have an existing system that needs to be "re-engineered" for performance. For a new system, if you start from scratch, and you're not in a replication scenario, then I'd always pick ID INT IDENTITY(1,1) as my clustered primary key - much more efficient than anything else!
I've been using GUIDs as PKs since 2005. In this distributed database world, it is absolutely the best way to merge distributed data. You can fire and forget merge tables without all the worry of ints matching across joined tables. GUIDs joins can be copied without any worry.
This is my setup for using GUIDs:
PK = GUID. GUIDs are indexed similar to strings, so high row tables (over 50 million records) may need table partitioning or other performance techniques. SQL Server is getting extremely efficient, so performance concerns are less and less applicable.
PK Guid is NON-Clustered index. Never cluster index a GUID unless it is NewSequentialID. But even then, a server reboot will cause major breaks in ordering.
Add ClusterID Int to every table. This is your CLUSTERED Index... that orders your table.
Joining on ClusterIDs (int) is more efficient, but I work with 20-30 million record tables, so joining on GUIDs doesn't visibly affect performance. If you want max performance, use the ClusterID concept as your primary key & join on ClusterID.
Here is my Email table...
CREATE TABLE [Core].[Email] (
[EmailID] UNIQUEIDENTIFIER CONSTRAINT [DF_Email_EmailID] DEFAULT (newsequentialid()) NOT NULL,
[EmailAddress] NVARCHAR (50) CONSTRAINT [DF_Email_EmailAddress] DEFAULT ('') NOT NULL,
[CreatedDate] DATETIME CONSTRAINT [DF_Email_CreatedDate] DEFAULT (getutcdate()) NOT NULL,
[ClusterID] INT NOT NULL IDENTITY,
CONSTRAINT [PK_Email] PRIMARY KEY NonCLUSTERED ([EmailID] ASC)
);
GO
CREATE UNIQUE CLUSTERED INDEX [IX_Email_ClusterID] ON [Core].[Email] ([ClusterID])
GO
CREATE UNIQUE NONCLUSTERED INDEX [IX_Email_EmailAddress] ON [Core].[Email] ([EmailAddress] Asc)
I am currently developing an web application with EF Core and here is the pattern I use:
All my classes (tables) have an int PK and FK.
I then have an additional column of type Guid (generated by the C# constructor) with a non clustered index on it.
All the joins of tables within EF are managed through the int keys while all the access from outside (controllers) are done with the Guids.
This solution allows to not show the int keys on URLs but keep the model tidy and fast.
This link says it better than I could and helped in my decision making. I usually opt for an int as a primary key, unless I have a specific need not to and I also let SQL server auto-generate/maintain this field unless I have some specific reason not to. In reality, performance concerns need to be determined based on your specific app. There are many factors at play here including but not limited to expected db size, proper indexing, efficient querying, and more. Although people may disagree, I think in many scenarios you will not notice a difference with either option and you should choose what is more appropriate for your app and what allows you to develop easier, quicker, and more effectively (If you never complete the app what difference does the rest make :).
https://web.archive.org/web/20120812080710/http://databases.aspfaq.com/database/what-should-i-choose-for-my-primary-key.html
P.S. I'm not sure why you would use a Composite PK or what benefit you believe that would give you.
Well, if your data never reach millions of rows, you are good. If you ask me, i never use GUID as database identity column of any type, including PK even if you force me to design with a shotgun at the head.
Using GUID as primary key is a definitive scaling stopper, and a critical one.
I recommend you check database identity and sequence option. Sequence is table independent and may provide a solution for your needs(MS SQL has sequences).
If your tables start reaching some dozens of millions of rows the most, e.g. 50 million you will not be able read/write information at acceptable timings and even standard database index maintenance would turn impossible.
Then you need to use partitioning, and be scalable up to half a billion or even 1-2 billion rows. Adding partitioning on the way is not the easiest thing, all read/write statements must include partition column (full app changes!).
These number of course (50 million and 500 million) are for a light selecting useage. If you need to select information in a complex way and/or have lots of inserts/updates/deletes, those could even be 1-2 millions and 50 millions instead, for a very demanding system. If you also add factors like full recovery model, high availability and no maintenance window, common for modern systems, things become extremely ugly.
Note at this point that 2 billion is int limit that looks bad, but int is 4 times smaller and is a sequential type of data, small size and sequential type are the #1 factor for database scalability. And you can use big int which is just twice smaller but still sequential, sequential is what is really deadly important - even more important than size - when to comes to many millions or few billions of rows.
If GUID is also clustered, things are much worst. Just inserting a new row will be actually stored randomly everywhere in physical position.
Even been just a column, not PK or PK part, just indexing it is trouble. From fragmentation perspective.
Having a guid column is perfectly ok like any varchar column as long as you do not use it as PK part and in general as a key column to join tables. Your database must have its own PK elements, filtering and joining data using them - filtering also by a GUID afterwards is perfectly ok.
Having sequential ID's makes it a LOT easier for a hacker or data miner to compromise your site and data. Keep that in mind when choosing a PK for a website.
If you use GUID as primary key and create clustered index then I suggest use the default of NEWSEQUENTIALID() value for it.
Another reason not to expose an Id in the user interface is that a competitor can see your Id incrementing over a day or other period and so deduce the volume of business you are doing.
Most of the times it should not be used as the primary key for a table because it really hit the performance of the database.
useful links regarding GUID impact on performance and as a primary key.
https://www.sqlskills.com/blogs/kimberly/disk-space-is-cheap/
https://www.sqlskills.com/blogs/kimberly/guids-as-primary-keys-andor-the-clustering-key/

Should GUID be used as a foreign key to many tables

I designed a database to use GUIDs for the UserID but I added UserId as a foreign key to many other tables, and as I plan to have a very large number of database inputs I must design this well.
I also use ASP Membership tables (not profiles only membership, users and roles).
So currently I use a GUID as PK and as FK in every other table, this is maybe bad design?
What I think is maybe better and is the source of my question, should I add in Users table UserId(int) as a primary key and use this field as a foreign key to other tables and user GUID UserId only to reference aspnet_membership?
aspnet_membership{UserID(uniqueIdentifier)}
Users{
UserID_FK(uniqueIdentifier) // FK to aspnet_membership table
UserID(int) // primary key in this table --> Should I add this
...
}
In other tables user can create items in tables and I always must add UserId for example:
TableA{TableA_PK(int)..., CreatedBy(uniqueIdentifier)}
TableB{TableB_PK(int)..., CreatedBy(uniqueIdentifier)}
TableC{TableC_PK(int)..., CreatedBy(uniqueIdentifier)}
...
Ultimately the answer is that it really depends.
Microsoft have documented the performance differences of each here. While the article differs slightly to your situation, as you have to use a UNIQUEIDENTIFIER to link back to asp membership, many of the discussion points still apply.
If you have to create your own users table anyway it would make more sense to have your own int primary key, and use the GUID as a foreign key. It keeps separate entities separate. What if at some point in the future you wanted to add a different membership to a user? You would then need to update a primary key, which would have to cascade to any tables referencing this and could be quite a performance hit. If it is just a unique column in your users table it is a simple update.
All binary datatypes (uniqueidetifier is binary(16)) are fine as foreign keys. But GUID might cause another small problem as primary key. By default primary key is clustered, but GUID is generated randomly not sequentially as IDENTITY, so it will frequently split IO pages, that cause performance degradation a little.
I'd go ahead with your current design.
If you're worried because of inner joins performance in your queries because of comparing strings instead of ints, nowadays there is no difference at all.

Resources