We migrated a lot of data from our old ordering system. This system stored the users initials in our "Orders" table for each order they created. Now that we have a view that looks at active directory I would like to update those initials with the active directory objectguid (or something that references the guid). This would allow us to change the users initials in active directory without having to worry about updating the "Orders" table records.
I've read that index performance is lackluster when using guids vs ints. One way to possibly solve this is have a table that maps guids to ints then store the int value in our orders table. Would this be a good idea? Any thoughts?
I assume that a USER entity already exists in your database design however, if it does not then I believe your solution will require it.
So assuming the USER entity exists, as you have described, you could place the UserID(int) on the ORDERS table, thereby relating all relevant USER details to an ORDER, rather than just initials (Note: Initials should arguably not be stored on the ORDER table but rather the USER or USER_DETAILS table, albeit not the focus of this discussion).
You can then add the GUID column to the USER table. I don’t think a separate lookup table is necessary.
Make sense?
Index performance on GUIDs is in general no different from other indexes on a 16-byte field. Where the GUID's are deadly for performance is when you use them as clustering key on a SQL Server table.
The SQL Server table is the clustered key and is physically ordered by that key. Since the GUIDs are by nature totally random, this leads to massive index fragmentation very quickly, and thus requires a) constant feed and care and reorganization, but b) still suffers even if you reorganize every night. So really, the best practice would be to avoid GUIDs (even the new "sequential" GUIDs) for your clustering key.
For more background info and some excellent write-ups on why GUIDs make very poor clustered keys, see the blog of the "Queen of Indexing", Kimberly Tripp:
GUIDs as primary key
The clustered index debate continues....
The clustered index debate....again!
Her insight is most valuable - read it, internalize it, follow it! You won't regret it.
Marc
Related
EDITED
When we first architected our web app years ago, we chose auto-increment int for all of our users' data. However, we are now getting burned by how hard it is to transfer specific user's data (multiple tables with one-to-many relationships) to another non-empty database instance (with the same table strictures).
While SET IDENTITY_INSERT table ON | OFF may work for some tables, with our current architecture we will still run into problem because certain 'many' in a 'one-to-many' relation may collide with the destination DB.
Inspired by Pam Lahoud's answer below, I started researching on non-clustered PK and PK alternatives. Then I come across Selecting an Appropriate Primary Key for a Distributed Environment from MSDN, and "Keys That Include a Node Identifier" caught my eye. Anyone has experience with this kind of architecture?
GUIDs as primary keys are awesome, GUIDs as clustered index keys are not so awesome. While the PK does default to be clustered, it doesn't necessarily have to be. If there is another column on the table that would make sense to be clustered, you might consider converting over to non-clustered GUID primary keys and clustering on some other field.
If the PKs on your table are used frequently for filtering and joining, it probably still makes sense for them to be clustered even if they are GUIDs. Using newsequentialid() will get around most of the problems that are caused by GUID clustered index keys - namely logical index fragmentation, page splits and low page density. You still have the issue that GUIDs are a large data type and therefore all your indexes (both clustered and non-clustered since they also contain the clustered index key) will be somewhat larger, but I don't think that's necessarily a deal breaker.
The only other solution I can think of other than converting to GUIDs would be to specify identity ranges on each of your databases and add constraints to ensure that there is no overlap in ranges between them. This of course wouldn't work for existing data, but would prevent the problem from happening in the future as new data being inserted should be unique across your farm.
As with anything in SQL Server, there are very few "always" or "never" rules, GUIDs as primary keys and/or clustered index keys is one of those "it depends" rules, I think in this case the GUID PK might be the right solution.
I've inherited some database creation scripts for a SQL SERVER 2005 database.
One thing I've noticed is that all primary keys are created as NON CLUSTERED indexes as opposed to clustered.
I know that you can only have one clustered index per table and that you may want to have it on a non primary key column for query performance of searches etc. However there are no other CLUSTERED indexes on the tables in questions.
So my question is are there any technical reasons not to have clustered indexes on a primary key column apart from the above.
On any "normal" data or lookup table: no, I don't see any reason whatsoever.
On stuff like bulk import tables, or temporary tables - it depends.
To some people surprisingly, it appears that having a good clustered index actually can speed up operations like INSERT or UPDATE. See Kimberly Tripps excellent The Clustered Index Debate continues.... blog post in which she explains in great detail why this is the case.
In this light: I don't see any valid reason not to have a good clustered index (narrow, stable, unique, ever-increasing = INT IDENTITY as the most obvious choice) on any SQL Server table.
To get some deep insights into how and why to choose clustering keys, read all of Kimberly Tripp's excellent blog posts on the topic:
http://www.sqlskills.com/BLOGS/KIMBERLY/category/Clustering-Key.aspx
http://www.sqlskills.com/BLOGS/KIMBERLY/category/Clustered-Index.aspx
Excellent stuff from the "Queen of Indexing" ! :-)
Clustered Tables vs Heap Tables
(Good article on subject at www.mssqltips.com)
HEAP Table (Without clustered index)
Data is not stored in any particular
order
Specific data can not be retrieved
quickly, unless there are also
non-clustered indexes
Data pages are not linked, so
sequential access needs to refer back
to the index allocation map (IAM)
pages
Since there is no clustered index,
additional time is not needed to
maintain the index
Since there is no clustered index,
there is not the need for additional
space to store the clustered index
tree
These tables have a index_id value of
0 in the sys.indexes catalog view
Clustered Table
Data is stored in order based on the
clustered index key
Data can be retrieved quickly based
on the clustered index key, if the
query uses the indexed columns
Data pages are linked for faster
sequential access
Additional time is needed to maintain clustered index based on
INSERTS, UPDATES and DELETES
Additional space is needed to store
clustered index tree
These tables have a index_id value of 1 in the sys.indexes catalog
view
Please read my answer under "No direct access to data row in clustered table - why?", first. Specifically item [2] Caveat.
The people who created the "database" are cretins. They had:
a bunch of unnormalised spreadhseets, not normalised relational tables
the PKs are all IDENTITY columns (the spreadsheets are linked to each other; they have to be navigated one-by-one-by-one); there is no relational access or relational power across the database
they had PRIMARY KEY, which produce UNIQUE CLUSTERED
they found that that prevented concurrency
they removed the CI and made them all NCIs
they were too lazy to finish the reversal; to nominate an alternate (current NCI) to become the new CI, for each table
the IDENTITY column remains the Primary Key (it isn't really, but it is in this hamfisted implementation)
For such collections of spreadsheets masquerading as databases, it is becoming more and more common to avoid CIs altogether, and just have NCIs plus the Heap. Obviously they get none of the power or benefits of the CI, but hell, they get none of the power or benefit of Relational databases, so who cares that they get none of the power of CIs (which were designed for Relational databases, which theirs is not). The way they look at it, they have to "refactor" the darn thing every so often anyway, so why bother. Relational databases do not need "refactoring".
If you need to discuss this response further, please post the CREATE TABLE/INDEX DDL; otherwise it is a time-wasting academic argument.
Here is another (have it already been provided in other answers?) possible reason (still to be understood):
SQL Server - Poor performance of PK delete
I hope, I shall update later but for now it is rather the desire to link these topics
Update:
What do I miss in understanding the clustered index?
With some b-tree servers/programming languages still used today, fixed or variable length flat ascii files are used for storing data. When a new data record/row is added to a file (table), the record is (1) appended to the end of the file (or replaces a deleted record) and (2) the indexes are balanced. When data is stored this way, you don't have to be concerned about system performance (as far as what the b-tree server is doing to return a pointer to the first data record). The response time is only effected by the # of nodes in your index files.
When you get into using SQL, you hopefully come to realize that system performance has to be considered whenever you write an SQL statement. Using an "ORDER BY" statement on a non-indexed column can bring a system to its knees. Using a clustered index might put an unnecessary load on the CPU. It's the 21st century and I wish we didn't have to think about system performance when programming in SQL, but we still do.
With some older programming languages, it was mandatory to use an index whenever sorted data is retrieved. I only wish this requirement was still in place today. I can only wonder how many companies have updated their slow computer systems due to a poorly written SQL statement on non-indexed data.
In my 25 years of programming, I've never needed my physical data stored in a particular order, so maybe that is why some programmers avoid using clustered indexes. It's hard to know what the tradeoff is (storage time, verses retrieval time) especially if the system you are designing might store millions of records someday.
We have a database with 500+ tables, in which almost all the tables have a clustered PK that is of datatype guid (uniqueidentifier).
We are in the process of testing a switch from "normal" "random" guids generated through .NETs Guid.NewGuid() method to sequential guids generated through the NHibernate guid.comb algorithm. This seems to be working well, but what about clients that already have millions of rows with "random" primary key values?
Will they benefit from the fact that new ids generated from now on will be sequential?
Could/should anything be done to their existing data?
Thanks in advance for any pointers on this.
You could do this, but I'm not sure you would want to. I dont see any benefit in using sequential guids, in fact using guids is not recommended as a primary key unless there are distributed/replication reasons involved. Are you using a clustered index?
Having said that if you go ahead, I recommend loading a table with values from your algorithm first.
You are going to have hassles with foreign keys. You will need to associate the old and new guids in the aformentioned table, drop the foreign keys, perform a transactional update, then reapply the foreign keys.
I dont think it is worth the hassle unless you were moving away from guids altogether to say an integer based system.
It depends whether the tables are clustered on the primary index or on another index. For instance, if you are creating large amounts of new records in a table with a GUID PK and a creation date, it usually makes sense to cluster by the creation date in order to optimize the insert operation.
On the other hand, depending on the queries done, a cluster on the GUID may be better, in which case using sequential GUIDs can help with the insert performance. I'd say that it isn't possible to give a final answer to your question without in-depth knowledge of the usage.
I'm facing a similar issue, I think it would be possible to update existing data by writing an application to update your existing keys using the NHibernate guid.comb algorithm. To propogate the new keys to related foreign key tables maybe it would be possible to temporarily cascade updates? Doing this through .NET code would be slower than an SQL script, another option might be to duplicate the guid.comb logic in SQL but not sure if this is possible.
If you choose to retain the existing data, using the guid.comb algorithm should have some performance improvement, there will still be page splitting when inserts occur but because new guids are sequential instead of totally random this will be at least somewhat reduced. Another option to consider would be to remove the clustered index on your GUID primary key, although I'm not sure how much existing query performance will be impacted.
I have a question related to database design. The database that I'm working with
requires data to treated in some way that it is never physically deleted. We started going
down a path of adding a "DeleteDateTime" column to some tables, that is NULL by default but
once stamped would mark a record as deleted.
This gives us the ability archive our data easily but I still feel in the dark on a few areas, specifically
whether this would be considered in line with best practices and also how to go about indexing these tables efficiently.
I'll give you an example: We have a table called "Courses" with a composite primary key made up of the columns "SiteID" and "CourseID".
This table also has a column called "DeleteDateTime" that is used in accordance with my description above.
I can't use the SQL Server 2008 filtered view feature because we have to be
SQL Server 2005 compatible. Should I include "DeleteDateTime" in the clustered index for this table? If so should it be
the first column in the index (i.e. "DeleteDateTime, SiteID, CourseID")...
Does anyone have any reasons why I should or shouldn't follow this approach?
Thanks!
Is there a chance you could transfer those "dead" records into a separate table? E.g. for your Courses table, have a Courses_deleted table or something like that, with an identical structure.
When you "delete" a record, you basically just move it to the "dead table". That way, the index on your actual, current data stays small and zippy....
If you need to have an aggregate view, you can always define a Courses_View which unions the two tables together.
Your clustered index on your real table should be as small, static and constant and possible, so I would definitely NOT recommend putting such a date time column into it. Not a good idea.
For excellent info on how to choose a good clustering key, and what it takes, check out Kimberly Tripp's blog entries:
GUIDs as PRIMARY KEYs and/or the clustering key
The Clustered Index Debate Continues...
Ever-increasing clustering key - the Clustered Index Debate..........again!
Marc
what's your requirements on data retention? have you looked into an audit log instead of keeping all non-current data in the database?
I think you have it right on the head for the composite indexes including your "DeleteDateTime" column.
I would create a view that is basically
select {List all columns except the delete flag}
from mytable
where deletflag is null
This is what I would use for all my queries on the table. The reason why is to prevent people from forgetting to consider the deleted flag. SQL Server 2005 can easily handle this kind of view and it is necessary if you are goin to use thisdesign for delting records. I would have a separate index on the delted column. I likely would not make it part of the clustered index.
I have several tables whose only unique data is a uniqueidentifier (a Guid) column. Because guids are non-sequential (and they're client-side generated so I can't use newsequentialid()), I have made a non-primary, non-clustered index on this ID field rather than giving the tables a clustered primary key.
I'm wondering what the performance implications are for this approach. I've seen some people suggest that tables should have an auto-incrementing ("identity") int as a clustered primary key even if it doesn't have any meaning, as it means that the database engine itself can use that value to quickly look up a row instead of having to use a bookmark.
My database is merge-replicated across a bunch of servers, so I've shied away from identity int columns as they're a bit hairy to get right in replication.
What are your thoughts? Should tables have primary keys? Or is it ok to not have any clustered indexes if there are no sensible columns to index that way?
When dealing with indexes, you have to determine what your table is going to be used for. If you are primarily inserting 1000 rows a second and not doing any querying, then a clustered index is a hit to performance. If you are doing 1000 queries a second, then not having an index will lead to very bad performance. The best thing to do when trying to tune queries/indexes is to use the Query Plan Analyzer and SQL Profiler in SQL Server. This will show you where you are running into costly table scans or other performance blockers.
As for the GUID vs ID argument, you can find people online that swear by both. I have always been taught to use GUIDs unless I have a really good reason not to. Jeff has a good post that talks about the reasons for using GUIDs: https://blog.codinghorror.com/primary-keys-ids-versus-guids/.
As with most anything development related, if you are looking to improve performance there is not one, single right answer. It really depends on what you are trying to accomplish and how you are implementing the solution. The only true answer is to test, test, and test again against performance metrics to ensure that you are meeting your goals.
[Edit]
#Matt, after doing some more research on the GUID/ID debate I came across this post. Like I mentioned before, there is not a true right or wrong answer. It depends on your specific implementation needs. But these are some pretty valid reasons to use GUIDs as the primary key:
For example, there is an issue known as a "hotspot", where certain pages of data in a table are under relatively high currency contention. Basically, what happens is most of the traffic on a table (and hence page-level locks) occurs on a small area of the table, towards the end. New records will always go to this hotspot, because IDENTITY is a sequential number generator. These inserts are troublesome because they require Exlusive page lock on the page they are added to (the hotspot). This effectively serializes all inserts to a table thanks to the page locking mechanism. NewID() on the other hand does not suffer from hotspots. Values generated using the NewID() function are only sequential for short bursts of inserts (where the function is being called very quickly, such as during a multi-row insert), which causes the inserted rows to spread randomly throughout the table's data pages instead of all at the end - thus eliminating a hotspot from inserts.
Also, because the inserts are randomly distributed, the chance of page splits is greatly reduced. While a page split here and there isnt too bad, the effects do add up quickly. With IDENTITY, page Fill Factor is pretty useless as a tuning mechanism and might as well be set to 100% - rows will never be inserted in any page but the last one. With NewID(), you can actually make use of Fill Factor as a performance-enabling tool. You can set Fill Factor to a level that approximates estimated volume growth between index rebuilds, and then schedule the rebuilds during off-peak hours using dbcc reindex. This effectively delays the performance hits of page splits until off-peak times.
If you even think you might need to enable replication for the table in question - then you might as well make the PK a uniqueidentifier and flag the guid field as ROWGUIDCOL. Replication will require a uniquely valued guid field with this attribute, and it will add one if none exists. If a suitable field exists, then it will just use the one thats there.
Yet another huge benefit for using GUIDs for PKs is the fact that the value is indeed guaranteed unique - not just among all values generated by this server, but all values generated by all computers - whether it be your db server, web server, app server, or client machine. Pretty much every modern language has the capability of generating a valid guid now - in .NET you can use System.Guid.NewGuid. This is VERY handy when dealing with cached master-detail datasets in particular. You dont have to employ crazy temporary keying schemes just to relate your records together before they are committed. You just fetch a perfectly valid new Guid from the operating system for each new record's permanent key value at the time the record is created.
http://forums.asp.net/t/264350.aspx
The primary key serves three purposes:
indicates that the column(s) should be unique
indicates that the column(s) should be non-null
document the intent that this is the unique identifier of the row
The first two can be specified in lots of ways, as you have already done.
The third reason is good:
for humans, so they can easily see your intent
for the computer, so a program that might compare or otherwise process your table can query the database for the table's primary key.
A primary key doesn't have to be an auto-incrementing number field, so I would say that it's a good idea to specify your guid column as the primary key.
Just jumping in, because Matt's baited me a bit.
You need to understand that although a clustered index is put on the primary key of a table by default, that the two concepts are separate and should be considered separately. A CIX indicates the way that the data is stored and referred to by NCIXs, whereas the PK provides a uniqueness for each row to satisfy the LOGICAL requirements of a table.
A table without a CIX is just a Heap. A table without a PK is often considered "not a table". It's best to get an understanding of both the PK and CIX concepts separately so that you can make sensible decisions in database design.
Rob
Nobody answered actual question: what are pluses/minuses of a table with NO PK NOR a CLUSTERED index.
In my opinion, if you optimize for faster inserts (especially incremental bulk-insert, e.g. when you bulk load data into a non-empty table), such a table: with NO clustered index, NO constraints, NO Foreign Keys, NO Defaults and NO Primary Key, in a database with Simple Recovery Model, is the best. Now, if you ever want to query this table (as opposed to scanning it in its entirety) you may want to add a non-clustered non-unique indexes as needed but keep them to the minimum.
I too have always heard having an auto-incrementing int is good for performance even if you don't actually use it.
A Primary Key needn't be an autoincrementing field, in many cases this just means you are complicating your table structure.
Instead, a Primary Key should be the minimum collection of attributes (note that most DBMS will allow a composite primary key) that uniquely identifies a tuple.
In technical terms, it should be the field that every other field in the tuple is fully functionally dependent upon. (If it isn't you might need to normalise).
In practice, performance issues may mean that you merge tables, and use an incrementing field, but I seem to recall something about premature optimisation being evil...
Since you are doing replication, your are correct identities are something to stear clear of. I would make your GUID a primary key but nonclustered since you can't use newsequentialid. That stikes me as your best course. If you don't make it a PK but put a unique index on it, sooner or later that may cause people who maintain the system to not understand the FK relationships properly introducing bugs.