I have a question related to database design. The database that I'm working with
requires data to treated in some way that it is never physically deleted. We started going
down a path of adding a "DeleteDateTime" column to some tables, that is NULL by default but
once stamped would mark a record as deleted.
This gives us the ability archive our data easily but I still feel in the dark on a few areas, specifically
whether this would be considered in line with best practices and also how to go about indexing these tables efficiently.
I'll give you an example: We have a table called "Courses" with a composite primary key made up of the columns "SiteID" and "CourseID".
This table also has a column called "DeleteDateTime" that is used in accordance with my description above.
I can't use the SQL Server 2008 filtered view feature because we have to be
SQL Server 2005 compatible. Should I include "DeleteDateTime" in the clustered index for this table? If so should it be
the first column in the index (i.e. "DeleteDateTime, SiteID, CourseID")...
Does anyone have any reasons why I should or shouldn't follow this approach?
Thanks!
Is there a chance you could transfer those "dead" records into a separate table? E.g. for your Courses table, have a Courses_deleted table or something like that, with an identical structure.
When you "delete" a record, you basically just move it to the "dead table". That way, the index on your actual, current data stays small and zippy....
If you need to have an aggregate view, you can always define a Courses_View which unions the two tables together.
Your clustered index on your real table should be as small, static and constant and possible, so I would definitely NOT recommend putting such a date time column into it. Not a good idea.
For excellent info on how to choose a good clustering key, and what it takes, check out Kimberly Tripp's blog entries:
GUIDs as PRIMARY KEYs and/or the clustering key
The Clustered Index Debate Continues...
Ever-increasing clustering key - the Clustered Index Debate..........again!
Marc
what's your requirements on data retention? have you looked into an audit log instead of keeping all non-current data in the database?
I think you have it right on the head for the composite indexes including your "DeleteDateTime" column.
I would create a view that is basically
select {List all columns except the delete flag}
from mytable
where deletflag is null
This is what I would use for all my queries on the table. The reason why is to prevent people from forgetting to consider the deleted flag. SQL Server 2005 can easily handle this kind of view and it is necessary if you are goin to use thisdesign for delting records. I would have a separate index on the delted column. I likely would not make it part of the clustered index.
Related
To start off, I'm not that great with database strategies, so I don't know really how to even approach this.
What I want to do is store some info in a database. Essentially the data is going to look like this
SensorNumber (int)
Reading (int)
Timestamp (Datetime?)(I just want to track down to the minute, nothing further is needed)
The only thing about this is that over a few months of tracking I'm going to have millions of rows (~5 million rows).
I really only care about searching by Timestamp and/or SensorNumber. The data in here is pretty much going to be never edited (insert once, read many times).
How should I go about building this? Is there anything special I should do other than create the table? and create the one index for SensorNumber and Temp?
Based on your comment, I would put a clustered index on (Sensor, Timestamp).
This will always cover when you want to search for SENSOR alone, but will also cover both fields checked in combination.
If you want to ever search for Timestamp alone, you can add a nonclustered index there as well.
One issue you will have with this design is the need to rebuild the table since you are going to be inserting rows non-sequentially - the new rows won't always belong at the end of the index.
Also, please do not name a field timestamp - this is a keyword in SQL Server and can cause you all kinds of issues if you don't delimit it everywhere.
You definitely want to use a SQL-Server "clustered index" for the most selective data you're likely to search on.
Here's more info:
http://www.sql-server-performance.com/2007/clustered-indexes/
http://odetocode.com/articles/70.aspx
http://www.sql-server-performance.com/2002/index-not-equal/
ELABORATION:
"Sensor" would be a poor choice - you're likely to have few sensors, many rows. This would not be a discriminating index.
"Time" would be discriminating... but it would also be a poor choice. Because the time itself, independent of sensor, temperature, etc, is probably meaningless to your query.
A clustered index on "sensor,time" might be ideal. Or maybe not - it depends on what you're after.
Please review the above links.
PS:
Please, too, consider using "datetime" instead of "timestamp". They're two completely different types under MSSQL ... and "datetime" is arguably the better, more flexible choice:
http://www.sqlteam.com/article/timestamps-vs-datetime-data-types
I agree with using a clustered index, you are almost certainly going to end up with one anyway - so it's better to define it.
A clustered index determines the order that the data is stored, adding to the end is cheaper than inserting into the middle.
Think of a deck of cards you are trying to keep in rank order as you add cards. If the highest rank is a 8, adding a 9 is trivial - put it at the top.
If you add a 5, it gets more complex, you have to work out where to put it and then insert it.
So adding items with a clustered index in order is optimal.
Given that I would suggest having a clustered index in (Timestamp,Sensor).
Clustering on (Sensor, Timestamp) will create a LOT of changes to the physical ordering of data which is very expensive (even using SSD).
If Timestamp,Sensor combo is unique then define it as being UNIQUE, otherwise Sql Server will add in a uniqueidentifier on the index to resolve duplicates.
Primary keys are automatically unique, almost all tables should have a primary key.
If (Timestamp,Sensor) is not unique, or you want to reference this data from another table, consider using an identity column as the clustered Primary Key.
Good Luck!
I've inherited some database creation scripts for a SQL SERVER 2005 database.
One thing I've noticed is that all primary keys are created as NON CLUSTERED indexes as opposed to clustered.
I know that you can only have one clustered index per table and that you may want to have it on a non primary key column for query performance of searches etc. However there are no other CLUSTERED indexes on the tables in questions.
So my question is are there any technical reasons not to have clustered indexes on a primary key column apart from the above.
On any "normal" data or lookup table: no, I don't see any reason whatsoever.
On stuff like bulk import tables, or temporary tables - it depends.
To some people surprisingly, it appears that having a good clustered index actually can speed up operations like INSERT or UPDATE. See Kimberly Tripps excellent The Clustered Index Debate continues.... blog post in which she explains in great detail why this is the case.
In this light: I don't see any valid reason not to have a good clustered index (narrow, stable, unique, ever-increasing = INT IDENTITY as the most obvious choice) on any SQL Server table.
To get some deep insights into how and why to choose clustering keys, read all of Kimberly Tripp's excellent blog posts on the topic:
http://www.sqlskills.com/BLOGS/KIMBERLY/category/Clustering-Key.aspx
http://www.sqlskills.com/BLOGS/KIMBERLY/category/Clustered-Index.aspx
Excellent stuff from the "Queen of Indexing" ! :-)
Clustered Tables vs Heap Tables
(Good article on subject at www.mssqltips.com)
HEAP Table (Without clustered index)
Data is not stored in any particular
order
Specific data can not be retrieved
quickly, unless there are also
non-clustered indexes
Data pages are not linked, so
sequential access needs to refer back
to the index allocation map (IAM)
pages
Since there is no clustered index,
additional time is not needed to
maintain the index
Since there is no clustered index,
there is not the need for additional
space to store the clustered index
tree
These tables have a index_id value of
0 in the sys.indexes catalog view
Clustered Table
Data is stored in order based on the
clustered index key
Data can be retrieved quickly based
on the clustered index key, if the
query uses the indexed columns
Data pages are linked for faster
sequential access
Additional time is needed to maintain clustered index based on
INSERTS, UPDATES and DELETES
Additional space is needed to store
clustered index tree
These tables have a index_id value of 1 in the sys.indexes catalog
view
Please read my answer under "No direct access to data row in clustered table - why?", first. Specifically item [2] Caveat.
The people who created the "database" are cretins. They had:
a bunch of unnormalised spreadhseets, not normalised relational tables
the PKs are all IDENTITY columns (the spreadsheets are linked to each other; they have to be navigated one-by-one-by-one); there is no relational access or relational power across the database
they had PRIMARY KEY, which produce UNIQUE CLUSTERED
they found that that prevented concurrency
they removed the CI and made them all NCIs
they were too lazy to finish the reversal; to nominate an alternate (current NCI) to become the new CI, for each table
the IDENTITY column remains the Primary Key (it isn't really, but it is in this hamfisted implementation)
For such collections of spreadsheets masquerading as databases, it is becoming more and more common to avoid CIs altogether, and just have NCIs plus the Heap. Obviously they get none of the power or benefits of the CI, but hell, they get none of the power or benefit of Relational databases, so who cares that they get none of the power of CIs (which were designed for Relational databases, which theirs is not). The way they look at it, they have to "refactor" the darn thing every so often anyway, so why bother. Relational databases do not need "refactoring".
If you need to discuss this response further, please post the CREATE TABLE/INDEX DDL; otherwise it is a time-wasting academic argument.
Here is another (have it already been provided in other answers?) possible reason (still to be understood):
SQL Server - Poor performance of PK delete
I hope, I shall update later but for now it is rather the desire to link these topics
Update:
What do I miss in understanding the clustered index?
With some b-tree servers/programming languages still used today, fixed or variable length flat ascii files are used for storing data. When a new data record/row is added to a file (table), the record is (1) appended to the end of the file (or replaces a deleted record) and (2) the indexes are balanced. When data is stored this way, you don't have to be concerned about system performance (as far as what the b-tree server is doing to return a pointer to the first data record). The response time is only effected by the # of nodes in your index files.
When you get into using SQL, you hopefully come to realize that system performance has to be considered whenever you write an SQL statement. Using an "ORDER BY" statement on a non-indexed column can bring a system to its knees. Using a clustered index might put an unnecessary load on the CPU. It's the 21st century and I wish we didn't have to think about system performance when programming in SQL, but we still do.
With some older programming languages, it was mandatory to use an index whenever sorted data is retrieved. I only wish this requirement was still in place today. I can only wonder how many companies have updated their slow computer systems due to a poorly written SQL statement on non-indexed data.
In my 25 years of programming, I've never needed my physical data stored in a particular order, so maybe that is why some programmers avoid using clustered indexes. It's hard to know what the tradeoff is (storage time, verses retrieval time) especially if the system you are designing might store millions of records someday.
SQL Server provides the type [rowguid]. I like to use this as unique primary key, to identify a row for update. The benefit shows up if you dump the table and reload it, no mess with SerialNo (identity) columns.
In the special case of distributed databases like offline copies on notebooks or something like that, nothing else works.
What do you think? Too much overhead?
As a primary key in the logical sense (uniquely identifying your rows) - yes, absolutely, makes total sense.
BUT: in SQL Server, the primary key is by default also the clustering key on your table, and using a ROWGUID as the clustering key is a really really bad idea. See Kimberly Tripp's excellent GUIDs as a PRIMARY and/or the clustering key article for in-depth reasons why not to use GUIDs for clustering.
Since the GUID is by definition random, you'll have a horrible index fragmentation and thus really really bad performance on insert, update, delete and select statements.
Also, since the clustering key is being added to each and every field of each and every non-clustered index on your table, you're wasting a lot of space - both on disk as well as in server RAM - when using 16-byte GUID vs. 4-byte INT.
So: yes, as a primary key, a ROWGUID has its merits - but if you do use it, definitely avoid using that column as your clustering key in the table! Use a INT IDENTITY() or something similar for that.
For a clustering key, ideally you should look for four features:
stable (never changing)
unique
as small as possible
ever-increasing
INT IDENTITY() ideally suits that need. And yes - the clustering key must be unique since it's used to physically locate a row in the table - if you pick a column that can't be guaranteed to be unique, SQL Server will actually add a four-byte uniqueifier to your clustering key - again, not something you want to have....
Check out The Clustered Index Debate Continues - another wonderful and insightful article by Kim Tripp (the "Queen of SQL Server Indexing") in which she explains all these requirements very nicely and thoroughly.
MArc
The problem with rowguid is that if you use it for your clustered index you end up constantly re-calculating your table pages for record inserts. A sequential guid ( NEWSEQUENTIALID() ) often works better.
Our offline application is used in branch offices and we have a central database in our main office. To synchronize the database into central database we have used rowguid column in all tables. May be there are better solutions but it is easier for us. We have not faced any major problem till date in last 3 years.
Contrary to the accepted answer, the uniqueidentifier datatype in SQL Server is indeed a good candidate for a primary clustering key; so long as you keep it sequential.
This is easily accomplished using (newsequentialid()) as the default value for the column.
If you actually read Kimberly Tripp's article you will find that sequentially generated GUIDs are actually a good candidate for primary clustering keys in terms of fragmentation and the only downside is size.
If you have large rows with few indexes, the extra few bytes in a GUID may be negligible. Sure the issue compounds if you have short rows with numerous indexes, but this is something you have to weigh up depending on your own situation.
Using sequential uniqueidentifiers makes a lot of sense when you're going to use merge replication, especially when dealing with identity seeding and the woes that ensue.
Server calss storage isn't cheap, but I'd rather have a database that uses a bit more space than one that screeches to a halt when your automatically assigned identity ranges overlap.
We migrated a lot of data from our old ordering system. This system stored the users initials in our "Orders" table for each order they created. Now that we have a view that looks at active directory I would like to update those initials with the active directory objectguid (or something that references the guid). This would allow us to change the users initials in active directory without having to worry about updating the "Orders" table records.
I've read that index performance is lackluster when using guids vs ints. One way to possibly solve this is have a table that maps guids to ints then store the int value in our orders table. Would this be a good idea? Any thoughts?
I assume that a USER entity already exists in your database design however, if it does not then I believe your solution will require it.
So assuming the USER entity exists, as you have described, you could place the UserID(int) on the ORDERS table, thereby relating all relevant USER details to an ORDER, rather than just initials (Note: Initials should arguably not be stored on the ORDER table but rather the USER or USER_DETAILS table, albeit not the focus of this discussion).
You can then add the GUID column to the USER table. I don’t think a separate lookup table is necessary.
Make sense?
Index performance on GUIDs is in general no different from other indexes on a 16-byte field. Where the GUID's are deadly for performance is when you use them as clustering key on a SQL Server table.
The SQL Server table is the clustered key and is physically ordered by that key. Since the GUIDs are by nature totally random, this leads to massive index fragmentation very quickly, and thus requires a) constant feed and care and reorganization, but b) still suffers even if you reorganize every night. So really, the best practice would be to avoid GUIDs (even the new "sequential" GUIDs) for your clustering key.
For more background info and some excellent write-ups on why GUIDs make very poor clustered keys, see the blog of the "Queen of Indexing", Kimberly Tripp:
GUIDs as primary key
The clustered index debate continues....
The clustered index debate....again!
Her insight is most valuable - read it, internalize it, follow it! You won't regret it.
Marc
I have several tables whose only unique data is a uniqueidentifier (a Guid) column. Because guids are non-sequential (and they're client-side generated so I can't use newsequentialid()), I have made a non-primary, non-clustered index on this ID field rather than giving the tables a clustered primary key.
I'm wondering what the performance implications are for this approach. I've seen some people suggest that tables should have an auto-incrementing ("identity") int as a clustered primary key even if it doesn't have any meaning, as it means that the database engine itself can use that value to quickly look up a row instead of having to use a bookmark.
My database is merge-replicated across a bunch of servers, so I've shied away from identity int columns as they're a bit hairy to get right in replication.
What are your thoughts? Should tables have primary keys? Or is it ok to not have any clustered indexes if there are no sensible columns to index that way?
When dealing with indexes, you have to determine what your table is going to be used for. If you are primarily inserting 1000 rows a second and not doing any querying, then a clustered index is a hit to performance. If you are doing 1000 queries a second, then not having an index will lead to very bad performance. The best thing to do when trying to tune queries/indexes is to use the Query Plan Analyzer and SQL Profiler in SQL Server. This will show you where you are running into costly table scans or other performance blockers.
As for the GUID vs ID argument, you can find people online that swear by both. I have always been taught to use GUIDs unless I have a really good reason not to. Jeff has a good post that talks about the reasons for using GUIDs: https://blog.codinghorror.com/primary-keys-ids-versus-guids/.
As with most anything development related, if you are looking to improve performance there is not one, single right answer. It really depends on what you are trying to accomplish and how you are implementing the solution. The only true answer is to test, test, and test again against performance metrics to ensure that you are meeting your goals.
[Edit]
#Matt, after doing some more research on the GUID/ID debate I came across this post. Like I mentioned before, there is not a true right or wrong answer. It depends on your specific implementation needs. But these are some pretty valid reasons to use GUIDs as the primary key:
For example, there is an issue known as a "hotspot", where certain pages of data in a table are under relatively high currency contention. Basically, what happens is most of the traffic on a table (and hence page-level locks) occurs on a small area of the table, towards the end. New records will always go to this hotspot, because IDENTITY is a sequential number generator. These inserts are troublesome because they require Exlusive page lock on the page they are added to (the hotspot). This effectively serializes all inserts to a table thanks to the page locking mechanism. NewID() on the other hand does not suffer from hotspots. Values generated using the NewID() function are only sequential for short bursts of inserts (where the function is being called very quickly, such as during a multi-row insert), which causes the inserted rows to spread randomly throughout the table's data pages instead of all at the end - thus eliminating a hotspot from inserts.
Also, because the inserts are randomly distributed, the chance of page splits is greatly reduced. While a page split here and there isnt too bad, the effects do add up quickly. With IDENTITY, page Fill Factor is pretty useless as a tuning mechanism and might as well be set to 100% - rows will never be inserted in any page but the last one. With NewID(), you can actually make use of Fill Factor as a performance-enabling tool. You can set Fill Factor to a level that approximates estimated volume growth between index rebuilds, and then schedule the rebuilds during off-peak hours using dbcc reindex. This effectively delays the performance hits of page splits until off-peak times.
If you even think you might need to enable replication for the table in question - then you might as well make the PK a uniqueidentifier and flag the guid field as ROWGUIDCOL. Replication will require a uniquely valued guid field with this attribute, and it will add one if none exists. If a suitable field exists, then it will just use the one thats there.
Yet another huge benefit for using GUIDs for PKs is the fact that the value is indeed guaranteed unique - not just among all values generated by this server, but all values generated by all computers - whether it be your db server, web server, app server, or client machine. Pretty much every modern language has the capability of generating a valid guid now - in .NET you can use System.Guid.NewGuid. This is VERY handy when dealing with cached master-detail datasets in particular. You dont have to employ crazy temporary keying schemes just to relate your records together before they are committed. You just fetch a perfectly valid new Guid from the operating system for each new record's permanent key value at the time the record is created.
http://forums.asp.net/t/264350.aspx
The primary key serves three purposes:
indicates that the column(s) should be unique
indicates that the column(s) should be non-null
document the intent that this is the unique identifier of the row
The first two can be specified in lots of ways, as you have already done.
The third reason is good:
for humans, so they can easily see your intent
for the computer, so a program that might compare or otherwise process your table can query the database for the table's primary key.
A primary key doesn't have to be an auto-incrementing number field, so I would say that it's a good idea to specify your guid column as the primary key.
Just jumping in, because Matt's baited me a bit.
You need to understand that although a clustered index is put on the primary key of a table by default, that the two concepts are separate and should be considered separately. A CIX indicates the way that the data is stored and referred to by NCIXs, whereas the PK provides a uniqueness for each row to satisfy the LOGICAL requirements of a table.
A table without a CIX is just a Heap. A table without a PK is often considered "not a table". It's best to get an understanding of both the PK and CIX concepts separately so that you can make sensible decisions in database design.
Rob
Nobody answered actual question: what are pluses/minuses of a table with NO PK NOR a CLUSTERED index.
In my opinion, if you optimize for faster inserts (especially incremental bulk-insert, e.g. when you bulk load data into a non-empty table), such a table: with NO clustered index, NO constraints, NO Foreign Keys, NO Defaults and NO Primary Key, in a database with Simple Recovery Model, is the best. Now, if you ever want to query this table (as opposed to scanning it in its entirety) you may want to add a non-clustered non-unique indexes as needed but keep them to the minimum.
I too have always heard having an auto-incrementing int is good for performance even if you don't actually use it.
A Primary Key needn't be an autoincrementing field, in many cases this just means you are complicating your table structure.
Instead, a Primary Key should be the minimum collection of attributes (note that most DBMS will allow a composite primary key) that uniquely identifies a tuple.
In technical terms, it should be the field that every other field in the tuple is fully functionally dependent upon. (If it isn't you might need to normalise).
In practice, performance issues may mean that you merge tables, and use an incrementing field, but I seem to recall something about premature optimisation being evil...
Since you are doing replication, your are correct identities are something to stear clear of. I would make your GUID a primary key but nonclustered since you can't use newsequentialid. That stikes me as your best course. If you don't make it a PK but put a unique index on it, sooner or later that may cause people who maintain the system to not understand the FK relationships properly introducing bugs.