When developing large systems (hundreds of tables) Do you create the indexes (and to lesser extend the other constraints in the DB) when you create the entities (tables), or wait for the system to be running (may be private Beta) to decide where to put the indexes?
I design indexes based on the eventual query scenarios. What will be the most common queries run against the table? That should inform index design - both to optimize query performance as well as to minimize insert/update/delete overhead.
Simply creating a clustered index on the primary key, for example, may make sense in a theoretical world up front, but may not mirror real-world query load.
For example: what if you have a table of order items, where 0-n order items are associated with a parent order? Do you just create an order item ID column, designate it the primary key, and burn your clustered index even though in the real world, 90% of your query activity against this table will be "get order items for order xyz", implying that a clustered index on parent order ID might make more sense than the "default" primary key clustered index on order item ID?
You can do a lot of this up front by knowing what scenarios your application will enable. Then, you can also do traces in the real world and analyze them to find where you are missing indexes; SQL Server, for example, ships with tools to do this, there are third-party tools too. One technique I use sometimes is also to do a big trace, upload the trace info into a table, and query it for distinct SQL statements (based on whatever criteria... e.g. give me all UPDATEs against table xyz...) and then you can do a query plan for those statements and see how good your indexing is by, for example, looking for and addressing table or index scans appropriately - and verifying by re-examining the execution plan for the query.
Some cautionary notes... don't apply indexes willy-nilly based on traces. An index on a table will affect overall performance of all queries against the table. Don't assume that a table or index scan (rather than a seek) is necessarily bad; it doesn't matter in a ten-row table. Index optimization is a combination of science and art, so keeping it simple is critical, testing frequently after small incremental changes is a good way to retain sanity and be able to roll back frequently, and above all, when you have a set of changes, script them out so that your DBA has an exact protocol of what will be done, and can easily determine where/what to roll back if needed.
If you know what fields you are going to be using most of the time (where and order by clauses`), you might as well create them when creating the entities.
You can always revisit later, and any DBA worth his salt would.
Related
I have a table with 7 columns:
ProductId
WarehouseId
StockDate
OnHandQuantity
InTransitQuantity
CreateDate
CreateUser
One of my user's is trying to improve the performance of a query and has created an index that was suggested by SSMS. The index looks like the following:
CREATE NONCLUSTERED INDEX [IDX1_FactStock2016] ON [dbo].[FactStock2016]
(
[WarehouseId] ASC
)
INCLUDE ([ProductId],[StockDate],[OnHandQuantity],[InTransitQuantity])
Now to me, it seems counter intuitive to create an index like this when it basically includes all columns from the underlying table (CreateDate and CreateUser are not relevant). Isn't this basically creating a copy of the table as an index (minus the two audit columns)? Is this actually going to have any impact on the speed?
Such an index would be a covering index for all queries on the table. For many, queries this could be the best index.
There are other options, such as a clustered index that has WarehouseId as the first key in the index.
Whether this is a good idea or not depends on your environment. If your sole concern is the performance of this one user's query, then it sounds like a good idea. Additional indexes do incur additional overhead for inserts, updates, and deletes (although indexes can also help the performance of updates and deletes depending on the conditions). Adding more indexes requires a balance of considerations. There is nothing a priori wrong, though, about having covering indexes for a query.
Yes, a covering index like this is a redundant copy of the data, It can also improve query performance by avoiding the key lookup.
Indexing decisions often involve trade-offs. Evaluate whether the additional costs of storage and maintenance outweigh the gains in performance. A frequently executed query that benefits from the index may very well outweigh the costs.
This index suggestion basically means that, for this particular query, it would be better if table's clustered index would have been built on the WarehouseId column. Since you cannot have more than 1 clustered index on any given table or view, sometimes this excess can be well worth the extra space (and some additional performance degradation of data modification statements).
Whether it is beneficial or not, however, you can only decide for yourself and your system - there are no general recommendations on this matter that would fit every situation except SQL Server index usage statistics, and what you can get from it. Shop around for index suggestion scripts, there are plenty of them everywhere.
We have a very large database and have been using shards which we want to get away from. The shards work by everytime a table gets really big, we start a new table that has the same schema as the previous table and keep a number in another table that helps us find which table the data is in. This is a cumbersome manual process and means we have data spread out over N different tables all with the same schema.
The idea we are trying for is to eliminate this need for sharding by using indexes. Our data lookup queries do not use unique keys and many records are returned that have the same values across columns.
The following illustrates many of our lookup selects for a particular table, the fields with the * indicate that field may or may not be in the select.
where clause: scheduled_test, *script, *label, *error_message
group/order: messenger_id, timeslice, script, label, error_message, step_sequence, *adapter_type
My thought is that I would not want to create an index with all of these 11 fields. I instead picked 3 of the ones that seemed to be used more commonly including the one that is always in the where clause. I had read that it is advisable not to have too wide an index with too many fields. I also had heard that the optimizer will use the indexed fields first and that it is not uncommon to have non unique indexes even though MSDN states to the effect that unique indexes is the big advantage. It's just not how our data is designed. I realize SQL will add something to the index to make it unique, but that doesn't seem to matter for our purposes.
When I look at the execution plan in sql server management studio on a query that is similar to what we might run, it says "clustered index seek cost 100%", but it is using the clustered index that I created so I am hoping this is better than the default clustered index that is just the generated primary key (previously how the table was defined). I am hoping that what I have here is as good or better than our sharding method and will eliminate the need for the shards.
We do insert alot of data into the tables all at once, but these rows all have the same data values across many columns and I think they would even tend to get inserted at the end as well. These inserts don't share values with older data and if the index is just 3 columns hopefully that would not be a very big hit on the inserts.
Does what I am saying seem reasonable or what else should I look into or consider ? Thanks alot, I am not that familiar with these types of indexing issues but have been looking on various websites and experimenting.
Generally, the narrower the clustered index the better as the clustering key of the clustered index will be added to all non-clustered indexes, making them less efficient.
SQL server will add a uniquifier to non-unique clustered indexes, making them (and all non-clustered indexes) even wider still.
If the space used by these indexes is not an issue for you, then you should consider whether the value of the clustered index key is ever increasing (or decreasing) as if it isn't, you will get page splits and fragmentation which will definitely hurt your inserts.
It's probably worth setting this up in a test system if you can to examine the impact different indexing strategies have on your normal queries.
The table in question is part of a database that a vendor's software uses on our network. The table contains metadata about files. The schema of the table is as follows
Metadata
ResultID (PK, int, not null)
MappedFieldname (char(50), not null)
Fieldname (PK, char(50), not null)
Fieldvalue (text, null)
There is a clustered index on ResultID and Fieldname. This table typically contains millions of rows (in one case, it contains 500 million). The table is populated by 24 workers running 4 threads each when data is being "processed". This results in many non-sequential inserts. Later after processing, more data is inserted into this table by some of our in-house software. The fragmentation for a given table is at least 50%. In the case of the largest table, it is at 90%. We do not have a DBA. I am aware we desperately need a DB maintenance strategy. As far as my background, I'm a college student working part time at this company.
My question is this, is a clustered index the best way to go about this? Should another index be considered? Are there any good references for this type and similar ad-hoc DBA tasks?
The indexing strategy entirely depends on how you query the table and how much performance you need to get out of the respective queries.
A clustered index can force re-sorting rows physically (on disk) when out-of-sequence inserts are made (this is called "page split"). In a large table with no free space on the index pages, this can take some time.
If you are not absolutely required to have a clustered index spanning two fields, then don't. If it is more like a kind of a UNIQUE constraint, then by all means make it a UNIQUE constraint. No re-sorting is required for those.
Determine what the typical query against the table is, and place indexes accordingly. The more indexes you have, the slower data changes (INSERTs/UPDATEs/DELETEs) will go. Don't create too many indexes, e.g. on fields that are unlikely to be filtered/sorted on.
Create combined indexes only on fields that are filtered/sorted on together, typically.
Look hard at your queries - the ones that hit the table for data. Will the index serve? If you have an index on (ResultID, FieldName) in that order, but you are querying for the possible ResultID values for a given Fieldname, it is likely that the DBMS will ignore the index. By contrast, if you have an index on (FieldName, ResultID), it will probably use the index - certainly for simple value lookups (WHERE FieldName = 'abc'). In terms of uniqueness, either index works well; in terms of query optimization, there is (at least potentially) a huge difference.
Use EXPLAIN to see how your queries are being handled by your DBMS.
Clustered vs non-clustered indexing is usually a second-order optimization effect in the DBMS. If you have the index correct, there is a small difference between clustered and non-clustered index (with a bigger update penalty for a clustered index as compensation for slightly smaller select times). Make sure everything else is optimized before worrying about the second-order effects.
The clustered index is OK as far as I see. Regarding other indexes you will need to provide typical SQL queries that operate on this table. Just creating an index out of the blue is never a good idea.
You're talking about fragmentation and indexing, does it mean that you suspect that query execution slows down? Or do you simply want to shrink/defragment the database/index?
It is a good idea to have a task to defragment indexes from time to time during off-hours, though you have to consider that with frequent/random inserts it does not hurt to have some spare space in the table to prevent page splits (which do affect performance).
I am aware we desperately need a DB maintenance strategy.
+1 for identifying that need
As far as my background, I'm a college student working part time at this company
Keep studying, gain experience, but get an experienced consultant in in the meantime.
The table is populated by 24 workers running 4 threads each
I presume this is pretty mission critical during the working day, and downtime is bad news? If so don't clutz with it.
There is a clustered index on ResultID and Fieldname
Is ResultID the first column in the PK, as you indicate?
If so I'll bet that it is insufficiently selective and, depending on what the needs are of the queries, the order of the PK fields should be swapped (notwithstanding that this compound key looks to be a poor choice for the clustered PK)
What's the result of:
SELECT COUNT(*), COUNT(DISTINCT ResultID) FROM MyTable
If the first count is, say, 4 x as big as the second, or more, you will most likely be getting scans in preference to seeks, because of the low selectively of ResultsID, and some simple changes will give huge performance improvements.
Also, Fieldname is quite wide (50 chars) so any secondary indexes will have 50 + 4 bytes added to every index entry. Are the fields really CHAR rather than VARCHAR?
Personally I would consider increased the density of the leaf pages. At 90% you will only leave a few gaps - maybe one-per-page. But with a large table of 500 million rows the higher packing density may mean fewer levels in the tree, and thus fewer seeks for retrieval. Against that almost every insert, for a given page, will require a page split. This would favour inserts that are clustered, so may not be appropriate (given that your insert data is probably not clustered). Like many things, you'd need to make a test to establish what index key density works best. SQL Server has tools to help analyse how queries are being parsed, whether they are being cached, how many scans of the table they cause, which queries are "slow running", and so on.
Get a consultant in to take a look and give you some advice. This aint a question that answers here are going to give you a safe solution to implement.
You really REALLY need to have some carefully thought through maintenance policies for tables that have 500 millions rows and shed-loads of inserts daily. Sorry, but I have enormous frustration with companies that get into this state.
The table needs defragmenting (your options will become fewer if you don't have a clustered index, so keep that until you decide that there is a better candidate). "Online" defragmentation methods will have modest impact on performance, and can chug away - and can safely be aborted if they overrun time / CPU constraints [although that will most likely take some programming]. If you have a "quiet" slot then use it for table defragmentation and updating the statistics on indexes. Don't wait until the weekend to try to do all tables in one go - do as much/many as you can during any quiet time daily (during the night presumably).
Defragmenting the tables is likely to lead to a huge increased in Transaction log usage, so make sure that any TLogs are backed up frequently (we have a 10 minute TLog backup policy, which we increase to every minute during table defragging so that the defragging process doesn't become the definition of required Tlog space!)
Suppose you have a very large database, and to simplify lets say it consists of one major table you will be doing your lookups on with one (and only one) primary key field - pk.
Given the fact that all lookups are going to be basically SELECT * FROM table_name WHERE pk=someKeyValue, what is the best way to optimize this database for the fastest lookups?
Edit: just a few more details - INSERTs and UPDATEs are going to be very non-frequent so I don't mind sacrificing performance there to achieve better lookup performance.
Also, seems like clustering is the way to go. Do you have any examples of the kind of increase in performance I can achieve with this method? And how exactly is this done (on any kind of DB)?
If the primary key is clustered, then you won't get any quicker.
If it isn't clustered, and the number of columns in your table is relatively small, then you could in theory create a covering index to speed up the query. But then this negates any insert/update performance enhancements that having the non-clustered primary key would have given you.
If your primary key is an always-increasing field (e.g. a SQL Server identity, or generated from a sequence in Oracle) then the clustered primary key has no drawbacks anyway.
One thing you could do is make the primary key clustered, this results in the actual data being physically ordered on the disk, resulting in faster queries.
It will also mean slower inserts, but if you select much more frequently than you insert, this should not be a problem.
If you're using MySQL, you can do some additional things (beyond tuning your cache values). The table engine can be a factor; for instance, MyISAM is widely held to be faster at SELECTs than InnoDB. If this table is primarily a lookup table, and you were using MySQL, that might be a good thing to do. (InnoDB is pretty good on average; it's better on writes than MyISAM, and also, InnoDB never needs to be repaired.)
I have to add two more options to all that was proposed above (I like dwc’s answer). You should consider partitioning if your table is really big.
First, horizontal partitioning (especially if I/O is bottleneck in your DB). You create several filegroups and locate them on different hard drives. Then, create Partition Function, Partition Scheme to divide your table and put parts of your table on separate HDs (like rows 1-499999 to the F: drive, 500000-999999 to the G: drive, and so on) .
Second, vertical partitioning. This would work if you select column sets (not *) in most of your queries. In this case, divide columns in the table in two groups: first, fields that you need in all queries; second, fields that you rarely need. Create two tables with the same primary key. Use JOINs on the primary key when you need columns from both tables.
(This answer pertains to SQL Server 2005/2008.)
If all your queries are going to be based off the PK, you wouldn't get any added benefit by setting an index on the PK since it should already be indexing by that.
Edit: The only other possible things I would suggest is looking at normalizing your table (if that is even an option or necessity). By splitting off items into other tables, you can refine what is being pulled back in each query and only pull the less-used items when needed using joins.
Based off the limited description of "a very large database with a single table" it is hard to locate any easy and obvious ways to optimize without looking at what kind of data you are actually storing in your fields.
If your PK order matches insertion order, i.e. time or id/autoincrement, then make it clustered. This will reduce disk and cache thrashing on inserts, leaving more resources to devote to lookups.
Consider tweaking page sizes on the table to be an exact multiple of your record size. This requires intimate knowledge of the particular database software for details of how, and record/index overhead, etc.
If practical, use fixed-size for all columns rather than variable size.
Consider putting the index and/or transaction log files on a separate volume.
Install as much RAM as the software and hardware can use.
If you were using Oracle then I'd advise benchmarking three approaches:
Heap table with primary key index
Index-organised table
Single table hash cluster
1 represents a very vanilla approach -- really it's the lowest common denominator, but could mean 5+ logical reads to get each row, with one of those being a probable physical read of the table if it is not completely cached.
2 will save you one of those logical read by avoiding the probe to a separate table segment, but might not save you the physical read because the IOT segment will be larger and harder to cache than the index alone.
3 will potentially get you the row with a single logical read, but unless you have the entire table cached that's probably going to translate into a physical read.
Benchmarking is highly recommended.
I have several tables whose only unique data is a uniqueidentifier (a Guid) column. Because guids are non-sequential (and they're client-side generated so I can't use newsequentialid()), I have made a non-primary, non-clustered index on this ID field rather than giving the tables a clustered primary key.
I'm wondering what the performance implications are for this approach. I've seen some people suggest that tables should have an auto-incrementing ("identity") int as a clustered primary key even if it doesn't have any meaning, as it means that the database engine itself can use that value to quickly look up a row instead of having to use a bookmark.
My database is merge-replicated across a bunch of servers, so I've shied away from identity int columns as they're a bit hairy to get right in replication.
What are your thoughts? Should tables have primary keys? Or is it ok to not have any clustered indexes if there are no sensible columns to index that way?
When dealing with indexes, you have to determine what your table is going to be used for. If you are primarily inserting 1000 rows a second and not doing any querying, then a clustered index is a hit to performance. If you are doing 1000 queries a second, then not having an index will lead to very bad performance. The best thing to do when trying to tune queries/indexes is to use the Query Plan Analyzer and SQL Profiler in SQL Server. This will show you where you are running into costly table scans or other performance blockers.
As for the GUID vs ID argument, you can find people online that swear by both. I have always been taught to use GUIDs unless I have a really good reason not to. Jeff has a good post that talks about the reasons for using GUIDs: https://blog.codinghorror.com/primary-keys-ids-versus-guids/.
As with most anything development related, if you are looking to improve performance there is not one, single right answer. It really depends on what you are trying to accomplish and how you are implementing the solution. The only true answer is to test, test, and test again against performance metrics to ensure that you are meeting your goals.
[Edit]
#Matt, after doing some more research on the GUID/ID debate I came across this post. Like I mentioned before, there is not a true right or wrong answer. It depends on your specific implementation needs. But these are some pretty valid reasons to use GUIDs as the primary key:
For example, there is an issue known as a "hotspot", where certain pages of data in a table are under relatively high currency contention. Basically, what happens is most of the traffic on a table (and hence page-level locks) occurs on a small area of the table, towards the end. New records will always go to this hotspot, because IDENTITY is a sequential number generator. These inserts are troublesome because they require Exlusive page lock on the page they are added to (the hotspot). This effectively serializes all inserts to a table thanks to the page locking mechanism. NewID() on the other hand does not suffer from hotspots. Values generated using the NewID() function are only sequential for short bursts of inserts (where the function is being called very quickly, such as during a multi-row insert), which causes the inserted rows to spread randomly throughout the table's data pages instead of all at the end - thus eliminating a hotspot from inserts.
Also, because the inserts are randomly distributed, the chance of page splits is greatly reduced. While a page split here and there isnt too bad, the effects do add up quickly. With IDENTITY, page Fill Factor is pretty useless as a tuning mechanism and might as well be set to 100% - rows will never be inserted in any page but the last one. With NewID(), you can actually make use of Fill Factor as a performance-enabling tool. You can set Fill Factor to a level that approximates estimated volume growth between index rebuilds, and then schedule the rebuilds during off-peak hours using dbcc reindex. This effectively delays the performance hits of page splits until off-peak times.
If you even think you might need to enable replication for the table in question - then you might as well make the PK a uniqueidentifier and flag the guid field as ROWGUIDCOL. Replication will require a uniquely valued guid field with this attribute, and it will add one if none exists. If a suitable field exists, then it will just use the one thats there.
Yet another huge benefit for using GUIDs for PKs is the fact that the value is indeed guaranteed unique - not just among all values generated by this server, but all values generated by all computers - whether it be your db server, web server, app server, or client machine. Pretty much every modern language has the capability of generating a valid guid now - in .NET you can use System.Guid.NewGuid. This is VERY handy when dealing with cached master-detail datasets in particular. You dont have to employ crazy temporary keying schemes just to relate your records together before they are committed. You just fetch a perfectly valid new Guid from the operating system for each new record's permanent key value at the time the record is created.
http://forums.asp.net/t/264350.aspx
The primary key serves three purposes:
indicates that the column(s) should be unique
indicates that the column(s) should be non-null
document the intent that this is the unique identifier of the row
The first two can be specified in lots of ways, as you have already done.
The third reason is good:
for humans, so they can easily see your intent
for the computer, so a program that might compare or otherwise process your table can query the database for the table's primary key.
A primary key doesn't have to be an auto-incrementing number field, so I would say that it's a good idea to specify your guid column as the primary key.
Just jumping in, because Matt's baited me a bit.
You need to understand that although a clustered index is put on the primary key of a table by default, that the two concepts are separate and should be considered separately. A CIX indicates the way that the data is stored and referred to by NCIXs, whereas the PK provides a uniqueness for each row to satisfy the LOGICAL requirements of a table.
A table without a CIX is just a Heap. A table without a PK is often considered "not a table". It's best to get an understanding of both the PK and CIX concepts separately so that you can make sensible decisions in database design.
Rob
Nobody answered actual question: what are pluses/minuses of a table with NO PK NOR a CLUSTERED index.
In my opinion, if you optimize for faster inserts (especially incremental bulk-insert, e.g. when you bulk load data into a non-empty table), such a table: with NO clustered index, NO constraints, NO Foreign Keys, NO Defaults and NO Primary Key, in a database with Simple Recovery Model, is the best. Now, if you ever want to query this table (as opposed to scanning it in its entirety) you may want to add a non-clustered non-unique indexes as needed but keep them to the minimum.
I too have always heard having an auto-incrementing int is good for performance even if you don't actually use it.
A Primary Key needn't be an autoincrementing field, in many cases this just means you are complicating your table structure.
Instead, a Primary Key should be the minimum collection of attributes (note that most DBMS will allow a composite primary key) that uniquely identifies a tuple.
In technical terms, it should be the field that every other field in the tuple is fully functionally dependent upon. (If it isn't you might need to normalise).
In practice, performance issues may mean that you merge tables, and use an incrementing field, but I seem to recall something about premature optimisation being evil...
Since you are doing replication, your are correct identities are something to stear clear of. I would make your GUID a primary key but nonclustered since you can't use newsequentialid. That stikes me as your best course. If you don't make it a PK but put a unique index on it, sooner or later that may cause people who maintain the system to not understand the FK relationships properly introducing bugs.