If I have a lookup table with very few records in it (say, less than ten), should I bother putting an index on the Foreign Key of another table to which it is attached? For that matter, does the lookup table even need an index on the Primary Key?
Specifically, is there any performance benefit that outweighs the overhead of maintaining the indexes? If not, are there any benefits other than speed?
Note: an example of a lookup table might be Order Status, where the tuples are:
1 - Order Received
2 - In Process
3 - Shipped
4 - Paid
On a transactional system there may be no significant benefit to putting an index on such a column (i.e. a low cardinality reference column) as the query optimiser probably won't use it. It will also generate additional disk traffic on writes to the table as the indexes have to be updated. So for low cardinality FK's on a transactional database it is usually better not to index the columns. This particularly applies to high volume systems.
Note that you may still want the FK for referential integrity and that the FK lookup on a small reference table will probably generate no I/O as the lookup table will almost always be cached.
However, you may find that you want to include the column in a composite index for some reason - perhaps to create a covering index for a commonly used query.
On a table that is frequently bulk-loaded (e.g. a data warehouse) the index write traffic will be much larger than that of the table load if you have many indexed columns. You will probably need to drop or disable the FKs and indexes for a bulk load if any indexes are present.
On a Star Schema you can get some benefit from indexing low cardinality columns, even on SQL Server. If you are doing a highly selective query (i.e. one where the query optimiser decides that the row set returned will be small) then it can do a 'star query' plan where it uses a technique known as index intersection.
Generally, query plans on a star schema should be based around a table scan of the fact table or a highly selective process that bookmarks the fact table and then returns a smaller set of rows. Index intersection is efficient for the latter type of query as the selection can be resolved before doing any I/O on the fact table.
Bitmap indexes are a real win for low cardinality columns on platforms such as Oracle that support them, but SQL Server does not. Even so, low cardinality indexes can still participate in star query plans on SQL Server.
Yes, always have an index.
The query optimizer of a modern database management system (DBMS) will make the determination as to which is faster: (1) actually reading from an index on a column, (2) performing a full table scan.
The table size (in number of rows) needs to be "large enough" for use of the index to be considered.
Yes to both. Always index as a rule of thumb.
Points:
You also can't set up an FK without a unique index on the lookup table
What if you want to delete or update in the lookup table? Especially accidently...
However, saying that, we don't always.
We have very OLTP table (5 million rows+ per day) with several parent tables. We only indexes on the FK columns where we need them. We assume no deletes/key updates on some parent tables, so we reduce the amount of work needed and disk space used.
We used the SQL Server 2005 dmvs to establish that indexes weren't used. We still have the FK in place though.
My personal opinion is that you should... it may be small now but ALWAYS anticipate your tables growing in size. A good database schema will grow easily with more records. Foreign Keys are almost always a good idea.
In sql server, the primary key is the clustered index if there isn't one already (clustered index that is).
Related
I have a large table consisting of 4 Billion+ rows and 50 columns, most of which are either datetime or numeric except a few which are varchar.
Data will be inserted into the table on a weekly basis (about 20 million rows).
I expect queries with where clauses on some of the datetime columns, and a couple of the the varchar columns. There is no primary key in the table.
There are no indexes, nor the table is partitioned. I am using SQL Server 2016.
I understand that I need to partition or index the table, but I am not sure which approach to take or both in-fact.
Since the table is large, should I create the indexes first or should I create the partitions first? If I do create the indexes and then create the partitions, what should I do to maintain these with new data coming in weekly.
EDIT: Also, minimal updates and deletes are expected on the table
I understand that I need to partition or index the table
You need to understand what you gain from partitioning. It is not at all the case that SQL Server requires partitioning on big tables to function adequately. SQL Server scales to arbitrary tables sizes without any inherent issues.
Common benefits of partitioning are:
Mass deletion in constant time
Different storage for older partitions
Not backing up old partitions
Sometimes in special situations (e.g. columnstore), partitioning can help as a strategy to speed up queries. Normally, indexing is better for that.
Essentially, partitioning splits the table physically into multiple sub tables. Most often this has a negative effect on query plans. Indexes are perfectly capable of restricting the set of data that needs to be touched. Partitions are worse for that.
Most of the queries will be filtering on the datetime columns and on some of the varchar columns. Like, get data for a certain daterange for a certain entity. With the indexes, it will be fragmented a lot because of new inserts and rebuilding/reorganising the indexes will also consume a lot of time. I can do it but again not sure which approach.
It seems you can best solve this by indexing:
Index according to the queries you expect.
Maintain the indexes properly. This is not too hard. For example, rebuild them after the weekly load.
Since the table is large, should I create the indexes first or should I create the partitions first?
Set up that partitioning objects first. Then, create or rebuild the clustered index on the new partitioning scheme. If possible drop other indexes first and recreate them afterwards (might not work due to availability restrictions).
what should I do to maintain these with new data coming in weekly.
What concerns do you have? New data will be stored in the appropriate partitions automatically. Make sure to create new partitions before loading the data. Keep partitions ready for 2 weeks in advance. The latest partitions must always be empty to avoid costly splits.
There is no primary key in the table.
Most often this is a not a good design. Most tables should have a primary key and a clustered index. If there is no natural key use an artifical one such as a bigint identity.
You definitely can apply partitioning but my feeling is that it will not gain you what you maybe expect. But it will force you to take on additional maintenance burdens, possibly reduce performance and there is risk of making mistakes that threaten availability. Simplicity is important.
I have just taken over a database which has around 2200 tables. Over 2000 of these have no clustered index (some have no indexes at all).
All of the tables have been configured to use a GUID as the uniqueidentifier.
Just looking at the query plans, I can see that there are many table scans occurring. Most searches use the uniqueidentifier to search on.
I am wondering if it is better to have a clustered index on the GUID than not to have a clustered index at all. I imagine that a clustered index on a 16-byte column will inevitably lead to fragmentation.
I could arguably cluster on other columns but the majority of searches tend to search by or join via the GUIDS.
Any advice would be very much welcomed. I've never seen so many GUID's!!
In generally, I would recommend having an identity column as the primary key and use that for clustering. This is also a better choice for joins.
Why? First, identity keys are generally shorter that unique ids. So, foreign key references and indexes are smaller.
More importantly, inserts would always go at the "end" of the table. When using GUIDs, inserts are often going to cause fragmentation. If you are inserting rows, I would say that a secondary index on the GUID might be better than a clustered index (the fragmentation is only in the index).
With 2000 tables, I doubt you will change the structure. You can ameliorate the fragmentation using newsequentialid().
GUID column with random values usually is not the best choice for a clustered index because it could be the root cause of an index fragmentation:
Read ahead opportunity of the database won't be effective;
The cost of insert operations will be too expensive, because in this case you'll got lots of page split overhead;
There are 3 ways how you can live with that:
Schedule planning index reorganizing and rebuilding which will reduce index fragmentation and improve your statistics automatically;
Use
newsequantialid for generating values of this column;
Generate
GUID value sequantialy outside of the database (Guid.Comb
Identifier is a great example of solving this issue in
NHibernate).
This is a really a comment on your question to Gordon's good answer:
Firstly, don't forget to check the index DMVs to see which ones are being used (or not used) and have a look at the expensive query plans in the cache to focus on the tables and queries that will be causing most pain. I would expect that many of those 2200 tables are relatively small & the queries are able to look up pretty quickly even from the guid clustered index.
For those tables that aren't clustered, clustering on the guid would reduce fragmentation, since it forces all the data for the table to be colocated rather than allowing pages to be put in the next free extent & spreading tables all over the disk. This should make some of the I/O more efficient.
Check you have a low enough fillfactor so that your regular index rebuilds avoid page splitting in advance, although it will also be workload dependent (OLTP vs DW and read/write ratio of table)
If you have applications that are doing explicit column selects/inserts then you may be able to add an identity column without breaking anything. That allows you cluster around the identity & add an index to the guid. Whether this really helps depends on the relative (in)efficency of the new plans.
You could consider clustering around a non-guid field where queries will lookup against it fairly regularly (eg, a date range) and index the guid separately.
You'd have to look at the queries & relative performance for that more closely.
I would like optimize the performance of a database that my team is using for an application.
I have been looking for areas to add foreign keys, and in turn index those columns to improve the performance of joins. However, many of our tables are joined on an id that is a GUID type, generated upon insertion of an item, and the data associated with that item in other tables is generally has column item_id containing the GUID.
I have read that adding clustered indexes to GUID type columns is a very bad decision because the index will need to be constantly reconstructed in order to be effective. However, I was wondering, is there any detriment to utilizing a non-clustered index in the scenario described above? Or is it reasonable to assume that it would help performance? I can provide more information if needed.
An index on a <anytype> is by far the best option you have to improve joins and singleton lookups. Lacking this index the query will always have to scan the entire table end-to-end with (often) abysmal performance results and concurrency gone down the drain.
It is true that uniqueidentifier makes poor choice for indexes for the reasons you mention, but by no means does that implies that you should not create these indexes. Changing the data type to INT or BIGINT would be advisable, if possible. Using NEWSEQUENTIALID() or UuidCreateSequential to generate them would help with fragmentation issues. If all alternatives fail you may have to do index maintenance (Rebuild, reorganize) operations more often than for other indexes. But by no means do any of these drawbacks outweigh the benefit of having the index in the first place!
Two performance:
- insert
- select
An index should improve select
An index will slow slow down insert.
If the inserts are in order the index does not fragment.
If the inserts are not in order the index will fragment.
Index fragmentation slows down both insert and select.
Via maintenance can defragment the index.
Adding an non-clustered index to the column that references a FK will help the joins.
Since that column is most likely not ordered that fact it is a GUID is of no loss.
On the FK table itself is where GUID is not a good candidate for a PK (clustered index).
With GUID as PK that index fragments on insert.
Int or sequential ID are better candidates as they would not fragment the PK on insert.
But no big deal just defragment those tables.
Yes, you are better off changing the Guid index from clustered to non-clustered. Guid can still be primary key and you don't need to change your query/source code. No reordering of data and increased performance.
In databases like SQL Azure it is mandatory to have a clustered index. So you could use a date/datetime field. Creating a additional int-identity/autoincrement column is unnecessary as some developers in one team tend to use those and others GUID. Resulting in an inconsistent application. So keep only GUID.. full stop!
Talking about sequential Guids, I think Guids are better created from code than from database. Modern DALs and repository patterns do not prefer dependencies on DB for CRUD. e.g. scenario: linq query and automated builds with unit testing with out DB dependency. And creating a sequential guid ourselves is not a good idea(atleast for me). So Guid as primary Key with a non-clustered index is the best option there is.
I have backing from Microsoft on the non-clustered subject http://blogs.msdn.com/b/sqlazure/archive/2010/05/05/10007304.aspx
Edited: Backing is gone ("No Resource Found")
It would usually help performance. But you may wish to create the index with a fillfactor of less than 100% such that the inevitable page-splits don't have to happen quite so often. Regular maintenance on the index would certainly be a plus.
Yes, a non-clustered index would be ideal for your situation. The underlying is a B-tree, like the clustered index, but the underlying data on the table is not sorted, so the problems with the non-sequential nature of the GUID does not exist. The NC index exist separately from the table.
Be careful to not add too many non-clustered indices though. Optimize only where you need to. Run the profiler to see which queries are taking a long time, and optimize only those. Additionally, be sure to set the fill factor to a value <50% unless the database rarely gets any updates, or space is a constraint.
Relevant MSDN: http://msdn.microsoft.com/en-us/library/ms177484(v=sql.105).aspx
The database I'm working with is currently over 100 GiB and promises to grow much larger over the next year or so. I'm trying to design a partitioning scheme that will work with my dataset but thus far have failed miserably. My problem is that queries against this database will typically test the values of multiple columns in this one large table, ending up in result sets that overlap in an unpredictable fashion.
Everyone (the DBAs I'm working with) warns against having tables over a certain size and I've researched and evaluated the solutions I've come across but they all seem to rely on a data characteristic that allows for logical table partitioning. Unfortunately, I do not see a way to achieve that given the structure of my tables.
Here's the structure of our two main tables to put this into perspective.
Table: Case
Columns:
Year
Type
Status
UniqueIdentifier
PrimaryKey
etc.
Table: Case_Participant
Columns:
Case.PrimaryKey
LastName
FirstName
SSN
DLN
OtherUniqueIdentifiers
Note that any of the columns above can be used as query parameters.
Rather than guess, measure. Collect statistics of usage (queries run), look at the engine own statistics like sys.dm_db_index_usage_stats and then you make an informed decision: the partition that bests balances data size and gives best affinity for the most often run queries will be a good candidate. Of course you'll have to compromise.
Also don't forget that partitioning is per index (where 'table' = one of the indexes), not per table, so the question is not what to partition on, but which indexes to partition or not and what partitioning function to use. Your clustered indexes on the two tables are going to be the most likely candidates obviously (not much sense to partition just a non-clustered index and not partition the clustered one) so, unless you're considering redesign of your clustered keys, the question is really what partitioning function to choose for your clustered indexes.
If I'd venture a guess I'd say that for any data that accumulates over time (like 'cases' with a 'year') the most natural partition is the sliding window.
If you have no other choice you can partition by key module the number of partition tables.
Lets say that you want to partition to 10 tables.
You will define tables:
Case00
Case01
...
Case09
And partition you data by UniqueIdentifier or PrimaryKey module 10 and place each record in the corresponding table (Depending on your unique UniqueIdentifier you might need to start manual allocation of ids).
When performing a query, you will need to run same query on all tables, and use UNION to merge the result set into a single query result.
It's not as good as partitioning the tables based on some logical separation which corresponds to the expected query, but it's better then hitting the size limit of a table.
Another possible thing to look at (before partitioning) is your model.
Are you in a normalized database? Are there further steps which could improve performance by different choices in the normalization/de-/partial-normalization? Are there options to transform the data into a Kimball-style dimensional star model which is optimal for reporting/querying?
If you aren't going to drop partitions of the table (sliding window, as mentioned) or treat different partitions differently (you say any columns can be used in the query), I'm not sure what you are trying to get out of the partitioning that you won't already get out of your indexing strategy.
I'm not aware of any table limits on rows. AFAIK, the number of rows is limited only by available storage.
Suppose you have a very large database, and to simplify lets say it consists of one major table you will be doing your lookups on with one (and only one) primary key field - pk.
Given the fact that all lookups are going to be basically SELECT * FROM table_name WHERE pk=someKeyValue, what is the best way to optimize this database for the fastest lookups?
Edit: just a few more details - INSERTs and UPDATEs are going to be very non-frequent so I don't mind sacrificing performance there to achieve better lookup performance.
Also, seems like clustering is the way to go. Do you have any examples of the kind of increase in performance I can achieve with this method? And how exactly is this done (on any kind of DB)?
If the primary key is clustered, then you won't get any quicker.
If it isn't clustered, and the number of columns in your table is relatively small, then you could in theory create a covering index to speed up the query. But then this negates any insert/update performance enhancements that having the non-clustered primary key would have given you.
If your primary key is an always-increasing field (e.g. a SQL Server identity, or generated from a sequence in Oracle) then the clustered primary key has no drawbacks anyway.
One thing you could do is make the primary key clustered, this results in the actual data being physically ordered on the disk, resulting in faster queries.
It will also mean slower inserts, but if you select much more frequently than you insert, this should not be a problem.
If you're using MySQL, you can do some additional things (beyond tuning your cache values). The table engine can be a factor; for instance, MyISAM is widely held to be faster at SELECTs than InnoDB. If this table is primarily a lookup table, and you were using MySQL, that might be a good thing to do. (InnoDB is pretty good on average; it's better on writes than MyISAM, and also, InnoDB never needs to be repaired.)
I have to add two more options to all that was proposed above (I like dwc’s answer). You should consider partitioning if your table is really big.
First, horizontal partitioning (especially if I/O is bottleneck in your DB). You create several filegroups and locate them on different hard drives. Then, create Partition Function, Partition Scheme to divide your table and put parts of your table on separate HDs (like rows 1-499999 to the F: drive, 500000-999999 to the G: drive, and so on) .
Second, vertical partitioning. This would work if you select column sets (not *) in most of your queries. In this case, divide columns in the table in two groups: first, fields that you need in all queries; second, fields that you rarely need. Create two tables with the same primary key. Use JOINs on the primary key when you need columns from both tables.
(This answer pertains to SQL Server 2005/2008.)
If all your queries are going to be based off the PK, you wouldn't get any added benefit by setting an index on the PK since it should already be indexing by that.
Edit: The only other possible things I would suggest is looking at normalizing your table (if that is even an option or necessity). By splitting off items into other tables, you can refine what is being pulled back in each query and only pull the less-used items when needed using joins.
Based off the limited description of "a very large database with a single table" it is hard to locate any easy and obvious ways to optimize without looking at what kind of data you are actually storing in your fields.
If your PK order matches insertion order, i.e. time or id/autoincrement, then make it clustered. This will reduce disk and cache thrashing on inserts, leaving more resources to devote to lookups.
Consider tweaking page sizes on the table to be an exact multiple of your record size. This requires intimate knowledge of the particular database software for details of how, and record/index overhead, etc.
If practical, use fixed-size for all columns rather than variable size.
Consider putting the index and/or transaction log files on a separate volume.
Install as much RAM as the software and hardware can use.
If you were using Oracle then I'd advise benchmarking three approaches:
Heap table with primary key index
Index-organised table
Single table hash cluster
1 represents a very vanilla approach -- really it's the lowest common denominator, but could mean 5+ logical reads to get each row, with one of those being a probable physical read of the table if it is not completely cached.
2 will save you one of those logical read by avoiding the probe to a separate table segment, but might not save you the physical read because the IOT segment will be larger and harder to cache than the index alone.
3 will potentially get you the row with a single logical read, but unless you have the entire table cached that's probably going to translate into a physical read.
Benchmarking is highly recommended.