Reduce SQL Server table fragmentation without adding/dropping a clustered index? - sql-server

I have a large database (90GB data, 70GB indexes) that's been slowly growing for the past year, and the growth/changes has caused a large amount of internal fragmentation not only of the indexes, but of the tables themselves.
It's easy to resolve the (large number of) very fragmented indexes - a REORGANIZE or REBUILD will take care of that, depending on how fragmented they are - but the only advice I can find on cleaning up actual table fragmentation is to add a clustered index to the table. I'd immediately drop it afterwards, as I don't want a clustered index on the table going forward, but is there another method of doing this without the clustered index? A "DBCC" command that will do this?
Thanks for your help.

Problem
Let's get some clarity, because this is a common problem, a serious issue for every company using SQL Server.
This problem, and the need for CREATE CLUSTERED INDEX, is misunderstood.
Agreed that having a permanent Clustered Index is better than not having one. But that is not the point, and it will lead into a long discussion anyway, so let's set that aside and focus on the posted question.
The point is, you have substantial fragmentation on the Heap. You keep calling it a "table", but there is no such thing at the physical data storage or DataStructure level. A table is a logical concept, not a physical one. It is a collection of physical DataStructures. The collection is one of two possibilities:
Heap
plus all Non-clustered Indices
plus Text/Image chains
or a Clustered Index
(eliminates the Heap and one Non-clustered Index)
plus all Non-clustered Indices
plus Text/Image chains.
Heaps get badly fragmented; the more interspersed (random)Insert/Deletes/Updates there are, the more fragmentation.
There is no way to clean up the Heap, as is. MS does not provide a facility (other vendors do).
Solution
However, we know that Create Clustered Index rewrites and re-orders the Heap, completely. The method (not a trick), therefore, is to Create Clustered Index only for the purpose of de-fragmenting the Heap, and drop it afterward. You need free space in the db of table_size x 1.25.
While you are at it, by all means, use FILLFACTOR, to reduce future fragmentation. The Heap will then take more allocated space, allowing for future Inserts, Deletes and row expansions due to Updates.
Note
Note that there are three Levels of Fragmentation; this deals with Level III only, fragmentation within the Heap, which is caused by Lack of a Clustered Index
As a separate task, at some other time, you may wish to contemplate the implementation of a permanent Clustered Index, which eliminates fragmentation altogether ... but that is separate to the posted problem.
Response to Comment
SqlRyan:
While this doesn't give me a magic solution to my problem, it makes pretty clear that my problem is a result of a SQL Server limitation and adding a clustered index is the only way to "defragment" the heap.
Not quite. I wouldn't call it a "limitation".
The method I have given to eliminate the Fragmentation in the Heap is to create a Clustered Index, and then drop it. Ie. temporarily, the only purpose of which is correct the Fragmentation.
Implementing a Clustered Index on the table (permanently) is a much better solution, because it reduces overall Fragmentation (the DataStructure can still get Fragmented, refer detailed info in links below), which is far less than the Fragmentation that occurs in a Heap.
Every table in a Relational database (except "pipe" or "queue" tables) should have a Clustered Index, in order to take advantage of its various benefits.
The Clustered Index should be on columns that distribute the data (avoiding INSERT conflicts), never be indexed on a monotonically increasing column, such as Record ID 1, which guarantees an INSERT Hot Spot in the last Page.
1. Record IDs on every File renders your "database" a non-relational Record Filing System, using SQL merely for convenience. Such Files have none of the Integrity, Power, or Speed of Relational databases.
Andrew Hill:
would you be able to comment further on "Note that there are three Levels of Fragmentation; this deals with Level III only" -- what are the other two levels of fragmentation?
In MS SQL and Sybase ASE, there are three Levels of Fragmentation, and within each Level, several different Types. Keep in mind that when dealing with Fragmentation, we must focus on DataStructures, not on tables (a table is a collection of DataStructures, as explained above). The Levels are:
Level I • Extra-DataStructure
Outside the DataStructure concerned, across or within the database.
Level II • DataStructure
Within the DataStructure concerned, above Pages (across all Pages)
This is the Level most frequently addressed by DBAs.
Level III • Page
Within the DataStructure concerned, within the Pages
These links provide full detail re Fragmentation. They are specific to Sybase ASE, however, at the structural level, the information applies to MS SQL.
Fragmentation Definition
Fragmentation Impact
Fragmentation Type
Note that the method I have given is Level II, it corrects the Level II and III Fragmentation.

You state that you add a clustered index to alleviate the table fragmentation, to then drop it immediately.
The clustered index removes fragmentation by sorting on the cluster key, but you say that this key would not be possible for future use. This begs the question: why defragment using this key at all?
It would make sense to create this clustered key and keep it, as you obviously want/need the data sorted that way. You say that data changes would incur data movement penalties that can't be borne; have you thought about creating the index with a lower FILLFACTOR than the default value? Depending upon data change patterns, you could benefit from something as low as 80%. You then have 20% 'unused' space per page, but the benefit of lower page splits when the clustered key values are changed.
Could that help you?

You can maybe compact the heap by running DBCC SHRINKFILE with NOTRUNCATE.
Based on comments, I see you haven't tested with a permenent clustered index.
To put this in perspective, we have database with 10 million new rows per day with clustered indexes on all tables. Deleted "gaps" will be removed via scheduled ALTER INDEX (and also forward pointers/page splits).
Your 12GB table may be 2GB after indexing: it merely has 12GB allocated but is massively fragmented too.

I understand your pain in being constrained by the design of a legacy design.
Have you the oppertunity to restore a backup of the table in question on another server and create a clustered index? It is very possible the clustered index if created on a set of narrow unique columns or an identity column will reduce the total table (data and index) size.
In one of my legacy apps all the data was accessed via views. I was able to modify the schema of the underlying table adding an identity column and a clustered index without effecting the application.
Another drawback of having the heap is the extra IO associated with any fowarded rows.
I found the article below effective when I was asked if there was any PROOF that we needed a clusted index permanently on the table
This article is by Microsoft

The problem that no one is talking about is FRAGMENTATION OF THE DATA OR LOG DEVICE FILES ON THE HARD DRIVE(s) ITSELF!! Everyone talks about fragmentation of the indexes and how to avoid/limit that fragmentation.
FYI: When you create a database, you specify the INITIAL size of the .MDF along with how much it will grow by when it needs to grow. You do the same with the .LDF file. THERE IS NO GUARANTEE THAT WHEN THESE TWO FILES GROW THAT THE DISK SPACE ALLOCATED FOR THE EXTRA DISK SPACE NEEDED WILL BE PHYSICALLY CONTIGUOUS WITH THE EXISTING DISK SPACE ALLOCATED!!
Every time one of these two device files needs to expand, there is the possibility of fragmentation of the hard drive disk space. That means the heads on the hard drive need to work harder (and take more time) to move from one section of the hard drive to another section to access the necessary data in the database. It is analogous to buying a small plot of land and building a house that just fits on that land. When you need to expand the house, you have no more land available unless you buy the empty lot next door - except - what if someone else, in the meantime, has already bought that land and built a house on it? Then you CANNOT expand your house. The only possibility is to buy another plot of land in the "neighborhood" and build another house on it. The problem becomes - you and two of your children would live in House A and your wife and third child would live in House B. That would be a pain (as long as you were still married).
The solution to remedy this situation is to "buy a much larger plot of land, pick up the existing house (i.e. database), move it to the larger plot of land and then expand the house there". Well - how do you do that with a database? Do a full backup, drop the database (unless you have plenty of free disk space to keep both the old fragmented database - just in case - as well as the new database), create a brand new database with plenty of initial disk space allocated (no guarantee that the operating system will insure that the space that you request will be contiguous) and then restore the database into the new database space just created. Yes - it is a pain to do but I do not know of any "automatic disk defragmenter" software that will work on SQL database files.

Related

FILLFACTOR on empty database

I have a system which populates an empty database with many millions of records.
The database has various types of indexes, the ones I'm worried about are:
Indices on foreign keys. These are non-clustered, and not necessarily inserted in sequential order.
Indices on BINARY(32) fields. These are content hashes and not ordered at all. Basically, these are like GUIDS and not sequential.
So as the data is bulk-inserted, there is significant fragmentation of these indices.
Question 1: if I set FILLFACTOR=75 to these indices when database is created, will it have any effect at all as the data is inserted? It seems FILLFACTOR has effect after data is created not before. Or will new index pages be allocated with original fillfactor setting?
Question 2: what other recommended strategies can I use to make sure these indices perform optimally?
Question1:
Fill factor is used only when indexes are rebuilt,SQL doesnt try to store pages based on fill factor while doing inserts.
Question2:
It depends on what you are saying as optimal.On a minimal you can check whether your indexes are usefull and your queries are using your indexes.There are tons of best practices around indexes like selective first key,small keys..
Its good to search for any thing about indexes from Kimberly Tripp and DBA.SE
References:
http://www.sqlskills.com/blogs/paul/a-sql-server-dba-myth-a-day-2530-fill-factor/
http://www.sqlskills.com/blogs/kimberly/category/indexes/
Check the index fragmentation as well as the write/read ration. If the write/read ratio is very high AND you see fragmentation, you can experiment with adding a fillfactor during index rebuild operation.
The amount of fill-factor really depends on your fragmentation you are seeing. If you see 0-20% fragmentation (and these happened over a period of time), you may not want any fill-factor. If you see 20-40% fragmentation, you may try 90%.
Lastly get a good index maintenance plan. Ola Hallengren's index script is excellent.
NB: Above suggestions are just suggestions - your

How to design SQL Server index to avoid fragmentation for columns that behave like partitions?

Consider the SQL Server table created by this command:
create table Foo
(
Id identity(1, 1) primary key clustered,
Time datetime2,
Host varchar(64),
Client varchar(64),
... Bunch of columns ...
);
create nonclustered index ix_foo_time on foo (time);
Id and Time columns make good indexes because they are immutable and ever increasing. (Almost) no fragmentation happens on these two.
Now consider that I need queries to work fast for both Client and Host columns. I created nonclustered indexes for each one of them. After a while, these indexes become very fragmented.
The nature of these columns is well known. There are a few hundred values for each. It is as if data is "partitioned" based on these columns.
Is there a way to tell SQL Server how it should behave to prevent index fragmentation?
Fragmentation is part of index management as data changes, some smaller table indices will retain their fragmentation regardless of a rebuild.
The below script links will help you manage index fragmentation extremely well.
I advise against static index maintenance plan tasks as the other author suggests, as maintenance plans in SSMS/SSIS rebuild index(s) regardless of fragmentation percentages, and are thus wasteful on IO, causing contention on your ETL or end users. If a DBA sees you using static index maintenance plan tasks to rebuild indexes across databases they will probably replace them with the first link below.
https://ola.hallengren.com/sql-server-index-and-statistics-maintenance.html
https://www.brentozar.com/blitzindex/
You have correctly diagnosed the reason for the fragmentation. Although there are just a few insertion points fragmentation will occur on newly inserted data.
Fillfactor will not help here because the free space will be consumed almost immediately at the few hundred insertion points. All other pages in the table will have useless free space in them.
Unfortunately, there is no way to avoid the fragmentation for newly inserted data here. You will need to install a maintenance plan.
Note, that existing data (that predates an index build) will not become fragmented. The page splits will be localized at the few hundred insertion points. Therefore fragmentation will become less when expressed as a percentage of the table size when the table grows.

Clustered indexes on non-identity columns to speed up bulk inserts?

My two questions are:
Can I use clustered indexes to speed
up bulk inserts in big tables?
Can I then still efficiently use
foreign key relationships if my
IDENTITY column is not the clustered
index anymore?
To elaborate, I have a database with a couple of very big (between 100-1000 mln rows) tables containing company data. Typically there is data about 20-40 companies in such a table, each as their own "chunk" marked by "CompanyIdentifier" (INT). Also, every company has about 20 departments, each with their own "subchunk" marked by "DepartmentIdentifier" (INT).
It frequently happens that a whole "chunk" or "subchunk" is added or removed from the table. My first thought was to use Table Partitioning on those chunks, but since I am using SQL Server 2008 Standard Edition I am not entitled to it. Still, most queries I have are executed on a "chunk" or "subchunk" rather than on the table as a whole.
I have been working to optimize these tables for the following functions:
Queries that are run on subchunks
"Benchmarking" queries that are run on the table as a whole
Inserting/removing big chunks of data.
For 1) and 2) I haven't encountered a lot of problems. I have created several indexes on key fields (also containing CompanyIdentifier and DepartmentIdentifier where useful) and the queries are running fine.
But for 3) I have struggled to find a good solution.
My first strategy was to always disable indexes, bulk insert a big chunk and rebuild indexes. This was very fast in the beginning, but now that there are a lot of companies in the database, it takes a very long time to rebuild the index each time.
At the moment my strategy has changed to just leaving the index on while inserting, since this seems to be faster now. But I want to optimize the insert speed even further.
I seem to have noticed that by adding a clustered index defined on CompanyIdentifier + DepartmentIdentifier, the loading of new "chunks" into the table is faster. Before I had abandoned this strategy in favour of adding a clustered index on an IDENTITY column, as several articles pointed out to me that the clustered index is contained in all other indexes and so the clustered index should be as small as possible. But now I am thinking of reviving this old strategy to speed up the inserts. My question, would this be wise, or will I suffer performance hits in other areas? And will this really speed up my inserts or is that just my imagination?
I am also not sure whether in my case an IDENTITY column is really needed. I would like to be able to establish foreign key relationships with other tables, but can I also use something like a CompanyIdentifier+DepartmentIdentifier+[uniquifier] scheme for that? Or does it have to be a table-wide, fragmented IDENTITY number?
Thanks a lot for any suggestions or explanations.
Well, I've put it to the test, and putting a clustered index on the two "chunk-defining" columns increases the performance of my table.
Inserting a chunk is now relatively fast compared to the situation where I had a clustered IDENTITY key, and about as fast as when I did not have any clustered index. Deleting a chunk is faster than with or without clustered index.
I think the fact that all the records I want to delete or insert are guaranteed to be all together on a certain part of the harddisk makes the tables faster - it would seem logical to me.
Update: After a year of experience with this design I can say that for this approach to work, it is necessary to schedule regular rebuilding of all the indexes (we do it once a week). Otherwise, the indexes become fragmented very soon and performance is lost. Nevertheless, we are in a process of migration to a new database design with partitioned tables, which is basically better in every way - except for the Enterprise Server license cost, but we've already forgotten about it by now. At least I have.
A clustered index is a physical index, a physical data structure, a row order. If you insert in the middle of the clustered index, the data will be physically inserted in the middle of the present data. I imagine a serious performance issue in this case. I only know this from theory, because if I do this in practice, it will be a mistake according to my theoretical knowledge.
Therefore, I only use (and advise the use) of clustered indexes on fields that are always, physically, inserted at the end, preserving the order.
A clustered index can be placed on a datetime field which marks the moment of insertion or something like that, because physically they will be ordered after appending a row. Identity is a good clustered index also, but not always relevant for querying.
In your solution you place a [uniquifier] field, but why do this when you can put an identity that will do just that? It will be unique, physically ordered, small (for foreign keys in other tables means smaller index), and in some cases faster.
Can't you try this, experiment? I have a similar situation here, where I have 4 billion rows, constantly more are inserting (up to 100 per second), the table has no primary key and no clustered index, so the propositions in this topic are VERY interesting for me too.
Can I use clustered indexes to speed up bulk inserts in big tables?
Never! Imagine another million rows that you need to put in that table and have them physically ordered it is a colossal loss in performance in the long run.
Can I then still efficiently use foreign key relationships if my IDENTITY column is not the clustered index anymore?
Absolutely. By the way, clustered index is no silver bullet and may be slower than your ordinary index.
Have a look at the System.Data.SqlClient.SqlBulkCopy API. Given your requirements to write signficant numbers of rows in and out of the database, it might be what you need?
Bulk copy streams the data into the table in a single operation then performs the index check once. I use it to copy 500,000 rows in and out of a database table and it's performance is an order of magnitude better than any other technique I've tried, assuming that your application can be structured to take use of the API?
i've been playing around with some etl stuff the last little bit. i went through jsut regularly inserting into the table, then removing and readding indexes before and after the insert, tried merge statements, then i finally tried ssis. I'm sold on ssis. Just yesterday i managed to cut an etl process (~24 million records, ~6gb) from ~1-1 1/2 hours per run to ~24 minutes, jsut by letting ssis handle the inserts.
i believe with advanced services you should be able to use ssis.
(Given you have already chosen the Answer and given yourself the points, this is provided as a free service, a charitable act !)
A little knowledge is a dangerous thing. There are many issues to be considered; and they must be considered together. Taking any one issue and examining it in isolation is a very fragmented way to go about administering a database: you will forever be finding some new truth and changing eveything you thought before. Before launching into it, please read this ▶question/answer◀ for context.
Do not forget, these days anyone with a keyboard and a modem can get their "papers" published. Some of them work for MS, evangelising the latest "enhancement"; others publish glowing reports of features they have never used, or used only once, in one context, but they publish that it works in every context. (Look at Spence's answer: he is enthusiastic and "sold" but under scrutiny, the statements are false; he is not a bad person, just typical of the masses in the MS world and how they operate; how they publish.)
Note: I use the term MicroSofties to describe those people who believe in the gatesian notion that any unqualified person can administer a database; and that MS will fix everything. It is not intended as an insult, more as an endearment, because of the belief in magic, and the suspension of the laws of physics.
Clustered Indices
Were designed for Relational databases, by real engineers (Sybase, before MS acquired the code) who have more brains than all of MS put together. Relational databases have Relational Keys, not Idiot keys. These are multi-column keys, that automatically distribute the data, and therefore the insert load, eg. inserting Invoices for various Companies all the time (although not in our discussed case of "chunks").
if you have good Relational keys, CIs provide Range Queries (your (1) & (2) ), and other advantages, that NCIs simply do not have.
Starting off with Id columns, before modelling and normalising the data, severely hinders the modelling and normalisation processes.
If you have an Idiot database, then you will have more indices than not. The contents of many MS databases are not "relational", they are commonly just unnormalised filing systems, with way more indices than a Normalised database would have. Therefore there is a big push, a lot of MS "enhancements" to try and give these abortions a bit of speed. Fix the symptom but don't go anywhere near the problem that caused the symptom.
In SQL 2005 and again in 2008 MS has screwed around with CIs, and the result is they are now better in some ways, but worse in other ways; the universality of CIs has been lost.
It is not correct that NCIs carry the CI (the CI is the basic single storage structure; the NCIs are secondary, and dependent on the CI; that's why when you re-create a CI, all the NCIs are automatically re-created). The NCIs carry the CI Key at the leaf level.
Microsoft has its problems, which change with the major releases (but are not eliminated):
and in MS this is not efficiently done, so the NCI index size is large; in enterprise DBMS when this is efficiently done, this is not a consideration.
In the MS world, therefore, it is only half true, that the CI key should be as short as possible. If you understand that the consideration is the size of NCIs, and if you are willing to incur that expense, it return for a table that is very fast due to a carefully constructed CI, then that is the best option.
The common advice that the CI should be theIdiot column is totally and completely wrong. The worst canditate fo a CI key is a monotonically increasing value (IDENTITY, DATETIME, etc). WHy ? because you have guaranteed that all concurrent inserts will fight for the current insert location, the last page on the index.
The real purpose of Partitioning (Which MS provided 10 years after the Enterprise vendors) is to spread this load. Sure, they then have to provide a method of allocating the Partitions, on guess what, nothing but a Relational Key; but to start with, now the Idiot key is spread across 32 or 64 Partitions, providing better concurrency.
the CI must be Unique. Relational dbs demand Unique keys, so that is a no-brainer.
But for the amateurs who have poured non-relational contents into the database, if they do not know this rule, but they know that the CI spreads the data (a little knowledge is a dangerous thing), they keep their Idiot key in a NCI (good) but they create the CI on an almost-but-not-quite Unique Key. Deadly. CI's must be Unique, that is a design demand. Duplicate (remember we are talking CI Key here) rows are off-page, located in Overflow pages, and the (then) last page; and constitute a method of badly fragmenting the Page Chain.
Update, since this point is being questioned elsewhere. I have already stated the MS keeps changing the methods without fixing the problem.
The MS Online manual, with their pretty pictures (not technical diagrams) tells us that In 2008, they have replaced (substitued one for another) Overflow Pages, with the adorable "Uniqueifier".
That totally satisfies the MicroSofties. Non-Unique CIs are not a problem. It is handled by magic. Case closed.
But there is no logic or completeness to the statements, and qualified people will ask the obvious questions: where is this "Uniqueifier" located ? On every row, or just the rows needing "Uniqueifying". DBBC PAGE shows it is on every row. So MS has just added a 4-byte secret column (including handling overhead) to every row, instead of a few Overflow Pages for the non-unique rows only. That's MS idea of engineering.
End Update
Anyway, the point remains, that Non-Unique CIs have a substantial overhead (now more than before) and should be avoided. you would be better off adding a 1- or 2-byte column yourself, to force uniqueness.
.
Therefore, unchanged from the beginning (1984), the best candidate for a CI is a multi-column unique Relational key (I cannot say that yours is for sure, but it certainly looks like it).
And put any monotonically increasing keys (IDENTITY, DATETIME) in an NCI.
Remember also that the CI is a single storage structure, which eliminates the (otherwise) Heap; the CI B-Tree is married to the rows at the Leaf level; the Leaf Level entry is the row. That guarantees one less read on every access.
So it is not possible, that a NCI+Heap can be faster than a CI. Anther common myth in the MS world that defies the laws of physics: navigating a B-Tree and writing to the one place you are already in, has got to be faster than additionally writing the row to a separate storage structure. But MicroSofties do believe in magic, they've suspended the laws of physics.
.
There are many other features you need to learn and use, I will mention at least FILLFACTOR and RESERVEPAGEGAP, to give this post a bit of completeness. Do not use these features until you understand them. All performance features have a cost that you need to understand and accept.
CIs are also self-trimming at both the Page and Extent level, there is no wasted space. PageSplits are something to monitor for (Random inserts only), and that is easily modulated by FILLFACTOR and RESERVEPAGEGAP.
And read the SO site for Clustered Indices, but keep in mind all the above, esp. the first two paras.
Your Specific Case
By all means, get rid of your surrogate keys (Idiot columns), and replace them with true natural Relational keys. Surrogates are always an additional key and index; that is a price that should not be forgotten or taken lightly.
CompanyIdentifier+DepartmentIdentifier+[uniquiefier] is exactly what I am talking about. Now notice that they are already INTs, and very fast, so it is very silly to add a NUMERIC(10,0) Idiot Key. Use a 1- or 2-byte column toforce Uniqueness.
If you get this right, you may not need a Partition licence.
The CompanyIdentifier+DepartmentIdentifier+[uniquifier] is the perfect candidate (not knowing anything about your db other than that which you have posted) for a CI, in the context that you perform mass delete/insert periodically. Detailed above.
Contrary to what others have stated, this is a good thing, and does not fragment the CI. Lets' say ou have 20 Companies, and you delete 1, which constitutes 5% of the data. That entire PageChain which was reasonably contiguous, is now relegated to the FreePageChain, contiguous and intact. To be precise, you have a single point of fragmentation, but not fragmentation in the sense of the normal use of the word. And guess what, if you turn around and perform a mass insert, where do you think that data will go ? That's right the exact same physical location as the Deleted rows. And the FreePageChain moves to the PageChain, extent and page at a time.
.
but what is alarming is that you did not know about the demand for CI to be Unique. Sad that the MicroSofties write rubbish, but not why/what each simplistic rule is based on; not the core information. The exact symptom of non-unique CIs is, the table will be very fast immediately after DROP/CREATE CI, and then slow down over time. An good Unique CI will hold its speed, and it would take a year to slow down (2 years on my large, active banking dbs).
4 hours is a very long time for 1 Billion rows (I can recreate a CI on 16 billion rows with a 6-column key in 3 minutes on an enterprise platform). But in any case, that means you have to schedule it as regular weekly or demand maintenance.
why aren't you using the WITH SORTED_DATA option ? Wasn't your data sorted, before the drop ? This option rewrites the CI Non-leaf pages but not the leaf pages (containing the rows). It can only do that if it is confident that the data was sorted. Not using this option rewrites every page, in physical order.
Now, please be kind. Before you ask me twenty questions, read up a little and understand all the issues I have defined here.

Database indexing - how does it work?

how does indexing increases the performance of data retrieval?
How indexing works?
Database products (RDMS) such as Oracle, MySQL builds their own indexing system, they give some control to the database administrators however nobody exactly knows what happens on the background except people makes research in that area, so why indexing :
Put simply, database indexes help
speed up retrieval of data. The other
great benefit of indexes is that your
server doesn't have to work as hard to
get the data. They are much the same
as book indexes, providing the
database with quick jump points on
where to find the full reference (or
to find the database row).
There are many indexing techiques for example :
Primary indexing, secondary indexing
B-trees and variants (B+-trees,B*-trees)
Hashing and variants (linear hashing, spiral etc.)
for example, just think that you have a database with the primary keys are sorted (simply) and these all data is stored in blocks (in hdd) so everytime you want to access the data you don't want to increase the access time (sometimes called transaction time or i/o time) the indexing helps you which data is stored in which block by using these primary keys.
Alice (primary key is names, not good example but just give an idea)
Alice
...
...
AZ...
Bob
Bri
...
Bza
...
Now you have an index in this index you only store Alice and Bob and the blocks they point, with this way users can access the data faster.The RDMS deals with the details.
I don't give the details but if you want to delve these topics, i offer you take an Database course or look at this popular book which is taught most of the universities.
Database Management Systems Ramakrishn CGherke
Each index keep the indexed fields stored separately, sorted (typically) and in a data structure which makes finding the right entries particularly easy. The database finds the entries in the index then cross-references them to the entries in the tables (Except in the case of clustered indexes and covering indexes, in which case the index has it all already). This cross-referencing takes time but is faster (you hope) than scanning the entire table.
A clustered index is where the rows themselves with all columns* are stored together with the index. Scanning clustered indexes is better than scanning non-clustered non-covering indexes because fewer lookups are required.
A covering index is where the query only requires columns which are part of the index, so the rest of the row does not need to be looked up (This is often good for performance).
* typically excluding blob / long text columns etc
How does an index in a book increase the ease with which you find the right page?
Much easier to look through an alphabetic list and then go to the right page than read every page.
This is a gross oversimplification, but in general, database indexing creates another list of some of the contents of the table, arranged in a way that the database engine can find information quickly. By organizing table contents deliberately, this eliminates the need to look for a row of data by scanning the entire table, creating a create efficiency in searches.
Indexes provide an optimal data structure for lookup queries. If your dataset changes a lot, you might consider the performance of updating/regenerating the index as well.
There are lot of open source indexing engines like lucene available, and you can search online for detailed information about performance benchmarks.

Very large tables in SQL Server

We have a very large table (> 77M records and growing) runing on SQL Server 2005 64bit Standard edition and we are seeing some performance issues. There are up to a hundred thousand records added daily.
Does anyone know if there is a limit to the number of records SQL server Standard edition can handle? Should be be considering moving to Enterprise edition or are there some tricks we can use?
Additional info:
The table in question is pretty flat (14 columns), there is a clustered index with 6 fields, and two other indexes on single fields.
We added a fourth index using 3 fields that were in a select in one problem query and did not see any difference in the estimated performance (the query is part of a process that has to run in the off hours so we don't have metrics yet). These fields are part of the clustered index.
Agreeing with Marc and Unkown above ... 6 indexes in the clustered index is way too many, especially on a table that has only 14 columns. You shouldn't have more than 3 or 4, if that, I would say 1 or maybe 2. You may know that the clustered index is the actual table on the disk so when a record is inserted, the database engine must sort it and place it in it's sorted organized place on the disk. Non clustered indexes are not, they are supporting lookup 'tables'. My VLDBs are laid out on the disk (CLUSTERED INDEX) according to the 1st point below.
Reduce your clustered index to 1 or 2. The best field choices are the IDENTITY (INT), if you have one, or a date field in which the fields are being added to the database, or some other field that is a natural sort of how your data is being added to the database. The point is you are trying to keep that data at the bottom of the table ... or have it laid out on the disk in the best (90%+) way that you'll read the records out. This makes it so that there is no reorganzing going on or that it's taking one and only one hit to get the data in the right place for the best read. Be sure to put the removed fields into non-clustered indexes so you don't lose the lookup efficacy. I have NEVER put more than 4 fields on my VLDBs. If you have fields that are being update frequently and they are included in your clustered index, OUCH, that's going to reorganize the record on the disk and cause COSTLY fragmentation.
Check the fillfactor on your indexes. The larger the fill factor number (100) the more full the data pages and index pages will be. In relation to how many records you have and how many records your are inserting you will change the fillfactor # (+ or -) of your non-clustered indexes to allow for the fill space when a record is inserted. If you change your clustered index to a sequential data field, then this won't matter as much on a clustered index. Rule of thumb (IMO), 60-70 fillfactor for high writes, 70-90 for medium writes, and 90-100 for high reads/low writes. By dropping your fillfactor to 70, will mean that for every 100 records on a page, 70 records are written, which will leave free space of 30 records for new or reorganized records. Eats up more space, but it sure beats having to DEFRAG every night (see 4 below)
Make sure the statistics exist on the table. If you want to sweep the database to create statistics using the "sp_createstats 'indexonly'", then SQL Server will create all the statistics on all the indexes that the engine has accumulated as requiring statistics. Don't leave off the 'indexonly' attribute though or you'll add statistics for every field, that would then not be good.
Check the table/indexes using DBCC SHOWCONTIG to see which indexes are getting fragmented the most. I won't go into the details here, just know that you need to do it. Then based on that information, change the fillfactor up or down in relation to the changes the indexes are experiencing change and how fast (over time).
Setup a job schedule that will do online (DBCC INDEXDEFRAG) or offline (DBCC DBREINDEX) on individual indexes to defrag them. Warning: don't do DBCC DBREINDEX on this large of a table without it being during maintenance time cause it will bring the apps down ... especially on the CLUSTERED INDEX. You've been warned. Test and test this part.
Use the execution plans to see what SCANS, and FAT PIPES exist and adjust the indexes, then defrag and rewrite stored procs to get rid of those hot spots. If you see a RED object in your execution plan, it's because there are not statistics on that field. That's bad. This step is more of the "art than the science".
On off peak times, run the UPDATE STATISTICS WITH FULLSCAN to give the query engine as much information about the data distributions as you can. Otherwise do the standard UPDATE STATISTICS (with standard 10% scan) on tables during the weeknights or more often as you see fit with your observerations to make sure the engine has more information about the data distributions to retrieve the data for efficiently.
Sorry this is so long, but it's extremely important. I've only give you here minimal information but will help a ton. There's some gut feelings and observations that go in to strategies used by these points that will require your time and testing.
No need to go to Enterprise edition. I did though in order to get the features spoken of earlier with partitioning. But I did ESPECIALLY to have much better mult-threading capabilities with searching and online DEFRAGING and maintenance ... In Enterprise edition, it is much much better and more friendly with VLDBs. Standard edition doesn't handle doing DBCC INDEXDEFRAG with online databases as well.
The first thing I'd look at is indexing. If you use the execution plan generator in Management Studio, you want to see index seeks or clustered index seeks. If you see scans, particularly table scans, you should look at indexing the columns you generally search on to see if that improves your performance.
You should certainly not need to move to Enterprise edition for this.
[there is a clustered index with 6 fields, and two other indexes on single fields.]
Without knowing any details about the fields, I would try to find a way to make the clustered index smaller.
With SQL Server, all the clustered-key fields will also be included in all the non-clustered indices (as a way to do the final lookup from non-clustered index to actual data page).
If you have six fields at 8 bytes each = 48 bytes, multiply that by two more indices times 77 million rows - and you're looking at a lot of wasted space which translates into a lot
of I/O operations (and thus degrades performance).
For the clustered index, it's absolutely CRUCIAL for it to be unique, stable, and as small as possible (preferably a single INT or such).
Marc
Do you really need to have access to all 77 million records in a single table?
For example, if you only need access to the last X months worth of data, then you could consider creating an archiving strategy. This could be used to relocate data to an archive table in order to reduce the volume of data and subsequently, query time on your 'hot' table.
This approach could be implemented in the standard edition.
If you do upgrade to the Enterprise edition you can make use of table partitioning. Again depending on your data structure this can offer significant performance improvements. Partitioning can also be used to implement the strategy previously mentioned but with less administrative overhead.
Here is an excellent White paper on table partitioning in SQL Server 2005
http://msdn.microsoft.com/en-us/library/ms345146.aspx
I hope what I have detailed is clear and understandable. Please do feel to contact me directly if you require further assistance.
Cheers,
http://msdn.microsoft.com/en-us/library/ms143432.aspx
You've got some room to grow.
As far as performance issues, that's a whole other question. Caching, sharding, normalizing, indexing, query tuning, app code tuning, and so on.
Standard should be able to handle it. I would look at indexing and the queries you use with the table. You want to structure things in such a way that your inserts don't cause too many index recalcs, but your queries can still take advantage of the index to limit lookups to a small portion of the table.
Beyond that, you might consider partitioning the table. This will allow you to divide the table into several logical groups. You can do it "behind-the-scenes", so it still appears in sql server as one table even though it stored separately, or you can do it manually (create a new 'archive' or yearly table and manually move over rows). Either way, only do it after you looked at the other options first, because if you don't get that right you'll still end up having to check every partition. Also: partitioning does require Enterprise Edition, so that's another reason to save this for a last resort.
In and of itself, 77M records is not a lot for SQL Server. How are you loading the 100,000 records? is that a batch load each day? or thru some sort of OLTP application? and is that the performance issue you are having, i.e adding the data? or is it the querying that giving you the most problems?
If you are adding 100K records at a time, and the records being added are forcing the cluster-index to re-org your table, that will kill your performance quickly. More details on the table structure, indexes and type of data inserted will help.
Also, the amount of ram and the speed of your disks will make a big difference, what are you running on?
maybe these are minor nits, but....
(1) relational databases don't have FIELDS... they have COLUMNS.
(2) IDENTITY columns usually mean the data isn't normalized (or the designer was lazy). Some combination of columns MUST be unique (and those columns make up the primary key)
(3) indexing on datetime columns is usually a bad idea; CLUSTERING on datetime columns is also usually a bad idea, especially an ever-increasing datetime column, as all the inserts are contending for the same physical space on disk. Clustering on datetime columns in a read-only table where that column is part of range restrictions is often a good idea (see how the ideas conflict? who said db design wasn't an art?!)
What type of disks do you have?
You might monitor some disk counters to see if requests are queuing.
You might move this table to another drive by putting it in another filegroup. You can also to the same with the indexes.
Initially I wanted to agree with Marc. The width of your clustered index seems suspect, as it will essentially be used as the key to perform lookups on all your records. The wider the clustered index, the slower the access, generally. And a six field clustered index feels really, really suspect.
Uniqueness is not required for a clustered index. In fact, the best candidates for fields that should be in the clustered index are ones that are not unique and used in joins. For example, in a Persons table where each Person belongs to one Group and you frequently join Persons to Groups, while accessing batches of people by group, Person.group_id would be an ideal candidate, for this particular use case.

Resources