We have got 221gb table in our SQL Database, mainly duplicate data.
Team has created NON-CLUSTERED index on HEAP. Does really this help in terms of performannce?
Should we put IDENTITY column in table and then create CLUSTERED index AND after that we can create NON clustered indexes.
It Depends
On the usage pattern and structure of the data.
Is the non-clustered index covering?
Is the data in the table ever changing?
A heap table with a non-clustered index (or indexes) which are covering can outperform a clustered index where the clustered index is the only "index" (a clustered index is obviously always covering, but may not be optimal for seeks)
Remember a clustered index is not an index (in the sense of a lookup based on a key into a location where the data is stored), it's the whole table organized by a choice of index. In a real (non-clustered) index, only the keys and included columns are included in the index and this means that (generally) more rows can be stored per database page and less data is read unnecessarily.
Most tables should have a clustered index, but the choice of non-clustered indexes is where most of your performance comes from.
Related
In row based clustered index: the entire table data (all columns) is ordered by clustered index column. Each page holds a particular amount of rows and all the column.
In row based non-clustered index - a data structure is created that holds the index column. Each page of the indexed column holds the row-wise values for that column and each row points back to the (another page) clustered index table key row or heap row (if no clustered index exists) for rest of the data.
I understand the concept of columnstore index - in the sense that there are row groups. Each row group contains one column segment (compressed) for every column in the table and there is a delta store to hold the Inserts/Updates until the next tuple mover process gets invoked. Based on the above 2 points of rowstore index (page), please can you tell me how it works in case of clustered and non-clustered columnstore index.
Example - In case of Columnstore non-clustered index storage, is it conceptually same as rowstore non-clustered index - that is - separate page for the index column, the values of which points to heap or clustered index key.
In comparing rowstore vs columnstore, the term clustered means all columns and non-clustered means some columns (unless one included all columns). There is no other similarly between the disparate architectures of rowstore/columnstore organization. I personally don't even use the word index at all when referring to columnstore since the structure is optimized for scans rather than lookups and often leads to confusion.
Columnstore index segments, whether clustered or not, are essentially just compressed blobs of data stored in pages/extents. Rowstores, OTOH, have a record structure for each row to accommodate multiple columns of varying types and nullability, which is why they do not compress as well as columnstore data.
Here is one of the definitions I found for clustered Index:
When is a file is organized so that the ordering of data records is
the same as or close to the ordering of data entries in some index, we
say that the index is clustered.
I'm having trouble understanding the above sentence regarding the clustered Indexes. The things I know about clustered index are:
Clustered indexes reorders the way the records are physically stored in the table, so only one clustered index is possible
Clustered index is created on non key attribute
Well for clustered index we have many view to look into
A clustered index is a type of index where the table records are physically re-ordered to match the index.
Clustered indexes are efficient on columns that are searched for a range of values. After the row with first value is found using a clustered index, rows with subsequent index values are guaranteed to be physically adjacent, thus providing faster access for a user query or an application
You also have to understand the Non-Clustered Index
In other words, a clustered index stores the actual data, where a non-clustered index is a pointer to the data. In most DBMSs, you can only have one clustered index per table, though there are systems that support multiple clusters (DB2 being an example).
Like a regular index that is stored unsorted in a database table, a clustered index can be a composite index, such as a concatenation of first name and last name in a table of personal information.
There are several example and explanations. And this is What do Clustered and Non clustered index actually mean? one of them.
If a table only needs 1 index, it seems like clustered is generally the way to go. It is faster because it does not have to reference back to the data via a key, and it also doesn't take disk space the way a non clustered index does.
My question is with multiple indexes, is it better to remove clustered index all together? The logic behind this is that if you have non clustered indexes WITH a clustered index, they don't directly refer back to the actual data rows anymore, but to the clustered index instead. So it seems like there would be a significant performance hit by using the clustered index as a proxy. It seems like the best thing to do would be to not use clustered indexes at all if you think you will need more than 1 index on the table.
If the table has a proper clustered index there is no benefit to removing it.
If you have several indexes then pick the best candidate for clustered.
Typically it is your PK.
When you create a PK by default it clustered.
PK is your best candidate for clustered unless you have specific reason not to use it.
I don't follow your assertion.
"If you have non clustered indexes WITH a clustered index, they don't
refer back to the actual data rows anymore, but to the clustered index
instead. So it seems like there would be a significant performance
hit."
If the clustered index is in the data then referring to the clustered index is referring to the data. The data is physically organized by the clustered index. Where is the significant performance hit?
Clustered Index Design Guidelines
With few exceptions, every table should have a clustered index defined
If one of those few exception was another index then it would be called out.
Another non-clustered index is not a reason to not have a clustered index.
Nonclustered Index Structures
The row locators in nonclustered index rows are either a pointer to a row or are a clustered index key for a row, as described in the following:
If the table is a heap, which means it does not have a clustered
index, the row locator is a pointer to the row. The pointer is built
from the file identifier (ID), page number, and number of the row on
the page. The whole pointer is known as a Row ID (RID).
If the table has a clustered index, or the index is on an indexed
view, the row locator is the clustered index key for the row. If the
clustered index is not a unique index, SQL Server makes any duplicate
keys unique by adding an internally generated value called a
uniqueifier. This four-byte value is not visible to users. It is only
added when required to make the clustered key unique for use in
nonclustered indexes. SQL Server retrieves the data row by searching
the clustered index using the clustered index key stored in the leaf
row of the nonclustered index.
They had the option use a RID even if there was a PK. Why do you think clustered index is slower?
In the documentation for SQL server 2008 R2 is stated:
Wide keys are a composite of several columns or several large-size columns. The key values from the clustered index are used by all nonclustered indexes as lookup keys. Any nonclustered indexes defined on the same table will be significantly larger because the nonclustered index entries contain the clustering key and also the key columns defined for that nonclustered index.
Does this mean, that when there is a search using non-clustered index, than the clustered indes is search also? I originally thought that the non-clustered index contains ditrectly the address of the page (block) with the row it references. From the text above it seems that it contains just the key from the non-clustered index instead of the address.
Could somebody explain please?
Yes, that's exactly what happens:
SQL Server searches for your search value in the non-clustered index
if a match is found, in that index entry, there's also the clustering key (the column or columns that make up the clustered index)
with that clustered key, a key lookup (often also called bookmark lookup) is now performed - the clustered index is searched for that value given
when the item is found, the entire data record at the leaf level of the clustered index navigation structure is present and can be returned
SQL Server does this, because using a physical address would be really really bad:
if a page split occurs, all the entries that are moved to a new page would be updated
for all those entries, all nonclustered indices would also have to be updated
and this is really really bad for performance.
This is one of the reasons why it is beneficial to use limited column lists in SELECT (instead of always SELECT *) and possibly even include a few extra columns in the nonclustered index (to make it a covering index). That way, you can avoid unnecessary and expensive bookmark lookups.
And because the clustering key is included in each and every nonclustered index, it's highly important that this be a small and narrow key - optimally an INT IDENTITY or something like that - and not a huge structure; the clustering key is the most replicated data structure in SQL Server and should be a small as possible.
The fact that these bookmark lookups are relatively expensive is also one of the reasons why the query optimizer might opt for an index scan as soon as you select a larger number of rows - at at time, just scanning the clustered index might be cheaper than doing a lot of key lookups.
I have few questions about clustered index and column store index.We know in clustered index the physical order of data for particular column is changing and storing in the leaf node of the binary tree.So,my question are:
1)If we create clustered index on columnA,the data of that columnA will be removed from the actual strage place and added to a binary trees leaf? or it is just rearranging the data in the original storage place?
2)What about column store index,here also the data from specific column will be removed from actual storage location and loaded in to another separate file segment?
3)From the above 2 question,If the data is moving from original location,for eg: we have a tableA having coulmns colA,colB and
if we create any of the above 2 indexes on colA, then original location contains only colB data? and colA is in some other location?
1) When you create a clustered index the table is rebuild as a clustered index. There is no 'original storage', the clustered index is the table and the table is the clustered index. A clustered index is organized as a B-tree structure (not to be confused with a binary tree).
2) A columnstore non-clustered index is a secondary index on a table. As with any secondary (non-clustered) indexes every row in the non-clustered index is a copy of the data from the table base heap or clustered index. A columnstore index is not a B-tree, nor a heap, but a new type of data organization optimized for column oriented storage. To understand how column oriented storage works in general, read the C-Store paper.
3) No. Adding non-clustered indexes of any nature, including special indexes like XML, spatial and columnstore, never remove data from the table base heap or clustered index. Non-clustered indexes always contain a copy of the data. The addition of a non-clustered columnstore index does not change in any way rows and columns in the table base heap or clustered index (I suspect what you call the 'actual storage').