Single Column Huge table (2.5 B rows). Clustered index Vs Clustered Columnstore index - sql-server

We are having a huge table Table1(2.5 billion rows) with single column A(NVARCHAR(255) datatype). What is the right approach for seek operations against this table. Clustered index on A Vs Clustered Column store index on A.
We are already keeping this table in separate filegroup from the other table Table2, with which it will be Joined.
Do you suggest partitioning this table for better performance ? This column will have unicode data also. So, what kind of partitioning approach is fine for unicode datatype ?
UPDATE: To clarify further, the use case for the table is SEEK. The table is storing identifiers for individuals. The major concerns here are performance for SEEK in the case of huge table. This table will be referred inside a transaction. We want the transaction to be short.

Clustered index vs column store index depends on the use case for the table. Column store keeps track of unique entries in the column and the rows where those entries are stored. This makes it very useful for data warehousing tasks such as aggregates against the indexed columns, however not as optimal for transactional tasks that need to pull a small number of specific rows. If you are using SQL Server 2014 or later you can use both a clustered index and a columnstore index by creating a clustered columnstore index. It does have some limitations and overhead that you should read up on though.
Given that this is a seek for specific rows and not an aggregation of the column, I would recommend a clustered index instead of a column store index.

Related

Does sorting benefit insert in SQL Server?

Is there a benefit of sorting the data in *.dat file based on INDEXED column before pushing them to the STAGING table in SQL Server?
Ok, the scenerio is :
I have a STAGING table with 40 columns and indexes on 5 columns. I need to push data from a file that contains 15 million rows into the STAGING table.
The approach I have followed is:
First, DISABLE the INDEXES
Second, push the data from file to STAGING table
Third, REBUILD INDEXES OFFLINE
Now I need to understand if I will sort the data in the file based on column that is indexed will it benefit in any way :
IN INSERT
IN INDEX REBUILD.
General answer: No!
15 million rows is quite a lot... It depends on how you are querying / filtering / sorting your data and it depends on the quality of your data:
Does your table have a clustered key (Are you aware of the difference between a clustered and a non-clustered index)?
Is there a one-column key candidate which is implicitly sorted (like IDENTITIY)?
Will the table see a lot of delets / inserts in future?
SQL-Server does not know any implicit sorting.
Only one case comes to my mind: If there is an active clustered index and you insert your data in a pre-sorted way, the rows should be added at the end and your index will not be fragmented and therefore will not need a rebuild at the end.
If you remove your indexes and insert your data, insertion should be faster, but you'll need a lot of work to get a clustered key in the right physical order at the end.
Many big tables define a non-clustered primary key and no clustered key at all...
My suggestion
remove all non-clustered indexes
If your table has an implicitly sorted PK and new rows are sorted to the end automatically, you should define this as clustered key and do the inserts pre-sorted.
If the above does not apply, you should do your inserts without any index and create the indexes after the insert operation.

SQL Server Clustered Index and Non-Clustered Index for same column

If one column in a table has both clustered and non-clustered index defined due to any reason, is there any disadvantage in that? Just curious.
If both indices are on the same identical column or columns (and in the same order) then yes, they both provide the same select query optimization for individual record selects; and although the Clustered index, in addition, provides enhanced performance for select queries that return multiple records filtered on a range of values for that column, the non-clustered on is redundant.
But by having both in place you incur an additional write (Insert/Update/Delete) performance hit for the process of having to update two indices instead of only one.

Should every User Table have a Clustered Index?

Recently I found a couple of tables in a Database with no Clustered Indexes defined.
But there are non-clustered indexes defined, so they are on HEAP.
On analysis I found that select statements were using filter on the columns defined in non-clustered indexes.
Not having a clustered index on these tables affect performance?
It's hard to state this more succinctly than SQL Server MVP Brad McGehee:
As a rule of thumb, every table should have a clustered index. Generally, but not always, the clustered index should be on a column that monotonically increases–such as an identity column, or some other column where the value is increasing–and is unique. In many cases, the primary key is the ideal column for a clustered index.
BOL echoes this sentiment:
With few exceptions, every table should have a clustered index.
The reasons for doing this are many and are primarily based upon the fact that a clustered index physically orders your data in storage.
If your clustered index is on a single column monotonically increases, inserts occur in order on your storage device and page splits will not happen.
Clustered indexes are efficient for finding a specific row when the indexed value is unique, such as the common pattern of selecting a row based upon the primary key.
A clustered index often allows for efficient queries on columns that are often searched for ranges of values (between, >, etc.).
Clustering can speed up queries where data is commonly sorted by a specific column or columns.
A clustered index can be rebuilt or reorganized on demand to control table fragmentation.
These benefits can even be applied to views.
You may not want to have a clustered index on:
Columns that have frequent data changes, as SQL Server must then physically re-order the data in storage.
Columns that are already covered by other indexes.
Wide keys, as the clustered index is also used in non-clustered index lookups.
GUID columns, which are larger than identities and also effectively random values (not likely to be sorted upon), though newsequentialid() could be used to help mitigate physical reordering during inserts.
A rare reason to use a heap (table without a clustered index) is if the data is always accessed through nonclustered indexes and the RID (SQL Server internal row identifier) is known to be smaller than a clustered index key.
Because of these and other considerations, such as your particular application workloads, you should carefully select your clustered indexes to get maximum benefit for your queries.
Also note that when you create a primary key on a table in SQL Server, it will by default create a unique clustered index (if it doesn't already have one). This means that if you find a table that doesn't have a clustered index, but does have a primary key (as all tables should), a developer had previously made the decision to create it that way. You may want to have a compelling reason to change that (of which there are many, as we've seen). Adding, changing or dropping the clustered index requires rewriting the entire table and any non-clustered indexes, so this can take some time on a large table.
I would not say "Every table should have a clustered index", I would say "Look carefully at every table and how they are accessed and try to define a clustered index on it if it makes sense". It's a plus, like a Joker, you have only one Joker per table, but you don't have to use it. Other database systems don't have this, at least in this form, BTW.
Putting clustered indices everywhere without understanding what you're doing can also kill your performance (in general, the INSERT performance because a clustered index means physical re-ordering on the disk, or at least it's a good way to understand it), for example with GUID primary keys as we see more and more.
So, read Tim Lehner's exceptions and reason.
Performance is a big hairy problem. Make sure you are optimizing for the right thing.
Free advice is always worth it's price, and there is no substitute for actual experimentation.
The purpose of an index is to find matching rows and help retrieve the data when found.
A non-clustered index on your search criteria will help to find rows, but there needs to be additional operation to get at the row's data.
If there is no clustered index, SQL uses an internal rowId to point to the location of the data.
However, If there is a clustered index on the table, that rowId is replaced by the data values in the clustered index.
So the step of reading the rows data would not be needed, and would be covered by the values in the index.
Even if a clustered index isn't very good at being selective, if those keys are frequently most or all of the results requested - it may be helpful to have them as the leaf of the non-clustered index.
Yes you should have clustered index on a table.So that all nonclustered indexes perform in better way.
Consider using a clustered index when Columns that contain a large number of distinct values so to avoid the need for SQL Server to add a "uniqueifier" to duplicate key values
Disadvantage : It takes longer to update records if only when the fields in the clustering index are changed.
Avoid clustering index constructions where there is a risk that many concurrent inserts will happen on almost the same clustering index value
Searches against a nonclustered index will appear slower is the clustered index isn't build correctly, or it does not include all the columns needed to return the data back to the calling application. In the event that the non-clustered index doesn't contain all the needed data then the SQL Server will go to the clustered index to get the missing data (via a lookup) which will make the query run slower as the lookup is done row by row.
Yes, every table should have a clustered index. The clustered index sets the physical order of data in a table. You can compare this to the ordering of music at a store, by bands name and or Yellow pages ordered by a last name. Since this deals with the physical order you can have only one it can be comprised by many columns but you can only have one.
It’s best to place the clustered index on columns often searched for a range of values. Example would be a date range. Clustered indexes are also efficient for finding a specific row when the indexed value is unique. Microsoft SQL will place clustered indexes on a PRIMARY KEY constraint automatically if no clustered indexes are defined.
Clustered indexes are not a good choice for:
Columns that undergo frequent changes
This results in the entire row moving (because SQL Server must keep
the data values of a row in physical order). This is an important
consideration in high-volume transaction processing systems where
data tends to be volatile.
Wide keys
The key values from the clustered index are used by all
nonclustered indexes as lookup keys and therefore are stored in each
nonclustered index leaf entry.

SQL Server - Clustered Index Key Issue on FACT Table with millions of rows

we got a FACT Table which has got 237383163 number of rows and which has lot of duplicate data.
While running queries against this table its doing a SCAN across that many rows resulting in long execution times (bocs we haven't created clustered index).
Is there way someone can suggest - to create a clustered key using some combination of existing field along with adding any new field (like identity column)
Non-clustered index are created on table is of no help either.
Regards
Thoughts:
Adding a clustered index that is not unique will require a 4 byte uniqueifier
Adding a surrogate IDENTITY column will leave you with duplicates
A clustered index is best when narrow and numeric espeically if you have non-clustered indexes
First thing, de-duplicate data
Then I'd consider one of 2 things based on whether there are non-clustered indexes
Without NC indexes, create a unique clustered index on some or all of the FACT columns
With NC indexes, create an IDENTITY column and use this as the clustered index. Create a unique NC index on the FACT columns
Option 1 will be a lot smaller on disk. I've done this before for a billion+ row fact table and it shrank by 65%. There were no NC indexes.
Both options will need tested to see the effect on load and response times etc

What are the differences between a clustered and a non-clustered index?

What are the differences between a clustered and a non-clustered index?
Clustered Index
Only one per table
Faster to read than non clustered as data is physically stored in index order
Non Clustered Index
Can be used many times per table
Quicker for insert and update operations than a clustered index
Both types of index will improve performance when select data with fields that use the index but will slow down update and insert operations.
Because of the slower insert and update clustered indexes should be set on a field that is normally incremental ie Id or Timestamp.
SQL Server will normally only use an index if its selectivity is above 95%.
Clustered indexes physically order the data on the disk. This means no extra data is needed for the index, but there can be only one clustered index (obviously). Accessing data using a clustered index is fastest.
All other indexes must be non-clustered. A non-clustered index has a duplicate of the data from the indexed columns kept ordered together with pointers to the actual data rows (pointers to the clustered index if there is one). This means that accessing data through a non-clustered index has to go through an extra layer of indirection. However if you select only the data that's available in the indexed columns you can get the data back directly from the duplicated index data (that's why it's a good idea to SELECT only the columns that you need and not use *)
Clustered indexes are stored physically on the table. This means they are the fastest and you can only have one clustered index per table.
Non-clustered indexes are stored separately, and you can have as many as you want.
The best option is to set your clustered index on the most used unique column, usually the PK. You should always have a well selected clustered index in your tables, unless a very compelling reason--can't think of a single one, but hey, it may be out there--for not doing so comes up.
Clustered Index
There can be only one clustered index for a table.
Usually made on the primary key.
The leaf nodes of a clustered index contain the data pages.
Non-Clustered Index
There can be only 249 non-clustered indexes for a table(till sql version 2005 later versions support upto 999 non-clustered indexes).
Usually made on the any key.
The leaf node of a nonclustered index does not consist of the data pages. Instead, the leaf nodes contain index rows.
Clustered Index
Only one clustered index can be there in a table
Sort the records and store them physically according to the order
Data retrieval is faster than non-clustered indexes
Do not need extra space to store logical structure
Non Clustered Index
There can be any number of non-clustered indexes in a table
Do not affect the physical order. Create a logical order for data rows and use pointers to physical data files
Data insertion/update is faster than clustered index
Use extra space to store logical structure
Apart from these differences you have to know that when table is non-clustered (when the table doesn't have a clustered index) data files are unordered and it uses Heap data structure as the data structure.
Pros:
Clustered indexes work great for ranges (e.g. select * from my_table where my_key between #min and #max)
In some conditions, the DBMS will not have to do work to sort if you use an orderby statement.
Cons:
Clustered indexes are can slow down inserts because the physical layouts of the records have to be modified as records are put in if the new keys are not in sequential order.
Clustered basically means that the data is in that physical order in the table. This is why you can have only one per table.
Unclustered means it's "only" a logical order.
A clustered index actually describes the order in which records are physically stored on the disk, hence the reason you can only have one.
A Non-Clustered Index defines a logical order that does not match the physical order on disk.
An indexed database has two parts: a set of physical records, which are arranged in some arbitrary order, and a set of indexes which identify the sequence in which records should be read to yield a result sorted by some criterion. If there is no correlation between the physical arrangement and the index, then reading out all the records in order may require making lots of independent single-record read operations. Because a database may be able to read dozens of consecutive records in less time than it would take to read two non-consecutive records, performance may be improved if records which are consecutive in the index are also stored consecutively on disk. Specifying that an index is clustered will cause the database to make some effort (different databases differ as to how much) to arrange things so that groups of records which are consecutive in the index will be consecutive on disk.
For example, if one were to start with an empty non-clustered database and add 10,000 records in random sequence, the records would likely be added at the end in the order they were added. Reading out the database in order by the index would require 10,000 one-record reads. If one were to use a clustered database, however, the system might check when adding each record whether the previous record was stored by itself; if it found that to be the case, it might write that record with the new one at the end of the database. It could then look at the physical record before the slots where the moved records used to reside and see if the record that followed that was stored by itself. If it found that to be the case, it could move that record to that spot. Using this sort of approach would cause many records to be grouped together in pairs, thus potentially nearly doubling sequential read speed.
In reality, clustered databases use more sophisticated algorithms than this. A key thing to note, though, is that there is a tradeoff between the time required to update the database and the time required to read it sequentially. Maintaining a clustered database will significantly increase the amount of work required to add, remove, or update records in any way that would affect the sorting sequence. If the database will be read sequentially much more often than it will be updated, clustering can be a big win. If it will be updated often but seldom read out in sequence, clustering can be a big performance drain, especially if the sequence in which items are added to the database is independent of their sort order with regard to the clustered index.
A clustered index is essentially a sorted copy of the data in the indexed columns.
The main advantage of a clustered index is that when your query (seek) locates the data in the index then no additional IO is needed to retrieve that data.
The overhead of maintaining a clustered index, especially in a frequently updated table, can lead to poor performance and for that reason it may be preferable to create a non-clustered index.
You might have gone through theory part from the above posts:
-The clustered Index as we can see points directly to record i.e. its direct so it takes less time for a search. Additionally it will not take any extra memory/space to store the index
-While, in non-clustered Index, it indirectly points to the clustered Index then it will access the actual record, due to its indirect nature it will take some what more time to access.Also it needs its own memory/space to store the index
// Copied from MSDN, the second point of non-clustered index is not clearly mentioned in the other answers.
Clustered
Clustered indexes sort and store the data rows in the table or view
based on their key values. These are the columns included in the
index definition. There can be only one clustered index per table,
because the data rows themselves can be stored in only one order.
The only time the data rows in a table are stored in sorted order is
when the table contains a clustered index. When a table has a
clustered index, the table is called a clustered table. If a table
has no clustered index, its data rows are stored in an unordered
structure called a heap.
Nonclustered
Nonclustered indexes have a structure separate from the data rows. A
nonclustered index contains the nonclustered index key values and
each key value entry has a pointer to the data row that contains the
key value.
The pointer from an index row in a nonclustered index to a data row
is called a row locator. The structure of the row locator depends on
whether the data pages are stored in a heap or a clustered table.
For a heap, a row locator is a pointer to the row. For a clustered
table, the row locator is the clustered index key.
Clustered Indexes
Clustered Indexes are faster for retrieval and slower for insertion
and update.
A table can have only one clustered index.
Don't require extra space to store logical structure.
Determines the order of storing the data on the disk.
Non-Clustered Indexes
Non-clustered indexes are slower in retrieving data and faster in
insertion and update.
A table can have multiple non-clustered indexes.
Require extra space to store logical structure.
Has no effect of order of storing data on the disk.

Resources