Move Table off Partition - sql-server

I have a very large table (800GB) which has a DATETIME field which is part of a partition schema. This field is named tran_date. The problem I'm having is that the indexes are not properly aligned with the partition and I can't include the tran_date field in the PRIMARY KEY because it's set to nullable.
I can drop all foreign key relationships, statistics, and indexes, but I can't modify the column because the partition schema is still dependent on the tran_date column.
In my research I've located one way to move the table off of the partition which is to drop the clustered index and then re-write the clustered index onto the PRIMARY filegroup which will then allow me to modify the column, but this operation takes several hours to drop, 13 hours to write the temporary CLUSTERED INDEX on PRIMARY and then I have to drop that, alter the table, and re-write the CLUSTERED INDEX properly which takes another 13 hours. Additionally I have more than one table.
When I tested this deployment in my development environment with a similarly sized data set it took several days to complete, so I'm trying to look for ways to chop down this time.
If I can move the table off the partition without having to write a CLUSTERED INDEX on PRIMARY it would significantly reduce the time required to alter the column.

No matter what, you are going to end up moving data from "point A" (stored in table partitions within the database) to "point B" (not stored within table partitions within the database. The goal is to minimize the number of times you have to work through all that data. Simplest way to do this might be:
Create a new non-partitioned table
Copy the data over to that table
Drop the original table
Rename the new table to the proper name
One problem to deal with is the clustered index. You could either create the new table without the clustered index, copy the data over, and then reindex (extra time and pain), or you could create the table with the clustered index, and copy the data over “in order” (say, low Ids to high). This would be slower than copying it over to a non-clustered table, but it might be faster overall since you wouldn’t then have to build the clustered index.
Of course there's the problem of "what if users change the data while you're copying it"... but table partitioning implies warehousing, so I'm guessing you don't have to worry about that.
A last point, when copying gobs of data, it is best to break the insert into several inserts, so as to not bloat the transaction log.

Related

Single Column Huge table (2.5 B rows). Clustered index Vs Clustered Columnstore index

We are having a huge table Table1(2.5 billion rows) with single column A(NVARCHAR(255) datatype). What is the right approach for seek operations against this table. Clustered index on A Vs Clustered Column store index on A.
We are already keeping this table in separate filegroup from the other table Table2, with which it will be Joined.
Do you suggest partitioning this table for better performance ? This column will have unicode data also. So, what kind of partitioning approach is fine for unicode datatype ?
UPDATE: To clarify further, the use case for the table is SEEK. The table is storing identifiers for individuals. The major concerns here are performance for SEEK in the case of huge table. This table will be referred inside a transaction. We want the transaction to be short.
Clustered index vs column store index depends on the use case for the table. Column store keeps track of unique entries in the column and the rows where those entries are stored. This makes it very useful for data warehousing tasks such as aggregates against the indexed columns, however not as optimal for transactional tasks that need to pull a small number of specific rows. If you are using SQL Server 2014 or later you can use both a clustered index and a columnstore index by creating a clustered columnstore index. It does have some limitations and overhead that you should read up on though.
Given that this is a seek for specific rows and not an aggregation of the column, I would recommend a clustered index instead of a column store index.

Does sorting benefit insert in SQL Server?

Is there a benefit of sorting the data in *.dat file based on INDEXED column before pushing them to the STAGING table in SQL Server?
Ok, the scenerio is :
I have a STAGING table with 40 columns and indexes on 5 columns. I need to push data from a file that contains 15 million rows into the STAGING table.
The approach I have followed is:
First, DISABLE the INDEXES
Second, push the data from file to STAGING table
Third, REBUILD INDEXES OFFLINE
Now I need to understand if I will sort the data in the file based on column that is indexed will it benefit in any way :
IN INSERT
IN INDEX REBUILD.
General answer: No!
15 million rows is quite a lot... It depends on how you are querying / filtering / sorting your data and it depends on the quality of your data:
Does your table have a clustered key (Are you aware of the difference between a clustered and a non-clustered index)?
Is there a one-column key candidate which is implicitly sorted (like IDENTITIY)?
Will the table see a lot of delets / inserts in future?
SQL-Server does not know any implicit sorting.
Only one case comes to my mind: If there is an active clustered index and you insert your data in a pre-sorted way, the rows should be added at the end and your index will not be fragmented and therefore will not need a rebuild at the end.
If you remove your indexes and insert your data, insertion should be faster, but you'll need a lot of work to get a clustered key in the right physical order at the end.
Many big tables define a non-clustered primary key and no clustered key at all...
My suggestion
remove all non-clustered indexes
If your table has an implicitly sorted PK and new rows are sorted to the end automatically, you should define this as clustered key and do the inserts pre-sorted.
If the above does not apply, you should do your inserts without any index and create the indexes after the insert operation.

Should every User Table have a Clustered Index?

Recently I found a couple of tables in a Database with no Clustered Indexes defined.
But there are non-clustered indexes defined, so they are on HEAP.
On analysis I found that select statements were using filter on the columns defined in non-clustered indexes.
Not having a clustered index on these tables affect performance?
It's hard to state this more succinctly than SQL Server MVP Brad McGehee:
As a rule of thumb, every table should have a clustered index. Generally, but not always, the clustered index should be on a column that monotonically increases–such as an identity column, or some other column where the value is increasing–and is unique. In many cases, the primary key is the ideal column for a clustered index.
BOL echoes this sentiment:
With few exceptions, every table should have a clustered index.
The reasons for doing this are many and are primarily based upon the fact that a clustered index physically orders your data in storage.
If your clustered index is on a single column monotonically increases, inserts occur in order on your storage device and page splits will not happen.
Clustered indexes are efficient for finding a specific row when the indexed value is unique, such as the common pattern of selecting a row based upon the primary key.
A clustered index often allows for efficient queries on columns that are often searched for ranges of values (between, >, etc.).
Clustering can speed up queries where data is commonly sorted by a specific column or columns.
A clustered index can be rebuilt or reorganized on demand to control table fragmentation.
These benefits can even be applied to views.
You may not want to have a clustered index on:
Columns that have frequent data changes, as SQL Server must then physically re-order the data in storage.
Columns that are already covered by other indexes.
Wide keys, as the clustered index is also used in non-clustered index lookups.
GUID columns, which are larger than identities and also effectively random values (not likely to be sorted upon), though newsequentialid() could be used to help mitigate physical reordering during inserts.
A rare reason to use a heap (table without a clustered index) is if the data is always accessed through nonclustered indexes and the RID (SQL Server internal row identifier) is known to be smaller than a clustered index key.
Because of these and other considerations, such as your particular application workloads, you should carefully select your clustered indexes to get maximum benefit for your queries.
Also note that when you create a primary key on a table in SQL Server, it will by default create a unique clustered index (if it doesn't already have one). This means that if you find a table that doesn't have a clustered index, but does have a primary key (as all tables should), a developer had previously made the decision to create it that way. You may want to have a compelling reason to change that (of which there are many, as we've seen). Adding, changing or dropping the clustered index requires rewriting the entire table and any non-clustered indexes, so this can take some time on a large table.
I would not say "Every table should have a clustered index", I would say "Look carefully at every table and how they are accessed and try to define a clustered index on it if it makes sense". It's a plus, like a Joker, you have only one Joker per table, but you don't have to use it. Other database systems don't have this, at least in this form, BTW.
Putting clustered indices everywhere without understanding what you're doing can also kill your performance (in general, the INSERT performance because a clustered index means physical re-ordering on the disk, or at least it's a good way to understand it), for example with GUID primary keys as we see more and more.
So, read Tim Lehner's exceptions and reason.
Performance is a big hairy problem. Make sure you are optimizing for the right thing.
Free advice is always worth it's price, and there is no substitute for actual experimentation.
The purpose of an index is to find matching rows and help retrieve the data when found.
A non-clustered index on your search criteria will help to find rows, but there needs to be additional operation to get at the row's data.
If there is no clustered index, SQL uses an internal rowId to point to the location of the data.
However, If there is a clustered index on the table, that rowId is replaced by the data values in the clustered index.
So the step of reading the rows data would not be needed, and would be covered by the values in the index.
Even if a clustered index isn't very good at being selective, if those keys are frequently most or all of the results requested - it may be helpful to have them as the leaf of the non-clustered index.
Yes you should have clustered index on a table.So that all nonclustered indexes perform in better way.
Consider using a clustered index when Columns that contain a large number of distinct values so to avoid the need for SQL Server to add a "uniqueifier" to duplicate key values
Disadvantage : It takes longer to update records if only when the fields in the clustering index are changed.
Avoid clustering index constructions where there is a risk that many concurrent inserts will happen on almost the same clustering index value
Searches against a nonclustered index will appear slower is the clustered index isn't build correctly, or it does not include all the columns needed to return the data back to the calling application. In the event that the non-clustered index doesn't contain all the needed data then the SQL Server will go to the clustered index to get the missing data (via a lookup) which will make the query run slower as the lookup is done row by row.
Yes, every table should have a clustered index. The clustered index sets the physical order of data in a table. You can compare this to the ordering of music at a store, by bands name and or Yellow pages ordered by a last name. Since this deals with the physical order you can have only one it can be comprised by many columns but you can only have one.
It’s best to place the clustered index on columns often searched for a range of values. Example would be a date range. Clustered indexes are also efficient for finding a specific row when the indexed value is unique. Microsoft SQL will place clustered indexes on a PRIMARY KEY constraint automatically if no clustered indexes are defined.
Clustered indexes are not a good choice for:
Columns that undergo frequent changes
This results in the entire row moving (because SQL Server must keep
the data values of a row in physical order). This is an important
consideration in high-volume transaction processing systems where
data tends to be volatile.
Wide keys
The key values from the clustered index are used by all
nonclustered indexes as lookup keys and therefore are stored in each
nonclustered index leaf entry.

SQL Server "Write Once" Table Clustered Index

I have a fairly unique table in a SQL Server database that doesn't follow 'typical' usage conventions and am looking for some advice regarding the clustered index.
This is a made-up example, but follows the real data pretty closely.
The table has a 3 column primary key, which are really foreign keys to other tables, and a fourth field that contains the relevant data. For this example, let's say that the table looks like this:
CREATE TABLE [dbo].[WordCountsForPage](
[AuthorID] [int] NOT NULL,
[BookID] [int] NOT NULL,
[PageNumber] [int] NOT NULL,
[WordCount] [int] NOT NULL
)
So, we have a somewhat hierarchical primary key, with the unique data being that fourth field.
In the real application, there are a total of 2.8 Billion possible records, but that's all. The records are created on the fly as the data is calculated over time, and realistically probably only 1/4 of those records will ever actually be calculated. They are stored in the DB since the calculation is an expensive operation, and we only want to do it once for each unique combination.
Today, the data is read thousands of times a minute, but (at least for now) there are also hundreds of inserts per minute as the table populates itself (and this will continue for quite some time). I would say that there are 10 reads for every insert (today).
I am wondering if we are taking a performance hit on all of those inserts because of the clustered index.
The clustered index makes sense "long term" since the table will eventually become read-only, but it will take some time to get there.
I suppose I could make the index non-clustered during the heavy insert period, and change it to clustered as the table becomes populated, but how do you determine when the cross-over point would be (and how can I notify myself in the future that the 'time has come')?
What I really need is a convertible index that crosses over from non-clustered to clustered at some magical time in the future.
Any suggestions for how to handle this one?
Actually, I would not bother with trying to have a non-clustered index first and convert it to a clustered one (that alone is a really messy affair!) later on.
As The Queen Of Indexing, Kimberly Tripp, explains in her The Clustered Index Debate Continues.., having a clustered index on a table can actually improve your INSERT performance!
Inserts are faster in a clustered table (but only in the "right" clustered table) than compared to a heap. The primary problem here is that lookups in the IAM/PFS to determine the insert location in a heap are slower than in a clustered table (where insert location is known, defined by the clustered key). Inserts are faster when inserted into a table where order is defined (CL) and where that order is ever-increasing.
A heap is a table which has no clustered index defined on it.
Considering this, and the effort and trouble it takes to go from heap to a table with a clustered index - I wouldn't even bother. Just define your indices, and start using that table!

Cluster the index on ever-increasing datetime column on logging table?

I'm not a DBA ("Good!", you'll be thinking in a moment.)
I have a table of logging data with these characteristics and usage patterns:
A datetime column for storing log timestamps whose value is ever-increasing and mostly (but only mostly) unique
Frequent-ish inserts (say, a dozen a minute), only at the end of the timestamp range (new data being logged)
Infrequent deletes, in bulk, from the beginning of the timestamp range (old data being cleared)
No updates at all
Frequent-ish selects using the timestamp column as the primary criterion, along with secondary criteria on other columns
Infrequent selects using other columns as the criteria (and not including the timestamp column)
A good amount of data, but nowhere near enough that I'm worried much about storage space
Additionally, there is currently a daily maintenance window during which I could do table optimization.
I frankly don't expect this table to challenge the server it's going to be on even if I mis-index it a bit, but nevertheless it seemed like a good opportunity to ask for some input on SQL Server clustered indexes.
I know that clustered indexes determine the storage of the actual table data (the data is stored in the leaf nodes of the index itself), and that non-clustered indexes are separate pointers into the data. So in query terms, a clustered index is going to be faster than a non-clustered index -- once we've found the index value, the data is right there. There are costs on insert and delete (and of course an update changing the clustered index column's value would be particularly costly).
But I read in this answer that deletes leave gaps that don't get cleaned up until/unless the index is rebuilt.
All of this suggests to me that I should:
Put a clustered index on the timestamp column with a 100% fill-factor
Put non-clustered indexes on any other column that may be used as a criterion in a query that doesn't also involve the clustered column (which may be any of them in my case)
Schedule the bulk deletes to occur during the daily maintenance interval
Schedule a rebuild of the clustered index to occur immediately after the bulk delete
Relax and get out more
Am I wildly off base there? Do I need to frequently rebuild the index like that to avoid lots of wasted space? Are there other obvious (to a DBA) things I should be doing?
Thanks in advance.
Contrary to what a lot of people believe, having a good clustered index on a table can actually make operations like INSERTs faster - yes, faster!
Check out the seminal blog post The Clustered Index Debate Continues.... by Kimberly Tripp - the ultimate indexing queen.
She mentions (about in the middle of the article):
Inserts are faster in a clustered
table (but only in the "right"
clustered table) than compared to a
heap. The primary problem here is that
lookups in the IAM/PFS to determine
the insert location in a heap are
slower than in a clustered table
(where insert location is known,
defined by the clustered key). Inserts
are faster when inserted into a table
where order is defined (CL) and where
that order is ever-increasing.
The crucial point is: only with the right clustered index will you be able to reap the benefits - when a clustered index is unique, narrow, stable and optimally ever-increasing. This is best served with an INT IDENTITY column.
Kimberly Tripp also has a great article on how to pick the best possible clustering key for your tables, and what criteria it should fulfil - see her post entitled Ever-increasing clustering key - the Clustered Index Debate..........again!
If you have such a column - e.g. a surrogate primary key - use that for your clustering key and you should see very nice performance on your table - even on lots of INSERTs.
I agree with putting the clustered index on the timestamp column. My query would be on the fillfactor - 100% gives best read performance at the expense of write performance. you may be hurt by page splits. Choosing a lower fillfactor will delay page splitting at the expense of read performance so its a fine balancing act to get the best for your situation.
After the bulk deletes its worth rebuilding the indexes and updating statistics. This not only keeps performance up but also resets the indexes to the specified fillfactor.
Finally, yes put nonclustered indexes on other appropriate columns but only ones that are very select e.g not bit fields. But remember the more indexes, the more this affects write performance
There's two "best practice" ways to index a high traffic logging table:
an integer identity column as a primary clustered key
a uniqueidentifier colum as primary key, with DEFAULT NEWSEQUENTIALID()
Both methods allow SQL Server to grow the table efficiently, because it knows that the index tree will grow in a particular direction.
I would not put any other indexes on the table, or schedule rebuilds of the index, unless there is a specific performance issue.
The obvious answer is it depends on how you will query it. The point of the index is to lessen the quantity of compares when selecting data. The clustered index helps when you consider what data you will load together and the blocking factor of the storage (you can load a bunch of data in a 64k block with one read). If you include an ID and a datetime as the primary key, but not use them in your selection criteria, they will do nothing but hinder your performance. This is why people usually drop indexes upon bulk inserts before loading data.

Resources