Is there a benefit of sorting the data in *.dat file based on INDEXED column before pushing them to the STAGING table in SQL Server?
Ok, the scenerio is :
I have a STAGING table with 40 columns and indexes on 5 columns. I need to push data from a file that contains 15 million rows into the STAGING table.
The approach I have followed is:
First, DISABLE the INDEXES
Second, push the data from file to STAGING table
Third, REBUILD INDEXES OFFLINE
Now I need to understand if I will sort the data in the file based on column that is indexed will it benefit in any way :
IN INSERT
IN INDEX REBUILD.
General answer: No!
15 million rows is quite a lot... It depends on how you are querying / filtering / sorting your data and it depends on the quality of your data:
Does your table have a clustered key (Are you aware of the difference between a clustered and a non-clustered index)?
Is there a one-column key candidate which is implicitly sorted (like IDENTITIY)?
Will the table see a lot of delets / inserts in future?
SQL-Server does not know any implicit sorting.
Only one case comes to my mind: If there is an active clustered index and you insert your data in a pre-sorted way, the rows should be added at the end and your index will not be fragmented and therefore will not need a rebuild at the end.
If you remove your indexes and insert your data, insertion should be faster, but you'll need a lot of work to get a clustered key in the right physical order at the end.
Many big tables define a non-clustered primary key and no clustered key at all...
My suggestion
remove all non-clustered indexes
If your table has an implicitly sorted PK and new rows are sorted to the end automatically, you should define this as clustered key and do the inserts pre-sorted.
If the above does not apply, you should do your inserts without any index and create the indexes after the insert operation.
Related
We are having a huge table Table1(2.5 billion rows) with single column A(NVARCHAR(255) datatype). What is the right approach for seek operations against this table. Clustered index on A Vs Clustered Column store index on A.
We are already keeping this table in separate filegroup from the other table Table2, with which it will be Joined.
Do you suggest partitioning this table for better performance ? This column will have unicode data also. So, what kind of partitioning approach is fine for unicode datatype ?
UPDATE: To clarify further, the use case for the table is SEEK. The table is storing identifiers for individuals. The major concerns here are performance for SEEK in the case of huge table. This table will be referred inside a transaction. We want the transaction to be short.
Clustered index vs column store index depends on the use case for the table. Column store keeps track of unique entries in the column and the rows where those entries are stored. This makes it very useful for data warehousing tasks such as aggregates against the indexed columns, however not as optimal for transactional tasks that need to pull a small number of specific rows. If you are using SQL Server 2014 or later you can use both a clustered index and a columnstore index by creating a clustered columnstore index. It does have some limitations and overhead that you should read up on though.
Given that this is a seek for specific rows and not an aggregation of the column, I would recommend a clustered index instead of a column store index.
we got a FACT Table which has got 237383163 number of rows and which has lot of duplicate data.
While running queries against this table its doing a SCAN across that many rows resulting in long execution times (bocs we haven't created clustered index).
Is there way someone can suggest - to create a clustered key using some combination of existing field along with adding any new field (like identity column)
Non-clustered index are created on table is of no help either.
Regards
Thoughts:
Adding a clustered index that is not unique will require a 4 byte uniqueifier
Adding a surrogate IDENTITY column will leave you with duplicates
A clustered index is best when narrow and numeric espeically if you have non-clustered indexes
First thing, de-duplicate data
Then I'd consider one of 2 things based on whether there are non-clustered indexes
Without NC indexes, create a unique clustered index on some or all of the FACT columns
With NC indexes, create an IDENTITY column and use this as the clustered index. Create a unique NC index on the FACT columns
Option 1 will be a lot smaller on disk. I've done this before for a billion+ row fact table and it shrank by 65%. There were no NC indexes.
Both options will need tested to see the effect on load and response times etc
I have a very large table (800GB) which has a DATETIME field which is part of a partition schema. This field is named tran_date. The problem I'm having is that the indexes are not properly aligned with the partition and I can't include the tran_date field in the PRIMARY KEY because it's set to nullable.
I can drop all foreign key relationships, statistics, and indexes, but I can't modify the column because the partition schema is still dependent on the tran_date column.
In my research I've located one way to move the table off of the partition which is to drop the clustered index and then re-write the clustered index onto the PRIMARY filegroup which will then allow me to modify the column, but this operation takes several hours to drop, 13 hours to write the temporary CLUSTERED INDEX on PRIMARY and then I have to drop that, alter the table, and re-write the CLUSTERED INDEX properly which takes another 13 hours. Additionally I have more than one table.
When I tested this deployment in my development environment with a similarly sized data set it took several days to complete, so I'm trying to look for ways to chop down this time.
If I can move the table off the partition without having to write a CLUSTERED INDEX on PRIMARY it would significantly reduce the time required to alter the column.
No matter what, you are going to end up moving data from "point A" (stored in table partitions within the database) to "point B" (not stored within table partitions within the database. The goal is to minimize the number of times you have to work through all that data. Simplest way to do this might be:
Create a new non-partitioned table
Copy the data over to that table
Drop the original table
Rename the new table to the proper name
One problem to deal with is the clustered index. You could either create the new table without the clustered index, copy the data over, and then reindex (extra time and pain), or you could create the table with the clustered index, and copy the data over “in order” (say, low Ids to high). This would be slower than copying it over to a non-clustered table, but it might be faster overall since you wouldn’t then have to build the clustered index.
Of course there's the problem of "what if users change the data while you're copying it"... but table partitioning implies warehousing, so I'm guessing you don't have to worry about that.
A last point, when copying gobs of data, it is best to break the insert into several inserts, so as to not bloat the transaction log.
I have a table that doesn't have any primary key. data is already there. I have made a non clustered index. but when i run query, actual execution plan is not showing index scanning. I think non clustered index is not working. what could be the reason. Please Help Me
First of all - why isn't there a primary key?? If it doesn't have a primary key, it's not a table - just add one! That will help on so many levels....
Secondly: even if you have an index, SQL Server query optimizer will always look at your query to decide whether it makes sense to use the index (or not). If you select all columns, and a large portion of the rows, then using an index is pointless.
So things to avoid are:
SELECT * FROM dbo.YourTable is almost guaranteed not to use any indices
if you don't have a good WHERE clause in your query
if your index is on a column that doesn't really select a small percentage of data; an index on a boolean column, or a Gender column with at most three different values doesn't help at all
Without knowing a lot more about your table structure, the data contained in those tables, the number of rows, and what kind of queries you're executing, no one can really answer your question - it's just way too broad....
Update: if you want to create a clustered index on a table which is different from your primary key, do these steps:
1) First, design your table
2) Then open up the index designer - create a new, clustered index on a column of your choice. Mind you - this is NOT the primary key !
3) After that, you can put your primary key on the ID column - it will create an index, but that index is not clustered !
Without having any more information I'd guess that the reason is that the table is too small for an index seek to be worth it.
If your table has less than a few thousand rows then SQL Server will almost always choose to do a table / index scan regardless of the indexes on that table simply because an index scan is in fact faster.
An index scan in itself doesn't necessarily indicate a performance problem - is the query actually slow?
I am very beginner in SQL Server 2005 and I am learning it from online tutorial, here is some of my question:
1: What is the difference between Select * from XYZ and Select ALL * from XYZ.
2: The purpose of Clustered index is like to make the search easier by physically sorting the table [as far as I kknow :-)]. Let say if have primary column on a table than is it good to create a clustered index on the table? because we have already a column which is sorted.
3: Why we can create 1 Clustered Index + 249 Nonclustered Index = 250 Index on a table? I understand the requirement of 1 clustered index. But why 249?? Why not more than 249?
No difference SELECT ALL is the default as opposed to SELECT DISTINCT
Opinion varies. For performance reasons Clustered indexes should ideally be small, stable, unique, and monotonically increasing. Primary keys should also be stable and unique so there is an obvious fit there. However clustered indexes are well suited for range queries. Looking up individual records by PK can perform well if the PK is nonclustered so some authors suggest not "wasting" the clustered index on the PK.
In SQL Server 2008 you can create up to 999 NCIs on a table. I can't imagine ever doing so but I think the limit was raised as potentially with "filtered indexes" there might be a viable case for this many. Indexes add a cost to data modification operations though as the changes need to be propagated in multiple places so I would imagine it would only be largely read only (e.g. reporting) databases that ever achieve even double figures of non clustered non filtered indexes.
For 3:
Everytime when you insert/delete record in the table ALL indexes must be updated. If you will have too many indexes it takes too long time.
If your table have more then 5-6 indexes I think you need take the time and check yourself.