i have a table in sybase db , having >33 million records. There is an already existing clustered index on the table and I want to add a new column in table as well as index. But with these many records if i drop the index and create a new one with newly added column the process or index creation keeps on running for hours. Is there any way we can make it fast?
Related
We are having a huge table Table1(2.5 billion rows) with single column A(NVARCHAR(255) datatype). What is the right approach for seek operations against this table. Clustered index on A Vs Clustered Column store index on A.
We are already keeping this table in separate filegroup from the other table Table2, with which it will be Joined.
Do you suggest partitioning this table for better performance ? This column will have unicode data also. So, what kind of partitioning approach is fine for unicode datatype ?
UPDATE: To clarify further, the use case for the table is SEEK. The table is storing identifiers for individuals. The major concerns here are performance for SEEK in the case of huge table. This table will be referred inside a transaction. We want the transaction to be short.
Clustered index vs column store index depends on the use case for the table. Column store keeps track of unique entries in the column and the rows where those entries are stored. This makes it very useful for data warehousing tasks such as aggregates against the indexed columns, however not as optimal for transactional tasks that need to pull a small number of specific rows. If you are using SQL Server 2014 or later you can use both a clustered index and a columnstore index by creating a clustered columnstore index. It does have some limitations and overhead that you should read up on though.
Given that this is a seek for specific rows and not an aggregation of the column, I would recommend a clustered index instead of a column store index.
Is there a benefit of sorting the data in *.dat file based on INDEXED column before pushing them to the STAGING table in SQL Server?
Ok, the scenerio is :
I have a STAGING table with 40 columns and indexes on 5 columns. I need to push data from a file that contains 15 million rows into the STAGING table.
The approach I have followed is:
First, DISABLE the INDEXES
Second, push the data from file to STAGING table
Third, REBUILD INDEXES OFFLINE
Now I need to understand if I will sort the data in the file based on column that is indexed will it benefit in any way :
IN INSERT
IN INDEX REBUILD.
General answer: No!
15 million rows is quite a lot... It depends on how you are querying / filtering / sorting your data and it depends on the quality of your data:
Does your table have a clustered key (Are you aware of the difference between a clustered and a non-clustered index)?
Is there a one-column key candidate which is implicitly sorted (like IDENTITIY)?
Will the table see a lot of delets / inserts in future?
SQL-Server does not know any implicit sorting.
Only one case comes to my mind: If there is an active clustered index and you insert your data in a pre-sorted way, the rows should be added at the end and your index will not be fragmented and therefore will not need a rebuild at the end.
If you remove your indexes and insert your data, insertion should be faster, but you'll need a lot of work to get a clustered key in the right physical order at the end.
Many big tables define a non-clustered primary key and no clustered key at all...
My suggestion
remove all non-clustered indexes
If your table has an implicitly sorted PK and new rows are sorted to the end automatically, you should define this as clustered key and do the inserts pre-sorted.
If the above does not apply, you should do your inserts without any index and create the indexes after the insert operation.
I just created a table with TWO primary keys in SQL Server. One column is age, another is ID number and I set the option to CLUSTER INDEX, so it automatically creates a cluster index on both columns. However, when I query the table, the results only seem to sort the ID and completely disregard/ignore the AGE (other PK and other Cluster index column). Why is this? Why is it only sorting based on the first cluster index column?
The query optimizer may decide to use the physical ordering of the rows in the table if there is no advantage in ordering any other way. So, when you select from the table using a simple query, it may be ordered this way. It is very easy to assume that the rows are physically stored in the order specified within the definition of your clustered index. But this turns out to be a false assumption.
Please view the following article for more details: Clustered Index do “NOT” guarantee Physically Ordering or Sorting of Rows
we got a FACT Table which has got 237383163 number of rows and which has lot of duplicate data.
While running queries against this table its doing a SCAN across that many rows resulting in long execution times (bocs we haven't created clustered index).
Is there way someone can suggest - to create a clustered key using some combination of existing field along with adding any new field (like identity column)
Non-clustered index are created on table is of no help either.
Regards
Thoughts:
Adding a clustered index that is not unique will require a 4 byte uniqueifier
Adding a surrogate IDENTITY column will leave you with duplicates
A clustered index is best when narrow and numeric espeically if you have non-clustered indexes
First thing, de-duplicate data
Then I'd consider one of 2 things based on whether there are non-clustered indexes
Without NC indexes, create a unique clustered index on some or all of the FACT columns
With NC indexes, create an IDENTITY column and use this as the clustered index. Create a unique NC index on the FACT columns
Option 1 will be a lot smaller on disk. I've done this before for a billion+ row fact table and it shrank by 65%. There were no NC indexes.
Both options will need tested to see the effect on load and response times etc
I have a very large table (800GB) which has a DATETIME field which is part of a partition schema. This field is named tran_date. The problem I'm having is that the indexes are not properly aligned with the partition and I can't include the tran_date field in the PRIMARY KEY because it's set to nullable.
I can drop all foreign key relationships, statistics, and indexes, but I can't modify the column because the partition schema is still dependent on the tran_date column.
In my research I've located one way to move the table off of the partition which is to drop the clustered index and then re-write the clustered index onto the PRIMARY filegroup which will then allow me to modify the column, but this operation takes several hours to drop, 13 hours to write the temporary CLUSTERED INDEX on PRIMARY and then I have to drop that, alter the table, and re-write the CLUSTERED INDEX properly which takes another 13 hours. Additionally I have more than one table.
When I tested this deployment in my development environment with a similarly sized data set it took several days to complete, so I'm trying to look for ways to chop down this time.
If I can move the table off the partition without having to write a CLUSTERED INDEX on PRIMARY it would significantly reduce the time required to alter the column.
No matter what, you are going to end up moving data from "point A" (stored in table partitions within the database) to "point B" (not stored within table partitions within the database. The goal is to minimize the number of times you have to work through all that data. Simplest way to do this might be:
Create a new non-partitioned table
Copy the data over to that table
Drop the original table
Rename the new table to the proper name
One problem to deal with is the clustered index. You could either create the new table without the clustered index, copy the data over, and then reindex (extra time and pain), or you could create the table with the clustered index, and copy the data over “in order” (say, low Ids to high). This would be slower than copying it over to a non-clustered table, but it might be faster overall since you wouldn’t then have to build the clustered index.
Of course there's the problem of "what if users change the data while you're copying it"... but table partitioning implies warehousing, so I'm guessing you don't have to worry about that.
A last point, when copying gobs of data, it is best to break the insert into several inserts, so as to not bloat the transaction log.