drop index at partition level - sql-server

Do you know if there's any way of doing this in SQL Server (2008)?
I'm working on a DataWarehouse loading process, so what I want to do is to drop the indexes of the partition being loaded so I can perform a quick bulk load, and then I can rebuild again the index at partition level.
I think that in Oracle it's possible to achieve this, but maybe not in SQL Server.
thanks,
Victor

No, you can't drop a table's indexes for just a single partition. However, SQL 2008 provides a methodology for bulk-loading that involves setting up a second table with exactly the same schema on a separate partition on the same filegroup, loading it, indexing it in precisely the way, then "switching" your new partition with an existing, empty partition on the production table.
This is a highly simplified description, though. Here's the MSDN article for SQL 2008 on implementing this:
http://msdn.microsoft.com/en-us/library/ms191160.aspx

I know it wasn't possible in SQL 2005. I haven't heard of anything that would let you do this in 2008, but it could be there (I've read about but have not yet used it). The closest I could get was disabling the index, but if you disable a clustered index you can no longer access the table. Not all that useful, imho.
My solution for our Warehouse ETL project was to create a table listing all the indexes and indexing constraints (PKs, UQs). During ETL, we walk through the table (for the desired set of tables being loaded), drop the indexes/indexing constraints, load the data, then walk through the table again and recreate the indexes/constraints. Kind of ugly and a bit awkward, but once up and running it won't break--and has the added advantage of freshly built indexes (i.e. no fragmentation, and fillfactor can be 100). Adding/modifying/dropping indexes is also awkward, but not all that hard.
You could do it dynamically--read and store the indexes/constraints definitions from the target table, drop them, load data, then dynamically build and run the (re)create scripts from your stored data. But, if something crashes during the run, you are so dead. (That's why I settled on permanent tables.)
I find this to work very well with table partitioning, since you do all the work on "Loading" tables, and the live (dbo, for us) tables are untouched.

Related

SQL Server partitioned views and locking

We are using partitioned views (SQL Server 2008 Standard, partitioned tables are not an option), and they work fine if we consider the partition elimination goal: if we run a query in a partitioned view specifying a clause on the column we choose as the discriminator, we can see from the actual execution plan that only the table related to the specified discriminator value is hit. But we incur in locking problems if there are concurrent INSERT or UPDATE statements, even if those ones are NOT hitting the table selected by the discriminator.
Analyzing the locks I can see that, even if the execution plan shows that only the right one table is read, IS locks are still put on ALL the tables in the partitioned view, and of course if someone else has already put an X locks on one of those the whole query running on the partitioned view gets locked on that one, even if the table with an X upon is not read at all.
Is this a limitation of partitioned view in general, or there is a way to avoid it while sticking with partitioned views? We created the partitioned view and the related things following the SQL Server Books Online recommendations.
Thanks
Wasp
This is by design. Do not lock X entire tables.

Do partitions allow multiple bulk loads?

I have a database that contains data for many "clients". Currently, we insert tens of thousands of rows into multiple tables every so often using .Net SqlBulkCopy which causes the entire tables to be locked and inaccessible for the duration of the transaction.
As most of our business processes rely upon accessing data for only one client at a time, we would like to be able to load data for one client, while updating data for another client.
To make things more fun, all PKs, FKs and clustered indexes are on GUID columns (I am looking at changing this).
I'm looking at adding the ClientID into all tables, then partitioning on this. Would this give me the functionality I require?
I haven't used the built-in partitioning functionality of SQL Server, but it's something I am particularly interested in. My understanding is that this would solve your problem.
From this article
This allows you to operate on a
partition even with performace
critical operation, such as
reindexing, without affecting the
others.
And a great whitepaper on partitioning by Kimberly L Tripp is here. Well worth a read - I won't even try to paraphrase it - covers it all in a lot of detail.
Hope this helps.
Can you partition on Client ID : Yes, but partitioning is limited to 1000 partitions so that is 1000 clients before it hits a hard limit. The only way to get around that is to start using partitioned views across multiple partitioned tables - it gets a bit messy.
Will is help your locking situation : In SQL 2005 the lock escalation is row -> page -> table, but in 2008 they introduced a new level allowing row -> page -> partition -> table. So it might get round it, depending on your SQL version (unspecified).
If 2008 is not an option, then there is a trace flag (TF 1211 / 1224) feature that turns off lock escalations, but I would not jump in and use it without some serious testing.
The partitioning feature remains an enterprise upwards feature as well which puts some people off.
The most ideal way in which to perform a data load with partitioning, but avoiding locks is to bring the data into a staging table and then swap the data into a new partition - but this requires that the data is somewhat sequence based (such as datetime) so that new data can be brought in to an entirely new partition whilst older data eventually is removed. (rolling the partition window.)

SQL Server to PostgreSQL - Migration and design concerns

Currently migrating from SQL Server to PostgreSQL and attempting to improve a couple of key areas on the way:
I have an Articles table:
CREATE TABLE [dbo].[Articles](
[server_ref] [int] NOT NULL,
[article_ref] [int] NOT NULL,
[article_title] [varchar](400) NOT NULL,
[category_ref] [int] NOT NULL,
[size] [bigint] NOT NULL
)
Data (comma delimited text files) is dumped on the import server by ~500 (out of ~1000) servers on a daily basis.
Importing:
Indexes are disabled on the Articles table.
For each dumped text file
Data is BULK copied to a temporary table.
Temporary table is updated.
Old data for the server is dropped from the Articles table.
Temporary table data is copied to Articles table.
Temporary table dropped.
Once this process is complete for all servers the indexes are built and the new database is copied to a web server.
I am reasonably happy with this process but there is always room for improvement as I strive for a real-time (haha!) system. Is what I am doing correct? The Articles table contains ~500 million records and is expected to grow. Searching across this table is okay but could be better. i.e. SELECT * FROM Articles WHERE server_ref=33 AND article_title LIKE '%criteria%' has been satisfactory but I want to improve the speed of searching. Obviously the "LIKE" is my problem here. Suggestions? SELECT * FROM Articles WHERE article_title LIKE '%criteria%' is horrendous.
Partitioning is a feature of SQL Server Enterprise but $$$ which is one of the many exciting prospects of PostgreSQL. What performance hit will be incurred for the import process (drop data, insert data) and building indexes? Will the database grow by a huge amount?
The database currently stands at 200 GB and will grow. Copying this across the network is not ideal but it works. I am putting thought into changing the hardware structure of the system. The thought process of having an import server and a web server is so that the import server can do the dirty work (WITHOUT indexes) while the web server (WITH indexes) can present reports. Maybe reducing the system down to one server would work to skip the copying across the network stage. This one server would have two versions of the database: one with the indexes for delivering reports and the other without for importing new data. The databases would swap daily. Thoughts?
This is a fantastic system, and believe it or not there is some method to my madness by giving it a big shake up.
UPDATE: I am not looking for help with relational databases, but hoping to bounce ideas around with data warehouse experts.
I am not a data warehousing expert, but a couple of pointers.
Seems like your data can be easily partitioned. See Postgresql documentation about partitioning on how to split data into different physical tables. This lets you manage data at your natural per server granularity.
You can use postgresql transactional DDL to avoid some copying. The process will then look something like this for each input file:
create a new table to store the data.
use COPY to bulk load data into the table.
create any necessary indexes and do any processing that is required.
In a transaction drop the old partition, rename the new table and add it as a partition.
If you do it like this, you can swap out the partitions on the go if you want to. Only the last step requires locking the live table, and it's a quick DDL metadata update.
Avoid deleting and reloading data to an indexed table - that will lead to considerable table and index bloat due to the MVCC mechanism PostgreSQL uses. If you just swap out the underlying table you get a nice compact table and indexes. If you have any data locality on top of the partitioning in your queries then either order your input data on that or if that's not possible use PostgreSQL cluster functionality to reorder the data physically.
To speed up the text searches use a GIN full text index if the constraints are acceptable (can only search at word boundaries). Or a trigram index (supplied by the pg_trgm extension module) if you need to search for arbitrary substrings.

How reliable is SQL server replication?

We have a database on SQL Server 2000 which should be truncated from time to time. It looks like the easiest solution would be to create a duplicate database and copy the primary database there. Then the primary database may be safely truncated by specially tailored stored procedures.
One way replication would guarantee that the backup database contains all updates from the primary one.
We plan to use backup database for reporting and primary for operative data.
Primary database will be truncated at night once in 2 days.
Database is several gigabytes. Only several tables are quite large (1-2 mln rows)
What are possible pitfalls? How reliable would such a solution be? Will it slow down the primary database?
Update: Variant with DTS for doing copying sounds good but has own disadvantages. It requires quite robust script which would run for about an hour to copy updated rows. There is also issue with integrity constraints in primary database which would make truncating it non-trivial task. Because of this replication cold straighten things up considerably.
It is also possible but not quite good variant to use union VIEW because system which woks mostly in unattended mode whiteout dedicated support personnel. It is related issue but not technical though.
While replication is usually robust, there are times where it can break and require a refresh. Managing and maintaining replication can become complicated. Once the primary database is truncated, you'll have to make sure that action is not replicated. You may also need an improved system of row identification as after you've truncated the primary database tables a couple of times, you'll still have a complete history in your secondary database.
There is a performance hit on the publisher (primary) as extra threads have to run to read the transaction log. Unless you're under heavy load at the moment, you likely won't notice this effect. Transaction log management can become more important also.
Instead, I'd look at a different solution for your problem. For example, before truncating, you can take a backup of the database, and restore it as a new database name. You then have a copy of the database as it was before the truncation, and you can query both at once using three-part names.
You've mentioned that the purpose of the secondary data is to keep report off. In this case you can create a view like SELECT * FROM Primary.dbo.Table UNION ALL SELECT * FROM SecondaryDBJune2008.dbo.Table UNION ALL SELECT * FROM SecondaryDBOctober2008.dbo.Table. You wouild then need to keep this view up to date whenever you perform a truncate.
The other alternative would be to take a snapshot of the current data before truncation and insert it into a single reporting database. Then you'd just have the Primary and the Historical databases - no need to modify views once they're created.
How much data are we talking about in GB?
As you're planning to perform the truncation once every two days, I'd recommend the second alternative, snapshotting the data before truncation into a single Historical database. This can be easily done with a SQL Agent job, without having to worry about replication keeping the two sets of data in synch.
I would not use replication for this. We have a fairly complex replication setup running with 80+ branches replicating a few tables to one central database. When connectivity goes down for a few days, the data management issues are hair raising.
If you want to archive older data, rather use DTS. You can then build the copying and truncation/deletion of data into the same DTS package, setting it so that the deletion only happens if the copy was successful.

How do I know which SQL Server 2005 index recommendations to implement, if any?

We're in the process of upgrading one of our SQL Server instances from 2000 to 2005. I installed the performance dashboard (http://www.microsoft.com/downloads/details.aspx?FamilyId=1d3a4a0d-7e0c-4730-8204-e419218c1efc&displaylang=en) for access to some high level reporting. One of the reports shows missing (recommended) indexes. I think it's based on some system view that is maintained by the query optimizer.
My question is what is the best way to determine when to take an index recommendation. I know that it doesn't make sense to apply all of the optimizer's suggestions. I see a lot of advice that basically says to try the index and to keep it if performance improves and to drop it if performances degrades or stays the same. I wondering if there is a better way to make the decision and what best practices exist on this subject.
First thing to be aware of:
When you upgrade from 2000 to 2005 (by using detach and attach) make sure that you:
Set compability to 90
Rebuild the indexes
Run update statistics with full scan
If you don't do this you will get suboptimal plans.
IF the table is mostly write you want as few indexes as possible
IF the table is used for a lot of read queries you have to make sure that the WHERE clause is covered by indexes.
The advice you got is right. Try them all, one by one.
There is NO substitute for testing when it comes to performance. Unless you prove it, you haven't done anything.
Your best researching the most common type of queries that happen on your database and creating indexes based on that research.
For example, if there is a table which stores website hits, which is written to very very often but hardly even read from. Then don't index the table in away.
If how ever you have a list of users which is access more often than is written to, then I would firstly create a clustered index on the column that is access the most, usually the primary key. I would then create an index on commonly search columns, and those which are use in order by clauses.

Resources