ETL Performance Problem

ETL Performance Problem - sql-server

I have an important problem running ETL Process in production environment. While my ETL is running, the OLAP Server turns extremely slowly, I think this is because the ETL is updating several existing rows in the fact table and adding new ones. I tried to avoid this problem having a whole data base replication and ETL writes in DB1 and OLAP Server read from DB2(the replicated ones). It doesn't work at all.
Can you give me some advices which point me in the right solutions to avoid this problem.
I'm using SQL Server 2005. 8GB RAM
Mondrian OLAP Server into a Jboss Server. 8GB RAM.
ETL is running every 3 hours and is taking 2 hours running.
I'd appreciate any help.

Have the ETL run it's process into a new table, then use ALTER TABLE ... SWITCH PARTITION to switch in the new data into the Facts table. See Transferring Data Efficiently by Using Partition Switching. I would also revisit the ETL itself, given that the one can import about 2TB/hour with a well tuned ETL process, you can probably squeeze a bit more of performance yourself, I doubt you import 4TB every 3 hours...
As for the idea of using Replication to alleviate load problems, all I can say is this: Replication will always add to the load, as the replication process itself is quite expensive. Every insert, update, delete on the publisher is also an insert, update or delete on the subscriber, and there is the additional overhead of detecting changes on publisher, distribution overhead and applying changes on subscriber... is just going to add up, not subtract anything.

Related

How to reducing impact on OLTP when performing ETL Process

I am new in designing a ETL process. Currently I have two database, one is the live database where the application use it for every day transaction. The other one is the data warehouse.
I have a table in the live database that regularly have new data insert into it. The goal is that every night the ETL Process will transfer the data in the live database to the data warehouse, follow by deleting the data in the live database.
Due to my lack of knowledge, the solution that I got is to implement something call a rolling table. Basically on the live database, I have two tables that have the same structure. I call them tblLive1 and tblLive2. I also has a synonym call tblLive. All insert is done on the synonym. The synonym would point at one of the table.
When I run the ETL process, I have a stored procedure that would drop and create a new synonym that point to tblLive2. This allow the ETL process to transform data from tblLive1 without effecting the application. The assumption is that the ETL Process takes an hour to run, and I won't want the ETL process lock the table preventing the application insert new data to it.
This solution should theoretically work, but not elegant.
I am sure this problem is a common problem, are there any other solutions out there?

To add to Bob's answer (above), It is usual in DWH/BI applications, that all necessary tables are essentially copied into a "staging" database or a "staging" schema on your DWH database(depending on the number of tables / size etc). These would ordinarily be on a different server to your OLTP system - for a DWH implementation of any size that is)
To answer the question on performance impact, it depends on your server spec/io configuration.
Is data being inserted into the OLTP system 24hours/day? or are there downtimes? or low traffic times?
It might be worthwhile using database compression as IO is going to be your biggest enemy and this will help considerably.

Read the table into a staging area and process the staging table. You usually want to spend as little time on the production system as you have too. Especially if it is in use.
You may also want to look into using tables loaded by a trigger. Or Change Data Capture if you are on SQL 2008

Horizontally partitioning data into an "archive" in SQL Server taking months to execute?

There is a project in flight at my organization to move customer data and all the associated records (billing transactions, etc) from one database to another, if the customer has not had account activity within a certain timeframe.
The total number of rows in all the tables is in the millions. Perhaps 100 million rows, with all the various tables combined. The schema is more-or-less normalized. The project's designers have decided on SSIS to execute this and initial analysis is showing 5 months of execution time.
Basically, the process:
Fills an "archive" database that has the same schema as the database of origin
Delete the original rows from the source database
I can provide more detail if necessary. What I'm wondering is, is SSIS the correct approach? Is there some sort of canonical way to move very large quantities of data around? Are there common performance pitfalls to avoid?
I just can't believe that this is going to take months to run and I'd like to know if there's something else that we should be looking into.

SSIS is just a tool. You can write a 100M rows transfer in SSIS to take 24h, you can write it to take 5 mo. The problem is what you write (ie. the workflow in SSIS case), not SSIS.
There isn't anything specific to SSID that would dictate 'the transfer cannot be done faster than 5 mo'.
The guiding principles for such a task (logically partition the data, process each logical partition in parallel, eliminate access and update contention between processing, batch commit changes, don't transfer more data that is necessary on the wire, use set based processing as much as possible, be able to suspend and resume etc etc) can be implemented on SSIS just as well as any other technology (if not better).
For the record, the ETL world speed record stands at about 2TB per hour. Using SSIS. And just as a matter of fact, I just finished a transfer of 130M rows, ~200Gb of data, took some 24h (I'm lazy and not shooting for ETL record).
I would understand 5mo for development, testing and deployment, but not 5mo for actual processing. That is like 7 rows a second, and is realy realy lame.

SSIS is probably not the right choice if you are simply deleting records.
This might be of interest: Performing fast SQL Server delete operations
UPDATE: as Remus correctly points out, SSIS can perform well or badly depending on how the flows are written, and there have been some huge benchmarks (on high end systems). But for just deletes there are simply ways, such as a SQL Agent job running a TSQL delete in batches.

Using a duplicate SQL Server database for queries

I have a very large (100+ gigs) SQL Server 2005 database that receives a large number of inserts and updates, with less frequent selects. The selects require a lot of indexes to keep them functioning well, but it appears the number of indexes is effecting the efficiency of the inserts and updates.
Question: Is there a method for keeping two copies of a database where one is used for the inserts and updates while the second is used for the selects? The second copy wouldn't need to be real-time updated, but shouldn't be more than an hour old. Is it possible to do this kind of replication while keeping different indexes on each database copy? Perhaps you have other solutions?

Your looking to setup a master/child database topology using replication. With SQL server you'll need to setup replication between two databases (preferrably on separate hardware). The Master DB you should use for inserts and updates. The Child will service all your select queries. You'll want to also optimize both database configuration settings for the type of work they will be performing. If you have heavy select queries on the child database you may also want to setup view's that will make the queries perform better than complex joins on tables.
Some reference material on replication:
http://technet.microsoft.com/en-us/library/ms151198.aspx
Just google it and you'll find plenty of information on how to setup and configure:
http://search.aim.com/search/search?&query=sql+server+2005+replication&invocationType=tb50fftrab

Transactional replication can do this as the subscriber can have a number of aditional indexes compared with the publisher. But you have to bear in mind a simple fact: all inserts/updates/deletes are going to be replicated at the reporting copy (the subscriber) and the aditional indexes will... slow down replication. It is actually possible to slow down the replication to a rate at wich is unable to keep up, causing a swell of the distribution DB. But this is only when you have a constant high rate of updates. If the problems only occur durink spikes, then the distribution DB will act as a queue that absorbes the spikes and levels them off during off-peak hours.
I would not take this endevour without absolute, 100% proof evidence that it is the additional indexes that are slowing down the insert/updates/deletes, and w/o testing that the insert/updates/deletes are actually performing significantly better without the extra indexes. Specifically , ensure that the culprit is not the other usual suspect: lock contention.

Generally, all set-based operations (including updating indexes) are faster than non set-based ones
1,000 inserts will most probably be slower than one insert of 1,000 records.
You can batch the updates to the second database. This will, first, make the index updating more fast, and, second, smooth the peaks.

You could task schedule a bcp script to copy the data to the other DB.
You could also try transaction log shipping to update the read only db.

Don't forget to adjust the fill factor when you create your two databases. It should be low(er) on the database with frequent updates, and 100 on your "data warehouse"/read only database.

SpeedUp Database Updates

There is a SqlServer2000 Database we have to update during weekend.
It's size is almost 10G.
The updates range from Schema changes, primary keys updates to some Million Records updated, corrected or Inserted.
The weekend is hardly enough for the job.
We set up a dedicated server for the job,
turned the Database SINGLE_USER
made any optimizations we could think of: drop/recreate indexes, relations etc.
Can you propose anything to speedup the process?
SQL SERVER 2000 is not negatiable (not my decision). Updates are run through custom made program and not BULK INSERT.
EDIT:
Schema updates are done by Query analyzer TSQL scripts (one script per Version update)
Data updates are done by C# .net 3.5 app.
Data come from a bunch of Text files (with many problems) and written to local DB.
The computer is not connected to any Network.

Although dropping excess indexes may help, you need to make sure that you keep those indexes that will enable your upgrade script to easily find those rows that it needs to update.
Otherwise, make sure you have plenty of memory in the server (although SQL Server 2000 Standard is limited to 2 GB), and if need be pre-grow your MDF and LDF files to cope with any growth.
If possible, your custom program should be processing updates as sets instead of row by row.
EDIT:
Ideally, try and identify which operation is causing the poor performance. If it's the schema changes, it could be because you're making a column larger and causing a lot of page splits to occur. However, page splits can also happen when inserting and updating for the same reason - the row won't fit on the page anymore.
If your C# application is the bottleneck, could you run the changes first into a staging table (before your maintenance window), and then perform a single update onto the actual tables? A single update of 1 million rows will be more efficient than an application making 1 million update calls. Admittedly, if you need to do this this weekend, you might not have a lot of time to set this up.

What exactly does this "custom made program" look like? i.e. how is it talking to the data? Minimising the amount of network IO (from a db server to an app) would be a good start... typically this might mean doing a lot of work in TSQL, but even just running the app on the db server might help a bit...
If the app is re-writing large chunks of data, it might still be able to use bulk insert to submit the new table data. Either via command-line (bcp etc), or through code (SqlBulkCopy in .NET). This will typically be quicker than individual inserts etc.
But it really depends on this "custom made program".

SQL Server Merge Replication Schedule

We're replicating a database between London and Hong Kong using SQL Server 2005 Merge replication. The replication is set to synchronise every one minute and it works just fine. There is however the option to set the synchronisation to be "Continuous". Is there any real difference between replication every one minute and continuously?
The only reason for us doing every one minute rather than continuous in the first place was that it recovered better if the line went down for a few minutes, but this experience was all from SQL Server 2000 so it might not be applicable any more...

We have been trying the continuous replication solution on SQL SERVER 2005 and it appeared to be less efficient than a scheduled solution: as your process is continuous, you will not get all the info related to your passed replications (how many replications failed, how long did the process take, why was the process stopped, how many records were updated, how many database structure modifications were replicated to suscribers, and so on), making the replication follow-up a lot more difficult.
We have also been experiencing troubles while modifying database structure (ALTER TABLE instructions) and/or making bulk updates on one of the databases with continuous replication going on.
Keep you "every minute" synchro as it is and just forget about this "continuous" option.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

ETL Performance Problem - sql-server

Related

How to reducing impact on OLTP when performing ETL Process

Horizontally partitioning data into an "archive" in SQL Server taking months to execute?

Using a duplicate SQL Server database for queries

SpeedUp Database Updates

SQL Server Merge Replication Schedule

Categories

Resources