Parallel Bulk Inserting with SqlBulkCopy and Azure

Parallel Bulk Inserting with SqlBulkCopy and Azure - database

I have an azure app on the cloud with a sql azure database. I have a worker role which needs to do parsing+processing on a file (up to ~30 million rows) so i can't directly use BCP or SSIS.
I'm currently using SqlBulkCopy, however this seems too slow as I've seen load times of up to 4-5 minutes for 400k rows.
I want to run my bulk inserts in parallel; however reading through the articles on importing data in parallel/controlling lock behaviour, it says that SqlBulkCopy requires that the table does not have clustered indexes and a tablelock (BU lock) needs to be specified. However azure tables must have a clustered index...
Is it even possible to use SqlBulkCopy in parallel on the same table in SQL Azure? If not is there another API (that I can use in code) to do this?

I don't see how you can run any faster than using SqlBulkCopy. On our project we can import 250K rows in about 3 mins, so your rate seems about right.
I don't think that doing it in parallel would help, even if it was technically possible. We only run 1 import at a time otherwise SQL Azure starts timing out our requests.
In fact sometimes, running a large group-by query at the same time as the import isn't possible. SQL Azure does a lot of work to ensure quality of service, this includes timing out requests that take too long, take too many resource, etc
So doing several large bulk inserts at the same time will probably cause one to time out.

It is possible to run SQLBulkCopy in parallel against SQL Azure, even if you load the same table. You need to prepare your records in batches yourself before sending them to the SQLBulkCopy API. This will absolutely help with performance, and it allows you to control retry operations for a smaller batch of records when you get throttled for reasons outside of your own doing.
Take a look at my blog post comparing load times of various approaches. There is a sample code as well. In separate tests I was able to cut the load time of a table in half.
This is the technique I am using for a couple of tools (Enzo Backup; Enzo Data Copy); It's not a simple thing to do but when done properly you can optimize load times significantly.

Related

Database locked (or slow) with Linq To Entities and Stored Procedures

I'm using Linq To Entities (L2E) ONLY to map all my Stored Procedures in my database for easy translation into objects. My data is not really sensitive (so I'm considering Isolation level "READ UNCOMMITED" everywhere). I have several tables with millions of rows. I have the website and a bunch of scripts utilizing the same datamodel created using entity framework. I have indexed all tables (max 3 for each table) so that every filter I use is directly catched by an index. My scripts mainly consists of
1) Get your from DB (~5 seconds)
2) Making API (1-3 seconds)
3) Adding result in database
I have READ_COMMITTED_SNAPSHOT and ALLOW_SNAPSHOT_ISOLATION set to ON.
Using this strategy most my queries is fixed and very fast (usually). I still have some queries used by my scripts that could run up to 20 second but the are not called that often.
The problem is that suddenly the whole database gets slow and all my queries are returned slowly (can be over 10 seconds). Using Sql Profiler I have tried to find the issue.
As mentioned Im considering NOLOCKS using "READ UNCOMMITED"... Right now I'm going through each possible database call and adding indexes and/or caching tables to make the call faster.
I have also considered removing L2E and accesing the "old way" to be sure thats not the issue. Is My data context looking my tables? it sure looks that way. I have experimenting haveing the context living over the API call to minimize created context but right now I create a new context for each call since I thought it was looking the database.
The problem is that I cannot control, that every single call is made fast, for all eternity otherwise the whole system gets slowed down.
When I restart sql server and rebuild indexes its really fast for a short period of time before everything gets slow again. Any pointers would be appreciated.

Have you considered checking if there are any major waits on the server.
Review the following page on sys.dm_os_wait_stats (Transact-SQL) and see if you can get any insight into why the server is slow....

Horizontally partitioning data into an "archive" in SQL Server taking months to execute?

There is a project in flight at my organization to move customer data and all the associated records (billing transactions, etc) from one database to another, if the customer has not had account activity within a certain timeframe.
The total number of rows in all the tables is in the millions. Perhaps 100 million rows, with all the various tables combined. The schema is more-or-less normalized. The project's designers have decided on SSIS to execute this and initial analysis is showing 5 months of execution time.
Basically, the process:
Fills an "archive" database that has the same schema as the database of origin
Delete the original rows from the source database
I can provide more detail if necessary. What I'm wondering is, is SSIS the correct approach? Is there some sort of canonical way to move very large quantities of data around? Are there common performance pitfalls to avoid?
I just can't believe that this is going to take months to run and I'd like to know if there's something else that we should be looking into.

SSIS is just a tool. You can write a 100M rows transfer in SSIS to take 24h, you can write it to take 5 mo. The problem is what you write (ie. the workflow in SSIS case), not SSIS.
There isn't anything specific to SSID that would dictate 'the transfer cannot be done faster than 5 mo'.
The guiding principles for such a task (logically partition the data, process each logical partition in parallel, eliminate access and update contention between processing, batch commit changes, don't transfer more data that is necessary on the wire, use set based processing as much as possible, be able to suspend and resume etc etc) can be implemented on SSIS just as well as any other technology (if not better).
For the record, the ETL world speed record stands at about 2TB per hour. Using SSIS. And just as a matter of fact, I just finished a transfer of 130M rows, ~200Gb of data, took some 24h (I'm lazy and not shooting for ETL record).
I would understand 5mo for development, testing and deployment, but not 5mo for actual processing. That is like 7 rows a second, and is realy realy lame.

SSIS is probably not the right choice if you are simply deleting records.
This might be of interest: Performing fast SQL Server delete operations
UPDATE: as Remus correctly points out, SSIS can perform well or badly depending on how the flows are written, and there have been some huge benchmarks (on high end systems). But for just deletes there are simply ways, such as a SQL Agent job running a TSQL delete in batches.

SQL Server vs. Access insert performance, in particular when using GUID

I'm interested to know how I could improve the performance of SQL Server when using sequential GUID when using Access 2007 as a front end to SQL Server 2008 (please note it's the only context I'm interested in).
I have made some tests (and gotten some fairly surprising results, in particular from SQL Server when using sequential GUID: the insert performance degrades very very quickly and it doesn't seem right to degrade so quickly to me.
Basically the test is as follow:
From the Access front-end, using VBA only, insert 100,000 records in batches of 1000,
sequentially.
I tried it both with a Identity and a sequential GUID as the PK.
I tried it in SQL Server 2008 Standard (no special tweaking just default install) as and an Access 2007 database as the back-end. All tables linked back to the front-end.
Some of the results (more, with raw data available on my blog entry about the test):
It's clear that, as the database grows, the insert performance is reduced but SQL Server isn't performing very well at all here.
http://blog.nkadesign.com/wp-content/uploads/2009/04/chart02.png
Expanded view of the results for SQL Server:
http://blog.nkadesign.com/wp-content/uploads/2009/04/chart03.png
Edit 13APR2009
I've found an issue with my server configuration and I updated the tests on my blog.
Thanks to all for your replies, they helped me a lot.

There's two things at play here. First, it's important to point out that SQL doesn't necessarily work very well, for a specific use case, out of the box. It is a professional product designed to be tuned by a person who knows what they're doing.
By comparison, Access is designed to work very well for most use cases without any configuration. The downside of this trade-off is covered in the second point:
SQL Server is designed for scalability. Notice how Access severely degrades with only 100,000 records. It would probably drop very steeply below SQL's line before a million. By comparison, SQL server holds almost perfectly steady, with the variation stabilizing after about 45,000 records and will continue to hold at many millions.
Edit I think there also may be something else at play here we're not seeing. I thought your SQL numbers looked bad, so I ran a test of my own. On my desktop running Windows Vista 3.6 ghz and 2gb of RAM, inserts with sequential GUID on SQL Server performed:
Average of 1382 inserts per second at 0 records
Average of 1426 inserts per second at 500k records
Averaging 1609.6 inserts per second from 0 to 500k with an average floor of 992 inserts/sec and an average ceiling of 1989 inserts/sec.
So accounting for the normal variance incurred by running this on an in-use desktop, I'd say SQL Server inserts basically scale linearly from 0 records to half a million. On a dedicated, tuned server I'd expect even more consistency (not to mention far better performance):
Excel chart, inserts per second http://img24.imageshack.us/img24/9485/insertspersecond.jpg

My question is whether your test setup represents the reality of your application or not. In short, are you testing the right thing?
Is your app going to be appending large numbers of records one at a time?
Or is it going to be appending batches of records based on a SQL SELECT?
If the latter, you might look at trying to do it all server-side, particularly if the source table(s) in the SELECT are on the server. It's important to realize that with ODBC, a batch append is going to be sent to the SQL Server as a single insert for every single row (every similar to the recordset-based approach in your test code). If you move the same process entirely server-side, it can be done as a batch operation.
Also, you should test again using ADO instead of DAO. It may optimize the operation completely differently.
Last of all, someone brought to my attention just this past week this fascinating article by Andy Baron:
Optimizing Microsoft Office Access Applications Linked to SQL Server
I'm still absorbing the contents of that very useful article, and it discusses several issues in regard to non-GUID-specific topics that may help you optimize your process for maximum efficiency.

You realize at least part of the decreasing performance is the log filling up, and that a GUID id what, 40 bytes longer than an int?
But I'm not quibbling; it's good to see someone taking actual metrics rather than just handwaving. Modded up.

Where are you getting the data from?
Does it change the numbers if you use the Access Export menu options rather than record-at-a-time-in-a-loop?
VBA is really sensitive to the connection paramters too, and there are lots of options that aren't necessarily intuitive.
If an identity column is acceptable, why are you even considering a sequential GUID (which is something of a tacked-on facility in MSSQL last I checked).
EDIT:
Looking at your code and briefly reviewing the Recordset docs on MSDN, I see you may be able to use more efficient parameters. E.g. your dbSeeChanges and dbOpenDynaset, which are appropriate if you are trying to allow for other users messing with the same rows (or needing to get back the inserted IDENTITY value or probably GUID), but I don't think you need those. In essence, after every INSERT or UPDATE, you're reading the record back from the database into VBA. I'd read through those connection config settings carefully, and I bet you'll come up with something a lot more satisfactory.

The last time I saw something like that (really slow insertion with GUID PK) was because of the log-file filling up. Insertion performance was dropping like a stone, pretty fast (no hard measurement, just looking at live traces, but it sure looked like it was kinda logarithmic). This was pre-loading of historical data.
Moved over to identity PK, took care of actually cleaning up the log file, and everything went much better afterwards (a couple of hours where the first version took several hours and was not finished).
Also, just a thought, are there any transactions involved? Maybe SQL Server transactions create a big performance hit that access does not have (given that access is not really geared towards concurrent access).

What is a good SQL Server 2008 solution for handling massive writes so that they don't slow down reads for users of the database?

We have large SQL Server 2008 databases. Very often we'll have to run massive data imports into the databases that take a couple hours. During that time everyone else's read and small write speeds slow down a ton.
I'm looking for a solution where maybe we setup one database server that is used for bulk writing and then two other database servers that are setup to be read and maybe have small writes made to them. The goal is to maintain fast small reads and writes while the bulk changes are running.
Does anyone have an idea of a good way to accomplish this using SQL Server 2008?

Paul. There's two parts to your question.
First, why are writes slow?
When you say you have large databases, you may want to clarify that with some numbers. The Microsoft teams have demonstrated multi-terabyte loads in less than an hour, but of course they're using high-end gear and specialized data warehousing techniques. I've been involved with data warehousing teams that regularly loaded so much data overnight that the transaction log drives had to be over a terabyte just to handle the quick bursts, but not a terabyte per hour.
To find out why writes are slow, you'll want to compare your load methods to data warehousing techniques. For example, have you tried using staging tables? Table partitioning? Data and log files on different arrays? If you're not sure where to start, check out my Perfmon tutorial to measure your system looking for bottlenecks:
http://www.brentozar.com/archive/2006/12/dba-101-using-perfmon-for-sql-performance-tuning/
Second, how do you scale out?
You asked how to set up multiple database servers so that one handles the bulk load while others handle reads and some writes. I would heavily, heavily caution against taking the multiple-servers-for-writes approach because it gets a lot more complicated quickly, but using multiple servers for reads is not uncommon.
The easiest way to do it is with log shipping: every X minutes, the primary server takes a transaction log backup and then that log backup is applied to the read-only reporting server. There's some catches with this - the data is a little behind, and the restore process has to kick all connections out of the database to apply the restore. This can be a perfectly acceptable solution for things like data warehouses, where the end users want to keep running their own reports while the new day's data loads. You can simply not do transaction log restores while the data warehouse is loading, and the users can maintain connections the whole time.
To help find out what solution is right for you, consider adding the following to your question:
The size of your database (GB/TB in size, # of millions of rows in the largest table that's having the writes)
The size of your server & storage (a box with 10 drives has different solutions available than a box hooked up to a SAN)
The method of loading data (is it single-record inserts, are you using bulk load, are you using table partitioning, etc)

Why not use MemCached to eliminate the reads, I've got the same situation where I work and we've been using memcached on Windows with great results. I was supprised how trivial it was to get my code running with it too. There are open-source wrapping libraries for virtually every mainstream language, and using it could result in 99% of your reads, not even touching the database (becasue you set the memcache values on the write operation of the database).
Memcached, is really just a giant hash table store (and can even be clustered or run on any machine you like since it uses sockets to read and store the hashes).
When reading the memcached value, simply check if its null (return if its not) or do your ussual database read and return. It can store just about everything, so long as each memcached key/value pair is less than 1MB.

The easiest way would be to slow down the rate at which writes occur, and feed them in one record at a time. They'll be slower, but it would make things faster for users. If the batches take "a couple hours", you perhaps can spread them out more.

This is just an idea. Create a view over your "active" tables. Then BCP in the data into a "staging" table. When it is done, update the view to include the "staging" tables. Just an idea.

I'm not sure what you mean when you say everyone else's read and write slows down. Does it slow down when they read & write to the same database where the data is currently being imported or from different databases on the same server?
If it is the same database, you could always use the "with (nolock)" hint to do the reads even when the table is locked for writes/inserts. However, please be aware that the reads can be dirty reads. I am not sure how you can do faster quick writes when the table is locked because a write is already in progress. You can keep the transaction small to make the writes faster and release the locks. The other option is to have a separate database for bulk inserts and another database for reading.

SpeedUp Database Updates

There is a SqlServer2000 Database we have to update during weekend.
It's size is almost 10G.
The updates range from Schema changes, primary keys updates to some Million Records updated, corrected or Inserted.
The weekend is hardly enough for the job.
We set up a dedicated server for the job,
turned the Database SINGLE_USER
made any optimizations we could think of: drop/recreate indexes, relations etc.
Can you propose anything to speedup the process?
SQL SERVER 2000 is not negatiable (not my decision). Updates are run through custom made program and not BULK INSERT.
EDIT:
Schema updates are done by Query analyzer TSQL scripts (one script per Version update)
Data updates are done by C# .net 3.5 app.
Data come from a bunch of Text files (with many problems) and written to local DB.
The computer is not connected to any Network.

Although dropping excess indexes may help, you need to make sure that you keep those indexes that will enable your upgrade script to easily find those rows that it needs to update.
Otherwise, make sure you have plenty of memory in the server (although SQL Server 2000 Standard is limited to 2 GB), and if need be pre-grow your MDF and LDF files to cope with any growth.
If possible, your custom program should be processing updates as sets instead of row by row.
EDIT:
Ideally, try and identify which operation is causing the poor performance. If it's the schema changes, it could be because you're making a column larger and causing a lot of page splits to occur. However, page splits can also happen when inserting and updating for the same reason - the row won't fit on the page anymore.
If your C# application is the bottleneck, could you run the changes first into a staging table (before your maintenance window), and then perform a single update onto the actual tables? A single update of 1 million rows will be more efficient than an application making 1 million update calls. Admittedly, if you need to do this this weekend, you might not have a lot of time to set this up.

What exactly does this "custom made program" look like? i.e. how is it talking to the data? Minimising the amount of network IO (from a db server to an app) would be a good start... typically this might mean doing a lot of work in TSQL, but even just running the app on the db server might help a bit...
If the app is re-writing large chunks of data, it might still be able to use bulk insert to submit the new table data. Either via command-line (bcp etc), or through code (SqlBulkCopy in .NET). This will typically be quicker than individual inserts etc.
But it really depends on this "custom made program".

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight