I'm looking for some advice on how to implement a process for mass inserts, like to the tune of 400 records per second. The data is coming from an outside real time trigger and the app will get notified when a data change happens. When that data change happens, I need to consume it.
I've looked at several different implementations for doing batch processing including using datatables/sqlbulkcopy or writing to csv and consuming.
What can you recommend?
400 inserts per second doesn't feel like it should present any major challenge. It depends on what you're inserting, if there are any indexes which could have page splits due to inserts, and if you have any extra logic going on during your insert proc or script.
If you want to insert them one by one, I would recommend just building a barebones stored procedure which does a simple insert of it's parameters into a staging table with no indexes, constraints, anything. That will allow you to very quickly get the data into the database, and you can have a separate process come through every minute or something and work off the rows in batches.
Alternatively, you could have your application store up records until you reach a certain number, and then insert them into the database with a proc using a table-valued parameter. Then you'll only have one insert of however many rows you chose to batch up. The cost of that should be pretty trivial. Do note however that if your application crashes before it's inserted enough rows, those will be lost.
SqlBulkCopy is a powerful tool, but as the name suggests, it's built more for bulk loading of tables. If you have a constant stream of insert requests coming in, I would not recommend using it to load up your data. That might be a good approach if you want to batch up a LOT of requests to load all at once, but not as a recurring and frequent activity.
This works pretty well for me. I can't guarantee you 400 per sec tho:
private async Task BulkInsert(string tableName, DataTable dt)
{
if (dt == null)
return;
using (SqlBulkCopy bulkCopy = new SqlBulkCopy("./sqlserver..."))
{
bulkCopy.DestinationTableName = tableName;
await bulkCopy.WriteToServerAsync(dt);
}
}
Related
I am working with a SQL Server table that contains 80 million (80,000,000) rows. Data space = 198,000 MB. Not surprisingly, queries against this table often churn or timeout. To add to the issues, the table rows get updated fairly frequently and new rows also get added on a regular basis. It thus continues to grow like a viral outbreak.
My issue is that I would like to write Entity Framework 5 LINQ to Entities queries to grab rows from this monster table. As I've tried, timeouts have become outright epidemic. A few more things: the table's primary key is indexed and it has non-clustered indexes on 4 of its 19 columns.
So far, I am writing simple LINQ queries that use Transaction Scope and Read Uncommitted Isolation Level. I have tried increasing both the command timeout and the connection timeout. I have written queries that return FirstOrDefault() or a collection, such as the following, which attempts to grab a single ID (an int) from seven days before the current date:
public int GetIDForSevenDaysAgo(DateTime sevenDaysAgo)
{
using (var txn = new TransactionScope(TransactionScopeOption.Required, new TransactionOptions { IsolationLevel = IsolationLevel.ReadUncommitted }))
{
var GetId = from te in _repo.GetTEvents()
where te.cr_date > sevenDaysAgo
orderby te.cr_date
select te.id;
return GetId.FirstOrDefault();
}
}
and
public IEnumerable<int> GetIDForSevenDaysAgo(DateTime sevenDaysAgo)
{
using (var txn = new TransactionScope(TransactionScopeOption.Required, new TransactionOptions { IsolationLevel = IsolationLevel.ReadUncommitted }))
{
var GetId = from te in _repo.GetTEvents()
where te.cr_date > sevenDaysAgo
orderby te.cr_date
select te.id;
return GetId.Take(1);
}
}
Each query times out repeatedly regardless of the timeout settings. I'm using the repository pattern with Unity DI and fetching the table with IQueryable<> calls. I'm also limiting the repository call to eight days from the current date (hoping to only grab the needed subset of this mammoth table). I'm using Visual Studio 2013 with Update 5 targeting .NET v4.5 and SQL Server 2008 R2.
I generated the SQL statement that EF generates and it didn't look incredibly more complicated than the LINQ statements above. And my brain hurts.
So, have I reached some sort of tolerance limit for EF? Is the table simply too big? Should I revert to Stored Procedures/domain methods when querying this table? Are there other options I should explore? There was some discussion around removing some of the table's rows, but that probably won't happen anytime soon. I did read a little about paging, but I'm not sure if that would help or not. Any thoughts or ideas would be appreciated! Thank you!
As I can see you only selecting data and don't change it. So why do you need to use TransactionScope? You need it only when you have 2 or more SaveChanges() in your code and you want them to be in one transaction. So get rid of it.
Another thing that i whould use in your case is disable change tracking and auto detection of changes on your context. But be carefull if you don't rectreade your context on each request. It can presist old data.
To do it you should write this lines near your context initialization:
context.ObjectTrackingEnabled = false;
context.DeferredLoadingEnabled = false;
The other thing that you should think about is pagenation and Cache. But as i can see in your example you trying to get only one row. So can't say anything particular.
I recommend you to read this article to further optimisation.
It's not easy to say if you have to go with stored procedures or EF since we speak for a monster. :-)
The first thing I would do is to run the query in SSMS displaying the Actual Execution Plan. Sometimes it provides information about indexes missing that might increase performance.
From you example, I 'm pretty sure you need an index on that date column.
In other words, -if you have access- be sure that table design is optimal for that amount of data.
My thought is that if a simple query hangs, what more EF can do?
Here is my situation, my client wants to bulk insert 100,000+ rows into the database from a csv file which is simple enough but the values need to be checked against data that is already in the database (does this product type exist? is this product still sold? etc.). To make things worse these files will also be uploaded into the live system during the day so I need to make sure I’m not locking any tables for long. The data that is inserted will also be spread across multiple tables.
I’ve been adding the date into a staging table which takes seconds, I then tried creating a WebService to start processing the table using Linq and marking any erroneous rows with an invalid flag (this can take some time). Once the validation is done I need take the valid rows and update/add the rows to the appropriate tables.
Is there a process for this that I am unfamiliar with?
For a smaller dataset I would suggest
IF EXISTS (SELECT blah FROM blah WHERE....)
UPDATE (blah)
ELSE
INSERT (blah)
You could do this in chunks to avoid server load, but this is by no means a quick solution, so SSIS would be preferable
I have reports that perform some time consuming data calculations for each user in my database, and the result is 10 to 20 calculated new records for each user. To improve report responsiveness, a nightly job was created to run the calculations and dump the results to a snapshot table in the database. It only runs for active users.
So with 50k users, 30k of which are active, the job "updates" 300k to 600k records in the large snapshot table. The method it currently uses is it deletes all previous records for a given user, then inserts the new set. There is no PK on the table, only a business key is used to group the sets of data.
So my question is, when removing and adding up to 600k records every night, are there techniques to optimize the table to handle this? For instance, since the data can be recreated on demand, is there a way to disable logging for the table as these changes are made?
UPDATE:
One issue is I cannot do this in batch because the way the script works, it's examining one user at a time, so it looks at a user, deletes the previous 10-20 records, and inserts a new set of 10-20 records. It does this over and over. I am worried that the transaction log will run out of space or other performance issues could occur. I would like to configure the table to now worry about data preservation or other items that could slow it down. I cannot drop the indexes and all that because people are accessing the table concurrently to it being updated.
It's also worth noting that indexing could potentially speed up this bulk update rather than slow it down, because UPDATE and DELETE statements still need to be able to locate the affected rows in the first place, and without appropriate indexes it will resort to table scans.
I would, at the very least, consider a non-clustered index on the column(s) that identify the user, and (assuming you are using 2008) consider the MERGE statement, which can definitely avoid the shortcomings of the mass DELETE/INSERT method currently employed.
According to The Data Loading Performance Guide (MSDN), MERGE is minimally logged for inserts with the use of a trace flag.
I won't say too much more until I know which version of SQL Server you are using.
This is called Bulk Insert, you have to drop all indexes in destination table and send insert commands in large packs (hundreds of insert statements) separated by ;
Another way is to use BULK INSERT statement http://msdn.microsoft.com/en-us/library/ms188365.aspx
but it involves dumping data to file.
See also: Bulk Insert Sql Server millions of record
It really depends upon many things
speed of your machine
size of the records being processed
network speed
etc.
Generally it is quicker to add records to a "heap" or an un-indexed table. So dropping all of your indexes and re-creating them after the load may improve your performance.
Partitioning the table may see performance benefits if you partition by active and inactive users (although the data set may be a little small for this)
Ensure you test how long each tweak adds or reduces your load and work from there.
Everyday a company drops a text file with potentially many records (350,000) onto our secure FTP. We've created a windows service that runs early in the AM to read in the text file into our SQL Server 2005 DB tables. We don't do a BULK Insert because the data is relational and we need to check it against what's already in our DB to make sure the data remains normalized and consistent.
The problem with this is that the service can take a very long time (hours). This is problematic because it is inserting and updating into tables that constantly need to be queried and scanned by our application which could affect the performance of the DB and the application.
One solution we've thought of is to run the service on a separate DB with the same tables as our live DB. When the service is finished we can do a BCP into the live DB so it mirrors all of the new records created by the service.
I've never worked with handling millions of records in a DB before and I'm not sure what a standard approach to something like this is. Is this an appropriate way of doing this sort of thing? Any suggestions?
One mechanism I've seen is to insert the values into a temporary table - with the same schema as the target table. Null IDs signify new records and populated IDs signify updated records. Then use the SQL Merge command to merge it into the main table. Merge will perform better than individual inserts/updates.
Doing it individually, you will incur maintenance of the indexes on the table - can be costly if its tuned for selects. I believe with merge its a bulk action.
It's touched upon here:
What's a good alternative to firing a stored procedure 368 times to update the database?
There are MSDN articles about SQL merging, so Googling will help you there.
Update: turns out you cannot merge (you can in 2008). Your idea of having another database is usually handled by SQL replication. Again I've seen in production a copy of the current database used to perform a long running action (reporting and aggregation of data in this instance), however this wasn't merged back in. I don't know what merging capabilities are available in SQL Replication - but it would be a good place to look.
Either that, or resolve the reason why you cannot bulk insert/update.
Update 2: as mentioned in the comments, you could stick with the temporary table idea to get the data into the database, and then insert/update join onto this table to populate your main table. The difference is now that SQL is working with a set so can tune any index rebuilds accordingly - should be faster, even with the joining.
Update 3: you could possibly remove the data checking from the insert process and move it to the service. If you can stop inserts into your table while this happens, then this will allow you to solve the issue stopping you from bulk inserting (ie, you are checking for duplicates based on column values, as you don't yet have the luxury of an ID). Alternatively with the temporary table idea, you can add a WHERE condition to first see if the row exists in the database, something like:
INSERT INTO MyTable (val1, val2, val3)
SELECT val1, val2, val3 FROM #Tempo
WHERE NOT EXISTS
(
SELECT *
FROM MyTable t
WHERE t.val1 = val1 AND t.val2 = val2 AND t.val3 = val3
)
We do much larger imports than that all the time. Create an SSIS pacakge to do the work. Personally I prefer to create a staging table, clean it up, and then do the update or import. But SSIS can do all the cleaning in memory if you want before inserting.
Before you start mirroring and replicating data, which is complicated and expensive, it would be worthwhile to check your existing service to make sure it is performing efficiently.
Maybe there are table scans you can get rid of by adding an index, or lookup queries you can get rid of by doing smart error handling? Analyze your execution plans for the queries that your service performs and optimize those.
I have a project that involves recording data from a device directly into a sql table.
I do very little processing in code before writing to sql server (2008 express by the way)
typically i use the sqlhelper class's ExecuteNonQuery method and pass in a stored proc name and list of parameters that the SP expects.
This is very convenient, but i need a much faster way of doing this.
Thanks.
ExecuteNonQuery with an INSERT statement, or even a stored procedure, will get you into thousands of inserts per second range on Express. 4000-5000/sec are easily achievable, I know this for a fact.
What usually slows down individual updates is the wait time for log flush and you need to account for that. The easiest solution is to simply batch commit. Eg. commit every 1000 inserts, or every second. This will fill up the log pages and will amortize the cost of log flush wait over all the inserts in a transaction.
With batch commits you'll probably bottleneck on disk log write performance, which there is nothing you can do about it short of changing the hardware (going raid 0 stripe on log).
If you hit earlier bottlenecks (unlikely) then you can look into batching statements, ie. send one single T-SQL batch with multiple inserts on it. But this seldom pays off.
Of course, you'll need to reduce the size of your writes to a minimum, meaning reduce the width of your table to the minimally needed columns, eliminate non-clustered indexes, eliminate unneeded constraints. If possible, use a Heap instead of a clustered index, since Heap inserts are significantly faster than clustered index ones.
There is little need to use the fast insert interface (ie. SqlBulkCopy). Using ordinary INSERTS and ExecuteNoQuery on batch commits you'll exhaust the drive sequential write throughput much faster than the need to deploy bulk insert. Bulk insert is needed on fast SAN connected machines, and you mention Express so it's probably not the case. There is a perception of the contrary out there, but is simply because people don't realize that bulk insert gives them batch commit, and its the batch commit that speeds thinks up, not the bulk insert.
As with any performance test, make sure you eliminate randomness, and preallocate the database and the log, you don't want to hit db or log growth event during test measurements or during production, that is sooo amateurish.
bulk insert would be the fastest since it is minimally logged
.NET also has the SqlBulkCopy Class
Here is a good way to insert a lot of records using table variables...
...but best to limit it to 1000 records at a time because table variables are "in Memory"
In this example I will insert 2 records into a table with 3 fields -
CustID, Firstname, Lastname
--first create an In-Memory table variable with same structure
--you could also use a temporary table, but it would be slower
declare #MyTblVar table (CustID int, FName nvarchar(50), LName nvarchar(50))
insert into #MyTblVar values (100,'Joe','Bloggs')
insert into #MyTblVar values (101,'Mary','Smith')
Insert into MyCustomerTable
Select * from #MyTblVar
This is typically done by way of a BULK INSERT. Basically, you prepare a file and then issue the BULK INSERT statement and SQL Server copies all the data from the file to the table with the fast method possible.
It does have some restrictions (for example, there's no way to do "update or insert" type of behaviour if you have possibly-existing rows to update), but if you can get around those, then you're unlikely to find anything much faster.
If you mean from .NET then use SqlBulkCopy
Things that can slow inserts include indexes and reads or updates (locks) on the same table. You can speed up situations like yours by avoiding both and inserting individual transactions to a separate holding table with no indexes or other activity. Then batch the holding table to the main table a little less frequently.
It can only really go as fast as your SP will run. Ensure that the table(s) are properly indexed and if you have a clustered index, ensure that it has a narrow, unique, increasing key. Ensure that the remaining indexes and constraints (if any) do not have a lot of overhead.
You shouldn't see much overhead in the ADO.NET layer (I wouldn't necessarily use any other .NET library above SQLCommand). You may be able to use ADO.NET Async methods in order to queue several calls to the stored proc without blocking a single thread in your application (this potentially could free up more throughput than anything else - just like having multiple machines inserting into the database).
Other than that, you really need to tell us more about your requirements.