How does DBContext SaveChanges work internally?

How does DBContext SaveChanges work internally? - sql-server

Using Entity Framework with SQL Server 2008, we've got an application that writes high volumes of data, say 1000 new rows per minute, each being in their own DBContext.saveChanges call (we're not batching them together)
The issue is that our writes fall way, way behind. To the point that it seems like the thing is thrashing. For example, we'll call saveChanges with new rows a couple thousand times over two minutes, and not a single write will be made, then all of a sudden we'll get a handful of writes (but many are completely lost).
We've taken a SQL trace, and seen that SQL doesn't receive a command to write for even 10% of our saveChanges calls.
So it would seem there's an issue somewhere in between saveChanges and SQL Server. I'm wondering how this call works. Does it use thread pooling? Queueing? some buffer that we could be overrunning? Maybe its silently failing due to the volume of writes?
MSDN is pretty useless on explaining how this stuff actually works

Read the performance considerations in the msdn and also have a look at Fastest Way of Inserting in Entity Framework.

I don't know how it works internally, but with this kind of overload you better insert the data into a queue and use one or more (but limited) threads to empty the queue and write to the database. You can test and adjust the amount of threads so you won't lose data.

Related

Should I keep this "GlobalConnection" or create connection for every query?

I have inherited a legacy Delphi application that uses ADO to connect to SQL Server.
The application has a notion of a "Global Connection" -- that is a single connection that it opens at the start, and then keeps open all throughout the running of the application (which can be days, weeks, or longer....)
So my question is this: Should I keep this way of doing things or should I switch to a "connect-query-disconnect" mode of doing things? Does it matter?
Switching would be a non-trivial task, but I'll do it if it means better performance, data management, etc.

Well, it depends on what you're expecting to get out of it, and what kind of application it is.
There's nothing in particular wrong with using a single long-running connection, as long as the application can gracefully handle disconnections and recover or log/notify when it can't reconnect.
The problem with a connect-query-disconnect setup is that you're adding the overhead of connecting and disconnecting on every query. That's going to slow things down, and in an interactive GUI application users may notice the additional overhead. You also have to make sure that authorization is transparently handled if it isn't already.
At the same time, there may be interactive performance gains to be had if you can push all the queries off onto background threads and asynchronously update the GUI. If contention appears because the queries are serialized, you can migrate to a connection-pool system fairly readily as well and improve things even more. This has a fairly high complexity cost to it though, so now you're looking to balancing what the gains are compared to the work involved.
Right now, my ultimate response is "if it ain't broke, don't fix it." Changes along the lines you propose are a lot of work -- how much do the users of this application stand to gain? Are there other problems to solve that might benefit them more?
Edit: Okay, so it's broke. Well, slow at least, which is all the same to me. If you've ruled out problems with the SQL Server itself, and the queries are performing as fast as they can (i.e. DB schema is sane, the right indexes are available, queries aren't completely braindead, server has enough RAM and fast enough I/O, network isn't flaky, etc.), then yes, it's time to find ways to improve the performance of the app itself.
Simply moving to a connect-query-disconnect is going to make things worse, and the more queries you're issuing the bigger the drop off is going to be. It sounds like you're going to need to rearchitect the app so that you can run fewer queries, run them in the background, cache more aggressively on the client, or some combination of all 3.
Don't forget the making the clients perform better means that server side performance gets more important since it's probably going to be handling a higher load if clients start making multiple connections and issuing multiple queries in parallel.

As mr Frazier told before - the one global connection is not bad per se.
If you intend to change, first detect WHAT is the problem. Let's see some scenarios:
1
Some screens(IOW: an set of 1..n forms to operate in a business entity) are slow. Possible causes:
insuficient filtering resulting in a pletora of records being pulled from database without necessity.
the number of records are ok, but takes too much to render it. Solution: faster controls or intelligent rendering (ex.: Virtual list views)
too much queries each time you open an screen. Possible solutions: use TClientDatasets (or any in-memory dataset) to hold infrequently modified lookup tables. An more sophisticated cache for more extensive tables or opening those datasets in other threads can improve response times.
Scrolling on datasets with controls bound can be slow (just to remember, because those little details can be easily forgotten).
2
Whole app simply slows down. Checklist:
Network cards are ok? An few net cards mal-functioning can wreak havoc even on good structured networks as they create unnecessary noise on the line.
[MSSQL DBA HAT ON] The next on the line of attack is SQL Server. Ask the DBA to trace blocks and deadlocks. Register slow queries and work on them speed up. This relate directly to #1.1 and #1.3
Detect if some naive developer have done SELECT inside transactions. In read committed isolation, it's just overhead, as it'll create more network traffic. Open the query, retrieve the data and close the dataset.
Review the database schema, if you can.
Are any data-bound operations on a bulk of records (let's say, remarking the price of some/majority/all products) being done on the app? Make an SP or refactor the operation on an query, it'll be much faster and will reduce the load of the entire server.
Extensive operations on a group of records? Learn how to do that operations at once on the server instead of one-by-one record. See an examination of most used alternatives on the MSSQL MVP Erland Sommarskog's article on array and list on MSSQL.
Beware of queries with WHERE like : WHERE SomeFunction(table1.blabla) = #SomeParam . Most of time, that ones will not use an index causing to read the entire table to select the desired data. If is a big table.... Indexing on a persisted computed columns can make miracles...[MSSQL HAT OFF]
That's what I can think of without a little more detail... ;-)

Database connections are time consuming resources to create and the rule of thumb should be create as little as possible and reuse as much as possible. That's why some other technologies have database connection pools, which are typically established at application/service startup and then kept as long as possible and shared among threads.
From your comment, the application has performances issues, but it's difficult without more details to make any recommendation.
Should try to nail down what is slow - are all queries slow or just some specific ones?
If just some specific ones is there some correlation.
My 2 cents.

How to decrease the response time when dealing with SQL Server remotely?

I have created a vb.net application that uses a SQL Server database at a remote location over the internet.
There are 10 vb.net clients that are working on the same time.
The problem is in the delay time that happens when inserting a new row or retrieving rows from the database, the form appears to be freezing for a while when it deals with the database, I don't want to use a background worker to overcome the freeze problem.
I want to eliminate that delay time and decrease it as much as possible
Any tips, advises or information are welcomed, thanks in advance

Well, 2 problems:
The form appears to be freezing for a while when it deals with the database, I don't want to use a background worker
to overcome the freeze problem.
Vanity, arroaance and reality rarely mix. ANY operation that takes more than a SHORT time (0.1-0.5 seconds) SHOULD run async, only way to kep the UI responsive. Regardless what the issue is, if that CAN take longer of is on an internet app, decouple them.
But:
The problem is in the delay time that happens when inserting a new records or retrieving records from the database,
So, what IS The problem? Seriously. Is this a latency problem (too many round trips, work on more efficient sql, batch, so not send 20 q1uestions waiting for a result after each) or is the server overlaoded - it is not clear from the question whether this really is a latency issue.
At the end:
I want to eliminate that delay time
Pray to whatever god you believe in to change the rules of physics (mostly the speed of light) or to your local physician tof finally get quantum teleportation workable for a low cost. Packets take time at the moment to travel, no way to change that.
Check whether you use too many ound trips. NEVER (!) use sql server remotely with SQL - put in a web service and make it fitting the application, possibly even down to a 1:1 match to your screens, so you can ask for data and send updates in ONE round trip, not a dozen. WHen we did something similar 12 years ago with our custom ORM in .NET we used a data access layer for that that acepted multiple queries in one run and retuend multiple result sets for them - so a form with 10 drop downs could ask for all 10 data sets in ONE round trip. If a request takes 0.1 seconds internet time - then this saves 0.9 seconds. We had a form with about 100 (!) round trips (creating a tree) and got that down to less than 5 - talk of "takes time" to "whow, there". Plus it WAS async, sorry.
Then realize moving a lot of data is SLOW unless you have instant high bandwidth connections.
THis is exaclty what async is done for - if you have transfer time or latency time issues that can not be optimized, and do not want to use async, go on delivering a crappy experience.

You can execute the SQL call asynchronously and let Microsoft deal with the background process.
http://msdn.microsoft.com/en-us/library/7szdt0kc.aspx
Please note, this does not decrease the response time from the SQL server, for that you'll have to try to improve your network speed or increase the performance of your SQL statements.

There are a few things you could potentially do to speed things up, however it is difficult to say without seeing the code.
If you are using generic inserts - start using stored procedures
If you are closing the connection after every command then... well dont. Establishing a connection is typically one of the more 'expensive' operations
Increase the pipe between the two.
Add an index
Investigate your SQL Server perhaps it not setup in a preferred manner.

Implementing multithreaded application under C

I am implementing a small database like MySQL.. Its a part of a larger project..
Right now i have designed the core database, by which i mean i have implemented a parser and i can now execute some basic sql queries on my database.. it can store, update, delete and retrieve data from files.. As of now its fine.. however i want to implement this on network..
I want more than one user to be able to access my database server and execute queries on it at the same time... I am working under Linux so there is no issue of portability right now..
I know i need to use Sockets which is fine.. I also know that i need to use a concept like Thread Pool where i will be required to create a maximum number of threads initially and then for each client request wake up a thread and assign it to the client..
As for now what i am unable to figure out is how all this is actually going to be bundled together.. Where should i implement multithreading.. on client side / server side.? how is my parser going to be configured to take input from each of the clients separately?(mostly via files i think?)
If anyone has idea about how i can implement this pls do tell me bcos i am stuck here in this project...
Thanks.. :)

If you haven't already, take a look at Beej's Guide to Network Programming to get your hands dirty in some socket programming.
Next I would take his example of a stream client and server and just use that as a single threaded query system. Once you have got this down, you'll need to choose if you're going to actually use threads or use select(). My gut says your on disk database doesn't yet support parallel writes (maybe reads), so likely a single server thread servicing requests is your best bet for starters!
In the multiple client model, you could use a simple per-socket hashtable of client information and return any results immediately when you process their query. Once you get into threading with the networking and db queries, it can get pretty complicated. So work up from the single client, add polling for multiple clients, and then start reading up on and tackling threaded (probably with pthreads) client-server models.

Server side, as it is the only person who can understand the information. You need to design locks or come up with your own model to make sure that the modification/editing doesn't affect those getting served.

As an alternative to multithreading, you might consider event-based single threaded approach (e.g. using poll or epoll). An example of a very fast (non-SQL) database which uses exactly this approach is redis.
This design has two obvious disadvantages: you only ever use a single CPU core, and a lengthy query will block other clients for a noticeable time. However, if queries are reasonably fast, nobody will notice.
On the other hand, the single thread design has the advantage of automatically serializing requests. There are no ambiguities, no locking needs. No write can come in between a read (or another write), it just can't happen.
If you don't have something like a robust, working MVCC built into your database (or are at least working on it), knowing that you need not worry can be a huge advantage. Concurrent reads are not so much an issue, but concurrent reads and writes are.
Alternatively, you might consider doing the input/output and syntax checking in one thread, and running the actual queries in another (query passed via a queue). That, too, will remove the synchronisation woes, and it will at least offer some latency hiding and some multi-core.

Are there any local DB that support multi-threading?

I tried sqlite,
by using multi-thread, only one thread can update db at the same time..
I need multi-thread updating the db at same time.
Is there are any DB can do the job?
ps: I use delphi6.
I found that sqlite can support multi-threading,
But in my test of asgsqlite, when one thread inserting, others will fail to insert.
I'm still in testing.

SQLite can be used in multi-threaded environments.
Check out this link.

Firebird can be used in an embedded version, but it's no problem to use the standard (server) installation locally as well. Very small, easy to deploy, concurrent access. Works good with Delphi, you should look into it as an option.
See also the StackOverflow question "Which embedded database to use in a Delphi application?"

Sqlite locks the entire database when updating (unless this has changed since I last used it). A second thread cannot update the database at the same time (even using entirely separate tables). However there is a timeout parameter that tells the second thread to retry for x milliseconds before failing. I think ASqlite surfaces this parameter in the database component (I think I actually wrote that bit of code, all 3 lines, but it was a couple of years ago).
Setting the timeout to a larger value than 0 will allow multiple threads to update the database. However there may be performance implications.

since version 3.3.1, SQLite's threading requirements have been greatly relaxed. in most cases, it means that it simply works. if you really need more concurrency than that, it might be better to use a DB server.

SQL Server 2008 Express supports concurrency, as well as most other features of SQL Server. And it's free.

Why do you need multiple threads to update it at the same time? I'm sure sqlite will ensure that the updates get done correctly, even if that means one thread waiting for the other one to finish; this is transparent to the application.
Indeed, having several threads updating concurrently would, in all likelihood, not be beneficial to performance. That's to say, it might LOOK like several threads were updating concurrently, but actually the result would be that the updates get done slower than if they weren't (due to the fact that they need to hold many page locks etc to avoid problems).

DBISAM from ElevateSoft works very nicely in multi-threaded mode, and has auto-session naming to make this easy. Be sure to follow the page in the help on how to make it all safe, and job done.

I'm actually at the moment doing performance testing with a multi-threaded Java process on Sybase ASE. The process parses a 1GB file and does inserts into a table.
I was afraid at first, because many of the senior programmers warned me about "table locking" and how dangerous it is to do concurrent access to DB. But I went ahead and did testing (because I wanted to find out for myself).
I created and compared a single threaded process to a process using 4 threads. I only received a 20% reduction in total execution time. I retried the the process using different thread counts and batch insert sizes. The maximum I could squeeze was 20%.
We are going to be switching to Oracle soon, so I'll share how Oracle handles concurrent inserts when that happens.

performance of web app with high number of inserts

What is the best IO strategy for a high traffic web app that logs user behaviour on a website and where ALL of the traffic will result in an IO write? Would it be to write to a file and overnight do batch inserts to the database? Or to simply do an INSERT (or INSERT DELAYED) per request? I understand that to consider this problem properly much more detail about the architecture would be needed, but a nudge in the right direction would be much appreciated.

By writing to the DB, you allow the RDBMS to decide when disk IO should happen - if you have enough RAM, for instance, it may be effectively caching all those inserts in memory, writing them to disk when there's a lighter load, or on some other scheduling mechanism.
Writing directly to the filesystem is going to be bandwidth-limited more-so than writing to a DB which then writes, expressly because the DB can - theoretically - write in more efficient sizes, contiguously, and at "convenient" times.

I've done this on a recent app. Inserts are generally pretty cheap (esp if you put them into an unindexed hopper table). I think that you have a couple of options.
As above, write data to a hopper table, if what ever application framework supports batched inserts, then use these, it will speed it up. Then every x requests, do a merge (via an SP call) into a master table, where you can normalize off data that has low entropy. For example if you are storing if the HTTP type of the request (get/post/etc), this can only ever be a couple of types, and better to store as an Int, and get improved I/O + query performance. Your master tables can also be indexed as you would normally do.
If this isn't good enough, then you can stream the requests to files on the local file system, and then have an out of band (i.e seperate process from the webserver) suck these files up and BCP them into the database. This will be at the expense of more moving parts, and potentially, a greater delay between receiving requests and them finding their way into the database
Hope this helps, Ace

When working with an RDBMS the most important thing is optimizing write operations to disk. Something somewhere has got to flush() to persistant storage (disk drives) to complete each transaction which is VERY expensive and time consuming. Minimizing the number of transactions and maximizing the number of sequential pages written is key to performance.
If you are doing inserts sending them in bulk within a single transaction will lead to more effecient write behavior on disk reducing the number of flush operations.
My recommendation is to queue the messages and periodically .. say every 15 seconds or so start a transaction ... send all queued inserts ... commit the transaction.
If your database supports sending multiple log entries in a single request/command doing so can have a noticable effect on performance when there is some network latency between the application and RDBMS by reducing the number of round trips.
Some systems support bulk operations (BCP) providing a very effecient method for bulk loading data which can be faster than the use of "insert" queries.
Sparing use of indexes and selection of sequential primary keys help.
Making sure multiple instances either coordinate write operations or write to separate tables can improve throughput in some instances by reducing concurrency management overhead in the database.

Write to a file and then load later. It's safer to be coupled to a filesystem than to a database. And the database is more likely to fail than the your filesystem.

The only problem with using the filesystem to back writes is how you extend the log.
A poorly implemented logger will have to open the entire file to append a line to the end of it. I witnessed one such example case where the person logged to a file in reverse order, being the most recent entries came out first, which required loading the entire file into memory, writing 1 line out to the new file, and then writing the original file contents after it.
This log eventually exceeded phps memory limit, and as such, bottlenecked the entire project.
If you do it properly however, the filesystem reads/writes will go directly into the system cache, and will only be flushed to disk every 10 or more seconds, ( depending on FS/OS settings ) which has a negligible performance hit compared to writing to arbitrary memory addresses.
Oh yes, and whatever system you use, you'll need to think about concurrent log appending. If you use a database, a high insert load can cause you to have deadlock conditions, and on files, you need to make sure that you're not going to have 2 concurrent writes cancel each other out.

The insertions will generally impact the (read/update) performance of the table. Perhaps you can do the writes to another table (or database) and have batch job that processes this data. The advantages of the database approach is that you can query/report on the data and all the data is logically in a relational database and may be easier to work with. Depending on how the data is logged to text file, you could open up more possibilities for corruption.

My instinct would be to only use the database, avoiding direct filesystem IO at all costs. If you need to produce some filesystem artifact, then I'd use a nightly cron job (or something like it) to read DB records and write to the filesystem.
ALSO: Only use "INSERT DELAYED" in cases where you don't mind losing a few records in the event of a server crash or restart, because some records almost certainly WILL be lost.

There's an easier way to answer this. Profile the performance of the two solutions.
Create one page that performs the DB insert, another that writes to a file, and another that does neither. Otherwise, the pages should be identical. Hit each page with a load tester (JMeter for example) and see what the performance impact is.
If you don't like the performance numbers, you can easily tweak each page to try and optimize performance a bit or try new solutions... everything from using MSMQ backed by MSSQL to delayed inserts to shared logs to individual files with a DB background worker.
That will give you a solid basis to make this decision rather than depending on speculation from others. It may turn out that none of the proposed solutions are viable or that all of them are viable...

Hello from left field, but no one asked (and you didn't specify) how important is it that you never, ever lose data?
If speed is the problem, leave it all in memory, and dump to the database in batches.

Do you log more than what would be available in the webserver logs? It can be quite a lot, see Apache 2.0 log information for example.
If not, then you can use the good old technique of buffering then batch writing. You can buffer at different places: in memory on your server, then batch insert them in db or batch write them in a file every X requests, and/or every X seconds.
If you use MySQL there are several different options/techniques to load efficiently a lot of data: LOAD DATA INFILE, INSERT DELAYED and so on.
Lots of details on insertion speeds.
Some other tips include:
splitting data into different tables per period of time (ie: per day or per week)
using multiple db connections
using multiple db servers
have good hardware (SSD/multicore)
Depending on the scale and resources available, it is possible to go different ways. So if you give more details, i can give more specific advices.

If you do not need to wait for a response such as a generated ID, you may want to adopt an asynchronous strategy using either a message queue or a thread manager.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight