We are experiencing seemingly random timeouts on a two app (one ASP.Net and one WinForms) SQL Server application. I had SQL Profiler run during an hour block to see what might be causing the problem. I then isolated the times when the timeouts were occurring.
There are a large number of Reads but there is no large difference in the reads when the timeout errors occur and when they don't. There are virtually no writes during this period (primarily because everyone is getting time outs and can't write).
Example:
Timeout occurs 11:37. There are an average of 1500 transactions a minute leading up to the timeout, with about 5709219 reads.
That seems high EXCEPT that during a period in between timeouts (over a ten minute span), there are just as many transactions per minute and the reads are just as high. The reads do spike a little before the timeout (jumping up to over 6005708) but during the non-timeout period, they go as high as 8251468. The timeouts are occurring in both applications.
The bigger problem here is that this only started occurring in the past week and the application has been up and running for several years. So yes, the Profiler has given us a lot of data to work with but the current issue is the timeouts.
Is there something else that I should be possibly looking for in the Profiler or should I move to Performance Monitor (or another tool) over on the server?
One possible culprit might be the Database Size. The database is fairly large (>200 GB) but the AutoGrow setting was set to 1MB. Could it be that SQL Server is resizing itself and that transaction doesn't show itself in the profiler?
Many thanks
Thanks to the assistance here, I was able to identify a few bottlenecks but I wanted to outline my process to possibly help anyone going through this.
The #1 problem was found to be a high number of LOCK_MK_S entries found from the SQLDiag and other tools.
Run the Trace Profiler over two different periods of time. Comparing durations for similar methods led me to find that certain UPDATE calls were always taking the same amount of time, over 10 seconds.
Further investigation found that these UPDATE stored procs were updating a table with a trigger that was taking too much time. Since a trigger may lock the table while it completes, it was affecting every other query. ( See the comment section - I was incorrectly stating that the trigger would always lock the table - in our case, the trigger was preventing the lock from being released)
Watch the use of Triggers for doing major updates.
Related
I have a C# application executing a constant flow of SQL statements (queries, inserts, updates) against a SQL Server (2019 Standard if that matters) database. Most of these operations take a few milliseconds (ca. 2-50) to execute.
However in the course of the day, there are occasional cases where the same SQL statement would be several magnitudes slower and take seconds to execute, for example 5 or 10 or even >30 seconds, in which case the operation is aborted.
The issue may not happen for several days or 1-2 per day as well as phases of 10-15 minutes with several of such cases. That all looks pretty random.
I am desperately searching for the cause of this behaviour.
What I have found out and ruled our so far:
the issue is not specific to a particular statement. The issue is also not specific to certain data. It is the exact same statement (with minor differences in the actual values) that takes 5 milliseconds in most cases, but sometimes so much longer.
I believe that the issue is limited to inserts & updates. I have not seen a query causing the issue so far.
the issue is not specific to a particular date/time of the day. Neither does it appear to be a workload bottleneck. My application is the only software using that SQL Server and the problem happens even at times where workload is pretty low. Everything runs in a single VM ware virtual machine, but the IT department is claiming that the machine has no issues and I have no evidence to prove otherwise.
Quite interestingly, I have seen cases where a particular statement would "hang" while similar other statements execute in milliseconds at the same time. (my application is a multi-threaded application)
The issue does not seem to be caused by a deadlock. SQL Server seems to detect deadlock situation and throws a specific error in such cases, which does not happen here.
it seems as if the statement is blocked or held up by something. I have also seen rare cases where a statement would be submitted & hang, another statement is later also submitted and also hangs, until the first statement completes or is aborted, after which the second statement also completes pretty much immediately.
the application may perform multiple transactions in parallel. However, these are not longer than 5-100 milliseconds. It is therefore not plausible that one statement would hang several seconds and wait for some other long running transaction to finish.
After searching through my code and days of logs, I am running out of ideas. Needless to say, I was not able to reproduce the issue in a development environment.
How can I retrieve more information why individual statements take so long and identify the root cause?
Possible theories/suspects:
could SQL Server be limited or hang due to hardware/OS resources - such as IO? I am pretty sure it is not CPU or memory and network should not matter with everything on the same machine. How would I find out about that?
could it be that problematic statements trigger some SQL Server internal catch-up, e.g. flushing cashes, etc. That should however not take several seconds - I hope.
Any help would be much appreciated.
I have a web service sitting on IIS that has been quite happy for months but now I'm getting timeouts and I don't know how to diagnose what the problem is.
The client sends up basic information in a 'heartbeat' message to IIS which then updates this in a SQL database (on a different server). There are 250 clients in the wild, all sending up their heartbeat every 5 minutes ... so there's only 250 rows in the table, with appropriate indexing on the column being used for the update.
Ordinarily it only takes 50-100ms to do the update, but since last week you can see that the response time in the IIS log has increased and I'm also getting timeouts too.
Nothing has changed with the setup so I don't know what I'm looking for to determine the reason. The error I get back is:
System.ServiceModel.FaultException: An error occurred while updating
the entries. See the inner exception for details.An error occurred
while updating the entries. See the inner exception for
details.Execution Timeout Expired. The timeout period elapsed prior to
completion of the operation or the server is not responding. The
statement has been terminated.The wait operation timed out
Any advice on where to start looking? I did enable the failed request log trace in IIS but I don't know what it all means if I'm perfectly honest. The difference between a successful requiest and a failed one is that the request log stops after the 'AspNetStart' entry.
Thanks!
Mark
There are lots of reasons a service can gradually or suddenly become slow. Poor code structure can lead to things like memory leaks on the server, small enough they don't really show up or cause problems during testing, but when run over weeks/months start to stack up. Unauthorized requests could be targeting your server if this is a public-facing service, or has a link to public-facing services.
Things to look at:
Does this happen at certain times of the day or throughout the day?
Is this a load issue that starts occurring when multiple users are sending updates concurrently? 250 users isn't a lot. Has the # of users grown over the last few months or has it been relatively stable since the start?
What is the memory and CPU usage looking like on the Web server(s) and DB server?
This is the first clue to check to see if either server is under considerable load. From there you can investigate why it might be under load or if it possibly needs a bit more grunt to deal with the load. Look at the running processes. If these servers are managed by an IT department or such some culprits can include things like Virus Scanners hogging resources. (I.e. policy changes in the last few months have lead to additional load on the servers)
What recovery model is your database set up for?
What is the size of your Tx Log (.mdx file)
Do you have a regular scheduled database backup and index maintenance?
This is one that new projects tend to forget. An empty database is small and has no Tx Log history being recorded, but as it runs over time that Tx Log grows silently in the background, especially with Full recovery. Larger Tx Logs can lead to slower performance over time especially if the log file needs to be enlarged. A good thing to check is whether the log file is set to grow by a # of bytes or percentage. Percentage is I believe the default but this can cause exponential "grow" time/space issues so it's better to set it to a fixed size per grow. You'll want regular backups being done that allow the Tx Log to reset. Ideally don't shrink the file if the Log size between backups stays consistent.
How many records across all tables are being inserted or updated in a given day?
This is important to build a picture of how much the database will be tracking through the day between backups. You may have 250 clients, but every heartbeat is potentially updating a row and inserting others.
What are you using for PKs for inserted records? (Ints vs. UUIDs) If using UUIDs are you using NEWSEQUENTIALID() or NEWID()/Guid.New()?
GUIDs can be a time bomb for indexing if done poorly. A GUID combined with NEWID() or Guid.New() will lead to considerable index fragmentation when inserting rows. Provided the GUIDs are not visible to clients you should use NEWSEQUENTIALID(). If IDs are set via code then there are implementations you can find to generate sequential GUIDs. (It's a matter of re-arranging the parts that make up the GUID) Regular index maintenance is a requirement for using UUID columns in indexed fields.
Are you using Dependency Injection in your web service?
What is the lifetime scope of the DbContexts performing the updates?
This is a potential time bomb for web servers if the lifetime scope for a DbContext is set up incorrectly. You want a DbContext to be alive for no longer than it is needed. At a maximum the lifetime scope should be set to PerRequest. A DbContext set up for Singleton for instance would be tracking entities across requests. The more entities a DbContext is tracking, the slower read and update operations become. This would be a possible culprit if the web server memory usage is climbing.
Are you running an SQL Profiler?
In a test environment with nothing else touching the database, running scenarios through the application with an SQL Profiler can reveal potential issues such as unexpected queries being kicked off due to things like lazy loading. For one operation you might expect one or a small number of queries to be run, only to find dozens or even hundreds. Multiply this across concurrent requests and you have a recipe for the database server to say "Just sit down and wait, dammit!" :) Any queries you don't expect based on the code that is running should be investigated for either eager loading relationships or implementing projection. (Recommended for best performance)
Do the web servers get restarted periodically?
For some tricky to debug issues and memory leaks, sometimes the easiest "fix" is to schedule regular restarts of the web server. It's a hack, but compared to the considerable cost of trying to track down memory leaks or fix up inefficient code that slows down over time, it is a cheap and effective fix. (At least while you do research options to address the issues and optimize the code)
That should give you a start into things to check with the service & database.
I have created a vb.net application that uses a SQL Server database at a remote location over the internet.
There are 10 vb.net clients that are working on the same time.
The problem is in the delay time that happens when inserting a new row or retrieving rows from the database, the form appears to be freezing for a while when it deals with the database, I don't want to use a background worker to overcome the freeze problem.
I want to eliminate that delay time and decrease it as much as possible
Any tips, advises or information are welcomed, thanks in advance
Well, 2 problems:
The form appears to be freezing for a while when it deals with the database, I don't want to use a background worker
to overcome the freeze problem.
Vanity, arroaance and reality rarely mix. ANY operation that takes more than a SHORT time (0.1-0.5 seconds) SHOULD run async, only way to kep the UI responsive. Regardless what the issue is, if that CAN take longer of is on an internet app, decouple them.
But:
The problem is in the delay time that happens when inserting a new records or retrieving records from the database,
So, what IS The problem? Seriously. Is this a latency problem (too many round trips, work on more efficient sql, batch, so not send 20 q1uestions waiting for a result after each) or is the server overlaoded - it is not clear from the question whether this really is a latency issue.
At the end:
I want to eliminate that delay time
Pray to whatever god you believe in to change the rules of physics (mostly the speed of light) or to your local physician tof finally get quantum teleportation workable for a low cost. Packets take time at the moment to travel, no way to change that.
Check whether you use too many ound trips. NEVER (!) use sql server remotely with SQL - put in a web service and make it fitting the application, possibly even down to a 1:1 match to your screens, so you can ask for data and send updates in ONE round trip, not a dozen. WHen we did something similar 12 years ago with our custom ORM in .NET we used a data access layer for that that acepted multiple queries in one run and retuend multiple result sets for them - so a form with 10 drop downs could ask for all 10 data sets in ONE round trip. If a request takes 0.1 seconds internet time - then this saves 0.9 seconds. We had a form with about 100 (!) round trips (creating a tree) and got that down to less than 5 - talk of "takes time" to "whow, there". Plus it WAS async, sorry.
Then realize moving a lot of data is SLOW unless you have instant high bandwidth connections.
THis is exaclty what async is done for - if you have transfer time or latency time issues that can not be optimized, and do not want to use async, go on delivering a crappy experience.
You can execute the SQL call asynchronously and let Microsoft deal with the background process.
http://msdn.microsoft.com/en-us/library/7szdt0kc.aspx
Please note, this does not decrease the response time from the SQL server, for that you'll have to try to improve your network speed or increase the performance of your SQL statements.
There are a few things you could potentially do to speed things up, however it is difficult to say without seeing the code.
If you are using generic inserts - start using stored procedures
If you are closing the connection after every command then... well dont. Establishing a connection is typically one of the more 'expensive' operations
Increase the pipe between the two.
Add an index
Investigate your SQL Server perhaps it not setup in a preferred manner.
Sorry for the long introduction but before I can ask my question, I think giving the background would help understanding our problem much better.
We are using sql server 2008 for our web services as the backend and from time to time it takes too much time for responding back for the requests that supposed to run really fast, like taking more than 20 seconds for a select request that queries a table that has only 22 rows. We went through many potential areas that could cause the issue from indexes to stored procedures, triggers etc, and tried to optimize whatever we can like removing indexes that are not read but write frequently or adding NOLOCK for our select queries to reduce the locking of the tables (we are OK with dirty reads).
We also had our DBA's reviewed the server and benchmarked the components to see any bottlenecks in CPU, memory or disk subsystem, and found out that hardware-wise we are OK as well. And since the pikes are occurring occasionally, it is really hard to reproduce the error on production or development because most of the time when we rerun the same query it yields response times that we are expecting, which are short, not the one that has been experienced earlier.
Having said that, I almost have been suspicious about I/O although it is not seem to be a bottleneck. But I think I was just be able to reproduce the error after running an index fragmentation report for a specific table on the server, which immediately caused pikes in requests not only run against that table but also in other requests that query other tables. And since the DB, and the server, is shared with other applications we use and also from time to time queries can be run on the server and database that take long time is a common scenario for us, my suspicion regarding occasional I/O bottleneck is, I believe, becoming a fact.
Therefore I want to find out a way that would prioritize requests that are coming from web services which will be processed even if there are other resource sensitive queries being run. I have been looking for some kind of prioritization I described above since very beginning of the resolution process and found out that SQL Server 2008 has a feature called 'Resource Governor' that allows prioritization of the requests.
However, since I am not an expert on Resource Governor nor a DBA, I would like to ask other people's experience who may have used or is using Resource Governor, as well as whether I can prioritize I/O for a specific login or a specific stored procedure (For example, if one I/O intensive process is being run at the time we receive a web service request, can SQL server stops, or slows down, I/O activity for that process and give a priority to the request we just received?).
Thank you for anyone that spends time on reading or helping out in advance.
Some Hardware Details:
CPU: 2x Quad Core AMD Opteron 8354
Memory: 64GB
Disk Subsystem: Compaq EVA8100 series (I am not sure but it should be RAID 0+1 accross 8 HP HSV210 SCSI drives)
PS:And I can almost 100 percent sure that application servers are not causing the error and there is no bottleneck we can identify there.
Update 1:
I'll try to answer as much as I can for the following questions that gbn asked below. Please let me know if you are looking something else.
1) What kind of index and statistics maintenance do you have please?
We have a weekly running job that defrags indexes every Friday. In addition to that, Auto Create Statistics and Auto Update Statistics are enabled. And the spikes are occurring in other times than the fragmentation job as well.
2) What kind of write data volumes do you have?
Hard to answer.In addition to our web services, there is a front end application that accesses the same database and periodically resource intensive queries needs to be run to my knowledge, however, I don't know how to get, let's say weekly or daily, write amount to DB.
3) Have you profiled Recompilation and statistics update events?
Sorry for not be able to figure out this one. I didn't understand what you are asking about by this question. Can you provide more information for this question, if possible?
first thought is that statistics are being updated because of the data change threshold is reached causing execution plans to be rebuilt.
What kind of index and statistics maintenance do you have please? Note: index maintenance updates index stats, not column stats: you may need separate stats updates.
What kind of write data volumes do you have?
Have you profiled Recompilation and statistics update events?
In response to question 3) of your Update to the original question, take a look at the following reference on SQL Server Pedia. It provides an explanation of what query recompiles are and also goes on to explain how you can monitor for these events. What I believe gbn is asking (feel free to correct me sir :-) ) is are you seeing recompile events prior to the slow execution of the troublesome query. You can look for this occurring by using the SQL Server Profiler.
Reasons for Recompiling a Query Execution Plan
While running an application load test, I am observing some weird behavior. Lock requests/sec counter is increasing in a linear fashion throughout the whole test (duration 12 hours, load levels off to a constant level within first 10 minutes). The value reached 6 million at 12 hours. There was no apparent impact to the response time of the application. There was also no impact to lock wait time (200ms average). Database CPU slowly increased from 20% to about 30% at 12 hours.
What could be causing such behaviour?
You are going to need to start profiling the database to see what items are requesting locks, and from there you will be able to see what is happening with the lock requests. Is the amount of data growing in your application? If so, that could be a source of the increased lock numbers.
We were able to get rid of the escalating request locks/sec issue by setting the default transaction isolation level to "READ_COMMITTED_SNAPSHOT". However, there is still no explanation why this was happening in the first place. Any ideas are welcome.