MS SQL server 2005 lock requests/sec rising linearly non-stop - sql-server

While running an application load test, I am observing some weird behavior. Lock requests/sec counter is increasing in a linear fashion throughout the whole test (duration 12 hours, load levels off to a constant level within first 10 minutes). The value reached 6 million at 12 hours. There was no apparent impact to the response time of the application. There was also no impact to lock wait time (200ms average). Database CPU slowly increased from 20% to about 30% at 12 hours.
What could be causing such behaviour?

You are going to need to start profiling the database to see what items are requesting locks, and from there you will be able to see what is happening with the lock requests. Is the amount of data growing in your application? If so, that could be a source of the increased lock numbers.

We were able to get rid of the escalating request locks/sec issue by setting the default transaction isolation level to "READ_COMMITTED_SNAPSHOT". However, there is still no explanation why this was happening in the first place. Any ideas are welcome.

Related

flink backpressure monitoring

From the beginning of our Flink project.
My cluster have suffered from low back-pressure because of heavy parsing code.
So I put monitoring script on the system which keep asking back-pressure status from the task manager. ( Which run every 20 seconds for getting the highest value and average )
By the way when I turn off the script running, I found increment of back pressure ratio become much less slower than when I using the script.
So is there any efficient way to get the back pressure status without losing performance ?
I don't believe requesting back-pressure every 20 seconds would have a major impact on your workflow's performance.
Separately, if you've got available CPU cycles, then increasing the parallelism of the function(s) doing the parsing would be the next thing to try, to improve throughput.

SQL Server Plan Cache Suddenly Misbehaving

I'm running soak tests at the moment and keep coming up against a wierd issue that I've never seen in the past. I've spent quite a while investigating the issue and so far not got to be bottom of it.
At some point during the test (sometimes 1 hour in, other times 4+ hours) the SQL Server machine starts maxing it's CPU. This always corresponds with a sharp decrease in DB cache memory and increase in free memory.
The signs obviously point at memory pressure and it seems that I can sometimes trigger this event by running a particularly heavy query.
I can understand why the plan cache is being flushed however the aspects of this that are confusing me are:
After the plan cache is flushed and my meaty query finishes there is plenty of free memory (even after further increasing the amount of memory SQL Server is allowed) the plan cache doesn't seem to recover. I'm left with loads of free memory which isn't helping anyone.
If I stop my soak test and then re-run it immediatly then things go back to normal, the plan cache grows as expected. SQL Server does not need restarted or to have any settings altered.
After the cache flush the cache hit ratio is still OK-ish, ~90% however this is much lower than the ~99% I am seeing before the flush and really hurting the CPU.
Before the flush a trace of cache misses, inserts and hits looks normal enough. Pre-flush the only issue I see is a non-parameterised ad-hoc query that's being inserted into the cache very frequently however even with this it's a very simple query which has a low cost so would expect these to be flushed from the cache ahead of most other things.
Post flush I'm seeing a very high number of inserts followed immediately by numerous misses on the same object (i.e. stored procedures), and thus memory consumption for the cache remains low.
You can see from the yellow line in the shot of my counters below that the cache memory usage drops off and stays low yet the free memory (royal blue) stays fairly high.
EDIT
After looking into this issue for another good while a pattern that keeps appearing is that if I push the server to it's limit for a short time (adding load above what the soak test is producing) then SQL Server seems to get itself into a mess which it can't recover from on it's own.
The number of connections to the server sharply increases when it hits the point of maximum pressure (I'm assuming due to it not being able to deal with requests quickly enough so new connections are needed to deal with the "constant" flow of requests). This backlog is then placing further pressure on the server which it doesn't appear to be able to recover from.
Now, I'm still puzzled by the metrics. I could accept this as purely a server resource issue if the new connections seemed to be eating up memory, further slowing processing, causing new connections, etc. What I am seeing though is that there is plenty of free memory but SQL Server isn't using it for the plan cache. Because of this it's spending more time compiling, upping CPU and things spiral out of control.
It feels like the connections are the key part of this problem. As mentioned before if I restart the test everything goes back to normal. I've since found that putting the DB into single user mode for a few seconds so that all test related connections die, waiting a few seconds and then going back to multi-user mode resolves the issue. I've tried just killing all active connections based on SPID however it seems there needs to be a pause of a few seconds in order for the server to recover and start using the plan cache properly.
See screenshot below of my counters. I'm trying to push the server over the top up to ~02:33:15 and I set to single user mode at ~02:34:30 and then multi-user mode a few seconds after.
Purple line is user connections, thick red is compilations p/s, bright green is cache memory, aqua connection memory, greyish/brown is free memory.
OK, it's been a long circular road but the best answer I currently have for this is that this issue is due to resource constraints and the unfortunate choices that SQL Server makes in relation to the plan cache for my particular circumstances. I'm not saying SQL Server is wrong, just that for my needs at this time I don't think it's making the right decisions.
I've adjusted my soak test so that if the DB server comes under pressure it pulls on the reigns a bit and drops some connections, until such time that the server comes back under control and the additional connections can be reestablished. The process of SQL Server getting itself back in order can take a few minutes but it does happen!
It seems that the server was getting itself into a vicious cycle, where it was coming under pressure, dropping cached plans and then having to spend more on recompiling these plans later than it gained by dropping them in the first place. This lead to things spiraling out of control and everything grinding to a halt.
In my particular case there is a very high cache hit ratio (above 99.5%) and due to the soak test basically doing the same thing repeatedly for hours for loads of users the cache is very well used. If the cache weren't so well used then SQL Server would have quite possibly made the right choice by dropping plans but I don't think it did here.

Identifying Timeout Causes with SQL Server Profiler

We are experiencing seemingly random timeouts on a two app (one ASP.Net and one WinForms) SQL Server application. I had SQL Profiler run during an hour block to see what might be causing the problem. I then isolated the times when the timeouts were occurring.
There are a large number of Reads but there is no large difference in the reads when the timeout errors occur and when they don't. There are virtually no writes during this period (primarily because everyone is getting time outs and can't write).
Example:
Timeout occurs 11:37. There are an average of 1500 transactions a minute leading up to the timeout, with about 5709219 reads.
That seems high EXCEPT that during a period in between timeouts (over a ten minute span), there are just as many transactions per minute and the reads are just as high. The reads do spike a little before the timeout (jumping up to over 6005708) but during the non-timeout period, they go as high as 8251468. The timeouts are occurring in both applications.
The bigger problem here is that this only started occurring in the past week and the application has been up and running for several years. So yes, the Profiler has given us a lot of data to work with but the current issue is the timeouts.
Is there something else that I should be possibly looking for in the Profiler or should I move to Performance Monitor (or another tool) over on the server?
One possible culprit might be the Database Size. The database is fairly large (>200 GB) but the AutoGrow setting was set to 1MB. Could it be that SQL Server is resizing itself and that transaction doesn't show itself in the profiler?
Many thanks
Thanks to the assistance here, I was able to identify a few bottlenecks but I wanted to outline my process to possibly help anyone going through this.
The #1 problem was found to be a high number of LOCK_MK_S entries found from the SQLDiag and other tools.
Run the Trace Profiler over two different periods of time. Comparing durations for similar methods led me to find that certain UPDATE calls were always taking the same amount of time, over 10 seconds.
Further investigation found that these UPDATE stored procs were updating a table with a trigger that was taking too much time. Since a trigger may lock the table while it completes, it was affecting every other query. ( See the comment section - I was incorrectly stating that the trigger would always lock the table - in our case, the trigger was preventing the lock from being released)
Watch the use of Triggers for doing major updates.

Why is the CPU not maxed out?

I have an application that I'd like to make more efficient - it isn't taxing any one resource enough that I can identify it as a bottleneck, so perhaps the app is doing something that is preventing full efficiency.
The application pulls data from a database on one SQL Server instance, does some manipulation on it, then writes it to a database on another SQL Server instance - all on one machine. It doesn't do anything in parallel.
While the app is running (it can take several hours), none of the 4 CPU cores are maxed out (they hover around 40-60% utilization each), the disks are almost idle and very little RAM is used.
Reported values:
Target SQL Server instance: ~10% CPU utilization, 1.3GB RAM
Source SQL Server instance: ~10% CPU utilization, 300MB RAM
Application: ~6% CPU utilization, 45MB RAM
All the work is happening on one disk, which writes around 100KB/s during the operation, on average. 'Active time' according to task manager is usually 0%, occasionally flickering up to between 1 and 5% for a second or so. Average response time, again according to task manager, moves betweeen 0ms and 20ms, mainly showing between 0.5 and 2ms.
Databases are notorious for IO limitations. Now, seriously, as you say:
The application pulls data from a database on one SQL Server instance,
does some manipulation on it, then writes it to a database on another
SQL Server instance - all on one machine.
I somehow get the idea this is a end user level mashine, maybe a workstation. Your linear code (a bad idea to get full utilization btw, as you never run all 3 parts - read, process, write - in parallel) will be seriously limited by whatever IO subsystem you have.
But that will not come into play as long as you can state:
It doesn't do anything in parallel.
What it must do is do things in parallel:
One task is reading the next data
One task does the data processing
One task does the data writing
You can definitely max out a lot more than your 4 cores. Last time I did something like that (read / manipulate / write) we were maxing out 48 cores with around 96 or so processing threads running in parallel (and a smaller amount doing the writes). But a core of that is that your application msut start actually using multiple CPU's.
If you do not parallelize:
You only will max out one core max,
YOu basically waste time waiting for databases on both ends. The latency while you wait for data to be read or committed is latency you are not processing anything.
;) And once you fix that you will get IO problems. Promised.
I recommend reading How to analyse SQL Server performance. You need to capture and analyze the wait stats. These will tell you what is the execution doing that prevents it from going max out on CPU. You already have a feeling that the workload is causing the SQL engine to wait rather than run, but only after you understand the wait stats you'll be able to get a feel what is waiting for. Follow the article linked for specific analysis techniques.

SQL Server 100% CPU Utilization - One database shows high CPU usage than others

We have an SQL server with about 40 different (about 1-5GB each) databases. The server is an 8 core 2.3G CPU with 32Gigs of RAM. 27Gig is pinned to SQL Server. The CPU utliziation is mostly close to 100% always and memory consumption is about 95%. The problem here is the CPU which is constantly close to 100% and trying to understand the reason.
I have run an initial check to see which database contributes to high CPU by using - this script but I could not substantiate in detail on whats really consuming CPU. The top query (from all DBs) only takes about 4 seconds to complete. IO is also not a bottleneck.
Would Memory be the culprit here? I have checked the memory split and the OBJECT CACHE occupies about 80% of memory allocated (27G) to SQL Server. I hope that is normal provided there are lot of SPs involved. Running profiler, I do see lot of recompiles, but mostly are due to "temp table changed", "deferred compile" etc and am not clear if these recompiles are a result of plans getting thrown out of cache due to memory pressure
Appreciate any thoughts.
You can see some reports in SSMS:
Right-click the instance name / reports / standard / top sessions
You can see top CPU consuming sessions. This may shed some light on what SQL processes are using resources. There are a few other CPU related reports if you look around. I was going to point to some more DMVs but if you've looked into that already I'll skip it.
You can use sp_BlitzCache to find the top CPU consuming queries. You can also sort by IO and other things as well. This is using DMV info which accumulates between restarts.
This article looks promising.
Some stackoverflow goodness from Mr. Ozar.
edit:
A little more advice...
A query running for 'only' 5 seconds can be a problem. It could be using all your cores and really running 8 cores times 5 seconds - 40 seconds of 'virtual' time. I like to use some DMVs to see how many executions have happened for that code to see what that 5 seconds adds up to.
According to this article on sqlserverstudymaterial;
Remember that "%Privileged time" is not based on 100%.It is based on number of processors.If you see 200 for sqlserver.exe and the system has 8 CPU then CPU consumed by sqlserver.exe is 200 out of 800 (only 25%).
If "% Privileged Time" value is more than 30% then it's generally caused by faulty drivers or anti-virus software. In such situations make sure the BIOS and filter drives are up to date and then try disabling the anti-virus software temporarily to see the change.
If "% User Time" is high then there is something consuming of SQL Server.
There are several known patterns which can be caused high CPU for processes running in SQL Server including

Resources