SQL Server Plan Cache Suddenly Misbehaving - sql-server

I'm running soak tests at the moment and keep coming up against a wierd issue that I've never seen in the past. I've spent quite a while investigating the issue and so far not got to be bottom of it.
At some point during the test (sometimes 1 hour in, other times 4+ hours) the SQL Server machine starts maxing it's CPU. This always corresponds with a sharp decrease in DB cache memory and increase in free memory.
The signs obviously point at memory pressure and it seems that I can sometimes trigger this event by running a particularly heavy query.
I can understand why the plan cache is being flushed however the aspects of this that are confusing me are:
After the plan cache is flushed and my meaty query finishes there is plenty of free memory (even after further increasing the amount of memory SQL Server is allowed) the plan cache doesn't seem to recover. I'm left with loads of free memory which isn't helping anyone.
If I stop my soak test and then re-run it immediatly then things go back to normal, the plan cache grows as expected. SQL Server does not need restarted or to have any settings altered.
After the cache flush the cache hit ratio is still OK-ish, ~90% however this is much lower than the ~99% I am seeing before the flush and really hurting the CPU.
Before the flush a trace of cache misses, inserts and hits looks normal enough. Pre-flush the only issue I see is a non-parameterised ad-hoc query that's being inserted into the cache very frequently however even with this it's a very simple query which has a low cost so would expect these to be flushed from the cache ahead of most other things.
Post flush I'm seeing a very high number of inserts followed immediately by numerous misses on the same object (i.e. stored procedures), and thus memory consumption for the cache remains low.
You can see from the yellow line in the shot of my counters below that the cache memory usage drops off and stays low yet the free memory (royal blue) stays fairly high.
EDIT
After looking into this issue for another good while a pattern that keeps appearing is that if I push the server to it's limit for a short time (adding load above what the soak test is producing) then SQL Server seems to get itself into a mess which it can't recover from on it's own.
The number of connections to the server sharply increases when it hits the point of maximum pressure (I'm assuming due to it not being able to deal with requests quickly enough so new connections are needed to deal with the "constant" flow of requests). This backlog is then placing further pressure on the server which it doesn't appear to be able to recover from.
Now, I'm still puzzled by the metrics. I could accept this as purely a server resource issue if the new connections seemed to be eating up memory, further slowing processing, causing new connections, etc. What I am seeing though is that there is plenty of free memory but SQL Server isn't using it for the plan cache. Because of this it's spending more time compiling, upping CPU and things spiral out of control.
It feels like the connections are the key part of this problem. As mentioned before if I restart the test everything goes back to normal. I've since found that putting the DB into single user mode for a few seconds so that all test related connections die, waiting a few seconds and then going back to multi-user mode resolves the issue. I've tried just killing all active connections based on SPID however it seems there needs to be a pause of a few seconds in order for the server to recover and start using the plan cache properly.
See screenshot below of my counters. I'm trying to push the server over the top up to ~02:33:15 and I set to single user mode at ~02:34:30 and then multi-user mode a few seconds after.
Purple line is user connections, thick red is compilations p/s, bright green is cache memory, aqua connection memory, greyish/brown is free memory.

OK, it's been a long circular road but the best answer I currently have for this is that this issue is due to resource constraints and the unfortunate choices that SQL Server makes in relation to the plan cache for my particular circumstances. I'm not saying SQL Server is wrong, just that for my needs at this time I don't think it's making the right decisions.
I've adjusted my soak test so that if the DB server comes under pressure it pulls on the reigns a bit and drops some connections, until such time that the server comes back under control and the additional connections can be reestablished. The process of SQL Server getting itself back in order can take a few minutes but it does happen!
It seems that the server was getting itself into a vicious cycle, where it was coming under pressure, dropping cached plans and then having to spend more on recompiling these plans later than it gained by dropping them in the first place. This lead to things spiraling out of control and everything grinding to a halt.
In my particular case there is a very high cache hit ratio (above 99.5%) and due to the soak test basically doing the same thing repeatedly for hours for loads of users the cache is very well used. If the cache weren't so well used then SQL Server would have quite possibly made the right choice by dropping plans but I don't think it did here.

Related

Clickhouse DB slows down on a daily basis at 10am for seemingly no reason

I have been using Clickhouse at work for analytics purposes for a while now.
I am currently running Clickhouse v22.6.3 revision 54455 on-premise on a VM with:
fast storage
200Gb of RAM
no swap
a 40-cores CPU.
I have a few Tb of data, but no table bigger than 300 Gb. I do not use distributed tables or replication yet, and I write frequently into Clickhouse (but I don't use deletes or updates and prefer using things like the ReplacingMergeTree engine). I also leverage the MaterializedView feature for a few tables. Let me know if you need any more context or parameter, I use a pretty standard configuration.
Now, for a few months I have been experiencing performances issues where the server significantly slows down every day at 10am, and I cannot figure out why.
Based on Clickhouse built-in Graphite monitoring, the "symptoms" of the issue seem to be as follow:
At 10am:
On the server side:
Both load and RAM usage remain reasonable. Load goes up a little.
Disk write await time goes up (which I suspect is what leads to higher load)
Disk utilization % skyrockets to something between 90 and 100%
On Clickhouse side:
DiskSpaceReservedForMerge stays roughly the same (ie between 0 and 70Gb)
both OpenFileForRead and OpenFileForWrite go up by a factor of ~2
BackgroundCommonPoolTask goes slightly up, so does BackgroundSchedulePoolTask (which I found weird, because I thought this pool was dedicated to distributed operations - which I don't use) - both numbers remain seemingly reasonable
The number of active Merge tasks per minutes drop significantly but I'm unsure whether it's a consequence of slow writing or if it's causing it
both insert and general querying time are multiplied by ~10 which renders the database effectively unusable even for small tasks
Restarting Clickhouse usually fixes the problem but I obviously do not want to restart my main database every day at 10am. Most of the heavy load I put on the DB (such as data extraction and transformation, etc) happens earlier in the morning (and end around 7-8am) and runs fine. I do not have any heavy tasks running at 10am. The Clickhouse VM takes most of its host resources and I have confirmed with the devOps team that there doesn't seem to be a problem on the host or anything else scheduled on it at that time.
Is there any kind of background tasks or process that is run by Clickhouse on a daily basis and that could have a high impact on our disk capacity? What else can I monitor to figure out what is causing this problem?
Again, let me know if I can be more thorough on our settings and the state of the DB when the "bug" occurs.
Do you use https://github.com/innogames/graphite-ch-optimizer ?
Do you use TTL ?
select * from system.merges;
select * from system.part_log where event_time between ~10am~

How do I find the cause of an IIS/SQL timeout?

I have a web service sitting on IIS that has been quite happy for months but now I'm getting timeouts and I don't know how to diagnose what the problem is.
The client sends up basic information in a 'heartbeat' message to IIS which then updates this in a SQL database (on a different server). There are 250 clients in the wild, all sending up their heartbeat every 5 minutes ... so there's only 250 rows in the table, with appropriate indexing on the column being used for the update.
Ordinarily it only takes 50-100ms to do the update, but since last week you can see that the response time in the IIS log has increased and I'm also getting timeouts too.
Nothing has changed with the setup so I don't know what I'm looking for to determine the reason. The error I get back is:
System.ServiceModel.FaultException: An error occurred while updating
the entries. See the inner exception for details.An error occurred
while updating the entries. See the inner exception for
details.Execution Timeout Expired. The timeout period elapsed prior to
completion of the operation or the server is not responding. The
statement has been terminated.The wait operation timed out
Any advice on where to start looking? I did enable the failed request log trace in IIS but I don't know what it all means if I'm perfectly honest. The difference between a successful requiest and a failed one is that the request log stops after the 'AspNetStart' entry.
Thanks!
Mark
There are lots of reasons a service can gradually or suddenly become slow. Poor code structure can lead to things like memory leaks on the server, small enough they don't really show up or cause problems during testing, but when run over weeks/months start to stack up. Unauthorized requests could be targeting your server if this is a public-facing service, or has a link to public-facing services.
Things to look at:
Does this happen at certain times of the day or throughout the day?
Is this a load issue that starts occurring when multiple users are sending updates concurrently? 250 users isn't a lot. Has the # of users grown over the last few months or has it been relatively stable since the start?
What is the memory and CPU usage looking like on the Web server(s) and DB server?
This is the first clue to check to see if either server is under considerable load. From there you can investigate why it might be under load or if it possibly needs a bit more grunt to deal with the load. Look at the running processes. If these servers are managed by an IT department or such some culprits can include things like Virus Scanners hogging resources. (I.e. policy changes in the last few months have lead to additional load on the servers)
What recovery model is your database set up for?
What is the size of your Tx Log (.mdx file)
Do you have a regular scheduled database backup and index maintenance?
This is one that new projects tend to forget. An empty database is small and has no Tx Log history being recorded, but as it runs over time that Tx Log grows silently in the background, especially with Full recovery. Larger Tx Logs can lead to slower performance over time especially if the log file needs to be enlarged. A good thing to check is whether the log file is set to grow by a # of bytes or percentage. Percentage is I believe the default but this can cause exponential "grow" time/space issues so it's better to set it to a fixed size per grow. You'll want regular backups being done that allow the Tx Log to reset. Ideally don't shrink the file if the Log size between backups stays consistent.
How many records across all tables are being inserted or updated in a given day?
This is important to build a picture of how much the database will be tracking through the day between backups. You may have 250 clients, but every heartbeat is potentially updating a row and inserting others.
What are you using for PKs for inserted records? (Ints vs. UUIDs) If using UUIDs are you using NEWSEQUENTIALID() or NEWID()/Guid.New()?
GUIDs can be a time bomb for indexing if done poorly. A GUID combined with NEWID() or Guid.New() will lead to considerable index fragmentation when inserting rows. Provided the GUIDs are not visible to clients you should use NEWSEQUENTIALID(). If IDs are set via code then there are implementations you can find to generate sequential GUIDs. (It's a matter of re-arranging the parts that make up the GUID) Regular index maintenance is a requirement for using UUID columns in indexed fields.
Are you using Dependency Injection in your web service?
What is the lifetime scope of the DbContexts performing the updates?
This is a potential time bomb for web servers if the lifetime scope for a DbContext is set up incorrectly. You want a DbContext to be alive for no longer than it is needed. At a maximum the lifetime scope should be set to PerRequest. A DbContext set up for Singleton for instance would be tracking entities across requests. The more entities a DbContext is tracking, the slower read and update operations become. This would be a possible culprit if the web server memory usage is climbing.
Are you running an SQL Profiler?
In a test environment with nothing else touching the database, running scenarios through the application with an SQL Profiler can reveal potential issues such as unexpected queries being kicked off due to things like lazy loading. For one operation you might expect one or a small number of queries to be run, only to find dozens or even hundreds. Multiply this across concurrent requests and you have a recipe for the database server to say "Just sit down and wait, dammit!" :) Any queries you don't expect based on the code that is running should be investigated for either eager loading relationships or implementing projection. (Recommended for best performance)
Do the web servers get restarted periodically?
For some tricky to debug issues and memory leaks, sometimes the easiest "fix" is to schedule regular restarts of the web server. It's a hack, but compared to the considerable cost of trying to track down memory leaks or fix up inefficient code that slows down over time, it is a cheap and effective fix. (At least while you do research options to address the issues and optimize the code)
That should give you a start into things to check with the service & database.

Sql Server DB causes huge disk QUEUE and outages of service.

I'm facing this situation, and need some advise as to how approach this.
From time to time, usually out of business hours, but it's most likely random, the activity on a database we host goes out of scale.
This means that the disk queue grows from accepted levels (below 2) to crazy and sustained values, for example the last incident queue went to an average of 450 for 30 minutes.
Cases are not always this exaggerate, and maybe this makes it easier to spot than the more subtle cases.
When this happens, it causes services depending on the DB to go down / error out / timeout, etc.
I can see it and record it on perfmon, I know it's read-queue (at least last incident), I know it's focused on a high activity SQL SERVER DB stored exclusively on that disk, but I can't pin-point the exact cause.
I have tried collecting data with profiler, but this tends to happen:
- I monitor query durations (3 secs or more), looking for rogue queries with bad execution plans.
- If lucky enough to witness one of these incidents, I will see mostly any query being captured, which really means that the server is slowed down, and even good queries are taking long because something else is going on.
I believe query duration is not what I have to chase down, so maybe you could help me figure out how else to approach this ? It's a process doing something, a query, a maintenance task, a backup, anything, that is causing massive IO on the disk. Although I don't really believe it's a query alone, it looks too aggressive as to be that.
Sql Server logs do not show errors, neither point to something that could help troubleshooting this. Maintenance activities, some of them have logs, other don't generate logs, which is yielding to equivocal interpretations.
So, what are the events that I should hook onto profiler to spot the cause of these incidents ?
What else can I do, without turning to a paid tool, to pin-point the activity that sql server starts doing, at random times, that cause disk queue to build up, ultimately going out of scale and causing service downages.
I turn to SO as last resource, believe me, I have been studying this and looking for ideas, and tried more than a few things, but I just have to admit I failed and look for your advise.
Thank you !

What happens when bandwidth is exceeded with a Microsoft SQL Server connection?

The application used by a group of 100+ users was made with VB6 and RDO. A replacement is coming, but the old one is still maintained. Users moved to a different building across the street and problems began. My opinion regarding the problem has been bandwidth, but I've had to argue with others who say it's database. Users regularly experience network slowness using the application, but also workstation tasks in general. The application moves large audio files and indexes them on occasion as well as others. Occasionally the database becomes hung. We have many top end, robust SQL Servers, so it is not a server problem. What I figured out is, a transaction is begun on a connection, but fails to complete properly because of a communication error. Updates from other connections become blocked, they continue stacking up, and users are down half a day. What I've begun doing the moment I'm told of a problem, after verifying the database is hung, is set the database to single user then back to multiuser to clear connections. They must all restart their applications. Today I found out there is a bandwidth limit at their new location which they regularly max out. I think in the old location there was a big pipe servicing many people, but now they are on a small pipe servicing a small number of people, which is also less tolerant of momentary high bandwidth demands.
What I want to know is exactly what happens to packets, both coming and going, when a bandwidth limit is reached. Also I want to know what happens in SQL Server communication. Do some packets get dropped? Do they start arriving more out of sequence? Do timing problems occur?
I plan to start controlling such things as file moves through the application. But I also want to know what configurations are usually present on network nodes regarding transient high demand.
This is a very broad question. Networking is very key (especially in Availability Groups or any sort of mirroring set up) to good performance. When transactions complete on the SQL server, they are then placed in the output buffer. The app then needs to 'pick up' that data, clear it's output buffer and continue on. I think (without knowing your configuration) that your apps aren't able to complete the round trip because the network pipe is inundated with requests, so the apps can't get what they need to successfully finish and close out. This causes havoc as the network can't keep up with what the apps and SQL server are trying to do. Then you have a 200 car pileup on a 1 lane highway.
Hindsight being what it is, there should have been extensive testing on the network capacity before everyone moved across the street. Clearly, that didn't happen so you are kind of left to do what you can with what you have. If the company can't get a stable networking connection, the situation may be out of your control. If you're the DBA, I highly recommend you speak to your higher ups and explain to them the consequences of the reduced network capacity. Often times, showing the consequences of inaction can lead to action.
Out of curiosity, is there any way you can analyze what waits are happening when the pileup happens? I'm thinking it will be something along the lines of ASYNC_NETWORK_IO which is usually indicative that SQL is waiting on the app to come back and pick up it's data.

Why is the CPU not maxed out?

I have an application that I'd like to make more efficient - it isn't taxing any one resource enough that I can identify it as a bottleneck, so perhaps the app is doing something that is preventing full efficiency.
The application pulls data from a database on one SQL Server instance, does some manipulation on it, then writes it to a database on another SQL Server instance - all on one machine. It doesn't do anything in parallel.
While the app is running (it can take several hours), none of the 4 CPU cores are maxed out (they hover around 40-60% utilization each), the disks are almost idle and very little RAM is used.
Reported values:
Target SQL Server instance: ~10% CPU utilization, 1.3GB RAM
Source SQL Server instance: ~10% CPU utilization, 300MB RAM
Application: ~6% CPU utilization, 45MB RAM
All the work is happening on one disk, which writes around 100KB/s during the operation, on average. 'Active time' according to task manager is usually 0%, occasionally flickering up to between 1 and 5% for a second or so. Average response time, again according to task manager, moves betweeen 0ms and 20ms, mainly showing between 0.5 and 2ms.
Databases are notorious for IO limitations. Now, seriously, as you say:
The application pulls data from a database on one SQL Server instance,
does some manipulation on it, then writes it to a database on another
SQL Server instance - all on one machine.
I somehow get the idea this is a end user level mashine, maybe a workstation. Your linear code (a bad idea to get full utilization btw, as you never run all 3 parts - read, process, write - in parallel) will be seriously limited by whatever IO subsystem you have.
But that will not come into play as long as you can state:
It doesn't do anything in parallel.
What it must do is do things in parallel:
One task is reading the next data
One task does the data processing
One task does the data writing
You can definitely max out a lot more than your 4 cores. Last time I did something like that (read / manipulate / write) we were maxing out 48 cores with around 96 or so processing threads running in parallel (and a smaller amount doing the writes). But a core of that is that your application msut start actually using multiple CPU's.
If you do not parallelize:
You only will max out one core max,
YOu basically waste time waiting for databases on both ends. The latency while you wait for data to be read or committed is latency you are not processing anything.
;) And once you fix that you will get IO problems. Promised.
I recommend reading How to analyse SQL Server performance. You need to capture and analyze the wait stats. These will tell you what is the execution doing that prevents it from going max out on CPU. You already have a feeling that the workload is causing the SQL engine to wait rather than run, but only after you understand the wait stats you'll be able to get a feel what is waiting for. Follow the article linked for specific analysis techniques.

Resources