Sudden spikes in SQL connections causing timeouts - sql-server

For the last week I've been experiencing intermittent mini-outages lasting between 1-3 minutes every few hours. We're using .NET Framework 4.7.2 and EF6 on top of Azure SQL for years, and it has served us well. Starting about 10 days ago however, we're seeing these sudden bursts of SQL connections being opened. These sudden bursts of SQL connections are causing timeouts on any new requests, causing our website to be inaccessible. For context: our platform sees about 1.1 million unique visitors every single day and traffic is always very stable and predictable with no sudden bursts, even during the mini outages traffic is perfectly normal.
The exception we get during these bursts is '
Timeout expired. The timeout period elapsed prior to obtaining a connection from the pool. This may have occurred because all pooled connections were in use and max pool size was reached.'
We use a combination of StructureMap with Nested Contaienrs to inject our DbContext in controllers and services, and in legacy parts of the codebase we religiously consume our DbContext with usings. We never manually open a connection (so no SqlConnections floating around)
Azure metrics (Succesful connections last 48 hours)
Azure SQL usage charts (24 hours)
The spikes here don't align with the outages so don't seem suspicious to us.
These bursts automatically resolve themselves within minutes. If I'm fast enough while our platform alerts notify us, I can confirm using 'exec sp_who' that there are indeed an excessive amount of idle connections (status=sleeping, cmd=AWAITING COMMAND) to our database. We are constantly running on 4 similarly specced VM's, and when this burst happens, the idle connections don't originate from one single machine.
We've been scratching our heads for the last week especially since the way we've been using EF6 and SQL Server has been a very smooth ride for several years. We obviously scoured over every single change we've made to the platform codebase over the last 2 weeks to spot anything that seems suspicious, but sadly that hasn't resulted in anything yet. We're also diligently squashing and tuning all our heavier un-optimized queries in a bid to fix this, but they've been serving the website fine for years and this really only started about 10-12 days ago.
Anyone who can give some insight into what might cause such very sudden bursts; any advice would be greatly appreciated.
Thank you in advance

Related

How many max_connections should I have?

We run a monolith Spring Boot application which is set up as multi-tenant, so every customer has their own app service and database (around 30-40 of them) but they still run on the same PostreSQL server. We previously had problems with not having enough space for connections in our database because our connection pooler (HikariCP) was incorrectly set up, which caused chaos between the environments when someone would get 15 idle connections while someone didn't have any at all and couldn't connect to the database, which caused crash of the application.
Our fix was to change our max_connections from 150 to 350, and while it helped with crashes I'm unsure if it's the right way. After doing some research it seems like just doing a higher number on max_connections isn't the smartest way due to performance issues. Today in our application we set minimum pool size to 5 and max to 10, but I'm wondering if it's an overkill and we should go down to min 2 max 5 as well as lowering our max_connections back to around 150-200. Or is it safe to keep our max_connections number as it is? In that case, what hardware is recommended for that? Today we're running on 1 vCPU, 3.5gb RAM but we're planning on upgrading. What is the most optimal in our case? What to do when the number of our databases (customers) increases? Do we upgrade our max_connections to whatever is needed or set up a new PostgreSQL server? Thank you in advance
350 connections for 3.5GB RAM is certainly a lot, that is only 10MB per backend. This is not guaranteed to cause swapping/paging problems, but it wouldn't be surprising at all if it did. I would try lowering the max connection pool size. Doing that might not cause any problems at all, and if it does they are likely to be more manageable. Having your database server swapping to death is the worst. Your monitoring stops working. Your ability to make changes goes away because the terminal itself stops responding in a reasonable timeframe.

AWS Lightsail Metric graphs "No data available"

We're using AWS Lightsail PostgreSQL Database. We've been experiencing errors with our C# application timing out when using the connection to database. As I'm trying to debug the issue, I went to look at the Metric graphs in AWS. I noticed that many of the graphs have frequent gaps in the data, labeled No data available. See image below.
This graph (and most of the other metrics) shows frequent gaps in the data. I'm trying to understand if this is normal, or could be a symptom of the problem. If I go back to 2 weeks timescale, there does not appear to be any other strange behaviors in any of the metric data. For example, I do not see a point in time in the past where the CPU or memory usage went crazy. The issue started happening about a week ago, so I was hoping the metrics would have helped explained why the connections to the PostgreSQL database are failing from C#.
🔶 So I guess my question is, are those frequent gaps of No data available normal for a AWS Lightsail Postgres Database?
Other Data about the machine:
1 GB RAM, 1 vCPU, 40 GB SSD
PostgreSQL database (12.11)
In the last two weeks (the average metrics show):
CPU utilization has never gone over 20%
Database connections have never gone over 35 (usually less than 5) (actually, usually 0)
Disk queue depth never goes over 0.2
Free storage space hovers around 36.5 GB
Network receive throughput is mostly less than 1 kB/s (with one spike to 141kB/s)
Network transmit throughput is mostly less than 11kB/s with all spikes less than 11.5kB/s
I would love to view the AWS logs, but they are a month old, and when trying to view them they are filled with checkpoint starting/complete logs. They start at one month ago and each page update only takes me 2 hours forward in time (and taking ~6 seconds to fetch the logs). This would require me to do ~360 page updates, and when trying, my auth timed out. 😢
So we never figured out the reason why, but this seems like it was a problem with the AWS LightSail DB. We ended up using a snapshot to create a new clone of the DB, and wiring the C# servers to the new DB. The latency issues we were having disappeared and the metric graphs looked normal (without the strange gaps).
I wish we were able to figure out the root of the problem. ATM, we are just hoping the problem does not return.
When in doubt, clone everything! 🙃

IIS response time hight every 10-15 minutes for the same simple request

We have a performance issue with an AngularJS website hosted on IIS. This issue only affects our users connected via VPN (working from home).
The problem: regularly, a page that usually takes one or two seconds to load can take over 10 seconds.
This issue first appeared to be random, but we were able to reproduce it in a test environment and found out that the problem seems to arise on a very regular basis (every 10-15 minutes).
What we did: using a tool (ThousandEyes), we send every minute the same simple GET request via 12 clients to the Test server. We can see in the IIS logs that this request is processed in less than 50ms most of the time. However, every 15 minutes or so, the same request takes more than 5 seconds to process at least for 1 client. Example below: the calls done every minutes by client #1 takes more than 5 sec at 21:12, 21:13, 21:14, then 21:28, 21:29, then 21:45:
The graph below shows the mean response times for the 12 clients (peak every 10-15 minutes):
For both the test and the production environments, this issue only affect users connected via VPN (but not all the users connected via VPN are affected at the same time).
Any idea what can cause this behavior ?
All suggestions and questions are welcome.
Notes:
Session State. InProcess. I tried Not Enabled and State Server but we still have the same results.
Maximum Worker Process. 1. I tried 2, no change.
Test server usage. As far as I can tell, nothing special happen every 15 minutes on the server (no special events).
Test server configuration: 2 Xeon proc #2.6GHz, 8 GB RAM, 20 GB disk space, Windonws 2016.
Test server load: almost nothing beside these 12 requests every minute from the 12 test clients.
This issue cost us a lot of time. We finally found out that a VPN server was misconfigured.
Rebuilding this server was the solution.

How do I find the cause of an IIS/SQL timeout?

I have a web service sitting on IIS that has been quite happy for months but now I'm getting timeouts and I don't know how to diagnose what the problem is.
The client sends up basic information in a 'heartbeat' message to IIS which then updates this in a SQL database (on a different server). There are 250 clients in the wild, all sending up their heartbeat every 5 minutes ... so there's only 250 rows in the table, with appropriate indexing on the column being used for the update.
Ordinarily it only takes 50-100ms to do the update, but since last week you can see that the response time in the IIS log has increased and I'm also getting timeouts too.
Nothing has changed with the setup so I don't know what I'm looking for to determine the reason. The error I get back is:
System.ServiceModel.FaultException: An error occurred while updating
the entries. See the inner exception for details.An error occurred
while updating the entries. See the inner exception for
details.Execution Timeout Expired. The timeout period elapsed prior to
completion of the operation or the server is not responding. The
statement has been terminated.The wait operation timed out
Any advice on where to start looking? I did enable the failed request log trace in IIS but I don't know what it all means if I'm perfectly honest. The difference between a successful requiest and a failed one is that the request log stops after the 'AspNetStart' entry.
Thanks!
Mark
There are lots of reasons a service can gradually or suddenly become slow. Poor code structure can lead to things like memory leaks on the server, small enough they don't really show up or cause problems during testing, but when run over weeks/months start to stack up. Unauthorized requests could be targeting your server if this is a public-facing service, or has a link to public-facing services.
Things to look at:
Does this happen at certain times of the day or throughout the day?
Is this a load issue that starts occurring when multiple users are sending updates concurrently? 250 users isn't a lot. Has the # of users grown over the last few months or has it been relatively stable since the start?
What is the memory and CPU usage looking like on the Web server(s) and DB server?
This is the first clue to check to see if either server is under considerable load. From there you can investigate why it might be under load or if it possibly needs a bit more grunt to deal with the load. Look at the running processes. If these servers are managed by an IT department or such some culprits can include things like Virus Scanners hogging resources. (I.e. policy changes in the last few months have lead to additional load on the servers)
What recovery model is your database set up for?
What is the size of your Tx Log (.mdx file)
Do you have a regular scheduled database backup and index maintenance?
This is one that new projects tend to forget. An empty database is small and has no Tx Log history being recorded, but as it runs over time that Tx Log grows silently in the background, especially with Full recovery. Larger Tx Logs can lead to slower performance over time especially if the log file needs to be enlarged. A good thing to check is whether the log file is set to grow by a # of bytes or percentage. Percentage is I believe the default but this can cause exponential "grow" time/space issues so it's better to set it to a fixed size per grow. You'll want regular backups being done that allow the Tx Log to reset. Ideally don't shrink the file if the Log size between backups stays consistent.
How many records across all tables are being inserted or updated in a given day?
This is important to build a picture of how much the database will be tracking through the day between backups. You may have 250 clients, but every heartbeat is potentially updating a row and inserting others.
What are you using for PKs for inserted records? (Ints vs. UUIDs) If using UUIDs are you using NEWSEQUENTIALID() or NEWID()/Guid.New()?
GUIDs can be a time bomb for indexing if done poorly. A GUID combined with NEWID() or Guid.New() will lead to considerable index fragmentation when inserting rows. Provided the GUIDs are not visible to clients you should use NEWSEQUENTIALID(). If IDs are set via code then there are implementations you can find to generate sequential GUIDs. (It's a matter of re-arranging the parts that make up the GUID) Regular index maintenance is a requirement for using UUID columns in indexed fields.
Are you using Dependency Injection in your web service?
What is the lifetime scope of the DbContexts performing the updates?
This is a potential time bomb for web servers if the lifetime scope for a DbContext is set up incorrectly. You want a DbContext to be alive for no longer than it is needed. At a maximum the lifetime scope should be set to PerRequest. A DbContext set up for Singleton for instance would be tracking entities across requests. The more entities a DbContext is tracking, the slower read and update operations become. This would be a possible culprit if the web server memory usage is climbing.
Are you running an SQL Profiler?
In a test environment with nothing else touching the database, running scenarios through the application with an SQL Profiler can reveal potential issues such as unexpected queries being kicked off due to things like lazy loading. For one operation you might expect one or a small number of queries to be run, only to find dozens or even hundreds. Multiply this across concurrent requests and you have a recipe for the database server to say "Just sit down and wait, dammit!" :) Any queries you don't expect based on the code that is running should be investigated for either eager loading relationships or implementing projection. (Recommended for best performance)
Do the web servers get restarted periodically?
For some tricky to debug issues and memory leaks, sometimes the easiest "fix" is to schedule regular restarts of the web server. It's a hack, but compared to the considerable cost of trying to track down memory leaks or fix up inefficient code that slows down over time, it is a cheap and effective fix. (At least while you do research options to address the issues and optimize the code)
That should give you a start into things to check with the service & database.

Identifying Timeout Causes with SQL Server Profiler

We are experiencing seemingly random timeouts on a two app (one ASP.Net and one WinForms) SQL Server application. I had SQL Profiler run during an hour block to see what might be causing the problem. I then isolated the times when the timeouts were occurring.
There are a large number of Reads but there is no large difference in the reads when the timeout errors occur and when they don't. There are virtually no writes during this period (primarily because everyone is getting time outs and can't write).
Example:
Timeout occurs 11:37. There are an average of 1500 transactions a minute leading up to the timeout, with about 5709219 reads.
That seems high EXCEPT that during a period in between timeouts (over a ten minute span), there are just as many transactions per minute and the reads are just as high. The reads do spike a little before the timeout (jumping up to over 6005708) but during the non-timeout period, they go as high as 8251468. The timeouts are occurring in both applications.
The bigger problem here is that this only started occurring in the past week and the application has been up and running for several years. So yes, the Profiler has given us a lot of data to work with but the current issue is the timeouts.
Is there something else that I should be possibly looking for in the Profiler or should I move to Performance Monitor (or another tool) over on the server?
One possible culprit might be the Database Size. The database is fairly large (>200 GB) but the AutoGrow setting was set to 1MB. Could it be that SQL Server is resizing itself and that transaction doesn't show itself in the profiler?
Many thanks
Thanks to the assistance here, I was able to identify a few bottlenecks but I wanted to outline my process to possibly help anyone going through this.
The #1 problem was found to be a high number of LOCK_MK_S entries found from the SQLDiag and other tools.
Run the Trace Profiler over two different periods of time. Comparing durations for similar methods led me to find that certain UPDATE calls were always taking the same amount of time, over 10 seconds.
Further investigation found that these UPDATE stored procs were updating a table with a trigger that was taking too much time. Since a trigger may lock the table while it completes, it was affecting every other query. ( See the comment section - I was incorrectly stating that the trigger would always lock the table - in our case, the trigger was preventing the lock from being released)
Watch the use of Triggers for doing major updates.

Resources