We run a monolith Spring Boot application which is set up as multi-tenant, so every customer has their own app service and database (around 30-40 of them) but they still run on the same PostreSQL server. We previously had problems with not having enough space for connections in our database because our connection pooler (HikariCP) was incorrectly set up, which caused chaos between the environments when someone would get 15 idle connections while someone didn't have any at all and couldn't connect to the database, which caused crash of the application.
Our fix was to change our max_connections from 150 to 350, and while it helped with crashes I'm unsure if it's the right way. After doing some research it seems like just doing a higher number on max_connections isn't the smartest way due to performance issues. Today in our application we set minimum pool size to 5 and max to 10, but I'm wondering if it's an overkill and we should go down to min 2 max 5 as well as lowering our max_connections back to around 150-200. Or is it safe to keep our max_connections number as it is? In that case, what hardware is recommended for that? Today we're running on 1 vCPU, 3.5gb RAM but we're planning on upgrading. What is the most optimal in our case? What to do when the number of our databases (customers) increases? Do we upgrade our max_connections to whatever is needed or set up a new PostgreSQL server? Thank you in advance
350 connections for 3.5GB RAM is certainly a lot, that is only 10MB per backend. This is not guaranteed to cause swapping/paging problems, but it wouldn't be surprising at all if it did. I would try lowering the max connection pool size. Doing that might not cause any problems at all, and if it does they are likely to be more manageable. Having your database server swapping to death is the worst. Your monitoring stops working. Your ability to make changes goes away because the terminal itself stops responding in a reasonable timeframe.
Related
I have been using Clickhouse at work for analytics purposes for a while now.
I am currently running Clickhouse v22.6.3 revision 54455 on-premise on a VM with:
fast storage
200Gb of RAM
no swap
a 40-cores CPU.
I have a few Tb of data, but no table bigger than 300 Gb. I do not use distributed tables or replication yet, and I write frequently into Clickhouse (but I don't use deletes or updates and prefer using things like the ReplacingMergeTree engine). I also leverage the MaterializedView feature for a few tables. Let me know if you need any more context or parameter, I use a pretty standard configuration.
Now, for a few months I have been experiencing performances issues where the server significantly slows down every day at 10am, and I cannot figure out why.
Based on Clickhouse built-in Graphite monitoring, the "symptoms" of the issue seem to be as follow:
At 10am:
On the server side:
Both load and RAM usage remain reasonable. Load goes up a little.
Disk write await time goes up (which I suspect is what leads to higher load)
Disk utilization % skyrockets to something between 90 and 100%
On Clickhouse side:
DiskSpaceReservedForMerge stays roughly the same (ie between 0 and 70Gb)
both OpenFileForRead and OpenFileForWrite go up by a factor of ~2
BackgroundCommonPoolTask goes slightly up, so does BackgroundSchedulePoolTask (which I found weird, because I thought this pool was dedicated to distributed operations - which I don't use) - both numbers remain seemingly reasonable
The number of active Merge tasks per minutes drop significantly but I'm unsure whether it's a consequence of slow writing or if it's causing it
both insert and general querying time are multiplied by ~10 which renders the database effectively unusable even for small tasks
Restarting Clickhouse usually fixes the problem but I obviously do not want to restart my main database every day at 10am. Most of the heavy load I put on the DB (such as data extraction and transformation, etc) happens earlier in the morning (and end around 7-8am) and runs fine. I do not have any heavy tasks running at 10am. The Clickhouse VM takes most of its host resources and I have confirmed with the devOps team that there doesn't seem to be a problem on the host or anything else scheduled on it at that time.
Is there any kind of background tasks or process that is run by Clickhouse on a daily basis and that could have a high impact on our disk capacity? What else can I monitor to figure out what is causing this problem?
Again, let me know if I can be more thorough on our settings and the state of the DB when the "bug" occurs.
Do you use https://github.com/innogames/graphite-ch-optimizer ?
Do you use TTL ?
select * from system.merges;
select * from system.part_log where event_time between ~10am~
For the last week I've been experiencing intermittent mini-outages lasting between 1-3 minutes every few hours. We're using .NET Framework 4.7.2 and EF6 on top of Azure SQL for years, and it has served us well. Starting about 10 days ago however, we're seeing these sudden bursts of SQL connections being opened. These sudden bursts of SQL connections are causing timeouts on any new requests, causing our website to be inaccessible. For context: our platform sees about 1.1 million unique visitors every single day and traffic is always very stable and predictable with no sudden bursts, even during the mini outages traffic is perfectly normal.
The exception we get during these bursts is '
Timeout expired. The timeout period elapsed prior to obtaining a connection from the pool. This may have occurred because all pooled connections were in use and max pool size was reached.'
We use a combination of StructureMap with Nested Contaienrs to inject our DbContext in controllers and services, and in legacy parts of the codebase we religiously consume our DbContext with usings. We never manually open a connection (so no SqlConnections floating around)
Azure metrics (Succesful connections last 48 hours)
Azure SQL usage charts (24 hours)
The spikes here don't align with the outages so don't seem suspicious to us.
These bursts automatically resolve themselves within minutes. If I'm fast enough while our platform alerts notify us, I can confirm using 'exec sp_who' that there are indeed an excessive amount of idle connections (status=sleeping, cmd=AWAITING COMMAND) to our database. We are constantly running on 4 similarly specced VM's, and when this burst happens, the idle connections don't originate from one single machine.
We've been scratching our heads for the last week especially since the way we've been using EF6 and SQL Server has been a very smooth ride for several years. We obviously scoured over every single change we've made to the platform codebase over the last 2 weeks to spot anything that seems suspicious, but sadly that hasn't resulted in anything yet. We're also diligently squashing and tuning all our heavier un-optimized queries in a bid to fix this, but they've been serving the website fine for years and this really only started about 10-12 days ago.
Anyone who can give some insight into what might cause such very sudden bursts; any advice would be greatly appreciated.
Thank you in advance
We are running a REST API based spring boot application using AWS Aurora as Database. Our application connects to read-only Aurora MySQL RDS instances.
We are doing load testing on it. Initially we have one database and we have autoscaling in place, which is triggered on high CPU.
Now we are expecting that if we are getting some X throughput with one db instance then we should be getting approx 1.8X when autoscaling happens, and connections should be distributed equally among with the newly created database instances.
But it is not happening, instead DB connections are going up and down on both database instances erratically. Due to which our load is not getting distributed equally and we are not getting desired throughput. Sometimes one database is running on 100 % CPU while the other is still on 20% CPU and after few minutes it is reversed.
Below are the database connection cofiguration :-
Driver - com.mysql.jdbc.driver
Maximum active connections=100
Max age = 300000
Initial pool size = 10
Tomcat jdbc pool is used for connection pooling
NOTE:
1) We have also disabled jvm network DNS caching.
2) we also tried refreshing the database connections every 5 minutes,
Even the active ones.
3) We have tried everything suggested by AWS but nothing is working.
4)We have even written a lambda code to update Route 53 when new db instance comes up to avoid cluster endpoint caching but still same issue.
Can anyone please help what is the best practice for this as currently we cannot take this into production.
This is not a great answer, but since you haven't gotten any replies yet some thoughts.
1) The behavior you are seeing replicates bad routing logic of load balancers
This is no surprise you, but this used to be much more common with small web server deployments – especially long running queries. With connection pooling, you mirror this situation.
2) Taking this assumption forward, we need to guess on how Amazon choose to balance traffic to read only replicas.
Even in their white paper, they don't mention how they are doing routing: https://www.allthingsdistributed.com/files/p1041-verbitski.pdf
Likely options are route53 or an NLB.
My best guess would be that they are using an NLB. NLBs became available to us only in Q3 2017 and Aurora was 2 years before, but it still is a reasonable guess.
NLBs would let us balance based on least connections (far better than round robin).
3) Validating assumptions
If route53 is being used, then we would be able to use DNS to find out.
I did a dig against the route53 end point and found that it gave me an answer
dig +nocmd +noall +answer zzz-databasecluster-xxx.cluster-ro-yyy.us-east-1.rds.amazonaws.com
zzz-databasecluster-xxx.cluster-ro-yyy.us-east-1.rds.amazonaws.com. 1 IN CNAME zzz-0.yyy.us-east-1.rds.amazonaws.com.
zzz-0.yyy.us-east-1.rds.amazonaws.com. 5 IN A 10.32.8.33
I did it again and got a different answer.
dig +nocmd +noall +answer zzz-databasecluster-xxx.cluster-ro-yyy.us-east-1.rds.amazonaws.com
zzz-databasecluster-xxx.cluster-ro-yyy.us-east-1.rds.amazonaws.com. 1 IN CNAME zzz-2.yyy.us-east-1.rds.amazonaws.com.
zzz-2.yyy.us-east-1.rds.amazonaws.com. 5 IN A 10.32.7.97
What you can see is that the read only endpoint is giving me a CNAME result to
Zzz is name of my cluster, yyy came from my cloudformation stack formation, and yyy comes from amazon.
Note: zzz-0 and zzz-2 are the two read only replicas.
What we can see here is that we have route53 for our load balancing.
4) Route53 Load Balancing
They are likely setting up Route53 with round robin on all healthy read only replicas.
The TTL is likely 5s.
Healthy nodes will get removed, but there is no balancing based on
5) Ramifications
A) Using the Read Only end point can only balance traffic away from unhealthy instances
B) DB Pools will keep connections for a long time which means that new read replicas won’t be touched
If we have a small number of servers, we will be unbalanced – which we can’t do much against.
6) Thoughts on what you can do
A) Verify yourself with dig that you are getting correct DNS resolution that keeps rotating between replicas every 5s.
If you don’t, this is something you need to fix
B) Periodically recycle DB Clients
New replicas will get used and while you will be unbalanced, this will help by keeping changing.
What is critical though is you MUST not have all your clients recycle at the same time. Otherwise, you run the risk of all getting the same time. I would suggest doing some random ttl per client (within min/max).
C) Manage it yourself
Summary: When you connect, connect directly to the read replica with least connection/lowest CPU.
How you do this is slightly not simplistic. I would suggest a lambda function that keeps this connection string in a queryable location. Have it update at some frequency. I would say the frequency of updating the preferred DB is 1/10 of the frequency you are recycle the DB connections. You could add logic if the DBs are running similarly, you give the readonly end point..and only give an explicit one when there is significant inequity.
I would caution when a new instance comes up you want to be careful of floating.
D) Increase number of clients or number of read only copies
Both of these would decrease the chance that two boxes would get significant differences.
The application used by a group of 100+ users was made with VB6 and RDO. A replacement is coming, but the old one is still maintained. Users moved to a different building across the street and problems began. My opinion regarding the problem has been bandwidth, but I've had to argue with others who say it's database. Users regularly experience network slowness using the application, but also workstation tasks in general. The application moves large audio files and indexes them on occasion as well as others. Occasionally the database becomes hung. We have many top end, robust SQL Servers, so it is not a server problem. What I figured out is, a transaction is begun on a connection, but fails to complete properly because of a communication error. Updates from other connections become blocked, they continue stacking up, and users are down half a day. What I've begun doing the moment I'm told of a problem, after verifying the database is hung, is set the database to single user then back to multiuser to clear connections. They must all restart their applications. Today I found out there is a bandwidth limit at their new location which they regularly max out. I think in the old location there was a big pipe servicing many people, but now they are on a small pipe servicing a small number of people, which is also less tolerant of momentary high bandwidth demands.
What I want to know is exactly what happens to packets, both coming and going, when a bandwidth limit is reached. Also I want to know what happens in SQL Server communication. Do some packets get dropped? Do they start arriving more out of sequence? Do timing problems occur?
I plan to start controlling such things as file moves through the application. But I also want to know what configurations are usually present on network nodes regarding transient high demand.
This is a very broad question. Networking is very key (especially in Availability Groups or any sort of mirroring set up) to good performance. When transactions complete on the SQL server, they are then placed in the output buffer. The app then needs to 'pick up' that data, clear it's output buffer and continue on. I think (without knowing your configuration) that your apps aren't able to complete the round trip because the network pipe is inundated with requests, so the apps can't get what they need to successfully finish and close out. This causes havoc as the network can't keep up with what the apps and SQL server are trying to do. Then you have a 200 car pileup on a 1 lane highway.
Hindsight being what it is, there should have been extensive testing on the network capacity before everyone moved across the street. Clearly, that didn't happen so you are kind of left to do what you can with what you have. If the company can't get a stable networking connection, the situation may be out of your control. If you're the DBA, I highly recommend you speak to your higher ups and explain to them the consequences of the reduced network capacity. Often times, showing the consequences of inaction can lead to action.
Out of curiosity, is there any way you can analyze what waits are happening when the pileup happens? I'm thinking it will be something along the lines of ASYNC_NETWORK_IO which is usually indicative that SQL is waiting on the app to come back and pick up it's data.
We are getting the following error on a certain database occasionally under moderate load.
"System.InvalidOperationException: Timeout expired. The timeout period elapsed prior to obtaining a connection from the pool. This may have occurred because all pooled connections were in use and max pool size was reached."
I have combed through the code and we are closing the connections in finally blocks like we should except in a few cases which we have established are being called very infrequently. We will fix those pieces of code in our next release but to solve the current production issue, I am suggesting increasing the max pool size to 300. The max concurrent users we are currently experiencing is around 110 which is obviously over the default pool size (100).
I am also suggesting making sure all our connection strings to a particular SQL Server instance are identical to avoid creating multiple connection pools unnecessarily. I am hoping that we can use the USE [Database] statement before our actual SQL queries when we need to switch databases within a single SQL Server instance.
Do you guys have any ideas, pointers, suggestions, or gotchas for us to watch out for?
You must eliminate the connection leaks. If the cause of the pool exhaustion is leaks, increasing it to 300 is just gonna delay the inevitable. If you leak one connection in 10000 calls (ie. "very infrequently") and you have 110 concurrent requests at, say, 5 seconds a call, you are leaking at a rate of about one connection every 8 minutes that will drain the pool in 13 hours. The timeouts will start showing up much earlier though, as the available pool size will shrink.
If you have hard evidence that s not the leaks that are the root cause but indeed the rate of calls vs. pool size, then you should increase the pool size. Whatever your pool size is you decide to use, if your requests are requiring 1:1 a connection for the whole duration of the requests then you need to throttle/queue the HTTP accepts so it does not exceed your pool size. If not, you can still encounter spikes that exhaust the pool.
Also you may consider using a more resilient connection factory, one that retries and attempts an non-pooled connection if the pool is drained. Of course this goes hand-in-hand with my prior point that if you calibrate your max HTTP accept count to match the pool size, then the pool cannot be exhausted (unless you leak, back to square one). I would not recommend this though, I think is much better to queue up requests in the http.sys territory than in the application resource allocation territory (ie. throttle the max accepted HTTP calls).
And last but not least, reduce the duration of each call. If your call takes in average 5 seconds, then you're seeing 110 connection concurrently at only a mere 22 requests per second. If you reduce the duration of the call by eliminating SQL bottlenecks to 1 second, you'll be able to service 110 requests per second to hit the same resource cap (110 concurrent requests), that is a 5 time traffic increase. The biggest culprit is usually table scans, make sure all your queries are using sensible SQL and have an optimal data access path. As David says, SQL Profiler is your friend.
You can also use SqlConnection.ChangeDatabase to change the database.