I want to use db pooling for mule using below connector:
<db:mysql-config name="dbConfig" host="localhost" port="3306" user="root"
password="" database="esb" doc:name="MySQL Configuration">
<db:pooling-profile maxPoolSize="17" minPoolSize="13" acquireIncrement="1"/>
</db:mysql-config>
Reference - https://docs.mulesoft.com/mule-runtime/4.3/tuning-pooling-profiles
My question is, suppose if application is on peek load 4000 queries/sec going to db and turnaround time time is 0.4 sec for every query. My question what maxPoolSize should need to set?
Second question, according to mule documentation acquireIncrement property use is 'Determines how many connections at a time to try to acquire when the pool is exhausted'. Does this mean if maxPoolSize is exhausted a new connection got created and later dropped, is that fair understanding?
If a query takes 0.4 seconds then a connection could be used theoretically to execute 2.5 queries per second. So for 4000 queries per second you would need at least 4000/2.5=1600 connections. You should add some more to be covered by peaks, longer queries, etc. That really depends on the usage pattern of your application. You should try to measure the usage in load testing or tracking real life usage. To be on the safe side I would guess no less than 2000.
acquireIncrement is used when there are no available idle connections at the moment, the total number of connections doesn't exceed maxPoolSize and a request for a new connection arrive. You might want to create 3 connections instead of just 1, because of the usage pattern expected for your application. The number of total connections will never exceed maxPoolSize in any case. For example using the configuration you shared, there might be 14 active connections in use and a new query arrives. If acquireIncrement is 2 the pool will create 2 additional connections for a total of 16.
Related
I'm pretty new to the world of databases and I was wondering - what is the difference between "Requests per second (RPS)" and "IO per second (IOps)"
From what I know 1 Request can be made up of several I/Os (which are input ouputs). Is this a correct understanding?
Also, in the context of databases, are concurrent connections the same as concurrent users? I would assume that every connection would equate to a user. Is this right?
Please feel free to correct me or throw me examples, links, resources to read up on. I'm a total newbie at this, and would absolutely love to ramp up on my database basics!
We are running a REST API based spring boot application using AWS Aurora as Database. Our application connects to read-only Aurora MySQL RDS instances.
We are doing load testing on it. Initially we have one database and we have autoscaling in place, which is triggered on high CPU.
Now we are expecting that if we are getting some X throughput with one db instance then we should be getting approx 1.8X when autoscaling happens, and connections should be distributed equally among with the newly created database instances.
But it is not happening, instead DB connections are going up and down on both database instances erratically. Due to which our load is not getting distributed equally and we are not getting desired throughput. Sometimes one database is running on 100 % CPU while the other is still on 20% CPU and after few minutes it is reversed.
Below are the database connection cofiguration :-
Driver - com.mysql.jdbc.driver
Maximum active connections=100
Max age = 300000
Initial pool size = 10
Tomcat jdbc pool is used for connection pooling
NOTE:
1) We have also disabled jvm network DNS caching.
2) we also tried refreshing the database connections every 5 minutes,
Even the active ones.
3) We have tried everything suggested by AWS but nothing is working.
4)We have even written a lambda code to update Route 53 when new db instance comes up to avoid cluster endpoint caching but still same issue.
Can anyone please help what is the best practice for this as currently we cannot take this into production.
This is not a great answer, but since you haven't gotten any replies yet some thoughts.
1) The behavior you are seeing replicates bad routing logic of load balancers
This is no surprise you, but this used to be much more common with small web server deployments – especially long running queries. With connection pooling, you mirror this situation.
2) Taking this assumption forward, we need to guess on how Amazon choose to balance traffic to read only replicas.
Even in their white paper, they don't mention how they are doing routing: https://www.allthingsdistributed.com/files/p1041-verbitski.pdf
Likely options are route53 or an NLB.
My best guess would be that they are using an NLB. NLBs became available to us only in Q3 2017 and Aurora was 2 years before, but it still is a reasonable guess.
NLBs would let us balance based on least connections (far better than round robin).
3) Validating assumptions
If route53 is being used, then we would be able to use DNS to find out.
I did a dig against the route53 end point and found that it gave me an answer
dig +nocmd +noall +answer zzz-databasecluster-xxx.cluster-ro-yyy.us-east-1.rds.amazonaws.com
zzz-databasecluster-xxx.cluster-ro-yyy.us-east-1.rds.amazonaws.com. 1 IN CNAME zzz-0.yyy.us-east-1.rds.amazonaws.com.
zzz-0.yyy.us-east-1.rds.amazonaws.com. 5 IN A 10.32.8.33
I did it again and got a different answer.
dig +nocmd +noall +answer zzz-databasecluster-xxx.cluster-ro-yyy.us-east-1.rds.amazonaws.com
zzz-databasecluster-xxx.cluster-ro-yyy.us-east-1.rds.amazonaws.com. 1 IN CNAME zzz-2.yyy.us-east-1.rds.amazonaws.com.
zzz-2.yyy.us-east-1.rds.amazonaws.com. 5 IN A 10.32.7.97
What you can see is that the read only endpoint is giving me a CNAME result to
Zzz is name of my cluster, yyy came from my cloudformation stack formation, and yyy comes from amazon.
Note: zzz-0 and zzz-2 are the two read only replicas.
What we can see here is that we have route53 for our load balancing.
4) Route53 Load Balancing
They are likely setting up Route53 with round robin on all healthy read only replicas.
The TTL is likely 5s.
Healthy nodes will get removed, but there is no balancing based on
5) Ramifications
A) Using the Read Only end point can only balance traffic away from unhealthy instances
B) DB Pools will keep connections for a long time which means that new read replicas won’t be touched
If we have a small number of servers, we will be unbalanced – which we can’t do much against.
6) Thoughts on what you can do
A) Verify yourself with dig that you are getting correct DNS resolution that keeps rotating between replicas every 5s.
If you don’t, this is something you need to fix
B) Periodically recycle DB Clients
New replicas will get used and while you will be unbalanced, this will help by keeping changing.
What is critical though is you MUST not have all your clients recycle at the same time. Otherwise, you run the risk of all getting the same time. I would suggest doing some random ttl per client (within min/max).
C) Manage it yourself
Summary: When you connect, connect directly to the read replica with least connection/lowest CPU.
How you do this is slightly not simplistic. I would suggest a lambda function that keeps this connection string in a queryable location. Have it update at some frequency. I would say the frequency of updating the preferred DB is 1/10 of the frequency you are recycle the DB connections. You could add logic if the DBs are running similarly, you give the readonly end point..and only give an explicit one when there is significant inequity.
I would caution when a new instance comes up you want to be careful of floating.
D) Increase number of clients or number of read only copies
Both of these would decrease the chance that two boxes would get significant differences.
I switched from a join and view based strategy on my old dedicated server to a multiple small queries strategy for Google Cloud. For me this is also easier to maintain and on my dev machine there was no noticeable performance difference. But on my App Engine and Cloud SQL it is really slow.
For example if I want to query the last 50 articles it takes 4-5 seconds, on my dev machine 160ms. For each article there are min. 12 queries on average 15 queries. That are ~750 queries, if I monitor the Cloud SQL I noticed that it always caps at ~200 queries per second. The CPU just peeks at 20%, I have just a db-n1-standard-1 with SSD. 200 queries per second also mean if I want to get the last 100 articles it will take 8-9 seconds and so on.
I already tried to set the App Engine Instance class to F4 to see if this will change anything. It didn't change anything, the number where the same. I haven't tried to increase the DB Instance because I can't see that it is at it's limit.
What to do?
Software: I use GO with a unlimited mysql connection pool.
EDIT: I even changed to the db-n1-standard-2 instance and there was no difference :(
EDIT2: I tried some changes over the weekend, 1500 iops, 4 cores, etc but nothing showed the expected improvements. The usage graphs were already indicating that there is no "hardware" limit. I managed to isolate the slow query tho... it was a super simple one where I query the country name via country-ISO2 and language-ISO3 both keys are indexed and still it takes 50ms for EACH. So I just cached ALL countires in memcache and done.
Google Cloud SQL uses GCE VM instances so things that apply to GCE apply to Cloud SQL.
When you create a db-n1-standard-1 instance your network throughput is caped to 250 MB/s by your CPU but your Read/Write (a)disk throughtput and (b)IOPS speed are capped by the the storage capacity and type, which is:
Write: 4.8ᵅ|300ᵇ
Read: 4.8ᵅ|300ᵇ
You can view what if anything is missing in your instance details:
https://console.cloud.google.com/sql/instances/[INSTANCE_NAME]
If you want to increase performance of your instance raise the number of its vCPUs and its storage capacity/type as sugested in the links above.
I recently experienced a sharp, short-lived increase in the load of my service on Google App Engine. The load went from ~1-2 req/second to about 10 req/second for about a couple of hours. My number of dynamic instances scaled up pretty quickly but in the process I did get a number of "Request waited too long" timeout messages.
So the next time around, I would like to be prepared with enough idle instances to handle my load. But now the question is, how do I determine how many is adequate. I expect a much larger burst in load this time - from practically nothing to an average of 500 requests/second, possibly with a peak of 3000. This is to last between 15 minutes and 1 hour.
My main goal is to ensure that the information passed via HTTP Post is saved to the datastore by means of a single write.
Here are the steps I have taken to prepare for the burst:
I have pruned the fast path to disable analytics and other reporting, which typically generate 2 urlfetch requests.
The datastore write is to be deferred to a taskqueue via the deferred library
What I would like to know is:
1. Tips/insights into calculating how many idle instances one would need per N requests/second.
2. It seems that the maximum throughput of a task queue is 500/second. Is this the rate at which you can push tasks, and if not, then is there a cap on that? I'm guessing not, since these are probably just datastore writes, but I would like to be sure.
My fallback plan if I am not confident of saving all of the information for this flash mob is to set up a beefy Amazon EC2 instance, run a web server on it and make my clients send a backup request to this server.
You must understand that Idle Instances are only used when new frontend instances are being spun-up. This means that they are only used during traffic increases. When traffic is steady they are not used.
Now if your instance needs 20 sec to spin up and can handle 10 req/sec of steady traffic and you traffic INCREASE is 5 req/sec, then you'll need 20 * 5 / 10 = 10 idle instances if you don't want any requests dropped.
What you should do is:
Maximize instance throughput (number of requests it can handle): optimize code, use async db operations and enable Concurrent Requests.
Minimize your instance startup time. This is important because idle instances are used during spinning up of new instances and the time it takes to spin up a new instance directly relates to how many idle instances you need. If you use Java this means getting rid of any heavy frameworks that do classpath scanning (Spring, etc..).
Fourth, number of frontend instances needed is VERY application specific. But since you already had traffic increase you should know how many requests your frontend instance can handle per second.
Edit: There is one more obvious thing you should do: HTTP caching. GAE has a transparent HTTP cache which can be simply controlled via Cache-Control headers.
Also, if analytics has a big performance impact on your server, consider using client side analytics services (like Google Analytics). They also work for devices.
We are getting the following error on a certain database occasionally under moderate load.
"System.InvalidOperationException: Timeout expired. The timeout period elapsed prior to obtaining a connection from the pool. This may have occurred because all pooled connections were in use and max pool size was reached."
I have combed through the code and we are closing the connections in finally blocks like we should except in a few cases which we have established are being called very infrequently. We will fix those pieces of code in our next release but to solve the current production issue, I am suggesting increasing the max pool size to 300. The max concurrent users we are currently experiencing is around 110 which is obviously over the default pool size (100).
I am also suggesting making sure all our connection strings to a particular SQL Server instance are identical to avoid creating multiple connection pools unnecessarily. I am hoping that we can use the USE [Database] statement before our actual SQL queries when we need to switch databases within a single SQL Server instance.
Do you guys have any ideas, pointers, suggestions, or gotchas for us to watch out for?
You must eliminate the connection leaks. If the cause of the pool exhaustion is leaks, increasing it to 300 is just gonna delay the inevitable. If you leak one connection in 10000 calls (ie. "very infrequently") and you have 110 concurrent requests at, say, 5 seconds a call, you are leaking at a rate of about one connection every 8 minutes that will drain the pool in 13 hours. The timeouts will start showing up much earlier though, as the available pool size will shrink.
If you have hard evidence that s not the leaks that are the root cause but indeed the rate of calls vs. pool size, then you should increase the pool size. Whatever your pool size is you decide to use, if your requests are requiring 1:1 a connection for the whole duration of the requests then you need to throttle/queue the HTTP accepts so it does not exceed your pool size. If not, you can still encounter spikes that exhaust the pool.
Also you may consider using a more resilient connection factory, one that retries and attempts an non-pooled connection if the pool is drained. Of course this goes hand-in-hand with my prior point that if you calibrate your max HTTP accept count to match the pool size, then the pool cannot be exhausted (unless you leak, back to square one). I would not recommend this though, I think is much better to queue up requests in the http.sys territory than in the application resource allocation territory (ie. throttle the max accepted HTTP calls).
And last but not least, reduce the duration of each call. If your call takes in average 5 seconds, then you're seeing 110 connection concurrently at only a mere 22 requests per second. If you reduce the duration of the call by eliminating SQL bottlenecks to 1 second, you'll be able to service 110 requests per second to hit the same resource cap (110 concurrent requests), that is a 5 time traffic increase. The biggest culprit is usually table scans, make sure all your queries are using sensible SQL and have an optimal data access path. As David says, SQL Profiler is your friend.
You can also use SqlConnection.ChangeDatabase to change the database.