I expected this to be a common problem but couldn't find a definitive answer.
The scenario is the following:
You have a micro-service application deployed in containers. Each micro-service is deployed in its own container and can be vertically/horizontally scaled independently from the others.
One of these micro-service needs to connect to a service like a database. You use your preferred client library to connect to that specific database, using a connections pool inside the micro-service application.
Your application is elastic, meaning it should scale in and out basing on some workload metrics, deploying/removing containers if required.
Now here is the problem. Your database can accept only a limited number of connections, let's say 100. Say also that your micro-service requiring database connection has a connections pool with a max limit of 10. This means that effectively your micro-service can't horizontally scale out beyond 10 containers, otherwise you can can go above the max number of connections supported by the database.
Ideally you would like to scale out the service independently from the database connections limit, having some sort of stateful pool service across the cluster of containers that is aware of the total number of connections currently active.
What are possible solutions to the above scenario?
I dont know what is the database that you use but for example in postgresql, there is a "pool service" as you say, called pgbouncer,https://www.pgbouncer.org/ that acts as a global connection pool for all services or instances that requires connection to de database.
You deploy it as a separate service and you configure it to connect to your postgresql instance and also configure the number of connections available if they can be reused among services.. etc. Then you have to connect you microservices to this pgbouncer and this way you are sure that you won't overload the database no matter the number of instances of the microservice...
If you are not using postgresql I am pretty sure that other databases have similar solutions
Related
I have Odoo front end on aws ec2 instance and connected it with postgresql on ElephentQl site with 15 concurrent connections
so I want to make sure that this connection limits will pose no problem so i wanna use kafka to perform database write instead of Odoo doing it directly but found no recourses online to help me out
Is your issue about Connection Pooling? PostgreSQL includes two implementations of DataSource for JDBC 2 and two for JDBC 3, as shown here.
dataSourceName String Every pooling DataSource must have a unique name.
initialConnections int The number of database connections to be created when the pool is initialized.
maxConnections int The maximum number of open database connections to allow.
When more connections are requested, the caller will hang until a connection is returned to the pool.
The pooling implementations do not actually close connections when the client calls the close method, but instead return the connections to a pool of available connections for other clients to use. This avoids any overhead of repeatedly opening and closing connections, and allows a large number of clients to share a small number of database connections.
Additionally, you might want to investigate, Pgbouncer. Pgbouncer is a stable, in-production connection pooler for PostgreSQL. PostgreSQL doesn’t realise the presence of PostgreSQL. Pgbouncer can do load handling, connection scheduling/routing and load balancing. Read more from this blog that shows how to integrate this with Odoo. There are a lot of references from this page.
Finally, I would second OneCricketeer's comment, above, on using Amazon RDS, unless you are getting a far better deal with ElephantSQL.
On using Kafka, you have to realise that Odoo is a frontend application that is synchronous to user actions, therefore you are not architecturally able to have a functional system if you put Kafka in-between Odoo and the database. You would input data and see it in about 2-10 minutes. I exaggerate but; If that is what you really want to do then by all means, invest the time and effort.
Read more from Confluent, the team behind Kafka that came out of LinkedIn on how they use a solution called BottledWater to do some cool streams over PostgreSQL, that should be more like what you want to do.
Do let us know which option you selected and what worked! Keep the community informed.
There are numerous questions about using a db connection pool with Google App Engine, but a lot has changed recently. Up to this point, I could never get a connection pool to work with GAE. However, I think some recent develops may allow connection pooling to work, which may be why it is mentioned in the Google documentation (which seems to have recently been updated).
https://cloud.google.com/sql/docs/mysql/connect-app-engine
Can someone confirm that connection pools can be used?
1) We used Google Cloud SQL 1st gen and the database could deactivate (go to sleep). This would make any existing connections stale.
With a 2nd gen database, there is no deactivtion of databases. So this may address the problem.
2) Many connection pool implementations used threads.
With Java 8 being supported on GAE, it looks like threads are permitted.
3) Some people suggest that GAE's limited number of database connections (12) are a reason to use connection pools. The connection pool size could be set to GAE's limit and thus an app would never exceed the limit.
a) First, documentation indicates a much larger number of connections, based on the size of the database.
https://cloud.google.com/sql/docs/quotas
b) Second, if there is a limit for a GAE app, is the limit per individual server instance or for an entire GAE app?
Any confirmation that the above thinking makes sense would be appreciated.
Regarding 1) Yes, the Cloud SQL instances of 2nd generation, your instances don't deactivate unless it's for maintenance etc.
2) I don't see why you can't use threads to connect to a 2nd generation Cloud SQL database. With Java 8, you can absolutely do that. To check how many threads you have open, you can run mysql> SHOW STATUS WHERE Variable_name = 'Threads_connected';
For 3a), I would go with the official documentation link that you provided already but remember that database connections consume resources on the server and the connecting application. Always use good connection management practices to minimize your application's footprint and reduce the likelihood of exceeding Cloud SQL connection limits. The limit of 12 connections was indeed in place in the past but it doesn't exist anymore.
3b) When a limit or quota refers to a Google App Engine app, then it's for the whole app unless it's specified that it's per instance. More specifically for Cloud SQL connections, you can find the limits here and there is actually a limit that is specific to instances. You can't have more than 100 concurrent connections for each App Engine instance running in a standard environment.
I hope that helps!
I've noticed that on a NopCommerce site we host (which uses Entity Framework) that if I run a crawler on the site (to check for broken links) it knocks the entire webserver offline for a few minutes and no other hosted sites respond. This seems to be because Entity Framework is opening 30-odd database connections and runs hundreds of queries per second (about 20-40 per page view).
I cannot change how EF is used by NopCommerce (it would take weeks) or change the version of EF being used, so can I mitigate the effects it has on SQL Server by limiting how many concurrent connections it uses, to give other sites hosted on the same server a fairer chance at database access?
What I'm ideally looking to do, is limit the number of concurrent DB connections to about 10, for a particular application.
I think the best you can do is use the Max Pool Size setting in the connection string. This limits the maximum number of connections in the connection pool, and I think this means that's the maximum number of connections the application will ever use. What I'm not sure of though, is if it can't get a connection from the pool, will it cause an exception. I've never tried limiting the connections in this manner.
Here's a litle reading on the settings you can put in a ADO.NET connection string:
http://msdn.microsoft.com/en-us/library/system.data.sqlclient.sqlconnection.connectionstring%28v=vs.100%29.aspx
And here's a little more reading on "Using Connection Pooling":
http://msdn.microsoft.com/en-us/library/8xx3tyca%28v=vs.100%29.aspx
I read some things about hosted (aka cloud) databases. For example Cloudant offers a hosted CouchDB database or Cassandra.io offers hosted Cassandra. I understand why these services solve some problems.
My question: Why do these services work? I suppose I host my own application on my own servers (or somewhere on a cloud-hosting-platform) and use one of these services to store my data. For every database request (either read or write), I need to pay a full roundtrip over the internet (supposing my application is not hosted in the same place as my database cloud provider uses). Why aren't these roundtrips killing me? When thinking about SQL, every query would cost another x*10ms just for the network, without any time spend.
How is this problem solved? Or are these services not suitable for applications which need fast responses and can only be used for data processing where latency is not an issue?
generally, the physical hosts of hosted database services normally reside in major data-centers (e.g. AWS). In order to reduce network latency, customers can choose whether to host their application on servers that reside in the physical same data-center as their hosted databases reside.
The majority of high-performance applications and/or website that do not use hosted database services usually maintain their application servers and their database servers on separate hosts for performance reason anyways. So, in short, switching to hosted database service would not necessarily increase network latency.
I'm researching cloud services to host an e-commerce site. And I'm trying to understand some basics on how they are able to scale things.
From what I can gather from AWS, Rackspace, etc documentation:
Setup 1:
You can get an instance of a webserver (AWS - EC2, Rackspace - Cloud Server) up. Then you can grow that instance to have more resources or make replicas of that instance to handle more traffic. And it seems like you can install a database local to these instances.
Setup 2:
You can have instance(s) of a webserver (AWS - EC2, Rackspace - Cloud Server) up. You can also have instance(s) of a database (AWS - RDS, Rackspace - Cloud Database) up. So the webserver instances can communicate with the database instances through a single access point.
When I use the term instances, I'm just thinking of replicas that can be access through a single access point and data is synchronized across each replica in the background. This could be the wrong mental image, but it's the best I got right now.
I can understand how setup 2 can be scalable. Webserver instances don't change at all since it's just the source code. So all the http requests are distributed to the different webserver instances and is load balanced. And the data queries have a single access point and are then distributed to the different database instances and is load balanced and all the data writes are sync'd between all database instances that is transparent to the application/webserver instance(s).
But for setup 1, where there is a database setup locally within each webserver instance, how is the data able to be synchronized across the other databases local to the other web server instances? Since the instances of each webserver can't talk to each other, how can you spin up multiple instances to scale the app? Is this setup mainly for sites with static content where the data inside the database is not getting changed? So with an e-commerce site where orders are written to the database, this architecture will just not be feasible? Or is there some way to get each webserver instance to update their local database to some master copy?
Sorry for such a simple question. I'm guessing the documentation doesn't say it plainly because it's so simple or I just wasn't able to find the correct document/page.
Thank you for your time!
Update:
Moved question to here:
https://webmasters.stackexchange.com/questions/32273/cloud-architecture
We have one server setup to be the application server, and our database installed across a cluster of separate machines on AWS in the same availability zone (initially three but scalable). The way we set it up is with a "k-safe" replication. This is scalable as the data is distributed across the machines, and duplicated such that one machine could disappear entirely and the site continues to function. THis also allows queries to be distributed.
(Another configuration option was to duplicate all the data on each of the database machines)
Relating to setup #1, you're right, if you duplicate the entire database on each machine with load balancing, you need to worry about replicating the data between the nodes, this will be complex and will take a toll on performance, or you'll need to sacrifice consistency, or synchronize everything to a single big database and then you lose the effect of clustering. Also keep in mind that when throughput increases, adding an additional server is a manual operation that can take hours, so you can't respond to throughput on-demand.
Relating to setup #2, here scaling the application is easy and the cloud providers do that for you automatically, but the database will become the bottleneck, as you are aware. If the cloud provider scales up your application and all those application instances talk to the same database, you'll get more throughput for the application, but the database will quickly run out of capacity. It has been suggested to solve this by setting up a MySQL cluster on the cloud, which is a valid option but keep in mind that if throughput suddenly increases you will need to reconfigure the MySQL cluster which is complex, you won't have auto scaling for your data.
Another way to do this is a cloud database as a service, there are several options on both the Amazon and RackSpace clouds. You mentioned RDS but it has the same issue because in the end it's limited to one database instance with no auto-scaling. Another MySQL database service is Xeround, which spreads the load over several database nodes, and there is a load balancer that manages the connection between those nodes and synchronizes the data between the partitions automatically. There is a single access point and a round-robin DNS that sends the requests to up to thousands of database nodes. So this might answer your need for a single access point and scalability of the database, without needing to setup a cluster or change it every time there is a scale operation.