Google Cloud Spanner - technical implication of 3 node recommendation in production

Google Cloud Spanner - technical implication of 3 node recommendation in production - database

I'm wondering if there is any technical implication behind that recommendation.
It can't be split-brain because even when configuring 1 node there are 3 instances in total running in different data-centers within the configured region, each having all the data.
So when configuring 3 nodes google will run 9 instances in total sharding the data as well.
I would be happy if anybody could provide information on what's behind that recommendation.
Thanks,
Christian

"when configuring 1 node there are 3 instances"
Quickly correcting terminology to make sure we're on the same page, 1 node still has 3 replicas. Or, more fully: "When configuring your instance to have only 1 node, there are still 3 replicas". When you have 3 nodes, you still have 3 replicas, however each replica will have more physical machines assigned to it.
The reason 3 nodes is the recommended minimum for a 'production grade' instance is around the number of machines that are assigned. When working with highly available systems we need to consider the probability of a replica being unavailable (zone maintenance, unforeseen network issues, etc) combined with the probability of other simultaneous issues, such as physical machines failures.
It's true that even with 1 node Cloud Spanner can continue to operate if a replica becomes available. However, to provide the uptime guarantees of Cloud Spanner, 3 nodes reduces the probability of physical machine failures being catastrophic to the overall system.

Related

MuleSoft RTF Architecture and understanding of Cores compared to Cloudhub

Hi we are planning to migrate to Mule4 from Mule3 and I've few questions related to sizing of cores compared to Cloudhub vs RTF.
Currently we installed Mule Runtimes on AWS(on-premise) . 2 VM Machines of 2 Cores each. so that is 4 Cores subscription. Clustered them as ServerGroup. Deployed 40 applications on both.
Question 1) So My understanding is we are using 2 cores to maintain 40 applications and other 2 cores for high availability . Let me know if this is correct and If the same 40 apps have to be moved to Cloudhub with HA do i need 8Cores ?
coming to RTF, i guess we need to have 3 controller and 3 worker nodes. suppose if i take AWS VM Machine of 3 Core capacity . It will be 3X3 = 9 cores using and I can deploy the same 40 applications on those 3 VM machines. (it could be more than 40 apps as well ).This is with high availability
When it comes to cloudhub if i need to deploy 40 apps with high availability (each app deployed on 2 cores) it would take 8Cores. and I cannot deploy not single application more than 40.
Question 2) RTF though i have 4 core VM machine i can deploy 50 or 60 apps. but for cloudhub if i take 4core subscription i cannot deploy more than 40 apps. Is this correct ?

yes, you're right. Currently (Dec 2021), the minimum allocation of vCores when deploying applications to Cloudhub is 0.1 of a vCore, so to your 1st question, yes, correct, you would strictly need 8 vCores, assuming 2 workers per application for a "somewhat" high-availability. A true end-to-end high-availability would more likely need 3 workers, so that if one dies, you would still have HA within the other 2.
To the second question, when you deploy in RTF and even Mule runtime directly to say a VM or a container, you have more flexibility in terms of how much of a vCore portion you need to allocate for your applications. Your MuleSoft account manager would be able to articulate with you as to how much that would mean.
Last but not least, you could also think of different deployment models and cost-savings approaches, which depending on your scenario could mean using say service mesh, so you drastically reduce the number of vCores you use and also you can come with a strategy of grouping endpoints/resources of different applications in a single one. Example: if you have 2 different applications, both related to customer data or somehow the same domain, you could group them together.
Ed.

Equal connection distribution is not happening in Aurora autoscaled insatnces

We are running a REST API based spring boot application using AWS Aurora as Database. Our application connects to read-only Aurora MySQL RDS instances.
We are doing load testing on it. Initially we have one database and we have autoscaling in place, which is triggered on high CPU.
Now we are expecting that if we are getting some X throughput with one db instance then we should be getting approx 1.8X when autoscaling happens, and connections should be distributed equally among with the newly created database instances.
But it is not happening, instead DB connections are going up and down on both database instances erratically. Due to which our load is not getting distributed equally and we are not getting desired throughput. Sometimes one database is running on 100 % CPU while the other is still on 20% CPU and after few minutes it is reversed.
Below are the database connection cofiguration :-
Driver - com.mysql.jdbc.driver
Maximum active connections=100
Max age = 300000
Initial pool size = 10
Tomcat jdbc pool is used for connection pooling
NOTE:
1) We have also disabled jvm network DNS caching.
2) we also tried refreshing the database connections every 5 minutes,
Even the active ones.
3) We have tried everything suggested by AWS but nothing is working.
4)We have even written a lambda code to update Route 53 when new db instance comes up to avoid cluster endpoint caching but still same issue.
Can anyone please help what is the best practice for this as currently we cannot take this into production.

This is not a great answer, but since you haven't gotten any replies yet some thoughts.
1) The behavior you are seeing replicates bad routing logic of load balancers
This is no surprise you, but this used to be much more common with small web server deployments – especially long running queries. With connection pooling, you mirror this situation.
2) Taking this assumption forward, we need to guess on how Amazon choose to balance traffic to read only replicas.
Even in their white paper, they don't mention how they are doing routing: https://www.allthingsdistributed.com/files/p1041-verbitski.pdf
Likely options are route53 or an NLB.
My best guess would be that they are using an NLB. NLBs became available to us only in Q3 2017 and Aurora was 2 years before, but it still is a reasonable guess.
NLBs would let us balance based on least connections (far better than round robin).
3) Validating assumptions
If route53 is being used, then we would be able to use DNS to find out.
I did a dig against the route53 end point and found that it gave me an answer
dig +nocmd +noall +answer zzz-databasecluster-xxx.cluster-ro-yyy.us-east-1.rds.amazonaws.com
zzz-databasecluster-xxx.cluster-ro-yyy.us-east-1.rds.amazonaws.com. 1 IN CNAME zzz-0.yyy.us-east-1.rds.amazonaws.com.
zzz-0.yyy.us-east-1.rds.amazonaws.com. 5 IN A 10.32.8.33
I did it again and got a different answer.
dig +nocmd +noall +answer zzz-databasecluster-xxx.cluster-ro-yyy.us-east-1.rds.amazonaws.com
zzz-databasecluster-xxx.cluster-ro-yyy.us-east-1.rds.amazonaws.com. 1 IN CNAME zzz-2.yyy.us-east-1.rds.amazonaws.com.
zzz-2.yyy.us-east-1.rds.amazonaws.com. 5 IN A 10.32.7.97
What you can see is that the read only endpoint is giving me a CNAME result to
Zzz is name of my cluster, yyy came from my cloudformation stack formation, and yyy comes from amazon.
Note: zzz-0 and zzz-2 are the two read only replicas.
What we can see here is that we have route53 for our load balancing.
4) Route53 Load Balancing
They are likely setting up Route53 with round robin on all healthy read only replicas.
The TTL is likely 5s.
Healthy nodes will get removed, but there is no balancing based on
5) Ramifications
A) Using the Read Only end point can only balance traffic away from unhealthy instances
B) DB Pools will keep connections for a long time which means that new read replicas won’t be touched
If we have a small number of servers, we will be unbalanced – which we can’t do much against.
6) Thoughts on what you can do
A) Verify yourself with dig that you are getting correct DNS resolution that keeps rotating between replicas every 5s.
If you don’t, this is something you need to fix
B) Periodically recycle DB Clients
New replicas will get used and while you will be unbalanced, this will help by keeping changing.
What is critical though is you MUST not have all your clients recycle at the same time. Otherwise, you run the risk of all getting the same time. I would suggest doing some random ttl per client (within min/max).
C) Manage it yourself
Summary: When you connect, connect directly to the read replica with least connection/lowest CPU.
How you do this is slightly not simplistic. I would suggest a lambda function that keeps this connection string in a queryable location. Have it update at some frequency. I would say the frequency of updating the preferred DB is 1/10 of the frequency you are recycle the DB connections. You could add logic if the DBs are running similarly, you give the readonly end point..and only give an explicit one when there is significant inequity.
I would caution when a new instance comes up you want to be careful of floating.
D) Increase number of clients or number of read only copies
Both of these would decrease the chance that two boxes would get significant differences.

SolrCloud - 2 nodes cluster

We are planning to implement SolrCloud in our solution (mainly for data replication reasons and disaster recovery), unfortunately some of our customers have only 2DCs - and one DC may be completely destroyed.
We are aware that running ZK in 2 locations is problematic, as ZK requires quorum. And downtime on any side with 2 ZK nodes would cause cluster failure. And cluster failure would be also triggered by network partition between locations (master will cease to be master due to quorum lost, slave can't elect himself for the same reason).
--
So our current plan A is to go with a single ZK for both sites and backup ZK into the other site. So if the site withou ZK dies, we are OK. If the site with ZK dies, we should be able to start new ZK from backup and reconfigure Solr.
--
We also considered plan B with classic master-slave replication between the sites. BUT we are using Time Routed Aliases, hence we need SolrCloud features, hence we would need also to replicate data/configuratin in ZooKeeper (not only Solr index). So this case seems only as more manual work in Solr, while we would still need to backup/restore ZK. So this plan was rejected.
--
Plan C may be to have 2ZK, but one with with bigger weight. This should survive partition and dead of ZK with lower weight. The first ZK node should be automatically backed up using standard cluster mechanics. But I do not even know about anyone using ZK this way...
--
Is there any smarter way, how to setup SolrCloud in 2 nodes environment? Which solution should we prefer?
We do not expect High Availability; we want to achieve disaster recovery. Administrator intervention is expected in case of node failure, we only need to be resilient to short network glitches.
Edit: CDCR (Cross Data Center Replication) with Time Routed Aliases
We are considering to use TRA, because our data are time based, and customers are usually interested only in latest slice/partition. Without TRA, the index grows and performance degrades, more (unused/old) stuff is in index & RAM...
Here comes a problem with CDCR, according to docs, the source&target collection parameters are required. But with TRA, collections are created with the same solrconfig.xml automatically (every X days/months). This problem in CDCR is known (see comments), but not resolved yet.
Also it seems that CDCR really does not synchronize ZooKeeper (I have not found any mentions of the functionality in docs, jira and in code), which may be ok with static number of collections, but is very problematic with dynamically created collections (especially by some machinery in background outside users/developers code).
Edit: According to David (the main author of TRA), CDCR&TRA combination is not to be supported.

Should zookeeper be run on the worker machines or independent machines?

We have several kinds of software that use zookeeper like Solr, Storm, Kafka, Hbase etc.
There are 2 options to install zookeeper cluster (more than 1 nodes):
Embedded cluster: Install ZK on some of the same machines as the other software are installed OR
External cluster: Have a few not very powerful but dedicated zookeeper machines (in the same region, cloud and data-center though) to run zookeeper on.
Which is a better option for cluster stability? Note that in both the cases, we always have an odd number of machines in our zookeeper cluster and not just one machine.
It appears that the embedded option is easier to setup and is a better use of the machines but the external option seems more stable because a loss of single machine means the loss of just one component (Loss of a machine in embedded zookeeper means loss of zookeeper node as well as the worker node of Solr, Storm, Kafka whatever the case maybe).
What is the industry standard to run zookeepers in production for maximum stability?

Zookeeper is a critical component for a Kafka cluster but since the implementation of the new generation of clients the load on ZK has been greatly reduced and is now only used by the cluster itself. Even though the load is usually not very high, it can be sensitive to latency and therefore the best practice is to run a Zookeeper ensemble on dedicated machines and optimally even use dedicated disks for ZK transaction logs to avoid IO contention.
By using larger Zookeeper ensembles you gain resiliency but this also increase communication within the cluster and you could lose some performance. Since Zookeeper works with simple majority voting you need an odd number of nodes for it to make sense. A 3 node ensemble allow losing 1 node without impact, a 5 node ensemble allow losing 2 nodes and so on.
In practice, I´ve seen small, low workload clusters run very well with Zookeeper installed on the same machines as the Kafka nodes but if you aim for maximum stability and have increasing traffic, separate clusters would be recommended.

You should consider yourself discouraged from using internal ZooKeeper in production.
Its good to have external zookeeper, Best if Zookeeper ensemble(two or more)
If you have one zookeeper node and it might create problems when it goes down.
if you have cluster setup of zookeeper nodes and if one zookeeper node goes down the remaining majority nodes are running will continue to work.
More details

For SolrCloud, we strongly recommend that Zookeeper is external, and that you have at least three of them.
This does NOT mean that it cannot run on the same servers as Solr, but it DOES mean that you should NOT use the zookeeper server that Solr itself can start, embedded within itself.
Here's some information related to performance and SolrCloud that touches on zookeeper:
https://wiki.apache.org/solr/SolrPerformanceProblems#SolrCloud
Whether or not you need completely separate machines, or even separate disks for the zookeeper database when running on the same machine as Solr, is VERY dependent on the characteristics of your SolrCloud install. If your index is very small and your query load is low, it's possible that you can put zookeeper on the same machines and even the same disks.
For the other services you mentioned, I have no idea what the recommendation is.

Are there any "gotchas" in deploying a Cassandra cluster to a set of Linode VPS instances?

I am learning about the Apache Cassandra database [sic].
Does anyone have any good/bad experiences with deploying Cassandra to less than dedicated hardware like the offerings of Linode or Slicehost?
I think Cassandra would be a great way to scale a web service easily to meet read/write/request load... just add another Linode running a Cassandra node to the existing cluster. Yes, this implies running the public web service and a Cassandra node on the same VPS (which many can take exception with).
Pros of Linode-like deployment for Cassandra:
Private VLAN; the Cassandra nodes could communicate privately
An API to provision a new Linode (and perhaps configure it with a "StackScript" that installs Cassandra and its dependencies, etc.)
The price is right
Cons:
Each host is a VPS and is not dedicated of course
The RAM/cost ratio is not that great once you decide you want 4GB RAM (cf. dedicated at say SoftLayer)
Only 1 disk where one would prefer 2 disks I suppose (1 for the commit log and another disk for the data files themselves). Probably moot since this is shared hardware anyway.
EDIT: found this which helps a bit: http://wiki.apache.org/cassandra/CassandraHardware
I see that 1GB is the minimum but is this a recommendation? Could I deploy with a Linode 720 for instance (say 500 MB usable to Cassandra)? See http://www.linode.com/

How much ram you needs really depends on your workload: if you are write-mostly you can get away with less, otherwise you will want ram for the read cache.
You do get more ram for you money at my employer, rackspace cloud: http://www.rackspacecloud.com/cloud_hosting_products/servers/pricing. (our machines also have raided disks so people typically see better i/o performance vs EC2. Dunno about linode.)
Since with most VPSes you pay roughly 2x for the next-size instance, i.e., about the same as adding a second small instance, I would recommend going with fewer, larger instances than more, smaller ones, since in small numbers network overhead is not negligible.
I do know someone using Cassandra on 256MB VMs but you're definitely in the minority if you go that small.