AWS Elasticache Redis failover - database

I am using Redis on ElastiCache for a Node application and today the node went down which means our app stopped working. It took 20 minutes for a new node to be provisioned.
From reading the documentation it seems I can set up a cluster which automatically promotes a slave to primary in case of a failure. The big gotcha seems to be you have to set your client to write to the primary node and read from the slave nodes.
This means in the case of a failure, you have to reconfigure your app to point to the newly created 'read' nodes. It also takes a few minutes for a slave to be promoted to primary.
Is there no way to set this up so if the primary fails, a slave will automatically take over for read/write operations?
I'm not storing much data in redis and low read/write operations, but it is required to run the app (live video sessions!).
If I can't have a seamless failover in redis, is there something I can use which provides this functionality? I'm hoping I don't have to move to a traditional DBMS as everything works perfectly but I need to be able to handle failure well.
Thanks

Multi AZ's should automatically switch over with minimal downtime. Once you have created one of these instances, you will get an endpoint for the cluster. Amazon will point that DNS entry to the proper failover node, and handle the promotion of a slave, if the master instances dies.

Related

MongoDB - set replication to DocumentDB

We're setting up a local MongoDB cluster - Locally, we'll have one primary and one node, and we want to have another node in AWS. Is it possible to have that node as the DocumentDB service instead of an EC2 instace?
Also, I know I must have an odd number of total nodes, is it possible to first add one node and then add another one?
Thanks ahaed.
Also, I know I must have an odd number of total nodes
In a MongoDB replica set, you can have any number of nodes you like. It is possible to have a 2-node replica set, although it's not very practically useful since unavailability of a single node (e.g. a restart for maintenance) would make the whole deployment unavailable for writes. A 4-node replica set is a feasible construction if you wanted an additional replica somewhere (e.g. for geographically close querying from a secondary, or for analytics querying), though if you are simply doing this for redundancy you should probably stick with the standard 3-node configuration and configure proper backups.
Is it possible to first add one node and then add another one?
You can reconfigure a replica set at any time.
Is it possible to have that node as the DocumentDB service instead of an EC2 instace?
Unlikely. DocumentDB is not MongoDB. DocumentDB pretends to be like a MongoDB but it 1) pretends to be an old version of MongoDB, 2) even then many features don't work, and 3) it's not anywhere near the same architecture as MongoDB under the hood. So when you ask a genuine MongoDB database to work with a DocumentDB node, this will probably not work.
This assumes you can even configure DocumentDB in the required manner - I suspect this won't be possible to begin with.
If you're only trying to replicate the data to DocumentDB, Database Migration Service is a good tool for the job: https://aws.amazon.com/dms/
But like others have said, this will be a separate cluster from your MongodDB setup.

2 Node Redis HA

I have two nodes which I want to run as servers in active-active mode and also have HA capability i.e if one is down, the other one should start receiving all the requests but while both are up, both should be taking all the requests. Now since Redis doesn't allow active-active mode for the same hash set and I don't have option to run Sentinel as I can't have a third node, my idea is to run the two nodes in replication and myself decide if master node is down and promote the slave to master. Are there any issues with this? When the original master comes back up, is there a way to configure it as slave?
Does this sound like a good idea? I am open to suggestions other than Redis.
Generally running two node is never a good idea because it is bound to have split brain problem: When the network between the two node is down for a moment or two, the two node will inevitably think each other is offline and will promote/keep itself to be master and start accepting requests from other services. Then the split brain happens.
And if you are OK with this possible situation, then you can look into setup a master-slave with help of a script file and a HA service like pacemaker or keepalived.
Typically you have to tell the cluster manager through a predefined rule that when two machine rejoins under split brain condition, which one is your preferred master.
When a master is elected, execute the script and basically it execute slaveof no one on itself and execute slaveof <new-master-ip> <port> on the other node.
You could go one step further in your script file and try to merge the two data sets together but whether that's achievable or not is entirely down to how you have organized your data in Redis and how long you are prepared to wait for to have all the data in sync.
I have done this way myself before through pacemaker+corosync.
Ok, partial solution with SLAVEOF:
You can manually promote slave to master by running:
SLAVEOF NO ONE
You can manually transition master to slave by running:
SLAVEOF <HOST> <port>
Clustering should be disabled.
If you brought the replica online manually by changing it to replicaof no one, you need to be careful to bring the failed master back online as a replicaof the new node so you dont overwrite more recent data. I would not recommend doing this manually. You want to minimize downtime so automated failover is ideal
You mention being open to other products. Check out KeyDB which has the exact configuration you are looking for. It is a maintained multi-threaded fork of redis which offers the active-replica scenario you are looking for. Check out an example of it here.
Run both nodes as replicas of each other accepting reads and writes simultaneously (depending on upfront proxy config). If one fails the other continues to take the full load and is already sync'd.
Regarding the split brain concern, KeyDB can handle split brain scenarios where the connection between masters is severed, but writes continue to be made. Each write is timestamped and when the connection is restored each master will share their new data. The newest write will win. This prevents stale data from overwriting new data written after the connection was severed.
I would recommendation to have at least 3 nodes with Sentinel Setup for enabling gossip/quorum for auto promotion of slave to master when current master node goes down.
I believe it is possible to create a cluster with two nodes with the commands below:
$ redis-cli --cluster create <ip-node1>:7000 <ip-node1>:7001 <ip-node2>:7000 <ip-node2>:7001 --cluster-replicas 1
To resolve the split-brain problem. you can add a third node without data:
$ cluster meet #IP_node3#:7000
$ cluster nodes
I think it works.

How transparent should losing access to a Postgres-XL datanode be?

I have set-up a testing Postgres-XL cluster with the following architecture:
gtm - vm00
coord1+datanode1 - vm01
coord2+datanode2 - vm02
I created a new database, which contains a table that is distributed by replication. This means that I should have the exact copy of that table in each and every single datanode.
Doing operations on the table works great, I can see the changes replicated when connecting to all coordinator nodes.
However, when I simulate one of the datanodes going down, while I can still read the data in the table just fine, I cannot add or modify anything, and I receive the following error:
ERROR: Failed to get pooled connections
I am considering deploying Postgres-XL as a highly available database backend for a fair number of applications, and I cannot control how those applications interact with the database (it might be big a problem if those applications couldn't write to the database while one datanode is down).
To my understanding, Postgres-XL should achieve high availability for replicated tables in a very transparent way and should be able to support losing one or more datanodes (as long as at least one is still available - again, this is just for replicated tables), but this does not seem the case.
Is this the intended behaviour? What can be done in order to be able to withstand having one or more datanodes down?
So as it turns out not transparent at all. To my jaw dropping surprise at it turns out Postgres-XL has no build in high availably support or recovery. Meaning if you lose one node the database fails. And if you are using the round robbin or hash DISTRIBUTED BY options if you lose a disk in a node you have lost the entire database. I could not believe it, but that is the case.
They do have a "stand by" server option which is just a mirrored node for each node you have, but even this requires manually setting it to recover and doubles the number of nodes you need. For data protection you will have to use the REPLICATION DISTRIBUTED BY option which is MUCH slower and again has no fail over support so you will have to manually restart it and reconfigure it not to use the failing node.
https://sourceforge.net/p/postgres-xl/mailman/message/32776225/
https://sourceforge.net/p/postgres-xl/mailman/message/35456205/

Solr master-master replication alternatives?

Currently we have 2 servers with a load-balancer before them. We want to be able to turn 1 machine off and later on, without the user noticing it.
Our application also uses solr and now i wanted to install & configure solr on both servers and the question is how do i configure a master-master replication?
After my initial research i found out that it's not possible :(
But what are my options here? I want both indices to stay in sync and when a document is commited on one server it should also go to the other.
Thanks for your help!
Not certain of your specific use case (why turn 1 server on and off?), there is no specific "master-master" replication. Solr does however support distributed indexing and querying via SolrCloud. From the documentation for SolrCloud:
Replication ensures redundancy for your data, and enables you to send
an update request to any node in the shard. If that node is a
replica, it will forward the request to the leader, which then
forwards it to all existing replicas, using versioning to make sure
every replica has the most up-to-date version. This architecture
enables you to be certain that your data can be recovered in the event
of a disaster, even if you are using Near Real Time searching.
It's a bit complex so I'd suggest you spend some time going thru the documentation as it's not quite as simple as setting up a couple of masters and load balancing between them. It is a big step up from the previous master/slave replication that Solr used, so even if it's not a perfect fit it will be a lot closer to what you need.
https://cwiki.apache.org/confluence/display/solr/SolrCloud
https://cwiki.apache.org/confluence/display/solr/Getting+Started+with+SolrCloud
You can just create a simple master - slave replication as described here:
https://cwiki.apache.org/confluence/display/solr/Index+Replication
But be sure you send your inserts, deletes, updates directly to the master, but selects can go through the load balancer.
The other alternative is to create a third server as a master, and 2 slaves, and the lode balancer can be in front of the two slaves.

Solr Master Slave Failover setup for High Availability

While using Solr (we are currently using 3.5), how do we setup the Masters for a Failover?
Lets say in my Setup I have Two Masters and Two Slaves. The Application commits all the writes to One Active Master, and both the slaves get the updates from this Active Master. There is another repeater which serves the same purpose of the Master.
Now my question is if the Master for some reason comes down, how can I make the Repeater as a Master without any Manual intervention. How can the slaves start getting the updates from the Repeater instead of the broken Master. Is there a recommended way to do this? Are there any other recommended Master/Slave setup's to ensure High availability of the Solr systems?
At this time, your best option is probably to investigate the SolrCloud functionality present in the current Solr 4.0 alpha, which at the time of this writing is due for its final release within a few months. The goal of SolrCloud is to handle data distribution and master election, using the ZooKeeper distributed database to maintain consensus within the cluster about which nodes are serving in while roles.
There are other more traditional ways to set up failover for Solr 3's replicated master-slave architecture, but I personally wouldn't want to make that investment with Solr 4.0 so near to release.
Edit: See Linux-HA, for one such traditional approach. Personally, I would create a purpose-built daemon that reconfigures your cores and load balancer, using ZooKeeper for presence detection and distributed locks.
If outsourcing is an option, you might consider a hosted service such as my own humble Websolr. We provide this kind of distribution and hot failover by default, so our customers don't have to worry as much about the mechanics of how it's implemented.
I agree with Nick. The way replication works in Solr 3.x is not always handy, especially for master fail-over. If you are going to consider Solr 4 you might want to have a look at elasticsearch too, which solves this kind of problems in a really brilliant way!
It uses push replication instead of the pull mechanism used by Solr. That means the document is literally reindexed on all nodes. It might sound strange but that allows to reduce the network load (due to segment merge for example). Furthermore, a node is elected as master and if it crashes one other node will automatically replace it becoming the new master.

Resources