Data Sync between Ignite Cluster - database

We have two Apache Ignite Clusters (Cluster_A and Cluster_B) of version 2.13.0.
We are writing data into Cluster_A tables. We want to sync/copy the data into Cluster_B tables from Cluster_A.
Is there any efficient way?
Data Sync between Apache Ignite Clusters.

In general, it could be possible to leverage the CDC replication using Kafka to transfer updates from one cluster to another. But it's totally worth mentioning that it would require running and maintaining a separate Kafka cluster to store updates.
On the other hand, GridGain Enterprise has built-in functionality Data Center Replication the cross data center replication cases. It doesn't require any 3rd party installations, GridGain stores updates in a persistent and reliable manner using native persistence. It's also possible to establish active-active replication out of the box.
Another advantage is that GridGain DR has a dedicated functionality to transfer the entire state of a cluster.
To get more info about how to configure DCR follow at link.

Related

Realtime Streaming of SQL SERVER (RDS) transactions to NoSQL

I have a situation where I want to stream all the Updates, Deletes and Inserts from my AWS RDS SQL Server to a NoSQL DB such as DynamoDB or RethinkDB.
What I am trying to achieve is to divide my users into critical and non critical databases reducing the load on my rds server and using technologies like rethinkdb or dynamodb streams to send the other set of data (non critical) to front end.
I have thought of various ways to do this:
the most obvious to just asynchronously make entry in both databases though I can end up in a situation where one of the entries may fail.
two is to use RabbitMQ or queing service aws sqs to que the second entry and make sure that it inserts.
(which I want to achive) is if somehow a nodejs service can listen to mssql streams and push the content to nosql.
What can be done in a situation like this?
The profit I am looking for is to store a dataset in nosql that can be served to over 100k users as they all want to see the same data with just some where clause changes and in realtime. This in turn will reduce the RDS Server transactions to a minimum reads and writes.
You can use 2 approach below :
AWS DMS
Or, combining EMR, Amazon Kinesis, and Lambda (with custom scripts)

Synchronizing data from MSSQL to Elasticsearch using Apache Kafka

I'm currently running a text search in SQL Server, which is becoming a bottleneck and I'd like to move things to Elasticsearch for obvious reasons, however I know that I have to denormalize data for best performance and scalability.
Currently, my text search includes some aggregation and joining multiple tables to get the final output. Tables, that are joined, aren't that big (up to 20GB per table) but are changed (inserted, updated, deleted) irregularly (two of them once in a week, other one on demand x times a day).
My plan would be to use Apache Kafka together with Kafka Connect in order to read CDC from my SQL Server, join this data in Kafka and persist it in Elasticsearch, however I cannot find any material telling me how deletes would be handled when data is being persisted to Elasticsearch.
Is this even supported by the default driver? If not, what are the possibilities? Apache Spark, Logstash?
I am not sure whether this is already possible in Kafka Connect now, but it seems that this can be resolved with Nifi.
Hopefully I understand the need, here is the documentation for deleting Elasticsearch records with one of the standard NiFi processors:
https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-elasticsearch-5-nar/1.5.0/org.apache.nifi.processors.elasticsearch.DeleteElasticsearch5/

How does Docker Swarm handle database (PostgreSQL) replication?

I'm learning Docker Swarm mode and I managed to create a Swarm locally with a web application and a PostgreSQL database. I can scale them and I see Swarm creating replicas.
I think I understand how Docker Swarm can load balance regular web servers, but how does it deal out of the box with database containers?
Outside of the Swarm context, usually databases have their own ways to deal with replication, in the form of plugins or extended products like MySQL cluster. Other databases like Cassandra have replication built directly into their product.
On a Swarm context, do we still need to rely on those database plugins and features?
What is the expected pattern to handle data consistency between replicas of a database container?
I know it's a very open-ended question, but Docker's documentation is very open-ended too and I can't seem to find anything specific to this.
How does it deal out of the box with database containers?
It doesn't.
There is a pretty good description of Swarm services here: How services work (emphasis mine)
When you deploy the service to the swarm, the swarm manager accepts your service definition as the desired state for the service. Then it schedules the service on nodes in the swarm as one or more replica tasks.
Swarm has no idea what's inside the task, all it knows is how many instances of it there are, whether those instances are passing their health checks, and if there are enough of them to satisfy the task definition you gave it. The word overlap between this and database replicas is a little unfortunate, but they are different concepts.
What is the expected pattern to handle data consistency between replicas of a database container?
Setting up data replication is on you. These are probably as good a place to start as any
How to Set Up PostgreSQL for High Availability and Replication with Hot Standby
PostgreSQL Replication Example
Docker swarm currently scales well for the stateless applications. For database replication, you have to rely on every database's own replication mechanism. Swarm could not manage the datatbase replication. The volume or file system level replication could provide the protection for a single instance database, but are not aware of database replication/cluster.
For databases such as PostgreSQL, the additional works are required. There are a few options:
Use host's local directory. You will need to create one service for every replica, and use constraint to schedule the container to one specific host. You will also need custom postgresql docker image to set up the postgresql replication among replicas. While, when one node goes down, one PostgreSQL replica will go down. You will need to work to bring up another replica. See crunchydata's example.
Use the volume plugin, such as flocker, REX-Ray. You will still need to create one service for every replica, and bind one volume to one service. You need to create all services in the same overlay network and configure the PostgreSQL replicas to talk with each other via the dns name (the docker service name of the replica). You will still need to set up the postgresql replication among replicas.

When to prefer master-slave and when to cluster?

I know there have been many articles written about database replication. Trust me, I spent some time reading those articles including this SO one that explaints the pros and cons of replication. This SO article goes in depth about replication and clustering individually, but doesn't answer these simple questions that I have:
When do you replicate your database, and when do you cluster?
Can both be performed at the same time? If yes, what are the inspirations for each?
Thanks in advance.
MySQL currently supports two different solutions for creating a high availability environment and achieving multi-server scalability.
MySQL Replication
The first form is replication, which MySQL has supported since MySQL version 3.23. Replication in MySQL is currently implemented as an asyncronous master-slave setup that uses a logical log-shipping backend.
A master-slave setup means that one server is designated to act as the master. It is then required to receive all of the write queries. The master then executes and logs the queries, which is then shipped to the slave to execute and hence to keep the same data across all of the replication members.
Replication is asyncronous, which means that the slave server is not guaranteed to have the data when the master performs the change. Normally, replication will be as real-time as possible. However, there is no guarantee about the time required for the change to propagate to the slave.
Replication can be used for many reasons. Some of the more common reasons include scalibility, server failover, and for backup solutions.
Scalibility can be achieved due to the fact that you can now do can do SELECT queries across any of the slaves. Write statements however are not improved generally due to the fact that writes have to occur on each of the replication member.
Failover can be implemented fairly easily using an external monitoring utility that uses a heartbeat or similar mechanism to detect the failure of a master server. MySQL does not currently do automatic failover as the logic is generally very application dependent. Keep in mind that due to the fact that replication is asynchronous that it is possible that not all of the changes done on the master will have propagated to the slave.
MySQL replication works very well even across slower connections, and with connections that aren't continuous. It also is able to be used across different hardware and software platforms. It is possible to use replication with most storage engines including MyISAM and InnoDB.
MySQL Cluster
MySQL Cluster is a shared nothing, distributed, partitioning system that uses synchronous replication in order to maintain high availability and performance.
MySQL Cluster is implemented through a separate storage engine called NDB Cluster. This storage engine will automatically partition data across a number of data nodes. The automatic partitioning of data allows for parallelization of queries that are executed. Both reads and writes can be scaled in this fashion since the writes can be distributed across many nodes.
Internally, MySQL Cluster also uses synchronous replication in order to remove any single point of failure from the system. Since two or more nodes are always guaranteed to have the data fragment, at least one node can fail without any impact on running transactions. Failure detection is automatically handled with the dead node being removed transparent to the application. Upon node restart, it will automatically be re-integrated into the cluster and begin handling requests as soon as possible.
There are a number of limitations that currently exist and have to be kept in mind while deciding if MySQL Cluster is the correct solution for your situation.
Currently all of the data and indexes stored in MySQL Cluster are stored in main memory across the cluster. This does restrict the size of the database based on the systems used in the cluster.
MySQL Cluster is designed to be used on an internal network as latency is very important for response time.
As a result, it is not possible to run a single cluster across a wide geographic distance. In addition, while MySQL Cluster will work over commodity network setups, in order to attain the highest performance possible special clustering interconnects can be used.
In Master-Salve configuration the write operations are performed by Master and Read by slave. So all SQL request first reaches the Master and a queue of request is maintained and the read operation get executed only after completion of write. There is a common problem in Master-Salve configuration which i also witnessed is that when queue becomes too large to be maintatined by master then this achitecture collapse and the slave starts behaving like master.
For clusters i have worked on Cassandra where the request reaches a node(table) and a commit hash is maintained which notices the differences made to a node and updates the other nodes based on that commit hash. So here all operations are not dependent on a single node.
We used Master-Salve when write data is not big in size and count otherwise we use clusters.
Clusters are expensive in space and Master-Salve in time so your desicion of what to choose depends on what you want to save.
We can also use both at the same time, i have done this in my current company.
We moved the tables with most write operations to Cassandra and we have written 4 API to perform the CRUD operation on tables in Cassandra. As whenever an HTTP request comes it first hits our web server and from the code running on our web server we can decide which operation has to be performed (among CRUD) and then we call that particular API to make changes to the cassandra database.

Can the HBase database system be used as a live application database with CRUD features?

I have been reading more about the low-latency ability that HBase database system offers on Hadoop. While most Hadoop data stores are meant for write-only map/reduce functions, HBase appears to have low-latency update/delete features as well.
Is HBase a good candidate to be used to replace existing live application databases?
I do use HBase as a back end for a client facing web application. It all depends on how the data is structured in Hbase for faster retrieval (all ties back to RowKey design) and how updates/CURD operations are handled (adding versions)
additional reference
HBase as web app backend
hbase as database in web application
The answer is YES one can replace an existing database by carefully evaluating the primary objectives of the application (especially performance)

Resources