Writing to many replicas of MongoDB - database

Let's say I have a distributed application that writes to the database. To decrease latency one of the instances (app + database) is hosted in Australia and another one is hosted in Europe. Both instances of database need to share the same data.
So what we are after here is data locality. The reason for it is obvious: we don't want users in Australia shooting requests to our database in Europe because that would increase latency.
The natural choice would be to deploy both instances of database in a one replica set. But it seems that with MongoDB you can write to only one Mongo instance within replica set.
What are the strategies with MongoDB to have two instances of database, sharing the same data, to which you can write to? Or is the MongoDB just a wrong choice for this requirement?

Huge subject, but i'll try to give you a short and simple answer :
As your two instances must share the same sata, you can't use sharded cluster with zones . But replica set can be your solution :
Create a replica set with at least the following :
a server in a 'neutral' zone. It will be the primary server (set a priority higher). This server, as long as it still primary, will handle your write operations.
your two existing servers with lower priority.
Set in your application Read Preference to 'nearest'. This way, your read operations will be handle by the server having the mower network latency, regardless of the Master/secondary roles of server.
But i highly recommand you to check the documentation, to see how correctly deploy this architecture. Here's a good start
EDIT
Some consideration about this solution :
This use case is one of the rare use case where it's better to read from secondaries. In general, prefer reading your data from MASTER, since replica set is done for high availability, not for scalability.
If some of your data can be 'located' to be accessed faster, consider sharding collections as a better solution

Related

Best practice or design to scale out/horizontal scale database for microservices

The main benefit of Microservices are one Service “Type” can be scale out by using multiple container instances and load-balancing to improve through put.
But one things is, multiple instances (ie. containers) of a "Service Type" are sharing the same database instance; and this could leave to performance bottle neck when multiple instance write/read on that database instance.
Traditionally, we would scale up on the processing power of that database instance to meet high demand.
The main questions for me is, what is the current best practice/design/solution to scale out/ horizontal scale so we can have multiple instance of that database and having performance improvement?
In particular, what I want to archive are:
One instance is down, a nother instance can handle the load -> High
Availability
Can load balance read, or maybe even write to multiple database
intance
Maintain the persistent and consistency of data incase I want to
create more database-instance
Within my knowledge,
One of the solution is Microsoft SQL Server provide High availability for SQL Server containers with can do most of the requirements above (https://learn.microsoft.com/en-us/sql/linux/sql-server-linux-container-ha-overview?view=sql-server-2017). But I'm wonder is there a better solution to avoid technology lock-down?
Another solution which I'm thinking of is: Replicate to multiple instance by using CDC Stream Data from a master database instance to multiple replications. This allow replication read.
But I'm still not convince because to quarrant the consistency, every services instance should write to master-database-instance, this could also, leave to bottle neck on master database instance.
There are 3 possible architectures for database at a broad level:
Single leader (e.g. RDBMS)
Multi leader (e.g. RDBMS in multiple DC)
Leader less (e.g. Riak, Cassandra)
As you go from top to bottom in the above list, horizontal scalability potential increases, but consistancy becomes weaker.
Scalability potential increases because more nodes can accept writes as you go down the list. Consistancy becomes weaker as writes take time to propagate or replicate to all nodes responsible for the data. Conflicts arise when same record is written in two different nodes at almost same time and so at the time of replication the system does not know which one is correct.
There are various conflict resolution strategies. Different database use different strategies. You need to study these strategies to understand which one suits your usecase and based on that you pick your DB.
There is always a trade off when making choices . database has its limitations and despite scaling database we can avoid performace hit by using simple best practices. you can't leave it to database to handle high request rate and mind it scaling database is expensive option and you will hit database limits eventually if not taken right so plan the whole system than just database.
coming to your point you can have one master and slave for read and write separately is very common approach but you have to rely on eventual consistency and sql always on is something you can have a look. You can cache the most frequently data. If you have very high request rate you may need to consider queues where you put the request and dequeue later to avoid database performance hit.

What is a good solution for cross datacenter master-master replication?

Lets say we have a distributed system where data is sharded by user ids. In most cases each shard is changed by user who owns this shard. There are more then one datacenter and users are balanced between these datacenters using DNS, cookies or somehow else. I.e. in most cases every user is served by one datacenter. This means we can replicate data between datacenters in master-master manner. Of course conflicting writes are possible, but they are rare and could be resolved using, say, CRDTs or vector clocks. And if some datacenter fails users could be just redirected to other datacenters!
Are there any database capable to solve described problem? Maybe some services provided by AWS or Google Cloud offer a solution?
I can't answer the specific use case here, because of the way you may want to shard, but Aerospike has XDR where all clusters are peers.
https://www.aerospike.com/docs/architecture/xdr.html
All clusters will be "eventually consistent" using XDR because of speed-of-light delays, plus we batch the updates between datacenters. Yet within each cluster, the Aerospike DB will be immediately consistent.
Feel free to check out our docs for more answers:
https://www.aerospike.com/docs/

Running Solr in a cluster - high availability only

I would like to run two Solr instances on different computers as a cluster.
My main interest is High availability - meaning, in case one server crashes or is down there will be always another one.
(my performances on a single instance are great. I do not need to split the data to two servers.)
Questions:
1. What is the best practice?
Is it different than clustering for index splitting? Do I need Shards?
2. Do I need zoo keeper?
3. Is it a container based configuration (different for jetty and tomcat)
4, Do I need an external NLB for that ?
5. When one computer is up after crashing. how dows it updates its index?
You can define numShards=1 and that's it. You need a single slice replicated for that. If you want automated cluster management and hot replication - yes, you need SolrCloud mode and ZooKeeper. Speaking about load balancing, it depends on your architecture. If you are going to use SolrJ, there is a basic load balancing implementation there.
When a node initializes, it enters the recovery stage. During the recovery stage it synchronizes with the other existing replicas as well as with its own transaction log. If its index version is old, it gets a newer version from other server.

Does memcached share across servers in google app engine?

On the memcached website it says that memcached is a distributed memory cache. It implies that it can run across multiple servers and maintain some sort of consistency. When I make a request in google app engine, there is a high probability that request in the same entity group will be serviced by the same server.
My question is, say there were two servers servicing my request, is the view of memcached from these two servers the same? That is, do things I put in memcached in one server reflected in the memcached instance for the other server, or are these two completely separate memcached instances (one for each server)?
Specifically, I want each server to actually run its own instance of memcached (no replication in other memcached instances). If it is the case that these two memcached instances update one another concerning changes made to them, is there a way to disable this?
I apologize if these questions are stupid, as I just started reading about it, but these are initial questions I have run into. Thanks.
App Engine does not really use memcached, but rather an API-compatible reimplementation (chiefly by the same guy, I believe -- in his "20% time";-).
Cached values may disappear at any time (via explicit expiration, a crash in one server, or due to memory scarcity in which case they're evicted in least-recently-used order, etc), but if they don't disappear they are consistent when viewed by different servers.
The memcached server chosen doesn't depend on the entity group that you're using (the entity group is a concept from the datastore, a different beast).
Each server runs its own instance of memcached, and each server will store a percentage of the objects that you store in memcache. The way it works is that when you use the memcached API to store something under a given key, a memcached server is chosen (based on the key).
There is no replication between memcached instances, if one of those boxes goes down, you lose 1/N of your memcached' data (N being the number of memcached instances running in AppEngine).
Typically, memcached does not share data between servers. The application server hashes the key to choose a memcached server, and then communicates with that server to get or set the data.
Based in what I know, there is only ONE instance of Memcache of you entire application, there could be many instance of your code running each one with their memory, and many datastore around the world, but there is only one Memcache server at a time, and keep in mind that this susceptible to failure service, even is no SLA for it.

Scaling out SQL Server for the web (Single Writer Multiple Readers)

Has anyone had any experience scaling out SQL Server in a multi reader single writer fashion. If not can anyone suggest a suitable alternative for a read intensive web application, that they have experience with
It depends on probably 2 things:
How big each single write is?
Do readers need real time data?
A write will block readers when writing, but if each write is small and fast then readers won't notice.
If you offload, say, end of day reporting then you batch your load onto a separate server because readers do not require real time data. This makes sense
A write on your primary server must be synched to your offload secondary server... which will block there as part of the synch process anyway + you add an overhead load to manage the synch.
Most apps are 95%+ read anyway all the time. For example, an update or delete is a read followed by a write.
My choice would be (probably, based on the low write volume and it's a web app) to scale up and stuff as much RAM as I could in the DB server with separate disk paths for the data and log files of the database.
I don't have any experience with scaling out SQL Server for your scenario.
However for a Read-Intensive application, I would be looking at reducing the load on the database and employ a Cache Strategy using something like Memcache or MS Velocity
There are two approaches that I'm aware of:
Have the entire database loaded into the Cache and manage Adding and Updating of items in the cache.
Add items to the cache only when they are requested and remove them when a write operation is performed.
Some kind of replication would do the trick.
http://msdn.microsoft.com/en-us/library/ms151827.aspx
You of course need to change your app code.
Some people use partitioned tables, with different row ranges being stored on different servers - united with views. This would be invisible to the app. Federation for this practice, I think.
By designing your database, application and server configuration (SQL particulars - location of data/log/system/sql binaries/tempdb), you should be able to handle a pretty good load. Try not to complicate things if you don't have to.

Resources