I am currently examining different NoSQL and RDBMSes regarding their replication abilities in order to build distributed systems.
Reading through several papers and books, I get the feeling, that some vendors or authors use their own definitions regarding the terms
Master-Master Replication (Replication between two servers)
Master-Slave Replication (Replication between mutliple Servers in order to increase reading speed, writes are only able for the master server)
Multi-Master Replication (= Peer-To-Peer?)
Peer-To-Peer Replication (replication between n nodes, each can read/write)
Merge Replication (?)
E.g: Some mix up the terms Master-Master and Peer-to-Peer as the same, while in Mysql docus for instance I found it is differentiated between Master-Master and Multi-Master (=Peer-to-peer???) Replication.
Where is the difference in Multi-Master and Peer-to-Peer replication?
Is Multi-Master replication's use case more oriented towards Clustering while Peer-To-Peer targets distributed content to distributed applications?
I would like to sort things out and be sure that I have the right understanding in these terms, so maybe a discussion in here would help to merge some knowledge.
Regards, Chris
Edit: added merge replication to the list and some explanations as I understand them...
Regarding CouchDB, the story is simple. Here it is:
There is only one replication mode for CouchDB. The source copies all its data to the target, subject to an optional yes/no filter. I described CouchDB replication in another question. The key point is that "replication" is simply a DB client. It connects to both couches, reads from the source, and writes to the target.
Any other big-picture architecture (peer-to-peer, multi-master, master-slave) is just the implementation of the developers or the system administrators. For example, if GETs are distributed to many couches, but POST go to one central couch which replicates to the others, that is effectively master-slave. If you put a CouchDB in every major city for performance, and they replicate directly with each other, that is multi-master replication.
Within the CouchDB community, and especially from Chris Anderson's projects and presentations, "peer-to-peer" replication is a concept where CouchDB is everywhere: mobile phones, data centers, telephone poles. And replication happens directly between couches in a decentralized way, without a central authority or architecture, like the web itself.
Related
Can anyone just explain me the precise difference between distributed database and decentralised database?
Decentralized
It means that there is no central storage. Some servers provide information to the clients. The servers are connected with each other.
Distributed
There are no data storages. All the nodes contain information. The clients are equal and have equal rights.
Main Difference
A distributed database is a single logical database, which is installed on a set of computers that are geographically located at different locations and linked through a data communication network whereas A decentralized database is a database that is installed on systems that are geographically located at different locations but not linked through a data communication network.
Coming to Blockchain it works on Centralized Relational Database and especially Distributed Database that leverage cryptography to provide multi version concurrency control mechanism and to maintain consensus about the existence and status of shared facts in trust-less environment.
Source
Database in blockchain video
The hierarchy is centralized to de-centralized to distributed.
de-centralized is simply centralized on a smaller (larger?) scale. The risk of losing data/some catastrophic event is reduced because there are lots of little 'centers'.
Distributed databases/ledgers have no 'center'. The risk of a catastrophic event is further reduced (although other risks (forks etc.) arise)
I'm looking for an open source data store that scales as easily as Cassandra but data can be queried via documents like MongoDB.
Are there currently any databases out that do this?
In this website http://nosql-database.org you can find a list of many NoSQL databases sorted by datastore types, you should check the Document stores there.
I'm not naming any specific database to avoid a biased/opinion-based answer, but if you are interested in a data store that is as scalable as Cassandra, you probably want to check those which use master-master/multi-master/masterless (you name it, the idea is the same) architecture, where both writes and reads can be split among all nodes in the cluster.
I know Cassandra is optimized towards writes rather than reads, but without further details in the question can't refine the answer with more information.
Update:
Disclaimer: I haven't used CouchDB at all, and haven't tested it's performance either.
Since you spotted CouchDB I'll add what I've found in the official documentation, in the distributed database and replication section.
CouchDB is a peer-based distributed database system. It allows users
and servers to access and update the same shared data while
disconnected. Those changes can then be replicated bi-directionally
later.
The CouchDB document storage, view and security models are designed to
work together to make true bi-directional replication efficient and
reliable. Both documents and designs can replicate, allowing full
database applications (including application design, logic and data)
to be replicated to laptops for offline use, or replicated to servers
in remote offices where slow or unreliable connections make sharing
data difficult.
The replication process is incremental. At the database level,
replication only examines documents updated since the last
replication. Then for each updated document, only fields and blobs
that have changed are replicated across the network. If replication
fails at any step, due to network problems or crash for example, the
next replication restarts at the same document where it left off.
Partial replicas can be created and maintained. Replication can be
filtered by a javascript function, so that only particular documents
or those meeting specific criteria are replicated. This can allow
users to take subsets of a large shared database application offline
for their own use, while maintaining normal interaction with the
application and that subset of data.
Which looks quite scalable to me, as it seems you can add new nodes to the cluster and then all the data gets replicated.
Also partial replicas seems an interesting option for really big data sets, which I'd configure these very carefully, in order to prevent situations where a given query to the database might not yield valid results, for example, in the case of a network partition and having only access to a partial set.
We are building a distributed system that has an Hub-and-Spoke topology: central office and remote sites which are connected to the central office, in low-bandwidth (~10Mbps) WAN connection.
The system is managing data which could be updated at any site, and should also be replicated across the whole system.
The system should support working when disconnected from the network. When the network re-connects, data should be synchronized again.
We are considering using a NoSQL DB for managing our data. However we are a bit overwhelmed by the many different alternatives.
We'd love to hear suggestions about fitting solutions.
What kind of data are you talking about? Is it already modeled in some way?
I haven't heard of people using NoSQL systems as a replacement for site-decentralized RDBMS scenarios which you would normally look at something like replication for. Some are probably better suited to this than others - e.g. MongoDB. Of course, the major RDBMS vendors all have their flavors of replication. In many cases, site replication is handled with master-slave, with the site being master and central office being slave, so that remote operation can run without connection.
Just for End-Of-Day data there will be billions of rows. What is the best way to store all that data. Is SQL Server 2008 good enough for that or should I look towards NoSQL solution, like MongoDB. Any suggestions?
That would be cool to have one master db with read/write permissions and one ore more replications of it for read only operations. Only master database will be used for adding new prices into the storage. Also that would be cool to be able replicate OHLC prices for most popular securities individually in order to optimize read access.
This data then will be streamed to a trading platform on clients' machines.
You should consider Oracle Berkeley DB which is in production doing this within the infrastructure of a few well known stock exchanges. Berkeley DB will allow you to record information at a master as simple key/value pairs, in your case I'd imagine a timestamp for the key and an encoded OHLC set for the value. Berkeley DB supports single master multi-replica replication (called "HA" for High Availability) to support exactly what you've outlined - read scalability. Berkeley DB HA will automatically fail-over to a new master if/when necessary. Using some simple compression and other basic features of Berkeley DB you'll be able to meet your scalability and data volume targets (billions of rows, tens of thousands of transactions per second - depending on your hardware, OS, and configuration of BDB - see the 3n+1 benchmark with BDB for help) without issue.
When you start working on accessing that OHLC data consider Berkeley DB's support for bulk-get and make sure you use the B-Tree access method (because your data has order and locality will provide much faster access). Also consider the Berkeley DB partitioning API to split your data (perhaps based on symbol or even based on time). Finally, because you'll be replicating the data you can relax the durability constraints to DB_TXN_WRITE_NOSYNC as long as your replication acknowledgement policy is requires a quorum of replicas ACK a write before considering it durable. You'll find that a fast network beats a fast disk in this case. Also, to offload some work from your master, enable peer-to-peer log replica distribution.
But, first read the replication manager getting started guide and review the rep quote example - which already implements some of what you're trying to do (handy, eh?).
Just for the record, full disclosure I work as a product manager at Oracle on Berkeley DB products. I have for the past nine years, so I'm a tad biased. I'd guess that the other solutions - SQL based or not - might eventually give you a working system, but I'm confident that Berkeley DB can without too much effort.
If you're really talking billions of new rows a day (Federal Express' data warehouse isn't that large), then you need an SQL database that can partition across multiple computers, like Oracle or IBM's DB2.
Another alternative would be a heavy-duty system managed storage like IBM's DFSMS.
What is difference between 3 type of replication?
Is replication suitable for data archiving?
What is the replication steps?
Following are the three types of replication in SQL server.
Transactional replication
Merge replication
Snapshot replication
For more See http://technet.microsoft.com/en-us/library/ms152531.aspx
Replication can be used for archiving purposes as well but with some additional mechanisms. Most of the time I have seen, it is used in data warehousing scenarios to reduce load on the OLTP system.
See http://msdn.microsoft.com/en-us/library/ms151198.aspx
From MSDN:
Transactional replication is typically used in server-to-server scenarios that require high throughput, including: improving scalability and availability; data warehousing and reporting; integrating data from multiple sites; integrating heterogeneous data; and offloading batch processing. Merge replication is primarily designed for mobile applications or distributed server applications that have possible data conflicts. Common scenarios include: exchanging data with mobile users; consumer point of sale (POS) applications; and integration of data from multiple sites. Snapshot replication is used to provide the initial data set for transactional and merge replication; it can also be used when complete refreshes of data are appropriate. With these three types of replication, SQL Server provides a powerful and flexible system for synchronizing data across your enterprise.
There are lots of other articles on the Internet and lots of good books on SQL Server. It's a bit of a broad question to ask what it is AND how to implement it.