suggestions for a replicated datastore - database

I am looking for a simple key-value datastore which will automatically replicate itself on different machines. Unfortunately, a Distributed Hash Table will not work for me since I need the whole datastore to be available on all machines. I have looked at mnesia from the erlang world but talking to it from different languages is a pain.
Any suggestions on what I should go for?
Thanks!

DNS. It's a distributed cached mapping table replicated across all masters (and partially to clients).
I think we can safely say it scales.

Related

OrientDB in distributed architecture works with vertex replication across servers?

I have worked in a project with OrientDB graph database. I've managed to fill the database and perform the queries in it without problems. But after I needed to run my queries using the distributed feature from OrientDB, and I came with an important (maybe trivial) doubt.
I've managed to use the distributed mode also without problems using 3 differente machines, but I wanted to be sure that OrientDB is really storing my database within the 3 machines that I've used. Is there any way to check that?
When I was researching for this answer, I came to the conclusion that OrientDB replicates the entire database across all the machines, is that correct? The goal to use the distributed architecture was to improve performance, but if OrientDB works with replication, and I run one query in a specific machine, the query will be processed using all machines, or only one?
To be short, I want to know if OrientDB when using the distributed mode, distributes the vertex and edges across the machines, and process the queries using all the machines?
I've read the entire documentation : http://orientdb.com/docs/2.0/orientdb.wiki/Distributed-Architecture.html and could not find a clear explanation for this questions.
Thanks in advance!
OrientDB, by default, replicates the entire DB on all the servers. What you're looking for is called "Sharding". OrientDB supports manual sharding (automatic in the future), that means you (the application) can decide where to store the vertices/edges.

Scalable database technology and architecture

I've been trying to learn more about database scaling in a distributed system, and I am stuck in between RDBMS and NoSQL.
Some articles online suggest that NoSQL is the solution to modern Big Data. Others say NoSQL is just a hype and RDBMS can be just as scalable with good design, and it provides good data structure.
Instead of reading others' opinions, I'd love to judge the two myself, but I do not understand exactly what is required for a scalable RDBMS and a scalable NoSQL.
I've done a bit more readings on RDBMS, and it seems that the solution requires leveraging memcache and sharding to reduce database size and the number of DB queries. Are there other tricks? Can you still use tables with many columns? Or use less columns and more joins?
As for NoSQL, I've read a little about MongoDB. I understand that it encourages data aggregation. But how does that make it more scalable? I'm also starting to learn Cassandra because I read that it scales much better than MongoDB, but I have no idea how it is more scalable.
I would very much appreciate a basic (or advanced, if you have the patience to type it out) condensed and down-to-the-core explanation on scaling RDBMS and NoSQL, or good articles online or books that explain the topic. :)
I won't cover ways you can scale by implementing things on your own and putting a memcache server in between, ... I'll just cover what comes right out of the box...
Let's start first with RDBMS:
I think setting up an RDBMS cluster is more complicated than a NoSQL cluster, but that's just my opinion. Usually what you have is one Master and multiple Slaves. You have to send all your writes to the master and can read from any slave you want. Since you have RDBMS and ACID, the system should somehow guarantee you, that you won't read old data. So the thing here is, that you assume that your application writes once and reads often (as it's usually the case). For those purposes, one Server for read/write and multiple servers for read is great. The problem is if you'r writes are so often that you can't keep up with them anymore on the one machine. That is your bottleneck. Additionally to the build in solutions from Oracle for instance - which are huge - there is also http://www.scalearc.com/ which can cache queries, ... and handle the scaling for you.
NoSQL:
There is no 1 NoSQL schema which is implemented by all the DBs. Every system is a bit different. MongoDB for instance is quite similar to RDBMS, it also has only one Master and several slaves to which it can replicate data, but additionally you can also create shards. Data is split between shards, and replicated to slaves. So you could have multiple different masters which are responsible for smaller parts. Afterwards when you read, you can choose if you want to read from multiple slaves, from the master or from any slave - depending how urgently you need the latest data.
Cassandra on the other hand works totally differently. I'm not sure if you can write to multiple servers or how it works, but basically the servers keep a log of all the writes. So even if they can't process the writes immediately, they are stored in a log, to still give you a fast response. Afterwards when you read, you can say again how urgently you want to have the new data, and if you really want the latest latest data, Cassandra will need to check the log, if there are any updates written, and it will cost you a lot of time.
Key-Value stores like ElasticSearch, CouchDB, CouchBase work again differently. Here the of the item is hashed, and based on the hash, sent to one node which will be responsible for it. This way, when you read after the key was written, you get again up to date information, because you'll read from the same node. The idea of this design is, that no one single key will be of everyone's interest, but the load will be distributed. These are also the DBs which I think scale the best, and make it the easiest to add more servers to the cluster, but you loose the power of complex queries, like you have it in MongoDB and Cassandra - and of course RDBMS. ElasticSearch has some simple search queries, and CouchDB and CouchBase have only Views which are produced by MapReduce, where you can get data which you want, if it fits the view. Otherwise you can only access it by the key.
http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis - is a very comprehensive summary of the most common NoSQL DBs, what are their strengths and weaknesses, and the most common usage scenarios.
In the end, the question is also, why do you want to scale? how many records are you going to have in the database? Few millions is not a problem at all. Few hundred millions is also not a problem for most of the RDBMS on a powerful enough server. And if designed the DB and it's indices properly even a billion records per year should be still fine.

HBase or Hive - web requests

Are either HBase/Hive suitable replacements as your traditional (non)relational database? Will they be able to serve up web-requests from web clients and respond in a timely manner? Are HBase/Hive only suitable for large dataset analysis? Sorry I'm a noob at this subject. Thanks in advance!
Hive is not at all suitable for any real time need such as timely web responses. You can use HBase though. But don't think about either HBase or Hive as a replacement of traditional RDBMSs. Both were meant to serve different needs. If your data is not huge enough better go with a RDBMS. RDBMSs are still the best choice(if they fit into your requirements). Technically speaking, HBase is really more a DataStore than DataBase because it lacks many of the features you find in an RDBMS, such as typed columns, secondary indexes, triggers, and advanced query languages, etc.
And the most important thing which could struck a newbie is the lack of SQL support by HBase, since it belongs to NoSQL family of stores.
And HBase/Hive are not the only options to handle large datasets. You have several options like Cassandra, Hypertable, MongoDB, Accumulo etc etc. But each one is meant for solving some specific problem. For example, MongoDB is used handling document data. So, you need to analyze your use case first and based on that you have to choose the datastore which suits your requirements.
You might find this list useful which compares different NoSQL datastores.
HTH
Hive is data warehouse tool, and it is mainly used for batch processing.
HBase is NoSQL database which allows random access based on rowkey (primary key). It is used for transactional access. It doesn't have indexing support which could be limitation for your needs.
Thanks,
Dino

Basic Database Question?

I am intrested to know a little bit more about databases then i currently know. I know how to setup a database backend for any webapp that i happen to be creating but that is all. For example if i was creating three different apps i would simply create three different databases and then configure each database for the particular app. This is all simple knowledge and i would now like to have a deeper understanding of how databases actually work.
Lets say that I developed an application for example that needed lot of space and processing power.This database would then have to be spread over numerous machines. How exactly would a database be spread across numerous machines and still be able to write records and then retreieve them. Would each table get their own machine and what software is needed to make sure that the different machines have all performed their transactions successfully.
As you can see i am quite a database ignoramus lol.
Any help in clearing this up would be greatly appreciated.
I don't know what RDBMS you're using but I have two book suggestions.
For theory (which should come first, in my opinion): Database in Depth: Relational Theory for Practitioners
For implementation: High Performance MySQL: Optimization, Backups, Replication, and More
I own both these books and they are both pretty great, especially the first one.
That's quite a broad topic... You might want to start with Multi-master replication, High-availability clustering and Massively parallel processing.
If you want to know about how to keep databases running with ever increasing load, then it's not a basic question. Several well known web companies are struggling to find the right way to make their database scalable.
Using memcached to cache database information is one way to decrease load on your database if your application is read-intensive. If you application is write-intensive then may be you would want to consider using a NOSQL datastore like MongoDB or Redis.
Database Design for Mere Mortals
This is the best book about the subject if you don't have any experience with databases. It's got historical background and practical examples. Most books often skip the historical stuff because they assume you know what a db is, or it doesn't matter, and jump right to the practical. This book gives you the complete picture.

voldemort vs. couchdb

I am trying to decide whether to use voldemort or couchdb for an upcoming healthcare project. I want a storage system that has high availability , fault tolerance, and can scale for the massive amounts of data being thrown at it.
What is the pros/cons of each?
Thanks
Project Voldemort looks nice, but I haven't looked deeply into it so far.
In it current state CouchDB might not be the right thing for "massive amounts of data". Distributing data between nodes and routing queries accordingly is on the roadmap but not implemented so far. The biggest known production setups of CouchDB use "tables" ("databases" in couch-speak) of about 200G.
HA is not natively supported by CouchDB but can build easily: All CouchDB nodes are replicating the database nodes between each other in a multi-master setup. We put two Varnish proxies in front of the CouchDB machines and the Varnish boxes are made redundant with CARP. CouchDBs "build from the Web" design makes such things very easy.
The most pressing issue in our setup is the fact that there are still issues with the replication of large (multi MB) attachments to CouchDB documents.
I suggest you also check the traditional RDBMS route. There are huge issues with available talent outside the RDBMS approach and there are very capable offerings available from Oracle & Co.
Not knowing enough from your question, I would nevertheless say Project Voldemort or distributed hash tables (DHTs) like CouchDB in general are a solution to your problem of HA.
Those DHTs are very nice for high availability but harder to write code for than traditional relational databases (RDBMS) concerning consistency.
They are quite good to store document type information, which may fit nicely with your healthcare project but make development harder for data.
The biggest limitation of most stores is that they are not transactionally safe (See Scalaris for an transactionally safe store) and you need to ensure data consistency by yourself - most use read time consistency by merging conflicting data). RDBMS are much easier to use for consistency of data (ACID)
Joining data is much harder too. In RDBMs you can easily query data over several tables, you need to write code in CouchDB to aggregate data. For other stores Hadoop may be a good choice for aggregating information.
Read about BASE and the CAP theorem on consistency vs. availability.
See
http://www.metabrew.com/article/anti-rdbms-a-list-of-distributed-key-value-stores/
http://queue.acm.org/detail.cfm?id=1394128
Is memcacheDB an option? I've heard that's how Digg handled HA issues.

Resources