Using etcd as primary store/database? - database

Can etcd be used as reliable database replacement? Since it is distributed and stores key/value pairs in a persistent way, it would be a great alternative nosql database. In addition, it has a great API. Can someone explain why this is not a thing?

etcd
etcd is a highly available key-value store which Kubernetes uses for persistent storage of all of its objects like deployment, pod, service information.
etcd has high access control, that it can be accessed only using API in master node. Nodes in the cluster other than master do not have access to etcd store.
nosql database
There are currently more than than 255 nosql databases, which can be broadly classified into Key-Value based, Column based, Document based and Graph based. Considering etcd as an key-value store, lets see the available nosql key-value data stores.
Redis, memcached and memcacheDB are popular key-value stores. These are general-purpose distributed memory caching system often used to speed up dynamic database-driven websites by caching data and objects in memory.
Why etcd not an alternative
etcd cannot be stored in memory(ram) they can only be persisted in disk storage, whereas redis can be cached in ram and can also be persisted in disk.
etcd does not have various data types. It is made to store only kubernetes objects. But redis and other key-value stores have data-type flexibility.
etcd guarantees only high availabilty, but does not give you the fast querying and indexing. All the nosql key-value stores are built with the goal of fast querying and searching.
Eventhough it is obvious that etcd cannot be used as an alternative nosql database, I think the above explanation will prove it cannot be an suitable alternative.

From the ETCD.IO site:
etcd is a strongly consistent, distributed key-value store that
provides a reliable way to store data that needs to be accessed by a
distributed system or cluster of machines. It gracefully handles
leader elections during network partitions and can tolerate machine
failure, even in the leader node.
It has a simple interface using http and json. It is NOT just for Kubernetes. Kubernetes is just an example of a critical application that uses it.
You are right it should be a thing. A nice reliable data store with an easy to use API and a nice way of telling you when things change using raft protocol. This is great for feature toggles and other items where everything needs to know and is much better than things like putting a trigger in an sql database and getting it to send an event to an external application or really horrible polling.
So if you are writing something like the kubernetes use case >> it is perfect a well proven store for a distributed application.
If you are writing something very different to the kubernetes use case, then you are comparing with all the other no-sql databases. But is very different to something like mongodb so it may be better for you if mongodb or similar does not work for you.
Other example users
M3, a large-scale metrics platform for Prometheus created by Uber, uses etcd for rule storage and other functions
Consistency
There is a nice comparison of NOSQL database consistency by Jepson at https://jepsen.io/analyses
ETCD sum up their result at https://etcd.io/blog/jepsen-343-results/

The only answer I've come to see are those between our ears. Guess we need to show first that it can be done, and what the benefits are.
My colleagues seem to shy off it because "it's for storing secrets, and common truth". The etcd v3 revise made etcd capable of much more, but the news hasn't simply rippled down, yet.
Let's make some show cases, success stories. Personally, I like etcd because of the reasons you mentioned, and because of its focus on dependable performance.

First, no. Etcd is not the next nosql replacement. But there are some sort of scenarios, where it can come in handy.
Let's imagine you have (configuration) data, that is mostly static but may change on runtime. Maybe your frontend needs to know the backend endpoints based on the customers country to comply with legal and you know the world wide rollout is done in phases.
So you could just use a k8s configMap to store the array of data (country -> endpoint) and let your backend watch this configMap for changes.
On change, the application just reads in the list and provides a repository to allow access to the data from your service layer.
All operations need to be implemented in the repository (search, get, update, ...) but your data will be in memory (probably a linked hash map). So it will be very quick to retrieve (like a local cache).
If data get changed by the application just serialize the list and patch the configMap. Any other application watching the configMap will update their internal state.
However there is no locking. So quick changes may result in race conditions.
etcd allows for 1Mb to be stored. That's enough for almost static data.
Another application might be feature toggles. They do not changed that much but when they do, every application needs to know quickly and polling sucks.

See if this checklist of limitations of etcd compared to a more full-featured database will work for you:
Your database size is going to be within 2 GB (extensible to max 8 GB)
No sharding and hence data scalability that NoSQL db clusters (Mongo, Redis,...) provide
Meant for simple value stores with payloads limited to 1.5 MB. Can be increased but impacts other queries. Most dbs can store large BLOBs. Redis can store a value of 512 MB.
No query language for more complex searches beyond key prefix. Other databases provide more complex data types like document, graph storage with querying and indexing. Even key-value db Redis supports more complex types through modules along with querying and search capabilities
No ACID transactions
Having a hammer, everything may look like a potential nail. You need to make sure it is indeed one.

Related

When to Redis? When to Tarantool?

I do not want to get a detailed comparison. Neither do I want to define 'what is the best or fastest' in-memory DB.
They are both similar, so I want to get an overview of critical differences.
So what are they?
Let me be the one to offer a solution to the question above, I did a little research. From my point of view, these are the most critical things to know about them.
Overall
both are extremely fast in-memory technologies
open-source and Enterprise versions
store all the data in-memory
offer RPS of 10 000 and greater
Persistence: they both snapshot data to disk
Support async replication
Redis is a key-value storage
Tarantool supports key-value, but also supports documents and relational model
Why Redis is preferable
easier to get started
more Information on the Internet (look ant the number of question here, for example)
a simplier technology overall
more people are familiar with it
Why Tarantool is preferable
supports secondary indexes
supports iteration over indexes
has a UI for cluster administration
has an app-server onboard by default
Conclusion
Redis is a great caching solution. Not recommended to use it as a main storage
Tarantool is a multi parading DB. Can be used as a main storage
Redis has a lower entry barrier
Tarantool has a higher ceiling as a solution (relational model, distributed NoSQL storage, queues)

Practical example for each type of database (real cases) [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 years ago.
Improve this question
There are several types of database for different purposes, however normally MySQL is used to everything, because is the most well know Database. Just to give an example in my company an application of big data has a MySQL database at an initial stage, what is unbelievable and will bring serious consequences to the company. Why MySQL? Just because no one know how (and when) should use another DBMS.
So, my question is not about vendors, but type of databases. Can you give me an practical example of specific situations (or apps) for each type of database where is highly recommended to use it?
Example:
• A social network should use the type X because of Y.
• MongoDB or couch DB can't support transactions, so Document DB is not good to an app for a bank or auctions site.
And so on...
Relational: MySQL, PostgreSQL, SQLite, Firebird, MariaDB, Oracle DB, SQL server, IBM DB2, IBM Informix, Teradata
Object: ZODB, DB4O, Eloquera, Versant , Objectivity DB, VelocityDB
Graph databases: AllegroGraph, Neo4j, OrientDB, InfiniteGraph, graphbase, sparkledb, flockdb, BrightstarDB
Key value-stores: Amazon DynamoDB, Redis, Riak, Voldemort, FoundationDB, leveldb, BangDB, KAI, hamsterdb, Tarantool, Maxtable, HyperDex, Genomu, Memcachedb
Column family: Big table, Hbase, hyper table, Cassandra, Apache Accumulo
RDF Stores: Apache Jena, Sesame
Multimodel Databases: arangodb, Datomic, Orient DB, FatDB, AlchemyDB
Document: Mongo DB, Couch DB, Rethink DB, Raven DB, terrastore, Jas DB, Raptor DB, djon DB, EJDB, denso DB, Couchbase
XML Databases: BaseX, Sedna, eXist
Hierarchical: InterSystems Caché, GT.M thanks to #Laurent Parenteau
I found two impressive articles about this subject. All credits to highscalability.com. The information in this answer is transcribed from these articles:
35+ Use Cases For Choosing Your Next NoSQL Database
What The Heck Are You Actually Using NoSQL For?
If Your Application Needs...
• complex transactions because you can't afford to lose data or if you would like a simple transaction programming model then look at a Relational or Grid database.
• Example: an inventory system that might want full ACID. I was very unhappy when I bought a product and they said later they were out of stock. I did not want a compensated transaction. I wanted my item!
• to scale then NoSQL or SQL can work. Look for systems that support scale-out, partitioning, live addition and removal of machines, load balancing, automatic sharding and rebalancing, and fault tolerance.
• to always be able to write to a database because you need high availability then look at Bigtable Clones which feature eventual consistency.
• to handle lots of small continuous reads and writes, that may be volatile, then look at Document or Key-value or databases offering fast in-memory access. Also, consider SSD.
• to implement social network operations then you first may want a Graph database or second, a database like Riak that supports relationships. An in-memory relational database with simple SQL joins might suffice for small data sets. Redis' set and list operations could work too.
• to operate over a wide variety of access patterns and data types then look at a Document database, they generally are flexible and perform well.
• powerful offline reporting with large datasets then look at Hadoop first and second, products that support MapReduce. Supporting MapReduce isn't the same as being good at it.
• to span multiple data-centers then look at Bigtable Clones and other products that offer a distributed option that can handle the long latencies and are partition tolerant.
• to build CRUD apps then look at a Document database, they make it easy to access complex data without joins.
• built-in search then look at Riak.
• to operate on data structures like lists, sets, queues, publish-subscribe then look at Redis. Useful for distributed locking, capped logs, and a lot more.
• programmer friendliness in the form of programmer-friendly data types like JSON, HTTP, REST, Javascript then first look at Document databases and then Key-value Databases.
• transactions combined with materialized views for real-time data feeds then look at VoltDB. Great for data-rollups and time windowing.
• enterprise-level support and SLAs then look for a product that makes a point of catering to that market. Membase is an example.
• to log continuous streams of data that may have no consistency guarantees necessary at all then look at Bigtable Clones because they generally work on distributed file systems that can handle a lot of writes.
• to be as simple as possible to operate then look for a hosted or PaaS solution because they will do all the work for you.
• to be sold to enterprise customers then consider a Relational Database because they are used to relational technology.
• to dynamically build relationships between objects that have dynamic properties then consider a Graph Database because often they will not require a schema and models can be built incrementally through programming.
• to support large media then look storage services like S3. NoSQL systems tend not to handle large BLOBS, though MongoDB has a file service.
• to bulk upload lots of data quickly and efficiently then look for a product that supports that scenario. Most will not because they don't support bulk operations.
• an easier upgrade path then use a fluid schema system like a Document Database or a Key-value Database because it supports optional fields, adding fields, and field deletions without the need to build an entire schema migration framework.
• to implement integrity constraints then pick a database that supports SQL DDL, implement them in stored procedures, or implement them in application code.
• a very deep join depth then use a Graph Database because they support blisteringly fast navigation between entities.
• to move behavior close to the data so the data doesn't have to be moved over the network then look at stored procedures of one kind or another. These can be found in Relational, Grid, Document, and even Key-value databases.
• to cache or store BLOB data then look at a Key-value store. Caching can for bits of web pages, or to save complex objects that were expensive to join in a relational database, to reduce latency, and so on.
• a proven track record like not corrupting data and just generally working then pick an established product and when you hit scaling (or other issues) use one of the common workarounds (scale-up, tuning, memcached, sharding, denormalization, etc).
• fluid data types because your data isn't tabular in nature, or requires a flexible number of columns, or has a complex structure, or varies by user (or whatever), then look at Document, Key-value, and Bigtable Clone databases. Each has a lot of flexibility in their data types.
• other business units to run quick relational queries so you don't have to reimplement everything then use a database that supports SQL.
• to operate in the cloud and automatically take full advantage of cloud features then we may not be there yet.
• support for secondary indexes so you can look up data by different keys then look at relational databases and Cassandra's new secondary index support.
• create an ever-growing set of data (really BigData) that rarely gets accessed then look at Bigtable Clone which will spread the data over a distributed file system.
• to integrate with other services then check if the database provides some sort of write-behind syncing feature so you can capture database changes and feed them into other systems to ensure consistency.
• fault tolerance check how durable writes are in the face power failures, partitions, and other failure scenarios.
• to push the technological envelope in a direction nobody seems to be going then build it yourself because that's what it takes to be great sometimes.
• to work on a mobile platform then look at CouchDB/Mobile couchbase.
General Use Cases (NoSQL)
• Bigness. NoSQL is seen as a key part of a new data stack supporting: big data, big numbers of users, big numbers of computers, big supply chains, big science, and so on. When something becomes so massive that it must become massively distributed, NoSQL is there, though not all NoSQL systems are targeting big. Bigness can be across many different dimensions, not just using a lot of disk space.
• Massive write performance. This is probably the canonical usage based on Google's influence. High volume. Facebook needs to store 135 billion messages a month (in 2010). Twitter, for example, has the problem of storing 7 TB/data per day (in 2010) with the prospect of this requirement doubling multiple times per year. This is the data is too big to fit on one node problem. At 80 MB/s it takes a day to store 7TB so writes need to be distributed over a cluster, which implies key-value access, MapReduce, replication, fault tolerance, consistency issues, and all the rest. For faster writes in-memory systems can be used.
• Fast key-value access. This is probably the second most cited virtue of NoSQL in the general mind set. When latency is important it's hard to beat hashing on a key and reading the value directly from memory or in as little as one disk seek. Not every NoSQL product is about fast access, some are more about reliability, for example. but what people have wanted for a long time was a better memcached and many NoSQL systems offer that.
• Flexible schema and flexible datatypes. NoSQL products support a whole range of new data types, and this is a major area of innovation in NoSQL. We have: column-oriented, graph, advanced data structures, document-oriented, and key-value. Complex objects can be easily stored without a lot of mapping. Developers love avoiding complex schemas and ORM frameworks. Lack of structure allows for much more flexibility. We also have program- and programmer-friendly compatible datatypes like JSON.
• Schema migration. Schemalessness makes it easier to deal with schema migrations without so much worrying. Schemas are in a sense dynamic because they are imposed by the application at run-time, so different parts of an application can have a different view of the schema.
• Write availability. Do your writes need to succeed no matter what? Then we can get into partitioning, CAP, eventual consistency and all that jazz.
• Easier maintainability, administration and operations. This is very product specific, but many NoSQL vendors are trying to gain adoption by making it easy for developers to adopt them. They are spending a lot of effort on ease of use, minimal administration, and automated operations. This can lead to lower operations costs as special code doesn't have to be written to scale a system that was never intended to be used that way.
• No single point of failure. Not every product is delivering on this, but we are seeing a definite convergence on relatively easy to configure and manage high availability with automatic load balancing and cluster sizing. A perfect cloud partner.
• Generally available parallel computing. We are seeing MapReduce baked into products, which makes parallel computing something that will be a normal part of development in the future.
• Programmer ease of use. Accessing your data should be easy. While the relational model is intuitive for end users, like accountants, it's not very intuitive for developers. Programmers grok keys, values, JSON, Javascript stored procedures, HTTP, and so on. NoSQL is for programmers. This is a developer-led coup. The response to a database problem can't always be to hire a really knowledgeable DBA, get your schema right, denormalize a little, etc., programmers would prefer a system that they can make work for themselves. It shouldn't be so hard to make a product perform. Money is part of the issue. If it costs a lot to scale a product then won't you go with the cheaper product, that you control, that's easier to use, and that's easier to scale?
• Use the right data model for the right problem. Different data models are used to solve different problems. Much effort has been put into, for example, wedging graph operations into a relational model, but it doesn't work. Isn't it better to solve a graph problem in a graph database? We are now seeing a general strategy of trying to find the best fit between a problem and solution.
• Avoid hitting the wall. Many projects hit some type of wall in their project. They've exhausted all options to make their system scale or perform properly and are wondering what next? It's comforting to select a product and an approach that can jump over the wall by linearly scaling using incrementally added resources. At one time this wasn't possible. It took custom built everything, but that's changed. We are now seeing usable out-of-the-box products that a project can readily adopt.
• Distributed systems support. Not everyone is worried about scale or performance over and above that which can be achieved by non-NoSQL systems. What they need is a distributed system that can span datacenters while handling failure scenarios without a hiccup. NoSQL systems, because they have focussed on scale, tend to exploit partitions, tend not use heavy strict consistency protocols, and so are well positioned to operate in distributed scenarios.
• Tunable CAP tradeoffs. NoSQL systems are generally the only products with a "slider" for choosing where they want to land on the CAP spectrum. Relational databases pick strong consistency which means they can't tolerate a partition failure. In the end, this is a business decision and should be decided on a case by case basis. Does your app even care about consistency? Are a few drops OK? Does your app need strong or weak consistency? Is availability more important or is consistency? Will being down be more costly than being wrong? It's nice to have products that give you a choice.
• More Specific Use Cases
• Managing large streams of non-transactional data: Apache logs, application logs, MySQL logs, clickstreams, etc.
• Syncing online and offline data. This is a niche CouchDB has targeted.
• Fast response times under all loads.
• Avoiding heavy joins for when the query load for complex joins become too large for an RDBMS.
• Soft real-time systems where low latency is critical. Games are one example.
• Applications where a wide variety of different write, read, query, and consistency patterns need to be supported. There are systems optimized for 50% reads 50% writes, 95% writes, or 95% reads. Read-only applications needing extreme speed and resiliency, simple queries, and can tolerate slightly stale data. Applications requiring moderate performance, read/write access, simple queries, completely authoritative data. A read-only application which complex query requirements.
• Load balance to accommodate data and usage concentrations and to help keep microprocessors busy.
• Real-time inserts, updates, and queries.
• Hierarchical data like threaded discussions and parts explosion.
• Dynamic table creation.
• Two-tier applications where low latency data is made available through a fast NoSQL interface, but the data itself can be calculated and updated by high latency Hadoop apps or other low priority apps.
• Sequential data reading. The right underlying data storage model needs to be selected. A B-tree may not be the best model for sequential reads.
• Slicing off part of service that may need better performance/scalability onto its own system. For example, user logins may need to be high performance and this feature could use a dedicated service to meet those goals.
• Caching. A high performance caching tier for websites and other applications. Example is a cache for the Data Aggregation System used by the Large Hadron Collider.
Voting.
• Real-time page view counters.
• User registration, profile, and session data.
• Document, catalog management and content management systems. These are facilitated by the ability to store complex documents has a whole rather than organized as relational tables. Similar logic applies to inventory, shopping carts, and other structured data types.
• Archiving. Storing a large continual stream of data that is still accessible on-line. Document-oriented databases with a flexible schema that can handle schema changes over time.
• Analytics. Use MapReduce, Hive, or Pig to perform analytical queries and scale-out systems that support high write loads.
• Working with heterogeneous types of data, for example, different media types at a generic level.
• Embedded systems. They don’t want the overhead of SQL and servers, so they use something simpler for storage.
• A "market" game, where you own buildings in a town. You want the building list of someone to pop up quickly, so you partition on the owner column of the building table, so that the select is single-partitioned. But when someone buys the building of someone else you update the owner column along with price.
• JPL is using SimpleDB to store rover plan attributes. References are kept to a full plan blob in S3. (source)
• Federal law enforcement agencies tracking Americans in real-time using credit cards, loyalty cards and travel reservations.
• Fraud detection by comparing transactions to known patterns in real-time.
• Helping diagnose the typology of tumors by integrating the history of every patient.
• In-memory database for high update situations, like a website that displays everyone's "last active" time (for chat maybe). If users are performing some activity once every 30 sec, then you will be pretty much be at your limit with about 5000 simultaneous users.
• Handling lower-frequency multi-partition queries using materialized views while continuing to process high-frequency streaming data.
• Priority queues.
• Running calculations on cached data, using a program friendly interface, without having to go through an ORM.
• Uniq a large dataset using simple key-value columns.
• To keep querying fast, values can be rolled-up into different time slices.
• Computing the intersection of two massive sets, where a join would be too slow.
• A timeline ala Twitter.
Redis use cases, VoltDB use cases and more find here.
This question is almost impossible to answer because of the generality. I think you are looking for some sort of easy answer problem = solution. The problem is that each "problem" becomes more and more unique as it becomes a business.
What do you call a social network? Twitter? Facebook? LinkedIn? Stack Overflow? They all use different solutions for different parts, and many solutions can exist that use polyglot approach. Twitter has a graph like concept, but there are only 1 degree connections, followers and following. LinkedIn on the other hand thrives on showing how people are connected beyond first degree. These are two different processing and data needs, but both are "social networks".
If you have a "social network" but don't do any discovery mechanisms, then you can easily use any basic key-value store most likely. If you need high performance, horizontal scale, and will have secondary indexes or full-text search, you could use Couchbase.
If you are doing machine learning on top of the log data you are gathering, you can integrate Hadoop with Hive or Pig, or Spark/Shark. Or you can do a lambda architecture and use many different systems with Storm.
If you are doing discovery via graph like queries that go beyond 2nd degree vertexes and also filter on edge properties you likely will consider graph databases on top of your primary store. However graph databases aren't good choices for session store, or as general purpose stores, so you will need a polyglot solution to be efficient.
What is the data velocity? scale? how do you want to manage it. What are the expertise you have available in the company or startup. There are a number of reasons this is not a simple question and answer.
A short useful read specific to database selection: How to choose a NoSQL Database?. I will highlight keypoints in this answer.
Key-Value vs Document-oriented
Key-value stores
If you have clear data structure defined such that all the data would have exactly one key, go for a key-value store. It’s like you have a big Hashtable, and people mostly use it for Cache stores or clearly key based data. However, things start going a little nasty when you need query the same data on basis of multiple keys!
Some key value stores are: memcached, Redis, Aerospike.
Two important things about designing your data model around key-value store are:
You need to know all use cases in advance and you could not change the query-able fields in your data without a redesign.
Remember, if you are going to maintain multiple keys around same data in a key-value store, updates to multiple tables/buckets/collection/whatever are NOT atomic. You need to deal with this yourself.
Document-oriented
If you are just moving away from RDBMS and want to keep your data in as object way and as close to table-like structure as possible, document-structure is the way to go! Particularly useful when you are creating an app and don’t want to deal with RDBMS table design early-on (in prototyping stage) and your schema could change drastically over time. However note:
Secondary indexes may not perform as well.
Transactions are not available.
Popular document-oriented databases are: MongoDB, Couchbase.
Comparing Key-value NoSQL databases
memcached
In-memory cache
No persistence
TTL supported
client-side clustering only (client stores value at multiple nodes). Horizontally scalable through client.
Not good for large-size values/documents
Redis
In-memory cache
Disk supported – backup and rebuild from disk
TTL supported
Super-fast (see benchmarks)
Data structure support in addition to key-value
Clustering support not mature enough yet. Vertically scalable (see Redis Cluster specification)
Horizontal scaling could be tricky.
Supports Secondary indexes
Aerospike
Both in-memory & on-disk
Extremely fast (could support >1 Million TPS on a single node)
Horizontally scalable. Server side clustering. Sharded & replicated data
Automatic failovers
Supports Secondary indexes
CAS (safe read-modify-write) operations, TTL support
Enterprise class
Comparing document-oriented NoSQL databases
MongoDB
Fast
Mature & stable – feature rich
Supports failovers
Horizontally scalable reads – read from replica/secondary
Writes not scalable horizontally unless you use mongo shards
Supports advanced querying
Supports multiple secondary indexes
Shards architecture becomes tricky, not scalable beyond a point where you need secondary indexes. Elementary shard deployment need 9 nodes at minimum.
Document-level locks are a problem if you have a very high write-rate
Couchbase Server
Fast
Sharded cluster instead of master-slave of mongodb
Hot failover support
Horizontally scalable
Supports secondary indexes through views
Learning curve bigger than MongoDB
Claims to be faster

Which NoSQL backend to store trace data from webpage

In our web application we need to trace what users click, what they write into search box, etc. Lots of data will be sent by AJAX. Generally functionality is a bit similar to google analytics, but we need to customize it in different ways.
Data will be collected and once per day aggregated and exported to PostgreSQL, so backend should be able to handle dozens of inserts. I don't consider usage of traditional SQL database, because probably it won't handle so many inserts efficiently.
I wonder which backend would you use for such task? Actually I think about MongoDB or Cassandra. But maybe you know better software for that task? Maybe something different then NoSQL database?
Web application is written in Ruby on Rails so support for Ruby would be nice but that's definitely not the most important.
Sounds like you need to analyse your specific requirements.
It may be that the best solution is to split / partition / shard a conventional database and then push the data up from there.
Depending on what your tolerance for data loss is, there are a lot of options. If you choose a system which has single-server durability, a major source of write bottleneck will be fdatasync() (assuming you use hard drives to store your data on).
If you can tolerate syncing less often than on every commit, then you may be able to tune your database to commit at timed intervals.
Depending on your table, index structure etc, I'd expect that you can get rather a lot of inserts with a "conventional" db (e.g. postgresql), if you manage it correctly and tune the durability (if it supports that) to your liking.
Sharding this into several instances of course will enable you to scale this up. However, you need to be mindful of operational requirements (i.e. what happens if some of the instances are down). Talk to your Ops team about what they're comfortable managing.

Distributed Key-Value Data Store with Offline Access (Static Partitioning)

Need to be able to set server(s) that replicate all information, as a master data store that has all the data.
Also need servers that specifically store/replicate certain data, available in local LANs, so that when the internet connection goes down, they can still access their local data. Under normal circumstances, the clients will access most of their data from the local LAN, and may use others when the local LAN server goes down.
This is wanted alongside the benefits of a distributed data store, such as failure resistance and speed.
Which Distributed Key-Value Data Store or other data storage method would be most suited for this?
Try out CouchDB. Your use case reads like it was build for it. Point taken, CouchDB is much more than a key/value store, but on the other hand, not less suitable for it.
Add replication and as an added bonus fault tolerance, conflict detection (and resolution) and an easy API (HTTP).
Let me know if you have any other questions.
Of course you must remember that replication is something completely different from backup, because one system's programmatic failure in handling the data can quickly replicate to other nodes resulting in total mayhem.
Maybe using a Hadoop File System or OpenAFS would be a good solution here?
I haven't used any of those systems in real-life scenarios, only had interest in them during my research on peer-to-peer and distributed storage solutions, but I think they're worth a try.
Have you checked out the new Microsoft's Velocity? http://msdn.microsoft.com/en-us/data/cc655792.aspx. Unlike many other cloud services, you can run the setup (for Velocity) on your premises.

voldemort vs. couchdb

I am trying to decide whether to use voldemort or couchdb for an upcoming healthcare project. I want a storage system that has high availability , fault tolerance, and can scale for the massive amounts of data being thrown at it.
What is the pros/cons of each?
Thanks
Project Voldemort looks nice, but I haven't looked deeply into it so far.
In it current state CouchDB might not be the right thing for "massive amounts of data". Distributing data between nodes and routing queries accordingly is on the roadmap but not implemented so far. The biggest known production setups of CouchDB use "tables" ("databases" in couch-speak) of about 200G.
HA is not natively supported by CouchDB but can build easily: All CouchDB nodes are replicating the database nodes between each other in a multi-master setup. We put two Varnish proxies in front of the CouchDB machines and the Varnish boxes are made redundant with CARP. CouchDBs "build from the Web" design makes such things very easy.
The most pressing issue in our setup is the fact that there are still issues with the replication of large (multi MB) attachments to CouchDB documents.
I suggest you also check the traditional RDBMS route. There are huge issues with available talent outside the RDBMS approach and there are very capable offerings available from Oracle & Co.
Not knowing enough from your question, I would nevertheless say Project Voldemort or distributed hash tables (DHTs) like CouchDB in general are a solution to your problem of HA.
Those DHTs are very nice for high availability but harder to write code for than traditional relational databases (RDBMS) concerning consistency.
They are quite good to store document type information, which may fit nicely with your healthcare project but make development harder for data.
The biggest limitation of most stores is that they are not transactionally safe (See Scalaris for an transactionally safe store) and you need to ensure data consistency by yourself - most use read time consistency by merging conflicting data). RDBMS are much easier to use for consistency of data (ACID)
Joining data is much harder too. In RDBMs you can easily query data over several tables, you need to write code in CouchDB to aggregate data. For other stores Hadoop may be a good choice for aggregating information.
Read about BASE and the CAP theorem on consistency vs. availability.
See
http://www.metabrew.com/article/anti-rdbms-a-list-of-distributed-key-value-stores/
http://queue.acm.org/detail.cfm?id=1394128
Is memcacheDB an option? I've heard that's how Digg handled HA issues.

Resources