Related
Can etcd be used as reliable database replacement? Since it is distributed and stores key/value pairs in a persistent way, it would be a great alternative nosql database. In addition, it has a great API. Can someone explain why this is not a thing?
etcd
etcd is a highly available key-value store which Kubernetes uses for persistent storage of all of its objects like deployment, pod, service information.
etcd has high access control, that it can be accessed only using API in master node. Nodes in the cluster other than master do not have access to etcd store.
nosql database
There are currently more than than 255 nosql databases, which can be broadly classified into Key-Value based, Column based, Document based and Graph based. Considering etcd as an key-value store, lets see the available nosql key-value data stores.
Redis, memcached and memcacheDB are popular key-value stores. These are general-purpose distributed memory caching system often used to speed up dynamic database-driven websites by caching data and objects in memory.
Why etcd not an alternative
etcd cannot be stored in memory(ram) they can only be persisted in disk storage, whereas redis can be cached in ram and can also be persisted in disk.
etcd does not have various data types. It is made to store only kubernetes objects. But redis and other key-value stores have data-type flexibility.
etcd guarantees only high availabilty, but does not give you the fast querying and indexing. All the nosql key-value stores are built with the goal of fast querying and searching.
Eventhough it is obvious that etcd cannot be used as an alternative nosql database, I think the above explanation will prove it cannot be an suitable alternative.
From the ETCD.IO site:
etcd is a strongly consistent, distributed key-value store that
provides a reliable way to store data that needs to be accessed by a
distributed system or cluster of machines. It gracefully handles
leader elections during network partitions and can tolerate machine
failure, even in the leader node.
It has a simple interface using http and json. It is NOT just for Kubernetes. Kubernetes is just an example of a critical application that uses it.
You are right it should be a thing. A nice reliable data store with an easy to use API and a nice way of telling you when things change using raft protocol. This is great for feature toggles and other items where everything needs to know and is much better than things like putting a trigger in an sql database and getting it to send an event to an external application or really horrible polling.
So if you are writing something like the kubernetes use case >> it is perfect a well proven store for a distributed application.
If you are writing something very different to the kubernetes use case, then you are comparing with all the other no-sql databases. But is very different to something like mongodb so it may be better for you if mongodb or similar does not work for you.
Other example users
M3, a large-scale metrics platform for Prometheus created by Uber, uses etcd for rule storage and other functions
Consistency
There is a nice comparison of NOSQL database consistency by Jepson at https://jepsen.io/analyses
ETCD sum up their result at https://etcd.io/blog/jepsen-343-results/
The only answer I've come to see are those between our ears. Guess we need to show first that it can be done, and what the benefits are.
My colleagues seem to shy off it because "it's for storing secrets, and common truth". The etcd v3 revise made etcd capable of much more, but the news hasn't simply rippled down, yet.
Let's make some show cases, success stories. Personally, I like etcd because of the reasons you mentioned, and because of its focus on dependable performance.
First, no. Etcd is not the next nosql replacement. But there are some sort of scenarios, where it can come in handy.
Let's imagine you have (configuration) data, that is mostly static but may change on runtime. Maybe your frontend needs to know the backend endpoints based on the customers country to comply with legal and you know the world wide rollout is done in phases.
So you could just use a k8s configMap to store the array of data (country -> endpoint) and let your backend watch this configMap for changes.
On change, the application just reads in the list and provides a repository to allow access to the data from your service layer.
All operations need to be implemented in the repository (search, get, update, ...) but your data will be in memory (probably a linked hash map). So it will be very quick to retrieve (like a local cache).
If data get changed by the application just serialize the list and patch the configMap. Any other application watching the configMap will update their internal state.
However there is no locking. So quick changes may result in race conditions.
etcd allows for 1Mb to be stored. That's enough for almost static data.
Another application might be feature toggles. They do not changed that much but when they do, every application needs to know quickly and polling sucks.
See if this checklist of limitations of etcd compared to a more full-featured database will work for you:
Your database size is going to be within 2 GB (extensible to max 8 GB)
No sharding and hence data scalability that NoSQL db clusters (Mongo, Redis,...) provide
Meant for simple value stores with payloads limited to 1.5 MB. Can be increased but impacts other queries. Most dbs can store large BLOBs. Redis can store a value of 512 MB.
No query language for more complex searches beyond key prefix. Other databases provide more complex data types like document, graph storage with querying and indexing. Even key-value db Redis supports more complex types through modules along with querying and search capabilities
No ACID transactions
Having a hammer, everything may look like a potential nail. You need to make sure it is indeed one.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
There has been a lot of talk related to Cassandra lately.
Twitter, Digg, Facebook, etc all use it.
When does it make sense to:
use Cassandra,
not use Cassandra, and
use a RDMS instead of Cassandra.
There is nothing like a silver bullet, everything is built to solve specific problems and has its own pros and cons. It is up to you, what problem statement you have and what is the best fitting solution for that problem.
I will try to answer your questions one by one in the same order you asked them. Since Cassandra is based on the NoSQL family of databases, it's important you understand why use a NoSQL database before I answer your questions.
Why use NoSQL
In the case of RDBMS, making a choice is quite easy because all the databases like MySQL, Oracle, MS SQL, PostgreSQL in this category offer almost the same kind of solutions oriented toward ACID properties. When it comes to NoSQL, the decision becomes difficult because every NoSQL database offers different solutions and you have to understand which one is best suited for your app/system requirements. For example, MongoDB is fit for use cases where your system demands a schema-less document store. HBase might be fit for search engines, analyzing log data, or any place where scanning huge, two-dimensional join-less tables is a requirement. Redis is built to provide In-Memory search for varieties of data structures like trees, queues, linked lists, etc and can be a good fit for making real-time leaderboards, pub-sub kind of system. Similarly there are other databases in this category (Including Cassandra) which are fit for different problem statements. Now lets move to the original questions, and answer them one by one.
When to use Cassandra
Being a part of the NoSQL family, Cassandra offers a solution for problems where one of your requirements is to have a very heavy write system and you want to have a quite responsive reporting system on top of that stored data. Consider the use case of Web analytics where log data is stored for each request and you want to built an analytical platform around it to count hits per hour, by browser, by IP, etc in a real time manner. You can refer to this blog post to understand more about the use cases where Cassandra fits in.
When to Use a RDMS instead of Cassandra
Cassandra is based on a NoSQL database and does not provide ACID and relational data properties. If you have a strong requirement for ACID properties (for example Financial data), Cassandra would not be a fit in that case. Obviously, you can make a workaround for that, however you will end up writing lots of application code to simulate ACID properties and will lose on time to market badly. Also managing that kind of system with Cassandra would be complex and tedious for you.
When not to use Cassandra
I don't think it needs to be answered if the above explanation makes sense.
When evaluating distributed data systems, you have to consider the CAP theorem - you can pick two of the following: consistency, availability, and partition tolerance.
Cassandra is an available, partition-tolerant system that supports eventual consistency. For more information see this blog post I wrote: Visual Guide to NoSQL Systems.
Cassandra is the answer to a particular problem: What do you do when you have so much data that it does not fit on one server ? How do you store all your data on many servers and do not break your bank account and not make your developers insane ? Facebook gets 4 Terabyte of new compressed data EVERY DAY. And this number most likely will grow more than twice within a year.
If you do not have this much data or if you have millions to pay for Enterprise Oracle/DB2 cluster installation and specialists required to set it up and maintain it, then you are fine with SQL database.
However Facebook no longer uses cassandra and now uses MySQL almost exclusively moving the partitioning up in the application stack for faster performance and better control.
The general idea of NoSQL is that you should use whichever data store is the best fit for your application. If you have a table of financial data, use SQL. If you have objects that would require complex/slow queries to map to a relational schema, use an object or key/value store.
Of course just about any real world problem you run into is somewhere in between those two extremes and neither solution will be perfect. You need to consider the capabilities of each store and the consequences of using one over the other, which will be very much specific to the problem you are trying to solve.
Besides the answers given above about when to use and when not to use Cassandra, if you do decide to use Cassandra you may want to consider not using Cassandra itself, but one of the its many cousins out there.
Some answers above already pointed to various "NoSQL" systems which share many properties with Cassandra, with some small or large differences, and may be better than Cassandra itself for your specific needs.
Additionally, recently (several years after this question was originally asked), a Cassandra clone called Scylla (see https://en.wikipedia.org/wiki/Scylla_(database)) was released. Scylla is an open-source re-implementation of Cassandra in C++, which claims to have significantly higher throughput and lower latencies than the original Java Cassandra, while being mostly compatible with it (in features, APIs, and file formats). So if you're already considering Cassandra, you may want to consider Scylla as well.
I will focus here on some of the important aspects which can help you to decide if you really need Cassandra. The list is not exhaustive, just some of the points which I have at top of my mind-
Don't consider Cassandra as the first choice when you have a strict requirement on the relationship (across your dataset).
Cassandra by default is AP system (of CAP). But, it supports tunable consistency which means it can be configured to support as CP as well. So don't ignore it just because you read somewhere that it's AP and you are looking for CP systems. Cassandra is more accurately termed “tuneably consistent,” which means it allows you to easily decide the level of consistency you require, in balance with the level of availability.
Don't use Cassandra if your scale is not much or if you can deal with a non-distributed DB.
Think harder if your team thinks that all your problems will be solved if you use distributed DBs like Cassandra. To start with these DBs is very simple as it comes with many defaults but optimizing and mastering it for solving a specific problem would require a good (if not a lot) amount of engineering effort.
Cassandra is column-oriented but at the same time each row also has a unique key. So, it might be helpful to think of it as an indexed, row-oriented store. You can even use it as a document store.
Cassandra doesn't force you to define the fields beforehand. So, if you are in a startup mode or your features are evolving (as in agile) - Cassandra embraces it. So better, first think about queries and then think about data to answer them.
Cassandra is optimized for really high throughput on writes. If your use case is read-heavy (like cache) then Cassandra might not be an ideal choice.
Right. It makes sense to use Cassandra when you have a huge amount of data, a huge number of queries but very little variety of queries. Cassandra basically works by partitioning and replicating. If all your queries will be based on the same partition key, Cassandra is your best bet. If you get a query on an attribute that is not the partition key, Cassandra allows you to replicate the whole data with a new partition key. So now you have 2 replicas of the same data with 2 different partition keys.
Which brings me to your next question. When not to use Cassandra. As I mentioned, Cassandra scales by replicating the complete database for every new partitioning key. But you can't keep making new copies again and again. So when you have a high variety in queries i.e. each query has a different column in the where clause, Cassandra is not a good option.
Now for the third question. The whole point of using RDBMS is when you want the ACID properties. If you are building something like a payment service and want each transaction to be isolated, each transaction to either complete or not happen at all, changes to be persistent despite system failure, and the money to be consistent across bank accounts before and after the transaction completes, an RDBMS is the only option that will help you achieve this.
This article actually explains the whole thing, especially when to use Cassandra or not (as opposed to some other NoSQL option) part of the question -> Choosing the best Database. Do check it out.
EDIT: To answer the question in the comments by proximab, when we think of banking systems we immidiately think "ACID is the best solution". But even banking systems are made up of several subsystems that might not even be dealing with any transaction related data like account holder's personal information, account statements, credit card details, credit histories, etc.
All of this information needs to be stored in some database or the another. Now if you store the account related information like account balance, that is something that needs to be consistent at all times. For example, if you try to send money from account A to account B, then the money that disappears from account A should instantaneousy show up in account B, and it cannot be present in both accounts at the same time. This system cannot be inconsistant at any point. This is where ACID is of utmost importance.
On the other hand if you are saving credit card details or credit histories, that should not get into the wrong hands, then you need something that allows access only to authorised users. That I believe is supported by Cassandra. That said, data like credit history and credit card transactions, I think that is an ever increasing data. Also there is only so much yo can query on this data i.e. it has a very finite number of queries. These two conditions make Cassandra a perfect solution.
Talking with someone in the midst of deploying Cassandra, it doesn't handle the many-to-many well. They are doing a hack job to do their initial testing. I spoke with a Cassandra consultant about this and he said he wouldn't recommend it if you had this problem set.
You should ask your self the following questions:
(Volume, Velocity) Will you be writing and reading TONS of information , so much information that no one computer could handle the writes.
(Global) Will you need this writing and reading capability around the world so that the writes in one part of the world are accessible in another part of the world?
(Reliability) Do you need this database to be up and running all the time and never go down regardless of which Cloud, which country, whether it's VM , Container, or Bare metal?
(Scale-ability) Do you need this database to be able to continue to grow easily and scale linearly
(Consistency) Do you need TUNABLE consistency where some writes can happen asynchronously where as others need to be certified?
(Skill) Are you willing to do what it takes to learn this technology and the data modeling that goes with creating a globally distributed database that can be fast for everyone, everywhere?
If for any of these questions you thought "maybe" or "no," you should use something else. If you had "hell yes" as an answer to all of them, then you should use Cassandra.
Use RDBMS when you can do everything on one box. It's probably easier than most and anyone can work with it.
Heavy single query vs. gazillion light query load is another point to consider, in addition to other answers here. It's inherently harder to automatically optimize a single query in a NoSql-style DB. I've used MongoDB and ran into performance issues when trying to calculate a complex query. I haven't used Cassandra but I expect it to have the same issue.
On the other hand, if your load is expected to be that of very many small queries, and you want to be able to easily scale out, you could take advantage of eventual consistency that is offered by most NoSql DBs. Note that eventual consistency is not really a feature of a non-relational data model, but it is much easier to implement and to set up in a NoSql-based system.
For a single, very heavy query, any modern RDBMS engine can do a decent job parallelizing parts of the query and take advantage of as much CPU and memory you throw at it (on a single machine). NoSql databases don't have enough information about the structure of the data to be able to make assumptions that will allow truly intelligent parallelization of a big query. They do allow you to easily scale out more servers (or cores) but once the query hits a complexity level you are basically forced to split it apart manually to parts that the NoSql engine knows how to deal with intelligently.
In my experience with MongoDB, in the end because of the complexity of the query there wasn't much Mongo could do to optimize it and run parts of it on multiple data. Mongo parallelizes multiple queries but isn't so good at optimizing a single one.
Let's read some real world cases:
http://planetcassandra.org/apache-cassandra-use-cases/
In this article: http://planetcassandra.org/blog/post/agentis-energy-stores-over-15-billion-records-of-time-series-usage-data-in-apache-cassandra
They elaborated the reason why they didn't choose MySql is because db synchronization is too slow.
(Also due to 2-phrase commit, FK, PK)
Cassandra is based on Amazon Dynamo paper
Features:
Stability
High availability
Backup performs well
Read and Write is better than HBase, (BigTable clone in java).
wiki http://en.wikipedia.org/wiki/Apache_Cassandra
Their Conclusion is:
We looked at HBase, Dynamo, Mongo and Cassandra.
Cassandra was simply the best storage solution for the majority of our data.
As of 2018,
I would recommend using ScyllaDB to replace classic cassandra, if you need back support.
Postgres kv plugin is also quick than cassandra. How ever won't have multi-instance scalability.
another situation that makes the choice easier is when you want to use aggregate function like sum, min, max, etcetera and complex queries (like in the financial system mentioned above) then a relational database is probably more convenient then a nosql database since both are not possible on a nosql databse unless you use really a lot of Inverted indexes. When you do use nosql you would have to do the aggregate functions in code or store them seperatly in its own columnfamily but this makes it all quite complex and reduces the performance that you gained by using nosql.
Cassandra is a good choice if:
You don't require the ACID properties from your DB.
There would be massive and huge number of writes on the DB.
There is a requirement to integrate with Big Data, Hadoop, Hive and Spark.
There is a need of real time data analytics and report generations.
There is a requirement of impressive fault tolerant mechanism.
There is a requirement of homogenous system.
There is a requirement of lots of customisation for tuning.
If you need a fully consistent database with SQL semantics, Cassandra is NOT the solution for you. Cassandra supports key-value lookups. It does not support SQL queries. Data in Cassandra is "eventually consistent". Concurrent lookups of data may be inconsistent, but eventually lookups are consistent.
If you need strict semantics and need support for SQL queries, choose another solution such as MySQL, PostGres, or combine use of Cassandra with Solr.
Apache cassandra is a distributed database for managing large amounts of structured data across many commodity servers, while providing highly available service and no single point of failure.
The archichecture is purely based on the cap theorem, which is availability , and partition tolerance, and interestingly eventual consistently.
Dont Use it, if your not storing volumes of data across racks of clusters,
Dont use if you are not storing Time series data,
Dont Use if you not patitioning your servers,
Dont use if you require strong Consistency.
Mongodb has very powerful aggregate functions and an expressive aggregate framework. It has many of the features developers are accustomed to using from the relational database world. It's document data/storage structure allows for more complex data models than Cassandra, for example.
All this comes with trade-offs of course. So when you select your database (NoSQL, NewSQL, or RDBMS) look at what problem you are trying to solve and at your scalability needs. No one database does it all.
According to DataStax, Cassandra is not the best use case when there is a need for
1- High end hardware devices.
2- ACID compliant with no roll back (bank transaction)
It does not support complete transaction management across the
tables.
Secondary Index not supported.
Have to rely on Elastic search /Solr for Secondary index and the custom sync component has to be written.
Not ACID compliant system.
Query support is limited.
Want to improve this post? Provide detailed answers to this question, including citations and an explanation of why your answer is correct. Answers without enough detail may be edited or deleted.
Is there any NoSQL data store that is ACID compliant?
I'll post this as an answer purely to support the conversation - Tim Mahy , nawroth , and CraigTP have suggested viable databases. CouchDB would be my preferred due to the use of Erlang, but there are others out there.
I'd say ACID does not contradict or negate the concept of NoSQL... While there seems to be a trend following the opinion expressed by dove , I would argue the concepts are distinct.
NoSQL is fundamentally about simple key-value (e.g. Redis) or document-style schema (collected key-value pairs in a "document" model, e.g. MongoDB) as a direct alternative to the explicit schema in classical RDBMSs. It allows the developer to treat things asymmetrically, whereas traditional engines have enforced rigid same-ness across the data model. The reason this is so interesting is because it provides a different way to deal with change, and for larger data sets it provides interesting opportunities to deal with volumes and performance.
ACID provides principles governing how changes are applied to a database. In a very simplified way, it states (my own version):
(A) when you do something to change a database the change should work or fail as a whole
(C) the database should remain consistent (this is a pretty broad topic)
(I) if other things are going on at the same time they shouldn't be able to see things mid-update
(D) if the system blows up (hardware or software) the database needs to be able to pick itself back up; and if it says it finished applying an update, it needs to be certain
The conversation gets a little more excitable when it comes to the idea of propagation and constraints. Some RDBMS engines provide the ability to enforce constraints (e.g. foreign keys) which may have propagation elements (a la cascade). In simpler terms, one "thing" may have a relationship with another "thing" in the database, and if you change an attribute of one it may require the other be changed (updated, deleted, ... lots of options). NoSQL databases, being predominantly (at the moment) focused on high data volumes and high traffic, seem to be tackling the idea of distributed updates which take place within (from a consumer perspective) arbitrary time frames. This is basically a specialized form of replication managed via transaction - so I would say that if a traditional distributed database can support ACID, so can a NoSQL database.
Some resources for further reading:
Wikipedia article on ACID
Wikipedia on propagation constraints
Wikipedia (yeah, I like the site, ok?) on database normalization
Apache documentation on CouchDB with a good overview of how it applies ACID
Wikipedia on Cluster Computing
Wikipedia (again...) on database transactions
UPDATE (27 July 2012):
Link to Wikipedia article has been updated to reflect the version of the article that was current when this answer was posted. Please note that the current Wikipedia article has been extensively revised!
Well, according to an older version of a Wikipedia article on NoSQL:
NoSQL is a movement promoting a
loosely defined class of
non-relational data stores that break
with a long history of relational
databases and ACID guarantees.
and also:
The name was an attempt to describe
the emergence of a growing number of
non-relational, distributed data
stores that often did not attempt to
provide ACID guarantees.
and
NoSQL systems often provide weak
consistency guarantees such as
eventual consistency and transactions
restricted to single data items, even
though one can impose full ACID
guarantees by adding a supplementary
middleware layer.
So, in a nutshell, I'd say that one of the main benefits of a "NoSQL" data store is its distinct lack of ACID properties. Furthermore, IMHO, the more one tries to implement and enforce ACID properties, the further away from the "spirit" of a "NoSQL" data store you get, and the closer to a "true" RDBMS you get (relatively speaking, of course).
However, all that said, "NoSQL" is a very vague term and is open to individual interpretations, and depends heavily upon just how much of a purist viewpoint you have. For example, most modern-day RDBMS systems don't actually adhere to all of Edgar F. Codd's 12 rules of his relation model!
Taking a pragmatic approach, it would appear that Apache's CouchDB comes closest to embodying both ACID-compliance whilst retaining loosely-coupled, non-relational "NoSQL" mentality.
Please ensure you read the Martin Fowler introduction about NoSQL databases. And the corresponding video.
First of all, we can distinguish two types of NoSQL databases:
Aggregate-oriented databases;
Graph-oriented databases (e.g. Neo4J).
By design, most Graph-oriented databases are ACID!
Then, what about the other types?
In Aggregate-oriented databases, we can put three sub-types:
Document-based NoSQL databases (e.g. MongoDB, CouchDB);
Key/Value NoSQL databases (e.g. Redis);
Column family NoSQL databases (e.g. Hibase, Cassandra).
What we call an Aggregate here, is what Eric Evans defined in its Domain-Driven Design as a self-sufficient of Entities and Value-Objects in a given Bounded Context.
As a consequence, an aggregate is a collection of data that we
interact with as a unit. Aggregates form the boundaries for ACID
operations with the database. (Martin Fowler)
So, at Aggregate level, we can say that most NoSQL databases can be as safe as ACID RDBMS, with the proper settings. Of source, if you tune your server for the best speed, you may come into something non ACID. But replication will help.
My main point is that you have to use NoSQL databases as they are, not as a (cheap) alternative to RDBMS. I have seen too much projects abusing of relations between documents. This can't be ACID. If you stay at document level, i.e. at Aggregate boundaries, you do not need any transaction. And your data will be as safe as with an ACID database, even if it not truly ACID, since you do not need those transactions! If you need transactions and update several "documents" at once, you are not in the NoSQL world any more - so use a RDBMS engine instead!
some 2019 update: Starting in version 4.0, for situations that require atomicity for updates to multiple documents or consistency between reads to multiple documents, MongoDB provides multi-document transactions for replica sets.
In this question someone must mention OrientDB:
OrientDB is a NoSQL database, one of the few, that support fully ACID transactions. ACID is not only for RDBMS because it's not part of the Relational algebra. So it IS possible to have a NoSQL database that support ACID.
This feature is the one I miss the most in MongoDB
FoundationDB is ACID compliant:
http://www.foundationdb.com/
It has proper transactions, so you can update multiple disparate data items in an ACID fashion. This is used as the foundation for maintaining indexes at a higher layer.
ACID and NoSQL are completely orthogonal. One does not imply the other.
I have a notebook on my desk, I use it to keep notes on things that I still have to do. This notebook is a NoSQL database. I query it using a linear search with a "page cache" so I don't always have to search every page. It is also ACID compliant as I ensure that I only write one thing at a time and never while I am reading it.
NoSQL simply means that it isn't SQL. Many people get confused and think it means highly-scaleable-wild-west-super-fast-storage. It doesn't. It doesn't mean key-value store, or eventual consistency. All it means is "not SQL", there are a lot of databases in this planet and most of them are not SQL[citation needed].
You can find many examples in the other answers so I need not list them here, but there are non-SQL databases with ACID compliance for various operations, some are only ACID for single object writes while some guarantee far more. Each database is different.
"NoSQL" is not a well-defined term. It's a very vague concept. As such, it's not even possible to say what is and what is not a "NoSQL" product. Not nearly all of the products typcially branded with the label are key-value stores.
As one of the originators of NoSQL (I was an early contributor to Apache CouchDB, and a speaker at the first NoSQL event held at CBS Interactive / CNET in 2009) I'm excited to see new algorithms create possibilities that didn't exist before. The Calvin protocol offers a new way to think of physical constraints like CAP and PACELC.
Instead of active/passive async replication, or active/active synchronous replication, Calvin preserves correctness and availability during replica outages by using a RAFT-like protocol to maintain a transaction log. Additionally, transactions are processed deterministically at each replica, removing the potential for deadlocks, so agreement is achieved with only a single round of consensus. This makes it fast even on multi-cloud worldwide deployments.
FaunaDB is the only database implementation using the Calvin protocol, making it uniquely suited for workloads that require mainframe-like data integrity with NoSQL scale and flexibility.
Yes, MarkLogic Server is a NoSQL solution (document database I like to call it) that works with ACID transactions
The grandfather of NoSQL: ZODB is ACID compliant. http://www.zodb.org/
However, it's Python only.
If you are looking for an ACID compliant key/value store, there's Berkeley DB. Among graph databases at least Neo4j and HyperGraphDB offer ACID transactions (HyperGraphDB actually uses Berkeley DB for low-level storage at the moment).
FoundationDB was mentioned and at the time it wasn't open source. It's been open sourced by Apple two days ago:
https://www.foundationdb.org/blog/foundationdb-is-open-source/
I believe it is ACID compliant.
MongoDB announced that its 4.0 version will be ACID compliant for multi-document transactions.
Version 4.2. is supposed to support it under sharded setups.
https://www.mongodb.com/blog/post/multi-document-transactions-in-mongodb
NewSQL
This concept Wikipedia contributors define as:
[…] a class of modern relational database management systems that seek to provide the same scalable performance of NoSQL systems for online transaction processing (OLTP) read-write workloads while still maintaining the ACID guarantees of a traditional database system.[1][2][3]
References
[1] Nancy Lynch and Seth Gilbert, “Brewer's conjecture and the feasibility of consistent, available, partition-tolerant web services”, ACM SIGACT News, Volume 33 Issue 2 (2002), pg. 51-59.
[2] "Brewer's CAP Theorem", julianbrowne.com, Retrieved 02-Mar-2010
[3] "Brewers CAP theorem on distributed systems", royans.net
take a look at the CAP theorem
EDIT: RavenDB seems to be ACID compliant
To add to the list of alternatives, another fully ACID compliant NoSQL database is GT.M.
Hyperdex Warp http://hyperdex.org/warp/
Warp (ACID feature) is proprietary, but Hyperdex is free.
db4o
Unlike roll-your-own persistence or
serialization, db4o is ACID
transaction safe and allows for
querying, replication and schema
changes during runtime
http://www.db4o.com/about/productinformation/db4o/
BergDB is a light-weight, open-source, NoSQL database designed from the start to run ACID transactions. Actually, BergDB is "more" ACID than most SQL databases in the sense that the only way to change the state of the database is to run ACID transactions with the highest isolation level (SQL term: "serializable"). There will never be any issues with dirty reads, non-repeatable reads, or phantom reads.
In my opinion, the database is still highly performant; but don't trust me, I created the software. Try it yourself instead.
Tarantool is a fully ACID NoSQL database. You can issue CRUD operations or stored procedures, everything will be run with strict accordance with an ACID property. You can also read about that here: http://stable.tarantool.org/doc/mpage/data-and-persistence.html
MarkLogic is also ACID complient. I think is one of the biggest players now.
Wait is over.
ACID compliant NoSQL DB is out ----------- have a look at citrusleaf
A lot of modern NoSQL solution don't support ACID transactions (atomic isolated multi-key updates), but most of them support primitives which allow you to implement transactions on the application level.
If a data store supports per key linearizability and compare-and-set (document level atomicity) then it's enough to implement client-side transactions, more over you have several options to choose from:
If you need Serializable isolation level then you can follow the same algorithm which Google use for the Percolator system or Cockroach Labs for CockroachDB. I've blogged about it and create a step-by-step visualization, I hope it will help you to understand the main idea behind the algorithm.
If you expect high contention but it's fine for you to have Read Committed isolation level then please take a look on the RAMP transactions by Peter Bailis.
The third approach is to use compensating transactions also known as the saga pattern. It was described in the late 80s in the Sagas paper but became more actual with the raise of distributed systems. Please see the Applying the Saga Pattern talk for inspiration.
The list of data stores suitable for client side transactions includes Cassandra with lightweight transactions, Riak with consistent buckets, RethinkDB, ZooKeeper, Etdc, HBase, DynamoDB, MongoDB and others.
YugaByte DB supports an ACID Compliant distributed txns as well as Redis and CQL API compatibility on the query layer.
Google Cloud Datastore is a NoSQL database that supports ACID transactions
DynamoDB is a NoSQL database and has ACID transactions.
VoltDB is an entrant which claims ACID compliance, and while it still uses SQL, its goals are the same in terms of scalability
Whilst it's only an embedded engine and not a server, leveldb has WriteBatch and the ability to turn on Synchronous writes to provide ACID behaviour.
Node levelUP is transactional and built on leveldb https://github.com/rvagg/node-levelup#batch
If you add enough pure water and successfully flip a coin, anything can become acidic. Or basic for that matter.
To say a database is ACID compliant means four specific things. And in defining the system (restricting the range) we can arbitrarily water down the meanings so that the result is ACID compliance.
A—if your NoSQL database only allows one record operation at a time and records either go or they don't then that's atomic.
C—if the only constraints you allow are simple, like checking JSON schemas against a known schema then that's consistent.
I—if just append-only transactions are supported (and schema changes are disallowed) then it is impossible for anything to depend on anything else, that's independent.
D—if you turn off all machines at night and synchronize disks then the transactions will be it in or they won't, that's durable.
What exactly is NoSQL? Is it database systems that only work with {key:value} pairs?
As far as I know MemCache is one of such database systems, am I right?
What other popular NoSQL databases are there and where exactly are they useful?
Thanks, Boda Cydo.
I'm not agree with the answers I'm seeing, although it's true that NoSQL solutions tends to break the ACID rules, not all are created from that approach.
I think first you should define what is a SQL Solution and then you can put the "Not Only" in front of it, that will be more accurate definition of what is a NoSQL solution.
With this approach in mind:
SQL databases are a way to group all the data stores that are accessible using Structured Query Language as the main (and most of the time only) way to communicate with them, this means it requires that the database support the structures that are common to those systems like "tables", "columns", "rows", "relationships", etc.
Now, put the "Not Only" in front of the last sentence and you will get a definition of what means "NoSQL". NoSQL groups all the stores created as an attempt to solve problems which cannot fit into the table/column/rows structures or even in SQL Statements, in most of the cases these databases will not support relationships, they're abandoning the well known structures just because the problems have changed since their conception.
If you have a text file, and you create an API to store/retrieve/organize this information, then you have a NoSQL database in your hands.
All of these means that there are several solutions to store the information in a way that traditional SQL systems will not allow to achieve better performance, flexibility, etc etc. Every NoSQL provider tries to solve a different problem and that's why you wont be able to compare two different solutions, for example:
djondb is a document store created to be used as
NoSQL enterprise solution supporting transactions, consistency, etc.
but sacrifice performance of its counterparts.
MongoDB is a document store (similar to
djondb) which accomplish great performance but trades some of the
ACID properties to achieve this.
CouchDB is another document store which
solves the queries slightly different providing views to retrieve the
information without doing a full query every time.
...
As you may have noticed I only talked about the document stores, that's because I wanted to show you that 3 different document stores implementations have different approach, therefore you should keep in mind the golden rule of NoSQL stores "Use the right tool for the right job".
I'm the creator of djondb and I've been doing a lot of research even before trying to start my own NoSQL implementation, but this is a field where the concepts will keep changing the way we see the information storage.
From wikipedia:
NoSQL is an umbrella term for a loosely defined class of non-relational data stores that break with a long history of relational databases and ACID guarantees. Data stores that fall under this term may not require fixed table schemas, and usually avoid join operations. The term was first popularised in early 2009.
The motivation for such an architecture was high scalability, to support sites such as Facebook, advertising.com, etc...
To quickly get a handle on NoSQL systems, see this blog post I wrote: Visual Guide to NoSQL Systems. Essentially, NoSQL systems sacrifice either consistency or availability in favor of tolerance to network partitions.
What is NoSQL ?
NoSQL is the acronym for Not Only SQL. The basic qualities of NoSQL databases are schemaless, distributed and horizontally scalable on commodity hardware. The NoSQL databases offers variety of functions to solve various problems with variety of data types, where “blob” used to be the only data type in RDBMS to store unstructured data.
1 Dynamic Schema
NoSQL databases allows schema to be flexible. New columns can be added anytime. Rows may or may not have values for those columns and no strict enforcement of data types for columns. This flexibility is handy for developers, especially when they expect frequent changes during the course of product life cycle.
2 Variety of Data
NoSQL databases support any type of data. It supports structured, semi-structured and unstructured data to be stored. Its supports logs, images files, videos, graphs, jpegs, JSON, XML to be stored and operated as it is without any pre-processing. So it reduces the need for ETL (Extract – Transform – Load).
3 High Availability Cluster
NoSQL databases support distributed storage using commodity hardware. It also supports high availability by horizontal scalability. This features enables NoSQL databases get the benefit of elastic nature of the Cloud infrastructure services.
4 Open Source
NoSQL databases are open source software. The usage of software is free and most of them are free to use in commercial products. The open sources codebase can be modified to solve the business needs. There are minor variations in the open source software licenses, users must be aware of license agreements.
5 NoSQL – Not Only SQL
NoSQL databases not only depend SQL to retrieve data. They provide rich API interfaces to perform DML and CRUD operations. These are APIs are move developer friendly and supported in variety of programming languages.
Take a look at these:
http://en.wikipedia.org/wiki/Nosql#List_of_NoSQL_open_source_projects
and this:
http://www.mongodb.org/display/DOCS/Comparing+Mongo+DB+and+Couch+DB
I used something called the Raima Data Manager more than a dozen years ago, that qualifies as NoSQL. It calls itself a "Set Oriented Database" Its not based on tables, and there is no query "language", just an C API for asking for subsets.
It's fast and easier to work with in C/C++ and SQL, there's no building up strings to pass to a query interpreter and the data comes back as an enumerable object rather than as an array. variable sized records are normal and don't waste space. I never saw the source code, but there were some hints at the interface that internally, the code used pointers a lot.
I'm not sure that the product I used is even sold anymore, but the company is still around.
MongoDB looks interesting, SourceForge is now using it.
I listened to a podcast with a team member. The idea with NoSQL isn't so much to replace SQL as it is to provide a solution for problems that aren't solved well with traditional RDBMS. As mentioned elsewhere, they are faster and scale better at the cost of reliability and atomicity (different solutions to different degrees). You wouldn't want to use one for a financial system, but a document based system would work great.
Here is a comprehensive list of NoSQL Databases: http://nosql-database.org/.
I'm glad that you have had success with RDM John! I work at Raima so it's great to hear feedback. For those looking for more information, here are a couple of resources:
Video Overview of RDM's General Architecture
Free Evaluation Download of RDM
I am trying to decide whether to use voldemort or couchdb for an upcoming healthcare project. I want a storage system that has high availability , fault tolerance, and can scale for the massive amounts of data being thrown at it.
What is the pros/cons of each?
Thanks
Project Voldemort looks nice, but I haven't looked deeply into it so far.
In it current state CouchDB might not be the right thing for "massive amounts of data". Distributing data between nodes and routing queries accordingly is on the roadmap but not implemented so far. The biggest known production setups of CouchDB use "tables" ("databases" in couch-speak) of about 200G.
HA is not natively supported by CouchDB but can build easily: All CouchDB nodes are replicating the database nodes between each other in a multi-master setup. We put two Varnish proxies in front of the CouchDB machines and the Varnish boxes are made redundant with CARP. CouchDBs "build from the Web" design makes such things very easy.
The most pressing issue in our setup is the fact that there are still issues with the replication of large (multi MB) attachments to CouchDB documents.
I suggest you also check the traditional RDBMS route. There are huge issues with available talent outside the RDBMS approach and there are very capable offerings available from Oracle & Co.
Not knowing enough from your question, I would nevertheless say Project Voldemort or distributed hash tables (DHTs) like CouchDB in general are a solution to your problem of HA.
Those DHTs are very nice for high availability but harder to write code for than traditional relational databases (RDBMS) concerning consistency.
They are quite good to store document type information, which may fit nicely with your healthcare project but make development harder for data.
The biggest limitation of most stores is that they are not transactionally safe (See Scalaris for an transactionally safe store) and you need to ensure data consistency by yourself - most use read time consistency by merging conflicting data). RDBMS are much easier to use for consistency of data (ACID)
Joining data is much harder too. In RDBMs you can easily query data over several tables, you need to write code in CouchDB to aggregate data. For other stores Hadoop may be a good choice for aggregating information.
Read about BASE and the CAP theorem on consistency vs. availability.
See
http://www.metabrew.com/article/anti-rdbms-a-list-of-distributed-key-value-stores/
http://queue.acm.org/detail.cfm?id=1394128
Is memcacheDB an option? I've heard that's how Digg handled HA issues.