Opinion: Reason to use NoSQL - database

I got two opinions about NoSQL from my friend.
First: Use NoSQL to boost performance and save occasional updated data. Still use sql to save all important dan transaction data.
Second: Don't use NoSQL if you didn't really need it. Use it if you really save big data.
I've used NoSQL and its really fast when selecting data.
I want to know, is first opinion only enough for implementing NoSQL? What all of you think about these?
NOTE: In my case, it still running well with SQL. I want to add NoSQL for improving data reading speed. so it will work alongside.
Is it worth it to use NoSQL this early?
Thanks in advance

It depends on what you are designing.
From my experience scaling out data collection I have found traditional relational storage to be a bottleneck in terms of its inability to scale out over multiple nodes when a databases gets very large. Sure it scales up but this becomes cost prohibitive at some point. In this scenario it would therefore depend on your medium to long term data storage projections. The solution for me was therefore mixture of relational storage for data that may be updated frequently and noSQL (document storage) for data the has a fast rate of growth that is generally not updated post write.
Things to take into account:
Queries
SQL relational storage supports a growing subset languages for queries, as well as a wide range of filters, sorting options, and projections and index queries. NoSQL does all this as well, but SQL can often go beyond it, allowing powerful aggregations of your data as well, beyond what NoSQL can do.
Transactions
Transactions are important because they ensure that you have atomically made changes to your database. Many NoSQL platforms don’t support transactions, so be aware of this feature when you’re figuring out which to use, and what your own needs are.
Consistency
MySQL platforms often use a single master to guarantee strong consistency in your database. These use synchronous replication to ensure you don’t lose important changes queued up to the master. NoSQL, by contrast, does replication of entity groups without a master, so that data is strong within an entity group, and is eventually updated across all groups. The better option depends on the constraints and needs of your database.
Scalability
For years, database administrators relied on scaling up, buying bigger servers as database load increased. However, as transaction rates and demands on the databases continue to expand immensely, emphasis is on scaling out instead. Scaling out is distributing databases across multiple hosts, and that’s something NoSQL does better than standard SQL. They’re designed for optimal use on scaled out databases.
Management
NoSQL databases are generally designed to require less management overall. Repairs are often automatic, and data distribution and simpler data models contribute to less administration required overall. However, you’ve also got less support when there’s a problem. SQL platforms often have vendors waiting to supply support to enterprises.
Schema
Regular SQL platforms often have strictly enforced rules for a schema change, to stave off user-created typos that can put faults in your query. NoSQL platforms will have their own mechanisms for combating this.
Hope that helps.

NoSQL scores over SQL in below areas
It support semi-structured data and volatile data. You can change the structure at any time
It does not have schema
Read/Write through put is very high
Horizontal scalability is easily achieved - Add cheaper hardware and provide right replication factor
Will support Bigdata in volumes of Terra Bytes & Peta Bytes by using cheaper hardware
Good support for Analytic tools on top of Bigdata, especially Hadoop/Hbase family
In memory caching option is available to increase the performance of queries
Faster development life cycles for developers
When you should not use NoSQL and go for SQL
If you require business critical transaction with ACID properties i.e where Consistency is key & Eventual consistency is not an option
If you have heavy aggregation queries spanning multiple entities
In summary, you have to use right technology for right business use case. i.e combination of SQL and NoSQL
Regarding your queries:
Use SQL for business critical transactions. If your SQL is scaling for your business requirements, use SQL.
Use NoSQL for huge volumes of data in magnitudes of Tera/Peta bytes with variety of data , where SQL can't handle that volume & variety.

As others pointed out both SQL and NoSQL (Not only SQL) have their advantages.
There is often temptation to use both side by side and get maximum out of it. Something referred to as Polyglot persistence
Is it a good idea? Sometimes, yes.
Should I do it?
While it may have benefits, the trade off comes with maintenance of multiple stores (note: they would have different ways of database management).
Also the data sync is a bigger one if you are planning this for same transactional system.
If the data you are going to store (in sql and no-sql databases) can be logically separated then you might be ok. But in case they are closely related then you are going to have tough time keeping them consistent.
Overall when i evaluated this option, i came to conclude that it would work only when you can logically partition the data. Another use case may be using Nosql for Analytics and continue with sql for transaction system.
Going back to your use case, did you try JSON storage within your sql database. It may give you benefit of performance without much tradeoffs.

Related

Scaling in nosql vs rdbms?

I am trying to understand the architectural difference in nosql and relational databases in the context of scalability.
My understanding of scalability(horizontal) is that as our data grows, we add more and more server to split the load evenly.
In key-value NO-SQL databases, we can add the new machine and split the keys. However, all of examples I have seen so far to understand eventual consistency in NO-SQL databases, they all have master-slave configuration where data is replicated across all the slaves instead of splitting across various machine to achieve scalability.
My question is doesn't using replicating your whole data defeat the whole point of scalability in No-SQL databases? The same can be done in RDBMS as well, with one master(for write) and slaves(for reads), how is NO-SQL more scalable in this regard?
The goal of scalability is to increase the overall capacity of a given application, and can can be vertical (bigger machines) or horizontal (add more machines). When it comes to horizonal scaling, you can add more machines, but as the number of machines increase, so does the probability of a failure of a node in the cluster, which is something to keep in mind.
When you add more nodes, what you can do is either split the data, which is called sharding, and you can also duplicate the data, which is called replication.
Replication
With replication, the usual architecture is master-slave, where you can only write to the master, who replicates the data to the slaves, so this means that you cannot use replication to split the writes to the cluster, but it is possible to split the reads, deppending on the consistency level (not all of the NoSQL technologies provide the same level) and the cluster configuration.
Sharding
Sharding is more suited to provide scaling, as you split the data set in multiple parts with similar sizes if possible. This clearly allows the benefit to split reads and writes to different nodes. In order to make it work, some mechanisms need to be in place:
routing: to locate in which shard is the data is stored, or to decide in which shard to write
balancing: to maintain the sizes of the different fragments of the dataset with a similar size over time.
But usually these mechanisms are provided by the database vendor, so no need to worry about providing it, but is still necessary to understand to manage the cluster.
The problem here is, as I mentioned at first, the more nodes a cluster has, the higher is the chance to have a failure on a given node, meaning that if a node with a part of the data set goes offline, part of the data won't be available, which is not a desirable scenario. But luckyly, sharding and replication are not exclusive, is it possible to build a sharded cluster, where each shard is a cluster on with replication in place.
But in order to answer your questions
doesn't using replicating your whole data defeat the whole point of
scalability in No-SQL databases?
In master-slave architectures, you cannot split the writes, but you can split the reads, which is somewhat a way to scale, although, the main purpose is high availability.
Anyways, there are new emergent databases that start to provide multi-master architecture, where all the nodes act as master, meaning all can receive both, writes and reads.
The same can be done in RDBMS as well, with one master(for write) and
slaves(for reads), how is NO-SQL more scalable in this regard?
In a single node environment, NoSQL is already faster than RDBMS when there are JOINS involved, as it is an expensive operation, or when there are a lot of integrity checks involved.
So, when you try to shard the dataset in a RDBMS, unless really carefully designed, the most likely scenario will be the desired data located in different shards. This means that the JOIN and the integrity checks need to be performed between different nodes, making them even more expensive operations than they already are.
This means that RDBMS databases, use mechanisms that act as contraints when you intent to scale horizontally, which NoSQL doesn't. Yes, you can still scale RDBMS horizontally, but overall will be more expensive than using NoSQL databases.
Update: special mention for graph databases
Sharding in graph databases is really difficult, since mathematically, the problem of distributing a large graph between different servers is NP complete. And also, when having to query data among different shards, one of the main features of graphs is lost, fast transverses.
I've seen 2 main approaches that graph databases follows to scale horizontally:
1) Let the application/developer decide how to partition the graph, which you can imagine how complex this can be.
2) Replicate all the graph in all the nodes and use cache-sharding, which means, that all the nodes have the entire data set, but each node maintains in memory the part of the graph that is most queried for that node in particular.
I guess, that in the future, graph database companies will develop more solutions to address this issue.
Related to your question, at their current state, graph databases can still outscale RDBMS when it comes to horizontal scaling, due to the lack of the RDBMS contraints, but its hard to compare between different NoSQL database types.
the master-slave configuration in NoSQL databases is for High-Avaibility purposes and data consistency, not to confound with the purpose of scalability wich is to load balance the workload.
In NoSQL, so far as splitting the keys is concerned, only the Master copies matter. Slaves are for HA and Availability in general. This replication, in fact, is responsible for the eventual consistency -- you will get the data right away, may be it is not the most updated one, but eventually you will have the updated data.
On the other hand, RDBMS will have slower data access/modification because it has to follow ACID properties, and mostly that is with strong-consistency.
Replication is not the differentiating factor, as such, between NoSQL and RDBMS, adherence/non-adherence to ACID properties is. Nor does Scalability mean absence of extra copies. Hope that answers.
To answer your question, replicating your data does not defeat the point of scalability.
Scalability roughly refers to the ability to grow a database and is not necessarily tied to having more copies of the database.
More servers with the database information on them allows for more readily available access to it to more users, as the other answers have stated.
I believe this might be a misunderstanding between Scalability and Availability?
If we just consider key-value databases vs SQL databases then the former ones are more suitable for scalability than the latter ones.
This is because a key-value store doesn't have transactions. So your only guarantee is that you can atomically change the one value for the one key. Which results in ease of scaling.
For example you just hash a key and store a key-value pairs on a machine corresponding to the hash of the key.
You can't do the same thing to the SQL database without loosing an ACID (atomicity, consistency, isolation, durability of a transaction) property. Moreover you can't even easily perform a join SELECT if you store different tables or different parts of a table on different machines.
So in general NoSQL databases are more prepared for sharding across many machines than SQL ones.

NoSql vs Relational database

Recently NoSQL has gained immense popularity.
What are the advantages of NoSQL over traditional RDBMS?
Not all data is relational. For those situations, NoSQL can be helpful.
With that said, NoSQL stands for "Not Only SQL". It's not intended to knock SQL or supplant it.
SQL has several very big advantages:
Strong mathematical basis.
Declarative syntax.
A well-known language in Structured Query Language (SQL).
Those haven't gone away.
It's a mistake to think about this as an either/or argument. NoSQL is an alternative that people need to consider when it fits, that's all.
Documents can be stored in non-relational databases, like CouchDB.
Maybe reading this will help.
The history seem to look like this:
Google needs a storage layer for their inverted search index. They figure a traditional RDBMS is not going to cut it. So they implement a NoSQL data store, BigTable on top of their GFS file system. The major part is that thousands of cheap commodity hardware machines provides the speed and the redundancy.
Everyone else realizes what Google just did.
Brewers CAP theorem is proven. All RDBMS systems of use are CA systems. People begin playing with CP and AP systems as well. K/V stores are vastly simpler, so they are the primary vehicle for the research.
Software-as-a-service systems in general do not provide an SQL-like store. Hence, people get more interested in the NoSQL type stores.
I think much of the take-off can be related to this history. Scaling Google took some new ideas at Google and everyone else follows suit because this is the only solution they know to the scaling problem right now. Hence, you are willing to rework everything around the distributed database idea of Google because it is the only way to scale beyond a certain size.
C - Consistency
A - Availability
P - Partition tolerance
K/V - Key/Value
NoSQL is better than RDBMS because of the following reasons/properities of NoSQL
It supports semi-structured data and volatile data
It does not have schema
Read/Write throughput is very high
Horizontal scalability can be achieved easily
Will support Bigdata in volumes of Terra Bytes & Peta Bytes
Provides good support for Analytic tools on top of Bigdata
Can be hosted in cheaper hardware machines
In-memory caching option is available to increase the performance of queries
Faster development life cycles for developers
EDIT:
To answer "why RDBMS cannot scale", please take a look at RDBMS Overheads pdf written by Stavros Harizopoulos,Daniel J. Abadi,Samuel Madden and Michael Stonebraker
RDBMS's have challenges in handling huge data volumes of Terabytes & Peta bytes. Even if you have Redundant Array of Independent/Inexpensive Disks (RAID) & data shredding, it does not scale well for huge volume of data. You require very expensive hardware.
Logging: Assembling log records and tracking down all changes in database structures slows performance. Logging may not be necessary if recoverability is not a requirement or if recoverability is provided through other means (e.g., other sites on the network).
Locking: Traditional two-phase locking poses a sizeable overhead since all accesses to database structures are governed by a separate entity, the Lock Manager.
Latching: In a multi-threaded database, many data structures have to be latched before they can be accessed. Removing this feature and going to a single-threaded approach has a noticeable performance impact.
Buffer management: A main memory database system does not need to access pages through a buffer pool, eliminating a level of indirection on every record access.
This does not mean that we have to use NoSQL over SQL.
Still, RDBMS is better than NoSQL for the following reasons/properties of RDBMS
Transactions with ACID properties - Atomicity, Consistency, Isolation & Durability
Adherence to Strong Schema of data being written/read
Real time query management ( in case of data size < 10 Tera bytes )
Execution of complex queries involving join & group by clauses
We have to use RDBMS (SQL) and NoSQL (Not only SQL) depending on the business case & requirements
NOSQL has no special advantages over the relational database model. NOSQL does address certain limitations of current SQL DBMSs but it doesn't imply any fundamentally new capabilities over previous data models.
NOSQL means only no SQL (or "not only SQL") but that doesn't mean the same as no relational. A relational database in principle would make a very good NOSQL solution - it's just that none of the current set of NOSQL products uses the relational model.
RDBMS focus more on relationship and NoSQL focus more on storage.
You can consider using NoSQL when your RDBMS reaches bottlenecks. NoSQL makes RDBMS more flexible.
The biggest advantage of NoSQL over RDBMS is Scalability.
NoSQL databases can easily scale-out to many nodes, but for RDBMS it is very hard.
Scalability not only gives you more storage space but also much higher performance since many hosts work at the same time.
If you need to process huge amount of data with high performance
OR
If data model is not predetermined
then
NoSQL database is a better choice.
Just adding to all the information given above
NoSql Advantages:
1) NoSQL is good if you want to be production ready fast due to its support for schema-less and object oriented architecture.
2) NoSql db's are eventually consistent which in simple language means they will not provide any lock on the data(documents) as in case of RDBMS and what does it mean is latest snapshot of data is always available and thus increase the latency of your application.
3) It uses MVCC (Multi view concurrency control) strategy for maintaining and creating snapshot of data(documents).
4) If you want to have indexed data you can create view which will automatically index the data by the view definition you provide.
NoSql Disadvantages:
1) Its definitely not suitable for big heavy transactional applications as it is eventually consistent and does not support ACID properties.
2) Also it creates multiple snapshots (revisions) of your data (documents) as it uses MVCC methodology for concurrency control, as a result of which space get consumed faster than before which makes compaction and hence reindexing more frequent and it will slow down your application response as the data and transaction in your application grows.
To counter that you can horizontally scale the nodes but then again it will be higher cost as compare sql database.
From mongodb.com:
NoSQL databases differ from older, relational technology in four main areas:
Data models: A NoSQL database lets you build an application without having to define the schema first unlike relational databases which make you define your schema before you can add any data to the system. No predefined schema makes NoSQL databases much easier to update as your data and requirements change.
Data structure: Relational databases were built in an era where data was fairly structured and clearly defined by their relationships. NoSQL databases are designed to handle unstructured data (e.g., texts, social media posts, video, email) which makes up much of the data that exists today.
Scaling: It’s much cheaper to scale a NoSQL database than a relational database because you can add capacity by scaling out over cheap, commodity servers. Relational databases, on the other hand, require a single server to host your entire database. To scale, you need to buy a bigger, more expensive server.
Development model: NoSQL databases are open source whereas relational databases typically are closed source with licensing fees baked into the use of their software. With NoSQL, you can get started on a project without any heavy investments in software fees upfront.

When NOT to use Cassandra? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
There has been a lot of talk related to Cassandra lately.
Twitter, Digg, Facebook, etc all use it.
When does it make sense to:
use Cassandra,
not use Cassandra, and
use a RDMS instead of Cassandra.
There is nothing like a silver bullet, everything is built to solve specific problems and has its own pros and cons. It is up to you, what problem statement you have and what is the best fitting solution for that problem.
I will try to answer your questions one by one in the same order you asked them. Since Cassandra is based on the NoSQL family of databases, it's important you understand why use a NoSQL database before I answer your questions.
Why use NoSQL
In the case of RDBMS, making a choice is quite easy because all the databases like MySQL, Oracle, MS SQL, PostgreSQL in this category offer almost the same kind of solutions oriented toward ACID properties. When it comes to NoSQL, the decision becomes difficult because every NoSQL database offers different solutions and you have to understand which one is best suited for your app/system requirements. For example, MongoDB is fit for use cases where your system demands a schema-less document store. HBase might be fit for search engines, analyzing log data, or any place where scanning huge, two-dimensional join-less tables is a requirement. Redis is built to provide In-Memory search for varieties of data structures like trees, queues, linked lists, etc and can be a good fit for making real-time leaderboards, pub-sub kind of system. Similarly there are other databases in this category (Including Cassandra) which are fit for different problem statements. Now lets move to the original questions, and answer them one by one.
When to use Cassandra
Being a part of the NoSQL family, Cassandra offers a solution for problems where one of your requirements is to have a very heavy write system and you want to have a quite responsive reporting system on top of that stored data. Consider the use case of Web analytics where log data is stored for each request and you want to built an analytical platform around it to count hits per hour, by browser, by IP, etc in a real time manner. You can refer to this blog post to understand more about the use cases where Cassandra fits in.
When to Use a RDMS instead of Cassandra
Cassandra is based on a NoSQL database and does not provide ACID and relational data properties. If you have a strong requirement for ACID properties (for example Financial data), Cassandra would not be a fit in that case. Obviously, you can make a workaround for that, however you will end up writing lots of application code to simulate ACID properties and will lose on time to market badly. Also managing that kind of system with Cassandra would be complex and tedious for you.
When not to use Cassandra
I don't think it needs to be answered if the above explanation makes sense.
When evaluating distributed data systems, you have to consider the CAP theorem - you can pick two of the following: consistency, availability, and partition tolerance.
Cassandra is an available, partition-tolerant system that supports eventual consistency. For more information see this blog post I wrote: Visual Guide to NoSQL Systems.
Cassandra is the answer to a particular problem: What do you do when you have so much data that it does not fit on one server ? How do you store all your data on many servers and do not break your bank account and not make your developers insane ? Facebook gets 4 Terabyte of new compressed data EVERY DAY. And this number most likely will grow more than twice within a year.
If you do not have this much data or if you have millions to pay for Enterprise Oracle/DB2 cluster installation and specialists required to set it up and maintain it, then you are fine with SQL database.
However Facebook no longer uses cassandra and now uses MySQL almost exclusively moving the partitioning up in the application stack for faster performance and better control.
The general idea of NoSQL is that you should use whichever data store is the best fit for your application. If you have a table of financial data, use SQL. If you have objects that would require complex/slow queries to map to a relational schema, use an object or key/value store.
Of course just about any real world problem you run into is somewhere in between those two extremes and neither solution will be perfect. You need to consider the capabilities of each store and the consequences of using one over the other, which will be very much specific to the problem you are trying to solve.
Besides the answers given above about when to use and when not to use Cassandra, if you do decide to use Cassandra you may want to consider not using Cassandra itself, but one of the its many cousins out there.
Some answers above already pointed to various "NoSQL" systems which share many properties with Cassandra, with some small or large differences, and may be better than Cassandra itself for your specific needs.
Additionally, recently (several years after this question was originally asked), a Cassandra clone called Scylla (see https://en.wikipedia.org/wiki/Scylla_(database)) was released. Scylla is an open-source re-implementation of Cassandra in C++, which claims to have significantly higher throughput and lower latencies than the original Java Cassandra, while being mostly compatible with it (in features, APIs, and file formats). So if you're already considering Cassandra, you may want to consider Scylla as well.
I will focus here on some of the important aspects which can help you to decide if you really need Cassandra. The list is not exhaustive, just some of the points which I have at top of my mind-
Don't consider Cassandra as the first choice when you have a strict requirement on the relationship (across your dataset).
Cassandra by default is AP system (of CAP). But, it supports tunable consistency which means it can be configured to support as CP as well. So don't ignore it just because you read somewhere that it's AP and you are looking for CP systems. Cassandra is more accurately termed “tuneably consistent,” which means it allows you to easily decide the level of consistency you require, in balance with the level of availability.
Don't use Cassandra if your scale is not much or if you can deal with a non-distributed DB.
Think harder if your team thinks that all your problems will be solved if you use distributed DBs like Cassandra. To start with these DBs is very simple as it comes with many defaults but optimizing and mastering it for solving a specific problem would require a good (if not a lot) amount of engineering effort.
Cassandra is column-oriented but at the same time each row also has a unique key. So, it might be helpful to think of it as an indexed, row-oriented store. You can even use it as a document store.
Cassandra doesn't force you to define the fields beforehand. So, if you are in a startup mode or your features are evolving (as in agile) - Cassandra embraces it. So better, first think about queries and then think about data to answer them.
Cassandra is optimized for really high throughput on writes. If your use case is read-heavy (like cache) then Cassandra might not be an ideal choice.
Right. It makes sense to use Cassandra when you have a huge amount of data, a huge number of queries but very little variety of queries. Cassandra basically works by partitioning and replicating. If all your queries will be based on the same partition key, Cassandra is your best bet. If you get a query on an attribute that is not the partition key, Cassandra allows you to replicate the whole data with a new partition key. So now you have 2 replicas of the same data with 2 different partition keys.
Which brings me to your next question. When not to use Cassandra. As I mentioned, Cassandra scales by replicating the complete database for every new partitioning key. But you can't keep making new copies again and again. So when you have a high variety in queries i.e. each query has a different column in the where clause, Cassandra is not a good option.
Now for the third question. The whole point of using RDBMS is when you want the ACID properties. If you are building something like a payment service and want each transaction to be isolated, each transaction to either complete or not happen at all, changes to be persistent despite system failure, and the money to be consistent across bank accounts before and after the transaction completes, an RDBMS is the only option that will help you achieve this.
This article actually explains the whole thing, especially when to use Cassandra or not (as opposed to some other NoSQL option) part of the question -> Choosing the best Database. Do check it out.
EDIT: To answer the question in the comments by proximab, when we think of banking systems we immidiately think "ACID is the best solution". But even banking systems are made up of several subsystems that might not even be dealing with any transaction related data like account holder's personal information, account statements, credit card details, credit histories, etc.
All of this information needs to be stored in some database or the another. Now if you store the account related information like account balance, that is something that needs to be consistent at all times. For example, if you try to send money from account A to account B, then the money that disappears from account A should instantaneousy show up in account B, and it cannot be present in both accounts at the same time. This system cannot be inconsistant at any point. This is where ACID is of utmost importance.
On the other hand if you are saving credit card details or credit histories, that should not get into the wrong hands, then you need something that allows access only to authorised users. That I believe is supported by Cassandra. That said, data like credit history and credit card transactions, I think that is an ever increasing data. Also there is only so much yo can query on this data i.e. it has a very finite number of queries. These two conditions make Cassandra a perfect solution.
Talking with someone in the midst of deploying Cassandra, it doesn't handle the many-to-many well. They are doing a hack job to do their initial testing. I spoke with a Cassandra consultant about this and he said he wouldn't recommend it if you had this problem set.
You should ask your self the following questions:
(Volume, Velocity) Will you be writing and reading TONS of information , so much information that no one computer could handle the writes.
(Global) Will you need this writing and reading capability around the world so that the writes in one part of the world are accessible in another part of the world?
(Reliability) Do you need this database to be up and running all the time and never go down regardless of which Cloud, which country, whether it's VM , Container, or Bare metal?
(Scale-ability) Do you need this database to be able to continue to grow easily and scale linearly
(Consistency) Do you need TUNABLE consistency where some writes can happen asynchronously where as others need to be certified?
(Skill) Are you willing to do what it takes to learn this technology and the data modeling that goes with creating a globally distributed database that can be fast for everyone, everywhere?
If for any of these questions you thought "maybe" or "no," you should use something else. If you had "hell yes" as an answer to all of them, then you should use Cassandra.
Use RDBMS when you can do everything on one box. It's probably easier than most and anyone can work with it.
Heavy single query vs. gazillion light query load is another point to consider, in addition to other answers here. It's inherently harder to automatically optimize a single query in a NoSql-style DB. I've used MongoDB and ran into performance issues when trying to calculate a complex query. I haven't used Cassandra but I expect it to have the same issue.
On the other hand, if your load is expected to be that of very many small queries, and you want to be able to easily scale out, you could take advantage of eventual consistency that is offered by most NoSql DBs. Note that eventual consistency is not really a feature of a non-relational data model, but it is much easier to implement and to set up in a NoSql-based system.
For a single, very heavy query, any modern RDBMS engine can do a decent job parallelizing parts of the query and take advantage of as much CPU and memory you throw at it (on a single machine). NoSql databases don't have enough information about the structure of the data to be able to make assumptions that will allow truly intelligent parallelization of a big query. They do allow you to easily scale out more servers (or cores) but once the query hits a complexity level you are basically forced to split it apart manually to parts that the NoSql engine knows how to deal with intelligently.
In my experience with MongoDB, in the end because of the complexity of the query there wasn't much Mongo could do to optimize it and run parts of it on multiple data. Mongo parallelizes multiple queries but isn't so good at optimizing a single one.
Let's read some real world cases:
http://planetcassandra.org/apache-cassandra-use-cases/
In this article: http://planetcassandra.org/blog/post/agentis-energy-stores-over-15-billion-records-of-time-series-usage-data-in-apache-cassandra
They elaborated the reason why they didn't choose MySql is because db synchronization is too slow.
(Also due to 2-phrase commit, FK, PK)
Cassandra is based on Amazon Dynamo paper
Features:
Stability
High availability
Backup performs well
Read and Write is better than HBase, (BigTable clone in java).
wiki http://en.wikipedia.org/wiki/Apache_Cassandra
Their Conclusion is:
We looked at HBase, Dynamo, Mongo and Cassandra.
Cassandra was simply the best storage solution for the majority of our data.
As of 2018,
I would recommend using ScyllaDB to replace classic cassandra, if you need back support.
Postgres kv plugin is also quick than cassandra. How ever won't have multi-instance scalability.
another situation that makes the choice easier is when you want to use aggregate function like sum, min, max, etcetera and complex queries (like in the financial system mentioned above) then a relational database is probably more convenient then a nosql database since both are not possible on a nosql databse unless you use really a lot of Inverted indexes. When you do use nosql you would have to do the aggregate functions in code or store them seperatly in its own columnfamily but this makes it all quite complex and reduces the performance that you gained by using nosql.
Cassandra is a good choice if:
You don't require the ACID properties from your DB.
There would be massive and huge number of writes on the DB.
There is a requirement to integrate with Big Data, Hadoop, Hive and Spark.
There is a need of real time data analytics and report generations.
There is a requirement of impressive fault tolerant mechanism.
There is a requirement of homogenous system.
There is a requirement of lots of customisation for tuning.
If you need a fully consistent database with SQL semantics, Cassandra is NOT the solution for you. Cassandra supports key-value lookups. It does not support SQL queries. Data in Cassandra is "eventually consistent". Concurrent lookups of data may be inconsistent, but eventually lookups are consistent.
If you need strict semantics and need support for SQL queries, choose another solution such as MySQL, PostGres, or combine use of Cassandra with Solr.
Apache cassandra is a distributed database for managing large amounts of structured data across many commodity servers, while providing highly available service and no single point of failure.
The archichecture is purely based on the cap theorem, which is availability , and partition tolerance, and interestingly eventual consistently.
Dont Use it, if your not storing volumes of data across racks of clusters,
Dont use if you are not storing Time series data,
Dont Use if you not patitioning your servers,
Dont use if you require strong Consistency.
Mongodb has very powerful aggregate functions and an expressive aggregate framework. It has many of the features developers are accustomed to using from the relational database world. It's document data/storage structure allows for more complex data models than Cassandra, for example.
All this comes with trade-offs of course. So when you select your database (NoSQL, NewSQL, or RDBMS) look at what problem you are trying to solve and at your scalability needs. No one database does it all.
According to DataStax, Cassandra is not the best use case when there is a need for
1- High end hardware devices.
2- ACID compliant with no roll back (bank transaction)
It does not support complete transaction management across the
tables.
Secondary Index not supported.
Have to rely on Elastic search /Solr for Secondary index and the custom sync component has to be written.
Not ACID compliant system.
Query support is limited.

What is NoSQL, how does it work, and what benefits does it provide? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
I've been hearing things about NoSQL and that it may eventually become the replacement for SQL DB storage methods due to the fact that DB interaction is often a bottle neck for speed on the web.
So I just have a few questions:
What exactly is it?
How does it work?
Why would it be better than using a SQL Database? And how much better is it?
Is the technology too new to start implementing yet or is it worth taking a look into?
There is no such thing as NoSQL!
NoSQL is a buzzword.
For decades, when people were talking about databases, they meant relational databases. And when people were talking about relational databases, they meant those you control with Edgar F. Codd's Structured Query Language. Storing data in some other way? Madness! Anything else is just flatfiles.
But in the past few years, people started to question this dogma. People wondered if tables with rows and columns are really the only way to represent data. People started thinking and coding, and came up with many new concepts how data could be organized. And they started to create new database systems designed for these new ways of working with data.
The philosophies of all these databases were different. But one thing all these databases had in common, was that the Structured Query Language was no longer a good fit for using them. So each database replaced SQL with their own query languages. And so the term NoSQL was born, as a label for all database technologies which defy the classic relational database model.
So what do NoSQL databases have in common?
Actually, not much.
You often hear phrases like:
NoSQL is scalable!
NoSQL is for BigData!
NoSQL violates ACID!
NoSQL is a glorified key/value store!
Is that true? Well, some of these statements might be true for some databases commonly called NoSQL, but every single one is also false for at least one other. Actually, the only thing NoSQL databases have in common, is that they are databases which do not use SQL. That's it. The only thing that defines them is what sets them apart from each other.
So what sets NoSQL databases apart?
So we made clear that all those databases commonly referred to as NoSQL are too different to evaluate them together. Each of them needs to be evaluated separately to decide if they are a good fit to solve a specific problem. But where do we begin? Thankfully, NoSQL databases can be grouped into certain categories, which are suitable for different use-cases:
Document-oriented
Examples: MongoDB, CouchDB
Strengths: Heterogenous data, working object-oriented, agile development
Their advantage is that they do not require a consistent data structure. They are useful when your requirements and thus your database layout changes constantly, or when you are dealing with datasets which belong together but still look very differently. When you have a lot of tables with two columns called "key" and "value", then these might be worth looking into.
Graph databases
Examples: Neo4j, GiraffeDB.
Strengths: Data Mining
While most NoSQL databases abandon the concept of managing data relations, these databases embrace it even more than those so-called relational databases.
Their focus is at defining data by its relation to other data. When you have a lot of tables with primary keys which are the primary keys of two other tables (and maybe some data describing the relation between them), then these might be something for you.
Key-Value Stores
Examples: Redis, Cassandra, MemcacheDB
Strengths: Fast lookup of values by known keys
They are very simplistic, but that makes them fast and easy to use. When you have no need for stored procedures, constraints, triggers and all those advanced database features and you just want fast storage and retrieval of your data, then those are for you.
Unfortunately they assume that you know exactly what you are looking for. You need the profile of User157641? No problem, will only take microseconds. But what when you want the names of all users who are aged between 16 and 24, have "waffles" as their favorite food and logged in in the last 24 hours? Tough luck. When you don't have a definite and unique key for a specific result, you can't get it out of your K-V store that easily.
Is SQL obsolete?
Some NoSQL proponents claim that their favorite NoSQL database is the new way of doing things, and SQL is a thing of the past.
Are they right?
No, of course they aren't. While there are problems SQL isn't suitable for, it still got its strengths. Lots of data models are simply best represented as a collection of tables which reference each other. Especially because most database programmers were trained for decades to think of data in a relational way, and trying to press this mindset onto a new technology which wasn't made for it rarely ends well.
NoSQL databases aren't a replacement for SQL - they are an alternative.
Most software ecosystems around the different NoSQL databases aren't as mature yet. While there are advances, you still haven't got supplemental tools which are as mature and powerful as those available for popular SQL databases.
Also, there is much more know-how for SQL around. Generations of computer scientists have spent decades of their careers into research focusing on relational databases, and it shows: The literature written about SQL databases and relational data modelling, both practical and theoretical, could fill multiple libraries full of books. How to build a relational database for your data is a topic so well-researched it's hard to find a corner case where there isn't a generally accepted by-the-book best practice.
Most NoSQL databases, on the other hand, are still in their infancy. We are still figuring out the best way to use them.
What exactly is it?
On one hand, a specific system, but it has also become a generic word for a variety of new data storage backends that do not follow the relational DB model.
How does it work?
Each of the systems labelled with the generic name works differently, but the basic idea is to offer better scalability and performance by using DB models that don't support all the functionality of a generic RDBMS, but still enough functionality to be useful. In a way it's like MySQL, which at one time lacked support for transactions but, exactly because of that, managed to outperform other DB systems. If you could write your app in a way that didn't require transactions, it was great.
Why would it be better than using a SQL Database? And how much better is it?
It would be better when your site needs to scale so massively that the best RDBMS running on the best hardware you can afford and optimized as much as possible simply can't keep up with the load. How much better it is depends on the specific use case (lots of update activity combined with lots of joins is very hard on "traditional" RDBMSs) - could well be a factor of 1000 in extreme cases.
Is the technology too new to start implementing yet or is it worth taking a look into?
Depends mainly on what you're trying to achieve. It's certainly mature enough to use. But few applications really need to scale that massively. For most, a traditional RDBMS is sufficient. However, with internet usage becoming more ubiquitous all the time, it's quite likely that applications that do will become more common (though probably not dominant).
Since someone said that my previous post was off-topic, I'll try to compensate :-) NoSQL is not, and never was, intended to be a replacement for more mainstream SQL databases, but a couple of words are in order to get things in the right perspective.
At the very heart of the NoSQL philosophy lies the consideration that, possibly for commercial and portability reasons, SQL engines tend to disregard the tremendous power of the UNIX operating system and its derivatives.
With a filesystem-based database, you can take immediate advantage of the ever-increasing capabilities and power of the underlying operating system, which have been steadily increasing for many years now in accordance with Moore's law. With this approach, many operating-system commands become automatically also "database operators" (think of "ls" "sort", "find" and the other countless UNIX shell utilities).
With this in mind, and a bit of creativity, you can indeed devise a filesystem-based database that is able to overcome the limitations of many common SQL engines, at least for specific usage patterns, which is the whole point behind NoSQL's philosophy, the way I see it.
I run hundreds of web sites and they all use NoSQL to a greater or lesser extent. In fact, they do not host huge amounts of data, but even if some of them did I could probably think of a creative use of NoSQL and the filesystem to overcome any bottlenecks. Something that would likely be more difficult with traditional SQL "jails". I urge you to google for "unix", "manis" and "shaffer" to understand what I mean.
If I recall correctly, it refers to types of databases that don't necessarily follow the relational form. Document databases come to mind, databases without a specific structure, and which don't use SQL as a specific query language.
It's generally better suited to web applications that rely on performance of the database, and don't need more advanced features of Relation Database Engines. For example, a Key->Value store providing a simple query by id interface might be 10-100x faster than the corresponding SQL server implementation, with a lower developer maintenance cost.
One example is this paper for an OLTP Tuple Store, which sacrificed transactions for single threaded processing (no concurrency problem because no concurrency allowed), and kept all data in memory; achieving 10-100x better performance as compared to a similar RDBMS driven system. Basically, it's moving away from the 'One Size Fits All' view of SQL and database systems.
In practice, NoSQL is a database system which supports fast access to large binary objects (docs, jpgs etc) using a key based access strategy. This is a departure from the traditional SQL access which is only good enough for alphanumeric values. Not only the internal storage and access strategy but also the syntax and limitations on the display format restricts the traditional SQL. BLOB implementations of traditional relational databases too suffer from these restrictions.
Behind the scene it is an indirect admission of the failure of the SQL model to support any form of OLTP or support for new dataformats. "Support" means not just store but full access capabilities - programmatic and querywise using the standard model.
Relational enthusiasts were quick to modify the defnition of NoSQL from Not-SQL to Not-Only-SQL to keep SQL still in the picture! This is not good especially when we see that most Java programs today resort to ORM mapping of the underlying relational model. A new concept must have a clearcut definition. Else it will end up like SOA.
The basis of the NoSQL systems lies in the random key - value pair. But this is not new. Traditional database systems like IMS and IDMS did support hashed ramdom keys (without making use of any index) and they still do. In fact IDMS already has a keyword NONSQL where they support SQL access to their older network database which they termed as NONSQL.
It's like Jacuzzi: both a brand and a generic name. It's not just a specific technology, but rather a specific type of technology, in this case referring to large-scale (often sparse) "databases" like Google's BigTable or CouchDB.
NoSQL the actual program appears to be a relational database implemented in awk using flat files on the backend. Though they profess, "NoSQL essentially has no arbitrary limits, and can work where other products can't. For example there is no limit on data field size, the number of columns, or file size" , I don't think it is the large scale database of the future.
As Joel says, massively scalable databases like BigTable or HBase, are much more interesting. GQL is the query language associated with BigTable and App Engine. It's largely SQL tweaked to avoid features Google considers bottle-necks (like joins). However, I haven't heard this referred to as "NoSQL" before.
NoSQL is a database system which doesn't use string based SQL queries to fetch data.
Instead you build queries using an API they will provide, for example Amazon DynamoDB is a good example of a NoSQL database.
NoSQL databases are better for large applications where scalability is important.
Does NoSQL mean non-relational database?
Yes, NoSQL is different from RDBMS and OLAP. It uses looser consistency models than traditional relational databases.
Consistency models are used in distributed systems like distributed shared memory systems or distributed data store.
How it works internally?
NoSQL database systems are often highly optimized for retrieval and appending operations and often offer little functionality beyond record storage (e.g. key-value stores). The reduced run-time flexibility compared to full SQL systems is compensated by marked gains in scalability and performance for certain data models.
It can work on Structured and Unstructured Data. It uses Collections instead of Tables
How do you query such "database"?
Watch SQL vs NoSQL: Battle of the Backends; it explains it all.

Pro's of databases like BigTable, SimpleDB

New school datastore paradigms like Google BigTable and Amazon SimpleDB are specifically designed for scalability, among other things. Basically, disallowing joins and denormalization are the ways this is being accomplished.
In this topic, however, the consensus seems to be that joins on large tables don't necessarilly have to be too expensive and denormalization is "overrated" to some extent
Why, then, do these aforementioned systems disallow joins and force everything together in a single table to achieve scalability? Is it the sheer volumes of data that needs to be stored in these systems (many terabytes)?
Do the general rules for databases simply not apply to these scales?
Is it because these database types are tailored specifically towards storing many similar objects?
Or am I missing some bigger picture?
Distributed databases aren't quite as naive as Orion implies; there has been quite a bit of work done on optimizing fully relational queries over distributed datasets. You may want to look at what companies like Teradata, Netezza, Greenplum, Vertica, AsterData, etc are doing. (Oracle got in the game, finally, as well, with their recent announcement; Microsoft bought their solition in the name of the company that used to be called DataAllegro).
That being said, when the data scales up into terabytes, these issues become very non-trivial. If you don't need the strict transactionality and consistency guarantees you can get from RDBMs, it is often far easier to denormalize and not do joins. Especially if you don't need to cross-reference much. Especially if you are not doing ad-hoc analysis, but require programmatic access with arbitrary transformations.
Denormalization is overrated. Just because that's what happens when you are dealing with a 100 Tera, doesn't mean this fact should be used by every developer who never bothered to learn about databases and has trouble querying a million or two rows due to poor schema planning and query optimization.
But if you are in the 100 Tera range, by all means...
Oh, the other reason these technologies are getting the buzz -- folks are discovering that some things never belonged in the database in the first place, and are realizing that they aren't dealing with relations in their particular fields, but with basic key-value pairs. For things that shouldn't have been in a DB, it's entirely possible that the Map-Reduce framework, or some persistent, eventually-consistent storage system, is just the thing.
On a less global scale, I highly recommend BerkeleyDB for those sorts of problems.
I'm not too familiar with them (I've only read the same blog/news/examples as everyone else) but my take on it is that they chose to sacrifice a lot of the normal relational DB features in the name of scalability - I'll try explain.
Imagine you have 200 rows in your data-table.
In google's datacenter, 50 of these rows are stored on server A, 50 on B, and 100 on server C. Additionally server D contains redundant copies of data from server A and B, and server E contains redundant copies of data on server C.
(In real life I have no idea how many servers would be used, but it's set up to deal with many millions of rows, so I imagine quite a few).
To "select * where name = 'orion'", the infrastructure can fire that query to all the servers, and aggregate the results that come back. This allows them to scale pretty much linearly across as many servers as they like (FYI this is pretty much what mapreduce is)
This however means you need some tradeoffs.
If you needed to do a relational join on some data, where it was spread across say 5 servers, each of those servers would need to pull data from eachother for each row. Try do that when you have 2 million rows spread across 10 servers.
This leads to tradeoff #1 - No joins.
Also, depending on network latency, server load, etc, some of your data may get saved instantly, but some may take a second or 2. Again, when you have dozens of servers, this gets longer and longer, and the normal approach of 'everyone just waits until the slowest guy has finished' no longer becomes acceptable.
This leads to tradeoff #2 - Your data may not always be immediately visible after it's written.
I'm not sure what other tradeoffs there are, but off the top of my head those are the main 2.
So what I'm getting is that the whole "denormalize, no joins" philosophy exists, not because joins themselves don't scale in large systems, but because they're practically impossible to implement in distributed databases.
This seems pretty reasonable when you're storing largely invariant data of a single type (Like Google does). Am I on the right track here?
If you are talking about data that is virtually read-only, the rules change. Denormalisation is hardest in situations where data changes because the work required is increased and there are more problems with locking. If the data barely changes then denormalisation is not so much of a problem.
Novaday You need to find more interoperational environment for databases. More frequently You don't need only an relational DBs, like MySQL or MS SQL but also Big Data farms as Hadoop or non-relational DBs like MongoDB. In some cases all those DBs will be used in one solution so their performance must be as equal as possible in macro scale. It means, that You will not be able to use let say Azure SQL as relational DB and one VM with 2 cores and 3GB of RAM for MongoDB. You must scale-up Your solution and use DB as a Service when it is possible (if it is not possible, then build Your own cluster in a cloud).

Resources