Recently NoSQL has gained immense popularity.
What are the advantages of NoSQL over traditional RDBMS?
Not all data is relational. For those situations, NoSQL can be helpful.
With that said, NoSQL stands for "Not Only SQL". It's not intended to knock SQL or supplant it.
SQL has several very big advantages:
Strong mathematical basis.
Declarative syntax.
A well-known language in Structured Query Language (SQL).
Those haven't gone away.
It's a mistake to think about this as an either/or argument. NoSQL is an alternative that people need to consider when it fits, that's all.
Documents can be stored in non-relational databases, like CouchDB.
Maybe reading this will help.
The history seem to look like this:
Google needs a storage layer for their inverted search index. They figure a traditional RDBMS is not going to cut it. So they implement a NoSQL data store, BigTable on top of their GFS file system. The major part is that thousands of cheap commodity hardware machines provides the speed and the redundancy.
Everyone else realizes what Google just did.
Brewers CAP theorem is proven. All RDBMS systems of use are CA systems. People begin playing with CP and AP systems as well. K/V stores are vastly simpler, so they are the primary vehicle for the research.
Software-as-a-service systems in general do not provide an SQL-like store. Hence, people get more interested in the NoSQL type stores.
I think much of the take-off can be related to this history. Scaling Google took some new ideas at Google and everyone else follows suit because this is the only solution they know to the scaling problem right now. Hence, you are willing to rework everything around the distributed database idea of Google because it is the only way to scale beyond a certain size.
C - Consistency
A - Availability
P - Partition tolerance
K/V - Key/Value
NoSQL is better than RDBMS because of the following reasons/properities of NoSQL
It supports semi-structured data and volatile data
It does not have schema
Read/Write throughput is very high
Horizontal scalability can be achieved easily
Will support Bigdata in volumes of Terra Bytes & Peta Bytes
Provides good support for Analytic tools on top of Bigdata
Can be hosted in cheaper hardware machines
In-memory caching option is available to increase the performance of queries
Faster development life cycles for developers
EDIT:
To answer "why RDBMS cannot scale", please take a look at RDBMS Overheads pdf written by Stavros Harizopoulos,Daniel J. Abadi,Samuel Madden and Michael Stonebraker
RDBMS's have challenges in handling huge data volumes of Terabytes & Peta bytes. Even if you have Redundant Array of Independent/Inexpensive Disks (RAID) & data shredding, it does not scale well for huge volume of data. You require very expensive hardware.
Logging: Assembling log records and tracking down all changes in database structures slows performance. Logging may not be necessary if recoverability is not a requirement or if recoverability is provided through other means (e.g., other sites on the network).
Locking: Traditional two-phase locking poses a sizeable overhead since all accesses to database structures are governed by a separate entity, the Lock Manager.
Latching: In a multi-threaded database, many data structures have to be latched before they can be accessed. Removing this feature and going to a single-threaded approach has a noticeable performance impact.
Buffer management: A main memory database system does not need to access pages through a buffer pool, eliminating a level of indirection on every record access.
This does not mean that we have to use NoSQL over SQL.
Still, RDBMS is better than NoSQL for the following reasons/properties of RDBMS
Transactions with ACID properties - Atomicity, Consistency, Isolation & Durability
Adherence to Strong Schema of data being written/read
Real time query management ( in case of data size < 10 Tera bytes )
Execution of complex queries involving join & group by clauses
We have to use RDBMS (SQL) and NoSQL (Not only SQL) depending on the business case & requirements
NOSQL has no special advantages over the relational database model. NOSQL does address certain limitations of current SQL DBMSs but it doesn't imply any fundamentally new capabilities over previous data models.
NOSQL means only no SQL (or "not only SQL") but that doesn't mean the same as no relational. A relational database in principle would make a very good NOSQL solution - it's just that none of the current set of NOSQL products uses the relational model.
RDBMS focus more on relationship and NoSQL focus more on storage.
You can consider using NoSQL when your RDBMS reaches bottlenecks. NoSQL makes RDBMS more flexible.
The biggest advantage of NoSQL over RDBMS is Scalability.
NoSQL databases can easily scale-out to many nodes, but for RDBMS it is very hard.
Scalability not only gives you more storage space but also much higher performance since many hosts work at the same time.
If you need to process huge amount of data with high performance
OR
If data model is not predetermined
then
NoSQL database is a better choice.
Just adding to all the information given above
NoSql Advantages:
1) NoSQL is good if you want to be production ready fast due to its support for schema-less and object oriented architecture.
2) NoSql db's are eventually consistent which in simple language means they will not provide any lock on the data(documents) as in case of RDBMS and what does it mean is latest snapshot of data is always available and thus increase the latency of your application.
3) It uses MVCC (Multi view concurrency control) strategy for maintaining and creating snapshot of data(documents).
4) If you want to have indexed data you can create view which will automatically index the data by the view definition you provide.
NoSql Disadvantages:
1) Its definitely not suitable for big heavy transactional applications as it is eventually consistent and does not support ACID properties.
2) Also it creates multiple snapshots (revisions) of your data (documents) as it uses MVCC methodology for concurrency control, as a result of which space get consumed faster than before which makes compaction and hence reindexing more frequent and it will slow down your application response as the data and transaction in your application grows.
To counter that you can horizontally scale the nodes but then again it will be higher cost as compare sql database.
From mongodb.com:
NoSQL databases differ from older, relational technology in four main areas:
Data models: A NoSQL database lets you build an application without having to define the schema first unlike relational databases which make you define your schema before you can add any data to the system. No predefined schema makes NoSQL databases much easier to update as your data and requirements change.
Data structure: Relational databases were built in an era where data was fairly structured and clearly defined by their relationships. NoSQL databases are designed to handle unstructured data (e.g., texts, social media posts, video, email) which makes up much of the data that exists today.
Scaling: It’s much cheaper to scale a NoSQL database than a relational database because you can add capacity by scaling out over cheap, commodity servers. Relational databases, on the other hand, require a single server to host your entire database. To scale, you need to buy a bigger, more expensive server.
Development model: NoSQL databases are open source whereas relational databases typically are closed source with licensing fees baked into the use of their software. With NoSQL, you can get started on a project without any heavy investments in software fees upfront.
Related
I am facing, in these days, the problem of storing some Time Series Data.
This data is taken from an industrial machine: for each job (about 3 per hour, 24/24h), a software records:
oil pressure;
oil temperature;
some vibrational data.
Vibrational data is taken at very high frequency (> 10 kHz), and leads to very massive memory requirements. This issue made my company evaluate some possibilities to efficiently store this data.
Inserts will be not very frequent (maybe 1 or 2 times per day, when the machine is not operative).
Reads will be potentially very frequent (another software will retrieve data for plotting and analyzing purposes).
For now, a single node will be used for storing data, so I don't want (for now) to take into account partitions and parallelization matters.
What solution should I prefer?
A relational DBMS (such as MySQL or PostgreSQL), or a common-purpose NoSQL DB (e.g. a column-oriented one - consider that all Time Series will be univariate -, like Cassandra, or a document-oriented one, like MongoDB)?
Beyond my particular use case, when generally to prefer RDMBS over NoSQL for Time Series storing? When to prefer NoSQL over RDBMS?
tl;dr:
Typically for time series, I would use a time-series database like InfluxDb. Some NoSQL products, like MongoDB, actually combines these features.
Traditionally, NoSQL was used for unstructured high volumes like logging results, website search data etc. However, with professionals becoming more comfortable with document storage and relational capabilities improved over time, NoSQL is often favoured over traditional relational databases.
When to use a Relational database?
Since the modelling of NoSQL is conceptually somewhat different than modelling a relational database, a relational database is typically preferred in application and data migration scenario's where there is already a big relational database present.
Relational databases uses normalisation to optimise storage. This is an artefact from times in which storage was very expensive. When modelling NoSQL, it is common to optimise for read and write speeds and business clarity.
If not dealing with a legacy system, NoSQL like solutions is a modern substitute for RDBMS.
I got two opinions about NoSQL from my friend.
First: Use NoSQL to boost performance and save occasional updated data. Still use sql to save all important dan transaction data.
Second: Don't use NoSQL if you didn't really need it. Use it if you really save big data.
I've used NoSQL and its really fast when selecting data.
I want to know, is first opinion only enough for implementing NoSQL? What all of you think about these?
NOTE: In my case, it still running well with SQL. I want to add NoSQL for improving data reading speed. so it will work alongside.
Is it worth it to use NoSQL this early?
Thanks in advance
It depends on what you are designing.
From my experience scaling out data collection I have found traditional relational storage to be a bottleneck in terms of its inability to scale out over multiple nodes when a databases gets very large. Sure it scales up but this becomes cost prohibitive at some point. In this scenario it would therefore depend on your medium to long term data storage projections. The solution for me was therefore mixture of relational storage for data that may be updated frequently and noSQL (document storage) for data the has a fast rate of growth that is generally not updated post write.
Things to take into account:
Queries
SQL relational storage supports a growing subset languages for queries, as well as a wide range of filters, sorting options, and projections and index queries. NoSQL does all this as well, but SQL can often go beyond it, allowing powerful aggregations of your data as well, beyond what NoSQL can do.
Transactions
Transactions are important because they ensure that you have atomically made changes to your database. Many NoSQL platforms don’t support transactions, so be aware of this feature when you’re figuring out which to use, and what your own needs are.
Consistency
MySQL platforms often use a single master to guarantee strong consistency in your database. These use synchronous replication to ensure you don’t lose important changes queued up to the master. NoSQL, by contrast, does replication of entity groups without a master, so that data is strong within an entity group, and is eventually updated across all groups. The better option depends on the constraints and needs of your database.
Scalability
For years, database administrators relied on scaling up, buying bigger servers as database load increased. However, as transaction rates and demands on the databases continue to expand immensely, emphasis is on scaling out instead. Scaling out is distributing databases across multiple hosts, and that’s something NoSQL does better than standard SQL. They’re designed for optimal use on scaled out databases.
Management
NoSQL databases are generally designed to require less management overall. Repairs are often automatic, and data distribution and simpler data models contribute to less administration required overall. However, you’ve also got less support when there’s a problem. SQL platforms often have vendors waiting to supply support to enterprises.
Schema
Regular SQL platforms often have strictly enforced rules for a schema change, to stave off user-created typos that can put faults in your query. NoSQL platforms will have their own mechanisms for combating this.
Hope that helps.
NoSQL scores over SQL in below areas
It support semi-structured data and volatile data. You can change the structure at any time
It does not have schema
Read/Write through put is very high
Horizontal scalability is easily achieved - Add cheaper hardware and provide right replication factor
Will support Bigdata in volumes of Terra Bytes & Peta Bytes by using cheaper hardware
Good support for Analytic tools on top of Bigdata, especially Hadoop/Hbase family
In memory caching option is available to increase the performance of queries
Faster development life cycles for developers
When you should not use NoSQL and go for SQL
If you require business critical transaction with ACID properties i.e where Consistency is key & Eventual consistency is not an option
If you have heavy aggregation queries spanning multiple entities
In summary, you have to use right technology for right business use case. i.e combination of SQL and NoSQL
Regarding your queries:
Use SQL for business critical transactions. If your SQL is scaling for your business requirements, use SQL.
Use NoSQL for huge volumes of data in magnitudes of Tera/Peta bytes with variety of data , where SQL can't handle that volume & variety.
As others pointed out both SQL and NoSQL (Not only SQL) have their advantages.
There is often temptation to use both side by side and get maximum out of it. Something referred to as Polyglot persistence
Is it a good idea? Sometimes, yes.
Should I do it?
While it may have benefits, the trade off comes with maintenance of multiple stores (note: they would have different ways of database management).
Also the data sync is a bigger one if you are planning this for same transactional system.
If the data you are going to store (in sql and no-sql databases) can be logically separated then you might be ok. But in case they are closely related then you are going to have tough time keeping them consistent.
Overall when i evaluated this option, i came to conclude that it would work only when you can logically partition the data. Another use case may be using Nosql for Analytics and continue with sql for transaction system.
Going back to your use case, did you try JSON storage within your sql database. It may give you benefit of performance without much tradeoffs.
oracle has a good reputation for handling large-scale applications and it's also flexible to extend to cluster environment. Why everyone wants NOSQL?
because nosql db is much cheaper?
why not swith object-oriented db?
Firstly, not everyone does want NoSQL. Packaged software (eg ERP) is all pretty much mainstream RDBMS stuff. Don't confuse the amount of development effort with the usage.
What has happened is that NoSQL has opened up a whole range of applications that simply didn't suit relational technologies, and so there's been a rush of applications that CAN be developed. Most of the stuff that could be developed on RDBMS platforms already has been and is in more of a maintenance/upgrade phase. Probably with less upgrading than usual because of the whole global financial climate.
So in ten/fifteen years, as those NoSQL apps move into the same level of maturity, the frenzy will have died down and there's be less excitement.
Oracle also have NoSQL solution: http://www.oracle.com/technetwork/database/nosqldb/overview/index.html
A SQL solution isn't necessarily what you want regardless of scale.
There are situations where you can't easily predict the schema of your model, or worst still your data is schemaless. In those situations you want a data model which doesn't limit you but rather allows you the flexibilty you need to evolve your data while still maintaining core abilities such as fast indexes.
Another reason is that SQL doesn't represent the natural way in which you want to look at your data, predominantly graph DBs such as Neo4J or GraphDB allow developers or users to approach a linked graph model in a more intuitive way.
Of course there is a way to address all these issues in Oracle RDBMS, but it feels more like hacking the DB to fit your needs as opposed to using a DB that fits you. This sounds like a perk but it actually goes a long way into the ease of development and analysis of your application.
Now if we are talking about scale, Oracle can probably beat column based DBs such as HBase or Hypertable, but it is important to note that Oracle RDBMS isn't just more expensive it's way way more expensive. In today's world even small time startups have Terabytes of data they need to analyze daily. Even small companies can use clusters of 100 machines in the cloud to store their data, in such a company Oracle isn't a viable option the annual licensing cost and the hiring of DBAs will prevent startups from using it.
Finally the last reason why you would start off with NoSQL is speed, bringing up a MongoDB and starting development can be done in 5 minutes and sometimes you wanna handle problems as they come up and avoid premature optimization
If you are willing to let go consistency, there are no theorical limits to how mouch you can scale out some NoSQL solutions.
Some RDBMS can scale a lot, and Oracle is among the best of them, but no RDBMS let you cut in consistency, and therefor even the best has a pretty clear theorical limit to how much it can scale, not to mention the real world limits.
Some big names on the net just can't rely only on RDBMS anymore, and many others follows just to be like the big ones. Finally, some solutions are realy best fitted on a scheam-less structure, but those don't account for the most part of NoSQL users I would guess. The main point is to scale the web 2.0.
This is a very general question you are asking. Are you comparing the relational database to NOSQL database, or are you comparing commercial or open source database? We need to figure out as it seems we are comparing apples to oranges, and you will not get a straight answer.
Here are the breakdown from my perspective.
DB type: If you are comparing relational db vs NOSQL db, you should refer to this link
instead.
Cost: If you are comparing from a cost perspective, each has its own cost. Oracle will charge for the license, while the NOSQL db (using MongoDB as an example) are open source, and you don't have to pay the license. But you will need someone to have a good understanding of NOSQL to administer and maintain the site, which is a lot harder to find.
Application: What kind of application are you writing? You need to understand the database requirement for your application. If you need to store a lot of unstructured data as a key value entry, you are better off using NOSQL. On the other side of coin, you would prefer SQL db if you will join a lot of tables and execute complex SQL.
You mentioned Object Oriented DB, and that is another type of database from NoSQL or relational DB. At the end, it depends on your need. To expand the choice of horizon, some might heard of hierarchical database, which mainly live in mainframe environment. :-)
We have
BigTable from Google,
Hadoop, actively contributed by Yahoo,
Dynamo from Amazon
all aiming towards one common goal - making data management as scalable as possible.
By scalability what I understand is that the cost of the usage should not go up drastically when the size of data increases.
RDBMS's are slow when the amount of data is large as the number of indirections invariable increases leading to more IO's.
How do these custom scalable friendly data management systems solve the problem?
This is a figure from this document explaining Google BigTable:
Looks the same to me. How is the ultra-scalability achieved?
The "traditional" SQL DBMS market really means a very small number of products, which have traditionally targeted business applications in a corporate setting. Massive shared-nothing scalability has not historically been a priority for those products or their customers. So it is natural that alternative products have emerged to support internet scale database applications.
This has nothing to do with the fact that these new products are not "Relational" DBMSs. The relational model can scale just as well as any other model. Arguably the relational model suits these types of massively scalable applications better than say, network (graph based) models. It's just that the SQL language has a lot of disadvantages and no-one has yet come up with suitable relational NOSQL (non-SQL) alternatives.
Speaking specifically to your question about Bigtable, the difference is that the heirarchy in the diagram above is all there is. Each Bigtable tabletserver is responsible for a set of tablets (contiguous row ranges from a table); the mapping from row range to tablet is maintained in the metadata table, while the mapping from tablet to tabletserver is maintained in the memory of the Bigtable master. Looking up a row, or range of rows, requires looking up the metadata entry (which will almost certainly be in memory on the server that hosts it), then using that to look up the actual row on the server responsible for it - resulting in only one, or a few disk seeks.
In a nutshell, the reason this scales well is because it's possible to throw more hardware at it: given enough resources, the metadata is always in memory, and thus there's no need to go to disk for it, only for the data (and not always for that, either!).
It's about using cheap comodity hardware to build a network/grid/cloud and spread the data and load (for example using map/reduce).
RDBMS databases seem to me like software being (originaly) designed to run on one supercomputer. You can use various hard drive arrays, DB clusters, but still..
The amount of data increased so there's one more reason to design new data storages with this in mind - scalability, high availability, terabytes of data.
Another thing - if you build a grid/cloud from cheap servers, it's fault tolerant because you store all data at three (?) different locations and at the same time it's cheap.
Back to your pictures - the first one is from one computer (typically), the second one from a network of computers.
One theoretical answer on scalability is at http://queue.acm.org/detail.cfm?id=1394128 - the ACID guarantees are expensive. See http://database.cs.brown.edu/papers/stonebraker-cacm2010.pdf for a counter-argument.
In fact just surviving power failures is expensive. Years ago now I compared MySQL against Oracle. MySQL was almost unbelieveably faster than Oracle, but we couldn't use it. MySQL of those days was built on top of Berkeley
DB, which was miles faster than Oracle's full blown log-based database, but if the power went off while Berkely DB based MySQL was running, it was a manual process to get the database consistent again when the power went back on, and you'ld probably lose recent updates for good.
I am trying to decide whether to use voldemort or couchdb for an upcoming healthcare project. I want a storage system that has high availability , fault tolerance, and can scale for the massive amounts of data being thrown at it.
What is the pros/cons of each?
Thanks
Project Voldemort looks nice, but I haven't looked deeply into it so far.
In it current state CouchDB might not be the right thing for "massive amounts of data". Distributing data between nodes and routing queries accordingly is on the roadmap but not implemented so far. The biggest known production setups of CouchDB use "tables" ("databases" in couch-speak) of about 200G.
HA is not natively supported by CouchDB but can build easily: All CouchDB nodes are replicating the database nodes between each other in a multi-master setup. We put two Varnish proxies in front of the CouchDB machines and the Varnish boxes are made redundant with CARP. CouchDBs "build from the Web" design makes such things very easy.
The most pressing issue in our setup is the fact that there are still issues with the replication of large (multi MB) attachments to CouchDB documents.
I suggest you also check the traditional RDBMS route. There are huge issues with available talent outside the RDBMS approach and there are very capable offerings available from Oracle & Co.
Not knowing enough from your question, I would nevertheless say Project Voldemort or distributed hash tables (DHTs) like CouchDB in general are a solution to your problem of HA.
Those DHTs are very nice for high availability but harder to write code for than traditional relational databases (RDBMS) concerning consistency.
They are quite good to store document type information, which may fit nicely with your healthcare project but make development harder for data.
The biggest limitation of most stores is that they are not transactionally safe (See Scalaris for an transactionally safe store) and you need to ensure data consistency by yourself - most use read time consistency by merging conflicting data). RDBMS are much easier to use for consistency of data (ACID)
Joining data is much harder too. In RDBMs you can easily query data over several tables, you need to write code in CouchDB to aggregate data. For other stores Hadoop may be a good choice for aggregating information.
Read about BASE and the CAP theorem on consistency vs. availability.
See
http://www.metabrew.com/article/anti-rdbms-a-list-of-distributed-key-value-stores/
http://queue.acm.org/detail.cfm?id=1394128
Is memcacheDB an option? I've heard that's how Digg handled HA issues.