Scaling in nosql vs rdbms? - database

I am trying to understand the architectural difference in nosql and relational databases in the context of scalability.
My understanding of scalability(horizontal) is that as our data grows, we add more and more server to split the load evenly.
In key-value NO-SQL databases, we can add the new machine and split the keys. However, all of examples I have seen so far to understand eventual consistency in NO-SQL databases, they all have master-slave configuration where data is replicated across all the slaves instead of splitting across various machine to achieve scalability.
My question is doesn't using replicating your whole data defeat the whole point of scalability in No-SQL databases? The same can be done in RDBMS as well, with one master(for write) and slaves(for reads), how is NO-SQL more scalable in this regard?

The goal of scalability is to increase the overall capacity of a given application, and can can be vertical (bigger machines) or horizontal (add more machines). When it comes to horizonal scaling, you can add more machines, but as the number of machines increase, so does the probability of a failure of a node in the cluster, which is something to keep in mind.
When you add more nodes, what you can do is either split the data, which is called sharding, and you can also duplicate the data, which is called replication.
Replication
With replication, the usual architecture is master-slave, where you can only write to the master, who replicates the data to the slaves, so this means that you cannot use replication to split the writes to the cluster, but it is possible to split the reads, deppending on the consistency level (not all of the NoSQL technologies provide the same level) and the cluster configuration.
Sharding
Sharding is more suited to provide scaling, as you split the data set in multiple parts with similar sizes if possible. This clearly allows the benefit to split reads and writes to different nodes. In order to make it work, some mechanisms need to be in place:
routing: to locate in which shard is the data is stored, or to decide in which shard to write
balancing: to maintain the sizes of the different fragments of the dataset with a similar size over time.
But usually these mechanisms are provided by the database vendor, so no need to worry about providing it, but is still necessary to understand to manage the cluster.
The problem here is, as I mentioned at first, the more nodes a cluster has, the higher is the chance to have a failure on a given node, meaning that if a node with a part of the data set goes offline, part of the data won't be available, which is not a desirable scenario. But luckyly, sharding and replication are not exclusive, is it possible to build a sharded cluster, where each shard is a cluster on with replication in place.
But in order to answer your questions
doesn't using replicating your whole data defeat the whole point of
scalability in No-SQL databases?
In master-slave architectures, you cannot split the writes, but you can split the reads, which is somewhat a way to scale, although, the main purpose is high availability.
Anyways, there are new emergent databases that start to provide multi-master architecture, where all the nodes act as master, meaning all can receive both, writes and reads.
The same can be done in RDBMS as well, with one master(for write) and
slaves(for reads), how is NO-SQL more scalable in this regard?
In a single node environment, NoSQL is already faster than RDBMS when there are JOINS involved, as it is an expensive operation, or when there are a lot of integrity checks involved.
So, when you try to shard the dataset in a RDBMS, unless really carefully designed, the most likely scenario will be the desired data located in different shards. This means that the JOIN and the integrity checks need to be performed between different nodes, making them even more expensive operations than they already are.
This means that RDBMS databases, use mechanisms that act as contraints when you intent to scale horizontally, which NoSQL doesn't. Yes, you can still scale RDBMS horizontally, but overall will be more expensive than using NoSQL databases.
Update: special mention for graph databases
Sharding in graph databases is really difficult, since mathematically, the problem of distributing a large graph between different servers is NP complete. And also, when having to query data among different shards, one of the main features of graphs is lost, fast transverses.
I've seen 2 main approaches that graph databases follows to scale horizontally:
1) Let the application/developer decide how to partition the graph, which you can imagine how complex this can be.
2) Replicate all the graph in all the nodes and use cache-sharding, which means, that all the nodes have the entire data set, but each node maintains in memory the part of the graph that is most queried for that node in particular.
I guess, that in the future, graph database companies will develop more solutions to address this issue.
Related to your question, at their current state, graph databases can still outscale RDBMS when it comes to horizontal scaling, due to the lack of the RDBMS contraints, but its hard to compare between different NoSQL database types.

the master-slave configuration in NoSQL databases is for High-Avaibility purposes and data consistency, not to confound with the purpose of scalability wich is to load balance the workload.

In NoSQL, so far as splitting the keys is concerned, only the Master copies matter. Slaves are for HA and Availability in general. This replication, in fact, is responsible for the eventual consistency -- you will get the data right away, may be it is not the most updated one, but eventually you will have the updated data.
On the other hand, RDBMS will have slower data access/modification because it has to follow ACID properties, and mostly that is with strong-consistency.
Replication is not the differentiating factor, as such, between NoSQL and RDBMS, adherence/non-adherence to ACID properties is. Nor does Scalability mean absence of extra copies. Hope that answers.

To answer your question, replicating your data does not defeat the point of scalability.
Scalability roughly refers to the ability to grow a database and is not necessarily tied to having more copies of the database.
More servers with the database information on them allows for more readily available access to it to more users, as the other answers have stated.
I believe this might be a misunderstanding between Scalability and Availability?

If we just consider key-value databases vs SQL databases then the former ones are more suitable for scalability than the latter ones.
This is because a key-value store doesn't have transactions. So your only guarantee is that you can atomically change the one value for the one key. Which results in ease of scaling.
For example you just hash a key and store a key-value pairs on a machine corresponding to the hash of the key.
You can't do the same thing to the SQL database without loosing an ACID (atomicity, consistency, isolation, durability of a transaction) property. Moreover you can't even easily perform a join SELECT if you store different tables or different parts of a table on different machines.
So in general NoSQL databases are more prepared for sharding across many machines than SQL ones.

Related

Opinion: Reason to use NoSQL

I got two opinions about NoSQL from my friend.
First: Use NoSQL to boost performance and save occasional updated data. Still use sql to save all important dan transaction data.
Second: Don't use NoSQL if you didn't really need it. Use it if you really save big data.
I've used NoSQL and its really fast when selecting data.
I want to know, is first opinion only enough for implementing NoSQL? What all of you think about these?
NOTE: In my case, it still running well with SQL. I want to add NoSQL for improving data reading speed. so it will work alongside.
Is it worth it to use NoSQL this early?
Thanks in advance
It depends on what you are designing.
From my experience scaling out data collection I have found traditional relational storage to be a bottleneck in terms of its inability to scale out over multiple nodes when a databases gets very large. Sure it scales up but this becomes cost prohibitive at some point. In this scenario it would therefore depend on your medium to long term data storage projections. The solution for me was therefore mixture of relational storage for data that may be updated frequently and noSQL (document storage) for data the has a fast rate of growth that is generally not updated post write.
Things to take into account:
Queries
SQL relational storage supports a growing subset languages for queries, as well as a wide range of filters, sorting options, and projections and index queries. NoSQL does all this as well, but SQL can often go beyond it, allowing powerful aggregations of your data as well, beyond what NoSQL can do.
Transactions
Transactions are important because they ensure that you have atomically made changes to your database. Many NoSQL platforms don’t support transactions, so be aware of this feature when you’re figuring out which to use, and what your own needs are.
Consistency
MySQL platforms often use a single master to guarantee strong consistency in your database. These use synchronous replication to ensure you don’t lose important changes queued up to the master. NoSQL, by contrast, does replication of entity groups without a master, so that data is strong within an entity group, and is eventually updated across all groups. The better option depends on the constraints and needs of your database.
Scalability
For years, database administrators relied on scaling up, buying bigger servers as database load increased. However, as transaction rates and demands on the databases continue to expand immensely, emphasis is on scaling out instead. Scaling out is distributing databases across multiple hosts, and that’s something NoSQL does better than standard SQL. They’re designed for optimal use on scaled out databases.
Management
NoSQL databases are generally designed to require less management overall. Repairs are often automatic, and data distribution and simpler data models contribute to less administration required overall. However, you’ve also got less support when there’s a problem. SQL platforms often have vendors waiting to supply support to enterprises.
Schema
Regular SQL platforms often have strictly enforced rules for a schema change, to stave off user-created typos that can put faults in your query. NoSQL platforms will have their own mechanisms for combating this.
Hope that helps.
NoSQL scores over SQL in below areas
It support semi-structured data and volatile data. You can change the structure at any time
It does not have schema
Read/Write through put is very high
Horizontal scalability is easily achieved - Add cheaper hardware and provide right replication factor
Will support Bigdata in volumes of Terra Bytes & Peta Bytes by using cheaper hardware
Good support for Analytic tools on top of Bigdata, especially Hadoop/Hbase family
In memory caching option is available to increase the performance of queries
Faster development life cycles for developers
When you should not use NoSQL and go for SQL
If you require business critical transaction with ACID properties i.e where Consistency is key & Eventual consistency is not an option
If you have heavy aggregation queries spanning multiple entities
In summary, you have to use right technology for right business use case. i.e combination of SQL and NoSQL
Regarding your queries:
Use SQL for business critical transactions. If your SQL is scaling for your business requirements, use SQL.
Use NoSQL for huge volumes of data in magnitudes of Tera/Peta bytes with variety of data , where SQL can't handle that volume & variety.
As others pointed out both SQL and NoSQL (Not only SQL) have their advantages.
There is often temptation to use both side by side and get maximum out of it. Something referred to as Polyglot persistence
Is it a good idea? Sometimes, yes.
Should I do it?
While it may have benefits, the trade off comes with maintenance of multiple stores (note: they would have different ways of database management).
Also the data sync is a bigger one if you are planning this for same transactional system.
If the data you are going to store (in sql and no-sql databases) can be logically separated then you might be ok. But in case they are closely related then you are going to have tough time keeping them consistent.
Overall when i evaluated this option, i came to conclude that it would work only when you can logically partition the data. Another use case may be using Nosql for Analytics and continue with sql for transaction system.
Going back to your use case, did you try JSON storage within your sql database. It may give you benefit of performance without much tradeoffs.

Scalable database technology and architecture

I've been trying to learn more about database scaling in a distributed system, and I am stuck in between RDBMS and NoSQL.
Some articles online suggest that NoSQL is the solution to modern Big Data. Others say NoSQL is just a hype and RDBMS can be just as scalable with good design, and it provides good data structure.
Instead of reading others' opinions, I'd love to judge the two myself, but I do not understand exactly what is required for a scalable RDBMS and a scalable NoSQL.
I've done a bit more readings on RDBMS, and it seems that the solution requires leveraging memcache and sharding to reduce database size and the number of DB queries. Are there other tricks? Can you still use tables with many columns? Or use less columns and more joins?
As for NoSQL, I've read a little about MongoDB. I understand that it encourages data aggregation. But how does that make it more scalable? I'm also starting to learn Cassandra because I read that it scales much better than MongoDB, but I have no idea how it is more scalable.
I would very much appreciate a basic (or advanced, if you have the patience to type it out) condensed and down-to-the-core explanation on scaling RDBMS and NoSQL, or good articles online or books that explain the topic. :)
I won't cover ways you can scale by implementing things on your own and putting a memcache server in between, ... I'll just cover what comes right out of the box...
Let's start first with RDBMS:
I think setting up an RDBMS cluster is more complicated than a NoSQL cluster, but that's just my opinion. Usually what you have is one Master and multiple Slaves. You have to send all your writes to the master and can read from any slave you want. Since you have RDBMS and ACID, the system should somehow guarantee you, that you won't read old data. So the thing here is, that you assume that your application writes once and reads often (as it's usually the case). For those purposes, one Server for read/write and multiple servers for read is great. The problem is if you'r writes are so often that you can't keep up with them anymore on the one machine. That is your bottleneck. Additionally to the build in solutions from Oracle for instance - which are huge - there is also http://www.scalearc.com/ which can cache queries, ... and handle the scaling for you.
NoSQL:
There is no 1 NoSQL schema which is implemented by all the DBs. Every system is a bit different. MongoDB for instance is quite similar to RDBMS, it also has only one Master and several slaves to which it can replicate data, but additionally you can also create shards. Data is split between shards, and replicated to slaves. So you could have multiple different masters which are responsible for smaller parts. Afterwards when you read, you can choose if you want to read from multiple slaves, from the master or from any slave - depending how urgently you need the latest data.
Cassandra on the other hand works totally differently. I'm not sure if you can write to multiple servers or how it works, but basically the servers keep a log of all the writes. So even if they can't process the writes immediately, they are stored in a log, to still give you a fast response. Afterwards when you read, you can say again how urgently you want to have the new data, and if you really want the latest latest data, Cassandra will need to check the log, if there are any updates written, and it will cost you a lot of time.
Key-Value stores like ElasticSearch, CouchDB, CouchBase work again differently. Here the of the item is hashed, and based on the hash, sent to one node which will be responsible for it. This way, when you read after the key was written, you get again up to date information, because you'll read from the same node. The idea of this design is, that no one single key will be of everyone's interest, but the load will be distributed. These are also the DBs which I think scale the best, and make it the easiest to add more servers to the cluster, but you loose the power of complex queries, like you have it in MongoDB and Cassandra - and of course RDBMS. ElasticSearch has some simple search queries, and CouchDB and CouchBase have only Views which are produced by MapReduce, where you can get data which you want, if it fits the view. Otherwise you can only access it by the key.
http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis - is a very comprehensive summary of the most common NoSQL DBs, what are their strengths and weaknesses, and the most common usage scenarios.
In the end, the question is also, why do you want to scale? how many records are you going to have in the database? Few millions is not a problem at all. Few hundred millions is also not a problem for most of the RDBMS on a powerful enough server. And if designed the DB and it's indices properly even a billion records per year should be still fine.

Practical example for each type of database (real cases) [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 years ago.
Improve this question
There are several types of database for different purposes, however normally MySQL is used to everything, because is the most well know Database. Just to give an example in my company an application of big data has a MySQL database at an initial stage, what is unbelievable and will bring serious consequences to the company. Why MySQL? Just because no one know how (and when) should use another DBMS.
So, my question is not about vendors, but type of databases. Can you give me an practical example of specific situations (or apps) for each type of database where is highly recommended to use it?
Example:
• A social network should use the type X because of Y.
• MongoDB or couch DB can't support transactions, so Document DB is not good to an app for a bank or auctions site.
And so on...
Relational: MySQL, PostgreSQL, SQLite, Firebird, MariaDB, Oracle DB, SQL server, IBM DB2, IBM Informix, Teradata
Object: ZODB, DB4O, Eloquera, Versant , Objectivity DB, VelocityDB
Graph databases: AllegroGraph, Neo4j, OrientDB, InfiniteGraph, graphbase, sparkledb, flockdb, BrightstarDB
Key value-stores: Amazon DynamoDB, Redis, Riak, Voldemort, FoundationDB, leveldb, BangDB, KAI, hamsterdb, Tarantool, Maxtable, HyperDex, Genomu, Memcachedb
Column family: Big table, Hbase, hyper table, Cassandra, Apache Accumulo
RDF Stores: Apache Jena, Sesame
Multimodel Databases: arangodb, Datomic, Orient DB, FatDB, AlchemyDB
Document: Mongo DB, Couch DB, Rethink DB, Raven DB, terrastore, Jas DB, Raptor DB, djon DB, EJDB, denso DB, Couchbase
XML Databases: BaseX, Sedna, eXist
Hierarchical: InterSystems Caché, GT.M thanks to #Laurent Parenteau
I found two impressive articles about this subject. All credits to highscalability.com. The information in this answer is transcribed from these articles:
35+ Use Cases For Choosing Your Next NoSQL Database
What The Heck Are You Actually Using NoSQL For?
If Your Application Needs...
• complex transactions because you can't afford to lose data or if you would like a simple transaction programming model then look at a Relational or Grid database.
• Example: an inventory system that might want full ACID. I was very unhappy when I bought a product and they said later they were out of stock. I did not want a compensated transaction. I wanted my item!
• to scale then NoSQL or SQL can work. Look for systems that support scale-out, partitioning, live addition and removal of machines, load balancing, automatic sharding and rebalancing, and fault tolerance.
• to always be able to write to a database because you need high availability then look at Bigtable Clones which feature eventual consistency.
• to handle lots of small continuous reads and writes, that may be volatile, then look at Document or Key-value or databases offering fast in-memory access. Also, consider SSD.
• to implement social network operations then you first may want a Graph database or second, a database like Riak that supports relationships. An in-memory relational database with simple SQL joins might suffice for small data sets. Redis' set and list operations could work too.
• to operate over a wide variety of access patterns and data types then look at a Document database, they generally are flexible and perform well.
• powerful offline reporting with large datasets then look at Hadoop first and second, products that support MapReduce. Supporting MapReduce isn't the same as being good at it.
• to span multiple data-centers then look at Bigtable Clones and other products that offer a distributed option that can handle the long latencies and are partition tolerant.
• to build CRUD apps then look at a Document database, they make it easy to access complex data without joins.
• built-in search then look at Riak.
• to operate on data structures like lists, sets, queues, publish-subscribe then look at Redis. Useful for distributed locking, capped logs, and a lot more.
• programmer friendliness in the form of programmer-friendly data types like JSON, HTTP, REST, Javascript then first look at Document databases and then Key-value Databases.
• transactions combined with materialized views for real-time data feeds then look at VoltDB. Great for data-rollups and time windowing.
• enterprise-level support and SLAs then look for a product that makes a point of catering to that market. Membase is an example.
• to log continuous streams of data that may have no consistency guarantees necessary at all then look at Bigtable Clones because they generally work on distributed file systems that can handle a lot of writes.
• to be as simple as possible to operate then look for a hosted or PaaS solution because they will do all the work for you.
• to be sold to enterprise customers then consider a Relational Database because they are used to relational technology.
• to dynamically build relationships between objects that have dynamic properties then consider a Graph Database because often they will not require a schema and models can be built incrementally through programming.
• to support large media then look storage services like S3. NoSQL systems tend not to handle large BLOBS, though MongoDB has a file service.
• to bulk upload lots of data quickly and efficiently then look for a product that supports that scenario. Most will not because they don't support bulk operations.
• an easier upgrade path then use a fluid schema system like a Document Database or a Key-value Database because it supports optional fields, adding fields, and field deletions without the need to build an entire schema migration framework.
• to implement integrity constraints then pick a database that supports SQL DDL, implement them in stored procedures, or implement them in application code.
• a very deep join depth then use a Graph Database because they support blisteringly fast navigation between entities.
• to move behavior close to the data so the data doesn't have to be moved over the network then look at stored procedures of one kind or another. These can be found in Relational, Grid, Document, and even Key-value databases.
• to cache or store BLOB data then look at a Key-value store. Caching can for bits of web pages, or to save complex objects that were expensive to join in a relational database, to reduce latency, and so on.
• a proven track record like not corrupting data and just generally working then pick an established product and when you hit scaling (or other issues) use one of the common workarounds (scale-up, tuning, memcached, sharding, denormalization, etc).
• fluid data types because your data isn't tabular in nature, or requires a flexible number of columns, or has a complex structure, or varies by user (or whatever), then look at Document, Key-value, and Bigtable Clone databases. Each has a lot of flexibility in their data types.
• other business units to run quick relational queries so you don't have to reimplement everything then use a database that supports SQL.
• to operate in the cloud and automatically take full advantage of cloud features then we may not be there yet.
• support for secondary indexes so you can look up data by different keys then look at relational databases and Cassandra's new secondary index support.
• create an ever-growing set of data (really BigData) that rarely gets accessed then look at Bigtable Clone which will spread the data over a distributed file system.
• to integrate with other services then check if the database provides some sort of write-behind syncing feature so you can capture database changes and feed them into other systems to ensure consistency.
• fault tolerance check how durable writes are in the face power failures, partitions, and other failure scenarios.
• to push the technological envelope in a direction nobody seems to be going then build it yourself because that's what it takes to be great sometimes.
• to work on a mobile platform then look at CouchDB/Mobile couchbase.
General Use Cases (NoSQL)
• Bigness. NoSQL is seen as a key part of a new data stack supporting: big data, big numbers of users, big numbers of computers, big supply chains, big science, and so on. When something becomes so massive that it must become massively distributed, NoSQL is there, though not all NoSQL systems are targeting big. Bigness can be across many different dimensions, not just using a lot of disk space.
• Massive write performance. This is probably the canonical usage based on Google's influence. High volume. Facebook needs to store 135 billion messages a month (in 2010). Twitter, for example, has the problem of storing 7 TB/data per day (in 2010) with the prospect of this requirement doubling multiple times per year. This is the data is too big to fit on one node problem. At 80 MB/s it takes a day to store 7TB so writes need to be distributed over a cluster, which implies key-value access, MapReduce, replication, fault tolerance, consistency issues, and all the rest. For faster writes in-memory systems can be used.
• Fast key-value access. This is probably the second most cited virtue of NoSQL in the general mind set. When latency is important it's hard to beat hashing on a key and reading the value directly from memory or in as little as one disk seek. Not every NoSQL product is about fast access, some are more about reliability, for example. but what people have wanted for a long time was a better memcached and many NoSQL systems offer that.
• Flexible schema and flexible datatypes. NoSQL products support a whole range of new data types, and this is a major area of innovation in NoSQL. We have: column-oriented, graph, advanced data structures, document-oriented, and key-value. Complex objects can be easily stored without a lot of mapping. Developers love avoiding complex schemas and ORM frameworks. Lack of structure allows for much more flexibility. We also have program- and programmer-friendly compatible datatypes like JSON.
• Schema migration. Schemalessness makes it easier to deal with schema migrations without so much worrying. Schemas are in a sense dynamic because they are imposed by the application at run-time, so different parts of an application can have a different view of the schema.
• Write availability. Do your writes need to succeed no matter what? Then we can get into partitioning, CAP, eventual consistency and all that jazz.
• Easier maintainability, administration and operations. This is very product specific, but many NoSQL vendors are trying to gain adoption by making it easy for developers to adopt them. They are spending a lot of effort on ease of use, minimal administration, and automated operations. This can lead to lower operations costs as special code doesn't have to be written to scale a system that was never intended to be used that way.
• No single point of failure. Not every product is delivering on this, but we are seeing a definite convergence on relatively easy to configure and manage high availability with automatic load balancing and cluster sizing. A perfect cloud partner.
• Generally available parallel computing. We are seeing MapReduce baked into products, which makes parallel computing something that will be a normal part of development in the future.
• Programmer ease of use. Accessing your data should be easy. While the relational model is intuitive for end users, like accountants, it's not very intuitive for developers. Programmers grok keys, values, JSON, Javascript stored procedures, HTTP, and so on. NoSQL is for programmers. This is a developer-led coup. The response to a database problem can't always be to hire a really knowledgeable DBA, get your schema right, denormalize a little, etc., programmers would prefer a system that they can make work for themselves. It shouldn't be so hard to make a product perform. Money is part of the issue. If it costs a lot to scale a product then won't you go with the cheaper product, that you control, that's easier to use, and that's easier to scale?
• Use the right data model for the right problem. Different data models are used to solve different problems. Much effort has been put into, for example, wedging graph operations into a relational model, but it doesn't work. Isn't it better to solve a graph problem in a graph database? We are now seeing a general strategy of trying to find the best fit between a problem and solution.
• Avoid hitting the wall. Many projects hit some type of wall in their project. They've exhausted all options to make their system scale or perform properly and are wondering what next? It's comforting to select a product and an approach that can jump over the wall by linearly scaling using incrementally added resources. At one time this wasn't possible. It took custom built everything, but that's changed. We are now seeing usable out-of-the-box products that a project can readily adopt.
• Distributed systems support. Not everyone is worried about scale or performance over and above that which can be achieved by non-NoSQL systems. What they need is a distributed system that can span datacenters while handling failure scenarios without a hiccup. NoSQL systems, because they have focussed on scale, tend to exploit partitions, tend not use heavy strict consistency protocols, and so are well positioned to operate in distributed scenarios.
• Tunable CAP tradeoffs. NoSQL systems are generally the only products with a "slider" for choosing where they want to land on the CAP spectrum. Relational databases pick strong consistency which means they can't tolerate a partition failure. In the end, this is a business decision and should be decided on a case by case basis. Does your app even care about consistency? Are a few drops OK? Does your app need strong or weak consistency? Is availability more important or is consistency? Will being down be more costly than being wrong? It's nice to have products that give you a choice.
• More Specific Use Cases
• Managing large streams of non-transactional data: Apache logs, application logs, MySQL logs, clickstreams, etc.
• Syncing online and offline data. This is a niche CouchDB has targeted.
• Fast response times under all loads.
• Avoiding heavy joins for when the query load for complex joins become too large for an RDBMS.
• Soft real-time systems where low latency is critical. Games are one example.
• Applications where a wide variety of different write, read, query, and consistency patterns need to be supported. There are systems optimized for 50% reads 50% writes, 95% writes, or 95% reads. Read-only applications needing extreme speed and resiliency, simple queries, and can tolerate slightly stale data. Applications requiring moderate performance, read/write access, simple queries, completely authoritative data. A read-only application which complex query requirements.
• Load balance to accommodate data and usage concentrations and to help keep microprocessors busy.
• Real-time inserts, updates, and queries.
• Hierarchical data like threaded discussions and parts explosion.
• Dynamic table creation.
• Two-tier applications where low latency data is made available through a fast NoSQL interface, but the data itself can be calculated and updated by high latency Hadoop apps or other low priority apps.
• Sequential data reading. The right underlying data storage model needs to be selected. A B-tree may not be the best model for sequential reads.
• Slicing off part of service that may need better performance/scalability onto its own system. For example, user logins may need to be high performance and this feature could use a dedicated service to meet those goals.
• Caching. A high performance caching tier for websites and other applications. Example is a cache for the Data Aggregation System used by the Large Hadron Collider.
Voting.
• Real-time page view counters.
• User registration, profile, and session data.
• Document, catalog management and content management systems. These are facilitated by the ability to store complex documents has a whole rather than organized as relational tables. Similar logic applies to inventory, shopping carts, and other structured data types.
• Archiving. Storing a large continual stream of data that is still accessible on-line. Document-oriented databases with a flexible schema that can handle schema changes over time.
• Analytics. Use MapReduce, Hive, or Pig to perform analytical queries and scale-out systems that support high write loads.
• Working with heterogeneous types of data, for example, different media types at a generic level.
• Embedded systems. They don’t want the overhead of SQL and servers, so they use something simpler for storage.
• A "market" game, where you own buildings in a town. You want the building list of someone to pop up quickly, so you partition on the owner column of the building table, so that the select is single-partitioned. But when someone buys the building of someone else you update the owner column along with price.
• JPL is using SimpleDB to store rover plan attributes. References are kept to a full plan blob in S3. (source)
• Federal law enforcement agencies tracking Americans in real-time using credit cards, loyalty cards and travel reservations.
• Fraud detection by comparing transactions to known patterns in real-time.
• Helping diagnose the typology of tumors by integrating the history of every patient.
• In-memory database for high update situations, like a website that displays everyone's "last active" time (for chat maybe). If users are performing some activity once every 30 sec, then you will be pretty much be at your limit with about 5000 simultaneous users.
• Handling lower-frequency multi-partition queries using materialized views while continuing to process high-frequency streaming data.
• Priority queues.
• Running calculations on cached data, using a program friendly interface, without having to go through an ORM.
• Uniq a large dataset using simple key-value columns.
• To keep querying fast, values can be rolled-up into different time slices.
• Computing the intersection of two massive sets, where a join would be too slow.
• A timeline ala Twitter.
Redis use cases, VoltDB use cases and more find here.
This question is almost impossible to answer because of the generality. I think you are looking for some sort of easy answer problem = solution. The problem is that each "problem" becomes more and more unique as it becomes a business.
What do you call a social network? Twitter? Facebook? LinkedIn? Stack Overflow? They all use different solutions for different parts, and many solutions can exist that use polyglot approach. Twitter has a graph like concept, but there are only 1 degree connections, followers and following. LinkedIn on the other hand thrives on showing how people are connected beyond first degree. These are two different processing and data needs, but both are "social networks".
If you have a "social network" but don't do any discovery mechanisms, then you can easily use any basic key-value store most likely. If you need high performance, horizontal scale, and will have secondary indexes or full-text search, you could use Couchbase.
If you are doing machine learning on top of the log data you are gathering, you can integrate Hadoop with Hive or Pig, or Spark/Shark. Or you can do a lambda architecture and use many different systems with Storm.
If you are doing discovery via graph like queries that go beyond 2nd degree vertexes and also filter on edge properties you likely will consider graph databases on top of your primary store. However graph databases aren't good choices for session store, or as general purpose stores, so you will need a polyglot solution to be efficient.
What is the data velocity? scale? how do you want to manage it. What are the expertise you have available in the company or startup. There are a number of reasons this is not a simple question and answer.
A short useful read specific to database selection: How to choose a NoSQL Database?. I will highlight keypoints in this answer.
Key-Value vs Document-oriented
Key-value stores
If you have clear data structure defined such that all the data would have exactly one key, go for a key-value store. It’s like you have a big Hashtable, and people mostly use it for Cache stores or clearly key based data. However, things start going a little nasty when you need query the same data on basis of multiple keys!
Some key value stores are: memcached, Redis, Aerospike.
Two important things about designing your data model around key-value store are:
You need to know all use cases in advance and you could not change the query-able fields in your data without a redesign.
Remember, if you are going to maintain multiple keys around same data in a key-value store, updates to multiple tables/buckets/collection/whatever are NOT atomic. You need to deal with this yourself.
Document-oriented
If you are just moving away from RDBMS and want to keep your data in as object way and as close to table-like structure as possible, document-structure is the way to go! Particularly useful when you are creating an app and don’t want to deal with RDBMS table design early-on (in prototyping stage) and your schema could change drastically over time. However note:
Secondary indexes may not perform as well.
Transactions are not available.
Popular document-oriented databases are: MongoDB, Couchbase.
Comparing Key-value NoSQL databases
memcached
In-memory cache
No persistence
TTL supported
client-side clustering only (client stores value at multiple nodes). Horizontally scalable through client.
Not good for large-size values/documents
Redis
In-memory cache
Disk supported – backup and rebuild from disk
TTL supported
Super-fast (see benchmarks)
Data structure support in addition to key-value
Clustering support not mature enough yet. Vertically scalable (see Redis Cluster specification)
Horizontal scaling could be tricky.
Supports Secondary indexes
Aerospike
Both in-memory & on-disk
Extremely fast (could support >1 Million TPS on a single node)
Horizontally scalable. Server side clustering. Sharded & replicated data
Automatic failovers
Supports Secondary indexes
CAS (safe read-modify-write) operations, TTL support
Enterprise class
Comparing document-oriented NoSQL databases
MongoDB
Fast
Mature & stable – feature rich
Supports failovers
Horizontally scalable reads – read from replica/secondary
Writes not scalable horizontally unless you use mongo shards
Supports advanced querying
Supports multiple secondary indexes
Shards architecture becomes tricky, not scalable beyond a point where you need secondary indexes. Elementary shard deployment need 9 nodes at minimum.
Document-level locks are a problem if you have a very high write-rate
Couchbase Server
Fast
Sharded cluster instead of master-slave of mongodb
Hot failover support
Horizontally scalable
Supports secondary indexes through views
Learning curve bigger than MongoDB
Claims to be faster

what are the best ways to mitigate database i/o bottoleneck for large web sites?

For large web sites (traffic wise) that has alot of incoming reads and updates that end up being database I/Os, what're the best ways to mitigate the performance impact? one solution that I can think of is - for write, to cache and then do delayed write (using separate job); for read, use memcached concept. any other better solutions?
Here are the most common solutions to database performance:
Caching (Memcache, etc)
Add memory to your database
More database servers (master/slave or sharding)
Use a different database type (NoSQL, Redis, etc)
Indexes to speed up read perf. (careful, too many will affect write performance)
SSDs (fast SSDs will help a lot)
RAID
Optimize/tune SQL queries
Don't forget to optimize your queries. Most of the times it is not the disk I/O, but poorly written queries which turn out to be the bottleneck.
You can also cache query results and also entire web pages if the content isn't going to change too often.
It very much depends on the usage pattern and data type. There are really different things to do depending on whether transaction are going to be supported, whether you are interested in full consistency or "eventual consistency", how big the data is (will it all fit in huge memory?), how complex the data and queries are, the list might go on and on.... Lots of variables and only after listing all the constraints/requirements you will be able to make a proper decision. Two general advices though:
Use SSDs
Use distributed architecture with distributed "NoSQL" (key/value) approach (only if you do not have to use complex relations and transactions)
10 years ago, the standard answer - besides optimizing your particular database - was scale-out using MySQL in two ways.
Reads can be scaled out in two ways. The first is through caching, which introduces possible inconsistancies and creates a separate cache layer. Reads can also be scaled in MySQL by creating "read replicas", where any database can be queried. Any write must be applied to all servers, so replication doesn't help write throughput.
Writes are scaled through sharding. For example, imagine all users with the last name 'a' are assigned to a certain server. Now imagine a more complicated shard algorithm, where a particular row's primary ID is hashed using a hash function, and distributed to one of a pool of servers.
Facebook is one of the most advanced proponents of a sharded MySQL architecture. You can have individual tables "joined" but you have to write custom code, because you might have to hop from server to server - imagine you want to get your friend's timeline posts, you can't simply join it, you have to write some application code.
Once you shard your database, you can't do joins and range lookups become difficult. This subset is sometimes called CRUD operations, and thus MySQL is overkill. Many Chinese social networks realized this, and use sharded Redis (which is much quicker than MySQL), and have written their own shard layer and application logic layers.
Imagine the next problem in sharding - you want to add a new server, and start assigning some users to that new server.
Another approach is to use a distributed database, which generally comes under the names NoSQL or NewSQL, and have a variety of approaches. Some, like MongoDB, have a sharding system to manage this mapping, but require manual steps to add servers. Cassandra has a more flexible clustering scheme, called a chorded architecture. Systems like CouchBase and Aerospike use a random distribution mechanism that remove the need for a shard layer. Some of these databases can exceed 100,000 to 200,000 requests per second per server, with the lateral scale to add new servers - enough for very large operations. With this style of clustering, you can often get a higher level of redundancy and reliability.
Other distributed approaches represent data in a more efficient way, like a graph database. If you have a problem that is better represented as a graph, then a clustered graph database may be more appropriate.

NoSql vs Relational database

Recently NoSQL has gained immense popularity.
What are the advantages of NoSQL over traditional RDBMS?
Not all data is relational. For those situations, NoSQL can be helpful.
With that said, NoSQL stands for "Not Only SQL". It's not intended to knock SQL or supplant it.
SQL has several very big advantages:
Strong mathematical basis.
Declarative syntax.
A well-known language in Structured Query Language (SQL).
Those haven't gone away.
It's a mistake to think about this as an either/or argument. NoSQL is an alternative that people need to consider when it fits, that's all.
Documents can be stored in non-relational databases, like CouchDB.
Maybe reading this will help.
The history seem to look like this:
Google needs a storage layer for their inverted search index. They figure a traditional RDBMS is not going to cut it. So they implement a NoSQL data store, BigTable on top of their GFS file system. The major part is that thousands of cheap commodity hardware machines provides the speed and the redundancy.
Everyone else realizes what Google just did.
Brewers CAP theorem is proven. All RDBMS systems of use are CA systems. People begin playing with CP and AP systems as well. K/V stores are vastly simpler, so they are the primary vehicle for the research.
Software-as-a-service systems in general do not provide an SQL-like store. Hence, people get more interested in the NoSQL type stores.
I think much of the take-off can be related to this history. Scaling Google took some new ideas at Google and everyone else follows suit because this is the only solution they know to the scaling problem right now. Hence, you are willing to rework everything around the distributed database idea of Google because it is the only way to scale beyond a certain size.
C - Consistency
A - Availability
P - Partition tolerance
K/V - Key/Value
NoSQL is better than RDBMS because of the following reasons/properities of NoSQL
It supports semi-structured data and volatile data
It does not have schema
Read/Write throughput is very high
Horizontal scalability can be achieved easily
Will support Bigdata in volumes of Terra Bytes & Peta Bytes
Provides good support for Analytic tools on top of Bigdata
Can be hosted in cheaper hardware machines
In-memory caching option is available to increase the performance of queries
Faster development life cycles for developers
EDIT:
To answer "why RDBMS cannot scale", please take a look at RDBMS Overheads pdf written by Stavros Harizopoulos,Daniel J. Abadi,Samuel Madden and Michael Stonebraker
RDBMS's have challenges in handling huge data volumes of Terabytes & Peta bytes. Even if you have Redundant Array of Independent/Inexpensive Disks (RAID) & data shredding, it does not scale well for huge volume of data. You require very expensive hardware.
Logging: Assembling log records and tracking down all changes in database structures slows performance. Logging may not be necessary if recoverability is not a requirement or if recoverability is provided through other means (e.g., other sites on the network).
Locking: Traditional two-phase locking poses a sizeable overhead since all accesses to database structures are governed by a separate entity, the Lock Manager.
Latching: In a multi-threaded database, many data structures have to be latched before they can be accessed. Removing this feature and going to a single-threaded approach has a noticeable performance impact.
Buffer management: A main memory database system does not need to access pages through a buffer pool, eliminating a level of indirection on every record access.
This does not mean that we have to use NoSQL over SQL.
Still, RDBMS is better than NoSQL for the following reasons/properties of RDBMS
Transactions with ACID properties - Atomicity, Consistency, Isolation & Durability
Adherence to Strong Schema of data being written/read
Real time query management ( in case of data size < 10 Tera bytes )
Execution of complex queries involving join & group by clauses
We have to use RDBMS (SQL) and NoSQL (Not only SQL) depending on the business case & requirements
NOSQL has no special advantages over the relational database model. NOSQL does address certain limitations of current SQL DBMSs but it doesn't imply any fundamentally new capabilities over previous data models.
NOSQL means only no SQL (or "not only SQL") but that doesn't mean the same as no relational. A relational database in principle would make a very good NOSQL solution - it's just that none of the current set of NOSQL products uses the relational model.
RDBMS focus more on relationship and NoSQL focus more on storage.
You can consider using NoSQL when your RDBMS reaches bottlenecks. NoSQL makes RDBMS more flexible.
The biggest advantage of NoSQL over RDBMS is Scalability.
NoSQL databases can easily scale-out to many nodes, but for RDBMS it is very hard.
Scalability not only gives you more storage space but also much higher performance since many hosts work at the same time.
If you need to process huge amount of data with high performance
OR
If data model is not predetermined
then
NoSQL database is a better choice.
Just adding to all the information given above
NoSql Advantages:
1) NoSQL is good if you want to be production ready fast due to its support for schema-less and object oriented architecture.
2) NoSql db's are eventually consistent which in simple language means they will not provide any lock on the data(documents) as in case of RDBMS and what does it mean is latest snapshot of data is always available and thus increase the latency of your application.
3) It uses MVCC (Multi view concurrency control) strategy for maintaining and creating snapshot of data(documents).
4) If you want to have indexed data you can create view which will automatically index the data by the view definition you provide.
NoSql Disadvantages:
1) Its definitely not suitable for big heavy transactional applications as it is eventually consistent and does not support ACID properties.
2) Also it creates multiple snapshots (revisions) of your data (documents) as it uses MVCC methodology for concurrency control, as a result of which space get consumed faster than before which makes compaction and hence reindexing more frequent and it will slow down your application response as the data and transaction in your application grows.
To counter that you can horizontally scale the nodes but then again it will be higher cost as compare sql database.
From mongodb.com:
NoSQL databases differ from older, relational technology in four main areas:
Data models: A NoSQL database lets you build an application without having to define the schema first unlike relational databases which make you define your schema before you can add any data to the system. No predefined schema makes NoSQL databases much easier to update as your data and requirements change.
Data structure: Relational databases were built in an era where data was fairly structured and clearly defined by their relationships. NoSQL databases are designed to handle unstructured data (e.g., texts, social media posts, video, email) which makes up much of the data that exists today.
Scaling: It’s much cheaper to scale a NoSQL database than a relational database because you can add capacity by scaling out over cheap, commodity servers. Relational databases, on the other hand, require a single server to host your entire database. To scale, you need to buy a bigger, more expensive server.
Development model: NoSQL databases are open source whereas relational databases typically are closed source with licensing fees baked into the use of their software. With NoSQL, you can get started on a project without any heavy investments in software fees upfront.

Resources