Difference between Horizontal Scaling and Clustering of Servers - database

While reading documentation of Cassandra, I came across the term called clustering growth.
After reading blogs, I came to know that Clustering is way of grouping of server (Distributed server) via a LAN, to solve the problem, behind it uses the Data Sharding and Partitioning Algorithms.
But If we look then in case of Distributed System, where we do horizontal scaling of server. We scale the server horizontally and distribute the load, So we are saying that those server are somehow acheiving the Clustering properties.
I basically want to know the difference between Clustering of Server and Replication of Server behind Load Balancer.
I want to know the difference between both of them, Since I knew that clustering is a way for database but I have seen clustered server also.
Is Clustering a way of Horizontal scaling or what?
Not precisely getting the answer.

In Cassandra we don't tend to scale vertically unless there is a scenario where nodes are under-provisioned. The idea of 'clustering' and 'replication' is built into the very nature of how Cassandra is meant to work.
While you can run Cassandra on a single node, because it is designed as a distributed database, it is most common to have multiple nodes. A group of nodes communicating with each other to make up a distributed database are what we refer to as a cluster. The more nodes you add to a cluster, the more data ownership and workload is spread out, which is where the idea of scaling horizontally comes from.
So, to answer your question, 'clustering' is certainly a way of scaling horizontally when nodes are added to a common cluster to increase throughput. You can also think of a cluster as a logical way to organize data. A Cassandra cluster can have one or more DCs (DataCenters) that are responsible for one or more copies of the data (Replicas) depending on how you define things. I would recommend this quick read for a better understanding:
https://cassandra.apache.org/_/cassandra-basics.html

Related

Scaling in nosql vs rdbms?

I am trying to understand the architectural difference in nosql and relational databases in the context of scalability.
My understanding of scalability(horizontal) is that as our data grows, we add more and more server to split the load evenly.
In key-value NO-SQL databases, we can add the new machine and split the keys. However, all of examples I have seen so far to understand eventual consistency in NO-SQL databases, they all have master-slave configuration where data is replicated across all the slaves instead of splitting across various machine to achieve scalability.
My question is doesn't using replicating your whole data defeat the whole point of scalability in No-SQL databases? The same can be done in RDBMS as well, with one master(for write) and slaves(for reads), how is NO-SQL more scalable in this regard?
The goal of scalability is to increase the overall capacity of a given application, and can can be vertical (bigger machines) or horizontal (add more machines). When it comes to horizonal scaling, you can add more machines, but as the number of machines increase, so does the probability of a failure of a node in the cluster, which is something to keep in mind.
When you add more nodes, what you can do is either split the data, which is called sharding, and you can also duplicate the data, which is called replication.
Replication
With replication, the usual architecture is master-slave, where you can only write to the master, who replicates the data to the slaves, so this means that you cannot use replication to split the writes to the cluster, but it is possible to split the reads, deppending on the consistency level (not all of the NoSQL technologies provide the same level) and the cluster configuration.
Sharding
Sharding is more suited to provide scaling, as you split the data set in multiple parts with similar sizes if possible. This clearly allows the benefit to split reads and writes to different nodes. In order to make it work, some mechanisms need to be in place:
routing: to locate in which shard is the data is stored, or to decide in which shard to write
balancing: to maintain the sizes of the different fragments of the dataset with a similar size over time.
But usually these mechanisms are provided by the database vendor, so no need to worry about providing it, but is still necessary to understand to manage the cluster.
The problem here is, as I mentioned at first, the more nodes a cluster has, the higher is the chance to have a failure on a given node, meaning that if a node with a part of the data set goes offline, part of the data won't be available, which is not a desirable scenario. But luckyly, sharding and replication are not exclusive, is it possible to build a sharded cluster, where each shard is a cluster on with replication in place.
But in order to answer your questions
doesn't using replicating your whole data defeat the whole point of
scalability in No-SQL databases?
In master-slave architectures, you cannot split the writes, but you can split the reads, which is somewhat a way to scale, although, the main purpose is high availability.
Anyways, there are new emergent databases that start to provide multi-master architecture, where all the nodes act as master, meaning all can receive both, writes and reads.
The same can be done in RDBMS as well, with one master(for write) and
slaves(for reads), how is NO-SQL more scalable in this regard?
In a single node environment, NoSQL is already faster than RDBMS when there are JOINS involved, as it is an expensive operation, or when there are a lot of integrity checks involved.
So, when you try to shard the dataset in a RDBMS, unless really carefully designed, the most likely scenario will be the desired data located in different shards. This means that the JOIN and the integrity checks need to be performed between different nodes, making them even more expensive operations than they already are.
This means that RDBMS databases, use mechanisms that act as contraints when you intent to scale horizontally, which NoSQL doesn't. Yes, you can still scale RDBMS horizontally, but overall will be more expensive than using NoSQL databases.
Update: special mention for graph databases
Sharding in graph databases is really difficult, since mathematically, the problem of distributing a large graph between different servers is NP complete. And also, when having to query data among different shards, one of the main features of graphs is lost, fast transverses.
I've seen 2 main approaches that graph databases follows to scale horizontally:
1) Let the application/developer decide how to partition the graph, which you can imagine how complex this can be.
2) Replicate all the graph in all the nodes and use cache-sharding, which means, that all the nodes have the entire data set, but each node maintains in memory the part of the graph that is most queried for that node in particular.
I guess, that in the future, graph database companies will develop more solutions to address this issue.
Related to your question, at their current state, graph databases can still outscale RDBMS when it comes to horizontal scaling, due to the lack of the RDBMS contraints, but its hard to compare between different NoSQL database types.
the master-slave configuration in NoSQL databases is for High-Avaibility purposes and data consistency, not to confound with the purpose of scalability wich is to load balance the workload.
In NoSQL, so far as splitting the keys is concerned, only the Master copies matter. Slaves are for HA and Availability in general. This replication, in fact, is responsible for the eventual consistency -- you will get the data right away, may be it is not the most updated one, but eventually you will have the updated data.
On the other hand, RDBMS will have slower data access/modification because it has to follow ACID properties, and mostly that is with strong-consistency.
Replication is not the differentiating factor, as such, between NoSQL and RDBMS, adherence/non-adherence to ACID properties is. Nor does Scalability mean absence of extra copies. Hope that answers.
To answer your question, replicating your data does not defeat the point of scalability.
Scalability roughly refers to the ability to grow a database and is not necessarily tied to having more copies of the database.
More servers with the database information on them allows for more readily available access to it to more users, as the other answers have stated.
I believe this might be a misunderstanding between Scalability and Availability?
If we just consider key-value databases vs SQL databases then the former ones are more suitable for scalability than the latter ones.
This is because a key-value store doesn't have transactions. So your only guarantee is that you can atomically change the one value for the one key. Which results in ease of scaling.
For example you just hash a key and store a key-value pairs on a machine corresponding to the hash of the key.
You can't do the same thing to the SQL database without loosing an ACID (atomicity, consistency, isolation, durability of a transaction) property. Moreover you can't even easily perform a join SELECT if you store different tables or different parts of a table on different machines.
So in general NoSQL databases are more prepared for sharding across many machines than SQL ones.

Scalable database technology and architecture

I've been trying to learn more about database scaling in a distributed system, and I am stuck in between RDBMS and NoSQL.
Some articles online suggest that NoSQL is the solution to modern Big Data. Others say NoSQL is just a hype and RDBMS can be just as scalable with good design, and it provides good data structure.
Instead of reading others' opinions, I'd love to judge the two myself, but I do not understand exactly what is required for a scalable RDBMS and a scalable NoSQL.
I've done a bit more readings on RDBMS, and it seems that the solution requires leveraging memcache and sharding to reduce database size and the number of DB queries. Are there other tricks? Can you still use tables with many columns? Or use less columns and more joins?
As for NoSQL, I've read a little about MongoDB. I understand that it encourages data aggregation. But how does that make it more scalable? I'm also starting to learn Cassandra because I read that it scales much better than MongoDB, but I have no idea how it is more scalable.
I would very much appreciate a basic (or advanced, if you have the patience to type it out) condensed and down-to-the-core explanation on scaling RDBMS and NoSQL, or good articles online or books that explain the topic. :)
I won't cover ways you can scale by implementing things on your own and putting a memcache server in between, ... I'll just cover what comes right out of the box...
Let's start first with RDBMS:
I think setting up an RDBMS cluster is more complicated than a NoSQL cluster, but that's just my opinion. Usually what you have is one Master and multiple Slaves. You have to send all your writes to the master and can read from any slave you want. Since you have RDBMS and ACID, the system should somehow guarantee you, that you won't read old data. So the thing here is, that you assume that your application writes once and reads often (as it's usually the case). For those purposes, one Server for read/write and multiple servers for read is great. The problem is if you'r writes are so often that you can't keep up with them anymore on the one machine. That is your bottleneck. Additionally to the build in solutions from Oracle for instance - which are huge - there is also http://www.scalearc.com/ which can cache queries, ... and handle the scaling for you.
NoSQL:
There is no 1 NoSQL schema which is implemented by all the DBs. Every system is a bit different. MongoDB for instance is quite similar to RDBMS, it also has only one Master and several slaves to which it can replicate data, but additionally you can also create shards. Data is split between shards, and replicated to slaves. So you could have multiple different masters which are responsible for smaller parts. Afterwards when you read, you can choose if you want to read from multiple slaves, from the master or from any slave - depending how urgently you need the latest data.
Cassandra on the other hand works totally differently. I'm not sure if you can write to multiple servers or how it works, but basically the servers keep a log of all the writes. So even if they can't process the writes immediately, they are stored in a log, to still give you a fast response. Afterwards when you read, you can say again how urgently you want to have the new data, and if you really want the latest latest data, Cassandra will need to check the log, if there are any updates written, and it will cost you a lot of time.
Key-Value stores like ElasticSearch, CouchDB, CouchBase work again differently. Here the of the item is hashed, and based on the hash, sent to one node which will be responsible for it. This way, when you read after the key was written, you get again up to date information, because you'll read from the same node. The idea of this design is, that no one single key will be of everyone's interest, but the load will be distributed. These are also the DBs which I think scale the best, and make it the easiest to add more servers to the cluster, but you loose the power of complex queries, like you have it in MongoDB and Cassandra - and of course RDBMS. ElasticSearch has some simple search queries, and CouchDB and CouchBase have only Views which are produced by MapReduce, where you can get data which you want, if it fits the view. Otherwise you can only access it by the key.
http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis - is a very comprehensive summary of the most common NoSQL DBs, what are their strengths and weaknesses, and the most common usage scenarios.
In the end, the question is also, why do you want to scale? how many records are you going to have in the database? Few millions is not a problem at all. Few hundred millions is also not a problem for most of the RDBMS on a powerful enough server. And if designed the DB and it's indices properly even a billion records per year should be still fine.

Difference between scaling horizontally and vertically for databases [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I have come across many NoSQL databases and SQL databases. There are varying parameters to measure the strength and weaknesses of these databases and scalability is one of them. What is the difference between horizontally and vertically scaling these databases?
Horizontal scaling means that you scale by adding more machines into your pool of resources whereas Vertical scaling means that you scale by adding more power (CPU, RAM) to an existing machine.
An easy way to remember this is to think of a machine on a server rack, we add more machines across the horizontal direction and add more resources to a machine in the vertical direction.
                 
In the database world, horizontal-scaling is often based on the partitioning of the data i.e. each node contains only part of the data, in vertical-scaling the data resides on a single node and scaling is done through multi-core i.e. spreading the load between the CPU and RAM resources of that machine.
With horizontal-scaling it is often easier to scale dynamically by adding more machines into the existing pool - Vertical-scaling is often limited to the capacity of a single machine, scaling beyond that capacity often involves downtime and comes with an upper limit.
Good examples of horizontal scaling are Cassandra, MongoDB, Google Cloud Spanner .. and a good example of vertical scaling is MySQL - Amazon RDS (The cloud version of MySQL). It provides an easy way to scale vertically by switching from small to bigger machines. This process often involves downtime.
In-Memory Data Grids such as GigaSpaces XAP, Coherence etc.. are often optimized for both horizontal and vertical scaling simply because they're not bound to disk. Horizontal-scaling through partitioning and vertical-scaling through multi-core support.
You can read more on this subject in my earlier posts:
Scale-out vs Scale-up and The Common Principles Behind the NOSQL Alternatives
Scaling horizontally ===> Thousands of minions will do the work together for you.
Scaling vertically ===> One big hulk will do all the work for you.
Let's start with the need for scaling that is increasing resources so that your system can now handle more requests than it earlier could.
When you realise your system is getting slow and is unable to handle the current number of requests, you need to scale the system.
This provides you with two options. Either you increase the resources in the server which you are using currently, i.e, increase the amount of RAM, CPU, GPU and other resources. This is known as vertical scaling.
Vertical scaling is typically costly.
It does not make the system fault tolerant, i.e if you are scaling application running with single server, if that server goes down, your system will go down.
Also the amount of threads remains the same in vertical scaling.
Vertical scaling may require your system to go down for a moment when process takes place. Increasing resources on a server requires a restart and put your system down.
Another solution to this problem is increasing the amount of servers present in the system. This solution is highly used in the tech industry.
This will eventually decrease the request per second rate in each server.
If you need to scale the system, just add another server, and you are done. You would not be required to restart the system.
Number of threads in each system decreases leading to high throughput.
To segregate the requests, equally to each of the application server, you need to add load balancer which would act as reverse proxy to the web servers. This whole system can be called as a single cluster.
Your system may contain a large number of requests which would require more amount of clusters like this.
Hope you get the whole concept of introducing scaling to the system.
There is an additional architecture that wasn't mentioned - SQL-based database services that enable horizontal scaling without the complexity of manual sharding. These services do the sharding in the background, so they enable you to run a traditional SQL database and scale out like you would with NoSQL engines like MongoDB or CouchDB. Two services I am familiar with are EnterpriseDB for PostgreSQL and Xeround for MySQL. I saw an in-depth post by Xeround which explains why scale-out on SQL databases is difficult and how they do it differently - treat this with a grain of salt as it is a vendor post. Also check out Wikipedia's Cloud Database entry, there is a nice explanation of SQL vs. NoSQL and service vs. self-hosted, a list of vendors and scaling options for each combination. ;)
Yes scaling horizontally means adding more machines, but it also implies that the machines are equal in the cluster. MySQL can scale horizontally in terms of Reading data, through the use of replicas, but once it reaches capacity of the server mem/disk, you have to begin sharding data across servers. This becomes increasingly more complex. Often keeping data consistent across replicas is a problem as replication rates are often too slow to keep up with data change rates.
Couchbase is also a fantastic NoSQL Horizontal Scaling database, used in many commercial high availability applications and games and arguably the highest performer in the category. It partitions data automatically across cluster, adding nodes is simple, and you can use commodity hardware, cheaper vm instances (using Large instead of High Mem, High Disk machines at AWS for instance). It is built off the Membase (Memcached) but adds persistence. Also, in the case of Couchbase, every node can do reads and writes, and are equals in the cluster, with only failover replication (not full dataset replication across all servers like in mySQL).
Performance-wise, you can see an excellent Cisco benchmark: http://blog.couchbase.com/understanding-performance-benchmark-published-cisco-and-solarflare-using-couchbase-server
Here is a great blog post about Couchbase Architecture: http://horicky.blogspot.com/2012/07/couchbase-architecture.html
Traditional relational databases were designed as client/server database systems. They can be scaled horizontally but the process to do so tends to be complex and error prone. NewSQL databases like NuoDB are memory-centric distributed database systems designed to scale out horizontally while maintaining the SQL/ACID properties of traditional RDBMS.
For more information on NuoDB, read their technical white paper.
SQL databases like Oracle, db2 also support Horizontal scaling through Shared disk cluster. For example Oracle RAC, IBM DB2 purescale or Sybase ASE Cluster edition. New node can be added to Oracle RAC system or DB2 purescale system to achieve horizontal scaling.
But the approach is different from noSQL databases (like mongodb, CouchDB or IBM Cloudant) is that the data sharding is not part of Horizontal scaling. In noSQL databases data is shraded during horizontal scaling.
The accepted answer is spot on the basic definition of horizontal vs vertical scaling. But unlike the common belief that horizontal scaling of databases is only possible with Cassandra, MongoDB, etc I would like to add that horizontal scaling is also very much possible with any traditional RDMS; that too without using any third party solutions.
I know of many companies, specially SaaS based companies that do this. This is done using simple application logic. You basically take a set of users and divide them over multiple DB servers. So for example, you would typically have a "meta" database/table that would store clients, DB server/connection strings, etc and a table that stores client/server mapping.
Then simply direct requests from each client to the DB server they are mapped to.
Now some may say this is akin to horizontal partitioning and not "true" horizontal scaling and they will be right in some ways. But the end result is that you have scaled your DB over multiple Db servers.
The only difference between the two approaches to horizontal scaling is that one approach (MongoDB, etc) the scaling is done by the DB software itself. In that sense you are "buying" the scaling. In the other approach (for RDBMS horizontal scaling), the scaling is built by application code/logic.
Adding lots of load balancers creates extra overhead and latency and that is the drawback for scaling out horizontally in nosql databases. It is like the question why people say RPC is not recommended since it is not robust.
I think in a real system we should use both sql and nosql databases to utilize both multicore and cloud computing capabilities of today's systems.
On the other hand, complex transactional queries has high performance if sql databases such as oracle being used. NoSql could be used for bigdata and horizontal scalability by sharding.
You have a company and there is only 1 worker but you got 1 new project at that time you hire new candidate -- this is horizontal scaling. where new candidate is new machines and project is new traffic/calls to your api's.
Where as 1 project with an IIT/NIT guy handling all request to your api/traffic. If any time more request to your api's then fire him and replacing him with a high IQ NIT/IIT guy -- this is vertical scaling.

what are the best ways to mitigate database i/o bottoleneck for large web sites?

For large web sites (traffic wise) that has alot of incoming reads and updates that end up being database I/Os, what're the best ways to mitigate the performance impact? one solution that I can think of is - for write, to cache and then do delayed write (using separate job); for read, use memcached concept. any other better solutions?
Here are the most common solutions to database performance:
Caching (Memcache, etc)
Add memory to your database
More database servers (master/slave or sharding)
Use a different database type (NoSQL, Redis, etc)
Indexes to speed up read perf. (careful, too many will affect write performance)
SSDs (fast SSDs will help a lot)
RAID
Optimize/tune SQL queries
Don't forget to optimize your queries. Most of the times it is not the disk I/O, but poorly written queries which turn out to be the bottleneck.
You can also cache query results and also entire web pages if the content isn't going to change too often.
It very much depends on the usage pattern and data type. There are really different things to do depending on whether transaction are going to be supported, whether you are interested in full consistency or "eventual consistency", how big the data is (will it all fit in huge memory?), how complex the data and queries are, the list might go on and on.... Lots of variables and only after listing all the constraints/requirements you will be able to make a proper decision. Two general advices though:
Use SSDs
Use distributed architecture with distributed "NoSQL" (key/value) approach (only if you do not have to use complex relations and transactions)
10 years ago, the standard answer - besides optimizing your particular database - was scale-out using MySQL in two ways.
Reads can be scaled out in two ways. The first is through caching, which introduces possible inconsistancies and creates a separate cache layer. Reads can also be scaled in MySQL by creating "read replicas", where any database can be queried. Any write must be applied to all servers, so replication doesn't help write throughput.
Writes are scaled through sharding. For example, imagine all users with the last name 'a' are assigned to a certain server. Now imagine a more complicated shard algorithm, where a particular row's primary ID is hashed using a hash function, and distributed to one of a pool of servers.
Facebook is one of the most advanced proponents of a sharded MySQL architecture. You can have individual tables "joined" but you have to write custom code, because you might have to hop from server to server - imagine you want to get your friend's timeline posts, you can't simply join it, you have to write some application code.
Once you shard your database, you can't do joins and range lookups become difficult. This subset is sometimes called CRUD operations, and thus MySQL is overkill. Many Chinese social networks realized this, and use sharded Redis (which is much quicker than MySQL), and have written their own shard layer and application logic layers.
Imagine the next problem in sharding - you want to add a new server, and start assigning some users to that new server.
Another approach is to use a distributed database, which generally comes under the names NoSQL or NewSQL, and have a variety of approaches. Some, like MongoDB, have a sharding system to manage this mapping, but require manual steps to add servers. Cassandra has a more flexible clustering scheme, called a chorded architecture. Systems like CouchBase and Aerospike use a random distribution mechanism that remove the need for a shard layer. Some of these databases can exceed 100,000 to 200,000 requests per second per server, with the lateral scale to add new servers - enough for very large operations. With this style of clustering, you can often get a higher level of redundancy and reliability.
Other distributed approaches represent data in a more efficient way, like a graph database. If you have a problem that is better represented as a graph, then a clustered graph database may be more appropriate.

How do the newer database models achieve better scalability and performance as compared to a traditional RDBMS implementation?

We have
BigTable from Google,
Hadoop, actively contributed by Yahoo,
Dynamo from Amazon
all aiming towards one common goal - making data management as scalable as possible.
By scalability what I understand is that the cost of the usage should not go up drastically when the size of data increases.
RDBMS's are slow when the amount of data is large as the number of indirections invariable increases leading to more IO's.
How do these custom scalable friendly data management systems solve the problem?
This is a figure from this document explaining Google BigTable:
Looks the same to me. How is the ultra-scalability achieved?
The "traditional" SQL DBMS market really means a very small number of products, which have traditionally targeted business applications in a corporate setting. Massive shared-nothing scalability has not historically been a priority for those products or their customers. So it is natural that alternative products have emerged to support internet scale database applications.
This has nothing to do with the fact that these new products are not "Relational" DBMSs. The relational model can scale just as well as any other model. Arguably the relational model suits these types of massively scalable applications better than say, network (graph based) models. It's just that the SQL language has a lot of disadvantages and no-one has yet come up with suitable relational NOSQL (non-SQL) alternatives.
Speaking specifically to your question about Bigtable, the difference is that the heirarchy in the diagram above is all there is. Each Bigtable tabletserver is responsible for a set of tablets (contiguous row ranges from a table); the mapping from row range to tablet is maintained in the metadata table, while the mapping from tablet to tabletserver is maintained in the memory of the Bigtable master. Looking up a row, or range of rows, requires looking up the metadata entry (which will almost certainly be in memory on the server that hosts it), then using that to look up the actual row on the server responsible for it - resulting in only one, or a few disk seeks.
In a nutshell, the reason this scales well is because it's possible to throw more hardware at it: given enough resources, the metadata is always in memory, and thus there's no need to go to disk for it, only for the data (and not always for that, either!).
It's about using cheap comodity hardware to build a network/grid/cloud and spread the data and load (for example using map/reduce).
RDBMS databases seem to me like software being (originaly) designed to run on one supercomputer. You can use various hard drive arrays, DB clusters, but still..
The amount of data increased so there's one more reason to design new data storages with this in mind - scalability, high availability, terabytes of data.
Another thing - if you build a grid/cloud from cheap servers, it's fault tolerant because you store all data at three (?) different locations and at the same time it's cheap.
Back to your pictures - the first one is from one computer (typically), the second one from a network of computers.
One theoretical answer on scalability is at http://queue.acm.org/detail.cfm?id=1394128 - the ACID guarantees are expensive. See http://database.cs.brown.edu/papers/stonebraker-cacm2010.pdf for a counter-argument.
In fact just surviving power failures is expensive. Years ago now I compared MySQL against Oracle. MySQL was almost unbelieveably faster than Oracle, but we couldn't use it. MySQL of those days was built on top of Berkeley
DB, which was miles faster than Oracle's full blown log-based database, but if the power went off while Berkely DB based MySQL was running, it was a manual process to get the database consistent again when the power went back on, and you'ld probably lose recent updates for good.

Resources