Difference between scaling horizontally and vertically for databases [closed] - database

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I have come across many NoSQL databases and SQL databases. There are varying parameters to measure the strength and weaknesses of these databases and scalability is one of them. What is the difference between horizontally and vertically scaling these databases?

Horizontal scaling means that you scale by adding more machines into your pool of resources whereas Vertical scaling means that you scale by adding more power (CPU, RAM) to an existing machine.
An easy way to remember this is to think of a machine on a server rack, we add more machines across the horizontal direction and add more resources to a machine in the vertical direction.
                 
In the database world, horizontal-scaling is often based on the partitioning of the data i.e. each node contains only part of the data, in vertical-scaling the data resides on a single node and scaling is done through multi-core i.e. spreading the load between the CPU and RAM resources of that machine.
With horizontal-scaling it is often easier to scale dynamically by adding more machines into the existing pool - Vertical-scaling is often limited to the capacity of a single machine, scaling beyond that capacity often involves downtime and comes with an upper limit.
Good examples of horizontal scaling are Cassandra, MongoDB, Google Cloud Spanner .. and a good example of vertical scaling is MySQL - Amazon RDS (The cloud version of MySQL). It provides an easy way to scale vertically by switching from small to bigger machines. This process often involves downtime.
In-Memory Data Grids such as GigaSpaces XAP, Coherence etc.. are often optimized for both horizontal and vertical scaling simply because they're not bound to disk. Horizontal-scaling through partitioning and vertical-scaling through multi-core support.
You can read more on this subject in my earlier posts:
Scale-out vs Scale-up and The Common Principles Behind the NOSQL Alternatives

Scaling horizontally ===> Thousands of minions will do the work together for you.
Scaling vertically ===> One big hulk will do all the work for you.

Let's start with the need for scaling that is increasing resources so that your system can now handle more requests than it earlier could.
When you realise your system is getting slow and is unable to handle the current number of requests, you need to scale the system.
This provides you with two options. Either you increase the resources in the server which you are using currently, i.e, increase the amount of RAM, CPU, GPU and other resources. This is known as vertical scaling.
Vertical scaling is typically costly.
It does not make the system fault tolerant, i.e if you are scaling application running with single server, if that server goes down, your system will go down.
Also the amount of threads remains the same in vertical scaling.
Vertical scaling may require your system to go down for a moment when process takes place. Increasing resources on a server requires a restart and put your system down.
Another solution to this problem is increasing the amount of servers present in the system. This solution is highly used in the tech industry.
This will eventually decrease the request per second rate in each server.
If you need to scale the system, just add another server, and you are done. You would not be required to restart the system.
Number of threads in each system decreases leading to high throughput.
To segregate the requests, equally to each of the application server, you need to add load balancer which would act as reverse proxy to the web servers. This whole system can be called as a single cluster.
Your system may contain a large number of requests which would require more amount of clusters like this.
Hope you get the whole concept of introducing scaling to the system.

There is an additional architecture that wasn't mentioned - SQL-based database services that enable horizontal scaling without the complexity of manual sharding. These services do the sharding in the background, so they enable you to run a traditional SQL database and scale out like you would with NoSQL engines like MongoDB or CouchDB. Two services I am familiar with are EnterpriseDB for PostgreSQL and Xeround for MySQL. I saw an in-depth post by Xeround which explains why scale-out on SQL databases is difficult and how they do it differently - treat this with a grain of salt as it is a vendor post. Also check out Wikipedia's Cloud Database entry, there is a nice explanation of SQL vs. NoSQL and service vs. self-hosted, a list of vendors and scaling options for each combination. ;)

Yes scaling horizontally means adding more machines, but it also implies that the machines are equal in the cluster. MySQL can scale horizontally in terms of Reading data, through the use of replicas, but once it reaches capacity of the server mem/disk, you have to begin sharding data across servers. This becomes increasingly more complex. Often keeping data consistent across replicas is a problem as replication rates are often too slow to keep up with data change rates.
Couchbase is also a fantastic NoSQL Horizontal Scaling database, used in many commercial high availability applications and games and arguably the highest performer in the category. It partitions data automatically across cluster, adding nodes is simple, and you can use commodity hardware, cheaper vm instances (using Large instead of High Mem, High Disk machines at AWS for instance). It is built off the Membase (Memcached) but adds persistence. Also, in the case of Couchbase, every node can do reads and writes, and are equals in the cluster, with only failover replication (not full dataset replication across all servers like in mySQL).
Performance-wise, you can see an excellent Cisco benchmark: http://blog.couchbase.com/understanding-performance-benchmark-published-cisco-and-solarflare-using-couchbase-server
Here is a great blog post about Couchbase Architecture: http://horicky.blogspot.com/2012/07/couchbase-architecture.html

Traditional relational databases were designed as client/server database systems. They can be scaled horizontally but the process to do so tends to be complex and error prone. NewSQL databases like NuoDB are memory-centric distributed database systems designed to scale out horizontally while maintaining the SQL/ACID properties of traditional RDBMS.
For more information on NuoDB, read their technical white paper.

SQL databases like Oracle, db2 also support Horizontal scaling through Shared disk cluster. For example Oracle RAC, IBM DB2 purescale or Sybase ASE Cluster edition. New node can be added to Oracle RAC system or DB2 purescale system to achieve horizontal scaling.
But the approach is different from noSQL databases (like mongodb, CouchDB or IBM Cloudant) is that the data sharding is not part of Horizontal scaling. In noSQL databases data is shraded during horizontal scaling.

The accepted answer is spot on the basic definition of horizontal vs vertical scaling. But unlike the common belief that horizontal scaling of databases is only possible with Cassandra, MongoDB, etc I would like to add that horizontal scaling is also very much possible with any traditional RDMS; that too without using any third party solutions.
I know of many companies, specially SaaS based companies that do this. This is done using simple application logic. You basically take a set of users and divide them over multiple DB servers. So for example, you would typically have a "meta" database/table that would store clients, DB server/connection strings, etc and a table that stores client/server mapping.
Then simply direct requests from each client to the DB server they are mapped to.
Now some may say this is akin to horizontal partitioning and not "true" horizontal scaling and they will be right in some ways. But the end result is that you have scaled your DB over multiple Db servers.
The only difference between the two approaches to horizontal scaling is that one approach (MongoDB, etc) the scaling is done by the DB software itself. In that sense you are "buying" the scaling. In the other approach (for RDBMS horizontal scaling), the scaling is built by application code/logic.

Adding lots of load balancers creates extra overhead and latency and that is the drawback for scaling out horizontally in nosql databases. It is like the question why people say RPC is not recommended since it is not robust.
I think in a real system we should use both sql and nosql databases to utilize both multicore and cloud computing capabilities of today's systems.
On the other hand, complex transactional queries has high performance if sql databases such as oracle being used. NoSql could be used for bigdata and horizontal scalability by sharding.

You have a company and there is only 1 worker but you got 1 new project at that time you hire new candidate -- this is horizontal scaling. where new candidate is new machines and project is new traffic/calls to your api's.
Where as 1 project with an IIT/NIT guy handling all request to your api/traffic. If any time more request to your api's then fire him and replacing him with a high IQ NIT/IIT guy -- this is vertical scaling.

Related

Difference between Horizontal Scaling and Clustering of Servers

While reading documentation of Cassandra, I came across the term called clustering growth.
After reading blogs, I came to know that Clustering is way of grouping of server (Distributed server) via a LAN, to solve the problem, behind it uses the Data Sharding and Partitioning Algorithms.
But If we look then in case of Distributed System, where we do horizontal scaling of server. We scale the server horizontally and distribute the load, So we are saying that those server are somehow acheiving the Clustering properties.
I basically want to know the difference between Clustering of Server and Replication of Server behind Load Balancer.
I want to know the difference between both of them, Since I knew that clustering is a way for database but I have seen clustered server also.
Is Clustering a way of Horizontal scaling or what?
Not precisely getting the answer.
In Cassandra we don't tend to scale vertically unless there is a scenario where nodes are under-provisioned. The idea of 'clustering' and 'replication' is built into the very nature of how Cassandra is meant to work.
While you can run Cassandra on a single node, because it is designed as a distributed database, it is most common to have multiple nodes. A group of nodes communicating with each other to make up a distributed database are what we refer to as a cluster. The more nodes you add to a cluster, the more data ownership and workload is spread out, which is where the idea of scaling horizontally comes from.
So, to answer your question, 'clustering' is certainly a way of scaling horizontally when nodes are added to a common cluster to increase throughput. You can also think of a cluster as a logical way to organize data. A Cassandra cluster can have one or more DCs (DataCenters) that are responsible for one or more copies of the data (Replicas) depending on how you define things. I would recommend this quick read for a better understanding:
https://cassandra.apache.org/_/cassandra-basics.html

what are the best ways to mitigate database i/o bottoleneck for large web sites?

For large web sites (traffic wise) that has alot of incoming reads and updates that end up being database I/Os, what're the best ways to mitigate the performance impact? one solution that I can think of is - for write, to cache and then do delayed write (using separate job); for read, use memcached concept. any other better solutions?
Here are the most common solutions to database performance:
Caching (Memcache, etc)
Add memory to your database
More database servers (master/slave or sharding)
Use a different database type (NoSQL, Redis, etc)
Indexes to speed up read perf. (careful, too many will affect write performance)
SSDs (fast SSDs will help a lot)
RAID
Optimize/tune SQL queries
Don't forget to optimize your queries. Most of the times it is not the disk I/O, but poorly written queries which turn out to be the bottleneck.
You can also cache query results and also entire web pages if the content isn't going to change too often.
It very much depends on the usage pattern and data type. There are really different things to do depending on whether transaction are going to be supported, whether you are interested in full consistency or "eventual consistency", how big the data is (will it all fit in huge memory?), how complex the data and queries are, the list might go on and on.... Lots of variables and only after listing all the constraints/requirements you will be able to make a proper decision. Two general advices though:
Use SSDs
Use distributed architecture with distributed "NoSQL" (key/value) approach (only if you do not have to use complex relations and transactions)
10 years ago, the standard answer - besides optimizing your particular database - was scale-out using MySQL in two ways.
Reads can be scaled out in two ways. The first is through caching, which introduces possible inconsistancies and creates a separate cache layer. Reads can also be scaled in MySQL by creating "read replicas", where any database can be queried. Any write must be applied to all servers, so replication doesn't help write throughput.
Writes are scaled through sharding. For example, imagine all users with the last name 'a' are assigned to a certain server. Now imagine a more complicated shard algorithm, where a particular row's primary ID is hashed using a hash function, and distributed to one of a pool of servers.
Facebook is one of the most advanced proponents of a sharded MySQL architecture. You can have individual tables "joined" but you have to write custom code, because you might have to hop from server to server - imagine you want to get your friend's timeline posts, you can't simply join it, you have to write some application code.
Once you shard your database, you can't do joins and range lookups become difficult. This subset is sometimes called CRUD operations, and thus MySQL is overkill. Many Chinese social networks realized this, and use sharded Redis (which is much quicker than MySQL), and have written their own shard layer and application logic layers.
Imagine the next problem in sharding - you want to add a new server, and start assigning some users to that new server.
Another approach is to use a distributed database, which generally comes under the names NoSQL or NewSQL, and have a variety of approaches. Some, like MongoDB, have a sharding system to manage this mapping, but require manual steps to add servers. Cassandra has a more flexible clustering scheme, called a chorded architecture. Systems like CouchBase and Aerospike use a random distribution mechanism that remove the need for a shard layer. Some of these databases can exceed 100,000 to 200,000 requests per second per server, with the lateral scale to add new servers - enough for very large operations. With this style of clustering, you can often get a higher level of redundancy and reliability.
Other distributed approaches represent data in a more efficient way, like a graph database. If you have a problem that is better represented as a graph, then a clustered graph database may be more appropriate.

how to scale databases

Can someone give me a quick rundown on the old and latest research on scaling databases or storage?
I have heard of master/slave. What else are there? Thanks!
In general there are two ways to scale a database - horizontal and vertical (which, if the design of your software and database are right, may be mixed together).
Vertical pretty much means bigger computers - more ram, CPU, faster disks etc...
Horizontal means spreading the load across many computers. One example is sharding, another is use of different machines for different data (one data base for customer data, another for product data etc...).
I am not sure what you mean about master/slave? It is a concept that has more to do with backup and failover than scalability.

How do the newer database models achieve better scalability and performance as compared to a traditional RDBMS implementation?

We have
BigTable from Google,
Hadoop, actively contributed by Yahoo,
Dynamo from Amazon
all aiming towards one common goal - making data management as scalable as possible.
By scalability what I understand is that the cost of the usage should not go up drastically when the size of data increases.
RDBMS's are slow when the amount of data is large as the number of indirections invariable increases leading to more IO's.
How do these custom scalable friendly data management systems solve the problem?
This is a figure from this document explaining Google BigTable:
Looks the same to me. How is the ultra-scalability achieved?
The "traditional" SQL DBMS market really means a very small number of products, which have traditionally targeted business applications in a corporate setting. Massive shared-nothing scalability has not historically been a priority for those products or their customers. So it is natural that alternative products have emerged to support internet scale database applications.
This has nothing to do with the fact that these new products are not "Relational" DBMSs. The relational model can scale just as well as any other model. Arguably the relational model suits these types of massively scalable applications better than say, network (graph based) models. It's just that the SQL language has a lot of disadvantages and no-one has yet come up with suitable relational NOSQL (non-SQL) alternatives.
Speaking specifically to your question about Bigtable, the difference is that the heirarchy in the diagram above is all there is. Each Bigtable tabletserver is responsible for a set of tablets (contiguous row ranges from a table); the mapping from row range to tablet is maintained in the metadata table, while the mapping from tablet to tabletserver is maintained in the memory of the Bigtable master. Looking up a row, or range of rows, requires looking up the metadata entry (which will almost certainly be in memory on the server that hosts it), then using that to look up the actual row on the server responsible for it - resulting in only one, or a few disk seeks.
In a nutshell, the reason this scales well is because it's possible to throw more hardware at it: given enough resources, the metadata is always in memory, and thus there's no need to go to disk for it, only for the data (and not always for that, either!).
It's about using cheap comodity hardware to build a network/grid/cloud and spread the data and load (for example using map/reduce).
RDBMS databases seem to me like software being (originaly) designed to run on one supercomputer. You can use various hard drive arrays, DB clusters, but still..
The amount of data increased so there's one more reason to design new data storages with this in mind - scalability, high availability, terabytes of data.
Another thing - if you build a grid/cloud from cheap servers, it's fault tolerant because you store all data at three (?) different locations and at the same time it's cheap.
Back to your pictures - the first one is from one computer (typically), the second one from a network of computers.
One theoretical answer on scalability is at http://queue.acm.org/detail.cfm?id=1394128 - the ACID guarantees are expensive. See http://database.cs.brown.edu/papers/stonebraker-cacm2010.pdf for a counter-argument.
In fact just surviving power failures is expensive. Years ago now I compared MySQL against Oracle. MySQL was almost unbelieveably faster than Oracle, but we couldn't use it. MySQL of those days was built on top of Berkeley
DB, which was miles faster than Oracle's full blown log-based database, but if the power went off while Berkely DB based MySQL was running, it was a manual process to get the database consistent again when the power went back on, and you'ld probably lose recent updates for good.

Can you recommend a database that scales horizontally? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
Generally the database server is the biggest, most expensive box we have to buy as scaling vertically is the only option. Are there any databases that scale well horizontally (i.e. across multiple commodity machines) and what are the limitations in this approach?
Oracle RAC is not horizontally scalable at all, because all Oracle instances share the same data storage. Yes, with SAN stuff u can get a large size DB, but it's just not scalable at all. In other words, Oracle RAC is still a scale-up approach. So for scaling-out or horizontally scaling, you have to partition your data by function that means put different groups of tables in different databases; or partition your data per table that means partition one table into multiple subtables with the same schema but store in different databases. In this way, you get a scaling-out solution. There are many resources on that. Sharding has been a buzz word for a while in web 2.0 website architecture blog sphere.
Because Sharding is not directly supported by database itself, you have to build your own solution. But as I said, there are many lessons already. For oracle, partitioning table is possible. For mysql, check this question
Oracle RAC -- Real Application Cluster
This works nicely, you just add boxes to your cluster. You can fail over from one box to the other. It's not replication, all the boxes are part of the same logical unit.
It's pretty spendy, of course.
Don't worry, good solutions are coming!
Couchdb and Hypertable are open source and still in alpha, but they are clearly designed to make scaling on commodity software simple. They work pretty well, and may change how you think about databases.
Also, if it's okay to let someone else do the distributing for you, Google AppEngine and Amazon SimpleDB are extremely cheap distributed database services, though they're both in beta right now so strict limitations are imposed.
There are storage techniques such as JavaSpaces (or a commercial implementation such as Gigaspaces) which provide highly scalable, fast & secure access to objects.
There are also distributed cacheing systems such as memcached, which offer a similar approach.
Of course, neither of these are true databases, but they are things that can work in conjunction with databases to offer a large amount of horizontal scalability, given a suitable architecture. The real problem is that if you want all of the ACID goodness that comes with a database, there are certain unavoidable performance penalties. The only way out is to figure out the bits where you don't need ACID, and use other technologies to service those bits.
Oracle RAC is the Rolls Royce of databases allowing extra hardware nodes to be added relatively easily and hardware failover.
However, your commodity hardware costs will be dwarfed by the licence costs.
Why dod you feel you need horizontal scaling. A multi CPU core server with 40GB RAM and SAN storage can support very sizeable DB installation.
Can you provide any sizing and expected activity information to allow better understanding of your problem?
If you do go down the RAC route it's worth remembering that it doesnt scale horizontally forever. Even the sales guys admit 90% of rac customers are 4 nodes or less. Once you go more than that you get diminishing returns. So rac may work for you, but it's not guaranteed to be the answer.
MySQL: http://www.mysql.com/why-mysql/scaleout.html
Limitations are that it works best with read-mostly workloads. You typically have one 'master' that receives all the writes, and many 'slaves' that replicate the writes. Then you distribute the reads over all the databases.
MySQL replication is asynchronous, so you will probably have to deal with time lag problems (you write to the master, and then read from a slave before the write has been replicated).
Netezza and other datawarehouse appliances scale this way, but they are not good for OLTP and web app workloads.
The Oracle route for scaling across multiple machines is called Real Application Clusters (Oracle RAC). There's no end of documentation on this elsewhere; you might try starting at http://www.oracle.com/database/rac_home.html.
MongoDB
is one of the best database that scales horizontally.
Oracle Real Application Clusters. If you want the best then take a look at it.
If you seriously think you will out scale a decent multicore "Big Iron" box, then you think about partitioning your data. This is a good, database agnostic way to scale out.
All databases which horizontally will come at a serious cost.
Unless you have mega $$'s to throw at the problem, forget about RAC. While its very good, its VERY expensive once you scale beyond 2 nodes.
You might look at DashDB for OLAP -- IBM pairs it with Cloudant for OLTP.
https://www.ibm.com/developerworks/community/blogs/5things/entry/5_things_to_know_about_dashdb_placeholder?lang=en

Resources