Going through documentation I did find cross-region replication for storage layer, but not about compute layer of Snowflake. I did not see any mentions about availability options for Virtual Warehouses. In case whole AWS Region goes down, database will still be available for serving queries, but what about virtual warehouse? do I need to create a new one in case region is still down or is there a way to have a "back-up" virtual warehouse in different AWS region?
Virtual warehouse is a essentially a compute server (for example, AWS EC2 if hosted on AWS). Virtual warehouses are not persistent, i.e. when you suspend a warehouse, it is returned to the AWS/Azure/GCP pool and when you resume, it is allocated from the pool.
When a cloud region goes down, virtual warehouses will be allocated and created from AWS/Azure/GCP pool in the backup region.
The documentation here states clearly what can, and what can't, be replicated:
Snowflake Replication
For example, it states:
Currently, replication is supported for databases only. Other types of objects in an account cannot be replicated. This list includes:
Users
Roles
Warehouses
Resource monitors
Shares
When you set up an environment (into which you are going to replicate a database from another account) you also need to set up the roles, users, warehouses, etc.
The 'The Snowflake Elastic Data Warehouse'(2016) paper at the paragraph '4.2.1 Fault Resilience' reports:
Virtual Warehouses (VWs) are not distributed across AZs. This choice is for performance reasons. High network throughput is critical for distributed query execution, and network throughput is significantly higher within the same AZ. If one of the worker nodes fails during query execution, the query fails but is transparently re-executed, either with the node immediately replaced, or with a temporarily reduced number of nodes. To accelerate node replacement, Snowflake maintains a small pool of standby nodes. (These nodes are also used for fast VW provisioning.)
If an entire AZ becomes unavailable though, all queries running on a given VW of that AZ will fail, and the user needs to actively re-provision the VW in a different AZ.
With full-AZ failures being truly catastrophic and exceedingly rare events, we today accept this one scenario of partial system unavailability, but hope to address it in the future.
Related
What is database clustering? If you allow the same database to be on 2 different servers how do they keep the data between synchronized. And how does this differ from load balancing from a database server perspective?
Database clustering is a bit of an ambiguous term, some vendors consider a cluster having two or more servers share the same storage, some others call a cluster a set of replicated servers.
Replication defines the method by which a set of servers remain synchronized without having to share the storage being able to be geographically disperse, there are two main ways of going about it:
master-master (or multi-master) replication: Any server can update the database. It is usually taken care of by a different module within the database (or a whole different software running on top of them in some cases).
Downside is that it is very hard to do well, and some systems lose ACID properties when in this mode of replication.
Upside is that it is flexible and you can support the failure of any server while still having the database updated.
master-slave replication: There is only a single copy of authoritative data, which is the pushed to the slave servers.
Downside is that it is less fault tolerant, if the master dies, there are no further changes in the slaves.
Upside is that it is easier to do than multi-master and it usually preserve ACID properties.
Load balancing is a different concept, it consists distributing the queries sent to those servers so the load is as evenly distributed as possible. It is usually done at the application layer (or with a connection pool). The only direct relation between replication and load balancing is that you need some replication to be able to load balance, else you'd have a single server.
From SQL Server point of view:
Clustering will give you an active - passive configuration. Meaning in a 2 node cluster, one of them will be the active (serving) and the other one will be passive (waiting to take over when the active node fails). It's a high availability from hardware point of view.
You can have an active-active cluster, but it will require multiple instances of SQL Server running on each node. (i.e. Instance 1 on Node A failing over to Instance 2 on Node B, and instance 1 on Node B failing over to instance 2 on Node A).
Load balancing (at least from SQL Server point of view) does not exists (at least in the same sense of web server load balancing). You can't balance load that way. However, you can split your application to run on some database on server 1 and also run on some database on server 2, etc. This is the primary mean of "load balancing" in SQL world.
Clustering uses shared storage of some kind (a drive cage or a SAN, for example), and puts two database front-ends on it. The front end servers share an IP address and cluster network name that clients use to connect, and they decide between themselves who is currently in charge of serving client requests.
If you're asking about a particular database server, add that to your question and we can add details on their implementation, but at its core, that's what clustering is.
Database Clustering is actually a mode of synchronous replication between two or possibly more nodes with an added functionality of fault tolerance added to your system, and that too in a shared nothing architecture. By shared nothing it means that the individual nodes actually don't share any physical resources like disk or memory.
As far as keeping the data synchronized is concerned, there is a management server to which all the data nodes are connected along with the SQL node to achieve this(talking specifically about MySQL).
Now about the differences: load balancing is just one result that could be achieved through clustering, the others include high availability, scalability and fault tolerance.
The system in question is for a company with multiple locations. Unreliable internet speeds/availability at some locations have led to the path of a local server at each location off of which a location and a central server.
The role of the local server is for each location to be able to run no matter if it is connected to the outside world or not, or to eliminate high latency if the the connection speed is less than optimal.
The role of the central server is two-fold:
Configuration, policy, user, etc, management. For example, new products, price changes, promotions, user changes, etc, are done on the central server and then distributed to the local servers so they have the most up to date info.
Centralize all data created at each location to run reports, analytics and warehouse data.
The question of how much data to keep on the local server is debatable. For example some processes are dependent upon not just that one location, like customer loyalty, so a query must be run to the central server to check user activity and determine incentives. On the other hand, active customer base should be within the scope of the local servers data.
I lack experience in these types of distributed systems. My question is what database should we use that will facilitate this type of setup, hopefully incorporating the functionality to work automatically without much coding needed to achieve the data syncs to/from central server.
Master-Slave Replication:
In this type of replication one server (the master) accepts writes and will replicate the changes to read replicas(slaves)
Characteristics
Asynchronous
Read Scalability
Master is a point of failure for all the nodes (SPOF)
Master-Master:
In this setup all the database servers accepts read and writes and synchronize together.
Characteristics
Synchronous(hopefully)
Read and Write Scalability
Performance is worse than Master-Slave
No SPOF
Master-Master is harder to setup and maintain. Possibility of id collisions.
Any Popular Database Server these days supports the features above.
I would like to know how replication works in a distributed database. It would be nice if this could be explained in a thorough, yet easy to understand way.
It would also be nice if you could make a comparison between distributed transactions and distributed replication.
Single point of failure
The database server is a central part of an enterprise system, and, if it goes down, service availability might get compromised.
If the database server is running on a single server, then we have a single point of failure. Any hardware issue (e.g., disk drive failure) or software malfunction (e.g., driver problems, malfunctioning updates) will render the system unavailable.
Limited resources
If there is a single database server node, then vertical scaling is the only option when it comes to accommodating a higher traffic load. Vertical scaling, or scaling up, means buying more powerful hardware, which provides more resources (e.g., CPU, Memory, I/O) to serve the incoming client transactions.
Up to a certain hardware configuration, vertical scaling can be a viable and simple solution to scale a database system. The problem is that the price-performance ratio is not linear, so after a certain threshold, you get diminishing returns from vertical scaling.
Another problem with vertical scaling is that, in order to upgrade the server, the database service needs to be stopped. So, during the hardware upgrade, the application will not be available, which can impact underlying business operations.
Database Replication
To overcome the aforementioned issues associated with having a single database server node, we can set up multiple database server nodes. The more nodes, the more resources we will have to process incoming traffic.
Also, if a database server node is down, the system can still process requests as long as there are spare database nodes to connect to. For this reason, upgrading the hardware or software of a given database server node can be done without affecting the overall system availability.
The challenge of having multiple nodes is data consistency. If all nodes are in-sync at any given time, the system is Linearizable, which is the strongest guarantee when it comes to data consistency across multiple registers.
The process of synchronizing data across all database nodes is called replication, and there are multiple strategies that we can use.
Single-Primary Database Replication
The Single-Primary Replication scheme looks as follows:
The primary node, also known as the Master node, is the one accepting writes while the replica nodes can only process read-only transactions. Having a single source of truth allows us to avoid data conflicts.
To keep the replicas in-sync, the primary nodes must provide the list of changes that were done by all committed transactions.
Relational database systems have a Redo Log, which contains all data changes that were successfully committed.
PostgreSQL uses the WAL (Write-Ahead Log) records to ensure transaction Durability and for Streaming Replication.
Because the storage engine is separated from the MySQL server, MySQL uses a separate Binary Log for replication. The Redo Log is generated by the InnoDB storage engine, and its goal is to provide transaction Durability while the Binary Log is created by the MySQL Server, and it stores the logical logging records, as opposed to physical logging created by the Redo Log.
By applying the same changes recorded in the WAL or Binary Log entries, the replica node can stay in-sync with the primary node.
Horizontal scaling
The Single-Primary Replication provides horizontal scalability for read-only transactions. If the number of read-only transactions increases, we can create more replica nodes to accommodate the incoming traffic.
This is what horizontal scaling, or scaling out, is all about. Unlike vertical scaling, which requires buying more powerful hardware, horizontal scaling can be achieved using commodity hardware.
On the other hand, read-write transactions can only be scaled up (vertical scaling) as there is a single primary node.
I would recommend initially spending time reviewing the MySQL Docs on Replication. It's a good example of database replication. They are here:
http://dev.mysql.com/doc/refman/5.5/en/replication.html
Covering the entire scope of your question seems like too much for one question.
If you have some specific questions, please feel free to post them. Thanks!
Clustrix is a distributed database with a shared nothing architecture that supports both distributed transactions and replication. There is some technical documentation available that describes data distribution, distributed evaluation model, and built in fault tolerance, as well as an overview of the architecture.
As a MySQL replacement, Clustrix implements MySQL's replication policy and produces binlogs in the MySQL format, which are serialized so that Clustrix can act as either a Master or Slave to MySQL.
I have a question for the DBA's out there: If I scale from a single web/DB server setup to two web/two DB server setup with a load balancer in front of the web servers to route incoming queries evenly... how do solutions like MySQL Cluster work so that a change made to one DB server is immediately known to the other (otherwise, users routed to the other DB server won't see the data or will outdated data), or at least so that the other web server is made aware of the fact that it's reading "dirty data" and it should try again in X seconds so as to get up-to-date data?
Thank you.
TWO ways of doing this.
Active/Active or Active/Passive.
Active/Passive is most prevalent
The data is kept in sync on the passive node.
The cluster is useful configuration in as much as the active node goes down the passive is immediately switched hence no downtime.
The clustering continuously synchronises the 2 nodes in the cluster.
I work with SQL server but I think the basic premise of clustering is the same for mySQL - that is no (or no noticeable) downtime on hardware failure.
EDIT: Additionally the clustering software handles the synchronisation. You don't need to worry. You view the cluster nodes as a virtual directory, which behaves like one server in windows.
here is document explaining this
http://www.sql-server-performance.com/articles/clustering/clustering_intro_p1.aspx
In Windows server clustering (to be distinguished from High Performance Clustering), there is a shared external storage array. The active node takes ownership/control of the storage, and when that node fails, the storage 'fails over' to the previously passive node (which is now the active node). There are also different schemes that allow for independent storage at each node, vs. shared storage. However, these require the application to have enough intelligence to know that it is clustered, and keep the two storage sets in sync.
Clustering is also where a number of nodes handle the workload, this is sometimes called active/active clusters i.e. all the nodes share the workload and are active. This is normally handled by specialist software like Oracle RAC (RAC#Wikipedia) for the Oracle RDBMS database. RAC allows Oracle to scale to very large workloads.
What is database clustering? If you allow the same database to be on 2 different servers how do they keep the data between synchronized. And how does this differ from load balancing from a database server perspective?
Database clustering is a bit of an ambiguous term, some vendors consider a cluster having two or more servers share the same storage, some others call a cluster a set of replicated servers.
Replication defines the method by which a set of servers remain synchronized without having to share the storage being able to be geographically disperse, there are two main ways of going about it:
master-master (or multi-master) replication: Any server can update the database. It is usually taken care of by a different module within the database (or a whole different software running on top of them in some cases).
Downside is that it is very hard to do well, and some systems lose ACID properties when in this mode of replication.
Upside is that it is flexible and you can support the failure of any server while still having the database updated.
master-slave replication: There is only a single copy of authoritative data, which is the pushed to the slave servers.
Downside is that it is less fault tolerant, if the master dies, there are no further changes in the slaves.
Upside is that it is easier to do than multi-master and it usually preserve ACID properties.
Load balancing is a different concept, it consists distributing the queries sent to those servers so the load is as evenly distributed as possible. It is usually done at the application layer (or with a connection pool). The only direct relation between replication and load balancing is that you need some replication to be able to load balance, else you'd have a single server.
From SQL Server point of view:
Clustering will give you an active - passive configuration. Meaning in a 2 node cluster, one of them will be the active (serving) and the other one will be passive (waiting to take over when the active node fails). It's a high availability from hardware point of view.
You can have an active-active cluster, but it will require multiple instances of SQL Server running on each node. (i.e. Instance 1 on Node A failing over to Instance 2 on Node B, and instance 1 on Node B failing over to instance 2 on Node A).
Load balancing (at least from SQL Server point of view) does not exists (at least in the same sense of web server load balancing). You can't balance load that way. However, you can split your application to run on some database on server 1 and also run on some database on server 2, etc. This is the primary mean of "load balancing" in SQL world.
Clustering uses shared storage of some kind (a drive cage or a SAN, for example), and puts two database front-ends on it. The front end servers share an IP address and cluster network name that clients use to connect, and they decide between themselves who is currently in charge of serving client requests.
If you're asking about a particular database server, add that to your question and we can add details on their implementation, but at its core, that's what clustering is.
Database Clustering is actually a mode of synchronous replication between two or possibly more nodes with an added functionality of fault tolerance added to your system, and that too in a shared nothing architecture. By shared nothing it means that the individual nodes actually don't share any physical resources like disk or memory.
As far as keeping the data synchronized is concerned, there is a management server to which all the data nodes are connected along with the SQL node to achieve this(talking specifically about MySQL).
Now about the differences: load balancing is just one result that could be achieved through clustering, the others include high availability, scalability and fault tolerance.