How does Snowflake do instantaneous resizes? - snowflake-cloud-data-platform

I was using Snowflake and I was surprised how it is able to do instantaneous resizes. Here is a very 10s video of how it instantly does a resize, and the query is still 'warm' the next time it is run (Note I have a CURRENT_TIMESTAMP in the query so it never returns from cache):
How is Snowflake able to do instantaneous resizes (completely different than something like Redshift)? Does this mean that it just has a fleet of servers that are always on, and a resize is just a virtual allocation of memory/cpu to run that task? Is the underlying data stored on a shared disk or in memory?

To answer your question about resizing in short: Yes, you are absolutely right.
As far as I know Snowflake manages a pool of running servers in the background. All customers can be assigned something from here.
Consequence: A resize for you from S to XS is a reallocation of a server from that pool.
Most probably the Virtual Private Snowflake-Edition behaves differently as those accounts don't share resources (e.g. Virtual Warehouses) with other accounts (outside that VPS). More infos: https://docs.snowflake.com/en/user-guide/intro-editions.html#virtual-private-snowflake-vps
Regarding your storage-question:
Snowflake's storage layer is basically a storage service, e.g. Amazon S3. In here Snowflake saves the data in columnar format, to be more presice in micro-partitions. More information regarding micro-partitions can be found here: https://docs.snowflake.com/en/user-guide/tables-clustering-micropartitions.html
Your virtual warehouse accesses this storage layer (remote disk) or - if the query was run before - a cache. There are a local disc cache (this is your virtual warehouse using SSD-storage) and a result cache (available across virtual warehouses for queries within the last 24 hours): https://community.snowflake.com/s/article/Caching-in-Snowflake-Data-Warehouse

To extend the existing answer, ALTER WAREHOUSE in standard setup is non-blocking statement, which means it returns control as soon as it is submitted.
ALTER WAREHOUSE
WAIT_FOR_COMPLETION = FALSE | TRUE
When resizing a warehouse, you can use this parameter to block the return of the ALTER WAREHOUSE command until the resize has finished provisioning all its servers. Blocking the return of the command when resizing to a larger warehouse serves to notify you that your servers have been fully provisioned and the warehouse is now ready to execute queries using all the new resources.
Valid values
FALSE: The ALTER WAREHOUSE command returns immediately, before the warehouse resize completes.
TRUE: The ALTER WAREHOUSE command will block until the warehouse resize completes.
Default: FALSE
For instance:
ALTER WAREHOUSE <warehouse_name> SET WAREHOUSE_SIZE = XLARGE WAIT_FOR_COMPLETION = TRUE;
EDIT:
The Snowflake Elastic Data Warehouse
3.2.1 Elasticity and Isolation VWs are pure compute resources.
They can be created,destroyed, or resized at any point, on demand. Creating or destroying a VW has no effect on the state of the database. It is perfectly legal (and encouraged) that users shut down all their VWs when they have no queries. This elasticity allows users to dynamically match their compute resources to usage demands, independent of the data volume.

Related

Regarding the burden on Snowflake's database storage layer

Snowflake has an architecture consisting of the following three layers.
・Database storage
・Query processing
・Cloud service
I understand that it is possible to create a warehouse for each process in query processing, and scale up and scale out on a per process basis.
However, when the created warehouses (processes) are processed in parallel, I am worried about the burden on the database storage.
Even though the query processing process can be load-balanced, since there is only one database storage, wouldn't there be a lot of parallel processing running in the database storage and an error occurring in the database storage layer?
Sorry if I don't understand the architecture.
The storage is immutable, thus the query read load is just IO against cloud provider IO layers, so for all purposes infinitely scalable.
When any node updates a table, the new set of file partitions are known, and any warehouse without the new set of partition parts, does remote IO to read them.
The only downsides to this pattern is it does not scale well for transactional write patterns, thus why that is not the targeted at those markets.

PostgreSQL RDS running out of Free Storage Space while querying

We have a read-only PostgreSQL RDS database which is heavily queried. We don't perform any inserts/updates/deletes during normal traffic, but still we can see how we are running out of Free Storage Space and an increase on Write IOPS metric. During this period, CPU usage is at 100%.
At some point, the storage space seems to be released.
Is this expected?
The issue was in the end related to our logs. log_statement was set to all, where every single query to PG would be log. In order to troubleshoot long time queries, we combined log_statement and log_min_duration_statement.
Since this is a read only database we want to know if there is any insert/update/delete operation, so log_statement: dll ; and we want to know which queries are taking longer than 1s: log_min_duration_statement: 1000

Choosing a warehouse size for Tasks

I came across this question and was puzzled.
How to determine the size of virtual warehouse used for a task?
A. Root task may be executed concurrently (i.e multiple instances), it is recommended to leave some margins in the execution window to avoid missing instances of execution.
B. Querying (select) the size of the stream content would help determine the warehouse size. For ex, if querying large stream content, use a larger warehouse size.
C. If using stored procedure to execute multiple SQL statements, it's best to test run the stored procedure separately to size the compute resource first.
D. Since task infrastructure is based on running the task body on a schedule, it's recommended to configure the virtual warehouse for automatic concurrency handling using Multicluster warehouse to match the task schedule.
Check the new "serverless" Snowflake tasks:
https://www.snowflake.com/blog/taking-serverless-to-task/
In this case, Snowflake will automatically determine what's the best warehouse size.
You can give a hint to Snowflake on what size to start with, using USER_TASK_MANAGED_INITIAL_WAREHOUSE_SIZE.
Specifies the size of the compute resources to provision for the first run of the task, before a task history is available for Snowflake to determine an ideal size. Once a task has successfully completed a few runs, Snowflake ignores this parameter setting. https://docs.snowflake.com/en/sql-reference/sql/create-task.html
The implications on billing are described here:
https://docs.snowflake.com/en/user-guide/admin-serverless-billing.html

Snowflake Virtual Warehouse replication options

Going through documentation I did find cross-region replication for storage layer, but not about compute layer of Snowflake. I did not see any mentions about availability options for Virtual Warehouses. In case whole AWS Region goes down, database will still be available for serving queries, but what about virtual warehouse? do I need to create a new one in case region is still down or is there a way to have a "back-up" virtual warehouse in different AWS region?
Virtual warehouse is a essentially a compute server (for example, AWS EC2 if hosted on AWS). Virtual warehouses are not persistent, i.e. when you suspend a warehouse, it is returned to the AWS/Azure/GCP pool and when you resume, it is allocated from the pool.
When a cloud region goes down, virtual warehouses will be allocated and created from AWS/Azure/GCP pool in the backup region.
The documentation here states clearly what can, and what can't, be replicated:
Snowflake Replication
For example, it states:
Currently, replication is supported for databases only. Other types of objects in an account cannot be replicated. This list includes:
Users
Roles
Warehouses
Resource monitors
Shares
When you set up an environment (into which you are going to replicate a database from another account) you also need to set up the roles, users, warehouses, etc.
The 'The Snowflake Elastic Data Warehouse'(2016) paper at the paragraph '4.2.1 Fault Resilience' reports:
Virtual Warehouses (VWs) are not distributed across AZs. This choice is for performance reasons. High network throughput is critical for distributed query execution, and network throughput is significantly higher within the same AZ. If one of the worker nodes fails during query execution, the query fails but is transparently re-executed, either with the node immediately replaced, or with a temporarily reduced number of nodes. To accelerate node replacement, Snowflake maintains a small pool of standby nodes. (These nodes are also used for fast VW provisioning.)
If an entire AZ becomes unavailable though, all queries running on a given VW of that AZ will fail, and the user needs to actively re-provision the VW in a different AZ.
With full-AZ failures being truly catastrophic and exceedingly rare events, we today accept this one scenario of partial system unavailability, but hope to address it in the future.

Difference between a Database & a Storage Engine

I have small doubt, can anyone help me to clear it..
My doubt is that, what is the difference between a normal DB (what we see as a DB user) & a Storage Engine.
While searching about it I seen a point,
A database engine (or storage engine) is the underlying software
component that a database management system (DBMS) uses to create,
read, update and delete (CRUD) data from a database.
I just need a simple explanation...
hope I get it soon.
When you submit a query to SQL Server, a number of processes on the server go to work on that query. The purpose of all these processes is to manage the system such that it will SELECT, INSERT, UPDATE or DELETE the data.These processes kick into action every time we submit a query to the system.
The processes for meeting the requirements of queries break down roughly into two stages:
1-Processes that occur in the relational engine.
2-Processes that occur in the storage engine.
In the relational engine, the query is parsed and then processed by the query optimizer, which generates an execution plan. The plan is sent (in a binary format) to the storage engine, which then uses that plan as a basis to retrieve or modify the underlying data. The storage engine is where processes such as locking, index maintenance, and transactions occur.

Resources