Choosing a warehouse size for Tasks - snowflake-cloud-data-platform

I came across this question and was puzzled.
How to determine the size of virtual warehouse used for a task?
A. Root task may be executed concurrently (i.e multiple instances), it is recommended to leave some margins in the execution window to avoid missing instances of execution.
B. Querying (select) the size of the stream content would help determine the warehouse size. For ex, if querying large stream content, use a larger warehouse size.
C. If using stored procedure to execute multiple SQL statements, it's best to test run the stored procedure separately to size the compute resource first.
D. Since task infrastructure is based on running the task body on a schedule, it's recommended to configure the virtual warehouse for automatic concurrency handling using Multicluster warehouse to match the task schedule.

Check the new "serverless" Snowflake tasks:
https://www.snowflake.com/blog/taking-serverless-to-task/
In this case, Snowflake will automatically determine what's the best warehouse size.
You can give a hint to Snowflake on what size to start with, using USER_TASK_MANAGED_INITIAL_WAREHOUSE_SIZE.
Specifies the size of the compute resources to provision for the first run of the task, before a task history is available for Snowflake to determine an ideal size. Once a task has successfully completed a few runs, Snowflake ignores this parameter setting. https://docs.snowflake.com/en/sql-reference/sql/create-task.html
The implications on billing are described here:
https://docs.snowflake.com/en/user-guide/admin-serverless-billing.html

Related

Snowflake - loading queries and controlling the execution of sequence of steps

As a part of our overall flow, data will be ingested into Azure blob from Influx DB and SQL DB, the thought process is to use Snowflake queries/SP to load the data from blob to snow flake in a scheduled manner (batch process). The thought process is to use the Tasks to schedule and orchestrate the execution using Snowflake scripting. Few questions,
Dynamic queries can be created and executed based on a config table - Ex: A copy command specifying the exact paths and file to load data from.
As a part of snowflake scripting, per understanding a sequence of steps (queries / SP) stored in a configuration DB can be executed in order along with some control mechanism.
Possibilities for sending email notifications of error records by loading into a table. whether this should be handled outside of snowflake after the data load process by using Azure data factory / logic apps.
Whether the above approach is possible and are there any limitations in using the above manner? Are there any alternate approaches that can be considered for the above.
you can dynamically generate and execute queries with a SP. You can chain activities within an SP's logic or by linked tasks running separate SPs. There is no functionality within Snowflake that will generate emails

How does Snowflake do instantaneous resizes?

I was using Snowflake and I was surprised how it is able to do instantaneous resizes. Here is a very 10s video of how it instantly does a resize, and the query is still 'warm' the next time it is run (Note I have a CURRENT_TIMESTAMP in the query so it never returns from cache):
How is Snowflake able to do instantaneous resizes (completely different than something like Redshift)? Does this mean that it just has a fleet of servers that are always on, and a resize is just a virtual allocation of memory/cpu to run that task? Is the underlying data stored on a shared disk or in memory?
To answer your question about resizing in short: Yes, you are absolutely right.
As far as I know Snowflake manages a pool of running servers in the background. All customers can be assigned something from here.
Consequence: A resize for you from S to XS is a reallocation of a server from that pool.
Most probably the Virtual Private Snowflake-Edition behaves differently as those accounts don't share resources (e.g. Virtual Warehouses) with other accounts (outside that VPS). More infos: https://docs.snowflake.com/en/user-guide/intro-editions.html#virtual-private-snowflake-vps
Regarding your storage-question:
Snowflake's storage layer is basically a storage service, e.g. Amazon S3. In here Snowflake saves the data in columnar format, to be more presice in micro-partitions. More information regarding micro-partitions can be found here: https://docs.snowflake.com/en/user-guide/tables-clustering-micropartitions.html
Your virtual warehouse accesses this storage layer (remote disk) or - if the query was run before - a cache. There are a local disc cache (this is your virtual warehouse using SSD-storage) and a result cache (available across virtual warehouses for queries within the last 24 hours): https://community.snowflake.com/s/article/Caching-in-Snowflake-Data-Warehouse
To extend the existing answer, ALTER WAREHOUSE in standard setup is non-blocking statement, which means it returns control as soon as it is submitted.
ALTER WAREHOUSE
WAIT_FOR_COMPLETION = FALSE | TRUE
When resizing a warehouse, you can use this parameter to block the return of the ALTER WAREHOUSE command until the resize has finished provisioning all its servers. Blocking the return of the command when resizing to a larger warehouse serves to notify you that your servers have been fully provisioned and the warehouse is now ready to execute queries using all the new resources.
Valid values
FALSE: The ALTER WAREHOUSE command returns immediately, before the warehouse resize completes.
TRUE: The ALTER WAREHOUSE command will block until the warehouse resize completes.
Default: FALSE
For instance:
ALTER WAREHOUSE <warehouse_name> SET WAREHOUSE_SIZE = XLARGE WAIT_FOR_COMPLETION = TRUE;
EDIT:
The Snowflake Elastic Data Warehouse
3.2.1 Elasticity and Isolation VWs are pure compute resources.
They can be created,destroyed, or resized at any point, on demand. Creating or destroying a VW has no effect on the state of the database. It is perfectly legal (and encouraged) that users shut down all their VWs when they have no queries. This elasticity allows users to dynamically match their compute resources to usage demands, independent of the data volume.

Row processing data from Redshift to Redshift

We are working on requirement where we want to fetch incremental data from one redshift cluster "row wise", process it based on requirement and insert it in another redshift cluster. We want to do it "row wise" not "batch operation." For that we are writing one generic service which will do row processing from Redshift -> Redshift. So, it is like Redshift -> Service -> Redshift.
For inserting data, we will use insert queries to insert. We will commit after particular batch not row wise for performance.
But I am bit worried about performance of multiple insert queries. Or is there any other tool available which does it. There are many ETL tools available but all do batch processing. We want to process row wise. Can someone please suggest on it?
I can guarantee that your approach will not be efficient based on experience. You can refer this link for detailed best practices :
https://docs.aws.amazon.com/redshift/latest/dg/c_loading-data-best-practices.html
But, I would suggest that you do as follows :
Write a python script to unload the data from your source Redshift to S3 based on a query condition that filters data as per your requirement, i.e based on some threshold like time, date etc. This operation should be fast and you can schedule this script to execute every minute or in a couple of minutes, generating multiple files.
Now, you basically have a continuous stream of files in S3, where the size of each file or batch size can be controlled based on your frequency for the previous script.
Now, all you have to do is set up a service that keeps polling S3 for objects/files as and when they are created and then process them as needed and put the processed file in another bucket. Let's call this as B2.
Set up another python script/ETL step that remotely executes a COPY command from bucket B2.
This is just an initial idea though. You have to evolve on this approach and optimize this. Best of luck!

SQL server limit resources for a stored procedure

I have a stored procedure that takes up too much of server resources. I want to limit this query to only use no more than, say, 10% of the reosurces. I've heard of Resource governor, but it is only available for enterprise and developer editions. I have a standard edition.
Is there any other alternatives to this, except for buying a better version of sql server?
Define 'resources'. IO? CPU? Memory? Network? Contention? Log size? tempdb? And you cannot answer 'all'. You need to specify what resources are being consumed now by the procedure.
Your question is too generic to be answered. There is no generic mechanism to limit 'resources' on a query or procedure, even the Resource Governor only limits some resources. You describe your problem as is a high volume data manipulation for a long period of time like tens of thousands of inserts updates throughout the database which would indicate toward batching the processing. If the procedure does indeed what you describe then throttling its execution is probably not what you want, because you want to reduce the duration of the transactions (batches) not increase it.
And first and foremost: did you analyzed the procedure as procedure resource consumption as a performance problem, using a methodology like Waits and Queues? Only after you done so and the procedure runs optimally (meaning it consumes the least resources required to perform it's job) can you look into throttling the procedure (and most likely by that time the need to throttle has magically vanished).
You can use the MAXDOP query hint to restrict how many CPUs a given query can utilize. Barring the high-end bells and whistles you mentioned, I don't think there's anything that lets you put single-process-level constraints on memory allocation or disk usage.

Does parallelising a stored procedure yield higher performance on clusters?

I'm currently researching ways to speed up and scale up a long running matching job which is currently running as a stored procedure in MSSQL 2005. The matching is involves multiple fields with many inexact cases. While I'd like to ultimately scale it up to large scale data sets outside of the database I need to consider some shorter term solutions also.
Given that I don't know much about the internal implementation of how they are run I'm wondering if it were possible to split the process up into parallel procedures by dividing the data set with a master procedure, which then kicks off subprocs which work on smaller data sets.
Would this yield any performance gains with a clustered database? Will MSSQL distribute the subprocs across the cluster nodes automatically and sensibly?
Perhaps it's better to have the master process in java and call worker procedures through jdbc which would presumably use cluster load balancing effectively? Aside from any arguments about maintainability could this be faster?
You have a fundamental misunderstanding of what clustering means for SQL Server. Clustering does not allow a single instance of SQL Server to share the resources of multiple boxes. Clustering is a high availability solution that allows the functionality of one box to shift over to another standby box in case of a failure.

Resources