Snowflake Task cost estimation - snowflake-cloud-data-platform

I have a generic question about Snowflake Cost Estimation.
I have a task which is schedule to execute every 5 min
CREATE TASK mytask1
WAREHOUSE = mywh
SCHEDULE = '5 minute'
WHEN
SYSTEM$STREAM_HAS_DATA('MYSTREAM')
AS
INSERT INTO ... ;
In case if there is no new data in mystream, In that case the task will be skipped.
However will it be costing any money as the task is still runinng.
Please suggest

The only possible cost is a Cloud Services cost for the metadata check to see if the stream has data. Based on how Cloud Services credits are discounted, there is a really good chance that this will never be something you'll see a charge for. You can read up on the cloud services billing here:
https://docs.snowflake.com/en/user-guide/credits.html#cloud-services-credit-usage

CREATE TASK:
SYSTEM$STREAM_HAS_DATA
Validating the conditions of the WHEN expression does not require a virtual warehouse. The validation is instead processed in the cloud services layer. A nominal charge accrues each time a task evaluates its WHEN condition and does not run. The charges accumulate each time the task is triggered until it runs. At that time, the charge is converted to Snowflake credits and added to the compute resource usage for the task run.
Generally the compute time to validate the condition is insignificant compared to task execution time. As a best practice, align scheduled and actual task runs as closely as possible. Avoid task schedules that are wildly out of synch with actual task runs. For example, if data is inserted into a table with a stream roughly every 24 hours, do not schedule a task that checks for stream data every minute. The charge to validate the WHEN expression with each run is generally insignificant, but the charges are cumulative.
Note that daily consumption of cloud services that falls below the 10% quota of the daily usage of the compute resources accumulates no cloud services charges.

Related

How does run queue work in Snowflake? Is there a concept timeslice at all?

I am a newbie to Snowflake and documentation is not clear.
Say I use a Large warehouse with 5 max concurrent queries
There are 5 users who fire heavy duty queries which may take many minutes to finish
The 6th user has a simple query to execute
Does the processes running those 5 queries yield at any point in time or do they run to completion?
Will the 6th user have to wait till the timeout limit is reached and attempt using a different Virtual Warehouse
Thanks!
The queue is a first-in-first-out queue like most (all?) other databases. If a query is queued because other queries are consuming all resources of the cluster then it'll have to wait until the other queries are finished (or timeout) before it can run. Snowflake won't pause a query that is running to "sneak in" a smaller query.
You could always resize the warehouse though to push through the query. Here is a good line from the documentation:
Single-cluster or multi-cluster (in Maximized mode): Statements are queued until already-allocated resources are freed or additional resources are provisioned, which can be accomplished by increasing the size of the warehouse.
This is actually a good question to ask and understanding how this works in snowflake will help you use snowflake more optimally. As you already know snowflake uses virtual warehouses for compute which are nothing but cluster of compute nodes. Each node has 8 cores. So, when you submit a query to a virtual warehouse, each query is being processed by one or more core(based on if the query can be parallelized). So, if the virtual warehouse does not have any core to execute the 6th query, it will queue up. If you logon to snowflake UI and click on the warehouse tab, you will see this queueing through the yellow color on the bars. You can also see it under 'QUEUED_OVERLOAD_TIME' if you query the QUERY_HISTORY view.
Now, this is not a good thing for queries to queue up consistently. So, the best practice is to have a multi warehouse strategy. Give every unique group of workload a dedicated warehouse so that you can scale them horizontally/vertically based on the query load of the given workload.

Thought on reducing snowflakes cost by dropping unused objects

Part1 #
As per the pricing policy of snowflakes ,we will be paying based on the usage and we will not be charged if we won't use resources..This is clear.However I Am trying to understand ,is there any chance for reducing the cost if we drop the unused or rarely used warehouses? users and roles that are not been used any more ?I was looking some cost savings in terms of reducing the cloud services cost.
Part 2#
which is the most cost effective way .
1)Allocating separate warehouse for each team who uses the warehouse at specific times
(or)
2)Allocating single warehouse for all them and monitor warehouse load closely,such that if we notice queued load on warehouse then opt scale out option(multi cluster)(S+S)?
Please suggest the best way so that we can reduce overall cost.
there are only two things major things you are charged for disk and cpu, and a couple of minor things like compile time, and inter region IO charges. But users, warehouses, & roles are just access control lists in the end, that are to control cpu and disk usage.
prior to per second billing we found using one warehouse for a couple of teams meant less wasted CPU billing, and to some degree that almost is the case with the min 60 second billing, but we have a shared x-small most teams do dev on, and then spin-up bigger warehouses to run one-off loads (and then shut down) or have auto-scaling clusters to handle "normal load" which we also use cron jobs to limit "max size" just so in the off-peek times we intentionally increase latency of total load, to shift expenditure budget to peek times. and compared to the always running clusters, our dev instances are single digit percentages, so 1 or 2 warehouses is a round error.
The way we found the most value for reducing cost, was to look at the bill and see what seemed more $$ then we expected for the bang we where getting, and then we experimented, to see if there were lower cost ways to reach the same end goal. Be it different shaped tables that we multi inserted into, or finding queries that had long execution times, or pruned lots of rows (which might lead to the first point).. if you are want to save dollars you have to whach/care how you are spending them, and make trade-offs.
Part #1
Existence of multiple Warehouse will not incur any cost, cost will only come when it will be utilized as part of compute. However dropping unused objects will certainly ease the operational effort. Also if user exists and not being used it should fall under your security audit and it is always better to disable a user instead of dropping. Validate all downstream application ETL jobs/BI reports (If any) before dropping any users/roles
Cloud service cost is entirely different ball game , it follows 10% rule. One need to pay this amount when cloud service usage exceeds 10% of the warehouse usage on that day.
Part #2
Snowflake always suggest warehouse should be created based on your activity. Please do not create warehouse to segregate teams/user group. Create user and roles for that.
What we observed
During development keeping only one virtual Warehouse, until real requirement pops up (Project team wise segregation for cost sharing or budgeting or credit assessment) there is no need to have multiple warehouse created.
Even for Prod activity wise segregation is ideal, one for ETL load/BI reporting / Data analytics team
Thanks
Palash Chatterjee

How can we increase the speed of single task execution in GAE Task Queue?

I have a single large task that is running in one of the task queues. Sometimes the task takes more than 24 hours to execute. I have optimized my code to the maximum and have achieved some speed to execute the task faster.
The task performs the operation of inserting the rows into the datastore which can be in millions.
Is there any way to increase the speed of that task by allocating more resources or by making changes in the instance configurations?
Please advise.
You can get some speedup by choosing an instance type with a faster CPU (also more expensive) in the respective service/module config file.
But the percent reduction of the overall task duration depends significantly on the actual structure/operation of your app.
You'll get a speedup for what is actually executed by your instance (i.e. your app code), but not for the services executed by the GAE infra, like datastore and memcache RPCs for example - which can be significant.

Custom Metrics cron job Datastore timeout

I have written a code to write data to custom metrics cloud monitoring - google app engine.
For that i am storing the data for some amount of time say: 15min into datastore and then a cron job runs and gets the data from there and plots the data on the cloud monitoring dashboard.
Now my problem is : while fetching huge data to plot from the datastore the cron job may timeout. Also i wanted to know what happens when cron job fails ?
Also Can it fail if the number of records is high ? if it can, what alternates could we do. Safely how many records cron could process in 10 min timeout duration.
Please let me know if any other info is needed.
Thanks!
You can run your cron job on an instance with basic or manual scaling. Then it can run for as long as you need it.
Cron job is not re-tried. You need to implement this mechanism yourself.
A better option is to use deferred tasks. Your cron job should create as many tasks to process data as necessary and add them to the queue. In this case you don't have to redo the whole job - or remember a spot from which to resume, because tasks are automatically retried if they fail.
Note that with tasks you may not need to create basic/manual scaling instances if each task takes less than 10 minutes to execute.
NB: If possible, it's better to create a large number of tasks that execute quickly as opposed to one or few tasks that take minutes. This way you minimize wasted resources if a task fails, and have smaller impact on other processes running on the same instance.

GAE - Execute many small tasks after a fixed time

I'd like to make a Google App Engine app that sends a Facebook message to a user a fixed time (e.g. one day) after they click a button in the app. It's not scalable to use cron or the task queue for potentially millions of tiny jobs. I've also considered implementing my own queue using a background thread, but that's only available using the Backends API as far as I know, which is designed for much larger usage and is not free.
Is there a scalable way for a free Google App Engine app to execute a large number of small tasks after a fixed period of time?
For starters, if you're looking to do millions of tiny jobs, you're going to blow past the free quota very quickly, any way you look at it. The free quota's meant for testing.
It depends on the granularity of your tasks. If you're executing a lot of tasks once per day, cron hooked up to a mapreduce operation (which essentially sends out a bunch of tasks on task queues) works fine. You'll basically issue a datastore query to find the tasks that need to be run, and send them out on the mapreduce.
If you execute this task thousands of times a day (every minute), it may start getting expensive because you're issuing many queries. Note that if most of those queries return nothing, the cost is still minimal.
The other option is to store your tasks in memory rather than in the datastore, that's where you'd want to start using backends. But backends are expensive to maintain. Look into using Google Compute Engine, which gives much cheaper VMs.
EDIT:
If you go the cron/datastore route, you'd store a new entity whenever a user wants to send a deferred message. Most importantly, it'd have a queryable timestamp for when the message should be sent, probably rounded to the nearest minute or the nearest 5 minutes, whatever you decide your granularity should be.
You would then have a cron job that runs at the set interval, say every minute. On each run it would build a query for all the cron jobs it needs to send for the given minute.
If you really do have hundreds of thousands of messages to send each minute, you're not going to want to do it from the cron task. You'd want the cron task to spawn a mapreduce job that will fan out the query and spawn tasks to send your messages.

Resources