I know that DDL and Show operations do not consume compute credits? Is there any list some one has compiled to determine what operations in snowflake do not consume compute credits? Appreciate your help
There is no hard rule that snowflake doesn't charge for DDL and Show operations. They will charge based on the cost of storage and the cost of compute resources consumed.
Storage: storage is per terabyte, compressed, per month and charge for compute is based on the processing units, which we refer to as credits, consumed to run our queries.
please refer following link for more details:
https://www.snowflake.com/pricing/
There are a list of statements that can be run in Snowflake that do not consume compute (virtual warehouse) credit. This list can include:
- DDL statements
- Queries that hit result cache
- result_scan query
- Show commands
- Some count, min, max queries.
Snowflake did make an announcement in November that there will be some changes coming starting February 2020 that will include some new Cloud Services billing in some situations for some of these innovative features that were running free of compute credits. Here's the recently published blog:
https://www.snowflake.com/blog/whats-new-with-the-snowflake-cloud-services-billing-model/
Related
I have a generic question about Snowflake Cost Estimation.
I have a task which is schedule to execute every 5 min
CREATE TASK mytask1
WAREHOUSE = mywh
SCHEDULE = '5 minute'
WHEN
SYSTEM$STREAM_HAS_DATA('MYSTREAM')
AS
INSERT INTO ... ;
In case if there is no new data in mystream, In that case the task will be skipped.
However will it be costing any money as the task is still runinng.
Please suggest
The only possible cost is a Cloud Services cost for the metadata check to see if the stream has data. Based on how Cloud Services credits are discounted, there is a really good chance that this will never be something you'll see a charge for. You can read up on the cloud services billing here:
https://docs.snowflake.com/en/user-guide/credits.html#cloud-services-credit-usage
CREATE TASK:
SYSTEM$STREAM_HAS_DATA
Validating the conditions of the WHEN expression does not require a virtual warehouse. The validation is instead processed in the cloud services layer. A nominal charge accrues each time a task evaluates its WHEN condition and does not run. The charges accumulate each time the task is triggered until it runs. At that time, the charge is converted to Snowflake credits and added to the compute resource usage for the task run.
Generally the compute time to validate the condition is insignificant compared to task execution time. As a best practice, align scheduled and actual task runs as closely as possible. Avoid task schedules that are wildly out of synch with actual task runs. For example, if data is inserted into a table with a stream roughly every 24 hours, do not schedule a task that checks for stream data every minute. The charge to validate the WHEN expression with each run is generally insignificant, but the charges are cumulative.
Note that daily consumption of cloud services that falls below the 10% quota of the daily usage of the compute resources accumulates no cloud services charges.
Part1 #
As per the pricing policy of snowflakes ,we will be paying based on the usage and we will not be charged if we won't use resources..This is clear.However I Am trying to understand ,is there any chance for reducing the cost if we drop the unused or rarely used warehouses? users and roles that are not been used any more ?I was looking some cost savings in terms of reducing the cloud services cost.
Part 2#
which is the most cost effective way .
1)Allocating separate warehouse for each team who uses the warehouse at specific times
(or)
2)Allocating single warehouse for all them and monitor warehouse load closely,such that if we notice queued load on warehouse then opt scale out option(multi cluster)(S+S)?
Please suggest the best way so that we can reduce overall cost.
there are only two things major things you are charged for disk and cpu, and a couple of minor things like compile time, and inter region IO charges. But users, warehouses, & roles are just access control lists in the end, that are to control cpu and disk usage.
prior to per second billing we found using one warehouse for a couple of teams meant less wasted CPU billing, and to some degree that almost is the case with the min 60 second billing, but we have a shared x-small most teams do dev on, and then spin-up bigger warehouses to run one-off loads (and then shut down) or have auto-scaling clusters to handle "normal load" which we also use cron jobs to limit "max size" just so in the off-peek times we intentionally increase latency of total load, to shift expenditure budget to peek times. and compared to the always running clusters, our dev instances are single digit percentages, so 1 or 2 warehouses is a round error.
The way we found the most value for reducing cost, was to look at the bill and see what seemed more $$ then we expected for the bang we where getting, and then we experimented, to see if there were lower cost ways to reach the same end goal. Be it different shaped tables that we multi inserted into, or finding queries that had long execution times, or pruned lots of rows (which might lead to the first point).. if you are want to save dollars you have to whach/care how you are spending them, and make trade-offs.
Part #1
Existence of multiple Warehouse will not incur any cost, cost will only come when it will be utilized as part of compute. However dropping unused objects will certainly ease the operational effort. Also if user exists and not being used it should fall under your security audit and it is always better to disable a user instead of dropping. Validate all downstream application ETL jobs/BI reports (If any) before dropping any users/roles
Cloud service cost is entirely different ball game , it follows 10% rule. One need to pay this amount when cloud service usage exceeds 10% of the warehouse usage on that day.
Part #2
Snowflake always suggest warehouse should be created based on your activity. Please do not create warehouse to segregate teams/user group. Create user and roles for that.
What we observed
During development keeping only one virtual Warehouse, until real requirement pops up (Project team wise segregation for cost sharing or budgeting or credit assessment) there is no need to have multiple warehouse created.
Even for Prod activity wise segregation is ideal, one for ETL load/BI reporting / Data analytics team
Thanks
Palash Chatterjee
I am exploring options to optimize query analysis and cost to store data in a BigQuery table. If we are able to reuse the query that is made on a larger data vs reuse/extract data from the last queried result to only save cost for running the entire query again.
Limitations
Cannot use cached results since the data is streaming inserts and every rewrite will invalidate the cached results.
Even if there is a programmatic solution that can be built, trying to validate if data inconsistencies happens or managing it whenever a data is out of sync.
Thanks in advance!
To analyze BigQuery SQL cost usage you can list all BigQuery jobs (BigQuery API) and analyze bytes/slots usage and the execution time. Besides caching, you can analyze queries to see if there is any candidate for Partitioning and Clustering that could reduce significant cost and execution time. Reading other BigQuery SO posts I am under impression that Materialized Views are around the corner, that would be another great performance and cost optimization.
To optimize cost itself you can compare on-demand or slot reservation pricing model.
To optimize streaming insert cost, as long you can accept 2 min delay (as opposed sec delay with streaming) you can take into account event-driven serverless data ingestion like BqTail
When it comes to caching you may also explore eager caching options which creates cache for most commonly used SQL every time underlying data changes, but in that case you have to control all data ingestion to recreate cache. (*possible with BqTail API post load task)
After reading the pricing of the new google relational database Spanner, it states that the cost is based on storage and use. They charge $0.9 by hour per node.
The question is: if I create the database for development, and only use it 6 hours a day, 100 hours a Month as maximum... Do I have to pay only for the hours with active use (receiving queries) or for the whole month? The charge is similar to App Engine instances?
In the first case, there is no problem spending US$90 for testing this new database, but if they charge for the whole month (using it or not)... the cost rise to US$670/month...
Anyone has been using this database and can share the final cost invoiced?
In the tutorial they recommend to delete de database after testing, but for development deleting the database and recreating database and data every day is not suitable.
Correct, you need to maintain at least 1 node to keep the data, and you need at least 1 node for every 2 TiB of data.
So, if you upload 50 TiB of data, you need to keep 25 nodes at a minimum to maintain the data.
More info - https://cloud.google.com/spanner/docs/limits
You are charged for any resources in your instances (while the nodes are running and storage is being used), even if you aren't actively issueing queries. It's like Compute Engine or Cloud SQL.
I have a project where we sample "large" amount of data on per-second basis. Some operation are performed as filtering and so on and it needs then to be accessed as second, minute, hour or day interval.
We currently do this process with an SQL based system and a software that update different tables (daily average, hourly averages, etc...).
We are currently looking if other solution could fit our needs and I went across several solutions, as open tsdb, google cloud dataflow and influxdb.
All seem to address timeseries needs, but it gets difficult to get information about the internals. opentsdb do offer downsampling but it is not clearly specified how.
The need is since we can query vast amount of data, for instance a year, if the DB downsample at the query and is not pre-computed, it may take a very long time.
As well, downsampling needs to be "updated" when ever "delayed" datapoint are added.
On top of that, upon data arrival we perform some processing (outliner filter, calibration) and those operation should not be written on the disk, several solution can be used like a Ram based DB but perhaps some more elegant solution that would work together with the previous specification exists.
I believe this application is not something "extravagant" and that it must exist some tools to perform this, I'm thinking of stock tickers, monitoring and so forth.
Perhaps you may have some good suggestions into which technologies / DB I should look on.
Thanks.
You can accomplish such use cases pretty easily with Google Cloud Dataflow. Data preprocessing and optimizing queries is one of major scenarios for Cloud Dataflow.
We don't provide a "downsample" primitive built-in, but you can write such data transformation easily. If you are simply looking at dropping unnecessary data, you can just use a ParDo. For really simple cases, Filter.byPredicate primitive can be even simpler.
Alternatively, if you are looking at merging many data points into one, a common pattern is to window your PCollection to subdivide it according to the timestamps. Then, you can use a Combine to merge elements per window.
Additional processing that you mention can easily be tacked along to the same data processing pipeline.
In terms of comparison, Cloud Dataflow is not really comparable to databases. Databases are primarily storage solutions with processing capabilities. Cloud Dataflow is primarily a data processing solution, which connects to other products for its storage needs. You should expect your Cloud Dataflow-based solution to be much more scalable and flexible, but that also comes with higher overall cost.
Dataflow is for inline processing as the data comes in. If you are only interested in summary and calculations, dataflow is your best bet.
If you want to later take that data and access it via time (time-series) for things such as graphs, then InfluxDB is a good solution though it has a limitation on how much data it can contain.
If you're ok with 2-25 second delay on large data sets, then you can just use BigQuery along with Dataflow. Dataflow will receive, summarize, and process your numbers. Then you submit the result into BigQuery. HINT, divide your tables by DAYS to reduce costs and make re-calculations much easier.
We process 187 GB of data each night. That equals 478,439,634 individual data points (each with about 15 metrics and an average of 43,000 rows per device) for about 11,512 devices.
Secrets to BigQuery:
LIMIT your column selection. Don't ever do a select * if you can help it.
;)