If i have ran a large query in snowflake and executed the same query after 5 minutes with out any change to the table etc. It is my understanding that the results will be fetched from Results Cache. In this case will it consume Compute Credits?
Not today, BUT, if you use an unusually high amount of result cache compared to your compute credits on your account, you will begin to be billed for your services layer consumption. There was an announcement on this in November that is important to understand. For those using the system in an expected fashion won't be affected by this, but it's important to review:
https://www.snowflake.com/blog/whats-new-with-the-snowflake-cloud-services-billing-model/
A few comments and updates about the product: (1) . Mike Walton's response below about the upcoming service layer billing is indeed important to be aware for operations like result caching that were previously credit-free (compute credit-free). (2) To understand what conditions required in order for Snowflake to reuse the result cache, this documentation link gives comprehensive list: https://docs.snowflake.net/manuals/user-guide/querying-persisted-results.html#retrieval-optimization (3) The mentioned doc link also included the detail on how long the result cache will be kept: "Each time the persisted result for a query is reused, Snowflake resets the 24-hour retention period for the result, up to a maximum of 31 days from the date and time that the query was first executed. After 31 days, the result is purged and the next time the query is submitted, a new result is generated and persisted."
The Snowflake Support answered your question here: https://community.snowflake.com/s/question/0D50Z000082DhlPSAS/does-a-cached-result-on-a-suspended-warehouse-cost-compute-credits
Compute credits don't get consumed when you use the results cache so long as the query is exactly the same and the underlying table data hasn't changed. The results cache is purged after 24hrs too.
Related
What does it mean there is a longer time for COMPILATION_TIME, QUEUED_PROVISIONING_TIME or both more than usual?
I have a query runs every couple of minutes and it usually takes less than 200 milliseconds for compilation and 0 for provisioning. There are 2 instances in the last couple of days the values are more than 4000 for compilation and more than 100000 for provisioning.
Is that mean warehouse was being resumed and there was a hiccup?
COMPILATION_TIME:
The SQL is parsed and simplified, and the tables meta data is loaded. Thus a compile for select a,b,c from table_name will be fractally faster than select * from table_name because the meta data is not needed from every partition to know the final shape.
Super fragmented tables, can give poor compile performance as there is more meta data to load. Fragmentation comes from many small writes/deletes/updates.
Doing very large INSERT statements can give horrible compile performance. We did a lift-and-shift and did all data loading via INSERT, just avoid..
PRIOVISIONING_TIME is the amount of time to setup the hardware, this occurs for two main reasons ,you are turning on 3X, 4X, 5X, 6X servers and it can take minutes just to allocate those volume of servers.
Or there is failure, sometime around releases there can be a little instability, where a query fails on the "new" release, and query is rolled back to older instances, which you would see in the profile as 1, 1001. But sometimes there has been problems in the provisioning infrastructure (I not seen it for a few years, but am not monitoring for it presently).
But I would think you will mostly see this on a on going basis for the first reason.
The compilation process involves query parsing, semantic checks, query rewrite components, reading object metadata, table pruning, evaluating certain heuristics such as filter push-downs, plan generations based upon the cost-based optimization, etc., which totally accounts for the COMPILATION_TIME.
QUEUED_PROVISIONING_TIME refers to Time (in milliseconds) spent in the warehouse queue, waiting for the warehouse compute resources to provision, due to warehouse creation, resume, or resize.
https://docs.snowflake.com/en/sql-reference/functions/query_history.html
To understand the reason behind the query taking long time recently in detail, the query ID needs to be analysed. You can raise a support case to Snowflake support with the problematic query ID to have the details checked.
I am looking at the queries performed against my warehouse and finding the credit calculation I'm using doesn't add up to what's being shown in snowflake. As I understand it, it is supposed to use credits per second of query time with a minimum of 60s. So if a query runs for 5s it would use 60s worth of credits, but if a query runs for 61s it will use 61s worth of credits.
Looking at the query history, limiting only to queries performed on my warehouse, I am only seeing 5 queries for the hour in question (12).
These queries copy their results into an S3 bucket in my AWS account.
If I take the starts and ends of each of these queries and chart time, I am only seeing a total of 455 seconds of query time. With the X-Small warehouse that I'm using (1 credit per hour), that should be only 0.126 credits used for that hour.
But I am seeing 0.66 credits used here:
What am I missing about snowflake credit usage? Why does it appear that I am using more credits than I should?
Moving answer from comments to an actual answer (for completeness):
Snowflake costs don't reflect query runtimes, but warehouses being run.
AUTO_SUSPEND can be set to 60 seconds (or less) to more closely match the duration of queries.
You can refer to the official Snowflake documentation for more details:
Virtual Warehouse Credit Usage
How are Credits Charged for Warehouses?
TL;DR
I have a table with about 2 million WRITEs over the month and 0 READs. Every 1st day of a month, I need to read all the rows written on the previous month and generate CSVs + statistics.
How to work with DynamoDB in this scenario? How to choose the READ throughput capacity?
Long description
I have an application that logs client requests. It has about 200 clients. The clients need to receive on every 1st day of a month a CSV with all the requests they've made. They also need to be billed, and for that we need to calculate some stats with the requests they've made, grouping by type of request.
So in the end of the month, a client receives a report like:
I've already come to two solutions, but I'm not still convinced on any of them.
1st solution: ok, every last day of the month I increase the READ throughput capacity and then I run a map reduce job. When the job is done, I decrease the capacity back to the original value.
Cons: not fully automated, risk of the DynamoDB capacity not being available when the job starts.
2nd solution: I can break the generation of CSVs + statistics to small jobs in a daily or hourly routine. I could store partial CSVs on S3 and on every 1st day of a month I could join those files and generate a new one. The statistics would be much easier to generate, just some calculations derived from the daily/hourly statistics.
Cons: I feel like I'm turning something simple into something complex.
Do you have a better solution? If not, what solution would you choose? Why?
Having been in a similar place myself before, I used, and now recommend to you, to process the raw data:
as often as you reasonably can (start with daily)
to a format as close as possible to the desired report output
with as much calculation/CPU intensive work done as possible
leaving as little to do at report time as possible.
This approach is entirely scaleable - the incremental frequency can be:
reduced to as small a window as needed
parallelised if required
It also, makes possible re-running past months reports on demand, as the report generation time should be quite small.
In my example, I shipped denormalized, pre-processed (financial calculations) data every hour to a data warehouse, then reporting just involved a very basic (and fast) SQL query.
This had the additional benefit of spreading the load on the production database server to lots of small bites, instead of bringing it to its knees once a week at invoice time (30000 invoiced produced every week).
I would use the service kinesis to produce a daily and almost real time billing.
for this purpose I would create a special DynamoDB table just for the calculated data.
(other option is to run it on flat files)
then I would add a process which will send events to kinesis service just after you update the regular DynamoDB table.
thus when you reach the end of the month you can just execute whatever post billing calculations you have and create your CSV files from the already calculated table.
I hope that helps.
Take a look at Dynamic DynamoDB. It will increase/decrease the throughput when you need it without any manual intervention. The good news is you will not need to change the way the export job is done.
With the appengine pricing changes, we've been paying attention to our datastore puts. According to the pricing comparison chart we're making 2.18 million puts a day. This seems a lot higher than expected. We receive about 0.6 queries per second which means that each request is making about 60 puts!!
Using the sample code for db profiling http://code.google.com/appengine/articles/hooks.html
we measured this for a day and the most we counted was ~14,000 which seems more reasonable. Does anyone have experience with something similar on their site?
The discrepancy you're seeing is because every index write is counted separately. When you do a datastore put, you're charged for the number of rows that have to be modified, so if you modified a single indexed field, you'd expect to be charged for:
One write for the entity itself
Two writes for the ascending index for the modified property
Two writes for the descending index for the modified property
For a total of 5 writes. As you can see, setting properties to indexed=False can have a big impact on your quota usage here.
In many cases, it could be useful to know the number of rows in a table (a kind) in a datastore using Google Application Engine. There is not clear and fast solution . At least I have not found one.. Have you?
You can efficiently get a count of all entities of a particular kind (i.e., number of rows in a table) using the Datastore Statistics. Simple example:
from google.appengine.ext.db import stats
kind_stats = stats.KindStat().all().filter("kind_name =", "NameOfYourModel").get()
count = kind_stats.count
You can find a more detailed example of how to get the latest stats here (GAE may keep multiple copies of the stats - one for 5min ago, one for 30min ago, etc.).
Note that these statistics aren't constantly updated so they lag a little behind the actual counts. If you really need the actual count, then you could track counts in your own custom stats table and update it every time you create/delete an entity (though this will be quite a bit more expensive to do).
Update 03-08-2015: Using the Datastore Statistics can lead to stale results. If that's not an option, another two methods are keeping a counter or sharding counters. (You can read more about those here). Only look at these 2 if you need real-time results.
There's no concept of "Select count(*)" in App Engine. You'll need to do one of the following:
Do a "keys-only" (index traversal) of the Entities you want at query time and count them one by one. This has the cost of slow reads.
Update counts at write time - this has the benefit of extremely fast reads at a greater cost per write/update. Cost: you have to know what you want to count ahead of time. You'll pay a higher cost at write time.
Update all counts asynchronously using Task Queues, cron jobs or the new Mapper API. This has the tradeoff of being semi-fresh.
You can count no. of rows in Google App Engine using com.google.appengine.api.datastore.Query as follow:
int count;
Query qry=new Query("EmpEntity");
DatastoreService datastoreService = DatastoreServiceFactory.getDatastoreService();
count=datastoreService.prepare(qry).countEntities(FetchOptions.Builder.withDefaults());