Snowflake query credit calculation - snowflake-cloud-data-platform

One of my user has asked if it is possible to calculate the credit burnt for executing a particular query in snowflake. Based on my understanding I think it is not possible because the credit burnt is at the warehouse level and not at query level. But I still thought if someone has a way to calculate the credit per query.
Thanks

I ended up writing a query as below
SELECT query_id
,warehouse_name
,start_time
,end_time
,total_elapsed_sec
,case
when total_elapsed_sec < 60 then 60
else total_elapsed_sec
end as total_elapsed_sec_1
,ROUND(unit_of_credit*total_elapsed_sec_1 / 60/60,2) total_credit
,total_credit*3.00 query_cost --change based on how much you are paying for a credit
FROM (
select query_id
,warehouse_name
,start_time
,end_time
,total_elapsed_time/1000 total_elapsed_sec
,CASE WHEN warehouse_size = 'X-Small' THEN 1
WHEN warehouse_size = 'Small' THEN 2
WHEN warehouse_size = 'Medium' THEN 4
WHEN warehouse_size = 'Large' THEN 8
WHEN warehouse_size = 'X-Large' THEN 16
WHEN warehouse_size = '2X-Large' THEN 32
WHEN warehouse_size = '3X-Large' THEN 64
WHEN warehouse_size = '4X-Large' THEN 128
ELSE 1
END unit_of_credit
from table(information_schema.QUERY_HISTORY_BY_USER
(user_name => 'USERNAME',
END_TIME_RANGE_START => dateadd('hours',-1,current_timestamp()), --you can manipulate this based on your need
END_TIME_RANGE_END => current_timestamp(),RESULT_LIMIT => 10000)));

If you are running sequential queries, like from the web UI using "run all", and nobody else is sharing the warehouse, then execution_time * warehouse_credits_per_time = cost.
If you have a warehouse that is always queued up/running, then the cost is prorate of total_warehouse_cost * sum(query_execution_time) / total_execution_time.
If you processing is in a loop, then any one query is "free", because without it the other code would run. But if you have a loop then you are caring about latency, or reducing your warehouse size, auto-scaling. Thus it's not really free..
So both the first to methods are actually the same thing, which you have to prorate the time.
For our processing most of it in a loop, so we are looking to reduce/manage latency, so we watch 'long running' or 'total time' of parts of our pipeline to find things to improve. As if the SQL is running by itself, the time is the cost, and if the warehouse is running many concurrent requests, then they are "slowed down" by the N concurrency, or they are not (a free lunch), and we discount that last bucket..

The actual credit burnt for a specific query would be little bit difficult to calculate because of various factors , you can reach some what closure with the elapsed time calculation
select sum(TOTAL_ELAPSED_TIME),WAREHOUSE_SIZE from query_history
where QUERY_TEXT='select * from query_history' -- Your Query
and WAREHOUSE_NAME='XXXXXX' -- replace Your WH name
and USER_NAME='XXXXXX'-- Replace your User Name
group by WAREHOUSE_SIZE
With this elapsed time and based on some assumption
Size of the warehouse was consistent during various execution
Warehouse credits also burnt based on the auto-suspend setting ( if execution time is 30 sec you have to pay for 5 minutes, if the auto-suspend is set as 300 sec)
As suggested above post, it will also shared the credit usage if multiple user is sharing the Warehouse at the same time for different query execution
During Query execution whether result is getting fetched from catch or remote storage
If above pointers are known to you calculate the total credit spent specific to the warehouse size, sum that up
Thanks
- Palash Chatterjee

Related

Best way to handle time consuming queries in InfluxDB

We have an API that queries an Influx database and a report functionality was implemented so the user can query data using a start and end date.
The problem is that when a longer period is chosen(usually more than 8 weeks), we get a timeout from influx, query takes around 13 seconds to run. When the query returns a dataset successfully, we store that in cache.
The most time-consuming part of the query is probably comparison and averages we do, something like this:
SELECT mean("value") AS "mean", min("value") AS "min", max("value") AS "max"
FROM $MEASUREMENT
WHERE time >= $startDate AND time < $endDate
AND ("field" = 'myFieldValue' )
GROUP BY "tagname"
What would be the best approach to fix this? I can of course limit the amount of weeks the user can choose, but I guess that's not the ideal fix.
How would you approach this? Increase timeout? Batch query? Any database optimization to be able to run this faster?
In such cases where you allow user to select in days, I would suggest to have another table that stores the result (min, max and avg) of each day as a document. This table can be populated using some job after end of the day.
You can also think changing the document per day to per week or per month, based on how you plot the values. You can also add more fields like in your case, tagname and other fields.
Reason why this is superior to using a cache: When you use a cache, you can store the result of the query, so you have to compute for every different combination in realtime. However, in this case, the cumulative results are already available with much smaller dataset to compute.
Based on your query, I assume you are using InfluxDB v1.X. You could try Continuous Queries which are InfluxQL queries that run automatically and periodically on realtime data and store query results in a specified measurement.
In your case, for each report, you could generate a CQ and let your users to query it.
e.g.:
Step 1: create a CQ
CREATE CONTINUOUS QUERY "cq_basic_rp" ON "db"
BEGIN
SELECT mean("value") AS "mean", min("value") AS "min", max("value") AS "max"
INTO "mean_min_max"
FROM $MEASUREMENT
WHERE "field" = 'myFieldValue' // note that the time filter is not here
GROUP BY time(1h), "tagname" // here you can define the job interval
END
Step 2: Query against that CQ
SELECT * FROM "mean_min_max"
WHERE time >= $startDate AND time < $endDate // here you can pass the user's time filter
Since you already ask InfluxDB to run these aggregates continuously based on the specified interval, you should be able to trade space for time.

Two question about Time Travel storage-costs in snowflake

I read the snowflake document a lot. Snowflake will has storage-costs if data update.
"tables-storage-considerations.html" mentioned that:
As an extreme example, consider a table with rows associated with
every micro-partition within the table (consisting of 200 GB of
physical storage). If every row is updated 20 times a day, the table
would consume the following storage:
Active 200 GB | Time Travel 4 TB | Fail-safe 28 TB | Total Storage 32.2 TB
The first Question is, if a periodical task run 20 times a day, and the task exactly update one row in each micro-partition, then the table still consume 32.2TB for the total storage?
"data-time-travel.html" mentioned that:
Once the defined period of time has elapsed, the data is moved into
Snowflake Fail-safe and these actions can no longer be performed.
So my second question is: why Fail-safe cost 28TB, not 24TB (reduce the time travel cost)?
https://docs.snowflake.com/en/user-guide/data-cdp-storage-costs.html
https://docs.snowflake.com/en/user-guide/tables-storage-considerations.html
https://docs.snowflake.com/en/user-guide/data-time-travel.html
First question: yes, it's the fact that the micro-partition is changing that is important not how many rows within it change
Question 2: fail-safe is 7 days of data. 4Tb x 7 = 28Tb

Querying in Google bigQuery work slow

I have a table which contains nearly a million rows. Searching for a single value in it takes 5 sec and around 500 in 15 seconds. This is quite a long time. Please let me know how can I optimize the query?
My query is:
select a,b,c,d from table where a in ('a1','a2')
Job id : stable-apogee-119006:job_ClLDIUSdDLYA6tC2jfC5GxBXmv0
I'm not sure what you mean by "500 it takes 15 secs" but I ran some tests against our database trying to simulate what you are running and I got some similar results to yours
(my query is slower then yours as it has a join operation but still here we go):
SELECT
a.fv fv,
a.v v,
a.sku sku,
a.pp pp from(
SELECT
fullvisitorid fv,
visitid v,
hits.product.productsku sku,
hits.page.pagepath pp
FROM (TABLE_DATE_RANGE([40663402.ga_sessions_], DATE_ADD(CURRENT_DATE(), -3, 'day'), DATE_ADD(CURRENT_DATE(), -3, 'day')))
WHERE
1 = 1 ) a
JOIN EACH (
SELECT
fullvisitorid fv,
FROM (TABLE_DATE_RANGE([40663402.ga_sessions_], DATE_ADD(CURRENT_DATE(), -3, 'day'), DATE_ADD(CURRENT_DATE(), -3, 'day')))
GROUP EACH BY
fv
LIMIT
1 ) b
ON
a.fv = b.fv
Querying for just one day and bringing just one fullvisitor took BQ roughly 5 secs to process 1.7 GBs.
And when I ran the same query for the last month and removed the limit operator it took ~10s to process ~56GB of data (around 33 million rows):
This is insanely fast.
So you might have to evaluate your project specs. If 5 secs is still too much for you then maybe you'll need to find some other strategy in your architecture that suits you best.
BigQuery does consume seconds to process its demands but it's also ready to process hundreds of Gigas still in seconds.
If your project data consumption is expected to grow and you will start processing millions of rows then you might evaluate if waiting a few secs is still acceptable in your application.
Other than that, as far as your query goes, I don't think there's much optimization left to improve its performance.
(ps: I decided to run for 100 days and it processed around 100 GBs in 14s.)

What are some suggested LogParser queries to run to detect sources of high network traffic?

In looking at the network in/out metrics for our AWS/EC2 instance, I would like to find the sources of the high network traffic occurrences.
I have installed up Log Parser Studio and run a few queries - primarily looking for responses that took a while:
SELECT TOP 10000 * FROM '[LOGFILEPATH]' WHERE time-taken > 1000
I am also targeting time spans that cover when the network in/out spikes have occurred:
SELECT TOP 20000 * FROM '[LOGFILEPATH]'
WHERE [date] BETWEEN TIMESTAMP('2013-10-20 02:44:00', 'yyyy-MM-dd hh:mm:ss')
AND TIMESTAMP('2013-10-20 02:46:00', 'yyyy-MM-dd hh:mm:ss')
One issue is that the log files are 2-7 gigs (targeting single files per query). In trying Log Parser Lizard, it crashed with an out of memory exception on large files (boo).
What are some other queries, and methodologies I should follow to identify the source of the high network traffic, which would hopefully help me figure out how to plug the hole?
Thanks.
One function that may be of particular use to you is the QUANTIZE() function. This allows you to aggregate stats for a period of time thus allowing you to see spikes in a given time period. Here is one query I use that allows me to see when we get scanned:
SELECT QUANTIZE(TO_LOCALTIME(TO_TIMESTAMP(date, time)), 900) AS LocalTime,
COUNT(*) AS Hits,
SUM(sc-bytes) AS TotalBytesSent,
DIV(MUL(1.0, SUM(time-taken)), Hits) AS LoadTime,
SQRROOT(SUB(DIV(MUL(1.0, SUM(SQR(time-taken))), Hits), SQR(LoadTime))) AS StandardDeviation
INTO '[OUTFILEPATH]'
FROM '[LOGFILEPATH]'
WHERE '[WHERECLAUSE]'
GROUP BY LocalTime
ORDER BY LocalTime
I usually output this to a .csv file and then chart in Excel to visually see where a period of time is out of normal range. This particular query breaks things down to 15 min segments based on the 900 passed to QUANTIZE. The TotalBytesSent, LoadTime and StandardDeviation allow me to see other aberrations in downloaded content or response times.
Another thing to look at is the number of requests a particular client has made to your site. The following query can help identify scanning or DoS activity coming in:
SELECT
DISTINCT c-ip as ClientIP,
COUNT(*) as Hits,
PROPCOUNT(*) as Percentage
INTO '[OUTFILEPATH]'
FROM '[LOGFILEPATH]'
WHERE '[WHERECLAUSE]'
GROUP BY ClientIP
HAVING (Hits > 50)
ORDER BY Percentage DESC
Adjusting the HAVING clause will set the minimum number of requests an IP will need to make before it shows up. Based on the activity and the WHERE clause, 50 may be too low. The PROPCOUNT() function gives a percentage of the overall value of a particular field. In this case, it gives the what percent a particular IP of all the requests made to the site. Typically this will surface the IP addresses of search engines as well, but those are pretty easy to weed out.
I hope that gives you some ideas on what you can do.

SQL Server code to duplicate Excel calculation that includes circular reference

Is there a way to duplicate a formula with a circular reference from a Excel file into SQL Server? My client uses a excel file to calculate a Selling Price. The Selling Price field is (costs/1-Projected Margin)) = 6.5224 (1-.6) = 16.3060. One of the numbers that goes into the costs is commission which is defined as SellingPrice times a commission rate.
Costs = 6.5224
Projected Margin = 60%
Commissions = 16.3060(Selling Price) * .10(Commission Rate) = 1.6306 (which is part of the 6.5224)
They get around the circular reference issue in Excel because Excel allows them to check a Enable Iterative Calculation option and stops the iterations after 100 times.
Is this possible using SQL Server 2005?
Thanks
Don
This is a business problem, not an IT one, so it follows that you need a business solution, not an IT one. It doesn't sound like you're working for a particularly astute customer. Essentially, you're feeding the commission back into the costs and recalculating commission 100 times. So the salesman is earning commission based on their commission?!? Seriously? :-)
I would try persuading them to calculate costs and commissions separately. In professional organisations with good accounting practices were I've worked before these costs are often broken down into operating and non-operating or raw materials costs, which should improve their understanding of their business. To report total costs later on, add commission and raw materials costs. No circular loops and good accounting reports.
At banks where I've worked these costs are often called things like Cost (no commissions or fees), Net Cost (Cost + Commission) and then bizzarely Net Net Cost (Cost + Commission + Fees). Depending on the business model, cost breakdowns can get quite interesting.
Here are 2 sensible options you might suggest for them to calculate the selling price.
Option 1: If you're going to calculate margin to exclude commission then
Price before commission = Cost + (Cost * (1 - Projected Margin))
Selling price = Price before commission + (Price before commision * Commission)
Option 2: If your client insists on calculating margin to include commission (which it sounds like they might want to do) then
Cost price = Cost + (Cost * Commission)
Profit per Unit or Contribution per Unit = Cost price * (1-Projected Margin)
Selling Price = Cost Price + Profit per Unit
This is sensible in accounting terms and a doddle to implement with SQL or any other software tool. It also means your customer has a way of analysing their sales to highlight per unit costs and per unit profits when the projected margin is different per product. This invariably happens as the business grows.
Don't blindly accept calculations from spreadsheets. Think them through and don't be afraid to ask your customer what they're trying to achieve. All too often broken business processes make it as far as the IT department before being called into question. Don't be afraid of doing a good job and that sometimes means challenging customer requests when they don't make sense.
Good luck!
No, it is not possible
mysql> select 2+a as a;
ERROR 1054 (42S22): Unknown column 'a' in 'field list'
sql expressions can only refer to expressions that already exist.
You can not even write
mysql> select 2 as a, 2+a as b;
ERROR 1054 (42S22): Unknown column 'a' in 'field list'
The way to look at databases is as transactional engines that take data from one state into another state in one step (with combination of operators that operate not only on scalar values, but also on sets).
Whilst I agree with #Sir Wobin's answer, if you do want to write some recursive code, you may be able to do it by abusing Recursive Common Table Expressions:
with RecurseCalc as (
select CAST(1.5 as float) as Value,1 as Iter
union all
select 2 * Value,1+Iter from RecurseCalc where Iter < 100
), FinalResult as (
select top 1 Value from RecurseCalc order by Iter desc
)
select * from FinalResult option (maxrecursion 100)

Resources