I'm trying to track down Snowflake usage via the snowflake query_history table. I notice that for a few apps, there's near "100%" usage, if I track it by start_time and elapsed time. But, the queries are full of things like "select /* nodejs:heartbeat */ 1;" which I'm told don't actually cost anything. That's cool. But, how can I make my report know that they don't cost anything?
I noticed that this query:
select query_text from query_history where warehouse_size is null;
seems to get me all the queries that are "trivial"... "SELECT current_date", "select /* nodejs:heartbeat */ 1;", etc...
But, sometimes, it also gives me seemingly non-trivial selects: "select * from cst_usr_info where deleted_at is null and lower(usr_email) = ?" for instance...
I guess my questions are:
Does "warehouse_size is null" mean that there is no cost to the query?
If so, then is the reason that the "non-trivial" selects are costless because perhaps they've been cached or something?
Yes, that is a great way to get all queries that relate to no warehouse being used (therefore, no cost).
Your non-trivial queries are likely a query that was available in Snowflake's query result cache (meaning it was executed previously and the results are available without the need of a warehouse). There are other types of queries that might use metadata information that can be returned from the services layer without the need for a warehouse, like a MIN() or MAX() function, for example.
I hope that helps.
Related
I am trying to reduce the costs of our queries from Snowflake. In the docs, there is a detailed explanation about the credits per second by the size of the used warehouse. When I look at the history tab in the web console, or at the 'snowflake.account_usage.query_history' view, I see that some of the queries have NULL value in the WAREHOUSE_SIZE column (commit statements, desc table etc.). Does anyone know how this type of queries is charged? maybe they are free?
This doesn't seem to be mentioned anywhere in the docs.
The result of any query is available for the following 24 hours through the Service Layer in the Result Cache. Therefore, queries that appear to have run without any Warehouse, do Not effectively use any Warehouse.
However, it doesn't necessarily mean that the Warehouse it was supposed to be used otherwise, was Not running at that time.
Say, for example, another query 'Q1' was executed within "your" warehouse 'MyWH' just before you ran your 'Q2': while yours will hit the cache without needing 'MyWH' running, 'Q1' will still cause 'MyWH' to resume and therefore consume credits.
More details on Snowflake Caching are available here: https://community.snowflake.com/s/article/Caching-in-Snowflake-Data-Warehouse
Queries without a warehouse listed are not charged via the compute credits. These types of queries are using the "cloud services" layer in Snowflake and are charged differently.
Query to determine how much and if cloud services are being used.
use schema snowflake.account_usage;
select query_type, sum(credits_used_cloud_services) cs_credits, count(1) num_queries
from query_history
where true
and start_time >= timestampadd(day, -1, current_timestamp)
group by 1
order by 2 desc
limit 10;
I noticed even the simplest 'SELECT MAX(TIMESTAMP) FROM MYVIEW' is somewhat slow (taking minutes) in my environment, and found it's doing a TableScan of 100+GB across 80K micropartitions.
My expectation was this to finish in milliseconds using MIN/MAX/COUNT metadata in each micropartitions. In fact, I do see Snowflake finishing the job in milliseconds using metadata for almost same MIN/MAX value lookup in following article:
http://cloudsqale.com/2019/05/03/performance-of-min-max-functions-metadata-operations-and-partition-pruning-in-snowflake/
Is there any limitation in how Snowflake decides to use metadata? Could it be because I'm querying through a view, instead of querying a table directly?
=== Added for clarity ===
Thanks for answering! Regarding how the view is defined, it seems to adds a WHERE clause for additional filtering with a cluster key. So I believe it should still be possible to fully use metadata of miropartitions. But as posted, TableScan is being done in profilter output.
I'm bit concerned on your comment on SecureView. The view I'm querying is indeed a SECURE VIEW - does it affect how optimizer handles my query? Could that be a reason why TableScan is done?
It looks like you're running the query on a view. The metadata you're referring to will be used when you're running a simple MIN MAX etc on the table, however if you have some logic inside your view which requires filtering / joining of data then Snowflake cannot return results just based off the metadata.
So yes, you are correct when you say the following because your view is probably doing something other than a simple MAX on the table:
...Could it be because I'm querying through a view, instead of querying a table directly?
I've recently hit a bottleneck situation in which if I keep a current version of a query inside a report (designed in Report Builder SSRS 2008) it will generate loading times of up to 15 minutes for a report with specific parameters. This JOIN represents a sub-query which I JOIN to the main query on a non-indexed column. Let's call this sub-query "Units".
If I delete the "Units" JOIN from the SQL Query and set it up as a separate Data Set inside the report, linking it using the SSRS Lookup function (same as the JOIN in SQL) to the Main Data Set (Query), the report runs smoothly, in under a minute (Approximately 3 to 5 miliseconds).
Keeping in mind that the "Units" sub-query, when ran separately runs in under 5 milliseconds for the same parameters that previously took 15 minutes, but when it is attached to the Main query causes severe performance issues.
Is there a clear benefit on doing this type of separation or should I just investigate further on how to improve the query? What are the performance benefits/downsides of using lookup versus improving the current query performance.
My concern is that this is a situational improvement and this will not represent a long term solution. I've used this alternative in the past to avoid tweaking the query and it did not backfire, but I do not fully understand the performance implications of using this workaround.
Thanks,
Radu.
There are a lot of things that could be causing the performance issues but here's a few simple things that might get the dataset back up to speed again with very little effort.
1. Parameter sniffing
You mention with specific parameters, if you mean that the query only performs badly with some parameters and performs well with other parameters, and assuming that the size of the data does not vary significantly based on these parameters then it's likely a parameter sniffing issue. This is caused by a query plan that was generated based on once set of parameters that is not suitable for other parameters. The easiest way to prove this is to simply add option (recompile) to the end of the query. This is not a permanent fix but it will force a new query plan to be generated. If you see an instant improvement then parameter sniffing is the most common cause.
2. Refactor dataset query
The other option is to redesign your query. I don't know what you query looks like but if we take a simple example based on the information you posted...
If you query looks something like..
SELECT * FROM tableA a
JOIN (SELECT * FROM tableB WHERE someValue=someOtherValue) b
ON a.FieldA = b.FieldB
then you could refactor it by putting the subquery into a temp table and joining to that, something like
SELECT *
INTO #t
FROM tableB WHERE someValue=someOtherValue
SELECT * FROM tableA a
JOIN #t b
ON a.FieldA = b.FieldB
This is an approach I often take and it can get round exactly these types of performance issues.
I'm using this SELECT Statment
SELECT ID, Code, ParentID,...
FROM myTable WITH (NOLOCK)
WHERE ParentID = 0x0
This Statment is repeated each 15min (Through Windows Service)
The problem is the database become slow to other users when this query is runnig.
What is the best way to avoid slow performance while query is running?
Generate an execution plan for your query and inspect it.
Is the ParentId field indexed?
Are there other ways you might optimize the query?
Is it possible to increase the performance of the server that is hosting SQL Server?
Does it need more disk or RAM?
Do you have separate drives (spindles) for operating system, data, transaction logs, temporary databases?
Something else to consider - must you always retrieve the very latest values from this table for your application, or might it be possible to cache prior results and use those for some length of time?
Seems your table got huge number of records. You can think of implementing page-wise retrieval of data. You can first request for say TOP 100 rows and then having multiple calls to fetch rest of data.
I still don't understand need to run such query every 15 mins. You may think of implementing a stored procedure which can perform majority of processing and return you a small subset of data. This will be good improvement if it suits your requirement.
I have a query that has been running every day for a little over 2 years now and has typically taken less than 30 seconds to complete. All of a sudden, yesterday, the query started taking 3+ hours to complete and was using 100% CPU the entire time.
The SQL is:
SELECT
#id,
alpha.A, alpha.B, alpha.C,
beta.X, beta.Y, beta.Z,
alpha.P, alpha.Q
FROM
[DifferentDatabase].dbo.fnGetStuff(#id) beta
INNER JOIN vwSomeData alpha ON beta.id = alpha.id
alpha.id is a BIGINT type and beta.id is an INT type. dbo.fnGetStuff() is a simple SELECT statement with 2 INNER JOINs on tables in the same DB, using a WHERE id = #id. The function returns approximately 11000 results.
The view vwSomeData is a simple SELECT statement with two INNER JOINs that returns about 590000 results.
Both the view and the function will complete in less than 10 seconds when executed by themselves. Selecting the results of the function into a temporary table first and then joining on that makes the query finish in < 10 seconds.
How do I troubleshoot what's going on? I don't see any locks in the activity manager.
Look at the query plan. My guess is that there is a table scan or more in the execution plan. This will cause huge amounts of I/O for the few record you get in the result.
You could use the SQL Server Profiler tool to monitor what queries are running on SQL Server. It doesn't show the locks, but it can for instance also give you hints on how to improve your query by suggesting indexes.
If you've got a reasonably recent version of SQL Server Management Studio, it has a Database Tuning Adviser as well, under Tools. It takes a trace from the Profiler and makes some, sometimes highly useful, suggestions. Makes sure there's not too many queries - it takes a long time to build advice.
I'm not an expert on it, but have had some luck with it in the past.
Do you need to use a function? Can you re-write the entire thing into a stored procedure in which you pass in the #ID as a parameter.
Even if your table has indexes because you pass the #ID as a variable to the WHERE clause potentially greatly increasing the amount of time for the query to run.
The reason the indexes may not be used is because the Query Analyzer does not know the value of the variables when it selects an access method to perform the query. Because this is a batch file, only one pass is made of the Transact-SQL code, preventing the Query Optimizer from knowing what it needs to know in order to select an access method that uses the indexes.
You might want to consider an INDEX query hint if you cannot re-write the SQL.
it might also be possible, since this just started happening, that the INDEXes have become fragmented and might need to be rebuilt.
I've had similar problems with joining functions that return large datasets. I had to do what you've already suggested. Put the results in a temp table and join on that.
Look at the estimated plan, this will probably shed some light. Typically when query cost gets orders of magnitude more expensive it is because a loop or merge join is being used where a hash join is more appropriate. If you see a loop or merge join in the estimated plan, look at the number of rows it expects to process - is it far smaller than the number of rows you know will actually be in play? You can also specify a hint to use a hash join and see if it performs much better. If so, try updating statistics and see if it goes back to a hash join without a hint.
SELECT
#id,
alpha.A, alpha.B, alpha.C,
beta.X, beta.Y, beta.Z,
alpha.P, alpha.Q
FROM
[DifferentDatabase].dbo.fnGetStuff(#id) beta
INNER HASH JOIN vwSomeData alpha ON beta.id = alpha.id
-- having no idea what type of schema is in place and just trying to throw out ideas:
Like others have said... use Profiler and find the source of pain... but I'm thinking it is the function on the other database. Since that function might be a source of pain, have you thought about a little denormalization or anything on [DifferentDatabase]. I think you'll find a bit more scalability in joining to a more flattened table with indexes than a costly function.
Run this command:
SET SHOWPLAN_ALL ON
Then run your query. It will display the execution plan, look for a "SCAN" on an index or a table. That is most likely what is happening to your query now. If that is the case, try to figure out why it is not using indexes now (refresh statistics, etc)