Snowflake offers various cache options and one of them is result cache. I understand that other users can use query result cache to access the result of repeated query (executed within 24 hours) but should they be in the same role OR users with all roles can access the cache results?
If the tool behavior has recently changed, what holds correct for Snowpro certification exam?
But should they be in the same role OR users with all roles can access the cache results?
The result cache should come into play well after access-authorization checks, so the role-membership of the authenticated user or service does not matter.
The cache is permitted for read in the same manner the table (or view) results are, and the Snowflake architecture separates the storage and cache from the compute layers, so the only rules that matter on its use are the ones defined in the documentation.
Typically, query results are reused if all of the following conditions are met:
The user executing the query has the necessary access privileges for
all the tables used in the query.
The new query syntactically matches the previously-executed query. T
The table data contributing to the query result has not changed.
The persisted result for the previous query is still available.
Any configuration options that affect how the result was produced have
not changed.
The query does not include functions that must be evaluated at
execution (e.g. CURRENT_TIMESTAMP()).
The table’s micro-partitions have not changed (e.g. been re-
clustered or consolidated) due to changes to other data in the table.
Ref : https://community.snowflake.com/s/article/Understanding-Result-Caching
I understood that the role accessing the cached results has the required privileges -
-- If the query was a SELECT query, the role executing the query must have the necessary access privileges for all the tables used in the cached query.
-- If the query was a SHOW query, the role executing the query must match the role that generated the cached results.
Related
I want to test query performance . Example:
select * from vw_testrole.
vw_testrole- has lot of joins. Since the data is cached, it is returning in less time. I want to see the query plan and How to see it or clear cache or that I can see original time taken to execute.
Thanks,
Xi
Some extra info, as you are planning to do some "performance tests" to determine the expected execution time for a query.
The USE_CACHED_RESULT parameter disables to use of cached query results. It doesn't delete the existing caches. If you disable it, you can see the query plan (as you wanted), and your query will be executed each time without checking if the result is already available (because of previous runs of the same query). But you should know that Snowflake has multiple caches.
The Warehouse cache: As Simeon mentioned in the comment, Snowflake caches recently accessed the remote data (on the shared storage) in the local disks of the warehouse nodes. That's not easy to clean. Even suspending a warehouse may not delete it.
The Metadata cache - If your query access very big tables and compile time is long because of accessing metadata (for calculating stats etc), then this cache could be very important. When you re-run the query, it will probably read from the metadata cache, and significantly reduce compile time.
The result cache: This is the one you are disabling.
And, running the following commands will not disable it:
ALTER SESSION UNSET USE_CACHED_RESULT=FALSE;
ALTER SESSION UNSET USE_CACHED_RESULT;
The first one will give an error you experienced. The last one will not give an error but the default value is TRUE, so actually, it enables it. The correct command is:
ALTER SESSION SET USE_CACHED_RESULT=FALSE;
You can clear cache by setting ALTER SESSION UNSET USE_CACHED_RESULT;
To get plan of last query Id , you can use below stmt:
select system$explain_plan_json(last_query_id()) as explain_plan;
In my snowflake account I can see that there is a lot of storage used by stages. I can see this for example using the following query:
select *
from table(information_schema.stage_storage_usage_history(dateadd('days',-10,current_date()),current_date()));
There are no named stages in the databases. All used storage must be in internal stages.
How can I find out which internal stages consumes most storage?
If I know the name of the table I can list all files in the table stage using something like this:
list #SCHEMANAME.%TABLENAME;
My problem is that there are hundreds of tables in the databases and I have no idea which tables to query.
There is an ACCOUNT_USAGE view called STAGE_STORAGE_USAGE_HISTORY in the Snowflake database/share that will give you everything, including internal stages. I would use this over the information_schema view, since that is limited to what your role currently has access to.
https://docs.snowflake.com/en/sql-reference/account-usage/stage_storage_usage_history.html
You can use the view STAGES in Information schema or account usage to get the stages. Do note that Account usage has higher retention period than information schema and data retrieval is faster. You can read more here
If I understood correctly,you want to do something with stages to reduce the overall billing or storage size
Snowflake Stages
'Internal Named' and 'External' stages :
These are the only stages which can be altered or dropped and controlled by user
User Stage and Table stages are one which can not be altered or dropped,
managed by snowflake
So even if you could identify those Table and User specific stages , you can not drop those.
Snowflake Storage consumption includes below three components for billing
1. Databases size
2. Stages size
3. Fail Safe size
The size of the storage occupied can be visible (Only when you have Accountadmin role
or MONITOR privs) under below location webUI tab
Account Tab ---> Usage --> Average Storage Used
Note : on the account tab, No DB object wise storage billing details available
So how you can see the storage consumption of the associated tables (including their fail safe and time travel bit) and stage details
select * from <DB Name>."INFORMATION_SCHEMA"."TABLE_STORAGE_METRICS"
select * from <DB Name>."INFORMATION_SCHEMA"."STAGES"
Hope clarification helped
Thanks
Palash Chatterjee
I have a small warehouse in SnowFlake, with minimum clusters = 1, maximum clusters = 5 and scaling policy set to standard. However, when I was viewing the query history profile, I saw that for some of the queries, size column was set to large, but the cluster number remained 1.
Now, I know that autoscaling helps increasing number of clusters, but how did the warehouse size change for some queries without manual intervention?
I referred to the official documentation of SnowFlake here, but couldn't find any ways to automatically change size of warehouse.
Snowflake does not carry any feature that will automatically alter the size of your warehouse.
It is likely that the tools in use (or users) may have run an ALTER WAREHOUSE SET WAREHOUSE_SIZE=LARGE. The purpose may have been to prepare for a larger operation, ensuring adequate performance temporarily.
Use the various history views to find out who/what and when such a change was run. For example, the QUERY_HISTORY view could be useful in finding the username and role that was used to alter the warehouse size, with the following query:
SELECT DISTINCT user_name, role_name, query_text, session_id, start_time
FROM snowflake.account_usage.query_history
WHERE query_text ILIKE 'ALTER%SET%WAREHOUSE_SIZE%=%LARGE%'
AND start_time > CURRENT_TIMESTAMP() - INTERVAL '7 days';
Then you could use LOGIN_HISTORY view to find which IP the user authenticated from during the time (or use the history UI for precise client information), check all other queries executed in the same session, etc.
To prevent unauthorized users from modifying warehouse sizes, consider restricting warehouse-level grants on their roles (rolename in use can be detected by the query above).
Imagine I have a big table with 20 columns and billion lines of data. Then I run a simple query like:
select [First Name], [Last Name]
from Audience;
After that I read the result set sequentially. Will SQL Server physically create all records (i.e. billion records) on the server side in the result set before I will start reading it? Is there any query plan that will build the result set dynamically while feeding it to the client?
I understand that concurrency reasons may prevent this. Can I give any hint that multiuser access is not possible? Maybe I should use cursors?
Depends on the query plan. If the query does not require any temporary internal structures then yes you get immediate response even before the full recordset has been constructed. If the query does require temporary internal storage (e.g. you are sorting it in a manner that doesn't match any index, or an index is available but a different one is used because it requires less I/O) then you will have to wait until the full recordset is constructed.
The only way to tell is to look at the query plan and examine each and every step. You will need to know how to interpret them... for example, a DISTINCT will require a temporary structure whereas a FLOW DISTINCT will not. If the query plan shows an EAGER SPOOL you will definitely have to wait, although there are a few things you can do to avoid them.
Note: You can't rely on this-- query plans can change depending not just on schema or indexes but on database statistics (e.g. selectivity), which are always changing.
SQL Server doesn't allow creating an view with schema binding where the view query uses OpenQuery as shown below.
Is there a way or a work-around to create an index on such a view?
The best you could do would be to schedule a periodic export of the AD data you are interested in to a table.
The table could of course then have all the indexes you like. If you ran the export every 10 minutes and the possibility of getting data that is 9 minutes and 59 seconds out of date is not a problem, then your queries will be lightning fast.
The only part of concern would be managing locking and concurrency during the export time. One strategy might be to export the data into a new table and then through renames swap it into place. Another might be to use SYNONYMs (SQL 2005 and up) to do something similar where you just point the SYNONYM to two alternating tables.
The data that supplies the query you're performing comes from a completely different system outside of SQL Server. There's no way that SQL Server can create an indexed view on data it does not own. For starters, how would it be notified when something had been changed so it could update its indexes? There would have to be some notification and update mechanism, which is implausible because SQL Server could not reasonably maintain ACID for such a distributed, slow, non-SQL server transaction to an outside system.
Thus my suggestion for mimicking such a thing through your own scheduled jobs that refresh the data every X minutes.
--Responding to your comment--
You can't tell whether a new user has been added without querying. If Active Directory supports some API that generates events, I've never heard of it.
But, each time you query, you could store the greatest creation time of all the users in a table, then through dynamic SQL, query only for new users with a creation date after that. This query should theoretically be very fast as it would pull very little data across the wire. You would just have to look into what the exact AD field would be for the creation date of the user and the syntax for conditions on that field.
If managing the dynamic SQL was too tough, a very simple vbscript, VB, or .Net application could also query active directory for you on a schedule and update the database.
Here are the basics for Indexed views and thier requirements. Note what you are trying to do would probably fall in the category of a Derived Table, therefore it is not possible to create an indexed view using "OpenQuery"
This list is from http://www.sqlteam.com/article/indexed-views-in-sql-server-2000
1.View definition must always return the same results from the same underlying data.
2.Views cannot use non-deterministic functions.
3.The first index on a View must be a clustered, UNIQUE index.
4.If you use Group By, you must include the new COUNT_BIG(*) in the select list.
5.View definition cannot contain the following
a.TOP
b.Text, ntext or image columns
c.DISTINCT
d.MIN, MAX, COUNT, STDEV, VARIANCE, AVG
e.SUM on a nullable expression
f.A derived table
g.Rowset function
h.Another view
i.UNION
j.Subqueries, outer joins, self joins
k.Full-text predicates like CONTAIN or FREETEXT
l.COMPUTE or COMPUTE BY
m.Cannot include order by in view definition
In this case, there is no way for SQL Server to know of any changes (data, schema, whatever) in the remote data source. For a local table, it can use SCHEMABINDING etc to ensure the underlying tables(s) stay the same and it can track datachanges.
If you need to query the view often, then I'd use a local table that is refreshed periodically. In fact, I'd use a table anyway. AD queries are't the quickest at the best of times...