Why avg cache used is very high even warehouse cold started? - snowflake-cloud-data-platform

We are using Snowflake and trying to evaluate some queries. What I have done is that I take random samples (163) and ran Select queries on them.
select * from where sessionkey=<> and sessionstarttime=<>
Where session key and sessonstarttime are numerical values.
So, When I used new warehouse (or suspended) one, My assumption was cache should not be used but I see more than 95% cache has been used. All samples are distinct and not the same. I am unable to understand this behavior. I see cache utilization from 0 to 95%.
One thought is like initial queries don't use cache and then once queries start running, they start loading partition in the cache and somehow partition for all these queries is the same. I am not sure, can someone suggest cache behavior?
Also, is there a way we can check the partition used by the query?
Tushar Goel

You can use the below at the start of your session to disable cached results
ALTER SESSION SET USE_CACHED_RESULT = FALSE;

Related

SQL Azure. Create Index recommendation and performance

I got several CREATE INDEX recommendations on Azure SQL S3 tier.
Before going through, I'd like to know some issues during indexing with 10-million records.
Can we know indexing progress or completion time approximately?
Does indexing work in asynchronous (or we can say lazy index) manner? Or it blocks query to the table/database?
Is there anything we need to know about performance degradation during indexing? If so, can we expect amount of degradation?
Does it perform differently from my CREAT INDEX command?
If the database is readonly-georedundant configured, I assume that index configuration itself is replicated either. But does indexing job operate separately?
If the indexing is performed on their own(replicated) database, tier master(S3 tier) to replica(S1) could have different indexing progress. is it correct?
Can we know indexing progress or completion time approximately?
You can get to know amount of space that will be used ,but not index creation time.You can track the progress though using sys.dm_exec_requests
also with SQL2016(azure compatabilty level 130) there is a new DMV called Sys.dm_exec_query_profiles..which can track accurate status better then exec requests DMV..
Does indexing work in asynchronous (or we can say lazy index) manner? Or it blocks query to the table/database?
There are two ways to create Index
1.Online
2.Offline
When you create index online,your table will not be blocked*,since SQL maintains a separate copy of index and updates both indexes parallely
with offline approach, you will experience blocking and table also won't be available
Is there anything we need to know about performance degradation during indexing? If so, can we expect amount of degradation?
You will experience additional IO load,increase in memory..This can't be accurately estimated.
Does it perform differently from my CREATE INDEX command?
Create Index is altogether a seperate statement ,i am not sure what you meant here
If the database is readonly-georedundant configured, I assume that index configuration itself is replicated either. But does indexing job operate separately?
If the indexing is performed on their own(replicated) database, tier master(S3 tier) to replica(S1) could have different indexing progress. is it correct?
Index creation is logged and all the TLOG is replayed on secondary as well.so there is no need to do index rebuilds on secondary..

Does SQL Server randomly sort results when no ORDER BY is used? Why?

I have a query in SSMS that gives me the same number of rows but in a different order each time I hit the F5 key. A similar problem is described in this post:
Query returns a different result every time it is run
The response given is to include an ORDER BY clause because, as the response in that post explains, SQL Server guesses the order if you don't give it one.
OK, that does fix it, but I'm confused about what it is that SQL Server is doing. Tables have a physical order whether they are heaps or have clustered indexes. The physical order of each table does not change with every execution of the query which also does not change. We should see the same results each time! What's it doing, accessing tables in their physical orders and then, instead of displaying the results by that unchanging physical order, it randomly sorts the results? Why? What am I missing? Thanks!
Simple - if you want records in certain order then ask for them in a certain order.
If you don't asked for an order it does not guess. SQL just does what is convenient.
One way that you can get different ordering is if parallelism is at play. Imagine a simple select (i.e. select * from yourTable). Let's say that the optimizer produces a parallel plan for that query and that the degree of parallelism is 4. Each thread will process (roughly) 1/4 of the table. But, if yours isn't the only workload on the server, each thread will go between status of running and runnable (just by the nature of how the SQLOS schedules threads, they will go into runnable from time to time even if yours is the only workload on the server, but is exacerbated if you have to share). Since you can't control which threads are running at any given time, and since each thread is going to return its results as soon as it's retrieved them (since it doesn't have to do any joins, aggregates, etc), the order in which the rows comes back is non-deterministic.
To test this theory, try to force a serial plan with the maxdop = 1 query hint.
SQL server uses a set of statistics for each table to assist with speed and joins etc... If the stats give ambiguous choice for the fatest route, the choice by SQL can be arbitrary - and could require slightly different indexing to achieve... Hence a different output order. The physical order is only a small factor in predicting order. Any indexes, joins, where clause can affect the order, as SQL will also create and use its own temporary indexes to help with satisfying the query, if the appropriate indexes do not already exist. Try re calculating the statistics on each table involved and see if there is any change or consistency after that.
You are probably not getting random order each time, but rather an arbitrary choice between a handful of similarly weighted pathways to get the same result from the query.

Ehcache, Hibernate, updating cache of very large table when a new entry is added?

I'm new to Ehcache and am searching on how to do this but now quite sure if this is a normal use case. I am working on an application that isn't a traditional web app, its something that is only used by a few people at a time and is for retrieving data from a very large dataset so rather than making a call to the DB each time I want to use caching to cache this large table. However, there is a chance that a new entry could be added to this table and I need this reflected in the cache but I don't want to reload the entire cache each time as its quite large. Any advice on how to approach this / further resources is appreciated.
You should learn about Hibernate query cache. In simple words: it works on top of second level cache (L2) and stores results of queries. But it only stores ids of the records that should be returned by the query rather than a whole list. This means that you need to have L2 working and fine tuned.
In your scenario suppose you have 1M records in table T and a query that returns 1K by average. The first time you run this query it will miss the query cache and:
run the SQL
fetch 1K records
put all of them in L2
put 1K ids in query cache
The next time you execute the query it will hit the query cache and lookup all the result from L2. The interesting part comes when you modify table T. Hibernate will figure out that the results in query cache might be stale and it will invalidate the whole cache but not the L2. It will basically repeat points 1-4 but refreshing only query cache (most of entities from table T are already in L2).
In some scenarios it works great, in others it introduces N+1 problems in unpredictable moments. This is just a tip of an iceberg, you should be really careful as this mechanism is very fragile and requires great understanding.

Using a Cache Table in SQLServer, am I crazy?

I have an interesting delimma. I have a very expensive query that involves doing several full table scans and expensive joins, as well as calling out to a scalar UDF that calculates some geospatial data.
The end result is a resultset that contains data that is presented to the user. However, I can't return everything I want to show the user in one call, because I subdivide the original resultset into pages and just return a specified page, and I also need to take the original entire dataset, and apply group by's and joins etc to calculate related aggregate data.
Long story short, in order to bind all of the data I need to the UI, this expensive query needs to be called about 5-6 times.
So, I started thinking about how I could calculate this expensive query once, and then each subsequent call could somehow pull against a cached result set.
I hit upon the idea of abstracting the query into a stored procedure that would take in a CacheID (Guid) as a nullable parameter.
This sproc would insert the resultset into a cache table using the cacheID to uniquely identify this specific resultset.
This allows sprocs that need to work on this resultset to pass in a cacheID from a previous query and it is a simple SELECT statement to retrieve the data (with a single WHERE clause on the cacheID).
Then, using a periodic SQL job, flush out the cache table.
This works great, and really speeds things up on zero load testing. However, I am concerned that this technique may cause an issue under load with massive amounts of reads and writes against the cache table.
So, long story short, am I crazy? Or is this a good idea.
Obviously I need to be worried about lock contention, and index fragmentation, but anything else to be concerned about?
I have done that before, especially when I did not have the luxury to edit the application. I think its a valid approach sometimes, but in general having a cache/distributed cache in the application is preferred, cause it better reduces the load on the DB and scales better.
The tricky thing with the naive "just do it in the application" solution, is that many time you have multiple applications interacting with the DB which can put you in a bind if you have no application messaging bus (or something like memcached), cause it can be expensive to have one cache per application.
Obviously, for your problem the ideal solution is to be able to do the paging in a cheaper manner, and not need to churn through ALL the data just to get page N. But sometimes its not possible. Keep in mind that streaming data out of the db can be cheaper than streaming data out of the db back into the same db. You could introduce a new service that is responsible for executing these long queries and then have your main application talk to the db via the service.
Your tempdb could balloon like crazy under load, so I would watch that. It might be easier to put the expensive joins in a view and index the view than trying to cache the table for every user.

SQL Server & update (or insert) parallelism

I got a large conversion job- 299Gb of JPEG images, already in the database, into thumbnail equivalents for reporting and bandwidth purposes.
I've written a thread safe SQLCLR function to do the business of re-sampling the images, lovely job.
Problem is, when I execute it in an UPDATE statement (from the PhotoData field to the ThumbData field), this executes linearly to prevent race conditions, using only one processor to resample the images.
So, how would I best utilise the 12 cores and phat raid setup this database machine has? Is it to use a subquery in the FROM clause of the update statement? Is this all that is required to enable parallelism on this kind of operation?
Anyway the operation is split into batches, around 4000 images per batch (in a windowed query of about 391k images), this machine has plenty of resources to burn.
Please check the configuration setting for Maximum Degree of Parallelism (MAXDOP) on your SQL Server. You can also set the value of MAXDOP.
This link might be useful to you http://www.mssqltips.com/tip.asp?tip=1047
cheers
Could you not split the query into batches, and execute each batch separately on a separate connection? SQL server only uses parallelism in a query when it feels like it, and although you can stop it, or even encourage it (a little) by changing the cost threshold for parallelism option to O, but I think its pretty hit and miss.
One thing thats worth noting is that it will only decide whether or not to use parallelism at the time that the query is compiled. Also, if the query is compiled at a time when the CPU load is higher, SQL server is less likely to consider parallelism.
I too recommend the "round-robin" methodology advocated by kragen2uk and onupdatecascade (I'm voting them up). I know I've read something irritating about CLR routines and SQL paralellism, but I forget what it was just now... but I think they don't play well together.
The bit I've done in the past on similar tasks it to set up a table listing each batch of work to be done. For each connection you fire up, it goes to this table, gest the next batch, marks it as being processed, processes it, updates it as Done, and repeats. This allows you to gauge performance, manage scaling, allow stops and restarts without having to start over, and gives you something to show how complete the task is (let alone show that it's actually doing anything).
Find some criteria to break the set into distinct sub-sets of rows (1-100, 101-200, whatever) and then call your update statement from multiple connections at the same time, where each connection handles one subset of rows in the table. All the connections should run in parallel.

Resources