Snowflake caching not working with Snowpark Python 1.0 - snowflake-cloud-data-platform

Snowflake is caching already executed queries, such that subsequent executions of the same query will finish much faster. As far as I know, this only works if the query is exactely the same.
In my app (a interactive dashboard which uses Snowpark-Python 1.0 to fire snowflake queries), this caching does not seem to work. Each time the (same snowpark) query is fired, snowflake does run the query again:
Depending on whether the warehouse cache is active are not (blue vs green bar), the execution time is several 100ms up to 2s. The result is not read from the cache.
I think the cause is that the generated SQL does contain random components (column and table names suffixes) which are different for each query:
How can I make use of the snowflake cache using Snowpark-generated queries?

Related

Identical simultaneous queries and caching

Say we have two servers making identical queries to the same database roughly once an hour, and the database gets updated rarely (every 30 minutes). Getting the result back fast is not important, but we would like the data warehouse running for as short a time as possible.
Should we make sure that one of the queries completes before the other begins, so that the result is cached? Is snowflake smart enough to realize when it is being asked to run two identical queries, and only does the work once?
As referenced from the documentation snowflake queries results are persisted for 24 hours. So if nothing has changed to the query snowflake does not regenerate results. This we have tested in all our applications.
Below is the link please check and let me know if this helps.
https://docs.snowflake.com/en/user-guide/querying-persisted-results.html

Does index rebuilding + updating statistics actually fix the issue of skewed cached plans?

Background
We have an export functionality which allows the user to export measurements from many devices for a certain time range.
The database is continuously being updated with sensor data from many devices (at 10 to 50 measurements per second from each device). So since these CSV exports often contain millions of rows (each row containing data from several different devices, where each device is in a separate database), it was written to fetch data from these tables in chunks (I guess to avoid slowing down inserts, since exporting is a lower priority task).
Problem
When the first chunk is queried, SQL Server will create an execution plan which fits the passed parameters. However, at the beginning of the export, it's possible that ranges of data are missing due to low connectivity, or are flagged differently due to time sync errors, meaning the following queries reusing this cached plan will possibly not get the optimal plan for their parameters.
One option is to add OPTION (RECOMPILE) to each query, but many sources claim that this will impose unnecessary burden on the CPU and that merely updating statistics on a regular base should help SQL Server create a better plan.
However, this doesn't make too much sense, because even if the plan cache is invalidates, the next time a user creates the query, first chunk will again dictate the "shape" of the execution plan.
Is this reasoning correct?
Should I actually use OPTION (RECOMPILE) with each query, or just add an "Update Statistics" maintenance plan?

Clear SQL Azure execution plan / query cache

I have a few "inefficient" queries that I am trying to debug on Azure SQL (v12). The problem I have is that after the query executes for the first time (albeit, many seconds) Azure appears to cache the query / execution plan. I have done some research and several people have suggested adding and removing a column will clear the cache but this doesn't seem to work. If I leave the server alone for a few hours / overnight and re-run the query it takes its usual time to execute but once again the cache is in place - this makes it very hard to optimise my query. Does anyone know how to force Azure SQL to not cache my queries / execution plans?
ALTER DATABASE SCOPED CONFIGURATION CLEAR PROCEDURE_CACHE is designed to help wit this problem.
https://learn.microsoft.com/en-us/sql/t-sql/statements/alter-database-scoped-configuration-transact-sql?view=sql-server-2017
This is closest to the DBCC FREEPROCCACHE you have in SQL Server but is scoped to a database instead of the server instance. This does not prevent caching of query plans - it just invalidates the current cache entries.
Please note that the query store is there to help you in SQL Azure (on-by-default). It stores a history of plan choices and plan performance (per-plan). So, if you have a prior plan that performs better available in the history of your application, you can force it using SSMS if you'd prefer to have the query optimizer pick this plan each time your query compiles. One common reason for what you are seeing is parameter-sensitivity in the plan choice where the optimizer will use the passed parameter value to try to generate the query plan, assuming it is representing a common pattern when you run that query. If that value is actually not close to a common value (in terms of how frequent it is in the table), then you can sometimes compile and cache a plan that is not better on average for your application.
Query store has an overview here:
https://learn.microsoft.com/en-us/sql/relational-databases/performance/monitoring-performance-by-using-the-query-store?view=sql-server-2017
Note that SQL Azure also has an automated mechanism to try forcing prior plans if it notices a performance regression. It is somewhat conservative, however, so it may not kick in for every single regression until it sees an obvious pattern over time. So, while you can force things in SSMS, you can also potentially just wait (assuming this is the issue you were seeing)

Parallel Bulk Inserting with SqlBulkCopy and Azure

I have an azure app on the cloud with a sql azure database. I have a worker role which needs to do parsing+processing on a file (up to ~30 million rows) so i can't directly use BCP or SSIS.
I'm currently using SqlBulkCopy, however this seems too slow as I've seen load times of up to 4-5 minutes for 400k rows.
I want to run my bulk inserts in parallel; however reading through the articles on importing data in parallel/controlling lock behaviour, it says that SqlBulkCopy requires that the table does not have clustered indexes and a tablelock (BU lock) needs to be specified. However azure tables must have a clustered index...
Is it even possible to use SqlBulkCopy in parallel on the same table in SQL Azure? If not is there another API (that I can use in code) to do this?
I don't see how you can run any faster than using SqlBulkCopy. On our project we can import 250K rows in about 3 mins, so your rate seems about right.
I don't think that doing it in parallel would help, even if it was technically possible. We only run 1 import at a time otherwise SQL Azure starts timing out our requests.
In fact sometimes, running a large group-by query at the same time as the import isn't possible. SQL Azure does a lot of work to ensure quality of service, this includes timing out requests that take too long, take too many resource, etc
So doing several large bulk inserts at the same time will probably cause one to time out.
It is possible to run SQLBulkCopy in parallel against SQL Azure, even if you load the same table. You need to prepare your records in batches yourself before sending them to the SQLBulkCopy API. This will absolutely help with performance, and it allows you to control retry operations for a smaller batch of records when you get throttled for reasons outside of your own doing.
Take a look at my blog post comparing load times of various approaches. There is a sample code as well. In separate tests I was able to cut the load time of a table in half.
This is the technique I am using for a couple of tools (Enzo Backup; Enzo Data Copy); It's not a simple thing to do but when done properly you can optimize load times significantly.

Whats the best way to profile a sqlserver 2005 database for performance?

What techinques do you use? How do you find out which jobs take the longest to run? Is there a way to find out the offending applications?
Step 1:
Install the SQL Server Performance Dashboard.
Step2:
Profit.
Seriously, you do want to start with a look at that dashboard. More about installing and using it can be found here and/or here
To identify problematic queries start the Profiler, select following Events:
TSQL:BatchCompleted
TSQL:StmtCompleted
SP:Completed
SP:StmtCompleted
filter output for example by
Duration > x ms (for example 100ms, depends mainly on your needs and type of system)
CPU > y ms
Reads > r
Writes > w
Depending on what you want to optimize.
Be sure to filter the output enough to not having thousands of datarows scrolling through your window, because that will impact your server performance!
Its helpful to log output to a database table to analyse it afterwards.
Its also helpful to run Windows system monitor in parallel to view cpu load, disk io and some sql server performance counters. Configure sysmon to save the data to a file.
Than you have to get production typical query load and data volumne on your database to see meaningfull values with profiler.
After getting some output from profiler, you can stop profiling.
Then load the stored data from the profiling table again into profiler, and use importmenu to import the output from systemmonitor and the profiler will correlate the sysmon output to your sql profiler data. Thats a very nice feature.
In that view you can immediately identifiy bootlenecks regarding to your memory, disk or cpu sytem.
When you have identified some queries you want to omtimize, go to query analyzer and watch the execution plan and try to omtimize index usage and query design.
I have had good sucess with the Database Tuning tools provided inside SSMS or SQL Profiler when working on SQL Server 2000.
The key is to work with a GOOD sample set, track a portion of TRUE production workload for analsys, that will get the best overall bang for the buck.
I use the SQL Profiler that comes with SQL Server. Most of the poorly performing queries I've found are not using a lot of CPU but are generating a ton of disk IO.
I tend to put in filters on disk reads and look for queries that tend to do more than 20,000 or so reads. Then I look at the execution plan for those queries which usually gives you the information you need to optimize either the query or the indexes on the tables involved.
I use a few different techniques.
If you're trying to optimize a specific query, use Query Analyzer. Use the tools in there like displaying the execution plan, etc.
For your situation where you're not sure WHICH query is running slowly, one of the most powerful tools you can use is SQL Profiler.
Just pick the database you want to profile, and let it do its thing.
You need to let it run for a decent amount of time (this varies on traffic to your application) and then you can dump the results in a table and start analyzing them.
You are going to want to look at queries that have a lot of reads, or take up a lot of CPU time, etc.
Optimization is a bear, but keep going at it, and most importantly, don't assume you know where the bottleneck is, find proof of where it is and fix it.

Resources