Snowflake query running on a medium warehouse is re-ran twice during resource contention. Can someone share any reference on how snowflake fails and then retries the execution of the same query. What is the advantage we are gaining with this approach?
Step 1
Step 2
Step 1002
Step 2002
As it has been explained to us by support: When a query fails due to internal errors, snowflake retries the query but on the prior software stack versions. To see if the failure is part of a recent update on their side.
You do get the joy of paying for both runs. If there is happening a lot, you should open a support case as there is a repeating bug that should be addressed or worked around. Under some situations we have had the long dead time (the failure case) refunded.
Related
I use snowflake for turning out if it can use for DWH, and I am concern with the query behavior when the warehouse somehow fails.
https://docs.snowflake.com/en/user-guide/warehouses-considerations.html#multi-cluster-warehouses-improve-concurrency
According to the above page, if the minimum cluster is set to higher than 1, it helps ensure availability and continuity.
I have questions about it.
1.If we set it to 1 and the warehouse fails, the proceeded query come to fail?
2.If we set it to 2 or more and a cluster of the warehouse fails, the proceeded query come to fail and start automatically by another cluster?
When a warehouse fails, a new warehouse is automatically started and the query is retired. In the 6 years at my prior job that we ran snowflake, there was less that a dozen times where we experienced warehouse failure.
It was often around releases being rolled out. One thing that does happen on failure is the release is pushed back. So we noticed blips in our processing rate, or increased total time, and during the trouble periods the query profile might show 1-3 tabs of query plan, for each retry.
At least one of those outages was a failure to bring up new warehouses, class of problem, and in that incident, I don't think we where impacted as we have stuff just always running.
A side note also is you get billed for those failures, so that can bite if you are doing large computations and it fails and retries. We have had refunds (of the extra cost) when we can show there was a cost increase due to a known failure event.
But if you are looking at run Medium and smaller warehouses those normally start same second, so you might not notice a "failure" but if you are running a really large instance size, it can take longer to bring that capacity online.
I have a C# application executing a constant flow of SQL statements (queries, inserts, updates) against a SQL Server (2019 Standard if that matters) database. Most of these operations take a few milliseconds (ca. 2-50) to execute.
However in the course of the day, there are occasional cases where the same SQL statement would be several magnitudes slower and take seconds to execute, for example 5 or 10 or even >30 seconds, in which case the operation is aborted.
The issue may not happen for several days or 1-2 per day as well as phases of 10-15 minutes with several of such cases. That all looks pretty random.
I am desperately searching for the cause of this behaviour.
What I have found out and ruled our so far:
the issue is not specific to a particular statement. The issue is also not specific to certain data. It is the exact same statement (with minor differences in the actual values) that takes 5 milliseconds in most cases, but sometimes so much longer.
I believe that the issue is limited to inserts & updates. I have not seen a query causing the issue so far.
the issue is not specific to a particular date/time of the day. Neither does it appear to be a workload bottleneck. My application is the only software using that SQL Server and the problem happens even at times where workload is pretty low. Everything runs in a single VM ware virtual machine, but the IT department is claiming that the machine has no issues and I have no evidence to prove otherwise.
Quite interestingly, I have seen cases where a particular statement would "hang" while similar other statements execute in milliseconds at the same time. (my application is a multi-threaded application)
The issue does not seem to be caused by a deadlock. SQL Server seems to detect deadlock situation and throws a specific error in such cases, which does not happen here.
it seems as if the statement is blocked or held up by something. I have also seen rare cases where a statement would be submitted & hang, another statement is later also submitted and also hangs, until the first statement completes or is aborted, after which the second statement also completes pretty much immediately.
the application may perform multiple transactions in parallel. However, these are not longer than 5-100 milliseconds. It is therefore not plausible that one statement would hang several seconds and wait for some other long running transaction to finish.
After searching through my code and days of logs, I am running out of ideas. Needless to say, I was not able to reproduce the issue in a development environment.
How can I retrieve more information why individual statements take so long and identify the root cause?
Possible theories/suspects:
could SQL Server be limited or hang due to hardware/OS resources - such as IO? I am pretty sure it is not CPU or memory and network should not matter with everything on the same machine. How would I find out about that?
could it be that problematic statements trigger some SQL Server internal catch-up, e.g. flushing cashes, etc. That should however not take several seconds - I hope.
Any help would be much appreciated.
I have spent a lot of time optimizing a query for our DB Admin. When I run it in our test server and our live server I get similar results. However, once it is actually running in production with the query being utilized a lot it runs very poorly.
I could post code but the query is over 1400 lines and it would take a lot of time to obfuscate. All data types match and I am using indices on my queries. It is broken down into 58 temp tables. When I test it using SQL Sentry I get it using 707 CPU cycles and 90,0007 reads and a time of 1.2 seconds to run a particular exec statement. The same parameters in production last week used 10,831 CPU cycles, 2.9 million reads, and took 33.9 seconds to run.
My question is what could be making the optimizer run more cycles and reads in production than my one off tests? Like I mentioned, I could post code if needed, but I am looking for answers that point me in a direction to troubleshoot such a discrepancy. This particular procedure is run a lot during our Billing cycle so it is hitting the server hundreds of times a day, it will become thousands as we near the 15th of the month.
Thanks
ADDENDUM:
I know how to optimized queries as this is a regular part of my job duties. As I stated in a comment, my test queries don't have large differences like this usually from actual production. I don't know the SQL Server side of things and wondered if there was something I needed to be aware of that might affect my query when the server is under a heavier load. This may be outside the scope of this forum, but I thought I would reach out for ideas from the community.
UPDATE:
I am still troubleshooting this, just replying to some of the comments.
The execution plans are the same between my one off tests and the production level executions. I am testing in the same environment, on the same server as production. This is a procedure for report data. The data returned, the records, and tables hit, are all the same. I am testing using the same parameters that during production took astronomical amounts of time to process, the difference between what I am doing and what happened during the production run is the load on the server. Not all of production executions are taking a long time, the vast majority are within the thresholds of acceptable CPU and reads, when the outliers have such a large discrepancy it is 500 times the CPU and 150 times the reads of the average execution (even with the same parameters).
I have zero control over the server side of things. I only can control how to code the proc. I realize that my proc is large and without it, it is probably impossible to get a good answer on this forum. I also realize that even with the code posted here I would not likely get a good answer due to the size of the procedure.
I was/am looking only for insights, directions of things to look at, using anecdotal evidence of issues other developers have overcome when dealing with similar problems. Comments that state the size of my code is the reason why performance is in the toilet, and that code that size is rarely needed, are not helpful and quite frankly ignorant. I am working with legacy c# code and a database that deals with millions of transactions for billing. There are thousands of tables in dozens of interconnected databases with an ERD that would blow your mind, I am optimizing a nightmare. That said, I am very good at it, and when myself and the database administrators are stumped as to why we see such stark numbers I thought I would widen my net and see if this larger community had any ideas.
Below is an image showing a report of the top 32 executions for this procedure in a 15 min window. Even among the top 32 the numbers are not consistent. The image below that shows all of the temp tables and main query that I just ran on the same server for #1 resource hog of the first image. The rows are different temp tables with a sum at the bottom. The sum shows 1.5 (1.492) seconds to run with 534 CPU and 92,045 reads. Contrast that with the 33.9 seconds, 10,831 CPU, and 2.9 million reads yesterday:
I am attempting to reduce CXPACKET waits in my SQL Server 2012 databases.
I am going to adjust the MAXDOP and Cost Threshold for Parallelism in order to do so. See Brent Ozar's article.
In order to gauge the effect that the changes have on wait times I am tracking the wait time every 15 mins using sys.dm_os_wait_stats and following this advice. I want to take 2 weeks of readings before I adjust and 2 weeks after.
But, I am also interested in tracking overall query performance. What would be a good way to see how the queries are performing before and after the changes - over the same 2 week before / 2 week after timeframe? Are there sprocs that will give me this data?
In an environment where a lot of large queries are ran you'll always see CXPacket towards the top of the list of waits, and it's not an issue. The only issue is when that starts skyrocketing over what your normal waits are. In those cases it typically means that your stats are so far off that it's not getting an accurate execution plan and you need to look into better index maintenance.
From what you have said, there's no reason I'd lower MaxDOP based on that alone. Instead I'd go with the typical advice of up to 8 with no more than the number of cores in a single NUMA node for OLTP.
In your case, I'd start looking at your most expensive queries using Query Stats along with your largest queries according to a server-side trace. If you're able to tune the biggest offenders you see here then you'll have a faster server along with a lowered CXPacket wait type.
As a bit of a disclaimer, I am the author of the wait stats article you mentioned as well as the articles linked to in this response.
Please feel free to reply here for this question or on my blog for any others that come up.
CXPACKET waits don't necessarily indicate a problem with parallelism - it's usually a symptom of some other problem.
When a query goes parallel, let's say across 10 threads, and one of those 10 threads takes longer than the others to finish its work, the other 9 threads are going to accumulate CXPACKET waits.
What other high wait types are you seeing?
Based on the above, it seems you could some utility to help you monitor query performance.
I can suggest ApexSQL Monitor (a commercial tool, but offers a free trial) that can store historic information, so you can review details before and after reducing MAXDOP - and you can review the overall query performance.
Besides that, you can also find a good explanation on possible causes about CXPACKET and MAXDOP in this article http://www.sqlshack.com/troubleshooting-the-cxpacket-wait-type-in-sql-server/. Here are some key takeaways from the article:
The steps that are recommended in diagnosing the cause of high CXPACKET wait stats values (before making any knee-jerk reaction and changing something on SQL Server):
Do not set MAXDOP to 1, as this is never the solution
Investigate the query and CXPACKET history to understand and determine whether it is something that occurred just once or twice, as it could be just the exception in the system that is normally working correctly
Check the indexes and statistics on tables used by the query and make sure they are up to date
Check the Cost Threshold for Parallelism (CTFP) and make sure that the value used is appropriate for your system
Check whether the CXPACKET is accompanied with a LATCH_XX (possibly with PAGEIOLATCH_XX or SOS_SCHEDULER_YIELD as well). If this is the case than the MAXDOP value should be lowered to fit your hardware
Check whether the CXPACKET is accompanied with a LCK_M_XX (usually accompanied with IO_COMPLETION and ASYNC_IO_COMPLETION). If this is the case, then parallelism is not the bottleneck. Troubleshoot those wait stats to find the root cause of the problem and solution
Sorry for the long introduction but before I can ask my question, I think giving the background would help understanding our problem much better.
We are using sql server 2008 for our web services as the backend and from time to time it takes too much time for responding back for the requests that supposed to run really fast, like taking more than 20 seconds for a select request that queries a table that has only 22 rows. We went through many potential areas that could cause the issue from indexes to stored procedures, triggers etc, and tried to optimize whatever we can like removing indexes that are not read but write frequently or adding NOLOCK for our select queries to reduce the locking of the tables (we are OK with dirty reads).
We also had our DBA's reviewed the server and benchmarked the components to see any bottlenecks in CPU, memory or disk subsystem, and found out that hardware-wise we are OK as well. And since the pikes are occurring occasionally, it is really hard to reproduce the error on production or development because most of the time when we rerun the same query it yields response times that we are expecting, which are short, not the one that has been experienced earlier.
Having said that, I almost have been suspicious about I/O although it is not seem to be a bottleneck. But I think I was just be able to reproduce the error after running an index fragmentation report for a specific table on the server, which immediately caused pikes in requests not only run against that table but also in other requests that query other tables. And since the DB, and the server, is shared with other applications we use and also from time to time queries can be run on the server and database that take long time is a common scenario for us, my suspicion regarding occasional I/O bottleneck is, I believe, becoming a fact.
Therefore I want to find out a way that would prioritize requests that are coming from web services which will be processed even if there are other resource sensitive queries being run. I have been looking for some kind of prioritization I described above since very beginning of the resolution process and found out that SQL Server 2008 has a feature called 'Resource Governor' that allows prioritization of the requests.
However, since I am not an expert on Resource Governor nor a DBA, I would like to ask other people's experience who may have used or is using Resource Governor, as well as whether I can prioritize I/O for a specific login or a specific stored procedure (For example, if one I/O intensive process is being run at the time we receive a web service request, can SQL server stops, or slows down, I/O activity for that process and give a priority to the request we just received?).
Thank you for anyone that spends time on reading or helping out in advance.
Some Hardware Details:
CPU: 2x Quad Core AMD Opteron 8354
Memory: 64GB
Disk Subsystem: Compaq EVA8100 series (I am not sure but it should be RAID 0+1 accross 8 HP HSV210 SCSI drives)
PS:And I can almost 100 percent sure that application servers are not causing the error and there is no bottleneck we can identify there.
Update 1:
I'll try to answer as much as I can for the following questions that gbn asked below. Please let me know if you are looking something else.
1) What kind of index and statistics maintenance do you have please?
We have a weekly running job that defrags indexes every Friday. In addition to that, Auto Create Statistics and Auto Update Statistics are enabled. And the spikes are occurring in other times than the fragmentation job as well.
2) What kind of write data volumes do you have?
Hard to answer.In addition to our web services, there is a front end application that accesses the same database and periodically resource intensive queries needs to be run to my knowledge, however, I don't know how to get, let's say weekly or daily, write amount to DB.
3) Have you profiled Recompilation and statistics update events?
Sorry for not be able to figure out this one. I didn't understand what you are asking about by this question. Can you provide more information for this question, if possible?
first thought is that statistics are being updated because of the data change threshold is reached causing execution plans to be rebuilt.
What kind of index and statistics maintenance do you have please? Note: index maintenance updates index stats, not column stats: you may need separate stats updates.
What kind of write data volumes do you have?
Have you profiled Recompilation and statistics update events?
In response to question 3) of your Update to the original question, take a look at the following reference on SQL Server Pedia. It provides an explanation of what query recompiles are and also goes on to explain how you can monitor for these events. What I believe gbn is asking (feel free to correct me sir :-) ) is are you seeing recompile events prior to the slow execution of the troublesome query. You can look for this occurring by using the SQL Server Profiler.
Reasons for Recompiling a Query Execution Plan