We sometimes experience CommandTimeouts on relative simple DELETE queries.
They normally execute in a second or a few, but sometimes hit our 300 sec CommandTimeout.
So far we have seen it happen from 2 locations, where we send a lot of delete statements from loops. In one case a TADOStoredProc with parameters is used, in the other case a TADOCommand with a DELETE statement is used (no parameters). It does not happen in a predictable fashion, in two executions of the same code the result can vary. The exception happens rarely, so far we have gotten past it by re-running the program.
What could be the cause of these timeout?
What would the best solution be? Currently we are considering trying async execution and resending the command on timeout.
Related
I have a task running on a table 'original_table' for every five minutes in snowflake .
I am taking a backup for some reason in another table 'back_up_table' . I need to perform swap for 'original_table' with 'back_up_table'
Do I have to pause task on 'original_table' before swap and then resume it?
Short answer, no, there's nothing technically requiring that you stop the task
However, depending on what you are doing, you may get some unexpected results.
For example, if the task is actively running a Insert when the swap happens you may have a bit of a race condition and it will be unpredictable as to which table gets that new record.
If the task is doing something more complicated, like if it is also creating/swapping tables, you could get an error in the task or some truly weird results that would be hard to troubleshoot.
So if your task is able to recover after a failure on follow-up runs, and you're not worried about transient results/race conditions, then It's probably fine to leave it running.
Edit: as other comments suggested. For backing up tables, you might want to look into the Clone feature if you haven't already for a possibly cleaner overall solution.
I have a bit of a strange problem. I have a module running on gae that puts a whole lot of little tasks on the default task queue. The tasks access the same ndb module. Each task accesses a bunch of data from a few different tables then calls put.
The first few tasks work fine but as time continues I start getting these on the final put:
suspended generator _put_tasklet(context.py:358) raised TransactionFailedError(too much contention on these datastore entities. please try again.)
So I wrapped the put with a try and put in a randomised timeout so it retries a couple of times. This mitigated the problem a little, it just happens later on.
Here is some pseudocode for my task:
def my_task(request):
stuff = get_ndb_instances() #this accessed a few things from different tables
better_stuff = process(ndb_instances) #pretty much just a summation
try_put(better_stuff)
return {'status':'Groovy'}
def try_put(oInstance,iCountdown=10):
if iCountdown<1:
return oInstance.put()
try:
return oInstance.put()
except:
import time
import random
logger.info("sleeping")
time.sleep(random.random()*20)
return oInstance.try_put(iCountdown-1)
Without using try_put the queue gets about 30% of the way through until it stops working. With the try_put it gets further, like 60%.
Could it be that a task is holding onto ndb connections after it has completed somehow? I'm not making explicit use of transactions.
EDIT:
there seems to be some confusion about what I'm asking. The question is: Why does ndb contention get worse as time goes on. I have a whole lot of tasks running simultaneously and they access the ndb in a way that can cause contention. If contention is detected then a randomy timed retry happens and this eliminates contention perfectly well. For a little while. Tasks keep running and completing and the more that successfully return the more contention happens. Even though the processes using the contended upon data should be finished. Is there something going on that's holding onto datastore handles that shouldn't be? What's going on?
EDIT2:
Here is a little bit about the key structures in play:
My ndb models sit in a hierarchy where we have something like this (the direction of the arrows specifies parent child relationships, ie: Type has a bunch of child Instances etc)
Type->Instance->Position
The ids of the Positions are limited to a few different names, there are many thousands of instances and not many types.
I calculate a bunch of Positions and then do a try_put_multi (similar to try_put in an obvious way) and get contention. I'm going to run the code again pretty soon and get a full traceback to include here.
Contention will get worse overtime if you continually exceed the 1 write/transaction per entity group per second. The answer is in how Megastore/Paxo work and how Cloud Datastore handles contention in the backend.
When 2 writes are attempted at the same time on different nodes in Megastore, one transaction will win and the other will fail. Cloud Datastore detects this contention and will retry the failed transaction several times. Usually this results in the transaction succeeding without any errors being raised to the client.
If sustained writes above the recommended limit are being attempted, the chance that a transaction needs to be retried multiple times increases. The number of transactions in an internal retry state also increases. Eventually, transactions will start reaching our internal retry limit and will return a contention error to the client.
Randomized sleep method is an incorrect way to handle error response situations. You should instead look into exponential back-off with jitter (example).
Similarly, the core of your problem is a high write rate into a single entity group. you should look into whether the explicit parenting is required (removing it if not), or if you should shard the entity group in some manner that makes sense according to your queries and consistency requirements.
I have just analyzed a call to a particular SQL Server Stored Procedure. It is very slow so I decided to analyze it.
The result is confusing:
When executing the procedure it takes 1 min 24s.
When executing two identical calls simultaneously it takes 2 min 50 s. Obviously some bad blocking and context swapping is happening.
When switching on Actual Execution Plan and running one call it takes 4 min 48s. The client statistics now tells me the total execution time is 3000ms.
What is happening?
Does the Actual Execution Plan actually interfere that much?
That means there should be a warning somewhere that the execution time is MUCH longer and the statistics is WAY off.
The procedure is not huge in size but with a nasty complexity: cursors, nested selects in five levels, temporary tables, sub procedures and function calls.
There are tons of articles in the web discussing why using cursors is a bad practice and should be avoided when possible. Depending on database settings a cursor is registered in the global namespace making concurrency close to impossible - it is absolutely normal that 2 instances of your proc take exactly twice longer. Performance of cursors and locking are another points of consideration. So having said that even a while loop on, say, temp table with identity column to loop on, might be a better solution than cursors.
My experience is that turning on actual execution plan always adds some time - I have noticed that the more complex the plan is the bigger the effect it is (so i guess it is quite related to the SSMS handling it) ... but in your case the effect looks of larger scale than I would expect, so there might be something else going on.
I have a very popular site in ASP.NET MVC/SQL Server, and unfortunately a lot of deadlocks occur. While I'm trying to figure out why they occur via the SQL profiler, I wonder how I can change the default behavior of SQL Server when doing the deadlocks.
Is it possible to re-run the transaction(s) that caused problems instead of showing the error screen?
Remus's answer is fundamentally flawed. According to https://stackoverflow.com/a/112256/14731 a consistent locking order does not prevent deadlocks. The best we can do is reduce their frequency.
He is wrong on two points:
The implication that deadlocks can be prevented. You will find both Microsoft and IBM post articles about reducing the frequency of deadlocks. No where do they claim you can prevent them altogether.
The implication that all deadlocks require you to re-evaluate the state and come to a new decision. It is perfectly correct to retry some actions at the application level, so long as you travel far back enough to the decision point.
Side-note: Remus's main point is that the database cannot automatically retry the operation on your behalf, and he is completely right on that count. But this doesn't mean that re-running operations is the wrong response to a deadlock.
You are barking up the wrong tree. You will never succeed in doing automated deadlock retries by the SQL engine, such concept is fundamentally wrong. The very definition of deadlock is that the state you base your decision on has changed therefore you need to read again the state and make a new decision. If your process has deadlocked, by definition another process has won the deadlocks, and it meas it has changed something you've read.
Your only focus should be at figuring out why the deadlocks occur and eliminate the cause. Invariably, the cause will turn out to be queries that scan more data that they should. While is true that other types of deadlock can occur, I bet is not your case. Many problems will be solved by deploying appropriate indexes. Some problems will send you back to the drawing board and you will have to rethink your requirements.
There are many, many resources out there on how to identify and solve deadlocks:
Detecting and Ending Deadlocks
Minimizing Deadlocks
You may also consider using snapshot isolation, since the lock-free reads involved in snapshot reduce the surface on which deadlocks can occur (ie. only write-write deadlocks can occur). See Using Row Versioning-based Isolation Levels.
A lot of deadlocks occurring is often an indication that you either do not have the correct indexes and/or that your statistics are out of date. Do you have regular scheduled index rebuilds as part of maintenance?
Your save code should automatically retry saves when error 1205 is returned (deadlock occurred). There is a standard pattern that looks like this:
catch (SqlException ex)
{
if (ex.Number == 1205)
{
// Handle Deadlock by retrying save...
}
else
{
throw;
}
}
The other option is to retry within your stored procedures. There is an example of that here: Using TRY...CATCH in Transact-SQL
One option in addition to those suggsted by Mitch and Remus, as your comments suggest you're looking for a fast fix. If you can identify the queries involved in the deadlocks, you can influence which of the queries involved are rolled back and which continue by setting DEADLOCK_PRIORITY for each query, batch or stored procedure.
Looking at your example in the comment to Mitch's answer:
Let's say deadlock occurs on page A,
but page B is trying to access the
locked data. The error will be
displayed on page B, but it doesn't
mean that the deadlock occurred on
page B. It still occurred on page A.
If you consistently see a deadlock occuring from the queries issued from page A and page B, you can influence which page results in an error and which completes successfully. As the others have said, you cannot automatically force a retry.
Post a question with the problem queries and/or the deadlock trace output and theres a good chance you'll get an explanation as to why its occurring and how it could be fixed.
in some cases, you can do below. Between begin tran and commit is all or nothing. So either #errorcode take 0 as value and ends loop, or, in case of failure, decrease counter by 1 and retry again. It may not work if you provide variables to code from outside begin tran/commit. Just an Idea :)
declare #errorcount int = 4 -- retry number
while #errorcount >0
begin
begin tran
<your code here>
set #errorcount =0
commit
set #errorcount=#errorcount-1
end
I got a large conversion job- 299Gb of JPEG images, already in the database, into thumbnail equivalents for reporting and bandwidth purposes.
I've written a thread safe SQLCLR function to do the business of re-sampling the images, lovely job.
Problem is, when I execute it in an UPDATE statement (from the PhotoData field to the ThumbData field), this executes linearly to prevent race conditions, using only one processor to resample the images.
So, how would I best utilise the 12 cores and phat raid setup this database machine has? Is it to use a subquery in the FROM clause of the update statement? Is this all that is required to enable parallelism on this kind of operation?
Anyway the operation is split into batches, around 4000 images per batch (in a windowed query of about 391k images), this machine has plenty of resources to burn.
Please check the configuration setting for Maximum Degree of Parallelism (MAXDOP) on your SQL Server. You can also set the value of MAXDOP.
This link might be useful to you http://www.mssqltips.com/tip.asp?tip=1047
cheers
Could you not split the query into batches, and execute each batch separately on a separate connection? SQL server only uses parallelism in a query when it feels like it, and although you can stop it, or even encourage it (a little) by changing the cost threshold for parallelism option to O, but I think its pretty hit and miss.
One thing thats worth noting is that it will only decide whether or not to use parallelism at the time that the query is compiled. Also, if the query is compiled at a time when the CPU load is higher, SQL server is less likely to consider parallelism.
I too recommend the "round-robin" methodology advocated by kragen2uk and onupdatecascade (I'm voting them up). I know I've read something irritating about CLR routines and SQL paralellism, but I forget what it was just now... but I think they don't play well together.
The bit I've done in the past on similar tasks it to set up a table listing each batch of work to be done. For each connection you fire up, it goes to this table, gest the next batch, marks it as being processed, processes it, updates it as Done, and repeats. This allows you to gauge performance, manage scaling, allow stops and restarts without having to start over, and gives you something to show how complete the task is (let alone show that it's actually doing anything).
Find some criteria to break the set into distinct sub-sets of rows (1-100, 101-200, whatever) and then call your update statement from multiple connections at the same time, where each connection handles one subset of rows in the table. All the connections should run in parallel.