I'm trying to figure out why I would be getting a deadlock error when executing a simple query inside a thread. I'm running CF10 with SQL Server 2008 R2, on a Windows 2012 server.
Once per day, I've got a process that caches a bunch of blog feeds in a database. For each blog feed, I create a thread and do all the work in inside it. Sometimes it runs fine with no errors, other times I get the following error in one or more of the threads:
[Macromedia][SQLServer JDBC Driver][SQLServer]Transaction (Process ID
57) was deadlocked on lock resources with another process and has been
chosen as the deadlock victim. Rerun the transaction.
This deadlock condition happens when I run a query that sets a flag indicating that the feed is being updated. Obviously, this query could happen concurrently with other threads that are updating other feeds.
From my research, I think I can solve the problem by putting a exclusive named lock around the query, but why would I need to do that? I've never had to deal with deadlocks before, so forgive my ignorance on the subject. How is it possible that I can run into a deadlock condition?
Since there's too much code to post, here's a rough algorithm:
thread name="#createUUID()#" action="run" idBlog=idBlog {
try {
var feedResults = getFeed(idBlog);
if (feedResults.errorCode != 0)
throw(message="failed to get feed");
transaction {
/* just a simple query to set a flag */
dirtyBlogCache(idBlog); /* this is where i get the deadlock */
cacheFeedResults(idBlog, feedResults);
}
} catch (any e) {
reportError(e);
}
}
} /* thread */
This approach has been working well for me.
<cffunction name="runQuery" access="private" returntype="query">
arguments if necessary
<cfset var whatever = QueryNew("a")>
<cfquery name="whatever">
sql
</cfquery>
<cfreturn whatever>
</cffunction>
attempts = 0;
myQuery = "not a query";
while (attempts <= 3 && isQuery(myQuery) == false) {
attempts += 1;
try {
myQuery = runQuery();
}
catch (any e) {
}
}
After all, the message does say to re-run the transaction.
Related
I have some long running commands in stored procedures that are at risk of timing out, and I run them using Context.Database.ExecuteSqlCommand
It would appear that when a command times out, it leaves a lock in the database because the transaction is not rolled back.
I found an explanation for that here: CommandTimeout – How to handle it properly?
Based on the linked example I changed my code to:
Database database = Context.Database;
try
{
return database.ExecuteSqlCommand(sql, parameters);
}
catch (SqlException e)
{
//Transactions can stay open after a CommandTimeout,
//so need to rollback any open transactions
if (e.Number == -2) //CommandTimeout occurred
{
//Single rollback exits all levels of nested transactions,
//no need to loop.
database.ExecuteSqlCommand("IF ##TRANCOUNT>0 ROLLBACK TRAN;");
}
throw;
}
However, that threw an exception inside the catch, because the connection is now null:
ArgumentNullException: Value cannot be null.
Parameter name: connection
Following the comments from Annie and usr I changed my code to this:
Database database = Context.Database;
using (var tran = database.BeginTransaction())
{
try
{
int result = database.ExecuteSqlCommand(sql, parameters);
tran.Commit();
return result;
}
catch (SqlException)
{
var debug = database.SqlQuery<Int16>("SELECT ##SPID");
tran.Rollback();
throw;
}
}
I really thought that would do it, but the locks in the database continue to accumulate when I set my CommandTimeout to a really small value to test it out.
I put a breakpoint at the throw, so I know the transaction has been rolled back. The debug variable tells me the session id and when I check my locks using this query: SELECT * FROM sys.dm_tran_locks, I find a match for the session id in the request_session_id, but it's a lock that was already there, not one of the new ones, so I'm a bit confused.
So, how should I properly handle CommandTimeout when using ExecuteSqlCommand to ensure locks are released immediately?
I downloaded sp_whoisactive and ran it, the spid appears to be linked to a query on tables used by Hangfire - I am using Hangfire to run the long running queries in a background process. So, I think that perhaps I am barking up the wrong tree. I did have a problem with locking but I've rewritten my own queries to avoid locking too many rows and I've disabled lock escalation on the tables where I had a problem. These last locks may be coming from Hangfire, and may not be significant, nonetheless I've decided to go with XACT_ABORT ON for now.
When trying to use multiple successive webApi.post inside a for loop,
for (i = 1; i <= indexLoops; i++) {
$2sxc(#Dnn.Module.ModuleID).webApi.post('app/auto/content/entity', {}, newItem);
}
if it repeats 2-3 times, it works fine, but if "i" reaches 4+, it will give this error:
The server failed to resume the transaction. Desc:d100000001. The transaction active in this session has been committed or aborted by another session.
Creating a 200ms delayed loop fixes it.
Is this a SQL server limitation or a 2scx controller issue? Can it affect users if they both attempt a post simultaneously?
Inspired by this, I wrote a simple mutex on Cassandra 2.1.4.
Here is a how the lock/unlock (pseudo) code looks:
public boolean lock(String uuid){
try {
Statement stmt = new SimpleStatement("INSERT INTO LOCK (id) VALUES (?) IF NOT EXISTS", uuid);
stmt.setConsistencyLevel(ConsistencyLevel.QUORUM);
ResultSet rs = session.execute(stmt);
if (rs.wasApplied()) {
return true;
}
} catch (Throwable t) {
Statement stmt = new SimpleStatement("DELETE FROM LOCK WHERE id = ?", uuid);
stmt.setConsistencyLevel(ConsistencyLevel.QUORUM);
session.execute(stmt); // DATA DELETED HERE REAPPEARS!
}
return false;
}
public void unlock(String uuid) {
try {
Statement stmt = new SimpleStatement("DELETE FROM LOCK WHERE id = ?", uuid);
stmt.setConsistencyLevel(ConsistencyLevel.QUORUM);
session.execute(stmt);
} catch (Throwable t) {
}
}
Now, I am able to recreate at will a situation where a WriteTimeoutException is thrown in lock() in a high load test. This means the data may or may not be written. After this my code deletes the lock - and again a WriteTimeoutException is thrown. However, the lock remains (or reappears).
Why is this?
Now I know I can easily put a TTL on this table (for this usecase), but how do I reliably delete that row?
My guess on seeing this code is a common error that happens in Distributed Systems programming. There is an assumption that in case in failure your attempt to correct the failure will succeed.
In the above code you check to make sure that initial write is successful, but don't make sure that the "rollback" is also successful. This can lead to a variety of unwanted states.
Let's imagine a few scenarios with Replicas A, B and C.
Client creates Lock but an error is thrown. The lock is present on all replicas but the client gets a timeout because that connection is lost or broken.
State of System
A[Lock], B[Lock], C[Lock]
We have an exception on the client and attempt to undo the lock by issuing a delete but this fails with an exception back at the client. This means the system can be in a variety of states.
0 Successful Writes of the Delete
A[Lock], B[Lock], C[Lock]
All quorum requests will see the Lock. There exists no combination of replicas which would show us the Lock has been removed.
1 Successful Writes of the Delete
A[Lock], B[Lock], C[]
In this case we are still vulnerable. Any request which excludes C as part of the quorum call will miss the deletion. If only A and B are polled than we'll still see the lock existing.
2/3 Successful Writes of the Delete (Quorum CL Is Met)
A[Lock/], B[], C[]
In this case we have once more lost the connection to the driver but somehow succeeded internally in replicating the delete request. These scenarios are the only ones in which we are actually safe and that future reads will not see the Lock.
Conclusion
One of the tricky things with situations like this is that if you fail do make your lock correctly because of network instability it is also unlikely that your correction will succeed since it has to work in the exact same environment.
This may be an instance where CAS operations can be beneficial. But in most cases it is better to not attempt to use distributing locking if at all possible.
I am trying to handle near-simultaneous input to my Entity Framework application. Members (users) can rate things, so I have a table for their ratings, where one column is the member's ID, one is the ID of the thing they're rating, one is the rating, and another is the time they rated it. The most recent rating is supposed to override the earlier ratings. When I receive input, I check to see if the member has already rated a thing or not, and if they have, I just update the rating using the existing row, or if they haven't, I add a new row. I noticed that when input comes in from the same user for the same item at nearly the same time, that I end up with two ratings for that user for the same thing.
Earlier I asked this question: How can I avoid duplicate rows from near-simultaneous SQL adds? and I followed the suggestion to add a SQL constraint requiring unique combinations of MemberID and ThingID, which makes sense, but I am having trouble getting this technique to work, probably because I don't know the syntax for doing what I want to do when an exception occurs. The exception comes up saying the constraint was violated, and what I would like to do then is forget the attemptd illegal addition of a row with the same MemberID and ThingID, and instead fetch the existing one and simply set the values to this slightly more recent data. However I have not been able to come up with a syntax that will do that. I have tried a few things and always I get an exception when I try to SaveChanges after getting the exception - either the unique constraint is still coming up, or I get a deadlock exception.
The latest version I tried was like this:
// Get the member's rating for the thing, or create it.
Member_Thing_Rating memPref = (from mip in _myEntities.Member_Thing_Rating
where mip.thingID == thingId
where mip.MemberID == memberId
select mip).FirstOrDefault();
bool RetryGet = false;
if (memPref == null)
{
using (TransactionScope txScope = new TransactionScope())
{
try
{
memPref = new Member_Thing_Rating();
memPref.MemberID = memberId;
memPref.thingID = thingId;
memPref.EffectiveDate = DateTime.Now;
_myEntities.Member_Thing_Rating.AddObject(memPref);
_myEntities.SaveChanges();
}
catch (Exception ex)
{
Thread.Sleep(750);
RetryGet = true;
}
}
if (RetryGet == true)
{
Member_Thing_Rating memPref = (from mip in _myEntities.Member_Thing_Rating
where mip.thingID == thingId
where mip.MemberID == memberId
select mip).FirstOrDefault();
}
}
After writing the above, I also tried wrapping the logic in a function call, because it seems like Entity Framework cleans up database transactions when leaving scope from where changes were submitted. So instead of using TransactionScope and managing the exception at the same level as above, I wrapped the whole thing inside a managing function, like this:
bool Succeeded = false;
while (Succeeded == false)
{
Thread.Sleep(750);
Exception Problem = AttemptToSaveMemberIngredientPreference(memberId, ingredientId, rating);
if (Problem == null)
Succeeded = true;
else
{
Exception BaseEx = Problem.GetBaseException();
}
}
But this only results in an unending string of exceptions on the unique constraint, being handled forever at the higher-level function. I have a 3/4 second delay between attempts, so I am surprised that there can be a reported conflict yet still there is nothing found when I query for a row. I suppose that indicates that all of the threads are failing because they are running at the same time and Entity Framework notices them all and fails them all before any succeed. So I suppose there should be a way to respond to the exception by looking at all the submissions and adjusting them? I don't know or see the syntax for that. So again, what is the way to handle this?
Update:
Paddy makes three good suggestions below. I expect his Stored Procedure technique would work around the problem, but I am still interested in the answer to the question. That is, surely one should be able to respond to this exception by manipulating the submission, but I haven't yet found the syntax to get it to insert one row and use the latest value.
To quote Eric Lippert, "if it hurts, stop doing it". If you are anticipating getting very high volumnes and you want to do an 'insert or update', then you may want to consider handling this within a stored procedure instead of using the methods outlined above.
Your problem is coming because there is a small gap between your call to the DB to check for existence and your insert/update.
The sproc could use a MERGE to do the insert or update in a single pass on the table, guaranteeing that you will only see a single row for a rating and that it will be the most recent update you receive.
Note - you can include the sproc in your EF model and call it using similar EF syntax.
Note 2 - Looking at your code, you don't rollback the transaction scope prior to sleeping your thread in the case of exception. This is a relatively long time to be holding a transaction open, particularly when you are expecting very high volumes. You may want to update your code something like this:
try
{
memPref = new Member_Thing_Rating();
memPref.MemberID = memberId;
memPref.thingID = thingId;
memPref.EffectiveDate = DateTime.Now;
_myEntities.Member_Thing_Rating.AddObject(memPref);
_myEntities.SaveChanges();
txScope.Complete();
}
catch (Exception ex)
{
txScope.Dispose();
Thread.Sleep(750);
RetryGet = true;
}
This may be why you seem to be suffering from deadlocks when you retry, particularly if you are getting rapid concurrent requests.
I've got a Seam web application working with Seam & Hibernate (JDBC to SQLServer).
It's working well, but under heavy load (stress test with JMeter), I have some LockAcquisitionException or OptimisticLockException.
The LockAquisitionException is caused by a SQLServerException "Transaction (Process ID 64) was deadlock on lock resources with another process and has been chosen as the deadlock victim. Rerun the transaction".
I've then written a Seam Interceptor to rerun such transactions for LockAquisitionException :
#AroundInvoke
public Object aroundInvoke(final InvocationContext invocationContext) throws Exception {
if (instanceThreadLocal.get() == null && isMethodInterceptable(invocationContext)) {
try {
instanceThreadLocal.set(this);
int i = 0;
PersistenceException exception = null;
do {
try {
return invocationContext.proceed();
} catch (final PersistenceException e) {
final Throwable cause = e.getCause();
if (!(cause instanceof LockAcquisitionException)) {
throw e;
}
exception = e;
i++;
if (i < MAX_RETRIES_LOCK_ACQUISITION) {
log.info("Swallowing a LockAcquisitionException - #0/#1", i, MAX_RETRIES_LOCK_ACQUISITION);
try {
if (Transaction.instance().isRolledBackOrMarkedRollback()) {
Transaction.instance().rollback();
}
Transaction.instance().begin();
} catch (final Exception e2) {
throw new IllegalStateException("Exception while rollback the current transaction, and begining a new one.", e2);
}
Thread.sleep(1000);
} else {
log.info("Can't swallow any more LockAcquisitionException (#0/#1), will throw it.", i, MAX_RETRIES_LOCK_ACQUISITION);
throw e;
}
}
} while (i < MAX_RETRIES_LOCK_ACQUISITION);
throw exception;
} finally {
instanceThreadLocal.remove();
}
}
return invocationContext.proceed();
}
First question : do you think this interceptor will correctly do the job ?
By googling around and saw that Alfresco (with a forum talk here), Bonita and Orchestra have some methods to rerun such transactions too, and they are catching much more Exceptions, like StaleObjectStateException for instance (the cause of my OptimisticLockException).
My 2nd question follows : for the StaleObjectStateException ("Row was updated or deleted by another transaction (or unsaved-value mapping was incorrect)"), normaly you can't just rerun the transaction, as it's a problem of synchronisation with the database and #Version fields isn't it ? Why Alfresco for instance tries to rerun such Transactions caused by such Exceptions ?
EDIT :
For LockAcquisitionException caused by SQLServerException, I've looked at some some resources on the web, and even if I should double check my code, it seems that it can happend anyway ... here are the links :
An article on the subject (with a comment which says it can happend by running out of resources also)
Another article with sublinks :
Microsoft talking about that on support.microsoft.com
A way to profile transactions
And some advice to reduce such problems
Even Microsoft says "Although deadlocks can be minimized, they cannot be completely avoided. That is why the front-end application should be designed to handle deadlocks."
Actually I finally found how to dodge the famous "Transaction (Process ID 64) was deadlock on lock resources with another process and has been chosen as the deadlock victim. Rerun the transaction".
So I will not really answer my question but I will explain what I saw and how I manage to do that.
At first, I thought that I had a "lock escalation problem" which would transform my row locks into page locks and produce my deadlocks (my JMeter test runs on a scenario which does delete / update while selecting rows, but the deletes and updates don't concern necessarily the same rows as the selects).
So I read Lock Escalation in SQL2005 and How to resolve blocking problems that are caused by lock escalation in SQL Server (by MS) and finally Diagnose SQL Server performance issues using sp_lock.
But before trying to detect if I was in a lock escalation situation, I fall on that page : http://community.jboss.org/message/95300. It talks about "transaction isolation" and that SQLServer has a special one which is called "snapshot isolation".
I then found Using Snapshot Isolation with SQL Server and Hibernate and read Using Snapshot Isolation (by MS).
So I first enabled the "snapshot isolation mode" on my database :
ALTER DATABASE [MY_DATABASE]
SET ALLOW_SNAPSHOT_ISOLATION ON
ALTER DATABASE [MY_DATABASE]
SET READ_COMMITTED_SNAPSHOT ON
Then I had to define transaction isolation for JDBC driver to 4096 ... and by reading the book "Hibernate in Action" on paragraph "5.1.6 Setting an isolation level", it reads :
Note that Hibernate never changes the isolation level of connections obtained from a datasource provided by the application server in a managed environment. You may change the default isolation using the configuration of your application server.
So I read Configuring JDBC DataSources (for JBoss 4) and finally edited my database-ds.xml file to add this :
<local-tx-datasource>
<jndi-name>myDatasource</jndi-name>
<connection-url>jdbc:sqlserver://BDDSERVER\SQL2008;databaseName=DATABASE</connection-url>
<driver-class>com.microsoft.sqlserver.jdbc.SQLServerDriver</driver-class>
<user-name>user</user-name>
<password>password</password>
<min-pool-size>2</min-pool-size>
<max-pool-size>400</max-pool-size>
<blocking-timeout-millis>60000</blocking-timeout-millis>
<background-validation>true</background-validation>
<background-validation-minutes>2</background-validation-minutes>
<idle-timeout-minutes>15</idle-timeout-minutes>
<check-valid-connection-sql>SELECT 1</check-valid-connection-sql>
<prefill>true</prefill>
<prepared-statement-cache-size>75</prepared-statement-cache-size>
<transaction-isolation>4096</transaction-isolation>
</local-tx-datasource>
The most important part is of course <transaction-isolation>4096</transaction-isolation>.
And then, I got no more deadlock problem anymore ! ... so my question is now more or less useless for me ... but perhaps someone could have a real answer !