SQL server, pyodbc and deadlock errors - sql-server

I have some code that writes Scrapy scraped data to a SQL server db. The data items consist of some basic hotel data (name, address, rating..) and some list of rooms with associated data(price, occupancy etc). There can be multiple celery threads and multiple servers running this code and simultaneously writing to the db different items. I am encountering deadlock errors like:
[Failure instance: Traceback: <class 'pyodbc.ProgrammingError'>:
('42000', '[42000] [FreeTDS][SQL Server]Transaction (Process ID 62)
was deadlocked on lock resources with another process and has been
chosen as the deadlock victim. Rerun the transaction. (1205) (SQLParamData)')
The code that actually does the insert/update schematically looks like this:
1) Check if hotel exists in hotels table, if it does update it, else insert it new.
Get the hotel id either way. This is done by `curs.execute(...)`
2) Python loop over the hotel rooms scraped. For each room check if room exists
in the rooms table (which is foreign keyed to the hotels table).
If not, then insert it using the hotel id to reference the hotels table row.
Else update it. These upserts are done using `curs.execute(...)`.
It is a bit more complicated than this in practice, but this illustrates that the Python code is using multiple curs.executes before and during the loop.
If instead of upserting the data in the above manner, I generate one big SQL command, which does the same thing (checks for hotel, upserts it, records the id to a temporary variable, for each room checks if exists and upserts against the hotel id var etc), then do only a single curs.execute(...) in the python code, then I no longer see deadlock errors.
However, I don't really understand why this makes a difference, and also I'm not entirely sure it is safe to run big SQL blocks with multiple SELECTS, INSERTS, UPDATES in a single pyodbc curs.execute. As I understand pyodbc is suppose to only handle single statements, however it does seem to work, and I see my tables populates with no deadlock errors.
Nevertheless, it seems impossible to get any output if I do a big command like this. I tried declaring a variable #output_string and recording various things to it (did we have to insert or update the hotel for example) before finally SELECT #output_string as outputstring, but doing a fetch after the execute in pyodbc always fails with
<class 'pyodbc.ProgrammingError'>: No results. Previous SQL was not a query.
Experiments within the shell suggest pyodbc ignores everything after the first statement:
In [11]: curs.execute("SELECT 'HELLO'; SELECT 'BYE';")
Out[11]: <pyodbc.Cursor at 0x7fc52c044a50>
In [12]: curs.fetchall()
Out[12]: [('HELLO', )]
So if the first statement is not a query you get that error:
In [13]: curs.execute("PRINT 'HELLO'; SELECT 'BYE';")
Out[13]: <pyodbc.Cursor at 0x7fc52c044a50>
In [14]: curs.fetchall()
---------------------------------------------------------------------------
ProgrammingError Traceback (most recent call last)
<ipython-input-14-ad813e4432e9> in <module>()
----> 1 curs.fetchall()
ProgrammingError: No results. Previous SQL was not a query.
Nevertheless, except for the inability to fetch my #output_string, my real "big query", consisting of multiple selects, updates, inserts actually works and populates multiple tables in the db.
Nevertheless, if I try something like
curs.execute('INSERT INTO testX (entid, thecol) VALUES (4, 5); INSERT INTO testX (entid, thecol) VALUES (5, 6); SELECT * FROM testX; '
...: )
I see that both rows were inserted into the table tableX, even a subsequent curs.fetchall() fails with the "Previous SQL was not a query." error, so it seems that pyodbc execute does execute everything...not just the first statement.
If I can trust this, then my main problem is how to get some output for logging.
EDIT Setting autocommit=True in the dbargs seems to prevent the deadlock errors, even with the multiple curs.executes. But why does this fix it?

Setting autocommit=True in the dbargs seems to prevent the deadlock errors, even with the multiple curs.executes. But why does this fix it?
When establishing a connection, pyodbc defaults to autocommit=False in accordance with the Python DB-API spec. Therefore when the first SQL statement is executed, ODBC begins a database transaction that remains in effect until the Python code does a .commit() or a .rollback() on the connection.
The default transaction isolation level in SQL Server is "Read Committed". Unless the database is configured to support SNAPSHOT isolation by default, a write operation within a transaction under Read Committed isolation will place transaction-scoped locks on the rows that were updated. Under conditions of high concurrency, deadlocks can occur if multiple processes generate conflicting locks. If those processes use long-lived transactions that generate a large number of such locks then the chances of a deadlock are greater.
Setting autocommit=True will avoid the deadlocks because each individual SQL statement will be automatically committed, thus ending the transaction (which was automatically started when that statement began executing) and releasing any locks on the updated rows.
So, to help avoid deadlocks you can consider a couple of different strategies:
continue to use autocommit=True, or
have your Python code explicitly .commit() more often, or
use SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED to "loosen up" the transaction isolation level and avoid the persistent locks created by write operations, or
configure the database to use SNAPSHOT isolation which will avoid lock contention but will make SQL Server work harder.
You will need to do some homework to determine the best strategy for your particular usage case.

Related

ORA-00060: deadlock detected while waiting for resource while updating multiple rows in a multi-threaded system

Earlier I used this query where I was updating 100 rows one by one.
String query="update table set status=? where rownum< ?";
stmt = conn.prepareStatement(query);
for(int i=0; i < CHUNK_SIZE;i++){
stmt.setString(1, "STATUS_IN_PROGRESS");
stmt.setInt(2, 2);
stmt.executeUpdate();
}
Now I changed this query to update all the rows at once.
String query="update table set status=? where rownum< ?";
stmt = conn.prepareStatement(query);
stmt.setString(1, "STATUS_IN_PROGRESS");
stmt.setInt(2, CHUNK_SIZE);
But the later one is giving " ORA-00060: deadlock detected while waiting for resource" exception. I read about this but I am not able to understand that if the exception is because of competing DMLs then it should have happened in the first query also though with a lower probability but that was non the case.
Also we normally do frequent updates at DB level so it should be a frequent problem but this is not the case.
As Alex Poole suggested, you definitely want to look for the trace file. Every deadlock creates a separate trace file on the database. The file lists all the related objects and SQL statements. Don't assume you know which statements are causing the deadlock, there are several weird ways that deadlocks can happen.
As ibre5041 pointed out, deadlocks depend on the order in which data is retrieved. However, simply adding an ORDER BY likely won't help. The same statement with the same execution plan will always return data in the same order (in practice, but this is not guaranteed!). But the same SQL statement can have different execution plans in some cases. For example, if the CHUNK_SIZE bind variable is different that could cause a change in the execution plan. It might help to find the SQL statements and check for multiple execution plans, and try to fix one plan. This query can help you find statements and plans:
select sql_id, plan_hash_value from gv$sql where lower(sql_text) like '%table_name%';
Auto-commit might explain why the first version did not cause errors. Deadlocks require two transactions and at least one of them must have performed work and then tried to do more work. A transaction must hold a lock and ask for another one. If the transaction is committed after every single row there is no way for a deadlock to occur. However, I'm not advocating single-row processing.
The solution is usually either a schema fix (add an index on a foreign key, convert a bitmap index to a b-tree index), control access to a table (serialize access or at least make sure the statements process in the same order), and as a last resort use exception handling.

Does inserting data into SQL Server lock the whole table?

I am using Entity Framework, and I am inserting records into our database which include a blob field. The blob field can be up to 5 MB of data.
When inserting a record into this table, does it lock the whole table?
So if you are querying any data from the table, will it block until the insert is done (I realise there are ways around this, but I am talking by default)?
How long will it take before it causes a deadlock? Will that time depend on how much load is on the server, e.g. if there is not much load, will it take longer to cause a deadlock?
Is there a way to monitor and see what is locked at any particular time?
If each thread is doing queries on single tables, is there then a case where blocking can occur? So isn't it the case that a deadlock can only occur if you have a query which has a join and is acting on multiple tables?
This is taking into account that most of my code is just a bunch of select statements, not heaps of long running transactions or anything like that.
Holy cow, you've got a lot of questions in here, heh. Here's a few answers:
When inserting a record into this table, does it lock the whole table?
Not by default, but if you use the TABLOCK hint or if you're doing certain kinds of bulk load operations, then yes.
So if you are querying any data from the table will it block until the insert is done (I realise there are ways around this, but I am talking by default)?
This one gets a little trickier. If someone's trying to select data from a page in the table that you've got locked, then yes, you'll block 'em. You can work around that with things like the NOLOCK hint on a select statement or by using Read Committed Snapshot Isolation. For a starting point on how isolation levels work, check out Kendra Little's isolation levels poster.
How long will it take before it causes a deadlock? Will that time depend on how much load is on the server, e.g. if there is not much load will it take longer to cause a deadlock?
Deadlocks aren't based on time - they're based on dependencies. Say we've got this situation:
Query A is holding a bunch of locks, and to finish his query, he needs stuff that's locked by Query B
Query B is also holding a bunch of locks, and to finish his query, he needs stuff that's locked by Query A
Neither query can move forward (think Mexican standoff) so SQL Server calls it a draw, shoots somebody's query in the back, releases his locks, and lets the other query keep going. SQL Server picks the victim based on which one will be less expensive to roll back. If you want to get fancy, you can use SET DEADLOCK_PRIORITY LOW on particular queries to paint targets on their back, and SQL Server will shoot them first.
Is there a way to monitor and see what is locked at any particular time?
Absolutely - there's Dynamic Management Views (DMVs) you can query like sys.dm_tran_locks, but the easiest way is to use Adam Machanic's free sp_WhoIsActive stored proc. It's a really slick replacement for sp_who that you can call like this:
sp_WhoIsActive #get_locks = 1
For each running query, you'll get a little XML that describes all of the locks it holds. There's also a Blocking column, so you can see who's blocking who. To interpret the locks being held, you'll want to check the Books Online descriptions of lock types.
If each thread is doing queries on single tables, is there then a case where blocking can occur? So isn't it the case that a deadlock can only occur if you have a query which has a join and is acting on multiple tables?
Believe it or not, a single query can actually deadlock itself, and yes, queries can deadlock on just one table. To learn even more about deadlocks, check out The Difficulty with Deadlocks by Jeremiah Peschka.
If you have direct control over the SQL, you can force row level locking using:
INSERT INTO WITH (ROWLOCK) MyTable(Id, BigColumn)
VALUES(...)
These two answers might be helpful:
Is it possible to force row level locking in SQL Server?
Locking a table with a select in Entity Framework
To view current held locks in Management Studio, look under the server, then under Management/Activity Monitor. It has a section for locks by object, so you should be able to see whether the inserts are really causing a problem.
Deadlock errors generally return quite quickly. Deadlock states do not occur as a result of a timeout error occurring while waiting for a lock. Deadlock is detected by SQL Server by looking for cycles in the lock requests.
The best answer I can come up with is: It depends.
The best way to check is to find your connection SPID and use sp_lock SPID to check if the lock mode is X on the TAB type. You can also verify the table name with SELECT OBJECT_NAME(objid). I also like to use the below query to check for locking.
SELECT RESOURCE_TYPE,RESOURCE_SUBTYPE,DB_NAME(RESOURCE_DATABASE_ID) AS 'DATABASE',resource_database_id DBID,
RESOURCE_DESCRIPTION,RESOURCE_ASSOCIATED_ENTITY_ID,REQUEST_MODE,REQUEST_SESSION_ID,
CASE WHEN RESOURCE_TYPE = 'OBJECT' THEN OBJECT_NAME(RESOURCE_ASSOCIATED_ENTITY_ID,RESOURCE_DATABASE_ID) ELSE '' END OBJETO
FROM SYS.DM_TRAN_LOCKS (NOLOCK)
WHERE REQUEST_SESSION_ID = --SPID here
In SQL Server 2008 (and later) you can disable the lock escalation on the table and enforce a WITH (ROWLOCK) in your insert clause effectively forcing a rowlock. This can't be done prior to SQL Server 2008 (you can write WITH ROWLOCK, but SQL Server can choose to ignore it).
I'm speaking generals here, and I don't have much experience with BLOBs as I usually advise developers to avoid them, especially if larger than 1 MB.

Are all deadlocks caused by a bad query

"Transaction (Process ID 63) was deadlocked on lock | communication buffer resources with another process and has been chosen as the deadlock victim. Rerun the transaction.". Possible failure reasons: Problems with the query, "ResultSet" property not set correctly, parameters not set correctly, or connection not established correctly."
Could this deadlock be caused by something that stored proc uses like SQL mail? Or is it always caused my something like two applications accessing the same table at the same time?
Two tables accessing the same table at the same time happens all the time in an application. Generally that won't cause a deadlock. A deadlock typically happens when you have say process 'A' attempting to update Table 1 and then Table 2 and then Table 3, and you have process 'B' attempting to update Table 3, then Table 2, and then Table 1. Process 'A' will have a resource locked that process 'B' needs and process 'B' has a resource process 'A' needs. SQL Server detects this as a deadlock and rolls one of the processes back, as a failed transaction.
The bottom line is that you have two processes attempting to update the same tables at the same time, but not in the same order. This will often lead to deadlocks.
One easy way to handle this in your application is to handle the failed transaction and simply re-execute the transaction. It will almost always execute successfully. A better solution is to make sure your processes are updating tables in the same order, as much as possible.
Missing Indexes is another common cause of deadlocks. If a select query can get the info it needs from an index instead of the base table, then it won't be blocked by any updates/inserts on the table itself.
To find out for sure, use the SQL profiler to trace for "Deadlock Graph" events, which will show you the detail of the deadlock itself.
Based on this, I don't think SQL Mail itself would directly be the culprit. I say "directly" because I don't know what you're doing with it. However, I assume SQL Mail is probably slow compared to the rest of your SQL ops, so if you're doing a lot with that, it could indirectly create a bottleneck that leads to a deadlock if you're holding onto tables while sending off the SQL Mail.
It's hard to recommend a specific strategy without having too many specifics about what you're doing. The short of it is that you should consider whether there's a way to break your dependence on holding onto the table while you're doing this such as using NOLOCK, using a temp table or non-temp "holding" table or just refactoring the SP that is doing the call.

Sql Server 2005 - manage concurrency on tables

I've got in an ASP.NET application this process :
Start a connection
Start a transaction
Insert into a table "LoadData" a lot of values with the SqlBulkCopy class with a column that contains a specific LoadId.
Call a stored procedure that :
read the table "LoadData" for the specific LoadId.
For each line does a lot of calculations which implies reading dozens of tables and write the results into a temporary (#temp) table (process that last several minutes).
Deletes the lines in "LoadDate" for the specific LoadId.
Once everything is done, write the result in the result table.
Commit transaction or rollback if something fails.
My problem is that if I have 2 users that start the process, the second one will have to wait that the previous has finished (because the insert seems to put an exclusive lock on the table) and my application sometimes falls in timeout (and the users are not happy to wait :) ).
I'm looking for a way to be able to have the users that does everything in parallel as there is no interaction, except the last one: writing the result. I think that what is blocking me is the inserts / deletes in the "LoadData" table.
I checked the other transaction isolation levels but it seems that nothing could help me.
What would be perfect would be to be able to remove the exclusive lock on the "LoadData" table (is it possible to force SqlServer to only lock rows and not table ?) when the Insert is finished, but without ending the transaction.
Any suggestion?
Look up SET TRANSACTION ISOLATION LEVEL READ COMMITTED SNAPSHOT in Books OnLine.
Transactions should cover small and fast-executing pieces of SQL / code. They have a tendancy to be implemented differently on different platforms. They will lock tables and then expand the lock as the modifications grow thus locking out the other users from querying or updating the same row / page / table.
Why not forget the transaction, and handle processing errors in another way? Is your data integrity truely being secured by the transaction, or can you do without it?
if you're sure that there is no issue with cioncurrent operations except the last part, why not start the transaction just before those last statements, Whichever they are that DO require isolation), and commit immediately after they succeed.. Then all the upfront read operations will not block each other...

Long query prevents inserts

I have a query that runs each night on a table with a bunch of records (200,000+). This application simply iterates over the results (using a DbDataReader in a C# app if that's relevant) and processes each one. The processing is done outside of the database altogether. During the time that the application is iterating over the results I am unable to insert any records into the table that I am querying for. The insert statements just hang and eventually timeout. The inserts are done in completely separate applications.
Does SQL Server lock the table down while a query is being done? This seems like an overly aggressive locking policy. I could understand how there could be a conflict between the query and newly inserted records, but I would be perfectly ok if records inserted after the query started were simply not included in the results.
Any ways to avoid this?
Update:
The WITH (NOLOCK) definitely did the trick. As some of you pointed out, this isn't the cleanest approach. I can't really query everything into memory given the amount of records and some of the columns in this table are binary (some records are actually about 1MB of total data).
The other suggestion, was to query for batches of records at a time. This isn't a bad idea either, but it does bring up a new issue: database independent queries. Right now the application can work with a variety of different databases (Oracle, MySQL, Access, etc). Each database has their own way of limiting the rows returned in a query. But maybe this is better saved for another question?
Back on topic, the "WITH (NOLOCK)" clause is certainly SQL Server specific, is there any way to keep this out of my query (and thus preventing it from working with other databases)? Maybe I could somehow specify a parameter on the DbCommand object? Or can I specify the locking policy at the database level? That is, change some properties in SQL Server itself that will prevent the table from locking like this by default?
If you're using SQL Server 2005+, then how about giving the new MVCC snapshot isolation a try. I've had good results with it:
ALTER DATABASE SET SINGLE_USER WITH ROLLBACK IMMEDIATE;
ALTER DATABASE SET READ_COMMITTED_SNAPSHOT ON;
ALTER DATABASE SET MULTI_USER;
It will stop readers blocking writers and vice-versa. It eliminates many deadlocks, at very little cost.
It depends what Isolation Level you are using. You might try doing your selects using the With (NoLock) hint, that will prevent the read locks, but will also mean the data being read might change before the selecting transaction completes.
The first thing you could do is try to add the "WITH (NOLOCK)" to any tables you have in your query. This will "Tame down" the locking that SQL Server does. An example of using "NOLOCK" on a join is as follows...
SELECT COUNT(Users.UserID)
FROM Users WITH (NOLOCK)
JOIN UsersInUserGroups WITH (NOLOCK) ON
Users.UserID = UsersInUserGroups.UserID
Another option is to use a dataset instead of a datareader. A datareader is a "fire hose" technique that stays connected to the tables while your program is processing and basically handling the table row by row through the hose. A dataset uses a "disconnected" methodology where all the data is loaded into memory and then the connection is closed. Your program can then loop the data in memory without having to worry about locking. However, if this is a really large amount of data, there maybe memory issues.
Hope this helps.
If you add the WITH (NOLOCK) hint after a table name in the FROM clause it should make sure it doesn't lock, and it doesn't care about reading data that is locked. You might get "out of date" results if you are writing at the same time, but if you don't care about that then you should be fine.
I reckon your best way of avoiding this is to do it in SQL rather than in the application.
You can add a
WAITFOR DELAY '000:00:01'
at the end of each loop iteration to provide time for other processes to run - just make sure that you haven't initiated a TRANSACTION such that all other processes are locked out anyway
The query is performing a table lock, thus the inserts are failing.
It sounds to me like you're keeping a lock on the table while processing the results.
You should instead load them into an array or collection of some sort, and close the database connection.
Then process the array.
In addition, while you're doing your select use either:
WITH(NOLOCK) or WITH(READPAST)
I'm not a big fan of using lock hints as you could end up with dirty reads or other weirdness. A couple of other ideas:
Can you break the number of rows down so you don't grab 200k at a time? Is there a way to tell whether you've processed a row - a flag, a timestamp - you could use to make the query? Your query could be 'SELECT TOP 5000 ...' getting a differnet 5k each time. Shorter queries mean shorter-lived locks.
If you can use smaller sets of rows I like the DataSet vs. IDataReader idea. You will be loading data into memory and not consuming any SQL locks, but the amount of memory can cause other problems.
-Brian
You should be able to set the isolation level at the .NET level so that you don't have to include the WITH (NOLOCK) hint.
If you want to go with the batching option, you should be able to specify the Rowcount setting from the .NET level which would tell the database to only return n number of records. By setting these settings at the .NET level they should become database independent and work across all the platforms.

Resources