I have a database with READ_COMMITTED_SNAPSHOT_ISOLATION set ON (cannot change that).
I insert new rows into a table on many parallel sessions,
but only if they don't exist there yet (classic left join check).
The inserting code looks like this:
INSERT INTO dbo.Destination(OrderID)
SELECT DISTINCT s.OrderID
FROM dbo.Source s
LEFT JOIN dbo.Destination d ON d.OrderID = s.OrderID
WHERE d.OrderID IS NULL;
If I run this on many parallel sessions I get a lot of duplicate key errors,
since different sessions try to insert the same OrderIDs over and over again.
That is expected due to the lack of SHARED locks under RCSI.
The recommended solution here (as per my research) would be to use the READCOMMITTEDLOCK hint like this:
LEFT JOIN dbo.Destination d WITH (READCOMMITTEDLOCK) ON d.OrderID = s.OrderID
This somewhat works, as greatly reduces the duplicate key errors, but (to my surprise) doesn't completely eliminate them.
As an experiment I removed the unique constraint on the Destination table, and saw that many duplicates enters the table in the very same millisecond originated from different sessions.
It seems that despite the table hint, I still get false positive on the existence check, and the redundant insert fires.
I tried different hints (SERIALIZABLE) but it made it worse and swarmed me with deadlocks.
How could I make this insert work under RCSI?
The right lock hint for reading a table you are about to insert into is (UPDLOCK,HOLDLOCK), which will place U locks on the rows as you read them, and also place SERIALIZABLE-style range locks if the row doesn't exist.
The problem with your approach is that each client is attempting to insert a batch of rows, and each batch has to either succeed completely or fail. If you use row-level locking, you will always have scenarios where a session inserts one row succesfully, but then becomes blocked waiting to read or insert a subsequent row. This inevitably leads to either PK failures or deadlocks, depending on the type of row lock used.
The solution is to either:
1) Insert the rows one-by-one, and not hold the locks from one row while you check and insert the next row.
2) Simply escalate to a tablockx, or an Applciation Lock to force your concurrent sessions to serialize through this bit of code.
So you can have highly-concurrent loads, or batch loads, but you can't have both. Well mostly.
3) You could turn on IGNORE_DUP_KEY on the index, which instead of an error will just skip any duplicate when inserting.
Related
I want to place DB2 Triggers for Insert, Update and Delete on DB2 Tables heavily used in parallel online Transactions. The tables are shared by several members on a Sysplex, DB2 Version 10.
In each of the DB2 Triggers I want to insert a row into a central table and have one background process calling a Stored Procedure to read this table every second to process the newly inserted rows, ordered by sequence of the insert (sequence number or timestamp).
I'm very concerned about DB2 Index locking contention and want to make sure that I do not introduce Deadlocks/Timeouts to the applications with these Triggers.
Obviously I would take advantage of DB2 Features to reduce locking like rowlevel locking, but still see no real good approach how to avoid index contention.
I see three different options to select the newly inserted rows.
Put a sequence number in the table and the store the last processed sequence number in the background process. I would do the following select Statement:
SELECT COLUMN_1, .... Column_n
FROM CENTRAL_TABLE
WHERE SEQ_NO > 'last-seq-number'
ORDER BY SEQ_NO;
Locking Level must be CS to avoid selecting uncommited rows, which will be later rolled back.
I think I need one Index on the table with SEQ_NO ASC
Pro: Background process only reads rows and makes no updates/deletes (only shared locks)
Neg: Index contention because of ascending key used.
I can clean-up processed records later (e.g. by rolling partions).
Put a Status field in the table (processed and unprocessed) and change the Select as follows:
SELECT COLUMN_1, .... Column_n
FROM CENTRAL_TABLE
WHERE STATUS = 'unprocessed'
ORDER BY TIMESTAMP;
Later I would update the STATUS on the selected rows to "processed"
I think I need an Index on STATUS
Pro: No ascending sequence number in the index and no direct deletes
Cons: Concurrent updates by online transactions and the background process
Clean-up would happen in off-hours
DELETE the processed records instead of the status field update.
SELECT COLUMN_1, .... Column_n
FROM CENTRAL_TABLE
ORDER BY TIMESTAMP;
Since the table contains very few records, no index is required which could create a hot spot.
Also I think I could SELECT with Isolation Level UR, because I would detect potential uncommitted data on the later delete of this row.
For a Primary Key index I could use GENERATE_UNIQUE,which is random an not ascending.
Pro: No Index hot spot and the Inserts can be spread across the tablespace by random UNIQUE_ID
Con: Tablespace scan and sort on every call of the Stored Procedure and deleting records in parallel to the online inserts.
Looking forward what the community thinks about this problem. This must be a pretty common problem e.g. SAP should have a similar issue on their Batch Input tables.
I tend to favour Option 3, because it avoids index contention.
May be there is still another solution in your minds out there.
I think you are going to have numerous performance problems with your various solutions.
(I know premature optimazation is a sin, but experience tells us that some things are just not going to work in a busy system).
You should be able to use DB2s autoincrement feature to get your sequence number, with little or know performance implications.
For the rest perhaps you should look at a Queue based solution.
Have your trigger drop the operation (INSERT/UPDATE/DELETE) and the keys of the row into a MQ queue,
Then have a long running backgound task (in CICS?) do your post processing as its processing one update at a time you should not trip over yourself. Having a single loaded and active task with the ability to batch up units of work should give you a throughput in the order of 3 to 5 hundred updates a second.
I need to update an identity column in a very specific scenario (most of the time the identity will be left alone). When I do need to update it, I simply need to give it a new value and so I'm trying to use a DELETE + INSERT combo.
At present I have a working query that looks something like this:
DELETE Test_Id
OUTPUT DELETED.Data,
DELETED.Moredata
INTO Test_id
WHERE Id = 13
(This is only an example, the real query is slightly more complex.)
A colleague brought up an important point. She asked if this wont cause a deadlock since we are writing and reading from the same table. Although in the example it works fine (half a dozen rows), in a real world scenario with tens of thousands of rows this might not work.
Is this a real issue? If so, is there a way to prevent it?
I set up an SQL Fiddle example.
Thanks!
My first thought was, yes it can. And maybe it is still possible, however in this simplified version of the statement it would be very hard to hit an deadlock. You're selecting a single row for which probably row level locks are acquired plus the fact that the locks required for the delete and the insert are acquired very fast after each other.
I've did some testing against a table holding a million rows execution the statement 5 million times on 6 different connections in parallel. Did not hit a single deadlock.
But add the reallive query, an table with indexes and foreign keys and you just might have a winner. I've had a similar statement which did cause deadlocks.
I have encountered deadlock errors with a similar statement.
UPDATE A
SET x=0
OUTPUT INSERTED.ID, 'a' INTO B
So for this statement to complete mssql needs to take locks for the updates on table A, locks for the inserts on table B and shared (read) locks on table A to validate the foreign key table B has to table A.
And last but not least, mssql decided it would be wise to use parallelism on this particular query causing the statement to deadlock on itself. To resolve this I've simply set "MAXDOP 1" query hint on the statement to prevent parallelism.
There is however no definite answer to prevent deadlocks. As they say with mssql ever so ofter, it depends. You could take an exclusive using the TABLOCKX table hint. This will prevent a deadlock, however it's probably not desirable for other reasons.
I have a SQL Server 2012 table that will contain 2.5 million rows at any one time. Items are always being written into the table, but the oldest rows in the table get truncated at the end of each day during a maintenance window.
I have .NET-based reporting dashboards that usually report against summary tables though on the odd occasion it does need to fetch a few rows from this table - making use of the indexes set.
When it does report against this table, it can prevent new rows being written to this table for up to 1 minute, which is very bad for the product.
As it is a reporting platform and the rows in this table never get updated (only inserted - think Twitter streaming but for a different kind of data) it isn't always necessary to wait for a gap in the transactions that cause rows to get inserted into this table.
When it comes to selecting data for reporting, would it be wise to use a SNAPSHOT isolation level within a transaction to select the data, or NOLOCK/READ UNCOMITTED? Would creating a SQLTransaction around the select statement cause the insert to block still? At the moment I am not wrapping my SQLCommand instance in a transaction, though I realise this will still cause locking regardless.
Ideally I'd like an outcome where the writes are never blocked, and the dashboards are as responsive as possible. What is my best play?
Post your query
In theory a select should not be blocking inserts.
By default a select only takes a shared lock.
Shared locks are acquired during read operations automatically and prevent the user from modifying data.
This should not block inserts to otherTable or joinTable
select otherTable.*, joinTable.*
from otherTable
join joinTable
on otherTable.jionID = joinTable.ID
But it does have the overhead of acquiring a read lock (it does not know you don't update).
But if it is only fetching a few rows from joinTable then it should only be taking a few shared locks.
Post your query, query plan, and table definitions.
I suspect you have some weird stuff going on where it is taking a lot more locks than it needs.
It may be taking lock on each row or it may be escalating to page lock or table lock.
And look at the inserts. Is it taking some crazy locks it does not need to.
My boss keeps on forcing me to write SELECT queries with with (nolock) to prevent deadlocks. But AFAIK, Select statements by default does not have locks, so selecting with with (nolock) and selecting without doesn't make any difference. Please correct me if I am wrong.
The two queries:
SELECT * from EMP with (nolock)
SELECT * from EMP
Isn't both are same. If I don't put nolock will it be prone to deadlocks? Please tell me what should I use.
Nolocks should be used with extreme caution. The most common understanding of nolock (read uncommitted) hint is that it reads data that has not been committed yet. However, there are other side effects that can be very dangerous. (search for "nolock" and "page splits")
There's a really good write up here... https://www.itprotoday.com/sql-server/beware-nolock-hint
In short, "nolocking"ing everything is not always a good idea... if ever.
Assuming we have default Transaction Isolation Level READ COMMITTED ,there is a chance for a dead lock even in a very simple SELECT statement , Imagine a scenario where User1 is only reading data and User2 trys to Update some data and there a non-clustered index on that table, it is possible.
User1 is reading Some Data and obtains a shared lock on the non-clustered index in order to perform a lookup, and then tries to obtain a shared lock on the page contianing the data in order to return the data itself.
User2 who is writing/Updating first obtains an exlusive lock on the database page containing the data, and then attempts to obtain an exclusive lock on the index in order to update the index.
SELECT statements do indeed apply locks unless there is a statement at the top of the query SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED.
By all means use WITH (NOLOCK) in SELECT statement on tables that have a clustered index, but it would be wiser to only do so if there's a need to.
Hint: The easiest way to add a clustered index to a table is to add an Id Primary Key column.
The result set can contain rows that have not yet been committed, that are often later rolled back.
If WITH(NOLOCK) is applied to a table that has a non-clustered index then row-indexes can be changed by other transactions as the row data is being streamed into the result-table. This means that the result-set can be missing rows or display the same row multiple times.
READ COMMITTED adds an additional issue where data is corrupted within a single column where multiple users change the same cell simultaneously.
Bearing in mind the issues WITH(NOLOCK) causes will help you tune your database.
As for your boss, just think of them as a challenge.
Consider this statement:
update TABLE1
set FormatCode = case T.FormatCode when null then TABLE1.FormatCode else T.FormatCode end,
CountryCode = case T.CountryCode when null then TABLE1.CountryCode else T.CountryCode end
<SNIP ... LOTS of similar fields being updated>
FROM TABLE2 AS T
WHERE TABLE1.KEYFIELD = T.KEYFIELD
TABLE1 is used by other applications and so locking on it should be minimal
TABLE2 is not used by anybody else so I do not care about it.
TABLE1 and TABLE2 contain 600K rows each.
Would the above statement cause a table lock on TABLE1?
How can I modify it to cause the minimal lock on it ?
Maybe use a cursor to read the rows of TABLE2 one by one and then for each row update the respective row of TABLE1?
Sql will use row locks first. If enough rows in a index page is locked SQL will issue a page lock. If enough pages are locked SQL will issue a table lock.
So it really depends on how many locks is issued. You could user the locking hint ROWLOCK in your update statement. The down side is that you will probably have thousand of row lock instead of hundreds of page locks or one table lock. Locks use resources so while ROWLOCK hints will probably not issue a table lock it might even be worse as it could starve your server of resources and slowing it down in anycase.
You could batch the update say 1000 at a time. Cursors is really going to news things up even more. Experiment monitor analyse the results and make a choice based on the data you have gathered.
As marc_s has suggested introducing a more restrictive WHERE clause to reduce the number of rows should help here.
Since your update occurs nightly it seems you'll be looking to only update the records that have updated since the previous update occurred (ie a days worth of updates). But this will only benefit you if a subset of the records have changed rather than all of then.
I'd probably try to SELECT out the Id's for the rows that have changed into a temp table and then joining the temp table as part of the update. To determine the list of Id's a couple of options come to mind on how you can do this such as making use of a last changed column on TABLE2 (if TABLE2 has one); alternatively you could compare each field between TABLE1 and TABLE2 to see if they differ (watch for nulls), although this would be a lot of extra sql to include and probably a maintenance pain. 3rd option that I can think of would be to have an UPDATE trigger against TABLE2 to insert the KEYFIELD of the rows as they are updated during the day into our temp table, the temp table could be cleared down following your nightly update.