Dealing with race condition in transactional database table - database

Let me lay the scenario out first. Say you have a database for a business app and one of the things it tracks is inventory. The system says you have 5 screws in stock. Say you needed all 5. The system creates an inventory transaction record for -5. After you commit that transaction, since you know you had 5 before and you pulled out 5, if you sum up all the inventory transaction records for that screw the total should be 0. The problem occurs when two people are trying to do this at the same time. Say one person wants 4 and the other wants 2. Both client apps check the quantity beforehand and they are both told 5. At the exact same time one creates a transaction for -4 and the other for -2. The results in the total inventory quantity to be -1 which should never be possible because the system should not allow negative inventory.
How would you solve this if you didn't have a server application to help you? I mention that because a server coordinating the inventory transactions is how I would solve it but right now our product has no server application. We just have client apps which talk to a Firebird database directly. I'm trying to figure out how to do this with just the client apps and database. One thing that might help is that Firebird has something called a Generator which is basically a unique number generator that is atomic so you are guaranteed that if you asked Firebird to increment the generator and give you the next number that it will not give anyone else that same number.
My mind was going down the route of trying to create a makeshift record lock using a generator. I thought I could have them both check a "lock" field on the Item table. If it is null, then noone has a lock. If it is non-null it is locked so you need to keep checking back until it is not locked. If there is no lock you ask the generator for a uniq number and store that in the locking field for the Item you want to lock. You commit that transaction then go back and check to see if it is indeed the case that the Item table's lock field contains the number you put there. If it does then you have successfully locked and if it doesn't then that means someone was locking it at the same time and you lost the race. Once you are done you null out the lock and the client that is waiting will then see the null, lock it themselves and repeat.
This itself has a race condition I believe though. Trxn1 (transaction 1) checks lock and finds null. Trxn2 checks lock and finds null. Trxn1 gets new lock number from generator. Trxn2 gets new lock from generator. Trxn1 says update Item record with my lock if lock is still null which it is. Trxn1 commits trxn then starts a new Trxn1 and proves the lock contains his lock id and it does so it knows it has permission to make inventory transactions and it starts doing so. Right after Trxn1 checks to see if it got the lock Trxn2 commits its update statement that stored its lock if the lock was null. If Trxn2 executed his update statement before Trxn1 committed the lock then Trxn2 would still see the value as null and the update would occur. If Trxn2's lock commit happens after Trxn1 committed lock and already verified it we have a problem. Trxn1 is making changes to Item transaction table. Trxn2 got his lock committed because the lock was null in its transaction world when it did it and when it commits Trxn2's update statement will overwrite Trxn1's lock because the null check in the update statement happened before both committed, not at the time of commit. So now both think they have a lock and we will end up with negative inventory.
Can anyone think of a way to solve this short of having a server application with some kind of queueing system (FIFO)? I would prefer if it could all be done via clients "talking to the database" to coordinate this but that may not be possible technically speaking. Sorry If this got a bit wordy :D
Solution Edit:
jtahlborn seems to have the right idea. I somehow didn't realize that Firebird does in fact have row level locking. Simple select statements (no joins, group by, etc) can have "with lock" appended to the end of the statement and any row returned by the statement will be locked until the transaction is committed or rolled back. Noone else can obtain a lock on that row nor make changes to it. Because I don't want to lock the entire ITEM table while I'm inserting rows in to the Item transaction table, I am going to create a table just for locking that has one column (the ItemID field). Because the second transaction will get an error when it tries to do it's own lock, it doesn't matter that I am never actually modifying anything on the locking table itself. Failing to get a lock gives me all the information I need. I will put triggers on the insert / delete of the ITEM table so that for every Item record this is also a record in the ITEMLOCK table. Here is the process I'm going to use.
Start database transaction
Attempted to obtain lock on ITEMLOCK row with the ItemID of the Item you want to change
If you can't get a lock keep trying until the record is unlocked
Once locked go prove that the quantity on hand of that Item is enough to cover what you
want to take out, because they could have old data this might not be
the case and it will drop out here and message the user
If sufficient quantities exist insert your inventory transaction record in the inventory transaction table
Commit transaction which in turn releases the lock
Note: Matthieu M mentioned the FOR UPDATE clause. It is mentioned in the documentation along with the WITH LOCK clause. As I understand it you can use that when you are locking multiple rows with one statement. I am not one hundred percent sure, but it seems like doing this with WITH LOCK will trying an all or nothing approach and FOR UPDATE will lock each one separately one at a time. I am not sure what happens if it locked the first 100 records you asked for but on the 101th record it couldn't get a lock. Does it then release the 100 locks you did get? I will need to lock more than one Item at a time, but I do not feel comfortable with FOR UPDATE since I feel like I don't truly understand the difference. I also probably want to know which Item was already locked for user messaging purposes (going to put a timeout so trxns wont wait forever for a lock) so I will be locking one at at time using WITH LOCK.
Note 2: I want to point out to anyone using this in their own code to be careful. I am going to have a very simple loop when waiting for a lock to be released (is it released yet? how about now? now?). If I had a ton of users possibly trying to lock the same row at the same time there may be a deadlock scenario. Say you have a slow client. That client may always end up with the short end of the stick because every time the lock was release some other client then grabbed it faster than the slow client could. If this happened over and over this would be essentially a deadlock scenario. If I was worried about that I would need a way to figure out who is first in line. In my case, database transactions should be short lived, we never have more than 50 users (not a cloud system), and it is highly unlikely that they all are using this part of the system at the same time trying to modify the exact same Item's inventory quantity.

The simplest solution is to lock some primary row (like the main "item") and use this as your distributed locking mechanism. (assuming your database supports row-level locks, as most modern dbs do).

I recommend reading up about the CAP theorem and how it may be an explanation for the scenario you are describing. EDIT: Having read in more detail, my comment may be of limited use because it seems you already know this and are trying to solve the problem within Firebird.

Related

Updating database keys where one table's keys refer to another's

I have two tables in DynamoDB. One has data about homes, one has data about businesses. The homes table has a list of the closest businesses to it, with walking times to each of them. That is, the homes table has a list of IDs which refer to items in the businesses table. Since businesses are constantly opening and closing, both these tables need to be updated frequently.
The problem I'm facing is that, when either one of the tables is updated, the other table will have incorrect data until it is updated itself. To make this clearer: let's say one business closes and another one opens. I could update the businesses table first to remove the old business and add the new one, but the homes table would then still refer to the now-removed business. Similarly, if I updated the homes table first to refer to the new business, the businesses table would not yet have this new business' data yet. Whichever table I update first, there will always be a period of time where the two tables are not in synch.
What's the best way to deal with this problem? One way I've considered is to do all the updates to a secondary database and then swap it with my primary database, but I'm wondering if there's a better way.
Thanks!
Dynamo only offers atomic operations on the item level, not transaction level, but you can have something similar to an atomic transaction by enforcing some rules in your application.
Let's say you need to run a transaction with two operations:
Delete Business(id=123) from the table.
Update Home(id=456) to remove association with Business(id=123) from the home.businesses array.
Here's what you can do to mimic a transaction:
Generate a timestamp for locking the items
Let's say our current timestamp is 1234567890. Using a timestamp will allow you to clean up failed transactions (I'll explain later).
Lock the two items
Update both Business-123 and Home-456 and set an attribute lock=1234567890.
Do not change any other attributes yet on this update operation!
Use a ConditionalExpression (check the Developer Guide and API) to verify that attribute_not_exists(lock) before updating. This way you're sure there's no other process using the same items.
Handle update lock responses
Check if both updates succeeded to Home and Business. If yes to both, it means you can proceed with the actual changes you need to make: delete the Business-123 and update the Home-456 removing the Business association.
For extra care, also use a ConditionExpression in both updates again, but now ensuring that lock == 1234567890. This way you're extra sure no other process overwrote your lock.
If both updates succeed again, you can consider the two items updated and consistent to be read by other processes. To do this, run a third update removing the lock attribute from both items.
When one of the operations fail, you may try again X times for example. If it fails all X times, make sure the process cleans up the other lock that succeeded previously.
Enforce the transaction lock throught your code
Always use a ConditionExpression in any part of your code that may update/delete Home and Business items. This is crucial for the solution to work.
When reading Home and Business items, you'll need to do this (this may not be necessary in all reads, you'll decide if you need to ensure consistency from start to finish while working with an item read from DB):
Retrieve the item you want to read
Generate a lock timestamp
Update the item with lock=timestamp using a ConditionExpression
If the update succeeds, continue using the item normally; if not, wait one or two seconds and try again;
When you're done, update the item removing the lock
Regularly clean up failed transactions
Every minute or so, run a background process to look for potentially failed transactions. If your processes take at max 60 seconds to finish and there's an item with lock value older than, say 5 minutes (remember lock value is the time the transaction started), it's safe to say that this transaction failed at some point and whatever process running it didn't properly clean up the locks.
This background job would ensure that no items keep locked for eternity.
Beware this implementation do not assure a real atomic and consistent transaction in the sense traditional ACID DBs do. If this is mission critical for you (e.g. you're dealing with financial transactions), do not attempt to implement this. Since you said you're ok if atomicity is broken on rare failure occasions, you may live with it happily. ;)
Hope this helps!

SQL Server long running transaction

I am wondering how resource expensive is to perform a begin transaction on a connection and imediatelly update/insert a row into a database and letting this transaction hanging for several hours. Basically I just want to perform a "series number" reservation for my document management system. My series are something that are very custom and I want that whenever a user press the "Add new document" button, the next value will be allocated into my series allocation table. To allocate it I would insert a row into the allocation table. Next time, when a new user asks for the next value, will read using NOLOCK hint so that he will see my pending inserted value so that he will know the next value also. If the user cancels the form that adds a new document, I would simply perform a rollback over my opened connection. If the connection is lost and I am in "add" mode, then I would check if current transaction id on wich I allocated my series matches the current one. If not, then I would allocate another one. There is not a problem that a user lose a series due to connection lost. What do you think? I feel like it's a very bad practice because it is in contradiction with the ideea that I learned in my several years of software development: Open connection as late as possible and close it as soon as possible.
Thank you in advance!
I would considering using sequences. If they do not fit, I would do something like the following:
Have separate transactions to manage your "series numbers".
These transactions are very short and only do e.g. "get next number".
Have a "state" column to know whether something is in progress.
Lock the whole table to manage its contents.
Avoid NOLOCK. Avoid long running transactions.
Try to keep your transactions small as possible, get sequence number in a different transaction and then you can start your actual process this way there would be less connections waiting for your transactions in process.
You can also consider using Read Uncommitted or other isolation level for certain cases, like last week last month or yearly sale where data needed is already present or minor mimor error is acceptable.
Consider having proper indexes and properly sequenced joins in order to lower execution time.

When does "select for update" lock and unlock?

Here is my Pseudo-code:
re = [select **result** from table where **condition**=key for update]
if[re satisfies]
{
delete from table where **condition** = key;
}
commit
I want to ask if the row with condition equals to "key" has already been deleted, Can the lock blocked by the "select for update" be unlocked automatically, which means if another process enters at this point and select for the same "key" it can not be blocked by this one ?
Locks are taken during (usually at or near the beginning of) a command's execution. Locks (except advisory locks) are released only when a transaction commits or rolls back. There is no FOR UNLOCK, nor is there an UNLOCK command to reverse the effects of the table-level LOCK command. This is all explained in the concurrency control section of the PostgreSQL documentation.
You must commit or rollback your transaction to release locks.
Additionally, it doesn't really make sense to ask "has this row already been deleted by another concurrent transaction". It isn't really deleted until the transaction that deleted the row commits... and even then, it might've deleted and re-inserted the row or another concurrent transaction might've inserted the row again.
Are you building a task queue or message queue system by any chance, because if so, that problem is solved and you shouldn't be trying to reinvent that unusually complicated wheel. See PGQ, ActiveMQ, RabbitMQ, ZeroMQ, etc. (Future PostgreSQL versions may include FOR UPDATE SKIP LOCKED as this is being tested, but hasn't been released at time of writing).
I suggest that you post a new question with a more detailed description of the underlying problem you are trying to solve. You're assuming that the solution to your problem is "find out if the row has already been deleted" or "unlock the row". That probably isn't actually the solution. It's a bit like someone saying "where do I buy petrol" when their push-bike doesn't go so they assume it's out of fuel. Fuel isn't the problem, the problem is that push bikes don't take fuel and you have to pedal them.
Explain the background. Explain what you're trying to achieve. Above all else, don't post pseudocode, post the actual code you are having problems with, preferably in a self-contained and runnable form.

Can Lost Update happen in read committed isolation level in PostgreSQL?

I have a query like below in PostgreSQL:
UPDATE
queue
SET
queue.status = 'PROCESSING'
WHERE
queue.status = 'WAITING' AND
queue.id = (SELECT id FROM queue WHERE STATUS = 'WAITING' LIMIT 1 )
RETURNING
queue.id
and many workers try to process one work at a time (that's why I have sub-query with limit 1). After this update, each worker grabs information about the id and processes the work, but sometimes they grab the same work and process it twice or more. The isolation level is Read Committed.
My question is how can I guarantee one work is going to be processed once? I know there is so many post out there but I can say I have tried most of them and it didn't help () ;
I have tried SELECT FOR UPDATE, but it caused deadlocked situation.
I have tried pg_try_advisory_xact_lock, but it caused out of shared
memory
I tried adding AND pg_try_advisory_xact_lock(queue.id) to the outer query's WHERE clause, but ... [?]
Any help would be appreciated.
A lost update won't occur in the situation you describe, but it won't work properly either.
What will happen in the example you've given above is that given (say) 10 workers started simultaneously, all 10 of them will execute the subquery and get the same ID. They will all attempt to lock that ID. One of them will succeed; the others will block on the first one's lock. Once the first backend commits or rolls back, the 9 others will race for the lock. One will get it, re-check the WHERE clause and see that the queue.status test no longer matches, and return without modifying any rows. The same will happen with the other 8. So you used 10 queries to do the work of one query.
If you fail to explicitly check the UPDATE result and see that zero rows were updated you might think you were getting lost updates, but you aren't. You just have a concurrency bug in your application caused by a misunderstanding of the order-of-execution and isolation rules. All that's really happening is that you're effectively serializing your backends so that only one at a time actually makes forward progress.
The only way PostgreSQL could avoid having them all get the same queue item ID would be to serialize them, so it didn't start executing query #2 until query #1 finished. If you want to you can do this by LOCKing the queue table ... but again, you might as well just have one worker then.
You can't get around this with advisory locks, not easily anyway. Hacks where you iterated down the queue using non-blocking lock attempts until you got the first lockable item would work, but would be slow and clumsy.
You are attempting to implement a work queue using the RDBMS. This will not work well. It will be slow, it will be painful, and getting it both correct and fast will be very very hard. Don't roll your own. Instead, use a well established, well tested system for reliable task queueing. Look at RabbitMQ, ZeroMQ, Apache ActiveMQ, Celery, etc. There's also PGQ from Skytools, a PostgreSQL-based solution.
Related:
In PostgreSQL, do multiple UPDATES to different rows in the same table having a locking conflict?
Can multiple threads cause duplicate updates on constrained set?
Why do we need message brokers like rabbitmq over a database like postgres?
SKIP LOCKED can be used to implement queue in PostgreSql. see
In PostgreSQL, lost update happens in READ COMMITTED and READ UNCOMMITTED but if you use SELECT FOR UPDATE in READ COMMITTED and READ UNCOMMITTED, lost update doesn't happen.
In addition, lost update doesn't happen in REPEATABLE READ and SERIALIZABLE whether or not you use SELECT FOR UPDATE. *Error happens if there is a lost update condition.

When I update/insert a single row should it lock the entire table?

I have two long running queries that are both on transactions and access the same table but completely separate rows in those tables. These queries also perform some update and inserts based on those queries.
It appears that when these run concurrently that they encounter a lock of some kind and it’s preventing the task from finishing and locks up when it goes to update one of the rows. I’m using an exclusive row lock on the rows being read and the lock that shows up on the process is a lck_m_ix lock.
Two questions:
When I update/insert a single row does it lock the entire table?
What can be done to work around this sort of issue?
Typically no, but it depends (most often used answer for SQL Server!)
SQL Server will have to lock the data involved in a transaction in some way. It has to lock the data in the table itself, and the data any affected indexes, while you perform a modification. In order to improve concurrency, there are several "granularities" of locking that the server might decide to use, in order to allow multiple processes to run: row locks, page locks, and table locks are common (there are more). Which scale of locking is in play depends on how the server decides to execute a given update. Complicating things, there are also classifications of locks like shared, exclusive, and intent exclusive, that control whether the locked object can be read and/or modified.
It's been my experience that SQL Server mainly uses page locks for changes to small portions of tables, and past some threshold will automatically escalate to a table lock, if a larger portion of a table seems (from stats) to be affected by an update or delete. The idea is that it is faster to lock a table (one lock) than obtaining and managing thousands of individual row or page locks for a big update.
To see what is happening in your specific case, you'd need to look at the query logic and, while your stuff is running, examine the locking/blocking conditions in sys.dm_tran_locks, sys.dm_os_waiting_tasks or other DMV's. You would want to discover what exactly is getting locked by what step in each of your processes, to discover why one is blocking the other.
The short version:
No
Fix your code.
The long version:
LCK_M_IX is an intent lock, meaning the operation will place an X lock on a subordinate element. Eg. When updating a row in a table, the operation table takes an IX lock on the table before locking X the row being updated/inserted/deleted. Intent locks are common strategy to deal with hierarchies, like table/page/row, because the lock manager cannot understand the physical structure of resources requested to be locked (ie. it cannot know that an X-lock on page P1 is incompatible with an S-lock on row R1 because R1 is contained in P1). For more details, see Lock Modes.
The fact that you are seeing contention on intent locks means you are trying to obtain high level object locks, like table locks. You will need to analyze your source code for the request being blocked (the one requesting the lock incompatible with LCK_M_IX) and remove the cause of the object level lock request. What that means will depend on your source code, I cannot know what you're doing there. My guess is that you use an erroneous lock hint.
A more general approach is to rely on SNAPSHOT ISOLATION. But this, most likely, will not solve the problem you're seeing, since snapshot isolation can only benefit row level contention issues, not applications that request table locks.
A frequent aim of using transactions: keep them as short and sweet as possible. I get the sense from your wording in the question that you are opening a transaction, then doing all kinds of things, some of which take a long time. Then expecting multiple users to be able to run this same code concurrently. Unfortunately, if you perform an insert at the beginning of that set of code, then do 40 other things before committing or rolling back, it is possible that that insert will block everyone else from running the same type of insert, essentially turning your operation from free-for-all to serial.
Find out what each query is doing, and if you are getting lock escalations that you wouldn't expect. Just because you say WITH (ROWLOCK) on a query doesn't mean SQL Server will be able to comply... if you are touched multiple indexes, indexed views, persisted computed columns etc. then there are all kinds of reasons why your rowlock may not hold any water. You also might have things later in the transaction that are taking longer than you think, and maybe you don't realize that the locks on all of the objects involved in the transaction (not just the statement that is currently running) can be held for the duration of the transaction.
Different databases have different locking mechanisms, but ones like SQL Server and Oracle have different types of locking.
The default on SQL Server appears to be pessimistic Page locking - so if you have a small number of records then all of them may get locked.
Most databases should not lock when running a script, so I'm wondering whether you're potentially running multiple queries concurrently without transactions.

Resources