SQL Server long running transaction - sql-server

I am wondering how resource expensive is to perform a begin transaction on a connection and imediatelly update/insert a row into a database and letting this transaction hanging for several hours. Basically I just want to perform a "series number" reservation for my document management system. My series are something that are very custom and I want that whenever a user press the "Add new document" button, the next value will be allocated into my series allocation table. To allocate it I would insert a row into the allocation table. Next time, when a new user asks for the next value, will read using NOLOCK hint so that he will see my pending inserted value so that he will know the next value also. If the user cancels the form that adds a new document, I would simply perform a rollback over my opened connection. If the connection is lost and I am in "add" mode, then I would check if current transaction id on wich I allocated my series matches the current one. If not, then I would allocate another one. There is not a problem that a user lose a series due to connection lost. What do you think? I feel like it's a very bad practice because it is in contradiction with the ideea that I learned in my several years of software development: Open connection as late as possible and close it as soon as possible.
Thank you in advance!

I would considering using sequences. If they do not fit, I would do something like the following:
Have separate transactions to manage your "series numbers".
These transactions are very short and only do e.g. "get next number".
Have a "state" column to know whether something is in progress.
Lock the whole table to manage its contents.
Avoid NOLOCK. Avoid long running transactions.

Try to keep your transactions small as possible, get sequence number in a different transaction and then you can start your actual process this way there would be less connections waiting for your transactions in process.
You can also consider using Read Uncommitted or other isolation level for certain cases, like last week last month or yearly sale where data needed is already present or minor mimor error is acceptable.
Consider having proper indexes and properly sequenced joins in order to lower execution time.

Related

Snowflake: Concurrent queries with CREATE OR REPLACE

When running a CREATE OR REPLACE TABLE AS statement in one session, are other sessions able to query the existing table, before the transaction opened by CORTAS is committed?
From reading the usage notes section of the documentation, it appears this is the case. Ideally I'm looking for someone who's validated this in practice and at scale, with a large number of read operations on the target table.
Using OR REPLACE is the equivalent of using DROP TABLE on the existing table and then creating a new table with the same name; however, the dropped table is not permanently removed from the system. Instead, it is retained in Time Travel. This is important to note because dropped tables in Time Travel can be recovered, but they also contribute to data storage for your account. For more information, see Storage Costs for Time Travel and Fail-safe.
In addition, note that the drop and create actions occur in a single atomic operation. This means that any queries concurrent with the CREATE OR REPLACE TABLE operation use either the old or new table version.
Recreating or swapping a table drops its change data. Any stream on the table becomes stale. In addition, any stream on a view that has this table as an underlying table, becomes stale. A stale stream is
I have not "proving it via performance tests to prove it happens" but we did run for 5 years, where we read from tables of on set of warehouses and rebuilts underlying tables of overs and never noticed "corruption of results".
I always thought of snowflake like double buffer in computer graphics, you have the active buffer, that the video signal is reading from (the existing tables state) and you are writing to the back buffer while a MERGE/INSERT/UPDATE/DELETE is running, and when that write transaction is complete the active "current page/files/buffer" is flipped, all going forward reads are now from the "new" state.
Given the files are immutable, the double buffer analogy holds really well (aka this is how time travel works also). Thus there is just a "global state of what is current" maintained in the meta data.
To the CORTAS / Transaction, I would assume as that is an DDL operation, if you had any transactions that it completes them, like all DDL operations do. So perhaps prior in my double buffer story, that is a hiccup to understand.

Are there best practices for deduplicating records when using auto-ingest Snowpipes?

Currently in Snowflake we have configured an auto-ingest Snowpipe connected to an external S3 stage as documented here. This works well and we're copying records from the pipe into a "landing" table. The end goal is to MERGE these records into a final table to deal with any duplicates, which also works well. My question is around how best to safely perform this MERGE without missing any records? At the moment, we are performing a single data extraction job per-day so there is normally a point where the Snowpipe queue is empty which we use as an indicator that it is safe to proceed, however we are looking to move to more frequent extractions where it will become harder and harder to guarantee there will be no new records ingested at any given point.
Things we've considered:
Temporarily pause the pipe, MERGE the records, TRUNCATE the landing table, then unpause the pipe. I believe this should technically work but it is not clear to me that this is an advised way to work with Snowpipes. I'm not sure how resilient they are to being paused/unpaused, how long it tends to take to pause/unpause, etc. I am aware that paused pipes can become "stale" after 14 days (link) however we're talking about pausing it for a few minutes, not multiple days.
Utilize transactions in some way. I have a general understanding of SQL transactions, but I'm having a hard time determining exactly if/how they could be used in this situation to guarantee no data loss. The general thought is if the MERGE and DELETE could be contained in a transaction it may provide a safe way to process the incoming data throughout the day but I'm not sure if that's true.
Add in a third "processing" table and a task to swap the landing table with the processing table. The task to swap the tables could run on a schedule (e.g. every hour), and I believe the key is to have the conditional statement check both that there are records in the landing table AND that the processing table is empty. As this point the MERGE and TRUNCATE would work off the processing table and the landing table would continue to receive the incoming records.
Any additional insights into these options or completely different suggestions are very welcome.
Look into table streams which record insertions/updates/deletions to your snowpipe table. You can then merge off the stream to your target table which then resets the offset. Use a task to run your merge statement. Also, given it is snowpipe, when creating your stream it is probably best to use an append only stream
However, I had a question here where in some circumstances, we were missing some rows. Our task was set to 1min intervals, which may be partly the reason. However I never did get to the end of it, even with Snowflake support.
What we did notice though was that using a stored procedure, with a transaction and also running a select on the stream before the merge, seems to have solved the issue i.e. no more missing rows

Updating database keys where one table's keys refer to another's

I have two tables in DynamoDB. One has data about homes, one has data about businesses. The homes table has a list of the closest businesses to it, with walking times to each of them. That is, the homes table has a list of IDs which refer to items in the businesses table. Since businesses are constantly opening and closing, both these tables need to be updated frequently.
The problem I'm facing is that, when either one of the tables is updated, the other table will have incorrect data until it is updated itself. To make this clearer: let's say one business closes and another one opens. I could update the businesses table first to remove the old business and add the new one, but the homes table would then still refer to the now-removed business. Similarly, if I updated the homes table first to refer to the new business, the businesses table would not yet have this new business' data yet. Whichever table I update first, there will always be a period of time where the two tables are not in synch.
What's the best way to deal with this problem? One way I've considered is to do all the updates to a secondary database and then swap it with my primary database, but I'm wondering if there's a better way.
Thanks!
Dynamo only offers atomic operations on the item level, not transaction level, but you can have something similar to an atomic transaction by enforcing some rules in your application.
Let's say you need to run a transaction with two operations:
Delete Business(id=123) from the table.
Update Home(id=456) to remove association with Business(id=123) from the home.businesses array.
Here's what you can do to mimic a transaction:
Generate a timestamp for locking the items
Let's say our current timestamp is 1234567890. Using a timestamp will allow you to clean up failed transactions (I'll explain later).
Lock the two items
Update both Business-123 and Home-456 and set an attribute lock=1234567890.
Do not change any other attributes yet on this update operation!
Use a ConditionalExpression (check the Developer Guide and API) to verify that attribute_not_exists(lock) before updating. This way you're sure there's no other process using the same items.
Handle update lock responses
Check if both updates succeeded to Home and Business. If yes to both, it means you can proceed with the actual changes you need to make: delete the Business-123 and update the Home-456 removing the Business association.
For extra care, also use a ConditionExpression in both updates again, but now ensuring that lock == 1234567890. This way you're extra sure no other process overwrote your lock.
If both updates succeed again, you can consider the two items updated and consistent to be read by other processes. To do this, run a third update removing the lock attribute from both items.
When one of the operations fail, you may try again X times for example. If it fails all X times, make sure the process cleans up the other lock that succeeded previously.
Enforce the transaction lock throught your code
Always use a ConditionExpression in any part of your code that may update/delete Home and Business items. This is crucial for the solution to work.
When reading Home and Business items, you'll need to do this (this may not be necessary in all reads, you'll decide if you need to ensure consistency from start to finish while working with an item read from DB):
Retrieve the item you want to read
Generate a lock timestamp
Update the item with lock=timestamp using a ConditionExpression
If the update succeeds, continue using the item normally; if not, wait one or two seconds and try again;
When you're done, update the item removing the lock
Regularly clean up failed transactions
Every minute or so, run a background process to look for potentially failed transactions. If your processes take at max 60 seconds to finish and there's an item with lock value older than, say 5 minutes (remember lock value is the time the transaction started), it's safe to say that this transaction failed at some point and whatever process running it didn't properly clean up the locks.
This background job would ensure that no items keep locked for eternity.
Beware this implementation do not assure a real atomic and consistent transaction in the sense traditional ACID DBs do. If this is mission critical for you (e.g. you're dealing with financial transactions), do not attempt to implement this. Since you said you're ok if atomicity is broken on rare failure occasions, you may live with it happily. ;)
Hope this helps!

Dealing with race condition in transactional database table

Let me lay the scenario out first. Say you have a database for a business app and one of the things it tracks is inventory. The system says you have 5 screws in stock. Say you needed all 5. The system creates an inventory transaction record for -5. After you commit that transaction, since you know you had 5 before and you pulled out 5, if you sum up all the inventory transaction records for that screw the total should be 0. The problem occurs when two people are trying to do this at the same time. Say one person wants 4 and the other wants 2. Both client apps check the quantity beforehand and they are both told 5. At the exact same time one creates a transaction for -4 and the other for -2. The results in the total inventory quantity to be -1 which should never be possible because the system should not allow negative inventory.
How would you solve this if you didn't have a server application to help you? I mention that because a server coordinating the inventory transactions is how I would solve it but right now our product has no server application. We just have client apps which talk to a Firebird database directly. I'm trying to figure out how to do this with just the client apps and database. One thing that might help is that Firebird has something called a Generator which is basically a unique number generator that is atomic so you are guaranteed that if you asked Firebird to increment the generator and give you the next number that it will not give anyone else that same number.
My mind was going down the route of trying to create a makeshift record lock using a generator. I thought I could have them both check a "lock" field on the Item table. If it is null, then noone has a lock. If it is non-null it is locked so you need to keep checking back until it is not locked. If there is no lock you ask the generator for a uniq number and store that in the locking field for the Item you want to lock. You commit that transaction then go back and check to see if it is indeed the case that the Item table's lock field contains the number you put there. If it does then you have successfully locked and if it doesn't then that means someone was locking it at the same time and you lost the race. Once you are done you null out the lock and the client that is waiting will then see the null, lock it themselves and repeat.
This itself has a race condition I believe though. Trxn1 (transaction 1) checks lock and finds null. Trxn2 checks lock and finds null. Trxn1 gets new lock number from generator. Trxn2 gets new lock from generator. Trxn1 says update Item record with my lock if lock is still null which it is. Trxn1 commits trxn then starts a new Trxn1 and proves the lock contains his lock id and it does so it knows it has permission to make inventory transactions and it starts doing so. Right after Trxn1 checks to see if it got the lock Trxn2 commits its update statement that stored its lock if the lock was null. If Trxn2 executed his update statement before Trxn1 committed the lock then Trxn2 would still see the value as null and the update would occur. If Trxn2's lock commit happens after Trxn1 committed lock and already verified it we have a problem. Trxn1 is making changes to Item transaction table. Trxn2 got his lock committed because the lock was null in its transaction world when it did it and when it commits Trxn2's update statement will overwrite Trxn1's lock because the null check in the update statement happened before both committed, not at the time of commit. So now both think they have a lock and we will end up with negative inventory.
Can anyone think of a way to solve this short of having a server application with some kind of queueing system (FIFO)? I would prefer if it could all be done via clients "talking to the database" to coordinate this but that may not be possible technically speaking. Sorry If this got a bit wordy :D
Solution Edit:
jtahlborn seems to have the right idea. I somehow didn't realize that Firebird does in fact have row level locking. Simple select statements (no joins, group by, etc) can have "with lock" appended to the end of the statement and any row returned by the statement will be locked until the transaction is committed or rolled back. Noone else can obtain a lock on that row nor make changes to it. Because I don't want to lock the entire ITEM table while I'm inserting rows in to the Item transaction table, I am going to create a table just for locking that has one column (the ItemID field). Because the second transaction will get an error when it tries to do it's own lock, it doesn't matter that I am never actually modifying anything on the locking table itself. Failing to get a lock gives me all the information I need. I will put triggers on the insert / delete of the ITEM table so that for every Item record this is also a record in the ITEMLOCK table. Here is the process I'm going to use.
Start database transaction
Attempted to obtain lock on ITEMLOCK row with the ItemID of the Item you want to change
If you can't get a lock keep trying until the record is unlocked
Once locked go prove that the quantity on hand of that Item is enough to cover what you
want to take out, because they could have old data this might not be
the case and it will drop out here and message the user
If sufficient quantities exist insert your inventory transaction record in the inventory transaction table
Commit transaction which in turn releases the lock
Note: Matthieu M mentioned the FOR UPDATE clause. It is mentioned in the documentation along with the WITH LOCK clause. As I understand it you can use that when you are locking multiple rows with one statement. I am not one hundred percent sure, but it seems like doing this with WITH LOCK will trying an all or nothing approach and FOR UPDATE will lock each one separately one at a time. I am not sure what happens if it locked the first 100 records you asked for but on the 101th record it couldn't get a lock. Does it then release the 100 locks you did get? I will need to lock more than one Item at a time, but I do not feel comfortable with FOR UPDATE since I feel like I don't truly understand the difference. I also probably want to know which Item was already locked for user messaging purposes (going to put a timeout so trxns wont wait forever for a lock) so I will be locking one at at time using WITH LOCK.
Note 2: I want to point out to anyone using this in their own code to be careful. I am going to have a very simple loop when waiting for a lock to be released (is it released yet? how about now? now?). If I had a ton of users possibly trying to lock the same row at the same time there may be a deadlock scenario. Say you have a slow client. That client may always end up with the short end of the stick because every time the lock was release some other client then grabbed it faster than the slow client could. If this happened over and over this would be essentially a deadlock scenario. If I was worried about that I would need a way to figure out who is first in line. In my case, database transactions should be short lived, we never have more than 50 users (not a cloud system), and it is highly unlikely that they all are using this part of the system at the same time trying to modify the exact same Item's inventory quantity.
The simplest solution is to lock some primary row (like the main "item") and use this as your distributed locking mechanism. (assuming your database supports row-level locks, as most modern dbs do).
I recommend reading up about the CAP theorem and how it may be an explanation for the scenario you are describing. EDIT: Having read in more detail, my comment may be of limited use because it seems you already know this and are trying to solve the problem within Firebird.

Predict next auto-inserted row id (SQLite)

I'm trying to find if there is a reliable way (using SQLite) to find the ID of the next row to be inserted, before it gets inserted. I need to use the id for another insert statement, but don't have the option of instantly inserting and getting the next row.
Is predicting the next id as simple as getting the last id and adding one? Is that a guarantee?
Edit: A little more reasoning...
I can't insert immediately because the insert may end up being canceled by the user. User will make some changes, SQL statements will be stored, and from there the user can either save (inserting all the rows at once), or cancel (not changing anything). In the case of a program crash, the desired functionality is that nothing gets changed.
Try SELECT * FROM SQLITE_SEQUENCE WHERE name='TABLE';. This will contain a field called seq which is the largest number for the selected table. Add 1 to this value to get the next ID.
Also see the SQLite Autoincrement article, which is where the above info came from.
Cheers!
Either scrapping or committing a series of database operations all at once is exactly what transactions are for. Query BEGIN; before the user starts fiddling and COMMIT; once he/she's done. You're guaranteed that either all the changes are applied (if you commit) or everything is scrapped (if you query ROLLBACK;, if the program crashes, power goes out, etc). Once you read from the db, you're also guaranteed that the data is good until the end of the transaction, so you can grab MAX(id) or whatever you want without worrying about race conditions.
http://www.sqlite.org/lang_transaction.html
You can probably get away with adding 1 to the value returned by sqlite3_last_insert_rowid under certain conditions, for example, using the same database connection and there are no other concurrent writers. Of course, you may refer to the sqlite source code to back up these assumptions.
However, you might also seriously consider using a different approach that doesn't require predicting the next ID. Even if you get it right for the version of sqlite you're using, things could change in the future and it will certainly make moving to a different database more difficult.
Insert the row with an INVALID flag of some kind, Get the ID, edit it, as needed, delete if necessary or mark as valid. That and don't worry about gaps in the sequence
BTW, you will need to figure out how to do the invalid part yourself. Marking something as NULL might work depending on the specifics.
Edit: If you can, use Eevee's suggestion of using proper transactions. It's a lot less work.
I realize your application using SQLite is small and SQLite has its own semantics. Other solutions posted here may well have the effect that you want in this specific setting, but in my view every single one of them I have read so far is fundamentally incorrect and should be avoided.
In a normal environment holding a transaction for user input should be avoided at all costs. The way to handle this, if you need to store intermediate data, is to write the information to a scratch table for this purpose and then attempt to write all of the information in an atomic transaction. Holding transactions invites deadlocks and concurrency nightmares in a multi-user environment.
In most environments you cannot assume data retrieved via SELECT within a transaction is repeatable. For example
SELECT Balance FROM Bank ...
UPDATE Bank SET Balance = valuefromselect + 1.00 WHERE ...
Subsequent to UPDATE the value of balance may well be changed. Sometimes you can get around this by updating the row(s) your interested in Bank first within a transaction as this is guaranteed to lock the row preventing further updates from changing its value until your transaction has completed.
However, sometimes a better way to ensure consistency in this case is to check your assumptions about the contents of the data in the WHERE clause of the update and check row count in the application. In the example above when you "UPDATE Bank" the WHERE clause should provide the expected current value of balance:
WHERE Balance = valuefromselect
If the expected balance no longer matches neither does the WHERE condition -- UPDATE does nothing and rowcount returns 0. This tells you there was a concurrency issue and you need to rerun the operation again when something else isn't trying to change your data at the same time.
select max(id) from particular_table is unreliable for the reason below..
http://www.sqlite.org/autoinc.html
"The normal ROWID selection algorithm described above will generate monotonically increasing unique ROWIDs as long as you never use the maximum ROWID value and you never delete the entry in the table with the largest ROWID. If you ever delete rows or if you ever create a row with the maximum possible ROWID, then ROWIDs from previously deleted rows might be reused when creating new rows and newly created ROWIDs might not be in strictly ascending order."
I think this can't be done because there is no way to be sure that nothing will get inserted between you asking and you inserting. (you might be able to lock the table to inserts but Yuck)
BTW I've only used MySQL but I don't think that will make any difference)
Most likely you should be able to +1 the most recent id. I would look at all (going back a while) of the existing id's in the ordered table .. Are they consistent and is each row's ID is one more than the last? If so, you'll probably be fine. I'd leave a comments in the code explaining the assumption however. Doing a Lock will help guarantee that you're not getting additional rows while you do this as well.
Select the last_insert_rowid() value.
Most of everything that needs to be said in this topic already has... However, be very careful of race conditions when doing this. If two people both open your application/webpage/whatever, and one of them adds a row, the other user will try to insert a row with the same ID and you will have lots of issues.
select max(id) from particular_table;
The next id will be +1 from the maximum id.

Resources