Why are rollbacks so important?
Is it to prevent data (like data in a SQL DB) from being in an inconsistent state?
If so, how comes the data "store" (the SQL DB or whatever) made it possible in the first place to become in a corrupt state?
Are there data storage mechanisms that don't have a need for "rollbacks"?
Rollbacks are important in case of any kind of errors appearing during database operational. They can really save the day in case of database server crashes or a critical exception is thrown in an application that modifies contents of DB. When a significant DB operation is performed (i.e. updates, inserts, etc.) and the process is broken in the middle, it would be very hard to trace which operations were successful and usage of DB afterward would be very complicated.
The "store" itself does not generally have a built-in mechanism for consistency control - this is exactly why we use rollbacks and transactions. This can be perceived as a sort of 'live backup' mechanism.
There are cases, when you need insert/update data in many related tables - if you didn't have transactional logic, then any errors somewhere in middle of process could make data inconsistent.
Simple example. Say you need to insert both order header data into orders table and order lines into lines table. You insert order header, read identity, start inserting order lines - but this second insert fails on whatever reason. Only reliable way to recover from this situation is to rollback first insert - either explicitly (when your connection to db is alive) or implicitly (when link is gone down).
Related
I am extracting data from a business system supplied by a third party to use in reporting. I am using a single SELECT statement issued from an SSIS data flow task source component that joins across multiple tables in the source system to create the dataset I want. We are using the default read-committed isolation level.
To my surprise I regularly find this extraction query is deadlocking and being selected as the victim. I didn't think a SELECT in a read-committed transaction could do this, but according to this SO answer it is possible: Can a readcommitted isolation level ever result in a deadlock (Sql Server)?
Through the use of the trace flags 1204 and 12222 I've identified the conflicting statement, and the object and index in question. Essentially, the contention is over a data page in the primary key of one of the tables. I need to extract from this table using a join on its key (so I'm taking out an S lock), the conflicting statement is performing an INSERT and is requesting an IX lock on the index data page.
(Side note: the above SO talks about this issue occurring with non-clustered indexes, but this appears to be occurring in the clustered PK. At least, that is what I believe based on my interpretation of the deadlock information in the event log and the "associatedObjectId" property.)
Here are my constraints:
The conflicting statement is in an encrypted stored procedure supplied by a third party as part of off-the-shelf software. There is no possibility of getting the plaintext code or having it changed.
I don't want to use dirty-reads as I need my extracted data to maintain its integrity.
It's not clear to me how or if restructuring my extract query could prevent this. The lock is on the PK of the table I'm most interested in, and I can't see any alternatives to using the PK.
I don't mind my extract query being the victim as I prefer this over interrupting the operational use of the source system. However, this does cause the SSIS execution to fail, so if it must be this way I'd like a cleaner, more graceful way to handle this situation.
Can anyone suggestion ways to, preferably, prevent the deadlock, or if not, then handle the error better?
My assumption here is that you are attempting to INSERT into the same table that you are SELECTing from. If no, then a screenshot of the data flow tab would be helpful in determining the problem. If yes, then you're in luck - I have had this problem before.
Add a sort to the data flow as this is a fully blocking transformation (see below regarding blocking transformations). What this means is that the SELECT will be required to complete loading all data into the pipeline buffer before any data is allowed to pass down to the destination. Otherwise, SSIS is attempting to INSERT data while there is a lock on the table/index. You might be able to get creative with your indexing strategies here (I have not tried this). But, a fully blocking transformation will do the trick and eliminates the need for any additional indexes to the table (and the overhead that entails).
Note: never use NOLOCK query hints when selecting data from a table as an attempt to get around this. I have never tried this nor do I intend to. You (the royal you) run the risk of ingesting uncommitted data into your ETL.
Reference:
https://jorgklein.com/2008/02/28/ssis-non-blocking-semi-blocking-and-fully-blocking-components/
I'm writing an application which must log information pretty frequently, say, twice in a second. I wish to save the information to an sqlite database, however I don't mind to commit changes to the disk once every ten minutes.
Executing my queries when using a file-database takes to long, and makes the computer lag.
An optional solution is to use an in-memory database (it will fit, no worries), and synchronize it to the disk from time to time,
Is it possible? Is there a better way to achieve that (can you tell sqlite to commit to disk only after X queries?).
Can I solve this with Qt's SQL wrapper?
Let's assume you have an on-disk database called 'disk_logs' with a table called 'events'. You could attach an in-memory database to your existing database:
ATTACH DATABASE ':memory:' AS mem_logs;
Create a table in that database (which would be entirely in-memory) to receive the incoming log events:
CREATE TABLE mem_logs.events(a, b, c);
Then transfer the data from the in-memory table to the on-disk table during application downtime:
INSERT INTO disk_logs.events SELECT * FROM mem_logs.events;
And then delete the contents of the existing in-memory table. Repeat.
This is pretty complicated though... If your records span multiple tables and are linked together with foreign keys, it might be a pain to keep these in sync as you copy from an in-memory tables to on-disk tables.
Before attempting something (uncomfortably over-engineered) like this, I'd also suggest trying to make SQLite go as fast as possible. SQLite should be able to easily handly > 50K record inserts per second. A few log entries twice a second should not cause significant slowdown.
If you're executing each insert within it's own transaction - that could be a significant contributor to the slow-downs you're seeing. Perhaps you could:
Count the number of records inserted so far
Begin a transaction
Insert your record
Increment count
Commit/end transaction when N records have been inserted
Repeat
The downside is that if the system crashes during that period you risk loosing the un-committed records (but if you were willing to use an in-memory database, than it sounds like you're OK with that risk).
A brief search of the SQLite documentation turned up nothing useful (it wasn't likely and I didn't expect it).
Why not use a background thread that wakes up every 10 minutes, copies all of the log rows from the in-memory database to the external database (and deletes them from the in-memory database). When your program is ready to end, wake up the background thread one last time to save the last logs, then close all of the connections.
We have a few customers with large data sets and during our upgrade procedure we need to modify the schema of various tables (adding some columns, renaming others, occasionally changing data types, but that's rare).
Previously we've been going via a temporary table with the new schema, and then dropping the original and renaming the temp table but I'm hoping to speed that up dramatically by using ALTER table ... instead.
My question is what data integrity and error handling issues do I need to consider? Should I enclose all changes to a table in a transaction (and if so, how?) or will the DBMS guarantee atomicity and integrity over an ALTER operation?
We already heavily recommend customers backup their data before starting the upgrade so that should always be a fall back option.
We need to target SQLServer 2005 and Oracle, but obviously I can add conditional code if they require different approaches.
Comments for Oracle only:
Table alterations are DDL, so the concept of a transaction doesn't apply - every DDL statement locks the table for the duration of the operation and either succeeds or fails.
Adding (nullable!) columns or renaming existing columns is a relatively lightweight process and shouldn't present any problems if the table lock can be acquired.
If you're adding/modifying constraints (either NOT NULL or other more complex check constraints) Oracle will check existing data to validate the constraints unless you add the ENABLE NOVALIDATE clause to the constraint DDL. The validation of existing data can be a lengthy process for large tables.
If you're scripting the upgrade to be run as a SQL*Plus script, save yourself a lot of headaches by using the "whenever sqlerror exit sql.sqlcode" directive to abort the script on the first failure to make the review of partially implemented upgrades easier.
If the upgrade must be performed on a live system where you can neither control transactions or afford to miss them, consider using the Oracle DBMS_REDEFINITION package, which automatically creates a temporary configuration of temp tables and triggers to capture in-flight transactions while redefining the table in the "background". Warning - lots of work and a steep learning curve for this option.
If you're using SQL Server then ddl statements are transactional, so wrap in a transaction (I don't think this applies to Oracle though).
We split upgrades into individual patches that go with a particular feature. Which patches are applied go in a database_patch_history table, and it's easy to see which patches were applied and how to roll them back.
As you say, taking a backup before you start is important.
I have had to do changes like this in the past and have always been very paranoid about data loss. To help mitigate that risk I have always done tons of testing against "sandbox" databases that mirrored the target databases in schema and data as closely as possible. Test out the process as much as possible before rolling it out, just like you would any other area of the application.
If you dramatically change any data types of columns, for instance change a VARCHAR to an INT, the DBMS will panic and you will probably loose that data. Luckily, nowadays DBMSs are intelligent enough to do some data type conversions without loosing the data, but you don't want to run the risk of damaging any of it when making the alterations.
You shouldn't loose any data by renaming columns and definitely won't by adding new columns, it's when you move the data about that you have to be concerned.
Firstly, backup the entire table, both the schema and data, so at a second's notice you can roll back to the previous schema. Secondly, look at the alterations you are trying to make, see how drastic they are - try to figure out exactly what needs to change. If you're making datatype conversions push that data to an intermediatery table first with 3 columns, the foreign key (id or whatever so you can locate the row), the old data and the new column. Then either push the old data to the new column directly, or convert it at the application-level.
When it's all in the correct types and everything's been successful, run the ALTER statements and repopulate the database! It's simple enough to do, just needs a logical thought process.
We have a SQL Server database table that consists of user id, some numeric value, e.g. balance, and a version column.
We have multiple threads updating this table's value column in parallel, each in its own transaction and session (we're using a session-per-thread model). Since we want all logical transaction to occur, each thread does the following:
load the current row (mapped to a type).
make the change to the value, based on old value. (e.g. add 50).
session.update(obj)
session.flush() (since we're optimistic, we want to make sure we had the correct version value prior to the update)
if step 4 (flush) threw StaleStateException, refresh the object (with lockmode.read) and goto step 1
we only do this a certain number of times per logical transaction, if we can't commit it after X attempts, we reject the logical transaction.
each such thread commits periodically, e.g. after 100 successful logical transactions, to keep commit-induced I/O to manageable levels. meaning - we have a single database transaction (per transaction) with multiple flushes, at least once per logical change.
what's the problem here, you ask? well, on commits we see changes to failed logical objects.
specifically, if the value was 50 when we went through step 1 (for the first time), and we tried to update it to 100 (but we failed since e.g. another thread changed it to 70), then the value of 50 is committed for this row. obviously this is incorrect.
What are we missing here?
Well, I do not have a ton of experience here, but one thing I remember reading in the documentation is that if an exception occurs, you are supposed to immediately rollback the transaction and dispose of the session. Perhaps your issue is related to the session being in an inconsistent state?
Also, calling update in your code here is not necessary. Since you loaded the object in that session, it is already being tracked by nhibernate.
If you want to make your changes anyway, why do you bother with row versioning? It sounds like you should get the same result if you simply always update the data and let the last transaction win.
As to why the update becomes permanent, it depends on what the SQL statements for the version check/update look like and on your transaction control, which you left out of the code example. If you turn on the Hibernate SQL logging it will probably become obvious how this is happening.
I'm not a nhibernate guru, but answer seems simple.
When nhibernate loads an object, it expects it not to change in db as long as it's in nhibernate session cache.
As you mentioned - you got multi thread app.
This is what happens=>
1st thread loads an entity
2nd thread loads an entity
1st thread changes entity
2nd thread changes entity and => finds out that loaded entity has changed by something else and being afraid that it has screwed up changes 1st thread made - throws an exception to let programmer be aware about that.
You are missing locking mechanism. Can't tell much about how to apply that properly and elegantly. Maybe Transaction would help.
We had similar problems when we used nhibernate and raw ado.net concurrently (luckily - just for querying - at least for production code). All we had to do - force updating db on insert/update so we could actually query something through full-text search for some specific entities.
Had StaleStateException in integration tests when we used raw ado.net to reset db. NHibernate session was alive through bunch of tests, but every test tried to cleanup db without awareness of NHibernate.
Here is the documention for exception in the session
http://nhibernate.info/doc/nhibernate-reference/best-practices.html
I've got in an ASP.NET application this process :
Start a connection
Start a transaction
Insert into a table "LoadData" a lot of values with the SqlBulkCopy class with a column that contains a specific LoadId.
Call a stored procedure that :
read the table "LoadData" for the specific LoadId.
For each line does a lot of calculations which implies reading dozens of tables and write the results into a temporary (#temp) table (process that last several minutes).
Deletes the lines in "LoadDate" for the specific LoadId.
Once everything is done, write the result in the result table.
Commit transaction or rollback if something fails.
My problem is that if I have 2 users that start the process, the second one will have to wait that the previous has finished (because the insert seems to put an exclusive lock on the table) and my application sometimes falls in timeout (and the users are not happy to wait :) ).
I'm looking for a way to be able to have the users that does everything in parallel as there is no interaction, except the last one: writing the result. I think that what is blocking me is the inserts / deletes in the "LoadData" table.
I checked the other transaction isolation levels but it seems that nothing could help me.
What would be perfect would be to be able to remove the exclusive lock on the "LoadData" table (is it possible to force SqlServer to only lock rows and not table ?) when the Insert is finished, but without ending the transaction.
Any suggestion?
Look up SET TRANSACTION ISOLATION LEVEL READ COMMITTED SNAPSHOT in Books OnLine.
Transactions should cover small and fast-executing pieces of SQL / code. They have a tendancy to be implemented differently on different platforms. They will lock tables and then expand the lock as the modifications grow thus locking out the other users from querying or updating the same row / page / table.
Why not forget the transaction, and handle processing errors in another way? Is your data integrity truely being secured by the transaction, or can you do without it?
if you're sure that there is no issue with cioncurrent operations except the last part, why not start the transaction just before those last statements, Whichever they are that DO require isolation), and commit immediately after they succeed.. Then all the upfront read operations will not block each other...