Transactional theory - database

(I have a simple CRUD API in a DAO pattern implementation.)
All operations (save, load, update, delete) have a transaction id that must be supplied.
So eg. its possible to do:
...
id = begintransaction();
dao.save(o, id);
sao.update(o2, id);
rollback(id);
All examples excluding load invocations seems intuitive. But as soon as you start to load objects from the database, things "feel" a little bit different. Are load-operations, per definition, tied to a transaction? Or should my load operations be counted as a single amount of work?

Depends on the transaction isolation level (http://en.wikipedia.org/wiki/Isolation_(database_systems)) you're using, but in general they should be part of the transaction. What if somebody else is just in the middle of updating the data you're trying to read? If the read operation is not transactional, you would get old data, and maybe you're interested in the latest data.

If the database is set to decent isolation level, uncommited writes can only be read from the transaction that created them. For example, in Oracle, if a procedure inserts or updates a row and then (without commiting) calls another procedure, which uses "pragma autonomous_transaction" to run in a seperate transaction, that other procedure does not see the new row. (An excellent way to shoot yourself in the foot, btw).
For that reason, you should always consider your load operations as tied to the transaction.

Related

How to prevent update anomalies with multiple clients running non-atomic computations concurrently in PostgreSQL?

I am running three PostgreSQL instances using replication (1 master, 2 slaves) which are accessed by two separate servers:
The first (unexposed) server basically iterates over every row in a particular table and continuously updates specific columns (resources) every tick (based on production rate of those resources) for each user.
The second server is a public API that exposes various functions such as spending a certain amount of those resources.
In order to access and manipulate the data I am using an ORM library which allows me to write code as follows:
const resources = await repository.findById(1337);
// some complex computation
resources.iron = computeNewIron(resources.iron);
await repository.save(resources);
Of course it might occur that the API wants to deduct a specific amount of resources right when the server handling the ticks is trying to update the amount of resources which can cause either of the servers to assume a certain amount of resources that is incorrect, basically your typical UPDATE anomaly.
My problem is that I am not just writing a "simple" atomic query such UPDATE table SET iron = iron + 42 WHERE id = :id. The ORM library is internally using a direct assignment that is not self-referencing the respective columns which yields something akin to UPDATE table SET iron = 123 WHERE id = :id where the amount has been computed previously.
I can just assume that it's possible to prevent the mentioned anomaly if I use manually written queries that are incrementing/decrementing the values atomically with self-references. I'd like to know which other options can alleviate the issue. Should I wrap my SELECT/computation/UPDATE in a transaction? Does this suffice?
Your question is a bit unclear, but if your transaction spans several statements, yet needs to have a consistent state of the database, there are basically two options:
Use pessimistic locking: when you read values from the database, do it with SELECT ... FOR UPDATE. Then the rows are locked for the duration of your transaction, and no concurrent transaction can modify them.
Use optimistic locking: start your transaction in REPEATABLE READ isolation level. Then you see a consistent snapshot of the database for the whole duration of your transaction. If somebody else modifies your data after you read them, your UPDATE will cause a serialization error and you'll have to retry the transaction.
Optimistic locking is better if conflicts are rare, while pessimistic locking is preferable if conflicts are likely.

Using SQLAlchemy sessions and transactions

While learning SQLAlchemy I came across two ways of dealing with SQLAlchemy's sessions.
One was creating the session once globally while initializing my database like:
DBSession = scoped_session(sessionmaker(extension=ZopeTransactionExtension()))
and import this DBSession instance in all my requests (all my insert/update) operations that follow.
When I do this, my DB operations have the following structure:
with transaction manager:
for each_item in huge_file_of_million_rows:
DBSession.add(each_item)
//More create, read, update and delete operations
I do not commit or flush or rollback anywhere assuming my Zope transaction manager takes care of it for me
(it commits at the end of the transaction or rolls back if it fails)
The second way and the most frequently mentioned on the web way was:
create a DBSession once like
DBSession=sessionmaker(bind=engine)
and then create a session instance of this per transaction:
session = DBSession()
for row in huge_file_of_million_rows:
for item in row:
try:
DBsesion.add(item)
//More create, read, update and delete operations
DBsession.flush()
DBSession.commit()
except:
DBSession.rollback()
DBSession.close()
I do not understand which is BETTER ( in terms of memory usage,
performance, and healthy) and how?
In the first method, I
accumulate all the objects to the session and then the commit
happens in the end. For a bulky insert operation, does adding
objects to the session result in adding them to the memory(RAM) or
elsewhere? where do they get stored and how much memory is consumed?
Both the ways tend to be very slow when I have about a
million inserts and updates. Trying SQLAlchemy core also takes the
same time to execute. 100K rows select insert and update takes about
25-30 minutes. Is there any way to reduce this?
Please point me in the right direction. Thanks in advance.
Here you have a very generic answer, and with the warning that I don't know much about zope. Just some simple database heuristics. Hope it helps.
How to use SQLAlchemy sessions:
First, take a look to their own explanation here
As they say:
The calls to instantiate Session would then be placed at the point in the application where database conversations begin.
I'm not sure I understand what you mean with method 1.; just in case, a warning: you should not have just one session for the whole application. You instantiate Session when the database conversations begin, but you surely have several points in the application in which you have different conversations beginning. (I'm not sure from your text if you have different users).
One commit at the end of a huge number of operations is not a good idea
Indeed it will consume memory, probably in the Session object of your python program, and surely in the database transaction. How much space? That's difficult to say with the information you provide; it will depend on the queries, on the database...
You could easily estimate it with a profiler. Take into account that if you run out of resources everything will go slower (or halt).
One commit per register is also not a good idea when processing a bulk file
It means you are asking the database to persist changes every time for every row. Certainly too much. Try with an intermediated number, commit every n hundreds of rows. But then it gets more complicated; one commit at the end of the file assures you that the file is either processed or not, while intermediate commits force you to take into account, when something fails, that your file is half through - you should reposition.
As for the times you mention, it is very difficult with the information you provide + what is your database + machine. Anyway, the order of magnitude of your numbers, a select+insert+update per 15ms, probably plus commit, sounds pretty high but more or less on the expected range (again it depends on queries + database + machine)... If you have to frequently insert so many registers you could consider other database solutions; it will depend on your scenario, and probably on dialects and may not be provided by an orm like SQLAlchemy.

Why use a READ UNCOMMITTED isolation level?

In plain English, what are the disadvantages and advantages of using
SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED
in a query for .NET applications and reporting services applications?
This isolation level allows dirty reads. One transaction may see uncommitted changes made by some other transaction.
To maintain the highest level of isolation, a DBMS usually acquires locks on data, which may result in a loss of concurrency and a high locking overhead. This isolation level relaxes this property.
You may want to check out the Wikipedia article on READ UNCOMMITTED for a few examples and further reading.
You may also be interested in checking out Jeff Atwood's blog article on how he and his team tackled a deadlock issue in the early days of Stack Overflow. According to Jeff:
But is nolock dangerous? Could you end
up reading invalid data with read uncommitted on? Yes, in theory. You'll
find no shortage of database
architecture astronauts who start
dropping ACID science on you and all
but pull the building fire alarm when
you tell them you want to try nolock.
It's true: the theory is scary. But
here's what I think: "In theory there
is no difference between theory and
practice. In practice there is."
I would never recommend using nolock
as a general "good for what ails you"
snake oil fix for any database
deadlocking problems you may have. You
should try to diagnose the source of
the problem first.
But in practice adding nolock to queries that you absolutely know are simple, straightforward read-only affairs never seems to lead to problems... As long as you know what you're doing.
One alternative to the READ UNCOMMITTED level that you may want to consider is the READ COMMITTED SNAPSHOT. Quoting Jeff again:
Snapshots rely on an entirely new data change tracking method ... more than just a slight logical change, it requires the server to handle the data physically differently. Once this new data change tracking method is enabled, it creates a copy, or snapshot of every data change. By reading these snapshots rather than live data at times of contention, Shared Locks are no longer needed on reads, and overall database performance may increase.
My favorite use case for read uncommited is to debug something that is happening inside a transaction.
Start your software under a debugger, while you are stepping through the lines of code, it opens a transaction and modifies your database. While the code is stopped, you can open a query analyzer, set on the read uncommited isolation level and make queries to see what is going on.
You also can use it to see if long running procedures are stuck or correctly updating your database using a query with count(*).
It is great if your company loves to make overly complex stored procedures.
This can be useful to see the progress of long insert queries, make any rough estimates (like COUNT(*) or rough SUM(*)) etc.
In other words, the results the dirty read queries return are fine as long as you treat them as estimates and don't make any critical decisions based upon them.
The advantage is that it can be faster in some situations. The disadvantage is the result can be wrong (data which hasn't been committed yet could be returned) and there is no guarantee that the result is repeatable.
If you care about accuracy, don't use this.
More information is on MSDN:
Implements dirty read, or isolation level 0 locking, which means that no shared locks are issued and no exclusive locks are honored. When this option is set, it is possible to read uncommitted or dirty data; values in the data can be changed and rows can appear or disappear in the data set before the end of the transaction. This option has the same effect as setting NOLOCK on all tables in all SELECT statements in a transaction. This is the least restrictive of the four isolation levels.
When is it ok to use READ UNCOMMITTED?
Rule of thumb
Good: Big aggregate reports showing constantly changing totals.
Risky: Nearly everything else.
The good news is that the majority of read-only reports fall in that Good category.
More detail...
Ok to use it:
Nearly all user-facing aggregate reports for current, non-static data e.g. Year to date sales.
It risks a margin of error (maybe < 0.1%) which is much lower than other uncertainty factors such as inputting error or just the randomness of when exactly data gets recorded minute to minute.
That covers probably the majority of what an Business Intelligence department would do in, say, SSRS. The exception of course, is anything with $ signs in front of it. Many people account for money with much more zeal than applied to the related core metrics required to service the customer and generate that money. (I blame accountants).
When risky
Any report that goes down to the detail level. If that detail is required it usually implies that every row will be relevant to a decision. In fact, if you can't pull a small subset without blocking it might be for the good reason that it's being currently edited.
Historical data. It rarely makes a practical difference but whereas users understand constantly changing data can't be perfect, they don't feel the same about static data. Dirty reads won't hurt here but double reads can occasionally be. Seeing as you shouldn't have blocks on static data anyway, why risk it?
Nearly anything that feeds an application which also has write capabilities.
When even the OK scenario is not OK.
Are any applications or update processes making use of big single transactions? Ones which remove then re-insert a lot of records you're reporting on? In that case you really can't use NOLOCK on those tables for anything.
Use READ_UNCOMMITTED in situation where source is highly unlikely to change.
When reading historical data. e.g some deployment logs that happened two days ago.
When reading metadata again. e.g. metadata based application.
Don't use READ_UNCOMMITTED when you know souce may change during fetch operation.
Regarding reporting, we use it on all of our reporting queries to prevent a query from bogging down databases. We can do that because we're pulling historical data, not up-to-the-microsecond data.
This will give you dirty reads, and show you transactions that's not committed yet. That is the most obvious answer. I don't think its a good idea to use this just to speed up your reads. There is other ways of doing that if you use a good database design.
Its also interesting to note whats not happening. READ UNCOMMITTED does not only ignore other table locks. It's also not causing any locks in its own.
Consider you are generating a large report, or you are migrating data out of your database using a large and possibly complex SELECT statement. This will cause a shared lock that's may be escalated to a shared table lock for the duration of your transaction. Other transactions may read from the table, but updates are impossible. This may be a bad idea if its a production database since the production may stop completely.
If you are using READ UNCOMMITTED you will not set a shared lock on the table. You may get the result from some new transactions or you may not depending where it the table the data were inserted and how long your SELECT transaction have read. You may also get the same data twice if for example a page split occurs (the data will be copied to another location in the data file).
So, if its very important for you that data can be inserted while doing your SELECT, READ UNCOMMITTED may make sense. You have to consider that your report may contain some errors, but if its based on millions of rows and only a few of them are updated while selecting the result this may be "good enough". Your transaction may also fail all together since the uniqueness of a row may not be guaranteed.
A better way altogether may be to use SNAPSHOT ISOLATION LEVEL but your applications may need some adjustments to use this. One example of this is if your application takes an exclusive lock on a row to prevent others from reading it and go into edit mode in the UI. SNAPSHOT ISOLATION LEVEL does also come with a considerable performance penalty (especially on disk). But you may overcome that by throwing hardware on the problem. :)
You may also consider restoring a backup of the database to use for reporting or loading data into a data warehouse.
It can be used for a simple table, for example in an insert-only audit table, where there is no update to existing row, and no fk to other table. The insert is a simple insert, which has no or little chance of rollback.
I always use READ UNCOMMITTED now. It's fast with the least issues. When using other isolations you will almost always come across some Blocking issues.
As long as you use Auto Increment fields and pay a little more attention to inserts then your fine, and you can say goodbye to blocking issues.
You can make errors with READ UNCOMMITED but to be honest, it is very easy make sure your inserts are full proof. Inserts/Updates which use the results from a select are only thing you need to watch out for. (Use READ COMMITTED here, or ensure that dirty reads aren't going to cause a problem)
So go the Dirty Reads (Specially for big reports), your software will run smoother...

Do triggers decreases the performance? Inserted and deleted tables?

Suppose i am having stored procedures which performs Insert/update/delete operations on table.
Depending upon some criteria i want to perform some operations.
Should i create trigger or do the operation in stored procedure itself.
Does using the triggers decreases the performance?
Does these two tables viz Inserted and deleted exists(persistent) or are created dynamically?
If they are created dynamically does it have performance issue.
If they are persistent tables then where are they?
Also if they exixts then can i access Inserted and Deleted tables in stored procedures?
Will it be less performant than doing the same thing in a stored proc. Probably not but with all performance questions the only way to really know is to test both approaches with a realistic data set (if you have a 2,000,000 record table don't test with a table with 100 records!)
That said, the choice between a trigger and another method depends entirely on the need for the action in question to happen no matter how the data is updated, deleted, or inserted. If this is a business rule that must always happen no matter what, a trigger is the best place for it or you will eventually have data integrity problems. Data in databases is frequently changed from sources other than the GUI.
When writing a trigger though there are several things you should be aware of. First, the trigger fires once for each batch, so whether you inserted one record or 100,000 records the trigger only fires once. You cannot assume ever that only one record will be affected. Nor can you assume that it will always only be a small record set. This is why it is critical to write all triggers as if you are going to insert, update or delete a million rows. That means set-based logic and no cursors or while loops if at all possible. Do not take a stored proc written to handle one record and call it in a cursor in a trigger.
Also do not send emails from a cursor, you do not want to stop all inserts, updates, or deletes if the email server is down.
Yes, a table with a trigger will not perform as well as it would without it. Logic dictates that doing something is more expensive than doing nothing.
I think your question would be more meaningful if you asked in terms of whether it is more performant than some other approach that you haven't specified.
Ultimately, I'd select the tool that is most appropriate for the job and only worry about performance if there is a problem, not before you have even implemented a solution.
Inserted and deleted tables are available within the trigger, so calling them from stored procedures is a no-go.
It decreases performance on the query by definition: the query is then doing something it otherwise wasn't going to do.
The other way to look at it is this: if you were going to manually be doing whatever the trigger is doing anyway then they increase performance by saving a round trip.
Take it a step further: that advantage disappears if you use a stored procedure and you're running within one server roundtrip anyway.
So it depends on how you look at it.
Performance on what? the trigger will perform an update on the DB after the event so the user of your system won't even know it's going on. It happens in the background.
Your question is phrased in a manner quite difficult to understand.
If your Operation is important and must never be missed, then you have 2 choice
Execute your operation immediately after Update/Delete with durability
Delay the operation by making it loosely coupled with durability.
We also faced the same issue and our production MSSQL 2016 DB > 1TB with >500 tables and need to send changes(insert, update, delete) of few columns from 20 important tables to 3rd party. Number of business process that updates those few columns in 20 important tables were > 200 and it's a tedious task to modify them because it's a legacy application. Our existing process must work without any dependency of data sharing. Data Sharing order must be important. FIFO must be maintained
eg User Mobile No: 123-456-789, it change to 123-456-123 and again change to 123-456-456
order of sending this 123-456-789 --> 123-456-123 --> 123-456-456. Subsequent request can only be send if response of first previous request is successful.
We created 20 new tables with limited columns that we want. We compare main tables and new table (MainTable1 JOIN MainTale_LessCol1) using checksum of all columns and TimeStamp Column to Identify change.
Changes are logged in APIrequest tables and updated back in MainTale_LessCol1. Run this logic in Scheduled Job every 15 min.
Separate process will pick from APIrequest and send data to 3rd party.
We Explored
Triggers
CDC (Change Data Capture)
200+ Process Changes
Since our deadlines were strict, and cumulative changes on those 20 tables were > 1000/sec and our system were already on peak capacity, our current design work.
You can try CDC share your experience

How to Decide to use Database Transactions

How do you guys decide that you should be wrapping the sql in a transaction?
Please throw some light on this.
Cheers !!
A transaction should be used when you need a set of changes to be processed completely to consider the operation complete and valid. In other words, if only a portion executes successfully, will that result in incomplete or invalid data being stored in your database?
For example, if you have an insert followed by an update, what happens if the insert succeeds and the update fails? If that would result in incomplete data (in this case, an orphaned record), you should wrap the two statements in a transaction to get them to complete as a "set".
If you are executing two or more statements that you expect to be functionally atomic, you should wrap them in a transaction.
if your have more than a single data modifying statement to execute to complete a task, all should be within a transaction.
This way, if the first one is successful, but any of the following ones has an error, you can rollback (undo) everything as if nothing was ever done.
Whenever you wouldn't like it if part of the operation can complete and part of it doesn't.
Anytime you want to lock up your database and potentially crash your production application, anytime you want to litter your application with hidden scalability nightmares go ahead and create a transaction. Make it big, slow, and put a loop inside.
Seriously, none of the above answers acknowledge the trade-off and potential problems that come with heavy use of transactions. Be careful, and consider the risk/reward each time.
Ebay doesn't use them at all. I'm sure there are many others.
http://www.infoq.com/interviews/dan-pritchett-ebay-architecture
Whenever any operation falls under ACID(Atomicity,Consistency,Isolation,Durability) criteria you should use transactions
Read this article
When you want to use atomic or isolation property of database for a set of changes.
Atomicity: An atomic transaction is an indivisible and irreducible series of database operations such that either all occurs, or nothing occurs(according to wikipedia).
Isolation: isolation determines how transaction integrity is visible to other users and systems(according to wikipedia).

Resources