Is that good practice to do other things during db transaction? - database

During a database transaction, I want to do a few of other things - for example, query ElasticSearch and then write the result to other db table, before committing the transaction, it makes logic sense that I want the entire process in the same transaction, ether all succeed or fail, but I'm not sure if this is doable and good practice. if not, is there any suggestion of good practice in this scenario? Thanks
I think this is a general question and I'm using postgresql if that's any helpful.

You can use BEGIN and SAVEPOINT to control the transactions.
Here is the documentation:
https://www.postgresql.org/docs/8.3/tutorial-transactions.html

That's an excellent question, unfortunately there's no straight answer. It all depends. Let me give you some examples with how I'd suggest approaching:
Fetch extra data during transaction:
a\ save data in DB
b\ fetch some extra data (rollback in case there's fail here)
c\ save extra data in DB
This makes some sense, the whole operation seems like an atomic transaction. But consider the reasons for fail in c. If this data is not available - ok. But if the fails was a temporary network issue? Depending on your business, you might prefer to move b/ after the transaction, maybe even async job and fill the data later?
Saving extra data during transaction:
a\ save some data
b\ save data in an external system
c\ do some more stuff, commit transaction
This is more tricky. What if b\ was succesfull, but there was a fail between b\ and c? How do you rollback changes made in b?
There are probably more, I just wanted to show the way you might think about it.
My gut feeling tells me that the less happens inside the transaction, the easier it should be to reason about what is happening. If you find yourself debugging such weird scenarios - maybe it's reasonable investment to remodel your app so you're able to do those extra calls outside the transaction.

Related

Using SQLAlchemy sessions and transactions

While learning SQLAlchemy I came across two ways of dealing with SQLAlchemy's sessions.
One was creating the session once globally while initializing my database like:
DBSession = scoped_session(sessionmaker(extension=ZopeTransactionExtension()))
and import this DBSession instance in all my requests (all my insert/update) operations that follow.
When I do this, my DB operations have the following structure:
with transaction manager:
for each_item in huge_file_of_million_rows:
DBSession.add(each_item)
//More create, read, update and delete operations
I do not commit or flush or rollback anywhere assuming my Zope transaction manager takes care of it for me
(it commits at the end of the transaction or rolls back if it fails)
The second way and the most frequently mentioned on the web way was:
create a DBSession once like
DBSession=sessionmaker(bind=engine)
and then create a session instance of this per transaction:
session = DBSession()
for row in huge_file_of_million_rows:
for item in row:
try:
DBsesion.add(item)
//More create, read, update and delete operations
DBsession.flush()
DBSession.commit()
except:
DBSession.rollback()
DBSession.close()
I do not understand which is BETTER ( in terms of memory usage,
performance, and healthy) and how?
In the first method, I
accumulate all the objects to the session and then the commit
happens in the end. For a bulky insert operation, does adding
objects to the session result in adding them to the memory(RAM) or
elsewhere? where do they get stored and how much memory is consumed?
Both the ways tend to be very slow when I have about a
million inserts and updates. Trying SQLAlchemy core also takes the
same time to execute. 100K rows select insert and update takes about
25-30 minutes. Is there any way to reduce this?
Please point me in the right direction. Thanks in advance.
Here you have a very generic answer, and with the warning that I don't know much about zope. Just some simple database heuristics. Hope it helps.
How to use SQLAlchemy sessions:
First, take a look to their own explanation here
As they say:
The calls to instantiate Session would then be placed at the point in the application where database conversations begin.
I'm not sure I understand what you mean with method 1.; just in case, a warning: you should not have just one session for the whole application. You instantiate Session when the database conversations begin, but you surely have several points in the application in which you have different conversations beginning. (I'm not sure from your text if you have different users).
One commit at the end of a huge number of operations is not a good idea
Indeed it will consume memory, probably in the Session object of your python program, and surely in the database transaction. How much space? That's difficult to say with the information you provide; it will depend on the queries, on the database...
You could easily estimate it with a profiler. Take into account that if you run out of resources everything will go slower (or halt).
One commit per register is also not a good idea when processing a bulk file
It means you are asking the database to persist changes every time for every row. Certainly too much. Try with an intermediated number, commit every n hundreds of rows. But then it gets more complicated; one commit at the end of the file assures you that the file is either processed or not, while intermediate commits force you to take into account, when something fails, that your file is half through - you should reposition.
As for the times you mention, it is very difficult with the information you provide + what is your database + machine. Anyway, the order of magnitude of your numbers, a select+insert+update per 15ms, probably plus commit, sounds pretty high but more or less on the expected range (again it depends on queries + database + machine)... If you have to frequently insert so many registers you could consider other database solutions; it will depend on your scenario, and probably on dialects and may not be provided by an orm like SQLAlchemy.

Appengine - querying database right after putting something in

In one place of code I do something like this:
FormModel(.. some data here..).put()
And a couple lines below I select from the database:
FormModel.all().filter(..).fetch(100)
The problem I noticed - sometimes the fetch doesn't notice the data I just added.
My theory is that this happens because I'm using high replication storage, and I don't give it enough time to replicate the data. But how can I avoid this problem?
Unless the data is in the same entity group there is no way to guarantee that the data will be the most up to data (If I understand this section correctly).
Shay is right: there's no way to know when the datastore will be ready to return the data you just entered.
However, you are guaranteed that the data will be entered eventually, once the call to put completes successfully. That's a lot of information, and you can use it to work around this problem. When you get the data back from fetch, just append/insert the new entities that you know will be in there eventually! In most cases it will be good enough to do this on a per-request basis, I think, but you could do something more powerful that uses memcache to cover all requests (except cases where memcache fails).
The hard part, of course, is figuring out when you should append/insert which entities. It's obnoxious to have to do this workaround, but a relatively low price to pay for something as astonishingly complex as the HRD.
From https://developers.google.com/appengine/docs/java/datastore/transactions#Java_Isolation_and_consistency
This consistent snapshot view also extends to reads after writes
inside transactions. Unlike with most databases, queries and gets
inside a Datastore transaction do not see the results of previous
writes inside that transaction. Specifically, if an entity is modified
or deleted within a transaction, a query or get returns the original
version of the entity as of the beginning of the transaction, or
nothing if the entity did not exist then.

Why use a READ UNCOMMITTED isolation level?

In plain English, what are the disadvantages and advantages of using
SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED
in a query for .NET applications and reporting services applications?
This isolation level allows dirty reads. One transaction may see uncommitted changes made by some other transaction.
To maintain the highest level of isolation, a DBMS usually acquires locks on data, which may result in a loss of concurrency and a high locking overhead. This isolation level relaxes this property.
You may want to check out the Wikipedia article on READ UNCOMMITTED for a few examples and further reading.
You may also be interested in checking out Jeff Atwood's blog article on how he and his team tackled a deadlock issue in the early days of Stack Overflow. According to Jeff:
But is nolock dangerous? Could you end
up reading invalid data with read uncommitted on? Yes, in theory. You'll
find no shortage of database
architecture astronauts who start
dropping ACID science on you and all
but pull the building fire alarm when
you tell them you want to try nolock.
It's true: the theory is scary. But
here's what I think: "In theory there
is no difference between theory and
practice. In practice there is."
I would never recommend using nolock
as a general "good for what ails you"
snake oil fix for any database
deadlocking problems you may have. You
should try to diagnose the source of
the problem first.
But in practice adding nolock to queries that you absolutely know are simple, straightforward read-only affairs never seems to lead to problems... As long as you know what you're doing.
One alternative to the READ UNCOMMITTED level that you may want to consider is the READ COMMITTED SNAPSHOT. Quoting Jeff again:
Snapshots rely on an entirely new data change tracking method ... more than just a slight logical change, it requires the server to handle the data physically differently. Once this new data change tracking method is enabled, it creates a copy, or snapshot of every data change. By reading these snapshots rather than live data at times of contention, Shared Locks are no longer needed on reads, and overall database performance may increase.
My favorite use case for read uncommited is to debug something that is happening inside a transaction.
Start your software under a debugger, while you are stepping through the lines of code, it opens a transaction and modifies your database. While the code is stopped, you can open a query analyzer, set on the read uncommited isolation level and make queries to see what is going on.
You also can use it to see if long running procedures are stuck or correctly updating your database using a query with count(*).
It is great if your company loves to make overly complex stored procedures.
This can be useful to see the progress of long insert queries, make any rough estimates (like COUNT(*) or rough SUM(*)) etc.
In other words, the results the dirty read queries return are fine as long as you treat them as estimates and don't make any critical decisions based upon them.
The advantage is that it can be faster in some situations. The disadvantage is the result can be wrong (data which hasn't been committed yet could be returned) and there is no guarantee that the result is repeatable.
If you care about accuracy, don't use this.
More information is on MSDN:
Implements dirty read, or isolation level 0 locking, which means that no shared locks are issued and no exclusive locks are honored. When this option is set, it is possible to read uncommitted or dirty data; values in the data can be changed and rows can appear or disappear in the data set before the end of the transaction. This option has the same effect as setting NOLOCK on all tables in all SELECT statements in a transaction. This is the least restrictive of the four isolation levels.
When is it ok to use READ UNCOMMITTED?
Rule of thumb
Good: Big aggregate reports showing constantly changing totals.
Risky: Nearly everything else.
The good news is that the majority of read-only reports fall in that Good category.
More detail...
Ok to use it:
Nearly all user-facing aggregate reports for current, non-static data e.g. Year to date sales.
It risks a margin of error (maybe < 0.1%) which is much lower than other uncertainty factors such as inputting error or just the randomness of when exactly data gets recorded minute to minute.
That covers probably the majority of what an Business Intelligence department would do in, say, SSRS. The exception of course, is anything with $ signs in front of it. Many people account for money with much more zeal than applied to the related core metrics required to service the customer and generate that money. (I blame accountants).
When risky
Any report that goes down to the detail level. If that detail is required it usually implies that every row will be relevant to a decision. In fact, if you can't pull a small subset without blocking it might be for the good reason that it's being currently edited.
Historical data. It rarely makes a practical difference but whereas users understand constantly changing data can't be perfect, they don't feel the same about static data. Dirty reads won't hurt here but double reads can occasionally be. Seeing as you shouldn't have blocks on static data anyway, why risk it?
Nearly anything that feeds an application which also has write capabilities.
When even the OK scenario is not OK.
Are any applications or update processes making use of big single transactions? Ones which remove then re-insert a lot of records you're reporting on? In that case you really can't use NOLOCK on those tables for anything.
Use READ_UNCOMMITTED in situation where source is highly unlikely to change.
When reading historical data. e.g some deployment logs that happened two days ago.
When reading metadata again. e.g. metadata based application.
Don't use READ_UNCOMMITTED when you know souce may change during fetch operation.
Regarding reporting, we use it on all of our reporting queries to prevent a query from bogging down databases. We can do that because we're pulling historical data, not up-to-the-microsecond data.
This will give you dirty reads, and show you transactions that's not committed yet. That is the most obvious answer. I don't think its a good idea to use this just to speed up your reads. There is other ways of doing that if you use a good database design.
Its also interesting to note whats not happening. READ UNCOMMITTED does not only ignore other table locks. It's also not causing any locks in its own.
Consider you are generating a large report, or you are migrating data out of your database using a large and possibly complex SELECT statement. This will cause a shared lock that's may be escalated to a shared table lock for the duration of your transaction. Other transactions may read from the table, but updates are impossible. This may be a bad idea if its a production database since the production may stop completely.
If you are using READ UNCOMMITTED you will not set a shared lock on the table. You may get the result from some new transactions or you may not depending where it the table the data were inserted and how long your SELECT transaction have read. You may also get the same data twice if for example a page split occurs (the data will be copied to another location in the data file).
So, if its very important for you that data can be inserted while doing your SELECT, READ UNCOMMITTED may make sense. You have to consider that your report may contain some errors, but if its based on millions of rows and only a few of them are updated while selecting the result this may be "good enough". Your transaction may also fail all together since the uniqueness of a row may not be guaranteed.
A better way altogether may be to use SNAPSHOT ISOLATION LEVEL but your applications may need some adjustments to use this. One example of this is if your application takes an exclusive lock on a row to prevent others from reading it and go into edit mode in the UI. SNAPSHOT ISOLATION LEVEL does also come with a considerable performance penalty (especially on disk). But you may overcome that by throwing hardware on the problem. :)
You may also consider restoring a backup of the database to use for reporting or loading data into a data warehouse.
It can be used for a simple table, for example in an insert-only audit table, where there is no update to existing row, and no fk to other table. The insert is a simple insert, which has no or little chance of rollback.
I always use READ UNCOMMITTED now. It's fast with the least issues. When using other isolations you will almost always come across some Blocking issues.
As long as you use Auto Increment fields and pay a little more attention to inserts then your fine, and you can say goodbye to blocking issues.
You can make errors with READ UNCOMMITED but to be honest, it is very easy make sure your inserts are full proof. Inserts/Updates which use the results from a select are only thing you need to watch out for. (Use READ COMMITTED here, or ensure that dirty reads aren't going to cause a problem)
So go the Dirty Reads (Specially for big reports), your software will run smoother...

Django data creation and commits

I'm not sure I 100% understand what the database does. If I just have some misconception, please point it out.
Let's say I have a function that wants to create 100 new entry in the database with has 100,000 entries.
It seems a lot faster when those 100 entries get create and the commit is made after the last entry is created.
Now, if those 100 entries get created by different users, is there a easy way to commit only after 100 entries are created?
Edit:
Should I maybe write some sort of buffer?
Databases are optimized for set-based operations, so yes it wouldbe faster to insert 100 records in a set than one at a time. However, when you are talking about users entering records one ata atime, you would not want to group them together under any circumstances that I can think of. Why?
First, if there was one bad record, the others would fail. This would make for 99 cranky users out of 100 (actually 100, but one would not really have reason to be cranky becasue he did the bad data entry to begin with).
Second, users would not see the records immediately after being entered. It is also true that they would not be able to do something further with those records until they are entered such as enter data into related tables. Having a delay like this would make users cranky. If users are entering data from customers through a phone call, they will be especially cranky at the wait (I worked at a call center with a horribly slow commercial product and believe me I know how upset the users used to get!)
Third, users will have gone on to something else and would not realize that their data was rejected for bad information, not a good thing at all.
How long are you going to wait to get your set number of records? 5 seconds, ten minutes?
What happens if for some reason the netwrok connection is lost during that time, wouldn;t the users lose the data they entered.
You might be able to hack something like that together, but you really shouldn't, because it wrecks your data integrity, which is the whole point of using transactions.
In your proposed solution, a problem with any insert in the batch would cause all the other (possibly totally valid) inserts from completely different users to fail. Also, users wouldn't be able to see the data they just tried to insert because the system was waiting to do the insert until the batch was full.
P.S. Here's a quick intro to transaction processing.
I think you do have a misconception. It sounds like you're looking at the database as something that is only for some sort of "long-term" memory. This is a bad concept; the database is the only memory your application has. Even when this isn't true, it's best to pretend that it is.
To go a little deeper, your application has:
scoped memory: variables that you define within view functions, for example. These all get destroyed when flow leaves the function.
globals: variables that are defined in the outermost part of your code. It is really important not to use these for any sort of state except perhaps configuration constants. The important thing is that you should rely on any dynamic behavior. Otherwise you will have to battle concurrency and forked processes (depending on server gateway) that aren't aware of each other. Just don't do it.
a caching scheme, if you choose to implement one. This is entirely optional in django, and there are many ways to do it. However, one typically uses some scheme to ensure that even if the cache crashes, the database reflects the current state of the data accurately.
your local filesystem. From a design point of view, most ways of taking advantage of this will either resemble a caching system (above) or be clumsy and fragile. From a performance point of view, it might be about as slow as a database.
your database.
So you see that there's not much place for you to put your data besides the database.

How to Decide to use Database Transactions

How do you guys decide that you should be wrapping the sql in a transaction?
Please throw some light on this.
Cheers !!
A transaction should be used when you need a set of changes to be processed completely to consider the operation complete and valid. In other words, if only a portion executes successfully, will that result in incomplete or invalid data being stored in your database?
For example, if you have an insert followed by an update, what happens if the insert succeeds and the update fails? If that would result in incomplete data (in this case, an orphaned record), you should wrap the two statements in a transaction to get them to complete as a "set".
If you are executing two or more statements that you expect to be functionally atomic, you should wrap them in a transaction.
if your have more than a single data modifying statement to execute to complete a task, all should be within a transaction.
This way, if the first one is successful, but any of the following ones has an error, you can rollback (undo) everything as if nothing was ever done.
Whenever you wouldn't like it if part of the operation can complete and part of it doesn't.
Anytime you want to lock up your database and potentially crash your production application, anytime you want to litter your application with hidden scalability nightmares go ahead and create a transaction. Make it big, slow, and put a loop inside.
Seriously, none of the above answers acknowledge the trade-off and potential problems that come with heavy use of transactions. Be careful, and consider the risk/reward each time.
Ebay doesn't use them at all. I'm sure there are many others.
http://www.infoq.com/interviews/dan-pritchett-ebay-architecture
Whenever any operation falls under ACID(Atomicity,Consistency,Isolation,Durability) criteria you should use transactions
Read this article
When you want to use atomic or isolation property of database for a set of changes.
Atomicity: An atomic transaction is an indivisible and irreducible series of database operations such that either all occurs, or nothing occurs(according to wikipedia).
Isolation: isolation determines how transaction integrity is visible to other users and systems(according to wikipedia).

Resources