SQLite: End transaction is taking too long - database

I am inserting a few rows with in a transaction. But when I do a 'END TRANSACTION' its taking around 250ms to execute while the 'BEGIN TRANSACTION' hardly takes around 1ms. I need to fasten the speed here to suite my application. How can I?
[edit]
* A single thread is accessing the database.
* I have 2 tables in this database and both these have primary key on them.
* With in a transaction there is exactly one insert into each the tables.
* OS - windows 7

Using the out of the box, or default, settings of sqlite, 250ms to commit a transaction makes sense. This is due to how sqlite commits your transaction. It waits for the VFS for a guarantee that the writes are committed to disk to return.
Here are a couple of possibilities to optimize.
Encapsulate more inserts per transaction
If possible, execute more inserts per transaction. Try running, say, 100 inserts in one transactions, and see how little difference there will be compared to only 1 insert. You may not even see any (i.e. 100 inserts will probably take marginally more than 250ms).
Bottom line is that you'll get more bang for your buck as each insert will ultimately take less time (on average).
Use WAL journaling
I strongly recommend you try WAL journaling as you should see a dramatic reduction from 250ms. WAL shouldn't be any less safe than regular journaling. The reason WAL is faster is in its name: it appends to a journaling file instead of having the database file absorb the changes of your commit every time you commit. Read this for the full story.
To activate WAL journaling, set the journal_mode pragma to WAL:
PRAGMA journal_mode = WAL;
Change the synchronous pragma
This may or may not be good enough for you as it is less safe. As such, I recommend it only if you understand what the risks are, and also if the two previous suggestions are not good enough for you, or if you cannot use them.
Basically, changing the synchronous pragma to NORMAL or OFF will cause sqlite to not wait after the VFS for a guarantee that the writes are committed to disk to return.
Please read the documentation first, then if you still want to try it, you can set your pragma to either OFF or NORMAL:
PRAGMA synchronous = NORMAL;
or
PRAGMA synchronous = OFF;

Related

Does PostgreSQL run some performance optimizations for read-only transactions

According to the reference documentation the READ ONLY transaction flag is useful other than allowing DEFERRABLE transactions?
SET SESSION CHARACTERISTICS AS TRANSACTION READ ONLY;
The DEFERRABLE transaction property has no effect unless the
transaction is also SERIALIZABLE and READ ONLY. When all three of
these properties are selected for a transaction, the transaction may
block when first acquiring its snapshot, after which it is able to run
without the normal overhead of a SERIALIZABLE transaction and without
any risk of contributing to or being canceled by a serialization
failure. This mode is well suited for long-running reports or backups.
Does the database engine runs other optimizations for read-only transactions?
To sum up the comments from Nick Barnes and Craig Ringer in the question comments:
The READ_ONLY flag does not necessarily provide any optimization
The main benefit of setting the READ_ONLY flag is to ensure that no tuple is going to be modified
Actually, it does. Let me just cite source code comment here:
/*
* Check if we have just become "RO-safe". If we have, immediately release
* all locks as they're not needed anymore. This also resets
* MySerializableXact, so that subsequent calls to this function can exit
* quickly.
*
* A transaction is flagged as RO_SAFE if all concurrent R/W transactions
* commit without having conflicts out to an earlier snapshot, thus
* ensuring that no conflicts are possible for this transaction.
*/

is insert-select statement massive?

When multiple inserts are used with a select statement in a transaction, how does the database keep track of the changes during the transaction? Can there be problems with resources (such as memory or hard disk space) if a transaction is held open too long?
The short answer is, it depends on the size of the select. The select is part of the transaction, technically, but most selects don't have to be "rolled back", so the actual log of DB changes wouldn't include the select by itself. What it WILL include is a new row for every result from the select statement as an insert statement. If that select statement is 10k rows, the commit will be rather large, but no more so than if you'd written 10k individual insert statements within an explicit transaction.
Exactly how this works depends on the database. For example, in Oracle, it will require UNDO space (and eventually, if you run out, your transaction will be aborted, or your DBA will yell at you). In PostgreSQL, it'll prevent the vacuuming of old row versions. In MySQL/InnoDB, it'll use rollback space, and possibly cause lock timeouts.
There are several things the database must use space for:
Storing which rows your transaction has changed (the old values, the new values, or both) so that rollback can be performed
Keeping track of which data is visible to your transaction so that a consistent view is maintained (in transaction isolation levels other than read uncommitted). This overhead will often be greater the more isolation you request.
Keeping track of which data is visible to other transactions (unless the whole database is running in read uncommitted)
Keeping track of which objects which transactions have changed, so isolation rules are followed, especially in serializable isolation. (Probably not much space, but plenty of locks).
In general, you want your transactions to commit as soon as possible. So, e.g., you don't want to hold one open on an idle connection. How to best batch inserts depends on the database (often, many inserts on one transaction is better than one transaction per insert). And of course, the primary purpose of transactions is data integrity.
You can have many problems with the large transaction. First, in most databases you do not want to run row-by-row because for a million records that will take hours. But to insert a million records in one complex statement can cause locking on the tables involved and harm performance for everyone else. And a rollback if you kill the transaction can take a good while too. Usually the best alternative is to loop in batches. I usually test 50,000 at a time and raise or lower the set depending on how long that takes. I've had some databases where I do no more that 1000 in one set-based operation. If possible large inserts or updates should be scheduled for the off-peak hours that the database operates. If really large (and one-time - usually a large data migration) you might even want to close the database for maintenance, put it in single user mode and drop the indexes, do the insert and reindex.

To NOLOCK or NOT to NOLOCK, that is the question

This is really more of a discussion than a specific question about nolock.
I took over an app recently that almost every query (and there are lots of them) has the nolock option on them. Now I am pretty new to SQL server (used Oracle for 10 years) but yet I find this pretty disturbing. So this weekend I was talking with one of my friends who runs a rather large ecommerce site (name will be withheld to protect the guilty) and he says he has to do this with all of his SQL servers cause he will always end in deadlocks.
Is this just a huge short fall with SQL server? Is this just a failure in the DB design (mine is not 3rd level, but its close) Is anybody out there running an SQL server app without nolocks? These are issues that Oracle handles better with more grandulare recordlocks.
Is SQL server just not able to handle big loads? Is there some better workaround than reading uncommited data? I would love to hear what people think.
Thanks
SQL Server has added snapshot isolation in SQL Server 2005, this will enable you to still read the latest correct value without having to wait for locks. StackOverflow is also using Snapshot Isolation. The Snapshot Isolation level is more or less the same that Oracle uses, this is why deadlocks are not very common on an Oracle box. Just be aware to have plenty of tempdb space if you do enable it
from Books On Line
When the READ_COMMITTED_SNAPSHOT
database option is set ON, read
committed isolation uses row
versioning to provide statement-level
read consistency. Read operations
require only SCH-S table level locks
and no page or row locks. When the
READ_COMMITTED_SNAPSHOT database
option is set OFF, which is the
default setting, read committed
isolation behaves as it did in earlier
versions of SQL Server. Both
implementations meet the ANSI
definition of read committed
isolation.
If somebody says that without NOLOCK their application always gets deadlocked, then there is (more than likely) a problem with their queries. A deadlock means that two transactions cannot proceed because of resource contention and the problem cannot be resolved. An example:
Consider Transactions A and B. Both are in-flight. Transaction A has inserted a row into table X and Transaction B has inserted a row into table Y, so Transaction A has an exclusive lock on X and Transaction B has an exclusive lock on Y.
Now, Transaction A needs run a SELECT against table Y and Transaction B needs to run a SELECT against table X.
The two transactions are deadlocked: A needs resource Y and B needs resource X. Since neither transaction can proceed until the other completes, the situtation cannot be resolved: neither transactions demand for a resource may be satisified until the other transaction releases its lock on the resource in contention (either by ROLLBACK or COMMIT, doesn't matter.)
SQL Server identifies this situation and select one transaction or the other as the deadlock victim, aborts that transaction and rolls back, leaving the other transaction free to proceed to its presumable completion.
Deadlocks are rare in real life (IMHO). One rectifies them by
ensuring that transaction scope is as small as possible, something SQL server does automatically (SQL Server's default transaction scope is a single statement with an implicit COMMIT), and
ensuring that transactions access resources in the same sequence. In the example above, if transactions A and B both locked resources X and Y in the same sequence, there would not be a deadlock.
Timeouts
A timeout, on the other hand, occurs when a transaction exceeds its wait time and is rolled back due to resource contention. For instance, Transaction A needs resource X. Resource X is locked by Transaction B, so Transaction A waits for the lock to be released. If the lock isn't released within the queries timeout limimt, the waiting transaction is aborted and rolled back. Every query has a query timeout associated with it (the default value is 30s, I believe), after which time the transaction is aborted and rolled back. The query timeout can be set to 0s, in which case SQL Server will let the query wait forever.
This is probably what they are talking about. In my experience, timeouts like this usually occur in big databases when large batch jobs are updating thousands and thousands of records in a single transaction, although they can happen because a transaction goes to long (connect to your production database in Query Abalyzer, execute BEGIN TRANSACTION, update a single row in a frequently hit table in Query Analyzer and go to lunch without executing ROLLBACK or COMMIT TRANSACTION and see how long it takes for the production DBAs to go apes**t on you. Don't ask me how I know this)
This sort of timeout is usually what results in splattering perfectly innocent SQL with all sorts of NOLOCK hints
[TIP: if your going to do that, just execute SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED as the first statement in your stored procedure and have done with it.]
The problem with this approach (NOLOCK/READ UNCOMMITTED) is that you can read uncommitted data from other transaction: stuff that is incomplete or that may get rolled back later, so your data integrity is comprimised. You might be sending out a bill based on data with a high level of bogosity.
My general rule is that one should avoid the use of table hints insofar as possible. Let SQL Server and its query optimizer do their jobs.
The right way to avoid this sort of issue is to avoid the sort of transactions (insert a million rows all at one fell swoop, for instance) that cause the problems. The locking strategy implicit in relational database SQL is designed around small transactions of short scope. Lock should be small in scope and short in duration. Think "bank teller updating somebody's checking account with a deposit." as the underlying use case. Design your processes to work in that model and you'll be much happier all the way 'round.
Instead of inserting a million rows in one mondo insert statement, do the work in independent chunks and commit each chunk independently. If your million row insert dies after processing 999,000 rows, all the work done is lost (not to mention that the rollback can be a b*tch, and the table is still locked during rollback as well.) If you insert the million rows in block of 1000 rows each, committing after each block, you avoid the lock contention that causes deadlocks, as locks will be obtained and released and things will keep moving. If something goes south in the 999th block of 1000 rows, and the transaction get aborted and rolled back, you've still gotten 998,000 rows inserted; you've only lost 1000 rows of work. Restart/Retry is much easier.
Also, lock escalation occurs in large transactions. For effiency, locks escalate to larger and larger scope as the number of locks held by transaction increases. If a single transaction inserts/updates/deletes a single row in a table, I get a row lock. Keep doing that and once the number of row locks held by that transaction against that table hits a threshold value, SQL Server will escalate the locking strategy: the row locks will be consolidated and converted into a smaller number page locks, thus increasing the scope of the locks held. From that point forward, an insert/delete/update of a single row will lock that page in the table. Once the number of page locks held hits its threshold value, the page locks are again consolidated and the locking strategy escalates to table locks: the transaction now locks the entire table and nobody else may play until the transaction commits or rolls back.
Whether you can avoid functionally avoid the use of NOLOCK/READ UNCOMMITTED is entirely dependent on the nature of the processes hitting the underlying database (and the culture of the organization owning it).
Myself, I try to avoid its use as much as possible.
Hope this helps.
No, there is no need to use NOLOCK. Links: SO 1
As for load, we deal with 2000 rows per second which is small change compared to 35k TPS
Deadlocks are caused by lock contention and usually caused by inconsistent write order on tables in transactions. ORMs especially are rubbish at this. We get them very infrequently. A well written DAL should retry too as per MSDN.
In a traditional normalized OLTP environment, NOLOCK is a code smell and almost certainly unnecessary in a properly designed system.
In a dimensional model, I used NOLOCK extensively to avoid locking very large fact and dimension tables which were being populated with later fact data (and dimensions may have been expiring). In the dimensional model, the facts either never change or never change after a certain point. Similarly, any dimension which is referenced will also be static, so for example, the NOLOCK will stop your long analysis operation on yesterday's data from blocking a dimension expiration during a data load for today's data.
You should only use nolock on an unchanging table. Of course, this will be the same then as Read Committed Snapshot. Without the snapshot, you are only saving the time it takes to apply a shared lock, and then to remove it, which for most cases isn't necessary.
As for a changing table... No lock doesn't just mean getting a row before a transaction is done updating all of its rows. You can get ghost data as data pages split, or even index pages split. Or no data. That alone scared me away, but I think there may be even more scenarios where you simply get the wrong data.
Of course, nolock for getting rough estimates or to just check in on a process might be reasonable.
Basic rule of thumb -- if you care about the data at all, and the data is changing, then do not use NoLOCK.

when to prefer pessimistic model of transaction isolation over optimistic one?

Do I understand correctly that table/row lock hints are being used for pessimistic transaction (TX) isolation models of concurrency ONLY?
In other words, when can table/row lock hints be used during engagement of optimistic TX isolation provided by SQL Server (2005 and higher)?
When one would need pessimistic TX isolation levels/hints in SQL Server2005+ if the later provides built-in optimistic (aka snapshot aka versioning) concurrency isolation?
I did read that pessimistic options are legacy and are not needed anymore, though I am in doubt.
Also, having optimistic (aka snapshot aka versioning) TX isolation levels built-in SQL Server2005+,
when one would need to manually code for optimistic concurrency features?
The last question is inspired by having read:
"Optimistic Concurrency in SQL Server" (September 28, 2007)
describing custom coding to provide versioning in SQL Server.
Optimistic concurrency requires more resources and is more expensive when the conflict occurs.
Two sessions can read and modify the values and the conflict only occurs when they try to apply their changes simultaneously. This means that in case of the concurrent update both values should be stored somewhere (which of course requires resources).
Also, when a conflict occurs, usually the whole transaction should be rolled back or the cursor refetched, which is expensive too.
Pessimistic concurrency model uses locking, thus downgrading concurrency but improving performance.
In case of two concurrent tasks, it may be cheaper for the second task to wait for a lock to release than spending CPU time and disk I/O on two simultaneous works and then yet more on rolling back the less fortunate work and redoing it.
Say, you have a query like this:
UPDATE mytable
SET myvalue = very_complex_function(#range)
WHERE rangeid = #range
, with very_complex_function reading some data from mytable itself. In other words, this query transforms a subset of mytable sharing the value of range.
Now, when two functions work on the same range, there may be two scenarios:
Pessimistic: the first query locks, the second query waits for it. The first query completes in 10 seconds, the second one does too. Total: 20 seconds.
Optimistic: both queries work independently (on the same input). This shares CPU time between them plus some overhead on switching. They should keep their intermediate data somewhere, so the data is stored twice (which implies twice I/O or memory). Let's say both complete almost at the same time, in 15seconds.
But when it's time to commit the work, the second query will conflict and will have to rollback its changes (say, it takes the same 15 seconds). Then it needs to reread the data again and do the work again, with the new set of data (10 seconds).
As a result, both queries complete later than with a pessimistic locking: 15 and 40 seconds vs. 10 and 20.
When one would need pessimistic TX isolation levels/hints in SQL Server2005+ if the later provides built-in optimistic (aka snapshot aka versioning) concurrency isolation?
Optimistic isolation levels are, well, optimistic. You should not use them when you expect high contention on your data.
BTW, optimistic isolation (for the read queries) was available in SQL Server 2000 too.
I have a detailed answer here: Developing Modifications that Survive Concurrency
I think there's a bit confusion over terminology here.
The technique of optimistic locking/optimistic concurrency/... is a programming technique used to avoid the following scenario :
start transaction
read data, setting a "read" lock on it to prevent any deletes/modifications to our data
display data on user's screen
await user input, lock remains active
keep awaiting user input, lock still preventing any writes/modifications
user input never comes (for whatever reason)
transaction times out (and this is usually not very rapidly, as the user must be given reasonable time to enter his input).
Optimistic locking replaces this with the following:
start transaction READ
read data, setting a "read" lock on it to prevent any deletes/modifications to our data
end transaction READ, releasing the read lock just set
display data on user's screen
await user input, but data can be modified/deleted meanwhile by other transactions
user input arrives
start transaction WRITE
verify that the data has remained unaltered, raising an exception if it has
apply user updates
end transaction WRITE
So the single "user transaction" to go fetch some data, and change and update them, consists of two distinct "database transactions". What is usually called "isolation levels" applies to those database transactions. The "optimistic locking" that you refer to applies to the "user transaction".
The matter is further complicated in that, broadly speaking, two completely distinct strategies are possible for the "isolating the database transactions part" :
MVCC
2-phase locking
I think the "snapshot versioning isolation level" means that the MVCC technique (well, one of its various possible variations) is being used for the database transaction. The other commonly known isolation levels apply more to transaction isolation using 2PL as the serialization(/isolation) technique. (And mixing them up can get messy ...)

Maximum transaction size in PostgreSQL

I have a utility in my application where i need to perform bulk load of INSERT, UPDATE & DELETE operations. I am trying to create transaction around this so that once this system is invoke and the data is fed to it, it is ensured that it is either all or none added to the database.
The concern what is have is what is the boundary conditions here? How many INSERT, UPDATE & DELETE can i have in one transaction? Is transaction size configurable?
I don't think there's a maximum amount of work that can be performed in a transaction. Data keeps getting added to the table files, and eventually the transaction either commits or rolls backs: AIUI this result gets stored in pg_clog; if it rolls back, the space will eventually be reclaimed by vacuum. So it's not as if the ongoing transaction work is held in memory and flushed at commit time, for instance.
A single transaction can run approximately two billion commands in it (2^31, minus IIRC a tiny bit of overhead. Actually, come to think of it, that may be 2^32 - the commandcounter is unsigned I think).
Each of those commands can modify multiple rows, of course.
For a project I work on, I perform 20 millions of INSERT. I tried with one big transaction and with one transaction for every million of INSERT and the performances seem exactly the same.
PostgreSQL 8.3
I believe the maximum amount of work is limited by your log file size. The database will never allow itself to not be able to rollback, so if you consume all your log space during the transaction, it will halt until you give it more space or rollback. This is a generally true for all databases.
I would recommend chunking your updates into manageable chunks that take a most a couple of minutes of execution time, that way you know if there's a problem earlier (eg what normally takes 1 minute is still running after 10 minutes... hmmm, did someone drop an index?)

Resources