Hallo,
I want to get data into a database on a multicore system with ative WAL using JDBC. I was thinking about spawning multiple threads in my application to insert data parallely.
If the application has multiple threads I will have to increase the isolation level to Repeatable Read which on MVCC-databases should be mapped to Snapshot isolation.
If I were using one thread I wouldn't need isolation levels. As far as I know most Snapshot isolation databases analyze the write sets of all transaction that could have a conflict and then rollback all but one of the real conflict transactions. More specific I'm talking about Oracle, InnoDB and PostgreSQL.
1.) Is this analyzing of the write sets expensive?
2.) Is it a good idea to multithread the inserts for a higher total throughput? Real conflict are nearly impossible because of the application layer feeding the threads conflict free stuff. But the database shall be a safety net.
Oracle does not support Repeatable Read. It supports only Read Committed and Serializable. I might be mistaken, but setting an isolation level of Repeatable Read for Oracle might result in a transaction with an isolation level of Serializable. In short, you are left to mercy of the database support for the isolation levels that you desire.
I cannot speak for InnoDB and PostgreSQL, but the same would apply if they do not support the required isolation levels. The database could automatically upgrade the isolation level to a higher level to meet the desired isolation characteristics. You ought to rethink this approach, if your application's desired isolation level has to be Repeatable Read.
The problem like you've rightly inferred is that optimistic locking will possibly result in transaction rollbacks, if a conflict is detected. Oracle does so by reporting the ORA-08177 SQL error. Since this error is reported when two threads will access the same data range, it could be avoided if the threads work against data sets involving different data ranges. You will have to ensure that this is the case when dividing work across threads.
I think the limiting factor here will be disk IO, not the overhead of moving to Repeatable Read.
Even a single thread may be able to max out the disks on the DB server especially with the amount of DB logging required on insert / update. Are you sure that's not already the case?
Also, in any multi-user system, you probably want to be running with Repeatable Read isolation anyway (Postgres only supports this and serializable). So, I don't think of this as adding any "overhead" above what I would normally see.
Related
MSDN describes the JET transaction isolation for its OLEDB provider as follows:
Jet supports five levels of nesting in transactions. The only
supported mode for transactions is Read Committed. Setting lesser
levels of transactional separation implies Read Committed. Setting
higher levels will cause StartTransaction to fail.
Jet supports only single-phase commit.
MSDN describes Read Committed as follows:
Specifies that shared locks are held while the data is being read to
avoid dirty reads, but the data can be changed before the end of the
transaction, resulting in nonrepeatable reads or phantom data. This
option is the SQL Server default.
My questions are:
What is single-phase commit? What consequence does this have for transactions and isolation?
Would the Read Committed isolation level as described above be suitable for my requirements here?
What is the best way to achieve a Serializable transaction isolation using Jet?
By question number:
Single-phase commit is used where all of your data is in one database -- the activity of the transaction is committed atomically and you're done. If you have a logical transaction which needs to be spread across multiple storage engines (like a relational database for metadata and some sort of document store for a big blob) you can use a transaction manager to coordinate the activities so that the work is persisted in both or neither, if both products support two phase commit. They are just telling you that they don't support two-phase commit, so the product is not suitable for distributed transactions.
Yes, if you check the condition in the UPDATE statement itself; otherwise you might have problems.
They seem to be suggesting that you can't.
As an aside, I worked for decades as a consultant in quite a variety of environments. More than once I was engaged to migrate people off of Jet because of performance problems. In one case a simple "star" type query was running for two minutes because it was joining on the client rather than letting the database do it. As a direct query against the database it was sub-second. In another case there was a report which took 72 hours to run through Jet, which took 2 minutes when run directly against the database. If it generally works OK for you, you might be able to deal with such situations by using stored procedures where Jet is causing performance pain.
I've read that in older versions of SQL Server .. it had a pessimistic locking strategy. I.e. readers wait on writers for access to the same data (row or page level), unlike Oracle.
Is this still the case in newer versions ? I've read that the locking strategy has been changed in recent versions.
What you heard of is the SNAPSHOT ISOLATION, available since SQL Server 2005. Snapshot isolation, aka. row-versioning, is the default behavior in Oracle. You can make it default in SQL Server too, by enabling READ_COMMITTED_SNAPSHOT on the database:
ALTER DATABASE [<dbname>] SET READ_COMMITTED_SNAPSHOT ON;
With row versioning SQL Server does not acquire data locks during reads. If concurrent writes occur, the read will fetch the previous version of the row. For more details, read Row Versioning-based Isolation Levels in the Database Engine.
You should not confuse row versioning and snapshot with dirty reads. Dirty reads offer inconsistent data which makes programming a challenge, to say the least (ie. you should not use it!). Snapshot reads offer always a transactionally consistent image of the data.
by default SQL server uses READ COMMITTED isolation level, which means it will wait on uncommited changes before it tries to read them.
http://msdn.microsoft.com/en-us/library/ms173763.aspx
note that if you don't care about the accuracy of the data returned, you can always set you isolation level to Read Uncommitted this will give you all the records evening the ones that have binding changes
You can also use snapshot isolation level, which will give you all the record, including the latest known version of data that are currently being modified, without the current modification.
The locking strategy is something that is handled on connection by connection basis - this is something that can be set by the application and withing SQL Server itself.
Read about the Transaction Isolation Levels for more details.
Are there any issues using SNAPSHOT isolation to read data consistently for viewing without locking, blocking or dirty/phantom reads, while a separate process is processing continuous incoming data in serializable transactions?
We need readers (guaranteed read-only: web data sync, real-time monitoring views, etc) to be able to read consistent data, without being blocked, or blocking the updates. We were using SNAPSHOT for everything, but had too many consistency failures so switched the updating process to SERIALIZABLE.
I've read about but am not totally clear as to the impacts of using different isolation levels concurrently. I've seen the lock compatibility matrix, and read various info. It seems ok, but I'd really appreciate some wise guidance from people with practical experience about any major pitfalls.
Are there any issues using Snapshot isolation for the readers while SERIALIZABLE transactions are writing? Are there circumstances it will block a SERIALIZABLE transaction? Is there a benefit to using SNAPSHOT vs READ COMMITTED (with READ_COMMITTED_SNAPSHOT ON)?
Thanks, any assistance greatly appreciated :-)
Reads performed under SNAPSHOT isolation level read any modified data from the version store. As such they are affected only by writes. Writes behave identically under all isolation levels. Therefore SNAPSHOT reads behave the same way no matter the isolation level of the concurent transactions.
READ_COMMITTED_SNAPSHOT ON makes READ COMMITTED act as SNAPSHOT. In effect, it is SNAPSHOT: the READ_COMMITTED_SNAPSHOT was provided as a quick way to port applications to SNAPSHOT w/o code changes. So everything said on the first paragraph applies.
I understand that an Isolation level of Serializable is the most restrictive of all isolation levels. I'm curious though what sort of applications would require this level of isolation, or when I should consider using it?
Ask yourself the following question: Would it be bad if someone were to INSERT a new row into your data while your transaction is running? Would this interfere with your results in an unacceptable way? If so, use the SERIALIZABLE level.
From MSDN regarding SET TRANSACTION ISOLATION LEVEL:
SERIALIZABLE
Places a range lock on the data set,
preventing other users from updating
or inserting rows into the data set
until the transaction is complete.
This is the most restrictive of the
four isolation levels. Because
concurrency is lower, use this option
only when necessary. This option has
the same effect as setting HOLDLOCK on
all tables in all SELECT statements in
a transaction.
So your transaction maintains all locks throughout its lifetime-- even those normally discarded after use. This makes it appear that all transactions are running one at a time, hence the name SERIALIZABLE. Note from Wikipedia regarding isolation levels:
SERIALIZABLE
This isolation level specifies that
all transactions occur in a completely
isolated fashion; i.e., as if all
transactions in the system had
executed serially, one after the
other. The DBMS may execute two or
more transactions at the same time
only if the illusion of serial
execution can be maintained.
The SERIALIZABLE isolation level is the highest isolation level based on pessimistic concurrency control where transactions are completely isolated from one another.
The ANSI/ISO standard SQL 92 covers the following read phenomena when one transaction reads data, which is changed by second transaction:
dirty reads
non-repeatable reads
phantom reads
and Microsoft documentation extends with the following two:
lost updates
missing and double reads caused by row updates
The following table shows the concurrency side effects enabled by the different isolation levels:
So, the question is what read phenomena are allowed by your business requirements and then to check if your hardware environment can handle stricter concurrency control?
Note, something very interesting about the SERIALIZABLE isolation level - it is the default isolation level specified by the SQL standard. In the context of SQL Server of course, the default is READ COMMITTED.
Also, the official documentation about Transaction Locking and Row Versioning Guide is a great place where a lot of aspects are covered and explained.
Try accounting. Transactions in accounts are inherently serializable if you want to have proper account values AND adhere to things like credit limits.
It behaves in a way that when you try to update a row, It simply blocks the updation process until the transaction is completed.
The Snapshot Isolation feature helps us to solve the problem where readers lock out writers on high volume sites. It does so by versioning rows using tempdb in SqlServer.
My question is to correctly implement this Snapshot Isolation feature, is it just a matter of executing the following on my SqlServer
ALTER DATABASE MyDatabase
SET ALLOW_SNAPSHOT_ISOLATION ON
ALTER DATABASE MyDatabase
SET READ_COMMITTED_SNAPSHOT ON
Do I still also have to write code that includes TransactionScope, like
using (new TransactionScope(TransactionScopeOption.Required,
new TransactionOptions { IsolationLevel = IsolationLevel.SnapShot}))
Finally, Brent pointed out his concern in this post under section The Hidden Costs of Concurrency, where he mentioned as you version rows in tempdb, tempdb may run out of space, and may have performance issues since it has to lookup versioned rows. So my question is I know this site uses Snapshot Isolation, anyone else uses this feature on large sites and what's your opinion on the performance?
Thx,
Ray.
It is "just a matter of executing the following", as stated in https://msdn.microsoft.com/en-us/library/tcbchxcb(v=vs.110).aspx, "If the READ_COMMITTED_SNAPSHOT option is set to OFF, you must explicitly set the Snapshot isolation level for each session in order to access versioned rows." Your second ALTER DATABASE command sets the READ_COMMITTED_SNAPSHOT ON so code does not need to specify that TransactionScope.
There are two sides to a performance coin, whenever one seeks an opinion about performance is "sufficient" versus "insufficient": Either "supply" is underwhelming or "demand" is overwhelming.... For this post, "supply" could refer to the performance and space used by tempdb, while the "demand" could concern the rate at which writes to tempdb occur. On the supply side, a variety of HW (from a single spindle 5400 RPM disk to arrays of SSDs) can be used. On the demand side, this isn't a SQL Server concern (although failing to properly normalize a database design can be a factor) as much as its a client code concern.
My SQL Servers see clients concurrently demanding approximately 50 writes/minute and 2000 batches/minute, where the writes are usually on the OTLP/short side. I have 1 TB of databases and a 30 GB tempdb, per SQL Server. All databases are in general normalized to 3rd normal form. All databases are running on SSDs. I have no concerns about the tempdb disk's IO throughput capacity being exceeded. As a result, I have had no concerns about enabling snapshot isolation on my systems. But, I have seen other systems where enabling snapshot isolation was attempted, but quickly abandoned.
Your system's experience can vary from any other respondent's system, by orders of magnitude. You should seek to profile/reliably replay your system's writes, along with replaying other uses of tempdb (including sorts), in order to come up with your own conclusions for your own system (for a variety of HW with sufficient space for your system's resulting tempdb size). Load testing should not be an afterthought :). You should also benchmark your tempdb disk's IO throughput capacity - see https://technet.microsoft.com/library/Cc966412, and be prepared to spend more money if its IO throughput capacity ends up being insufficient.