Pessimistic vs. optimistic concurrency control - database

I have a question about pessimistic versus optimistic locking.
Everybody says that "Optimistic locking is used when you don't expect many collisions.", for example:
which concurrency control is more efficient pessimistic or optimistic concurrency control
Optimistic vs. Pessimistic locking
For a school project I need to find the 'break-even' point when pessimistic locking is more appropriate then optimistic locking.
Now, I would like to know/understand why such a break-even point exists? How it is possible that pessimistic locking is more costly (in speed or memory usage?) then optimistic locking?
I suspect it is because of the extra read-operation that pessimistic locking requires. But with optimistic locking this extra read-operation is also needed (only just before the save operation), right?
Hopefully someone can explain this :)
Thank you!

Pessimissm vs optimism in concurrency control is re transaction implementations interfering. (Notwithstanding any definitions expressed at your links or by specific products.)
The supposed pessimistic attitude is, someone will interfere so lock them out; the supposed optimistic attitude is, maybe no one will interfere so go ahead until completion then roll some process back if there was interference.
The costs are delays due to waiting by locked out processes vs delays due to re-computation by rolled back processes. We wish to optimize throughput given expected process properties and distribution.
(In your question you address only a given process rather than a collection and ignore a process having to wait or having to throw away work on rollback.)
EDIT
Think about what the words mean. Throughput involves work and time. A " 'break-even' point " presumes a dimension (interference) along which a quantity (throughput) differs between schemes (pessimistic/optimistic). You have to come up with a way to characterize and measure work and interference. You can see others' takes on reasonable interference to test from textbooks & their biblography references. Eg On Optimistic Methods for Concurrency Control.
Experimentally, calculate throughput for each scheme running your DBMS under varying amounts of interference.
The reality is that different interference workloads (> expected process properties and distribution) make the problem multidimensional. So you may want to calculate throughput as above for different interference scenarios.

Related

Why we need Rigorous 2PL since we have Strict 2PL? [duplicate]

In 2PL (two phase locking), what advantage(s) does the rigorous model have over the strict model?
I) There is no advantage over the strict model.
II) In contrast to the strict model, it guarantees that starvation cannot occur.
III) In contrast to the strict model, it guarantees that deadlock cannot occur.
IV) In contrast to the strict model, there is no need to predict data needed in the future.
My note says all of the above are false. I am a bit confused. Can someone clarify for me why all of this is false?
What is Two-Phase Locking (2PL) Protocol ?
A transaction is two-phase locked if:
before reading x, it sets a read lock on x
before writing x, it sets a write lock on x
it holds each lock until after it executes the corresponding operation
after its first unlock operation, it requests no new locks
Now, what is Strict phase locking ?
Here a transaction must hold all its exclusive locks till it commits/aborts.
But ,whats rigorous 2PL ?
Rigorous two-phase locking is even stricter: here all locks are held till commit/abort. In this protocol transactions can be serialized in the order in which they commit.
Much deeper :
Strict 2PL :
Same as 2PL but Hold all exclusive locks until the transaction has already successfully committed or aborted. –It guarantees cascadeless recoverability
Rigorous 2PL :
Same as Strict 2PL but Hold all locks until the transaction has already successfully committed or aborted. –It is used in dynamic environments
where data access patterns are not known before hand.
There is no deadlock. Also, a younger transaction requesting an item held by an
older transaction is aborted and restart with the same timestamp, starvation is avoided.
I hope that above
clear explanations with diagram must have made you clear about the
concept and advantages of rigorous over the other.
Thanks
I - there is an advantage
Look at this lecture note from UCLA:
Rigorous two-phase locking has the advantages of strict 2PL. In addition
it has the property that for two conflicting transactions, their commit order
is their serializability order. In some systems users might expect this behavior.
These lecture notes have an example (the model in the example is strict - not rigorous):
Consider two transactions conducted at the same site in which a long running transaction T1 which reads x is ordered before a short transaction T2 that writes x. T2 returns first, showing an update version of x long before T1 completes based on the old version.
II and III - does not affect deadlocks/starvation
Rigorous 2PL means that all locks are released after the transaction ends as opposed to strict where read-only locks may be released earlier. This doesn't affect deadlocks or starvation as those occur in the expanding phase (a transaction cannot acquire the needed lock). In a deadlock both processes are always in the expanding phase.
IV - both need to know the needed data for locking in the expanding phase - shrinking phase varies
Strict: I don't know the usual implementation details of strict 2PL but if a read lock is released before a transaction ends there has to be a knowledge (100% sure prediction if you like) that the lock is not needed later in the transaction.
Rigorous: All the read locks are released at the end of a transaction and the transaction never has to evaluate if it should release a read lock or keep it for later reads in the transaction.
Is rigorous or strict more used/preferred?
Which of those two models to use would depend on the situation. Modern DBMS use more complex concurrency handling than simple rigorous or strict 2PL. Having said that judging by the Wikipedia article on two-phase locking the rigorous (SS2PL) model is more widely used:
SS2PL [rigorous] has been the concurrency control protocol of choice for most database systems and utilized since their early days in the 1970s. [...]
2PL in its general form, as well as when combined with Strictness, i.e., Strict 2PL (S2PL), are not known to be utilized in practice. The popular SS2PL does not require marking "end of phase-1" as 2PL and S2PL do, and thus is simpler to implement. Also, unlike the general 2PL, SS2PL provides, as mentioned above, the useful Strictness and Commitment ordering properties. [...]
SS2PL Vs. S2PL: Both provide Serializability and Strictness. Since S2PL is a super class of SS2PL it may, in principle, provide more concurrency. However, no concurrency advantage is typically practically noticed (exactly same locking exists for both, with practically not much earlier lock release for S2PL), and the overhead of dealing with an end-of-phase-1 mechanism in S2PL, separate from transaction-end, is not justified. Also, while SS2PL provides Commitment ordering, S2PL does not. This explains the preference of SS2PL over S2PL. [...]
Transaction T2 in the above example does not follow 2PL and S2PL as a lock request (lock B) is done after the lock release unlock(A) - hence violating the protocol.
Rigorous two phase locking is similar to strict two phase locking with two major differences:
In strict two phase locking the shared locks are released in
shrinking phase, but in rigorous two phase locking all the shared
and exclusive locks are kept until the end of the transaction.
In rigorous two phase locking we do not need to know the access
pattern of locks on data items beforehand so it is more appropriate
for dynamic environments while in strict two phase locking the
access pattern of locks should be specified at the start of
transaction.
So the fourth option is the correct one.

Multiple writes in a relational database

I'm pretty sure that with a relational database, it's faster and better to read 50 records at once than to make 50 calls for one record each. Is there a performance benefit from performing multiple writes all at once? If not, why not?
Probably depends on the RDBMS and the storage engine, but at least in MySQL/InnoDB, multiple writes in one transaction (as well the multi-insert syntax, which, afaik, is MySQL extension) allows you not to update non-unique indexes before transaction is commited, and the update of the index happens at once with all new values (since it's a b-tree, in this way its much faster). It's possible that RDBMS optimizes other writes as well, to have sequential instead of random writes.
Also, if there is a table-level locking (as in MyISAM), locking the table once, writting multiple records and then unlocking removes the overhead of lock/unlock for every write.
So generally, there is performance gain, but it depends on the database server used.
Doing all your reads at once makes sense, although there are some problems in it which I'll touch on in a minute.
Doing all your writes at once poses a particular problem: the data is in the database until you put it there. If you're waiting for some optimization threshold (let's say 50) then transaction 1 is going to have to wait for (unrelated) transactions 2-50 to complete before it goes to the database. This means that in the mean time (which could be several [seconds, minutes, hours]) nobody knows what those records are, or if they're updated what the new values are. (Same with reads but the other way around. Your data may be out of date by the time you get to use it.)
Performance wise, I cannot imagine that combining writes closer together would not have some performance. (IF that was confusing to read, I meant "You should always get a performance boost by grouping.") If nothing else, you have a better chance to hit memory caches instead of disk caches than if you do them separately. #Darhazer brings up a good point about locking. So strictly from a total-time-spent-writing point of view, it would be better to group them. From an application performance point of view, it's difficult to say without an intricate knowledge of the business requirements of the app.

Databases: More questions about (B-Tree) indexes

I've been studying indexes and there are some questions that pother me and which I think important.
If you can help or refer to sources, please feel free to do it.
Q1: B-tree indexes can favor a fast access to specific rows on a table. Considering an OLTP system, with many accesses, both Read and Write, simultaneously, do you think it can be a disadvantage having many B-tree indexes on this system? Why?
Q2: Why are B-Tree indexes not fully occupied (typically only 75% occupied, if I'm not mistaken)?
Q1: I've no administration experience with large indexing systems in practice, but the typical multiprocessing environment drawbacks apply to having multiple B-tree indexes on a system -- cost of context switching, cache invalidation and flushing, poor IO scheduling, and the list goes up. On the other hand, IO is something that inherently ought to be non-blocking for maximal use of resources, and it's hard to do that without some sort of concurrency, even if done in a cooperative manner. (For example, some people recommend event-based systems.) Also, you're going to need multiple index structures for many practical applications, especially if you're looking at OLTP. The biggest thing here is good IO scheduling, access patterns, and data caching depending on said access patterns.
Q2: Because splitting and re-balancing nodes is expensive. The naive methodology for speed is "only split with they're full." Given this, there's two extremes -- a node was just split and is half full, or a node is full so it will be next time. The 'average' between the cases (50% and 100%) is 75%. Yes, it's somewhat bad logic from a mathematics perspective, but it exposes the underlying reason as to why the 75% figure appears.

when to prefer pessimistic model of transaction isolation over optimistic one?

Do I understand correctly that table/row lock hints are being used for pessimistic transaction (TX) isolation models of concurrency ONLY?
In other words, when can table/row lock hints be used during engagement of optimistic TX isolation provided by SQL Server (2005 and higher)?
When one would need pessimistic TX isolation levels/hints in SQL Server2005+ if the later provides built-in optimistic (aka snapshot aka versioning) concurrency isolation?
I did read that pessimistic options are legacy and are not needed anymore, though I am in doubt.
Also, having optimistic (aka snapshot aka versioning) TX isolation levels built-in SQL Server2005+,
when one would need to manually code for optimistic concurrency features?
The last question is inspired by having read:
"Optimistic Concurrency in SQL Server" (September 28, 2007)
describing custom coding to provide versioning in SQL Server.
Optimistic concurrency requires more resources and is more expensive when the conflict occurs.
Two sessions can read and modify the values and the conflict only occurs when they try to apply their changes simultaneously. This means that in case of the concurrent update both values should be stored somewhere (which of course requires resources).
Also, when a conflict occurs, usually the whole transaction should be rolled back or the cursor refetched, which is expensive too.
Pessimistic concurrency model uses locking, thus downgrading concurrency but improving performance.
In case of two concurrent tasks, it may be cheaper for the second task to wait for a lock to release than spending CPU time and disk I/O on two simultaneous works and then yet more on rolling back the less fortunate work and redoing it.
Say, you have a query like this:
UPDATE mytable
SET myvalue = very_complex_function(#range)
WHERE rangeid = #range
, with very_complex_function reading some data from mytable itself. In other words, this query transforms a subset of mytable sharing the value of range.
Now, when two functions work on the same range, there may be two scenarios:
Pessimistic: the first query locks, the second query waits for it. The first query completes in 10 seconds, the second one does too. Total: 20 seconds.
Optimistic: both queries work independently (on the same input). This shares CPU time between them plus some overhead on switching. They should keep their intermediate data somewhere, so the data is stored twice (which implies twice I/O or memory). Let's say both complete almost at the same time, in 15seconds.
But when it's time to commit the work, the second query will conflict and will have to rollback its changes (say, it takes the same 15 seconds). Then it needs to reread the data again and do the work again, with the new set of data (10 seconds).
As a result, both queries complete later than with a pessimistic locking: 15 and 40 seconds vs. 10 and 20.
When one would need pessimistic TX isolation levels/hints in SQL Server2005+ if the later provides built-in optimistic (aka snapshot aka versioning) concurrency isolation?
Optimistic isolation levels are, well, optimistic. You should not use them when you expect high contention on your data.
BTW, optimistic isolation (for the read queries) was available in SQL Server 2000 too.
I have a detailed answer here: Developing Modifications that Survive Concurrency
I think there's a bit confusion over terminology here.
The technique of optimistic locking/optimistic concurrency/... is a programming technique used to avoid the following scenario :
start transaction
read data, setting a "read" lock on it to prevent any deletes/modifications to our data
display data on user's screen
await user input, lock remains active
keep awaiting user input, lock still preventing any writes/modifications
user input never comes (for whatever reason)
transaction times out (and this is usually not very rapidly, as the user must be given reasonable time to enter his input).
Optimistic locking replaces this with the following:
start transaction READ
read data, setting a "read" lock on it to prevent any deletes/modifications to our data
end transaction READ, releasing the read lock just set
display data on user's screen
await user input, but data can be modified/deleted meanwhile by other transactions
user input arrives
start transaction WRITE
verify that the data has remained unaltered, raising an exception if it has
apply user updates
end transaction WRITE
So the single "user transaction" to go fetch some data, and change and update them, consists of two distinct "database transactions". What is usually called "isolation levels" applies to those database transactions. The "optimistic locking" that you refer to applies to the "user transaction".
The matter is further complicated in that, broadly speaking, two completely distinct strategies are possible for the "isolating the database transactions part" :
MVCC
2-phase locking
I think the "snapshot versioning isolation level" means that the MVCC technique (well, one of its various possible variations) is being used for the database transaction. The other commonly known isolation levels apply more to transaction isolation using 2PL as the serialization(/isolation) technique. (And mixing them up can get messy ...)

Database deadlocks

One of the classical reasons we have a database deadlock is when two transactions are inserting and updating tables in a different order.
For example, transaction A inserts in Table A then Table B.
And transaction B inserts in Table B followed by A.
Such a scenario is always at risk of a database deadlock (assuming you are not using serializable isolation level).
My questions are:
What kind of patterns do you follow in your design to make sure that all transactions are inserting and updating in the same order.
A book I was reading- had a suggestion that you can sort the statements by the name of the table. Have you done something like this or different - which would enforce that all inserts and updates are in the same order?
What about deleting records? Delete needs to start from child tables and updates and inserts need to start from parent tables. How do you ensure that this would not run into a deadlock?
All transactions are
inserting\updating in the same order.
Deletes; identify records to be
deleted outside a transaction and
then attempt the deletion in the
smallest possible transaction, e.g.
looking up by the primary key or similar
identified during the lookup stage.
Small transactions generally.
Indexing and other performance
tuning to both speed transactions
and to promote index lookups over
tablescans.
Avoid 'Hot tables',
e.g. one table with incrementing
counters for other tables primary
keys. Any other 'switchboard' type
configuration is risky.
Especially if not using Oracle, learn
the looking behaviour of the target
RDBMS in detail (optimistic /
pessimistic, isolation levels, etc.)
Ensure you do not allow row locks to
escalate to table locks as some
RDMSes will.
Deadlocks are no biggie. Just be prepared to retry your transactions on failure.
And keep them short. Short transactions consisting of queries that touch very few records (via the magic of indexing) are ideal to minimize deadlocks - fewer rows are locked, and for a shorter period of time.
You need to know that modern database engines don't lock tables; they lock rows; so deadlocks are a bit less likely.
You can also avoid locking by using MVCC and the CONSISTENT READ transaction isolation level: instead of locking, some threads will just see stale data.
Carefully design your database processes to eliminate as much as possible transactions that involve multiple tables. When I've had database design control there has never been a case of deadlock for which I could not design out the condition that caused it. That's not to say they don't exist and perhaps abound in situations outside my limited experience; but I've had no shortage of opportunities to improve designs causing these kinds of problems. One obvious strategy is to start with a chronological write-only table for insertion of new complete atomic transactions with no interdependencies, and apply their effects in an orderly asynchronous process.
Always use the database default isolation levels and locking settings unless you are absolutely sure what risks they incur, and have proven it by testing. Redesign your process if at all possible first. Then, impose the least increase in protection required to eliminate the risk (and test to prove it.) Don't increase restrictiveness "just in case" - this often leads to unintended consequences, sometimes of the type you intended to avoid.
To repeat the point from another direction, most of what you will read on this and other sites advocating the alteration of database settings to deal with transaction risks and locking problems is misleading and/or false, as demonstrated by how they conflict with each other so regularly. Sadly, especially for SQL Server, I have found no source of documentation that isn't hopelessly confusing and inadequate.
I have found that one of the best investments I ever made in avoiding deadlocks was to use a Object Relational Mapper that could order database updates. The exact order is not important, as long as every transaction writes in the same order (and deletes in exactly the reverse order).
The reason that this avoids most deadlocks out of the box is that your operations are always table A first, then table B, then table C (which perhaps depends on table B).
You can achieve a similar result as long as you exercise care in your stored procedures or data layer's access code. The only problem is that it requires great care to do it by hand, whereas a ORM with a Unit of Work concept can automate most cases.
UPDATE: A delete should run forward to verify that everything is the version you expect (you still need record version numbers or timestamps) and then delete backwards once everything verifies. As this should all happen in one transaction, the possibility of something changing out from under you shouldn't exist. The only reason for the ORM doing it backwards is to obey the key requirements, but if you do your check forward, you will have all the locks you need already in hand.
I analyze all database actions to determine, for each one, if it needs to be in a multiple statement transaction, and then for each such case, what the minimum isolation level is required to prevent deadlocks... As you said serializable will certainly do so...
Generally, only a very few database actions require a multiple statement transaction in the first place, and of those, only a few require serializable isolation to eliminate deadlocks.
For those that do, set the isolation level for that transaction before you begin, and reset it whatever your default is after it commits.
Your example would only be a problem if the database locked the ENTIRE table. If your database is doing that...run :)

Resources