Clarification in regards to using safe_time in YugabyteDB - distributed-transactions

The document https://docs.yugabyte.com/latest/architecture/transactions/transactional-io-path/ says that a distributed txn can choose the safe_time from one of the involved tablets, and that safe_time considers the first uncommitted raft log’s hybrid timestamp. Does this mean that yugabytedb guarantees that all txn can read the data written by the txn committed before it starts?
[Disclaimer]: This question was first asked on the YugabyteDB Community Slack channel.

There are two components to choosing a read timestamp for a snapshot isolation transaction: (1) it needs to be recent enough to capture everything that has been committed before the transaction started; and (2) it needs to be as low as possible to avoid unnecessary wait. Choosing the safe time from the first tablet that a transaction reads from or writes to is just a heuristic towards the above goal. Safe time considers the timestamp of the first uncommitted (in the Raft sense) record in that tablet's Raft log as one of the inputs, and what actually goes into safe time calculation is that uncommitted timestamp minus "epsilon" (smallest possible hybrid time step) so that that record committing will not change the view of data as of this timestamp (and also safe time is capped by the hybrid time leader lease of the tablet's leader so that we are safe against leader changes and a new leader trying to read at a new timestamp past the leader lease expiration). So, all of the above concerns the "snapshot safety" (i.e. the property that if we are reading at some time read_ht, we are guaranteed that no writes will be done to that data with timestamps <= read_ht). If safe time on a tablet has not reached a particular read_ht when a read request arrives at that tablet, the tablet will wait for it to reach read_ht before starting the read operation. Now, let's address the question how we guarantee that all the data written prior to a transaction starting is visible to that transaction. This is done through a mechanism called "read restarts" and a clock skew assumption. If a read request on behalf of a snapshot isolation transaction with a read time read_ht encounters a committed record with a commit timestamp range between read_ht and read_ht + max_clock_skew, that record might have been committed prior to the transaction starting (due to clock skew) and we have to restart the read request at the timestamp of that transaction. The way this is implemented has some optimizations: the value read_ht + max_clock_skew is only computed once per transaction and does not change with read restarts, and we call it global_limit. It is an upper bound on the commit timestamp of any transaction that could have committed prior to our transaction operation starting, and by setting read_ht = global_limit (which is actually suitable in some cases, like long-running reporting queries), we can safely avoid any read restarts. There is also another similar mechanism called local_limit, which limits the number of restarts to one per tablet. So, with read restarts, we can be sure that a read request will capture all records that were written prior to the transaction starting, even with clock skew.

Related

How does dbms maintain atomicity during transactions?

I was reading about ACID properties of dbms of which 1 property is Atomicity.
http://ecomputernotes.com/database-system/rdbms/transaction
Scenario:
Suppose that, just prior to execution of transaction Ti the values of account A and B are Rs.I000 and Rs.2000.
Now, suppose that during the execution of Ti, a power failure has occurred that prevented the Ti to complete successfully. The point of failure may be after the completion Write (A,a) and before Write(B,b). It means that the changes in A are performed but not in B. Thus the values of account A and Bare Rs.950 and Rs.2000 respectively. We have lost Rs.50 as a result 'of this failure.
Now, our database is in inconsistent state.
My question is in case of power failure which lead us to the inconsistent state, how does we recover from it?
Can we do it at application level/ code level?
How many ways are there to recover from it?
Generally speaking, these ways may differ from one database to another, but usually DBMSs insure atomicity this way:
When a new request for data change is received, a database first writes this request to a special log which represents a change vector. Only when this record has been successfully written, a transaction can be committed.
In case of power failure this log persists. And database can recover from inconsistent state using this log, applying the changes one by one.
For example, this log in Oracle database is called Redo Log. In PostgreSql it's called WAL.

Approach for Change tracking mechanism with concurrent writes

I need to implement data synchronization in a distributed system taking into account concurrent writes to the data table.
The export from Main DB should read only changed rows.
Common advice is to use triggers, marking rows with timestamp of data update or consequent revision; and tell this timestamp/revision_number to the remote system.
E.g. What is the best approach to pull "Delta" data into Analytics DB from a highly transactional DB?
But the problem with concurrent writes is in the moment of time when commit takes place. Here is a problem:
[time 00:00] Transaction A starts writing a big batch of data;
marking rows with timestamp [00:00].
[time 00:02] Transaction B starts writing a small amount of data;
going to mark rows with timestamp [00:02].
[time 00:03] Transaction B finishes writing and commit happens.
[time 00:10] Export is done to the remote part of the system.
Isolation level is ReadCommitted, so it gets know only data from transaction B, timestamp 00:02.
[time 00:15] Transaction A finishes writing and commit happens.
Remote system never receves this data(!)
I like the solution Change tracking in MSSQL, but I keep in mind our intention to migrate to PostgreSQL soon. So I have to implement a general solution.
What is the correct approach to solve this problem?

How do Narayana/XA recover from TM failures?

I was trying to reason about failure recovery actions that can be taken by systems/frameworks which guarantee synchronous data sources. I've been unable to find a clear explanation of Narayana's recovery mechanism.
Q1: Does Narayana essentially employ a 2-phase commit to ensure distributed transactions across 2 datasources?
Q2: Can someone explain Narayana's behavior in this scenario?
Application wants to save X to 2 data stores
Narayana's transaction manager (TM) generates a transaction ID and writes info to disk
TM now sends a prepare message to both data stores
Each data store responds back with prepare_success
TM updates local transaction log and sends a commit message to both data stores
TM fails (permanently). And because of packet loss on the network, only one data store receives the commit message. But the other data stores receives and successfully processes the commit message.
The two data stores are now out of sync with each other (one source has an additional transaction that is not present in the other source).
When a new TM is brought up, it does not have access to the old transaction state records. So the TM cannot initiate the recovery of the missing transaction in one of the data stores.
So how can 2PC/Narayana/XA claim that they guarantee distributed transactions that can maintain 2 data stores in sync? From where I stand, they can only maintain synchronous data stores with a very high probability, but they cannot guarantee it.
Q3: Another scenario where I'm unclear on the behavior of the application/framework. Consider the following interleaved transactions (both on the same record - or at least with a partially overlapping set of records):
Di = Data source i
Ti = Transaction i
Pi = prepare message for transaction i
D1 receives P1; responds P1_success
D2 receives P2; responds P2_success
D1 receives P2; responds P2_failure
D2 receives P1; responds P1_failure
The order in which the network packets arrive at the different data sources can determine which prepare request succeeds. Does this not mean that at high transaction speeds for a contentious record - it is possible that all transactions will keep failing (until the record experiences a lower transaction request rate)?
One could argue that we are choosing consistency over availability but unlike ACID systems there is no guarantee that at least one of the transactions will succeed (thus avoiding a potentially long-lasting deadlock).
I would refer you to my article on how Narayana 2PC works
https://developer.jboss.org/wiki/TwoPhaseCommit2PC
To your questions
Q1: you already mentioned that in the comment - yes, Narayana uses 2PC = Narayana implements the XA specification (pubs.opengroup.org/onlinepubs/009680699/toc.pdf).
Q2: The steps in the scenario are not precise. Narayana writes to disk at time of prepare is called, not at time the transaction is started.
Application saves X to 2 data stores
TM now sends a prepare message to both data stores
Each data store responds back with prepare_success
TM saves permanently info about the prepared transaction and its ID to transaction log store
TM sends a commit message to both data stores
...
I don't agree that 2PC claims to guarantee to maintain 2 data stores in sync.
I was wondering about this too (e.g. asked here https://developer.jboss.org/message/954043).
2PC claims guaranteeing ACID properties. Having 2 stores in sync is kind of what CAP consistency is about.
In this Narayana strictly depends on capabilities of particular resource managers (data stores or jdbc drivers of data stores).
ACID declares
atomicity - whole transaction is committed or rolled-back (no info when it happens, no info about resources in sync)
consistency - before and when the transaction ends the system is in consistent state
durability - all is stored even when a crash occurs
isolation - (tricky one, left at the end) - for being ACID we have to be serializable. That's you can observe transactions happening "one by one".
If I take a pretty simplified example, to show my point - expecting DB being implemented in a naive way of locking whole database when transaction starts - you committed jms message, that's processed and now you don't commit the db record. When DB works in the serializable isolation level (that's what ACID requires!) then your next write/read operation has to wait until the 'in-flight prepared' transaction is resolved. DB is just stuck and waiting. If you read you won't get answer so you can't say what is the value.
The Narayana's recovery manager then come to that prepared transaction after connection is established and commit it. And you read action returns information that is 'correct'.
Q3: I don't understand the question, sorry. But if you state that The order in which the network packets arrive at the different data sources can determine which prepare request succeeds. then you are right, you are doomed to get failing transaction till network become more stable.

Prioritizing Transactions in Google AppEngine

Let's say I need to perform two different kinds write operations on a datastore entity that might happen simultaneously, for example:
The client that holds a write-lock on the entry updates the entry's content
The client requests a refresh of the write-lock (updates the lock's expiration time-stamp)
As the content-update operation is only allowed if the client holds the current write-lock, I need to perform the lock-check and the content-write in a transaction (unless there is another way that I am missing?). Also, a lock-refresh must happen in a transaction because the client needs to first be confirmed as the current lock-holder.
The lock-refresh is a very quick operation.
The content-update operation can be quite complex. Think of it as the client sending the server a complicated update-script that the server executes on the content.
Given this, if there is a conflict between those two transactions (should they be executed simultaneously), I would much rather have the lock-refresh operation fail than the complex content-update.
Is there a way that I can "prioritize" the content-update transaction? I don't see anything in the docs and I would imagine that this is not a specific feature, but maybe there is some trick I can use?
For example, what happens if my content-update reads the entry, writes it back with a small modification (without committing the transaction), then performs the lengthy operation and finally writes the result and commits the transaction? Would the first write be applied immediately and cause a simultaneous lock-refresh transaction to fail? Or are all writes kept until the transaction is committed at the end?
Is there such a thing as keeping two transactions open? Or doing an intermediate commit in a transaction?
Clearly, I can just split my content-update into two transactions: The first one sets a "don't mess with this, please!"-flag and the second one (later) writes the changes and clears that flag.
But maybe there is some other trick to achieve this with fewer reads/writes/transactions?
Another thought I had was that there are 3 different "blocks" of data: The current lock-holder (LH), the lock expiration (EX), and the content that is being modified (CO). The lock-refresh operation needs to perform a read of LH and a write to EX in a transaction, while the content-update operation needs to perform a read of LH, a read of CO, and a write of CO in a transaction. Is there a way to break the data apart into three entities and somehow have the transactions span only the needed entities? Since LH is never modified by these two operations, this might help avoid the conflict in the first place?
The datastore uses optimistic concurrency control, which means that a (datastore primitive) transaction waits until it is committed, then succeeds only if someone else hasn't committed first. Typically, the app retries the failed transaction with fresh data. There is no way to modify this first-wins behavior.
It might help to know that datastore transactions are strongly consistent, so a client can first commit a lock refresh with a synchronous datastore call, and when that call returns, the client knows for sure whether it obtained or refreshed the lock. The client can then proceed with its update and lock clear. The case you describe where a lock refresh and an update might occur concurrently from the same client sounds avoidable.
I'm assuming you need the lock mechanism to prevent writes from other clients while the lock owner performs multiple datastore primitive transactions. If a client is actually only doing one update before it releases the lock and it can do so within seconds (well before the datastore RPC timeout), you might get by with just a primitive datastore transaction with optimistic concurrency control and retries. But a lock might be a good idea for simple serialization of, say, edits to a record in a user interface, where a user hits an "edit" button in a UI and you want that to guarantee that the user has some time to prepare and submit changes without the record being changed by someone else. (Whether that's the user experience you want is your decision. :) )

when to prefer pessimistic model of transaction isolation over optimistic one?

Do I understand correctly that table/row lock hints are being used for pessimistic transaction (TX) isolation models of concurrency ONLY?
In other words, when can table/row lock hints be used during engagement of optimistic TX isolation provided by SQL Server (2005 and higher)?
When one would need pessimistic TX isolation levels/hints in SQL Server2005+ if the later provides built-in optimistic (aka snapshot aka versioning) concurrency isolation?
I did read that pessimistic options are legacy and are not needed anymore, though I am in doubt.
Also, having optimistic (aka snapshot aka versioning) TX isolation levels built-in SQL Server2005+,
when one would need to manually code for optimistic concurrency features?
The last question is inspired by having read:
"Optimistic Concurrency in SQL Server" (September 28, 2007)
describing custom coding to provide versioning in SQL Server.
Optimistic concurrency requires more resources and is more expensive when the conflict occurs.
Two sessions can read and modify the values and the conflict only occurs when they try to apply their changes simultaneously. This means that in case of the concurrent update both values should be stored somewhere (which of course requires resources).
Also, when a conflict occurs, usually the whole transaction should be rolled back or the cursor refetched, which is expensive too.
Pessimistic concurrency model uses locking, thus downgrading concurrency but improving performance.
In case of two concurrent tasks, it may be cheaper for the second task to wait for a lock to release than spending CPU time and disk I/O on two simultaneous works and then yet more on rolling back the less fortunate work and redoing it.
Say, you have a query like this:
UPDATE mytable
SET myvalue = very_complex_function(#range)
WHERE rangeid = #range
, with very_complex_function reading some data from mytable itself. In other words, this query transforms a subset of mytable sharing the value of range.
Now, when two functions work on the same range, there may be two scenarios:
Pessimistic: the first query locks, the second query waits for it. The first query completes in 10 seconds, the second one does too. Total: 20 seconds.
Optimistic: both queries work independently (on the same input). This shares CPU time between them plus some overhead on switching. They should keep their intermediate data somewhere, so the data is stored twice (which implies twice I/O or memory). Let's say both complete almost at the same time, in 15seconds.
But when it's time to commit the work, the second query will conflict and will have to rollback its changes (say, it takes the same 15 seconds). Then it needs to reread the data again and do the work again, with the new set of data (10 seconds).
As a result, both queries complete later than with a pessimistic locking: 15 and 40 seconds vs. 10 and 20.
When one would need pessimistic TX isolation levels/hints in SQL Server2005+ if the later provides built-in optimistic (aka snapshot aka versioning) concurrency isolation?
Optimistic isolation levels are, well, optimistic. You should not use them when you expect high contention on your data.
BTW, optimistic isolation (for the read queries) was available in SQL Server 2000 too.
I have a detailed answer here: Developing Modifications that Survive Concurrency
I think there's a bit confusion over terminology here.
The technique of optimistic locking/optimistic concurrency/... is a programming technique used to avoid the following scenario :
start transaction
read data, setting a "read" lock on it to prevent any deletes/modifications to our data
display data on user's screen
await user input, lock remains active
keep awaiting user input, lock still preventing any writes/modifications
user input never comes (for whatever reason)
transaction times out (and this is usually not very rapidly, as the user must be given reasonable time to enter his input).
Optimistic locking replaces this with the following:
start transaction READ
read data, setting a "read" lock on it to prevent any deletes/modifications to our data
end transaction READ, releasing the read lock just set
display data on user's screen
await user input, but data can be modified/deleted meanwhile by other transactions
user input arrives
start transaction WRITE
verify that the data has remained unaltered, raising an exception if it has
apply user updates
end transaction WRITE
So the single "user transaction" to go fetch some data, and change and update them, consists of two distinct "database transactions". What is usually called "isolation levels" applies to those database transactions. The "optimistic locking" that you refer to applies to the "user transaction".
The matter is further complicated in that, broadly speaking, two completely distinct strategies are possible for the "isolating the database transactions part" :
MVCC
2-phase locking
I think the "snapshot versioning isolation level" means that the MVCC technique (well, one of its various possible variations) is being used for the database transaction. The other commonly known isolation levels apply more to transaction isolation using 2PL as the serialization(/isolation) technique. (And mixing them up can get messy ...)

Resources