How do Narayana/XA recover from TM failures? - distributed-transactions

I was trying to reason about failure recovery actions that can be taken by systems/frameworks which guarantee synchronous data sources. I've been unable to find a clear explanation of Narayana's recovery mechanism.
Q1: Does Narayana essentially employ a 2-phase commit to ensure distributed transactions across 2 datasources?
Q2: Can someone explain Narayana's behavior in this scenario?
Application wants to save X to 2 data stores
Narayana's transaction manager (TM) generates a transaction ID and writes info to disk
TM now sends a prepare message to both data stores
Each data store responds back with prepare_success
TM updates local transaction log and sends a commit message to both data stores
TM fails (permanently). And because of packet loss on the network, only one data store receives the commit message. But the other data stores receives and successfully processes the commit message.
The two data stores are now out of sync with each other (one source has an additional transaction that is not present in the other source).
When a new TM is brought up, it does not have access to the old transaction state records. So the TM cannot initiate the recovery of the missing transaction in one of the data stores.
So how can 2PC/Narayana/XA claim that they guarantee distributed transactions that can maintain 2 data stores in sync? From where I stand, they can only maintain synchronous data stores with a very high probability, but they cannot guarantee it.
Q3: Another scenario where I'm unclear on the behavior of the application/framework. Consider the following interleaved transactions (both on the same record - or at least with a partially overlapping set of records):
Di = Data source i
Ti = Transaction i
Pi = prepare message for transaction i
D1 receives P1; responds P1_success
D2 receives P2; responds P2_success
D1 receives P2; responds P2_failure
D2 receives P1; responds P1_failure
The order in which the network packets arrive at the different data sources can determine which prepare request succeeds. Does this not mean that at high transaction speeds for a contentious record - it is possible that all transactions will keep failing (until the record experiences a lower transaction request rate)?
One could argue that we are choosing consistency over availability but unlike ACID systems there is no guarantee that at least one of the transactions will succeed (thus avoiding a potentially long-lasting deadlock).

I would refer you to my article on how Narayana 2PC works
https://developer.jboss.org/wiki/TwoPhaseCommit2PC
To your questions
Q1: you already mentioned that in the comment - yes, Narayana uses 2PC = Narayana implements the XA specification (pubs.opengroup.org/onlinepubs/009680699/toc.pdf).
Q2: The steps in the scenario are not precise. Narayana writes to disk at time of prepare is called, not at time the transaction is started.
Application saves X to 2 data stores
TM now sends a prepare message to both data stores
Each data store responds back with prepare_success
TM saves permanently info about the prepared transaction and its ID to transaction log store
TM sends a commit message to both data stores
...
I don't agree that 2PC claims to guarantee to maintain 2 data stores in sync.
I was wondering about this too (e.g. asked here https://developer.jboss.org/message/954043).
2PC claims guaranteeing ACID properties. Having 2 stores in sync is kind of what CAP consistency is about.
In this Narayana strictly depends on capabilities of particular resource managers (data stores or jdbc drivers of data stores).
ACID declares
atomicity - whole transaction is committed or rolled-back (no info when it happens, no info about resources in sync)
consistency - before and when the transaction ends the system is in consistent state
durability - all is stored even when a crash occurs
isolation - (tricky one, left at the end) - for being ACID we have to be serializable. That's you can observe transactions happening "one by one".
If I take a pretty simplified example, to show my point - expecting DB being implemented in a naive way of locking whole database when transaction starts - you committed jms message, that's processed and now you don't commit the db record. When DB works in the serializable isolation level (that's what ACID requires!) then your next write/read operation has to wait until the 'in-flight prepared' transaction is resolved. DB is just stuck and waiting. If you read you won't get answer so you can't say what is the value.
The Narayana's recovery manager then come to that prepared transaction after connection is established and commit it. And you read action returns information that is 'correct'.
Q3: I don't understand the question, sorry. But if you state that The order in which the network packets arrive at the different data sources can determine which prepare request succeeds. then you are right, you are doomed to get failing transaction till network become more stable.

Related

Clarification in regards to using safe_time in YugabyteDB

The document https://docs.yugabyte.com/latest/architecture/transactions/transactional-io-path/ says that a distributed txn can choose the safe_time from one of the involved tablets, and that safe_time considers the first uncommitted raft log’s hybrid timestamp. Does this mean that yugabytedb guarantees that all txn can read the data written by the txn committed before it starts?
[Disclaimer]: This question was first asked on the YugabyteDB Community Slack channel.
There are two components to choosing a read timestamp for a snapshot isolation transaction: (1) it needs to be recent enough to capture everything that has been committed before the transaction started; and (2) it needs to be as low as possible to avoid unnecessary wait. Choosing the safe time from the first tablet that a transaction reads from or writes to is just a heuristic towards the above goal. Safe time considers the timestamp of the first uncommitted (in the Raft sense) record in that tablet's Raft log as one of the inputs, and what actually goes into safe time calculation is that uncommitted timestamp minus "epsilon" (smallest possible hybrid time step) so that that record committing will not change the view of data as of this timestamp (and also safe time is capped by the hybrid time leader lease of the tablet's leader so that we are safe against leader changes and a new leader trying to read at a new timestamp past the leader lease expiration). So, all of the above concerns the "snapshot safety" (i.e. the property that if we are reading at some time read_ht, we are guaranteed that no writes will be done to that data with timestamps <= read_ht). If safe time on a tablet has not reached a particular read_ht when a read request arrives at that tablet, the tablet will wait for it to reach read_ht before starting the read operation. Now, let's address the question how we guarantee that all the data written prior to a transaction starting is visible to that transaction. This is done through a mechanism called "read restarts" and a clock skew assumption. If a read request on behalf of a snapshot isolation transaction with a read time read_ht encounters a committed record with a commit timestamp range between read_ht and read_ht + max_clock_skew, that record might have been committed prior to the transaction starting (due to clock skew) and we have to restart the read request at the timestamp of that transaction. The way this is implemented has some optimizations: the value read_ht + max_clock_skew is only computed once per transaction and does not change with read restarts, and we call it global_limit. It is an upper bound on the commit timestamp of any transaction that could have committed prior to our transaction operation starting, and by setting read_ht = global_limit (which is actually suitable in some cases, like long-running reporting queries), we can safely avoid any read restarts. There is also another similar mechanism called local_limit, which limits the number of restarts to one per tablet. So, with read restarts, we can be sure that a read request will capture all records that were written prior to the transaction starting, even with clock skew.

How does dbms maintain atomicity during transactions?

I was reading about ACID properties of dbms of which 1 property is Atomicity.
http://ecomputernotes.com/database-system/rdbms/transaction
Scenario:
Suppose that, just prior to execution of transaction Ti the values of account A and B are Rs.I000 and Rs.2000.
Now, suppose that during the execution of Ti, a power failure has occurred that prevented the Ti to complete successfully. The point of failure may be after the completion Write (A,a) and before Write(B,b). It means that the changes in A are performed but not in B. Thus the values of account A and Bare Rs.950 and Rs.2000 respectively. We have lost Rs.50 as a result 'of this failure.
Now, our database is in inconsistent state.
My question is in case of power failure which lead us to the inconsistent state, how does we recover from it?
Can we do it at application level/ code level?
How many ways are there to recover from it?
Generally speaking, these ways may differ from one database to another, but usually DBMSs insure atomicity this way:
When a new request for data change is received, a database first writes this request to a special log which represents a change vector. Only when this record has been successfully written, a transaction can be committed.
In case of power failure this log persists. And database can recover from inconsistent state using this log, applying the changes one by one.
For example, this log in Oracle database is called Redo Log. In PostgreSql it's called WAL.

Approach for Change tracking mechanism with concurrent writes

I need to implement data synchronization in a distributed system taking into account concurrent writes to the data table.
The export from Main DB should read only changed rows.
Common advice is to use triggers, marking rows with timestamp of data update or consequent revision; and tell this timestamp/revision_number to the remote system.
E.g. What is the best approach to pull "Delta" data into Analytics DB from a highly transactional DB?
But the problem with concurrent writes is in the moment of time when commit takes place. Here is a problem:
[time 00:00] Transaction A starts writing a big batch of data;
marking rows with timestamp [00:00].
[time 00:02] Transaction B starts writing a small amount of data;
going to mark rows with timestamp [00:02].
[time 00:03] Transaction B finishes writing and commit happens.
[time 00:10] Export is done to the remote part of the system.
Isolation level is ReadCommitted, so it gets know only data from transaction B, timestamp 00:02.
[time 00:15] Transaction A finishes writing and commit happens.
Remote system never receves this data(!)
I like the solution Change tracking in MSSQL, but I keep in mind our intention to migrate to PostgreSQL soon. So I have to implement a general solution.
What is the correct approach to solve this problem?

Prioritizing Transactions in Google AppEngine

Let's say I need to perform two different kinds write operations on a datastore entity that might happen simultaneously, for example:
The client that holds a write-lock on the entry updates the entry's content
The client requests a refresh of the write-lock (updates the lock's expiration time-stamp)
As the content-update operation is only allowed if the client holds the current write-lock, I need to perform the lock-check and the content-write in a transaction (unless there is another way that I am missing?). Also, a lock-refresh must happen in a transaction because the client needs to first be confirmed as the current lock-holder.
The lock-refresh is a very quick operation.
The content-update operation can be quite complex. Think of it as the client sending the server a complicated update-script that the server executes on the content.
Given this, if there is a conflict between those two transactions (should they be executed simultaneously), I would much rather have the lock-refresh operation fail than the complex content-update.
Is there a way that I can "prioritize" the content-update transaction? I don't see anything in the docs and I would imagine that this is not a specific feature, but maybe there is some trick I can use?
For example, what happens if my content-update reads the entry, writes it back with a small modification (without committing the transaction), then performs the lengthy operation and finally writes the result and commits the transaction? Would the first write be applied immediately and cause a simultaneous lock-refresh transaction to fail? Or are all writes kept until the transaction is committed at the end?
Is there such a thing as keeping two transactions open? Or doing an intermediate commit in a transaction?
Clearly, I can just split my content-update into two transactions: The first one sets a "don't mess with this, please!"-flag and the second one (later) writes the changes and clears that flag.
But maybe there is some other trick to achieve this with fewer reads/writes/transactions?
Another thought I had was that there are 3 different "blocks" of data: The current lock-holder (LH), the lock expiration (EX), and the content that is being modified (CO). The lock-refresh operation needs to perform a read of LH and a write to EX in a transaction, while the content-update operation needs to perform a read of LH, a read of CO, and a write of CO in a transaction. Is there a way to break the data apart into three entities and somehow have the transactions span only the needed entities? Since LH is never modified by these two operations, this might help avoid the conflict in the first place?
The datastore uses optimistic concurrency control, which means that a (datastore primitive) transaction waits until it is committed, then succeeds only if someone else hasn't committed first. Typically, the app retries the failed transaction with fresh data. There is no way to modify this first-wins behavior.
It might help to know that datastore transactions are strongly consistent, so a client can first commit a lock refresh with a synchronous datastore call, and when that call returns, the client knows for sure whether it obtained or refreshed the lock. The client can then proceed with its update and lock clear. The case you describe where a lock refresh and an update might occur concurrently from the same client sounds avoidable.
I'm assuming you need the lock mechanism to prevent writes from other clients while the lock owner performs multiple datastore primitive transactions. If a client is actually only doing one update before it releases the lock and it can do so within seconds (well before the datastore RPC timeout), you might get by with just a primitive datastore transaction with optimistic concurrency control and retries. But a lock might be a good idea for simple serialization of, say, edits to a record in a user interface, where a user hits an "edit" button in a UI and you want that to guarantee that the user has some time to prepare and submit changes without the record being changed by someone else. (Whether that's the user experience you want is your decision. :) )

Integrity and Confidentiality in Distributed Transactions

I've a question regarding distributed transactions. Let's assume I have 3 transaction programs:
Transaction A
begin
a=read(A)
b=read(B)
c=a+b
write(C,c)
commit
Transaction B
begin
a=read(A)
a=a+1
write(A,a)
commit
Transaction C
begin
c=read(C)
c=c*2
write(A,c)
commit
So there are 5 pairs of critical operations: C2-A5, A2-B4, B4-C4, B2-C4, A2-C4.
I should ensure integrity and confidentiality, do you have any idea of how to achieve it?
Thank you in advance!
What you have described in your post is a common situation in multi-user systems. Different sessions simultaneously start transactions using the same tables and indeed the same rows. There are two issues here:
What happens if Session C reads a record after Session A has updated it but before Session A has committed its trandsaction?
What happens if Session C updates the same record which Session A has updated but not committed?
(Your scenario only illustrates the first of these issues).
The answer to the first question is ioslation level. This is the definition of the visibility of uncommmitted transactions across sessions. The ANSI standard specifies four levels:
SERIALIZABLE: no changes from another session are ever visible.
REPEATABLE READ: phantom reads allowed, that is the same query executed twice may return different results.
READ COMMITTED: only changes which have been committed by another session are visible.
READ UNCOMMITTED: diryt readsallowed, that is uncommitted changes from one session are visible in another.
Different flavours or database implement these in different fashions, and not all databases support all of them. For instance, Oracle only supports READ COMMITTED and SERIALIZABLE, and it implements SERIALIZABLE as a snapsot (i.e. it is a read-only transaction). However, it uses multiversion concurrency control to prevent non-repeatable reads in READ COMMITTED transactions.
So, coming back to your question, the answer is: set the appropriate Isolation Level. What the appropriate level is depends on what levels your database supports, and what behaviour you wish to happen. Probably you want READ COMMITTED or SERIALIZABLE, that is you want your transactions to proceed on the basis of data values being consistent with the start of the transaction.
As to the other matter, the answer is simpler: transactions must issue locks on tables or preferably just the required rows, before they start to update them. This ensures that the transaction can proceed to change those values without causing a deadlock. This is called pessimistic locking. It is not possible in applications which use connection pooling (i.e. most web-based applications), and the situation there is much gnarlier.

Resources