I am trying to understand the behaviour of jbd2 journaling when one of the many transactions that are written to the journal gets corrupted.
As per my understanding, for a write operation, first a write is done to persist data to the on-disk location, followed by the corresponding metadata transaction write on the journal. The format of metadata updates is as follows - 1) transaction descriptor block 2) the metadata blocks and 3) the transaction commit block. This continues for multiple transactions. Finally during a checkpoint, the metadata
corresponding to these transactions are written to the on-disk location.
I also understand that order needs to be maintained between transactions during replay if the file system crashed before checkpoint occurred. i.e. if we have 3 transactions T1, T2, T3, they will be replayed sequentially. This is to avoid the scenario where overwrites of the same block occur or there is delete and subsequent re-allocation of the same block in two consecutive transactions.
My question is for a special case: where T1, T2, and T3 being three consecutive transactions, if T1 and T3 maintain metadata change for say M1 metadata block, and T2 stores metadata change for M2 block. and M1 and M2 are not at all overlapping, in that case, if T2 gets corrupted, will T3 and all subsequent transactions be discarded?
Related
Why does it say by a sequence of short transactions? If transactions are long there should be no difference, no?
However, care must be taken to avoid the following scenario. Suppose a
transaction T2 has a shared-mode lock on a data item, and another
transaction T1 requests an exclusive-mode lock on the data item. T1
has to wait for T2 to release the shared mode lock. Meanwhile, a
transaction T3 may request a shared-mode lock on the same data item.
The lock request is compatible with the lock granted to T2, so T3 may
be granted the shared-mode lock. At this point T2 may release the
lock, but still T1 has to wait for T3 to finish. But again, there may
be a new transaction T4 that requests a shared-mode lock on the same
data item, and is granted the lock before T3 releases it. In fact, it
is possible that there is a sequence of transactions that each
requests a shared mode lock on the data item, and each transaction
releases the lock a short while after it is granted, but T1 never gets
the exclusive-mode lock on the data item. The transaction T1 may never
make progress, and is said to be starved.
Long transactions (in time) are actually more susceptible to blocking problems than short transactions are. Consequently, it is usually recommended that transactions be designed to hold blocking locks for as short a time as possible.
So, in the scenario above a series of "long" transactions are actually much more likely to cause this problem. However, the writer refers to a series of "short" transactions to emphasize that this problem can happen even when the transactions are short (if there are enough nearly simultaneous compatible transactions).
Since transactions are only ACID in a single region, and don't replicate until said region completes the transaction, there is opportunity for transactions to occur simultaneously in different regions which would not be allowed to happen in the same region, then when these transactions replicate to other regions, they cannot be completeled because of some condition.
Example:
User 1 buys item 1 from User 2 in region A
User 3 buys item 1 from User 2 in region B at the same time
Region A transaction completes, Region B transaction completes, and both begin to replicate to the other regions. Transaction 1 cannot complete in region B, and Transaction 2 cannot complete in region A, because User 2 no longer has that item (it now belongs to User 1 and User 3 respectively).
How does DynamoDB handle these conflicts? SQL DBs would typically prevent this because of the single Writer node, but since DynamoDB has multiple writer nodes we can see this potential conflict.
Once the transaction is complete in a region, the item changes individually enter the stream to be replicated via global tables, not the transaction. The transaction is over.
In your scenario, DynamoDB's conflict resolution works as it normally does on each individual item. DynamoDB uses last writer wins for its conflict resolution.
I have a some data that is sharded and kept in DB1, DB2 and DB3 hosted on Machine1. Now, to scale the system, I need to move shard DB1 from Machine1 to Machine2. Once the move is complete, all requests to shard DB1 will be routed to Machine2.
Assume that we have reads, writes and updates coming to DB1 all the time. How can I do the migration without any downtime to read/write/update?
We can make DB1 readonly during the migration window and copy the data to Machine2. Once copy is complete, we can route traffic to Machine2 and allow writes.
But what if we want to do the same while writes are also happening?
After some research I found couple of ways to do this.
Solution 1
For the purpose of copying the shard to another physical machine, break that into several small segments. Trigger a script to copy segment by segment from M1 to M2. While the copy is happening, replicate incoming writes to both M1 and M2.
All newly written rows will be written to new copy as well
Any updates to existing segments which are not yet copied will be taken care when that segment is copied to M2 in the future
Updates to already copied segments will be applied to M2 as well because M2 also has that segment
Writes to a particular segment is blocked while the segment is being copied. Once the copy is complete, write is completed and it's same as #3 above.
Once all the segments are copied successfully, stop writes to M1. At this point, DB1 in M1 is stale and can be deleted safely.
Solution 2
As before, break the shard into several smaller segments.
Schedule a script to segment by segment to M2
Any updates to existing segments which are not yet copied will be taken care when that segment is copied to M2 in the future
When there is an update to a segment that's already copied, mark that segment as dirty to be copied again.
Newly created segments are by default marked as dirty
Once copy is complete, start another pass to copy all the dirty blocks again
Repeat the passes until the number of dirty blocks is below a certain threshold. At that point, queue incoming writes (increases write latency), copy the remaining dirty blocks, commit the queued writes to new machine, change the configuration to write to M2 and start accepting writes.
I feel Solution 2 is better because it doesn't write to two places and hence client write requests will be faster.
I was trying to reason about failure recovery actions that can be taken by systems/frameworks which guarantee synchronous data sources. I've been unable to find a clear explanation of Narayana's recovery mechanism.
Q1: Does Narayana essentially employ a 2-phase commit to ensure distributed transactions across 2 datasources?
Q2: Can someone explain Narayana's behavior in this scenario?
Application wants to save X to 2 data stores
Narayana's transaction manager (TM) generates a transaction ID and writes info to disk
TM now sends a prepare message to both data stores
Each data store responds back with prepare_success
TM updates local transaction log and sends a commit message to both data stores
TM fails (permanently). And because of packet loss on the network, only one data store receives the commit message. But the other data stores receives and successfully processes the commit message.
The two data stores are now out of sync with each other (one source has an additional transaction that is not present in the other source).
When a new TM is brought up, it does not have access to the old transaction state records. So the TM cannot initiate the recovery of the missing transaction in one of the data stores.
So how can 2PC/Narayana/XA claim that they guarantee distributed transactions that can maintain 2 data stores in sync? From where I stand, they can only maintain synchronous data stores with a very high probability, but they cannot guarantee it.
Q3: Another scenario where I'm unclear on the behavior of the application/framework. Consider the following interleaved transactions (both on the same record - or at least with a partially overlapping set of records):
Di = Data source i
Ti = Transaction i
Pi = prepare message for transaction i
D1 receives P1; responds P1_success
D2 receives P2; responds P2_success
D1 receives P2; responds P2_failure
D2 receives P1; responds P1_failure
The order in which the network packets arrive at the different data sources can determine which prepare request succeeds. Does this not mean that at high transaction speeds for a contentious record - it is possible that all transactions will keep failing (until the record experiences a lower transaction request rate)?
One could argue that we are choosing consistency over availability but unlike ACID systems there is no guarantee that at least one of the transactions will succeed (thus avoiding a potentially long-lasting deadlock).
I would refer you to my article on how Narayana 2PC works
https://developer.jboss.org/wiki/TwoPhaseCommit2PC
To your questions
Q1: you already mentioned that in the comment - yes, Narayana uses 2PC = Narayana implements the XA specification (pubs.opengroup.org/onlinepubs/009680699/toc.pdf).
Q2: The steps in the scenario are not precise. Narayana writes to disk at time of prepare is called, not at time the transaction is started.
Application saves X to 2 data stores
TM now sends a prepare message to both data stores
Each data store responds back with prepare_success
TM saves permanently info about the prepared transaction and its ID to transaction log store
TM sends a commit message to both data stores
...
I don't agree that 2PC claims to guarantee to maintain 2 data stores in sync.
I was wondering about this too (e.g. asked here https://developer.jboss.org/message/954043).
2PC claims guaranteeing ACID properties. Having 2 stores in sync is kind of what CAP consistency is about.
In this Narayana strictly depends on capabilities of particular resource managers (data stores or jdbc drivers of data stores).
ACID declares
atomicity - whole transaction is committed or rolled-back (no info when it happens, no info about resources in sync)
consistency - before and when the transaction ends the system is in consistent state
durability - all is stored even when a crash occurs
isolation - (tricky one, left at the end) - for being ACID we have to be serializable. That's you can observe transactions happening "one by one".
If I take a pretty simplified example, to show my point - expecting DB being implemented in a naive way of locking whole database when transaction starts - you committed jms message, that's processed and now you don't commit the db record. When DB works in the serializable isolation level (that's what ACID requires!) then your next write/read operation has to wait until the 'in-flight prepared' transaction is resolved. DB is just stuck and waiting. If you read you won't get answer so you can't say what is the value.
The Narayana's recovery manager then come to that prepared transaction after connection is established and commit it. And you read action returns information that is 'correct'.
Q3: I don't understand the question, sorry. But if you state that The order in which the network packets arrive at the different data sources can determine which prepare request succeeds. then you are right, you are doomed to get failing transaction till network become more stable.
I'm doing some reading up on the advantages/disadvantages of using timestamps for concurrency control in a distributed database. The material I'm reading mentions that although timestamps overcome traditional deadlock problems which can affect locking there is still the problem of "global deadlock" which it is vulnerable to.
The material describes global deadlock as a situation where no cycle exists in the wait-for graphs of local graphs but that there is a cycle in the global graph.
I'm wondering how this could happen? Could someone describe a situation where a timestamp system could cause this problem?
Here is an example, the simplest possible probably. We have machines A and B. Machine A has locks T1 and T2 with the relationship T1 < T2. Machine B has T3 and T4 with T3 > T4.
Now, the local graphs are just that T2 must wait for T1 and T3 must wait for T4. So there are no local cycles. But now, assume we have T4 < T1 so T1 has to wait for T4. And at the same time T2 < T3 so T3 has to wait for T2. In this case, there is a cycle globally.
So how does that cycle happen? The key here is that you never have the full information in a distributed system. So we may learn later that the inter-machine dependencies are there. And then we have a problem.
Timestamping is used to determine conflictresolution between local processes on a machine. It gives a means to solve deadlocks on that level. For distributed processes there is a possibilty of two processes on different machines to be waiting on each other. Which is in fact a regular deadlock, but across machines. This is called a 'global' deadlock. Imho timestamping might be used there also but is apparantly impractical.
Some info on this can be found on http://www.cse.scu.edu/~jholliday/dd_9_16.htm