How does TM recovery handle past broadcasted data - apache-flink

In the context of HA of TaskManagers(TM), when a TM goes down a new one will be restored from latest checkpoint of faulted by the JobManager(JM).
Say we have 3 TMs (tm1, tm2, & tm3)
At a give time t where everyone's checkpoint(cp) is at cp1. All TMs broadcast data among them.
Now tm2 went down, JM brought tm2' with cp1 checkpoint as part of HA. By the time t+x a new TM is brought up, in the mean time others progressed to cp2.
How's does the data broadcasted by tm1 and tm3 as part of cp2 is replayed on tm2'?

The contents of checkpoints are determined by checkpoint barriers. A given checkpoint includes exactly the effects throughout the entire cluster of everyone having processed all events up to the corresponding barrier, and none of the events after that barrier.
During a restore, the entire cluster is reset to the contents of the most recent checkpoint, and processing then resumes from that consistent starting point.
Broadcast data is checkpointed more or less like everything else, except that each instance stores its own copy of the broadcast data -- with the expectation that these copies are identical. During recovery, the broadcast source is rewound to the point recorded in the checkpoint, and the broadcast state is also recovered from the checkpoint. Any new instance (due to scaling up the cluster) will get a copy of the broadcast state (taken by reading the state intended for one of the other instances).
It may be that at the time of a failure, some machines have completed a new checkpoint, but a checkpoint will not be used for a restore unless every TM has completed that checkpoint, and the Job Manager has finalized it.

Related

Clarification in regards to using safe_time in YugabyteDB

The document https://docs.yugabyte.com/latest/architecture/transactions/transactional-io-path/ says that a distributed txn can choose the safe_time from one of the involved tablets, and that safe_time considers the first uncommitted raft log’s hybrid timestamp. Does this mean that yugabytedb guarantees that all txn can read the data written by the txn committed before it starts?
[Disclaimer]: This question was first asked on the YugabyteDB Community Slack channel.
There are two components to choosing a read timestamp for a snapshot isolation transaction: (1) it needs to be recent enough to capture everything that has been committed before the transaction started; and (2) it needs to be as low as possible to avoid unnecessary wait. Choosing the safe time from the first tablet that a transaction reads from or writes to is just a heuristic towards the above goal. Safe time considers the timestamp of the first uncommitted (in the Raft sense) record in that tablet's Raft log as one of the inputs, and what actually goes into safe time calculation is that uncommitted timestamp minus "epsilon" (smallest possible hybrid time step) so that that record committing will not change the view of data as of this timestamp (and also safe time is capped by the hybrid time leader lease of the tablet's leader so that we are safe against leader changes and a new leader trying to read at a new timestamp past the leader lease expiration). So, all of the above concerns the "snapshot safety" (i.e. the property that if we are reading at some time read_ht, we are guaranteed that no writes will be done to that data with timestamps <= read_ht). If safe time on a tablet has not reached a particular read_ht when a read request arrives at that tablet, the tablet will wait for it to reach read_ht before starting the read operation. Now, let's address the question how we guarantee that all the data written prior to a transaction starting is visible to that transaction. This is done through a mechanism called "read restarts" and a clock skew assumption. If a read request on behalf of a snapshot isolation transaction with a read time read_ht encounters a committed record with a commit timestamp range between read_ht and read_ht + max_clock_skew, that record might have been committed prior to the transaction starting (due to clock skew) and we have to restart the read request at the timestamp of that transaction. The way this is implemented has some optimizations: the value read_ht + max_clock_skew is only computed once per transaction and does not change with read restarts, and we call it global_limit. It is an upper bound on the commit timestamp of any transaction that could have committed prior to our transaction operation starting, and by setting read_ht = global_limit (which is actually suitable in some cases, like long-running reporting queries), we can safely avoid any read restarts. There is also another similar mechanism called local_limit, which limits the number of restarts to one per tablet. So, with read restarts, we can be sure that a read request will capture all records that were written prior to the transaction starting, even with clock skew.

Flink app's checkpoint size keeps growing

I have a pipeline like this:
env.addSource(kafkaConsumer, name_source)
.keyBy { value -> value.f0 }
.window(EventTimeSessionWindows.withGap(Time.seconds(2)))
.process(MyProcessor())
.addSink(kafkaProducer)
The keys are guaranteed to be unique in the data that is being currently processed.
Thus I would expect the state size to not grow over 2 seconds of data.
However, I notice the state size has been steadily growing over the last day (since the app was deployed).
Is this a bug in flink?
using flink 1.11.2 in aws kinesis data analytics.
Kinesis Data Analytics always uses RocksDB as its state backend. With RocksDB, dead state isn't immediately cleaned up, it's merely marked with a tombstone and is later compacted away. I'm not sure how KDA configures RocksDB compaction, but typically it's done when a level reaches a certain size -- and I suspect your state size is still small enough that compaction hasn't occurred.
With incremental checkpoints (which is what KDA does), checkpointing is done by copying RocksDB's SST files -- which in your case are presumably full of stale data. If you let this run long enough you should eventually see a significant drop in checkpoint size, once compaction has been done.

How flink checkpoints help in failure recovery

My flink job reads from kafka consumer using FlinkKafkaConsumer010 and sinks into hdfs using CustomBucketingSink. We have series of transformations kafka -> flatmaps(2-3 transformations) -> keyBy -> tumblingWindow(5 mins) -> Aggregation -> hdfsSink. We have kafka input of 3 millions/min events on an average and around 20 millions/min events on peak time. Checkpointing duration and minimum pause between two checkpoiting is 3 mins and i am using FsStateBackend.
Here are my assumptions :
Flink consumes some fixed number of events from kafka(multiple offsets from multiple partitions at once) and waits till it reachs to sink and then checkpoints. In case of success it commits the kafka partitions offset it read and maintains some state related to hdfs file it was writting. While multiple transformations were going after kafka hand over events to other operators, kafka consumer sits idle until it gets confirmation for success for the events that it sent. So we can say while sink is writting data to hdfs all previous operators were sitting idle. In case of failure flink goes to previous checkpoint state and points to kafka last partition offset committed and points to hdfs file offest it should start writting to.
Here are my doubts based on above assumptions:
1) Is above assumption is correct.
2) Does it make sense for tumbling window to have state as in case of failure anyway we start from last kafka partition commited offset.
3) In case tumbling window make state, when will this state can be used by flink.
4) Why checkpoint and savepoint state size vary.
5) In case of any failure, flink always starts from sorce operator. Right ?
Your assumptions are not correct.
(1) Checkpointing does not depend in any way on events or results reaching the sink(s).
(2) Flink does its own Kafka offset management. When restoring from a checkpoint, after a failure, the offsets in the checkpoint are used, not those that may have been committed back to Kafka.
(3) No operators are ever idle in the way you've described. The pipeline is not stalled by checkpointing.
The best way to understand how checkpointing works is to go through the Flink operations playground, especially the section on Observing Failure and Recovery. This will give you a much clearer understanding of this topic, because you'll be able to observe exactly what's happening.
I can also recommend reading https://ci.apache.org/projects/flink/flink-docs-master/training/fault_tolerance.html, and following the links contained there.
But to walk through how checkpointing works in your application, here are the basic steps:
(1) When the checkpoint coordinator (part of the job manager) decides it's time to initiate another checkpoint, it informs each of the task managers to start checkpoint n.
(2) All of the sources instances checkpoint their own state, and insert checkpoint barrier n into their outgoing streams. In your case, the sources are Kafka consumers, and they checkpoint the current offset for each partition.
(3) Whenever the checkpoint barrier reaches the head of the input queue in a stateful operator, that operator checkpoints its state and forwards the barrier. This part has some complexity to it -- but basically, the state is held in a multi-version, concurrency controlled hash map. The operator creates a new version n+1 of the state that can be modified by the events behind the checkpoint barrier, and creates a new thread to asynchronously snapshot all the state in version n.
In your case, the window and sink are stateful. The window's state includes the current window contents, the state of the trigger, and other state you're using for window processing, if any.
(4) Sinks use the arrival of the barrier to flush any queued output, and commit pending transactions. Again, there's some complexity here, as transactional sinks use a two-phase commit protocol.
In your application, if the checkpoint interval is much smaller than the window duration, then the sink will complete many checkpoints before ever receiving any output from the window.
(5) When the checkpoint coordinator has heard back from every task that the checkpoint is complete, it finalizes the checkpoint metadata.
During recovery, the state of every operator is reset to the state in the most recent checkpoint. This means that the sources are rewound to the offsets in the checkpoint, and processing resumes with the state in the window and sink corresponding to what it should be after having consumed the events up to those offsets.
Note: To keep this reasonably simple, I've glossed over a bunch of details. Also, FLIP-76 will introduce a new approach to checkpointing.

How do Narayana/XA recover from TM failures?

I was trying to reason about failure recovery actions that can be taken by systems/frameworks which guarantee synchronous data sources. I've been unable to find a clear explanation of Narayana's recovery mechanism.
Q1: Does Narayana essentially employ a 2-phase commit to ensure distributed transactions across 2 datasources?
Q2: Can someone explain Narayana's behavior in this scenario?
Application wants to save X to 2 data stores
Narayana's transaction manager (TM) generates a transaction ID and writes info to disk
TM now sends a prepare message to both data stores
Each data store responds back with prepare_success
TM updates local transaction log and sends a commit message to both data stores
TM fails (permanently). And because of packet loss on the network, only one data store receives the commit message. But the other data stores receives and successfully processes the commit message.
The two data stores are now out of sync with each other (one source has an additional transaction that is not present in the other source).
When a new TM is brought up, it does not have access to the old transaction state records. So the TM cannot initiate the recovery of the missing transaction in one of the data stores.
So how can 2PC/Narayana/XA claim that they guarantee distributed transactions that can maintain 2 data stores in sync? From where I stand, they can only maintain synchronous data stores with a very high probability, but they cannot guarantee it.
Q3: Another scenario where I'm unclear on the behavior of the application/framework. Consider the following interleaved transactions (both on the same record - or at least with a partially overlapping set of records):
Di = Data source i
Ti = Transaction i
Pi = prepare message for transaction i
D1 receives P1; responds P1_success
D2 receives P2; responds P2_success
D1 receives P2; responds P2_failure
D2 receives P1; responds P1_failure
The order in which the network packets arrive at the different data sources can determine which prepare request succeeds. Does this not mean that at high transaction speeds for a contentious record - it is possible that all transactions will keep failing (until the record experiences a lower transaction request rate)?
One could argue that we are choosing consistency over availability but unlike ACID systems there is no guarantee that at least one of the transactions will succeed (thus avoiding a potentially long-lasting deadlock).
I would refer you to my article on how Narayana 2PC works
https://developer.jboss.org/wiki/TwoPhaseCommit2PC
To your questions
Q1: you already mentioned that in the comment - yes, Narayana uses 2PC = Narayana implements the XA specification (pubs.opengroup.org/onlinepubs/009680699/toc.pdf).
Q2: The steps in the scenario are not precise. Narayana writes to disk at time of prepare is called, not at time the transaction is started.
Application saves X to 2 data stores
TM now sends a prepare message to both data stores
Each data store responds back with prepare_success
TM saves permanently info about the prepared transaction and its ID to transaction log store
TM sends a commit message to both data stores
...
I don't agree that 2PC claims to guarantee to maintain 2 data stores in sync.
I was wondering about this too (e.g. asked here https://developer.jboss.org/message/954043).
2PC claims guaranteeing ACID properties. Having 2 stores in sync is kind of what CAP consistency is about.
In this Narayana strictly depends on capabilities of particular resource managers (data stores or jdbc drivers of data stores).
ACID declares
atomicity - whole transaction is committed or rolled-back (no info when it happens, no info about resources in sync)
consistency - before and when the transaction ends the system is in consistent state
durability - all is stored even when a crash occurs
isolation - (tricky one, left at the end) - for being ACID we have to be serializable. That's you can observe transactions happening "one by one".
If I take a pretty simplified example, to show my point - expecting DB being implemented in a naive way of locking whole database when transaction starts - you committed jms message, that's processed and now you don't commit the db record. When DB works in the serializable isolation level (that's what ACID requires!) then your next write/read operation has to wait until the 'in-flight prepared' transaction is resolved. DB is just stuck and waiting. If you read you won't get answer so you can't say what is the value.
The Narayana's recovery manager then come to that prepared transaction after connection is established and commit it. And you read action returns information that is 'correct'.
Q3: I don't understand the question, sorry. But if you state that The order in which the network packets arrive at the different data sources can determine which prepare request succeeds. then you are right, you are doomed to get failing transaction till network become more stable.

How can I control log switches and checkpoint frequencies?

What are the differences between LOG_CHECKPOINT_INTERVAL and LOG_CHECKPOINT_TIMEOUT? I need a clear picture of volume based intervals and time based interval. What are the relations among LOG_CHECKPOINT_TIMEOUT,LOG_CHECKPOINT_INTERVAL and FAST_START_IO_TARGET?
A checkpoint is when the database synchronizes the dirty blocks in the buffer cache with the datafiles. That is, it writes changed data to disk. The two LOG_CHECKPOINT parameters you mention govern how often this activity occurs.
The heart of the matter is: if the checkpoint occurs infrequently it will take longer to recover the database in the event of a crash, because it has to apply lots of data from the redo logs. On the other hand, if the checkpoint occurs too often the database can be tied up as various background processes become a bottleneck.
The difference between the two is that the INTERVAL specifies the maximum amount of redo blocks which can exist between checkpoints and the TIMEOUT specifies the maximum number of seconds between checkpoints. We need to set both parameters to cater for spikes of heavy activity. Note that LOG_CHECKPOINT_INTERVAL is measured in OS blocks not database blocks.
FAST_START_IO_TARGET is a different proposition. It specifies a target for the number of I/Os required to recover the database. The database then manages its checkpoints intelligently to achieve this target. Again, this is a trade-off between recovery times and the amount of background activity, although the impact on normal processing should be less than badly set LOG_CHECKPOINT paremeters. This parameter is only available withe the Enterprise Edition. It was deprecated in 9i in favour of FAST_START_MTTR_TARGET, and Oracle removed it in 10g. There is a view V$MTTR_TARGET_ADVICE which, er, provides advice on setting the FAST_START_MTTR_TARGET.
We should set either the FAST_START%TARGET or the LOG_CHECKPOINT_% parameters but not both. Setting the LOG_CHECKPOINT_INTERVAL will override the setting of FAST_START_MTTR_TARGET.

Resources