Can I somehow find out which offsets the Kafka consumers of my Flink job were the exact moment I took a savepoint? I assume I could hack up something using the state processor api, peering into the internal state of the Kafka consumer. But I'd like to know whether I've missed a nicer way, which doesn't rely on implementation details.
Related
I'm using MQTT consumer as my flink job's data source. I'm wondering how to save the data offsets into checkpoint to ensure that no data lost when flink cluster restarts after a failure. I've see lots of articles introducing how apache flink manages kafka consumer offsets. Does anyone know whether apache flink has its own function to manage MQTT consumer? Thanks.
If you have a MQTT consumer, you should make sure it uses the Data Source API. You can read about that on https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/sources/ - That also includes how to work integrate with checkpointing. You can also read the details in FLIP-27 https://cwiki.apache.org/confluence/display/FLINK/FLIP-27%3A+Refactor+Source+Interface
You shoul read state-backends part of documentation. And checkpoints section.
When checkpointing is enabled, managed state is persisted to ensure
consistent recovery in case of failures. Where the state is persisted
during checkpointing depends on the chosen Checkpoint Storage.
I am new to Flink. How to know what can be the production cluster requirements for flink. And how to decide the job memory, task memory and task slots for each job execution in yarn cluster mode.
For ex- I have to process around 600-700 million records each day using datastream as it's a real time data.
There's no one-size-fits-all answer to these questions; it depends. It depends on the sort of processing you are doing with these events, whether or not you need to access external resources/services in order to process them, how much state you need to keep and the access and update patterns for that state, how frequently you will checkpoint, which state backend you choose, etc, etc. You'll need to do some experiments, and measure.
See How To Size Your Apache Flink® Cluster: A Back-of-the-Envelope Calculation for an in-depth introduction to this topic. https://www.youtube.com/watch?v=8l8dCKMMWkw is also helpful.
Before calling execute on the StreamExecutionEnvironment and starting the stream job, is there a way to programmatically find out whether or not the job was restored from a savepoint? I need to know such information so that I can set the offset of a Kafka source depending on it while building the job graph.
It seems that the FlinkConnectorKafkaBase class which has a method initializeState has access to such information (code). However, there is no way to intercept the FunctionInitializationContext and retrieve the isRestored() value since initializeState is a final method. Also, the initializeState method gets called after the job graph is executed and so I don't think there is a feasible solution associated to it.
Another attempt I made was to find a Flink job parameter that indicates whether or not the job was started from a savepoint. However, I don't think such parameter exists.
You can get the effect you are looking for by simply doing this:
FlinkKafkaConsumer<String> myConsumer = new FlinkKafkaConsumer<>(...);
myConsumer.setStartFromEarliest();
If you use setStartFromEarliest then Flink will ignore the offsets stored in Kafka, and instead begin reading from the earliest record. Moreover, even if you use setStartFromEarliest, if Flink is resuming from a checkpoint or savepoint, it will instead use the offsets stored in that snapshot.
Note that Flink does its own Kafka offset management, and when recovering from a checkpoint ignores the offsets stored in Kafka. Flink does this as a part of providing exactly-once guarantees, which requires knowing exactly how much of the input was consumed to produce the results present in the rest of the state captured in a checkpoint or savepoint. For this reason, Flink always stores the offsets as part of every state snapshot (checkpoint or savepoint).
This is documented here and here.
As for your original question about initializeState, this is available if you implement the CheckpointedFunction interface, but it's quite rare to actually need this.
I have a Flink Application that reads some events from Kafka, does some enrichment of the data from MySQL, buffers the data using a window function and writes the data inside a window to HBase. I've currently enabled checkpointing, but it turns out that the checkpointing is quite expensive and over time it takes longer and longer and affects my job's latency (falling behind on kafka ingest rate). If I figure out a way to make my HBase writes idempotent, is there a strong reason for me to use checkpointing? I can just configure the internal kafka consumer client to commit every so often right?
If the only thing you are checkpointing is the Kafka provider offset(s), then it would surprise me that the checkpointing time is significant enough to slow down your workflow. Or is state being saved elsewhere as well? If so, you could skip that (as long as, per your note, the HBase writes are idempotent).
Note that you can also adjust the checkpointing interval, and (if need be) use incremental checkpoints with RocksDB.
I'm new to Flink and I'm currently testing the framework for a usecase consisting in enriching transactions coming from Kafka with a lot of historical features (e.g number of past transactions between same source and same target), then score this transaction with a machine learning model.
For now, features are all kept in Flink states and the same job is scoring the enriched transaction. But I'd like to separate the features computation job from the scoring job and I'm not sure how to do this.
The queryable state doesn't seem to fit for this, as the job id is needed, but tell me if I'm wrong !
I've thought about querying directly RocksDB but maybe there's a more simple way ?
Is the separation in two jobs for this task a bad idea with Flink ? We do it this way for the same test with Kafka Streams, in order to avoid complex jobs (and to check if it has any positive impact on latency)
Some extra information : I'm using Flink 1.3 (but willing to upgrade if it's needed) and the code is written in Scala
Thanks in advance for your help !
Something like Kafka works well for this kind of decoupling. In that way you could have one job that computes the features and streams them out to a Kafka topic that is consumed by the job doing the scoring. (Aside: this would make it easy to do things like run several different models and compare their results.)
Another approach that is sometimes used is to call out to an external API to do the scoring. Async I/O could be helpful here. At least a couple of groups are using stream SQL to compute features, and wrapping external model scoring services as UDFs.
And if you do want to use queryable state, you could use Flink's REST api to determine the job id.
There have been several talks at Flink Forward conferences about using machine learning models with Flink. One example: Fast Data at ING – Building a Streaming Data Platform with Flink and Kafka.
There's an ongoing community effort to make all this easier. See FLIP-23 - Model Serving for details.