Flink dataset api delivery guarantee and checkpointing - apache-flink

Flink documentation mentions delivery guarantee of exactly once or atleast once for data stream api, however, I found no reference of the same for data set api.
Are messages guaranteed to be delivered exactly once to all transformations in data sets; further, in the absence of checkpoint mechanism, the only logical recourse is to start the job from the beginning?
Can i use data stream api for batch job, what would i lose?

Fault tolerance for the DataSet api is described here, and yes, it is based on retrying the failed job.
You certainly can use the DataStream api for finite (batch) jobs. There are a few features that are only present in the batch api, such as the machine learning and graph libraries, and the DataSet api has some optimizations that aren't available for DataStreams, but for many applications the differences aren't significant.

Related

Canonical way of retrying in Flink operators

I have a couple of Flink jobs that receive data from a series of Kafka topics, do some aggregation, and publish the result into a Kafka topic.
The aggregation part is what gets somehow difficult. I have to retrieve some information from several HTTP endpoints and put together the responses in a particular format. Problem is that some of those outbound HTTP calls time out sometimes, so I need a way to retry them.
I was wondering if there is a canonical way to do such task within Flink operators, without doing something entirely manually. If not, what could be a recommended approach?
In a bit more than a month you'll have Flink 1.16 available with retry support in AsyncIO:
https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/operators/asyncio/#retry-support
That is probably your best option. In the meantime, using AsyncIO, but configuring it to support long timeouts and handle the retries yourself in the asyncInvoke may be an option.

Flink pipeline without a data sink with checkpointing on

I am researching on building a flink pipeline without a data sink. i.e my pipeline ends when it makes a successful api call to a datastore.
In that case if we don't use a sink operator how will checkpointing work ?
As checkpointing is based on the concept of pre-checkpoint epoch (all events that are persisted in state or emitted into sinks) and a post-checkpoint epoch. Is having a sink required for a flink pipeline?
Yes, sinks are required as part of Flink's execution model:
DataStream programs in Flink are regular programs that implement
transformations on data streams (e.g., filtering, updating state,
defining windows, aggregating). The data streams are initially created
from various sources (e.g., message queues, socket streams, files).
Results are returned via sinks, which may for example write the data
to files, or to standard output (for example the command line
terminal)
One could argue that your that the call to your datastore is the actual sink implementation that you could use. You could define your own sink and execute the datastore call there.
I am not keen on the details of your datastore, but one could assume that you are serializing these events and sending them to the datastore in some way. In that case, you could flow all your elements to the sink operator, and store each of these elements in some ListState which you can continuously offload and send. This way, if your application needs to be upgraded, in flight records will not be lost and will be recovered and sent once the job has restored.

Chaining Flink Sinks

Background
I am new to Flink and come from Apache Storm background
Working on developing a lossless gRPC sink
Crux
A finite no. of retries will be made based on the error codes returned by the gRPC endpoint
After that the data will be flushed to Kafka Queue for offline processing
Decision to retry will be based on returned error code.
Problem
Is it possible to chain another sink so that the response ( successful or error ) is also available downstream for any customized processing ?
Answer is as per the comment by Dominik Wosiński
It's not possible in general, You will have to work around that, either by providing both functionalities in a single sink or using some existing fuctions like AsyncIO to write to gRPC and then sink the failures to kafka, but that may be harder if You need any strong guarantees.

How can I access states computed from an external Flink job ? (without knowing its id)

I'm new to Flink and I'm currently testing the framework for a usecase consisting in enriching transactions coming from Kafka with a lot of historical features (e.g number of past transactions between same source and same target), then score this transaction with a machine learning model.
For now, features are all kept in Flink states and the same job is scoring the enriched transaction. But I'd like to separate the features computation job from the scoring job and I'm not sure how to do this.
The queryable state doesn't seem to fit for this, as the job id is needed, but tell me if I'm wrong !
I've thought about querying directly RocksDB but maybe there's a more simple way ?
Is the separation in two jobs for this task a bad idea with Flink ? We do it this way for the same test with Kafka Streams, in order to avoid complex jobs (and to check if it has any positive impact on latency)
Some extra information : I'm using Flink 1.3 (but willing to upgrade if it's needed) and the code is written in Scala
Thanks in advance for your help !
Something like Kafka works well for this kind of decoupling. In that way you could have one job that computes the features and streams them out to a Kafka topic that is consumed by the job doing the scoring. (Aside: this would make it easy to do things like run several different models and compare their results.)
Another approach that is sometimes used is to call out to an external API to do the scoring. Async I/O could be helpful here. At least a couple of groups are using stream SQL to compute features, and wrapping external model scoring services as UDFs.
And if you do want to use queryable state, you could use Flink's REST api to determine the job id.
There have been several talks at Flink Forward conferences about using machine learning models with Flink. One example: Fast Data at ING – Building a Streaming Data Platform with Flink and Kafka.
There's an ongoing community effort to make all this easier. See FLIP-23 - Model Serving for details.

Using Apache Flink for data streaming

I am working on building an application with below requirements and I am just getting started with flink.
Ingest data into Kafka with say 50 partitions (Incoming rate - 100,000 msgs/sec)
Read data from Kafka and process each data (Do some computation, compare with old data etc) real time
Store the output on Cassandra
I was looking for a real time streaming platform and found Flink to be a great fit for both real time and batch.
Do you think flink is the best fit for my use case or should I use Storm, Spark streaming or any other streaming platforms?
Do I need to write a data pipeline in google data flow to execute my sequence of steps on flink or is there any other way to perform a sequence of steps for realtime streaming?
Say if my each computation take like 20 milliseconds, how can I better design it with flink and get better throughput.
Can I use Redis or Cassandra to get some data within flink for each computation?
Will I be able to use JVM in-memory cache inside flink?
Also can I aggregate data based on a key for some time window (example 5 seconds). For example lets say there are 100 messages coming in and 10 messages have the same key, can I group all messages with the same key together and process it.
Are there any tutorials on best practices using flink?
Thanks and appreciate all your help.
Given your task description, Apache Flink looks like a good fit for your use case.
In general, Flink provides low latency and high throughput and has a parameter to tune these. You can read and write data from and to Redis or Cassandra. However, you can also store state internally in Flink. Flink does also have sophisticated support for windows. You can read the blog on the Flink website, check out the documentation for more information, or follow this Flink training to learn the API.

Resources