I have a couple of Flink jobs that receive data from a series of Kafka topics, do some aggregation, and publish the result into a Kafka topic.
The aggregation part is what gets somehow difficult. I have to retrieve some information from several HTTP endpoints and put together the responses in a particular format. Problem is that some of those outbound HTTP calls time out sometimes, so I need a way to retry them.
I was wondering if there is a canonical way to do such task within Flink operators, without doing something entirely manually. If not, what could be a recommended approach?
In a bit more than a month you'll have Flink 1.16 available with retry support in AsyncIO:
https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/operators/asyncio/#retry-support
That is probably your best option. In the meantime, using AsyncIO, but configuring it to support long timeouts and handle the retries yourself in the asyncInvoke may be an option.
Related
Background
I am new to Flink and come from Apache Storm background
Working on developing a lossless gRPC sink
Crux
A finite no. of retries will be made based on the error codes returned by the gRPC endpoint
After that the data will be flushed to Kafka Queue for offline processing
Decision to retry will be based on returned error code.
Problem
Is it possible to chain another sink so that the response ( successful or error ) is also available downstream for any customized processing ?
Answer is as per the comment by Dominik Wosiński
It's not possible in general, You will have to work around that, either by providing both functionalities in a single sink or using some existing fuctions like AsyncIO to write to gRPC and then sink the failures to kafka, but that may be harder if You need any strong guarantees.
I am trying to build a Flink job that would read data from a Kafka source do a bunch of processing including few REST calls and then finally sink into another Kafka topic.
The problem I trying to address is that of message retries. What if there are transient errors in the REST API? How can I do exponential backoff-based retry of these messages like the way Storm supports?
I have 2 approaches that I could think off
Use TimerService but then in case of failures the state will start to expand uncontrollably.
Write failed message to a different Kafka topic and process them with a delay of sorts, but here the problem can arise if the Sink itself is down for few minutes?
Is there a better more robust and simpler way to achieve this?
I would use Flink's AsyncFunction to make the REST calls. If needed, it will backpressure the source(s) rather than use more than a configured amount of state. For retries, see AsyncFunction retries.
I could not find any answer to my question on the web so far, so I thought its good to ask here.
I know that Apache Flink is by design asynchronous, but I was wondering if there is any project or design, which aims to build a synchronous pipeline with Flink.
With synchronous response I mean in e.g. having an API endpoint, where I send my data to, the processing is done by Flink, and the outcome of the processing is given back (in what form ever) in the body of the answer to the API call e.g. a 200
I already looked into RabbitMQ RPC but I was not able to successfully implement it.
I'm happy for any direction or suggestion.
Thanks,
Jon
The closest thing that comes into my mind seems to be deploying Flink job with TcpSource available in Apache Bahir. You could have an HTTP endpoint that would receive some data and call Flink on the specified address then process it and create a response. The problem is that there is only TcpSource available in Bahir, which means You would need to create large part of the code (whole Sink) by yourself.
There can be also other ways of doing that (like trying to assign an id to each message and then waiting for message with that Id to arrive on Kafka and sending it as a response, but seems to be troublesome and error-prone)
The other way would be to make the response asynchronous(I know the question specifically mentions sync response but mentioning that just for sake of completeness)
However, I would like to say that this seems like a misuse of Flink to me. Flink was primary designed to allow real-time computations on multiple nodes, which doesn't seem to be a case here. I would suggest looking into different streaming libraries that are much more lightweight, easier to compose, and can offer the functionality You want out-of-the-box. You may want to take a look at Akka Streams for example.
I'm new to Flink and I'm currently testing the framework for a usecase consisting in enriching transactions coming from Kafka with a lot of historical features (e.g number of past transactions between same source and same target), then score this transaction with a machine learning model.
For now, features are all kept in Flink states and the same job is scoring the enriched transaction. But I'd like to separate the features computation job from the scoring job and I'm not sure how to do this.
The queryable state doesn't seem to fit for this, as the job id is needed, but tell me if I'm wrong !
I've thought about querying directly RocksDB but maybe there's a more simple way ?
Is the separation in two jobs for this task a bad idea with Flink ? We do it this way for the same test with Kafka Streams, in order to avoid complex jobs (and to check if it has any positive impact on latency)
Some extra information : I'm using Flink 1.3 (but willing to upgrade if it's needed) and the code is written in Scala
Thanks in advance for your help !
Something like Kafka works well for this kind of decoupling. In that way you could have one job that computes the features and streams them out to a Kafka topic that is consumed by the job doing the scoring. (Aside: this would make it easy to do things like run several different models and compare their results.)
Another approach that is sometimes used is to call out to an external API to do the scoring. Async I/O could be helpful here. At least a couple of groups are using stream SQL to compute features, and wrapping external model scoring services as UDFs.
And if you do want to use queryable state, you could use Flink's REST api to determine the job id.
There have been several talks at Flink Forward conferences about using machine learning models with Flink. One example: Fast Data at ING – Building a Streaming Data Platform with Flink and Kafka.
There's an ongoing community effort to make all this easier. See FLIP-23 - Model Serving for details.
Flink documentation mentions delivery guarantee of exactly once or atleast once for data stream api, however, I found no reference of the same for data set api.
Are messages guaranteed to be delivered exactly once to all transformations in data sets; further, in the absence of checkpoint mechanism, the only logical recourse is to start the job from the beginning?
Can i use data stream api for batch job, what would i lose?
Fault tolerance for the DataSet api is described here, and yes, it is based on retrying the failed job.
You certainly can use the DataStream api for finite (batch) jobs. There are a few features that are only present in the batch api, such as the machine learning and graph libraries, and the DataSet api has some optimizations that aren't available for DataStreams, but for many applications the differences aren't significant.