Background
I am new to Flink and come from Apache Storm background
Working on developing a lossless gRPC sink
Crux
A finite no. of retries will be made based on the error codes returned by the gRPC endpoint
After that the data will be flushed to Kafka Queue for offline processing
Decision to retry will be based on returned error code.
Problem
Is it possible to chain another sink so that the response ( successful or error ) is also available downstream for any customized processing ?
Answer is as per the comment by Dominik WosiĆski
It's not possible in general, You will have to work around that, either by providing both functionalities in a single sink or using some existing fuctions like AsyncIO to write to gRPC and then sink the failures to kafka, but that may be harder if You need any strong guarantees.
Related
I have a couple of Flink jobs that receive data from a series of Kafka topics, do some aggregation, and publish the result into a Kafka topic.
The aggregation part is what gets somehow difficult. I have to retrieve some information from several HTTP endpoints and put together the responses in a particular format. Problem is that some of those outbound HTTP calls time out sometimes, so I need a way to retry them.
I was wondering if there is a canonical way to do such task within Flink operators, without doing something entirely manually. If not, what could be a recommended approach?
In a bit more than a month you'll have Flink 1.16 available with retry support in AsyncIO:
https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/operators/asyncio/#retry-support
That is probably your best option. In the meantime, using AsyncIO, but configuring it to support long timeouts and handle the retries yourself in the asyncInvoke may be an option.
I have a working Flink job built on Flink Data Stream. I want to REWRITE the entire job based on the Flink stateful functions 3.1.
The functions of my current Flink Job are:
Read message from Kafka
Each message is in format a slice of data packets, e.g.(s for slice):
s-0, s-1 are for packet 0
s-4, s-5, s-6 are for packet 1
The job merges slices into several data packets and then sink packets to HBase
Window functions are applied to deal with disorder of slice arrival
My Objectives
Currently I already have Flink Stateful Functions demo running on my k8s. I want to do rewrite my entire job upon on stateful functions.
Save data into MinIO instead of HBase
My current plan
I have read the doc and got some ideas. My plans are:
There's no need to deal with Kafka anymore, Kafka Ingress(https://nightlies.apache.org/flink/flink-statefun-docs-release-3.0/docs/io-module/apache-kafka/) handles it
Rewrite my job based on java SDK. Merging are straightforward. But How about window functions?
Maybe I should use persistent state with TTL to mimic window function behaviors
Egress for MinIO is not in the list of default Flink I/O Connectors, therefore I need to write my custom Flink I/O Connector for MinIO myself, according to https://nightlies.apache.org/flink/flink-statefun-docs-release-3.0/docs/io-module/flink-connectors/
I want to avoid Embedded module because it prevents scaling. Auto scaling is the key reason why I want to migrate to Flink stateful functions
My Questions
I don't feel confident with my plan. Is there anything wrong with my understandings/plan?
Are there any best practice I should refer to?
Update:
windows were used to assemble results
get a slice, inspect its metadata and know it is the last one of the packet
also knows the packet should contains 10 slices
if there are already 10 slices, merge them
if there are not enough slices yet, wait for sometime (e.g. 10 minutes) and then either merge or record packet errors.
I want to get rid of windows during the rewrite, but I don't know how
Background: Use KeyedProcessFunctions Rather than Windows to Assemble Related Events
With the DataStream API, windows are not a good building block for assembling together related events. The problem is that windows begin and end at times that are aligned to the clock, rather than being aligned to the events. So even if two related events are only a few milliseconds apart they might be assigned to different windows.
In general, it's more straightforward to implement this sort of use case with keyed process functions, and use timers as needed to deal with missing or late events.
Doing this with the Statefun API
You can use the same pattern mentioned above. The function id will play the same role as the key, and you can use a delayed message instead of a timer:
as each slice arrives, add it to the packet that's being assembled
if it is the first slice, send a delayed message that will act as a timeout
when all the slices have arrived, merge them and send the packet
if the delayed message arrives before the packet is complete, do whatever is appropriate (e.g., go ahead and send the partial packet)
I am trying to build a Flink job that would read data from a Kafka source do a bunch of processing including few REST calls and then finally sink into another Kafka topic.
The problem I trying to address is that of message retries. What if there are transient errors in the REST API? How can I do exponential backoff-based retry of these messages like the way Storm supports?
I have 2 approaches that I could think off
Use TimerService but then in case of failures the state will start to expand uncontrollably.
Write failed message to a different Kafka topic and process them with a delay of sorts, but here the problem can arise if the Sink itself is down for few minutes?
Is there a better more robust and simpler way to achieve this?
I would use Flink's AsyncFunction to make the REST calls. If needed, it will backpressure the source(s) rather than use more than a configured amount of state. For retries, see AsyncFunction retries.
I could not find any answer to my question on the web so far, so I thought its good to ask here.
I know that Apache Flink is by design asynchronous, but I was wondering if there is any project or design, which aims to build a synchronous pipeline with Flink.
With synchronous response I mean in e.g. having an API endpoint, where I send my data to, the processing is done by Flink, and the outcome of the processing is given back (in what form ever) in the body of the answer to the API call e.g. a 200
I already looked into RabbitMQ RPC but I was not able to successfully implement it.
I'm happy for any direction or suggestion.
Thanks,
Jon
The closest thing that comes into my mind seems to be deploying Flink job with TcpSource available in Apache Bahir. You could have an HTTP endpoint that would receive some data and call Flink on the specified address then process it and create a response. The problem is that there is only TcpSource available in Bahir, which means You would need to create large part of the code (whole Sink) by yourself.
There can be also other ways of doing that (like trying to assign an id to each message and then waiting for message with that Id to arrive on Kafka and sending it as a response, but seems to be troublesome and error-prone)
The other way would be to make the response asynchronous(I know the question specifically mentions sync response but mentioning that just for sake of completeness)
However, I would like to say that this seems like a misuse of Flink to me. Flink was primary designed to allow real-time computations on multiple nodes, which doesn't seem to be a case here. I would suggest looking into different streaming libraries that are much more lightweight, easier to compose, and can offer the functionality You want out-of-the-box. You may want to take a look at Akka Streams for example.
Flink documentation mentions delivery guarantee of exactly once or atleast once for data stream api, however, I found no reference of the same for data set api.
Are messages guaranteed to be delivered exactly once to all transformations in data sets; further, in the absence of checkpoint mechanism, the only logical recourse is to start the job from the beginning?
Can i use data stream api for batch job, what would i lose?
Fault tolerance for the DataSet api is described here, and yes, it is based on retrying the failed job.
You certainly can use the DataStream api for finite (batch) jobs. There are a few features that are only present in the batch api, such as the machine learning and graph libraries, and the DataSet api has some optimizations that aren't available for DataStreams, but for many applications the differences aren't significant.