I could not find any answer to my question on the web so far, so I thought its good to ask here.
I know that Apache Flink is by design asynchronous, but I was wondering if there is any project or design, which aims to build a synchronous pipeline with Flink.
With synchronous response I mean in e.g. having an API endpoint, where I send my data to, the processing is done by Flink, and the outcome of the processing is given back (in what form ever) in the body of the answer to the API call e.g. a 200
I already looked into RabbitMQ RPC but I was not able to successfully implement it.
I'm happy for any direction or suggestion.
Thanks,
Jon
The closest thing that comes into my mind seems to be deploying Flink job with TcpSource available in Apache Bahir. You could have an HTTP endpoint that would receive some data and call Flink on the specified address then process it and create a response. The problem is that there is only TcpSource available in Bahir, which means You would need to create large part of the code (whole Sink) by yourself.
There can be also other ways of doing that (like trying to assign an id to each message and then waiting for message with that Id to arrive on Kafka and sending it as a response, but seems to be troublesome and error-prone)
The other way would be to make the response asynchronous(I know the question specifically mentions sync response but mentioning that just for sake of completeness)
However, I would like to say that this seems like a misuse of Flink to me. Flink was primary designed to allow real-time computations on multiple nodes, which doesn't seem to be a case here. I would suggest looking into different streaming libraries that are much more lightweight, easier to compose, and can offer the functionality You want out-of-the-box. You may want to take a look at Akka Streams for example.
Related
I have a couple of Flink jobs that receive data from a series of Kafka topics, do some aggregation, and publish the result into a Kafka topic.
The aggregation part is what gets somehow difficult. I have to retrieve some information from several HTTP endpoints and put together the responses in a particular format. Problem is that some of those outbound HTTP calls time out sometimes, so I need a way to retry them.
I was wondering if there is a canonical way to do such task within Flink operators, without doing something entirely manually. If not, what could be a recommended approach?
In a bit more than a month you'll have Flink 1.16 available with retry support in AsyncIO:
https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/operators/asyncio/#retry-support
That is probably your best option. In the meantime, using AsyncIO, but configuring it to support long timeouts and handle the retries yourself in the asyncInvoke may be an option.
Background
I am new to Flink and come from Apache Storm background
Working on developing a lossless gRPC sink
Crux
A finite no. of retries will be made based on the error codes returned by the gRPC endpoint
After that the data will be flushed to Kafka Queue for offline processing
Decision to retry will be based on returned error code.
Problem
Is it possible to chain another sink so that the response ( successful or error ) is also available downstream for any customized processing ?
Answer is as per the comment by Dominik WosiĆski
It's not possible in general, You will have to work around that, either by providing both functionalities in a single sink or using some existing fuctions like AsyncIO to write to gRPC and then sink the failures to kafka, but that may be harder if You need any strong guarantees.
I know that by design and out of the box a request and reply data processing is not possible with Flink. But consider for example a legacy TCP application, which opens a connection to a server and expects a response in tha same connection.
For example consider a legacy application, where the clients connect to a server via TCP and a custom protocol. They send some status information and expect a command as the response, where the command may depend on the current status.
Is it possible, to build a combined source,which inputs the TCP message into the processing, and sink, which recieves the processing result?
Building a source, which accepts TCP connections and creates events from messages seems straightforward, but getting the corresponding response to the corrent sink on the same worker(!) to send the response to the client see s tricky.
I know, that this can be implemented with an external component, but I'm wondering if this can be implemented directly in Flink with minimal overhead (e.g. for realtime performance reasons).
If this is possible, what would be the ways to do it and with which pros and cons?
Thank you!
Regards,
Kan
It depends how your server-processing pipeline looks like.
If the processing can be modeled as a single chain, as in Source -> Map/flatMap/filter -> Map/flatMap/filter -> ... -> sink, then you could pass the TCP connection itself the next operation together with the data (I supposed wrapped in a tuple or POJO). By virtue of being part of a chain it is guaranteed that the entire computation happens within a single worker.
But, the moment you do anything like grouping, windows etc. this is no longer possible, since the processing may continue on another worker.
Normally if you're talking to an external service in Flink, you'd use an AsyncFunction. This lets you use incoming data to determine what request to make, and emit the results as the operator output. Is there any reason why this approach wouldn't work for you?
Note that you can play some games if you don't have any incoming data, e.g. have a source that regularly emits a "tickler" record, which then triggers the async request.
And if the result needs to feed back into the next request, you can use iterations, though they have limitations.
I've asked similar questions, and had great responses..but my request here seems sufficiently different enough to ask separately.
The Camel Aggregator, as awesome as it is, is not going to cut it for me. I need to aggregate exchange data and when I hit a certain size, forward this onto a queue. When that happens I can then ACK the the original source messages off the queue. The persistence choices of the aggregator isn't really an option based on environmental reasons. No rdms around, and other options would be locally managed state. If the route went down, or the box then I need to be able to carry on processing, and if I had messasges in that db then it is a recovery job. Thanks ZK and camels integration to it!
I'm basically thinking I need to implement a processor/or a bean (what are the subtle differences?) that will take exchanges and put them in a map. When I hit a size forward on the joined exchange to an endpoint, and then somehow ack all the messages.
What I want to know is what api do I use to control the exchange to effectively stop it with out acking and pull what I need to be able to ack later.
Can anyone provide some guidance and point me at the relevant functions on the objects of interest?
I have a nice simple idea to this. I was going to extend the Rabbit* classes and specifically the RabbitConsumer doHandleDelivery and have that do my noddy aggregation. That would call Exchange exchange = consumer.getEndpoint().createRabbitExchange(envelope, properties, body); once the aggregation has complete. And depending on the result of consumer.getProcessor().process(exchange); it would ack or rej all the messages. On the face of it I would say it would all work quite well. Ok I would need some synchronisation in the RabbitConsumer..
Just to give peeps an update I built my own batching rmq consumer.
Pretty simple really, but just had to make sure I built on the onXXX functions so the route could be paused/resumed stopped and started.
There is appengine-mapreduce which seems the official way to do things on AppEngine. But there seems no documentation besides some hacked together Wiki Pages and lengthy videos. There are statements that the lib only supports the map step. But the source indicates that there are also implementations for shuffle.
A Version of this appengine-mapreduce library seems also to be included in the SDK but it not blessed for public use. So you basically are expected to load the library twice into your runtime.
Then there is appengine-pipeline. "A primary use-case of the API is connecting together various App Engine MapReduces into a computational pipeline." But there also seems pipeline-related code in the appengine-mapreduce library.
So where do I start to find out how this all fits together? Which is the library to call from my project. Is there any decent documentation on appengine-mapreduce besides parsing change logs?
Which is the library to call from my project.
They serve different purposes, and you've provided no details about what you're attempting to do.
The most fundamental layer here is the task queue, which lets you schedule background work that can be highly parallelized. This is fan-out. Let's say you had a list of 1000 websites, and you wanted to check the response time for each one and send an email for any site that takes more than 5 seconds to load. By running these as concurrent tasks, you can complete the work much faster than if you checked all 1000 sites in sequence.
Now let's say you don't want to send an email for every slow site, you just want to check all 1000 sites and send one summary email that says how many took more than 5 seconds and how many took fewer. This is fan-in. It's trickier with the task queue, because you need to know when all tasks have completed, and you need to collect and summarize their results.
Enter the Pipeline API. The Pipeline API abstracts the task queue to make fan-in easier. You write what looks like synchronous, procedural code, but uses Python futures and is executed (as much as possible) in parallel. The Pipeline API keeps track of task dependencies and collects results to facilitate building distributed workflows.
The MapReduce API wraps the Pipeline API to facilitate a specific type of distributed workflow: mapping the results of a piece of work into a set of key/value pairs, and reducing multiple sets of results to one by combining their values.
So they provide increasing layers of abstraction and convenience around a common system of distributed task execution. The right solution depends on what you're trying to accomplish.
There is offical documentation here: https://developers.google.com/appengine/docs/java/dataprocessing/