Flink async IO operator chaining with with another sync operator - apache-flink

I have a usecase where I am using async IO operators with normal mappers in flink. I am using flink 1.8. So, async operator is going to have to be at the head of the operator chain. So my operator flows looks like this:
Source -> Mapper1 -> AsyncOperator -> Mapper2 -> Sink
Because of the requirement of async operator being head, there are two operator chains and hence two tasks- 1. Source + Mapper1 2. AsyncOperator+Mapper2+Sink.
I have question regarding the second chain. I think the second chain should be comprised within a single task if they are chained correctly. I am not sure if there is a wait time between async operator and mapper 2 on the task threads or the Mapper2 gets bound to the response handler for the Async Operator internally ? Ideally, it should be second, but I can't find any documentation for the same - hence wondering.
Reference:
https://ci.apache.org/projects/flink/flink-docs-stable/dev/stream/operators/asyncio.html

The AsyncWaitOperator spins up an Emitter in a thread, so as soon as results are available they get sent to the operator's collector. Note though that if you specify ordered results there can be a "wait time" due to completion order not matching the order of incoming elements.

BTW, the restriction that the AsyncWaitOperator must be at the head of a chain was removed in Flink 1.11. See FLINK-16219. The only remaining limitation was that it could not follow a SourceFunction. The AsyncWaitOperator can follow the new sources introduced in Flink 1.12.

Related

Flink: An abstraction that implements CheckpointListener w/o element processing

I'm new to Flink and am looking for a way to run some code once a checkpoint completes (supposedly by implementing CheckpointListener) without processing events (void processElement(StreamRecord<IN> element)). Currently, I have an operator MyOperator that runs my code within notifyCheckpointComplete function. However, I see a lot of traffic sent to that operator. The operator chain looks as follows:
input = KafkaStream
input -> IcebergSink
input -> MyOperator
I can't find how to register CheckpointListener in Flink execution environment. Is it possible?
Also, I have the following ideas:
map input stream elements to Void, Unit before sending to MyOperator
use Side Output without emitting data to side output. I'm wondering if notifyCheckpointComplete will be still called.

Akka HTTP streaming API with cycles never completes

I'm building an application where I take a request from a user, call a REST API to get back some data, then based on that response, make another HTTP call and so on. Basically, I'm processing a tree of data where each node in the tree requires me to recursively call this API, like this:
A
/ \
B C
/ \ \
D E F
I'm using Akka HTTP with Akka Streams to build the application, so I'm using it's streaming API, like this:
val httpFlow = Http().cachedConnection(host = "localhost")
val flow = GraphDSL.create() { implicit builder =>
import GraphDSL.Implicits._
val merge = b.add(Merge[Data](2))
val bcast = b.add(Broadcast[ResponseData](2))
takeUserData ~> merge ~> createRequest ~> httpFlow ~> processResponse ~> bcast
merge <~ extractSubtree <~ bcast
FlowShape(takeUserData.in, bcast.out(1))
}
I understand that the best way to handle recursion in an Akka Streams application is to handle recursion outside of the stream, but since I'm recursively calling the HTTP flow to get each subtree of data, I wanted to make sure that the flow was properly backpressured in case the API becomes overloaded.
The problem is that this stream never completes. If I hook it up to a simple source like this:
val source = Source.single(data)
val sink = Sink.seq[ResponseData]
source.via(flow).runWith(sink)
It prints out that it's processing all the data in the tree and then stops printing anything, just idling forever.
I read the documentation about cycles and the suggestion was to put a MergePreferred in there, but that didn't seem to help. This question helped, but I don't understand why MergePreferred wouldn't stop the deadlock, since unlike their example, the elements are removed from the stream at each level of the tree.
Why doesn't MergePreferred avoid the deadlock, and is there another way of doing this?
MergePreferred (in the absence of eagerComplete being true) will complete when all the inputs have completed, which tends to generally be true of stages in Akka Streams (completion flows down from the start).
So that implies that the merge can't propagate completion until both the input and extractSubtree signal completion. extractSubtree won't signal completion (most likely, without knowing the stages in that flow) until bcast signals completion which (again most likely) won't happen until processResponse signals completion which* won't happen until httpFlow signals completion which* won't happen until createRequest signals completion, which* won't happen until merge signals completion. Because detecting this cycle in general is impossible (consider that there are stages for which completion is entirely dynamic), Akka Streams effectively takes the position that if you want to create a cycle like this, it's on you to determine how to break the cycle.
As you've noticed, eagerComplete being true changes this behavior, but since it will complete as soon as any input completes (which in this case will always be the input, thanks to the cycle) merge completes and cancels demand on extractSubtree (which by itself could (depending on whether the Broadcast has eagerCancel set) cause the downstream to cancel), which will likely result in at least some elements emitted by extractSubtree not getting processed.
If you're absolutely sure that the input completing means that the cycle will eventually dry up, you can use eagerComplete = false if you have some means to complete extractSubtree once the cycle is dry and the input has completed. A broad outline (without knowing what, specifically, is in extractSubtree) for going about this:
map everything coming into extractSubtree from bcast into a Some of the input
prematerialize a Source.actorRef to which you can send a None, save the ActorRef (which will be the materialized value of this source)
merge the input with that prematerialized source
when extracting the subtree, use a statefulMapConcat stage to track whether a) a None has been seen and b) how many subtrees are pending (initial value 1, add the number of (first generation) children of this node minus 1, i.e. no children subtracts 1); if a None has been seen and no subtrees are pending emit a List(None), otherwise emit a List of each subtree wrapped in a Some
have a takeWhile(_.isDefined), which will complete once it sees a None
if you have more complex things (e.g. side effects) in extractSubtrees, you'll have to figure out where to put them
before merging the outside input, pass it through a watchTermination stage, and in the future callback (on success) send a None to the ActorRef you got when prematerializing the Source.actorRef for extractSubtrees. Thus, when the input completes, watchTermination will fire successfully and effectively send a message to extractSubtrees to watch for when it's completed the inflight tree.

Akka Streams buffer on SubFlows based on parent Flow

I am using akka-streams and I hit an exception because of maxing out the Http Pool on akka-http.
There is a Source of list-elements, which get split and thus transformed to SubFlows.
The SubFlows issue http requests. Although I put a buffer on the SubFlow, it seems the buffer takes effect per SubFlow.
Is there a way to have a buffer based on the Source that takes effect on the SubFlows?
My mistake was that I was merging the substreams without taking into consideration the parallelism by using
def mergeSubstreams(): Flow[In, Out, Mat]
From the documentation
This is identical in effect to mergeSubstreamsWithParallelism(Integer.MAX_VALUE).
Thus my workaround was to use
def mergeSubstreamsWithParallelism(parallelism: Int): Flow[In, Out, Mat]

Is Angular's foreach loop over an object asynchronous?

I read somewhere in the past that angular.foreach is asynchronous unlike looping over arrays which is synchronous. For a long time I was taking into account this and doing the necessary to avoid executing the code which comes after the loop before it's finishes all its iterations (by wrapping the angular.foreach inside an anonymous JavaScript function which calls a callback which will be executed once the loop finishes all iterations).
(function(callback){
angular.foreach(..)
callback();
})(callback)
But I had a conversation with a collegue who didn't agree that angular.foreach is asynchronous and I also couldn't find that information again which makes me confused now.
no. Take a look at the docs
Furthermore your code wouldn't work if foreach would be asynchronous.
If foreach would be async, the callback would be called immediately after calling foreach and foreach would be put onto the eventqueue which would execute it some time in the future.
Javascripts concurrency model does not have threads but instead uses an eventloop. This means every async operation is pushed onto the eventqueue and executed later.
Have a look into the MDN
There may be a scenario where you want to make code behave asynchronously.
I had a scenario where I used local storage to store an ad-hoc user selected collection of jobs that I wanted to perform the same operation on.
I had a web service call to convert a list of job names into a returned a collection of job objects. I initially tried using a
foreach loop inside the subscribe pf the service layer, that operated on the results.
Then I tried calling another method within the foreach loop that as it performed the operations removed the job name from local storage when the operation posted to the web service correctly.
The problem was on the second iteration I read the collection of names from local storage again - before the set to remove had completed.
There was a lot of manipulation of the job and object properties to create the parameters passed on the function call, so I ended up refactoring the code, creating a value object interface and stored the information in a value object array for the whole job collection I had returned. I included the index of the job too in the value object.
I introduced a BehaviourSubject property to the class.
During the restructuring, I just added an entry to the value object array collection within the forEach loop instead. At the end of the loop. I sent next(0) to the BehaviourSubject to start the ball rolling.
Each time a job name was removed from local storage, I converted service to return a Promise.
Then in the code after the service was called I put this code in the then part, behaviour subject.next(index from value object +1)
In the initialisation I set the behaviour subject up with a -1 value..
Then in the subscription to the BehaviourSubject class I ignored -1,
And when the index +1 was > length of value object collection called completion routine - which bounce app back to prior page.
When the index was between 0 and 1 less than collection size, I just called the method that had originally been in the forEach loop with the value object entry with the value object match the index of the behaviour subject.
By doing this I had converted the behaviour of the forEach into something asynchronous.

RxJS: how to reflect array of observable results in ordered to the UI?

I have an array of observables. Each of them will make a http call to a REST endpoint and return a result so I can update the UI.
I am using zip to run them all like this:
Observable.zip(allTransactions).subscribe(result=> {blab});
In subscribe, I update a page-level collection, so UI gets updated via 2-way binding (angular).
however, there are a few problems:
1) when I construct each observable in the array, I added .delay(1000) to it, so I expect each run will delay at least 1 second to the previous one. In fact, that's not true. Based on my log, it seems all those transactions were fired at the same time. But the subscribe was delayed on second. I really need them run in sequence, as the order I setup the array, because I have some dependency in those transactions. Running all together won't work for me.
2) zip doesn't seem to guarantee to bring back my results in ordered. So my UI is totally in random ordered. Because I was doing this.items.push(result), where items is a variable being bound to the UI.
I am currently trying to merge all the transactions and add an empty observable with delay between every 2 transactions (still working on it).
Can anyone provide any suggestion what other alternatives I can do? or a better way I can try?
Thanks
1) You are correct that adding .delay(1000) to all observables will not make them wait for the previous one. The delay operator will delay the execution from the moment you subscribe to them, and since you subscribe to them all at the same time and delay them for the same amount of time, they will all execute at the same time.
If you want to execute the observable in sequence, and wait for one to finish before proceeding to the next then use the flatMap operator:
obs1.get()
// First call
.flatMap(response1 => {
return obs2.get(response1.something);
})
.subscribe(response2 => {
// Result from second call
});
2) Looking at the zip documentation the result should return in an ordered list:
Merges the specified observable sequences or Promises into one
observable sequence by using the selector function whenever all of the
observable sequences have produced an element at a corresponding
index. If the result selector function is omitted, a list with the
elements of the observable sequences at corresponding indexes will be
yielded.
But as you have noted: the calls are all executed simultaneously and one observable does not wait for the next one before starting.
So the call:
Observable.zip(obs1, obs2, obs3).subscribe(result => console.log(result));
Will log:
[response1, response2, response3]

Resources