Akka HTTP streaming API with cycles never completes - akka-stream

I'm building an application where I take a request from a user, call a REST API to get back some data, then based on that response, make another HTTP call and so on. Basically, I'm processing a tree of data where each node in the tree requires me to recursively call this API, like this:
A
/ \
B C
/ \ \
D E F
I'm using Akka HTTP with Akka Streams to build the application, so I'm using it's streaming API, like this:
val httpFlow = Http().cachedConnection(host = "localhost")
val flow = GraphDSL.create() { implicit builder =>
import GraphDSL.Implicits._
val merge = b.add(Merge[Data](2))
val bcast = b.add(Broadcast[ResponseData](2))
takeUserData ~> merge ~> createRequest ~> httpFlow ~> processResponse ~> bcast
merge <~ extractSubtree <~ bcast
FlowShape(takeUserData.in, bcast.out(1))
}
I understand that the best way to handle recursion in an Akka Streams application is to handle recursion outside of the stream, but since I'm recursively calling the HTTP flow to get each subtree of data, I wanted to make sure that the flow was properly backpressured in case the API becomes overloaded.
The problem is that this stream never completes. If I hook it up to a simple source like this:
val source = Source.single(data)
val sink = Sink.seq[ResponseData]
source.via(flow).runWith(sink)
It prints out that it's processing all the data in the tree and then stops printing anything, just idling forever.
I read the documentation about cycles and the suggestion was to put a MergePreferred in there, but that didn't seem to help. This question helped, but I don't understand why MergePreferred wouldn't stop the deadlock, since unlike their example, the elements are removed from the stream at each level of the tree.
Why doesn't MergePreferred avoid the deadlock, and is there another way of doing this?

MergePreferred (in the absence of eagerComplete being true) will complete when all the inputs have completed, which tends to generally be true of stages in Akka Streams (completion flows down from the start).
So that implies that the merge can't propagate completion until both the input and extractSubtree signal completion. extractSubtree won't signal completion (most likely, without knowing the stages in that flow) until bcast signals completion which (again most likely) won't happen until processResponse signals completion which* won't happen until httpFlow signals completion which* won't happen until createRequest signals completion, which* won't happen until merge signals completion. Because detecting this cycle in general is impossible (consider that there are stages for which completion is entirely dynamic), Akka Streams effectively takes the position that if you want to create a cycle like this, it's on you to determine how to break the cycle.
As you've noticed, eagerComplete being true changes this behavior, but since it will complete as soon as any input completes (which in this case will always be the input, thanks to the cycle) merge completes and cancels demand on extractSubtree (which by itself could (depending on whether the Broadcast has eagerCancel set) cause the downstream to cancel), which will likely result in at least some elements emitted by extractSubtree not getting processed.
If you're absolutely sure that the input completing means that the cycle will eventually dry up, you can use eagerComplete = false if you have some means to complete extractSubtree once the cycle is dry and the input has completed. A broad outline (without knowing what, specifically, is in extractSubtree) for going about this:
map everything coming into extractSubtree from bcast into a Some of the input
prematerialize a Source.actorRef to which you can send a None, save the ActorRef (which will be the materialized value of this source)
merge the input with that prematerialized source
when extracting the subtree, use a statefulMapConcat stage to track whether a) a None has been seen and b) how many subtrees are pending (initial value 1, add the number of (first generation) children of this node minus 1, i.e. no children subtracts 1); if a None has been seen and no subtrees are pending emit a List(None), otherwise emit a List of each subtree wrapped in a Some
have a takeWhile(_.isDefined), which will complete once it sees a None
if you have more complex things (e.g. side effects) in extractSubtrees, you'll have to figure out where to put them
before merging the outside input, pass it through a watchTermination stage, and in the future callback (on success) send a None to the ActorRef you got when prematerializing the Source.actorRef for extractSubtrees. Thus, when the input completes, watchTermination will fire successfully and effectively send a message to extractSubtrees to watch for when it's completed the inflight tree.

Related

Flink: An abstraction that implements CheckpointListener w/o element processing

I'm new to Flink and am looking for a way to run some code once a checkpoint completes (supposedly by implementing CheckpointListener) without processing events (void processElement(StreamRecord<IN> element)). Currently, I have an operator MyOperator that runs my code within notifyCheckpointComplete function. However, I see a lot of traffic sent to that operator. The operator chain looks as follows:
input = KafkaStream
input -> IcebergSink
input -> MyOperator
I can't find how to register CheckpointListener in Flink execution environment. Is it possible?
Also, I have the following ideas:
map input stream elements to Void, Unit before sending to MyOperator
use Side Output without emitting data to side output. I'm wondering if notifyCheckpointComplete will be still called.

Flink async IO operator chaining with with another sync operator

I have a usecase where I am using async IO operators with normal mappers in flink. I am using flink 1.8. So, async operator is going to have to be at the head of the operator chain. So my operator flows looks like this:
Source -> Mapper1 -> AsyncOperator -> Mapper2 -> Sink
Because of the requirement of async operator being head, there are two operator chains and hence two tasks- 1. Source + Mapper1 2. AsyncOperator+Mapper2+Sink.
I have question regarding the second chain. I think the second chain should be comprised within a single task if they are chained correctly. I am not sure if there is a wait time between async operator and mapper 2 on the task threads or the Mapper2 gets bound to the response handler for the Async Operator internally ? Ideally, it should be second, but I can't find any documentation for the same - hence wondering.
Reference:
https://ci.apache.org/projects/flink/flink-docs-stable/dev/stream/operators/asyncio.html
The AsyncWaitOperator spins up an Emitter in a thread, so as soon as results are available they get sent to the operator's collector. Note though that if you specify ordered results there can be a "wait time" due to completion order not matching the order of incoming elements.
BTW, the restriction that the AsyncWaitOperator must be at the head of a chain was removed in Flink 1.11. See FLINK-16219. The only remaining limitation was that it could not follow a SourceFunction. The AsyncWaitOperator can follow the new sources introduced in Flink 1.12.

Akka Streams buffer on SubFlows based on parent Flow

I am using akka-streams and I hit an exception because of maxing out the Http Pool on akka-http.
There is a Source of list-elements, which get split and thus transformed to SubFlows.
The SubFlows issue http requests. Although I put a buffer on the SubFlow, it seems the buffer takes effect per SubFlow.
Is there a way to have a buffer based on the Source that takes effect on the SubFlows?
My mistake was that I was merging the substreams without taking into consideration the parallelism by using
def mergeSubstreams(): Flow[In, Out, Mat]
From the documentation
This is identical in effect to mergeSubstreamsWithParallelism(Integer.MAX_VALUE).
Thus my workaround was to use
def mergeSubstreamsWithParallelism(parallelism: Int): Flow[In, Out, Mat]

threadpools - boss/worker vs peer (workcrew) models

I'm aiming to use a threadpool with pthreads and am trying to choose between these two models of threading and it seems to me that the peer model is more suitable when working with fixed input, whereas the boss/worker model is better for dynamically changing work items. However, I'm a little unsure of how exactly to get the peer model to work with a threadpool.
I have a number of tasks that all need to be performed on the same data set. Here's some simple psuedocode for how I would look at tackling this:
data = [0 ... 999]
data_index = 0
data_size = 1000
tasks = [0 ... 99]
task_index = 0
threads = [0 ... 31]
thread_function()
{
while (true)
{
index = data_index++ (using atomics)
if index > data_size
{
sync
if thread_index == 0
{
data_index = 0
task_index++
sync
}
else
{
sync
}
continue
}
tasks[task_index](data[index])
}
}
(Firstly, it seems like there should be a way of making this use just one synchronisation point, but I'm not sure whether that's possible?)
The above code seems like it will work well for the case where the the tasks are known in advance, though I guess a threadpool is unnecessary for this particular problem. However even if the data items are still predefined across all tasks, if the tasks are not known in advance, it seems like the boss/worker model is better suited? Is it possible to use the boss/worker model but still allow the tasks to be picked up by the threads themselves (as above), where the boss essentially suspends itself until all tasks are complete? (Maybe this is still termed the peer model?)
Final question is regarding the synchronisation, barrier or condition variable and why?
If anyone can make any suggestions as to how better to approach this problem or even to poke holes in any of my assumptions, that would be great? Unfortunately I'm restricted from using a more higher-level library such as tbb for tackling this.
Edit: I should point out in case this isn't clear, each task needs to be completed in it's entirety before moving onto the next.
I'm a bit confused by your description here, hope the below is relevant.
I always looked at this pattern and found it very useful: The "boss" is responsible for detecting work and dispatching it to a worker pool based on some algorithm, from that time on, the worker is independent.
In this scenario, the worker is always waiting for work, not aware of any other instance, process requests and when it finishes, may trigger a notification of completion.
This has the advantage of good separation between the work itself and the algorithm that balance between the threads.
The other option is for the "boss" to maintain a pool of work items, and the workers to always pick them up as soon as they are free. But I guess this is more complex to implement and requires a larger amount of synchronization. I do not see the benefit of this second approach over the previous one.
Control logic and worker state is maintained by the "boss" in both scenarios.
As the paralleled work is done on a task, the "boss" "object" is handling a task, in a simple implementation, this "boss" blocks until a task is finished, allowing to call the next "boss" in line.
Regarding the Sync, unless I'm missing here something, you only need to sync once for all the workers to finish and this sync is done at the "boss" where the workers just send notifications that they finished.

C: pattern for returning asynchron error from background thread?

I'm writing an open source C library. This library is quite complex, and some operations can take a long time. I therefore created a background thread which manages the long-running tasks.
My problem is that I have not yet found an elegant way to return errors from the background thread. Suppose the background thread reorganizes a file or does periodic maintenance, and it fails – what to do?
I currently see two options:
1) if the user is interested in seeing these errors, he can register a callback function.
I don't like this option – the user doesn't even know that there's a background thread, so he will most likely forget about setting the callback function. From usability point of view, this option is bad.
2) the background thread stores the error in a global variable and the next API function returns this error.
That's what I'm currently doing, but I'm also not 100% happy with it, because it means that users have to expect EVERY possible error code being returned from every API function. I.e. if the background thread sets an IO Error, and the user just wants to know the library version, he will get an IO error although the get_version() API call doesn't access the disk at all. Again, bad usability…
Any other suggestions/ideas?
Perhaps for the "long running operations" (the ones you'd like to use a thread for) give users two options:
a blocking DoAction(...) that returns status
a non-blocking DoActionAsync(..., <callback>) that gives the status to a user provided callback function
This gives the user the choice in how they want to handle the long operation (instead of you deciding for them), and it is clear how the status will be returned.
Note: I suppose that if they call DoActionAsync, and the user doesn't specify a callback (e.g. they pass null) then the call wouldn't block, but the user wouldn't have/need to handle the status.
I am interested in knowing how the completion status informed to the caller of API.
Since the background thread carries out all the execution. Either the foreground thread chooses to wait till the completion, like synchronous. Or the foreground thread can do other tasks, registering for a callback.
Now, since the first method is synchronous, like your usage of a global variable. You can use a message queue with 1 member, instead of your global variable. Now,
- Caller can either poll the message queue for the status
- Caller can block wait on the message queue for status
What I can think of,
But if I am the caller, I would like to know the progress status, if the time taken is very ... very long. So better to give some kind of percentage completion or something to enable the end user to develop much better application with progress bar and all.
You should keep a thread-safe list (or queue) of error events and warnings. The worker thread can post events to the list, then the main thread can read events from the list, one at a time, or in a batch to prevent race conditions. Ideally, the main thread should fetch a copy of the event queue and flush it so there is no change of duplicating events in the case of multiple main or worker threads. Events on the list would have a type and details.
If you're providing a library and try to hide expensive work via a thread I'd suggest to not do it that way. If something is expensive it should be visible to the caller and if it bugs him, he should take care of backgrounding/threading himself. That way he also has full control over the error.
It also takes the control over his process away from the developer who uses your library.
If you still want to use threads I'd suggest to really follow the callback-route but make it clearly visible in the API and documentation that there will be a background thread running on this task and therefore the callback is necessary.
Best way would be if you offered both ways, synchronous and asynchronous, to the users of the library so they can choose what fits best for them in their specific situation.
Thanks for all the good answers. You provided me with a lot of material to think about.
Some of you suggested callbacks. Initially, i thought a callback is a good idea. But it just moves the problem to the user. If a user gets an asynchronous error notification, how will he deal with it? he will have to interrupt and/or notify his synchronous program flow, and that's usually tricky and often breaks a design.
The solution i'm doing now: if the background thread generates an error, the next API call will return an error BACKGROUND_ERROR_PENDING. With a separate API function (get_background_error()) the user can look at this error code, if he's interested in it.
Also, i added documentation so users don't be too surprised if this error is returned.
You might take a look at java's Future API for an alternate mechanism for dealing with asynchronous calls and errors. You could easily substitute the checked exceptions with some isError() or getError() methods if you preferred.
I agree with Sean. An message queue with an event loop.
If there is an error the background thread can insert into the queue. The event loop will block until a new message becomes available.
I have used Apache Portable runtime time with great success with this design in building telecomms servers with a high transaction rate. It has never failed.
I use 1 thread to inserting into the queue, that would be your background thread. The event loop will run in another thread and block until a new message is inserted.
I would recommend APR thread pool with APR FIFO queue (which is also thread safe).
Quick design here:
void background_job()
{
/* There has been an error insert into the queue */
apr_status_t rv = 0;
rv = apr_queue_push(queue, data);
if(rv == APR_EOF) {
MODULE_LOG(APK_PRIO_WARNING, "Message queue has been terminated");
return FALSE;
}
else if(rv == APR_EINTR) {
MODULE_LOG(APK_PRIO_WARNING, "Message queue was interrupted");
return FALSE;
}
else if(rv != APR_SUCCESS) {
char err_buf[BUFFER_SIZE];
MODULE_LOG(APK_PRIO_CRITICAL, "Failed to push to queue %s", apr_strerror(rv, err_buf, BUFFER_SIZE));
return FALSE;
}
return TRUE;
}
void evt_loop()
{
while(continue_loop) {
apr_status_t rv = 0;
rv = apr_queue_pop(queue, data);
if(rv == APR_EOF) {
MODULE_LOG(APK_PRIO_WARNING, "Message queue has been terminated");
return FALSE;
}
else if(rv == APR_EINTR) {
MODULE_LOG(APK_PRIO_WARNING, "Message queue was interrupted");
return FALSE;
}
else if(rv != APR_SUCCESS) {
char err_buf[BUFFER_SIZE];
MODULE_LOG(APK_PRIO_CRITICAL, "Failed to pop from the queue %s", apr_strerror(rv, err_buf, BUFFER_SIZE));
return FALSE;
}
return TRUE;
}
}
Above is just some simple code snippets, if you want I post more complete code.
Hope that helps

Resources