Order of operators initialization in Flink

Order of operators initialization in Flink - apache-flink

I have a Flink job with the classic shape of datasource-operator1-operatorN-sink.
From what I can observe, the open() method of operator1 is invoked before the open() method of the datasource.
In the open() method of operator1 I need to handle some business logic, that it is dependent of stuff which gets resolved at datasource.open()
1- Is there any way that I can restrain that the operator1.open() is not invoked until datasource.open() is?
2- Is there any way to communicate/signal from the datasource.open() method, to the operator1.open() method?

Trying to establish some sort of out-of-band communication between operators often gets folks into trouble. At best it can screw up performance, and at worst it can lead to deadlocks.
What you might try instead is to rely on the signaling pathway that already exists between the data source and the async function -- in other words, emit a specially encoded event from the data source that tells the async function it can start now, and have the async function wait for that special record before doing other processing.

Related

Multithreading inside Flink's Map/Process function

I have an use case where I need to apply multiple functions to every incoming message, each producing 0 or more results.
Having a loop won't scale for me, and ideally I would like to be able to emit results as soon as they are ready instead of waiting for the all the functions to be applied.
I thought about using AsyncIO for this, maintaining a ThreadPool but if I am not mistaken I can only emit one record using this API, which is not a deal-breaker but I'd like to know if there are other options, like using a ThreadPool but in a Map/Process function so then I can send the results as they are ready.
Would this be an anti-pattern, or cause any problems in regards to checkpointing, at-least-once guarantees?

Depending on the number of different functions involved, one solution would be to fan each incoming message out to n operators, each applying one of the functions.
I fear you'll get into trouble if you try this with a multi-threaded map/process function.
How about this instead:
You could have something like a RichCoFlatMap (or KeyedCoProcessFunction, or BroadcastProcessFunction) that is aware of all of the currently active functions, and for each incoming event, emits n copies of it, each being enriched with info about a specific function to be performed. Following that can be an async i/o operator that has a ThreadPool, and it takes care of executing the functions and emitting results if and when they become available.

Flink stateful functions : compensating callback on a timeout

I am implementing a use case in Flink stateful functions. My specification highlights that starting from a stateful function f a business workflow (in other words a group of stateful functions f1, f2, … fn are called either sequentially or in parallel or both ). Stateful function f waits for a result to be returned to update a local state, it as well starts a timeout callback i.e. a message to itself. At timeout, f checks if the local state is updated (it has received a result), if this is the case life is good.
However, if at timeout f discovers that it has not received a result yet, it has to launch a compensating workflow to undo any changes that stateful functions f1, f2, … fn might have received.
Does Flink stateful functions framework support such as a design pattern/use case, or it should be implemented at the application level? What is the simplest design to achieve such a solution? For instance, how to know what functions of the workflow stateful functions f1, f2, … fn were affected by the timedout invocation (where the control flow has been timed out)? How does Flink sateful functions and the concept of integrated messaging and state facilitate such a pattern?
Thank you.

I posted the question on Apache Flink mailing list and got the following response by Igal Shilman, Thanks to Igal.
The first thing that I would like to mention is that, if your original
motivation for that scenario is a concern of a transient failures such as:
did function Y ever received a message sent by function X ?
did sending a message failed?
did the target function is there to accept a message sent to it?
did the order of message got mixed up?
etc'
Then, StateFun eliminates all of these problems and a whole class of
transient errors that otherwise you would have to deal with by yourself in
your business logic (like retries, backoffs, service discovery etc').
Now if your motivating scenario is not about transient errors but more
about transactional workflows, then as Dawid mentioned you would have to
implement
this in your application logic. I think that the way you have described the
flow should map directly to a coordinating function (per flow instance)
that keeps track of results/timeouts in its internal state.
Here is a sketch:
A Flow Coordinator Function - it would be invoked with the input
necessary to kick off a flow. It would start invoking the relevant
functions (as defined by the flow's DAG) and would keep an internal state
indicating
what functions (addresses) were invoked and their completion statues.
When the flow completes successfully the coordinator can safely discard its
state.
In any case that the coordinator decides to abort the flow (an internal
timeout / an external message / etc') it would have to check its internal
state and kick off a compensating workflow (sending a special message to
the already succeed/in progress functions)
Each function in the flow has to accept a message from the coordinator,
in turn, and reply with either a success or a failure.

Flink: what's the best way to handle exceptions inside Flink jobs

I have a flink job that takes in Kafaka topics and goes through a bunch of operators. I'm wondering what's the best way to deal with exceptions that happen in the middle.
My goal is to have a centralized place to handle those exceptions that may be thrown from different operators and here is my current solution:
Use ProcessFunction and output sideOutput to context in the catch block, assuming there is an exception, and have a separate sink function for the sideOutput at the end where it calls an external service to update the status of another related job
However, my question is that by doing so it seems I still need to call collector.collect() and pass in a null value in order to proceed to following operators and hit last stage where sideOutput will flow into the separate sink function. Is this the right way to do it?
Also I'm not sure what actually happens if I don't call collector.collect() inside a operator, would it hang there and cause memory leak?

It's fine to not call collector.collect(). And you don't need to call collect() with a null value when you use the side output to capture the exception - each operator can have its own side output. Finally, if you have multiple such operators with a side output for exceptions, you can union() the side outputs together before sending that stream to a sink.
If for some reason the downstream operator(s) need to know that there was an exception, then one approach is to output an Either<good result, Exception>, but then each downstream operator would of course need to have code to check what it's receiving.

When to Use a Callback function?

When do you use a callback function? I know how they work, I have seen them in use and I have used them myself many times.
An example from the C world would be libcurl which relies on callbacks for its data retrieval.
An opposing example would be OpenSSL: Where I have used it, I use out parameters:
ret = somefunc(&target_value);
if(ret != 0)
//error case
I am wondering when to use which? Is a callback only useful for async stuff? I am currently in the processes of designing my application's API and I am wondering whether to use a callback or just an out parameter. Under the hood it will use libcurl and OpenSSL as the main libraries it builds on and the parameter "returned" is an OpenSSL data type.
I don't see any benefit of a callback over just returning. Is this only useful, if I want to process the data in any way instead of just giving it back? But then I could process the returned data. Where is the difference?

In the simplest case, the two approaches are equivalent. But if the callback can be called multiple times to process data as it arrives, then the callback approach provides greater flexibility, and this flexibility is not limited to async use cases.
libcurl is a good example: it provides an API that allows specifying a callback for all newly arrived data. The alternative, as you present it, would be to just return the data. But return it — how? If the data is collected into a memory buffer, the buffer might end up very large, and the caller might have only wanted to save it to a file, like a downloader. If the data is saved to a file whose name is returned to the caller, it might incur unnecessary IO if the caller in fact only wanted to store it in memory, like a web browser showing an image. Either approach is suboptimal if the caller wanted to process data as it streams, say to calculate a checksum, and didn't need to store it at all.
The callback approach allows the caller to decide how the individual chunks of data will be processed or assembled into a larger whole.

Callbacks are useful for asynchronous notification. When you register a callback with some API, you are expecting that callback to be run when some event occurs. Along the same vein, you can use them as an intermediate step in a data processing pipeline (similar to an 'insert' if you're familiar with the audio/recording industry).
So, to summarise, these are the two main paradigms that I have encountered and/or implemented callback schemes for:
I will tell you when data arrives or some event occurs - you use it as you see fit.
I will give you the chance to modify some data before I deal with it.

If the value can be returned immediately then yes, there is no need for a callback. As you surmised, callbacks are useful in situations wherein a value cannot be returned immediately for whatever reason (perhaps it is just a long running operation which is better performed asynchronously).

My take on this: I see it as which module has to know about which one? Let's call them Data-User and IO.
Assume you have some IO, where data comes in. The IO-Module might not even know who is interested in the data. The Data-User however knows exactly which data it needs. So the IO should provide a function like subscribe_to_incoming_data(func) and the Data-User module will subscribe to the specific data the IO-Module has. The alternative would be to change code in the IO-Module to call the Data-User. But with existing libs you definitely don't want to touch existing code that someone else has provided to you.

Asynchronous WCF Services in WPF - events

I am using WCF services asynchronously in a WPF application. So I have class with all the web service. The view models call the method in this proc, which in-turn calls the web service.
So the view Model code looks like this:
WebServiceAgent.GetProductByID(SelectedProductID, (s, e)=>{States = e.Result;});
And the WebService agent looks like:
public static void GetProductByID(int ProductID, EventHandler<GetProductListCompletedEventArgs> callback)
{
Client.GetProductByIDCompleted += callback;
Client.GetProductByIDAsync(ProductID);
}
Is this a good approach? I am using MVVM light toolkit. So the View Model static, so in the lifetime of the application, the view model stays. But each time the view model calls this WebServiceAgent, I think I am registering an event. But that event is not being unregistered.
Is this a problem. Lets say the view Model is called for 20 - 30 times. I am inserting some kind of memory leak?

Some helpful information, based on the mistakes I learned from myself:
The Client object seems to be re-used all the time. When not unregisering event handlers, they will stack up when future invokations of the same operations finish and you'll get unpredictable results.
The States = e.Result statement is executed on the event handler's thread, which is not the UI dispatcher thread. When updating lists or complex properties this will cause problems.
In general not unregistering event handlers when they are invoked is a bad idea as it will indeed cause hard to find memory leaks.
You should probably refactor to create or re-use a clean client, wrap the viewmodel callback inside another callback that will take care of unregistering itself, cleaning up the client, and invoking the viewmodel's callback on the main dispatcher thread.
If you think all this is tedious, check out http://blogs.msdn.com/b/csharpfaq/archive/2010/10/28/async.aspx and http://msdn.microsoft.com/en-us/vstudio/async.aspx. In the next version of C# an async keyword will be introduced to make this all easier. A CTP is available already.

Event handlers are death traps and you will leak them if you do not "unsubscribe" with "-=".
One way to avoid is to use RX (Reactive Extensions) that will manage your event subscriptions. Take a look at http://msdn.microsoft.com/en-us/data/gg577609 and specifically creating Observable by using Observable.FromEvent or FromAsync http://rxwiki.wikidot.com/101samples.

This is unfortunaltely not a good approach.
I learned this the hard way in silverlight.
Your WebserviceAgent is probably a long-life object, whereas the model or view is probably short-life
Events give references, and in this case the webservice agent, and wcf client a reference to the model. A long lifeobject has a reference to a short life object, this means the short life object will not be collected, and so will have a memory leak.
As Pieter-Bias said, the async functionality will make this easier.
Have you looked at RIA services? This is the exact problem that RIA services was designed to solve

Yes, the event handlers are basically going to cause a leak unless removed. To get the near-single line equivalent of what you're expressing in your code, and to remove handlers you're going to need an instance of some sort of class that represents the full lifecycle of the call and does some housekeeping.
What I've done is create a Caller<TResult> class that uses an underlying WCF client proxy following this basic pattern:
create a Caller instance around an existing or new client proxy (the proxy's lifecycle is outside of the scope of the call to be made (so you can use a new short-lived one or an existing long-lived one).
use one of Caller's various CallAsync<TArg [,...]> overloads to specify the async method to call and the intended callback to call upon completion. This method will choose the async method that also takes a state parameter. The state parameter will be the Caller instance itself.
I say intended because the real handler that will be wired up will do a bit more housekeeping. The real callback is what will be called at the end of the async call, and will
check that ReferenceEquals(e.UserState, this) in your real handler
if not true, immediately return (the event was not intended to be the result of this particular call and should be ignored; this is very important if your proxy is long lived)
otherwise, immediately remove the real handler
call your intended, actual callback with e.Result
Modify Caller's real handler as needed to execute the intended callback on the right thread (more important for WPF than Silverlight)
The above implementation should also have separate handlers for cases where e.Error is non-null or e.Cancelled is true. This gives you the advantage of not checking these cases in your intended callback. Perhaps your overloads take in optional handlers for those cases.
At any rate, you end up cleaning up handlers aggressively at the expense of some per-call wiring. It's a bit expensive per-call, but with proper optimization ends up being far less expensive than the over-the-wire WCF call anyway.
Here's an example of a call using the class (you'll note I use method groups in many cases to increase the readability, though HandleStuff could have been result => use result ). The first method group is important, because CallAsync gets the owner of that delegate (i.e. the service instance), which is needed to call the method; alternatively the service could be passed in as a separate parameter).
Caller<AnalysisResult>.CallAsync(
// line below could also be longLivedAnalyzer.AnalyzeSomeThingsAsync
new AnalyzerServiceClient().AnalyzeSomeThingsAsync,
listOfStuff,
HandleAnalyzedStuff,
// optional handlers for error or cancelled would go here
onFailure:TellUserWhatWentWrong);

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Order of operators initialization in Flink - apache-flink

Related

Multithreading inside Flink's Map/Process function

Flink stateful functions : compensating callback on a timeout

Flink: what's the best way to handle exceptions inside Flink jobs

When to Use a Callback function?

Asynchronous WCF Services in WPF - events

Categories

Resources