Using multithreaded library within ProcessFunction with callbacks relying on the out parameter

Using multithreaded library within ProcessFunction with callbacks relying on the out parameter - apache-flink

I am working on a CoProcessFunction that uses a third party library for detecting certain patterns of events based on some rules. So, in the end, the ProcessElement1 method is basically forwarding the events to this library and registering a callback so that, when a match is detected, the CoProcessFunction can emit an output event. For achieving this, the callback relies on a reference to the out: Collector[T] parameter in ProcessElement1.
Having said that, I am not sure whether this use case is well-supported by Flink, since:
There might be multiple threads spanned by the third party library (let's say I have not any control over the amount of threads spanned, this is decided by the library)
I am not sure whether out might be recreated or something by Flink at some point, invalidating the references in the callbacks, making them crash
So far I have not observed any issues, but I have just run my program in the small. It would be great to hear from the experts whether my approach is correct and how could this be approached otherwise.
As an update based on Arvid's comments. Since my current process function already works well for me, except for the fact I don't have access to the mailbox executor, I have simply created a custom operator for injecting that:
class MyOperator(myFunction: MyFunction)
extends KeyedCoProcessOperator(myFunction)
{
private lazy val mailboxExecutor = getContainingTask
.getMailboxExecutorFactory
.createExecutor(getOperatorConfig.getChainIndex)
override def open(): Unit = {
super.open()
userFunction.asInstanceOf[MyFunction].mailboxExecutor = mailboxExecutor
}
}
This way I can register callbacks that will send mails to be processed one by one. In the main application I use it like this:
.transform("wrapping function in operator", new MyOperator(new MyFunction()))
So far everything looks good to me, but if you see problems or know a better way, it would be great to hear your thoughts on this again. In particular, the way of getting access to the mailbox executor is definitively a bit clumsy...

If you have asynchronous callbacks, you really should use asyncIO. So use your CoProcessFunction to emit a Tuple2 and have a asyncIO directly following it.
Op now added that he may not get a result back at all which makes asyncIO difficult to use. You could rely on the timeout to trigger such that the element gets removed but that may slow down processing as asyncIO has a limited queue of "active" elements.
So, the way to go in Flink 1.10 would probably to implement a custom operator using the MailboxExecutor.
Getting the executor is still a bit clumsy, but you could check AsyncWaitOperator and the AsyncWaitOperatorFactory.
Code sketch for using executor
// setup is optionally but if you use timestamped records, you usually do that
void setup(StreamTask<?, ?> containingTask, StreamConfig config, Output<StreamRecord<OUT>> output) {
super.setup(containingTask, config, output);
this.timestampedCollector = new TimestampedCollector<>(output);
}
void processElement(record) {
externalLib.addElement(record, (match) -> {
mailboxExecutor.execute(() -> {
timestampedCollector.collect(match);
});
});
}
Note that this involves quite a bit #PublicEvolving code and we already have some changes on our agenda. So be prepared to adjust code for 1.11.

Related

Multithreading inside Flink's Map/Process function

I have an use case where I need to apply multiple functions to every incoming message, each producing 0 or more results.
Having a loop won't scale for me, and ideally I would like to be able to emit results as soon as they are ready instead of waiting for the all the functions to be applied.
I thought about using AsyncIO for this, maintaining a ThreadPool but if I am not mistaken I can only emit one record using this API, which is not a deal-breaker but I'd like to know if there are other options, like using a ThreadPool but in a Map/Process function so then I can send the results as they are ready.
Would this be an anti-pattern, or cause any problems in regards to checkpointing, at-least-once guarantees?

Depending on the number of different functions involved, one solution would be to fan each incoming message out to n operators, each applying one of the functions.
I fear you'll get into trouble if you try this with a multi-threaded map/process function.
How about this instead:
You could have something like a RichCoFlatMap (or KeyedCoProcessFunction, or BroadcastProcessFunction) that is aware of all of the currently active functions, and for each incoming event, emits n copies of it, each being enriched with info about a specific function to be performed. Following that can be an async i/o operator that has a ThreadPool, and it takes care of executing the functions and emitting results if and when they become available.

What is SourceFunction#run is supposed to work in Flink?

I have implemented a Source by extending RichSourceFunction for our Message Queue that Flink doesn't support.
When I implements the run method whose signature is:
override def run(sc: SourceFunction.SourceContext[String]): Unit = {
val msg = read_from_mq
sc.collect(msg)
}
When the run method is called, if there is no newer message in message queue,
Should I run without calling sc.collect or
I can wait until newer data comes(in this case, run method will be blocked).
I would prefer the 2nd one,not sure if this is the correct usage.

The run method of a Flink source should loop, endlessly producing output until its cancel method is called. When there's nothing to produce, then it's best if you can find a way to do a blocking wait.
The apache nifi source connector is another reasonable example to use as a model. You will note that it sleeps for a configurable interval when there's nothing for it to do.

As you probably know both options are functionally correct and will yield correct results.
This being said the second one is preferred because you're not holding the thread. In fact, if you take a look at the RabbitMQ connector implementation you'll notice that this exactly how it is implemented: inside its run it indirectly waits for messages to be placed on a BlockingQueue.

Benefits of Future.of() versus Timer.run() in dart?

Dart has many ways of creating, processing and returning async functions. One of the most common methods is to:
import 'dart:async';
var completer = new Completer();
// Previously this would have been new Timer(0, () => ....);
Timer.run(() => completer.complete(doSomethingHere()));
return completer.future;
However dart also provides a constructor for Future's directly such as:
import 'dart:async';
return new Future.of(() => doSomethingHere());
I am aware that the Timer.run() version may be cancelled using the return value of the static method. And that the new Future.of() version is slightly less code, is there a particular benefit to using new Future.of() over Timer.run() (or vice versa). Or are the benefits simply those that I just mentioned?

Future.of returns a Future, Timer.run does not return anything. To me, this is the main difference.
I use Future.of if I want to return a value as a Future.
I use Timer.run if I want to run some function later, and I don't care about the value that it produces.
One big difference is when the function is run, though.
For Future.of, the function is run in the current event loop, and only its value is made available in the next event loop.
For Timer.run, the function is run in the next event loop.

Don't forget that they are two different things. Timer can be used anywhere for so many purposes. It could be used on the client-side for waiting for layout to happen before running more code, for example. So, it may not have anything to do with futures. I use the timer sometimes in client-side code to wait for a layout.
Timer is a generic class for delaying the running of some code. No matter whether it has anything to do with futures or not. Another example could be client-side animations. Nothing to do with futures or dealing with data asynchronously.
Futures, however, are monads that help you to program asynchronous programs. Basically a replacement to passing plain callback functions.
Use new Future.of() when writing asynchronous programs and it suits your situation (the same goes with new Future.immediate().
Use the Timer class if you want to delay the running of some code.
If what you want to do is to use futures for asynchronous programming and at the same time delay the code, forget the timer class unless you need some real delay (more than 0/next event).

Asynchronous WCF Services in WPF - events

I am using WCF services asynchronously in a WPF application. So I have class with all the web service. The view models call the method in this proc, which in-turn calls the web service.
So the view Model code looks like this:
WebServiceAgent.GetProductByID(SelectedProductID, (s, e)=>{States = e.Result;});
And the WebService agent looks like:
public static void GetProductByID(int ProductID, EventHandler<GetProductListCompletedEventArgs> callback)
{
Client.GetProductByIDCompleted += callback;
Client.GetProductByIDAsync(ProductID);
}
Is this a good approach? I am using MVVM light toolkit. So the View Model static, so in the lifetime of the application, the view model stays. But each time the view model calls this WebServiceAgent, I think I am registering an event. But that event is not being unregistered.
Is this a problem. Lets say the view Model is called for 20 - 30 times. I am inserting some kind of memory leak?

Some helpful information, based on the mistakes I learned from myself:
The Client object seems to be re-used all the time. When not unregisering event handlers, they will stack up when future invokations of the same operations finish and you'll get unpredictable results.
The States = e.Result statement is executed on the event handler's thread, which is not the UI dispatcher thread. When updating lists or complex properties this will cause problems.
In general not unregistering event handlers when they are invoked is a bad idea as it will indeed cause hard to find memory leaks.
You should probably refactor to create or re-use a clean client, wrap the viewmodel callback inside another callback that will take care of unregistering itself, cleaning up the client, and invoking the viewmodel's callback on the main dispatcher thread.
If you think all this is tedious, check out http://blogs.msdn.com/b/csharpfaq/archive/2010/10/28/async.aspx and http://msdn.microsoft.com/en-us/vstudio/async.aspx. In the next version of C# an async keyword will be introduced to make this all easier. A CTP is available already.

Event handlers are death traps and you will leak them if you do not "unsubscribe" with "-=".
One way to avoid is to use RX (Reactive Extensions) that will manage your event subscriptions. Take a look at http://msdn.microsoft.com/en-us/data/gg577609 and specifically creating Observable by using Observable.FromEvent or FromAsync http://rxwiki.wikidot.com/101samples.

This is unfortunaltely not a good approach.
I learned this the hard way in silverlight.
Your WebserviceAgent is probably a long-life object, whereas the model or view is probably short-life
Events give references, and in this case the webservice agent, and wcf client a reference to the model. A long lifeobject has a reference to a short life object, this means the short life object will not be collected, and so will have a memory leak.
As Pieter-Bias said, the async functionality will make this easier.
Have you looked at RIA services? This is the exact problem that RIA services was designed to solve

Yes, the event handlers are basically going to cause a leak unless removed. To get the near-single line equivalent of what you're expressing in your code, and to remove handlers you're going to need an instance of some sort of class that represents the full lifecycle of the call and does some housekeeping.
What I've done is create a Caller<TResult> class that uses an underlying WCF client proxy following this basic pattern:
create a Caller instance around an existing or new client proxy (the proxy's lifecycle is outside of the scope of the call to be made (so you can use a new short-lived one or an existing long-lived one).
use one of Caller's various CallAsync<TArg [,...]> overloads to specify the async method to call and the intended callback to call upon completion. This method will choose the async method that also takes a state parameter. The state parameter will be the Caller instance itself.
I say intended because the real handler that will be wired up will do a bit more housekeeping. The real callback is what will be called at the end of the async call, and will
check that ReferenceEquals(e.UserState, this) in your real handler
if not true, immediately return (the event was not intended to be the result of this particular call and should be ignored; this is very important if your proxy is long lived)
otherwise, immediately remove the real handler
call your intended, actual callback with e.Result
Modify Caller's real handler as needed to execute the intended callback on the right thread (more important for WPF than Silverlight)
The above implementation should also have separate handlers for cases where e.Error is non-null or e.Cancelled is true. This gives you the advantage of not checking these cases in your intended callback. Perhaps your overloads take in optional handlers for those cases.
At any rate, you end up cleaning up handlers aggressively at the expense of some per-call wiring. It's a bit expensive per-call, but with proper optimization ends up being far less expensive than the over-the-wire WCF call anyway.
Here's an example of a call using the class (you'll note I use method groups in many cases to increase the readability, though HandleStuff could have been result => use result ). The first method group is important, because CallAsync gets the owner of that delegate (i.e. the service instance), which is needed to call the method; alternatively the service could be passed in as a separate parameter).
Caller<AnalysisResult>.CallAsync(
// line below could also be longLivedAnalyzer.AnalyzeSomeThingsAsync
new AnalyzerServiceClient().AnalyzeSomeThingsAsync,
listOfStuff,
HandleAnalyzedStuff,
// optional handlers for error or cancelled would go here
onFailure:TellUserWhatWentWrong);

Synchronizing Asynchronous request handlers in Silverlight environment

For our senior design project my group is making a Silverlight application that utilizes graph theory concepts and stores the data in a database on the back end. We have a situation where we add a link between two nodes in the graph and upon doing so we run analysis to re-categorize our clusters of nodes. The problem is that this re-categorization is quite complex and involves multiple queries and updates to the database so if multiple instances of it run at once it quickly garbles data and breaks (by trying to re-insert already used primary keys). Essentially it's not thread safe, and we're trying to make it safe, and that's where we're failing and need help :).
The create link function looks like this:
private Semaphore dblock = new Semaphore(1, 1);
// This function is on our service reference and gets called
// by the client code.
public int addNeed(int nodeOne, int nodeTwo)
{
dblock.WaitOne();
submitNewNeed(createNewNeed(nodeOne, nodeTwo));
verifyClusters(nodeOne, nodeTwo);
dblock.Release();
return 0;
}
private void verifyClusters(int nodeOne, int nodeTwo)
{
// Run analysis of nodeOne and nodeTwo in graph
}
All copies of addNeed should wait for the first one that comes in to finish before another can execute. But instead they all seem to be running and conflicting with each other in the verifyClusters method. One solution would be to force our front end calls to be made synchronously. And in fact, when we do that everything works fine, so the code logic isn't broken. But when it's launched our application will be deployed within a business setting and used by internal IT staff (or at least that's the plan) so we'll have the same problem. We can't force all clients to submit data at different times, so we really need to get it synchronized on the back end. Thanks for any help you can give, I'd be glad to supply any additional information that you could need!

I wrote a series to specifically address this situation - let me know if this works for you (sequential asynchronous workflows):
Part 2 (has a link back to the part1):
http://csharperimage.jeremylikness.com/2010/03/sequential-asynchronous-workflows-part.html
Jeremy

Wrap your database updates in a transaction. Escalate to a table lock if necessary

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Using multithreaded library within ProcessFunction with callbacks relying on the out parameter - apache-flink

Related

Multithreading inside Flink's Map/Process function

What is SourceFunction#run is supposed to work in Flink?

Benefits of Future.of() versus Timer.run() in dart?

Asynchronous WCF Services in WPF - events

Synchronizing Asynchronous request handlers in Silverlight environment

Categories

Resources