StreamTask.getCheckpointLock deprecation and custom Flink sources - apache-flink

When writing custom checkpointed sources for Flink, one must make sure that emitting elements downstream, checkpointing and watermarks are emitted in a synchronized fashion. This is done by acquiring StreamContext.getCheckpointLock
Flink 1.10 introduced a deprecation to StreamTask.getCheckpointLock and is now recommending the use of MailboxExecutor for operation which require such synchronization.
I have a custom source implementation which is split into multiple phases. A SourceFunction[T] for reading file locations and an OneInputStreamOperator for downloading and emitting these elements downstream. Up until now, I used StreamSourceContexts.getSourceContext to receive the SourceContext used to emit elements, which looked as follows:
ctx = StreamSourceContexts.getSourceContext(
getOperatorConfig.getTimeCharacteristic,
getProcessingTimeService,
getContainingTask.getCheckpointLock,
getContainingTask.getStreamStatusMaintainer,
output,
getRuntimeContext.getExecutionConfig.getAutoWatermarkInterval,
-1
)
And this context is being used throughout the code to emit elements and watermarks:
ctx.getCheckpointLock.synchronized(ctx.collect(item))
ctx.getCheckpointLock.synchronized(ctx.emitWatermark(watermark))
Is using the checkpoint lock still the preferred way to emit elements downstream? Or is it now recommended that we use MailboxExecutor instead and make collection and watermark inside the mailbox execution thread?

Checkpoint lock in source context is not deprecated as there is currently no way around implementing a source without the lock. These sources are already dubbed legacy sources exactly for that reason: they spawn their own thread and need the lock to emit data (push-based).
There is currently a larger rework for sources (FLIP-27), which will offer a pull-based interface. This interface is called from the main task thread, such that no synchronization is necessary anymore. If some async work needs to be done, then MailboxExecutor is the way to go.
FYI, new operators should (rather must) only use MailboxExecutor instead of checkpoint lock.

Related

Multithreading inside Flink's Map/Process function

I have an use case where I need to apply multiple functions to every incoming message, each producing 0 or more results.
Having a loop won't scale for me, and ideally I would like to be able to emit results as soon as they are ready instead of waiting for the all the functions to be applied.
I thought about using AsyncIO for this, maintaining a ThreadPool but if I am not mistaken I can only emit one record using this API, which is not a deal-breaker but I'd like to know if there are other options, like using a ThreadPool but in a Map/Process function so then I can send the results as they are ready.
Would this be an anti-pattern, or cause any problems in regards to checkpointing, at-least-once guarantees?
Depending on the number of different functions involved, one solution would be to fan each incoming message out to n operators, each applying one of the functions.
I fear you'll get into trouble if you try this with a multi-threaded map/process function.
How about this instead:
You could have something like a RichCoFlatMap (or KeyedCoProcessFunction, or BroadcastProcessFunction) that is aware of all of the currently active functions, and for each incoming event, emits n copies of it, each being enriched with info about a specific function to be performed. Following that can be an async i/o operator that has a ThreadPool, and it takes care of executing the functions and emitting results if and when they become available.

Reading two streams (main and configs) in sequential in Flink

I have two streams, one is main stream let's say in example of fraud detection I have transactions stream and then I have second stream which is configs, in our example it is rules. So I connect main stream to config stream in order to do processing. But when first time flink starts and we are adding job it starts consuming from transactions and configs stream parallel and when wants process transaction it sometimes see that there is no config and we have to send transaction to dead letter queue. However, what I want to achieve is, if there is patential config which I could get a bit later I want to get that config first then get transaction in order to process it rather then sending it to dead letter queue. I have the same key for transactions and configs.
long story short, is there a way telling flink when first time job starts try to consume one stream until there isn't new value then start processing main stream? How I can make them kind of sequential?
The recommended way to approach this is to connect the 2 streams and apply a RichCoFlatMap that will allow you to buffer events from main while you're waiting to receive the config events.
Check out this useful section of the Flink tutorials. The very last paragraph actually describes your problem.
It is important to recognize that you have no control over the order in which the flatMap1 and flatMap2 callbacks are called. These two input streams are racing against each other, and the Flink runtime will do what it wants to regarding consuming events from one stream or the other. In cases where timing and/or ordering matter, you may find it necessary to buffer events in managed Flink state until your application is ready to process them. (Note: if you are truly desperate, it is possible to exert some limited control over the order in which a two-input operator consumes its inputs by using a custom Operator that implements the InputSelectable interface.
So in a nutshell you should connect your 2 streams and have some kind of ListState where you can "buffer" your main elements while waiting to receive the rules. When you receive an element from the config stream, you check whether you had some pending elements "waiting" for that config in your ListState (your buffer). If you do, you can then process these elements and emit them through the collector of your flatmap.
Starting with version 1.16, you can use the hybrid source support in Flink to read all of once source (configs, in your case) before reading the second source. Though I imagine you'd have to map the events to an Either<config, transaction> so that the data stream has consistent record types.

Process elements after sinking to Destination

I am setting up a flink pipeline that reads from Kafka and sinks to HDFS. I want to process the elements after the addSink() step. This is because I want to setup trigger files indicating that writing data (to the sink) for a certain partition/hour is complete. How can this be achieved? Currently I am using the Bucketing sink.
DataStream messageStream = env
.addSource(flinkKafkaConsumer011);
//some aggregations to convert message stream to keyedStream
keyedStream.addSink(sink);
//How to process elements after 3.?
The Flink APIs do not support extending the job graph beyond the sink(s). (You can, however, fork the stream and do additional processing in parallel with writing to the sink.)
With the Streaming File Sink you can observe the part files transition to the finished state when they complete. See the JavaDoc for more information.
State lives within a single operator -- only that operator (e.g., a ProcessFunction) can modify it. If you want to modify the keyed value state after the sink has completed, there's no straightforward way to do that. One idea would be to add a processing time timer in the ProcessFunction that has the keyed state that wakes up periodically and checks for newly finished part files, and based on their existence, modifies the state. Or if that's the wrong granularity, write a custom source that does something similar and streams or broadcasts information into the ProcessFunction (which will then have to be a CoProcessFunction or a KeyedBroadcastProcessFunction) that it can use to do the necessary state updates.

Should I use async await on UI Thread or Task.Run to process multiple files?

I am writing an application in WPF using .NET 4.5. The application allows the user to select multiple files and import them. When this happens the application will parse text files and store the data into a database. The files can be very large so I want the UI to remain responsive and allow the user to do other things while this is happening.
I'm new to asynchronous programming and would to know what the best approach would be? According to a Microsoft article...
"The async and await keywords don't cause additional threads to be created. Async methods don't require multithreading because an async method doesn't run on its own thread. The method runs on the current synchronization context and uses time on the thread only when the method is active. You can use Task.Run to move CPU-bound work to a background thread, but a background thread doesn't help with a process that's just waiting for results to become available."
Since I am processing multiple files would I be better of using "Task.Run" for each file since they will run on separate threads? If not, what is the advantage of running everything on the same thread using async/await?
Note that no other UI operation (except progress bars - progress reporting) really cares about when these files are done processing so according to this article it seems like using Task.Run would benefit me. What are your thoughts?
What are you doing with the files? If you're doing anything CPU intensive, it would be better to move that processing off the UI thread. If it's really just going to be the IO, then you can do it all in the UI thread - and still make it parallel.
For example:
private async Task ProcessAllFiles(IEnumerable<string> files)
{
List<Task> tasks = files.Select(x => ProcessFile(x))
.ToList();
await Task.WhenAll(tasks);
textBox.Text = "Finished!";
}
private async Task ProcessFile(string file)
{
// Do stuff here with async IO, and update the UI if you want to...
}
Here you're still doing all the actual processing on the UI thread - no separate threads are running, beyond potentially IO completion ports - but you're still starting multiple asynchronous IO operations at the same time.
Now one other thing to consider is that this might actually slow everything down anyway. If you're processing multiple files on the same physical disk, you might find that accessing them sequentially would be faster just due to the nature of disk IO. It will depend on the disk type though (SSDs would act differently to "regular" disks, for example).

Many-to-one gatekeeper task synchronization

I'm working on a design that uses a gatekeeper task to access a shared resource. The basic design I have right now is a single queue that the gatekeeper task is receiving from and multiple tasks putting requests into it.
This is a memory limited system, and I'm using FreeRTOS (Cortex M3 port).
The problem is as follows: To handle these requests asynchronously is fairly simple. The requesting task queues its request and goes about its business, polling, processing, or waiting for other events. To handle these requests synchronously, I need a mechanism for the requesting task to block on such that once the request has been handled, the gatekeeper can wake up the task that called that request.
The easiest design I can think of would be to include a semaphore in each request, but given the memory limitations and the rather large size of a semaphore in FreeRTOS, this isn't practical.
What I've come up with is using the task suspend and task resume feature to manually block the task, passing a handle to the gatekeeper with which it can resume the task when the request is completed. There are some issues with suspend/resume, though, and I'd really like to avoid them. A single resume call will wake up a task no matter how many times it has been suspended by other calls and this can create an undesired behavior.
Some simple pseudo-C to demonstrate the suspend/resume method.
void gatekeeper_blocking_request(void)
{
put_request_in_queue(request);
task_suspend(this_task);
}
void gatekeeper_request_complete_callback(request)
{
task_resume(request->task);
}
A workaround that I plan to use in the meantime is to use the asynchronous calls and implement the blocking entirely in each requesting task. The gatekeeper will execute a supplied callback when the operation completes, and that can then post to the task's main queue or a specific semaphore, or whatever is needed. Having the blocking calls for requests is essentially a convenience feature so each requesting task doesn't need to implement this.
Pseudo-C to demonstrate the task-specific blocking, but this needs to be implemented in each task.
void requesting_task(void)
{
while(1)
{
gatekeeper_async_request(callback);
pend_on_sempahore(sem);
}
}
void callback(request)
{
post_to_semaphore(sem);
}
Maybe the best solution is just to not implement blocking in the gatekeeper and API, and force each task to handle it. That will increase the complexity of each task's flow, though, and I was hoping I could avoid it. For the most part, all calls will want to block until the operation is finished.
Is there some construct that I'm missing, or even just a better term for this type of problem that I can google? I haven't come across anything like this in my searches.
Additional remarks - Two reasons for the gatekeeper task:
Large stack space required. Rather than adding this requirement to each task, the gatekeeper can have a single stack with all the memory required.
The resource is not always accessible in the CPU. It is synchronizing not only tasks in the CPU, but tasks outside the CPU as well.
Use a mutex and make the gatekeeper a subroutine instead of a task.
It's been six years since I posted this question, and I struggled with getting the synchronization working how I needed it to. There were some terrible abuses of OS constructs used. I've considered updating this code, even though it works, to be less abusive, and so I've looked at more elegant ways to handle this. FreeRTOS has also added a number of features in the last six years, one of which I believe provides a lightweight method to accomplish the same thing.
Direct-to-Task Notifications
Revisiting my original proposed method:
void gatekeeper_blocking_request(void)
{
put_request_in_queue(request);
task_suspend(this_task);
}
void gatekeeper_request_complete_callback(request)
{
task_resume(request->task);
}
The reason this method was avoided was because the FreeRTOS task suspend/resume calls do not keep count, so several suspend calls will be negated by a single resume call. At the time, the suspend/resume feature was being used by the application, and so this was a real possibility.
Beginning with FreeRTOS 8.2.0, Direct-to-task notifications essentially provide a lightweight built-into-the-task binary semaphore. When a notification is sent to a task, the notification value may be set. This notification will lie dormant until the notified task calls some variant of xTaskNotifyWait() or it will be woken if it had already made such a call.
The above code, can be slightly reworked to be the following:
void gatekeeper_blocking_request(void)
{
put_request_in_queue(request);
xTaskNotifyWait( ... );
}
void gatekeeper_request_complete_callback(request)
{
xTaskNotify( ... );
}
This is still not an ideal method, as if the task notifications are used elsewhere, you may run into the same problem with suspend/resume, where the task is woken by a different source than the one it is expecting. Given that, for me, it was a new feature, it may work out in the revised code.

Resources