i wonder if there is a way in flink to broadcast an event (or something like that) if specific event read from the source into all the task managers.
To be more specific I am aggregating state data with a map state and if some events are read from the source I want that all task managers perform a specific action
Is it possible?
Yes, this is possible. The broadcast state pattern is meant for exactly this sort of use case.
As David noted, using a broadcast stream is the right way to send data to all (parallel) sub-tasks. As for only broadcasting some data, take a look at side outputs as a way to do special processing for a sub-set of your data. So you could have a ProcessFunction that passes through all data un-modified, and if an incoming event is one that wants to be broadcast, then you also emit it as a side output.
Related
I am trying to process watermarks in a BroadcastConnectedStream. However, I am not able to find any direct way to handle watermark events, similar to what we have in processWatermark1 in a KeyedCoProcessOperator. Following are further details.
In the context of the example given in “A Practical Guide to Broadcast State in Apache Flink”, I have a user actions stream and a pattern stream. The pattern stream is broadcast and connected with the user actions stream. The result is a BroadcastConnectedStream. I want to handle user action events and pattern evens in this stream. In addition, I want to use a processWatermark function to perform an action in response to watermark events.
The problem is that a BroadcastConnectedStream has only process() function, no transform(), that takes a (Keyed)BroadcastProcessFunction. A BroadcastProcessFunction allows only to process elements, it doesn’t provide the interface to process watermarks. In contrast, a ConnectedStream (without broadcast) provides a transform function, which takes in an operator that provides a way to process watermarks.
I have tried looking into the API and other resources to see the usage of CoBroadcastWithKeyedOperator, but I am unable to find useful resources.
Is there a way to process watermarks in a BroadcastConnectedStream?
We currently have a flink-based streaming job (the task is composed of complex FlatMapFunctions DAG), and an http interface for fetching configuration.
Now I hope to read configuration from the http interface through a source function every 5 minutes with a parallelism of 1, and then distribute it to all task managers or FlatMapFunctions of the job. In FlatMapFunctions, the configuration will be read and will never not be changed.
I have read the documentationThe Broadcast State Pattern, but the method in the documentation seems to only apply to the first Function of the broadcast, and other subsequent downstream FlatMapFunctions cannot read the state of the broadcast. As shown in the figure below, only Co-Process-Broadcast can obtain the broadcast, but map func 1 and map func 2 cannot.
Broadcast state graph
Similar to QUESTION but different, I have many downstream FlatMapFunctions and expect them all to get the broadcast configuration.
You can send the broadcast stream to multiple functions, so if your config state isn't big then that's likely what I'd do.
If the config state is very small (relative to the size of records being processed) then you could attach it to every incoming record in your BroadcastProcessFunction, so downstream operators have it in hand when processing each of their records.
I'm using cqrs pattern with multiples databases (one for query and another for search). Should I put the insert inside Repository
CommunityRepository{
Add(Community community){
Database1.Insert(community);
Database2.Insert(community);
}
}
and then:
CommunityCommands{
Handler(AddCommunityCommand community){
communityRepository.Add(community);
}
or should i put this in Commands, like this:
CommunityCommands{
Handler(AddCommunityCommand community){
db1.Insert(community);
db2.Insert(community);
}
or maybe something like this, using the main repository + database2
CommunityCommands{
Handler(AddCommunityCommand community){
communityRepository.Add(community);
db2.Insert(community);
}
I would do neither of those options as you'd be basically coupling the Command and Query immplementations.
Instead, publish events from the Command side, like OrderPlacedEvent and subscribe to them from the Query side. This not only allows you to separate the implementations of the Command and Query sides, but it will also allow you to implement other side effects of the events without coupling the code from the multiple features (eg. "when an order is placed, send a confirmation email").
You can implement the pub/sub synchronously (in process) or asynchronously (with a messaging system). If you use messaging, note that you'll have to deal with eventual consistency (the read data is slightly behind the write data, but eventually it catches up).
refreshing the Query Models should be handled in an offline operation. You should do something like this:
process your Command logic (whatever it is)
right before your Command handler returns, send an Event to a message bus
then in a background service you can listen to those Events and update the Query side.
Bonus tip: you can use the Outbox pattern to get more reliability. The idea is to store the Event messages in a table on your Write DB in the same transaction as your previous write operation, instead of sending them directly. A background service would check for "pending" messages and dispatch them. The rest is unchanged.
Description:
Currently I am working on using Flink with an IOT setup. Essentially, devices are sending data such as (device_id, device_type, event_timestamp, etc) and I don't have any control over when the messages get sent. I then key the steam by device_id and device_type to preform aggregations. I would like to use event-time given that is ensures the timers which are set trigger in a deterministic nature given a failure. However, given that this isn't always a high throughput stream a window could be opened for a 10 minute aggregation period, but not have its next point come until approximately 40 minutes later. Although the calculation would aggregation would eventually be completed it would output my desired result extremely late.
So my work around for this is to create an additional external source that does nothing other than pump fake messages. By having these fake messages being pumped out in alignment with my 10 minute aggregation period, even if a device hadn't sent any data, the event time windows would have something to force the windows closed. The critical part here is to make it possible that all parallel instances / operators have access to this fake message because I need to close all the windows with this single fake message. I was thinking that Broadcast state might be the most appropriate way to accomplish this goal given: "Broadcast state is replicated across all parallel instances of a function, and might typically be used where you have two streams, a regular data stream alongside a control stream that serves rules, patterns, or other configuration messages." Quote Source
Questions:
Is broadcast state the best method for ensuring all parallel instances (e.g. windows) receive my fake messages?
Once the operators have access to this fake message via the broadcast state can this fake message then be used to advance the event time watermark?
You can make this work with broadcast state, along the lines you propose, but I'm not convinced it's the best solution.
In an ideal world I'd suggest you arrange for the devices to send occasional keepalive messages, but assuming that's not possible, I think a custom Trigger would work well here. You can extend the EventTimeTrigger so that in addition to the event time timer it creates via
ctx.registerEventTimeTimer(window.maxTimestamp());
you also create a processing time timer, as a fallback, and you FIRE the window if the window still exists when that processing time timer fires.
I'm recommending this approach because it's simpler and more directly addresses the specific need. With the broadcast state approach you'll have to introduce a source for these messages, add a broadcast state descriptor and stream, add special fake watermarks for the non-broadcast stream (set to Watermark.MAX_WATERMARK), connect the broadcast and non-broadcast streams and implement a BroadcastProcessFunction (that probably doesn't really do anything), etc. It's a lot of moving parts spread across several different operators.
I am just trying to understand the use case when to use CoProcessFunction in Flink. Explanation with an example would help me to understand the concept better.
A CoProcessFunction is similar to a RichCoFlatMap, but with the addition of also being able to use timers. The timers are useful for expiring state for stale keys, or for raising alarms when keep alive messages fail to arrive, for example.
A CoProcessFunction allows you to use one stream to influence how another is processed, or to enrich another stream. For example, an e-commerce site might have a stream of order events and a stream of shipment events, and they want to create a stream of events for orders that haven't shipped with 24 hours of the order being placed. The two streams can be keyed by the orderId, and connected together. As an order arrives it's recorded in keyed state, and a timer is created to fire 24 hours later. When a shipment event arrives, the state and timer are cleared. If a timer does fire, the state is used to send the order out to the unfilled order service.
For more on this, and examples with code, see connected streams and process function and the labs that accompany those tutorials.