Reading two streams (main and configs) in sequential in Flink - apache-flink

I have two streams, one is main stream let's say in example of fraud detection I have transactions stream and then I have second stream which is configs, in our example it is rules. So I connect main stream to config stream in order to do processing. But when first time flink starts and we are adding job it starts consuming from transactions and configs stream parallel and when wants process transaction it sometimes see that there is no config and we have to send transaction to dead letter queue. However, what I want to achieve is, if there is patential config which I could get a bit later I want to get that config first then get transaction in order to process it rather then sending it to dead letter queue. I have the same key for transactions and configs.
long story short, is there a way telling flink when first time job starts try to consume one stream until there isn't new value then start processing main stream? How I can make them kind of sequential?

The recommended way to approach this is to connect the 2 streams and apply a RichCoFlatMap that will allow you to buffer events from main while you're waiting to receive the config events.
Check out this useful section of the Flink tutorials. The very last paragraph actually describes your problem.
It is important to recognize that you have no control over the order in which the flatMap1 and flatMap2 callbacks are called. These two input streams are racing against each other, and the Flink runtime will do what it wants to regarding consuming events from one stream or the other. In cases where timing and/or ordering matter, you may find it necessary to buffer events in managed Flink state until your application is ready to process them. (Note: if you are truly desperate, it is possible to exert some limited control over the order in which a two-input operator consumes its inputs by using a custom Operator that implements the InputSelectable interface.
So in a nutshell you should connect your 2 streams and have some kind of ListState where you can "buffer" your main elements while waiting to receive the rules. When you receive an element from the config stream, you check whether you had some pending elements "waiting" for that config in your ListState (your buffer). If you do, you can then process these elements and emit them through the collector of your flatmap.

Starting with version 1.16, you can use the hybrid source support in Flink to read all of once source (configs, in your case) before reading the second source. Though I imagine you'd have to map the events to an Either<config, transaction> so that the data stream has consistent record types.

Related

How does order of streams in flink connect matter?

How would the two scenarios compare or differ
I) stream1.connect(stream2).flatMap(new connectFunction())
flatmap1 of connectFunction will process input from stream1 and flatMap2 will process input from stream2.
II) stream2.connect(stream1).flatMap(new connectFunction())
flatmap1 of connectFunction will process input from stream2 and flatMap2 will process input from stream1.
That’s all there is to it. There’s no significant difference.
As David mentioned there's no significant difference but in case you're asking because you're worried about the order in which the inputs arrive, you should always keep in mind that there is no guarantee whatsoever that one stream (for example your "control stream" in case you're doing some kind of "dynamic filter") will actually be processed before the stream you want to filter on. There could be lag due to Kafka consumer lag or many other reasons.
As such it is a common pattern to set up a state for both streams so that you can keep items in a "pending processing" state until you receive what you need from the other stream.
For instance, if you have a control stream called control that contains items from a blocklist/disallowlist that you would like to use to filter an input stream called input, after keying by a common key, it won't matter whether you do input.connect(control) or control.connect(input). However, the important point is to remember that it's not necessary that the control stream would have produced anything by the time you receive your first element from input. For that reason you would set up a state (could be a ListState[InputType] to keep track of all input items that have arrived before you received what you needed from control. In other words, you'd be keeping track of "pending items". Once you do receive the required item from control you will process all "pending" items from the ListState.
This is documented in the Learn Flink > ETL section, in the last paragraph of the Connected streams section:
It is important to recognize that you have no control over the order in which the flatMap1 and flatMap2 callbacks are called. These two input streams are racing against each other, and the Flink runtime will do what it wants to regarding consuming events from one stream or the other. In cases where timing and/or ordering matter, you may find it necessary to buffer events in managed Flink state until your application is ready to process them. (Note: if you are truly desperate, it is possible to exert some limited control over the order in which a two-input operator consumes its inputs by using a custom Operator that implements the InputSelectable interface.

Using Broadcast State To Force Window Closure Using Fake Messages

Description:
Currently I am working on using Flink with an IOT setup. Essentially, devices are sending data such as (device_id, device_type, event_timestamp, etc) and I don't have any control over when the messages get sent. I then key the steam by device_id and device_type to preform aggregations. I would like to use event-time given that is ensures the timers which are set trigger in a deterministic nature given a failure. However, given that this isn't always a high throughput stream a window could be opened for a 10 minute aggregation period, but not have its next point come until approximately 40 minutes later. Although the calculation would aggregation would eventually be completed it would output my desired result extremely late.
So my work around for this is to create an additional external source that does nothing other than pump fake messages. By having these fake messages being pumped out in alignment with my 10 minute aggregation period, even if a device hadn't sent any data, the event time windows would have something to force the windows closed. The critical part here is to make it possible that all parallel instances / operators have access to this fake message because I need to close all the windows with this single fake message. I was thinking that Broadcast state might be the most appropriate way to accomplish this goal given: "Broadcast state is replicated across all parallel instances of a function, and might typically be used where you have two streams, a regular data stream alongside a control stream that serves rules, patterns, or other configuration messages." Quote Source
Questions:
Is broadcast state the best method for ensuring all parallel instances (e.g. windows) receive my fake messages?
Once the operators have access to this fake message via the broadcast state can this fake message then be used to advance the event time watermark?
You can make this work with broadcast state, along the lines you propose, but I'm not convinced it's the best solution.
In an ideal world I'd suggest you arrange for the devices to send occasional keepalive messages, but assuming that's not possible, I think a custom Trigger would work well here. You can extend the EventTimeTrigger so that in addition to the event time timer it creates via
ctx.registerEventTimeTimer(window.maxTimestamp());
you also create a processing time timer, as a fallback, and you FIRE the window if the window still exists when that processing time timer fires.
I'm recommending this approach because it's simpler and more directly addresses the specific need. With the broadcast state approach you'll have to introduce a source for these messages, add a broadcast state descriptor and stream, add special fake watermarks for the non-broadcast stream (set to Watermark.MAX_WATERMARK), connect the broadcast and non-broadcast streams and implement a BroadcastProcessFunction (that probably doesn't really do anything), etc. It's a lot of moving parts spread across several different operators.

Handling "state refresh" in Flink ConnectedStream

We're building an application which has two streams:
A high-volume messages stream
A large static stream (originating from some parquet files we have lying around) which we feed into Flink just to get that Dataset into a saved state
We want to connect the two streams in order to get shared state, so that the 1st stream can use the 2nd state for enrichment.
Every day or so, the parquet files (2nd streams's source) are updated, and that will require us to clear the state of the 2nd stream and rebuild it (will probably take about 2 minutes).
The question is, can we block/delay messages from the 1st stream while this process is running?
Thanks.
There's currently no direct/easy way to block one stream on another stream, unfortunately. The typical solution is to buffer the ingest stream while you load (or re-load) the enrichment stream.
One approach you could try is to wrap your ingest stream in a custom SourceFunction that knows when to not generate data, based on some external trigger (which is the same signal you'd use to know that you have Parquet data to re-load).
Sounds a bit like your case is similar to Flip-23, which explores Model Serving in Apache Flink.
I think it all boils down to how (and if) your static stream is keyed:
if it is keyed in a similar way as your fast data, then you can key both streams, connect them and then have access to the keyed context.
if the static stream events are not keyed in a similar fashion maybe you should consider emitting control events which will trigger a refresh of those static files from an external source (eg s3). That's easier said than done as there is no trivial way to guarantee that all parallel instances of your fast stream will get the control event.
You can use ListState as a buffer, how you can access this though depends on the shape of your data.
It might help, if you shared a bit more info about the shape of your data (eg are you joining on a key? are you simply serving a model? other?).

Process elements after sinking to Destination

I am setting up a flink pipeline that reads from Kafka and sinks to HDFS. I want to process the elements after the addSink() step. This is because I want to setup trigger files indicating that writing data (to the sink) for a certain partition/hour is complete. How can this be achieved? Currently I am using the Bucketing sink.
DataStream messageStream = env
.addSource(flinkKafkaConsumer011);
//some aggregations to convert message stream to keyedStream
keyedStream.addSink(sink);
//How to process elements after 3.?
The Flink APIs do not support extending the job graph beyond the sink(s). (You can, however, fork the stream and do additional processing in parallel with writing to the sink.)
With the Streaming File Sink you can observe the part files transition to the finished state when they complete. See the JavaDoc for more information.
State lives within a single operator -- only that operator (e.g., a ProcessFunction) can modify it. If you want to modify the keyed value state after the sink has completed, there's no straightforward way to do that. One idea would be to add a processing time timer in the ProcessFunction that has the keyed state that wakes up periodically and checks for newly finished part files, and based on their existence, modifies the state. Or if that's the wrong granularity, write a custom source that does something similar and streams or broadcasts information into the ProcessFunction (which will then have to be a CoProcessFunction or a KeyedBroadcastProcessFunction) that it can use to do the necessary state updates.

Flink: Watermarking with Late Elements

I am doing real-time streaming in Flink where the Kafka is the message queue. I am applying EventTimeSlidingWindow of 120 sec. and slide of 1 sec. I am also inserting the watermark at each second of Event Time.
My concern is what happened if the element will come late, after the watermark? Now I my case, Flink simply discard the message which come after its respective watermark. Is there any mechanism provided by the filnk to handle such late message, like maintaining separate window? I have also gone through the documentation but I did not get clear about it.
Apache Flink has a concept called allowed lateness for the windows to handle data that arrives after a watermark.
By default, late elements are dropped when the watermark is past the end of the window. However, Flink allows to specify a maximum allowed lateness for window operators. Allowed lateness specifies by how much time elements can be late before they are dropped, and its default value is 0. Elements that arrive after the watermark has passed the end of the window but before it passes the end of the window plus the allowed lateness, are still added to the window. Depending on the trigger used, a late but not dropped element may cause the window to fire again. This is the case for the EventTimeTrigger.
In order to make this work, Flink keeps the state of windows until their allowed lateness expires. Once this happens, Flink removes the window and deletes its state.
Also another option is SideOoutput i.e. In addition to the main stream that results from DataStream operations, you can also produce any number of additional side output result streams. The type of data in the result streams does not have to match the type of data in the main stream and the types of the different side outputs can also differ. This operation can be useful when you want to split a stream of data where you would normally have to replicate the stream and then filter out from each stream the data that you don’t want to have.
When using side outputs, you first need to define an OutputTag that will be used to identify a side output stream:
https://ci.apache.org/projects/flink/flink-docs-release-1.3/dev/stream/side_output.html
Allowed lateness can result in multiple outputs. So end of window and end of watermark from the last even is one time and then for each element that’s late another aggregated output.

Resources