Joining stream events only once over a sliding window - apache-flink

I'm trying to evaluate if apache flink would be usable for a distributed event driven system (only-once). The use case is that a user is signed up for a subscription and wants to change for a different subscription.
There are two separate processes that run asynchronously when the users clicks the submit button. One process cancels the existing subscription whilst another signs up for the new subscription. Once these two events have been triggered, the email notification is sent.
I've managed to create two streams in apache flink using the RabbitMQ connector. When I try joining these streams together using a sliding window, the events are duplicated for each slide in the window. I've tried setting a ValueStateDescriptor on the joined streams but this doesn't seem to expire after the window has passed.
Additionally I need to detect the events that have not been paired in the streams and send this event to a different RabbitMQ sink to cope with situations whereby the event has not be fired due to the process not completing successfully.
Do you have any tips/ideas on how I could achieve the above functionality?
final StreamExecutionEnvironment environment = StreamExecutionEnvironment.getExecutionEnvironment();
environment.setStreamTimeCharacteristic(TimeCharacteristic.IngestionTime);
final RMQConnectionConfig rmqConnectionConfig = new RMQConnectionConfig.Builder()
.setHost("localhost")
.setPort(5672)
.setVirtualHost("/")
.setUserName("admin")
.setPassword("password")
.build();
final DataStream<String> cancellation = environment
.addSource(new RMQSource<>(rmqConnectionConfig, "scratchpad-cancellation", true, new SimpleStringSchema()))
.setParallelism(1);
final DataStream<String> subscription = environment
.addSource(new RMQSource<>(rmqConnectionConfig, "scratchpad-subscription", true, new SimpleStringSchema()))
.setParallelism(1);
cancellation
.join(subscription)
.where(value -> value).equalTo(value -> value)
.window(SlidingEventTimeWindows.of(Time.minutes(5), Time.seconds(15)))
.apply((left, right) -> left)
.keyBy(value -> value)
.process(new ProcessFunction<String, String>() {
private ValueStateDescriptor<Boolean> descriptor = new ValueStateDescriptor<>("seen", Boolean.class);
private ValueState<Boolean> state;
#Override
public void open(Configuration parameters) {
state = this.getRuntimeContext().getState(descriptor);
}
#Override
public void processElement(String value, Context ctx, Collector<String> out) throws Exception {
if (BooleanUtils.isNotTrue(state.value())) {
state.update(true);
out.collect(value);
ctx.timerService().registerEventTimeTimer(ctx.timestamp() + TimeUnit.MILLISECONDS.convert(10, TimeUnit.MINUTES));
}
}
#Override
public void onTimer(long timestamp, OnTimerContext ctx, Collector<String> out) {
state.clear();
}
})
.print()
.setParallelism(1);
environment.execute();

If you have duplicated values on the output of your window, you can add a reduce function into another window after your already defined slide window and that should be engough in most cases. But i think that should be a better solution than this, but we need an example fo your code to work on improvements.
On the other side, if you need to detect non paired events, i think that you need to work with the CoGroup operator, instead of use joins.

Related

Flink - Need way to notify one stream from another

I have an Apache flink usecase that works as follows:
I have data events coming in through first stream. Part of each event is a foreign key for which I expect data from the second stream. E.g.: I am getting data for major cities in the first stream which has a city-code and I need the average temperature over time for this city code streamed through the second stream. It is not possible to have temperatures streamed for all possible cities, we have to request the city for which we need the data.
So we need some way to "notify" the second stream source that we need data for this city "pushed" when we encounter it the first time in the first stream.
This would have been easy if this notification could be done from the first stream. The problem is that the second stream is coming to us through a websocket part of which is a control channel through which we have to make the request - so the request HAS to be made from the second stream.
Check event in the first stream. Read city code x.
Have we seen this city code? If not, notify the second stream, we need data for city code x.
Second stream sends message to source for data for x.
Data starts flowing in for city x, which is used to join downstream.
If notification from the first stream was possible, this would be easy - I could have done it from step 2, so data starts flowing in the second stream. But that is not possible as the request needs to be send on the same websocket connection that feeds the second stream.
I have explored using CoProcessFunction or RichCoMapFunction for this - but it is not clear how this can be done. I have seen some examples of Broadcast State Pattern - but even that does not seem to fit the usecase.
Can someone help me with some pointers on possible solutions?
So I made it work using the suggestion of the side output stream. Thanks #whatisinthename and #kkrugler for the suggestions.
Still trying to figure out details, but here's a summary
From the notification stream (stream 1), create a side output stream (stream 1-1).
Use an extended class (TempRequester) of KeyedProcessFunction, to process the side output stream 1-1 and create Stream 2 from it. The KeyedProcessFunction has the websocket connection.
In the open method of the KeyedProcessFunction create the connection to websocket (handshaking etc.). Have a ListState state to keep the list of city codes.
In the processElement function of TempRequester, check the city code coming in from side output stream 1-1. If present in ListState, do nothing. Else, send a message through websocket control channel and request city data and add the code to ListState. Create a process timer (this is one time) to fire after 500 milliseconds or so. The websocket server writes the temp data very frequently and that is saved in a queue.
In the onTimer method, check the queue, read the data and push out (out.collect...). Create a timer again. So essentially, once the first city code gets in, we create a timer that runs every 500 milliseconds and dumps the records received out into the second stream.
Now the first and second streams can be joined downstream (I used the table API).
Not sure if this is the most elegant solution, but it worked. Thanks for the suggestions.
Here's the approximate main code:
DataStream<Event> notificationStream =
env.addSource(this.notificationSource)
.returns(TypeInformation.of(Event.class));
notificationStream.assignTimestampsAndWatermarks(WatermarkStrategy.forMonotonousTimestamps());
final OutputTag<String> outputTag = new OutputTag<String>("cities-seen"){};
SingleOutputStreamOperator<Event> mainDataStream = notificationStream.process(new ProcessFunction<Event, Event>() {
#Override
public void processElement(
Event value,
Context ctx,
Collector<Event> out) throws Exception {
// emit data to regular output
out.collect(value);
// emit data to side output
ctx.output(outputTag, event.cityCode);
}
});
DataStream<String> sideOutputStream = mainDataStream.getSideOutput(outputTag);
DataStream<TemperatureData> temperatureStream = sideOutputStream
.keyBy(value -> value)
.process(new TempRequester());
temperatureStream.assignTimestampsAndWatermarks(WatermarkStrategy.forMonotonousTimestamps());
// set up the Java Table API and rest of SQL based joins ...
And the approximate code for TempRequester (ProcessFunction):
public static class TempRequester extends KeyedProcessFunction<String, String, TemperatureData> {
private ListState<String> allCities;
private volatile boolean running = true;
//This is the queue for requesting city codes
private BlockingQueue<String> messagesToSend = new ArrayBlockingQueue<>(100);
//This is the queue for receiving temperature data
private ConcurrentLinkedQueue<TemperatureData> messages = new ConcurrentLinkedQueue<TemperatureData>();
private static final int TIMEOUT = 500;
#Override
public void open(Configuration parameters) throws Exception {
super.open(parameters);
allCities = getRuntimeContext().getListState(new ListStateDescriptor<>("List of cities seen", String.class));
... rest of websocket client setup code ...
}
#Override
public void close() throws Exception {
running = false;
super.close();
}
private boolean initialized = false;
#Override
public void processElement(String cityCode, Context ctx, Collector<TemperatureData> collector) throws Exception {
boolean citycodeFound = StreamSupport.stream(allCities.get().spliterator(), false)
.anyMatch(s -> cityCode.equals(s));
if (!citycodeFound) {
allCities.add(cityCode);
messagesToSend.put(.. add city code ..);
if (!initialized) {
ctx.timerService().registerProcessingTimeTimer(ctx.timestamp()+ TIMEOUT);
initialized = true;
}
}
}
#Override
public void onTimer(long timestamp, OnTimerContext ctx, Collector<TemperatureData> out) throws Exception {
TemperatureData p;
while ((p = messages.poll()) != null) {
out.collect(p);
}
ctx.timerService().registerProcessingTimeTimer(ctx.timestamp() + TIMEOUT);
}
}

Flink Checkpointing mode ExactlyOnce is not working as expected

I am newbie to flink apologize if my understanding is wrong i am building a dataflow application and the flow contains multiple data streams which check if the required fields are present in the incoming DataStream or not. My application validate the incoming data and if the data is validated successfully it should append the data to file in the given if it is already existing. I am trying to simulate if any exception happens in one DataStream other data streams should not get impacted for that i am explicitly throwing an exception in one of the flow. In the below example for simplicity i am using windows text file to append data
Note: My flow don't have states since i don't have any thing to store in state
public class ExceptionTest {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
// start a checkpoint every 1000 ms
env.enableCheckpointing(1000);
// env.setParallelism(1);
//env.setStateBackend(new RocksDBStateBackend("file:///C://flinkCheckpoint", true));
// to set minimum progress time to happen between checkpoints
env.getCheckpointConfig().setMinPauseBetweenCheckpoints(500);
// checkpoints have to complete within 5000 ms, or are discarded
env.getCheckpointConfig().setCheckpointTimeout(5000);
// set mode to exactly-once (this is the default)
env.getCheckpointConfig().setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE);
// allow only one checkpoint to be in progress at the same time
env.getCheckpointConfig().setMaxConcurrentCheckpoints(1);
// enable externalized checkpoints which are retained after job cancellation
env.getCheckpointConfig().enableExternalizedCheckpoints(CheckpointConfig.ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION); // DELETE_ON_CANCELLATION
env.setRestartStrategy(RestartStrategies.fixedDelayRestart(
3, // number of restart attempts
Time.of(10, TimeUnit.SECONDS) // delay
));
DataStream<String> input1 = env.fromElements("hello");
DataStream<String> input2 = env.fromElements("hello");
DataStream<String> output1 = input.flatMap(new FlatMapFunction<String, String>() {
#Override
public void flatMap(String value, Collector<String> out) throws Exception {
//out.collect(value.concat(" world"));
throw new Exception("=====================NO VALUE TO CHECK=================");
}
});
DataStream<String> output2 = input.flatMap(new FlatMapFunction<String, String>() {
#Override
public void flatMap(String value, Collector<String> out) throws Exception {
out.collect(value.concat(" world"));
}
});
output2.addSink(new SinkFunction<String>() {
#Override
public void invoke(String value) throws Exception {
try {
File myObj = new File("C://flinkOutput//filename.txt");
if (myObj.createNewFile()) {
System.out.println("File created: " + myObj.getName());
BufferedWriter out = new BufferedWriter(
new FileWriter("C://flinkOutput//filename.txt", true));
out.write(value);
out.close();
System.out.println("Successfully wrote to the file.");
} else {
System.out.println("File already exists.");
BufferedWriter out = new BufferedWriter(
new FileWriter("C://flinkOutput//filename.txt", true));
out.write(value);
out.close();
System.out.println("Successfully wrote to the file.");
}
} catch (IOException e) {
System.out.println("An error occurred.");
e.printStackTrace();
}
}
});
env.execute();
}
I have few doubts as below
When i am throwing exception in output1 stream the second flow output2 is running even after encountering the exception and writing data to the file in my local but when i check the file the output as below
hello world
hello world
hello world
hello world
As per my understanding from flink documentation if i use the checkpointing mode as EXACTLY_ONCE it should not write the data to file not more than one time as the process is already completed and written data to file. But its not happening in my case and i am not getting if i am doing anything wrong
Please help me to clear my doubts on checkpointing and how can i achieve the EXACTLY_ONCE mechanism i read about TWO_PHASE_COMMIT in flink but i didn't get any example on how to implement it.
As suggested by #Mikalai Lushchytski i implemented StreamingSinkFunction below
With StreamingSinkFunction
public class ExceptionTest {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
// start a checkpoint every 1000 ms
env.enableCheckpointing(1000);
// env.setParallelism(1);
//env.setStateBackend(new RocksDBStateBackend("file:///C://flinkCheckpoint", true));
// to set minimum progress time to happen between checkpoints
env.getCheckpointConfig().setMinPauseBetweenCheckpoints(500);
// checkpoints have to complete within 5000 ms, or are discarded
env.getCheckpointConfig().setCheckpointTimeout(5000);
// set mode to exactly-once (this is the default)
env.getCheckpointConfig().setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE);
// allow only one checkpoint to be in progress at the same time
env.getCheckpointConfig().setMaxConcurrentCheckpoints(1);
// enable externalized checkpoints which are retained after job cancellation
env.getCheckpointConfig().enableExternalizedCheckpoints(CheckpointConfig.ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION); // DELETE_ON_CANCELLATION
env.setRestartStrategy(RestartStrategies.fixedDelayRestart(
3, // number of restart attempts
Time.of(10, TimeUnit.SECONDS) // delay
));
DataStream<String> input1 = env.fromElements("hello");
DataStream<String> input2 = env.fromElements("hello");
DataStream<String> output1 = input.flatMap(new FlatMapFunction<String, String>() {
#Override
public void flatMap(String value, Collector<String> out) throws Exception {
//out.collect(value.concat(" world"));
throw new Exception("=====================NO VALUE TO CHECK=================");
}
});
DataStream<String> output2 = input.flatMap(new FlatMapFunction<String, String>() {
#Override
public void flatMap(String value, Collector<String> out) throws Exception {
out.collect(value.concat(" world"));
}
});
String outputPath = "C://flinkCheckpoint";
final StreamingFileSink<String> sink = StreamingFileSink
.forRowFormat(new Path(outputPath), new SimpleStringEncoder<String>("UTF-8"))
.withRollingPolicy(
DefaultRollingPolicy.builder()
.withRolloverInterval(TimeUnit.MINUTES.toMillis(15))
.withInactivityInterval(TimeUnit.MINUTES.toMillis(5))
.withMaxPartSize(1)
.build())
.build();
output2.addSink(sink);
});
env.execute();
}
But when i check the Checkpoint folder i can see it created four part files with in progress as below
Is there anything i am doing because of that its creating multipart files?
In order to guarantee end-to-end exactly-once record delivery (in addition to exactly-once state semantics), the data sink needs to take part in the checkpointing mechanism (as well as the data source).
If you are going to write the data to a file, then you can use a StreamingFileSink, which emits its input elements to FileSystem files within buckets. This is integrated with the checkpointing mechanism to provide exactly once semantics out-of-the box.
If you are going to implement your own sink, then the sink function must implement the CheckpointedFunction interface and properly implement snapshotState(FunctionSnapshotContext context) method called when a snapshot for a checkpoint is requested and flushing the current application state. In addition I would recommend implementing the CheckpointListener interface to be notified once a distributed checkpoint has been completed.
Flink already provides an abstract TwoPhaseCommitSinkFunction, which is a recommended base class for all of the SinkFunction that intend to implement exactly-once semantic. It does that by implementing two phase commit algorithm on top of the CheckpointedFunction and
CheckpointListener. As an example, you can have a look at FlinkKafkaProducer.java source code.

Merging rebalanced partition

As one of my last step in a streaming application, I want to sort the out of order events in the system.
To do so I used:
events.keyBy((Event event) -> event.id)
.process(new SortFunction())
.print();
Where sort function is:
public static class SortFunction extends KeyedProcessFunction<String, Event, Event> {
private ValueState<PriorityQueue<Event>> queueState = null;
#Override
public void open(Configuration config) {
ValueStateDescriptor<PriorityQueue<Event>> descriptor = new ValueStateDescriptor<>(
// state name
"sorted-events",
// type information of state
TypeInformation.of(new TypeHint<PriorityQueue<Event>>() {
}));
queueState = getRuntimeContext().getState(descriptor);
}
#Override
public void processElement(Event event, Context context, Collector<Event> out) throws Exception {
TimerService timerService = context.timerService();
if (context.timestamp() > timerService.currentWatermark()) {
PriorityQueue<Event> queue = queueState.value();
if (queue == null) {
queue = new PriorityQueue<>(10);
}
queue.add(event);
queueState.update(queue);
timerService.registerEventTimeTimer(event.timestamp);
}
}
#Override
public void onTimer(long timestamp, OnTimerContext context, Collector<Event> out) throws Exception {
PriorityQueue<Event> queue = queueState.value();
Long watermark = context.timerService().currentWatermark();
Event head = queue.peek();
while (head != null && head.timestamp <= watermark) {
out.collect(head);
queue.remove(head);
head = queue.peek();
}
}
}
What Im trying to do now is try to paralelize it. My current idea is to do the following:
events.keyBy((Event event) -> event.id)
.rebalance()
.process(new SortFunction()).setParalelism(3)
.map(new KWayMerge()).setParalelism(1).
.print();
If what I understand is correct, what should happend in this case, and correct me if I am wrong, is that a section of each of the Events for a given key (ideally 1/3) will go to each of the parallel instances of SortFunction, in which case, to have a complete sort, I need to create a a map, or another processFunction, that receives sorted Events from the 3 diferent instances and merges them back together.
If this thats the case, is there any way to distinguish the origin of the Event received by the map so that I can perform a 3-way merge on the map? If thats not possible my next idea will be to swap the PriorityQueue for a TreeMap and put everything into a window so that the merge happens at the end of the window once the 3 TreeMaps have been received. Does this other option make sense in case option a is non viable or is there a better solution to do something like this?
First of all, you should be aware that using a PriorityQueue or a TreeMap in Flink ValueState is an okay idea if and only if you are using a heap-based state backend. In the case of RocksDB, this will perform quite badly, as the PriorityQueues will be deserialized on every access, and reserialized on every update. In general we recommend sorting based on MapState, and this is how sorting in implemented in Flink's libraries.
What this code will do
events.keyBy((Event event) -> event.id)
.process(new SortFunction())
is to independently sort the stream on a key-by-key basis -- the output will be sorted with respect to each key, but not globally.
On the other hand, this
events.keyBy((Event event) -> event.id)
.rebalance()
.process(new SortFunction()).setParalelism(3)
won't work, because the result of the rebalance is no longer a KeyedStream, and the SortFunction depends on keyed state.
Moreover, I don't believe that doing 3 sorts of 1/3 of the stream and then merging the results will perform noticeably better than a single global sort. If you need to do a global sort, you might want to consider using the Table API instead. See the answer here for an example.

How do I assign an id to a session window in Apache Flink?

How do I assign an id to a session window in Apache Flink?
Ultimately I want to enrich events with a session window id one-by-one while the session windows is open (I don't want to wait until the window closes before emitting the enriched events).
I tried to do this with an AggregateFunction, however I don't think merge() works as I expect. It seems to be for merging windows and not panes (trigger firings). It seems to be never called in my pipeline. It seems therefore that there is no shared state between triggers!
The session window id will be the timestamp of the first event to fall into the window (due to non-guaranteed ordering that may mean some events with could potentially fall into the same session window with an earlier timestamp - I'm ok with this).
public class FooSessionState {
private Long sessionCreationTime;
private FooMatch lastMatch;
}
/**
* Aggregator that assigns session ids to elements of a session window
*/
public class SessionIdAssigner implements
AggregateFunction<FooMatch, FooSessionState, FooSessionEvent> {
static final long serialVersionUID = 0L;
#Override
public FooSessionState createAccumulator() {
return new FooSessionState();
}
#Override
public FooSessionState add(FooMatch value, FooSessionState sessionState) {
if (sessionState.getSessionCreationTime() == null) {
sessionState.setSessionCreationTime(value.getReport().getTimestamp());
}
sessionState.setLastMatch(value);
return sessionState;
}
#Override
public FooSessionEvent getResult(FooSessionState accumulator) {
FooSessionEvent sessionEvent = new FooSessionEvent();
sessionEvent.setFooMatch(accumulator.getLastMatch());
sessionEvent.setSessionCreationTime(accumulator.getSessionCreationTime());
return sessionEvent;
}
#Override
public FooSessionState merge(FooSessionState a, FooSessionState b) {
if ( a.getSessionCreationTime() != null) {
b.setSessionCreationTime(a.getSessionCreationTime());
}
return b;
}
}
My plan was to use it as follows:
stream.keyBy(new FooMatchKeySelector())
.window(EventTimeSessionWindows.withGap(Time.milliseconds(config.getFooSessionWindowTimeout())))
.trigger(PurgingTrigger.of(CountTrigger.of(1L)))
.aggregate(new SessionIdAssigner())
I think session windows are not a good fit for what you want to achieve. They have been designed to aggregate events per session, but not to enrich every event, i.e., they compute a result and emit it when the window is closed. As you noticed, session windows work by creating a new window for every event and merging windows that overlap. This design was chosen, because events might arrive out of order. Hence it might happen, that you have two windows that are later connected by a bridging event.
I would recommend to implement the logic with a ProcessFunction that collects the events and sorts them on their timestamp. When a watermark is received, it emits all collected events with correct session IDs. Hence, you keep only the events between two watermarks in state. In addition to those events, you need to keep the timestamp of the last emitted event and the last emitted session ID to perform correct sessionization.

Flink Kafka consumer StreamExecutionEnvironment only?

I have pulling scenario,
HTTP -> Kafka -> Flink -> some output
If im not wrong i can use kafka consumer on stream only ?
Therefor i need to "block" the stream in order to sum/count the data im receiving from the HTTP call .
The easiest way to "block" is to add window/.
What is the best approach for this pulling scenario .
UPDATE
I want to prevent from the collector to sum each value
SingleOutputStreamOperator<Tuple2<String, Integer>> t =
in.flatMap(new FlatMapFunction<String, Tuple2<String, Integer>>() {
#Override
public void flatMap(String s, Collector<Tuple2<String, Integer>>
collector) throws Exception {
ObjectMapper mapper = new ObjectMapper();
JsonNode node = mapper.readTree(s);
node.elements().forEachRemaining(v -> {
collector.collect(new Tuple2<>(v.textValue(), 1));
});
}
}).keyBy(0).sum(1);
If I understand correctly I think what you may want to use is a session window. This will continue to collect messages into the window and will only process the contents of the window when an event hasn't been received after a certain amount of time. See the documentation on session windows here: https://ci.apache.org/projects/flink/flink-docs-release-1.3/dev/windows.html

Resources