Flink - Need way to notify one stream from another - apache-flink

I have an Apache flink usecase that works as follows:
I have data events coming in through first stream. Part of each event is a foreign key for which I expect data from the second stream. E.g.: I am getting data for major cities in the first stream which has a city-code and I need the average temperature over time for this city code streamed through the second stream. It is not possible to have temperatures streamed for all possible cities, we have to request the city for which we need the data.
So we need some way to "notify" the second stream source that we need data for this city "pushed" when we encounter it the first time in the first stream.
This would have been easy if this notification could be done from the first stream. The problem is that the second stream is coming to us through a websocket part of which is a control channel through which we have to make the request - so the request HAS to be made from the second stream.
Check event in the first stream. Read city code x.
Have we seen this city code? If not, notify the second stream, we need data for city code x.
Second stream sends message to source for data for x.
Data starts flowing in for city x, which is used to join downstream.
If notification from the first stream was possible, this would be easy - I could have done it from step 2, so data starts flowing in the second stream. But that is not possible as the request needs to be send on the same websocket connection that feeds the second stream.
I have explored using CoProcessFunction or RichCoMapFunction for this - but it is not clear how this can be done. I have seen some examples of Broadcast State Pattern - but even that does not seem to fit the usecase.
Can someone help me with some pointers on possible solutions?

So I made it work using the suggestion of the side output stream. Thanks #whatisinthename and #kkrugler for the suggestions.
Still trying to figure out details, but here's a summary
From the notification stream (stream 1), create a side output stream (stream 1-1).
Use an extended class (TempRequester) of KeyedProcessFunction, to process the side output stream 1-1 and create Stream 2 from it. The KeyedProcessFunction has the websocket connection.
In the open method of the KeyedProcessFunction create the connection to websocket (handshaking etc.). Have a ListState state to keep the list of city codes.
In the processElement function of TempRequester, check the city code coming in from side output stream 1-1. If present in ListState, do nothing. Else, send a message through websocket control channel and request city data and add the code to ListState. Create a process timer (this is one time) to fire after 500 milliseconds or so. The websocket server writes the temp data very frequently and that is saved in a queue.
In the onTimer method, check the queue, read the data and push out (out.collect...). Create a timer again. So essentially, once the first city code gets in, we create a timer that runs every 500 milliseconds and dumps the records received out into the second stream.
Now the first and second streams can be joined downstream (I used the table API).
Not sure if this is the most elegant solution, but it worked. Thanks for the suggestions.
Here's the approximate main code:
DataStream<Event> notificationStream =
env.addSource(this.notificationSource)
.returns(TypeInformation.of(Event.class));
notificationStream.assignTimestampsAndWatermarks(WatermarkStrategy.forMonotonousTimestamps());
final OutputTag<String> outputTag = new OutputTag<String>("cities-seen"){};
SingleOutputStreamOperator<Event> mainDataStream = notificationStream.process(new ProcessFunction<Event, Event>() {
#Override
public void processElement(
Event value,
Context ctx,
Collector<Event> out) throws Exception {
// emit data to regular output
out.collect(value);
// emit data to side output
ctx.output(outputTag, event.cityCode);
}
});
DataStream<String> sideOutputStream = mainDataStream.getSideOutput(outputTag);
DataStream<TemperatureData> temperatureStream = sideOutputStream
.keyBy(value -> value)
.process(new TempRequester());
temperatureStream.assignTimestampsAndWatermarks(WatermarkStrategy.forMonotonousTimestamps());
// set up the Java Table API and rest of SQL based joins ...
And the approximate code for TempRequester (ProcessFunction):
public static class TempRequester extends KeyedProcessFunction<String, String, TemperatureData> {
private ListState<String> allCities;
private volatile boolean running = true;
//This is the queue for requesting city codes
private BlockingQueue<String> messagesToSend = new ArrayBlockingQueue<>(100);
//This is the queue for receiving temperature data
private ConcurrentLinkedQueue<TemperatureData> messages = new ConcurrentLinkedQueue<TemperatureData>();
private static final int TIMEOUT = 500;
#Override
public void open(Configuration parameters) throws Exception {
super.open(parameters);
allCities = getRuntimeContext().getListState(new ListStateDescriptor<>("List of cities seen", String.class));
... rest of websocket client setup code ...
}
#Override
public void close() throws Exception {
running = false;
super.close();
}
private boolean initialized = false;
#Override
public void processElement(String cityCode, Context ctx, Collector<TemperatureData> collector) throws Exception {
boolean citycodeFound = StreamSupport.stream(allCities.get().spliterator(), false)
.anyMatch(s -> cityCode.equals(s));
if (!citycodeFound) {
allCities.add(cityCode);
messagesToSend.put(.. add city code ..);
if (!initialized) {
ctx.timerService().registerProcessingTimeTimer(ctx.timestamp()+ TIMEOUT);
initialized = true;
}
}
}
#Override
public void onTimer(long timestamp, OnTimerContext ctx, Collector<TemperatureData> out) throws Exception {
TemperatureData p;
while ((p = messages.poll()) != null) {
out.collect(p);
}
ctx.timerService().registerProcessingTimeTimer(ctx.timestamp() + TIMEOUT);
}
}

Related

Check if all I'm receiving stream properly with all keys

I have the following scenario: suppose there are 20 sensors which are sending me streaming feed. I apply a keyBy (sensorID) against the stream and perform some operations such as average etc. This is implemented, and running well (using Flink Java API).
Initially it's all going well and all the sensors are sending me feed. After a certain time, it may happen that a couple of sensors start misbehaving and I start getting irregular feed from them e.g. I receive feed from 18 sensors,but 2 don't send me feed for long durations.
We can assume that I already know the fixed list of sensorId's (possibly hard-coded / or in a database). How do I identify which two are not sending feed? Where can I get the list of keyId's to compare with the list in database?
I want to raise an alarm if I don't get a feed (e.g 2 mins, 5 mins, 10 mins etc. with increasing priority).
Has anyone implemented such a scenario using flink-streaming / patterns? Any suggestions please.
You could technically use the ProcessFunction and timers.
You could simply register timer for each record and reset it if You receive data. If You schedule the timer to run after 5 mins processing time, this would basically mean that If You haven't received the data it would call function onTimer, from which You could simply emit some alert. It would be possible to re-register the timers for already fired alerts to allow emitting alerts with higher severity.
Note that this will only work assuming that initially, all sensors are working correctly. Specifically, it will only emit alerts for keys that have been seen at least once. But from your description it seems that It would solve Your problem.
I just happen to have an example of this pattern lying around. It'll need some adjustment to fit your use case, but should get you started.
public class TimeoutFunction extends KeyedProcessFunction<String, Event, String> {
private ValueState<Long> lastModifiedState;
static final int TIMEOUT = 2 * 60 * 1000; // 2 minutes
#Override
public void open(Configuration parameters) throws Exception {
// register our state with the state backend
state = getRuntimeContext().getState(new ValueStateDescriptor<>("myState", Long.class));
}
#Override
public void processElement(Event event, Context ctx, Collector<String> out) throws Exception {
// update our state and timer
Long current = lastModifiedState.value();
if (current != null) {
ctx.timerService().deleteEventTimeTimer(current + TIMEOUT);
}
current = max(current, event.timestamp());
lastModifiedState.update(current);
ctx.timerService().registerEventTimeTimer(current + TIMEOUT);
}
#Override
public void onTimer(long timestamp, OnTimerContext ctx, Collector<String> out) throws Exception {
// emit alert
String deviceId = ctx.getCurrentKey();
out.collect(deviceId);
}
}
This assumes a main program that does something like this:
DataStream<String> result = stream
.assignTimestampsAndWatermarks(new MyBoundedOutOfOrdernessAssigner(...))
.keyBy(e -> e.deviceId)
.process(new TimeoutFunction());
As #Dominik said, this only emits alerts for keys that have been seen at least once. You could fix that by introducing a secondary source of events that creates an artificial event for every source that should exist, and union that stream with the primary source.
The pattern is very clear to me now. I've implemented the solution and it works like charm.
If anyone needs the code, then I'll be happy to share

Side input of size around 50Mb causing long GC pause

We are running Beam application on Flink cluster with side inputs of size 50Mb.
Side input refresh ( Pull from external data source ) based on the notification sent to the notification topic in Kafka.
As the application progress due to side input Full GC happening often and each GC taking ~30 sec which pauses task manager to send heart beat to the Master.
After consecutive heartbeat miss , master assuming worker is dead and start reassigning the jobs , results restarting of application.
We tried removing Side input , application works fine.
Questions :
Is there any limitation on size of side input in Apache Beam side input ?
I have created side input map using asSingleton() , is going to create seprate copy for each task ? I have given 15 parallelism. is it going to create 15 copy in a JVM ( assuming all tasks assigned to same worker )?
What is alternative for side inputs?
This is sample pipeline :
public class BeamApplication {
public static final CloseableHttpClient httpClient = HttpClients.createDefault();
public static void main(String[] args) {
PipelineOptions options = PipelineOptionsFactory.create();
options.as(FlinkPipelineOptions.class).setRunner(FlinkRunner.class);
Pipeline pipeline = Pipeline.create(options);
PCollection<Map<String, Double>> sideInput = pipeline
.apply(KafkaIO.<String, String>read().withBootstrapServers("localhost:9092")
.withKeyDeserializer(StringDeserializer.class).withValueDeserializer(StringDeserializer.class)
.withTopic("testing"))
.apply(ParDo.of(new DoFn<KafkaRecord<String, String>, Map<String, Double>>() {
#ProcessElement
public void processElement(ProcessContext processContext) {
KafkaRecord<String, String> record = processContext.element();
String message = record.getKV().getValue().split("##")[0];
String change = record.getKV().getValue().split("##")[1];
if (message.equals("START_REST")) {
Map<String, Double> map = new HashMap<>();
Map<String,Double> changeMap = new HashMap<>();
HttpGet request = new HttpGet("http://localhost:8080/config-service/currency");
try (CloseableHttpResponse response = httpClient.execute(request)) {
HttpEntity entity = response.getEntity();
String responseString = EntityUtils.toString(entity, "UTF-8");
ObjectMapper objectMapper = new ObjectMapper();
CurrencyDTO jsonObject = objectMapper.readValue(responseString, CurrencyDTO.class);
map.putAll(jsonObject.getQuotes());
System.out.println(change);
Random rand = new Random();
Double db = rand.nextDouble();
System.out.println(db);
changeMap.put(change,db);
entity.getContent();
} catch (Exception e) {
e.printStackTrace();
}
processContext.output(changeMap);
}
}
}));
PCollection<Map<String, Double>> currency = sideInput
.apply(Window.<Map<String, Double>>into(new GlobalWindows())
.triggering(Repeatedly.forever(AfterPane.elementCountAtLeast(1)))
.withAllowedLateness(Duration.ZERO).discardingFiredPanes());
PCollectionView<Map<String, Double>> sideInputView = currency.apply(View.asSingleton());
PCollection<KafkaRecord<Long, String>> kafkaEvents = pipeline
.apply(KafkaIO.<Long, String>read().withBootstrapServers("localhost:9092")
.withKeyDeserializer(LongDeserializer.class).withValueDeserializer(StringDeserializer.class)
.withTopic("event_testing"));
PCollection<String> output = kafkaEvents
.apply("Extract lines", ParDo.of(new DoFn<KafkaRecord<Long, String>, String>() {
#ProcessElement
public void processElement(ProcessContext processContext) {
String element = processContext.element().getKV().getValue();
Map<String, Double> map = processContext.sideInput(sideInputView);
System.out.println("This is it : " + map.entrySet());
}
}).withSideInputs(sideInputView));
pipeline.run().waitUntilFinish();
}
}
What state-backend are you using?
If i'm not mistaken, side inputs are implemented as state in Flink. If you're using MemoryStateBackend as state-backend, you might indeed reach pressure on you memory consumption.
Also, the processing of events will block until that side input is ready, buffering events. If preparing the side input take long time or the rate of incoming events is high, you might reach memory pressure.
Can try an alternative state-backend? Preferably RocksDBStateBackend, it holds in-flight data in a RocksDB database instead of in-memory.
It's difficult to guess what's the issue. I would recommend monitoring memory related metrics - see a good post on that here.
You could also run profiling on the Task Managers and analyse the dumps - see here
Is the memory increasing also if you only publish the first message to "testing" topic?
Maybe to isolate the problem I would use a simpler side-input. Remove the HTTP call and make the data static. Maybe a periodic triggered one instead of Kafka:
GenerateSequence.from(0).withRate(1, Duration.standardSeconds(5L))

Joining stream events only once over a sliding window

I'm trying to evaluate if apache flink would be usable for a distributed event driven system (only-once). The use case is that a user is signed up for a subscription and wants to change for a different subscription.
There are two separate processes that run asynchronously when the users clicks the submit button. One process cancels the existing subscription whilst another signs up for the new subscription. Once these two events have been triggered, the email notification is sent.
I've managed to create two streams in apache flink using the RabbitMQ connector. When I try joining these streams together using a sliding window, the events are duplicated for each slide in the window. I've tried setting a ValueStateDescriptor on the joined streams but this doesn't seem to expire after the window has passed.
Additionally I need to detect the events that have not been paired in the streams and send this event to a different RabbitMQ sink to cope with situations whereby the event has not be fired due to the process not completing successfully.
Do you have any tips/ideas on how I could achieve the above functionality?
final StreamExecutionEnvironment environment = StreamExecutionEnvironment.getExecutionEnvironment();
environment.setStreamTimeCharacteristic(TimeCharacteristic.IngestionTime);
final RMQConnectionConfig rmqConnectionConfig = new RMQConnectionConfig.Builder()
.setHost("localhost")
.setPort(5672)
.setVirtualHost("/")
.setUserName("admin")
.setPassword("password")
.build();
final DataStream<String> cancellation = environment
.addSource(new RMQSource<>(rmqConnectionConfig, "scratchpad-cancellation", true, new SimpleStringSchema()))
.setParallelism(1);
final DataStream<String> subscription = environment
.addSource(new RMQSource<>(rmqConnectionConfig, "scratchpad-subscription", true, new SimpleStringSchema()))
.setParallelism(1);
cancellation
.join(subscription)
.where(value -> value).equalTo(value -> value)
.window(SlidingEventTimeWindows.of(Time.minutes(5), Time.seconds(15)))
.apply((left, right) -> left)
.keyBy(value -> value)
.process(new ProcessFunction<String, String>() {
private ValueStateDescriptor<Boolean> descriptor = new ValueStateDescriptor<>("seen", Boolean.class);
private ValueState<Boolean> state;
#Override
public void open(Configuration parameters) {
state = this.getRuntimeContext().getState(descriptor);
}
#Override
public void processElement(String value, Context ctx, Collector<String> out) throws Exception {
if (BooleanUtils.isNotTrue(state.value())) {
state.update(true);
out.collect(value);
ctx.timerService().registerEventTimeTimer(ctx.timestamp() + TimeUnit.MILLISECONDS.convert(10, TimeUnit.MINUTES));
}
}
#Override
public void onTimer(long timestamp, OnTimerContext ctx, Collector<String> out) {
state.clear();
}
})
.print()
.setParallelism(1);
environment.execute();
If you have duplicated values on the output of your window, you can add a reduce function into another window after your already defined slide window and that should be engough in most cases. But i think that should be a better solution than this, but we need an example fo your code to work on improvements.
On the other side, if you need to detect non paired events, i think that you need to work with the CoGroup operator, instead of use joins.

Flink Kafka consumer StreamExecutionEnvironment only?

I have pulling scenario,
HTTP -> Kafka -> Flink -> some output
If im not wrong i can use kafka consumer on stream only ?
Therefor i need to "block" the stream in order to sum/count the data im receiving from the HTTP call .
The easiest way to "block" is to add window/.
What is the best approach for this pulling scenario .
UPDATE
I want to prevent from the collector to sum each value
SingleOutputStreamOperator<Tuple2<String, Integer>> t =
in.flatMap(new FlatMapFunction<String, Tuple2<String, Integer>>() {
#Override
public void flatMap(String s, Collector<Tuple2<String, Integer>>
collector) throws Exception {
ObjectMapper mapper = new ObjectMapper();
JsonNode node = mapper.readTree(s);
node.elements().forEachRemaining(v -> {
collector.collect(new Tuple2<>(v.textValue(), 1));
});
}
}).keyBy(0).sum(1);
If I understand correctly I think what you may want to use is a session window. This will continue to collect messages into the window and will only process the contents of the window when an event hasn't been received after a certain amount of time. See the documentation on session windows here: https://ci.apache.org/projects/flink/flink-docs-release-1.3/dev/windows.html

How to process a message without message leaving the queue till a condition is met?

This is regarding a particular use case which I am planning to address via flink streaming.
A message is sent to flink stream processing, the stream is keyed by and thus gets partitioned as expected. However, each message per key needs to evaluated till a condition is met e.g. lets say there is a banking system, where the account transaction (messages) for an account needs to be processed in sequence, and it is not possible to process a message out of sequence as it will lead to an inconsistent system state. The system needs to wait for a message to be processed (maybe even over 2-3 days) before processing the next message in sequence. How this can be achieved in flink without blocking any part of message processing which can be associated with other keys ?
Thanks in advance !
Have you had a look at the CEP library? You could specify a pattern like:
Pattern<Event, ?> pattern = Pattern.<Event>begin("firstOfSequence").where(new FilterFunction<Event>() {
private static final long serialVersionUID = 5726188262756267490L;
#Override
public boolean filter(Event value) throws Exception {
return value.isFirstOfSequence();
}
}).followedBy("secondOfSequence").where(new FilterFunction<Event>() {
private static final long serialVersionUID = 5726188262756267490L;
#Override
public boolean filter(Event value) throws Exception {
return value.isSecondOfSequence();
}
});

Resources