ProcessFunction in Flink to output a result every 5 seconds - apache-flink

My input stream is Tuple2<String, String>, I want to group by the first field and sum the integer in the second field. This is my ProcessFunction:
public static class MyKeyedProcessFunction
extends KeyedProcessFunction<String, Tuple2<String, Integer>, Tuple2<String, Integer>> {
private ValueState<Integer> state;
#Override
public void open(Configuration parameters) throws Exception {
state = getRuntimeContext().getState(new ValueStateDescriptor<>("sum", Integer.class));
}
#Override
public void processElement(
Tuple2<String, Integer> value,
Context ctx,
Collector<Tuple2<String, Integer>> out) throws Exception {
Integer sum = state.value();
if (sum == null) {
sum = 0;
}
sum += value.f1;
state.update(sum);
ctx.timerService().registerProcessingTimeTimer(ctx.timerService().currentProcessingTime() + 5000);
}
#Override
public void onTimer(
long timestamp,
OnTimerContext ctx,
Collector<Tuple2<String, Integer>> out) throws Exception {
out.collect(Tuple2.of(ctx.getCurrentKey(), state.value()));
state.clear();
}
}
Now the onTimer is called for every element. I specified the input as:
aaa,50
aaa,40
aaa,10
I see the output like:
(aaa,100)
(aaa, null)
(aaa, null)
How can I get the output as (aaa,100)?

You registered a timer for every incoming event. Maybe you can also try to create a new ValueState of type boolean that indicates if an initial timer has been registered.
As soon as the first event comes in, you register a timer in the processElement method like:
if(timerRegistered.value()==false){
ctx.timerService().registerProcessingTimeTimer(ctx.timerService().currentProcessingTime() + 5000);
timerRegistered.update(true);
}
Then you will just go on and register new timers for the 5 sec interval in the onTimer method instead of the processElement.
I didn't test the code but it should give an idea.
Kind Regards
Dominik

Related

Process Function Event Time Behaviour

I am referring to the Process Function example mentioned in https://nightlies.apache.org/flink/flink-docs-release-1.16/docs/dev/datastream/operators/process_function/
/**
* The data type stored in the state
*/
public class CountWithTimestamp {
public String key;
public long count;
public long lastModified;
}
/**
* The implementation of the ProcessFunction that maintains the count and timeouts
*/
public class CountWithTimeoutFunction
extends KeyedProcessFunction<Tuple, Tuple2<String, String>, Tuple2<String, Long>> {
/** The state that is maintained by this process function */
private ValueState<CountWithTimestamp> state;
#Override
public void open(Configuration parameters) throws Exception {
state = getRuntimeContext().getState(new ValueStateDescriptor<>("myState", CountWithTimestamp.class));
}
#Override
public void processElement(
Tuple2<String, String> value,
Context ctx,
Collector<Tuple2<String, Long>> out) throws Exception {
// retrieve the current count
CountWithTimestamp current = state.value();
if (current == null) {
current = new CountWithTimestamp();
current.key = value.f0;
}
// update the state's count
current.count++;
// set the state's timestamp to the record's assigned event time timestamp
current.lastModified = ctx.timestamp();
// write the state back
state.update(current);
// schedule the next timer 60 seconds from the current event time
ctx.timerService().registerEventTimeTimer(current.lastModified + 60000);
}
#Override
public void onTimer(
long timestamp,
OnTimerContext ctx,
Collector<Tuple2<String, Long>> out) throws Exception {
// get the state for the key that scheduled the timer
CountWithTimestamp result = state.value();
// check if this is an outdated timer or the latest timer
if (timestamp == result.lastModified + 60000) {
// emit the state on timeout
out.collect(new Tuple2<String, Long>(result.key, result.count));
}
}
}
In this scenario my datastream is being produced by KafkaSource with no idleness behaviour configured
DataStream<Tuple2<Integer, Integer>> inputStream = env.fromSource(inputSource, WatermarkStrategy.forBoundedOutOfOrderness(Duration.ofSeconds(1)), "Input Kafka Source")
Now consider a scenario where there is only 1 key that is being emitted by source, let's say key1
At time T1 when the first event comes, processElement is called and the CountWithTimestamp object is set accordingly ie count = 1 and lastModified = T1
Now there are no more events for lets say 70 secs(T2). Then another event comes in for the same key key1
Now here are my questions:
When the second event comes, during my debugging, processElement always gets called first then onTimer. This is because watermark gets generated only after the event has been processed. Is my understanding correct?
Since processElement is getting called first the lastModified is getting modified to T2 (earlier it was T1). This means that even if now timer triggers it won't process as lastModified got updated. And it won't process if the above mentioned scenario keeps repeating.
Thanks.
I believe you've got that right.
Yes, watermarks follow the events that justify their creation.
Yes, that example is flawed. It makes (unstated) assumptions about there being events for other keys.

Flink KeyedProcessFunction integrating anonymous methods

I am attempting to write a keyedProcessFunction, the code looks like this below:
DataStream<Tuple2<Long, Integer>> busyMachinesPerWindow = busyMachines
// group by timestamp (window end)
.keyBy(event -> event.getField(1))
.process(new KeyedProcessFunction<Tuple1<Long>, Tuple3<Long, Long, Long>, Tuple2<Long, Integer>>() {
private ValueState<Integer> state;
#Override
public void open(Configuration config) throws IOException {
// initialize the state descriptors here
state = getRuntimeContext().getState(new ValueStateDescriptor<>("machine-counts", Integer.class));
if (state.value() == null) {
state.update(0);
}
}
#Override
public void processElement(Tuple3<Long, Long, Long> inWindow, Context ctx, Collector<Tuple2<Long, Integer>> out) throws Exception {
if (state.value() != null) {
state.update(state.value() + 1);
} else {
state.update(1);
}
ctx.timerService().registerEventTimeTimer(inWindow.f1);
}
#Override
public void onTimer(long timestamp, OnTimerContext ctx, Collector<Tuple2<Long, Integer>> out) throws Exception {
int counter = state.value();
state.clear();
// we can now output the window and the machine count
out.collect(new Tuple2<>(((Tuple1<Long>) ctx.getCurrentKey()).f0, counter));
}
});
However this pops up an error saying cannot derive anonymous method. I don't see what the problem is with this code. Is there some type ambiguity that I am not doing right?
One problem with this code is that you are calling state.value() and state.update(0) in the open method. This is not allowed. These methods can only be used in processElement and in onTimer, because only then is there a specific event being processed whose key can be used to access/update the appropriate entry in the state backend.
An instance of a KeyedProcessFunction is multiplexed across all of the keys assigned to a given task slot. The open method is called just once, at a time when there is no specific key in the runtime context, so the state cannot be accessed or updated at this time.

Scheduled Task with Apache Flink

I have a flink job with parallelism 5 (for now !!). And one of the richFlatMap stream opens one file in the open(Configuration parameters) method. In the flatMapoperation there is no any open action, it just read the file to search something. (There is a utility class which has a method like utilityClass.searchText('abc')). Here is the boilerplate code:
public class MyFlatMap extends RichFlatMapFunction<...> {
private MyUtilityFile myFile;
#Override
public void open(Configuration parameters) throws Exception {
myFile.Open("fileLocation");
}
#Override
public void flatMap(...) throws Exception {
String text = myFile.searchText('abc');
if (text != null) // take an action
else // another action
}
}
This file is being updated by the python script every day at specific time. Therefore I should also open the newly created file (by python script) in the flatMap stream.
I just though that this can be done by ScheduledExecutorService with only one thread pool.
I can not open this file every flatMap calls because it is big.
Here is the boilerplate code I am trying to write:
public class MyFlatMap extends RichFlatMapFunction<...> implements Runnable {
private final ScheduledExecutorService scheduler = Executors.newScheduledThreadPool(1);
private MyUtilityFile myFile;
#Override
public void run() {
myFile.Open("fileLocation");
}
#Override
public void open(Configuration parameters) throws Exception {
scheduler.scheduleAtFixedRate(this, 1, 1, TimeUnit.HOURS);
myFile.Open("fileLocation");
}
#Override
public void flatMap(...) throws Exception {
String text = myFile.searchText('abc');
if (text != null) // take an action
else // another action
}
}
Is this boilerplate okey for Flink environment? If not, how can i open the file with a scheduled manner? (There is no option such as "after updating file send event with kafka and read event by flink")
Perhaps you can directly implement the ProcessingTimeCallback interface, which supports timer operations
public class MyFlatMap extends RichFlatMapFunction<...> implements ProcessingTimeCallback {
private MyUtilityFile myFile;
#Override
public void open(Configuration parameters) throws Exception {
scheduler.scheduleAtFixedRate(this, 1, 1, TimeUnit.HOURS);
final long now = getProcessingTimeService().getCurrentProcessingTime();
getProcessingTimeService().registerTimer(now + 3600000, this);
myFile.Open("fileLocation");
}
#Override
public void flatMap(...) throws Exception {
String text = myFile.searchText('abc');
if (text != null) // take an action
else // another action
}
#Override
public void onProcessingTime(long timestamp) throws Exception {
myFile.Open("fileLocation");
final long now = getProcessingTimeService().getCurrentProcessingTime();
getProcessingTimeService().registerTimer(now + 3600000, this);
}
}

onTimer method, why timer state is null?

I have a simple application like this (inside keyed process function).
As you can in the code section below, I am always first getting timerObject from state and if it does not exists, I am creating a new one, and update the state. Thus, there will never be a empty/null state.
And basically this state is just for keeping the object last time, for example:
If an object was seen at time 10:15 then register time will be 10:30.
However if an object was seen again at time 10:25, then register time will be updated to 10:40
If process function runs onTimer at time 10:40, that's means there was no object in 15 mins interval, then i am just clearing my state.
Problem is that logger sometimes prints null for the state object. This should not be the case right?
public class ProcessRule extends KeyedProcessFunction<Tuple, LogEntity, Result> {
private static final Logger LOGGER = LoggerFactory.getLogger(ProcessRule.class);
private transient ValueState<TimerObject> timerState;
#Override
public void open(Configuration parameters) throws Exception{
ValueStateDescriptor<TimerObject> timerValueStateDescriptor = new ValueStateDescriptor<TimerObject>(
"timerStateForProcessRule",
TypeInformation.of(TimerObject.class)
);
timerState = getRuntimeContext().getState(timerValueStateDescriptor);
}
#Override
public void processElement(LogEntity value, Context ctx, Collector<Result> out) throws Exception{
registerTimer(value, ctx);
if (conditionTrue) {
convert Result add to collector
}
}
private void registerTimer(LogEntity element, Context ctx) throws Exception{
TimerObject stateTimer = timerState.value();
if (stateTimer == null){
stateTimer = new TimerObject();
long timeInterval = 15 * 60 * 1000;
stateTimer.setTimeInterval(timeInterval);
}
stateTimer.setCurrentTimeInMilliseconds(element.getTimestampMs());
timerState.update(stateTimer);
ctx.timerService().registerProcessingTimeTimer(stateTimer.getNextTimer());
// getNextTimer => currentTime + timeInterval
}
#Override
public void onTimer(long timestamp, OnTimerContext ctx, Collector<ValidationResult> out) throws Exception{
TimerObject stateTimer = timerState.value();
LOGGER.info("Timer fired at the the timestamps: {} for: {}", timestamp, stateTimer);
timerState.clear();
}
}
The issue here is most probably coming from the fact that You are registering multiple different timers, but You don't seem to delete them when registering new ones. So, this basically means that when first-timer fires the timerState is cleared, but seconds again next timer may also fire since it might have been registered to fire 3 sec after the first one and in this case the timerState may already be null.

Flink: ValueState on RichFlatMapFunktion always returns null

I try to calculate the highest amount of found hashtags in a given Tumbling window.
For this I do kind of a "word count" for hashtags and sum them up. This works fine. After this, I try to find the hashtag with the highest order in the given window. I use a RichFlatMapFunction for this and ValueState to save the current maximum of the appearance of a single hashtag, but this doesn't work.
I have debugged my code and find out that the value of the ValueState "maxVal" is in every flatMap step "null". So the update() and the value() method doesn't work in my scenario.
Do I misunderstood the RichFlatMap function or their usage?
Here is my code, everything except the last flatmap function is working as expected:
public class TwitterHashtagCount {
public static void main(String args[]) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment().setParallelism(1);
DataStream<String> tweetsRaw = env.addSource(new TwitterSource(TwitterConnection.getTwitterConnectionProperties()));
DataStream<String> tweetsGerman = tweetsRaw.filter(new EnglishLangFilter());
DataStream<Tuple2<String, Integer>> tweetHashtagCount = tweetsGerman
.flatMap(new TwitterHashtagFlatMap())
.keyBy(0)
.timeWindow(Time.seconds(15))
.sum(1)
.flatMap(new RichFlatMapFunction<Tuple2<String, Integer>, Tuple2<String, Integer>>() {
private transient ValueState<Integer> maxVal;
#Override
public void open(Configuration parameters) throws Exception {
ValueStateDescriptor<Integer> descriptor =
new ValueStateDescriptor<>(
// state name
"max-val",
// type information of state
TypeInformation.of(Integer.class));
maxVal = getRuntimeContext().getState(descriptor);
}
#Override
public void flatMap(Tuple2<String, Integer> value, Collector<Tuple2<String, Integer>> out) throws Exception {
Integer maxCount = maxVal.value();
if(maxCount == null) {
maxCount = 0;
maxVal.update(0);
}
if(value.f1 > maxCount) {
maxVal.update(maxCount);
out.collect(new Tuple2<String, Integer>(value.f0, value.f1));
}
}
});
tweetHashtagCount.print();
env.execute("Twitter Streaming WordCount");
}
}
I'm wondering why the code you've shared runs at all. The result of sum(1) is non-keyed stream, and the managed state interface you are using expects a keyed stream, and will keep a separate instance of the state for each key. I'm surprised you're not getting an error saying "Keyed state can only be used on a 'keyed stream', i.e., after a 'keyBy()' operation."
Since you've previously windowed the stream, if you do key it again (with the same key) before the RichFlatMapFunction, each key will occur once and the maxVal will always be null.
Something like this might do what you intend, if your goal is to find the max across all hashtags in each time window:
tweetsGerman
.flatMap(new TwitterHashtagFlatMap())
.keyBy(0)
.timeWindow(Time.seconds(15))
.sum(1)
.timeWindowAll(Time.seconds(15))
.max(1)

Resources