Flink Custom Trigger giving Unexpected Output - apache-flink

I want to create a Trigger which gets fired in 20 seconds for the first time and in every five seconds after that. I have used GlobalWindows and a custom Trigger
val windowedStream = valueStream
.keyBy(0)
.window(GlobalWindows.create())
.trigger(TradeTrigger.of())
Here is the code in TradeTrigger:
#PublicEvolving
public class TradeTrigger<W extends Window> extends Trigger<Object, W> {
private static final long serialVersionUID = 1L;
static boolean flag=false;
static long ctime = System.currentTimeMillis();
private TradeTrigger() {
}
#Override
public TriggerResult onElement(
Object arg0,
long arg1,
W arg2,
org.apache.flink.streaming.api.windowing.triggers.Trigger.TriggerContext arg3)
throws Exception {
// TODO Auto-generated method stub
if(flag == false){
if((System.currentTimeMillis()-ctime) >= 20000){
flag = true;
ctime = System.currentTimeMillis();
return TriggerResult.FIRE;
}
return TriggerResult.CONTINUE;
} else {
if((System.currentTimeMillis()-ctime) >= 5000){
ctime = System.currentTimeMillis();
return TriggerResult.FIRE;
}
return TriggerResult.CONTINUE;
}
}
#Override
public TriggerResult onEventTime(
long arg0,
W arg1,
org.apache.flink.streaming.api.windowing.triggers.Trigger.TriggerContext arg2)
throws Exception {
// TODO Auto-generated method stub
return TriggerResult.CONTINUE;
}
#Override
public TriggerResult onProcessingTime(
long arg0,
W arg1,
org.apache.flink.streaming.api.windowing.triggers.Trigger.TriggerContext arg2)
throws Exception {
// TODO Auto-generated method stub
return TriggerResult.CONTINUE;
}
public static <W extends Window> TradeTrigger<W> of() {
return new TradeTrigger<>();
}
}
So basically, when flag is false, i.e. the first time, the Trigger should get fired in 20 seconds and set the flag to true. From the next time, it should get fired every 5 seconds.
The problem I am facing is, I am getting only one message in the output every time the Trigger is fired. That is, I get a single message after 20 seconds and single messages in every five seconds.
I am expecting twenty messages in the output on each triggering.
If I use .timeWindow(Time.seconds(5)) and create a time window of five seconds, I get 20 messages in output every 5 seconds.
Please help me get this code right. Is there something I am missing?

There are a few issues with your Trigger implementation:
You should never store the state of a function in a static variable. Flink does not isolate user processes in JVMs. Instead it uses a single JVM per TaskManager and starts multiple threads. Hence, your static boolean flag is shared across multiple instances of triggers. You should store the flag Flink's ValueState interface which is accessible from the TriggerContext. Flink will take care to checkpoint your state and recover it in case of a failure.
Trigger.onEvent() is only called when a new event arrives. So it cannot be used to trigger a Window computation at a specific time. Instead you should register an event time timer or processing time timer (again via the TriggerContext). The timer will call Trigger.onEventTime() or Trigger.onProcessingTime() respectively. Whether to use event or processing time depends on your use case.

Got it working with the help of the answer from Fabian and Flink mailing lists.
Stored the state in a ValueState variable through the TriggerContext. Checked the variable in onEvent() method and if it was the first time, registered a processingTimeTimer for 20 seconds more than the current time and updated the state. In the onProcessingTime method, registered another ProcessingTimeTimer for 5 seconds more than current time, updated the state and fired the Window.

Related

Process Function Event Time Behaviour

I am referring to the Process Function example mentioned in https://nightlies.apache.org/flink/flink-docs-release-1.16/docs/dev/datastream/operators/process_function/
/**
* The data type stored in the state
*/
public class CountWithTimestamp {
public String key;
public long count;
public long lastModified;
}
/**
* The implementation of the ProcessFunction that maintains the count and timeouts
*/
public class CountWithTimeoutFunction
extends KeyedProcessFunction<Tuple, Tuple2<String, String>, Tuple2<String, Long>> {
/** The state that is maintained by this process function */
private ValueState<CountWithTimestamp> state;
#Override
public void open(Configuration parameters) throws Exception {
state = getRuntimeContext().getState(new ValueStateDescriptor<>("myState", CountWithTimestamp.class));
}
#Override
public void processElement(
Tuple2<String, String> value,
Context ctx,
Collector<Tuple2<String, Long>> out) throws Exception {
// retrieve the current count
CountWithTimestamp current = state.value();
if (current == null) {
current = new CountWithTimestamp();
current.key = value.f0;
}
// update the state's count
current.count++;
// set the state's timestamp to the record's assigned event time timestamp
current.lastModified = ctx.timestamp();
// write the state back
state.update(current);
// schedule the next timer 60 seconds from the current event time
ctx.timerService().registerEventTimeTimer(current.lastModified + 60000);
}
#Override
public void onTimer(
long timestamp,
OnTimerContext ctx,
Collector<Tuple2<String, Long>> out) throws Exception {
// get the state for the key that scheduled the timer
CountWithTimestamp result = state.value();
// check if this is an outdated timer or the latest timer
if (timestamp == result.lastModified + 60000) {
// emit the state on timeout
out.collect(new Tuple2<String, Long>(result.key, result.count));
}
}
}
In this scenario my datastream is being produced by KafkaSource with no idleness behaviour configured
DataStream<Tuple2<Integer, Integer>> inputStream = env.fromSource(inputSource, WatermarkStrategy.forBoundedOutOfOrderness(Duration.ofSeconds(1)), "Input Kafka Source")
Now consider a scenario where there is only 1 key that is being emitted by source, let's say key1
At time T1 when the first event comes, processElement is called and the CountWithTimestamp object is set accordingly ie count = 1 and lastModified = T1
Now there are no more events for lets say 70 secs(T2). Then another event comes in for the same key key1
Now here are my questions:
When the second event comes, during my debugging, processElement always gets called first then onTimer. This is because watermark gets generated only after the event has been processed. Is my understanding correct?
Since processElement is getting called first the lastModified is getting modified to T2 (earlier it was T1). This means that even if now timer triggers it won't process as lastModified got updated. And it won't process if the above mentioned scenario keeps repeating.
Thanks.
I believe you've got that right.
Yes, watermarks follow the events that justify their creation.
Yes, that example is flawed. It makes (unstated) assumptions about there being events for other keys.

Flink DataStream sort program does not output

I have written a small test case code in Flink to sort a datastream. The code is as follows:
public enum StreamSortTest {
;
public static class MyProcessWindowFunction extends ProcessWindowFunction<Long,Long,Integer, TimeWindow> {
#Override
public void process(Integer key, Context ctx, Iterable<Long> input, Collector<Long> out) {
List<Long> sortedList = new ArrayList<>();
for(Long i: input){
sortedList.add(i);
}
Collections.sort(sortedList);
sortedList.forEach(l -> out.collect(l));
}
}
public static void main(final String[] args) throws Exception {
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(2);
env.getConfig().setExecutionMode(ExecutionMode.PIPELINED);
DataStream<Long> probeSource = env.fromSequence(1, 500).setParallelism(2);
// range partition the stream into two parts based on data value
DataStream<Long> sortOutput =
probeSource
.keyBy(x->{
if(x<250){
return 1;
} else {
return 2;
}
})
.window(TumblingProcessingTimeWindows.of(Time.seconds(20)))
.process(new MyProcessWindowFunction())
;
sortOutput.print();
System.out.println(env.getExecutionPlan());
env.executeAsync();
}
}
However, the code just outputs the execution plan and a few other lines. But it doesn't output the actual sorted numbers. What am I doing wrong?
The main problem I can see is that You are using ProcessingTime based window with very short input data, which surely will be processed in time shorter than 20 seconds. While Flink is able to detect end of input(in case of stream from file or sequence as in Your case) and generate Long.Max watermark, which will close all open event time based windows and fire all event time based timers. It doesn't do the same thing for ProcessingTime based computations, so in Your case You need to assert Yourself that Flink will actually work long enough so that Your window is closed or refer to custom trigger/different time characteristic.
One other thing I am not sure about since I never used it that much is if You should use executeAsync for local execution, since that's basically meant for situations when You don't want to wait for the result of the job according to the docs here.

onTimer method, why timer state is null?

I have a simple application like this (inside keyed process function).
As you can in the code section below, I am always first getting timerObject from state and if it does not exists, I am creating a new one, and update the state. Thus, there will never be a empty/null state.
And basically this state is just for keeping the object last time, for example:
If an object was seen at time 10:15 then register time will be 10:30.
However if an object was seen again at time 10:25, then register time will be updated to 10:40
If process function runs onTimer at time 10:40, that's means there was no object in 15 mins interval, then i am just clearing my state.
Problem is that logger sometimes prints null for the state object. This should not be the case right?
public class ProcessRule extends KeyedProcessFunction<Tuple, LogEntity, Result> {
private static final Logger LOGGER = LoggerFactory.getLogger(ProcessRule.class);
private transient ValueState<TimerObject> timerState;
#Override
public void open(Configuration parameters) throws Exception{
ValueStateDescriptor<TimerObject> timerValueStateDescriptor = new ValueStateDescriptor<TimerObject>(
"timerStateForProcessRule",
TypeInformation.of(TimerObject.class)
);
timerState = getRuntimeContext().getState(timerValueStateDescriptor);
}
#Override
public void processElement(LogEntity value, Context ctx, Collector<Result> out) throws Exception{
registerTimer(value, ctx);
if (conditionTrue) {
convert Result add to collector
}
}
private void registerTimer(LogEntity element, Context ctx) throws Exception{
TimerObject stateTimer = timerState.value();
if (stateTimer == null){
stateTimer = new TimerObject();
long timeInterval = 15 * 60 * 1000;
stateTimer.setTimeInterval(timeInterval);
}
stateTimer.setCurrentTimeInMilliseconds(element.getTimestampMs());
timerState.update(stateTimer);
ctx.timerService().registerProcessingTimeTimer(stateTimer.getNextTimer());
// getNextTimer => currentTime + timeInterval
}
#Override
public void onTimer(long timestamp, OnTimerContext ctx, Collector<ValidationResult> out) throws Exception{
TimerObject stateTimer = timerState.value();
LOGGER.info("Timer fired at the the timestamps: {} for: {}", timestamp, stateTimer);
timerState.clear();
}
}
The issue here is most probably coming from the fact that You are registering multiple different timers, but You don't seem to delete them when registering new ones. So, this basically means that when first-timer fires the timerState is cleared, but seconds again next timer may also fire since it might have been registered to fire 3 sec after the first one and in this case the timerState may already be null.

APACHE FLINK AggregateFunction with tumblingWindow to count events but also send 0 if no event occurred

I need to count events within a tumbling window. But I also want to send events with 0 Value if there were no events within the window.
Something like.
windowCount: 5
windowCount: 0
windowCount: 0
windowCount: 3
windowCount: 0
...
import com.google.protobuf.Message;
import org.apache.flink.api.common.functions.AggregateFunction;
import org.skydivin4ng3l.cepmodemon.models.events.aggregate.AggregateOuterClass;
public class BasicCounter<T extends Message> implements AggregateFunction<T, Long, AggregateOuterClass.Aggregate> {
#Override
public Long createAccumulator() {
return 0L;
}
#Override
public Long add(T event, Long accumulator) {
return accumulator + 1L;
}
#Override
public AggregateOuterClass.Aggregate getResult(Long accumulator) {
return AggregateOuterClass.Aggregate.newBuilder().setVolume(accumulator).build();
}
#Override
public Long merge(Long accumulator1, Long accumulator2) {
return accumulator1 + accumulator2;
}
}
and used here
DataStream<AggregateOuterClass.Aggregate> aggregatedStream = someEntryStream
.windowAll(TumblingEventTimeWindows.of(Time.seconds(5)))
.aggregate(new BasicCounter<MonitorOuterClass.Monitor>());
TimeCharacteristics are ingestionTime
I read about a TiggerFunction which might detect if the aggregated Stream has received an event after x time but i am not sure if that is the right way to do it.
I expected the aggregation to happen even is there would be no events at all within the window. Maybe there is a setting i am not aware of?
Thx for any hints.
I chose Option 1 as suggested by #David-Anderson:
Here is my Event Generator:
public class EmptyEventSource implements SourceFunction<MonitorOuterClass.Monitor> {
private volatile boolean isRunning = true;
private final long delayPerRecordMillis;
public EmptyEventSource(long delayPerRecordMillis){
this.delayPerRecordMillis = delayPerRecordMillis;
}
#Override
public void run(SourceContext<MonitorOuterClass.Monitor> sourceContext) throws Exception {
while (isRunning) {
sourceContext.collect(MonitorOuterClass.Monitor.newBuilder().build());
if (delayPerRecordMillis > 0) {
Thread.sleep(delayPerRecordMillis);
}
}
}
#Override
public void cancel() {
isRunning = false;
}
}
and my adjusted AggregateFunction:
public class BasicCounter<T extends Message> implements AggregateFunction<T, Long, AggregateOuterClass.Aggregate> {
#Override
public Long createAccumulator() {
return 0L;
}
#Override
public Long add(T event, Long accumulator) {
if(((MonitorOuterClass.Monitor)event).equals(MonitorOuterClass.Monitor.newBuilder().build())) {
return accumulator;
}
return accumulator + 1L;
}
#Override
public AggregateOuterClass.Aggregate getResult(Long accumulator) {
AggregateOuterClass.Aggregate newAggregate = AggregateOuterClass.Aggregate.newBuilder().setVolume(accumulator).build();
return newAggregate;
}
#Override
public Long merge(Long accumulator1, Long accumulator2) {
return accumulator1 + accumulator2;
}
}
Used them Like this:
DataStream<MonitorOuterClass.Monitor> someEntryStream = env.addSource(currentConsumer);
DataStream<MonitorOuterClass.Monitor> triggerStream = env.addSource(new EmptyEventSource(delayPerRecordMillis));
DataStream<AggregateOuterClass.Aggregate> aggregatedStream = someEntryStream
.union(triggerStream)
.windowAll(TumblingProcessingTimeWindows.of(Time.seconds(5)))
.aggregate(new BasicCounter<MonitorOuterClass.Monitor>());
Flink's windows are created lazily, when the first event is assigned to a window. Thus empty windows do not exist, and can't produce results.
In general there are three ways to workaround this issue:
Put something in front of the window that adds events to the stream, ensuring that every window has something in it, and then modify your window processing to ignore these special events when computing their results.
Use a GlobalWindow along with a custom Trigger that uses processing time timers to trigger the window (with no events flowing, the watermark won't advance, and event time timers won't fire until more events arrive).
Don't use the window API, and implement your own windowing with a ProcessFunction instead. But here you'll still face the issue of needing to use processing time timers.
Update:
Having now made an effort to implement an example of option 2, I cannot recommend it. The issue is that even with a custom Trigger, the ProcessAllWindowFunction will not be called if the window is empty, so it is necessary to always keep at least one element in the GlobalWindow. This appears then to require implementing a rather hacky Evictor and ProcessAllWindowFunction that collaborate to retain and ignore a special element in the window -- and you also have to somehow get that element into the window in the first place.
If you're going to do something hacky, option 1 appears to be much simpler.

Flink executes dataflow twice

I'm new to Flink and I work with DataSet API. After a whole bunch of processing as the last stage I need to normalize one of the values by dividing it by its maximum value. So, I have used the .max() operator to take the max and later I'm passing the result as constructor's argument to the MapFunction.
This works, however all the processing is performed twice. One job is executed to find max values, and later another job is executed to create final result (starting execution from the beginning)... Is there any workaround to execute whole dataflow only once?
final List<Tuple6<...>> maxValues = result.max(2).collect();
assert maxValues.size() == 1;
result.map(new NormalizeAttributes(maxValues.get(0))).writeAsCsv(...)
#FunctionAnnotation.ForwardedFields("f0; f1; f3; f4; f5")
#FunctionAnnotation.ReadFields("f2")
private static class NormalizeAttributes implements MapFunction<Tuple6<...>, Tuple6<...>> {
private final Tuple6<...> maxValues;
public NormalizeAttributes(Tuple6<...> maxValues) {
this.maxValues = maxValues;
}
#Override
public Tuple6<...> map(Tuple6<...> value) throws Exception {
value.f2 /= maxValues.f2;
return value;
}
}
collect() immediately triggers an execution of the program up to the dataset requested by collect(). If you later call env.execute() or collect() again, the program is executed second time.
Besides the side effect of execution, using collect() to distribute values to subsequent transformation has also the drawback that data is transferred to the client and later back into the cluster. Flink offers so-called Broadcast variables to ship a DataSet as a side input into another transformation.
Using Broadcast variables in your program would look as follows:
DataSet maxValues = result.max(2);
result
.map(new NormAttrs()).withBroadcastSet(maxValues, "maxValues")
.writeAsCsv(...);
The NormAttrs function would look like this:
private static class NormAttr extends RichMapFunction<Tuple6<...>, Tuple6<...>> {
private Tuple6<...> maxValues;
#Override
public void open(Configuration config) {
maxValues = (Tuple6<...>)getRuntimeContext().getBroadcastVariable("maxValues").get(1);
}
#Override
public PredictedLink map(Tuple6<...> value) throws Exception {
value.f2 /= maxValues.f2;
return value;
}
}
You can find more information about Broadcast variables in the documentation.

Resources