I have a simple application like this (inside keyed process function).
As you can in the code section below, I am always first getting timerObject from state and if it does not exists, I am creating a new one, and update the state. Thus, there will never be a empty/null state.
And basically this state is just for keeping the object last time, for example:
If an object was seen at time 10:15 then register time will be 10:30.
However if an object was seen again at time 10:25, then register time will be updated to 10:40
If process function runs onTimer at time 10:40, that's means there was no object in 15 mins interval, then i am just clearing my state.
Problem is that logger sometimes prints null for the state object. This should not be the case right?
public class ProcessRule extends KeyedProcessFunction<Tuple, LogEntity, Result> {
private static final Logger LOGGER = LoggerFactory.getLogger(ProcessRule.class);
private transient ValueState<TimerObject> timerState;
#Override
public void open(Configuration parameters) throws Exception{
ValueStateDescriptor<TimerObject> timerValueStateDescriptor = new ValueStateDescriptor<TimerObject>(
"timerStateForProcessRule",
TypeInformation.of(TimerObject.class)
);
timerState = getRuntimeContext().getState(timerValueStateDescriptor);
}
#Override
public void processElement(LogEntity value, Context ctx, Collector<Result> out) throws Exception{
registerTimer(value, ctx);
if (conditionTrue) {
convert Result add to collector
}
}
private void registerTimer(LogEntity element, Context ctx) throws Exception{
TimerObject stateTimer = timerState.value();
if (stateTimer == null){
stateTimer = new TimerObject();
long timeInterval = 15 * 60 * 1000;
stateTimer.setTimeInterval(timeInterval);
}
stateTimer.setCurrentTimeInMilliseconds(element.getTimestampMs());
timerState.update(stateTimer);
ctx.timerService().registerProcessingTimeTimer(stateTimer.getNextTimer());
// getNextTimer => currentTime + timeInterval
}
#Override
public void onTimer(long timestamp, OnTimerContext ctx, Collector<ValidationResult> out) throws Exception{
TimerObject stateTimer = timerState.value();
LOGGER.info("Timer fired at the the timestamps: {} for: {}", timestamp, stateTimer);
timerState.clear();
}
}
The issue here is most probably coming from the fact that You are registering multiple different timers, but You don't seem to delete them when registering new ones. So, this basically means that when first-timer fires the timerState is cleared, but seconds again next timer may also fire since it might have been registered to fire 3 sec after the first one and in this case the timerState may already be null.
Related
I am referring to the Process Function example mentioned in https://nightlies.apache.org/flink/flink-docs-release-1.16/docs/dev/datastream/operators/process_function/
/**
* The data type stored in the state
*/
public class CountWithTimestamp {
public String key;
public long count;
public long lastModified;
}
/**
* The implementation of the ProcessFunction that maintains the count and timeouts
*/
public class CountWithTimeoutFunction
extends KeyedProcessFunction<Tuple, Tuple2<String, String>, Tuple2<String, Long>> {
/** The state that is maintained by this process function */
private ValueState<CountWithTimestamp> state;
#Override
public void open(Configuration parameters) throws Exception {
state = getRuntimeContext().getState(new ValueStateDescriptor<>("myState", CountWithTimestamp.class));
}
#Override
public void processElement(
Tuple2<String, String> value,
Context ctx,
Collector<Tuple2<String, Long>> out) throws Exception {
// retrieve the current count
CountWithTimestamp current = state.value();
if (current == null) {
current = new CountWithTimestamp();
current.key = value.f0;
}
// update the state's count
current.count++;
// set the state's timestamp to the record's assigned event time timestamp
current.lastModified = ctx.timestamp();
// write the state back
state.update(current);
// schedule the next timer 60 seconds from the current event time
ctx.timerService().registerEventTimeTimer(current.lastModified + 60000);
}
#Override
public void onTimer(
long timestamp,
OnTimerContext ctx,
Collector<Tuple2<String, Long>> out) throws Exception {
// get the state for the key that scheduled the timer
CountWithTimestamp result = state.value();
// check if this is an outdated timer or the latest timer
if (timestamp == result.lastModified + 60000) {
// emit the state on timeout
out.collect(new Tuple2<String, Long>(result.key, result.count));
}
}
}
In this scenario my datastream is being produced by KafkaSource with no idleness behaviour configured
DataStream<Tuple2<Integer, Integer>> inputStream = env.fromSource(inputSource, WatermarkStrategy.forBoundedOutOfOrderness(Duration.ofSeconds(1)), "Input Kafka Source")
Now consider a scenario where there is only 1 key that is being emitted by source, let's say key1
At time T1 when the first event comes, processElement is called and the CountWithTimestamp object is set accordingly ie count = 1 and lastModified = T1
Now there are no more events for lets say 70 secs(T2). Then another event comes in for the same key key1
Now here are my questions:
When the second event comes, during my debugging, processElement always gets called first then onTimer. This is because watermark gets generated only after the event has been processed. Is my understanding correct?
Since processElement is getting called first the lastModified is getting modified to T2 (earlier it was T1). This means that even if now timer triggers it won't process as lastModified got updated. And it won't process if the above mentioned scenario keeps repeating.
Thanks.
I believe you've got that right.
Yes, watermarks follow the events that justify their creation.
Yes, that example is flawed. It makes (unstated) assumptions about there being events for other keys.
I am attempting to write a keyedProcessFunction, the code looks like this below:
DataStream<Tuple2<Long, Integer>> busyMachinesPerWindow = busyMachines
// group by timestamp (window end)
.keyBy(event -> event.getField(1))
.process(new KeyedProcessFunction<Tuple1<Long>, Tuple3<Long, Long, Long>, Tuple2<Long, Integer>>() {
private ValueState<Integer> state;
#Override
public void open(Configuration config) throws IOException {
// initialize the state descriptors here
state = getRuntimeContext().getState(new ValueStateDescriptor<>("machine-counts", Integer.class));
if (state.value() == null) {
state.update(0);
}
}
#Override
public void processElement(Tuple3<Long, Long, Long> inWindow, Context ctx, Collector<Tuple2<Long, Integer>> out) throws Exception {
if (state.value() != null) {
state.update(state.value() + 1);
} else {
state.update(1);
}
ctx.timerService().registerEventTimeTimer(inWindow.f1);
}
#Override
public void onTimer(long timestamp, OnTimerContext ctx, Collector<Tuple2<Long, Integer>> out) throws Exception {
int counter = state.value();
state.clear();
// we can now output the window and the machine count
out.collect(new Tuple2<>(((Tuple1<Long>) ctx.getCurrentKey()).f0, counter));
}
});
However this pops up an error saying cannot derive anonymous method. I don't see what the problem is with this code. Is there some type ambiguity that I am not doing right?
One problem with this code is that you are calling state.value() and state.update(0) in the open method. This is not allowed. These methods can only be used in processElement and in onTimer, because only then is there a specific event being processed whose key can be used to access/update the appropriate entry in the state backend.
An instance of a KeyedProcessFunction is multiplexed across all of the keys assigned to a given task slot. The open method is called just once, at a time when there is no specific key in the runtime context, so the state cannot be accessed or updated at this time.
NOTE: As the David suggestion, I have just updated my flink project to version v1.12.3. With v1.12.3, It seems flink has done some improvements. Right now, my problem is resolved.
My current flink application runs with 48 task slots on 3 nodes. Also I am using rocksdb as state management. (I do not care about Savepoints and Checkpoint mechanism about in flink, I am just creating state which almost 5 mins ttl)
However memory consumption for all nodes are always increasing and I have to do stop flink application via stop-cluster.sh, then re-start again.
I have many keyedstreams based on the client ip address. In daily basis, millions of users are visiting my site.
Some of the keyed streams are using StateTtlConfig while others using onTimer mechanism.
My assumption about memory consumption (or leak) is that: calling the registerProcessingTimer creates an entry which holds in memory and because there are many ip addresses I will have many entries and memory consumption is always increasing ?
Should i remove onTimer solution and only using StateTtlConfig? (I am using onTimer method because in StateTtlConfig every time I update the state it also updates the ttls which creates invalid data in my application)
Examples for state managements
// EXAMPLE FOR STATETTLCONFIG
public class State1 extends KeyedProcessFunction<Tuple, ..., ...>{
private transient ValueState<Integer> state;
#Override
public void open(Configuration parameters) throws Exception{
StateTtlConfig ttlConfig = StateTtlConfig
.newBuilder(Time.minutes(2))
.setUpdateType(StateTtlConfig.UpdateType.OnCreateAndWrite)
.setStateVisibility(StateTtlConfig.StateVisibility.NeverReturnExpired)
.cleanupInBackground()
.build();
ValueStateDescriptor<Integer> valueStateDesc = new ValueStateDescriptor<Integer>(
..
);
valueStateDesc.enableTimeToLive(ttlConfig);
state = getRuntimeContext().getState(valueStateDesc);
}
#Override
public void processElement(LogObject value, Context ctx, Collector<LogObject> out) throws Exception{
Integer stateVal = valueState.value();
// do something and update state
}
}
// EXAMPLE FOR ONTIMER METHOD
public class State2 extends KeyedProcessFunction<Tuple, ..., ...> {
private transient ValueState<Integer> state;
#Override
public void open(Configuration parameters){
ValueStateDescriptor<Integer> stateDesc = new ValueStateDescriptor<>(
...;
state = getRuntimeContext().getState(stateDesc);
}
#Override
public void processElement(LogObject value, Context ctx, Collector<LogObject> out) throws Exception{
Integer stateVal = state.value();
if (stateVal == null)
{
stateVal = 0;
ctx.timerService().registerProcessingTimeTimer(value.getTimestamp() + 5 MINS);
}
stateVal ++;
// do something and update state
}
#Override
public void onTimer(long timestamp, OnTimerContext ctx, Collector<LogObject> out)
{
stateVal.clear();
}
}
I try to calculate the highest amount of found hashtags in a given Tumbling window.
For this I do kind of a "word count" for hashtags and sum them up. This works fine. After this, I try to find the hashtag with the highest order in the given window. I use a RichFlatMapFunction for this and ValueState to save the current maximum of the appearance of a single hashtag, but this doesn't work.
I have debugged my code and find out that the value of the ValueState "maxVal" is in every flatMap step "null". So the update() and the value() method doesn't work in my scenario.
Do I misunderstood the RichFlatMap function or their usage?
Here is my code, everything except the last flatmap function is working as expected:
public class TwitterHashtagCount {
public static void main(String args[]) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment().setParallelism(1);
DataStream<String> tweetsRaw = env.addSource(new TwitterSource(TwitterConnection.getTwitterConnectionProperties()));
DataStream<String> tweetsGerman = tweetsRaw.filter(new EnglishLangFilter());
DataStream<Tuple2<String, Integer>> tweetHashtagCount = tweetsGerman
.flatMap(new TwitterHashtagFlatMap())
.keyBy(0)
.timeWindow(Time.seconds(15))
.sum(1)
.flatMap(new RichFlatMapFunction<Tuple2<String, Integer>, Tuple2<String, Integer>>() {
private transient ValueState<Integer> maxVal;
#Override
public void open(Configuration parameters) throws Exception {
ValueStateDescriptor<Integer> descriptor =
new ValueStateDescriptor<>(
// state name
"max-val",
// type information of state
TypeInformation.of(Integer.class));
maxVal = getRuntimeContext().getState(descriptor);
}
#Override
public void flatMap(Tuple2<String, Integer> value, Collector<Tuple2<String, Integer>> out) throws Exception {
Integer maxCount = maxVal.value();
if(maxCount == null) {
maxCount = 0;
maxVal.update(0);
}
if(value.f1 > maxCount) {
maxVal.update(maxCount);
out.collect(new Tuple2<String, Integer>(value.f0, value.f1));
}
}
});
tweetHashtagCount.print();
env.execute("Twitter Streaming WordCount");
}
}
I'm wondering why the code you've shared runs at all. The result of sum(1) is non-keyed stream, and the managed state interface you are using expects a keyed stream, and will keep a separate instance of the state for each key. I'm surprised you're not getting an error saying "Keyed state can only be used on a 'keyed stream', i.e., after a 'keyBy()' operation."
Since you've previously windowed the stream, if you do key it again (with the same key) before the RichFlatMapFunction, each key will occur once and the maxVal will always be null.
Something like this might do what you intend, if your goal is to find the max across all hashtags in each time window:
tweetsGerman
.flatMap(new TwitterHashtagFlatMap())
.keyBy(0)
.timeWindow(Time.seconds(15))
.sum(1)
.timeWindowAll(Time.seconds(15))
.max(1)
I want to create a Trigger which gets fired in 20 seconds for the first time and in every five seconds after that. I have used GlobalWindows and a custom Trigger
val windowedStream = valueStream
.keyBy(0)
.window(GlobalWindows.create())
.trigger(TradeTrigger.of())
Here is the code in TradeTrigger:
#PublicEvolving
public class TradeTrigger<W extends Window> extends Trigger<Object, W> {
private static final long serialVersionUID = 1L;
static boolean flag=false;
static long ctime = System.currentTimeMillis();
private TradeTrigger() {
}
#Override
public TriggerResult onElement(
Object arg0,
long arg1,
W arg2,
org.apache.flink.streaming.api.windowing.triggers.Trigger.TriggerContext arg3)
throws Exception {
// TODO Auto-generated method stub
if(flag == false){
if((System.currentTimeMillis()-ctime) >= 20000){
flag = true;
ctime = System.currentTimeMillis();
return TriggerResult.FIRE;
}
return TriggerResult.CONTINUE;
} else {
if((System.currentTimeMillis()-ctime) >= 5000){
ctime = System.currentTimeMillis();
return TriggerResult.FIRE;
}
return TriggerResult.CONTINUE;
}
}
#Override
public TriggerResult onEventTime(
long arg0,
W arg1,
org.apache.flink.streaming.api.windowing.triggers.Trigger.TriggerContext arg2)
throws Exception {
// TODO Auto-generated method stub
return TriggerResult.CONTINUE;
}
#Override
public TriggerResult onProcessingTime(
long arg0,
W arg1,
org.apache.flink.streaming.api.windowing.triggers.Trigger.TriggerContext arg2)
throws Exception {
// TODO Auto-generated method stub
return TriggerResult.CONTINUE;
}
public static <W extends Window> TradeTrigger<W> of() {
return new TradeTrigger<>();
}
}
So basically, when flag is false, i.e. the first time, the Trigger should get fired in 20 seconds and set the flag to true. From the next time, it should get fired every 5 seconds.
The problem I am facing is, I am getting only one message in the output every time the Trigger is fired. That is, I get a single message after 20 seconds and single messages in every five seconds.
I am expecting twenty messages in the output on each triggering.
If I use .timeWindow(Time.seconds(5)) and create a time window of five seconds, I get 20 messages in output every 5 seconds.
Please help me get this code right. Is there something I am missing?
There are a few issues with your Trigger implementation:
You should never store the state of a function in a static variable. Flink does not isolate user processes in JVMs. Instead it uses a single JVM per TaskManager and starts multiple threads. Hence, your static boolean flag is shared across multiple instances of triggers. You should store the flag Flink's ValueState interface which is accessible from the TriggerContext. Flink will take care to checkpoint your state and recover it in case of a failure.
Trigger.onEvent() is only called when a new event arrives. So it cannot be used to trigger a Window computation at a specific time. Instead you should register an event time timer or processing time timer (again via the TriggerContext). The timer will call Trigger.onEventTime() or Trigger.onProcessingTime() respectively. Whether to use event or processing time depends on your use case.
Got it working with the help of the answer from Fabian and Flink mailing lists.
Stored the state in a ValueState variable through the TriggerContext. Checked the variable in onEvent() method and if it was the first time, registered a processingTimeTimer for 20 seconds more than the current time and updated the state. In the onProcessingTime method, registered another ProcessingTimeTimer for 5 seconds more than current time, updated the state and fired the Window.