I am referring to the Process Function example mentioned in https://nightlies.apache.org/flink/flink-docs-release-1.16/docs/dev/datastream/operators/process_function/
/**
* The data type stored in the state
*/
public class CountWithTimestamp {
public String key;
public long count;
public long lastModified;
}
/**
* The implementation of the ProcessFunction that maintains the count and timeouts
*/
public class CountWithTimeoutFunction
extends KeyedProcessFunction<Tuple, Tuple2<String, String>, Tuple2<String, Long>> {
/** The state that is maintained by this process function */
private ValueState<CountWithTimestamp> state;
#Override
public void open(Configuration parameters) throws Exception {
state = getRuntimeContext().getState(new ValueStateDescriptor<>("myState", CountWithTimestamp.class));
}
#Override
public void processElement(
Tuple2<String, String> value,
Context ctx,
Collector<Tuple2<String, Long>> out) throws Exception {
// retrieve the current count
CountWithTimestamp current = state.value();
if (current == null) {
current = new CountWithTimestamp();
current.key = value.f0;
}
// update the state's count
current.count++;
// set the state's timestamp to the record's assigned event time timestamp
current.lastModified = ctx.timestamp();
// write the state back
state.update(current);
// schedule the next timer 60 seconds from the current event time
ctx.timerService().registerEventTimeTimer(current.lastModified + 60000);
}
#Override
public void onTimer(
long timestamp,
OnTimerContext ctx,
Collector<Tuple2<String, Long>> out) throws Exception {
// get the state for the key that scheduled the timer
CountWithTimestamp result = state.value();
// check if this is an outdated timer or the latest timer
if (timestamp == result.lastModified + 60000) {
// emit the state on timeout
out.collect(new Tuple2<String, Long>(result.key, result.count));
}
}
}
In this scenario my datastream is being produced by KafkaSource with no idleness behaviour configured
DataStream<Tuple2<Integer, Integer>> inputStream = env.fromSource(inputSource, WatermarkStrategy.forBoundedOutOfOrderness(Duration.ofSeconds(1)), "Input Kafka Source")
Now consider a scenario where there is only 1 key that is being emitted by source, let's say key1
At time T1 when the first event comes, processElement is called and the CountWithTimestamp object is set accordingly ie count = 1 and lastModified = T1
Now there are no more events for lets say 70 secs(T2). Then another event comes in for the same key key1
Now here are my questions:
When the second event comes, during my debugging, processElement always gets called first then onTimer. This is because watermark gets generated only after the event has been processed. Is my understanding correct?
Since processElement is getting called first the lastModified is getting modified to T2 (earlier it was T1). This means that even if now timer triggers it won't process as lastModified got updated. And it won't process if the above mentioned scenario keeps repeating.
Thanks.
I believe you've got that right.
Yes, watermarks follow the events that justify their creation.
Yes, that example is flawed. It makes (unstated) assumptions about there being events for other keys.
Related
NOTE: As the David suggestion, I have just updated my flink project to version v1.12.3. With v1.12.3, It seems flink has done some improvements. Right now, my problem is resolved.
My current flink application runs with 48 task slots on 3 nodes. Also I am using rocksdb as state management. (I do not care about Savepoints and Checkpoint mechanism about in flink, I am just creating state which almost 5 mins ttl)
However memory consumption for all nodes are always increasing and I have to do stop flink application via stop-cluster.sh, then re-start again.
I have many keyedstreams based on the client ip address. In daily basis, millions of users are visiting my site.
Some of the keyed streams are using StateTtlConfig while others using onTimer mechanism.
My assumption about memory consumption (or leak) is that: calling the registerProcessingTimer creates an entry which holds in memory and because there are many ip addresses I will have many entries and memory consumption is always increasing ?
Should i remove onTimer solution and only using StateTtlConfig? (I am using onTimer method because in StateTtlConfig every time I update the state it also updates the ttls which creates invalid data in my application)
Examples for state managements
// EXAMPLE FOR STATETTLCONFIG
public class State1 extends KeyedProcessFunction<Tuple, ..., ...>{
private transient ValueState<Integer> state;
#Override
public void open(Configuration parameters) throws Exception{
StateTtlConfig ttlConfig = StateTtlConfig
.newBuilder(Time.minutes(2))
.setUpdateType(StateTtlConfig.UpdateType.OnCreateAndWrite)
.setStateVisibility(StateTtlConfig.StateVisibility.NeverReturnExpired)
.cleanupInBackground()
.build();
ValueStateDescriptor<Integer> valueStateDesc = new ValueStateDescriptor<Integer>(
..
);
valueStateDesc.enableTimeToLive(ttlConfig);
state = getRuntimeContext().getState(valueStateDesc);
}
#Override
public void processElement(LogObject value, Context ctx, Collector<LogObject> out) throws Exception{
Integer stateVal = valueState.value();
// do something and update state
}
}
// EXAMPLE FOR ONTIMER METHOD
public class State2 extends KeyedProcessFunction<Tuple, ..., ...> {
private transient ValueState<Integer> state;
#Override
public void open(Configuration parameters){
ValueStateDescriptor<Integer> stateDesc = new ValueStateDescriptor<>(
...;
state = getRuntimeContext().getState(stateDesc);
}
#Override
public void processElement(LogObject value, Context ctx, Collector<LogObject> out) throws Exception{
Integer stateVal = state.value();
if (stateVal == null)
{
stateVal = 0;
ctx.timerService().registerProcessingTimeTimer(value.getTimestamp() + 5 MINS);
}
stateVal ++;
// do something and update state
}
#Override
public void onTimer(long timestamp, OnTimerContext ctx, Collector<LogObject> out)
{
stateVal.clear();
}
}
I have a simple application like this (inside keyed process function).
As you can in the code section below, I am always first getting timerObject from state and if it does not exists, I am creating a new one, and update the state. Thus, there will never be a empty/null state.
And basically this state is just for keeping the object last time, for example:
If an object was seen at time 10:15 then register time will be 10:30.
However if an object was seen again at time 10:25, then register time will be updated to 10:40
If process function runs onTimer at time 10:40, that's means there was no object in 15 mins interval, then i am just clearing my state.
Problem is that logger sometimes prints null for the state object. This should not be the case right?
public class ProcessRule extends KeyedProcessFunction<Tuple, LogEntity, Result> {
private static final Logger LOGGER = LoggerFactory.getLogger(ProcessRule.class);
private transient ValueState<TimerObject> timerState;
#Override
public void open(Configuration parameters) throws Exception{
ValueStateDescriptor<TimerObject> timerValueStateDescriptor = new ValueStateDescriptor<TimerObject>(
"timerStateForProcessRule",
TypeInformation.of(TimerObject.class)
);
timerState = getRuntimeContext().getState(timerValueStateDescriptor);
}
#Override
public void processElement(LogEntity value, Context ctx, Collector<Result> out) throws Exception{
registerTimer(value, ctx);
if (conditionTrue) {
convert Result add to collector
}
}
private void registerTimer(LogEntity element, Context ctx) throws Exception{
TimerObject stateTimer = timerState.value();
if (stateTimer == null){
stateTimer = new TimerObject();
long timeInterval = 15 * 60 * 1000;
stateTimer.setTimeInterval(timeInterval);
}
stateTimer.setCurrentTimeInMilliseconds(element.getTimestampMs());
timerState.update(stateTimer);
ctx.timerService().registerProcessingTimeTimer(stateTimer.getNextTimer());
// getNextTimer => currentTime + timeInterval
}
#Override
public void onTimer(long timestamp, OnTimerContext ctx, Collector<ValidationResult> out) throws Exception{
TimerObject stateTimer = timerState.value();
LOGGER.info("Timer fired at the the timestamps: {} for: {}", timestamp, stateTimer);
timerState.clear();
}
}
The issue here is most probably coming from the fact that You are registering multiple different timers, but You don't seem to delete them when registering new ones. So, this basically means that when first-timer fires the timerState is cleared, but seconds again next timer may also fire since it might have been registered to fire 3 sec after the first one and in this case the timerState may already be null.
I try to calculate the highest amount of found hashtags in a given Tumbling window.
For this I do kind of a "word count" for hashtags and sum them up. This works fine. After this, I try to find the hashtag with the highest order in the given window. I use a RichFlatMapFunction for this and ValueState to save the current maximum of the appearance of a single hashtag, but this doesn't work.
I have debugged my code and find out that the value of the ValueState "maxVal" is in every flatMap step "null". So the update() and the value() method doesn't work in my scenario.
Do I misunderstood the RichFlatMap function or their usage?
Here is my code, everything except the last flatmap function is working as expected:
public class TwitterHashtagCount {
public static void main(String args[]) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment().setParallelism(1);
DataStream<String> tweetsRaw = env.addSource(new TwitterSource(TwitterConnection.getTwitterConnectionProperties()));
DataStream<String> tweetsGerman = tweetsRaw.filter(new EnglishLangFilter());
DataStream<Tuple2<String, Integer>> tweetHashtagCount = tweetsGerman
.flatMap(new TwitterHashtagFlatMap())
.keyBy(0)
.timeWindow(Time.seconds(15))
.sum(1)
.flatMap(new RichFlatMapFunction<Tuple2<String, Integer>, Tuple2<String, Integer>>() {
private transient ValueState<Integer> maxVal;
#Override
public void open(Configuration parameters) throws Exception {
ValueStateDescriptor<Integer> descriptor =
new ValueStateDescriptor<>(
// state name
"max-val",
// type information of state
TypeInformation.of(Integer.class));
maxVal = getRuntimeContext().getState(descriptor);
}
#Override
public void flatMap(Tuple2<String, Integer> value, Collector<Tuple2<String, Integer>> out) throws Exception {
Integer maxCount = maxVal.value();
if(maxCount == null) {
maxCount = 0;
maxVal.update(0);
}
if(value.f1 > maxCount) {
maxVal.update(maxCount);
out.collect(new Tuple2<String, Integer>(value.f0, value.f1));
}
}
});
tweetHashtagCount.print();
env.execute("Twitter Streaming WordCount");
}
}
I'm wondering why the code you've shared runs at all. The result of sum(1) is non-keyed stream, and the managed state interface you are using expects a keyed stream, and will keep a separate instance of the state for each key. I'm surprised you're not getting an error saying "Keyed state can only be used on a 'keyed stream', i.e., after a 'keyBy()' operation."
Since you've previously windowed the stream, if you do key it again (with the same key) before the RichFlatMapFunction, each key will occur once and the maxVal will always be null.
Something like this might do what you intend, if your goal is to find the max across all hashtags in each time window:
tweetsGerman
.flatMap(new TwitterHashtagFlatMap())
.keyBy(0)
.timeWindow(Time.seconds(15))
.sum(1)
.timeWindowAll(Time.seconds(15))
.max(1)
I want to create a Trigger which gets fired in 20 seconds for the first time and in every five seconds after that. I have used GlobalWindows and a custom Trigger
val windowedStream = valueStream
.keyBy(0)
.window(GlobalWindows.create())
.trigger(TradeTrigger.of())
Here is the code in TradeTrigger:
#PublicEvolving
public class TradeTrigger<W extends Window> extends Trigger<Object, W> {
private static final long serialVersionUID = 1L;
static boolean flag=false;
static long ctime = System.currentTimeMillis();
private TradeTrigger() {
}
#Override
public TriggerResult onElement(
Object arg0,
long arg1,
W arg2,
org.apache.flink.streaming.api.windowing.triggers.Trigger.TriggerContext arg3)
throws Exception {
// TODO Auto-generated method stub
if(flag == false){
if((System.currentTimeMillis()-ctime) >= 20000){
flag = true;
ctime = System.currentTimeMillis();
return TriggerResult.FIRE;
}
return TriggerResult.CONTINUE;
} else {
if((System.currentTimeMillis()-ctime) >= 5000){
ctime = System.currentTimeMillis();
return TriggerResult.FIRE;
}
return TriggerResult.CONTINUE;
}
}
#Override
public TriggerResult onEventTime(
long arg0,
W arg1,
org.apache.flink.streaming.api.windowing.triggers.Trigger.TriggerContext arg2)
throws Exception {
// TODO Auto-generated method stub
return TriggerResult.CONTINUE;
}
#Override
public TriggerResult onProcessingTime(
long arg0,
W arg1,
org.apache.flink.streaming.api.windowing.triggers.Trigger.TriggerContext arg2)
throws Exception {
// TODO Auto-generated method stub
return TriggerResult.CONTINUE;
}
public static <W extends Window> TradeTrigger<W> of() {
return new TradeTrigger<>();
}
}
So basically, when flag is false, i.e. the first time, the Trigger should get fired in 20 seconds and set the flag to true. From the next time, it should get fired every 5 seconds.
The problem I am facing is, I am getting only one message in the output every time the Trigger is fired. That is, I get a single message after 20 seconds and single messages in every five seconds.
I am expecting twenty messages in the output on each triggering.
If I use .timeWindow(Time.seconds(5)) and create a time window of five seconds, I get 20 messages in output every 5 seconds.
Please help me get this code right. Is there something I am missing?
There are a few issues with your Trigger implementation:
You should never store the state of a function in a static variable. Flink does not isolate user processes in JVMs. Instead it uses a single JVM per TaskManager and starts multiple threads. Hence, your static boolean flag is shared across multiple instances of triggers. You should store the flag Flink's ValueState interface which is accessible from the TriggerContext. Flink will take care to checkpoint your state and recover it in case of a failure.
Trigger.onEvent() is only called when a new event arrives. So it cannot be used to trigger a Window computation at a specific time. Instead you should register an event time timer or processing time timer (again via the TriggerContext). The timer will call Trigger.onEventTime() or Trigger.onProcessingTime() respectively. Whether to use event or processing time depends on your use case.
Got it working with the help of the answer from Fabian and Flink mailing lists.
Stored the state in a ValueState variable through the TriggerContext. Checked the variable in onEvent() method and if it was the first time, registered a processingTimeTimer for 20 seconds more than the current time and updated the state. In the onProcessingTime method, registered another ProcessingTimeTimer for 5 seconds more than current time, updated the state and fired the Window.
I have tried to migrate some simple Task to Flink 1.0.0 version, but it fails with the following exception:
java.lang.RuntimeException: Record has Long.MIN_VALUE timestamp (= no timestamp marker). Is the time characteristic set to 'ProcessingTime', or did you forget to call 'DataStream.assignTimestampsAndWatermarks(...)'?
The code consists of two separated tasks connected via Kafka topic, where one task is simple messages generator and the other task is simple messages consumer which uses timeWindowAll to calculate the minutely messages arriving rate.
Again, the similar code worked with 0.10.2 version without any problems, but now it looks like the system wrongly interprets some event timestamps like Long.MIN_VALUE which causes task failure.
The question is do I something wrong or it is some bug which will be fixed in future releases?
The main Task:
public class Test1_0_0 {
// Max Time lag between events time to current System time
static final Time maxTimeLag = Time.of(3, TimeUnit.SECONDS);
public static void main(String[] args) throws Exception {
// set up the execution environment
final StreamExecutionEnvironment env = StreamExecutionEnvironment
.getExecutionEnvironment();
// Setting Event Time usage
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
env.setBufferTimeout(1);
// Properties initialization
Properties properties = new Properties();
properties.setProperty("bootstrap.servers", "localhost:9092");
properties.setProperty("zookeeper.connect", "localhost:2181");
properties.setProperty("group.id", "test");
// Overwrites the default properties by one provided by command line
ParameterTool parameterTool = ParameterTool.fromArgs(args);
for(Map.Entry<String, String> e: parameterTool.toMap().entrySet()) {
properties.setProperty(e.getKey(),e.getValue());
}
//properties.setProperty("auto.offset.reset", "smallest");
System.out.println("Properties: " + properties);
DataStream<Message> stream = env
.addSource(new FlinkKafkaConsumer09<Message>("test", new MessageSDSchema(), properties)).filter(message -> message != null);
// The call to rebalance() causes data to be re-partitioned so that all machines receive messages
// (for example, when the number of Kafka partitions is fewer than the number of Flink parallel instances).
stream.rebalance()
.assignTimestampsAndWatermarks(new MessageTimestampExtractor(maxTimeLag));
// Counts messages
stream.timeWindowAll(Time.minutes(1)).apply(new AllWindowFunction<Message, String, TimeWindow>() {
#Override
public void apply(TimeWindow timeWindow, Iterable<Message> values, Collector<String> collector) throws Exception {
Integer count = 0;
if (values.iterator().hasNext()) {
for (Message value : values) {
count++;
}
collector.collect("Arrived last minute: " + count);
}
}
}).print();
// execute program
env.execute("Messages Counting");
}
}
The timestamp extractor:
public class MessageTimestampExtractor implements AssignerWithPeriodicWatermarks<Message>, Serializable {
private static final long serialVersionUID = 7526472295622776147L;
// Maximum lag between the current processing time and the timestamp of an event
long maxTimeLag = 0L;
long currentWatermarkTimestamp = 0L;
public MessageTimestampExtractor() {
}
public MessageTimestampExtractor(Time maxTimeLag) {
this.maxTimeLag = maxTimeLag.toMilliseconds();
}
/**
* Assigns a timestamp to an element, in milliseconds since the Epoch.
*
* <p>The method is passed the previously assigned timestamp of the element.
* That previous timestamp may have been assigned from a previous assigner,
* by ingestion time. If the element did not carry a timestamp before, this value is
* {#code Long.MIN_VALUE}.
*
* #param message The element that the timestamp is wil be assigned to.
* #param previousElementTimestamp The previous internal timestamp of the element,
* or a negative value, if no timestamp has been assigned, yet.
* #return The new timestamp.
*/
#Override
public long extractTimestamp(Message message, long previousElementTimestamp) {
long timestamp = message.getTimestamp();
currentWatermarkTimestamp = Math.max(timestamp, currentWatermarkTimestamp);
return timestamp;
}
/**
* Returns the current watermark. This method is periodically called by the
* system to retrieve the current watermark. The method may return null to
* indicate that no new Watermark is available.
*
* <p>The returned watermark will be emitted only if it is non-null and larger
* than the previously emitted watermark. If the current watermark is still
* identical to the previous one, no progress in event time has happened since
* the previous call to this method.
*
* <p>If this method returns a value that is smaller than the previously returned watermark,
* then the implementation does not properly handle the event stream timestamps.
* In that case, the returned watermark will not be emitted (to preserve the contract of
* ascending watermarks), and the violation will be logged and registered in the metrics.
*
* <p>The interval in which this method is called and Watermarks are generated
* depends on {#link ExecutionConfig#getAutoWatermarkInterval()}.
*
* #see org.apache.flink.streaming.api.watermark.Watermark
* #see ExecutionConfig#getAutoWatermarkInterval()
*/
#Override
public Watermark getCurrentWatermark() {
if(currentWatermarkTimestamp <= 0) {
return new Watermark(Long.MIN_VALUE);
}
return new Watermark(currentWatermarkTimestamp - maxTimeLag);
}
public long getMaxTimeLag() {
return maxTimeLag;
}
public void setMaxTimeLag(long maxTimeLag) {
this.maxTimeLag = maxTimeLag;
}
}
The problem is that calling assignTimestampsAndWatermarks returns a new DataStream which uses the timestamp extractor. Thus, you have to use the returned DataStream to perform the subsequent operations on it.
DataStream<Message> timestampStream = stream.rebalance()
.assignTimestampsAndWatermarks(new MessageTimestampExtractor(maxTimeLag));
// Counts Strings
timestampStream.timeWindowAll(Time.minutes(1)).apply(new AllWindowFunction<Message, String, TimeWindow>() {
#Override
public void apply(TimeWindow timeWindow, Iterable<Message> values, Collector<String> collector) throws Exception {
Integer count = 0;
if (values.iterator().hasNext()) {
for (Message value : values) {
count++;
}
collector.collect("Arrived last minute: " + count);
}
}
}).print();