Apache Flink 1.0.0 . Event Time related migration problems - apache-flink

I have tried to migrate some simple Task to Flink 1.0.0 version, but it fails with the following exception:
java.lang.RuntimeException: Record has Long.MIN_VALUE timestamp (= no timestamp marker). Is the time characteristic set to 'ProcessingTime', or did you forget to call 'DataStream.assignTimestampsAndWatermarks(...)'?
The code consists of two separated tasks connected via Kafka topic, where one task is simple messages generator and the other task is simple messages consumer which uses timeWindowAll to calculate the minutely messages arriving rate.
Again, the similar code worked with 0.10.2 version without any problems, but now it looks like the system wrongly interprets some event timestamps like Long.MIN_VALUE which causes task failure.
The question is do I something wrong or it is some bug which will be fixed in future releases?
The main Task:
public class Test1_0_0 {
// Max Time lag between events time to current System time
static final Time maxTimeLag = Time.of(3, TimeUnit.SECONDS);
public static void main(String[] args) throws Exception {
// set up the execution environment
final StreamExecutionEnvironment env = StreamExecutionEnvironment
.getExecutionEnvironment();
// Setting Event Time usage
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
env.setBufferTimeout(1);
// Properties initialization
Properties properties = new Properties();
properties.setProperty("bootstrap.servers", "localhost:9092");
properties.setProperty("zookeeper.connect", "localhost:2181");
properties.setProperty("group.id", "test");
// Overwrites the default properties by one provided by command line
ParameterTool parameterTool = ParameterTool.fromArgs(args);
for(Map.Entry<String, String> e: parameterTool.toMap().entrySet()) {
properties.setProperty(e.getKey(),e.getValue());
}
//properties.setProperty("auto.offset.reset", "smallest");
System.out.println("Properties: " + properties);
DataStream<Message> stream = env
.addSource(new FlinkKafkaConsumer09<Message>("test", new MessageSDSchema(), properties)).filter(message -> message != null);
// The call to rebalance() causes data to be re-partitioned so that all machines receive messages
// (for example, when the number of Kafka partitions is fewer than the number of Flink parallel instances).
stream.rebalance()
.assignTimestampsAndWatermarks(new MessageTimestampExtractor(maxTimeLag));
// Counts messages
stream.timeWindowAll(Time.minutes(1)).apply(new AllWindowFunction<Message, String, TimeWindow>() {
#Override
public void apply(TimeWindow timeWindow, Iterable<Message> values, Collector<String> collector) throws Exception {
Integer count = 0;
if (values.iterator().hasNext()) {
for (Message value : values) {
count++;
}
collector.collect("Arrived last minute: " + count);
}
}
}).print();
// execute program
env.execute("Messages Counting");
}
}
The timestamp extractor:
public class MessageTimestampExtractor implements AssignerWithPeriodicWatermarks<Message>, Serializable {
private static final long serialVersionUID = 7526472295622776147L;
// Maximum lag between the current processing time and the timestamp of an event
long maxTimeLag = 0L;
long currentWatermarkTimestamp = 0L;
public MessageTimestampExtractor() {
}
public MessageTimestampExtractor(Time maxTimeLag) {
this.maxTimeLag = maxTimeLag.toMilliseconds();
}
/**
* Assigns a timestamp to an element, in milliseconds since the Epoch.
*
* <p>The method is passed the previously assigned timestamp of the element.
* That previous timestamp may have been assigned from a previous assigner,
* by ingestion time. If the element did not carry a timestamp before, this value is
* {#code Long.MIN_VALUE}.
*
* #param message The element that the timestamp is wil be assigned to.
* #param previousElementTimestamp The previous internal timestamp of the element,
* or a negative value, if no timestamp has been assigned, yet.
* #return The new timestamp.
*/
#Override
public long extractTimestamp(Message message, long previousElementTimestamp) {
long timestamp = message.getTimestamp();
currentWatermarkTimestamp = Math.max(timestamp, currentWatermarkTimestamp);
return timestamp;
}
/**
* Returns the current watermark. This method is periodically called by the
* system to retrieve the current watermark. The method may return null to
* indicate that no new Watermark is available.
*
* <p>The returned watermark will be emitted only if it is non-null and larger
* than the previously emitted watermark. If the current watermark is still
* identical to the previous one, no progress in event time has happened since
* the previous call to this method.
*
* <p>If this method returns a value that is smaller than the previously returned watermark,
* then the implementation does not properly handle the event stream timestamps.
* In that case, the returned watermark will not be emitted (to preserve the contract of
* ascending watermarks), and the violation will be logged and registered in the metrics.
*
* <p>The interval in which this method is called and Watermarks are generated
* depends on {#link ExecutionConfig#getAutoWatermarkInterval()}.
*
* #see org.apache.flink.streaming.api.watermark.Watermark
* #see ExecutionConfig#getAutoWatermarkInterval()
*/
#Override
public Watermark getCurrentWatermark() {
if(currentWatermarkTimestamp <= 0) {
return new Watermark(Long.MIN_VALUE);
}
return new Watermark(currentWatermarkTimestamp - maxTimeLag);
}
public long getMaxTimeLag() {
return maxTimeLag;
}
public void setMaxTimeLag(long maxTimeLag) {
this.maxTimeLag = maxTimeLag;
}
}

The problem is that calling assignTimestampsAndWatermarks returns a new DataStream which uses the timestamp extractor. Thus, you have to use the returned DataStream to perform the subsequent operations on it.
DataStream<Message> timestampStream = stream.rebalance()
.assignTimestampsAndWatermarks(new MessageTimestampExtractor(maxTimeLag));
// Counts Strings
timestampStream.timeWindowAll(Time.minutes(1)).apply(new AllWindowFunction<Message, String, TimeWindow>() {
#Override
public void apply(TimeWindow timeWindow, Iterable<Message> values, Collector<String> collector) throws Exception {
Integer count = 0;
if (values.iterator().hasNext()) {
for (Message value : values) {
count++;
}
collector.collect("Arrived last minute: " + count);
}
}
}).print();

Related

Busy time is too high for simple process function

Finally, after a month of research I found the main reason.
The main reason was IP2Location. I am using IP2Location java library to search ip address location in the BIN files. In the peak time, it causes a problem. At least i can avoid to problem by passing IP2Proxy.IOModes.IP2PROXY_MEMORY_MAPPED parameter before reading the bin files.
And also I just found that a few state object doesn't match with POJO standard which causes high load.
I am using flink v1.13, there are 4 task managers (per 16 cpu) with 3800 tasks (default application parallelism is 28)
In my application one operator has always high busy time (around %80 - %90).
If I restart the flink application, then busy time decreases, but after 5-10 hours running busy time increases again.
In the grafana, I can see that busy time for ProcessStream increases. Here is the PromethuesQuery: avg((avg_over_time(flink_taskmanager_job_task_busyTimeMsPerSecond[1m]))) by (task_name)
There is no backpressure in the ProcessStream task. To calculate backPressure time, I am using: flink_taskmanager_job_task_backPressuredTimeMsPerSecond
But I couldn't find any reason for that.
Here is the code :
private void processOne(DataStream<KafkaObject> kafkaLog) {
kafkaLog
.filter(new FilterRequest())
.name(FilterRequest.class.getSimpleName())
.map(new MapToUserIdAndTimeStampMs())
.name(MapToUserIdAndTimeStampMs.class.getSimpleName())
.keyBy(UserObject::getUserId) // returns of type int
.process(new ProcessStream())
.name(ProcessStream.class.getSimpleName())
.addSink(...)
;
}
// ...
// ...
public class ProcessStream extends KeyedProcessFunction<Integer, UserObject, Output>
{
private static final long STATE_TIMER = // 5 min in milliseconds;
private static final int AVERAGE_REQUEST = 74;
private static final int STANDARD_DEVIATION = 32;
private static final int MINIMUM_REQUEST = 50;
private static final int THRESHOLD = 70;
private transient ValueState<Tuple2<Integer, Integer>> state;
#Override
public void open(Configuration parameters) throws Exception
{
ValueStateDescriptor<Tuple2<Integer, Integer>> stateDescriptor = new ValueStateDescriptor<Tuple2<Integer, Integer>>(
ProcessStream.class.getSimpleName(),
TypeInformation.of(new TypeHint<Tuple2<Integer, Integer>>() {}));
state = getRuntimeContext().getState(stateDescriptor);
}
#Override
public void processElement(UserObject value, KeyedProcessFunction<Integer, UserObject, Output>.Context ctx, Collector<Output> out) throws Exception
{
Tuple2<Integer, Integer> stateValue = state.value();
if (Objects.isNull(stateValue)) {
stateValue = Tuple2.of(1, 0);
ctx.timerService().registerProcessingTimeTimer(value.getTimestampMs() + STATE_TIMER);
}
int totalRequest = stateValue.f0;
int currentScore = stateValue.f1;
if (totalRequest >= MINIMUM_REQUEST && currentScore >= THRESHOLD)
{
out.collect({convert_to_output});
state.clear();
}
else
{
stateValue.f0 = totalRequest + 1;
stateValue.f1 = calculateNextScore(stateValue.f0);
state.update(stateValue);
}
}
private int calculateNextScore(int totalRequest)
{
return (totalRequest - AVERAGE_REQUEST ) / STANDARD_DEVIATION;
}
#Override
public void onTimer(long timestamp, KeyedProcessFunction<Integer, UserObject, Output>.OnTimerContext ctx, Collector<Output> out) throws Exception
{
state.clear();
}
}
Since you're using a timestamp value from your incoming record (value.getTimestampMs() + STATE_TIMER), you want to be running with event time, and setting watermarks based on that incoming record's timestamp. Otherwise you have no idea when the timer is actually firing, as the record's timestamp might be something completely different than your current processor time.
This means you also want to use .registerEventTimeTimer().
Without these changes you might be filling up TM heap with uncleared state, which can lead to high CPU load.

Process Function Event Time Behaviour

I am referring to the Process Function example mentioned in https://nightlies.apache.org/flink/flink-docs-release-1.16/docs/dev/datastream/operators/process_function/
/**
* The data type stored in the state
*/
public class CountWithTimestamp {
public String key;
public long count;
public long lastModified;
}
/**
* The implementation of the ProcessFunction that maintains the count and timeouts
*/
public class CountWithTimeoutFunction
extends KeyedProcessFunction<Tuple, Tuple2<String, String>, Tuple2<String, Long>> {
/** The state that is maintained by this process function */
private ValueState<CountWithTimestamp> state;
#Override
public void open(Configuration parameters) throws Exception {
state = getRuntimeContext().getState(new ValueStateDescriptor<>("myState", CountWithTimestamp.class));
}
#Override
public void processElement(
Tuple2<String, String> value,
Context ctx,
Collector<Tuple2<String, Long>> out) throws Exception {
// retrieve the current count
CountWithTimestamp current = state.value();
if (current == null) {
current = new CountWithTimestamp();
current.key = value.f0;
}
// update the state's count
current.count++;
// set the state's timestamp to the record's assigned event time timestamp
current.lastModified = ctx.timestamp();
// write the state back
state.update(current);
// schedule the next timer 60 seconds from the current event time
ctx.timerService().registerEventTimeTimer(current.lastModified + 60000);
}
#Override
public void onTimer(
long timestamp,
OnTimerContext ctx,
Collector<Tuple2<String, Long>> out) throws Exception {
// get the state for the key that scheduled the timer
CountWithTimestamp result = state.value();
// check if this is an outdated timer or the latest timer
if (timestamp == result.lastModified + 60000) {
// emit the state on timeout
out.collect(new Tuple2<String, Long>(result.key, result.count));
}
}
}
In this scenario my datastream is being produced by KafkaSource with no idleness behaviour configured
DataStream<Tuple2<Integer, Integer>> inputStream = env.fromSource(inputSource, WatermarkStrategy.forBoundedOutOfOrderness(Duration.ofSeconds(1)), "Input Kafka Source")
Now consider a scenario where there is only 1 key that is being emitted by source, let's say key1
At time T1 when the first event comes, processElement is called and the CountWithTimestamp object is set accordingly ie count = 1 and lastModified = T1
Now there are no more events for lets say 70 secs(T2). Then another event comes in for the same key key1
Now here are my questions:
When the second event comes, during my debugging, processElement always gets called first then onTimer. This is because watermark gets generated only after the event has been processed. Is my understanding correct?
Since processElement is getting called first the lastModified is getting modified to T2 (earlier it was T1). This means that even if now timer triggers it won't process as lastModified got updated. And it won't process if the above mentioned scenario keeps repeating.
Thanks.
I believe you've got that right.
Yes, watermarks follow the events that justify their creation.
Yes, that example is flawed. It makes (unstated) assumptions about there being events for other keys.

Flink DataStream sort program does not output

I have written a small test case code in Flink to sort a datastream. The code is as follows:
public enum StreamSortTest {
;
public static class MyProcessWindowFunction extends ProcessWindowFunction<Long,Long,Integer, TimeWindow> {
#Override
public void process(Integer key, Context ctx, Iterable<Long> input, Collector<Long> out) {
List<Long> sortedList = new ArrayList<>();
for(Long i: input){
sortedList.add(i);
}
Collections.sort(sortedList);
sortedList.forEach(l -> out.collect(l));
}
}
public static void main(final String[] args) throws Exception {
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(2);
env.getConfig().setExecutionMode(ExecutionMode.PIPELINED);
DataStream<Long> probeSource = env.fromSequence(1, 500).setParallelism(2);
// range partition the stream into two parts based on data value
DataStream<Long> sortOutput =
probeSource
.keyBy(x->{
if(x<250){
return 1;
} else {
return 2;
}
})
.window(TumblingProcessingTimeWindows.of(Time.seconds(20)))
.process(new MyProcessWindowFunction())
;
sortOutput.print();
System.out.println(env.getExecutionPlan());
env.executeAsync();
}
}
However, the code just outputs the execution plan and a few other lines. But it doesn't output the actual sorted numbers. What am I doing wrong?
The main problem I can see is that You are using ProcessingTime based window with very short input data, which surely will be processed in time shorter than 20 seconds. While Flink is able to detect end of input(in case of stream from file or sequence as in Your case) and generate Long.Max watermark, which will close all open event time based windows and fire all event time based timers. It doesn't do the same thing for ProcessingTime based computations, so in Your case You need to assert Yourself that Flink will actually work long enough so that Your window is closed or refer to custom trigger/different time characteristic.
One other thing I am not sure about since I never used it that much is if You should use executeAsync for local execution, since that's basically meant for situations when You don't want to wait for the result of the job according to the docs here.

Flink window function getResult not fired

I am trying to use event time in my Flink job, and using BoundedOutOfOrdernessTimestampExtractor to extract timestamp and generate watermark.
But I have some input Kafka having sparse stream, it can have no data for a long time, which makes the getResult in AggregateFunction not called at all. I can see data going into add function.
I have set getEnv().getConfig().setAutoWatermarkInterval(1000L);
I tried
eventsWithKey
.keyBy(entry -> (String) entry.get(key))
.window(TumblingEventTimeWindows.of(Time.minutes(windowInMinutes)))
.allowedLateness(WINDOW_LATENESS)
.aggregate(new CountTask(basicMetricTags, windowInMinutes))
also session window
eventsWithKey
.keyBy(entry -> (String) entry.get(key))
.window(EventTimeSessionWindows.withGap(Time.seconds(30)))
.aggregate(new CountTask(basicMetricTags, windowInMinutes))
All the watermark metics shows No Watermark
How can I let Flink to ignore that no watermark thing?
FYI, this is commonly referred to as the "idle source" problem. This occurs because whenever a Flink operator has two or more inputs, its watermark is the minimum of the watermarks from its inputs. If one of those inputs stalls, its watermark no longer advances.
Note that Flink does not have per-key watermarking -- a given operator is typically multiplexed across events for many keys. So long as some events are flowing through a given task's input streams, its watermark will advance, and event time timers for idle keys will still fire. For this "idle source" problem to occur, a task has to have an input stream that has become completely idle.
If you can arrange for it, the best solution is to have your data sources include keepalive events. This will allow you to advance your watermarks with confidence, knowing that the source is simply idle, rather than, for example, offline.
If that's not possible, and if you have some sources that aren't idle, then you could put a rebalance() in front of the BoundedOutOfOrdernessTimestampExtractor (and before the keyBy), so that every instance continues to receive some events and can advance its watermark. This comes at the expense of an extra network shuffle.
Perhaps the most commonly used solution is to use a watermark generator that detects idleness and artificially advances the watermark based on a processing time timer. ProcessingTimeTrailingBoundedOutOfOrdernessTimestampExtractor is an example of that.
A new watermark with idleness capability has been introduced. Flink will ignore these idle watermarks while calculating the minimum so the single partition with the data will be considered.
https://ci.apache.org/projects/flink/flink-docs-release-1.11/api/java/org/apache/flink/api/common/eventtime/WatermarksWithIdleness.html
I have the same issue - a src that may be inactive for a long time.
The solution below is based on WatermarksWithIdleness.
It is a standalone Flink job that demonstrate the concept.
package com.demo.playground.flink.sleepysrc;
import org.apache.flink.api.common.eventtime.WatermarkStrategy;
import org.apache.flink.api.common.eventtime.WatermarksWithIdleness;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.datastream.KeyedStream;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.datastream.WindowedStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.source.SourceFunction;
import org.apache.flink.streaming.api.functions.windowing.ProcessWindowFunction;
import org.apache.flink.streaming.api.windowing.assigners.EventTimeSessionWindows;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.streaming.api.windowing.windows.TimeWindow;
import org.apache.flink.util.Collector;
import java.time.Duration;
public class SleepyJob {
public static void main(String[] args) throws Exception {
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
final EventGenerator eventGenerator = new EventGenerator();
WatermarkStrategy<Event> strategy = WatermarkStrategy.
<Event>forBoundedOutOfOrderness(Duration.ofSeconds(5)).
withIdleness(Duration.ofSeconds(Constants.IDLE_TIME_SEC)).
withTimestampAssigner((event, timestamp) -> event.timestamp);
final DataStream<Event> events = env.addSource(eventGenerator).assignTimestampsAndWatermarks(strategy);
KeyedStream<Event, String> eventStringKeyedStream = events.keyBy((Event event) -> event.id);
WindowedStream<Event, String, TimeWindow> windowedStream = eventStringKeyedStream.window(EventTimeSessionWindows.withGap(Time.milliseconds(Constants.SESSION_WINDOW_GAP)));
windowedStream.allowedLateness(Time.milliseconds(1000));
SingleOutputStreamOperator<Object> result = windowedStream.process(new ProcessWindowFunction<Event, Object, String, TimeWindow>() {
#Override
public void process(String s, Context context, Iterable<Event> events, Collector<Object> collector) {
int counter = 0;
for (Event e : events) {
Utils.print(++counter + ") inside process: " + e);
}
Utils.print("--- Process Done ----");
}
});
result.print();
env.execute("Sleepy flink src demo");
}
private static class Event {
public Event(String id) {
this.timestamp = System.currentTimeMillis();
this.eventData = "not_important_" + this.timestamp;
this.id = id;
}
#Override
public String toString() {
return "Event{" +
"id=" + id +
", timestamp=" + timestamp +
", eventData='" + eventData + '\'' +
'}';
}
public String id;
public long timestamp;
public String eventData;
}
private static class EventGenerator implements SourceFunction<Event> {
#Override
public void run(SourceContext<Event> ctx) throws Exception {
/**
* Here is the sleepy src - after NUM_OF_EVENTS events are collected , the code goes to a SHORT_SLEEP_TIME sleep
* We would like to detect this inactivity and FIRE the window
*/
int counter = 0;
while (running) {
String id = Long.toString(System.currentTimeMillis());
Utils.print(String.format("Generating %d events with id %s", 2 * Constants.NUM_OF_EVENTS, id));
while (counter < Constants.NUM_OF_EVENTS) {
Event event = new Event(id);
ctx.collect(event);
counter++;
Thread.sleep(Constants.VERY_SHORT_SLEEP_TIME);
}
// here we create a delay:
// a time of inactivity where
// we would like to FIRE the window
Thread.sleep(Constants.SHORT_SLEEP_TIME);
counter = 0;
while (counter < Constants.NUM_OF_EVENTS) {
Event event = new Event(id);
ctx.collect(event);
counter++;
Thread.sleep(Constants.VERY_SHORT_SLEEP_TIME);
}
Thread.sleep(Constants.LONG_SLEEP_TIME);
}
}
#Override
public void cancel() {
this.running = false;
}
private volatile boolean running = true;
}
private static final class Constants {
public static final int VERY_SHORT_SLEEP_TIME = 300;
public static final int SHORT_SLEEP_TIME = 8000;
public static final int IDLE_TIME_SEC = 5;
public static final int LONG_SLEEP_TIME = SHORT_SLEEP_TIME * 5;
public static final long SESSION_WINDOW_GAP = 60 * 1000;
public static final int NUM_OF_EVENTS = 4;
}
private static final class Utils {
public static void print(Object obj) {
System.out.println(new java.util.Date() + " > " + obj);
}
}
}
For others, make sure there's data coming out of all your topics' partitions if you're using Kafka
I know it sounds dumb, but in my case I had a single source and the problem was still happening, because I was testing with very little data in a single Kafka topic (single source) that had 10 partitions. The dataset was so small that some of the topic's partitions did not have anything to give and, although I had only one source (the one topic), Flink did not increase the Watermark.
The moment I switched my source to a topic with a single partition the Watermark started to advance.

Dynamic flink window creation by reading the details from kafka

Let's say Kafka messages contain flink window size configuration.
I want to read the message from Kafka and create a global window in flink.
Problem Statement:
Can we handle the above scenario by using BroadcastStream ?
Or
Any other approach which will support the above case ?
Flink's window API does not support dynamically changing window sizes.
What you'll need to do is to implement your own windowing using a process function. In this case a KeyedBroadcastProcessFunction, where the window configuration is broadcast.
You can examine the Flink training for an example of how to implement time windows with a KeyedProcessFunction (copied below):
public class PseudoWindow extends KeyedProcessFunction<String, KeyedDataPoint<Double>, KeyedDataPoint<Integer>> {
// Keyed, managed state, with an entry for each window.
// There is a separate MapState object for each sensor.
private MapState<Long, Integer> countInWindow;
boolean eventTimeProcessing;
int durationMsec;
/**
* Create the KeyedProcessFunction.
* #param eventTime whether or not to use event time processing
* #param durationMsec window length
*/
public PseudoWindow(boolean eventTime, int durationMsec) {
this.eventTimeProcessing = eventTime;
this.durationMsec = durationMsec;
}
#Override
public void open(Configuration config) {
MapStateDescriptor<Long, Integer> countDesc =
new MapStateDescriptor<>("countInWindow", Long.class, Integer.class);
countInWindow = getRuntimeContext().getMapState(countDesc);
}
#Override
public void processElement(
KeyedDataPoint<Double> dataPoint,
Context ctx,
Collector<KeyedDataPoint<Integer>> out) throws Exception {
long endOfWindow = setTimer(dataPoint, ctx.timerService());
Integer count = countInWindow.get(endOfWindow);
if (count == null) {
count = 0;
}
count += 1;
countInWindow.put(endOfWindow, count);
}
public long setTimer(KeyedDataPoint<Double> dataPoint, TimerService timerService) {
long time;
if (eventTimeProcessing) {
time = dataPoint.getTimeStampMs();
} else {
time = System.currentTimeMillis();
}
long endOfWindow = (time - (time % durationMsec) + durationMsec - 1);
if (eventTimeProcessing) {
timerService.registerEventTimeTimer(endOfWindow);
} else {
timerService.registerProcessingTimeTimer(endOfWindow);
}
return endOfWindow;
}
#Override
public void onTimer(long timestamp, OnTimerContext context, Collector<KeyedDataPoint<Integer>> out) throws Exception {
// Get the timestamp for this timer and use it to look up the count for that window
long ts = context.timestamp();
KeyedDataPoint<Integer> result = new KeyedDataPoint<>(context.getCurrentKey(), ts, countInWindow.get(ts));
out.collect(result);
countInWindow.remove(timestamp);
}
}

Resources