Flink TumblingEventTimeWindows how achievement without overlap? - apache-flink

There is this text in Stream Processing with Apache Flink page 211
“The WindowAssigner determines for each arriving element to which windows it is assigned.”
then I study source code of TumblingEventTimeWindows
public class TumblingEventTimeWindows extends WindowAssigner<Object, TimeWindow> {
private static final long serialVersionUID = 1L;
...............................
#Override
public Collection<TimeWindow> assignWindows(
Object element, long timestamp, WindowAssignerContext context) {
if (timestamp > Long.MIN_VALUE) {
if (staggerOffset == null) {
staggerOffset =
windowStagger.getStaggerOffset(context.getCurrentProcessingTime(), size);
}
// Long.MIN_VALUE is currently assigned when no timestamp is present
long start =
TimeWindow.getWindowStartWithOffset(
timestamp, (globalOffset + staggerOffset) % size, size);
return Collections.singletonList(new TimeWindow(start, start + size));
} else {
throw new RuntimeException(
"Record has Long.MIN_VALUE timestamp (= no timestamp marker). "
+ "Is the time characteristic set to 'ProcessingTime', or did you forget to call "
+ "'DataStream.assignTimestampsAndWatermarks(...)'?");
}
}
...............................
from the source code I can found , It is true that elements are assigned to the window ,new TimeWindow(start, start + size) meanning each element be assigned a new TimeWindow.
but I am confused, TumblingEventTimeWindows how achievement without overlap?
if every element be assigned a new TimeWindow, the results are as follows
There is no guarantee that each window will not overlap , Can someone point me in the direction of TumblingEventTimeWindows how achievement without overlap?

The TimeWindow object isn't very important. It is a simple structure that holds the start and end timestamps for the window, and nothing else. It's name makes it sound important, but it's just used to encode a copy of the information describing the time interval the incoming event is being assigned to.
It's actually the WindowOperator that has the important window data. Logically it's keeping something like a map, where the keys are the intervals described by the TimeWindow objects, and the values are the lists of events assigned to those intervals.

Related

Absence of event in Apache Flink CEP

I'm new at Apache Flink CEP and I'm struggle trying to detect a simple absence of event.
What I'm trying to detect is wheter an event of type CurrencyEvent with a certain id does not occur in certain amount of time. I would like to detect the absence of such event every time that after 3000ms the event does not occur.
My pattern code looks as follows:
Pattern<CurrencyEvent, ?> myPattern = Pattern.<Event>begin("CurrencyEvent")
.subtype(CurrencyEvent.class)
.where(new SimpleCondition<CurrencyEvent>() {
#Override
public boolean filter(CurrencyEvent currencyEvent) throws Exception {
return currencyEvent.getId().equalsIgnoreCase("usd");
}
})
.within(Time.milliseconds(3000L));
So now my idea is to use timeout functions in order to detect timeout events:
DataStreamSource<Event> events = env.addSource(new TestSource(
Arrays.asList(
basicCurrencyWithMivLevelEvent("EUR", 100L, Arrays.asList("1", "2"), 200D),
basicCurrencyWithMivLevelEvent("USD", 100L, Arrays.asList("1", "2"), 200D),
basicCurrencyWithMivLevelEvent("EUR", 100L, Arrays.asList("1", "2"), 200D)
),
1636040364820L, // initial timestamp for the first element
7000 // 7 seconds between each event
));
PatternStream<Event> patternStream = CEP.pattern(
events,
(Pattern<Event, ?>) myPattern
);
OutputTag<Alarm> tag = new OutputTag<Alarm>("currency-timeout"){};
PatternFlatTimeoutFunction<Event, Alarm> eventAlarmTimeoutPatternFunction = (patterns, timestamp, ctx) -> {
System.out.println("New alarm, since after 3 seconds an event with id=usd is not detected");
//TODO: call collect
};
PatternFlatSelectFunction<Event, Alarm> eventAlarmPatternSelectFunction = (patterns, ctx) -> {
System.out.println("Select! (we can ignore it) " + patterns);
// ignore matched events
};
return patternStream.flatSelect(
tag,
eventAlarmTimeoutPatternFunction,
TypeInformation.of(Alarm.class),
eventAlarmPatternSelectFunction
);
My Test source is using event timestamps and watermarks, as shown as follows:
public class TestSource implements SourceFunction<Event> {
private final List<Event> events;
private final long initialTimestamp;
private final long timeBetweenInMillis;
public TestSource(List<Event> events, long initialTimestamp, long timeBetweenInMillis){
this.events = events;
this.initialTimestamp = initialTimestamp;
this.timeBetweenInMillis = timeBetweenInMillis;
}
#Override
public void run(SourceContext<Event> sourceContext) throws InterruptedException {
long timestamp = this.initialTimestamp;
for(Event event: this.events){
sourceContext.collectWithTimestamp(event, timestamp);
sourceContext.emitWatermark(new Watermark(timestamp));
timestamp+=this.timeBetweenInMillis;
}
}
#Override
public void cancel() {
}
}
I'm using TimeCharacteristics.EventTime.
Since the the window time (3seconds) is lower than the event time difference between every event (7 seconds), I expect to get some timeout events, but I'm getting 0.
A CEP Pattern matches a sequence of one or more events; the within(interval) clause adds an additional constraint that all of the events in the sequence must occur within the specified interval. When partial matches time out, this can be captured in a TimedOutPartialMatchHandler.
In your case, since a successfully matched Pattern consists of a single event, there can be no partial matches, and a match can never time out. (Your matching sequences are always less than 3 seconds long.)
What you can do is to extend the pattern definition to include a second event, so that to match there must be a start event followed by another event within 3 seconds. When that second event is missing, then you will have a partial match that times out.
For more flexibility than what CEP offers for implementing use cases involving missing events, you can use a KeyedProcessFunction with timers.

Flink watermark not advancing at all? Stuck at -9223372036854775808

I'm encountering similar issue to Flink EventTime Processing Watermark is always coming as -9223372036854725808 However, the suggested solutions (set parallelism and disable checkpointing) do not have any effect. In this example, I'm simply streaming 1000 events 1 second apart, and then comparing the event timestamp to ctx.timerService().currentWatermark()
>>> v=(61538659200000,0), watermark=-9223372036854775808
>>> v=(61538659201000,1), watermark=-9223372036854775808
>>> v=(61538660198000,998), watermark=-9223372036854775808
>>> v=(61538660199000,999), watermark=-9223372036854775808
public void watermarks()
throws Exception
{
final var env = StreamExecutionEnvironment.createLocalEnvironment();
env.setRuntimeMode(RuntimeExecutionMode.STREAMING);
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
env.setMaxParallelism(1);
final long startMs = new Date(2020, 1, 1).getTime();
final var events = new ArrayList<Tuple2<Long, Integer>>();
for (var ii = 0; ii < 1000; ++ii ) {
events.add(new Tuple2<Long, Integer>(startMs + ii * 1000, ii));
}
env.fromCollection(events)
.assignTimestampsAndWatermarks(
WatermarkStrategy.<Tuple2<Long, Integer>>forMonotonousTimestamps()
.withTimestampAssigner((event, ts) -> event.f0))
.setParallelism(1)
.keyBy(row -> row.f1 % 2)
.process(new ProcessFunction<Tuple2<Long, Integer>, String>()
{
#Override
public void processElement(
final Tuple2<Long, Integer> value,
final Context ctx,
final Collector<String> out)
throws Exception
{
out.collect("v=" + value + ", watermark=" + ctx.timerService().currentWatermark());
}
})
.setParallelism(1)
.print()
.setParallelism(1);
final var result = env.execute();
System.out.println(result);
}
forMonotonousTimestamps is a periodic watermark generator that only generates watermarks when triggered by a timer. By default this timer fires every 200 msec (this is the autoWatermarkInterval). Your job doesn't run long enough for this timer to fire.
Bounded sources do generate a watermark with its timestamp set to MAX_WATERMARK when they reach the end of their input -- just before shutting down the job. You're not seeing this watermark in the output from your job because there are no events that follow it.
If you want to generate watermarks with every event, you can implement a custom watermark strategy that emits a watermarks in the onEvent method of the WatermarkGenerator (docs). This is usually a bad idea in production, as you'll waste CPU cycles and network bandwidth on these extra watermarks, but sometimes for testing this is helpful.
According to source code comments:
/**
* Creates a new enriched {#link WatermarkStrategy} that also does idleness detection in the
* created {#link WatermarkGenerator}.
*
* <p>Add an idle timeout to the watermark strategy. If no records flow in a partition of a
* stream for that amount of time, then that partition is considered "idle" and will not hold
* back the progress of watermarks in downstream operators.
*
* <p>Idleness can be important if some partitions have little data and might not have events
* during some periods. Without idleness, these streams can stall the overall event time
* progress of the application.
*/
default WatermarkStrategy<T> withIdleness(Duration idleTimeout) ...
So, You can try to use WatermarkStrategy.forMonotonousTimestamps.withIdleness(...)

Flink counter with timestamp

I was reading the the Flink example CountWithTimestamp and below is a code snippet from the example:
#Override
public void processElement(Tuple2<String, String> value, Context ctx, Collector<Tuple2<String, Long>> out)
throws Exception {
// retrieve the current count
CountWithTimestamp current = state.value();
if (current == null) {
current = new CountWithTimestamp();
current.key = value.f0;
}
// update the state's count
current.count++;
// set the state's timestamp to the record's assigned event time timestamp
current.lastModified = ctx.timestamp();
// write the state back
state.update(current);
// schedule the next timer 60 seconds from the current event time
ctx.timerService().registerEventTimeTimer(current.lastModified + 60000);
}
#Override
public void onTimer(long timestamp, OnTimerContext ctx, Collector<Tuple2<String, Long>> out)
throws Exception {
// get the state for the key that scheduled the timer
CountWithTimestamp result = state.value();
// check if this is an outdated timer or the latest timer
if (timestamp == result.lastModified + 60000) {
// emit the state on timeout
out.collect(new Tuple2<String, Long>(result.key, result.count));
}
}
}
My question is that if I remove the if statment timestamp == result.lastModified + 60000 (collect stmt not touched) in the onTimer, and instead replace it by another if statment if(ctx.timestamp < current.lastModified + 60000) { deleteEventTimeTimer(current.lastModified + 60000)} in the begining of processElement, would the semnatics of the program be the same? any preference of one version over the other in case of same semantics?
You are correct to think that the implementation that deletes the timer has the same semantics. And in fact I recently changed the example used in our training materials to do just that, as I prefer this approach. The reason I find it preferable is that all of the complex business logic is then in one place (in processElement), and whenever onTimer is called, you know exactly what to do, no questions asked. Plus, it's more performant, as there are fewer timers to checkpoint and eventually trigger.
This example was written for the docs back before timers could be deleted, and hasn't been updated.
You can find the reworked example I mentioned in these slides -- https://training.ververica.com/decks/process-function/ -- once you get past the registration page.
FWIW, I also recently reworked the reference solution to the corresponding training exercise along the same lines: https://github.com/apache/flink-training/tree/master/long-ride-alerts.

MapState does not store the previous session with EventTimeSessionWindows in Flink java

I need to compare the previous session to averages from different sessions for the same user. I'm using MapState to keep the previous session, but somehow the mapstate never contains any previous keys, so every session is new. here's my code:
SessionIdentificationProcessFunction (this is a function that gather all the events that belongs to the same session.
static SingleOutputStreamOperator<SessionEvent> sessionUser(KeyedStream<Event, String> stream) {
return stream.window(EventTimeSessionWindows.withGap(Time.minutes(PropertyFileReader.getGAP_SECTION())))
.allowedLateness(Time.minutes(PropertyFileReader.getLATENCY_ALLOWED()))
.process(new SessionIdentificationProcessFunction<Event, SessionEvent, String, TimeWindow>() {
#Override
public void open(Configuration parameters) {
/*state configured to live just one day to avoid garbage accumulation*/
StateTtlConfig ttlConfig = StateTtlConfig
.newBuilder(org.apache.flink.api.common.time.Time.days(1))
.cleanupFullSnapshot()
.build();
MapStateDescriptor<String, SessionEvent> map_descriptor = new MapStateDescriptor<>("prevMapUserSession", String.class, SessionEvent.class);
map_descriptor.enableTimeToLive(ttlConfig);
previous_user_sessions_state = getRuntimeContext().getMapState(map_descriptor);
}
#Override
public SessionEvent generateSessionRecord(String s, Context context, Iterable<Event> elements) {
Comparator<Event> sortFunc = (o1, o2) -> ((o1.timestamp.before(o2.timestamp)) ? 0 : 1);
Event start = StreamSupport.stream(elements.spliterator(), false).max(sortFunc).orElse(new Event());
Event end = StreamSupport.stream(elements.spliterator(), false).max(sortFunc).orElse(new Event());
SessionEvent session_user = (end.timestamp.equals(Timestamp.from(Instant.EPOCH))) ? new SessionEvent(start) : new SessionEvent(end);
session_user.sessionEvents = StreamSupport.stream(elements.spliterator(), false).count();
session_user.sessionDuration = sd;
try {
if (previous_user_sessions_state.contains(s)) {
SessionEvent previous = previous_user_sessions_state.get(s);
/*Update values of the session with the values of the previous which never exist and delete the previous session in the map to create a new entry with the new values updated*/
previous_user_sessions_state.remove(s);
} else {
/*always get here and create a new session*/
}
previous_user_sessions_state.put(s, session_user);
} catch (Exception e) {
e.printStackTrace();
}
return session_user;
}
})
.name("User Sessions");
}
Without seeing how SessionIdentificationProcessFunction is implemented, I'm not sure exactly what's going wrong, but Flink's session windows are rather special, so it's not terribly surprising that this isn't working. Part of the problem is that any given session window has a very short lifetime before it is merged with another session window. (As each new event arrives it is initially assigned to its own session window, after which the set of all current session windows is processed and any possible merges are performed (based on the session gap).)
What I can recommend is rather than using getRuntimeContext().getMapState(), use context.globalState().getMapState() instead (where context is the ProcessWindowFunction.Context passed to the process() method of a ProcessWindowFunction). This globalState is a KeyedStateStore meant for precisely this purpose -- keeping keyed state that is global/shared among all window instances for that key.

Flink session window with getting result on end

I have a kafka messages something like the following pattern:
{ user: 'someUser', value: 'SomeValue' , timestamp:000000000}
With Flink stream calculation that do some count action on those items .
Now I want to declare a session , to collect same user + value in a range of X seconds as a single , with the latest timestamp , then it will be forwarded to the next stream just one time
So I Wrote something like that:
data.assignTimestampsAndWatermarks(new AssignerWithPeriodicWatermarks<Data>() {
.....
})
.keyBy(new KeySelector<Data, String>(){
.......
})
.window(EventTimeSessionWindows.withGap(Time.minutes(10)))
.aggregate(new AggregateFunction<Data, Data, Data>() {
#Override
public Data createAccumulator() {
return null;
}
#Override
public Data add(Data value, Data accumulator) {
if(accumulator == null) {
accumulator = value;
}
return accumulator;
}
#Override
public Data getResult(Data accumulator) {
return accumulator;
}
#Override
public Data merge(Data a, Data b) {
return a;
}
});
But the problem is that the getResult function is called on each element , not just in the end of the window.
My problem is how to not to forward the aggregation result until the end of the window to the next stream. as far that I know also process stream result is moving forward when there is no more elements, even though the windows isn't end yes
any advice?
Thanks
Flink provides two distinct approaches for evaluating windows. In this case you want to use the other one.
One approach evaluates each window's contents incrementally. This is what you get with reduce and aggregate. As elements are assigned to the window, the ReduceFunction or AggregateFunction is called and that element immediately makes its contribution to the final result.
The alternative is to use process with a ProcessWindowFunction. With this approach, the window isn't evaluated until the window is complete, at which point the ProcessWindowFunction is called once with an Iterable containing all of the elements that were assigned to the window. This has the disadvantage of needing to store all of the elements until the window is triggered, and if the ProcessWindowFunction has to do a lot of work to compute its result that can temporarily disrupt the pipeline, but some calculations need to be done this way -- like counting distinct elements.
See the documentation for more info.

Resources