I have a kafka messages something like the following pattern:
{ user: 'someUser', value: 'SomeValue' , timestamp:000000000}
With Flink stream calculation that do some count action on those items .
Now I want to declare a session , to collect same user + value in a range of X seconds as a single , with the latest timestamp , then it will be forwarded to the next stream just one time
So I Wrote something like that:
data.assignTimestampsAndWatermarks(new AssignerWithPeriodicWatermarks<Data>() {
.....
})
.keyBy(new KeySelector<Data, String>(){
.......
})
.window(EventTimeSessionWindows.withGap(Time.minutes(10)))
.aggregate(new AggregateFunction<Data, Data, Data>() {
#Override
public Data createAccumulator() {
return null;
}
#Override
public Data add(Data value, Data accumulator) {
if(accumulator == null) {
accumulator = value;
}
return accumulator;
}
#Override
public Data getResult(Data accumulator) {
return accumulator;
}
#Override
public Data merge(Data a, Data b) {
return a;
}
});
But the problem is that the getResult function is called on each element , not just in the end of the window.
My problem is how to not to forward the aggregation result until the end of the window to the next stream. as far that I know also process stream result is moving forward when there is no more elements, even though the windows isn't end yes
any advice?
Thanks
Flink provides two distinct approaches for evaluating windows. In this case you want to use the other one.
One approach evaluates each window's contents incrementally. This is what you get with reduce and aggregate. As elements are assigned to the window, the ReduceFunction or AggregateFunction is called and that element immediately makes its contribution to the final result.
The alternative is to use process with a ProcessWindowFunction. With this approach, the window isn't evaluated until the window is complete, at which point the ProcessWindowFunction is called once with an Iterable containing all of the elements that were assigned to the window. This has the disadvantage of needing to store all of the elements until the window is triggered, and if the ProcessWindowFunction has to do a lot of work to compute its result that can temporarily disrupt the pipeline, but some calculations need to be done this way -- like counting distinct elements.
See the documentation for more info.
Related
There is this text in Stream Processing with Apache Flink page 211
“The WindowAssigner determines for each arriving element to which windows it is assigned.”
then I study source code of TumblingEventTimeWindows
public class TumblingEventTimeWindows extends WindowAssigner<Object, TimeWindow> {
private static final long serialVersionUID = 1L;
...............................
#Override
public Collection<TimeWindow> assignWindows(
Object element, long timestamp, WindowAssignerContext context) {
if (timestamp > Long.MIN_VALUE) {
if (staggerOffset == null) {
staggerOffset =
windowStagger.getStaggerOffset(context.getCurrentProcessingTime(), size);
}
// Long.MIN_VALUE is currently assigned when no timestamp is present
long start =
TimeWindow.getWindowStartWithOffset(
timestamp, (globalOffset + staggerOffset) % size, size);
return Collections.singletonList(new TimeWindow(start, start + size));
} else {
throw new RuntimeException(
"Record has Long.MIN_VALUE timestamp (= no timestamp marker). "
+ "Is the time characteristic set to 'ProcessingTime', or did you forget to call "
+ "'DataStream.assignTimestampsAndWatermarks(...)'?");
}
}
...............................
from the source code I can found , It is true that elements are assigned to the window ,new TimeWindow(start, start + size) meanning each element be assigned a new TimeWindow.
but I am confused, TumblingEventTimeWindows how achievement without overlap?
if every element be assigned a new TimeWindow, the results are as follows
There is no guarantee that each window will not overlap , Can someone point me in the direction of TumblingEventTimeWindows how achievement without overlap?
The TimeWindow object isn't very important. It is a simple structure that holds the start and end timestamps for the window, and nothing else. It's name makes it sound important, but it's just used to encode a copy of the information describing the time interval the incoming event is being assigned to.
It's actually the WindowOperator that has the important window data. Logically it's keeping something like a map, where the keys are the intervals described by the TimeWindow objects, and the values are the lists of events assigned to those intervals.
I'm new at Apache Flink CEP and I'm struggle trying to detect a simple absence of event.
What I'm trying to detect is wheter an event of type CurrencyEvent with a certain id does not occur in certain amount of time. I would like to detect the absence of such event every time that after 3000ms the event does not occur.
My pattern code looks as follows:
Pattern<CurrencyEvent, ?> myPattern = Pattern.<Event>begin("CurrencyEvent")
.subtype(CurrencyEvent.class)
.where(new SimpleCondition<CurrencyEvent>() {
#Override
public boolean filter(CurrencyEvent currencyEvent) throws Exception {
return currencyEvent.getId().equalsIgnoreCase("usd");
}
})
.within(Time.milliseconds(3000L));
So now my idea is to use timeout functions in order to detect timeout events:
DataStreamSource<Event> events = env.addSource(new TestSource(
Arrays.asList(
basicCurrencyWithMivLevelEvent("EUR", 100L, Arrays.asList("1", "2"), 200D),
basicCurrencyWithMivLevelEvent("USD", 100L, Arrays.asList("1", "2"), 200D),
basicCurrencyWithMivLevelEvent("EUR", 100L, Arrays.asList("1", "2"), 200D)
),
1636040364820L, // initial timestamp for the first element
7000 // 7 seconds between each event
));
PatternStream<Event> patternStream = CEP.pattern(
events,
(Pattern<Event, ?>) myPattern
);
OutputTag<Alarm> tag = new OutputTag<Alarm>("currency-timeout"){};
PatternFlatTimeoutFunction<Event, Alarm> eventAlarmTimeoutPatternFunction = (patterns, timestamp, ctx) -> {
System.out.println("New alarm, since after 3 seconds an event with id=usd is not detected");
//TODO: call collect
};
PatternFlatSelectFunction<Event, Alarm> eventAlarmPatternSelectFunction = (patterns, ctx) -> {
System.out.println("Select! (we can ignore it) " + patterns);
// ignore matched events
};
return patternStream.flatSelect(
tag,
eventAlarmTimeoutPatternFunction,
TypeInformation.of(Alarm.class),
eventAlarmPatternSelectFunction
);
My Test source is using event timestamps and watermarks, as shown as follows:
public class TestSource implements SourceFunction<Event> {
private final List<Event> events;
private final long initialTimestamp;
private final long timeBetweenInMillis;
public TestSource(List<Event> events, long initialTimestamp, long timeBetweenInMillis){
this.events = events;
this.initialTimestamp = initialTimestamp;
this.timeBetweenInMillis = timeBetweenInMillis;
}
#Override
public void run(SourceContext<Event> sourceContext) throws InterruptedException {
long timestamp = this.initialTimestamp;
for(Event event: this.events){
sourceContext.collectWithTimestamp(event, timestamp);
sourceContext.emitWatermark(new Watermark(timestamp));
timestamp+=this.timeBetweenInMillis;
}
}
#Override
public void cancel() {
}
}
I'm using TimeCharacteristics.EventTime.
Since the the window time (3seconds) is lower than the event time difference between every event (7 seconds), I expect to get some timeout events, but I'm getting 0.
A CEP Pattern matches a sequence of one or more events; the within(interval) clause adds an additional constraint that all of the events in the sequence must occur within the specified interval. When partial matches time out, this can be captured in a TimedOutPartialMatchHandler.
In your case, since a successfully matched Pattern consists of a single event, there can be no partial matches, and a match can never time out. (Your matching sequences are always less than 3 seconds long.)
What you can do is to extend the pattern definition to include a second event, so that to match there must be a start event followed by another event within 3 seconds. When that second event is missing, then you will have a partial match that times out.
For more flexibility than what CEP offers for implementing use cases involving missing events, you can use a KeyedProcessFunction with timers.
I was reading the the Flink example CountWithTimestamp and below is a code snippet from the example:
#Override
public void processElement(Tuple2<String, String> value, Context ctx, Collector<Tuple2<String, Long>> out)
throws Exception {
// retrieve the current count
CountWithTimestamp current = state.value();
if (current == null) {
current = new CountWithTimestamp();
current.key = value.f0;
}
// update the state's count
current.count++;
// set the state's timestamp to the record's assigned event time timestamp
current.lastModified = ctx.timestamp();
// write the state back
state.update(current);
// schedule the next timer 60 seconds from the current event time
ctx.timerService().registerEventTimeTimer(current.lastModified + 60000);
}
#Override
public void onTimer(long timestamp, OnTimerContext ctx, Collector<Tuple2<String, Long>> out)
throws Exception {
// get the state for the key that scheduled the timer
CountWithTimestamp result = state.value();
// check if this is an outdated timer or the latest timer
if (timestamp == result.lastModified + 60000) {
// emit the state on timeout
out.collect(new Tuple2<String, Long>(result.key, result.count));
}
}
}
My question is that if I remove the if statment timestamp == result.lastModified + 60000 (collect stmt not touched) in the onTimer, and instead replace it by another if statment if(ctx.timestamp < current.lastModified + 60000) { deleteEventTimeTimer(current.lastModified + 60000)} in the begining of processElement, would the semnatics of the program be the same? any preference of one version over the other in case of same semantics?
You are correct to think that the implementation that deletes the timer has the same semantics. And in fact I recently changed the example used in our training materials to do just that, as I prefer this approach. The reason I find it preferable is that all of the complex business logic is then in one place (in processElement), and whenever onTimer is called, you know exactly what to do, no questions asked. Plus, it's more performant, as there are fewer timers to checkpoint and eventually trigger.
This example was written for the docs back before timers could be deleted, and hasn't been updated.
You can find the reworked example I mentioned in these slides -- https://training.ververica.com/decks/process-function/ -- once you get past the registration page.
FWIW, I also recently reworked the reference solution to the corresponding training exercise along the same lines: https://github.com/apache/flink-training/tree/master/long-ride-alerts.
I need to compare the previous session to averages from different sessions for the same user. I'm using MapState to keep the previous session, but somehow the mapstate never contains any previous keys, so every session is new. here's my code:
SessionIdentificationProcessFunction (this is a function that gather all the events that belongs to the same session.
static SingleOutputStreamOperator<SessionEvent> sessionUser(KeyedStream<Event, String> stream) {
return stream.window(EventTimeSessionWindows.withGap(Time.minutes(PropertyFileReader.getGAP_SECTION())))
.allowedLateness(Time.minutes(PropertyFileReader.getLATENCY_ALLOWED()))
.process(new SessionIdentificationProcessFunction<Event, SessionEvent, String, TimeWindow>() {
#Override
public void open(Configuration parameters) {
/*state configured to live just one day to avoid garbage accumulation*/
StateTtlConfig ttlConfig = StateTtlConfig
.newBuilder(org.apache.flink.api.common.time.Time.days(1))
.cleanupFullSnapshot()
.build();
MapStateDescriptor<String, SessionEvent> map_descriptor = new MapStateDescriptor<>("prevMapUserSession", String.class, SessionEvent.class);
map_descriptor.enableTimeToLive(ttlConfig);
previous_user_sessions_state = getRuntimeContext().getMapState(map_descriptor);
}
#Override
public SessionEvent generateSessionRecord(String s, Context context, Iterable<Event> elements) {
Comparator<Event> sortFunc = (o1, o2) -> ((o1.timestamp.before(o2.timestamp)) ? 0 : 1);
Event start = StreamSupport.stream(elements.spliterator(), false).max(sortFunc).orElse(new Event());
Event end = StreamSupport.stream(elements.spliterator(), false).max(sortFunc).orElse(new Event());
SessionEvent session_user = (end.timestamp.equals(Timestamp.from(Instant.EPOCH))) ? new SessionEvent(start) : new SessionEvent(end);
session_user.sessionEvents = StreamSupport.stream(elements.spliterator(), false).count();
session_user.sessionDuration = sd;
try {
if (previous_user_sessions_state.contains(s)) {
SessionEvent previous = previous_user_sessions_state.get(s);
/*Update values of the session with the values of the previous which never exist and delete the previous session in the map to create a new entry with the new values updated*/
previous_user_sessions_state.remove(s);
} else {
/*always get here and create a new session*/
}
previous_user_sessions_state.put(s, session_user);
} catch (Exception e) {
e.printStackTrace();
}
return session_user;
}
})
.name("User Sessions");
}
Without seeing how SessionIdentificationProcessFunction is implemented, I'm not sure exactly what's going wrong, but Flink's session windows are rather special, so it's not terribly surprising that this isn't working. Part of the problem is that any given session window has a very short lifetime before it is merged with another session window. (As each new event arrives it is initially assigned to its own session window, after which the set of all current session windows is processed and any possible merges are performed (based on the session gap).)
What I can recommend is rather than using getRuntimeContext().getMapState(), use context.globalState().getMapState() instead (where context is the ProcessWindowFunction.Context passed to the process() method of a ProcessWindowFunction). This globalState is a KeyedStateStore meant for precisely this purpose -- keeping keyed state that is global/shared among all window instances for that key.
I want three values, they are aggValueInLastHour aggValueInLastDay aggValueInLastThreeDay.
I've tried like below.
But I don't want to wait, means that I'm not prefer to use sliding window to do aggregation.(3 day window must wait three days' data, this is unbearable for our system.)
How can I get last 3 day aggregation value when first event come?
Thanks for any advice in advance!
If you want to get more frequent updates you can use QueryableState, polling the state at a rate that suits your use case.
You can make use of the ContinuousEventTimeTrigger, which will cause your window to fire on a shorter time period than the the full window, allowing you to see the intermediate state. You can optionally wrap that in a PurgingTrigger if the downstream consumers of your sink are expecting each output to be a partial aggregation (rather than the full current state) and sums them up.
I've tried CEP.
code:
AfterMatchSkipStrategy strategy = AfterMatchSkipStrategy.skipShortOnes();
Pattern<RiskEvent, ?> loginPattern = Pattern.<RiskEvent>begin("start", strategy)
.where(eventTypeCondition)
.timesOrMore(1)
.greedy()
.within(Time.hours(1));
KeyedStream<RiskEvent, String> keyedStream = dataStream.keyBy(new KeySelector<RiskEvent, String>() {
#Override
public String getKey(RiskEvent riskEvent) throws Exception {
// key by user for aggregation
return riskEvent.getEventType() + riskEvent.getDeviceFp();
}
});
PatternStream<RiskEvent> eventPatternStream = CEP.pattern(keyedStream, loginPattern);
eventPatternStream.select(new PatternSelectFunction<RiskEvent, RiskResult>() {
#Override
public RiskResult select(Map<String, List<RiskEvent>> map) throws Exception {
List<RiskEvent> list = map.get("start");
ArrayList<Long> times = new ArrayList<>();
for (RiskEvent riskEvent : list) {
times.add(riskEvent.getEventTime());
}
Long min = Collections.min(times);
Long max = Collections.max(times);
Set<String> accountList = list.stream().map(RiskEvent::getUserName).collect(Collectors.toSet());
logger.info("时间范围:" + new Date(min) + " --- " + new Date(max) + " 事件:" + list.get(0).getEventType() + ", 设备指纹:" + list.get(0).getDeviceFp() + ", 关联账户:" + accountList.toString());
return null;
}
});
maybe you notice that, the skip strategy skipShortOnes is a customized strategy.
Show you my modification in CEP lib.
add strategy in Enum.
public enum SkipStrategy{
NO_SKIP,
SKIP_PAST_LAST_EVENT,
SKIP_TO_FIRST,
SKIP_TO_LAST,
SKIP_SHORT_ONES
}
add access method in AfterMatchSkipStrategy.java
public static AfterMatchSkipStrategy skipShortOnes() {
return new AfterMatchSkipStrategy(SkipStrategy.SKIP_SHORT_ONES);
}
add strategy actions in discardComputationStatesAccordingToStrategy method at NFA.java.
case SKIP_SHORT_ONES:
int i = 0;
List>> tempResult = new ArrayList<>(matchedResult);
for (Map> resultMap : tempResult) {
if (i++ == 0) {
continue;
}
matchedResult.remove(resultMap);
}
break;