I try to aggregate two streams like that
val joinedStream = finishResultStream.keyBy(_.searchId)
.connect(startResultStream.keyBy(_.searchId))
.process(new SomeCoProcessFunction)
and then working on them in SomeCoProcessFunction class like that
class SomeCoProcessFunction extends CoProcessFunction[SearchFinished, SearchCreated, SearchAggregated] {
override def processElement1(finished: SearchFinished, ctx: CoProcessFunction[SearchFinished, SearchCreated, SearchAggregated]#Context, out: Collector[SearchAggregated]): Unit = {
// aggregating some "finished" data ...
}
override def processElement2(created: SearchCreated, ctx: CoProcessFunction[SearchFinished, SearchCreated, SearchAggregated]#Context, out: Collector[SearchAggregated]): Unit = {
val timerService = ctx.timerService()
timerService.registerEventTimeTimer(System.currentTimeMillis + 5000)
// aggregating some "created" data ...
}
override def onTimer(timestamp: Long, ctx: CoProcessFunction[SearchFinished, SearchCreated, SearchAggregated]#OnTimerContext, out: Collector[SearchAggregated]): Unit = {
val watermark: Long = ctx.timerService().currentWatermark()
println(s"watermark!!!! $watermark")
// clean up the state
}
What I want is to clean up the state after a certain time( 5000 Milliseconds), and that is what onTimer have to be used for. But since it never get fired, I kinda ask my self what am I doing wrong here?
Thanks in advance for any hint.
UPDATE:
Solution was to set timeService like that (tnx to both fabian-hueske and Beckham):
timerService.registerProcessingTimeTimer(timerService.currentProcessingTime() + 5000)
I still didn't really figure out what timerService.registerEventTimeTimer does, watermark ctx.timerService().currentWatermark() shows always -9223372036854775808 now matter how long before EventTimer was registered.
I see that you are using System.currentTimeMillis which might be different from the TimeCharacteristic (event time, processing time, ingestion time) that your Flink job uses.
Try getting the timestamp of the event ctx.timestamp() then add the 5000ms on top of it.
The problem is that you are registering an event-time timer (timerService.registerEventTimeTimer) with a processing-time timestamp (System.currentTimeMillis + 5000).
System.currentTimeMillis returns the current machine time but event-time is not based on the machine time but on the time computed from watermarks.
Either you should register a processing-timer or register an event-time timer with an event-time timestamp. You can get the timestamp of the current watermark or the timestamp of the current record from the Context object that is passed as a parameter to processElement1() and processElement2().
Related
First and foremost:
I'm kind of new to Flink (Understand the principle and is able to create any basic streaming job I need to)
I'm using Kinesis Analytics to run my Flink job and by default it's using incremental checkpointing with a 1 minute interval.
The Flink job is reading event from a Kinesis stream using a FlinkKinesisConsumer and a custom deserailzer (deserialze the byte into a simple Java object which is used throughout the job)
What I would like to archieve is simply counting how many event of ENTITY_ID/FOO and ENTITY_ID/BAR there is for the past 24 hours. It is important that this count is as accurate as possible and this is why I'm using this Flink feature instead of doing a running sum myself on a 5 minute tumbling window.
I also want to be able to have a count of 'TOTAL' events from the start (and not just for the past 24h) so I also output in the result the count of events for the past 5 minutes so that the post porcessing app can simply takes these 5 minute of data and do a running sum. (This count doesn't have to be accurate and it's ok if there is an outage and I lose some count)
Now, this job was working pretty good up until last week where we had a surge (10 times more) in traffic. From that point on Flink went banana.
Checkpoint size starting to slowly grow from ~500MB to 20GB and checkpoint time were taking around 1 minutes and growing over time.
The application started failing and never was able to fully recover and the event iterator age shoot up never went back down so no new events were being consumed.
Since I'm new with Flink I'm not enterely sure if the way I'm doing the sliding count is completely un optimised or plain wrong.
This is a small snippet of the key part of the code:
The source (MyJsonDeserializationSchema extends AbstractDeserializationSchema and simply read byte and create the Event object):
SourceFunction<Event> source =
new FlinkKinesisConsumer<>("input-kinesis-stream", new MyJsonDeserializationSchema(), kinesisConsumerConfig);
The input event, simple java pojo which will be use in the Flink operators:
public class Event implements Serializable {
public String entityId;
public String entityType;
public String entityName;
public long eventTimestamp = System.currentTimeMillis();
}
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
DataStream<Event> eventsStream = kinesis
.assignTimestampsAndWatermarks(new BoundedOutOfOrdernessTimestampExtractor<Event>(Time.seconds(30)) {
#Override
public long extractTimestamp(Event event) {
return event.eventTimestamp;
}
})
DataStream<Event> fooStream = eventsStream
.filter(new FilterFunction<Event>() {
#Override
public boolean filter(Event event) throws Exception {
return "foo".equalsIgnoreCase(event.entityType);
}
})
DataStream<Event> barStream = eventsStream
.filter(new FilterFunction<Event>() {
#Override
public boolean filter(Event event) throws Exception {
return "bar".equalsIgnoreCase(event.entityType);
}
})
StreamTableEnvironment tEnv = StreamTableEnvironment.create(env);
Table fooTable = tEnv.fromDataStream("fooStream, entityId, entityName, entityType, eventTimestame.rowtime");
tEnv.registerTable("Foo", fooTable);
Table barTable = tEnv.fromDataStream("barStream, entityId, entityName, entityType, eventTimestame.rowtime");
tEnv.registerTable("Bar", barTable);
Table slidingFooCountTable = fooTable
.window(Slide.over("24.hour").every("5.minute").on("eventTimestamp").as("minuteWindow"))
.groupBy("entityId, entityName, minuteWindow")
.select("concat(concat(entityId,'_'), entityName) as slidingFooId, entityid as slidingFooEntityid, entityName as slidingFooEntityName, entityType.count as slidingFooCount, minuteWindow.rowtime as slidingFooMinute");
Table slidingBarCountTable = barTable
.window(Slide.over("24.hout").every("5.minute").on("eventTimestamp").as("minuteWindow"))
.groupBy("entityId, entityName, minuteWindow")
.select("concat(concat(entityId,'_'), entityName) as slidingBarId, entityid as slidingBarEntityid, entityName as slidingBarEntityName, entityType.count as slidingBarCount, minuteWindow.rowtime as slidingBarMinute");
Table tumblingFooCountTable = fooTable
.window(Tumble.over(tumblingWindowTime).on("eventTimestamp").as("minuteWindow"))
.groupBy("entityid, entityName, minuteWindow")
.select("concat(concat(entityName,'_'), entityName) as tumblingFooId, entityId as tumblingFooEntityId, entityNamae as tumblingFooEntityName, entityType.count as tumblingFooCount, minuteWindow.rowtime as tumblingFooMinute");
Table tumblingBarCountTable = barTable
.window(Tumble.over(tumblingWindowTime).on("eventTimestamp").as("minuteWindow"))
.groupBy("entityid, entityName, minuteWindow")
.select("concat(concat(entityName,'_'), entityName) as tumblingBarId, entityId as tumblingBarEntityId, entityNamae as tumblingBarEntityName, entityType.count as tumblingBarCount, minuteWindow.rowtime as tumblingBarMinute");
Table aggregatedTable = slidingFooCountTable
.leftOuterJoin(slidingBarCountTable, "slidingFooId = slidingBarId && slidingFooMinute = slidingBarMinute")
.leftOuterJoin(tumblingFooCountTable, "slidingFooId = tumblingBarId && slidingFooMinute = tumblingBarMinute")
.leftOuterJoin(tumblingFooCountTable, "slidingFooId = tumblingFooId && slidingFooMinute = tumblingFooMinute")
.select("slidingFooMinute as timestamp, slidingFooCreativeId as entityId, slidingFooEntityName as entityName, slidingFooCount, slidingBarCount, tumblingFooCount, tumblingBarCount");
DataStream<Result> result = tEnv.toAppendStream(aggregatedTable, Result.class);
result.addSink(sink); // write to an output stream to be picked up by a lambda function
I would greatly appreciate if someone with more experience in working with Flink could comment on the way I have done my counting? Is my code completely over engineered? Is there a better and more efficient way of counting events over a 24h period?
I have read somewhere in Stackoverflow #DavidAnderson suggesting to create our own sliding window using map state and slicing the event by timestamp.
However I'm not exactly sure what this mean and I didn't find any code example to show it.
You are creating quite a few windows in there. If You are creating a sliding window with a size of 24h and slide of 5 mins this means that there will be a lot of open windows in there, so You may expect that all the data You have received in the given day will be checkpointed in at least one window if You think about it. So, it's certain that the size & time of the checkpoint will grow as the data itself grows.
To be able to get the answer if the code can be rewritten You would need to provide more details on what exactly are You trying to achieve here.
How are timestamps treated within an iterative DataStream loop within Flink?
For example, here is an example of a simple iterative loop within Flink where the feedback loop is of a different type to the input stream:
DataStream<MyInput> inputStream = env.addSource(new MyInputSourceFunction());
IterativeStream.ConnectedIterativeStreams<MyInput, MyFeedback> iterativeStream = inputStream.iterate().withFeedbackType(MyFeedback.class);
// define an output tag so we can emit feedback objects via a side output
final OutputTag<MyFeedback> outputTag = new OutputTag<MyFeedback>("feedback-output"){};
// now do some processing
SingleOutputStreamOperator<MyOutput> combinedStreams = iterativeStream.process(new CoProcessFunction<MyInput, MyFeedback, MyOutput>() {
#Override
public void processElement1(MyInput value, Context ctx, Collector<MyOutput> out) throws Exception {
// do some processing of the stream of MyInput values
// emit MyOutput values downstream by calling out.collect()
out.collect(someInstanceOfMyOutput);
}
#Override
public void processElement2(MyFeedback value, Context ctx, Collector<MyOutput> out) throws Exception {
// do some more processing on the feedback classes
// emit feedback items
ctx.output(outputTag, someInstanceOfMyFeedback);
}
});
iterativeStream.closeWith(combinedStreams.getSideOutput(outputTag));
My questions revolve around how does Flink use timestamps within a feedback loop:
Within the ConnectedIterativeStreams, how does Flink treat ordering of the input objects across the streams of regular inputs and feedback objects? If I emit an object into the feedback loop, when will it be seen by the head of the loop with respect to the regular stream of input objects?
How does the behaviour change when using event time processing?
AFAICT, Flink doesn't provide any guarantees on the ordering of input objects. I've run into this when trying to use iterations for a clustering algorithm in Flink, where the centroid updates don't get processed in a timely manner. The only solution I found was to essentially create a single (unioned) stream of the incoming events and the centroid updates, versus using a co-stream.
FYI there's this proposal to address some of the short-comings of iterations.
I'm trying to count the elements in a stream while enriching the result with the end time of the window.
The events are received from Kafka using kafka10 consumer provided by flink. EventTime is used.
A simple KeyedStream.count( ... ) works fine.
The stream has a length of 4 minutes. By using a time window of 3 minutes only one output is received. There should be two. The results are written using a BucketingSink.
val count = stream.map( m => (m.getContext, 1) )
.keyBy( 0 )
.timeWindow( Time.minutes(3) )
.apply( new EndTimeWindow() )
.map( new JsonMapper() )
count.addSink( countSink )
class EndTimeWindow extends WindowFunction[(String,Int),(String, Int),Tuple, TimeWindow]{
override def apply(key: Tuple, window: TimeWindow, input: Iterable[(String, Int)], out: Collector[(String, Int)]): Unit = {
var sum: Int = 0
for( value <-input ) {
sum = sum + value._2;
}
out.collect( (window.getEnd.toString, new Integer(sum ) ))
}
}
By using a time window of 3 minutes only one output with a smaller amount of events is received. There should be two outputs.
To be more precise, an event time window closes when a suitable watermark arrives -- which, with a bounded-out-of-orderness watermark generator, will happen (1) if an event arrives that is sufficiently outside the window, or (2) if the events are coming from a finite source that reaches its end, because in that case Flink will send a watermark with a timestamp of Long.MAX_VALUE that will close all open event time windows. However, with Kafka as your source, that won't happen.
Ok, I think, I know what went wrong. The mistake happens, because I thought wrong about the problem.
Since I'm using Eventtime, the windows close, when an event arrives that has a timestamp greater than the window end time. When the stream ends there arrives no element anymore. It follows, that the window never closes.
The requirement is that I want to write an Akka streaming application that listens to continuous events from Kafka, then sessionizes the event data in a time frame, based on some id value embedded inside each event.
For example, let's say that my time frame window is two minutes, and in the first two minutes I get the four events below:
Input:
{"message-domain":"1234","id":1,"aaa":"bbb"}
{"message-domain":"1234","id":2,"aaa":"bbb"}
{"message-domain":"5678","id":4,"aaa":"bbb"}
{"message-domain":"1234","id":3,"aaa":"bbb"}
Then in the output, after grouping/sessionizing these events, I will have only two events based on their message-domain value.
Output:
{"message-domain":"1234",messsages:[{"id":1,"aaa":"bbb"},{"id":2,"aaa":"bbb"},{"id":4,"aaa":"bbb"}]}
{"message-domain":"5678",messsages:[{"id":3,"aaa":"bbb"}]}
And I want this to happen in real time. Any suggestions on how to achieve this?
To group the events within a time window you can use Flow.groupedWithin:
val maxCount : Int = Int.MaxValue
val timeWindow = FiniteDuration(2L, TimeUnit.MINUTES)
val timeWindowFlow : Flow[String, Seq[String]] =
Flow[String] groupedWithin (maxCount, timeWindow)
I am using the "tick" event's delta property in EaselJS in order to create a simple timer in milliseconds. My ticker is set to 60 FPS. When the game is running I am getting roughly 16/17 ms between each tick (1000/60 = 16.6667) - so I am happy with this. However, when I append this value onto my text value (starting from 0) it is going up considerably quicker than it should be. I was expecting that on average it would be displaying a time of 1000 for each second elapsed. My code (in chunks) is below (game.js and gameInit.js are separate files). I am hoping that I am just overlooking something really simple...
//gameInit.js
createjs.Ticker.setFPS(60);
createjs.Ticker.on("tick", this.onTick, this);
...
//game.js
p.run = function (tickerEvent) {
if (this.gameStarted == true ) {
console.log("TICK ms since last tick = " + Math.floor(tickerEvent.delta)); // returns around 16/17
this.timerTextValue += Math.floor(tickerEvent.delta); //FIXME seems too fast!
this.timerText.text = this.timerTextValue;
}
};
Kind Regards,
Rich
Solved it. What a silly mistake! So, I had another place where I was initialising the ticker meaning it was being invoked twice, hence the reason that my timer was displaying doubly quick