Custom Watermarks with Apache Flink - apache-flink

I am investigating the types of watermarks that can be inserted into the data stream. 
While this may go outside of the purpose of watermarks, I'll ask it anyway.
Can you create a watermark that holds a timestamp and k/v pair(s) (this=that, that=this)? 
Hence the watermark will hold {12DEC180500GMT,this=that, that=this}.
Or
{Timestamp, kvp1, kvp2, kvpN}
Is something like this possible? I have reviewed the user and API docs but may have overlooked something

No, the Watermark class in Flink
(found in
flink/flink-streaming/java/src/main/java/org/apache/flink/streaming/api/watermark/Watermark.java)
has one one instance variable besides MAX_WATERMARK, which is
/** The timestamp of the watermark in milliseconds. */
private final long timestamp;
So watermarks cannot carry any information besides a timestamp, which must be a long value.

Related

Flink Watermark

In Flink, I found 2 ways to set up watermark,
the first is
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.getConfig.setAutoWatermarkInterval(5000)
the second is
env.addSource(
new FlinkKafkaConsumer[...](...)
).assignTimestampsAndWatermarks(
WatermarkStrategy.forBoundedOutOfOrderness[...](Duration.ofSeconds(10)).withTimestampAssigner(...)
)
I would like to know which will take effect eventually.
There's no conflict at all between those two -- they are dealing with separate concerns. Everything specified will take effect.
The first one,
env.getConfig.setAutoWatermarkInterval(5000)
is specifying how often you want watermarks to be generated (one watermark every 5000 msec). If this wasn't specified, the default of 200 msec would be used instead.
The second,
env.addSource(
new FlinkKafkaConsumer[...](...)
).assignTimestampsAndWatermarks(
WatermarkStrategy.forBoundedOutOfOrderness[...](Duration.ofSeconds(10)).withTimestampAssigner(...)
)
is specifying the details of how those watermarks are to be computed. I.e., they should be generated by the FlinkKafkaConsumer using a BoundedOutOfOrderness strategy, with a bounded delay of 10 seconds. The WatermarkStrategy also needs a timestamp assigner.
There's no default WatermarkStrategy, so something like this second code snippet is required if you want to work with event time.

kafka flink timestamp Event time and watermark

I am reading the book Stream Processing with Apache Flink and it is stated that “As of version 0.10.0, Kafka supports message timestamps. When reading from Kafka version 0.10 or later, the consumer will automatically extract the message timestamp as an event-time timestamp if the application runs in event-time mode*”
So inside a processElement function the call context.timestamp() will by default return the kafka message timestamp?
Coul you please provide a simple example on how to implement AssignerWithPeriodicWatermarks/AssignerWithPunctuatedWatermarks that extract (and builds watermarks) based on the consumed kafka message timestamp.
If I am using TimeCharacteristic.ProcessingTime, would ctx.timestamp() return the processing time and in such case would it be similar to context.timerService().currentProcessingTime() .
Thank you.
The Flink Kafka consumer takes care of this for you, and puts the timestamp where it needs to be. In Flink 1.11 you can simply rely on this, though you still need to take care of providing a WatermarkStrategy that specifies the out-of-orderness (or asserts that the timestamps are in order):
FlinkKafkaConsumer<String> myConsumer = new FlinkKafkaConsumer<>(...);
myConsumer.assignTimestampsAndWatermarks(
WatermarkStrategy.
.forBoundedOutOfOrderness(Duration.ofSeconds(20)));
In earlier versions of Flink you had to provide an implementation of a timestamp assigner, which would look like this:
public long extractTimestamp(Long element, long previousElementTimestamp) {
return previousElementTimestamp;
}
This version of the extractTimestamp method is passed the current value of the timestamp present in the StreamRecord as previousElementTimestamp, which in this case will be the timestamp put there by the Flink Kafka consumer.
Flink 1.11 docs
Flink 1.10 docs
As for what is returned by ctx.timestamp() when using TimeCharacteristic.ProcessingTime, this method returns NULL in that case. (Semantically, yes, it is as though the timestamp is the current processing time, but that's not how it's implemented.)

Flink: Evaluate window for each incoming element of stream

I have a stream of Booking elements of the following form:
Booking(id=B1, driverId=D1, time=t1, location=l1)
Booking(id=B2, driverId=D2, time=t2, location=l2)
I need to find, per location, count of bookings made in last 15mins. But the window should be evaluated for any new booking coming for a location.
Roughly like:
Assuming `time` field is set as timestamp of record
bookingStream.keyBy(b=>b.location).window(Any window of 15mins).trigger(triggerFunction)
Except that the trigger function should not be evaluated at the end of 15mins but instead whenever any booking arrives at a location, and emit the count of booking in last 15min from timestamp of newly arrived booking.
Approach:
Use RichMap function, maintain a priority queue of location bookings as a managed state(ValueState) with timestamp as priority of bookings. For each element that arrives, first add it to state and remove elements earlier than 15mins from currently arrived elements. Emit the count of remaining elements in priority queue to collector.
Is this the right way or it could be achieved by using some other flink construct in a better way.
If you are running on the heap-based state backend, what you propose should behave reasonably well. But with RocksDB you will have to go through serialization/deserialization of the priority queue for every access, which may be rather painful.
An approach that might perform better on RocksDB would be to keep the current count along with the earliest timestamp in ValueState, and the set of bookings in ListState. The RocksDB state backend can append to ListState without going through ser/de, so you would only have to deserialize and reserialize the whole list when the earliest element is too old.

How to get DataStream key after keyBy() in Flink Java API

I'm reading from a Kafka cluster in a Flink streaming app. After getting the source stream i want to aggregate events by a composite key and a timeEvent tumbling window and then write result to a table.
The problem is that after applying my aggregateFunction that just counts number of clicks by clientId i don't find the way to get the key of each output record since the api returns an instance of accumulated result but not the corresponding key.
DataStream<Event> stream = environment.addSource(mySource)
stream.keyBy(new KeySelector<Event,Integer>() {
public Integer getKey(Event event) { return event.getClientId(); })
.window(TumblingEventTimeWindows.of(Time.minutes(1))).aggregate(new MyAggregateFunction)
How do i get the key that i specified before? I did not inject key of the input events in the accumulator as i felt i wouldn't be nice.
Rather than
.aggregate(new MyAggregateFunction)
you can use
.aggregate(new MyAggregateFunction, new MyProcessWindowFunction)
and in this case the process method of your ProcessWindowFunction will be passed the key, along with the pre-aggregated result of your AggregateFunction and a Context object with other potentially relevant info. See the section in the docs on ProcessWindowFunction with Incremental Aggregation for more details.

apache flink window order

Using Apache Flink I want to create a streaming window sorted by the timestamp that is stored in the Kafka event. According to the following article this is not implemented.
https://cwiki.apache.org/confluence/display/FLINK/Time+and+Order+in+Streams
However, the article is dated july 2015, it is now almost a year later. Is this functionality implemented and can somebody point me to any relevent documentation and/or an example.
Apache Flink supports stream windows based on event timestamps.
In Flink, this concept is called event-time.
In order to support event-time, you have to extract a timestamp (long value) from each event. In addition, you need to support so-called watermarks which are needed to deal with events with out-of-order timestamps.
Given a stream with extracted timestamps you can define a windowed sum as follows:
val stream: DataStream[(String, Int)] = ...
val windowCnt = stream
.keyBy(0) // partition stream on first field (String)
.timeWindow(Time.minutes(1)) // window in extracted timestamp by 1 minute
.sum(1) // sum the second field (Int)
Event-time and windows are explained in detail in the documentation (here and here) and in several blog posts (here, here, here, and here).
Sorting by timestamps is still not supported out-of-box but you can do windowing based on the timestamps in elements. We call this event-time windowing. Please have a look here: https://ci.apache.org/projects/flink/flink-docs-master/apis/streaming/windows.html.

Resources