is it possible to make different keys have independent watermark - apache-flink

I am using Flink 1.12 and I have a keyed stream, in my code it looks that both A and B share the same watermark? and therefore B is determined as late because A's coming has upgraded the watermark to be 2020-08-30 10:50:11?
The output is A(2020-08-30 10:50:08, 2020-08-30 10:50:16):2020-08-30 10:50:15,there is no output for B
I would ask whether it is possible to make different keys have independent watermark? A's watermark and B'watermark change independently
The application code is:
import java.text.SimpleDateFormat
import java.util.Date
import java.util.concurrent.TimeUnit
import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.functions.AssignerWithPunctuatedWatermarks
import org.apache.flink.streaming.api.scala.function.WindowFunction
import org.apache.flink.streaming.api.scala.{StreamExecutionEnvironment, _}
import org.apache.flink.streaming.api.watermark.Watermark
import org.apache.flink.streaming.api.windowing.assigners.TumblingEventTimeWindows
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.streaming.api.windowing.windows.TimeWindow
import org.apache.flink.util.Collector
object DemoDiscardLateEvent4_KeyStream {
def to_milli(str: String) =
new SimpleDateFormat("yyyy-MM-dd HH:mm:ss").parse(str).getTime
def to_char(milli: Long) = {
val date = if (milli <= 0) new Date(0) else new Date(milli)
new SimpleDateFormat("yyyy-MM-dd HH:mm:ss").format(date)
}
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(1)
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
val data = Seq(
("A", "2020-08-30 10:50:15"),
("B", "2020-08-30 10:50:07")
)
env.fromCollection(data).setParallelism(1).assignTimestampsAndWatermarks(new AssignerWithPunctuatedWatermarks[(String, String)]() {
var maxSeen = Long.MinValue
override def checkAndGetNextWatermark(lastElement: (String, String), extractedTimestamp: Long): Watermark = {
val eventTime = to_milli(lastElement._2)
if (eventTime > maxSeen) {
maxSeen = eventTime
}
//Allow 4 seconds late
new Watermark(maxSeen - 4000)
}
override def extractTimestamp(element: (String, String), previousElementTimestamp: Long): Long = to_milli(element._2)
}).keyBy(_._1).window(TumblingEventTimeWindows.of(Time.of(8, TimeUnit.SECONDS))).apply(new WindowFunction[(String, String), String, String, TimeWindow] {
override def apply(key: String, window: TimeWindow, input: Iterable[(String, String)], out: Collector[String]): Unit = {
val start = to_char(window.getStart)
val end = to_char(window.getEnd)
val sb = new StringBuilder
//the start and end of the window
sb.append(s"$key($start, $end):")
//The content of the window
input.foreach {
e => sb.append(e._2 + ",")
}
out.collect(sb.toString().substring(0, sb.length - 1))
}
}).print()
env.execute()
}
}

While it would sometimes be helpful if Flink offered per-key watermarking, it does not.
Each parallel instance of your WatermarkStrategy (or in this case, of your AssignerWithPunctuatedWatermarks) is generating watermarks independently, based on the timestamps of the events it observes (regardless of their keys).
One way to work around the lack of this feature is to not use watermarks at all. For example, if you would be using per-key watermarks to trigger keyed event-time windows, you can instead implement your own windows using a KeyedProcessFunction, and instead of using watermarks to trigger event time timers, keep track of the largest timestamp seen so far for each key, and whenever updating that value, determine if you now want to close one or more windows for that key.
See one of the Flink training lessons for an example of how to implement keyed tumbling windows with a KeyedProcessFunction. This example depends on watermarks but should help you get started.

Related

Flink windows do not fire after connecting to broadcast stream

I try to use BroadcastStatePattern to extend the functionality of my application.
Some code here. Main
// .... ///
val gatewayBroadcastStateDescriptor = new MapStateDescriptor[String, BCA]("gatewayEvents", classOf[String], classOf[BCASTDATACLASS])
// Broadcast source
val broadcastSource = env
.addSource(new FlinkKinesisConsumer[String](s"BROADCAST", new SimpleStringSchema, consumerConfig))
val broadcastSourceGatewayEvents = broadcastSource
.filter(_.contains("someText"))
.map(json => read[BCASTDATACLASS](json))
val broadcastGatewayEventsConfigurations = broadcastSourceGatewayEvents.broadcast(gatewayBroadcastStateDescriptor)
// packet source
val packetSource = env
.addSource(
new FlinkKinesisConsumer[String](s"PACKETS", new SimpleStringSchema, consumerConfig))
val packets = packetSource.disableChaining()
.map(json => read[MAINDATACLASS](json))
.assignTimestampsAndWatermarks(WatermarkStrategy
.forBoundedOutOfOrderness[MAINDATACLASS](Duration.ofSeconds(2))
.withTimestampAssigner(new PacketWatermarkGenerator))
.timeWindowAll(Time.seconds(2))
.process(new OrderPacketWindowFunction)
.disableChaining()
// connect MainDataSource with BroadcastDataSource
val gwEnrichedPackets = packets
.keyBy(_.gatewayId)
.connect(broadcastGatewayEventsConfigurations)
.process(new EnrichingPackets)
My window function (in this example doing nothing, just forward data further )
//....//
class EnrichingPackets()
extends KeyedBroadcastProcessFunction[String, MAINDATACLASS, BCASTDATACLASS, MAINDATACLASS]
with LazyLogging {
private lazy val gatewayEventsStateDescriptor =
new MapStateDescriptor[String, BCASTDATACLASS]("gatewayEvents", classOf[String], classOf[BCASTDATACLASS])
override def processBroadcastElement( // stream element, context, collector to emit resulting elements
broadcastInput: BCASTDATACLASS,
ctx: KeyedBroadcastProcessFunction[String, MAINDATACLASS, BCASTDATACLASS, MAINDATACLASS]#Context,
out: Collector[MAINDATACLASS]): Unit = {
val gatewayEvents = ctx.getBroadcastState(gatewayEventsStateDescriptor)
println("OK")
}
override def processElement(
packetInput: MAINDATACLASS,
readOnlyCtx: KeyedBroadcastProcessFunction[String, MAINDATACLASS, GatewayEvent, MAINDATACLASS]#ReadOnlyContext,
out: Collector[MAINDATACLASS]): Unit = {
// get read-only broadcast state
val gatewayEvents = readOnlyCtx.getBroadcastState(gatewayEventsStateDescriptor)
out.collect(packetInput)
}
}
After connecting data and configuration streams im going to open window and do some processing.
But when i open window from gwEnrichedPackets nothing happened, i can see (flink ui) ONLY incoming messages into window. Even using session windows and stop the data flow - windows do not fire.
allowedLateness and sideOutputLateData do not help the investigation of the problem
An interesting point is that if I open windows from packets - everything works properly.
// val sessionWindows = gwEnrichedPackets - NOT works
// val sessionWindows = packets - Works
val sessionWindows = gwEnrichedPackets
.keyBy(_.tag.tagId)
.timeWindow(Time.seconds(20))
//.window(EventTimeSessionWindows.withGap(Time.seconds(120)))
//.allowedLateness(Time.seconds(12000))
//.sideOutputLateData(new OutputTag[MAINDATACLASS]("late-readings"))
.process(new DetectTagGatewayDisconnections)
val lateStream = sessionWindows
.getSideOutput(new OutputTag[MAINDATACLASS]("late-readings"))
lateStream.print()
sessionWindows.print()
What am I doing wrong?
The problem is watermarking in this case, You are assigning Watermarks only to one of the streams, Flink always picks the lowest Watermark when more than one stream is on the input of the given operator.
So, in Your case Flink has to pick between Watermark generated by packets and the one generated by broadcast stream and one of them will be always Long.MinVal (because the control stream has no watermark generator), so it will always pick Long.MinVal and thus windows will never progress.
In this case, You can simply add Watermark assigner to the gwEnrichedPackets stream and that should solve the issue.

Flink interval join does not output

We have a Flink job that does intervalJoin two streams, both streams consume events from Kafka. Here is the example code
val articleEventStream: DataStream[ArticleEvent] = env.addSource(articleEventSource)
.assignTimestampsAndWatermarks(new ArticleEventAssigner)
val feedbackEventStream: DataStream[FeedbackEvent] = env.addSource(feedbackEventSource)
.assignTimestampsAndWatermarks(new FeedbackEventAssigner)
articleEventStream
.keyBy(article => article.id)
.intervalJoin(feedbackEventStream.keyBy(feedback => feedback.article.id))
.between(Time.seconds(-5), Time.seconds(10))
.process(new ProcessJoinFunction[ArticleEvent, FeedbackEvent, String] {
override def processElement(left: ArticleEvent, right: FeedbackEvent, ctx: ProcessJoinFunction[ArticleEvent, FeedbackEvent, String]#Context, out: Collector[String]): Unit = {
out.collect(left.name + " got feedback: " + right.feedback);
}
});
});
class ArticleEventAssigner extends AssignerWithPunctuatedWatermarks[ArticleEvent] {
val bound: Long = 5 * 1000
override def checkAndGetNextWatermark(lastElement: ArticleEvent, extractedTimestamp: Long): Watermark = {
new Watermark(extractedTimestamp - bound)
}
override def extractTimestamp(element: ArticleEvent, previousElementTimestamp: Long): Long = {
element.occurredAt
}
}
class FeedbackEventAssigner extends AssignerWithPunctuatedWatermarks[FeedbackEvent] {
val bound: Long = 5 * 1000
override def checkAndGetNextWatermark(lastElement: FeedbackEvent, extractedTimestamp: Long): Watermark = {
new Watermark(extractedTimestamp - bound)
}
override def extractTimestamp(element: FeedbackEvent, previousElementTimestamp: Long): Long = {
element.occurredAt
}
}
However, we do not see any joined output. We checked that each stream does continuously emit elements with timestamp and proper watermark. Does anyone have any hint what could be possible reasons?
After checking different parts (timestamp/watermark, triggers), I just noticed that I made a mistake, i.e., the window size I used
between(Time.seconds(-5), Time.seconds(10))
is just too small, which could not find elements from both streams to join. This might sound obvious, but since I am new to Flink, I did not know where to check.
So, my lesson is that if the join does not output, it could be necessary to check the window size.
And thanks all for the comments!

What's the relationship between key and Window instance in KeyedStream#timeWindow#process

For KeyedStream#timeWindow#process, I am wonderring whether one window instance will only contain the same key, and different keys will use different window instances.
From the output of the following application, i see that one window instance will only contain the same key, and different keys will use different windows.
But I want to ask and confirm, thanks!
import org.apache.flink.streaming.api.functions.source.{RichParallelSourceFunction, SourceFunction}
import scala.util.Random
class KeyByAndWindowAndProcessTestSource extends RichParallelSourceFunction[Int] {
override def run(ctx: SourceFunction.SourceContext[Int]): Unit = {
while (true) {
val i = new Random().nextInt(30)
ctx.collect(i)
ctx.collect(i)
ctx.collect(i)
Thread.sleep(1000)
}
}
override def cancel(): Unit = {
}
}
The applications is:
import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.scala.function.ProcessWindowFunction
import org.apache.flink.streaming.api.scala.{DataStream, StreamExecutionEnvironment}
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.streaming.api.windowing.windows.TimeWindow
import org.apache.flink.util.Collector
import org.apache.flink.api.scala._
object KeyByAndWindowTest {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(1)
env.getCheckpointConfig.setCheckpointInterval(10 * 1000)
env.setStreamTimeCharacteristic(TimeCharacteristic.ProcessingTime)
val ds: DataStream[Int] = env.addSource(new KeyByAndWindowAndProcessTestSource)
val ds2 = ds.keyBy(i => i).timeWindow(Time.seconds(4)).process(new MyProcessFunction())
ds2.print()
env.execute()
}
}
class MyProcessFunction extends ProcessWindowFunction[Int, String, Int, TimeWindow] {
override def process(
key: Int,
ctx: Context,
vals: Iterable[Int],
out: Collector[String]): Unit = {
println(new java.util.Date())
println(s"key=${key}, vals = ${vals.mkString(",")}, hashCode=${System.identityHashCode(ctx.window)}")
}
}
The output is:
Sat Sep 14 13:08:24 CST 2019
key=26, vals = 26,26,26, hashCode=838523304
Sat Sep 14 13:08:24 CST 2019
key=28, vals = 28,28,28, hashCode=472721641
Sat Sep 14 13:08:24 CST 2019
key=18, vals = 18,18,18,18,18,18, hashCode=1668151956
Actually, with respect to ProcessingTimeWindow, a new window object is created for each element.
Here is the source code of TumblingProcessingTimeWindows#assignWindows:
public Collection<TimeWindow> assignWindows(Object element, long timestamp, WindowAssignerContext context) {
final long now = context.getCurrentProcessingTime();
long start = TimeWindow.getWindowStartWithOffset(now, offset, size);
return Collections.singletonList(new TimeWindow(start, start + size));
}
So System.identityHashCode will always return a unique hash code for different keys, and your test code does not prove anything.
Under the hood, elements are grouped by the key of elementKey + assignedWindow, so I think it's right to say "one window instance will only contain the same key, and different keys will use different window instances".
Original Answer:
I hope I get your question right...
ProcessWindowFunction#process will be invoked for each window and key once (or multiple times depending on the windows's trigger). Internally, window and key make up a composite partition key.
In terms of Java object instances, one instance of ProcessWindowFunction will deal with many keys. Specifically, there will be degree of parallelism many ProcessWindowFunctions.
Follow Up:
So I did not get it right :)
For every record, which is processed by the WindowOperator a new Window object is created, with the correct start/end time for the record.
This means that each invocation of ProcessWindowFunction#process will be passed a new Window object.
It is important to understand, that a Window in Flink is a very light object, which is just used as an additional part (the namespace) of the overall key. It does not hold any data and/or logic.
May I ask for the background of the question?

Watermarks in a RichParallelSourceFunction

I am implementing a SourceFunction, which reads Data from a Database.
The job should be able to be resumed if stopped or crushed (i.e savepoints and checkpoints) with the data being processed exactly once.
What I have so far:
#SerialVersionUID(1L)
class JDBCSource(private val waitTimeMs: Long) extends
RichParallelSourceFunction[Event] with StoppableFunction with LazyLogging{
#transient var client: PostGreClient = _
#volatile var isRunning: Boolean = true
val DEFAULT_WAIT_TIME_MS = 1000
def this(clientConfig: Serializable) =
this(clientConfig, DEFAULT_WAIT_TIME_MS)
override def stop(): Unit = {
this.isRunning = false
}
override def open(parameters: Configuration): Unit = {
super.open(parameters)
client = new JDBCClient
}
override def run(ctx: SourceFunction.SourceContext[Event]): Unit = {
while (isRunning){
val statement = client.getConnection.createStatement()
val resultSet = statement.executeQuery("SELECT name, timestamp FROM MYTABLE")
while (resultSet.next()) {
val event: String = resultSet.getString("name")
val timestamp: Long = resultSet.getLong("timestamp")
ctx.collectWithTimestamp(new Event(name, timestamp), timestamp)
}
}
}
override def cancel(): Unit = {
isRunning = false
}
}
How can I make sure to only get the rows of the database which aren't processed yet?
I assumed the ctx variable would have some information about the current watermark so that I could change my query to something like:
select name, timestamp from myTable where timestamp > ctx.getCurrentWaterMark
But it doesn't have any relevant methods for me. Any Ideas how to solve this problem would be appreciated
You have to implement CheckpointedFunction so that you can manage checkpointing by yourself. The documentation of the interface is pretty comprehensive but if you need an example I advise you to take a look at an example.
In essence, your function must implement CheckpointedFunction#snapshotState to store the state you need using Flink's managed state and then, when performing a restore, it will read that same state in CheckpointedFunction#initializeState.

How to sum count in last 2 hours specific for each event in real time?

We're using Flink to monitor each event. The detail scenario is when a event arrives, flink find out all event with same userid in last 2 hours and sum the count field. For example:
event1<userid1, n1, t0> -> real time result = n1
event2<userid2, n2, t0+1h> -> real time result = n2
event3<userid1, n3, t0+1h> -> real time result = n1+n3
event4<userid1, n4, t0+2.5h> -> real time result = n3+n4
How could we implement such scenario in flink? Intuitively, we want to use sliding window, but there are two problems:
In flink, sliding window slides by parameter slide_size.
However, in our scenario, window slides for each event, which
means the start/end point of window is different for each
event (expected window range: [eventtime-2h, eventtime)). Should we implement this by setting a small slide_size(10ms?)?
The process function is executed by trigger function, which means we can't get result immediately as soon as event arrive?
You can achive this by using ProcessFunction. Here is the detail.
You can keep using the sliding window, but using your own Trigger to emit the elements arriving in the window. Sample code may like:
src.map(x => new Tuple2(x.id, x.value))
.keyBy(0)
.timeWindow(Time.seconds(2), Time.seconds(1))
.trigger(new Trigger[Tuple2[String, Int], TimeWindow] {
override def onEventTime(time: Long, window: TimeWindow, ctx: Trigger.TriggerContext): TriggerResult = {
TriggerResult.CONTINUE
}
override def onProcessingTime(time: Long, window: TimeWindow, ctx: Trigger.TriggerContext): TriggerResult = {
TriggerResult.FIRE
}
override def clear(window: TimeWindow, ctx: Trigger.TriggerContext): Unit = {
}
override def onElement(element: Tuple2[String, Int], timestamp: Long, window: TimeWindow, ctx: Trigger.TriggerContext): TriggerResult = {
TriggerResult.FIRE
}
})
.sum(1)

Resources