We have a Flink job that does intervalJoin two streams, both streams consume events from Kafka. Here is the example code
val articleEventStream: DataStream[ArticleEvent] = env.addSource(articleEventSource)
.assignTimestampsAndWatermarks(new ArticleEventAssigner)
val feedbackEventStream: DataStream[FeedbackEvent] = env.addSource(feedbackEventSource)
.assignTimestampsAndWatermarks(new FeedbackEventAssigner)
articleEventStream
.keyBy(article => article.id)
.intervalJoin(feedbackEventStream.keyBy(feedback => feedback.article.id))
.between(Time.seconds(-5), Time.seconds(10))
.process(new ProcessJoinFunction[ArticleEvent, FeedbackEvent, String] {
override def processElement(left: ArticleEvent, right: FeedbackEvent, ctx: ProcessJoinFunction[ArticleEvent, FeedbackEvent, String]#Context, out: Collector[String]): Unit = {
out.collect(left.name + " got feedback: " + right.feedback);
}
});
});
class ArticleEventAssigner extends AssignerWithPunctuatedWatermarks[ArticleEvent] {
val bound: Long = 5 * 1000
override def checkAndGetNextWatermark(lastElement: ArticleEvent, extractedTimestamp: Long): Watermark = {
new Watermark(extractedTimestamp - bound)
}
override def extractTimestamp(element: ArticleEvent, previousElementTimestamp: Long): Long = {
element.occurredAt
}
}
class FeedbackEventAssigner extends AssignerWithPunctuatedWatermarks[FeedbackEvent] {
val bound: Long = 5 * 1000
override def checkAndGetNextWatermark(lastElement: FeedbackEvent, extractedTimestamp: Long): Watermark = {
new Watermark(extractedTimestamp - bound)
}
override def extractTimestamp(element: FeedbackEvent, previousElementTimestamp: Long): Long = {
element.occurredAt
}
}
However, we do not see any joined output. We checked that each stream does continuously emit elements with timestamp and proper watermark. Does anyone have any hint what could be possible reasons?
After checking different parts (timestamp/watermark, triggers), I just noticed that I made a mistake, i.e., the window size I used
between(Time.seconds(-5), Time.seconds(10))
is just too small, which could not find elements from both streams to join. This might sound obvious, but since I am new to Flink, I did not know where to check.
So, my lesson is that if the join does not output, it could be necessary to check the window size.
And thanks all for the comments!
Related
I am using Flink 1.12 and I have a keyed stream, in my code it looks that both A and B share the same watermark? and therefore B is determined as late because A's coming has upgraded the watermark to be 2020-08-30 10:50:11?
The output is A(2020-08-30 10:50:08, 2020-08-30 10:50:16):2020-08-30 10:50:15,there is no output for B
I would ask whether it is possible to make different keys have independent watermark? A's watermark and B'watermark change independently
The application code is:
import java.text.SimpleDateFormat
import java.util.Date
import java.util.concurrent.TimeUnit
import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.functions.AssignerWithPunctuatedWatermarks
import org.apache.flink.streaming.api.scala.function.WindowFunction
import org.apache.flink.streaming.api.scala.{StreamExecutionEnvironment, _}
import org.apache.flink.streaming.api.watermark.Watermark
import org.apache.flink.streaming.api.windowing.assigners.TumblingEventTimeWindows
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.streaming.api.windowing.windows.TimeWindow
import org.apache.flink.util.Collector
object DemoDiscardLateEvent4_KeyStream {
def to_milli(str: String) =
new SimpleDateFormat("yyyy-MM-dd HH:mm:ss").parse(str).getTime
def to_char(milli: Long) = {
val date = if (milli <= 0) new Date(0) else new Date(milli)
new SimpleDateFormat("yyyy-MM-dd HH:mm:ss").format(date)
}
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(1)
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
val data = Seq(
("A", "2020-08-30 10:50:15"),
("B", "2020-08-30 10:50:07")
)
env.fromCollection(data).setParallelism(1).assignTimestampsAndWatermarks(new AssignerWithPunctuatedWatermarks[(String, String)]() {
var maxSeen = Long.MinValue
override def checkAndGetNextWatermark(lastElement: (String, String), extractedTimestamp: Long): Watermark = {
val eventTime = to_milli(lastElement._2)
if (eventTime > maxSeen) {
maxSeen = eventTime
}
//Allow 4 seconds late
new Watermark(maxSeen - 4000)
}
override def extractTimestamp(element: (String, String), previousElementTimestamp: Long): Long = to_milli(element._2)
}).keyBy(_._1).window(TumblingEventTimeWindows.of(Time.of(8, TimeUnit.SECONDS))).apply(new WindowFunction[(String, String), String, String, TimeWindow] {
override def apply(key: String, window: TimeWindow, input: Iterable[(String, String)], out: Collector[String]): Unit = {
val start = to_char(window.getStart)
val end = to_char(window.getEnd)
val sb = new StringBuilder
//the start and end of the window
sb.append(s"$key($start, $end):")
//The content of the window
input.foreach {
e => sb.append(e._2 + ",")
}
out.collect(sb.toString().substring(0, sb.length - 1))
}
}).print()
env.execute()
}
}
While it would sometimes be helpful if Flink offered per-key watermarking, it does not.
Each parallel instance of your WatermarkStrategy (or in this case, of your AssignerWithPunctuatedWatermarks) is generating watermarks independently, based on the timestamps of the events it observes (regardless of their keys).
One way to work around the lack of this feature is to not use watermarks at all. For example, if you would be using per-key watermarks to trigger keyed event-time windows, you can instead implement your own windows using a KeyedProcessFunction, and instead of using watermarks to trigger event time timers, keep track of the largest timestamp seen so far for each key, and whenever updating that value, determine if you now want to close one or more windows for that key.
See one of the Flink training lessons for an example of how to implement keyed tumbling windows with a KeyedProcessFunction. This example depends on watermarks but should help you get started.
I am new to Flink and doing something very similar to the below link.
Cannot see message while sinking kafka stream and cannot see print message in flink 1.2
I am also trying to add JSONDeserializationSchema() as a deserializer for my Kafka input JSON message which is without a key.
But I found JSONDeserializationSchema() is not present.
Please let me know if I am doing anything wrong.
JSONDeserializationSchema was removed in Flink 1.8, after having been deprecated earlier.
The recommended approach is to write a deserializer that implements DeserializationSchema<T>. Here's an example, which I've copied from the Flink Operations Playground:
import org.apache.flink.api.common.serialization.DeserializationSchema;
import org.apache.flink.api.common.typeinfo.TypeInformation;
import org.apache.flink.shaded.jackson2.com.fasterxml.jackson.databind.ObjectMapper;
import java.io.IOException;
/**
* A Kafka {#link DeserializationSchema} to deserialize {#link ClickEvent}s from JSON.
*
*/
public class ClickEventDeserializationSchema implements DeserializationSchema<ClickEvent> {
private static final long serialVersionUID = 1L;
private static final ObjectMapper objectMapper = new ObjectMapper();
#Override
public ClickEvent deserialize(byte[] message) throws IOException {
return objectMapper.readValue(message, ClickEvent.class);
}
#Override
public boolean isEndOfStream(ClickEvent nextElement) {
return false;
}
#Override
public TypeInformation<ClickEvent> getProducedType() {
return TypeInformation.of(ClickEvent.class);
}
}
For a Kafka producer you'll want to implement KafkaSerializationSchema<T>, and you'll find examples of that in that same project.
To solve the problem of reading non-key JSON messages from Kafka I used case class and JSON parser.
The following code makes a case class and parses the JSON field using play API.
import play.api.libs.json.JsValue
object CustomerModel {
def readElement(jsonElement: JsValue): Customer = {
val id = (jsonElement \ "id").get.toString().toInt
val name = (jsonElement \ "name").get.toString()
Customer(id,name)
}
case class Customer(id: Int, name: String)
}
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
val properties = new Properties()
properties.setProperty("bootstrap.servers", "xxx.xxx.0.114:9092")
properties.setProperty("group.id", "test-grp")
val consumer = new FlinkKafkaConsumer[String]("customer", new SimpleStringSchema(), properties)
val stream1 = env.addSource(consumer).rebalance
val stream2:DataStream[Customer]= stream1.map( str =>{Try(CustomerModel.readElement(Json.parse(str))).getOrElse(Customer(0,Try(CustomerModel.readElement(Json.parse(str))).toString))
})
stream2.print("stream2")
env.execute("This is Kafka+Flink")
}
The Try method lets you overcome the exception thrown while parsing the data
and returns the exception in one of the fields (if we want) or else it can just return the case class object with any given or default fields.
The sample output of the Code is:
stream2:1> Customer(1,"Thanh")
stream2:1> Customer(5,"Huy")
stream2:3> Customer(0,Failure(com.fasterxml.jackson.databind.JsonMappingException: No content to map due to end-of-input
at [Source: ; line: 1, column: 0]))
I am not sure if it is the best approach but it is working for me as of now.
For KeyedStream#timeWindow#process, I am wonderring whether one window instance will only contain the same key, and different keys will use different window instances.
From the output of the following application, i see that one window instance will only contain the same key, and different keys will use different windows.
But I want to ask and confirm, thanks!
import org.apache.flink.streaming.api.functions.source.{RichParallelSourceFunction, SourceFunction}
import scala.util.Random
class KeyByAndWindowAndProcessTestSource extends RichParallelSourceFunction[Int] {
override def run(ctx: SourceFunction.SourceContext[Int]): Unit = {
while (true) {
val i = new Random().nextInt(30)
ctx.collect(i)
ctx.collect(i)
ctx.collect(i)
Thread.sleep(1000)
}
}
override def cancel(): Unit = {
}
}
The applications is:
import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.scala.function.ProcessWindowFunction
import org.apache.flink.streaming.api.scala.{DataStream, StreamExecutionEnvironment}
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.streaming.api.windowing.windows.TimeWindow
import org.apache.flink.util.Collector
import org.apache.flink.api.scala._
object KeyByAndWindowTest {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(1)
env.getCheckpointConfig.setCheckpointInterval(10 * 1000)
env.setStreamTimeCharacteristic(TimeCharacteristic.ProcessingTime)
val ds: DataStream[Int] = env.addSource(new KeyByAndWindowAndProcessTestSource)
val ds2 = ds.keyBy(i => i).timeWindow(Time.seconds(4)).process(new MyProcessFunction())
ds2.print()
env.execute()
}
}
class MyProcessFunction extends ProcessWindowFunction[Int, String, Int, TimeWindow] {
override def process(
key: Int,
ctx: Context,
vals: Iterable[Int],
out: Collector[String]): Unit = {
println(new java.util.Date())
println(s"key=${key}, vals = ${vals.mkString(",")}, hashCode=${System.identityHashCode(ctx.window)}")
}
}
The output is:
Sat Sep 14 13:08:24 CST 2019
key=26, vals = 26,26,26, hashCode=838523304
Sat Sep 14 13:08:24 CST 2019
key=28, vals = 28,28,28, hashCode=472721641
Sat Sep 14 13:08:24 CST 2019
key=18, vals = 18,18,18,18,18,18, hashCode=1668151956
Actually, with respect to ProcessingTimeWindow, a new window object is created for each element.
Here is the source code of TumblingProcessingTimeWindows#assignWindows:
public Collection<TimeWindow> assignWindows(Object element, long timestamp, WindowAssignerContext context) {
final long now = context.getCurrentProcessingTime();
long start = TimeWindow.getWindowStartWithOffset(now, offset, size);
return Collections.singletonList(new TimeWindow(start, start + size));
}
So System.identityHashCode will always return a unique hash code for different keys, and your test code does not prove anything.
Under the hood, elements are grouped by the key of elementKey + assignedWindow, so I think it's right to say "one window instance will only contain the same key, and different keys will use different window instances".
Original Answer:
I hope I get your question right...
ProcessWindowFunction#process will be invoked for each window and key once (or multiple times depending on the windows's trigger). Internally, window and key make up a composite partition key.
In terms of Java object instances, one instance of ProcessWindowFunction will deal with many keys. Specifically, there will be degree of parallelism many ProcessWindowFunctions.
Follow Up:
So I did not get it right :)
For every record, which is processed by the WindowOperator a new Window object is created, with the correct start/end time for the record.
This means that each invocation of ProcessWindowFunction#process will be passed a new Window object.
It is important to understand, that a Window in Flink is a very light object, which is just used as an additional part (the namespace) of the overall key. It does not hold any data and/or logic.
May I ask for the background of the question?
I am implementing a SourceFunction, which reads Data from a Database.
The job should be able to be resumed if stopped or crushed (i.e savepoints and checkpoints) with the data being processed exactly once.
What I have so far:
#SerialVersionUID(1L)
class JDBCSource(private val waitTimeMs: Long) extends
RichParallelSourceFunction[Event] with StoppableFunction with LazyLogging{
#transient var client: PostGreClient = _
#volatile var isRunning: Boolean = true
val DEFAULT_WAIT_TIME_MS = 1000
def this(clientConfig: Serializable) =
this(clientConfig, DEFAULT_WAIT_TIME_MS)
override def stop(): Unit = {
this.isRunning = false
}
override def open(parameters: Configuration): Unit = {
super.open(parameters)
client = new JDBCClient
}
override def run(ctx: SourceFunction.SourceContext[Event]): Unit = {
while (isRunning){
val statement = client.getConnection.createStatement()
val resultSet = statement.executeQuery("SELECT name, timestamp FROM MYTABLE")
while (resultSet.next()) {
val event: String = resultSet.getString("name")
val timestamp: Long = resultSet.getLong("timestamp")
ctx.collectWithTimestamp(new Event(name, timestamp), timestamp)
}
}
}
override def cancel(): Unit = {
isRunning = false
}
}
How can I make sure to only get the rows of the database which aren't processed yet?
I assumed the ctx variable would have some information about the current watermark so that I could change my query to something like:
select name, timestamp from myTable where timestamp > ctx.getCurrentWaterMark
But it doesn't have any relevant methods for me. Any Ideas how to solve this problem would be appreciated
You have to implement CheckpointedFunction so that you can manage checkpointing by yourself. The documentation of the interface is pretty comprehensive but if you need an example I advise you to take a look at an example.
In essence, your function must implement CheckpointedFunction#snapshotState to store the state you need using Flink's managed state and then, when performing a restore, it will read that same state in CheckpointedFunction#initializeState.
We're using Flink to monitor each event. The detail scenario is when a event arrives, flink find out all event with same userid in last 2 hours and sum the count field. For example:
event1<userid1, n1, t0> -> real time result = n1
event2<userid2, n2, t0+1h> -> real time result = n2
event3<userid1, n3, t0+1h> -> real time result = n1+n3
event4<userid1, n4, t0+2.5h> -> real time result = n3+n4
How could we implement such scenario in flink? Intuitively, we want to use sliding window, but there are two problems:
In flink, sliding window slides by parameter slide_size.
However, in our scenario, window slides for each event, which
means the start/end point of window is different for each
event (expected window range: [eventtime-2h, eventtime)). Should we implement this by setting a small slide_size(10ms?)?
The process function is executed by trigger function, which means we can't get result immediately as soon as event arrive?
You can achive this by using ProcessFunction. Here is the detail.
You can keep using the sliding window, but using your own Trigger to emit the elements arriving in the window. Sample code may like:
src.map(x => new Tuple2(x.id, x.value))
.keyBy(0)
.timeWindow(Time.seconds(2), Time.seconds(1))
.trigger(new Trigger[Tuple2[String, Int], TimeWindow] {
override def onEventTime(time: Long, window: TimeWindow, ctx: Trigger.TriggerContext): TriggerResult = {
TriggerResult.CONTINUE
}
override def onProcessingTime(time: Long, window: TimeWindow, ctx: Trigger.TriggerContext): TriggerResult = {
TriggerResult.FIRE
}
override def clear(window: TimeWindow, ctx: Trigger.TriggerContext): Unit = {
}
override def onElement(element: Tuple2[String, Int], timestamp: Long, window: TimeWindow, ctx: Trigger.TriggerContext): TriggerResult = {
TriggerResult.FIRE
}
})
.sum(1)