I'm assigning timestamps and watermarks like this
def myProcess(dataStream: DataStream[Foo]) {
dataStream
.assignTimestampsAndWatermarks(
WatermarkStrategy
.forBoundedOutOfOrderness[Foo](Duration.ofSeconds(5))
.withTimestampAssigner(new SerializableTimestampAssigner[Foo]() {
// Long is milliseconds since the Epoch
override def extractTimestamp(element: Foo, recordTimestamp: Long): Long = element.eventTimestamp
})
)
.keyBy({ k => k.id
})
.window(TumblingEventTimeWindows.of(Time.hours(1)))
.reduce(new MyReducerFn(), new MyWindowFunction())
}
I have a unit test
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(1)
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
val dataTime = 1641488400000L
val lateTime = dataTime + Duration.ofHours(5).toMillis
val dataSource = env
.addSource(new SourceFunction[Foo]() {
def run(ctx: SourceFunction.SourceContext[Foo]) {
ctx.collect(Foo(id=1, value=1, eventTimestamp=dataTime))
ctx.collect(Foo(id=1, value=1, eventTimestamp=lateTime))
// should be dropped due to past max lateness
ctx.collect(Foo(id=1, value=2, eventTimestamp=dataTime))
}
override def cancel(): Unit = {}
})
myProcess(dataSource).collect(new MyTestSink)
env.execute("watermark & lateness test")
I expect that the second element would advance the watermark to (latetime - Duration.ofSeconds(5)) (i.e. the bounded out of orderness) and therefore the third element would not be assigned to the 1 hour tumbling window since the watermark had advanced considerably past it. However I see both the first and third element reach my reduce function.
I must misunderstand watermarks here? Or what forBoundedOutOfOrderness does? Can I please get clarity?
Thanks!
I must have typo'd something I have it working.
Related
There are 2 data streams with timestamp assigned and watermark generator defined as followed.
val streamA: DataStream[A] = kafkaStreamASourceOutput.assignTimestampsAndWatermarks(
WatermarkStrategy
.forBoundedOutOfOrderness[A](Duration.ofSeconds(0))
.withTimestampAssigner(new SerializableTimestampAssigner[A] {
override def extractTimestamp(element: A, recordTimestamp: Long): Long = {
element.lastUpdatedMs
}
})
)
val streamA: DataStream[B] = kafkaStreamBSourceOutput.assignTimestampsAndWatermarks(
WatermarkStrategy
.forBoundedOutOfOrderness[B](Duration.ofSeconds(0))
.withTimestampAssigner(new SerializableTimestampAssigner[B] {
override def extractTimestamp(element: B, recordTimestamp: Long): Long = {
element.lastUpdatedMs
}
})
)
When these 2 streams are connected in an operator then the minimum watermark from streamA or streamB acts as the watermark of connecting operator.
class CombineAB extends CoProcessFunction[A, B, C] {
override def processElement1(elem: A, ctx:Context, out: Collector[C]) {
out.collect(C(elem.x, elem.y, time.Now()))
}
override def processElement2(elem: B, ctx:Context, out: Collector[C]) {
out.collect(C(elem.x, elem.y, time.Now()))
}
}
val streamC: DataStream[C] = streamA.connect(streamB)
.process(new CombineAB)
The watermark of CombineAB operator is the minimum of A or B. Based on that the elements of type C are marked late or not.
But since we have not attached any timestamp assigned to C, does that mean none of the elements from CombineAB operator are marked late? Hence windowing on C will not have any late records being dropped?
Let's say we attach a timestamp assigned and watermark generator to C as follows, then does it mean the watermarks from A and B are completely ignored and watermark of CombineAB depends only on timestamp field of C and the lateness defined with C.
streamC.assignTimestampsAndWatermarks(
WatermarkStrategy
.forBoundedOutOfOrderness[C](Duration.ofSeconds(0))
.withTimestampAssigner(new SerializableTimestampAssigner[C] {
override def extractTimestamp(element: C, recordTimestamp: Long): Long = {
element.updatedTime
}
})
)
Isn't there a way that I can attach the timestamp assigner to C and the watermark for CombineAB is still the minimum of A and B and elements of C are marked late based on C's assigned timestamp and wartermark of CombineAB
Update: Refined implementation of CombineAB
A few points:
forBoundedOutOfOrderness[A](Duration.ofSeconds(0)) is unusual. Any out-of-order events will be late. Why not use forMonotonousTimestamps()?
The records produced by CombineAB will have timestamps; there's no need to apply assignTimestampsAndWatermarks to this stream. The timestamp of any records produced by the Collector is the timestamp of the incoming record.
If you do call assignTimestampsAndWatermarks on stream C, the incoming watermarks will be filtered out, and you'll need to generate new watermarks.
This is my application code
object StreamingJob {
def main(args: Array[String]) {
// set up the streaming execution environment
val env = StreamExecutionEnvironment.getExecutionEnvironment
// define EventTime and Watermark
var sensorData: DataStream[SensorReading] = env.addSource(new SensorSource).assignTimestampsAndWatermarks(
WatermarkStrategy
.forBoundedOutOfOrderness[SensorReading](Duration.ofSeconds(0))
.withTimestampAssigner(new SerializableTimestampAssigner[SensorReading] {
override def extractTimestamp(t: SensorReading, l: Long): Long = t.timestamp
})
)
val outputTag = OutputTag[SensorReading]("late-event")
val minTemp: DataStream[String] = sensorData
.map(r => {
val celsius = (r.temperature - 32) * (5.0 / 9.0)
SensorReading(r.id, r.timestamp, celsius)
})
.keyBy(_.id)
.window(TumblingEventTimeWindows.of(Time.seconds(5)))
.allowedLateness(Time.seconds(10))
.sideOutputLateData(outputTag)
// compute min temperature
.process(new TemperatureMiner)
val lateStream: DataStream[SensorReading] = minTemp.getSideOutput(outputTag)
lateStream.map(r => s"late event: ${r.id}, ${r.timestamp}, ${r.temperature}").print()
minTemp.print()
// execute program
env.execute("Flink Streaming Scala API Skeleton")
}
}
I am pretty sure that late data are captured because I can see log printing that TemperatureMiner is invoke multiple times for one window, hence late firing.
But the problem is that there is nothing printing by lateStream from side output for late data. Any idea why?
A side output for late data in a window is only sent data that is so late that it falls outside the allowed lateness. Perhaps none of your late data is late enough.
I have a simple Flink application to illustrate the usage of KeyedStream#max
import com.huawei.flink.time.Box
import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.scala.{StreamExecutionEnvironment, _}
object KeyStreamMaxTest {
val env = StreamExecutionEnvironment.getExecutionEnvironment
def main(args: Array[String]): Unit = {
env.setStreamTimeCharacteristic(TimeCharacteristic.ProcessingTime)
env.setParallelism(1)
env.setMaxParallelism(1)
val ds = env.fromElements(("X,Red,10"), ("Y,Blue,10"), ("Z,Black, 22"), ("U,Green,22"), ("N,Blue,25"), ("M,Green,23"))
val ds2 = ds.map { line =>
val Array(name, color, size) = line.split(",")
Box(name.trim, color.trim, size.trim.toInt)
}.keyBy(_.color).max("size")
ds2.print()
env.execute()
}
}
The output is:
Box(X,Red,10)
Box(Y,Blue,10)
Box(Z,Black,22)
Box(U,Green,22)
Box(Y,Blue,25) -- I thought this should be ("N,Blue,25")
Box(U,Green,23)
Looks Flink only replaces the sizeļ¼ but keeps name and color unchanged,
I would ask what's the practical usage for this behavior? I could only imagine that it is natural to get the whole record that has the max size.
Sometimes all you need to know is the max value, for each key, of one field. I believe max is able to provide this information while doing less work than the more generally useful maxBy, which returns the whole record that has the max size.
I would like to aggregate a stream of trades into windows of the same trade volume, which is the sum of the trade size of all the trades in the interval.
I was able to write a custom Trigger that partitions the data into windows. Here is the code:
case class Trade(key: Int, millis: Long, time: LocalDateTime, price: Double, size: Int)
class VolumeTrigger(triggerVolume: Int, config: ExecutionConfig) extends Trigger[Trade, Window] {
val LOG: Logger = LoggerFactory.getLogger(classOf[VolumeTrigger])
val stateDesc = new ValueStateDescriptor[Double]("volume", createTypeInformation[Double].createSerializer(config))
override def onElement(event: Trade, timestamp: Long, window: Window, ctx: TriggerContext): TriggerResult = {
val volume = ctx.getPartitionedState(stateDesc)
if (volume.value == null) {
volume.update(event.size)
return TriggerResult.CONTINUE
}
volume.update(volume.value + event.size)
if (volume.value < triggerVolume) {
TriggerResult.CONTINUE
}
else {
volume.update(volume.value - triggerVolume)
TriggerResult.FIRE_AND_PURGE
}
}
override def onEventTime(time: Long, window: Window, ctx: TriggerContext): TriggerResult = {
TriggerResult.FIRE_AND_PURGE
}
override def onProcessingTime(time: Long, window:Window, ctx: TriggerContext): TriggerResult = {
throw new UnsupportedOperationException("Not a processing time trigger")
}
override def clear(window: Window, ctx: TriggerContext): Unit = {
val volume = ctx.getPartitionedState(stateDesc)
ctx.getPartitionedState(stateDesc).clear()
}
}
def main(args: Array[String]) : Unit = {
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(1)
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
val trades = env
.readTextFile("/tmp/trades.csv")
.map {line =>
val cells = line.split(",")
val time = LocalDateTime.parse(cells(0), DateTimeFormatter.ofPattern("yyyyMMdd HH:mm:ss.SSSSSSSSS"))
val millis = time.toInstant(ZoneOffset.UTC).toEpochMilli
Trade(0, millis, time, cells(1).toDouble, cells(2).toInt)
}
val aggregated = trades
.assignAscendingTimestamps(_.millis)
.keyBy("key")
.window(GlobalWindows.create)
.trigger(new VolumeTrigger(500, env.getConfig))
.sum(4)
aggregated.writeAsText("/tmp/trades_agg.csv")
env.execute("volume agg")
}
The data looks for example as follows:
0180102 04:00:29.715706404,169.10,100
20180102 04:00:29.715715627,169.10,100
20180102 05:08:29.025299624,169.12,100
20180102 05:08:29.025906589,169.10,214
20180102 05:08:29.327113252,169.10,200
20180102 05:09:08.350939314,169.00,100
20180102 05:09:11.532817015,169.00,474
20180102 06:06:55.373584329,169.34,200
20180102 06:07:06.993081961,169.34,100
20180102 06:07:08.153291898,169.34,100
20180102 06:07:20.081524768,169.34,364
20180102 06:07:22.838656715,169.34,200
20180102 06:07:24.561360031,169.34,100
20180102 06:07:37.774385969,169.34,100
20180102 06:07:39.305219107,169.34,200
I have a time stamp, a price and a size.
The above code can partition it into windows of roughly the same size:
Trade(0,1514865629715,2018-01-02T04:00:29.715706404,169.1,514)
Trade(0,1514869709327,2018-01-02T05:08:29.327113252,169.1,774)
Trade(0,1514873215373,2018-01-02T06:06:55.373584329,169.34,300)
Trade(0,1514873228153,2018-01-02T06:07:08.153291898,169.34,464)
Trade(0,1514873242838,2018-01-02T06:07:22.838656715,169.34,600)
Trade(0,1514873294898,2018-01-02T06:08:14.898397117,169.34,500)
Trade(0,1514873299492,2018-01-02T06:08:19.492589659,169.34,400)
Trade(0,1514873332251,2018-01-02T06:08:52.251339070,169.34,500)
Trade(0,1514873337928,2018-01-02T06:08:57.928680090,169.34,1000)
Trade(0,1514873338078,2018-01-02T06:08:58.078221995,169.34,1000)
Now I like to partition the data so that the volume is exactly matching the trigger value. For this I would need to change the data slightly by splitting a trade at the end of an interval into two parts, one that belongs to the actual window being fired, and the remaining volume that is above the trigger value, has to be assigned to the next window.
Can that be handled with some custom aggregation function? It would need to know the results from the previous window(s) though and I was not able to find out how to do that.
Any ideas from Apache Flink experts how to handle that case?
Adding an evictor does not work as it only purges some elements at the beginning.
I hope the change from Spark Structured Streaming to Flink was a good choice, as I have later even more complicated situations to handle.
Since your key is the same for all records, you may not require a window in this case. Please refer to this page in Flink's documentation https://ci.apache.org/projects/flink/flink-docs-release-1.4/dev/stream/state/state.html#using-managed-keyed-state.
It has a CountWindowAverage class where in the aggregation of a value from each record in a stream is done using a State Variable. You can implement this and send the output whenever the state variable reaches your trigger volume and reset the value of the state variable with the remaining volume.
A simple approach (though not super-efficient) would be to put a FlatMapFunction ahead of your windowing flow. If it's keyed the same way, then you can use ValueState to track total volume, and emit two records (the split) when it hits your limit.
I wrote a simple stream using akka-streams api assuming it will handle my source but unfortunately it doesn't. I am sure I am doing something wrong in my source. I simply created an iterator which generate very large number of elements assuming it won't matter because akka-streams api will take care of backpressure. What am I doing wrong, this is my iterator.
def createData(args: Array[String]): Iterator[TimeSeriesValue] = {
var data = new ListBuffer[TimeSeriesValue]()
for (i <- 1 to range) {
sessionId = UUID.randomUUID()
for (j <- 1 to countersPerSession) {
time = DateTime.now()
keyName = s"Encoder-${sessionId.toString}-Controller.CaptureFrameCount.$j"
for (k <- 1 to snapShotCount) {
time = time.plusSeconds(2)
fValue = new Random().nextLong()
data += TimeSeriesValue(sessionId, keyName, time, fValue)
totalRows += 1
}
}
}
data.iterator
}
The problem is primarily in the line
data += TimeSeriesValue(sessionId, keyName, time, fValue)
You are continuously adding to the ListBuffer with a "very large number of elements". This is chewing up all of your RAM. The data.iterator line is simply wrapping the massive ListBuffer blob inside of an iterator to provide each element one at a time, it's basically just a cast.
Your assumption that "it won't matter because ... of backpressure" is partially true that the akka Stream will process the TimeSeriesValue values reactively, but you are creating a large number of them even before you get to the Source constructor.
If you want this iterator to be "lazy", i.e. only produce values when needed and not consume memory, then make the following modifications (note: I broke apart the code to make it more readable):
def createTimeSeries(startTime: Time, snapShotCount : Int, sessionId : UUID, keyName : String) =
Iterator.range(1, snapShotCount)
.map(_ * 2)
.map(startTime plusSeconds _)
.map(t => TimeSeriesValue(sessionId, keyName, t, ThreadLocalRandom.current().nextLong()))
def sessionGenerator(countersPerSession : Int, sessionID : UUID) =
Iterator.range(1, countersPerSession)
.map(j => s"Encoder-${sessionId.toString}-Controller.CaptureFrameCount.$j")
.flatMap { keyName =>
createTimeSeries(DateTime.now(), snapShotCount, sessionID, keyName)
}
object UUIDIterator extends Iterator[UUID] {
def hasNext : Boolean = true
def next() : UUID = UUID.randomUUID()
}
def iterateOverIDs(range : Int) =
UUIDIterator.take(range)
.flatMap(sessionID => sessionGenerator(countersPerSession, sessionID))
Each one of the above functions returns an Iterator. Therefore, calling iterateOverIDs should be instantaneous because no work is immediately being done and de mimimis memory is being consumed. This iterator can then be passed into your Stream...