What's the practical use of KeyedStream#max - apache-flink

I have a simple Flink application to illustrate the usage of KeyedStream#max
import com.huawei.flink.time.Box
import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.scala.{StreamExecutionEnvironment, _}
object KeyStreamMaxTest {
val env = StreamExecutionEnvironment.getExecutionEnvironment
def main(args: Array[String]): Unit = {
env.setStreamTimeCharacteristic(TimeCharacteristic.ProcessingTime)
env.setParallelism(1)
env.setMaxParallelism(1)
val ds = env.fromElements(("X,Red,10"), ("Y,Blue,10"), ("Z,Black, 22"), ("U,Green,22"), ("N,Blue,25"), ("M,Green,23"))
val ds2 = ds.map { line =>
val Array(name, color, size) = line.split(",")
Box(name.trim, color.trim, size.trim.toInt)
}.keyBy(_.color).max("size")
ds2.print()
env.execute()
}
}
The output is:
Box(X,Red,10)
Box(Y,Blue,10)
Box(Z,Black,22)
Box(U,Green,22)
Box(Y,Blue,25) -- I thought this should be ("N,Blue,25")
Box(U,Green,23)
Looks Flink only replaces the sizeļ¼Œ but keeps name and color unchanged,
I would ask what's the practical usage for this behavior? I could only imagine that it is natural to get the whole record that has the max size.

Sometimes all you need to know is the max value, for each key, of one field. I believe max is able to provide this information while doing less work than the more generally useful maxBy, which returns the whole record that has the max size.

Related

Flink Watermark forBoundedOutOfOrderness includes data beyond boundaries

I'm assigning timestamps and watermarks like this
def myProcess(dataStream: DataStream[Foo]) {
dataStream
.assignTimestampsAndWatermarks(
WatermarkStrategy
.forBoundedOutOfOrderness[Foo](Duration.ofSeconds(5))
.withTimestampAssigner(new SerializableTimestampAssigner[Foo]() {
// Long is milliseconds since the Epoch
override def extractTimestamp(element: Foo, recordTimestamp: Long): Long = element.eventTimestamp
})
)
.keyBy({ k => k.id
})
.window(TumblingEventTimeWindows.of(Time.hours(1)))
.reduce(new MyReducerFn(), new MyWindowFunction())
}
I have a unit test
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(1)
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
val dataTime = 1641488400000L
val lateTime = dataTime + Duration.ofHours(5).toMillis
val dataSource = env
.addSource(new SourceFunction[Foo]() {
def run(ctx: SourceFunction.SourceContext[Foo]) {
ctx.collect(Foo(id=1, value=1, eventTimestamp=dataTime))
ctx.collect(Foo(id=1, value=1, eventTimestamp=lateTime))
// should be dropped due to past max lateness
ctx.collect(Foo(id=1, value=2, eventTimestamp=dataTime))
}
override def cancel(): Unit = {}
})
myProcess(dataSource).collect(new MyTestSink)
env.execute("watermark & lateness test")
I expect that the second element would advance the watermark to (latetime - Duration.ofSeconds(5)) (i.e. the bounded out of orderness) and therefore the third element would not be assigned to the 1 hour tumbling window since the watermark had advanced considerably past it. However I see both the first and third element reach my reduce function.
I must misunderstand watermarks here? Or what forBoundedOutOfOrderness does? Can I please get clarity?
Thanks!
I must have typo'd something I have it working.

Slicing the first row of a Dataframe into an Array[String]

import org.apache.spark.sql.functions.broadcast
import org.apache.spark.sql.SparkSession._
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import org.apache.spark.SparkContext._
import org.apache.spark.{SparkConf,SparkContext}
import java.io.File
import org.apache.commons.io.FileUtils
import org.apache.hadoop.fs.FileSystem
import org.apache.hadoop.fs.Path
import org.apache.spark.sql.expressions.Window
import scala.runtime.ScalaRunTime.{array_apply, array_update}
import scala.collection.mutable.Map
object SimpleApp {
def main(args: Array[String]){
val conf = new SparkConf().setAppName("SimpleApp").setMaster("local")
val sc = new SparkContext(conf)
val input = "file:///home/shahid/Desktop/sample1.csv"
val hdfsOutput = "hdfs://localhost:9001/output.csv"
val localOutput = "file:///home/shahid/Desktop/output"
val sqlContext = new SQLContext(sc)
val df = sqlContext.read.format("com.databricks.spark.csv").load(input)
var colLen = df.columns.length
val df1 = df.filter(!(col("_c1") === ""))
I am capturing the top row into a val named headerArr.
val headerArr = df1.head
I wanted this val to be Array[String].
println("class = "+headerArr.getClass)
What can I do to either typecast this headerArr into an Array[String] or get this top row directly into an Array[String].
val fs = org.apache.hadoop.fs.FileSystem.get(new java.net.URI("hdfs://localhost:9001"), sc.hadoopConfiguration)
fs.delete(new org.apache.hadoop.fs.Path("/output.csv"),true)
df1.write.csv(hdfsOutput)
val fileTemp = new File("/home/shahid/Desktop/output/")
if (fileTemp.exists)
FileUtils.deleteDirectory(fileTemp)
df1.write.csv(localOutput)
sc.stop()
}
}
I have tried using df1.first also but both return the same type.
The result of the above code on the console is as follows :-
class = class org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema
Help needed.
Thankyou for you time. xD
Given the following dataframe:
val df = spark.createDataFrame(Seq(("a", "hello"), ("b", "world"))).toDF("id", "word")
df.show()
+---+-----+
| id| word|
+---+-----+
| a|hello|
| b|world|
+---+-----+
You can get the first row as you already mentioned and then turn this result into a Seq, which is actually backed by a subtype of Array and that you can then "cast" to an array without copying:
// returns: WrappedArray(a, hello)
df.first.toSeq.asInstanceOf[Array[_]]
Casting is usually not a good practice in a language with very good static typing as Scala, so you'd probably want to stick to the Seq unless you really have a need for an Array.
Notice that thus far we always ended up not with an array of strings but with an array of objects, since the Row object in Spark has to accommodate for various types. If you want to get to a collection of strings you can iterate the fields and extract the strings:
// returns: Vector(a, hello)
for (i <- 0 until df.first.length) yield df.first.getString(i)
This of course will cause a ClassCastException if the Row contains non-strings. Depending on your needs, you may also want to consider using a Try to silently drop non-strings within the for-comprehension:
import scala.util.Try
// same return type as before
// non-string members will be filtered out of the end result
for {
i <- 0 until df.first.length
field <- Try(df.first.getString(i)).toOption
} yield field
Until now we returned an IndexedSeq, which is suitable for efficient random access (i.e. has constant access time to any item in the collection) and in particular a Vector. Again, you may really need to return an Array. To return an Array[String] you may want to call toArray on the Vector, which unfortunately copies the whole thing.
You can skip this step and directly output an Array[String] by explicitly using flatMap instead of relying on the for-comprehension and using collection.breakOut:
// returns: Array[String] -- silently keeping strings only
0.until(df.first.length).
flatMap(i => Try(df.first.getString(i)).toOption)(collection.breakOut)
To learn more about builders and collection.breakOut you may want to have a read here.
well my problem didn't solve with the best way but i tried a way out :-
val headerArr = df1.first
var headerArray = new Array[String](colLen)
for(i <- 0 until colLen){
headerArray(i)=headerArr(i).toString
}
But still I am open for new suggestions.
Although I am slicing the dataframe into a var of class = org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema and then transfering the elements to Array[String] with an iteration.

Apache Flink Custom Window Aggregation

I would like to aggregate a stream of trades into windows of the same trade volume, which is the sum of the trade size of all the trades in the interval.
I was able to write a custom Trigger that partitions the data into windows. Here is the code:
case class Trade(key: Int, millis: Long, time: LocalDateTime, price: Double, size: Int)
class VolumeTrigger(triggerVolume: Int, config: ExecutionConfig) extends Trigger[Trade, Window] {
val LOG: Logger = LoggerFactory.getLogger(classOf[VolumeTrigger])
val stateDesc = new ValueStateDescriptor[Double]("volume", createTypeInformation[Double].createSerializer(config))
override def onElement(event: Trade, timestamp: Long, window: Window, ctx: TriggerContext): TriggerResult = {
val volume = ctx.getPartitionedState(stateDesc)
if (volume.value == null) {
volume.update(event.size)
return TriggerResult.CONTINUE
}
volume.update(volume.value + event.size)
if (volume.value < triggerVolume) {
TriggerResult.CONTINUE
}
else {
volume.update(volume.value - triggerVolume)
TriggerResult.FIRE_AND_PURGE
}
}
override def onEventTime(time: Long, window: Window, ctx: TriggerContext): TriggerResult = {
TriggerResult.FIRE_AND_PURGE
}
override def onProcessingTime(time: Long, window:Window, ctx: TriggerContext): TriggerResult = {
throw new UnsupportedOperationException("Not a processing time trigger")
}
override def clear(window: Window, ctx: TriggerContext): Unit = {
val volume = ctx.getPartitionedState(stateDesc)
ctx.getPartitionedState(stateDesc).clear()
}
}
def main(args: Array[String]) : Unit = {
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(1)
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
val trades = env
.readTextFile("/tmp/trades.csv")
.map {line =>
val cells = line.split(",")
val time = LocalDateTime.parse(cells(0), DateTimeFormatter.ofPattern("yyyyMMdd HH:mm:ss.SSSSSSSSS"))
val millis = time.toInstant(ZoneOffset.UTC).toEpochMilli
Trade(0, millis, time, cells(1).toDouble, cells(2).toInt)
}
val aggregated = trades
.assignAscendingTimestamps(_.millis)
.keyBy("key")
.window(GlobalWindows.create)
.trigger(new VolumeTrigger(500, env.getConfig))
.sum(4)
aggregated.writeAsText("/tmp/trades_agg.csv")
env.execute("volume agg")
}
The data looks for example as follows:
0180102 04:00:29.715706404,169.10,100
20180102 04:00:29.715715627,169.10,100
20180102 05:08:29.025299624,169.12,100
20180102 05:08:29.025906589,169.10,214
20180102 05:08:29.327113252,169.10,200
20180102 05:09:08.350939314,169.00,100
20180102 05:09:11.532817015,169.00,474
20180102 06:06:55.373584329,169.34,200
20180102 06:07:06.993081961,169.34,100
20180102 06:07:08.153291898,169.34,100
20180102 06:07:20.081524768,169.34,364
20180102 06:07:22.838656715,169.34,200
20180102 06:07:24.561360031,169.34,100
20180102 06:07:37.774385969,169.34,100
20180102 06:07:39.305219107,169.34,200
I have a time stamp, a price and a size.
The above code can partition it into windows of roughly the same size:
Trade(0,1514865629715,2018-01-02T04:00:29.715706404,169.1,514)
Trade(0,1514869709327,2018-01-02T05:08:29.327113252,169.1,774)
Trade(0,1514873215373,2018-01-02T06:06:55.373584329,169.34,300)
Trade(0,1514873228153,2018-01-02T06:07:08.153291898,169.34,464)
Trade(0,1514873242838,2018-01-02T06:07:22.838656715,169.34,600)
Trade(0,1514873294898,2018-01-02T06:08:14.898397117,169.34,500)
Trade(0,1514873299492,2018-01-02T06:08:19.492589659,169.34,400)
Trade(0,1514873332251,2018-01-02T06:08:52.251339070,169.34,500)
Trade(0,1514873337928,2018-01-02T06:08:57.928680090,169.34,1000)
Trade(0,1514873338078,2018-01-02T06:08:58.078221995,169.34,1000)
Now I like to partition the data so that the volume is exactly matching the trigger value. For this I would need to change the data slightly by splitting a trade at the end of an interval into two parts, one that belongs to the actual window being fired, and the remaining volume that is above the trigger value, has to be assigned to the next window.
Can that be handled with some custom aggregation function? It would need to know the results from the previous window(s) though and I was not able to find out how to do that.
Any ideas from Apache Flink experts how to handle that case?
Adding an evictor does not work as it only purges some elements at the beginning.
I hope the change from Spark Structured Streaming to Flink was a good choice, as I have later even more complicated situations to handle.
Since your key is the same for all records, you may not require a window in this case. Please refer to this page in Flink's documentation https://ci.apache.org/projects/flink/flink-docs-release-1.4/dev/stream/state/state.html#using-managed-keyed-state.
It has a CountWindowAverage class where in the aggregation of a value from each record in a stream is done using a State Variable. You can implement this and send the output whenever the state variable reaches your trigger volume and reset the value of the state variable with the remaining volume.
A simple approach (though not super-efficient) would be to put a FlatMapFunction ahead of your windowing flow. If it's keyed the same way, then you can use ValueState to track total volume, and emit two records (the split) when it hits your limit.

backpressure is not properly handled in akka-streams

I wrote a simple stream using akka-streams api assuming it will handle my source but unfortunately it doesn't. I am sure I am doing something wrong in my source. I simply created an iterator which generate very large number of elements assuming it won't matter because akka-streams api will take care of backpressure. What am I doing wrong, this is my iterator.
def createData(args: Array[String]): Iterator[TimeSeriesValue] = {
var data = new ListBuffer[TimeSeriesValue]()
for (i <- 1 to range) {
sessionId = UUID.randomUUID()
for (j <- 1 to countersPerSession) {
time = DateTime.now()
keyName = s"Encoder-${sessionId.toString}-Controller.CaptureFrameCount.$j"
for (k <- 1 to snapShotCount) {
time = time.plusSeconds(2)
fValue = new Random().nextLong()
data += TimeSeriesValue(sessionId, keyName, time, fValue)
totalRows += 1
}
}
}
data.iterator
}
The problem is primarily in the line
data += TimeSeriesValue(sessionId, keyName, time, fValue)
You are continuously adding to the ListBuffer with a "very large number of elements". This is chewing up all of your RAM. The data.iterator line is simply wrapping the massive ListBuffer blob inside of an iterator to provide each element one at a time, it's basically just a cast.
Your assumption that "it won't matter because ... of backpressure" is partially true that the akka Stream will process the TimeSeriesValue values reactively, but you are creating a large number of them even before you get to the Source constructor.
If you want this iterator to be "lazy", i.e. only produce values when needed and not consume memory, then make the following modifications (note: I broke apart the code to make it more readable):
def createTimeSeries(startTime: Time, snapShotCount : Int, sessionId : UUID, keyName : String) =
Iterator.range(1, snapShotCount)
.map(_ * 2)
.map(startTime plusSeconds _)
.map(t => TimeSeriesValue(sessionId, keyName, t, ThreadLocalRandom.current().nextLong()))
def sessionGenerator(countersPerSession : Int, sessionID : UUID) =
Iterator.range(1, countersPerSession)
.map(j => s"Encoder-${sessionId.toString}-Controller.CaptureFrameCount.$j")
.flatMap { keyName =>
createTimeSeries(DateTime.now(), snapShotCount, sessionID, keyName)
}
object UUIDIterator extends Iterator[UUID] {
def hasNext : Boolean = true
def next() : UUID = UUID.randomUUID()
}
def iterateOverIDs(range : Int) =
UUIDIterator.take(range)
.flatMap(sessionID => sessionGenerator(countersPerSession, sessionID))
Each one of the above functions returns an Iterator. Therefore, calling iterateOverIDs should be instantaneous because no work is immediately being done and de mimimis memory is being consumed. This iterator can then be passed into your Stream...

How do I execute a count command using scala.dbc?

I am trying to connect to an MS/SQL server and execute a 'count' statement. I've reached this far:
import scala.dbc._
import scala.dbc.Syntax._
import scala.dbc.syntax.Statement._
import java.net.URI
object MsSqlVendor extends Vendor {
val uri = new URI("jdbc:sqlserver://173.248.X.X:Y/DataBaseName")
val user = "XXX"
val pass = "XXX"
val retainedConnections = 5
val nativeDriverClass = Class.forName("com.microsoft.sqlserver.jdbc.SQLServerDriver");
val urlProtocolString = "jdbc:sqlserver:"
}
object Main {
def main(args: Array[String]) {
println("Hello, world!")
val db = new Database(MsSqlVendor)
val count = db.executeStatement {
select (count) from (technical)
}
println("%d rows counted", count)
}
}
I get an error saying: "scala.dbc.syntax.Statement.select of type dbc.syntax.Statement.SelectZygote does not take parameters"
How do I set this up?
This can be a problem:
val count = db.executeStatement {
select (count) from (technical)
}
The count inside the statement refers to val count, not to some other count. Still, there's the other problem you report. There's neither a count nor a technical definition anywhere. Maybe it is missing from some other place where you found this snippet. The following does compile, though it's anyone's guess as to whether it does what you want:
val countx = db.executeStatement {
select fields "count" from "technical"
}
At any rate, I thought scala.dbc was long deprecated. I can't find any deprecation notice, however, and it is still linked on the library jar, even on trunk.

Resources