Apache Flink: How to create two datasets from one dataset using Flink DataSet API - apache-flink

I'm writing an application using DataSet API of Flink 0.10.1.
Can I get multiple collectors using a single operator in Flink?
What I want to do is something like below:
val lines = env.readTextFile(...)
val (out_small, out_large) = lines **someOp** {
(iterator, collector1, collector2) => {
for (line <- iterator) {
val (elem1, elem2) = doParsing(line)
collector1.collect(elem1)
collector2.collect(elem2)
}
}
}
Currently I'm calling mapPartition twice to make two datasets from one source dataset.
val lines = env.readTextFile(...)
val out_small = lines mapPartition {
(iterator, collector) => {
for (line <- iterator) {
val (elem1, elem2) = doParsing(line)
collector.collect(elem1)
}
}
}
val out_large = lines mapPartition {
(iterator, collector) => {
for (line <- iterator) {
val (elem1, elem2) = doParsing(line)
collector.collect(elem2)
}
}
}
As doParsing function is quite expensive, I want to call it just once per each line.
p.s. I would be very appreciated if you can let me know other approaches to do this kind of stuff in a simpler way.

Flink does not support multiple collectors. However, you can change the output of your parsing step by adding an additional field that indicates the output type:
val lines = env.readTextFile(...)
val intermediate = lines **someOp** {
(iterator, collector) => {
for (line <- iterator) {
val (elem1, elem2) = doParsing(line)
collector.collect(0, elem1) // 0 indicates small
collector.collect(1, elem2) // 1 indicates large
}
}
}
Next you consume the output intermediate twice and filter each for the first attribute. The first filter filters for 0 the second filter for 1 (you an also add a projection to get rid of the first attribute).
+---> filter("0") --->
|
intermediate --+
|
+---> filter("1") --->

Related

Summary of ArrayList ordering in Kotlin (Android)

I am trying to provide a summary of items within an ArrayList (where order matters). Basically, I am setting up an exercise plan with two different types of activities (Training and Assessment). I then will provide a summary of the plan after adding each training/assessment to it.
The structure I have is something along the lines of:
exercisePlan: [
{TRAINING OBJECT},
{TRAINING OBJECT},
{ASSESSMENT OBJECT},
{TRAINING OBJECT}
]
What I want to be able to do is summarise this in a format of:
2 x Training, 1 x Assessment, 1 x Training, which will be displayed in a TextView in a Fragment. So I will have an arbitrarily long string that details the structure and order of the exercise plan.
I have tried to investigate using a HashMap or a plain ArrayList, but it seems pretty messy so I'm looking for a much cleaner way (perhaps a MutableList). Thanks in advance!
ArrayList is just a specific type of MutableList. It's usually preferable to use a plain List, because mutability can make code a little more complex to work with and keep robust.
I'd create a list of some class that wraps an action and the number of consecutive times to do it.
enum class Activity {
Training, Assessment
}
data class SummaryPlanStep(val activity: Activity, val consecutiveTimes: Int) {
override fun toString() = "$consecutiveTimes x $activity"
}
If you want to start with your summary, you can create it and later convert it to a plain list of activities like this:
val summary: List<SummaryPlanStep> = listOf(
SummaryPlanStep(Activity.Training, 2),
SummaryPlanStep(Activity.Assessment, 1),
SummaryPlanStep(Activity.Training, 1),
)
val plan: List<Activity> = summary.flatMap { List(it.consecutiveTimes) { _ -> it.activity } }
If you want to do it the other way around, it's more involved because I don't think there's a built-in way to group consecutive duplicate elements. You could a write a function for that.
fun <T> List<T>.groupConsecutiveDuplicates(): List<Pair<T, Int>> {
if (isEmpty()) return emptyList()
val outList = mutableListOf<Pair<T, Int>>()
var current = first() to 1
for (i in 1 until size) {
val item = this[i]
current = if (item == current.first)
current.first to (current.second + 1)
else {
outList.add(current)
item to 1
}
}
outList.add(current)
return outList
}
val plan: List<Activity> = listOf(
Activity.Training,
Activity.Training,
Activity.Assessment,
Activity.Training
)
val summary: List<SummaryPlanStep> = plan.groupConsecutiveDuplicates().map { SummaryPlanStep(it.first, it.second) }
This is what I have set up to work for me at the moment:
if (exercisePlanSummary.isNotEmpty() && exercisePlanSummary[exercisePlanSummary.size - 1].containsKey(trainingAssessment)) {
exercisePlanSummary[exercisePlanSummary.size - 1][trainingAssessment] = exercisePlanSummary[exercisePlanSummary.size - 1][trainingAssessment]!! + 1
} else {
exercisePlanSummary.add(hashMapOf(trainingAssessment to 1))
}
var textToDisplay = ""
exercisePlanSummary.forEach {
textToDisplay = if (textToDisplay.isNotEmpty()) {
textToDisplay.plus(", ${it.values.toList()[0]} x ${it.keys.toList()[0].capitalize()}")
} else {
textToDisplay.plus("${it.values.toList()[0]} x ${it.keys.toList()[0].capitalize()}")
}
}
where trainingAssessment is a String of "training" or "assessment". exercisePlanSummary is a ArrayList<HashMap<String, Int>>.
What #Tenfour04 has written above is perhaps more appropriate, and a cleaner way of implementing this. But my method is quite simple.

How can I create a sink from a list of operations

I want to create a sink in akka streams which is made up of many operations.
e.g map, filter, fold and then sink.
The best I can do at the moment is the following.
I don't like it because I have to specify broadcast even though I am only letting a single value through.
Does anyone know a better way of doing this?
def kafkaSink(): Sink[PartialBatchProcessedResult, NotUsed] = {
Sink.fromGraph(GraphDSL.create() { implicit b =>
import GraphDSL.Implicits._
val broadcast = b.add(Broadcast[PartialBatchProcessedResult](1))
broadcast.out(0)
.fold(new BatchPublishingResponseCollator()) { (c, e) => c.consume(e) }
.map(_.build())
.map(a =>
FunctionalTesterResults(sampleProjectorConfig, 0, a)) ~> Sink.foreach(new KafkaTestResultsReporter().report)
SinkShape(broadcast.in)
})
}
One key point to remember with akka-stream is that any number of Flow values plus a Sink value is still a Sink.
A couple of examples demonstrating this property:
val intSink : Sink[Int, _] = Sink.head[Int]
val anotherSink : Sink[Int, _] =
Flow[Int].filter(_ > 0)
.to(intSink)
val oneMoreSink : Sink[Int, _] =
Flow[Int].filter(_ > 0)
.map(_ + 4)
.to(intSink)
Therefore, you can implement the map and filter as Flows. The fold that you are asking about can be implemented with Sink.fold.

Why I cannot update an array in cluster mode but could in pseudo-distributed

I wrote a spark program in scala, of which the main codes are:
val centers:Array[(Vector,Double)] = initCenters(k)
val sumsMap:Map(int,(vector,int))= data.mapPartitions{
***
}.reduceByKey(***).collectAsMap()
sumsMap.foreach{case(index,(sum,count))=>
sum/=count
centers(index)=(sum,sum.norm2())
}
the origin codes are:
val centers = initCenters.getOrElse(initCenter(data))
val br_centers = data.sparkContext.broadcast(centers)
val trainData = data.map(e => (e._2, e._2.norm2)).cache()
val squareStopBound = stopBound * stopBound
var isConvergence = false
var i = 0
val costs = data.sparkContext.doubleAccumulator
while (!isConvergence && i < maxIters) {
costs.reset()
val res = trainData.mapPartitions { iter =>
val counts = new Array[Int](k)
util.Arrays.fill(counts, 0)
val partSum = (0 until k).map(e => new DenseVector(br_centers.value(0)._1.size))
iter.foreach { e =>
val (index, cost) = KMeans.findNearest(e, br_centers.value)
costs.add(cost)
counts(index) += 1
partSum(index) += e._1
}
counts.indices.filter(j => counts(j) > 0).map(j => (j -> (partSum(j), counts(j)))).iterator
}.reduceByKey { case ((s1, c1), (s2, c2)) =>
(s1 += s2, c1 + c2)
}.collectAsMap()
br_centers.unpersist(false)
println(s"cost at iter: $i is: ${costs.value}")
isConvergence = true
res.foreach { case (index, (sum, count)) =>
sum /= count
val sumNorm2 = sum.norm2()
val squareDist = math.pow(centers(index)._2, 2.0) + math.pow(sumNorm2, 2.0) - 2 * (centers(index)._1 * sum)
if (squareDist >= squareStopBound) {
isConvergence = false
}
centers.update(index,(sum, sumNorm2))
}
i += 1
}
when these run in a pseudo-distributed mode in IDEA, I get the centers updated, while when I get these run on a spark cluster, I do not get the centers updated.
LostInOverflow's answer is correct, but not especially descriptive as to what's going on.
Here are some important properties of your code:
declare an array centers
broadcast this array as br_centers
update centers iteratively
So how is this going wrong? Well, broadcasts are static. If I write:
val a = Array(1,2,3)
val aBc = sc.broadcast(a)
a(0) = 67
and access aBc.value(0), I'm going to get different results depending on whether this code was run on the driver JVM or not. Broadcasting takes an object, torrents it across the network to each node, and creates a new reference in each JVM. This reference exists as it did when the base object was broadcasted, and it is NOT updated in real time as you mutate the base object.
What's the solution? I think moving the broadcast inside the while loop so that you broadcast the updated centers should work:
while (!isConvergence && i < maxIters) {
val br_centers = data.sparkContext.broadcast(centers)
...
Please check Understanding closures section in the programming guide.
Spark is a distributed system and behavior of the code you've shown is simply undefined. It works in local mode only by accident because it executes everything in a single JVM.

backpressure is not properly handled in akka-streams

I wrote a simple stream using akka-streams api assuming it will handle my source but unfortunately it doesn't. I am sure I am doing something wrong in my source. I simply created an iterator which generate very large number of elements assuming it won't matter because akka-streams api will take care of backpressure. What am I doing wrong, this is my iterator.
def createData(args: Array[String]): Iterator[TimeSeriesValue] = {
var data = new ListBuffer[TimeSeriesValue]()
for (i <- 1 to range) {
sessionId = UUID.randomUUID()
for (j <- 1 to countersPerSession) {
time = DateTime.now()
keyName = s"Encoder-${sessionId.toString}-Controller.CaptureFrameCount.$j"
for (k <- 1 to snapShotCount) {
time = time.plusSeconds(2)
fValue = new Random().nextLong()
data += TimeSeriesValue(sessionId, keyName, time, fValue)
totalRows += 1
}
}
}
data.iterator
}
The problem is primarily in the line
data += TimeSeriesValue(sessionId, keyName, time, fValue)
You are continuously adding to the ListBuffer with a "very large number of elements". This is chewing up all of your RAM. The data.iterator line is simply wrapping the massive ListBuffer blob inside of an iterator to provide each element one at a time, it's basically just a cast.
Your assumption that "it won't matter because ... of backpressure" is partially true that the akka Stream will process the TimeSeriesValue values reactively, but you are creating a large number of them even before you get to the Source constructor.
If you want this iterator to be "lazy", i.e. only produce values when needed and not consume memory, then make the following modifications (note: I broke apart the code to make it more readable):
def createTimeSeries(startTime: Time, snapShotCount : Int, sessionId : UUID, keyName : String) =
Iterator.range(1, snapShotCount)
.map(_ * 2)
.map(startTime plusSeconds _)
.map(t => TimeSeriesValue(sessionId, keyName, t, ThreadLocalRandom.current().nextLong()))
def sessionGenerator(countersPerSession : Int, sessionID : UUID) =
Iterator.range(1, countersPerSession)
.map(j => s"Encoder-${sessionId.toString}-Controller.CaptureFrameCount.$j")
.flatMap { keyName =>
createTimeSeries(DateTime.now(), snapShotCount, sessionID, keyName)
}
object UUIDIterator extends Iterator[UUID] {
def hasNext : Boolean = true
def next() : UUID = UUID.randomUUID()
}
def iterateOverIDs(range : Int) =
UUIDIterator.take(range)
.flatMap(sessionID => sessionGenerator(countersPerSession, sessionID))
Each one of the above functions returns an Iterator. Therefore, calling iterateOverIDs should be instantaneous because no work is immediately being done and de mimimis memory is being consumed. This iterator can then be passed into your Stream...

In Flink, stream windowing does not seem to work?

I tried to enhance the Flink example displaying the usage of streams.
My goal is to use the windowing features (see the window function call).
I assume that the code below outputs the sum of last 3 numbers of the stream.
(the stream is opened thanks to nc -lk 9999 on ubuntu)
Actually, the output sums up ALL numbers entered. Switching to a time window produces the same result, i.e. no windowing produced.
Is that a bug? (version used: latest master on github )
object SocketTextStreamWordCount {
def main(args: Array[String]) {
val hostName = args(0)
val port = args(1).toInt
val env = StreamExecutionEnvironment.getExecutionEnvironment
// Create streams for names and ages by mapping the inputs to the corresponding objects
val text = env.socketTextStream(hostName, port)
val currentMap = text.flatMap { (x:String) => x.toLowerCase.split("\\W+") }
.filter { (x:String) => x.nonEmpty }
.window(Count.of(3)).every(Time.of(1, TimeUnit.SECONDS))
// .window(Time.of(5, TimeUnit.SECONDS)).every(Time.of(1, TimeUnit.SECONDS))
.map { (x:String) => ("not used; just to have a tuple for the sum", x.toInt) }
val numberOfItems = currentMap.count
numberOfItems print
val counts = currentMap.sum( 1 )
counts print
env.execute("Scala SocketTextStreamWordCount Example")
}
}
The problem seems to be that there is an implicit conversion from WindowedDataStream to DataStream. This implicit conversion calls flatten() on the WindowedDataStream.
What happens in your case is that the code gets expanded to this:
val currentMap = text.flatMap { (x:String) => x.toLowerCase.split("\\W+") }
.filter { (x:String) => x.nonEmpty }
.window(Count.of(3)).every(Time.of(1, TimeUnit.SECONDS))
.flatten()
.map { (x:String) => ("not used; just to have a tuple for the sum",x.toInt) }
What flatten() does is similar to a flatMap() on a collection. It takes the stream of windows which can be seen as a collection of collections ([[a,b,c], [d,e,f]]) and turns it into a stream of elements: [a,b,c,d,e,f].
This means that your count really operates only on the original stream that has been windowed and "de-windowed". This looks like it has never been windowed at all.
This is a problem and I will work on fixing this right away. (I'm one of the Flink committers.) You can track the progress here: https://issues.apache.org/jira/browse/FLINK-2096
The way to do it with the current API is this:
val currentMap = text.flatMap { (x:String) => x.toLowerCase.split("\\W+") }
.filter { (x:String) => x.nonEmpty }
.map { (x:String) => ("not used; just to have a tuple for the sum",x.toInt) }
.window(Count.of(3)).every(Time.of(1, TimeUnit.SECONDS))
WindowedDataStream has a sum() method so there will be no implicit insertion of the flatten() call. Unfortunately, count() is not available on WindowedDataStream so for this you have to manually add a 1 field to the tuple and count these.

Resources