Why I cannot update an array in cluster mode but could in pseudo-distributed - arrays

I wrote a spark program in scala, of which the main codes are:
val centers:Array[(Vector,Double)] = initCenters(k)
val sumsMap:Map(int,(vector,int))= data.mapPartitions{
***
}.reduceByKey(***).collectAsMap()
sumsMap.foreach{case(index,(sum,count))=>
sum/=count
centers(index)=(sum,sum.norm2())
}
the origin codes are:
val centers = initCenters.getOrElse(initCenter(data))
val br_centers = data.sparkContext.broadcast(centers)
val trainData = data.map(e => (e._2, e._2.norm2)).cache()
val squareStopBound = stopBound * stopBound
var isConvergence = false
var i = 0
val costs = data.sparkContext.doubleAccumulator
while (!isConvergence && i < maxIters) {
costs.reset()
val res = trainData.mapPartitions { iter =>
val counts = new Array[Int](k)
util.Arrays.fill(counts, 0)
val partSum = (0 until k).map(e => new DenseVector(br_centers.value(0)._1.size))
iter.foreach { e =>
val (index, cost) = KMeans.findNearest(e, br_centers.value)
costs.add(cost)
counts(index) += 1
partSum(index) += e._1
}
counts.indices.filter(j => counts(j) > 0).map(j => (j -> (partSum(j), counts(j)))).iterator
}.reduceByKey { case ((s1, c1), (s2, c2)) =>
(s1 += s2, c1 + c2)
}.collectAsMap()
br_centers.unpersist(false)
println(s"cost at iter: $i is: ${costs.value}")
isConvergence = true
res.foreach { case (index, (sum, count)) =>
sum /= count
val sumNorm2 = sum.norm2()
val squareDist = math.pow(centers(index)._2, 2.0) + math.pow(sumNorm2, 2.0) - 2 * (centers(index)._1 * sum)
if (squareDist >= squareStopBound) {
isConvergence = false
}
centers.update(index,(sum, sumNorm2))
}
i += 1
}
when these run in a pseudo-distributed mode in IDEA, I get the centers updated, while when I get these run on a spark cluster, I do not get the centers updated.

LostInOverflow's answer is correct, but not especially descriptive as to what's going on.
Here are some important properties of your code:
declare an array centers
broadcast this array as br_centers
update centers iteratively
So how is this going wrong? Well, broadcasts are static. If I write:
val a = Array(1,2,3)
val aBc = sc.broadcast(a)
a(0) = 67
and access aBc.value(0), I'm going to get different results depending on whether this code was run on the driver JVM or not. Broadcasting takes an object, torrents it across the network to each node, and creates a new reference in each JVM. This reference exists as it did when the base object was broadcasted, and it is NOT updated in real time as you mutate the base object.
What's the solution? I think moving the broadcast inside the while loop so that you broadcast the updated centers should work:
while (!isConvergence && i < maxIters) {
val br_centers = data.sparkContext.broadcast(centers)
...

Please check Understanding closures section in the programming guide.
Spark is a distributed system and behavior of the code you've shown is simply undefined. It works in local mode only by accident because it executes everything in a single JVM.

Related

are Steps in gatling cached and only executed once?

I have a simulation with a step that allows me to publish to different endpoints.
class MySimulation extends Simulation {
// some init code
var testTitle = this.getClass.getSimpleName
val myscenario = scenario("Scn Description")
.exec(PublishMessageRandom(pConfigTest, testTitle + "-" + numProducers, numProducers))
if (testMode == "debug") {
setUp(
myscenario.inject(
atOnceUsers(1)
)
).protocols(httpConf)
} else if (testMode == "open") {
setUp(
myscenario.inject(
rampConcurrentUsers(concurrentUserMin) to (concurrentUserMax) during (durationInMinutes minutes),
)
).protocols(httpConf)
}
}
Now here is my PublishMessageRandom definition
def PublishMessageRandom(producerConfig : ProducerConfig, testTitle : String, numberOfProducers : Int ) = {
val jsonBody = producerConfig.asJson
val valuedJsonBody = Printer.noSpaces.copy(dropNullValues = true).print(jsonBody)
println(valuedJsonBody)
val nodes : Array[String] = endpoints.split(endpointDelimiter)
val rnd = scala.util.Random
val rndIndex = rnd.nextInt(numberOfProducers)
var endpoint = "http://" + nodes(rndIndex) + perfEndpoint
println("endpoint:" + endpoint)
exec(http(testTitle)
.post(endpoint)
.header(HttpHeaderNames.ContentType, HttpHeaderValues.ApplicationJson)
.body(StringBody(valuedJsonBody))
.check(status.is(200))
.check(bodyString.saveAs("serverResponse"))
)
// the below is only useful in debug mode. Comment it out for longer tests
/*.exec { session =>
println("server_response: " + session("serverResponse").as[String])
println("endpoint:" + endpoint)
session */
}
}
as you can see it simply round-robin of endpoints. Unfortunately I see the above println("endpoint:" + endpoint) once and it looks like it picks one endpoint randomly and keeps hitting that instead of desired purpose of hitting endpoints randomly.
Can someone explain that behavior? Is Gatling caching the Step or and how do I go around that?
Quoting the official documentation:
Warning
Gatling DSL components are immutable ActionBuilder(s) that have to be
chained altogether and are only built once on startup. The results is
a workflow chain of Action(s). These builders don’t do anything by
themselves, they don’t trigger any side effect, they are just
definitions. As a result, creating such DSL components at runtime in
functions is completely meaningless.
I had to use feeder to solve the problem where the feeder takes the random endpoint.
// feeder is random endpoint as per number of producers
val endpointFeeder = GetEndpoints(numProducers).random
val myscenario = scenario("Vary number of producers hitting Kafka cluster")
.feed(endpointFeeder)
.exec(PublishMessageRandom(pConfigTest, testTitle + "-" + numProducers))
and Publish message random looks like this:
def PublishMessageRandom(producerConfig : ProducerConfig, testTitle : String ) = {
val jsonBody = producerConfig.asJson
val valuedJsonBody = Printer.noSpaces.copy(dropNullValues = true).print(jsonBody)
println(valuedJsonBody)
exec(http(testTitle)
.post("${endpoint}")
.header(HttpHeaderNames.ContentType, HttpHeaderValues.ApplicationJson)
.body(StringBody(valuedJsonBody))
.check(status.is(200))
.check(bodyString.saveAs("serverResponse"))
)
}
you see the line above .post("${endpoint}") will end up hitting the endpoint coming from the feeder.
The feeder function GetEndpoints is defined as follows
where we create an array of maps with one value each "endpoint" is the key.
def GetEndpoints(numberOfProducers : Int ) : Array[Map[String,String]] = {
val nodes : Array[String] = endpoints.split(endpointDelimiter)
var result : Array[Map[String,String]] = Array()
for( elt <- 1 to numberOfProducers ) {
var endpoint = "http://" + nodes(elt-1) + perfEndpoint
var m : Map[String, String] = Map()
m += ("endpoint" -> endpoint )
result = result :+ m
println("map:" + m)
}
result
}

Kotlin: initialize 2D array

I am in a loop, reading 2 columns from a file. I read R, T combinations, 50 times. I want R and T to be in an array so I can look up the Nth pair of R, T later in a function. How do I put the R, T pairs in an array and look up the, say, 25th entry later in a function?
For example:
for (nsection in 1 until NS+1) {
val list: List<String> = lines[nsection + 1].trim().split("\\s+".toRegex())
val radius = list[0].toFloat()
println("Radius = $radius")
val twist = list[8].toFloat()
println("twist = $twist")
}
Would like to pull radius and twist pairs from a table in a function later. NS goes up to 50 so far.
You can use map() on your range iterator to produce a List of what you want.
val radiusTwistPairs: List<Pair<Float, Float>> = (1..NS).map { nsection ->
val list = lines[nsection + 1].trim().split("\\s+".toRegex())
val radius = list[0].toFloat()
println("Radius = $radius")
val twist = list[8].toFloat()
println("twist = $twist")
radius to twist
}
Or use an Array constructor:
val radiusTwistPairs: Array<Pair<Float, Float>> = Array(NS) { i ->
val list = lines[i + 2].trim().split("\\s+".toRegex())
val radius = list[0].toFloat()
println("Radius = $radius")
val twist = list[8].toFloat()
println("twist = $twist")
radius to twist
}

Controlling order of processed elements within CoProcessFunction using custom sources

For testing purposes, I am using the following custom source:
class ThrottledSource[T](
data: Array[T],
throttling: Int,
beginWaitingTime: Int = 0,
endWaitingTime: Int = 0
) extends SourceFunction[T] {
private var isRunning = true
private var offset = 0
override def run(ctx: SourceFunction.SourceContext[T]): Unit = {
Thread.sleep(beginWaitingTime)
val lock = ctx.getCheckpointLock
while (isRunning && offset < data.length) {
lock.synchronized {
ctx.collect(data(offset))
offset += 1
}
Thread.sleep(throttling)
}
Thread.sleep(endWaitingTime)
}
override def cancel(): Unit = isRunning = false
and using it like this within my test
val controlStream = new ThrottledSource[Control](
data = Array(c1,c2), endWaitingTime = 10000, throttling = 0,
)
val dataStream = new ThrottledSource[Event](
data = Array(e1,e2,e3,e4,e5),
throttling = 1000,
beginWaitingTime = 2000,
endWaitingTime = 2000,
)
val dataStream = env.addSource(events)
env.addSource(controlStream)
.connect(dataStream)
.process(MyProcessFunction)
My intent is to get all the control elements first (that is why I don't specify any beginWaitingTime nor any throttling). In processElement1 and processElement2 within MyProcessFunction I print the elements when I receive them. Most of the times I get the two control elements first as expected, but quite surprisingly to me from time to time I am getting data elements first, despite the two-second delay used for the data source to start emitting its elements. Can anyone explain this to me?
The control and data stream source operators are running in different threads, and as you've seen, there's no guarantee that the source instance running the control stream will get a chance to run before the instance running the data stream.
You could look at the answer here and its associated code on github for one way to accomplish this reliably.

Apache Flink: How to create two datasets from one dataset using Flink DataSet API

I'm writing an application using DataSet API of Flink 0.10.1.
Can I get multiple collectors using a single operator in Flink?
What I want to do is something like below:
val lines = env.readTextFile(...)
val (out_small, out_large) = lines **someOp** {
(iterator, collector1, collector2) => {
for (line <- iterator) {
val (elem1, elem2) = doParsing(line)
collector1.collect(elem1)
collector2.collect(elem2)
}
}
}
Currently I'm calling mapPartition twice to make two datasets from one source dataset.
val lines = env.readTextFile(...)
val out_small = lines mapPartition {
(iterator, collector) => {
for (line <- iterator) {
val (elem1, elem2) = doParsing(line)
collector.collect(elem1)
}
}
}
val out_large = lines mapPartition {
(iterator, collector) => {
for (line <- iterator) {
val (elem1, elem2) = doParsing(line)
collector.collect(elem2)
}
}
}
As doParsing function is quite expensive, I want to call it just once per each line.
p.s. I would be very appreciated if you can let me know other approaches to do this kind of stuff in a simpler way.
Flink does not support multiple collectors. However, you can change the output of your parsing step by adding an additional field that indicates the output type:
val lines = env.readTextFile(...)
val intermediate = lines **someOp** {
(iterator, collector) => {
for (line <- iterator) {
val (elem1, elem2) = doParsing(line)
collector.collect(0, elem1) // 0 indicates small
collector.collect(1, elem2) // 1 indicates large
}
}
}
Next you consume the output intermediate twice and filter each for the first attribute. The first filter filters for 0 the second filter for 1 (you an also add a projection to get rid of the first attribute).
+---> filter("0") --->
|
intermediate --+
|
+---> filter("1") --->

backpressure is not properly handled in akka-streams

I wrote a simple stream using akka-streams api assuming it will handle my source but unfortunately it doesn't. I am sure I am doing something wrong in my source. I simply created an iterator which generate very large number of elements assuming it won't matter because akka-streams api will take care of backpressure. What am I doing wrong, this is my iterator.
def createData(args: Array[String]): Iterator[TimeSeriesValue] = {
var data = new ListBuffer[TimeSeriesValue]()
for (i <- 1 to range) {
sessionId = UUID.randomUUID()
for (j <- 1 to countersPerSession) {
time = DateTime.now()
keyName = s"Encoder-${sessionId.toString}-Controller.CaptureFrameCount.$j"
for (k <- 1 to snapShotCount) {
time = time.plusSeconds(2)
fValue = new Random().nextLong()
data += TimeSeriesValue(sessionId, keyName, time, fValue)
totalRows += 1
}
}
}
data.iterator
}
The problem is primarily in the line
data += TimeSeriesValue(sessionId, keyName, time, fValue)
You are continuously adding to the ListBuffer with a "very large number of elements". This is chewing up all of your RAM. The data.iterator line is simply wrapping the massive ListBuffer blob inside of an iterator to provide each element one at a time, it's basically just a cast.
Your assumption that "it won't matter because ... of backpressure" is partially true that the akka Stream will process the TimeSeriesValue values reactively, but you are creating a large number of them even before you get to the Source constructor.
If you want this iterator to be "lazy", i.e. only produce values when needed and not consume memory, then make the following modifications (note: I broke apart the code to make it more readable):
def createTimeSeries(startTime: Time, snapShotCount : Int, sessionId : UUID, keyName : String) =
Iterator.range(1, snapShotCount)
.map(_ * 2)
.map(startTime plusSeconds _)
.map(t => TimeSeriesValue(sessionId, keyName, t, ThreadLocalRandom.current().nextLong()))
def sessionGenerator(countersPerSession : Int, sessionID : UUID) =
Iterator.range(1, countersPerSession)
.map(j => s"Encoder-${sessionId.toString}-Controller.CaptureFrameCount.$j")
.flatMap { keyName =>
createTimeSeries(DateTime.now(), snapShotCount, sessionID, keyName)
}
object UUIDIterator extends Iterator[UUID] {
def hasNext : Boolean = true
def next() : UUID = UUID.randomUUID()
}
def iterateOverIDs(range : Int) =
UUIDIterator.take(range)
.flatMap(sessionID => sessionGenerator(countersPerSession, sessionID))
Each one of the above functions returns an Iterator. Therefore, calling iterateOverIDs should be instantaneous because no work is immediately being done and de mimimis memory is being consumed. This iterator can then be passed into your Stream...

Resources