How to implement pagination with akka-streams - akka-stream

I need to process large file by lines and do some heavy work (on 4 core cpu) on every item, I think code correct:
implicit val system = ActorSystem("TestSystem")
implicit val materializer = ActorMaterializer()
import system.dispatcher
val sink = Sink.foreach[String](elem => println("element proceed"))
FileIO.fromPath(Paths.get("file.txt"))
.via(Framing.delimiter(ByteString("\n"), 64).map(_.utf8String))
.mapAsync(4)(v =>
//long op
Future {
Thread.sleep(500)
"updated_" + v
})
.to(sink)
.run()
But I want to have output like:
100 element proceed
200 element proceed
300 element proceed
357 element proceed. done
How to implement it?

You can use Flow.grouped:
val groupSize = 100
val groupedFlow = Flow[String].grouped(groupSize)
This Flow can now be injected before or after your mapAsync:
FileIO.fromPath(Paths.get("file.txt"))
.via(Framing.delimiter(ByteString("\n"), 64).map(_.utf8String))
.via(groupedFlow)
...

Related

Use feeder next 10 results and repeat the request 10 times in Gatling

I am using Gatling 3.6.1 and I am trying to repeat the request 10 times for the next 10 products from the feeder file. This is what I tried:
feed(products, 10)
.repeat(10, "index") {
exec(session => {
val index = session("index").as[Int]
val counter = index + 1
session.set("counter", counter)
})
.exec(productIdsRequest())
}
private def productIdsRequest() = {
http("ProductId${counter}")
.get(path + "products/${product_code${counter}}")
.check(jsonPath("$..code").count.gt(2))
}
I am having trouble getting the counter value to my API URL.
I would like to have something like
products/${product_code1},
products/${product_code2} etc.
But instead, I get the error 'nested attribute definition is not allowed'
So basically I would like that every request gets called with one product from the feeder (in the batch of 10 products)
Can you please help?
Thanks!
Disclaimer: I don't know how realized your feeder products.
If I clearly understand you - just need to move .repeat on high level:
.repeat(10, "counter") {
feed(product)
.exec(http("ProductId ${counter}")
.get("products/${product_code}")
.check(jsonPath("$..code").count.gt(2)))
}

Controlling order of processed elements within CoProcessFunction using custom sources

For testing purposes, I am using the following custom source:
class ThrottledSource[T](
data: Array[T],
throttling: Int,
beginWaitingTime: Int = 0,
endWaitingTime: Int = 0
) extends SourceFunction[T] {
private var isRunning = true
private var offset = 0
override def run(ctx: SourceFunction.SourceContext[T]): Unit = {
Thread.sleep(beginWaitingTime)
val lock = ctx.getCheckpointLock
while (isRunning && offset < data.length) {
lock.synchronized {
ctx.collect(data(offset))
offset += 1
}
Thread.sleep(throttling)
}
Thread.sleep(endWaitingTime)
}
override def cancel(): Unit = isRunning = false
and using it like this within my test
val controlStream = new ThrottledSource[Control](
data = Array(c1,c2), endWaitingTime = 10000, throttling = 0,
)
val dataStream = new ThrottledSource[Event](
data = Array(e1,e2,e3,e4,e5),
throttling = 1000,
beginWaitingTime = 2000,
endWaitingTime = 2000,
)
val dataStream = env.addSource(events)
env.addSource(controlStream)
.connect(dataStream)
.process(MyProcessFunction)
My intent is to get all the control elements first (that is why I don't specify any beginWaitingTime nor any throttling). In processElement1 and processElement2 within MyProcessFunction I print the elements when I receive them. Most of the times I get the two control elements first as expected, but quite surprisingly to me from time to time I am getting data elements first, despite the two-second delay used for the data source to start emitting its elements. Can anyone explain this to me?
The control and data stream source operators are running in different threads, and as you've seen, there's no guarantee that the source instance running the control stream will get a chance to run before the instance running the data stream.
You could look at the answer here and its associated code on github for one way to accomplish this reliably.

Manipulate Seq Elements in an Akka Flow

I have 2 flows like the following:
val aToSeqOfB: Flow[A, Seq[B], NotUsed] = ...
val bToC: Flow[B, C, NotUsed] = ...
I want to combine these into a convenience method like the following:
val aToSeqOfC: Flow[A, Seq[C], NotUsed]
So far I have the following, but I know it just ends up with C elements and not Seq[C].
Flow[A].via(aToSeqOfB).mapConcat(_.toList).via(bToC)
How can I preserve the Seq in this scenario?
Indirect Answer
In my opinion your question highlights one of the "rookie mistakes" that is common when dealing with akka streams. It is usually not good organization to put business logic within akka stream constructs. Your question indicates that you have something of the form:
val bToC : Flow[B, C, NotUsed] = Flow[B] map { b : B =>
//business logic
}
The more ideal scenario would be if you had:
//normal function, no akka involved
val bToCFunc : B => C = { b : B =>
//business logic
}
val bToCFlow : Flow[B,C,NotUsed] = Flow[B] map bToCFunc
In the above "ideal" example the Flow is just a thin veneer on top of normal, non-akka, business logic.
The separate logic can then simply solve your original question with:
val aToSeqOfC : Flow[A, Seq[C], NotUsed] =
aToSeqOfB via (Flow[Seq[B]] map (_ map bToCFunc))
Direct Answer
If you cannot reorganize your code then the only available option is to deal with Futures. You'll need to use bToC within a separate sub-stream:
val mat : akka.stream.Materializer = ???
val seqBToSeqC : Seq[B] => Future[Seq[C]] =
(seqB) =>
Source
.apply(seqB.toIterable)
.via(bToC)
.to(Sink.seq[C])
.run()
You can then use this function within a mapAsync to construct the Flow you are looking for:
val parallelism = 10
val aToSeqOfC: Flow[A, Seq[C], NotUsed] =
aToSeqB.mapAsync(parallelism)(seqBtoSeqC)

Why I cannot update an array in cluster mode but could in pseudo-distributed

I wrote a spark program in scala, of which the main codes are:
val centers:Array[(Vector,Double)] = initCenters(k)
val sumsMap:Map(int,(vector,int))= data.mapPartitions{
***
}.reduceByKey(***).collectAsMap()
sumsMap.foreach{case(index,(sum,count))=>
sum/=count
centers(index)=(sum,sum.norm2())
}
the origin codes are:
val centers = initCenters.getOrElse(initCenter(data))
val br_centers = data.sparkContext.broadcast(centers)
val trainData = data.map(e => (e._2, e._2.norm2)).cache()
val squareStopBound = stopBound * stopBound
var isConvergence = false
var i = 0
val costs = data.sparkContext.doubleAccumulator
while (!isConvergence && i < maxIters) {
costs.reset()
val res = trainData.mapPartitions { iter =>
val counts = new Array[Int](k)
util.Arrays.fill(counts, 0)
val partSum = (0 until k).map(e => new DenseVector(br_centers.value(0)._1.size))
iter.foreach { e =>
val (index, cost) = KMeans.findNearest(e, br_centers.value)
costs.add(cost)
counts(index) += 1
partSum(index) += e._1
}
counts.indices.filter(j => counts(j) > 0).map(j => (j -> (partSum(j), counts(j)))).iterator
}.reduceByKey { case ((s1, c1), (s2, c2)) =>
(s1 += s2, c1 + c2)
}.collectAsMap()
br_centers.unpersist(false)
println(s"cost at iter: $i is: ${costs.value}")
isConvergence = true
res.foreach { case (index, (sum, count)) =>
sum /= count
val sumNorm2 = sum.norm2()
val squareDist = math.pow(centers(index)._2, 2.0) + math.pow(sumNorm2, 2.0) - 2 * (centers(index)._1 * sum)
if (squareDist >= squareStopBound) {
isConvergence = false
}
centers.update(index,(sum, sumNorm2))
}
i += 1
}
when these run in a pseudo-distributed mode in IDEA, I get the centers updated, while when I get these run on a spark cluster, I do not get the centers updated.
LostInOverflow's answer is correct, but not especially descriptive as to what's going on.
Here are some important properties of your code:
declare an array centers
broadcast this array as br_centers
update centers iteratively
So how is this going wrong? Well, broadcasts are static. If I write:
val a = Array(1,2,3)
val aBc = sc.broadcast(a)
a(0) = 67
and access aBc.value(0), I'm going to get different results depending on whether this code was run on the driver JVM or not. Broadcasting takes an object, torrents it across the network to each node, and creates a new reference in each JVM. This reference exists as it did when the base object was broadcasted, and it is NOT updated in real time as you mutate the base object.
What's the solution? I think moving the broadcast inside the while loop so that you broadcast the updated centers should work:
while (!isConvergence && i < maxIters) {
val br_centers = data.sparkContext.broadcast(centers)
...
Please check Understanding closures section in the programming guide.
Spark is a distributed system and behavior of the code you've shown is simply undefined. It works in local mode only by accident because it executes everything in a single JVM.

backpressure is not properly handled in akka-streams

I wrote a simple stream using akka-streams api assuming it will handle my source but unfortunately it doesn't. I am sure I am doing something wrong in my source. I simply created an iterator which generate very large number of elements assuming it won't matter because akka-streams api will take care of backpressure. What am I doing wrong, this is my iterator.
def createData(args: Array[String]): Iterator[TimeSeriesValue] = {
var data = new ListBuffer[TimeSeriesValue]()
for (i <- 1 to range) {
sessionId = UUID.randomUUID()
for (j <- 1 to countersPerSession) {
time = DateTime.now()
keyName = s"Encoder-${sessionId.toString}-Controller.CaptureFrameCount.$j"
for (k <- 1 to snapShotCount) {
time = time.plusSeconds(2)
fValue = new Random().nextLong()
data += TimeSeriesValue(sessionId, keyName, time, fValue)
totalRows += 1
}
}
}
data.iterator
}
The problem is primarily in the line
data += TimeSeriesValue(sessionId, keyName, time, fValue)
You are continuously adding to the ListBuffer with a "very large number of elements". This is chewing up all of your RAM. The data.iterator line is simply wrapping the massive ListBuffer blob inside of an iterator to provide each element one at a time, it's basically just a cast.
Your assumption that "it won't matter because ... of backpressure" is partially true that the akka Stream will process the TimeSeriesValue values reactively, but you are creating a large number of them even before you get to the Source constructor.
If you want this iterator to be "lazy", i.e. only produce values when needed and not consume memory, then make the following modifications (note: I broke apart the code to make it more readable):
def createTimeSeries(startTime: Time, snapShotCount : Int, sessionId : UUID, keyName : String) =
Iterator.range(1, snapShotCount)
.map(_ * 2)
.map(startTime plusSeconds _)
.map(t => TimeSeriesValue(sessionId, keyName, t, ThreadLocalRandom.current().nextLong()))
def sessionGenerator(countersPerSession : Int, sessionID : UUID) =
Iterator.range(1, countersPerSession)
.map(j => s"Encoder-${sessionId.toString}-Controller.CaptureFrameCount.$j")
.flatMap { keyName =>
createTimeSeries(DateTime.now(), snapShotCount, sessionID, keyName)
}
object UUIDIterator extends Iterator[UUID] {
def hasNext : Boolean = true
def next() : UUID = UUID.randomUUID()
}
def iterateOverIDs(range : Int) =
UUIDIterator.take(range)
.flatMap(sessionID => sessionGenerator(countersPerSession, sessionID))
Each one of the above functions returns an Iterator. Therefore, calling iterateOverIDs should be instantaneous because no work is immediately being done and de mimimis memory is being consumed. This iterator can then be passed into your Stream...

Resources