How can I create a sink from a list of operations - akka-stream

I want to create a sink in akka streams which is made up of many operations.
e.g map, filter, fold and then sink.
The best I can do at the moment is the following.
I don't like it because I have to specify broadcast even though I am only letting a single value through.
Does anyone know a better way of doing this?
def kafkaSink(): Sink[PartialBatchProcessedResult, NotUsed] = {
Sink.fromGraph(GraphDSL.create() { implicit b =>
import GraphDSL.Implicits._
val broadcast = b.add(Broadcast[PartialBatchProcessedResult](1))
broadcast.out(0)
.fold(new BatchPublishingResponseCollator()) { (c, e) => c.consume(e) }
.map(_.build())
.map(a =>
FunctionalTesterResults(sampleProjectorConfig, 0, a)) ~> Sink.foreach(new KafkaTestResultsReporter().report)
SinkShape(broadcast.in)
})
}

One key point to remember with akka-stream is that any number of Flow values plus a Sink value is still a Sink.
A couple of examples demonstrating this property:
val intSink : Sink[Int, _] = Sink.head[Int]
val anotherSink : Sink[Int, _] =
Flow[Int].filter(_ > 0)
.to(intSink)
val oneMoreSink : Sink[Int, _] =
Flow[Int].filter(_ > 0)
.map(_ + 4)
.to(intSink)
Therefore, you can implement the map and filter as Flows. The fold that you are asking about can be implemented with Sink.fold.

Related

Turn `func(in: Source[A]) : Source[B]` into a `Flow[A, B]`

I am using akka-grpc to generate client bindings. They usually have the form of
func[A, B](in: Source[A]) : Source[B],
i.e. they consume a Source[A] and offer a Source[B].
Now, I want to turn func into a Flow[A, B] to use them with akka-stream.
The solution is:
def SourceProcessor[In, Out](f : Source[In, NotUsed] => Source[Out, NotUsed]): Flow[In, Out, NotUsed] =
Flow[In].prefixAndTail(0).flatMapConcat { case (Nil, in) => f(in) }
It uses prefixAndTail to highjack the underyling Source.

How to buffer and drop a chunked bytestring with a delimiter?

Lets say you have a publisher using broadcast with some fast and some slow subscribers and would like to be able to drop sets of messages for the slow subscriber without having to keep them in memory. The data consists of chunked ByteStrings, so dropping a single ByteString is not an option.
Each set of ByteStrings is followed by a terminator ByteString("\n"), so I would need to drop a set of ByteStrings ending with that.
Is that something you can do with a custom graph stage? Can it be done without aggregating and keeping the whole set in memory?
Avoid Custom Stages
Whenever possible try to avoid custom stages, they are very tricky to get correct as well as being pretty verbose. Usually some combination of the standard akka-stream stages and plain-old-functions will do the trick.
Group Dropping
Presumably you have some criteria that you will use to decide which group of messages will be dropped:
type ShouldDropTester : () => Boolean
For demonstration purposes I will use a simple switch that drops every other group:
val dropEveryOther : ShouldDropTester =
Iterator.from(1)
.map(_ % 2 == 0)
.next
We will also need a function that will take in a ShouldDropTester and use it to determine whether an individual ByteString should be dropped:
val endOfFile = ByteString("\n")
val dropGroupPredicate : ShouldDropTester => ByteString => Boolean =
(shouldDropTester) => {
var dropGroup = shouldDropTester()
(byteString) =>
if(byteString equals endOfFile) {
val returnValue = dropGroup
dropGroup = shouldDropTester()
returnValue
}
else {
dropGroup
}
}
Combining the above two functions will drop every other group of ByteStrings. This functionality can then be converted into a Flow:
val filterPredicateFunction : ByteString => Boolean =
dropGroupPredicate(dropEveryOther)
val dropGroups : Flow[ByteString, ByteString, _] =
Flow[ByteString] filter filterPredicateFunction
As required: the group of messages do not need to be buffered, the predicate will work on individual ByteStrings and therefore consumes a constant amount of memory regardless of file size.

Manipulate Seq Elements in an Akka Flow

I have 2 flows like the following:
val aToSeqOfB: Flow[A, Seq[B], NotUsed] = ...
val bToC: Flow[B, C, NotUsed] = ...
I want to combine these into a convenience method like the following:
val aToSeqOfC: Flow[A, Seq[C], NotUsed]
So far I have the following, but I know it just ends up with C elements and not Seq[C].
Flow[A].via(aToSeqOfB).mapConcat(_.toList).via(bToC)
How can I preserve the Seq in this scenario?
Indirect Answer
In my opinion your question highlights one of the "rookie mistakes" that is common when dealing with akka streams. It is usually not good organization to put business logic within akka stream constructs. Your question indicates that you have something of the form:
val bToC : Flow[B, C, NotUsed] = Flow[B] map { b : B =>
//business logic
}
The more ideal scenario would be if you had:
//normal function, no akka involved
val bToCFunc : B => C = { b : B =>
//business logic
}
val bToCFlow : Flow[B,C,NotUsed] = Flow[B] map bToCFunc
In the above "ideal" example the Flow is just a thin veneer on top of normal, non-akka, business logic.
The separate logic can then simply solve your original question with:
val aToSeqOfC : Flow[A, Seq[C], NotUsed] =
aToSeqOfB via (Flow[Seq[B]] map (_ map bToCFunc))
Direct Answer
If you cannot reorganize your code then the only available option is to deal with Futures. You'll need to use bToC within a separate sub-stream:
val mat : akka.stream.Materializer = ???
val seqBToSeqC : Seq[B] => Future[Seq[C]] =
(seqB) =>
Source
.apply(seqB.toIterable)
.via(bToC)
.to(Sink.seq[C])
.run()
You can then use this function within a mapAsync to construct the Flow you are looking for:
val parallelism = 10
val aToSeqOfC: Flow[A, Seq[C], NotUsed] =
aToSeqB.mapAsync(parallelism)(seqBtoSeqC)

Apache Flink: How to create two datasets from one dataset using Flink DataSet API

I'm writing an application using DataSet API of Flink 0.10.1.
Can I get multiple collectors using a single operator in Flink?
What I want to do is something like below:
val lines = env.readTextFile(...)
val (out_small, out_large) = lines **someOp** {
(iterator, collector1, collector2) => {
for (line <- iterator) {
val (elem1, elem2) = doParsing(line)
collector1.collect(elem1)
collector2.collect(elem2)
}
}
}
Currently I'm calling mapPartition twice to make two datasets from one source dataset.
val lines = env.readTextFile(...)
val out_small = lines mapPartition {
(iterator, collector) => {
for (line <- iterator) {
val (elem1, elem2) = doParsing(line)
collector.collect(elem1)
}
}
}
val out_large = lines mapPartition {
(iterator, collector) => {
for (line <- iterator) {
val (elem1, elem2) = doParsing(line)
collector.collect(elem2)
}
}
}
As doParsing function is quite expensive, I want to call it just once per each line.
p.s. I would be very appreciated if you can let me know other approaches to do this kind of stuff in a simpler way.
Flink does not support multiple collectors. However, you can change the output of your parsing step by adding an additional field that indicates the output type:
val lines = env.readTextFile(...)
val intermediate = lines **someOp** {
(iterator, collector) => {
for (line <- iterator) {
val (elem1, elem2) = doParsing(line)
collector.collect(0, elem1) // 0 indicates small
collector.collect(1, elem2) // 1 indicates large
}
}
}
Next you consume the output intermediate twice and filter each for the first attribute. The first filter filters for 0 the second filter for 1 (you an also add a projection to get rid of the first attribute).
+---> filter("0") --->
|
intermediate --+
|
+---> filter("1") --->

backpressure is not properly handled in akka-streams

I wrote a simple stream using akka-streams api assuming it will handle my source but unfortunately it doesn't. I am sure I am doing something wrong in my source. I simply created an iterator which generate very large number of elements assuming it won't matter because akka-streams api will take care of backpressure. What am I doing wrong, this is my iterator.
def createData(args: Array[String]): Iterator[TimeSeriesValue] = {
var data = new ListBuffer[TimeSeriesValue]()
for (i <- 1 to range) {
sessionId = UUID.randomUUID()
for (j <- 1 to countersPerSession) {
time = DateTime.now()
keyName = s"Encoder-${sessionId.toString}-Controller.CaptureFrameCount.$j"
for (k <- 1 to snapShotCount) {
time = time.plusSeconds(2)
fValue = new Random().nextLong()
data += TimeSeriesValue(sessionId, keyName, time, fValue)
totalRows += 1
}
}
}
data.iterator
}
The problem is primarily in the line
data += TimeSeriesValue(sessionId, keyName, time, fValue)
You are continuously adding to the ListBuffer with a "very large number of elements". This is chewing up all of your RAM. The data.iterator line is simply wrapping the massive ListBuffer blob inside of an iterator to provide each element one at a time, it's basically just a cast.
Your assumption that "it won't matter because ... of backpressure" is partially true that the akka Stream will process the TimeSeriesValue values reactively, but you are creating a large number of them even before you get to the Source constructor.
If you want this iterator to be "lazy", i.e. only produce values when needed and not consume memory, then make the following modifications (note: I broke apart the code to make it more readable):
def createTimeSeries(startTime: Time, snapShotCount : Int, sessionId : UUID, keyName : String) =
Iterator.range(1, snapShotCount)
.map(_ * 2)
.map(startTime plusSeconds _)
.map(t => TimeSeriesValue(sessionId, keyName, t, ThreadLocalRandom.current().nextLong()))
def sessionGenerator(countersPerSession : Int, sessionID : UUID) =
Iterator.range(1, countersPerSession)
.map(j => s"Encoder-${sessionId.toString}-Controller.CaptureFrameCount.$j")
.flatMap { keyName =>
createTimeSeries(DateTime.now(), snapShotCount, sessionID, keyName)
}
object UUIDIterator extends Iterator[UUID] {
def hasNext : Boolean = true
def next() : UUID = UUID.randomUUID()
}
def iterateOverIDs(range : Int) =
UUIDIterator.take(range)
.flatMap(sessionID => sessionGenerator(countersPerSession, sessionID))
Each one of the above functions returns an Iterator. Therefore, calling iterateOverIDs should be instantaneous because no work is immediately being done and de mimimis memory is being consumed. This iterator can then be passed into your Stream...

Resources