Akka Streams: Fan-out operator with custom logic - akka-stream

I am looking for an Akka Streams operator that would allow me to split a stream based on custom logic. The set of messages that I am expecting is known in advance so there is no need for dynamic scaling of downstream consumers.
In the earlier versions of the library - when it was still labeled experimental - there was a FlexiRoute operator. I saw that at some point it accumulated a lot of cruft and was subsequently removed in favor of GraphStage.
Nowadays there are operators like Balance and Partition that come close to what I need. Balance requires me to duplicate logic per consumer. Partition works only for two outputs and I need to have N. I could make it happen with a Partition per message type but that seems hacky.
Is building a custom solution the only way?

Partition is what you need. It works for N rather than only 2 outputs. Read https://doc.akka.io/api/akka/current/akka/stream/scaladsl/Partition.html for API, read https://blog.colinbreck.com/partitioning-akka-streams-to-maximize-throughput/#partition for example.
A snapshot of the API doc:
new Partition(outputPorts: Int, partitioner: (T) ⇒ Int, eagerCancel: Boolean)
A snapshot of the example:
val flow = Flow.fromGraph(GraphDSL.create() { implicit b =>
import GraphDSL.Implicits._
val workerCount = 4
val partition = b.add(Partition[Int](workerCount, _ % workerCount))
val merge = b.add(Merge[Int](workerCount))
for (_ <- 1 to workerCount) {
partition ~> Flow[Int].map(spin).async ~> merge
}
FlowShape(partition.in, merge.out)
})
Source(1 to 1000)
.via(flow)
.runWith(Sink.ignore)

Related

Don't understand interval join in Flink

From Flink's official doc:
https://ci.apache.org/projects/flink/flink-docs-release-1.10/dev/stream/operators/joining.html#interval-join
The example code is:
import org.apache.flink.streaming.api.functions.co.ProcessJoinFunction;
import org.apache.flink.streaming.api.windowing.time.Time;
...
val orangeStream: DataStream[Integer] = ...
val greenStream: DataStream[Integer] = ...
orangeStream
.keyBy(elem => /* select key */)
.intervalJoin(greenStream.keyBy(elem => /* select key */))
.between(Time.milliseconds(-2), Time.milliseconds(1))
.process(new ProcessJoinFunction[Integer, Integer, String] {
override def processElement(left: Integer, right: Integer, ctx: ProcessJoinFunction[Integer, Integer, String]#Context, out: Collector[String]): Unit = {
out.collect(left + "," + right);
}
});
});
From the above code, I would like know how to specify the starting time(eg, from the beginning of today) from which to perform this interval join(the data before the starting time will not take into account).
eg, I have run the program for 3 days, I don't want to perform this join for all the data for 3 days,
I just want to perform the join for the data generated today.
I don't think it works like You think it does.
The actual interval is calculated based on the actual timestamps of orangeStream in this case, so You are not really providing the interval of the data You want to take in account, but rather this is something like window which specifies which elements will be joined with the given element of the orange stream.
So, for the window described above if You have orange element with timestamp 5, then it will be joined with elements that have timestamps from 3 to 6.
I really don't think You could use it to perform joins only with some part of the data, the only think I can think of is to simply filter the data using the timestamps and filter out all elements that have been generated earlier.

Data pipeline for running data records through modules which are acyclically dependent on each other

My question revolves around the best practice for processing data through multiple modules, where each module creates new data from the original data or from data generated by other modules. Let me create an illustrative example, where I'll be considering a MongoDB document and modules which create update parameters for the document. Consider a base document, {"a":2}. Now, consider these modules, which I've written as Python functions:
def mod1(data):
a = data["a"]
b = 2 * a
return {"$set":{"b":b}}
def mod2(data):
b = data["b"]
c = "Good" if b > 1 else "Bad"
return {"$set":{"c":c}}
def mod3(data):
a = data["a"]
d = a - 3
return {"$set":{"d":d}}
def mod4(data):
c = data["c"]
d = data["d"]
e = d if c == "Good" else 0
return {"$set":{"e":e}}
When applied in the correct order, the updated document will be {"a":2,"b":4,"c":"Good","d":-1,"e":-1}. Note that mod1 and mod3 can run simultaneously, while mod4 must wait for mod2 and mod3 to run on a document. My question is more general though; what's the best way to do something like this? My current method is very similar to this, but with each module getting its own Docker container due to the less trivial nature of their processing. The problem with this is that it requires every module to be querying the entire collection in which these documents reside continuously to see if documents become valid for them to process.
You really want an event-driven architecture for this for pipelines like this. As records are updated, push events onto a topic queue (or ESB) and have the other pipelines listening to those queues. As events are published, those other pipeline stages can pick up records and process their stages.

What Erlang data structure to use for ordered set with the possibility to do lookups?

I am working on a problem where I need to remember the order of events I receive but also I need to lookup the event based on it's id. How can I do this efficiently in Erlang if possible without a third party library? Note that I have many potentially ephemeral actors with each their own events (already considered mnesia but it requires atoms for the tables and the tables would stick around if my actor died).
-record(event, {id, timestamp, type, data}).
Based on the details included in the discussion in comments on Michael's answer, a very simple, workable approach would be to create a tuple in your process state variable that stores the order of events separately from the K-V store of events.
Consider:
%%% Some type definitions so we know exactly what we're dealing with.
-type id() :: term().
-type type() :: atom().
-type data() :: term().
-type ts() :: calendar:datetime().
-type event() :: {id(), ts(), type(), data()}.
-type events() :: dict:dict(id(), {type(), data(), ts()}).
% State record for the process.
% Should include whatever else the process deals with.
-record(s,
{log :: [id()],
events :: event_store()}).
%%% Interface functions we will expose over this module.
-spec lookup(pid(), id()) -> {ok, event()} | error.
lookup(Pid, ID) ->
gen_server:call(Pid, {lookup, ID}).
-spec latest(pid()) -> {ok, event()} | error.
latest(Pid) ->
gen_server:call(Pid, get_latest).
-spec notify(pid(), event()) -> ok.
notify(Pid, Event) ->
gen_server:cast(Pid, {new, Event}).
%%% gen_server handlers
handle_call({lookup, ID}, State#s{events = Events}) ->
Result = find(ID, Events),
{reply, Result, State};
handle_call(get_latest, State#s{log = [Last | _], events = Events}) ->
Result = find(Last, Events),
{reply, Result, State};
% ... and so on...
handle_cast({new, Event}, State) ->
{ok, NewState} = catalog(Event, State),
{noreply, NewState};
% ...
%%% Implementation functions
find(ID, Events) ->
case dict:find(ID, Events) of
{Type, Data, Timestamp} -> {ok, {ID, Timestamp, Type, Data}};
Error -> Error
end.
catalog({ID, Timestamp, Type, Data},
State#s{log = Log, events = Events}) ->
NewEvents = dict:store(ID, {Type, Data, Timestamp}, Events),
NewLog = [ID | Log],
{ok, State#s{log = NewLog, events = NewEvents}}.
This is a completely straightforward implementation and hides the details of the data structure behind the interface of the process. Why did I pick a dict? Just because (its easy). Without knowing your requirements better I really have no reason to pick a dict over a map over a gb_tree, etc. If you have relatively small data (hundreds or thousands of things to store) the performance isn't usually noticeably different among these structures.
The important thing is that you clearly identify what messages this process should respond to and then force yourself to stick to it elsewhere in your project code by creating an interface of exposed functions over this module. Behind that you can swap out the dict for something else. If you really only need the latest event ID and won't ever need to pull the Nth event from the sequence log then you could ditch the log and just keep the last event's ID in the record instead of a list.
So get something very simple like this working first, then determine if it actually suits your need. If it doesn't then tweak it. If this works for now, just run with it -- don't obsess over performance or storage (until you are really forced to).
If you find later on that you have a performance problem switch out the dict and list for something else -- maybe gb_tree or orddict or ETS or whatever. The point is to get something working right now so you have a base from which to evaluate the functionality and run benchmarks if necessary. (The vast majority of the time, though, I find that whatever I start out with as a specced prototype turns out to be very close to whatever the final solution will be.)
Your question makes it clear you want to lookup by ID, but it's not entirely clear if you want to lookup or traverse your data by or based on time, and what operations you might want to perform in that regard; you say "remember the order of events" but storing your records with an index of the ID field will accomplish that.
If you only have to lookup by ID then any of the usual suspects will work as a suitable storage engines, so ets, gb_trees and dict for example would be good. Don't use mnesia unless you need the transactions and safety and all those good features; mnesia is good, but there is a high performance price to be paid for all that stuff, and it's not clear you need it, from your question anyway.
If you do want to lookup or traverse your data by or based on time, then consider an ets table of ordered_set. If that can do what you need then it's probably a good choice. In that case you would employ two tables, one set to provide a hash lookup by ID and another ordered_set to lookup or traverse by timestamp.
If you have two different lookup methods like this there's no getting around the fact you need two indexes. You could store the whole record in both, or, assuming your IDs are unique, you could store the ID as the data in the ordered_set. Which you choose is really a matter of trade off of storage utilisation and read and wrote performance.

Apache Flink: Why do reduce or groupReduce transformations not operate in parallel?

For example:
DataSet<Tuple1<Long>> input = env.fromElements(1,2,3,4,5,6,7,8,9);
DataSet<Tuple1<Long>> sum = input.reduce(new ReduceFunction()<Tuple1<Long>,Tuple1<Long>>{
public Tuple1<Long> reduce(Tuple1<Long> value1,Tuple1<Long> value2){
return new Tuple1<>(value1.f0 + value2.f0);
}
}
If the above reduce transform is not a parallel operation, do I need to use additional two transformation 'partitionByHash' and 'mapPartition' as below:
DataSet<Tuple1<Long>> input = env.fromElements(1,2,3,4,5,6,7,8,9);
DataSet<Tuple1<Long>> sum = input.partitionByHash(0).mapPartition(new MapPartitionFunction()<Tuple1<Long>,Tuple1<Long>>{
public void map(Iterable<Tuple1<Long>> values,Collector<Tuple1<Long>> out){
long sum = getSum(values);
out.collect(new Tuple1(sum));
}
}).reduce(new ReduceFunction()<Tuple1<Long>,Tuple1<Long>>{
public Tuple1<Long> reduce(Tuple1<Long> value1,Tuple1<Long> value2){
return new Tuple1<>(value1.f0 + value2.f0);
}
}
and why the result of reduce transform is still an instance of DataSet but not an instance of Tuple1<Long>
Both, reduce and reduceGroup are group-wise operations and are applied on groups of records. If you do not specify a grouping key using groupBy, all records of the data set belong to the same group. Therefore, there is only a single group and the final result of reduce and reduceGroup cannot be computed in parallel.
If the reduce transformation is combinable (which is true for any ReduceFunction and all combinable GroupReduceFunctions), Flink can apply combiners in parallel.
Two answers, to your two questions:
(1) Why is reduce() not parallel
Fabian gave a good explanation. The operations are parallel if applied by key. Otherwise only pre-aggregation is parallel.
In your second example, you make it parallel by introducing a key. Instead of your complex workaround with "mapPartition()", you can also simply write (Java 8 Style)
DataSet<Tuple1<Long>> input = ...;
input.groupBy(0).reduce( (a, b) -> new Tuple1<>(a.f0 + b.f0);
Note however that your input data is so small that there will be only one parallel task anyways. You can see parallel pre-aggregation if you use larger input, such as:
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(10);
DataSet<Long> input = env.generateSequence(1, 100000000);
DataSet<Long> sum = input.reduce ( (a, b) -> a + b );
(2) Why is the result of a reduce() operation still a DataSet ?
A DataSet is still a lazy representation of X in the cluster. You can continue to use that data in the parallel program without triggering some computation and fetching the result data back from the distributed workers to the driver program. That allows you to write larger programs that run entirely on the distributed workers and are lazily executed. No data ever fetched to the client and re-distributed to the parallel workers.
Especially in iterative programs, this is very powerful, as the entire loops work without ever involving the client and needing to re-deploy operators.
You can always get the "X" by calling "dataSet.collext().get(0);" - which makes it explicit that something should be executed and fetched.

Slick: Difficulties working with Column[Int] values

I have a followup to another Slick question I recently asked (Slick table Query: Trouble with recognizing values) here. Please bear with me!! I'm new to databasing and Slick seems especially poor on documentation. Anyway, I have this table:
object Users extends Table[(Int, String)]("Users") {
def userId = column[Int]("UserId", O.PrimaryKey, O.AutoInc)
def userName = column[String]("UserName")
def * = userId ~ userName
}
Part I
I'm attempting to query with this function:
def findByQuery(where: List[(String, String)]) = SlickInit.dbSlave withSession {
val q = for {
x <- Users if foo((x.userId, x.userName), where)
} yield x
q.firstOption.map { case(userId, userName) =>
User(userId, userName)}
}
where "where" is a list of search queries //ex. ("userId", "1"),("userName", "Alex")
"foo" is a helper function that tests equality. I'm running into a type error.
x.userId is of type Column[Int]. How can one manipulate this as an Int? I tried casting, ex:
foo(x.userId.asInstanceOf[Int]...)
but am also experiencing trouble with that. How does one deal with Slick return types?
Part II
Is anyone familiar with the casting function:
def * = userId ~ userName <> (User, User.unapply _)
? I know there have been some excellent answers to this question, most notably here: scala slick method I can not understand so far and a very similar question here: mapped projection with companion object in SLICK. But can anyone explain why the compiler responds with
<> method overloaded
for that simple line of code?
Let's start with the problem:
val q = for {
x <- Users if foo((x.userId, x.userName), where)
} yield x
See, Slick transforms Scala expressions into SQL. To be able to transform conditions, as you want, into SQL statement, Slick requires some special types to be used. The way these types works are actually part of the transformation Slick performs.
For example, when you write List(1,2,3) filter { x => x == 2 } the filter predicate is executed for each element in the list. But Slick can't do that! So Query[ATable] filter { arow => arow.id === 2 } actually means "make a select with the condition id = 2" (I am skipping details here).
I wrote a mock of your foo function and asked Slick to generate the SQL for the query q:
select x2."UserId", x2."UserName" from "Users" x2 where false
See the false? That's because foo is a simple predicate that Scala evaluates into the Boolean false. A similar predicate done in a Query, instead of a list, evaluates into a description of a what needs to be done in the SQL generation. Compare the difference between the filters in List and in Slick:
List[A].filter(A => Boolean):List[A]
Query[E,U].filter[T](f: E => T)(implicit wt: CanBeQueryCondition[T]):Query[E,U]
List filter evaluates to a list of As, while Query.filter evaluates into new Query!
Now, a step towards a solution.
It seems that what you want is actually the in operator of SQL. The in operator returns true if there is an element in a list, eg: 4 in (1,2,3,4) is true. Notice that (1,2,3,4) is a SQL list, not Tuple like in Scala.
For this use case of the in SQL operator Slick uses the operator inSet.
Now comes the second part of the problem. (I renamed the where variable to list, because where is a Slick method)
You could try:
val q = for {
x <- Users if (x.userId,x.userName) inSet list
} yield x
But that won't compile! That's because SQL doesn't have Tuples the way Scala has. In SQL you can't do (1,"Alfred") in ((1,"Alfred"),(2, "Mary")) (remember, the (x,y,z) is the SQL syntax for lists, I am abusing the syntax here only to show that it's invalid -- also there are many dialects of SQL out there, it is possible some of them do support tuples and lists in a similar way.)
One possible solution is to use only the userId field:
val q = for {
x <- Users if x.userId inSet list2
} yield x
This generates select x2."UserId", x2."UserName" from "Users" x2 where x2."UserId" in (1, 2, 3)
But since you are explicitly using user id and user name, it's reasonable to assume that user id doesn't uniquely identify a user. So, to amend that we can concatenate both values. Of course, we need to do the same in the list.
val list2 = list map { t => t._1 + t._2 }
val q2 = for {
x <- Users if (x.userId.asColumnOf[String] ++ x.userName) inSet list2
} yield x
Look the generated SQL:
select x2."UserId", x2."UserName" from "Users" x2
where (cast(x2."UserId" as VARCHAR)||x2."UserName") in ('1a', '3c', '2b')
See the above ||? It's the string concatenation operator used in H2Db. H2Db is the Slick driver I am using to run your example. This query and the others can vary slightly depending on the database you are using.
Hope that it clarifies how slick works and solve your problem. At least the first one. :)
Part I:
Slick uses Column[...]-types instead of ordinary Scala types. You also need to define Slick helper functions using Column types. You could implement foo like this:
def foo( columns: (Column[Int],Column[String]), values: List[(Int,String)] ) : Column[Boolean] = values.map( value => columns._1 === value._1 && columns._2 === value._2 ).reduce( _ || _ )
Also read pedrofurla's answer to better understand how Slick works.
Part II:
Method <> is indeed overloaded and when types don't work out the Scala compiler can easily become uncertain which overload it should use. (We should get rid of the overloading in Slick I think.) Writing
def * = userId ~ userName <> (User.tupled _, User.unapply _)
may slightly improve the error message you get. To solve the problem make sure that the Column types of userId and userName exactly correspond to the member types of your User case class, which should look something like case class User( id:Int, name:String ). Also make sure, that you extend Table[User] (not Table[(Int,String)]) when mapping to User.

Resources