Via/ViaMat/to/toMat in Akka Stream - akka-stream

Can someone explain clearly what are the difference between those 4 methods ? When is it more appropriate to use each one ? Also generally speaking what is the name of this Group of method? Are there more method that does the same job ? A link to the scaladoc could also help.
-D-

All these methods are necessary to join two streams into one stream. For example, you can create a Source out of a Source and a Flow, or you can create a Sink out of a Flow and a Sink, or you can create a Flow out of two Flows.
For this, there are two basic operations, to and via. The former allows one to connect either a Source or a Flow to a Sink, while the latter allows to connect a Source or a Flow to a Flow:
source.to(sink) -> runnable graph
flow.to(sink) -> sink
source.via(flow) -> source
flow1.via(flow2) -> flow
For the reference, a runnable graph is a fully connected reactive stream which is ready to be materialized and executed.
*Mat versions of various operations allow one to specify how materialized values of streams included in the operation should be combined. As you may know, each stream has a materialized value which can be obtained when the stream is materialized. For example, Source.queue yields a queue object which can be used by another part of your program to emit elements into the running stream.
By default to and via on sources and flows only keeps the materialized value of the stream it is called on, ignoring the materialized value of its argument:
source.to(sink) yields mat.value of source
source.via(flow) yields mat.value of source
flow.to(sink) yields mat.value of flow
flow1.via(flow2) yields mat.value of flow1
Sometimes, however, you need to keep both materialized values or to combined them somehow. That's when Mat variants of methods are needed. They allow you to specify the combining function which takes materialized values of both operands and returns a materialized value of the combined stream:
source.to(sink) equivalent to source.toMat(sink)(Keep.left)
flow1.via(flow2) equivalent to flow1.viaMat(flow2)(Keep.left)
For example, to keep both materialized values, you can use Keep.both method, or if you only need the mat.value of the "right" operand, you can use Keep.right method:
source.toMat(sink)(Keep.both) yields a tuple (mat.value of source, mat.value of sink)

Related

KeyBy multiple streams in Flink

I have multiple (3 to be precise as of now) streams (of different types) from different kafka topics. They have a common property userId. All I want to do now is to partition by userId and then add some business logic to it. How can I partition by userId all streams and ensure that all the events go to the same task processor so that userId state is accessible ?
I could have used ConnectedStream but here the usecase is for more than 2 different kind of streams.
Also I was wondering weather something like this would guarantee same task processor
MyBusinessProcess businessProcess() = new MyBusinessProcess();
streamA.keyBy(event -> event.userId).process(businessProcess);
streamB.keyBy(event -> event.userId).process(businessProcess);
streamC.keyBy(event -> event.userId).process(businessProcess);
Edit: I just realised that for businessProcess, how would it differentiate between which event is coming in if there are stream of multiple types. Gets me thinking more since this seems like a naive streams problem.
Thanks.
I would create a class (let's call it Either3) that has a userID field, and then three additional fields (only one of which is ever set) that contain your three different stream's data type (look at Flink's Either class for how to do this for 2 values).
Then use a map function on each of your three streams to convert from class A/B/C to an Either3 with the appropriate value set.
Now you can .union() your three streams together, and run that one stream into your business process function, which can maintain state as needed.

How to connect more than 2 streams in Flink?

I've 3 keyed data streams of different types.
DataStream<A> first;
DataStream<B> second;
DataStream<C> third;
Each stream has its own processing logic defined and share a state between them. I want to connect these 3 streams triggering the respective processing functions whenever data is available in any stream. Connect on two streams is possible.
first.connect(second).process(<CoProcessFunction>)
I can't use union (allows multiple data stream) as the types are different. I want to avoid creating a wrapper and convert all the streams into the same type.
The wrapper approach isn't too bad, really. You can create an EitherOfThree<T1, T2, T3> wrapper class that's similar to Flink's existing Either<Left, Right>, and then process a stream of those records in a single function. Something like:
DataStream <EitherOfThree<A,B,C>> combo = first.map(r -> new EitherOfThree<A,B,C>(r, null, null))
.union(second.map(r -> new EitherOfThree<A,B,C>(null, r, null)))
.union(third.map(r -> new EitherOfThree<A,B,C>(null, null, r)));
combo.process(new MyProcessFunction());
Flink's Either class has a more elegant implementation, but for your use case something simple should work.
Other than union, the standard approach is to use connect in a cascade, e.g.,
first.connect(second).process(...).connect(third).process(...)
You won't be able to share state between all three streams in one place. You can have the first process function output whatever the subsequent process function will need, but the third stream won't be able to affect the state in the first process function, which is a problem for some use cases.
Another possibility might be to leverage a lower-level mechanism -- see FLIP-92: Add N-Ary Stream Operator in Flink. However, this mechanism is intended for internal use (the Table/SQL API uses this for n-way joins), and would need to be treated with caution. See the mailing list discussion for details. I mention this for completeness, but I'm skeptical this is a good idea until the interface is further developed.
You might also want to look at the stateful functions api, which overcomes many of the restrictions of the datastream api.

What is the difference between Flink join and connect?

I am confused of the definitions. In documentation it seems that join is followed by a key defined, but connect does not need to specify key and the result of which is a connectedStream. What can we do with this conenctedStream and is there any concrete example that we use one rather than the other?
More, how is the connected stream looks like?
Thanks in advance
A connect operation is more general then a join operation. Connect ensures that two streams (keyed or unkeyed) meet at the same location (at the same parallel instance within a CoXXXFunction).
One stream could be a control stream that manipulates the behavior applied to the other stream. For example, you could stream-in new machine learning models or other business rules.
Alternatively, you can use the property of two streams that are keyed and meet at the same location for joining. Flink provides some predefined join operators.
However, joining of data streams often depends on different use case-specific behaviors such as "How long do you want to wait for the other key to arrive?", "Do you only look for one matching pair or more?", or "Are there late elements that need special treatment if no matching record arrives or the other matching record is not stored in state anymore?". A connect() allows you to implement your own joining logic if needed. The data Artisans training here explains one example of connect for joining.

Iteration over multiple streams in Apache Flink

My Question in regarding iteration over multiple streams in Apache Flink.
I am a Flink beginner, and I am currently trying to execute a recursive query (e.g., datalog) on Flink.
For example, a query calculates the transitive closure for every 5mins (tumbling window). If I have one input stream inputStream (consists of initial edge informations), another outputStream (the transitive closure) which is initialised by the inputStream. And I want to iteratively enrich the outputStream by joining the inputStream. For each iteration, the feedback should be the outputStream, and the iteration will last until no more edge can be appended on outputStream. The computation of my transitive closure should trigger periodically for every 5 mins. During the iteration, the inputStream should be "hold" and provide the data for my outputStream.
Is it possible to do this in Flink? Thanks for any help!
This sounds like a side-input issue, where you want to treat the "inputStream" as a batch dataset (with refresh) that's joined to the other "outputStream". Unfortunately Flink doesn't provide an easy way to implement that currently (see https://stackoverflow.com/a/48701829/231762)
If both of these streams are coming from data sources, then one approach is to create a wrapper source that controls the ordering of the records. It would have to emit something like a Tuple2 where one side or the other is null, and then in a downstream (custom) Function you'd essentially split these, and do the joining.
If that's possible, then this source can block the "output" tuples while it emits the "input" tuples, plus other logic it sounds like you need (5 minute refresh, etc). See my response to the other SO issue above for skeleton code that does this.

Stream loadbalancing

I have two streams. One is an event stream, the other is a database update stream. I want to enrich the event stream with information built from the DB update stream.
The event stream is very voluminous and is partitioned using 5 fields. This gives me good distribution. The DB stream is a lot less chattier, and is partitioned using two fields. I am currently connecting the two streams using the two common fields and using a flapMap to enrich the first stream. The flatMap operator uses ValueState to maintain state, which is automatically keyed by the two common fields.
I find that the load in the event stream tends to be skewed in terms of the two common fields. This causes uneven loadbalancing across the flapMap instances and a few instances are around 10 times more loaded than the others.
I am thinking a better approach would be to broadcast the DB update stream across all flatMap instances and simply forward the event stream based on its existing partitioning scheme. However the issue is that because there are no keys specified for the connect operator, I cannot use ValueState.
Other than implementing custom logic to manually extract the key and update maintain state, is there any anything else I can do?
Is there a simpler approach I am missing?
You can implement the Checkpointed interface with the CoFlatMapFunction to checkpoint the broadcasted DB updates instead of using the key-value state interface.

Resources