Flink stream going to two sinks based on conditions - apache-flink

Trying to see the possibility of stream going to two sinks based on conditions.
Requirement is stream have events, all events after transformation need to go to one sink ( assume one kafka topic)
And only error events needs to go to another sink ( assume another kafka topic).
did not see use-case of once transformation is done , additional logic putting in sink. Looking if something similar done

The best way to do this is with side outputs.
private static final OutputTag<String> errors = new OutputTag<>("errors") {};
...
// in your main() method
SingleOutputStreamOperator<T> result = events.process(new ProcessFunction());
result.addSink(sink).name("normal output");
result.getSideOutput(errors).addSink(errorSink).name("error output");
...
// in the process function
if (somethingGoesWrong) {
ctx.output(errors, "error message");
}
While there are other ways to split a stream with Flink, side outputs are very flexible (e.g., the side outputs can have different types) and perform well.

Related

How to collect side output from rich sink function in Apache Flink?

I am trying below scenario in Flink
Flink consume data from kafka topic and validate against avro schema
Converting the data into JSON payload in process function after some enrichments on the data
After enrichment of data of it should be written to Postgres database and upload data to Azure blob storage through Flink RichSinkFunction
I am stuck in one place in Sink function the process should happen in transactional way meaning if any exception while persisting data to postgres or any exception happens while uploading data to Azure blob storage the process should through exception and the it should rollback the data from database as well it should data from Azure blob storage. In case of exception the payload received Sink function should be put a kafka topic but i am not getting how to handle that i know that in processfunction it support a side through which we can send the data to different topic but Sink won't support side output.
Is there a way i can publish the payload received in Sink to Kakfa topic in case of any exception.
I am not sure about the programming language which you are using right now but you can do something like below using Scala inside a process function and call sink methods based on the output returned by the process function.
Try {
}
match {
case Success(x) => {
.
.
Right(x)
}
case Failure(err) => {
.
.
Left(err)
}
}
Your process element method will look something like below:
override def process(key: Int,context: Context, elements: Iterable[String], out: Collector[(String, String)]): Unit = {
for (i <- elements) {
println("Inside Process.....")
parseJson(i) match {
case Right(data) => {
context.output(goodOutputTag, data)
out.collect(data) //usually used to collect records and emits them to writer.and so on,collect be called when needs to write data.
}
case Left(err) => {
context.output(badOutputTag, dataTuple) // side outputs, when needed to split a stream of data. Emit data to side output and a new datastream can be created using .getSideOutput(outputTag)
}
}
}
Now, use these output tags from the Success and Failure cases and create a data stream out of it in your invoker object and call your respective sink methods.

Flink 1.12.x DataSet --> Flink 1.14.x DataStream

I am trying to migrate from Flink 1.12.x DataSet api to Flink 1.14.x DataStream api. mapPartition is not available in Flink DataStream.
Our Code using Flink 1.12.x DataSet
dataset
.<few operations>
.mapPartition(new SomeMapParitionFn())
.<few more operations>
public static class SomeMapPartitionFn extends RichMapPartitionFunction<InputModel, OutputModel> {
#Override
public void mapPartition(Iterable<InputModel> records, Collector<OutputModel> out) throws Exception {
for (InputModel record : records) {
/*
do some operation
*/
if (/* some condition based on processing *MULTIPLE* records */) {
out.collect(...); // Conditional collect ---> (1)
}
}
// At the end of the data, collect
out.collect(...); // Collect processed data ---> (2)
}
}
(1) - Collector.collect invoked based on some condition after processing few records
(2) - Collector.collect invoked at the end of data
Initially we thought of using flatMap instead of mapPartition, but collector not available in close function.
https://issues.apache.org/jira/browse/FLINK-14709 - Only available in case of chained drivers
How to implement this in Flink 1.14.x DataStream? Please advise...
Note: Our application works with only finite set of data (Batch Mode)
In Flink's DataSet API, a MapPartitionFunction has two parameters. An iterator for the input and a collector for the result of the function. A MapPartitionFunction in a Flink DataStream program would never return from the first function call, because the iterator would iterate over an endless stream of records. However, Flink's internal stream processing model requires that user functions return in order to checkpoint function state. Therefore, the DataStream API does not offer a mapPartition transformation.
In order to implement similar function, you need to define a window over the stream. Windows discretize streams which is somewhat similar to mini batches but windows offer way more flexibility
Solution provided by Zhipeng
One solution could be using a streamOperator to implement BoundedOneInput
interface.
An example code could be found here [1].
[1]
https://github.com/apache/flink-ml/blob/56b441d85c3356c0ffedeef9c27969aee5b3ecfc/flink-ml-core/src/main/java/org/apache/flink/ml/common/datastream/DataStreamUtils.java#L75
Flink user mailing link: https://lists.apache.org/thread/ktck2y96d0q1odnjjkfks0dmrwh7kb3z

Bypass Flink CEP Buffer in an EventTime Streaming Environment

I have an EventTime streaming application that uses the CEP library for a basic three-step pattern on a joined stream. The joined stream is a combination of live, watermarked, and windowed data and a stream of historical items outside of the windowing/watermarking.
The setup is similar to the dataArtisans blog post except with the CEP Pattern as the last step.
Our CEP setup looks like this, and worked before adding in the non-timestamped historical stream. The EscalatingAlertEventIterativeCondition makes sure that the previous event match is of a greater level than the next.
Pattern<AlertEvent, ?> pattern = Pattern.<AlertEvent>
begin("one")
.where((AlertEvent event) -> event.level > 0)
.next("two")
.where(new EscalatingAlertEventIterativeCondition("one"))
.next("three")
.where(new EscalatingAlertEventIterativeCondition("two"));
return CEP.pattern(
alertEventStream,
pattern
);
The problem I'm seeing is that CEP is forever buffering (breakpoints within the filter and iterative conditions are now not hit) and that the filtering/selection never happens. I initially thought this could be due to the CEP buffer but am unsure as I am new to both Flink and Flink CEP. Is there any way to avoid the lateness buffer, or does something else look amiss?
Our job graph, where only the top, live stream of data is timestamped and watermarked:

Non blocking streaming on Flink

Hi, I'm trying to run a Flink job that it should process incoming data as below. In the process operator right after keyBy(), there should be a case that takes too much time according to some property in data. Even though incoming data have different ids (which is used to keyBy() the stream), long processing code in process function blocks other incoming data. I mean the entire stream.
SingleOutputStreamOperator<Envelope> processingStream = deviceStream
.map(e -> (Envelope) e)
.keyBy((KeySelector<Envelope, String>) value -> value.eventId) // key by scenarios
.process(new RuleProcessFunction());
In RuleProcessFunction.java:
...
#Override
public void processElement(Envelope value, Context ctx, Collector<Envelope> out) throws Exception {
//handleEvent(value, ctx, out);
if (value.getEventId().equals("I")) {
System.out.println("hello i");
for (long i = 0; i < 10000000000L; i++) {
}
}
out.collect(value);
}
I expect the long-running code block should not block the entire stream. I know there is AsyncFunction for blocking IO situations but I don't know that it's correct solution for this.
Since you aren't pulling data from an external database like Cassandra, I don't think you need to use an AsyncFunction.
What it could be that you are running the flink job with a single parallelism. Try increasing the parallelism so one core isn't responsible for all of the processing as well as receiving data. Granted, there can still be back pressure if you do this. Since if the core responsible for ingesting data from the source is reading in data faster than the core(s) that are running the processFunction Flink's back pressure handling will slow the rate of ingestion.

Apache Flink DataStream API doesn't have a mapPartition transformation

Spark DStream has mapPartition API, while Flink DataStream API doesn't. Is there anyone who could help explain the reason. What I want to do is to implement a API similar to Spark reduceByKey on Flink.
Flink's stream processing model is quite different from Spark Streaming which is centered around mini batches. In Spark Streaming each mini batch is executed like a regular batch program on a finite set of data, whereas Flink DataStream programs continuously process records.
In Flink's DataSet API, a MapPartitionFunction has two parameters. An iterator for the input and a collector for the result of the function. A MapPartitionFunction in a Flink DataStream program would never return from the first function call, because the iterator would iterate over an endless stream of records. However, Flink's internal stream processing model requires that user functions return in order to checkpoint function state. Therefore, the DataStream API does not offer a mapPartition transformation.
In order to implement functionality similar to Spark Streaming's reduceByKey, you need to define a keyed window over the stream. Windows discretize streams which is somewhat similar to mini batches but windows offer way more flexibility. Since a window is of finite size, you can call reduce the window.
This could look like:
yourStream.keyBy("myKey") // organize stream by key "myKey"
.timeWindow(Time.seconds(5)) // build 5 sec tumbling windows
.reduce(new YourReduceFunction); // apply a reduce function on each window
The DataStream documentation shows how to define various window types and explains all available functions.
Note: The DataStream API has been reworked recently. The example assumes the latest version (0.10-SNAPSHOT) which will be release as 0.10.0 in the next days.
Assuming your input stream is single partition data (say String)
val new_number_of_partitions = 4
//below line partitions your data, you can broadcast data to all partitions
val step1stream = yourStream.rescale.setParallelism(new_number_of_partitions)
//flexibility for mapping
val step2stream = step1stream.map(new RichMapFunction[String, (String, Int)]{
// var local_val_to_different_part : Type = null
var myTaskId : Int = null
//below function is executed once for each mapper function (one mapper per partition)
override def open(config: Configuration): Unit = {
myTaskId = getRuntimeContext.getIndexOfThisSubtask
//do whatever initialization you want to do. read from data sources..
}
def map(value: String): (String, Int) = {
(value, myTasKId)
}
})
val step3stream = step2stream.keyBy(0).countWindow(new_number_of_partitions).sum(1).print
//Instead of sum(1), you can use .reduce((x,y)=>(x._1,x._2+y._2))
//.countWindow will first wait for a certain number of records for perticular key
// and then apply the function
Flink streaming is pure streaming (not the batched one). Take a look at Iterate API.

Resources