I am new to Apache Flink. I wanna create a DataStream and feed it with values from another system.
I found examples how to add a "SourceFunctions", in that function I have to wait for values from a source and publish those values to Flink by calling ctx.collect and then wait again, it's polling.
But I have a data source which calls a function when values arrive (async). So, what I wanna do is: when this async call happens I wanna put the value to a Flink DataStream, Pseudo-Code:
mysystem.connect_to_values( (value) => { myflinkdatastream.put(value.toString) })
Can this be done? otherwise i have to execute my connect and callback in the SourceFunction and do a loop with sleep afterwards, but I don't wanna do it in this way...
I already have seen "Asynchronous I/O for External Data Access" in Flink, but for that I still need a source stream, which is feed with a SourceFunctions (Poll/Loop).
If you do not want to add a SourceFunction in your streaming job, I suggest using Kafka or other message queue, to which you can send the data from async source, and connect Flink Streaming Job to the message queue.
Related
I am trying below scenario in Flink
Flink consume data from kafka topic and validate against avro schema
Converting the data into JSON payload in process function after some enrichments on the data
After enrichment of data of it should be written to Postgres database and upload data to Azure blob storage through Flink RichSinkFunction
I am stuck in one place in Sink function the process should happen in transactional way meaning if any exception while persisting data to postgres or any exception happens while uploading data to Azure blob storage the process should through exception and the it should rollback the data from database as well it should data from Azure blob storage. In case of exception the payload received Sink function should be put a kafka topic but i am not getting how to handle that i know that in processfunction it support a side through which we can send the data to different topic but Sink won't support side output.
Is there a way i can publish the payload received in Sink to Kakfa topic in case of any exception.
I am not sure about the programming language which you are using right now but you can do something like below using Scala inside a process function and call sink methods based on the output returned by the process function.
Try {
}
match {
case Success(x) => {
.
.
Right(x)
}
case Failure(err) => {
.
.
Left(err)
}
}
Your process element method will look something like below:
override def process(key: Int,context: Context, elements: Iterable[String], out: Collector[(String, String)]): Unit = {
for (i <- elements) {
println("Inside Process.....")
parseJson(i) match {
case Right(data) => {
context.output(goodOutputTag, data)
out.collect(data) //usually used to collect records and emits them to writer.and so on,collect be called when needs to write data.
}
case Left(err) => {
context.output(badOutputTag, dataTuple) // side outputs, when needed to split a stream of data. Emit data to side output and a new datastream can be created using .getSideOutput(outputTag)
}
}
}
Now, use these output tags from the Success and Failure cases and create a data stream out of it in your invoker object and call your respective sink methods.
I am trying to migrate from Flink 1.12.x DataSet api to Flink 1.14.x DataStream api. mapPartition is not available in Flink DataStream.
Our Code using Flink 1.12.x DataSet
dataset
.<few operations>
.mapPartition(new SomeMapParitionFn())
.<few more operations>
public static class SomeMapPartitionFn extends RichMapPartitionFunction<InputModel, OutputModel> {
#Override
public void mapPartition(Iterable<InputModel> records, Collector<OutputModel> out) throws Exception {
for (InputModel record : records) {
/*
do some operation
*/
if (/* some condition based on processing *MULTIPLE* records */) {
out.collect(...); // Conditional collect ---> (1)
}
}
// At the end of the data, collect
out.collect(...); // Collect processed data ---> (2)
}
}
(1) - Collector.collect invoked based on some condition after processing few records
(2) - Collector.collect invoked at the end of data
Initially we thought of using flatMap instead of mapPartition, but collector not available in close function.
https://issues.apache.org/jira/browse/FLINK-14709 - Only available in case of chained drivers
How to implement this in Flink 1.14.x DataStream? Please advise...
Note: Our application works with only finite set of data (Batch Mode)
In Flink's DataSet API, a MapPartitionFunction has two parameters. An iterator for the input and a collector for the result of the function. A MapPartitionFunction in a Flink DataStream program would never return from the first function call, because the iterator would iterate over an endless stream of records. However, Flink's internal stream processing model requires that user functions return in order to checkpoint function state. Therefore, the DataStream API does not offer a mapPartition transformation.
In order to implement similar function, you need to define a window over the stream. Windows discretize streams which is somewhat similar to mini batches but windows offer way more flexibility
Solution provided by Zhipeng
One solution could be using a streamOperator to implement BoundedOneInput
interface.
An example code could be found here [1].
[1]
https://github.com/apache/flink-ml/blob/56b441d85c3356c0ffedeef9c27969aee5b3ecfc/flink-ml-core/src/main/java/org/apache/flink/ml/common/datastream/DataStreamUtils.java#L75
Flink user mailing link: https://lists.apache.org/thread/ktck2y96d0q1odnjjkfks0dmrwh7kb3z
I have implement the CEP Pattern in Flink which is working as expected connecting to local Kafka broker. But when i connecting to cluster based cloud kafka setup, the Flink CEP is not triggering.
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
//saves checkpoint
env.getCheckpointConfig().enableExternalizedCheckpoints(
CheckpointConfig.ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION);
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
I am using AscendingTimestampExtractor,
consumer.assignTimestampsAndWatermarks(
new AscendingTimestampExtractor<ObjectNode>() {
#Override
public long extractAscendingTimestamp(ObjectNode objectNode) {
long timestamp;
Instant instant = Instant.parse(objectNode.get("value").get("timestamp").asText());
timestamp = instant.toEpochMilli();
return timestamp;
}
});
And also i am getting Warn Message that,
AscendingTimestampExtractor:140 - Timestamp monotony violated: 1594017872227 < 1594017873133
And Also i tried using AssignerWithPeriodicWatermarks and AssignerWithPunctuatedWatermarks none of one is working
I have attached Flink console screenshot where Watermark is not assigning.
Updated flink console screenshot
Could Anyone Help?
CEP must first sort the input stream(s), which it does based on the watermarking. So
the problem could be with watermarking, but you haven't shown us enough to debug the cause. One common issue is having an idle source, which can prevent the watermarks from advancing.
But there are other possible causes. To debug the situation, I suggest you look at some metrics, either in the Flink Web UI or in a metrics system if you have one connected. To begin, check if records are flowing, by looking at numRecordsIn, numRecordsOut, or numRecordsInPerSecond and numRecordsOutPerSecond at different stages of your pipeline.
If there are events, then look at currentOutputWatermark throughout the different tasks of your job to see if event time is advancing.
Update:
It appears you may be calling assignTimestampsAndWatermarks on the Kafka consumer, which will result in per-partition watermarking. In that case, if you have an idle partition, that partition won't produce any watermarks, and that will hold back the overall watermark. Try calling assignTimestampsAndWatermarks on the DataStream produced by the source instead, to see if that fixes things. (Of course, without per-partition watermarking, you won't be able to use an AscendingTimestampExtractor, since the stream won't be in order.)
My Flink processor listens to Kafka and the business logic in processor involves calling external REST services and there are possibilities that the services may be down. I would like to replay the tuple back into the processor and Is there anyway to do it? I have used Storm and we will be able to fail the tuple so that the the tuple will not be acknowledged. So the same tuple will be replayed to the processor.
In Flink, the tuple is being acknowledged automatically once the message is consumed by Flink-Kafka Consumer. There are ways to solve this. One such way is to publish the message back to the same queue/retry queue. But I am looking for a solution similar to Storm.
I know that Flink's Savepoint/Checkpoint will be used for fault tolerance. But in my understanding, the tuples will be replayed win case of the Flink's failure. I would like to get ideas on how to handle transient failures.
Thank you
When interacting with external systems I would recommend to use Flink's async I/O operator. It allows you to execute asynchronous tasks without blocking the execution of an operator.
If you want to retry failed operations without restarting the Flink job from the last successful checkpoint, then I would suggest to implement the retry policy yourself. It could look the following way:
new AsyncFunction<IN, OUT>() {
#Override
public void asyncInvoke(IN input, ResultFuture<OUT> resultFuture) throws Exception {
FutureUtils
.retrySuccessfulWithDelay(
() -> triggerAsyncOperation(input),
Time.seconds(1L),
Deadline.fromNow(Duration.ofSeconds(10L)),
this::decideWhetherToRetry,
new ScheduledExecutorServiceAdapter(new DirectScheduledExecutorService()))
.whenComplete((result, throwable) -> {
if (result != null) {
resultFuture.complete(Collections.singleton(result));
} else {
resultFuture.completeExceptionally(throwable);
}
})
}
}
with triggerAsyncOperation encapsulating your asynchronous operation and decideWhetherToRetry encapsulating your retry strategy. If decideWhetherToRetry returns true, then resultFuture will be completed with the value of this operation attempt.
If resultFuture is completed exceptionally, then it will trigger a failover which will cause the job to restart from that last successful checkpoint.
Spark DStream has mapPartition API, while Flink DataStream API doesn't. Is there anyone who could help explain the reason. What I want to do is to implement a API similar to Spark reduceByKey on Flink.
Flink's stream processing model is quite different from Spark Streaming which is centered around mini batches. In Spark Streaming each mini batch is executed like a regular batch program on a finite set of data, whereas Flink DataStream programs continuously process records.
In Flink's DataSet API, a MapPartitionFunction has two parameters. An iterator for the input and a collector for the result of the function. A MapPartitionFunction in a Flink DataStream program would never return from the first function call, because the iterator would iterate over an endless stream of records. However, Flink's internal stream processing model requires that user functions return in order to checkpoint function state. Therefore, the DataStream API does not offer a mapPartition transformation.
In order to implement functionality similar to Spark Streaming's reduceByKey, you need to define a keyed window over the stream. Windows discretize streams which is somewhat similar to mini batches but windows offer way more flexibility. Since a window is of finite size, you can call reduce the window.
This could look like:
yourStream.keyBy("myKey") // organize stream by key "myKey"
.timeWindow(Time.seconds(5)) // build 5 sec tumbling windows
.reduce(new YourReduceFunction); // apply a reduce function on each window
The DataStream documentation shows how to define various window types and explains all available functions.
Note: The DataStream API has been reworked recently. The example assumes the latest version (0.10-SNAPSHOT) which will be release as 0.10.0 in the next days.
Assuming your input stream is single partition data (say String)
val new_number_of_partitions = 4
//below line partitions your data, you can broadcast data to all partitions
val step1stream = yourStream.rescale.setParallelism(new_number_of_partitions)
//flexibility for mapping
val step2stream = step1stream.map(new RichMapFunction[String, (String, Int)]{
// var local_val_to_different_part : Type = null
var myTaskId : Int = null
//below function is executed once for each mapper function (one mapper per partition)
override def open(config: Configuration): Unit = {
myTaskId = getRuntimeContext.getIndexOfThisSubtask
//do whatever initialization you want to do. read from data sources..
}
def map(value: String): (String, Int) = {
(value, myTasKId)
}
})
val step3stream = step2stream.keyBy(0).countWindow(new_number_of_partitions).sum(1).print
//Instead of sum(1), you can use .reduce((x,y)=>(x._1,x._2+y._2))
//.countWindow will first wait for a certain number of records for perticular key
// and then apply the function
Flink streaming is pure streaming (not the batched one). Take a look at Iterate API.