Transfer events from Apache Flink to Apache NiFi - poor performance - apache-flink

I integrated two tools Apache NiFi and Apache Flink. NiFi takes events and send them to Flink, after that Flink returns these events after some processing to the same NiFi.
I built source and sink to Nifi in Flink. The whole process works, but the performence of the sink is very poor (about 10 events per second).
If I remove the sink (print the output only), the process speed is much higher.
I figured out that I can change parallelism for the sink process using setParallelism(), of course it helps, but the base throughput is too low.
I also tried to use requestBatchCount(1000), but nothing has changed.
Probably my problem is related with transactions. After each event sink wait for closing the transaction, but I'm not sure and I don't know how to change it e.g. send hundreds of events in one transaction.
What can I do for increase the performance for sink?
Here is my sink definition:
SiteToSiteClientConfig sinkConfig = new SiteToSiteClient.Builder()
.url("http://" + host + ":" + port + "/nifi")
.portName("Data from Flink")
.buildConfig();
outStream.addSink(new NiFiSink<String>(sinkConfig, new NiFiDataPacketBuilder<String>() {
public NiFiDataPacket createNiFiDataPacket(String s, RuntimeContext ctx) {
return new StandardNiFiDataPacket(s.getBytes(), new HashMap<>());
}
}));
Now I'm using the latest version of Flink (1.5.1) and NiFi (1.7.1)

The NiFiSink provided by Apache Flink is creating a transaction for each event:
https://github.com/apache/flink/blob/master/flink-connectors/flink-connector-nifi/src/main/java/org/apache/flink/streaming/connectors/nifi/NiFiSink.java#L60-L67
It was done this way to make the error handling clear so that if the transaction fails to commit, it will throw an exception out of the invoke method in the context of the specific event that failed.
You can implement your own custom version of this that lets many events be sent before calling commit on the transaction, but I don't know enough about Flink to understand what happens if the transaction fails to commit later on and you no longer have the events.

Related

How Apacheflink's DataStream API supporting batch processing of events

My producer is apache Kafka and we want to listen batch of events to process them and write that processed events into the database. If I use stream/batch every event will hit one query to DB. I don't want to hit every event as one query. How can I batch some of the events and write this bulk data into DB?
Note: We are using DataStream API
No, there isn't an official Neo4j sink for Flink. If your goal is to implement exactly once end-to-end by doing buffered, batched transactional updates, you might start by reading An Overview of End-to-End Exactly-Once Processing in Apache Flink, and then reach out to the flink user mailing list for further guidance.

Is Flink aware of the Kafka partition addition during runtime

I have a question aboult Flink-Kafka Source:
When a flink application starts up after restored from checkpoint, and runs well.
During running, serveral Kafka partitions are added to the Kafka topic, will the running flink application be aware of these added partitions and read them without manual effort? or I have to restart the application and let flink be aware of these partition during startup?
Could you please point to me the code where Flink handles Kafka partitions change if adding partitions doesn't need manual effort. I didn't find the logic in the code.
Thanks!
Looks that Flink will be aware of new topic and new partition during runtime,the method call sequence is:
FlinkKafkaConsumerBase#run
FlinkKafkaConsumerBase#runWithPartitionDiscovery
FlinkKafkaConsumerBase#createAndStartDiscoveryLoop
It the last method, it will kick off a new thread to discover new topics/partitions periodically

how to deploy a new job without downtime

I have an Apache Flink application that reads from a single Kafka topic.
I would like to update the application from time to time without experiencing downtime. For now the Flink application executes some simple operators such as map and some synchronous IO to external systems via http rest APIs.
I have tried to use the stop command, but i get "Job termination (STOP) failed: This job is not stoppable.", I understand that the Kafka connector does not support the the stop behavior - a link!
A simple solution would be to cancel with savepoint and to redeploy the new jar with the savepoint, but then we get downtime.
Another solution would be to control the deployment from the outside, for example, by switching to a new topic.
what would be a good practice ?
If you don't need exactly-once output (i.e., can tolerate some duplicates) you can take a savepoint without cancelling the running job. Once the savepoint is completed, you start a second job. The second job could write to different topic but doesn't have to. When the second job is up, you can cancel the first job.

Commit Kafka Offsets Manually in Flink

In a Flink streaming application that is ingesting messages from Kafka,
1) How do I disable auto-committing?
2) How do I manually commit from Flink after successfully processing a message?
Thanks.
By default Flink commits offsets on checkpoints. You can disable it as follows:
val consumer = new FlinkKafkaConsumer011[T](...)
c.setCommitOffsetsOnCheckpoints(false)
If you don't have checkpoints enabled see here
Why would you do that though? Flink's checkpointing mechanism is there to solve this problem for you. Flink won't commit offsets in the presence of failures. If you throw an exception at some point downstream of the Kafka consumer Flink will attempt to restart the stream from previous successful checkpoint. If the error persists then Flink will repeatedly restart for the configured number of times before failing the stream.
This means that is unlikely you will lose messages due to Flink committing offsets of messages your code hasn't successfully processed.

Flink Kafka connector 0.10.0 Event time Clarification and ProcessFunction Clarification

I'm struggling with an issue regarding event time of flink's kafka's consumer connector.
Citing Flink doc
Since Apache Kafka 0.10+, Kafka’s messages can carry timestamps, indicating the time the event has occurred (see “event time” in Apache Flink) or the time when the message has been written to the Kafka broker.
The FlinkKafkaConsumer010 will emit records with the timestamp attached, if the time characteristic in Flink is set to TimeCharacteristic.EventTime (StreamExecutionEnvironment.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)).
The Kafka consumer does not emit watermarks.
Some questions and issues come to mind:
How do I know if it timestamp taken is of the time it occurred or time written to the Kafka broker?
If the consumer does not emit watermarks and TimeCharacteristic.EventTime is set, does this mean a message late by a few days can still enter and be processed?
The main flow diagram does not contain a window function, and basically looks like the following: source(kafka)->filter->processFunction->Sink. Does this mean the the event is fired at the moment it is consumed by Kafka connector?
I currently use Kafka connector 0.10.0, TimeCharacteristic.EventTime set and use a processFunction which every expectedly X minutes does some state cleanup.
However I'm receiving a strange situation where the OnTimerContext contains timestamps which starts from 0 and grows until current timestamp when I start the flink program and is quite strange, is this a bug?
Thanks in advance to all helpers!
That depends on the configuration of the Kafka producer that's creating these events. The message.timestamp.type property should be set to either CreateTime or LogAppendTime.
Your flink application is responsible for creating watermarks; the kafka consumer will take care of the timestamps, but not the watermarks. It doesn't matter how late an event is, it will still enter your pipeline.
Yes.
It's not clear to me what part of this is strange.

Resources