Pipelined vs blocking data exchange in a Flink job - apache-flink

I've been reading about pipelined region scheduling in Flink and am a bit confused about what they mean. My understanding of it is that a Streaming job is always pipelined whereas a Batch job can produce intermediate results that are blocking. This makes sense since an operator can load the entire datastream into memory and process all of it to produce a result that only then can go into the next operator for further processing in case of a Batch job.
The blog post then describes pipelined regions that consists of 4 different regions and has pipelined and blocking data exchanges in the same topology. My question is, how would one go about creating such a job in Flink where it is able to handle both pipelined and blocking data exchanges? A simple code example would be very much appreciated where this capability is showcased.

Related

Is it possible to achieve Exacly Once Semantics using a BASE-fashioned database?

In Stream Processing applications (f. e. based on Apache Flink or Apache Spark Streaming) it is sometimes necessary to process data exactly once.
In the database world something equal be achieved by using databases that follow the ACID criteria (correct me if I'm wrong here).
However there are a lot of (non relational) databases that do not follow ACID but BASE.
Now my question is: If I'm going to integrate such a BASE database into a stream processing application (exactly once), can I still guarantee exactly once processing for the whole pipeline? And if this is possible, under what circumstances?
Exactly Once Semantics means the processing framework such as flink can guarantee each incoming record(event) will be processed exactly one time even if the pineline fails in any way.
This is done by having checkpoints after each operation in the pineline, so that when the application recovers from failure, successful operation will not be executed again.
Depends on what kind of operations you are trying to do with databases, most cases databases are used as sinks for processing result to write into. In that case the operation involving database is just a simple insert and it will not be executed again after one successful run therefore it's still exactly-once regardless of its ACID support.
You might be tempted to group operations together for databases that support ACID but it will be a bad practice in a parallel streaming pineline since they created mutilple transactions and the locks might block the whole process. Instead, use BASE (NoSQL) database that are fast with intensive read and update performance is preferable, you just need to make your operations to be idempotent so that partially re-executed statements (if they failed half way through then after recovery they might be executed all again) won't result in incorrect data.

Increasing Parallelism in Flink decreases/splits the overall throughput

My problem is exactly similar to this except that Backpressure in my application is coming as "OK".
I thought the problem was with my local machine not having enough configuration, so I created a 72 core Windows machine, where I am reading data from Kafka, processing it in Flink and then writing the output back in Kafka. I have checked, writing into Kafka Sink is not causing any issues.
All I am looking for are the areas that may be causing a split in Throughput among task slots by increasing parallelism?
Flink Version: 1.7.2
Scala version: 2.12.8
Kafka version: 2.11-2.2.1
Java version: 1.8.231
Working of application: Data is coming from Kafka (1 partition) which is deserialized by Flink (throughput here is 5k/sec). Then the deserialized message is passed through basic schema validation (Throughput here is 2k/sec).
Even after increasing the parallelism to 2, throughput at Level 1 (deserializing stage) remains same and doesn't increase two fold as per expectation.
I understand, without the code, it is difficult to debug so I am asking for the points which you can suggest for this problem, so that I can go back to my code and try that.
We are using 1 Kafka partition for our input topic.
If you want to process data in parallel, you actually need to read data in parallel.
There are certain requirements to read data in parallel. The most important once are that the source is able to actually split the data into smaller work chunks. For example, if you read from a file system, you have multiple files, or the system subdivides the files into splits. For Kafka, this necessarily means that you have to have more partitions. Ideally, you have at least as many partitions than you have max consumer parallelism.
The 5k/s seems to be the maximum throughput that you can achieve on one partition. You can also calculate the number of partitions by the maximum throughput you want to achieve. If you need to achieve 50k/s, you need at least 10 partitions. You should use more to also catch up in case of reprocessing or failure recovery.
Another way to distribute the work is to add a manual shuffle step. That means, if you keep the single input partition, you would still only reach 5k/s, but after that the work is actually redistributed and processed in parallel, such that you will not see a huge decline in your throughput afterwards. After a shuffle operation, work is somewhat evenly distributed among the parallel downstream tasks.

Data/event exchange between jobs

Is it possible in Apache Flink, to create an application, which consists of multiple jobs who build a pipeline to process some data.
For example, consider a process with an input/preprocessing stage, a business logic and an output stage.
In order to be flexible in development and (re)deployment, I would like to run these as independent jobs.
Is it possible in Flink to built this and directly pipe the output of one job to the input of another (without external components)?
If yes, where can I find documentation about this and can it buffer data if one of the jobs is restarted?
If no, does anyone have experience with such a setup and point me to a possible solution?
Thank you!
If you really want separate jobs, then one way to connect them is via something like Kafka, where job A publishes, and job B (downstream) subscribes. Once you disconnect the two jobs, though, you no longer get the benefit of backpressure or unified checkpointing/saved state.
Kafka can do buffering of course (up to some max amount of data), but that's not a solution to a persistent different in performance, if the upstream job is generating data faster than the downstream job can consume it.
I imagine you could also use files as the 'bridge' between jobs (streaming file sink and then streaming file source), though that would typically create significant latency as the downstream job has to wait for the upstream job to decide to complete a file, before it can be consumed.
An alternative approach that's been successfully used a number of times is to provide the details of the preprocessing and business logic stages dynamically, rather than compiling them into the application. This means that the overall topology of the job graph is static, but you are able to modify the processing logic while the job is running.
I've seen this done with purpose-built DSLs, PMML models, Javascript (via Rhino), Groovy, Java classloading, ...
You can use a broadcast stream to communicate/update the dynamic portions of the processing.
Here's an example of this pattern, described in a Flink Forward talk by Erik de Nooij from ING Bank.

How to parallel write to sinks in Apache Flink

I have a map DataStream with a parallelism of 8. I add two sinks to the DataStream. One is slow (Elasticsearch) the other one is fast (HDFS). However, my events are only written to HDFS after they have been flushed to ES, so it takes a magnitude longer with ES than it takes w/o ES.
dataStream.setParallelism(8);
dataStream.addSink(elasticsearchSink);
dataStream.addSink(hdfsSink);
It appears to me, that both sinks use the same thread. Is with possible by using the same source with two sinks, or do I have to add another job, one for earch sink, to write the output parallel?
I checked in the logs that Map(1/8) to Map(8/8) are getting deployed and receive data.
If the Elasticsearch sink can not keep up with the speed at which its input is produced it slowdowns its input operator(s). This concept is called backpressure which means that a slow consumer blocks a fast producer from processing.
The only way to make your program behave as you expect (HDFS sink writing faster than Elasticsearch sink) is to buffer all records that the HDFS sink wrote but the Elasticsearch sink hasn't written yet. If the Elasticsearch sink is consistently slower you will run out of memory / disk space at some point in time.
Flink's approach to solve issues with slow consumers is backpressure.
I see two ways to fix this issue:
increase the parallelism of the ElasticsearchSink. This might help or not, depending on the capabilities of your Elasticsearch setup.
run both jobs as independent pipelines. In this case you'll have to compute all results twice.

Flink: lazy operations processing

The execution of Flinks programs has to be triggered, e.g. with execute(). Otherwise Flink only creates a new execution plan, right? My Question is: Which components of Flink are being activated when processing a lazy operation without triggering the execution?
According the dokumentation there is an optimizer responsible for building a dataflow graph. Are there more processes involved?
And is there a way to find out the id of the optimizer process in order to monitor it?
Flink DataSet programs are optimized when the execution is triggered. Before, the program is only constructed by appending operators and data sinks to other operators and data sources.
The optimization happens within the client process before the program is submitted to the JobManager process. That means, there is no dedicated optimizer process that could be monitored.
The program translation is done in multiple steps:
Program construction using the DataSet API
Translation into the generic API
Program optimization
JobGraph generation
The JobGraph is the data flow representation that is scheduled by the JobManager for execution.

Resources