Flink: lazy operations processing - apache-flink

The execution of Flinks programs has to be triggered, e.g. with execute(). Otherwise Flink only creates a new execution plan, right? My Question is: Which components of Flink are being activated when processing a lazy operation without triggering the execution?
According the dokumentation there is an optimizer responsible for building a dataflow graph. Are there more processes involved?
And is there a way to find out the id of the optimizer process in order to monitor it?

Flink DataSet programs are optimized when the execution is triggered. Before, the program is only constructed by appending operators and data sinks to other operators and data sources.
The optimization happens within the client process before the program is submitted to the JobManager process. That means, there is no dedicated optimizer process that could be monitored.
The program translation is done in multiple steps:
Program construction using the DataSet API
Translation into the generic API
Program optimization
JobGraph generation
The JobGraph is the data flow representation that is scheduled by the JobManager for execution.

Related

Pipelined vs blocking data exchange in a Flink job

I've been reading about pipelined region scheduling in Flink and am a bit confused about what they mean. My understanding of it is that a Streaming job is always pipelined whereas a Batch job can produce intermediate results that are blocking. This makes sense since an operator can load the entire datastream into memory and process all of it to produce a result that only then can go into the next operator for further processing in case of a Batch job.
The blog post then describes pipelined regions that consists of 4 different regions and has pipelined and blocking data exchanges in the same topology. My question is, how would one go about creating such a job in Flink where it is able to handle both pipelined and blocking data exchanges? A simple code example would be very much appreciated where this capability is showcased.

Flink: What does it mean to embed flink on other programs?

What does it mean to embed flink on other programs?
In the link here - https://ci.apache.org/projects/flink/flink-docs-release-1.10/dev/api_concepts.html#basic-api-concepts in second paragraph it says flink can be embedded in other programs.
I would like to know more about this. Like how to achieve it. A sample program would be very helpful.
Using the above is it possible to achieve the following?
Can we run flink programs as individual Actors?
Can we route data between two flink programs?
Reason: I am asking the above two questions because my requirement is as below
I have some set of Flink Jobs/programs based on the config file I want only certain Flink Jobs/programs to process the input data and this keeps changing based on the config file. So there is a need for Flink jobs./programs(or the code in those jobs) to be always available and they need to pass data and communicate.
Kindly share your insights.
Running Flink embedded in other programs refers to Flink's local execution mode. The local execution mode runs a Flink program in your JVM. This entails that the job won't be executed distributedly.
What is currently not possible out of the box is to let Flink jobs control other Flink jobs. However, it is possible to build a Flink application which takes as input job descriptions and executes them. RBEA is an example of such a Flink application. The conceptual difference is that you don't have multiple Flink jobs but a single one which processes programs as input records.
Alternatively, you could take a look at Stateful functions which is a virtual actor framework built on top of Apache Flink. The idea is to provide a framework for building distributed stateful applications with strong consistency guarantees. With stateful functions, you would also build a single Flink application which processes events which could represent a form of computation.

Data/event exchange between jobs

Is it possible in Apache Flink, to create an application, which consists of multiple jobs who build a pipeline to process some data.
For example, consider a process with an input/preprocessing stage, a business logic and an output stage.
In order to be flexible in development and (re)deployment, I would like to run these as independent jobs.
Is it possible in Flink to built this and directly pipe the output of one job to the input of another (without external components)?
If yes, where can I find documentation about this and can it buffer data if one of the jobs is restarted?
If no, does anyone have experience with such a setup and point me to a possible solution?
Thank you!
If you really want separate jobs, then one way to connect them is via something like Kafka, where job A publishes, and job B (downstream) subscribes. Once you disconnect the two jobs, though, you no longer get the benefit of backpressure or unified checkpointing/saved state.
Kafka can do buffering of course (up to some max amount of data), but that's not a solution to a persistent different in performance, if the upstream job is generating data faster than the downstream job can consume it.
I imagine you could also use files as the 'bridge' between jobs (streaming file sink and then streaming file source), though that would typically create significant latency as the downstream job has to wait for the upstream job to decide to complete a file, before it can be consumed.
An alternative approach that's been successfully used a number of times is to provide the details of the preprocessing and business logic stages dynamically, rather than compiling them into the application. This means that the overall topology of the job graph is static, but you are able to modify the processing logic while the job is running.
I've seen this done with purpose-built DSLs, PMML models, Javascript (via Rhino), Groovy, Java classloading, ...
You can use a broadcast stream to communicate/update the dynamic portions of the processing.
Here's an example of this pattern, described in a Flink Forward talk by Erik de Nooij from ING Bank.

Difference between job, task and subtask in flink

I'm new to flink and try to understand:
job
task
subtask
I searched in the docs but still did not get it. What's the main diffence between them?
Tasks and sub-tasks are explained here -- https://ci.apache.org/projects/flink/flink-docs-release-1.7/concepts/runtime.html#tasks-and-operator-chains:
A task is an abstraction representing a chain of operators that could be executed in a single thread. Something like a keyBy (which causes a network shuffle to partition the stream by some key) or a change in the parallelism of the pipeline will break the chaining and force operators into separate tasks. In the diagram above, the application has three tasks.
A subtask is one parallel slice of a task. This is the schedulable, runable unit of execution. In the diagram above, the application is to be run with a parallelism of two for the source/map and keyBy/Window/apply tasks, and a parallelism of one for the sink -- resulting in a total of 5 subtasks.
A job is a running instance of an application. Clients submit jobs to the jobmanager, which slices them into subtasks and schedules those subtasks for execution by the taskmanagers.
Update:
The community decided to re-align the definitions of task and sub-task to match how these terms are used in the code -- which means that task and sub-task now mean the same thing: exactly one parallel instance of an operator or operator chain. See the Glossary for more details.

How to parallel write to sinks in Apache Flink

I have a map DataStream with a parallelism of 8. I add two sinks to the DataStream. One is slow (Elasticsearch) the other one is fast (HDFS). However, my events are only written to HDFS after they have been flushed to ES, so it takes a magnitude longer with ES than it takes w/o ES.
dataStream.setParallelism(8);
dataStream.addSink(elasticsearchSink);
dataStream.addSink(hdfsSink);
It appears to me, that both sinks use the same thread. Is with possible by using the same source with two sinks, or do I have to add another job, one for earch sink, to write the output parallel?
I checked in the logs that Map(1/8) to Map(8/8) are getting deployed and receive data.
If the Elasticsearch sink can not keep up with the speed at which its input is produced it slowdowns its input operator(s). This concept is called backpressure which means that a slow consumer blocks a fast producer from processing.
The only way to make your program behave as you expect (HDFS sink writing faster than Elasticsearch sink) is to buffer all records that the HDFS sink wrote but the Elasticsearch sink hasn't written yet. If the Elasticsearch sink is consistently slower you will run out of memory / disk space at some point in time.
Flink's approach to solve issues with slow consumers is backpressure.
I see two ways to fix this issue:
increase the parallelism of the ElasticsearchSink. This might help or not, depending on the capabilities of your Elasticsearch setup.
run both jobs as independent pipelines. In this case you'll have to compute all results twice.

Resources