Flink: What does it mean to embed flink on other programs? - apache-flink

What does it mean to embed flink on other programs?
In the link here - https://ci.apache.org/projects/flink/flink-docs-release-1.10/dev/api_concepts.html#basic-api-concepts in second paragraph it says flink can be embedded in other programs.
I would like to know more about this. Like how to achieve it. A sample program would be very helpful.
Using the above is it possible to achieve the following?
Can we run flink programs as individual Actors?
Can we route data between two flink programs?
Reason: I am asking the above two questions because my requirement is as below
I have some set of Flink Jobs/programs based on the config file I want only certain Flink Jobs/programs to process the input data and this keeps changing based on the config file. So there is a need for Flink jobs./programs(or the code in those jobs) to be always available and they need to pass data and communicate.
Kindly share your insights.

Running Flink embedded in other programs refers to Flink's local execution mode. The local execution mode runs a Flink program in your JVM. This entails that the job won't be executed distributedly.
What is currently not possible out of the box is to let Flink jobs control other Flink jobs. However, it is possible to build a Flink application which takes as input job descriptions and executes them. RBEA is an example of such a Flink application. The conceptual difference is that you don't have multiple Flink jobs but a single one which processes programs as input records.
Alternatively, you could take a look at Stateful functions which is a virtual actor framework built on top of Apache Flink. The idea is to provide a framework for building distributed stateful applications with strong consistency guarantees. With stateful functions, you would also build a single Flink application which processes events which could represent a form of computation.

Related

flink jobmanger or taskmanger instances

I had few questions in flink stream processing framework. Please let me know the your comments on these questions.
Let say If I build the cluster with n nodes, out of which I had m nodes as job mangers (for HA) then, remaining nodes (n-m) are the ask mangers?
In each node, We had n cores then how we can control/to use the specific number of cores to task-manger/job-manger?
If we add the new node as task-manger then, does the job manger automatically assign the task to the newly added task-manger?
Does flink has concept of partitions and data skew?
If flink connects to pulsar and need to read the data from portioned topic. So, what is the parallelism here? (parallelism is equal to no. of partitions or it's completely depends the flink task-manager's no.of task slots)
Does flink has any inbuilt optimization on job graph? (Example. My job graph has so many filter, map , flatmap.. etc). Please can you suggest any docs/materials for flink job optimizations?
do we have any option like, one dedicated core can be used for prometheus metrics scraping?
Yes
Configuring the number of slots per TM: https://nightlies.apache.org/flink/flink-docs-stable/docs/concepts/flink-architecture/#task-slots-and-resources although each operator runs in its own thread and you have no control on which core they run, so you don't really have a fine-grained control of how cores are used. Configuring resource groups also allows you to distribute operators across slots: https://nightlies.apache.org/flink/flink-docs-stable/docs/dev/datastream/operators/overview/#task-chaining-and-resource-groups
Not for currently running jobs, you'd need to re-scale them. New jobs will use it though.
Yes. https://nightlies.apache.org/flink/flink-docs-stable/docs/dev/datastream/sources/
It will depend on the Fink source parallelism.
It automatically optimizes the graph as it sees fit. You have some control rescaling and chaining/splitting operators: https://nightlies.apache.org/flink/flink-docs-stable/docs/dev/datastream/operators/overview/ (towards the end). As a rule of thumb, I would start deploying a full job per slot and then, once properly understood where are the bottlenecks, try to optimize the graph. Most of the time is not worth it due to increased serialization and shuffling of data.
You can export Prometheus metrics, but not have a core dedicated to it: https://nightlies.apache.org/flink/flink-docs-stable/docs/deployment/metric_reporters/#prometheus

How to fail whole flink application if one job gets fail?

There are two jobs running in flink shown in the below image, If one gets failed, I need to fail the whole flink application? How can I do it? Suppose job with parallelism:1 fails due to some exception, How to fail job with parallelism:4?
The details of how you should go about this depend a bit on the type of infrastructure you are using to run Flink, and how are submitting the jobs. But if you look at ClusterClient and JobClient and associated classes, you should be able to find a way forward.
If you aren't already, you may want to take advantage of application mode, which was added in Flink 1.11. This makes it possible for a single main() method to launch multiple jobs, and added env.executeAsync() for non-blocking job submission.

Data/event exchange between jobs

Is it possible in Apache Flink, to create an application, which consists of multiple jobs who build a pipeline to process some data.
For example, consider a process with an input/preprocessing stage, a business logic and an output stage.
In order to be flexible in development and (re)deployment, I would like to run these as independent jobs.
Is it possible in Flink to built this and directly pipe the output of one job to the input of another (without external components)?
If yes, where can I find documentation about this and can it buffer data if one of the jobs is restarted?
If no, does anyone have experience with such a setup and point me to a possible solution?
Thank you!
If you really want separate jobs, then one way to connect them is via something like Kafka, where job A publishes, and job B (downstream) subscribes. Once you disconnect the two jobs, though, you no longer get the benefit of backpressure or unified checkpointing/saved state.
Kafka can do buffering of course (up to some max amount of data), but that's not a solution to a persistent different in performance, if the upstream job is generating data faster than the downstream job can consume it.
I imagine you could also use files as the 'bridge' between jobs (streaming file sink and then streaming file source), though that would typically create significant latency as the downstream job has to wait for the upstream job to decide to complete a file, before it can be consumed.
An alternative approach that's been successfully used a number of times is to provide the details of the preprocessing and business logic stages dynamically, rather than compiling them into the application. This means that the overall topology of the job graph is static, but you are able to modify the processing logic while the job is running.
I've seen this done with purpose-built DSLs, PMML models, Javascript (via Rhino), Groovy, Java classloading, ...
You can use a broadcast stream to communicate/update the dynamic portions of the processing.
Here's an example of this pattern, described in a Flink Forward talk by Erik de Nooij from ING Bank.

How can I access states computed from an external Flink job ? (without knowing its id)

I'm new to Flink and I'm currently testing the framework for a usecase consisting in enriching transactions coming from Kafka with a lot of historical features (e.g number of past transactions between same source and same target), then score this transaction with a machine learning model.
For now, features are all kept in Flink states and the same job is scoring the enriched transaction. But I'd like to separate the features computation job from the scoring job and I'm not sure how to do this.
The queryable state doesn't seem to fit for this, as the job id is needed, but tell me if I'm wrong !
I've thought about querying directly RocksDB but maybe there's a more simple way ?
Is the separation in two jobs for this task a bad idea with Flink ? We do it this way for the same test with Kafka Streams, in order to avoid complex jobs (and to check if it has any positive impact on latency)
Some extra information : I'm using Flink 1.3 (but willing to upgrade if it's needed) and the code is written in Scala
Thanks in advance for your help !
Something like Kafka works well for this kind of decoupling. In that way you could have one job that computes the features and streams them out to a Kafka topic that is consumed by the job doing the scoring. (Aside: this would make it easy to do things like run several different models and compare their results.)
Another approach that is sometimes used is to call out to an external API to do the scoring. Async I/O could be helpful here. At least a couple of groups are using stream SQL to compute features, and wrapping external model scoring services as UDFs.
And if you do want to use queryable state, you could use Flink's REST api to determine the job id.
There have been several talks at Flink Forward conferences about using machine learning models with Flink. One example: Fast Data at ING – Building a Streaming Data Platform with Flink and Kafka.
There's an ongoing community effort to make all this easier. See FLIP-23 - Model Serving for details.

Dynamic number of jobs in Apache Flink - dealing with task slots

I'm in the process of evaluation Apache Flink for potential use case and I'm struggling how I should model my computations in Flink itself.
In my case I will have many (unknown upfront) small, heterogeneous processing graphs, each of them could use parts of standard Flink DataStream API to process data from external sensors. Each of theses graphs won't be computationally expensive. My first thought was to make each of those small graphs separate job and deploy it to Flink cluster. The problem is that because task slots are not shared between subtasks from different jobs I'm facing a situation where I would need to create task managers with very high number of task slots but everywhere I read it is recommended to have number of slots equal to number of cores in the system.
I've found an article about using Flink when the nature of the job is somehow dynamic (https://techblog.king.com/rbea-scalable-real-time-analytics-king/) but implementing some kind of custom DSL on top of Flink in a situation where most of the graphs that will be created can be easily expressed using DataStream does not look like an elegant solution to me.
Is Flink just not designed to handle dynamic number of jobs that are defined using DataStream API and the Flink way would be to model my use case with single meta-job that will be generic enough to be able to express all potential computations?

Resources