Getting Flink cluster configuration at runtime - apache-flink

I'm interested in getting at runtime the number of TaskManagers and slots of a Flink cluster, before submitting jobs to it (I'd like to tune some program parameters based on the cluster ones).
Does anybody know which functions should I call to get these parameters?
Thanks!

The parameters are available through Flink's REST API.
Full API documentation: https://ci.apache.org/projects/flink/flink-docs-master/monitoring/rest_api.html

Related

flink jobmanger or taskmanger instances

I had few questions in flink stream processing framework. Please let me know the your comments on these questions.
Let say If I build the cluster with n nodes, out of which I had m nodes as job mangers (for HA) then, remaining nodes (n-m) are the ask mangers?
In each node, We had n cores then how we can control/to use the specific number of cores to task-manger/job-manger?
If we add the new node as task-manger then, does the job manger automatically assign the task to the newly added task-manger?
Does flink has concept of partitions and data skew?
If flink connects to pulsar and need to read the data from portioned topic. So, what is the parallelism here? (parallelism is equal to no. of partitions or it's completely depends the flink task-manager's no.of task slots)
Does flink has any inbuilt optimization on job graph? (Example. My job graph has so many filter, map , flatmap.. etc). Please can you suggest any docs/materials for flink job optimizations?
do we have any option like, one dedicated core can be used for prometheus metrics scraping?
Yes
Configuring the number of slots per TM: https://nightlies.apache.org/flink/flink-docs-stable/docs/concepts/flink-architecture/#task-slots-and-resources although each operator runs in its own thread and you have no control on which core they run, so you don't really have a fine-grained control of how cores are used. Configuring resource groups also allows you to distribute operators across slots: https://nightlies.apache.org/flink/flink-docs-stable/docs/dev/datastream/operators/overview/#task-chaining-and-resource-groups
Not for currently running jobs, you'd need to re-scale them. New jobs will use it though.
Yes. https://nightlies.apache.org/flink/flink-docs-stable/docs/dev/datastream/sources/
It will depend on the Fink source parallelism.
It automatically optimizes the graph as it sees fit. You have some control rescaling and chaining/splitting operators: https://nightlies.apache.org/flink/flink-docs-stable/docs/dev/datastream/operators/overview/ (towards the end). As a rule of thumb, I would start deploying a full job per slot and then, once properly understood where are the bottlenecks, try to optimize the graph. Most of the time is not worth it due to increased serialization and shuffling of data.
You can export Prometheus metrics, but not have a core dedicated to it: https://nightlies.apache.org/flink/flink-docs-stable/docs/deployment/metric_reporters/#prometheus

Customize Flink - Prometheus metrics

I need to export custom metrics from Flink 1.10 to Prometheus. I have my custom metrics already created and working, but the issue is that when I print out (in terminal for example) to see the metrics, a lot of metrics comes out from Flink and I don't need them, such as: flink_taskmanager_job_task_Shuffle_Netty_Input_Buffers_inputQueueLength, and many more.
I'm just interest into spread from Flink my custom metrics to Prometheus, and remove the rest of them.
So, questions:
Is there anyway to remove all the metrics exported from Flink and just keep my custom metrics to Prometheus?
Is there anyway to create statics task_id to not accumulate a lot of information in Prometheus? Because I supposed that that ids are not fixed and with every changes in the application that requires a stop/start, Flink will create a new task_id.
I've been able to remove a few tags using:
"metrics.reporter.cep_reporter.scope.variables.excludes":"job_id;job_name;task_attempt_id;task_attempt_num;task_name;operator_id;operator_name;subtask_index;tm_id;host;Netty"
but is not enough, there are more than 800 metrics that I don't need, JVM for example, I'm using another node_exporter to scrape those metrics, need to remove this metrics too.
Any help will be appreciated. Thanks a lot.
Disclaimer: I haven't tried this.
What I would try would be to set a user scope on your custom flink metrics, and then configure prometheus to only scrape those metrics.

Flink: What does it mean to embed flink on other programs?

What does it mean to embed flink on other programs?
In the link here - https://ci.apache.org/projects/flink/flink-docs-release-1.10/dev/api_concepts.html#basic-api-concepts in second paragraph it says flink can be embedded in other programs.
I would like to know more about this. Like how to achieve it. A sample program would be very helpful.
Using the above is it possible to achieve the following?
Can we run flink programs as individual Actors?
Can we route data between two flink programs?
Reason: I am asking the above two questions because my requirement is as below
I have some set of Flink Jobs/programs based on the config file I want only certain Flink Jobs/programs to process the input data and this keeps changing based on the config file. So there is a need for Flink jobs./programs(or the code in those jobs) to be always available and they need to pass data and communicate.
Kindly share your insights.
Running Flink embedded in other programs refers to Flink's local execution mode. The local execution mode runs a Flink program in your JVM. This entails that the job won't be executed distributedly.
What is currently not possible out of the box is to let Flink jobs control other Flink jobs. However, it is possible to build a Flink application which takes as input job descriptions and executes them. RBEA is an example of such a Flink application. The conceptual difference is that you don't have multiple Flink jobs but a single one which processes programs as input records.
Alternatively, you could take a look at Stateful functions which is a virtual actor framework built on top of Apache Flink. The idea is to provide a framework for building distributed stateful applications with strong consistency guarantees. With stateful functions, you would also build a single Flink application which processes events which could represent a form of computation.

Flink latency metrics not being shown

While running Flink 1.5.0 with a local environment I was trying to get latency metrics via REST (with something similar to http://localhost:8081/jobs/e779dbbed0bfb25cd02348a2317dc8f1/vertices/e70bbd798b564e0a50e10e343f1ac56b/metrics) but there isn't any reference to latency.
All of this while the latency tracking is enabled which I confirmed by checking with the debugger that the LatencyMarksEmitter is emiting the marks.
What can I be doing wrong?
In 1.5 latency metrics aren't exposed for tasks but for jobs instead, the reasoning being that latency metrics inherently contain information about multiple tasks. You have to query "http://localhost:8081/jobs/e779dbbed0bfb25cd02348a2317dc8f1/metrics" instead.

Debugging on the remote cluster

I have a program which works fine in a local cluster but not running properly when executing in on the remote cluster. I would like to know, what are the best and common ways of debugging a program running on a remote Flink cluster?
Any help is appreciated!
There are several ways to debug a Flink application on a remote cluster.
Since using a real debugger is complicated, I would first try to log as much as possible to find out the error.
Another approach that could be helpful is using Flink's accumulators. With them, you can gather some statistics: For example when you have a filter, you can determine, how many elements passed the filter and so on.
The last resort is attaching a debugger to one of the Flink TaskManager JVMs.
Also check out my presentation on the topic: http://de.slideshare.net/robertmetzger1/apache-flink-hands-on

Resources