Flink : Unable to collect Task Metrics via JMX

Flink : Unable to collect Task Metrics via JMX - apache-flink

I have been able to run JMX with Flink with the following configuration applied to the flink-conf.yaml file of all nodes in the cluster:
metrics.reporters: jmx
metrics.reporter.jmx.class: org.apache.flink.metrics.jmx.JMXReporter
metrics.reporter.jmx.port: 9020-9022
env.java.opts: -Dcom.sun.management.jmxremote -
Dcom.sun.management.jmxremote.port=9999 -
Dcom.sun.management.jmxremote.authenticate=false -
Dcom.sun.management.jmxremote.ssl=false
When I run JConsole and listen on ports master-IP:9999/slave-IP:9020, I am able to see the system metrics like CPU, memory etc.
How can I access the task metrics and their respective graphs like bytesRead, latency etc. which are collected for each subtask and shown on the GUI.

you can go to mbeans tab on jconsole and there you will see various dropdowns on RHS in the name of job and tasks. Let me know if you have any issues.

Related

Prometheus doesn't have metrics from taskmanager if flink job started

I operate flink 1.15.2 on Kubernetes and set metric configuration for Flink Cluster as below
# metrics
metrics.reporters: prom
metrics.reporter.prom.class: org.apache.flink.metrics.prometheus.PrometheusReporter
The problem is that prometheus doesn't get metrics from taskmanager if the flink job has started.
If I stopped the job, then I could see the metrics however some metrics are empty.
I tried to reduce CPU usage but still no metric from taskmanager
I tried to increase task slot, still no metric
It happens to both Intel and ARM node
I tried to change flink config as below, metircs were collocted for a moment(several seconds) and disappeared again
# metrics
metrics.reporters: prom
metrics.reporter.prom.class: org.apache.flink.metrics.prometheus.PrometheusReporterFactory
I tried to change flink config as below, but still no metric
kafkaSourceBuilder.setProperty("register.consumer.metrics", "false");
var producerProperties = new Properties();
producerProperties.setProperty("register.producer.metrics", "false");
producerSinkBuilder.setKafkaProducerConfig(producerProperties);
If I try to start job on flink 1.15.3, metircs were collocted
If I try to start job on flink 1.16.0, Prometheus doesn't have any metric from flink at all

As mentioned in the release notes of Flink 1.16, configuring reporters by their class has been deprecated. See https://nightlies.apache.org/flink/flink-docs-master/release-notes/flink-1.16/#flink-27206httpsissuesapacheorgjirabrowseflink-27206 for details.
There are also some known issues with metrics reporting in 1.16.0; please upgrade to Flink 1.16.1.

Specify slot sharing group for a specific task manager in Flink 1.14.0

I am trying the fine-grained resource management feature in Flink 1.14, hoping it can enable assigning certain operators to certain TaskManagers. Following the sample code in https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/deployment/finegrained_resource/, I can now define the task sharing groups I would like (using setExternalResource-method), but I do not see any option to "assign" a TaskManager worker instance with the capabilities of this "external resource".
So to the question. Following the GPU-based example in 1, how can I ensure that Flink "knows" which task manager actually has the required GPU?

With help from the excellent flink mailing list, I now have the solution. Basically, add lines to flink-conf.yaml for the specific task manager as per the external resource documentation. For a resource entitled 'example', these are the two lines that must be added:
external-resources: example
external-resource.example.amount: 1
.. will match a task a task sharing group with the added external resource:
.setExternalResource("example", 1.0)

Is it possible to add new embedded worker while cluster is running on statefun?

Here is the deal;
I'm dealing with adding new worker (embbeded) to on running the cluster (flink statefun 2.2.1).
As you see the new task manager can be registered to the cluster;
Screenshot of new deployed taskmanager
But it doesn't initialize (it doesn't deploying sources);
What am I missing here?? (master and workers has to same jar files too? or it should be enough deploying taskmanager with jar file)
Any help would be appreciated,
Thx.

Flink supports two different approaches to rescaling: active and reactive.
Reactive mode is new in Flink 1.13 (released just this week), and works as you expected: add (or remove) a task manager, and your application will adjust to the new parallelism. You can read about elastic scaling and reactive mode in the docs.
Reactive mode is currently a work in progress, but might need your needs.
In broad strokes, for active mode rescaling you need to:
Do a stop with savepoint to bring down your current job while taking a snapshot of its state.
Relaunch with the new parallelism, using the savepoint as the starting point.
The exact details depend on how your cluster is deployed.
For a step-by-step tutorial, see Upgrading & Rescaling a Job in the Flink Operations Playground.
The above applies to rescaling statefun embedded functions. Being stateless, remote functions can be rescaled more straightforwardly.

Flink - multiple job managers in jobmanager.rpc.address

I'm trying to configure Flink with two job managers for HA. Should I specify both of them in flink-conf.yaml / jobmanager.rpc.address ?
If yes how?

You don't need to. In HA mode the rpc.address is chosen automatically by default. Have a look at docs.
By default, the job manager will pick a random port for inter process
communication. You can change this via the
high-availability.jobmanager.port key. This key accepts single ports
(e.g. 50010), ranges (50000-50025), or a combination of both
(50010,50011,50020-50025,50050-50075).

Why "Configuration" section of running job is empty?

Can anybody explain me why "Configuration" section of running job in Apache Flink Dashboard is empty?
How to use this job configuration in my flow? Seems like this is not described in documentation.

The configuration tab of a running job shows the values of the ExecutionConfig. Depending on the version of Flink you might will experience a different behaviour.
Flink <= 1.0
The ExecutionConfig is only accessible for finished jobs. For running jobs, it is not possible to access it. Once the job has finished or has been stopped/cancelled, you should be able to see the ExecutionConfig.
Flink > 1.0
The ExecutionConfig can also be accessed for running jobs.