Prometheus doesn't have metrics from taskmanager if flink job started - apache-flink

I operate flink 1.15.2 on Kubernetes and set metric configuration for Flink Cluster as below
# metrics
metrics.reporters: prom
metrics.reporter.prom.class: org.apache.flink.metrics.prometheus.PrometheusReporter
The problem is that prometheus doesn't get metrics from taskmanager if the flink job has started.
If I stopped the job, then I could see the metrics however some metrics are empty.
I tried to reduce CPU usage but still no metric from taskmanager
I tried to increase task slot, still no metric
It happens to both Intel and ARM node
I tried to change flink config as below, metircs were collocted for a moment(several seconds) and disappeared again
# metrics
metrics.reporters: prom
metrics.reporter.prom.class: org.apache.flink.metrics.prometheus.PrometheusReporterFactory
I tried to change flink config as below, but still no metric
kafkaSourceBuilder.setProperty("register.consumer.metrics", "false");
var producerProperties = new Properties();
producerProperties.setProperty("register.producer.metrics", "false");
producerSinkBuilder.setKafkaProducerConfig(producerProperties);
If I try to start job on flink 1.15.3, metircs were collocted
If I try to start job on flink 1.16.0, Prometheus doesn't have any metric from flink at all

As mentioned in the release notes of Flink 1.16, configuring reporters by their class has been deprecated. See https://nightlies.apache.org/flink/flink-docs-master/release-notes/flink-1.16/#flink-27206httpsissuesapacheorgjirabrowseflink-27206 for details.
There are also some known issues with metrics reporting in 1.16.0; please upgrade to Flink 1.16.1.

Related

Flink 1.13.2 not updates metrics in near-real-time when connected to kafka sources/sink

I'm creating a process to handle millions of records with apache flink to support logistics data pipelines. I'm moving from kinesis sources/sink to kafka sources/sink.
However, in the flink dashboard, the job metrics are not being updated in the near-real-time. Do you know what can be wrong with the job/version?
Btw, when job is closed, then it can show all metrics... but not in near-real-time...
Job non-updating metrics picture
Fixed after cleanup conflict dependencies on "Kafka-clients" lib.
So, in my case, using also some avro & cloudevents libs with higher Kafka-clients version. Then, just need to exclude Kafka-clients from these libs and prefer flink Kafka-clients version. And this solved the issue.

Apache Flink showing custom metrics in UI, but prometheus metrics reporter not scraping them

I am working on sending custom app metrics to prometheus via the Prometheus Flink Metrics Reporter. The metrics are correctly created since I am able to accurately see them in the flink dashboard. I configured the prometheus metrics reporter similar to found here. When I curl to the prometheus endpoint (curl http://localhost:9090/api/v1/metrics), I am only able to see the cluster metrics and not the custom metrics I am creating. I suspect this issue has to do with how I configured the Prometheus Flink Metrics Reporter since when I try to visit http://localhost:9090, there is no UI and just a list of the cluster metrics mentioned above.
flink job code to create metrics(visible in Flink UI):
this.anomalyCounter = getRuntimeContext.getMetricGroup.addGroup("metric1").counter("counter")
flink-conf.yaml:
metrics.reporters: prom
metrics.reporter.prom.class: org.apache.flink.metrics.prometheus.PrometheusReporter
metrics.reporter.prom.port: 9090
promethus.yml:
scrape_configs:
- job_name: 'flink'
static_configs:
- targets: ['localhost:9090']
Is there anything I am missing in the configuration? Why are my cluster metrics reaching prometheus and not my custom ones?
Hi #sarvad123 probably (depending on your Flink version) you should add flink-metrics-prometheus-{version}.jar in the /lib folder.
I've seen similar issues based on a bug in the 1.13.6 Flink we were using. The reporter was blowing up and thus you got no custom metrics. This has been fixed in 1.16 version we are using now and we can view both custom and rocksdb metrics. For what it's worth the 1.13.6 version had lots of issues that apparently made the Flink UI pretty useless for data reporting. 1.16 is much more stable and reports things quite well.

Is it possible to add new embedded worker while cluster is running on statefun?

Here is the deal;
I'm dealing with adding new worker (embbeded) to on running the cluster (flink statefun 2.2.1).
As you see the new task manager can be registered to the cluster;
Screenshot of new deployed taskmanager
But it doesn't initialize (it doesn't deploying sources);
What am I missing here?? (master and workers has to same jar files too? or it should be enough deploying taskmanager with jar file)
Any help would be appreciated,
Thx.
Flink supports two different approaches to rescaling: active and reactive.
Reactive mode is new in Flink 1.13 (released just this week), and works as you expected: add (or remove) a task manager, and your application will adjust to the new parallelism. You can read about elastic scaling and reactive mode in the docs.
Reactive mode is currently a work in progress, but might need your needs.
In broad strokes, for active mode rescaling you need to:
Do a stop with savepoint to bring down your current job while taking a snapshot of its state.
Relaunch with the new parallelism, using the savepoint as the starting point.
The exact details depend on how your cluster is deployed.
For a step-by-step tutorial, see Upgrading & Rescaling a Job in the Flink Operations Playground.
The above applies to rescaling statefun embedded functions. Being stateless, remote functions can be rescaled more straightforwardly.

Flink : Unable to collect Task Metrics via JMX

I have been able to run JMX with Flink with the following configuration applied to the flink-conf.yaml file of all nodes in the cluster:
metrics.reporters: jmx
metrics.reporter.jmx.class: org.apache.flink.metrics.jmx.JMXReporter
metrics.reporter.jmx.port: 9020-9022
env.java.opts: -Dcom.sun.management.jmxremote -
Dcom.sun.management.jmxremote.port=9999 -
Dcom.sun.management.jmxremote.authenticate=false -
Dcom.sun.management.jmxremote.ssl=false
When I run JConsole and listen on ports master-IP:9999/slave-IP:9020, I am able to see the system metrics like CPU, memory etc.
How can I access the task metrics and their respective graphs like bytesRead, latency etc. which are collected for each subtask and shown on the GUI.
you can go to mbeans tab on jconsole and there you will see various dropdowns on RHS in the name of job and tasks. Let me know if you have any issues.

Why "Configuration" section of running job is empty?

Can anybody explain me why "Configuration" section of running job in Apache Flink Dashboard is empty?
How to use this job configuration in my flow? Seems like this is not described in documentation.
The configuration tab of a running job shows the values of the ExecutionConfig. Depending on the version of Flink you might will experience a different behaviour.
Flink <= 1.0
The ExecutionConfig is only accessible for finished jobs. For running jobs, it is not possible to access it. Once the job has finished or has been stopped/cancelled, you should be able to see the ExecutionConfig.
Flink > 1.0
The ExecutionConfig can also be accessed for running jobs.

Resources