Prometheus not showing flink metrics - apache-flink

I have following code in my flink job;
#Override
public void open(Configuration config) {
this.counter = getRuntimeContext()
.getMetricGroup()
.counter("myCounter");
}
#Override
public Tuple2<String, String> map(String s) throws Exception {
this.counter.inc();
Thread.sleep(5000);
return new Tuple2<String, String>(s, s.toUpperCase());
}
In prometheus.yml inside prometheus distribution, I have following:
- job_name: 'flink-prometheus'
scrape_interval: 5s
static_configs:
- targets: ['localhost:9999']
metrics_path: /
And in flink-conf.yaml inside flink distribution:
metrics.reporters: prom
metrics.reporter.prom.class: org.apache.flink.metrics.prometheus.PrometheusReporter
metrics.reporter.prom.host: 127.0.0.1
metrics.reporter.prom.port: 9999
On prometheus board, I can see localhost:9999 as target, and also various metric logs. But there is no log for the counter I have added in the code. I searched for string "myCounter" as well as "flink-prometheus", but zero results.
What else I need to do for my metrics to show up?

The main difference I see between the example in https://github.com/mbode/flink-prometheus-example and your own config is that the example is scraping the job manager as well as the task manager(s):
scrape_configs:
- job_name: 'flink'
static_configs:
- targets: ['job-cluster:9249', 'taskmanager1:9249', 'taskmanager2:9249']
In my own example -- see Flink Timing Explorer -- I found it necessary to do this as well. Here's what worked for me:
flink-conf.yaml
metrics.reporters: prom
metrics.reporter.prom.class: org.apache.flink.metrics.prometheus.PrometheusReporter
metrics.reporter.prom.port: 9250-9260
prometheus.yaml
global:
scrape_interval: 5s
evaluation_interval: 5s
scrape_configs:
- job_name: 'flink'
static_configs:
- targets: ['host.docker.internal:9250', 'host.docker.internal:9251']

Related

Why is CamelExchangesFailed_total metrics not increased?

I have a Apache Camel application which is monitored by Prometheus. Therefore, I added Micrometer to my POM (see Spring Boot Auto-Configuration) and MicrometerRoutePolicyFactory to my application (see Using Micrometer Route Policy Factory). But the metric CamelExchangesFailed_total doesn't change, althought a route failed.
Source
#SpringBootApplication
public class TestApplication {
public static void main(String[] args) {
SpringApplication.run(TestApplication.class, args);
}
#Bean
public MicrometerRoutePolicyFactory micrometerRoutePolicyFactory() {
return new MicrometerRoutePolicyFactory();
}
#Bean
public EndpointRouteBuilder route() {
return new EndpointRouteBuilder() {
#Override
public void configure() throws Exception {
errorHandler(deadLetterChannel("log:dead"));
from(timer("testTimer").repeatCount(1)).throwException(new RuntimeException());
}
};
}
}
Logs
INFO 5060 --- [ restartedMain] o.a.c.i.e.InternalRouteStartupManager : Route: route1 started and consuming from: timer://testTimer
INFO 5060 --- [ restartedMain] o.a.c.impl.engine.AbstractCamelContext : Total 1 routes, of which 1 are started
INFO 5060 --- [ restartedMain] o.a.c.impl.engine.AbstractCamelContext : Apache Camel 3.5.0 (camel-1) started in 0.007 seconds
INFO 5060 --- [ restartedMain] test.TestApplication : Started TestApplication in 6.626 seconds (JVM running for 7.503)
INFO 5060 --- [on(3)-127.0.0.1] o.a.c.c.C.[Tomcat].[localhost].[/] : Initializing Spring DispatcherServlet 'dispatcherServlet'
INFO 5060 --- [on(3)-127.0.0.1] o.s.web.servlet.DispatcherServlet : Initializing Servlet 'dispatcherServlet'
INFO 5060 --- [on(3)-127.0.0.1] o.s.web.servlet.DispatcherServlet : Completed initialization in 5 ms
INFO 5060 --- [mer://testTimer] dead : Exchange[ExchangePattern: InOnly, BodyType: null, Body: [Body is null]]
Metrics
# HELP CamelExchangesFailed_total
# TYPE CamelExchangesFailed_total counter
CamelExchangesFailed_total{application="test",camelContext="camel-1",routeId="route1",serviceName="MicrometerRoutePolicyService",} 0.0
# HELP CamelExchangesSucceeded_total
# TYPE CamelExchangesSucceeded_total counter
CamelExchangesSucceeded_total{application="test",camelContext="camel-1",routeId="route1",serviceName="MicrometerRoutePolicyService",} 1.0
Resaerch
If I remove the custom error handler, the metric CamelExchangesFailed_total is increased, but then the default error handler is used, which is not desired for some reasons.
Question
Why is CamelExchangesFailed_total metrics not increased? Is there any way to count all failed routes with a custom error handler?
Apache Camel LTS version 3.7.0 added a new metric CamelExchangesFailuresHandled_total, which is a counter of handled errors, see CAMEL-15908:
Similar to CAMEL-15255, there are some additional counter metrics we could add to the MicrometerRoutePolicy for:
Total exchanges processed
Number of failures handled
Number of external redeliveries
Metrics
# HELP CamelExchangesFailed_total
# TYPE CamelExchangesFailed_total counter
CamelExchangesFailed_total{application="test",camelContext="camel-1",routeId="route1",serviceName="MicrometerRoutePolicyService",} 0.0
# HELP CamelExchangesSucceeded_total
# TYPE CamelExchangesSucceeded_total counter
CamelExchangesSucceeded_total{application="test",camelContext="camel-1",routeId="route1",serviceName="MicrometerRoutePolicyService",} 1.0
# HELP CamelExchangesFailuresHandled_total
# TYPE CamelExchangesFailuresHandled_total counter
CamelExchangesFailuresHandled_total{application="test",camelContext="camel-1",routeId="route1",serviceName="MicrometerRoutePolicyService",} 1.0

concurrentConsumers not created right away from beginning

I am using Camel in a Spring-Boot application to route from AMQ-Queue. Messages from this queue will be sent to a REST-Webservice. It is already working with this code line:
from("amq:queue:MyQueue").process("jmsToHttpProcessor").to(uri);
My uri looks like this:
http4://localhost:28010/application/createCustomer
Now I have the requirement that the routing to the Webservice should be done parallely:
In order to achive that, I configured concurrentConsumers in JmsConfiguration as follows:
#Bean
public JmsComponent amq(#Qualifier("amqConnectionFactory") ConnectionFactory amqConnectionFactory, AMQProperties amqProperties) {
JmsConfiguration jmsConfiguration = new JmsConfiguration(amqConnectionFactory);
jmsConfiguration.setConcurrentConsumers(50);
jmsConfiguration.setMaxConcurrentConsumers(50);
return new JmsComponent(jmsConfiguration);
}
#Bean
public ConnectionFactory amqConnectionFactory(AMQProperties amqProperties) throws Exception {
ConnectionFactoryParser parser = new ConnectionFactoryParser();
ConnectionFactory returnValue = parser.newObject(parser.expandURI(amqProperties.getUrl()), "amqConnectionFactory");
return returnValue;
}
It is working as expected, BUT not right away from the beginning. I have the phenomenon:
I have 100 messages in the ActiveMQ queue
I start my Spring application
Camel creates only 1 thread consuming 1 message after the previous one gets response
I observe that the amount of messages in queue only decreasing slowly(99.... 98... 97... 96...)
I am filling the queue with new 100 messages
NOW the concurrent consumers are being created as I can observe that the messages decreasing rapidly.
Does someone have any idea, why the concurrentConsumers is not working right away from the beginning?
I tried the advices. Unfortunately they dont change the behaviour. I found out, that the problem is that Camel already starts consuming the messages from the queue before the Spring boot application is startet. I can observe this from the log:
2021-04-01T20:26:33,901 INFO (Camel (CamelBridgeContext) thread #592 - JmsConsumer[MyQueue]) [message]; ...
2021-04-01T20:26:33,902 INFO (Camel (CamelBridgeContext) thread #592 - JmsConsumer[MyQueue]) [message]; ...
2021-04-01T20:26:33,915 INFO (main) [AbstractConnector]; _; Started ServerConnector#5833f5cd{HTTP/1.1,[http/1.1]}{0.0.0.0:23500}
2021-04-01T20:26:33,920 INFO (main) [BridgeWsApplication]; _; Started BridgeWsApplication in 12.53 seconds (JVM running for 13.429)
In this case, only one consumer with thread #592 is consuming all the messages.
In fact, if I start my Spring application first, and then fill the queue with messages, then concurrentConsumers will be used:
2021-04-01T20:30:20,159 INFO (Camel (CamelBridgeContext) thread #594 - JmsConsumer[MyQueue])
2021-04-01T20:30:20,159 INFO (Camel (CamelBridgeContext) thread #599 - JmsConsumer[MyQueue])
2021-04-01T20:30:20,178 INFO (Camel (CamelBridgeContext) thread #593 - JmsConsumer[MyQueue])
2021-04-01T20:30:20,204 INFO (Camel (CamelBridgeContext) thread #564 - JmsConsumer[MyQueue])
In this case, messages are being consumed from concurrentConsumers parallely.
In order to solve the problem, I tried setting autoStartUp to false in my RouteBuilder component:
#Override
public void configure() {
CamelContext context = getContext();
context.setAutoStartup(false);
// My Route
}
In my naive thinking, I let Camel starting after the Spring boot is started and running:
public static void main(String[] args) {
ConfigurableApplicationContext context = SpringApplication.run(BridgeWsApplication.class, args);
SpringCamelContext camel = (SpringCamelContext) context.getBean("camelContext");
camel.start();
try {
camel.startAllRoutes();
} catch (Exception e) {
e.printStackTrace();
}
}
Unfortunately, this does not change the behaviour. There must be a configuration to let Camel starts after Spring is started.

Failed to register Prometheus Gauge to Flink

I am trying to expose a Prometheus Gauge in a Flink app:
#transient def metricGroup: MetricGroup = getRuntimeContext.getMetricGroup
.addGroup("site", site)
.addGroup("sink", counterBaseName)
#transient var failedCounter: Counter = _
def expose(metricName: String, gaugeValue: Int, context: SinkFunction.Context[_]): Unit = {
try {
metricGroup.addGroup("hostname", metricName).gauge[Int, ScalaGauge[Int]]("test", ScalaGauge[Int](() => gaugeValue))
}
} catch {
case _: Throwable => failedCounter.inc()
}
}
The app runs locally just fine and expose the metrics without any problem.
While trying to move to production I encounter the following exception in Flink task manager:
WARN org.apache.flink.runtime.metrics.MetricRegistryImpl - Error while registering metric. java.lang.NullPointerException
Not sure, what am I missing here.
Why does the local app expose metrics while on the cluster it fails to register gauge?
I use Prometheus in order to expose other metrics from Flink, for example, failedCounter (in the code) which is a counter.
This is the first time I exposed gauge in Flink so I bet something in my implementation is broken.
Please help.

Unable to run a flink application on a cluster

I have a below example flink application which I am trying to run on a cluster.
public class ClusterConnect {
public static void main(String[] args) throws Exception {
ExecutionEnvironment env = ExecutionEnvironment
.createRemoteEnvironment("X.X.X.X", 6123, "");
// get input data
DataSet<String> text = env.fromElements("To be, or not to be,--that is the question:--",
"Whether 'tis nobler in the mind to suffer", "The slings and arrows of outrageous fortune",
"Or to take arms against a sea of troubles,");
DataSet<Tuple2<String, Integer>> counts = text
.flatMap(new FlatMapFunction<String, Tuple2<String, Integer>>() {
#Override
public void flatMap(String s, Collector<Tuple2<String, Integer>> collector) throws Exception {
for (String word : s.split(" ")) {
collector.collect(new Tuple2<String, Integer>(word, 1));
}
}
})
.groupBy(0)
.sum(1);
// execute and print result
counts.print();
env.execute();
}
Cluster is setup with one jobmanager(free aws instance) and two tasks managers(free aws instances). While trying to run the above flink application on different AWS(which could reach the jobmanager, taskmanagers) hitting the following error.
Error from application:
WARN [akka.remote.ReliableDeliverySupervisor] Association with remote system [akka.tcp://flink#172.31.29.190:6123] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
Logs from cluster job manager:
016-11-30 22:00:42,796 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink#172.31.6.190:33619] has failed, address is now gated for [5000] ms. Reason is: [scala.Option; local class incompatible: stream classdesc serialVersionUID = -2062608324514658839, local class serialVersionUID = -114498752079829388]

Google App Engine : use mapreduce to empty datastore

I am trying to use an early experimental release of mapper implementation to empty the datastore. This solution was proposed in a similar SO question.
This is the AppEngineMapper I am currently using. It just deletes the entity.
public class EmptyFixesMapper extends AppEngineMapper<Key, Entity, NullWritable, NullWritable> {
public EmptyFixesMapper() {
}
#Override
public void taskSetup(Context context) {
}
#Override
public void taskCleanup(Context context) {
}
#Override
public void setup(Context context) throws IOException, InterruptedException {
super.setup(context);
}
#Override
public void cleanup(Context context) {
getAppEngineContext(context).flush();
}
#Override
public void map(Key key, Entity value, Context context) {
log.warning("Mapping key: " + key);
DatastoreMutationPool mutationPool =
this.getAppEngineContext(context).getMutationPool();
mutationPool.delete(value.getKey());
}
}
This is my mapreduce.xml configuration file:
<configurations>
<configuration name="Empty Entities">
<property>
<name>mapreduce.map.class</name>
<value>com.google.appengine.demos.mapreduce.EmptyFixesMapper</value>
</property>
<property>
<name>mapreduce.inputformat.class</name>
<value>com.google.appengine.tools.mapreduce.DatastoreInputFormat</value>
</property>
<property>
<name human="Entity Kind to Map Over">mapreduce.mapper.inputformat.datastoreinputformat.entitykind</name>
<value template="optional">Fix</value>
</property>
</configuration>
...
When I enter the the mapreduce control panel in mydomain/mapreduce/status, I can launch the tasks, but they never complete. This is the screenshot where you can see a field "0/0 shards":
And I can see some tasks are created in the appengine default task queue, with a lot of retries:
An finally, in my GAE application logs I see:
1.
09-11 03:23AM 08.556 /mapreduce/mapperCallback 500 10081ms
0cpu_ms 0kb AppEngine-Google;
(+http://code.google.com/appengine)
0.1.0.2 - - [11/Sep/2010:03:23:18 -0700] "POST
/mapreduce/mapperCallback HTTP/1.1"
500 0
"http://xxx.appspot.com/mapreduce/command/start_job"
"AppEngine-Google;
(+http://code.google.com/appengine)"
xxx.appspot.com" ms=10081 cpu_ms=0
api_cpu_ms=0 cpm_usd=0.000057
queue_name=default
task_name=worker-attempt-1284198892815-0001-m-000002-1--0
2.
W 09-11 03:23AM 18.638
Request was aborted after waiting too long to attempt to service
your request. This may happen
sporadically when the App Engine
serving cluster is under unexpectedly
high or uneven load. If you see this
message frequently, please contact the
App Engine team.
What could be happening? I'm sure I've followed steps described in the getting started guide, and I have less than 1000 entities in the datastore...
Well, the problem has nothing to do with appengine-mapreduce. I was securing /mapreduce/** URIs, so the task in the default task queue was not being able to reach /mapreduce/mapperCallback, /mapreduce/command/start_job, etc because no username/password information is sent.
It is an interesting issue anyway, because I don't really want to open /mapreduce/** to everyone...

Resources