Customize Flink - Prometheus metrics - apache-flink

I need to export custom metrics from Flink 1.10 to Prometheus. I have my custom metrics already created and working, but the issue is that when I print out (in terminal for example) to see the metrics, a lot of metrics comes out from Flink and I don't need them, such as: flink_taskmanager_job_task_Shuffle_Netty_Input_Buffers_inputQueueLength, and many more.
I'm just interest into spread from Flink my custom metrics to Prometheus, and remove the rest of them.
So, questions:
Is there anyway to remove all the metrics exported from Flink and just keep my custom metrics to Prometheus?
Is there anyway to create statics task_id to not accumulate a lot of information in Prometheus? Because I supposed that that ids are not fixed and with every changes in the application that requires a stop/start, Flink will create a new task_id.
I've been able to remove a few tags using:
"metrics.reporter.cep_reporter.scope.variables.excludes":"job_id;job_name;task_attempt_id;task_attempt_num;task_name;operator_id;operator_name;subtask_index;tm_id;host;Netty"
but is not enough, there are more than 800 metrics that I don't need, JVM for example, I'm using another node_exporter to scrape those metrics, need to remove this metrics too.
Any help will be appreciated. Thanks a lot.

Disclaimer: I haven't tried this.
What I would try would be to set a user scope on your custom flink metrics, and then configure prometheus to only scrape those metrics.

Related

Is it possible on Flink to decide which metrics are being send to Graphite?

So far we used the GraphiteReporter to monitor the Flink. Recently, we decided that we don't need all the metrics that are being reported to the Graphite, and we would like to know if it's possible to still use the GraphiteReporter but control the type and amount of metrics that are been sent.

How to get the throughput of KafkaSource in Flink?

I want to know the throughput of KafkaSource. In other words, I want to measure the speed at which flink reads data. My idea is to add a map operator after the Source and use the built-in Metrics in the map operator. Will this increase the overhead? I hope to get this metric without adding a lot of overhead. what should I do? Or is there a way to get the output throughput of this topic in kafka? Or should I get KafkaSource's NumberOutPersecond through the REST API?
Take a look at Kafka Manager which displays a lot of metrics related to Kafka. It's a tool which is used to manage Kafka and acts as a real-time dashboard. You need to install and configure this separately.
This can be used to check the consumption rate for your Flink consumer.
You can also make use of built-in metrics publisher on the source operator without using a Map only for that purpose.

Getting Flink cluster configuration at runtime

I'm interested in getting at runtime the number of TaskManagers and slots of a Flink cluster, before submitting jobs to it (I'd like to tune some program parameters based on the cluster ones).
Does anybody know which functions should I call to get these parameters?
Thanks!
The parameters are available through Flink's REST API.
Full API documentation: https://ci.apache.org/projects/flink/flink-docs-master/monitoring/rest_api.html

How can I access states computed from an external Flink job ? (without knowing its id)

I'm new to Flink and I'm currently testing the framework for a usecase consisting in enriching transactions coming from Kafka with a lot of historical features (e.g number of past transactions between same source and same target), then score this transaction with a machine learning model.
For now, features are all kept in Flink states and the same job is scoring the enriched transaction. But I'd like to separate the features computation job from the scoring job and I'm not sure how to do this.
The queryable state doesn't seem to fit for this, as the job id is needed, but tell me if I'm wrong !
I've thought about querying directly RocksDB but maybe there's a more simple way ?
Is the separation in two jobs for this task a bad idea with Flink ? We do it this way for the same test with Kafka Streams, in order to avoid complex jobs (and to check if it has any positive impact on latency)
Some extra information : I'm using Flink 1.3 (but willing to upgrade if it's needed) and the code is written in Scala
Thanks in advance for your help !
Something like Kafka works well for this kind of decoupling. In that way you could have one job that computes the features and streams them out to a Kafka topic that is consumed by the job doing the scoring. (Aside: this would make it easy to do things like run several different models and compare their results.)
Another approach that is sometimes used is to call out to an external API to do the scoring. Async I/O could be helpful here. At least a couple of groups are using stream SQL to compute features, and wrapping external model scoring services as UDFs.
And if you do want to use queryable state, you could use Flink's REST api to determine the job id.
There have been several talks at Flink Forward conferences about using machine learning models with Flink. One example: Fast Data at ING – Building a Streaming Data Platform with Flink and Kafka.
There's an ongoing community effort to make all this easier. See FLIP-23 - Model Serving for details.

DynamoDB - Do I need lots of read capacities to handle multiple getItem-calls per page?

I'm using DynamoDB to store items that are necessary to deliver a specific webpage. However, for one page load, the web server may easily need hundreds of items from about 2-5 different tables. If I have only one read capacity I can only make 2 eventually consistent DB calls per second. Of course if I need to get these items to deliver a webpage, I cannot wait one second for every DB call.
I already use batchGetItems to reduce the workload. Do I now need just lots of more read capacities or am I getting something wrong?
You should be thinking caching, not fetching.
Either AWS ElasticSearch (memcached) or Varnish-like caching.
You can also implement an in-process caching using Google Guava
It's possible to tune your read capacity based on usage and that's one of the advantages of using a hosted solution like DynamoDB. You can setup CloudWatch alarms, receive notifications through a SNS topic and create a simple app to increase/decrease your capacity. There is a nice post about it at: http://engineeringblog.txtweb.com/2013/09/txtweb-scaling-with-dynamodb/

Resources