Failed to register Prometheus Gauge to Flink - apache-flink

I am trying to expose a Prometheus Gauge in a Flink app:
#transient def metricGroup: MetricGroup = getRuntimeContext.getMetricGroup
.addGroup("site", site)
.addGroup("sink", counterBaseName)
#transient var failedCounter: Counter = _
def expose(metricName: String, gaugeValue: Int, context: SinkFunction.Context[_]): Unit = {
try {
metricGroup.addGroup("hostname", metricName).gauge[Int, ScalaGauge[Int]]("test", ScalaGauge[Int](() => gaugeValue))
}
} catch {
case _: Throwable => failedCounter.inc()
}
}
The app runs locally just fine and expose the metrics without any problem.
While trying to move to production I encounter the following exception in Flink task manager:
WARN org.apache.flink.runtime.metrics.MetricRegistryImpl - Error while registering metric. java.lang.NullPointerException
Not sure, what am I missing here.
Why does the local app expose metrics while on the cluster it fails to register gauge?
I use Prometheus in order to expose other metrics from Flink, for example, failedCounter (in the code) which is a counter.
This is the first time I exposed gauge in Flink so I bet something in my implementation is broken.
Please help.

Related

Flink custom metrics are not shown in Datadog

In Flink, I am generating custom metrics in a FlatMapFunction using Python.
class OccupancyEventFlatMap(FlatMapFunction):
def open(self, runtime_context: RuntimeContext):
mg = runtime_context.get_metrics_group()
self.counter_sum = mg.counter("my_counter_sum")
self.counter_total = mg.counter("my_counter_total")
def flat_map(self, value):
self.counter_sum.inc(10)
self.counter_total.inc()
I am able to query the metric using the REST API
http://localhost:43491/jobs/9a376e28a1bb022b45c127d75fb1b447/vertices/5239a5f0e3e9cdca6a88500e58b5759e/metrics?get=0.FlatMap.my_counter_sum
[{"id":"0.FlatMap.my_counter_sum","value":"28201"}]
But I don't see any of my custom metrics in Datadog, however I see all the standard Flink metrics there.
This is my configuration in Flink for Datadog exporter
# Datadog
metrics.reporter.dghttp.class: org.apache.flink.metrics.datadog.DatadogHttpReporter
metrics.reporter.dghttp.dataCenter: US
metrics.reporter.dghttp.apikey: ${datadog_api_key}
metrics.reporter.dghttp.tags: env:development
# https://docs.datadoghq.com/integrations/flink/#configuration
metrics.scope.jm: flink.jobmanager
metrics.scope.jm.job: flink.jobmanager.job
metrics.scope.tm: flink.taskmanager
metrics.scope.tm.job: flink.taskmanager.job
metrics.scope.task: flink.task
metrics.scope.operator: flink.operator
It's the first time I am tying to send custom metrics from Flink to Datadog.
Am I doing something wrong ?
Thanks
I was using the configuration from Flink 1.15 documentation but using it on Flink 1.16
It's working now. These are the changes needed.
+ metrics.reporters: prom,dghttp
+
# Prometheus
- metrics.reporter.prom.class: org.apache.flink.metrics.prometheus.PrometheusReporter
+ metrics.reporter.prom.factory.class: org.apache.flink.metrics.prometheus.PrometheusReporterFactory
metrics.reporter.prom.port: 9249
# Datadog
- metrics.reporter.dghttp.class: org.apache.flink.metrics.datadog.DatadogHttpReporter
+ metrics.reporter.dghttp.factory.class: org.apache.flink.metrics.datadog.DatadogHttpReporterFactory

Upgrade dependency camel-spring-batch / spring-batch-component

When upgrading from:
https://camel.apache.org/components/2.x/spring-batch-component.html
to:
https://camel.apache.org/components/3.18.x/spring-batch-component.html
query parameter "synchronous" is dropped.
Can I assume "synchronous=true",
for camel-spring-batch 3.18.2 ?
Looking at the apache-camel project on github it was removed due to lack of components that supported a flag of that ilk.
The default job launcher in spring batch launches jobs synchronously. If you wanted to run a job asynchonously you would have to configure your own job launcher and use use an asynctaskexecutor like below
#Bean(name = "asyncJobLauncher")
public JobLauncher asyncJobLauncher() throws Exception {
SimpleJobLauncher jobLauncher = new SimpleJobLauncher();
jobLauncher.setJobRepository(jobRepository);
jobLauncher.setTaskExecutor(new SimpleAsyncTaskExecutor());
jobLauncher.afterPropertiesSet();
return jobLauncher;
}

Zeppelin Python Flink cannot print to console

I'm using Kinesis Data Analytics Studio which provides a Zeppelin environment.
Very simple code:
%flink.pyflink
from pyflink.common.serialization import JsonRowDeserializationSchema
from pyflink.common.typeinfo import Types
from pyflink.datastream import StreamExecutionEnvironment
from pyflink.datastream.connectors import FlinkKafkaConsumer
# create env = determine app runs locally or remotely
env = s_env or StreamExecutionEnvironment.get_execution_environment()
env.add_jars("file:///home/ec2-user/flink-sql-connector-kafka_2.12-1.13.5.jar")
# create a kafka consumer
deserialization_schema = JsonRowDeserializationSchema.builder() \
.type_info(type_info=Types.ROW_NAMED(
['id', 'name'],
[Types.INT(), Types.STRING()])
).build()
kafka_consumer = FlinkKafkaConsumer(
topics='nihao',
deserialization_schema=deserialization_schema,
properties={
'bootstrap.servers': 'kakfa-brokers:9092',
'group.id': 'group1'
})
kafka_consumer.set_start_from_earliest()
ds = env.add_source(kafka_consumer)
ds.print()
env.execute('job1')
I can get this working locally can sees change logs being produced to console. However I cannot get the same results in Zeppelin.
Also checked STDOUT in Flink web console task managers, nothing is there too.
Am I missing something? Searched for days and could not find anything on it.
I'm not 100% sure but I think you may need a sink to begin pulling data through the datastream, you could potentially use the included Print Sink Function

Google Cloud Run pubsub pull listener app fails to start

I'm testing pubsub "pull" subscriber on Cloud Run using just listener part of this sample java code (SubscribeAsyncExample...reworked slightly to fit in my SpringBoot app):
https://cloud.google.com/pubsub/docs/quickstart-client-libraries#java_1
It fails to startup during deploy...but while it's trying to start, it does pull items from the pubsub queue. Originally, I had an HTTP "push" receiver (a #RestController) on a different pubsub topic and that worked fine. Any suggestions? I'm new to Cloud Run. Thanks.
Deploying...
Creating Revision... Cloud Run error: Container failed to start. Failed to start and then listen on the port defined
by the PORT environment variable. Logs for this revision might contain more information....failed
Deployment failed
In logs:
2020-08-11 18:43:22.688 INFO 1 --- [ main] o.s.web.context.ContextLoader : Root WebApplicationContext: initialization completed in 4606 ms
2020-08-11T18:43:25.287759Z Listening for messages on projects/ce-cxmo-dev/subscriptions/AndySubscriptionPull:
2020-08-11T18:43:25.351650801Z Container Sandbox: Unsupported syscall setsockopt(0x18,0x29,0x31,0x3eca02dfd974,0x4,0x28). It is very likely that you can safely ignore this message and that this is not the cause of any error you might be troubleshooting. Please, refer to https://gvisor.dev/c/linux/amd64/setsockopt for more information.
2020-08-11T18:43:25.351770555Z Container Sandbox: Unsupported syscall setsockopt(0x18,0x29,0x12,0x3eca02dfd97c,0x4,0x28). It is very likely that you can safely ignore this message and that this is not the cause of any error you might be troubleshooting. Please, refer to https://gvisor.dev/c/linux/amd64/setsockopt for more information.
2020-08-11 18:43:25.680 WARN 1 --- [ault-executor-0] i.g.n.s.i.n.u.internal.MacAddressUtil : Failed to find a usable hardware address from the network interfaces; using random bytes: ae:2c:fb:e7:92:9c:2b:24
2020-08-11T18:45:36.282714Z Id: 1421389098497572
2020-08-11T18:45:36.282763Z Data: We be pub-sub'n in pull mode2!!
Nothing else after this and the app stops running.
#Component
public class AndyTopicPullRecv {
public AndyTopicPullRecv()
{
subscribeAsyncExample("ce-cxmo-dev", "AndySubscriptionPull");
}
public static void subscribeAsyncExample(String projectId, String subscriptionId) {
ProjectSubscriptionName subscriptionName =
ProjectSubscriptionName.of(projectId, subscriptionId);
// Instantiate an asynchronous message receiver.
MessageReceiver receiver =
(PubsubMessage message, AckReplyConsumer consumer) -> {
// Handle incoming message, then ack the received message.
System.out.println("Id: " + message.getMessageId());
System.out.println("Data: " + message.getData().toStringUtf8());
consumer.ack();
};
Subscriber subscriber = null;
try {
subscriber = Subscriber.newBuilder(subscriptionName, receiver).build();
// Start the subscriber.
subscriber.startAsync().awaitRunning();
System.out.printf("Listening for messages on %s:\n", subscriptionName.toString());
// Allow the subscriber to run for 30s unless an unrecoverable error occurs.
// subscriber.awaitTerminated(30, TimeUnit.SECONDS);
subscriber.awaitTerminated();
System.out.printf("Async subscribe terminated on %s:\n", subscriptionName.toString());
// } catch (TimeoutException timeoutException) {
} catch (Exception e) {
// Shut down the subscriber after 30s. Stop receiving messages.
subscriber.stopAsync();
System.out.printf("Async subscriber exception: " + e);
}
}
}
Kolban question is very important!! With the shared code, I would like to say "No". The Cloud Run contract is clear:
Your service must answer to HTTP request. Out of request, you pay nothing and no CPU is dedicated to your instance (the instance is like a daemon when no request is processing)
Your service must be stateless (not your case here, I won't take time on this)
If you want to pull your PubSub subscription, create an endpoint in your code with a Rest controller. While you are processing this request, run your pull mechanism and process messages.
This endpoint can be called by Cloud Scheduler regularly to keep the process up.
Be careful, you have a max request processing timeout at 15 minutes (today, subject to change in a near future). So, you can't run your process more than 15 minutes. Make it resilient to fail and set your scheduler to call your service every 15 minutes

Stuck at: Could not find a suitable table factory

While playing around with Flink, I have been trying to upsert data into Elasticsearch. I'm having this error on my STDOUT:
Caused by: org.apache.flink.table.api.NoMatchingTableFactoryException: Could not find a suitable table factory for 'org.apache.flink.table.factories.TableSinkFactory' in
the classpath.
Reason: Required context properties mismatch.
The following properties are requested:
connector.hosts=http://elasticsearch-elasticsearch-coordinating-only.default.svc.cluster.local:9200
connector.index=transfers-sum
connector.key-null-literal=n/a
connector.property-version=1
connector.type=elasticsearch
connector.version=6
format.json-schema={ \"curr_careUnit\": {\"type\": \"text\"}, \"sum\": {\"type\": \"float\"} }
format.property-version=1
format.type=json
schema.0.data-type=VARCHAR(2147483647)
schema.0.name=curr_careUnit
schema.1.data-type=FLOAT
schema.1.name=sum
update-mode=upsert
The following factories have been considered:
org.apache.flink.streaming.connectors.kafka.Kafka09TableSourceSinkFactory
org.apache.flink.table.sinks.CsvBatchTableSinkFactory
org.apache.flink.table.sinks.CsvAppendTableSinkFactory
at org.apache.flink.table.factories.TableFactoryService.filterByContext(TableFactoryService.java:322)
...
Here is what I have in my scala Flink code:
def main(args: Array[String]) {
// Create streaming execution environment
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(1)
// Set properties per KafkaConsumer API
val properties = new Properties()
properties.setProperty("bootstrap.servers", "kafka.kafka:9092")
properties.setProperty("group.id", "test")
// Add Kafka source to environment
val myKConsumer = new FlinkKafkaConsumer010[String]("raw.data4", new SimpleStringSchema(), properties)
// Read from beginning of topic
myKConsumer.setStartFromEarliest()
val streamSource = env
.addSource(myKConsumer)
// Transform CSV (with a header row per Kafka event into a Transfers object
val streamTransfers = streamSource.map(new TransfersMapper())
// create a TableEnvironment
val tEnv = StreamTableEnvironment.create(env)
// register a Table
val tblTransfers: Table = tEnv.fromDataStream(streamTransfers)
tEnv.createTemporaryView("transfers", tblTransfers)
tEnv.connect(
new Elasticsearch()
.version("6")
.host("elasticsearch-elasticsearch-coordinating-only.default.svc.cluster.local", 9200, "http")
.index("transfers-sum")
.keyNullLiteral("n/a")
.withFormat(new Json().jsonSchema("{ \"curr_careUnit\": {\"type\": \"text\"}, \"sum\": {\"type\": \"float\"} }"))
.withSchema(new Schema()
.field("curr_careUnit", DataTypes.STRING())
.field("sum", DataTypes.FLOAT())
)
.inUpsertMode()
.createTemporaryTable("transfersSum")
val result = tEnv.sqlQuery(
"""
|SELECT curr_careUnit, sum(los)
|FROM transfers
|GROUP BY curr_careUnit
|""".stripMargin)
result.insertInto("transfersSum")
env.execute("Flink Streaming Demo Dump to Elasticsearch")
}
}
I am creating a fat jar and uploading it to my remote flink instance. Here is my build.gradle dependencies:
compile 'org.scala-lang:scala-library:2.11.12'
compile 'org.apache.flink:flink-scala_2.11:1.10.0'
compile 'org.apache.flink:flink-streaming-scala_2.11:1.10.0'
compile 'org.apache.flink:flink-connector-kafka-0.10_2.11:1.10.0'
compile 'org.apache.flink:flink-table-api-scala-bridge_2.11:1.10.0'
compile 'org.apache.flink:flink-connector-elasticsearch6_2.11:1.10.0'
compile 'org.apache.flink:flink-json:1.10.0'
compile 'com.fasterxml.jackson.core:jackson-core:2.10.1'
compile 'com.fasterxml.jackson.module:jackson-module-scala_2.11:2.10.1'
compile 'org.json4s:json4s-jackson_2.11:3.7.0-M1'
Here is how the farJar command is built for gradle:
jar {
from {
(configurations.compile).collect {
it.isDirectory() ? it : zipTree(it)
}
}
manifest {
attributes("Main-Class": "main" )
}
}
task fatJar(type: Jar) {
zip64 true
manifest {
attributes 'Main-Class': "flinkNamePull.Demo"
}
baseName = "${rootProject.name}"
from { configurations.compile.collect { it.isDirectory() ? it : zipTree(it) } }
with jar
}
Could anybody please help me to see what I am missing? I'm fairly new to Flink and data streaming in general. Hehe
Thank you in advance!
Is the list in The following factories have been considered: complete? Does it contain Elasticsearch6UpsertTableSinkFactory? If not as far as I can tell there is a problem with the service discovery dependencies.
How do you submit your job? Can you check if you have a file META-INF/services/org.apache.flink.table.factories.TableFactory in the uber jar with an entry for Elasticsearch6UpsertTableSinkFactory?
When using maven you have to add a transformer to properly merge service files:
<!-- The service transformer is needed to merge META-INF/services files -->
<transformer implementation="org.apache.maven.plugins.shade.resource.ServicesResourceTransformer"/>
I don't know how do you do it in gradle.
EDIT:
Thanks to Arvid Heise
In gradle when using shadowJar plugin you can merge service files via:
// Merging Service Files
shadowJar {
mergeServiceFiles()
}
You should use the shadow plugin to create the fat jar instead of doing it manually.
In particular, you want to merge service descriptors.

Resources