Measuring event-time latency with Flink CEP - apache-flink

I have implemented a pattern with Flink CEP that matches three Events such as A->B->C. After I have defined my pattern I generate a
PatternStream<Event> patternStream = CEP.pattern(eventStream, pattern);
with a PatternSelectFunction such that
patternStream.select(new MyPatternSelectFunction()).print();
This works like a charm but I am interested in the event-time of all matched events. I know that the traditional Flink streaming API offers rich functions which allow you to register Flink's internal latency tracker as described in this question. I have also seen that for Flink 1.8 a new RichPatternSelectFunction has been added. But unfortunately I cannot set up Flink 1.8 with Flink CEP.
Finally, is there a way to get the event-time of all matched events?

You don't need Rich Functions to use Flink's latency tracking. You just need to enable it by setting latencyTrackingInterval to a positive number in either the Flink configuration or ExecutionConfig, e.g.,
env.getConfig().setLatencyTrackingInterval(1000);
and then you can observe the results in your metrics solution, or via the REST api (latency metrics are not reported in the Flink web UI).
Documentation
Update:
The latency statistics are job metrics, and are in the list returned by
http://<job_manager_rest_endpoint>/jobs/<job_id>/metrics
Latency metric values can be fetched from
http://<job_manager_rest_endpoint>/jobs/<job_id>/metrics?get=<metric_name>
These metrics have names like
latency.source_id.<ID>.operator_id.<ID>.operator_subtask_index.<SUBTASK>.<metric>
where the IDs identity the source and operator nodes in the job graph between which the latency is being measured.
For example, I can determine the 95th percentile latency between the source and one of the sinks in a job I am running right now with this request:
http://localhost:8081/jobs/94b189a96b98b3aafaba6db6aa8b770b/metrics?get=latency.source_id.bc764cd8ddf7a0cff126f51c16239658.operator_id.fd0ee602f2fa8d310d9bd9f694e185f5.operator_subtask_index.0.latency_p95
Alternatively, you could use a ProcessFunction to add processing time timestamps to your events before they enter the CEP part of your job, and then use another ProcessFunction afterwards to measure the elapsed time.

Related

Lantency Monitoring in Flink 1.14

I am following this Flink tutorial for reactive scaling and am interested in knowing how overall end-to-end latencies are affected by such rapid changes in the number of worker nodes. As per the documentation, I have added metrics.latency.interval: 1000 to the config map with the understanding that a new latency metric will be added with markers being sent every 1 second. However, I cannot seem to find the corresponding histogram in prometheus. Listed below are this which are available metrics associated with latency:
I am using Flink 1.14. Is there something which I am missing?
I am suspecting that something happened to the latency metric between releases 1.13.2 and 1.14. Per now, I am not able to see the latency metrics from Flink after migration to 1.14, despite setting the latency interval to a positive number. Have you tried 1.13.2?
.. further exploration lead me to believe it is the migration to the KafkaSource / KafkaSink classes, as opposed to the deprecated FlinkKafkaConsumer and FlinkKafkaProducer that actually made the latency metric disappear. Currently, I am seeing the latency measures on flink 1.14, however using the deprecated Kafka source / sinks..

Apache Flink Watermark Strategies

We are building a stream processing pipeline to process/ingest Kafka messages. And we are using Flink v1.12.2. While defining a source watermark strategy, in the official documentation, I came across two out-of-the-box watermark strategies; forBoundedOutOfOrderness and forMonotonousTimestamps. I did go through javadoc, but did not fully understand when and why you should use one strategy over the other. Timestamps are based on event-time. Thanks.
You should use forMonotonousTimestamps if the timestamps are never out-of-order, or if you are willing for all out-of-order events to be considered late. On the other hand, if out-of-order timestamps are normal for your application, then you should use forBoundedOutOfOrderness.
With Kafka, if you are having the kafka source operator apply the watermark strategy (recommended), then it will apply the strategy to each partition separately. In that case, each instance of the Kafka source will produce watermarks that are the minimum of the per-partition watermarks (for the partitions handled by that instance). In this case, you can use forMonotonousTimestamps if the timestamps are in-order within each partition (which will be the case, for example, if you consuming from a producer that is using log-append timestamps).
You want to use forMonotonousTimestamps whenever possible, since it minimizes latency and simplifies things.

Does my Flink application need watermarks? If not, do I need WatermarkStrategy.noWatermarks?

I'm not sure if my Flink application actually requires Watermarks. When are they necessary?
And if I don't need them, what is the purpose of WatermarkStrategy.noWatermarks()?
A Watermark for time t marks a location in a data stream and asserts that the stream, at that point, is now complete up through time t.
The only purpose watermarks serve is to trigger the firing of event-time-based timers.
Event-time-based timers are directly exposed by the KeyedProcessFunction API, and are also used internally by
event-time Windows
the CEP (pattern-matching) library, which uses Watermarks to sort the incoming stream(s) if you specify you want to do event-time based processing
Flink SQL, again only when doing event-time-based processing: e.g., ORDER BY, versioned table joins, windows, MATCH_RECOGNIZE, etc.
Common cases where you don't need watermarks include applications that only rely on processing time, or when doing batch processing. Or when processing data that has timestamps, but never relying on event-time timers (e.g., simple event-by-event processing).
Flink's new source interface, introduced by FLIP-27, does require a WatermarkStrategy:
env.fromSource(source, watermarkStrategy, sourceName);
In cases where you don't actually need watermarks, you can use WatermarkStrategy.noWatermarks() in this interface.

How to get the throughput of KafkaSource in Flink?

I want to know the throughput of KafkaSource. In other words, I want to measure the speed at which flink reads data. My idea is to add a map operator after the Source and use the built-in Metrics in the map operator. Will this increase the overhead? I hope to get this metric without adding a lot of overhead. what should I do? Or is there a way to get the output throughput of this topic in kafka? Or should I get KafkaSource's NumberOutPersecond through the REST API?
Take a look at Kafka Manager which displays a lot of metrics related to Kafka. It's a tool which is used to manage Kafka and acts as a real-time dashboard. You need to install and configure this separately.
This can be used to check the consumption rate for your Flink consumer.
You can also make use of built-in metrics publisher on the source operator without using a Map only for that purpose.

Using Apache Flink for data streaming

I am working on building an application with below requirements and I am just getting started with flink.
Ingest data into Kafka with say 50 partitions (Incoming rate - 100,000 msgs/sec)
Read data from Kafka and process each data (Do some computation, compare with old data etc) real time
Store the output on Cassandra
I was looking for a real time streaming platform and found Flink to be a great fit for both real time and batch.
Do you think flink is the best fit for my use case or should I use Storm, Spark streaming or any other streaming platforms?
Do I need to write a data pipeline in google data flow to execute my sequence of steps on flink or is there any other way to perform a sequence of steps for realtime streaming?
Say if my each computation take like 20 milliseconds, how can I better design it with flink and get better throughput.
Can I use Redis or Cassandra to get some data within flink for each computation?
Will I be able to use JVM in-memory cache inside flink?
Also can I aggregate data based on a key for some time window (example 5 seconds). For example lets say there are 100 messages coming in and 10 messages have the same key, can I group all messages with the same key together and process it.
Are there any tutorials on best practices using flink?
Thanks and appreciate all your help.
Given your task description, Apache Flink looks like a good fit for your use case.
In general, Flink provides low latency and high throughput and has a parameter to tune these. You can read and write data from and to Redis or Cassandra. However, you can also store state internally in Flink. Flink does also have sophisticated support for windows. You can read the blog on the Flink website, check out the documentation for more information, or follow this Flink training to learn the API.

Resources