We are building a stream processing pipeline to process/ingest Kafka messages. And we are using Flink v1.12.2. While defining a source watermark strategy, in the official documentation, I came across two out-of-the-box watermark strategies; forBoundedOutOfOrderness and forMonotonousTimestamps. I did go through javadoc, but did not fully understand when and why you should use one strategy over the other. Timestamps are based on event-time. Thanks.
You should use forMonotonousTimestamps if the timestamps are never out-of-order, or if you are willing for all out-of-order events to be considered late. On the other hand, if out-of-order timestamps are normal for your application, then you should use forBoundedOutOfOrderness.
With Kafka, if you are having the kafka source operator apply the watermark strategy (recommended), then it will apply the strategy to each partition separately. In that case, each instance of the Kafka source will produce watermarks that are the minimum of the per-partition watermarks (for the partitions handled by that instance). In this case, you can use forMonotonousTimestamps if the timestamps are in-order within each partition (which will be the case, for example, if you consuming from a producer that is using log-append timestamps).
You want to use forMonotonousTimestamps whenever possible, since it minimizes latency and simplifies things.
Related
We have a Flink pipeline aggregating data per "client" by combining data with identical keys ("client-id") and within the same window.
The problem is trivially parallelizable and the input Kafka topic has a few partitions (same number as the Flink parallelism) - each holding a subset of the clients. I.e., a single client is always in a specific Kafka partition.
Does Flink take advantage of this automatically or will reshuffle the keys? And if the latter is true - can we somehow avoid the reshuffle and keep the data local to each operator as assigned by the input partitioning?
Note: we are actually using Apache Beam with the Flink backend but I tried to simplify the question as much as possible. Beam is using FixedWindows followed by Combine.perKey
I'm not familiar with the internals of the Flink runner for Beam, but assuming it is using a Flink keyBy, then this will involve a network shuffle. This could be avoided, but only rather painfully by reimplementing the job to use low-level Flink primitives, rather than keyed windows and keyed state.
Flink does offer reinterpretAsKeyedStream, which can be used to avoid unnecessary shuffles, but this can only applied in situations where the existing partitioning exactly matches what keyBy would do -- and I see no reason to think that would apply here.
I'm not sure if my Flink application actually requires Watermarks. When are they necessary?
And if I don't need them, what is the purpose of WatermarkStrategy.noWatermarks()?
A Watermark for time t marks a location in a data stream and asserts that the stream, at that point, is now complete up through time t.
The only purpose watermarks serve is to trigger the firing of event-time-based timers.
Event-time-based timers are directly exposed by the KeyedProcessFunction API, and are also used internally by
event-time Windows
the CEP (pattern-matching) library, which uses Watermarks to sort the incoming stream(s) if you specify you want to do event-time based processing
Flink SQL, again only when doing event-time-based processing: e.g., ORDER BY, versioned table joins, windows, MATCH_RECOGNIZE, etc.
Common cases where you don't need watermarks include applications that only rely on processing time, or when doing batch processing. Or when processing data that has timestamps, but never relying on event-time timers (e.g., simple event-by-event processing).
Flink's new source interface, introduced by FLIP-27, does require a WatermarkStrategy:
env.fromSource(source, watermarkStrategy, sourceName);
In cases where you don't actually need watermarks, you can use WatermarkStrategy.noWatermarks() in this interface.
When I use flink to log data pipeline, data volume about 100G every day,
the flink checkpoint default config is exactly-once, but I worry about this will affect latency.
exactly-once it is necessary? how about at-least-once?
If the computation you perform downstream is idempotent (ie duplicates do not change the result of your computation), then at-least-once will be more performant (as it requires less synchronisation).
For more info see here.
Is duplicated delivery a problem for your application? Then you may want to use exactly-once. Else at-least-once will offer better performance.
From experience: we started with exactly-once, but soon switched to at-least-once. You may be able to control duplication within Flink, but it is almost impossible to avoid throughout the system. Thus, we built all of our actions to be (testably) idempotent and turned exactly-once off.
Checkpoints in Flink are implemented via a variant of the Chandy/Lamport asynchronous barrier snapshotting algorithm. Docs.
Before Flink 1.11, the only difference between "exactly-once" and "at-least-once" has been that exactly-once required barrier alignment on any operator with multiple inputs. In general this tends to increase latency; how much it increases depends on the job.
Flink 1.11 has introduced the option of unaligned checkpoints. This alternative implementation of exactly-once helps in some cases, sometimes by a lot. Docs.
If you aren't happy with the thought of coping with some duplication during recovery, then you may want to so some benchmarking with your app.
I am using the Apache Flink processElement1, processElement2 and onTimer streaming design pattern for implementing a timeout use case. I observed that the throughput of the system decreased by orders of magnitude when I included the timeout functionality.
Any hint on the internal implementation of the onTimer in Flink: is it one thread per key stream (unlikely), or a pool/single execution threads that continuously polls buffered callbacks and pick up the timed-out callbacks for execution.
To the best of my knowledge, Flink is based on the actor model and reactive patterns (AKKA) which encourages the judicious usage of few non-blocking threads, and hence one thread per key stream for onTimer or any other pattern is typically not used!
There are two kinds of timers in Flink, event time and processing time timers. The implementations are completely different, but in neither case should you see a significant performance impact. Something else must be going on. Can you share small, reproducible example, or at least show us more of what’s going on and how you are taking the measurements?
I have implemented a pattern with Flink CEP that matches three Events such as A->B->C. After I have defined my pattern I generate a
PatternStream<Event> patternStream = CEP.pattern(eventStream, pattern);
with a PatternSelectFunction such that
patternStream.select(new MyPatternSelectFunction()).print();
This works like a charm but I am interested in the event-time of all matched events. I know that the traditional Flink streaming API offers rich functions which allow you to register Flink's internal latency tracker as described in this question. I have also seen that for Flink 1.8 a new RichPatternSelectFunction has been added. But unfortunately I cannot set up Flink 1.8 with Flink CEP.
Finally, is there a way to get the event-time of all matched events?
You don't need Rich Functions to use Flink's latency tracking. You just need to enable it by setting latencyTrackingInterval to a positive number in either the Flink configuration or ExecutionConfig, e.g.,
env.getConfig().setLatencyTrackingInterval(1000);
and then you can observe the results in your metrics solution, or via the REST api (latency metrics are not reported in the Flink web UI).
Documentation
Update:
The latency statistics are job metrics, and are in the list returned by
http://<job_manager_rest_endpoint>/jobs/<job_id>/metrics
Latency metric values can be fetched from
http://<job_manager_rest_endpoint>/jobs/<job_id>/metrics?get=<metric_name>
These metrics have names like
latency.source_id.<ID>.operator_id.<ID>.operator_subtask_index.<SUBTASK>.<metric>
where the IDs identity the source and operator nodes in the job graph between which the latency is being measured.
For example, I can determine the 95th percentile latency between the source and one of the sinks in a job I am running right now with this request:
http://localhost:8081/jobs/94b189a96b98b3aafaba6db6aa8b770b/metrics?get=latency.source_id.bc764cd8ddf7a0cff126f51c16239658.operator_id.fd0ee602f2fa8d310d9bd9f694e185f5.operator_subtask_index.0.latency_p95
Alternatively, you could use a ProcessFunction to add processing time timestamps to your events before they enter the CEP part of your job, and then use another ProcessFunction afterwards to measure the elapsed time.