I am writing a Flink application which consumes time series data from kafka
topic. Time series data has components like metric name, tag key value
pair, timestamp and a value. I have created a tumbling window to aggregate
data based on a metric key (which is a combination of metric name, key
value pair and timestamp). Here is the main stream looks like
kafka source -> Flat Map which parses and emits Metric -> Key by metric
key -> Tumbling window of 60 seconds -> Aggregate the data -> write to the
sink.
I also want to check if there is any metric which arrived late outside the
above window. I want to check how many metrics arrived late and calculate
the percentage of late metrics compared to original metrics. I am thinking
of using the "allowedLateness" feature of flink to send the late metrics to
a different stream. I am planning to add a "MapState" in the main
"Aggregate the data" operator which will have the key as the metric key and
value as the count of the metrics that arrived in the main window.
kafka source -> Flat Map which parses and emits Metric -> Key by metric key
-> Tumbling window of 60 seconds -> Aggregate the data (Maintain a map
state of metric count) -> write to the sink.
\
\
Late data -> Key by
metric key -> Collect late metrics and find the percentage of late metrics
-> Write the result in sink
My question is can "Collect late metrics and find the percentage of late
metrics" operator access the "MapState" which got updated by the
mainstream. Even though they are keyed by the same metric key, I guess they
are two different tasks. I want to calculate (number of late metrics /
(number of late metrics + number of metrics arrived on time)).
There are several different ways you could approach this.
You could store the per-window state in the KeyedStateStore windowState() provided by the Context passed to your WindowProcessFunction. Used in combination with allowedLateness, you could compute the late event statistics as late firings occur. (No need for MapState with this approach, since the windowState is already scoped to a specific window and specific key. ValueState will suffice.)
Another idea would be to capture a side output stream of the late events from the primary window and send those late events through another window that counts them over some time frame. Then send both that late event analytics stream and the output of the first (main) window into a KeyedCoProcessFunction (or RichCoFlatMap) that can compute the late event vs on-time event statistics. (Here you will need MapState, since you may need to have several windows open simultaneously for each key of the keyed stream.)
Or you could use a simple process function to split the initial stream into two (by comparing the timestamps to the current watermark) -- one for the late and another for the not-late events -- and then use Flink SQL to compute all of the statistics.
Or just implement the whole thing in one KeyedProcessFunction. See https://ci.apache.org/projects/flink/flink-docs-stable/docs/learn-flink/event_driven/ for an example.
Related
I'm using Flink with a kinesis source and event time keyed windows. The application will be listening to a live stream of data, windowing (event time windows) and processing each keyed stream. I have another use-case where i also need to be able to support backfill of older data for certain key streams (These will be new key streams with event-time < watermark).
Given that I'm using Watermarks, this poses to be a problem since Flink doesn't support per - key watermark. Hence any keyed stream for backfill will end up being ignored since the event time for this stream will be < application watermark maintained by the live stream.
I have gone through other similar questions but wasn't able to get a possible approach.
Here are possible approaches I'm considering but still have some open questions.
Possible Approach - 1
(i) Maintain a copy of the application specifically for backfill purpose. The backfill job will happen rarely (~ a few times a month). The stream of data sent to the application copy will have an indicator for start and stop in the stream. Using that I plan on starting / resetting the watermark.
Open Question ? Is it possible to reset the watermark using an indicator from the stream ? I understand that this is not best practise but can't think of an alternative solution.
Follow up to : Clear Flink watermark state in DataStream [No definitive solution provided.]
Possible Approach - 2
Have parallel instances for each key since its possible for having different watermark per task. -> Not going with this since i'll be having > 5k keyed streams.
Let me know if any other details are needed.
You can address this by running the backfill jobs in BATCH execution mode. When the DataStream API operates in batch mode, the input is bounded (finite), and known in advance. This allows Flink to sort the input by key and by timestamp, and the processing will proceed correctly according to event time without any concern for watermarks or late events.
We're working on calculating some max concurrent count for different type of events within a 1min tumbling time window.
These events like sensor data which was collected from our desktop agents on minute basis, however, some agent got a bad timestamp, say, it would be a timestamp even several hours later than now.
So, my question is how to handle/drop these events, currently I just apply
filter(s => s.ct.getTime < now) predicate to exclude them.
My 1st question is, if I don't do this, I doubt this bad "future" event would trigger window calculation even the for those incomplete data window
And 2nd question is, do we have any better method to prevent this?
Thanks
Interesting use case.
So first some background, then some solutions:
Windows in flink do not fire based on timestamps but based on watermarks. There is a close connection between the two and often it's okay to treat them the same when it comes to window firing, but in this case, it's important to have this clear separation. So yes your doubt is probably valid, if you use a watermark generator that is strictly bound to the timestamp.
So with that in mind, you have a few options:
Filter invalid events (timestamp > now())
Adjust timestamp (timestamp = min(timestamp, now())) or by understanding why specific sensors are off (timezone issues?)
Use a more sophisticated watermark generator
I think the first two options are straight-forward and I'd personally would go for the 2. (fixing data is always good). Let's focus on the watermark generator.
There is basically no limit on how you generate watermarks - you can rely on your imagination. Here are some ideas:
Only advance watermarks, when you saw X events with a watermark greater than the current watermark.
Use some low pass filter = slow moving average.
Ignore events with timestamp > now() (so filter only for watermark generation).
...
I'd be happy to hear which way you have chosen and I can help you further down.
I'm using flink with event time keyed windows.
It seems like some of the windows are not being emitted.
Is the watermark being advanced for each key individually?
For example, if my key is (id,type), and a specific pair of id and type are not being ingested to the source, will their specific window watermark will not advance?
If this is the case, how can i make sure that all my keyd windows will get evicted after some time? (we have many keys so sending a periodic dummy message for each key is not an option).
I'll appreciate any help
Flink has separate watermarks for each task (i.e., each parallel instance) -- otherwise there would have to some sort of horribly expensive global coordination -- but not for each key. In the case of a keyed window, each instance of the window operator will be handling the events for some disjoint subset of the keyspace, and all of the windows for those keys will be using the same watermark.
Keep in mind that empty windows do not produce results. So if there is some key for which there are no events during a window, that window will not produce results for that key.
Or it could be that you have an idle source holding back the watermarks. If one of your source tasks becomes idle, then its watermark won't advance. You could inspect the current watermark in the web UI, and check to see if it is advancing in every task.
Flink provides an example here : https://www.ververica.com/blog/stream-processing-introduction-event-time-apache-flink that describes the scenario that someone is playing a game, loses connection due to subway and then when he is back online all the data is back and can be sorted and processed.
My understanding with this is that if there's more players there are two options:
All the other ones will be delayed waiting for this user to get back connection and send the data allowing the watermark to be pushed;
This user is classified as idle allowing the watermark to move forward and when he gets connected all his data will go to late data stream;
I would like to have the following option:
Each user is processed independently with its own watermark for his session window. Ideally I would even use ingestion time (so when he gets connection back I will put all the data into one unique session that would later order by the event timestamp once the session closes) and there would be a gap between the current time and the last timestamp (ingestion) of the window I'm processing (the session window guarantees this based on the time gap that terminates the session); I also don't want the watermark to be stuck once one user loses connection and I also don't want to manage idle states: just continue processing all the other events normally and once this user gets back do not classify any data as late data due to the watermark being advanced in time compared with the moment the user lost connection;
How could I implement the requirement above? I've been having a hard time working no scenarios like this due to watermark being global. Is there an easy explanation for not having watermarks for each key ?
Thank you in advance!
The closest Flink's watermarking comes to supporting this directly is probably the support for per-kafka-partition watermarking -- which isn't really a practical solution to the situation you describe (since having a kafka partition per user isn't realistic).
What can be done is to simply ignore watermarking, and implement the logic yourself, using a KeyedProcessFunction.
BTW, there was recently a thread about this on both the flink-user and flink-dev mailing lists under the subject Per Key Grained Watermark Support.
Consider I have a data stream that contains event time data in it. I want to gather input data stream in window time of 8 milliseconds and reduce every window data. I do that using the following code:
aggregatedTuple
.keyBy( 0).timeWindow(Time.milliseconds(8))
.reduce(new ReduceFunction<Tuple2<Long, JSONObject>>()
Point: The key of the data stream is the timestamp of processing time mapped to last 8 submultiples of a timestamp of processing millisecond, for example 1531569851297 will mapped to 1531569851296.
But it's possible the data stream arrived late and enter to the wrong window time. For example, suppose I set the window time to 8 milliseconds. If data enter the Flink engine in order or at least with a delay less than window time (8 milliseconds) it will be the best case. But suppose data stream event time (that is a field in the data stream, also) has arrived with the latency of 30 milliseconds. So it will enter the wrong window and I think if I check the event time of every data stream, as it wants to enter the window, I can filter at such a late data.
So I have two question:
How can I filter data stream as it wants to enter the window and check if the data created at the right timestamp for the window?
How can I gather such late data in a variable to do some processing on them?
Flink has two different, related abstractions that deal with different aspects of computing windowed analytics on streams with event-time timestamps: watermarks and allowed lateness.
First, watermarks, which come into play whenever working with event-time data (whether or not you are using windows). Watermarks provide information to Flink about the progress of event-time, and give you, the application writer, a means of coping with out-of-order data. Watermarks flow with the data stream, and each one marks a position in the stream and carries a timestamp. A watermark serves as an assertion that at that point in the stream, the stream is now (probably) complete up to that timestamp -- or in other words, the events that follow the watermark are unlikely to be from before the time indicated by the watermark. The most common watermarking strategy is to use a BoundedOutOfOrdernessTimestampExtractor, which assumes that events arrive within some fixed, bounded delay.
This now provides a definition of lateness -- events that follow a watermark with timestamps less than the watermarks' timestamp are considered late.
The window API provides a notion of allowed lateness, which is set to zero by default. If allowed lateness is greater than zero, then the default Trigger for event-time windows will accept late events into their appropriate windows, up to the limit of the allowed lateness. The window action will fire once at the usual time, and then again for each late event, up to the end of the allowed lateness interval. After which, late events are discarded (or collected to a side output if one is configured).
How can I filter data stream as it wants to enter the window and check
if the data created at the right timestamp for the window?
Flink's window assigners are responsible for assigning events to the appropriate windows -- the right thing will happen automatically. New window instances will be created as needed.
How can I gather such late data in a variable to do some processing on them?
You can either be sufficiently generous in your watermarking so as to avoid having any late data, and/or configure the allowed lateness to be long enough to accommodate the late events. Be aware, however, that Flink will be forced to keep all windows open that are still accepting late events, which will delay garbage collecting old windows and may consume considerable memory.
Note that this discussion assumes you want to work with time windows -- e.g. the 8msec long windows you are working with. Flink also supports count windows (e.g. group events into batches of 100), session windows, and custom window logic. Watermarks and lateness don't play any role if you are using count windows, for example.
If you want per-key results for your analytics, then use keyBy to partition the stream by key (e.g., by userId) before applying windowing. For example
stream
.keyBy(e -> e.userId)
.timeWindow(Time.seconds(10))
.reduce(...)
will produce separate results for each userId.
Update: Note that in recent versions of Flink it is now possible for windows to collect late events to a side output.
Some relevant documentation:
Event Time and Watermarks
Allowed Lateness