Infinite allowed Lateness for Apache Flink Windows - apache-flink

I have following use case, sorry if there is an obvious solution but I am very new to Flink:
Events (containing a value of interest) in a stream are supposed to be assigned to a window based on event-time. In my case events do not only arrive out of order and late but also events are versioned. By that, I mean, that for a given event-time two events may arrive. In this case the window should fire again. The time between the arrival of these events might be days (or even weeks). I already found the allowed Lateness option for windows. Is this a possible solution or would this amount in to many windows that can not be discarded, since another event might still arrive (this basically boils down to the questions, if windows are persisted or kept in memory).
Thanks

In general the allowed lateness needs to be finite in order to avoid keeping an unbounded amount of state. But you can configure Flink to use the RocksDBStateBackend, which will spill state to disk, allowing for as much state as your local disks can hold.
If very late events are rare, you might be better off accommodating them in some special way, rather than burdening a general purpose pipeline with the overhead of all that state.

Related

Apache Flink : Watermarks per partitions?

I see that there are lot of discussions going on about adding support for watermarks per key. But do flink support per partition watermarks?
Currently - then minimum of all the watermarks(non idle partitions) is taken into account. Because of this the last hanging records in a window are stuck as well.(when incremented the watermark using periodicemit)
Any info on this is really appreciated!
Some of the sources, such as the FlinkKafkaConsumer, support per-partition watermarking. You get this by calling assignTimestampsAndWatermarks on the source, rather than on the stream produced by the source.
What this does is that each consumer instance tracks the maximum timestamp within each partition, and take as its watermark the minimum of these maximums, less the configured bounded out-of-orderness. Idle partitions will be ignored, if you configure it to do so.
Not only does this yield more accurate watermarking, but if your events are in-order within each partition, this also makes it possible to take advantage of the WatermarkStrategy.forMonotonousTimestamps() strategy.
See Watermark Strategies and the Kafka Connector for more details.
As for why the last window isn't being triggered, this is related to watermarking, but not to per-partition watermarking. The problem is simply that windows are triggered by watermarks, and the watermarks are trailing behind the timestamps in the events. So the watermarks can never catch up to the final events, and can never trigger the last window.
This isn't a problem for unbounded streaming jobs, since they never stop and never have a last window. And it isn't a problem for batch jobs, since they are aware of all of the data. But for bounded streaming jobs, you need to do something to work around this issue. Broadly speaking, what you must do is to inform Flink that the input stream has ended -- whenever the Flink sources detect that they have reached the end of an event-time-based input stream, they emit one last watermark whose value is MAX_WATERMARK, and this will trigger any open windows.
One way to do this is to use a KafkaDeserializationSchema with an implementation of isEndOfStream that returns true when the job reaches its end.

How to handle future events in flink streaming?

We're working on calculating some max concurrent count for different type of events within a 1min tumbling time window.
These events like sensor data which was collected from our desktop agents on minute basis, however, some agent got a bad timestamp, say, it would be a timestamp even several hours later than now.
So, my question is how to handle/drop these events, currently I just apply
filter(s => s.ct.getTime < now) predicate to exclude them.
My 1st question is, if I don't do this, I doubt this bad "future" event would trigger window calculation even the for those incomplete data window
And 2nd question is, do we have any better method to prevent this?
Thanks
Interesting use case.
So first some background, then some solutions:
Windows in flink do not fire based on timestamps but based on watermarks. There is a close connection between the two and often it's okay to treat them the same when it comes to window firing, but in this case, it's important to have this clear separation. So yes your doubt is probably valid, if you use a watermark generator that is strictly bound to the timestamp.
So with that in mind, you have a few options:
Filter invalid events (timestamp > now())
Adjust timestamp (timestamp = min(timestamp, now())) or by understanding why specific sensors are off (timezone issues?)
Use a more sophisticated watermark generator
I think the first two options are straight-forward and I'd personally would go for the 2. (fixing data is always good). Let's focus on the watermark generator.
There is basically no limit on how you generate watermarks - you can rely on your imagination. Here are some ideas:
Only advance watermarks, when you saw X events with a watermark greater than the current watermark.
Use some low pass filter = slow moving average.
Ignore events with timestamp > now() (so filter only for watermark generation).
...
I'd be happy to hear which way you have chosen and I can help you further down.

Modelling time for complex events generated out of simple ones

My flink application generates output (complex) events based on the processing of (simple) input events. The generated output events are to be consumed by other external services. My application works using event-time semantics, so I am bit in doubt regarding what should I use as the output events' timestamp.
Should I use:
the processing time at the moment of generating them?
the event time (given by the watermark value)?
both? (*)
For my use case, I am using both for now. But maybe you can come up with examples/justifications for each of the given options.
(*) In the case of using both, what naming would you use for the two fields? Something along the lines of event_time and processing_time seems to leak implementation details of my app to the external services...
There is no general answer to your question. It often depends on downstream requirements. Let's look at two simple cases:
A typical data processing pipeline is ingesting some kind of movement event (e.g., sensor data, click on web page, search request) and enriches it with master data (e.g., sensor calibration data, user profiles, geographic information) through joins. Then the resulting event should clearly have the same time as the input event.
A second pipeline is aggregating the events from the first pipeline on a 15 min tumbling window and simply counts it. Then fair options would be to use the start of the window or the time of the first event, end of the window or time of the last event, or both of these information. Using the start/end of a window would mean that we have a resulting signal that is always defined. Using the first/last event timestamp is more precise when you actually want to see in the aggregates when things happen. Usually, that also means that you probably want some finer window resolutions though (1 min instead of 15 min). Whether you use the start or the end of a window is often more a matter of taste and you are usually safer to include both.
In none of these cases, processing time is relevant at all. In fact, if your input is event time, I'd argue that there is no good reason for processing time. The main reason is that you cannot do meaningful reprocessing with processing time.
You can still add processing time, but for a different reason: to measure the end-to-end latency of a very complex data analytics pipeline including multiple technologies and jobs.

How to gather late data in Flink Stream Processing Windowing

Consider I have a data stream that contains event time data in it. I want to gather input data stream in window time of 8 milliseconds and reduce every window data. I do that using the following code:
aggregatedTuple
.keyBy( 0).timeWindow(Time.milliseconds(8))
.reduce(new ReduceFunction<Tuple2<Long, JSONObject>>()
Point: The key of the data stream is the timestamp of processing time mapped to last 8 submultiples of a timestamp of processing millisecond, for example 1531569851297 will mapped to 1531569851296.
But it's possible the data stream arrived late and enter to the wrong window time. For example, suppose I set the window time to 8 milliseconds. If data enter the Flink engine in order or at least with a delay less than window time (8 milliseconds) it will be the best case. But suppose data stream event time (that is a field in the data stream, also) has arrived with the latency of 30 milliseconds. So it will enter the wrong window and I think if I check the event time of every data stream, as it wants to enter the window, I can filter at such a late data.
So I have two question:
How can I filter data stream as it wants to enter the window and check if the data created at the right timestamp for the window?
How can I gather such late data in a variable to do some processing on them?
Flink has two different, related abstractions that deal with different aspects of computing windowed analytics on streams with event-time timestamps: watermarks and allowed lateness.
First, watermarks, which come into play whenever working with event-time data (whether or not you are using windows). Watermarks provide information to Flink about the progress of event-time, and give you, the application writer, a means of coping with out-of-order data. Watermarks flow with the data stream, and each one marks a position in the stream and carries a timestamp. A watermark serves as an assertion that at that point in the stream, the stream is now (probably) complete up to that timestamp -- or in other words, the events that follow the watermark are unlikely to be from before the time indicated by the watermark. The most common watermarking strategy is to use a BoundedOutOfOrdernessTimestampExtractor, which assumes that events arrive within some fixed, bounded delay.
This now provides a definition of lateness -- events that follow a watermark with timestamps less than the watermarks' timestamp are considered late.
The window API provides a notion of allowed lateness, which is set to zero by default. If allowed lateness is greater than zero, then the default Trigger for event-time windows will accept late events into their appropriate windows, up to the limit of the allowed lateness. The window action will fire once at the usual time, and then again for each late event, up to the end of the allowed lateness interval. After which, late events are discarded (or collected to a side output if one is configured).
How can I filter data stream as it wants to enter the window and check
if the data created at the right timestamp for the window?
Flink's window assigners are responsible for assigning events to the appropriate windows -- the right thing will happen automatically. New window instances will be created as needed.
How can I gather such late data in a variable to do some processing on them?
You can either be sufficiently generous in your watermarking so as to avoid having any late data, and/or configure the allowed lateness to be long enough to accommodate the late events. Be aware, however, that Flink will be forced to keep all windows open that are still accepting late events, which will delay garbage collecting old windows and may consume considerable memory.
Note that this discussion assumes you want to work with time windows -- e.g. the 8msec long windows you are working with. Flink also supports count windows (e.g. group events into batches of 100), session windows, and custom window logic. Watermarks and lateness don't play any role if you are using count windows, for example.
If you want per-key results for your analytics, then use keyBy to partition the stream by key (e.g., by userId) before applying windowing. For example
stream
.keyBy(e -> e.userId)
.timeWindow(Time.seconds(10))
.reduce(...)
will produce separate results for each userId.
Update: Note that in recent versions of Flink it is now possible for windows to collect late events to a side output.
Some relevant documentation:
Event Time and Watermarks
Allowed Lateness

What does 'soft-state' in BASE mean?

BASE stands for 'Basically Available, Soft state, Eventually consistent'
So, I've come this far: "Basically Available: the system is available, but not necessarily all items in it at any given point in time" and "Eventually Consistent: after a certain time all nodes are consistent, but at any given time this might not be the case" (please correct me if I'm wrong).
But, what is meant exactly by 'Soft State'? I haven't been able to find any decent explanations on the internet yet.
This page (originally here, now available only from the web archive) may help:
[soft state] is information (state) the user put into the system that
will go away if the user doesn't maintain it. Stated another way, the
information will expire unless it is refreshed.
By contrast, the position of a typical simple light-switch is
"hard-state". If you flip it up, it will stay up, possibly forever. It
will only change back to down when you (or some other user) explicitly
comes back to manipulate it.
The BASE acronym is a bit contrived, and most NoSQL stores don't actually require data to be refreshed in this way. There's another explanation suggesting that soft-state means that the system will change state without user intervention due to eventual consistency (but then the soft-state part of the acronym is redundant).
There are some specific usages where state must indeed be refreshed by the user; for example, in the Cassandra NoSQL database, one can give all rows a time-to-live to make them completely soft-state (they will expire unless refreshed), but this is an unusual mode of usage (a transient cache, essentially).
"Soft-state" might also apply to the gossip protocol within Cassandra; a new node can determine the state of the cluster from the gossip messages it receives, and this cluster state must be constantly refreshed to detect unresponsive nodes.
I was taught in classes that "Soft state" means that the state of the system could change over time (even during times without input), because there may be changes going on due to "eventual consistency". That's why says "soft" state.
Some source: link
Soft state means data that is not persisted on the disk, yet in case of failure it could be possible to restore it (e.g. recreate a lower quality image from a high quality one). A good article that addresses this and other interesting issues is Cluster-Based Scalable Network Services
A BASE system gives up on consistency to improve the performance of the database. Hence most of the famous NoSQL databases are highly available and scalable than ACID-compliant relational databases.
Soft state indicates that the state of the system may change over time, even without input. This is because of the eventual consistency model.
Eventual consistency indicates that the system will become consistent over time.
For example, Consider two systems such as A and B. If a user writes data to a system A, there will be some delay in reflecting these written data to B usually within milliseconds(depending upon the network speed and the design for syncing).

Resources