In Flink, I am reading a file using readTextFile and applying SlidingProcessingTimeWindows.of(Time.milliseconds(60), Time.milliseconds(60)) of 60 msec with slide of 60 msec on it. On windowed stream I am calculating the mean of the second filed of the tuple. My text file contains 1100 lines and each line is tuple (String, Integer). I have set the parallelism to 1 and keyed the messages on first field of the tuple.
When I run the code, each time I get different answers. I mean that it seems like, sometime it reads entire file and sometime it reads one first some lines of the file. Does it have some relation with window size of sliding amount? How this relation can be found out so that I can decide the size and sliding amount of window?
The answer in the comment of AlpineGizmo is correct. I'll add a few more details here.
Flink aligns time windows to the begin of epoch (1970-01-01-00:00:00). This means that a window operator with a 1 hour window starts a new window with every new hour (i.e., at 00:00, 01:00, 02:00, ...) and not with the first arriving record.
Processing time windows are evaluated based on the current time of the system.
As said in the comment above, this means that the amount of data which can be processed depends on the processing resources (hardware, CPU/IO load, ...) of the machine that an operator runs on. Therefore, processing time window cannot produce reliable and consistent results.
I your case, both described effects might cause results which are inconsistent across jobs. Depending on when you start the job, the data will be assigned to different windows (if the first record arrives just before the first 60 msecs window is closed, only this element will be in the window). Depending on the IO load of the machine it might take more or less time to access and read the file.
If you want to have consistent results, you need to use event-time. In this case, the records are processed based on the time which is encoded in the data, i.e., the results depend on the data only and not on external effects such as the starting time of the job or the load of the processing machine.
Related
Context: the project I'm working on processes timestamped files that are produced periodically (1 min) and they are ingested in real time into a series of cascading window operators. The timestamp of the file indicates the event time, so I don't need to rely on the file creation time. The result of the processing of each window is sent to a sink which stores the data in several tables.
input -> 1 min -> 5 min -> 15 min -> ...
\-> SQL \-> SQL \-> SQL
I am trying to come up with a solution to deal with possible downtime of the real time process. The input files are generated independently, so in case of severe downtime of the Flink solution, I want to ingest and process the missed files as if they were ingested by the same process.
My first thought is to configure a mode of operation of the same flow which reads only the missed files and has an allowed lateness which covers the earliest file to be processed. However, once a file has been processed, it is guaranteed that no more late files will be ingested, so I don't necessarily need to maintain the earliest window open for the duration of the whole process, especially since there may be many files to process in this manner. Is it possible to do something about closing windows, even with the allowed lateness set, or maybe I should look into reading the whole thing as a batch operation and partition by timestamp instead?
Since you are ingesting the input files in order, using event time processing, I don't see why there's an issue. When the Flink job recovers, it seems like it should be able to resume from where it left off.
If I've misunderstood the situation, and you sometimes need to go back and process (or reprocess) a file from some point in the past, one way to do this would be to deploy another instance of the same job, configured to only ingest the file(s) needing to be (re)ingested. There shouldn't be any need to rewrite this as a batch job -- most streaming jobs can be run on bounded inputs. And with event time processing, this backfill job will produce the same results as if it had been run in (near) real-time.
My flink application generates output (complex) events based on the processing of (simple) input events. The generated output events are to be consumed by other external services. My application works using event-time semantics, so I am bit in doubt regarding what should I use as the output events' timestamp.
Should I use:
the processing time at the moment of generating them?
the event time (given by the watermark value)?
both? (*)
For my use case, I am using both for now. But maybe you can come up with examples/justifications for each of the given options.
(*) In the case of using both, what naming would you use for the two fields? Something along the lines of event_time and processing_time seems to leak implementation details of my app to the external services...
There is no general answer to your question. It often depends on downstream requirements. Let's look at two simple cases:
A typical data processing pipeline is ingesting some kind of movement event (e.g., sensor data, click on web page, search request) and enriches it with master data (e.g., sensor calibration data, user profiles, geographic information) through joins. Then the resulting event should clearly have the same time as the input event.
A second pipeline is aggregating the events from the first pipeline on a 15 min tumbling window and simply counts it. Then fair options would be to use the start of the window or the time of the first event, end of the window or time of the last event, or both of these information. Using the start/end of a window would mean that we have a resulting signal that is always defined. Using the first/last event timestamp is more precise when you actually want to see in the aggregates when things happen. Usually, that also means that you probably want some finer window resolutions though (1 min instead of 15 min). Whether you use the start or the end of a window is often more a matter of taste and you are usually safer to include both.
In none of these cases, processing time is relevant at all. In fact, if your input is event time, I'd argue that there is no good reason for processing time. The main reason is that you cannot do meaningful reprocessing with processing time.
You can still add processing time, but for a different reason: to measure the end-to-end latency of a very complex data analytics pipeline including multiple technologies and jobs.
Consider I have a data stream that contains event time data in it. I want to gather input data stream in window time of 8 milliseconds and reduce every window data. I do that using the following code:
aggregatedTuple
.keyBy( 0).timeWindow(Time.milliseconds(8))
.reduce(new ReduceFunction<Tuple2<Long, JSONObject>>()
Point: The key of the data stream is the timestamp of processing time mapped to last 8 submultiples of a timestamp of processing millisecond, for example 1531569851297 will mapped to 1531569851296.
But it's possible the data stream arrived late and enter to the wrong window time. For example, suppose I set the window time to 8 milliseconds. If data enter the Flink engine in order or at least with a delay less than window time (8 milliseconds) it will be the best case. But suppose data stream event time (that is a field in the data stream, also) has arrived with the latency of 30 milliseconds. So it will enter the wrong window and I think if I check the event time of every data stream, as it wants to enter the window, I can filter at such a late data.
So I have two question:
How can I filter data stream as it wants to enter the window and check if the data created at the right timestamp for the window?
How can I gather such late data in a variable to do some processing on them?
Flink has two different, related abstractions that deal with different aspects of computing windowed analytics on streams with event-time timestamps: watermarks and allowed lateness.
First, watermarks, which come into play whenever working with event-time data (whether or not you are using windows). Watermarks provide information to Flink about the progress of event-time, and give you, the application writer, a means of coping with out-of-order data. Watermarks flow with the data stream, and each one marks a position in the stream and carries a timestamp. A watermark serves as an assertion that at that point in the stream, the stream is now (probably) complete up to that timestamp -- or in other words, the events that follow the watermark are unlikely to be from before the time indicated by the watermark. The most common watermarking strategy is to use a BoundedOutOfOrdernessTimestampExtractor, which assumes that events arrive within some fixed, bounded delay.
This now provides a definition of lateness -- events that follow a watermark with timestamps less than the watermarks' timestamp are considered late.
The window API provides a notion of allowed lateness, which is set to zero by default. If allowed lateness is greater than zero, then the default Trigger for event-time windows will accept late events into their appropriate windows, up to the limit of the allowed lateness. The window action will fire once at the usual time, and then again for each late event, up to the end of the allowed lateness interval. After which, late events are discarded (or collected to a side output if one is configured).
How can I filter data stream as it wants to enter the window and check
if the data created at the right timestamp for the window?
Flink's window assigners are responsible for assigning events to the appropriate windows -- the right thing will happen automatically. New window instances will be created as needed.
How can I gather such late data in a variable to do some processing on them?
You can either be sufficiently generous in your watermarking so as to avoid having any late data, and/or configure the allowed lateness to be long enough to accommodate the late events. Be aware, however, that Flink will be forced to keep all windows open that are still accepting late events, which will delay garbage collecting old windows and may consume considerable memory.
Note that this discussion assumes you want to work with time windows -- e.g. the 8msec long windows you are working with. Flink also supports count windows (e.g. group events into batches of 100), session windows, and custom window logic. Watermarks and lateness don't play any role if you are using count windows, for example.
If you want per-key results for your analytics, then use keyBy to partition the stream by key (e.g., by userId) before applying windowing. For example
stream
.keyBy(e -> e.userId)
.timeWindow(Time.seconds(10))
.reduce(...)
will produce separate results for each userId.
Update: Note that in recent versions of Flink it is now possible for windows to collect late events to a side output.
Some relevant documentation:
Event Time and Watermarks
Allowed Lateness
Could I set DataStream time window to a large value like 24 hours? The reason for the requirement is that I want to make data statistics based on the latest 24 hours client traffic to the web site. This way, I can check if there are security violations.
For example, check if a user account used multiple source IPs to log on to the web site. Or check how many unique pages a certain IP accessed in the latest 24 hours. If security violation is detected, the configured action will be taken in real time such as blocking the source IP or locking the relevant user account.
The throughput of the web site is around 200Mb/s. I think setting the time window to a large value will cause memory issue. Should I store the statistics results of each time window like 5 minutes into database?
Then make statistics based on database query for the date generated in the latest 24 hours?
I don't have any experience with big data analysis. Any advice will be appreciated.
It depends on what type of window and aggregations we're talking about:
Window where no eviction is used: in this case Flink will only save one accumulated result per physical window. This means that for a sliding window of 10h with 1h slide that computes a sum it would have to have a number 10 times. For a tumbling window (regardless of the parameters) it only saves the result of the aggregation once. However this is not the whole story: because state is keyed you have to multiply all of this for every distinct value of the field used in the group by.
Window with eviction: saves all events that were processed but still weren't evicted.
In short, generally the memory consumption is not tied to how many events you processed or the window's durations but to:
The number of windows (considering that one sliding window actually maps to several physical windows).
The cardinality of the field you're using in the group by.
All things considered, I'd say a simple 24-hour window has an almost nonexistent memory footprint.
You can check the relevant code here.
I am currently writing an aggregation use case using Flink 1.0, as part of the use case I need to get count of api's that were logged in last 10 mins.
This I can easily do using keyBy("api") and then apply window of 10 min and doe sum(count) operation.
But the problem is my data might come out of order so I need some way to get the count of api's across the 10 min window..
For example : If the same api log comes in 2 different windows, I should get a global count i.e 2 for it and not two separate records diaplaying count as 1 each for each window.
I also don't want incremental counts i.e each record with same key is displayed many times with count equal to the incremental value..
I want the record to be displayed once with a global count, something like updateStateByKey() in Spark.
Can we do that?
You should have a look at Flink's event-time feature which produces consistent results for out-of-order streams. Event-time means that Flink will process data depending on timestamps that are part of the events and not depending on the machines wall-clock time.
If you you event-time (with appropriate watermarks). Flink will use automatically handle events that arrive out-of-order.