How to count unique words in time window stream with Flink? - apache-flink

Is there a way to count the number of unique words in time window stream with Flink Streaming? I see this question but I don't know how to implement time window.

Sure, that's pretty straightforward. If you want an aggregation across all of the input records during each time window, then you'll need to use one of the flavors of windowAll(), which means you won't be using a keyedstream, and you can not operate in parallel.
You'll need to decide if you want tumbling windows or sliding windows, and whether you are operating in event time or processing time.
But roughly speaking, you'll do something like this:
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.addSource( ... )
.timeWindowAll(Time.minutes(15))
.apply(new UniqueWordCounter())
.print()
env.execute()
Your UniqueWordCounter will be a WindowFunction that receives an iterable of all the words in a window, and returns the number of unique words.
On the other hand, if you are using a keyedstream and want to count unique words for each key, modify your application accordingly:
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.addSource( ... )
.keyBy( ... )
.timeWindow(Time.minutes(15))
.apply(new UniqueWordCounter())
.print()
env.execute()

Related

Persist Apache Flink window

I'm trying to use Flink to consume a bounded data from a message queue in a streaming passion. The data will be in the following format:
{"id":-1,"name":"Start"}
{"id":1,"name":"Foo 1"}
{"id":2,"name":"Foo 2"}
{"id":3,"name":"Foo 3"}
{"id":4,"name":"Foo 4"}
{"id":5,"name":"Foo 5"}
...
{"id":-2,"name":"End"}
The start and end of messages can be determined using the event id. I want to receive such batches and store the latest (by overwriting) batch on disk or in memory. I can write a custom window trigger to extract the events using the start and end flags as shown below:
DataStream<Foo> fooDataStream = ...
AllWindowedStream<Foo, GlobalWindow> fooWindow = fooDataStream.windowAll(GlobalWindows.create())
.trigger(new CustomTrigger<>())
.evictor(new Evictor<Foo, GlobalWindow>() {
#Override
public void evictBefore(Iterable<TimestampedValue<Foo>> elements, int size, GlobalWindow window, EvictorContext evictorContext) {
for (Iterator<TimestampedValue<Foo>> iterator = elements.iterator();
iterator.hasNext(); ) {
TimestampedValue<Foo> foo = iterator.next();
if (foo.getValue().getId() < 0) {
iterator.remove();
}
}
}
#Override
public void evictAfter(Iterable<TimestampedValue<Foo>> elements, int size, GlobalWindow window, EvictorContext evictorContext) {
}
});
but how can I persist the output of the latest window. One way would be using a ProcessAllWindowFunction to receive all the events and write them to disk manually but it feels like a hack. I'm also looking into the Table API with Flink CEP Pattern (like this question) but couldn't find a way to clear the Table after each batch to discard the events from the previous batch.
There are a couple of things getting in the way of what you want:
(1) Flink's window operators produce append streams, rather than update streams. They're not designed to update previously emitted results. CEP also doesn't produce update streams.
(2) Flink's file system abstraction does not support overwriting files. This is because object stores, like S3, don't support this operation very well.
I think your options are:
(1) Rework your job so that it produces an update (changelog) stream. You can do this with toChangelogStream, or by using Table/SQL operations that create update streams, such as GROUP BY (when it's used without a time window). On top of this, you'll need to choose a sink that supports retractions/updates, such as a database.
(2) Stick to producing an append stream and use something like the FileSink to write the results to a series of rolling files. Then do some scripting outside of Flink to get what you want out of this.

Find max value in a Flink DataStream

I have a DataStream of Tuple2<String, Integer>. I want to find the max of field f1, preferably without doing a keyBy(). Is that possible in Flink?
One "hack" I came up with:
DataStream<Tuple2<String, Integer>> input; // Initialized somewhere
DataStream<Tuple2<String, Integer>> maxEntry =
input.map(entry -> new Tuple3(entry.f0, entry.f1, "foo"))
.keyBy(2)
.maxBy(1)
.map(entry -> new Tuple2(entry.f1, entry.f1));
Doing the intermediate map() and keyBy() seems to me wasteful/inefficient. Is there a better way?
Thank you,
Ahmed.
You could do this, which is still hacky, but less so
input.keyBy(e -> "foo").maxBy(1)
but keep in mind that
Keying by a constant reduces the effective parallelism to 1 (which is fine in this case, as you need to process every event in the same place to find a global maximum).
KeyedStream#maxBy will be removed from Flink in the future. See https://stackoverflow.com/a/66651834/2000823 for more about that.

How does Flink treat timestamps within iterative loops?

How are timestamps treated within an iterative DataStream loop within Flink?
For example, here is an example of a simple iterative loop within Flink where the feedback loop is of a different type to the input stream:
DataStream<MyInput> inputStream = env.addSource(new MyInputSourceFunction());
IterativeStream.ConnectedIterativeStreams<MyInput, MyFeedback> iterativeStream = inputStream.iterate().withFeedbackType(MyFeedback.class);
// define an output tag so we can emit feedback objects via a side output
final OutputTag<MyFeedback> outputTag = new OutputTag<MyFeedback>("feedback-output"){};
// now do some processing
SingleOutputStreamOperator<MyOutput> combinedStreams = iterativeStream.process(new CoProcessFunction<MyInput, MyFeedback, MyOutput>() {
#Override
public void processElement1(MyInput value, Context ctx, Collector<MyOutput> out) throws Exception {
// do some processing of the stream of MyInput values
// emit MyOutput values downstream by calling out.collect()
out.collect(someInstanceOfMyOutput);
}
#Override
public void processElement2(MyFeedback value, Context ctx, Collector<MyOutput> out) throws Exception {
// do some more processing on the feedback classes
// emit feedback items
ctx.output(outputTag, someInstanceOfMyFeedback);
}
});
iterativeStream.closeWith(combinedStreams.getSideOutput(outputTag));
My questions revolve around how does Flink use timestamps within a feedback loop:
Within the ConnectedIterativeStreams, how does Flink treat ordering of the input objects across the streams of regular inputs and feedback objects? If I emit an object into the feedback loop, when will it be seen by the head of the loop with respect to the regular stream of input objects?
How does the behaviour change when using event time processing?
AFAICT, Flink doesn't provide any guarantees on the ordering of input objects. I've run into this when trying to use iterations for a clustering algorithm in Flink, where the centroid updates don't get processed in a timely manner. The only solution I found was to essentially create a single (unioned) stream of the incoming events and the centroid updates, versus using a co-stream.
FYI there's this proposal to address some of the short-comings of iterations.

Flink Side output for Sliding time window

I have the following Flink pipeline which simply counts the elements in a window and reports on a separate stream the late elements
OutputTag<Tuple3<Long, String, Double>> lateItems= new OutputTag<Tuple3<Long, String, Double>>("Late Items"){};
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
env.setParallelism(1);
env.getConfig().setAutoWatermarkInterval(-1);
DataStream<Tuple3<Long,String,Double>> stream = env.addSource(new YetAnotherSource(fileName));
DataStream<Tuple3<Long,String,Double>> lateStream;
AllWindowedStream<Tuple3<Long, String, Double>, TimeWindow> tuple3TimeWindowAllWindowedStream = stream.windowAll(SlidingEventTimeWindows.of(Time.milliseconds(100), Time.milliseconds(10)));
tuple3TimeWindowAllWindowedStream.sideOutputLateData(lateItems);
lateStream = streamOfResults.getSideOutput(lateItems);
lateStream.countWindowAll(1).apply(new CounterFunction22()).writeAsText("FlinkSlidingTimeWindowLateItemsResult.txt",FileSystem.WriteMode.OVERWRITE);
streamOfResults.writeAsText("FlinkSlidingTimeWindowOutputFor" + fileName + ".txt", FileSystem.WriteMode.OVERWRITE);
When I pass as input the following data
1383451269002,A,22
1383451269006,A,18
1383451269007,A,18
*1383451269010,W,0
1383451269008,A,18
1383451269027,A,20
1383451269028,A,19
1383451269033,A,17
1383451269033,A,17
1383451269030,A,17
*1383451269038,W,0
1383451269008,A,17
The elements with * are watermarks.
I get as expected as a result the first window contains three elements. That is because the elements on the fifth and the last rows are considered late for that window.
(1383451268910,1383451269010,3)
However, on the side output nothing is generated.
When I use a session window, though, late items are generated on the side output.
Any ideas why nothing is generated for sliding time window?

How to write result of each sliding window of a FLINK program in new file Instead of appending the result of all Windows in one file

Below is a flink program (Java) which reads tweets from a file, extract hash tags, count the number of repetition for each hash tag and finally write in a file.
Now In this program there is a sliding Window of size 20 seconds that slides by 5 seconds. In sink all output data is getting written into file named outfile. Means after every 5 seconds one window is getting fired and writing data into outfile.
My Problem:
I want that for every window firing (means in every 5 seconds) data gets written in new file. (instead of getting appended in same file).
Kindly guide where and how it can be done? Do i need to use custom trigger or any configuration regarding sink? or anything else?
Code:
<!-- language: lang-java -->
StreamExecutionEnvironment env =
StreamExecutionEnvironment.getExecutionEnvironment();
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
env.getConfig().setAutoWatermarkInterval(100);
env.enableCheckpointing(5000,CheckpointingMode.EXACTLY_ONCE);
env.getCheckpointConfig().setMinPauseBetweenCheckpoints(5000);
String path = "C:\\Users\\eventTime";
// Reading data from files of folder eventTime.
DataStream<String> streamSource = env.readFile(new TextInputFormat(new Path(path)), path, FileProcessingMode.PROCESS_CONTINUOUSLY, 1000).uid("read-1");
//Extracting the hash tags of tweets
DataStream<Tuple3<String, Integer, Long>> mapStream = streamSource.map(new ExtractHashTagFunction());
//generating watermarks and extracting the timestamps from tweets
DataStream<Tuple3<String, Integer, Long>> withTimestampsAndWatermarks = mapStream.assignTimestampsAndWatermarks(new MyTimestampsAndWatermarks());
KeyedStream<Tuple3<String, Integer, Long>,Tuple> keyedStream = withTimestampsAndWatermarks.keyBy(0);
//Using sliding window of 20 seconds which slide by 5 seconds.
SingleOutputStreamOperator<Tuple4<String, Integer, Long, String>> aggregatedStream = keyedStream.**window(SlidingEventTimeWindows.of(Time.seconds(20),Time.seconds(5)))**
.aggregate(new AggregateHashTagCountFunction()).uid("agg-123");
aggregatedStream.writeAsText("C:\\Users\\outfile", WriteMode.NO_OVERWRITE).setParallelism(1).uid("write-1");
env.execute("twitter-analytics");
If you are not satisfied with the built in sinks, you can define your custom sink:
stream.addSink(new MyCustomSink ...)
The MyCustomSink should implement SinkFunction
Your custom sink will contain a FileWriter and e.g. a counter.
Every time the sink is invoked, it will write to "/path/to/file + counter.yourFileExtension"
https://ci.apache.org/projects/flink/flink-docs-release-1.4/api/java/org/apache/flink/streaming/api/functions/sink/SinkFunction.html

Resources