I have the following Flink pipeline which simply counts the elements in a window and reports on a separate stream the late elements
OutputTag<Tuple3<Long, String, Double>> lateItems= new OutputTag<Tuple3<Long, String, Double>>("Late Items"){};
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
env.setParallelism(1);
env.getConfig().setAutoWatermarkInterval(-1);
DataStream<Tuple3<Long,String,Double>> stream = env.addSource(new YetAnotherSource(fileName));
DataStream<Tuple3<Long,String,Double>> lateStream;
AllWindowedStream<Tuple3<Long, String, Double>, TimeWindow> tuple3TimeWindowAllWindowedStream = stream.windowAll(SlidingEventTimeWindows.of(Time.milliseconds(100), Time.milliseconds(10)));
tuple3TimeWindowAllWindowedStream.sideOutputLateData(lateItems);
lateStream = streamOfResults.getSideOutput(lateItems);
lateStream.countWindowAll(1).apply(new CounterFunction22()).writeAsText("FlinkSlidingTimeWindowLateItemsResult.txt",FileSystem.WriteMode.OVERWRITE);
streamOfResults.writeAsText("FlinkSlidingTimeWindowOutputFor" + fileName + ".txt", FileSystem.WriteMode.OVERWRITE);
When I pass as input the following data
1383451269002,A,22
1383451269006,A,18
1383451269007,A,18
*1383451269010,W,0
1383451269008,A,18
1383451269027,A,20
1383451269028,A,19
1383451269033,A,17
1383451269033,A,17
1383451269030,A,17
*1383451269038,W,0
1383451269008,A,17
The elements with * are watermarks.
I get as expected as a result the first window contains three elements. That is because the elements on the fifth and the last rows are considered late for that window.
(1383451268910,1383451269010,3)
However, on the side output nothing is generated.
When I use a session window, though, late items are generated on the side output.
Any ideas why nothing is generated for sliding time window?
Related
I am currently working on a problem in Flink, wherein I'll have to compute aggregate functions for three different sliding windows of window sizes 7 days,14 days and 1 month.
From what I've understood I'd have to run three different consumers parallelly having the above mentioned window sizes. Is there a way to implement three sliding windows for a single data stream all using a single consumer code?
Some code or reference to implement this using Flink is very appreciable.
What I know :
consumer 1 computes over a sliding window of size 7 days
consumer 2 computes over a sliding window of size 14 days
and so on.
What I want:
consumer 1 computing all these sliding windows simultaneously for a single data stream.
Is it possible to implement this in Flink?
The various windows can share a single stream produced by one kafka consumer, like this:
consumer = new FlinkKafkaConsumer<>("topic", new topicSchema(), kafkaProps);
stream = env.addSource(consumer);
w1 = stream.keyBy(key)
.window(SlidingEventTimeWindows.of(Time.days(7), Time.days(1))
.process(...)
w2 = stream.keyBy(key)
.window(SlidingEventTimeWindows.of(Time.days(14), Time.days(1))
.process(...)
Or to be more efficient, you might structure it like this:
consumer = new FlinkKafkaConsumer<>("topic", new topicSchema(), kafkaProps);
stream = env.addSource(consumer);
dayByDay = stream.keyBy(key)
.window(TumblingEventTimeWindows.of(Time.days(1))
.process(...)
w1 = dayByDay.keyBy(key)
.window(SlidingEventTimeWindows.of(Time.days(7), Time.days(1))
.process(...)
w2 = dayByDay.keyBy(key)
.window(SlidingEventTimeWindows.of(Time.days(14), Time.days(1))
.process(...)
Note, however, that there is no Time.months(), so if you want windows aligned to month boundaries, I guess you'll have to figure that part out.
I'm trying to count the elements in a stream while enriching the result with the end time of the window.
The events are received from Kafka using kafka10 consumer provided by flink. EventTime is used.
A simple KeyedStream.count( ... ) works fine.
The stream has a length of 4 minutes. By using a time window of 3 minutes only one output is received. There should be two. The results are written using a BucketingSink.
val count = stream.map( m => (m.getContext, 1) )
.keyBy( 0 )
.timeWindow( Time.minutes(3) )
.apply( new EndTimeWindow() )
.map( new JsonMapper() )
count.addSink( countSink )
class EndTimeWindow extends WindowFunction[(String,Int),(String, Int),Tuple, TimeWindow]{
override def apply(key: Tuple, window: TimeWindow, input: Iterable[(String, Int)], out: Collector[(String, Int)]): Unit = {
var sum: Int = 0
for( value <-input ) {
sum = sum + value._2;
}
out.collect( (window.getEnd.toString, new Integer(sum ) ))
}
}
By using a time window of 3 minutes only one output with a smaller amount of events is received. There should be two outputs.
To be more precise, an event time window closes when a suitable watermark arrives -- which, with a bounded-out-of-orderness watermark generator, will happen (1) if an event arrives that is sufficiently outside the window, or (2) if the events are coming from a finite source that reaches its end, because in that case Flink will send a watermark with a timestamp of Long.MAX_VALUE that will close all open event time windows. However, with Kafka as your source, that won't happen.
Ok, I think, I know what went wrong. The mistake happens, because I thought wrong about the problem.
Since I'm using Eventtime, the windows close, when an event arrives that has a timestamp greater than the window end time. When the stream ends there arrives no element anymore. It follows, that the window never closes.
Below is a flink program (Java) which reads tweets from a file, extract hash tags, count the number of repetition for each hash tag and finally write in a file.
Now In this program there is a sliding Window of size 20 seconds that slides by 5 seconds. In sink all output data is getting written into file named outfile. Means after every 5 seconds one window is getting fired and writing data into outfile.
My Problem:
I want that for every window firing (means in every 5 seconds) data gets written in new file. (instead of getting appended in same file).
Kindly guide where and how it can be done? Do i need to use custom trigger or any configuration regarding sink? or anything else?
Code:
<!-- language: lang-java -->
StreamExecutionEnvironment env =
StreamExecutionEnvironment.getExecutionEnvironment();
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
env.getConfig().setAutoWatermarkInterval(100);
env.enableCheckpointing(5000,CheckpointingMode.EXACTLY_ONCE);
env.getCheckpointConfig().setMinPauseBetweenCheckpoints(5000);
String path = "C:\\Users\\eventTime";
// Reading data from files of folder eventTime.
DataStream<String> streamSource = env.readFile(new TextInputFormat(new Path(path)), path, FileProcessingMode.PROCESS_CONTINUOUSLY, 1000).uid("read-1");
//Extracting the hash tags of tweets
DataStream<Tuple3<String, Integer, Long>> mapStream = streamSource.map(new ExtractHashTagFunction());
//generating watermarks and extracting the timestamps from tweets
DataStream<Tuple3<String, Integer, Long>> withTimestampsAndWatermarks = mapStream.assignTimestampsAndWatermarks(new MyTimestampsAndWatermarks());
KeyedStream<Tuple3<String, Integer, Long>,Tuple> keyedStream = withTimestampsAndWatermarks.keyBy(0);
//Using sliding window of 20 seconds which slide by 5 seconds.
SingleOutputStreamOperator<Tuple4<String, Integer, Long, String>> aggregatedStream = keyedStream.**window(SlidingEventTimeWindows.of(Time.seconds(20),Time.seconds(5)))**
.aggregate(new AggregateHashTagCountFunction()).uid("agg-123");
aggregatedStream.writeAsText("C:\\Users\\outfile", WriteMode.NO_OVERWRITE).setParallelism(1).uid("write-1");
env.execute("twitter-analytics");
If you are not satisfied with the built in sinks, you can define your custom sink:
stream.addSink(new MyCustomSink ...)
The MyCustomSink should implement SinkFunction
Your custom sink will contain a FileWriter and e.g. a counter.
Every time the sink is invoked, it will write to "/path/to/file + counter.yourFileExtension"
https://ci.apache.org/projects/flink/flink-docs-release-1.4/api/java/org/apache/flink/streaming/api/functions/sink/SinkFunction.html
Is there a way to count the number of unique words in time window stream with Flink Streaming? I see this question but I don't know how to implement time window.
Sure, that's pretty straightforward. If you want an aggregation across all of the input records during each time window, then you'll need to use one of the flavors of windowAll(), which means you won't be using a keyedstream, and you can not operate in parallel.
You'll need to decide if you want tumbling windows or sliding windows, and whether you are operating in event time or processing time.
But roughly speaking, you'll do something like this:
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.addSource( ... )
.timeWindowAll(Time.minutes(15))
.apply(new UniqueWordCounter())
.print()
env.execute()
Your UniqueWordCounter will be a WindowFunction that receives an iterable of all the words in a window, and returns the number of unique words.
On the other hand, if you are using a keyedstream and want to count unique words for each key, modify your application accordingly:
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.addSource( ... )
.keyBy( ... )
.timeWindow(Time.minutes(15))
.apply(new UniqueWordCounter())
.print()
env.execute()
I'm writing a Hadoop application calculates map data at a certain resolution. My Input files are tiles of a map, named according to the QuadTile principle. I need to subsample those, and stitch those together until I have a certain higher-level tile which covers a larger area but at a lower resolution. Like zooming out in google maps.
Currently my Mapper subsamples tiles and my reducer combines tiles a a certain level and forms tiles of one level up. So for so good. But depending on which tile I need, I need to repeat those map and reduce steps a x times, which I have not been able to do so far.
What would be the best way to do so? Is it possible without explicitly saving the tiles in some temp directory and starting a new mapreduce Job on those temp dirs until I get what I want? What I think would be the perfect solution is something roughly like 'while(context.hasMoreThanOneKey()){iterate mapreduce}'.
Following an answer, I have now written a class TileJob which extends Job. However, the mapreduce is still not chained. Could you tell me what I'm doing wrong?
public boolean waitForCompletion(boolean verbose) throws IOException, InterruptedException, ClassNotFoundException{
if(desiredkeylength != currentinputkeylength-1){
System.out.println("In loop, setting input at " + tempout);
String tempin = tempout;
FileInputFormat.setInputPaths(this, tempin);
tempout = (output + currentinputkeylength + "/");
FileOutputFormat.setOutputPath(this, new Path(tempout));
System.out.println("Setting output at " + tempout);
currentinputkeylength--;
Configuration conf = new Configuration();
TileJob job = new TileJob(conf);
job.setJobName(getJobName());
job.setUpJob(tempin, tempout, tiletogenerate, currentinputkeylength);
return job.waitForCompletion(verbose);
}else{
//desiredkeylength == currentkeylength-1
System.out.println("In else, setting input at " + tempout);
String tempin = tempout;
FileInputFormat.setInputPaths(this, tempin);
tempout = output;
FileOutputFormat.setOutputPath(this, new Path(tempout));
System.out.println("Setting output at " + tempout);
currentinputkeylength--;
Configuration conf = new Configuration();
TileJob job = new TileJob(conf);
job.setJobName(getJobName());
job.setUpJob(tempin, tempout, tiletogenerate, currentinputkeylength);
currentinputkeylength--;
return super.waitForCompletion(verbose);
}
}
Usually you kick a mapreduce step off by having a driver class main method that configures the Job, Configuration and format type (input and output). Once everything's ready to go that main method calls Job::waitForCompletion() which submits the job and waits for the job to complete before continuing.
You can wrap some of that logic in a loop that repeatedly calls Job::waitForCompletion() until your criteria is met. You can implement your criteria using counters. Put logic into your reduce() method to set or increment a counter with the number of keys. Your loop in the driver class can get the value of that (distributed) counter from the Job instance and you code your while expression using that value.
What file locations you use is up to you. Inside this driver loop you can change the file location for the inputs and outputs, or keep them the same.
I should probably add that you ought to go ahead and create a new Job and Configuration instance inside the loop. I don't know that those objects are reusable in this situation.
public static void main(String[] args) {
int keys = 2;
boolean completed = true;
while (completed & (keys > 1)) {
Job job = new Job();
// Do all your job configuration here
completed = job.waitForCompletion();
if (completed) {
keys = job.getCounter().findCounter("Total","Keys").getValue();
}
}
}