I have some question.
Based on the timestamp in the class, I would like to make a logic that excludes data that has entered N or more times in 1 minute.
UserData class has a timestamp variable.
class UserData{
public Timestamp timestamp;
public String userId;
}
At first I tried to use a tumbling window.
SingleOutputStreamOperator<UserData> validStream =
stream.keyBy((KeySelector<UserData, String>) value -> value.userId)
.window(TumblingProcessingTimeWindows.of(Time.seconds(60)))
.process(new ValidProcessWindow());
public class ValidProcessWindow extends ProcessWindowFunction<UserData, UserData, String, TimeWindow> {
private int validCount = 10;
#Override
public void process(String key, Context context, Iterable<UserData> elements, Collector<UserData> out) throws Exception {
int count = -1;
for (UserData element : elements) {
count++; // start is 0
if (count >= validCount) // valid click count
{
continue;
}
out.collect(element);
}
}
}
However, the time calculation of the tumbling window is based on a fixed time, so it is not suitable regardless of the timestamp of the UserData class.
How to handle window on stream UserData class's timestamp base?
Thanks.
Additinal Information
I use code like this.
stream.assignTimestampsAndWatermarks(WatermarkStrategy.<UserData>forBoundedOutOfOrderness(Duration.ofSeconds(1))
.withTimestampAssigner((event, timestamp) -> Timestamps.toMillis(event.timestamp))
.keyBy((KeySelector<UserData, String>) value -> value.userId)
.window(TumblingEventTimeWindows.of(Time.seconds(60)))
.process(new ValidProcessWindow());
I tried some test.
150 sample data. The timestamp of each data increased by 1 second.
result is |1,2,3....59| |60,61....119| .
I wait last 30 data. but is not processed.
I expected |1,2,3....59| |60,61....119| |120...149|.
How can I get last other datas?
Self Answer
I found the cause.
Because I use only 150 sample data.
If use event time at Flink can not progress if there are no elements to be processed.
https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/event_time.html#idling-sources
So, I tested 150 sample data and dummy data. (dummy data timestamp of each data increased by 1 second).
I received correct data |1,2,3....59| |60,61....119| |120...149|.
Thank you for help.
So as far as I understand Your problem, You should just use different Time Characteristic. Processing time is using the system time to calculate windows, You should use event time for Your application. You can find more info about proper usage of event time here.
EDIT:
That's how flink works, there is no data to push watermark past 150, so window is not closed and thus no output. You can use custom trigger that will close the window even if watermark is not generated or inject some data to move the watermark.
Related
First let me ask the my question then could you please clarify my assumption about apply method?
Question: If my application creates 1.500.000 (approximately) records in every one minute interval and flink job reads these records from kafka consumer with let's say 15++ different operators, then this logic could create latency, backpressure etc..? (you may assume that parallelism is 16)
public class Sample{
//op1 =
kafkaSource
.keyBy(something)
.timeWindow(Time.minutes(1))
.apply(new ApplySomething())
.name("Name")
.addSink(kafkaSink);
//op2 =
kafkaSource
.keyBy(something2)
.timeWindow(Time.seconds(1)) // let's assume that this one second
.apply(new ApplySomething2())
.name("Name")
.addSink(kafkaSink);
// ...
//op16 =
kafkaSource
.keyBy(something16)
.timeWindow(Time.minutes(1))
.apply(new ApplySomething16())
.name("Name")
.addSink(kafkaSink);
}
// ..
public class ApplySomething ... {
private AnyObject object;
private int threshold = 30, 40, 100 ...;
#Override
public void open(Configuration parameters) throws Exception{
object = new AnyObject();
}
#Override
public void apply(Tuple tuple, TimeWindow window, Iterable<Record> input, Collector<Result> out) throws Exception{
int counter = 0;
for (Record each : input){
counter += each.getValue();
if (counter > threshold){
out.collec(each.getResult());
return;
}
}
}
}
If yes, should i use flatMap with state(rocksDB) instead of timeWindow?
My prediction is "YES". Let me explain why i am thinking like that:
If parallelism is 16 than there will be a 16 different instances of indivudual ApplySomething1(), ApplySomething2()...ApplySomething16() and also there will be sixteen AnyObject() instances for per ApplySomething..() classes.
When application works, if keyBy(something)partition number is larger than 16 (assumed that my application has 1.000.000 different something per day), then some of the ApplySomething..()instances will handle the different keys therefore one apply() should wait the others for loops before processing. Then this will create a latency?
Flink's time windows are aligned to the epoch (e.g., if you have a bunch of hourly windows, they will all trigger on the hour). So if you do intend to have a bunch of different windows in your job like this, you should configure them to have distinct offsets, so they aren't all being triggered simultaneously. Doing that will spread out the load. That will look something like this
.window(TumblingProcessingTimeWindows.of(Time.minutes(1), Time.seconds(15))
(or use TumblingEventTimeWindows as the case may be). This will create minute-long windows that trigger at 15 seconds after each minute.
Whenever your use case permits, you should use incremental aggregation (via reduce or aggregate), rather than using a WindowFunction (or ProcessWindowFunction) that has to collect all of the events assigned to each window in a list before processing them as a sort of mini-batch.
A keyed time window will keep its state in RocksDB, assuming you have configured RocksDB as your state backend. You don't need to switch to using a RichFlatMap to have access to RocksDB. (Moreover, since a flatMap can't use timers, I assume you would really end up using a process function instead.)
While any of the parallel instances of the window operator is busy executing its window function (one of the ApplySomethings) you are correct in thinking that that task will not be doing anything else -- and thus it will (unless it completes very quickly) create temporary backpressure. You will want to increase the parallelism as needed so that the job can satisfy your requirements for throughput and latency.
I am new to Flink and have a use case I do not know how to approach.
I have events coming
{
"id" : "AAA",
"event" : "someEvent",
"eventTime" : "2019/09/14 14:04:25:235"
}
I want to create a table (in elastic / oracle) that tracks user inactivity.
id || lastEvent || lastEventTime || inactivityTime
My final goal is to alert if some group of users are in active more then X minutes.
This table should be updated every 1 minute.
I do not have prior knowledge of all my id's. new ids can come at any time..
I thought maybe just use simple process function to emit event if present or else emit timestamp (that will update the inactivity column).
Questions
Regarding my solution - I still need to have another piece of code that check if event is null or not and update accordingly. If null --> update inactivity. else update lastEvent.
Can / should this code by in the same flink/spark job?
How do I deal with new ids?
Also, how can this use case can be dealt in spark structured stream?
input
.keyBy("id")
.window(TumblingEventTimeWindows.of(Time.minutes(1)))
.process(new MyProcessWindowFunction());
public class MyProcessWindowFunction
extends ProcessWindowFunction<Tuple2<String, Long>, Tuple2<Long, Object>> {
#Override
public void process(String key, Context context, Iterable<Tuple2<String, Long>> input, Collector<Tuple2<Long, Object>> out) {
Object obj = null;
while(input.iterator().hasNext()){
obj = input.iterator().next();
}
if (obj!=null){
out.collect(Tuple2.of(context.timestamp(), obj));
} else {
out.collect(Tuple2.of(context.timestamp(), null));
}
}
I would use a KeyedProcessFunction instead of the Windowing API for these requirements. [1] The stream is keyed by id.
KeyedProcessFunction#process is invoked for each record of the stream, you can keep state and schedule timers. You could schedule a timer every minute, and for each od store the last event seen in state. When the timer fires, you either emit the event and clear the state.
Personally, I would only store the last event seen in the database and calculate the inactivity time when querying the database. This way you can clear state after each emission and the possibly unbounded key space does not result in every growing managed state in Flink.
Hope this helps.
[1] https://ci.apache.org/projects/flink/flink-docs-stable/dev/stream/operators/process_function.html
I have keyed events coming in on a stream that I would like to accumulate by key, up to a timeout (say, 5 minutes), and then process the events accumulated up to that point (and ignore everything after for that key, but first things first).
I am new to Flink, but conceptually I think I need something like the code below.
DataStream<Tuple2<String, String>> dataStream = see
.socketTextStream("localhost", 9999)
.flatMap(new Splitter())
.keyBy(0)
.window(GlobalWindows.create())
.trigger(ProcessingTimeTrigger.create()) // how do I set the timeout value?
.fold(new Tuple2<>("", ""), new FoldFunction<Tuple2<String, String>, Tuple2<String, String>>() {
public Tuple2<String, String> fold(Tuple2<String, String> agg, Tuple2<String, String> elem) {
if ( agg.f0.isEmpty()) {
agg.f0 = elem.f0;
}
if ( agg.f1.isEmpty()) {
agg.f1 = elem.f1;
} else {
agg.f1 = agg.f1 + "; " + elem.f1;
}
return agg;
}
});
This code does NOT compile because a ProcessingTimeTrigger needs a TimeWindow, and GlobalWindow is not a TimeWindow. So...
How can I accomplish keyed window timeouts in Flink?
You will have a much easier time if you approach this with a KeyedProcessFunction.
I suggest establishing an item of keyed ListState in the open() method of a KeyedProcessFunction. In the processElement() method, if the list is empty, set a processing-time timer (a per-key timer, relative to the current time) to fire when you want the window to end. Then append the incoming event to the list.
When the timer fires the onTimer() method will be called, and you can iterate over the list, produce a result, and clear the list.
To arrange for only doing all of this only once per key, add a ValueState<Boolean> to the KeyedProcessFunction to keep track of this. (Note that if your key space is unbounded, you should think about a strategy for eventually expiring the state for stale keys.)
The documentation describes how to use Process Functions and how to work with state. You can find additional examples in the Flink training site, such as this exercise.
I have a continuing JSONArray data produced to Kafka topic,and I wanna process records with EventTime characteristic.In order to reach this goal,I have to assign watermark to each record which contained in the JSONArray.
I didn't find a convenience way to achieve this goal.My solution is consuming data from DataStreamSource> ,then iterate List and collect Object to downstream with an anonymous ProcessFunction,finally assign watermark to the this downstream.
The major code shows below:
DataStreamSource<List<MockData>> listDataStreamSource = KafkaSource.genStream(env);
SingleOutputStreamOperator<MockData> convertToPojo = listDataStreamSource
.process(new ProcessFunction<List<MockData>, MockData>() {
#Override
public void processElement(List<MockData> value, Context ctx, Collector<MockData> out)
throws Exception {
value.forEach(mockData -> out.collect(mockData));
}
});
convertToPojo.assignTimestampsAndWatermarks(
new BoundedOutOfOrdernessTimestampExtractor<MockData>(Time.seconds(5)) {
#Override
public long extractTimestamp(MockData element) {
return element.getTimestamp();
}
});
SingleOutputStreamOperator<Tuple2<String, Long>> countStream = convertToPojo
.keyBy("country").window(
SlidingEventTimeWindows.of(Time.seconds(10), Time.seconds(10)))
.process(
new FlinkEventTimeCountFunction()).name("count elements");
The code seems all right without doubt,running without error as well.But ProcessWindowFunction never triggered.I tracked the Flink source code,find EventTimeTrigger never returns TriggerResult.FIRE,causing by TriggerContext.getCurrentWatermark returns Long.MIN_VALUE all the time.
What's the proper way to process List in eventtime?Any suggestion will be appreciated.
The problem is that you are applying the keyBy and window operations to the convertToPojo stream, rather than the stream with timestamps and watermarks (which you didn't assign to a variable).
If you write the code more or less like this, it should work:
listDataStreamSource = KafkaSource ...
convertToPojo = listDataStreamSource.process ...
pojoPlusWatermarks = convertToPojo.assignTimestampsAndWatermarks ...
countStream = pojoPlusWatermarks.keyBy ...
Calling assignTimestampsAndWatermarks on the convertToPojo stream does not modify that stream, but rather creates a new datastream object that includes timestamps and watermarks. You need to apply your windowing to that new datastream.
I have read few articles on Flink and while reading a blog on Flink I came across the phrase
"with at most 60 seconds serving delay (events are out of order by max. 1 minute) "
Is defining out of order events duration used for the technique "Watermarks" in Flink and if not then whats the internal purpose?
I'll try to briefly explain how to manage out of order events in Flink. Event time, out of order and watermarks are very close concepts, and I think you will understand that phrase better after you understand their relation.
Watermarks and out of orderness are concepts of event time based DataStreams. A watermark can be described as a time mark where you assume there aren't going to occur more events before the mark. There are several mechanisms to emit watermarks in Flink, i.e, you can set a watermark each time you receive an event. Also, time windows use the watermarks to check when is the right time to evaluate.
That said, the "watermarks" and "out of order" concepts are essentially the same, as you use the watermark to achieve that out of order management. In your case, to define that 60 second max delay it's as simple as setting the watermark 60 seconds before the max timestamp received.
There is a nice example on the official site about managing out of order events:
/**
* This generator generates watermarks assuming that elements come out of order to a certain degree only.
* The latest elements for a certain timestamp t will arrive at most n milliseconds after the earliest
* elements for timestamp t.
*/
public class BoundedOutOfOrdernessGenerator extends AssignerWithPeriodicWatermarks<MyEvent> {
private final long maxOutOfOrderness = 3500; // 3.5 seconds
private long currentMaxTimestamp;
#Override
public long extractTimestamp(MyEvent element, long previousElementTimestamp) {
long timestamp = element.getCreationTime();
currentMaxTimestamp = Math.max(timestamp, currentMaxTimestamp);
return timestamp;
}
#Override
public Watermark getCurrentWatermark() {
// return the watermark as current highest timestamp minus the out-of-orderness bound
return new Watermark(currentMaxTimestamp - maxOutOfOrderness);
}
}