flink / spark stream - track user inactivity - apache-flink

I am new to Flink and have a use case I do not know how to approach.
I have events coming
{
"id" : "AAA",
"event" : "someEvent",
"eventTime" : "2019/09/14 14:04:25:235"
}
I want to create a table (in elastic / oracle) that tracks user inactivity.
id || lastEvent || lastEventTime || inactivityTime
My final goal is to alert if some group of users are in active more then X minutes.
This table should be updated every 1 minute.
I do not have prior knowledge of all my id's. new ids can come at any time..
I thought maybe just use simple process function to emit event if present or else emit timestamp (that will update the inactivity column).
Questions
Regarding my solution - I still need to have another piece of code that check if event is null or not and update accordingly. If null --> update inactivity. else update lastEvent.
Can / should this code by in the same flink/spark job?
How do I deal with new ids?
Also, how can this use case can be dealt in spark structured stream?
input
.keyBy("id")
.window(TumblingEventTimeWindows.of(Time.minutes(1)))
.process(new MyProcessWindowFunction());
public class MyProcessWindowFunction
extends ProcessWindowFunction<Tuple2<String, Long>, Tuple2<Long, Object>> {
#Override
public void process(String key, Context context, Iterable<Tuple2<String, Long>> input, Collector<Tuple2<Long, Object>> out) {
Object obj = null;
while(input.iterator().hasNext()){
obj = input.iterator().next();
}
if (obj!=null){
out.collect(Tuple2.of(context.timestamp(), obj));
} else {
out.collect(Tuple2.of(context.timestamp(), null));
}
}

I would use a KeyedProcessFunction instead of the Windowing API for these requirements. [1] The stream is keyed by id.
KeyedProcessFunction#process is invoked for each record of the stream, you can keep state and schedule timers. You could schedule a timer every minute, and for each od store the last event seen in state. When the timer fires, you either emit the event and clear the state.
Personally, I would only store the last event seen in the database and calculate the inactivity time when querying the database. This way you can clear state after each emission and the possibly unbounded key space does not result in every growing managed state in Flink.
Hope this helps.
[1] https://ci.apache.org/projects/flink/flink-docs-stable/dev/stream/operators/process_function.html

Related

How to handle FLINK window on stream data's timestamp base?

I have some question.
Based on the timestamp in the class, I would like to make a logic that excludes data that has entered N or more times in 1 minute.
UserData class has a timestamp variable.
class UserData{
public Timestamp timestamp;
public String userId;
}
At first I tried to use a tumbling window.
SingleOutputStreamOperator<UserData> validStream =
stream.keyBy((KeySelector<UserData, String>) value -> value.userId)
.window(TumblingProcessingTimeWindows.of(Time.seconds(60)))
.process(new ValidProcessWindow());
public class ValidProcessWindow extends ProcessWindowFunction<UserData, UserData, String, TimeWindow> {
private int validCount = 10;
#Override
public void process(String key, Context context, Iterable<UserData> elements, Collector<UserData> out) throws Exception {
int count = -1;
for (UserData element : elements) {
count++; // start is 0
if (count >= validCount) // valid click count
{
continue;
}
out.collect(element);
}
}
}
However, the time calculation of the tumbling window is based on a fixed time, so it is not suitable regardless of the timestamp of the UserData class.
How to handle window on stream UserData class's timestamp base?
Thanks.
Additinal Information
I use code like this.
stream.assignTimestampsAndWatermarks(WatermarkStrategy.<UserData>forBoundedOutOfOrderness(Duration.ofSeconds(1))
.withTimestampAssigner((event, timestamp) -> Timestamps.toMillis(event.timestamp))
.keyBy((KeySelector<UserData, String>) value -> value.userId)
.window(TumblingEventTimeWindows.of(Time.seconds(60)))
.process(new ValidProcessWindow());
I tried some test.
150 sample data. The timestamp of each data increased by 1 second.
result is |1,2,3....59| |60,61....119| .
I wait last 30 data. but is not processed.
I expected |1,2,3....59| |60,61....119| |120...149|.
How can I get last other datas?
Self Answer
I found the cause.
Because I use only 150 sample data.
If use event time at Flink can not progress if there are no elements to be processed.
https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/event_time.html#idling-sources
So, I tested 150 sample data and dummy data. (dummy data timestamp of each data increased by 1 second).
I received correct data |1,2,3....59| |60,61....119| |120...149|.
Thank you for help.
So as far as I understand Your problem, You should just use different Time Characteristic. Processing time is using the system time to calculate windows, You should use event time for Your application. You can find more info about proper usage of event time here.
EDIT:
That's how flink works, there is no data to push watermark past 150, so window is not closed and thus no output. You can use custom trigger that will close the window even if watermark is not generated or inject some data to move the watermark.

Instance of object related to flink Parallelism & Apply Method

First let me ask the my question then could you please clarify my assumption about apply method?
Question: If my application creates 1.500.000 (approximately) records in every one minute interval and flink job reads these records from kafka consumer with let's say 15++ different operators, then this logic could create latency, backpressure etc..? (you may assume that parallelism is 16)
public class Sample{
//op1 =
kafkaSource
.keyBy(something)
.timeWindow(Time.minutes(1))
.apply(new ApplySomething())
.name("Name")
.addSink(kafkaSink);
//op2 =
kafkaSource
.keyBy(something2)
.timeWindow(Time.seconds(1)) // let's assume that this one second
.apply(new ApplySomething2())
.name("Name")
.addSink(kafkaSink);
// ...
//op16 =
kafkaSource
.keyBy(something16)
.timeWindow(Time.minutes(1))
.apply(new ApplySomething16())
.name("Name")
.addSink(kafkaSink);
}
// ..
public class ApplySomething ... {
private AnyObject object;
private int threshold = 30, 40, 100 ...;
#Override
public void open(Configuration parameters) throws Exception{
object = new AnyObject();
}
#Override
public void apply(Tuple tuple, TimeWindow window, Iterable<Record> input, Collector<Result> out) throws Exception{
int counter = 0;
for (Record each : input){
counter += each.getValue();
if (counter > threshold){
out.collec(each.getResult());
return;
}
}
}
}
If yes, should i use flatMap with state(rocksDB) instead of timeWindow?
My prediction is "YES". Let me explain why i am thinking like that:
If parallelism is 16 than there will be a 16 different instances of indivudual ApplySomething1(), ApplySomething2()...ApplySomething16() and also there will be sixteen AnyObject() instances for per ApplySomething..() classes.
When application works, if keyBy(something)partition number is larger than 16 (assumed that my application has 1.000.000 different something per day), then some of the ApplySomething..()instances will handle the different keys therefore one apply() should wait the others for loops before processing. Then this will create a latency?
Flink's time windows are aligned to the epoch (e.g., if you have a bunch of hourly windows, they will all trigger on the hour). So if you do intend to have a bunch of different windows in your job like this, you should configure them to have distinct offsets, so they aren't all being triggered simultaneously. Doing that will spread out the load. That will look something like this
.window(TumblingProcessingTimeWindows.of(Time.minutes(1), Time.seconds(15))
(or use TumblingEventTimeWindows as the case may be). This will create minute-long windows that trigger at 15 seconds after each minute.
Whenever your use case permits, you should use incremental aggregation (via reduce or aggregate), rather than using a WindowFunction (or ProcessWindowFunction) that has to collect all of the events assigned to each window in a list before processing them as a sort of mini-batch.
A keyed time window will keep its state in RocksDB, assuming you have configured RocksDB as your state backend. You don't need to switch to using a RichFlatMap to have access to RocksDB. (Moreover, since a flatMap can't use timers, I assume you would really end up using a process function instead.)
While any of the parallel instances of the window operator is busy executing its window function (one of the ApplySomethings) you are correct in thinking that that task will not be doing anything else -- and thus it will (unless it completes very quickly) create temporary backpressure. You will want to increase the parallelism as needed so that the job can satisfy your requirements for throughput and latency.

Flink re-scalable keyed stream stateful function

I have the following Flink job where I tried to use keyed-stream stateful function (MapState) with backend type RockDB,
environment
.addSource(consumer).name("MyKafkaSource").uid("kafka-id")
.flatMap(pojoMapper).name("MyMapFunction").uid("map-id")
.keyBy(new MyKeyExtractor())
.map(new MyRichMapFunction()).name("MyRichMapFunction").uid("rich-map-id")
.addSink(sink).name("MyFileSink").uid("sink-id")
MyRichMapFunction is a stateful function which extends RichMapFunction which has following code,
public static class MyRichMapFunction extends RichMapFunction<MyEvent, MyEvent> {
private transient MapState<String, Boolean> cache;
#Override
public void open(Configuration config) {
MapStateDescriptor<String, Boolean> descriptor =
new MapStateDescriptor("seen-values", TypeInformation.of(new TypeHint<String>() {}), TypeInformation.of(new TypeHint<Boolean>() {}));
cache = getRuntimeContext().getMapState(descriptor);
}
#Override
public MyEvent map(MyEvent value) throws Exception {
if (cache.contains(value.getEventId())) {
value.setIsSeenAlready(Boolean.TRUE);
return value;
}
value.setIsSeenAlready(Boolean.FALSE);
cache.put(value.getEventId(), Boolean.TRUE)
return value;
}
}
In future, I would like to rescale the parallelism (from 2 to 4), so my question is, how can I achieve re-scalable keyed states so that after changing the parallelism I can get the corresponding cache keyed data to its corresponding task slot. I tried to explore this, where I found a documentation here. According to this, re-scalable operator state can be achieved by using ListCheckPointed interface which provides snapshotState/restoreState method for that. But not sure how re-scalable keyed state (MyRichMapFunction) can be achieved? Should I need to implement ListCheckPointed interface for my MyRichMapFunction class? If yes how can I redistribute the cache according to new parallelism key hash on restoreState method (my MapState will hold huge number of keys with TTL enabled, let's say max it will hold 1 billion keys at any point of time)? Could some one please help me on this or if you point me to any example that would be great too.
The code you've written is already rescalable; Flink's managed keyed state is rescalable by design. Keyed state is rescaled by rebalancing the assignment of keys to instances. (You can think of keyed state as a sharded key/value store. Technically what happens is that consistent hashing is used to map keys to key groups, and each parallel instance is responsible for some of the key groups. Rescaling simply involves redistributing the key groups among the instances.)
The ListCheckpointed interface is for state used in a non-keyed context, so it's inappropriate for what you are doing. Note also that ListCheckpointed will be deprecated in Flink 1.11 in favor of the more general CheckpointedFunction.
One more thing: if MyKeyExtractor is keying by value.getEventId(), then you could be using ValueState<Boolean> for your cache, rather than MapState<String, Boolean>. This works because with keyed state there is a separate value of ValueState for every key. You only need to use MapState when you need to store multiple attribute/value pairs for each key in your stream.
Most of this is discussed in the Flink documentation under Hands-on Training, which includes an example that's very close to what you are doing.

how can I implement keyed window timeouts in Flink?

I have keyed events coming in on a stream that I would like to accumulate by key, up to a timeout (say, 5 minutes), and then process the events accumulated up to that point (and ignore everything after for that key, but first things first).
I am new to Flink, but conceptually I think I need something like the code below.
DataStream<Tuple2<String, String>> dataStream = see
.socketTextStream("localhost", 9999)
.flatMap(new Splitter())
.keyBy(0)
.window(GlobalWindows.create())
.trigger(ProcessingTimeTrigger.create()) // how do I set the timeout value?
.fold(new Tuple2<>("", ""), new FoldFunction<Tuple2<String, String>, Tuple2<String, String>>() {
public Tuple2<String, String> fold(Tuple2<String, String> agg, Tuple2<String, String> elem) {
if ( agg.f0.isEmpty()) {
agg.f0 = elem.f0;
}
if ( agg.f1.isEmpty()) {
agg.f1 = elem.f1;
} else {
agg.f1 = agg.f1 + "; " + elem.f1;
}
return agg;
}
});
This code does NOT compile because a ProcessingTimeTrigger needs a TimeWindow, and GlobalWindow is not a TimeWindow. So...
How can I accomplish keyed window timeouts in Flink?
You will have a much easier time if you approach this with a KeyedProcessFunction.
I suggest establishing an item of keyed ListState in the open() method of a KeyedProcessFunction. In the processElement() method, if the list is empty, set a processing-time timer (a per-key timer, relative to the current time) to fire when you want the window to end. Then append the incoming event to the list.
When the timer fires the onTimer() method will be called, and you can iterate over the list, produce a result, and clear the list.
To arrange for only doing all of this only once per key, add a ValueState<Boolean> to the KeyedProcessFunction to keep track of this. (Note that if your key space is unbounded, you should think about a strategy for eventually expiring the state for stale keys.)
The documentation describes how to use Process Functions and how to work with state. You can find additional examples in the Flink training site, such as this exercise.

Apache Flink, what is serving delay in event stream?

I have read few articles on Flink and while reading a blog on Flink I came across the phrase
"with at most 60 seconds serving delay (events are out of order by max. 1 minute) "
Is defining out of order events duration used for the technique "Watermarks" in Flink and if not then whats the internal purpose?
I'll try to briefly explain how to manage out of order events in Flink. Event time, out of order and watermarks are very close concepts, and I think you will understand that phrase better after you understand their relation.
Watermarks and out of orderness are concepts of event time based DataStreams. A watermark can be described as a time mark where you assume there aren't going to occur more events before the mark. There are several mechanisms to emit watermarks in Flink, i.e, you can set a watermark each time you receive an event. Also, time windows use the watermarks to check when is the right time to evaluate.
That said, the "watermarks" and "out of order" concepts are essentially the same, as you use the watermark to achieve that out of order management. In your case, to define that 60 second max delay it's as simple as setting the watermark 60 seconds before the max timestamp received.
There is a nice example on the official site about managing out of order events:
/**
* This generator generates watermarks assuming that elements come out of order to a certain degree only.
* The latest elements for a certain timestamp t will arrive at most n milliseconds after the earliest
* elements for timestamp t.
*/
public class BoundedOutOfOrdernessGenerator extends AssignerWithPeriodicWatermarks<MyEvent> {
private final long maxOutOfOrderness = 3500; // 3.5 seconds
private long currentMaxTimestamp;
#Override
public long extractTimestamp(MyEvent element, long previousElementTimestamp) {
long timestamp = element.getCreationTime();
currentMaxTimestamp = Math.max(timestamp, currentMaxTimestamp);
return timestamp;
}
#Override
public Watermark getCurrentWatermark() {
// return the watermark as current highest timestamp minus the out-of-orderness bound
return new Watermark(currentMaxTimestamp - maxOutOfOrderness);
}
}

Resources