Flink - how to aggregate in state - apache-flink

I have a keyd stream of data that looks like:
{
summary:Integer
uid:String
key:String
.....
}
I need to aggregate the summary values in some time range, and once I achieved a specifc number , to flush the summary and all the of the UID'S that influenced the summary to database/log file.
after the first flush , I want to discare all the uid's from the memory , and just flush every new item immediatelly.
So I tried this aggregate function.
public class AggFunc implements AggregateFunction<Item, Acc, Tuple2<Integer,List<String>>>{
private static final long serialVersionUID = 1L;
#Override
public Acc createAccumulator() {
return new Acc());
}
#Override
public Acc add(Item value, Acc accumulator) {
accumulator.inc(value.getSummary());
accumulator.addUid(value.getUid);
return accumulator;
}
#Override
public Tuple2<Integer,List<String>> getResult(Acc accumulator) {
List<String> newL = Lists.newArrayList(accumulator.getUids());
accumulator.setUids(Lists.newArrayList());
return Tuple2.of(accumulator.getSum(), newL);
}
#Override
public Acc merge(Acc a, Acc b) {
.....
}
}
and in the aggregate process function , I flush the list to state, and if I need to save to dataBase I'm clearing the state and save flag in the state to indicate it.
But it seems crooked to me. And I'm not sure if that would work well for me.
Is there a better solution to this situation?

Work with a state inside a rich function. Keep adding the uid in your state and when the window triggers to flush the values. This page from the official documentation has an example.
https://ci.apache.org/projects/flink/flink-docs-release-1.11/dev/stream/state/state.html#using-keyed-state
For your case a ListState will work well.
EDIT:
The solution above is for non-window case. for window case simply use the aggrgation with apply function that can have a rich window function

Related

Flink DataStream sort program does not output

I have written a small test case code in Flink to sort a datastream. The code is as follows:
public enum StreamSortTest {
;
public static class MyProcessWindowFunction extends ProcessWindowFunction<Long,Long,Integer, TimeWindow> {
#Override
public void process(Integer key, Context ctx, Iterable<Long> input, Collector<Long> out) {
List<Long> sortedList = new ArrayList<>();
for(Long i: input){
sortedList.add(i);
}
Collections.sort(sortedList);
sortedList.forEach(l -> out.collect(l));
}
}
public static void main(final String[] args) throws Exception {
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(2);
env.getConfig().setExecutionMode(ExecutionMode.PIPELINED);
DataStream<Long> probeSource = env.fromSequence(1, 500).setParallelism(2);
// range partition the stream into two parts based on data value
DataStream<Long> sortOutput =
probeSource
.keyBy(x->{
if(x<250){
return 1;
} else {
return 2;
}
})
.window(TumblingProcessingTimeWindows.of(Time.seconds(20)))
.process(new MyProcessWindowFunction())
;
sortOutput.print();
System.out.println(env.getExecutionPlan());
env.executeAsync();
}
}
However, the code just outputs the execution plan and a few other lines. But it doesn't output the actual sorted numbers. What am I doing wrong?
The main problem I can see is that You are using ProcessingTime based window with very short input data, which surely will be processed in time shorter than 20 seconds. While Flink is able to detect end of input(in case of stream from file or sequence as in Your case) and generate Long.Max watermark, which will close all open event time based windows and fire all event time based timers. It doesn't do the same thing for ProcessingTime based computations, so in Your case You need to assert Yourself that Flink will actually work long enough so that Your window is closed or refer to custom trigger/different time characteristic.
One other thing I am not sure about since I never used it that much is if You should use executeAsync for local execution, since that's basically meant for situations when You don't want to wait for the result of the job according to the docs here.

in flink processFunction, all mapstate is empty in onTimer() function

I want implements the aggregationFunction by the processKeyedFunction, because the default aggregationFunction does not support rich function,
Besides, I tryed the aggreagationFunction + processWindowFunction(https://ci.apache.org/projects/flink/flink-docs-stable/dev/stream/operators/windows.html), but it also cannot satisfy my needs, so I have to use the basic processKeyedFunction to implement the aggregationFunction, the detail of my problem is as followed:
in processFunction, , I define a windowState for stage the aggregation value of elements, the code is as followed:
public void open(Configuration parameters) throws Exception {
followCacheMap = FollowSet.getInstance();
windowState = getRuntimeContext().getMapState(windowStateDescriptor);
currentTimer = getRuntimeContext().getState(new ValueStateDescriptor<Long>(
"timer",
Long.class
));
in processElement() function, I use the windowState (which is a MapState initiate in open function) to aggregate the window element, and register the first timeServie to clear current window state, the code is as followed:
#Override
public void processElement(FollowData value, Context ctx, Collector<FollowData> out) throws Exception
{
if ( (currentTimer==null || (currentTimer.value() ==null) || (long)currentTimer.value()==0 ) && value.getClickTime() != null) {
currentTimer.update(value.getClickTime() + interval);
ctx.timerService().registerEventTimeTimer((long)currentTimer.value());
}
windowState = doMyAggregation(value);
}
in onTimer() function, first, I register the next timeService in next One minute, and clear the window State
#Override
public void onTimer(long timestamp, OnTimerContext ctx, Collector<FollowData> out) throws Exception {
currentTimer.update(timestamp + interval); // interval is 1 minute
ctx.timerService().registerEventTimeTimer((long)currentTimer.value());
out.collect(windowState);
windowState.clear();
}
but when the program is running , I found that all the windowState in onTimer is empty, but it is not empyt in processElement() function, I don't know why this happens, maybe the execution logic is different, how can I fix this,
Thanks in advance !
new added code about doMyAggregation() part
windowState is a MapState , key is "mykey", value is an self-defined Object AggregateFollow
public class AggregateFollow {
private String clicked;
private String unionid;
private ArrayList allFollows;
private int enterCnt;
private Long clickTime;
}
and the doMyAggregation(value) function is pretty much like this , the function of doMyAggregation is to get all the value whose source field is 'follow', but if there are no values whose field is 'click' during 1 minute, the 'follow' value should be obsolete, in a word , it's like a join operation of 'follow' data and 'click' data,
AggregateFollow acc = windowState.get(windowkey);
String flag = acc.getClicked();
ArrayList<FollowData> followDataList = acc.getAllFollows();
if ("0".equals(flag)) {
if ("follow".equals(value.getSource())) {
followDataList.add(value);
acc.setAllFollows(followDataList);
}
if ("click".equals(value.getSource())) {
String unionid = value.getUnionid();
clickTime = value.getClickTime();
if (followDataList.size() > 0) {
ArrayList listNew = new ArrayList();
for (FollowData followData : followDataList) {
followData.setUnionid(unionid);
followData.setClickTime(clickTime);
followData.setSource("joined_flag"); //
}
acc.setAllFollows(listNew);
}
acc.setClicked("1");
acc.setUnionid(unionid);
acc.setClickTime(clickTime);
windowState.put(windowkey, acc);
}
} else if ("1".equals(flag)) {
if ("follow".equals(value.getSource())) {
value.setUnionid(acc.getUnionid());
value.setClickTime(acc.getClickTime());
value.setSource("joined_flag");
followDataList.add(value);
acc.setAllFollows(followDataList);
windowState.put(windowkey, acc);
}
}
because of performance problem, original windowAPI is not a valid choice for me, the only way here I think is to use processFunction + ontimer and Guava Cache ,
Thanks a lot
If windowState is empty, it would be helpful to see what doMyAggregation(value) is doing.
It's difficult to debug this, or propose good alternatives, without more context, but out.collect(windowState) isn't going to work as intended. What you might want to do instead would be to iterate over this MapState and collect each key/value pair it contains to the output.
I changed the type of windowState from MapState to ValueState, and the problem is solved, maybe it is a bug or something, can anyone can explain this?

Unbounded Collection based stream in Flink

Is it possible to create an unbounded collection streams in flink. Like in a map if we add a element flink should process as in the socket stream. It should not exit once the initial elements are read.
You can create a custom SourceFunction that never terminates (until cancel() is called, and emits elements as they appear. You'd want to have a class that looks something like:
class MyUnboundedSource extends RichParallelSourceFunction<MyType> {
...
private transient volatile boolean running;
...
#Override
public void run(SourceContext<MyType> ctx) throws Exception {
while (running) {
// Call some method that returns the next record, if available.
MyType record = getNextRecordOrNull();
if (record != null) {
ctx.collect(record);
} else {
Thread.sleep(NO_DATA_SLEEP_TIME());
}
}
}
#Override
public void cancel() {
running = false;
}
}
Note that you'd need to worry about saving state for this to support at least once or exactly once generation of records.

APACHE FLINK AggregateFunction with tumblingWindow to count events but also send 0 if no event occurred

I need to count events within a tumbling window. But I also want to send events with 0 Value if there were no events within the window.
Something like.
windowCount: 5
windowCount: 0
windowCount: 0
windowCount: 3
windowCount: 0
...
import com.google.protobuf.Message;
import org.apache.flink.api.common.functions.AggregateFunction;
import org.skydivin4ng3l.cepmodemon.models.events.aggregate.AggregateOuterClass;
public class BasicCounter<T extends Message> implements AggregateFunction<T, Long, AggregateOuterClass.Aggregate> {
#Override
public Long createAccumulator() {
return 0L;
}
#Override
public Long add(T event, Long accumulator) {
return accumulator + 1L;
}
#Override
public AggregateOuterClass.Aggregate getResult(Long accumulator) {
return AggregateOuterClass.Aggregate.newBuilder().setVolume(accumulator).build();
}
#Override
public Long merge(Long accumulator1, Long accumulator2) {
return accumulator1 + accumulator2;
}
}
and used here
DataStream<AggregateOuterClass.Aggregate> aggregatedStream = someEntryStream
.windowAll(TumblingEventTimeWindows.of(Time.seconds(5)))
.aggregate(new BasicCounter<MonitorOuterClass.Monitor>());
TimeCharacteristics are ingestionTime
I read about a TiggerFunction which might detect if the aggregated Stream has received an event after x time but i am not sure if that is the right way to do it.
I expected the aggregation to happen even is there would be no events at all within the window. Maybe there is a setting i am not aware of?
Thx for any hints.
I chose Option 1 as suggested by #David-Anderson:
Here is my Event Generator:
public class EmptyEventSource implements SourceFunction<MonitorOuterClass.Monitor> {
private volatile boolean isRunning = true;
private final long delayPerRecordMillis;
public EmptyEventSource(long delayPerRecordMillis){
this.delayPerRecordMillis = delayPerRecordMillis;
}
#Override
public void run(SourceContext<MonitorOuterClass.Monitor> sourceContext) throws Exception {
while (isRunning) {
sourceContext.collect(MonitorOuterClass.Monitor.newBuilder().build());
if (delayPerRecordMillis > 0) {
Thread.sleep(delayPerRecordMillis);
}
}
}
#Override
public void cancel() {
isRunning = false;
}
}
and my adjusted AggregateFunction:
public class BasicCounter<T extends Message> implements AggregateFunction<T, Long, AggregateOuterClass.Aggregate> {
#Override
public Long createAccumulator() {
return 0L;
}
#Override
public Long add(T event, Long accumulator) {
if(((MonitorOuterClass.Monitor)event).equals(MonitorOuterClass.Monitor.newBuilder().build())) {
return accumulator;
}
return accumulator + 1L;
}
#Override
public AggregateOuterClass.Aggregate getResult(Long accumulator) {
AggregateOuterClass.Aggregate newAggregate = AggregateOuterClass.Aggregate.newBuilder().setVolume(accumulator).build();
return newAggregate;
}
#Override
public Long merge(Long accumulator1, Long accumulator2) {
return accumulator1 + accumulator2;
}
}
Used them Like this:
DataStream<MonitorOuterClass.Monitor> someEntryStream = env.addSource(currentConsumer);
DataStream<MonitorOuterClass.Monitor> triggerStream = env.addSource(new EmptyEventSource(delayPerRecordMillis));
DataStream<AggregateOuterClass.Aggregate> aggregatedStream = someEntryStream
.union(triggerStream)
.windowAll(TumblingProcessingTimeWindows.of(Time.seconds(5)))
.aggregate(new BasicCounter<MonitorOuterClass.Monitor>());
Flink's windows are created lazily, when the first event is assigned to a window. Thus empty windows do not exist, and can't produce results.
In general there are three ways to workaround this issue:
Put something in front of the window that adds events to the stream, ensuring that every window has something in it, and then modify your window processing to ignore these special events when computing their results.
Use a GlobalWindow along with a custom Trigger that uses processing time timers to trigger the window (with no events flowing, the watermark won't advance, and event time timers won't fire until more events arrive).
Don't use the window API, and implement your own windowing with a ProcessFunction instead. But here you'll still face the issue of needing to use processing time timers.
Update:
Having now made an effort to implement an example of option 2, I cannot recommend it. The issue is that even with a custom Trigger, the ProcessAllWindowFunction will not be called if the window is empty, so it is necessary to always keep at least one element in the GlobalWindow. This appears then to require implementing a rather hacky Evictor and ProcessAllWindowFunction that collaborate to retain and ignore a special element in the window -- and you also have to somehow get that element into the window in the first place.
If you're going to do something hacky, option 1 appears to be much simpler.

Flink executes dataflow twice

I'm new to Flink and I work with DataSet API. After a whole bunch of processing as the last stage I need to normalize one of the values by dividing it by its maximum value. So, I have used the .max() operator to take the max and later I'm passing the result as constructor's argument to the MapFunction.
This works, however all the processing is performed twice. One job is executed to find max values, and later another job is executed to create final result (starting execution from the beginning)... Is there any workaround to execute whole dataflow only once?
final List<Tuple6<...>> maxValues = result.max(2).collect();
assert maxValues.size() == 1;
result.map(new NormalizeAttributes(maxValues.get(0))).writeAsCsv(...)
#FunctionAnnotation.ForwardedFields("f0; f1; f3; f4; f5")
#FunctionAnnotation.ReadFields("f2")
private static class NormalizeAttributes implements MapFunction<Tuple6<...>, Tuple6<...>> {
private final Tuple6<...> maxValues;
public NormalizeAttributes(Tuple6<...> maxValues) {
this.maxValues = maxValues;
}
#Override
public Tuple6<...> map(Tuple6<...> value) throws Exception {
value.f2 /= maxValues.f2;
return value;
}
}
collect() immediately triggers an execution of the program up to the dataset requested by collect(). If you later call env.execute() or collect() again, the program is executed second time.
Besides the side effect of execution, using collect() to distribute values to subsequent transformation has also the drawback that data is transferred to the client and later back into the cluster. Flink offers so-called Broadcast variables to ship a DataSet as a side input into another transformation.
Using Broadcast variables in your program would look as follows:
DataSet maxValues = result.max(2);
result
.map(new NormAttrs()).withBroadcastSet(maxValues, "maxValues")
.writeAsCsv(...);
The NormAttrs function would look like this:
private static class NormAttr extends RichMapFunction<Tuple6<...>, Tuple6<...>> {
private Tuple6<...> maxValues;
#Override
public void open(Configuration config) {
maxValues = (Tuple6<...>)getRuntimeContext().getBroadcastVariable("maxValues").get(1);
}
#Override
public PredictedLink map(Tuple6<...> value) throws Exception {
value.f2 /= maxValues.f2;
return value;
}
}
You can find more information about Broadcast variables in the documentation.

Resources