I am trying to do pre shuffle aggregation in flink. Following is the MapBundle implementation.
public class TaxiFareMapBundleFunction extends MapBundleFunction<Long, TaxiFare, TaxiFare, TaxiFare> {
#Override
public TaxiFare addInput(#Nullable TaxiFare value, TaxiFare input) throws Exception {
if (value == null) {
return input;
}
value.tip = value.tip + input.tip;
return value;
}
#Override
public void finishBundle(Map<Long, TaxiFare> buffer, Collector<TaxiFare> out) throws Exception {
for (Map.Entry<Long, TaxiFare> entry : buffer.entrySet()) {
out.collect(entry.getValue());
}
}
}
I am using "CountBundleTrigger.java" . But the pre-shuffle aggregation is not working as the "count" variable is always 0. Please let me know If I am missing something.
#Override
public void onElement(T element) throws Exception {
count++;
if (count >= maxCount) {
callback.finishBundle();
reset();
}
}
Here is the main code.
MapBundleFunction<Long, TaxiFare, TaxiFare, TaxiFare> mapBundleFunction = new TaxiFareMapBundleFunction();
BundleTrigger<TaxiFare> bundleTrigger = new CountBundleTrigger<>(10);
KeySelector<TaxiFare, Long> taxiFareLongKeySelector = new KeySelector<TaxiFare, Long>() {
#Override
public Long getKey(TaxiFare value) throws Exception {
return value.driverId;
}
};
DataStream<Tuple3<Long, Long, Float>> hourlyTips =
// fares.keyBy((TaxiFare fare) -> fare.driverId)
//
.window(TumblingEventTimeWindows.of(Time.hours(1))).process(new AddTips());;
fares.transform("preshuffle", TypeInformation.of(TaxiFare.class),
new TaxiFareStream(mapBundleFunction, bundleTrigger,
taxiFareLongKeySelector
))
.assignTimestampsAndWatermarks(new
BoundedOutOfOrdernessTimestampExtractor<TaxiFare>(Time.seconds(20)) {
#Override
public long extractTimestamp(TaxiFare element) {
return element.startTime.getEpochSecond();
}
})
.keyBy((TaxiFare fare) -> fare.driverId)
.window(TumblingProcessingTimeWindows.of(Time.minutes(1)))
.process(new AddTips());
DataStream<Tuple3<Long, Long, Float>> hourlyMax =
hourlyTips.windowAll(TumblingEventTimeWindows.of(Time.hours(1))).maxBy(2);
Here is the code for TaxiFareStream.java.
public class TaxiFareStream extends MapBundleOperator<Long, TaxiFare, TaxiFare, TaxiFare> {
private KeySelector<TaxiFare, Long> keySelector;
public TaxiFareStream(MapBundleFunction<Long, TaxiFare,
TaxiFare, TaxiFare> userFunction,
BundleTrigger<TaxiFare> bundleTrigger,
KeySelector<TaxiFare, Long> keySelector) {
super(userFunction, bundleTrigger, keySelector);
this.keySelector = keySelector;
}
#Override
protected Long getKey(TaxiFare input) throws Exception {
return keySelector.getKey(input);
}
}
Update
I have created the following class but I am seeing an error. I think it is not able to serialize the class MapStreamBundleOperator.java
public class MapStreamBundleOperator<K, V, IN, OUT> extends
AbstractMapStreamBundleOperator<K, V, IN, OUT> {
private static final long serialVersionUID = 6556268125924098320L;
/** KeySelector is used to extract key for bundle map. */
private final KeySelector<IN, K> keySelector;
public MapStreamBundleOperator(MapBundleFunction<K, V, IN, OUT> function, BundleTrigger<IN> bundleTrigger,
KeySelector<IN, K> keySelector) {
super(function, bundleTrigger);
this.keySelector = keySelector;
}
#Override
protected K getKey(IN input) throws Exception {
return this.keySelector.getKey(input);
}
}
`
2021-08-27 05:06:04,814 ERROR FlinkDefaults.class - Stream execution failed
org.apache.flink.streaming.runtime.tasks.StreamTaskException: Cannot serialize operator object class org.apache.flink.streaming.api.operators.SimpleUdfStreamOperatorFactory.
at org.apache.flink.streaming.api.graph.StreamConfig.setStreamOperatorFactory(StreamConfig.java:247)
at org.apache.flink.streaming.api.graph.StreamingJobGraphGenerator.setVertexConfig(StreamingJobGraphGenerator.java:497)
at org.apache.flink.streaming.api.graph.StreamingJobGraphGenerator.createChain(StreamingJobGraphGenerator.java:318)
at org.apache.flink.streaming.api.graph.StreamingJobGraphGenerator.createChain(StreamingJobGraphGenerator.java:297)
at org.apache.flink.streaming.api.graph.StreamingJobGraphGenerator.createChain(StreamingJobGraphGenerator.java:297)
at org.apache.flink.streaming.api.graph.StreamingJobGraphGenerator.setChaining(StreamingJobGraphGenerator.java:264)
at org.apache.flink.streaming.api.graph.StreamingJobGraphGenerator.createJobGraph(StreamingJobGraphGenerator.java:173)
at org.apache.flink.streaming.api.graph.StreamingJobGraphGenerator.createJobGraph(StreamingJobGraphGenerator.java:113)
at org.apache.flink.streaming.api.graph.StreamGraph.getJobGraph(StreamGraph.java:850)
at org.apache.flink.client.StreamGraphTranslator.translateToJobGraph(StreamGraphTranslator.java:52)
at org.apache.flink.client.FlinkPipelineTranslationUtil.getJobGraph(FlinkPipelineTranslationUtil.java:43)
at org.apache.flink.client.deployment.executors.PipelineExecutorUtils.getJobGraph(PipelineExecutorUtils.java:55)
at org.apache.flink.client.deployment.executors.AbstractJobClusterExecutor.execute(AbstractJobClusterExecutor.java:62)
at org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.executeAsync(StreamExecutionEnvironment.java:1810)
at org.apache.flink.client.program.StreamContextEnvironment.executeAsync(StreamContextEnvironment.java:128)
at org.apache.flink.client.program.StreamContextEnvironment.execute(StreamContextEnvironment.java:76)
at org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.java:1697)
at com.pinterest.xenon.flink.FlinkDefaults$.run(FlinkDefaults.scala:46)
at com.pinterest.xenon.flink.FlinkWorkflow.run(FlinkWorkflow.scala:74)
at com.pinterest.xenon.flink.WorkflowLauncher$.executeWorkflow(WorkflowLauncher.scala:43)
at com.pinterest.xenon.flink.WorkflowLauncher$.delayedEndpoint$com$pinterest$xenon$flink$WorkflowLauncher$1(WorkflowLauncher.scala:25)
at com.pinterest.xenon.flink.WorkflowLauncher$delayedInit$body.apply(WorkflowLauncher.scala:9)
at scala.Function0$class.apply$mcV$sp(Function0.scala:34)
at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
at scala.App$$anonfun$main$1.apply(App.scala:76)
at scala.App$$anonfun$main$1.apply(App.scala:76)
at scala.collection.immutable.List.foreach(List.scala:392)
at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
at scala.App$class.main(App.scala:76)
at com.pinterest.xenon.flink.WorkflowLauncher$.main(WorkflowLauncher.scala:9)
at com.pinterest.xenon.flink.WorkflowLauncher.main(WorkflowLauncher.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:288)
at org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:198)
at org.apache.flink.client.ClientUtils.executeProgram(ClientUtils.java:168)
at org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:699)
at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:232)
at org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:916)
at org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:992)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1893)
at org.apache.flink.runtime.security.contexts.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:992)
Caused by: java.io.NotSerializableException: visibility.mabs.src.main.java.com.pinterest.mabs.MabsFlinkJob
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1184)
at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
`
I would not rely on the official MapBundleOperator since David already said that this is not very well documented. I will answer this question based on my own AbstractMapStreamBundleOperator. I think that you are missing the counter numOfElements++; inside the processElement() method. And it is also better to use generic types. Use this code:
public abstract class AbstractMapStreamBundleOperator<K, V, IN, OUT>
extends AbstractUdfStreamOperator<OUT, MapBundleFunction<K, V, IN, OUT>>
implements OneInputStreamOperator<IN, OUT>, BundleTriggerCallback {
private static final long serialVersionUID = 1L;
private final Map<K, V> bundle;
private final BundleTrigger<IN> bundleTrigger;
private transient TimestampedCollector<OUT> collector;
private transient int numOfElements = 0;
public AbstractMapStreamBundleOperator(MapBundleFunction<K, V, IN, OUT> function, BundleTrigger<IN> bundleTrigger) {
super(function);
chainingStrategy = ChainingStrategy.ALWAYS;
this.bundle = new HashMap<>();
this.bundleTrigger = checkNotNull(bundleTrigger, "bundleTrigger is null");
}
#Override
public void open() throws Exception {
super.open();
numOfElements = 0;
collector = new TimestampedCollector<>(output);
bundleTrigger.registerCallback(this);
// reset trigger
bundleTrigger.reset();
}
#Override
public void processElement(StreamRecord<IN> element) throws Exception {
// get the key and value for the map bundle
final IN input = element.getValue();
final K bundleKey = getKey(input);
final V bundleValue = this.bundle.get(bundleKey);
// get a new value after adding this element to bundle
final V newBundleValue = userFunction.addInput(bundleValue, input);
// update to map bundle
bundle.put(bundleKey, newBundleValue);
numOfElements++;
bundleTrigger.onElement(input);
}
protected abstract K getKey(final IN input) throws Exception;
#Override
public void finishBundle() throws Exception {
if (!bundle.isEmpty()) {
numOfElements = 0;
userFunction.finishBundle(bundle, collector);
bundle.clear();
}
bundleTrigger.reset();
}
}
Then create the MapStreamBundleOperator like you already have. Use this code:
public class MapStreamBundleOperator<K, V, IN, OUT> extends AbstractMapStreamBundleOperator<K, V, IN, OUT> {
private final KeySelector<IN, K> keySelector;
public MapStreamBundleOperator(MapBundleFunction<K, V, IN, OUT> function, BundleTrigger<IN> bundleTrigger,
KeySelector<IN, K> keySelector) {
super(function, bundleTrigger);
this.keySelector = keySelector;
}
#Override
protected K getKey(IN input) throws Exception {
return this.keySelector.getKey(input);
}
}
The counter inside the trigger is that makes the Bundle operator flush the events to the next phase. The CountBundleTrigger is like below. Use this code. You will need also the BundleTriggerCallback.
public class CountBundleTrigger<T> implements BundleTrigger<T> {
private final long maxCount;
private transient BundleTriggerCallback callback;
private transient long count = 0;
public CountBundleTrigger(long maxCount) {
Preconditions.checkArgument(maxCount > 0, "maxCount must be greater than 0");
this.maxCount = maxCount;
}
#Override
public void registerCallback(BundleTriggerCallback callback) {
this.callback = Preconditions.checkNotNull(callback, "callback is null");
}
#Override
public void onElement(T element) throws Exception {
count++;
if (count >= maxCount) {
callback.finishBundle();
reset();
}
}
#Override
public void reset() { count = 0; }
#Override
public String explain() {
return "CountBundleTrigger with size " + maxCount;
}
}
Then you have to create one of this trigger to pass on your operator. Here I am creating a bundle of 100 TaxiFare events. Take this example with another POJO. I wrote the MapBundleTaxiFareImpl here but you can create your UDF based on this one.
private OneInputStreamOperator<Tuple2<Long, TaxiFare>, Tuple2<Long, TaxiFare>> getPreAggOperator() {
MapBundleFunction<Long, TaxiFare, Tuple2<Long, TaxiFare>, Tuple2<Long, TaxiFare>> myMapBundleFunction = new MapBundleTaxiFareImpl();
CountBundleTrigger<Tuple2<Long, TaxiFare>> bundleTrigger = new CountBundleTrigger<Tuple2<Long, TaxiFare>>(100);
return new MapStreamBundleOperator<>(myMapBundleFunction, bundleTrigger, keyBundleSelector);
}
In the end you call this new operator somewhere using the transform(). Take this example with another POJO.
stream
...
.transform("my-pre-agg",
TypeInformation.of(new TypeHint<Tuple2<Long, TaxiFare>>(){}), getPreAggOperator())
...
I this that it is all that you need. Try to use those class and if it is missing something it is probably on the gitrepository that I put the links. i hope you can make it work.
Related
Here's my code.
My question is as follows
Is it correct to clear state in this way?
Is this the correct way to use keyBy ?
//There are 1000,000 + storeId
orderStream.keyBy(Order::getStoreId)
.window(TumblingEventTimeWindows.of(Time.days(1), Time.hours(16)))
.trigger(ContinuousEventTimeTrigger.of(Time.seconds(1)))
.evictor(TimeEvictor.of(Time.seconds(0), true))
.process(new ProcessWindowFunction<Order, Object, Long, TimeWindow>() {
MapState<Long, Long> storeCountState;
#Override
public void process(Long storeId, Context context, Iterable<Order> elements, Collector<Object> out) throws Exception {
long sum = 0L;
for (Order element : elements) {
sum++;
}
storeCountState.put(storeId, storeCountState.get(storeId) + sum);
}
#Override
public void open(Configuration parameters) throws Exception {
super.open(parameters);
MapStateDescriptor<Long, Long> mapStateDescriptor = new MapStateDescriptor();
storeCountState = getRuntimeContext().getMapState(mapStateDescriptor);
}
#Override
public void close() throws Exception {
super.close();
// I clear state when each window close
storeCountState.clear();
}
})
.addSink(new PrintSinkFunction<>());
I think you should override the public void clear(Context context) throws Exception {} function, not the close() function.
Documentation
I am write my Apache Flink(1.10) to update records real time like this:
public class WalletConsumeRealtimeHandler {
public static void main(String[] args) throws Exception {
walletConsumeHandler();
}
public static void walletConsumeHandler() throws Exception {
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
FlinkUtil.initMQ();
FlinkUtil.initEnv(env);
DataStream<String> dataStreamSource = env.addSource(FlinkUtil.initDatasource("wallet.consume.report.realtime"));
DataStream<ReportWalletConsumeRecord> consumeRecord =
dataStreamSource.map(new MapFunction<String, ReportWalletConsumeRecord>() {
#Override
public ReportWalletConsumeRecord map(String value) throws Exception {
ObjectMapper mapper = new ObjectMapper();
ReportWalletConsumeRecord consumeRecord = mapper.readValue(value, ReportWalletConsumeRecord.class);
consumeRecord.setMergedRecordCount(1);
return consumeRecord;
}
}).assignTimestampsAndWatermarks(new BoundedOutOfOrdernessGenerator());
consumeRecord.keyBy(
new KeySelector<ReportWalletConsumeRecord, Tuple2<String, Long>>() {
#Override
public Tuple2<String, Long> getKey(ReportWalletConsumeRecord value) throws Exception {
return Tuple2.of(value.getConsumeItem(), value.getTenantId());
}
})
.timeWindow(Time.seconds(5))
.reduce(new SumField(), new CollectionWindow())
.addSink(new SinkFunction<List<ReportWalletConsumeRecord>>() {
#Override
public void invoke(List<ReportWalletConsumeRecord> reportPumps, Context context) throws Exception {
WalletConsumeRealtimeHandler.invoke(reportPumps);
}
});
env.execute(WalletConsumeRealtimeHandler.class.getName());
}
private static class CollectionWindow extends ProcessWindowFunction<ReportWalletConsumeRecord,
List<ReportWalletConsumeRecord>,
Tuple2<String, Long>,
TimeWindow> {
public void process(Tuple2<String, Long> key,
Context context,
Iterable<ReportWalletConsumeRecord> minReadings,
Collector<List<ReportWalletConsumeRecord>> out) throws Exception {
ArrayList<ReportWalletConsumeRecord> employees = Lists.newArrayList(minReadings);
if (employees.size() > 0) {
out.collect(employees);
}
}
}
private static class SumField implements ReduceFunction<ReportWalletConsumeRecord> {
public ReportWalletConsumeRecord reduce(ReportWalletConsumeRecord d1, ReportWalletConsumeRecord d2) {
Integer merged1 = d1.getMergedRecordCount() == null ? 1 : d1.getMergedRecordCount();
Integer merged2 = d2.getMergedRecordCount() == null ? 1 : d2.getMergedRecordCount();
d1.setMergedRecordCount(merged1 + merged2);
d1.setConsumeNum(d1.getConsumeNum() + d2.getConsumeNum());
return d1;
}
}
public static void invoke(List<ReportWalletConsumeRecord> records) {
WalletConsumeService service = FlinkUtil.InitRetrofit().create(WalletConsumeService.class);
Call<ResponseBody> call = service.saveRecords(records);
call.enqueue(new Callback<ResponseBody>() {
#Override
public void onResponse(Call<ResponseBody> call, Response<ResponseBody> response) {
}
#Override
public void onFailure(Call<ResponseBody> call, Throwable t) {
t.printStackTrace();
}
});
}
}
and now I found the Flink task only receive at least 2 records to trigger sink, is the reduce action need this?
You need two records to trigger the window. Flink only knows when to close a window (and fire subsequent calculation) when it receives a watermark that is larger than the configured value of the end of the window.
In your case, you use BoundedOutOfOrdernessGenerator, which updates the watermark according to the incoming records. So it generates a second watermark only after having seen the second record.
You can use a different watermark generator. In the troubleshooting training there is a watermark generator that also generates watermarks on timeout.
I have used global window and custom trigger. Then notice that the state size in every checkpoint keeps increasing. So I tried to set breakpoints in clear method and found clear method seems not been invoked. So I guess it is because clear method not been invoked which makes the state size keeps increasing.
main method
final StreamExecutionEnvironment see = StreamExecutionEnvironment.getExecutionEnvironment();
see.setStreamTimeCharacteristic(TimeCharacteristic.ProcessingTime);
see.enableCheckpointing(5000L, CheckpointingMode.EXACTLY_ONCE);
see.getCheckpointConfig().setMinPauseBetweenCheckpoints(1000L);
see.setStateBackend(new MemoryStateBackend());
see.getCheckpointConfig().setCheckpointTimeout(3000L);
DataStream<String> dataStream = generateData(see);
dataStream.flatMap(new FlatMapFunction<String, Tuple2<String,Integer>>() {
#Override
public void flatMap(String line, Collector<Tuple2<String,Integer>> collector) throws Exception {
String[] split = line.split(" ");
for (String s1 : split) {
collector.collect(new Tuple2<>(s1,1));
}
}
}).keyBy(0).window(GlobalWindows.create())
.trigger(PurgingTrigger.of(TimeoutCountTrigger.of(10,1000L)))
.process(new CustomProcessWindow())
.print().setParallelism(1);
see.execute();
Trigger implement:
public class CountWithTimeoutTrigger<T, W extends Window> extends Trigger<T, W> {
private static final long serialVersionUID = 1L;
private final long maxCount;
private final long timeoutMs;
private final ValueStateDescriptor<Long> countDesc = new ValueStateDescriptor<>("count", LongSerializer.INSTANCE, 0L);
private final ValueStateDescriptor<Long> deadlineDesc = new ValueStateDescriptor<>("deadline", LongSerializer.INSTANCE, Long.MAX_VALUE);
private CountWithTimeoutTrigger(long maxCount, long timeoutMs) {
this.maxCount = maxCount;
this.timeoutMs = timeoutMs;
}
#Override
public TriggerResult onElement(T element, long timestamp, W window, Trigger.TriggerContext ctx) throws IOException {
final ValueState<Long> deadline = ctx.getPartitionedState(deadlineDesc);
final ValueState<Long> count = ctx.getPartitionedState(countDesc);
final long currentDeadline = deadline.value();
final long currentTimeMs = System.currentTimeMillis();
final long newCount = count.value() + 1;
if (currentTimeMs >= currentDeadline || newCount >= maxCount) {
return fire(deadline, count);
}
if (currentDeadline == deadlineDesc.getDefaultValue()) {
final long nextDeadline = currentTimeMs + timeoutMs;
deadline.update(nextDeadline);
ctx.registerProcessingTimeTimer(nextDeadline);
}
count.update(newCount);
return TriggerResult.CONTINUE;
}
#Override
public TriggerResult onEventTime(long time, W window, Trigger.TriggerContext ctx) {
return TriggerResult.CONTINUE;
}
#Override
public TriggerResult onProcessingTime(long time, W window, Trigger.TriggerContext ctx) throws Exception {
final ValueState<Long> deadline = ctx.getPartitionedState(deadlineDesc);
// fire only if the deadline hasn't changed since registering this timer
if (deadline.value() == time) {
return fire(deadline, ctx.getPartitionedState(countDesc));
}
return TriggerResult.CONTINUE;
}
#Override
public void clear(W window, TriggerContext ctx) throws Exception {
// ***** this method not been invoked *****
final ValueState<Long> deadline = ctx.getPartitionedState(deadlineDesc);
final ValueState<Long> cntState = ctx.getPartitionedState(countDesc);
final long deadlineValue = deadline.value();
if (deadlineValue != deadlineDesc.getDefaultValue()) {
ctx.deleteProcessingTimeTimer(deadlineValue);
}
deadline.clear();
cntState.clear();
}
private TriggerResult fire(ValueState<Long> deadline, ValueState<Long> count) throws IOException {
deadline.update(Long.MAX_VALUE);
count.update(0L);
return TriggerResult.FIRE;
}
public static <T, W extends Window> CountWithTimeoutTrigger<T, W> of(long maxCount, long intervalMs) {
return new CountWithTimeoutTrigger<>(maxCount, intervalMs);
}
}
I expect the clear method to be called and clear state in clear method, but it seems clear method in trigger not been invoked and state size in every checkpoint keeps increasing.
The Trigger.clear() method is invoked when the window is closed. This happens when the application time (processing time or event time as defined by WindowAssigner.isEventTime()) reaches the end timestamp of the window.
Since a GlobalWindow never ends, the end timestamp of a GlobalWindow is Long.MAX_VALUE. Hence, the Trigger.clear() method will never be called if the trigger is applied on a GlobalWindow.
The flink flow has multi data stream, then I merge those data stream with org.apache.flink.streaming.api.datastream.DataStream#union method.
Then, I got the problem, the datastream is disordered and I can not set window to sort the data in data stream.
Sorting union of streams to identify user sessions in Apache Flink
I got the the answer, but the com.liam.learn.flink.example.union.UnionStreamDemo.SortFunction#onTimer
never been invoked.
Environment Info: flink version 1.7.0
In general, I hope to sort the union datastream witout watermark.
You need watermarks so that the sorting function knows when it can safely emit sorted elements. Without watermarks, you get get an record from stream B that has an earlier date than any of the first N records of stream A, right?
But adding watermarks is easy, especially if you know that "event time" is strictly increasing for any one stream. Below is some code I wrote that extends what David Anderson posted in his answer to the other SO issue you referenced above - hopefully this will get you started.
-- Ken
package com.scaleunlimited.flinksnippets;
import java.util.PriorityQueue;
import java.util.Random;
import org.apache.flink.api.common.state.ValueState;
import org.apache.flink.api.common.state.ValueStateDescriptor;
import org.apache.flink.api.common.typeinfo.TypeHint;
import org.apache.flink.api.common.typeinfo.TypeInformation;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.streaming.api.TimeCharacteristic;
import org.apache.flink.streaming.api.TimerService;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.KeyedProcessFunction;
import org.apache.flink.streaming.api.functions.source.RichParallelSourceFunction;
import org.apache.flink.streaming.api.functions.timestamps.AscendingTimestampExtractor;
import org.apache.flink.util.Collector;
import org.junit.Test;
public class MergeAndSortStreamsTest {
#Test
public void testMergeAndSort() throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.createLocalEnvironment(2);
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
DataStream<Event> streamA = env.addSource(new EventSource("A"))
.assignTimestampsAndWatermarks(new EventTSWAssigner());
DataStream<Event> streamB = env.addSource(new EventSource("B"))
.assignTimestampsAndWatermarks(new EventTSWAssigner());
streamA.union(streamB)
.keyBy(r -> r.getKey())
.process(new SortByTimestampFunction())
.print();
env.execute();
}
private static class Event implements Comparable<Event> {
private String _label;
private long _timestamp;
public Event(String label, long timestamp) {
_label = label;
_timestamp = timestamp;
}
public String getLabel() {
return _label;
}
public void setLabel(String label) {
_label = label;
}
public String getKey() {
return "1";
}
public long getTimestamp() {
return _timestamp;
}
public void setTimestamp(long timestamp) {
_timestamp = timestamp;
}
#Override
public String toString() {
return String.format("%s # %d", _label, _timestamp);
}
#Override
public int compareTo(Event o) {
return Long.compare(_timestamp, o._timestamp);
}
}
#SuppressWarnings("serial")
private static class EventTSWAssigner extends AscendingTimestampExtractor<Event> {
#Override
public long extractAscendingTimestamp(Event element) {
return element.getTimestamp();
}
}
#SuppressWarnings("serial")
private static class SortByTimestampFunction extends KeyedProcessFunction<String, Event, Event> {
private ValueState<PriorityQueue<Event>> queueState = null;
#Override
public void open(Configuration config) {
ValueStateDescriptor<PriorityQueue<Event>> descriptor = new ValueStateDescriptor<>(
// state name
"sorted-events",
// type information of state
TypeInformation.of(new TypeHint<PriorityQueue<Event>>() {
}));
queueState = getRuntimeContext().getState(descriptor);
}
#Override
public void processElement(Event event, Context context, Collector<Event> out) throws Exception {
TimerService timerService = context.timerService();
long currentWatermark = timerService.currentWatermark();
System.out.format("processElement called with watermark %d\n", currentWatermark);
if (context.timestamp() > currentWatermark) {
PriorityQueue<Event> queue = queueState.value();
if (queue == null) {
queue = new PriorityQueue<>(10);
}
queue.add(event);
queueState.update(queue);
timerService.registerEventTimeTimer(event.getTimestamp());
}
}
#Override
public void onTimer(long timestamp, OnTimerContext context, Collector<Event> out) throws Exception {
PriorityQueue<Event> queue = queueState.value();
long watermark = context.timerService().currentWatermark();
System.out.format("onTimer called with watermark %d\n", watermark);
Event head = queue.peek();
while (head != null && head.getTimestamp() <= watermark) {
out.collect(head);
queue.remove(head);
head = queue.peek();
}
}
}
#SuppressWarnings("serial")
private static class EventSource extends RichParallelSourceFunction<Event> {
private String _prefix;
private transient Random _rand;
private transient boolean _running;
private transient int _numEvents;
public EventSource(String prefix) {
_prefix = prefix;
}
#Override
public void open(Configuration parameters) throws Exception {
super.open(parameters);
_rand = new Random(_prefix.hashCode() + getRuntimeContext().getIndexOfThisSubtask());
}
#Override
public void cancel() {
_running = false;
}
#Override
public void run(SourceContext<Event> context) throws Exception {
_running = true;
_numEvents = 0;
long timestamp = System.currentTimeMillis() + _rand.nextInt(10);
while (_running && (_numEvents < 100)) {
long deltaTime = timestamp - System.currentTimeMillis();
if (deltaTime > 0) {
Thread.sleep(deltaTime);
}
context.collect(new Event(_prefix, timestamp));
_numEvents++;
// Generate a timestamp every 5...15 ms, average is 10.
timestamp += (5 + _rand.nextInt(10));
}
}
}
}
Flink 0.10.0 was just released recently. I have some code need migrated from 0.9.1. But got the following error:
org.apache.flink.api.common.functions.InvalidTypesException: Type of TypeVariable 'K' in 'class fi.aalto.dmg.frame.FlinkPairWorkloadOperator' could not be determined. This is most likely a type erasure problem. The type extraction currently supports types with generic variables only in cases where all variables in the return type can be deduced from the input type(s).
Here is the code:
public class FlinkPairWorkloadOperator<K,V> implements PairWorkloadOperator<K,V> {
private DataStream<Tuple2<K, V>> dataStream;
public FlinkPairWorkloadOperator(DataStream<Tuple2<K, V>> dataStream1) {
this.dataStream = dataStream1;
}
public FlinkGroupedWorkloadOperator<K, V> groupByKey() {
KeyedStream<Tuple2<K, V>, K> keyedStream = this.dataStream.keyBy(new KeySelector<Tuple2<K, V>, K>() {
#Override
public K getKey(Tuple2<K, V> value) throws Exception {
return value._1();
}
});
return new FlinkGroupedWorkloadOperator<>(keyedStream);
}
}
To understand how the InvalidTypesException occurs, I have another example which throw this exception also and I have no idea about it. In this demo, the program works with scala.Tuple2, but not flink Tuple2.
public class StreamingWordCount {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
DataStream<String> counts = env
.socketTextStream("localhost", 9999)
.flatMap(new Splitter());
DataStream<Tuple2<String, Integer>> pairs = mapToPair(counts, mapToStringIntegerPair);
pairs.print();
env.execute("Socket Stream WordCount");
}
public static class Splitter implements FlatMapFunction<String, String> {
#Override
public void flatMap(String sentence, Collector<String> out) throws Exception {
for (String word: sentence.split(" ")) {
out.collect(word);
}
}
}
public static <K,V,T> DataStream<Tuple2<K,V>> mapToPair(DataStream<T> dataStream , final MapPairFunction<T, K, V> fun){
return dataStream.map(new MapFunction<T, Tuple2<K, V>>() {
#Override
public Tuple2<K, V> map(T t) throws Exception {
return fun.mapPair(t);
}
});
}
public interface MapPairFunction<T, K, V> extends Serializable {
Tuple2<K,V> mapPair(T t);
}
public static MapPairFunction<String, String, Integer> mapToStringIntegerPair = new MapPairFunction<String, String, Integer>() {
public Tuple2<String, Integer> mapPair(String s) {
return new Tuple2<String, Integer>(s, 1);
}
};
}
The problem is that you're using scala.Tuple2 instead of org.apache.flink.api.java.tuple.Tuple2 in combination with Flink's Java API. The TypeExtractor of the Java API does not understand Scala tuples. Therefore, it cannot extract the type for the type variable K.
If you use org.apache.flink.api.java.tuple.Tuple2 instead, then the TypeExtractor will be able to resolve the type variable.