Here's my code.
My question is as follows
Is it correct to clear state in this way?
Is this the correct way to use keyBy ?
//There are 1000,000 + storeId
orderStream.keyBy(Order::getStoreId)
.window(TumblingEventTimeWindows.of(Time.days(1), Time.hours(16)))
.trigger(ContinuousEventTimeTrigger.of(Time.seconds(1)))
.evictor(TimeEvictor.of(Time.seconds(0), true))
.process(new ProcessWindowFunction<Order, Object, Long, TimeWindow>() {
MapState<Long, Long> storeCountState;
#Override
public void process(Long storeId, Context context, Iterable<Order> elements, Collector<Object> out) throws Exception {
long sum = 0L;
for (Order element : elements) {
sum++;
}
storeCountState.put(storeId, storeCountState.get(storeId) + sum);
}
#Override
public void open(Configuration parameters) throws Exception {
super.open(parameters);
MapStateDescriptor<Long, Long> mapStateDescriptor = new MapStateDescriptor();
storeCountState = getRuntimeContext().getMapState(mapStateDescriptor);
}
#Override
public void close() throws Exception {
super.close();
// I clear state when each window close
storeCountState.clear();
}
})
.addSink(new PrintSinkFunction<>());
I think you should override the public void clear(Context context) throws Exception {} function, not the close() function.
Documentation
Related
I am trying to do pre shuffle aggregation in flink. Following is the MapBundle implementation.
public class TaxiFareMapBundleFunction extends MapBundleFunction<Long, TaxiFare, TaxiFare, TaxiFare> {
#Override
public TaxiFare addInput(#Nullable TaxiFare value, TaxiFare input) throws Exception {
if (value == null) {
return input;
}
value.tip = value.tip + input.tip;
return value;
}
#Override
public void finishBundle(Map<Long, TaxiFare> buffer, Collector<TaxiFare> out) throws Exception {
for (Map.Entry<Long, TaxiFare> entry : buffer.entrySet()) {
out.collect(entry.getValue());
}
}
}
I am using "CountBundleTrigger.java" . But the pre-shuffle aggregation is not working as the "count" variable is always 0. Please let me know If I am missing something.
#Override
public void onElement(T element) throws Exception {
count++;
if (count >= maxCount) {
callback.finishBundle();
reset();
}
}
Here is the main code.
MapBundleFunction<Long, TaxiFare, TaxiFare, TaxiFare> mapBundleFunction = new TaxiFareMapBundleFunction();
BundleTrigger<TaxiFare> bundleTrigger = new CountBundleTrigger<>(10);
KeySelector<TaxiFare, Long> taxiFareLongKeySelector = new KeySelector<TaxiFare, Long>() {
#Override
public Long getKey(TaxiFare value) throws Exception {
return value.driverId;
}
};
DataStream<Tuple3<Long, Long, Float>> hourlyTips =
// fares.keyBy((TaxiFare fare) -> fare.driverId)
//
.window(TumblingEventTimeWindows.of(Time.hours(1))).process(new AddTips());;
fares.transform("preshuffle", TypeInformation.of(TaxiFare.class),
new TaxiFareStream(mapBundleFunction, bundleTrigger,
taxiFareLongKeySelector
))
.assignTimestampsAndWatermarks(new
BoundedOutOfOrdernessTimestampExtractor<TaxiFare>(Time.seconds(20)) {
#Override
public long extractTimestamp(TaxiFare element) {
return element.startTime.getEpochSecond();
}
})
.keyBy((TaxiFare fare) -> fare.driverId)
.window(TumblingProcessingTimeWindows.of(Time.minutes(1)))
.process(new AddTips());
DataStream<Tuple3<Long, Long, Float>> hourlyMax =
hourlyTips.windowAll(TumblingEventTimeWindows.of(Time.hours(1))).maxBy(2);
Here is the code for TaxiFareStream.java.
public class TaxiFareStream extends MapBundleOperator<Long, TaxiFare, TaxiFare, TaxiFare> {
private KeySelector<TaxiFare, Long> keySelector;
public TaxiFareStream(MapBundleFunction<Long, TaxiFare,
TaxiFare, TaxiFare> userFunction,
BundleTrigger<TaxiFare> bundleTrigger,
KeySelector<TaxiFare, Long> keySelector) {
super(userFunction, bundleTrigger, keySelector);
this.keySelector = keySelector;
}
#Override
protected Long getKey(TaxiFare input) throws Exception {
return keySelector.getKey(input);
}
}
Update
I have created the following class but I am seeing an error. I think it is not able to serialize the class MapStreamBundleOperator.java
public class MapStreamBundleOperator<K, V, IN, OUT> extends
AbstractMapStreamBundleOperator<K, V, IN, OUT> {
private static final long serialVersionUID = 6556268125924098320L;
/** KeySelector is used to extract key for bundle map. */
private final KeySelector<IN, K> keySelector;
public MapStreamBundleOperator(MapBundleFunction<K, V, IN, OUT> function, BundleTrigger<IN> bundleTrigger,
KeySelector<IN, K> keySelector) {
super(function, bundleTrigger);
this.keySelector = keySelector;
}
#Override
protected K getKey(IN input) throws Exception {
return this.keySelector.getKey(input);
}
}
`
2021-08-27 05:06:04,814 ERROR FlinkDefaults.class - Stream execution failed
org.apache.flink.streaming.runtime.tasks.StreamTaskException: Cannot serialize operator object class org.apache.flink.streaming.api.operators.SimpleUdfStreamOperatorFactory.
at org.apache.flink.streaming.api.graph.StreamConfig.setStreamOperatorFactory(StreamConfig.java:247)
at org.apache.flink.streaming.api.graph.StreamingJobGraphGenerator.setVertexConfig(StreamingJobGraphGenerator.java:497)
at org.apache.flink.streaming.api.graph.StreamingJobGraphGenerator.createChain(StreamingJobGraphGenerator.java:318)
at org.apache.flink.streaming.api.graph.StreamingJobGraphGenerator.createChain(StreamingJobGraphGenerator.java:297)
at org.apache.flink.streaming.api.graph.StreamingJobGraphGenerator.createChain(StreamingJobGraphGenerator.java:297)
at org.apache.flink.streaming.api.graph.StreamingJobGraphGenerator.setChaining(StreamingJobGraphGenerator.java:264)
at org.apache.flink.streaming.api.graph.StreamingJobGraphGenerator.createJobGraph(StreamingJobGraphGenerator.java:173)
at org.apache.flink.streaming.api.graph.StreamingJobGraphGenerator.createJobGraph(StreamingJobGraphGenerator.java:113)
at org.apache.flink.streaming.api.graph.StreamGraph.getJobGraph(StreamGraph.java:850)
at org.apache.flink.client.StreamGraphTranslator.translateToJobGraph(StreamGraphTranslator.java:52)
at org.apache.flink.client.FlinkPipelineTranslationUtil.getJobGraph(FlinkPipelineTranslationUtil.java:43)
at org.apache.flink.client.deployment.executors.PipelineExecutorUtils.getJobGraph(PipelineExecutorUtils.java:55)
at org.apache.flink.client.deployment.executors.AbstractJobClusterExecutor.execute(AbstractJobClusterExecutor.java:62)
at org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.executeAsync(StreamExecutionEnvironment.java:1810)
at org.apache.flink.client.program.StreamContextEnvironment.executeAsync(StreamContextEnvironment.java:128)
at org.apache.flink.client.program.StreamContextEnvironment.execute(StreamContextEnvironment.java:76)
at org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.java:1697)
at com.pinterest.xenon.flink.FlinkDefaults$.run(FlinkDefaults.scala:46)
at com.pinterest.xenon.flink.FlinkWorkflow.run(FlinkWorkflow.scala:74)
at com.pinterest.xenon.flink.WorkflowLauncher$.executeWorkflow(WorkflowLauncher.scala:43)
at com.pinterest.xenon.flink.WorkflowLauncher$.delayedEndpoint$com$pinterest$xenon$flink$WorkflowLauncher$1(WorkflowLauncher.scala:25)
at com.pinterest.xenon.flink.WorkflowLauncher$delayedInit$body.apply(WorkflowLauncher.scala:9)
at scala.Function0$class.apply$mcV$sp(Function0.scala:34)
at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
at scala.App$$anonfun$main$1.apply(App.scala:76)
at scala.App$$anonfun$main$1.apply(App.scala:76)
at scala.collection.immutable.List.foreach(List.scala:392)
at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
at scala.App$class.main(App.scala:76)
at com.pinterest.xenon.flink.WorkflowLauncher$.main(WorkflowLauncher.scala:9)
at com.pinterest.xenon.flink.WorkflowLauncher.main(WorkflowLauncher.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:288)
at org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:198)
at org.apache.flink.client.ClientUtils.executeProgram(ClientUtils.java:168)
at org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:699)
at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:232)
at org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:916)
at org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:992)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1893)
at org.apache.flink.runtime.security.contexts.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:992)
Caused by: java.io.NotSerializableException: visibility.mabs.src.main.java.com.pinterest.mabs.MabsFlinkJob
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1184)
at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
`
I would not rely on the official MapBundleOperator since David already said that this is not very well documented. I will answer this question based on my own AbstractMapStreamBundleOperator. I think that you are missing the counter numOfElements++; inside the processElement() method. And it is also better to use generic types. Use this code:
public abstract class AbstractMapStreamBundleOperator<K, V, IN, OUT>
extends AbstractUdfStreamOperator<OUT, MapBundleFunction<K, V, IN, OUT>>
implements OneInputStreamOperator<IN, OUT>, BundleTriggerCallback {
private static final long serialVersionUID = 1L;
private final Map<K, V> bundle;
private final BundleTrigger<IN> bundleTrigger;
private transient TimestampedCollector<OUT> collector;
private transient int numOfElements = 0;
public AbstractMapStreamBundleOperator(MapBundleFunction<K, V, IN, OUT> function, BundleTrigger<IN> bundleTrigger) {
super(function);
chainingStrategy = ChainingStrategy.ALWAYS;
this.bundle = new HashMap<>();
this.bundleTrigger = checkNotNull(bundleTrigger, "bundleTrigger is null");
}
#Override
public void open() throws Exception {
super.open();
numOfElements = 0;
collector = new TimestampedCollector<>(output);
bundleTrigger.registerCallback(this);
// reset trigger
bundleTrigger.reset();
}
#Override
public void processElement(StreamRecord<IN> element) throws Exception {
// get the key and value for the map bundle
final IN input = element.getValue();
final K bundleKey = getKey(input);
final V bundleValue = this.bundle.get(bundleKey);
// get a new value after adding this element to bundle
final V newBundleValue = userFunction.addInput(bundleValue, input);
// update to map bundle
bundle.put(bundleKey, newBundleValue);
numOfElements++;
bundleTrigger.onElement(input);
}
protected abstract K getKey(final IN input) throws Exception;
#Override
public void finishBundle() throws Exception {
if (!bundle.isEmpty()) {
numOfElements = 0;
userFunction.finishBundle(bundle, collector);
bundle.clear();
}
bundleTrigger.reset();
}
}
Then create the MapStreamBundleOperator like you already have. Use this code:
public class MapStreamBundleOperator<K, V, IN, OUT> extends AbstractMapStreamBundleOperator<K, V, IN, OUT> {
private final KeySelector<IN, K> keySelector;
public MapStreamBundleOperator(MapBundleFunction<K, V, IN, OUT> function, BundleTrigger<IN> bundleTrigger,
KeySelector<IN, K> keySelector) {
super(function, bundleTrigger);
this.keySelector = keySelector;
}
#Override
protected K getKey(IN input) throws Exception {
return this.keySelector.getKey(input);
}
}
The counter inside the trigger is that makes the Bundle operator flush the events to the next phase. The CountBundleTrigger is like below. Use this code. You will need also the BundleTriggerCallback.
public class CountBundleTrigger<T> implements BundleTrigger<T> {
private final long maxCount;
private transient BundleTriggerCallback callback;
private transient long count = 0;
public CountBundleTrigger(long maxCount) {
Preconditions.checkArgument(maxCount > 0, "maxCount must be greater than 0");
this.maxCount = maxCount;
}
#Override
public void registerCallback(BundleTriggerCallback callback) {
this.callback = Preconditions.checkNotNull(callback, "callback is null");
}
#Override
public void onElement(T element) throws Exception {
count++;
if (count >= maxCount) {
callback.finishBundle();
reset();
}
}
#Override
public void reset() { count = 0; }
#Override
public String explain() {
return "CountBundleTrigger with size " + maxCount;
}
}
Then you have to create one of this trigger to pass on your operator. Here I am creating a bundle of 100 TaxiFare events. Take this example with another POJO. I wrote the MapBundleTaxiFareImpl here but you can create your UDF based on this one.
private OneInputStreamOperator<Tuple2<Long, TaxiFare>, Tuple2<Long, TaxiFare>> getPreAggOperator() {
MapBundleFunction<Long, TaxiFare, Tuple2<Long, TaxiFare>, Tuple2<Long, TaxiFare>> myMapBundleFunction = new MapBundleTaxiFareImpl();
CountBundleTrigger<Tuple2<Long, TaxiFare>> bundleTrigger = new CountBundleTrigger<Tuple2<Long, TaxiFare>>(100);
return new MapStreamBundleOperator<>(myMapBundleFunction, bundleTrigger, keyBundleSelector);
}
In the end you call this new operator somewhere using the transform(). Take this example with another POJO.
stream
...
.transform("my-pre-agg",
TypeInformation.of(new TypeHint<Tuple2<Long, TaxiFare>>(){}), getPreAggOperator())
...
I this that it is all that you need. Try to use those class and if it is missing something it is probably on the gitrepository that I put the links. i hope you can make it work.
I'm working with Apache Flink and using the machanism ConnectedStreams. Here is my code:
public class StreamingJob {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
DataStream<String> control = env.fromElements("DROP", "IGNORE");
DataStream<String> streamOfWords = env.fromElements("Apache", "DROP", "Flink", "IGNORE");
control
.connect(datastreamOfWords)
.flatMap(new ControlFunction())
.print();
env.execute();
}
public static class ControlFunction extends RichCoFlatMapFunction<String, String, String> {
private boolean found;
#Override
public void open(Configuration config) {
this.found = false;
}
#Override
public void flatMap1(String control_value, Collector<String> out) throws Exception {
if (control_value.equals("DROP")) {
this.found = true;
} else {
this.found = false;
}
}
#Override
public void flatMap2(String data_value, Collector<String> out) throws Exception {
if (this.found) {
out.collect(data_value);
this.found = false;
} else {
// nothing to do
}
}
}
}
As you see, I used a boolean variable to control the process of stream. The boolean variable found is read and written in flatMap1 and in flatMap2. So I'm thinking if I need to worry about the thread-safe issue.
Can the ConnectedStreams ensure thread safe? If not, does it mean that I need to lock the variable found in flatMap1 and in flatMap2?
The calls to flatMap1() and flatMap2() are guaranteed to not overlap, so you don't need to worry about concurrent access to your class's variables.
I know that keyed state belongs to the its key and only current key accesses its state value, other keys can not access to the different key's state value.
I tried to access the state with the same key but in different stream. Is it possible?
If it is not possible then I will have 2 duplicate data?
Not: I need two stream because each of them will have different timewindow and also different implementations.
Here is the example (I know that keyBy(sommething) is the same for both stream operations):
public class Sample{
streamA
.keyBy(something)
.timeWindow(Time.seconds(4))
.process(new CustomMyProcessFunction())
.name("CustomMyProcessFunction")
.print();
streamA
.keyBy(something)
.timeWindow(Time.seconds(1))
.process(new CustomMyAnotherProcessFunction())
.name("CustomMyProcessFunction")
.print();
}
public class CustomMyProcessFunction extends ProcessWindowFunction<..>
{
private Logger logger = LoggerFactory.getLogger(CustomMyProcessFunction.class);
private transient ValueState<SimpleEntity> simpleEntityValueState;
private SimpleEntity simpleEntity;
#Override
public void open(Configuration parameters) throws Exception
{
ValueStateDescriptor<SimpleEntity> simpleEntityValueStateDescriptor = new ValueStateDescriptor<SimpleEntity>(
"sample",
TypeInformation.of(SimpleEntity.class)
);
simpleEntityValueState = getRuntimeContext().getState(simpleEntityValueStateDescriptor);
}
#Override
public void process(...) throws Exception
{
SimpleEntity value = simpleEntityValueState.value();
if (value == null)
{
SimpleEntity newVal = new SimpleEntity("sample");
logger.info("New Value put");
simpleEntityValueState.update(newVal);
}
...
}
...
}
public class CustomMyAnotherProcessFunction extends ProcessWindowFunction<..>
{
private transient ValueState<SimpleEntity> simpleEntityValueState;
#Override
public void open(Configuration parameters) throws Exception
{
ValueStateDescriptor<SimpleEntity> simpleEntityValueStateDescriptor = new ValueStateDescriptor<SimpleEntity>(
"sample",
TypeInformation.of(SimpleEntity.class)
);
simpleEntityValueState = getRuntimeContext().getState(simpleEntityValueStateDescriptor);
}
#Override
public void process(...) throws Exception
{
SimpleEntity value = simpleEntityValueState.value();
if (value != null)
logger.info(value.toString()); // I expect that SimpleEntity("sample")
out.collect(...);
}
...
}
As has been pointed out already, state is always local to a single operator instance. It cannot be shared.
What you can do, however, is stream the state updates from the operator holding the state to other operators that need it. With side outputs you can create complex dataflows without needing to share state.
I tried with your idea to share state between two operators using same key.
import org.apache.flink.api.common.state.ValueState;
import org.apache.flink.api.common.state.ValueStateDescriptor;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.source.SourceFunction;
import org.apache.flink.streaming.api.functions.windowing.ProcessWindowFunction;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.streaming.api.windowing.windows.TimeWindow;
import org.apache.flink.util.Collector;
import java.io.IOException;
public class FlinkReuseState {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(3);
DataStream<Integer> stream1 = env.addSource(new SourceFunction<Integer>() {
#Override
public void run(SourceContext<Integer> sourceContext) throws Exception {
int i = 0;
while (true) {
sourceContext.collect(1);
Thread.sleep(1000);
}
}
#Override
public void cancel() {
}
});
DataStream<Integer> stream2 = env.addSource(new SourceFunction<Integer>() {
#Override
public void run(SourceContext<Integer> sourceContext) throws Exception {
while (true) {
sourceContext.collect(1);
Thread.sleep(1000);
}
}
#Override
public void cancel() {
}
});
DataStream<Integer> windowedStream1 = stream1.keyBy(Integer::intValue)
.timeWindow(Time.seconds(3))
.process(new ProcessWindowFunction<Integer, Integer, Integer, TimeWindow>() {
private ValueState<Integer> value;
#Override
public void open(Configuration parameters) throws Exception {
super.open(parameters);
ValueStateDescriptor<Integer> desc = new ValueStateDescriptor<Integer>("value", Integer.class);
value = getRuntimeContext().getState(desc);
}
#Override
public void process(Integer integer, Context context, Iterable<Integer> iterable, Collector<Integer> collector) throws Exception {
iterable.forEach(x -> {
try {
if (value.value() == null) {
value.update(1);
} else {
value.update(value.value() + 1);
}
} catch (IOException e) {
e.printStackTrace();
}
});
collector.collect(value.value());
}
});
DataStream<String> windowedStream2 = stream2.keyBy(Integer::intValue)
.timeWindow(Time.seconds(3))
.process(new ProcessWindowFunction<Integer, String, Integer, TimeWindow>() {
private ValueState<Integer> value;
#Override
public void open(Configuration parameters) throws Exception {
super.open(parameters);
ValueStateDescriptor<Integer> desc = new ValueStateDescriptor<Integer>("value", Integer.class);
value = getRuntimeContext().getState(desc);
}
#Override
public void process(Integer s, Context context, Iterable<Integer> iterable, Collector<String> collector) throws Exception {
iterable.forEach(x -> {
try {
if (value.value() == null) {
value.update(1);
} else {
value.update(value.value() + 1);
}
} catch (IOException e) {
e.printStackTrace();
}
});
collector.collect(String.valueOf(value.value()));
}
});
windowedStream2.print();
windowedStream1.print();
env.execute();
}
}
It doesn't work, each stream only update its own value state, the output is listed below.
3> 3
3> 3
3> 6
3> 6
3> 9
3> 9
3> 12
3> 12
3> 15
3> 15
3> 18
3> 18
3> 21
3> 21
3> 24
3> 24
keyed state
Based on the official docs, *Each keyed-state is logically bound to a unique composite of <parallel-operator-instance, key>, and since each key “belongs” to exactly one parallel instance of a keyed operator, we can think of this simply as <operator, key>*.
I think it is not possible to share state by giving same name to states in different operators.
Have u tried coprocess function? By doing so, you can also implement two proccess funcs for each stream, the only problem will be the timewindow then. Can you provide more details about your process logic?
Why cant you return the state as part of map operation and that stream can be used to connect to other stream
I am write my Apache Flink(1.10) to update records real time like this:
public class WalletConsumeRealtimeHandler {
public static void main(String[] args) throws Exception {
walletConsumeHandler();
}
public static void walletConsumeHandler() throws Exception {
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
FlinkUtil.initMQ();
FlinkUtil.initEnv(env);
DataStream<String> dataStreamSource = env.addSource(FlinkUtil.initDatasource("wallet.consume.report.realtime"));
DataStream<ReportWalletConsumeRecord> consumeRecord =
dataStreamSource.map(new MapFunction<String, ReportWalletConsumeRecord>() {
#Override
public ReportWalletConsumeRecord map(String value) throws Exception {
ObjectMapper mapper = new ObjectMapper();
ReportWalletConsumeRecord consumeRecord = mapper.readValue(value, ReportWalletConsumeRecord.class);
consumeRecord.setMergedRecordCount(1);
return consumeRecord;
}
}).assignTimestampsAndWatermarks(new BoundedOutOfOrdernessGenerator());
consumeRecord.keyBy(
new KeySelector<ReportWalletConsumeRecord, Tuple2<String, Long>>() {
#Override
public Tuple2<String, Long> getKey(ReportWalletConsumeRecord value) throws Exception {
return Tuple2.of(value.getConsumeItem(), value.getTenantId());
}
})
.timeWindow(Time.seconds(5))
.reduce(new SumField(), new CollectionWindow())
.addSink(new SinkFunction<List<ReportWalletConsumeRecord>>() {
#Override
public void invoke(List<ReportWalletConsumeRecord> reportPumps, Context context) throws Exception {
WalletConsumeRealtimeHandler.invoke(reportPumps);
}
});
env.execute(WalletConsumeRealtimeHandler.class.getName());
}
private static class CollectionWindow extends ProcessWindowFunction<ReportWalletConsumeRecord,
List<ReportWalletConsumeRecord>,
Tuple2<String, Long>,
TimeWindow> {
public void process(Tuple2<String, Long> key,
Context context,
Iterable<ReportWalletConsumeRecord> minReadings,
Collector<List<ReportWalletConsumeRecord>> out) throws Exception {
ArrayList<ReportWalletConsumeRecord> employees = Lists.newArrayList(minReadings);
if (employees.size() > 0) {
out.collect(employees);
}
}
}
private static class SumField implements ReduceFunction<ReportWalletConsumeRecord> {
public ReportWalletConsumeRecord reduce(ReportWalletConsumeRecord d1, ReportWalletConsumeRecord d2) {
Integer merged1 = d1.getMergedRecordCount() == null ? 1 : d1.getMergedRecordCount();
Integer merged2 = d2.getMergedRecordCount() == null ? 1 : d2.getMergedRecordCount();
d1.setMergedRecordCount(merged1 + merged2);
d1.setConsumeNum(d1.getConsumeNum() + d2.getConsumeNum());
return d1;
}
}
public static void invoke(List<ReportWalletConsumeRecord> records) {
WalletConsumeService service = FlinkUtil.InitRetrofit().create(WalletConsumeService.class);
Call<ResponseBody> call = service.saveRecords(records);
call.enqueue(new Callback<ResponseBody>() {
#Override
public void onResponse(Call<ResponseBody> call, Response<ResponseBody> response) {
}
#Override
public void onFailure(Call<ResponseBody> call, Throwable t) {
t.printStackTrace();
}
});
}
}
and now I found the Flink task only receive at least 2 records to trigger sink, is the reduce action need this?
You need two records to trigger the window. Flink only knows when to close a window (and fire subsequent calculation) when it receives a watermark that is larger than the configured value of the end of the window.
In your case, you use BoundedOutOfOrdernessGenerator, which updates the watermark according to the incoming records. So it generates a second watermark only after having seen the second record.
You can use a different watermark generator. In the troubleshooting training there is a watermark generator that also generates watermarks on timeout.
I am writting a pipe to group session for a user keyed by id and window using eventSessionWindow. I am using the Periodic WM and a custom session accumulator which will count the event is a given session.
What is happenning is my window operator is consuming records but not emmiting out. I am not sure what is missing here.
FlinkKafkaConsumer010<String> eventSource =
new FlinkKafkaConsumer010<>("events", new SimpleStringSchema(), properties);
eventSource.setStartFromLatest();
DataStream<Event> eventStream = env.addSource(eventSource
).flatMap(
new FlatMapFunction<String, Event>() {
#Override
public void flatMap(String value, Collector<Event> out) throws Exception {
out.collect(Event.toEvent(value));
}
}
).assignTimestampsAndWatermarks(
new AssignerWithPeriodicWatermarks<Event>() {
long maxTime;
#Override
public long extractTimestamp(Event element, long previousElementTimestamp) {
maxTime = Math.max(previousElementTimestamp, maxTime);
return previousElementTimestamp;
}
#Nullable
#Override
public Watermark getCurrentWatermark() {
return new Watermark(maxTime);
}
}
);
DataStream <Session> session_stream =eventStream.keyBy((KeySelector<Event, String>)value -> value.id)
.window(EventTimeSessionWindows.withGap(Time.minutes(5)))
.aggregate(new AggregateFunction<Event, pipe.SessionAccumulator, Session>() {
#Override
public pipe.SessionAccumulator createAccumulator() {
return new pipe.SessionAccumulator();
}
#Override
public pipe.SessionAccumulator add(Event e, pipe.SessionAccumulator sessionAccumulator) {
sessionAccumulator.add(e);
return sessionAccumulator;
}
#Override
public Session getResult(pipe.SessionAccumulator sessionAccumulator) {
return sessionAccumulator.getLocalValue();
}
#Override
public pipe.SessionAccumulator merge(pipe.SessionAccumulator prev, pipe.SessionAccumulator next) {
prev.merge(next);
return prev;
}
}, new WindowFunction<Session, Session, String, TimeWindow>() {
#Override
public void apply(String s, TimeWindow timeWindow, Iterable<Session> iterable, Collector<Session> collector) throws Exception {
collector.collect(iterable.iterator().next());
}
});
public static class SessionAccumulator implements Accumulator<Event, Session>{
Session session;
public SessionAccumulator(){
session = new Session();
}
#Override
public void add(Event e) {
session.add(e);
}
#Override
public Session getLocalValue() {
return session;
}
#Override
public void resetLocal() {
session = new Session();
}
#Override
public void merge(Accumulator<Event, Session> accumulator) {
session.merge(Collections.singletonList(accumulator.getLocalValue()));
}
#Override
public Accumulator<Event, Session> clone() {
SessionAccumulator sessionAccumulator = new SessionAccumulator();
sessionAccumulator.session = new Session(
session.id,
);
return sessionAccumulator;
}
}
public static class SessionAccumulator implements Accumulator<Event, Session>{
Session session;
public SessionAccumulator(){
session = new Session();
}
#Override
public void add(Event e) {
session.add(e);
}
#Override
public Session getLocalValue() {
return session;
}
#Override
public void resetLocal() {
session = new Session();
}
#Override
public void merge(Accumulator<Event, Session> accumulator) {
session.merge(Collections.singletonList(accumulator.getLocalValue()));
}
#Override
public Accumulator<Event, Session> clone() {
SessionAccumulator sessionAccumulator = new SessionAccumulator();
sessionAccumulator.session = new Session(
session.id,
session.lastEventTime,
session.earliestEventTime,
session.count;
);
return sessionAccumulator;
}
}
If your watermarks are not advancing, this would explain why no results are being emitted by the window. Possible causes include:
Your events haven't been timestamped by Kafka, and thus previousElementTimestamp isn't set.
You have an idle Kafka partition holding back the watermarks. (This is a somewhat complex topic. If this turns out to be the cause of your problems, and you get stuck on it, please come back with a new question.)
Another possibility is that there is never a 5 minute-long gap in the events, in which case the events will accumulate in a never-ending session.
Also, you don't appear to have included a sink. If you don't print or otherwise send the results to a sink, Flink won't do anything.
And don't forget that you must call env.execute() to get anything to happen.
A few other things:
Your watermark generator isn't allowing for any out-of-orderness, so the window is going to ignore all out-of-order events (because they will be late). If your events have strictly ascending timestamps you should go ahead and use a AscendingTimestampExtractor; if they can be out-of-order, then a BoundedOutOfOrdernessTimestampExtractor is appropriate.
Your WindowFunction is superfluous. It is simply forwarding downstream the result from the aggregator, so you could remove it.
You have posted two different implementations of SessionAccumulator.