Apache Flink read at least 2 record to trigger sink - apache-flink

I am write my Apache Flink(1.10) to update records real time like this:
public class WalletConsumeRealtimeHandler {
public static void main(String[] args) throws Exception {
walletConsumeHandler();
}
public static void walletConsumeHandler() throws Exception {
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
FlinkUtil.initMQ();
FlinkUtil.initEnv(env);
DataStream<String> dataStreamSource = env.addSource(FlinkUtil.initDatasource("wallet.consume.report.realtime"));
DataStream<ReportWalletConsumeRecord> consumeRecord =
dataStreamSource.map(new MapFunction<String, ReportWalletConsumeRecord>() {
#Override
public ReportWalletConsumeRecord map(String value) throws Exception {
ObjectMapper mapper = new ObjectMapper();
ReportWalletConsumeRecord consumeRecord = mapper.readValue(value, ReportWalletConsumeRecord.class);
consumeRecord.setMergedRecordCount(1);
return consumeRecord;
}
}).assignTimestampsAndWatermarks(new BoundedOutOfOrdernessGenerator());
consumeRecord.keyBy(
new KeySelector<ReportWalletConsumeRecord, Tuple2<String, Long>>() {
#Override
public Tuple2<String, Long> getKey(ReportWalletConsumeRecord value) throws Exception {
return Tuple2.of(value.getConsumeItem(), value.getTenantId());
}
})
.timeWindow(Time.seconds(5))
.reduce(new SumField(), new CollectionWindow())
.addSink(new SinkFunction<List<ReportWalletConsumeRecord>>() {
#Override
public void invoke(List<ReportWalletConsumeRecord> reportPumps, Context context) throws Exception {
WalletConsumeRealtimeHandler.invoke(reportPumps);
}
});
env.execute(WalletConsumeRealtimeHandler.class.getName());
}
private static class CollectionWindow extends ProcessWindowFunction<ReportWalletConsumeRecord,
List<ReportWalletConsumeRecord>,
Tuple2<String, Long>,
TimeWindow> {
public void process(Tuple2<String, Long> key,
Context context,
Iterable<ReportWalletConsumeRecord> minReadings,
Collector<List<ReportWalletConsumeRecord>> out) throws Exception {
ArrayList<ReportWalletConsumeRecord> employees = Lists.newArrayList(minReadings);
if (employees.size() > 0) {
out.collect(employees);
}
}
}
private static class SumField implements ReduceFunction<ReportWalletConsumeRecord> {
public ReportWalletConsumeRecord reduce(ReportWalletConsumeRecord d1, ReportWalletConsumeRecord d2) {
Integer merged1 = d1.getMergedRecordCount() == null ? 1 : d1.getMergedRecordCount();
Integer merged2 = d2.getMergedRecordCount() == null ? 1 : d2.getMergedRecordCount();
d1.setMergedRecordCount(merged1 + merged2);
d1.setConsumeNum(d1.getConsumeNum() + d2.getConsumeNum());
return d1;
}
}
public static void invoke(List<ReportWalletConsumeRecord> records) {
WalletConsumeService service = FlinkUtil.InitRetrofit().create(WalletConsumeService.class);
Call<ResponseBody> call = service.saveRecords(records);
call.enqueue(new Callback<ResponseBody>() {
#Override
public void onResponse(Call<ResponseBody> call, Response<ResponseBody> response) {
}
#Override
public void onFailure(Call<ResponseBody> call, Throwable t) {
t.printStackTrace();
}
});
}
}
and now I found the Flink task only receive at least 2 records to trigger sink, is the reduce action need this?

You need two records to trigger the window. Flink only knows when to close a window (and fire subsequent calculation) when it receives a watermark that is larger than the configured value of the end of the window.
In your case, you use BoundedOutOfOrdernessGenerator, which updates the watermark according to the incoming records. So it generates a second watermark only after having seen the second record.
You can use a different watermark generator. In the troubleshooting training there is a watermark generator that also generates watermarks on timeout.

Related

Is ConnectedStreams thread safe in Apache Flink

I'm working with Apache Flink and using the machanism ConnectedStreams. Here is my code:
public class StreamingJob {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
DataStream<String> control = env.fromElements("DROP", "IGNORE");
DataStream<String> streamOfWords = env.fromElements("Apache", "DROP", "Flink", "IGNORE");
control
.connect(datastreamOfWords)
.flatMap(new ControlFunction())
.print();
env.execute();
}
public static class ControlFunction extends RichCoFlatMapFunction<String, String, String> {
private boolean found;
#Override
public void open(Configuration config) {
this.found = false;
}
#Override
public void flatMap1(String control_value, Collector<String> out) throws Exception {
if (control_value.equals("DROP")) {
this.found = true;
} else {
this.found = false;
}
}
#Override
public void flatMap2(String data_value, Collector<String> out) throws Exception {
if (this.found) {
out.collect(data_value);
this.found = false;
} else {
// nothing to do
}
}
}
}
As you see, I used a boolean variable to control the process of stream. The boolean variable found is read and written in flatMap1 and in flatMap2. So I'm thinking if I need to worry about the thread-safe issue.
Can the ConnectedStreams ensure thread safe? If not, does it mean that I need to lock the variable found in flatMap1 and in flatMap2?
The calls to flatMap1() and flatMap2() are guaranteed to not overlap, so you don't need to worry about concurrent access to your class's variables.

Flink Event Session Window not emitting records

I am writting a pipe to group session for a user keyed by id and window using eventSessionWindow. I am using the Periodic WM and a custom session accumulator which will count the event is a given session.
What is happenning is my window operator is consuming records but not emmiting out. I am not sure what is missing here.
FlinkKafkaConsumer010<String> eventSource =
new FlinkKafkaConsumer010<>("events", new SimpleStringSchema(), properties);
eventSource.setStartFromLatest();
DataStream<Event> eventStream = env.addSource(eventSource
).flatMap(
new FlatMapFunction<String, Event>() {
#Override
public void flatMap(String value, Collector<Event> out) throws Exception {
out.collect(Event.toEvent(value));
}
}
).assignTimestampsAndWatermarks(
new AssignerWithPeriodicWatermarks<Event>() {
long maxTime;
#Override
public long extractTimestamp(Event element, long previousElementTimestamp) {
maxTime = Math.max(previousElementTimestamp, maxTime);
return previousElementTimestamp;
}
#Nullable
#Override
public Watermark getCurrentWatermark() {
return new Watermark(maxTime);
}
}
);
DataStream <Session> session_stream =eventStream.keyBy((KeySelector<Event, String>)value -> value.id)
.window(EventTimeSessionWindows.withGap(Time.minutes(5)))
.aggregate(new AggregateFunction<Event, pipe.SessionAccumulator, Session>() {
#Override
public pipe.SessionAccumulator createAccumulator() {
return new pipe.SessionAccumulator();
}
#Override
public pipe.SessionAccumulator add(Event e, pipe.SessionAccumulator sessionAccumulator) {
sessionAccumulator.add(e);
return sessionAccumulator;
}
#Override
public Session getResult(pipe.SessionAccumulator sessionAccumulator) {
return sessionAccumulator.getLocalValue();
}
#Override
public pipe.SessionAccumulator merge(pipe.SessionAccumulator prev, pipe.SessionAccumulator next) {
prev.merge(next);
return prev;
}
}, new WindowFunction<Session, Session, String, TimeWindow>() {
#Override
public void apply(String s, TimeWindow timeWindow, Iterable<Session> iterable, Collector<Session> collector) throws Exception {
collector.collect(iterable.iterator().next());
}
});
public static class SessionAccumulator implements Accumulator<Event, Session>{
Session session;
public SessionAccumulator(){
session = new Session();
}
#Override
public void add(Event e) {
session.add(e);
}
#Override
public Session getLocalValue() {
return session;
}
#Override
public void resetLocal() {
session = new Session();
}
#Override
public void merge(Accumulator<Event, Session> accumulator) {
session.merge(Collections.singletonList(accumulator.getLocalValue()));
}
#Override
public Accumulator<Event, Session> clone() {
SessionAccumulator sessionAccumulator = new SessionAccumulator();
sessionAccumulator.session = new Session(
session.id,
);
return sessionAccumulator;
}
}
public static class SessionAccumulator implements Accumulator<Event, Session>{
Session session;
public SessionAccumulator(){
session = new Session();
}
#Override
public void add(Event e) {
session.add(e);
}
#Override
public Session getLocalValue() {
return session;
}
#Override
public void resetLocal() {
session = new Session();
}
#Override
public void merge(Accumulator<Event, Session> accumulator) {
session.merge(Collections.singletonList(accumulator.getLocalValue()));
}
#Override
public Accumulator<Event, Session> clone() {
SessionAccumulator sessionAccumulator = new SessionAccumulator();
sessionAccumulator.session = new Session(
session.id,
session.lastEventTime,
session.earliestEventTime,
session.count;
);
return sessionAccumulator;
}
}
If your watermarks are not advancing, this would explain why no results are being emitted by the window. Possible causes include:
Your events haven't been timestamped by Kafka, and thus previousElementTimestamp isn't set.
You have an idle Kafka partition holding back the watermarks. (This is a somewhat complex topic. If this turns out to be the cause of your problems, and you get stuck on it, please come back with a new question.)
Another possibility is that there is never a 5 minute-long gap in the events, in which case the events will accumulate in a never-ending session.
Also, you don't appear to have included a sink. If you don't print or otherwise send the results to a sink, Flink won't do anything.
And don't forget that you must call env.execute() to get anything to happen.
A few other things:
Your watermark generator isn't allowing for any out-of-orderness, so the window is going to ignore all out-of-order events (because they will be late). If your events have strictly ascending timestamps you should go ahead and use a AscendingTimestampExtractor; if they can be out-of-order, then a BoundedOutOfOrdernessTimestampExtractor is appropriate.
Your WindowFunction is superfluous. It is simply forwarding downstream the result from the aggregator, so you could remove it.
You have posted two different implementations of SessionAccumulator.

Why CEP doesn't print the first event only after I input second event when using ProcessingTime?

I sent one event with isStart true to kafka ,and made Flink consumed the event from the kafka, also set the TimeCharacteristic to ProcessingTime and set within(Time.seconds(5)), so I expected that CEP would print the event after 5 seconds I sent the first event, however it didn't, and it printed the first event only after I sent the second event to kafka. Why it printed the first event only I after sent two events? Didn't it should be print the event just after 5 seconds I sent the first one when using ProcessingTime ?
The following is the code:
public class LongRidesWithKafka {
private static final String LOCAL_ZOOKEEPER_HOST = "localhost:2181";
private static final String LOCAL_KAFKA_BROKER = "localhost:9092";
private static final String RIDE_SPEED_GROUP = "rideSpeedGroup";
private static final int MAX_EVENT_DELAY = 60; // rides are at most 60 sec out-of-order.
public static void main(String[] args) throws Exception {
final int popThreshold = 1; // threshold for popular places
// set up streaming execution environment
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setStreamTimeCharacteristic(TimeCharacteristic.ProcessingTime);
Properties kafkaProps = new Properties();
//kafkaProps.setProperty("zookeeper.connect", LOCAL_ZOOKEEPER_HOST);
kafkaProps.setProperty("bootstrap.servers", LOCAL_KAFKA_BROKER);
kafkaProps.setProperty("group.id", RIDE_SPEED_GROUP);
// always read the Kafka topic from the start
kafkaProps.setProperty("auto.offset.reset", "earliest");
// create a Kafka consumer
FlinkKafkaConsumer011<TaxiRide> consumer = new FlinkKafkaConsumer011<>(
"flinktest",
new TaxiRideSchema(),
kafkaProps);
// assign a timestamp extractor to the consumer
//consumer.assignTimestampsAndWatermarks(new CustomWatermarkExtractor());
DataStream<TaxiRide> rides = env.addSource(consumer);
DataStream<TaxiRide> keyedRides = rides.keyBy("rideId");
// A complete taxi ride has a START event followed by an END event
Pattern<TaxiRide, TaxiRide> completedRides =
Pattern.<TaxiRide>begin("start")
.where(new SimpleCondition<TaxiRide>() {
#Override
public boolean filter(TaxiRide ride) throws Exception {
return ride.isStart;
}
})
.next("end")
.where(new SimpleCondition<TaxiRide>() {
#Override
public boolean filter(TaxiRide ride) throws Exception {
return !ride.isStart;
}
});
// We want to find rides that have NOT been completed within 120 minutes
PatternStream<TaxiRide> patternStream = CEP.pattern(keyedRides, completedRides.within(Time.seconds(5)));
OutputTag<TaxiRide> timedout = new OutputTag<TaxiRide>("timedout") {
};
SingleOutputStreamOperator<TaxiRide> longRides = patternStream.flatSelect(
timedout,
new LongRides.TaxiRideTimedOut<TaxiRide>(),
new LongRides.FlatSelectNothing<TaxiRide>()
);
longRides.getSideOutput(timedout).print();
env.execute("Long Taxi Rides");
}
public static class TaxiRideTimedOut<TaxiRide> implements PatternFlatTimeoutFunction<TaxiRide, TaxiRide> {
#Override
public void timeout(Map<String, List<TaxiRide>> map, long l, Collector<TaxiRide> collector) throws Exception {
TaxiRide rideStarted = map.get("start").get(0);
collector.collect(rideStarted);
}
}
public static class FlatSelectNothing<T> implements PatternFlatSelectFunction<T, T> {
#Override
public void flatSelect(Map<String, List<T>> pattern, Collector<T> collector) {
}
}
private static class TaxiRideTSExtractor extends AscendingTimestampExtractor<TaxiRide> {
private static final long serialVersionUID = 1L;
#Override
public long extractAscendingTimestamp(TaxiRide ride) {
// Watermark Watermark = getCurrentWatermark();
if (ride.isStart) {
return ride.startTime.getMillis();
} else {
return ride.endTime.getMillis();
}
}
}
private static class CustomWatermarkExtractor implements AssignerWithPeriodicWatermarks<TaxiRide> {
private static final long serialVersionUID = -742759155861320823L;
private long currentTimestamp = Long.MIN_VALUE;
#Override
public long extractTimestamp(TaxiRide ride, long previousElementTimestamp) {
// the inputs are assumed to be of format (message,timestamp)
if (ride.isStart) {
this.currentTimestamp = ride.startTime.getMillis();
return ride.startTime.getMillis();
} else {
this.currentTimestamp = ride.endTime.getMillis();
return ride.endTime.getMillis();
}
}
#Nullable
#Override
public Watermark getCurrentWatermark() {
return new Watermark(currentTimestamp == Long.MIN_VALUE ? Long.MIN_VALUE : currentTimestamp - 1);
}
}
}
The reason is that Flink's CEP library currently only checks the timestamps if another element arrives and is processed. The underlying assumption is that you have a steady flow of events.
I think this is a limitation of Flink's CEP library. To work correctly, Flink should register processing time timers with arrivalTime + timeout which trigger the timeout of patterns if no events arrive.

Flink -- get data from Cassandra as generic ResultSet and convert it to DataSet

I have StreamExecutionEnvironment job that consumes from kafka simple cql select queries.
I try to handle this queries asynchronically using following code:
public class GenericCassandraReader extends RichAsyncFunction {
private static final Logger logger = LoggerFactory.getLogger(GenericCassandraReader.class);
private ExecutorService executorService;
private final Properties props;
private Session client;
public ExecutorService getExecutorService() {
return executorService;
}
public GenericCassandraReader(Properties props, ExecutorService executorService) {
super();
this.props = props;
this.executorService = executorService;
}
#Override
public void open(Configuration parameters) throws Exception {
client = Cluster.builder().addContactPoint(props.getProperty("cqlHost"))
.withPort(Integer.parseInt(props.getProperty("cqlPort"))).build()
.connect(props.getProperty("keyspace"));
}
#Override
public void close() throws Exception {
client.close();
synchronized (GenericCassandraReader.class) {
try {
if (!getExecutorService().awaitTermination(1000, TimeUnit.MILLISECONDS)) {
getExecutorService().shutdownNow();
}
} catch (InterruptedException e) {
getExecutorService().shutdownNow();
}
}
}
#Override
public void asyncInvoke(final UserDefinedType input, final AsyncCollector<ResultSet> asyncCollector) throws Exception {
getExecutorService().submit(new Runnable() {
#Override
public void run() {
ListenableFuture<ResultSet> resultSetFuture = client.executeAsync(input.query);
Futures.addCallback(resultSetFuture, new FutureCallback<ResultSet>() {
public void onSuccess(ResultSet resultSet) {
asyncCollector.collect(Collections.singleton(resultSet));
}
public void onFailure(Throwable t) {
asyncCollector.collect(t);
}
});
}
});
}
}
each response of this code provides Cassandra ResultSet with different amount of fields .
Any Ideas for handling Cassandra ResultSet in Flink or should I use another technics to reach my goal ?
Thanks for any help in advance!
Cassandra ResultSet is not thread-safe. Better try to use Flink Cassandra connector. Or at least write your implementation in a similar way

Evaluate only the latest window for event time based sliding windows

I would like to process events in EvenTime using sliding windows. The sliding interval is 24 hours and increment is 30 minutes. The problem is that below code is producing 48 calculations for each event. In our case events are coming in order so we need only the latest window to be evaluated.
Thanks,
Dejan
public static void processEventsa(
DataStream<Tuple2<String, MyEvent>> events) throws Exception {
events.assignTimestampsAndWatermarks(new MyWatermark()).
keyBy(0).
timeWindow(Time.hours(windowSizeHour), Time.seconds(windowSlideSeconds)).
apply(new WindowFunction<Tuple2<String, MyEvent>, Tuple2<String, MyEvent>, Tuple, TimeWindow>() {
#Override
public void apply(Tuple key, TimeWindow window, Iterable<Tuple2<String, MyEvent>> input,
Collector<Tuple2<String, MyEvent>> out) throws Exception {
for (Tuple2<String, MyEvent> record : input) {
}
}
});
}
public class MyWatermark implements
AssignerWithPunctuatedWatermarks<Tuple2<String, MyEvent>> {
#Override
public long extractTimestamp(Tuple2<String, MyEvent> event, long previousElementTimestamp) {
return event.f1.eventTime;
}
#Override
public Watermark checkAndGetNextWatermark(Tuple2<String, MyEvent> event, long previousElementTimestamp) {
return new Watermark(event.f1.eventTime);
}
}
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
The problem was in watermark. AssignerWithPeriodicWatermarks should be used
public class MyWatermark implements
AssignerWithPeriodicWatermarks<Tuple2<String, MyEvent>> {
private final long maxTimeLag = 5000;
#Override
public long extractTimestamp(Tuple2<String, MyEvent> event, long previousElementTimestamp) {
try {
return event.f1.eventTime;
}
catch(NullPointerException ex) {}
return System.currentTimeMillis() - maxTimeLag;
}
#Override
public Watermark getCurrentWatermark() {
return new Watermark(System.currentTimeMillis() - maxTimeLag);
}
}

Resources