I'm fairly new to Flink and would be grateful for any advice with this issue.
I wrote a job that receives some input events and compares them with some rules before forwarding them on to kafka topics based on whatever rules match. I implemented this using a flatMap and found it worked well, with one downside: I was loading the rules just once, during application startup, by calling an API from my main() method, and passing the result of this API call into the flatMap function. This worked, but it means that if there are any changes to the rules I have to restart the application, so I wanted to improve it.
I found this page in the documentation which seems to be an appropriate solution to the problem. I wrote a custom source to poll my Rules API every few minutes, and then used a BroadcastProcessFunction, with the Rules added to to the broadcast state using processBroadcastElement and the events processed by processElement.
The solution is working, but with one problem. My first approach using a FlatMap would process the events almost instantly. Now that I changed to a BroadcastProcessFunction each event takes 60 seconds to process, and it seems to be more or less exactly 60 seconds every time with almost no variation. I made no changes to the rule matching logic itself.
I've had a look through the documentation and I can't seem to find a reason for this, so I'd appreciate if anyone more experienced in flink could offer a suggestion as to what might cause this delay.
The job:
public static void main(String[] args) throws Exception {
// set up the streaming execution environment
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setStreamTimeCharacteristic(TimeCharacteristic.IngestionTime);
// read the input from Kafka
DataStream<KafkaEvent> documentStream = env.addSource(
createKafkaSource(getSourceTopic(), getSourceProperties())).name("Kafka[" + getSourceTopic() + "]");
// Configure the Rules data stream
DataStream<RulesEvent> ruleStream = env.addSource(
new RulesApiHttpSource(
getApiRulesSubdomain(),
getApiBearerToken(),
DataType.DataTypeName.LOGS,
getRulesApiCacheDuration()) // Currently set to 120000
);
MapStateDescriptor<String, RulesEvent> ruleStateDescriptor = new MapStateDescriptor<>(
"RulesBroadcastState",
BasicTypeInfo.STRING_TYPE_INFO,
TypeInformation.of(new TypeHint<RulesEvent>() {
}));
// broadcast the rules and create the broadcast state
BroadcastStream<RulesEvent> ruleBroadcastStream = ruleStream
.broadcast(ruleStateDescriptor);
// extract the resources and attributes
documentStream
.connect(ruleBroadcastStream)
.process(new FanOutLogsRuleMapper()).name("FanOut Stream")
.addSink(createKafkaSink(getDestinationProperties()))
.name("FanOut Sink");
// run the job
env.execute(FanOutJob.class.getName());
}
The custom HTTP source which gets the rules
public class RulesApiHttpSource extends RichSourceFunction<RulesEvent> {
private static final Logger LOGGER = LoggerFactory.getLogger(RulesApiHttpSource.class);
private final long pollIntervalMillis;
private final String endpoint;
private final String bearerToken;
private final DataType.DataTypeName dataType;
private final RulesApiCaller caller;
private volatile boolean running = true;
public RulesApiHttpSource(String endpoint, String bearerToken, DataType.DataTypeName dataType, long pollIntervalMillis) {
this.pollIntervalMillis = pollIntervalMillis;
this.endpoint = endpoint;
this.bearerToken = bearerToken;
this.dataType = dataType;
this.caller = new RulesApiCaller(this.endpoint, this.bearerToken);
}
#Override
public void open(Configuration configuration) throws Exception {
// do nothing
}
#Override
public void close() throws IOException {
// do nothing
}
#Override
public void run(SourceContext<RulesEvent> ctx) throws IOException {
while (running) {
if (pollIntervalMillis > 0) {
try {
RulesEvent event = new RulesEvent();
event.setRules(getCurrentRulesList());
event.setDataType(this.dataType);
event.setRetrievedAt(Instant.now());
ctx.collect(event);
Thread.sleep(pollIntervalMillis);
} catch (InterruptedException e) {
running = false;
}
} else if (pollIntervalMillis <= 0) {
cancel();
}
}
}
public List<Rule> getCurrentRulesList() throws IOException {
// call API and get rulles
}
#Override
public void cancel() {
running = false;
}
}
The BroadcastProcessFunction
public abstract class FanOutRuleMapper extends BroadcastProcessFunction<KafkaEvent, RulesEvent, KafkaEvent> {
protected final String RULES_EVENT_NAME = "rulesEvent";
protected final MapStateDescriptor<String, RulesEvent> ruleStateDescriptor = new MapStateDescriptor<>(
"RulesBroadcastState",
BasicTypeInfo.STRING_TYPE_INFO,
TypeInformation.of(new TypeHint<RulesEvent>() {
}));
#Override
public void processBroadcastElement(RulesEvent rulesEvent, BroadcastProcessFunction<KafkaEvent, RulesEvent, KafkaEvent>.Context ctx, Collector<KafkaEvent> out) throws Exception {
ctx.getBroadcastState(ruleStateDescriptor).put(RULES_EVENT_NAME, rulesEvent);
LOGGER.debug("Added to broadcast state {}", rulesEvent.toString());
}
// omitted rules matching logic
}
public class FanOutLogsRuleMapper extends FanOutRuleMapper {
public FanOutLogsJobRuleMapper() {
super();
}
#Override
public void processElement(KafkaEvent in, BroadcastProcessFunction<KafkaEvent, RulesEvent, KafkaEvent>.ReadOnlyContext ctx, Collector<KafkaEvent> out) throws Exception {
RulesEvent rulesEvent = ctx.getBroadcastState(ruleStateDescriptor).get(RULES_EVENT_NAME);
ExportLogsServiceRequest otlpLog = extractOtlpMessageFromJsonPayload(in);
for (Rule rule : rulesEvent.getRules()) {
boolean match = false;
// omitted rules matching logic
if (match) {
for (RuleDestination ruleDestination : rule.getRulesDestinations()) {
out.collect(fillInTheEvent(in, rule, ruleDestination, otlpLog));
}
}
}
}
}
Maybe you can give the complete code of the FanOutLogsRuleMapper class, currently the match variable is always false
Related
It runs with processing time and using a broadcast state.
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setStreamTimeCharacteristic(TimeCharacteristic.ProcessingTime);
BroadcastStream<List<TableOperations>> broadcastOperationsState = env
.addSource(new LoadCassandraOperations(10000L, cassandraHost, cassandraPort)).broadcast(descriptor);
SingleOutputStreamOperator<InternalVariableValue> stream =
env.addSource(new SourceMillisInternalVariableValue(5000L));
SingleOutputStreamOperator<InternalVariableOperation> streamProcessed =
stream
.keyBy(InternalVariableValue::getUuid)
.connect(broadcastOperationsState)
.process(new AddOperationInfo())
;
streamProcessed.print();
SourceMillisIntervalVariableValues create a event every 5s . The events are stored in a static collection. The run method looks like :
public class SourceMillisInternalVariableValue extends RichSourceFunction<InternalVariableValue>{
private boolean running;
long millis;
public SourceMillisInternalVariableValue(long millis) {
super();
this.millis = millis;
}
#Override
public void open(Configuration parameters) throws Exception {
super.open(parameters);
running = true;
}
#Override
public void cancel() {
running = false;
}
#Override
public void run(SourceContext<InternalVariableValue> ctx) throws Exception {
//Espera inicial
Thread.sleep(1500);
PojoVariableValues[] pojoData =
new PojoVariableValues[]{
new PojoVariableValues("id1", "1"),
new PojoVariableValues("id2", "2"),
....
....
new PojoVariableValues("id21", "21")
};
int cont = 0;
while (cont<pojoData.length) {
System.out.println("Iteration "+cont+" "+pojoData.length);
ctx.collect(generateVar(pojoData[0+cont].getUUID(), pojoData[0+cont].getValue()));
ctx.collect(generateVar(pojoData[1+cont].getUUID(), pojoData[1+cont].getValue()));
ctx.collect(generateVar(pojoData[2+cont].getUUID(), pojoData[2+cont].getValue()));
cont = cont +3;
Thread.sleep(millis);
}
}
private InternalVariableValue generateVar(String uuid, String value)
{
return InternalVariableValueMessage.InternalVariableValue.newBuilder()
.setUuid(uuid)
.setTimestamp(new Date().getTime()).setValue(value).setKeyspace("nest").build();
}
class PojoVariableValues {
private String UUID;
private String Value;
public PojoVariableValues(String uUID, String value) {
super();
UUID = uUID;
Value = value;
}
public String getUUID() {
return UUID;
}
public void setUUID(String uUID) {
UUID = uUID;
}
public String getValue() {
return Value;
}
public void setValue(String value) {
Value = value;
}
}
}
LoadCassandraOperations emits events every 10 seconds. It works fine.
When I run this code, SourceMillisIntervalVariableValues stops in the first iteration, emiting only three events. If I comment the process function, both sources works properly, but if I run the process , the source is cancel...
I spect than the source emits all events ( 21 exactly ) , and all of them are processing in the aggregate function. If I run this code, the while loop in the sources only complete one iteration.
Any idea ?
Thank youuu . cheers
EDIT:
Important. This code is for explore the processint time and broadcast feature. I know that I'm not using the best practices in the sources. Thanks
EDIT 2:
The problem starts when I try to run the process function.
Solved !!
The problem was that I try to run it using a TestContainer and I can't watch any logs.
I ran it with a simple main method and I can see some code errors ( like the commented in the comments. Tnks !!! ).
When I need to work with I/O (Query DB, Call to the third API,...), I can use RichAsyncFunction. But I need to interact with Google Sheet via GG Sheet API: https://developers.google.com/sheets/api/quickstart/java. This API is sync. I wrote below code snippet:
public class SendGGSheetFunction extends RichAsyncFunction<Obj, String> {
#Override
public void asyncInvoke(Obj message, final ResultFuture<String> resultFuture) {
CompletableFuture.supplyAsync(() -> {
syncSendToGGSheet(message);
return "";
}).thenAccept((String result) -> {
resultFuture.complete(Collections.singleton(result));
});
}
}
But I found that message send to GGSheet very slow, It seems to send by synchronous.
Most of the code executed by users in AsyncIO is sync originally. You just need to ensure, it's actually executed in a separate thread. Most commonly a (statically shared) ExecutorService is used.
private class SendGGSheetFunction extends RichAsyncFunction<Obj, String> {
private transient ExecutorService executorService;
#Override
public void open(Configuration parameters) throws Exception {
super.open(parameters);
executorService = Executors.newFixedThreadPool(30);
}
#Override
public void close() throws Exception {
super.close();
executorService.shutdownNow();
}
#Override
public void asyncInvoke(final Obj message, final ResultFuture<String> resultFuture) {
executorService.submit(() -> {
try {
resultFuture.complete(syncSendToGGSheet(message));
} catch (SQLException e) {
resultFuture.completeExceptionally(e);
}
});
}
}
Here are some considerations on how to tune AsyncIO to increase throughput: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Flink-Async-IO-operator-tuning-micro-benchmarks-td35858.html
I am newbie to Flink i am trying a POC in which if no event is received in x amount of time greater than time specified in within time period in CEP
public class MyCEPApplication {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment streamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment();
streamExecutionEnvironment.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
Properties properties = new Properties();
properties.setProperty("bootstrap.servers", "localhost:9092");
properties.setProperty("group.id", "consumer_group");
FlinkKafkaConsumer<String> inputSignal = new FlinkKafkaConsumer<>("consumer_topic",
new SimpleStringSchema(),
properties);
DataStream<String> inputSignalKafka = streamExecutionEnvironment.addSource(inputSignal);
DataStream<MyEvent> eventDataStream = inputSignalKafka.map(new MapFunction<String, MyEvent>() {
#Override
public MyEvent map(String value) throws Exception {
ItmAtomicEvent MyEvent = new ItmAtomicEvent();
MyEvent.setJsonObject(new JSONObject(value));
MyEvent.setAttribute("time",System.currentTimeMillis());
return MyEvent;
}
}).assignTimestampsAndWatermarks(new AssignerWithPunctuatedWatermarks<MyEvent>() {
#Override
public long extractTimestamp(MyEvent event, long currentTimestamp) {
System.out.println("TIMESTAMP: " +(long)event.getAttribute("time").get());
System.out.println("Time Difference: " +((long)(event.getAttribute("time").get()) - timeDifference));
timeDifference = (long)(event.getAttribute("time").get());
return (long)event.getAttribute("time").get();
}
#Override
public Watermark checkAndGetNextWatermark(MyEvent lastElement, long extractedTimestamp) {
return new Watermark(extractedTimestamp);
}
});
eventDataStream.print("Source=======================================>");
Pattern<MyEvent, ?> pattern =
Pattern.<MyEvent>begin("start")
.where(new SimpleCondition<MyEvent>() {
#Override
public boolean filter(MyEvent event) {
return event.equals("Event1");
}
})
.next("end")
.where(new SimpleCondition<MyEvent>() {
#Override
public boolean filter(MyEvent event) {
return event.equals("Event2");
}
}).within(Time.seconds(10));
PatternStream<MyEvent> patternStream = cepPatternMatching.compile(pattern, eventDataStream);
OutputTag<MyEvent> timedout = new OutputTag<MyEvent>("timedout") {};
SingleOutputStreamOperator<MyEvent> result = patternStream.flatSelect(
timedout,
new PatternFlatTimeoutFunction<MyEvent,MyEvent>() {
#Override
public void timeout(Map<String, List<MyEvent>> pattern, long timeoutTimestamp, Collector<MyEvent> out) throws Exception {
if(null != pattern.get("CustomerId")){
for (MyEvent timedOutEvent :pattern.get("CustomerId")){
System.out.println("TimedOut Event : "+timedOutEvent.getField(0));
out.collect(timedOutEvent);
}
}
}
},
new PatternFlatSelectFunction<MyEvent, MyEvent>() {
#Override
public void flatSelect(Map<String, List<MyEvent>> pattern,
Collector<MyEvent> out) throws Exception {
out.collect(pattern.get("CustomerId").get(0));
}
}
);
result.print("=============CEP Pattern Detected==================");
DataStream<MyEvent> timedOut = result.getSideOutput(timedout);
timedOut.print("=============CEP TimedOut Pattern Detected==================");
streamExecutionEnvironment.execute("CEPtest");
}
}
Even if no event received after 10 seconds its printing timedout event i tried even commenting out code PatternFlatSelectFunction method, is there a way or work around to make timeout of pattern if no event received in a given x seconds. Someone asked same question i referred below solution but nothing worked for me, please help me in solving the issue
1)Apache Flink CEP how to detect if event did not occur within x seconds?
2)https://ci.apache.org/projects/flink/flink-docs-release-1.10/dev/libs/cep.html#handling-timed-out-partial-patterns
3)https://github.com/ververica/flink-training-exercises/blob/master/src/main/java/com/ververica/flinktraining/solutions/datastream_java/cep/LongRidesCEPSolution.java
Your application is using event time, so you will need to arrange for a sufficiently large Watermark to be generated despite the lack of incoming events. You could use this example if you want to artificially advance the current watermark when the source is idle.
Given that your events don't have event-time timestamps, why don't you simply use processing time instead, and thereby avoid this problem? (Note, however, the limitation mentioned in https://stackoverflow.com/a/50357721/2000823).
Can someone please help me understand when and how is the window (session) in flink happens? Or how the samples are processed?
For instance, if I have a continuous stream of events flowing in, events being request coming in an application and response provided by the application.
As part of the flink processing we need to understand how much time is taken for serving a request.
I understand that there are time tumbling windows which gets triggered every n seconds which is configured and as soon as the time lapses then all the events in that time window will be aggregated.
So for example:
Let's assume that the time window defined is 30 seconds and if an event arrives at t time and another arrives at t+30 then both will be processed, but an event arrivng at t+31 will be ignored.
Please correct if I am not right in saying the above statement.
Question on the above is: if say an event arrives at t time and another event arrives at t+3 time, will it still wait for entire 30 seconds to aggregate and finalize the results?
Now in case of session window, how does this work? If the event are being processed individually and the broker time stamp is used as session_id for the individual event at the time of deserialization, then the session window will that be created for each event? If yes then do we need to treat request and response events differently because if we don't then doesn't the response event will get its own session window?
I will try posting my example (in java) that I am playing with in short time but any inputs on the above points will be helpful!
process function
DTO's:
public class IncomingEvent{
private String id;
private String eventId;
private Date timestamp;
private String component;
//getters and setters
}
public class FinalOutPutEvent{
private String id;
private long timeTaken;
//getters and setters
}
===============================================
Deserialization of incoming events:
public class IncomingEventDeserializationScheme implements KafkaDeserializationSchema {
private ObjectMapper mapper;
public IncomingEventDeserializationScheme(ObjectMapper mapper) {
this.mapper = mapper;
}
#Override
public TypeInformation<IncomingEvent> getProducedType() {
return TypeInformation.of(IncomingEvent.class);
}
#Override
public boolean isEndOfStream(IncomingEvent nextElement) {
return false;
}
#Override
public IncomingEvent deserialize(ConsumerRecord<byte[], byte[]> record) throws Exception {
if (record.value() == null) {
return null;
}
try {
IncomingEvent event = mapper.readValue(record.value(), IncomingEvent.class);
if(event != null) {
new SessionWindow(record.timestamp());
event.setOffset(record.offset());
event.setTopic(record.topic());
event.setPartition(record.partition());
event.setBrokerTimestamp(record.timestamp());
}
return event;
} catch (Exception e) {
return null;
}
}
}
===============================================
main logic
public class MyEventJob {
private static final ObjectMapper mapper = new ObjectMapper();
public static void main(String[] args) throws Exception {
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
MyEventJob eventJob = new MyEventJob();
InputStream inStream = eventJob.getFileFromResources("myConfig.properties");
ParameterTool parameter = ParameterTool.fromPropertiesFile(inStream);
Properties properties = parameter.getProperties();
Integer timePeriodBetweenEvents = 120;
String outWardTopicHostedOnServer = localhost:9092";
DataStreamSource<IncomingEvent> stream = env.addSource(new FlinkKafkaConsumer<>("my-input-topic", new IncomingEventDeserializationScheme(mapper), properties));
SingleOutputStreamOperator<IncomingEvent> filteredStream = stream
.assignTimestampsAndWatermarks(new BoundedOutOfOrdernessTimestampExtractor<IncomingEvent>() {
long eventTime;
#Override
public long extractTimestamp(IncomingEvent element, long previousElementTimestamp) {
return element.getTimestamp();
}
#Override
public Watermark getCurrentWatermark() {
return new Watermark(eventTime);
}
})
.map(e -> { e.setId(e.getEventId()); return e; });
SingleOutputStreamOperator<FinalOutPutEvent> correlatedStream = filteredStream
.keyBy(new KeySelector<IncomingEvent, String> (){
#Override
public String getKey(#Nonnull IncomingEvent input) throws Exception {
return input.getId();
}
})
.window(GlobalWindows.create()).allowedLateness(Time.seconds(defaultSliceTimePeriod))
.trigger( new Trigger<IncomingEvent, Window> (){
private final long sessionTimeOut;
public SessionTrigger(long sessionTimeOut) {
this.sessionTimeOut = sessionTimeOut;
}
#Override
public TriggerResult onElement(IncomingEvent element, long timestamp, Window window, TriggerContext ctx)
throws Exception {
ctx.registerProcessingTimeTimer(timestamp + sessionTimeOut);
return TriggerResult.CONTINUE;
}
#Override
public TriggerResult onProcessingTime(long time, Window window, TriggerContext ctx) throws Exception {
return TriggerResult.FIRE_AND_PURGE;
}
#Override
public TriggerResult onEventTime(long time, Window window, TriggerContext ctx) throws Exception {
return TriggerResult.CONTINUE;
}
#Override
public void clear(Window window, TriggerContext ctx) throws Exception {
//check the clear method implementation
}
})
.process(new ProcessWindowFunction<IncomingEvent, FinalOutPutEvent, String, SessionWindow>() {
#Override
public void process(String arg0,
ProcessWindowFunction<IncomingEvent, FinalOutPutEvent, String, SessionWindow>.Context arg1,
Iterable<IncomingEvent> input, Collector<FinalOutPutEvent> out) throws Exception {
List<IncomingEvent> eventsIn = new ArrayList<>();
input.forEach(eventsIn::add);
if(eventsIn.size() == 1) {
//Logic to handle incomplete request/response events
} else if (eventsIn.size() == 2) {
//Logic to handle the complete request/response and how much time it took
}
}
} );
FlinkKafkaProducer<FinalOutPutEvent> kafkaProducer = new FlinkKafkaProducer<>(
outWardTopicHostedOnServer, // broker list
"target-topic", // target topic
new EventSerializationScheme(mapper));
correlatedStream.addSink(kafkaProducer);
env.execute("Streaming");
}
}
Thanks
Vicky
From your description, I think you want to write a custom ProcessFunction, which is keyed by the session_id. You'll have a ValueState, where you store the timestamp for the request event. When you get the corresponding response event, you calculate the delta and emit that (with the session_id) and clear out state.
It's likely you'd also want to set a timer when you get the request event, so that if you don't get a response event in safe/long amount of time, you can emit a side output of failed requests.
So, with the default trigger, each window is finalized after it's time fully passes. Depending on whether You are using EventTime or ProcessingTime this may mean different things, but in general, Flink will always wait for the Window to be closed before it is fully processed. The event at t+31 in Your case would simply go to the other window.
As for the session windows, they are windows too, meaning that in the end they simply aggregate samples that have a difference between timestamps lower than the defined gap. Internally, this is more complicated than the normal windows, since they don't have defined starts and ends. The Session Window operator gets sample and creates a new Window for each individual sample. Then, the operator verifies, if the newly created window can be merged with already existing ones (i.e. if their timestamps are closer than the gap) and merges them. This finally results with window that has all elements with timestamps closer to each other than the defined gap.
You are making this more complicated than it needs to be. The example below will need some adjustment, but will hopefully convey the idea of how to use a KeyedProcessFunction rather than session windows.
Also, the constructor for BoundedOutOfOrdernessTimestampExtractor expects to be passed a Time maxOutOfOrderness. Not sure why you are overriding its getCurrentWatermark method with an implementation that ignores the maxOutOfOrderness.
public static void main(String[] args) throws Exception {
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
DataStream<Event> events = ...
events
.assignTimestampsAndWatermarks(new TimestampsAndWatermarks(OUT_OF_ORDERNESS))
.keyBy(e -> e.sessionId)
.process(new RequestReponse())
...
}
public static class RequestReponse extends KeyedProcessFunction<KEY, Event, Long> {
private ValueState<Long> requestTimeState;
#Override
public void open(Configuration config) {
ValueStateDescriptor<Event> descriptor = new ValueStateDescriptor<>(
"request time", Long.class);
requestState = getRuntimeContext().getState(descriptor);
}
#Override
public void processElement(Event event, Context context, Collector<Long> out) throws Exception {
TimerService timerService = context.timerService();
Long requestedAt = requestTimeState.value();
if (requestedAt == null) {
// haven't seen the request before; save its timestamp
requestTimeState.update(event.timestamp);
timerService.registerEventTimeTimer(event.timestamp + TIMEOUT);
} else {
// this event is the response
// emit the time elapsed between request and response
out.collect(event.timestamp - requestedAt);
}
}
#Override
public void onTimer(long timestamp, OnTimerContext context, Collector<Long> out) throws Exception {
//handle incomplete request/response events
}
}
public static class TimestampsAndWatermarks extends BoundedOutOfOrdernessTimestampExtractor<Event> {
public TimestampsAndWatermarks(Time t) {
super(t);
}
#Override
public long extractTimestamp(Event event) {
return event.eventTime;
}
}
I sent one event with isStart true to kafka ,and made Flink consumed the event from the kafka, also set the TimeCharacteristic to ProcessingTime and set within(Time.seconds(5)), so I expected that CEP would print the event after 5 seconds I sent the first event, however it didn't, and it printed the first event only after I sent the second event to kafka. Why it printed the first event only I after sent two events? Didn't it should be print the event just after 5 seconds I sent the first one when using ProcessingTime ?
The following is the code:
public class LongRidesWithKafka {
private static final String LOCAL_ZOOKEEPER_HOST = "localhost:2181";
private static final String LOCAL_KAFKA_BROKER = "localhost:9092";
private static final String RIDE_SPEED_GROUP = "rideSpeedGroup";
private static final int MAX_EVENT_DELAY = 60; // rides are at most 60 sec out-of-order.
public static void main(String[] args) throws Exception {
final int popThreshold = 1; // threshold for popular places
// set up streaming execution environment
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setStreamTimeCharacteristic(TimeCharacteristic.ProcessingTime);
Properties kafkaProps = new Properties();
//kafkaProps.setProperty("zookeeper.connect", LOCAL_ZOOKEEPER_HOST);
kafkaProps.setProperty("bootstrap.servers", LOCAL_KAFKA_BROKER);
kafkaProps.setProperty("group.id", RIDE_SPEED_GROUP);
// always read the Kafka topic from the start
kafkaProps.setProperty("auto.offset.reset", "earliest");
// create a Kafka consumer
FlinkKafkaConsumer011<TaxiRide> consumer = new FlinkKafkaConsumer011<>(
"flinktest",
new TaxiRideSchema(),
kafkaProps);
// assign a timestamp extractor to the consumer
//consumer.assignTimestampsAndWatermarks(new CustomWatermarkExtractor());
DataStream<TaxiRide> rides = env.addSource(consumer);
DataStream<TaxiRide> keyedRides = rides.keyBy("rideId");
// A complete taxi ride has a START event followed by an END event
Pattern<TaxiRide, TaxiRide> completedRides =
Pattern.<TaxiRide>begin("start")
.where(new SimpleCondition<TaxiRide>() {
#Override
public boolean filter(TaxiRide ride) throws Exception {
return ride.isStart;
}
})
.next("end")
.where(new SimpleCondition<TaxiRide>() {
#Override
public boolean filter(TaxiRide ride) throws Exception {
return !ride.isStart;
}
});
// We want to find rides that have NOT been completed within 120 minutes
PatternStream<TaxiRide> patternStream = CEP.pattern(keyedRides, completedRides.within(Time.seconds(5)));
OutputTag<TaxiRide> timedout = new OutputTag<TaxiRide>("timedout") {
};
SingleOutputStreamOperator<TaxiRide> longRides = patternStream.flatSelect(
timedout,
new LongRides.TaxiRideTimedOut<TaxiRide>(),
new LongRides.FlatSelectNothing<TaxiRide>()
);
longRides.getSideOutput(timedout).print();
env.execute("Long Taxi Rides");
}
public static class TaxiRideTimedOut<TaxiRide> implements PatternFlatTimeoutFunction<TaxiRide, TaxiRide> {
#Override
public void timeout(Map<String, List<TaxiRide>> map, long l, Collector<TaxiRide> collector) throws Exception {
TaxiRide rideStarted = map.get("start").get(0);
collector.collect(rideStarted);
}
}
public static class FlatSelectNothing<T> implements PatternFlatSelectFunction<T, T> {
#Override
public void flatSelect(Map<String, List<T>> pattern, Collector<T> collector) {
}
}
private static class TaxiRideTSExtractor extends AscendingTimestampExtractor<TaxiRide> {
private static final long serialVersionUID = 1L;
#Override
public long extractAscendingTimestamp(TaxiRide ride) {
// Watermark Watermark = getCurrentWatermark();
if (ride.isStart) {
return ride.startTime.getMillis();
} else {
return ride.endTime.getMillis();
}
}
}
private static class CustomWatermarkExtractor implements AssignerWithPeriodicWatermarks<TaxiRide> {
private static final long serialVersionUID = -742759155861320823L;
private long currentTimestamp = Long.MIN_VALUE;
#Override
public long extractTimestamp(TaxiRide ride, long previousElementTimestamp) {
// the inputs are assumed to be of format (message,timestamp)
if (ride.isStart) {
this.currentTimestamp = ride.startTime.getMillis();
return ride.startTime.getMillis();
} else {
this.currentTimestamp = ride.endTime.getMillis();
return ride.endTime.getMillis();
}
}
#Nullable
#Override
public Watermark getCurrentWatermark() {
return new Watermark(currentTimestamp == Long.MIN_VALUE ? Long.MIN_VALUE : currentTimestamp - 1);
}
}
}
The reason is that Flink's CEP library currently only checks the timestamps if another element arrives and is processed. The underlying assumption is that you have a steady flow of events.
I think this is a limitation of Flink's CEP library. To work correctly, Flink should register processing time timers with arrivalTime + timeout which trigger the timeout of patterns if no events arrive.