Can I use Flink CEP to sort a stream? - apache-flink

I know I can use Flink SQL to sort a stream by timestamp, but as I'm already using CEP, I'd like to use it for sorting instead.

Sorting with CEP is pretty easy, since CEP always sorts its input by timestamp. Something like this will do the trick:
DataStream<Event> streamWithTimestampsAndWatermarks = ...
Pattern<Event, ?> matchEverything =
Pattern.<Event>begin("any")
.where(new SimpleCondition<Event>() {
#Override
public boolean filter(Event event) throws Exception {
return true;
}
});
PatternStream<Event> patternStream = CEP.pattern(
streamWithTimestampsAndWatermarks, matchEverything);
SingleOutputStreamOperator<Event> sorted = patternStream
.select(new PatternSelectFunction<Event, Event>() {
#Override
public Event select(Map<String, List<Event>> map) throws Exception {
return map.get("any").get(0);
}
});
If you want to sort the stream key-by-key, rather than globally, then use keyBy before applying a pattern to it.

Related

Upgrading Flink deprecated function calls

I am currently trying to upgrade a method call assignTimestampsAndWatermarks that is applied to a data stream. The data stream looks something like this:
DataStream<Auction> auctions = env.addSource(new AuctionSourceFunction(auctionSrcRates))
.name("Custom Source")
.setParallelism(params.getInt("p-auction-source", 1))
.assignTimestampsAndWatermarks(new AuctionTimestampAssigner());
The AssignerWithPeriodicWatermark looks like this:
private static final class AuctionTimestampAssigner implements AssignerWithPeriodicWatermarks<Auction> {
private long maxTimestamp = Long.MIN_VALUE;
#Nullable
#Override
public Watermark getCurrentWatermark() {
return new Watermark(maxTimestamp);
}
#Override
public long extractTimestamp(Auction element, long previousElementTimestamp) {
maxTimestamp = Math.max(maxTimestamp, element.dateTime);
return element.dateTime;
}
}
What are the steps I would need to take to upgrade from deprecated calls to the current best practices? Thanks.
Your watermark generator assumes that the events are in order, by timestamp, or at least accepts that any out-of-order events will be late. This is equivalent to
assignTimestampsAndWatermarks(
WatermarkStrategy
.<Auction>forMonotonousTimestamps()
.withTimestampAssigner((event, timestamp) -> event.dateTime))

Absence of event in Apache Flink CEP

I'm new at Apache Flink CEP and I'm struggle trying to detect a simple absence of event.
What I'm trying to detect is wheter an event of type CurrencyEvent with a certain id does not occur in certain amount of time. I would like to detect the absence of such event every time that after 3000ms the event does not occur.
My pattern code looks as follows:
Pattern<CurrencyEvent, ?> myPattern = Pattern.<Event>begin("CurrencyEvent")
.subtype(CurrencyEvent.class)
.where(new SimpleCondition<CurrencyEvent>() {
#Override
public boolean filter(CurrencyEvent currencyEvent) throws Exception {
return currencyEvent.getId().equalsIgnoreCase("usd");
}
})
.within(Time.milliseconds(3000L));
So now my idea is to use timeout functions in order to detect timeout events:
DataStreamSource<Event> events = env.addSource(new TestSource(
Arrays.asList(
basicCurrencyWithMivLevelEvent("EUR", 100L, Arrays.asList("1", "2"), 200D),
basicCurrencyWithMivLevelEvent("USD", 100L, Arrays.asList("1", "2"), 200D),
basicCurrencyWithMivLevelEvent("EUR", 100L, Arrays.asList("1", "2"), 200D)
),
1636040364820L, // initial timestamp for the first element
7000 // 7 seconds between each event
));
PatternStream<Event> patternStream = CEP.pattern(
events,
(Pattern<Event, ?>) myPattern
);
OutputTag<Alarm> tag = new OutputTag<Alarm>("currency-timeout"){};
PatternFlatTimeoutFunction<Event, Alarm> eventAlarmTimeoutPatternFunction = (patterns, timestamp, ctx) -> {
System.out.println("New alarm, since after 3 seconds an event with id=usd is not detected");
//TODO: call collect
};
PatternFlatSelectFunction<Event, Alarm> eventAlarmPatternSelectFunction = (patterns, ctx) -> {
System.out.println("Select! (we can ignore it) " + patterns);
// ignore matched events
};
return patternStream.flatSelect(
tag,
eventAlarmTimeoutPatternFunction,
TypeInformation.of(Alarm.class),
eventAlarmPatternSelectFunction
);
My Test source is using event timestamps and watermarks, as shown as follows:
public class TestSource implements SourceFunction<Event> {
private final List<Event> events;
private final long initialTimestamp;
private final long timeBetweenInMillis;
public TestSource(List<Event> events, long initialTimestamp, long timeBetweenInMillis){
this.events = events;
this.initialTimestamp = initialTimestamp;
this.timeBetweenInMillis = timeBetweenInMillis;
}
#Override
public void run(SourceContext<Event> sourceContext) throws InterruptedException {
long timestamp = this.initialTimestamp;
for(Event event: this.events){
sourceContext.collectWithTimestamp(event, timestamp);
sourceContext.emitWatermark(new Watermark(timestamp));
timestamp+=this.timeBetweenInMillis;
}
}
#Override
public void cancel() {
}
}
I'm using TimeCharacteristics.EventTime.
Since the the window time (3seconds) is lower than the event time difference between every event (7 seconds), I expect to get some timeout events, but I'm getting 0.
A CEP Pattern matches a sequence of one or more events; the within(interval) clause adds an additional constraint that all of the events in the sequence must occur within the specified interval. When partial matches time out, this can be captured in a TimedOutPartialMatchHandler.
In your case, since a successfully matched Pattern consists of a single event, there can be no partial matches, and a match can never time out. (Your matching sequences are always less than 3 seconds long.)
What you can do is to extend the pattern definition to include a second event, so that to match there must be a start event followed by another event within 3 seconds. When that second event is missing, then you will have a partial match that times out.
For more flexibility than what CEP offers for implementing use cases involving missing events, you can use a KeyedProcessFunction with timers.

How to create batch or slide windows using Flink CEP?

I'm just starting with Flink CEP and I come from Esper CEP engine. As you may (or not) know, in Esper using their syntax (EPL) you can create a batch or slide window easily, grouping the events in those windows and allowing you to use this events with functions (avg, max, min, ...).
For example, with the following pattern you can create a batch windows of 5 seconds and calculate the average value of the attribute price of all the Stock events that you have received in that specified window.
select avg(price) from Stock#time_batch(5 sec)
The thing is I would like to know how to implement this on Flink CEP. I'm aware that, probably, the goal or approach in Flink CEP is different, so the way to implement this may not be as simple as in Esper CEP.
I have taken a look at the docs regarding to time windows, but I'm not able to implement this windows along with Flink CEP. So, given the following code:
DataStream<Stock> stream = ...; // Consume events from Kafka
// Filtering events with negative price
Pattern<Stock, ?> pattern = Pattern.<Stock>begin("start")
.where(
new SimpleCondition<Stock>() {
public boolean filter(Stock event) {
return event.getPrice() >= 0;
}
}
);
PatternStream<Stock> patternStream = CEP.pattern(stream, pattern);
/**
CREATE A BATCH WINDOW OF 5 SECONDS IN WHICH
I COMPUTE OVER THE AVERAGE PRICES AND, IF IT IS
GREATER THAN A THREESHOLD, AN ALERT IS DETECTED
return avg(allEventsInWindow.getPrice()) > 1;
*/
DataStream<Alert> result = patternStream.select(
new PatternSelectFunction<Stock, Alert>() {
#Override
public Alert select(Map<String, List<Stock>> pattern) throws Exception {
return new Alert(pattern.toString());
}
}
);
How can I create that window in which, from the first one received, I start to calculate the average for the following events within 5 seconds. For example:
t = 0 seconds
Stock(price = 1); (...starting batch window...)
Stock(price = 1);
Stock(price = 1);
Stock(price = 2);
Stock(price = 2);
Stock(price = 2);
t = 5 seconds (...end of batch window...)
Avg = 1.5 => Alert detected!
The average after 5 seconds would be 1.5, and will trigger the alert. How can I code this?
Thanks!
With Flink's CEP library this behavior is not expressible. I would rather recommend using Flink's DataStream or Table API to calculate the averages. Based on that you could again use CEP to generate other events.
final DataStream<Stock> input = env
.fromElements(
new Stock(1L, 1.0),
new Stock(2L, 2.0),
new Stock(3L, 1.0),
new Stock(4L, 2.0))
.assignTimestampsAndWatermarks(new BoundedOutOfOrdernessTimestampExtractor<Stock>(Time.seconds(0L)) {
#Override
public long extractTimestamp(Stock element) {
return element.getTimestamp();
}
});
final DataStream<Double> windowAggregation = input
.timeWindowAll(Time.milliseconds(2))
.aggregate(new AggregateFunction<Stock, Tuple2<Integer, Double>, Double>() {
#Override
public Tuple2<Integer, Double> createAccumulator() {
return Tuple2.of(0, 0.0);
}
#Override
public Tuple2<Integer, Double> add(Stock value, Tuple2<Integer, Double> accumulator) {
return Tuple2.of(accumulator.f0 + 1, accumulator.f1 + value.getValue());
}
#Override
public Double getResult(Tuple2<Integer, Double> accumulator) {
return accumulator.f1 / accumulator.f0;
}
#Override
public Tuple2<Integer, Double> merge(Tuple2<Integer, Double> a, Tuple2<Integer, Double> b) {
return Tuple2.of(a.f0 + b.f0, a.f1 + b.f1);
}
});
final DataStream<Double> result = windowAggregation.filter((FilterFunction<Double>) value -> value > THRESHOLD);

Is there a work-around to handle multiple "temporal constraints" in Flink CEP?

As stated in CEP document (https://ci.apache.org/projects/flink/flink-docs-release-1.5/dev/libs/cep.html) that only one temporal constraint is allowed in a pattern sequence, I'm struggling to find out a way to handle a business case that contains 2 temporal constraints.
I need to monitor some business events and alert on the events that meet the following rules:
a new account is registered
the account gets authenticated in 5 minutes after registration
the account completes at least 2 transactions which transaction amount is greater than 1000.00 in next 1 hour.
And the code is something like this:
Pattern<Event, ?> pattern = Pattern.<Event>begin("register").where(new SimpleCondition<Event>() {
#Override
public boolean filter<Event value> throws Exception {
return (value.getEventType() == EventType.REGISTER);
}
}).followedBy("authentication").where(new SimpleCondition<Event>() {
#Override
public boolean filter<Event value> throws Exception {
return (value.getEventType() == EventType.AUTHENTICATION);
}
}).where(new IterativeCondition<Event>() {
#Override
public boolean filter(Event value, Context<Event> ctx) throws Exception {
for (Event event : ctx.getEventsForPattern("register")) {
if (value.getEventTime() - event.getEventTime() <= 1000 * 60 * 5) {
return true;
}
}
return false;
}
}).followedBy("transaction").where(new SimpleCondition<Event>() {
#Override
public boolean filter<Event value> throws Exception {
return (value.getEventType() == EventType.TRANSACTION && value.getAmount() > 1000.00);
}
}).where(new IterativeCondition<Event>() {
#Override
public boolean filter(Event value, Context<Event> ctx) throws Exception {
for (Event event : ctx.getEventsForPattern("authentication")) {
if (value.getEventTime() - event.getEventTime() <= 1000 * 60 * 60) {
return true;
}
}
return false;
}
}).timesOrMore(2);
You can see that I use 2 IterativeConditions to handle the temporal constraints. Is there a better way to make the code more concise?
As you said you can apply only one time constraint to whole pattern right now in CEP library. What you could do though is to split you pattern into 2 sub patterns. First apply pattern that will look for REGISTER -> AUTHENTICATE and generate complex event out of those (let's name it REGISTER_AUTHENTICATED). And then use it in the subsequent pattern REGISTER_AUTHENTICATED -> 2* TRANSACTIONS.
Then you can apply two time constraints to both of those patterns.

How to handle exception while parsing JSON in Flink

I am reading data from Kafka using flink 1.4.2 and parsing them to ObjectNode using JSONDeserializationSchema. If the incoming record is not a valid JSON then my Flink job fails. I would like to skip the broken record instead of failing the job.
FlinkKafkaConsumer010<ObjectNode> kafkaConsumer =
new FlinkKafkaConsumer010<>(TOPIC, new JSONDeserializationSchema(), consumerProperties);
DataStream<ObjectNode> messageStream = env.addSource(kafkaConsumer);
messageStream.print();
I am getting the following exception if the data in Kafka is not a valid JSON.
Job execution switched to status FAILING.
org.apache.flink.shaded.jackson2.com.fasterxml.jackson.core.JsonParseException: Unrecognized token 'This': was expecting ('true', 'false' or 'null')
at [Source: [B#4f522623; line: 1, column: 6]
Job execution switched to status FAILED.
Exception in thread "main" org.apache.flink.runtime.client.JobExecutionException: Job execution failed.
The easiest solution is to implement your own DeserializationSchema and wrap JSONDeserializationSchema. You can then catch the exception and either ignore it or perform custom action.
As suggested by #twalthr, I implemented my own DeserializationSchema by copying JSONDeserializationSchema and added exception handling.
import org.apache.flink.api.common.serialization.AbstractDeserializationSchema;
import org.apache.flink.shaded.jackson2.com.fasterxml.jackson.databind.ObjectMapper;
import org.apache.flink.shaded.jackson2.com.fasterxml.jackson.databind.node.ObjectNode;
import java.io.IOException;
public class CustomJSONDeserializationSchema extends AbstractDeserializationSchema<ObjectNode> {
private ObjectMapper mapper;
#Override
public ObjectNode deserialize(byte[] message) throws IOException {
if (mapper == null) {
mapper = new ObjectMapper();
}
ObjectNode objectNode;
try {
objectNode = mapper.readValue(message, ObjectNode.class);
} catch (Exception e) {
ObjectMapper errorMapper = new ObjectMapper();
ObjectNode errorObjectNode = errorMapper.createObjectNode();
errorObjectNode.put("jsonParseError", new String(message));
objectNode = errorObjectNode;
}
return objectNode;
}
#Override
public boolean isEndOfStream(ObjectNode nextElement) {
return false;
}
}
In my streaming job.
messageStream
.filter((event) -> {
if(event.has("jsonParseError")) {
LOG.warn("JsonParseException was handled: " + event.get("jsonParseError").asText());
return false;
}
return true;
}).print();
Flink has improved null record handling for FlinkKafkaConsumer
There are two possible design choices when the DeserializationSchema encounters a corrupted message. It can either throw an IOException which causes the pipeline to be restarted, or it can return null where the Flink Kafka consumer will silently skip the corrupted message.
For more details, you can see this link.

Resources