Using Apache Flink I want to create a streaming window sorted by the timestamp that is stored in the Kafka event. According to the following article this is not implemented.
https://cwiki.apache.org/confluence/display/FLINK/Time+and+Order+in+Streams
However, the article is dated july 2015, it is now almost a year later. Is this functionality implemented and can somebody point me to any relevent documentation and/or an example.
Apache Flink supports stream windows based on event timestamps.
In Flink, this concept is called event-time.
In order to support event-time, you have to extract a timestamp (long value) from each event. In addition, you need to support so-called watermarks which are needed to deal with events with out-of-order timestamps.
Given a stream with extracted timestamps you can define a windowed sum as follows:
val stream: DataStream[(String, Int)] = ...
val windowCnt = stream
.keyBy(0) // partition stream on first field (String)
.timeWindow(Time.minutes(1)) // window in extracted timestamp by 1 minute
.sum(1) // sum the second field (Int)
Event-time and windows are explained in detail in the documentation (here and here) and in several blog posts (here, here, here, and here).
Sorting by timestamps is still not supported out-of-box but you can do windowing based on the timestamps in elements. We call this event-time windowing. Please have a look here: https://ci.apache.org/projects/flink/flink-docs-master/apis/streaming/windows.html.
Related
I have a use case where a large no of logs will be consumed to the apache flink CEP. My use case is to find the brute force attack and port scanning attack. The challenge here is that while in ordinary CEP we compare the value against a constant like "event" = login. In this case the Criteria is different as in the case of brute force attack we have the criteria as follows.
username is constant and event="login failure" (Delimiter the event happens 5 times within 5 minutes).
It means the logs with the login failure event is received for the same username 5 times within 5 minutes
And for port Scanning we have the following criteira.
ip address is constant and dest port is variable (Delimiter is the event happens 10 times within 1 minute). It means the logs with constant ip address is received for the 10 different ports within 1 minute.
With Flink, when you want to process the events for something like one username or one ip address in isolation, the way to do this is to partition the stream by a key, using keyBy(). The training materials in the Flink docs have a section on Keyed Streams that explains this part of the DataStream API in more detail. keyBy() is the roughly same concept as a GROUP BY in SQL, if that helps.
With CEP, if you first key the stream, then the pattern will be matched separately for each distinct value of the key, which is what you want.
However, rather than CEP, I would instead recommend Flink SQL, perhaps in combination with MATCH_RECOGNIZE, for this use case. MATCH_RECOGNIZE is a higher-level API, built on top of CEP, and it's easier to work with. In combination with SQL, the result is quite powerful.
You'll find some Flink SQL training materials and examples (including examples that use MATCH_RECOGNIZE) in Ververica's github account.
Update
To be clear, I wouldn't use MATCH_RECOGNIZE for these specific rules; neither it nor CEP is needed for this use case. I mentioned it in case you have other rules where it would be helpful. (My reason for not recommending CEP in this case is that implementing the distinct constraint might be messy.)
For example, for the port scanning case you can do something like this:
SELECT e1.ip, COUNT(DISTINCT e2.port)
FROM events e1, events e2
WHERE e1.ip = e2.ip AND timestampDiff(MINUTE, e1.ts, e2.ts) < 1
GROUP BY e1.ip HAVING COUNT(DISTINCT e2.port) >= 10;
The login case is similar, but easier.
Note that when working with streaming SQL, you should give some thought to state retention.
Further update
This query is likely to return a given IP address many times, but it's not desirable to generate multiple alerts.
This could be handled by inserting matching IP addresses into an Alert table, and only generate alerts for IPs that aren't already there.
Or the output of the SQL query could be processed by a de-duplicator implemented using the DataStream API, similar to the example in the Flink docs. If you only want to suppress duplicate alerts for some period of time, use a KeyedProcessFunction instead of a RichFlatMapFunction, and use a Timer to clear the state when it's time to re-enable alerts for a given IP.
Yet another update (concerning CEP and distinctness)
Implementing this with CEP should be possible. You'll want to key the stream by the IP address, and have a pattern that has to match within one minute.
The pattern can be roughly like this:
Pattern<Event, ?> pattern = Pattern
.<Event>begin("distinctPorts")
.where(iterative condition 1)
.oneOrMore()
.followedBy("end")
.where(iterative condition 2)
.within(1 minute)
The first iterative condition returns true if the event being added to the pattern has a distinct port from all of the previously matching events. Somewhat similar to the example here, in the docs.
The second iterative condition returns true if size("distinctPorts") >= 9 and this event also has yet another distinct port.
See this Flink Forward talk (youtube video) for a somewhat similar example at the end of the talk.
If you try this and get stuck, please ask a new question, showing us what you've tried and where you're stuck.
I am reading the book Stream Processing with Apache Flink and it is stated that “As of version 0.10.0, Kafka supports message timestamps. When reading from Kafka version 0.10 or later, the consumer will automatically extract the message timestamp as an event-time timestamp if the application runs in event-time mode*”
So inside a processElement function the call context.timestamp() will by default return the kafka message timestamp?
Coul you please provide a simple example on how to implement AssignerWithPeriodicWatermarks/AssignerWithPunctuatedWatermarks that extract (and builds watermarks) based on the consumed kafka message timestamp.
If I am using TimeCharacteristic.ProcessingTime, would ctx.timestamp() return the processing time and in such case would it be similar to context.timerService().currentProcessingTime() .
Thank you.
The Flink Kafka consumer takes care of this for you, and puts the timestamp where it needs to be. In Flink 1.11 you can simply rely on this, though you still need to take care of providing a WatermarkStrategy that specifies the out-of-orderness (or asserts that the timestamps are in order):
FlinkKafkaConsumer<String> myConsumer = new FlinkKafkaConsumer<>(...);
myConsumer.assignTimestampsAndWatermarks(
WatermarkStrategy.
.forBoundedOutOfOrderness(Duration.ofSeconds(20)));
In earlier versions of Flink you had to provide an implementation of a timestamp assigner, which would look like this:
public long extractTimestamp(Long element, long previousElementTimestamp) {
return previousElementTimestamp;
}
This version of the extractTimestamp method is passed the current value of the timestamp present in the StreamRecord as previousElementTimestamp, which in this case will be the timestamp put there by the Flink Kafka consumer.
Flink 1.11 docs
Flink 1.10 docs
As for what is returned by ctx.timestamp() when using TimeCharacteristic.ProcessingTime, this method returns NULL in that case. (Semantically, yes, it is as though the timestamp is the current processing time, but that's not how it's implemented.)
I'm reading from a Kafka cluster in a Flink streaming app. After getting the source stream i want to aggregate events by a composite key and a timeEvent tumbling window and then write result to a table.
The problem is that after applying my aggregateFunction that just counts number of clicks by clientId i don't find the way to get the key of each output record since the api returns an instance of accumulated result but not the corresponding key.
DataStream<Event> stream = environment.addSource(mySource)
stream.keyBy(new KeySelector<Event,Integer>() {
public Integer getKey(Event event) { return event.getClientId(); })
.window(TumblingEventTimeWindows.of(Time.minutes(1))).aggregate(new MyAggregateFunction)
How do i get the key that i specified before? I did not inject key of the input events in the accumulator as i felt i wouldn't be nice.
Rather than
.aggregate(new MyAggregateFunction)
you can use
.aggregate(new MyAggregateFunction, new MyProcessWindowFunction)
and in this case the process method of your ProcessWindowFunction will be passed the key, along with the pre-aggregated result of your AggregateFunction and a Context object with other potentially relevant info. See the section in the docs on ProcessWindowFunction with Incremental Aggregation for more details.
I am investigating the types of watermarks that can be inserted into the data stream.
While this may go outside of the purpose of watermarks, I'll ask it anyway.
Can you create a watermark that holds a timestamp and k/v pair(s) (this=that, that=this)?
Hence the watermark will hold {12DEC180500GMT,this=that, that=this}.
Or
{Timestamp, kvp1, kvp2, kvpN}
Is something like this possible? I have reviewed the user and API docs but may have overlooked something
No, the Watermark class in Flink
(found in
flink/flink-streaming/java/src/main/java/org/apache/flink/streaming/api/watermark/Watermark.java)
has one one instance variable besides MAX_WATERMARK, which is
/** The timestamp of the watermark in milliseconds. */
private final long timestamp;
So watermarks cannot carry any information besides a timestamp, which must be a long value.
I need your advice, really
In my task i need to aggregate events by two type of aggregation.
First type - is onCount, second type - is onTime.
If event is for onCount aggregation - it has fields number - number of event, and totalCount - what count of events we should accumulate before aggregate.
If event is for onTime aggregation - it has field time - it's date after which we should get all accumulate events and start aggregating.
I can groupped events by type, start window and set trigger:
stream
.keyBy(e => (e.clientSystemId, e.onMode))
.window(GlobalWindows.create())
.trigger(new WindowAggregationTrigger())
But in trigger i need to have state - total count or time.
And in best solution - i need two different triggers - first is about counting and second - is about time aggregation.
My question is - how beautifully to solve this problem?
When i need two triggers with different logic - first about counting, second- about time trigger.
I do not ask to solve the problem for me, I ask for advice.
We developing on Apache Flink 1.4.
It is not possible to apply two different triggers in the same window operator, but you can implement a single trigger to distinguish the onCount and onTime cases.
However, I would recommend to split the stream into two streams (using split() or side outputs), apply window operators with different triggers on the splitted streams, and later union() the streams together (if that is necessary).