Is there a way to achieve the following with Flink CEP
Continue matching pattern of all events have not arrived within time. For example A -> B -> C are supposed to match within 20 seconds, and only event A has arrived, right now I see timeout happens at 20 seconds. After timeout if B arrives the pattern doesn't match. How do we make sure it continues to match after the timeout has occurred.
Multiple timeouts - Is it possible to alert multiple times on a given pattern. I have a use case where in I need to alert at time t1, t2, t3. Is there a way to achieve that ?
Multiple Patterns - Is have seen some articles where NFACompiler.NFAFactory is maintained in a map and based on the data, the right one is looked up and used. Is there an example that I can find how to do it with current version of Flink.
Related
I have a use case where a large no of logs will be consumed to the apache flink CEP. My use case is to find the brute force attack and port scanning attack. The challenge here is that while in ordinary CEP we compare the value against a constant like "event" = login. In this case the Criteria is different as in the case of brute force attack we have the criteria as follows.
username is constant and event="login failure" (Delimiter the event happens 5 times within 5 minutes).
It means the logs with the login failure event is received for the same username 5 times within 5 minutes
And for port Scanning we have the following criteira.
ip address is constant and dest port is variable (Delimiter is the event happens 10 times within 1 minute). It means the logs with constant ip address is received for the 10 different ports within 1 minute.
With Flink, when you want to process the events for something like one username or one ip address in isolation, the way to do this is to partition the stream by a key, using keyBy(). The training materials in the Flink docs have a section on Keyed Streams that explains this part of the DataStream API in more detail. keyBy() is the roughly same concept as a GROUP BY in SQL, if that helps.
With CEP, if you first key the stream, then the pattern will be matched separately for each distinct value of the key, which is what you want.
However, rather than CEP, I would instead recommend Flink SQL, perhaps in combination with MATCH_RECOGNIZE, for this use case. MATCH_RECOGNIZE is a higher-level API, built on top of CEP, and it's easier to work with. In combination with SQL, the result is quite powerful.
You'll find some Flink SQL training materials and examples (including examples that use MATCH_RECOGNIZE) in Ververica's github account.
Update
To be clear, I wouldn't use MATCH_RECOGNIZE for these specific rules; neither it nor CEP is needed for this use case. I mentioned it in case you have other rules where it would be helpful. (My reason for not recommending CEP in this case is that implementing the distinct constraint might be messy.)
For example, for the port scanning case you can do something like this:
SELECT e1.ip, COUNT(DISTINCT e2.port)
FROM events e1, events e2
WHERE e1.ip = e2.ip AND timestampDiff(MINUTE, e1.ts, e2.ts) < 1
GROUP BY e1.ip HAVING COUNT(DISTINCT e2.port) >= 10;
The login case is similar, but easier.
Note that when working with streaming SQL, you should give some thought to state retention.
Further update
This query is likely to return a given IP address many times, but it's not desirable to generate multiple alerts.
This could be handled by inserting matching IP addresses into an Alert table, and only generate alerts for IPs that aren't already there.
Or the output of the SQL query could be processed by a de-duplicator implemented using the DataStream API, similar to the example in the Flink docs. If you only want to suppress duplicate alerts for some period of time, use a KeyedProcessFunction instead of a RichFlatMapFunction, and use a Timer to clear the state when it's time to re-enable alerts for a given IP.
Yet another update (concerning CEP and distinctness)
Implementing this with CEP should be possible. You'll want to key the stream by the IP address, and have a pattern that has to match within one minute.
The pattern can be roughly like this:
Pattern<Event, ?> pattern = Pattern
.<Event>begin("distinctPorts")
.where(iterative condition 1)
.oneOrMore()
.followedBy("end")
.where(iterative condition 2)
.within(1 minute)
The first iterative condition returns true if the event being added to the pattern has a distinct port from all of the previously matching events. Somewhat similar to the example here, in the docs.
The second iterative condition returns true if size("distinctPorts") >= 9 and this event also has yet another distinct port.
See this Flink Forward talk (youtube video) for a somewhat similar example at the end of the talk.
If you try this and get stuck, please ask a new question, showing us what you've tried and where you're stuck.
I use Flink SQL and CEP to recognize some really simple patterns. However, I found a weird thing (likely a bug). I have two example tables password_change and transfer as below.
transfer
transid,accountnumber,sortcode,value,channel,eventtime,eventtype
1,123,1,100,ONL,2020-01-01T01:00:01Z,transfer
3,123,1,100,ONL,2020-01-01T01:00:02Z,transfer
4,123,1,200,ONL,2020-01-01T01:00:03Z,transfer
5,456,1,200,ONL,2020-01-01T01:00:04Z,transfer
password_change
accountnumber,channel,eventtime,eventtype
123,ONL,2020-01-01T01:00:05Z,password_change
456,ONL,2020-01-01T01:00:06Z,password_change
123,ONL,2020-01-01T01:00:08Z,password_change
123,ONL,2020-01-01T01:00:09Z,password_change
Here are my SQL queries.
First create a temporary view event as
(SELECT accountnumber,rowtime,eventtype FROM password_change WHERE channel='ONL')
UNION ALL
(SELECT accountnumber,rowtime, eventtype FROM transfer WHERE channel = 'ONL' )
rowtime column is the event time extracted directly from original eventtime col with watermark periodic bound 1 second.
Then output the query result of
SELECT * FROM `event`
MATCH_RECOGNIZE (
PARTITION BY accountnumber
ORDER BY rowtime
MEASURES
transfer.eventtype AS event_type,
transfer.rowtime AS transfer_time
ONE ROW PER MATCH
AFTER MATCH SKIP PAST LAST ROW
PATTERN (transfer password_change ) WITHIN INTERVAL '5' SECOND
DEFINE
password_change AS eventtype='password_change',
transfer AS eventtype='transfer'
)
It should output
123,transfer,2020-01-01T01:00:03Z
456,transfer,2020-01-01T01:00:04Z
But I got nothing when running Flink 1.11.1 (also no output for 1.10.1).
What's more, I change the pattern to only password_change, it still output nothing, but if I change the pattern to transfer then it outputs several rows but not all transfer rows. If I exchange the eventtime of two tables which means let password_changes happen first, then the pattern password_change will output several rows while transfer not.
On the other hand, if I extract those columns from two tables and merge them in one table manually, then emit them into Flink, the running result is correct.
I searched and tried a lot to get it right including changing the SQL statement, watermark, buffer timeout and so on, but nothing helped. Hope anyone here can help. Thanks.
10/10/2020 update:
I use Kafka as the table source. tEnv is the StreamTableEnvironment.
Kafka kafka=new Kafka()
.version("universal")
.property("bootstrap.servers", "localhost:9092");
tEnv.connect(
kafka.topic("transfer")
).withFormat(
new Json()
.failOnMissingField(true)
).withSchema(
new Schema()
.field("rowtime",DataTypes.TIMESTAMP(3))
.rowtime(new Rowtime()
.timestampsFromField("eventtime")
.watermarksPeriodicBounded(1000)
)
.field("channel",DataTypes.STRING())
.field("eventtype",DataTypes.STRING())
.field("transid",DataTypes.STRING())
.field("accountnumber",DataTypes.STRING())
.field("value",DataTypes.DECIMAL(38,18))
).createTemporaryTable("transfer");
tEnv.connect(
kafka.topic("pchange")
).withFormat(
new Json()
.failOnMissingField(true)
).withSchema(
new Schema()
.field("rowtime",DataTypes.TIMESTAMP(3))
.rowtime(new Rowtime()
.timestampsFromField("eventtime")
.watermarksPeriodicBounded(1000)
)
.field("channel",DataTypes.STRING())
.field("accountnumber",DataTypes.STRING())
.field("eventtype",DataTypes.STRING())
).createTemporaryTable("password_change");
Thank #Dawid Wysakowicz's answer. To confirm that, I added 4,123,1,200,ONL,2020-01-01T01:00:10Z,transfer to the end of transfer table, then the output becomes right, which means it is really some problem about watermarks.
So now the question is how to fix it. Since a user will not change his/her password frequently, the time gap between these two table is unavoidable. I just need the UNION ALL table has the same behavior as that I merged manually.
Update Nov. 4th 2020:
WatermarkStrategy with idle sources may help.
Most likely the problem is somewhere around watermark generation in conjunction with the UNION ALL operator. Could you share how you create the two tables including how you define the time attributes and what are the connectors? It could let me confirm my suspicions.
I think the problem is that one of the sources stops emitting Watermarks. If the transfer table (or the table with lower timestamps) does not finish and produces no records it emits no Watermarks. After emitting the fourth row it will emit Watermark = 3 (4-1 second). The Watermark of a union of inputs is the smallest of values of the two. Therefore the first table will pause/hold the Watermark with value Watermark = 3 and thus you see no progress for the original query and you see some records emitted for the table with smaller timestamps.
If you manually join the two tables, you have just a single input with a single source of Watermarks and thus it progresses further and you see some results.
I need your advice, really
In my task i need to aggregate events by two type of aggregation.
First type - is onCount, second type - is onTime.
If event is for onCount aggregation - it has fields number - number of event, and totalCount - what count of events we should accumulate before aggregate.
If event is for onTime aggregation - it has field time - it's date after which we should get all accumulate events and start aggregating.
I can groupped events by type, start window and set trigger:
stream
.keyBy(e => (e.clientSystemId, e.onMode))
.window(GlobalWindows.create())
.trigger(new WindowAggregationTrigger())
But in trigger i need to have state - total count or time.
And in best solution - i need two different triggers - first is about counting and second - is about time aggregation.
My question is - how beautifully to solve this problem?
When i need two triggers with different logic - first about counting, second- about time trigger.
I do not ask to solve the problem for me, I ask for advice.
We developing on Apache Flink 1.4.
It is not possible to apply two different triggers in the same window operator, but you can implement a single trigger to distinguish the onCount and onTime cases.
However, I would recommend to split the stream into two streams (using split() or side outputs), apply window operators with different triggers on the splitted streams, and later union() the streams together (if that is necessary).
I am currently writing a streaming application where:
as an input, I am receiving some alerts from a kafka topic (1 alert is linked to 1 resource, for example 1 alert will be linked to my-router-1 or to my-switch-1 or to my-VM-1 or my-VM-2 or ...)
I need then to do a query to an external system in order to enrich the alert with some additional information linked to the resource on which the alert is linked
When querying the external system:
I do not want to do 1 query per alert and not even 1 query per resource
I rather want to do group queries (1 query for several alerts linked to several resources)
My idea was to have something like n buffer (n being a small number representing the nb of queries that I will do in parallel), and then for a given time period (let's say 100ms), put all alerts within one of those buffer and at the end of those 100ms, do my n queries in parallel (1 query being responsible for enriching several alerts belonging to several resources).
In Spark, it is something that I would do through a mapPartitions (if I have n partition, then I will do only n queries in parallel to my external system and each query will be for all the alerts received during the micro-batch for one partition).
Now, I am currently looking at Flink and I haven't really found what is the best way of doing such kind of grouping when requesting an external system.
When looking at this kind of use case and especially at asyncio (https://ci.apache.org/projects/flink/flink-docs-release-1.4/dev/stream/operators/asyncio.html), it seems that it deals with 1 query per key.
For example, I can very easily:
define the resouce id as a key
define a processing time window of 100ms
and then do my query to the external system (synchronously or maybe better asynchrously through the asyncio feature)
But by doing so, I will do 1 query per resource (maybe for several alerts but linked to the same key, ie the same resource).
It is not what I want to do as it will lead to too much queries to the external system.
I've then explored the option of defining a kind of technical key for my requests (something like the hashCode of my resource id % nb of queries I want to perform).
So, if I want to do max 4 queries in parallel, then my key will be something like "resourceId.hashCode % 4".
I was thinking that it was ok, but when looking more deeply to some metrics when running my job, I found that that my queries were not well distributed to my 4 operator instances (only 2 of them were doing something).
It comes for the mechanism used to assign a key to a given operator instance:
public static int assignKeyToParallelOperator(Object key, int maxParallelism, int parallelism) {
return computeOperatorIndexForKeyGroup(maxParallelism, parallelism, assignToKeyGroup(key, maxParallelism));
}
(in my case, parallelism being 4, maxParallelism 128 and my key value in the range [0,4[ ) (in such a context, 2 of my keys goes to operator instance 3 and 2 to operator instance 4) (operator instance 1 and 2 will have nothing to do).
I was thinking that key=0 will go to operator 0, key 1 to operator 1, key 2 to operator 2 and key 3 to operator 3, but it is not the case.
So do you know what will be the best approach to do this kind of grouping while querying an external system ?
ie 1 query per operator instance for all the alerts "received" by this operator instance during the last 100ms.
You can put an aggregator function upstream of the async function, where that function (using a timed window) outputs a record with <resource id><list of alerts to query>. You'd key the stream by the <resource id> ahead of the aggregator, which should then get pipelined to the async function.
I effectively want a flush, or a completionSize but for all the aggregations in the aggregator. Like a global completionSize.
Basically I want to make sure that every message that comes in a batch is aggregated and then have all the aggregations in that aggregator complete at once when the last one has been read.
e.g. 1000 messages arrive (the length is not known beforehand)
aggregate on correlation id into bins
A 300
B 400
C 300 (size of the bins is not known before hand)
I want the aggregator not to complete until the 1000th exchange is aggregated
thereupon I want all of the aggregations in the aggregator to complete at once
The CompleteSize applies to each aggregation, and not the aggregator as a whole unfortunately. So if I set CompleteSize( 1000 ) it will just never finish, since each aggregation has to exceed 1000 before it is 'complete'
I could get around it by building up a single Map object, but this is kind of sidestepping the correlation in aggregator2, that I would prefer to use ideally
so yeah, either global-complete-size or flushing, is there a way to do this intelligently?
one option is to simply add some logic to keep a global counter and set the Exchange.AGGREGATION_COMPLETE_ALL_GROUPS header once its reached...
Available as of Camel 2.9...You can manually complete all current aggregated exchanges by sending in a message containing the header Exchange.AGGREGATION_COMPLETE_ALL_GROUPS set to true. The message is considered a signal message only, the message headers/contents will not be processed otherwise.
I suggest to take a look at the Camel aggregator eip doc http://camel.apache.org/aggregator2, and read about the different completion conditions. And as well that special message Ben refers to you can send to signal to complete all in-flight aggregates.
If you consume from a batch consumer http://camel.apache.org/batch-consumer.html then you can use a special completion that complets when the batch is done. For example if you pickup files or rows from a JPA database table etc. Then when all messages from the batch consumer has been processed then the aggregator can signal completion for all these aggregated messages, using the completionFromBatchConsumer option.
Also if you have a copy of the Camel in Action book, then read chapter 8, section 8.2, as its all about the aggregate EIP covered in much more details.
Using Exchange.AGGREGATION_COMPLETE_ALL_GROUPS_INCLUSIVE worked for me:
from(endpoint)
.unmarshal(csvFormat)
.split(body())
.bean(CsvProcessor())
.choice()
// If all messages are processed,
// flush the aggregation
.`when`(simple("\${property.CamelSplitComplete}"))
.setHeader(Exchange.AGGREGATION_COMPLETE_ALL_GROUPS_INCLUSIVE, constant(true))
.end()
.aggregate(simple("\${body.trackingKey}"),
AggregationStrategies.bean(OrderAggregationStrategy()))
.completionTimeout(10000)