I use CEP Pattern in Flink SQL which is working as expected connecting to Kafka broker. But when i connecting to cluster based cloud kafka setup, the Flink CEP is not triggering. Here is my sql:
create table agent_action_detail
(
agent_id String,
room_id String,
create_time Bigint,
call_type String,
application_id String,
connect_time Bigint,
row_time TIMESTAMP_LTZ(3), WATERMARK for row_time as row_time - INTERVAL '1' MINUTE)
with ('connector'='kafka', 'topic'='agent-action-detail', ...)
then I send messages in json format like
{"agent_id":"agent_221","room_id":"room1","create_time":1635206828877,"call_type":"inbound","application_id":"app1","connect_time":1635206501735,"row_time":"2021-10-25 16:07:09.019Z"}
in flink web ui, watermark works fine
flink web ui
I run my cep sql :
select * from agent_action_detail
match_recognize(
partition by agent_id
order by row_time
measures
last(BF.create_time) as create_time,
first(AF.connect_time) as connect_time
one row per match AFTER MATCH SKIP PAST LAST ROW
pattern (BF+ AF) define BF as BF.connect_time > 0 ,AF as AF.connect_time > 0
)
every kafka message, connect_time is > 0, but flink not triggering.
Can somebody help to this issue, thanks in advance!
select * from agent_action_detail match_recognize( partition by agent_id order by row_time measures AF.connect_time as connect_time one row per match pattern (BF AF) WITHIN INTERVAL '1' second define BF as (last(BF.connect_time, 1) < 1), AF as AF.connect_time >= 100)
Here is another cep sql still not working.
And the agent_action_detail table is insert by another flink sql as
insert into agent_action_detail select data.agent_id, data.room_id, data.create_time, data.call_type, data.application_id, data.connect_time, now() from source_table where type = 'xxx'
There are several things that can cause pattern matching to produce no results:
the input doesn't actually contain the pattern
watermarking is being done incorrectly
the pattern is pathological in some way
This particular pattern loops with no exit condition. This sort of pattern doesn't allow the internal state of the pattern matching engine to ever be cleared, which will lead to problems.
If you were using Flink CEP directly, I would tell you to
try adding until(condition) or within(time) to constrain the number of possible matches.
With MATCH_RECOGNIZE, see if you can add a distinct terminating element to the pattern.
Update: since you are still getting no results after modifying the pattern, you should determine if watermarking is the source of your problem. CEP relies on sorting the input stream by time, which depends on watermarking -- but only if you are using event time.
The easiest way to test this would be to switch to using processing time:
create table agent_action_detail
(
agent_id String,
...
row_time AS PROCTIME()
)
with (...)
If that works, then either the timestamps or watermarks are the problem. For example, if all of the events are late, you'll get no results. In your case, I'm wondering the row_time column has any data in it.
If that doesn't reveal the problem, please share a minimal reproducible example, including the data needed to observe the problem.
Related
I'm using Flink SQL to compute event-time-based windowed analytics. Everything works fine until my data source becomes idle each evening, after which the results for the last minute aren't produced until the next day when data begins to flow again.
CREATE TABLE input
id STRING,
data BIGINT,
rowtime TIMESTAMP(3) METADATA FROM 'timestamp',
WATERMARK FOR rowtime AS rowtime - INTERVAL '1' SECOND
WITH (
'connector' = 'kafka',
'topic' = 'input',
'properties.bootstrap.servers' = 'localhost:9092',
'scan.startup.mode' = 'latest-offset',
'format' = 'json'
)
SELECT ...
FROM
(SELECT * FROM
TABLE(TUMBLE(TABLE input, DESCRIPTOR(rowtime), INTERVAL '1' MINUTES)))
GROUP BY ..., window_start, window_end
I've tried setting table.exec.source.idle-timeout, but it didn't help. What can I do?
table.exec.source.idle-timeout (and the corresponding withIdleness construct used with the DataStream API for WatermarkStrategy) detects idle input partitions and prevents them from holding back the progress of the overall watermark. However, for the overall watermark to advance, there must still be some input, somewhere.
Some options:
(1) Live with the problem, which means waiting until the watermark can advance normally, based on observing larger timestamps in the input stream. As you've indicated, in your use case this can require waiting several hours.
(2) Arrange for the input stream(s) to contain keep-alive messages. This way the watermark generator will have evidence (based on the timestamps in the keep-alive messages) that it can advance the watermark. You'll have to modify your queries to ignore these otherwise extraneous events.
(3) Upon reaching the point where the job has fully ingested all of the daily input, but hasn't yet produced the final set of results, stop the job and specify --drain. This will send a watermark with value MAX_WATERMARK through the pipeline, which will close all pending windows. You can then restart the job.
(4) Implement a custom watermark strategy that uses a processing-time timer to detect idleness and artificially advance the watermark based on the passage of wall clock time. This will require converting your table input to a DataStream, adding the watermarks there, and then converting back to a table for the windowing. See https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/dev/table/data_stream_api/ for examples of these conversions.
I have a use case where a large no of logs will be consumed to the apache flink CEP. My use case is to find the brute force attack and port scanning attack. The challenge here is that while in ordinary CEP we compare the value against a constant like "event" = login. In this case the Criteria is different as in the case of brute force attack we have the criteria as follows.
username is constant and event="login failure" (Delimiter the event happens 5 times within 5 minutes).
It means the logs with the login failure event is received for the same username 5 times within 5 minutes
And for port Scanning we have the following criteira.
ip address is constant and dest port is variable (Delimiter is the event happens 10 times within 1 minute). It means the logs with constant ip address is received for the 10 different ports within 1 minute.
With Flink, when you want to process the events for something like one username or one ip address in isolation, the way to do this is to partition the stream by a key, using keyBy(). The training materials in the Flink docs have a section on Keyed Streams that explains this part of the DataStream API in more detail. keyBy() is the roughly same concept as a GROUP BY in SQL, if that helps.
With CEP, if you first key the stream, then the pattern will be matched separately for each distinct value of the key, which is what you want.
However, rather than CEP, I would instead recommend Flink SQL, perhaps in combination with MATCH_RECOGNIZE, for this use case. MATCH_RECOGNIZE is a higher-level API, built on top of CEP, and it's easier to work with. In combination with SQL, the result is quite powerful.
You'll find some Flink SQL training materials and examples (including examples that use MATCH_RECOGNIZE) in Ververica's github account.
Update
To be clear, I wouldn't use MATCH_RECOGNIZE for these specific rules; neither it nor CEP is needed for this use case. I mentioned it in case you have other rules where it would be helpful. (My reason for not recommending CEP in this case is that implementing the distinct constraint might be messy.)
For example, for the port scanning case you can do something like this:
SELECT e1.ip, COUNT(DISTINCT e2.port)
FROM events e1, events e2
WHERE e1.ip = e2.ip AND timestampDiff(MINUTE, e1.ts, e2.ts) < 1
GROUP BY e1.ip HAVING COUNT(DISTINCT e2.port) >= 10;
The login case is similar, but easier.
Note that when working with streaming SQL, you should give some thought to state retention.
Further update
This query is likely to return a given IP address many times, but it's not desirable to generate multiple alerts.
This could be handled by inserting matching IP addresses into an Alert table, and only generate alerts for IPs that aren't already there.
Or the output of the SQL query could be processed by a de-duplicator implemented using the DataStream API, similar to the example in the Flink docs. If you only want to suppress duplicate alerts for some period of time, use a KeyedProcessFunction instead of a RichFlatMapFunction, and use a Timer to clear the state when it's time to re-enable alerts for a given IP.
Yet another update (concerning CEP and distinctness)
Implementing this with CEP should be possible. You'll want to key the stream by the IP address, and have a pattern that has to match within one minute.
The pattern can be roughly like this:
Pattern<Event, ?> pattern = Pattern
.<Event>begin("distinctPorts")
.where(iterative condition 1)
.oneOrMore()
.followedBy("end")
.where(iterative condition 2)
.within(1 minute)
The first iterative condition returns true if the event being added to the pattern has a distinct port from all of the previously matching events. Somewhat similar to the example here, in the docs.
The second iterative condition returns true if size("distinctPorts") >= 9 and this event also has yet another distinct port.
See this Flink Forward talk (youtube video) for a somewhat similar example at the end of the talk.
If you try this and get stuck, please ask a new question, showing us what you've tried and where you're stuck.
I use Flink SQL and CEP to recognize some really simple patterns. However, I found a weird thing (likely a bug). I have two example tables password_change and transfer as below.
transfer
transid,accountnumber,sortcode,value,channel,eventtime,eventtype
1,123,1,100,ONL,2020-01-01T01:00:01Z,transfer
3,123,1,100,ONL,2020-01-01T01:00:02Z,transfer
4,123,1,200,ONL,2020-01-01T01:00:03Z,transfer
5,456,1,200,ONL,2020-01-01T01:00:04Z,transfer
password_change
accountnumber,channel,eventtime,eventtype
123,ONL,2020-01-01T01:00:05Z,password_change
456,ONL,2020-01-01T01:00:06Z,password_change
123,ONL,2020-01-01T01:00:08Z,password_change
123,ONL,2020-01-01T01:00:09Z,password_change
Here are my SQL queries.
First create a temporary view event as
(SELECT accountnumber,rowtime,eventtype FROM password_change WHERE channel='ONL')
UNION ALL
(SELECT accountnumber,rowtime, eventtype FROM transfer WHERE channel = 'ONL' )
rowtime column is the event time extracted directly from original eventtime col with watermark periodic bound 1 second.
Then output the query result of
SELECT * FROM `event`
MATCH_RECOGNIZE (
PARTITION BY accountnumber
ORDER BY rowtime
MEASURES
transfer.eventtype AS event_type,
transfer.rowtime AS transfer_time
ONE ROW PER MATCH
AFTER MATCH SKIP PAST LAST ROW
PATTERN (transfer password_change ) WITHIN INTERVAL '5' SECOND
DEFINE
password_change AS eventtype='password_change',
transfer AS eventtype='transfer'
)
It should output
123,transfer,2020-01-01T01:00:03Z
456,transfer,2020-01-01T01:00:04Z
But I got nothing when running Flink 1.11.1 (also no output for 1.10.1).
What's more, I change the pattern to only password_change, it still output nothing, but if I change the pattern to transfer then it outputs several rows but not all transfer rows. If I exchange the eventtime of two tables which means let password_changes happen first, then the pattern password_change will output several rows while transfer not.
On the other hand, if I extract those columns from two tables and merge them in one table manually, then emit them into Flink, the running result is correct.
I searched and tried a lot to get it right including changing the SQL statement, watermark, buffer timeout and so on, but nothing helped. Hope anyone here can help. Thanks.
10/10/2020 update:
I use Kafka as the table source. tEnv is the StreamTableEnvironment.
Kafka kafka=new Kafka()
.version("universal")
.property("bootstrap.servers", "localhost:9092");
tEnv.connect(
kafka.topic("transfer")
).withFormat(
new Json()
.failOnMissingField(true)
).withSchema(
new Schema()
.field("rowtime",DataTypes.TIMESTAMP(3))
.rowtime(new Rowtime()
.timestampsFromField("eventtime")
.watermarksPeriodicBounded(1000)
)
.field("channel",DataTypes.STRING())
.field("eventtype",DataTypes.STRING())
.field("transid",DataTypes.STRING())
.field("accountnumber",DataTypes.STRING())
.field("value",DataTypes.DECIMAL(38,18))
).createTemporaryTable("transfer");
tEnv.connect(
kafka.topic("pchange")
).withFormat(
new Json()
.failOnMissingField(true)
).withSchema(
new Schema()
.field("rowtime",DataTypes.TIMESTAMP(3))
.rowtime(new Rowtime()
.timestampsFromField("eventtime")
.watermarksPeriodicBounded(1000)
)
.field("channel",DataTypes.STRING())
.field("accountnumber",DataTypes.STRING())
.field("eventtype",DataTypes.STRING())
).createTemporaryTable("password_change");
Thank #Dawid Wysakowicz's answer. To confirm that, I added 4,123,1,200,ONL,2020-01-01T01:00:10Z,transfer to the end of transfer table, then the output becomes right, which means it is really some problem about watermarks.
So now the question is how to fix it. Since a user will not change his/her password frequently, the time gap between these two table is unavoidable. I just need the UNION ALL table has the same behavior as that I merged manually.
Update Nov. 4th 2020:
WatermarkStrategy with idle sources may help.
Most likely the problem is somewhere around watermark generation in conjunction with the UNION ALL operator. Could you share how you create the two tables including how you define the time attributes and what are the connectors? It could let me confirm my suspicions.
I think the problem is that one of the sources stops emitting Watermarks. If the transfer table (or the table with lower timestamps) does not finish and produces no records it emits no Watermarks. After emitting the fourth row it will emit Watermark = 3 (4-1 second). The Watermark of a union of inputs is the smallest of values of the two. Therefore the first table will pause/hold the Watermark with value Watermark = 3 and thus you see no progress for the original query and you see some records emitted for the table with smaller timestamps.
If you manually join the two tables, you have just a single input with a single source of Watermarks and thus it progresses further and you see some results.
I would like to emit last record of a time window. This can easily be done with maxBy in regular Flink but I cannot get it to work through SQL API. What I want is:
SELECT LAST(attribute) FROM [table]
GROUP BY key, TUMBLE(ts, INTERVAL '1' DAY)
which behaves similar to
ds.keyBy(key)
.window(TumblingEventTimeWindows.of(Time.days(1)))
.maxBy(x -> x.getTs())
Any way to achieve that in SQL API?
I don't think there's a built-in function for this in Flink yet, but you could implement a user-defined aggregate function for this.
You need to adjust the query a little bit and pass the timestamp field in the aggregation function, because SQL does not assume an order of the rows of a GROUP BY group:
SELECT last_by(attribute, ts) FROM [table]
GROUP BY key, TUMBLE(ts, INTERVAL '1' DAY)
I refer to the documentation for details how to implement and register a user-defined aggregation function.
There is LAST_VALUE function in Flink, built in.
Check: https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/functions/systemfunctions/
Recently I am trying to use Apache Flink for fast batch processing.
I have a table with a column:value and an irrelevant index column
Basically I want to calculate the mean and range of every 5 rows of value. Then I am going to calculate the mean and standard deviation based on those mean I just calculated. So I guess the best way is to use Tumble window.
It looks like this
DataSet<Tuple2<Double, Integer>> rawData = {get the source data};
Table table = tableEnvironment.fromDataSet(rawData);
Table groupedTable = table
.window(Tumble.over("5.rows").on({what should I write?}).as("w")
.groupBy("w")
.select("f0.avg, f0.max-f0.min");
{The next step is to use groupedTable to calculate overall mean and stdDev}
But I don't know what to write in .on(). I have tried "proctime" but it said there is no such input. I just want it to group by the order as it reads from the source. But it has to be a time attribute so I cannot use "f2" - the index column as ordering as well.
Do I have to add a timestamp to do this? Is it necessary in batch processing and will it slow down the calculation? What is the best way to solve this?
Update :
I tried to use a sliding window in the table API and it gets me Exception.
// Calculate mean value in each group
Table groupedTable = table
.groupBy("f0")
.select("f0.cast(LONG) as groupNum, f1.avg as avg")
.orderBy("groupNum");
//Calculate moving range of group Mean using sliding window
Table movingRangeTable = groupedTable
.window(Slide.over("2.rows").every("1.rows").on("groupNum").as("w"))
.groupBy("w")
.select("groupNum.max as groupNumB, (avg.max - avg.min) as MR");
The Exception is:
Exception in thread "main" java.lang.UnsupportedOperationException: Count sliding group windows on event-time are currently not supported.
at org.apache.flink.table.plan.nodes.dataset.DataSetWindowAggregate.createEventTimeSlidingWindowDataSet(DataSetWindowAggregate.scala:456)
at org.apache.flink.table.plan.nodes.dataset.DataSetWindowAggregate.translateToPlan(DataSetWindowAggregate.scala:139)
...
Does that mean that sliding window is not supported in Table API? If I recall correctly there is no window function in DataSet API. Then how do I calculate moving range in batch process?
The window clause is used to define a grouping based on a window function, such as Tumble or Session. Grouping every 5 rows is not well defined in the Table API (or SQL) unless you specify the order of the rows. This is done in the on clause of the Tumble function. Since this feature originates from stream processing, the on clause expects a timestamp attribute.
You can fetch the timestamp of the current time using the currentTimestamp() function. However, I should point out that Flink will sort the data as it is not aware of the monotonic property of the function. Moreover, all of that will with a parallelism of 1 because there is no clause that would allow for partitioning.
Alternatively, you can also implement a user-defined scalar function that converts the index attribute into a timestamp (effectively a Long value). But again, Flink will do a full sort of the data.