Flink CEP cannot get correct results on a unioned table - apache-flink

I use Flink SQL and CEP to recognize some really simple patterns. However, I found a weird thing (likely a bug). I have two example tables password_change and transfer as below.
transfer
transid,accountnumber,sortcode,value,channel,eventtime,eventtype
1,123,1,100,ONL,2020-01-01T01:00:01Z,transfer
3,123,1,100,ONL,2020-01-01T01:00:02Z,transfer
4,123,1,200,ONL,2020-01-01T01:00:03Z,transfer
5,456,1,200,ONL,2020-01-01T01:00:04Z,transfer
password_change
accountnumber,channel,eventtime,eventtype
123,ONL,2020-01-01T01:00:05Z,password_change
456,ONL,2020-01-01T01:00:06Z,password_change
123,ONL,2020-01-01T01:00:08Z,password_change
123,ONL,2020-01-01T01:00:09Z,password_change
Here are my SQL queries.
First create a temporary view event as
(SELECT accountnumber,rowtime,eventtype FROM password_change WHERE channel='ONL')
UNION ALL
(SELECT accountnumber,rowtime, eventtype FROM transfer WHERE channel = 'ONL' )
rowtime column is the event time extracted directly from original eventtime col with watermark periodic bound 1 second.
Then output the query result of
SELECT * FROM `event`
MATCH_RECOGNIZE (
PARTITION BY accountnumber
ORDER BY rowtime
MEASURES
transfer.eventtype AS event_type,
transfer.rowtime AS transfer_time
ONE ROW PER MATCH
AFTER MATCH SKIP PAST LAST ROW
PATTERN (transfer password_change ) WITHIN INTERVAL '5' SECOND
DEFINE
password_change AS eventtype='password_change',
transfer AS eventtype='transfer'
)
It should output
123,transfer,2020-01-01T01:00:03Z
456,transfer,2020-01-01T01:00:04Z
But I got nothing when running Flink 1.11.1 (also no output for 1.10.1).
What's more, I change the pattern to only password_change, it still output nothing, but if I change the pattern to transfer then it outputs several rows but not all transfer rows. If I exchange the eventtime of two tables which means let password_changes happen first, then the pattern password_change will output several rows while transfer not.
On the other hand, if I extract those columns from two tables and merge them in one table manually, then emit them into Flink, the running result is correct.
I searched and tried a lot to get it right including changing the SQL statement, watermark, buffer timeout and so on, but nothing helped. Hope anyone here can help. Thanks.
10/10/2020 update:
I use Kafka as the table source. tEnv is the StreamTableEnvironment.
Kafka kafka=new Kafka()
.version("universal")
.property("bootstrap.servers", "localhost:9092");
tEnv.connect(
kafka.topic("transfer")
).withFormat(
new Json()
.failOnMissingField(true)
).withSchema(
new Schema()
.field("rowtime",DataTypes.TIMESTAMP(3))
.rowtime(new Rowtime()
.timestampsFromField("eventtime")
.watermarksPeriodicBounded(1000)
)
.field("channel",DataTypes.STRING())
.field("eventtype",DataTypes.STRING())
.field("transid",DataTypes.STRING())
.field("accountnumber",DataTypes.STRING())
.field("value",DataTypes.DECIMAL(38,18))
).createTemporaryTable("transfer");
tEnv.connect(
kafka.topic("pchange")
).withFormat(
new Json()
.failOnMissingField(true)
).withSchema(
new Schema()
.field("rowtime",DataTypes.TIMESTAMP(3))
.rowtime(new Rowtime()
.timestampsFromField("eventtime")
.watermarksPeriodicBounded(1000)
)
.field("channel",DataTypes.STRING())
.field("accountnumber",DataTypes.STRING())
.field("eventtype",DataTypes.STRING())
).createTemporaryTable("password_change");
Thank #Dawid Wysakowicz's answer. To confirm that, I added 4,123,1,200,ONL,2020-01-01T01:00:10Z,transfer to the end of transfer table, then the output becomes right, which means it is really some problem about watermarks.
So now the question is how to fix it. Since a user will not change his/her password frequently, the time gap between these two table is unavoidable. I just need the UNION ALL table has the same behavior as that I merged manually.
Update Nov. 4th 2020:
WatermarkStrategy with idle sources may help.

Most likely the problem is somewhere around watermark generation in conjunction with the UNION ALL operator. Could you share how you create the two tables including how you define the time attributes and what are the connectors? It could let me confirm my suspicions.
I think the problem is that one of the sources stops emitting Watermarks. If the transfer table (or the table with lower timestamps) does not finish and produces no records it emits no Watermarks. After emitting the fourth row it will emit Watermark = 3 (4-1 second). The Watermark of a union of inputs is the smallest of values of the two. Therefore the first table will pause/hold the Watermark with value Watermark = 3 and thus you see no progress for the original query and you see some records emitted for the table with smaller timestamps.
If you manually join the two tables, you have just a single input with a single source of Watermarks and thus it progresses further and you see some results.

Related

flink cep sql Event Not triggering

I use CEP Pattern in Flink SQL which is working as expected connecting to Kafka broker. But when i connecting to cluster based cloud kafka setup, the Flink CEP is not triggering. Here is my sql:
create table agent_action_detail
(
agent_id String,
room_id String,
create_time Bigint,
call_type String,
application_id String,
connect_time Bigint,
row_time TIMESTAMP_LTZ(3), WATERMARK for row_time as row_time - INTERVAL '1' MINUTE)
with ('connector'='kafka', 'topic'='agent-action-detail', ...)
then I send messages in json format like
{"agent_id":"agent_221","room_id":"room1","create_time":1635206828877,"call_type":"inbound","application_id":"app1","connect_time":1635206501735,"row_time":"2021-10-25 16:07:09.019Z"}
in flink web ui, watermark works fine
flink web ui
I run my cep sql :
select * from agent_action_detail
match_recognize(
partition by agent_id
order by row_time
measures
last(BF.create_time) as create_time,
first(AF.connect_time) as connect_time
one row per match AFTER MATCH SKIP PAST LAST ROW
pattern (BF+ AF) define BF as BF.connect_time > 0 ,AF as AF.connect_time > 0
)
every kafka message, connect_time is > 0, but flink not triggering.
Can somebody help to this issue, thanks in advance!
select * from agent_action_detail match_recognize( partition by agent_id order by row_time measures AF.connect_time as connect_time one row per match pattern (BF AF) WITHIN INTERVAL '1' second define BF as (last(BF.connect_time, 1) < 1), AF as AF.connect_time >= 100)
Here is another cep sql still not working.
And the agent_action_detail table is insert by another flink sql as
insert into agent_action_detail select data.agent_id, data.room_id, data.create_time, data.call_type, data.application_id, data.connect_time, now() from source_table where type = 'xxx'
There are several things that can cause pattern matching to produce no results:
the input doesn't actually contain the pattern
watermarking is being done incorrectly
the pattern is pathological in some way
This particular pattern loops with no exit condition. This sort of pattern doesn't allow the internal state of the pattern matching engine to ever be cleared, which will lead to problems.
If you were using Flink CEP directly, I would tell you to
try adding until(condition) or within(time) to constrain the number of possible matches.
With MATCH_RECOGNIZE, see if you can add a distinct terminating element to the pattern.
Update: since you are still getting no results after modifying the pattern, you should determine if watermarking is the source of your problem. CEP relies on sorting the input stream by time, which depends on watermarking -- but only if you are using event time.
The easiest way to test this would be to switch to using processing time:
create table agent_action_detail
(
agent_id String,
...
row_time AS PROCTIME()
)
with (...)
If that works, then either the timestamps or watermarks are the problem. For example, if all of the events are late, you'll get no results. In your case, I'm wondering the row_time column has any data in it.
If that doesn't reveal the problem, please share a minimal reproducible example, including the data needed to observe the problem.

Apache Flink: How to group every n rows with the Table API?

Recently I am trying to use Apache Flink for fast batch processing.
I have a table with a column:value and an irrelevant index column
Basically I want to calculate the mean and range of every 5 rows of value. Then I am going to calculate the mean and standard deviation based on those mean I just calculated. So I guess the best way is to use Tumble window.
It looks like this
DataSet<Tuple2<Double, Integer>> rawData = {get the source data};
Table table = tableEnvironment.fromDataSet(rawData);
Table groupedTable = table
.window(Tumble.over("5.rows").on({what should I write?}).as("w")
.groupBy("w")
.select("f0.avg, f0.max-f0.min");
{The next step is to use groupedTable to calculate overall mean and stdDev}
But I don't know what to write in .on(). I have tried "proctime" but it said there is no such input. I just want it to group by the order as it reads from the source. But it has to be a time attribute so I cannot use "f2" - the index column as ordering as well.
Do I have to add a timestamp to do this? Is it necessary in batch processing and will it slow down the calculation? What is the best way to solve this?
Update :
I tried to use a sliding window in the table API and it gets me Exception.
// Calculate mean value in each group
Table groupedTable = table
.groupBy("f0")
.select("f0.cast(LONG) as groupNum, f1.avg as avg")
.orderBy("groupNum");
//Calculate moving range of group Mean using sliding window
Table movingRangeTable = groupedTable
.window(Slide.over("2.rows").every("1.rows").on("groupNum").as("w"))
.groupBy("w")
.select("groupNum.max as groupNumB, (avg.max - avg.min) as MR");
The Exception is:
Exception in thread "main" java.lang.UnsupportedOperationException: Count sliding group windows on event-time are currently not supported.
at org.apache.flink.table.plan.nodes.dataset.DataSetWindowAggregate.createEventTimeSlidingWindowDataSet(DataSetWindowAggregate.scala:456)
at org.apache.flink.table.plan.nodes.dataset.DataSetWindowAggregate.translateToPlan(DataSetWindowAggregate.scala:139)
...
Does that mean that sliding window is not supported in Table API? If I recall correctly there is no window function in DataSet API. Then how do I calculate moving range in batch process?
The window clause is used to define a grouping based on a window function, such as Tumble or Session. Grouping every 5 rows is not well defined in the Table API (or SQL) unless you specify the order of the rows. This is done in the on clause of the Tumble function. Since this feature originates from stream processing, the on clause expects a timestamp attribute.
You can fetch the timestamp of the current time using the currentTimestamp() function. However, I should point out that Flink will sort the data as it is not aware of the monotonic property of the function. Moreover, all of that will with a parallelism of 1 because there is no clause that would allow for partitioning.
Alternatively, you can also implement a user-defined scalar function that converts the index attribute into a timestamp (effectively a Long value). But again, Flink will do a full sort of the data.

Continuous queries in Influxdb ignoring where clause?

I'm having a bit of a trouble with the continuous queries in influxdb 0.8.8.
I'm trying to create a continuous query but it seems that the where clauses are ignored. I'm aware about the restrictions mentioned here: http://influxdb.com/docs/v0.8/api/continuous_queries.html but I don't consider that this would be the case here.
One row in the time series would contain data like this:
{"hex":"06a0b6", "squawk":"3421", "flight":"QTR028 ", "lat":99.867630, "lon":66.447365, "validposition":1, "altitude":39000, "vert_rate":-64,"track":125, "validtrack":1,"speed":482, "messages":201, "seen":219}
The query I'm running and works is the following:
select * from flight_series where time > now() - 30m and flight !~ /^$/ and validtrack = 1 and validposition = 1;
Trough it I'm trying to take the last 30 minutes from the current time, check that the flight field is no whitespaces and that the track/position are valid.
The query returns successfully but when I'm adding the
into filtered_log
part the 'where' clause is ignored.
How can I create a continuous query which takes the above-mentioned conditions into consideration? At least, how could I extract with one continuous query only the rows which have the valid track/heading set to 1 and the flight is not whitespace/empty string? The time constraint I could eliminate from the query and translate into shard retention/duration.
Also, could I specify to in the continuous query to save the data into a time-series which is located into another database (which has a more relaxed retention/duration policy)?
Thank you!
Later edit:
I've managed to do something closer to my need by using the following cq:
"select time, sequence_number, altitude, vert_rate, messages, squawk, lon, lat, speed, hex, seen from current_flights where ((flight !~ /^$/) AND (validtrack = 1)) AND (validposition = 1) into flight.[flight]"
This creates a series for each 'flight' even for those which have a whitespace in the 'flight' field -- for which a flight. series is built.
How could I specify the retention/duration policies for the series generated by the cq above? Can I do something like:
"spaces": [
{
"name": "flight",
"retentionPolicy": "1h",
"shardDuration": "30m",
"regex": "/.*/",
"replicationFactor": 1,
"split": 1
},
...
which would give me a retention of 1h and shard duration of 30m?
I'm a bit confused about where those series are stored, which shard space?
Thanks!
P.S.: My final goal would be the following:
Have a 'window' of 15-30min max with all the flights around, process some data from them and after that period is over discard the data but in the same time move/copy it to another long-term db/series which can be used for historical purposes.
You cannot put time restrictions into the WHERE clause of a continuous query. The server will generate the time restrictions as needed when the CQ runs and must ignore all others. I suspect if you leave out the time restriction the rest of the WHERE clause will be fine.
I don't believe CQs in 0.8 require an aggregation in the SELECT, but you do need to have GROUP BY clause to tell the CQ how often to run. I'm not sure what you would GROUP BY, perhaps the flight?
You can specify a different retention policy when writing to the new series but not a new database. In 0.8 the retention policy for a series is determined by regex matching on the series name. As long as you select a series name correctly it will go into your desired retention policy.
EDIT: updates for new questions
How could I specify the retention/duration policies for the series
generated by the cq above?
In 0.8.x, the shard space to which a series belongs controls the retention policy. The regex on the shard space determines which series belong to that shard. The shard space regex is evaluated newest to oldest, meaning the first created shard space will be the last regex evaluated. Unfortunately, I do know if it is possible to create new shard spaces once the database exists. See this discussion on the mailing list for more: https://groups.google.com/d/msgid/influxdb/ce3fc641-fbf2-4b39-9ce7-77e65c67ea24%40googlegroups.com
Can I do something like:
"spaces": [
{
"name": "flight",
"retentionPolicy": "1h",
"shardDuration": "30m",
"regex": "/.*/",
"replicationFactor": 1,
"split": 1
}, ... which would give me a retention of 1h and shard duration of 30m?
That shard space would have a shard duration of 30 minutes, retaining data for 1 hour, meaning any series would only exist in three shards, the current hot shard, the current cold shard, and the shard waiting for deletion.
The regex is /./, meaning it would match any series, not just the 'flight.' series. Perhaps /flight../ is a better regex if you only want those series generated by the CQ in that shard space.

Query Performance in Access 2007 - drawing on a SQL Server Express backend

I've been banging my head on this issue for a little while now, and decided I should ask for help. I have a table which holds temperature/humidity chart recorder data (currently over 775,000 records) from which I am trying to run a statistical query against it. The problem is that this often will take up to two minutes, and sometimes will not come back at all - causing me to force close the program (Control-Alt-Delete). At first, I didn't have as much of a problem - it was only after I hit the magical 500k records mark that I started getting serious slowdowns, getting progressively worse as more data was compiled and imported into the table.
Here is the query (pass-through):
SELECT dbo.tblRecorderLogs.strAreaAssigned, Min(dbo.tblRecorderLogs.datDateRecorded) AS FirstRecorderDate, Max(dbo.tblRecorderLogs.datDateRecorded) AS LastRecordedDate,
Round(Avg(dbo.tblRecorderLogs.intTempCelsius),2) AS AverageTempC,
Round(Avg(dbo.tblRecorderLogs.intRHRecorded),2) AS AverageRH,
Count(dbo.tblRecorderLogs.strAreaAssigned) AS Records
FROM dbo.tblRecorderLogs
GROUP BY dbo.tblRecorderLogs.strAreaAssigned
ORDER BY dbo.tblRecorderLogs.strAreaAssigned;
Here is the table structure in which the chart data is stored:
idRecorderDataID Number Primary Key
datDateEntered Date/Time (indexed, duplicates OK)
datTimeEntered Date/Time
intTempCelcius Number
intDewPointCelcius Number
intWetBulbCelcius Number
intMixingGPP Number
intRHRecorded Number
strAssetRecorder Text (indexed, duplicates OK)
strAreaAssigned Text (indexed, duplicates OK)
I am trying to write a program which will allow people to pull data from this table based on Area Assigned, as well as start and end dates. With the dataset size I currently have, this kind of report is simply too much for it to handle (it seems) and the machine doesn't ever return an answer. I've had to extend the ODBC timeout to almost 180 seconds in any queries dealing with this table, simply because of the size. I could use some serious help, if people have some. Thank you in advance!
-- Edited 08/13/2012 # 1050 hours --
I have not been able to test the query on the SQL Server due to the fact that the IT department has taken control of the machine in question, and has someone logged into it full-time using the remote management console. I have tried an interim step to lessen the impact of the performance issue, but I am still looking for a permanent solution to this issue.
Interim step:
I created a local table mirroring the structure of the dbo.tblRecorderLogs SQL Server table, to which I do a INSERT INTO using the former SELECT statement as it's subquery. Then any subsequent statistical analysis is drawn from this 'temporary' local table. After the process is complete, the local table is truncated.
-- Edited 08/13/2012 # 1217 hours --
Ran the shown query on the SQL Server Management Console, took 1 minute 38 seconds to complete according to the query timer provided by the console.
-- Edit 08/15/2012 # 1531 hours --
Tried to run query as VBA DoCmd.RunSQL statement to populate a temporary table using the following code:
INSERT INTO tblTempRecorderDataStatsByArea ( strAreaAssigned, datFirstRecord,
datLastRecord, intAveTempC, intAveRH, intRecordCount )
SELECT dbo_tblRecorderLogs.strAreaAssigned, Min(dbo_tblRecorderLogs.datDateRecorded)
AS MinOfdatDateRecorded, Max(dbo_tblRecorderLogs.datDateRecorded) AS MaxOfdatDateRecorded,
Round(Avg(dbo_tblRecorderLogs.intTempCelsius),2) AS AveTempC,
Round(Avg(dbo_tblRecorderLogs.intRHRecorded),2) AS AveRHRecorded,
Count(dbo_tblRecorderLogs.strAreaAssigned) AS CountOfstrAreaAssigned FROM
dbo_tblRecorderLogs GROUP BY dbo_tblRecorderLogs.strAreaAssigned ORDER BY
dbo_tblRecorderLogs.strAreaAssigned
The problem arises when the code is executed, the query takes so long - it encounters Timeout before it finishes. Still hoping for a 'magic bullet' to fix this...
-- Edited 08/20/2012 # 1241 hours --
The only 'quasi' solution I've found is running the failed query repeatedly (sort of priming the pump, as it were) so that when the query is called again by my program - it has a relative chance of actually completing before the ODBC SQL Server driver times out. Basically, a filthy filthy hack - but I don't have a better one to combat this issue.
I've tried creating a view, which works on the server side - but doesn't speed things up.
The proper fields being aggregated are indexed properly, so I can't make any changes there.
I am only pulling information from the database that is immediately useful to user - no 'SELECT * madness' going on here.
I think I am, officially, out of things to try - aside from throwing raw computing horsepower at the problem, which isn't a solution right now as the item isn't live, and I have no budget to procure better hardware. I will post this as an 'answer' and leave it up until Sept 3rd - where if I do not have better answers, I will accept my own answer and accept defeat.
When I've had to run min/max functions on several fields from the same table I've often found it quicker to do each column separately as a subquery in the from line of the main/outer query.
So your query would be like this:
SELECT rLogs1.strAreaAssigned, rLogs1.FirstRecorderDate, rLogs2.LastRecorderDate, rLog3.AverageTempC, rLogs4.AverageRH, rLogs5.Records
FROM (((
(SELECT strAreaAssigned, min(datDateRecorded) as FirstRecorderDate FROM dbo.tblRecorderLogs GROUP BY strAreaAssigned) rLogs1
inner join
(SELECT strAreaAssigned, Max(datDateRecorded) as LastRecordedDate, FROM dbo.tblRecorderLogs GROUP BY strAreaAssigned) rLogs2
on rLogs1.strAreaAssigned = rLogs2.strAreaAssigned)
inner join
(SELECT strAreaAssigned, Round(Avg(intTempCelsius),2) AS AverageTempC, FROM dbo.tblRecorderLogs GROUP BY strAreaAssigned) rLogs3
on rLogs1.strAreaAssigned = rLogs3.strAreaAssigned)
inner join
(SELECT strAreaAssigned, Round(Avg(intRHRecorded),2) AS AverageRH, FROM dbo.tblRecorderLogs GROUP BY strAreaAssigned) rLogs4
on rLogs1.strAreaAssigned = rLogs4.strAreaAssigned)
inner join
(SELECT strAreaAssigned, Count(strAreaAssigned) AS Records, FROM dbo.tblRecorderLogs GROUP BY strAreaAssigned) rLogs5
on rLogs1.strAreaAssigned = rLogs5.strAreaAssigned
ORDER BY rLogs1.strAreaAssigned;
If you take your query and the one above, copy them into the same query window in SQL Server and run the estimated execution plan you should be able to compare them and see which one works better.

MDX MEMBER causing NON EMPTY to not filter

I'm using an MDX query to pull information to support a set of reports. A high degree of detail is required for the reports so they take some time to generate. To speed up the access time we pull the data we need and store it in a flat Oracle table and then connect to the table in Excel. This makes the reports refresh in seconds instead of minutes.
Previously the MDX was generated and run by department for 100 departments and then for a number of other filters. All this was done in VB.Net. The requirements for filters have grown to the point where this method is not sustainable (and probably isn't the best approach regardless).
I've built the entire dataset into one MDX query that works perfectly. One of my sets that I cross join includes members from three different levels of hierarchy, it looks like this:
(
Descendants([Merch].[Merch CHQ].[All], 2),
Descendants([Merch].[Merch CHQ].[All], 3),
[Merch].[Merch CHQ].[Department].&[1].Children
)
The problem for me is in our hierarchy (which I can't change), each group (first item) and each department (second item) have the same structure to their naming, ie 15-DeptName and it's confusing to work with.
To address it I added a member:
MEMBER
[Measures].[Merch Level] AS
(
[Merch].[Merch CHQ].CurrentMember.Level.Name
)
Which returns what type the member is and it works perfectly.
The problem is that it updates for every member so none of the rows get filtered by NON BLANK, instead of 65k rows I have 130k rows which will hurt my access performance.
Can my query be altered to still filter out the non blanks short of using IIF to check each measurement for null?
You can specify Null for your member based on your main measure like:
MEMBER
[Measures].[Merch Level] AS
IIf(IsEmpty([Measures].[Normal Measure]),null,[Merch].[Merch CHQ].CurrentMember.Level.Name)
That way it will only generate when there is data. You can go further and add additional dimensions to the empty check if you need to get more precise.

Resources