I am using session windows in Flink SQL (1.13).
Is there a way (must be in SQL, no UDFs etc.) to get the last value of a certain field (in other words: this would be a value at window_end)?
I was trying with:
SELECT user_account_id,
SESSION_START(request_timestamp, INTERVAL '30' MINUTE) AS window_start,
SESSION_END(request_timestamp, INTERVAL '30' MINUTE) AS window_end,
LAST_VALUE(package)
GROUP BY SESSION(request_timestamp, INTERVAL '30' MINUTE), user_account_id
but I am getting error:
Could not find an implementation method 'merge' in class 'org.apache.flink.table.planner.functions.aggfunctions.LastValueAggFunction' for function 'LAST_VALUE' that matches the following signature:
void merge(org.apache.flink.table.data.RowData, java.lang.Iterable)
I guess that using window functions (OVER(...)) would not work here.
Any hints appreciated!
I've got the answer that this is not supported yet.
A workaround would be to create a custom user-defined aggregate function (UDAGG) .
Apart from that, there is a new Jira for that functionality.
Related
I am trying to perform aggregation using Flink Table API by accepting group by field and field aggregation expressions as string parameters from the user.
Input
GroupBy Field = department
aggregation Field Expression = count(employeeId), max(salary)
Is there any way we can do it using flink Table API? I tried to do the following, but it didn't help. Does flink have anything equivalent to selectExpr function in spark?
https://spark.apache.org/docs/3.1.3/api/python/reference/api/pyspark.sql.DataFrame.selectExpr.html
employeeTable
.groupBy($("department"))
.select(
$("department"),
$("count(employeeId)").as("numberOfEmployees"),
$("max(salary)").as("maxSalary")
)
It is throwing the following exception
Exception in thread "main" org.apache.flink.table.api.ValidationException: Cannot resolve field [count(employeeId)], input field list:[department].
No, I don't believe this will work. Flink's SQL planner wants to know what the query is doing at compile time.
What you can do is construct a SQL query and create a new job to run that query. The SQL gateway that's coming in Flink 1.16 (see FLIP-91) should make this easier.
I think you have wrong syntax.
.select(
$("department"),
$("count(employeeId)").as("numberOfEmployees"),
$("max(salary)").as("maxSalary")
)
count and max you should call like this :
$("employeeId").count().as("numberOfEmployees"),
$("salary").max().as("maxSalary")
You can check Built-in functions here
I would like to emit last record of a time window. This can easily be done with maxBy in regular Flink but I cannot get it to work through SQL API. What I want is:
SELECT LAST(attribute) FROM [table]
GROUP BY key, TUMBLE(ts, INTERVAL '1' DAY)
which behaves similar to
ds.keyBy(key)
.window(TumblingEventTimeWindows.of(Time.days(1)))
.maxBy(x -> x.getTs())
Any way to achieve that in SQL API?
I don't think there's a built-in function for this in Flink yet, but you could implement a user-defined aggregate function for this.
You need to adjust the query a little bit and pass the timestamp field in the aggregation function, because SQL does not assume an order of the rows of a GROUP BY group:
SELECT last_by(attribute, ts) FROM [table]
GROUP BY key, TUMBLE(ts, INTERVAL '1' DAY)
I refer to the documentation for details how to implement and register a user-defined aggregation function.
There is LAST_VALUE function in Flink, built in.
Check: https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/functions/systemfunctions/
I'm trying to do an exponentially decaying moving average over a hopping window in Flink SQL. I need the have access to one of the borders of the window, the HOP_START in the following:
SELECT
lb_index one_key,
-- I have access to this one:
HOP_START(proctime, INTERVAL '0.05' SECOND, INTERVAL '5' SECOND) start_time,
-- Aggregation primitive:
SUM(
Y * EXP(TIMESTAMPDIFF(
SECOND,
proctime,
-- This one throws:
HOP_START(proctime, INTERVAL '0.05' SECOND, INTERVAL '5' SECOND)
)))
FROM write_position
GROUP BY lb_index, HOP(proctime, INTERVAL '0.05' SECOND, INTERVAL '5' SECOND)
I'm getting the following stack trace:
11:55:37.011 [main] DEBUG o.a.c.p.RelOptPlanner - For final plan, using Aggregate(groupBy: (lb_index), window: (SlidingGroupWindow('w$, 'proctime, 5000.millis, 50.millis)), select: (lb_index, SUM($f2) AS Y, start('w$) AS w$start, end('w$) AS w$end, proctime('w$) AS w$proctime))
11:55:37.011 [main] DEBUG o.a.c.p.RelOptPlanner - For final plan, using Calc(select: (lb_index, proctime, *(payload.Y, EXP(/(CAST(/INT(Reinterpret(-(HOP_START(PROCTIME(proctime), 50, 5000), PROCTIME(proctime))), 1000)), 1000))) AS $f2))
11:55:37.011 [main] DEBUG o.a.c.p.RelOptPlanner - For final plan, using rel#459:DataStreamScan.DATASTREAM.true.Acc(table=[_DataStreamTable_0])
Exception in thread "main" org.apache.flink.table.codegen.CodeGenException: Unsupported call: HOP_START
If you think this function should be supported, you can create an issue and start a discussion for it.
at org.apache.flink.table.codegen.CodeGenerator$$anonfun$visitCall$3.apply(CodeGenerator.scala:1027)
at org.apache.flink.table.codegen.CodeGenerator$$anonfun$visitCall$3.apply(CodeGenerator.scala:1027)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.flink.table.codegen.CodeGenerator.visitCall(CodeGenerator.scala:1027)
at org.apache.flink.table.codegen.CodeGenerator.visitCall(CodeGenerator.scala:66)
It does say is it unimplemented while it works outside the aggregating SUM. So that's what makes me think this is a scoping issue.
Now, the thing is: I could transform this expression and do a final processing outside the aggregation, as exp(x+y) = exp(x)*exp(y); But I'm stuck with using TIMESTAMPDIFF (which did wonders in my previous issue). I have not found a way to cast TIME ATTRIBUTEs to NUMERIC types; also, I'm not comfortable exponentiating UNIX timestamps, even if I scale them down.
Anyway, this work-around would be sort of clunky and there might me another way. I don't know how I could massage scopes in this SQL piece to still 'be' in the window scope and have the start time without throwing.
I suggest you experiment with HOP_PROCTIME() rather than HOP_START(). The differences are explained here, but the effect will be that you'll have a proctime attribute rather than a timestamp, which I'm hoping will make TIMESTAMPDIFF happy.
Recently I am trying to use Apache Flink for fast batch processing.
I have a table with a column:value and an irrelevant index column
Basically I want to calculate the mean and range of every 5 rows of value. Then I am going to calculate the mean and standard deviation based on those mean I just calculated. So I guess the best way is to use Tumble window.
It looks like this
DataSet<Tuple2<Double, Integer>> rawData = {get the source data};
Table table = tableEnvironment.fromDataSet(rawData);
Table groupedTable = table
.window(Tumble.over("5.rows").on({what should I write?}).as("w")
.groupBy("w")
.select("f0.avg, f0.max-f0.min");
{The next step is to use groupedTable to calculate overall mean and stdDev}
But I don't know what to write in .on(). I have tried "proctime" but it said there is no such input. I just want it to group by the order as it reads from the source. But it has to be a time attribute so I cannot use "f2" - the index column as ordering as well.
Do I have to add a timestamp to do this? Is it necessary in batch processing and will it slow down the calculation? What is the best way to solve this?
Update :
I tried to use a sliding window in the table API and it gets me Exception.
// Calculate mean value in each group
Table groupedTable = table
.groupBy("f0")
.select("f0.cast(LONG) as groupNum, f1.avg as avg")
.orderBy("groupNum");
//Calculate moving range of group Mean using sliding window
Table movingRangeTable = groupedTable
.window(Slide.over("2.rows").every("1.rows").on("groupNum").as("w"))
.groupBy("w")
.select("groupNum.max as groupNumB, (avg.max - avg.min) as MR");
The Exception is:
Exception in thread "main" java.lang.UnsupportedOperationException: Count sliding group windows on event-time are currently not supported.
at org.apache.flink.table.plan.nodes.dataset.DataSetWindowAggregate.createEventTimeSlidingWindowDataSet(DataSetWindowAggregate.scala:456)
at org.apache.flink.table.plan.nodes.dataset.DataSetWindowAggregate.translateToPlan(DataSetWindowAggregate.scala:139)
...
Does that mean that sliding window is not supported in Table API? If I recall correctly there is no window function in DataSet API. Then how do I calculate moving range in batch process?
The window clause is used to define a grouping based on a window function, such as Tumble or Session. Grouping every 5 rows is not well defined in the Table API (or SQL) unless you specify the order of the rows. This is done in the on clause of the Tumble function. Since this feature originates from stream processing, the on clause expects a timestamp attribute.
You can fetch the timestamp of the current time using the currentTimestamp() function. However, I should point out that Flink will sort the data as it is not aware of the monotonic property of the function. Moreover, all of that will with a parallelism of 1 because there is no clause that would allow for partitioning.
Alternatively, you can also implement a user-defined scalar function that converts the index attribute into a timestamp (effectively a Long value). But again, Flink will do a full sort of the data.
I am currently working on a query that is registered for Query Notifications. In accordance w/ the rules of Notification Serivces, I can only use Deterministic functions in my queries set up for subscription. However, GetDate() (and almost any other means that I can think of) are non-deterministic. Whenever I pull my data, I would like to be able to limit the result set to only relevant records, which is determined by the current day.
Does anyone know of a work around that I could use that would allow me to use the current date to filter my results but not invalidate the query for query notifications?
Example Code:
SELECT fcDate as RecordDate, fcYear as FiscalYear, fcPeriod as FiscalPeriod, fcFiscalWeek as FiscalWeek, fcIsPeriodEndDate as IsPeriodEnd, fcPeriodWeek as WeekOfPeriod
FROM dbo.bFiscalCalendar
WHERE fcDate >= GetDate() -- This line invalidates the query for notification...
Other thoughts:
We have an application controls table in our database that we use to store application level settings. I had thought to write a small script that keeps a record up to date w/ teh current smalldatetime. However, my join to this table is failing for notificaiton as well and I am not sure why. I surmise that it has something to do w/ me specifitying a text type (the column name), which is frustrating.
Example Code 2:
SELECT fcDate as RecordDate, fcYear as FiscalYear, fcPeriod as FiscalPeriod, fcFiscalWeek as FiscalWeek, fcIsPeriodEndDate as IsPeriodEnd, fcPeriodWeek as WeekOfPeriod
FROM dbo.bFiscalCalendar
INNER JOIN dbo.xApplicationControls ON fcDate >= acValue AND acName = N'Cache_CurrentDate'
Does anyone have any suggestions?
EDIT: Here is a link on MSDN that gives the rules for Notification Services
As it turns out, I figured out the solution. Basically, I was invalidating my query attempts because I was casting a value as a DateTime which marks it as Non-Deterministic. Even though you don't specifically call out a cast but do something akin to:
RecordDate = 'date_string_value'
You still end up w/ a Date Cast. Hopefully this will help out someone else who hits this issue.
This link helped me quite a bit.
http://msdn.microsoft.com/en-us/library/ms178091.aspx
A good way to bypass this is simply to create a view that just says "SELECT GetDate() AS Now", then use the view in your query.
EDIT : I see nothing about not using user-defined functions (which is what I've used the 'view today' bit in). So can you use a UDF in the query that points at the view?