Is there a Flink Table API equivalent to Window Functions using row_number(), rank(), dense_rank()? - apache-flink

In an attempt to discover the possibilities and limitations of the Flink Table API for use in a current project, I was trying to translate a Flink SQL statement into its equivalent Flink Table API version.
For most parts, I am able to translate the statement using the documentation except for the window function row_number().
Flink SQL (working)
final Table someTable = tableEnvironment.sqlQuery("SELECT" +
" T.COLUMN_A," +
" T.COLUMN_B," +
" T.COLUMN_C," +
" row_number() OVER (" +
" PARTITION BY" +
" T.COLUMN_A" +
" ORDER BY" +
" T.EVENT_TIME DESC" +
" ) AS ROW_NUM" +
" FROM SOME_TABLE T"
)
.where($("ROW_NUM").isEqual(1))
.select(
$("COLUMN_A"),
$("COLUMN_B"),
$("COLUMN_C")
);
The closest I get, is the code below, but I don't seem to find what should be placed at the location of the question marks (/* ??? */).
Flink Table API (not working)
final Table someTable = tableEnvironment.from("SOME_TABLE")
.window(Over.partitionBy($("COLUMN_A"))
.orderBy($("EVENT_TIME").desc())
.as($("window"))
)
.select(
$("COLUMN_A"),
$("COLUMN_B"),
$("COLUMN_C"),
/* ??? */.over($("window")).as("ROW_NUM")
)
.where($("ROW_NUM").isEqual(1));
On https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/tableapi/#over-window-aggregation I find how it works for other window functions like avg(), min(), max()...; but the one(s) I require (row_number(), rank(), dense_rank()) are not (yet) described on this page.
My question is twofold:
Does an equivalent exist in the Flink Table API?
If so, what does it look like?
Additional information:
The Flink SQL variant works without issues (for this specific part).
I am experimenting with Flink 1.15.1.
Thank you in advance for you help!

The page where you can look this up is at https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/dev/table/functions/systemfunctions/. You will see that ROW_NUMBER, RANK, and DENSE_RANK have examples for SQL, but not for Table API.
In the end, it shouldn't matter though. As you've done, you can just use SQL directly in your Table API program.

Related

Npgsql: Correctly performing a full text search using an expression index

Npgsql docs suggest performing a full text search based on an expression index using ToTsVector
.Where(p => EF.Functions.ToTsVector("english", p.Title + " " + p.Description).Matches("Npgsql"))
As I understand expression indexes, they require that the query uses the same expression that was used to create the index, ie "Name" || ' ' || "Description".
However it seems to me that p.Title + " " + p.Description is evaluated before being translated to SQL as ToTsVector takes a plain string
public static NpgsqlTsVector ToTsVector(this DbFunctions _, string config, string document);
Am I wrong or will the index not be utilized? If I'm correct, is there a way to query correctly without using raw SQL?
First, you may want to look at the other method, i.e. setting up a TsVector column with HasGeneratedTsVectorColumn.
Regardless, p.Title + " " + p.Description definitely isn't evaluated before being translated to SQL - that can't happen assuming p refers to a database column. If you turn on SQL logging, you should see the exact SQL being generated by EF Core against your database. To be extra sure that the query uses your expression index, you can use EXPLAIN on that SQL and examine the query plan.

SSRS: is there a way of passing the values of a parameter to a "Values" keyword in the SQL query?

I would like to know the correct method for passing the values of a parameter to a "VALUES" keyword in the SQL in the underlying dataset. I'm using Microsoft Report Builder v3.0, querying an MS-SQL database.
At the moment, after a lot of googling and stack-overflowing, I have come up with the following nicely-working SQL in order to find patients with diagnosis codes starting with either "AB" or "XC":
SELECT
x.PatientId
FROM
(
VALUES
('AB%'),
('XC%')
) AS v (pattern)
CROSS APPLY
(
SELECT
p.PatientId,
p.LastName
FROM
dbo.Patient p
inner join Course c on (c.PatientSer = p.PatientSer)
inner join CourseDiagnosis cd on (cd.CourseSer=c.CourseSer)
inner join Diagnosis diag on (diag.DiagnosisSer=cd.DiagnosisSer)
WHERE
diag.DiagnosisCode like v.pattern
) AS x
;
However, what I want to do is make the patterns searched for, as generated by the "VALUES" keyword, to be generated when the user selects a drop-down box corresponding to a particular group of patterns. I have used a parameter for this named #Diagnoses, with the label "Grouping1" (there will be other groupings later - I intend to make the parameter multi-valued), and the value "'AB%', 'XC%'", but this doesn't work - the report runs, but returns nothing at all, so clearly I'm doing something wrong.
I have tried to avoid specifiying these diagnosis codes directly in the WHERE clause using the "OR" keyword, as everything I can find along these lines seems to involve using separately declared functions, and the pattern specification / cross-applying solution seemed the neatest.
Can someone help me out?
Thanks in Advance.
You can use a JOIN to combine your parameter values and use the Dataset Expression to build the query text.
="SELECT x.PatientId FROM (VALUES ('" & JOIN(Parameters!VALUES.Value, "'),('") & "') ) AS v (pattern) " & VBCRLF &
"CROSS APPLY " & VBCRLF &
<rest of your query>
and the resulting part of the query is:

Optimized Top-N query using Flink SQL

I'm trying to run a streaming top-n query using Flink SQL but can't get the "optimized version" outlined in the Flink docs working. The setting is as follows:
I've got a Kafka topic where each record contains a tuple (GUID, reached score, maximum possible score). Think of them like a student taking an assessment and the tuple represents how many points he achieved.
What I want to get is a list of the five GUIDs with the highest score measured as a percentage (i.e. sorted by SUM(reached_score) / SUM(maximum possible score)).
I started by aggregating the scores and grouping them by GUID:
EnvironmentSettings bsSettings = EnvironmentSettings.newInstance().useBlinkPlanner().inStreamingMode().build();
StreamTableEnvironment tableEnv = StreamTableEnvironment.create(env, bsSettings);
Table scores = tableEnv.fromDataStream(/* stream from kafka */, "guid, reached_score, max_score");
tableEnv.registerTable("scores", scores);
Table aggregatedScores = tableEnv.sqlQuery(
"SELECT " +
" guid, " +
" SUM(reached_score) as reached_score, " +
" SUM(max_score) as max_score, " +
" SUM(reached_score) / CAST(SUM(max_score) AS DOUBLE) as score " +
"FROM scores " +
"GROUP BY guid");
tableEnv.registerTable("agg_scores", aggregatedScores);
The resulting table contains an unsorted list of aggregated scores. I then tried to feed it into the Top-N query as it is used in the Flink documentation:
Table topN = tableEnv.sqlQuery(
"SELECT guid, reached_score, max_score, score, row_num " +
"FROM (" +
" SELECT *," +
" ROW_NUMBER() OVER (ORDER BY score DESC) as row_num" +
" FROM agg_scores)" +
"WHERE row_num <= 5");
tableEnv.toRetractStream(topN, Row.class).print();
Running this query runs about as expected and results in multiple updates if the order of the elements changes.
// add first entry
6> (true,63992935-9684-4285-8c2b-1fd57b51b48f,97,200,0.485,1)
// add a second entry with lower score below the first one
7> (true,d7847f58-a4d9-40f8-a38d-161821b48481,67,200,0.335,2)
// update the second entry with a much higher score
8> (false,d7847f58-a4d9-40f8-a38d-161821b48481,67,200,0.335,2)
1> (true,d7847f58-a4d9-40f8-a38d-161821b48481,229,400,0.5725,1)
3> (true,63992935-9684-4285-8c2b-1fd57b51b48f,97,200,0.485,2)
2> (false,63992935-9684-4285-8c2b-1fd57b51b48f,97,200,0.485,1)
I then followed the advice from the docs and removed the row_number from the projection:
Table topN = tableEnv.sqlQuery(
"SELECT guid, reached_score, max_score, score " +
"FROM (" +
" SELECT *," +
" ROW_NUMBER() OVER (ORDER BY score DESC) as row_num" +
" FROM agg_scores)" +
"WHERE row_num <= 5");
Running a similar dataset:
// add first entry
4> (true,63992935-9684-4285-8c2b-1fd57b51b48f,112,200,0.56)
// add a second entry with lower score below the first one
5> (true,d7847f58-a4d9-40f8-a38d-161821b48481,76,200,0.38)
// update the second entry with a much higher score
7> (true,d7847f58-a4d9-40f8-a38d-161821b48481,354,400,0.885)
1> (true,63992935-9684-4285-8c2b-1fd57b51b48f,112,200,0.56) <-- ???
8> (false,63992935-9684-4285-8c2b-1fd57b51b48f,112,200,0.56) <-- ???
6> (false,d7847f58-a4d9-40f8-a38d-161821b48481,76,200,0.38)
What I don't understand is:
why the first entry (63992935-9684-4285-8c2b-1fd57b51b48f) is removed and added again / still touched at all
why the second entry gets added first (a second time) and then removed. Wouldn't this result in it being technically removed from the data stream?
Both are obviously related to the order of the sorting changing, but isn't this what the optimized top-n query (written further down in the documentation) is supposed to solve?
I've checked this issue and can also reproduce in my local env. I also did some investigation and the reason for this is:
"we didn't do such optimization for some scenarios, and your case seems to be one of them".
However, according to the user document, I think it's valid request to include such optimization in your scenario also. It looks like a BUG to me, we claimed some optimizations but doesn't work out.
I've create an issue: https://issues.apache.org/jira/browse/FLINK-15497 to track this, hopefully we can fix it in up coming 1.9.2 and 1.10.0 version.
Thanks for reporting this.

Dynamic SQL Query in Flink

I have a SQL query like this
String ipdetailsSql = "select sid, _zpsbd6 as ip_address, ssresp, reason, " +
"SUM(CASE WHEN botcode='r1' THEN 1 ELSE 0 END ) as icf_count, " +
"SUM(CASE WHEN botcode='r2' THEN 1 ELSE 0 END ) as dc_count, " +
"SUM(CASE WHEN botcode='r5' THEN 1 ELSE 0 END ) as badua_count, " +
"COUNT(*) as hits, TUMBLE_START(ts, INTERVAL '1' MINUTE) AS fseen " +
"from sourceTopic " +
"GROUP BY TUMBLE(ts, INTERVAL '1' MINUTE), sid, _zpsbd6, ssresp, reason";
Based on the user input I want to change the botcode='r1' to given input. Say botcode='r10' without restarting the job.
Is there a way to do this. I am on flink 1.7 using stream env. I tried config stream to read the inputs.
But stuck at how to change the query on fly. Can anyone help me with this? Thanks in advance
A stream SQL query isn't something that is executed once and is done, but rather is a declarative expression of a continuous computation. It's not possible to make arbitrary changes to that computation without starting a new job with a new query.
In simple cases, though, there are things you can do. You might consider whether it could work to join your sourcetopic with another stream that effectively provides some query parameters. Or you might find it affordable to compute all conceivably desired results and then select the actually desired results downstream.

Apache Flink: How to use DISTINCT in a TUMBLE time window?

I have a stream like this: <_time(timestamp), uri(string), userId(int)>.
The _time attribute is rowtime and I register it as a table:
tableEnv.registerDataStream("userVisitPage", stream, "_time.rowtime, uri,userId");
Then I query the table:
final String sql =
"SELECT tumble_start(_time, interval '10' second) as timestart, " +
" count(distinct userId) as uv, " +
" uri as uri, " +
" count(1) as pv " +
"FROM userVisitPage " +
"GROUP BY tumble(_time, interval '10' second), uri";
final Table table = tableEnv.sqlQuery(sql);
However, the query throws an exception:
org.apache.flink.table.codegen.CodeGenException: Unsupported call: TUMBLE
If you think this function should be supported, you can create an issue and start a discussion for it.
at org.apache.flink.table.codegen.CodeGenerator$$anonfun$visitCall$3.apply(CodeGenerator.scala:1006)
at org.apache.flink.table.codegen.CodeGenerator$$anonfun$visitCall$3.apply(CodeGenerator.scala:1006)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.flink.table.codegen.CodeGenerator.visitCall(CodeGenerator.scala:1006)
at org.apache.flink.table.codegen.CodeGenerator.visitCall(CodeGenerator.scala:67)
at org.apache.calcite.rex.RexCall.accept(RexCall.java:107)
at org.apache.flink.table.codegen.CodeGenerator.generateExpression(CodeGenerator.scala:234)
at org.apache.flink.table.codegen.CodeGenerator$$anonfun$7.apply(CodeGenerator.scala:321)
at org.apache.flink.table.codegen.CodeGenerator$$anonfun$7.apply(CodeGenerator.scala:321)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.AbstractTraversable.map(Traversable.scala:104)
at org.apache.flink.table.codegen.CodeGenerator.generateResultExpression(CodeGenerator.scala:321)
at org.apache.flink.table.plan.nodes.CommonCalc$class.generateFunction(CommonCalc.scala:44)
at org.apache.flink.table.plan.nodes.datastream.DataStreamCalc.generateFunction(DataStreamCalc.scala:43)
at org.apache.flink.table.plan.nodes.datastream.DataStreamCalc.translateToPlan(DataStreamCalc.scala:116)
at org.apache.flink.table.plan.nodes.datastream.DataStreamGroupAggregate.translateToPlan(DataStreamGroupAggregate.scala:113)
at org.apache.flink.table.plan.nodes.datastream.DataStreamGroupAggregate.translateToPlan(DataStreamGroupAggregate.scala:113)
at org.apache.flink.table.plan.nodes.datastream.DataStreamCalc.translateToPlan(DataStreamCalc.scala:97)
at org.apache.flink.table.api.StreamTableEnvironment.translateToCRow(StreamTableEnvironment.scala:837)
at org.apache.flink.table.api.StreamTableEnvironment.translate(StreamTableEnvironment.scala:764)
at org.apache.flink.table.api.StreamTableEnvironment.translate(StreamTableEnvironment.scala:734)
at org.apache.flink.table.api.java.StreamTableEnvironment.toRetractStream(StreamTableEnvironment.scala:414)
at org.apache.flink.table.api.java.StreamTableEnvironment.toRetractStream(StreamTableEnvironment.scala:357)
How can I implement this query?
Update: Flink 1.6.0 is available and supports DISTINCT aggregates on streaming tables.
Flink (version 1.4.x) does not support SQL queries with DISTINCT aggregations on streaming tables yet. Support is targeted for Flink 1.6 which won't be released before mid 2018.
You can however implement a user-defined aggregation function to compute distinct counts and use that function in your queries after registering them. The query syntax will be different of course.

Resources