Is it possible to implement left outer join in Flink's CoGroupedStreams.coGroup() - apache-flink

In reference to post - Flink co group outer join fails with High Backpressure
would "CalculateCoGroupFunction.coGroup" method implementation support left outer join on multiple streams (in my use case I have three stream sources that would need to be joined). I would appreciate if you could share any examples you may have on left outer joins using coGroup().
Thanks in advance!
DataStream<Message> pStream =
stream1
.coGroup(stream2)
.where(obj -> obj.getid())
.equalTo(ev -> ev.getid())
.window(TumblingEventTimeWindows.of(Time.minutes(Constants.VALIDTY_WINDOW_MIN)))
.evictor(TimeEvictor.of(Time.minutes(Constants.VALIDTY_WINDOW_MIN)))
.apply(new CalculateCoGroupFunction());
Looking for a working example of coGroup implementation for left outer joins on multiple streams ( > 3 streams)

You can do a union on stream 2 and 3 (right side), then coGroup on stream1. Not sure if this is what you are looking for.
stream1
.coGroup(stream2.union(stream3))
...

Related

Apache Flink - enable join ordering

I have noticed that Apache Flink does not optimise the order in which the tables are joined. At the moment, it keeps the user-specified join order (basically, it takes the the query literally). I suppose that Apache Calcite can optimise the order of joins but for some reason these rules are not in use in Apache Flink.
If, for example, we have two tables 'R' and 'S'
private val tableEnv: BatchTableEnvironment = TableEnvironment.getTableEnvironment(env)
private val fileNumber = 1
tableEnv.registerTableSource("R", getDataSourceR(fileNumber))
tableEnv.registerTableSource("S", getDataSourceS(fileNumber))
private val r = tableEnv.scan("R")
private val s = tableEnv.scan("S")
and we suppose that 'S' is empty and we want to join these tables in two ways:
val tableOne = r.as("x1, x2").join(r.as("x3, x4")).where("x2 === x3").select("x1, x4")
.join(s.as("x5, x6")).where("x4 === x5 ").select("x1, x6")
val tableTwo = s.as("x1, x2").join(r.as("x3, x4")).where("x2 === x3").select("x1, x4")
.join(r.as("x5, x6")).where("x4 === x5 ").select("x1, x6")
If we want to count the number of rows in tableOne and in tableTwo the result will be zero in both cases.
The problem is that evaluating tableOne will take much longer than evaluating tableTwo.
Is there any way by which we can automatically optimise the order of how the join are executed, or even enable a possible plan cost operation by adding some statistics? How can these statistic can be added?
In the documentation at this link it is written that maybe it is necessary to change the Table environment CalciteConfig but it is not clear to me how to do it.
Please help.
Join reordering is not enabled because Flink does not handle statistics well. Reordering joins without somewhat accurate cardinality estimates is basically gambling. Therefore, join reordering is disabled and tables are joined in the order as provided by the user. This gives a deterministic and controllable behavior.
However, you can pass optimization rules into the optimizer by passing a TableConfig with a CalciteConfig when creating the TableEnvironment, i.e., TableEnvironment.getTableEnvironment(env, yourTableConfig). In the CalciteConfig you can add optimization rules to different optimization phases. You probably want to add JoinCommuteRule and JoinAssociateRule to the logical optimization phase. You probably also have to dig into the code to check how to pass statistics into the optimizer.

Flink: no outer joins on DataStream?

I was surprised to find that there are no outer joins for DataStream in Flink (DataStream docs).
For DataSet you have all the options: leftOuterJoin, rightOuterJoin and fullOuterJoin, apart from the regular join (DataSet docs). But for DataStream you just have the plain old join.
Is this due to some fundamental properties of the DataStream that make it impossible to have outer joins? Or maybe we can expect this in the (close?) future?
I could really use an outer join on DataStream for the problem I'm working on... Is there any way to achieve a similar behaviour?
You can implement outer joins using the DataStream.coGroup() transformation. A CoGroupFunction receives two iterators (one for each input), which serve all elements of a certain key and which may be empty if no matching element is found. This allows to implement outer join functionality.
First-class support for outer joins might be added to the DataStream API in one of the next releases of Flink. I am not aware of any such efforts at the moment. However, creating an issue in the Apache Flink JIRA could help.
One way would be to go from a stream -> table -> stream, using the following api: FLINK TABLE API - OUTER JOIN
Here is a java example:
DataStream<String> data = env.readTextFile( ... );
DataStream<String> data2Merge = env.readTextFile( ... );
...
tableEnv.registerDataStream("myDataLeft", data, "left_column1, left_column2");
tableEnv.registerDataStream("myDataRight", data2Merge, "right_column1, right_column2");
String queryLeft = "SELECT left_column1, left_column2 FROM myDataLeft";
String queryRight = "SELECT right_column1, right_column2 FROM myDataRight";
Table tableLeft = tableEnv.sqlQuery(queryLeft);
Table tableRight = tableEnv.sqlQuery(queryRight);
Table fullOuterResult = tableLeft.fullOuterJoin(tableRight, "left_column1 == right_column1").select("left_column1, left_column2, right_column2");
DataStream<Tuple2<Boolean, Row>> retractStream = tableEnv.toRetractStream(fullOuterResult, Row.class);

CTE performance on execution plan. Is it displayed two times, or two times processed?

This is the SQL with CommonTableExpression. Note, that USERS_PROJECTS_CTE used twice.
WITH USERS_PROJECTS_CTE (PRO_ID, SHOW_IAS, USERNAME)
AS
(
SELECT up.PRO_ID, up.SHOW_IAS, ISNULL(u.FIRST_NAME, '') + ' ' + ISNULL(u.SECOND_NAME, '')
FROM SFMIS07_PRO.USERS_PROJECTS up
INNER JOIN SFMIS07_ADM.USERS AS u
ON up.USER_ID = u.ID
WHERE up.IS_RESP_PERSON = 1 AND up.valid_to is null
)
SELECT up.PRO_ID,
up1.USERNAME as RESP_USER1,
up2.USERNAME as RESP_USER2,
up.COUNT_
FROM SFMIS07_PRO.PRO_RESP_USERS_KERNEL_MV AS up
LEFT JOIN USERS_PROJECTS_CTE AS up1 ON up.PRO_ID = up1.PRO_ID AND up1.SHOW_IAS=1
LEFT JOIN USERS_PROJECTS_CTE AS up2 ON up.PRO_ID = up2.PRO_ID AND up2.SHOW_IAS=0
The Execution Plan. Note that CTE displayed twice:
Questions:
am I right that CTE is not only displayed twice but processed twice?
is it possible to inform QO to reuse CTE ?
is it possible for QO in principle to detect "the same SQL fragment" and reuse results (I imagine the realization of this - by coping already prepared data)?
how to optimize the query (without using temporal tables :) ?
Am I right that CTE is not only displayed twice but processed twice?
Yes
Is it possible to inform QO to reuse CTE ?
Not directly but there are some hacks to encourage this.
is it possible for QO in principle to detect "the same SQL fragment"
and reuse results (I imagine the realization of this - by coping
already prepared data)?
In principle yes. See Microsoft Research Paper Efficient Exploitation of Similar Subexpressions for Query
Processing for examples.
how to optimize the query (without using temporal tables :) ?
The most reliable way would be to use a temporary (not temporal) table. See Provide a hint to force intermediate materialization of CTEs or derived tables for a more hacky workaround.

Non standard join

I ran into a problem and was trying to find a general solution to it as a join.
I have 2 tables :
http://pastebin.com/q5yws5Ym (not sure how to enforce the styling)
And i want to generate something like
http://pastebin.com/GscBUrYS
(while there are more parameters i'm interested in how i would do it for something like this)
While i was able to reach a similar effect with self-joins and equi-joins it would generate a lot of unneeded rows, which i'm not sure how to delete automatically.
Try something along the lines of:
SELECT user.user_id, j1.user_param, j1.user_value, j2.user_param, j2.user_value
FROM user
JOIN Users_info j1 ON user.user_id = j1.user_id
JOIN users_info j2 on user.user_id = j2.user_id
where j1.user_param != j2.user_param
GROUP BY user.user_id
It may be possible that you will need some more "exclusion" clauses for the where to make sure that every row is only selected once but the general idea should work (for a given and limited number of different user_param`s).

PostGIS geometry database query

I have several tables that contain multipolygons. I need to find points within these polygons that I can use in my java test class. What I have been doing is sending a query to return all the multi polygons, choose a vertex to use as a point, and most times it works.
However these tables represent risk data, 1 in 100, 1 in 200 etc, and so some of the points are shared between tables (the higher risk multi polygons are encapsulated by the lower risk). what query can I use to return a point that will be within 1 multipolygon in 1 table, but not in any others that I specify?
the tables are river_100_1k, river_200_1k, and river_1000_1k
Well you could do a multiple left join:
SELECT a.gid, a.the_geom FROM pointsTable a
LEFT JOIN river_100_1k b
ON ST_Intersects(a.the_geom, b.the_geom)
LEFT JOIN
river_200_1k c
ON NOT ST_Intersects(a.the_geom, c.the_geom) -- Not Intersects
LEFT JOIN
river_1000_1k d
ON NOT ST_Intersects(a.the_geom, d.the_geom) -- Not Intersects
WHERE
AND c.gid IS NULL AND d.gid IS NULL AND b.gid=2 AND c.gid=2 AND d.gid=2 ;
I'm not sure if I understand correctly but this is the path you should take.
Use ST_PointOnSurface(polygon) to get a point within a polygon.

Resources