I'm trying to coGroup 2 datastreams using flink's datastream API.
stream1.coGroup(stream2)
.where(stream1Item -> streamItem.field1)
.equalTo(stream2Item -> stream2Item.field1)
.window(TumblingProcessingTimeWindows.of(Time.seconds(30)))
.apply(someCoGroupFunction)
.sinkTo(someSink)
The above code successfully works while coGrouping on stream1.field1 = stream2.field1. But how the following kind of composite field equality be built for coGoruping?
stream1.field1 = stream2.field1 AND stream1.field2 = stream2.field2 AND (stream1.field3 = stream2.fieldN OR stream1.field4 = stream2.fieldN)
With Flink's SQL API such composite join conditions can be written flawlessly like
LEFT JOIN ON
Table1.field1 = Table2.field1 AND
Table1.field2 = Table2.field2 AND
(Table1.field3 = Table2.fieldN OR Table1.field4 = Table2.fieldN)
But how the same can be implemented in Datastream API with composite complex conditions for join/cogroup?
DataStream joins only support equality predicates, but I think you could implement the additional constraint inside the coGroupFunction. Or perhaps you could take the union of two coGroupings, one for each side of the OR.
But given that the DataStream and Table APIs are interoperable, why not just use the Table/SQL API for that part of the pipeline?
Related
I have inherited a very old DB and I'm just starting out with .NET Core 3 (preview) to see if I can create a proof-of-concept WebApi that uses EF Core to query the DB.
So my problem is that in this DB, much of the data is untrimmed (so has lots of white space around values). There's a mix of varchar(n) and char(n) fields. And no, I'm not in a position to re-engineer this.
So my challenge is that the WebApi will receive a query which will contain a List<People>, and I need to return all relevant data about those people.
If I were doing this in SQL I guess I would do something like:
SELECT *
FROM Table1
INNER JOIN Table2
ON RTRIM(LTRIM(Table1.[P1])) = RTRIM(LTRIM(Table2.[P9]))
I need to be able to do something similar using Entity Framework where I would join the DB table with the .NET List<People>. However, I'm a) not sure how best to go about this with the necessary trims and b) the number of rows in this DB is very large and so it has to be a set-based operation performed on the SQL Server - I can't bring everything back to the DB and join in-memory.
[EDIT (24th June)]
To add some examples of what I've tried, and to re-focus the question onto the approach I should be employing:
1 - I attempted this using a JOIN
var y = from a in query
join b in request.Items
on new { a.Prop1, a.Prop2 } equals new { b.Prop1, b.Prop2 }
select new { a.Prop1, a.Prop2, a.Prop3 };
return await y.ToListAsync().ConfigureAwait(false);
The compile error returned is:
The type of one of the expressions in the join clause is incorrect.
Type inference failed in the call to 'Join'.
Not sure why, but I suspect that this is because I'm trying to join strings to DB char fields.
Also tried:
var y = from a in query
from b in request.Items
where a.Prop1 == b.Prop1 && a.Prop2 == b.Prop2
select new {a.Prop1, a.Prop2, a.Prop3};
return await y.ToListAsync().ConfigureAwait(false);
This compiles, but gives run-time error:
System.InvalidOperationException: 'collection selector was not
NavigationExpansionExpression'
So, yes, I do need to be able to ensure that I'm trimming the data when doing the "join", but I think the main problem is that this needs to be a server-side operation (because of the size of the data sets) but I'd doing a join with a DB table against an in-memory List on the Client.
Not sure what the best approach is here (with EF Core 3).
I have noticed that Apache Flink does not optimise the order in which the tables are joined. At the moment, it keeps the user-specified join order (basically, it takes the the query literally). I suppose that Apache Calcite can optimise the order of joins but for some reason these rules are not in use in Apache Flink.
If, for example, we have two tables 'R' and 'S'
private val tableEnv: BatchTableEnvironment = TableEnvironment.getTableEnvironment(env)
private val fileNumber = 1
tableEnv.registerTableSource("R", getDataSourceR(fileNumber))
tableEnv.registerTableSource("S", getDataSourceS(fileNumber))
private val r = tableEnv.scan("R")
private val s = tableEnv.scan("S")
and we suppose that 'S' is empty and we want to join these tables in two ways:
val tableOne = r.as("x1, x2").join(r.as("x3, x4")).where("x2 === x3").select("x1, x4")
.join(s.as("x5, x6")).where("x4 === x5 ").select("x1, x6")
val tableTwo = s.as("x1, x2").join(r.as("x3, x4")).where("x2 === x3").select("x1, x4")
.join(r.as("x5, x6")).where("x4 === x5 ").select("x1, x6")
If we want to count the number of rows in tableOne and in tableTwo the result will be zero in both cases.
The problem is that evaluating tableOne will take much longer than evaluating tableTwo.
Is there any way by which we can automatically optimise the order of how the join are executed, or even enable a possible plan cost operation by adding some statistics? How can these statistic can be added?
In the documentation at this link it is written that maybe it is necessary to change the Table environment CalciteConfig but it is not clear to me how to do it.
Please help.
Join reordering is not enabled because Flink does not handle statistics well. Reordering joins without somewhat accurate cardinality estimates is basically gambling. Therefore, join reordering is disabled and tables are joined in the order as provided by the user. This gives a deterministic and controllable behavior.
However, you can pass optimization rules into the optimizer by passing a TableConfig with a CalciteConfig when creating the TableEnvironment, i.e., TableEnvironment.getTableEnvironment(env, yourTableConfig). In the CalciteConfig you can add optimization rules to different optimization phases. You probably want to add JoinCommuteRule and JoinAssociateRule to the logical optimization phase. You probably also have to dig into the code to check how to pass statistics into the optimizer.
In my Flink batch program (DataSet / Table ), I am reading multiple file, this is producing differents flows, do some processing, and save it with output format
As flink is using dataflow model, and my flows are not really related, it is processing in parallel
Yet I want Flink to respect the order of my output operations at least, because I want flow1 to be save before flow2
For example I have something like :
Table table1 = tableEnv.fromTableSource(new MyTableSource1());
DataSet<Obj1> dataSet1 = talbeEnv.toDataSet(table1.select("toto",..),Obj1.class)
dataSet1.output(new WateverdatasinkSQL())
Table table2 = tableEnv.fromTableSource(new MyTableSource2());
DataSet<Obj2 dataSet2 = tableEnv.toDataSet(table2.select("foo","bar",..),Obj2.class)
dataSet2.output(new WateverdatasinkSQL())
I want flink to wait for dataSet1 to be save to continue...
How can I do it as successive operations ?
I have already looked at the execution modes, but this is not doing it
Regards,
Bastien
The easiest solution is to separate both flows into individual jobs and execute them one after the other.
Table table1 = tableEnv.fromTableSource(new MyTableSource1());
DataSet<Obj1> dataSet1 = talbeEnv.toDataSet(table1.select("toto",..), Obj1.class);
dataSet1.output(new WateverdatasinkSQL());
env.execute();
Table table2 = tableEnv.fromTableSource(new MyTableSource2());
DataSet<Obj2> dataSet2 = tableEnv.toDataSet(table2.select("foo","bar",..), Obj2.class);
dataSet2.output(new WateverdatasinkSQL());
env.execute();
I was trying to use table api inside flatMap by passing the flink env object to the flatMap object. But I was getting serialization exception which tells that I have added some field which cannot be serializable.
Could you please give some light on this?
Regards,
Sajeev
You cannot pass the ExecutionEnvironment into a Function. It would be like passing Flink into Flink.
The Table API is an abstraction on top of the DataSet/DataStream APIs. If you want to use both the Table API and the lower API, you can use TableEnvironment#toDataSet/fromDataSet to change between the APIs even between DataSet operators.
DataSet<Integer> ds = env.fromElements(1, 2, 3);
BatchTableEnvironment tEnv = TableEnvironment.getTableEnvironment(env);
Table t = tEnv.fromDataSet(ds, "intCol"); // continue in Table API
Table t2 = t.select("intCol.cast(STRING)"); // do something with table
DataSet<String> ds2 = tEnv.toDataSet(t2); // continue in DataSet API
I was surprised to find that there are no outer joins for DataStream in Flink (DataStream docs).
For DataSet you have all the options: leftOuterJoin, rightOuterJoin and fullOuterJoin, apart from the regular join (DataSet docs). But for DataStream you just have the plain old join.
Is this due to some fundamental properties of the DataStream that make it impossible to have outer joins? Or maybe we can expect this in the (close?) future?
I could really use an outer join on DataStream for the problem I'm working on... Is there any way to achieve a similar behaviour?
You can implement outer joins using the DataStream.coGroup() transformation. A CoGroupFunction receives two iterators (one for each input), which serve all elements of a certain key and which may be empty if no matching element is found. This allows to implement outer join functionality.
First-class support for outer joins might be added to the DataStream API in one of the next releases of Flink. I am not aware of any such efforts at the moment. However, creating an issue in the Apache Flink JIRA could help.
One way would be to go from a stream -> table -> stream, using the following api: FLINK TABLE API - OUTER JOIN
Here is a java example:
DataStream<String> data = env.readTextFile( ... );
DataStream<String> data2Merge = env.readTextFile( ... );
...
tableEnv.registerDataStream("myDataLeft", data, "left_column1, left_column2");
tableEnv.registerDataStream("myDataRight", data2Merge, "right_column1, right_column2");
String queryLeft = "SELECT left_column1, left_column2 FROM myDataLeft";
String queryRight = "SELECT right_column1, right_column2 FROM myDataRight";
Table tableLeft = tableEnv.sqlQuery(queryLeft);
Table tableRight = tableEnv.sqlQuery(queryRight);
Table fullOuterResult = tableLeft.fullOuterJoin(tableRight, "left_column1 == right_column1").select("left_column1, left_column2, right_column2");
DataStream<Tuple2<Boolean, Row>> retractStream = tableEnv.toRetractStream(fullOuterResult, Row.class);