How to implement a LEFT OUTER JOIN with streams in Apache Flink - apache-flink

I have two streams left and right.
For the same time window let's say that
the left stream contains the elements L1, L2 (the number is the
key)
the right stream contains the elements R1, R3
I wonder how to implement a LEFT OUTER JOIN in Apache Flink so that the result obtained when processing this window is the following:
(L1, R1), (L2, null)
L1, R1 are matching by key (1), and L2, R3 do not match. L2 is included because is at left

Well, You should be able to obtain the proper results with the coGroup operator and properly implemented CoGroupFunction. The function gives You access to the whole group in the coGroup method. The documentation states that for CoGroupFunction one of the groups may be empty, so this should allow You to implement the Outer Join. The only issue is the fact that groups are currently created in memory, so You need to verify that Your groups won't grow too big as they can effectively kill the JVM.

Related

Database merge join cost evaluation problem

I got a question to evaluate the minimum page I/O costs for query πA,B,C,D(R ⋈A=C S) by using merge join method. I need to evaluate followings:
Page I/O cost to sort R.
Page I/O cost to sort S.
Page I/O cost to join R and S.
My question is that, since the question has a projection on sub-set of attributes (A,B,C,D) only, is it possible to eliminate the unwanted attribute during separate sort of R and S (Provided that A and B are in R, C and D are in S)? If can then the formula of "2Br([log M-1(br/M)]+1)" seems can't apply directly.
Or more precisely, when to eliminate the unwanted attribute is the best practice?
This question stuck me a long time. Hope to get some insight on this.
Thanks.

CASE statement versus temporary table

I have my choice between two different techniques in converting codes to text:
insert into #TMP_CONVERT(code, value)
(1, 'Uno')
,(2, 'Dos')
,(3, 'Tres')
;
coalesce(tc.value, 'Unknown') as THE_VALUE
...
LEFT OUTER JOIN #TMP_CONVERT tc
on tc.code = x.code
Or
case x.code
when 1 then 'Uno'
when 2 then 'Dos'
when 3 then 'Tres'
else 'Unknown'
end as THE_VALUE
The table has about 20 million rows.
Typical size the of the code lookup table is 10 rows.
I would rather to #1 but I don't like left outer joins.
My questions are:
Is one faster than the other in any really meaningful way?
Does any SQL engine optimize this out anyway? That is: It just reads the table into memory essentially does the case statement logic anyway?
I happen to be using tsql, but I would like to know for any number of RDBM systems because I use several.
[Edit to clarify not liking LEFT OUTER JOIN]
I use LEFT OUTER JOINS when I need them, but whenever I use them I double check my logic and data to confirm I actually need them. Then I add a comment to the code that indicates why I am using a LEFT OUTER JOIN. Of course I have to do a similar exercise when I use INNER JOIN; that is: make sure I am am not dropping data.
There is some overhead to using a join. However, if the code is made a primary clustered key, then the performance for the two might be comparable -- with the join even winning out. Without an index, I would expect the case to be slightly better than the left join.
These are just guesses. As with all performance questions, though, you should check on your data on your systems.
I also want to react to your not liking left joins. These provide important functionality for SQL and are the right way to address this problem.
The code executed for the join will likely be substantially more than the code executed for the hardcoded case options.
The execution plan will have an extra join iterator along with an extra scan or seek operator (depending on the availability of a suitable index). On the positive side likely the 10 rows will all fit on a single page in #TMP_CONVERT and this will be in memory anyway, also being a temp table it won't bother with taking and releasing row locks each time, but still the code to latch the page, locate the correct row, and crack the desired column value out of it over 20,000,000 iterations would likely add some amount of measurable CPU time compared with looking up in a hardcoded list of values (potentially you could try nested CASE statements too, to perform a binary search and avoid the need for 10 branches there).
But even if there is a measurable time difference it still may not be particularly significant as a proportion of the query time as a whole. Test it. Let us know what you find...
You can also avoid creating temporary table in this case by using with construction. So you query might be something like this.
WITH TMP_CONVERT(code,value) AS -- Semicolon can be required before WITH.
(
SELECT * FROM (VALUES (1,'UNO'),
(2,'DOS'),
(3,'Tres')
) tbl(code,value)
)
coalesce(tc.value, 'Unknown') as THE_VALUE
...
LEFT OUTER JOIN TMP_CONVERT tc
on tc.code = x.code
OR even sub query can be used :
coalesce(tc.value, 'Unknown') as THE_VALUE
...
LEFT OUTER JOIN (VALUES (1,'UNO'),
(2,'DOS'),
(3,'Tres')
) tc(code,value)
ON tc.code = x.code
Hope this can be helpful.

query processing (natural join)

i need little help
please answer this question anyone
Consider a join of 3 relations:
r natural join s natural join t
Since join is commutative and associative, the system could join r and s first, s and t first, or r and t first, and then join the remaining relation with the result. If the system is able accurately to estimate. How large the result of a join will be without actually computing the join, should it choose first
1.the join with the largest result
2.the join with the smallest result.
Why?
Doing the join with the smallest result set allows you to reduce the amount of work to be done in the future.
Consider the case where you joing 10 relations with 1.000.000 elements each, knowing that each join will produce on the order of (10^6)^2 elements, and then join with a relation with 10 elements (knowing that result will be 10 elements only). Compare this to a case where you start with 10-element relation first.

Spark speed up multiple join operations

Suppose I have a rule like this:
p(v3,v4) :- t1(k1,v1), t2(k1,v2), t3(v1,v3), t4(v2,v4).
The task is join t1, t2, t3, and t4 together to produce a relation p.
Suppose t1, t2, t3, and t4 are already having same partitioner for their keys.
A common strategy is to join relations one by one, but it will force at least 3 shuffle/repartition operations. Details are below(suppose I have 10 partitions).
1.join: x = t1.join(t2)
2.repartition: x = x.map(lambda (k1, (v1,v2)): (v1,v2)).partitionBy(10)
3.join: x = x.join(t3)
4.repartition: x = x.map(lambda (v1, (v2,v3)): (v2,v3)).partitionBy(10)
5.join: x = x.join(t4)
6.repartition: x = x.map(lambda (v2, (v3,v4)): (v3,v4)).partitionBy(10)
Because t1 to t4 all have same partitioner, and I repartition the intermediate result after every join, each join operations will not involve any shuffle.
However, the intermediate result(i.e. variable x) is huge in my practical code, 3 shuffle operations are still too many for me.
My questions are:
Is there anything wrong with my strategy to evaluate this rule? Is there any better, more efficient solution?
My understanding of shuffle operation is that, for each partition, Spark will do repartition independently, it will generate repartition results for each partition on disk(so-called shuffle write). Then, for each partition, Spark will get new repartition results from disk(so-called shuffle read). If my understanding is correct, each shuffle/repartition will always cost disk read and write. It's kind of a waste, if I can guarantee my memory is huge enough to store all data. Just as said in http://www.trongkhoanguyen.com/2015/04/understand-shuffle-component-in-spark.html. Is there any workaround to disable this kind of shuffle write and read operations? I think my program's performance bottleneck is due to shuffle IO overhead.
Thank you.

How should i treat multi-valued logic in a canonical overlap calculation?

The Situation:
B->TCW
C->WG
TCB->OMW
L->UNM
TL->CGM
Now i have to calculate the canonical overlap and remove the redundancies.
After left and right reduction i get the following result
B -> TC
C -> WG
TCB -> OM
L -> UNM
TL -> C
So the question is, what would happen if i had two more functional dependences?
e.g.
B->TCW
C->WG
TCB->OMW
L->UNM
TL->CGM
BL ->-> F
BL ->-> G
From my point of view or understanding, i would say both new multi-valued logics does not has any influence to the canonical overlap. So the result would be the same just with the two multi-valued logics.
But what would happen if we had a multi-valued logic which has influence to the result?
Is it permitted to cross out the duplicates or is it not allowed, because if we have a multi-valued logic every letter will be a key value?

Resources