query processing (natural join) - database

i need little help
please answer this question anyone
Consider a join of 3 relations:
r natural join s natural join t
Since join is commutative and associative, the system could join r and s first, s and t first, or r and t first, and then join the remaining relation with the result. If the system is able accurately to estimate. How large the result of a join will be without actually computing the join, should it choose first
1.the join with the largest result
2.the join with the smallest result.
Why?

Doing the join with the smallest result set allows you to reduce the amount of work to be done in the future.
Consider the case where you joing 10 relations with 1.000.000 elements each, knowing that each join will produce on the order of (10^6)^2 elements, and then join with a relation with 10 elements (knowing that result will be 10 elements only). Compare this to a case where you start with 10-element relation first.

Related

RA translation to natural language

so im stuck in this exercise where I need to translate relational algebra (unary relational operations) expressions based on the Mondial III database to natural language and I need help for the last two and if I have any errors in the ones I answered. BTW i used 6 for sigma (SELECT operation) and |><| for the THETA JOIN operation (couldn't find the sigma or the real theta join operator on my keyboard sorry about that) Any help is much appreciated!Thanks in advance.
Here's the meaning for symbols :
SELECT :
Selects all tuples that satisfy the selection condition from a relation R :
6selection condition(R)
PROJECT : Produces a new relation with only some of the attributes of R, and removes duplicates tuples :
πattribute list(R)
THETHA JOIN : Produces all combinations of tuples from R1 and R2 that satisfy the join condition :
R1< |><|join condition >(R2)
πname(6elevation>1000(MOUNTAIN)) -> Find the name of all mountains whose elevation is higher than 1000.
6elevation>1000(6population>100000(CITY)) -> Select the city's tuples whose elevation is higher than 1000 with a population greater than 100000
6population>100000(6elevation>1000(CITY)) -> Select the city's tuples whose population is greater than 100000 with an elevation higher than 1000
COUNTRY|><|code=country(LANGUAGE) -> ?
πCountry.name(COUNTRY|><|code=country(6Language.name='English' AND percentage>50(LANGUAGE)) -> ?
The fourth expression returns all the informations about the countries together with all the languages spoken (the information about the country is repeated for each different language spoken).
The fifth expression return the name of all the countries where the prevalent language is English.

Trivial join dependency

I'm having a difficulty to understand how to "work" with join dependencies, and I would like to ask a question that will help me clarify things for myself.
Here's the simple definition from Wikipedia:
A table T is subject to a join dependency if T can always be recreated
by joining multiple tables each having a subset of the attributes of
T.
A trivial join dependency is defined as follows:
If one of the tables in the join has all the attributes of the table
T, the join dependency is called trivial.
My question is: If we decompose a relation R into a lossless decomposition, is it possible that every join dependency of R could be a trivial join dependency?
An example would be awesome.
If we decompose a relation R into a lossless decomposition, is it possible that the join dependency\ies of R would be a trivial join dependency\ies?
If you mean, if we decompose a relation R losslessly is it possible that all the JDs of R are trivial: yes.
Whenever all the JDs of R are trivial, you can decompose it losslessly, because by definition a JD is just a description of a lossless decomposition. And there are such relations. Every R, calling its attribute set S, satisfies the JDs *(S,S), *(S,S,S), etc. Some satisfy no other FDs. Some satisfy others but they're also trivial.
Eg: This R only satisfies *(S,S), *(S,S,S), etc:
x y
1 2
5 2
5 4
Eg: Say S = {x,y} and FD {x}->{y} holds, so *({x},S} holds. But say JD *({x},{y}) doesn't hold. Then the only way a JD can have sets unioning to S is if S is one of them. So R has only trivial JDs. But not just the ones using only S.
x y
1 2
5 2
6 4
If you mean, if we decompose a relation R losslessly into smaller components is it possible that all the JDs of R are trivial: no. Because by definition a trivial JD has one set that is all the attributes of R, ie has one component that is R, it doesn't decompose into components smaller than R.

CASE statement versus temporary table

I have my choice between two different techniques in converting codes to text:
insert into #TMP_CONVERT(code, value)
(1, 'Uno')
,(2, 'Dos')
,(3, 'Tres')
;
coalesce(tc.value, 'Unknown') as THE_VALUE
...
LEFT OUTER JOIN #TMP_CONVERT tc
on tc.code = x.code
Or
case x.code
when 1 then 'Uno'
when 2 then 'Dos'
when 3 then 'Tres'
else 'Unknown'
end as THE_VALUE
The table has about 20 million rows.
Typical size the of the code lookup table is 10 rows.
I would rather to #1 but I don't like left outer joins.
My questions are:
Is one faster than the other in any really meaningful way?
Does any SQL engine optimize this out anyway? That is: It just reads the table into memory essentially does the case statement logic anyway?
I happen to be using tsql, but I would like to know for any number of RDBM systems because I use several.
[Edit to clarify not liking LEFT OUTER JOIN]
I use LEFT OUTER JOINS when I need them, but whenever I use them I double check my logic and data to confirm I actually need them. Then I add a comment to the code that indicates why I am using a LEFT OUTER JOIN. Of course I have to do a similar exercise when I use INNER JOIN; that is: make sure I am am not dropping data.
There is some overhead to using a join. However, if the code is made a primary clustered key, then the performance for the two might be comparable -- with the join even winning out. Without an index, I would expect the case to be slightly better than the left join.
These are just guesses. As with all performance questions, though, you should check on your data on your systems.
I also want to react to your not liking left joins. These provide important functionality for SQL and are the right way to address this problem.
The code executed for the join will likely be substantially more than the code executed for the hardcoded case options.
The execution plan will have an extra join iterator along with an extra scan or seek operator (depending on the availability of a suitable index). On the positive side likely the 10 rows will all fit on a single page in #TMP_CONVERT and this will be in memory anyway, also being a temp table it won't bother with taking and releasing row locks each time, but still the code to latch the page, locate the correct row, and crack the desired column value out of it over 20,000,000 iterations would likely add some amount of measurable CPU time compared with looking up in a hardcoded list of values (potentially you could try nested CASE statements too, to perform a binary search and avoid the need for 10 branches there).
But even if there is a measurable time difference it still may not be particularly significant as a proportion of the query time as a whole. Test it. Let us know what you find...
You can also avoid creating temporary table in this case by using with construction. So you query might be something like this.
WITH TMP_CONVERT(code,value) AS -- Semicolon can be required before WITH.
(
SELECT * FROM (VALUES (1,'UNO'),
(2,'DOS'),
(3,'Tres')
) tbl(code,value)
)
coalesce(tc.value, 'Unknown') as THE_VALUE
...
LEFT OUTER JOIN TMP_CONVERT tc
on tc.code = x.code
OR even sub query can be used :
coalesce(tc.value, 'Unknown') as THE_VALUE
...
LEFT OUTER JOIN (VALUES (1,'UNO'),
(2,'DOS'),
(3,'Tres')
) tc(code,value)
ON tc.code = x.code
Hope this can be helpful.

finding max value among two table without using max function in relational algebra

Suppose I have two tables A{int m} and B{int m} and I have to find maximum m among two tables using relational algebra but I cannot use max function.How can I do it?I think using join we can do it but i am not sure if my guess is correct or not.
Note: this is an interview question.
Hmm, I'm puzzled why the question involves two tables. For the question as asked, I would just UNION the two (as StilesCrisis has done), then solve for a single table.
So: how to find the maximum m in a table using only NatJOIN? This is a simplified version of finding the top node on a table that holds a hierarchy (think assembly/component explosions or org charts).
The key idea is that we need to 'copy' the table into something with a different attribute name so that we can compare the tuples pair-wise. (And this will therefore use the degenerate form of NatJOIN aka cross-product). See example here How can I find MAX with relational algebra?
A NOT MATCHING
((A x (A RENAME m AS mm)) WHERE m < mm)
The subtrahend is all tuples with m less than some other tuples. The anti-join is all the tuples except those -- ie the MAX. (Using NOT MATCHING I think is both more understandable than MINUS, and doesn't need the relations to be UNION-compatible. It's roughly equivalent to SQL NOT EXISTS).)
[I've used Tutorial D syntax, to avoid mucking about with greek letters.]
SELECT M FROM (SELECT M FROM A UNION SELECT M FROM B) ORDER BY M DESC LIMIT 1
This doesn't use MAX, just plain vanilla SQL.

Estimating a Size of Joining a Relation with itself

I'm studying size estimation of logical query plans in order to select a physical query plan.
I was wondering what is the size of joining (natural join) a relation to itself?
e.g R(a,b) JOIN R(a,b), say total number of tuples is 100 and attributes a and b both has a distinct values of 20.
Will the join size (number of tuples in result) equal to 100?
I'm so confused!
To answer the question as asked:
Natural join of a relation to itself is the identity operation; you'll get exactly the tuples you started with (yes, 100 tuples in this case).
The equivalent SQL for what you ask is:
SELECT R1.a, R1.b FROM R AS R1, R As R2 WHERE R1.a = R2.a AND R1.b = R2.b
This is because RA's (Natural) Join always matches by attribute name.
What could be more sensible? What's to be confused about?

Resources