Algorithm for merge join with inequality condition

Algorithm for merge join with inequality condition - database

I read that Oracle supports merge join with inequality join predicates.
Is there online reference to algorithm used in implementation of such join ?
If anyone knows how to do that, Can you put it in answer?

This is what you're looking for.
7.4 Sort Merge Joins
Sort merge joins can join rows from two independent sources. In
general, hash joins perform better than sort merge joins. However,
sort merge joins can perform better than hash joins if both of the
following conditions exist:
The row sources are sorted. A sort operation is not required. However,
if a sort merge join involves choosing a slower access method (an
index scan as opposed to a full table scan), then the benefit of using
a sort merge might be lost.
Sort merge joins are useful when the join condition between two tables
is an inequality condition such as <, <=, >, or >=. Sort merge joins
perform better than nested loops joins for large data sets. Hash joins
require an equality condition.
In a merge join, there is no concept of a driving table. The join
consists of two steps:
Sort join operation
Both the inputs are sorted on the join key.
Merge join operation
The sorted lists are merged.
If the input is sorted by the join column, then a sort join operation
is not performed for that row source. However, a sort merge join
always creates a positionable sort buffer for the right side of the
join so that it can seek back to the last match in the case where
duplicate join key values come out of the left side of the join.

There's an example here: http://www.serkey.com/oracle-skyline-query-challenge-bdh859.html
Is this what you're looking to do? (key word is "soft-merge")

Related

How do Apache Flink's JoinFunction and CoGroupFunction differ?

What is the difference between a JoinFunction and a CoGroupFunction in Apache Flink? How do semantics and execution differ?

Both, Join and CoGroup transformations join two inputs on key fields. The differences is how the user functions are called:
the Join transformation calls the JoinFunction with pairs of matching records from both inputs that have the same values for key fields. This behavior is very similar to an equality inner join.
the CoGroup transformation calls the CoGroupFunction with iterators over all records of both inputs that have the same values for key fields. If an input has no records for a certain key value an empty iterator is passed. The CoGroup transformation can be used, among other things, for inner and outer equality joins. It is hence more generic than the Join transformation.
Looking at the execution strategies of Join and CoGroup, Join can be executed using sort- and hash-based join strategies where as CoGroup is always executed using sort-based strategies. Hence, joins are often more efficient than cogroups and should be preferred if possible.

Loop Join in SQL Server 2008

I'm not clear about working difference between queries mentioned below.
Specifically I'm unclear about the concept of OPTION(LOOP JOIN).
1st approach: it's a traditional join used, which is most expensive than all of below.
SELECT *
FROM [Item Detail] a
LEFT JOIN [Order Detail] b ON a.[ItemId] = b.[fkItemId] OPTION (FORCE ORDER);
2nd approach: It includes OPTION in a statement with sorted data, merely optimized.
SELECT *
FROM [Item Detail] a
LEFT LOOP JOIN [Order Detail] b ON a.[ItemId] = b.[fkItemId] OPTION (FORCE ORDER);
3rd approach: Here, I am not clear, how the query works and includes OPTION with loop join!!?
SELECT *
FROM [Item Detail] a
LEFT LOOP JOIN [Order Detail] b ON a.[ItemId] = b.[fkItemId] OPTION (LOOP JOIN);
Can anybody explain difference and way of working and advantages of each one over other?
Note: These are not Nested OR Hash loops!

From Query Hints (Transact-SQL)
FORCE ORDER Specifies that the join order indicated by the query
syntax is preserved during query optimization. Using FORCE ORDER does
not affect possible role reversal behavior of the query optimizer.
also
{ LOOP | MERGE | HASH } JOIN Specifies that all join operations are
performed by LOOP JOIN, MERGE JOIN, or HASH JOIN in the whole query.
If more than one join hint is specified, the optimizer selects the
least expensive join strategy from the allowed ones.
Advanced Query Tuning Concepts
If one join input is small (fewer than 10 rows) and the other join
input is fairly large and indexed on its join columns, an index nested
loops join is the fastest join operation because they require the
least I/O and the fewest comparisons.
If the two join inputs are not small but are sorted on their join
column (for example, if they were obtained by scanning sorted
indexes), a merge join is the fastest join operation.
Hash joins can efficiently process large, unsorted, nonindexed inputs.
And Join Hints (Transact-SQL)
Join hints specify that the query optimizer enforce a join strategy
between two tables
Your option 1 tells the optimizer to keep the join order as is. So the JOIN type can be decided by the optimizer, so might be MERGE JOIN.
You option 2 is telling the optimizer to use LOOP JOIN for this specific JOIN. If there were any other joins in the FROM section, the optimizer would be able to decide for them. Also, you are specifying the order of JOINS to take for the optimizer.
Your last option OPTION (LOOP JOIN) would enforce LOOP JOIN across all joins in the query.
This all said, it is very seldom that the optimizer would choose an incorrect plan, and this should probably indicate bigger underlying issues, such as outdated statistics or fragmented indexes.

why Indexed Nested-Loop Join only applicable for equi-join or natural join?

Indexed Nested-Loop Join :
For each tuple tr in the outer relation R, use the index to look up tuples in S that satisfy the join condition with tuple tr
some materials mentioned that "Indexed Nested-Loop Join" only applicable for equi-join or natural join and an index is available on the inner relation’s join attribute
SELECT *
FROM tableA as a
JOIN tableB as b
ON a.col1 > b.col1;
Suppose we have an index on b.col1.
why Indexed Nested-Loop Join is not applicable for this case?

You are quoting slides for Database Systems Concepts (c) Silberschatz, Korth and Sudarshan.
We want the DBMS to calculate a join. There are lots of special cases where it can do it various ways. These might involve whether there are indexes, selection conditions, etc.
The particular technique that that book calls by that name works in certain cases:
Indexed Nested-Loop Join
If an index is available on the inner loop's join attribute and join
is an equi-join or natural join
The answer is, because your query does not meet the conditions. It is not an equi-join (ie ON or WHERE a.col1 = b.col1) or natural join (USING (col1) or NATURAL JOIN).
As to why not meeting those conditions means not using that technique, it would be because it doesn't work and/or some other technique is better. You gave the technique :
For each tuple tr in the outer relation r, use the index to look up
tuples in s that satisfy the join condition with tuple tr
If it's an inequality, you can't "look up in" the index; you have search through the index. Not this method.

Well I read the second answer, so I checked the book called Database System Concepts 7th Edition, by Silberschatz, Korth and Sudarshan
First, like you can see, Indexed Nested Loop Join, can be used if an index is available on the inner loop, and we don't have any other restriction related to equality joins
I think the confusion is with Merge Join. In page 708, Chapter 15, Query Processing subject, we can see that this algorithm can be used just to compute natural joins and equi-joins.
Profiting the topic, just a mention about Hash Join. In this case, same as Merge Join, can be used just to compute natural joins and equi-joins.

How does a hash full outer join work?

I know the algorithm for a hash left outer join is to build a hashtable on the right table and then loop through the left table and search in the hashtable for if there is a match, but how does a full outer join work? After you scan through the values in the left table you would still need a way to get the tuples in the right table that didn't have matches in the left.

While looping through the probe records you record which right tuples have found a match in the build table. You just set a boolean to true for each one that matched. As a final pass in the algorithm you scan the build table and output all tuples that did not match previously.
There is an alternate strategy which is not used in RDBMS's as far as I'm aware: Build a combined hash table of left and right tuples. Treat that table as a map from hash key to a list of left tuples plus a list of right tuples. Build that table by looping through both input tables adding all tuples to the hash table. After all tuples have been consumed iterate over the hash table once and output the equality groups accordingly (either all left-tuples or all right-tuples or a cross-product of all left and all right tuples in the equality group).
The latter algorithm is nice for in-memory workloads (like in client applications). The former is good for an extremely (or unpredictably) large probe input so RDBMS's use that one.

How to calculate Join cost? I want to know the disk operations?

I want some help regarding join processing
Nested Loop Join
Block Nested loop join
Merge join
Hash join
I search but did not find some link which also provide mathematical examples of calculation?
e.g.
Consider the natural join R & S of relations R and S, with the following information about those relations:
Relation R contains 8,000 records and has 10 records per page
Relation S contains 2,000 records and has 10 records per page
Both relations are stored as sorted files on the join attribute
how many disk operations would it take to process the upper four joins?

Do you have a specific dbms in mind?
For Oracle, you'd have to know the block size, the configuration for db_file_multiblock_read_count and the expected nr of blocks already in cache, the high watermark for each table, existing indexes and their clustering factor to mention a few things that will affect the answer.
As a general rule, whenever I fully join two tables, I expect to see two full table scans and a hash join. Whenever I join parts of two tables, I expect to see a nested loop driven from the table with the most selective filter predicate.
Whenever I get surprised, I investigate the statistics and the above mentiond things to validate the optimizer choice.