I know the algorithm for a hash left outer join is to build a hashtable on the right table and then loop through the left table and search in the hashtable for if there is a match, but how does a full outer join work? After you scan through the values in the left table you would still need a way to get the tuples in the right table that didn't have matches in the left.
While looping through the probe records you record which right tuples have found a match in the build table. You just set a boolean to true for each one that matched. As a final pass in the algorithm you scan the build table and output all tuples that did not match previously.
There is an alternate strategy which is not used in RDBMS's as far as I'm aware: Build a combined hash table of left and right tuples. Treat that table as a map from hash key to a list of left tuples plus a list of right tuples. Build that table by looping through both input tables adding all tuples to the hash table. After all tuples have been consumed iterate over the hash table once and output the equality groups accordingly (either all left-tuples or all right-tuples or a cross-product of all left and all right tuples in the equality group).
The latter algorithm is nice for in-memory workloads (like in client applications). The former is good for an extremely (or unpredictably) large probe input so RDBMS's use that one.
Related
While studying database related content, I came across the following:
When table R is vertically partitioned into R1, R2, R3..., Rn, it is
said that it can be expressed as R = R1⋈R2⋈R3...⋈Rn. (⋈ is the join
symbol)
However, if the join symbol is used without any special conditions, it becomes a Cartesian product, so doesn't that result in much more tuples than in the original table R? Can you please explain why the existing table is represented as a join of a partitioned table?
⋈ is natural join. (When used with 2 relation arguments & no others.) Natural join is Cartesian product when there are no common columns. But in the quote the assumption is that the partitioning is done so that there is a set of key values all tables use. So Cartesian product doesn't come up. Every partition has the same number of rows & the same key content & they differ in additional column content. Each join adds some more additional column data until you get back the original table.
That operator means natural join. See https://en.m.wikipedia.org/wiki/Relational_algebra:
Natural join (⋈) is a binary operator that is written as (R ⋈ S) where R and S are relations.[2] The result of the natural join is the set of all combinations of tuples in R and S that are equal on their common attribute names.
In a report I have the next join from a FACT table:
Join…
LEFT JOIN DimState AS s
ON s.StateCode = l.Province AND l.Locale LIKE (s.CountryCode + '%')
More information:
Fact table has 59,567,773 rows
L.Province can match a StateCode in DimState: 42,346,471 rows 71%
L.Province can’t match a StateCode in DimState: 13,742,966 rows 23% (most of them are a blank value in L.Province).
L.Province is NULL in 3,500,000 rows (6%)
4 questions:
-The correct thing to do, would be to replace L.Province Nulls and blanks for “other”… And have an entry in DimState, with StateCode “other”, right?
-Is it acceptable to LEFT JOIN to a dimension? Or it should always be INNER JOIN?
-Is it correct to join to a dimension on 2 columns?
-To do a l.Locale = s.CountryCode… Should I modify the values in l.Locale or in s.CountryCode?
In order of your four questions:
Yes, you should not have blanks for dimension keys in your fact tables. If the value in the source data is in fact null or empty, there should be members in your dimension tables which are set aside to reflect this.
Therefore, building off 1, you should GENERALLY not do left joins when joining facts to dimensions. I say generally because there might be a situation where this is necessary, but I can't think of anything of the top of my head. You should not have to with properly designed fact and dimension tables.
Generally, no. I would recommend using a surrogate key in this case since your business key is spread across two columns.
Not sure what you are asking here. If you keep this design, you would need to change both. If you switch to using a surrogate key for DimState, you would only have to update the dimension table whenever anything changes.
To build on what mallan1121 said:
1:There are generally three different meanings for null/blank in data warehousing.
A. I don't know the value
B. The value is known and it is blank
C. The value does not apply.
Make sure you consider the relevance for each option as you design your warehouse. The fact should ALWAYS reference a dimension key or you will end up with data quality issues.
2: It can be useful to use left joins if you are abstracting your tables from your cube using views (a good idea) and if you may use those views for non-cube reporting. The reason is that an inner join is a filtering join and the result set is filtered by all inner joined tables even if only a single column is returned.
SELECT DimA.COLUMN, Fact.COLUMN
FROM Fact
JOIN DimA
JOIN DimB --filters result
JOIN DimC --filters result
If you use a left join and you only want columns from the some of the tables, the other joins are ignored and those tables are never accessed.
SELECT DimA.COLUMN, Fact.COLUMN
FROM Fact
LEFT JOIN DimA
LEFT JOIN DimB --ignored
LEFT JOIN DimC --ignored
This can speed up reporting querys run directly against the SQL database. However, you must make sure your ETL process enforces the integrity and that the results returned are identical whether inner or left joins are used.
4: Requiring multiple columns in the join is not a problem, but I'd be very concerned about a multiple column join using a wildcard. I expect you have a granularity issue in your dimension. I don't know your data, but using a wildcard risks getting multiple values back from that dimension.
Do not do this from one simple reason. You will get 13M records with the key L.Province = 'Other' in you dimension table - each record from the fact table with s.StateCode = 'Other' will be joined with those 13M dimension records, leading to massive duplication of the measures.
The proper answer is enforce the primary key on your dimension. Typically a dimnsion have one record with the key other (meaning the key is not known) and possible one other recrod NA (the dimension has no meaning in for this fact record).
The problem is not in the OUTER join- what should be enforced by design is that all foreign key in the fact table are defined in the dimension table.
One step to achieve this is the definition of NA and Other as decribed in 1.
The rationale behind this approach is to enforce that INNER and OUTER joins lead to the same result, i.e. do not cause confusion with different results.
Again each dimension should have defined a PRIMARY KEY - if the PK consist of two columns - the join on those columns is fine. (Typical scenario in DWh though is a single column numeric PK).
What should be avioded is join on LIKEor SUBSTR - this points that the dimension PK is not well defined.
If your dimension has a two column PK Locale + province the fact table must be updated to contain this two column as a FK.
Wikipedia says:
First prepare a hash table of the smaller relation. The hash table
entries consist of the join attribute and its row. Because the hash
table is accessed by applying a hash function to the join attribute,
it will be much quicker to find a given join attribute's rows by using
this table than by scanning the original relation.
It appears as if speed of this join algorithm is due to that we hash R(lesser sized relation) but not S(other, larger one).
My question is how do we compare hashed versions of R's rows to S without running the hash function on S as well? Do we presume DB stores one for us?
Or am I wrongly assuming about not hashing S, and speed advantage is due to comparing hashes(unique, small) as opposed to reading through actual data of the rows(not unique, might be large)?
The hash function will also be used on the join attribute in S.
I think that the meaning of the quoted paragraph is that applying the hash function on the attribute, finding the correct hash bucket and following the linked list will be faster than searching for the corresponding row of the table R with a table or index scan.
The trade-off for this speed gain is the cost of building the hash.
I read that Oracle supports merge join with inequality join predicates.
Is there online reference to algorithm used in implementation of such join ?
If anyone knows how to do that, Can you put it in answer?
This is what you're looking for.
7.4 Sort Merge Joins
Sort merge joins can join rows from two independent sources. In
general, hash joins perform better than sort merge joins. However,
sort merge joins can perform better than hash joins if both of the
following conditions exist:
The row sources are sorted. A sort operation is not required. However,
if a sort merge join involves choosing a slower access method (an
index scan as opposed to a full table scan), then the benefit of using
a sort merge might be lost.
Sort merge joins are useful when the join condition between two tables
is an inequality condition such as <, <=, >, or >=. Sort merge joins
perform better than nested loops joins for large data sets. Hash joins
require an equality condition.
In a merge join, there is no concept of a driving table. The join
consists of two steps:
Sort join operation
Both the inputs are sorted on the join key.
Merge join operation
The sorted lists are merged.
If the input is sorted by the join column, then a sort join operation
is not performed for that row source. However, a sort merge join
always creates a positionable sort buffer for the right side of the
join so that it can seek back to the last match in the case where
duplicate join key values come out of the left side of the join.
There's an example here: http://www.serkey.com/oracle-skyline-query-challenge-bdh859.html
Is this what you're looking to do? (key word is "soft-merge")
I want some help regarding join processing
Nested Loop Join
Block Nested loop join
Merge join
Hash join
I search but did not find some link which also provide mathematical examples of calculation?
e.g.
Consider the natural join R & S of relations R and S, with the following information about those relations:
Relation R contains 8,000 records and has 10 records per page
Relation S contains 2,000 records and has 10 records per page
Both relations are stored as sorted files on the join attribute
how many disk operations would it take to process the upper four joins?
Do you have a specific dbms in mind?
For Oracle, you'd have to know the block size, the configuration for db_file_multiblock_read_count and the expected nr of blocks already in cache, the high watermark for each table, existing indexes and their clustering factor to mention a few things that will affect the answer.
As a general rule, whenever I fully join two tables, I expect to see two full table scans and a hash join. Whenever I join parts of two tables, I expect to see a nested loop driven from the table with the most selective filter predicate.
Whenever I get surprised, I investigate the statistics and the above mentiond things to validate the optimizer choice.