Preferred INNER JOIN linking sequence - inner-join

I am linking three tables from the FDA's MedWatch system that have a common Individual Safety Report, or ISR, field. The three tables are Demographics (unique record), Drugs (unique record, but linked to the Demographics record) and Reactions (one or more records, each linked to either the Drugs or the Demographics record).
My question is that since the Reactions.ISR can be INNER JOINed to either the Drugs or the Demographics table, is there a preferred way?
E.g. can be:
SELECT Demographics.Case, Demographics.ISR, Drugs.DrugName, Reactions.PT
FROM (Reactions INNER JOIN Drugs ON Reactions.ISR = Drugs.ISR)
INNER JOIN Demographics ON Drugs.ISR = Demographics.ISR
which links in a Demographics <- Drugs <- Reactions hierarchy
or:
SELECT Demographics.Case, Demographics.ISR, Drugs.DrugName, Reactions.PT
FROM Reactions INNER JOIN (Drugs INNER JOIN Demographics ON Drugs.ISR = Demographics.ISR)
ON Reactions.ISR = Demographics.ISR
which independently links the Drugs record and the Reactions record(s) to the Demographics record.
Both return the same recordset, but I wondered if one method was preferred over the other as a best practice, perhaps making the query execute faster. i.e. is it possible to improve the query performance by altering the JOIN sequence?

To compare the performance of two queries you should look at their execution plans. For instance in this case you might find that you get identical plans in which case that proves that it doesn't matter.

How a query is optimized depends on the database.
In terms of inner joins (and where clauses) its often best to start from the most restrictive table (or condition) first and then link out to the "wider" joins/conditions.
Given that these joins aren't that restrictive - i.e. there aren't any further conditions applied then they'll probably be equivalent - look at the query plans of the two approaches to check.

Related

Merging Legacy Data on Best Key

I am bringing in a field from a legacy system that does not have a Primary Key-Foreign Key relationship with the new table. The data is transactional, and each line has a customer and sales rep.
The legacy field has a many to many relationship with customer (but only on some), but it goes to one to many when you link customer and sales rep. However, the data is messy and the transaction may not match to a sales rep exactly.
It seems that the best way to tackle this problem is to join on customer and sales rep when possible, if there is not a match, then just join on customer.
I was able to do this in Excel by using the following:
=IFERROR(VLOOKUP(Customer_SalesRep_Combo, DataTable, 3, FALSE),VLOOKUP(Customer,Datatable,3,FALSE))
This function in excel works, but the spreadsheet is so large that it tends to crash, so I am trying to duplicate this using SQL code.
Note that the legacy system just outputs CSV files, so I uploaded that CSV to the cloud, and now I am using Databricks to convert that into a Spark dataframe, so I can use SQL logic on it.
Initially, my idea was to do a left join using both conditions (which matches 50k of my 80k) rows, and do a left join using one condition. I would then bring in the legacy field twice (twice if matched, once if not). Then I would use a CASE statement to only bring in the "soft match" if there was not a hard match. However, due to the many to many relationship, I would experience join duplication on the left join. Since I am also bringing in Sales Data, I cannot have any duplication. However, I would be able to live with some inaccuracy if I could just use the first match and suppress any duplication.
I have seen examples of using case statements in joins, but I do not know how to use that in this case. If I cannot get this to work, I will resort to iterating over the dataframes to match the logic in Scala, but I would prefer a SQL solution.
My code is below. The real version contains more fields, but this is the simplest I could get while retaining the basic logic.
SELECT
InnerQry.Customer,
InnerQry.SalesRep,
InnerQry.Sales,
CASE
WHEN InnerQry.LegacyFieldHard IS NULL
THEN InnerQry.LegacyFieldSoft
ELSE InnerQry.LegacyFieldHard
END AS LegacyField
FROM
(SELECT
A.Customer,
A.SalesRep,
A.Sales,
B.LegacyFieldHard,
C.LegacyFieldSoft
FROM
DBS AS A
LEFT JOIN
LEGACY AS B ON A.Customer = B.Customer AND A.SalesRep = B.SalesRep
LEFT JOIN
LEGACY AS C ON A.Customer = B.Customer) AS InnerQry
The main problem here is that you get multiple rows when you map based on just on Customer (Legacy C). To avoid this you can create a row number field and restrict it to 1, provided you don't really care which among that customer's records gets mapped:
SELECT
A.Customer,
A.SalesRep,
A.Sales,
COALESCE(B.LegacyField,C.LegacyField) as LegacyField
FROM DBS AS A
LEFT JOIN LEGACY AS B ON A.Customer=B.Customer AND A.SalesRep=B.SalesRep
LEFT JOIN
(select *,
row_number() Over (partition by Customer order by SalesRep) as rownum1
from LEGACY) AS C ON A.Customer=C.Customer and C.rownum1=1
Also, you could use the COALESCE function directly, instead of the case statement. This will automatically use the first non-null value . i.e) C value will be taken only when B is NULL. Hope this helps.

SQL Server Performance: LEFT JOIN vs NOT IN

The output of these two queries is the same. Just wondering if there's a performance difference between these 2 queries, if the requirement is to list down all the customers who have not purchased anything? The database used is the Northwind sample database. I'm using T-SQL.
select companyname
from Customers c
left join Orders o on c.customerid = o.customerid
where o.OrderID is null
select companyname
from Customers c
where Customerid not in (select customerid from orders)
If you want to find out empirically, I'd try them on a database with one customer who placed 1,000,000 orders.
And even then you should definitely keep in mind that the results you'll be seeing are valid only for the particular optimiser you're using (comes with particular version of particular DBMS) and for the particular physical design you're using (different sets of indexes or different detailed properties of some index might yield different performance characteristics).
Potentially the second is faster if the tables are indexed. So if orders has an index on customer ID, then NOT IN will mean that you aren't bringing back the entire ORDERS table.
But as Erwin said, a lot depends on how things are set up. I'd tend to go for the second option as I don't like bringing in tables unless I need data from them.

Oracle Implicit Partition Pruning

I am trying to optimize a long-running stored procedure at my company. From checking the query plan, it looks like we could make some nice gains by writing the query to allow for better partition pruning. The trouble is, it seems like doing so would create a very verbose query. Essentially, we have a bunch of tables that have a foreign key to client and "sub-client". In many cases, data is not shared between clients/sub-clients so we partitioned on those IDs for each table. Here's a sample query to show what I mean:
SELECT ...
FROM CLIENT_PRODUCT cp
INNER JOIN ORDER o ON o.product_id = cp.id
INNER JOIN PRICE_HISTORY ph on ph.product_id = cp.id
WHERE cp.id = ?
All of the tables have a foreign key that references a client and sub client. The same client product cannot belong to two different clients or sub clients (Sorry. This example is using made up tables and is a bit contrived)
I can improve partition pruning by doing the following:
SELECT ...
FROM CLIENT_PRODUCT cp
INNER JOIN ORDER o ON o.product_id = cp.id and o.client_id = l_client_id and o.sub_client_id = l_sub_client_id
INNER JOIN PRICE_HISTORY ph on ph.product_id = cp.id and ph.client_id = l_client_id and ph.sub_client_id = l_sub_client_id
WHERE cp.id = ? and cp.client_id = l_client_id and cp.sub_client_id = l_sub_client_id
With this change, I just explicitly say what partition Oracle can look at for each join. This feels pretty gross though because I've added a bunch of mostly repeated SQL that doesn't functionally change what is returned. This same pattern would need to be applied for many joins (larger than the example)
I know that our application has an invariant that any Order for a Product must belong to the same Client and Sub-Client. Likewise, any Price-History item must belong to the same Client and Sub-Client as the Product. The same idea applies to many pairs of tables. In an ideal world, Oracle would be able to infer the Client and Sub-Client for each join from the other tables in the join because of that invariant. It does not seem to be doing that (and I understand that my specific invariant does not apply to everyone). Is there a way I can get Oracle to do this implicit partition pruning without me needing to add all those additional conditions? It seems like that would add a lot of value across the codebase and remove the need for all these "unnecessary" explicit joins.
There's also the possibility that I'm just totally overthinking / misunderstanding this so other suggestions would be great.

Is it more efficient to use the CASE in the original query or in a separate query?

I can't show the real query, but here is an example of the type of thing I'm doing:
select
t1.contract,
t1.state,
t1.status,
t2.product,
case
when t2.product_cost > 100 and t3.status = 'Closed' then 'Success'
when t2.product_cost <= 100 and t3.status = 'Closed' then 'Semi-Success'
else 'Failure'
end as performance,
t3.sales_date
from contract_table as t1
left join product_table as t2 on t1.prodkey = t2.prodkey
left join sales_table as t3 on (t1.client_number = t3.client_number and t1.contract=t3.contract)
where t1.client_number in (1, 2, 5, 8, 10)
The tables involved have currently have 27 million records in them and are growing fast.
This query will be put in a view and then used to generate various summaries.
My question is this. Would it be more efficient to join the 3 tables into 1 view that has the detail needed to do the case statements and then run a second query that creates the new variables using the case statements? Or is it more efficient to do what I'm doing here?
Basically, I'm ignorant as to how SQL processes the select statement and accounts for the where statement filtering on the clients from the contract table but not the sales table even though the client_number field is in both.
All other things being equal, the only thing I can see that would change the efficiency one way or another would be whether you have where clause conditions in your outer query. If that outer query performed on the view is going to have where clauses that limit the number of records returned, then it would be more efficient to put the case statements on it. That way the case operation will only be performed on the records that pass those conditions, rather than getting performed on every record that passes the view's conditions, only to have those values thrown away by the outer query.
With views, I tend to keep to pretty raw data, as much as possible. Partly, for this reason, so that any query operating on the view, after deciding what rows to use, can do the necessary operations only on those rows.
As for how sql accounts for filtering on the clients from the contract table but not the sales table, think through both the where clause and the joins. The where clause says grab only the records where the contract table's client is 1,2,5,8,10. But the join conditions tell it to only grab the records from sales where that table's client number matches the contract table's client number. If the only records it's grabbing from contract are 1,2,5,8,10, then the only records from sale that will match that join will be the ones where the client numbers are also 1,2,5,8,10. Does that make sense?

TSQL Joining three tables where none have a common column

This will probably be pretty simple, as I am very much a novice.
I have two tables I have imported from Excel, and pretty much I need to update an existing table of e-mail addresses based off of the email addresses from the spreadsheet.
My only issue is I cannot join on a common column, as none of the tables share a column.
So I am wondering if, in a Join, I can just put something like
FROM table a
INNER JOIN table b ON b.column 'name' = a.column 'nameplus' `
Any help would be appreciated!
A join without matching predicates can be implemented effectively be a cross join: i.e. every row in table A matched with every row in table B.
If you specify an INNER JOIN then you have to have an ON term, which either matches something or it doesn't: in your example you may have a technical match (i.e. b.column really does - perhaps totally coincidentally - match a.column) that makes no business sense.
So you either have
a CROSS JOIN: no way of linking the tables but the result is all possible combinations of rows
Or:
an inner join where you must specify how the rows are to be combined (or a left/right outer join if you want to include all rows from either side, regardless of matched-ness)

Resources