SQL Server Performance: LEFT JOIN vs NOT IN - sql-server

The output of these two queries is the same. Just wondering if there's a performance difference between these 2 queries, if the requirement is to list down all the customers who have not purchased anything? The database used is the Northwind sample database. I'm using T-SQL.
select companyname
from Customers c
left join Orders o on c.customerid = o.customerid
where o.OrderID is null
select companyname
from Customers c
where Customerid not in (select customerid from orders)

If you want to find out empirically, I'd try them on a database with one customer who placed 1,000,000 orders.
And even then you should definitely keep in mind that the results you'll be seeing are valid only for the particular optimiser you're using (comes with particular version of particular DBMS) and for the particular physical design you're using (different sets of indexes or different detailed properties of some index might yield different performance characteristics).

Potentially the second is faster if the tables are indexed. So if orders has an index on customer ID, then NOT IN will mean that you aren't bringing back the entire ORDERS table.
But as Erwin said, a lot depends on how things are set up. I'd tend to go for the second option as I don't like bringing in tables unless I need data from them.

Related

How to compare tables for possible combinations to match people

I have to tables. A table that we call "The Vault" which has a ton of demographic data and Table A which has some of the demographic data. I am new at SQL Server, and I was tasked with finding a list of 21 students in The Vault table (Table B). Problem, there is no primary key or anything distinctive besides, FirstName, LastName, BirthMonth, Birthday, Birthyear.
Goal: We could not match these people in the conventional way we have, and so we are attempting a Hail Mary to try to see which of these shared combinations will perhaps land us with a match.
What I have tried doing: I have placed both tables on tempt tables, table A and table B, and I am trying to do an Inner Join but then I realized that in order to make it work, I would have to do a crazy join statement where I say (see the code below)
But the problem as you can imagine is it brings a lot more than my current 21 records and is in the thousands so then I would have to make that join statement longer but I am not sure this is the right way to do this. I believe that is where the WHERE clause would come in no?
Question: How do I compare these two tables for possible matches using a WHERE clause where I can mix and match different columns without having to filter the data constrains in the ON clause of the JOIN. I don't want to JOIN on like 6 different columns. Is there another way to do this so I can perhaps learn. I understand this is easier when you have a primary key shared and that would be the JOIN criteria I would use, but when we are comparing two tables to find possible matches, I have never done that.
FROM #Table a
INNER JOIN #table b ON (a.LAST_NAME = B.LAST_NAME AND a.FIRST_NAME = b.FIRST_NAME.....)```

Merging Legacy Data on Best Key

I am bringing in a field from a legacy system that does not have a Primary Key-Foreign Key relationship with the new table. The data is transactional, and each line has a customer and sales rep.
The legacy field has a many to many relationship with customer (but only on some), but it goes to one to many when you link customer and sales rep. However, the data is messy and the transaction may not match to a sales rep exactly.
It seems that the best way to tackle this problem is to join on customer and sales rep when possible, if there is not a match, then just join on customer.
I was able to do this in Excel by using the following:
=IFERROR(VLOOKUP(Customer_SalesRep_Combo, DataTable, 3, FALSE),VLOOKUP(Customer,Datatable,3,FALSE))
This function in excel works, but the spreadsheet is so large that it tends to crash, so I am trying to duplicate this using SQL code.
Note that the legacy system just outputs CSV files, so I uploaded that CSV to the cloud, and now I am using Databricks to convert that into a Spark dataframe, so I can use SQL logic on it.
Initially, my idea was to do a left join using both conditions (which matches 50k of my 80k) rows, and do a left join using one condition. I would then bring in the legacy field twice (twice if matched, once if not). Then I would use a CASE statement to only bring in the "soft match" if there was not a hard match. However, due to the many to many relationship, I would experience join duplication on the left join. Since I am also bringing in Sales Data, I cannot have any duplication. However, I would be able to live with some inaccuracy if I could just use the first match and suppress any duplication.
I have seen examples of using case statements in joins, but I do not know how to use that in this case. If I cannot get this to work, I will resort to iterating over the dataframes to match the logic in Scala, but I would prefer a SQL solution.
My code is below. The real version contains more fields, but this is the simplest I could get while retaining the basic logic.
SELECT
InnerQry.Customer,
InnerQry.SalesRep,
InnerQry.Sales,
CASE
WHEN InnerQry.LegacyFieldHard IS NULL
THEN InnerQry.LegacyFieldSoft
ELSE InnerQry.LegacyFieldHard
END AS LegacyField
FROM
(SELECT
A.Customer,
A.SalesRep,
A.Sales,
B.LegacyFieldHard,
C.LegacyFieldSoft
FROM
DBS AS A
LEFT JOIN
LEGACY AS B ON A.Customer = B.Customer AND A.SalesRep = B.SalesRep
LEFT JOIN
LEGACY AS C ON A.Customer = B.Customer) AS InnerQry
The main problem here is that you get multiple rows when you map based on just on Customer (Legacy C). To avoid this you can create a row number field and restrict it to 1, provided you don't really care which among that customer's records gets mapped:
SELECT
A.Customer,
A.SalesRep,
A.Sales,
COALESCE(B.LegacyField,C.LegacyField) as LegacyField
FROM DBS AS A
LEFT JOIN LEGACY AS B ON A.Customer=B.Customer AND A.SalesRep=B.SalesRep
LEFT JOIN
(select *,
row_number() Over (partition by Customer order by SalesRep) as rownum1
from LEGACY) AS C ON A.Customer=C.Customer and C.rownum1=1
Also, you could use the COALESCE function directly, instead of the case statement. This will automatically use the first non-null value . i.e) C value will be taken only when B is NULL. Hope this helps.

DB Performance - Left outer join over database funtion

This is a bit complex query which has multiple joins and reruns a lot of records with several data fields. Let’s say it basically use to retrieve manager details.
First set of tables (already implemented query):
Select m.name, d.name, d.address, m.salary , m.age,……
From manager m,department d,…..etc
JOINS …..
Assume, a one manger can have zero or more employees.
Let’s say I need to list down all employee names for each and every manager for result of first set of tables with managers who has no employees (which means want to keep the manager list of first set of tables as it is).
Then I have to access “employee” table through “party” tables (might be involved few more tables).
Second set of tables (to be newly connected):
That means there are one or more join with “employee” , “party” and …..etc
I have two approaches on this.
Make left outer join with first set of tables to second set of
tables.
Create a user define function (UDF) in DB level for second set of
tables. Then I have to insert manger id in to this UDF as a
parameter and take all the employees (e1,e2,…) as a formatted string
by calling through the select clause in the first set of tables
Please can someone suggest me the best solution in DB performance wise out of these two options?
Go for the JOIN, using appropriate WHERE clauses and indexes.
The database engine is far better at optimizing that you'll ever be. Let it do its job.
Your way sounds like (n+1) query death.
Write a sample query and ask your database to EXPLAIN PLAN to see what the cost is. If you spot a TABLE

Joining against views in SQLServer with strange query optimizer behavior

I have a complex view that I use to pull a list of primary keys that indicate rows in a table that have been modified between two temporal points.
This view has to query 13 related tables and look at a changelog table to determine if a entity is "dirty" or not.
Even with all of this going on, doing a simple query:
select * from vwDirtyEntities;
Takes only 2 seconds.
However, if I change it to
select
e.Name
from
Entities e
inner join vwDirtyEntities de
on e.Entity_ID = de.Entity_ID
This takes 1.5 minutes.
However, if I do this:
declare #dirtyEntities table
(
Entity_id uniqueidentifier;
)
insert into #dirtyEntities
select * from vwDirtyEntities;
select
e.Name
from
Entities e
inner join #dirtyEntities de
on e.Entity_ID = de.Entity_ID
I get the same results in only 2 seconds.
This leads me to believe that SQLServer is evaluating the view per row when joined to Entities, instead of constructing a query plan that involves joining the single inner join above to the other joins in the view.
Note that I want to join against the full result set from this view, as it filters out only the keys I want internally.
I know I could make it into a materialized view, but this would involve schema binding the view and it's dependencies and I don't like the overhead maintaining the index would cause (This view is only queried for exports, while there are far more writes to the underlying tables).
So, aside from using a table variable to cache the view results, is there any way to tell SQL Server to cache the view while evaluating the join? I tried changing the join order (Select from the view and join against Entities), however that did not make any difference.
The view itself is also very efficient, and there is no room to optimize there.
There is nothing magical about a view. It's a macro that expands. The optimiser decides when JOINed to expand the view into the main query.
I'll address other points in your post:
you have ruled out an indexed view. A view can only be a discrete entity when it is indexed
SQL Server will never do a RBAR query on it's own. Only developers can write loops.
there is no concept of caching: every query uses latest data unless you use temp tables
you insist on using the view which you've decided is very efficient. But have no idea how views are treated by the optimizer and it has 13 tables
SQL is declarative: join order usually does not matter
Many serious DB developer don't use views because of limitations like this: they are not reusable because they are macros
Edit, another possibility. Predicate pushing on SQL Server 2005. That is, SQL Server can not push the JOIN condition "deeper" into the view.

In SQL Server what is most efficient way to compare records to other records for duplicates with in a given range of values?

We have an SQL Server that gets daily imports of data files from clients. This data is interrelated and we are always scrubbing it and having to look for suspect duplicate records between these files.
Finding and tagging suspect records can get pretty complicated. We use logic that requires some field values to be the same, allows some field values to differ, and allows a range to be specified for how different certain field values can be. The only way we've found to do it is by using a cursor based process, and it places a heavy burden on the database.
So I wanted to ask if there's a more efficient way to do this. I've heard it said that there's almost always a more efficient way to replace cursors with clever JOINS. But I have to admit I'm having a lot of trouble with this one.
For a concrete example suppose we have 1 table, an "orders" table, with the following 6 fields.
(order_id, customer_id, product_id, quantity, sale_date, price)
We want to look through the records to find suspect duplicates on the following example criteria. These get increasingly harder.
Records that have the same product_id, sale_date, and quantity but different customer_id's should be marked as suspect duplicates for review
Records that have the same customer_id, product_id, quantity and have sale_dates within five days of each other should be marked as suspect duplicates for review
Records that have the same customer_id, product_id, but different quantities within 20
units, and sales dates within five days of each other should be considered suspect.
Is it possible to satisfy each one of these criteria with a single SQL Query that uses JOINS? Is this the most efficient way to do this?
If this gets much more involved, then you might be looking at a simple ETL process to do the heavy carrying for you: the load to the database should be manageable in the sense that you will be loading to your ETL environment, running tranformations/checks/comparisons and then writing your results to perhaps a staging table that outputs the stats you need. It sounds like a lot of work, but once it is setup, tweaking it is no great pain.
On the other hand, if you are looking at comparing vast amounts of data, then that might entail significant network traffic.
I am thinking efficient will mean adding index to the fields you are looking into the contents of. Not sure offhand if a megajoin is what you need, or just to list off a primary key of the suspect records into a hold table to simply list problems later. I.e. do you need to know why each record is suspect in the result set
You could
-- Assuming some pkid (primary key) has been added
1.
select pkid,order_id, customer_id product_id, quantity, sale_date
from orders o
join orders o2 on o.product_id=o2.productid and o.sale_date=o2.sale_date
and o.quantity=o2.quantity and o.customerid<>o2.customerid
then keep joining up more copies of orders, I suppose
You can do this in a single Case statement. In this below scenario, the value for MarkedForReview will tell you which of your three Tests (1,2, or 3) triggered the review. Note that I have to check for the conditions of the third test before the second test.
With InputData As
(
Select order_id, product_id, sale_date, quantity, customer_id
, Case
When O.sale_date = O2.sale_date Then 1
When Abs(DateDiff(d, O.sale_date, O2.sale_date)) <= 5
And Abs( O.quantity - O2.quantity ) <= 20 Then 3
When Abs(DateDiff(d, O.sale_date, O2.sale_date)) <= 5 Then 2
Else 0
End As MarkedForReview
From Orders As O
Left Join Orders As O2
On O2.order_id <> O.order_id
And O2.customer_id = O.customer_id
And O2.product_id = O.product_id
)
Select order_id, product_id, sale_date, quantity, customer_id
From InputData
Where MarkedForReview <> 0
Btw, if you are using something prior to SQL Server 2005, you can achieve the equivalent query using a derived table. Also note that you can return the id of the complementary order that triggered the review. Both orders that trigger a review will obviously be returned.

Resources