How to compare tables for possible combinations to match people - sql-server

I have to tables. A table that we call "The Vault" which has a ton of demographic data and Table A which has some of the demographic data. I am new at SQL Server, and I was tasked with finding a list of 21 students in The Vault table (Table B). Problem, there is no primary key or anything distinctive besides, FirstName, LastName, BirthMonth, Birthday, Birthyear.
Goal: We could not match these people in the conventional way we have, and so we are attempting a Hail Mary to try to see which of these shared combinations will perhaps land us with a match.
What I have tried doing: I have placed both tables on tempt tables, table A and table B, and I am trying to do an Inner Join but then I realized that in order to make it work, I would have to do a crazy join statement where I say (see the code below)
But the problem as you can imagine is it brings a lot more than my current 21 records and is in the thousands so then I would have to make that join statement longer but I am not sure this is the right way to do this. I believe that is where the WHERE clause would come in no?
Question: How do I compare these two tables for possible matches using a WHERE clause where I can mix and match different columns without having to filter the data constrains in the ON clause of the JOIN. I don't want to JOIN on like 6 different columns. Is there another way to do this so I can perhaps learn. I understand this is easier when you have a primary key shared and that would be the JOIN criteria I would use, but when we are comparing two tables to find possible matches, I have never done that.
FROM #Table a
INNER JOIN #table b ON (a.LAST_NAME = B.LAST_NAME AND a.FIRST_NAME = b.FIRST_NAME.....)```

Related

Merging Legacy Data on Best Key

I am bringing in a field from a legacy system that does not have a Primary Key-Foreign Key relationship with the new table. The data is transactional, and each line has a customer and sales rep.
The legacy field has a many to many relationship with customer (but only on some), but it goes to one to many when you link customer and sales rep. However, the data is messy and the transaction may not match to a sales rep exactly.
It seems that the best way to tackle this problem is to join on customer and sales rep when possible, if there is not a match, then just join on customer.
I was able to do this in Excel by using the following:
=IFERROR(VLOOKUP(Customer_SalesRep_Combo, DataTable, 3, FALSE),VLOOKUP(Customer,Datatable,3,FALSE))
This function in excel works, but the spreadsheet is so large that it tends to crash, so I am trying to duplicate this using SQL code.
Note that the legacy system just outputs CSV files, so I uploaded that CSV to the cloud, and now I am using Databricks to convert that into a Spark dataframe, so I can use SQL logic on it.
Initially, my idea was to do a left join using both conditions (which matches 50k of my 80k) rows, and do a left join using one condition. I would then bring in the legacy field twice (twice if matched, once if not). Then I would use a CASE statement to only bring in the "soft match" if there was not a hard match. However, due to the many to many relationship, I would experience join duplication on the left join. Since I am also bringing in Sales Data, I cannot have any duplication. However, I would be able to live with some inaccuracy if I could just use the first match and suppress any duplication.
I have seen examples of using case statements in joins, but I do not know how to use that in this case. If I cannot get this to work, I will resort to iterating over the dataframes to match the logic in Scala, but I would prefer a SQL solution.
My code is below. The real version contains more fields, but this is the simplest I could get while retaining the basic logic.
SELECT
InnerQry.Customer,
InnerQry.SalesRep,
InnerQry.Sales,
CASE
WHEN InnerQry.LegacyFieldHard IS NULL
THEN InnerQry.LegacyFieldSoft
ELSE InnerQry.LegacyFieldHard
END AS LegacyField
FROM
(SELECT
A.Customer,
A.SalesRep,
A.Sales,
B.LegacyFieldHard,
C.LegacyFieldSoft
FROM
DBS AS A
LEFT JOIN
LEGACY AS B ON A.Customer = B.Customer AND A.SalesRep = B.SalesRep
LEFT JOIN
LEGACY AS C ON A.Customer = B.Customer) AS InnerQry
The main problem here is that you get multiple rows when you map based on just on Customer (Legacy C). To avoid this you can create a row number field and restrict it to 1, provided you don't really care which among that customer's records gets mapped:
SELECT
A.Customer,
A.SalesRep,
A.Sales,
COALESCE(B.LegacyField,C.LegacyField) as LegacyField
FROM DBS AS A
LEFT JOIN LEGACY AS B ON A.Customer=B.Customer AND A.SalesRep=B.SalesRep
LEFT JOIN
(select *,
row_number() Over (partition by Customer order by SalesRep) as rownum1
from LEGACY) AS C ON A.Customer=C.Customer and C.rownum1=1
Also, you could use the COALESCE function directly, instead of the case statement. This will automatically use the first non-null value . i.e) C value will be taken only when B is NULL. Hope this helps.

SQL Multiple Table Join - Best Optimization

Hi am hoping someone can help my SQL theory out. I have to create a set of reports which use joins from multiple tables. These reports are running far slower than I would like and I am hoping to optimize my SQL although my knowledge has hit a wall and I cant seem to find anything on Google.
I am hoping someone here can give me some best practice guidance.
Essentially I am trying to filter on the results set as it comes back to reduce the number of rows included in later joins
Items INNER JOIN BlueItems ON Items.ItemID = BlueItems.ItemID AND BlueItems.shape = 'square'
LEFT JOIN ItemHistory ON Items.ItemID = ItemHistory.ItemsID
LEFT JOIN ItemDates ON Items.ItemID = ItemDates.ItemID
WHERE ItemDates.ManufactureDate BETWEEN '01/01/2017' AND '01/05/2017'
I figure that Inner Joining on Blue items that are squares vastly reduces the data set at this point?
I also understand that the Where clause is intelligent enough to reduce the data set on run time? Am I mistaken? Is it returning all the data and then just filtering on that data?
Any guidance on essentially how to speed this kind of query up would be fantastic, Index's and such have already been put in place. Unfortunately the database is actually looked after by someone else and I am simply creating reports based on their database. This does limit me to just being able to optimize my queries rather than the data itself.
I guess at this point its time for me to try and improve my knowledge on how SQL handles the various ways you can filter on data and try to understand which actually reduce the dataset used and which simply filter on it. Any guidance would be very appreciated!
You mentioned that the primary keys are all indexed, but this is always the case for primary key fields. The only portion of your current query which would possibly benefit from this is the first join with Items. For the other joins and the WHERE clause, these primary key fields are not being used.
For this particular query, I would suggest the following indices:
ALTER TABLE BlueItems ADD INDEX bi_item_idx (ItemID, shape)
ALTER TABLE ItemHistory ADD INDEX ih_item_idx (ItemID)
ALTER TABLE ItemDates ADD INDEX id_idx (ItemID, ManufactureDate)
For the ItemHistory table, the index ih_item_idx should speed up the join involving the ItemID foreign key. A column by the same name is also involved with the other two joins, and hence is part of the other indices. The reason for the composite indices (i.e. indices involving more than one column) is that we want to cover all the columns which appear in either the join or the WHERE clause.
These comments are not really an answer but too big to put in a comment...
IF the dates being passed in as parameters (i'm guessing they are) then it might be parameter sniffing that is causing the issue. The query may be using a bad plan.
I've seen this a lot especially when using the between operator. A few quick things to try as adding OPTION(RECOMPILE) to the end of your query. This might seem counter intuitive but just try it. Although compiled queries should be faster than recompiling, if a bad plan is being used, it can slow things down A LOT.
Also, If ItemDates is big, try dumping yuor filtered results to a temp table and joining to that, so something like.
SELECT * INTO #id FROM ItemDates i WHERE i.ManufactureDate BETWEEN '01/01/2017' AND '01/05/2017'
The change you main query to be something like
SELECT *
FROM Items
JOIN BlueItems ON Items.ItemID = BlueItems.ItemID AND BlueItems.shape = 'square'
JOIN #id i ON Items.ItemID = i.ItemID
LEFT JOIN ItemHistory ON Items.ItemID = ItemHistory.ItemsID
I also changed the JOIN from being a LEFT JOIN to a JOIN (implicitly an inner join) as you are only selecting items which have a match in ItemDates so LEFT joining makes no sense.

TSQL Joining three tables where none have a common column

This will probably be pretty simple, as I am very much a novice.
I have two tables I have imported from Excel, and pretty much I need to update an existing table of e-mail addresses based off of the email addresses from the spreadsheet.
My only issue is I cannot join on a common column, as none of the tables share a column.
So I am wondering if, in a Join, I can just put something like
FROM table a
INNER JOIN table b ON b.column 'name' = a.column 'nameplus' `
Any help would be appreciated!
A join without matching predicates can be implemented effectively be a cross join: i.e. every row in table A matched with every row in table B.
If you specify an INNER JOIN then you have to have an ON term, which either matches something or it doesn't: in your example you may have a technical match (i.e. b.column really does - perhaps totally coincidentally - match a.column) that makes no business sense.
So you either have
a CROSS JOIN: no way of linking the tables but the result is all possible combinations of rows
Or:
an inner join where you must specify how the rows are to be combined (or a left/right outer join if you want to include all rows from either side, regardless of matched-ness)

Is it possible to query two access tables where you want to know if a value/range in the first table is between two fields in the second table?

I'm trying to query two tables, ASSAYS, AND LITHO in a diamond drillhole database.
I was given values (SAMPLE_NO) to search for in the ASSAYS table, to return values such as HOLE-ID, FROM, and TO. So each sample that we take has a HOLE-ID, SAMPLE_NO, FROM AND TO. One hole-id can have multiple sample numbers, but each sample number is unique. The from and to will be unique in each hole-id. This I can find no problem.
My coworker also wanted to know what rock type was associated with each sample. This info is located in another table so I'll need to figure out how to query for this. The information that this table holds is HOLE-ID, FROM, TO, and ROCKTYPE.
You're looking for what is called a JOIN. This allows you to JOIN data of multiple tables based on matiching column values.
This could be your starting point:
SELECT a.*, l.*
FROM ASSAYS a LEFT JOIN LITHO l ON a.hole-id = l.hole-id
WHERE a.sample_no = 'XXXX'
Please google for JOIN and SQL to find out about the exact syntax.

cakephp linked data HABTM by JOIN need both related data?

This should be a simple Yes/No answer so here goes.
If I set a 3 tables, 2 typical recordsets and 1 that joins them by the id of the 2 tables, do I need the id from both tables in order to have an entry in the join table?
The scenario is a Jobs table and a Parts table linked by JobsParts table. But some parts are not in the Parts table, they are just freetext entries (so as to avoid stock control issues) belonging to a Job.
Hope this is enough to explain my question.
Thanks
BTW using CakePHP 2.0
For database sanity, I'd say the join table 'jobs_parts' should have both IDs.
If you try entering free-form parts into the join table, you're not only going to increase the size of the join table, but you've effectively lost the ability to grow/expand - ie. what if you want to add a few more fields to this unknown part? Or what if it turns out to be a part that you actually want in your normal parts table.... it just gets confusing.
There are other options for dealing with free-form parts vs actual parts...
have a field in the parts table that's a tinyint(1) for whether or not it's a verified part
OR make an UnknownParts model/table
In my opinion, go with what makes logical sense for ease of understanding and for future updates to your database/website...etc. And IMO, adding a freeform part into the join table would not fit that bill.

Resources