I have two tables with similar data. Wanting to find closest matches for comparison. Here's what I was trying to do:
select a.field1 as a1, b.field1 as b1, a.field2 as a2, b.field2 as b2
from foo a
left join (
select top 1 tmp.field1, tmp.field2
from foo2 tmp
-- The closest match will match the most fields. Add up these.
order by case when tmp.field1 = a.field1 then 1 else 0 end
+ case when tmp.field2 = a.field2 then 1 else 0 end
desc) b on 1 = 1
I can't reference the main selection table in the join though.
Perhaps I'm going about it all wrong. The actual goal is that I was given a spreadsheet of data and told to update a database. The spreadsheet has no PK and is missing many fields that the database has. Also, the database has foreign keys and child data all over. So I don't want to delete/insert. Instead I want to compare values and update wherever possible. So I created two temporary tables and pulled the database records into one and the spreadsheet records into another. Now I'm wanting to work with those two tables to update records, and finally delete/insert where no update is available.
Have you looked up the MERGE statement? It does what you want, although the syntax is a bit tricky.
Article here with half decent examples:
http://technet.microsoft.com/en-us/library/bb522522(v=sql.105).aspx
Related
I have two tables, Table1 and Table2.
Suppose Table1 have the following columns: T1c1, T1c2, T1c3
and Table2 have the following columns: T2c1, T2c2, T2c3
I need to add values of Table2.T2c3 to Table1.T1c3 based on matching pairs between the two tables and their other two columns or just matching one column values if a column has NULL values in one table or both. That is, I need to match Table1.T1c1 values with Table2.T2c1 values and Table1.T1c2 values with Table2.T2c2 values or just match Table1.T1c1 with Table2.T2c1 values and etc if there's NULL.
The problem is, my tables are of very large size; several hundred millions of rows. I need the fastest matching algorithm to fill out values on Table1.T1c3 values.
I think you could abuse Coalesce() to do this in the ON clause of your join:
SELECT table2.c3 + table1.c3
FROM table1
INNER JOIN table2
ON COALESCE(table1.c1, table2.c1) = COALESCE(table2.c1 = table1.c1)
AND COALESCE(table1.c2, table2.c2) = COALESCE(table2.c2 = table1.c2);
That feels like it might be speedier than dropping CASE all of the over the place or hitting this with a bunch of OR conditions but it might be 6 of one half a dozen of the other...
This is assuming that there would never be a null in both c1 and c2 of either table at the same time. Otherwise you may end up with some funky cross-join shenanigans.
I have an Access database table which sometimes contains duplicate ProfileIDs. I would like to create a query that excludes one (or more, if necessary) of the duplicate records.
The condition for a duplicate record to be excluded is: if the PriceBefore and PriceAfter fields are NOT equal, they are considered duplicate. If they are equal, the duplicate field remains.
In the example table above, the records with ID 7 and 8 have the same ProfileIDs. For ID 8, PriceBefore and PriceAfter are not equal, so this record should be excluded from the query. For ID 7, the two are equal, so it remains. Also note that PriceBefore and PriceAfter for ID 4 are the same, but as the ProfileID is not a duplicate, the record must remain.
What is the best way to do this? I am happy to use multiple queries if necessary.
Create a pointer query. Call it pQuery:
SELECT ProfileID, Sum(1) as X
FROM MyTableName
HAVING Sum(1) > 1
This will give you the ProfileID of every record that's part of a dupe.
Next, find the records where prices don't match. Call this pNoMatchQuery:
SELECT MyTableName.*
FROM MyTableName
INNER JOIN pQuery
ON pQuery.ProfileID = MyTableName.ProfileID
WHERE PriceBefore <> PriceAfter
You now have a query of every record that should be excluded from your dataset. If you want to permanently delete all of these records, run a DELETE query where you inner join your source table to pNoMatchQuery:
Delete MyTableName.*
From MyTableName
Where Exists( Select 1 From pNoMatchQuery Where pNoMatchQuery.ID = MyTableName.ID ) = True
First, make absolutely sure that pQuery and pNoMatchQuery are returning what you expect before you delete anything from your source table, because once it's gone it's gone for good (unless you make a backup first, which I would highly suggest before you run that delete the first time).
*EDIT** Thanks for all the input, and sorry for late reply. I have been away during the weekend without access to internet. I realized from the answers that I needed to provide more information, so people could understand the problem more throughly so here it comes:
I am migrating an old database design to a new design. The old one is a mess and very confusing ( I haven't been involved in the old design ). I've attached a picture of the relevent part of the old design below:
The table called Item will exist in the new design as well, and it got all columns that I need in the new design as well except one and it is here my problem begin. I need the column which I named 'neededProp' to be associated( with associated I mean like a column in the new Item table in the new design) with each new migrated row from Item.
So for a particular eid in table Environment there can be n entries in table Item. The "corresponding" set exists in table Room. The only way to know which rows that are associated in Item and Room are with the help of the columns "itemId" and "objectId" in the respective table. So for example for a particular eid there might be 100 entries in Item and Room and their "itemId" and "objectId" can be values from 1 to 100, so that column is only unique for a particular eid ( or baseSeq which it is called in table BaseFile).
Basically you can say that the tables Environment and BaseFile reminds of each other and the tables Item and Room reminds of each other. The difference is that some tables lack some columns and other may have some extra. I have no idea why it is designed like this from the beginning.
My question is if someone can help me with creating a query so that I can be able to find out the proper "neededProp" for each row in the Item-table so I can get that data into the new design?
*OLD-PART**This might be a trivial question but I can't get it to work as I want. I want to join a few tables as in the sql-statement below. If I start like this and run this query
select * from Environment e
join items ei on e.eid = ei.eid
I get like 400000 rows which is what I want. However if I add one more line so it looks like this:
select * from Environment e
join items ei on e.eid= ei.eid
left join Room r on e.roomnr = r.roomobjectnr
I get an insane amount of rows so there must be some multiplication going on. I want to get the same amount of rows ( like 400000 in this case ) even after joining the third table. Is that possible somehow? Maybe like creating a temporary view with the first 2 rows.
I am using MSSQL server.
So without knowing what data you have in your second query it's very difficult to say exactly how to write this out, and you're likely having a problem where there's an additional column that you are joining to in Rooms that perhaps you have forgotten such as something indicating a facility or hallway perhaps where you have multiple 'Room 1' entries as an example.
However, to answer your question regarding another way to write this out without using a temp table I've crufted up the below as an example of using a common table expression which will only return one record per source row.
;WITH cte_EnvironmentItems AS (
SELECT *
FROM Environment E
INNER JOIN Items I ON I.eid = E.eid
), cte_RankedRoom AS (
SELECT *
,ROW_NUMBER() OVER (ORDER BY R.UpdateDate DESC) [RN]
FROM Room R
)
SELECT *
FROM cte_EnvironmentItems E
LEFT JOIN cte_RankedRoom R ON E.roomnr = R.roomobjectnr
AND R.RN = 1
btw,do you want column from room table.if no then
select * from Environment e
join items ei on e.eid= ei.eid
where e.roomnr in (select r.roomobjectnr from Room r )
else
select * from Environment e
join items ei on e.eid= ei.eid
left join (select distinct roomobjectnr from Room) r on e.roomnr = r.roomobjectnr
I have a very simple database. It contains 3 tables. The first is the primary input table where values go in. The 2nd and 3rd are there purely for translating values to names for a view. When I view the rows in table 1, I can see that column businessUnit contains valid values in all rows. When I add the Business_Units table (The 3rd table in this DB), all but 2 rows go away and despite the businessUnit value both being 1 in the 1st table, the view gives them different names.
I created a DB diagram and uploaded a screenshot to imgur. Link: http://imgur.com/jXF7L1R
I only have 2 relationships in the table. One from equipType on New_Equipment_User to id in Equipment_Type and one from businessUnit in New_Equipment_User to id in Business_Units. The weird thing is that the Equipment Type works perfectly, yet when I replicate the table, relationship and view information exactly, it doesn't work. Instead of 6 rows appearing, there are only 2 which share the same value in businessUnit, but gives 2 different names for it.
In case it matters, here is my view Query:
SELECT dbo.New_Equipment_User.id, dbo.Equipment_Type.name AS equipType, dbo.New_Equipment_User.jobNumber, dbo.New_Equipment_User.costCode,
dbo.New_Equipment_User.reason, dbo.New_Equipment_User.mobile, dbo.New_Equipment_User.mobileQty, dbo.New_Equipment_User.mobileComment,
dbo.New_Equipment_User.laptop, dbo.New_Equipment_User.laptopQty, dbo.New_Equipment_User.laptopComment, dbo.New_Equipment_User.desktop,
dbo.New_Equipment_User.desktopQty, dbo.New_Equipment_User.desktopComment, dbo.New_Equipment_User.modem, dbo.New_Equipment_User.modemQty,
dbo.New_Equipment_User.modemComment, dbo.New_Equipment_User.printer, dbo.New_Equipment_User.printerQty, dbo.New_Equipment_User.printerComment,
dbo.New_Equipment_User.camera, dbo.New_Equipment_User.cameraQty, dbo.New_Equipment_User.cameraComment, dbo.New_Equipment_User.dateRequired,
dbo.New_Equipment_User.requestedBy, dbo.New_Equipment_User.dateRequested, dbo.New_Equipment_User.approvalStatus,
dbo.Business_Units.name AS businessUnit
FROM dbo.New_Equipment_User
JOIN dbo.Equipment_Type ON dbo.New_Equipment_User.equipType = dbo.Equipment_Type.id
JOIN dbo.Business_Units ON dbo.New_Equipment_User.id = dbo.Business_Units.id
WHERE (dbo.New_Equipment_User.approvalStatus = '0')
And here is an image of the view since it is easier to read: http://imgur.com/pZ97ehQ
Is anyone able to assist with why this might be happening?
Try using a LEFT JOIN
SELECT ...
FROM dbo.New_Equipment_User
JOIN dbo.Equipment_Type ON dbo.New_Equipment_User.equipType = dbo.Equipment_Type.id
LEFT JOIN dbo.Business_Units ON dbo.New_Equipment_User.id = dbo.Business_Units.id
This will ensure that all dbo.New_Equipment_User and all dbo.Equipment_Type is present
I'm reading a book, where the author talks about fetching an row + all linked parent rows in one step. Like fetching an order + all it's items all at once. Okay, sounds nice, but really: I've never seen an possibility in SQL to ask for - lets say - one order + 100 items? How would this record set look like? Would I get 101 rows with merged fields of both the order and the item table, where 100 rows have a lot of NULL values for the order fields, while one row has a lot of NULL values for the item fields? Is that the way to go? Or is there something much cooler? I mean... I never heard of fetching arrays onto a field?
A simple JOIN would do the trick:
SELECT o.*
, i.*
FROM orders o
INNER JOIN order_items i
ON o.id = i.order_id
The will return one row for each row in order_items. The returned rows consist of all fields from the orders table, and concatenated to that, all fields from the order_items table (quite literally, the records from the tables are joined, that is, they are combined by record concatenation)
So if orders has (id, order_date, customer_id) and order_items has (order_id, product_id, price) the result of the statement above will consist of records with (id, order_date, customer_id, order_id, product_id, price)
One thing you need to be aware of is that this approach breaks down whenever there are two distinct 'detail' tables for one 'master'. Let me explain.
In the orders/order_items example, orders is the master and order_items is the detail: each row in order_items belongs to, or is dependent on exactly one row in orders. The reverse is not true: one row in the orders table can have zero or more related rows in the order_items table. The join condition
ON o.id = i.order_id
ensures that only related rows are combined and returned (leaving out the condition would retturn all possible combinations of rows from the two tables, assuming the database would allow you to omit the join condition)
Now, suppose you have one master with two details, for example, customers as master and customer_orders as detail1 and customer_phone_numbers. Suppose you want to retrieve a particular customer along with all is orders and all its phone numbers. You might be tempted to write:
SELECT c.*, o.*, p.*
FROM customers c
INNER JOIN customer_orders o
ON c.id = o.customer_id
INNER JOIN customer_phone_numbers p
ON c.id = p.customer_id
This is valid SQL, and it will execute (asuming the tables and column names are in place)
But the problem is, is that it will give you a rubbish result. Assuming you have on customer with two orders (1,2) and two phone numbers (A, B) you get these records:
customer-data | order 1 | phone A
customer-data | order 2 | phone A
customer-data | order 1 | phone B
customer-data | order 2 | phone B
This is rubbish, as it suggests there is some relationship between order 1 and phone numbers A and B and order 2 and phone numbers A and B.
What's worse is that these results can completely explode in numbers of records, much to the detriment of database performance.
So, JOIN is excellent to "flatten" a hierarchy of items of known depth (customer -> orders -> order_items) into one big table which only duplicates the master items for each detail item. But it is awful to extract a true graph of related items. This is a direct consequence of the way SQL is designed - it can only output normalized tables without repeating groups. This is way object relational mappers exist, to allow object definitions that can have multiple dependent collections of subordinate objects to be stored and retrieved from a relational database without losing your sanity as a programmer.
This is normally done through a JOIN clause. This will not result in many NULL values, but many repeated values for the parent row.
Another option, if your database and programming language support it, it to return both result sets in one connection - one select for the parent row another for the related rows.