Iterating 1 row at a time with massive amounts of links/joins - sql-server

Ok, basically what is needed is a way to have row numbers while using a lot of joins and having where clauses using these rownumbers.
such as something like
select ADDRESS.ADDRESS FROM ADDRESS
INNER JOIN WORKHISTORY ON WORKHISTORY.ADDRESSRID=ADDRESS.ADDRESSRID
INNER JOIN PERSON ON PERSON.PERSONRID=WORKHISTORY.PERSONRID
WHERE PERSONRID=<some number> AND WORKHISTORY.ROWNUMBER=1
ROWNUMBER needs to be generated for this query on that one table though. So that if we want to access the second WORKHISTORY record's address, we could just go WORKHISTORY.ROWNUMBER=2 and if say we had two address's that matched, we could cycle through the addresses for one WORKHISTORY record using ADDRESS.ROWNUMBER=1 and ADDRESS.ROWNUMBER=2
This should be capable of being an automatically generated query. Thus, there could be more than 10 inner joins in order to get to the relevant table, and we need to be able to cycle through each table's record independently of the rest of the tables..
I'm aware there is the RANK and ROWNUMBER functions, but I'm not seeing how it will work for me because of all the inner joins
note: in this example query, ROWNUMBER should be automatically generated! It should never be stored in the actual table

Can you use a temp table?
I ask because you can write the code like this:
select a.field1, b.field2, c.field3, identity (int, 1,1) as TableRownumber into #temp
from table1 a
join table2 b on a.table1id = b.table1id
join table3 c on b.table2id = c.table2id
select * from #temp where ...

Related

SQL Left Outer Join?

I have table that should joint to another table based on the unique id. If I do LEFT OUTER JOIN ON I will have duplicates. If I put DISTINCT in my SELECT I will get correct number of records. Then if I include any field from the table that I did LEFT OUTER JOIN in that case I'm getting duplicates again. Here is my query:
SELECT DISTINCT
Table1.fname,
Table1.lname,
Table2.address
FROM Table1
LEFT OUTER JOIN Table2
ON Table2.user_id = Table1.userid
In the example above I'm getting duplicates, also I have tried to do:
LEFT OUTER JOIN (
SELECT user_id
FROM Table2
GROUP BY user_id
) AS t2 ON Table1.user_id = t2.user_id
This gave me correct number of records but I need some additional columns from that second table, after I include extra columns I'm getting duplicates again, example:
LEFT OUTER JOIN (
SELECT user_id, address
FROM Table2
GROUP BY user_id, address
) AS t2 ON Table1.user_id = t2.user_id
I'm wondering if I missed something or there is better way to handle this type of problem. If anyone see something or know better solution please let me know.
It is impossible for you to pick the correct answer here without understanding your data.
It seems that Table2 supports multiple addresses per user_id. This is a common design. If you want to return only one address per user_id you have several options:
Fix the data - Remove the duplicate addresses from table 2 and add a constraint that prevents this situation again. You will need to determine which addresses are incorrect.
Reduce the left join to only include one address per user - How you do this will depend on your other data. You could use min() or max() with a group by if you don't care which one to return where there are multiples or you will need to perhaps order by an effective date and take the latest one - or maybe there are billing and shipping addresses and you should pick the correct one.
Accept that there are multiple addresses per user - this may be correct - and adjust the rest of your code.

TSQL query to merge data from multiple tables that may or may not have matching rows?

For example, suppose we're conducting research where students can take up to 10 different tests, and each table in the database stores all the students' responses for one test. The tables are named after each test as: T1, T2, ... , T10. Suppose each table has a primary key column 'Username' that identifies each student. Students may or may not have completed each test, so there may or may not be a record in each table for each student.
What is the correct SQL Query to return all the test data from all tables, with one row per student (one row per username)? I want the simplest query possible that returns the correct results. I would also like to coalesce the Username fields into a single Username field in the final query.
To clarify, I understand that SQL has a major limitation in that it does not support a syntax to select all columns except one or more fields like "select *[^ExcludeColumn1][^ExcludeColumn2]". To avoid specifically naming all columns in the final query, it would be acceptable to leave all the Username columns there, as long as it includes a coalesced Username field at the beginning named something like RowID.
As for the overall query, one option would be to perform a union all on the username column of all ten tables, then select the distinct usernames across all tables, then perform a series of left joins against the list of distinct usernames on all 10 tables. That would result in a very straightforward query where each left join is performed on the same distinct set of usernames, but I want to avoid a separate up-front query for distinct usernames. (Although if that's the best option, let me know). It would look something like this:
select * from
(select distinct coalesce(t1.Username,t2.Username,...,t10.Username) as RowID from t1,t2,t3,t4,t5,t6,t7,t8,t9,t10) distinct_usernames
left join t1 on t1.Username = distinct_usernames.RowID
left join t2 on t2.Username = distinct_usernames.RowID
...
left join t10 on t10.Username = distinct_usernames.RowID
Although that is short and easy to write, it is incredibly inefficient and would take hours to run on test tables with 5000+ rows each, so with an adjustment, an equivalent version that runs in a few seconds is:
select * from (
select distinct Username as RowID from (
select Username from t1
union all
select Username from t2
union all
...
select Username from t10
) all_usernames) distinct_usernames
left join t1 on t1.Username = distinct_usernames.RowID
left join t2 on t2.Username = distinct_usernames.RowID
...
left join t10 on t10.Username = distinct_usernames.RowID
I think that what I have above might be the most efficient and correct query (takes only a couple seconds to run and returns correct result set), but I also thought perhaps it could be simplified with some kind of full join. The problem is that full joins get confusing with more than two tables, because without pre-determining the usernames, each subsequent table would have to match records against any of the preceding tables, resulting in a query where each additional table has "[previous table count] + 1" conditions on matching the username.
Assuming that Username is unique in each table, your second query would be the way I would try first, with the slight modifications of removing distinct and simply using union (which implies distinct) rather than union all:
select *
from (
select Username from t1
union
select Username from t2
union
-- ...
select Username from t10
) distinct_usernames
left join t1 on t1.Username = distinct_usernames.Username
left join t2 on t2.Username = distinct_usernames.Username
-- ...
left join t10 on t10.Username = distinct_usernames.Username
From there I would make sure that Username is indexed, possibly even using it as the clustered index. I've also had optimization luck in the past by implementing your distinct_usernames as a temp table (possibly indexed, or an indexed view) at the beginning of the proc, but only testing would determine if that were worthwhile.
A full outer join would require a bunch of or conditions or coalesce arguments, though it could be worth a try on just a few tables to see if the performance is there. I can't try to out-guess what your query engine will like best.
Also, getting just the column names that you want could be done with a query to sys.columns or information_schema.columns and using dynamic SQL to build your query as a string and then executing that.

SQL FROM clause using n>1 tables

If you add more than one table to the FROM clause (in a query), how does this impact the result set? Does it first select from the first table then from the second and then create a union (i.e., only the rowspace is impacted?) or does it actually do something like a join (i.e., extend the column space)? And when you use multiple tables in the FROM clause, does the WHERE clause filter both sub-result-sets?
Specifying two tables in your FROM clause will execute a JOIN. You can then use the WHERE clause to specify your JOIN conditions. If you fail to do this, you will end-up with a Cartesian product (every row in the first table indiscriminately joined to every row in the second).
The code will look something like this:
SELECT a.*, b.*
FROM table1 a, table2 b
WHERE a.id = b.id
However, I always try to explicitly specify my JOINs (with JOIN and ON keywords). That makes it abundantly clear (for the next developer) as to what you're trying to do. Here's the same JOIN, but explicitly specified:
SELECT a.*, b.*
FROM table1 a
INNER JOIN table2 b ON b.id = a.id
Note that now I don't need a WHERE clause. This method also helps you avoid generating an inadvertent Cartesian product (if you happen to forget your WHERE clause), because the ON is specified explicitly.

SQL Same Column in one row

I have a lookup table that has a Name and an ID in it. Example:
ID NAME
-----------------------------------------------------------
5499EFC9-925C-4856-A8DC-ACDBB9D0035E CANCELLED
D1E31B18-1A98-4E1A-90DA-E6A3684AD5B0 31PR
The first record indicates and order status. The next indicates a service type.
In a query from an orders table I do the following:
INNER JOIN order.id = lut.Statusid
This returns the 'cancelled' name from my lookup table. I also need the service type in the same row. This is connected in the order table by the orders.serviceid How would I go about doing this?
It Cancelled doesnt connect to 31PR.
Orders connects to both. Orders has 2 fields in it called Servicetypeid and orderstatusid. That is how those 2 connect to the order. I need to return both names in the same order row.
I think many will tell you that having two different pieces of data in the same column violates first normal form. There is a reason why having one lookup table to rule them all is a bad idea. However, you can do something like the following:
Select ..
From order
Join lut
On lut.Id = order.StatusId
Left Join lut As l2
On l2.id = order.ServiceTypeId
If order.ServiceTypeId (or whatever the column is named) is not nullable, then you can use a Join (inner join) instead.
A lot of info left out, but here it goes:
SELECT orders.id, lut1.Name AS OrderStatus, lut2.Name AS ServiceType
FROM orders
INNER JOIN lut lut1 ON order.id = lut.Statusid
INNER JOIN lut lut2 ON order.serviceid = lut.Statusid

Performance problem on a query

I have a performance problem on a query.
First table is a Customer table which has millions records in it. Customer table has a column of email address and some other information about customer.
Second table is a CommunicationInfo table which contains just Email addresses.
And What I want in here is; how many times the email address in CommunicationInfo table repeats in Customers table. What could be the the most performer query.
The basic query that I can explain this situation is;
Select ci.Email, count(*) from Customer c left join
CommunicationInfo ci on c.Email1 = ci.Email or c.Email2 = ci.Email
Group by ci.Email
But sure, it takes about 5, 6 minutes in execution.
Thanks in Advance.
this query is about as good as it gets if you have an index on Customer.Email and another on CommunicationInfo.Email
Select
c.Email, count(*)
from Customer c
left join CommunicationInfo ci on c.Email1 = ci.Email
left join CommunicationInfo ci2 on c.Email2 = ci2.Email
Group by c.Email
You mention:
And What I want in here is; how many
times the email address in
CommunicationInfo table repeats in
Customers table. What could be the the
most performer query.
To me, that sounds like you could easily use an INNER JOIN - this would most likely be a lot faster, since it will limit the search scope to just those customers who really do have an e-mail - anyone who doesn't have an e-mail at all (and thus a count(*) = 0) will not even be looked at - that might make a big difference even just in the number of rows SQL Server has to count and group.
So try this:
SELECT
ci.Email, COUNT(*)
FROM
dbo.Customer c
INNER JOIN dbo.CommunicationInfo ci
ON c.Email1 = ci.Email OR c.Email2 = ci.Email
GROUP BY
ci.Email
How does that perform in your case??
Using the OR condition robs the optimizer of opportunity to use HASH JOIN or MERGE JOIN.
Use this:
SELECT ci.Email, SUM(cnt)
FROM (
SELECT ci.Email, COUNT(c.Email) AS cnt
FROM CommunicationInfo ci
LEFT JOIN
Customer c
ON c.Email1 = ci.Email
GROUP BY
ci.Email
UNION ALL
SELECT ci.Email, COUNT(c.Email) AS cnt
FROM CommunicationInfo ci
LEFT JOIN
Customer c
ON c.Email2 = ci.Email
GROUP BY
ci.Email
) q2
GROUP BY
ci.Email
or this:
SELECT ci.Email, COUNT(*)
FROM CommunicationInfo ci
LEFT JOIN
(
SELECT Email1 AS email
FROM Customer c
UNION ALL
SELECT Email2
FROM Customer
) q
ON q.Email = ci.Email
GROUP BY
ci.Email
Make sure that you have indexes on Customer(Email) and Customer(Email2)
The first query will be more efficient if your emails are mostly not filled, the second one — if most emails are filled.
Depending on your environment there may not be much you can do to optimize this.
A couple of questions:
How many records in CommunicationInfo?
How often do you really need to run this query? Is it a one time analysis, or are multiple people going to be running this every 10 minutes?
Are the fields indexed? I'll make a guess that neither Email1 nor Email2 field is indexed. However, I wouldn't suggest adding an index without taking the balance of the whole system into consideration.
Why are you using a left join? Do you really need EVERYTHING from the Customer table? You're counting, so no harm in doing an INNER JOIN.
Suggestions:
Run the query through the Query Optimization wizard to see if there is anything SQL Server would recommend.
An extreme suggestion would be to dump the Email1 and Email2 columns into a temp table and join to that. I've seen queries run slowly because of a large amount of stress on a particular table, so sometimes copying the records into a temp table is faster, but this technique is very dependent on how much memory there is, how fast IO is, and the amount of stress on a particular table.

Resources