SQL Server inner join - sql-server

How to choose main table when joining multiple tables using inner join?
A) Should I choose main table depending on its number of columns/rows (For example large main as main table or to keep larger table as join table)?
B) If I choose the table containing column that I use in where condition as main table , will there be any performance benefit ?
For example lets say there are 2 tables. Table1 & Table2 . Will there be any performance difference between two solutions given below
Solution 1 :
select t1.empid , t1.name , t1.dept , t2.add , t2.city , t2.country
from Table1 t1
inner join Table2 t2 on t2.empid = t1.empid
where t1.year = 2010
Solution 2 :
select t1.empid , t1.name , t1.dept , t2.add , t2.city , t2.country
from Table2 t2
inner join Table1 t1 on t1.empid = t2.empid
where t1.year = 2010

There is no difference. SQL Server will pick "main" table and join type based on table statistics.
Example:
Table1 contains 5 rows (and only one with year 2010). Table2 contains 10000 rows.
SQL Server will generate Nested Loops join with Table1 as outer input, Table2 as inner input, to get 1 run over 1000 rows. It will definitely not generate 10000 cycles over 1 row.
You still can get different execution plans for statements above, but only in case if SQL Server will decide that plan should be trivial and will skip the optimization phase (because tables are almost empty, for example).

The main table should be the one that you are using the most columns from, so if you are using table1 with 4 columns and you need to get one column out of table2.

Related

Efficiently group query by one column, taking the maximum of another column and a third column that comes from the same row as the maximum column

I have a table of 100,000,000+ values, so efficiency is very important to me. I need to take information from table A, join it to an index table B, then join to table C using the index retrieved from table B. The problem is, there are multiple indexes for each value in table A, and I want to retrieve the one with the most recent date.
The query below creates duplicates:
SELECT ID_1, ID_2, Date
INTO #DEST_TABLE FROM Table_1 t1
INNER JOIN Table_2 t2 ON t1.ID_1=t2.ID_1
INNER JOIN Table_3 t3 ON t2.ID_2=t3.ID_2
This one does not, but when running with more than 35,000 vs 40,000 elements, the execution time goes from <5sec to >1min:
SELECT ID_1, ID_2, Date
INTO #DEST_TABLE FROM
(SELECT * FROM Table_1 l CROSS APPLY Table_2 t2 WHERE t1.ID_1=t2.ID_1) t_temp
LEFT JOIN Table_3 t3 ON t_temp.ID_2=t3.ID_2
How can I decrease my execution time as much as possible?
Here is an example table:
For this table, I would be trying to get the most recent location for each person.
None of the columns are indexed and I cannot create indexes on this table.
First of all, when you are working on 100 Million+ records and that
too joining to other tables, first thing I would ask is what is the
rationale behind not creating indexes which can cover your query. If
you are not the admin of that system, I would suggest that you
should bring this up to admin group and try to understand what is
the exact reason (if any) they do not want index on that huge table.
Specially because you mentioned "efficiency is very important to
me".
Remember that 'SQL Tuning' is only one of the steps of 'Database Performance Tuning' and you can tune only as much with writing a good SQL Query. When the data volume gets huge, a good SQL Query is never sufficient without taking other Performance Tuning Measures.
Apart from what Roger has already provided, here are a few solutions that you can try out:
Solution 1
SELECT T1.ID_1, OA.ID_2, OA.Location
FROM Table1 T1
OUTER APPLY (
SELECT TOP 1 T3.ID_2, T3.Location
FROM Table2 T2
INNER JOIN Table3 T3
ON T2.ID_2 = T3.ID_2
WHERE T2.ID_1 = T1.ID_1
ORDER BY T3.Date DESC
) OA;
Solution 2:
SELECT DISTINCT
T1.ID_1
,T2.ID_2
,Location = FIRST_VALUE(T3.Location) OVER (PARTITION BY T1.ID_1 ORDER BY T3.Date DESC)
FROM Table1 T1
INNER JOIN Table2 T2
ON T1.ID_1 = T2.ID_1
INNER JOIN Table3 T3
ON T2.ID_2 = T3.ID_2;
Data Preparation:
DROP TABLE IF EXISTS Table1
DROP TABLE IF EXISTS Table2
DROP TABLE IF EXISTS Table3
SELECT TOP 10000 ID_1 = object_id, name
INTO Table1
FROM sys.all_objects
ORDER BY object_id
SELECT ID_1 = T1.ID_1, ID_2 = IDENTITY(INT, 1, 1)
INTO Table2
FROM Table1 T1
CROSS JOIN Table1 T2
SELECT ID_2, Location = 'City_'+ CAST(ID_2 AS VARCHAR(100)), Date = CAST(DATEADD(DAY, ID_2/10000, GETDATE()) AS DATE)
INTO Table3
FROM Table2
Indexes to cover the Solution 1:
CREATE NONCLUSTERED INDEX IX_TABLE1_ID_1 ON Table1 (ID_1)
CREATE NONCLUSTERED INDEX IX_TABLE2_ID_2 ON Table2 (ID_1, ID_2)
CREATE NONCLUSTERED INDEX IX_TABLE3_ID_2 ON Table3 (ID_2, Date DESC) INCLUDE (Location)
Execution Plan:
You can see that all are 'Index Seek' except for Table1 which is an legitimate 'Index Scan' because you are doing scans for each value of Table1's ID_1 value. If you put a where clause in the outer loop to search for a few specific ID_1 values, then that 'Index Scan' will turn to a 'Index Seek' as well.
I will leave the Index Strategy for the 2nd solution to you (as a homework :) ). Tips: You have to make the Location as a key as well. Or you can go with COLUMNSTORE index approach.
You can use something like this:
select top (1) with ties
a.A_Id, b.B_Id, b.Date
from dbo.TableA a
inner join dbo.TableB b on a.A_Id = it.A_Id
inner join dbo.TableC c on c.B_Id = b.B_Id
order by row_number() over(partition by a.A_Id order by b.Date desc);
Alternatively, you can try an olde fashioneth approache:
select a.A_Id, b.B_Id, b.Date
from dbo.TableA a
inner join dbo.TableB b on a.A_Id = b.A_Id
inner join dbo.TableC c on c.B_Id = b.B_Id
where not exists (
select 0 from dbo.TableB pb where pb.B_Id = b.B_Id and pb.Date > b.Date
);
However, as with all such situations, its performance will heavily depend on indices. SSMS can suggest you some, if you will look at the execution plan; off the top of my head, you will need all Id columns to be indexed, and you will need either a single (Date) or a composite (A_Id, Date, B_Id) on the TableB.
UPD: If you can't create or modify any indices, and performance is paramount, I would suggest copying the data in question into a separate schema or database, where you might have appropriate permissions. Apart from that... it's impossible to get something out of nothing.

SQL get counts using subqueries from multiple linked tables

Suppose I have tables 1-4, all the other tables are linked to table1. For what its worth, table1, table2 and table3 are relatively small but table4 contains a lot of data.
Now I have the following query:
SELECT t1.id
, (SELECT COUNT(*) FROM table2 WHERE table1_id = t1.id) AS t2_count
, (SELECT COUNT(*) FROM table3 WHERE table1_id = t1.id) AS t3_count
, (SELECT COUNT(*) FROM table4 WHERE table1_id = t1.id) AS t4_count
FROM table1 t1
Due to the fact that the subqueries are dependent/correlated I assumed that there must be a better way (performance wise) to get the data.
I tried to do the following but it drastically increased the execution time (from about 2s to 35s). I'm guessing that the multiple left joins creates a very big data set?!
SELECT t1.id
, COUNT(t2.id) AS t2_count
, COUNT(t3.id) AS t3_count
, COUNT(t4.id) AS t4_count
FROM table1 t1
LEFT JOIN table2 t2 ON t2.table1_id = t1.id
LEFT JOIN table3 t3 ON t3.table1_id = t1.id
LEFT JOIN table4 t4 ON t4.table1_id = t1.id
GROUP BY t1.id
Is there better way to get the counts? I don't need the data from the other tables.
UPDATE:
Bart's answer got me thinking that the table1_id columns are nullable. I added a IS NOT NULL check to the WHERE clauses and this brought the time down to 1s.
SELECT t1.id
, (SELECT COUNT(*) FROM table2 WHERE table1_id IS NOT NULL AND table1_id = t1.id) AS t2_count
, (SELECT COUNT(*) FROM table3 WHERE table1_id IS NOT NULL AND table1_id = t1.id) AS t3_count
, (SELECT COUNT(*) FROM table4 WHERE table1_id IS NOT NULL AND table1_id = t1.id) AS t4_count
FROM table1 t1
I guess not. If you execute a SELECT COUNT(*) FROM [table], it should perform a count on the table's PK. That should be pretty fast, even for very large tables.
Is your table4 a real table (and not a view, or a table-valued function, or something else that looks like a table)? And does it have a primary key? If so, I don't think that the performance of a SELECT COUNT(*) FROM [table4] query can be increased significantly.
It may also be the case, that your table4 is heavily targeted (in concurrent transactions over multiple connections), or perhaps your SQL Server is doing some heavy IO or computations. I cannot assume anything about that, however. You may try to check if your query is also slow on a restored database backup on a physically separate test server.

Custom order for multiple joins in SQL Server

I am trying to write a query in SQL Server that replicates following figure:
I want the result of first left join (order_defect & ncdef) to be left join with third table (filter) and again the result of these three left join with last one (nsdic).
Each of these tables are huge, so I'm trying to find most efficient way to do it because i have limited space and I get "out of memory" error... any suggestion for an efficient query?
If I do:
Select *
from A
left join B on a.id = B.id
left join C on a.id = c.id
it's joining A and B first and then A and C...but I want the result of "A & B" to be join with "C".
Basically my question is how to use a result of one join, to join with another table.
Thank you
select
c.id
,c.colum1
,c.colum2
,c.colum3
,c.colum4
,t3.colum1
from
(
select
t1.id as id
,t1.colum1 as colum1
,t1.colum2 as colum2
,t2.column1 as colum3
,t2.colum2 as colum4
from table1 t1
left join table2 t2
on t1.id = t2.id
) as c
left join table3 t3
on c.id = t3.id
It's dificult to help you without the tables design and fields|columns, keys, ...
But I'll considerate:
1 - Primary keys fields, and how to relation the tables
2 - How to add left joins with "filters", or how to reduce the number of results
3 - Evaluate if it'll be better to use Sub-querys
Plus: Try the query with TOP 100 <--- to make test.
And remember: sometimes it's imposible to optimizate querys because of the hardware limits, like the RAM, in those case you have to show the data in sections.

SQL Server many-to-many query

I have two tables, let's call them table1 and table2. They both have a column called ID1 and ID2 which are respective PK for each of two tables.
I have also another table, called table3 which contains ID1 and ID2, establishing a many to many relation between these two tables.
What I need is to get all the records from table2 that are not related with records in table1.
Ex.
Table 1 contains a record with ID = 1
Table 2 contains two records with ID 1, 2
Table 3 contains one record with values 1 - 1
I need a query that will give me as result 2.
Can anyone suggest me a way to proceed?
Thanks
SELECT t2.ID2
FROM table2 t2
WHERE NOT EXISTS(SELECT NULL
FROM table3 t3
WHERE t3.ID2 = t2.ID2);
You could also use a LEFT JOIN:
SELECT t2.ID2
FROM table2 t2
LEFT JOIN table3 t3
ON t2.ID2 = t3.ID2
WHERE t3.ID2 IS NULL;

Query too long in .NET 3.5 and SQL Compact 3.5

this is a rather pathetic problem:
In Visual Basic 2008 Express with SQL Server Compact 3.5 and the Usual DataSet / TableAdapters, My Query is too long:
Changing the names, it is a query like this:
SELECT Table1.*, Table3.*
FROM Table1
INNER JOIN Table2
ON Table1.ID = Table2.T1ID
INNER JOIN Table3
ON Table2.T3ID = Table3.PK
Problem is, Table1 and Table3 have around 10 and 5 columns each, with rather descriptive names, and The Table adapter is insistent on writing all the columns out and therefore hacks off my command. (It won't take * 's, it always says it can't find the column Table1.* )
Is there a way around this?
If you add an alias for the tables, will the table adapter take it?
SELECT t1.*, t3.*
FROM Table1 t1
INNER JOIN Table2 t2
ON t1.ID = t2.T1ID
INNER JOIN Table3
ON t2.T3ID = t3.PK
Put the query in a stored procedure and then just call it.
(It's also cleaner, more efficient, and helps encapsulate the logic within the database instead of your application just doing anything it likes)

Resources