SQL get counts using subqueries from multiple linked tables - sql-server

Suppose I have tables 1-4, all the other tables are linked to table1. For what its worth, table1, table2 and table3 are relatively small but table4 contains a lot of data.
Now I have the following query:
SELECT t1.id
, (SELECT COUNT(*) FROM table2 WHERE table1_id = t1.id) AS t2_count
, (SELECT COUNT(*) FROM table3 WHERE table1_id = t1.id) AS t3_count
, (SELECT COUNT(*) FROM table4 WHERE table1_id = t1.id) AS t4_count
FROM table1 t1
Due to the fact that the subqueries are dependent/correlated I assumed that there must be a better way (performance wise) to get the data.
I tried to do the following but it drastically increased the execution time (from about 2s to 35s). I'm guessing that the multiple left joins creates a very big data set?!
SELECT t1.id
, COUNT(t2.id) AS t2_count
, COUNT(t3.id) AS t3_count
, COUNT(t4.id) AS t4_count
FROM table1 t1
LEFT JOIN table2 t2 ON t2.table1_id = t1.id
LEFT JOIN table3 t3 ON t3.table1_id = t1.id
LEFT JOIN table4 t4 ON t4.table1_id = t1.id
GROUP BY t1.id
Is there better way to get the counts? I don't need the data from the other tables.
UPDATE:
Bart's answer got me thinking that the table1_id columns are nullable. I added a IS NOT NULL check to the WHERE clauses and this brought the time down to 1s.
SELECT t1.id
, (SELECT COUNT(*) FROM table2 WHERE table1_id IS NOT NULL AND table1_id = t1.id) AS t2_count
, (SELECT COUNT(*) FROM table3 WHERE table1_id IS NOT NULL AND table1_id = t1.id) AS t3_count
, (SELECT COUNT(*) FROM table4 WHERE table1_id IS NOT NULL AND table1_id = t1.id) AS t4_count
FROM table1 t1

I guess not. If you execute a SELECT COUNT(*) FROM [table], it should perform a count on the table's PK. That should be pretty fast, even for very large tables.
Is your table4 a real table (and not a view, or a table-valued function, or something else that looks like a table)? And does it have a primary key? If so, I don't think that the performance of a SELECT COUNT(*) FROM [table4] query can be increased significantly.
It may also be the case, that your table4 is heavily targeted (in concurrent transactions over multiple connections), or perhaps your SQL Server is doing some heavy IO or computations. I cannot assume anything about that, however. You may try to check if your query is also slow on a restored database backup on a physically separate test server.

Related

Efficiently group query by one column, taking the maximum of another column and a third column that comes from the same row as the maximum column

I have a table of 100,000,000+ values, so efficiency is very important to me. I need to take information from table A, join it to an index table B, then join to table C using the index retrieved from table B. The problem is, there are multiple indexes for each value in table A, and I want to retrieve the one with the most recent date.
The query below creates duplicates:
SELECT ID_1, ID_2, Date
INTO #DEST_TABLE FROM Table_1 t1
INNER JOIN Table_2 t2 ON t1.ID_1=t2.ID_1
INNER JOIN Table_3 t3 ON t2.ID_2=t3.ID_2
This one does not, but when running with more than 35,000 vs 40,000 elements, the execution time goes from <5sec to >1min:
SELECT ID_1, ID_2, Date
INTO #DEST_TABLE FROM
(SELECT * FROM Table_1 l CROSS APPLY Table_2 t2 WHERE t1.ID_1=t2.ID_1) t_temp
LEFT JOIN Table_3 t3 ON t_temp.ID_2=t3.ID_2
How can I decrease my execution time as much as possible?
Here is an example table:
For this table, I would be trying to get the most recent location for each person.
None of the columns are indexed and I cannot create indexes on this table.
First of all, when you are working on 100 Million+ records and that
too joining to other tables, first thing I would ask is what is the
rationale behind not creating indexes which can cover your query. If
you are not the admin of that system, I would suggest that you
should bring this up to admin group and try to understand what is
the exact reason (if any) they do not want index on that huge table.
Specially because you mentioned "efficiency is very important to
me".
Remember that 'SQL Tuning' is only one of the steps of 'Database Performance Tuning' and you can tune only as much with writing a good SQL Query. When the data volume gets huge, a good SQL Query is never sufficient without taking other Performance Tuning Measures.
Apart from what Roger has already provided, here are a few solutions that you can try out:
Solution 1
SELECT T1.ID_1, OA.ID_2, OA.Location
FROM Table1 T1
OUTER APPLY (
SELECT TOP 1 T3.ID_2, T3.Location
FROM Table2 T2
INNER JOIN Table3 T3
ON T2.ID_2 = T3.ID_2
WHERE T2.ID_1 = T1.ID_1
ORDER BY T3.Date DESC
) OA;
Solution 2:
SELECT DISTINCT
T1.ID_1
,T2.ID_2
,Location = FIRST_VALUE(T3.Location) OVER (PARTITION BY T1.ID_1 ORDER BY T3.Date DESC)
FROM Table1 T1
INNER JOIN Table2 T2
ON T1.ID_1 = T2.ID_1
INNER JOIN Table3 T3
ON T2.ID_2 = T3.ID_2;
Data Preparation:
DROP TABLE IF EXISTS Table1
DROP TABLE IF EXISTS Table2
DROP TABLE IF EXISTS Table3
SELECT TOP 10000 ID_1 = object_id, name
INTO Table1
FROM sys.all_objects
ORDER BY object_id
SELECT ID_1 = T1.ID_1, ID_2 = IDENTITY(INT, 1, 1)
INTO Table2
FROM Table1 T1
CROSS JOIN Table1 T2
SELECT ID_2, Location = 'City_'+ CAST(ID_2 AS VARCHAR(100)), Date = CAST(DATEADD(DAY, ID_2/10000, GETDATE()) AS DATE)
INTO Table3
FROM Table2
Indexes to cover the Solution 1:
CREATE NONCLUSTERED INDEX IX_TABLE1_ID_1 ON Table1 (ID_1)
CREATE NONCLUSTERED INDEX IX_TABLE2_ID_2 ON Table2 (ID_1, ID_2)
CREATE NONCLUSTERED INDEX IX_TABLE3_ID_2 ON Table3 (ID_2, Date DESC) INCLUDE (Location)
Execution Plan:
You can see that all are 'Index Seek' except for Table1 which is an legitimate 'Index Scan' because you are doing scans for each value of Table1's ID_1 value. If you put a where clause in the outer loop to search for a few specific ID_1 values, then that 'Index Scan' will turn to a 'Index Seek' as well.
I will leave the Index Strategy for the 2nd solution to you (as a homework :) ). Tips: You have to make the Location as a key as well. Or you can go with COLUMNSTORE index approach.
You can use something like this:
select top (1) with ties
a.A_Id, b.B_Id, b.Date
from dbo.TableA a
inner join dbo.TableB b on a.A_Id = it.A_Id
inner join dbo.TableC c on c.B_Id = b.B_Id
order by row_number() over(partition by a.A_Id order by b.Date desc);
Alternatively, you can try an olde fashioneth approache:
select a.A_Id, b.B_Id, b.Date
from dbo.TableA a
inner join dbo.TableB b on a.A_Id = b.A_Id
inner join dbo.TableC c on c.B_Id = b.B_Id
where not exists (
select 0 from dbo.TableB pb where pb.B_Id = b.B_Id and pb.Date > b.Date
);
However, as with all such situations, its performance will heavily depend on indices. SSMS can suggest you some, if you will look at the execution plan; off the top of my head, you will need all Id columns to be indexed, and you will need either a single (Date) or a composite (A_Id, Date, B_Id) on the TableB.
UPD: If you can't create or modify any indices, and performance is paramount, I would suggest copying the data in question into a separate schema or database, where you might have appropriate permissions. Apart from that... it's impossible to get something out of nothing.

Does this query working like cross join?

I am trying to run this query but couldn't understand its working process.
SELECT *
FROM TABLE1 T1
INNER JOIN TABLE2 T2 ON T1.ID = T2.ID
LEFT JOIN TABLE1 T3 ON T1.ID = T2.ID
table1 contains 3 records with sequential id 1, 2, 3 and table2 contains 4 records with sequential id 1, 2, 3, 4
and one another thing i also want to know that is, does this query executing from right to left? i mean left join process executes first then inner join? i am saying this according to query execution plan.
Here is your current query:
SELECT *
FROM table1 t1
INNER JOIN table2 t2
ON t1.id = t2.id
LEFT JOIN table1 t3
ON t1.id = t2.id
The first join is just a normal inner join between table1 and table2. The second join is a left join, but the ON condition is superfluous, and I believe the behavior would be the same without this condition even being there. The reason for this is that the join condition t1.id = t2.id will already always be true at that point in the query, for every record in the intermediate table. Hence, it appears that the second join would effectively be a cross join with table1.
Typically, your join condition will involve the two tables being joined.
Yes it's working like a cross join. If this is the intention you should rewrite it correctly. What you have there is misleading and confusing (where did it come from originally?)
select * from table1 t1
inner join table2 t2
on t1.id=t2.id
cross join table1 t3
The order that tables, filters, joins are evalulated are dictated by the query plan (press CTRL-L). This may change at any time. You shouldn't be concerned about the ordered these run in - you just need to know that you will get the same results no matter how it is executed. The query planner might choose one method over the other if it thinks it will be faster

Convert T-SQL Cross Apply to Oracle

I'm looking to convert this SQL Server (T-SQL) query that uses a cross apply to Oracle 11g. Oracle does not support Cross Apply until 12g, so I have to find a work-around. The idea behind the query is for each Tab.Name that = 'Foobar', I need find the previous row's name with the same ID ordered by Tab.Date. (This table contains multiple rows for 1 ID with different Name and Date).
Here is the T-SQL code:
SELECT DISTINCT t1.ID
t1.Name,
t1.Date,
t2.Date as 'PreviousDate',
t2.Name as 'PreviousName'
FROM Tab t1
OUTER apply (SELECT TOP 1 t2.Date,
t2.Name
FROM Tab t2
WHERE t1.Id = t2.Id
ORDER BY t2.Date DESC) t2
WHERE t1.Name = 'Foobar' )
Technically, I was able to recreate this same functionality in Oracle using LEFT JOIN and LAG() function:
SELECT DISTINCT t1.ID
t1.Name,
t1.Date,
t2.PreviousDate as PreviousDate,
t2.PreviousName as PreviousName
FROM Tab t1
LEFT JOIN (
SELECT ID,
LAG(Name) OVER (PARTITION BY ID ORDER BY PreviousDate) as PreviousName,
LAG(Date) OVER (PARTITION BY ID ORDER BY PreviousDate) as PreviousDate
FROM Tab) t2 ON t2.ID = t1.ID
WHERE t1.Name = 'Foobar'
The issue is the order it executes the Oracle query. It will pull back ALL rows from Tab, order them (because of the LAG function), then it will filter them down using the ON statement when it joins it to the main query. That table has millions of records, so doing that for EACH ID is not feasible. Basically, I want to change the order of operations in the sub-query to just pull back rows for a single ID, sort those rows to find the previous, and join that. Any ideas on how to tweak it?
TL;DR
SQL Server: filters, orders, joins
Oracle: orders, filters, joins
You can look for the latest row per (id) group with row_number():
select *
from tab t1
left join
(
select row_number() over (
partition by id
order by Date desc) as rn
, *
from t2
) t2
on t1.id = t2.id
and t2.rn = 1 -- Latest row per id

SQL Server inner join

How to choose main table when joining multiple tables using inner join?
A) Should I choose main table depending on its number of columns/rows (For example large main as main table or to keep larger table as join table)?
B) If I choose the table containing column that I use in where condition as main table , will there be any performance benefit ?
For example lets say there are 2 tables. Table1 & Table2 . Will there be any performance difference between two solutions given below
Solution 1 :
select t1.empid , t1.name , t1.dept , t2.add , t2.city , t2.country
from Table1 t1
inner join Table2 t2 on t2.empid = t1.empid
where t1.year = 2010
Solution 2 :
select t1.empid , t1.name , t1.dept , t2.add , t2.city , t2.country
from Table2 t2
inner join Table1 t1 on t1.empid = t2.empid
where t1.year = 2010
There is no difference. SQL Server will pick "main" table and join type based on table statistics.
Example:
Table1 contains 5 rows (and only one with year 2010). Table2 contains 10000 rows.
SQL Server will generate Nested Loops join with Table1 as outer input, Table2 as inner input, to get 1 run over 1000 rows. It will definitely not generate 10000 cycles over 1 row.
You still can get different execution plans for statements above, but only in case if SQL Server will decide that plan should be trivial and will skip the optimization phase (because tables are almost empty, for example).
The main table should be the one that you are using the most columns from, so if you are using table1 with 4 columns and you need to get one column out of table2.

CROSS APPLY Performance

Is it possible to improve performance by taking the following SQL:
SELECT t1.id,
t1.name,
t2.subname,
t2.refvalue
FROM table1 AS t1
CROSS apply (SELECT TOP 1 t2.subid,
t2.subname,
t3.refvalue
FROM table2 AS t2
INNER JOIN table3 AS t3
ON t2.subid = t3.subid
ORDER BY lastupdated DESC) AS t2
And rewriting it so that it looks like this:
SELECT t1.id,
t1.name,
t2.subname,
t3.refvalue
FROM table1 AS t1
CROSS apply (SELECT TOP 1 t2.subid,
t2.subname
FROM table2 AS t2
ORDER BY lastupdated DESC) AS t2
INNER JOIN table3 AS t3
ON t2.subid = t3.subid
Firstly, does it give the same result?
If so, what does the query plan say, and also set statistics io on?
How many rows in Table1, Table2 and Table3? How many intersect and end up in the result? I'm trying to figure out the purpose of rewriting the query, and agree with gbn... do you get the same result, does the query plan look the same in both cases, do the statistics i/o get any better, and does the rewritten query run any faster?

Resources