I am aware of the sequence of the execution of SQL statements but I still want to make sure few things with the help of SQL experts here. I have a big SQL query which returns thousands of rows. Here is the minimized version of the query which I wrote and think that it is correct.
Select *
from property
inner join tenant t on (t.hproperty = p.hmy **and p.hmy = 7**)
inner join commtenant ct on ct.htenant = t.hmyperson
where 1=1
My colleague says that above query is equivalent to below query performance wise(He is very confident about it)
Select *
from property
inner join tenant t on (t.hproperty = p.hmy)
inner join commtenant ct on ct.htenant = t.hmyperson
where **p.hmy = 7**
Could anybody help me with the explanation about why above queries are not equivalent or equivalent? Thanks.
If you want to know if two queries are equivalent, learn how to look at the execution plans in SQL Server Management Studio. You can put the two queries in different windows, look at the estimated execution plans, and see for yourself if they are the same.
In this case, they probably are the same. SQL is intended to be a descriptive language, not a procedural language. That is, it describes the output you want, but the SQL engine is allowed to rewrite the query to be as efficient as possible. The two forms you have describe the same output. Do note that if there were a left outer join instead of an inner join, then the queries would be different.
In all likelihood, the engine will read the table and filter the records during the read or use an index for the read. The key idea, though, is that the output is the same and SQL Server can recognize this.
"p.hmy = 7" is not a join condition, as it relates only to a single table. As such, it doesn't really belong in the ON clause of the join. Since you are not adding any information by placing the condition in the ON clause, having it in the WHERE clause (in which it really belongs) will not make any difference to the query plan generated. If in doubt, look at the query plans.
Related
Some context: I have two tables, smalltable and bigtable. Smalltable contains 10,000 rows, whereas bigtable contains 2,000,000, and I am using SQL Server 2008. I started with a query as follows:
select * from [dbo].[smalltable] t1
INNER JOIN
[dbo].[bigtable] t2
on (t1.name1=t2.firstname and t1.name6=t2.lastname) or (t1.name6=t2.firstname and t1.name1=t2.lastname)
This query was running for over 15 minutes before I killed it - on inspecting the query plan, it was using a nested loop to do the inner join.
I then rewrote the query as follows:
select * from [dbo].[smalltable] t1
INNER JOIN
[dbo].[bigtable] t2
on (t1.name1=t2.firstname and t1.name6=t2.lastname)
UNION
select * from [dbo].[smalltable] t1
INNER JOIN
[dbo].[bigtable] t2
on (t1.name6=t2.firstname and t1.name1=t2.lastname)
The two queries above then instead executed using a Hash Match, and the whole query ran in 4 seconds.
My question is, why does SQL Server get the query plan so wrong, and was there anyway that I could have fixed the original query without rewriting it? I tried adding a hint to use a Hash Match to the first query, but it seems that you are not allowed to with multiple join criteria?
Update: Added examples of the kind of data in the tables as requested. Note, the code is looking for name matches where names may have been swapped around:
Smalltable(Columns name1,name6)
John, Smith
Johnny, Smith
Smythe, Jon
Michaels, Robert
Bob, Brown
Bigtable (Columns firstname,lastname)
John, Smith
John, Smythe
Johnny, Smith
Alison, Roberts
Robert, Michaels
Janet, Green
It is a problem within SQL Server optimizer.
The condition (t1.name1=t2.firstname and t1.name6=t2.lastname) uses clustered index seek only and thus is very fast and executes almost instantly even with very large tables.
But the condition with OR
(t1.name1=t2.firstname and t1.name6=t2.lastname) or (t1.name6=t2.firstname and t1.name1=t2.lastname)
generally performs much worse usually performing full scan. You should see execution plans for your queries.
The Query Optimizer will always perform a table scan or a clustered
index scan on a table if the WHERE clause in the query contains an OR
operator and if any of the referenced columns in the OR clause are not
indexed (or do not have a useful index). Because of this, if you use
many queries with OR clauses, you will want to ensure that each
referenced column in the WHERE clause has an index.
If you have a query that uses ORs and it is not making the best use
of indexes, consider rewriting it as a UNION and then testing
performance. Only through testing can you be sure that one version of
your query will be faster than another.
See here. So shortly your first OR query does not make good use of indexes.
I believe there is no other way than rewrite the query with UNION (as you did) or APPLY, optimizer will not do it.
I have a table in my MS SQL Database called PolicyTransactions. This table has two important columns:
trans_id INT IDENTITY(1,1),
policy_id INT NOT NULL,
I need help writing a query that will, for each trans_id/policy_id in the table, join it to the last previous trans_id for that policy_id. This seems like a simple enough query, but for some reason I can't get it the gel in my brain right now.
Thanks!
I cooked this up for you.... Hopefully its what you're looking for: http://sqlfiddle.com/#!6/e7dc39/8
Basically, a cross apply is different from a subquery or regular join. It is a query that gets executed per each row that the outer portion of the query returns. This is why it has visibility into the outer tables (a subquery would not have this ability) and this is why its using the old school join syntax (old school meaning the join condition on _ = _ is in the where clause).
Just be really careful with this solution as cross apply isn't necessarily the fastest thing on earth. However, if the indexing on the tables is decent, that tiny query should run pretty quickly.
Its the only way I could think of to solve it, but it doesn't mean its the only way!
just a super quick edit: If you notice, some rows are not returned because they are the FIRST policy and therefore don't have a tran_id less than them with the same policy number. If you want to simulate an outer join with an apply, use outer apply :)
If you are using SQL Server 2012 or later you should use the LAG() function. See snippet below, I feel that its much cleaner than the other answer given here.
SELECT trans_id, policy_id, LAG(trans_id) OVER (PARTITION BY policy_id ORDER BY trans_id)
FROM PolicyTransaction
I've been pondering the question which of those 2 Statements might have a higher performance (and why):
select * from formelement
where formid = (select id from form where name = 'Test')
or
select *
from formelement fe
inner join form f on fe.formid = f.id
where f.name = 'Test'
One form contains several form elements, one form element is always part of one form.
Thanks,
Dennis
look at the execution plan, most likely it will be the same if you add the filtering to the join, that said the join will return everything from both tables, the in will not
I actually prefer EXISTS over those two
select * from formelement fe
where exists (select 1 from form f
where f.name='Test'
and fe.formid =f.id)
The performance depends on the query plan choosen by the SQL Server Engine. The query plan depends on a lot of factors, including (but not limited to) the SQL, the exact table structure, the statistics of the tables, available indexes, etc.
Since your two queries are quite simple, my guess would be that they result in the same (or a very similar) execution plan, thus yielding comparable performance.
(For large, complicated queries, the exact wording of the SQL can make a difference, the book SQL Tuning by Dan Tow gives a lot of great advice on that.)
Which of the following query is better... This is just an example, there are numerous situations, where I want the user name to be displayed instead of UserID
Select EmailDate, B.EmployeeName as [UserName], EmailSubject
from Trn_Misc_Email as A
inner join
Mst_Users as B on A.CreatedUserID = B.EmployeeLoginName
or
Select EmailDate, GetUserName(CreatedUserID) as [UserName], EmailSubject
from Trn_Misc_Email
If there is no performance benefit in using the First, I would prefer using the second... I would be having around 2000 records in User Table and 100k records in email table...
Thanks
A good question and great to be thinking about SQL performance, etc.
From a pure SQL point of view the first is better. In the first statement it is able to do everything in a single batch command with a join. In the second, for each row in trn_misc_email it is having to run a separate BATCH select to get the user name. This could cause a performance issue now, or in the future
It is also eaiser to read for anyone else coming onto the project as they can see what is happening. If you had the second one, you've then got to go and look in the function (I'm guessing that's what it is) to find out what that is doing.
So in reality two reasons to use the first reason.
The inline SQL JOIN will usually be better than the scalar UDF as it can be optimised better.
When testing it though be sure to use SQL Profiler to view the cost of both versions. SET STATISTICS IO ON doesn't report the cost for scalar UDFs in its figures which would make the scalar UDF version appear better than it actually is.
Scalar UDFs are very slow, but the inline ones are much faster, typically as fast as joins and subqueries
BTW, you query with function calls is equivalent to an outer join, not to an inner one.
To help you more, just a tip, in SQL server using the Managment studio you can evaluate the performance by Display Estimated execution plan. It shown how the indexs and join works and you can select the best way to use it.
Also you can use the DTA (Database Engine Tuning Advisor) for more info and optimization.
I have 2 tables Person_Organization and Person_Organization_other and nested query is :
SELECT
Person_Organization_id
FROM
Person_Organization_other
WHERE
company_name IN (SELECT company_name
FROM Person_Organization_other
WHERE Person_Organization_id IN (SELECT Person_Organization_Id
FROM Person_Organization
WHERE person_id = 117
AND delete_flag = 0)
)
Whereas the above query's corresponding query with join that I tried is :-
SELECT
poo.Person_Organization_id
FROM
Person_Organization_other poo, Person_Organization_other poo1, Person_Organization po
WHERE
poo1.Person_Organization_id = po.Person_Organization_Id
AND po.person_id = 117
AND po.delete_flag = 0
AND poo.company_name = poo1.company_name
GROUP BY
poo.Person_Organization_id
However the nested query is found to take less time as compared to it's corresponding query with joins. I used SQL profiler trace to compare times of executed queries. For the nested query it took 30 odd ms. For the joined query it took 41 odd ms
I was under the impression that as a rule nested queries are less perfomant and should be "flattened out" using joins.
Could someone explain what I am doing wrong?
regards
Nitin
You are using cross joins. Try inner joins.
select poo.Person_Organization_id
from Person_Organization po
INNER JOIN Person_Organization_other poo ON
poo.Person_Organization_id=po.Person_Organization_Id
INNER JOIN Person_Organization_other poo1 ON
poo1.Person_Organization_id=po.Person_Organization_Id AND
poo.company_name=poo1.company_name
where po.person_id=117 AND po.delete_flag=0
group by poo.Person_Organization_id
By separating your tables with commas, you are effectively CROSS JOINing them together. I would try doing explicit INNER JOINs between the tables and see if that helps performance.
The view that nested queries are less performant and should be flattened out using joins is a myth - it is true that inappropriate nested subqueries can cause performance issues, however in many cases using a subquery is just as good as using a join.
In fact the SQL server optimises all queries that it executes by reducing them to an execution tree - often queries that use a JOIN end up with identical execution trees to equivalent sql statements that use nested queries instead.
In this case the execution time of these is really low anyway - the difference could just as easily be explained as due to caches etc... not being filled.
My advice would be to use whatever syntax makes more sense to you - if you have a performance problem then by all means go back and check to see if a nested subquery is the cause of your problem, however I definitely wouldn't spend time worrying about "flattening out" queries that aren't causing problems.
Your order of tables might reduce the performance your table order in from clause should be in increasing order of number of rows