SQL Server performance - Subselect or Inner Join? - sql-server

I've been pondering the question which of those 2 Statements might have a higher performance (and why):
select * from formelement
where formid = (select id from form where name = 'Test')
or
select *
from formelement fe
inner join form f on fe.formid = f.id
where f.name = 'Test'
One form contains several form elements, one form element is always part of one form.
Thanks,
Dennis

look at the execution plan, most likely it will be the same if you add the filtering to the join, that said the join will return everything from both tables, the in will not
I actually prefer EXISTS over those two
select * from formelement fe
where exists (select 1 from form f
where f.name='Test'
and fe.formid =f.id)

The performance depends on the query plan choosen by the SQL Server Engine. The query plan depends on a lot of factors, including (but not limited to) the SQL, the exact table structure, the statistics of the tables, available indexes, etc.
Since your two queries are quite simple, my guess would be that they result in the same (or a very similar) execution plan, thus yielding comparable performance.
(For large, complicated queries, the exact wording of the SQL can make a difference, the book SQL Tuning by Dan Tow gives a lot of great advice on that.)

Related

Does SQL Server expand a view's sql inline during execution?

Let's say I have a (hypothetical) table called Table1 with 500 columns and there is a view called View1 which is basically
select Column1, Column2,..., Column500, ComputedOrForeignKeyColumn1,...
from Table1
inner join ForeignKeyTables .....
Now, when I execute something like
Select Column32, Column56
from View1
which one of the below 3 does SQL Server turn it into?
Query #1:
select Column32, Column56
from
(select
Column1, Column2,..., Column500, ComputedOrForeignKeyColumn1,...
from
Table1
inner join
ForeignKeyTables ......) v
Query #2:
Select Column32, Column56
from Table1
Query #3:
select Column32, Column56
from
(select Column32, Column56
from Table1) v
The reason I'm asking this is that I do have a very wide table and a view sitting on top of it (that basically inner joins to bring texts from all foreign key ids) and I can't figure out if SQL Server fetches all columns and then selects the ones that are needed or fetches only those that are needed (while also ignoring unnecessary joins etc)...if it is former then a view would not be the best for performance.
SQL Server query compilation can be split into phases:
Parsing
Binding
Optimization
View resolution is performed during binding. At this stage the view reference is replaced with its definition. At this point, unused view columns will be present.
The next stage is optimization, where the bound syntax tree is transformed into an execution plan. The optimizer considers many kinds of manipulations on the execution plan to increase efficiency, and removing unused columns is one of the most basic. At this point, the unused column references will be removed.
So to answer your question, unused columns in the view definition will not impact performance, since the optimizer will be smart enough to remove them.
Note: this answer assumes the view is not indexed. For indexed views, the resolution process works differently, and there is view maintenance overhead for UPDATEs of the base tables.
None of the above. SQL Server will parse the query and it will create and execution plan. The resulting execution plan is calculated based on many factors, like indexes joins, etc.
Your question cannot be truly answered by anyone other than you, examining such execution plan.
See How do I obtain a Query Execution Plan? for more information.
The view definition is merged with the outer query in very early stage of compilation. You may or may not get the same execution plan for query on a view vs an equivalent query touching base tables, depending on complexity of the view and given the limitations of QO.
For your particular case it's worth noting that an inner join doesn't only fetch data from joined tables, but it also limits the result (in the same way as an IF EXISTS check does). If there is a declarative FK between the tables, the QO will be smart enough not to check the referenced tables, as the existence is guaranteed by the constraint, but otherwise it has to.

Simple join takes far too long to run due to query plan

Some context: I have two tables, smalltable and bigtable. Smalltable contains 10,000 rows, whereas bigtable contains 2,000,000, and I am using SQL Server 2008. I started with a query as follows:
select * from [dbo].[smalltable] t1
INNER JOIN
[dbo].[bigtable] t2
on (t1.name1=t2.firstname and t1.name6=t2.lastname) or (t1.name6=t2.firstname and t1.name1=t2.lastname)
This query was running for over 15 minutes before I killed it - on inspecting the query plan, it was using a nested loop to do the inner join.
I then rewrote the query as follows:
select * from [dbo].[smalltable] t1
INNER JOIN
[dbo].[bigtable] t2
on (t1.name1=t2.firstname and t1.name6=t2.lastname)
UNION
select * from [dbo].[smalltable] t1
INNER JOIN
[dbo].[bigtable] t2
on (t1.name6=t2.firstname and t1.name1=t2.lastname)
The two queries above then instead executed using a Hash Match, and the whole query ran in 4 seconds.
My question is, why does SQL Server get the query plan so wrong, and was there anyway that I could have fixed the original query without rewriting it? I tried adding a hint to use a Hash Match to the first query, but it seems that you are not allowed to with multiple join criteria?
Update: Added examples of the kind of data in the tables as requested. Note, the code is looking for name matches where names may have been swapped around:
Smalltable(Columns name1,name6)
John, Smith
Johnny, Smith
Smythe, Jon
Michaels, Robert
Bob, Brown
Bigtable (Columns firstname,lastname)
John, Smith
John, Smythe
Johnny, Smith
Alison, Roberts
Robert, Michaels
Janet, Green
It is a problem within SQL Server optimizer.
The condition (t1.name1=t2.firstname and t1.name6=t2.lastname) uses clustered index seek only and thus is very fast and executes almost instantly even with very large tables.
But the condition with OR
(t1.name1=t2.firstname and t1.name6=t2.lastname) or (t1.name6=t2.firstname and t1.name1=t2.lastname)
generally performs much worse usually performing full scan. You should see execution plans for your queries.
The Query Optimizer will always perform a table scan or a clustered
index scan on a table if the WHERE clause in the query contains an OR
operator and if any of the referenced columns in the OR clause are not
indexed (or do not have a useful index). Because of this, if you use
many queries with OR clauses, you will want to ensure that each
referenced column in the WHERE clause has an index.
If you have a query that uses ORs and it is not making the best use
of indexes, consider rewriting it as a UNION and then testing
performance. Only through testing can you be sure that one version of
your query will be faster than another.
See here. So shortly your first OR query does not make good use of indexes.
I believe there is no other way than rewrite the query with UNION (as you did) or APPLY, optimizer will not do it.

Execution of TSQL statement

I am aware of the sequence of the execution of SQL statements but I still want to make sure few things with the help of SQL experts here. I have a big SQL query which returns thousands of rows. Here is the minimized version of the query which I wrote and think that it is correct.
Select *
from property
inner join tenant t on (t.hproperty = p.hmy **and p.hmy = 7**)
inner join commtenant ct on ct.htenant = t.hmyperson
where 1=1
My colleague says that above query is equivalent to below query performance wise(He is very confident about it)
Select *
from property
inner join tenant t on (t.hproperty = p.hmy)
inner join commtenant ct on ct.htenant = t.hmyperson
where **p.hmy = 7**
Could anybody help me with the explanation about why above queries are not equivalent or equivalent? Thanks.
If you want to know if two queries are equivalent, learn how to look at the execution plans in SQL Server Management Studio. You can put the two queries in different windows, look at the estimated execution plans, and see for yourself if they are the same.
In this case, they probably are the same. SQL is intended to be a descriptive language, not a procedural language. That is, it describes the output you want, but the SQL engine is allowed to rewrite the query to be as efficient as possible. The two forms you have describe the same output. Do note that if there were a left outer join instead of an inner join, then the queries would be different.
In all likelihood, the engine will read the table and filter the records during the read or use an index for the read. The key idea, though, is that the output is the same and SQL Server can recognize this.
"p.hmy = 7" is not a join condition, as it relates only to a single table. As such, it doesn't really belong in the ON clause of the join. Since you are not adding any information by placing the condition in the ON clause, having it in the WHERE clause (in which it really belongs) will not make any difference to the query plan generated. If in doubt, look at the query plans.

FULL TEXT INDEX - Huge Performance Decrease on Multiple Tables

I've recently been learning something very new to me - FULLTEXT Indexes.
It seems that I can run off two separate queries (using CONTAINSTABLE) on the same parameters against two separate tables are gain an almost instantaneous answer (sub 10ms) however when I combined the two together, the query takes 1.3 seconds - or 130+ times slower!!
Below are the queries (simplified for the purpose of this question).
Query 1:
SELECT
*
FROM
dbo.FooBar FB
INNER JOIN dbo.FooBalls FBS on FB.ID = FBS.ID
LEFT JOIN CONTAINSTABLE(dbo.FooBar, (Col1, Col2, Col3), #query) FBCONT ON FB.ID = FBCONT.[KEY]
WHERE
FBCONT.[KEY] IS NOT NULL
Query 2:
SELECT
*
FROM
dbo.FooBar FB
INNER JOIN dbo.FooBalls FBS on FB.ID = FBS.ID
LEFT JOIN CONTAINSTABLE(dbo.FooBalls, (Col1), #query) FBSCONT ON FBS.ID = FBSCONT.[KEY]
WHERE
FBSCONT.[KEY] IS NOT NULL
Query Combined:
SELECT
*
FROM
dbo.FooBar FB
INNER JOIN dbo.FooBalls FBS on FB.ID = FBS.ID
LEFT JOIN CONTAINSTABLE(dbo.FooBar, (Col1, Col2, Col3), #query) FBCONT ON FB.ID = FBCONT.[KEY]
LEFT JOIN CONTAINSTABLE(dbo.FooBalls, (Col1), #query) FBSCONT ON FBS.ID = FBSCONT.[KEY]
WHERE
(FBCONT.[KEY] IS NOT NULL OR FBSCONT.[KEY] IS NOT NULL)
Perhaps my research has missed something but can someone give me an indicator as to why having both clauses together reduces performance by over 130 times?
NOTES:
I've checked the relevant indexes for joining exist - verified by the speed of the individual queries.
There are actually more joins involved in the process - however they are completely unlinked to the tables being queries and again response are under 10ms when searching for results in 100,000 plus records.
I tried replacing the CONTAINSTABLE with individual CONTAINS statements - performance was massively degraded as my research would lead me to expect.
A catalog has been set up that references ONLY the four columns from the two tables being queried
The #query parameter is set to NVARCHAR (50) at the present. I've read that using NVACHAR is faster as implicit conversions are not required.
I know I could do a dirty UNION ALL on both queries separately, but I'd prefer to writer better queries if possible rather than hack it together. Additionally UNION ALL would leave me with potential duplicates if #query value was in two columns from separate tables linked to one record.
Any further suggestions would be greatly received.
Your question comments suggest you improved performance to a satisfactory level by rewriting an unrelated part of the query (not shown in the question).
This is fair enough if it works, but doesn't explain why the two separate queries and the combined query differ so significantly, when other unrelated parts of the query are kept constant.
It's difficult to say confidently without seeing a query plan and statistics results; however I can think of two possibilities based solely on reasoning about how the SQL queries are written:
One or both of the ID columns (from FooBar and FooBalls) may be non-unique in the row set after these two tables have been inner joined. Doing two, rather than one, join to CONTAINSTABLE result sets may thus be "breeding" rather more records than a single join does; larger result sets take longer to be passed back to the client and displayed. To test this: compare the row counts returned by the two separate queries, and compare these to the row counts of each separate query if the WHERE clauses are omitted. Larger row counts will typically suggest a longer query elapsed time (all other things being equal).
Each of the separate queries has been written with a left outer join, but the result set is then restricted to only include rows where the join has succeeded. This is effectively an inner join: SQL Server's query planner may well be identifying this fact and choosing an execution plan as if an inner join had been specified. Conversely, the combined query requires rows where either join (but not necessarily both) have succeeded, which is a true left join. The execution plan is likely to use different, slower, approaches for these joins. To test this: look at the execution plans, and compare to execution plans for the separate queries with inner joins requested instead of left joins.

is index still effective after data has been selected?

I have two tables that I want to join, they both have index on the column I am trying to join.
QUERY 1
SELECT * FROM [A] INNER JOIN [B] ON [A].F = [B].F;
QUERY 2
SELECT * FROM (SELECT * FROM [A]) [A1] INNER JOIN (SELECT * FROM B) [B1] ON [A1].F=[B1].F
the first query clearly will utilize the index, what about the second one?
after the two select statements in the brackets are executed, then join would occur, but my guess is the index wouldn't help to speed up the query because it is pretty much a new table..
The query isn't executed quite so literally as you suggest, where the inner queries are executed first and then their results are combined with the outer query. The optimizer will take your query and will look at many possible ways to get your data through various join orders, index usages, etc. etc. and come up with a plan that it feels is optimal enough.
If you execute both queries and look at their respective execution plans, I think you will find that they use the exact same one.
Here's a simple example of the same concept. I created my schema as so:
CREATE TABLE A (id int, value int)
CREATE TABLE B (id int, value int)
INSERT INTO A (id, value)
VALUES (1,900),(2,800),(3,700),(4,600)
INSERT INTO B (id, value)
VALUES (2,800),(3,700),(4,600),(5,500)
CREATE CLUSTERED INDEX IX_A ON A (id)
CREATE CLUSTERED INDEX IX_B ON B (id)
And ran queries like the ones you provided.
SELECT * FROM A INNER JOIN B ON A.id = B.id
SELECT * FROM (SELECT * FROM A) A1 INNER JOIN (SELECT * FROM B) B1 ON A1.id = B1.id
The plans that were generated looked like this:
Which, as you can see, both utilize the index.
Chances are high that the SQL Server Query Optimizer will be able to detect that Query 2 is in fact the same as Query 1 and use the same indexed approach.
Whether this happens depends on a lot of factors: your table design, your table statistics, the complexity of your query, etc. If you want to know for certain, let SQL Server Query Analyzer show you the execution plan. Here are some links to help you get started:
Displaying Graphical Execution Plans
Examining Query Execution Plans
SQL Server uses predicate pushing (a.k.a. predicate pushdown) to move query conditions as far toward the source tables as possible. It doesn't slavishly do things in the order you parenthesize them. The optimizer uses complex rules--what is essentially a kind of geometry--to determine the meaning of your query, and restructure its access to the data as it pleases in order to gain the most performance while still returning the same final set of data that your query logic demands.
When queries become more and more complicated, there is a point where the optimizer cannot exhaustively search all possible execution plans and may end up with something that is suboptimal. However, you can pretty much assume that a simple case like you have presented is going to always be "seen through" and optimized away.
So the answer is that you should get just as good performance as if the two queries were combined. Now, if the values you are joining on are composite, that is they are the result of a computation or concatenation, then you are almost certainly not going to get the predicate push you want that will make the index useful, because the server won't or can't do a seek based on a partial string or after performing reverse arithmetic or something.
May I suggest that in the future, before asking questions like this here, you simply examine the execution plan for yourself to validate that it is using the index? You could have answered your own question with a little experimentation. If you still have questions, then come post, but in the meantime try to do some of your own research as a sign of respect for the people who are helping you.
To see execution plans, in SQL Server Management Studio (2005 and up) or SQL Query Analyzer (SQL 2000) you can just click the "Show Execution Plan" button on the menu bar, run your query, and switch to the tab at the bottom that displays a graphical version of the execution plan. Some little poking around and hovering your mouse over various pieces will quickly show you which indexes are being used on which tables.
However, if things aren't as you expect, don't automatically think that the server is making a mistake. It may decide that scanning your main table without using the index costs less--and it will almost always be right. There are many reasons that scanning can be less expensive, one of which is a very small table, another of which is that the number of rows the server statistically guesses it will have to return exceeds a significant portion of the table.
These both queries are same. The second query will be transformed just same as first one during transformation.
However, if you have specific requirement I would suggest that you put the whole code.Then It would be much easier to answer your question.

Resources