Can anyone explain this behavior or how to get around it?
If you execute this query:
select *
from TblA
left join freetexttable ( TblB, *, 'query' ) on TblA.ID = [Key]
inner join DifferentDbCatalog.dbo.TblC on TblA.ID = TblC.TblAID
It will be very very very slow.
If you change that query to use two inner joins instead of a left join, it will be very fast. If you change it to use two left joins instead of an inner join, it will be very fast.
You can observe this same behavior if you use a sql table variable instead of the freetexttable as well.
The performance problem arises any time you have a table variable (or freetexttable) and a table in a different database catalog where one is in an inner join and the other is in a left join.
Does anyone know why this is slow, or how to speed it up?
A general rule of thumb is that OUTER JOINs cause the number of rows in a result set to increase, while INNER JOINs cause the number of rows in a result set to decrease. Of course, there are plenty of scenarios where the opposite is true as well, but it's more likely to work this way than not. What you want to do for performance is keep the size of the result set (working set) as small as possible for as long as possible.
Since both joins match on the first table, changing up the order won't effect the accuracy of the results. Therefore, you probably want to do the INNER JOIN before the LEFT JOIN:
SELECT *
FROM TblA
INNER JOIN DifferentDbCatalog.dbo.TblC on TblA.ID = TblC.TblAID
LEFT JOIN freetexttable ( TblB, *, 'query' ) on TblA.ID = [Key]
As a practical matter, the query optimizer should be smart enough to compile to use the faster option, regardless of which order you specified for the joins. However, it's good practice to pretend that you have a dumb query optimizer, and that query operations happen in order. This helps future maintainers spot potential errors or assumptions about the nature of the tables.
Because the optimizer should re-write things, this probably isn't good enough to fully explain the behavior you're seeing, so you'll still want to examine the execution plan used for each query, and probably add an index as suggested earlier. This is still a good principle to learn, though.
What you should usually do is turn on the "Show Actual Execution Plan" option and then take a close look at what is causing the slowdown. (hover your mouse over each join to see the details) You'll want to make sure that you are getting an index seek and not a table scan.
I would assume what is happening is that SQL is being forced to pull everything from one table into memory in order to do one of the joins. Sometimes reversing the order that you join the tables will also help things.
Putting freetexttable(TblB, *, 'query') into a temp table may help if it's getting called repeatedly in the execution plan.
Index the field you use to perform the join.
A good rule of thumb is to assign an index to any commonly referenced foreign or candidate keys.
Related
I've always thought of cross applies as a different way of doing an inner join. I've been having to re-write a bunch of code because my SR. is convinced that cross applies aren't ansi supported and also use row by row processing.
I get that an inner join is more intuitive. I also understand that I should not use a cross apply if the same thing can be accomplished with an inner join. It's just that some times I try cross applies before inner joins. I've looked at the IO statistics for cases where I can switch cross apply to inner join and they're are no differences.
My Questions then:
1. Does cross apply use row by row processing?
2. Should cross applies be regarded and treated like cursor's? (I.e performance hogs)
3. Is cross apply ansi supported?
4. What are the best real life examples of when to use and avoid cross applies?
Does cross apply use row by row processing?
Sometimes. So do many regular joins. Depends on the query. Show actual query plan in SSMS and you can see what it is doing. Often times you will see that the CROSS APPLY and the equivalent traditional joins use the same query plan. Sometimes CROSS APPLY will be faster. Sometimes the JOIN will be faster. Depends on the data, indexes, statistics, etc.
Should cross applies be regarded and treated like cursor's? (I.e performance hogs)
No. They are not like cursors. If not optimized by the query optimizer, they are like LOOP JOINS. But they might be performance hogs. Just like any other query.
Is cross apply ansi supported?
I don't think so, but I am not certain
What are the best real life examples of when to use and avoid cross applies?
If you have a query that returns a lot of rows in the outer part of the query, you might consider joining to a subquery rather than using a CROSS APPLY, anticipating that SQL Server will do a HASH JOIN on the two queries. However, if SQL Server does a LOOP JOIN you will likely end up with the same query plan as the CROSS APPLY. If you have a query with few rows in the outer and you want to look up values in another table just based on those few, then you might favor the CROSS APPLY, though SQL Server may choose the LOOP JOIN for you anyway.
As a general rule, you shouldn't use JOIN hints unless you have a darn good reason to do so. Similarly, I wouldn't fret over using CROSS APPLY vs a join to a sub-query based solely on performance. Choose the one that makes the most sense in fetching your data, and let SQL Server figure out the best way to execute it. If it runs slowly for a particular query, then think about changing it to the other approach or providing join hints.
I have a large query with many joins that I am trying to tune, and one warning sign is that there are many, many hash joins being used throughout. I went down to the base of the query tree and went to the first join there, which is an inner join.
Table A is using a clustered index scan to retrieve its data, which is sorted on the join column.
Table B is using a nonclustered index scan, which is also sorted on the join column.
When I join just these two tables in isolation, and select the same set of columns, the optimizer uses a merge join. The sets being joined are approximately the same size, and not very large (<5,000 rows).
What could explain the optimizer choosing the hash join over the merge join in this case?
EDIT
As requested, I have added a few more details. The index definitions are:
CREATE NONCLUSTERED INDEX NCL_Asset_Issuer_MergeInduce ON Asset.IssuerCompanyId (CompanyId)INCLUDE (IsPrivate,HasPublicEquity,Ticker,FinancialTemplateID,BondTicker,SICOther1ID,SICOther4ID,SICSandPID,SICOther3ID,SICMoodyID,CurrencyTypeID,SecondaryAnalystID,AnalystID,SICOshaID,SecondaryBondTicker,FiscalYearEnd,EquityExchangeID);
CREATE NONCLUSTERED INDEX NCL_Asset_IssuerCustom_IssuerId ON Asset.IssuerCustom (IssuerID) INCLUDE (Text3,ListItem1ID,ListItem5ID,ListItem3ID,ListItem2ID,Version,text4,TextLong15,ListItem6ID)
The following query will return a merge join, as I mentioned earlier:
SELECT IsPrivate,HasPublicEquity,Ticker,FinancialTemplateID,BondTicker,SICOther1ID,SICOther4ID,SICSandPID,SICOther3ID,SICMoodyID,CurrencyTypeID,SecondaryAnalystID,AnalystID,SICOshaID,SecondaryBondTicker,FiscalYearEnd,EquityExchangeID,ic.ListItem2Id,ic.ListItem3ID,ic.IssuerId
FROM Asset.Issuer i
INNER JOIN Asset.IssuerCustom ic ON i.CompanyId = ic.IssuerId;
As you can see, the query is using both the indices above. On the other hand, this same join occurs in a much larger query, and the below image shows the corner of the plan, where this join is occurring as a hash join:
The one difference that I can see is that there is a reversal in terms of which table is the "inner" table vs which is the "outer" table. Still, why would this impact the execution plan if both queries are inner joins on the same column?
The SQL Server Query optimiser does not guarantee the optimum query. It looks for the best query in the time that it sets itself. As queries become large, with multiple joins, the number of different combinations grows exponentially and it becomes impossible to explore every possible path and therefore guarantee an optimum solution.
Usually, you should be able to trust the query optimiser to do a good job if your table design is sound (with appropriate indices) and statistics are up to date.
The different choice of joins may be due to a different resources available - (CPU & memory) considering other parts of the plan being executed in parallel.
If you want to investigate further, I would test running with join hints to find out if the execution plan has made the best decision. It may also be worth testing with the query hint MAXDOP=1 to find out if parallel execution effects the choices made by the optimiser.
Which one gives better performance on a large set of records???
select name from tablea,tableb where tablea.id = tableb.id
Or
select name from tablea inner join tableb on tablea.id = tableb.id
Here I have given a simple example, but in our project we use a lot of tables and joins to fetch records. In this more complicated case, which one wil give higher performance, links or joins?
Neither. You should always use ANSI joins, in any case.
There are more than just style considerations for using ANSI joins:
In future versions of SQL Server, non-standard joins may not be supported.
You cannot perform all types of joins using old-style syntax. For example, the following ANSI-style query:
SELECT
FROM
dbo.Customer C
LEFT JOIN dbo.Address A
ON C.CustomerID = A.CustomerID
AND A.AddressType = 'Home' -- this can't be done with non-ANSI
WHERE
A.CustomerID IS NULL
;
Using JOIN expresses the intent of the query more clearly to the developer, by separating conditions that are specific to the mechanics of how tables always relate to each other in the JOIN clause, and putting conditions that are specific to the needs of this query in the WHERE clause. Putting all the conditions, joining and filtering, in the WHERE clause clutters the WHERE clause and makes it exceedingly hard to understand complex queries.
Using JOIN allows the conditions that join a table to be physically near where that table is introduced in the script, reducing the likelihood of making mistakes and further aiding comprehension.
I will state absolutely and categorically that you will gain no performance benefit from using old-style joins instead of ANSI joins. If a query using ANSI joins is too complex for the optimizer to do a good job, it is only a matter of chance whether the old-style join will work better or worse. The reverse is exactly as true, that if a query using old-style joins is too complex for the optimizer to do a good job, then it is only a matter of chance whether an ANSI join will work better or worse.
The correct solution to queries that are not performing well is for an expert to study the query, the tables, and the underlying indexes and make recommendations. The query may benefit from being separated into two (first inserting to a temp table, then joining to that). There may be missing indexes. Data types may be chosen poorly. A clustered index may need to move from the PK to another column. The conditions on the query might be transformable to SARGable ones. Date functions on newly-introduced columns might be eliminated in favor of date inequality conditions against pre-calculable expressions using constants or earlier-introduced columns. There may be denormalization that is actually hurting performance.
There are a host of factors that can affect performance, and I guarantee you with every shred of conviction I possess that going back to old-style joins will never, ever be the answer.
In that simplified example, the performance should be exactly the same. If you run Query Analyzer on the two options, you'll see that the optimizer will translate your WHERE into a JOIN in any case.
You might be able to write a query complex enough to confound the optimizer, though, so stick with JOINs.
I find myself unwilling to push to using JOIN when I can easily solve the same problem by using an inner query:
e.g.
SELECT COLUMN1, ( SELECT COLUMN1 FROM TABLE2 WHERE TABLE2.ID = TABLE1.TABLE2ID ) AS COLUMN2 FROM TABLE1;
My question is, is this a bad programming practice? I find it easier to read and maintain as opposed to a join.
UPDATE
I want to add that there's some great feedback in here which in essence is pushing be back to using JOIN. I am finding myself less and less involved with using TSQL directly these days as of a result of ORM solutions (LINQ to SQL, NHibernate, etc.), but when I do it's things like correlated subqueries which I find are easier to type out linearly.
Personally, I find this incredibly difficult to read. It isn't the structure a SQL developer expects. By using JOIN, you are keeping all of your table sources in a single spot instead of spreading it throughout your query.
What happens if you need to have three or four joins? Putting all of those into the SELECT clause is going to get hairy.
A join is usually faster than a correlated subquery as it acts on the set of rows rather than one row at a time. I would never let this code go to my production server.
And I find a join much much easier to read and maintain.
If you needed more than one column from the second table, then you would require two subqueries. This typically would not perform as well as a join.
This is not equivalent to JOIN.
If you have multiple rows in TABLE2 for each row in TABLE1, you won't get them.
For each row in TABLE1 you get one row output so you can't get multiple from TABLE2.
This is why I'd use "JOIN": to make sure I get the data I wanted...
After your update: I rarely use correlation except with EXISTS...
The query you use was often used as a replacement for a LEFT JOIN for the engines that lacked it (most notably, PostgreSQL before 7.2)
This approach has some serious drawbacks:
It may fail if TABLE2.ID is not UNIQUE
Some engines will not be able to use anything else than NESTED LOOPS for this query
If you need to select more than one column, you will need to write the subquery several times
If your engine supports LEFT JOIN, use the LEFT JOIN.
In MySQL, however, there are some cases when an aggregate function in a select-level subquery can be more efficient than that in a LEFT JOIN with a GROUP BY.
See this article in my blog for the examples:
Aggregates: subqueries vs. GROUP BY
This is not a bad programming practice at all IMO, it is a little bit ugly though. It can actually be a performance boost in situations where the sub-select is from a very large table while you are expecting a very small result set (you have to consider indexes and platform, 2000 having a different optimizer and all from 2005). Here is how I format it to be easier to read.
select
column1
[column2] = (subselect...)
from
table1
Edit:
This of course assumes that your subselect will only return one value, if not it could be returning you bad results. See gbn's response.
it makes it a lot easier to use other types of joins (left outer, cross, etc) because the syntax for those in subquery terms is less than ideal for readability
At the end of the day, the goal when writing code, beyond functional requirements, is to make the intent of your code clear to a reader. If you use a JOIN, the intent is obvious. If you use a subquery in the manner you describe, it begs the question of why you did it that way. What were you trying to achieve that a JOIN would not have accomplished? In short, you waste the reader's time in trying to determine if the author was solving some problem in an ingenious fashion or if they were writing the code after a hard night of drinking.
I'm curious which of the following below would be more efficient?
I've always been a bit cautious about using IN because I believe SQL Server turns the result set into a big IF statement. For a large result set, this could result in poor performance. For small result sets, I'm not sure either is preferable. For large result sets, wouldn't EXISTS be more efficient?
WHERE EXISTS (SELECT * FROM Base WHERE bx.BoxID = Base.BoxID AND [Rank] = 2)
vs.
WHERE bx.BoxID IN (SELECT BoxID FROM Base WHERE [Rank = 2])
EXISTS will be faster because once the engine has found a hit, it will quit looking as the condition has proved true.
With IN, it will collect all the results from the sub-query before further processing.
The accepted answer is shortsighted and the question a bit loose in that:
1) Neither explicitly mention whether a covering index is present in
the left, right, or both sides.
2) Neither takes into account the size of input left side set and
input right side set.
(The question just mentions an overall large result set).
I believe the optimizer is smart enough to convert between "in" vs "exists" when there is a significant cost difference due to (1) and (2), otherwise it may just be used as a hint (e.g. exists to encourage use of an a seekable index on the right side).
Both forms can be converted to join forms internally, have the join order reversed, and run as loop, hash or merge--based on the estimated row counts (left and right) and index existence in left, right, or both sides.
I've done some testing on SQL Server 2005 and 2008, and on both the EXISTS and the IN come back with the exact same actual execution plan, as other have stated. The Optimizer is optimal. :)
Something to be aware of though, EXISTS, IN, and JOIN can sometimes return different results if you don't phrase your query just right: http://weblogs.sqlteam.com/mladenp/archive/2007/05/18/60210.aspx
I'd go with EXISTS over IN, see below link:
SQL Server: JOIN vs IN vs EXISTS - the logical difference
There is a common misconception that IN behaves equally to EXISTS or JOIN in terms of returned results. This is simply not true.
IN: Returns true if a specified value matches any value in a subquery or a list.
Exists: Returns true if a subquery contains any rows.
Join: Joins 2 resultsets on the joining column.
Blog credit: https://stackoverflow.com/users/31345/mladen-prajdic
There are many misleading answers answers here, including the highly upvoted one (although I don't believe their ops meant harm). The short answer is: These are the same.
There are many keywords in the (T-)SQL language, but in the end, the only thing that really happens on the hardware is the operations as seen in the execution query plan.
The relational (maths theory) operation we do when we invoke [NOT] IN and [NOT] EXISTS is the semi join (anti-join when using NOT). It is not a coincidence that the corresponding sql-server operations have the same name. There is no operation that mentions IN or EXISTS anywhere - only (anti-)semi joins. Thus, there is no way that a logically-equivalent IN vs EXISTS choice could affect performance because there is one and only way, the (anti)semi join execution operation, to get their results.
An example:
Query 1 ( plan )
select * from dt where dt.customer in (select c.code from customer c where c.active=0)
Query 2 ( plan )
select * from dt where exists (select 1 from customer c where c.code=dt.customer and c.active=0)
The execution plans are typically going to be identical in these cases, but until you see how the optimizer factors in all the other aspects of indexes etc., you really will never know.
So, IN is not the same as EXISTS nor it will produce the same execution plan.
Usually EXISTS is used in a correlated subquery, that means you will JOIN the EXISTS inner query with your outer query. That will add more steps to produce a result as you need to solve the outer query joins and the inner query joins then match their where clauses to join both.
Usually IN is used without correlating the inner query with the outer query, and that can be solved in only one step (in the best case scenario).
Consider this:
If you use IN and the inner query result is millions of rows of distinct values, it will probably perform SLOWER than EXISTS given that the EXISTS query is performant (has the right indexes to join with the outer query).
If you use EXISTS and the join with your outer query is complex (takes more time to perform, no suitable indexes) it will slow the query by the number of rows in the outer table, sometimes the estimated time to complete can be in days. If the number of rows is acceptable for your given hardware, or the cardinality of data is correct (for example fewer DISTINCT values in a large data set) IN can perform faster than EXISTS.
All of the above will be noted when you have a fair amount of rows on each table (by fair I mean something that exceeds your CPU processing and/or ram thresholds for caching).
So the ANSWER is it DEPENDS. You can write a complex query inside IN or EXISTS, but as a rule of thumb, you should try to use IN with a limited set of distinct values and EXISTS when you have a lot of rows with a lot of distinct values.
The trick is to limit the number of rows to be scanned.
Regards,
MarianoC
To optimize the EXISTS, be very literal; something just has to be there, but you don't actually need any data returned from the correlated sub-query. You're just evaluating a Boolean condition.
So:
WHERE EXISTS (SELECT TOP 1 1 FROM Base WHERE bx.BoxID = Base.BoxID AND [Rank] = 2)
Because the correlated sub-query is RBAR, the first result hit makes the condition true, and it is processed no further.
I know that this is a very old question but I think my answer would add some tips.
I just came across a blog on mssqltips sql exists vs in vs join and it turns out that it is generally the same performance wise.
But the downside of one vs the other are as follows:
The in statement has a downside that it can only compare the two tables on one column.
The join statement will run on duplicate values, while in and exists will ignore duplicates.
But when you look at the execution time there is no big difference.
The interesting thing is when you create an index on the table, the execution from the join is better.
And I think that join has another upside that it's easier to write and understand especially for newcomers.
Off the top of my head and not guaranteed to be correct: I believe the second will be faster in this case.
In the first, the correlated subquery will likely cause the subquery to be run for each row.
In the second example, the subquery should only run once, since not correlated.
In the second example, the IN will short-circuit as soon as it finds a match.