I've always thought of cross applies as a different way of doing an inner join. I've been having to re-write a bunch of code because my SR. is convinced that cross applies aren't ansi supported and also use row by row processing.
I get that an inner join is more intuitive. I also understand that I should not use a cross apply if the same thing can be accomplished with an inner join. It's just that some times I try cross applies before inner joins. I've looked at the IO statistics for cases where I can switch cross apply to inner join and they're are no differences.
My Questions then:
1. Does cross apply use row by row processing?
2. Should cross applies be regarded and treated like cursor's? (I.e performance hogs)
3. Is cross apply ansi supported?
4. What are the best real life examples of when to use and avoid cross applies?
Does cross apply use row by row processing?
Sometimes. So do many regular joins. Depends on the query. Show actual query plan in SSMS and you can see what it is doing. Often times you will see that the CROSS APPLY and the equivalent traditional joins use the same query plan. Sometimes CROSS APPLY will be faster. Sometimes the JOIN will be faster. Depends on the data, indexes, statistics, etc.
Should cross applies be regarded and treated like cursor's? (I.e performance hogs)
No. They are not like cursors. If not optimized by the query optimizer, they are like LOOP JOINS. But they might be performance hogs. Just like any other query.
Is cross apply ansi supported?
I don't think so, but I am not certain
What are the best real life examples of when to use and avoid cross applies?
If you have a query that returns a lot of rows in the outer part of the query, you might consider joining to a subquery rather than using a CROSS APPLY, anticipating that SQL Server will do a HASH JOIN on the two queries. However, if SQL Server does a LOOP JOIN you will likely end up with the same query plan as the CROSS APPLY. If you have a query with few rows in the outer and you want to look up values in another table just based on those few, then you might favor the CROSS APPLY, though SQL Server may choose the LOOP JOIN for you anyway.
As a general rule, you shouldn't use JOIN hints unless you have a darn good reason to do so. Similarly, I wouldn't fret over using CROSS APPLY vs a join to a sub-query based solely on performance. Choose the one that makes the most sense in fetching your data, and let SQL Server figure out the best way to execute it. If it runs slowly for a particular query, then think about changing it to the other approach or providing join hints.
Related
This is the description from Microsoft TechNet explaining Trace Flag 342 (emphasis added):
Disables the costing of pseudo-merge joins, thus significantly
reducing time spent on the parse for certain types of large,
multi-table joins. One can also use SET FORCEPLAN ON to disable the
costing of pseudo-merge joins because the query is forced to use the
order specified in the FROM clause.
Does any of you know what is a pseudo-merge join? As far as I know, SQL Server has 3 Join Algorithms (Nest Loop Join, Merge Join, and Hash Join - which encompass Bitmap Join). So what is a pseudo-merge join, and what is the difference between it and a regular Merge Join or any other join for that matter?
I know this is a kind of old question but I will try to answer it as specific as I can.
Pseudo-merge is not a type of Join used as a T-SQL language operator, my interpretation of Microsoft's explanation that using the Trace Flag 342 is as folows:
Disables the costing of pseudo-merge joins, thus significantly
reducing time spent on the parse for certain types of large,
multi-table joins.
Pseudo-merge is the concept to represent that the query optimiser is trying to calculate a better query execution plan, trying to obtain the best way to join the several tables.
One can also use SET FORCEPLAN ON to disable the costing of
pseudo-merge joins because the query is forced to use the order
specified in the FROM clause.
This option prevents the optimizer from trying to calculate and simply execute the joins as they are listed in the query.
An article on SET FORCEPLAN ON for reference.
Which one gives better performance on a large set of records???
select name from tablea,tableb where tablea.id = tableb.id
Or
select name from tablea inner join tableb on tablea.id = tableb.id
Here I have given a simple example, but in our project we use a lot of tables and joins to fetch records. In this more complicated case, which one wil give higher performance, links or joins?
Neither. You should always use ANSI joins, in any case.
There are more than just style considerations for using ANSI joins:
In future versions of SQL Server, non-standard joins may not be supported.
You cannot perform all types of joins using old-style syntax. For example, the following ANSI-style query:
SELECT
FROM
dbo.Customer C
LEFT JOIN dbo.Address A
ON C.CustomerID = A.CustomerID
AND A.AddressType = 'Home' -- this can't be done with non-ANSI
WHERE
A.CustomerID IS NULL
;
Using JOIN expresses the intent of the query more clearly to the developer, by separating conditions that are specific to the mechanics of how tables always relate to each other in the JOIN clause, and putting conditions that are specific to the needs of this query in the WHERE clause. Putting all the conditions, joining and filtering, in the WHERE clause clutters the WHERE clause and makes it exceedingly hard to understand complex queries.
Using JOIN allows the conditions that join a table to be physically near where that table is introduced in the script, reducing the likelihood of making mistakes and further aiding comprehension.
I will state absolutely and categorically that you will gain no performance benefit from using old-style joins instead of ANSI joins. If a query using ANSI joins is too complex for the optimizer to do a good job, it is only a matter of chance whether the old-style join will work better or worse. The reverse is exactly as true, that if a query using old-style joins is too complex for the optimizer to do a good job, then it is only a matter of chance whether an ANSI join will work better or worse.
The correct solution to queries that are not performing well is for an expert to study the query, the tables, and the underlying indexes and make recommendations. The query may benefit from being separated into two (first inserting to a temp table, then joining to that). There may be missing indexes. Data types may be chosen poorly. A clustered index may need to move from the PK to another column. The conditions on the query might be transformable to SARGable ones. Date functions on newly-introduced columns might be eliminated in favor of date inequality conditions against pre-calculable expressions using constants or earlier-introduced columns. There may be denormalization that is actually hurting performance.
There are a host of factors that can affect performance, and I guarantee you with every shred of conviction I possess that going back to old-style joins will never, ever be the answer.
In that simplified example, the performance should be exactly the same. If you run Query Analyzer on the two options, you'll see that the optimizer will translate your WHERE into a JOIN in any case.
You might be able to write a query complex enough to confound the optimizer, though, so stick with JOINs.
I am creating a report, for that I have written 2 different type of query. But I am seeing a huge performance difference between these 2 methods. What may be the reason? My main table (suppose table A) contain a date column. I am filtering the data based on date. Around 10 table join I have to do with this table.
First method:
select A.id,A1.name,...
from table A
join A1
join A2 ....A10
where A.Cdate >= #date
and A.Cdate <= #date
Second method:
With CTE as(select A.id from A where A.Cdate>=#date and A.Cdate<=#date)
select CTE.id, A1.name,.... from CTE join A1 join A2....A10
Here second method is fast. What is the reason? In first method, the filtered data of A only will be join with other tables data right?
Execution plans will tell us for sure, but likely if the CTE is able to filter out a lot of rows before the join, a more optimal join approach may have been chosen (for example merge instead of hash). SQL isn't supposed to work that way - in theory those two should execute the same way. But in practice we find that SQL Server's optimizer isn't perfect and can be swayed in different ways based on a variety of factors, including statistics, selectivity, parallelism, a pre-existing version of that plan in the cache, etc.
One suggestion.You should be able to answer this question yourself as you have both the plans. Did you compare the two plans? Are those similar? Also, when performance is bad what do you mean ,is it time or cpu time or IO's or what did you compare?
Thus before you post any question you should check these counters and I am sure they will provide with some kind of answers in most of cases.
CTE is for managing the code it wont improve the performance of query automatically. CTE's will be expanded by optimizer and thus in your case these both should have same queries after transformation or expansion and thus similar plans.
CTE is basically for the handling the more complex code(be it recursive or having subquery), but you need to check the execution plan regarding the improvement of the 2 different queries.
You can check for the use of CTE at : http://msdn.microsoft.com/en-us/library/ms190766(v=sql.105).aspx
Just a note, in many scenarios, temp tables gives better performance then CTE also, so you should give a try to temp tables as well.
Reference : http://social.msdn.microsoft.com/Forums/en/transactsql/thread/d040d19d-016e-4a21-bf44-a0359fb3c7fb
I find myself unwilling to push to using JOIN when I can easily solve the same problem by using an inner query:
e.g.
SELECT COLUMN1, ( SELECT COLUMN1 FROM TABLE2 WHERE TABLE2.ID = TABLE1.TABLE2ID ) AS COLUMN2 FROM TABLE1;
My question is, is this a bad programming practice? I find it easier to read and maintain as opposed to a join.
UPDATE
I want to add that there's some great feedback in here which in essence is pushing be back to using JOIN. I am finding myself less and less involved with using TSQL directly these days as of a result of ORM solutions (LINQ to SQL, NHibernate, etc.), but when I do it's things like correlated subqueries which I find are easier to type out linearly.
Personally, I find this incredibly difficult to read. It isn't the structure a SQL developer expects. By using JOIN, you are keeping all of your table sources in a single spot instead of spreading it throughout your query.
What happens if you need to have three or four joins? Putting all of those into the SELECT clause is going to get hairy.
A join is usually faster than a correlated subquery as it acts on the set of rows rather than one row at a time. I would never let this code go to my production server.
And I find a join much much easier to read and maintain.
If you needed more than one column from the second table, then you would require two subqueries. This typically would not perform as well as a join.
This is not equivalent to JOIN.
If you have multiple rows in TABLE2 for each row in TABLE1, you won't get them.
For each row in TABLE1 you get one row output so you can't get multiple from TABLE2.
This is why I'd use "JOIN": to make sure I get the data I wanted...
After your update: I rarely use correlation except with EXISTS...
The query you use was often used as a replacement for a LEFT JOIN for the engines that lacked it (most notably, PostgreSQL before 7.2)
This approach has some serious drawbacks:
It may fail if TABLE2.ID is not UNIQUE
Some engines will not be able to use anything else than NESTED LOOPS for this query
If you need to select more than one column, you will need to write the subquery several times
If your engine supports LEFT JOIN, use the LEFT JOIN.
In MySQL, however, there are some cases when an aggregate function in a select-level subquery can be more efficient than that in a LEFT JOIN with a GROUP BY.
See this article in my blog for the examples:
Aggregates: subqueries vs. GROUP BY
This is not a bad programming practice at all IMO, it is a little bit ugly though. It can actually be a performance boost in situations where the sub-select is from a very large table while you are expecting a very small result set (you have to consider indexes and platform, 2000 having a different optimizer and all from 2005). Here is how I format it to be easier to read.
select
column1
[column2] = (subselect...)
from
table1
Edit:
This of course assumes that your subselect will only return one value, if not it could be returning you bad results. See gbn's response.
it makes it a lot easier to use other types of joins (left outer, cross, etc) because the syntax for those in subquery terms is less than ideal for readability
At the end of the day, the goal when writing code, beyond functional requirements, is to make the intent of your code clear to a reader. If you use a JOIN, the intent is obvious. If you use a subquery in the manner you describe, it begs the question of why you did it that way. What were you trying to achieve that a JOIN would not have accomplished? In short, you waste the reader's time in trying to determine if the author was solving some problem in an ingenious fashion or if they were writing the code after a hard night of drinking.
Can anyone explain this behavior or how to get around it?
If you execute this query:
select *
from TblA
left join freetexttable ( TblB, *, 'query' ) on TblA.ID = [Key]
inner join DifferentDbCatalog.dbo.TblC on TblA.ID = TblC.TblAID
It will be very very very slow.
If you change that query to use two inner joins instead of a left join, it will be very fast. If you change it to use two left joins instead of an inner join, it will be very fast.
You can observe this same behavior if you use a sql table variable instead of the freetexttable as well.
The performance problem arises any time you have a table variable (or freetexttable) and a table in a different database catalog where one is in an inner join and the other is in a left join.
Does anyone know why this is slow, or how to speed it up?
A general rule of thumb is that OUTER JOINs cause the number of rows in a result set to increase, while INNER JOINs cause the number of rows in a result set to decrease. Of course, there are plenty of scenarios where the opposite is true as well, but it's more likely to work this way than not. What you want to do for performance is keep the size of the result set (working set) as small as possible for as long as possible.
Since both joins match on the first table, changing up the order won't effect the accuracy of the results. Therefore, you probably want to do the INNER JOIN before the LEFT JOIN:
SELECT *
FROM TblA
INNER JOIN DifferentDbCatalog.dbo.TblC on TblA.ID = TblC.TblAID
LEFT JOIN freetexttable ( TblB, *, 'query' ) on TblA.ID = [Key]
As a practical matter, the query optimizer should be smart enough to compile to use the faster option, regardless of which order you specified for the joins. However, it's good practice to pretend that you have a dumb query optimizer, and that query operations happen in order. This helps future maintainers spot potential errors or assumptions about the nature of the tables.
Because the optimizer should re-write things, this probably isn't good enough to fully explain the behavior you're seeing, so you'll still want to examine the execution plan used for each query, and probably add an index as suggested earlier. This is still a good principle to learn, though.
What you should usually do is turn on the "Show Actual Execution Plan" option and then take a close look at what is causing the slowdown. (hover your mouse over each join to see the details) You'll want to make sure that you are getting an index seek and not a table scan.
I would assume what is happening is that SQL is being forced to pull everything from one table into memory in order to do one of the joins. Sometimes reversing the order that you join the tables will also help things.
Putting freetexttable(TblB, *, 'query') into a temp table may help if it's getting called repeatedly in the execution plan.
Index the field you use to perform the join.
A good rule of thumb is to assign an index to any commonly referenced foreign or candidate keys.