Which one gives better performance on a large set of records???
select name from tablea,tableb where tablea.id = tableb.id
Or
select name from tablea inner join tableb on tablea.id = tableb.id
Here I have given a simple example, but in our project we use a lot of tables and joins to fetch records. In this more complicated case, which one wil give higher performance, links or joins?
Neither. You should always use ANSI joins, in any case.
There are more than just style considerations for using ANSI joins:
In future versions of SQL Server, non-standard joins may not be supported.
You cannot perform all types of joins using old-style syntax. For example, the following ANSI-style query:
SELECT
FROM
dbo.Customer C
LEFT JOIN dbo.Address A
ON C.CustomerID = A.CustomerID
AND A.AddressType = 'Home' -- this can't be done with non-ANSI
WHERE
A.CustomerID IS NULL
;
Using JOIN expresses the intent of the query more clearly to the developer, by separating conditions that are specific to the mechanics of how tables always relate to each other in the JOIN clause, and putting conditions that are specific to the needs of this query in the WHERE clause. Putting all the conditions, joining and filtering, in the WHERE clause clutters the WHERE clause and makes it exceedingly hard to understand complex queries.
Using JOIN allows the conditions that join a table to be physically near where that table is introduced in the script, reducing the likelihood of making mistakes and further aiding comprehension.
I will state absolutely and categorically that you will gain no performance benefit from using old-style joins instead of ANSI joins. If a query using ANSI joins is too complex for the optimizer to do a good job, it is only a matter of chance whether the old-style join will work better or worse. The reverse is exactly as true, that if a query using old-style joins is too complex for the optimizer to do a good job, then it is only a matter of chance whether an ANSI join will work better or worse.
The correct solution to queries that are not performing well is for an expert to study the query, the tables, and the underlying indexes and make recommendations. The query may benefit from being separated into two (first inserting to a temp table, then joining to that). There may be missing indexes. Data types may be chosen poorly. A clustered index may need to move from the PK to another column. The conditions on the query might be transformable to SARGable ones. Date functions on newly-introduced columns might be eliminated in favor of date inequality conditions against pre-calculable expressions using constants or earlier-introduced columns. There may be denormalization that is actually hurting performance.
There are a host of factors that can affect performance, and I guarantee you with every shred of conviction I possess that going back to old-style joins will never, ever be the answer.
In that simplified example, the performance should be exactly the same. If you run Query Analyzer on the two options, you'll see that the optimizer will translate your WHERE into a JOIN in any case.
You might be able to write a query complex enough to confound the optimizer, though, so stick with JOINs.
Related
I've always thought of cross applies as a different way of doing an inner join. I've been having to re-write a bunch of code because my SR. is convinced that cross applies aren't ansi supported and also use row by row processing.
I get that an inner join is more intuitive. I also understand that I should not use a cross apply if the same thing can be accomplished with an inner join. It's just that some times I try cross applies before inner joins. I've looked at the IO statistics for cases where I can switch cross apply to inner join and they're are no differences.
My Questions then:
1. Does cross apply use row by row processing?
2. Should cross applies be regarded and treated like cursor's? (I.e performance hogs)
3. Is cross apply ansi supported?
4. What are the best real life examples of when to use and avoid cross applies?
Does cross apply use row by row processing?
Sometimes. So do many regular joins. Depends on the query. Show actual query plan in SSMS and you can see what it is doing. Often times you will see that the CROSS APPLY and the equivalent traditional joins use the same query plan. Sometimes CROSS APPLY will be faster. Sometimes the JOIN will be faster. Depends on the data, indexes, statistics, etc.
Should cross applies be regarded and treated like cursor's? (I.e performance hogs)
No. They are not like cursors. If not optimized by the query optimizer, they are like LOOP JOINS. But they might be performance hogs. Just like any other query.
Is cross apply ansi supported?
I don't think so, but I am not certain
What are the best real life examples of when to use and avoid cross applies?
If you have a query that returns a lot of rows in the outer part of the query, you might consider joining to a subquery rather than using a CROSS APPLY, anticipating that SQL Server will do a HASH JOIN on the two queries. However, if SQL Server does a LOOP JOIN you will likely end up with the same query plan as the CROSS APPLY. If you have a query with few rows in the outer and you want to look up values in another table just based on those few, then you might favor the CROSS APPLY, though SQL Server may choose the LOOP JOIN for you anyway.
As a general rule, you shouldn't use JOIN hints unless you have a darn good reason to do so. Similarly, I wouldn't fret over using CROSS APPLY vs a join to a sub-query based solely on performance. Choose the one that makes the most sense in fetching your data, and let SQL Server figure out the best way to execute it. If it runs slowly for a particular query, then think about changing it to the other approach or providing join hints.
This is the description from Microsoft TechNet explaining Trace Flag 342 (emphasis added):
Disables the costing of pseudo-merge joins, thus significantly
reducing time spent on the parse for certain types of large,
multi-table joins. One can also use SET FORCEPLAN ON to disable the
costing of pseudo-merge joins because the query is forced to use the
order specified in the FROM clause.
Does any of you know what is a pseudo-merge join? As far as I know, SQL Server has 3 Join Algorithms (Nest Loop Join, Merge Join, and Hash Join - which encompass Bitmap Join). So what is a pseudo-merge join, and what is the difference between it and a regular Merge Join or any other join for that matter?
I know this is a kind of old question but I will try to answer it as specific as I can.
Pseudo-merge is not a type of Join used as a T-SQL language operator, my interpretation of Microsoft's explanation that using the Trace Flag 342 is as folows:
Disables the costing of pseudo-merge joins, thus significantly
reducing time spent on the parse for certain types of large,
multi-table joins.
Pseudo-merge is the concept to represent that the query optimiser is trying to calculate a better query execution plan, trying to obtain the best way to join the several tables.
One can also use SET FORCEPLAN ON to disable the costing of
pseudo-merge joins because the query is forced to use the order
specified in the FROM clause.
This option prevents the optimizer from trying to calculate and simply execute the joins as they are listed in the query.
An article on SET FORCEPLAN ON for reference.
I find myself unwilling to push to using JOIN when I can easily solve the same problem by using an inner query:
e.g.
SELECT COLUMN1, ( SELECT COLUMN1 FROM TABLE2 WHERE TABLE2.ID = TABLE1.TABLE2ID ) AS COLUMN2 FROM TABLE1;
My question is, is this a bad programming practice? I find it easier to read and maintain as opposed to a join.
UPDATE
I want to add that there's some great feedback in here which in essence is pushing be back to using JOIN. I am finding myself less and less involved with using TSQL directly these days as of a result of ORM solutions (LINQ to SQL, NHibernate, etc.), but when I do it's things like correlated subqueries which I find are easier to type out linearly.
Personally, I find this incredibly difficult to read. It isn't the structure a SQL developer expects. By using JOIN, you are keeping all of your table sources in a single spot instead of spreading it throughout your query.
What happens if you need to have three or four joins? Putting all of those into the SELECT clause is going to get hairy.
A join is usually faster than a correlated subquery as it acts on the set of rows rather than one row at a time. I would never let this code go to my production server.
And I find a join much much easier to read and maintain.
If you needed more than one column from the second table, then you would require two subqueries. This typically would not perform as well as a join.
This is not equivalent to JOIN.
If you have multiple rows in TABLE2 for each row in TABLE1, you won't get them.
For each row in TABLE1 you get one row output so you can't get multiple from TABLE2.
This is why I'd use "JOIN": to make sure I get the data I wanted...
After your update: I rarely use correlation except with EXISTS...
The query you use was often used as a replacement for a LEFT JOIN for the engines that lacked it (most notably, PostgreSQL before 7.2)
This approach has some serious drawbacks:
It may fail if TABLE2.ID is not UNIQUE
Some engines will not be able to use anything else than NESTED LOOPS for this query
If you need to select more than one column, you will need to write the subquery several times
If your engine supports LEFT JOIN, use the LEFT JOIN.
In MySQL, however, there are some cases when an aggregate function in a select-level subquery can be more efficient than that in a LEFT JOIN with a GROUP BY.
See this article in my blog for the examples:
Aggregates: subqueries vs. GROUP BY
This is not a bad programming practice at all IMO, it is a little bit ugly though. It can actually be a performance boost in situations where the sub-select is from a very large table while you are expecting a very small result set (you have to consider indexes and platform, 2000 having a different optimizer and all from 2005). Here is how I format it to be easier to read.
select
column1
[column2] = (subselect...)
from
table1
Edit:
This of course assumes that your subselect will only return one value, if not it could be returning you bad results. See gbn's response.
it makes it a lot easier to use other types of joins (left outer, cross, etc) because the syntax for those in subquery terms is less than ideal for readability
At the end of the day, the goal when writing code, beyond functional requirements, is to make the intent of your code clear to a reader. If you use a JOIN, the intent is obvious. If you use a subquery in the manner you describe, it begs the question of why you did it that way. What were you trying to achieve that a JOIN would not have accomplished? In short, you waste the reader's time in trying to determine if the author was solving some problem in an ingenious fashion or if they were writing the code after a hard night of drinking.
What are the advantages, if any, of explicitly doing a HASH JOIN over a regular JOIN (wherein SQL Server will decide the best JOIN strategy)? Eg:
select pd.*
from profiledata pd
inner hash join profiledatavalue val on val.profiledataid=pd.id
In the simplistic sample code above, I'm specifying the JOIN strategy, whereas if I leave off the "hash" key word SQL Server will do a MERGE JOIN behind the scenes (per the "actual execution plan").
The optmiser does a good enough job for everyday use. However, in theory it might need 3 weeks to find the perfect plan in the extreme, so there is a chance that the generated plan will not be ideal.
I'd leave it alone unless you have a very complex query or huge amounts of data where it simply can't produce a good plan. Then I'd consider it.
But over time, as data changes/grows or indexes change etc, your JOIN hint will becomes obsolete and prevents an optimal plan. A JOIN hint can only optimise for that single query at the time of development with that set of data you have.
Personally, I've never specified a JOIN hint in any production code.
I've normally solved a bad join by changing my query around, adding/changing an index or breaking it up (eg load a temp table first). Or my query was just wrong, or I had an implicit data type conversion, or it highlighted a flaw in my schema etc.
I've seen other developers use them but only where they had complex views nested upon complex views and they caused later problems when they refactored.
Edit:
I had a conversion today where some colleagues are going to use them to force a bad query plan (with NOLOCK and MAXDOP 1) to "encourage" migration away from legacy complex nested views that one of their downstream system calls directly.
Hash joins parallelize and scale better than any other join and are great at maximizing throughput in data warehouses.
When to try a hash hint, how about:
After checking that adequate indices exist on at least one of the
tables.
After having tried to re-arrange the query. Things like converting
joins to "in" or "exists", changing join order (which is only really a
hint anyway), moving logic from where clause to join condition, etc.
Some basic rules about when a hash join is effective is when a join condition does not exist as a table index and when the tables sizes are different. If you looking for a technical description there are some good descriptions out there about how a hash join works.
Why use any join hints (hash/merge/loop with side effect of force order)?
To avoid extremely slow execution (.5 -> 10.0s) of corner cases.
When the optimizer consistently chooses a mediocre plan.
A supplied hint is likely to be non-ideal for some circumstances but provides more consistently predictable runtimes. The expected worst case and best case scenarios should be pre-tested when using a hint. Predictable runtimes are critical for web services where a rigidly optimized nominal [.3s, .6s] query is preferred over one that can range [.25, 10.0s] for example. Large runtime variances can happen with statistics freshly updated and best practices followed.
When testing in a development environment, one should turn off "cheating" as well to avoid hot/cold runtime variances. From another post...
CHECKPOINT -- flushes dirty pages to disk
DBCC DROPCLEANBUFFERS -- clears data cache
DBCC FREEPROCCACHE -- clears execution plan cache
The last option may be the same as the option(recompile) hint.
The MAXDOP and loading of the machine can also make a huge difference in runtime. Materialization of CTE into temp tables is also a good locking down mechanism and something to consider.
The only hint I've ever seen in shipping code was OPTION (FORCE ORDER). Stupid bug in SQL query optimizer would generate a plan that tried to join an unfiltered varchar and a unique identifier. Adding FORCE ORDER caused it to run the filter first.
I know, overloading columns is bad. Sometimes, you've got to live with it.
The logical plan optimizator doesn't assure to you that it finds the optimal solution: an exact algorithm is too slow to use in a production server; instead there are used some greedy algorithms.
Hence, the rationale behind those commands is to let the user specify the optimal join strategy, in the case the optimizator can't sort out what's really the best to adopt.
Can anyone explain this behavior or how to get around it?
If you execute this query:
select *
from TblA
left join freetexttable ( TblB, *, 'query' ) on TblA.ID = [Key]
inner join DifferentDbCatalog.dbo.TblC on TblA.ID = TblC.TblAID
It will be very very very slow.
If you change that query to use two inner joins instead of a left join, it will be very fast. If you change it to use two left joins instead of an inner join, it will be very fast.
You can observe this same behavior if you use a sql table variable instead of the freetexttable as well.
The performance problem arises any time you have a table variable (or freetexttable) and a table in a different database catalog where one is in an inner join and the other is in a left join.
Does anyone know why this is slow, or how to speed it up?
A general rule of thumb is that OUTER JOINs cause the number of rows in a result set to increase, while INNER JOINs cause the number of rows in a result set to decrease. Of course, there are plenty of scenarios where the opposite is true as well, but it's more likely to work this way than not. What you want to do for performance is keep the size of the result set (working set) as small as possible for as long as possible.
Since both joins match on the first table, changing up the order won't effect the accuracy of the results. Therefore, you probably want to do the INNER JOIN before the LEFT JOIN:
SELECT *
FROM TblA
INNER JOIN DifferentDbCatalog.dbo.TblC on TblA.ID = TblC.TblAID
LEFT JOIN freetexttable ( TblB, *, 'query' ) on TblA.ID = [Key]
As a practical matter, the query optimizer should be smart enough to compile to use the faster option, regardless of which order you specified for the joins. However, it's good practice to pretend that you have a dumb query optimizer, and that query operations happen in order. This helps future maintainers spot potential errors or assumptions about the nature of the tables.
Because the optimizer should re-write things, this probably isn't good enough to fully explain the behavior you're seeing, so you'll still want to examine the execution plan used for each query, and probably add an index as suggested earlier. This is still a good principle to learn, though.
What you should usually do is turn on the "Show Actual Execution Plan" option and then take a close look at what is causing the slowdown. (hover your mouse over each join to see the details) You'll want to make sure that you are getting an index seek and not a table scan.
I would assume what is happening is that SQL is being forced to pull everything from one table into memory in order to do one of the joins. Sometimes reversing the order that you join the tables will also help things.
Putting freetexttable(TblB, *, 'query') into a temp table may help if it's getting called repeatedly in the execution plan.
Index the field you use to perform the join.
A good rule of thumb is to assign an index to any commonly referenced foreign or candidate keys.