What is a pseudo-merge join? - sql-server

This is the description from Microsoft TechNet explaining Trace Flag 342 (emphasis added):
Disables the costing of pseudo-merge joins, thus significantly
reducing time spent on the parse for certain types of large,
multi-table joins. One can also use SET FORCEPLAN ON to disable the
costing of pseudo-merge joins because the query is forced to use the
order specified in the FROM clause.
Does any of you know what is a pseudo-merge join? As far as I know, SQL Server has 3 Join Algorithms (Nest Loop Join, Merge Join, and Hash Join - which encompass Bitmap Join). So what is a pseudo-merge join, and what is the difference between it and a regular Merge Join or any other join for that matter?

I know this is a kind of old question but I will try to answer it as specific as I can.
Pseudo-merge is not a type of Join used as a T-SQL language operator, my interpretation of Microsoft's explanation that using the Trace Flag 342 is as folows:
Disables the costing of pseudo-merge joins, thus significantly
reducing time spent on the parse for certain types of large,
multi-table joins.
Pseudo-merge is the concept to represent that the query optimiser is trying to calculate a better query execution plan, trying to obtain the best way to join the several tables.
One can also use SET FORCEPLAN ON to disable the costing of
pseudo-merge joins because the query is forced to use the order
specified in the FROM clause.
This option prevents the optimizer from trying to calculate and simply execute the joins as they are listed in the query.
An article on SET FORCEPLAN ON for reference.

Related

Does cross apply use row by row processing?

I've always thought of cross applies as a different way of doing an inner join. I've been having to re-write a bunch of code because my SR. is convinced that cross applies aren't ansi supported and also use row by row processing.
I get that an inner join is more intuitive. I also understand that I should not use a cross apply if the same thing can be accomplished with an inner join. It's just that some times I try cross applies before inner joins. I've looked at the IO statistics for cases where I can switch cross apply to inner join and they're are no differences.
My Questions then:
1. Does cross apply use row by row processing?
2. Should cross applies be regarded and treated like cursor's? (I.e performance hogs)
3. Is cross apply ansi supported?
4. What are the best real life examples of when to use and avoid cross applies?
Does cross apply use row by row processing?
Sometimes. So do many regular joins. Depends on the query. Show actual query plan in SSMS and you can see what it is doing. Often times you will see that the CROSS APPLY and the equivalent traditional joins use the same query plan. Sometimes CROSS APPLY will be faster. Sometimes the JOIN will be faster. Depends on the data, indexes, statistics, etc.
Should cross applies be regarded and treated like cursor's? (I.e performance hogs)
No. They are not like cursors. If not optimized by the query optimizer, they are like LOOP JOINS. But they might be performance hogs. Just like any other query.
Is cross apply ansi supported?
I don't think so, but I am not certain
What are the best real life examples of when to use and avoid cross applies?
If you have a query that returns a lot of rows in the outer part of the query, you might consider joining to a subquery rather than using a CROSS APPLY, anticipating that SQL Server will do a HASH JOIN on the two queries. However, if SQL Server does a LOOP JOIN you will likely end up with the same query plan as the CROSS APPLY. If you have a query with few rows in the outer and you want to look up values in another table just based on those few, then you might favor the CROSS APPLY, though SQL Server may choose the LOOP JOIN for you anyway.
As a general rule, you shouldn't use JOIN hints unless you have a darn good reason to do so. Similarly, I wouldn't fret over using CROSS APPLY vs a join to a sub-query based solely on performance. Choose the one that makes the most sense in fetching your data, and let SQL Server figure out the best way to execute it. If it runs slowly for a particular query, then think about changing it to the other approach or providing join hints.

When will SQL Server Choose a Hash Join over a Merge Join?

I have a large query with many joins that I am trying to tune, and one warning sign is that there are many, many hash joins being used throughout. I went down to the base of the query tree and went to the first join there, which is an inner join.
Table A is using a clustered index scan to retrieve its data, which is sorted on the join column.
Table B is using a nonclustered index scan, which is also sorted on the join column.
When I join just these two tables in isolation, and select the same set of columns, the optimizer uses a merge join. The sets being joined are approximately the same size, and not very large (<5,000 rows).
What could explain the optimizer choosing the hash join over the merge join in this case?
EDIT
As requested, I have added a few more details. The index definitions are:
CREATE NONCLUSTERED INDEX NCL_Asset_Issuer_MergeInduce ON Asset.IssuerCompanyId (CompanyId)INCLUDE (IsPrivate,HasPublicEquity,Ticker,FinancialTemplateID,BondTicker,SICOther1ID,SICOther4ID,SICSandPID,SICOther3ID,SICMoodyID,CurrencyTypeID,SecondaryAnalystID,AnalystID,SICOshaID,SecondaryBondTicker,FiscalYearEnd,EquityExchangeID);
CREATE NONCLUSTERED INDEX NCL_Asset_IssuerCustom_IssuerId ON Asset.IssuerCustom (IssuerID) INCLUDE (Text3,ListItem1ID,ListItem5ID,ListItem3ID,ListItem2ID,Version,text4,TextLong15,ListItem6ID)
The following query will return a merge join, as I mentioned earlier:
SELECT IsPrivate,HasPublicEquity,Ticker,FinancialTemplateID,BondTicker,SICOther1ID,SICOther4ID,SICSandPID,SICOther3ID,SICMoodyID,CurrencyTypeID,SecondaryAnalystID,AnalystID,SICOshaID,SecondaryBondTicker,FiscalYearEnd,EquityExchangeID,ic.ListItem2Id,ic.ListItem3ID,ic.IssuerId
FROM Asset.Issuer i
INNER JOIN Asset.IssuerCustom ic ON i.CompanyId = ic.IssuerId;
As you can see, the query is using both the indices above. On the other hand, this same join occurs in a much larger query, and the below image shows the corner of the plan, where this join is occurring as a hash join:
The one difference that I can see is that there is a reversal in terms of which table is the "inner" table vs which is the "outer" table. Still, why would this impact the execution plan if both queries are inner joins on the same column?
The SQL Server Query optimiser does not guarantee the optimum query. It looks for the best query in the time that it sets itself. As queries become large, with multiple joins, the number of different combinations grows exponentially and it becomes impossible to explore every possible path and therefore guarantee an optimum solution.
Usually, you should be able to trust the query optimiser to do a good job if your table design is sound (with appropriate indices) and statistics are up to date.
The different choice of joins may be due to a different resources available - (CPU & memory) considering other parts of the plan being executed in parallel.
If you want to investigate further, I would test running with join hints to find out if the execution plan has made the best decision. It may also be worth testing with the query hint MAXDOP=1 to find out if parallel execution effects the choices made by the optimiser.

Sql Server Performance Issue with link and join

Which one gives better performance on a large set of records???
select name from tablea,tableb where tablea.id = tableb.id
Or
select name from tablea inner join tableb on tablea.id = tableb.id
Here I have given a simple example, but in our project we use a lot of tables and joins to fetch records. In this more complicated case, which one wil give higher performance, links or joins?
Neither. You should always use ANSI joins, in any case.
There are more than just style considerations for using ANSI joins:
In future versions of SQL Server, non-standard joins may not be supported.
You cannot perform all types of joins using old-style syntax. For example, the following ANSI-style query:
SELECT
FROM
dbo.Customer C
LEFT JOIN dbo.Address A
ON C.CustomerID = A.CustomerID
AND A.AddressType = 'Home' -- this can't be done with non-ANSI
WHERE
A.CustomerID IS NULL
;
Using JOIN expresses the intent of the query more clearly to the developer, by separating conditions that are specific to the mechanics of how tables always relate to each other in the JOIN clause, and putting conditions that are specific to the needs of this query in the WHERE clause. Putting all the conditions, joining and filtering, in the WHERE clause clutters the WHERE clause and makes it exceedingly hard to understand complex queries.
Using JOIN allows the conditions that join a table to be physically near where that table is introduced in the script, reducing the likelihood of making mistakes and further aiding comprehension.
I will state absolutely and categorically that you will gain no performance benefit from using old-style joins instead of ANSI joins. If a query using ANSI joins is too complex for the optimizer to do a good job, it is only a matter of chance whether the old-style join will work better or worse. The reverse is exactly as true, that if a query using old-style joins is too complex for the optimizer to do a good job, then it is only a matter of chance whether an ANSI join will work better or worse.
The correct solution to queries that are not performing well is for an expert to study the query, the tables, and the underlying indexes and make recommendations. The query may benefit from being separated into two (first inserting to a temp table, then joining to that). There may be missing indexes. Data types may be chosen poorly. A clustered index may need to move from the PK to another column. The conditions on the query might be transformable to SARGable ones. Date functions on newly-introduced columns might be eliminated in favor of date inequality conditions against pre-calculable expressions using constants or earlier-introduced columns. There may be denormalization that is actually hurting performance.
There are a host of factors that can affect performance, and I guarantee you with every shred of conviction I possess that going back to old-style joins will never, ever be the answer.
In that simplified example, the performance should be exactly the same. If you run Query Analyzer on the two options, you'll see that the optimizer will translate your WHERE into a JOIN in any case.
You might be able to write a query complex enough to confound the optimizer, though, so stick with JOINs.

What is the reason for performance difference?

I am creating a report, for that I have written 2 different type of query. But I am seeing a huge performance difference between these 2 methods. What may be the reason? My main table (suppose table A) contain a date column. I am filtering the data based on date. Around 10 table join I have to do with this table.
First method:
select A.id,A1.name,...
from table A
join A1
join A2 ....A10
where A.Cdate >= #date
and A.Cdate <= #date
Second method:
With CTE as(select A.id from A where A.Cdate>=#date and A.Cdate<=#date)
select CTE.id, A1.name,.... from CTE join A1 join A2....A10
Here second method is fast. What is the reason? In first method, the filtered data of A only will be join with other tables data right?
Execution plans will tell us for sure, but likely if the CTE is able to filter out a lot of rows before the join, a more optimal join approach may have been chosen (for example merge instead of hash). SQL isn't supposed to work that way - in theory those two should execute the same way. But in practice we find that SQL Server's optimizer isn't perfect and can be swayed in different ways based on a variety of factors, including statistics, selectivity, parallelism, a pre-existing version of that plan in the cache, etc.
One suggestion.You should be able to answer this question yourself as you have both the plans. Did you compare the two plans? Are those similar? Also, when performance is bad what do you mean ,is it time or cpu time or IO's or what did you compare?
Thus before you post any question you should check these counters and I am sure they will provide with some kind of answers in most of cases.
CTE is for managing the code it wont improve the performance of query automatically. CTE's will be expanded by optimizer and thus in your case these both should have same queries after transformation or expansion and thus similar plans.
CTE is basically for the handling the more complex code(be it recursive or having subquery), but you need to check the execution plan regarding the improvement of the 2 different queries.
You can check for the use of CTE at : http://msdn.microsoft.com/en-us/library/ms190766(v=sql.105).aspx
Just a note, in many scenarios, temp tables gives better performance then CTE also, so you should give a try to temp tables as well.
Reference : http://social.msdn.microsoft.com/Forums/en/transactsql/thread/d040d19d-016e-4a21-bf44-a0359fb3c7fb

advantages in specifying HASH JOIN over just doing a JOIN?

What are the advantages, if any, of explicitly doing a HASH JOIN over a regular JOIN (wherein SQL Server will decide the best JOIN strategy)? Eg:
select pd.*
from profiledata pd
inner hash join profiledatavalue val on val.profiledataid=pd.id
In the simplistic sample code above, I'm specifying the JOIN strategy, whereas if I leave off the "hash" key word SQL Server will do a MERGE JOIN behind the scenes (per the "actual execution plan").
The optmiser does a good enough job for everyday use. However, in theory it might need 3 weeks to find the perfect plan in the extreme, so there is a chance that the generated plan will not be ideal.
I'd leave it alone unless you have a very complex query or huge amounts of data where it simply can't produce a good plan. Then I'd consider it.
But over time, as data changes/grows or indexes change etc, your JOIN hint will becomes obsolete and prevents an optimal plan. A JOIN hint can only optimise for that single query at the time of development with that set of data you have.
Personally, I've never specified a JOIN hint in any production code.
I've normally solved a bad join by changing my query around, adding/changing an index or breaking it up (eg load a temp table first). Or my query was just wrong, or I had an implicit data type conversion, or it highlighted a flaw in my schema etc.
I've seen other developers use them but only where they had complex views nested upon complex views and they caused later problems when they refactored.
Edit:
I had a conversion today where some colleagues are going to use them to force a bad query plan (with NOLOCK and MAXDOP 1) to "encourage" migration away from legacy complex nested views that one of their downstream system calls directly.
Hash joins parallelize and scale better than any other join and are great at maximizing throughput in data warehouses.
When to try a hash hint, how about:
After checking that adequate indices exist on at least one of the
tables.
After having tried to re-arrange the query. Things like converting
joins to "in" or "exists", changing join order (which is only really a
hint anyway), moving logic from where clause to join condition, etc.
Some basic rules about when a hash join is effective is when a join condition does not exist as a table index and when the tables sizes are different. If you looking for a technical description there are some good descriptions out there about how a hash join works.
Why use any join hints (hash/merge/loop with side effect of force order)?
To avoid extremely slow execution (.5 -> 10.0s) of corner cases.
When the optimizer consistently chooses a mediocre plan.
A supplied hint is likely to be non-ideal for some circumstances but provides more consistently predictable runtimes. The expected worst case and best case scenarios should be pre-tested when using a hint. Predictable runtimes are critical for web services where a rigidly optimized nominal [.3s, .6s] query is preferred over one that can range [.25, 10.0s] for example. Large runtime variances can happen with statistics freshly updated and best practices followed.
When testing in a development environment, one should turn off "cheating" as well to avoid hot/cold runtime variances. From another post...
CHECKPOINT -- flushes dirty pages to disk
DBCC DROPCLEANBUFFERS -- clears data cache
DBCC FREEPROCCACHE -- clears execution plan cache
The last option may be the same as the option(recompile) hint.
The MAXDOP and loading of the machine can also make a huge difference in runtime. Materialization of CTE into temp tables is also a good locking down mechanism and something to consider.
The only hint I've ever seen in shipping code was OPTION (FORCE ORDER). Stupid bug in SQL query optimizer would generate a plan that tried to join an unfiltered varchar and a unique identifier. Adding FORCE ORDER caused it to run the filter first.
I know, overloading columns is bad. Sometimes, you've got to live with it.
The logical plan optimizator doesn't assure to you that it finds the optimal solution: an exact algorithm is too slow to use in a production server; instead there are used some greedy algorithms.
Hence, the rationale behind those commands is to let the user specify the optimal join strategy, in the case the optimizator can't sort out what's really the best to adopt.

Resources