When will SQL Server Choose a Hash Join over a Merge Join?

When will SQL Server Choose a Hash Join over a Merge Join? - sql-server

I have a large query with many joins that I am trying to tune, and one warning sign is that there are many, many hash joins being used throughout. I went down to the base of the query tree and went to the first join there, which is an inner join.
Table A is using a clustered index scan to retrieve its data, which is sorted on the join column.
Table B is using a nonclustered index scan, which is also sorted on the join column.
When I join just these two tables in isolation, and select the same set of columns, the optimizer uses a merge join. The sets being joined are approximately the same size, and not very large (<5,000 rows).
What could explain the optimizer choosing the hash join over the merge join in this case?
EDIT
As requested, I have added a few more details. The index definitions are:
CREATE NONCLUSTERED INDEX NCL_Asset_Issuer_MergeInduce ON Asset.IssuerCompanyId (CompanyId)INCLUDE (IsPrivate,HasPublicEquity,Ticker,FinancialTemplateID,BondTicker,SICOther1ID,SICOther4ID,SICSandPID,SICOther3ID,SICMoodyID,CurrencyTypeID,SecondaryAnalystID,AnalystID,SICOshaID,SecondaryBondTicker,FiscalYearEnd,EquityExchangeID);
CREATE NONCLUSTERED INDEX NCL_Asset_IssuerCustom_IssuerId ON Asset.IssuerCustom (IssuerID) INCLUDE (Text3,ListItem1ID,ListItem5ID,ListItem3ID,ListItem2ID,Version,text4,TextLong15,ListItem6ID)
The following query will return a merge join, as I mentioned earlier:
SELECT IsPrivate,HasPublicEquity,Ticker,FinancialTemplateID,BondTicker,SICOther1ID,SICOther4ID,SICSandPID,SICOther3ID,SICMoodyID,CurrencyTypeID,SecondaryAnalystID,AnalystID,SICOshaID,SecondaryBondTicker,FiscalYearEnd,EquityExchangeID,ic.ListItem2Id,ic.ListItem3ID,ic.IssuerId
FROM Asset.Issuer i
INNER JOIN Asset.IssuerCustom ic ON i.CompanyId = ic.IssuerId;
As you can see, the query is using both the indices above. On the other hand, this same join occurs in a much larger query, and the below image shows the corner of the plan, where this join is occurring as a hash join:
The one difference that I can see is that there is a reversal in terms of which table is the "inner" table vs which is the "outer" table. Still, why would this impact the execution plan if both queries are inner joins on the same column?

The SQL Server Query optimiser does not guarantee the optimum query. It looks for the best query in the time that it sets itself. As queries become large, with multiple joins, the number of different combinations grows exponentially and it becomes impossible to explore every possible path and therefore guarantee an optimum solution.
Usually, you should be able to trust the query optimiser to do a good job if your table design is sound (with appropriate indices) and statistics are up to date.
The different choice of joins may be due to a different resources available - (CPU & memory) considering other parts of the plan being executed in parallel.
If you want to investigate further, I would test running with join hints to find out if the execution plan has made the best decision. It may also be worth testing with the query hint MAXDOP=1 to find out if parallel execution effects the choices made by the optimiser.

Related

Possible steps to improve SQL Server query performance

I have a search procedure that is being passed around 15-20 (optional) parameters and the search procedure calls their respective functions to check if the value passed in parameter exists in the database. So, it is basically a Search structure based on a number of parameters.
Now, since the database is going to have millions of records, I expect the simple plain search procedure to fail right away. What are the ways that can improve query performance?
What I have tried so far:
Clustered index on FirstName column of database (as I expect it to be used very frequently)
Non Clustered index on rest of the columns that are basis of the user search and also the include keyword.
Note:
I am looking for more ways to optimize my queries.
Most of the queries are nothing but select statements checked against a condition.
One of the queries uses GroupBy clause.
I have also created a temporary table in which I am inserting all the matched entries.

First Run the query from Sql Server Management Studio and look at the query plan to see where the bottle neck is. Any place you see a "table scan" or "index scan" it has to go through all data to find what it is looking for. If you create appropriate indexes that can be used for these operations it should increase performance.
Below Listed are some tips for improving the performance of sql query..
Avoid Multiple Joins in a Single Query
Try to avoid writing a SQL query using multiple joins that includes outer joins, cross apply, outer apply and other complex sub queries. It reduces the choices for Optimizer to decide the join order and join type. Sometime, Optimizer is forced to use nested loop joins, irrespective of the performance consequences for queries with excessively complex cross apply or sub queries.
Eliminate Cursors from the Query
Try to remove cursors from the query and use set-based query; set-based query is more efficient than cursor-based. If there is a need to use cursor than avoid dynamic cursors as it tends to limit the choice of plans available to the query optimizer. For example, dynamic cursor limits the optimizer to using nested loop joins.
Avoid Use of Non-correlated Scalar Sub Query
You can re-write your query to remove non-correlated scalar sub query as a separate query instead of part of the main query and store the output in a variable, which can be referred to in the main query or later part of the batch. This will give better options to Optimizer, which may help to return accurate cardinality estimates along with a better plan.
Avoid Multi-statement Table Valued Functions (TVFs)
Multi-statement TVFs are more costly than inline TFVs. SQL Server expands inline TFVs into the main query like it expands views but evaluates multi-statement TVFs in a separate context from the main query and materializes the results of multi-statement into temporary work tables. The separate context and work table make multi-statement TVFs costly.
Create a Highly Selective Index
Selectivity define the percentage of qualifying rows in the table (qualifying number of rows/total number of rows). If the ratio of the qualifying number of rows to the total number of rows is low, the index is highly selective and is most useful. A non-clustered index is most useful if the ratio is around 5% or less, which means if the index can eliminate 95% of the rows from consideration. If index is returning more than 5% of the rows in a table, it probably will not be used; either a different index will be chosen or created or the table will be scanned.
Position a Column in an Index
Order or position of a column in an index also plays a vital role to improve SQL query performance. An index can help to improve the SQL query performance if the criteria of the query matches the columns that are left most in the index key. As a best practice, most selective columns should be placed leftmost in the key of a non-clustered index.
Drop Unused Indexes
Dropping unused indexes can help to speed up data modifications without affecting data retrieval. Also, you need to define a strategy for batch processes that run infrequently and use certain indexes. In such cases, creating indexes in advance of batch processes and then dropping them when the batch processes are done helps to reduce the overhead on the database.
Statistic Creation and Updates
You need to take care of statistic creation and regular updates for computed columns and multi-columns referred in the query; the query optimizer uses information about the distribution of values in one or more columns of a table statistics to estimate the cardinality, or number of rows, in the query result. These cardinality estimates enable the query optimizer to create a high-quality query plan.
Revisit Your Schema Definitions
Last but not least, revisit your schema definitions; keep on eye out that appropriate FORIGEN KEY, NOT NULL and CEHCK constraints are in place or not. Availability of the right constraint on the right place always helps to improve the query performance, like FORIGEN KEY constraint helps to simplify joins by converting some outer or semi-joins to inner joins and CHECK constraint also helps a bit by removing unnecessary or redundant predicates.
Reference

Sql Server Performance Issue with link and join

Which one gives better performance on a large set of records???
select name from tablea,tableb where tablea.id = tableb.id
Or
select name from tablea inner join tableb on tablea.id = tableb.id
Here I have given a simple example, but in our project we use a lot of tables and joins to fetch records. In this more complicated case, which one wil give higher performance, links or joins?

Neither. You should always use ANSI joins, in any case.
There are more than just style considerations for using ANSI joins:
In future versions of SQL Server, non-standard joins may not be supported.
You cannot perform all types of joins using old-style syntax. For example, the following ANSI-style query:
SELECT
FROM
dbo.Customer C
LEFT JOIN dbo.Address A
ON C.CustomerID = A.CustomerID
AND A.AddressType = 'Home' -- this can't be done with non-ANSI
WHERE
A.CustomerID IS NULL
;
Using JOIN expresses the intent of the query more clearly to the developer, by separating conditions that are specific to the mechanics of how tables always relate to each other in the JOIN clause, and putting conditions that are specific to the needs of this query in the WHERE clause. Putting all the conditions, joining and filtering, in the WHERE clause clutters the WHERE clause and makes it exceedingly hard to understand complex queries.
Using JOIN allows the conditions that join a table to be physically near where that table is introduced in the script, reducing the likelihood of making mistakes and further aiding comprehension.
I will state absolutely and categorically that you will gain no performance benefit from using old-style joins instead of ANSI joins. If a query using ANSI joins is too complex for the optimizer to do a good job, it is only a matter of chance whether the old-style join will work better or worse. The reverse is exactly as true, that if a query using old-style joins is too complex for the optimizer to do a good job, then it is only a matter of chance whether an ANSI join will work better or worse.
The correct solution to queries that are not performing well is for an expert to study the query, the tables, and the underlying indexes and make recommendations. The query may benefit from being separated into two (first inserting to a temp table, then joining to that). There may be missing indexes. Data types may be chosen poorly. A clustered index may need to move from the PK to another column. The conditions on the query might be transformable to SARGable ones. Date functions on newly-introduced columns might be eliminated in favor of date inequality conditions against pre-calculable expressions using constants or earlier-introduced columns. There may be denormalization that is actually hurting performance.
There are a host of factors that can affect performance, and I guarantee you with every shred of conviction I possess that going back to old-style joins will never, ever be the answer.

In that simplified example, the performance should be exactly the same. If you run Query Analyzer on the two options, you'll see that the optimizer will translate your WHERE into a JOIN in any case.
You might be able to write a query complex enough to confound the optimizer, though, so stick with JOINs.

What is the reason for performance difference?

I am creating a report, for that I have written 2 different type of query. But I am seeing a huge performance difference between these 2 methods. What may be the reason? My main table (suppose table A) contain a date column. I am filtering the data based on date. Around 10 table join I have to do with this table.
First method:
select A.id,A1.name,...
from table A
join A1
join A2 ....A10
where A.Cdate >= #date
and A.Cdate <= #date
Second method:
With CTE as(select A.id from A where A.Cdate>=#date and A.Cdate<=#date)
select CTE.id, A1.name,.... from CTE join A1 join A2....A10
Here second method is fast. What is the reason? In first method, the filtered data of A only will be join with other tables data right?

Execution plans will tell us for sure, but likely if the CTE is able to filter out a lot of rows before the join, a more optimal join approach may have been chosen (for example merge instead of hash). SQL isn't supposed to work that way - in theory those two should execute the same way. But in practice we find that SQL Server's optimizer isn't perfect and can be swayed in different ways based on a variety of factors, including statistics, selectivity, parallelism, a pre-existing version of that plan in the cache, etc.

One suggestion.You should be able to answer this question yourself as you have both the plans. Did you compare the two plans? Are those similar? Also, when performance is bad what do you mean ,is it time or cpu time or IO's or what did you compare?
Thus before you post any question you should check these counters and I am sure they will provide with some kind of answers in most of cases.
CTE is for managing the code it wont improve the performance of query automatically. CTE's will be expanded by optimizer and thus in your case these both should have same queries after transformation or expansion and thus similar plans.

CTE is basically for the handling the more complex code(be it recursive or having subquery), but you need to check the execution plan regarding the improvement of the 2 different queries.
You can check for the use of CTE at : http://msdn.microsoft.com/en-us/library/ms190766(v=sql.105).aspx
Just a note, in many scenarios, temp tables gives better performance then CTE also, so you should give a try to temp tables as well.
Reference : http://social.msdn.microsoft.com/Forums/en/transactsql/thread/d040d19d-016e-4a21-bf44-a0359fb3c7fb

Is My T-SQL Query Written Efficiently?

SELECT o.oxxxxID,
m.mxxxx,
txxxx,
exxxxName,
paxxxxe,
fxxxxe,
pxxxx,
axxxx,
nxxxx,
nxxxx,
nxxxx,
ixxxx,
CONVERT(VARCHAR, o.dateCreated, 103)
FROM Offer o INNER JOIN Mxxxx m ON o.mxxxxID = m.mxxxxID
INNER JOIN EXXXX e ON e.exxxxID = o.exxxxID
INNER JOIN PXXXX p ON p.pxxxxxxID = o.pxxxxID
INNER JOIN Fxxxx f ON f.fxxxxxID = o.fxxxxxID
WHERE o.cxxxxID = 11
The above query is expected to be executed via website by approximately 1000 visitors daily. Is it badly written and has a high chance to cause lack of performance? If yes, can you please suggest me how to improve it.
NOTE: every table has only one index (Primary key).

Looks good to me.
Now for the performance piece you need to make sure you have the proper indexes covering the columns you are filtering and joining (Foreign Keys, etc).
A good start would be to do an Actual Execution Plan or, the easy route, run it against the Indexing Tunning Wizard.
The actual execution plan in SQL 2008 (perhaps 2005 as well) will give you missing indexes hints already on the top.

It's hard to tell without knowing the content of the data, but it looks like a perfectly valid SQL statement. The many joins will likely degrade performance a bit, but you can use a fw strategies for improving performance... I have a few ideas.
indexed views can often improve performance
stored procedures will optomize the query for you and save the optomized query
or if possible, create a one-off table that's not live, but contains the data from this statement, only in a non-normalized format. This on-off table would need tp be updated regularly, but you can get some huge performance boosts using this strategy if it's possible in your situation.
For general performance issues and ideas, this is a good place to start, if you haven't alredy: http://msdn.microsoft.com/en-us/library/ff647793.aspx
This one is very good as well: http://technet.microsoft.com/en-us/magazine/2006.01.boostperformance.aspx

That would depend mostly on the keys and indexes defined on the tables. If you could provide those a better answer could be given. While the query looks ok (other than the xxx's in all the names), if you're joining on fields with no indexes, or there field in the where clause has no index then you may run into performance issues on larger data sets.

It looks pretty good to me. Probably the only improvement I might make is to output o.datecreated as is and let the client format it.
You could also add indexes to the join columns.
There may also be a potential to create an indexed view if performance is an issue and space isn't.

Actually, your query looks perfectly good written. The only point that we can't know if it can be improved is the existence of indexes and keys on the columns that you are using on the JOINS and the WHERE statement. Other than that, I don't see anything that can be improved.

If you only have single indexes on the primary keys, then it is unlikely the indexes will be covering for all the data output in your select statement. So what will happen is that the query can efficiently locate the rows for each primary key but it will need to use bookmark lookups to find the data rows and extract the additional columns.
So, although the query itself is probably fine (except for the date conversion) as long as all these columns are truly needed in the output, the execution plan could probably be improved by adding additional columns to your indexes. A clustered index key is not allowed to have included columns, and this is probably also your primary key enforcement, and you are unlikely to want to add other columns to your primary key, so this would mean creating an additional non-clustered index with the PK column first and then including additional columns.
At this point the indexes will cover the query and it will not need to do the bookmark lookups. Note that the indexes need to support the most common usage scenarios and that the more indexes you add, the slower your write performance will be, since all the indexes will need to be updated.
In addition, you might also want to review your constraints, since these can be used by the optimizer to eliminate joins if a table is not used for any output columns when the optimizer can determine there will not be an outer join or cross join which would eliminate or multiply rows.

Slow SQL Query due to inner and left join?

Can anyone explain this behavior or how to get around it?
If you execute this query:
select *
from TblA
left join freetexttable ( TblB, *, 'query' ) on TblA.ID = [Key]
inner join DifferentDbCatalog.dbo.TblC on TblA.ID = TblC.TblAID
It will be very very very slow.
If you change that query to use two inner joins instead of a left join, it will be very fast. If you change it to use two left joins instead of an inner join, it will be very fast.
You can observe this same behavior if you use a sql table variable instead of the freetexttable as well.
The performance problem arises any time you have a table variable (or freetexttable) and a table in a different database catalog where one is in an inner join and the other is in a left join.
Does anyone know why this is slow, or how to speed it up?

A general rule of thumb is that OUTER JOINs cause the number of rows in a result set to increase, while INNER JOINs cause the number of rows in a result set to decrease. Of course, there are plenty of scenarios where the opposite is true as well, but it's more likely to work this way than not. What you want to do for performance is keep the size of the result set (working set) as small as possible for as long as possible.
Since both joins match on the first table, changing up the order won't effect the accuracy of the results. Therefore, you probably want to do the INNER JOIN before the LEFT JOIN:
SELECT *
FROM TblA
INNER JOIN DifferentDbCatalog.dbo.TblC on TblA.ID = TblC.TblAID
LEFT JOIN freetexttable ( TblB, *, 'query' ) on TblA.ID = [Key]
As a practical matter, the query optimizer should be smart enough to compile to use the faster option, regardless of which order you specified for the joins. However, it's good practice to pretend that you have a dumb query optimizer, and that query operations happen in order. This helps future maintainers spot potential errors or assumptions about the nature of the tables.
Because the optimizer should re-write things, this probably isn't good enough to fully explain the behavior you're seeing, so you'll still want to examine the execution plan used for each query, and probably add an index as suggested earlier. This is still a good principle to learn, though.

What you should usually do is turn on the "Show Actual Execution Plan" option and then take a close look at what is causing the slowdown. (hover your mouse over each join to see the details) You'll want to make sure that you are getting an index seek and not a table scan.
I would assume what is happening is that SQL is being forced to pull everything from one table into memory in order to do one of the joins. Sometimes reversing the order that you join the tables will also help things.

Putting freetexttable(TblB, *, 'query') into a temp table may help if it's getting called repeatedly in the execution plan.

Index the field you use to perform the join.
A good rule of thumb is to assign an index to any commonly referenced foreign or candidate keys.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight