Some context: I have two tables, smalltable and bigtable. Smalltable contains 10,000 rows, whereas bigtable contains 2,000,000, and I am using SQL Server 2008. I started with a query as follows:
select * from [dbo].[smalltable] t1
INNER JOIN
[dbo].[bigtable] t2
on (t1.name1=t2.firstname and t1.name6=t2.lastname) or (t1.name6=t2.firstname and t1.name1=t2.lastname)
This query was running for over 15 minutes before I killed it - on inspecting the query plan, it was using a nested loop to do the inner join.
I then rewrote the query as follows:
select * from [dbo].[smalltable] t1
INNER JOIN
[dbo].[bigtable] t2
on (t1.name1=t2.firstname and t1.name6=t2.lastname)
UNION
select * from [dbo].[smalltable] t1
INNER JOIN
[dbo].[bigtable] t2
on (t1.name6=t2.firstname and t1.name1=t2.lastname)
The two queries above then instead executed using a Hash Match, and the whole query ran in 4 seconds.
My question is, why does SQL Server get the query plan so wrong, and was there anyway that I could have fixed the original query without rewriting it? I tried adding a hint to use a Hash Match to the first query, but it seems that you are not allowed to with multiple join criteria?
Update: Added examples of the kind of data in the tables as requested. Note, the code is looking for name matches where names may have been swapped around:
Smalltable(Columns name1,name6)
John, Smith
Johnny, Smith
Smythe, Jon
Michaels, Robert
Bob, Brown
Bigtable (Columns firstname,lastname)
John, Smith
John, Smythe
Johnny, Smith
Alison, Roberts
Robert, Michaels
Janet, Green
It is a problem within SQL Server optimizer.
The condition (t1.name1=t2.firstname and t1.name6=t2.lastname) uses clustered index seek only and thus is very fast and executes almost instantly even with very large tables.
But the condition with OR
(t1.name1=t2.firstname and t1.name6=t2.lastname) or (t1.name6=t2.firstname and t1.name1=t2.lastname)
generally performs much worse usually performing full scan. You should see execution plans for your queries.
The Query Optimizer will always perform a table scan or a clustered
index scan on a table if the WHERE clause in the query contains an OR
operator and if any of the referenced columns in the OR clause are not
indexed (or do not have a useful index). Because of this, if you use
many queries with OR clauses, you will want to ensure that each
referenced column in the WHERE clause has an index.
If you have a query that uses ORs and it is not making the best use
of indexes, consider rewriting it as a UNION and then testing
performance. Only through testing can you be sure that one version of
your query will be faster than another.
See here. So shortly your first OR query does not make good use of indexes.
I believe there is no other way than rewrite the query with UNION (as you did) or APPLY, optimizer will not do it.
Related
Let's say I have a (hypothetical) table called Table1 with 500 columns and there is a view called View1 which is basically
select Column1, Column2,..., Column500, ComputedOrForeignKeyColumn1,...
from Table1
inner join ForeignKeyTables .....
Now, when I execute something like
Select Column32, Column56
from View1
which one of the below 3 does SQL Server turn it into?
Query #1:
select Column32, Column56
from
(select
Column1, Column2,..., Column500, ComputedOrForeignKeyColumn1,...
from
Table1
inner join
ForeignKeyTables ......) v
Query #2:
Select Column32, Column56
from Table1
Query #3:
select Column32, Column56
from
(select Column32, Column56
from Table1) v
The reason I'm asking this is that I do have a very wide table and a view sitting on top of it (that basically inner joins to bring texts from all foreign key ids) and I can't figure out if SQL Server fetches all columns and then selects the ones that are needed or fetches only those that are needed (while also ignoring unnecessary joins etc)...if it is former then a view would not be the best for performance.
SQL Server query compilation can be split into phases:
Parsing
Binding
Optimization
View resolution is performed during binding. At this stage the view reference is replaced with its definition. At this point, unused view columns will be present.
The next stage is optimization, where the bound syntax tree is transformed into an execution plan. The optimizer considers many kinds of manipulations on the execution plan to increase efficiency, and removing unused columns is one of the most basic. At this point, the unused column references will be removed.
So to answer your question, unused columns in the view definition will not impact performance, since the optimizer will be smart enough to remove them.
Note: this answer assumes the view is not indexed. For indexed views, the resolution process works differently, and there is view maintenance overhead for UPDATEs of the base tables.
None of the above. SQL Server will parse the query and it will create and execution plan. The resulting execution plan is calculated based on many factors, like indexes joins, etc.
Your question cannot be truly answered by anyone other than you, examining such execution plan.
See How do I obtain a Query Execution Plan? for more information.
The view definition is merged with the outer query in very early stage of compilation. You may or may not get the same execution plan for query on a view vs an equivalent query touching base tables, depending on complexity of the view and given the limitations of QO.
For your particular case it's worth noting that an inner join doesn't only fetch data from joined tables, but it also limits the result (in the same way as an IF EXISTS check does). If there is a declarative FK between the tables, the QO will be smart enough not to check the referenced tables, as the existence is guaranteed by the constraint, but otherwise it has to.
I've recently been learning something very new to me - FULLTEXT Indexes.
It seems that I can run off two separate queries (using CONTAINSTABLE) on the same parameters against two separate tables are gain an almost instantaneous answer (sub 10ms) however when I combined the two together, the query takes 1.3 seconds - or 130+ times slower!!
Below are the queries (simplified for the purpose of this question).
Query 1:
SELECT
*
FROM
dbo.FooBar FB
INNER JOIN dbo.FooBalls FBS on FB.ID = FBS.ID
LEFT JOIN CONTAINSTABLE(dbo.FooBar, (Col1, Col2, Col3), #query) FBCONT ON FB.ID = FBCONT.[KEY]
WHERE
FBCONT.[KEY] IS NOT NULL
Query 2:
SELECT
*
FROM
dbo.FooBar FB
INNER JOIN dbo.FooBalls FBS on FB.ID = FBS.ID
LEFT JOIN CONTAINSTABLE(dbo.FooBalls, (Col1), #query) FBSCONT ON FBS.ID = FBSCONT.[KEY]
WHERE
FBSCONT.[KEY] IS NOT NULL
Query Combined:
SELECT
*
FROM
dbo.FooBar FB
INNER JOIN dbo.FooBalls FBS on FB.ID = FBS.ID
LEFT JOIN CONTAINSTABLE(dbo.FooBar, (Col1, Col2, Col3), #query) FBCONT ON FB.ID = FBCONT.[KEY]
LEFT JOIN CONTAINSTABLE(dbo.FooBalls, (Col1), #query) FBSCONT ON FBS.ID = FBSCONT.[KEY]
WHERE
(FBCONT.[KEY] IS NOT NULL OR FBSCONT.[KEY] IS NOT NULL)
Perhaps my research has missed something but can someone give me an indicator as to why having both clauses together reduces performance by over 130 times?
NOTES:
I've checked the relevant indexes for joining exist - verified by the speed of the individual queries.
There are actually more joins involved in the process - however they are completely unlinked to the tables being queries and again response are under 10ms when searching for results in 100,000 plus records.
I tried replacing the CONTAINSTABLE with individual CONTAINS statements - performance was massively degraded as my research would lead me to expect.
A catalog has been set up that references ONLY the four columns from the two tables being queried
The #query parameter is set to NVARCHAR (50) at the present. I've read that using NVACHAR is faster as implicit conversions are not required.
I know I could do a dirty UNION ALL on both queries separately, but I'd prefer to writer better queries if possible rather than hack it together. Additionally UNION ALL would leave me with potential duplicates if #query value was in two columns from separate tables linked to one record.
Any further suggestions would be greatly received.
Your question comments suggest you improved performance to a satisfactory level by rewriting an unrelated part of the query (not shown in the question).
This is fair enough if it works, but doesn't explain why the two separate queries and the combined query differ so significantly, when other unrelated parts of the query are kept constant.
It's difficult to say confidently without seeing a query plan and statistics results; however I can think of two possibilities based solely on reasoning about how the SQL queries are written:
One or both of the ID columns (from FooBar and FooBalls) may be non-unique in the row set after these two tables have been inner joined. Doing two, rather than one, join to CONTAINSTABLE result sets may thus be "breeding" rather more records than a single join does; larger result sets take longer to be passed back to the client and displayed. To test this: compare the row counts returned by the two separate queries, and compare these to the row counts of each separate query if the WHERE clauses are omitted. Larger row counts will typically suggest a longer query elapsed time (all other things being equal).
Each of the separate queries has been written with a left outer join, but the result set is then restricted to only include rows where the join has succeeded. This is effectively an inner join: SQL Server's query planner may well be identifying this fact and choosing an execution plan as if an inner join had been specified. Conversely, the combined query requires rows where either join (but not necessarily both) have succeeded, which is a true left join. The execution plan is likely to use different, slower, approaches for these joins. To test this: look at the execution plans, and compare to execution plans for the separate queries with inner joins requested instead of left joins.
I am using sql 2008 full text search and I am having serious issues with performance depending on how I use Contains or ContainsTable.
Here are sample: (table one has about 5000 records and there is a covered index on table1 which has all the fields in the where clause. I tried to simplify the statements so forgive me if there is syntax issues.)
Scenario 1:
select * from table1 as t1
where t1.field1=90
and t1.field2='something'
and Exists(select top 1 * from containstable(table1,*, 'something') as t2
where t2.[key]=t1.id)
results: 10 second (very slow)
Scenario 2:
select * from table1 as t1
join containstable(table1,*, 'something') as t2 on t2.[key] = t1.id
where t1.field1=90
and t1.field2='something'
results: 10 second (very slow)
Scenario 3:
Declare #tbl Table(id uniqueidentifier primary key)
insert into #tbl select {key] from containstable(table1,*, 'something')
select * from table1 as t1
where t1.field1=90
and t1.field2='something'
and Exists(select id from #tbl as tbl where id=req1.id)
results: fraction of a second (super fast)
Bottom line, it seems if I use Containstable in any kind of join or where clause condition of a select statement that also has other conditions, the performance is really bad. In addition if you look at profiler, the number of reads from the database goes to the roof. But if I first do the full text search and put results in a table variable and use that variable everything goes super fast. The number of reads are also much lower. It seems in "bad" scenarios, somehow it gets stuck in a loop which causes it to read many times from teh database but of course I don't understant why.
Now the question is first of all whyis that happening? and question two is that how scalable table variables are? what if it results to 10s of thousands of records? is it still going to be fast.
Any ideas?
Thanks
I spent quite sometime on this issue, and based on running many scenarios, this is what I figured out:
if you have Contains or ContainsTable anywhere in your query, that is the part that gets executed first and rather independently. Meaning that even if the rest of the conditions limit your search to only one record, neither Contains nor containstable care about that. So this is like a parallel execution.
Now since fulltext search only returns a Key field, it immediately looks for the Key as the first field of other indexes chosen for the query. So for the example above, it looks for the index with [key], field1, field2. The problem is that it chooses an index for the rest of query based on the fields in the where clause. so for the example above it picks the covered index that I have which is something like field1, field2, Id. (Id of the table is the same as the [Key] returned from the full text search). So summary is:
executes containstable
executes the rest of the query and pick an index based on where clause of the query
It tries to merge these two. Therefore, if the index that it picked for the rest of the query starts with the [key] field, it is fine. However, if the index doesn't have the [key] field as the first key, it starts doing loops. It does not even do a table scan, otherwise going through 5000 records would not be that slow. The way it does the loop is that it runs the loop for the total number of results from FTS multiplied by total number of results from the rest of the query. So if the FTS is returning 2000 records and the rest of the query returns 3000, it loops 2000*3000= 6,000,000. I donot understand why.
So in my case it does the full text search, then it does he rest of the query but picks the covered index that I have which is based on field1, field2,id (which is wrong) and as the result it screws up. If I change my covered index to Id, field1, field2 everything would be very fast.
My expection was that FTS returns bunch of [key], the rest of the query return bunch of [Id] and then the Id should be matched against [key].
Of course, I tried to simplify my query here, but the actual query is much more complicated and I cannot just change the index. I also do have scenarios where the text passed in full text is blank and in those scenarios I donot even want to join with containstable.
In those cases changing my covered index to have the id field as the first field, will generate disaster.
Anyways, for now I chose the temp table solution since it is working for me. I am also limiting the result to a few thousand which helps with the potential performance issues of table variables when the number of records go too high.
thanks
Normally it works very fast:
select t1.*, t2.Rank
from containstable(table1, field2, 'something') as t2
join table1 as t1 ON t1.id = t2.Key AND t1.field1=90
order by t2.Rank desc
There is a big difference where you put your search criteria: in JOIN or in WHERE.
I'm going to take a guess here that your issue is the same as on the other thread I linked to. Are you finding the issue arises with multiple word search terms?
If so my answer from that thread will apply.
From http://technet.microsoft.com/en-us/library/cc721269.aspx#_Toc202506240
The most important thing is that the
correct join type is picked for
full-text query. Cardinality
estimation on the FulltextMatch STVF
is very important for the right plan.
So the first thing to check is the
FulltextMatch cardinality estimation.
This is the estimated number of hits
in the index for the full-text search
string. For example, in the query in
Figure 3 this should be close to the
number of documents containing the
term ‘word’. In most cases it should
be very accurate but if the estimate
was off by a long way, you could
generate bad plans. The estimation for
single terms is normally very good,
but estimating multiple terms such as
phrases or AND queries is more complex
since it is not possible to know what
the intersection of terms in the index
will be based on the frequency of the
terms in the index. If the cardinality
estimation is good, a bad plan
probably is caused by the query
optimizer cost model. The only way to
fix the plan issue is to use a query
hint to force a certain kind of join
or OPTIMIZE FOR.
So it simply cannot know from the information it stores whether the 2 search terms together are likely to be quite independent or commonly found together. Maybe you should have 2 separate procedures one for single word queries that you let the optimiser do its stuff on and one for multi word search terms that you force a "good enough" plan on (sys.dm_fts_index_keywords might help if you want to do a rough estimate of cardinality yourself).
If you are getting the issue with single word queries this passage from the linked article might apply.
In SQL Server 2008 full-text search we have the ability to alter the plan that is
generated based on a cardinality estimation of the search term used. If the query plan is fixed (as it is in a parameterized query inside a stored procedure), this step does
not take place. Therefore, the compiled plan always serves this query, even if this plan is not ideal for a given search term.
So you might need to use the RECOMPILE option.
I have 2 tables Person_Organization and Person_Organization_other and nested query is :
SELECT
Person_Organization_id
FROM
Person_Organization_other
WHERE
company_name IN (SELECT company_name
FROM Person_Organization_other
WHERE Person_Organization_id IN (SELECT Person_Organization_Id
FROM Person_Organization
WHERE person_id = 117
AND delete_flag = 0)
)
Whereas the above query's corresponding query with join that I tried is :-
SELECT
poo.Person_Organization_id
FROM
Person_Organization_other poo, Person_Organization_other poo1, Person_Organization po
WHERE
poo1.Person_Organization_id = po.Person_Organization_Id
AND po.person_id = 117
AND po.delete_flag = 0
AND poo.company_name = poo1.company_name
GROUP BY
poo.Person_Organization_id
However the nested query is found to take less time as compared to it's corresponding query with joins. I used SQL profiler trace to compare times of executed queries. For the nested query it took 30 odd ms. For the joined query it took 41 odd ms
I was under the impression that as a rule nested queries are less perfomant and should be "flattened out" using joins.
Could someone explain what I am doing wrong?
regards
Nitin
You are using cross joins. Try inner joins.
select poo.Person_Organization_id
from Person_Organization po
INNER JOIN Person_Organization_other poo ON
poo.Person_Organization_id=po.Person_Organization_Id
INNER JOIN Person_Organization_other poo1 ON
poo1.Person_Organization_id=po.Person_Organization_Id AND
poo.company_name=poo1.company_name
where po.person_id=117 AND po.delete_flag=0
group by poo.Person_Organization_id
By separating your tables with commas, you are effectively CROSS JOINing them together. I would try doing explicit INNER JOINs between the tables and see if that helps performance.
The view that nested queries are less performant and should be flattened out using joins is a myth - it is true that inappropriate nested subqueries can cause performance issues, however in many cases using a subquery is just as good as using a join.
In fact the SQL server optimises all queries that it executes by reducing them to an execution tree - often queries that use a JOIN end up with identical execution trees to equivalent sql statements that use nested queries instead.
In this case the execution time of these is really low anyway - the difference could just as easily be explained as due to caches etc... not being filled.
My advice would be to use whatever syntax makes more sense to you - if you have a performance problem then by all means go back and check to see if a nested subquery is the cause of your problem, however I definitely wouldn't spend time worrying about "flattening out" queries that aren't causing problems.
Your order of tables might reduce the performance your table order in from clause should be in increasing order of number of rows
I have a query that originally looks like this:
select c.Id, c.Name, c.CountryCode, c.CustomerNumber, cacc.AccountNumber, ca.Line1, ca.CityName, ca.PostalCode
from dbo.Customer as c
left join dbo.CustomerAddress as ca on ca.CustomerId = c.Id
left join dbo.CustomerAccount as cacc on cacc.CustomerId = c.Id
where c.CountryCode = 'XX' and (cacc.AccountNumber like '%C17%' or c.Name like '%op%'
or ca.Line1 like '%ae%' or ca.CityName like '%ab%' or ca.PostalCode like '%10%')
On a database with 90,000 records this query takes around 7 seconds to execute (obviously all the joins and likes are taxing).
I have been trying to find a way to bring the query execution time down with full-text search on the columns concerned. However, I haven't seen an example of a full-text search that has three table joins like this, especially since my join condition is not part of the search term.
Is there a way to do this in full-text search?
#David
Yep, there are indexes on the Ids.
I've tried adding indexes on the CustomerAddress stuff (CityName, PostalCode, etc.) and it brought down the query to 3 seconds, but I still find that too slow for something like this.
Note that all of the text fields (with the exception of the ids) are nvarchars, and Line1 is an nvarchar 1000, so that might affect the speed, but still.
Run it through the query analyzer and see what the query plan is. My guess would be that the double root (ie. %ae%) searches are causing it do do a table scan when looking for the matching rows. Double root searches are inherently slow, as you can't use any kind of index to match them usually.
NOTE: This isn't really an answer, just an attempt to clarify what might actually be causing the performance problem(s).
90,000 records is really a fairly small data set and the query is relatively simple with just two join. Do you have indexes on CustomerAddress.CustomerId and CustomerAccount.CustomerId? That seems more likely to be causing performance issues than the where condition LIKE predicates. Are you typically searching to find a match on all of those columns at the same time?
I would echo David's suggestion. You'd probably want to examine how the RDBMS is executing your query (e.g., via table scans or using indexes).
One quick check would be to time just the part of the query involving the text search. Something like this:
SELECT ca.Line1, ca.CityName, ca.PostalCode
FROM CustomerAddress as ca
WHERE ca.CustomerId = <some id number>
AND (ca.Line1 LIKE '%ae%' OR ca.CityName LIKE '%ab%' OR ca.PostalCode LIKE '%10%');
If that takes a long time, then the LIKEs are the issue (remove one expression at a time from the ORed line to see if just one of those columns is causing the slowdown). If it's quick, then the joins are suspect.
You could write a similar query for the CustomerAccount table as well.