I've recently been learning something very new to me - FULLTEXT Indexes.
It seems that I can run off two separate queries (using CONTAINSTABLE) on the same parameters against two separate tables are gain an almost instantaneous answer (sub 10ms) however when I combined the two together, the query takes 1.3 seconds - or 130+ times slower!!
Below are the queries (simplified for the purpose of this question).
Query 1:
SELECT
*
FROM
dbo.FooBar FB
INNER JOIN dbo.FooBalls FBS on FB.ID = FBS.ID
LEFT JOIN CONTAINSTABLE(dbo.FooBar, (Col1, Col2, Col3), #query) FBCONT ON FB.ID = FBCONT.[KEY]
WHERE
FBCONT.[KEY] IS NOT NULL
Query 2:
SELECT
*
FROM
dbo.FooBar FB
INNER JOIN dbo.FooBalls FBS on FB.ID = FBS.ID
LEFT JOIN CONTAINSTABLE(dbo.FooBalls, (Col1), #query) FBSCONT ON FBS.ID = FBSCONT.[KEY]
WHERE
FBSCONT.[KEY] IS NOT NULL
Query Combined:
SELECT
*
FROM
dbo.FooBar FB
INNER JOIN dbo.FooBalls FBS on FB.ID = FBS.ID
LEFT JOIN CONTAINSTABLE(dbo.FooBar, (Col1, Col2, Col3), #query) FBCONT ON FB.ID = FBCONT.[KEY]
LEFT JOIN CONTAINSTABLE(dbo.FooBalls, (Col1), #query) FBSCONT ON FBS.ID = FBSCONT.[KEY]
WHERE
(FBCONT.[KEY] IS NOT NULL OR FBSCONT.[KEY] IS NOT NULL)
Perhaps my research has missed something but can someone give me an indicator as to why having both clauses together reduces performance by over 130 times?
NOTES:
I've checked the relevant indexes for joining exist - verified by the speed of the individual queries.
There are actually more joins involved in the process - however they are completely unlinked to the tables being queries and again response are under 10ms when searching for results in 100,000 plus records.
I tried replacing the CONTAINSTABLE with individual CONTAINS statements - performance was massively degraded as my research would lead me to expect.
A catalog has been set up that references ONLY the four columns from the two tables being queried
The #query parameter is set to NVARCHAR (50) at the present. I've read that using NVACHAR is faster as implicit conversions are not required.
I know I could do a dirty UNION ALL on both queries separately, but I'd prefer to writer better queries if possible rather than hack it together. Additionally UNION ALL would leave me with potential duplicates if #query value was in two columns from separate tables linked to one record.
Any further suggestions would be greatly received.
Your question comments suggest you improved performance to a satisfactory level by rewriting an unrelated part of the query (not shown in the question).
This is fair enough if it works, but doesn't explain why the two separate queries and the combined query differ so significantly, when other unrelated parts of the query are kept constant.
It's difficult to say confidently without seeing a query plan and statistics results; however I can think of two possibilities based solely on reasoning about how the SQL queries are written:
One or both of the ID columns (from FooBar and FooBalls) may be non-unique in the row set after these two tables have been inner joined. Doing two, rather than one, join to CONTAINSTABLE result sets may thus be "breeding" rather more records than a single join does; larger result sets take longer to be passed back to the client and displayed. To test this: compare the row counts returned by the two separate queries, and compare these to the row counts of each separate query if the WHERE clauses are omitted. Larger row counts will typically suggest a longer query elapsed time (all other things being equal).
Each of the separate queries has been written with a left outer join, but the result set is then restricted to only include rows where the join has succeeded. This is effectively an inner join: SQL Server's query planner may well be identifying this fact and choosing an execution plan as if an inner join had been specified. Conversely, the combined query requires rows where either join (but not necessarily both) have succeeded, which is a true left join. The execution plan is likely to use different, slower, approaches for these joins. To test this: look at the execution plans, and compare to execution plans for the separate queries with inner joins requested instead of left joins.
Related
Is it possible to use join hint with a cross join in T-SQL? If so what is the syntax?
select *
from tableA
cross ? join tableB
Based on your comments
I am trying to fix my execution plan, my estimated rows are very off
in the nested loop join. I have changed a cursor to a cross join...
The code is faster now with the cross join, but I want to make it even
faster. So I just want to experiment with a join hint...
I have 900 out af 2000000 as actual and estimated for the nested loop
join..And I think it is the step where the cross join is happening...
it is a table from ETL so a lot of new data every day ..
I have a few suggestions
Don't go straight for a cross join. If it's doing a nested loop join because of really bad cardinality estimation, try using a hash join hint instead
It definitely can help to have statistics up-to-date (research the 'Ascending Key Problem' for info). However, you may want to check if your statistics are set to auto-update and whether they get triggered (e.g., after the ETL, view the properties of the statistics to see when they were last updated etc)
Try to fix the bad cardinality estimate. One way is to split the bigger tasks into smaller tasks (e.g., into temporary tables).
On the chance you're using table variables (e.g., DECLARE #temptable TABLE) rather than temporary tables (e.g., CREATE TABLE #TempTable) then stop it. Variables (including table variables) don't have statistics. Older versions often assume 1 row in table variables. SQL Server 2019 (as long as you're in the latest compatibility mode) has some changes to this, but still has some big issues.
When you get it down to the one operation that has the bad cardinality estimate, you can also do things like adding indexes/etc to help with that estimate (remember - you can put indexes and primary keys on temporary tables - they can speed up processing too if the table is accessed multiple times).
Some context: I have two tables, smalltable and bigtable. Smalltable contains 10,000 rows, whereas bigtable contains 2,000,000, and I am using SQL Server 2008. I started with a query as follows:
select * from [dbo].[smalltable] t1
INNER JOIN
[dbo].[bigtable] t2
on (t1.name1=t2.firstname and t1.name6=t2.lastname) or (t1.name6=t2.firstname and t1.name1=t2.lastname)
This query was running for over 15 minutes before I killed it - on inspecting the query plan, it was using a nested loop to do the inner join.
I then rewrote the query as follows:
select * from [dbo].[smalltable] t1
INNER JOIN
[dbo].[bigtable] t2
on (t1.name1=t2.firstname and t1.name6=t2.lastname)
UNION
select * from [dbo].[smalltable] t1
INNER JOIN
[dbo].[bigtable] t2
on (t1.name6=t2.firstname and t1.name1=t2.lastname)
The two queries above then instead executed using a Hash Match, and the whole query ran in 4 seconds.
My question is, why does SQL Server get the query plan so wrong, and was there anyway that I could have fixed the original query without rewriting it? I tried adding a hint to use a Hash Match to the first query, but it seems that you are not allowed to with multiple join criteria?
Update: Added examples of the kind of data in the tables as requested. Note, the code is looking for name matches where names may have been swapped around:
Smalltable(Columns name1,name6)
John, Smith
Johnny, Smith
Smythe, Jon
Michaels, Robert
Bob, Brown
Bigtable (Columns firstname,lastname)
John, Smith
John, Smythe
Johnny, Smith
Alison, Roberts
Robert, Michaels
Janet, Green
It is a problem within SQL Server optimizer.
The condition (t1.name1=t2.firstname and t1.name6=t2.lastname) uses clustered index seek only and thus is very fast and executes almost instantly even with very large tables.
But the condition with OR
(t1.name1=t2.firstname and t1.name6=t2.lastname) or (t1.name6=t2.firstname and t1.name1=t2.lastname)
generally performs much worse usually performing full scan. You should see execution plans for your queries.
The Query Optimizer will always perform a table scan or a clustered
index scan on a table if the WHERE clause in the query contains an OR
operator and if any of the referenced columns in the OR clause are not
indexed (or do not have a useful index). Because of this, if you use
many queries with OR clauses, you will want to ensure that each
referenced column in the WHERE clause has an index.
If you have a query that uses ORs and it is not making the best use
of indexes, consider rewriting it as a UNION and then testing
performance. Only through testing can you be sure that one version of
your query will be faster than another.
See here. So shortly your first OR query does not make good use of indexes.
I believe there is no other way than rewrite the query with UNION (as you did) or APPLY, optimizer will not do it.
Here is the example:
SELECT <columns>
FROM (..........<subquery>..........) AS xxx
INNER JOIN(s) with xxx.............
LEFT OUTER JOIN(s) with xxx........
WHERE <filter conditions>
Please correct me if I'm wrong:
Is that <subquery> a derived table?
Is it a problem if it returns too much data (say millions of rows) regarding server memory, since i know that WHERE clause is applied to the final result set and leaving the server processing too much from the subquery even if the final result has 10 rows?
What if there was no inner join (to reduce the data) and only left outer join, does that make things even worse/slow since it has to make the join with all the rows from the subquery?
If (2) is a problem then one solution I think of would be to limit the data returned by the subquery by adding other joins inside which would make things slower (I've tried that). Any other thoughts on this?
What if I can't limit the result from the subquery since the where clause depends on the joins from after the subquery?
To clarify things out, the reason the subquery returns too much data is because I'm trying to combine data from multiple tables using UNION ALL (with no filtering conditions) and then, foreach row returned by the subquery, join to get the info I need to use it in the WHERE clause. Another way to do this is to do all the joins that you see outside the subquery for each of the UNION ALL from inside the subquery, which yes, does limit the result sets but makes more joins which, as I said, slow things down. In other words, I have to choose between a subquery that does this:
(
SELECT * FROM A UNION ALL
SELECT * FROM B UNION ALL
SELECT * FROM C...
) AS xxx
left outer join T with xxx
AND
SELECT * FROM A
LEFT OUTER JOIN T ...
WHERE....
UNION ALL
SELECT * FROM B
LEFT OUTER JOIN T ...
WHERE....
UNION ALL
SELECT * FROM C
LEFT OUTER JOIN T ...
WHERE....
Yes it is.
No, the query optimizer treats the whole query as one block. It doesn't run a derived table then run the outer statement on the result. It 'optimizes through' derived tables.
Again, no. Having a derived table doesn't mean bad performance. You always have to look at your query as a whole.
It's not a problem.
Then that's just fine. Trust the query optimizer. Have you ever met the people that wrote it? They are scary intelligent.
In each individual case, it is worth looking at your query execution plan and finding pain points. Looks for things that are doing scans when they could be doing seeks, and that will usually give you a significant boost. Things do scans and not seeks when:
There is no index to seek upon
The thing you are seeking is the result of a function (e.g. WHERE function(field) = value)
The optimizer decides that a scan is actually faster.
But the bottom line answer to the question is - no, you should not be worried that derived tables would contain a lot of data if you selected them out in isolation.
I've been pondering the question which of those 2 Statements might have a higher performance (and why):
select * from formelement
where formid = (select id from form where name = 'Test')
or
select *
from formelement fe
inner join form f on fe.formid = f.id
where f.name = 'Test'
One form contains several form elements, one form element is always part of one form.
Thanks,
Dennis
look at the execution plan, most likely it will be the same if you add the filtering to the join, that said the join will return everything from both tables, the in will not
I actually prefer EXISTS over those two
select * from formelement fe
where exists (select 1 from form f
where f.name='Test'
and fe.formid =f.id)
The performance depends on the query plan choosen by the SQL Server Engine. The query plan depends on a lot of factors, including (but not limited to) the SQL, the exact table structure, the statistics of the tables, available indexes, etc.
Since your two queries are quite simple, my guess would be that they result in the same (or a very similar) execution plan, thus yielding comparable performance.
(For large, complicated queries, the exact wording of the SQL can make a difference, the book SQL Tuning by Dan Tow gives a lot of great advice on that.)
I have 2 tables Person_Organization and Person_Organization_other and nested query is :
SELECT
Person_Organization_id
FROM
Person_Organization_other
WHERE
company_name IN (SELECT company_name
FROM Person_Organization_other
WHERE Person_Organization_id IN (SELECT Person_Organization_Id
FROM Person_Organization
WHERE person_id = 117
AND delete_flag = 0)
)
Whereas the above query's corresponding query with join that I tried is :-
SELECT
poo.Person_Organization_id
FROM
Person_Organization_other poo, Person_Organization_other poo1, Person_Organization po
WHERE
poo1.Person_Organization_id = po.Person_Organization_Id
AND po.person_id = 117
AND po.delete_flag = 0
AND poo.company_name = poo1.company_name
GROUP BY
poo.Person_Organization_id
However the nested query is found to take less time as compared to it's corresponding query with joins. I used SQL profiler trace to compare times of executed queries. For the nested query it took 30 odd ms. For the joined query it took 41 odd ms
I was under the impression that as a rule nested queries are less perfomant and should be "flattened out" using joins.
Could someone explain what I am doing wrong?
regards
Nitin
You are using cross joins. Try inner joins.
select poo.Person_Organization_id
from Person_Organization po
INNER JOIN Person_Organization_other poo ON
poo.Person_Organization_id=po.Person_Organization_Id
INNER JOIN Person_Organization_other poo1 ON
poo1.Person_Organization_id=po.Person_Organization_Id AND
poo.company_name=poo1.company_name
where po.person_id=117 AND po.delete_flag=0
group by poo.Person_Organization_id
By separating your tables with commas, you are effectively CROSS JOINing them together. I would try doing explicit INNER JOINs between the tables and see if that helps performance.
The view that nested queries are less performant and should be flattened out using joins is a myth - it is true that inappropriate nested subqueries can cause performance issues, however in many cases using a subquery is just as good as using a join.
In fact the SQL server optimises all queries that it executes by reducing them to an execution tree - often queries that use a JOIN end up with identical execution trees to equivalent sql statements that use nested queries instead.
In this case the execution time of these is really low anyway - the difference could just as easily be explained as due to caches etc... not being filled.
My advice would be to use whatever syntax makes more sense to you - if you have a performance problem then by all means go back and check to see if a nested subquery is the cause of your problem, however I definitely wouldn't spend time worrying about "flattening out" queries that aren't causing problems.
Your order of tables might reduce the performance your table order in from clause should be in increasing order of number of rows