Should Full Outer Joins be avoided in Power Query? - sql-server

I created a full outer join in Power Query that joins 2 views from SQL server, which both run quickly on their own. The query sometimes takes longish (a few minutes) to run, and despite the views being READ UNCOMITTED has blocked other users. Now I need to create another Full outer join, but am nervous about doing so. Also full outer joins in Power Query seem to break query folding - although in this new case that's irrelevant since the join is not to another query/view in SQL server, so it can't be folded anyway.
Should Full Outer Joins in Power Query be avoided?

Related

Query speeds using Access front with SQL backend

I have been working on converting an Access database over to MS SQL. For my initial testing I have imported the back end data onto my system with SQL Server Express 2014. So far I have been able to get everything to work with my Access front end except for one query.
Setting the primary keys on the tables in question has helped some but not fully. When I run the query in Access it will take about 10 seconds to run, but when I run it from a second computer it will take upwards to 30 seconds to run. However if I run the query directly Server Management Studio it runs within a second.
I'm not sure if the slow down is due to the fact I'm running SQL off of my laptop, because it's SQL Server Express, or a combination of the two. I'm hoping someone can provide some more information for me.
Here is a copy of the query:
SELECT tbl_defects.*,
tbl_parts.part_type,
tbl_parts.number,
tbl_parts.mold,
tbl_parts.date_created,
tbl_parts.blade,
tbl_parts.product,
tbl_defects.defects_id
FROM tbl_parts
RIGHT JOIN (tbl_dispositions
RIGHT JOIN tbl_defects
ON tbl_dispositions.dispositions_id =
tbl_defects.disposition)
ON tbl_parts.parts_id = tbl_defects.part
ORDER BY tbl_defects.defects_id DESC;
On the tbl_Defects the primary key is Defects_ID and it is set to index. On tbl_disposition the primary key is Disposition_ID and it is set to index. The third table tbl_parts the primary key is Parts_ID and it is set to index as well.
If I switch any of the Right Joins to be Inner Joins the query will run properly but it will be missing about 2000 records.
It has been my experience too that some queries with multiple LEFT (or RIGHT) JOINs perform badly when run in Access on linked tables.
I suggest creating a Pass-Through query for this, if possible (if you don't need to edit the result set). This will run directly on the server, as it would in SSMS.
Or you could create a view on SQL Server and link that.
Can you try running the query using the more convention left join approach?
FROM tbl_Defects LEFT JOIN
tbl_Dispositions
ON tbl_Dispositions.Dispositions_ID = tbl_Defects.Disposition LEFT JOIN
tbl_Parts
ON tbl_Parts.Parts_ID = tbl_Defects.Part
If you need parentheses:
FROM (tbl_Defects LEFT JOIN
tbl_Dispositions
ON tbl_Dispositions.Dispositions_ID = tbl_Defects.Disposition
) LEFT JOIN
tbl_Parts
ON tbl_Parts.Parts_ID = tbl_Defects.Part
Perhaps something is getting garbled in the execution plan due to the joins. RIGHT JOIN can be quite tricky, because the parsing of the FROM clause goes from left to right. I'm pretty sure this will return the same result set, however.

Simple join takes far too long to run due to query plan

Some context: I have two tables, smalltable and bigtable. Smalltable contains 10,000 rows, whereas bigtable contains 2,000,000, and I am using SQL Server 2008. I started with a query as follows:
select * from [dbo].[smalltable] t1
INNER JOIN
[dbo].[bigtable] t2
on (t1.name1=t2.firstname and t1.name6=t2.lastname) or (t1.name6=t2.firstname and t1.name1=t2.lastname)
This query was running for over 15 minutes before I killed it - on inspecting the query plan, it was using a nested loop to do the inner join.
I then rewrote the query as follows:
select * from [dbo].[smalltable] t1
INNER JOIN
[dbo].[bigtable] t2
on (t1.name1=t2.firstname and t1.name6=t2.lastname)
UNION
select * from [dbo].[smalltable] t1
INNER JOIN
[dbo].[bigtable] t2
on (t1.name6=t2.firstname and t1.name1=t2.lastname)
The two queries above then instead executed using a Hash Match, and the whole query ran in 4 seconds.
My question is, why does SQL Server get the query plan so wrong, and was there anyway that I could have fixed the original query without rewriting it? I tried adding a hint to use a Hash Match to the first query, but it seems that you are not allowed to with multiple join criteria?
Update: Added examples of the kind of data in the tables as requested. Note, the code is looking for name matches where names may have been swapped around:
Smalltable(Columns name1,name6)
John, Smith
Johnny, Smith
Smythe, Jon
Michaels, Robert
Bob, Brown
Bigtable (Columns firstname,lastname)
John, Smith
John, Smythe
Johnny, Smith
Alison, Roberts
Robert, Michaels
Janet, Green
It is a problem within SQL Server optimizer.
The condition (t1.name1=t2.firstname and t1.name6=t2.lastname) uses clustered index seek only and thus is very fast and executes almost instantly even with very large tables.
But the condition with OR
(t1.name1=t2.firstname and t1.name6=t2.lastname) or (t1.name6=t2.firstname and t1.name1=t2.lastname)
generally performs much worse usually performing full scan. You should see execution plans for your queries.
The Query Optimizer will always perform a table scan or a clustered
index scan on a table if the WHERE clause in the query contains an OR
operator and if any of the referenced columns in the OR clause are not
indexed (or do not have a useful index). Because of this, if you use
many queries with OR clauses, you will want to ensure that each
referenced column in the WHERE clause has an index.
If you have a query that uses ORs and it is not making the best use
of indexes, consider rewriting it as a UNION and then testing
performance. Only through testing can you be sure that one version of
your query will be faster than another.
See here. So shortly your first OR query does not make good use of indexes.
I believe there is no other way than rewrite the query with UNION (as you did) or APPLY, optimizer will not do it.

Execution of TSQL statement

I am aware of the sequence of the execution of SQL statements but I still want to make sure few things with the help of SQL experts here. I have a big SQL query which returns thousands of rows. Here is the minimized version of the query which I wrote and think that it is correct.
Select *
from property
inner join tenant t on (t.hproperty = p.hmy **and p.hmy = 7**)
inner join commtenant ct on ct.htenant = t.hmyperson
where 1=1
My colleague says that above query is equivalent to below query performance wise(He is very confident about it)
Select *
from property
inner join tenant t on (t.hproperty = p.hmy)
inner join commtenant ct on ct.htenant = t.hmyperson
where **p.hmy = 7**
Could anybody help me with the explanation about why above queries are not equivalent or equivalent? Thanks.
If you want to know if two queries are equivalent, learn how to look at the execution plans in SQL Server Management Studio. You can put the two queries in different windows, look at the estimated execution plans, and see for yourself if they are the same.
In this case, they probably are the same. SQL is intended to be a descriptive language, not a procedural language. That is, it describes the output you want, but the SQL engine is allowed to rewrite the query to be as efficient as possible. The two forms you have describe the same output. Do note that if there were a left outer join instead of an inner join, then the queries would be different.
In all likelihood, the engine will read the table and filter the records during the read or use an index for the read. The key idea, though, is that the output is the same and SQL Server can recognize this.
"p.hmy = 7" is not a join condition, as it relates only to a single table. As such, it doesn't really belong in the ON clause of the join. Since you are not adding any information by placing the condition in the ON clause, having it in the WHERE clause (in which it really belongs) will not make any difference to the query plan generated. If in doubt, look at the query plans.

FULL TEXT INDEX - Huge Performance Decrease on Multiple Tables

I've recently been learning something very new to me - FULLTEXT Indexes.
It seems that I can run off two separate queries (using CONTAINSTABLE) on the same parameters against two separate tables are gain an almost instantaneous answer (sub 10ms) however when I combined the two together, the query takes 1.3 seconds - or 130+ times slower!!
Below are the queries (simplified for the purpose of this question).
Query 1:
SELECT
*
FROM
dbo.FooBar FB
INNER JOIN dbo.FooBalls FBS on FB.ID = FBS.ID
LEFT JOIN CONTAINSTABLE(dbo.FooBar, (Col1, Col2, Col3), #query) FBCONT ON FB.ID = FBCONT.[KEY]
WHERE
FBCONT.[KEY] IS NOT NULL
Query 2:
SELECT
*
FROM
dbo.FooBar FB
INNER JOIN dbo.FooBalls FBS on FB.ID = FBS.ID
LEFT JOIN CONTAINSTABLE(dbo.FooBalls, (Col1), #query) FBSCONT ON FBS.ID = FBSCONT.[KEY]
WHERE
FBSCONT.[KEY] IS NOT NULL
Query Combined:
SELECT
*
FROM
dbo.FooBar FB
INNER JOIN dbo.FooBalls FBS on FB.ID = FBS.ID
LEFT JOIN CONTAINSTABLE(dbo.FooBar, (Col1, Col2, Col3), #query) FBCONT ON FB.ID = FBCONT.[KEY]
LEFT JOIN CONTAINSTABLE(dbo.FooBalls, (Col1), #query) FBSCONT ON FBS.ID = FBSCONT.[KEY]
WHERE
(FBCONT.[KEY] IS NOT NULL OR FBSCONT.[KEY] IS NOT NULL)
Perhaps my research has missed something but can someone give me an indicator as to why having both clauses together reduces performance by over 130 times?
NOTES:
I've checked the relevant indexes for joining exist - verified by the speed of the individual queries.
There are actually more joins involved in the process - however they are completely unlinked to the tables being queries and again response are under 10ms when searching for results in 100,000 plus records.
I tried replacing the CONTAINSTABLE with individual CONTAINS statements - performance was massively degraded as my research would lead me to expect.
A catalog has been set up that references ONLY the four columns from the two tables being queried
The #query parameter is set to NVARCHAR (50) at the present. I've read that using NVACHAR is faster as implicit conversions are not required.
I know I could do a dirty UNION ALL on both queries separately, but I'd prefer to writer better queries if possible rather than hack it together. Additionally UNION ALL would leave me with potential duplicates if #query value was in two columns from separate tables linked to one record.
Any further suggestions would be greatly received.
Your question comments suggest you improved performance to a satisfactory level by rewriting an unrelated part of the query (not shown in the question).
This is fair enough if it works, but doesn't explain why the two separate queries and the combined query differ so significantly, when other unrelated parts of the query are kept constant.
It's difficult to say confidently without seeing a query plan and statistics results; however I can think of two possibilities based solely on reasoning about how the SQL queries are written:
One or both of the ID columns (from FooBar and FooBalls) may be non-unique in the row set after these two tables have been inner joined. Doing two, rather than one, join to CONTAINSTABLE result sets may thus be "breeding" rather more records than a single join does; larger result sets take longer to be passed back to the client and displayed. To test this: compare the row counts returned by the two separate queries, and compare these to the row counts of each separate query if the WHERE clauses are omitted. Larger row counts will typically suggest a longer query elapsed time (all other things being equal).
Each of the separate queries has been written with a left outer join, but the result set is then restricted to only include rows where the join has succeeded. This is effectively an inner join: SQL Server's query planner may well be identifying this fact and choosing an execution plan as if an inner join had been specified. Conversely, the combined query requires rows where either join (but not necessarily both) have succeeded, which is a true left join. The execution plan is likely to use different, slower, approaches for these joins. To test this: look at the execution plans, and compare to execution plans for the separate queries with inner joins requested instead of left joins.

Join queries taking more execution time than their corresponding nested queries

I have 2 tables Person_Organization and Person_Organization_other and nested query is :
SELECT
Person_Organization_id
FROM
Person_Organization_other
WHERE
company_name IN (SELECT company_name
FROM Person_Organization_other
WHERE Person_Organization_id IN (SELECT Person_Organization_Id
FROM Person_Organization
WHERE person_id = 117
AND delete_flag = 0)
)
Whereas the above query's corresponding query with join that I tried is :-
SELECT
poo.Person_Organization_id
FROM
Person_Organization_other poo, Person_Organization_other poo1, Person_Organization po
WHERE
poo1.Person_Organization_id = po.Person_Organization_Id
AND po.person_id = 117
AND po.delete_flag = 0
AND poo.company_name = poo1.company_name
GROUP BY
poo.Person_Organization_id
However the nested query is found to take less time as compared to it's corresponding query with joins. I used SQL profiler trace to compare times of executed queries. For the nested query it took 30 odd ms. For the joined query it took 41 odd ms
I was under the impression that as a rule nested queries are less perfomant and should be "flattened out" using joins.
Could someone explain what I am doing wrong?
regards
Nitin
You are using cross joins. Try inner joins.
select poo.Person_Organization_id
from Person_Organization po
INNER JOIN Person_Organization_other poo ON
poo.Person_Organization_id=po.Person_Organization_Id
INNER JOIN Person_Organization_other poo1 ON
poo1.Person_Organization_id=po.Person_Organization_Id AND
poo.company_name=poo1.company_name
where po.person_id=117 AND po.delete_flag=0
group by poo.Person_Organization_id
By separating your tables with commas, you are effectively CROSS JOINing them together. I would try doing explicit INNER JOINs between the tables and see if that helps performance.
The view that nested queries are less performant and should be flattened out using joins is a myth - it is true that inappropriate nested subqueries can cause performance issues, however in many cases using a subquery is just as good as using a join.
In fact the SQL server optimises all queries that it executes by reducing them to an execution tree - often queries that use a JOIN end up with identical execution trees to equivalent sql statements that use nested queries instead.
In this case the execution time of these is really low anyway - the difference could just as easily be explained as due to caches etc... not being filled.
My advice would be to use whatever syntax makes more sense to you - if you have a performance problem then by all means go back and check to see if a nested subquery is the cause of your problem, however I definitely wouldn't spend time worrying about "flattening out" queries that aren't causing problems.
Your order of tables might reduce the performance your table order in from clause should be in increasing order of number of rows

Resources