How can I perform a conditional join in mssql? - sql-server

I want to join a table to one of two possible tables, depending on data. Here's an attempt that did not work, but gets the idea across, I hope. Also, this is a mocked up example that may not be very realistic, so don't get too hung up on the idea this is representing real students and classes.
SELECT *
FROM
student
INNER JOIN class ON class.student_id = student.student_id
CASE
WHEN class.complete=0
THEN RIGHT OUTER JOIN report ON report.label_id = inprogress.class_id
WHEN class.complete=1
THEN RIGHT OUTER JOIN report ON report.label_id = completed.class_id
END
Any ideas?

You have two join conditions and if either are true you want to commit a join - That's a boolean OR operation.
You simply need to:
RIGHT OUTER JOIN report ON (CONDITION1) OR (CONDITION2)
Let's unravel that a moment though, what is condition 1 and what is condition 2?
WHEN class.complete=0
THEN RIGHT OUTER JOIN report ON report.label_id = inprogress.class_id
WHEN class.complete=1
THEN RIGHT OUTER JOIN report ON report.label_id = completed.class_id
Here you're putting together two conditions on each of your condition 1 and 2, so your condition 1 is:
class.complete = 0 AND report.label_id = inprogress.class_id
and your condition 2 is
class.complete = 1 AND report.label_id = completed.class_id
So the completed SQL should be something like (and this is untested off the top of my head)
RIGHT OUTER JOIN report ON (
class.complete = 0 AND report.label_id = inprogress.class_id
) OR (
class.complete = 1 AND report.label_id = completed.class_id
)
Worth mentioning..
I haven't run the above join but I know from experience the execution plan on that will be absolutely abominable, won't matter if performance isn't important and or your data set is small, but if the performance matters I strongly suggest you post a broader scope of what you want here and we can talk about a better approach to getting your particular data set that won't perform so terribly. I would personally write a join like above only as a last resort or if I was hacking something truly irrelevant.

try this (untested) -
SELECT *
FROM student S
INNER JOIN class c ON c.student_id = S.student_id
left outer join report ir on ir.label_id = inprogress.class_id AND c.complete=0
left outer join report cr on cr.label_id = completed.class_id AND c.complete=1

If you want to join to either of 2 tables with reasonable performance, write a stored procedure for each path (one that joins table 1 to table A, one that joins table 1 to table B). Make a third stored procedure that calls either the 1-A stored procedure or the 1-B stored procedure. This way an efficient query plan will be performed in each case without having to make it recompile on each call or generate a strange query plan.
In your case, you actually want some records from both of the tables you might join to. In that case, you want to union the results together rather than pick one or the other (and you can combine them in one sproc if you want to, it shouldn't hurt the query plan). If you are sure the records won't duplicate between the two queries (it seems like they wouldn't), then as usual use UNION ALL for performance.

Related

Is there an equivalent to OR clause in CONTAINSTABLE - FULL TEXT INDEX

I am trying to find a solution in order to improve the String searching process and I selected FULL-TEXT INDEX Strategy.
However, after implementing it, I still can see there is a performance hit when it comes to search by using multiple strings using multiple Full-Text Index tables with OR clauses.
(E.x. WHERE CONTAINS(F.*,'%Gayan%') OR CONTAINS(P.FirstName,'%John%'))
As a solution, I am trying to use CONTAINSTABLE expecting a performance improvement.
Now, I am facing an issue with CONTAINSTABLE when it comes to joining tables with a LEFT JOIN
Please go through the example below.
Query 1
SELECT F.Name,p.*
FROM P.Role PR
INNER JOIN P.Building F ON PR.PID = F.PID
LEFT JOIN CONTAINSTABLE(P.Building,*,'%John%') AS FFTIndex ON F.ID = FFTIndex.[Key]
LEFT JOIN P.Relationship PRSHIP ON PR.id = prship.ToRoleID
LEFT JOIN P.Role PR2 ON PRSHIP.ToRoleID = PR2.ID
LEFT JOIN P.Person p ON pr2.ID = p.PID
LEFT JOIN CONTAINSTABLE(P.Person,FirstName,'%John%') AS PFTIndex ON P.ID = PFTIndex.[Key]
WHERE F.Name IS NOT NULL
This produces the below result.
Query 2
SELECT F.Name,p.*
FROM P.Role PR
INNER JOIN P.Building F ON PR.PID = F.PID
INNER JOIN P.Relationship PRSHIP ON PR.id = prship.ToRoleID
INNER JOIN P.Role PR2 ON PRSHIP.ToRoleID = PR2.ID
INNER JOIN P.Person p ON pr2.ID = p.PID
WHERE CONTAINS(F.*,'%Gayan%') OR CONTAINS(P.FirstName,'%John%')
AND F.Name IS NOT NULL
Result
Expectation
To use query 1 in a way that works as the behavior of an SQL SERVER OR clause. As I can understand Query 1's CONTAINSTABLE, joins the data with the building table, and the rest of the results are going to ignore so that the CONTAINSTABLE of the Person table gets data that already contains the keyword filtered from the building table.
If the keyword = Building, I want to match the keyword in both the tables regardless of searching a saved record in both the tables. Having a record in each table is enough.
Summary
Query 2 performs well but is creates a slowness when the words in the indexes are growing. Query 1 seems optimized(When it comes to multiple online resources and MS Documentation),
however, it does not give me the expected output.
Is there any way to solve this problem?
I am not strictly attached to CONTAINSTABLE. Suggesting another optimization method will also be considerable.
Thank you.
Hard to say definitively without your full data set but a couple of options to explore
Remove Invalid % Wildcards
Why are you using '%SearchTerm%'? Does performance improve if you use the search term without the wildcards (%)? If you want a word that matches a prefix, try something like
WHERE CONTAINS (String,'"SearchTerm*"')
Try Temp Tables
My guess is CONTAINS is slightly faster than CONTAINSTABLE as it doesn't calculate a rank, but I don't know if anyone has ever attempted to benchmark it. Either way, I'd try saving off the matches to a temp table before joining up to the rest of the tables. This will allow the optimizer to create a better execution plan
SELECT ID INTO #Temp
FROM YourTable
WHERE CONTAINS (String,'"SearchTerm"')
SELECT *
FROM #Temp
INNER JOIN...
Optimize Full Text Index by Removing Noisy Words
You might find you have some noisy words aka words that reoccur many times in your data that are meaningless like "the" or perhaps some business jargon. Adding these to your stop list will mean your full text index will ignore them, making your index smaller thus faster
The query below will list indexed words with the most frequent at the top
Select *
From sys.dm_fts_index_keywords(Db_Id(),Object_Id('dbo.YourTable') /*Replace with your table name*/)
Order By document_count Desc
This OR That Criteria
For your WHERE CONTAINS(F.*,'%Gayan%') OR CONTAINS(P.FirstName,'%John%') criteria where you want this or that, is tricky. OR clauses generally perform even when using simple equality operators.
I'd try either doing two queries and union the results like:
SELECT * FROM Table1 F
/*Other joins and stuff*/
WHERE CONTAINS(F.*,'%Gayan%')
UNION
SELECT * FROM Table2 P
/*Other joins and stuff*/
WHERE CONTAINS(P.FirstName,'%John%')
OR this is much more work, but you could load all your data into giant denormalized table with all your columns. Then apply a full text index to that table and adjust your search criteria that way. It'd probably be the fastest method searching, but then you'd have to ensure the data is sync between the denormalized table and the underlying normalized tables
SELECT B.*,P.* INTO DenormalizedTable
FROM Building AS B
INNER JOIN People AS P
CREATE FULL TEXT INDEX ft ON DenormalizedTable
etc...

Possible causes slow order by on sql server statement

I have the next query which returns 1550 rows.
SELECT *
FROM V_InventoryMovements -- 2 seconds
ORDER BY V_InventoryMovements.TransDate -- 23 seconds
It takes about 2 seconds to return the results.
But when I include the ORDER BY clause, then it takes about 23 seconds.
It is a BIG change just for adding an ORDER BY.
I would like to know what is happening, and a way to improve the query with the ORDER BY. To quit the ORDER BY should not be the solution.
Here a bit of information, please let me know if you need more info.
V_InventoryMovements
CREATE VIEW [dbo].[V_InventoryMovements]
AS
SELECT some_fields
FROM FinTime
RIGHT OUTER JOIN V_Outbound ON FinTime.StdDate = dbo.TruncateDate(V_Outbound.TransDate)
LEFT OUTER JOIN ReasonCode_Grouping ON dbo.V_Outbound.ReasonCode = dbo.ReasonCode_Grouping.ReasonCode
LEFT OUTER JOIN Items ON V_Outbound.ITEM = Items.Item
LEFT OUTER JOIN FinTime ON V_Outbound.EventDay = FinTime.StdDate
V_Outbound
CREATE VIEW [dbo].[V_Outbound]
AS
SELECT V_Outbound_WMS.*
FROM V_Outbound_WMS
UNION
SELECT V_Transactions_Calc.*
FROM V_Transactions_Calc
V_OutBound_WMS
CREATE VIEW [dbo].[V_OutBound_WMS]
AS
SELECT some_fields
FROM Transaction_Log
INNER JOIN MFL_StartDate ON Transaction_Log.TransDate >= MFL_StartDate.StartDate
LEFT OUTER JOIN Rack ON Transaction_Log.CHARGE = Rack.CHARGE AND Transaction_Log.CHARGE_LFD = Rack.CHARGE_LFD
V_Transactions_Calc
CREATE VIEW [dbo].[V_Transactions_Calc]
AS
SELECT some_fields
FROM Transactions_Calc
INNER JOIN MFL_StartDate ON dbo.Transactions_Calc.EventDay >= dbo.MFL_StartDate.StartDate
And here I will also share a part of the execution plan (the part where you can see the main cost). I don't know exactly how to read it and improve the query. Let me know if you need to see the rest of the execution plan. But all the other parts are 0% of Cost. The main Cost is in the: Nested Loops (Left Outer Join) Cost 95%.
Execution Plan With ORDER BY
Execution Plan Without ORDER BY
I think the short answer is that the optimizer is executing in a different order in an attempt to minimize the cost of the sorting, and doing a poor job. Its job is made very hard by the views within views within views, as GuidoG suggests. You might be able to convince it to execute differently by creating some additional index or statistics, but its going to be hard to advise on that remotely.
A possible workaround might be to select into a temp table, then apply the ordering afterwards:
SELECT *
INTO #temp
FROM V_InventoryMovements;
SELECT *
FROM #temp
ORDER BY TransDate

PostgreSQL: Using AND statement in LEFT JOIN is not working as expected

This query returns all the elements in the table la and all nulls for fields coming from the lar table which is not what I expected.
SELECT
la.listing_id,
la.id,
lar.*
FROM la
LEFT JOIN lar
ON lar.application_id = la.id AND la.listing_id = 2780;
This query returns correct and expected results but shouldn't both queries do the same thing ?
SELECT
la.listing_id,
la.id,
lar.*
FROM la
LEFT JOIN lar
ON lar.application_id = la.id
WHERE la.listing_id = 2780;
What am I missing here?
I want to make conditional joins as I have noticed that for complex queries Postgresql does the join then do the WHERE clause which is actually very slow. How to make the database filter out some records before doing the JOIN ?
The confusion around LEFT JOIN and WHERE clause has been clarified many times:
SQL / PostgreSQL left join ignores "on = constant" predicate, on left table
This interesting question remains:
How to make the database filter out some records before doing the JOIN?
There are no explicit query hints in Postgres. (Which is a matter of ongoing debate.) But there are still various tricks to make Postgres bend your way.
But first, ask yourself: Why did the query planner estimate the chosen plan to be cheaper to begin with? Is your server configuration basically sane? Cost settings adequate? autovacuum running? Postgres version outdated? Are you working around an underlying problem that should really be fixed?
If you force Postgres to do it your way, you should be sure it won't fire back, after a version upgrade or update to the server configuration ... You'd better know what you are doing exactly.
That said, you can force Postgres to "filter out some records before doing the JOIN" with a subquery where you add OFFSET 0 - which is just noise, logically, but prevents Postgres from rearranging it into the form of a regular join. (Query hint after all)
SELECT la.listing_id, la.id, lar.*
FROM (
SELECT listing_id, id
FROM la
WHERE listing_id = 2780
OFFSET 0
) la
LEFT JOIN lar ON lar.application_id = la.id;
Or you can use a CTE (less obscure, but more expensive). Or other tricks like setting certain config parameters. Or, in this particular case, I would use a LATERAL join to the same effect:
SELECT la.listing_id, la.id, lar.*
FROM la
LEFT JOIN LATERAL (
SELECT *
FROM lar
WHERE application_id = la.id
) lar ON true
WHERE la.listing_id = 2780;
Related:
Sample Query to show Cardinality estimation error in PostgreSQL
Here is an extensive blog on Query hints by 2ndQuadrant. Five year old but still valid.
The LEFT JOIN keyword returns all rows from the left table (table1), with the matching rows in the right table (table2). The result is NULL in the right side when there is no match.
So no matter you try to filter with AND la.listing_id = 2780; you still get all the rows from first table. But only those with la.listing_id = 2780; will have something <> NULL on the right side
The behaviour is different if you try INNER JOIN in that case only the matching columns are created and the AND condition will filter the rows.
So to make the first query work you need add WHERE la.listing_id IS NOT NULL
The problem with second query is will try to JOIN every row and then will filter only the one you need.

Force joined view not to be optimized

I have a somewhat complex view which includes a join to another view. For some reason the generated query plan is highly inefficient. The query runs for many hours. However if I select the sub-view into a temporary table first and then join with this, the same query finished in a few minutes.
My question is: Is there some kind of query hint or other trick which will force the optimizer to execute the joined sub-view in isolation before performing the join, just as when using a temp table? Clearly the default strategy chosen by the optimizer is not optimal.
I cannot use the temporary table-trick since views does not allow temporary tables. I understand I could probably rewrite everything to a stored procedure, but that would break composeability of views, and it seems also like bad for maintenance to rewrite everything just to trick the optimizer to not use a bad optimization.
Adam Machanic explained one such way at a SQL Saturday I recently attended. The presentation was called Clash of the Row Goals. The method involves using a TOP X at the beginning of the sub-select. He explained that when doing a TOP X, the query optimizer assumes it is more efficient to grab the TOP X rows one at a time. As long as you set X as a sufficiently large number (limit of INT or BIGINT?), the query will always get the correct results.
So one example that Adam provided:
SELECT
x.EmployeeId,
y.totalWorkers
FROM HumanResources.Employee AS x
INNER JOIN
(
SELECT
y0.ManagerId,
COUNT(*) AS totalWorkers
FROM HumanResources.Employee AS y0
GROUP BY
y0.ManagerId
) AS y ON
y.ManagerId = x.ManagerId
becomes:
SELECT
x.EmployeeId,
y.totalWorkers
FROM HumanResources.Employee AS x
INNER JOIN
(
SELECT TOP(2147483647)
y0.ManagerId,
COUNT(*) AS totalWorkers
FROM HumanResources.Employee AS y0
GROUP BY
y0.ManagerId
) AS y ON
y.ManagerId = x.ManagerId
It is a super cool trick and very useful.
When things get messy the query optimize often resorts to loop joins
If materializing to a temp fixed it then most likely that is the problem
The optimizer often does not deal with views very well
I would rewrite you view to not uses views
Join Hints (Transact-SQL)
You may be able to use these hints on views
Try merge and hash
Try changing the order of join
Move condition into the join whenever possible
select *
from table1
join table2
on table1.FK = table2.Key
where table2.desc = 'cat1'
should be
select *
from table1
join table2
on table1.FK = table2.Key
and table2.desc = 'cat1'
Now the query optimizer will get that correct but as the query gets more complex the query optimize goes into what I call stupid mode and loop joins. But that is also done to protect the server and have as little in memory as possible.

WHERE clause better execute before IN and JOIN or after

I read this article:
Logical Processing Order of the SELECT statement
in end of article has been write ON and JOIN clause consider before WHERE.
Consider we have a master table that has 10 million records and a detail table (that has reference to master table(FK)) with 50 million record. We have a query that want just 100 record of detail table according a PK in master table.
In this situation ON and JOIN execute before WHERE?I mean that do we have 500 million record after JOIN and then WHERE apply to it?or first WHERE apply and then JOIN and ON Consider? If second answer is true do it has incoherence with top article?
thanks
In the case of an INNER JOIN or a table on the left in a LEFT JOIN, in many cases, the optimizer will find that it is better to perform any filtering first (highest selectivity) before actually performing whatever type of physical join - so there are obviously physical order of operations which are better.
To some extent you can sometimes control this (or interfere with this) with your SQL, for instance, with aggregates in subqueries.
The logical order of processing the constraints in the query can only be transformed according to known invariant transformations.
So:
SELECT *
FROM a
INNER JOIN b
ON a.id = b.id
WHERE a.something = something
AND b.something = something
is still logically equivalent to:
SELECT *
FROM a
INNER JOIN b
ON a.id = b.id
AND a.something = something
AND b.something = something
and they will generally have the same execution plan.
On the other hand:
SELECT *
FROM a
LEFT JOIN b
ON a.id = b.id
WHERE a.something = something
AND b.something = something
is NOT equivalent to:
SELECT *
FROM a
LEFT JOIN b
ON a.id = b.id
AND a.something = something
AND b.something = something
and so the optimizer isn't going to transform them into the same execution plan.
The optimizer is very smart and is able to move things around pretty successfully, including collapsing views and inline table-valued functions as well as even pushing things down through certain kinds of aggregates fairly successfully.
Typically, when you write SQL, it needs to be understandable, maintainable and correct. As far as efficiency in execution, if the optimizer is having difficulty turning the declarative SQL into an execution plan with acceptable performance, the code can sometimes be simplified or appropriate indexes or hints added or broken down into steps which should perform more quickly - all in successive orders of invasiveness.
It doesn't matter
Logical processing order is always honoured: regardless of actual processing order
INNER JOINs and WHERE conditions are effectively associative and commutative (hence the ANSI-89 "join in the where" syntax) so actual order doesn't matter
Logical order becomes important with outer joins and more complex queries: applying WHERE on an OUTER table changes the logic completely.
Again, it doesn't matter how the optimiser does it internally so long as the query semantics are maintained by following logical processing order.
And the key word here is "optimiser": it does exactly what it says
Just re-reading Paul White's excellent series on the Query Optimiser and remembered this question.
It is possible to use an undocumented command to disable specific transformation rules and get some insight into the transformations applied.
For (hopefully!) obvious reasons only try this on a development instance and remember to re-enable them and remove any suboptimal plans from the cache.
USE AdventureWorks2008;
/*Disable the rules*/
DBCC RULEOFF ('SELonJN');
DBCC RULEOFF ('BuildSpool');
SELECT P.ProductNumber,
P.ProductID,
I.Quantity
FROM Production.Product P
JOIN Production.ProductInventory I
ON I.ProductID = P.ProductID
WHERE I.ProductID < 3
OPTION (RECOMPILE)
You can see with those two rules disabled it does a cartesian join and filter after.
/*Re-enable them*/
DBCC RULEON ('SELonJN');
DBCC RULEON ('BuildSpool');
SELECT P.ProductNumber,
P.ProductID,
I.Quantity
FROM Production.Product P
JOIN Production.ProductInventory I
ON I.ProductID = P.ProductID
WHERE I.ProductID < 3
OPTION (RECOMPILE)
With them enabled the predicate is pushed right down into the index seek and so reduces the number of rows processed by the join operation.
There is no defined order. The SQL engine determines what order to perform the operations based on the execution strategy chosen by its optimizer.
I think you have misread ON as IN in the article.
However, the order it is showing in the article is correct (obviously it is msdn anyway). The ON and JOIN are executed before WHERE naturally because WHERE has to be applied as a filter on the temporary resultset obtained due to JOINS
The article just says it is logical order of execution and also at end of the paragraph it adds this line too ;)
"Note that the actual physical execution of the statement is determined by the query processor and the order may vary from this list."

Resources