SQL query showing drastically different performance depending on the parameters - sql-server

We have a stored procedure that searches products based on a number of input parameters that differ from one scenario to the next. Depending on the input parameters the search involves anywhere from two to about a dozen different tables. In order to avoid unnecessary joins we build the actual search query as dynamic SQL and execute it inside the stored procedure.
In one of the most basic scenarios the user searches products by a keyword alone (see Query 1 below), which usually takes less than a second. However, if they search by a keyword and department (Query 2 below), the execution time goes up to well over a minute, and the execution plan looks somewhat different (the attached snapshots of the plans are showing just the parts that differ).
Query 1 (fast)
SELECT DISTINCT
Product.ProductID, Product.Title
FROM
Product
INNER JOIN ProductVariant ON (ProductVariant.ProductID = Product.ProductID)
WHERE (1=1)
AND (CONTAINS((Product.*), #Keywords) OR CONTAINS((ProductVariant.*), #Keywords))
AND (Product.SourceID = #SourceID)
AND (Product.ProductStatus = #ProductStatus)
AND (ProductVariant.ProductStatus = #ProductStatus)
Query 2 (slow)
SELECT DISTINCT
Product.ProductID, Product.Title
FROM
Product
INNER JOIN ProductVariant ON (ProductVariant.ProductID = Product.ProductID)
WHERE (1=1)
AND (CONTAINS((Product.*), #Keywords) OR CONTAINS((ProductVariant.*), #Keywords))
AND (Product.SourceID = #SourceID)
AND (Product.DepartmentID = #DepartmentID)
AND (Product.ProductStatus = #ProductStatus)
AND (ProductVariant.ProductStatus = #ProductStatus)
Both the Product and ProductVariant table have some string columns that participate in the full-text index. The Product table has a non-clustered indexed on the SourceID column and another non-clustered indexed on SourceID+DepartmentID (this redundancy is not an oversight but is intended). ProductVariant.ProductID is a FK to Product and has a non-clustered index on it. Statistics are updated for all indexes and columns, and no missing indexes are reported by SQL Management Studio.
Any suggestions on what might be causing this drastically different performance?
P.S. Forgot to mention that Product.DepartmentID is a FK to a table of departments, in case it makes any difference.

Thanks to #MartinSmith for the suggestion to break the full-text search logic out into temp tables and then using them to filter the results of the main query. The following returns in just 2 seconds:
SELECT
[Key] AS ProductID
INTO
#matchingProducts
FROM
CONTAINSTABLE(Product, *, #Keywords)
SELECT
[Key] AS VariantID
INTO
#matchingVariants
FROM
CONTAINSTABLE(ProductVariant, *, #Keywords)
SELECT DISTINCT
Product.ProductID, Product.Title
FROM
Product
INNER JOIN ProductVariant ON (ProductVariant.ProductID = Product.ProductID)
LEFT OUTER JOIN #matchingProducts ON #matchingProducts.ProductID = Product.ProductID
LEFT OUTER JOIN #matchingVariants ON #matchingVariants.VariantID = ProductVariant.VariantID
WHERE (1=1)
AND (Product.SourceID = #SourceID)
AND (Product.ProductStatus = #ProductStatus)
AND (ProductVariant.ProductStatus = #ProductStatus)
AND (Product.DepartmentID = #DepartmentID)
AND (NOT #matchingProducts.ProductID IS NULL OR NOT #matchingVariants.VariantID IS NULL)
Curiously, when I tried to simplify the above solution using nested queries as shown below, the results were somewhere in-between in terms of speed (around 25 secs). Theoretically, the query below should be identical to the one above, yet somehow SQL Server internally compiles the second one differently.
SELECT DISTINCT
Product.ProductID, Product.Title
FROM
Product
INNER JOIN ProductVariant ON (ProductVariant.ProductID = Product.ProductID)
LEFT OUTER JOIN
(
SELECT
[Key] AS ProductID
FROM
CONTAINSTABLE(Product, *, #Keywords)
) MatchingProducts
ON MatchingProducts.ProductID = Product.ProductID
LEFT OUTER JOIN
(
SELECT
[Key] AS VariantID
FROM
CONTAINSTABLE(ProductVariant, *, #Keywords)
) MatchingVariants
ON MatchingVariants.VariantID = ProductVariant.VariantID
WHERE (1=1)
AND (Product.SourceID = #SourceID)
AND (Product.ProductStatus = #ProductStatus)
AND (ProductVariant.ProductStatus = #ProductStatus)
AND (Product.DepartmentID = #DepartmentID)
AND (NOT MatchingProducts.ProductID IS NULL OR NOT MatchingVariants.VariantID IS NULL)

This may have been your mistake
In order to avoid unnecessary joins we build the actual search query as dynamic SQL and execute it inside the stored procedure.
The dynamic SQL cannot be optimized by the server in most cases. There are certain techniques to mitigate this. Read more The Curse and Blessings of Dynamic SQL
Get rid of your dynamic SQL and build a decent query using proper indices. I assure you; the SQL Server knows better than you when it comes to optimizing. Define ten different queries if you must (or a hundred).
Secondly... Why would you ever expect the same execution plan when running different queries? using different columns/indices? The execution plans and results you get seem perfectly natural to me, given your approach.
You will not get the same execution plan / performance because you are not querying column 'DepartmentID' the first query.

Related

Is there an equivalent to OR clause in CONTAINSTABLE - FULL TEXT INDEX

I am trying to find a solution in order to improve the String searching process and I selected FULL-TEXT INDEX Strategy.
However, after implementing it, I still can see there is a performance hit when it comes to search by using multiple strings using multiple Full-Text Index tables with OR clauses.
(E.x. WHERE CONTAINS(F.*,'%Gayan%') OR CONTAINS(P.FirstName,'%John%'))
As a solution, I am trying to use CONTAINSTABLE expecting a performance improvement.
Now, I am facing an issue with CONTAINSTABLE when it comes to joining tables with a LEFT JOIN
Please go through the example below.
Query 1
SELECT F.Name,p.*
FROM P.Role PR
INNER JOIN P.Building F ON PR.PID = F.PID
LEFT JOIN CONTAINSTABLE(P.Building,*,'%John%') AS FFTIndex ON F.ID = FFTIndex.[Key]
LEFT JOIN P.Relationship PRSHIP ON PR.id = prship.ToRoleID
LEFT JOIN P.Role PR2 ON PRSHIP.ToRoleID = PR2.ID
LEFT JOIN P.Person p ON pr2.ID = p.PID
LEFT JOIN CONTAINSTABLE(P.Person,FirstName,'%John%') AS PFTIndex ON P.ID = PFTIndex.[Key]
WHERE F.Name IS NOT NULL
This produces the below result.
Query 2
SELECT F.Name,p.*
FROM P.Role PR
INNER JOIN P.Building F ON PR.PID = F.PID
INNER JOIN P.Relationship PRSHIP ON PR.id = prship.ToRoleID
INNER JOIN P.Role PR2 ON PRSHIP.ToRoleID = PR2.ID
INNER JOIN P.Person p ON pr2.ID = p.PID
WHERE CONTAINS(F.*,'%Gayan%') OR CONTAINS(P.FirstName,'%John%')
AND F.Name IS NOT NULL
Result
Expectation
To use query 1 in a way that works as the behavior of an SQL SERVER OR clause. As I can understand Query 1's CONTAINSTABLE, joins the data with the building table, and the rest of the results are going to ignore so that the CONTAINSTABLE of the Person table gets data that already contains the keyword filtered from the building table.
If the keyword = Building, I want to match the keyword in both the tables regardless of searching a saved record in both the tables. Having a record in each table is enough.
Summary
Query 2 performs well but is creates a slowness when the words in the indexes are growing. Query 1 seems optimized(When it comes to multiple online resources and MS Documentation),
however, it does not give me the expected output.
Is there any way to solve this problem?
I am not strictly attached to CONTAINSTABLE. Suggesting another optimization method will also be considerable.
Thank you.
Hard to say definitively without your full data set but a couple of options to explore
Remove Invalid % Wildcards
Why are you using '%SearchTerm%'? Does performance improve if you use the search term without the wildcards (%)? If you want a word that matches a prefix, try something like
WHERE CONTAINS (String,'"SearchTerm*"')
Try Temp Tables
My guess is CONTAINS is slightly faster than CONTAINSTABLE as it doesn't calculate a rank, but I don't know if anyone has ever attempted to benchmark it. Either way, I'd try saving off the matches to a temp table before joining up to the rest of the tables. This will allow the optimizer to create a better execution plan
SELECT ID INTO #Temp
FROM YourTable
WHERE CONTAINS (String,'"SearchTerm"')
SELECT *
FROM #Temp
INNER JOIN...
Optimize Full Text Index by Removing Noisy Words
You might find you have some noisy words aka words that reoccur many times in your data that are meaningless like "the" or perhaps some business jargon. Adding these to your stop list will mean your full text index will ignore them, making your index smaller thus faster
The query below will list indexed words with the most frequent at the top
Select *
From sys.dm_fts_index_keywords(Db_Id(),Object_Id('dbo.YourTable') /*Replace with your table name*/)
Order By document_count Desc
This OR That Criteria
For your WHERE CONTAINS(F.*,'%Gayan%') OR CONTAINS(P.FirstName,'%John%') criteria where you want this or that, is tricky. OR clauses generally perform even when using simple equality operators.
I'd try either doing two queries and union the results like:
SELECT * FROM Table1 F
/*Other joins and stuff*/
WHERE CONTAINS(F.*,'%Gayan%')
UNION
SELECT * FROM Table2 P
/*Other joins and stuff*/
WHERE CONTAINS(P.FirstName,'%John%')
OR this is much more work, but you could load all your data into giant denormalized table with all your columns. Then apply a full text index to that table and adjust your search criteria that way. It'd probably be the fastest method searching, but then you'd have to ensure the data is sync between the denormalized table and the underlying normalized tables
SELECT B.*,P.* INTO DenormalizedTable
FROM Building AS B
INNER JOIN People AS P
CREATE FULL TEXT INDEX ft ON DenormalizedTable
etc...

SQL query plan and minimum row size exceeds the maximum allowable of 8060 bytes

I'm getting the following error when running a query against my local database :
The query processor could not produce a query plan because a worktable is required, and its minimum row size exceeds the maximum allowable of 8060 bytes. A typical reason why a worktable is required is a GROUP BY or ORDER BY clause in the query. If the query has a GROUP BY or ORDER BY clause, consider reducing the number and/or size of the fields in the clause. Consider using prefix (LEFT()) or hash (CHECKSUM()) of fields for grouping or prefix for ordering. Note however that this will change the behavior of the query.
This occurs when running a query that looks similar to this (apolgies for the lack of detail) :
SELECT <about 183 columns>
FROM tableA
INNER JOIN tvf1(<params>) tvf1
ON tvf1.id = tableA.X1
INNER JOIN tvf2(<params>) tvf2
ON tvf2.id = tableA.X2
INNER JOIN tvf3(<params>) tvf3
ON tvf3.id = tableA.X3
INNER JOIN tvf4(<params>) tvf4
ON tvf4.id = tableA.X4
INNER JOIN tvf5(<params>) tvf5
ON tvf5.id = tableA.X5
The table-valued-functions above all use a combination of GROUP BY, ROW_NUMBER() and other aggregation functions.
While binary debugging, commenting out any 2 of the above joins results in the error not occurring, doesnt matter which though.
My database is running on Compatibility Level 2019.
If i try setting Legacy Cardinality Estimation to On then the error no longer happens but I dont understand what this setting does.
edit :
If the database compatibility level is 2016 then everything works as expected as well
A concern i have is that the production database might be upgraded in future and this error could occur.
Edit :
I've managed to get the column count down to a handful now however my results are inconsistent.
SELECT
Other = TvfGroupData.Other
,GroupA = TvfGroupData.GroupA
,GroupB = TvfGroupData.GroupB
,GroupC = TvfGroupData.GroupC
, [Max Created Date] =
(SELECT MAX(Value)
FROM (VALUES
(Tvf1.CreatedDate)
,(Tvf2.CreatedDate)
,(Tvf3.CreatedDate)
--,(TvfGroupData.CreatedDate)
,(Tvf3.CreatedDate)
,(Tvf4.CreatedDate)
) AS AllValues(Value)
)
FROM TableA
LEFT JOIN Tvf1() ...
LEFT JOIN Tvf2() ...
LEFT JOIN TvfGroupData() ...
LEFT JOIN Tvf3() ...
LEFT JOIN Tvf4() ...
In the above query the following scenarios work :
excluding only GroupA column.
excluding only GroupB, GroupC column
Other combinations all fail with the error :
The query processor ran out of internal resources and could not produce a query plan. This is a rare event and only expected for extremely complex queries or queries that reference a very large number of tables or partitions. Please simplify the query. If you believe you have received this message in error, contact Customer Support Services for more information.
The cardinality estimator is how SQL Server generates the execution plan, meaning how the engine will execute and assemble the disparate sets of data.
The recent changes in the engine usually (but not always) result in better execution plans, leading to faster queries response while consuming less resources.
If you look into how SQL Server processes a query statement, the full data set is gathered before the unwanted columns are excluded. When multiple data sets are combined, the engine may exclude columns before joining to another set, or it may join the sets before excluding columns. It is based on what the engine "sees" in your data patterns (statistics).
UDFs are a frequent stumbling block for the query plan optimizer as a poorly formed function masks the data statistics and prevents the engine from efficiently piecing the data together.
All that to say, the updated engine is looking at your data and determining it is more efficient to combine multiple sets before eliminating unwanted columns.
I believe you may be able to fix this by subselecting the columns you need from your functions before joining to the outer set.
SELECT *
FROM (select SpecificColumns from tableA) as tableA
INNER JOIN (select SpecificColumns from tvf1(<params>)) as tvf1
ON tvf1.id = tableA.X1
INNER JOIN (select SpecificColumns from tvf2(<params>)) as tvf2
ON tvf2.id = tableA.X2
Alternatively, you may want to reconsider the Do-Everything query approach to reporting.
At ~8kb per row, you're likely passing a tremendous amount of data to your reporting system.
You may also give Cross Apply a try.
SELECT <about 183 columns>
FROM tableA
CROSS APPLY tvf1(<params>) tvf1
CROSS APPLY tvf2(<params>) tvf2
WHERE tvf1.id = tableA.X1
AND tvf2.id = tableA.X2
CROSS APPLY can instruct the optimizer to process sets in a different order.
Ive stil been unable to track down the exact problem.
My workaround for now has been to evaluate the table-valued-functions into table variables/temp tables and then join onto them instead
So it has been changed to something like this
DECLARE #tvf1 AS TABLE ....
INSERT INTO #tvf1 SELECT * FROM tvf1()...
DECLARE #tvf2 AS TABLE ....
INSERT INTO #tvf2 SELECT * FROM tvf2()...
DECLARE #tvf3 AS TABLE ....
INSERT INTO #tvf3 SELECT * FROM tvf3()...
SELECT <about 183 columns>
FROM tableA
INNER JOIN #tvf1 tvf1
ON tvf1.id = tableA.X1
INNER JOIN #tvf2 tvf2
ON tvf2.id = tableA.X2
INNER JOIN #tvf3 tvf3
ON tvf3.id = tableA.X3

Trying to find a solution to long running SQL code where I think NESTED SQL statement is the culprit

I have a SQL statement that has a weird 2nd nested SQL statement that I think is causing this query to run for 6+ min and any suggestions/help would be appreciated. I tried creating a TEMP table for the values in the nested SQL statement and just do a simple join but there is nothing to join on in the SQL code so that is why they used a 1=1 in the ON statement for the join. Here is the SQL code:
Declare #TransactionEndDate datetime;
Select #TransactionEndDate = lastmonth_end from dbo.DTE_udfCommonDates(GETDATE());
Select ''''+TreatyName as Treaty,
cast(EndOfMonth as Date) as asOfDate,
Count(Distinct ClaimSysID) as ClaimCount,
Count(Distinct FeatureSysID) as FeatureCount,
Sum(OpenReserve) as OpenReserve
From (
Select
TreatyName,
EndOfMonth,
dbo.CMS_Claims.ClaimSysID,
FeatureSysID,
sum(IW_glGeneralLedger.TransactionAmount)*-1 as OpenReserve
From dbo.CMS_Claims
Inner Join dbo.CMS_Claimants
On dbo.CMS_Claims.ClaimSysID = dbo.CMS_Claimants.ClaimSysID
Inner Join dbo.CMS_Features
On dbo.CMS_Features.ClaimantSysID = dbo.CMS_Claimants.ClaimantSysID
Left Join dbo.IW_glGeneralLedger
On IW_glGeneralLedger.FeatureID = dbo.CMS_Features.FeatureSysID
Left Join dbo.IW_glSubChildAccount
On dbo.IW_glSubChildAccount.glSubChildAccountID = dbo.IW_glGeneralLedger.glSubChildAccountSysID
Left Join dbo.IW_glAccountGroup
On dbo.IW_glAccountGroup.glAccountGroupID = dbo.IW_glSubChildAccount.glAccountGroupSysID
Left Join dbo.IW_BankRegister
On dbo.IW_BankRegister.BankRegisterSysID = dbo.IW_glGeneralLedger.BankRegisterID
Left Join dbo.IW_BankRegisterStatus
On dbo.IW_BankRegisterStatus.BankRegisterStatusSysID = dbo.IW_BankRegister.BankRegisterStatusID
**Left Join (Select Distinct dbo.DTE_get_month_end(dt) as EndOfMonth
From IW_Calendar
Where dt Between '3/1/2004'
and #TransactionEndDate) as dates
on 1=1**
Left Join dbo.IW_ReinsuranceTreaty
On dbo.IW_ReinsuranceTreaty.TreatySysID = IW_glGeneralLedger.PolicyTreatyID
Where dbo.IW_glGeneralLedger.TransactionDate Between '1/1/2004 00:00:00' And EndOfMonth
And dbo.IW_glAccountGroup.Code In ('RESERVEINDEMNITY')
And (
(dbo.IW_glGeneralLedger.BankRegisterID Is Null)
Or (
(IW_BankRegister.PrintedDate Between '1/1/2004 00:00:00' And EndOfMonth Or dbo.IW_glGeneralLedger.BankRegisterID = 0)
And
(dbo.IW_BankRegisterStatus.EnumValue In ('Approved','Outstanding','Cleared','Void') Or dbo.IW_glGeneralLedger.BankRegisterID = 0))
)
Group By TreatyName, dbo.CMS_Claims.ClaimSysID, FeatureSysID, EndOfMonth
Having sum(IW_glGeneralLedger.TransactionAmount) <> 0
) As Data
Group By TreatyName,EndOfMonth
Order By EndOfMonth, TreatyName
This nested SQL code only provides a table of End of Month values in one column called EndOfMonth and this is what I'm trying to fix:
Select Distinct dbo.DTE_get_month_end(dt) as EndOfMonth
From IW_Calendar
Where dt Between '3/1/2004'
and #TransactionEndDate
Please use the below methods to increase the query performance.
Use temporary tables. ( load relevant data into temporary tables with necessary where conditions and then join).
Use clustered and non clustered indexes to your tables.
Create Multiple-Column Indexes.
Index the ORDER-BY / GROUP-BY / DISTINCT Columns for Better Response Time.
Use Parameterized Queries.
Use query hints accordingly.
NOLOCK: In the event that data is locked, this tells SQL Server to read data from the last known value available, also known as a dirty read. Since it is possible to use some old values and some new values, data sets can contain inconsistencies. Do not use this in any place in which data quality is important.
RECOMPILE: Adding this to the end of a query will result in a new execution plan being generated each time this query executed. This should not be used on a query that is executed often, as the cost to optimize a query is not trivial. For infrequent reports or processes, though, this can be an effective way to avoid undesired plan reuse. This is often used as a bandage when statistics are out of date or parameter sniffing is occurring.
MERGE/HASH/LOOP: This tells the query optimizer to use a specific type of join as part of a join operation. This is super-risky as the optimal join will change as data, schema, and parameters evolve over time. While this may fix a problem right now, it will introduce an element of technical debt that will remain for as long as the hint does.
OPTIMIZE FOR: Can specify a parameter value to optimize the query for. This is often used when we want performance to be controlled for a very common use case so that outliers do not pollute the plan cache. Similar to join hints, this is fragile and when business logic changes, this hint usage may become obsolete.

Predicates on views: How can I evaluate the predicate before the view joins?

I have a query:
SELECT TOP 1 * FROM vwTimeBlocks
WHERE vwTimeBlocks.id = #id
AND vwTimeBlocks.organization = #org
organization is indexed in some of the tables within the view. I expected that the WHERE would be evaluated on the underlying tables in the view before any JOINs, but this isn't the case:
The execution plan suggests that the WHERE is being applied on the result of vwBlocks. Replacing the vwBlocks definition gives me this:
SELECT TOP 1 * FROM
( SELECT ... all the things...
FROM tblTimeBlocks
LEFT OUTER JOIN tblA
ON tblA.id = tblTimeBlocks.F
AND tblTimeBlocks.B = 1000
LEFT OUTER JOIN tblB
ON tblB.id = tblTimeBlocks.F
AND tblTimeBlocks.B = 2000
) AS subTimeBlocks
WHERE subTimeBlocks.id = #timeblock AND subTimeBlocks.organization = #org;
Comparing the execution plan of this against the original query shows they're identical. However, when I place the WHERE clause inside the subquery:
SELECT TOP 1 * FROM
( SELECT ... all the things...
FROM tblTimeBlocks
LEFT OUTER JOIN tblA
ON tblA.id = tblTimeBlocks.F
AND tblTimeBlocks.B = 1000
LEFT OUTER JOIN tblB
ON tblB.id = tblTimeBlocks.F
AND tblTimeBlocks.B = 2000
WHERE subTimeBlocks.id = #timeblock AND subTimeBlocks.organization = #org
) AS subTimeBlocks
The overall query cost drops, and the execution plan for the new query is much better; here's a comparison, with the top execution plan showing the WHERE outside the subquery, and the bottom plan with the WHERE inside the subquery.
Note that the relative costs of the queries is at 91%-9%, and the first filters on organization quite late, while the second uses two seeks.
Why is this the case? I expected the query optimizer to optimize better than it did.
Given this situation, I'd obviously like to use the second query, and somehow push the predicate into the view portion of the query. Because the predicate relies on a parameter, I can't just alter the view definition. One option is to replace the view with a table-defined function, but this situation is replicated all over my database. Is there any other way?
Is there any way I can hint, or otherwise ask sqlserver to use the predicate in my view? I don't know why the query optimizer doesn't look at the query and push the WHERE as early as possible.
In case it's relevant, this is running on:
Microsoft SQL Azure (RTM) - 12.0.2000.8

SQL Server 2005 Performance: Distinct or full table in WHERE IN statement

We have two Tables:
Document: id, title, document_type_id, showon_id
DocumentType: id, name
Relationship: DocumentType hasMany Documents. (Document.document_type_id = DocumentType.id)
We wish to retrieve a list of all document types for one given ShowOn_Id.
We see two possiblities:
SELECT DocumentType.*
FROM DocumentType
WHERE DocumentType.id IN (
SELECT DISTINCT Document.document_type_id FROM Document WHERE showon_id = 42
);
SELECT DocumentType.*
FROM DocumentType
WHERE DocumentType.id IN (
SELECT Document.document_type_id FROM Document WHERE showon_id = 42
);
Our question is: when and if is it better to use the DISTINCT to get the smaller record set versus retrieving the whole table and the IN statement walking the table to the first match. (We guess that's what it does ;-))
Is this different for different databases, is there a common answer?
Or is there a better way of doing it? (We are in .NET land)
You can use a join:
SELECT DISTINCT DocumentType.*
FROM DocumentType
INNER JOIN Document
ON DocumentType.id=Document.document_type_id
WHERE Document.showon_id = 42
I think it's the best way to do it.
For the best performance you should use:
SELECT DISTINCT dt.*
FROM
DocumentType dt
INNER JOIN Document d ON dt.id=d.document_type_id and d.showon_id = 42
Joins are very efficient at bridging multiple tables where as the nested query in the Where clause will need to perform a separate result selection that will filter down the From clause results. The join statement is also much more readable.
I would also put an index on showon_id, in addition to the primary keys and foreign key relationship.
My answer differs from wmasm's answer only by moving the showon_id filter up to the inner join. For MS SQL 2k5, I think the interpreter is smart enough to do this automatically, but you always want to work with the smallest result set possible. Bringing your filters up to inner join statements can limit the number of rows the query has to work with when joining many tables together. If you do this though, you should understand that this happens for every row comparison so complex filters (such as like x = '%a' or function calls) are better left for the Where clause so that the inner joins may filter out unnecessary comparisons.
Use an EXISTS. It sometimes is faster, but in my opinion, more readable than a DISTINCT and JOIN. Just for kicks, pls reply with the query plan for this query and the JOIN above, and see if anything is different (they may be optimized down to the same plan). If they are the same, I'd recommend the EXISTS as it is closer to a "plain language" description than a JOIN (because you don't want any of the data from Document, etc.)
SELECT whatever
FROM DocumentType dt
WHERE EXISTS( SELECT *
FROM Document
WHERE dt.id = document_type_id
AND showon_id = 42)
To get the query plan (ref: http://msdn.microsoft.com/en-us/library/ms180765(SQL.90).aspx), do:
SET SHOWPLAN_TEXT ON
GO
SELECT ...
GO
From my point of view it should not make any difference inside SQL Server (but who knows how this is implemented).
Think of it this way: to return the resultset the server needs to go into the Document table and retrieve all document_type_id WHERE showon_id = 42. In the process of retrieving the document_type_ids (e.g. by index seeking) it puts them into a hash table. When this process has finished the hash table will contain distinct values anyway. After that the query execution goes inside the Document_Type table, scans the primary key and probes into the hash table. Note that this depends, e.g. maybe it's more efficient to not use a hash table, when the expected row count from the Document table it low compared to Document_Type, but in general you get the same query plan as for the query wmasm just suggested.
Follow up on Matt's answer:
I've enabled the query plan and tested the following four different queries that have come up so far:
SELECT DocumentType.* FROM DocumentType WHERE DocumentType.id IN (SELECT DISTINCT Document.document_type_id FROM Document WHERE showon_id = 42);
SELECT DocumentType.* FROM DocumentType WHERE DocumentType.id IN (SELECT Document.document_type_id FROM Document WHERE showon_id = 42);
SELECT DISTINCT DocumentType.* FROM DocumentType INNER JOIN Document ON DocumentType.id=Document.document_type_id WHERE Document.showon_id = 42;
SELECT DocumentType.* FROM DocumentType WHERE EXISTS ( SELECT * FROM Document WHERE DocumentType.id=Document.document_type_id AND showon_id = 42);
The query plan for all four queries turned out to be the same:
|--Hash Match(Right Semi Join, HASH:([Document].[document_type_id])=([DocumentType].[Id]))
|--Hash Match(Inner Join, HASH:([Document].[Title], [Uniq1005])=([Document].[Title], [Uniq1005]), RESIDUAL:([Document].[Title] as [Document].[Title] = [Document].[Title] as [Document].[Title] AND [Uniq1005] = [Uniq1005]))
| |--Index Seek(OBJECT:([Document].[IX_Document_3] AS [Document]), SEEK:([Document].[showon_id]=(1)) ORDERED FORWARD)
| |--Index Scan(OBJECT:([Document].[IX_Document_1] AS [Document]))
|--Table Scan(OBJECT:([DocumentType] AS [DocumentType]))
I am not sure what every line and element means, but it seems that from the performance perspective it does not matter how you construct the query for this kind of problem...

Resources