Snowflake: Dynamic Data Masking Impact On query results

Snowflake: Dynamic Data Masking Impact On query results - snowflake-cloud-data-platform

I have discovered that a query against a table that has a masked policy will return different results based on the role IF a masked column is part of the predicate or a JOIN.
The following query will return no data:
use role NO_READ_ROLE;
select *
from customer_table
where ssn = '***-**-0102';
But if the the role used applies the mask, it will return the result set.
For the below query, the result set is only returned if the role is FULL READ (unmasked):
select *
from customer_table
where ssn = '100-10-0102';
Similarly, if we have a self join on a masked column, the results returned will depend on the role used.
select c1.customer_id
, c2.first_name
, c1.ssn
, c2.ph_no
, c1.email
from customer_table c1
inner join customer_table c2
on c1.ssn = c2.ssn;
I had hoped/expected that the optimizer would treat the predicates and the JOIN operators the same, regardless of the masking policy impact. Instead, it appears that the masked columns are also masked as part of the query plan.
Based on the above findings, it seems that a good rule of thumb is to not use masked columns as part of the predicate and/or JOINs UNLESS the role used has unrestricted read access to the column.
Does anyone have any other guidance when using masked columns as part of the queries JOIN or WHERE clause?
Reference:
https://docs.snowflake.com/en/user-guide/security-column-intro.html#masking-policies-at-query-runtime
At runtime, Snowflake rewrites the query to apply the masking policy expression to the columns specified in the masking policy. The masking policy is applied to the column regardless of where in a SQL expression the column is referenced, including:
PROJECTIONS.
JOIN predicates.
WHERE clause predicates.
ORDER BY and GROUP BY clauses.
Thanks,

masking converts column values to the masked values appropriate to the role running the query; the underlying value effectively doesn't exist from the perspective of the role.
If this wasn't the case then masking would be pointless. For example, I could write a Stored Procedure like this (pseudo-code):
for n= 1 to 999999999
select c1.customer_id
, c1.ssn
, c1.email
, n
from customer_table c1
where c1.ssn = n
next n
This would allow me to list the ssn for every customer, bypassing the masking policy.
I'm not sure what guidance you are looking for, you just need to understand the behaviour when writing queries.

Have you tried to use context function INVOKER_ROLE() in the mix?
In my use case, I have created a view from two table with inner join on masked columns. By using INVOKER_ROLE() in masked policy definition, I was able to get the same outputs irrespective of roles accessing the view.
Some helpful links:
https://docs.snowflake.com/en/sql-reference/functions/invoker_role.html
https://www.youtube.com/watch?v=6l49PxnvYKI&t=623s

Related

SQL query plan and minimum row size exceeds the maximum allowable of 8060 bytes

I'm getting the following error when running a query against my local database :
The query processor could not produce a query plan because a worktable is required, and its minimum row size exceeds the maximum allowable of 8060 bytes. A typical reason why a worktable is required is a GROUP BY or ORDER BY clause in the query. If the query has a GROUP BY or ORDER BY clause, consider reducing the number and/or size of the fields in the clause. Consider using prefix (LEFT()) or hash (CHECKSUM()) of fields for grouping or prefix for ordering. Note however that this will change the behavior of the query.
This occurs when running a query that looks similar to this (apolgies for the lack of detail) :
SELECT <about 183 columns>
FROM tableA
INNER JOIN tvf1(<params>) tvf1
ON tvf1.id = tableA.X1
INNER JOIN tvf2(<params>) tvf2
ON tvf2.id = tableA.X2
INNER JOIN tvf3(<params>) tvf3
ON tvf3.id = tableA.X3
INNER JOIN tvf4(<params>) tvf4
ON tvf4.id = tableA.X4
INNER JOIN tvf5(<params>) tvf5
ON tvf5.id = tableA.X5
The table-valued-functions above all use a combination of GROUP BY, ROW_NUMBER() and other aggregation functions.
While binary debugging, commenting out any 2 of the above joins results in the error not occurring, doesnt matter which though.
My database is running on Compatibility Level 2019.
If i try setting Legacy Cardinality Estimation to On then the error no longer happens but I dont understand what this setting does.
edit :
If the database compatibility level is 2016 then everything works as expected as well
A concern i have is that the production database might be upgraded in future and this error could occur.
Edit :
I've managed to get the column count down to a handful now however my results are inconsistent.
SELECT
Other = TvfGroupData.Other
,GroupA = TvfGroupData.GroupA
,GroupB = TvfGroupData.GroupB
,GroupC = TvfGroupData.GroupC
, [Max Created Date] =
(SELECT MAX(Value)
FROM (VALUES
(Tvf1.CreatedDate)
,(Tvf2.CreatedDate)
,(Tvf3.CreatedDate)
--,(TvfGroupData.CreatedDate)
,(Tvf3.CreatedDate)
,(Tvf4.CreatedDate)
) AS AllValues(Value)
)
FROM TableA
LEFT JOIN Tvf1() ...
LEFT JOIN Tvf2() ...
LEFT JOIN TvfGroupData() ...
LEFT JOIN Tvf3() ...
LEFT JOIN Tvf4() ...
In the above query the following scenarios work :
excluding only GroupA column.
excluding only GroupB, GroupC column
Other combinations all fail with the error :
The query processor ran out of internal resources and could not produce a query plan. This is a rare event and only expected for extremely complex queries or queries that reference a very large number of tables or partitions. Please simplify the query. If you believe you have received this message in error, contact Customer Support Services for more information.

The cardinality estimator is how SQL Server generates the execution plan, meaning how the engine will execute and assemble the disparate sets of data.
The recent changes in the engine usually (but not always) result in better execution plans, leading to faster queries response while consuming less resources.
If you look into how SQL Server processes a query statement, the full data set is gathered before the unwanted columns are excluded. When multiple data sets are combined, the engine may exclude columns before joining to another set, or it may join the sets before excluding columns. It is based on what the engine "sees" in your data patterns (statistics).
UDFs are a frequent stumbling block for the query plan optimizer as a poorly formed function masks the data statistics and prevents the engine from efficiently piecing the data together.
All that to say, the updated engine is looking at your data and determining it is more efficient to combine multiple sets before eliminating unwanted columns.
I believe you may be able to fix this by subselecting the columns you need from your functions before joining to the outer set.
SELECT *
FROM (select SpecificColumns from tableA) as tableA
INNER JOIN (select SpecificColumns from tvf1(<params>)) as tvf1
ON tvf1.id = tableA.X1
INNER JOIN (select SpecificColumns from tvf2(<params>)) as tvf2
ON tvf2.id = tableA.X2
Alternatively, you may want to reconsider the Do-Everything query approach to reporting.
At ~8kb per row, you're likely passing a tremendous amount of data to your reporting system.
You may also give Cross Apply a try.
SELECT <about 183 columns>
FROM tableA
CROSS APPLY tvf1(<params>) tvf1
CROSS APPLY tvf2(<params>) tvf2
WHERE tvf1.id = tableA.X1
AND tvf2.id = tableA.X2
CROSS APPLY can instruct the optimizer to process sets in a different order.

Ive stil been unable to track down the exact problem.
My workaround for now has been to evaluate the table-valued-functions into table variables/temp tables and then join onto them instead
So it has been changed to something like this
DECLARE #tvf1 AS TABLE ....
INSERT INTO #tvf1 SELECT * FROM tvf1()...
DECLARE #tvf2 AS TABLE ....
INSERT INTO #tvf2 SELECT * FROM tvf2()...
DECLARE #tvf3 AS TABLE ....
INSERT INTO #tvf3 SELECT * FROM tvf3()...
SELECT <about 183 columns>
FROM tableA
INNER JOIN #tvf1 tvf1
ON tvf1.id = tableA.X1
INNER JOIN #tvf2 tvf2
ON tvf2.id = tableA.X2
INNER JOIN #tvf3 tvf3
ON tvf3.id = tableA.X3

Prefix in field name SQL

I am an extreme rookie when it comes to SQL and I am trying to self teach myself. I have a few questions regarding SQL and writing queries:
I've been given some examples of queries by a colleague and many of them have the field names beginning with either m. or t. or o.email (eg. m.email or t.email or o.email). What do the prefixes indicate?
I am attempting to write a JOIN query but keep receiving an error message saying: An expression of non-boolean type specified in a context where a condition is expected, near ')' What would cause this? Neither of the data extensions I am trying to join contain boonlean fields.
I've written it as:
SELECT DISTINCT email, status_type, status_value_text FROM ent.[Table 1] JOIN [Table 2] ON email
Again, I am extremely new, any help would be greatly appreciated!

The general form of an SQL SELECT query is:
SELECT (columns, expressions, aggregate functions)
FROM (data sources)
WHERE (filters on data sources)
GROUP BY (columns used to group the data, needed if you use aggregate functions)
HAVING (filters on data AFTER it's grouped)
About the FROM clause, if you use more than one data source (table or view), you should think how your data will be related:
A cross join returns the cartesian product of the tables involved.Example:
FROM foo, bar will return all the rows of table foo and all the rows from table bar, without any rule on how they are related.
An inner join returns only the data that fulfills a relation rule.
Example:
FROM foo INNER JOIN bar ON foo.aField = bar.aField (alternative: FROM foo JOIN bar ON foo.aField = b.aField) will return only the rows of foo and bar that fulfils the condition provided (the field aField in each table must match). Important: An INNER JOIN must have a boolean expression, i.e. the relation must either be true or false; most times it will be an equality relationship (=) but it can be anything that returns a TRUE/FALSE value (>, <, >=, etcetera).
An outer join returns the full data of one table and only the rows of the other table that match the relation rule (for any other non-matching row of the first table, the columns from the second table will have a NULL value).
There are two possible outer joins: LEFT JOIN will return all the rows from the table on the left of the relation rule, and only the matching rows of the right table of the relation. RIGHT JOIN does just the opposite.
Example:
FROM foo LEFT JOIN bar ON foo.aField = bar.aField
As it was the case with the INNER JOIN, the relation must have a boolean expression.
Now, as you've noticed, you need to tell the system from which field you're getting the data. That's the reason for the "prefixes": They are the table (or schema.database.table) names. If you want, you can use aliases on this table names (just as you can use aliases on fields):
Example: FROM foo AS f INNER JOIN bar AS b on f.aField = b.aField
The aliases used in the FROM clause must be used every time you use the field in the same SELECT query.
Now, talking explicitly about your query: The JOIN in your post is missing the relation rule: You're telling the database server that the tables are related, but you're not defining the relation rule. Wich fields must match? Complete the JOIN expression with the columns that must match between the tables.

What do the prefixes indicate?
If you look at the FROM clause of the example queries you are questioning, you should notice table aliases specified after the full table names. Aliases are used sometimes just to shorten the table names from needing to be fully spelled out, but most often they are needed to disambiguate between similarly named columns in more than one of the tables in the FROM.
SELECT
-- Get the email column from the table aliased
-- as `m` (table1)
m.email
FROM
-- table1 aliased as m
table1 AS m
-- table1 aliased as e
INNER JOIN table2 AS e ON....
Consider two related tables, table1 and table2. Both of them has a column called email. In the SELECT list, you may not specify merely email because the RDBMS won't know which one you want. Instead, you must qualify it with the table name. Since the tables were aliased as m, e in the FROM, you must use the alias in SELECT instead of the full table name.
MS SQL Server documentation on table aliases...
An expression of non-boolean type specified in a context where a condition is expected, near ')'
In your join's ON clause, you simply supplied the column name email, which we can assume is a common value relating the two tables. A join's ON clause expects a boolean expression wherein a true value doing a row-wise comparison between the tables results in a row being returned.
So the ON clause needs a boolean expression with two sides, or something returning TRUE. In your case, it is equality between the email columns
SELECT DISTINCT
-- Must qualify email since both tables have it
-- Using the full table name, or its alias if an alias
-- was provided in `FROM`.
[Table 1].email,
status_type,
status_value_text
FROM
ent.[Table 1]
-- Equality between email columns completes the join
JOIN [Table 2] ON [Table 1].email = [Table 2].email
The expression in ON can be anything which evaluates to TRUE. It need not be an exact match between column values, though an exact match is by far the most common use case. You could say for example ON 1 = 1, which is always true. The resultant rowset would match every row from [Table 1] to every row of [Table 2] (which is a cartesian product). It could also be ON 1 = 2 which is never true, and therefore basically pointless since it would never return rows.
Using a syntax similar to the ON email you attempted, some RDBMS systems support a USING() in place of ON, allowing you to specify the equal column instead of a boolean expression. You might therefore also express it as
FROM
ent.[Table 1]
JOIN [Table 2] USING (email)
See also What's the difference between ON and USING

The m. or t. are aliases for tables within your query (so you are probably using tables that begin with the letter m and t). The alias is specified after the table name in the FROM part of your query.
The syntax looks off for your JOIN. The query should probably look more similar to this:
SELECT DISTINCT t1.email, t1.status_type, t2.status_value_text
FROM
[Table 1] t1 JOIN [Table 2] t2 ON t1.email = t2.email
Check out this link to see more examples.http://www.w3schools.com/sql/sql_join.asp

Right way to use distinct in SQL Server

I am trying to retrieve some records based on the query
Select distinct
tblAssessmentEcosystemCredit.AssessmentEcosystemCreditID,
tblSpecies.CommonName
from
tblAssessmentEcosystemCredit
left join
tblSpeciesVegTypeLink on tblAssessmentEcosystemCredit.VegTypeID = tblSpeciesVegTypeLink.VegTypeID
left join
tblSpecies on tblSpecies.SpeciesID = tblSpeciesVegTypeLink.SpeciesID
where
tblAssessmentEcosystemCredit.SpeciesTGValue < 1
The above query returns 17,000 records but when I remove tblSpecies.CommonName, it retrieves only 4200 (that's actually correct).
I have no idea how to distinct only tblAssessmentEcosystemCredit.AssessmentEcosystemCreditID column and retrieve all other table columns in the query.

This query selects the different COMBINATION of AssessmentEcosystemCreditID and CommonName; if you want only one row per value of AssessmentEcosystemCreditID then you need to use a GROUP BY, as suggested by #JonasB; however, in that case, there could be several values of CommonName per value of AssessmentEcosystemCreditID , and so SQL requires you to specify WHICH one you want
Select tblAssessmentEcosystemCredit.AssessmentEcosystemCreditID ,
max(tblSpecies.CommonName) as CommonName,
min(tblSpecies.CommonName) as CommonName2, -- so you can verify you only have one value
from tblAssessmentEcosystemCredit
left join tblSpeciesVegTypeLink
on tblAssessmentEcosystemCredit.VegTypeID = tblSpeciesVegTypeLink.VegTypeID
left join tblSpecies on tblSpecies.SpeciesID= tblSpeciesVegTypeLink.SpeciesID
where tblAssessmentEcosystemCredit.SpeciesTGValue <1
GROUP BY tblAssessmentEcosystemCredit.AssessmentEcosystemCreditID

See this topic: mySQL select one column DISTINCT, with corresponding other columns
You probably have to deactivate ONLY_FULL_GROUP_BY, see http://dev.mysql.com/doc/refman/5.7/en/sql-mode.html#sqlmode_only_full_group_by

Order Of Execution of the SQL query

I am confused with the order of execution of this query, please explain me this.
I am confused with when the join is applied, function is called, a new column is added with the Case and when the serial number is added. Please explain the order of execution of all this.
select Row_number() OVER(ORDER BY (SELECT 1)) AS 'Serial Number',
EP.FirstName,Ep.LastName,[dbo].[GetBookingRoleName](ES.UserId,EP.BookingRole) as RoleName,
(select top 1 convert(varchar(10),eventDate,103)from [3rdi_EventDates] where EventId=13) as EventDate,
(CASE [dbo].[GetBookingRoleName](ES.UserId,EP.BookingRole)
WHEN '90 Day Client' THEN 'DC'
WHEN 'Association Client' THEN 'DC'
WHEN 'Autism Whisperer' THEN 'DC'
WHEN 'CampII' THEN 'AD'
WHEN 'Captain' THEN 'AD'
WHEN 'Chiropractic Assistant' THEN 'AD'
WHEN 'Coaches' THEN 'AD'
END) as Category from [3rdi_EventParticipants] as EP
inner join [3rdi_EventSignup] as ES on EP.SignUpId = ES.SignUpId
where EP.EventId = 13
and userid in (
select distinct userid from userroles
--where roleid not in(6,7,61,64) and roleid not in(1,2))
where roleid not in(19, 20, 21, 22) and roleid not in(1,2))
This is the function which is called from the above query.
CREATE function [dbo].[GetBookingRoleName]
(
#UserId as integer,
#BookingId as integer
)
RETURNS varchar(20)
as
begin
declare #RoleName varchar(20)
if #BookingId = -1
Select Top 1 #RoleName=R.RoleName From UserRoles UR inner join Roles R on UR.RoleId=R.RoleId Where UR.UserId=#UserId and R.RoleId not in(1,2)
else
Select #RoleName= RoleName From Roles where RoleId = #BookingId
return #RoleName
end

Queries are generally processed in the follow order (SQL Server). I have no idea if other RDBMS's do it this way.
FROM [MyTable]
ON [MyCondition]
JOIN [MyJoinedTable]
WHERE [...]
GROUP BY [...]
HAVING [...]
SELECT [...]
ORDER BY [...]

SQL is a declarative language. The result of a query must be what you would get if you evaluated as follows (from Microsoft):
Logical Processing Order of the SELECT statement
The following steps show the logical
processing order, or binding order,
for a SELECT statement. This order
determines when the objects defined in
one step are made available to the
clauses in subsequent steps. For
example, if the query processor can
bind to (access) the tables or views
defined in the FROM clause, these
objects and their columns are made
available to all subsequent steps.
Conversely, because the SELECT clause
is step 8, any column aliases or
derived columns defined in that clause
cannot be referenced by preceding
clauses. However, they can be
referenced by subsequent clauses such
as the ORDER BY clause. Note that the
actual physical execution of the
statement is determined by the query
processor and the order may vary from
this list.
FROM
ON
JOIN
WHERE
GROUP BY
WITH CUBE or WITH ROLLUP
HAVING
SELECT
DISTINCT
ORDER BY
TOP
The optimizer is free to choose any order it feels appropriate to produce the best execution time. Given any SQL query, is basically impossible to anybody to pretend it knows the execution order. If you add detailed information about the schema involved (exact tables and indexes definition) and the estimated cardinalities (size of data and selectivity of keys) then one can take a guess at the probable execution order.
Ultimately, the only correct 'order' is the one described ion the actual execution plan. See Displaying Execution Plans by Using SQL Server Profiler Event Classes and Displaying Graphical Execution Plans (SQL Server Management Studio).
A completely different thing though is how do queries, subqueries and expressions project themselves into 'validity'. For instance if you have an aliased expression in the SELECT projection list, can you use the alias in the WHERE clause? Like this:
SELECT a+b as c
FROM t
WHERE c=...;
Is the use of c alias valid in the where clause? The answer is NO. Queries form a syntax tree, and a lower branch of the tree cannot be reference something defined higher in the tree. This is not necessarily an order of 'execution', is more of a syntax parsing issue. It is equivalent to writing this code in C#:
void Select (int a, int b)
{
if (c = ...) then {...}
int c = a+b;
}
Just as in C# this code won't compile because the variable c is used before is defined, the SELECT above won't compile properly because the alias c is referenced lower in the tree than is actually defined.
Unfortunately, unlike the well known rules of C/C# language parsing, the SQL rules of how the query tree is built are somehow esoteric. There is a brief mention of them in Single SQL Statement Processing but a detailed discussion of how they are created, and what order is valid and what not, I don't know of any source. I'm not saying there aren't good sources, I'm sure some of the good SQL books out there cover this topic.
Note that the syntax tree order does not match the visual order of the SQL text. For example the ORDER BY clause is usually the last in the SQL text, but as a syntax tree it sits above everything else (it sorts the output of the SELECT, so it sits above the SELECTed columns so to speak) and as such is is valid to reference the c alias:
SELECT a+b as c
FROM t
ORDER BY c;

SQL query is not imperative but declarative, so you have no idea which the statement is executed first, but since SQL is evaluated by SQL query engines, most of the SQL engines follows similar process to obtain the results. You may have to understand how the query engine works internally to understand some SQL execution behavior.
Julia Evens has a great post explaining this, it is worth to check it out:
https://jvns.ca/blog/2019/10/03/sql-queries-don-t-start-with-select/

SQL is a declarative language, meaning that it tells the SQL engine what to do, not how. This is in contrast to an imperative language such as C, in which how to do something is clearly laid out.
This means that not all statements will execute as expected. Of particular note are boolean expressions, which may not evaluate from left-to-right as written. For example, the following code is not guaranteed to execute without a divide by zero error:
SELECT 'null' WHERE 1 = 1 OR 1 / 0 = 0
The reason for this is the query optimizer chooses the best (most efficient) way to execute a statement. This means that, for example, a value may be loaded and filtered before a transforming predicate is applied, causing an error. See the second link above for an example
See: here and here.

"Order of execution" is probably a bad mental model for SQL queries. Its hard to actually write a single query that would actually depend on order of execution (this is a good thing). Instead you should think of all join and where clauses happening simultaneously (almost like a template)
That said you could run display the Execution Plans which should give you insight into it.
However since its's not clear why you want to know the order of execution, I'm guessing your trying to get a mental model for this query so you can fix it in some way. This is how I would "translate" your query, although I've done well with this kind of analysis there's some grey area with how precise it is.
FROM AND WHERE CLAUSE
Give me all the Event Participants rows. from [3rdi_EventParticipants
Also give me all the Event Signup rows that match the Event Participants rows on SignUpID inner join 3rdi_EventSignup] as ES on EP.SignUpId = ES.SignUpId
But Only for Event 13 EP.EventId = 13
And only if the user id has a record in the user roles table where the role id is not in 1,2,19,20,21,22
userid in (
select distinct userid from userroles
--where roleid not in(6,7,61,64) and roleid not in(1,2))
where roleid not in(19, 20, 21, 22) and roleid not in(1,2))
SELECT CLAUSE
For each of the rows give me a unique ID
Row_number() OVER(ORDER BY (SELECT 1)) AS 'Serial Number',
The participants First Name EP.FirstName
The participants Last Name Ep.LastName
The Booking Role name GetBookingRoleName
Go look in the Event Dates and find out what the first eventDate where the EventId = 13 that you find
(select top 1 convert(varchar(10),eventDate,103)from [3rdi_EventDates] where EventId=13) as EventDate
Finally translate the GetBookingRoleName in Category. I don't have a table for this so I'll map it manually (CASE [dbo].[GetBookingRoleName](ES.UserId,EP.BookingRole)
WHEN '90 Day Client' THEN 'DC'
WHEN 'Association Client' THEN 'DC'
WHEN 'Autism Whisperer' THEN 'DC'
WHEN 'CampII' THEN 'AD'
WHEN 'Captain' THEN 'AD'
WHEN 'Chiropractic Assistant' THEN 'AD'
WHEN 'Coaches' THEN 'AD'
END) as Category
So a couple of notes here. You're not ordering by anything when you select TOP. You should probably have na order by there. You could also just as easily put that in your from clause e.g.
from [3rdi_EventParticipants] as EP
inner join [3rdi_EventSignup] as ES on EP.SignUpId = ES.SignUpId,
(select top 1 convert(varchar(10),eventDate,103)
from [3rdi_EventDates] where EventId=13
Order by eventDate) dates

There is a logical order to evaluation of the query text, but the database engine can choose what order execute the query components based upon what is most optimal. The logical text parsing ordering is listed below. That is, for example, why you can't use an alias from SELECT clause in a WHERE clause. As far as the query parsing process is concerned, the alias doesn't exist yet.
FROM
ON
OUTER
WHERE
GROUP BY
CUBE | ROLLUP
HAVING
SELECT
DISTINCT
ORDER BY
TOP
See the Microsoft documentation (see "Logical Processing Order of the SELECT statement") for more information on this.

Simplified order for T-SQL -> SELECT statement:
1) FROM
2) Cartesian product
3) ON
4) Outer rows
5) WHERE
6) GROUP BY
7) HAVING
8) SELECT
9) Evaluation phase in SELECT
10) DISTINCT
11) ORDER BY
12) TOP
as I had done so far - same order was applicable in SQLite.
Source => SELECT (Transact-SQL)
... of course there are (rare) exceptions.

SQL Server 2005 Performance: Distinct or full table in WHERE IN statement

We have two Tables:
Document: id, title, document_type_id, showon_id
DocumentType: id, name
Relationship: DocumentType hasMany Documents. (Document.document_type_id = DocumentType.id)
We wish to retrieve a list of all document types for one given ShowOn_Id.
We see two possiblities:
SELECT DocumentType.*
FROM DocumentType
WHERE DocumentType.id IN (
SELECT DISTINCT Document.document_type_id FROM Document WHERE showon_id = 42
);
SELECT DocumentType.*
FROM DocumentType
WHERE DocumentType.id IN (
SELECT Document.document_type_id FROM Document WHERE showon_id = 42
);
Our question is: when and if is it better to use the DISTINCT to get the smaller record set versus retrieving the whole table and the IN statement walking the table to the first match. (We guess that's what it does ;-))
Is this different for different databases, is there a common answer?
Or is there a better way of doing it? (We are in .NET land)

You can use a join:
SELECT DISTINCT DocumentType.*
FROM DocumentType
INNER JOIN Document
ON DocumentType.id=Document.document_type_id
WHERE Document.showon_id = 42
I think it's the best way to do it.

For the best performance you should use:
SELECT DISTINCT dt.*
FROM
DocumentType dt
INNER JOIN Document d ON dt.id=d.document_type_id and d.showon_id = 42
Joins are very efficient at bridging multiple tables where as the nested query in the Where clause will need to perform a separate result selection that will filter down the From clause results. The join statement is also much more readable.
I would also put an index on showon_id, in addition to the primary keys and foreign key relationship.
My answer differs from wmasm's answer only by moving the showon_id filter up to the inner join. For MS SQL 2k5, I think the interpreter is smart enough to do this automatically, but you always want to work with the smallest result set possible. Bringing your filters up to inner join statements can limit the number of rows the query has to work with when joining many tables together. If you do this though, you should understand that this happens for every row comparison so complex filters (such as like x = '%a' or function calls) are better left for the Where clause so that the inner joins may filter out unnecessary comparisons.

Use an EXISTS. It sometimes is faster, but in my opinion, more readable than a DISTINCT and JOIN. Just for kicks, pls reply with the query plan for this query and the JOIN above, and see if anything is different (they may be optimized down to the same plan). If they are the same, I'd recommend the EXISTS as it is closer to a "plain language" description than a JOIN (because you don't want any of the data from Document, etc.)
SELECT whatever
FROM DocumentType dt
WHERE EXISTS( SELECT *
FROM Document
WHERE dt.id = document_type_id
AND showon_id = 42)
To get the query plan (ref: http://msdn.microsoft.com/en-us/library/ms180765(SQL.90).aspx), do:
SET SHOWPLAN_TEXT ON
GO
SELECT ...
GO

From my point of view it should not make any difference inside SQL Server (but who knows how this is implemented).
Think of it this way: to return the resultset the server needs to go into the Document table and retrieve all document_type_id WHERE showon_id = 42. In the process of retrieving the document_type_ids (e.g. by index seeking) it puts them into a hash table. When this process has finished the hash table will contain distinct values anyway. After that the query execution goes inside the Document_Type table, scans the primary key and probes into the hash table. Note that this depends, e.g. maybe it's more efficient to not use a hash table, when the expected row count from the Document table it low compared to Document_Type, but in general you get the same query plan as for the query wmasm just suggested.

Follow up on Matt's answer:
I've enabled the query plan and tested the following four different queries that have come up so far:
SELECT DocumentType.* FROM DocumentType WHERE DocumentType.id IN (SELECT DISTINCT Document.document_type_id FROM Document WHERE showon_id = 42);
SELECT DocumentType.* FROM DocumentType WHERE DocumentType.id IN (SELECT Document.document_type_id FROM Document WHERE showon_id = 42);
SELECT DISTINCT DocumentType.* FROM DocumentType INNER JOIN Document ON DocumentType.id=Document.document_type_id WHERE Document.showon_id = 42;
SELECT DocumentType.* FROM DocumentType WHERE EXISTS ( SELECT * FROM Document WHERE DocumentType.id=Document.document_type_id AND showon_id = 42);
The query plan for all four queries turned out to be the same:
|--Hash Match(Right Semi Join, HASH:([Document].[document_type_id])=([DocumentType].[Id]))
|--Hash Match(Inner Join, HASH:([Document].[Title], [Uniq1005])=([Document].[Title], [Uniq1005]), RESIDUAL:([Document].[Title] as [Document].[Title] = [Document].[Title] as [Document].[Title] AND [Uniq1005] = [Uniq1005]))
| |--Index Seek(OBJECT:([Document].[IX_Document_3] AS [Document]), SEEK:([Document].[showon_id]=(1)) ORDERED FORWARD)
| |--Index Scan(OBJECT:([Document].[IX_Document_1] AS [Document]))
|--Table Scan(OBJECT:([DocumentType] AS [DocumentType]))
I am not sure what every line and element means, but it seems that from the performance perspective it does not matter how you construct the query for this kind of problem...

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Snowflake: Dynamic Data Masking Impact On query results - snowflake-cloud-data-platform

Related

SQL query plan and minimum row size exceeds the maximum allowable of 8060 bytes

Prefix in field name SQL

Right way to use distinct in SQL Server

Order Of Execution of the SQL query

SQL Server 2005 Performance: Distinct or full table in WHERE IN statement

Categories

Resources