I have a huge view with many queries concatenated using UNION ALL with the first column of every query being constant.
e.g.
CREATE VIEW M AS (
SELECT 'A' ID, Value FROM A
UNION ALL
SELECT 'B' ID, Value FROM B
...
)
The queries are more complex in reality but the purpose here is just to switch on what queries to run like this:
SELECT * FROM M WHERE ID = 'A'
The execution plan is showing that the queries that doesn't match on ID never run.
I thought this was a really nice (feature?) that I could use to combine querying different but similar things through the same view.
However, I'm ending up with an even better execution plans if querying against a CTE like this:
WITH M AS (
SELECT 'A' ID, Value FROM A
UNION ALL
SELECT 'B' ID, Value FROM B
...
)
SELECT * FROM M WHERE ID = 'A'
Here's a partial sample of the actual query:
SELECT CONVERT(char(4), 'T ') EntityTypeID, SystemID, TaskId EntityID
FROM [dbo].[Task]
UNION ALL
SELECT CONVERT(char(4), 'T ') EntityTypeID, s.SystemID, [dbo].[Task].TaskId EntityID
FROM [dbo].[Task]
INNER JOIN [dbo].[System] s ON s.MasterSystemID = [dbo].[Task].SystemID
INNER JOIN SystemEntitySettings ON SystemEntitySettings.SystemID = s.SystemID
AND SystemEntitySettings.EntityTypeID = 'T '
AND SystemEntitySettings.IsSystemPrivate = 0
Given the above T-SQL if I ran something like WHERE EntityTypeiD <> 'T' it would ignore the first query entirely but do something with the second (never returning any actual rows).
The issue I'm having, or rather, my question is, why is it that it cannot eliminate the query entirely from the view, when it does so in the CTE case?
EDIT
I've observed some interesting things so far, I'm not ruling out the deal with parameterization but I can also achive the desiered effect by either specifying a query hint (apparently any will do) or rewrite the second join as a IN predicate since it is only a filter.
INNER JOIN SystemEntitySettings ON SystemEntitySettings.SystemID = s.SystemID
AND SystemEntitySettings.EntityTypeID = 'T '
AND SystemEntitySettings.IsSystemPrivate = 0
...becomes...
WHERE s.SystemID IN (
SELECT SystemID
FROM dbo.SystemEntitySettings
WHERE EntityTypeID = 'T ' AND IsSystemPrivate = 0
)
But, the following query has the same issue. It appears as if it's related to the JOIN operations some how. (NOTE the additional JOIN with [Group] taking place in this query)
SELECT CONVERT(char(4), 'CF ') EntityTypeID, s.SystemID, [dbo].[CareerForum].GroupID EntityID
FROM [dbo].[CareerForum]
INNER JOIN [dbo].[Group] ON [dbo].[Group].GroupID = [dbo].[CareerForum].GroupID
INNER JOIN [dbo].[System] s ON s.MasterSystemID = [dbo].[Group].SystemID
WHERE s.SystemID IN (SELECT SystemID FROM dbo.SystemEntitySettings WHERE EntityTypeID = 'CF ' AND IsSystemPrivate = 0)
Reproducible
The following script can be used to reproduce the issue. Notice how the execution plan is completely different wheter the query is run with a query hint or if the view is run using a cte (the desiered result).
CREATE DATABASE test_jan_20
USE test_jan_20
create table source (
x int not null primary key,
)
insert into source values (1)
insert into source values (2)
insert into source values (3)
insert into source values (4)
insert into source values (5)
insert into source values (6)
create table other (
y int not null primary key,
)
insert into other values (1)
insert into other values (2)
insert into other values (3)
insert into other values (4)
insert into other values (5)
insert into other values (6)
create view dummy AS (
SELECT 'A' id, x, NULL y
FROM SOURCE
WHERE x BETWEEN 1 AND 2
UNION ALL
SELECT 'B' id, x, NULL y
FROM SOURCE
WHERE x BETWEEN 3 AND 4
UNION ALL
SELECT 'B' id, source.x, y
FROM SOURCE
INNER JOIN other ON y = source.x
INNER JOIN source s2 ON s2.x = y - 1 --i need this join for the issue to occur in the execution plan
WHERE source.x BETWEEN 5 AND 6
)
GO
--this one fails to remove the JOIN, not OK
SELECT * FROM dummy WHERE id = 'c'
--this is OK
SELECT * FROM dummy WHERE id = 'c' OPTION (HASH JOIN) --NOTE: any query hint seems to do the trick
--this is OK
;
WITH a AS (
SELECT 'A' id, x, NULL y
FROM SOURCE
WHERE x BETWEEN 1 AND 2
UNION ALL
SELECT 'B' id, x, NULL y
FROM SOURCE
WHERE x BETWEEN 3 AND 4
UNION ALL
SELECT 'B' id, source.x, y
FROM SOURCE
INNER JOIN other ON y = source.x
INNER JOIN source s2 ON s2.x = y - 1 --i need this join for the issue to occur in the execution plan
WHERE source.x BETWEEN 5 AND 6
)
SELECT * FROM a WHERE id = 'c'
In your test case this is what is happening.
For the query with the view and the query hint or the CTE the Query Optimiser is using "contradiction detection". You can see in the execution plan properties that the OPTIMIZATION LEVEL is TRIVIAL. The trivial plan churned out is exactly the same as the one shown in point 8 of this article.
For the query with the view without the query hint this gets auto parameterised. This can prevent the contradiction detection from kicking in as covered here.
The execution plan is showing that the
queries that doesn't match on ID never
run.
That is correct since you provided a constant 'A', so the plan is built against the specific string 'A', which cuts off one part.
The issue I'm having, or rather, my
question is, why is it that it cannot
eliminate the query entirely from the
view, when it does so in the CTE case?
I thought you just stated that it did? I guess you are using it in a parameterized way, either in an SP, function or parameterized query. This causes a plan to be created that MUST be able to take various parameters - so cutting one part out is out of the question.
To achieve what you want, you would have to generate dynamic SQL that would present the query with a constant value to the query optimizer. This is true whether you use View or inline Table-valued function.
EDIT: following addition of reproducible
These two forms seem to work as well
select * from (SELECT * FROM dummy) y WHERE id = 'c'
with a as (Select * from dummy) SELECT * FROM a WHERE id = 'c'
With you last update, the query is optimized too.
If you provide c (or any pother missing value) as a filter, you will have this plan:
|--Compute Scalar(DEFINE:([Union1019]=[Expr1018], [Union1020]=[ee].[dbo].[source].[x], [Union1021]=[ee].[dbo].[other].[y]))
|--Compute Scalar(DEFINE:([Expr1018]='B'))
|--Nested Loops(Inner Join, OUTER REFERENCES:([Expr1022]))
|--Constant Scan
|--Clustered Index Seek(OBJECT:([ee].[dbo].[source].[PK__source__3BD019E5171A1207] AS [s2]), SEEK:([s2].[x]=[Expr1022]) ORDERED FORWARD)
, with the constant scan expanding as follows:
<RelOp AvgRowSize="19" EstimateCPU="1.57E-07" EstimateIO="0" EstimateRebinds="0" EstimateRewinds="0" EstimateRows="0" LogicalOp="Constant Scan" NodeId="3" Parallel="false" PhysicalOp="Constant Scan" EstimatedTotalSubtreeCost="1.57E-07">
<OutputList>
<ColumnReference Database="[ee]" Schema="[dbo]" Table="[source]" Column="x" />
<ColumnReference Database="[ee]" Schema="[dbo]" Table="[other]" Column="y" />
<ColumnReference Column="Expr1022" />
</OutputList>
<ConstantScan />
</RelOp>
In other worlds, source s and other o are never touched, this does not produce any real output, and, hence, there is no input for the Nested Loops, so no actual seeks are performed.
If you substitute the parameter with b, you will see a more complex plan with actual JOIN operations against all three tables.
Related
I'm trying to optimize or completely rewrite this query. It takes about ~1500ms to run currently. I know the distinct's are fairly inefficient as well as the Union. But I'm struggling to figure out exactly where to go from here.
I am thinking that the first select statement might not be needed to return the output of;
[Key | User_ID,(User_ID)]
Note; Program and Program Scenario are both using Clustered Indexes. I can provide a screenshot of the Execution Plan if needed.
ALTER FUNCTION [dbo].[Fn_Get_Del_User_ID] (#_CompKey INT)
RETURNS VARCHAR(8000)
AS
BEGIN
DECLARE #UseID AS VARCHAR(8000);
SET #UseID = '';
SELECT #UseID = #UseID + ', ' + x.User_ID
FROM
(SELECT DISTINCT (UPPER(p.User_ID)) as User_ID FROM [dbo].[Program] AS p WITH (NOLOCK)
WHERE p.CompKey = #_CompKey
UNION
SELECT DISTINCT (UPPER(ps.User_ID)) as User_ID FROM [dbo].[Program] AS p WITH (NOLOCK)
LEFT OUTER JOIN [dbo].[Program_Scenario] AS ps WITH (NOLOCK) ON p.ProgKey = ps.ProgKey
WHERE p.CompKey = #_CompKey
AND ps.User_ID IS NOT NULL) x
RETURN Substring(#UserIDs, 3, 8000);
END
There are two things happening in this query
1. Locating rows in the [Program] table matching the specified CompKey (#_CompKey)
2. Locating rows in the [Program_Scenario] table that have the same ProgKey as the rows located in (1) above.
Finally, non-null UserIDs from both these sets of rows are concatenated into a scalar.
For step 1 to be efficient, you'd need an index on the CompKey column (clustered or non-clustered)
For step 2 to be efficient, you'd need an index on the join key which is ProgKey on the Program_Scenario table (this likely is a non-clustered index as I can't imagine ProgKey to be PK). Likely, SQL would resort to a loop join strategy - i.e., for each row found in [Program] matching the CompKey criteria, it would need to lookup corresponding rows in [Program_Scenario] with same ProgKey. This is a guess though, as there is not sufficient information on the cardinality and distribution of data.
Ensure the above two indexes are present.
Also, as others have noted the second left outer join is a bit confusing as an inner join is the right way to deal with it.
Per my interpretation the inner part of the query can be rewritten this way. Also, this is the query you'd ideally run and optimize before tacking the string concatenation part. The DISTINCT is dropped as it is automatic with a UNION. Try this version of the query along with the indexes above and if it provides the necessary boost, then include the string concatenation or the xml STUFF approaches to return a scalar.
SELECT UPPER(p.User_ID) as User_ID
FROM
[dbo].[Program] AS p WITH (NOLOCK)
WHERE
p.CompKey = #_CompKey
UNION
SELECT UPPER(ps.User_ID) as User_ID
FROM
[dbo].[Program] AS p WITH (NOLOCK)
INNER JOIN [dbo].[Program_Scenario] AS ps WITH (NOLOCK) ON p.ProgKey = ps.ProgKey
WHERE
p.CompKey = #_CompKey
AND ps.User_ID IS NOT NULL
I am taking a shot in the dark here. I am guessing that the last code you posted is still a scalar function. It also did not have all the logic of your original query. Again, this is a shot in the dark since there is no table definitions or sample data posted.
This might be how this would look as an inline table valued function.
ALTER FUNCTION [dbo].[Fn_Get_Del_User_ID]
(
#_CompKey INT
) RETURNS TABLE AS RETURN
select MyResult = STUFF(
(
SELECT distinct UPPER(p.User_ID) as User_ID
FROM dbo.Program AS p
WHERE p.CompKey = #_CompKey
group by p.User_ID
UNION
SELECT distinct UPPER(ps.User_ID) as User_ID
FROM dbo.Program AS p
LEFT OUTER JOIN dbo.Program_Scenario AS ps ON p.ProgKey = ps.ProgKey
WHERE p.CompKey = #_CompKey
AND ps.User_ID IS NOT NULL
for xml path ('')
), 1, 1, '')
from dbo.Program
I use PostgreSQL 9.6 and my table schema is as follows: department, key, value1, value2, value3, ... Each department has hundreds of millions of unique keys, but the set of keys is more or less the same for all departments. It's possible that some keys don't exist for some departments, but such situations are rare.
I would like to prepare a report that for two departments points out differences in values for each key (comparison involves some logic based only on values for the key).
My first approach was to write an external tool in python that:
creates a server-side cursor for query: SELECT * FROM my_table WHERE department = 'ABC' ORDER BY key;
creates another server-side cursor for query: SELECT * FROM my_table WHERE department = 'XYZ' ORDER BY key;
iterates over both cursors, and compares the values.
It worked fine, but I thought it will be more efficient to perform the comparison inside a stored procedure in PostgreSQL. I wrote a stored procedure that takes two cursors as arguments, iterates over them and compares the values. Any differences are written into a temporary table. At the end, the external tool iterates just over the temporary table - there shouldn't be many rows there.
I thought that the latter approach would be more efficient, because it doesn't require transferring lots of data outside the database.
To my surprise, it turned out to be slower by almost 40%.
To isolate the problem I compared the performance of iterating a cursor inside a stored procedure, and in python:
FETCH cur_1 INTO row_1;
WHILE (row_1 IS NOT NULL) LOOP
rows = rows + 1;
FETCH FROM cur_1 INTO row_1;
END LOOP;
vs.
conn = psycopg2.connect(PG_URI)
cur = conn.cursor('test')
cur.execute(query)
cnt = 0
for row in cur:
cnt += 1
Query was the same in both cases. Again, the external tool was faster.
My hypothesis is that this is because the stored procedure fetches rows one-by-one (FETCH FROM curs_1 INTO row_1) while the application fetches rows in batches of 2000. But I couldn't find a way to FETCH a batch of rows from cursor inside a PgSQL procedure. Thus, I can't test the hypothesis.
So my question is it possible to speed up my stored procedure?
What is the best approach for problems like this?
Why can you not do a self-join rather than using cursors? Something like:
SELECT t1.key, t1.value1 - t2.value1 as diff1,
t1.value2 - t2.value2 as diff2, ...
FROM my_table t1 inner join my_table t2 on t1.key = t2.key
WHERE t1.department = 'XYZ' and t2.department = 'ABC'
UNION
SELECT t1.key, t1.value1 as diff1,
t1.value2 as diff2, ...
FROM my_table t1 WHERE NOT EXISTS (SELECT 1 FROM my_table t2 WHERE
t1.key = t2.key AND t2.dept = 'ABC') AND t1.dept = 'XYZ'
UNION
SELECT t1.key, t1.value1 as diff1,
t1.value2 as diff2, ...
FROM my_table t1 WHERE NOT EXISTS (SELECT 1 FROM my_table t2 WHERE
t1.key = t2.key AND t2.dept = 'XYZ') AND t1.dept = 'ABC';
The first part deals with all the common cases and the two unions pick up the missing values. I would have thought this would be much faster than a cursor approach.
This might be faster as it will only return those rows that are different in at least one of the values:
select *
from (
SELECT key, value1, value2, value3
FROM my_table
WHERE department = 'ABC'
) d1
full join (
SELECT key, value1, value2, value3
FROM my_table
WHERE department = 'XYZ'
) d2 using (key)
where d1 is distinct from d2;
Is there a way to have a column from another table with value which is always the same inside a View> Example:
SELECT *,
(SELECT value FROM tblStudentPrefixes WHERE PrefixName = 'SeniorPrefix')
AS StudentPrefix
FROM tblStudents
Will the above nested query get executed fro each row? Is there a way to execute it once and use for all rows.
Please note, I'm specifically talking about a View, not a Stored Procedure. I know this can be done in a Stored Procedure.
This actually depends on your table set up. Unless prefixName is constrained to be unique you could come across errors, where the subquery returns more than one row. If it is not constrained to be unique, but happens to be unique for SeniorPrefix then your query will be executed 1000 times. To demonstrate I have used the following DDL:
CREATE TABLE #tblStudents (ID INT IDENTITY(1, 1), Filler CHAR(100));
INSERT #tblStudents (Filler)
SELECT TOP 10000 NULL
FROM sys.all_objects a, sys.all_objects b;
CREATE TABLE #tblStudentPrefixes (Value VARCHAR(10), PrefixName VARCHAR(20));
INSERT #tblStudentPrefixes (Value, PrefixName) VALUES ('A Value', 'SeniorPrefix');
Running your query gives the following IO output:
Table '#tblStudentPrefixes'. Scan count 10000, logical reads 10000
Table '#tblStudents'. Scan count 1, logical reads 142
The key being the 1000 logical reads on tblStudentPrefixes. The other problem with it not being constrained to be unique is that if you have duplicates your query will fail with the error:
Subquery returned more than 1 value. This is not permitted when the subquery follows =, !=, <, <= , >, >= or when the subquery is used as an expression.
If you can't constrain PrefixName to be unique, then you can stop it executing for each row and avoid the errors by using TOP:
SELECT *,
(SELECT TOP 1 value FROM #tblStudentPrefixes WHERE PrefixName = 'SeniorPrefix' ORDER BY Value)
AS StudentPrefix
FROM #tblStudents
The IO now becomes:
Table '#tblStudentPrefixes'. Scan count 1, logical reads 1
Table '#tblStudents'. Scan count 1, logical reads 142
However, I would still recommend switching to a CROSS JOIN here:
SELECT s.*, p.Value AS StudentPrefix
FROM #tblStudents AS s
CROSS JOIN
( SELECT TOP 1 value
FROM #tblStudentPrefixes
WHERE PrefixName = 'SeniorPrefix'
ORDER BY Value
) AS p;
Inspection of the execution plans shows that a sub-select using a table spool which is very unnecessary for a single value:
So in summary, it depends on your table set up whether it will execute for each row, but regardless you are giving the optimiser a better chance if you switch to a cross join.
EDIT
In light of the fact that you need to return rows from tblstudent when there is no match for SeniorPrefix in tblStudentPrefixes, and that PrefixName is not currenty constrianed to be unique then the best solution is:
SELECT *,
(SELECT MAX(value) FROM #tblStudentPrefixes WHERE PrefixName = 'SeniorPrefix')
AS StudentPrefix
FROM #tblStudents;
If you do constrain it to be unique, then the following 3 queries produce (essentially) the same plan and the same results, it is simply personal preference:
SELECT *,
(SELECT value FROM #tblStudentPrefixes WHERE PrefixName = 'SeniorPrefix')
AS StudentPrefix
FROM #tblStudents;
SELECT s.*, p.Value AS StudentPrefix
FROM #tblStudents AS s
LEFT JOIN #tblStudentPrefixes AS p
ON p.PrefixName = 'SeniorPrefix';
SELECT s.*, p.Value AS StudentPrefix
FROM #tblStudents AS s
OUTER APPLY
( SELECT Value
FROM #tblStudentPrefixes
WHERE PrefixName = 'SeniorPrefix'
) AS p;
I hope I understand your question right, but try this
SELECT *
FROM tblStudents
Outer Apply
(
SELECT value
FROM tblStudentPrefixes
WHERE PrefixName = 'SeniorPrefix'
) as tble
This is OK. Subquery would be executed for every row on every row (which could provide bad performance).
You could try also:
SELECT tblStudents.*,StudentPrefix.value
FROM tblStudents,
(SELECT value
FROM tblStudentPrefixes
WHERE PrefixName = 'SeniorPrefix')StudentPrefix
I've come across a scenario where I need to return a complex set of calculated values at a crossover point from "legacy" to current.
To cut a long story short I have something like this ...
with someofit as
(
select id, col1, col2, col3 from table1
)
select someofit.*,
case when id < #lastLegacyId then
(select ... from table2 where something = id) as 'bla'
,(select ... from table2 where something = id) as 'foo'
,(select ... from table2 where something = id) as 'bar'
else
(select ... from table3 where something = id) as 'bla'
,(select ... from table3 where something = id) as 'foo'
,(select ... from table3 where something = id) as 'bar'
end
from someofit
No here lies the problem ...
I don't want to be constantly doing that case check for each sub selection but at the same time when that condition applies I need all of the selections within the relevant case block.
Is there a smarter way to do this?
if I was in a proper OO language I would use something like this ...
var common = GetCommonSuff()
foreach (object item in common)
{
if(item.id <= lastLegacyId)
{
AppendLegacyValuesTo(item);
}
else
{
AppendCurrentValuesTo(item);
}
}
I did initially try doing 2 complete selections with a union all but this doesn't work very well due to efficiency / number of rows to be evaluated.
The sub selections are looking for total row counts where some condition is met other than the id match on either table 2 or 3 but those tables may have millions of rows in them.
The cte is used for 2 reasons ...
firstly it pulls only the rows from table 1 i am interested in so straight away im only doing a fraction of the sub selections in each case.
secondly its returning the common stuff in a single lookup on table 1
Any ideas?
EDIT 1 :
Some context to the situation ...
I have a table called "imports" (table 1 above) this represents an import job where we take data from a file (csv or similar) and pull the records in to the db.
I then have a table called "steps" this represents the processing / cleaning rules we go through and each record contains a sproc name and a bunch of other stuff about the rule.
There is then a join table that represents the rule for a particular import "ImportSteps" (table 2 above - for current data), this contains a "rowsaffected" column and the import id
so for the current jobs my sql is quite simple ...
select 123 456
from imports
join importsteps
for the older legacy stuff however I have to look through table 3 ... table 3 is the holding table, it contains every record ever imported, each row has an import id and each row contains key values.
on the new data rowsaffected on table 2 for import id x where step id is y will return my value.
on the legacy data i have to count the rows in holding where col z = something
i need data on about 20 imports and this data is bound to a "datagrid" on my mvc web app (if that makes any difference)
the cte i use determines through some parameters the "current 20 im interested in" those params represent start and end record (ordered by import id).
My biggest issue is that holding table ... it's massive .. individual jobs have been known to contain 500k + records on their own and this table holds years of imported rows so i need my lookups on that table to be as fast as possible and as few as possible.
EDIT 2:
The actual solution (suedo code only) ...
-- declare and populate the subset to reduce reads on the big holding table
declare table #holding ( ... )
insert into #holding
select .. from holding
select
... common stuff from inner select in "from" below
... bunch of ...
case when id < #legacy then (select getNewValue(id, stepid))
else (select x from #holding where id = ID and ... ) end as 'bla'
from
(
select ROW_NUMBER() over (order by importid desc) as 'RowNum'
, ...
) as I
-- this bit handles the paging
where RowNum >= #StartIndex
and RowNum < #EndIndex
i'm still confident i can clean it up more but my original query that looked something like bills solution was about 45 seconds in execution time, this is about 7
I take it the subqueries must return a single scalar value, correct? This point is important because it is what ensures the LEFT JOINs will not multiply the result.
;with someofit as
(
select id, col1, col2, col3 from table1
)
select someofit.*,
bla = coalesce(t2.col1, t3.col1),
foo = coalesce(t2.col2, t3.col2),
bar = coalesce(t2.bar, t3.bar)
from someofit
left join table2 t2 on t2.something=someofit.id and somefit.id < #lastLegacyId
left join table3 t3 on t3.something=someofit.id and somefit.id >= #lastLegacyId
Beware that I have used id >= #lastLegacyId as the complement of the condition, by assuming that id is not nullable. If it is, you need an IsNull there, i.e. somefit.id >= isnull(#lastLegacyId,somefit.id).
Your edit to the question doesn't change the fact that this is an almost literal translation of the O-O syntax.
foreach (object item in common) --> "from someofit"
{
if(item.id <= lastLegacyId) --> the precondition to the t2 join
{
AppendLegacyValuesTo(item); --> putting t2.x as first argument of coalesce
}
else --> sql would normally join to both tables
--> hence we need an explicit complement
--> condition as an "else" clause
{
AppendCurrentValuesTo(item); --> putting t3.x as 2nd argument
--> tbh, the order doesn't matter since t2/t3
--> are mutually exclusive
}
}
function AppendCurrentValuesTo --> the correlation between t2/t3 to someofit.id
Now, if you have actually tried this and it doesn't solve your problem, I'd like to know where it broke.
Assuming you know that there are no conflicting ID's between the two tables, you can do something like this (DB2 syntax, because that's what I know, but it should be similar):
with combined_tables as (
select ... as id, ... as bla, ...as bar, ... as foo from table 2
union all
select ... as id, ... as bla, ...as bar, ... as foo from table 3
)
select someofit.*, combined_ids.bla, combined_ids.foo, combined_ids.bar
from someofit
join combined_tables on someofit.id = combined_tables.id
If you had cases like overlapping ids, you could handle that within the combined_tables() section
I'm trying to select records with a statement
SELECT *
FROM A
WHERE
LEFT(B, 5) IN
(SELECT * FROM
(SELECT LEFT(A.B,5), COUNT(DISTINCT A.C) c_count
FROM A
GROUP BY LEFT(B,5)
) p1
WHERE p1.c_count = 1
)
AND C IN
(SELECT * FROM
(SELECT A.C , COUNT(DISTINCT LEFT(A.B,5)) b_count
FROM A
GROUP BY C
) p2
WHERE p2.b_count = 1)
which takes a long time to run ~15 sec.
Is there a better way of writing this SQL?
If you would like to represent Set Difference (A-B) in SQL, here is solution for you.
Let's say you have two tables A and B, and you want to retrieve all records that exist only in A but not in B, where A and B have a relationship via an attribute named ID.
An efficient query for this is:
# (A-B)
SELECT DISTINCT A.* FROM (A LEFT OUTER JOIN B on A.ID=B.ID) WHERE B.ID IS NULL
-from Jayaram Timsina's blog.
You don't need to return data from the nested subqueries. I'm not sure this will make a difference withiut indexing but it's easier to read.
And EXISTS/JOIN is probably nicer IMHO then using IN
SELECT *
FROM
A
JOIN
(SELECT LEFT(B,5) AS b1
FROM A
GROUP BY LEFT(B,5)
HAVING COUNT(DISTINCT C) = 1
) t1 On LEFT(A.B, 5) = t1.b1
JOIN
(SELECT C AS C1
FROM A
GROUP BY C
HAVING COUNT(DISTINCT LEFT(B,5)) = 1
) t2 ON A.C = t2.c1
But you'll need a computed column as marc_s said at least
And 2 indexes: one on (computed, C) and another on (C, computed)
Well, not sure what you're really trying to do here - but obviously, that LEFT(B, 5) expression keeps popping up. Since you're using a function, you're giving up any chance to use an index.
What you could do in your SQL Server table is to create a computed, persisted column for that expression, and then put an index on that:
ALTER TABLE A
ADD LeftB5 AS LEFT(B, 5) PERSISTED
CREATE NONCLUSTERED INDEX IX_LeftB5 ON dbo.A(LeftB5)
Now use the new computed column LeftB5 instead of LEFT(B, 5) anywhere in your query - that should help to speed up certain lookups and GROUP BY operations.
Also - you have a GROUP BY C in there - is that column C indexed?
If you are looking for just set difference between table1 and table2,
the below query is simple that gives the rows that are in table1, but not in table2, such that both tables are instances of the same schema with column names as
columnone, columntwo, ...
with
col1 as (
select columnone from table2
),
col2 as (
select columntwo from table2
)
...
select * from table1
where (
columnone not in col1
and columntwo not in col2
...
);