Rewrite SQL Query- I need to replace NOT IN with Join - sql-server

I have a query in my production environment which is taking long time to execute. I did not write this query but I must find a way to make it quicker since it is causing a big performance issue at the moment. I need to replace NOT IN with Left Join but not sure how to rewrite it. It looks like following at the moment
SELECT TOP 1 IT.ITEMID
FROM (SELECT CAST(ITEMID AS NUMERIC) + 1 ITEMID
FROM Items
WHERE ISNUMERIC(ITEMID) = 1
AND CAST(ITEMID AS NUMERIC) >= 50000) IT
WHERE IT.ITEMID NOT IN (SELECT CAST(ITEMID AS NUMERIC) ITEMID
FROM Items
WHERE ISNUMERIC(ITEMID) = 1)
ORDER BY IT.ITEMID
Kindly suggest how am I supposed to rewrite it using Left Join for better performance. Any help/guidance is greatly appreciated.

Try this one -
;WITH cte AS
(
SELECT DISTINCT ITEMID =
CASE WHEN ISNUMERIC(ITEMID) = 1
THEN ITEMID
END
FROM Items
)
SELECT TOP 1 ITEMID = ITEMID + 1
FROM cte t
WHERE ITEMID >= 50000
AND NOT EXISTS(
SELECT 1
FROM cte t2
WHERE t.ITEMID + 1 = t2.ITEMID
)
ORDER BY t.ITEMID

As mentioned in the comments, the NOT EXISTS version of the query is usually faster in SQLServer than the LEFT JOIN - for completeness, here's both versions:
Left join variant of existing query:
with cte as
(SELECT CAST(it.ITEMID AS NUMERIC) ITEMID
FROM Items
WHERE ISNUMERIC(ITEMID) = 1)
select top 1 i.ITEMID + 1 ITEMID
FROM cte i
LEFT JOIN cte ni ON i.ITEMID + 1 = ni.ITEMID
WHERE i.ITEMID >= 50000 AND ni.ITEMID IS NULL
Not exists variant of existing query:
with cte as
(SELECT CAST(it.ITEMID AS NUMERIC) ITEMID
FROM Items
WHERE ISNUMERIC(ITEMID) = 1)
select top 1 i.ITEMID + 1 ITEMID
FROM cte i
WHERE i.ITEMID >= 50000 AND NOT EXISTS
(SELECT NULL
FROM cte ni
WHERE i.ITEMID + 1 = ni.ITEMID)

As #gbn pointed at the comments, the CAST and functions on predicates which invalidates index use anyway, so there is no point in converting this from NOT IN to LEFT JOIN / IS NULL or to NOT EXISTS. And NOT EXISTS usually performs better than LEFT NULL in SQL-Server.
NOT IN is not advised due to the problems (wrong, unexpected results) when there are nulls (in the compared columns or produced by the expressions) and the inefficient plans because of the nullability of the columns/expessions.
And ISNUMERIC() is not doing always what you think it does (as # Damien_The_Unbeliever noted in another comment.) There are cases where the IsNumeric result is 1 but the cast fails.
So, the sane thing to do would be - in my opinion - to add another column in the table and convert (the values that can be converted) to numeric and store them in that column. Then you could write the query without casting and an index on that column could be used.
If you cannot alter the tables in any way (by adding a new column or a materialized view), then you can try and test the various rewritings the other answers offer.

I agree with #ypercube that the sane thing to do is to fix your schema.
If for some reason this is not an option maybe materialising the whole thing into an indexed temporary table at runtime would make the best of a bad job.
CREATE TABLE #T
(
ITEMID NUMERIC(18,0) PRIMARY KEY
WITH ( IGNORE_DUP_KEY = ON)
)
INSERT INTO #T
SELECT CASE WHEN ISNUMERIC(ITEMID) = 1 THEN ITEMID END
FROM Items
WHERE CASE WHEN ISNUMERIC(ITEMID) = 1 THEN ITEMID END >= 50000
SELECT TOP 1 ITEMID+1
FROM #T T1
WHERE NOT EXISTS (SELECT * FROM #T T2 WHERE T2.ITEMID = T1.ITEMID +1)
ORDER BY ITEMID

Related

How to optimize the insert query from multiple tables?

I have 2 tables, Table 1 (temp table in SP) has around 400 records. Table 2 has around 30,550,284 records.
I need to run a loop on table 1 for each record and get the top 1 from table 2 based on a few conditions (where clause) and then order by modified date in decreasing order.
There is an index on the modified date.
declare #iPos int;
declare #iCount int;
select #iCount = count(*) from Table1;
set #iPos = 1;
declare #Table2 table(......)
declare #timestampLocal2 datetime
while (#iPos <= #iCount)
BEGIN
select #val1 = Col1, #timestampLocal = TimeStamp
from #Table1 where ID = #iPos
set #timestampLocal2 = DATEADD(HH,-96,#timestampLocal)
INSERT INTO #Temp3 ( .... ),....)
select top 1 r.LastModified, r.[Col2], r.Col3, #iPos
from Table2 (NOLOCK) r
where Col1 =#val1 and
r.LastModified <= #timestampLocal
and r.LastModified >= #timestampLocal2
and (r.Col2 is not null and r.Col3 is not null)
order by LastModified desc
SELECT #iPos = #iPos + 1;
END
This query is very slow.
I have also thought to archive table 2, But I want to keep that as the second option for now.
Do I really need to add an index on the columns which are involved in the where clause?
So my question is, in terms of performance is there a better way to do this?
I believe a CROSS APPLY or OUTER APPLY may do the trick. These can be thought of as being similar to INNER JOIN or LEFT JOIN, except that they allow you to reference a subquery having more complex conditions such as TOP 1 and ORDER BY. Ideal for cases like this.
-- INSERT INTO #Temp3 ( .... )
select r.LastModified, r.[Col2], r.Col3, t1.ID
from #Table1 t1
cross apply (
SELECT TOP 1 r.*
from Table2 r -- Don't use (NOLOCK)
where r.Col1 = t.Col1
and r.LastModified <= t1.[TimeStamp]
and r.LastModified >= DATEADD(HH,-96,t1.[TimeStamp])
and (r.Col2 is not null and r.Col3 is not null)
order by r.LastModified desc
) r
For efficiency, I recommend an index on Table2(Col1,LastModified) or as an absolute minimum, an index on Table2(Col1).
I would strongly discourage the use of (NOLOCK) or 'READ UNCOMMITTED` in queries that update the database (like the insert into table3 above). While the query may appear to work most of the time, seemingly random occurrences of missing or duplicate rows may result.
Do you need to handle cases where no matching Table2 record is found? The above will quietly ignore such cases. Changing the CROSS APPLY to an OUTER APPLY together with logic to handle null r.xxx values could be what you need.

Creating permutation via recursive CTE in SQL server?

Looking at :
;WITH cte AS(
SELECT 1 AS x UNION
SELECT 2 AS x UNION
SELECT 3 AS x
)
I can create permutation table for all 3 values :
SELECT T1.x , y=T2.x , z=t3.x
FROM cte T1
JOIN cte T2
ON T1.x != T2.x
JOIN cte T3
ON T2.x != T3.x AND T1.x != T3.x
This uses the power of SQL's cartesian product plus eliminating equal values.
OK.
But is it possible to enhance this recursive pseudo CTE :
;WITH cte AS(
SELECT 1 AS x , 2 AS y , 3 AS z
UNION ALL
...
)
SELECT * FROM cte
So that it will yield same result as :
NB there are other solutions in SO that uses recursive CTE , but it is not spread to columns , but string representation of the permutations
I tried to do the lot in a CTE.
However trying to "redefine" a rowset dynamically is a little tricky. While the task is relatively easy using dynamic SQL doing it without poses some issues.
While this answer may not be the most efficient or straight forward, or even correct in the sense that it's not all CTE it may give others a basis to work from.
To best understand my approach read the comments, but it might be worthwhile looking at each CTE expression in turn with by altering the bit of code below in the main block, with commenting out the section below it.
SELECT * FROM <CTE NAME>
Good luck.
IF OBJECT_ID('tempdb..#cteSchema') IS NOT NULL
DROP Table #cteSchema
GO
-- BASE CTE
;WITH cte AS( SELECT 1 AS x, 2 AS y, 3 AS z),
-- So we know what columns we have from the CTE we extract it to XML
Xml_Schema AS ( SELECT CONVERT(XML,(SELECT * FROM cte FOR XML PATH(''))) AS MySchema ),
-- Next we need to get a list of the columns from the CTE, by querying the XML, getting the values and assigning a num to the column
MyColumns AS (SELECT D.ROWS.value('fn:local-name(.)','SYSNAME') AS ColumnName,
D.ROWS.value('.','SYSNAME') as Value,
ROW_NUMBER() OVER (ORDER BY D.ROWS.value('fn:local-name(.)','SYSNAME')) AS Num
FROM Xml_Schema
CROSS APPLY Xml_Schema.MySchema.nodes('/*') AS D(ROWS) ),
-- How many columns we have in the CTE, used a coupld of times below
ColumnStats AS (SELECT MAX(NUM) AS ColumnCount FROM MyColumns),
-- create a cartesian product of the column names and values, so now we get each column with it's possible values,
-- so {x=1, x =2, x=3, y=1, y=2, y=3, z=1, z=2, z=3} -- you get the idea.
PossibleValues AS (SELECT MyC.ColumnName, MyC.Num AS ColumnNum, MyColumns.Value, MyColumns.Num,
ROW_NUMBER() OVER (ORDER BY MyC.ColumnName, MyColumns.Value, MyColumns.Num ) AS ID
FROM MyColumns
CROSS APPLY MyColumns MyC
),
-- Now we have the possibly values of each "column" we now have to concat the values together using this recursive CTE.
AllRawXmlRows AS (SELECT CONVERT(VARCHAR(MAX),'<'+ISNULL((SELECT ColumnName FROM MyColumns WHERE MyColumns.Num = 1),'')+'>'+Value) as ConcatedValue, Value,ID, Counterer = 1 FROM PossibleValues
UNION ALL
SELECT CONVERT(VARCHAR(MAX),CONVERT(VARCHAR(MAX), AllRawXmlRows.ConcatedValue)+'</'+(SELECT ColumnName FROM MyColumns WHERE MyColumns.Num = Counterer)+'><'+(SELECT ColumnName FROM MyColumns WHERE MyColumns.Num = Counterer+1)+'>'+CONVERT(VARCHAR(MAX),PossibleValues.Value)) AS ConcatedValue, PossibleValues.Value, PossibleValues.ID,
Counterer = Counterer+1
FROM AllRawXmlRows
INNER JOIN PossibleValues ON AllRawXmlRows.ConcatedValue NOT LIKE '%'+PossibleValues.Value+'%' -- I hate this, there has to be a better way of making sure we don't duplicate values....
AND AllRawXmlRows.ID <> PossibleValues.ID
AND Counterer < (SELECT ColumnStats.ColumnCount FROM ColumnStats)
),
-- The above made a list but was missing the final closing XML element. so we add it.
-- we also restict the list to the items that contain all columns, the section above builds it up over many columns
XmlRows AS (SELECT DISTINCT
ConcatedValue +'</'+(SELECT ColumnName FROM MyColumns WHERE MyColumns.Num = Counterer)+'>'
AS ConcatedValue
FROM AllRawXmlRows WHERE Counterer = (SELECT ColumnStats.ColumnCount FROM ColumnStats)
),
-- Wrap the output in row and table tags to create the final XML
FinalXML AS (SELECT (SELECT CONVERT(XML,(SELECT CONVERT(XML,ConcatedValue) FROM XmlRows FOR XML PATH('row'))) FOR XML PATH('table') )as XMLData),
-- Prepare a CTE that represents the structure of the original CTE with
DataTable AS (SELECT cte.*, XmlData
FROM FinalXML, cte)
--SELECT * FROM <CTE NAME>
-- GETS destination columns with XML data.
SELECT *
INTO #cteSchema
FROM DataTable
DECLARE #XML VARCHAR(MAX) ='';
SELECT #Xml = XMLData FROM #cteSchema --Extract XML Data from the
ALTER TABLE #cteSchema DROP Column XMLData -- Removes the superflous column
DECLARE #h INT
EXECUTE sp_xml_preparedocument #h OUTPUT, #XML
SELECT *
FROM OPENXML(#h, '/table/row', 2)
WITH #cteSchema -- just use the #cteSchema to define the structure of the xml that has been constructed
EXECUTE sp_xml_removedocument #h
How about translating 1,2,3 into a column, which will look exactly like the example you started from, and use the same approach ?
;WITH origin (x,y,z) AS (
SELECT 1,2,3
), translated (x) AS (
SELECT col
FROM origin
UNPIVOT ( col FOR cols IN (x,y,z)) AS up
)
SELECT T1.x , y=T2.x , z=t3.x
FROM translated T1
JOIN translated T2
ON T1.x != T2.x
JOIN translated T3
ON T2.x != T3.x AND T1.x != T3.x
ORDER BY 1,2,3
If I understood correctly the request, this might just do the trick.
And to run it on more columns, just need to add them origin cte definition + unpivot column list.
Now, i dont know how you pass your 1 - n values for it to be dynamic, but if you tell me, i could try edit the script to be dynamic too.

How to develop a recursive CTE in T-SQL?

I am new to recursive CTEs. I am trying to develop a CTE which will return all of the employees under each manager name. So I have two tables: people_rv and staff_rv
People_rv table contains all of the people, both managers and employees. Staff_rv only contains manager information. Uniqueidentifier staff values are stored in Staff_rv. Uniqueidentifier employee values are stored in people_rv. People_rv contains varchar first and last name values for both managers and employees.
But when I run the following CTE I get an error:
WITH
cteStaff (ClientID, FirstName, LastName, SupervisorID, EmpLevel)
AS
(
SELECT p.people_id, p.first_name, p.last_name, s.supervisor_id,1
FROM people_rv p JOIN staff_rv s on s.people_id = p.people_id
WHERE s.supervisor_id = '95E16819-8C3A-4098-9430-08F0E3B764E1'
UNION ALL
SELECT p2.people_id, p2.first_name, p2.last_name, s2.supervisor_id, r.EmpLevel + 1
FROM people_rv p2 JOIN staff_rv s2 on s2.people_id = p2.people_id
INNER JOIN cteStaff r on s2.staff_id = r.ClientID
)
SELECT
FirstName + ' ' + LastName AS FullName,
EmpLevel,
(SELECT first_name + ' ' + last_name FROM people_rv p join staff_rv s on s.people_id = p.people_id
WHERE s.staff_id = cteStaff.SupervisorID) AS Manager
FROM cteStaff
OPTION (MAXRECURSION 0);
My output is:
Barbara G 1 Melanie K
Dawn P 1 Melanie K
Garrett M 1 Melanie K
Stephanie P 1 Melanie K
Amanda F 1 Melanie K
Amanda T 1 Melanie K
Stephanie G 1 Melanie K
Carlos H 1 Melanie K
So it is not iterating any more than the first level. Why not?
Melanie is the top most supervisor, but each of the persons in the leftmost column are also supervisors. So this query should also return level 2.
You may be in an infinite loop with your join. I would check how many levels you expect the table to actually go down. Generally you join a recursion on something similar to do
ID = ParentID
of something either contained in a table or in an expression. Keep in mind you can also create a CTE prior to a recursive CTE if you have to make up your relationship.
Here is an example that will self execute, it may help.
Declare #table table ( PersonId int identity, PersonName varchar(512), Account int, ParentId int, Orders int);
insert into #Table values ('Brett', 1, NULL, 1000),('John', 1, 1, 100),('James', 1, 1, 200),('Beth', 1, 2, 300),('John2', 2, 4, 400);
select
PersonID
, PersonName
, Account
, ParentID
from #Table
; with recursion as
(
select
t1.PersonID
, t1.PersonName
, t1.Account
--, t1.ParentID
, cast(isnull(t2.PersonName, '')
+ Case when t2.PersonName is not null then '\' + t1.PersonName else t1.PersonName end
as varchar(255)) as fullheirarchy
, 1 as pos
, cast(t1.orders +
isnull(t2.orders,0) -- if the parent has no orders than zero
as int) as Orders
from #Table t1
left join #Table t2 on t1.ParentId = t2.PersonId
union all
select
t.PersonID
, t.PersonName
, t.Account
--, t.ParentID
, cast(r.fullheirarchy + '\' + t.PersonName as varchar(255))
, pos + 1 -- increases
, r.orders + t.orders
from #Table t
join recursion r on t.ParentId = r.PersonId
)
, b as
(
select *, max(pos) over(partition by PersonID) as maxrec -- I find the maximum occurrence of position by person
from recursion
)
select *
from b
where pos = maxrec -- finds the furthest down tree
-- and Account = 2 -- I could find just someone from a different department
Your problem as far as I can tell is is you have no join connecting managers to their employees.
This join
INNER JOIN cteStaff r on r.StaffID = s2.staff_id
Just joins the same initial level 1 staffer back to himself.
UPDATE:
Still not quite right! You have a supervisor_id, but again you're still not actually using that to join back to the CTE.
So for each recursion of this CTE you need to (excluding the name join):
select {Level 1 Boss}, NULL (no supervisor)
union
select {new employee}, {that employee's boss}
So the join must connect the CTE's ClientID (the level 1 boss) to the second UNION query's supervisor field, which looks to be supervisor_id , not staff_id.
The JOIN to accomplish this second task is (from what I can tell of your staff_rv table schema):
SELECT p2.people_id, p2.first_name, p2.last_name, s2.supervisor_id, r.EmpLevel + 1
FROM people_rv p2 JOIN staff_rv s2 on s2.people_id = p2.people_id
INNER JOIN cteStaff r on s2.supervisor_id = r.ClientID
Note the bottom join joins the r.ClientID (the level 1 boss) to the staffer's supervisor_id field.
(NB: I think your staff_id and supervisor_id's mimic your people_id values from the people_rv table, so this join should work fine. But if they are different (i.e. a staffer's supervisor_id isn't that supervisor's people_id) then you'll need to write the join such that the staffer's supervisor_id can be joined to their people_id you're storing as ClientID in the CTE.)
Here's a good simple Recursive CTE to review (it may not be the answer, but someone else searching on how to make a recursive CTE may need it):
-- Recursive CTE
;
WITH Years ( myYear )
AS (
-- Base case
SELECT DATEPART(year, GETDATE())
UNION ALL
-- Recursive
SELECT Years.myYear - 1
FROM Years
WHERE Years.myYear >= 2002
)
SELECT *
FROM Years
Note that this probably won't solve your problem, but is a means to hopefully seeing where you're going wrong in the original query.
The default is 100 levels of recursion - you can set it to unlimited by using the MAXRECURSION query hint where you're selecting from your CTE:
...
FROM cteStaff
OPTION (MAXRECURSION 0);
From MSDN:
MAXRECURSION number
Specifies the maximum number of recursions allowed for this query. number is a nonnegative integer between 0 and 32767. When 0 is
specified, no limit is applied. If this option is not specified, the
default limit for the server is 100.
When the specified or default number for MAXRECURSION limit is reached during query execution, the query is ended and an error is
returned.
Because of this error, all effects of the statement are rolled back. If the statement is a SELECT statement, partial results or no
results may be returned. Any partial results returned may not include
all rows on recursion levels beyond the specified maximum recursion
level.

SQL Server - Insert into with select and union - duplicates being inserted

When I execute my "select union select", I get the correct number or rows (156)
Executed independently, select #1 returns 65 rows and select #2 returns 138 rows.
When I use this "select union select" with an Insert into, I get 203 rows (65+138) with duplicates.
I would like to know if it is my code structure that is causing this issue ?
INSERT INTO dpapm_MediaObjectValidation (mediaobject_id, username, checked_date, expiration_date, notified)
(SELECT FKMediaObjectId, CreatedBy,#checkdate,dateadd(ww,2,#checkdate),0
FROM dbo.gs_MediaObjectMetadata
LEFT JOIN gs_MediaObject mo
ON gs_MediaObjectMetadata.FKMediaObjectId = mo.MediaObjectId
WHERE UPPER([Description]) IN ('CAPTION','TITLE','AUTHOR','DATE PHOTO TAKEN','KEYWORDS')
AND FKMediaObjectId >=
(SELECT TOP 1 MediaObjectId
FROM dbo.gs_MediaObject
WHERE DateAdded > #lastcheck
ORDER BY MediaObjectId)
GROUP BY FKMediaObjectId, CreatedBy
HAVING count(*) < 5
UNION
SELECT FKMediaObjectId, CreatedBy,getdate(),dateadd(ww,2,getdate()),0
FROM gs_MediaObjectMetadata yt
LEFT JOIN gs_MediaObject mo
ON yt.FKMediaObjectId = mo.MediaObjectId
WHERE UPPER([Description]) = 'KEYWORDS'
AND FKMediaObjectId >=
(SELECT TOP 1 MediaObjectId
FROM dbo.gs_MediaObject
WHERE DateAdded > #lastcheck
ORDER BY MediaObjectId)
AND NOT EXISTS
(
SELECT *
FROM dbo.fnSplit(Replace(yt.Value, '''', ''''''), ',') split
WHERE split.item in (SELECT KeywordEn FROM gs_Keywords) or split.item in (SELECT KeywordFr FROM gs_Keywords)
)
)
I would appreciate any clues into resolving this problem ...
Thank you !
The UNION keyword should only return distinct records between the two queries. However, if I recall correctly, this is only true if the datatypes are the same. The date variables might be throwing that off. Depending on the collation type, whitespace might be handled differently as well. You might want to do a SELECT DISTINCT on the dpapm_MediaObjectValidation table after doing your insert, but be sure to trim whitespace from both sides in your comparison. Another approach is to do your first insert, then on your second insert, forgo the UNION altogether and do a manual EXISTS check to see if the items to be inserted already exist.

set difference in SQL query

I'm trying to select records with a statement
SELECT *
FROM A
WHERE
LEFT(B, 5) IN
(SELECT * FROM
(SELECT LEFT(A.B,5), COUNT(DISTINCT A.C) c_count
FROM A
GROUP BY LEFT(B,5)
) p1
WHERE p1.c_count = 1
)
AND C IN
(SELECT * FROM
(SELECT A.C , COUNT(DISTINCT LEFT(A.B,5)) b_count
FROM A
GROUP BY C
) p2
WHERE p2.b_count = 1)
which takes a long time to run ~15 sec.
Is there a better way of writing this SQL?
If you would like to represent Set Difference (A-B) in SQL, here is solution for you.
Let's say you have two tables A and B, and you want to retrieve all records that exist only in A but not in B, where A and B have a relationship via an attribute named ID.
An efficient query for this is:
# (A-B)
SELECT DISTINCT A.* FROM (A LEFT OUTER JOIN B on A.ID=B.ID) WHERE B.ID IS NULL
-from Jayaram Timsina's blog.
You don't need to return data from the nested subqueries. I'm not sure this will make a difference withiut indexing but it's easier to read.
And EXISTS/JOIN is probably nicer IMHO then using IN
SELECT *
FROM
A
JOIN
(SELECT LEFT(B,5) AS b1
FROM A
GROUP BY LEFT(B,5)
HAVING COUNT(DISTINCT C) = 1
) t1 On LEFT(A.B, 5) = t1.b1
JOIN
(SELECT C AS C1
FROM A
GROUP BY C
HAVING COUNT(DISTINCT LEFT(B,5)) = 1
) t2 ON A.C = t2.c1
But you'll need a computed column as marc_s said at least
And 2 indexes: one on (computed, C) and another on (C, computed)
Well, not sure what you're really trying to do here - but obviously, that LEFT(B, 5) expression keeps popping up. Since you're using a function, you're giving up any chance to use an index.
What you could do in your SQL Server table is to create a computed, persisted column for that expression, and then put an index on that:
ALTER TABLE A
ADD LeftB5 AS LEFT(B, 5) PERSISTED
CREATE NONCLUSTERED INDEX IX_LeftB5 ON dbo.A(LeftB5)
Now use the new computed column LeftB5 instead of LEFT(B, 5) anywhere in your query - that should help to speed up certain lookups and GROUP BY operations.
Also - you have a GROUP BY C in there - is that column C indexed?
If you are looking for just set difference between table1 and table2,
the below query is simple that gives the rows that are in table1, but not in table2, such that both tables are instances of the same schema with column names as
columnone, columntwo, ...
with
col1 as (
select columnone from table2
),
col2 as (
select columntwo from table2
)
...
select * from table1
where (
columnone not in col1
and columntwo not in col2
...
);

Resources