Nested Loops performance issue on a very simple query - sql-server

I have a very simple table and a very simple INNER JOIN query and a huge count of rows.
IF OBJECT_ID('tempdb..#blackIPAndMACs') IS NOT NULL
DROP TABLE #blackIPAndMACs
CREATE TABLE #blackIPAndMACs
(
ResourceID dsidentifier,
MACAddress VARCHAR(500),
IPAddress VARCHAR(50)
)
CREATE INDEX #blackIPAndMACs_idx1 ON #blackIPAndMACs(MACAddress)
CREATE INDEX #blackIPAndMACs_idx2 ON #blackIPAndMACs(IPAddress)
CREATE INDEX #blackIPAndMACs_idx3 ON #blackIPAndMACs(MACAddress, IPAddress)
CREATE INDEX #blackIPAndMACs_idx4 ON #blackIPAndMACs(ResourceID)
After this table has been filled with 2.514.000 rows, I am trying to find all ResourceID, that accessed from similar IP or MAC:
SELECT b1.*,
b2.*
FROM #blackIPAndMACs b1 with(NOLOCK, INDEX=#blackIPAndMACs_idx3)
INNER JOIN #blackIPAndMACs b2 with(NOLOCK, INDEX=#blackIPAndMACs_idx3)
ON (
b1.MACAddress = b2.MACAddress
OR b1.IPAddress = b2.IPAddress
)
WHERE 1 = 1
As a result, this query executes (possible) infinitely. Our server is really powerful. I think I can't disclose this information, but I can only say, that the RAM of the server counts in a lot of hundreds of GB.
What kind optimization should I use to speedup the query execution?
Update 1:
OK, I removed OR and changed SELECT to count (b1.ResourceID), but it didn't solve the issue. Even such simple query executes too long:
SELECT count (b1.ResourceID)
FROM #blackIPAndMACs b1 with(NOLOCK)
INNER JOIN #blackIPAndMACs b2 with(NOLOCK)
b1.MACAddress = b2.MACAddress
WHERE 1 = 1
AND b1.ResourceID != b2.ResourceID

As a force of habit I would refrain from using select *, even if you need a whole bunch of fields as a result. Having said that my approach would be something like this:
SELECT
b1.ResourceID
, b2.MACAddress
, b2.IPAddress
, b3.MACAddress
, b3.IPAddress
FROM #blackIPAndMACs AS b1
LEFT JOIN #blackIPAndMACs AS b2 ON b1.MACAddress = b2.MACAddress
LEFT JOIN #blackIPAndMACs AS b3 ON b1.IPAddress = b2.IPAddress;
Which uses a much more efficient query plan:

Related

Why is using Table Spool slower than not?

There are two similiar sqls running in sql server,in which the table TBSFA_DAT_CUST has millons rows and no constraint(no index and primary key),
the other two has just a few rows and normal primary key:
s for slower one:
SELECT A.CUST_ID, C.CUST_NAME, A.xxx --and several specific columns
FROM TBSFA_DAT_ORD_LIST A JOIN VWSFA_ORG_EMPLOYEE B ON A.EMP_ID = B.EMP_ID
LEFT JOIN TBSFA_DAT_CUST C ON A.CUST_ID = B.CUST_ID
JOIN VWSFA_ORG_EMPLOYEE D ON A.REVIEW_ID = D.EMP_ID
WHERE ISNULL(A.BATCH_ID, '') != ''
execution plan of slower one
f for faster one:
SELECT *
FROM TBSFA_DAT_ORD_LIST A JOIN VWSFA_ORG_EMPLOYEE B ON A.EMP_ID = B.EMP_ID
LEFT JOIN TBSFA_DAT_CUST C ON A.CUST_ID = B.CUST_ID
JOIN VWSFA_ORG_EMPLOYEE D ON A.REVIEW_ID = D.EMP_ID
WHERE ISNULL(A.BATCH_ID, '') != ''
execution plan of faster one
f(above 0.6s) is much faster than s(above 4.6s).
Otherwise,I found two ways to make s fast as f:
1.Add constaint and primary key in table TBSFA_DAT_CUST.CUST_ID;
2.Specific more than 61 columns of table TBSFA_DAT_CUST(totally 80 columns).
My question is why sql optimizer uses Table Spool when I specific columns in SELECT clause rather than '*',and why is using Table Spool one executes slower?
My question is about sql-servertable-spool
In the slower query you are limiting your result set to specific columns. Since this is an un-indexed un constrained table the optimizer is creating a temporary table from the original table scan with only the specific columns required. It is then running through the nested loop operator on the temporary table. When it knows its going to need every column on the table (Select *) it can run the nested loop operator directly off the table scan because the result set of the scan will be joined in full to the top table.
Outside of that your query has a couple other possible problems:
LEFT JOIN TBSFA_DAT_CUST C ON A.CUST_ID = B.CUST_ID
you aren't joining to anything here, you are joining the entire table to every record. Did mean a.cust_id = c.cust_id or b.cust_id = c.cust_id or a.cust_id = c.cust_id and b.cust_id = c.cust_id?
Also, this function in the where clause is pointless and can degrade performance:
WHERE ISNULL(A.BATCH_ID, '') != ''
change it to:
WHERE A.BATCH_ID is not null and A.Batch_ID <> ''

RIGHT\LEFT Join does not provide null values without condition

I have two tables one is the lookup table and the other is the data table. The lookup table has columns named cycleid, cycle. The data table has SID, cycleid, cycle. Below is the structure of the tables.
If you check the data table, the SID may have all the cycles and may not have all the cycles. I want to output the SID completed as well as missed cycles.
I right joined the lookup table and retrieved the missing as well as completed cycles. Below is the query I used.
SELECT TOP 1000 [SID]
,s4.[CYCLE]
,s4.[CYCLEID]
FROM [dbo].[data] s3 RIGHT JOIN
[dbo].[lookup_data] s4 ON s3.CYCLEID = s4.CYCLEID
The query is not displaying me the missed values when I query for all the SID's. When I specifically query for a SID with the below query i am getting the correct result including the missed ones.
SELECT TOP 1000 [SID]
,s4.[CYCLE]
,s4.[CYCLEID]
FROM [dbo].[data] s3 RIGHT JOIN [dbo].[lookup_data] s4
ON s3.CYCLEID = s4.CYCLEID
AND s3.SID = 101002
ORDER BY [SID], s4.[CYCLEID]
As I am supplying this query into tableau I cannot provide the sid value in the query. I want to return all the sid's and from tableau I will be do the rest of the things.
The expected output that i need is as shown below.
I wrote a cross join query like below to acheive my expected output
SELECT DISTINCT
tab.CYCLEID
,tab.SID
,d.CYCLE
FROM ( SELECT d.SID
,d.[CYCLE]
,e.CYCLEID
FROM ( SELECT e.sid
,e.CYCLE
FROM [db_temp].[dbo].[Sheet3$] e
) d
CROSS JOIN [db_temp].[dbo].[Sheet4$] e
) tab
LEFT OUTER JOIN [db_temp].[dbo].[Sheet3$] d
ON d.CYCLEID = tab.CYCLEID
AND d.SID = tab.SID
ORDER BY tab.SID
,tab.CYCLEID;
However I am not able to use this query for more scenarios as my data set have nearly 20 to 40 columns and i am having issues when i use the above one.
Is there any way to do this in a simpler manner with only left or right join itself? I want the query to return all the missing values and the completed values for the all the SID's instead of supplying a single sid in the query.
You can create a master table first (combine all SID and CYCLE ID), then right join with the data table
;with ctxMaster as (
select distinct d.SID, l.CYCLE, l.CYCLEID
from lookup_data l
cross join data d
)
select d.SID, m.CYCLE, m.CYCLEID
from ctxMaster m
left join data d on m.SID = d.SID and m.CYCLEID = d.CYCLEID
order by m.SID, m.CYCLEID
Fiddle
Or if you don't want to use common table expression, subquery version:
select d.SID, m.CYCLE, m.CYCLEID
from (select distinct d.SID, l.CYCLE, l.CYCLEID
from lookup_data l
cross join data d) m
left join data d on m.SID = d.SID and m.CYCLEID = d.CYCLEID
order by m.SID, m.CYCLEID

Sql Server right side restrictions on left join

Please read it slowly. This isn't a dup.
Tables:
CREATE TABLE [dbo].[TEST] (
[TEST_ID] [integer] IDENTITY (1, 1) NOT NULL ,
....
[TEST_TYPE_ID] [char](1) NULL ,
....
)
CREATE TABLE [dbo].[TEST_A] (
[TEST_ID] [integer] NOT NULL ,
....
)
CREATE TABLE [dbo].[TEST_B] (
[TEST_ID] [integer] NOT NULL ,
....
)
Normally you would write:
select *
from dbo.TEST as t
left join dbo.TEST_A as ta on ta.TEST_ID = t.TEST_ID
left join dbo.TEST_B as tb on tb.TEST_ID = t.TEST_ID
...
However, Sql Server can save a lot of work - IF it knows that only some of table TEST's rows potentially join to TEST_A:
select *
from dbo.TEST as t
left join dbo.TEST_A as ta on t.TEST_TYPE_ID = 'A'
and ta.TEST_ID = t.TEST_ID
left join dbo.TEST_B as tb on t.TEST_TYPE_ID = 'B'
and tb.TEST_ID = t.TEST_ID
...
These queries return the exact same result. Adding TEST_TYPE_ID = X does not change the result.
Note: You CAN'T put the restriction on TEST_TYPE_ID in the where statement. That would change the number of rows returned.
My question is: In a left join if you place a restriction on the right side, will Sql Server use this information first? Order of operations is very important here. This is important when TEST and TEST_A are large, but only a few records join.
I have tested this, and the execution plan seems to indicate: no. It appears Sql Server first does a normal left join trying to join all the records in TEST to TEST_A, then it applies a "filter". However, I'm not certain I'm reading the execution plan correctly. If TEST_TYPE_ID = X is applied second, it is effectly a no-op. If TEST_TYPE_ID = X is applied first, it will limit the left join to only the rows that will actually join.
Note: My actual case looks very different. I have distilled the question down to this bare bones example to demonstrate the issue.

Get a collision free hash for a specific query or a view with SQL Server 2008

I am working on a project where I need to synchronize data from our system to an external system. What I want to achieve, is to periodically send only changed items (rows) from a custom query. This query looks like this (but with many more columns) :
SELECT T1.field1,
T1.field2,
T1.field2,
T1.field3,
CASE WHEN T1.field4 = 'some-value' THEN 1 ELSE 0 END,
T2.field1,
T3.field1,
T4.field1
FROM T1
INNER JOIN T2 ON T2.pk = T2.fk
INNER JOIN T3 ON T3.pk = T2.fk
INNER JOIN T4 ON T4.pk = T2.fk
I want to avoid to have to compare every field one to one between synchronizations. I came with the idea that I could generate a hash for every row from my query, and compare this with the hash from the previous synchronization, which will return only the changed rows. I am aware of the CHECKSUM function, but it is very collision-prone and might miss changes sometimes. However I like the way I could just make a temp table and use CHECKSUM(*), which makes maintenance easier (not having to add fields in the query and in the CHECKSUM) :
SELECT T1.field1,
T1.field2,
T1.field2,
T1.field3,
CASE WHEN T1.field4 = 'some-value' THEN 1 ELSE 0 END,
T2.field1,
T3.field1,
T4.field1
INTO #tmp
FROM T1
INNER JOIN T2 ON T2.pk = T2.fk
INNER JOIN T3 ON T3.pk = T2.fk
INNER JOIN T4 ON T4.pk = T2.fk;
-- get all columns from the query, plus a hash of the row
SELECT *, CHECKSUM(*)
FROM #tmp;
I am aware of HASHBYTES function (which supports sha1, md5, which are less prone to collisions), but it only accept varchar or varbinary, not a list of columns or * the way CHECKSUM does. Having to cast/convert every column from the query is a pain in the ... and opens the door to errors (forget to include a new field for instance)
I also noticed Change Data Capture and Change Tracking features of SQL Server, but they all seems complicated and overkill for what I am doing.
So my question : is there an other method to generate a hash from a query or a temp table that meets my criterias ?
If not, is there an other way to achieve this kind of work (to sync differences from a query)
I found a way to do exactly what I wanted, thanks to the FOR XML clause :
SELECT T1.field1,
T1.field2,
T1.field2,
T1.field3,
CASE WHEN T1.field4 = 'some-value' THEN 1 ELSE 0 END,
T2.field1,
T3.field1,
T4.field1
INTO #tmp
FROM T1
INNER JOIN T2 ON T2.pk = T2.fk
INNER JOIN T3 ON T3.pk = T2.fk
INNER JOIN T4 ON T4.pk = T2.fk;
-- get all columns from the query, plus a hash of the row (converted in an hex string)
SELECT T.*, CONVERT(VARCHAR(100), HASHBYTES('sha1', (SELECT T.* FOR XML RAW)), 2) AS sHash
FROM #tmp AS T;

SQL Server 2008 Stored Procedure Performance issue

Hi I have a Stored Procedure
ALTER PROCEDURE [dbo].[usp_EP_GetTherapeuticalALternates]
(
#NDCNumber CHAR(11) ,
#patientid INT ,
#pbmid INT
)
AS
BEGIN
TRUNCATE TABLE TempTherapeuticAlt
INSERT INTO TempTherapeuticAlt
SELECT --PR.ProductID AS MedicationID ,
NULL AS MedicationID ,
PR.ePrescribingName AS MedicationName ,
U.Strength AS MedicationStrength ,
FRM.FormName AS MedicationForm ,
PR.DEAClassificationID AS DEASchedule ,
NULL AS NDCNumber
--INTO #myTemp
FROM DatabaseTwo.dbo.Product PR
JOIN ( SELECT MP.MarketedProductID
FROM DatabaseTwo.dbo.Therapeutic_Concept_Tree_Specific_Product TCTSP
JOIN DatabaseTwo.dbo.Marketed_Product MP ON MP.SpecificProductID = TCTSP.SpecificProductID
JOIN ( SELECT TCTSP.TherapeuticConceptTreeID
FROM DatabaseTwo.dbo.Marketed_Product MP
JOIN DatabaseTwo.dbo.Therapeutic_Concept_Tree_Specific_Product TCTSP ON MP.SpecificProductID = TCTSP.SpecificProductID
JOIN ( SELECT
PR.MarketedProductID
FROM
DatabaseTwo.dbo.Package PA
JOIN DatabaseTwo.dbo.Product PR ON PA.ProductID = PR.ProductID
WHERE
PA.NDC11 = #NDCNumber
) PAPA ON MP.MarketedProductID = PAPA.MarketedProductID
) xxx ON TCTSP.TherapeuticConceptTreeID = xxx.TherapeuticConceptTreeID
) MPI ON PR.MarketedProductID = MPI.MarketedProductID
JOIN ( SELECT P.ProductID ,
O.Strength ,
O.Unit
FROM DatabaseTwo.dbo.Product AS P
INNER JOIN DatabaseTwo.dbo.Marketed_Product
AS M ON P.MarketedProductID = M.MarketedProductID
INNER JOIN DatabaseTwo.dbo.Specific_Product
AS S ON M.SpecificProductID = S.SpecificProductID
LEFT OUTER JOIN DatabaseTwo.dbo.OrderableName_Combined
AS O ON S.SpecificProductID = O.SpecificProductID
GROUP BY P.ProductID ,
O.Strength ,
O.Unit
) U ON PR.ProductID = U.ProductID
JOIN ( SELECT PA.ProductID ,
S.ScriptFormID ,
F.Code AS NCPDPScriptFormCode ,
S.FormName
FROM DatabaseTwo.dbo.Package AS PA
INNER JOIN DatabaseTwo.dbo.Script_Form
AS S ON PA.NCPDPScriptFormCode = S.NCPDPScriptFormCode
INNER JOIN DatabaseTwo.dbo.FormCode AS F ON S.FormName = F.FormName
GROUP BY PA.ProductID ,
S.ScriptFormID ,
F.Code ,
S.FormName
) FRM ON PR.ProductID = FRM.ProductID
WHERE
( PR.OffMarketDate IS NULL )
OR ( PR.OffMarketDate = '' )
OR (PR.OffMarketDate = '1899-12-30 00:00:00.000')
OR ( PR.OffMarketDate <> '1899-12-30 00:00:00.000'
AND DATEDIFF(dd, GETDATE(),PR.OffMarketDate) > 0
)
GROUP BY PR.ePrescribingName ,
U.Strength ,
FRM.FormName ,
PR.DEAClassificationID
-- ORDER BY pr.ePrescribingName
SELECT LL.ProductID AS MedicationID ,
temp.MedicationName ,
temp.MedicationStrength ,
temp.MedicationForm ,
temp.DEASchedule ,
temp.NDCNumber ,
fs.[ReturnFormulary] AS FormularyStatus ,
copay.CopaTier ,
copay.FirstCopayTerm ,
copay.FlatCopayAmount ,
copay.PercentageCopay ,
copay.PharmacyType,
dbo.udf_EP_GetBrandGeneric(LL.ProductID) AS BrandGeneric
FROM TempTherapeuticAlt temp
OUTER APPLY ( SELECT TOP 1
ProductID
FROM DatabaseTwo.dbo.Product
WHERE ePrescribingName = temp.MedicationName
) AS LL
OUTER APPLY [dbo].[udf_EP_tbfGetFormularyStatus](#patientid,
LL.ProductID,
#pbmid) AS fs
OUTER APPLY ( SELECT TOP 1
*
FROM udf_EP_CopayDetails(LL.ProductID,
#PBMID,
fs.ReturnFormulary)
) copay
--ORDER BY LL.ProductID
TRUNCATE TABLE TempTherapeuticAlt
END
On my dev server I have data of 63k in each table
so this procedure took about 30 seconds to return result.
On my Production server, it is timing out, or taking >1 minute.
I am wondering my production server tables are full with 1400 millions of records,
can this be a reason.
if so what can be done, I have all required indexes on tables.
any help would be greatly appreciated.
thanks
Execution Plan
http://www.sendspace.com/file/hk8fao
Major Leakage
OUTER APPLY [dbo].[udf_EP_tbfGetFormularyStatus](#patientid,
LL.ProductID,
#pbmid) AS fs
Some strategies that may help:
Remove the first ORDER BY statement, those are killer on complex queries shouldn't be necessary.
Use CTEs to break the query into smaller pieces that can be individually addressed.
Reduce the nesting in the first set of JOINs
Extract the second and third set of joins (the GROUPED ones) and insert those into a temporary indexed table before joining and grouping everything.
You did not include the definition for function1 or function2 -- custom functions are often a place where performance issues can hide.
Without seeing the execution plan, it's difficult to see where the particular problems may be.
You have a query that selects data from 4 or 5 tables , some of them multiple times. It's really hard to say how to improve without deep analysis of what you are trying to achieve and what table structure actually is.
Data size is definitely an issue; I think it's quite obvious that the more data has to be processed, the longer query will take. Some general advices... Run the query directly and check execution plan. It may reveal bottlenecks. Then check if statistics is up to date. Also, review your tables, partitioning may help a lot in some cases. In addition, you can try altering tables and create clustered index not on PK (as it's done by default unless otherwise specified), but on other column[s] so your query will benefit from certain physical order of records. Note : do it only if you are absolutely sure what you are doing.
Finally, try refactoring your query. I have a feeling that there is a better way to get desired results (sorry, without understanding of table structure and expected results I cannot tell exact solution, but multiple joins of the same tables and bunch of derived tables don't look good to me)

Resources