SQL Server 2008: how does NOLOCK works for CTE? - sql-server

I have CTEs which all uses NOLOCK inside. But then selects from those CTEs in parent CTEs using children CTEs doesn't use NOLOCK in presumption that it is already NOLOCK'd. And the final select doesn't use NOLOCK either.
Something like that:
with cte1 as
(select * from tab1 (nolock)),
cte2 as
(select * from cte1)
select * from cte2
or shall I write
with cte1 as
(select * from tab1 (nolock)),
cte2 as
(select * from cte1 (nolock))
select * from cte2 (nolock)
thanks

You don't need the outer nolock to avoid taking shared locks on tab1. You can easily verify this by setting up a SQL Profiler trace capturing the various events in the locks category, filtering on the spid of an SSMS connection and trying both versions.
nolock is quite a dangerous setting though, are you aware of all of the possible downsides to using it (dirty reads, reading data twice or not at all)?

NOLOCK on an "outer" query applies to any inner queries too. A CTE is just a macro like a view or inline table udf: nothing more, nothing less. So you actually have (ignoring NOLOCK hints)
select * from (
select * from (
select * from tab1
) t1
) t2
From Table Hints on MSDN, under "Remarks"
All lock hints are propagated to all the tables and views that are accessed by the query plan, including tables and views referenced in a view.
In this case, you need only one. Doesn't matter where.
Where it does matter is where you have JOINs. If cte1 was a join of 2 tables, you'd need it for each table,. Or specify it once at a higher/outer level.
Oh, and I'll join in with everyone else: NOLOCK is a bad idea

The innermost nolock is sufficient, no need to repeat it for the outer selects.
You can test this by starting a transaction without ending it:
begin transaction
; with YourCte ( ...
Then you can view the locks using Management Studio. They'll be there until the transaction times out.

NOLOCK for CTEs work the same way as with everything else: it causes inconsistent results. Use SNAPSHOT instead, see SQL Server 2005 Row Versioning-Based Transaction Isolation.

Related

Does SQL Server execute a CTE expression if it is not used?

I'm trying to debug a query that is performing slowly. It has several with expressions which are left joined. When I remove the joins, it speeds up considerably.
Original query:
;with CTE as
(
Select *
from table1
)
SELECT *
FROM table2
LEFT JOIN CTE ON table2.CTEID
Better performing query:
;with CTE as
(
Select *
from table1
)
SELECT *
FROM table2
In the above, does it not execute CTE since it is not joined, or does it execute it regardless?
My guess is probably not-- the query optimizer is pretty smart about not executing unnecessary stuff. Every query is different, and the query optimizer uses statistics about your actual data to decide how to evaluate it, so the only way to know for sure, is to get SQL Server to tell you how it evaluated your query.
To do this, execute your query in SQL Server Management Studio with 'Include Actual Execution Plan' and you will be see clearly how it evaluated the query.

Force joined view not to be optimized

I have a somewhat complex view which includes a join to another view. For some reason the generated query plan is highly inefficient. The query runs for many hours. However if I select the sub-view into a temporary table first and then join with this, the same query finished in a few minutes.
My question is: Is there some kind of query hint or other trick which will force the optimizer to execute the joined sub-view in isolation before performing the join, just as when using a temp table? Clearly the default strategy chosen by the optimizer is not optimal.
I cannot use the temporary table-trick since views does not allow temporary tables. I understand I could probably rewrite everything to a stored procedure, but that would break composeability of views, and it seems also like bad for maintenance to rewrite everything just to trick the optimizer to not use a bad optimization.
Adam Machanic explained one such way at a SQL Saturday I recently attended. The presentation was called Clash of the Row Goals. The method involves using a TOP X at the beginning of the sub-select. He explained that when doing a TOP X, the query optimizer assumes it is more efficient to grab the TOP X rows one at a time. As long as you set X as a sufficiently large number (limit of INT or BIGINT?), the query will always get the correct results.
So one example that Adam provided:
SELECT
x.EmployeeId,
y.totalWorkers
FROM HumanResources.Employee AS x
INNER JOIN
(
SELECT
y0.ManagerId,
COUNT(*) AS totalWorkers
FROM HumanResources.Employee AS y0
GROUP BY
y0.ManagerId
) AS y ON
y.ManagerId = x.ManagerId
becomes:
SELECT
x.EmployeeId,
y.totalWorkers
FROM HumanResources.Employee AS x
INNER JOIN
(
SELECT TOP(2147483647)
y0.ManagerId,
COUNT(*) AS totalWorkers
FROM HumanResources.Employee AS y0
GROUP BY
y0.ManagerId
) AS y ON
y.ManagerId = x.ManagerId
It is a super cool trick and very useful.
When things get messy the query optimize often resorts to loop joins
If materializing to a temp fixed it then most likely that is the problem
The optimizer often does not deal with views very well
I would rewrite you view to not uses views
Join Hints (Transact-SQL)
You may be able to use these hints on views
Try merge and hash
Try changing the order of join
Move condition into the join whenever possible
select *
from table1
join table2
on table1.FK = table2.Key
where table2.desc = 'cat1'
should be
select *
from table1
join table2
on table1.FK = table2.Key
and table2.desc = 'cat1'
Now the query optimizer will get that correct but as the query gets more complex the query optimize goes into what I call stupid mode and loop joins. But that is also done to protect the server and have as little in memory as possible.

Query in a batch takes from few millisec to 2 minutes on SQL Server

I have a batch that loads configuration data parsed from about a hundred XLS workbooks, with information on cells position, type of cells, meaning of cells etc.
This is a very large batch, it involves use of several temporary tables,
Since it can start with an existing configuration, it need to merge the two configurations, so some temporary table are filled out-of-transaction, some are filled in-transaction.
On Oracle 10g the batch executes without any problem.
On SQLServer 2008 R2 I'm experiencing some random hangs on 2-3 in-transaction INSERT queries. This thing doesn't always happen, or maybe it happens on a query (that normally executes in few millisecs), but not on another on which it happened the time before.
I initially thought about a deadlock, but raising query timeout to 3 minutes, it finally executes in about 2 minutes. I repeat, these queries sometime executes in few millisecs.
I also specify that all queries are done with option for ROWLOCK only.
Monitoring SQL Server with profiler and activity monitor I don't see anything strange.
CPU is not hanged, there is memory, disk reads are about 0, disk writes are 0-200 kb/s not constant.
No one other is using this DB schema.
I really cannot work it out how to solve this.
EDIT:
This is one of the puzzling queries
INSERT INTO "LOAD"."META_CELLS_UOM" WITH (ROWLOCK)
("UDA_DOMAIN_ID", "META_DOC_ID", "META_SECT_ID", "META_SET_ID", "UDA_ID", "META_CELL_UROW", "META_CELL_UCOL", "CEM_ID")
SELECT "UDA_DOMAIN_ID", "META_DOC_ID", "META_SECT_ID", "META_SET_ID", "UDA_ID", "META_CELL_UROW", "META_CELL_UCOL", "CEM_ID"
FROM (
SELECT "META_DOCUMENTS"."UDA_DOMAIN_ID", "META_DOCUMENTS"."META_DOC_ID", "META_SECTIONS"."META_SECT_ID", "META_SET_ID", "UDA_ID", "META_CELL_ROW" AS "META_CELL_UROW", "META_CELL_COL" AS "META_CELL_UCOL", "CEM_ID"
FROM "LOAD"."LOADER_CELLS_STEP_1"
INNER JOIN "LOAD"."META_DOCUMENTS" ON ("META_DOCUMENTS"."META_DOC_CODE"="LOADER_CELLS_STEP_1"."META_DOC_CODE")
INNER JOIN "LOAD"."META_SECTIONS" ON ("META_SECTIONS"."META_DOC_ID"="META_DOCUMENTS"."META_DOC_ID") AND ("META_SECTIONS"."META_SECT_NAME"="LOADER_CELLS_STEP_1"."META_SECT_NAME")
INNER JOIN "LOAD"."META_SETS" ON ("META_SETS"."META_SECT_ID"="META_SECTIONS"."META_SECT_ID") AND ("META_SETS"."META_DOC_ID"="META_DOCUMENTS"."META_DOC_ID") AND ("META_SETS"."META_SET_NAME"="LOADER_CELLS_STEP_1"."META_SET_NAME")
INNER JOIN "LOAD"."USER_DEF_ATTRIBUTES" ON ("USER_DEF_ATTRIBUTES"."UDA_NAME"="LOADER_CELLS_STEP_1"."UDA_NAME")
WHERE ("META_CELL_WUOM"=1)
AND ("UDA_TYPE"='MATCH')
) "SELECTION"
WHERE NOT EXISTS (
SELECT *
FROM "LOAD"."META_CELLS_UOM"
WHERE ("SELECTION"."META_SET_ID"="META_CELLS_UOM"."META_SET_ID")
AND ("SELECTION"."UDA_ID"="META_CELLS_UOM"."UDA_ID")
)
The destination table META_CELLS_UOM is empty, the source table LOADER_CELLS_STEP_1 has about 80.000 records from which I select about 3000.
EDIT 2:
There are no cuncurrent queries. When the program hangs on the execution of above query this is a screenshot from SQL Server Mngmt Studio's Activity monitor:
Without looking into the exceution plan i think that the reason that the query sometimes takes longer is that the SQL server tries to cut some cornors using the statistics from the last query, but it backfires
I would say that it should be solved if you
Select the 3000 rows LOADER_CELLS_STEP_1 in a subquery first and join in the other data, or else the the server might consider some other table to be better, and then join that will all 80000 rows, and then filter it.
Skip the "not exists" clause and write it another way, like left outer join with the META_CELLS_UOM and only include the rows not matching. And also here do it as separate sub query to avoid bad optimization.
Try this (I dont know if I understand the relations right) and i might have missed a few tings, but I hope you understand.
INSERT INTO "LOAD"."META_CELLS_UOM" WITH (ROWLOCK)
("UDA_DOMAIN_ID", "META_DOC_ID", "META_SECT_ID", "META_SET_ID", "UDA_ID", "META_CELL_UROW", "META_CELL_UCOL", "CEM_ID")
SELECT "UDA_DOMAIN_ID", "META_DOC_ID", "META_SECT_ID", "META_SET_ID", "UDA_ID", "META_CELL_UROW", "META_CELL_UCOL", "CEM_ID"
FROM (
SELECT "META_DOCUMENTS"."UDA_DOMAIN_ID", "META_DOCUMENTS"."META_DOC_ID", "META_SECTIONS"."META_SECT_ID", "META_SET_ID", "UDA_ID", "META_CELL_ROW" AS "META_CELL_UROW", "META_CELL_COL" AS "META_CELL_UCOL", "CEM_ID"
FROM
(
select * from "LOAD"."LOADER_CELLS_STEP_1" where ("META_CELL_WUOM"=1) AND ("UDA_TYPE"='MATCH')
) LCS
INNER JOIN "LOAD"."META_DOCUMENTS" ON ("META_DOCUMENTS"."META_DOC_CODE"=LCS."META_DOC_CODE")
INNER JOIN "LOAD"."META_SECTIONS" ON ("META_SECTIONS"."META_DOC_ID"="META_DOCUMENTS"."META_DOC_ID") AND ("META_SECTIONS"."META_SECT_NAME"=LCS."META_SECT_NAME")
INNER JOIN "LOAD"."META_SETS" ON ("META_SETS"."META_SECT_ID"="META_SECTIONS"."META_SECT_ID") AND ("META_SETS"."META_DOC_ID"="META_DOCUMENTS"."META_DOC_ID") AND ("META_SETS"."META_SET_NAME"=LCS."META_SET_NAME")
) "SELECTION"
left outer join "LOAD"."META_CELLS_UOM" on ("SELECTION"."META_SET_ID"="META_CELLS_UOM"."META_SET_ID") AND ("SELECTION"."UDA_ID"="META_CELLS_UOM"."UDA_ID")
where "META_CELLS_UOM"."META_SET_ID" is null
)

Inner join vs select statements on multiple tables

THe below 2 queries performs the same operation, but wondering which would be the fastest and most preferable?
NUM is the primary key on table1 & table2...
select *
from table1 tb1,
table2 tb2
where tb1.num = tb2.num
select *
from table1 tb1
inner join
table2 tb2
on tb1.num = tb2.num
They are the same query. The first is an older alternate syntax, but they both mean do an inner join.
You should avoid using the older syntax. It's not just readability, but as you build more complex queries, there are things that you simply can't do with the old syntax. Additionally, the old syntax is going through a slow process of being phased out, with the equivalent outer join syntax marked as deprecated in most products, and iirc dropped already in at least one.
The 2 SQL statements are equivalent. You can look at the execution plan to confirm. As a rule, given 2 SQL statements which affect/return the same rows in the same way, the server is free to execute them the same way.
They're equivalent queries - both are inner joins, but the first uses an older, implicit join syntax. Your database should execute them in exactly the same way.
If you're unsure, you could always use the SQL Management Studio to view and compare the execution plans of both queries.
The first example is what I have seen referred to as an Oracle Join. As mentioned already there appears to be little performance difference. I prefer the second example from a readability standpoint because it separates join conditions from filter conditions.

Why is this non-correlated query so slow?

I have this query...
SELECT Distinct([TargetAttributeID]) FROM
(SELECT distinct att1.intAttributeID as [TargetAttributeID]
FROM AST_tblAttributes att1
INNER JOIN
AST_lnkProfileDemandAttributes pda
ON pda.intAttributeID=att1.intAttributeID AND pda.intProfileID = #intProfileID
union all
SELECT distinct ca2.intAttributeID as [TargetAttributeID] FROM
AST_lnkCapturePolicyAttributes ca2
INNER JOIN
AST_lnkEmployeeCapture ec2 ON ec2.intAdminCaptureID = ca2.intAdminCaptureID AND ec2.intTeamID = 57
WHERE ec2.dteCreatedDate >= #cutoffdate) x
Execution Plan for the above query
The two inner distincts are looking at 32 and 10,000 rows respectively. This query returns 5 rows and executes in under 1 second.
If I then use the result of this query as the subject of an IN like so...
SELECT attx.intAttributeID,attx.txtAttributeName,attx.txtAttributeLabel,attx.txtType,attx.txtEntity FROM
AST_tblAttributes attx WHERE attx.intAttributeID
IN
(SELECT Distinct([TargetAttributeID]) FROM
(SELECT Distinct att1.intAttributeID as [TargetAttributeID]
FROM AST_tblAttributes att1
INNER JOIN
AST_lnkProfileDemandAttributes pda
ON pda.intAttributeID=att1.intAttributeID AND pda.intProfileID = #intProfileID
union all
SELECT Distinct ca2.intAttributeID as [TargetAttributeID] FROM
AST_lnkCapturePolicyAttributes ca2
INNER JOIN
AST_lnkEmployeeCapture ec2 ON ec2.intAdminCaptureID = ca2.intAdminCaptureID AND ec2.intTeamID = 57
WHERE ec2.dteCreatedDate >= #cutoffdate) x)
Execution Plan for the above query
Then it takes over 3 minutes! If I just take the result of the query and perform the IN "manually" then again it comes back extremely quickly.
However if I remove the two inner DISTINCTS....
SELECT attx.intAttributeID,attx.txtAttributeName,attx.txtAttributeLabel,attx.txtType,attx.txtEntity FROM
AST_tblAttributes attx WHERE attx.intAttributeID
IN
(SELECT Distinct([TargetAttributeID]) FROM
(SELECT att1.intAttributeID as [TargetAttributeID]
FROM AST_tblAttributes att1
INNER JOIN
AST_lnkProfileDemandAttributes pda
ON pda.intAttributeID=att1.intAttributeID AND pda.intProfileID = #intProfileID
union all
SELECT ca2.intAttributeID as [TargetAttributeID] FROM
AST_lnkCapturePolicyAttributes ca2
INNER JOIN
AST_lnkEmployeeCapture ec2 ON ec2.intAdminCaptureID = ca2.intAdminCaptureID AND ec2.intTeamID = 57
WHERE ec2.dteCreatedDate >= #cutoffdate) x)
Execution Plan for the above query
..then it comes back in under a second.
What is SQL Server thinking? Can it not figure out that it can perform the two sub-queries and use the result as the subject of the IN. It seems as slow as a correlated sub-query, but it isn't correlated!!!
In Show Estimate Execution plan there are three Clustered Index Scans each with a cost of 100%! (Execution Plan is here)
Can anyone tell me why the inner DISTINCTS make this query so much slower (but only when used as the subject of an IN...) ?
UPDATE
Sorry it's taken me a while to get these execution plans up...
Query 1
Query 2 (The slow one)
Query 3 - No Inner Distincts
Honestly I think it comes down to the fact that, in terms of relational operators, you have a gratuitously baroque query there, and SQL Server stops searching for alternate execution plans within the time it allows itself to find one.
After the parse and bind phase of plan compilation, SQL Server will apply logical transforms to the resulting tree, estimate the cost of each, and choose the one with the lowest cost. It doesn't exhaust all possible transformations, just as many as it can compute within a given window. So presumably, it has burned through that window before it arrives at a good plan, and it's the addition of the outer semi-self-join on AST_tblAttributes that pushed it over the edge.
How is it gratuitously baroque? Well, first off, there's this (simplified for noise reduction):
select distinct intAttributeID from (
select distinct intAttributeID from AST_tblAttributes ....
union all
select distinct intAttributeID from AST_tblAttributes ....
)
Concatenating two sets, and projecting the unique elements? Turns out there's operator for that, it's called UNION. So given enough time during plan compilation and enough logical transformations, SQL Server will realize what you really mean is:
select intAttributeID from AST_tblAttributes ....
union
select intAttributeID from AST_tblAttributes ....
But wait, you put this in a correlated subquery. Well, a correlated subquery is a semi-join, and the right relation does not require logical dedupping in a semi-join. So SQL Server may logically rewrite the query as this:
select * from AST_tblAttributes
where intAttributeID in (
select intAttributeID from AST_tblAttributes ....
union all
select intAttributeID from AST_tblAttributes ....
)
And then go about physical plan selection. But to get there, it has to see though the cruft first, and that may fall outside the optimization window.
EDIT:
Really, the way to explore this for yourself, and corroborate the speculation above, is to put both versions of the query in the same window and compare estimated execution plans side-by-side (Ctrl-L in SSMS). Leave one as is, edit the other, and see what changes.
You will see that some alternate forms are recognized as logically equivalent and generate to the same good plan, and others generate less optimal plans, as you bork the optimizer.**
Then, you can use SET STATISTICS IO ON and SET STATISTICS TIME ON to observe the actual amount of work SQL Server performs to execute the queries:
SET STATISTICS IO ON
SET STATISTICS TIME ON
SELECT ....
SELECT ....
SET STATISTICS IO OFF
SET STATISTICS TIME OFF
The output will appear in the messages pane.
** Or not--if they all generate the same plan, but actual execution time still varies like you say, something else may be going on--it's not unheard of. Try comparing actual execution plans and go from there.
El Ronnoco
First of all a possible explanation:
You say that: "This query returns 5 rows and executes in under 1 second.". But how many rows does it ESTIMATE are returned? If the estimate is very much off, using the query as part of the IN part could cause you to scan the entire: AST_tblAttributes in the outer part, instead of index seeking it (which could explain the big difference)
If you shared the query plans for the different variants (as a file, please), I think I should be able to get you an idea of what is going on under the hood here. It would also allow us to validate the explanation.
Edit: each DISTINCT keyword adds a new Sort node to your query plan. Basically, by having those other DISTINCTs in there, you're forcing SQL to re-sort the entire table again and again to make sure that it isn't returning duplicates. Each such operation can quadruple the cost of the query. Here's a good review of the effects that the DISTINCT operator can have, intended an unintended. I've been bitten by this, myself.
Are you using SQL 2008? If so, you can try this, putting the DISTINCT work into a CTE and then joining to your main table. I've found CTEs to be pretty fast:
WITH DistinctAttribID
AS
(
SELECT Distinct([TargetAttributeID])
FROM (
SELECT distinct att1.intAttributeID as [TargetAttributeID]
FROM AST_tblAttributes att1
INNER JOIN
AST_lnkProfileDemandAttributes pda
ON pda.intAttributeID=att1.intAttributeID AND pda.intProfileID = #intProfileID
UNION ALL
SELECT distinct ca2.intAttributeID as [TargetAttributeID] FROM
AST_lnkCapturePolicyAttributes ca2
INNER JOIN
AST_lnkEmployeeCapture ec2 ON ec2.intAdminCaptureID = ca2.intAdminCaptureID AND ec2.intTeamID = 57
WHERE ec2.dteCreatedDate >= #cutoffdate
) x
SELECT attx.intAttributeID,
attx.txtAttributeName,
attx.txtAttributeLabel,
attx.txtType,
attx.txtEntity
FROM AST_tblAttributes attx
JOIN DistinctAttribID attrib
ON attx.intAttributeID = attrib.TargetAttributeID

Resources