Why is VALUES(CONVERT(XML,'...')) much slower than VALUES(#xml)?

Why is VALUES(CONVERT(XML,'...')) much slower than VALUES(#xml)? - sql-server

I would like to create a subquery that produces a list of numbers as a single-column result, something like MindLoggedOut did here but without the #x xml variable, so that it can be appended to a WHERE expression as a pure string (subquery) without sql parameters. The problem is that the replacement of the parameter (or variable) makes the query run 5000 times slower, and I don't understand why. What causes this big difference?
Example:
/* Create a minimalistic xml like <b><a>78</a><a>91</a>...</b> */
DECLARE #p_str VARCHAR(MAX) =
'78 91 01 12 34 56 78 91 01 12 34 56 78 91 01 12 34 56';
DECLARE #p_xml XML = CONVERT(XML,
'<b><a>'+REPLACE(#p_str,' ','</a><a>')+'</a></b>'
);
SELECT a.value('(child::text())[1]','INT')
FROM (VALUES (#p_xml)) AS t(x)
CROSS APPLY x.nodes('//a') AS x(a);
This returns one number per row and is quite fast (20x faster than the string-splitter approaches I was using so far, similar to these.
I measured the 20x speed-up in terms of sql server CPU time, with #p_str containing 3000 numbers.)
Now if I inline the definition of #p_xml into the query:
SELECT a.value('(child::text())[1]','INT')
FROM (VALUES (CONVERT(XML,
'<b><a>'+REPLACE(#p_str,' ','</a><a>')+'</a></b>'
))) AS t(x)
CROSS APPLY x.nodes('//a') AS x(a);
then it becames 5000x slower (when #p_str contains thousands of numbers.) Looking at the query plan I cannot find the reason for it.
Plan of the first query (…VALUES(#p_xml)…), and the second (…VALUES(CONVERT(XML,'...'))…)
Could somebody shed some light on it?
UPDATE
Clearly the plan of the first query doesn't include the cost
of the #p_xml = CONVERT(XML, ...REPLACE(...)... ) assignment, but this
cost is not the culprit that could explain the 46ms vs. 234sec
difference between the execution time of the whole script (when
#p_str is large). This difference is systematic (not random)
and was in fact observed in SqlAzure (S1 tier).
Furthermore, when I rewrote the query: replacing CONVERT(XML,...) by a user-defined scalar function:
SELECT a.value('(child::text())[1]','INT')
FROM (VALUES (dbo.MyConvertToXmlFunc(
'<b><a>'+REPLACE(#p_str,' ','</a><a>')+'</a></b>'
))) AS t(x)
CROSS APPLY x.nodes('//a') AS x(a);
where dbo.MyConvertToXmlFunc() is:
CREATE FUNCTION dbo.MyConvertToXmlFunc(#p_str NVARCHAR(MAX))
RETURNS XML BEGIN
RETURN CONVERT(XML, #p_str);
END;
the difference disappeared (plan). So at least I have a workaround... but would like to understand it.

This is basically the same issue as described in this answer by Paul White.
I tried with a string of length 10,745 characters containing 3,582 items.
The execution plan with the string literal ends up performing the string replace and casting this entire string to XML twice for each item (so 7,164 times in total).
The problematic sqltses.dll!CEsExec::GeneralEval4 calls are highlighted in the traces below. The CPU time for the entire call stack was 22.38% (nearly maxing out a single core on a quad core). - 92% of that was taken with these two calls.
Within each call sqltses.dll!ConvertFromStringTypesAndXmlToXml and sqltses.dll!BhReplaceBhStrStr both take nearly equal time.
I have used the same colour coding for the plan below.
The bottom branch of the execution plan is executed once for each split item in the string.
The problematic table valued function in the bottom right is in its open method. The Parameter list for the function is
Scalar Operator([Expr1000]),
Scalar Operator((7)),
Scalar Operator(XML Reader with XPath filter.[id]),
Scalar Operator(getdescendantlimit(XML Reader with XPath filter.[id]))
For the Stream Aggregate the issue is in its getrow method.
[Expr1010] = Scalar Operator(MIN(
SELECT CASE
WHEN [Expr1000] IS NULL
THEN NULL
ELSE
CASE
WHEN datalength([XML Reader with XPath filter].[value]) >= ( 128 )
THEN CONVERT_IMPLICIT(int, [XML Reader with XPath filter].[lvalue], 0)
ELSE CONVERT_IMPLICIT(int, [XML Reader with XPath filter].[value], 0)
END
END
))
Both of these expressions refer to Expr1000 (though the stream aggregate only does so to check if it was NULL)
This is defined in the constant scan at the top right as below.
(Scalar Operator(CONVERT(xml,'<b><a>'+replace([#p_str],' '
,CONVERT_IMPLICIT(varchar(max),'</a><a>',0))+'</a></b>',0)))
It is clear from the trace that the issue is the same as in the previously linked answer and that this is getting repeatedly re-evaluated in the slow plan. When passing as a parameter the expensive calculation only happens once.
Edit: I just realised this is in fact almost exactly the same plan and issue as Paul White blogged about here - The only difference in my tests compared to those described there is that I found the string Replace and the XML conversion to be as bad as each other in the VARCHAR(MAX) case - and for the string replace to outweigh the conversion cost in the non max case.
Max
Non Max
(2000 character source string with 668 items. 6010 chars after replace)
In this test the replace was nearly double the CPU cost of the xml conversion. It seems to be implemented by using code from familiar TSQL functions CHARINDEX and STUFF with a large chunk of time taken up converting the string to unicode. I think this discepancy between my results and those reported by Paul is down to collation (switching to SQL_Latin1_General_CP1_CS_AS from Latin1_General_CS_AS reduces the cost of the string replace significantly)

Related

SQL Server : function precedence and short curcuiting in where clause

Consider this setup:
create table #test (val varchar(10))
insert into #test values ('20100101'), ('1')
Now if I run this query
select *
from #test
where ISDATE(val) = 1
and CAST(val as datetimeoffset) > '2005-03-01 00:00:00 +00:00'
it will fail with
Conversion failed when converting date and/or time from character string
which tells me that the where conditions are not short-circuited and both functions are evaluated. OK.
However if I run
select *
from #test
where LEN(val) > 2
and CAST(val as datetimeoffset) > '2005-03-01 00:00:00 +00:00'
it doesn't fail, which tells me that where clause is short-circuited in this case.
This
select *
from #test
where ISDATE(val) = 1
and CAST(val as datetimeoffset) > '2005-03-01 00:00:00 +00:00'
and LEN(val) > 2
fails again, but if I move length check to before cast, it work. So looks like the functions are evaluated in the order they appear in query.
Can anyone explain why first query fails?

It fails because SQL is declarative so the order of your conditions is not taken into account when the plan is generated (nor is it required to do so).
The usual way to get around this is to use CASE which has strict rules about sequence and when to stop.
In your case you will probably need nested CASEs, something like this:
WHERE
(
case when ISDATE(val) = 1 then
case when CAST(val as datetimeoffset) > '2005-03-01 00:00:00 +00:00' and
LEN(val) > 2
THEN 1 ELSE 0 END
ELSE 0
END
) = 1
(note this is unlikely to be actually correct SQL as I just typed it in).
By the way, even if you get it "working" by rearranging the conditions, I'd advise you don't. Accept that SQL simply doesn't work in that way. As the data changes & stats change, SQL is upgraded, workload varies, indexes are added the query plan could change. Any attempts to "get it working" are going to be short-lived at best so go with the CASE which will continue to work once you've got it right (provided you nest CASE statements where required and don't fall into the same precedence trap in the CASE conditions!)

The mystery is answered if you examine the Execution Plan. Both the CAST() and the LEN() are applied as part of the Table Scan step, while the test for IsDate() is a separate Filter test after the Table Scan.
It appears that the SQL Engine's internal optimizations use certain filtering functions as part of the retrieval of the data, and others as separate filters, almost certainly as a form of query optimization to minimize the load from disk into main memory. However, more complex functions, such as IsDate(), which is dependent on system variables such as system date format in some cases (is '01/02/2017' Jan 2nd or Feb 1st?), need to have the data retrieved before the filter is applied.
Although I have no hard information on this, I strongly suspect that any filter more resource intensive than a certain level is delegated to the Filter steps in the query plan, and anything simple/fast enough to be checked as the data is being read in is applied during the Scan/Seek steps. Also, if a filter could be applied on the data in the index, I am certain that it will be tested before any non-index data is tested, solely to minimize disk reads, which are bad performance juju (this may not apply on the Clustered index of the table). In these cases, the short-circuiting might not be straightforward, with an IsDate() test specified on a non-index field being executed after a similar test on an indexed field, no matter where they are in the list of conditions.
That said, it appears to be true that conditions short-circuit when they are executed in the same step of the query plan. If you insert a string like '201612123' into the temp table, then add a check on Len(val) < 9 after the date comparison, it still generates an error, instead of checking both LEN() conditions at the same time in a tiny optimization.

which tells me that where conditions are not short-circuited and both functions are evaluated.
To expand on LoztInSpace's answer, your terminology suggests you are not interpreting SQL correctly, on its own terms.
The various parts of a SELECT statement are not "functions". The entire statement is atomic. You supply the query as unit, and the DBMS responds. There is no "before" and no "after". There is just the query.
Those are the rules. Your job in formulating the query is to supply one that is valid. It's a logical progression: valid question, valid answer, etc. The moment you step out of that frame, you might as well be asking, "why is the sky seven?".
One a small clarification to #LoztInSpace's answer. When he refers to the order of your statements, he's presumably talking about the phrasing of your query, which for purposes of evaluation is inconsequential. Sequential SQL statements are executed sequentially, as presented. That is guaranteed by the SQL standard.

Sql Server aggregate concatenate CLR returning different sequence of strings based on number of records

I have a clr aggregate concatenation function, similar to https://gist.github.com/FilipDeVos/5b7b4addea1812067b09. When the number of rows are small, the sequence of concatenated strings follows the input data set. When the number of rows are larger (dozens and more), the sequence seems indeterminate. There is a difference in the execution plan, but I'm not that familiar with the optimizer and what hints to apply (I've tried MAXDOP 1, without success). From a different test than the example below with similar results here's what seems to be the difference in the plan - the separate sorts, then a merge join. The row count where it tipped over here was 60.
yielded expected results:
yielded unexpected results:
Below is the query that demonstrates the issue in the AdventureWorks2014 sample database with the above clr (renamed to TestConcatenate). The intended result is a dataset with a row for each order and a column with a delimited list of products for that order in quantity sequence.
;with cte_ordered_steps AS (
SELECT top 100000 sd.SalesOrderID, SalesOrderDetailID, OrderQty
FROM [Sales].[SalesOrderDetail] sd
--WHERE sd.SalesOrderID IN (53598, 53595)
ORDER BY sd.SalesOrderID, OrderQty
)
select
sd.SalesOrderID,
dbo.TestConcatenate(' QTY: ' + CAST(sd.OrderQty AS VARCHAR(9)) + ': ' + IsNull(p.Name, ''))
FROM [Sales].[SalesOrderDetail] sd
JOIN [Production].[Product] p ON p.ProductID = sd.ProductID
JOIN cte_ordered_steps r ON r.SalesOrderID = sd.SalesOrderID AND r.SalesOrderDetailID = sd.SalesOrderDetailID
where sd.SalesOrderID IN (53598, 53595)
GROUP BY sd.SalesOrderID
When the SalesOrderID is constrained in the cte for 53598, 53595, the sequence is correct (top set), when it's constrained in the main select for 53598, 53595, the sequence is not (botton set).
So what's my question? How can I build the query, with hints or other changes to return consistent (and correct) sequenced concatenated values independent of the number of rows.

Just like a normal query, if there isn't an order by clause, return order isn't guaranteed. If I recall correctly, the SQL 92 spec allows for an order by clause to be passed in to an aggregate via an over clause, SQL Server doesn't implement it. So there's no way to guarantee ordering in your CLR function (unless you implement it yourself by collecting everything in the Accumulate and Merge methods into some sort of collection and then sorting the list in the Terminate method before returning it. But you'll pay a cost in terms of memory grants as now need to serialize the collection.
As to why you're seeing different behavior based on the size of your result set, I notice that a different join operator is being used between the two. A loop join and a merge join walk through the two sets being joined differently and so that might account for the difference you're seeing.

Why not try the aggregate dbo.GROUP_CONCAT_S available at http://groupconcat.codeplex.com. The S is for Sorted output. It does exactly what you want.

While this answer doesn't have a solution, the additional information that Ben and Orlando provided (thanks!) have provided what I need to move on. I'll take the approach that Orlando pointed to, which was also my plan B, i.e. sorting in the clr.

Function hanging on #Variable input, but not with hard coded integer

Very odd problem. This function has randomly started hanging and timing out when I call it like this:
DECLARE #persId int
SET #persId = 336
SELECT * FROM [CIDER].[dbo].[SMAN_ACL_getPermissions] (
null
,#persId
,1
,null)
GO
But returns super quick when I call it like this:
SELECT * FROM [CIDER].[dbo].[SMAN_ACL_getPermissions] (
null
,336
,1
,null)
GO
Could someone please highlight the difference between these two me? It's making debugging very hard...

The variable could be a null value, whereas the static value definitely is not. This can lead to different execution plans.

You could be falling prey to parameter sniffing. Take a look at the execution plan for the one that isn't performing well. In the plan XML, you'll see two values in the ParameterList tag: ParameterCompiledValue and ParameterRuntimeValue that are self-explanatory. If the data distribution is wildly different for the two, you could be getting a sub-optimal plan for your runtime value. You could try adding a "with (recompile)" to the statement that is running slow within your function and see if it helps

Concatenate varchar(max) variables SQL Server 2005

Just ran into a major headache when concatenating several #varchar(max) variables together to build an email based on several different queries.
For efficiencies sake, I was using several varchars to build the email at once, rather than going through roughly the same query two or three or more times to build it using only one varchar.
This worked, right up until my varchars got to longer than 8000 characters. Then the concatenation of them all into one varchar (which I could shove into the #body parameter of msdb.dbo.sp_send_dbmail) returned "", and even LEN() wouldn't actually give me a length.
Anyhow, I've gotten around this by doing roughly the same queries several times and building the email with only one varchar(max).
TL;DR
I'm not happy with the solution. How could I have appended these varchar(max) variables to each other?

One thing I've hit in the past which may or may not help here: SQL seems to "forget" what datatype its working with when you concatenate varchar(max). Instead of maintaining the MAX, it devolves to conventional varcharnitude, meaning truncation at 8000 characters or so. To get around this, we use the following trick:
Start with
SET #MyMaxVarchar = #aVarcharMaxValue + #SomeString + #SomeOtherString + #etc
and revise like so:
SET #MyMaxVarchar = cast(#aVarcharMaxValue as varchar(max)) + #SomeString + #SomeOtherString + #etc
Again, this may not help with your particular problem, but remembering it might save you major headaches down the road some day.

This may not have happened in your case, but there's a "gotcha" embedded in SQL Management Studio involving VARCHAR(MAX): SQL Studio will only output so many characters in the results grid. You can test this:
SELECT #MyLongVar, LEN(#MyLongVar)
You may find that the length of the actual data returned (most text editors can give you this) is less than the length of the data stored in the variable.
The fix is in Tools | Options | Query Results | SQL Server | Results to Grid; increase Maximum Characters Retrieved | Non XML data to some very large number. Unfortunately the maximum is 65,535, which may not be enough.
If your problem does not involve outputting the variable's value in SQL Studio, please disregard.

I have found that MS SQL silently does NOTHING when attempting to concatentate a string to a NULL value. therefore this solution always works for me:
UPDATE myTable
SET isNull(myCol, '') += 'my text'
WHERE myColumnID = 9999

T-SQL Where Clause Case Statement Optimization (optional parameters to StoredProc)

I've been battling this one for a while now. I have a stored proc that takes in 3 parameters that are used to filter. If a specific value is passed in, I want to filter on that. If -1 is passed in, give me all.
I've tried it the following two ways:
First way:
SELECT field1, field2...etc
FROM my_view
WHERE
parm1 = CASE WHEN #PARM1= -1 THEN parm1 ELSE #PARM1 END
AND parm2 = CASE WHEN #PARM2 = -1 THEN parm2 ELSE #PARM2 END
AND parm3 = CASE WHEN #PARM3 = -1 THEN parm3 ELSE #PARM3 END
Second Way:
SELECT field1, field2...etc
FROM my_view
WHERE
(#PARM1 = -1 OR parm1 = #PARM1)
AND (#PARM2 = -1 OR parm2 = #PARM2)
AND (#PARM3 = -1 OR parm3 = #PARM3)
I read somewhere that the second way will short circuit and never eval the second part if true. My DBA said it forces a table scan. I have not verified this, but it seems to run slower on some cases.
The main table that this view selects from has somewhere around 1.5 million records, and the view proceeds to join on about 15 other tables to gather a bunch of other information.
Both of these methods are slow...taking me from instant to anywhere from 2-40 seconds, which in my situation is completely unacceptable.
Is there a better way that doesn't involve breaking it down into each separate case of specific vs -1 ?
Any help is appreciated. Thanks.

I read somewhere that the second way will short circuit and never eval the second part if true. My DBA said it forces a table scan.
You read wrong; it will not short circuit. Your DBA is right; it will not play well with the query optimizer and likely force a table scan.
The first option is about as good as it gets. Your options to improve things are dynamic sql or a long stored procedure with every possible combination of filter columns so you get independent query plans. You might also try using the "WITH RECOMPILE" option, but I don't think it will help you.

if you are running SQL Server 2005 or above you can use IFs to make multiple version of the query with the proper WHERE so an index can be used. Each query plan will be placed in the query cache.
also, here is a very comprehensive article on this topic:
Dynamic Search Conditions in T-SQL by Erland Sommarskog
it covers all the issues and methods of trying to write queries with multiple optional search conditions
here is the table of contents:
Introduction
The Case Study: Searching Orders
The Northgale Database
Dynamic SQL
Introduction
Using sp_executesql
Using the CLR
Using EXEC()
When Caching Is Not Really What You Want
Static SQL
Introduction
x = #x OR #x IS NULL
Using IF statements
Umachandar's Bag of Tricks
Using Temp Tables
x = #x AND #x IS NOT NULL
Handling Complex Conditions
Hybrid Solutions – Using both Static and Dynamic SQL
Using Views
Using Inline Table Functions
Conclusion
Feedback and Acknowledgements
Revision History

If you pass in a null value when you want everything, then you can write your where clause as
Where colName = IsNull(#Paramater, ColName)
This is basically same as your first method... it will work as long as the column itself is not nullable... Null values IN the column will mess it up slightly.
The only approach to speed it up is to add an index on the column being filtered on in the Where clause. Is there one already? If not, that will result in a dramatic improvement.

No other way I can think of then doing:
WHERE
(MyCase IS NULL OR MyCase = #MyCaseParameter)
AND ....
The second one is more simpler and readable to ther developers if you ask me.

SQL 2008 and later make some improvements to optimization for things like (MyCase IS NULL OR MyCase = #MyCaseParameter) AND ....
If you can upgrade, and if you add an OPTION (RECOMPILE) to get decent perf for all possible param combinations (this is a situation where there is no single plan that is good for all possible param combinations), you may find that this performs well.
http://blogs.msdn.com/b/bartd/archive/2009/05/03/sometimes-the-simplest-solution-isn-t-the-best-solution-the-all-in-one-search-query.aspx

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Why is VALUES(CONVERT(XML,'...')) much slower than VALUES(#xml)? - sql-server

Related

SQL Server : function precedence and short curcuiting in where clause

Sql Server aggregate concatenate CLR returning different sequence of strings based on number of records

Function hanging on #Variable input, but not with hard coded integer

Concatenate varchar(max) variables SQL Server 2005

T-SQL Where Clause Case Statement Optimization (optional parameters to StoredProc)

Categories

Resources