sql constant function causes inferior query plan to be used - sql-server

I'm on SQL Server 2016 and am seeing the following:
I have a simple query similar to:
select distinct col1
from tbl
where
col2 > 12345
If I move the constant value into a function, the query plan changes (for the worse, by A LOT):
select distinct col1
from tbl
where
col2 > dbo.fn12345()
where the function is
create function dbo.fn12345()
returns int
as begin
return 12345
end
here are screenshots of the plans (using my actual schema so the identifiers are different than the illustrative example.
without function:
with function:
With the 2nd plan my execution time goes from 22s to 96s.
Is there any way to fix this while still using functions?
Please no questions asking why I just can't inline the constant. The same issue occurs for more complex functions that include sargable logic- inlining what is effectively a complex constant calculation changes the query plan.
I am also aware that my index is not optimal. This is by design. The table is very large and this particular query doesn't warrant the storage for a dedicated index.

You are always going to run into problems with functions in where clauses.
even something as straightforward as ISNULL() can change the plan.
Is there any way you can persist the computed result in a table (even a temp table)? Then you can cross join to this.
NB - Create stats on your table as this will help the optimizer.
SELECT 12345 as val into #t
select distinct col1
from tbl
CROSS JOIN #t
where
col2 > val

Related

SQL Server - UNION with WHERE clause outside is extremely slow on simple join

I have a simple query and it works fast (<1sec):
;WITH JointIncomingData AS
(
SELECT A, B, C, D FROM dbo.table1
UNION ALL
SELECT A, B, C, D FROM dbo.table2
)
SELECT *
FROM JointIncomingData D
WHERE a = '1/1/2020'
However, if I join with another small table in the final SELECT statement it is extremely slow (> 30 sec)
DECLARE #anotherTable TABLE (A DATE, B INT)
INSERT INTO #anotherTable (AsOfDate, FundId)
VALUES ('1/1/2020', 1)
;WITH JointIncomingData AS
(
SELECT A, B, C, D FROM dbo.table1
UNION ALL
SELECT A, B, C, D FROM dbo.table2
)
SELECT *
FROM JointIncomingData D
JOIN #anotherTable T ON T.A = D.A AND T.B = D.B
In the real application, I have a complex UPDATE as the final statement, so I try to avoid copy-paste and introduces UNION to consolidate code.
But now experience an unexpected issue with slowness.
I tried using UNION ALL instead of UNION - with the same result.
Looks like SQL Server pushed simple conditions to each of UNION statements, but when I join it with another table, it doesn't happen and a table scan occurs.
Any advice?
UPDATE: Here is estimated plans
for the first simple condition query: https://www.brentozar.com/pastetheplan/?id=SJ5fynTgP
for the query with a join table: https://www.brentozar.com/pastetheplan/?id=H1eny3pxP
Please keep in mind that estimated plans are not exactly for the above query, but more real one, having exactly the same problem.
When I'm doing complex updates I normally declare a temp table and insert the rows into it that I intend to update. There's two benefits to this approach, one being that by explicitly collecting the rows to be updated you simplify the logic and make the update itself really simple (just update the rows whose primary key is in your temp table). The other big benefit of it is you can do some sanity checking before actually running your update, and "throw an error" by returning a different value.
I think it's generally a good practice to break down queries into simple steps like this, because it makes them much easier to troubleshoot in the future.
Based on the "similar" execution plan you shared. It would also be better to have the actual plan, to know if your estimates and memory grants are ok.
Key lookup
The index IX_dperf_date_fund should be extended to INCLUDE the following columns nav, equity
Why? Every row the index returns, create a lookup in the clusterd index to retrieve the column values of nav, equity.
Only if this is reasonable for the application, if other queries may benefit as well
CTE
Change your CTE to a temp table.
Example:
SELECT *
INTO #JointIncomingData
FROM (
SELECT AsOfDate, FundId, DataSourceId, ShareClass, NetAssetsBase, SharesOutstanding
FROM
ETL.tblIncomingData
UNION ALL
SELECT AsOfDate, FundId, DataSourceId, ShareClass, NetAssetsBase, SharesOutstanding
FROM ETL.vIncomingDataDPerf
) x
Why? CTE's are not materialized. and this answer
Bonus: parameter sniffing
If you pass in parameters you might be suffering from parameters sniffing.

Difference between SQL Function that Returns a Table or a Select From a table [duplicate]

A few examples to show, just incase:
Inline Table Valued
CREATE FUNCTION MyNS.GetUnshippedOrders()
RETURNS TABLE
AS
RETURN SELECT a.SaleId, a.CustomerID, b.Qty
FROM Sales.Sales a INNER JOIN Sales.SaleDetail b
ON a.SaleId = b.SaleId
INNER JOIN Production.Product c ON b.ProductID = c.ProductID
WHERE a.ShipDate IS NULL
GO
Multi Statement Table Valued
CREATE FUNCTION MyNS.GetLastShipped(#CustomerID INT)
RETURNS #CustomerOrder TABLE
(SaleOrderID INT NOT NULL,
CustomerID INT NOT NULL,
OrderDate DATETIME NOT NULL,
OrderQty INT NOT NULL)
AS
BEGIN
DECLARE #MaxDate DATETIME
SELECT #MaxDate = MAX(OrderDate)
FROM Sales.SalesOrderHeader
WHERE CustomerID = #CustomerID
INSERT #CustomerOrder
SELECT a.SalesOrderID, a.CustomerID, a.OrderDate, b.OrderQty
FROM Sales.SalesOrderHeader a INNER JOIN Sales.SalesOrderHeader b
ON a.SalesOrderID = b.SalesOrderID
INNER JOIN Production.Product c ON b.ProductID = c.ProductID
WHERE a.OrderDate = #MaxDate
AND a.CustomerID = #CustomerID
RETURN
END
GO
Is there an advantage to using one type (in-line or multi statement) over the other? Is there certain scenarios when one is better than the other or are the differences purely syntactical? I realise the two example queries are doing different things but is there a reason I would write them in that way?
Reading about them and the advantages/differences haven't really been explained.
In researching Matt's comment, I have revised my original statement. He is correct, there will be a difference in performance between an inline table valued function (ITVF) and a multi-statement table valued function (MSTVF) even if they both simply execute a SELECT statement. SQL Server will treat an ITVF somewhat like a VIEW in that it will calculate an execution plan using the latest statistics on the tables in question. A MSTVF is equivalent to stuffing the entire contents of your SELECT statement into a table variable and then joining to that. Thus, the compiler cannot use any table statistics on the tables in the MSTVF. So, all things being equal, (which they rarely are), the ITVF will perform better than the MSTVF. In my tests, the performance difference in completion time was negligible however from a statistics standpoint, it was noticeable.
In your case, the two functions are not functionally equivalent. The MSTV function does an extra query each time it is called and, most importantly, filters on the customer id. In a large query, the optimizer would not be able to take advantage of other types of joins as it would need to call the function for each customerId passed. However, if you re-wrote your MSTV function like so:
CREATE FUNCTION MyNS.GetLastShipped()
RETURNS #CustomerOrder TABLE
(
SaleOrderID INT NOT NULL,
CustomerID INT NOT NULL,
OrderDate DATETIME NOT NULL,
OrderQty INT NOT NULL
)
AS
BEGIN
INSERT #CustomerOrder
SELECT a.SalesOrderID, a.CustomerID, a.OrderDate, b.OrderQty
FROM Sales.SalesOrderHeader a
INNER JOIN Sales.SalesOrderHeader b
ON a.SalesOrderID = b.SalesOrderID
INNER JOIN Production.Product c
ON b.ProductID = c.ProductID
WHERE a.OrderDate = (
Select Max(SH1.OrderDate)
FROM Sales.SalesOrderHeader As SH1
WHERE SH1.CustomerID = A.CustomerId
)
RETURN
END
GO
In a query, the optimizer would be able to call that function once and build a better execution plan but it still would not be better than an equivalent, non-parameterized ITVS or a VIEW.
ITVFs should be preferred over a MSTVFs when feasible because the datatypes, nullability and collation from the columns in the table whereas you declare those properties in a multi-statement table valued function and, importantly, you will get better execution plans from the ITVF. In my experience, I have not found many circumstances where an ITVF was a better option than a VIEW but mileage may vary.
Thanks to Matt.
Addition
Since I saw this come up recently, here is an excellent analysis done by Wayne Sheffield comparing the performance difference between Inline Table Valued functions and Multi-Statement functions.
His original blog post.
Copy on SQL Server Central
Internally, SQL Server treats an inline table valued function much like it would a view and treats a multi-statement table valued function similar to how it would a stored procedure.
When an inline table-valued function is used as part of an outer query, the query processor expands the UDF definition and generates an execution plan that accesses the underlying objects, using the indexes on these objects.
For a multi-statement table valued function, an execution plan is created for the function itself and stored in the execution plan cache (once the function has been executed the first time). If multi-statement table valued functions are used as part of larger queries then the optimiser does not know what the function returns, and so makes some standard assumptions - in effect it assumes that the function will return a single row, and that the returns of the function will be accessed by using a table scan against a table with a single row.
Where multi-statement table valued functions can perform poorly is when they return a large number of rows and are joined against in outer queries. The performance issues are primarily down to the fact that the optimiser will produce a plan assuming that a single row is returned, which will not necessarily be the most appropriate plan.
As a general rule of thumb we have found that where possible inline table valued functions should be used in preference to multi-statement ones (when the UDF will be used as part of an outer query) due to these potential performance issues.
There is another difference. An inline table-valued function can be inserted into, updated, and deleted from - just like a view. Similar restrictions apply - can't update functions using aggregates, can't update calculated columns, and so on.
Your examples, I think, answer the question very well. The first function can be done as a single select, and is a good reason to use the inline style. The second could probably be done as a single statement (using a sub-query to get the max date), but some coders may find it easier to read or more natural to do it in multiple statements as you have done. Some functions just plain can't get done in one statement, and so require the multi-statement version.
I suggest using the simplest (inline) whenever possible, and using multi-statements when necessary (obviously) or when personal preference/readability makes it wirth the extra typing.
Another case to use a multi line function would be to circumvent sql server from pushing down the where clause.
For example, I have a table with a table names and some table names are formatted like C05_2019 and C12_2018 and and all tables formatted that way have the same schema. I wanted to merge all that data into one table and parse out 05 and 12 to a CompNo column and 2018,2019 into a year column. However, there are other tables like ACA_StupidTable which I cannot extract CompNo and CompYr and would get a conversion error if I tried. So, my query was in two part, an inner query that returned only tables formatted like 'C_______' then the outer query did a sub-string and int conversion. ie Cast(Substring(2, 2) as int) as CompNo. All looks good except that sql server decided to put my Cast function before the results were filtered and so I get a mind scrambling conversion error. A multi statement table function may prevent that from happening, since it is basically a "new" table.
look at Comparing Inline and Multi-Statement Table-Valued Functions you can find good descriptions and performance benchmarks
I have not tested this, but a multi statement function caches the result set. There may be cases where there is too much going on for the optimizer to inline the function. For example suppose you have a function that returns a result from different databases depending on what you pass as a "Company Number". Normally, you could create a view with a union all then filter by company number but I found that sometimes sql server pulls back the entire union and is not smart enough to call the one select. A table function can have logic to choose the source.
Maybe in a very condensed way.
ITVF ( inline TVF) : more if u are DB person, is kind of parameterized view, take a single SELECT st
MTVF ( Multi-statement TVF): Developer, creates and load a table variable.
if you are going to do a query you can join in your Inline Table Valued function like:
SELECT
a.*,b.*
FROM AAAA a
INNER JOIN MyNS.GetUnshippedOrders() b ON a.z=b.z
it will incur little overhead and run fine.
if you try to use your the Multi Statement Table Valued in a similar query, you will have performance issues:
SELECT
x.a,x.b,x.c,(SELECT OrderQty FROM MyNS.GetLastShipped(x.CustomerID)) AS Qty
FROM xxxx x
because you will execute the function 1 time for each row returned, as the result set gets large, it will run slower and slower.

Return table from a user defined function, which is best? [duplicate]

A few examples to show, just incase:
Inline Table Valued
CREATE FUNCTION MyNS.GetUnshippedOrders()
RETURNS TABLE
AS
RETURN SELECT a.SaleId, a.CustomerID, b.Qty
FROM Sales.Sales a INNER JOIN Sales.SaleDetail b
ON a.SaleId = b.SaleId
INNER JOIN Production.Product c ON b.ProductID = c.ProductID
WHERE a.ShipDate IS NULL
GO
Multi Statement Table Valued
CREATE FUNCTION MyNS.GetLastShipped(#CustomerID INT)
RETURNS #CustomerOrder TABLE
(SaleOrderID INT NOT NULL,
CustomerID INT NOT NULL,
OrderDate DATETIME NOT NULL,
OrderQty INT NOT NULL)
AS
BEGIN
DECLARE #MaxDate DATETIME
SELECT #MaxDate = MAX(OrderDate)
FROM Sales.SalesOrderHeader
WHERE CustomerID = #CustomerID
INSERT #CustomerOrder
SELECT a.SalesOrderID, a.CustomerID, a.OrderDate, b.OrderQty
FROM Sales.SalesOrderHeader a INNER JOIN Sales.SalesOrderHeader b
ON a.SalesOrderID = b.SalesOrderID
INNER JOIN Production.Product c ON b.ProductID = c.ProductID
WHERE a.OrderDate = #MaxDate
AND a.CustomerID = #CustomerID
RETURN
END
GO
Is there an advantage to using one type (in-line or multi statement) over the other? Is there certain scenarios when one is better than the other or are the differences purely syntactical? I realise the two example queries are doing different things but is there a reason I would write them in that way?
Reading about them and the advantages/differences haven't really been explained.
In researching Matt's comment, I have revised my original statement. He is correct, there will be a difference in performance between an inline table valued function (ITVF) and a multi-statement table valued function (MSTVF) even if they both simply execute a SELECT statement. SQL Server will treat an ITVF somewhat like a VIEW in that it will calculate an execution plan using the latest statistics on the tables in question. A MSTVF is equivalent to stuffing the entire contents of your SELECT statement into a table variable and then joining to that. Thus, the compiler cannot use any table statistics on the tables in the MSTVF. So, all things being equal, (which they rarely are), the ITVF will perform better than the MSTVF. In my tests, the performance difference in completion time was negligible however from a statistics standpoint, it was noticeable.
In your case, the two functions are not functionally equivalent. The MSTV function does an extra query each time it is called and, most importantly, filters on the customer id. In a large query, the optimizer would not be able to take advantage of other types of joins as it would need to call the function for each customerId passed. However, if you re-wrote your MSTV function like so:
CREATE FUNCTION MyNS.GetLastShipped()
RETURNS #CustomerOrder TABLE
(
SaleOrderID INT NOT NULL,
CustomerID INT NOT NULL,
OrderDate DATETIME NOT NULL,
OrderQty INT NOT NULL
)
AS
BEGIN
INSERT #CustomerOrder
SELECT a.SalesOrderID, a.CustomerID, a.OrderDate, b.OrderQty
FROM Sales.SalesOrderHeader a
INNER JOIN Sales.SalesOrderHeader b
ON a.SalesOrderID = b.SalesOrderID
INNER JOIN Production.Product c
ON b.ProductID = c.ProductID
WHERE a.OrderDate = (
Select Max(SH1.OrderDate)
FROM Sales.SalesOrderHeader As SH1
WHERE SH1.CustomerID = A.CustomerId
)
RETURN
END
GO
In a query, the optimizer would be able to call that function once and build a better execution plan but it still would not be better than an equivalent, non-parameterized ITVS or a VIEW.
ITVFs should be preferred over a MSTVFs when feasible because the datatypes, nullability and collation from the columns in the table whereas you declare those properties in a multi-statement table valued function and, importantly, you will get better execution plans from the ITVF. In my experience, I have not found many circumstances where an ITVF was a better option than a VIEW but mileage may vary.
Thanks to Matt.
Addition
Since I saw this come up recently, here is an excellent analysis done by Wayne Sheffield comparing the performance difference between Inline Table Valued functions and Multi-Statement functions.
His original blog post.
Copy on SQL Server Central
Internally, SQL Server treats an inline table valued function much like it would a view and treats a multi-statement table valued function similar to how it would a stored procedure.
When an inline table-valued function is used as part of an outer query, the query processor expands the UDF definition and generates an execution plan that accesses the underlying objects, using the indexes on these objects.
For a multi-statement table valued function, an execution plan is created for the function itself and stored in the execution plan cache (once the function has been executed the first time). If multi-statement table valued functions are used as part of larger queries then the optimiser does not know what the function returns, and so makes some standard assumptions - in effect it assumes that the function will return a single row, and that the returns of the function will be accessed by using a table scan against a table with a single row.
Where multi-statement table valued functions can perform poorly is when they return a large number of rows and are joined against in outer queries. The performance issues are primarily down to the fact that the optimiser will produce a plan assuming that a single row is returned, which will not necessarily be the most appropriate plan.
As a general rule of thumb we have found that where possible inline table valued functions should be used in preference to multi-statement ones (when the UDF will be used as part of an outer query) due to these potential performance issues.
There is another difference. An inline table-valued function can be inserted into, updated, and deleted from - just like a view. Similar restrictions apply - can't update functions using aggregates, can't update calculated columns, and so on.
Your examples, I think, answer the question very well. The first function can be done as a single select, and is a good reason to use the inline style. The second could probably be done as a single statement (using a sub-query to get the max date), but some coders may find it easier to read or more natural to do it in multiple statements as you have done. Some functions just plain can't get done in one statement, and so require the multi-statement version.
I suggest using the simplest (inline) whenever possible, and using multi-statements when necessary (obviously) or when personal preference/readability makes it wirth the extra typing.
Another case to use a multi line function would be to circumvent sql server from pushing down the where clause.
For example, I have a table with a table names and some table names are formatted like C05_2019 and C12_2018 and and all tables formatted that way have the same schema. I wanted to merge all that data into one table and parse out 05 and 12 to a CompNo column and 2018,2019 into a year column. However, there are other tables like ACA_StupidTable which I cannot extract CompNo and CompYr and would get a conversion error if I tried. So, my query was in two part, an inner query that returned only tables formatted like 'C_______' then the outer query did a sub-string and int conversion. ie Cast(Substring(2, 2) as int) as CompNo. All looks good except that sql server decided to put my Cast function before the results were filtered and so I get a mind scrambling conversion error. A multi statement table function may prevent that from happening, since it is basically a "new" table.
look at Comparing Inline and Multi-Statement Table-Valued Functions you can find good descriptions and performance benchmarks
I have not tested this, but a multi statement function caches the result set. There may be cases where there is too much going on for the optimizer to inline the function. For example suppose you have a function that returns a result from different databases depending on what you pass as a "Company Number". Normally, you could create a view with a union all then filter by company number but I found that sometimes sql server pulls back the entire union and is not smart enough to call the one select. A table function can have logic to choose the source.
Maybe in a very condensed way.
ITVF ( inline TVF) : more if u are DB person, is kind of parameterized view, take a single SELECT st
MTVF ( Multi-statement TVF): Developer, creates and load a table variable.
if you are going to do a query you can join in your Inline Table Valued function like:
SELECT
a.*,b.*
FROM AAAA a
INNER JOIN MyNS.GetUnshippedOrders() b ON a.z=b.z
it will incur little overhead and run fine.
if you try to use your the Multi Statement Table Valued in a similar query, you will have performance issues:
SELECT
x.a,x.b,x.c,(SELECT OrderQty FROM MyNS.GetLastShipped(x.CustomerID)) AS Qty
FROM xxxx x
because you will execute the function 1 time for each row returned, as the result set gets large, it will run slower and slower.

Dynamic inner query

Is there a way to code a dynamic inner query? Basically, I find myself typing something like the following query over and over:
;with tempData as (
--this inner query is the part that changes, but there's always a timeGMT column.
select timeGMT, dataCol2, dataCol3
from tbl1 t1
join tbl2 t2 on t1.ID=t2.ID
)
select dateadd(ss,d.gmtOffset,t.timeGMT) timeLocal,
t.*
from tempData t
join dst d on t.timeGMT between d.sTimeGMT and d.eTimeGMT
where d.zone = 'US-Eastern'
The only thing I can think of is a stored proc with the inner query text as the input for some dynamic sql... However, my understanding of the optimizer (which is, admittedly, limited) says this isn't really a good idea.
From a performance perspective, what you have there is the version on which I would expect the optimizer to do the best job.
If the "outer" part of your example is static and code maintenance overrides performance, I'd look to encapsulating the dateadd result in a table-valued function (TVF). Since the time conversion is very much the common thread in these queries, I would definitely focus on that part of the workload.
For example, your query that can vary would look like this:
select timeGMT, dataCol2, dataCol3, lt.timeLocal
from tbl1 t1
join tbl2 t2 on t1.ID = t2.ID
cross apply dbo.LocalTimeGet(timeGMT, 'US-Eastern') AS lt
Where the TVF dbo.LocalTimeGet contains the logic for dateadd(ss,d.gmtOffset,t.timeGMT) and the lookup of the time zone offset value based on the time zone name. The implementation of that function would look something like:
CREATE FUNCTION dbo.LocalTimeGet (
#TimeGMT datetime,
#TimeZone varchar(20)
)
RETURNS TABLE
AS
RETURN (
SELECT DATEADD(ss, d.gmtOffset, #TimeGMT) AS timeLocal
FROM dst AS d
WHERE d.zone = #TimeZone
);
GO
The upside of this approach is when you upgrade to 2008 or later, there are system functions you could use to make this conversion a lot easier to code and you'll only have to alter the TVF. If your result sets are small, I'd consider a system scalar function (SQL 2008) over a TVF, even if it implements those same system functions. Based on your comment, it sounds like the system functions won't do what you need, but you could still stick with your implementation of a dst table, which is encapsulated in the TVF above.
TVFs can be a performance problem because the optimizer assumes they only return 1 row.
If you need to combine encapsulation and performance, then I'd do the time zone calc in the application code instead. Even though you'd have to apply it to each project that uses it, you would only have to implement it 1x in each project (in the Data Access Layer) and treat it as a common utility library if you'll be using across projects.
To answer the OP's follow-on question, a SQL Server 2008 solution would look like this:
First, create permanent definitions:
CREATE TYPE dbo.tempDataType AS TABLE (
timeGMT DATETIME,
dataCol2 int,
dataCol3 int)
GO
CREATE PROCEDURE ComputeDateWithDST
#tempData tempDataType READONLY
AS
SELECT dateadd(ss,d.gmtOffset,t.timeGMT) timeLocal, t.*
FROM #tempData t
JOIN dst d ON t.timeGMT BETWEEN d.sTimeGMT AND d.eTimeGMT
WHERE d.zone = 'US-Eastern'
GO
Thereafter, whenever you want to plug a subquery (which has now become a separate query, no longer a CTE) into the stored procedure:
DECLARE #tempData tempDataType
INSERT #tempData
-- sample subquery:
SELECT timeGMT, dataCol2, dataCol3
FROM tbl1 t1
JOIN tbl2 t2 ON t1.ID=t2.ID
EXEC ComputeDateWithDST #tempData;
GO
Performance could be an issue because you'd be running separately what used to be a CTE instead of letting SQL Server combine it with the main query to optimize the execution plan.

Is it okay to use table look up functionalities in scalar function?

In our case we have some business logic that looks into several tables in a certain order, so that the first non null value from one table is used. While the look up is not hard, but it does take several lines of SQL code to accomplish. I have read about scalar valued functions in SQL Server, but don't know if the re-compliation issue affects me enough to do it in a less convenient way.
So what's the general rule of thumb?
Would you rather have something like
select id, udfGetFirstNonNull(id), from mytable
Or is table-valued functions any better than scalar?
select id,
(select firstNonNull from udfGetFirstNonNull(id)) as firstNonNull
from myTable
The scalar udf will look up for each row in myTable which can run exponentially longer as data increases. Effectively you have a CURSOR. If you have a few rows, it won't matter of course.
I do the same myself where I don't expect a lot of rows (more than a few hundred).
However, I would consider a table value function where I've placed "foo" here. "foo" could also be a CTE in a UDF too (not tested):
select id,
(select firstNonNull from udfGetFirstNonNull(id)) as firstNonNull
from
myTable M
JOIN
(SELECT value, id as firstNonNull
FROM OtherTable
WHERE value IS NOT NULL
GROUP BY id
ORDER BY value) foo ON M.id = foo.id
Your first query is fine. One place I work for is absolutely obsessed with speed and optimization, and they use UDF's heavily in this way.
I think for readibility and maintainability, I would prefer to use the scalar function, as that is what it is returning.

Resources