Is there a way to create "asserts" on the parameters of a table-valued UDF.
I'd like to use a table-valued UDF for performance reasons, however I know that certain parameter combinations (like start and end dates that are more than a month apart) will cause performance issues on the server for all users.
End users query the database via Excel using UDFs. UDFs (and table-valued UDFs in particular) are useful when the data is too large for Excel. Users write simple SQL queries that categorizes the data into groups to reduce the number of rows. For example, the user may be interested in weekly aggregates rather than hourly ones. Users write a group by SELECT statement to reduce the rows by 24x7=168 times. I know I can write RAISERROR statements in multistatement UDFs, but table-valued UDFs are integrated in the query optimizer so these queries are more efficient with table-valued UDFs.
So, can I define assertions on the parameters passed to a table-valued UDF?
The short answer is no - single statement TVFs can only contain a single statement.
There are a couple of alternatives you could try. One would be to carry out validation of the parameters within the SQL statement by extending the WHERE clause - like
...
WHERE ...
AND DATEDIFF(day, #startDate, #endDate) < 31
This may not be ideal for a couple of reasons - first, it may lead the users to think that no data exists meeting their criteria since there's no means to communicate why no results have been returned. Second, there's no guarantee that the DB engine won't run the data parts of the query anyway before evaluating the parameters. Thirdly, it may lead to a bad plan being cached.
If you're on SQL 2008, an alternative approach would be to look into the SQL server resource govenor which provides a means to limit users or groups of users to running queries for which the estimated execution time in seconds is less than a given threshold.
Another approach again would be to build some parameter validation into the Excel sheets your users use for their queries, but this may not be practical depending on the details of your setup.
Related
What is the difference between scalar-valued, table-valued, and aggregate functions in SQL server? And does calling them from a query need a different method, or do we call them in the same way?
Scalar Functions
Scalar functions (sometimes referred to as User-Defined Functions / UDFs) return a single value as a return value, not as a result set, and can be used in most places within a query or SET statement, except for the FROM clause (and maybe other places?). Also, scalar functions can be called via EXEC, just like Stored Procedures, though there are not many occasions to make use of this ability (for more details on this ability, please see my answer to the following question on DBA.StackExchange: Why scalar valued functions need execute permission rather than select?). These can be created in both T-SQL and SQLCLR.
T-SQL (UDF):
Prior to SQL Server 2019: these scalar functions are typically a performance issue because they generally run for every row returned (or scanned) and always prohibit parallel execution plans.
Starting in SQL Server 2019: certain T-SQL scalar UDFs can be inlined, that is, have their definitions placed directly into the query such that the query does not call the UDF (similar to how iTVFs work (see below)). There are restrictions that can prevent a UDF from being inlineable (if that wasn't a word before, it is now), and UDFs that can be inlined will not always be inlined due to several factors. This feature can be disabled at the database, query, and individual UDF levels. For more information on this really cool new feature, please see: Scalar UDF Inlining (be sure to review the "requirements" section).
SQLCLR (UDF): these scalar functions also typically run per each row returned or scanned, but there are two important benefits over T-SQL UDFs:
Starting in SQL Server 2012, return values can be constant-folded into the execution plan IF the UDF does not do any data access, and if it is marked IsDeterministic = true. In this case the function wouldn't run per each row.
SQLCLR scalar functions can work in parallel plans ( 😃 ) if they do not do any database access.
Table-Valued Functions
Table-Valued Functions (TVFs) return result sets, and can be used in a FROM clause, JOIN, or CROSS APPLY / OUTER APPLY of any query, but unlike simple Views, cannot be the target of any DML statements (INSERT / UPDATE / DELETE). These can also be created in both T-SQL and SQLCLR.
T-SQL MultiStatement (TVF): these TVFs, as their name implies, can have multiple statements, similar to a Stored Procedure. Whatever results they are going to return are stored in a Table Variable and returned at the very end; meaning, nothing is returned until the function is done processing. The estimated number of rows that they will return, as reported to the Query Optimizer (which impacts the execution plan) depends on the version of SQL Server:
Prior to SQL Server 2014: these always report 1 (yes, just 1) row.
SQL Server 2014 and 2016: these always report 100 rows.
Starting in SQL Server 2017: default is to report 100 rows, BUT under some conditions the row count will be fairly accurate (based on current statistics) thanks to the new Interleaved Execution feature.
T-SQL Inline (iTVF): these TVFs can only ever be a single statement, and that statement is a full query, just like a View. And in fact, Inline TVFs are essentially a View that accepts input parameters for use in the query. They also do not cache their own query plan as their definition is placed into the query in which they are used (unlike the other objects described here), hence they can be optimized much better than the other types of TVFs ( 😃 ). These TVFs perform quite well and are preferred if the logic can be handled in a single query.
SQLCLR (TVF): these TVFs are similar to T-SQL MultiStatement TVFs in that they build up the entire result set in memory (even if it is swap / page file) before releasing all of it at the very end. The estimated number of rows that they will return, as reported to the Query Optimizer (which impacts the execution plan) is always 1000 rows. Given that a fixed row count is far from ideal, please support my request to allow for specifying the row count: Allow TVFs (T-SQL and SQLCLR) to provide user-defined row estimates to query optimizer
SQLCLR Streaming (sTVF): these TVFs allow for complex C# / VB.NET code just like regular SQLCLR TVFs, but are special in that they return each row to the calling query as they are generated ( 😃 ). This model allows the calling query to start processing the results as soon as the first one is sent so the query doesn't need to wait for the entire process of the function to complete before it sees any results. And it requires less memory since the results aren't being stored in memory until the process completes. The estimated number of rows that they will return, as reported to the Query Optimizer (which impacts the execution plan) is always 1000 rows. Given that a fixed row count is far from ideal, please support my request to allow for specifying the row count: Allow TVFs (T-SQL and SQLCLR) to provide user-defined row estimates to query optimizer
Aggregate Functions
User-Defined Aggregates (UDA) are aggregates similar to SUM(), COUNT(), MIN(), MAX(), etc. and typically require a GROUP BY clause. These can only be created in SQLCLR, and that ability was introduced in SQL Server 2005. Also, starting in SQL Server 2008, UDAs were enhanced to allow for multiple input parameters ( 😃 ). One particular deficiency is that there is no knowledge of row ordering within the group, so creating a running total, which would be relatively easy if ordering could be guaranteed, is not possible within a SAFE Assembly.
Please also see:
CREATE FUNCTION (MSDN documentation)
CREATE AGGREGATE (MSDN documentation)
CLR Table-Valued Function Example with Full Streaming (STVF / TVF) (article I wrote)
A scalar function returns a single value. It might not even be related to tables in your database.
A tabled-valued function returns your specified columns for rows in your table meeting your selection criteria.
An aggregate-valued function returns a calculation across the rows of a table -- for example summing values.
Scalar function
Returns a single value. It is just like writing functions in other programming languages using T-SQL syntax.
Table Valued function
Is a little different compared to the above. Returns a table value. Inside the body of this function you write a query that will return the exact table.
For example:
CREATE FUNCTION <function name>(parameter datatype)
RETURN table
AS
RETURN
(
-- *write your query here* ---
)
Note that there is no BEGIN & END statements here.
Aggregate Functions
Includes built in functions that is used alongside GROUP clause. For example: SUM(),MAX(),MIN(),AVG(),COUNT() are aggregate functions.
Aggregate and Scalar functions both return a single value but Scalar functions operate based on a single input value argument while Aggregate functions operate on a single input set of values (a collection or column name). Examples of Scalar functions are string functions, ISNULL, ISNUMERIC, for Aggregate functions examples are AVG, MAX and others you can find in Aggregate Functions section of Microsoft website.
Table-Valued functions return a table regardless existence of any input argument. Execution of this functions is done by using them as a regular physical table e.g: SELECT * FROM fnGetMulEmployee()
This following link is very useful to understand the difference: https://www.dotnettricks.com/learn/sqlserver/different-types-of-sql-server-functions
I have a simple SELECT statement with a couple columns referenced in the WHERE clause. Normally I do these simple ones in the VB code (setup a Command object, set Command Type to text, set Command Text to the Select statement). However I'm seeing timeout problems. We've optimized just about everything we can with our tables, etc.
I'm wondering if there'd be a big performance hit just because I'm doing the query this way, versus creating a simple stored procedure with a couple params. I'm thinking maybe the inline code forces SQL to do extra work compiling, creating query plan, etc. which wouldn't occur if I used a stored procedure.
An example of the actual SQL being run:
SELECT TOP 1 * FROM MyTable WHERE Field1 = #Field1 ORDER BY ID DESC
A well formed "inline" or "ad-hoc" SQL query - if properly used with parameters - is just as good as a stored procedure.
But this is absolutely crucial: you must use properly parametrized queries! If you don't - if you concatenate together your SQL for each request - then you don't benefit from these points...
Just like with a stored procedure, upon first executing, a query execution plan must be found - and then that execution plan is cached in the plan cache - just like with a stored procedure.
That query plan is reused over and over again, if you call your inline parametrized SQL statement multiple times - and the "inline" SQL query plan is subject to the same cache eviction policies as the execution plan of a stored procedure.
Just from that point of view - if you really use properly parametrized queries - there's no performance benefit for a stored procedure.
Stored procedures have other benefits (like being a "security boundary" etc.), but just raw performance isn't one of their major plus points.
It is true that the db has to do the extra work you mention, but that should not result in a big performance hit (unless you are running the query very, very frequently..)
Use sql profiler to see what is actually getting sent to the server. Use activity monitor to see if there are other queries blocking yours.
Your query couldn't be simpler. Is Field1 indexed? As others have said, there is no performance hit associated with "ad-hoc" queries.
For where to put your queries, this is one of the oldest debates in tech. I would argue that your requests "belong" to your application. They will be versionned with your app, tested with your app and should disappear when your app disappears. Putting them anywhere other than in your app is walking into a world of pain. But for goodness sake, use .sql files, compiled as embedded resources.
Select statement which is part of form clause of any
another statement is called as inline query.
Cannot take parameters.
Not a database object
Procedure:
Can take paramters
Database object
can be used globally if same action needs to be performed.
In SQL Server, what is the best way to allow for multiple execution plans to exist for a query in a SP without having to recompile every time?
For example, I have a case where the query plan varies significantly depending on how many rows are in a temp table that the query uses. Since there was no "one size fits all" plan that was satisfactory, and since it was unacceptable to recompile every time, I ended up copy/pasting (ick) the main query in the SP multiple times within several IF statements, forcing the SQL engine to give each case its own optimal plan. It actually seemed to work beautifully performance-wise, but it feels a bit clunky. (I know I could similarly break this part out into multiple SPs to do the same thing.) Is there a better way to do this?
IF #RowCount < 1
[paste query here]
ELSE IF #RowCount < 50
[paste query here]
ELSE IF #RowCount < 200
[paste query here]
ELSE
[paste query here]
You can use OPTIMIZE FOR in certain situations, to create a plan targeted to a certain value of a parameter (but not multiple plans per se). This allows you to specify what parameter value we want SQL Server to use when creating the execution plan. This is a SQL Server 2005 onwards hint.
Optimize Parameter Driven Queries with the OPTIMIZE FOR Hint in SQL Server
There is also OPTIMIZE FOR UNKNOWN – a SQL Server 2008 onwards feature (use judiciously):
This hint directs the query optimizer
to use the standard algorithms it has
always used if no parameters values
had been passed to the query at all.
In this case the optimizer will look
at all available statistical data to
reach a determination of what the
values of the local variables used to
generate the queryplan should be,
instead of looking at the specific
parameter values that were passed to
the query by the application.
Perhaps also look into optimize for ad hoc workloads Option
SQL Server 2005+ has statement level recompilation and is better at dealing with this kind of branching. You have one plan still but the plan can be partially recompiled at the statement level.
But it is ugly.
I'd go with #Mitch Wheat's option personally because you have recompilations anyway with the stored procedure using a temp table. See Temp table and stored proc compilation
I'm trying to use SQL Server Profiler (2005) to track down some application performance problems. One of the calls being made is to a table-valued user-defined function. This function wraps a select that joins several tables together.
In SQL Server Profiler, the call to the UDF is logged. However, the select that underlies the UDF isn't being logged at all. Because of this, I'm not getting useful data on which tables & indexes are being hit. I'd like to feed this info into the Database Tuning Advisor for some indexing advice.
Is there any way (short of unwrapping the queries themselves) to log the tables called by UDFs in Profiler?
You can't: a multi-statement TVF is a black box and you can only get CPU, Read, Writes etc.
by "black box" I mean it's a fully encapsulated and opaque series of statements inside another query, and there is no "flow" like you'd get line by line through a stored proc.
An in-line TVF is expanded like a view or macro into the main query and can be seen.
Edit: related: Table Valued Function where did my query plan go?
Is there an inherent cost to using inline-table-valued functions in SQL Server 2008 that is not incurred if the SQL is inlined directly? Our application makes very heavy use of inline-table-valued functions to reuse common queries, but recently, we've found that queries run much faster if we don't use them.
Consider this:
CREATE FUNCTION dbo.fn_InnerQuery (#asOfDate DATETIME)
RETURNS TABLE
AS
RETURN
(
SELECT ... -- common, complicated query here
)
Now, when I do this:
SELECT TOP 10 Amount FROM dbo.fn_InnerQuery(dbo.Date(2009,1,1)) ORDER BY Amount DESC
The query returns with results in about 15 seconds.
However, when I do this:
SELECT TOP 10 Amount FROM
(
SELECT ... -- inline the common, complicated query here
) inline
ORDER BY Amount DESC
The query returns in less than 1 second.
I'm a little baffled by the overhead of using the table valued function in this case. I did not expect that. We have a ton of table valued functions in our application, so I'm wondering if there is something I'm missing here.
In this case, the UDF should be unnested/expanded like a view and it should be transparent.
Obviously, it's not...
In this case, my guess is that the column is smalldatetime and is cast to datetime because of the udf parameter but the constant is correctly evaluated (to match colum datatype) when inline.
datetime has a higher precedence that smalldatetime, so the column would be cast
What do the query plans say? The UDF would show a scan, the inline a seek most likely (not 100%, just based on what I've seen before)
Edit: Blog post by Adam Machanic
One thing that can slow functions down is omitting dbo. from table references inside the function. That causes SQL Server to do a security check for every call, which can be expensive.
Try running the table valued function independently to see, how fast/slow it executes?
Also, I am not sure how to clear the execution cache(?) which SQL Server might retain from the execution of the UDF. I mean - if you run the UDF first, it could be the case where SQL Server has the actual query with it & it could cache the plan/result. So, if you run the complicated query separately - it could be running it from cache.
In your second example the Table Valued function has to return the entire data set before the query can apply the filter. Hopping across the TF boundary is not something that the optimiser can always do.
In the third example the query optimiser can work out that the user only wants the top few 'amounts'. If this isn't an aggregate value the optimiser can push that processing right to the start of the query and not bother with any other data. If it is an aggregate amount then the slowdown is for a different reason.
If you compare the query plans of the two queries you should see that they are different.