SQL Server Lookup Functions - sql-server

Is it possible to build lookup type functions in SQL Server or are these always inferior (performance) to just writing subqueries/joins?
I would like to take some code like this
SELECT
ContactId,
ProductType,
SUM(OrderAmount) TotalOrders
FROM
(
SELECT
ContactId,
ProductType,
OrderAmount
FROM
UserOrders ord
JOIN
(
SELECT
ProductCode,
CASE
--Complex business logic
END ProductType
FROM
ItemTable
) item
ON
item.ProductCode=ord.ProductCode
) a
GROUP BY
ContactId,
ProductType
And instead be able to write a query like this
SELECT
ContactId,
UDF_GET_PRODUCT(ProductCode) ProductType,
SUM(OrderAmount) TotalOrders
FROM
UserOrders
GROUP BY
ContactId,
UDF_GET_PRODUCT(ProductCode)

It is possible, but not quite in the format you have described. Whether it is advisable or not really depends.
I agree with the other answer in that scalar functions are performance killers, and I personally do not use them at all.
That being said I don't think that is a reason to ignore the DRY principle where feasible. i.e. I would not take a short cut
if it had an impact on performance, however I also don't like the idea of having complex logic repeated in multiple places.
When anything changes you then have multiple queries to change, and inevitably some get missed, so if you will be re-using this
logic then it is a good idea to encapsulate it in a single place.
Based on your example perhaps a view would be most appropriate:
CREATE VIEW dbo.ItemTableWithLogic
AS
SELECT ProductCode,
ProductType = <your logic>
FROM ItemTable;
Then you can simply use:
SELECT ord.ContactId, item.ProductType, SUM(ord.OrderAmount) AS TotalOrders
FROM UserOrders AS ord
INNER JOIN dbo.ItemTableWithLogic AS item
ON item.ProductCode=ord.ProductCode
GROUP BY ord.ContactId, item.ProductType;
Which simplifies things somewhat.
Another alternative is an inline table valued function, something like:
CREATE FUNCTION dbo.GetProductType (#ProductCode INT)
RETURNS TABLE
AS
RETURN
( SELECT ProductType = <your logic>
FROM ItemTable
WHERE ProductCode = #ProductCode
);
Which can be called using:
SELECT ord.ContactId, item.ProductType, SUM(ord.OrderAmount) AS TotalOrders
FROM UserOrders AS ord
CROSS APPLY dbo.ItemTableWithLogic(ord.ProductCode) AS item
GROUP BY ord.ContactId, item.ProductType;
My preference is for views over table valued functions, however, it would really depend on your usage as to which I would recommend, so I don't really want to pick a side, I will stick to sitting on the fence.
In summary, If you only need to use the logic in one place, and won't need to reuse it in many queries then just stick to a subquery. If you need to reuse the same logic multiple times, don't use a scalar valued function in the same way you might in a procedural language, but also don't let this rule out other ways of keeping your logic in a single place.

Stick to sub-queries and Joins.
Because it would use a set based approach and execute the inner query once , apply aggregate on to the result set returned from the inner query and return the final result set.
On the other hand if you use a Scalar function like you have shown in your second query, all the code inside the function (sub-query in your original question) will be executed for the each row returned.
Scalar functions are performance killers and should avoid them whenever possible. This is the .net mentality that if you are having to write a piece of a code again and again put it inside a method and call the method, not true for sql server.

Related

BigQuery GENERATE_UUID() and CTE's

This behavior surprised me a little bit.
When you generate a uuid in a CTE (to make a row id, etc) and reference it in the future you'll find that it changes. It seems that generate_uuid() is being called twice instead of once. Anyone know why this is the case w/ BigQuery and what this is called?
I was using generate_uuid() to create a row_id and was finding that in my eventual joins that no matches were occurring because of this. Best way to get around it I've found is by just creating a table from the first CTE which cements the uuid in place for future use.
Still curious to know more about the why and what behind this.
with _first as (
select generate_uuid() as row_id
)
,_second as (
select * from _first
)
select row_id from _first
union all
select row_id from _second
curious to know more about the why and what behind this
This is by design:
WITH clauses are not materialized. Placing all your queries in WITH clauses and then running UNION ALL is a misuse of the WITH clause.
If a query appears in more than one WITH clause, it executes in each clause.
You can see in documentation - Do not treat WITH clauses as prepared statements

Return table from a user defined function, which is best? [duplicate]

A few examples to show, just incase:
Inline Table Valued
CREATE FUNCTION MyNS.GetUnshippedOrders()
RETURNS TABLE
AS
RETURN SELECT a.SaleId, a.CustomerID, b.Qty
FROM Sales.Sales a INNER JOIN Sales.SaleDetail b
ON a.SaleId = b.SaleId
INNER JOIN Production.Product c ON b.ProductID = c.ProductID
WHERE a.ShipDate IS NULL
GO
Multi Statement Table Valued
CREATE FUNCTION MyNS.GetLastShipped(#CustomerID INT)
RETURNS #CustomerOrder TABLE
(SaleOrderID INT NOT NULL,
CustomerID INT NOT NULL,
OrderDate DATETIME NOT NULL,
OrderQty INT NOT NULL)
AS
BEGIN
DECLARE #MaxDate DATETIME
SELECT #MaxDate = MAX(OrderDate)
FROM Sales.SalesOrderHeader
WHERE CustomerID = #CustomerID
INSERT #CustomerOrder
SELECT a.SalesOrderID, a.CustomerID, a.OrderDate, b.OrderQty
FROM Sales.SalesOrderHeader a INNER JOIN Sales.SalesOrderHeader b
ON a.SalesOrderID = b.SalesOrderID
INNER JOIN Production.Product c ON b.ProductID = c.ProductID
WHERE a.OrderDate = #MaxDate
AND a.CustomerID = #CustomerID
RETURN
END
GO
Is there an advantage to using one type (in-line or multi statement) over the other? Is there certain scenarios when one is better than the other or are the differences purely syntactical? I realise the two example queries are doing different things but is there a reason I would write them in that way?
Reading about them and the advantages/differences haven't really been explained.
In researching Matt's comment, I have revised my original statement. He is correct, there will be a difference in performance between an inline table valued function (ITVF) and a multi-statement table valued function (MSTVF) even if they both simply execute a SELECT statement. SQL Server will treat an ITVF somewhat like a VIEW in that it will calculate an execution plan using the latest statistics on the tables in question. A MSTVF is equivalent to stuffing the entire contents of your SELECT statement into a table variable and then joining to that. Thus, the compiler cannot use any table statistics on the tables in the MSTVF. So, all things being equal, (which they rarely are), the ITVF will perform better than the MSTVF. In my tests, the performance difference in completion time was negligible however from a statistics standpoint, it was noticeable.
In your case, the two functions are not functionally equivalent. The MSTV function does an extra query each time it is called and, most importantly, filters on the customer id. In a large query, the optimizer would not be able to take advantage of other types of joins as it would need to call the function for each customerId passed. However, if you re-wrote your MSTV function like so:
CREATE FUNCTION MyNS.GetLastShipped()
RETURNS #CustomerOrder TABLE
(
SaleOrderID INT NOT NULL,
CustomerID INT NOT NULL,
OrderDate DATETIME NOT NULL,
OrderQty INT NOT NULL
)
AS
BEGIN
INSERT #CustomerOrder
SELECT a.SalesOrderID, a.CustomerID, a.OrderDate, b.OrderQty
FROM Sales.SalesOrderHeader a
INNER JOIN Sales.SalesOrderHeader b
ON a.SalesOrderID = b.SalesOrderID
INNER JOIN Production.Product c
ON b.ProductID = c.ProductID
WHERE a.OrderDate = (
Select Max(SH1.OrderDate)
FROM Sales.SalesOrderHeader As SH1
WHERE SH1.CustomerID = A.CustomerId
)
RETURN
END
GO
In a query, the optimizer would be able to call that function once and build a better execution plan but it still would not be better than an equivalent, non-parameterized ITVS or a VIEW.
ITVFs should be preferred over a MSTVFs when feasible because the datatypes, nullability and collation from the columns in the table whereas you declare those properties in a multi-statement table valued function and, importantly, you will get better execution plans from the ITVF. In my experience, I have not found many circumstances where an ITVF was a better option than a VIEW but mileage may vary.
Thanks to Matt.
Addition
Since I saw this come up recently, here is an excellent analysis done by Wayne Sheffield comparing the performance difference between Inline Table Valued functions and Multi-Statement functions.
His original blog post.
Copy on SQL Server Central
Internally, SQL Server treats an inline table valued function much like it would a view and treats a multi-statement table valued function similar to how it would a stored procedure.
When an inline table-valued function is used as part of an outer query, the query processor expands the UDF definition and generates an execution plan that accesses the underlying objects, using the indexes on these objects.
For a multi-statement table valued function, an execution plan is created for the function itself and stored in the execution plan cache (once the function has been executed the first time). If multi-statement table valued functions are used as part of larger queries then the optimiser does not know what the function returns, and so makes some standard assumptions - in effect it assumes that the function will return a single row, and that the returns of the function will be accessed by using a table scan against a table with a single row.
Where multi-statement table valued functions can perform poorly is when they return a large number of rows and are joined against in outer queries. The performance issues are primarily down to the fact that the optimiser will produce a plan assuming that a single row is returned, which will not necessarily be the most appropriate plan.
As a general rule of thumb we have found that where possible inline table valued functions should be used in preference to multi-statement ones (when the UDF will be used as part of an outer query) due to these potential performance issues.
There is another difference. An inline table-valued function can be inserted into, updated, and deleted from - just like a view. Similar restrictions apply - can't update functions using aggregates, can't update calculated columns, and so on.
Your examples, I think, answer the question very well. The first function can be done as a single select, and is a good reason to use the inline style. The second could probably be done as a single statement (using a sub-query to get the max date), but some coders may find it easier to read or more natural to do it in multiple statements as you have done. Some functions just plain can't get done in one statement, and so require the multi-statement version.
I suggest using the simplest (inline) whenever possible, and using multi-statements when necessary (obviously) or when personal preference/readability makes it wirth the extra typing.
Another case to use a multi line function would be to circumvent sql server from pushing down the where clause.
For example, I have a table with a table names and some table names are formatted like C05_2019 and C12_2018 and and all tables formatted that way have the same schema. I wanted to merge all that data into one table and parse out 05 and 12 to a CompNo column and 2018,2019 into a year column. However, there are other tables like ACA_StupidTable which I cannot extract CompNo and CompYr and would get a conversion error if I tried. So, my query was in two part, an inner query that returned only tables formatted like 'C_______' then the outer query did a sub-string and int conversion. ie Cast(Substring(2, 2) as int) as CompNo. All looks good except that sql server decided to put my Cast function before the results were filtered and so I get a mind scrambling conversion error. A multi statement table function may prevent that from happening, since it is basically a "new" table.
look at Comparing Inline and Multi-Statement Table-Valued Functions you can find good descriptions and performance benchmarks
I have not tested this, but a multi statement function caches the result set. There may be cases where there is too much going on for the optimizer to inline the function. For example suppose you have a function that returns a result from different databases depending on what you pass as a "Company Number". Normally, you could create a view with a union all then filter by company number but I found that sometimes sql server pulls back the entire union and is not smart enough to call the one select. A table function can have logic to choose the source.
Maybe in a very condensed way.
ITVF ( inline TVF) : more if u are DB person, is kind of parameterized view, take a single SELECT st
MTVF ( Multi-statement TVF): Developer, creates and load a table variable.
if you are going to do a query you can join in your Inline Table Valued function like:
SELECT
a.*,b.*
FROM AAAA a
INNER JOIN MyNS.GetUnshippedOrders() b ON a.z=b.z
it will incur little overhead and run fine.
if you try to use your the Multi Statement Table Valued in a similar query, you will have performance issues:
SELECT
x.a,x.b,x.c,(SELECT OrderQty FROM MyNS.GetLastShipped(x.CustomerID)) AS Qty
FROM xxxx x
because you will execute the function 1 time for each row returned, as the result set gets large, it will run slower and slower.

Can I sort data for an aggregate function?

I have a custom CLR aggregate function. This function concats strings within a group. Now the question is, can I make this function process the data in some specific order or will it always be some random order the DB found suitable? I understand that for most mathematical aggregate functions (MIN, MAX, AVG etc.) it makes no difference in which order the data is processed, but let's say I want to concat strings alphabetically within a group is there something I can do to achieve this result?
Note that it has to be an aggregate function (don't get mislead by the examples below) and that altering the existing CLR function is out of question (all it does is a basic string concat and nothing more).
I tested adding ORDER BY to the SELECT that contains the GROUP BY, but it produced no meaningful results.
SELECT
user.Id, dbo.concat(cat.Name)
FROM
Users user
JOIN Cats cat ON (cat.Owner = user.Id)
GROUP BY user.Id
ORDER BY user.Id, MAX(cat.Name) --kind of meaningless really
I also tried to ORDER BY the table that contains the data which I want to concat before doing a JOIN, but the result was the same.
SELECT
user.Id, dbo.concat(cat.Name)
FROM
Users user
JOIN (SELECT TOP 100 PERCENT /*hack*/ c.* FROM Cats c ORDER BY c.Name) cat ON (cat.Owner = user.Id)
GROUP BY user.Id
Ordering data in a subquery and then doing a GROUP BY didn't work either.
SELECT
t1.Id, dbo.concat(t1.Name)
FROM
(
SELECT TOP 100 PERCENT /*hack*/
user.Id, cat.Name
FROM
Users user
JOIN Cats cat ON (cat.Owner = user.Id)
ORDER BY user.Id, cat.Name
) t1
GROUP BY t1.Id
I was kind of expecting that neither of those will work, but at least now no one can say I haven't tried anything.
P.S. Yes, I have reasons not to use FOR XML PATH. If what I'm asking here cannot be done, I'll live with it.
Based on information from Damien_The_Unbeliever, Vladimir Baranov, Microsoft pages and from few other users (see comments to the question), I can deduce that:
Ordering rows for aggregate function cannot be done directly in the database; However there are hints that this is\might have been a planned feature (see here and here); If MS ever implements this, some existing CLR aggregate functions might start acting weird (as by default those aggregate functions are flagged to be dependent on order)
Ordering has to be implemented directly in the CLR function; It can be a little tricky due to how CLR aggregate functions are being run, but it can be done
Unfortunately I don't have a piece of code to present here, as I didn't had time to alter my CLR function (and doing unordered concat was good enough in my case).
You can include a function in the order by clause.
try it with this dummy date returner:
create function testDate ()
returns datetime
as
begin
declare #returnDate datetime
select #returnDate = CURRENT_TIMESTAMP
return #returnDate
end
run the function with any table (replace SomeTable with a real table) and order by it:
select dbo.testDate (),
*
from SomeTable
order by dbo.testDate () desc
#jahu
EDIT: I thought you wanted to order by a user defined function. Perhaps I am misunderstanding the question. You can order a query by an aggregate function like this:
select CustomerID,
avg(OrderID)
from Orders
group by CustomerID
order by avg(OrderID) desc
The table above has OrderID as a unique column and there can be multiple CustomerID records

Is it okay to use table look up functionalities in scalar function?

In our case we have some business logic that looks into several tables in a certain order, so that the first non null value from one table is used. While the look up is not hard, but it does take several lines of SQL code to accomplish. I have read about scalar valued functions in SQL Server, but don't know if the re-compliation issue affects me enough to do it in a less convenient way.
So what's the general rule of thumb?
Would you rather have something like
select id, udfGetFirstNonNull(id), from mytable
Or is table-valued functions any better than scalar?
select id,
(select firstNonNull from udfGetFirstNonNull(id)) as firstNonNull
from myTable
The scalar udf will look up for each row in myTable which can run exponentially longer as data increases. Effectively you have a CURSOR. If you have a few rows, it won't matter of course.
I do the same myself where I don't expect a lot of rows (more than a few hundred).
However, I would consider a table value function where I've placed "foo" here. "foo" could also be a CTE in a UDF too (not tested):
select id,
(select firstNonNull from udfGetFirstNonNull(id)) as firstNonNull
from
myTable M
JOIN
(SELECT value, id as firstNonNull
FROM OtherTable
WHERE value IS NOT NULL
GROUP BY id
ORDER BY value) foo ON M.id = foo.id
Your first query is fine. One place I work for is absolutely obsessed with speed and optimization, and they use UDF's heavily in this way.
I think for readibility and maintainability, I would prefer to use the scalar function, as that is what it is returning.

How to simplify this Sql query

The Table - Query has 2 columns (functionId, depFunctionId)
I want all values that are either in functionid or in depfunctionid
I am using this:
select distinct depfunctionid from Query
union
select distinct functionid from Query
How to do it better?
I think that's the best you'll get.
Thats as good as it gets I think...
Lose the DISTINCT clauses, as your UNION (vs UNION ALL) will take care of removing duplicates.
An alternative - but perhaps less clear and probably with the same execution plan - would be to do a FULL JOIN across the 2 columns.
SELECT
COALESCE(Query1.FunctionId, Query2.DepFunctionId) as FunctionId
FROM Query as Query1
FULL OUTER JOIN Query as Query2 ON
Query1.FunctionId = Query2.DepFunctionId
I am almost sure you can loose the distinct's.
When you use UNION instead of UNION ALL, duplicated results are thrown away.
It all depends on how heavy your inline view query is. The key for a better perfomance would be to execute only once, but that is not possible given the data that it returns.
If you do it like this :
select depfunctionid , functionid from Query
group by depfunctionid , functionid
It is very likely that you'll get repeated results for depfunctionid or functionid.
I may be wrong, but it seems to me that you're trying to retrieve a tree of dependencies. If thats the case, I personally would try to use a materialized path approach.
If the materialized path is stored in a self referencing table name, I would retrieve the tree using something like
select asrt2.function_id
from a_self_referencig_table asrt1,
a_self_referencig_table asrt2
where asrt1.function_name = 'blah function'
and asrt2.materialized_path like (asrt1.materialized_path || '%')
order by asrt2.materialized_path, asrt2.some_child_node_ordering_column
This would retrieved the whole tree in the proper order. What sucks is having to construct the materialized path based on the function_id and parent_function_id (or in your case, functionid and depfunctionid), but a trigger could take care of it quite easily.

Resources