The question is: Can inline table valued functions (ITVF) be used to encapsulate and reuse code? Or will this result in performance issues?
I was researching inline table valued functions, which led me to this discussion:
When would you use a table-valued function?
An answer in the discussion states that an inline table valued function "allows the optimizer to treat these functions no differently than the objects they encapsulate giving you optimum performance (assuming that your indexes and statistics are ideal)."
My original problem was that I was trying to reformat different data sources into a standard format and then union them. I tested unioning 6 different ITVF versus performing the unions and transformation all in one query. The execution plan was identical.
Since my background is in oop, I would prefer to split up queries into smaller functions, but before I commit to doing this throughout future projects, I was wondering if using too many ITVFs will eventually cause performance issues.
Can inline table valued functions (ITVF) be used to encapsulate and reuse code?
Yes. And they are superior in this to multi-statement TVFs because with multi-statement TVFs the encapsulation prevents the query optimizer from pushing predicates into the TVF logic, and prevents it from accurately estimating the number of rows returned.
Or will this result in performance issues?
Short answer, not typically.
Longer answer:
There are 4 ways to encapsulate and reuse query logic (whole queries, not just scalar expressions).
Views
Inline Table Valued Functions
Multi-Statement Table Valued Functions
Temporary Tables
Table Variables
Views and Inline TVFs don't inherently degrade performance, but they add to the complexity of query optimization.
Where the optimizer fails to consistently find low-cost plans you may need to intervene. A common way you can do that is by forcing spooling (ie materializing) of intermediate results, for instance replacing an Inline TVF with a multi-statement TVF, or by spooling results to a temp table ahead-of-time.
Spooling reduces the complexity of the encapsulating query at the cost of possible optimization of the encapsulated query when run in the context of the larger query.
When spooling results, Temp Tables are typically the best, as SQL Server they can have indexes and statistics that enable SQL Server to accurately assess the cost of the plans that will consume the intermediate results.
ITVFs are perfect for encapsulating query logic for re-use. I have a dozen or so financial reports that all query the same set of tables for roughly the same information, and by creating a function to provide that data I can be sure that all of my reports are pulling from the same body of data with the same filters and transformations, etc.
That being said, you could also just as easily create a view instead of an ITVF, but ITVFs also provide a way to filter or otherwise transform data based on the parameters sent in. For example, my financial functions could accept a district name as an optional input parameter and only return data for that district. By using the ITVFs this way the optimizer can, over time, optimize the query plan based on the parameters sent in, which helps rather than hindering performance.
I'd recommend that instead of a union on six different ITVFs, just pull all of your tables together into a single ITVF: that way you only have one place to make updates if your table schemas or reporting demands change.
Related
I am currently in the progress of a database migration from MS Access to SQL Server. To improve performance of a specific query, I am translating from access to T-SQL and executing server-side. The query in question is essentially made up of almost 15 subqueries branching off in to different directions with varying levels of complexity. The top level query is a culmination (final Select) of all of these queries.
Without actually going into the specifics of the fields and relationships in my queries, I want to ask a question on a generic example.
Take the following:
Top Level Query
|
___________|___________
| |
Query 1 <----------> Query 2
_________________________| Views? |_______________________________
| | | |
Query 1.1 Query 1.2 Query 2.1 Query 2.2
________|______ ______|________
| | | |
Query 1.1.1 Query 1.1.2 Query 2.1.1 Query 2.1.2
| | | |
... ... ... ...
I am attempting to convert the above MS Access query structure to T-SQL, whilst maximising performance. So far I have converted all of Query1 Into a single query starting from the bottom and working my way up. I have achieved by using CTE's to represent every single subquery and then finally selected from this entire CTE tree to produce Query1. Due to the original design of the query, there is a high level of dependency between the subqueries.
Now my question is quite simple actually. With regards to Query2, should I continue to use this same method within the same query window or should I make both Query1 and Query2 seperate entities (Views) and then do a select from each? Or should I just continue adding more CTE's and then get the final Top Level Query result from this one super query?
This is an extremely bastardised version of the actual query, I am working with which has a large number of calculated fields and more subquery levels.
What do you think is the best approach here?
There is no way to say for sure from a diagram like this, but I suspect that you want to use Views for a number of reasons.
1) If the sub-query/view is used in more than one place there is a high chance that caching will allow for results to be shared in more than one place, but it is not as strong effect as a CTE but can be mitigated with a materialized query
2) It is easy turn a view into a materialized view. Then you get huge bonus if it is used multiple times or is used many times before it needs to be refreshed.
3) If you find a slow part it will be isolated to one view -- then you can optimize and change that small section easier.
I would recommend using views for EVERY sub-view if you can. Unless you can demonstrate (via execution plan or testing) that the CTE runs faster.
Final note as someone who has migrated Access to SQL in the past. Access encourages more sub-queries than needed with modern SQL and windowing functions. It is very likely with some analysis these access queries can be made much simpler. Try to find cases where you can roll them up to the parent query
A query you submit, or a "view" is all the same thing.
Your BASIC simple question stands!
Should you use a query on query, or a use a CTE?
Well, first, lets get rid of some confusing you have here.
A CTE is great for eliminaton of the need to build a query, save the query (say as a view) AND THEN query against it.
However, in your question we are mixing up TWO VERY different issues.
Are you doing a query against a query, or in YOUR case using a sub-query? While these two things SEEM simular, they really at not!!!
In the case of a sub-query, using a CTE will in fact save you the pain of having to build a separate query/view and saving that query. In this case, you are I suppose doing a query on query, but it REALLY a sub query. From a performance point of view, I don't believe you find any difference, so do what works best for you. I do in some ways like adopting CTE's since THEN you have have the "whole thing" in one spot. And updates to the whole mess occurs in one place. This can especially be an advantage if you have several sites. So to update things, you ONLY have to update this one huge big saved "thing". I do find this a signficant advantage.
The advantages of breaking out into separate views (as opposed to using CTE's) can often be the simple issue that how do you eat a elephant?
Answer: One bite at a time.
However, I in fact consider the conept and approahc of a sub-query a DIFFERENT issue then building a query on query. One of the really big reasons to using CTE'S in sql server, is SQL server has one REALLY big limitation compared to Access SQL. That limiation of course is being able to re-use derived columns.
eg this:
SELECT ID, Company, State, TaxRate, Purchased, Payments,
(Purchased - Payments) as Balance,
(Balance * TaxRate) as BalanceWithTax
FROM Customers
Of course in T-SQL, you can't re-use expression like you can in Access T-SQL. So the above is a GREAT use of CTE'S. Balance in t-sql cannot be re-used. So you are having to constant repeat expressions in t-sql (my big pet peeve with t-sql). Using a CTE means we CAN use the above ability to repeat a expression.
So I tend to think of the CTE solving two issues, and you should keep these concepts seperate:
I want to eleiminate the need for a query on query, and that includes sub-queries.
So, sure, using CTE for above is a good use.
The SECOND use case is the abilty to repeat use expession columns. This is VERY painful in T-SQL, and CTE's go a long way to reducing this pain point (Access SQL is still far better), but CTE's are at least very helpfull.
So, from a performance point of view, using CTE's to eliminate sub query should not effect perfomance, and as noted you can saving having to create 2-5 seperate queries for this to work.
Then there is a query on query (especially in the above use case of being able to re-use columns in expressions. In this case, I believe some performance advantages exist, but again, likely not enough to justify one approach or the other. So once again, adopting CTE's should be which road is LESS work for you! (but for a very nasty say sub-query that sums() and does some real work, and THEN you need to re-use that columns, then that is really when CTE's shine here.
So as a general coding approach, I used CTE's to advoid query on query (but NOT a sub quer s you are doing). And I use CTE's to gain re-use of a complex expression column.
Using CTE's to eliminate having sub-queries is not really all that great of a benefit. (I mean, just shove the sub-query into the main query - MOST of the time a CTE will not help you).
So, using CTE's just for the concept of a sub-query is not all that great of a advantage. You can, but I don't see great gains from a developer poitn of view. However, in the case of query on query (to gain re-use of column expressions?). Well then the CTE's elemintes the need for a query against the same table/query.
So, for just sub-queries, I can't say CTE's are huge advantage. But for re-use of column expressions, then you MUST either do a query on a query (say a saved view), and THEN you gain re-use of the columns as expressions to be used in additional expressions.
While this is somewhat opinion?
CTE'S ability to allow re-use of columns is their use case, because this elimiantes the need to create a sepeate view. It is not so much that you elimited the need for a seperate view (query on query), but that you gained used of a column exprssion for re-use is the main benefit here.
So, you certainly can use CTE's to eliminate having to create views (yes, a good idea), but in your case you likly could have just used sub-queries anyway, and the CTE's are not really required. For column re-use, you have NO CHOICE in the matter. Since you MUST use a query on query for column expression re-use, the the CTE's will eliminate this need. In your case (at least so far) you really did not need to use a CTE's and you were not being forced to in regards to your solution. For column re-use you have zero choice - you ARE being forced to query on query - so a CTE's eliminates this need.
As far as I can tell, so far you don't really need necessary to use a CTE unless the issue is being able to re-use some columns in other expressions like we could/can in Access sql.
If column re-use is the goal? Then yes, CTE's are a great solution. So it more of a column re-use issue then that of choosing to use query on query. If you did not have the additional views in Access, then no question that adopting CTE's to keep a similar approach and design makes a lot of sense. So the motivation of column re-use is what we lost by going to sql server, and CTE's do a lot to regain this ability.
Scenario
Quick background on this one: I am attempting to optimize the use of an inline table-valued function uf_GetVisibleCustomers(#cUserId). The iTVF wraps a view CustomerView and filters out all rows containing data for customers whom the provided requesting user is not permitted to see. This way, should selection criteria ever change in the future for certain user types, we won't have to implement that new condition a hundred times (hyperbole) all over the SQL codebase.
Performance is not great, however, so I want to fix that before encouraging use of the iTVF. Changed database object names here just so it's easier to demonstrate (hopefully).
Queries
In attempting to optimize our iTVF uf_GetVisibleCustomers, I've noticed that the following SQL …
CREATE TABLE #tC ( idCustomer INT )
INSERT #tC
SELECT idCustomer
FROM [dbo].[uf_GetVisibleCustomers]('requester')
SELECT T.fAmount
FROM [Transactions] T
JOIN #tC C ON C.idCustomer = T.idCustomer
… is orders of magnitude faster than my original (IMO more readable, likely to be used) SQL here…
SELECT T.fAmount
FROM [Transactions] T
JOIN [dbo].[uf_GetVisibleCustomers]('requester') C ON C.idCustomer = T.idCustomer
I don't get why this is. The former (top block of SQL) returns ~700k rows in 17 seconds on a fairly modest development server. The latter (second block of SQL) returns the same number of rows in about ten minutes when there is no other user activity on the server. Maybe worth noting that there is a WHERE clause, however I have omitted it here for simplicity; it is the same for both queries.
Execution Plan
Below is the execution plan for the first. It enjoys automatic parallelism as mentioned while the latter query isn't worth displaying here because it's just massive (expands the entire iTVF and underlying view, subqueries). Anyway, the latter also does not execute in parallel (AFAIK) to any extent.
My Questions
Is it possible to achieve performance comparable to the first block without a temp table?
That is, with the relative simplicity and human-readability of the slower SQL.
Why is a join to a temp table faster than a join to iTVF?
Why is it faster to use a temp table than an in-memory table populated the same way?
Beyond those explicit questions, if someone can point me in the right direction toward understanding this better in general then I would be very grateful.
Without seeing the DDL for your inline function - it's hard to say what the issue is. It would also help to see the actual execution plans for both queries (perhaps you could try: https://www.brentozar.com/pastetheplan/). That said, I can offer some food for thought.
As you mentioned, the iTVF accesses the underlying tables, views and associated indexes. If your statistics are not up-to-date you can get a bad plan, that won't happen with your temp table. On that note, too, how long does it take to populate that temp table?
Another thing to look at (again, this is why DDL is helpful) is: are the data type's the same for Transactions.idCustomer and #TC.idCustomer? I see a hash match in the plan you posted which seems bad for a join between two IDs (a nested loops or merge join would be better). This could be slowing both queries down but would appear to have a more dramatic impact on the query that leverages your iTVF.
Again this ^^^ is speculation based on my experience. A couple quick things to try (not as a perm fix but for troubleshooting):
1. Check to see if re-compiling your query when using the iTVF speeds things up (this would be a sign of a bad stats or a bad execution plan being cached and re-used)
2. Try forcing a parallel plan for the iTVF query. You can do this by adding OPTION (QUERYTRACEON 8649) to the end of your query of by using make_parallel() by Adam Machanic.
Why use temporary tables in stored procedures that are returning large result sets. How does this help performance? Is there an example out there of maybe a join of several tables returning a large set of data and how a temporary table may help performance of this query in a stored procedure?
In my experience they may be helpful in limited situations when a query is so complex that the query optimizer is struggling to come up with a decent plan. Breaking such a query apart and storing intermediate results in temp tables may help if done right. I use this strategy as a last resort because temp tables are expensive and for large results sets they may be very expensive.
I found this excelent article quite useful in answering this question:
Paul White - temporary tables in stored procedures
Just to underline some concepts from the article:
Temporary tables can be very useful as a way of simplifying a large query into smaller parts, giving the optimizer a better chance of finding good execution plans, providing statistical information about an intermediate result set
Temporary objects may be cached across executions, despite explicit CREATE and DROP statements
Statistics associated with a cached temporary object are also cached
On my behalf, I would just add that, if the stored procedure accesses the same data in another server more than once and we have a slow connection, it may be usefull to bring it into a temporary table. Of course, a table variable would also be valid for this purpose.
What are the downsides of using SqlServer Views?
I create views frequently to show my data in a denormalized form.
I find it much easier and therefore faster, less error prone, and more self documenting, to query one of these joins rather than to generate complex queries with complicated joins between many tables. Especially when I am analyzing the same data (many same fields, same table joins) from different angles.
But is there a cost to creating and using these views?
Am I slowing down (or speeding up?) query processing?
When comes to Views there are advantages and disadvantages.
Advantages:
They are virtual tables and not stored in the database as a distinct object. All that is stored is the SELECT statement.
It can be used as a security measure by restricting what the user can see.
It can make commonly used complex queries easier to read by encapsulating them into a view. This is a double edged sword though - see disadvantages #3.
Disadvantages:
It does not have an optimized execution plan cached so it will not be as fast as a stored procedure.
Since it is basically just an abstraction of a SELECT it is marginally slower than doing a pure SELECT.
It can hide complexity and lead to gotchas. (Gotcha: ORDER BY not honored).
My personal opinion is to not use Views but to instead use stored procedures as they provide the security and encapsulation of Views but also come with improved performance.
One possible downside of using views is that you abstract the complexity of the underlying design which can lead to abuse by junior developers and report creators.
For a particularly large and complex project I designed a set of views which were to be used mostly by report designers to populate crystal reports. I found out weeks later that junior devs had started using these views to fetch aggregates and join these already large views simply because they were there and were easy to consume. (There was a strong element of EAV design in the database.) I found out about this after junior devs started asking why seemingly simple reports were taking many minutes to execute.
The efficiency of a view depends in large part on the underlying tables. The view really is a just an organized an consistent way to look at query results. If the query used to form the view is good, and uses proper indexes on the underlying tables, then the view shouldn't negatively impact performance.
In SQL Server you can also create materialized or indexed views (since SQL Server 2000), which increase speed somewhat.
I use views regularly as well. One thing to note, however, is that using lots of views could be hard to maintain if your underlying tables change frequently (especially during development).
EDIT: Having said that, I find the convenience and advantage of being able to simplify and re-use complex queries outweighs the maintenance issue, especially if the views are used responsibly.
Views can be a detriment to performance when the view contains logic, columns, rows, or tables that aren't ultimately used by your final query. I can't tell you how many times I've seen stuff like:
SELECT ...
FROM (View with complex UNION of ActiveCustomer and InactiveCustomer tables)
WHERE Active = True
(thus filtering out all rows that were included in the view from the InactiveCustomer table), or
SELECT (one column)
FROM (view that returns 50 columns)
(SQL has to retrieve lots of data that is then discarded at a later step. Its possible those other columns are expensive to retrieve, like through a bookmark lookup), or
SELECT ...
FROM (view with complex filters)
WHERE (entirely different filters)
(its likely that SQL could have used a more appropriate index if the tables were queried directly),
or
SELECT (only fields from a single table)
FROM (view that contains crazy complex joins)
(lots of CPU overhead through the join, and unnecessary IO for the table reads that are later discarded), or my favorite:
SELECT ...
FROM (Crazy UNION of 12 tables each containing a month of data)
WHERE OrderDate = #OrderDate
(Reads 12 tables when it only really needs to read 1).
In most cases, SQL is smart enough to "see through the covers" and come up with an effective query plan anyway. But in other cases (especially very complex ones), it can't. In each of the above situations, the answer was to remove the view and query the underlying tables instead.
At the very least (even if you think SQL would be smart enough to optimize it anyway), eliminating the view can sometimes make your own query debugging and optimization easier (a bit more obvious what needs to be done).
A downside to views that I've run into is a dive in performance when incorporating them into distributed queries. This SQLMag article discusses - and whilst I use highly artificial data in the demo, I've run into this problem time and time again in the "real world".
Respect your views, and they'll treat you well.
What are Various Limitations of the Views in SQL Server?
Top 11 Limitations of Views
Views do not support COUNT (); however, it can support COUNT_BIG ()
ORDER BY clause does not work in View
Regular queries or Stored Procedures give us flexibility when we need another column; we can add a column to regular queries right away. If we want to do the same with Views, then we will have to modify them first
Index created on view not used often
Once the view is created and if the basic table has any column added or removed, it is not usually reflected in the view till it is refreshed
UNION Operation is not allowed in Indexed View
We cannot create an Index on a nested View situation means we cannot create index on a view which is built from another view.
SELF JOIN Not Allowed in Indexed View
Outer Join Not Allowed in Indexed Views
Cross Database Queries Not Allowed in Indexed View
Source SQL MVP Pinal Dave
http://blog.sqlauthority.com/2010/10/03/sql-server-the-limitations-of-the-views-eleven-and-more/
When I started I always though views added performance overhead, however experience paints a different story (the view mechanism itself has negligible overhead).
It all depends on what the underlying query is. Check out indexed views here or here , ultimately you should test the performance both ways to obtain a clear performance profile
My biggest 'gripe' is that ORDER BY does not work in a view. While it makes sense, it is a case which can jump up and bite if not expected. Because of this I have had to switch away from using views to SPROCS (which have more than enough problems of their own) in a few cases where I could not specify an ORDER BY later. (I wish there was a construct with "FINAL VIEW" -- e.g. possibly include order by -- semantics).
http://blog.sqlauthority.com/2010/10/03/sql-server-the-limitations-of-the-views-eleven-and-more/ (Limitation #1 is about ORDER BY :-)
The following is a SQL hack that allows an order by to be referenced in a view:
create view toto1 as
select top 99.9999 percent F1
from Db1.dbo.T1 as a
order by 1
But my preference is to use Row_Number:
create view toto2 as
select *, ROW_NUMBER() over (order by [F1]) as RowN from (
select f1
from Db1.dbo.T1) as a
I want to memoize function results for performance, i.e. lazily populate a cache indexed on the function arguments. The first time I call a function, the cache won't have anything for the input arguments, so it will calculate it and store it before returning it. Subsequent calls just use the cache.
However, it seems that SQL Server 2000 has a stupid arbitrary rule about functions being "deterministic". INSERTs, UPDATEs, and regular stored procedure calls are forbidden. However, extended stored procedures are allowed. How is this deterministic? If another session modifies the database state, the function output will change anyways.
I'm steaming mad. I had thought I could make caching transparent to the user. Is this possible? I don't have the permissions to deploy extended stored procedures.
EDIT:
This limitation is still in 2008. You can't call RAND, for God's sake!
The cache would be implemented by me in the DB. A cache is any data store used for caching...
EDIT:
There are no cases where the same arguments to a function will yield different results, outside of changes to the underlying data. This is a BI platform, and the only changes come from scheduled ETL, at which time I would TRUNCATE the cache table.
These are I/O intensive time series calculations, on the order of O(n^4). I don't have the mandate to change the underlying table or indexes. Also, a lot of these functions use the same intermediate functions, and caching allows those to be used.
UDFs are not truly deterministic, unless they account for changes in database state. What's the point? Is SQL Server caching? (Ironic.) If SQL Server is caching, then it must be expiring on changes to tables that are schema bound. If they're schema bound, then why not bind tables that the function modifies? I can see why procs aren't allowed, although that's just sloppy; just schema bind procs. And, BTW, why allow extended stored procs? You can't possibly track what those do to ensure determinism!!! Argh!!!
EDIT:
My question is: Is there any way to lazily cache function results in a way that can be used in a view?
Deterministic means that the same inputs return the same output independent of time and database.
SQL Server (any version) does no caching of UDFs - I believe it will avoid calling the UDF twice on a single row, but that's it.
One trick I've used is to (I think I posted it here on SO):
Refactor the UDF if you can so that there are effectively a usable discrete subset of values returned for a given set of inputs. For numerical calculations, one can sometimes refactor the logic to return a factor or rate which is multiplied outside the UDF instead of multiplied inside the UDF from a passed in value.
Call the UDF over the DISTINCT rowset and cache the results to a temporary table. If you are only calling the UDF with 100,000 tuples of parameters over a 17,000,000 row set, this is very much more efficient.
JOIN to the temporary table (basically converting from code-based logic to table-based logic) to get values.
This table can be re-used as necessary or even kept.
Addition to the table can be done by first LEFT JOINing to find missing cached entries.
This works for both single-row table-valued UDFs and scalar UDFs. I'm mainly using it for table-valued UDFs. There is a hotfix to SQL Server 2005 which is supposed to address the UDF performance - I'm waiting on mthe DBAs to test it before deploying to production.