Can anyone break it down in plain English the performance difference between using temp tables vs. CTE's vs. table variables in MSSQL. I have used temporary tables quite frequently and have started using CTE's just because of the clear syntax but I have found them to be slower. I think that temp tables are using system memory and that is why they seem fast but may be a bottleneck if trying to do multiple jobs. Table variables I have used sparingly and do not know a great deal about. Looking for some advice from the guru's out there!
This question is well covered in Books Online, MSDN and this site.
About temp tables and table variables you can read here What's the difference between a temp table and table variable in SQL Server?.
There you will find that in many cases temp tables cause recompilation of a procedure which is their main disadvantage.
CTEs are well described here http://blogs.msdn.com/b/craigfr/archive/2007/10/18/ctes-common-table-expressions.aspx
CTEs are performance-neutral. They simplify a query for the developer by abstracting out SQL statements - usually complicated JOINs or built-in functions applied to fields. The database engine just in-lines the CTE into the query that uses it. So, the CTE itself isn't "slow", but you may find you are having better performance with temp tables because the database engine is creating better query plans on the queries using the temp tables.
This question was answered here and here.
Briefly, this is a different tools fo different tasks.
Table variables can lead to fewer stored procedure recompilations than temporary tables
A temp table is good for re-use or to perform multiple processing passes on a set of data
Related
While attempting to improve performance on a stored procedure, the execution plan reported some missing indexes (obvious wins). I see awful indexes on the table now - some repeated, some overlaps, some missing columns. I expect to drop some indexes entirely, update / consolidate others, and I might get to add one or two new ones (though I doubt it).
I've tuned indexes in the past, but on tables with relatively few sp's. This table has been identified as a problem but nobody's clear how to effectively test hundreds of dependent sp's. I believe I'll have to run every stored procedure, repeatedly, both before and after indexing, to demonstrate that any change is useful.
I've seen load-testing tools, and that inspired my first plan of attack. Is there an open-source tool that analyses the code / table and provides meaningful parameters, then executes hundreds of sp's in independent loops, multi-threaded? I hope not to have to hand-curate the parameter values. The server is rebooted weekly so historical patterns take a while to collect.
Second, is this the best approach? I've tuned indexes where only a few stored procedures were impacted, never anything at this scope - is there a better approach?
Thanks!
Scenario
Quick background on this one: I am attempting to optimize the use of an inline table-valued function uf_GetVisibleCustomers(#cUserId). The iTVF wraps a view CustomerView and filters out all rows containing data for customers whom the provided requesting user is not permitted to see. This way, should selection criteria ever change in the future for certain user types, we won't have to implement that new condition a hundred times (hyperbole) all over the SQL codebase.
Performance is not great, however, so I want to fix that before encouraging use of the iTVF. Changed database object names here just so it's easier to demonstrate (hopefully).
Queries
In attempting to optimize our iTVF uf_GetVisibleCustomers, I've noticed that the following SQL …
CREATE TABLE #tC ( idCustomer INT )
INSERT #tC
SELECT idCustomer
FROM [dbo].[uf_GetVisibleCustomers]('requester')
SELECT T.fAmount
FROM [Transactions] T
JOIN #tC C ON C.idCustomer = T.idCustomer
… is orders of magnitude faster than my original (IMO more readable, likely to be used) SQL here…
SELECT T.fAmount
FROM [Transactions] T
JOIN [dbo].[uf_GetVisibleCustomers]('requester') C ON C.idCustomer = T.idCustomer
I don't get why this is. The former (top block of SQL) returns ~700k rows in 17 seconds on a fairly modest development server. The latter (second block of SQL) returns the same number of rows in about ten minutes when there is no other user activity on the server. Maybe worth noting that there is a WHERE clause, however I have omitted it here for simplicity; it is the same for both queries.
Execution Plan
Below is the execution plan for the first. It enjoys automatic parallelism as mentioned while the latter query isn't worth displaying here because it's just massive (expands the entire iTVF and underlying view, subqueries). Anyway, the latter also does not execute in parallel (AFAIK) to any extent.
My Questions
Is it possible to achieve performance comparable to the first block without a temp table?
That is, with the relative simplicity and human-readability of the slower SQL.
Why is a join to a temp table faster than a join to iTVF?
Why is it faster to use a temp table than an in-memory table populated the same way?
Beyond those explicit questions, if someone can point me in the right direction toward understanding this better in general then I would be very grateful.
Without seeing the DDL for your inline function - it's hard to say what the issue is. It would also help to see the actual execution plans for both queries (perhaps you could try: https://www.brentozar.com/pastetheplan/). That said, I can offer some food for thought.
As you mentioned, the iTVF accesses the underlying tables, views and associated indexes. If your statistics are not up-to-date you can get a bad plan, that won't happen with your temp table. On that note, too, how long does it take to populate that temp table?
Another thing to look at (again, this is why DDL is helpful) is: are the data type's the same for Transactions.idCustomer and #TC.idCustomer? I see a hash match in the plan you posted which seems bad for a join between two IDs (a nested loops or merge join would be better). This could be slowing both queries down but would appear to have a more dramatic impact on the query that leverages your iTVF.
Again this ^^^ is speculation based on my experience. A couple quick things to try (not as a perm fix but for troubleshooting):
1. Check to see if re-compiling your query when using the iTVF speeds things up (this would be a sign of a bad stats or a bad execution plan being cached and re-used)
2. Try forcing a parallel plan for the iTVF query. You can do this by adding OPTION (QUERYTRACEON 8649) to the end of your query of by using make_parallel() by Adam Machanic.
What's the difference between this and rebuilding the index?
ANALYZE TABLE <table_name> COMPUTE STATISTICS;
A few things to discuss here
1) ANALYZE TABLE COMPUTE STATISTICS;
Don't use this command. It is obsolete. It is designed to collect information on the table to allow queries against it to be run in the best fashion. Use DBMS_STATS.GATHER_TABLE_STATS instead. And that's just an obvious lead in to that you should have a good read of the Performance Tuning guide to get your head around the optimizer, SQL execution etc
https://docs.oracle.com/en/database/oracle/oracle-database/12.2/tgdba/index.html
2) Rebuild index
Nothing to do with the table at all. It is about re-generating the structure that is used for certain queries to efficiently access table data. It is rare that rebuilds are required. If you are interested in that, there's a very good whitepaper at
https://richardfoote.wordpress.com/2007/12/11/index-internals-rebuilding-the-truth/
I've looked for tips of how to speed up sql intensive application and found this particularly useful link.
In point 6 he says
Do pre-stage data This is one of my favorite topics because it's an old technique that's often overlooked. If you have a report or a
procedure (or better yet, a set of them) that will do similar joins to
large tables, it can be a benefit for you to pre-stage the data by
joining the tables ahead of time and persisting them into a table. Now
the reports can run against that pre-staged table and avoid the large
join.
You're not always able to use this technique, but when you can, you'll
find it is an excellent way to save server resources.
Note that many developers get around this join problem by
concentrating on the query itself and creating a view-only around the
join so that they don't have to type the join conditions again and
again. But the problem with this approach is that the query still runs
for every report that needs it. By pre-staging the data, you run the
join just once (say, 10 minutes before the reports) and everyone else
avoids the big join. I can't tell you how much I love this technique;
in most environments, there are popular tables that get joined all the
time, so there's no reason why they can't be pre-staged
From what I understood you join the tables once and several SQL Queries can "benefit" from it. That looks extremely interesting for the application I'm working at.
The thing is, I've been looking for pre staging data around and couldn't find anything that seems to be related to that technique. Maybe I'm missing a few keywords ?
I'd like to know how to use the described technique within SQL Server. The link says it's an old technique so it shouldn't be a problem that I'm using SQL Server 2008.
What I would like is the following: I have several SELECT queries that run in a row. All of them join the same 7-8 tables and they're all really heavy, that impacts performance. So I'm thinking of joining them, run the queries and them drop this intermediate table. How/Can it be done ?
If your query meets the requirements for an indexed view then you can just create such an object materialising the result of your query. This means that it will always be up-to-date.
Otherwise you would need to write code to materialise it yourself. Either eagerly on a schedule or potentially on first request and then cached for some amount of time that you deem acceptable.
The second approach is not rocket science and can be done with a TRUNCATE ... INSERT ... SELECT for moderate amounts of data or perhaps an ALTER TABLE ... SWITCH if the data is large and there are concurrent queries that might be accessing the cached result set.
Regarding your edit it seems like you just need to create a #temp table and insert the results of the join into it though. Then reference the #temp table for the several SELECTs and drop it. There is no guarantee that this will improve performance though and insufficient details in the question to even hazard a guess.
Why use temporary tables in stored procedures that are returning large result sets. How does this help performance? Is there an example out there of maybe a join of several tables returning a large set of data and how a temporary table may help performance of this query in a stored procedure?
In my experience they may be helpful in limited situations when a query is so complex that the query optimizer is struggling to come up with a decent plan. Breaking such a query apart and storing intermediate results in temp tables may help if done right. I use this strategy as a last resort because temp tables are expensive and for large results sets they may be very expensive.
I found this excelent article quite useful in answering this question:
Paul White - temporary tables in stored procedures
Just to underline some concepts from the article:
Temporary tables can be very useful as a way of simplifying a large query into smaller parts, giving the optimizer a better chance of finding good execution plans, providing statistical information about an intermediate result set
Temporary objects may be cached across executions, despite explicit CREATE and DROP statements
Statistics associated with a cached temporary object are also cached
On my behalf, I would just add that, if the stored procedure accesses the same data in another server more than once and we have a slow connection, it may be usefull to bring it into a temporary table. Of course, a table variable would also be valid for this purpose.