Use temp table instead of CTE for disposable purpose - sql-server

1 As I known, CTE is more used for Query readability, not for performance.
Without talking about indexing, should we just use the temp table in all cases instead of CTE because there is no performance benefit anyway and temp table can achieve the same outcome as CTE?
2 But I still cannot get this: Temp Tables are physically created in the Tempdb database while CTE is not materialized. They are created in memory and got disposed of straight away after the next statement.
Should we have fast better performance to access the same data in memory instead of materialized disk space?

Temp Tables are physically created in the Tempdb database while CTE is
not materialized. They are created in memory and got disposed of
straight away after the next statement.
This is wrong. CTE is not created, it's not a structure containing data. It's just like a view with the only difference that its definition is not saved as an object. As a consequence, it cannot be reused in other statements. It's just another form of writing a derived table.
For example, when you write
with USAcusts as
(
select custid, companyname
from sales.customers
where country = N'USA'
)
select * from USAcusts;
It's the same as the code below:
select *
from
(
select custid, companyname
from sales.customers
where country = N'USA'
) t
None of CTE or derived table is "created in memory", exactly like in view case, SQL Server expands the definition of the table expression and queries the underlying objects directly.

If you are looking for performance, always use temp table.
I will only use CTE for these reasons:
very small and 1 time use temp data
recursive query

Related

Persistent WITH statement in SQL Server 2008 [duplicate]

I've got a question which occurs when I was using the WITH-clause in one of my script. The question is easy to pointed out I wanna use the CTE alias multiple times instead of only in outer query and there is crux.
For instance:
-- Define the CTE expression
WITH cte_test (domain1, domain2, [...])
AS
-- CTE query
(
SELECT domain1, domain2, [...]
FROM table
)
-- Outer query
SELECT * FROM cte_test
-- Now I wanna use the CTE expression another time
INSERT INTO sometable ([...]) SELECT [...] FROM cte_test
The last row will lead to the following error because it's outside the outer query:
Msg 208, Level 16, State 1, Line 12 Invalid object name 'cte_test'.
Is there a way to use the CTE multiple times resp. make it persistent? My current solution is to create a temp table where I store the result of the CTE and use this temp table for any further statements.
-- CTE
[...]
-- Create a temp table after the CTE block
DECLARE #tmp TABLE (domain1 DATATYPE, domain2 DATATYPE, [...])
INSERT INTO #tmp (domain1, domain2, [...]) SELECT domain1, domain2, [...] FROM cte_test
-- Any further DML statements
SELECT * FROM #tmp
INSERT INTO sometable ([...]) SELECT [...] FROM #tmp
[...]
Frankly, I don't like this solution. Does anyone else have a best practice for this problem?
Thanks in advance!
A CommonTableExpression doesn't persist data in any way. It's basically just a way of creating a sub-query in advance of the main query itself.
This makes it much more like an in-line view than a normal sub-query would be. Because you can reference it repeatedly in one query, rather than having to type it again and again.
But it is still just treated as a view, expanded into the queries that reference it, macro like. No persisting of data at all.
This, unfortunately for you, means that you must do the persistance yourself.
If you want the CTE's logic to be persisted, you don't want an in-line view, you just want a view.
If you want the CTE's result set to be persisted, you need a temp table type of solution, such as the one you do not like.
A CTE is only in scope for the SQL statement it belongs to. If you need to reuse its data in a subsequent statement, you need a temporary table or table variable to store the data in. In your example, unless you're implementing a recursive CTE I don't see that the CTE is needed at all - you can store its contents straight in a temporary table/table variable and reuse it as much as you want.
Also note that your DELETE statement would attempt to delete from the underlying table, unlike if you'd placed the results into a temporary table/table variable.

Speed of using SQL temp table vs long list of variables in stored procedure

I have a stored procedure with a list of about 50 variables of different types repeated about 8 times as part of different groups (declaration, initialization, loading, calculations, result, e.t.c.).
In order to avoid duplication I want to use temp tables instead (not table variable, which does not bring advantages that I seek - inferred type).
I've read that temp tables may start as "in memory table" and then are spilled to disk as they grow depending on amount of memory and many other conditions.
My question is - if I use temp table to store and manipulate one record with 50 fields, will it be much slower than using 50 variables ?
I would not use a temp #Table unless I need to store temporary results for multiple rows. Our code uses lots of variables in some stored procedures. The ability to initialize during declaration helps reduce clutter.
Temp #Tables have some interesting side effects with regards to query compilation. If your stored procedure calls any child procedures, and queries in the child procs refer to this #Table, then these queries will be recompiled upon every execution.
Also, note that if you modify the temp #Table schema in any way, then SQL Server will not be able to cache the table definition. You'll be incurring query recompilation penalties in every query that refers to the table. Also, SQL Server will hammer various system tables as it continually creates and drops the table metadata.
On the other hand, if you don't call child procs, and you don't change the #Table schema, it might perform OK.
But stylistically, it does not make sense to me to add another join to a query just to get a variable for use in a WHERE clause. In other words, I'd rather see a lot of this:
declare #id
select #id = ...
select tbl.id, ...
from tbl
inner join tbl2 ...
where tbl.id = #id
Instead of this:
create table #VarTbl (...)
insert into #VarTbl (...) select ...
select tbl.id, ...
from tbl
inner join tbl2 ...
cross join #VariableTable
where tbl.id = VarTbl_ID
Another thought: can you break apart the stored procedure into logical groups of operations? That might help readability. It can also help reduce query recompilations. If one child proc needs to be recompiled, this will not affect the parent proc or other child procs.
No, it will not be much slower; you would probably even have a hard time showing it is slower at all in normal use cases.
I always use temp tables in this instance; the performance difference is negligible and readability and ease of use is better in my opinion. I normally start looking at using a temp table if I get above 10 variables, especially if those are related.

How to force SQL Server to process CONTAINS clauses before WHERE clauses?

I have a SQL query that uses both standard WHERE clauses and full text index CONTAINS clauses. The query is built dynamically from code and includes a variable number of WHERE and CONTAINS clauses.
In order for the query to be fast, it is very important that the full text index be searched before the rest of the criteria are applied.
However, SQL Server chooses to process the WHERE clauses before the CONTAINS clauses and that causes tables scans and the query is very slow.
I'm able to rewrite this using two queries and a temporary table. When I do so, the query executes 10 times faster. But I don't want to do that in the code that creates the query because it is too complex.
Is there an a way to force SQL Server to process the CONTAINS before anything else? I can't force a plan (USE PLAN) because the query is built dynamically and varies a lot.
Note: I have the same problem on SQL Server 2005 and SQL Server 2008.
You can signal your intent to the optimiser like this
SELECT
*
FROM
(
SELECT *
FROM
WHERE
CONTAINS
) T1
WHERE
(normal conditions)
However, SQL is declarative: you say what you want, not how to do it. So the optimiser may decide to ignore the nesting above.
You can force the derived table with CONTAINS to be materialised before the classic WHERE clause is applied. I won't guarantee performance.
SELECT
*
FROM
(
SELECT TOP 2000000000
*
FROM
....
WHERE
CONTAINS
ORDER BY
SomeID
) T1
WHERE
(normal conditions)
Try doing it with 2 queries without temp tables:
SELECT *
FROM table
WHERE id IN (
SELECT id
FROM table
WHERE contains_criterias
)
AND further_where_classes
As I noted above, this is NOT as clean a way to "materialize" the derived table as the TOP clause that #gbn proposed, but a loop join hint forces an order of evaluation, and has worked for me in the past (admittedly usually with two different tables involved). There are a couple of problems though:
The query is ugly
you still don't get any guarantees that the other WHERE parameters don't get evaluated until after the join (I'll be interested to see what you get)
Here it is though, given that you asked:
SELECT OriginalTable.XXX
FROM (
SELECT XXX
FROM OriginalTable
WHERE
CONTAINS XXX
) AS ContainsCheck
INNER LOOP JOIN OriginalTable
ON ContainsCheck.PrimaryKeyColumns = OriginalTable.PrimaryKeyColumns
AND OriginalTable.OtherWhereConditions = OtherValues

SQL Server CTE referred in self joins slow

I have written a table-valued UDF that starts by a CTE to return a subset of the rows from a large table.
There are several joins in the CTE. A couple of inner and one left join to other tables, which don't contain a lot of rows.
The CTE has a where clause that returns the rows within a date range, in order to return only the rows needed.
I'm then referencing this CTE in 4 self left joins, in order to build subtotals using different criterias.
The query is quite complex but here is a simplified pseudo-version of it
WITH DataCTE as
(
SELECT [columns] FROM table
INNER JOIN table2
ON [...]
INNER JOIN table3
ON [...]
LEFT JOIN table3
ON [...]
)
SELECT [aggregates_columns of each subset] FROM DataCTE Main
LEFT JOIN DataCTE BananasSubset
ON [...]
AND Product = 'Bananas'
AND Quality = 100
LEFT JOIN DataCTE DamagedBananasSubset
ON [...]
AND Product = 'Bananas'
AND Quality < 20
LEFT JOIN DataCTE MangosSubset
ON [...]
GROUP BY [
I have the feeling that SQL Server gets confused and calls the CTE for each self join, which seems confirmed by looking at the execution plan, although I confess not being an expert at reading those.
I would have assumed SQL Server to be smart enough to only perform the data retrieval from the CTE only once, rather than do it several times.
I have tried the same approach but rather than using a CTE to get the subset of the data, I used the same select query as in the CTE, but made it output to a temp table instead.
The version referring the CTE version takes 40 seconds. The version referring the temp table takes between 1 and 2 seconds.
Why isn't SQL Server smart enough to keep the CTE results in memory?
I like CTEs, especially in this case as my UDF is a table-valued one, so it allowed me to keep everything in a single statement.
To use a temp table, I would need to write a multi-statement table valued UDF, which I find a slightly less elegant solution.
Did some of you had this kind of performance issues with CTE, and if so, how did you get them sorted?
Thanks,
Kharlos
I believe that CTE results are retrieved every time. With a temp table the results are stored until it is dropped. This would seem to explain the performance gains you saw when you switched to a temp table.
Another benefit is that you can create indexes on a temporary table which you can't do to a cte. Not sure if there would be a benefit in your situation but it's good to know.
Related reading:
Which are more performant, CTE or temporary tables?
SQL 2005 CTE vs TEMP table Performance when used in joins of other tables
http://msdn.microsoft.com/en-us/magazine/cc163346.aspx#S3
Quote from the last link:
The CTE's underlying query will be
called each time it is referenced in
the immediately following query.
I'd say go with the temp table. Unfortunately elegant isn't always the best solution.
UPDATE:
Hmmm that makes things more difficult. It's hard for me to say with out looking at your whole environment.
Some thoughts:
can you use a stored procedure instead of a UDF (instead, not from within)?
This may not be possible but if you can remove the left join from you CTE you could move that into an indexed view. If you are able to do this you may see performance gains over even the temp table.

Optimizing ROW_NUMBER() in SQL Server

We have a number of machines which record data into a database at sporadic intervals. For each record, I'd like to obtain the time period between this recording and the previous recording.
I can do this using ROW_NUMBER as follows:
WITH TempTable AS (
SELECT *, ROW_NUMBER() OVER (PARTITION BY Machine_ID ORDER BY Date_Time) AS Ordering
FROM dbo.DataTable
)
SELECT [Current].*, Previous.Date_Time AS PreviousDateTime
FROM TempTable AS [Current]
INNER JOIN TempTable AS Previous
ON [Current].Machine_ID = Previous.Machine_ID
AND Previous.Ordering = [Current].Ordering + 1
The problem is, it goes really slow (several minutes on a table with about 10k entries) - I tried creating separate indicies on Machine_ID and Date_Time, and a single joined-index, but nothing helps.
Is there anyway to rewrite this query to go faster?
The given ROW_NUMBER() partition and order require an index on (Machine_ID, Date_Time) to satisfy in one pass:
CREATE INDEX idxMachineIDDateTime ON DataTable (Machine_ID, Date_Time);
Separate indexes on Machine_ID and Date_Time will help little, if any.
How does it compare to this version?:
SELECT x.*
,(SELECT MAX(Date_Time)
FROM dbo.DataTable
WHERE Machine_ID = x.Machine_ID
AND Date_Time < x.Date_Time
) AS PreviousDateTime
FROM dbo.DataTable AS x
Or this version?:
SELECT x.*
,triang_join.PreviousDateTime
FROM dbo.DataTable AS x
INNER JOIN (
SELECT l.Machine_ID, l.Date_Time, MAX(r.Date_Time) AS PreviousDateTime
FROM dbo.DataTable AS l
LEFT JOIN dbo.DataTable AS r
ON l.Machine_ID = r.Machine_ID
AND l.Date_Time > r.Date_Time
GROUP BY l.Machine_ID, l.Date_Time
) AS triang_join
ON triang_join.Machine_ID = x.Machine_ID
AND triang_join.Date_Time = x.Date_Time
Both would perform best with an index on Machine_ID, Date_Time and for correct results, I'm assuming that this is unique.
You haven't mentioned what is hidden away in * and that can sometimes means a lot since a Machine_ID, Date_Time index will not generally be covering and if you have a lot of columns there or they have a lot of data, ...
If the number of rows in dbo.DataTable is large then it is likely that you are experiencing the issue due to the CTE self joining onto itself. There is a blog post explaining the issue in some detail here
Occasionally in such cases I have resorted to creating a temporary table to insert the result of the CTE query into and then doing the joins against that temporary table (although this has usually been for cases where a large number of joins against the temp table are required - in the case of a single join the performance difference will be less noticable)
I have had some strange performance problems using CTEs in SQL Server 2005. In many cases, replacing the CTE with a real temp table solved the problem.
I would try this before going any further with using a CTE.
I never found any explanation for the performance problems I've seen, and really didn't have any time to dig into the root causes. However I always suspected that the engine couldn't optimize the CTE in the same way that it can optimize a temp table (which can be indexed if more optimization is needed).
Update
After your comment that this is a view, I would first test the query with a temp table to see if that performs better.
If it does, and using a stored proc is not an option, you might consider making the current CTE into an indexed/materialized view. You will want to read up on the subject before going down this road, as whether this is a good idea depends on a lot of factors, not the least of which is how often the data is updated.
What if you use a trigger to store the last timestamp an subtract each time to get the difference?
If you require this data often, rather than calculate it each time you pull the data, why not add a column and calculate/populate it whenever row is added?
(Remus' compound index will make the query fast; running it only once should make it faster still.)

Resources