Indexing a single-use temporary table - sql-server

A colleague works in a business which uses Microsoft SQL Server. Their team creates stored procedures that are executed daily to create data extracts. The underlying tables are huge (some have billions of rows), so most stored procedures are designed such that first they extract only the relevant rows of these huge tables into temporary tables, and then the temp tables are joined with each other and with other smaller tables to create a final extract. Something similar to this:
SELECT COL1, COL2, COL3
INTO #TABLE1
FROM HUGETABLE1
WHERE COL4 IN ('foo', 'bar');
SELECT COL1, COL102, COL103
INTO #TABLE2
FROM HUGETABLE2
WHERE COL14 = 'blah';
SELECT COL1, COL103, COL306
FROM #TABLE1 AS T1
JOIN #TABLE2 AS T2
ON T1.COL1 = T2.COL1
LEFT JOIN SMALLTABLE AS ST
ON T1.COL3 = ST.COL3
ORDER BY T1.COL1;
Generally, the temporary tables are not modified after their creation (so no subsequent ALTER, UPDATE or INSERT operations). For the purpose of this discussion, let's assume the temporary tables are only used once later on (so only one SELECT query would rely on them).
Here is the question: is it a good idea to index these temporary tables after they are created and before they are used in the subsequent query?
My colleague believes that creating an index will make the join and the sort operations faster. I believe, however, that the total time will be larger, because index creation takes time. In other words, I assume that except for edge cases (like a temporary table which itself is extremely large, or the final SELECT query is very complex), SQL Server will use the statistics it has on the temporary tables to optimize the final query, and in doing so it will effectively index the temp tables as it sees fit.
In other words, I am used to think that creating an index is only useful if you know that table is used often; a single-use temporary table that is dropped once the stored procedure is complete is not worth indexing.
Neither of us knows enough about SQL Server optimizer to know in what ways we are right or wrong. Can you please help us better understand which of our assumptions are closer to truth?

Your friend is probably correct, because even if a table's going to be used in a single query, without seeing the query (and even if we do, we still don't have a great idea of what it's execution plan looks like) we have no idea how many times SQL Server will need to find data within various columns of each of those tables for joins, sorts, etc.
However, we'll never know for sure until it's actually done both ways and the results measured and compared.

If you are doing daily data extracts with billions of rows, I would recommend you use a staging tables instead of a temporary table. This will isolate your extracts from other resources using tempdb.
Here is the question: is it a good idea to index these temporary tables after they are created and before they are used in the subsequent query?
Create the index after loading the data into temp table. This will eliminate fragmentation and statistics will be created.
the optimizer will use statistics to generate the optimal plan. So if you don't have a statistics, it could dramatically affect your query performance especially for large datasets.
Example below query the before and after comparison of index creation in temp table:
/* Create index after data load into temp table -- stats is created */
CREATE TABLE #temp ( [text] varchar(50), [num] int);
INSERT INTO #temp([text], [num]) VALUES ('aaa', 1), ('bbb', 2) , ('ccc',3);
CREATE UNIQUE CLUSTERED INDEX [IX_num] ON #temp (num);
DBCC SHOW_STATISTICS ('tempdb..#temp', 'IX_num');
/* Create index before data load into temp table -- stats is not created */
CREATE TABLE #temp_nostats ( [text] varchar(50), [num] int);
CREATE UNIQUE CLUSTERED INDEX [IX_num] ON #temp_nostats (num);
INSERT INTO #temp_nostats([text], [num]) VALUES ('aaa', 1), ('bbb', 2) , ('ccc',3);
DBCC SHOW_STATISTICS ('tempdb..#temp_nostats', 'IX_num');
You need to test if the index will help you or not. You need to balance how many index you can have because it can also impact your performance if you have too many index.

Related

Delete large amount of data on SQL server

I need to delete 900,000.00 million records in SQL Server.
I would like to know the best way.
I did the following SELECT.
DeleteTable:
DELETE TOP(1000) TAB1
FROM TABLE1 TAB1
LEFT JOIN TABLE2 TAB2 ON TAB1.ID_PRODUCT = AB2.ID_PRODUCT
WHERE TAB2.ID_PRODUCT IS NULL;
IF ##ROWCOUNT <> 0 goto DeleteTable;
I would like to know if there is how I can optimize this query for better delete performance
Thank you.
Deleting 900,000,000 rows is going to take a long time and you might run out of temporary storage -- unless you have lots and lots of storage. Your approach of deleting rows in increments in one approach.
If your logging is not set to "simple", then you might want to consider that. With your incremental delete approach, that will at least prevent the log from filling up.
For your query, you want tab2(id_product) to have an index. I'm not sure if an index on tab1(id_product) would really help.
Another is just to recreate the table, because inserts and table creation is much more efficient.
For this, you can essentially do:
select t1.*
into temp_tab1
from tab1 t1
where exists (select 1 from table2 t2 where t2.id_product = t1.id_product);
truncate table tab1; -- back it up first!
insert into tab1
select *
from temp_tab1;
Note: If you have an identity column, you may want to set identity insert on. Also, if you have foreign key constraints to this table, then you need extra care.
Finally, if this is something that you need to do repeatedly, then you should consider partitioning the table. It is much more efficient to drop partitions than to delete rows.
You need to be careful if the database is highly transactional and the table has heavy read-write activity mainly because you may be blocking other sessions while delete is in progress. A slower but less impactful approach is to use a cursor to delete the records. The way to do it is by throwing product_id into #table and deleting from the actual table using the product_id as a predicate.

Does MS SQL Server automatically create temp table if the query contains a lot id's in 'IN CLAUSE'

I have a big query to get multiple rows by id's like
SELECT *
FROM TABLE
WHERE Id in (1001..10000)
This query runs very slow and it ends up with timeout exception.
Temp fix for it is querying with limit, break this query into 10 parts per 1000 id's.
I heard that using temp tables may help in this case but also looks like ms sql server automatically doing it underneath.
What is the best way to handle problems like this?
You could write the query as follows using a temporary table:
CREATE TABLE #ids(Id INT NOT NULL PRIMARY KEY);
INSERT INTO #ids(Id) VALUES (1001),(1002),/*add your individual Ids here*/,(10000);
SELECT
t.*
FROM
[Table] AS t
INNER JOIN #ids AS ids ON
ids.Id=t.Id;
DROP TABLE #ids;
My guess is that it will probably run faster than your original query. Lookup can be done directly using an index (if it exists on the [Table].Id column).
Your original query translates to
SELECT *
FROM [TABLE]
WHERE Id=1000 OR Id=1001 OR /*...*/ OR Id=10000;
This would require evalutation of the expression Id=1000 OR Id=1001 OR /*...*/ OR Id=10000 for every row in [Table] which probably takes longer than with a temporary table. The example with a temporary table takes each Id in #ids and looks for a corresponding Id in [Table] using an index.
This all assumes that there are gaps in the Ids between 1000 and 10000. Otherwise it would be easier to write
SELECT *
FROM [TABLE]
WHERE Id BETWEEN 1001 AND 10000;
This would also require an index on [Table].Id to speed it up.

Sql Server Performance: table variable inner join vs multiple conditions in where clause

What is faster in MS Sql Server, a where clause with multiple conditions or a inner join after creating a table variable? For example:
select A.* from A where A.fk='one ' or A.fk='two ' or A.fk='three' ...ect.
vs
declare #temp (key as char(matchingWidth)) table;
insert into #temp values ('one ');
insert into #temp values ('two ');
insert into #temp values ('three');
select A.* from A inner join #temp t on A.fk=t.key;
I know normally the difference would be negligible; however, sadly the database I am querying use the char type for primary keys...
If it helps, in my particular case, table A has a few million records, and there would usually be about a hundred ids I'd be querying for. The column is indexed, but not a clustered index.
EDIT: I am also open to the same thing with a temp table... although I was under the impression that both a temp table and table variable where virtually identical in terms of performance.
Thanks!
In most cases the first approach will win as table variable does not use statistics. You'll notice big performance decrease with big amount of data. When you have just few values then there is not supposed to be any noticeable difference.

Speed of using SQL temp table vs long list of variables in stored procedure

I have a stored procedure with a list of about 50 variables of different types repeated about 8 times as part of different groups (declaration, initialization, loading, calculations, result, e.t.c.).
In order to avoid duplication I want to use temp tables instead (not table variable, which does not bring advantages that I seek - inferred type).
I've read that temp tables may start as "in memory table" and then are spilled to disk as they grow depending on amount of memory and many other conditions.
My question is - if I use temp table to store and manipulate one record with 50 fields, will it be much slower than using 50 variables ?
I would not use a temp #Table unless I need to store temporary results for multiple rows. Our code uses lots of variables in some stored procedures. The ability to initialize during declaration helps reduce clutter.
Temp #Tables have some interesting side effects with regards to query compilation. If your stored procedure calls any child procedures, and queries in the child procs refer to this #Table, then these queries will be recompiled upon every execution.
Also, note that if you modify the temp #Table schema in any way, then SQL Server will not be able to cache the table definition. You'll be incurring query recompilation penalties in every query that refers to the table. Also, SQL Server will hammer various system tables as it continually creates and drops the table metadata.
On the other hand, if you don't call child procs, and you don't change the #Table schema, it might perform OK.
But stylistically, it does not make sense to me to add another join to a query just to get a variable for use in a WHERE clause. In other words, I'd rather see a lot of this:
declare #id
select #id = ...
select tbl.id, ...
from tbl
inner join tbl2 ...
where tbl.id = #id
Instead of this:
create table #VarTbl (...)
insert into #VarTbl (...) select ...
select tbl.id, ...
from tbl
inner join tbl2 ...
cross join #VariableTable
where tbl.id = VarTbl_ID
Another thought: can you break apart the stored procedure into logical groups of operations? That might help readability. It can also help reduce query recompilations. If one child proc needs to be recompiled, this will not affect the parent proc or other child procs.
No, it will not be much slower; you would probably even have a hard time showing it is slower at all in normal use cases.
I always use temp tables in this instance; the performance difference is negligible and readability and ease of use is better in my opinion. I normally start looking at using a temp table if I get above 10 variables, especially if those are related.

Data Warehouseing with minimal changes

Ok, I have a table that has 10 years worth of data, and performance is taking a hit. I am planning on moving the older data to a seperate historicaltable. the problem is i need to select from the first table if it is in there and the 2nd table if not. I do not want to do a join because then it will do a lookup on the 2nd table always. HELP?
IF you still need to query the data in no way would I move it to another table. How big is the table now? What are the indexes? Have you considered partioning the table?
If you must move to another table, you could query in stored procs with an if statement. Query the main table first and then if the rowcount = 0 query the other table. It will be slower for records not in the main table but should stay fast if they are in there. However, it wouldn't know when you need records from both.
Sample of code to do this:
CREATE PROC myproc (#test INT)
AS
SELECT field1, field2 from table1field1, field2 from table1
IF ##rowcount = 0
BEGIN
SELECT field1, field2 FROM table2 field1, field2 from table1
END
But really the partioning and indexing correctly is probaly your best choice. Also optimize existing queries. If you are using known poorly performing techniques such as cursors, correlated subqueries, views that call views, scalar functions, nonsargable where clauses, etc. just fixing your queries may mean you don't have to archive.
Sometimes, buying a better server would help as well.
Rather than using a separate historical table, you might want to look into partitioning the table by some function of the date (year perhaps?) to improve performance instead.

Resources