Delete large amount of data on SQL server - sql-server

I need to delete 900,000.00 million records in SQL Server.
I would like to know the best way.
I did the following SELECT.
DeleteTable:
DELETE TOP(1000) TAB1
FROM TABLE1 TAB1
LEFT JOIN TABLE2 TAB2 ON TAB1.ID_PRODUCT = AB2.ID_PRODUCT
WHERE TAB2.ID_PRODUCT IS NULL;
IF ##ROWCOUNT <> 0 goto DeleteTable;
I would like to know if there is how I can optimize this query for better delete performance
Thank you.

Deleting 900,000,000 rows is going to take a long time and you might run out of temporary storage -- unless you have lots and lots of storage. Your approach of deleting rows in increments in one approach.
If your logging is not set to "simple", then you might want to consider that. With your incremental delete approach, that will at least prevent the log from filling up.
For your query, you want tab2(id_product) to have an index. I'm not sure if an index on tab1(id_product) would really help.
Another is just to recreate the table, because inserts and table creation is much more efficient.
For this, you can essentially do:
select t1.*
into temp_tab1
from tab1 t1
where exists (select 1 from table2 t2 where t2.id_product = t1.id_product);
truncate table tab1; -- back it up first!
insert into tab1
select *
from temp_tab1;
Note: If you have an identity column, you may want to set identity insert on. Also, if you have foreign key constraints to this table, then you need extra care.
Finally, if this is something that you need to do repeatedly, then you should consider partitioning the table. It is much more efficient to drop partitions than to delete rows.

You need to be careful if the database is highly transactional and the table has heavy read-write activity mainly because you may be blocking other sessions while delete is in progress. A slower but less impactful approach is to use a cursor to delete the records. The way to do it is by throwing product_id into #table and deleting from the actual table using the product_id as a predicate.

Related

How to use results from first query in later queries within a DB Transaction

A common case for DB transactions is performing operations on multiple tables, as you can then easily rollback all operations if one fails. However, a common scenario I run into is wanting to insert records to multiple tables where the later inserts need the serial ID from the previous inserts.
Since the ID is not generated/available until the transaction is actually committed, how can one accomplish this? If you have to commit after the first insert in order to get the ID and then execute the second insert, it seems to defeat the purpose of the transaction in the first place because after committing (or if I don't use a transaction at all) I cannot rollback the first insert if the second insert fails.
This seems like such a common use case for DB transactions that I can't imagine it would not be supported in some way. How can this be accomplished?
cte (common table expression) with data modifying statements should cover your need, see the manual.
Typical example :
WITH cte AS (INSERT INTO table_A (id) VALUES ... RETURNING id)
INSERT INTO table_B (id) SELECT id FROM cte
see the demo in dbfiddle

Indexing a single-use temporary table

A colleague works in a business which uses Microsoft SQL Server. Their team creates stored procedures that are executed daily to create data extracts. The underlying tables are huge (some have billions of rows), so most stored procedures are designed such that first they extract only the relevant rows of these huge tables into temporary tables, and then the temp tables are joined with each other and with other smaller tables to create a final extract. Something similar to this:
SELECT COL1, COL2, COL3
INTO #TABLE1
FROM HUGETABLE1
WHERE COL4 IN ('foo', 'bar');
SELECT COL1, COL102, COL103
INTO #TABLE2
FROM HUGETABLE2
WHERE COL14 = 'blah';
SELECT COL1, COL103, COL306
FROM #TABLE1 AS T1
JOIN #TABLE2 AS T2
ON T1.COL1 = T2.COL1
LEFT JOIN SMALLTABLE AS ST
ON T1.COL3 = ST.COL3
ORDER BY T1.COL1;
Generally, the temporary tables are not modified after their creation (so no subsequent ALTER, UPDATE or INSERT operations). For the purpose of this discussion, let's assume the temporary tables are only used once later on (so only one SELECT query would rely on them).
Here is the question: is it a good idea to index these temporary tables after they are created and before they are used in the subsequent query?
My colleague believes that creating an index will make the join and the sort operations faster. I believe, however, that the total time will be larger, because index creation takes time. In other words, I assume that except for edge cases (like a temporary table which itself is extremely large, or the final SELECT query is very complex), SQL Server will use the statistics it has on the temporary tables to optimize the final query, and in doing so it will effectively index the temp tables as it sees fit.
In other words, I am used to think that creating an index is only useful if you know that table is used often; a single-use temporary table that is dropped once the stored procedure is complete is not worth indexing.
Neither of us knows enough about SQL Server optimizer to know in what ways we are right or wrong. Can you please help us better understand which of our assumptions are closer to truth?
Your friend is probably correct, because even if a table's going to be used in a single query, without seeing the query (and even if we do, we still don't have a great idea of what it's execution plan looks like) we have no idea how many times SQL Server will need to find data within various columns of each of those tables for joins, sorts, etc.
However, we'll never know for sure until it's actually done both ways and the results measured and compared.
If you are doing daily data extracts with billions of rows, I would recommend you use a staging tables instead of a temporary table. This will isolate your extracts from other resources using tempdb.
Here is the question: is it a good idea to index these temporary tables after they are created and before they are used in the subsequent query?
Create the index after loading the data into temp table. This will eliminate fragmentation and statistics will be created.
the optimizer will use statistics to generate the optimal plan. So if you don't have a statistics, it could dramatically affect your query performance especially for large datasets.
Example below query the before and after comparison of index creation in temp table:
/* Create index after data load into temp table -- stats is created */
CREATE TABLE #temp ( [text] varchar(50), [num] int);
INSERT INTO #temp([text], [num]) VALUES ('aaa', 1), ('bbb', 2) , ('ccc',3);
CREATE UNIQUE CLUSTERED INDEX [IX_num] ON #temp (num);
DBCC SHOW_STATISTICS ('tempdb..#temp', 'IX_num');
/* Create index before data load into temp table -- stats is not created */
CREATE TABLE #temp_nostats ( [text] varchar(50), [num] int);
CREATE UNIQUE CLUSTERED INDEX [IX_num] ON #temp_nostats (num);
INSERT INTO #temp_nostats([text], [num]) VALUES ('aaa', 1), ('bbb', 2) , ('ccc',3);
DBCC SHOW_STATISTICS ('tempdb..#temp_nostats', 'IX_num');
You need to test if the index will help you or not. You need to balance how many index you can have because it can also impact your performance if you have too many index.

Fast replace table in T-SQL with another

I have two tables with same structure (keys/columns/etc). I want to replace data in first table with data from second one. I use following code to do it:
DROP TABLE T1
SELECT *
INTO T1
FROM T2
DROP TABLE T2
but this code is slow enough when T2 is large. T2 table is temporary so I want to rewrite it as:
drop table T1
EXEC sp_rename 'T2', 'T1'
This should execute very fast for any size table, but am I missing something here? Some side effects that may break this code? I'm not very familiar with dynamic SQL so please advice.
Renaming the tables should be fine. Sometimes, there can be issues with triggers or foreign key constraints (and the like). However, you are dropping T1 anyway, so this is not a concern.
The one issue is where the data is actually stored. If by temporary table, you mean a table name that starts with #, then this is not a good approach, because temporary tables are often stored separately from other tables. Instead, create the table in the same place where T1 would be stored, perhaps calling it something like temp_T1.
You might want to revisit your logic to see if there is a way to "reconstruct" T1 in place. However, when there are large numbers of updates and deletes in the processing, recreating the table is often the fastest approach.

netezza left outer join query performance

I have a question related to Netezza query performance .I have 2 tables Table A and Table B and Table B is the sub set of Table A with data alteration .I need to update those new values to table A from table B
We can have 2 approaches here
1) Left outer join and select relevant columns and insert in target table
2) Insert table a data into target table and update those values from tableB using join
I tried both and logically both are same.But Explain plan is giving different cost
for normal select
a)Sub-query Scan table "TM2" (cost=0.1..1480374.0 rows=8 width=4864 conf=100)
update
b)Hash Join (cost=356.5..424.5 rows=2158 width=27308 conf=21)
for left outer join
Sub-query Scan table "TM2" (cost=51.0..101474.8 rows=10000000 width=4864 conf=100)
From this I feel left outer join is better .Can anyone put some thought on this and guide
Thanks
The reason that the cost of insert into table_c select ... from table_a; update table_c set ... from table_b; is higher is because you're inserting, deleting, then inserting. Updates in Netezza mark the records to be updated as deleted, then inserts new rows with the updated values. Once the data is written to an extent, it's never (to my knowledge) altered.
With insert into table_c select ... from table_a join table_b using (...); you're only inserting once, thereby only updating all the zone maps once. The cost will be noticeably lower.
Netezza does an excellent job of keeping you away from the disk on reads, but it will write to the disk as often as you tell it to. In the case of updates, seemingly more so. Try to only write as often as is necessary to gain benefits of new distributions and co-located joins. Any more than that, and you're just using excess commit actions.

Data Warehouseing with minimal changes

Ok, I have a table that has 10 years worth of data, and performance is taking a hit. I am planning on moving the older data to a seperate historicaltable. the problem is i need to select from the first table if it is in there and the 2nd table if not. I do not want to do a join because then it will do a lookup on the 2nd table always. HELP?
IF you still need to query the data in no way would I move it to another table. How big is the table now? What are the indexes? Have you considered partioning the table?
If you must move to another table, you could query in stored procs with an if statement. Query the main table first and then if the rowcount = 0 query the other table. It will be slower for records not in the main table but should stay fast if they are in there. However, it wouldn't know when you need records from both.
Sample of code to do this:
CREATE PROC myproc (#test INT)
AS
SELECT field1, field2 from table1field1, field2 from table1
IF ##rowcount = 0
BEGIN
SELECT field1, field2 FROM table2 field1, field2 from table1
END
But really the partioning and indexing correctly is probaly your best choice. Also optimize existing queries. If you are using known poorly performing techniques such as cursors, correlated subqueries, views that call views, scalar functions, nonsargable where clauses, etc. just fixing your queries may mean you don't have to archive.
Sometimes, buying a better server would help as well.
Rather than using a separate historical table, you might want to look into partitioning the table by some function of the date (year perhaps?) to improve performance instead.

Resources