Fastest way to delete all the data in a large table - sql-server

I had to delete all the rows from a log table that contained about 5 million rows. My initial try was to issue the following command in query analyzer:
delete from client_log
which took a very long time.

Check out truncate table which is a lot faster.

I discovered the TRUNCATE TABLE in the msdn transact-SQL reference. For all interested here are the remarks:
TRUNCATE TABLE is functionally identical to DELETE statement with no WHERE clause: both remove all rows in the table. But TRUNCATE TABLE is faster and uses fewer system and transaction log resources than DELETE.
The DELETE statement removes rows one at a time and records an entry in the transaction log for each deleted row. TRUNCATE TABLE removes the data by deallocating the data pages used to store the table's data, and only the page deallocations are recorded in the transaction log.
TRUNCATE TABLE removes all rows from a table, but the table structure and its columns, constraints, indexes and so on remain. The counter used by an identity for new rows is reset to the seed for the column. If you want to retain the identity counter, use DELETE instead. If you want to remove table definition and its data, use the DROP TABLE statement.
You cannot use TRUNCATE TABLE on a table referenced by a FOREIGN KEY constraint; instead, use DELETE statement without a WHERE clause. Because TRUNCATE TABLE is not logged, it cannot activate a trigger.
TRUNCATE TABLE may not be used on tables participating in an indexed view.

There is a common myth that TRUNCATE somehow skips transaction log.
This is misunderstanding, and is clearly mentioned in MSDN.
This myth is invoked in several comments here. Let's eradicate it together ;)

For reference TRUNCATE TABLE also works on MySQL

I use the following method to zero out tables, with the added bonus that it leaves me with an archive copy of the table.
CREATE TABLE `new_table` LIKE `table`;
RENAME TABLE `table` TO `old_table`, `new_table` TO `table`;

forget truncate and delete. maintain your table definitions (in case you want to recreate it) and just use drop table.

truncate table client_log
is your best bet, truncate kills all content in the table and indices and resets any seeds you've got too.

truncate table is not SQL-platform independent. If you suspect that you might ever change database providers, you might be wary of using it.

On SQL Server you can use the Truncate Table command which is faster than a regular delete and also uses less resources. It will reset any identity fields back to the seed value as well.
The drawbacks of truncate are that it can't be used on tables that are referenced by foreign keys and it won't fire any triggers. Also you won't be able to rollback the data if anything goes wrong.

Note that TRUNCATE will also reset any auto incrementing keys, if you are using those.
If you do not wish to lose your auto incrementing keys, you can speed up the delete by deleting in sets (e.g., DELETE FROM table WHERE id > 1 AND id < 10000). It will speed it up significantly and in some cases prevent data from being locked up.

Yes, well, deleting 5 million rows is probably going to take a long time. The only potentially faster way I can think of would be to drop the table, and re-create it. That only works, of course, if you want to delete ALL data in the table.

The suggestion of "Drop and recreate the table" is probably not a good one because that goofs up your foreign keys.
You ARE using foreign keys, right?

If you cannot use TRUNCATE TABLE because of foreign keys and/or triggers, you can consider to:
drop all indexes;
do the usual DELETE;
re-create all indexes.
This may speed up DELETE somewhat.

I am revising my earlier statement:
You should understand that by using
TRUNCATE the data will be cleared but
nothing will be logged to the
transaction log. Writing to the log
is why DELETE will take forever on 5
million rows. I use TRUNCATE often
during development, but you should be
wary about using it on a production
database because you will not be able
to roll back your changes. You should
immediately make a full database
backup after doing a TRUNCATE to
establish a new basis for restoration.
The above statement was intended to prompt you to be sure that you understand there is difference between the two. Unfortunately, it is poorly written and makes unsupported statements as I have not actually done any testing myself between the two. It is based on statements that I have heard from others.
From MSDN:
The DELETE statement removes rows one
at a time and records an entry in the
transaction log for each deleted row.
TRUNCATE TABLE removes the data by
deallocating the data pages used to
store the table's data, and only the
page deallocations are recorded in the
transaction log.
I just wanted to say that there is a fundamental difference between the two and because there is a difference, there will be applications where one or the other may be inappropriate.

DELETE * FROM table_name;
Premature optimization may be dangerous. Optimizing may mean doing something weird, but if it works you may want to take advantage of it.
SELECT DbVendor_SuperFastDeleteAllFunction(tablename, BOZO_BIT) FROM dummy;
For speed I think it depends on...
The underlying database: Oracle, Microsoft, MySQL, PostgreSQL, others, custom...
The table, it's content, and related tables:
There may be deletion rules. Is there an existing procedure to delete all content in the table? Can this be optimized for the specific underlying database engine? How much do we care about breaking things / related data? Performing a DELETE may be the 'safest' way assuming that other related tables do not depend on this table. Are there other tables and queries that are related / depend on the data within this table? If we don't care much about this table being around, using DROP might be a fast method, again depending on the underlying database.
DROP TABLE table_name;
How many rows are being deleted? Is there other information that is quickly gleaned that will optimize the deletion? For example, can we tell if the table is already empty? Can we tell if there are hundreds, thousands, millions, billions of rows?

Related

What is the scope of a temporary index on a permanent table?

Within a stored procedure, I created this index:
CREATE NONCLUSTERED INDEX #IX_MyTempIndex ON dbo.MyPermTable (ColumnA, ColumnB) INCLUDE (ColumnC);
Days later, from a different session, a different user got the error "...an index or statistics with name '#IX_MyTempIndex' already exists on table 'dbo.MyPermTable'."
1) Is this the correct way to specify a temporary index on a permanent table?
2) What event or scope will cause the temporary index to disappear?
There is no such thing as a "Temporary Index".
You can make a temp table with an index, and by virtue of the table being temporary the index will be too, but that's not the same as what you're describing.
If you were allowed to make the index, why not keep the index that is necessary for your query? Simply evaluate it and make sure it is a good index for your table. You don't want an additional index that is super similar with only 1 additional column, or other inefficient scenario.
At this point you need to ask yourself some serious questions about the query you're running:
Are you aggregating items in this table, and only this table?
Are you joining to other tables? How many? Are they indexed properly?
How often is this table updated, deleted from, inserted into, etc?
How often is my procedure run?
Given the answer to these, and possibly other questions, you'll know if you should in fact have an index on the table. Or, if you should be creating a temp table or view to do work on in your procedure. In either case, you will not want to create an index, do some work, drop an index. You'll lose more than you'll gain.
As an example, if you're doing some aggregations on values only in this table, and they take a while, it may be beneficial to simply copy the whole table into a view or temporary table. This will release the base table from your locks faster than doing the aggregations, if not, just do your work on the base table.
If you will use it over and over, use a view, you won't have to recreate it each time, and it will be up to date when you run your sproc. If performing your aggregations on the clone is still slow, you can put indexes on a view or temp table.
If you're sproc requires joins, you should probably be indexing the involved tables. Otherwise, no matter what you do with one table, eventually the unoptimized table(s) involved are going to drag you down.
Creating Indexes on permanent table through Stored procedure is not recommended.
This Index will be created on your permanent table and will be there forever unless you delete it. Please look at the below picture.
If you create Index on Temporary table, that will be dropped when the session end.

What is the best way to implement soft deletion in a large relational Database? [duplicate]

Working on a project at the moment and we have to implement soft deletion for the majority of users (user roles). We decided to add an is_deleted='0' field on each table in the database and set it to '1' if particular user roles hit a delete button on a specific record.
For future maintenance now, each SELECT query will need to ensure they do not include records where is_deleted='1'.
Is there a better solution for implementing soft deletion?
Update: I should also note that we have an Audit database that tracks changes (field, old value, new value, time, user, ip) to all tables/fields within the Application database.
I would lean towards a deleted_at column that contains the datetime of when the deletion took place. Then you get a little bit of free metadata about the deletion. For your SELECT just get rows WHERE deleted_at IS NULL
You could perform all of your queries against a view that contains the WHERE IS_DELETED='0' clause.
Having is_deleted column is a reasonably good approach.
If it is in Oracle, to further increase performance I'd recommend partitioning the table by creating a list partition on is_deleted column.
Then deleted and non-deleted rows will physically be in different partitions, though for you it'll be transparent.
As a result, if you type a query like
SELECT * FROM table_name WHERE is_deleted = 1
then Oracle will perform the 'partition pruning' and only look into the appropriate partition. Internally a partition is a different table, but it is transparent for you as a user: you'll be able to select across the entire table no matter if it is partitioned or not. But Oracle will be able to query ONLY the partition it needs. For example, let's assume you have 1000 rows with is_deleted = 0 and 100000 rows with is_deleted = 1, and you partition the table on is_deleted. Now if you include condition
WHERE ... AND IS_DELETED=0
then Oracle will ONLY scan the partition with 1000 rows. If the table weren't partitioned, it would have to scan 101000 rows (both partitions).
The best response, sadly, depends on what you're trying to accomplish with your soft deletions and the database you are implementing this within.
In SQL Server, the best solution would be to use a deleted_on/deleted_at column with a type of SMALLDATETIME or DATETIME (depending on the necessary granularity) and to make that column nullable. In SQL Server, the row header data contains a NULL bitmask for each of the columns in the table so it's marginally faster to perform an IS NULL or IS NOT NULL than it is to check the value stored in a column.
If you have a large volume of data, you will want to look into partitioning your data, either through the database itself or through two separate tables (e.g. Products and ProductHistory) or through an indexed view.
I typically avoid flag fields like is_deleted, is_archive, etc because they only carry one piece of meaning. A nullable deleted_at, archived_at field provides an additional level of meaning to yourself and to whoever inherits your application. And I avoid bitmask fields like the plague since they require an understanding of how the bitmask was built in order to grasp any meaning.
if the table is large and performance is an issue, you can always move 'deleted' records to another table, which has additional info like time of deletion, who deleted the record, etc
that way you don't have to add another column to your primary table
That depends on what information you need and what workflows you want to support.
Do you want to be able to:
know what information was there (before it was deleted)?
know when it was deleted?
know who deleted it?
know in what capacity they were acting when they deleted it?
be able to un-delete the record?
be able to tell when it was un-deleted?
etc.
If the record was deleted and un-deleted four times, is it sufficient for you to know that it is currently in an un-deleted state, or do you want to be able to tell what happened in the interim (including any edits between successive deletions!)?
Careful of soft-deleted records causing uniqueness constraint violations.
If your DB has columns with unique constraints then be careful that the prior soft-deleted records don’t prevent you from recreating the record.
Think of the cycle:
create user (login=JOE)
soft-delete (set deleted column to non-null.)
(re) create user (login=JOE). ERROR. LOGIN=JOE is already taken
Second create results in a constraint violation because login=JOE is already in the soft-deleted row.
Some techniques:
1. Move the deleted record to a new table.
2. Make your uniqueness constraint across the login and deleted_at timestamp column
My own opinion is +1 for moving to new table. Its take lots of
discipline to maintain the *AND delete_at = NULL* across all your
queries (for all of your developers)
You will definitely have better performance if you move your deleted data to another table like Jim said, as well as having record of when it was deleted, why, and by whom.
Adding where deleted=0 to all your queries will slow them down significantly, and hinder the usage of any of indexes you may have on the table. Avoid having "flags" in your tables whenever possible.
you don't mention what product, but SQL Server 2008 and postgresql (and others i'm sure) allow you to create filtered indexes, so you could create a covering index where is_deleted=0, mitigating some of the negatives of this particular approach.
Something that I use on projects is a statusInd tinyint not null default 0 column
using statusInd as a bitmask allows me to perform data management (delete, archive, replicate, restore, etc.). Using this in views I can then do the data distribution, publishing, etc for the consuming applications. If performance is a concern regarding views, use small fact tables to support this information, dropping the fact, drops the relation and allows for scalled deletes.
Scales well and is data centric keeping the data footprint pretty small - key for 350gb+ dbs with realtime concerns. Using alternatives, tables, triggers has some overhead that depending on the need may or may not work for you.
SOX related Audits may require more than a field to help in your case, but this may help.
Enjoy
Use a view, function, or procedure that checks is_deleted = 0; i.e. don't select directly on the table in case the table needs to change later for other reasons.
And index the is_deleted column for larger tables.
Since you already have an audit trail, tracking the deletion date is redundant.
I prefer to keep a status column, so I can use it for several different configs, i.e. published, private, deleted, needsAproval...
Create an other schema and grant it all on your data schema.
Implment VPD on your new schema so that each and every query will have the predicate allowing selection of the non-deleted row only appended to it.
http://download.oracle.com/docs/cd/E11882_01/server.112/e16508/cmntopc.htm#CNCPT62345
#AdditionalCriteria("this.status <> 'deleted'")
put this on top of your #entity
http://wiki.eclipse.org/EclipseLink/Examples/JPA/SoftDelete

Cascade on delete performance: Whats the fastest way to delete a row its 1-Many rows?

I have a database in which there is a parent "Account" row that then has a 1-Many relationship with another table, and that table has a 1-Many relationship with another table. This goes on about 6 levels deep (with Account at the top). At the very bottom there could possibly be thousands (can even go beyond 100k) of rows. On each table there is a foreign key set to cascade on delete.
The issue is, that if I try to delete the very top row (an "Account"), it can take minutes, sometimes well over 10 minutes. Is there a faster way to delete all the rows (such as maybe going from the bottom up in individual delete statements) or is cascading pretty much it?
I am using MSSQL 2005 & MSSQL 2008 for the server, ans L2S to perform the delete, although i can use a T-SQL statement if it is faster.
Ive tried doing the delete from the SQL Management Studio too, and that takes just as long.
edit: we have tried re-indexing the database, with negligible difference, maybe a minute or two difference. I appreciate all your answers, it looks like i am going to have to start writing some code to do soft deletes!
A delete is a delete, and if you want to delete massive amounts of rows (100k), it will take a while.
If you do a soft delete (set a status to "D" for example) you can then run a job to actually delete the rows in batches of say 1,000 or so over time it may work better for you. The soft delete should update only the header row and would be very fast. You'd need to code your application to ignore these "D" status rows and their children though.
EDIT
To further #Kane's comment. you could only do a soft delete, or you could do a soft delete followed by a batch process to do the actual deletes if you really want to. I'd just stick with the soft deletes if drive space is not an issue.
Have you indexed all the foreign keys? That's a common issue.
It sounds like you might have indexing issues.
Assume a parent-to-child relationship on column ParentId. By definition, column ParentId in the Parent table must have a primary or unique constraint, and thus be indexed. The child table, however, need not be indexed on ParentId. When you delete a parent entry, SQL has to delete all rows in the child table that have been assigned that foreign key... and if that column is not indexed, the work will have to be done with table scans. This could occur once for each table in your "deletion chain".
Of course, it might just be volume. Deleting a few k rows from 100k+ databases with multiple indexes, even if the "delete lookup" field is indexed, could take significant time -- and dont' forget locking and blocking if you've got users accessing your system during the delete!
Deferring the delete until a schedule maintenance window, as KM suggests, would definitely be an option--though it might require a serious modification to your code base.

Delete all rows in table

Normaly i would do a delete * from XXX but on this table thats very slow, it normaly has about 500k to 1m rows in it ( one is a varbinary(MAX) if that mathers ).
Basicly im wondering if there is a quick way to emty the table of all content, its actualy quicker to drop and recreate it then to delete the content via the delete sql statement
The reason i dont want to recreate the table is because its heavly used and delete/recreate i assume will destroy indexs and stats gathered by sql server
Im also hoping there is a way to do this because there is a "clever" way to get row count via sys.sysindexes , so im hoping there is a equaly clever way to delete content
Truncate table is faster than delete * from XXX. Delete is slow because it works one row at a time. There are a few situations where truncate doesn't work, which you can read about on MSDN.
As other have said, TRUNCATE TABLE is far quicker, but it does have some restrictions (taken from here):
You cannot use TRUNCATE TABLE on tables that:
- Are referenced by a FOREIGN KEY constraint. (You can truncate a table that has a foreign key that references itself.)
- Participate in an indexed view.
- Are published by using transactional replication or merge replication.
For tables with one or more of these characteristics, use the DELETE statement instead.
The biggest drawback is that if the table you are trying to empty has foreign keys pointing to it, then the truncate call will fail.
You can rename the table in question, create a table with an identical schema, and then drop the original table at your leisure.
See the MySQL 5.1 Reference Manual for the [RENAME TABLE][1] and [CREATE TABLE][2] commands.
RENAME TABLE tbl TO tbl_old;
CREATE TABLE tbl LIKE tbl_old;
DROP TABLE tbl_old; -- at your leisure
This approach can help minimize application downtime.
I would suggest using TRUNCATE TABLE, it's quicker and uses less resources than DELETE FROM xxx
Here's the related MSDN article
Truncate table in MS Sql Server
Truncate table in Mysql

Deleting Database Rows and their References-Best Practices

How do I go about deleting a row that is referenced by many other tables, either as a primary key or as a foreign key?
Do I need to delete each reference in the appropriate order, or is there an 'auto' way to perform this in, for example, linq to sql?
If you're performing all of your data access through stored procedures then your delete stored procedure for the master should take care of this. You need to maintain it when you add a new related table, but IMO that requires you to think about what you're doing, which is a good thing.
Personally, I stay away from cascading deletes. It's too easy to accidentally delete a slew of records when the user should have been warned about existing children instead.
Many times the best way to delete something in a database is to just "virtually" delete it by setting an IsDeleted column, and then ignoring the row in all other queries.
Deletes can be very expensive for heavily linked tables, and the locks can cause other queries to fail while the delete is happening.
You can just leave the "IsDeleted" rows in the system forever (which might be helpful for auditing), or go back and delete them for real when the system is idle.
if you have the foreign keys set with ON DELETE CASCADE, it'll take care of pruning your database with just DELETE master WHERE id = :x

Resources