Deleting Database Rows and their References-Best Practices

Deleting Database Rows and their References-Best Practices - sql-server

How do I go about deleting a row that is referenced by many other tables, either as a primary key or as a foreign key?
Do I need to delete each reference in the appropriate order, or is there an 'auto' way to perform this in, for example, linq to sql?

If you're performing all of your data access through stored procedures then your delete stored procedure for the master should take care of this. You need to maintain it when you add a new related table, but IMO that requires you to think about what you're doing, which is a good thing.
Personally, I stay away from cascading deletes. It's too easy to accidentally delete a slew of records when the user should have been warned about existing children instead.

Many times the best way to delete something in a database is to just "virtually" delete it by setting an IsDeleted column, and then ignoring the row in all other queries.
Deletes can be very expensive for heavily linked tables, and the locks can cause other queries to fail while the delete is happening.
You can just leave the "IsDeleted" rows in the system forever (which might be helpful for auditing), or go back and delete them for real when the system is idle.

if you have the foreign keys set with ON DELETE CASCADE, it'll take care of pruning your database with just DELETE master WHERE id = :x

Related

Self foreign key may cause cycles or multiple cascade paths [duplicate]

In SQL Server 2005 I just struck the infamous error message:
Introducing FOREIGN KEY constraint XXX on table YYY may cause cycles or multiple cascade paths. Specify ON DELETE NO ACTION or ON UPDATE NO ACTION, or modify other FOREIGN KEY constraints.
Now, StackOverflow has several topics about this error message, so I've already got the solution (in my case I'll have to use triggers), but I'm curious as to why there is such a problem at all.
As I understand it, there are basically two scenarios that they want to avoid - a cycle and multiple paths. A cycle would be where two tables have cascading foreign keys to each other. OK, a cycle can span several tables too, but this is the basic case and will be easier to analyze.
Multiple paths would be when TableA has foreign keys to TableB and TableC, and TableB also has a foreign key to TableC. Again - this is the minimum basic case.
I cannot see any problems that would arise when a record would get deleted or updated in any of those tables. Sure, you might need to query the same table multiple times to see which records need updating/deleting, but is that really a problem? Is this a performance issue?
In other SO topics people go as far as to label using cascades as "risky" and state that "resolving cascade paths is a complex problem". Why? Where is the risk? Where is the problem?

You have a child table with 2 cascade paths from the same parent: one "delete", one "null".
What takes precedence? What do you expect afterwards? etc
Note: A trigger is code and can add some intelligence or conditions to a cascade.

The reason we forbid using cascade delete has to do with performance and locking. Yes it's not so bad when you delete one record but sooner or later you will need to delete a large group of records and your database will comes to a standstill.
If you are deleting enough records, SQL Server might escalate to a table lock and no one can do anything with the table until it is finished.
We recently moved one of our clients to his own server. As part of the deal we also then had to delete all of that client's records form our original server. Deleting all his information in batches (so as not to cause problems with other users) took a couple of months. If we had cascade delete set up, the database would have been inaccessible to the other clients for a long time as millions of records were deleted in one transaction and hundreds of tables were locked until the transaction was done.
I could also see a scenario where a deadlock might have occured in using cascade delete because we have no control over the order the cascade path would have taken and our database is somewhat denormalized with clientid appearing in most tables. So if it locked the one table that had a foreign key also to a third table as well as the client table that was in a differnt path, it possibly couldn't check that table in order to delete from the third table because this is all one transaction and the locks wouldn't be released until it was done. So possibly it wouldn't have let us set up cascade deletes if it saw the possibility of creating deadlocks in the transaction.
Another reason to avoid cascading deletes is that sometimes the existence of a child record is reason enough not to delete the parent record. For instance, if you have a customer table and that customer has had orders in the past, you would not want to delete him and lose the information on the actual order.

Consider a table of employees:
CREATE TABLE Employee
(
EmpID INTEGER NOT NULL PRIMARY KEY,
Name VARCHAR(40) NOT NULL,
MgrID INTEGER NOT NULL REFERENCES Employee(EmpID) ON DELETE CASCADE
);
INSERT INTO Employees( 1, "Bill", 1);
INSERT INTO Employees( 23, "Steve", 1);
INSERT INTO Employees(234212, "Helen", 23);
Now suppose Bill retires:
DELETE FROM Employees WHERE Name = "Bill";
Ooooppps; everyone just got sacked!
[We can debate whether the details of the syntax are correct; the concept stands, I think.]

I think the problem is that when you make one path "ON DELETE CASCADE" and the other "ON DELETE RESTRICT", or "NO ACTION" the outcome (the result) is unpredicable. It depends on which delete-trigger (this is also a trigger, but one you don't have to build yourself) will be executed first.

I agree with that cascades being "risky" and should be avoided. (I personally prefer cascading the changes manually rather that having sql server automatically take care of them). This is also because even if sql server deleted millions of rows, the output would still show up as
(1 row(s) affected)

I think whether or not to use a ON DELETE CASCADE option is a question of the business model you are implementing. A relationship between two business objects could be either a simple "association", where both ends of the relationship are related, but otherwise independent objects the lifecycle of which are different and controlled by other logic. There are, however, also "aggregation" relationships, where one object might actually be seen as the "parent" or "owner" of a "child" or "detail" object. There is the even stronger notion of a "composition" relationship, where an object solely exists as a composition of a number of parts.
In the "association" case, you usually won't declare an ON DELETE CASCADE constraint. For aggregations or compositions, however, the ON DELETE CASCADE helps you mapping your business model to the database more accurately and in a declarative way.
This is why it annoys me that MS SQL Server restricts the use of this option to a single cascade path. If I'm not mistaken, many other widely used SQL database systems do not impose such restrictions.

Is there a way to specify a cascading delete, on a query by query basis?

On a project I'm working on we've got some tables with numerous foreign key relationships, and because it's in early development the number of relationships will likely change.
We'd like to be able to delete records from certain tables but are reluctant to set up cascading deletes on the foreign key relationships.
We've considered the following options:
Ignore our instincts and set up cascading deletes anyway
Instead of a cascading delete use set null
Write and maintain a custom script to delete all the foreign key records manually
None of these options are great :-(
We don't want to set up cascading deletes because we don't want that to be the default behaviour.
We don't want to use cascading nulls becuase leaving lots of orphans would be useless.
Writing a custom script would work, but it's not very scalable or maintainable. Writing a script for a single table or even a few tables is ok, but for every table? Seriously? There must be a better way! At least I hope there's a better way.
For the "Too Long Didn't Read" crowd; A quick summary
Is there a way of specifying that you'd like a delete to cascade, on a query by query basis?
Perhaps something that looks like this:
-- wouldn't it be nice if this was a real command!
CASCADE DELETE FROM MyTable WHERE ID = #ID

I am not sure I really see the usefulness of manual cascade option in your case.
The CASCADE is there to maintain certain relationships between entities and if you, and I quote,
don't want it to be default behaviour
that then you can still:
issue multiple queries that will do your clean-up 'manually'
use stored procedures if you want to do it with a single call, for example you might then have CALL CASCADE_DELETE('table_name', 'id = 3') (and you get a single point where you would maintain your clean-up scripts)
use triggers if you want to be fancy (for example you could create simple views on the original tables where deleting from these views would cascade - through instead triggers, and deleting from original tables would not cascade)
But, please note that basically you are increasing complexity of the system because of what might be simply unfinished system design. That will not really help you in the long run and the design decisions should be addressed before hacking the functionality (if at all possible).
EDIT:
If the purpose is to clean test data to conform to the actual integrity rules then you could create a proper tables with proper rules and then move the data from test data tables to proper tables.
Rows that don't conform to the proper integrity will fail the insert and you will have the clean data.

Cascade on delete performance: Whats the fastest way to delete a row its 1-Many rows?

I have a database in which there is a parent "Account" row that then has a 1-Many relationship with another table, and that table has a 1-Many relationship with another table. This goes on about 6 levels deep (with Account at the top). At the very bottom there could possibly be thousands (can even go beyond 100k) of rows. On each table there is a foreign key set to cascade on delete.
The issue is, that if I try to delete the very top row (an "Account"), it can take minutes, sometimes well over 10 minutes. Is there a faster way to delete all the rows (such as maybe going from the bottom up in individual delete statements) or is cascading pretty much it?
I am using MSSQL 2005 & MSSQL 2008 for the server, ans L2S to perform the delete, although i can use a T-SQL statement if it is faster.
Ive tried doing the delete from the SQL Management Studio too, and that takes just as long.
edit: we have tried re-indexing the database, with negligible difference, maybe a minute or two difference. I appreciate all your answers, it looks like i am going to have to start writing some code to do soft deletes!

A delete is a delete, and if you want to delete massive amounts of rows (100k), it will take a while.
If you do a soft delete (set a status to "D" for example) you can then run a job to actually delete the rows in batches of say 1,000 or so over time it may work better for you. The soft delete should update only the header row and would be very fast. You'd need to code your application to ignore these "D" status rows and their children though.
EDIT
To further #Kane's comment. you could only do a soft delete, or you could do a soft delete followed by a batch process to do the actual deletes if you really want to. I'd just stick with the soft deletes if drive space is not an issue.

Have you indexed all the foreign keys? That's a common issue.

It sounds like you might have indexing issues.
Assume a parent-to-child relationship on column ParentId. By definition, column ParentId in the Parent table must have a primary or unique constraint, and thus be indexed. The child table, however, need not be indexed on ParentId. When you delete a parent entry, SQL has to delete all rows in the child table that have been assigned that foreign key... and if that column is not indexed, the work will have to be done with table scans. This could occur once for each table in your "deletion chain".
Of course, it might just be volume. Deleting a few k rows from 100k+ databases with multiple indexes, even if the "delete lookup" field is indexed, could take significant time -- and dont' forget locking and blocking if you've got users accessing your system during the delete!
Deferring the delete until a schedule maintenance window, as KM suggests, would definitely be an option--though it might require a serious modification to your code base.

Fastest way to delete all the data in a large table

I had to delete all the rows from a log table that contained about 5 million rows. My initial try was to issue the following command in query analyzer:
delete from client_log
which took a very long time.

Check out truncate table which is a lot faster.

I discovered the TRUNCATE TABLE in the msdn transact-SQL reference. For all interested here are the remarks:
TRUNCATE TABLE is functionally identical to DELETE statement with no WHERE clause: both remove all rows in the table. But TRUNCATE TABLE is faster and uses fewer system and transaction log resources than DELETE.
The DELETE statement removes rows one at a time and records an entry in the transaction log for each deleted row. TRUNCATE TABLE removes the data by deallocating the data pages used to store the table's data, and only the page deallocations are recorded in the transaction log.
TRUNCATE TABLE removes all rows from a table, but the table structure and its columns, constraints, indexes and so on remain. The counter used by an identity for new rows is reset to the seed for the column. If you want to retain the identity counter, use DELETE instead. If you want to remove table definition and its data, use the DROP TABLE statement.
You cannot use TRUNCATE TABLE on a table referenced by a FOREIGN KEY constraint; instead, use DELETE statement without a WHERE clause. Because TRUNCATE TABLE is not logged, it cannot activate a trigger.
TRUNCATE TABLE may not be used on tables participating in an indexed view.

There is a common myth that TRUNCATE somehow skips transaction log.
This is misunderstanding, and is clearly mentioned in MSDN.
This myth is invoked in several comments here. Let's eradicate it together ;)

For reference TRUNCATE TABLE also works on MySQL

I use the following method to zero out tables, with the added bonus that it leaves me with an archive copy of the table.
CREATE TABLE `new_table` LIKE `table`;
RENAME TABLE `table` TO `old_table`, `new_table` TO `table`;

forget truncate and delete. maintain your table definitions (in case you want to recreate it) and just use drop table.

truncate table client_log
is your best bet, truncate kills all content in the table and indices and resets any seeds you've got too.

truncate table is not SQL-platform independent. If you suspect that you might ever change database providers, you might be wary of using it.

On SQL Server you can use the Truncate Table command which is faster than a regular delete and also uses less resources. It will reset any identity fields back to the seed value as well.
The drawbacks of truncate are that it can't be used on tables that are referenced by foreign keys and it won't fire any triggers. Also you won't be able to rollback the data if anything goes wrong.

Note that TRUNCATE will also reset any auto incrementing keys, if you are using those.
If you do not wish to lose your auto incrementing keys, you can speed up the delete by deleting in sets (e.g., DELETE FROM table WHERE id > 1 AND id < 10000). It will speed it up significantly and in some cases prevent data from being locked up.

Yes, well, deleting 5 million rows is probably going to take a long time. The only potentially faster way I can think of would be to drop the table, and re-create it. That only works, of course, if you want to delete ALL data in the table.

The suggestion of "Drop and recreate the table" is probably not a good one because that goofs up your foreign keys.
You ARE using foreign keys, right?

If you cannot use TRUNCATE TABLE because of foreign keys and/or triggers, you can consider to:
drop all indexes;
do the usual DELETE;
re-create all indexes.
This may speed up DELETE somewhat.

I am revising my earlier statement:
You should understand that by using
TRUNCATE the data will be cleared but
nothing will be logged to the
transaction log. Writing to the log
is why DELETE will take forever on 5
million rows. I use TRUNCATE often
during development, but you should be
wary about using it on a production
database because you will not be able
to roll back your changes. You should
immediately make a full database
backup after doing a TRUNCATE to
establish a new basis for restoration.
The above statement was intended to prompt you to be sure that you understand there is difference between the two. Unfortunately, it is poorly written and makes unsupported statements as I have not actually done any testing myself between the two. It is based on statements that I have heard from others.
From MSDN:
The DELETE statement removes rows one
at a time and records an entry in the
transaction log for each deleted row.
TRUNCATE TABLE removes the data by
deallocating the data pages used to
store the table's data, and only the
page deallocations are recorded in the
transaction log.
I just wanted to say that there is a fundamental difference between the two and because there is a difference, there will be applications where one or the other may be inappropriate.

DELETE * FROM table_name;
Premature optimization may be dangerous. Optimizing may mean doing something weird, but if it works you may want to take advantage of it.
SELECT DbVendor_SuperFastDeleteAllFunction(tablename, BOZO_BIT) FROM dummy;
For speed I think it depends on...
The underlying database: Oracle, Microsoft, MySQL, PostgreSQL, others, custom...
The table, it's content, and related tables:
There may be deletion rules. Is there an existing procedure to delete all content in the table? Can this be optimized for the specific underlying database engine? How much do we care about breaking things / related data? Performing a DELETE may be the 'safest' way assuming that other related tables do not depend on this table. Are there other tables and queries that are related / depend on the data within this table? If we don't care much about this table being around, using DROP might be a fast method, again depending on the underlying database.
DROP TABLE table_name;
How many rows are being deleted? Is there other information that is quickly gleaned that will optimize the deletion? For example, can we tell if the table is already empty? Can we tell if there are hundreds, thousands, millions, billions of rows?

When/Why to use Cascading in SQL Server?

When setting up foreign keys in SQL Server, under what circumstances should you have it cascade on delete or update, and what is the reasoning behind it?
This probably applies to other databases as well.
I'm looking most of all for concrete examples of each scenario, preferably from someone who has used them successfully.

Summary of what I've seen so far:
Some people don't like cascading at all.
Cascade Delete
Cascade Delete may make sense when the semantics of the relationship can involve an exclusive "is part of" description. For example, an OrderLine record is part of its parent order, and OrderLines will never be shared between multiple orders. If the Order were to vanish, the OrderLine should as well, and a line without an Order would be a problem.
The canonical example for Cascade Delete is SomeObject and SomeObjectItems, where it doesn't make any sense for an items record to ever exist without a corresponding main record.
You should not use Cascade Delete if you are preserving history or using a "soft/logical delete" where you only set a deleted bit column to 1/true.
Cascade Update
Cascade Update may make sense when you use a real key rather than a surrogate key (identity/autoincrement column) across tables.
The canonical example for Cascade Update is when you have a mutable foreign key, like a username that can be changed.
You should not use Cascade Update with keys that are Identity/autoincrement columns.
Cascade Update is best used in conjunction with a unique constraint.
When To Use Cascading
You may want to get an extra strong confirmation back from the user before allowing an operation to cascade, but it depends on your application.
Cascading can get you into trouble if you set up your foreign keys wrong. But you should be okay if you do that right.
It's not wise to use cascading before you understand it thoroughly. However, it is a useful feature and therefore worth taking the time to understand.

Foreign keys are the best way to ensure referential integrity of a database. Avoiding cascades due to being magic is like writing everything in assembly because you don't trust the magic behind compilers.
What is bad is the wrong use of foreign keys, like creating them backwards, for example.
Juan Manuel's example is the canonical example, if you use code there are many more chances of leaving spurious DocumentItems in the database that will come and bite you.
Cascading updates are useful, for instance, when you have references to the data by something that can change, say a primary key of a users table is the name,lastname combination. Then you want changes in that combination to propagate to wherever they are referenced.
#Aidan, That clarity you refer to comes at a high cost, the chance of leaving spurious data in your database, which is not small. To me, it's usually just lack of familiarity with the DB and inability to find which FKs are in place before working with the DB that foster that fear. Either that, or constant misuse of cascade, using it where the entities were not conceptually related, or where you have to preserve history.

I never use cascading deletes.
If I want something removed from the database I want to explicitly tell the database what I want taking out.
Of course they are a function available in the database and there may be times when it is okay to use them, for example if you have an 'order' table and an 'orderItem' table you may want to clear the items when you delete an order.
I like the clarity that I get from doing it in code (or stored procedure) rather than 'magic' happening.
For the same reason I am not a fan of triggers either.
Something to notice is that if you do delete an 'order' you will get '1 row affected' report back even if the cascaded delete has removed 50 'orderItem's.

I work a lot with cascading deletes.
It feels good to know whoever works against the database might never leave any unwanted data. If dependencies grow I just change the constraints in the diagramm in Management Studio and I dont have to tweak sp or dataacces.
That said, I have 1 problem with cascading deletes and thats circular references. This often leads to parts of the database that have no cascading deletes.

I do a lot of database work and rarely find cascade deletes useful. The one time I have used them effectively is in a reporting database that is updated by a nightly job. I make sure that any changed data is imported correctly by deleting any top level records that have changed since the last import, then reimport the modified records and anything that relates to them. It save me from having to write a lot of complicated deletes that look from the bottom to the top of my database.
I don't consider cascade deletes to be quite as bad as triggers as they only delete data, triggers can have all kinds of nasty stuff inside.
In general I avoid real Deletes altogether and use logical deletes (ie. having a bit column called isDeleted that gets set to true) instead.

One example is when you have dependencies between entities... ie: Document -> DocumentItems (when you delete Document, DocumentItems don't have a reason to exist)

ON Delete Cascade:
When you want rows in child table to be deleted If corresponding row is deleted in parent table.
If on cascade delete isn't used then an error will be raised for referential integrity.
ON Update Cascade:
When you want change in primary key to be updated in foreign key

Use cascade delete where you would want the record with the FK to be removed if its referring PK record was removed. In other words, where the record is meaningless without the referencing record.
I find cascade delete useful to ensure that dead references are removed by default rather than cause null exceptions.

I have heard of DBAs and/or "Company Policy" that prohibit using "On Delete Cascade" (and others) purely because of bad experiences in the past. In one case a guy wrote three triggers which ended up calling one another. Three days to recover resulted in a total ban on triggers, all because of the actions of one idjit.
Of course sometimes Triggers are needed instead of "On Delete cascade", like when some child data needs to be preserved. But in other cases, its perfectly valid to use the On Delete cascade method. A key advantage of "On Delete cascade" is that it captures ALL the children; a custom written trigger/store procedure may not if it is not coded correctly.
I believe the Developer should be allowed to make the decision based upon what the development is and what the spec says. A carpet ban based on a bad experience should not be the criteria; the "Never use" thought process is draconian at best. A judgement call needs to be made each and every time, and changes made as the business model changes.
Isn't this what development is all about?

One reason to put in a cascade delete (rather than doing it in the code) is to improve performance.
Case 1: With a cascade delete
DELETE FROM table WHERE SomeDate < 7 years ago;
Case 2: Without a cascade delete
FOR EACH R IN (SELECT FROM table WHERE SomeDate < 7 years ago) LOOP
DELETE FROM ChildTable WHERE tableId = R.tableId;
DELETE FROM table WHERE tableId = R.tableid;
/* More child tables here */
NEXT
Secondly, when you add in an extra child table with a cascade delete, the code in Case 1 keeps working.
I would only put in a cascade where the semantics of the relationship is "part of". Otherwise some idiot will delete half of your database when you do:
DELETE FROM CURRENCY WHERE CurrencyCode = 'USD'

I try to avoid deletes or updates that I didn't explicitly request in SQL server.
Either through cascading or through the use of triggers. They tend to bite you in the ass some time down the line, either when trying to track down a bug or when diagnosing performance problems.
Where I would use them is in guaranteeing consistency for not very much effort. To get the same effect you would have to use stored procedures.

I, like everyone else here, find that cascade deletes are really only marginally helpful (it's really not that much work to delete referenced data in other tables -- if there are lot of tables, you simply automate this with a script) but really annoying when someone accidentally cascade deletes some important data that is difficult to restore.
The only case where I'd use is if the data in the table table is highly controlled (e.g., limited permissions) and only updated or deleted from through a controlled process (like a software update) that has been verified.

A deletion or update to S that removes a foreign-key value found in some tuples of R can be handled in one of three ways:
Rejection
Propagation
nullification.
Propagation is referred to as cascading.
There are two cases:
‣ If a tuple in S was deleted, delete the R tuples that referred to it.
‣ If a tuple in S was updated, update the value in the R tuples that refer to it.

If you're working on a system with many different modules in different versions, it can be very helpful, if the cascade deleted items are part of / owned by the PK holder. Else, all modules would require immediate patches to clean up their dependent items before deleting the PK owner, or the foreign key relation would be omitted completely, possibly leaving tons of garbage in the system if cleanup is not performed correctly.
I just introduced cascade delete for a new intersection table between two already existing tables (the intersection to delete only), after cascade delete had been discouraged from for quite some time. It's also not too bad if data gets lost.
It is, however, a bad thing on enum-like list tables: somebody deletes entry 13 - yellow from table "colors", and all yellow items in the database get deleted. Also, these sometimes get updated in a delete-all-insert-all manner, leading to referential integrity totally omitted. Of course it's wrong, but how will you change a complex software which has been running for many years, with introduction of true referential integrity being at risk of unexpected side effects?
Another problem is when original foreign key values shall be kept even after the primary key has been deleted. One can create a tombstone column and an ON DELETE SET NULL option for the original FK, but this again requires triggers or specific code to maintain the redundant (except after PK deletion) key value.

Cascade deletes are extremely useful when implementing logical super-type and sub-type entities in a physical database.
When separate super-type and sub-type tables are are used to physically implement super-types/sub-types (as opposed to rolling up all sub-type attributes into a single physical super-type table), there is a one-to-one relationship between these tables and the issue then becomes how to keep the primary keys 100% in sync between these tables.
Cascade deletes can be a very useful tool to:
1) Make sure that deleting a super-type record also deletes the corresponding single sub-type record.
2) Make sure that any delete of a sub-type record also deletes the super-type record. This is achieved by implementing an "instead-of" delete trigger on the sub-type table that goes and deletes the corresponding super-type record, which, in turn, cascade deletes the sub-type record.
Using cascade deletes in this manner ensures that no orphan super-type or sub-type records ever exist, regardless of whether you delete the super-type record first or the sub-type record first.

I would make a distinction between
Data integrity
Business logic/rules
In my experience it is best to enforce integrity as far as possible in the database using PK, FK, and other constraints.
However business rules/logic IMO is best implemented using code for the reason of cohesion (google "coupling and cohesion" to learn more).
Is cascade delete/update data integrity or business rules? This could of course be debated but I would say it is usually a logic/rule. For example a business rule may be that if an Order is deleted all OrderItems should be automatically deleted. But it could also be that it should never be possible to delete an Order if it still have OrderItems. So this may be up to the business to decide. How do we know how this rule is currently implemented? If it is all in code we can just look at the code (high cohesion). If the rule is maybe implemented in the code or maybe implemented as cascade in the database then we need to look in multiple places (low cohesion).
Of course if you go all-in with putting your business rules only in the database and use triggers, stored proc then cascade may make sense.
I usually consider database vendor lock-in before using any stored proc or triggers. A SQL database that just stores data and enforces integrity is IMO easier to port to another vendor. So for that reason I usually don't use stored proc or triggers.