When setting up foreign keys in SQL Server, under what circumstances should you have it cascade on delete or update, and what is the reasoning behind it?
This probably applies to other databases as well.
I'm looking most of all for concrete examples of each scenario, preferably from someone who has used them successfully.
Summary of what I've seen so far:
Some people don't like cascading at all.
Cascade Delete
Cascade Delete may make sense when the semantics of the relationship can involve an exclusive "is part of" description. For example, an OrderLine record is part of its parent order, and OrderLines will never be shared between multiple orders. If the Order were to vanish, the OrderLine should as well, and a line without an Order would be a problem.
The canonical example for Cascade Delete is SomeObject and SomeObjectItems, where it doesn't make any sense for an items record to ever exist without a corresponding main record.
You should not use Cascade Delete if you are preserving history or using a "soft/logical delete" where you only set a deleted bit column to 1/true.
Cascade Update
Cascade Update may make sense when you use a real key rather than a surrogate key (identity/autoincrement column) across tables.
The canonical example for Cascade Update is when you have a mutable foreign key, like a username that can be changed.
You should not use Cascade Update with keys that are Identity/autoincrement columns.
Cascade Update is best used in conjunction with a unique constraint.
When To Use Cascading
You may want to get an extra strong confirmation back from the user before allowing an operation to cascade, but it depends on your application.
Cascading can get you into trouble if you set up your foreign keys wrong. But you should be okay if you do that right.
It's not wise to use cascading before you understand it thoroughly. However, it is a useful feature and therefore worth taking the time to understand.
Foreign keys are the best way to ensure referential integrity of a database. Avoiding cascades due to being magic is like writing everything in assembly because you don't trust the magic behind compilers.
What is bad is the wrong use of foreign keys, like creating them backwards, for example.
Juan Manuel's example is the canonical example, if you use code there are many more chances of leaving spurious DocumentItems in the database that will come and bite you.
Cascading updates are useful, for instance, when you have references to the data by something that can change, say a primary key of a users table is the name,lastname combination. Then you want changes in that combination to propagate to wherever they are referenced.
#Aidan, That clarity you refer to comes at a high cost, the chance of leaving spurious data in your database, which is not small. To me, it's usually just lack of familiarity with the DB and inability to find which FKs are in place before working with the DB that foster that fear. Either that, or constant misuse of cascade, using it where the entities were not conceptually related, or where you have to preserve history.
I never use cascading deletes.
If I want something removed from the database I want to explicitly tell the database what I want taking out.
Of course they are a function available in the database and there may be times when it is okay to use them, for example if you have an 'order' table and an 'orderItem' table you may want to clear the items when you delete an order.
I like the clarity that I get from doing it in code (or stored procedure) rather than 'magic' happening.
For the same reason I am not a fan of triggers either.
Something to notice is that if you do delete an 'order' you will get '1 row affected' report back even if the cascaded delete has removed 50 'orderItem's.
I work a lot with cascading deletes.
It feels good to know whoever works against the database might never leave any unwanted data. If dependencies grow I just change the constraints in the diagramm in Management Studio and I dont have to tweak sp or dataacces.
That said, I have 1 problem with cascading deletes and thats circular references. This often leads to parts of the database that have no cascading deletes.
I do a lot of database work and rarely find cascade deletes useful. The one time I have used them effectively is in a reporting database that is updated by a nightly job. I make sure that any changed data is imported correctly by deleting any top level records that have changed since the last import, then reimport the modified records and anything that relates to them. It save me from having to write a lot of complicated deletes that look from the bottom to the top of my database.
I don't consider cascade deletes to be quite as bad as triggers as they only delete data, triggers can have all kinds of nasty stuff inside.
In general I avoid real Deletes altogether and use logical deletes (ie. having a bit column called isDeleted that gets set to true) instead.
One example is when you have dependencies between entities... ie: Document -> DocumentItems (when you delete Document, DocumentItems don't have a reason to exist)
ON Delete Cascade:
When you want rows in child table to be deleted If corresponding row is deleted in parent table.
If on cascade delete isn't used then an error will be raised for referential integrity.
ON Update Cascade:
When you want change in primary key to be updated in foreign key
Use cascade delete where you would want the record with the FK to be removed if its referring PK record was removed. In other words, where the record is meaningless without the referencing record.
I find cascade delete useful to ensure that dead references are removed by default rather than cause null exceptions.
I have heard of DBAs and/or "Company Policy" that prohibit using "On Delete Cascade" (and others) purely because of bad experiences in the past. In one case a guy wrote three triggers which ended up calling one another. Three days to recover resulted in a total ban on triggers, all because of the actions of one idjit.
Of course sometimes Triggers are needed instead of "On Delete cascade", like when some child data needs to be preserved. But in other cases, its perfectly valid to use the On Delete cascade method. A key advantage of "On Delete cascade" is that it captures ALL the children; a custom written trigger/store procedure may not if it is not coded correctly.
I believe the Developer should be allowed to make the decision based upon what the development is and what the spec says. A carpet ban based on a bad experience should not be the criteria; the "Never use" thought process is draconian at best. A judgement call needs to be made each and every time, and changes made as the business model changes.
Isn't this what development is all about?
One reason to put in a cascade delete (rather than doing it in the code) is to improve performance.
Case 1: With a cascade delete
DELETE FROM table WHERE SomeDate < 7 years ago;
Case 2: Without a cascade delete
FOR EACH R IN (SELECT FROM table WHERE SomeDate < 7 years ago) LOOP
DELETE FROM ChildTable WHERE tableId = R.tableId;
DELETE FROM table WHERE tableId = R.tableid;
/* More child tables here */
NEXT
Secondly, when you add in an extra child table with a cascade delete, the code in Case 1 keeps working.
I would only put in a cascade where the semantics of the relationship is "part of". Otherwise some idiot will delete half of your database when you do:
DELETE FROM CURRENCY WHERE CurrencyCode = 'USD'
I try to avoid deletes or updates that I didn't explicitly request in SQL server.
Either through cascading or through the use of triggers. They tend to bite you in the ass some time down the line, either when trying to track down a bug or when diagnosing performance problems.
Where I would use them is in guaranteeing consistency for not very much effort. To get the same effect you would have to use stored procedures.
I, like everyone else here, find that cascade deletes are really only marginally helpful (it's really not that much work to delete referenced data in other tables -- if there are lot of tables, you simply automate this with a script) but really annoying when someone accidentally cascade deletes some important data that is difficult to restore.
The only case where I'd use is if the data in the table table is highly controlled (e.g., limited permissions) and only updated or deleted from through a controlled process (like a software update) that has been verified.
A deletion or update to S that removes a foreign-key value found in some tuples of R can be handled in one of three ways:
Rejection
Propagation
nullification.
Propagation is referred to as cascading.
There are two cases:
‣ If a tuple in S was deleted, delete the R tuples that referred to it.
‣ If a tuple in S was updated, update the value in the R tuples that refer to it.
If you're working on a system with many different modules in different versions, it can be very helpful, if the cascade deleted items are part of / owned by the PK holder. Else, all modules would require immediate patches to clean up their dependent items before deleting the PK owner, or the foreign key relation would be omitted completely, possibly leaving tons of garbage in the system if cleanup is not performed correctly.
I just introduced cascade delete for a new intersection table between two already existing tables (the intersection to delete only), after cascade delete had been discouraged from for quite some time. It's also not too bad if data gets lost.
It is, however, a bad thing on enum-like list tables: somebody deletes entry 13 - yellow from table "colors", and all yellow items in the database get deleted. Also, these sometimes get updated in a delete-all-insert-all manner, leading to referential integrity totally omitted. Of course it's wrong, but how will you change a complex software which has been running for many years, with introduction of true referential integrity being at risk of unexpected side effects?
Another problem is when original foreign key values shall be kept even after the primary key has been deleted. One can create a tombstone column and an ON DELETE SET NULL option for the original FK, but this again requires triggers or specific code to maintain the redundant (except after PK deletion) key value.
Cascade deletes are extremely useful when implementing logical super-type and sub-type entities in a physical database.
When separate super-type and sub-type tables are are used to physically implement super-types/sub-types (as opposed to rolling up all sub-type attributes into a single physical super-type table), there is a one-to-one relationship between these tables and the issue then becomes how to keep the primary keys 100% in sync between these tables.
Cascade deletes can be a very useful tool to:
1) Make sure that deleting a super-type record also deletes the corresponding single sub-type record.
2) Make sure that any delete of a sub-type record also deletes the super-type record. This is achieved by implementing an "instead-of" delete trigger on the sub-type table that goes and deletes the corresponding super-type record, which, in turn, cascade deletes the sub-type record.
Using cascade deletes in this manner ensures that no orphan super-type or sub-type records ever exist, regardless of whether you delete the super-type record first or the sub-type record first.
I would make a distinction between
Data integrity
Business logic/rules
In my experience it is best to enforce integrity as far as possible in the database using PK, FK, and other constraints.
However business rules/logic IMO is best implemented using code for the reason of cohesion (google "coupling and cohesion" to learn more).
Is cascade delete/update data integrity or business rules? This could of course be debated but I would say it is usually a logic/rule. For example a business rule may be that if an Order is deleted all OrderItems should be automatically deleted. But it could also be that it should never be possible to delete an Order if it still have OrderItems. So this may be up to the business to decide. How do we know how this rule is currently implemented? If it is all in code we can just look at the code (high cohesion). If the rule is maybe implemented in the code or maybe implemented as cascade in the database then we need to look in multiple places (low cohesion).
Of course if you go all-in with putting your business rules only in the database and use triggers, stored proc then cascade may make sense.
I usually consider database vendor lock-in before using any stored proc or triggers. A SQL database that just stores data and enforces integrity is IMO easier to port to another vendor. So for that reason I usually don't use stored proc or triggers.
Related
In SQL Server 2005 I just struck the infamous error message:
Introducing FOREIGN KEY constraint XXX on table YYY may cause cycles or multiple cascade paths. Specify ON DELETE NO ACTION or ON UPDATE NO ACTION, or modify other FOREIGN KEY constraints.
Now, StackOverflow has several topics about this error message, so I've already got the solution (in my case I'll have to use triggers), but I'm curious as to why there is such a problem at all.
As I understand it, there are basically two scenarios that they want to avoid - a cycle and multiple paths. A cycle would be where two tables have cascading foreign keys to each other. OK, a cycle can span several tables too, but this is the basic case and will be easier to analyze.
Multiple paths would be when TableA has foreign keys to TableB and TableC, and TableB also has a foreign key to TableC. Again - this is the minimum basic case.
I cannot see any problems that would arise when a record would get deleted or updated in any of those tables. Sure, you might need to query the same table multiple times to see which records need updating/deleting, but is that really a problem? Is this a performance issue?
In other SO topics people go as far as to label using cascades as "risky" and state that "resolving cascade paths is a complex problem". Why? Where is the risk? Where is the problem?
You have a child table with 2 cascade paths from the same parent: one "delete", one "null".
What takes precedence? What do you expect afterwards? etc
Note: A trigger is code and can add some intelligence or conditions to a cascade.
The reason we forbid using cascade delete has to do with performance and locking. Yes it's not so bad when you delete one record but sooner or later you will need to delete a large group of records and your database will comes to a standstill.
If you are deleting enough records, SQL Server might escalate to a table lock and no one can do anything with the table until it is finished.
We recently moved one of our clients to his own server. As part of the deal we also then had to delete all of that client's records form our original server. Deleting all his information in batches (so as not to cause problems with other users) took a couple of months. If we had cascade delete set up, the database would have been inaccessible to the other clients for a long time as millions of records were deleted in one transaction and hundreds of tables were locked until the transaction was done.
I could also see a scenario where a deadlock might have occured in using cascade delete because we have no control over the order the cascade path would have taken and our database is somewhat denormalized with clientid appearing in most tables. So if it locked the one table that had a foreign key also to a third table as well as the client table that was in a differnt path, it possibly couldn't check that table in order to delete from the third table because this is all one transaction and the locks wouldn't be released until it was done. So possibly it wouldn't have let us set up cascade deletes if it saw the possibility of creating deadlocks in the transaction.
Another reason to avoid cascading deletes is that sometimes the existence of a child record is reason enough not to delete the parent record. For instance, if you have a customer table and that customer has had orders in the past, you would not want to delete him and lose the information on the actual order.
Consider a table of employees:
CREATE TABLE Employee
(
EmpID INTEGER NOT NULL PRIMARY KEY,
Name VARCHAR(40) NOT NULL,
MgrID INTEGER NOT NULL REFERENCES Employee(EmpID) ON DELETE CASCADE
);
INSERT INTO Employees( 1, "Bill", 1);
INSERT INTO Employees( 23, "Steve", 1);
INSERT INTO Employees(234212, "Helen", 23);
Now suppose Bill retires:
DELETE FROM Employees WHERE Name = "Bill";
Ooooppps; everyone just got sacked!
[We can debate whether the details of the syntax are correct; the concept stands, I think.]
I think the problem is that when you make one path "ON DELETE CASCADE" and the other "ON DELETE RESTRICT", or "NO ACTION" the outcome (the result) is unpredicable. It depends on which delete-trigger (this is also a trigger, but one you don't have to build yourself) will be executed first.
I agree with that cascades being "risky" and should be avoided. (I personally prefer cascading the changes manually rather that having sql server automatically take care of them). This is also because even if sql server deleted millions of rows, the output would still show up as
(1 row(s) affected)
I think whether or not to use a ON DELETE CASCADE option is a question of the business model you are implementing. A relationship between two business objects could be either a simple "association", where both ends of the relationship are related, but otherwise independent objects the lifecycle of which are different and controlled by other logic. There are, however, also "aggregation" relationships, where one object might actually be seen as the "parent" or "owner" of a "child" or "detail" object. There is the even stronger notion of a "composition" relationship, where an object solely exists as a composition of a number of parts.
In the "association" case, you usually won't declare an ON DELETE CASCADE constraint. For aggregations or compositions, however, the ON DELETE CASCADE helps you mapping your business model to the database more accurately and in a declarative way.
This is why it annoys me that MS SQL Server restricts the use of this option to a single cascade path. If I'm not mistaken, many other widely used SQL database systems do not impose such restrictions.
What is the advantage of defining a foreign key when working with an MVC framework that handles the relation?
I'm using a relational database with a framework that allows model definitions with relations. Because the foreign keys are defined through the models, it seems like foreign keys are redundant. When it comes to managing the database of an application in development, editing/deleting tables that are using foreign keys is a hassle.
Is there any advantage to using foreign keys that I'm forgoing by dropping the use of them altogether?
Foreign keys with constraints(in some DB engines) give you data integrity on the low level(level of database).
It means you can't physically create a record that doesn't fulfill relation.
It's just a way to be more safe.
It gives you data integrity that's enforced at the database level. This helps guard against possibly error in application logic that might cause invalid data.
If any data manipulation is ever done directly in SQL that bypasses your application logic, it also guards against bad data that breaks those constraints.
An additional side-benefit is that it allows tools to automatically generating database diagrams with relationships inferred from the schema itself. Now in theory all the diagramming should be done before the database is created, but as the database evolves beyond its initial incarnation these diagrams often aren't kept up to date, and the ability to generate a diagram from an existing database is helpful both for reviewing, as well as for explaining the structure to new developers joining a project.
It might be a helpful to disable FKs while the database structure is still in flux, but they're good safeguard to have when the schema is more stabilized.
A foreign key guarantees a matching record exists in a foreign table. Imagine a table called Books that has a FK constraint on a table called Authors. Every book is guaranteed to have an Author.
Now, you can do a query such as:
SELECT B.Title, A.Name FROM Books B
INNER JOIN Authors A ON B.AuthorId = A.AuthorId;
Without the FK constraint, a missing Author row would cause the entire Book row to be dropped, resulting in missing books in your dataset.
Also, with the FK constraint, attempting to delete an author that was referred to by at least one Book would result in an error, rather than corrupting your database.
Whilst they may be a pain when manipulating development/test data, they have saved me a lot of hassle in production.
Think of them as a way to maintain data integrity, especially as a safeguard against orphaned records.
For example, if you had a database relating many PhoneNumber records to a Person, what happens to PhoneNumber records when the Person record is deleted for whatever reason?
They will still exist in the database, but the ID of the Person they relate to will no longer exist in the relevant Person table and you have orphaned records.
Yes, you could write a trigger to delete the PhoneNumber whenever a Person gets removed, but this could get messy if you accidentally delete a Person and need to rollback.
Yes, you may remember to get rid of the PhoneNumber records manually, but what about other developers or methods you write 9 months down the line?
By creating a Foreign Key that ensures any PhoneNumber is related to an existing Person, you both insure against destroying this relationship and also add 'clues' as to the intended data structure.
The main benefits are data integrity and cascading deletes. You can also get a performance gain when they're defined and those fields are properly indexed. For example, you wouldn't be able to create a phone number that didn't belong to a contact, or when you delete the contact you can set it to automatically delete all of their phone numbers. Yes, you can make those connections in your UI or middle tier, but you'll still end up with orphans if someone runs an update directly against the server using SQL rather than your UI. The "hassle" part is just forcing you to consider those connections before you make a bulk change. FKs have saved my bacon many times.
On a project I'm working on we've got some tables with numerous foreign key relationships, and because it's in early development the number of relationships will likely change.
We'd like to be able to delete records from certain tables but are reluctant to set up cascading deletes on the foreign key relationships.
We've considered the following options:
Ignore our instincts and set up cascading deletes anyway
Instead of a cascading delete use set null
Write and maintain a custom script to delete all the foreign key records manually
None of these options are great :-(
We don't want to set up cascading deletes because we don't want that to be the default behaviour.
We don't want to use cascading nulls becuase leaving lots of orphans would be useless.
Writing a custom script would work, but it's not very scalable or maintainable. Writing a script for a single table or even a few tables is ok, but for every table? Seriously? There must be a better way! At least I hope there's a better way.
For the "Too Long Didn't Read" crowd; A quick summary
Is there a way of specifying that you'd like a delete to cascade, on a query by query basis?
Perhaps something that looks like this:
-- wouldn't it be nice if this was a real command!
CASCADE DELETE FROM MyTable WHERE ID = #ID
I am not sure I really see the usefulness of manual cascade option in your case.
The CASCADE is there to maintain certain relationships between entities and if you, and I quote,
don't want it to be default behaviour
that then you can still:
issue multiple queries that will do your clean-up 'manually'
use stored procedures if you want to do it with a single call, for example you might then have CALL CASCADE_DELETE('table_name', 'id = 3') (and you get a single point where you would maintain your clean-up scripts)
use triggers if you want to be fancy (for example you could create simple views on the original tables where deleting from these views would cascade - through instead triggers, and deleting from original tables would not cascade)
But, please note that basically you are increasing complexity of the system because of what might be simply unfinished system design. That will not really help you in the long run and the design decisions should be addressed before hacking the functionality (if at all possible).
EDIT:
If the purpose is to clean test data to conform to the actual integrity rules then you could create a proper tables with proper rules and then move the data from test data tables to proper tables.
Rows that don't conform to the proper integrity will fail the insert and you will have the clean data.
How do I go about deleting a row that is referenced by many other tables, either as a primary key or as a foreign key?
Do I need to delete each reference in the appropriate order, or is there an 'auto' way to perform this in, for example, linq to sql?
If you're performing all of your data access through stored procedures then your delete stored procedure for the master should take care of this. You need to maintain it when you add a new related table, but IMO that requires you to think about what you're doing, which is a good thing.
Personally, I stay away from cascading deletes. It's too easy to accidentally delete a slew of records when the user should have been warned about existing children instead.
Many times the best way to delete something in a database is to just "virtually" delete it by setting an IsDeleted column, and then ignoring the row in all other queries.
Deletes can be very expensive for heavily linked tables, and the locks can cause other queries to fail while the delete is happening.
You can just leave the "IsDeleted" rows in the system forever (which might be helpful for auditing), or go back and delete them for real when the system is idle.
if you have the foreign keys set with ON DELETE CASCADE, it'll take care of pruning your database with just DELETE master WHERE id = :x
I have an application where the majority of the database tables have a strong relationship to one other table. Currently I am enforcing referential integrity with foreign keys, but I'm wondering if this is really the best approach. Data in the primary table can be deleted from an admin interface by business users, which means having to do a cascading delete, (or writing several delete statements), but I'm not sure if I really want to remove all that other data at the same time. It could be a lot of data that *might* be useful at a later date (reporting maybe?). However, the data in the secondary tables is basically useless to the application itself unless the relationship exists with the primary table.
Given the option, I always keep data around. And since you already have foreign keys in place, you have some built-in protection from integrity violations.
If what your users want is to "delete" a record, therefore hiding it from the application, consider the "virtual delete" strategy -- mark a record as inactive, instead of physically removing it from the database.
As for implementation, depending on your db, add whatever equates to boolean/bit logic for your table. All rows get assigned true/1 by default; "deletes" are marked as false/0.
You can use foreign keys and relationships to enforce referential integrity without having to use cascading deletes. I seldom use cascading deletes as I've always found it's often better to have the data and manage/archive it well than it is to delete it.
Just write your own delete logic to support your own business rules.
Logical deletions work excellently as well and I use them extensively.
You don't want to delete some of the data - you'll likely end up with rogue data, that you have no idea where it belonged in the first place. It's either all or nothing.
Soft delete, i.e. having a bit field on every row that determins if the record is "deleted" or not is the way to go. That way, you simply check if the record is deleted == true in the API, and hide it from the application.
You keep the data, but no one can retrieve it through the application.
I would say use foreign key constraints as a rule - this "safeguards" your DB design long-term, as well as data integrity itself. Constraints are there also to explicitly state a designer's decision.
I've seen constraints ditched on extremely large databases - that would be one reason not to use them, if you compare the performance and there is a significant foreign key overhead.
I'd use logical/soft delete. This basically means adding one more column (possibly bit column Deleted) to the table in question, which would mark a particular row as deleted.
That said, "deleted" data is just that: deleted. Thus it cannot logically be used in reporting and similar stuff. In order to overcome this, I'd also introduce Hidden column to hide certain rows retaining their logical meaning.
Never do physical deletes. You can add a BOOL flag IsDeleted to indicate the record is deleted. When you want to "Delete" a record, simply set the flag to True.