I have a large database, and as expected, with a lot of foreign keys referencing tables. From a database design perspective, how should I handle the deletion of a record that is referenced by a foreign key?
One option I thought of was adding a boolean column to the table which determines whether the record is active or not. So if I was to delete a record, I'd just set its boolean active value to false.
Database may end up being bloated, but then not only will all the referenced foreign keys remain unchanged, the database will hold more information.
I would like to hear your thoughts on this matter regarding a system critical database.
As far as I understood your question, you have 2 tables:
| main | | child |
|-------| |---------------|
|id|data| (1) ----> (n) |id|main_id (FK)|
And you don't want to delete the data from main table, when there are records in the child table.
You didn't say, what RDBMS you use. But in MySQL you can set up the foreign key type. If you set it to RESTRICT, then the system won't allow you to delete the data from the main table, if there is data in child table.
Or you can set it to CASCADE, then when you delete the data in main table, it will be automatically deleted from the child table.
So there is no need to create additional 'active' field.
The answer to Your question is highly dependent on Your application.
If You need "historic" data, then using an "enabled"-flag seems the right choice. However, if there is sensitive data in Your database and You want to ensure that deleted data is not too easy to recover, then an "enabled"-flag is a no-go.
Other aspects:
Do You need "undelete"?
How often do delete operations occur? Many deletes create many disabled entries.
Is there an easy way (e.g., database triggers) to ensure that entries referencing disabled
entries are disabled as well?
Do You have processes or mechanisms to ensure that applications will only see/handle enabled entries? Consider views.
Consider creating proper indices to speed up looking for enabled entries and need a lot of space.
Do You have requirements (technical, organizational, legal) for finally removing
deleted entries? Do You need the deletion date for that purpose?
Is there a need for a cleanup-script?
I know, a lot of questions. However, I hope those questions will help You to find an answer to Your question.
Related
We have a database structure where one table is linked to by many other tables.
That linked table often contains orphaned entries that we want to delete (based on a soft delete flag). However there are occasionally rows which aren't fully orphaned (i.e. they still have a foreign key reference to them) but the soft delete flag is set.
Is there a way to identify which rows are still referenced by a foreign key so we can ignore them when finding those to delete. I believe SQL Server itself must have a mechanism to do this as it fails a delete if there is a Foreign Key reference to it.
(We can not turn on Cascade On Delete, and it's not practical to iterate over all of the tables referencing this one to find out from that direction).
Edit
Those foreign keys can change over time. I am trying to investigate what is technically possible, as building a static query that relies on 'external' information about the rest of the db structure has it's own downsides. That still might be the best approach but I don't want to rule out other options just because I haven't been able to think of them.
In SQL Server 2005 I just struck the infamous error message:
Introducing FOREIGN KEY constraint XXX on table YYY may cause cycles or multiple cascade paths. Specify ON DELETE NO ACTION or ON UPDATE NO ACTION, or modify other FOREIGN KEY constraints.
Now, StackOverflow has several topics about this error message, so I've already got the solution (in my case I'll have to use triggers), but I'm curious as to why there is such a problem at all.
As I understand it, there are basically two scenarios that they want to avoid - a cycle and multiple paths. A cycle would be where two tables have cascading foreign keys to each other. OK, a cycle can span several tables too, but this is the basic case and will be easier to analyze.
Multiple paths would be when TableA has foreign keys to TableB and TableC, and TableB also has a foreign key to TableC. Again - this is the minimum basic case.
I cannot see any problems that would arise when a record would get deleted or updated in any of those tables. Sure, you might need to query the same table multiple times to see which records need updating/deleting, but is that really a problem? Is this a performance issue?
In other SO topics people go as far as to label using cascades as "risky" and state that "resolving cascade paths is a complex problem". Why? Where is the risk? Where is the problem?
You have a child table with 2 cascade paths from the same parent: one "delete", one "null".
What takes precedence? What do you expect afterwards? etc
Note: A trigger is code and can add some intelligence or conditions to a cascade.
The reason we forbid using cascade delete has to do with performance and locking. Yes it's not so bad when you delete one record but sooner or later you will need to delete a large group of records and your database will comes to a standstill.
If you are deleting enough records, SQL Server might escalate to a table lock and no one can do anything with the table until it is finished.
We recently moved one of our clients to his own server. As part of the deal we also then had to delete all of that client's records form our original server. Deleting all his information in batches (so as not to cause problems with other users) took a couple of months. If we had cascade delete set up, the database would have been inaccessible to the other clients for a long time as millions of records were deleted in one transaction and hundreds of tables were locked until the transaction was done.
I could also see a scenario where a deadlock might have occured in using cascade delete because we have no control over the order the cascade path would have taken and our database is somewhat denormalized with clientid appearing in most tables. So if it locked the one table that had a foreign key also to a third table as well as the client table that was in a differnt path, it possibly couldn't check that table in order to delete from the third table because this is all one transaction and the locks wouldn't be released until it was done. So possibly it wouldn't have let us set up cascade deletes if it saw the possibility of creating deadlocks in the transaction.
Another reason to avoid cascading deletes is that sometimes the existence of a child record is reason enough not to delete the parent record. For instance, if you have a customer table and that customer has had orders in the past, you would not want to delete him and lose the information on the actual order.
Consider a table of employees:
CREATE TABLE Employee
(
EmpID INTEGER NOT NULL PRIMARY KEY,
Name VARCHAR(40) NOT NULL,
MgrID INTEGER NOT NULL REFERENCES Employee(EmpID) ON DELETE CASCADE
);
INSERT INTO Employees( 1, "Bill", 1);
INSERT INTO Employees( 23, "Steve", 1);
INSERT INTO Employees(234212, "Helen", 23);
Now suppose Bill retires:
DELETE FROM Employees WHERE Name = "Bill";
Ooooppps; everyone just got sacked!
[We can debate whether the details of the syntax are correct; the concept stands, I think.]
I think the problem is that when you make one path "ON DELETE CASCADE" and the other "ON DELETE RESTRICT", or "NO ACTION" the outcome (the result) is unpredicable. It depends on which delete-trigger (this is also a trigger, but one you don't have to build yourself) will be executed first.
I agree with that cascades being "risky" and should be avoided. (I personally prefer cascading the changes manually rather that having sql server automatically take care of them). This is also because even if sql server deleted millions of rows, the output would still show up as
(1 row(s) affected)
I think whether or not to use a ON DELETE CASCADE option is a question of the business model you are implementing. A relationship between two business objects could be either a simple "association", where both ends of the relationship are related, but otherwise independent objects the lifecycle of which are different and controlled by other logic. There are, however, also "aggregation" relationships, where one object might actually be seen as the "parent" or "owner" of a "child" or "detail" object. There is the even stronger notion of a "composition" relationship, where an object solely exists as a composition of a number of parts.
In the "association" case, you usually won't declare an ON DELETE CASCADE constraint. For aggregations or compositions, however, the ON DELETE CASCADE helps you mapping your business model to the database more accurately and in a declarative way.
This is why it annoys me that MS SQL Server restricts the use of this option to a single cascade path. If I'm not mistaken, many other widely used SQL database systems do not impose such restrictions.
How do I go about deleting a row that is referenced by many other tables, either as a primary key or as a foreign key?
Do I need to delete each reference in the appropriate order, or is there an 'auto' way to perform this in, for example, linq to sql?
If you're performing all of your data access through stored procedures then your delete stored procedure for the master should take care of this. You need to maintain it when you add a new related table, but IMO that requires you to think about what you're doing, which is a good thing.
Personally, I stay away from cascading deletes. It's too easy to accidentally delete a slew of records when the user should have been warned about existing children instead.
Many times the best way to delete something in a database is to just "virtually" delete it by setting an IsDeleted column, and then ignoring the row in all other queries.
Deletes can be very expensive for heavily linked tables, and the locks can cause other queries to fail while the delete is happening.
You can just leave the "IsDeleted" rows in the system forever (which might be helpful for auditing), or go back and delete them for real when the system is idle.
if you have the foreign keys set with ON DELETE CASCADE, it'll take care of pruning your database with just DELETE master WHERE id = :x
I have an application where the majority of the database tables have a strong relationship to one other table. Currently I am enforcing referential integrity with foreign keys, but I'm wondering if this is really the best approach. Data in the primary table can be deleted from an admin interface by business users, which means having to do a cascading delete, (or writing several delete statements), but I'm not sure if I really want to remove all that other data at the same time. It could be a lot of data that *might* be useful at a later date (reporting maybe?). However, the data in the secondary tables is basically useless to the application itself unless the relationship exists with the primary table.
Given the option, I always keep data around. And since you already have foreign keys in place, you have some built-in protection from integrity violations.
If what your users want is to "delete" a record, therefore hiding it from the application, consider the "virtual delete" strategy -- mark a record as inactive, instead of physically removing it from the database.
As for implementation, depending on your db, add whatever equates to boolean/bit logic for your table. All rows get assigned true/1 by default; "deletes" are marked as false/0.
You can use foreign keys and relationships to enforce referential integrity without having to use cascading deletes. I seldom use cascading deletes as I've always found it's often better to have the data and manage/archive it well than it is to delete it.
Just write your own delete logic to support your own business rules.
Logical deletions work excellently as well and I use them extensively.
You don't want to delete some of the data - you'll likely end up with rogue data, that you have no idea where it belonged in the first place. It's either all or nothing.
Soft delete, i.e. having a bit field on every row that determins if the record is "deleted" or not is the way to go. That way, you simply check if the record is deleted == true in the API, and hide it from the application.
You keep the data, but no one can retrieve it through the application.
I would say use foreign key constraints as a rule - this "safeguards" your DB design long-term, as well as data integrity itself. Constraints are there also to explicitly state a designer's decision.
I've seen constraints ditched on extremely large databases - that would be one reason not to use them, if you compare the performance and there is a significant foreign key overhead.
I'd use logical/soft delete. This basically means adding one more column (possibly bit column Deleted) to the table in question, which would mark a particular row as deleted.
That said, "deleted" data is just that: deleted. Thus it cannot logically be used in reporting and similar stuff. In order to overcome this, I'd also introduce Hidden column to hide certain rows retaining their logical meaning.
Never do physical deletes. You can add a BOOL flag IsDeleted to indicate the record is deleted. When you want to "Delete" a record, simply set the flag to True.
When setting up foreign keys in SQL Server, under what circumstances should you have it cascade on delete or update, and what is the reasoning behind it?
This probably applies to other databases as well.
I'm looking most of all for concrete examples of each scenario, preferably from someone who has used them successfully.
Summary of what I've seen so far:
Some people don't like cascading at all.
Cascade Delete
Cascade Delete may make sense when the semantics of the relationship can involve an exclusive "is part of" description. For example, an OrderLine record is part of its parent order, and OrderLines will never be shared between multiple orders. If the Order were to vanish, the OrderLine should as well, and a line without an Order would be a problem.
The canonical example for Cascade Delete is SomeObject and SomeObjectItems, where it doesn't make any sense for an items record to ever exist without a corresponding main record.
You should not use Cascade Delete if you are preserving history or using a "soft/logical delete" where you only set a deleted bit column to 1/true.
Cascade Update
Cascade Update may make sense when you use a real key rather than a surrogate key (identity/autoincrement column) across tables.
The canonical example for Cascade Update is when you have a mutable foreign key, like a username that can be changed.
You should not use Cascade Update with keys that are Identity/autoincrement columns.
Cascade Update is best used in conjunction with a unique constraint.
When To Use Cascading
You may want to get an extra strong confirmation back from the user before allowing an operation to cascade, but it depends on your application.
Cascading can get you into trouble if you set up your foreign keys wrong. But you should be okay if you do that right.
It's not wise to use cascading before you understand it thoroughly. However, it is a useful feature and therefore worth taking the time to understand.
Foreign keys are the best way to ensure referential integrity of a database. Avoiding cascades due to being magic is like writing everything in assembly because you don't trust the magic behind compilers.
What is bad is the wrong use of foreign keys, like creating them backwards, for example.
Juan Manuel's example is the canonical example, if you use code there are many more chances of leaving spurious DocumentItems in the database that will come and bite you.
Cascading updates are useful, for instance, when you have references to the data by something that can change, say a primary key of a users table is the name,lastname combination. Then you want changes in that combination to propagate to wherever they are referenced.
#Aidan, That clarity you refer to comes at a high cost, the chance of leaving spurious data in your database, which is not small. To me, it's usually just lack of familiarity with the DB and inability to find which FKs are in place before working with the DB that foster that fear. Either that, or constant misuse of cascade, using it where the entities were not conceptually related, or where you have to preserve history.
I never use cascading deletes.
If I want something removed from the database I want to explicitly tell the database what I want taking out.
Of course they are a function available in the database and there may be times when it is okay to use them, for example if you have an 'order' table and an 'orderItem' table you may want to clear the items when you delete an order.
I like the clarity that I get from doing it in code (or stored procedure) rather than 'magic' happening.
For the same reason I am not a fan of triggers either.
Something to notice is that if you do delete an 'order' you will get '1 row affected' report back even if the cascaded delete has removed 50 'orderItem's.
I work a lot with cascading deletes.
It feels good to know whoever works against the database might never leave any unwanted data. If dependencies grow I just change the constraints in the diagramm in Management Studio and I dont have to tweak sp or dataacces.
That said, I have 1 problem with cascading deletes and thats circular references. This often leads to parts of the database that have no cascading deletes.
I do a lot of database work and rarely find cascade deletes useful. The one time I have used them effectively is in a reporting database that is updated by a nightly job. I make sure that any changed data is imported correctly by deleting any top level records that have changed since the last import, then reimport the modified records and anything that relates to them. It save me from having to write a lot of complicated deletes that look from the bottom to the top of my database.
I don't consider cascade deletes to be quite as bad as triggers as they only delete data, triggers can have all kinds of nasty stuff inside.
In general I avoid real Deletes altogether and use logical deletes (ie. having a bit column called isDeleted that gets set to true) instead.
One example is when you have dependencies between entities... ie: Document -> DocumentItems (when you delete Document, DocumentItems don't have a reason to exist)
ON Delete Cascade:
When you want rows in child table to be deleted If corresponding row is deleted in parent table.
If on cascade delete isn't used then an error will be raised for referential integrity.
ON Update Cascade:
When you want change in primary key to be updated in foreign key
Use cascade delete where you would want the record with the FK to be removed if its referring PK record was removed. In other words, where the record is meaningless without the referencing record.
I find cascade delete useful to ensure that dead references are removed by default rather than cause null exceptions.
I have heard of DBAs and/or "Company Policy" that prohibit using "On Delete Cascade" (and others) purely because of bad experiences in the past. In one case a guy wrote three triggers which ended up calling one another. Three days to recover resulted in a total ban on triggers, all because of the actions of one idjit.
Of course sometimes Triggers are needed instead of "On Delete cascade", like when some child data needs to be preserved. But in other cases, its perfectly valid to use the On Delete cascade method. A key advantage of "On Delete cascade" is that it captures ALL the children; a custom written trigger/store procedure may not if it is not coded correctly.
I believe the Developer should be allowed to make the decision based upon what the development is and what the spec says. A carpet ban based on a bad experience should not be the criteria; the "Never use" thought process is draconian at best. A judgement call needs to be made each and every time, and changes made as the business model changes.
Isn't this what development is all about?
One reason to put in a cascade delete (rather than doing it in the code) is to improve performance.
Case 1: With a cascade delete
DELETE FROM table WHERE SomeDate < 7 years ago;
Case 2: Without a cascade delete
FOR EACH R IN (SELECT FROM table WHERE SomeDate < 7 years ago) LOOP
DELETE FROM ChildTable WHERE tableId = R.tableId;
DELETE FROM table WHERE tableId = R.tableid;
/* More child tables here */
NEXT
Secondly, when you add in an extra child table with a cascade delete, the code in Case 1 keeps working.
I would only put in a cascade where the semantics of the relationship is "part of". Otherwise some idiot will delete half of your database when you do:
DELETE FROM CURRENCY WHERE CurrencyCode = 'USD'
I try to avoid deletes or updates that I didn't explicitly request in SQL server.
Either through cascading or through the use of triggers. They tend to bite you in the ass some time down the line, either when trying to track down a bug or when diagnosing performance problems.
Where I would use them is in guaranteeing consistency for not very much effort. To get the same effect you would have to use stored procedures.
I, like everyone else here, find that cascade deletes are really only marginally helpful (it's really not that much work to delete referenced data in other tables -- if there are lot of tables, you simply automate this with a script) but really annoying when someone accidentally cascade deletes some important data that is difficult to restore.
The only case where I'd use is if the data in the table table is highly controlled (e.g., limited permissions) and only updated or deleted from through a controlled process (like a software update) that has been verified.
A deletion or update to S that removes a foreign-key value found in some tuples of R can be handled in one of three ways:
Rejection
Propagation
nullification.
Propagation is referred to as cascading.
There are two cases:
‣ If a tuple in S was deleted, delete the R tuples that referred to it.
‣ If a tuple in S was updated, update the value in the R tuples that refer to it.
If you're working on a system with many different modules in different versions, it can be very helpful, if the cascade deleted items are part of / owned by the PK holder. Else, all modules would require immediate patches to clean up their dependent items before deleting the PK owner, or the foreign key relation would be omitted completely, possibly leaving tons of garbage in the system if cleanup is not performed correctly.
I just introduced cascade delete for a new intersection table between two already existing tables (the intersection to delete only), after cascade delete had been discouraged from for quite some time. It's also not too bad if data gets lost.
It is, however, a bad thing on enum-like list tables: somebody deletes entry 13 - yellow from table "colors", and all yellow items in the database get deleted. Also, these sometimes get updated in a delete-all-insert-all manner, leading to referential integrity totally omitted. Of course it's wrong, but how will you change a complex software which has been running for many years, with introduction of true referential integrity being at risk of unexpected side effects?
Another problem is when original foreign key values shall be kept even after the primary key has been deleted. One can create a tombstone column and an ON DELETE SET NULL option for the original FK, but this again requires triggers or specific code to maintain the redundant (except after PK deletion) key value.
Cascade deletes are extremely useful when implementing logical super-type and sub-type entities in a physical database.
When separate super-type and sub-type tables are are used to physically implement super-types/sub-types (as opposed to rolling up all sub-type attributes into a single physical super-type table), there is a one-to-one relationship between these tables and the issue then becomes how to keep the primary keys 100% in sync between these tables.
Cascade deletes can be a very useful tool to:
1) Make sure that deleting a super-type record also deletes the corresponding single sub-type record.
2) Make sure that any delete of a sub-type record also deletes the super-type record. This is achieved by implementing an "instead-of" delete trigger on the sub-type table that goes and deletes the corresponding super-type record, which, in turn, cascade deletes the sub-type record.
Using cascade deletes in this manner ensures that no orphan super-type or sub-type records ever exist, regardless of whether you delete the super-type record first or the sub-type record first.
I would make a distinction between
Data integrity
Business logic/rules
In my experience it is best to enforce integrity as far as possible in the database using PK, FK, and other constraints.
However business rules/logic IMO is best implemented using code for the reason of cohesion (google "coupling and cohesion" to learn more).
Is cascade delete/update data integrity or business rules? This could of course be debated but I would say it is usually a logic/rule. For example a business rule may be that if an Order is deleted all OrderItems should be automatically deleted. But it could also be that it should never be possible to delete an Order if it still have OrderItems. So this may be up to the business to decide. How do we know how this rule is currently implemented? If it is all in code we can just look at the code (high cohesion). If the rule is maybe implemented in the code or maybe implemented as cascade in the database then we need to look in multiple places (low cohesion).
Of course if you go all-in with putting your business rules only in the database and use triggers, stored proc then cascade may make sense.
I usually consider database vendor lock-in before using any stored proc or triggers. A SQL database that just stores data and enforces integrity is IMO easier to port to another vendor. So for that reason I usually don't use stored proc or triggers.