SQL: How to find rows which can't be deleted

SQL: How to find rows which can't be deleted - sql-server

We have a database structure where one table is linked to by many other tables.
That linked table often contains orphaned entries that we want to delete (based on a soft delete flag). However there are occasionally rows which aren't fully orphaned (i.e. they still have a foreign key reference to them) but the soft delete flag is set.
Is there a way to identify which rows are still referenced by a foreign key so we can ignore them when finding those to delete. I believe SQL Server itself must have a mechanism to do this as it fails a delete if there is a Foreign Key reference to it.
(We can not turn on Cascade On Delete, and it's not practical to iterate over all of the tables referencing this one to find out from that direction).
Edit
Those foreign keys can change over time. I am trying to investigate what is technically possible, as building a static query that relies on 'external' information about the rest of the db structure has it's own downsides. That still might be the best approach but I don't want to rule out other options just because I haven't been able to think of them.

Related

Self foreign key may cause cycles or multiple cascade paths [duplicate]

In SQL Server 2005 I just struck the infamous error message:
Introducing FOREIGN KEY constraint XXX on table YYY may cause cycles or multiple cascade paths. Specify ON DELETE NO ACTION or ON UPDATE NO ACTION, or modify other FOREIGN KEY constraints.
Now, StackOverflow has several topics about this error message, so I've already got the solution (in my case I'll have to use triggers), but I'm curious as to why there is such a problem at all.
As I understand it, there are basically two scenarios that they want to avoid - a cycle and multiple paths. A cycle would be where two tables have cascading foreign keys to each other. OK, a cycle can span several tables too, but this is the basic case and will be easier to analyze.
Multiple paths would be when TableA has foreign keys to TableB and TableC, and TableB also has a foreign key to TableC. Again - this is the minimum basic case.
I cannot see any problems that would arise when a record would get deleted or updated in any of those tables. Sure, you might need to query the same table multiple times to see which records need updating/deleting, but is that really a problem? Is this a performance issue?
In other SO topics people go as far as to label using cascades as "risky" and state that "resolving cascade paths is a complex problem". Why? Where is the risk? Where is the problem?

You have a child table with 2 cascade paths from the same parent: one "delete", one "null".
What takes precedence? What do you expect afterwards? etc
Note: A trigger is code and can add some intelligence or conditions to a cascade.

The reason we forbid using cascade delete has to do with performance and locking. Yes it's not so bad when you delete one record but sooner or later you will need to delete a large group of records and your database will comes to a standstill.
If you are deleting enough records, SQL Server might escalate to a table lock and no one can do anything with the table until it is finished.
We recently moved one of our clients to his own server. As part of the deal we also then had to delete all of that client's records form our original server. Deleting all his information in batches (so as not to cause problems with other users) took a couple of months. If we had cascade delete set up, the database would have been inaccessible to the other clients for a long time as millions of records were deleted in one transaction and hundreds of tables were locked until the transaction was done.
I could also see a scenario where a deadlock might have occured in using cascade delete because we have no control over the order the cascade path would have taken and our database is somewhat denormalized with clientid appearing in most tables. So if it locked the one table that had a foreign key also to a third table as well as the client table that was in a differnt path, it possibly couldn't check that table in order to delete from the third table because this is all one transaction and the locks wouldn't be released until it was done. So possibly it wouldn't have let us set up cascade deletes if it saw the possibility of creating deadlocks in the transaction.
Another reason to avoid cascading deletes is that sometimes the existence of a child record is reason enough not to delete the parent record. For instance, if you have a customer table and that customer has had orders in the past, you would not want to delete him and lose the information on the actual order.

Consider a table of employees:
CREATE TABLE Employee
(
EmpID INTEGER NOT NULL PRIMARY KEY,
Name VARCHAR(40) NOT NULL,
MgrID INTEGER NOT NULL REFERENCES Employee(EmpID) ON DELETE CASCADE
);
INSERT INTO Employees( 1, "Bill", 1);
INSERT INTO Employees( 23, "Steve", 1);
INSERT INTO Employees(234212, "Helen", 23);
Now suppose Bill retires:
DELETE FROM Employees WHERE Name = "Bill";
Ooooppps; everyone just got sacked!
[We can debate whether the details of the syntax are correct; the concept stands, I think.]

I think the problem is that when you make one path "ON DELETE CASCADE" and the other "ON DELETE RESTRICT", or "NO ACTION" the outcome (the result) is unpredicable. It depends on which delete-trigger (this is also a trigger, but one you don't have to build yourself) will be executed first.

I agree with that cascades being "risky" and should be avoided. (I personally prefer cascading the changes manually rather that having sql server automatically take care of them). This is also because even if sql server deleted millions of rows, the output would still show up as
(1 row(s) affected)

I think whether or not to use a ON DELETE CASCADE option is a question of the business model you are implementing. A relationship between two business objects could be either a simple "association", where both ends of the relationship are related, but otherwise independent objects the lifecycle of which are different and controlled by other logic. There are, however, also "aggregation" relationships, where one object might actually be seen as the "parent" or "owner" of a "child" or "detail" object. There is the even stronger notion of a "composition" relationship, where an object solely exists as a composition of a number of parts.
In the "association" case, you usually won't declare an ON DELETE CASCADE constraint. For aggregations or compositions, however, the ON DELETE CASCADE helps you mapping your business model to the database more accurately and in a declarative way.
This is why it annoys me that MS SQL Server restricts the use of this option to a single cascade path. If I'm not mistaken, many other widely used SQL database systems do not impose such restrictions.

Deleting foreign key referenced ID

I have a large database, and as expected, with a lot of foreign keys referencing tables. From a database design perspective, how should I handle the deletion of a record that is referenced by a foreign key?
One option I thought of was adding a boolean column to the table which determines whether the record is active or not. So if I was to delete a record, I'd just set its boolean active value to false.
Database may end up being bloated, but then not only will all the referenced foreign keys remain unchanged, the database will hold more information.
I would like to hear your thoughts on this matter regarding a system critical database.

As far as I understood your question, you have 2 tables:
| main | | child |
|-------| |---------------|
|id|data| (1) ----> (n) |id|main_id (FK)|
And you don't want to delete the data from main table, when there are records in the child table.
You didn't say, what RDBMS you use. But in MySQL you can set up the foreign key type. If you set it to RESTRICT, then the system won't allow you to delete the data from the main table, if there is data in child table.
Or you can set it to CASCADE, then when you delete the data in main table, it will be automatically deleted from the child table.
So there is no need to create additional 'active' field.

The answer to Your question is highly dependent on Your application.
If You need "historic" data, then using an "enabled"-flag seems the right choice. However, if there is sensitive data in Your database and You want to ensure that deleted data is not too easy to recover, then an "enabled"-flag is a no-go.
Other aspects:
Do You need "undelete"?
How often do delete operations occur? Many deletes create many disabled entries.
Is there an easy way (e.g., database triggers) to ensure that entries referencing disabled
entries are disabled as well?
Do You have processes or mechanisms to ensure that applications will only see/handle enabled entries? Consider views.
Consider creating proper indices to speed up looking for enabled entries and need a lot of space.
Do You have requirements (technical, organizational, legal) for finally removing
deleted entries? Do You need the deletion date for that purpose?
Is there a need for a cleanup-script?
I know, a lot of questions. However, I hope those questions will help You to find an answer to Your question.

Cascade Delete Use Case

I am pretty new to Business Analysis. I have to write requirements that show both (for now) cascade delete (for two tables) and the rest of the tables will delete explicitly.
I need some guidance for how to write the requirements for cascade deletion.

Delete child entities on parent
deletion.
Delete collection members if collection entity is deleted.
Actually it is hard to understand the task without context and also it smells like university/colledge homework (we had one very similar to this).

Use the ON DELETE CASCADE option to specify whether you want rows deleted in a child table when corresponding rows are deleted in the parent table. If you do not specify cascading deletes, the default behavior of the database server prevents you from deleting data in a table if other tables reference it.
If you specify this option, later when you delete a row in the parent table, the database server also deletes any rows associated with that row (foreign keys) in a child table. The principal advantage to the cascading-deletes feature is that it allows you to reduce the quantity of SQL statements you need to perform delete actions.
For example, the all_candy table contains the candy_num column as a primary key. The hard_candy table refers to the candy_num column as a foreign key. The following CREATE TABLE statement creates the hard_candy table with the cascading-delete option on the foreign key:
CREATE TABLE all_candy
(candy_num SERIAL PRIMARY KEY,
candy_maker CHAR(25));
CREATE TABLE hard_candy
(candy_num INT,
candy_flavor CHAR(20),
FOREIGN KEY (candy_num) REFERENCES all_candy
ON DELETE CASCADE)
Because ON DELETE CASCADE is specified for the dependent table, when a row of the all_candy table is deleted, the corresponding rows of the hard_candy table are also deleted. For information about syntax restrictions and locking implications when you delete rows from tables that have cascading deletes, see Considerations When Tables Have Cascading Deletes.
Source: http://publib.boulder.ibm.com/infocenter/idshelp/v10/index.jsp?topic=/com.ibm.sqls.doc/sqls292.htm

You don't write use cases for functionality - that is the reason why it is hard to properly answer your question - we don't know the actor who interacts with the system and of course we know nothing about the system, so we cannot tell you how to write description of their interactions.
You should write your use cases first and from them derive the functionality.

Deleting Database Rows and their References-Best Practices

How do I go about deleting a row that is referenced by many other tables, either as a primary key or as a foreign key?
Do I need to delete each reference in the appropriate order, or is there an 'auto' way to perform this in, for example, linq to sql?

If you're performing all of your data access through stored procedures then your delete stored procedure for the master should take care of this. You need to maintain it when you add a new related table, but IMO that requires you to think about what you're doing, which is a good thing.
Personally, I stay away from cascading deletes. It's too easy to accidentally delete a slew of records when the user should have been warned about existing children instead.

Many times the best way to delete something in a database is to just "virtually" delete it by setting an IsDeleted column, and then ignoring the row in all other queries.
Deletes can be very expensive for heavily linked tables, and the locks can cause other queries to fail while the delete is happening.
You can just leave the "IsDeleted" rows in the system forever (which might be helpful for auditing), or go back and delete them for real when the system is idle.

if you have the foreign keys set with ON DELETE CASCADE, it'll take care of pruning your database with just DELETE master WHERE id = :x

When/Why to use Cascading in SQL Server?

When setting up foreign keys in SQL Server, under what circumstances should you have it cascade on delete or update, and what is the reasoning behind it?
This probably applies to other databases as well.
I'm looking most of all for concrete examples of each scenario, preferably from someone who has used them successfully.

Summary of what I've seen so far:
Some people don't like cascading at all.
Cascade Delete
Cascade Delete may make sense when the semantics of the relationship can involve an exclusive "is part of" description. For example, an OrderLine record is part of its parent order, and OrderLines will never be shared between multiple orders. If the Order were to vanish, the OrderLine should as well, and a line without an Order would be a problem.
The canonical example for Cascade Delete is SomeObject and SomeObjectItems, where it doesn't make any sense for an items record to ever exist without a corresponding main record.
You should not use Cascade Delete if you are preserving history or using a "soft/logical delete" where you only set a deleted bit column to 1/true.
Cascade Update
Cascade Update may make sense when you use a real key rather than a surrogate key (identity/autoincrement column) across tables.
The canonical example for Cascade Update is when you have a mutable foreign key, like a username that can be changed.
You should not use Cascade Update with keys that are Identity/autoincrement columns.
Cascade Update is best used in conjunction with a unique constraint.
When To Use Cascading
You may want to get an extra strong confirmation back from the user before allowing an operation to cascade, but it depends on your application.
Cascading can get you into trouble if you set up your foreign keys wrong. But you should be okay if you do that right.
It's not wise to use cascading before you understand it thoroughly. However, it is a useful feature and therefore worth taking the time to understand.

Foreign keys are the best way to ensure referential integrity of a database. Avoiding cascades due to being magic is like writing everything in assembly because you don't trust the magic behind compilers.
What is bad is the wrong use of foreign keys, like creating them backwards, for example.
Juan Manuel's example is the canonical example, if you use code there are many more chances of leaving spurious DocumentItems in the database that will come and bite you.
Cascading updates are useful, for instance, when you have references to the data by something that can change, say a primary key of a users table is the name,lastname combination. Then you want changes in that combination to propagate to wherever they are referenced.
#Aidan, That clarity you refer to comes at a high cost, the chance of leaving spurious data in your database, which is not small. To me, it's usually just lack of familiarity with the DB and inability to find which FKs are in place before working with the DB that foster that fear. Either that, or constant misuse of cascade, using it where the entities were not conceptually related, or where you have to preserve history.

I never use cascading deletes.
If I want something removed from the database I want to explicitly tell the database what I want taking out.
Of course they are a function available in the database and there may be times when it is okay to use them, for example if you have an 'order' table and an 'orderItem' table you may want to clear the items when you delete an order.
I like the clarity that I get from doing it in code (or stored procedure) rather than 'magic' happening.
For the same reason I am not a fan of triggers either.
Something to notice is that if you do delete an 'order' you will get '1 row affected' report back even if the cascaded delete has removed 50 'orderItem's.

I work a lot with cascading deletes.
It feels good to know whoever works against the database might never leave any unwanted data. If dependencies grow I just change the constraints in the diagramm in Management Studio and I dont have to tweak sp or dataacces.
That said, I have 1 problem with cascading deletes and thats circular references. This often leads to parts of the database that have no cascading deletes.

I do a lot of database work and rarely find cascade deletes useful. The one time I have used them effectively is in a reporting database that is updated by a nightly job. I make sure that any changed data is imported correctly by deleting any top level records that have changed since the last import, then reimport the modified records and anything that relates to them. It save me from having to write a lot of complicated deletes that look from the bottom to the top of my database.
I don't consider cascade deletes to be quite as bad as triggers as they only delete data, triggers can have all kinds of nasty stuff inside.
In general I avoid real Deletes altogether and use logical deletes (ie. having a bit column called isDeleted that gets set to true) instead.

One example is when you have dependencies between entities... ie: Document -> DocumentItems (when you delete Document, DocumentItems don't have a reason to exist)

ON Delete Cascade:
When you want rows in child table to be deleted If corresponding row is deleted in parent table.
If on cascade delete isn't used then an error will be raised for referential integrity.
ON Update Cascade:
When you want change in primary key to be updated in foreign key

Use cascade delete where you would want the record with the FK to be removed if its referring PK record was removed. In other words, where the record is meaningless without the referencing record.
I find cascade delete useful to ensure that dead references are removed by default rather than cause null exceptions.

I have heard of DBAs and/or "Company Policy" that prohibit using "On Delete Cascade" (and others) purely because of bad experiences in the past. In one case a guy wrote three triggers which ended up calling one another. Three days to recover resulted in a total ban on triggers, all because of the actions of one idjit.
Of course sometimes Triggers are needed instead of "On Delete cascade", like when some child data needs to be preserved. But in other cases, its perfectly valid to use the On Delete cascade method. A key advantage of "On Delete cascade" is that it captures ALL the children; a custom written trigger/store procedure may not if it is not coded correctly.
I believe the Developer should be allowed to make the decision based upon what the development is and what the spec says. A carpet ban based on a bad experience should not be the criteria; the "Never use" thought process is draconian at best. A judgement call needs to be made each and every time, and changes made as the business model changes.
Isn't this what development is all about?

One reason to put in a cascade delete (rather than doing it in the code) is to improve performance.
Case 1: With a cascade delete
DELETE FROM table WHERE SomeDate < 7 years ago;
Case 2: Without a cascade delete
FOR EACH R IN (SELECT FROM table WHERE SomeDate < 7 years ago) LOOP
DELETE FROM ChildTable WHERE tableId = R.tableId;
DELETE FROM table WHERE tableId = R.tableid;
/* More child tables here */
NEXT
Secondly, when you add in an extra child table with a cascade delete, the code in Case 1 keeps working.
I would only put in a cascade where the semantics of the relationship is "part of". Otherwise some idiot will delete half of your database when you do:
DELETE FROM CURRENCY WHERE CurrencyCode = 'USD'

I try to avoid deletes or updates that I didn't explicitly request in SQL server.
Either through cascading or through the use of triggers. They tend to bite you in the ass some time down the line, either when trying to track down a bug or when diagnosing performance problems.
Where I would use them is in guaranteeing consistency for not very much effort. To get the same effect you would have to use stored procedures.

I, like everyone else here, find that cascade deletes are really only marginally helpful (it's really not that much work to delete referenced data in other tables -- if there are lot of tables, you simply automate this with a script) but really annoying when someone accidentally cascade deletes some important data that is difficult to restore.
The only case where I'd use is if the data in the table table is highly controlled (e.g., limited permissions) and only updated or deleted from through a controlled process (like a software update) that has been verified.

A deletion or update to S that removes a foreign-key value found in some tuples of R can be handled in one of three ways:
Rejection
Propagation
nullification.
Propagation is referred to as cascading.
There are two cases:
‣ If a tuple in S was deleted, delete the R tuples that referred to it.
‣ If a tuple in S was updated, update the value in the R tuples that refer to it.

If you're working on a system with many different modules in different versions, it can be very helpful, if the cascade deleted items are part of / owned by the PK holder. Else, all modules would require immediate patches to clean up their dependent items before deleting the PK owner, or the foreign key relation would be omitted completely, possibly leaving tons of garbage in the system if cleanup is not performed correctly.
I just introduced cascade delete for a new intersection table between two already existing tables (the intersection to delete only), after cascade delete had been discouraged from for quite some time. It's also not too bad if data gets lost.
It is, however, a bad thing on enum-like list tables: somebody deletes entry 13 - yellow from table "colors", and all yellow items in the database get deleted. Also, these sometimes get updated in a delete-all-insert-all manner, leading to referential integrity totally omitted. Of course it's wrong, but how will you change a complex software which has been running for many years, with introduction of true referential integrity being at risk of unexpected side effects?
Another problem is when original foreign key values shall be kept even after the primary key has been deleted. One can create a tombstone column and an ON DELETE SET NULL option for the original FK, but this again requires triggers or specific code to maintain the redundant (except after PK deletion) key value.

Cascade deletes are extremely useful when implementing logical super-type and sub-type entities in a physical database.
When separate super-type and sub-type tables are are used to physically implement super-types/sub-types (as opposed to rolling up all sub-type attributes into a single physical super-type table), there is a one-to-one relationship between these tables and the issue then becomes how to keep the primary keys 100% in sync between these tables.
Cascade deletes can be a very useful tool to:
1) Make sure that deleting a super-type record also deletes the corresponding single sub-type record.
2) Make sure that any delete of a sub-type record also deletes the super-type record. This is achieved by implementing an "instead-of" delete trigger on the sub-type table that goes and deletes the corresponding super-type record, which, in turn, cascade deletes the sub-type record.
Using cascade deletes in this manner ensures that no orphan super-type or sub-type records ever exist, regardless of whether you delete the super-type record first or the sub-type record first.

I would make a distinction between
Data integrity
Business logic/rules
In my experience it is best to enforce integrity as far as possible in the database using PK, FK, and other constraints.
However business rules/logic IMO is best implemented using code for the reason of cohesion (google "coupling and cohesion" to learn more).
Is cascade delete/update data integrity or business rules? This could of course be debated but I would say it is usually a logic/rule. For example a business rule may be that if an Order is deleted all OrderItems should be automatically deleted. But it could also be that it should never be possible to delete an Order if it still have OrderItems. So this may be up to the business to decide. How do we know how this rule is currently implemented? If it is all in code we can just look at the code (high cohesion). If the rule is maybe implemented in the code or maybe implemented as cascade in the database then we need to look in multiple places (low cohesion).
Of course if you go all-in with putting your business rules only in the database and use triggers, stored proc then cascade may make sense.
I usually consider database vendor lock-in before using any stored proc or triggers. A SQL database that just stores data and enforces integrity is IMO easier to port to another vendor. So for that reason I usually don't use stored proc or triggers.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight