Database: To delete or not to delete records - database

I don't think I am the only person wondering about this. What do you usually practice about database behavior? Do you prefer to delete a record from the database physically? Or is it better to just flag the record with a "deleted" flag or a boolean column to denote the record is active or inactive?

It definitely depends on the actual content of your database. If you're using it to store session information, then by all means wipe it immediately when the session expires (or is closed), you don't want that garbage lying around. As it cannot really be used again for any practical purposes.
Basically, what you need to ask yourself, might I need to restore this information? Like deleted questions on SO, they should definitely just be marked 'deleted', as we're actively allowing an undelete. We also have the option to display it to select users as well, without much extra work.
If you're not actively seeking to fully restore the data, but you'd still like to keep it around for monitoring (or similar) purposes. I would suggest that you figure out (to the extent possible of course) an aggregation scheme, and shove that off to another table. This will keep your primary table clean of 'deleted' data, as well as keep your secondary table optimized for monitoring purposes (or whatever you had in mind).
For temporal data, see: http://talentedmonkeys.wordpress.com/2010/05/15/temporal-data-in-a-relational-database/

Pros of using a delete flag:
You can get the data back later if you need it,
Delete operation (updating the flag) is probably quicker than really deleting it
Cons of using a delete flag:
It is very easy to miss AND DeletedFlag = 'N' somewhere in your SQL
Slower for the database to find the rows that you are interested in amongst all the crap
Eventually, you'll probably want to really delete it anyway (assuming your system is successful. What about when that record is 10 years old and it was "deleted" 4 minutes after originally created)
It can make it impossible to use a natural key. You may have one or more deleted rows with the natural key and a real row wanting to use that same natural key.
There may be legal/compliance reasons why you are meant to actually delete data.

As a complement to all posts...
However, if you plan to mark the record, its good to consider making a view, for active records. This would save you from writing or forgetting the flag in your SQL query. You might consider a view for non-active records too, if you think that also would serve some purpose.

I am glad to have found this thread. I too was wondering what people thought about this issue. I have implemented the 'marked as deleted' for about 15 years on many systems. Whenever a user would call to say something was accidentally deleted it was certainly a lot easier to mark it un-deleted than recreate it or restore from a backup.
We are using postgresql and Ruby on rails it looks like we could do this in 1 of two ways, modify rails or add an ondelete trigger and does instead a pl/pgsql function to mark as deleted. I am leaning toward the latter.
As for performance hits, it will be interesting to see the results of EXPLAIN-ANALYZE on large tables to few deleted items as well as many deleted items.
In systems used over time I have found, new users tend to do silly things like delete things accidentally. So when people are new in a position they have all the access rights of the person previously in that position except with zero experience. Accidentally deleting something and being able to quickly recover gets everyone back to work quickly.
But as someone said, sometimes you may need that particular key back for some reason, at that point you would need to really delete it, then re-create the records (on undelete it and modify the record).

I mark them as deleted, and don't really delete. However every once in a while I sweep out all the junk and archive it, so it doesn't kill performance.

There are also legal issues either way if personal data is involved. I think it greatly depends on where you are (or where the database is), and what the terms of use are.
In some cases people can ask to be removed from your system, in which case a hard delete is needed (or at least clearing out all of the personal information).
I would check with your legal department before you adopt a strategy either way if personal information is involved.

If you are concerned about "dormant" records slowing down your database access, you may want to move those rows into another table acting as an "archive" table.

For user-entered/managed data I've used the flag method you describe and given the user an "empty the trash" interface to actually delete items if they choose to.

I have a database with lots of dependencies. Hence, I cannot delete some records because others still depend on the data. This is what I usually do; I try to delete the data, if it works, I know it didn't have any dependencies and didn't matter. If it doesn't, I catch the error and flag it as inactive:
try
{
_context.SomeTable.Remove(someEntity);
await _context.SaveChangesAsync();
}
catch (DbUpdateException ex) when (ex.InnerException is SqlException && (ex.InnerException as SqlException).Number == 547)
{
// Mark as inactive
someEntity.Active = false;
await _context.SaveChangesAsync();
}

Related

Postgres: How to clear transaction ID for anonymity and data reduction

We're running an evaluation platform where users can comment on certain things. A key feature is that people can comment only once, and every comment is made in anonymity.
We're using Postgres for all our data. We want to save a flag in the database that a user created a comment (so they cannot comment again). In a separate table but within the same transaction, we want to save the comment itself without any link to the user.
However, postgres saves the transaction ID of every tuple inserted into the database (xmin of the system columns). So now there's a link between the user and their comment which we have to avoid!
Possible (Non)Solutions
Vacuuming alone does not help as it does not clear the transaction ID. See the "Note" box in the "24.1.5. Preventing Transaction ID Wraparound Failures" section in the postgres docs.
Putting those inserts in different transactions, doesn't really solve anything since transaction IDs are consecutive.
We could aggregate comments from multiple users to one large text in the database with some separators, but since old versions of this large text would be kept by postgres at least until the next vacuum, that doesn't seem like a full solution. Also, we'd still have the order of when the user added their comment, which would be nice to not save as well.
Re-writing all the tuples in those tables periodically (by a dummy UPDATE to all of them), followed by a vacuum would probably erase the "insert history" sufficiently, but that too seems like a crude hack.
Is there any other way within postgres to make it impossible to reconstruct the insertion history of a table?
Perhaps you could use something like dblink or postgres_fdw to write to tables using a remote connection (either to the current database or another database), and thereby separate xmin values, even though you as a user think you are doing it all in the "same transaction."
Regarding the concerns about tracking via reverse-engineering sequential xmin values, since dblink is asynchronous, this issue may become moot at scale, when many users are simultaneously adding comments to the system. This might not work if you need to be able to rollback after encountering an error—it really depends on how important it is for you to confine the operations into one transaction.
I don't think there is a problem.
In your comment you write that you keep a flag with the user (however exactly you store it) that keeps track of which posting the user commented on. To keep that information private, you have to keep that flag private so that nobody except the user itself could read it.
If no other user can see that information, then no other user can see the xmin on the corresponding table entries. Then nobody could make a correlation with the xmin on the comment, so the problem is not there.
The difficult part is how you want to keep the information private which postings a user commented on. I see two ways:
Don't use database techniques to do it, but write the application so that it hides that information from the users.
Use PostgreSQL Row Level Security to do it.
There is no way you can keep the information from a superuser. Don't even try.
You could store the users with their flags and the comments on different database clusters (and use distributed transactions), then the xmins would be unrelated.
Make sure to disable track_commit_timestamp.
To make it impossible to correlate the transactions in the databases, you could issue random
SELECT txid_current();
which do nothing but increment the transaction counter.

Is it OK to ignore DbUpdateConcurrencyException?

I have a table that effectively implements a cache - we put stuff in, we take stuff out. It has a last_accessed field that indicates when it was inserted or read, and it has a max_age field indicating how long to keep the record.
We want to purge old records, rather than having them accumulate. Seemed to me that the easiest way to handle that was just to call a purge routine, every time we accessed a record.
This is simple enough to do, in EF:
var oldCacheItems =
this.xxDbContext.XXtokencaches.Where(
t => EntityFunctions.AddMinutes(t.lastAccessed, t.expirationMinutes) < DateTime.Now);
this.xxDbContext.XXtokencaches.RemoveRange(oldCacheItems);
this.xxDbContext.SaveChanges();
Works fine, alone. But when I have multiple clients running, I occasionally get DbUpdateConcurrencyExceptions:
System.Data.Entity.Infrastructure.DbUpdateConcurrencyException was unhandled by user code
HResult=-2146233087
Message=Store update, insert, or delete statement affected an unexpected number of rows (0). Entities may have been modified or deleted since entities were loaded. Refresh ObjectStateManager entries.
Source=EntityFramework
Browsing around, I see a lot of discussion about concurrency issues in EF. But in this case, I'm not sure any of it matters. In essence, this particular error is EF complaining that it can't delete the records because someone else already deleted them.
So I'm considering simply catching DbUpdateConcurrencyException and ignoring it.
Are there any possible consequences to this that I might not have considered?
Or some alternative approach that would eliminate the problem?
I think it depends on your business logic. You can assume that database version wins over yours.

GUID VS Auto Increment. (In comfortably wise)

A while a go, my sysadmin restored my database by mistake to a much earlier point.
After 3 hours we noticed this, and during this time 80 new rows (auto increment with foreign keys dependency) were created.
So at this point we had 80 different customers with the same ids in two tables that needed to be merged.
I dont remember how but we resolved this but it took a long time.
Now, I am designing a new database and my first thought is to use a GUID index even though this use case is rare.
My question: How do you get along with such long string as your ID?
I mean, when 2 programmers are talking about a customer, it is possible to say:
"Hey. We have a problem with client 874454".
But how do you keep it as simple with GUID, This is really a problem that can cause some trouble and dis-communications.
Thanks
GUIDs can create more problems than they solve if you are not using replication. First,you need to make sure they aren't the clustered index (which is the default for the PK in SQL Server at least) because you can really slow down insert performance. Second they are longer than ints and thus take up not only more space but make joins slower. Every join in every query.
You are going to create a bigger problem trying to solve a rare occurance. Instead think of ways to set things up so that you don't take hours to recover from a mistake.
You could create an auditing solution. That way you can easily recover from all sorts of missteps. And write the code in advance to do the recovering. Then it is relatively easy to fix when things go wrong. Frankly I would never allow a database that contains company critical data to be set up without some form of auditing. It's just too dangerous not to.
Or you could even have a script ready to go to move records to a temporary place and then reinsert them with a new identity (and update the identities on the child records to the new one). You did this once, the dba should have created a script (and put it in source control) so it is available the next time you need to do a similar fix. If your dba is so incompetent he doesn't create and save these sort of scripts, then get rid of him and hire someone who knows what he is doing.
just show a prefix in most views. That's what DVCSs do, since most of them identify most objects by a hexcoded hash.
(OTOH, I know it's fashionable in many circles to use UUIDs for primary keys; but it would take a lot more than a few scary stories to convince me)

Delete data or just flag it as deleted?

I'm building a website that lets people create vocabulary lessons. When a lesson is created, a news items is created that references the lesson. When another user practices the lesson, the user also stores a reference to it together with the practice result.
My question is what to do when a user decides to remove the lesson?
The options I've considered are:
Actually delete the lesson from
the database and remove all
referencing news items, practise
results etc.
Just flag it as deleted and
exclude the link from referencing
news items, results etc.
What are your thoughts? Should data never be removed, ala Facebook? Should references be avoided all together?
By the way, I'm using Google App Engine (python/datastore). A db.ReferenceProperty is not set to None when the referenced object is deleted as far as I can see?
Thanks!
Where changes to data need to be audited, marking data as deleted (aka "soft deletes") helps greatly particularly if you record the user that actioned the delete and the time when it occurred. It also allows data to be "un-deleted" very easily.
Having said that there is no reason to prevent "hard deletes" (where data is actually deleted) as an administrative function to help tidy up mistakes.
Marking the data as "deleted" is simplest. If you currently have no use for it, this keeps everything in your database very tidy and makes it easy to add new functionality.
On the other hand, if you're doing something like showing the user where their "vocabulary points" came from, or how many lessons they've completed, then the reference to soft deleted items might be necessary.
I'd start with the first one and change it later if you need to. Here's why:
If you're not using soft deletes, assume they won't work in the way that future requests actually want them to. You'll have to rewrite them anyway.
If you are using them, assume that nobody is using the feature which uses them. Now you've done a lot of work and tied yourself into maintenance of something nobody cares about.
If you create them, you'll find yourself creating a feature to use them. See the above.
If you don't create them, you can always create them later, once you have better knowledge about what the users of your system really want.
Not creating soft deletes gives you more options going forward. Options have value. Options expire. Never commit early unless you know why.

Database design question. BIT column for deletions

Part of my table design is to include a IsDeleted BIT column that is set to 1 whenever a user deletes a record. Therefore all SELECTS are inevitable accompanied by a WHERE IsDeleted = 0 condition.
I read in a previous question (I cannot for the love of God re-find that post and reference it) that this might not be the best design and an 'Audit Trail' table might be better.
How are you guys dealing with this problem?
Update
I'm on SQL Server. Solutions for other DB's are welcome albeit not as useful for me but maybe for other people.
Update2
Just to encapsulate what everyone said so far. There seems to be basically 3 ways to deal with this.
Leave it as it is
Create an audit table to keep track of all the changes
Use of views with WHERE IsDeleted = 0
Therefore all SELECTS are inevitable accompanied by a WHERE IsDeleted = 0 condition.
This is not a really good way to do it, as you probably noticed, it is quite error-prone.
You could create a VIEW which is simply
CREATE VIEW myview AS SELECT * FROM yourtable WHERE NOT deleted;
Then you just use myview instead of mytable and you don't have to think about this damn column in SELECTs.
Or, you could move deleted records to a separate "archive" table, which, depending on the proportion of deleted versus active records, might make your "active" table a lot smaller, better cached in RAM, ie faster.
If you have to have this kind of Deleted Bit column, then you really should consider setting up some VIEWs with the WHERE clause in it, and use those rather than the underlying tables. Much less error prone.
For example, if you have this view:
CREATE VIEW [Current Product List] AS
SELECT ProductID,ProductName
FROM Products
WHERE Discontinued=No
Then someone who wants to see current products can simply write:
SELECT * FROM [Current Product List]
This is much less error prone than writing:
SELECT ProductID,ProductName
FROM Products
WHERE Discontinued=No
As you say, people will forget that WHERE clause, and get confusing and incorrect results.
P.S. the example SQL comes from Microsoft's Northwind database. Normally I would recommend NOT using spaces in column and table names.
We're actively using the "Deleted" column in our enterprise software. It is however a source of constant errors when forgetting to add "WHERE Deleted = 0" to an SQL query.
Not sure what is meant by "Audit Trail". You may wish to have a table to track all deleted records. Or there may be an option of moving the deleted content to paired tables (like Customer_Deleted) to remove the passive content from tables to minimize their size and optimize performance.
A while ago there was some blog uproar on this issue, Ayende and Udi Dahan both posted on this.
Nai this is totally up to you.
Do you need to be able to see who has deleted / modified / inserted what and when? If so, you should design the tables for this and adjust your procs to write these values when they are called.
If you dont need an audit trail, dont waste time with one. Just do as you are with IsDeleted.
Personally, I flag things right now, as an audit trail wasn't specified in my spec, but that said, I don't like to actually delete things. Hence, I chose to flag it. I'm not going to waste a clients time writing something they diddn't request. I wont mess about with other tables because that's another thing for me to think about. I'd just make sure my index's were up to the job.
Ask your manager or client. Plan out how long the audit trail would take so they can cost it and let them make the decision for you ;)
Udi Dahan said this:
Model the task, not the data
Looking back at the story our friend from marketing told us, his intent is to discontinue the product – not to delete it in any technical sense of the word. As such, we probably should provide a more explicit representation of this task in the user interface than just selecting a row in some grid and clicking the ‘delete’ button (and “Are you sure?” isn’t it).
As we broaden our perspective to more parts of the system, we see this same pattern repeating:
Orders aren’t deleted – they’re cancelled. There may also be fees incurred if the order is canceled too late.
Employees aren’t deleted – they’re fired (or possibly retired). A compensation package often needs to be handled.
Jobs aren’t deleted – they’re filled (or their requisition is revoked).
In all cases, the thing we should focus on is the task the user wishes to perform, rather than on the technical action to be performed on one entity or another. In almost all cases, more than one entity needs to be considered.
If you have Oracle DB, then you can use audit trail for auditing. Check the AUDIT VAULT tool form OTN, here. It even supports SQL Server.
Views (or stored procs) to get at the underlying table data are the best way. However, if you have the problem with "too many cooks in the kitchen" like we do (too many people have rights to the data and may just use the table without knowing enough to use the view/proc) you should try using another table.
We have a complete mimic of the base table with a few extra columns for tracking. So Employee table has an EmployeeDeleted table with the same schema but extra columns for when it was deleted and who deleted it and sometimes even the reason for deletion. You can even get fancy and have triggers do the insertion directly instead of going through applications/procs.
Biggest Advantage: no flag to worry about during selects
Biggest Disadvantage: any schema changes to the base table also have to be made on the "deleted" table
Best for: situations where for whatever reason (usually political with us) many not-as-experienced people have rights to the data but still expect it to be accurate without having to understand flags or schemas, etc
I've used soft deletes before on a number of applications I've worked on, and overall it's worked out quite well. Yes, there is the issue of always having to remember to add AND IsActive = 1 to all of your SELECT queries, but really that's not so bad. You can create views if you don't want to have to remember to always do that.
The reason we've done this is because we had very specific business needs to be able to report on records that have been deleted. The reporting needs varied widely - sometimes they'd need to see just the active records, or just the inactive records, or sometimes a mix of both - so pushing all the deleted records into an audit table wasn't a very good option.
So, depending on your particular business needs, I think this approach is certainly a viable option.

Resources