When to implement soft delete logic in the code over the database?

When to implement soft delete logic in the code over the database? - database

When I want to soft delete resources as a policy of my company I can do it in one of two places.
I can do it in my database with some "instead of DELETE" trigger. Like so:
CREATE TRIGGER prevent_resource_delete
BEFORE DELETE ON resource
FOR EACH ROW EXECUTE PROCEDURE resource_soft_delete();
CREATE FUNCTION resource_soft_delete() RETURNS trigger
LANGUAGE plpgsql AS
$$
BEGIN
UPDATE resource SET deleted_at = now() WHERE id = OLD.id;
RETURN NULL;
END;
$$;
That's how pretty much every article about soft deletes suggests to do it. Other than articles written specifically by a ORM owner because they have their in-house solution.
I like this approach. The logic in my APIs looks like I am just deleting the resource.
Resource.query().deleteById(id); // Using a query builder
db.query('DELETE FROM resource WHERE id = $1;', [id]); // Using native library
To me it seems more natural and I don't have to worry about other developers accidentally hard deleting stuff. But it can also be confusing to those who don't know what is actually going on. And having any logic in the database means I can have bugs there (soft deleting logic is usually dead simple, but still...), which would be hard to debug. At least compared to those in my APIs.
But also I can instead have the logic in the APIs themselves. Keeping logic next to the other logic. Less elegant but more straightforward. No hidden logic somewhere else. I do lose the protection from people accidentally hard deleting resources.
Resource.query().findById(id).patch({deleted_at: new Date()}); // Using a query builder
db.query('UPDATE resource SET deleted_at = now() WHERE id = $1;', [id]); // Using native library
I am inclined to choose the former option as I consider the choice of whether to soft delete a database matter. The database chooses what to do with deleted data. Deleted data, soft or hard, is in principle not part of the application anymore. The APIs can't retrieve it. It is for me, the developer, to use for analytics, legal reasons or to manually aid a user who wants to recover something he/she considers lost.
But I don't like the downsides. I just talked to a colleague that was worried because he thought we were actually deleting stuff. Now, that could actually be solved with better onboarding and documentation. But should it be like that?
When to implement soft delete logic in the code over the database? Why does every article I find directly suggest the database without even considering the code? It looks like there is a strong reason I can't find.

As per me there isn't any strong reason, it depends on the architect and developer where they decide to put the logic, but below could be the possible reasons behind it ::
First is, as we are deleting something from the DB, so keeping the logic where it's best suited and,
Second writing the logic for each and every API is kind of redundant instead doing it in DB once and for all tables or nodes or collections is of less work to do. :)

Related

Is this a "correct" database design?

I'm working with the new version of a third party application. In this version, the database structure is changed, they say "to improve performance".
The old version of the DB had a general structure like this:
TABLE ENTITY
(
ENTITY_ID,
STANDARD_PROPERTY_1,
STANDARD_PROPERTY_2,
STANDARD_PROPERTY_3,
...
)
TABLE ENTITY_PROPERTIES
(
ENTITY_ID,
PROPERTY_KEY,
PROPERTY_VALUE
)
so we had a main table with fields for the basic properties and a separate table to manage custom properties added by user.
The new version of the DB insted has a structure like this:
TABLE ENTITY
(
ENTITY_ID,
STANDARD_PROPERTY_1,
STANDARD_PROPERTY_2,
STANDARD_PROPERTY_3,
...
)
TABLE ENTITY_PROPERTIES_n
(
ENTITY_ID_n,
CUSTOM_PROPERTY_1,
CUSTOM_PROPERTY_2,
CUSTOM_PROPERTY_3,
...
)
So, now when the user add a custom property, a new column is added to the current ENTITY_PROPERTY table until the max number of columns (managed by application) is reached, then a new table is created.
So, my question is: Is this a correct way to design a DB structure? Is this the only way to "increase performances"? The old structure required many join or sub-select, but this structute don't seems to me very smart (or even correct)...

I have seen this done before on the assumed (often unproven) "expense" of joining - it is basically turning a row-heavy data table into a column-heavy table. They ran into their own limitation, as you imply, by creating new tables when they run out of columns.
I completely disagree with it.
Personally, I would stick with the old structure and re-evaluate the performance issues. That isn't to say the old way is the correct way, it is just marginally better than the "improvement" in my opinion, and removes the need to do large scale re-engineering of database tables and DAL code.
These tables strike me as largely static... caching would be an even better performance improvement without mutilating the database and one I would look at doing first. Do the "expensive" fetch once and stick it in memory somewhere, then forget about your troubles (note, I am making light of the need to manage the Cache, but static data is one of the easiest to manage).
Or, wait for the day you run into the maximum number of tables per database :-)
Others have suggested completely different stores. This is a perfectly viable possibility and if I didn't have an existing database structure I would be considering it too. That said, I see no reason why this structure can't fit into an RDBMS. I have seen it done on almost all large scale apps I have worked on. Interestingly enough, they all went down a similar route and all were mostly "successful" implementations.

No, it's not. It's terrible.
until the max number of column (handled by application) is reached,
then a new table is created.
This sentence says it all. Under no circumstance should an application dynamically create tables. The "old" approach isn't ideal either, but since you have the requirement to let users add custom properties, it has to be like this.
Consider this:
You lose all type-safety as you have to store all values in the column "PROPERTY_VALUE"
Depending on your users, you could have them change the schema beforehand and then let them run some kind of database update batch job, so at least all the properties would be declared in the right datatype. Also, you could lose the entity_id/key thing.
Check out this: http://en.wikipedia.org/wiki/Inner-platform_effect. This certainly reeks of it
Maybe a RDBMS isn't the right thing for your app. Consider using a key/value based store like MongoDB or another NoSQL database. (http://nosql-database.org/)

From what I know of databases (but I'm certainly not the most experienced), it seems quite a bad idea to do that in your database. If you already know how many max custom properties a user might have, I'd say you'd better set the table number of columns to that value.
Then again, I'm not an expert, but making new columns on the fly isn't the kind of operations databases like. It's gonna bring you more trouble than anything.
If I were you, I'd either fix the number of custom properties, or stick with the old system.

I believe creating a new table for each entity to store properties is a bad design as you could end up bulking the database with tables. The only pro to applying the second method would be that you are not traversing through all of the redundant rows that do not apply to the Entity selected. However using indexes on your database on the original ENTITY_PROPERTIES table could help greatly with performance.
I would personally stick with your initial design, apply indexes and let the database engine determine the best methods for selecting the data rather than separating each entity property into a new table.

There is no "correct" way to design a database - I'm not aware of a universally recognized set of standards other than the famous "normal form" theory; many database designs ignore this standard for performance reasons.
There are ways of evaluating database designs though - performance, maintainability, intelligibility, etc. Quite often, you have to trade these against each other; that's what your change seems to be doing - trading maintainability and intelligibility against performance.
So, the best way to find out if that was a good trade off is to see if the performance gains have materialized. The best way to find that out is to create the proposed schema, load it with a representative dataset, and write queries you will need to run in production.
I'm guessing that the new design will not be perceivably faster for queries like "find STANDARD_PROPERTY_1 from entity where STANDARD_PROPERTY_1 = 'banana'.
I'm guessing it will not be perceivably faster when retrieving all properties for a given entity; in fact it might be slightly slower, because instead of a single join to ENTITY_PROPERTIES, the new design requires joins to several tables. You will be returning "sparse" results - presumably, not all entities will have values in the property_n columns in all ENTITY_PROPERTIES_n tables.
Where the new design may be significantly faster is when you need a compound where clause on custom properties. For instance, finding an entity where custom property 1 is true, custom property 2 is banana, and custom property 3 is not in ('kylie', 'pussycat dolls', 'giraffe') is e`(probably) faster when you can specify columns in the ENTITY_PROPERTIES_n tables instead of rows in the ENTITY_PROPERTIES table. Probably.
As for maintainability - yuck. Your database access code now needs to be far smarter, knowing which table holds which property, and how many columns are too many. The likelihood of entertaining bugs is high - there are more moving parts, and I can't think of any obvious unit tests to make sure that the database access logic is working.
Intelligibility is another concern - this solution is not in most developers' toolbox, it's not an industry-standard pattern. The old solution is pretty widely known - commonly referred to as "entity-attribute-value". This becomes a major issue on long-lived projects where you can't guarantee that the original development team will hang around.

Delete data or just flag it as deleted?

I'm building a website that lets people create vocabulary lessons. When a lesson is created, a news items is created that references the lesson. When another user practices the lesson, the user also stores a reference to it together with the practice result.
My question is what to do when a user decides to remove the lesson?
The options I've considered are:
Actually delete the lesson from
the database and remove all
referencing news items, practise
results etc.
Just flag it as deleted and
exclude the link from referencing
news items, results etc.
What are your thoughts? Should data never be removed, ala Facebook? Should references be avoided all together?
By the way, I'm using Google App Engine (python/datastore). A db.ReferenceProperty is not set to None when the referenced object is deleted as far as I can see?
Thanks!

Where changes to data need to be audited, marking data as deleted (aka "soft deletes") helps greatly particularly if you record the user that actioned the delete and the time when it occurred. It also allows data to be "un-deleted" very easily.
Having said that there is no reason to prevent "hard deletes" (where data is actually deleted) as an administrative function to help tidy up mistakes.

Marking the data as "deleted" is simplest. If you currently have no use for it, this keeps everything in your database very tidy and makes it easy to add new functionality.
On the other hand, if you're doing something like showing the user where their "vocabulary points" came from, or how many lessons they've completed, then the reference to soft deleted items might be necessary.
I'd start with the first one and change it later if you need to. Here's why:
If you're not using soft deletes, assume they won't work in the way that future requests actually want them to. You'll have to rewrite them anyway.
If you are using them, assume that nobody is using the feature which uses them. Now you've done a lot of work and tied yourself into maintenance of something nobody cares about.
If you create them, you'll find yourself creating a feature to use them. See the above.
If you don't create them, you can always create them later, once you have better knowledge about what the users of your system really want.
Not creating soft deletes gives you more options going forward. Options have value. Options expire. Never commit early unless you know why.

How to use Data aware controls "correctly"?

I would like to ask experienced users, if you prefer to use data aware controls to add, insert, delete and edit data in DB or you favor to do it manualy.
I developed some DB applications, in which for the sake of "user friendly policy" I run into complicated web of table events (afterinsert, afteredit, after... and beforeedit, beforeinsert, before...). After that it was a quite nasty work to debug the application.
Aware of this risk (later by another application) I tried to avoid this problem, so I paid increased attention to write code well, readable and comprehensive. It seemed everything all right from the beginning, but as I needed to handle some preprocessing stuff before sending and loading data etc, I run into the same problems again, "slowly and inevitably". Sometime I could not use dataaware controls anyway, and what seemed to be a "cool" feature of DAControl at the beginning it turned to an obstacle on the end. I "had to" write special routine for non-dataaware controls, in order to behave as dataaware. Then I asked myself, why on earth should I use dataaware controls? Is it better to found application architecture on non-dataaware controls? It requires more time to write bug-proof code, of course, but does it worth of it? I do not know...
I happened to me several times, like jinxed : paradise on the beginning hell on the end...
I do not know, if I use wrong method to write DB program, if there is some standard common practice how to proceed. Or if it is common problem to everybody?
Thanx for advices and your experiences

I've written applications that used data aware components against TTable style components and applications which used non-data aware components.
My preference these days is to use data aware components but with TClientDataSets rather than TTable style components.
Using a TClientDataSet I don't have to make my user interface structure mimic my database structure. It's flexible enough to fill it with the data from several tables and then when you are applying the updates back to the database you can manually add/delete/update records as you see fit.

The secret should be in DataSet parameter automation, you can create a control that glues datasets together in master-slave way, just by defining connections between them. Ofcourse such control should be fed with form parameters in some other generalized way. In this case calling form with entity identifier, all datasets will get filled in a proper order and will allow to update data in database automatically by provider.
Generally it is better to have DataSets being an exact representation of tables with optional calculated fields (fkInternalCalc sometimes works better as it updates with row change not field change) bound to data aware controls. Data aware controls are the most optimal approach, and less error prone. Like in every aspect, there are exceptions to that.
If you must write too many glue functions, the problem probably is in design pattern not in VCL.

A lot of the time I use data aware controls linked to an in-memory table (kbmMemTable) that is filled from a query.
The benefits I see are:
I have full control over all inserts/updates/posts/edits to the database.
No need to worry about a user leaving a record in update mode (potentially locking other users)
Did I mention full control over all inserts/updates/posts/edits?
Using the in-memory table is as easy as:
dataset.sql.add('select a.field,b.field from a,b');
dataset.open;
inMemoryTable.loadfromdataset(dataset);
inMemoryTable.checkpoint;
And then "resolving" back to the database, you are given access to the original and new data for each field in each record (similar in a way to a trigger) - you can easily transaction and resolve a whole edit back in milliseconds - even if it took the end user 30 mins to fill in the data aware controls.

Have you considered a O/R mapper for Delphi like tiOPF or hcOPF?
This will separate the business domain logic from the database layer. For big and legacy systems, it is even common to add another layer, the 'Anti Corruption Layer', which protects the model from changes in the database design.

Database: To delete or not to delete records

I don't think I am the only person wondering about this. What do you usually practice about database behavior? Do you prefer to delete a record from the database physically? Or is it better to just flag the record with a "deleted" flag or a boolean column to denote the record is active or inactive?

It definitely depends on the actual content of your database. If you're using it to store session information, then by all means wipe it immediately when the session expires (or is closed), you don't want that garbage lying around. As it cannot really be used again for any practical purposes.
Basically, what you need to ask yourself, might I need to restore this information? Like deleted questions on SO, they should definitely just be marked 'deleted', as we're actively allowing an undelete. We also have the option to display it to select users as well, without much extra work.
If you're not actively seeking to fully restore the data, but you'd still like to keep it around for monitoring (or similar) purposes. I would suggest that you figure out (to the extent possible of course) an aggregation scheme, and shove that off to another table. This will keep your primary table clean of 'deleted' data, as well as keep your secondary table optimized for monitoring purposes (or whatever you had in mind).
For temporal data, see: http://talentedmonkeys.wordpress.com/2010/05/15/temporal-data-in-a-relational-database/

Pros of using a delete flag:
You can get the data back later if you need it,
Delete operation (updating the flag) is probably quicker than really deleting it
Cons of using a delete flag:
It is very easy to miss AND DeletedFlag = 'N' somewhere in your SQL
Slower for the database to find the rows that you are interested in amongst all the crap
Eventually, you'll probably want to really delete it anyway (assuming your system is successful. What about when that record is 10 years old and it was "deleted" 4 minutes after originally created)
It can make it impossible to use a natural key. You may have one or more deleted rows with the natural key and a real row wanting to use that same natural key.
There may be legal/compliance reasons why you are meant to actually delete data.

As a complement to all posts...
However, if you plan to mark the record, its good to consider making a view, for active records. This would save you from writing or forgetting the flag in your SQL query. You might consider a view for non-active records too, if you think that also would serve some purpose.

I am glad to have found this thread. I too was wondering what people thought about this issue. I have implemented the 'marked as deleted' for about 15 years on many systems. Whenever a user would call to say something was accidentally deleted it was certainly a lot easier to mark it un-deleted than recreate it or restore from a backup.
We are using postgresql and Ruby on rails it looks like we could do this in 1 of two ways, modify rails or add an ondelete trigger and does instead a pl/pgsql function to mark as deleted. I am leaning toward the latter.
As for performance hits, it will be interesting to see the results of EXPLAIN-ANALYZE on large tables to few deleted items as well as many deleted items.
In systems used over time I have found, new users tend to do silly things like delete things accidentally. So when people are new in a position they have all the access rights of the person previously in that position except with zero experience. Accidentally deleting something and being able to quickly recover gets everyone back to work quickly.
But as someone said, sometimes you may need that particular key back for some reason, at that point you would need to really delete it, then re-create the records (on undelete it and modify the record).

I mark them as deleted, and don't really delete. However every once in a while I sweep out all the junk and archive it, so it doesn't kill performance.

There are also legal issues either way if personal data is involved. I think it greatly depends on where you are (or where the database is), and what the terms of use are.
In some cases people can ask to be removed from your system, in which case a hard delete is needed (or at least clearing out all of the personal information).
I would check with your legal department before you adopt a strategy either way if personal information is involved.

If you are concerned about "dormant" records slowing down your database access, you may want to move those rows into another table acting as an "archive" table.

For user-entered/managed data I've used the flag method you describe and given the user an "empty the trash" interface to actually delete items if they choose to.

I have a database with lots of dependencies. Hence, I cannot delete some records because others still depend on the data. This is what I usually do; I try to delete the data, if it works, I know it didn't have any dependencies and didn't matter. If it doesn't, I catch the error and flag it as inactive:
try
{
_context.SomeTable.Remove(someEntity);
await _context.SaveChangesAsync();
}
catch (DbUpdateException ex) when (ex.InnerException is SqlException && (ex.InnerException as SqlException).Number == 547)
{
// Mark as inactive
someEntity.Active = false;
await _context.SaveChangesAsync();
}

Creating a Notifications type feed in GAE Objectify

I'm working on a notification feed for my mobile app and am looking for some help on an issue.
The app is a Twitter/Facebook like app where users can post statuses and other users can like, comment, or subscribe to them.
One thing I want to have in my app is to have a notifications feed where users can see who liked/comment on their post or subscribed to them.
The first part of this system I have figured out, when a user likes/comments/subscribes, a Notification entity will be written to the datastore with details about the event. To show a users Notification's all I have to do is query for all Notification's for that user, sort by date created desc and we have a nice little feed of actions other users took on a specific users account.
The issue I have is what to do when someone unlikes a post, unsubscribes or deletes a comment. Currently, if I were to query for that specific notification, it is possible that nothing would return from the datastore because of eventual consistency. We could imagine someone liking, then immediate unliking a post (b/c who hasn't done that? =P). The query to find that Notification might return null and nothing would get deleted when calling ofy().delete().entity(notification).now(); And now the user has a notification in their feed saying Sally liked his post when in reality she liked then quickly unliked it!
A wrench in this whole system is that I cannot delete by Key<Notification>, because I don't really have a way to know id of the Notification when trying to delete it.
A potential solution I am experimenting with is to not delete any Notifications. Instead I would always write Notification's and simply indicate if the notification was positive or negative. Then in my query to display notifications to a specific user, I could somehow only display the sum-positive Notification's. This would save some money on datastore too because deleting entities is expensive.

There are three main ways I've solved this problem before:
deterministic key
for example
{user-Id}-{post-id}-{liked-by} for likes
{user-id}-{post-id}-{comment-by}-{comment-index} for comments
This will work for most basic use cases for the problem you defined, but you'll have some hairy edge cases to figure out (like managing indexes of comments as they get edited and deleted). This will allow get and delete by key
parallel data structures
The idea here is to create more than one entity at a time in a transaction, but to make sure they have related keys. For example, when someone comments on a feed item, create a Comment entity, then create a CommentedOn entity which has the same ID, but make it have a parent key of the commenter user.
Then, you can make a strongly consistent query for the CommentedOn, and use the same id to do a get by key on the Comment. You can also just store a key, rather than having matching IDs if that's too hard. Having matching IDs in practice was easier each time I did this.
The main limitation of this approach is that you're effectively creating an index yourself out of entities, and while this can give you strongly consistent queries where you need them the throughput limitations of transactional writes can become harder to understand. You also need to manage state changes (like deletes) carefully.
State flags on entities
Assuming the Notification object just shows the user that something happened but links to another entity for the actual data, you could store a state flag (deleted, hidden, private etc) on that entity. Then listing your notifications would be a matter of loading the entities server side and filtering in code (or possibly subsequent filtered queries).
At the end of the day, the complexity of the solution should mirror the complexity of the problem. I would start with approach 3 then migrate to approach 2 when the fuller set of requirements is understood. It is a more robust and flexible approach, but complexity of XG transaction limitations will rear its head - but ultimately a distributed feed like this is a hard problem.

What I ended up doing and what worked for my specific model was that before creating a Notification Entity I would first allocate and ID for it:
// Allocate an ID for a Notification
final Key<Notification> notificationKey = factory().allocateId(Notification.class);
final Long notificationId = notificationKey.getId();
Then when creating my Like or Follow Entity, I would set the property Like.notificationId = notificationId; or Follow.notificationId = notificationId;
Then I would save both Entities.
Later, when I want to delete the Like or Follow I can do so and at the same time get the Id of the Notification, load the Notification by key (which is strongly consistent to do so), and delete it too.
Just another approach that may help someone =D

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight