Delete data or just flag it as deleted? - google-app-engine

I'm building a website that lets people create vocabulary lessons. When a lesson is created, a news items is created that references the lesson. When another user practices the lesson, the user also stores a reference to it together with the practice result.
My question is what to do when a user decides to remove the lesson?
The options I've considered are:
Actually delete the lesson from
the database and remove all
referencing news items, practise
results etc.
Just flag it as deleted and
exclude the link from referencing
news items, results etc.
What are your thoughts? Should data never be removed, ala Facebook? Should references be avoided all together?
By the way, I'm using Google App Engine (python/datastore). A db.ReferenceProperty is not set to None when the referenced object is deleted as far as I can see?
Thanks!

Where changes to data need to be audited, marking data as deleted (aka "soft deletes") helps greatly particularly if you record the user that actioned the delete and the time when it occurred. It also allows data to be "un-deleted" very easily.
Having said that there is no reason to prevent "hard deletes" (where data is actually deleted) as an administrative function to help tidy up mistakes.

Marking the data as "deleted" is simplest. If you currently have no use for it, this keeps everything in your database very tidy and makes it easy to add new functionality.
On the other hand, if you're doing something like showing the user where their "vocabulary points" came from, or how many lessons they've completed, then the reference to soft deleted items might be necessary.
I'd start with the first one and change it later if you need to. Here's why:
If you're not using soft deletes, assume they won't work in the way that future requests actually want them to. You'll have to rewrite them anyway.
If you are using them, assume that nobody is using the feature which uses them. Now you've done a lot of work and tied yourself into maintenance of something nobody cares about.
If you create them, you'll find yourself creating a feature to use them. See the above.
If you don't create them, you can always create them later, once you have better knowledge about what the users of your system really want.
Not creating soft deletes gives you more options going forward. Options have value. Options expire. Never commit early unless you know why.

Related

When to implement soft delete logic in the code over the database?

When I want to soft delete resources as a policy of my company I can do it in one of two places.
I can do it in my database with some "instead of DELETE" trigger. Like so:
CREATE TRIGGER prevent_resource_delete
BEFORE DELETE ON resource
FOR EACH ROW EXECUTE PROCEDURE resource_soft_delete();
CREATE FUNCTION resource_soft_delete() RETURNS trigger
LANGUAGE plpgsql AS
$$
BEGIN
UPDATE resource SET deleted_at = now() WHERE id = OLD.id;
RETURN NULL;
END;
$$;
That's how pretty much every article about soft deletes suggests to do it. Other than articles written specifically by a ORM owner because they have their in-house solution.
I like this approach. The logic in my APIs looks like I am just deleting the resource.
Resource.query().deleteById(id); // Using a query builder
db.query('DELETE FROM resource WHERE id = $1;', [id]); // Using native library
To me it seems more natural and I don't have to worry about other developers accidentally hard deleting stuff. But it can also be confusing to those who don't know what is actually going on. And having any logic in the database means I can have bugs there (soft deleting logic is usually dead simple, but still...), which would be hard to debug. At least compared to those in my APIs.
But also I can instead have the logic in the APIs themselves. Keeping logic next to the other logic. Less elegant but more straightforward. No hidden logic somewhere else. I do lose the protection from people accidentally hard deleting resources.
Resource.query().findById(id).patch({deleted_at: new Date()}); // Using a query builder
db.query('UPDATE resource SET deleted_at = now() WHERE id = $1;', [id]); // Using native library
I am inclined to choose the former option as I consider the choice of whether to soft delete a database matter. The database chooses what to do with deleted data. Deleted data, soft or hard, is in principle not part of the application anymore. The APIs can't retrieve it. It is for me, the developer, to use for analytics, legal reasons or to manually aid a user who wants to recover something he/she considers lost.
But I don't like the downsides. I just talked to a colleague that was worried because he thought we were actually deleting stuff. Now, that could actually be solved with better onboarding and documentation. But should it be like that?
When to implement soft delete logic in the code over the database? Why does every article I find directly suggest the database without even considering the code? It looks like there is a strong reason I can't find.
As per me there isn't any strong reason, it depends on the architect and developer where they decide to put the logic, but below could be the possible reasons behind it ::
First is, as we are deleting something from the DB, so keeping the logic where it's best suited and,
Second writing the logic for each and every API is kind of redundant instead doing it in DB once and for all tables or nodes or collections is of less work to do. :)

Override delete method of custom object

we have an custom object in our instance that effectively is a junction object. Right now, if a relationship is removed, the record in the junction object is deleted.
We want to change this behavior to such that the junction object is marked as deleted, but not physically deleted (please understand that I cannot go into details of why, there are good business reasons to do so). Since we have multiple clients accessing our instance through SOAP and REST APIs I would like to implement a solution whereby I override the standard delete functionality of the custom object to just check a custom field is_deleted, instead of deleting the record.
Is this possible?
Cheers,
Dan
I suppose you can't just put an on-delete trigger on the object?
If you can, then just add the trigger code to update the field, and then attach an error to the record being deleted (so the deletion doesn't go through). There are plenty of examples in the official docs for how to do this.
Remember to keep everything bulkified (process all the records being deleted at once, from a list)...
On a side note, the deleted records in SalesForce are kept in the Recycle Bin on the org for 15 days after deletion. So you can also select them from the object, by using the SELECT... ALL ROWS query form.
I don't think you can really override delete action. You could override a button (with a Visualforce page) but that won't help you in any way if delete is fired from API.
I suspect you want to pretend to API (SOAP, REST etc) users that record was deleted while in reality retaining it somewhere? Smells like some shady business practice to be honest but whatever, let's assume it really is legit... For sure you can't suddenly throw errors at the operation because your end users will notice.
I think I'd go with a hidden 1-to-1 matching "shadow" object and sync each action to it. You'd need a trigger on insert/update/delete/undelete of your junction that would replicate the action (difference being this custom "soft delete" flag). This has lots of concerns like storage usage but well.
One thing that comes to mind is that (if I recall correctly) the triggers on junction object don't fire if you delete one of masters. So if it's a real junction object (you wrote "acts like") you'd have to deal with this scenario too and put logic into master objects' triggers.
If it's not a real junction object (i.e. it has OwnerId field visible) and your sharing rules permit - maybe you could transfer the ownership of record to some special user/queue outside of roles hierarchy so it becomes invisible... But I doubt it'll work, in the end the delete should appear to complete succesfully, right? Maybe in combination with some #future that'd immediately undelete them & transfer... Still - messy!

Is this a functional syncing algorithm?

I'm working on a basic syncing algorithm for a user's notes. I've got most of it figured out, but before I start programming it, I want to run it by here to see if it makes sense. Usually I end up not realizing one huge important thing that someone else easily saw that I couldn't. Here's how it works:
I have a table in my database where I insert objects called SyncOperation. A SyncOperation is a sort of metadata on the nature of what every device needs to perform to be up to date. Say a user has 2 registered devices, firstDevice and secondDevice. firstDevice creates a new note and pushes it to the server. Now, a SyncOperation is created with the note's Id, operation type, and processedDeviceList. I create a SyncOperation with type "NewNote", and I add the originating device ID to that SyncOperation's processedDeviceList. So now secondDevice checks in to the server to see if it needs to make any updates. It makes a query to get all SyncOperations where secondDeviceId is not in the processedDeviceList. It finds out its type is NewNote, so it gets the new note and adds itself to the processedDeviceList. Now this device is in sync.
When I delete a note, I find the already created SyncOperation in the table with type "NewNote". I change the type to Delete, remove all devices from processedDevicesList except for the device that deleted the note. So now when new devices call in to see what they need to update, since their deviceId is not in the processedList, they'll have to process that SyncOperation, which tells their device to delete that respective note.
And that's generally how it'd work. Is my solution too complicated? Can it be simplified? Can anyone think of a situation where this wouldn't work? Will this be inefficient on a large scale?
Sounds very complicated - the central database shouldn't be responsible for determining which devices have recieved which updates. Here's how I'd do it:
The database keeps a table of SyncOperations for each change. Each SyncOperation is has a change_id numbered in ascending order (that is, change_id INTEGER PRIMARY KEY AUTOINCREMENT.)
Each device keeps a current_change_id number representing what change it last saw.
When a device wants to update, it does SELECT * FROM SyncOperations WHERE change_id > current_change_id. This gets it the list of all changes it needs to be up-to-date. Apply each of them in chronological order.
This has the charming feature that, if you wanted to, you could initialise a new device simply by creating a new client with current_change_id = 0. Then it would pull in all updates.
Note that this won't really work if two users can be doing concurrent edits (which edit "wins"?). You can try and merge edits automatically, or you can raise a notification to the user. If you want some inspiration, look at the operation of the git version control system (or Mercurial, or CVS...) for conflicting edits.
You may want to take a look at SyncML for ideas on how to handle sync operations (http://www.openmobilealliance.org/tech/affiliates/syncml/syncml_sync_protocol_v11_20020215.pdf). SyncML has been around for a while, and as a public standard, has had a fair amount of scrutiny and review. There are also open source implementations (Funambol comes to mind) that can also provide some coding clues. You don't have to use the whole spec, but reading it may give you a few "ahah" moments about syncing data - I know it helped to think through what needs to be done.
Mark
P.S. A later version of the protocol - http://www.openmobilealliance.org/technical/release_program/docs/DS/V1_2_1-20070810-A/OMA-TS-DS_Protocol-V1_2_1-20070810-A.pdf
I have seen the basic idea of keeping track of operations in a database elsewhere, so I dare say it can be made to work. You may wish to think about what should happen if different devices are in use at much the same time, and end up submitting conflicting changes - e.g. two different attempts to edit the same note. This may surface as a change to the user interface, to allow them to intervene to resolve such conflicts manually.

Database: To delete or not to delete records

I don't think I am the only person wondering about this. What do you usually practice about database behavior? Do you prefer to delete a record from the database physically? Or is it better to just flag the record with a "deleted" flag or a boolean column to denote the record is active or inactive?
It definitely depends on the actual content of your database. If you're using it to store session information, then by all means wipe it immediately when the session expires (or is closed), you don't want that garbage lying around. As it cannot really be used again for any practical purposes.
Basically, what you need to ask yourself, might I need to restore this information? Like deleted questions on SO, they should definitely just be marked 'deleted', as we're actively allowing an undelete. We also have the option to display it to select users as well, without much extra work.
If you're not actively seeking to fully restore the data, but you'd still like to keep it around for monitoring (or similar) purposes. I would suggest that you figure out (to the extent possible of course) an aggregation scheme, and shove that off to another table. This will keep your primary table clean of 'deleted' data, as well as keep your secondary table optimized for monitoring purposes (or whatever you had in mind).
For temporal data, see: http://talentedmonkeys.wordpress.com/2010/05/15/temporal-data-in-a-relational-database/
Pros of using a delete flag:
You can get the data back later if you need it,
Delete operation (updating the flag) is probably quicker than really deleting it
Cons of using a delete flag:
It is very easy to miss AND DeletedFlag = 'N' somewhere in your SQL
Slower for the database to find the rows that you are interested in amongst all the crap
Eventually, you'll probably want to really delete it anyway (assuming your system is successful. What about when that record is 10 years old and it was "deleted" 4 minutes after originally created)
It can make it impossible to use a natural key. You may have one or more deleted rows with the natural key and a real row wanting to use that same natural key.
There may be legal/compliance reasons why you are meant to actually delete data.
As a complement to all posts...
However, if you plan to mark the record, its good to consider making a view, for active records. This would save you from writing or forgetting the flag in your SQL query. You might consider a view for non-active records too, if you think that also would serve some purpose.
I am glad to have found this thread. I too was wondering what people thought about this issue. I have implemented the 'marked as deleted' for about 15 years on many systems. Whenever a user would call to say something was accidentally deleted it was certainly a lot easier to mark it un-deleted than recreate it or restore from a backup.
We are using postgresql and Ruby on rails it looks like we could do this in 1 of two ways, modify rails or add an ondelete trigger and does instead a pl/pgsql function to mark as deleted. I am leaning toward the latter.
As for performance hits, it will be interesting to see the results of EXPLAIN-ANALYZE on large tables to few deleted items as well as many deleted items.
In systems used over time I have found, new users tend to do silly things like delete things accidentally. So when people are new in a position they have all the access rights of the person previously in that position except with zero experience. Accidentally deleting something and being able to quickly recover gets everyone back to work quickly.
But as someone said, sometimes you may need that particular key back for some reason, at that point you would need to really delete it, then re-create the records (on undelete it and modify the record).
I mark them as deleted, and don't really delete. However every once in a while I sweep out all the junk and archive it, so it doesn't kill performance.
There are also legal issues either way if personal data is involved. I think it greatly depends on where you are (or where the database is), and what the terms of use are.
In some cases people can ask to be removed from your system, in which case a hard delete is needed (or at least clearing out all of the personal information).
I would check with your legal department before you adopt a strategy either way if personal information is involved.
If you are concerned about "dormant" records slowing down your database access, you may want to move those rows into another table acting as an "archive" table.
For user-entered/managed data I've used the flag method you describe and given the user an "empty the trash" interface to actually delete items if they choose to.
I have a database with lots of dependencies. Hence, I cannot delete some records because others still depend on the data. This is what I usually do; I try to delete the data, if it works, I know it didn't have any dependencies and didn't matter. If it doesn't, I catch the error and flag it as inactive:
try
{
_context.SomeTable.Remove(someEntity);
await _context.SaveChangesAsync();
}
catch (DbUpdateException ex) when (ex.InnerException is SqlException && (ex.InnerException as SqlException).Number == 547)
{
// Mark as inactive
someEntity.Active = false;
await _context.SaveChangesAsync();
}

Creating a Notifications type feed in GAE Objectify

I'm working on a notification feed for my mobile app and am looking for some help on an issue.
The app is a Twitter/Facebook like app where users can post statuses and other users can like, comment, or subscribe to them.
One thing I want to have in my app is to have a notifications feed where users can see who liked/comment on their post or subscribed to them.
The first part of this system I have figured out, when a user likes/comments/subscribes, a Notification entity will be written to the datastore with details about the event. To show a users Notification's all I have to do is query for all Notification's for that user, sort by date created desc and we have a nice little feed of actions other users took on a specific users account.
The issue I have is what to do when someone unlikes a post, unsubscribes or deletes a comment. Currently, if I were to query for that specific notification, it is possible that nothing would return from the datastore because of eventual consistency. We could imagine someone liking, then immediate unliking a post (b/c who hasn't done that? =P). The query to find that Notification might return null and nothing would get deleted when calling ofy().delete().entity(notification).now(); And now the user has a notification in their feed saying Sally liked his post when in reality she liked then quickly unliked it!
A wrench in this whole system is that I cannot delete by Key<Notification>, because I don't really have a way to know id of the Notification when trying to delete it.
A potential solution I am experimenting with is to not delete any Notifications. Instead I would always write Notification's and simply indicate if the notification was positive or negative. Then in my query to display notifications to a specific user, I could somehow only display the sum-positive Notification's. This would save some money on datastore too because deleting entities is expensive.
There are three main ways I've solved this problem before:
deterministic key
for example
{user-Id}-{post-id}-{liked-by} for likes
{user-id}-{post-id}-{comment-by}-{comment-index} for comments
This will work for most basic use cases for the problem you defined, but you'll have some hairy edge cases to figure out (like managing indexes of comments as they get edited and deleted). This will allow get and delete by key
parallel data structures
The idea here is to create more than one entity at a time in a transaction, but to make sure they have related keys. For example, when someone comments on a feed item, create a Comment entity, then create a CommentedOn entity which has the same ID, but make it have a parent key of the commenter user.
Then, you can make a strongly consistent query for the CommentedOn, and use the same id to do a get by key on the Comment. You can also just store a key, rather than having matching IDs if that's too hard. Having matching IDs in practice was easier each time I did this.
The main limitation of this approach is that you're effectively creating an index yourself out of entities, and while this can give you strongly consistent queries where you need them the throughput limitations of transactional writes can become harder to understand. You also need to manage state changes (like deletes) carefully.
State flags on entities
Assuming the Notification object just shows the user that something happened but links to another entity for the actual data, you could store a state flag (deleted, hidden, private etc) on that entity. Then listing your notifications would be a matter of loading the entities server side and filtering in code (or possibly subsequent filtered queries).
At the end of the day, the complexity of the solution should mirror the complexity of the problem. I would start with approach 3 then migrate to approach 2 when the fuller set of requirements is understood. It is a more robust and flexible approach, but complexity of XG transaction limitations will rear its head - but ultimately a distributed feed like this is a hard problem.
What I ended up doing and what worked for my specific model was that before creating a Notification Entity I would first allocate and ID for it:
// Allocate an ID for a Notification
final Key<Notification> notificationKey = factory().allocateId(Notification.class);
final Long notificationId = notificationKey.getId();
Then when creating my Like or Follow Entity, I would set the property Like.notificationId = notificationId; or Follow.notificationId = notificationId;
Then I would save both Entities.
Later, when I want to delete the Like or Follow I can do so and at the same time get the Id of the Notification, load the Notification by key (which is strongly consistent to do so), and delete it too.
Just another approach that may help someone =D

Resources