Related
When I want to soft delete resources as a policy of my company I can do it in one of two places.
I can do it in my database with some "instead of DELETE" trigger. Like so:
CREATE TRIGGER prevent_resource_delete
BEFORE DELETE ON resource
FOR EACH ROW EXECUTE PROCEDURE resource_soft_delete();
CREATE FUNCTION resource_soft_delete() RETURNS trigger
LANGUAGE plpgsql AS
$$
BEGIN
UPDATE resource SET deleted_at = now() WHERE id = OLD.id;
RETURN NULL;
END;
$$;
That's how pretty much every article about soft deletes suggests to do it. Other than articles written specifically by a ORM owner because they have their in-house solution.
I like this approach. The logic in my APIs looks like I am just deleting the resource.
Resource.query().deleteById(id); // Using a query builder
db.query('DELETE FROM resource WHERE id = $1;', [id]); // Using native library
To me it seems more natural and I don't have to worry about other developers accidentally hard deleting stuff. But it can also be confusing to those who don't know what is actually going on. And having any logic in the database means I can have bugs there (soft deleting logic is usually dead simple, but still...), which would be hard to debug. At least compared to those in my APIs.
But also I can instead have the logic in the APIs themselves. Keeping logic next to the other logic. Less elegant but more straightforward. No hidden logic somewhere else. I do lose the protection from people accidentally hard deleting resources.
Resource.query().findById(id).patch({deleted_at: new Date()}); // Using a query builder
db.query('UPDATE resource SET deleted_at = now() WHERE id = $1;', [id]); // Using native library
I am inclined to choose the former option as I consider the choice of whether to soft delete a database matter. The database chooses what to do with deleted data. Deleted data, soft or hard, is in principle not part of the application anymore. The APIs can't retrieve it. It is for me, the developer, to use for analytics, legal reasons or to manually aid a user who wants to recover something he/she considers lost.
But I don't like the downsides. I just talked to a colleague that was worried because he thought we were actually deleting stuff. Now, that could actually be solved with better onboarding and documentation. But should it be like that?
When to implement soft delete logic in the code over the database? Why does every article I find directly suggest the database without even considering the code? It looks like there is a strong reason I can't find.
As per me there isn't any strong reason, it depends on the architect and developer where they decide to put the logic, but below could be the possible reasons behind it ::
First is, as we are deleting something from the DB, so keeping the logic where it's best suited and,
Second writing the logic for each and every API is kind of redundant instead doing it in DB once and for all tables or nodes or collections is of less work to do. :)
What is the proper way to keep data slices refreshed? Imagine I have a table with various column, but importantly a DATE_CREATED and a DATE_MODIFIED column.
If my data slicing strategy is based on DATE_CREATED, I could periodically reprocess old slices. This follows the ADF guidance of "repeatability". I don't think ADF has a way of doing this automatically, but I could externally trigger the refresh via the API (I'm guessing.) This seems like perhaps the most correct way, but given that ADF doesn't seem to support this as a feature, it makes me feel like there's a better way of doing it ... it also seems mildly wasteful.
If my data slicing strategy is based on DATE_MODIFIED, I run into issues with the ADF activities not being repeatable. An old slice, when refreshed, would give different results because rows that were within the window may have moved to a different window. On the other hand, the latest slice would always catch rows that have changed. The other issue is preventing row duplication. The pre-activity clean up actions would need to somehow be able to clean up records in the destination table prior to the copy. Or some type of UPSERT method must be employed.
The final option is to TRUNCATE the destination table every day. This is fine for smaller tables but has its own downsides, (1) we're not really "slicing" at all anymore. This is just scorched earth. (2) Any time any slice is being processed, all downstream slices from all dates are in danger of failing, due to the table being blown away. (3) practically impossible if your table has any respectable amount of data in it.
No option seems excellent but the first option seems better. Looking for advice from someone who has solved this problem or is experienced with ADF.
I have a production SQL-Server DB (reporting) that has many Stored Procedures.
The SPs are publicly exposed to the external world in different ways
- some users have access directly to the SP,
- some are exposed via a WebService
- while others are encapsulated as interfaces thru a DCOM layer.
The user base is large and we do not know exactly which user-set uses which method of accessing the DB.
We get frequent (about 1 every other month) requests from user-sets for modifying an existing SP by adding one column to the output or a group of columns to the existing output, all else remaining same.
We initially started doing this by modifying the existing SP and adding the newly requested columns to the end of the output. But this broke the custom tools built by some other user bases as their tool had the number of columns hardcoded, so adding a column meant they had to modify their tool as well.
Also for some columns complex logic is required to get that column into the report which meant the SP performance degraded, affecting all users - even those who did not need the new column.
We are thinking of various ways to fix this:
1 Default Parameters to control flow
Update the existing SP and control the new functionality by adding a flag as a default parameter to control the code path. By using default parameters, if value of the Parameter is set to true then only call the new functionality. By default it is set to False.
Advantage
New Object is not required.
On going maintenance is not affected.
Testing overhead remains under control.
Disadvantage
Since an existing SP is modified, it will need testing of existing functionality as well as new functionality.
Since we have no inkling on how the client tools are calling the SPs we can never be sure that we have not broken anything.
It will be difficult to handle if same report gets modified again with more requests – will mean more flags and code will become un-readable.
2 New Stored procedure
A new stored procedure will be created for any requirement which changes the signature(Input/Output) of the SP. The new SP will call the original stored procedure for existing stuff and add the logic for new requirement on top of it.
Advantage
Here benefit will be that there will be No impact on the existing procedure hence No Testing required for old logic.
Disadvantage
Need to create new objects in database whenever changes are requested. This will be overhead in database maintenance.
Will the execution plan change based on adding a new parameter? If yes then this could adversely affect users who did not request the new column.
Considering a SP is a public interface to the DB and interfaces should be immutable should we go for option 2?
What is the best practice or does it depend on a case by case basis, and what should be the main driving factors when choosing a option?
Thanks in advance!
Quoting from a disadvantage for your first option:
It will be difficult to handle if same report gets modified again with more requests – will mean more flags and code will become un-readable.
Personally I feel this is the biggest reason not to modify an existing stored procedure to accommodate the new columns.
When bugs come up with a stored procedure that has several branches, it can become very difficult to debug. Also as you hinted at, the execution plan can change with branching/if statements. (sql using different execution plans when running a query and when running that query inside a stored procedure?)
This is very similar to object oriented coding and your instinct is correct that it's best to extend existing objects instead of modify them.
I would go for approach #2. You will have more objects, but at least when an issue comes up, you will be able to know the affected stored procedure has limited scope/impact.
Over time I've learned to grow objects/data structures horizontally, not vertically. In other words, just make something new, don't keep making existing things bigger and bigger and bigger.
Ok. #2. Definitely. No doubt.
#1 says: "change the existing procedure", causing things to break. No way that's a good thing! Your customers will hate you. Your code just gets more complex meaning it is harder and harder to avoid breaking things leading to more hatred. It will go horribly slowly, and be impossible to tune. And so on.
For #2 you have a stable interface. No hatred. Yay! Seriously, "yay" as in "I still have a job!" as opposed to "boo, I got fired for annoying the hell out of my customers". Seriously. Never ever do #1 for that reason alone. You know this is true. You know it!
Having said that, record what people are doing. Take a user-id as a parameter. Log it. Know your users. Find the ones using old crappy code and ask them nicely to upgrade if necessary.
Your reason given to avoid number 2 is proliferation. But that is only a problem if you don't test stuff. If you do test stuff properly, then proliferation is happening anyway, in your tests. And you can always tune things in #2 if you have to, or at least isolate performance problems.
If the fatter procedure is really great, then retrofit the skinny version with a slimmer version of the fat one. In SQL this is tricky, but copy/paste and cut down your select column list works. Generally I just don't bother to do this. Life is too short. Having really good test code is a much better investment of time, and data schema tend to rarely change in ways that break existing queries.
Okay. Rant over. Serious message. Do #2, or at the very least do NOT do #1 or you will get yourself fired, or hated, or both. I can't think of a better reason than that.
Easier to go with #2. Nullable SP parameters can create some very difficult to locate bugs. Although, I do employ them from time to time.
Especially when you start getting into joins on nulls and ANSI settings. The way you write the query will change the results dramatically. KISS. (Keep things simple stupid).
Also, if it's a parameterized search for reporting or displaying, I might consider a super-fast fetch of data into a LINQ-able object. Then you can search an in-memory list rather than re-fetching from the database.
#2 could be better option than #1 particularly considering the bullet 3 of disadvantages of #1 since requirements keep changing on most of the time. I feel this because disadvantages are dominating here than advantages on either side.
I would also vote for #2. I've seen a few stored procedures which take #1 to the extreme: The SPs has a parameter #Option and a few parameters #param1, #param2, .... The net effect is that this is a single stored procedure that tries to play the role of many stored procedures.
The main disadvantage to #2 is that there are more stored procedures. It may be more difficult to find the one you're looking for, but I think that is a small price to pay for the other advantages you get.
I want to make sure also, that you don't just copy and paste the original stored procedure and add some columns. I've also seen too many of those. If you are only adding a few columns, you can call the original stored procedure and join in the new columns. This will definitely incur a performance penalty if those columns were readily available before, but you won't have to change your original stored procedure (refactoring to allow for good performance and no duplication of the code), nor will you have to maintain two copies of the code (copy and paste for performance).
I am going to suggest a couple of other options based on the options you gave.
Alternative option #1: Add another variable, but instead of making it a default variable base the variable off of customer name. That way Customer A can get his specialized report and Customer B can get his slightly different customized report. This adds a ton of work as updates to the 'Main' portion would have to get copied to all the specialty customer ones.
You could do this with branching 'if' statements.
Alternative option #2: Add new stored procedures, just add the customer's name to the stored procedure. Maintenance wise, this might be a little more difficult but it will achieve the same end results, each customer gets his own report type.
Option #2 is the one to choose.
You yourself mentioned (dis)advantages.
While you consider adding new objects to db based on requirement changes, add only necessary objects that don't make your new SP bigger and difficult to maintain.
For me, the classic wisdom is to store enum values (OrderStatus, UserTypes, etc) as Lookup tables in your db. This lets me enforce data integrity in the database, preventing false or null values, etc.
However more and more, this feels like unnecessary duplication to me. Not only do I have to create tables for these values (or have an unwieldy central lookup table), but if I want to add a value, i have to remember to add it to 2 (or more, counting production, testing, live db's) and things can get out of sync easily.
Still I have a hard time letting go of lookup tables.
I know there are probably certain scenarios where one had an advantage over the other, but what are your general thoughts?
I've done both, but I now much prefer defining them as in classes in code.
New files cost nothing, and the benefits that you seek by having it in the database should be handled as business rules.
Also, I have an aversion to holding data in a database that really doesn't change. And it seems an enum fits this description. It doesn't make sense for me to have a States lookup table, but a States enum class makes sense to me.
If it has to be maintained I would leave them in a lookup table in the DB. Even if I think they won't need to be maintained I would still go towards a lookup table so that if I am wrong it's not a big deal.
EDIT:
I want to clarify that if the Enum is not part of the DB model then I leave it in code.
I put them in the database, but I really can't defend why I do that. It just "seems right". I guess I justify it by saying there's always a "right" version of what the enums can be by checking the database.
Schema dependencies should be stored in the database itself to ensure any changes to your architecture can be easily perform transparently to the app..
I prefer enums as it enforces early binding of values in code, so that exceptions aren't caused by missing values
It's also helpful if you can use code generation that can bring in the associations of the integer columns to an enumeration type, so that in business logic you only have to deal with easily memorable enumeration values.
Consider it a form of documentation.
If you've already documented the enum constants properly in the code that uses the dB, do you really need a duplicate set of documentation (to use and maintain)?
How do you implement a last-modified column in SQL?
I know for a date-created column you can just set the default value to getdate(). For last-modified I have always used triggers, but it seems like there must be a better way.
Thanks.
Triggers are the best way, because this logic is intimately associated with the table, and not the application. This is just about the only obvious proper use of a trigger that I can think of, apart from more granular referential integrity.
But I think "last-modified" dates are a flawed concept anyway. What makes the "last-changed" timestamp any more valuable than the ones that came before it? You generally need a more complete audit trail.
The only other way to perform this without using triggers is to disallow any inserts/updates/deletes on the table directly via permissions, and insist all these actions are performed via stored procedures that will take care of setting the modified date.
An administrator might still be able to modify data without using the stored procedures, but an administrator can also disable triggers.
If there are a lot of tables that require this sort of functionality, I would favour triggers as it simplifies the code. Simple, well written and well-indexed auditing triggers are generally not too evil - they only get bad when you try to put too much logic in the trigger.
You can use the keyword DEFAULT, assuming you have a default constraint.
On insert, there is no need to specify a value, You could use the keyword here too.
No trigger and is done in the same write as the "real" data
UPDATE
table
SET
blah,
LastUpdatedDateTime = DEFAULT
WHERE
foo = bar
Using a trigger to update the last modified column is the way to go. Almost every record in the system at work is stamped with an add and change timestamp and this has helped me quite a bit. Implementing it as a trigger will let you stamp it whenever there is any change, no matter how it was initiated.
Another nice thing about a trigger is that you can easily expand it to store an audit trail as well without too much trouble.
And as long as you are doing so, add a field for last_updated_by and update the user everytime the record is updated. Not as good as a real audit table but much better than the date last updated.
A trigger is the only method you can use to do this. And to those who said triggers are evil, no they aren't. They are the best way - by far - to maintain data integrity that is more complex than a simple default or constraint. They do need to written by people who actually know what they are doing though. But then that is true of most things that affect database design and integrity.