I have a 300+ column table and a trigger.
I have 40(ish) "columns of interest" and the rest I don't care about.
The purpose of the trigger is to determine if any of the "columns of interest" were updated.
So: determining the COLUMNS_UPDATED() seems to be the way to go and do some bit-magic with that.
Right now: I'm Faking the 300+ table and ramming in a complete row to force the correct number of columns to ensure COLMNS_UPDATED() gets the correct bytes/bits flagged for columns changed.
My question is: is this unnecessary because the genius of FakeTable encompasses the use of COLUMNS_UPDATED() and it will return the "correct" byte indicators even if I only faked a subset of the columns?
tSQLt.FakeTable reconstructs the table based on the original and always includes all columns. So, after using tSQLt.ApplyTrigger, to move the trigger over to the faked table, COLUMS_UPDATED will work as designed.
That being said, if you encounter a piece of functionality in any external tool or API that you are going to rely on in your product, and you have the feeling that there is a piece of functionality that is not documented well, it is always a good idea to write an exploratory test and check that test in with your other tests.
In this case I would create a wide table on the fly within the test, take the table, create an update trigger on the faked table and then call tSQLt.AssertEquals within that trigger to make sure the right value is returned by COLUMNS_UPDATED.
You achieve two things with that:
A) you document the expected behavior that your code relies on. That will help future developers that have to maintain your code.
B) You’ll get notified if that behavior changes.
Now, to be clear, there’s no reason to do that with well documented functionality. But if there are documentation gaps, it’s a good practice.
Related
First, I know this is a rather subjective question but I need some kind of formal documentation to help me educate my client.
Background - a large enterprise application with hundreds of tables and SP's, all neatly designed with normalized tables and foreign keys using identity columns.
Our client has a few employees writing complex reports in Crystal enterprise using a replicated copy of our production Db.
We have tables that store what I would classify as 'system' base information, such as a list of office locations, or departments within the company, standard set of roles for users, statuses of other objects (open/closed etc), basically data that doesn't change often.
The issue - the report designers and financial analysts are writing queries with hardcoded identity values inside of them. Something like this
SELECT xxx FROM OFFICE WHERE OFFICE_ID = 6
I'm greatly simplifying here, but basically they're using these hard coded int values inside their procedures all over the place.
For SQL developers seeing this will obviously make you facepalm as it's just a built-in instinct not to do this.
However, surprisingly I can't find any documentation or even best practices articles as to why this shouldn't be done.
They would argue it's fine to do this since the values never change, and they're right, within that single system those values won't change, however across multiple environments (staging/QA/Dev) those values can and are absolutely different, making their reporting design approach non-portable and only able to function in 1 isolated server environment.
Do any of the SQL guru's out there have any more in-depth information/articles etc that I can use to help educate my client on why they should avoid this approach?
Seems to me the strongest argument to your report writers is your second to last sentence "...those values can and are absolutely different [between environments]". That would be pretty much the gist of my response to them.
Of course there's always gray area to any question. Identity columns are essentially magic numbers. They have the benefit to the database of being...
Small
Sequential
Fast to seek and join on, sort by and create
...but have the downside of being of completely meaningless, and in effect, randomly assigned (sort the inserts into that table one way, you get a different identity per row than if you sorted the other way). As such, in cases where you have to look up something specific like that, it's common use also include a "business/natural/alternate" key (e.g. maybe (a completely made up example) [CategoryName] where CatgoryName is something short, unique and human readable, while. [CategoryId] is an identity, but not something intended to be sought on)
If you have a website with, say, a dropdown menu, usually the natural key gets put into the visible part of the drop down, and the surrogate/identity key gets passed around on the back end, invisible to the end user.
This gets a little trickier when you have people writing queries directly against the database. If they're owners of the data, they may know things about the larger data structure which they can take advantage of in *cough "clever" ways. If you know the keys wont change and you know what those values are, there might be a case to be made just referencing those. But again, not if they're going to be different when you query a different server.
Of course the flip side is, if you don't want them to use the identity values, you'll have to give them an alternative. And if your tables don't already include a business/natural/alternate key, you're going to have to add one wherever one doesn't already exist.
Also, there's nothing wrong with that alternate key being an integer too (maybe you already have company-wide identifiers for your offices of 1, 2, 3 etc), but the point is that it's deterministic no matter where you run your query.
I have a production SQL-Server DB (reporting) that has many Stored Procedures.
The SPs are publicly exposed to the external world in different ways
- some users have access directly to the SP,
- some are exposed via a WebService
- while others are encapsulated as interfaces thru a DCOM layer.
The user base is large and we do not know exactly which user-set uses which method of accessing the DB.
We get frequent (about 1 every other month) requests from user-sets for modifying an existing SP by adding one column to the output or a group of columns to the existing output, all else remaining same.
We initially started doing this by modifying the existing SP and adding the newly requested columns to the end of the output. But this broke the custom tools built by some other user bases as their tool had the number of columns hardcoded, so adding a column meant they had to modify their tool as well.
Also for some columns complex logic is required to get that column into the report which meant the SP performance degraded, affecting all users - even those who did not need the new column.
We are thinking of various ways to fix this:
1 Default Parameters to control flow
Update the existing SP and control the new functionality by adding a flag as a default parameter to control the code path. By using default parameters, if value of the Parameter is set to true then only call the new functionality. By default it is set to False.
Advantage
New Object is not required.
On going maintenance is not affected.
Testing overhead remains under control.
Disadvantage
Since an existing SP is modified, it will need testing of existing functionality as well as new functionality.
Since we have no inkling on how the client tools are calling the SPs we can never be sure that we have not broken anything.
It will be difficult to handle if same report gets modified again with more requests – will mean more flags and code will become un-readable.
2 New Stored procedure
A new stored procedure will be created for any requirement which changes the signature(Input/Output) of the SP. The new SP will call the original stored procedure for existing stuff and add the logic for new requirement on top of it.
Advantage
Here benefit will be that there will be No impact on the existing procedure hence No Testing required for old logic.
Disadvantage
Need to create new objects in database whenever changes are requested. This will be overhead in database maintenance.
Will the execution plan change based on adding a new parameter? If yes then this could adversely affect users who did not request the new column.
Considering a SP is a public interface to the DB and interfaces should be immutable should we go for option 2?
What is the best practice or does it depend on a case by case basis, and what should be the main driving factors when choosing a option?
Thanks in advance!
Quoting from a disadvantage for your first option:
It will be difficult to handle if same report gets modified again with more requests – will mean more flags and code will become un-readable.
Personally I feel this is the biggest reason not to modify an existing stored procedure to accommodate the new columns.
When bugs come up with a stored procedure that has several branches, it can become very difficult to debug. Also as you hinted at, the execution plan can change with branching/if statements. (sql using different execution plans when running a query and when running that query inside a stored procedure?)
This is very similar to object oriented coding and your instinct is correct that it's best to extend existing objects instead of modify them.
I would go for approach #2. You will have more objects, but at least when an issue comes up, you will be able to know the affected stored procedure has limited scope/impact.
Over time I've learned to grow objects/data structures horizontally, not vertically. In other words, just make something new, don't keep making existing things bigger and bigger and bigger.
Ok. #2. Definitely. No doubt.
#1 says: "change the existing procedure", causing things to break. No way that's a good thing! Your customers will hate you. Your code just gets more complex meaning it is harder and harder to avoid breaking things leading to more hatred. It will go horribly slowly, and be impossible to tune. And so on.
For #2 you have a stable interface. No hatred. Yay! Seriously, "yay" as in "I still have a job!" as opposed to "boo, I got fired for annoying the hell out of my customers". Seriously. Never ever do #1 for that reason alone. You know this is true. You know it!
Having said that, record what people are doing. Take a user-id as a parameter. Log it. Know your users. Find the ones using old crappy code and ask them nicely to upgrade if necessary.
Your reason given to avoid number 2 is proliferation. But that is only a problem if you don't test stuff. If you do test stuff properly, then proliferation is happening anyway, in your tests. And you can always tune things in #2 if you have to, or at least isolate performance problems.
If the fatter procedure is really great, then retrofit the skinny version with a slimmer version of the fat one. In SQL this is tricky, but copy/paste and cut down your select column list works. Generally I just don't bother to do this. Life is too short. Having really good test code is a much better investment of time, and data schema tend to rarely change in ways that break existing queries.
Okay. Rant over. Serious message. Do #2, or at the very least do NOT do #1 or you will get yourself fired, or hated, or both. I can't think of a better reason than that.
Easier to go with #2. Nullable SP parameters can create some very difficult to locate bugs. Although, I do employ them from time to time.
Especially when you start getting into joins on nulls and ANSI settings. The way you write the query will change the results dramatically. KISS. (Keep things simple stupid).
Also, if it's a parameterized search for reporting or displaying, I might consider a super-fast fetch of data into a LINQ-able object. Then you can search an in-memory list rather than re-fetching from the database.
#2 could be better option than #1 particularly considering the bullet 3 of disadvantages of #1 since requirements keep changing on most of the time. I feel this because disadvantages are dominating here than advantages on either side.
I would also vote for #2. I've seen a few stored procedures which take #1 to the extreme: The SPs has a parameter #Option and a few parameters #param1, #param2, .... The net effect is that this is a single stored procedure that tries to play the role of many stored procedures.
The main disadvantage to #2 is that there are more stored procedures. It may be more difficult to find the one you're looking for, but I think that is a small price to pay for the other advantages you get.
I want to make sure also, that you don't just copy and paste the original stored procedure and add some columns. I've also seen too many of those. If you are only adding a few columns, you can call the original stored procedure and join in the new columns. This will definitely incur a performance penalty if those columns were readily available before, but you won't have to change your original stored procedure (refactoring to allow for good performance and no duplication of the code), nor will you have to maintain two copies of the code (copy and paste for performance).
I am going to suggest a couple of other options based on the options you gave.
Alternative option #1: Add another variable, but instead of making it a default variable base the variable off of customer name. That way Customer A can get his specialized report and Customer B can get his slightly different customized report. This adds a ton of work as updates to the 'Main' portion would have to get copied to all the specialty customer ones.
You could do this with branching 'if' statements.
Alternative option #2: Add new stored procedures, just add the customer's name to the stored procedure. Maintenance wise, this might be a little more difficult but it will achieve the same end results, each customer gets his own report type.
Option #2 is the one to choose.
You yourself mentioned (dis)advantages.
While you consider adding new objects to db based on requirement changes, add only necessary objects that don't make your new SP bigger and difficult to maintain.
I am working on an application that someone else wrote and it appears that they are using IDs throughout the application that are not defined in the database. For a simplified example, lets say there is a table called Question:
Question
------------
Id
Text
TypeId
SubTypeId
Currently the SubTypeId column is populated with a set of IDs that do not reference another table in the database. In the code these SubTypeIds are mapped to a specific string in a configuration file.
In the past when I have had these types of values I would create a lookup table and insert the appropriate values, but in this application there is a mapping between the IDs and their corresponding text values in a configuration file.
Is it bad practice to define a lookup table in a configuration file rather than in the database itself?
Is it bad practice to define a lookup table in a configuration file rather than in the database itself?
Absolutely, yes. It brings in a heavy dependence on the code to manage and maintain references, fetch necessary values, etc. In a situation where you now need to create additional functionality, you would rely on copy-pasting the mapping (or importing them, etc.) which is more likely to cause an issue.
It's similar to why DB constraints should be in the DB rather than in the program/application that's accessing it - any maintenance or new application needs to replicate all the behaviour and rules. Having things this way has similar side-affects I've mentioned here in another answer.
Good reasons to have a lookup table:
Since DBs can generally naturally have these kinds of relations, it would be obvious to use them.
Queries first need to be constructed in code for the Type- and SubType- Text vs ID instead of having them as part of the where/having clause of the query that is actually executed.
Speed/Performance - with the right indexes and table structures, you'd benefit from this (and reduce code complexity that manages it)
You don't need to update your code for to add a new Type or SubType, or to edit/delete them.
Possible reasons it was done that way, which I don't think are valid reasons:
The TypeID and SubTypeID are related and the original designer did not know how to create a complex foreign key. (Not a good reason though.)
Another could be 'translation' but that could also be handled using foreign key relations.
In some pieces of code, there may not be a strict TypeID-to-SubTypeID relation and that logic was handled in code rather than in the DB. Again, can be managed using 'flag' values or NULLs if possible. Those specific cases could be handled by designing the DB right and then working around a unique/odd situation in code instead of putting all the dependence on the code.
NoSQL: Original designer may be under the impression that such foreign keys or relations cannot be done in a NoSQL db.
And the obvious 'people' problem vs technical challenge: The original designer may not have had a proper understanding of databases and may have been a programmer who did that application (or was made to do it) without the right knowledge or assistance.
Just to put it out there: If the previous designer was an external contractor, he may have used the code maintenance complexity or 'support' clause as a means to get more business/money.
As a general rule of thumb, I'd say that keeping all the related data in a DB is a better practice since it removes a tacit dependency between the DB and your app, and because it makes the DB more "comprehensible." If the definitions of the SubTypeIDs are in a lookup table it becomes possible to create queries that return human-readable results, etc.
That said, the right answer probably depends a bit on the specifics of the application. If there's very tight coupling between the DB and app to begin with (eg, if the DB isn't going to be accessed by other clients) this is probably a minor concern particularly if the set of SubTypeIDs is small and seldom changes.
Marginally related to Should I delete or disable a row in a relational database?
Given that I am going to go with the strategy of warehousing changes to my tables in a history table, I am faced with the following options for implementing a status for a given row in MySQL:
An isActive booelan
An activeStatus enum
An activeStatus INT referencing a small ActiveStatus lookup table
An activeStatus INT not referencing another table
The first approach is rather inflexible in my opinion, since I might need more booleans in the future to support other types of active statuses (I'm not sure what they would be, but maybe something like "being phased out" or "active for a random group of users", etc).
I'm told that MySQL enum is bad, so the second approach probably won't fly.
I like the third approach, but I'm wondering if it is a heavy handed solution to a relatively small problem.
The fourth approach requires that we know in advance what each status INT means and seems like an outdated way to do things.
Is there a canonical right answer? Am I ignoring another approach?
Personally I would go with your third option.
Boolean values often turn out to be more complex in reality, as you suggested. ENUMs can be nice, but they have the downside that as soon as you want to store additional information about each value - who added it, when, is it only valid for a certain time period or source system, comments etc. - it becomes difficult, whereas with a lookup table those data can easily be maintained in additional columns. ENUMs are a good tool to constrain data to certain values (like a CHECK constraint), but not such a good tool if those values have significant meaning and need to be exposed to users.
It's not entirely clear from your question if you plan to treat your history table like a fact table and use it in reports, but if so then you could consider the ActiveStatus lookup table as a dimension. In this case a table is much easier, because your reporting tool can read the possible values from the dimension table in order to let the user choose his query conditions; such tools generally don't know anything about ENUMs.
From my point of view your 2nd approach is better if u have more than 2 status.Because ENUM is great for data that you know will fall within a static set. But if u have only two status active and inactive then its always better to use boolean.
EDIT:
If u r sure in future u r not gonna change the value of your ENUM then its great to use ENUM for such field.
How do you implement a last-modified column in SQL?
I know for a date-created column you can just set the default value to getdate(). For last-modified I have always used triggers, but it seems like there must be a better way.
Thanks.
Triggers are the best way, because this logic is intimately associated with the table, and not the application. This is just about the only obvious proper use of a trigger that I can think of, apart from more granular referential integrity.
But I think "last-modified" dates are a flawed concept anyway. What makes the "last-changed" timestamp any more valuable than the ones that came before it? You generally need a more complete audit trail.
The only other way to perform this without using triggers is to disallow any inserts/updates/deletes on the table directly via permissions, and insist all these actions are performed via stored procedures that will take care of setting the modified date.
An administrator might still be able to modify data without using the stored procedures, but an administrator can also disable triggers.
If there are a lot of tables that require this sort of functionality, I would favour triggers as it simplifies the code. Simple, well written and well-indexed auditing triggers are generally not too evil - they only get bad when you try to put too much logic in the trigger.
You can use the keyword DEFAULT, assuming you have a default constraint.
On insert, there is no need to specify a value, You could use the keyword here too.
No trigger and is done in the same write as the "real" data
UPDATE
table
SET
blah,
LastUpdatedDateTime = DEFAULT
WHERE
foo = bar
Using a trigger to update the last modified column is the way to go. Almost every record in the system at work is stamped with an add and change timestamp and this has helped me quite a bit. Implementing it as a trigger will let you stamp it whenever there is any change, no matter how it was initiated.
Another nice thing about a trigger is that you can easily expand it to store an audit trail as well without too much trouble.
And as long as you are doing so, add a field for last_updated_by and update the user everytime the record is updated. Not as good as a real audit table but much better than the date last updated.
A trigger is the only method you can use to do this. And to those who said triggers are evil, no they aren't. They are the best way - by far - to maintain data integrity that is more complex than a simple default or constraint. They do need to written by people who actually know what they are doing though. But then that is true of most things that affect database design and integrity.