What is better- Add an optional parameter to an existing SP or add a new SP? - sql-server

I have a production SQL-Server DB (reporting) that has many Stored Procedures.
The SPs are publicly exposed to the external world in different ways
- some users have access directly to the SP,
- some are exposed via a WebService
- while others are encapsulated as interfaces thru a DCOM layer.
The user base is large and we do not know exactly which user-set uses which method of accessing the DB.
We get frequent (about 1 every other month) requests from user-sets for modifying an existing SP by adding one column to the output or a group of columns to the existing output, all else remaining same.
We initially started doing this by modifying the existing SP and adding the newly requested columns to the end of the output. But this broke the custom tools built by some other user bases as their tool had the number of columns hardcoded, so adding a column meant they had to modify their tool as well.
Also for some columns complex logic is required to get that column into the report which meant the SP performance degraded, affecting all users - even those who did not need the new column.
We are thinking of various ways to fix this:
1 Default Parameters to control flow
Update the existing SP and control the new functionality by adding a flag as a default parameter to control the code path. By using default parameters, if value of the Parameter is set to true then only call the new functionality. By default it is set to False.
Advantage
New Object is not required.
On going maintenance is not affected.
Testing overhead remains under control.
Disadvantage
Since an existing SP is modified, it will need testing of existing functionality as well as new functionality.
Since we have no inkling on how the client tools are calling the SPs we can never be sure that we have not broken anything.
It will be difficult to handle if same report gets modified again with more requests – will mean more flags and code will become un-readable.
2 New Stored procedure
A new stored procedure will be created for any requirement which changes the signature(Input/Output) of the SP. The new SP will call the original stored procedure for existing stuff and add the logic for new requirement on top of it.
Advantage
Here benefit will be that there will be No impact on the existing procedure hence No Testing required for old logic.
Disadvantage
Need to create new objects in database whenever changes are requested. This will be overhead in database maintenance.
Will the execution plan change based on adding a new parameter? If yes then this could adversely affect users who did not request the new column.
Considering a SP is a public interface to the DB and interfaces should be immutable should we go for option 2?
What is the best practice or does it depend on a case by case basis, and what should be the main driving factors when choosing a option?
Thanks in advance!

Quoting from a disadvantage for your first option:
It will be difficult to handle if same report gets modified again with more requests – will mean more flags and code will become un-readable.
Personally I feel this is the biggest reason not to modify an existing stored procedure to accommodate the new columns.
When bugs come up with a stored procedure that has several branches, it can become very difficult to debug. Also as you hinted at, the execution plan can change with branching/if statements. (sql using different execution plans when running a query and when running that query inside a stored procedure?)
This is very similar to object oriented coding and your instinct is correct that it's best to extend existing objects instead of modify them.
I would go for approach #2. You will have more objects, but at least when an issue comes up, you will be able to know the affected stored procedure has limited scope/impact.
Over time I've learned to grow objects/data structures horizontally, not vertically. In other words, just make something new, don't keep making existing things bigger and bigger and bigger.

Ok. #2. Definitely. No doubt.
#1 says: "change the existing procedure", causing things to break. No way that's a good thing! Your customers will hate you. Your code just gets more complex meaning it is harder and harder to avoid breaking things leading to more hatred. It will go horribly slowly, and be impossible to tune. And so on.
For #2 you have a stable interface. No hatred. Yay! Seriously, "yay" as in "I still have a job!" as opposed to "boo, I got fired for annoying the hell out of my customers". Seriously. Never ever do #1 for that reason alone. You know this is true. You know it!
Having said that, record what people are doing. Take a user-id as a parameter. Log it. Know your users. Find the ones using old crappy code and ask them nicely to upgrade if necessary.
Your reason given to avoid number 2 is proliferation. But that is only a problem if you don't test stuff. If you do test stuff properly, then proliferation is happening anyway, in your tests. And you can always tune things in #2 if you have to, or at least isolate performance problems.
If the fatter procedure is really great, then retrofit the skinny version with a slimmer version of the fat one. In SQL this is tricky, but copy/paste and cut down your select column list works. Generally I just don't bother to do this. Life is too short. Having really good test code is a much better investment of time, and data schema tend to rarely change in ways that break existing queries.
Okay. Rant over. Serious message. Do #2, or at the very least do NOT do #1 or you will get yourself fired, or hated, or both. I can't think of a better reason than that.

Easier to go with #2. Nullable SP parameters can create some very difficult to locate bugs. Although, I do employ them from time to time.
Especially when you start getting into joins on nulls and ANSI settings. The way you write the query will change the results dramatically. KISS. (Keep things simple stupid).
Also, if it's a parameterized search for reporting or displaying, I might consider a super-fast fetch of data into a LINQ-able object. Then you can search an in-memory list rather than re-fetching from the database.

#2 could be better option than #1 particularly considering the bullet 3 of disadvantages of #1 since requirements keep changing on most of the time. I feel this because disadvantages are dominating here than advantages on either side.

I would also vote for #2. I've seen a few stored procedures which take #1 to the extreme: The SPs has a parameter #Option and a few parameters #param1, #param2, .... The net effect is that this is a single stored procedure that tries to play the role of many stored procedures.
The main disadvantage to #2 is that there are more stored procedures. It may be more difficult to find the one you're looking for, but I think that is a small price to pay for the other advantages you get.
I want to make sure also, that you don't just copy and paste the original stored procedure and add some columns. I've also seen too many of those. If you are only adding a few columns, you can call the original stored procedure and join in the new columns. This will definitely incur a performance penalty if those columns were readily available before, but you won't have to change your original stored procedure (refactoring to allow for good performance and no duplication of the code), nor will you have to maintain two copies of the code (copy and paste for performance).

I am going to suggest a couple of other options based on the options you gave.
Alternative option #1: Add another variable, but instead of making it a default variable base the variable off of customer name. That way Customer A can get his specialized report and Customer B can get his slightly different customized report. This adds a ton of work as updates to the 'Main' portion would have to get copied to all the specialty customer ones.
You could do this with branching 'if' statements.
Alternative option #2: Add new stored procedures, just add the customer's name to the stored procedure. Maintenance wise, this might be a little more difficult but it will achieve the same end results, each customer gets his own report type.

Option #2 is the one to choose.
You yourself mentioned (dis)advantages.
While you consider adding new objects to db based on requirement changes, add only necessary objects that don't make your new SP bigger and difficult to maintain.

Related

How Does FakeTable Work with the COLUMNS_UPDATED() function?

I have a 300+ column table and a trigger.
I have 40(ish) "columns of interest" and the rest I don't care about.
The purpose of the trigger is to determine if any of the "columns of interest" were updated.
So: determining the COLUMNS_UPDATED() seems to be the way to go and do some bit-magic with that.
Right now: I'm Faking the 300+ table and ramming in a complete row to force the correct number of columns to ensure COLMNS_UPDATED() gets the correct bytes/bits flagged for columns changed.
My question is: is this unnecessary because the genius of FakeTable encompasses the use of COLUMNS_UPDATED() and it will return the "correct" byte indicators even if I only faked a subset of the columns?
tSQLt.FakeTable reconstructs the table based on the original and always includes all columns. So, after using tSQLt.ApplyTrigger, to move the trigger over to the faked table, COLUMS_UPDATED will work as designed.
That being said, if you encounter a piece of functionality in any external tool or API that you are going to rely on in your product, and you have the feeling that there is a piece of functionality that is not documented well, it is always a good idea to write an exploratory test and check that test in with your other tests.
In this case I would create a wide table on the fly within the test, take the table, create an update trigger on the faked table and then call tSQLt.AssertEquals within that trigger to make sure the right value is returned by COLUMNS_UPDATED.
You achieve two things with that:
A) you document the expected behavior that your code relies on. That will help future developers that have to maintain your code.
B) You’ll get notified if that behavior changes.
Now, to be clear, there’s no reason to do that with well documented functionality. But if there are documentation gaps, it’s a good practice.

Stored procedure with multiple selects - interaction with client tool?

Suppose I have a stored procedure as follows:
create procedure p_x
as
begin
select 'a','b','c'
select 'c','d','e'
select 'e','f','g'
end
go
This is of course not the real code, but it illustrates enough to be able to ask my questions.
I'm looking for the best performance and the best practices to deal with it.
How will the client tool (eg Informatica Data Quality) calling this procedure react?
Will it receive 3 separate results, just the last query result or all results at once?
Will each separate query be send to the client directly (and will the procedure halt till completed)? or is this done after the procedure finished?
Is it good practice to work this way? I was looking for the exchange of an OUTPUT table type parameter, but this doesn't seem possible if I'm correct (based on other stories)(just as input)
Is there a performance impact in this way? And if so what is the way to do this as efficient as possible (e.g. to just send one result back to the client)
You would be better served by posting your question to the Informatica forums. They should be able to answer your questions precisely and accurately. But I'll give it a go.
How will the tool react? Don't know, but often tools that support using stored procedures as a datasource will assume and will consume a single (and the first) resultset. Any others will be ignored. Go ask in their forums.
Will it receive 3 ...? Roughly the same question and answer as the first.
Will each separate query ...? Your procedure produces three resultsets. How the client consumes them is, again, something you should ask in their forums. The procedure itself will not "halt" waiting for the client to do anything.
Is it good practice...? Not in my opinion. Nor is posting a complete nonsense procedure a useful tool for discussing the pros/cons of this approach. Can it be a useful thing to do? Likely. But it is not often done IME. In addition, you are dealing with a tool with which you are not familiar. The simpler you keep things the better you are off in the long run regardless of your tools.
A procedure is a unit of work and should do one "thing". If it produces multiple resultsets, one can argue that it ceases to do a single thing since, logically, each resultset represents a set of different (even if related) things. And typically one would expect to see some relationship among the resultsets. If there are no relationships, then the resultsets are obviously different things which violates the idea of a procedure. You might want to review the topic of coupling and cohesion. But I think I see a bigger issue - which I'll address with the next item.
Is there a performance impact ...? This can't really be answered. Performance is always, ALWAYS specific to a particular situation (query, schema, etc). Based on that last sentence, I think you have not made the adjustment to thinking in terms of sets - something that is critical to writing efficient sql. Rather, I'll guess that you are thinking in terms of a loop which includes a select statement and each iteration will produce a set of (perhaps 1 but who knows) rows. If you think you have the "option" to produce just one resultset of 3 rows vs. 3 resultsets of 1 row, then you are most likely stuck in RBAR land. Regardless, this can't really be answered. It is also a question for the Informatica people.

Bad practice to have IDs that are not defined in the database?

I am working on an application that someone else wrote and it appears that they are using IDs throughout the application that are not defined in the database. For a simplified example, lets say there is a table called Question:
Question
------------
Id
Text
TypeId
SubTypeId
Currently the SubTypeId column is populated with a set of IDs that do not reference another table in the database. In the code these SubTypeIds are mapped to a specific string in a configuration file.
In the past when I have had these types of values I would create a lookup table and insert the appropriate values, but in this application there is a mapping between the IDs and their corresponding text values in a configuration file.
Is it bad practice to define a lookup table in a configuration file rather than in the database itself?
Is it bad practice to define a lookup table in a configuration file rather than in the database itself?
Absolutely, yes. It brings in a heavy dependence on the code to manage and maintain references, fetch necessary values, etc. In a situation where you now need to create additional functionality, you would rely on copy-pasting the mapping (or importing them, etc.) which is more likely to cause an issue.
It's similar to why DB constraints should be in the DB rather than in the program/application that's accessing it - any maintenance or new application needs to replicate all the behaviour and rules. Having things this way has similar side-affects I've mentioned here in another answer.
Good reasons to have a lookup table:
Since DBs can generally naturally have these kinds of relations, it would be obvious to use them.
Queries first need to be constructed in code for the Type- and SubType- Text vs ID instead of having them as part of the where/having clause of the query that is actually executed.
Speed/Performance - with the right indexes and table structures, you'd benefit from this (and reduce code complexity that manages it)
You don't need to update your code for to add a new Type or SubType, or to edit/delete them.
Possible reasons it was done that way, which I don't think are valid reasons:
The TypeID and SubTypeID are related and the original designer did not know how to create a complex foreign key. (Not a good reason though.)
Another could be 'translation' but that could also be handled using foreign key relations.
In some pieces of code, there may not be a strict TypeID-to-SubTypeID relation and that logic was handled in code rather than in the DB. Again, can be managed using 'flag' values or NULLs if possible. Those specific cases could be handled by designing the DB right and then working around a unique/odd situation in code instead of putting all the dependence on the code.
NoSQL: Original designer may be under the impression that such foreign keys or relations cannot be done in a NoSQL db.
And the obvious 'people' problem vs technical challenge: The original designer may not have had a proper understanding of databases and may have been a programmer who did that application (or was made to do it) without the right knowledge or assistance.
Just to put it out there: If the previous designer was an external contractor, he may have used the code maintenance complexity or 'support' clause as a means to get more business/money.
As a general rule of thumb, I'd say that keeping all the related data in a DB is a better practice since it removes a tacit dependency between the DB and your app, and because it makes the DB more "comprehensible." If the definitions of the SubTypeIDs are in a lookup table it becomes possible to create queries that return human-readable results, etc.
That said, the right answer probably depends a bit on the specifics of the application. If there's very tight coupling between the DB and app to begin with (eg, if the DB isn't going to be accessed by other clients) this is probably a minor concern particularly if the set of SubTypeIDs is small and seldom changes.

Is this a "correct" database design?

I'm working with the new version of a third party application. In this version, the database structure is changed, they say "to improve performance".
The old version of the DB had a general structure like this:
TABLE ENTITY
(
ENTITY_ID,
STANDARD_PROPERTY_1,
STANDARD_PROPERTY_2,
STANDARD_PROPERTY_3,
...
)
TABLE ENTITY_PROPERTIES
(
ENTITY_ID,
PROPERTY_KEY,
PROPERTY_VALUE
)
so we had a main table with fields for the basic properties and a separate table to manage custom properties added by user.
The new version of the DB insted has a structure like this:
TABLE ENTITY
(
ENTITY_ID,
STANDARD_PROPERTY_1,
STANDARD_PROPERTY_2,
STANDARD_PROPERTY_3,
...
)
TABLE ENTITY_PROPERTIES_n
(
ENTITY_ID_n,
CUSTOM_PROPERTY_1,
CUSTOM_PROPERTY_2,
CUSTOM_PROPERTY_3,
...
)
So, now when the user add a custom property, a new column is added to the current ENTITY_PROPERTY table until the max number of columns (managed by application) is reached, then a new table is created.
So, my question is: Is this a correct way to design a DB structure? Is this the only way to "increase performances"? The old structure required many join or sub-select, but this structute don't seems to me very smart (or even correct)...
I have seen this done before on the assumed (often unproven) "expense" of joining - it is basically turning a row-heavy data table into a column-heavy table. They ran into their own limitation, as you imply, by creating new tables when they run out of columns.
I completely disagree with it.
Personally, I would stick with the old structure and re-evaluate the performance issues. That isn't to say the old way is the correct way, it is just marginally better than the "improvement" in my opinion, and removes the need to do large scale re-engineering of database tables and DAL code.
These tables strike me as largely static... caching would be an even better performance improvement without mutilating the database and one I would look at doing first. Do the "expensive" fetch once and stick it in memory somewhere, then forget about your troubles (note, I am making light of the need to manage the Cache, but static data is one of the easiest to manage).
Or, wait for the day you run into the maximum number of tables per database :-)
Others have suggested completely different stores. This is a perfectly viable possibility and if I didn't have an existing database structure I would be considering it too. That said, I see no reason why this structure can't fit into an RDBMS. I have seen it done on almost all large scale apps I have worked on. Interestingly enough, they all went down a similar route and all were mostly "successful" implementations.
No, it's not. It's terrible.
until the max number of column (handled by application) is reached,
then a new table is created.
This sentence says it all. Under no circumstance should an application dynamically create tables. The "old" approach isn't ideal either, but since you have the requirement to let users add custom properties, it has to be like this.
Consider this:
You lose all type-safety as you have to store all values in the column "PROPERTY_VALUE"
Depending on your users, you could have them change the schema beforehand and then let them run some kind of database update batch job, so at least all the properties would be declared in the right datatype. Also, you could lose the entity_id/key thing.
Check out this: http://en.wikipedia.org/wiki/Inner-platform_effect. This certainly reeks of it
Maybe a RDBMS isn't the right thing for your app. Consider using a key/value based store like MongoDB or another NoSQL database. (http://nosql-database.org/)
From what I know of databases (but I'm certainly not the most experienced), it seems quite a bad idea to do that in your database. If you already know how many max custom properties a user might have, I'd say you'd better set the table number of columns to that value.
Then again, I'm not an expert, but making new columns on the fly isn't the kind of operations databases like. It's gonna bring you more trouble than anything.
If I were you, I'd either fix the number of custom properties, or stick with the old system.
I believe creating a new table for each entity to store properties is a bad design as you could end up bulking the database with tables. The only pro to applying the second method would be that you are not traversing through all of the redundant rows that do not apply to the Entity selected. However using indexes on your database on the original ENTITY_PROPERTIES table could help greatly with performance.
I would personally stick with your initial design, apply indexes and let the database engine determine the best methods for selecting the data rather than separating each entity property into a new table.
There is no "correct" way to design a database - I'm not aware of a universally recognized set of standards other than the famous "normal form" theory; many database designs ignore this standard for performance reasons.
There are ways of evaluating database designs though - performance, maintainability, intelligibility, etc. Quite often, you have to trade these against each other; that's what your change seems to be doing - trading maintainability and intelligibility against performance.
So, the best way to find out if that was a good trade off is to see if the performance gains have materialized. The best way to find that out is to create the proposed schema, load it with a representative dataset, and write queries you will need to run in production.
I'm guessing that the new design will not be perceivably faster for queries like "find STANDARD_PROPERTY_1 from entity where STANDARD_PROPERTY_1 = 'banana'.
I'm guessing it will not be perceivably faster when retrieving all properties for a given entity; in fact it might be slightly slower, because instead of a single join to ENTITY_PROPERTIES, the new design requires joins to several tables. You will be returning "sparse" results - presumably, not all entities will have values in the property_n columns in all ENTITY_PROPERTIES_n tables.
Where the new design may be significantly faster is when you need a compound where clause on custom properties. For instance, finding an entity where custom property 1 is true, custom property 2 is banana, and custom property 3 is not in ('kylie', 'pussycat dolls', 'giraffe') is e`(probably) faster when you can specify columns in the ENTITY_PROPERTIES_n tables instead of rows in the ENTITY_PROPERTIES table. Probably.
As for maintainability - yuck. Your database access code now needs to be far smarter, knowing which table holds which property, and how many columns are too many. The likelihood of entertaining bugs is high - there are more moving parts, and I can't think of any obvious unit tests to make sure that the database access logic is working.
Intelligibility is another concern - this solution is not in most developers' toolbox, it's not an industry-standard pattern. The old solution is pretty widely known - commonly referred to as "entity-attribute-value". This becomes a major issue on long-lived projects where you can't guarantee that the original development team will hang around.

Find redundant pages, files, stored procedures in legacy applications

I have several horrors of old ASP web applications. Does anyone have any easy ways to find what scripts, pages, and stored procedures are no longer needed? (besides the stuff in "old___code", "delete_this", etc ;-)
Chances are if the stored proc won't run, it isn't being used because nobody ever bothered to update it when sonmething else changed. Table colunms that are null for every single record are probably not being used.
If you have your sp and database objects in source control (and if you don't why don't you?), you might be able to reaach through and find what other code it was moved to production with which should give you a clue as to what might call it. YOu will also be able to see who touched it last and that person might know if it is still needed.
I generally approach this by first listing all the procs (you can get this from the system tables) and then marking the ones I know are being used off the list. Profiler can help you here as you can see which are commonly being called. (But don't assume that because profiler didn't show the proc that it isn't being used, that just gives you a list of the ones to research.) This makes the ones that need to be rearched a much smaller list. Depending on your naming convention it might be relatively easy to see what part of the code should use them. When researching don't forget that procs are called in places other than the application, so you will need to check through jobs, DTS or SSIS packages, SSRS reports, other applications, triggers etc to be sure something is not being used.
Once you have identified a list of ones you don't think you need, share it with the rest of the development staff and ask if anyone knows if the proc is needed. You'll probably get a a couple more taken off the list this way that are used for something specialized. Then when you have the list, change the names to some convention that allows you to identify them as a candidate for deletion. At the same time set a deletion date (how far out that date is depends on how often something might be called, if it is called something like AnnualXYZReport, then make that date a year out). If no one complains by the deletion date, delete the proc (of course if it is in source control you can alawys get it back even then).
Onnce you have gone through the hell of identifying the bad ones, then it is time to realize you need to train people that part of the development process is to identify procs that are no longer being used and get rid of them as part of a change to a section of code. Depending on code reuse, this may mean searching the code base to see if someother part of the code base uses it and then doing the same thing discussed as above, let everyone know it will be deleted on this date, change the name so that any code referncing it will break and then on the date to delete getting rid of it. Or maybe you can have a meta data table where you can put candidates for deletion at the time you know that you have stopped using something and send a report around to everyone once a month or so to determine if anyone else needs it.
I can't think of any easy way to do this, it's just a matter of identifying what might not be used and slogging through.
For SQL Server only, 3 options that I can think of:
modify the stored procs to log usage
check if code has no permissions set
run profiler
And of course, remove access or delete it and see who calls...

Resources