SQL Best Practices for Identity value hard coding - sql-server

First, I know this is a rather subjective question but I need some kind of formal documentation to help me educate my client.
Background - a large enterprise application with hundreds of tables and SP's, all neatly designed with normalized tables and foreign keys using identity columns.
Our client has a few employees writing complex reports in Crystal enterprise using a replicated copy of our production Db.
We have tables that store what I would classify as 'system' base information, such as a list of office locations, or departments within the company, standard set of roles for users, statuses of other objects (open/closed etc), basically data that doesn't change often.
The issue - the report designers and financial analysts are writing queries with hardcoded identity values inside of them. Something like this
SELECT xxx FROM OFFICE WHERE OFFICE_ID = 6
I'm greatly simplifying here, but basically they're using these hard coded int values inside their procedures all over the place.
For SQL developers seeing this will obviously make you facepalm as it's just a built-in instinct not to do this.
However, surprisingly I can't find any documentation or even best practices articles as to why this shouldn't be done.
They would argue it's fine to do this since the values never change, and they're right, within that single system those values won't change, however across multiple environments (staging/QA/Dev) those values can and are absolutely different, making their reporting design approach non-portable and only able to function in 1 isolated server environment.
Do any of the SQL guru's out there have any more in-depth information/articles etc that I can use to help educate my client on why they should avoid this approach?

Seems to me the strongest argument to your report writers is your second to last sentence "...those values can and are absolutely different [between environments]". That would be pretty much the gist of my response to them.
Of course there's always gray area to any question. Identity columns are essentially magic numbers. They have the benefit to the database of being...
Small
Sequential
Fast to seek and join on, sort by and create
...but have the downside of being of completely meaningless, and in effect, randomly assigned (sort the inserts into that table one way, you get a different identity per row than if you sorted the other way). As such, in cases where you have to look up something specific like that, it's common use also include a "business/natural/alternate" key (e.g. maybe (a completely made up example) [CategoryName] where CatgoryName is something short, unique and human readable, while. [CategoryId] is an identity, but not something intended to be sought on)
If you have a website with, say, a dropdown menu, usually the natural key gets put into the visible part of the drop down, and the surrogate/identity key gets passed around on the back end, invisible to the end user.
This gets a little trickier when you have people writing queries directly against the database. If they're owners of the data, they may know things about the larger data structure which they can take advantage of in *cough "clever" ways. If you know the keys wont change and you know what those values are, there might be a case to be made just referencing those. But again, not if they're going to be different when you query a different server.
Of course the flip side is, if you don't want them to use the identity values, you'll have to give them an alternative. And if your tables don't already include a business/natural/alternate key, you're going to have to add one wherever one doesn't already exist.
Also, there's nothing wrong with that alternate key being an integer too (maybe you already have company-wide identifiers for your offices of 1, 2, 3 etc), but the point is that it's deterministic no matter where you run your query.

Related

Bad practice to have IDs that are not defined in the database?

I am working on an application that someone else wrote and it appears that they are using IDs throughout the application that are not defined in the database. For a simplified example, lets say there is a table called Question:
Question
------------
Id
Text
TypeId
SubTypeId
Currently the SubTypeId column is populated with a set of IDs that do not reference another table in the database. In the code these SubTypeIds are mapped to a specific string in a configuration file.
In the past when I have had these types of values I would create a lookup table and insert the appropriate values, but in this application there is a mapping between the IDs and their corresponding text values in a configuration file.
Is it bad practice to define a lookup table in a configuration file rather than in the database itself?
Is it bad practice to define a lookup table in a configuration file rather than in the database itself?
Absolutely, yes. It brings in a heavy dependence on the code to manage and maintain references, fetch necessary values, etc. In a situation where you now need to create additional functionality, you would rely on copy-pasting the mapping (or importing them, etc.) which is more likely to cause an issue.
It's similar to why DB constraints should be in the DB rather than in the program/application that's accessing it - any maintenance or new application needs to replicate all the behaviour and rules. Having things this way has similar side-affects I've mentioned here in another answer.
Good reasons to have a lookup table:
Since DBs can generally naturally have these kinds of relations, it would be obvious to use them.
Queries first need to be constructed in code for the Type- and SubType- Text vs ID instead of having them as part of the where/having clause of the query that is actually executed.
Speed/Performance - with the right indexes and table structures, you'd benefit from this (and reduce code complexity that manages it)
You don't need to update your code for to add a new Type or SubType, or to edit/delete them.
Possible reasons it was done that way, which I don't think are valid reasons:
The TypeID and SubTypeID are related and the original designer did not know how to create a complex foreign key. (Not a good reason though.)
Another could be 'translation' but that could also be handled using foreign key relations.
In some pieces of code, there may not be a strict TypeID-to-SubTypeID relation and that logic was handled in code rather than in the DB. Again, can be managed using 'flag' values or NULLs if possible. Those specific cases could be handled by designing the DB right and then working around a unique/odd situation in code instead of putting all the dependence on the code.
NoSQL: Original designer may be under the impression that such foreign keys or relations cannot be done in a NoSQL db.
And the obvious 'people' problem vs technical challenge: The original designer may not have had a proper understanding of databases and may have been a programmer who did that application (or was made to do it) without the right knowledge or assistance.
Just to put it out there: If the previous designer was an external contractor, he may have used the code maintenance complexity or 'support' clause as a means to get more business/money.
As a general rule of thumb, I'd say that keeping all the related data in a DB is a better practice since it removes a tacit dependency between the DB and your app, and because it makes the DB more "comprehensible." If the definitions of the SubTypeIDs are in a lookup table it becomes possible to create queries that return human-readable results, etc.
That said, the right answer probably depends a bit on the specifics of the application. If there's very tight coupling between the DB and app to begin with (eg, if the DB isn't going to be accessed by other clients) this is probably a minor concern particularly if the set of SubTypeIDs is small and seldom changes.

When is it appropriate to use UUIDs for a web project?

I'm busy with the database design of a new project, and I'm not sure whether to use UUIDs or normal table-unique auto-increment ids.
Up to now, the sites I've built have all run on a single server, and very heavy traffic has never been too much of a concern. However, this web application will eventually run concurrently on multiple servers, serve an API, and need to process thousands of requests per second, and I want to make sure that the design I choose now doesn't cripple any of those possibilities later.
I have my suspicions, of course, and they should be clear through the way I phrased my question, but I would like to hear from those with more experience what trouble I can run into later if I do or don't have UUIDs, and what I should really be basing my decision on.
So, in short: What are the considerations I should give into deciding whether or not to use UUIDs for all database models, so that any one object can be identified uniquely by one string, and when is it appropriate to use this as the primary key, instead of table-by-table auto-increment?
Note: I've seen this question (When are you truly forced to use UUID as part of the design?), and read all the answers, but they mostly answer "How rarely do UUIDs collide", instead of "When is it appropriate to use them".
One consideration that I've used when deciding on UUIDs vs. auto-increment ids is whether they're going to be user-visible, and if so, whether I want users to know how many I have of that table. For example, if I didn't want to make public the number of registered users my site has, I wouldn't assign auto-increment user ids.
And to address one other specific point you raised, it's still possible to use auto-incrementing ids with multiple servers (though not with the built-in MySQL). You just need to start all the ids at different offsets, and increment accordingly. That is, if you had 3 servers, you could start server A at 1, server B at 2, and server C at 3, and then increment the ids by 10 each time instead of 1. That way, you could guarantee no collisions.
And finally, the last thing I consider is how important performance is to my application. Integers are much more easily indexed than UUIDs that are string-based, so indexes are smaller, more quickly searched, etc.
UUID's or GUID's can be very useful especially for the web. If you use auto-increment values to store UserId anyone can view the source of your web pages and see the simplicity of it's use. They could then try any integer value to get data they are not supposed to see.
GUID's are not created in any sequential format, therefore if you create them one right after the other, there sequence can not easily be guessed.
I don't think it's necessary to use GUID's for simple lookup type data such as ColorId 1=Blue, 2=Red, 3=Green.
GUID's are also very useful for session and state management.
That's my $0.02

Adding relations to an Access Database

I have an MS Access database with plenty of data. It's used by an application me and my team are developing. However, we've never added any foreign keys to this database because we could control relations from the code itself. Never had any problems with this, probably never will either.
However, as development has developed further, I fear there's a risk of losing sight over all the relationships between the 30+ tables, even though we use well-normalized data. So it would be a good idea go get at least the relations between the tables documented.
Altova has created DatabaseSpy which can show the structure of a database but without the relations, there isn't much to display. I could still use to add relations to it all but I don't want to modify the database itself.
Is there any software that can analyse a database by it's structures and data and then do a best-guess about its relations? (Just as documentation, not to modify the database.)
This application was created more than 10 years ago and has over 3000 paying customers who all use it. It's actually document-based, using an XML document for it's internal storage. The database is just used as storage and a single import/export routine converts it back and to XML. Unfortunately, the XML structure isn't very practical to use for documentation and there's a second layer around this XML document to expose it as an object model. This object model is far from perfect too, but that's what 10 years of development can do to an application. We do want to improve it but this takes time and we can't disappoint the current users by delaying new updates.Basically, we're stuck with its current design and to improve it, we need to make sure things are well-documented. That's what I'm working on now.
Only 30+ tables? Shouldn't take but a half hour or an hour to create all the relationships required. Which I'd urge you to do. Yes, I know that you state your code checks for those. But what if you've missed some? What if there are indeed orphaned records? How are you going to know? Or do you have bullet proof routines which go through all your tables looking for all these problems?
Use a largish 23" LCD monitor and have at it.
If your database does not have relationships defined somewhere other than code, there is no real way to guess how tables relate to each other.
Worse, you can't know the type of relationship and whether cascading of update and deletion should occur or not.
Having said that, if you followed some strict rules for naming your foreign key fields, then it could be possible to reconstruct the structure of the relationships.
For instance, I use a scheme like this one:
Table Product
- Field ID /* The Unique ID for a Product */
- Field Designation
- Field Cost
Table Order
- Field ID /* the unique ID for an Order */
- Field ProductID
- Field Quantity
The relationship is easy to detect when looking at the Order: Order.ProductID is related to Product.ID and this can easily be ascertain from code, going through each field.
If you have a similar scheme, then how much you can get out of it depends on how well you follow your own convention, but it could go to 100% accuracy although you're probably have some exceptions (that you can build-in your code or, better, look-up somewhere).
The other solution is if each of your table's unique ID is following a different numbering scheme.
Say your Order.ID is in fact following a scheme like OR001, OR002, etc and Product.ID follows PD001, PD002, etc.
In that case, going through all fields in all tables, you can search for FK records that match each PK.
If you're following a sane convention for naming your fields and tables, then you can probably automate the discovery of the relations between them, store that in a table and manually go through to make corrections.
Once you're done, use that result table to actually build the relationships from code using the Database.CreateRelation() method (look up the Access documentation, there is sample code for it).
You can build a small piece of VBA code, divided in 2 parts:
Step 1 implements the database relations with the database.createrelation method
Step 2 deleted all created relations with the database.delete command
As Tony said, 30 tables are not that much, and the script should be easy to set. Once this set, stop the process after step 1, run the access documenter (tools\analyse\documenter) to get your documentation ready, launch step 2. Your database will then be unchanged and your documentation ready.
I advise you to keep this code and run it regularly against your database to check that your relational model sticks to the data.
There might be a tool out there that might be able to "guess" the relations but I doubt it. Frankly I am scared of databases without proper foreign keys in particular and multi user apps that uses Access as a DBMS as well.
I guess that the app must be some sort of internal tool, otherwise I would suggest that you move to a proper DBMS ( SQL Express is for free) and adds the foreign keys.

Obfuscate a SQL Server Db schema

When posting example code or filing bug reports based on a real production app, it would be helpful to have some way to change the table and column names to not potentially give away information about the internals of the app. Doing it by hand without breaking things is time consuming. Does anything automatic exist? Ideally it would use real English words so they are more easily referred to than random text strings.
As long as you don't use real data, I don't see what the issue is. Most apps are fairly obvious based on the requirements. ie CRM system = (customer name, address, etc...) or (customer name, addressid, etc.. with some address table with parts of the address, etc...). By knowing your schema I have no idea how you implement your app. Generally without the stored procedures/program code it would be hard to steal any intellectual property. Even if you were the NSA or something (InternetIP, PacketHeadingID, PacketDetailID, TimeStampID). Even with the structure of the tables I still would have no information on how your system to log all the internet traffic actually works. I also wouldn't know anything that is logged.
I don't know of anything off hand to do what you are requesting, but I would think it is fairly easy to write a script to do it on your own. Look at the table columns and datatypes and call text columns "TextColumn1", int columns "IntColumn2", etc. and build a table of substitutions, then perform the substitutions globally in the script file. I would think this is a fairly easy Python/Perl/PowerShell/Ruby/VbScript program.
I agree that there's no real need to do so, but if you feel that way, take a look at anonymizers, usually used to protect the data and not the schemas, but you could easily apply those approaches to schemas as well.
See this paper (which is the description of this framework) especially page 8 an onwards for different anonymization methods, although replacing column names for static strings might probably be good enough anyway.

Default database IDs; system and user values

As part of our current database work, we are looking at a dealing with the process of updating databases.
A point which has been brought up recurrently, is that of dealing with system vs. user values; in our project user and system vals are stored together. For example...
We have a list of templates.
1, <system template>
2, <system template>
3, <system template>
These are mapped in the app to an enum (1, 2, 3)
Then a user comes in and adds...
4, <user template>
...and...
5, <user template>
Then.. we issue an upgrade.. and insert as part of our upgrade scripts...
<new id> [6], <new system template>
THEN!!... we find a bug in the new system template and need to update it... The problem is how? We cannot update record using ID6 (as we may have inserted it as 9, or 999, so we have to identify the record using some other mechanism)
So, we've come to two possible solutions for this.
In the red corner (speed)....
We simply start user Ids at 5000 (or some other value) and test data at 10000 (or some other value). This would allow us to make modifications to system values and test them up to the lower limit of the next ID range.
Advantage...Quick and easy to implement,
Disadvantage... could run out of values if we don't choose a big enough range!
In the blue corner (scalability)...
We store, system and user data separately, use GUIDs as Ids and merge the two lists using a view.
Advantage...Scalable..No limits w/regard to DB size.
Disadvantage.. More complicated to implement. (many to one updatable views etc.)
I plump squarely for the first option, but looking for some ammo to back me up!
Does anyone have any thoughts on these approaches, or even one(s) that we've missed?
I have never had problems (performance or development - TDD & unit testing included) using GUIDs as the ID for my databases, and I've worked on some pretty big ones. Have a look here, here and here if you want to find out more about using GUIDs (and the potential GOTCHAS involved) as your primary keys - but I can't recommend it highly enough since moving data around safely and DB synchronisation becomes as easy as brushing your teeth in the morning :-)
For your question above, I would either recommend a third column (if possible) that indicates whether or not the template is user or system based, or you can at the very least generate GUIDs for system templates as you insert them and keep a list of those on hand, so that if you need to update the template, you can just target that same GUID in your DEV, UAT and /or PRODUCTION databases without fear of overwriting other templates. The third column would come in handy though for selecting all system or user templates at will, without the need to seperate them into two tables (this is overkill IMHO).
I hope that helps,
Rob G
I recommend using the second with the modification that you store the system and user values in one table. GUID is quite reliable in this manner.
Another idea: use any text-based ID (not necessary GUID), which you give for the system values and is generated by a random string or a string based on some kind of custom logic for the user values.
Another idea: use the first approach, but extend the table with a flag which shows if a value is system or user. Maybe this is the easiest. Ok, you have to write some kind of mechanism to update the correct system value, but it can be done easily.
+1 for Biri's text based ID - define a "template_mnemonic" text based column and make it the primary key. This will be a known value when you insert it as you, the developers will have decided on it (or auto-generated it) and you will always be able to reference a template by its mnemonic regardless of how many user specified templates there are. It also allows users to have a meaningful naming convention for their templates.
Maybe I didn't get it, but couldn't you use GUIDs as Ids and still have user and system data together? Then you can access the system data by the (non-changable) GUIDs.
I don't think that GUID should make any problem.
If you want to avoid it, then use a flag:
ID int
template whatever
flag enum/int/bool
Flag shows whether the actual value is a system or a user value.
If you would like to update a system value, then ask only for system values ordered by ID, and it will show you actual order of insertion (you should have a bigint or something for ID to make sure that it doesn't get full and it doesn't get the deleted IDs back to work). With this list the x. record is the x. inserted system value.
I think there is a better third solution.
It strikes me that you're storing two different things in the same table and that you might be better off creating 2 separate tables one for user templates and one for system templates. You might then be able to create a view over the two tables to make them appear as a single object to your application.
Obviously I don't have full knowledge of your application and this may be impossible for you for any number of reasons but I think it's a neater solution than GUIDs and way safer than ranges of IDs (seriously don't do ID ranges it'll bite you one day)

Resources