good database design: enum values: ints or strings? - database

I have a column in a table that will store an enum value. E.g. Large, Medium, Small or the days of the week. This will correspond to displayed text on a web page or user selection from a droplist. What is the best design?
Store the values as an int and then perhaps have a table that has the enums/int corresponding string in it.
Just store the values in the column as a string, to make queries a little more self-explanatory.
At what point/quantity of values is it best to use ints or strings.
Thanks.

Assuming your RDBMS of choice doesn't have an ENUM type (which handles this for you), I think best to use ids instead of strings directly when the values can change (either in value or in quantity.)
You might think that days of the week won't change, but what if your application needs to add internationalization support? (or an evil multinational corporation decides to rename them after taking control of the world?)
Also, that Large, Medium and Small categorization is probably changing after a while. Most values you think cannot change, can change after a while.
So, mainly for anticipating change reasons, I think it's best to use ids, you just need to change the translation table and everything works painlessly. For i18n, you can just expand the translation table and pull the proper records automatically.
Most likely (it'll depend on various factors) ints are going to perform better, at the very least in the amount of required storage. But I wouldn't do ints for performance reasons, I'd do ints for flexibility reasons.

this is an interesting question. Definitely you have to take in consideration performance targets here. If you wan't to go for speed, int is a must. A Database can index integers a bit better than Strings although I must say its not at all a bad performance loss.
On example is Oracle database itself where they have the luxury of doing large caps enum as strings on their system tables. Things like USER_ALLOCATION_TYPE or things like that are the norm. Its like you say, Strings can be more "extensible" and more readable, but in any case in the code you will end up with:
Static final String USER_ALLOCATION_TYPE="USER_ALLOCATION_TYPE";
in place of
Static final int USER_ALLOCATION_TYPE=5;
Because you either do this you will end up with all this string literals that are just aching for someone to go there and misplace a char! :)
In my company we use tables with integers primary keys; all the tables have a serial primary key, because even if you don't think you need one, sooner or later you'll regret that.
In the case you are describing what we do is that we have a table with (PK Int, Description String) and then we do Views over the master tables with joins to get the descriptions, that way we get to see the joined fields descriptions if we must and we keep the performance up.
Also, with a separate description table you can have EXTRA information about those ids you would never think about. For example, lets say a user can have access to some fields in the combo box if and only if they have such property and so. You could use extra fields in the description table to store that in place of ad-hoc code.
My two cents.

Going with your first example. Lets say you create a Look up table: Sizes. It has the following columns:
Id - primary key + identity
Name - varchar / nvarchar
You'd have three rows in the table, Small, Medium and Large with values 1, 2, 3 if you inserted them in that order.
If you have another table that uses those values you can use the identity value as the foreign key...or you could create a third column which is a short hand value for the three values. It would have the values S, M & L. You could use that as the foreign key instead. You'd have to create a unique constraint on the column.
As far as the dropdown, you could use either one as the value behind the scenes.
You could also create S/M/L value as the primary key as well.
For your other question about when its best to use the ints vs strings. There is probably a lot of debate on the subject. A lot of people only like using identity values as their primary keys. Other people say that it's better to use a natural key. If you are not using an identity as the primary key then it's just important to make sure you have a good candidate for the primary key (making sure it will always be unique and that the value does not change).

I too would be interested in people's thinking regarding this, I've always gone the route of storing the enum in a look up table and then in any data tables that referenced the enum I would store the ID and using FK relationship. In a certain way, I still like this approach, but there is something plain and simple about putting the string value directly in the table.
Going purely by size, an int is 4 bytes, where as the string is n btyes (where n is number of characters). Shortest value in your look up is 5 characters, longest is 6, so storing the actual value would use up more space eventually (if that was a problem).
Going by performance, I'm not sure if an index on an int or on a varchar would return any difference in speed / optimisation / index size?

Related

Need help to decide about enum implementation for SQL Server in connection with C#

I'm looking for some more insight to implement enums with SQL Server. SQL Server has no direct enum support and C# does not have string enums. For SQL Server the main options are Check constraints and lookup tables. For enums that don't change (e.g. On, Off) Check constraints work. For enums that will have new members added (e.g. category, type) lookup tables don't require changes in the database definition. New values can be easily added to the lookup table. Lookup table data can also serve to fill comboboxes in the frontend. Additionally lookup tables can hold sort orders if needed.
Most recommendations advocate a tinyint field as the primary and foreign key for efficiency and easy mapping to C# enums. The major drawback is that without joining each enum to its lookup table the records are not human readable. I favor natural keys if possible. They are easier to read and understand and less error-prone. Our database holds a lot of tables with many enums. Therefore I don't feel comfortable using a lot of integer keys for enums, since 1,2,3,... will always have a different meaning.
I see two solutions:
Use a char(1) field as the key. The character code can be used in C# as the enum value. The problem is that sometimes enum members might start with the same letter and a different non-intuitive letter has to be used which negates the readability. It still would be better than numbers.
Use a varchar field of appropriate length and enter the full value (e.g. AddressType: 'Home', 'Work', 'Shipping'). This will provide optimal readability. The lookup table still ensures integrity. In C# the enums integer values are just used internally. For database operations the enum name can be used.
The drawback is increased space in the database. Considering the amount of other data the extra characters per record are neglectable. The join would cost more, but it is only needed for inserts or updates for integrity check. The lookup tables can still feed combobox values.
The main drawback with both solutions is that if changing an enum would require changing the key for readability all affected records need to be changed. Since enums usually don't change I don't see this as a practical problem. If enums essentially change due to an application update there is normally more to update anyway.
For our case I favor to give up easy update for readability. Is there anything I overlook when using char or varchar as a key? If not, is there anything else that makes option 1 or 2 the better choice?
Thanks for any help.

Is storing a value many times considered a normal form failure?

When storing a user's religion in a "User Table", so that if you look down a column you would see "Christian" many times, "Muslim" many times, etc considered a failure of a normal form? Which form?
The way I see it:
1nf: There are no repeating columns.
2nf: There is no concatenated primary key, so this does not apply.
3nf: There is no dependency on a nonkey attribute.
Storing user religion this way does not seem to fail any normal form, however it seems very inefficient. Comments?
Your design supports all normal forms. It's fine that your attribute has a string value. The size of the data type is irrelevant for normalization.
The goal of normalization is not physical storage efficiency -- the goal is to prevent anomalies. And to support logical efficiency, i.e. store a given fact only once. In this case, the fact that the user on a given row is Christian.
The principle disadvantage to storing the column in that manner is in storage space as the number of rows scales up.
Rather than a character column, you could use an ENUM() if you have a fixed set of choices that will rarely, if ever, change, and still avoid creating an additional table of religion options to which this one has a foreign key. However, if the choices will be fluid, normalization rules would prefer that the choices be placed into their own table with a foreign key in your user table.
There are other advantages besides storage space to keeping them in another table. Modifying them is a snap. To change Christian to Christianity, you can make a single change in the religions table, rather than doing the potentially expensive (if you have lots of rows and religion is not indexed)
UPDATE users SET religion='Christianity' WHERE religion='Christian'
... you can do the much simpler and cheaper
UPDATE religions SET name='Christianity' WHERE id=123
Of course, you also enforce data integrity by keying against a religions table. It becomes impossible to insert an invalid value like the misspelled Christain.
I'm assuming that there's a list of valid religions; if you've just got the user entering their own string, then you have to store it in the user table and this is all moot.
Assume that religions are stored in their own table. If you're following well-established practices, this table will have a primary key which is an integer, and all references to entries in the table in other tables (such as the user table) will be foreign keys. The string method of storing religion doesn't violate any normal form (since the name of a religion is a candidate key for the religion table), but it does violate the practice of not using strings as keys.
(This is an interesting difference between the theory and practice of relational algebra. In theory, a string is no different from an integer; they're both atomic mathematical values. In practice, strings have a lot of overhead that leads programmers not to use them as keys.)
Of course, there are other ways (such as ENUM for some RDBMSes) of storing a list of possible values, each with their own advantages and disadvantages.
Your normal forms are a little awry. Second normal form is that the rest of the row depends on "the whole key". Third normal form is that the rest of the row depends on "nothing but the key." (So help me Codd).
No, your situation as described does not violate any of the first three normal forms. (It might violate the sixth, depending on other factors).
There are a few cons with this approach (compared to using a foreign key) that you will need to make sure you are ok with.
1 - wastes storage.
2 - slower to query by religion
3 - someone might put data in there that doesn't match, eg manually insert "Jedi" or something that you might not consider correct
4 - there's no way to have a list of possible religions (eg if there are no one of a certain religion in your table, eg, Zoroastrian) but you still want it to be a valid possibility
5 - incorrect capitalization might cause problems
6 - white space around the string might cause problems
The main pro with this technique is the data is quicker to pull out (no joining on a table) and it is also quicker for a human to read.

Use of specifying lengths for surrogate keys

In one of my database class assignments, I wrote that I specifically didn't assign lengths to my NUMBER columns acting as surrogate keys since it would unnecessarily limit the number of records able to be stored in the table, and because there is literally no difference in performance or physical storage between NUMBER(n) and NUMBER.
My professor wrote back that it would be technically possible but "impractical" for large databases, and that most DBAs in real-life situations would not do that.
There is no difference whatsoever between NUMBER(n) and NUMBER as far as physical storage or performance goes, and thus no reason to specify a length for a NUMBER-based surrogate key column. Why does this professor think that using NUMBER alone would be "impractical"?
In my experience, most production DBAs in real life would likely do as you suggested and declare key columns as NUMBER rather than NUMBER(n).
It would be worthwhile to ask the professor what makes this approach impractical in his or her opinion. There are a couple possibilities that I can think of
Assuming that you are using a data modeling tool to design your schema, a reasonable tool will ensure that the data type of a key will be the same in the table where it is defined as a primary key and in the child table where it is a foreign key. If you specify a length for the primary key, forcing the key to generate foreign keys without length limits would be impractical. Of course, the counter to this is that you can just declare both the primary and foreign key columns as NUMBER.
DBAs tend to be extremely picky (and I mean this as a compliment). They like to see everything organized "just so". Adding a length qualifier to a field whether it be a NUMBER or a VARCHAR2 serves as an implicit constraint that ensure that incorrect data does not get stored. Ideally, you would know when you are designing a table a reasonable upper bound on the number of rows you'll insert over the table's lifetime (i.e. if your PERSON table ended up with more than 10 billion rows, something would likely be seriously wrong). Applying length constraints to numeric columns demonstrates to the DBA that you've done this sort of analysis.
Today, however, that is rather unlikely to actually happen at least with respect to numeric columns both because it is something that is more in keeping with waterfall planning methods that would generally involve that sort of detailed design discussion and because people are less concerned with the growth analysis that would have traditionally been done at the same time. If you were designing a database schema 20 or 30 years ago, it wouldn't be uncommon to provide the DBA with a table-by-table breakdown of the projected size of each table at go-live and over the next few years. Today, it's more cost effective to potentially spend more on disk rather than investing the time to do this analysis up front.
It would probably be better from a readability and self documentation standpoint to limit what can be stored in the column to numbers that are expected. I would agree that I don't see how its impractical
From this thread about number
number(n) is an edit -- restricting the number to n digits in length.
if you store the number '55', it takes the same space in a number(2)
as it does in a number(38).
the storage required = function of the number actually stored.
Left to my own devices I would declare surrogate primary keys as NUMBER(38) on oracle instead of NUMBER. And possibly a check constraint to make the key > 0. Primarily to serve as documentation to outside systems about what they can expect in the column and what they need to be able to handle.
In theory, when building an application that is reading the surrogate primary key, seeing NUMBER means one needs to handle full floating point range of number, whereas NUMBER(38) means the application needs to handle an integer with up to 38 digits.
If I were working in an environment where all the front ends were going to be using a 32 bit integer for surrogate keys I'd define it as a number(10) with appropriate check constraint.

DATABASE DESIGN - Primary key for COUNTRY, CURRENCY int or varchar

For my country table, I used the
country code as the primary key "AU,
US, UK, FR" etc
For my currency table, I used the currency code as the primary key "AUD, GBP, USD" etc
I think what I did is ok, but another developer wants me to change all the primary keys to an int, because the country code, currency code might change sometime in the future he said. We just don't know that, well in this case he is right, his path is the safest path to take.
Should I change the primary keys to an int to be safe rather than be sorry? Can't I just keep it?
I would use the ISO codes with char columns.
If a country ever splits then you'd get new ISO codes (say SC, WL, EN) but UK will still be valid for historic data.
It's the same for currency. A transaction in 2000 would be in the currency at that time: French Francs, Deutschmarks, Belgium Banana but not Euro.
I would say the "birth of a nation" or the disappearance of a currency is - over all - a rather rare occurence - not likely to happen several times a year, every year.
So in this regard, I would think using the ISO defined country and currency codes for your primary key should be OK.
Yes, if something happens to the Euro zone, or if another country is split into two, you might have to do some manual housekeeping - but you'd have to do this with an INT as well. In a case like this, I would argue that an artificial surrogate key (like such an INT) really only adds overhead and doesn't really help keep things easier/more explicit.
Since those codes are really short, and typically all the same length, I would however recommend using a CHAR(3) or CHAR(5) - no point in using VARCHAR for such a short string, and also, VARCHAR as variable length fields do behave quite differently (and not "better" in terms of performance) that fixed-length fields like INT or CHAR
From a logical point of view, adding a surrogate means extra columns, additional key constraints and more complex logic to query and manipulate the data. That's one thing to consider.
From a physical standpoint, in SQL Server an INTEGER key will take up more than twice as much space as a CHAR(2) or CHAR(3). That means your referencing tables and indexes get larger. It also makes any updates to those foreign key values much more expensive. I don't know your data but it seems quite possible that the referencing data in those foreign key columns could be updated much more frequently than the country code and currency code values in the parent table. In contrast, the ISO codes for currency and country almost never change so that is probably very little to worry about. By changing to INTEGER keys you could very well increase the cost of updating those foreign key values.
If you are considering such a change as a performance optimisation then I suggest you evaluate very carefully whether INTEGER keys will make updates of those values more costly or less costly. I suggest you ignore people who say "always do X". Dogma is no help in database design. Evaluate what the real impact will be in practice and make your decision accordingly.
I think that's your system will become obsolete ten time before the ISO standard about country and currency code will do.
So I really don't see any benefit for using 01010101 01010011 or 21843 instead of "US".
So long as any foreign keys that reference these primary keys are declared with ON UPDATE CASCADE, who cares if these codes change?
There's an additional benefit in querying any of the referencing tables - if all you need is the country/currency code, then there's no need to join to these tables - you've already got the code included in these tables.
And if you do decide to move to a INT surrogate, please remember to place a unique constraint on these columns still - they are a real key for these tables.
I would use INT ids as a key instead of ISO codes and explain you why:
Organization I worked for, uses "own currency" (LBP) - eg, when a user performs some transaction, he receives some amount of LBP as a bonus. Further, he can exchange those LBPs to USD, EUR, etc and vice versa, pay for services with LBPs, etc. Also, I didn't find BTC (Bitcoin) currency in ISO standard.
Yes, these are not official currencies, but it is more flexible from system and users point of view to have them as currencies but not as an additional product which user can buy and sell.
Organization I worked for do not use INTS as primary key, they use ISO codes as ids (plus those additional currencies).
Officially, LBP is ISO standard for Lebanese Pound - so they will not able to add Lebanese Pound to the system smoothly.
If you identify your currencies by Code, and in the future some new currency will be registered as ISO standard (say, LBE, or BTC) - then these currencies will conflict with "your" currencies.
Somebody mentioned here that having additional int key for currencies is an additional index.
But excuse me, is it a problem for 300 records (approximate count of currencies)? Moreover, if you use INTs as primary key for currencies, it has an additional benefit: Imagine a table with 1M transactions which holds amounts and currencies, and what is more efficient: INTS or CHARS?
So I would go for INTs.
Yes, changing to an integer key would be a good idea before it's too late.
E.g. what if Great Britain joins the Euro-zone?
It is a poor practice to use something as primary key that changes. Suppose the value changed, and then you had to update all child records. Doing so could lock up your database for hours or even days. This is why the integer FK with the unique index on the natural key is a better practice for information that is volatile.

Is it ok to use character values for primary keys?

Is there a performance gain or best practice when it comes to using unique, numeric ID fields in a database table compared to using character-based ones?
For instance, if I had two tables:
athlete
id ... 17, name ... Rickey Henderson, teamid ... 28
team
teamid ... 28, teamname ... Oakland
The athlete table, with thousands of players, would be easier to read if the teamid was, say, "OAK" or "SD" instead of "28" or "31". Let's take for granted the teamid values would remain unique and consistent in character form.
I know you CAN use characters, but is it a bad idea for indexing, filtering, etc for any reason?
Please ignore the normalization argument as these tables are more complicated than the example.
I find primary keys that are meaningless numbers cause less headaches in the long run.
Text is fine, for all the reasons you mentioned.
If the string is only a few characters, then it will be nearly as small an an integer anyway. The biggest potential drawback to using strings is the size: database performance is related to how many disk accesses are needed. Making the index twice as big, for example, could create disk-cache pressure, and increase the number of disk seeks.
I'd stay away from using text as your key - what happens in the future when you want to change the team ID for some team? You'd have to cascade that key change all through your data, when it's the exact thing a primary key can avoid. Also, though I don't have any emperical evidence, I'd think the INT key would be significantly faster than the text one.
Perhaps you can create views for your data that make it easier to consume, while still using a numeric primary key.
I'm just going to roll with your example. Doug is correct when he says that text is fine. Even for a medium sized (~50gig) database having a 3 letter code be a primary key won't kill the database. If it makes development easier, reduces joins on the other table and it's a field that users would be typing in...I say go for it. Don't do it if it's just an abbreviation that you show on a page or because it makes the athletes table look pretty. I think the key is the question "Is this a code that the user will type in and not just pick from a list?"
Let me give you an example of when I used a text column for a key. I was making software for processing medical claims. After the claim got all digitized a human had to look at the claim and then pick a code for it that designated what kind of claim it was. There were hundreds of codes...and these guys had them all memorized or crib sheets to help them. They'd been using these same codes for years. Using a 3 letter key let them just fly through the claims processing.
I recommend using ints or bigints for primary keys. Benefits include:
This allows for faster joins.
Having no semantic meaning in your primary key allows you to change the fields with semantic meaning without affecting relationships to other tables.
You can always have another column to hold team_code or something for "OAK" and "SD". Also
The standard answer is to use numbers because they are faster to index; no need to compute a hash or whatever.
If you use a meaningful value as a primary key you'll have to update it all through you're database if the team name changes.
To satisfy the above, but still make the database directly readable,
use a number field as the primary key
immediately create a view Athlete_And_Team that joins the Athlete and Team tables
Then you can use the view when you're going through the data by hand.
Are you talking about your primary key or your clustered index? Your clustered index should be the column which you will use to uniquely identify that row by most often. It also defines the logical ordering of the rows in your table. The clustered index will almost always be your primary key, but there are circumstances where they can be differant.

Resources