Is it best to store the enum value or the enum name in a database table field?
For example should I store 'TJLeft' as a string or a it's equivalent value in the database?
Public Enum TextJustification
TJLeft
TJCenter
TJRight
End Enum
I'm currently leaning towards the name as some could come along later and explicitly assign a different value.
Edit -
Some of the enums are under my control but some are from third parties.
Another reason to store the numeric value is if you're using the [Flags] attribute on your enumeration in cases where you may want to allow for multiple enumeration values. Say, for example you want to let someone pick what days of the week that they're available for something...
[Flags]
public enum WeekDays
{
Monday=1,
Tuesday=2,
Wednesday=4,
Thursday=8,
Friday=16
}
In this case, you can store the numeric value in the db for any combination of the values (for example, 3 == Monday and Tuesday)
I always use lookup tables consisting of the fields
OID int (pk) as the numeric value
ProgID varchar (unique) as the value's identifier in C# (i.e. const name, or enum symbol)
ID nvarchar as the display value (UI)
dbscript lets me generate C# code from my lookup tables, so my code is always in sync with the database.
For your own enums, use the numeric values, for one simple reason: it allows for every part of enum functionality, out of the box, with no hassle. The only caveat is that in the enum definition, every member must be explicitly given a numeric value, which can never change (or, at least, not after you've made the first release). I always add a prominent comment to enums that get persisted to the database, so people don't go changing the constants.
Here are some reasons why numeric values are better than string identifiers:
It is the simplest way to represent the value
Database searching/sorting is faster
Lower database storage cost (which could be a serious issue for some applications)
You can add [Flags] to your enum and not break your code and/or existing data
For [Flags] stored in a string field:
Poorly normalized data
Could generate false-positive anomalies when doing matching (i.e., if you have members "Sales" and "RetailSales", merely doing a substring search for "Sales" will match on either type). This has to be constrained either by using a regex on word boundaries (finicky using databases, and slow), or constraining in the enum itself, which is nonstandard, error-prone, and very difficult to debug.
For string fields (either [Flags] or not), if the database is obfuscated, this field has to be handled, which greatly affects the ability and efficiency when doing searching/sorting code, as mentioned in the previous point
You can rename any of the members without breaking the database code and/or existing client data.
Less over-the-wire data transfer space/time needed
There are only two situations where using the member names in the database may be an advantage:
If you're doing a lot of data editing manually... but who does that? And if you are, there's a good chance you're not going to be using an enum anyway.
Third-party enums where they may not be so diligent as to maintain the numeric value constants. But I have to say, anyone releasing a decently-written API is overwhelmingly likely to be smart enough to keep the enum values constant. (The identifiers have to stay the same since changing them would break existing code.)
On lookup tables, which I strongly discourage because they are a one-way bullet train to a maintenance nightmare:
Adding [Flags] functionality requires the use of a junction table, which means more complicated queries (existing ones need to be rewritten), and added complexity. What about existing client data?
If the identifier is stored in the data table, what's the point of having a lookup table in the first place?
If the numeric value is stored in the data table, you gain nothing since you still have to look up the identifier from the lookup table. To make it easier, you could create a view... for every table that has an enum value in it. And then let's not even think about [Flags] enums.
Introducing any kind of synchronization between database and code is just asking for trouble. What about existing client data?
Store an ID (value) and a varchar name; this lets you query on either way. Searching on the name is reasonable if your IDs (values) may get out of sync later.
It is better to use the integer representation... If you have to change the Enum later (add more values etc) you can explicitly assign the integer value to the Enum value so that your Enum representation in code still matches what you have in the database.
It depends on how important performance is versus readability. Databases can index numeric values a lot easier than strings, which means you can get better performance without using as much memory. It would also reduce the amount of data going across the wire somewhat. On the other hand, when you look at a numeric value in your database which you then have to refer to a code file to translate, that can be annoying.
In most cases, I'd suggest using the value, but you will need to make sure you're explicitly setting those values so that if you add a value in the future it doesn't shift the references around.
As often it depends on many things:
Do you want to sort by the natural order of the enums? Use the numeric values.
Do you work directly in the database using a low level tool? use the name.
Do you have huge amounts of data and performance is an issue? use the number
For me the most important issue is most of the time maintainability:
If your enums change in the future, names will either match correctly of fail hard and loud. With numbers some one can add a enum instance, changing all the numbers of all enums, so you have to update all the tables where the enum is used. And almost no way to know if you missed a table.
if you are trying to get the values of enum stored in the database back, then try this
EnumValue = DirectCast([Enum].Parse(GetType(TextJustification), reader.Item("put_field_name_here").ToString), TextJustification)
tell me if it works for you
Related
FYI: I explicitly mean SQL Server 2000-8 and C#. So DBMSs with enum support like MySql is not the subject of my question.
I know this question has been asked multiple times in SO. But still, I see in answers that different approaches are taken to store enum values in db.
Save enum as int in db and extract the enum value (or enum description attribute using reflection) in code:
this is the approach I usually use. The problem is when I try to query from database in SSMS, the retrieved data is hard to understand.
Save enum as string (varchar) in db and cast back to int in code.
Actually, this might the best solution. But (don't laugh!) it doesn't feel right. I'm not sure about the cons. (Except more space in db which is usually acceptable) So anything else against this approach?
Have a separate table in db which is synchronized with code's enum definition and make a foreign key relationship between your main table and the enum table.
The problem is when another enum value should be added later, Both code and db need to get updated. Also, there might be typos which can be a pain!
So in general when we can accept the overhead on db in 2nd solution, What would be the best way to store enum values in db? Is there a general definite design pattern rule about this?
Thanks.
There is no definite design rule (that I know of), but I prefer approach #1.
Is the approach I prefer. It's simple, and enums are usually compact enough that I start remembers what the numbers mean.
It's more readable, but can get in the way of refactoring or renaming your enumeration values when you want to. You lose some freedom of your code. All of the sudden you need to get a DBA involved (depending on where/how you work) just to change an enumeration value, or suffer with it. Parsing an enum has some performance impact as well since things like Locale come into play, but probably negligible.
What problem does that solve? You still have unreadable numbers in a table somewhere, unless you want to add the overhead of a join. But sometimes, this is the correct answer too depending on how the data is used.
EDIT:
Chris in the comments had a good point: If you do go down the numeric approach, you should explicitly assign values so you can re-order them as well. For example:
public enum Foo
{
Bar = 1,
Baz = 2,
Cat = 9,
//Etc...
}
One idea I've seen before which is your option 3 more or less
A table in the database (for foreign keys etc)
A matching Enum in the client code
A startup check (via database call) to ensure they match
The database table table can have a trigger or check constraint to reduce risk of changes. It shouldn't have any write permissions because the data is tied to a client code release, but it adds a safety factor in case the DBA bollixes up
If you have other clients reading the code (which is very common) then the database has complete data.
Is it a common practice to use special naming conventions when you're denormalizing for performance?
For example, let's say you have a customer table with a date_of_birth column. You might then add an age_range column because sometimes it's too expensive to calculate that customer's age range on the fly. However, one could see this getting messy because it's not abundantly clear which values are authoritative and which ones are derived. So maybe you'd want to name that column denormalized_age_range or something.
Is it common to use a special naming convention for these columns? If so, are there established naming conventions for such a thing?
Edit: Here's another, more realistic example of when denormalization would give you a performance gain. This is from a real-life case. Let's say you're writing an app that keeps track of college courses at all the colleges in the US. You need to be able to show, for each degree, how many credits you graduate with if you choose that degree. A degree's credit count is actually ridiculously complicated to calculate and it takes a long time (more than one second per degree). If you have a report comparing 100 different degrees, it wouldn't be practical to calculate the credit count on the fly. What I did when I came across this problem was I added a credit_count column to our degree table and calculated each degree's credit count up front. This solved the performance problem.
I've seen column names use the word "derived" when they represent that kind of value. I haven't seen a generic style guide for other kinds of denormalization.
I should add that in every case I've seen, the derived value is always considered secondary to the data from which it is derived.
In some programming languages, eg Java, variable names with the _ prefix are used for private methods or variables. Private means it should not be modified/invoked by any methods outside the class.
I wonder if this convention can be borrowed in naming derived database columns.
In Postgres, column names can start with _, eg _average_product_price.
It can convey the meaning that you can read this column, but don't write it because it's derived.
I'm in the same situation right now, designing a database schema that can benefit from denormalisation of central values. For example, table partitioning requires the partition key to exist in the table. So even if the data can be retrieved by following some levels of foreign keys, I need the data right there in most tables.
Maybe the suffix "copy" could be used for this. Because after all, the data is just a copy of some other location where the primary data is stored. Since it's a word, it can work with all naming conventions, like .NET PascalCase which can be mapped to SQL snake_case, e. g. CompanyIdCopy and company_id_copy. And it's a short word so you don't have to write too much. And it's not an abbreviation so you don't have to spell it or ever wonder what it means. ;-)
I could also think of the suffix "cache" or "cached" but a cache is usually filled on demand and invalidated some time later, which is usually not the case with denormalised columns. That data should exist at all times and never be outdated or missing.
The word "derived" is just a bit longer than "copy". I know that one special DBMS, an expensive one, has a column name limit of 30 characters, so that could be an issue.
If all of the values required to derive the calculation are in the table already, then it is extremely unlikely that you will gain any meaningful (or even measurable) performance benefit by persisting these calculated values.
I realize this doesn't answer the question directly, but it would seem that the premise is faulty: if such conditions existed for the question to apply, then you don't need to denormalize it to begin with.
We have have a table layout with property names in one table, and values in a second table, and items in a third. (Yes, we're re-implementing tables in SQL.)
We join all three to get a value of a property for a specific item.
Unfortunately the values can have multiple data types double, varchar, bit, etc. Currently the consensus is to stringly type all the values and store the type name in the column next to the value.
tblValues
DataTypeName nvarchar
Is there a better, cleaner way to do this?
Clarifications:
Our requirements state that we must add new "attributes" at run time without modifying the db schema
I would prefer not to use EAV, but that is the direction we are headed right now.
This system currently exists in SQL server using a traditional db design, but I can't see a way to fulfill our requirement of not modifying the db schema without moving to EAV.
There are really only two patterns for implementing an 'EAV model' (assuming that's what you want to do):
Implement it as you've described, where you explicitly store the property value type along with the value, and use that to convert the string values stored into the appropriate 'native' types in the application(s) that access the DB.
Add a separate column for each possible datatype you might store as a property value. You could also include a column that indicates the property value type, but it wouldn't be strictly necessary.
Solution 1 is a simpler design, but it incurs the overhead of converting the string values stored in the table into the appropriate data type as needed.
Solution 2 has the benefit of storing values as the appropriate native type, but it will necessarily require more, though not necessarily much more, space. This may be moot if there aren't a lot of rows in this table. You may want to add a check constraint that only allows one non-NULL value in the different value columns, or if you're including a type column (so as to avoid checking for non-NULL values in the different value columns), prevent mismatches between the value stored in the type column and which value column contains the non-NULL value.
As HLGEM states in her answer, this is less preferred than a standard relational design, but I'm more sympathetic to the use of EAV model table designs for data such as application option settings.
Well don't do that! You lose all the values of having datatypes if you do. You can't properly constrain them (and will, I guarantee it, get bad data eventually) and you have to cast them back to the proper type to use in mathematical or date calculations. All in all a performance loser.
Your whole design will not scale well. Read up on why you don't want to use EAV tables in a relational database. It is not only generally slower but unusually difficult to query especially for reporting.
Perhaps a noSQL database would better suit your needs or a proper relational design and NOT an EAV design. Is it really too hard to figure out what fields each table would really need or are your developers just lazy? Are you sacrificing performance for flexibility - a flexibility that most users will hate? Especially when it means bad performance? Have you ever used a database designed that way to try to do anything?
I have a situation where I need to store a general piece of data (could be an int, float, or string) in my database, but I don't know ahead of time which it will be. I need a table (or less preferably tables) to store this unknown typed data.
What I think I am going to do is have a column for each data type, only use one for each record and leave the others NULL. This requires some logic above the database, but this is not too much of a problem because I will be representing these records in models anyway.
Basically, is there a best practice way to do something like this? I have not come up with anything that is less of a hack than this, but it seems like this is a somewhat common problem. Thanks in advance.
EDIT: Also, is this considered 3NF?
You could easily do that if you used SQLite as a database backend :
Any column in a version 3 database, except an INTEGER PRIMARY KEY column, may be used to store any type of value.
For other RDBMS systems, I would go with Philip's solution.
Note that in my line of software (business applications), I cannot think of any situation where this kind of requirement would be needed (a value with an unknown datatype). Unless the domain model was flawed, of course... I can imagine that other lines of software may incur different practices, but I suggest that you consider rethinking your overall design.
If your application can reliably convert datatypes, you might consider a single column solution based on a variable-length binary column, with a second column to track original data type. (I did a very small routine based on this once before, and it worked well enough.) Testing would show if conversion is more efficiently handled on the application or database side.
If I were to do this I would choose either your method, or I would cast everything to string and use only one column. Of course there would be another column with the type (which would probably be useful for the first method too).
For faster code I would probably go with your method.
Like most apps, mine has a "users" table that describes the entities that can log in to it. It has info like their alias, their email address, their salted password hashes, and all the usual candidates.
However, as my app has grown, I've needed more and more special case "flags" that I've generally just stuck in the users table. Stuff like whether their most recent monthly email has been transmitted yet, whether they've dismissed the tutorial popup, how many times they clicked the "I am awesome" button, etc.
I am beginning to have quite a few of these fields, and the majority of these flags I don't need for the majority of the webpages that I handle.
Is there anything wrong with keeping all of these flags in the users table? Is there somewhere better to put them? Would creating other tables with a 1:1 relationship with the users table provide additional overhead to retrieving the data when I do need it?
Also, I use Hibernate as my ORM, and I worry that creating a bunch of extra tables for this information means that I'd also have to dirty up my User domain object. Advice?
There are several common solutions:
EAV
Store one flag per row in a child table, with a reference to the user row, the name of the flag, and the value. Disadvantages: Can't guarantee a row exists for each flag. Need to define another lookup table for flag names. Reconstituting a User record with all its flags is a very costly query (requires a join per flag).
Bit field
Store one flag per bit in a single long binary column. Use bitmasking in application code to interpret the flags. Disadvantages: Artificial limit on number of flags. Hard to drop a flag when it becomes obsolete. Harder to change flag values, search for specific flag values, or aggregate based on flag values without resorting to confusing bitwise operators.
Normalized design
Store one BIT column per flag, all in the Users table. Most "correct" design from the perspective of relational theory and normalization. Disadvantages: Adding a flag requires ALTER TABLE ADD COLUMN. Also, you might exceed the number of columns or row size supported by your brand of RDBMS.
I'd say a better design was something like this:
create table users (
id integer primary key,
user varchar(32) unique
)
create table flags (
id integer,
flagname varchar(32),
flagval char(1)
)
witha primary key of id + flagname. flags entries then look like:
1, 'administrator', 'Y',
1, 'editor', 'Y',
2, 'editor' 'Y'
and so on. I'd create a view to access the joined tables.
It's interesting to see how the crappiest answer of all is the only one that got an upvote.
The question did not include sufficient information to actually give a sensible answer.
For one, it failed to say whether the question was about some logical design of a database, or some physical design of a database.
If the question was about logical design, then the answer is rather simple : NEVER include a boolean in your logical designs. The relational model already has a way of representing yes/no information (by virtue of the closed-world assumption) : namely as the presence of some tuple in some relation.
If the question was about physical design, then any sensible answer must necessarily depend on other information such as update frequency, querying frequency, volumes of data queried, etc. etc. None of those were provided, making the question unanswerable.
EDIT
"The relational model prescribes just one such type, BOOLEAN (the most fundamental type of all)." -- C. J. Date, SQL and Relational Theory (2009)."
That reply was of course bound to appear.
But does that quote really say that a type boolean should be available FOR INCLUSION IN SOME RELATION TYPE ? Or does that quote (or better, the larger piece of text that it appears in) only say that the existence of type boolean is inevitable because otherwise the system has no way of returning a result for any invocation of the equality operator, and that it is really the existence of the equality operator that is "prescribed" ?)
IOW, should type boolean be available for inclusion in relation types or should type boolean be available because otherwise there wouldn't be a single DML language we could define to operate on the database ?
Date is also on record saying (slightly paraphrased) that "if there are N ways of representing information, with N > 1, then there are also >1 sets of operators to learn, and >1 ways for the developer to make mistakes", and >1 sets of operators for the DBMS developer to implement, and >1 ways for the DBMS developer to make mistakes".
EDIT EDIT
"Date says "a relational attribute can be of any type whatsoever." He does not say an attribute can be of any type except boolean"
You have read Date very well.
One other thing that Date definitely does not say is that an attribute cannot be relation-typed. Quite the contrary. Yet, there is a broad consensus, and I know for a fact that even Date shares that consensus, that it is probably a bad idea to have base relvars that include an attribute that is relation-typed.
Likewise, nowhere does Date say that it is a GOOD idea to include boolean attributes in base relation types. He is absolutely silent on that particular issue. The opinion I expressed was mine. I don't think I gave the impression I was expressing somebody else's opinion in what I wrote originally.
Representing "the truth (or falseness) of any given proposition" can be done by including/omitting a tuple in the relation value of a certain relvar (at least logically !). Now, being able to include/exclude some given tuple from the value of some given relvar is most certainly fundamental. Given that, there is no need what so ever to be able to represent the truth (or falseness) of any given proposition (logically !) by using an attribute of type boolean. And what else would you use an attribute of type boolean for, but to say explicitly that some proprosition is either true or false ?
If you really only need this information on a few pages, why not have a table & relation for each flag? Existence of a record in that table sets the bit, selecting null is an unset bit.
The count of awesome clicks can then also be done by adding a record for each click (this solves the race-problem of updating the count in a record on the user table):
select count(*) from AwesomeClicks where userid = 1234
Use a unique constraint on the userid field for bit-only information (real flags as opposed to the count in the above example).
select userid from DismissedTutorialPopup where userid = 1234
This will result in either 1234 (flag is set) or null (flag is not set).
Also, by adding a CreateDate field, you can store when the flag was set etc.
Some people don't seem to like this pattern for a number of reasons but I've developed a method for doing binary comparisons on base64 strings so I can handle a virtually unlimited number of flags inside a single varchar field. (6 per character technically)
I admit one frustration with this technique is that it's next to impossible to read them from inside the database. But it works for me. My flags defined in my application like so:
public class Flags
{
public const string Flag1 = "1";
public const string Flag2 = "2";
public const string Flag3 = "4";
public const string Flag4 = "8";
public const string Flag5 = "g";
public const string Flag6 = "w";
public const string Flag7 = "10";
// ... etc ...
}
****is there anything wrong with keeping all of these flags in the users table?****
Hi I am not sure which Db you are using currently but if you are using SQL server make sure the row size wont is 8060 bytes. (max row size 8060 ).
MAX row size
SQLserver 2005 - 8060 bytes
MYSQL - 8052 bytes
Oracle 9i - 255000 bytes.