Like most apps, mine has a "users" table that describes the entities that can log in to it. It has info like their alias, their email address, their salted password hashes, and all the usual candidates.
However, as my app has grown, I've needed more and more special case "flags" that I've generally just stuck in the users table. Stuff like whether their most recent monthly email has been transmitted yet, whether they've dismissed the tutorial popup, how many times they clicked the "I am awesome" button, etc.
I am beginning to have quite a few of these fields, and the majority of these flags I don't need for the majority of the webpages that I handle.
Is there anything wrong with keeping all of these flags in the users table? Is there somewhere better to put them? Would creating other tables with a 1:1 relationship with the users table provide additional overhead to retrieving the data when I do need it?
Also, I use Hibernate as my ORM, and I worry that creating a bunch of extra tables for this information means that I'd also have to dirty up my User domain object. Advice?
There are several common solutions:
EAV
Store one flag per row in a child table, with a reference to the user row, the name of the flag, and the value. Disadvantages: Can't guarantee a row exists for each flag. Need to define another lookup table for flag names. Reconstituting a User record with all its flags is a very costly query (requires a join per flag).
Bit field
Store one flag per bit in a single long binary column. Use bitmasking in application code to interpret the flags. Disadvantages: Artificial limit on number of flags. Hard to drop a flag when it becomes obsolete. Harder to change flag values, search for specific flag values, or aggregate based on flag values without resorting to confusing bitwise operators.
Normalized design
Store one BIT column per flag, all in the Users table. Most "correct" design from the perspective of relational theory and normalization. Disadvantages: Adding a flag requires ALTER TABLE ADD COLUMN. Also, you might exceed the number of columns or row size supported by your brand of RDBMS.
I'd say a better design was something like this:
create table users (
id integer primary key,
user varchar(32) unique
)
create table flags (
id integer,
flagname varchar(32),
flagval char(1)
)
witha primary key of id + flagname. flags entries then look like:
1, 'administrator', 'Y',
1, 'editor', 'Y',
2, 'editor' 'Y'
and so on. I'd create a view to access the joined tables.
It's interesting to see how the crappiest answer of all is the only one that got an upvote.
The question did not include sufficient information to actually give a sensible answer.
For one, it failed to say whether the question was about some logical design of a database, or some physical design of a database.
If the question was about logical design, then the answer is rather simple : NEVER include a boolean in your logical designs. The relational model already has a way of representing yes/no information (by virtue of the closed-world assumption) : namely as the presence of some tuple in some relation.
If the question was about physical design, then any sensible answer must necessarily depend on other information such as update frequency, querying frequency, volumes of data queried, etc. etc. None of those were provided, making the question unanswerable.
EDIT
"The relational model prescribes just one such type, BOOLEAN (the most fundamental type of all)." -- C. J. Date, SQL and Relational Theory (2009)."
That reply was of course bound to appear.
But does that quote really say that a type boolean should be available FOR INCLUSION IN SOME RELATION TYPE ? Or does that quote (or better, the larger piece of text that it appears in) only say that the existence of type boolean is inevitable because otherwise the system has no way of returning a result for any invocation of the equality operator, and that it is really the existence of the equality operator that is "prescribed" ?)
IOW, should type boolean be available for inclusion in relation types or should type boolean be available because otherwise there wouldn't be a single DML language we could define to operate on the database ?
Date is also on record saying (slightly paraphrased) that "if there are N ways of representing information, with N > 1, then there are also >1 sets of operators to learn, and >1 ways for the developer to make mistakes", and >1 sets of operators for the DBMS developer to implement, and >1 ways for the DBMS developer to make mistakes".
EDIT EDIT
"Date says "a relational attribute can be of any type whatsoever." He does not say an attribute can be of any type except boolean"
You have read Date very well.
One other thing that Date definitely does not say is that an attribute cannot be relation-typed. Quite the contrary. Yet, there is a broad consensus, and I know for a fact that even Date shares that consensus, that it is probably a bad idea to have base relvars that include an attribute that is relation-typed.
Likewise, nowhere does Date say that it is a GOOD idea to include boolean attributes in base relation types. He is absolutely silent on that particular issue. The opinion I expressed was mine. I don't think I gave the impression I was expressing somebody else's opinion in what I wrote originally.
Representing "the truth (or falseness) of any given proposition" can be done by including/omitting a tuple in the relation value of a certain relvar (at least logically !). Now, being able to include/exclude some given tuple from the value of some given relvar is most certainly fundamental. Given that, there is no need what so ever to be able to represent the truth (or falseness) of any given proposition (logically !) by using an attribute of type boolean. And what else would you use an attribute of type boolean for, but to say explicitly that some proprosition is either true or false ?
If you really only need this information on a few pages, why not have a table & relation for each flag? Existence of a record in that table sets the bit, selecting null is an unset bit.
The count of awesome clicks can then also be done by adding a record for each click (this solves the race-problem of updating the count in a record on the user table):
select count(*) from AwesomeClicks where userid = 1234
Use a unique constraint on the userid field for bit-only information (real flags as opposed to the count in the above example).
select userid from DismissedTutorialPopup where userid = 1234
This will result in either 1234 (flag is set) or null (flag is not set).
Also, by adding a CreateDate field, you can store when the flag was set etc.
Some people don't seem to like this pattern for a number of reasons but I've developed a method for doing binary comparisons on base64 strings so I can handle a virtually unlimited number of flags inside a single varchar field. (6 per character technically)
I admit one frustration with this technique is that it's next to impossible to read them from inside the database. But it works for me. My flags defined in my application like so:
public class Flags
{
public const string Flag1 = "1";
public const string Flag2 = "2";
public const string Flag3 = "4";
public const string Flag4 = "8";
public const string Flag5 = "g";
public const string Flag6 = "w";
public const string Flag7 = "10";
// ... etc ...
}
****is there anything wrong with keeping all of these flags in the users table?****
Hi I am not sure which Db you are using currently but if you are using SQL server make sure the row size wont is 8060 bytes. (max row size 8060 ).
MAX row size
SQLserver 2005 - 8060 bytes
MYSQL - 8052 bytes
Oracle 9i - 255000 bytes.
Related
Marginally related to Should I delete or disable a row in a relational database?
Given that I am going to go with the strategy of warehousing changes to my tables in a history table, I am faced with the following options for implementing a status for a given row in MySQL:
An isActive booelan
An activeStatus enum
An activeStatus INT referencing a small ActiveStatus lookup table
An activeStatus INT not referencing another table
The first approach is rather inflexible in my opinion, since I might need more booleans in the future to support other types of active statuses (I'm not sure what they would be, but maybe something like "being phased out" or "active for a random group of users", etc).
I'm told that MySQL enum is bad, so the second approach probably won't fly.
I like the third approach, but I'm wondering if it is a heavy handed solution to a relatively small problem.
The fourth approach requires that we know in advance what each status INT means and seems like an outdated way to do things.
Is there a canonical right answer? Am I ignoring another approach?
Personally I would go with your third option.
Boolean values often turn out to be more complex in reality, as you suggested. ENUMs can be nice, but they have the downside that as soon as you want to store additional information about each value - who added it, when, is it only valid for a certain time period or source system, comments etc. - it becomes difficult, whereas with a lookup table those data can easily be maintained in additional columns. ENUMs are a good tool to constrain data to certain values (like a CHECK constraint), but not such a good tool if those values have significant meaning and need to be exposed to users.
It's not entirely clear from your question if you plan to treat your history table like a fact table and use it in reports, but if so then you could consider the ActiveStatus lookup table as a dimension. In this case a table is much easier, because your reporting tool can read the possible values from the dimension table in order to let the user choose his query conditions; such tools generally don't know anything about ENUMs.
From my point of view your 2nd approach is better if u have more than 2 status.Because ENUM is great for data that you know will fall within a static set. But if u have only two status active and inactive then its always better to use boolean.
EDIT:
If u r sure in future u r not gonna change the value of your ENUM then its great to use ENUM for such field.
am looking to let the users of my web application define their own attributes for products and then enter data for those products. I have found out that this technique is called n(th) normal form.
The following is DB structure I am currently considering deploying and was wondering what the positives and negatives would be in regards to integrity and scalability (and any other -ity's you can think of)
EDIT
(Sorry, This is more what I mean)
I have been staring at this for the last 15mins and I know (where the red arrow is) induces duplication and hence you would have to have integrity checks. But I just don't understand how else what I want could be done.
The products would number no more then 10. The variables would number no more then 200 (max 20 per product). The number of product instances would not exceed 100,000, therefore the maximum size of pVariable_data would not exceed 2 million
This model is called a database in a database and is not nice. Though sometimes it is impossible first check whether you really need it and your database is really the right database for the job.
With PostgreSQL you could use: http://www.postgresql.org/docs/8.4/static/hstore.html which is a standardized solution for this kind of issues.
Assuming that pVariable is more of a pVariable type, drop the reference to product_fk. It would mean that you need a new entry in that table for every Product record. Maybe try something like this:
Product(id, active, allow_new)
pVariable_type(id, name)
pVariable_data(id, product_fk, pvariable_fk, non_typed_value, bool, int, etc)
I would use the non_typed_value as your text value, and (unless you are keeping streams) write a record into that field along with the typed value. It will mean keeping the value of a record twice (and more of a pain on updates etc) but it will make querying easier, along with reporting (anything you just need to display the value for).
Note: it would also be idea to pull anything that is common to all products and put them in the product table. For example all products will most likely have a name, suggested price, etc.
Is it a common practice to use special naming conventions when you're denormalizing for performance?
For example, let's say you have a customer table with a date_of_birth column. You might then add an age_range column because sometimes it's too expensive to calculate that customer's age range on the fly. However, one could see this getting messy because it's not abundantly clear which values are authoritative and which ones are derived. So maybe you'd want to name that column denormalized_age_range or something.
Is it common to use a special naming convention for these columns? If so, are there established naming conventions for such a thing?
Edit: Here's another, more realistic example of when denormalization would give you a performance gain. This is from a real-life case. Let's say you're writing an app that keeps track of college courses at all the colleges in the US. You need to be able to show, for each degree, how many credits you graduate with if you choose that degree. A degree's credit count is actually ridiculously complicated to calculate and it takes a long time (more than one second per degree). If you have a report comparing 100 different degrees, it wouldn't be practical to calculate the credit count on the fly. What I did when I came across this problem was I added a credit_count column to our degree table and calculated each degree's credit count up front. This solved the performance problem.
I've seen column names use the word "derived" when they represent that kind of value. I haven't seen a generic style guide for other kinds of denormalization.
I should add that in every case I've seen, the derived value is always considered secondary to the data from which it is derived.
In some programming languages, eg Java, variable names with the _ prefix are used for private methods or variables. Private means it should not be modified/invoked by any methods outside the class.
I wonder if this convention can be borrowed in naming derived database columns.
In Postgres, column names can start with _, eg _average_product_price.
It can convey the meaning that you can read this column, but don't write it because it's derived.
I'm in the same situation right now, designing a database schema that can benefit from denormalisation of central values. For example, table partitioning requires the partition key to exist in the table. So even if the data can be retrieved by following some levels of foreign keys, I need the data right there in most tables.
Maybe the suffix "copy" could be used for this. Because after all, the data is just a copy of some other location where the primary data is stored. Since it's a word, it can work with all naming conventions, like .NET PascalCase which can be mapped to SQL snake_case, e. g. CompanyIdCopy and company_id_copy. And it's a short word so you don't have to write too much. And it's not an abbreviation so you don't have to spell it or ever wonder what it means. ;-)
I could also think of the suffix "cache" or "cached" but a cache is usually filled on demand and invalidated some time later, which is usually not the case with denormalised columns. That data should exist at all times and never be outdated or missing.
The word "derived" is just a bit longer than "copy". I know that one special DBMS, an expensive one, has a column name limit of 30 characters, so that could be an issue.
If all of the values required to derive the calculation are in the table already, then it is extremely unlikely that you will gain any meaningful (or even measurable) performance benefit by persisting these calculated values.
I realize this doesn't answer the question directly, but it would seem that the premise is faulty: if such conditions existed for the question to apply, then you don't need to denormalize it to begin with.
We have have a table layout with property names in one table, and values in a second table, and items in a third. (Yes, we're re-implementing tables in SQL.)
We join all three to get a value of a property for a specific item.
Unfortunately the values can have multiple data types double, varchar, bit, etc. Currently the consensus is to stringly type all the values and store the type name in the column next to the value.
tblValues
DataTypeName nvarchar
Is there a better, cleaner way to do this?
Clarifications:
Our requirements state that we must add new "attributes" at run time without modifying the db schema
I would prefer not to use EAV, but that is the direction we are headed right now.
This system currently exists in SQL server using a traditional db design, but I can't see a way to fulfill our requirement of not modifying the db schema without moving to EAV.
There are really only two patterns for implementing an 'EAV model' (assuming that's what you want to do):
Implement it as you've described, where you explicitly store the property value type along with the value, and use that to convert the string values stored into the appropriate 'native' types in the application(s) that access the DB.
Add a separate column for each possible datatype you might store as a property value. You could also include a column that indicates the property value type, but it wouldn't be strictly necessary.
Solution 1 is a simpler design, but it incurs the overhead of converting the string values stored in the table into the appropriate data type as needed.
Solution 2 has the benefit of storing values as the appropriate native type, but it will necessarily require more, though not necessarily much more, space. This may be moot if there aren't a lot of rows in this table. You may want to add a check constraint that only allows one non-NULL value in the different value columns, or if you're including a type column (so as to avoid checking for non-NULL values in the different value columns), prevent mismatches between the value stored in the type column and which value column contains the non-NULL value.
As HLGEM states in her answer, this is less preferred than a standard relational design, but I'm more sympathetic to the use of EAV model table designs for data such as application option settings.
Well don't do that! You lose all the values of having datatypes if you do. You can't properly constrain them (and will, I guarantee it, get bad data eventually) and you have to cast them back to the proper type to use in mathematical or date calculations. All in all a performance loser.
Your whole design will not scale well. Read up on why you don't want to use EAV tables in a relational database. It is not only generally slower but unusually difficult to query especially for reporting.
Perhaps a noSQL database would better suit your needs or a proper relational design and NOT an EAV design. Is it really too hard to figure out what fields each table would really need or are your developers just lazy? Are you sacrificing performance for flexibility - a flexibility that most users will hate? Especially when it means bad performance? Have you ever used a database designed that way to try to do anything?
Is it best to store the enum value or the enum name in a database table field?
For example should I store 'TJLeft' as a string or a it's equivalent value in the database?
Public Enum TextJustification
TJLeft
TJCenter
TJRight
End Enum
I'm currently leaning towards the name as some could come along later and explicitly assign a different value.
Edit -
Some of the enums are under my control but some are from third parties.
Another reason to store the numeric value is if you're using the [Flags] attribute on your enumeration in cases where you may want to allow for multiple enumeration values. Say, for example you want to let someone pick what days of the week that they're available for something...
[Flags]
public enum WeekDays
{
Monday=1,
Tuesday=2,
Wednesday=4,
Thursday=8,
Friday=16
}
In this case, you can store the numeric value in the db for any combination of the values (for example, 3 == Monday and Tuesday)
I always use lookup tables consisting of the fields
OID int (pk) as the numeric value
ProgID varchar (unique) as the value's identifier in C# (i.e. const name, or enum symbol)
ID nvarchar as the display value (UI)
dbscript lets me generate C# code from my lookup tables, so my code is always in sync with the database.
For your own enums, use the numeric values, for one simple reason: it allows for every part of enum functionality, out of the box, with no hassle. The only caveat is that in the enum definition, every member must be explicitly given a numeric value, which can never change (or, at least, not after you've made the first release). I always add a prominent comment to enums that get persisted to the database, so people don't go changing the constants.
Here are some reasons why numeric values are better than string identifiers:
It is the simplest way to represent the value
Database searching/sorting is faster
Lower database storage cost (which could be a serious issue for some applications)
You can add [Flags] to your enum and not break your code and/or existing data
For [Flags] stored in a string field:
Poorly normalized data
Could generate false-positive anomalies when doing matching (i.e., if you have members "Sales" and "RetailSales", merely doing a substring search for "Sales" will match on either type). This has to be constrained either by using a regex on word boundaries (finicky using databases, and slow), or constraining in the enum itself, which is nonstandard, error-prone, and very difficult to debug.
For string fields (either [Flags] or not), if the database is obfuscated, this field has to be handled, which greatly affects the ability and efficiency when doing searching/sorting code, as mentioned in the previous point
You can rename any of the members without breaking the database code and/or existing client data.
Less over-the-wire data transfer space/time needed
There are only two situations where using the member names in the database may be an advantage:
If you're doing a lot of data editing manually... but who does that? And if you are, there's a good chance you're not going to be using an enum anyway.
Third-party enums where they may not be so diligent as to maintain the numeric value constants. But I have to say, anyone releasing a decently-written API is overwhelmingly likely to be smart enough to keep the enum values constant. (The identifiers have to stay the same since changing them would break existing code.)
On lookup tables, which I strongly discourage because they are a one-way bullet train to a maintenance nightmare:
Adding [Flags] functionality requires the use of a junction table, which means more complicated queries (existing ones need to be rewritten), and added complexity. What about existing client data?
If the identifier is stored in the data table, what's the point of having a lookup table in the first place?
If the numeric value is stored in the data table, you gain nothing since you still have to look up the identifier from the lookup table. To make it easier, you could create a view... for every table that has an enum value in it. And then let's not even think about [Flags] enums.
Introducing any kind of synchronization between database and code is just asking for trouble. What about existing client data?
Store an ID (value) and a varchar name; this lets you query on either way. Searching on the name is reasonable if your IDs (values) may get out of sync later.
It is better to use the integer representation... If you have to change the Enum later (add more values etc) you can explicitly assign the integer value to the Enum value so that your Enum representation in code still matches what you have in the database.
It depends on how important performance is versus readability. Databases can index numeric values a lot easier than strings, which means you can get better performance without using as much memory. It would also reduce the amount of data going across the wire somewhat. On the other hand, when you look at a numeric value in your database which you then have to refer to a code file to translate, that can be annoying.
In most cases, I'd suggest using the value, but you will need to make sure you're explicitly setting those values so that if you add a value in the future it doesn't shift the references around.
As often it depends on many things:
Do you want to sort by the natural order of the enums? Use the numeric values.
Do you work directly in the database using a low level tool? use the name.
Do you have huge amounts of data and performance is an issue? use the number
For me the most important issue is most of the time maintainability:
If your enums change in the future, names will either match correctly of fail hard and loud. With numbers some one can add a enum instance, changing all the numbers of all enums, so you have to update all the tables where the enum is used. And almost no way to know if you missed a table.
if you are trying to get the values of enum stored in the database back, then try this
EnumValue = DirectCast([Enum].Parse(GetType(TextJustification), reader.Item("put_field_name_here").ToString), TextJustification)
tell me if it works for you