Alternatives to isActive - database

Marginally related to Should I delete or disable a row in a relational database?
Given that I am going to go with the strategy of warehousing changes to my tables in a history table, I am faced with the following options for implementing a status for a given row in MySQL:
An isActive booelan
An activeStatus enum
An activeStatus INT referencing a small ActiveStatus lookup table
An activeStatus INT not referencing another table
The first approach is rather inflexible in my opinion, since I might need more booleans in the future to support other types of active statuses (I'm not sure what they would be, but maybe something like "being phased out" or "active for a random group of users", etc).
I'm told that MySQL enum is bad, so the second approach probably won't fly.
I like the third approach, but I'm wondering if it is a heavy handed solution to a relatively small problem.
The fourth approach requires that we know in advance what each status INT means and seems like an outdated way to do things.
Is there a canonical right answer? Am I ignoring another approach?

Personally I would go with your third option.
Boolean values often turn out to be more complex in reality, as you suggested. ENUMs can be nice, but they have the downside that as soon as you want to store additional information about each value - who added it, when, is it only valid for a certain time period or source system, comments etc. - it becomes difficult, whereas with a lookup table those data can easily be maintained in additional columns. ENUMs are a good tool to constrain data to certain values (like a CHECK constraint), but not such a good tool if those values have significant meaning and need to be exposed to users.
It's not entirely clear from your question if you plan to treat your history table like a fact table and use it in reports, but if so then you could consider the ActiveStatus lookup table as a dimension. In this case a table is much easier, because your reporting tool can read the possible values from the dimension table in order to let the user choose his query conditions; such tools generally don't know anything about ENUMs.

From my point of view your 2nd approach is better if u have more than 2 status.Because ENUM is great for data that you know will fall within a static set. But if u have only two status active and inactive then its always better to use boolean.
EDIT:
If u r sure in future u r not gonna change the value of your ENUM then its great to use ENUM for such field.

Related

Is this a "correct" database design?

I'm working with the new version of a third party application. In this version, the database structure is changed, they say "to improve performance".
The old version of the DB had a general structure like this:
TABLE ENTITY
(
ENTITY_ID,
STANDARD_PROPERTY_1,
STANDARD_PROPERTY_2,
STANDARD_PROPERTY_3,
...
)
TABLE ENTITY_PROPERTIES
(
ENTITY_ID,
PROPERTY_KEY,
PROPERTY_VALUE
)
so we had a main table with fields for the basic properties and a separate table to manage custom properties added by user.
The new version of the DB insted has a structure like this:
TABLE ENTITY
(
ENTITY_ID,
STANDARD_PROPERTY_1,
STANDARD_PROPERTY_2,
STANDARD_PROPERTY_3,
...
)
TABLE ENTITY_PROPERTIES_n
(
ENTITY_ID_n,
CUSTOM_PROPERTY_1,
CUSTOM_PROPERTY_2,
CUSTOM_PROPERTY_3,
...
)
So, now when the user add a custom property, a new column is added to the current ENTITY_PROPERTY table until the max number of columns (managed by application) is reached, then a new table is created.
So, my question is: Is this a correct way to design a DB structure? Is this the only way to "increase performances"? The old structure required many join or sub-select, but this structute don't seems to me very smart (or even correct)...
I have seen this done before on the assumed (often unproven) "expense" of joining - it is basically turning a row-heavy data table into a column-heavy table. They ran into their own limitation, as you imply, by creating new tables when they run out of columns.
I completely disagree with it.
Personally, I would stick with the old structure and re-evaluate the performance issues. That isn't to say the old way is the correct way, it is just marginally better than the "improvement" in my opinion, and removes the need to do large scale re-engineering of database tables and DAL code.
These tables strike me as largely static... caching would be an even better performance improvement without mutilating the database and one I would look at doing first. Do the "expensive" fetch once and stick it in memory somewhere, then forget about your troubles (note, I am making light of the need to manage the Cache, but static data is one of the easiest to manage).
Or, wait for the day you run into the maximum number of tables per database :-)
Others have suggested completely different stores. This is a perfectly viable possibility and if I didn't have an existing database structure I would be considering it too. That said, I see no reason why this structure can't fit into an RDBMS. I have seen it done on almost all large scale apps I have worked on. Interestingly enough, they all went down a similar route and all were mostly "successful" implementations.
No, it's not. It's terrible.
until the max number of column (handled by application) is reached,
then a new table is created.
This sentence says it all. Under no circumstance should an application dynamically create tables. The "old" approach isn't ideal either, but since you have the requirement to let users add custom properties, it has to be like this.
Consider this:
You lose all type-safety as you have to store all values in the column "PROPERTY_VALUE"
Depending on your users, you could have them change the schema beforehand and then let them run some kind of database update batch job, so at least all the properties would be declared in the right datatype. Also, you could lose the entity_id/key thing.
Check out this: http://en.wikipedia.org/wiki/Inner-platform_effect. This certainly reeks of it
Maybe a RDBMS isn't the right thing for your app. Consider using a key/value based store like MongoDB or another NoSQL database. (http://nosql-database.org/)
From what I know of databases (but I'm certainly not the most experienced), it seems quite a bad idea to do that in your database. If you already know how many max custom properties a user might have, I'd say you'd better set the table number of columns to that value.
Then again, I'm not an expert, but making new columns on the fly isn't the kind of operations databases like. It's gonna bring you more trouble than anything.
If I were you, I'd either fix the number of custom properties, or stick with the old system.
I believe creating a new table for each entity to store properties is a bad design as you could end up bulking the database with tables. The only pro to applying the second method would be that you are not traversing through all of the redundant rows that do not apply to the Entity selected. However using indexes on your database on the original ENTITY_PROPERTIES table could help greatly with performance.
I would personally stick with your initial design, apply indexes and let the database engine determine the best methods for selecting the data rather than separating each entity property into a new table.
There is no "correct" way to design a database - I'm not aware of a universally recognized set of standards other than the famous "normal form" theory; many database designs ignore this standard for performance reasons.
There are ways of evaluating database designs though - performance, maintainability, intelligibility, etc. Quite often, you have to trade these against each other; that's what your change seems to be doing - trading maintainability and intelligibility against performance.
So, the best way to find out if that was a good trade off is to see if the performance gains have materialized. The best way to find that out is to create the proposed schema, load it with a representative dataset, and write queries you will need to run in production.
I'm guessing that the new design will not be perceivably faster for queries like "find STANDARD_PROPERTY_1 from entity where STANDARD_PROPERTY_1 = 'banana'.
I'm guessing it will not be perceivably faster when retrieving all properties for a given entity; in fact it might be slightly slower, because instead of a single join to ENTITY_PROPERTIES, the new design requires joins to several tables. You will be returning "sparse" results - presumably, not all entities will have values in the property_n columns in all ENTITY_PROPERTIES_n tables.
Where the new design may be significantly faster is when you need a compound where clause on custom properties. For instance, finding an entity where custom property 1 is true, custom property 2 is banana, and custom property 3 is not in ('kylie', 'pussycat dolls', 'giraffe') is e`(probably) faster when you can specify columns in the ENTITY_PROPERTIES_n tables instead of rows in the ENTITY_PROPERTIES table. Probably.
As for maintainability - yuck. Your database access code now needs to be far smarter, knowing which table holds which property, and how many columns are too many. The likelihood of entertaining bugs is high - there are more moving parts, and I can't think of any obvious unit tests to make sure that the database access logic is working.
Intelligibility is another concern - this solution is not in most developers' toolbox, it's not an industry-standard pattern. The old solution is pretty widely known - commonly referred to as "entity-attribute-value". This becomes a major issue on long-lived projects where you can't guarantee that the original development team will hang around.

Storing enum values in database

FYI: I explicitly mean SQL Server 2000-8 and C#. So DBMSs with enum support like MySql is not the subject of my question.
I know this question has been asked multiple times in SO. But still, I see in answers that different approaches are taken to store enum values in db.
Save enum as int in db and extract the enum value (or enum description attribute using reflection) in code:
this is the approach I usually use. The problem is when I try to query from database in SSMS, the retrieved data is hard to understand.
Save enum as string (varchar) in db and cast back to int in code.
Actually, this might the best solution. But (don't laugh!) it doesn't feel right. I'm not sure about the cons. (Except more space in db which is usually acceptable) So anything else against this approach?
Have a separate table in db which is synchronized with code's enum definition and make a foreign key relationship between your main table and the enum table.
The problem is when another enum value should be added later, Both code and db need to get updated. Also, there might be typos which can be a pain!
So in general when we can accept the overhead on db in 2nd solution, What would be the best way to store enum values in db? Is there a general definite design pattern rule about this?
Thanks.
There is no definite design rule (that I know of), but I prefer approach #1.
Is the approach I prefer. It's simple, and enums are usually compact enough that I start remembers what the numbers mean.
It's more readable, but can get in the way of refactoring or renaming your enumeration values when you want to. You lose some freedom of your code. All of the sudden you need to get a DBA involved (depending on where/how you work) just to change an enumeration value, or suffer with it. Parsing an enum has some performance impact as well since things like Locale come into play, but probably negligible.
What problem does that solve? You still have unreadable numbers in a table somewhere, unless you want to add the overhead of a join. But sometimes, this is the correct answer too depending on how the data is used.
EDIT:
Chris in the comments had a good point: If you do go down the numeric approach, you should explicitly assign values so you can re-order them as well. For example:
public enum Foo
{
Bar = 1,
Baz = 2,
Cat = 9,
//Etc...
}
One idea I've seen before which is your option 3 more or less
A table in the database (for foreign keys etc)
A matching Enum in the client code
A startup check (via database call) to ensure they match
The database table table can have a trigger or check constraint to reduce risk of changes. It shouldn't have any write permissions because the data is tied to a client code release, but it adds a safety factor in case the DBA bollixes up
If you have other clients reading the code (which is very common) then the database has complete data.

Database design question - which is the best solution?

I'm using Firebird 2.1 and I'm looking for the best way to solve this issue.
I'm writing a calendaring application. Different users' calendar entries are stored in a big Calendar table. Each calendar entry can have a reminder set - only one reminder/entry.
Statistically, the Calendar table could grow to hundreds of thousands of records over time, while there are going to be much less reminders.
I need to query the reminders on a constant basis.
Which is the best option?
A) Store the reminders' info in the Calendar table (in which case I'm going to query hundreds of thousands of records for IsReminder = 1)
B) Create a separate Reminders table which contains only the ID of calendar entries which have reminders set, then query the two tables with a JOIN operation (or maybe create a view on them)
C) I can store all information about reminders in the Reminders table, then query only this table. The downside is that some information needs to be duplicated in both tables, like in order to show the reminder, I'll need to know and store the event's starttime in the Reminders table - thus I'm maintaining two tables with the same values.
What do you think?
And one more question: The Calendar table will contain the calender of multiple users, separated only by a UserID field. Since there can be only 4-5 users, even if I put an index on this field, its selectivity is going to be very bad - which is not good for a table with hundreds of thousands of records. Is there a workaround here?
Thanks!
There are advantages and drawbacks to all three choices. Whis one is best depends on details you have not provided. In general, don't worry too much about selecting three or four entries out of a hundred thousand, provided the indexes you have set up allow the right retrieval strategy. If don't understand indexing, you're likely to be in trouble no matter which of the three choices you make.
If it were me, I would go with choice B. I'd also store any attributes of a reminder in the table for reminders.
Be very careful about whether you identify an event by EventId alone or by (UserId, EventId). If you choose the latter, it behooves you to use a compound primary key for the Event table. Don't worry too much about compound primary keys, especially with Firebird.
If you declare a compound primary key, be aware that declaring (UserId, EventId) will not have the same consequences as declaring (EventId, UserId). They are logically equivalent, but the structure of the automatically generated index will be different in the two cases.
This in turn will affect the speed of queries like "find all the reminders for a given user".
Again, if it were me, I'd avoid choice C. the introduction of harmful redundancy into a schema carries with it the responsibility for some very careful programming when you go to update the data. Otherwise, you can end up with a database that stores contradictory versions of the same fact in different places of the database.
And, if you really want to know the effect on perfromance, try all three ways, load with test data, and do your own benchmarks.
I think you need to create realistic, fake user data and measure the difference with some typical queries you expect to run.
Indexing, query optimization and the types of query results you need can make a big difference,
so it's not easy to say what's best without knowing more.
When choosing Option (A) you should
provide an index on "IsReminder" (or a combined index on IsReminder, UserId, whatever fits best to your intended queries)
make sure your queries use this index
Option B is preferable over A if you have more than a boolean flag for each reminder to store (for example, the number of minutes the user shall be notified before the event). You should, however, make some guessing how often in your program you will have to JOIN both tables.
If you can, avoid option C. If you don't want to benchmark all three cases, I suggest start with A or B, according to the described circumstances, and probably the solution you choose will be fast enough, so you don't have to bother with the other cases.

Enums in the DB or NO Enums in the DB

For me, the classic wisdom is to store enum values (OrderStatus, UserTypes, etc) as Lookup tables in your db. This lets me enforce data integrity in the database, preventing false or null values, etc.
However more and more, this feels like unnecessary duplication to me. Not only do I have to create tables for these values (or have an unwieldy central lookup table), but if I want to add a value, i have to remember to add it to 2 (or more, counting production, testing, live db's) and things can get out of sync easily.
Still I have a hard time letting go of lookup tables.
I know there are probably certain scenarios where one had an advantage over the other, but what are your general thoughts?
I've done both, but I now much prefer defining them as in classes in code.
New files cost nothing, and the benefits that you seek by having it in the database should be handled as business rules.
Also, I have an aversion to holding data in a database that really doesn't change. And it seems an enum fits this description. It doesn't make sense for me to have a States lookup table, but a States enum class makes sense to me.
If it has to be maintained I would leave them in a lookup table in the DB. Even if I think they won't need to be maintained I would still go towards a lookup table so that if I am wrong it's not a big deal.
EDIT:
I want to clarify that if the Enum is not part of the DB model then I leave it in code.
I put them in the database, but I really can't defend why I do that. It just "seems right". I guess I justify it by saying there's always a "right" version of what the enums can be by checking the database.
Schema dependencies should be stored in the database itself to ensure any changes to your architecture can be easily perform transparently to the app..
I prefer enums as it enforces early binding of values in code, so that exceptions aren't caused by missing values
It's also helpful if you can use code generation that can bring in the associations of the integer columns to an enumeration type, so that in business logic you only have to deal with easily memorable enumeration values.
Consider it a form of documentation.
If you've already documented the enum constants properly in the code that uses the dB, do you really need a duplicate set of documentation (to use and maintain)?

Default database IDs; system and user values

As part of our current database work, we are looking at a dealing with the process of updating databases.
A point which has been brought up recurrently, is that of dealing with system vs. user values; in our project user and system vals are stored together. For example...
We have a list of templates.
1, <system template>
2, <system template>
3, <system template>
These are mapped in the app to an enum (1, 2, 3)
Then a user comes in and adds...
4, <user template>
...and...
5, <user template>
Then.. we issue an upgrade.. and insert as part of our upgrade scripts...
<new id> [6], <new system template>
THEN!!... we find a bug in the new system template and need to update it... The problem is how? We cannot update record using ID6 (as we may have inserted it as 9, or 999, so we have to identify the record using some other mechanism)
So, we've come to two possible solutions for this.
In the red corner (speed)....
We simply start user Ids at 5000 (or some other value) and test data at 10000 (or some other value). This would allow us to make modifications to system values and test them up to the lower limit of the next ID range.
Advantage...Quick and easy to implement,
Disadvantage... could run out of values if we don't choose a big enough range!
In the blue corner (scalability)...
We store, system and user data separately, use GUIDs as Ids and merge the two lists using a view.
Advantage...Scalable..No limits w/regard to DB size.
Disadvantage.. More complicated to implement. (many to one updatable views etc.)
I plump squarely for the first option, but looking for some ammo to back me up!
Does anyone have any thoughts on these approaches, or even one(s) that we've missed?
I have never had problems (performance or development - TDD & unit testing included) using GUIDs as the ID for my databases, and I've worked on some pretty big ones. Have a look here, here and here if you want to find out more about using GUIDs (and the potential GOTCHAS involved) as your primary keys - but I can't recommend it highly enough since moving data around safely and DB synchronisation becomes as easy as brushing your teeth in the morning :-)
For your question above, I would either recommend a third column (if possible) that indicates whether or not the template is user or system based, or you can at the very least generate GUIDs for system templates as you insert them and keep a list of those on hand, so that if you need to update the template, you can just target that same GUID in your DEV, UAT and /or PRODUCTION databases without fear of overwriting other templates. The third column would come in handy though for selecting all system or user templates at will, without the need to seperate them into two tables (this is overkill IMHO).
I hope that helps,
Rob G
I recommend using the second with the modification that you store the system and user values in one table. GUID is quite reliable in this manner.
Another idea: use any text-based ID (not necessary GUID), which you give for the system values and is generated by a random string or a string based on some kind of custom logic for the user values.
Another idea: use the first approach, but extend the table with a flag which shows if a value is system or user. Maybe this is the easiest. Ok, you have to write some kind of mechanism to update the correct system value, but it can be done easily.
+1 for Biri's text based ID - define a "template_mnemonic" text based column and make it the primary key. This will be a known value when you insert it as you, the developers will have decided on it (or auto-generated it) and you will always be able to reference a template by its mnemonic regardless of how many user specified templates there are. It also allows users to have a meaningful naming convention for their templates.
Maybe I didn't get it, but couldn't you use GUIDs as Ids and still have user and system data together? Then you can access the system data by the (non-changable) GUIDs.
I don't think that GUID should make any problem.
If you want to avoid it, then use a flag:
ID int
template whatever
flag enum/int/bool
Flag shows whether the actual value is a system or a user value.
If you would like to update a system value, then ask only for system values ordered by ID, and it will show you actual order of insertion (you should have a bigint or something for ID to make sure that it doesn't get full and it doesn't get the deleted IDs back to work). With this list the x. record is the x. inserted system value.
I think there is a better third solution.
It strikes me that you're storing two different things in the same table and that you might be better off creating 2 separate tables one for user templates and one for system templates. You might then be able to create a view over the two tables to make them appear as a single object to your application.
Obviously I don't have full knowledge of your application and this may be impossible for you for any number of reasons but I think it's a neater solution than GUIDs and way safer than ranges of IDs (seriously don't do ID ranges it'll bite you one day)

Resources