Conflicting desires in Database Design, with fields of two similar functions - database

Okay, so I'm making a table right now for "Box Items".
Now, a Box Item, depending on what it's being used for/the status of the item, may end up being related to a "Shipping" box or a "Returns" box.
A Box Item may be defective:if it is, a flag will be set in the Box Item's row (IsDefective), and the Box Item will be put in a "Returns" box (with other items to be returned to that vendor). Otherwise, the Box Item will eventually be put into a "Shipping" box (with other items to be shipped). (Note that Shipping and Returns boxes have their own tables: there's not one common table for all boxes... though maybe I should consider doing that if possible as a third possibility?)
Maybe I'm just not thinking clearly today, but I started questioning what should be done in this situation.
My gut tells me that I should have a separate field for each possible relation, even if only one of the relations can happen at any given time, which would make the schema for Box Items look like:
BoxItemID
Description
IsDefective
ShippingBoxID
ReturnBoxID
etc...
This would make the relations clear, but it seems wasteful (since only one of the relations will be used at any time). So then I thought I could have just one field for the BoxID, and determine which BoxID it's referring to (a Shipping or a Returns Box ID) based on the IsDefective field:
BoxItemID
Description
IsDefective
BoxID
etc...
This seems less wasteful, but doesn't sit right with me. The relation isn't obvious.
So, I put it to you, database gurus of Stackoverflow. What would you do in this situation?
EDIT: Thank you everyone for your input! It's given me a lot to think about. For one, I'm going to use an ORM next time I start a project like this. =) For two, since I'm not right now, I'll bite the four bytes and use two fields.
Thanks everyone again!

I'm with Psychotic Venom and mattlant.
Going the polymorphic route (having to figure out which table your foreign key points to based on the contents of another field) is going to be a pain. Coding the constraints for that maybe tough (I'm not sure most databases would support that natively, I think you'd have to use a trigger).
Do items ever move between the tables? Sticking with two tables with identical definitions where one is for returns and one is for shipping may be the easiest route. If you want to stick with the definition you first proposed (with the two separate fields) is perfectly reasonable.
"Premature optimization is the root of all evil" and all that. While it seems wasteful, remember what you're storing. Since they are IDs they are probably just integers, maybe 4 bytes. Wasting four bytes per record is basically nothing. In fact, due to padding to put things on even addresses or other such things it may be "free" to put that extra field in there. It all depends on the DB design.
Unless you have a very good reason to go the polymorphic route (like you're on an embedded system with little memory or you have to replicate across some really slow 9600bps link) it probably won't be worth the headaches you can end up with. Having to write all those special cases into your queries can get annoying.
Quick example: doing a join between two tables where if you want to join is based on if the isDefective flag is set is going to be a pain. Being able to just use one of the two columns alone is probably enough of a hassle you may save, at least for me.

I would consider making a single table for the boxes and the box type be a column of the box table. This would simplify the relationships and make it easy to still query for box type. So the box item only has one foreign key to the boxId.

I'd use what Hibernate calls Table-per-subclass, so my DB would wind up with 3 tables for Boxes: Box, ShippingBox, and ReturnBox. The FK in BoxItem would point to Box.

What you're talking about is polymorphic relations. A single ID that can reference multiple other tables. There are several frameworks that support this, however, it is (potentially) bad for database integrity (that could be a whole other discussion whether or not your database or your application should maintain referential integrity).
What about this?
BoxItem:
BoxItemID, Description, IsDefective
Box:
BoxID, Description
BoxItemMap:
BoxID, BoxItemID, BoxItemType
Then you can have BoxItemType be an enumeration, or an integer where you define constants in your application as "Return" or "Shipping" as the type of box.

Agree about the polymorphic discussion above, although it has potential to be used poorly, it is still a viable solution.
Basically you have a base table called box. Then you have two other tables, shipping box and return box. Those two add any extra fields that are special to them. they are related to box with a 1:1 fk.Boz base table has the common fields of all box types.
You relate BoxItem with the box table. The way you you get the proper box type is by doing a query that joins the child box with the root box based on the key. The record that has in both the base box and the child box is of that type.
You just have to be careful like mentioned that when you create a box type that it is done correctly. BUt thats what testing is for. The code to add them only needs ot written once. Or use an ORM.
Almost all ORM's support this strategy.

I'd go with just a single BoxItems table with IsDefective, ShippingBoxID, the shipping-box-related fields, ReturnBoxID and the return-box-related fields. Some fields will always be NULL for each record.
This is a very simple and self-evident design that the next developer is unlikely to be confused by. In theory this design is inefficient because of the guaranteed empty fields for each row. In practice, databases tend to have a minimum required storage size for each row anyway, so (unless the number of fields is huge) this design is as efficient as possible anyway, and much easier to code to.

I'd probably go with:
BoxTable:
box_id, box_descrip, box_status_id ...
1, Lovely Box, 1
2, Borked box, 2
3, Ugly Box, 3
4, Flammable Box, 4
BoxStatus:
box_status_id, box_status_name, box_type_id, ....
1,Shippable, 1
2,Return, 2
3,Ugly, 2
4,Dangerous,3
BoxType:
box_type_id, box_type_name, ...
1, Shipping box, ...
2, Return box, ....
3, Hazmat box, ...
That way the Box Status defines the box type, and it's flexible if you need to expand into a few more status levels or box types later on.

Related

System design: whether to normalize the departments or not

I'm working with two consultants in one project. The thing is we reached a point where both of them cannot get into an agreement and each offer a different approach.
The thing is we have a store with four departments and we want to find the best approach for working with all of them in the same database.
Each department sell different products: Cars, Boats, Jetskies and Motorbikes.
When the data is inserted or updated in each department there are some triggers to be fires so different workflows will begin, when adding a new car there are certain requirements that needs to be checked as well as the details of the car that are completely different than a boat. Also, regarding the data there are not many fields there are in common, I would say so far only the brand, color, model and year, everything else is specific for each deparment due to the different products and how they work with them..
Consultant one says:
Create one table for all the departments and use a column to identify what department the row belongs to, this way you will have only one trigger and inside the trigger you will then call the function/mehod you need for each record type.
Reason: you only have one table (with over 200 fields) and one trigger, is easier to maintain. Also if you need to report you just need to query one table and filter based on the record type. If you need to report for all the items you don't need to have multiple joins.
Consultant two says:
Create one table for each deparment and a trigger for each table.
Reason: you will have smaller tables (aprox 50 fields each) and is more flexible and you have it all separated. If you want to report you need to join the tables as you want to include data from different places.
I see the advantages of having everything in one place but if I want to expand or change anything I have the feeling I will bre creating a beast table as the data grows.
On the other side keep it separated look more appealing but will need to setup everything for each different table.
What would you say is the best approach?
You should probably listen to consultant number two.
The thing is, all design is trade-offs. You need to assess the pros and cons of each approach and you need to think about the risks that each design entails.
What happens when your design grows? (department 5, more details per product type,...)
What happens when the system scales up to higher transaction volumes?
What happens when your business rules change?
I've been doing this for a long time and I've seen some pendulums swing back and forth when it comes to what is "in fashion" as far as database and software best practices.
I'd say right now the prevailing wisdom is that separation of concerns is innately good. This means you should keep your program logic (trigger code) separate for each department. This makes sense because your logic will vary from one product type to the next since they mostly have distinct columns.
This second point is also important, because your stake in the ground for a transactional system should always be start with third normal form (or higher, if necessary). Sometimes you can get away without it, but four different types of objects with 40 or more distinct attributes each doesn't sound like a good candidate for jamming everything into one table. How do you keep track of which columns belong to which type of product, for example? A separate table for each product type keeps this clean and simple - and importantly - easy for your support programmers to understand.
Contrary to what consultant one is saying, having one trigger instead of four is not likely to be easier to maintain if that one trigger is a big bowl of spaghetti, or even four tidy, well written subroutines joined together with a switch type statement.
These days, programmers favour short, atomic, single-purpose functions (triggers, in your case).
If there is enough common data and common business logic that doing it four times seems awkward, then maybe you have a good candidate for a super-type / sub-type design.
I'll say one
These are all Products, It doesn't matter that its a Bike or a Car. You can control the fields and the object by RecordTypes and Page layouts and that will save you from having 4 Objects, which means potentially 8 new classes(if it follows my pattern it could be up to 20+) + all of the workflow rules and validation rules across the these new objects, it will be very hard to maintain a structure that has 4 objects but are all the same thing.. Tracking Products.
Down the road if you decide to add a new product such as planes, it will be very easy to add a plane to this object and the code will be able to pick up from there if needed. You will definitely need Record Types to manage each Product. The trigger code shouldn't be an issue if the consultants are building it properly meaning a trigger should never have any business logic so as long as that is followed all of the code will be maintainable
I will go with one.
I assume you have a large number of products and this list will grow in future. All these are Products at the end. They will have some common fields and common logic.
If you use Process Builder with Invocable classes instead of Triggers, you may be able to get away with just configuration changes while adding a new object, if its fields and functionality are same/similar to a existing object.
There may also be limitation on the number of different objects a profile has access to based on your license types.
Salesforce has a standard object called Product. Its a single object to be classifies based on record type.
I would have gone with approach two if this was not salesforce. Based on how salesforce works and the limitations it imposes one seems like a better and cleaner solution.
I would say option 2.
Why?
(1) I would find one table with 200+ columns harder to maintain. You're also then going to have to expose fields for an object that doesn't need said fields.
(2) You are also going to have to "hide" logic inside the trigger which then decides to do different actions based on the type of department etc...
(3) Option 2 involves more "scaffolding" and separate objects but those are objects are inherently smaller and easier to maintain and don't specifically hide logic or cause any sort of ambiguity.
(4) Option 2 abides by the single responsibility principle. Not everyone follows this I understand but I find it a good guiding principle, as the responsibility for the data lies with the individual table and the responsibility for triggered the action lies with the individual trigger as opposed to just being one mammoth entity/trigger.
** I would state that I am simply looking at this from a software development perspective, I am not sure whether or not SalesForce would handle this setup, but it is the way I would personally prefer to design it. :)
Option 2 for me.
You've said that there is little common data and the trigger logic is completely different. Here are some additional technical considerations.
Option 1 Warnings
The trigger would be a single point of failure and errors will be trickier to debug. I have worked with large triggers where broken logic near the top has stopped logic near the bottom from running, sometimes silently! You also have to maintain conditional guards to control the flow of logic based on the data which is another opportunity for error.
I'm not red hot on indexes but I believe performance will suffer due to no natural order of the multi-purpose data. More specific tables will yield better indexing strategies. Also, large rows can lead to fragmented indexes.
https://blogs.msdn.microsoft.com/pamitt/2010/12/23/notes-sql-server-index-fragmentation-types-and-solutions/
You would need extra consideration when setting nullable/default constraints on each surplus field not relevant to the product in question. These subtleties can introduce bugs and might make it harder if/when you decide to work with a data layer technology such as Entity Framework. E.g. the logical difference between NULL, 0 and 'None', especially on shared columns.

Alternatives to isActive

Marginally related to Should I delete or disable a row in a relational database?
Given that I am going to go with the strategy of warehousing changes to my tables in a history table, I am faced with the following options for implementing a status for a given row in MySQL:
An isActive booelan
An activeStatus enum
An activeStatus INT referencing a small ActiveStatus lookup table
An activeStatus INT not referencing another table
The first approach is rather inflexible in my opinion, since I might need more booleans in the future to support other types of active statuses (I'm not sure what they would be, but maybe something like "being phased out" or "active for a random group of users", etc).
I'm told that MySQL enum is bad, so the second approach probably won't fly.
I like the third approach, but I'm wondering if it is a heavy handed solution to a relatively small problem.
The fourth approach requires that we know in advance what each status INT means and seems like an outdated way to do things.
Is there a canonical right answer? Am I ignoring another approach?
Personally I would go with your third option.
Boolean values often turn out to be more complex in reality, as you suggested. ENUMs can be nice, but they have the downside that as soon as you want to store additional information about each value - who added it, when, is it only valid for a certain time period or source system, comments etc. - it becomes difficult, whereas with a lookup table those data can easily be maintained in additional columns. ENUMs are a good tool to constrain data to certain values (like a CHECK constraint), but not such a good tool if those values have significant meaning and need to be exposed to users.
It's not entirely clear from your question if you plan to treat your history table like a fact table and use it in reports, but if so then you could consider the ActiveStatus lookup table as a dimension. In this case a table is much easier, because your reporting tool can read the possible values from the dimension table in order to let the user choose his query conditions; such tools generally don't know anything about ENUMs.
From my point of view your 2nd approach is better if u have more than 2 status.Because ENUM is great for data that you know will fall within a static set. But if u have only two status active and inactive then its always better to use boolean.
EDIT:
If u r sure in future u r not gonna change the value of your ENUM then its great to use ENUM for such field.

Too many columns in a single preference db table?

I have an application that is essentially built out of many smaller applications. Each application has their own individual preferences, but all of them share the same 5 preferences, for example, whether the application is displayed in the nav, whether it is public, whether reports should be generated, etc.
All of these common preferences need to be known by any page in the web app because the navigation is constructed from it. So originally I put all these preferences in a single table. However as the number of applications grow (10 now, eventually around 30), the number of columns will end up being around 150-200 total. Most of these columns are just booleans, but it still worries me having that many columns in one table. On the other hand, if I were to split them apart into separate tables (preferences per app), I'd have to join them all together anyway every time I need to see the preferences, so why not just leave them all together?
In the application I can break the preferences into smaller objects so they are easier to work with, but from a db perspective they are a single entity. Is it better to leave them in one giant table, or break them apart into smaller ones but force many joins every time they are requested?
Which database engine are you using ? normally you will find some recommendations about recommended number of columns per table in your DB engine. Mostly Row size limitations, which should keep you safe.
Other options and suggestions include:
Assign a bit per config key in an integer, and use the logical "AND" operation to show only the key you are interested in at a given point in time. Single value read from DB, one quick Logical operation for each read of a config key.
Caching the preferences in memory, less round trips to DB servers, Based on frequency of changes , you may also having to clear the cache of each preference when it is updated.
Why not turn the columns into rows and use something like this:
This is a typical approach for maintaining lists of settings values.
The APP_SETTING table contains the value of the setting. The SETTING table gives you the context of what the setting is.
There are ways of extending this to add information such as which settings apply to which applications and whether or not the possible values for a particular setting are constrained to a specific list.
Well CommonPreferences and ApplicationPreferences would certainly make sense, and perhaps even segregating them in code (two queries instead of a join).
After that a table per application will make more sense.
Another way is going down the route suggested By Joel Brown.
A third would be instead of having individual colums or row per setting, you stuff all the non-common ones in to an xml snippet or serialise from a preferences class.
Which decision you make revolves around how your application does (or could use the data).
If you go down the settings table approach getting application settings as a row will be 'erm painful. Go down the xml snippet route and querying for a setting across applications will be even more painful than several joins.
No way to say what you should compromise on from here. I think I'd go for CommonPreferences first and see where I was at after that.

Examples of good UI for selecting multiple records

I'm currently revisiting an area of my Windows-based software and looking at changing the relationship from 1->M to M->M. As a result, I need to adjust the UI to accommodate selecting multiple related records.
There are a lot of ways to handle this that are common, but usually pretty clunky. Examples include the two-pane list of all items, and list of selected items, or a list of all records and a checkbox beside each one that applies.
In my case, there may be an awful lot (in the tens of thousands) of records that could be associated, so I'll probably need to include some kind of search mechanism.
I'm not looking for a hard and fast answer -- I can implement something pretty easily that's functional, I'm looking to see if anyone here has come up with (or seen) any great UIs for doing this kind of thing, whether it's web based, Windows, Mac, Unix, whatever.
Images or links to them would be appreciated!
Edit: here's an example of what I'm considering:
I like the way StackOverflow relates many tags with many questions:
Items are displayed as user types
You start obviously with the record you want to associate multiple items with.
As you type the search displays the matches ( no need to press on "Search" )
The user select the desired record ( Sorting would be nice. SO uses "tag relevance". For instance typing 'a' brings Java rather than asp because Java has more questions than asp, in your case relevance may be the user name )
The system creates the relationship ( in memory )
If a number of records ( 5+ ) are filling the input field, they are moved into a semi-regid area ( not a SO problem because it only has 5 tag withing a single question, but in your case something like the "interesting tags" feature would be needed )
Associated items are moved to a "rigid" area
Of course in an ordered manner ( using a table )
Finally when the user end with the association it clicks SAVE or CANCEL buttons.
This approach has more efficiency by not needing to have the user press on "search" or "add other" which distracts them from what they're doing, it is being said it interrupts its train of thought.
Also, if you make the user grab the mouse to click on something while they are typing the UI is less efficient ( I think there is something called the Hick's law about that, but quite frankly I may be wrong )
As you see this approach is pretty much already what you have in mind, but adding some facilities to make the user happier ( The danger would be if the user loves this approach and wants it in other parts of the system )
It's an interesting and fairly common UI problem, how to efficiently select items. I'm assuming that you are intending on having the user first select a single item and that the mechanism you are interested is how to choose other items that get related to this first single item.
There ares various select methods. From a usability standpoint, it would be preferable to just have ONE method used for each scenario. Then when the user sees it, they will know what to do.
various selection techniques:
dropdown list - obvious for single selects.
open list multi select - eg: a multiline textbox that shows 10 or 20 lines and has a scroll bar
dropdown list where you select then hit and 'add' link or button to add multiple selects
list moving - where you have two open lists, with all the choices available in the left list, you select a few then click a button to move your selection to the right list.
Check boxes - good for just a few choices of multiple selection possibilities.
List of items, each with an 'add' button next to them - good for short lists
You've said that you'll have thousands of possible choices, so that eliminates 1 and 5. Really, thousands will eliminate all of them, as the usability doesn't scale well with more than a few hundred in the list.
If you can count on the user to filter the list, like in your example, then 6 may be suitable. If you think of how Facebook picture tagging works, I think that it fairly efficient for long lists: background: Facebook picture tagging is a mechanism that allows you to assign one or more people to portions of an image - ie 'tag' them.
When you select an image to tag (ie the 'single item') and wish to relate other items(people) to it, A dialog box pops up. It contains the top 6 or so names that you've used in the past, and a textbox where you can start to type the person's name you wish to use. As you type, the list dynamically changes to reduce the number of people to only those who contain the letter sequence you've typed. This works very well for large lists, but it does rely on the user typing to filter. It also will rely on use of scripting to intelligently reduce the list based on the user's input.
For your application it would rely on the user performing this step once for each association, as I'm assuming that the other items won't all have similar names!
Here's an image of the Facebook tagging application: http://screencast.com/t/9MPlpJQJzBQ
A search feature that filters records in real time as you type would probably be a good idea to include. Another would be the possibility to sort the records.
Since there may be a lot of records, the best choice in this case is probably to have a separate area which displays what you have already chosen, so that the user won't have to scroll around the selection areas to find what they already have.
self-explanatory GUI http://img25.imageshack.us/img25/8568/28666917.png
Link to the original image
Another thing is, that in my opinion your problem is not about selecting multiple records, but filtering those tens of thousands of records. M->M association can be implemented in variety of way, but the tricky part is to provide a convenient and logical way to browse/search the huge amount of data.
I'd suggest not having to click add more to be able to search. The warning at the right is nice, but IMHO it should only say the search displays results as the user types.
Sorting a column (maybe along with the search) would also be a nice functionality. I'd suggesting it being done by clicking on the header of the table, with some icon indicating whether the sort is ascending or descending.
I'd suggest also the search to do an approximate string matching in case there are no or few results. It is so annoying not being able to find something you don't remember exactly.
Finally, for testing the first impression (though not the functionality itself), I'd suggest uploading it to the 5 second test and see what you get.
I think that what you have mocked up is a pretty good way to do it. When you think about the tags-to-posts relationship on a blog (or on SO even), that is many-to-many and it is usually implemented very similarly: for one post, you search for (or, since they are simple strings, directly enter) as many tags as you want to associate with it. I can't really think of any many-to-many relationships I encounter often, although I know there are probably many...
There are a number of important questions to consider - how many records will typically be used (as opposed to available for association)? Will there be a large number of records on one side of the association (given the switch from 1->M, this seems likely)?
If one of the quantities of records is usually very small (<10, I'd say), call this the LHS (because it usually is), then the best way to associate may be to allow searches for LHS and RHS items, then drag-and-drop them onto a list - LHS items onto the list proper; RHS items into the existing LHS items. That way, it's intuitive to specify a relation between to items. You could also add other options like "associate with all", or a grouping pen so you can assign several records to several other records - nothing is tedious like having to do 15 drags-and-drops of the same record.
In fact, I think that's the most crucial bit of any M->M UI design - minimize repetition. Doing the same thing over for 100s of records (remember that if "nobody will ever...", they will) is unfun, especially if it's complex. I know this seems to contradict my earlier advice, but I don't think it does - design for the typical use case, but make sure that the atypical ones do not make the program unusable.

Default database IDs; system and user values

As part of our current database work, we are looking at a dealing with the process of updating databases.
A point which has been brought up recurrently, is that of dealing with system vs. user values; in our project user and system vals are stored together. For example...
We have a list of templates.
1, <system template>
2, <system template>
3, <system template>
These are mapped in the app to an enum (1, 2, 3)
Then a user comes in and adds...
4, <user template>
...and...
5, <user template>
Then.. we issue an upgrade.. and insert as part of our upgrade scripts...
<new id> [6], <new system template>
THEN!!... we find a bug in the new system template and need to update it... The problem is how? We cannot update record using ID6 (as we may have inserted it as 9, or 999, so we have to identify the record using some other mechanism)
So, we've come to two possible solutions for this.
In the red corner (speed)....
We simply start user Ids at 5000 (or some other value) and test data at 10000 (or some other value). This would allow us to make modifications to system values and test them up to the lower limit of the next ID range.
Advantage...Quick and easy to implement,
Disadvantage... could run out of values if we don't choose a big enough range!
In the blue corner (scalability)...
We store, system and user data separately, use GUIDs as Ids and merge the two lists using a view.
Advantage...Scalable..No limits w/regard to DB size.
Disadvantage.. More complicated to implement. (many to one updatable views etc.)
I plump squarely for the first option, but looking for some ammo to back me up!
Does anyone have any thoughts on these approaches, or even one(s) that we've missed?
I have never had problems (performance or development - TDD & unit testing included) using GUIDs as the ID for my databases, and I've worked on some pretty big ones. Have a look here, here and here if you want to find out more about using GUIDs (and the potential GOTCHAS involved) as your primary keys - but I can't recommend it highly enough since moving data around safely and DB synchronisation becomes as easy as brushing your teeth in the morning :-)
For your question above, I would either recommend a third column (if possible) that indicates whether or not the template is user or system based, or you can at the very least generate GUIDs for system templates as you insert them and keep a list of those on hand, so that if you need to update the template, you can just target that same GUID in your DEV, UAT and /or PRODUCTION databases without fear of overwriting other templates. The third column would come in handy though for selecting all system or user templates at will, without the need to seperate them into two tables (this is overkill IMHO).
I hope that helps,
Rob G
I recommend using the second with the modification that you store the system and user values in one table. GUID is quite reliable in this manner.
Another idea: use any text-based ID (not necessary GUID), which you give for the system values and is generated by a random string or a string based on some kind of custom logic for the user values.
Another idea: use the first approach, but extend the table with a flag which shows if a value is system or user. Maybe this is the easiest. Ok, you have to write some kind of mechanism to update the correct system value, but it can be done easily.
+1 for Biri's text based ID - define a "template_mnemonic" text based column and make it the primary key. This will be a known value when you insert it as you, the developers will have decided on it (or auto-generated it) and you will always be able to reference a template by its mnemonic regardless of how many user specified templates there are. It also allows users to have a meaningful naming convention for their templates.
Maybe I didn't get it, but couldn't you use GUIDs as Ids and still have user and system data together? Then you can access the system data by the (non-changable) GUIDs.
I don't think that GUID should make any problem.
If you want to avoid it, then use a flag:
ID int
template whatever
flag enum/int/bool
Flag shows whether the actual value is a system or a user value.
If you would like to update a system value, then ask only for system values ordered by ID, and it will show you actual order of insertion (you should have a bigint or something for ID to make sure that it doesn't get full and it doesn't get the deleted IDs back to work). With this list the x. record is the x. inserted system value.
I think there is a better third solution.
It strikes me that you're storing two different things in the same table and that you might be better off creating 2 separate tables one for user templates and one for system templates. You might then be able to create a view over the two tables to make them appear as a single object to your application.
Obviously I don't have full knowledge of your application and this may be impossible for you for any number of reasons but I think it's a neater solution than GUIDs and way safer than ranges of IDs (seriously don't do ID ranges it'll bite you one day)

Resources