Table with hierarchy or multiple tables? - database

I am designing database model for some application, and I have one table Post which belong to some category. OK, Category will logically be other table.
But, more categories belong to some super category or domain or area, and my question is next:
Whether create other table for super categories or domains, or to do this hierarchy in table Category with some combination of key to point to parent.
I hope I was clear with problem?
PS.I know that I can do this problem with both solution, but is there any benefits with using first over second solution, and contrary.
Thanks

It depends: if nearly each category has a parent, you could add a parent serial as a column. Then your category table will look like
+--+----+------+
|ID|Name|Parent|
+--+----+------+
The problem with this representation is that, as long the hierarchy is not cyclic, some categories will have no parent. Furthermore a category can only have one parent.
Therefore I would suggest using a category_hierarchy table. An additional table:
+-----+------+
|Child|Parent|
+-----+------+
The disadvantage of this approach is that nearly each category will be repeated. And therefore if nearly all categories have parents, the redundancy will approximately scale with that number. If relations however are quite sparse, one saves space. Furthermore using an intelligent join will prevent the second representation from taking long execution times. You can for instance define a view to handle such requests.
Furthermore there are situations where the second approach can improve speed. For instance if you don't need the hierarchy all the time (for instance when mapping serials to the category-name), lookups in the category table can be faster, simply because the table is more compact and thus more parts of the table will be cached.

Related

I'm unable to normalize my Product table as I have 4 different product types

So because I have 4 different product types (books, magazines, gifts, food) I can't just put all products in one "products" table without having a bunch of null values. So I decided to break each product up into their own tables but I know this is just wrong (https://c1.staticflickr.com/1/742/23126857873_438655b10f_b.jpg).
I also tried creating an EAV model for this (https://c2.staticflickr.com/6/5734/23479108770_8ae693053a_b.jpg), but I got stuck as I'm not sure how to link the publishers and authors tables.
I know this question has been asked a lot but I don't understand ANY of the answer's I've seen. I think this is because I'm a very visual learner and this makes it hard to understand what's being talked about when not a lot of information is given.
Your model is on the right track, except that the product name should be sufficient you don't need Gift name, book name etc. What you put in those tables is the information that is specific to the type of product that the other products don't need. The Product table contains all the common fields. I would use productid in the child tables rather than renaming it giftID, magazineID etc. It is easier to remember what things are celled when you are consistent in nameing them.
Now to be practical, you put as much as you can into the product table especially if you are going to do calculations. I prefer the child tables in this specific case to have what is mostly display information. So product contains the product name, the cost, the type of product, the units the product is sold in etc. The stuff that generally is needed to calculate the cost of an order or to have a report of what was ordered. There may be one or two fields that can contain nulls, but it simplifies the calculation type queries so much it might be worth it.
The meat of the descriptive details though would go in the child table for the type of product. These would usually only be referenced when displaying the product in the shopping area and only one at a time, so you can use the product type to let you only join to the one child table you need for display. So while the order cares about the product number and name and cost calculations, it probably doesn't need to go line by line describing the book ISBN number or the megapixels in a camera. But the description page of the product does need those things.
This approach is not purely relational, although it mostly is, but it does group the information by the meanings of the data and how they will be used which will make the database easier to understand and query. I am a big fan of relational tables because database just work better when they hit at least the third normal form but sometimes you can go too far for practicality, so the meaning of the data and the way you are grouping to use the data (and not just for the user interface, but for later reporting as well) is almost always one of my considerations in design.
Breaking each product type into its own table is fine - let the child tables use the same id as the parent Product table, and create views for the child tables that join with Product
Your case is a classic case of types and subtypes. This is often called class/subclass in object modeling and generalization/specialization in ER modeling. It's a well understood pattern. There are known techniques for dealing with this pattern.
Visit the following tabs, and read the description under the info tab (presented as "learn more"). Also look over the questions grouped under these tags.
single-table-inheritance class-table-inheritance shared-primary-key
If you want to rean in more depth use these buzzwords to search for articles on the web.
You've already discovered and discarded single table inheritance on your own. Other answers have pointed you at shared primary key. Class table inheritance involves a single table for generalized data as well as the four specialized tables. Shared primary key is generally used in conjunction with class table inheritance.

Onetomany with parent database design

Below is a database design which represents my problem(it is not my actual database design). For each city I need to know which restaurants, bars and hotels are available. I think the two designs speak for itself, but:
First design: create one-to-many relations between city and restaurants, bars and hotels.
Second design: only create an one-to-many relation between city and place.
Which design would be best practice? The second design has less relations, but would I be able to get all the restaurants, bars and hotels for a city and there own data (property_x/y/z)?
Update: this question is going wrong, maybe my fault for not being clear.
the restaurant/bar/hotel classes are subclasses of "place" (in both
designs).
the restaurant/bar/hotel classes must have the parent "place"
the restaurant/bar/hotel classes have there own specific data (property_X/Y/X)
Good design first
Your data, and the readability/understandability of your SQL and ERD, are the most important factors to consider. For the purpose of readability:
Put city_id into place. Why: Places are in cities. A hotel is not a place that just happens to be in a city by virtue of being a hotel.
Other design points to consider are how this structure will be extended in the future. Let's compare adding a new subtype:
In design one, you need to add a new table, relationship to 'place' and a relationship to city
In design two, you simply add a new table and relationship to 'place'.
I'd again go with the second design.
Performance second
Now, I'm guessing, but the reason for putting city_id in the subtype is probably that you anticipate that it's more efficient or faster in some specific use cases and this may be a very good reason to ignore readability/understandability. However, until you measure performance on the actual hardware you'll deploy on, you don't know:
Which design is faster
Whether the difference in performance would actually degrade the overall system
Whether other optimization approaches (tuning SQL or database parameters) is actually a better way to handle it.
I would argue that design one is an attempt to physically model the database on an ERD, which is a bad practice.
Premature optimization is the root of a lot of evil in SW Engineering.
Subtype approaches
There are two solutions to implementing subtypes on an ERD:
A common-properties table, and one table per subtype, (this is your second model)
A single table with additional columns for subtype properties.
In the single-table approach, you would have:
A subtype column, TYPE INT NOT NULL. This specifies whether the row is a restaurant, bar or hotel
Extra columns property_X, property_Y and property_Z on place.
Here is a quick table of pros and cons:
Disadvantages of a single-table approach:
The extension columns (X, Y, Z) cannot be NOT NULL on a single table approach. You can implement row-level constraints, but you lose the simplicity and visibility of a simple NOT NULL
The single table is very wide and sparse, especially as you add additional subtypes. You may hit the max. number of columns on some databases. This can make this design quite wasteful.
To query a list of a specific subtype, you have to filter using a WHERE TYPE = ? clause, whereas the table-per-subtype is a much more natural `FROM HOTEL INNER JOIN PLACE ON HOTEL.PLACE_ID = PLACE.ID"
IMHO, mapping into classes in an object-oriented languages is harder and less obvious. Consider avoiding if this DB is going to be mapped by Hibernate, Entity Beans or similar
Advantages of a single-table approach:
By consolidating into a single table, there are no joins, so queries and CRUD operations are more efficient (but is this small difference going to cause problems?)
Queries for different types are parameterized (WHERE TYPE = ?) and therefore more controllable in code rather than in the SQL itself (FROM PLACE INNER JOIN HOTEL ON PLACE.ID = HOTEL.PLACE_ID).
There is no best design, you have to pick based on the type of SQL and CRUD operations you are doing most frequently, and possibly on performance (but see above for a general warning).
Advice
All things being equal, I would advise the default option is your second design. But, if you have an overriding concern such as those I listed above, do choose another implementation. But don't optimize prematurely.
Both of them and none of them at all.
If I need to choose one, I would keep the second one, because of the number of foreign keys and indexes needed to be created after. But, a better approach would be: create a table with all kinds of places (bars, restaurants, and so on) and assign to each row a column with a value of the type of the place (apply a COMPRESS clause with the types expected at the column). It would improve both performance and readability of the structure, plus being more easier to maintain. Hope this helps. :-)
you do not show alternate columns in any of the sub-place tables. i think you should not split type data into table names like 'bar','restaurant', etc - these should be types inside the place table.
i think further you should have an address table - one column of which is city. then each place has an address and you can easily group by city when needed. (or state or zip code or country etc)
I think the best option is the second one. In the first design, there is a possibility of data errors as one place can be assigned to a particular restaurant (or any other type) in one city (e.g. A) and at the same time can be assigned to another restaurant in a different city (e.g. B). In the second design, a place is always bound to a particular city.
First:
Both designs can get you all the appropriate data.
Second:
If all extending classes are going to implement the location (which sound obvious for your implementation) then it would be a better practice to include it as part of the parent object. This would suggest option 2.
Thingy:
The thing is that even-tough you can find out the type of each particular PLACE, it is easier to just know that a type (CHILD) is always a place (PARENT). You can think of that while you visualize the result-set of option 2. With that in mind, I recommend the first approach.
NOTE:
First one doesn't have more relations, it just splits them.
If bar, restaurant and hotel have different sets of attributes, then they are different entities and should be represented by 3 different tables. But why do you need the place table? My advice it to ditch it and have 3 tables for your 3 entities and that's that.
In code, collecting common attributes into a parent class is more organised and efficient than repeating them in each child class - of course. But as spathirana comments above, database design is not like OOP. Sure, you'll save on the repetition of column names by sticking common attributes of places into a "place" table. But it will also add complication:
- you have to join on that table whenever you want to reference a bar, restaurant or hotel
- you have to insert into two tables whenever you want to add a new bar, restaurant or hotel
- you have to update two tables when ... etc.
Having 3 tables without the place table is ALSO, PROBABLY, the most performance-optimal design. But that's not where I'm coming from. I'm thinking of clean, simple database design where a single entity means a single row in a single table. There are no "is-a" relationships in a relational DB. Foreign key relationships are "has-a". OK, there are exceptions I'm sure, but your case is not exceptional.

When I should use one to one relationship?

Sorry for that noob question but is there any real needs to use one-to-one relationship with tables in your database? You can implement all necessary fields inside one table. Even if data becomes very large you can enumerate column names that you need in SELECT statement instead of using SELECT *. When do you really need this separation?
1 to 0..1
The "1 to 0..1" between super and sub-classes is used as a part of "all classes in separate tables" strategy for implementing inheritance.
A "1 to 0..1" can be represented in a single table with "0..1" portion covered by NULL-able fields. However, if the relationship is mostly "1 to 0" with only a few "1 to 1" rows, splitting-off the "0..1" portion into a separate table might save some storage (and cache performance) benefits. Some databases are thriftier at storing NULLs than others, so a "cut-off point" where this strategy becomes viable can vary considerably.
1 to 1
The real "1 to 1" vertically partitions the data, which may have implications for caching. Databases typically implement caches at the page level, not at the level of individual fields, so even if you select only a few fields from a row, typically the whole page that row belongs to will be cached. If a row is very wide and the selected fields relatively narrow, you'll end-up caching a lot of information you don't actually need. In a situation like that, it may be useful to vertically partition the data, so only the narrower, more frequently used portion or rows gets cached, so more of them can fit into the cache, making the cache effectively "larger".
Another use of vertical partitioning is to change the locking behavior: databases typically cannot lock at the level of individual fields, only the whole rows. By splitting the row, you are allowing a lock to take place on only one of its halfs.
Triggers are also typically table-specific. While you can theoretically have just one table and have the trigger ignore the "wrong half" of the row, some databases may impose additional limits on what a trigger can and cannot do that could make this impractical. For example, Oracle doesn't let you modify the mutating table - by having separate tables, only one of them may be mutating so you can still modify the other one from your trigger.
Separate tables may allow more granular security.
These considerations are irrelevant in most cases, so in most cases you should consider merging the "1 to 1" tables into a single table.
See also: Why use a 1-to-1 relationship in database design?
My 2 cents.
I work in a place where we all develop in a large application, and everything is a module. For example, we have a users table, and we have a module that adds facebook details for a user, another module that adds twitter details to a user. We could decide to unplug one of those modules and remove all its functionality from our application. In this case, every module adds their own table with 1:1 relationships to the global users table, like this:
create table users ( id int primary key, ...);
create table users_fbdata ( id int primary key, ..., constraint users foreign key ...)
create table users_twdata ( id int primary key, ..., constraint users foreign key ...)
If you place two one-to-one tables in one, its likely you'll have semantics issue. For example, if every device has one remote controller, it doesn't sound quite good to place the device and the remote controller with their bunch of characteristics in one table. You might even have to spend time figuring out if a certain attribute belongs to the device or the remote controller.
There might be cases, when half of your columns will stay empty for a long while, or will not ever be filled in. For example, a car could have one trailer with a bunch of characteristics, or might have none. So you'll have lots of unused attributes.
If your table has 20 attributes, and only 4 of them are used occasionally, it makes sense to break the table into 2 tables for performance issues.
In such cases it isn't good to have everything in one table. Besides, it isn't easy to deal with a table that has 45 columns!
If data in one table is related to, but does not 'belong' to the entity described by the other, then that's a candidate to keep it separate.
This could provide advantages in future, if the separate data needs to be related to some other entity, also.
The most sensible time to use this would be if there were two separate concepts that would only ever relate in this way. For example, a Car can only have one current Driver, and the Driver can only drive one car at a time - so the relationship between the concepts of Car and Driver would be 1 to 1. I accept that this is contrived example to demonstrate the point.
Another reason is that you want to specialize a concept in different ways. If you have a Person table and want to add the concept of different types of Person, such as Employee, Customer, Shareholder - each one of these would need different sets of data. The data that is similar between them would be on the Person table, the specialist information would be on the specific tables for Customer, Shareholder, Employee.
Some database engines struggle to efficiently add a new column to a very large table (many rows) and I have seen extension-tables used to contain the new column, rather than the new column being added to the original table. This is one of the more suspect uses of additional tables.
You may also decide to divide the data for a single concept between two different tables for performance or readability issues, but this is a reasonably special case if you are starting from scratch - these issues will show themselves later.
First, I think it is a question of modelling and defining what consist a separate entity. Suppose you have customers with one and only one single address. Of course you could implement everything in a single table customer, but if, in the future you allow him to have 2 or more addresses, then you will need to refactor that (not a problem, but take a conscious decision).
I can also think of an interesting case not mentioned in other answers where splitting the table could be useful:
Imagine, again, you have customers with a single address each, but this time it is optional to have an address. Of course you could implement that as a bunch of NULL-able columns such as ZIP,state,street. But suppose that given that you do have an address the state is not optional, but the ZIP is. How to model that in a single table? You could use a constraint on the customer table, but it is much easier to divide in another table and make the foreign_key NULLable. That way your model is much more explicit in saying that the entity address is optional, and that ZIP is an optional attribute of that entity.
not very often.
you may find some benefit if you need to implement some security - so some users can see some of the columns (table1) but not others (table2)..
of course some databases (Oracle) allow you to do this kind of security in the same table, but some others may not.
You are referring to database normalization. One example that I can think of in an application that I maintain is Items. The application allows the user to sell many different types of items (i.e. InventoryItems, NonInventoryItems, ServiceItems, etc...). While I could store all of the fields required by every item in one Items table, it is much easier to maintain to have a base Item table that contains fields common to all items and then separate tables for each item type (i.e. Inventory, NonInventory, etc..) which contain fields specific to only that item type. Then, the item table would have a foreign key to the specific item type that it represents. The relationship between the specific item tables and the base item table would be one-to-one.
Below, is an article on normalization.
http://support.microsoft.com/kb/283878
As with all design questions the answer is "it depends."
There are few considerations:
how large will the table get (both in terms of fields and rows)? It can be inconvenient to house your users' name, password with other less commonly used data both from a maintenance and programming perspective
fields in the combined table which have constraints could become cumbersome to manage over time. for example, if a trigger needs to fire for a specific field, that's going to happen for every update to the table regardless of whether that field was affected.
how certain are you that the relationship will be 1:1? As This question points out, things get can complicated quickly.
Another use case can be the following: you might import data from some source and update it daily, e.g. information about books. Then, you add data yourself about some books. Then it makes sense to put the imported data in another table than your own data.
I normally encounter two general kinds of 1:1 relationship in practice:
IS-A relationships, also known as supertype/subtype relationships. This is when one kind of entity is actually a type of another entity (EntityA IS A EntityB). Examples:
Person entity, with separate entities for Accountant, Engineer, Salesperson, within the same company.
Item entity, with separate entities for Widget, RawMaterial, FinishedGood, etc.
Car entity, with separate entities for Truck, Sedan, etc.
In all these situations, the supertype entity (e.g. Person, Item or Car) would have the attributes common to all subtypes, and the subtype entities would have attributes unique to each subtype. The primary key of the subtype would be the same as that of the supertype.
"Boss" relationships. This is when a person is the unique boss or manager or supervisor of an organizational unit (department, company, etc.). When there is only one boss allowed for an organizational unit, then there is a 1:1 relationship between the person entity that represents the boss and the organizational unit entity.
The main time to use a one-to-one relationship is when inheritance is involved.
Below, a person can be a staff and/or a customer. The staff and customer inherit the person attributes. The advantage being if a person is a staff AND a customer their details are stored only once, in the generic person table. The child tables have details specific to staff and customers.
In my time of programming i encountered this only in one situation. Which is when there is a 1-to-many and an 1-to-1 relationship between the same 2 entities ("Entity A" and "Entity B").
When "Entity A" has multiple "Entity B" and "Entity B" has only 1 "Entity A"
and
"Entity A" has only 1 current "Entity B" and "Entity B" has only 1 "Entity A".
For example, a Car can only have one current Driver, and the Driver can only drive one car at a time - so the relationship between the concepts of Car and Driver would be 1 to 1. - I borrowed this example from #Steve Fenton's answer
Where a Driver can drive multiple Cars, just not at the same time. So the Car and Driver entities are 1-to-many or many-to-many. But if we need to know who the current driver is, then we also need the 1-to-1 relation.
Another use case might be if the maximum number of columns in the database table is exceeded. Then you could join another table using OneToOne

Many tables to a single row in relational database

Consider we have a database that has a table, which is a record of a sale. You sell both products and services, so you also have a product and service table.
Each sale can either be a product or a service, which leaves the options for designing the database to be something like the following:
Add columns for each type, ie. add Service_id and Product_id to Invoice_Row, both columns of which are nullable. If they're both null, it's an ad-hoc charge not relating to anything, but if one of them is satisfied then it is a row relating to that type.
Add a weird string/id based system, for instance: Type_table, Type_id. This would be a string/varchar and integer respectively, the former would contain for example 'Service', and the latter the id within the Service table. This is obviously loose coupling and horrible, but is a way of solving it so long as you're only accessing the DB from code, as such.
Abstract out the concept of "something that is chargeable" for with new tables, of which Product and Service now are an abstraction of, and on the Invoice_Row table you would link to something like ChargeableEntity_id. However, the ChargeableEntity table here would essentially be redundant as it too would need some way to link to an abstract "backend" table, which brings us all the way back around to the same problem.
Which way would you choose, or what are the other alternatives to solving this problem?
What you are essentially asking is how to achieve polymorphism in a relational database. There are many approaches (as you yourself demonstrate) to this problem. One solution is to use "table per class" inheritance. In this setup, there will be a parent table (akin to your "chargeable item") that contains a unique identifier and the fields that are common to both products and services. There will be two child tables, products and goods: Each will contain the unique identifier for that entity and the fields specific to it.
One benefit to this approach over others is you don't end up with one table with many nullable columns that essentially becomes a dumping ground to describe anything ("schema-less").
One downside is as your inheritance hierarchy grows, the number of joins needed to grab all the data for an entity also grows.
I believe it depends on use case(s).
You could put the common columns in one table and put product and service specific columns in its own tables.Here the deal is that you need to join stuff.
Else if you maintain two separate tables, one for Product and another for Sale. You use application logic to determine which table to insert into. And getting all sales will essentially mean , union of getting all products and getting all sale.
I would go for approach 2 personally to avoid joins and inserting into two tables whenever a sale is made.

Database design for a product aggregator

I'm trying to design a database for a product aggregator. Each product has information about where it comes from, what it costs, what type of thing it is, price, color, etc. Users need to able to search and filter results based on any of those product categories. I also expect to have a large number of users. My initial thought was having one big table with every product in it with a column for each piece of information and an index on anything I need to be able to search by but I think this might be inefficient with a lot of users pounding on this one table. My other thought was to organize the database to promote a tree-like navigation of tables but because you can search by anything I'm not sure how I would organize the tables.
Any thoughts on some good practices?
One table of products - databases are designed to have lots of users pounding on tables.
(from the comments)
You need to model your data. This comes from looking at the all the data you have, determining what is related to what (a table is called a relation because all the attributes in a row are related to a candidate key). You haven't really given enough information about the scope of what data (unstructured?) you have on these products and how it varies. Are you going to have difficulties because Shoes have brand, model, size and color, but Desks only have brand, model and finish? All this is going to inform your data model. Typically you have one products table, and other things link to it.
Some of those attributes will be foreign keys to lookup tables, others (price) would be simple scalars. Appropriate indexing and you'll be fine. For advanced analytics, consider a dimensionally modeled star-schema, but perhaps not for your live transaction system - depends what your data flow/workflow/transactions are. Or consider some benefits of its principles in your transactional database. Ralph Kimball is source of good information on dimensional modeling.
I dont see any need for the tree structure here. You can do with single table.
if you insist on tree structure with hierarchy here is an example to get you started.
For text based search, and ease of startup & design, I strongly recommend Apache SOLR. The SOLR API is easy to use (especially JSON). Databases do text search poorly, and I would instead recommend that you just make sure that they respond to primary/unique key queries properly, and those are the fields you should index.
One table for the products, and another table for the product category hierarchy (you don't specifically say you have this but "tree-like navigation of tables" makes me think you might).
I can see you might be concerned about over-indexing causing problems if you plan to index almost every column. In that case, it might be best to index on the top 5 or 10 columns you think users are likely to search for, unless it's possible for a user to search on ANY column. In that case you might want to look at building a data warehouse. Maybe you'll want to look into data cubes to see if those will help...?
For hierarchical data, you need a PRODUCT_CATEGORY table looking something like this:
ID
PARENT_ID
NAME
Some sample data:
ID PARENT_ID NAME
1 ROOT
2 1 SOCKS
3 1 HELICOPTER PARTS
4 2 ARGYLE
Some SQL engines (such as Oracle) allow you to write recursive queries to traverse the hierarchy in a single query. In this example, the root of the tree has a PARENT_ID of NULL, but if you don't want this column to be nullable, I've also seen -1 used for the same purposes.

Resources