database design (lists of many different items, with custom fields) - database

I’m working on a project where you work with all kinds of items. What it is is of no importance, it’s the database design I’m worried about. If someone could give me some insight in how I should create the layout of my database for this, or just point me in the right direction, I would be most thankful.
All kinds of items in one list
Imagine you have lists of items. You could have a list of CDs, a list of DVDs and a list of books. This translates to 1 list has many items in database terms, with the id of the list in the item row.
But what if you wanted to have a list with all Super Mario related stuff, containing soundtrack DVDs, that horrible live action film and some fanfiction novels based on the plumber’s life.
I suddenly realized, when drawing out my database that those items, that belong to the same list, couldn’t be in the same table, as they all would have different columns to support artist/album title, director/movie title, author/novel title, etc.. Wich I couldn’t possibly have all in one giant table.
On top of that, I want to have the track titles of the soundtrack albums and the actors of the film in my database. If I had only CDs, I could easily attach a album_track-table to my item-table, but I can’t just attach all kinds of different tables to my item-table, as that wouldn’t be too good for performance if I wanted to get all items with all their details for a certain list. The procedure would have to search all attached tables for references of the list, even if the list doesn’t contain any books, vinyls, manga, tv-series, plants, furniture, etc…
What I have right now is the following layout (but I can’t imagine this is the best way to do this):
t_list (id) --> t_item (id, id_list, image)
t_item --> t_cd (id, id_item, artist, title)
t_item --> t_dvd (id, id_item, director, title)
t_item --> …
t_cd --> t_cd_track (id, id_cd, track_title, length)
t_dvd --> t_dvd_actors (id, id_dvd, actor_name, image)
…
Custom columns
Now, imagine that to add these items to a cd list, you’d have a form with input fields, according to the columns in the table t_cd (artist, album title, genre, …). I want to be able to add a custom input field for example for the average price of albums.
This is set for a certain user for a certain list. This is not set on an item level, because that would mean it would be added to everyone’s form. I just want to add that field to my own CD list.
But, it still needs to related to items, because that value needs to be filled in in the database.
I’m thinking about something like this:
t_list (id) --> t_extra_field (id, description, id_list)
t_extra_field --> t_field_value (id, id_extra_field, value)
But I’m not entirely sure where to attach this in my database scheme.
Could this kind of structure also be an answer to my previous question? (t_field --> t_field_value) If so, I also don’t know where to attach that. Perhaps to list, like I suggested in the above example?
That would mean that all details for a certain item, are in one table, but value by value, not on 1 single record, according to a category id of some sort, coming from another table, attached to item. That would be a table with a lot of records, which again raises my question : isn’t this bad for performance..?
I sincerely hope someone could give me some insight in the matter..

A completely generic database is probably a bad idea - it usually means you have to enforce the data consistency completely at the application level. This might be justified for highly "untyped" or "volatile" data when you want to avoid DDL at run-time, but the data you describe here looks "typed" enough for a more conventional database design.
Judging on your description, you'd need something similar to this:
The symbol denotes the "category" (aka. inheritance, sub-type, generalization hierarchy etc.).
For the specific cases where we know exactly how the items should be connected, we can model that directly through a link (aka. junction) table between specific sub-types, as in case of the TRACK table.
Also, we can group items of different kinds through GROUP and GROUP_ITEM (so, say, a Mario soundtrack(s), movie(s) and book(s) can be grouped together, under the same GROUP_ID).
Artists are also handled in a fairly general way, so we can easily represent a situation where (for example) a same person writes both a song and a book.
As for things such as "average price of albums", ideally you shouldn't store them at all - you should calculate them when needed, based on the existing data, so the possibility of an out-of-date result is eliminated.
If this becomes problematic performance-wise, either:
do it periodically, cache the result and live with the somewhat out-of-date result.
or cache the result whenever the data is modified (through triggers), but do it very carefully to avoid anomalies in the concurrent environment.
For example...
SELECT AVG(PRICE) FROM TABLE1;
INSERT TABLE2 (AVERAGE_PRICE) VALUES (result_of_the_previous_query);
...is almost certainly unsafe, but depending on the DBMS even...
INSERT TABLE2 (AVERAGE_PRICE) VALUES (SELECT AVG(PRICE) FROM TABLE1);
...might not be completely safe without proper locking. You'll need to learn about your DBMS'es transaction isolation and locking.
In the specific case of calculating an average, there are other tricks that you might consider, such as separately incrementing/decrementing the COUNT and adding/subtracting SUM of the price through triggers with each INSERT/UPDATE/DELETE, and then calculating the AVG on the fly. SQL guarantees that things such as UPDATE MY_COUNT = MY_COUNT + 1 will be "atomic".

Related

I'm unable to normalize my Product table as I have 4 different product types

So because I have 4 different product types (books, magazines, gifts, food) I can't just put all products in one "products" table without having a bunch of null values. So I decided to break each product up into their own tables but I know this is just wrong (https://c1.staticflickr.com/1/742/23126857873_438655b10f_b.jpg).
I also tried creating an EAV model for this (https://c2.staticflickr.com/6/5734/23479108770_8ae693053a_b.jpg), but I got stuck as I'm not sure how to link the publishers and authors tables.
I know this question has been asked a lot but I don't understand ANY of the answer's I've seen. I think this is because I'm a very visual learner and this makes it hard to understand what's being talked about when not a lot of information is given.
Your model is on the right track, except that the product name should be sufficient you don't need Gift name, book name etc. What you put in those tables is the information that is specific to the type of product that the other products don't need. The Product table contains all the common fields. I would use productid in the child tables rather than renaming it giftID, magazineID etc. It is easier to remember what things are celled when you are consistent in nameing them.
Now to be practical, you put as much as you can into the product table especially if you are going to do calculations. I prefer the child tables in this specific case to have what is mostly display information. So product contains the product name, the cost, the type of product, the units the product is sold in etc. The stuff that generally is needed to calculate the cost of an order or to have a report of what was ordered. There may be one or two fields that can contain nulls, but it simplifies the calculation type queries so much it might be worth it.
The meat of the descriptive details though would go in the child table for the type of product. These would usually only be referenced when displaying the product in the shopping area and only one at a time, so you can use the product type to let you only join to the one child table you need for display. So while the order cares about the product number and name and cost calculations, it probably doesn't need to go line by line describing the book ISBN number or the megapixels in a camera. But the description page of the product does need those things.
This approach is not purely relational, although it mostly is, but it does group the information by the meanings of the data and how they will be used which will make the database easier to understand and query. I am a big fan of relational tables because database just work better when they hit at least the third normal form but sometimes you can go too far for practicality, so the meaning of the data and the way you are grouping to use the data (and not just for the user interface, but for later reporting as well) is almost always one of my considerations in design.
Breaking each product type into its own table is fine - let the child tables use the same id as the parent Product table, and create views for the child tables that join with Product
Your case is a classic case of types and subtypes. This is often called class/subclass in object modeling and generalization/specialization in ER modeling. It's a well understood pattern. There are known techniques for dealing with this pattern.
Visit the following tabs, and read the description under the info tab (presented as "learn more"). Also look over the questions grouped under these tags.
single-table-inheritance class-table-inheritance shared-primary-key
If you want to rean in more depth use these buzzwords to search for articles on the web.
You've already discovered and discarded single table inheritance on your own. Other answers have pointed you at shared primary key. Class table inheritance involves a single table for generalized data as well as the four specialized tables. Shared primary key is generally used in conjunction with class table inheritance.

Store multiple values in one database field in Access (hear me out)

So I've done extensive searching on this and I can't seem to find a good solution that actually applies to my situation.
I have a list of projects in a table, then a list of people. I want to assign multiple people to one project. Seems pretty common. Obviously, I can't make multiple columns on my projects table for each person, as the people will change fairly frequently.
I need to display this information very quickly in a continuous list of projects (the ultimate way would be a multiple-select combobox as a listbox is too tall, but they don't exist outside of the dreaded lookup fields)
I can think of two ways:
- Store multiple employee IDs delimited by commas in one field in my projects table (I know this goes against good database design). Would require some code to store and retrieve the data.
- Have a separate table for employees assigned to projects (ID, ProjectID, EmployeeID). One to many relationship between projects table and this new table. One to many relationship between employees table and this new table. If a project has 3 employees assigned, it would store 3 records in this table. It seems a bit odd joining both tables in this way, and would also require code to get it to store and retrieve into a control like the one mentioned above).
Does anyone know if there is a better way (including displaying in an easy control) or how you usually tackle this problem?
The usual way to tackle this problem would be with a Junction Table. This is what you describe where you have a separate table maybe called EmployeeProject which has an EmployeeProjectID(PK), EmployeeID(FK) and ProjectID(FK).
In this way you model a Many-to-Many relationship where each project can have many employees involved and each employee can be involved in many projects. It's not actually all that difficult to do the SQL etc. required to pull the information back together again for display.
I would definitely stay away from storing comma-delimited values as this becomes significantly more complicated when you want to display or manipulate the data.
There's a good guide here: http://en.tekstenuitleg.net/articles/software/create-a-many-to-many-relationship-in-access but if you google "many to many junction table" or similar, there are thousands of pages/articles about implementation.

Database Design without inheritance

I have a come up with the following schema for a client of mine. Does anything look off here especially the Order Line Items. Should i use inheritance. I'm pretty sure that this site will only allow you to order courses, lessons, and giftcards, and that's it
Any feedback would be appreciated
Just my thinking on the design:
You have Courses, Lessons and GiftCards tables for the possible purchase objects, and OrderLines contains IDs for each of the tables. But in case a customer will purchase a Lesson and a GiftCard, they should be shown as 2 lines in the order. Also, what you will do if your client will want to trade more objects?
Therefore I think it might be better to redesign this part, like this:
OrderLines rename to OrderItems;
add ItemType table with 3 rows: Courses, Lessons, GiftCards;
add Items table with (ItemId, ItemType, Title, Price, LanguageCode, SortOrder, etc.) fields.
This way it will also be possible to add reviews not only for Lessons, but for all possible items.
You will have to come up with the preferred way to keep fields for the Items details. Right now Courses and Lessons share a lot of fields, therefore it might be reasonable to move all of them into the new Items table, as such fields seems also to be valid for the GiftCards also. And in case you have some specific details, like for GiftCards, you might add specific tables, like GiftCardItems with Items.id and a set of special fields not shared with other Item types.
A minor note: I would split Users into a couple of tables, as I suppose that this table will contain both, customers and support stuff. This means that this table might grow big (depending on how many customers are expected). Maintaining so many fields in a single table might be problematic when table will grow in number of rows.
And I agree with Matt — it is difficult to tell anything without requirements.
It is really hard to tell without knowing the requirements from your client. Everything looks good but I can't really tell if it is all inclusive of what the client wants without their requirements documentation.

Modeling A Food Recipes Database

I'm trying to design a "recipe box" database and I'm having trouble getting it right. I have no idea if I'm on the right track or not, but here's what I have.
recipes(recipeID, etc.)
ingredient(ingredientID, etc.)
recipeIngredient(recipeID, ingredientID, amount)
category(categoryID, name)
recipeCategory(recipeID, categoryID, name, etc.)
So I have a couple of questions.
How am I doing so far? Is this design okay from what you all know?
How would I implement the preparation steps? Should I create an additional many-to-many implementation (something like preparation(prepID, etc.) and recipePrep(recipeID, prepID)) or just add the directions in the recipes table? I would like this to be an ordered list in the UI (webpage).
Thank you for your help.
Have you looked at any of the existing schemas out there, such as this one at DatabaseAnswers?
some thoughts:
You might want to use the same table for Recipe and Ingredient, with a type indicator column. The reason is that Recipes can contain sub-recipes. Let's call the combined table "Item". Then your RecipeIngredient table would look like
RecipeIngredient (RecipeId, ItemId, Amount).
I'd expect that the table would also have a sequencing column.
If you want to do any calculations with these recipes (e.g., scaling, nutritional analysis, production planning) then your quantities will need to specify a unit of measure. You can do that explicitly (by having a separate column for uofm) or you can use a text field for quantity and expect the user to enter values like "1 cup", or "2 tbs". If you take that approach, you'll need to make sure that what they enter is recognizable, and parse it every time you need to use it. This can become surprising complex, especially if you want to represent recipe yields in a formalized manner.
Assuming you want 1:M from recipe to category, I'm still not sure why your RecipeCategory table would have a Name column. I'd think that the name comes from the Category definition.
I agree with Dave that it's unlikely that you'd reuse preparation steps from recipe to recipe, and so a RecipePreparationSteps table (or something like it) would be more appropriate.
However, recipes are often presented with ingredients and instructions intermixed. eg.
Intro text
some ingredients.
prep instructions
some more ingredients
baking instructions.
To accomodate that, you need to cleverly set sequencing values in the RecipeIngredient and RecipePreparation step tables so that you can combine data from both in the proper order for presentation. Another approach would be, instead of these two tables, use a "RecipeLine" table such that each row can represent either an instruction OR an ingredient. I think that may be what you were suggesting. Purists would frown on this kind of table overloading, but I'm not a purist.
This is a topic I happen to know a lot about, so ask anything.
Looks like a good start. A few thoughts:
Don't see a need for a recpieCategory table. One-to-many between recipe and category should do fine.
A PreparationSteps table should contain 1-n steps for each recipe. I wouldn't try to reuse steps between recipes.

Designing an 'Order' schema in which there are disparate product definition tables

This is a scenario I've seen in multiple places over the years; I'm wondering if anyone else has run across a better solution than I have...
My company sells a relatively small number of products, however the products we sell are highly specialized (i.e. in order to select a given product, a significant number of details must be provided about it). The problem is that while the amount of detail required to choose a given product is relatively constant, the kinds of details required vary greatly between products. For instance:
Product X might have identifying characteristics like (hypothetically)
'Color',
'Material'
'Mean Time to Failure'
but Product Y might have characteristics
'Thickness',
'Diameter'
'Power Source'
The problem (one of them, anyway) in creating an order system that utilizes both Product X and Product Y is that an Order Line has to refer, at some point, to what it is "selling". Since Product X and Product Y are defined in two different tables - and denormalization of products using a wide table scheme is not an option (the product definitions are quite deep) - it's difficult to see a clear way to define the Order Line in such a way that order entry, editing and reporting are practical.
Things I've Tried In the Past
Create a parent table called 'Product' with columns common to Product X and Product Y, then using 'Product' as the reference for the OrderLine table, and creating a FK relationship with 'Product' as the primary side between the tables for Product X and Product Y. This basically places the 'Product' table as the parent of both OrderLine and all the disparate product tables (e.g. Products X and Y). It works fine for order entry, but causes problems with order reporting or editing since the 'Product' record has to track what kind of product it is in order to determine how to join 'Product' to its more detailed child, Product X or Product Y. Advantages: key relationships are preserved. Disadvantages: reporting, editing at the order line/product level.
Create 'Product Type' and 'Product Key' columns at the Order Line level, then use some CASE logic or views to determine the customized product to which the line refers. This is similar to item (1), without the common 'Product' table. I consider it a more "quick and dirty" solution, since it completely does away with foreign keys between order lines and their product definitions. Advantages: quick solution. Disadvantages: same as item (1), plus lost RI.
Homogenize the product definitions by creating a common header table and using key/value pairs for the customized attributes (OrderLine [n] <- [1] Product [1] <- [n] ProductAttribute). Advantages: key relationships are preserved; no ambiguity about product definition. Disadvantages: reporting (retrieving a list of products with their attributes, for instance), data typing of attribute values, performance (fetching product attributes, inserting or updating product attributes etc.)
If anyone else has tried a different strategy with more success, I'd sure like to hear about it.
Thank you.
The first solution you describe is the best if you want to maintain data integrity, and if you have relatively few product types and seldom add new product types. This is the design I'd choose in your situation. Reporting is complex only if your reports need the product-specific attributes. If your reports need only the attributes in the common Products table, it's fine.
The second solution you describe is called "Polymorphic Associations" and it's no good. Your "foreign key" isn't a real foreign key, so you can't use a DRI constraint to ensure data integrity. OO polymorphism doesn't have an analog in the relational model.
The third solution you describe, involving storing an attribute name as a string, is a design called "Entity-Attribute-Value" and you can tell this is a painful and expensive solution. There's no way to ensure data integrity, no way to make one attribute NOT NULL, no way to make sure a given product has a certain set of attributes. No way to restrict one attribute against a lookup table. Many types of aggregate queries become impossible to do in SQL, so you have to write lots of application code to do reports. Use the EAV design only if you must, for instance if you have an unlimited number of product types, the list of attributes may be different on every row, and your schema must accommodate new product types frequently, without code or schema changes.
Another solution is "Single-Table Inheritance." This uses an extremely wide table with a column for every attribute of every product. Leave NULLs in columns that are irrelevant to the product on a given row. This effectively means you can't declare an attribute as NOT NULL (unless it's in the group common to all products). Also, most RDBMS products have a limit on the number of columns in a single table, or the overall width in bytes of a row. So you're limited in the number of product types you can represent this way.
Hybrid solutions exist, for instance you can store common attributes normally, in columns, but product-specific attributes in an Entity-Attribute-Value table. Or you could store product-specific attributes in some other structured way, like XML or YAML, in a BLOB column of the Products table. But these hybrid solutions suffer because now some attributes must be fetched in a different way
The ultimate solution for situations like this is to use a semantic data model, using RDF instead of a relational database. This shares some characteristics with EAV but it's much more ambitious. All metadata is stored in the same way as data, so every object is self-describing and you can query the list of attributes for a given product just as you would query data. Special products exist, such as Jena or Sesame, implementing this data model and a special query language that is different than SQL.
There's no magic bullet that you've overlooked.
You have what are sometimes called "disjoint subclasses". There's the superclass (Product) with two subclasses (ProductX) and (ProductY). This is a problem that -- for relational databases -- is Really Hard. [Another hard problem is Bill of Materials. Another hard problem is Graphs of Nodes and Arcs.]
You really want polymorphism, where OrderLine is linked to a subclass of Product, but doesn't know (or care) which specific subclass.
You don't have too many choices for modeling. You've pretty much identified the bad features of each. This is pretty much the whole universe of choices.
Push everything up to the superclass. That's the uni-table approach where you have Product with a discriminator (type="X" and type="Y") and a million columns. The columns of Product are the union of columns in ProductX and ProductY. There will be nulls all over the place because of unused columns.
Push everything down into the subclasses. In this case, you'll need a view which is the union of ProductX and ProductY. That view is what's joined to create a complete order. This is like the first solution, except it's built dynamically and doesn't optimize well.
Join Superclass instance to subclass instance. In this case, the Product table is the intersection of ProductX and ProductY columns. Each Product has a reference to a key either in ProductX or ProductY.
There isn't really a bold new direction. In the relational database world-view, those are the choices.
If, however, you elect to change the way you build application software, you can get out of this trap. If the application is object-oriented, you can do everything with first-class, polymorphic objects. You have to map from the kind-of-clunky relational processing; this happens twice: once when you fetch stuff from the database to create objects and once when you persist objects back to the database.
The advantage is that you can describe your processing succinctly and correctly. As objects, with subclass relationships.
The disadvantage is that your SQL devolves to simplistic bulk fetches, updates and inserts.
This becomes an advantage when the SQL is isolated into an ORM layer and managed as a kind of trivial implementation detail. Java programmers use iBatis (or Hibernate or TopLink or Cocoon), Python programmers use SQLAlchemy or SQLObject. The ORM does the database fetches and saves; your application directly manipulate Orders, Lines and Products.
This might get you started. It will need some refinement
Table Product ( id PK, name, price, units_per_package)
Table Product_Attribs (id FK ref Product, AttribName, AttribValue)
Which would allow you to attach a list of attributes to the products. -- This is essentially your option 3
If you know a max number of attributes, You could go
Table Product (id PK, name, price, units_per_package, attrName_1, attrValue_1 ...)
Which would of course de-normalize the database, but make queries easier.
I prefer the first option because
It supports an arbitrary number of attributes.
Attribute names can be stored in another table, and referential integrity enforced so that those damn Canadians don't stick a "colour" in there and break reporting.
Does your product line ever change?
If it does, then creating a table per product will cost you dearly, and the key/value pairs idea will serve you well. That's the kind of direction down which I am naturally drawn.
I would create tables like this:
Attribute(attribute_id, description, is_listed)
-- contains values like "colour", "width", "power source", etc.
-- "is_listed" tells us if we can get a list of valid values:
AttributeValue(attribute_id, value)
-- lists of valid values for different attributes.
Product (product_id, description)
ProductAttribute (product_id, attribute_id)
-- tells us which attributes apply to which products
Order (order_id, etc)
OrderLine (order_id, order_line_id, product_id)
OrderLineProductAttributeValue (order_line_id, attribute_id, value)
-- tells us things like: order line 999 has "colour" of "blue"
The SQL to pull this together is not trivial, but it's not too complex either... and most of it will be write once and keep (either in stored procedures or your data access layer).
We do similar things with a number of types of entity.
Chris and AJ: Thanks for your responses. The product line may change, but I would not term it "volatile".
The reason I dislike the third option is that it comes at the cost of metadata for the product attribute values. It essentially turns columns into rows, losing most of the advantages of the database column in the process (data type, default value, constraints, foreign key relationships etc.)
I've actually been involved in a past project where the product definition was done in this way. We essentially created a full product/product attribute definition system (data types, min/max occurrences, default values, 'required' flags, usage scenarios etc.) The system worked, ultimately, but came with a significant cost in overhead and performance (e.g. materialized views to visualize products, custom "smart" components to represent and validate data entry UI for product definition, another "smart" component to represent the product instance's customizable attributes on the order line, blahblahblah).
Again, thanks for your replies!

Resources