Designing a data model that incorporates logical operators - database

I am new to data modeling and i'm having trouble coming up with a data model that can store logic.
The data model would be used to store location and marketing attributes.
When a customer visits one of the company's websites, they would enter in their zip code, and based on their location the attributes would be used to arrange the online catalog of items.
The catalog of items would be separate from the database, so the data model would only produce the output of attributes used to arrange the items. Each item in the catalog has attributes such as ItemNumber, Price, Condition, Manufacture, and marketing segments (Age:Adult, Education: College, Income:High, etc.).
**For example:**
**Input zip code**: 90210
**Output Attributes**: (ItemNumber:123456, Segment:HighIncome, Condition:New)
This example is saying for zip 90210, first show item #123456, followed by all of the items with the HighIncome segment, and then display all of the non-refurbished items.
So far I have 2 tables with a many to many relationship and I would like to add an additional table(s) so I can incorporate logic (AND & OR).
The first table would have location and other information about which of the company's site the user is on.
Table Location(
Location_Unique_Identifier number
ZipCode varchar2
State varchar2
Site varchar2
..
)
The second table would have the attributes types (Manufacture, Price, Condition, etc.) and the attribute values (IBM, 10.00, Refurbished, etc.).
Table Attributes(
Attribute_Unique_Identifier number
Attribute_Type varchar2
Attribute_Value varchar2
..
..
)
In-between these two tables to break up the many to many relationship I would add the logic table. This table should allow me to output
item#123456 AND (item#768900 OR Condition:New)
The problem I am having with the logic table is trying to make it flexible enough to handle an unknown amount of AND/ORs and to handle the grouping.

This is a typical scenario of JOIN two( many ) tables together to do AND/OR/XOR or something else logical.
The best choice is to build a meterailized view that denormalize the attributes from multiple tables together into one table(this table is called a view).
In your case, the view may be:
table location_join_attributes{
number,
zipcode,
state,
site,
Manufacture,
Price,
Condition,
......
}
Then you will operate your logical statement on this table/view as(modified from your example):
item#123456 OR (item#768900 AND Condition:New) AND (more condition)
If we do not have this view, this operation will firstly fetch out all the records have item#768900, and then filter among the second table to know which of them have condition:new. It will take a long time to finish. If the condition is complex, the performance is terrible.
For quick query, you should build secondary indexes on the columns you operate.
On the scalability side, if your business logic changes, you may build a new view, and the older one will be discarded. The original tables do not change, which is also one of the advantages of a materialized view has.

Related

Dynamic columns in database tables vs EAV

I'm trying to decide which way to go if I have an app that needs to be able to change the db schema based on the user input.
For example, if I have a "car" object that contains car properties, like year, model, # of doors etc, how do I store it in the DB in such a way, that the user should be able to add new properties?
I read about EAV tables and they seem right for this thing, but the problem is that queries will get pretty complicated when I try to get a list of cars filtered by a set of properties.
Could I generate the tables dynamically instead? I see that Sqlite has support for ADD COLUMN, but how fast is it when the table reaches many records? And it looks like there's no way to remove a column. I have to create a new table without the column I want to remove, and copy the data from the old table. That's certainly slow on large tables :(
I will assume that SQLite (or another relational DBMS) is a requirement.
EAVs
I have worked with EAVs and generic data models, and I can say that the data model is very messy and hard to work with in the long run.
Lets say that you design a datamodel with three tables: entities, attributes, and _entities_attributes_:
CREATE TABLE entities
(entity_id INTEGER PRIMARY KEY, name TEXT);
CREATE TABLE attributes
(attribute_id INTEGER PRIMARY KEY, name TEXT, type TEXT);
CREATE TABLE entity_attributes
(entity_id INTEGER, attribute_id INTEGER, value TEXT,
PRIMARY KEY(entity_id, attribute_id));
In this model, the entities table will hold your cars, the attributes table will hold the attributes that you can associate to your cars (brand, model, color, ...) and its type (text, number, date, ...), and the _entity_attributes_ will hold the values of the attributes for a given entity (for example "red").
Take into account that with this model you can store as many entities as you want and they can be cars, houses, computers, dogs or whatever (ok, maybe you need a new field on entities, but it's enough for the example).
INSERTs are pretty straightforward. You only need to insert a new object, a bunch of attributes and its relations. For example, to insert a new entity with 3 attributes you will need to execute 7 inserts (one for the entity, three more for the attributes, and three more for the relations.
When you want to perform an UPDATE, you will need to know what is the entity that you want to update, and update the desired attribute joining with the relation between the entity and its attributes.
When you want to perform a DELETE, you will also need to need to know what is the entity you want to delete, delete its attributes, delete the relation between your entity and its attributes and then delete the entity.
But when you want to perform a SELECT the thing becomes nasty (you need to write really difficult queries) and the performance drops horribly.
Imagine a data model to store car entities and its properties as in your example (say that we want to store brand and model). A SELECT to query all your records will be
SELECT brand, model FROM cars;
If you design a generic data model as in the example, the SELECT to query all your stored cars will be really difficult to write and will involve a 3 table join. The query will perform really bad.
Also, think about the definition of your attributes. All your attributes are stored as TEXT, and this can be a problem. What if somebody makes a mistake and stores "red" as a price?
Indexes are another thing that you could not benefit of (or at least not as much as it would be desirable), and they are very neccesary as the data stored grows.
As you say, the main concern as a developer is that the queries are really hard to write, hard to test and hard to maintain (how much would a client have to pay to buy all red, 1980, Pontiac Firebirds that you have?), and will perform very poorly when the data volume increases.
The only advantage of using EAVs is that you can store virtually everything with the same model, but is like having a box full of stuff where you want to find one concrete, small item.
Also, to use an argument from authority, I will say that Tom Kyte argues strongly against generic data models:
http://tkyte.blogspot.com.es/2009/01/this-should-be-fun-to-watch.html
https://asktom.oracle.com/pls/asktom/f?p=100:11:0::::P11_QUESTION_ID:10678084117056
Dynamic columns in database tables
On the other hand, you can, as you say, generate the tables dynamically, adding (and removing) columns when needed. In this case, you can, for example create a car table with the basic attributes that you know that you will use and then add columns dynamically when you need them (for example the number of exhausts).
The disadvantage is that you will need to add columns to an existing table and (maybe) build new indexes.
This model, as you say, also has another problem when working with SQLite as there's no direct way to delete columns and you will need to do this as stated on http://www.sqlite.org/faq.html#q11
BEGIN TRANSACTION;
CREATE TEMPORARY TABLE t1_backup(a,b);
INSERT INTO t1_backup SELECT a,b FROM t1;
DROP TABLE t1;
CREATE TABLE t1(a,b);
INSERT INTO t1 SELECT a,b FROM t1_backup;
DROP TABLE t1_backup;
COMMIT;
Anyway, I don't really think that you will need to delete columns (or at least it will be a very rare scenario). Maybe someone adds the number of doors as a column, and stores a car with this property. You will need to ensure that any of your cars have this property to prevent from losing data before deleting the column. But this, of course depends on your concrete scenario.
Another drawback of this solution is that you will need a table for each entity you want to store (one table to store cars, another to store houses, and so on...).
Another option (pseudo-generic model)
A third option could be to have a pseudo-generic model, with a table having columns to store id, name, and type of the entity, and a given (enough) number of generic columns to store the attributes of your entities.
Lets say that you create a table like this:
CREATE TABLE entities
(entity_id INTEGER PRIMARY KEY,
name TEXT,
type TEXT,
attribute1 TEXT,
attribute1 TEXT,
...
attributeN TEXT
);
In this table you can store any entity (cars, houses, dogs) because you have a type field and you can store as many attributes for each entity as you want (N in this case).
If you need to know what the attribute37 stands for when type is "red", you would need to add another table that relates the types and attributes with the description of the attributes.
And what if you find that one of your entities needs more attributes? Then simply add new columns to the entities table (attributeN+1, ...).
In this case, the attributes are always stored as TEXT (as in EAVs) with it's disadvantages.
But you can use indexes, the queries are really simple, the model is generic enough for your case, and in general, I think that the benefits of this model are greater than the drawbacks.
Hope it helps.
Follow up from the comments:
With the pseudo-generic model your entities table will have a lot of columns. From the documentation (https://www.sqlite.org/limits.html), the default setting for SQLITE_MAX_COLUMN is 2000. I have worked with SQLite tables with over 100 columns with great performance, so 40 columns shouldn't be a big deal for SQLite.
As you say, most of your columns will be empty for most of your records, and you will need to index all of your colums for performance, but you can use partial indexes (https://www.sqlite.org/partialindex.html). This way, your indexes will be small, even with a high number of rows, and the selectivity of each index will be great.
If you implement a EAV with only two tables, the number of joins between tables will be less than in my example, but the queries will still be hard to write and maintain, and you will need to do several (outer) joins to extract data, which will reduce performance, even with a great index, when you store a lot of data. For example, imagine that you want to get the brand, model and color of your cars. Your SELECT would look like this:
SELECT e.name, a1.value brand, a2.value model, a3.value color
FROM entities e
LEFT JOIN entity_attributes a1 ON (e.entity_id = a1.entity_id and a1.attribute_id = 'brand')
LEFT JOIN entity_attributes a2 ON (e.entity_id = a2.entity_id and a2.attribute_id = 'model')
LEFT JOIN entity_attributes a3 ON (e.entity_id = a3.entity_id and a3.attribute_id = 'color');
As you see, you would need one (left) outer join for each attribute you want to query (or filter). With the pseudo-generic model the query will be like this:
SELECT name, attribute1 brand, attribute7 model, attribute35 color
FROM entities;
Also, take into account the potential size of your _entity_attributes_ table. If you can potentially have 40 attributes for each entity, lets say that you have 20 not null for each of them. If you have 10,000 entities, your _entity_attributes_ table will have 200,000 rows, and you will be querying it using one huge index. With the pseudo-generic model you will have 10,000 rows and one small index for each column.
It all depends on the way in which your application needs to reason about the data.
If you need to run queries which need to do complicated comparisons or joins on data whose schema you don't know in advance, SQL and the relational model are rarely a good fit.
For instance, if your users can set up arbitrary data entities (like "car" in your example), and then want to find cars whose engine capacity is greater than 2000cc, with at least 3 doors, made after 2010, whose current owner is part of the "little old ladies" table, I'm not aware of an elegant way of doing this in SQL.
However, you could achieve something like this using XML, XPath etc.
If your application has a set on data entities with known attributes, but users can extend those attributes (a common requirement for products like bug trackers), "add column" is a good solution. However, you may need to invent a custom query language to allow users to query those columns. For instance, Atlassian Jira's bug tracking solution has JQL, a SQL-like language for querying bugs.
EAV is great if your task is to store and then show data. However, even moderately complex queries become very hard in an EAV schema - imagine how you'd execute my made up example above.
For your use case, a document oriented database like MongoDB would do great.
Another option that I haven't seen mentioned above is to use denormalized tables for the extended attributes. This is a combination of the pseudo-generic model and the dynamic columns in database tables. Instead of adding columns to existing tables, you add columns or groups of columns into new tables with FK indexes to the source table. Of course, you'll want a good naming convention (car, car_attributes_door, car_attributes_littleOldLadies)
Your selection problem becomes that of applying a LEFT OUTER JOIN to include the extended attributes that you want to include.
Slower than normalized, but not as slow as EAV.
Adding new extended attributes becomes a problem of adding a new table.
Harder than EAV, easier/faster than modifying table schema.
Deleting attributes becomes a problem of dropping whole tables.
Easier/faster than modifying table schema.
These new attributes can be strongly typed.
As good as modifying table schema, faster than EAV or generic columns.
The biggest advantage to this approach that I can see is that deleting unused attributes is quite easy compared to any of the others via a single DROP TABLE command. You also have the option to later normalize often-used attributes into larger groups or into the main table using a single ALTER TABLE process rather than one for each new column you were adding as you added them, which helps with the slow LEFT OUTER JOIN queries.
The biggest disadvantage is that you're cluttering up your table list, which admittedly is often not a trivial concern. That and I'm not sure how much better LEFT OUTER JOIN's actually perform than EAV table joins. It's definitely closer to EAV join performance than normalized table performance.
If you're doing a lot of comparisons/filters of values that benefit greatly from strongly typed columns, but you add/remove these columns frequently enough to make modifying a huge normalized table intractable, this seems like a good compromise.
I would try EAV.
Adding columns based on user input doesn't sounds nice to me and you can quickly run out of capacity. Queries on very flat table can also be a problem. Do you want to create hundreds of indexes?
Instead of writing every thing to one table, I would store as many as possible common properties (price, name , color, ...) in the main table and those less common properties in an "extra" attributes table. You can always balance them later with a little effort.
EAV can performance well for small to middle sized data set. Since you want to use SQLlite, I guess it's not be a problem.
You may also want to avoid "over" normalizing your data. With the cheap storage
we currently have, you can use one table to store all "Extra" attributes, instead of two:
ent_id, ent_name, ...
ent_id, attr_name, attr_type, attr_value ...
People against EAV will say its performance is poor on large database. It's sure that it won't performance as well as normalized structure but you don't want to change structure on a 3TB table either.
I have a low quality answer, but possible, that came from HTML tags that are like : <tag width="10px" height="10px" ... />
In this dirty way you will have just one column as a varchar(max) for all properties say it Props column and you will store data in it like this:
Props
------------------------------------------------------------
Model:Model of car1|Year:2010|# of doors:4
Model:Model of car2|NewProp1:NewValue1|NewProp2:NewValue2
In this way all works will go to the programming code in business layer with using some functions like concatCustom that get an array and return a string and unconcatCustom that get a string and return an array.
For more validity of special characters like ':' and '|', I suggest '#:#' and '#|#' or something more rare for splitter part.
In a similar way you can use a text or binary field and store an XML data in the column.

How can I create a hierarchy in SSAS?

I have the table order with following fields:
ID
Serial
Visitor
Branch
Company
Assume there are relations between Visitor, Branch and Company in the database. But every visitor can be in more Branch. How can I create a hierarchy between these three fields for my order table.
How can I do that?
You would need to create a denormalised dimension table, with the distinct result of the denormalisation process of the table order. In this case, you would have many rows for the same visitor. One for each branch.
In your fact table, the activity record which would have BranchKey in the primary key, would reference this dimension. This obviously would be together with the VisitorKey...
Then in SSAS you would need to build the hierarchy, and set the relationships between the keys... When displaying this data in a client, such as excel, you would drag the hierarchy in the rows, and when expanding, data from your fact would fit in according to the visitors branch...
With regards to dimensions, it's important to set relationships between the attributes, as this will give you a massive performance gain when processing the dimension, and the cube. Take a look at this article for help regarding that matter http://www.bidn.com/blogs/DevinKnight/ssis/1099/ssas-defining-attribute-relationships-in-2005-and-2008. In this case it's the same approach also for '12.

Many tables to a single row in relational database

Consider we have a database that has a table, which is a record of a sale. You sell both products and services, so you also have a product and service table.
Each sale can either be a product or a service, which leaves the options for designing the database to be something like the following:
Add columns for each type, ie. add Service_id and Product_id to Invoice_Row, both columns of which are nullable. If they're both null, it's an ad-hoc charge not relating to anything, but if one of them is satisfied then it is a row relating to that type.
Add a weird string/id based system, for instance: Type_table, Type_id. This would be a string/varchar and integer respectively, the former would contain for example 'Service', and the latter the id within the Service table. This is obviously loose coupling and horrible, but is a way of solving it so long as you're only accessing the DB from code, as such.
Abstract out the concept of "something that is chargeable" for with new tables, of which Product and Service now are an abstraction of, and on the Invoice_Row table you would link to something like ChargeableEntity_id. However, the ChargeableEntity table here would essentially be redundant as it too would need some way to link to an abstract "backend" table, which brings us all the way back around to the same problem.
Which way would you choose, or what are the other alternatives to solving this problem?
What you are essentially asking is how to achieve polymorphism in a relational database. There are many approaches (as you yourself demonstrate) to this problem. One solution is to use "table per class" inheritance. In this setup, there will be a parent table (akin to your "chargeable item") that contains a unique identifier and the fields that are common to both products and services. There will be two child tables, products and goods: Each will contain the unique identifier for that entity and the fields specific to it.
One benefit to this approach over others is you don't end up with one table with many nullable columns that essentially becomes a dumping ground to describe anything ("schema-less").
One downside is as your inheritance hierarchy grows, the number of joins needed to grab all the data for an entity also grows.
I believe it depends on use case(s).
You could put the common columns in one table and put product and service specific columns in its own tables.Here the deal is that you need to join stuff.
Else if you maintain two separate tables, one for Product and another for Sale. You use application logic to determine which table to insert into. And getting all sales will essentially mean , union of getting all products and getting all sale.
I would go for approach 2 personally to avoid joins and inserting into two tables whenever a sale is made.

Storing Preferences/One-to-One Relationships in Database

What is the best way to store settings for certain objects in my database?
Method one: Using a single table
Table: Company {CompanyID, CompanyName, AutoEmail, AutoEmailAddress, AutoPrint, AutoPrintPrinter}
Method two: Using two tables
Table Company {CompanyID, COmpanyName}
Table2 CompanySettings{CompanyID, utoEmail, AutoEmailAddress, AutoPrint, AutoPrintPrinter}
I would take things a step further...
Table 1 - Company
CompanyID (int)
CompanyName (string)
Example
CompanyID 1
CompanyName "Swift Point"
Table 2 - Contact Types
ContactTypeID (int)
ContactType (string)
Example
ContactTypeID 1
ContactType "AutoEmail"
Table 3 Company Contact
CompanyID (int)
ContactTypeID (int)
Addressing (string)
Example
CompanyID 1
ContactTypeID 1
Addressing "name#address.blah"
This solution gives you extensibility as you won't need to add columns to cope with new contact types in the future.
SELECT
[company].CompanyID,
[company].CompanyName,
[contacttype].ContactTypeID,
[contacttype].ContactType,
[companycontact].Addressing
FROM
[company]
INNER JOIN
[companycontact] ON [companycontact].CompanyID = [company].CompanyID
INNER JOIN
[contacttype] ON [contacttype].ContactTypeID = [companycontact].ContactTypeID
This would give you multiple rows for each company. A row for "AutoEmail" a row for "AutoPrint" and maybe in the future a row for "ManualEmail", "AutoFax" or even "AutoTeleport".
Response to HLEM.
Yes, this is indeed the EAV model. It is useful where you want to have an extensible list of attributes with similar data. In this case, varying methods of contact with a string that represents the "address" of the contact.
If you didn't want to use the EAV model, you should next consider relational tables, rather than storing the data in flat tables. This is because this data will almost certainly extend.
Neither EAV model nor the relational model significantly slow queries. Joins are actually very fast, compared with (for example) a sort. Returning a record for a company with all of its associated contact types, or indeed a specific contact type would be very fast. I am working on a financial MS SQL database with millions of rows and similar data models and have no problem returning significant amounts of data in sub-second timings.
In terms of complexity, this isn't the most technical design in terms of database modelling and the concept of joining tables is most definitely below what I would consider to be "intermediate" level database development.
I would consider if you need one or two tables based onthe following criteria:
First are you close the the record storage limit, then two tables definitely.
Second will you usually be querying the information you plan to put inthe second table most of the time you query the first table? Then one table might make more sense. If you usually do not need the extended information, a separate ( and less wide) table should improve performance on the main data queries.
Third, how strong a possibility is it that you will ever need multiple values? If it is one to one nopw, but something like email address or phone number that has a strong possibility of morphing into multiple rows, go ahead and make it a related table. If you know there is no chance or only a small chance, then it is OK to keep it one assuming the table isn't too wide.
EAV tables look like they are nice and will save futue work, but in reality they don't. Genreally if you need to add another type, you need to do future work to adjust quesries etc. Writing a script to add a column takes all of five minutes, the other work will need to be there regarless of the structure. EAV tables are also very hard to query when you don;t know how many records you wil need to pull becasue normally you want them on one line and will get the information by joining to the same table multiple times. This causes performance problmes and locking especially if this table is central to your design. Don't use this method.
It depends if you will ever need more information about a company. If you notice yourself adding fields like companyphonenumber1 companyphonenumber2, etc etc. Then method 2 is better as you would seperate your entities and just reference a company id. If you do not plan to make these changes and you feel that this table will never change then method 1 is fine.
Usually, if you don't have data duplication then a single table is fine.
In your case you don't so the first method is OK.
I use one table if I estimate the data from the "second" table will be used in more than 50% of my queries. Use two tables if I need multiple copies of the data (i.e. multiple phone numbers, email addresses, etc)

Designing an 'Order' schema in which there are disparate product definition tables

This is a scenario I've seen in multiple places over the years; I'm wondering if anyone else has run across a better solution than I have...
My company sells a relatively small number of products, however the products we sell are highly specialized (i.e. in order to select a given product, a significant number of details must be provided about it). The problem is that while the amount of detail required to choose a given product is relatively constant, the kinds of details required vary greatly between products. For instance:
Product X might have identifying characteristics like (hypothetically)
'Color',
'Material'
'Mean Time to Failure'
but Product Y might have characteristics
'Thickness',
'Diameter'
'Power Source'
The problem (one of them, anyway) in creating an order system that utilizes both Product X and Product Y is that an Order Line has to refer, at some point, to what it is "selling". Since Product X and Product Y are defined in two different tables - and denormalization of products using a wide table scheme is not an option (the product definitions are quite deep) - it's difficult to see a clear way to define the Order Line in such a way that order entry, editing and reporting are practical.
Things I've Tried In the Past
Create a parent table called 'Product' with columns common to Product X and Product Y, then using 'Product' as the reference for the OrderLine table, and creating a FK relationship with 'Product' as the primary side between the tables for Product X and Product Y. This basically places the 'Product' table as the parent of both OrderLine and all the disparate product tables (e.g. Products X and Y). It works fine for order entry, but causes problems with order reporting or editing since the 'Product' record has to track what kind of product it is in order to determine how to join 'Product' to its more detailed child, Product X or Product Y. Advantages: key relationships are preserved. Disadvantages: reporting, editing at the order line/product level.
Create 'Product Type' and 'Product Key' columns at the Order Line level, then use some CASE logic or views to determine the customized product to which the line refers. This is similar to item (1), without the common 'Product' table. I consider it a more "quick and dirty" solution, since it completely does away with foreign keys between order lines and their product definitions. Advantages: quick solution. Disadvantages: same as item (1), plus lost RI.
Homogenize the product definitions by creating a common header table and using key/value pairs for the customized attributes (OrderLine [n] <- [1] Product [1] <- [n] ProductAttribute). Advantages: key relationships are preserved; no ambiguity about product definition. Disadvantages: reporting (retrieving a list of products with their attributes, for instance), data typing of attribute values, performance (fetching product attributes, inserting or updating product attributes etc.)
If anyone else has tried a different strategy with more success, I'd sure like to hear about it.
Thank you.
The first solution you describe is the best if you want to maintain data integrity, and if you have relatively few product types and seldom add new product types. This is the design I'd choose in your situation. Reporting is complex only if your reports need the product-specific attributes. If your reports need only the attributes in the common Products table, it's fine.
The second solution you describe is called "Polymorphic Associations" and it's no good. Your "foreign key" isn't a real foreign key, so you can't use a DRI constraint to ensure data integrity. OO polymorphism doesn't have an analog in the relational model.
The third solution you describe, involving storing an attribute name as a string, is a design called "Entity-Attribute-Value" and you can tell this is a painful and expensive solution. There's no way to ensure data integrity, no way to make one attribute NOT NULL, no way to make sure a given product has a certain set of attributes. No way to restrict one attribute against a lookup table. Many types of aggregate queries become impossible to do in SQL, so you have to write lots of application code to do reports. Use the EAV design only if you must, for instance if you have an unlimited number of product types, the list of attributes may be different on every row, and your schema must accommodate new product types frequently, without code or schema changes.
Another solution is "Single-Table Inheritance." This uses an extremely wide table with a column for every attribute of every product. Leave NULLs in columns that are irrelevant to the product on a given row. This effectively means you can't declare an attribute as NOT NULL (unless it's in the group common to all products). Also, most RDBMS products have a limit on the number of columns in a single table, or the overall width in bytes of a row. So you're limited in the number of product types you can represent this way.
Hybrid solutions exist, for instance you can store common attributes normally, in columns, but product-specific attributes in an Entity-Attribute-Value table. Or you could store product-specific attributes in some other structured way, like XML or YAML, in a BLOB column of the Products table. But these hybrid solutions suffer because now some attributes must be fetched in a different way
The ultimate solution for situations like this is to use a semantic data model, using RDF instead of a relational database. This shares some characteristics with EAV but it's much more ambitious. All metadata is stored in the same way as data, so every object is self-describing and you can query the list of attributes for a given product just as you would query data. Special products exist, such as Jena or Sesame, implementing this data model and a special query language that is different than SQL.
There's no magic bullet that you've overlooked.
You have what are sometimes called "disjoint subclasses". There's the superclass (Product) with two subclasses (ProductX) and (ProductY). This is a problem that -- for relational databases -- is Really Hard. [Another hard problem is Bill of Materials. Another hard problem is Graphs of Nodes and Arcs.]
You really want polymorphism, where OrderLine is linked to a subclass of Product, but doesn't know (or care) which specific subclass.
You don't have too many choices for modeling. You've pretty much identified the bad features of each. This is pretty much the whole universe of choices.
Push everything up to the superclass. That's the uni-table approach where you have Product with a discriminator (type="X" and type="Y") and a million columns. The columns of Product are the union of columns in ProductX and ProductY. There will be nulls all over the place because of unused columns.
Push everything down into the subclasses. In this case, you'll need a view which is the union of ProductX and ProductY. That view is what's joined to create a complete order. This is like the first solution, except it's built dynamically and doesn't optimize well.
Join Superclass instance to subclass instance. In this case, the Product table is the intersection of ProductX and ProductY columns. Each Product has a reference to a key either in ProductX or ProductY.
There isn't really a bold new direction. In the relational database world-view, those are the choices.
If, however, you elect to change the way you build application software, you can get out of this trap. If the application is object-oriented, you can do everything with first-class, polymorphic objects. You have to map from the kind-of-clunky relational processing; this happens twice: once when you fetch stuff from the database to create objects and once when you persist objects back to the database.
The advantage is that you can describe your processing succinctly and correctly. As objects, with subclass relationships.
The disadvantage is that your SQL devolves to simplistic bulk fetches, updates and inserts.
This becomes an advantage when the SQL is isolated into an ORM layer and managed as a kind of trivial implementation detail. Java programmers use iBatis (or Hibernate or TopLink or Cocoon), Python programmers use SQLAlchemy or SQLObject. The ORM does the database fetches and saves; your application directly manipulate Orders, Lines and Products.
This might get you started. It will need some refinement
Table Product ( id PK, name, price, units_per_package)
Table Product_Attribs (id FK ref Product, AttribName, AttribValue)
Which would allow you to attach a list of attributes to the products. -- This is essentially your option 3
If you know a max number of attributes, You could go
Table Product (id PK, name, price, units_per_package, attrName_1, attrValue_1 ...)
Which would of course de-normalize the database, but make queries easier.
I prefer the first option because
It supports an arbitrary number of attributes.
Attribute names can be stored in another table, and referential integrity enforced so that those damn Canadians don't stick a "colour" in there and break reporting.
Does your product line ever change?
If it does, then creating a table per product will cost you dearly, and the key/value pairs idea will serve you well. That's the kind of direction down which I am naturally drawn.
I would create tables like this:
Attribute(attribute_id, description, is_listed)
-- contains values like "colour", "width", "power source", etc.
-- "is_listed" tells us if we can get a list of valid values:
AttributeValue(attribute_id, value)
-- lists of valid values for different attributes.
Product (product_id, description)
ProductAttribute (product_id, attribute_id)
-- tells us which attributes apply to which products
Order (order_id, etc)
OrderLine (order_id, order_line_id, product_id)
OrderLineProductAttributeValue (order_line_id, attribute_id, value)
-- tells us things like: order line 999 has "colour" of "blue"
The SQL to pull this together is not trivial, but it's not too complex either... and most of it will be write once and keep (either in stored procedures or your data access layer).
We do similar things with a number of types of entity.
Chris and AJ: Thanks for your responses. The product line may change, but I would not term it "volatile".
The reason I dislike the third option is that it comes at the cost of metadata for the product attribute values. It essentially turns columns into rows, losing most of the advantages of the database column in the process (data type, default value, constraints, foreign key relationships etc.)
I've actually been involved in a past project where the product definition was done in this way. We essentially created a full product/product attribute definition system (data types, min/max occurrences, default values, 'required' flags, usage scenarios etc.) The system worked, ultimately, but came with a significant cost in overhead and performance (e.g. materialized views to visualize products, custom "smart" components to represent and validate data entry UI for product definition, another "smart" component to represent the product instance's customizable attributes on the order line, blahblahblah).
Again, thanks for your replies!

Resources