Database design, huge number of parameters, denormalise? - database

Given the table tblProject. This has a myriad of properties. For example, width, height etc etc. Dozens of them.
I'm adding a new module which lets you specify settings for your project for mobile devices. This is a 1-1 relationship, so all the mobile settings should be stored in tblProject. However, the list is getting huge, there will be some ambiguity amongst properties (IE, I will have to prefix all mobile fields with MOBILE so that Mobile_width isn't confused with width).
How bad is it to denormalise and store the mobile settings in another table? Or a better way to store the settings? The properties and becoming unwieldly and hard to modify/find in the table.

I want to respond to #Alexander Sobolev's suggestion and provide my own.
#Alexander Sobolev suggests an EAV model. This trades maximum flexibility, for poor performance and complexity as you need to join multiple times to get all values for an entity. The way you typically work around those issues is keeping all the entity meta data in memory (i.e. tblProperties) so you don't join to it at runtime. And, denormalize the values (i.e. tblProjectProperties) as a CLOB (i.e. XML) off the root table. Thus you only use the values table for querying and sorting, but not to actually retrieve the data. Also you usually end up caching the actual entities by ID as well so you don't have the expense of deserialization each time. Issues you run into the are cache invalidation of the entities and their meta data. So overall a non trivial approach.
What I would do instead is create a separate table, perhaps more than one depending on your data, with a discriminator/type column:
create table properties (
root_id int,
type_id int,
height int
width int
...etc...
)
Make the unique a combination of root_id and type_id, where type_id would be representative of mobile for instance - assuming a separate lookup table in my example.

There is nothing bad in storing mobile section in other table. This could even carry some economy, this depends on how much this information is used.
You can store in another table or use even more complicated version with three tables. One is your tblProject, one is tblProperties and one is tblProjectProperties.
create table tblProperties (
id int autoincrement(1,1) not null,
prop_name nvarchar(32),
prop_description nvarchar(1024)
)
create table tblProjectProperties
(
ProjectUid int not null,
PropertyUid int not null,
PropertyValue nvarchar(256)
)
with foreign key tblProjectProperties. ProjectUid -> tblProject.uid
and foreign key tblProjectProperties.propertyUid -> tblProperties.id
Thing is if you have different types of projects wich use different properties, you have no need to store all these unused null and store only properties you really need for given project. Above schema gives you some flexibility. You can create some views for different project types and use it to avoid too much joins in user selects.

Related

Dynamic columns in database tables vs EAV

I'm trying to decide which way to go if I have an app that needs to be able to change the db schema based on the user input.
For example, if I have a "car" object that contains car properties, like year, model, # of doors etc, how do I store it in the DB in such a way, that the user should be able to add new properties?
I read about EAV tables and they seem right for this thing, but the problem is that queries will get pretty complicated when I try to get a list of cars filtered by a set of properties.
Could I generate the tables dynamically instead? I see that Sqlite has support for ADD COLUMN, but how fast is it when the table reaches many records? And it looks like there's no way to remove a column. I have to create a new table without the column I want to remove, and copy the data from the old table. That's certainly slow on large tables :(
I will assume that SQLite (or another relational DBMS) is a requirement.
EAVs
I have worked with EAVs and generic data models, and I can say that the data model is very messy and hard to work with in the long run.
Lets say that you design a datamodel with three tables: entities, attributes, and _entities_attributes_:
CREATE TABLE entities
(entity_id INTEGER PRIMARY KEY, name TEXT);
CREATE TABLE attributes
(attribute_id INTEGER PRIMARY KEY, name TEXT, type TEXT);
CREATE TABLE entity_attributes
(entity_id INTEGER, attribute_id INTEGER, value TEXT,
PRIMARY KEY(entity_id, attribute_id));
In this model, the entities table will hold your cars, the attributes table will hold the attributes that you can associate to your cars (brand, model, color, ...) and its type (text, number, date, ...), and the _entity_attributes_ will hold the values of the attributes for a given entity (for example "red").
Take into account that with this model you can store as many entities as you want and they can be cars, houses, computers, dogs or whatever (ok, maybe you need a new field on entities, but it's enough for the example).
INSERTs are pretty straightforward. You only need to insert a new object, a bunch of attributes and its relations. For example, to insert a new entity with 3 attributes you will need to execute 7 inserts (one for the entity, three more for the attributes, and three more for the relations.
When you want to perform an UPDATE, you will need to know what is the entity that you want to update, and update the desired attribute joining with the relation between the entity and its attributes.
When you want to perform a DELETE, you will also need to need to know what is the entity you want to delete, delete its attributes, delete the relation between your entity and its attributes and then delete the entity.
But when you want to perform a SELECT the thing becomes nasty (you need to write really difficult queries) and the performance drops horribly.
Imagine a data model to store car entities and its properties as in your example (say that we want to store brand and model). A SELECT to query all your records will be
SELECT brand, model FROM cars;
If you design a generic data model as in the example, the SELECT to query all your stored cars will be really difficult to write and will involve a 3 table join. The query will perform really bad.
Also, think about the definition of your attributes. All your attributes are stored as TEXT, and this can be a problem. What if somebody makes a mistake and stores "red" as a price?
Indexes are another thing that you could not benefit of (or at least not as much as it would be desirable), and they are very neccesary as the data stored grows.
As you say, the main concern as a developer is that the queries are really hard to write, hard to test and hard to maintain (how much would a client have to pay to buy all red, 1980, Pontiac Firebirds that you have?), and will perform very poorly when the data volume increases.
The only advantage of using EAVs is that you can store virtually everything with the same model, but is like having a box full of stuff where you want to find one concrete, small item.
Also, to use an argument from authority, I will say that Tom Kyte argues strongly against generic data models:
http://tkyte.blogspot.com.es/2009/01/this-should-be-fun-to-watch.html
https://asktom.oracle.com/pls/asktom/f?p=100:11:0::::P11_QUESTION_ID:10678084117056
Dynamic columns in database tables
On the other hand, you can, as you say, generate the tables dynamically, adding (and removing) columns when needed. In this case, you can, for example create a car table with the basic attributes that you know that you will use and then add columns dynamically when you need them (for example the number of exhausts).
The disadvantage is that you will need to add columns to an existing table and (maybe) build new indexes.
This model, as you say, also has another problem when working with SQLite as there's no direct way to delete columns and you will need to do this as stated on http://www.sqlite.org/faq.html#q11
BEGIN TRANSACTION;
CREATE TEMPORARY TABLE t1_backup(a,b);
INSERT INTO t1_backup SELECT a,b FROM t1;
DROP TABLE t1;
CREATE TABLE t1(a,b);
INSERT INTO t1 SELECT a,b FROM t1_backup;
DROP TABLE t1_backup;
COMMIT;
Anyway, I don't really think that you will need to delete columns (or at least it will be a very rare scenario). Maybe someone adds the number of doors as a column, and stores a car with this property. You will need to ensure that any of your cars have this property to prevent from losing data before deleting the column. But this, of course depends on your concrete scenario.
Another drawback of this solution is that you will need a table for each entity you want to store (one table to store cars, another to store houses, and so on...).
Another option (pseudo-generic model)
A third option could be to have a pseudo-generic model, with a table having columns to store id, name, and type of the entity, and a given (enough) number of generic columns to store the attributes of your entities.
Lets say that you create a table like this:
CREATE TABLE entities
(entity_id INTEGER PRIMARY KEY,
name TEXT,
type TEXT,
attribute1 TEXT,
attribute1 TEXT,
...
attributeN TEXT
);
In this table you can store any entity (cars, houses, dogs) because you have a type field and you can store as many attributes for each entity as you want (N in this case).
If you need to know what the attribute37 stands for when type is "red", you would need to add another table that relates the types and attributes with the description of the attributes.
And what if you find that one of your entities needs more attributes? Then simply add new columns to the entities table (attributeN+1, ...).
In this case, the attributes are always stored as TEXT (as in EAVs) with it's disadvantages.
But you can use indexes, the queries are really simple, the model is generic enough for your case, and in general, I think that the benefits of this model are greater than the drawbacks.
Hope it helps.
Follow up from the comments:
With the pseudo-generic model your entities table will have a lot of columns. From the documentation (https://www.sqlite.org/limits.html), the default setting for SQLITE_MAX_COLUMN is 2000. I have worked with SQLite tables with over 100 columns with great performance, so 40 columns shouldn't be a big deal for SQLite.
As you say, most of your columns will be empty for most of your records, and you will need to index all of your colums for performance, but you can use partial indexes (https://www.sqlite.org/partialindex.html). This way, your indexes will be small, even with a high number of rows, and the selectivity of each index will be great.
If you implement a EAV with only two tables, the number of joins between tables will be less than in my example, but the queries will still be hard to write and maintain, and you will need to do several (outer) joins to extract data, which will reduce performance, even with a great index, when you store a lot of data. For example, imagine that you want to get the brand, model and color of your cars. Your SELECT would look like this:
SELECT e.name, a1.value brand, a2.value model, a3.value color
FROM entities e
LEFT JOIN entity_attributes a1 ON (e.entity_id = a1.entity_id and a1.attribute_id = 'brand')
LEFT JOIN entity_attributes a2 ON (e.entity_id = a2.entity_id and a2.attribute_id = 'model')
LEFT JOIN entity_attributes a3 ON (e.entity_id = a3.entity_id and a3.attribute_id = 'color');
As you see, you would need one (left) outer join for each attribute you want to query (or filter). With the pseudo-generic model the query will be like this:
SELECT name, attribute1 brand, attribute7 model, attribute35 color
FROM entities;
Also, take into account the potential size of your _entity_attributes_ table. If you can potentially have 40 attributes for each entity, lets say that you have 20 not null for each of them. If you have 10,000 entities, your _entity_attributes_ table will have 200,000 rows, and you will be querying it using one huge index. With the pseudo-generic model you will have 10,000 rows and one small index for each column.
It all depends on the way in which your application needs to reason about the data.
If you need to run queries which need to do complicated comparisons or joins on data whose schema you don't know in advance, SQL and the relational model are rarely a good fit.
For instance, if your users can set up arbitrary data entities (like "car" in your example), and then want to find cars whose engine capacity is greater than 2000cc, with at least 3 doors, made after 2010, whose current owner is part of the "little old ladies" table, I'm not aware of an elegant way of doing this in SQL.
However, you could achieve something like this using XML, XPath etc.
If your application has a set on data entities with known attributes, but users can extend those attributes (a common requirement for products like bug trackers), "add column" is a good solution. However, you may need to invent a custom query language to allow users to query those columns. For instance, Atlassian Jira's bug tracking solution has JQL, a SQL-like language for querying bugs.
EAV is great if your task is to store and then show data. However, even moderately complex queries become very hard in an EAV schema - imagine how you'd execute my made up example above.
For your use case, a document oriented database like MongoDB would do great.
Another option that I haven't seen mentioned above is to use denormalized tables for the extended attributes. This is a combination of the pseudo-generic model and the dynamic columns in database tables. Instead of adding columns to existing tables, you add columns or groups of columns into new tables with FK indexes to the source table. Of course, you'll want a good naming convention (car, car_attributes_door, car_attributes_littleOldLadies)
Your selection problem becomes that of applying a LEFT OUTER JOIN to include the extended attributes that you want to include.
Slower than normalized, but not as slow as EAV.
Adding new extended attributes becomes a problem of adding a new table.
Harder than EAV, easier/faster than modifying table schema.
Deleting attributes becomes a problem of dropping whole tables.
Easier/faster than modifying table schema.
These new attributes can be strongly typed.
As good as modifying table schema, faster than EAV or generic columns.
The biggest advantage to this approach that I can see is that deleting unused attributes is quite easy compared to any of the others via a single DROP TABLE command. You also have the option to later normalize often-used attributes into larger groups or into the main table using a single ALTER TABLE process rather than one for each new column you were adding as you added them, which helps with the slow LEFT OUTER JOIN queries.
The biggest disadvantage is that you're cluttering up your table list, which admittedly is often not a trivial concern. That and I'm not sure how much better LEFT OUTER JOIN's actually perform than EAV table joins. It's definitely closer to EAV join performance than normalized table performance.
If you're doing a lot of comparisons/filters of values that benefit greatly from strongly typed columns, but you add/remove these columns frequently enough to make modifying a huge normalized table intractable, this seems like a good compromise.
I would try EAV.
Adding columns based on user input doesn't sounds nice to me and you can quickly run out of capacity. Queries on very flat table can also be a problem. Do you want to create hundreds of indexes?
Instead of writing every thing to one table, I would store as many as possible common properties (price, name , color, ...) in the main table and those less common properties in an "extra" attributes table. You can always balance them later with a little effort.
EAV can performance well for small to middle sized data set. Since you want to use SQLlite, I guess it's not be a problem.
You may also want to avoid "over" normalizing your data. With the cheap storage
we currently have, you can use one table to store all "Extra" attributes, instead of two:
ent_id, ent_name, ...
ent_id, attr_name, attr_type, attr_value ...
People against EAV will say its performance is poor on large database. It's sure that it won't performance as well as normalized structure but you don't want to change structure on a 3TB table either.
I have a low quality answer, but possible, that came from HTML tags that are like : <tag width="10px" height="10px" ... />
In this dirty way you will have just one column as a varchar(max) for all properties say it Props column and you will store data in it like this:
Props
------------------------------------------------------------
Model:Model of car1|Year:2010|# of doors:4
Model:Model of car2|NewProp1:NewValue1|NewProp2:NewValue2
In this way all works will go to the programming code in business layer with using some functions like concatCustom that get an array and return a string and unconcatCustom that get a string and return an array.
For more validity of special characters like ':' and '|', I suggest '#:#' and '#|#' or something more rare for splitter part.
In a similar way you can use a text or binary field and store an XML data in the column.

Organizing database tables - large number of properties

I have a database that stores some users in it. Each user has its account settings, privacy settings and lots of other properties to set. The number of those properties started to grow and I could end up with 30 properties or so.
Till now, I used to keep it in "UserInfo" table having User and UserInfo related as One-To-Many (keeping a log of all changes). Putting it in a single "UserInfo" table doesn't sound nice and, at least in the database model, it would look messy. What's the solution?
Separating privacy settings, account settings and other "groups" of settings in separate tables and have 1-1 relations between UserInfo and each group of settings table is one solution, but would that be too slow (or much slower) when retrieving the data? I guess all data would not be presented on a single page at the same moment. So maybe having one-to-many relationships to each table is a solution too (keeping log of each group separately)?
If it's only 30 properties, I'd recommend just creating 30 columns. That's not too much for a modern database to handle.
But I would guess that if you ahve 30 properties today, you will continue to invent new properties as time goes on, and the number of columns will keep growing. Restructuring your table to add columns every day may become time-consuming as you get lots of rows.
For an alternative solution check out this blog for a nifty solution for storing lots of dynamic attributes in a "schemaless" way: How FriendFeed Uses MySQL.
Basically, collect all the properties into some format and store it in a single TEXT column. The format is semi-structured, that is your application can separate the properties if needed but you can also add more at any time, or even have different properties per row. XML or YAML or JSON are example formats, or some object serialization format supported by your application code language.
CREATE TABLE Users (
user_id SERIAL PRIMARY KEY,
user_proerties TEXT
);
This makes it hard to search for a given value in a given property. So in addition to the TEXT column, create an auxiliary table for each property you want to be searchable, with two columns: values of the given property, and a foreign key back to the main table where that particular value is found. Now you have can index the column so lookups are quick.
CREATE TABLE UserBirthdate (
user_id BIGINT UNSIGNED PRIMARY KEY,
birthdate DATE NOT NULL,
FOREIGN KEY (user_id) REFERENCES Users(user_id),
KEY (birthdate)
);
SELECT u.* FROM Users AS u INNER JOIN UserBirthdate b USING (user_id)
WHERE b.birthdate = '2001-01-01';
This means as you insert or update a row in Users, you also need to insert or update into each of your auxiliary tables, to keep it in sync with your data. This could grow into a complex chore as you add more auxiliary tables.

What is the best way to keep this schema clear?

Currently I'm working on a RFID project where each tag is attached to an object. An object could be a person, a computer, a pencil, a box or whatever it comes to the mind of my boss.
And of course each object have different attributes.
So I'm trying to have a table tags where I can keep a register of each tag in the system (registration of the tag). And another tables where I can relate a tag with and object and describe some other attributes, this is what a have done. (No real schema just a simplified version)
Suddenly, I realize that this schema could have the same tag in severals tables.
For example, the tag 123 could be in C and B at the same time. Which is impossible because each tag just could be attached to just a single object.
To put it simple I want that each tag could not appear more than once in the database.
My current approach
What I really want
Update:
Yeah, the TagID is chosen by the end user. Moreover the TagID is given by a Tag Reader and the TagID is a 128-bit number.
New Update:
The objects until now are:
-- Medicament(TagID, comercial_name, generic_name, amount, ...)
-- Machine(TagID, name, description, model, manufacturer, ...)
-- Patient(TagID, firstName, lastName, birthday, ...)
All the attributes (columns or whatever you name it) are very different.
Update after update
I'm working on a system, with RFID tags for a hospital. Each RFID tag is attached to an object in order keep watch them and unfortunately each object have a lot of different attributes.
An object could be a person, a machine or a medicine, or maybe a new object with other attributes.
So, I just want a flexible and cleaver schema. That allow me to introduce new object's types and also let me easily add new attributes to one object. Keeping in mind that this system could be very large.
Examples:
Tag(TagID)
Medicine(generic_name, comercial_name, expiration_date, dose, price, laboratory, ...)
Machine(model, name, description, price, buy_date, ...)
Patient(PatientID, first_name, last_name, birthday, ...)
We must relate just one tag for just one object.
Note: I don't really speak (or also write) really :P sorry for that. Not native speaker here.
You can enforce these rules using relational constraints. Check out the use of a persisted column to enforce the constraint Tag:{Pencil or Computer}. This model gives you great flexibility to model each child table (Person, Machine, Pencil, etc.) and at same time prevent any conflicts between tag. Also good that we dont have to resort to triggers or udfs via check constraints to enforce the relation. The relation is built into the model.
create table dbo.TagType (TagTypeID int primary key, TagTypeName varchar(10));
insert into dbo.TagType
values(1, 'Computer'), (2, 'Pencil');
create table dbo.Tag
( TagId int primary key,
TagTypeId int references TagType(TagTypeId),
TagName varchar(10),
TagDate datetime,
constraint UX_Tag unique (TagId, TagTypeId)
)
go
create table dbo.Computer
( TagId int primary key,
TagTypeID as 1 persisted,
CPUType varchar(25),
CPUSpeed varchar(25),
foreign key (TagId, TagTypeID) references Tag(TagId, TagTypeID)
)
go
create table dbo.Pencil
( TagId int primary key,
TagTypeId as 2 persisted,
isSharp bit,
Color varchar(25),
foreign key (TagId, TagTypeID) references Tag(TagId, TagTypeId)
)
go
-----------------------------------------------------------
-- create a new tag of type Pencil:
-----------------------------------------------------------
insert into dbo.Tag(TagId, TagTypeId, TagName, TagDate)
values(1, 2, 'Tag1', getdate());
insert into dbo.Pencil(TagId, isSharp, Color)
values(1, 1, 'Yellow');
-----------------------------------------------------------
-- try to make it a Computer too (fails FK)
-----------------------------------------------------------
insert into dbo.Computer(TagId, CPUType, CPUSpeed)
values(1, 'Intel', '2.66ghz')
Have a Tag Table with PK identity insert of TagID.
This will ensure that each TagID only shows up once no matter what...
Then in the Tag Table have a TagType column that can either be free form (TableName) or better yet have a TagType table with entries A,B,C and then have a FK in Tag pointing TagType.
I would move the Tag attributes into Table A,B,C to minimize extra data in Tag or have a series of Junction Tables between Tag and A,B, and C
EDIT:
Assuming the TagID is created when the object is created this will work fine (Insert into Tag first to get TagID and capture it using IDENTITY_INSERT)
This assumes users cannot edit the TagID itself.
If users can choose the TagID then still use a Tag Table with the TagID but have another field called DisplayID where the user can type in a number. Just put on a unique constraint on Tag.DisplayID....
EDIT:
What attributes are you needing and are they nullable? If they are different for A, B, and C then it is cleaner to put them in A, B, and C especially if there might be some for A and B but not C...
talked with Raz to clear up what he's trying to do. What he's wanting is a flexable way to store attributes related to tags. Tags can one of multiple types of objects, and each object has a specific list of attributes. he also wants to be able to add objects/attributes without having to change the schema. here's the model i came up with:
if each tag can only be in a, b, or c only once, i'd just combine a, b, and c into one table. it'd be easier to give you a better idea of how to build your schema if you gave an example of exactly what you're wanting to collect.
to me, from what i've read, it sounds like you have a list of tags, and a list of objects, and you need to assign a tag to an object. if that is the case, i'd have a tags table, and objects table, and a ObjectTag table. in the object tab table you would have a foreign key to the tag table and a foreign key to the object table. then you make a unique index on the tag foreign key and now you've enforced your requirement of only using a tag once.
I would tackle this using your original structures. Relational databases are a lot better at aggregating/combining atomic data than they are at parsing complex data structures.
Keep the design of each "tag-able" object type in its own table. Data types, check constraints, default values, etc. are still easily implemented this way. Also, continue to define a FK from each object table to the Tags table.
I'm assuming you already have this in place, but if you place a unique constraint on the TagId column in each of the object tables (A, B, C, etc.) then you can guarantee uniqueness within that object type.
There are no built-in SQL Server constraints to guarantee uniqueness among all the object types, if implemented as separate tables. So, you will have to make your own validation. An INSTEAD OF trigger on your object tables can do this cleanly.
First, create a view to access the TagId list across all your object tables.
CREATE VIEW TagsInUse AS
SELECT A.TagId FROM A
UNION
SELECT B.TagId FROM B
UNION
SELECT C.TagId FROM C
;
Then, for each of your object tables, define an INSTEAD OF trigger to test your TagId.
CREATE TRIGGER dbo.T_IO_Insert_TableA ON dbo.A
INSTEAD OF INSERT
AS
IF EXISTS (SELECT 0 FROM dbo.TagsInUse WHERE TagId = inserted.TagId)
BEGIN;
--The tag(s) is/are already in use. Create the necessary notification(s).
RAISERROR ('You attempted to re-use a TagId. This is not allowed.');
ROLLBACK
END;
ELSE
BEGIN;
--The tag(s) is/are available, so proceed with the INSERT.
INSERT INTO dbo.A (TagId, Attribute1, Attribute2, Attribute3)
SELECT i.TagId, i.Attribute1, i.Attribute2, i.Attribute3
FROM inserted AS i
;
END;
GO
Keep in mind that you can also (and probably should) encapsulate that IF EXISTS test in a T-SQL function for maintenance and performance reasons.
You can write supplementary stored procedures for doing things like finding what object type a TagId is associated with.
Pros
You are still taking advantage of SQL Server's data integrity features, which are all quite fast and self-documenting. Don't underestimate the usefulness of data types.
The view is an encapsulation of the domain that must be unique without combining the underlying sets of attributes. Now, you won't have to write any messy code to decipher the object's type. You can base that determination by which table contains the matching tag.
Your options remain open...
Because you didn't store everything in an EAV-friendly nvarchar(300) column, you can tweak the data types for whatever makes the most sense for each attribute.
If you run into any performance issues, you can index the view.
You (or your DBA) can move the object tables to different file groups on different disks if you need to balance things out and help with parallel disk I/O. Think of it as a form of horizontal partitioning. For example, if you have 8 times as many RFID tags applied to medicine containers as you have for patients, you can place the medicine table on a different disk without having to create the partitioning function that you would need for a monolithic table (one table for all types).
If you need to eventually partition your tables vertically (for archiving data onto a read-only partition), you can more easily create a partitioning function for each object type. This would be useful where the business rules do
Most importantly, implementing different business rules based on object type is much simpler. You don't have to implement any nasty conditional logic like "IF type = 'needle' THEN ... ELSE IF type = 'patient' THEN ... ELSE IF....". If you need to apply different rules, then apply them to the relevant object table without having to test a "type" value.
Cons
Triggers have to be maintained. However, this would have to be done in your application anyway, so you are performing the same data integrity checking at the database. That means that you will have no extra network overhead and this will be available for any application that uses this database.
What you're describing is a classical "table-per-type" ORM mapping. Entity Framework has built-in support of this, which you should look into.
Otherwise, I don't think most databases have easy integrity constraints that are enforced over primary keys of multiple tables.
However, is there any reason why you can't just use a single tags table to hold all the fields? Use a type field to hold the type of object. NULL all the irrelevant fields -- this way they don't consume disk space. You'll end up with far fewer tables (only one) that you can maintain as one single coherent object; it also makes you write far fewer SQL queries to work on tags that may span multiple object types.
Implementing it as one single table also saves you disk space because you can implement tiers of inheritance -- for example, "patient" and "doctor" and "nurse" can be three different object types, each having similar fields (e.g. firstname, lastname etc.) and some unique fields. Right now you'll need three tables, with duplicated fields.
It is also simpler when you add an object type. Before, you need to add a new table, and duplicate some SQL statements that span multiple object types. Now you only need to add new fields to the same table (maybe reuse some). The SQL you need to change are far fewer.
The only reason why you won't go with one single table is when the number of fields make a row too large to fit inside a SQL-Server page (which I believe is 8K). Then SQL will complain and won't allow you to add any more fields. The solution, in this case, is to adopt an ORM tool (like Entity Framework), and then "reuse" fields. For example, if "Field1" is only used by object type #1, there is no reason why object type #3 can't use it to store something as well. You only need to be able to distinguish it in your programs.
You could have the Tags table such that it can have a pointer to any of those tables, and could include a Type that tells you which of the tables it is
Tags
-
ID
Type (A,B, or C)
A (nullable)
B (nullable)
C (nullable)
A
-
ID
(other attributes)

How to design DB table / schema with ease?

Is there a simple method to decide on what fields and indexes are needed for each table in an app you design?
For example, if it is a webapp that simply lets people create lists (any number of lists, and users can create "things to do" list or "shopping" list), and the user can assign other users to edit the list, and whether the list is viewable publicly or to only certain users, how can the tables be design so that it is very accurate and designed quickly? What about the indexes?
I did that in college and then revisited the question some time ago and have a method, but would like to find out if there are standard and good ways to do it out in the field.
Database design is hard ...
As with many things in life, it's a series of tradeoffs. The first thing you need to decide is what DBMS you will use, (MySQL, SQL Server, Oracle, PostgreSQL, one of the "Object-oriented" databases, etc.
Then you need to decide on normalization v. insane numbers of JOINs to get to your data. Questions like "how much logic will I implement in triggers, stored procedures, in app code, etc" need to be addressed.
There is no "Quick'n'Easy" way to design anything but the most trivial of databases.
'Course, that's just my experience. YMWV.
it is beyond the scope of this answer to fully explain database design
I generally break my design into three parts (part 1 and 2 happen up front, while 3 is usually near the project end)
1) create the tables based on relationships (parent/child/etc)
2) create fields based on content (parent has x atributes, etc)
3) create indexes last based on how you select data from your tables
Haven't heard of any formal approaches to this problem but there are rules of thumb. All nouns and business objects become tables, normalized of course. And I'd think the attributes sort of speak for themselves. I guess?
As for indexes, it just comes with working with the data. Any column that's joined off of deserves an index (maybe even clustered). It's very... depends. But there are patterns. But other than optimizing for joins, many indexes are directly related to how the data is used, and this isn't something that can be provided by rule of thumb. Like if you look up users by pk and elsewhere by last_name, last_name deserves an index.
I think the solution is a subjective one. When I have to design tables I look at the Java object that will represent that particular data model and go from there. You'll find a lot of frameworks (Django, CakePHP, RoR) have you develop the model and the frameworks will build the corresponding tables.
So I would suggest evaluating what functionality and data you need to store and develop your tables from that. Also look into whether the tool set you have at your disposal offers to generate the tables for you from the object structure.
I would go for the straightforward (almost) normalized design:
CREATE TABLE lists (
listid serial,
name varchar,
ownerid int references users(userid)
)
CREATE TABLE list_items (
listid int references lists(listid),
value varchar,
date datetime
)
CREATE TABLE permissions (
permissionid serial,
description varchar,
)
CREATE TABLE list_permissions (
listid int references lists(listid),
permissionid int references permissions(permissionid)
userid int references users(userid)
)
CREATE TABLE users (
userid serial,
name varchar
)
Which indexes to create would depend on what are the actual most used queries and how are they performing. For instance, if you query a lot on the lists and list_items (likely) you'd want an index on listid and on name, if you'll be searching by name.
Just some ideas. Hope they're helpful.
I'd try not to lock yourself in if you're still trying to see what works.
Just from your description, you'd want a table for your users' information, as well as:
tbl_lists:
ID_list (primary key)
UserID (foreign key to list owner)
ListName
tbl_listItems:
ID_listItem (primary key)
ListID (foreign key to list)
ItemDescription
tbl_permissions:
ID_permission (primary key)
ListID
UserID (foreign key to user you're granting permission to)
PermissionTypeID (what kind of permission)
tbl_permissionTypes:
ID_permissionType (primary key)
Description ("can view", "can edit", etc.)
The more flexible you can make things while you're designing, the better. You can optimize later.
If you want to keep things very simple and are not too concerned with normalizing. You could create one big table that stores the main object your webapp is based around, ex: lists, and have other smaller supporting tables link to the big table, ex: tbl_listType, tbl_permission, tbl_list_items).
Then when you write queries, you almost certainly include the main table and you can link in other supporting tables for more granular details.

Designing an 'Order' schema in which there are disparate product definition tables

This is a scenario I've seen in multiple places over the years; I'm wondering if anyone else has run across a better solution than I have...
My company sells a relatively small number of products, however the products we sell are highly specialized (i.e. in order to select a given product, a significant number of details must be provided about it). The problem is that while the amount of detail required to choose a given product is relatively constant, the kinds of details required vary greatly between products. For instance:
Product X might have identifying characteristics like (hypothetically)
'Color',
'Material'
'Mean Time to Failure'
but Product Y might have characteristics
'Thickness',
'Diameter'
'Power Source'
The problem (one of them, anyway) in creating an order system that utilizes both Product X and Product Y is that an Order Line has to refer, at some point, to what it is "selling". Since Product X and Product Y are defined in two different tables - and denormalization of products using a wide table scheme is not an option (the product definitions are quite deep) - it's difficult to see a clear way to define the Order Line in such a way that order entry, editing and reporting are practical.
Things I've Tried In the Past
Create a parent table called 'Product' with columns common to Product X and Product Y, then using 'Product' as the reference for the OrderLine table, and creating a FK relationship with 'Product' as the primary side between the tables for Product X and Product Y. This basically places the 'Product' table as the parent of both OrderLine and all the disparate product tables (e.g. Products X and Y). It works fine for order entry, but causes problems with order reporting or editing since the 'Product' record has to track what kind of product it is in order to determine how to join 'Product' to its more detailed child, Product X or Product Y. Advantages: key relationships are preserved. Disadvantages: reporting, editing at the order line/product level.
Create 'Product Type' and 'Product Key' columns at the Order Line level, then use some CASE logic or views to determine the customized product to which the line refers. This is similar to item (1), without the common 'Product' table. I consider it a more "quick and dirty" solution, since it completely does away with foreign keys between order lines and their product definitions. Advantages: quick solution. Disadvantages: same as item (1), plus lost RI.
Homogenize the product definitions by creating a common header table and using key/value pairs for the customized attributes (OrderLine [n] <- [1] Product [1] <- [n] ProductAttribute). Advantages: key relationships are preserved; no ambiguity about product definition. Disadvantages: reporting (retrieving a list of products with their attributes, for instance), data typing of attribute values, performance (fetching product attributes, inserting or updating product attributes etc.)
If anyone else has tried a different strategy with more success, I'd sure like to hear about it.
Thank you.
The first solution you describe is the best if you want to maintain data integrity, and if you have relatively few product types and seldom add new product types. This is the design I'd choose in your situation. Reporting is complex only if your reports need the product-specific attributes. If your reports need only the attributes in the common Products table, it's fine.
The second solution you describe is called "Polymorphic Associations" and it's no good. Your "foreign key" isn't a real foreign key, so you can't use a DRI constraint to ensure data integrity. OO polymorphism doesn't have an analog in the relational model.
The third solution you describe, involving storing an attribute name as a string, is a design called "Entity-Attribute-Value" and you can tell this is a painful and expensive solution. There's no way to ensure data integrity, no way to make one attribute NOT NULL, no way to make sure a given product has a certain set of attributes. No way to restrict one attribute against a lookup table. Many types of aggregate queries become impossible to do in SQL, so you have to write lots of application code to do reports. Use the EAV design only if you must, for instance if you have an unlimited number of product types, the list of attributes may be different on every row, and your schema must accommodate new product types frequently, without code or schema changes.
Another solution is "Single-Table Inheritance." This uses an extremely wide table with a column for every attribute of every product. Leave NULLs in columns that are irrelevant to the product on a given row. This effectively means you can't declare an attribute as NOT NULL (unless it's in the group common to all products). Also, most RDBMS products have a limit on the number of columns in a single table, or the overall width in bytes of a row. So you're limited in the number of product types you can represent this way.
Hybrid solutions exist, for instance you can store common attributes normally, in columns, but product-specific attributes in an Entity-Attribute-Value table. Or you could store product-specific attributes in some other structured way, like XML or YAML, in a BLOB column of the Products table. But these hybrid solutions suffer because now some attributes must be fetched in a different way
The ultimate solution for situations like this is to use a semantic data model, using RDF instead of a relational database. This shares some characteristics with EAV but it's much more ambitious. All metadata is stored in the same way as data, so every object is self-describing and you can query the list of attributes for a given product just as you would query data. Special products exist, such as Jena or Sesame, implementing this data model and a special query language that is different than SQL.
There's no magic bullet that you've overlooked.
You have what are sometimes called "disjoint subclasses". There's the superclass (Product) with two subclasses (ProductX) and (ProductY). This is a problem that -- for relational databases -- is Really Hard. [Another hard problem is Bill of Materials. Another hard problem is Graphs of Nodes and Arcs.]
You really want polymorphism, where OrderLine is linked to a subclass of Product, but doesn't know (or care) which specific subclass.
You don't have too many choices for modeling. You've pretty much identified the bad features of each. This is pretty much the whole universe of choices.
Push everything up to the superclass. That's the uni-table approach where you have Product with a discriminator (type="X" and type="Y") and a million columns. The columns of Product are the union of columns in ProductX and ProductY. There will be nulls all over the place because of unused columns.
Push everything down into the subclasses. In this case, you'll need a view which is the union of ProductX and ProductY. That view is what's joined to create a complete order. This is like the first solution, except it's built dynamically and doesn't optimize well.
Join Superclass instance to subclass instance. In this case, the Product table is the intersection of ProductX and ProductY columns. Each Product has a reference to a key either in ProductX or ProductY.
There isn't really a bold new direction. In the relational database world-view, those are the choices.
If, however, you elect to change the way you build application software, you can get out of this trap. If the application is object-oriented, you can do everything with first-class, polymorphic objects. You have to map from the kind-of-clunky relational processing; this happens twice: once when you fetch stuff from the database to create objects and once when you persist objects back to the database.
The advantage is that you can describe your processing succinctly and correctly. As objects, with subclass relationships.
The disadvantage is that your SQL devolves to simplistic bulk fetches, updates and inserts.
This becomes an advantage when the SQL is isolated into an ORM layer and managed as a kind of trivial implementation detail. Java programmers use iBatis (or Hibernate or TopLink or Cocoon), Python programmers use SQLAlchemy or SQLObject. The ORM does the database fetches and saves; your application directly manipulate Orders, Lines and Products.
This might get you started. It will need some refinement
Table Product ( id PK, name, price, units_per_package)
Table Product_Attribs (id FK ref Product, AttribName, AttribValue)
Which would allow you to attach a list of attributes to the products. -- This is essentially your option 3
If you know a max number of attributes, You could go
Table Product (id PK, name, price, units_per_package, attrName_1, attrValue_1 ...)
Which would of course de-normalize the database, but make queries easier.
I prefer the first option because
It supports an arbitrary number of attributes.
Attribute names can be stored in another table, and referential integrity enforced so that those damn Canadians don't stick a "colour" in there and break reporting.
Does your product line ever change?
If it does, then creating a table per product will cost you dearly, and the key/value pairs idea will serve you well. That's the kind of direction down which I am naturally drawn.
I would create tables like this:
Attribute(attribute_id, description, is_listed)
-- contains values like "colour", "width", "power source", etc.
-- "is_listed" tells us if we can get a list of valid values:
AttributeValue(attribute_id, value)
-- lists of valid values for different attributes.
Product (product_id, description)
ProductAttribute (product_id, attribute_id)
-- tells us which attributes apply to which products
Order (order_id, etc)
OrderLine (order_id, order_line_id, product_id)
OrderLineProductAttributeValue (order_line_id, attribute_id, value)
-- tells us things like: order line 999 has "colour" of "blue"
The SQL to pull this together is not trivial, but it's not too complex either... and most of it will be write once and keep (either in stored procedures or your data access layer).
We do similar things with a number of types of entity.
Chris and AJ: Thanks for your responses. The product line may change, but I would not term it "volatile".
The reason I dislike the third option is that it comes at the cost of metadata for the product attribute values. It essentially turns columns into rows, losing most of the advantages of the database column in the process (data type, default value, constraints, foreign key relationships etc.)
I've actually been involved in a past project where the product definition was done in this way. We essentially created a full product/product attribute definition system (data types, min/max occurrences, default values, 'required' flags, usage scenarios etc.) The system worked, ultimately, but came with a significant cost in overhead and performance (e.g. materialized views to visualize products, custom "smart" components to represent and validate data entry UI for product definition, another "smart" component to represent the product instance's customizable attributes on the order line, blahblahblah).
Again, thanks for your replies!

Resources