I'm developing an Asset management application.
Looking through the excel tracker that was being used previously, I was able to identify some attributes that were common to all categories of assets (basically non-technical attributes such as Purchase Order No. , Warranty Info etc.) for which I think I will make a separate table.
But when storing technical-attributes, there are many categories of assets for which I need only one or two additional attributes to be stored.
Should a make a single table for all these attributes and store NULLs wherever applicable or should I make a separate table each category containing just the asset ID and the addition columns? Which approach is better/more pragmatic?
Is cluttering the database with too many tables ok? I have around 10 such categories.
There are 3 known approaches to this:
Single table
In this model, you have a single table with all known columns, and allow them to be null for types that don't have that attribute. This gives you a simple database, and fairly simple SQL, but doesn't allow support for common features that relational databases give you, like insisting on non-null columns for a data type, or creating unique indices where that makes sense.
It also tends to lead to messy SQL, with developers forgetting over time what columns mean, so you could get a column being used for multiple purposes.
It does make it easy to join to other tables - so if you have an asset and a purchase related to that asset, the "purchase" table joins to the "asset" table on "assetID".
Table per subtype
In this case, you build a table for each subtype, and enforce the data characteristics of that subtype with not null, unique etc.
This creates a clearer separation of subtypes, and is less likely to degrade into big ball of mud, but makes joins very hard - to join from "purchase" to "asset", you have to know which table holds that particular asset.
Common table for common fields, table per subtype
In this model, you have a single table for the fields that are common between subtypes - you say you've identified this already - and have further tables for each subtype to store the unique attributes.
This solves the joining problem between "asset" and "purchase", keeps the data pretty self-describing.
It does mean client logic needs to implement the "join asset_master to asset_subtype" issue.
I prefer option 3 - it's the best trade-off between maintainability and managability.
Databases should be able to handle lots of columns and lots of tables, so both approaches should work from that perspective.
If you don't have any additional requirements, I'd use the single table approach. It is the easiest, and the only thing you are loosing is the ability to put not null constraints on the fields that exist only form some categories
Related
Recently I encountered an application, Where a Master Table is maintained which contain the data of more than 20 categories. For e.g. it has some categories named as Country,State and City.
So my question is, it is better to move out this category as a separate table and fetching out the data through joins or Everything should be inside a single table.
P.S. In future categories count might increase to 50+ or more than it.
P.S. application based on EF6 + Sql Server.
Edited Version
I just want to know that in above scenario what should be the best approach, one should go with single table with proper indexing or go by the DB normalization approach, putting each category into a separate Table and maintaning relationship through fk's.
Normally, categories are put into separate tables. This conforms more closely with normalized database structures and the definition of entities. In particular, it allows for proper foreign key relationships to be defined. That is a big win for data integrity.
Sometimes categories are put into a single table. This can, of course, be confusing; consider, for instance, "Florida, Massachusetts" or "Washington, Iowa" (these are real places).
Putting categories in one table has one major advantage: all the text is in a single location. That can be very handy for internationalization efforts. To be honest, that is the situation where I have seen this used.
I have an ERP application with about 50 small lookup tables containing non-transactional data. Examples are ItemTypes, SalesOrderStatuses etc. There are so many different types and categories and statuses and with every new module new lookup tables are being added. I have a service to provide List objects out of these tables. These tables usually contain only two columns, (Id and Description). They have only a couple of rows, 8 - 10 rows at max.
I am thinking about putting all of them in one table with ID, Description and LookupTypeID. With this one table I will be able to get rid of 50 tables. Is it good idea? Bad Idea? Very bad idea?
Are there any standards/best-practices for managing small lookup tables?
Among some professionals, the single common lookup table is a design error you should avoid. At the very least, it will slow down performance. The reason is that you will have to have a compound primary key for the common table, and lookups via a compound key will take longer than lookups via a simple key.
According to Anith Sen, this is the first of five design errors you should avoid. See this article: Five Simple Design Errors
Merging lookup tables is a bad idea if you care about integrity of your data (and you should!):
It would allow "client" tables to reference the data they were not meant to reference. E.g. the DBMS will not protect you from referencing SalesOrderStatuses where only ItemTypes should be allowed - they are now in the same table and you cannot (easily) separate the corresponding FKs.
It would force all lookup data to share the same columns and types.
Unless you have a performance problems due to excessive JOINs, I recommend you stay with your current design.
If you do, then you could consider using natural instead of surrogate keys in the lookup tables. This way, the natural keys gets "propagated" through foreign keys to the "client" tables, resulting in less need for JOINing, at the price of increased storage space. For example, instead of having ItemTypes {Id PK, Description AK}, only have ItemTypes {Description PK}, and you no longer have to JOIN with ItemTypes just to get the Description - it was automatically propagated down the FK.
You can store them in a text search (ie nosql) database like Lucene. They are ridiculously fast.
I have implemented this to great effect. Note though that there is some initial setup to overcome, but not much. Lucene queries on ids are a snap to write.
The "one big lookup table" approach has the problem of allowing for silly values -- for example "color: yellow" for trucks in the inventory when you only have cars with "color: yellow". One Big Lookup Table: Just Say No.
Off-hand, I would go with the natural keys for the lookup tables unless you would have cases like "the 2012 model CX300R was red but the 2010-2011 models CX300R were blue (and model ID also denotes color)".
Traditionally if you ask a DBA they will say you should have separate tables. If you asked a programmer they would say using the single table is easier. (Makes making a Edit Status webpage very easy you just make one webpage and pass it a different LookupTypeID instead of lots of similar pages)
However now with ORM the SQL and Code to access different status tables is not really any extra effort.
I have used both method and both work fine. I must admit using a single status table is easiest. I have done this for small apps and also enterprise apps and have noticed no performance impacts.
Finally the other field I normally like to add on these generic status tables is a OrderBy field so you can sort the status in your UI by something other than the description if needed.
Sounds like a good idea to me. You can have the ID and LookupTypeID as a multi-attribute primary key. You just need to know what all of the different LookupTypeIDs represent and you should be good as gold.
EDIT: As for the standards/best-practices, I honestly don't have an answer for you. I've only had one semester of SQL/database design so I haven't been all too exposed to the matter.
Consider we have a database that has a table, which is a record of a sale. You sell both products and services, so you also have a product and service table.
Each sale can either be a product or a service, which leaves the options for designing the database to be something like the following:
Add columns for each type, ie. add Service_id and Product_id to Invoice_Row, both columns of which are nullable. If they're both null, it's an ad-hoc charge not relating to anything, but if one of them is satisfied then it is a row relating to that type.
Add a weird string/id based system, for instance: Type_table, Type_id. This would be a string/varchar and integer respectively, the former would contain for example 'Service', and the latter the id within the Service table. This is obviously loose coupling and horrible, but is a way of solving it so long as you're only accessing the DB from code, as such.
Abstract out the concept of "something that is chargeable" for with new tables, of which Product and Service now are an abstraction of, and on the Invoice_Row table you would link to something like ChargeableEntity_id. However, the ChargeableEntity table here would essentially be redundant as it too would need some way to link to an abstract "backend" table, which brings us all the way back around to the same problem.
Which way would you choose, or what are the other alternatives to solving this problem?
What you are essentially asking is how to achieve polymorphism in a relational database. There are many approaches (as you yourself demonstrate) to this problem. One solution is to use "table per class" inheritance. In this setup, there will be a parent table (akin to your "chargeable item") that contains a unique identifier and the fields that are common to both products and services. There will be two child tables, products and goods: Each will contain the unique identifier for that entity and the fields specific to it.
One benefit to this approach over others is you don't end up with one table with many nullable columns that essentially becomes a dumping ground to describe anything ("schema-less").
One downside is as your inheritance hierarchy grows, the number of joins needed to grab all the data for an entity also grows.
I believe it depends on use case(s).
You could put the common columns in one table and put product and service specific columns in its own tables.Here the deal is that you need to join stuff.
Else if you maintain two separate tables, one for Product and another for Sale. You use application logic to determine which table to insert into. And getting all sales will essentially mean , union of getting all products and getting all sale.
I would go for approach 2 personally to avoid joins and inserting into two tables whenever a sale is made.
I have a table that has a bunch of fields. The fields can be broken into logical groups - like a job's project manager info. The groupings themselves aren't really entity candidates as they don't and shouldn't have their own PKs.
For now, to group them, the fields have prefixes (PmFirstName for example) but I'm considering breaking them out into multiple tables with 1:1 relations on the main table.
Is there anything I should watch out for when I do this? Is this just a poor choice?
I can see that maybe my queries will get more complicated with all the extra joins but that can be mitigated with views right? If we're talking about a table with less than 100k records is this going to have a noticeable effect on performance?
Edit: I'll justify the non-entity candidate thoughts a little further. This information is entered by our user base. They don't know/care about each other. So its possible that the same user will submit the same "projectManager name" or whatever which, at this point, wouldn't be violating any constraint. Its for us to determine later on down the pipeline if we wanna correlate entries from separate users. If I were to give these things their own key they would grow at the same rate the main table grows - since they are essentially part of the same entity. At no pt is a user picking from a list of available "project managers".
So, given the above, I don't think they are entities. But maybe not - if you have further thoughts please post.
I don't usually use 1 to 1 relations unless there is a specific performance reason for it. For example storing an infrequently used large text or BLOB type field in a separate table.
I would suspect that there is something else going on here though. In the example you give - PmFirstName - it seems like maybe there should be a single pm_id relating to a "ProjectManagers" or "Employees" table. Are you sure none of those groupings are really entity candidates?
To me, they smell unless for some rows or queries you won't be interested in the extra columns. e.g. if for a large portion of your queries you are not selecting the PmFirstName columns, or if for a large subset of rows those columns are NULL.
I like the smells tag.
I use 1 to 1 relationships for inheritance-like constructs.
For example, all bonds have some basic information like CUSIP, Coupon, DatedDate, and MaturityDate. This all goes in the main table.
Now each type of bond (Treasury, Corporate, Muni, Agency, etc.) also has its own set of columns unique to it.
In the past we would just have one incredibly wide table with all that information. Now we break out the type-specific info into separate tables, which gives us much better performance.
For now, to group them, the fields have prefixes (PmFirstName for example) but I'm considering breaking them out into multiple tables with 1:1 relations on the main table.
Create a person table, every database needs this. Then in your project table have a column called PMKey which points to the person table.
Why do you feel that the group of fields are not an entity candidates? If they are not then why try to identify them with a prefix?
Either drop the prefixes or extract them into their own table.
It is valuable splitting them up into separate tables if they are separate logical entities that could be used elsewhere.
So a "Project Manager" could be 1:1 with all the projects currently, but it makes sense that later you might want to be able to have a Project Manager have more than one project.
So having the extra table is good.
If you have a PrimaryFirstName,PrimaryLastName,PrimaryPhone, SecondaryFirstName,SecondaryLastName,SEcondaryPhone
You could just have a "Person" table with FirstName, LastName, Phone
Then your original Table only needs "PrimaryId" and "SecondaryId" columns to replace the 6 columns you previously had.
Also, using SQL you can split up filegroups and tables across physical locations.
So you could have a POST table, and a COMMENT Table, that have a 1:1 relationship, but the COMMENT table is located on a different filegroup, and on a different physical drive with more memory.
1:1 does not always smell. Unless it has no purpose.
This is a scenario I've seen in multiple places over the years; I'm wondering if anyone else has run across a better solution than I have...
My company sells a relatively small number of products, however the products we sell are highly specialized (i.e. in order to select a given product, a significant number of details must be provided about it). The problem is that while the amount of detail required to choose a given product is relatively constant, the kinds of details required vary greatly between products. For instance:
Product X might have identifying characteristics like (hypothetically)
'Color',
'Material'
'Mean Time to Failure'
but Product Y might have characteristics
'Thickness',
'Diameter'
'Power Source'
The problem (one of them, anyway) in creating an order system that utilizes both Product X and Product Y is that an Order Line has to refer, at some point, to what it is "selling". Since Product X and Product Y are defined in two different tables - and denormalization of products using a wide table scheme is not an option (the product definitions are quite deep) - it's difficult to see a clear way to define the Order Line in such a way that order entry, editing and reporting are practical.
Things I've Tried In the Past
Create a parent table called 'Product' with columns common to Product X and Product Y, then using 'Product' as the reference for the OrderLine table, and creating a FK relationship with 'Product' as the primary side between the tables for Product X and Product Y. This basically places the 'Product' table as the parent of both OrderLine and all the disparate product tables (e.g. Products X and Y). It works fine for order entry, but causes problems with order reporting or editing since the 'Product' record has to track what kind of product it is in order to determine how to join 'Product' to its more detailed child, Product X or Product Y. Advantages: key relationships are preserved. Disadvantages: reporting, editing at the order line/product level.
Create 'Product Type' and 'Product Key' columns at the Order Line level, then use some CASE logic or views to determine the customized product to which the line refers. This is similar to item (1), without the common 'Product' table. I consider it a more "quick and dirty" solution, since it completely does away with foreign keys between order lines and their product definitions. Advantages: quick solution. Disadvantages: same as item (1), plus lost RI.
Homogenize the product definitions by creating a common header table and using key/value pairs for the customized attributes (OrderLine [n] <- [1] Product [1] <- [n] ProductAttribute). Advantages: key relationships are preserved; no ambiguity about product definition. Disadvantages: reporting (retrieving a list of products with their attributes, for instance), data typing of attribute values, performance (fetching product attributes, inserting or updating product attributes etc.)
If anyone else has tried a different strategy with more success, I'd sure like to hear about it.
Thank you.
The first solution you describe is the best if you want to maintain data integrity, and if you have relatively few product types and seldom add new product types. This is the design I'd choose in your situation. Reporting is complex only if your reports need the product-specific attributes. If your reports need only the attributes in the common Products table, it's fine.
The second solution you describe is called "Polymorphic Associations" and it's no good. Your "foreign key" isn't a real foreign key, so you can't use a DRI constraint to ensure data integrity. OO polymorphism doesn't have an analog in the relational model.
The third solution you describe, involving storing an attribute name as a string, is a design called "Entity-Attribute-Value" and you can tell this is a painful and expensive solution. There's no way to ensure data integrity, no way to make one attribute NOT NULL, no way to make sure a given product has a certain set of attributes. No way to restrict one attribute against a lookup table. Many types of aggregate queries become impossible to do in SQL, so you have to write lots of application code to do reports. Use the EAV design only if you must, for instance if you have an unlimited number of product types, the list of attributes may be different on every row, and your schema must accommodate new product types frequently, without code or schema changes.
Another solution is "Single-Table Inheritance." This uses an extremely wide table with a column for every attribute of every product. Leave NULLs in columns that are irrelevant to the product on a given row. This effectively means you can't declare an attribute as NOT NULL (unless it's in the group common to all products). Also, most RDBMS products have a limit on the number of columns in a single table, or the overall width in bytes of a row. So you're limited in the number of product types you can represent this way.
Hybrid solutions exist, for instance you can store common attributes normally, in columns, but product-specific attributes in an Entity-Attribute-Value table. Or you could store product-specific attributes in some other structured way, like XML or YAML, in a BLOB column of the Products table. But these hybrid solutions suffer because now some attributes must be fetched in a different way
The ultimate solution for situations like this is to use a semantic data model, using RDF instead of a relational database. This shares some characteristics with EAV but it's much more ambitious. All metadata is stored in the same way as data, so every object is self-describing and you can query the list of attributes for a given product just as you would query data. Special products exist, such as Jena or Sesame, implementing this data model and a special query language that is different than SQL.
There's no magic bullet that you've overlooked.
You have what are sometimes called "disjoint subclasses". There's the superclass (Product) with two subclasses (ProductX) and (ProductY). This is a problem that -- for relational databases -- is Really Hard. [Another hard problem is Bill of Materials. Another hard problem is Graphs of Nodes and Arcs.]
You really want polymorphism, where OrderLine is linked to a subclass of Product, but doesn't know (or care) which specific subclass.
You don't have too many choices for modeling. You've pretty much identified the bad features of each. This is pretty much the whole universe of choices.
Push everything up to the superclass. That's the uni-table approach where you have Product with a discriminator (type="X" and type="Y") and a million columns. The columns of Product are the union of columns in ProductX and ProductY. There will be nulls all over the place because of unused columns.
Push everything down into the subclasses. In this case, you'll need a view which is the union of ProductX and ProductY. That view is what's joined to create a complete order. This is like the first solution, except it's built dynamically and doesn't optimize well.
Join Superclass instance to subclass instance. In this case, the Product table is the intersection of ProductX and ProductY columns. Each Product has a reference to a key either in ProductX or ProductY.
There isn't really a bold new direction. In the relational database world-view, those are the choices.
If, however, you elect to change the way you build application software, you can get out of this trap. If the application is object-oriented, you can do everything with first-class, polymorphic objects. You have to map from the kind-of-clunky relational processing; this happens twice: once when you fetch stuff from the database to create objects and once when you persist objects back to the database.
The advantage is that you can describe your processing succinctly and correctly. As objects, with subclass relationships.
The disadvantage is that your SQL devolves to simplistic bulk fetches, updates and inserts.
This becomes an advantage when the SQL is isolated into an ORM layer and managed as a kind of trivial implementation detail. Java programmers use iBatis (or Hibernate or TopLink or Cocoon), Python programmers use SQLAlchemy or SQLObject. The ORM does the database fetches and saves; your application directly manipulate Orders, Lines and Products.
This might get you started. It will need some refinement
Table Product ( id PK, name, price, units_per_package)
Table Product_Attribs (id FK ref Product, AttribName, AttribValue)
Which would allow you to attach a list of attributes to the products. -- This is essentially your option 3
If you know a max number of attributes, You could go
Table Product (id PK, name, price, units_per_package, attrName_1, attrValue_1 ...)
Which would of course de-normalize the database, but make queries easier.
I prefer the first option because
It supports an arbitrary number of attributes.
Attribute names can be stored in another table, and referential integrity enforced so that those damn Canadians don't stick a "colour" in there and break reporting.
Does your product line ever change?
If it does, then creating a table per product will cost you dearly, and the key/value pairs idea will serve you well. That's the kind of direction down which I am naturally drawn.
I would create tables like this:
Attribute(attribute_id, description, is_listed)
-- contains values like "colour", "width", "power source", etc.
-- "is_listed" tells us if we can get a list of valid values:
AttributeValue(attribute_id, value)
-- lists of valid values for different attributes.
Product (product_id, description)
ProductAttribute (product_id, attribute_id)
-- tells us which attributes apply to which products
Order (order_id, etc)
OrderLine (order_id, order_line_id, product_id)
OrderLineProductAttributeValue (order_line_id, attribute_id, value)
-- tells us things like: order line 999 has "colour" of "blue"
The SQL to pull this together is not trivial, but it's not too complex either... and most of it will be write once and keep (either in stored procedures or your data access layer).
We do similar things with a number of types of entity.
Chris and AJ: Thanks for your responses. The product line may change, but I would not term it "volatile".
The reason I dislike the third option is that it comes at the cost of metadata for the product attribute values. It essentially turns columns into rows, losing most of the advantages of the database column in the process (data type, default value, constraints, foreign key relationships etc.)
I've actually been involved in a past project where the product definition was done in this way. We essentially created a full product/product attribute definition system (data types, min/max occurrences, default values, 'required' flags, usage scenarios etc.) The system worked, ultimately, but came with a significant cost in overhead and performance (e.g. materialized views to visualize products, custom "smart" components to represent and validate data entry UI for product definition, another "smart" component to represent the product instance's customizable attributes on the order line, blahblahblah).
Again, thanks for your replies!