I'm investigating a personal Grails project and want to put together a domain model to represent a product catalog. I really can't decide the best way to go about it. I will have a number of different product categories although many categories will just have a base set of properties that are shared across all categories (e.g. product name, product description, price etc). However, some products will have additional properties specific to their category.
I've looked into the Entity Attribute Value (EAV) Model technique that provides a very extensible solution. And, I've considered the route of using an explicit OO inheritance model where I have sub-classes of a base Product class to represent any product that has additional properties.
Obviously, the second approach is less extensible - to add a new product category would require a new entity and likely a custom view/editor for the front-end. However, as a developer, I think the programming model is significantly clearer and much more logical to code against.
The EAV approach would allow dynamic extensibility but would lead to a more cryptic programming model and would have a performance overhead in the DB (complex table joins). Views/editors on the front end could be dynamically generated to include any number of the custom attributes for a product category - though I'm sure situations would arise where such dynamic generation wouldn't suffice from a usability perspective.
When I consider a framework like Grails, it would seem to make sense to go down the route of creating an explicit inheritance model. I'm not convinced a framework like Grails would fit the EAV approach so well - a lot of the benefits of Grails would be lost in the complexity. However, I'm not sure this approach would scale practically as the number of product categories increases.
I'd be really interested to hear of others' experience with this type of modelling challenge!
I’ve had a situation similar to this and went with the inheritance solution. Going into this I knew I’d never have more than about 10 classes so I wasn’t worried about exponential growth of complexity. Although you will need views and controllers for each class there are some things you can do to reduce code duplication. The first thing to do is to put all common view code in templates. For example if all your classes will have a price, name, and description the view code that will allow the displaying and editing of this should be put into templates. Instead of having duplicate lines of code in each view you can simply do a
<g:render template=”/baseView</g>render>
For more info on templates see http://www.grails.org/Tag+-+render
The second thing I found useful to do was move all shared controller code into a class and define closures that I could call from my actual controller. This got quite ugly since my save method would not only insure the fields of the base class were dealt with properly but would also have code for corner cases of the inherited classes. Looking back on this a better option may have been to define custom behavior as functions of the domain class that required it or to use a service. With that said putting code into closures that could be called from the controller was still helpful since it would allow me to have one line long controller bodies instead of 30 or 40. If I had to modify code dealing with the base class I could edit it where the closures were defined and that change would be reflected across all my controllers with no code change to the actual source file of the controller. This came in quite useful and allowed me to edit code in one place instead of editing duplicate code across 10 controllers.
Inheritance works fine with Hibernate and GORM. Consider using the table-per-subclass mapping as you cannot define NOT NULL constraints with the (default) table-per-hierarchy inheritance mapping.
You can also use composition for "not so" common, but shared, attributes.
"The" criteria for EAV is, do you need to introduce new attributes without changing the data model?
In practice, applications like yours use a combination of inheritance and EAV.
You're concerned about performance when querying JOINed tables. That's normally not an issue if you index the columns that are included in the SQL WHERE statement.
(GORM/Hibernate will automatically create foreign keys, which are important as well.) (Given, the necessary indexes are in place and a DBMS that provides a decent query optimizer (i.e., PostgreSQL oder SQL Server - maybe not MySQL), you can select from millions of records using 10 joins in 50 milliseconds or less.)
Finally, there's been an excellent, recent, discussion on your issue.
Related
I've worked on big projects before, but I'm trying to improve my best practices, and one thing that I'm stuck on is not to create many models.
This might seem a little bit confusing, so let me put an example:
Let's suppose I have a Post model, and an Answer model, the answer one relates to the Post in a One-Many relationship.
Then, I want to add a Comment model, both to Post and Answer.
I could add two Foreign Key nullable columns on the Comment, to show which model it belongs.
But I could also create PostComment and AnswerComment models, removing the nullable column, but creating more kind of boilerplate.
Which practice is the best?
It depends.
I'm assuming the design is primarily to support a transactional application (OLTP), and not reporting (OLAP). I'm also assuming that model = table.
There's nothing inherently wrong with having multiple tables, as long as the design makes sense (can be easily supported), can be extended / modified with relative ease (maintained), does not lead to poorly performing queries (e.g. if there's a mismatch between the database schema and how calling applications want to consume its data.
If data is the same, it should probably go into the same table; e.g. if you're dealing with birds then don't have tblHawk, tblParrot, etc - but you you had all animals then sure you'd probably want to seperate them out somehow - tblBird, tblFish, tblMammal, etc - because the data would be too different & too hard to model effectively.
You have answers and posts - I assume these are different enough that having separate tables makes sense? If so, what about comments to them? If comments are essentially the same regardless of post/answer then one table, as you described, is probably a good idea.
Also consider the application: if you have separate post/answer comment tables there's more code to be developed and maintained - but it's separate, so more code but possibly more flexible with less complexity. Using one table will have the opposite affect. Neither is wrong, but one approach is probably better than the other depending on your situation.
Here is something I've wondered for quite some time, and have not seen a real (good) solution for yet. It's a problem I imagine many games having, and that I can't easily think of how to solve (well). Ideas are welcome, but since this is not a concrete problem, don't bother asking for more details - just make them up! (and explain what you made up).
Ok, so, many games have the concept of (inventory) items, and often, there are hundreds of different kinds of items, all with often very varying data structures - some items are very simple ("a rock"), others can have insane complexity or data behind them ("a book", "a programmed computer chip", "a container with more items"), etc.
Now, programming something like that is easy - just have everything implement an interface, or maybe extend an abstract root item. Since objects in the programming world don't have to look the same on the inside as on the outside, there is really no issue with how much and what kind of private fields any type of item has.
But when it comes to database serialization (binary serialization is of course no problem), you are facing a dilemma: how would you represent that in, say, a typical SQL database ?
Some attempts at a solution that I have seen, none of which I find satisfying:
Binary serialization of the items, the database just holds an ID and a blob.
Pro's: takes like 10 seconds to implement.
Con's: Basically sacrifices every database feature, hard to maintain, near impossible to refactor.
A table per item type.
Pro's: Clean, flexible.
Con's: With a wide variety come hundreds of tables, and every search for an item has to query them all since SQL doesn't have the concept of table/type 'reference'.
One table with a lot of fields that aren't used by every item.
Pro's: takes like 10 seconds to implement, still searchable.
Con's: Waste of space, performance, confusing from the database to tell what fields are in use.
A few tables with a few 'base profiles' for storage where similar items get thrown together and use the same fields for different data.
Pro's: I've got nothing.
Con's: Waste of space, performance, confusing from the database to tell what fields are in use.
What ideas do you have? Have you seen another design that works better or worse?
It depends if you need to sort, filter, count, or analyze those attribute.
If you use EAV, then you will screw yourself nicely. Try doing reports on an EAV schema.
The best option is to use Table Inheritance:
PRODUCT
id pk
type
att1
PRODUCT_X
id pk fk PRODUCT
att2
att3
PRODUCT_Y
id pk fk PRODUCT
att4
att 5
For attributes that you don't need to search/sort/analyze, then use a blob or xml
I have two alternatives for you:
One table for the base type and supplemental tables for each “class” of specialized types.
In this schema, properties common to all “objects” are stored in one table, so you have a unique record for every object in the game. For special types like books, containers, usable items, etc, you have another table for each unique set of properties or relationships those items need. Every special type will therefore be represented by two records: the base object record and the supplemental record in a particular special type table.
PROS: You can use column-based features of your database like custom domains, checks, and xml processing; you can have simpler triggers on certain types; your queries differ exactly at the point of diverging concerns.
CONS: You need two inserts for many objects.
Use a “kind” enum field and a JSONB-like field for the special type data.
This is kind of like your #1 or #3, except with some database help. Postgres added JSONB, giving you an improvement over the old EAV pattern. Other databases have a similar complex field type. In this strategy you roll your own mini schema that you stash in the JSONB field. The kind field declares what you expect to find in that JSONB field.
PROS: You can extract special type data in your queries; can add check constraints and have a simple schema to deal with; you can benefit from indexing even though your data is heterogenous; your queries and inserts are simple.
CONS: Your data types within JSONB-like fields are pretty limited and you have to roll your own validation.
Yes, it is a pain to design database formats like this. I'm designing a notification system and reached the same problem. My notification system is however less complex than yours - the data it holds is at most ids and usernames. My current solution is a mix of 1 and 3 - I serialize data that is different from every notification, and use a column for the 2 usernames (some may have 2 or 1). I shy away from method 2 because I hate that design, but it's probably just me.
However, if you can afford it, I would suggest thinking outside the realm of RDBMS - it sounds like Non-RDBMS (especially key/value storage ones) may be a better fit to store these data, especially if item 1 and item 2 differ from each item a lot.
I'm sure this has been asked here a million times before, but in addition to the options which you have discussed in your question, you can look at EAV schema which is very flexible, but which has its own sets of cons.
Another alternative is database systems which are not relational. There are object databases as well as various key/value stores and document databases.
Typically all these things break down to some extent when you need to query against the flexible attributes. This is kind of an intrinsic problem, however. Conceptually, what does it really mean to query things accurately which are unstructured?
First of all, do you actually need the concurrency, scalability and ACID transactions of a real database? Unless you are building a MMO, your game structures will likely fit in memory anyway, so you can search and otherwise manipulate them there directly. In a scenario like this, the "database" is just a store for serialized objects, and you can replace it with the file system.
If you conclude that you do (need a database), then the key is in figuring out what "atomicity" means from the perspective of the data management.
For example, if a game item has a bunch of attributes, but none of these attributes are manipulated individually at the database level (even though they could well be at the application level), then it can be considered as "atomic" from the data management perspective. OTOH, if the item needs to be searched on some of these attributes, then you'll need a good way to index them in the database, which typically means they'll have to be separate fields.
Once you have identified attributes that should be "visible" versus the attributes that should be "invisible" from the database perspective, serialize the latter to BLOBs (or whatever), then forget about them and concentrate on structuring the former.
That's where the fun starts and you'll probably need to use "all of the above" strategy for reasonable results.
BTW, some databases support "deep" indexes that can go into heterogeneous data structures. For example, take a look at Oracle's XMLIndex, though I doubt you'll use Oracle for a game.
You seem to be trying to solve this for a gaming context, so maybe you could consider a component-based approach.
I have to say that I personally haven't tried this yet, but I've been looking into it for a while and it seems to me something similar could be applied.
The idea would be that all the entities in your game would basically be a bag of components. These components can be Position, Energy or for your inventory case, Collectable, for example. Then, for this Collectable component you can add custom fields such as category, numItems, etc.
When you're going to render the inventory, you can simply query your entity system for items that have the Collectable component.
How can you save this into a DB? You can define the components independently in their own table and then for the entities (each in their own table as well) you would add a "Components" column which would hold an array of IDs referencing these components. These IDs would effectively be like foreign keys, though I'm aware that this is not exactly how you can model things in relational databases, but you get the idea.
Then, when you load the entities and their components at runtime, based on the component being loaded you can set the corresponding flag in their bag of components so that you know which components this entity has, and they'll then become queryable.
Here's an interesting read about component-based entity systems.
I'm working with the new version of a third party application. In this version, the database structure is changed, they say "to improve performance".
The old version of the DB had a general structure like this:
TABLE ENTITY
(
ENTITY_ID,
STANDARD_PROPERTY_1,
STANDARD_PROPERTY_2,
STANDARD_PROPERTY_3,
...
)
TABLE ENTITY_PROPERTIES
(
ENTITY_ID,
PROPERTY_KEY,
PROPERTY_VALUE
)
so we had a main table with fields for the basic properties and a separate table to manage custom properties added by user.
The new version of the DB insted has a structure like this:
TABLE ENTITY
(
ENTITY_ID,
STANDARD_PROPERTY_1,
STANDARD_PROPERTY_2,
STANDARD_PROPERTY_3,
...
)
TABLE ENTITY_PROPERTIES_n
(
ENTITY_ID_n,
CUSTOM_PROPERTY_1,
CUSTOM_PROPERTY_2,
CUSTOM_PROPERTY_3,
...
)
So, now when the user add a custom property, a new column is added to the current ENTITY_PROPERTY table until the max number of columns (managed by application) is reached, then a new table is created.
So, my question is: Is this a correct way to design a DB structure? Is this the only way to "increase performances"? The old structure required many join or sub-select, but this structute don't seems to me very smart (or even correct)...
I have seen this done before on the assumed (often unproven) "expense" of joining - it is basically turning a row-heavy data table into a column-heavy table. They ran into their own limitation, as you imply, by creating new tables when they run out of columns.
I completely disagree with it.
Personally, I would stick with the old structure and re-evaluate the performance issues. That isn't to say the old way is the correct way, it is just marginally better than the "improvement" in my opinion, and removes the need to do large scale re-engineering of database tables and DAL code.
These tables strike me as largely static... caching would be an even better performance improvement without mutilating the database and one I would look at doing first. Do the "expensive" fetch once and stick it in memory somewhere, then forget about your troubles (note, I am making light of the need to manage the Cache, but static data is one of the easiest to manage).
Or, wait for the day you run into the maximum number of tables per database :-)
Others have suggested completely different stores. This is a perfectly viable possibility and if I didn't have an existing database structure I would be considering it too. That said, I see no reason why this structure can't fit into an RDBMS. I have seen it done on almost all large scale apps I have worked on. Interestingly enough, they all went down a similar route and all were mostly "successful" implementations.
No, it's not. It's terrible.
until the max number of column (handled by application) is reached,
then a new table is created.
This sentence says it all. Under no circumstance should an application dynamically create tables. The "old" approach isn't ideal either, but since you have the requirement to let users add custom properties, it has to be like this.
Consider this:
You lose all type-safety as you have to store all values in the column "PROPERTY_VALUE"
Depending on your users, you could have them change the schema beforehand and then let them run some kind of database update batch job, so at least all the properties would be declared in the right datatype. Also, you could lose the entity_id/key thing.
Check out this: http://en.wikipedia.org/wiki/Inner-platform_effect. This certainly reeks of it
Maybe a RDBMS isn't the right thing for your app. Consider using a key/value based store like MongoDB or another NoSQL database. (http://nosql-database.org/)
From what I know of databases (but I'm certainly not the most experienced), it seems quite a bad idea to do that in your database. If you already know how many max custom properties a user might have, I'd say you'd better set the table number of columns to that value.
Then again, I'm not an expert, but making new columns on the fly isn't the kind of operations databases like. It's gonna bring you more trouble than anything.
If I were you, I'd either fix the number of custom properties, or stick with the old system.
I believe creating a new table for each entity to store properties is a bad design as you could end up bulking the database with tables. The only pro to applying the second method would be that you are not traversing through all of the redundant rows that do not apply to the Entity selected. However using indexes on your database on the original ENTITY_PROPERTIES table could help greatly with performance.
I would personally stick with your initial design, apply indexes and let the database engine determine the best methods for selecting the data rather than separating each entity property into a new table.
There is no "correct" way to design a database - I'm not aware of a universally recognized set of standards other than the famous "normal form" theory; many database designs ignore this standard for performance reasons.
There are ways of evaluating database designs though - performance, maintainability, intelligibility, etc. Quite often, you have to trade these against each other; that's what your change seems to be doing - trading maintainability and intelligibility against performance.
So, the best way to find out if that was a good trade off is to see if the performance gains have materialized. The best way to find that out is to create the proposed schema, load it with a representative dataset, and write queries you will need to run in production.
I'm guessing that the new design will not be perceivably faster for queries like "find STANDARD_PROPERTY_1 from entity where STANDARD_PROPERTY_1 = 'banana'.
I'm guessing it will not be perceivably faster when retrieving all properties for a given entity; in fact it might be slightly slower, because instead of a single join to ENTITY_PROPERTIES, the new design requires joins to several tables. You will be returning "sparse" results - presumably, not all entities will have values in the property_n columns in all ENTITY_PROPERTIES_n tables.
Where the new design may be significantly faster is when you need a compound where clause on custom properties. For instance, finding an entity where custom property 1 is true, custom property 2 is banana, and custom property 3 is not in ('kylie', 'pussycat dolls', 'giraffe') is e`(probably) faster when you can specify columns in the ENTITY_PROPERTIES_n tables instead of rows in the ENTITY_PROPERTIES table. Probably.
As for maintainability - yuck. Your database access code now needs to be far smarter, knowing which table holds which property, and how many columns are too many. The likelihood of entertaining bugs is high - there are more moving parts, and I can't think of any obvious unit tests to make sure that the database access logic is working.
Intelligibility is another concern - this solution is not in most developers' toolbox, it's not an industry-standard pattern. The old solution is pretty widely known - commonly referred to as "entity-attribute-value". This becomes a major issue on long-lived projects where you can't guarantee that the original development team will hang around.
I'd like suggestions for the design of a CRUD business app using Silverlight 4, the Business Application Template, WCF RIA Services and the Entity Framework 4. The app tracks lab test results performed on material samples. It replaces a (difficult to maintain) existing web application. Lab tests results are stored in two "SampleData" tables made up of hundreds of fields. The tables have a one to one relationship. I combined the two tables into one using Entity Framework's Table Per Type Inheritance which I'm very happy with. Note: I've decided not to change the database design to avoid destroying the existing application, but it was considered.
My dilemma is how to break up this huge table. Each record represents a material sample that is tested. The logical grouping of fields is by lab test. I envision my UI having multiple tabs or separate pages - one for each test. The problem at this point is that I'm sucking in ALL the fields yet only displaying a few in a paged DataGrid and there is a noticeable delay. Instead of one giant entity it might be nice to have several "Lab Test" entities (each representing a type of test) that are sub-sets of my one giant TPT Inheritance table. How would I do this? The base SampleData table/entity contains header fields plus several child test results fields. The second derived table/entity contains more test result fields linked to the base by SampleID. If split up I'd need to maintain the header info with each Lab Test entity.
I'm willing to stick with one giant table/entity (despite a slight performance penalty). Still, I'm wondering the best way to create my UI with this one entity. Can a DataForm be tabbed? If I make a dashboard with links to lab tests how do I keep header info in sync with each test page?
I know this is a broad question. I'm hoping to get suggestions on a good design path that will allow me to grow the app as new lab tests are added (making an even bigger entity). I'd hope to find a path that simplifies maintenance and takes advantage of the RAD experience Microsoft is promoting.
Thanks in advance!
I scanned the post discussing the database design and must say that based on what you said and the fact that you've already got users asking for more tests (repeating values) that I wish you'd reconsider the db redesign. You can create a flat view to simulate the existing flat samples-data table and use that to minimize breakage in the existing application.
But you've already made that decision, so how about reversing the situation? Instead of fixing the database, add code to the domain service that transforms the data from it's flat layout, leaving out all the null values.
One idea is to write a view that un-flattens the data and leaving out the null no-test situations. The query will raise eyebrows (I'll probably get flamed for this) because it looks nasty but in reality the DBMS does a fine job optimizing and performing the query (in Oracle anyway). I've had great results making a view something like::
create view programmer_exp_unflat as (
select programmer_id, 'C#', csharp_yrs from programmer_exp_flat where csharp_yrs is not null
union
select programmer_id, 'Java', java_yrs from programmer_exp_flat where java_yrs is not null
union
select programmer_id, 'Cobol', cobol_yrs from programmer_exp_flat where cobol_yrs is not null
.
repeat xx times) from dual
It's backwards and ugly no matter how you look at it but it reduces your result set to a bare minimum and no need to break things into categories. New test values require modification of the view, and depending on UI flexibility and business rules, might not require any changes.
It makes coding at the UI more difficult, as it would have been with the right database design in the first place, but your query result is reduced to only the tests that had been completed. If your users are flexible the UI could be designed to show the test results as a list making display a piece of cake. Your current design pretty much forces you to modify the UI and database with each and every new test.
These are the type challenges that make being a developer so much fun -- and why all the marketing gimmick sample CRUD applications that can be built in five minutes are worthless in the real world.
I'm answering (and accepting) my own question to increase my stack overflow accept rate, but my "answer" is that I have found no answer yet. Because I've had to move on with the project I continue to use one giant entity. I've also moved away from Silverlight and turned the project into a WPF app due to various struggles with Silverlight such as inherent asynchronous data access.
I work for a billing service that uses some complicated mainframe-based billing software for it's core services. We have all kinds of codes we set up that are used for tracking things: payment codes, provider codes, write-off codes, etc... Each type of code has a completely different set of data items that control what the code does and how it behaves.
I am tasked with building a new system for tracking changes made to these codes. We want to know who requested what code, who/when it was reviewed, approved, and implemented, and what the exact setup looked like for that code. The current process only tracks two of the different types of code. This project will add immediate support for a third, with the goal of also making it easy to add additional code types into the same process at a later date. My design conundrum is that each code type has a different set of data that needs to be configured with it, of varying complexity. So I have a few choices available:
I could give each code type it's own table(s) and build them independently. Considering we only have three codes I'm concerned about at the moment, this would be simplest. However, this concept has already failed or I wouldn't be building a new system in the first place. It's also weak in that the code involved in writing generic source code at the presentation level to display request data for any code type (even those not yet implemented) is not trivial.
Build a db schema capable of storing the data points associated with each code type: not only values, but what type they are and how they should be displayed (dropdown list from an enum of some kind). I have a decent db schema for this started, but it just feels wrong: overly complicated to query and maintain, and it ultimately requires a custom query to view full data in nice tabular for for each code type anyway.
Storing the data points for each code request as xml. This greatly simplifies the database design and will hopefully make it easier to build the interface: just set up a schema for each code type. Then have code that validates requests to their schema, transforms a schema into display widgets and maps an actual request item onto the display. What this item lacks is how to handle changes to the schema.
My questions are: how would you do it? Am I missing any big design options? Any other pros/cons to those choices?
My current inclination is to go with the xml option. Given the schema updates are expected but extremely infrequent (probably less than one per code type per 18 months), should I just build it to assume the schema never changes, but so that I can easily add support for a changing schema later? What would that look like in SQL Server 2000 (we're moving to SQL Server 2005, but that won't be ready until after this project is supposed to be completed)?
[Update]:
One reason I'm thinking xml is that some of the data will be complex: nested/conditional data, enumerated drop down lists, etc. But I really don't need to query any of it. So I was thinking it would be easier to define this data in xml schemas.
However, le dorfier's point about introducing a whole new technology hit very close to home. We currently use very little xml anywhere. That's slowly changing, but at the moment this would look a little out of place.
I'm also not entirely sure how to build an input form from a schema, and then merge a record that matches that schema into the form in an elegant way. It will be very common to only store a partially-completed record and so I don't want to build the form from the record itself. That's a topic for a different question, though.
Based on all the comments so far Xml is still the leading candidate. Separate tables may be as good or better, but I have the feeling that my manager would see that as not different or generic enough compared to what we're currently doing.
There is no simple, generic solution to a complex, meticulous problem. You can't have both simple storage and simple app logic at the same time. Either the database structure must be complex, or else your app must be complex as it interprets the data.
I outline five solution to this general problem in "product table, many kind of product, each product have many parameters."
For your situation, I would lean toward Concrete Table Inheritance or Serialized LOB (the XML solution).
The reason that XML might be a good solution is that:
You don't need to use SQL to pick out individual fields; you're always going to display the whole form.
Your XML can annotate fields for data type, user interface control, etc.
But of course you need to add code to parse and validate the XML. You should use an XML schema to help with this. In which case you're just replacing one technology for enforcing data organization (RDBMS) with another (XML schema).
You could also use an RDF solution instead of an RDBMS. In RDF, metadata is queriable and extensible, and you can model entities with "facts" about them. For example:
Payment code XYZ contains attribute TradeCredit (Net-30, Net-60, etc.)
Attribute TradeCredit is of type CalendarInterval
Type CalendarInterval is displayed as a drop-down
.. and so on
Re your comments: Yeah, I am wary of any solution that uses XML. To paraphrase Jamie Zawinski:
Some people, when confronted with a problem, think "I know, I'll use XML." Now they have two problems.
Another solution would be to invent a little Domain-Specific Language to describe your forms. Use that to generate the user-interface. Then use the database only to store the values for form data instances.
Why do you say "this concept has already failed or I wouldn't be building a new system in the first place"? Is it because you suspect there must be a scheme for handling them in common?
Else I'd say to continue the existing philosophy, and establish additional tables. At least it would be sharing an existing pattern and maintaining some consistency in that respect.
Do a web search on "generalized specialized relational modeling". You'll find articles on how to set up tables that store the attributes of each kind of code, and the attributes common to all codes.
If you’re interested in object modeling, just search on “generalized specialized object modeling”.