I believe the following is a pretty common use case, yet even after thinking about for a couple of hours and discussing it with a friend I found no satisfactory solution.
Basic problem: How do you store and efficiently query objects/entities with connections to many different relations?
The objects
Imagine you have a system that keeps track of a group of cars, their positions and their drivers (each is an entity in your DB/system). From monitoring the activity of the cars you are generating events such as speeding violations, collisions between two cars and fuel fillups. Now each of these events is a little different, modeled as objects they might have the following attributes:
Speeding violation
speed (integer)
car (reference)
driver (reference)
Collision
car1 (reference)
car2 (reference)
driver1 (reference)
driver2 (reference)
position (reference)
date & time
state (fixed or new)
Fuel fillup
car (reference)
amount (float)
position (reference)
Additionally they all share some attributes such as a creation date and an owning company. It is also possible that new events will be generated in the future and these should be simple to add to whatever storage system/model gets decided on.
Query demands
Queries are roughly ordered in by importance (most important first). The system should be able to efficiently
query all notifications (with their attributes) for a given company or time frame
query all notifications belonging to a certain car or driver
query all notifications of a certain type (e.g. all fillup notifications)
The question
How do you store the above described objects in a database (not necessarily relational although the referenced entities are in a relational DB) such that the queries described can be performed efficiently?
The definition of efficiency here can be pretty flexible, what is important to me is that situation where e.g. all of the dependencies have to be queried individually are avoided.
Potential solutions
Here are some of the ideas I came up with:
2 tables model: A first table event holds the common general information of the events such as an id, company, event_type and creation date. A second table event_objects then holds all the different attachments and contains the columns id, event_id, object_id and object_type.
Good:
Most queries can be answered efficiently
Very easy to scale for additional events
Very easy to add new attributes for an event
Bad:
When the objects for a specific event have to be retrieved they each have to be fetched with an individual query
If the DB is relational this goes against good practice/designed use (essentially use the DB as a key-value store)
1 table per event: Simply create one table for each event type with a column for each attribute
Good:
Events of the same type can be queried very efficiently
Querying all events of a company/car etc. is only linear in the number of event types (as opposed to the number of related attributes times the number of fetched events for the 2 tables model)
Fits more nicely with the relational model
Bad:
Harder to query all events of a company/for a time frame (requires #types queries)
Harder to add a new attribute to an existing event type
Conclusion
Based on the listed advantages and disadvantages I am tempted to go with the 1 table per event solution, but it still doesn't seem particularly elegant to me. I am sure I am not the first one to bump into this problem and would love to hear how others have tackled similar issues.
Related
I am looking to design a database schema to compare two products. Something like this https://www.capterra.com/agile-project-management-tools-software/compare/160498-147657/Clubhouse-vs-monday-com
Here is what I am thinking for the database schema design(only products of same category can be compared, please note that database is mongodb):
Categories table tagging the category of a product.
Store all the features corresponding to a category in the categories table.
In the
product table store an array of
per feature, where key is the feature name, value is the value of
this feature in the product and category_feature_id is the
feature_id in the categories table.
However, this makes the product table very tightly coupled with categories table. Has anyone worked on such a problem before ? Any pointers will be appreciated. Here is an overview of schema:
categories collection:
name: 'String'
features: [
{
name: 'string'
parent_id: 'ObjectID' // if this is a sub feature it will reference in this // embedded document itself
}
]
products:
name: 'String'
features: [ // Embedded document with feature values
{
name: 'String',
value: Boolean,
category_feature_id: 'ObjectID' // feature_id into the categories.features // table, majorly used to comparison only.
}
]
I would consider making features a separate collection, and for each category or product, have a list of feature IDs. So for example:
Features collection:
{id: XXX, name: A}, {id: YYY, name: B}
Categories collection:
{ features: [featureId: XXX, value: C]}
Products collection:
{ features: [featureId: YYY, value: D]}
This has several advantages:
Conceptually, I would argue that features are independent of both
categories and products. Unless you are sure that two categories
will never share a feature, then you shouldn't have duplicate
definitions of a single feature. Otherwise, if you ever want to
update the feature later (e.g. its name, or other attributes), it
will be a pain to do so.
This makes it easy to tie features to
products and/or categories without coupling so tightly to the
definitions within each category.
This allows you to essentially override category features in a product, if you want, by including
the same feature in a category and a specific product. You can
decide what this situation means to you. But one way to define this
condition is that the product definition of the feature supersedes
the category definition, making for a very flexible schema.
It
allows users to search for single features across categories and
products. For example, in the future, you may wish to allow users to
search for a specific color across multiple categories and products.
Treating features as 1st class objects would allow you to do that
without needing to kludge around it by translating a user request
into multiple category_feature_id's.
You don't need a category_feature_id field because each feature has the same id across products and categories, so it's easy to reference between a product and a category.
Anyway, this is my recommendation. And if you add an index to the features Array in both the categories and products collections, then doing db operations like lookups, joins, filters, etc. will be very fast.
EDIT (to respond to your comment):
The decision to denormalize the feature name is orthogonal to the decision of where to store the feature record. Let me translate that :-)
Normalized data means you keep only one copy of any data, and then reference that data whenever you need it. This way, there is only ever one definitive source for the data, and you don't run into problems where different copies of the data end up being changed and are no longer consistent.
Under relational theory, you want to normalize data as much as possible, because it's the easiest way to maintain consistency. If you only have one place to record a customer address, for example, you'll never end up in a situation where you have two addresses and you don't know which one is the right one. However, people frequently de-normalize data for performance reasons, namely, to avoid expensive and/or frequent queries. The decision to de-normalize data must weigh the performance benefits against the costs of manually maintaining data consistency (you must now write application code to ensure that the various copies of the data stay consistent when any one of them gets updated).
That's what I mean by de-normalization is orthogonal to the data structure: you choose the data structure that makes the most sense to accurately represent your data. Then you selectively de-normalize it for performance reasons. Of course, you don't choose a final data structure without considering performance impact, but conceptually, they are two different goals. Does that make sense?
So let's take a look at your example. Currently, you copy the feature name from the category feature list to the product feature list. This is a denormalization. One that allows you to avoid querying the category collection every time you need to list the product. You need to balance that performance advantage against the issues with data consistency. Because now, if someone changes the name in the either the product or category record, you need to have application code to manually update the corresponding record in the other collection. And if you change the name in the category side, that might entail changing hundreds of product records.
I'm assuming you thought through these trade-offs and believe the performance advantage of the de-normalization is worth it. If that's the case, then nothing prevents you from de-normalizing from a separate feature collection as well. Just copy the name from the feature collection into the category or product document. You still gain all the advantages I listed, and the performance will be no worse than your current system.
OTOH, if you haven't thought through the performance advantages, and are just following this paradigm because "noSQL doesn't do joins" then my recommendation is don't be so dogmatic! :-) You can do joins in MongoDB quite fast, just as you can denormalize data in SQL tables quite easily. These aren't hard and fast rules.
FWIW, IMHO, I think de-normalization to avoid a simple query is a case of premature optimization. Unless you have a website serving >10k product pages a second along with >1k inserts or updates / sec causing extensive locking delays, an additional read query to a features collection (especially if you're properly indexed) will add very minimal overhead. And even in those scenarios, you can optimize the queries a lot before you need to start denormalizing (e.g., in a category page showing multiple products, you can do one batch query to retrieve all the feature records in a single query).
Note: there's one way to avoid both, which is to make each feature name unique, and then use that as the key. That is, don't store the featureId, just store the feature name, and query based on that if you need additional data from the features collection. However, I strongly recommend against this. The one thing I personally am dogmatic about is that a primary key should never contain any useful information. You may think it's clever right now, but a year from now, you will be cursing your decision (e.g. what happens when you decide to internationalize the site, and each feature has multiple names? What if you want to have more extensive filters, where each feature has multiple synonyms, many of which overlap?). So I don't recommend this route. Personally, I'd rather take the minimal additional overhead of a query.
Currently scoping out a new system. Like many systems, it will be required to store documents and link them to other kinds of item. In this instance a Document object can belong to a Job or it can belong to an Item (which in turn belongs to a Job).
We could do this by having a JobId and an ItemId against a Document and leaving one or the other blank if necessary, but that's going to mean annoying conditional logic in the handling code. So, two link tables seems a better idea.
However, it is likely that we will need to link Documents to other items in the system at some point in the future. There are Company and User objects, for example, and we might want to record Documents against those. There may be more.
That would entail a proliferation of link tables which, while effective, is messy and hard to follow.
This solution is in SQL Server and will be handled in code via Entity Framework.
Are there any design principles that can allow us to hook up Document objects with a variety of other system objects as required in a neater and more flexible way?
You could store two values: the id, and the type of object to which the document is attached. It doesn't allow the use of foreign keys, but is compatible with many application development frameworks.
If you have the partitioning option then you could dedicate different partitions to different object types.
You could also have multiple tables, one for job documents, one for item documents, and get an overview of all of them with a view that UNION ALL's them together. If you need uniqueness in that result set then you could use UUIDs for the primary key, or add an extra column to the view to express from which table the row was read.
I'm making a to-do list thingy in my spare time for learning etc. I'm using SQL Server Compact 3.5 along with Entity Framework for data management. It is a desktop application, meant to be used by a single person.
I have close to no knowledge with database stuff, and am focusing my energies more on the UI side of things.
I was going along merrily implementing CRUD of tasks, when I thought it would be nice to have some scheduling for the tasks. Begin task in future, repetitions daily/weekly/monthly/yearly/custom etc.
I went on to try to design my DB to accomodate this with my limited knowledge and poof, I end up with like 14 new tables. I then searched online and found posts pointing to sysschedules on MSDN. All accomplished in one table. I lowered my head in shame and tried a puny attempt to improve my design. I got it down to 10 tables while including some stuff I liked from the sysschedules table.
This is my (simplified) schema now(explanation below image):
A Task can have a SchedulingInfo associated with it.
I forced OO into this, so SchedulingInfo is an abstract type which has various 'subclasses'.
TimeOfDayToStart_Ticks represents the time to start... since I don't want to store it as a datetime.
The subclasses:
CustomSchedule: Used to allow a task to run some day, or a set of days, in the future.
IntervalSchedule: eg. Run everyday, or every 3 days, or every 4 hours, etc.
Monthly/Yearly-Schedule: Set of days to run every month/year
MonthlyRelativeSchedule: I stole this from the sysschedules thing. Holds a set of days that conform to things like every second(Frequency) Saturday(DayType), or the last weekday of the month, etc. (See previously mentioned link to see full explanation).
My code will retrieve a list of ScheduleInfo, sorted by NextRun. Dequeue a ScheduleInfo, instantiate a new Task with relevant details, re-calculate NextRun based on the subclass of ScheduleInfo, save the ScheduleInfo back to the DB.
I feel weird about the number of tables. Will this affect performance if there are like thousands of entries? Or is this just like yucky design, full of bad practices or some such? Should I just use the single-table approach?
Yes, I think your table flood will have a negative impact on performance. If YearlySchedule and the other stuff are derived entities from the base entity SchedulingInformation and you have separate tables for base and derived properties you are forced to use Table-Per-Type inheritance mapping which is known to be slow. (At least up to current version 4.1 of EF. It is announced that the generated SQL for queries with TPT mapping will be improved in the next release of EF.)
In my opinion your model is a typical case for Table-Per-Hierarchy mapping because I see four derived entity tables which only have a primary key column. So, these entities add nothing to the base class (except their navigation properties) and would only force unnecessary joins in queries.
I would throw these four classes away and also the fifth - IntervalSchedule - and add its single property Interval_Ticks to the SchedulingInformation table.
The four ...Specifiers tables could all refer then with their foreign keys to the SchedulingInformation table.
So, this would result in:
Five tables: SchedulingInformation and 4 x *Specifiers
One abstract base entity: SchedulingInformation
Five derived entities: *Schedule
Four entities: *Specifier
Each of the *Schedule entities (except IntervalSchedule) has a collection of the corresponding *Specifier entity (one-to-many relationship). And you map the five *Schedule entities to the same SchedulingInformation table via Table-Per-Hierarchy inheritance mapping.
That would be my primary plan to try and test.
We are working on a mapping application that uses Google Maps API to display points on a map. All points are currently fetched from a MySQL database (holding some 5M + records). Currently all entities are stored in separate tables with attributes representing individual properties.
This presents following problems:
Every time there's a new property we have to make changes in the database, application code and the front-end. This is all fine but some properties have to be added for all entities so that's when it becomes a nightmare to go through 50+ different tables and add new properties.
There's no way to find all entities which share any given property e.g. no way to find all schools/colleges or universities that have a geography dept (without querying schools,uni's and colleges separately).
Removing a property is equally painful.
No standards for defining properties in individual tables. Same property can exist with different name or data type in another table.
No way to link or group points based on their properties (somehow related to point 2).
We are thinking to redesign the whole database but without DBA's help and lack of professional DB design experience we are really struggling.
Another problem we're facing with the new design is that there are lot of shared attributes/properties between entities.
For example:
An entity called "university" has 100+ attributes. Other entities (e.g. hospitals,banks,etc) share quite a few attributes with universities for example atm machines, parking, cafeteria etc etc.
We dont really want to have properties in separate table [and then linking them back to entities w/ foreign keys] as it will require us adding/removing manually. Also generalizing properties will results in groups containing 50+ attributes. Not all records (i.e. entities) require those properties.
So with keeping that in mind here's what we are thinking about the new design:
Have separate tables for each entity containing some basic info e.g. id,name,etc etc.
Have 2 tables attribute type and attribute to store properties information.
Link each entity (or a table if you like) to attribute using a many-to-many relation.
Store addresses in different table called addresses link entities via foreign keys.
We think this will allow us to be more flexible when adding, removing or querying on attributes.
This design, however, will result in increased number of joins when fetching data e.g.to display all "attributes" for a given university we might have a query with 20+ joins to fetch all related attributes in a single row.
We desperately need to know some opinions or possible flaws in this design approach.
Thanks for your time.
In trying to generalize your question without more specific examples, it's hard to truly critique your approach. If you'd like some more in depth analysis, try whipping up an ER diagram.
If your data model is changing so much that you're constantly adding/removing properties and many of these properties overlap, you might be better off using EAV.
Otherwise, if you want to maintain a relational approach but are finding a lot of overlap with properties, you can analyze the entities and look for abstractions that link to them.
Ex) My Db has Puppies, Kittens, and Walruses all with a hasFur and furColor attribute. Remove those attributes from the 3 tables and create a FurryAnimal table that links to each of those 3.
Of course, the simplest answer is to not touch the data model. Instead, create Views on the underlying tables that you can use to address (5), (4) and (2)
1 cannot be an issue. There is one place where your objects are defined. Everything else is generated/derived from that. Just refactor your code until this is the case.
2 is solved by having a metamodel, where you describe which properties are where. This is probably needed for 1 too.
You might want to totally avoid the problem by programming this in Smalltalk with Seaside on a Gemstone object oriented database. Then you can just have objects with collections and don't need so many joins.
This is a scenario I've seen in multiple places over the years; I'm wondering if anyone else has run across a better solution than I have...
My company sells a relatively small number of products, however the products we sell are highly specialized (i.e. in order to select a given product, a significant number of details must be provided about it). The problem is that while the amount of detail required to choose a given product is relatively constant, the kinds of details required vary greatly between products. For instance:
Product X might have identifying characteristics like (hypothetically)
'Color',
'Material'
'Mean Time to Failure'
but Product Y might have characteristics
'Thickness',
'Diameter'
'Power Source'
The problem (one of them, anyway) in creating an order system that utilizes both Product X and Product Y is that an Order Line has to refer, at some point, to what it is "selling". Since Product X and Product Y are defined in two different tables - and denormalization of products using a wide table scheme is not an option (the product definitions are quite deep) - it's difficult to see a clear way to define the Order Line in such a way that order entry, editing and reporting are practical.
Things I've Tried In the Past
Create a parent table called 'Product' with columns common to Product X and Product Y, then using 'Product' as the reference for the OrderLine table, and creating a FK relationship with 'Product' as the primary side between the tables for Product X and Product Y. This basically places the 'Product' table as the parent of both OrderLine and all the disparate product tables (e.g. Products X and Y). It works fine for order entry, but causes problems with order reporting or editing since the 'Product' record has to track what kind of product it is in order to determine how to join 'Product' to its more detailed child, Product X or Product Y. Advantages: key relationships are preserved. Disadvantages: reporting, editing at the order line/product level.
Create 'Product Type' and 'Product Key' columns at the Order Line level, then use some CASE logic or views to determine the customized product to which the line refers. This is similar to item (1), without the common 'Product' table. I consider it a more "quick and dirty" solution, since it completely does away with foreign keys between order lines and their product definitions. Advantages: quick solution. Disadvantages: same as item (1), plus lost RI.
Homogenize the product definitions by creating a common header table and using key/value pairs for the customized attributes (OrderLine [n] <- [1] Product [1] <- [n] ProductAttribute). Advantages: key relationships are preserved; no ambiguity about product definition. Disadvantages: reporting (retrieving a list of products with their attributes, for instance), data typing of attribute values, performance (fetching product attributes, inserting or updating product attributes etc.)
If anyone else has tried a different strategy with more success, I'd sure like to hear about it.
Thank you.
The first solution you describe is the best if you want to maintain data integrity, and if you have relatively few product types and seldom add new product types. This is the design I'd choose in your situation. Reporting is complex only if your reports need the product-specific attributes. If your reports need only the attributes in the common Products table, it's fine.
The second solution you describe is called "Polymorphic Associations" and it's no good. Your "foreign key" isn't a real foreign key, so you can't use a DRI constraint to ensure data integrity. OO polymorphism doesn't have an analog in the relational model.
The third solution you describe, involving storing an attribute name as a string, is a design called "Entity-Attribute-Value" and you can tell this is a painful and expensive solution. There's no way to ensure data integrity, no way to make one attribute NOT NULL, no way to make sure a given product has a certain set of attributes. No way to restrict one attribute against a lookup table. Many types of aggregate queries become impossible to do in SQL, so you have to write lots of application code to do reports. Use the EAV design only if you must, for instance if you have an unlimited number of product types, the list of attributes may be different on every row, and your schema must accommodate new product types frequently, without code or schema changes.
Another solution is "Single-Table Inheritance." This uses an extremely wide table with a column for every attribute of every product. Leave NULLs in columns that are irrelevant to the product on a given row. This effectively means you can't declare an attribute as NOT NULL (unless it's in the group common to all products). Also, most RDBMS products have a limit on the number of columns in a single table, or the overall width in bytes of a row. So you're limited in the number of product types you can represent this way.
Hybrid solutions exist, for instance you can store common attributes normally, in columns, but product-specific attributes in an Entity-Attribute-Value table. Or you could store product-specific attributes in some other structured way, like XML or YAML, in a BLOB column of the Products table. But these hybrid solutions suffer because now some attributes must be fetched in a different way
The ultimate solution for situations like this is to use a semantic data model, using RDF instead of a relational database. This shares some characteristics with EAV but it's much more ambitious. All metadata is stored in the same way as data, so every object is self-describing and you can query the list of attributes for a given product just as you would query data. Special products exist, such as Jena or Sesame, implementing this data model and a special query language that is different than SQL.
There's no magic bullet that you've overlooked.
You have what are sometimes called "disjoint subclasses". There's the superclass (Product) with two subclasses (ProductX) and (ProductY). This is a problem that -- for relational databases -- is Really Hard. [Another hard problem is Bill of Materials. Another hard problem is Graphs of Nodes and Arcs.]
You really want polymorphism, where OrderLine is linked to a subclass of Product, but doesn't know (or care) which specific subclass.
You don't have too many choices for modeling. You've pretty much identified the bad features of each. This is pretty much the whole universe of choices.
Push everything up to the superclass. That's the uni-table approach where you have Product with a discriminator (type="X" and type="Y") and a million columns. The columns of Product are the union of columns in ProductX and ProductY. There will be nulls all over the place because of unused columns.
Push everything down into the subclasses. In this case, you'll need a view which is the union of ProductX and ProductY. That view is what's joined to create a complete order. This is like the first solution, except it's built dynamically and doesn't optimize well.
Join Superclass instance to subclass instance. In this case, the Product table is the intersection of ProductX and ProductY columns. Each Product has a reference to a key either in ProductX or ProductY.
There isn't really a bold new direction. In the relational database world-view, those are the choices.
If, however, you elect to change the way you build application software, you can get out of this trap. If the application is object-oriented, you can do everything with first-class, polymorphic objects. You have to map from the kind-of-clunky relational processing; this happens twice: once when you fetch stuff from the database to create objects and once when you persist objects back to the database.
The advantage is that you can describe your processing succinctly and correctly. As objects, with subclass relationships.
The disadvantage is that your SQL devolves to simplistic bulk fetches, updates and inserts.
This becomes an advantage when the SQL is isolated into an ORM layer and managed as a kind of trivial implementation detail. Java programmers use iBatis (or Hibernate or TopLink or Cocoon), Python programmers use SQLAlchemy or SQLObject. The ORM does the database fetches and saves; your application directly manipulate Orders, Lines and Products.
This might get you started. It will need some refinement
Table Product ( id PK, name, price, units_per_package)
Table Product_Attribs (id FK ref Product, AttribName, AttribValue)
Which would allow you to attach a list of attributes to the products. -- This is essentially your option 3
If you know a max number of attributes, You could go
Table Product (id PK, name, price, units_per_package, attrName_1, attrValue_1 ...)
Which would of course de-normalize the database, but make queries easier.
I prefer the first option because
It supports an arbitrary number of attributes.
Attribute names can be stored in another table, and referential integrity enforced so that those damn Canadians don't stick a "colour" in there and break reporting.
Does your product line ever change?
If it does, then creating a table per product will cost you dearly, and the key/value pairs idea will serve you well. That's the kind of direction down which I am naturally drawn.
I would create tables like this:
Attribute(attribute_id, description, is_listed)
-- contains values like "colour", "width", "power source", etc.
-- "is_listed" tells us if we can get a list of valid values:
AttributeValue(attribute_id, value)
-- lists of valid values for different attributes.
Product (product_id, description)
ProductAttribute (product_id, attribute_id)
-- tells us which attributes apply to which products
Order (order_id, etc)
OrderLine (order_id, order_line_id, product_id)
OrderLineProductAttributeValue (order_line_id, attribute_id, value)
-- tells us things like: order line 999 has "colour" of "blue"
The SQL to pull this together is not trivial, but it's not too complex either... and most of it will be write once and keep (either in stored procedures or your data access layer).
We do similar things with a number of types of entity.
Chris and AJ: Thanks for your responses. The product line may change, but I would not term it "volatile".
The reason I dislike the third option is that it comes at the cost of metadata for the product attribute values. It essentially turns columns into rows, losing most of the advantages of the database column in the process (data type, default value, constraints, foreign key relationships etc.)
I've actually been involved in a past project where the product definition was done in this way. We essentially created a full product/product attribute definition system (data types, min/max occurrences, default values, 'required' flags, usage scenarios etc.) The system worked, ultimately, but came with a significant cost in overhead and performance (e.g. materialized views to visualize products, custom "smart" components to represent and validate data entry UI for product definition, another "smart" component to represent the product instance's customizable attributes on the order line, blahblahblah).
Again, thanks for your replies!