Is there a pattern to avoid ever-multiplying link tables in database design? - database

Currently scoping out a new system. Like many systems, it will be required to store documents and link them to other kinds of item. In this instance a Document object can belong to a Job or it can belong to an Item (which in turn belongs to a Job).
We could do this by having a JobId and an ItemId against a Document and leaving one or the other blank if necessary, but that's going to mean annoying conditional logic in the handling code. So, two link tables seems a better idea.
However, it is likely that we will need to link Documents to other items in the system at some point in the future. There are Company and User objects, for example, and we might want to record Documents against those. There may be more.
That would entail a proliferation of link tables which, while effective, is messy and hard to follow.
This solution is in SQL Server and will be handled in code via Entity Framework.
Are there any design principles that can allow us to hook up Document objects with a variety of other system objects as required in a neater and more flexible way?

You could store two values: the id, and the type of object to which the document is attached. It doesn't allow the use of foreign keys, but is compatible with many application development frameworks.
If you have the partitioning option then you could dedicate different partitions to different object types.
You could also have multiple tables, one for job documents, one for item documents, and get an overview of all of them with a view that UNION ALL's them together. If you need uniqueness in that result set then you could use UUIDs for the primary key, or add an extra column to the view to express from which table the row was read.

Related

Database schema design for product comparison

I am looking to design a database schema to compare two products. Something like this https://www.capterra.com/agile-project-management-tools-software/compare/160498-147657/Clubhouse-vs-monday-com
Here is what I am thinking for the database schema design(only products of same category can be compared, please note that database is mongodb):
Categories table tagging the category of a product.
Store all the features corresponding to a category in the categories table.
In the
product table store an array of
per feature, where key is the feature name, value is the value of
this feature in the product and category_feature_id is the
feature_id in the categories table.
However, this makes the product table very tightly coupled with categories table. Has anyone worked on such a problem before ? Any pointers will be appreciated. Here is an overview of schema:
categories collection:
name: 'String'
features: [
{
name: 'string'
parent_id: 'ObjectID' // if this is a sub feature it will reference in this // embedded document itself
}
]
products:
name: 'String'
features: [ // Embedded document with feature values
{
name: 'String',
value: Boolean,
category_feature_id: 'ObjectID' // feature_id into the categories.features // table, majorly used to comparison only.
}
]
I would consider making features a separate collection, and for each category or product, have a list of feature IDs. So for example:
Features collection:
{id: XXX, name: A}, {id: YYY, name: B}
Categories collection:
{ features: [featureId: XXX, value: C]}
Products collection:
{ features: [featureId: YYY, value: D]}
This has several advantages:
Conceptually, I would argue that features are independent of both
categories and products. Unless you are sure that two categories
will never share a feature, then you shouldn't have duplicate
definitions of a single feature. Otherwise, if you ever want to
update the feature later (e.g. its name, or other attributes), it
will be a pain to do so.
This makes it easy to tie features to
products and/or categories without coupling so tightly to the
definitions within each category.
This allows you to essentially override category features in a product, if you want, by including
the same feature in a category and a specific product. You can
decide what this situation means to you. But one way to define this
condition is that the product definition of the feature supersedes
the category definition, making for a very flexible schema.
It
allows users to search for single features across categories and
products. For example, in the future, you may wish to allow users to
search for a specific color across multiple categories and products.
Treating features as 1st class objects would allow you to do that
without needing to kludge around it by translating a user request
into multiple category_feature_id's.
You don't need a category_feature_id field because each feature has the same id across products and categories, so it's easy to reference between a product and a category.
Anyway, this is my recommendation. And if you add an index to the features Array in both the categories and products collections, then doing db operations like lookups, joins, filters, etc. will be very fast.
EDIT (to respond to your comment):
The decision to denormalize the feature name is orthogonal to the decision of where to store the feature record. Let me translate that :-)
Normalized data means you keep only one copy of any data, and then reference that data whenever you need it. This way, there is only ever one definitive source for the data, and you don't run into problems where different copies of the data end up being changed and are no longer consistent.
Under relational theory, you want to normalize data as much as possible, because it's the easiest way to maintain consistency. If you only have one place to record a customer address, for example, you'll never end up in a situation where you have two addresses and you don't know which one is the right one. However, people frequently de-normalize data for performance reasons, namely, to avoid expensive and/or frequent queries. The decision to de-normalize data must weigh the performance benefits against the costs of manually maintaining data consistency (you must now write application code to ensure that the various copies of the data stay consistent when any one of them gets updated).
That's what I mean by de-normalization is orthogonal to the data structure: you choose the data structure that makes the most sense to accurately represent your data. Then you selectively de-normalize it for performance reasons. Of course, you don't choose a final data structure without considering performance impact, but conceptually, they are two different goals. Does that make sense?
So let's take a look at your example. Currently, you copy the feature name from the category feature list to the product feature list. This is a denormalization. One that allows you to avoid querying the category collection every time you need to list the product. You need to balance that performance advantage against the issues with data consistency. Because now, if someone changes the name in the either the product or category record, you need to have application code to manually update the corresponding record in the other collection. And if you change the name in the category side, that might entail changing hundreds of product records.
I'm assuming you thought through these trade-offs and believe the performance advantage of the de-normalization is worth it. If that's the case, then nothing prevents you from de-normalizing from a separate feature collection as well. Just copy the name from the feature collection into the category or product document. You still gain all the advantages I listed, and the performance will be no worse than your current system.
OTOH, if you haven't thought through the performance advantages, and are just following this paradigm because "noSQL doesn't do joins" then my recommendation is don't be so dogmatic! :-) You can do joins in MongoDB quite fast, just as you can denormalize data in SQL tables quite easily. These aren't hard and fast rules.
FWIW, IMHO, I think de-normalization to avoid a simple query is a case of premature optimization. Unless you have a website serving >10k product pages a second along with >1k inserts or updates / sec causing extensive locking delays, an additional read query to a features collection (especially if you're properly indexed) will add very minimal overhead. And even in those scenarios, you can optimize the queries a lot before you need to start denormalizing (e.g., in a category page showing multiple products, you can do one batch query to retrieve all the feature records in a single query).
Note: there's one way to avoid both, which is to make each feature name unique, and then use that as the key. That is, don't store the featureId, just store the feature name, and query based on that if you need additional data from the features collection. However, I strongly recommend against this. The one thing I personally am dogmatic about is that a primary key should never contain any useful information. You may think it's clever right now, but a year from now, you will be cursing your decision (e.g. what happens when you decide to internationalize the site, and each feature has multiple names? What if you want to have more extensive filters, where each feature has multiple synonyms, many of which overlap?). So I don't recommend this route. Personally, I'd rather take the minimal additional overhead of a query.

Designing a database with similar, but different Models

I have a system whereby you can create documents. You select the document type to create and a form is displayed. Data is then added to the form, and the document can be generated. In Laravel things are done via Models. I am creating a new Model for each document but I don't think this is the best way. An example of my database :
So at the heart of it are projects. I create a new project; I can now create documents for this project. When I select project brief from a select box, a form is displayed whereby I can input :
Project roles
Project Data
Deliverables
Budget
It's three text fields and a standard input field. If I select reporting doc from the select menu, I have to input the data for this document (which is a couple of normal inputs, a couple of text fields, and a date). Although they are both documents, they expect different data (which is why I have created a Model for each document).
The problems: As seen in the diagram, I want to allow supporting documents to be uploaded alongside a document which is generated. I have a doc_upload table for this. So a document can have one or more doc_uploads.
Going back to the MVC structure, in my DocUpload model I can't say that DocUpload belongs to both ProjectBriefDoc and ProjectReportingDoc because it can only belong to one Model. So not only am I going to create a new model for every single document, I will have to create a new Upload model for each document as well. As more documents are added, I can see this becoming a nightmare to manage.
I am after a more generic Model which can handle different types of documents. My question relates to the different types of data I need to capture for each document, and how I can fit this into my design.
I have a design that can work, but I think it is a bad idea. I am looking for advice to improve this design, taking into account that each document requires different input, and each document will need to allow for file uploads.
You don't need to have a table/Model for each document type you'll create.
A more flexible approach would be to have a project_documents table, where you'll have a project_id and some data related to it, and then a doc_uploads related to the project_documents table.
This way a project can have as many documents your business will ever need and each document can have as many files as it needs.
You could try something like that:
If you still want to keep both tables, your doc_upload table in your example can have two foreign keys and two belongsTo() Laravel Model declarations without conflicts (it's not a marriage, it's an open relationship).
Or you could use Polymorphic Relations to do the same thing, but it's an anti-pattern of Database Design (because it'll not ensure data integrity on the database level).
For a good reference about Database Design, google for "Bill Karwin" and "SQL Antipatterns".
This guy has a very good Slideshare presentation and a book written about this topic - he used to be an active SO user as well.
ok.
I have a suggestion..you don't have to have such a tight coupling on the doc_upload references. You can treat this actually as a stand alone table in your model that is not pegged to a single entity.. You can still use the ORM to CRUD your way through and manage this table..
What I would do is keep the doc_upload table and use it for all up_load references for all documents no matter what table model the document resides in and have the following fields in the doc_upload table
documenttype (which can be the object name the target document object)
documentid_fk (this is now the generic key to a single row in the appropriate document type table(s)
So given a document in a given table.. (you can derive the documenttype based on the model object) and you know the id of the document itself because you just pulled it from the db context.. should be able to pull all related documents in the doc_upload table that match those two values.
You may be able to use reflection in your model to know what Entity (doc type ) you are in.. and the key is just the key.. so you should be able.
You will still have to create a new model Entity for each flavor of project document you wish to have.. but that may not be too difficult if the rate of change is small..
You should be able to write a minimum amount of code to e pull all related uploaded documents into your app..
You may use inheritance by zero-or-one relation in data model design.
IMO having an abstract entity(table) called project-document containing shared properties of all documents, will serve you.
project-brief and project-report and other types of documents will be children of project-document table, having a zero-or-one relation. primary key of project-document will be foreign key and primary key of the children.
Now having one-to-many relation between project-document and doc-upload will solve the problem.
I also suggest adding a unique constraint {project_id, doc_type} inside project-document for cardinal check (if necessary)
As other answers are sort of alluding to, you probably don't want to have a different Model for different documents, but rather a single Model for "document" with different views on it for your different processes. Laravel seems to have a good "templating" system for implementing views:
http://laravel.com/docs/5.1/blade
http://daylerees.com/codebright-blade/

Database Design to handle newsfeed for different activities

I am going to create a new project, where I need users to view their friends activities and actions just like Facebook and LinkedIn.
Each user is allowed to do 5 different types of activities, each activity have different attributes, for example activity X can be public/private for while activity Y will be assigned to categories. Some of actions include 1 users others have 2 or 3 ...etc. Eventually I have to aggregate all these 5 different types of activities on the news feed page.
How can I design a database that is efficient?
I have 3 designs in mind, please let me know your thoughts. Any new ideas will be greatly appreciated!
1- Separate tables: since there are nearly 3-4 different columns for each activity, it would be logical to separate each activity to its own table.
Pros: Clean database, and easy to develop.
Cons: It will need to query the database 5 times and aggregate results to make a single newsfeed page.
2- One big table: This table will hold all activities with many unused columns. A new numeric column will be added called "type" which will indicate the type of activity. Some attributes could be combined in an HStore field (since we are using Postgres), others will be queried a lot so I dont think it is a good thing to include them as in an HStore field.
Pros: Easy to pull newsfeed.
Cons: Lots of read/writes on the same table, the code will be a bit messier so is the database.
3- Hybrid: A solution would be to make one table containing all the newsfeed, with a polymorphic association to other tables that contain details of each specific activity.
Pros: Tidy code and database, easy to add new activities.
Cons: JOIN ALL THE TABLES to make a single newsfeed! Still better than making 5 different queries.
As I am writing this post I am starting to lean towards solution number 2. Please advise!
Thanks
I would consider a graph database for this. Neo4j. It will add very flexible attributes on either nodes (users) or links (types of relations).
For small sets and few joins, SQL databases are faster and more appropriate. But if your starting point is 5 table joins, graph databases seem simpler and offer similar performance (if not better).

Database Is-a relationship

My problem relates to DB schema developing and is as follows.
I am developing a purchasing module, in which I want to use for purchasing items and SERVICES.
Following is my EER diagram, (note that service has very few specialized attributes – max 2)
My problem is to keep products and services in two tables or just in one table?
One table option –
Reduces complexity as I will only need to specify item id which refers to item table which will have an “item_type” field to identify whether it’s a product or a service
Two table option –
Will have to refer separate product or service in everywhere I want to refer to them and will have to keep “item_type” field in every table which refers to either product or service?
Currently planning to use option 1, but want to know expert opinion on this matter. Highly appreciate your time and advice. Thanks.
I'd certainly go to the "two tables" option. You see, you have to distinguish Products and Services, so you may either use switch(item_type) { ... } in your program or entirely distinct code paths for Product and for Service. And if a need for updating the DB schema arises, switch is harder to maintain.
The second reason is NULLs. I'd advise avoid them as much as you can — they create more problems than they solve. With two tables you can declare all fields non-NULL and forget about NULL-processing. With one table option, you have to manually write code to ensure that if item_type=product, then Product-specific fields are not NULL, and Service-specific ones are, and that if item_type=service, then Service-specific fields are not NULL, and Product-specific ones are. That's not quite pleasant work, and the DBMS can't do it for you (there is no NOT NULL IF another_field = value column constraint in SQL or anything like this).
Go with two tables. It's easier to support. I once saw a DB where everything, every single piece of data went in just two tables — there were pages and pages of code to make sure that necessary fields are not NULL.
If I were to implement I would have gone for the Two table option, It's kinda like the first rule of normalization of the schema. To remove multi-valued attributes. Using item_type is not recommended. Once you create separate tables you dont need to use the item_type you can just use the foreign key relationship.
Consider reading this article :
http://en.wikipedia.org/wiki/Database_normalization
It should help.

Database design rules to follow for a programmer

We are working on a mapping application that uses Google Maps API to display points on a map. All points are currently fetched from a MySQL database (holding some 5M + records). Currently all entities are stored in separate tables with attributes representing individual properties.
This presents following problems:
Every time there's a new property we have to make changes in the database, application code and the front-end. This is all fine but some properties have to be added for all entities so that's when it becomes a nightmare to go through 50+ different tables and add new properties.
There's no way to find all entities which share any given property e.g. no way to find all schools/colleges or universities that have a geography dept (without querying schools,uni's and colleges separately).
Removing a property is equally painful.
No standards for defining properties in individual tables. Same property can exist with different name or data type in another table.
No way to link or group points based on their properties (somehow related to point 2).
We are thinking to redesign the whole database but without DBA's help and lack of professional DB design experience we are really struggling.
Another problem we're facing with the new design is that there are lot of shared attributes/properties between entities.
For example:
An entity called "university" has 100+ attributes. Other entities (e.g. hospitals,banks,etc) share quite a few attributes with universities for example atm machines, parking, cafeteria etc etc.
We dont really want to have properties in separate table [and then linking them back to entities w/ foreign keys] as it will require us adding/removing manually. Also generalizing properties will results in groups containing 50+ attributes. Not all records (i.e. entities) require those properties.
So with keeping that in mind here's what we are thinking about the new design:
Have separate tables for each entity containing some basic info e.g. id,name,etc etc.
Have 2 tables attribute type and attribute to store properties information.
Link each entity (or a table if you like) to attribute using a many-to-many relation.
Store addresses in different table called addresses link entities via foreign keys.
We think this will allow us to be more flexible when adding, removing or querying on attributes.
This design, however, will result in increased number of joins when fetching data e.g.to display all "attributes" for a given university we might have a query with 20+ joins to fetch all related attributes in a single row.
We desperately need to know some opinions or possible flaws in this design approach.
Thanks for your time.
In trying to generalize your question without more specific examples, it's hard to truly critique your approach. If you'd like some more in depth analysis, try whipping up an ER diagram.
If your data model is changing so much that you're constantly adding/removing properties and many of these properties overlap, you might be better off using EAV.
Otherwise, if you want to maintain a relational approach but are finding a lot of overlap with properties, you can analyze the entities and look for abstractions that link to them.
Ex) My Db has Puppies, Kittens, and Walruses all with a hasFur and furColor attribute. Remove those attributes from the 3 tables and create a FurryAnimal table that links to each of those 3.
Of course, the simplest answer is to not touch the data model. Instead, create Views on the underlying tables that you can use to address (5), (4) and (2)
1 cannot be an issue. There is one place where your objects are defined. Everything else is generated/derived from that. Just refactor your code until this is the case.
2 is solved by having a metamodel, where you describe which properties are where. This is probably needed for 1 too.
You might want to totally avoid the problem by programming this in Smalltalk with Seaside on a Gemstone object oriented database. Then you can just have objects with collections and don't need so many joins.

Resources