How can I polymorphically structure a database? - database

This might be a stupid question but I have very little experience. I have encountered an issue where I am working with a Excel spreadsheet for a small factory.
It has a huge list of products that are grouped into families.
analogy: Corolla, Avensis, Landcruiser = Toyota
Furthermore the products have a list of tasks associated with them.
Corolla:
Step 1
Step 2
Step 3...
All products share tasks in the first few stages even across different families.
But some occur at a different stage during production
What may be step 6 in productX is step 5 in productY.
But productX and productY share 1-5. (And this is true across the board.
I have three questions.
Is it possible to polymorphically structure a database? Common tasks can be placed in the base class and get more specific (common for OO).
If it is not can you create a central database of unordered tasks and give some sort of priority to each database of a product and they give the tasks some order.
Final question is has anyone encountered such a problem? I have a feeling there has to be a design pattern to this. It feels like a solution is just beyond my grasp.
Edit 1. Spread sheet is mostly blank for time being. Worksheets are the product names. That string-integer combination are the product numbers. Values will be put in underneath i.e. Time/hr and the amount of product should be made in the time specified [

So, this is what I understood:
You need to store a mapping between products and tasks/steps. The latter should be stored in order that are to be performed.
Some initial tasks are always common for all products.
You'd like to structure your database 'polymorphically'. Since you didn't mention what kind of database you are using, I'll assume it to be a relational one.
You can create your tables so:
Product: each row stores data on one product. Primary key: product-name (or product-id, whatever)
Task: information on a task, such as time taken to finish it etc. Primary key: task-name/id.
ProductTaskMapping: contains mapping of what tasks are to be done what product, in order. Its schema will be as follows. You can also think of having the first two columns as foreign keys.
product-name- refers to the primary-key in Product table.
task-name- refers to the primary-key in Task table.
priority, or sequence-number
CommonTask: Two columns:
task-name
priority
Also, there's no way to define 'inheritance' between two tables.

Related

How to move from Excel to designing a Data Warehouse Model

I just started in Data Warehouse modeling and I need help for the modeling of a problem.
Let me tell you the facts: I work on flight data (aeronautical data),
so I have two Excel (fact) files, linked together, one file 'order' and the other 'services'.
the 'order' file sets out a summary of each flight (orderId, departure date, arrival date, City of departure, City of arrival, total amount collected, etc.)
the 'services' file lists the services provided by flight (orderId, service name, quantity, amount / qty, etc.)
with a 1-n relationship (order-services) each order has n services
I already see some dimensions (Time, Location, etc ...). However, I would like to know how I could design my Data Warehouse, knowing that I have two fact files linked together by orderId.
I thought about it, and the star and snowflake schema do not work in my case (since I have two fact tables) and the galaxy schema requires to have dimensions in common, but I block it, is that I put the order table as a dimension and not as a fact table or I should rather put the services table as a dimension, but these are fact tables. I get a little confused.
How can I design my model?
First of all realize that in a star schema it is not a problem to have more fact tables that are connected - see the discussion here.
So the first draw will simple follow your two fact tables with the native provided dimensions.
Order is in one context a fact table, in other context a dimensional table for the service table.
Dependent on your expected queries you could find useful to denormalize some dimensions of the order table in the service table. So the service will have defined the departure date, arrival date etc. dimensions.
This will be done at the load time in the ETL job.
I will be somehow careful to denormalize the measures from order to service - which will basically eliminate the whole order table.
There will be no problem with the measure total amount collected if this is a redundant sum of the service amounts - you may safely get rid of it.
But you will need for sure the number of flights or number of people transported - those measure are better defined in the order fact table; you can not simple replicate them in the N rows for each service.
A workaround is possible, if you define a main service for each order and those measures are defined only in this row - in other rows the value is NULL. This could lead to unexpected results if queried naively, e.g. for number of flights per service.
So basically I'd start with the two fact tables and denormalize some dimensions to the services if this would help to optimize the queries.
I would start with one fact table of Services. This fact would include all of the dimensions you might associate with the Order including a degenerated dimension of OrderId.
Once this fact is built out and some information products are consuming it, return to the Order and re-evaluate it to see if there are any reporting needs which are not being served, or questions which are difficult to answer with the Services fact.
Joining two facts together is always a bad idea. Performance is terrible. You are always better off bring the dimensions from, in your case, Order to Services. Don't forget to include the context of the dimension in the column name and a corresponding role-playing dimension view for this context. E.G. OrderArrivalCity, OrderDepartureDate, OrderDepartureTime.
You can also get yourself a copy of Ralph Kimball's The Data Warehouse Toolkit

How much should data be split up?

I'm currently setting up a database with microsoft access.
The main goal is to re-setup a huge inventory in a well structured manner.
The current inventory is full with duplication and redundancy, which I'm trying to reduce with access. My question now is, how much the data should be splitted up into it's smallest, logic parts.
The list includes a lot of different data, I pretty much broke it down already, let me give you an overview:
For me, it seems that I could split up the different attributes of a room into separate tables because each key shows up multiple times. For example, each room has a category (exp.: bureau) and a definition (exp.: conference room) and of course there are multiple rooms with the same category / definition.
Question is, does it make sense to split this into isolated tables? It feels like I am splitting this up way too much.
Your base data tables are:
Employees, Teams, Departments, Floors, Rooms, Workstations, Equipment.
Then you need lookup tables for things like: Employee_Gender or Room_Size (anything where you have to select from a fixed set of values).
Depending on how things work, Floors may be better used as a lookup table too by assigning teams directly to rooms rather than floors.
Also do not directly link Rooms and Devices. The link through Workstations is enough, unless you have devices assigned to rooms that do not belong to any workstation. Even then I would just create virtual workstation entries rather than have my table links loop.
If it is possible (ever) to have a team with employees from different departments, that part also needs to be different (employees assigned directly to departments and also independently to teams rather than being assigned to departments through their assignment to a team). In that case Team is also a lookup table rather than a main data table.
Size is just a value and should be stored as value in Room, not as foreign key. You can have a lookup table for Sizes, if they are standardized, just to make data entry easier, but this is not useful as a relation.
Room definition is also a lookup table. But names can change, so I would go with foreign key here (from Room). But you can also store the definition name directly, if it helps make your life easier.
The rest is straightforward:
Room -> Floor
Device -> Room
There is no n:m relation needing a connector table here.

Database Design to handle newsfeed for different activities

I am going to create a new project, where I need users to view their friends activities and actions just like Facebook and LinkedIn.
Each user is allowed to do 5 different types of activities, each activity have different attributes, for example activity X can be public/private for while activity Y will be assigned to categories. Some of actions include 1 users others have 2 or 3 ...etc. Eventually I have to aggregate all these 5 different types of activities on the news feed page.
How can I design a database that is efficient?
I have 3 designs in mind, please let me know your thoughts. Any new ideas will be greatly appreciated!
1- Separate tables: since there are nearly 3-4 different columns for each activity, it would be logical to separate each activity to its own table.
Pros: Clean database, and easy to develop.
Cons: It will need to query the database 5 times and aggregate results to make a single newsfeed page.
2- One big table: This table will hold all activities with many unused columns. A new numeric column will be added called "type" which will indicate the type of activity. Some attributes could be combined in an HStore field (since we are using Postgres), others will be queried a lot so I dont think it is a good thing to include them as in an HStore field.
Pros: Easy to pull newsfeed.
Cons: Lots of read/writes on the same table, the code will be a bit messier so is the database.
3- Hybrid: A solution would be to make one table containing all the newsfeed, with a polymorphic association to other tables that contain details of each specific activity.
Pros: Tidy code and database, easy to add new activities.
Cons: JOIN ALL THE TABLES to make a single newsfeed! Still better than making 5 different queries.
As I am writing this post I am starting to lean towards solution number 2. Please advise!
Thanks
I would consider a graph database for this. Neo4j. It will add very flexible attributes on either nodes (users) or links (types of relations).
For small sets and few joins, SQL databases are faster and more appropriate. But if your starting point is 5 table joins, graph databases seem simpler and offer similar performance (if not better).

Best database design/architecture for when you want changes to raw tables to automatically propagate up to summary tables

I am trying to figure out the best database/data architecture approach for a system where I need changes to raw "lower level" tables to, ideally, automatically propagate up various different tables that store views based on the underlying data.
Let me give an simplified example.
Imagine I have the following tables:
various table of food ingredients and prices at different supermarkets
eg. a Safeway table that has { ketchup: 1.19, butter: 0.99}, and a Walmart table with { eggs: 1.99, butter: 0.79}
a table that stores the cheapest location for each ingredient
eg. { butter => walmart: 0.79}
a table that stores recipes with the cheapest price of the recipe (made up from the cheapest prices of ingredients from table 2.
eg. { "ketchupy-egg-breakfast => total: 3.97, ingredients => butter: 0.79, eggs: 1.99, ketchup: 1.19
a table that stores the cheapest a breakfast recipe among several alternative recipes.
Now imagine I have workers going out and updating the values of table (1) on an hourly basis. Is there a database design or architecture that would force an updating of any entries in table two that rely on something that changed in table 1, and similarly would propagate changes onwards to table 3 and table 4. I imagine that can be implemented in code through a series of cascading jobs but I was wondering if there was a less complex and more elegant way of doing that.
Thanks for your help.
Looks like a well fit for the CQRS pattern. You should have just one data source when storing items.
From the linked article:
CQRS stands for Command Query Responsibility Segregation. It's a pattern that I first heard described by Greg Young. At its heart is a simple notion that you can use a different model to update information than the model you use to read information.
CQRS helps you take in one kind of information (Update model) and then transform it into something which you can query (Read/query model).
i think the 'elegant way' is to properly normalize your tables.
for instance you would not have one table for Safeway, and another table for Walmart etc - instead you would have one table for 'Retailer' that would list both Safeway and Walmart,
then one table for 'Product' to list eggs, ketchup, etc
then a third, say 'retail_item' that lists retailer_id, product_id, price etc.
then you can simply query for all the questions you posed...

database design scenario

What's the right way to do this:
I have the following relationship between entities RAW_MATERIAL_PRODUCT and FINISHED_PRODUCT: A FINISHED PRODUCT has to be made of one ore more Raw Material Products and a Raw Material Product may be part of a Finished Product.( so a Many-Many). I have the intersection entity which i called ASSEMBLY that tells me exactly of what Raw Material Products is a Finished Product made of.
Good. Now i need to sell the Finished Products and compute the production cost. PRODUCT_OUT entity comes in, which can contain only one FINISHED PRODUCT and a FINISHED PRODUCT may be part of multiple PRODUCT_OUT.
It would be easy if, for example, Finished Product A was always made of 3 pieces of Raw Material Product a1, 2 of a2 etc. Problem is that the quantities may change.
The stock of a Raw Material Product is computed as
TotalIn - TotalOut
so i can't put quantity Attribute in ASSEMBLY because i would get incorrect data when calculating the Stock. (if quantites are changed)
My only idea is to give up to FINISHED_PRODUCT entity and make a join between PRODUCT_OUT and RAW_MATERIAL_PRODUCT with the intersection entity containing a quantity attribute. But this seems kind of stupid because almost all the time a FINISHED_PRODUCT is made of the same RAW_MATERIAL_PRODUCTS.
Is there a better way?
I'm not 100% sure I understand, but it sound like essentially the recipe can change, and your model needs to account for this?
But this seems kind of stupid because almost all the time a
FINISHED_PRODUCT is made of the same RAW_MATERIAL_PRODUCTS.
Almost all the time, or all the time? I think that's a pretty critical question.
It seems to me that when you change the recipe, you should create a new FINISHED_PRODUCT row, which has a different set of RAW_MATERIAL_PRODUCTS based on the association in the ASSEMBLY table.
If you want to group differnt recipies of the same FINISHED_PRODUCT together (kind of like versioning!), create a FINISHED_PRODUCT_TYPE table with a 1:m relationship to the FINISHED_PRODUCT table.
Edit (quote from comment):
I totally agree with you it should be a different product but if i add
one screw to a product i can't really name it Product A with 1 extra
screw. And it seems this can happen. I didn't quite get the use of
creating a FINISHED_PRODUCT_TYPE table. Could you please explain?
Sure. So your FINISHED_PRODUCT_TYPE defines the name of the product, and possibly some other data (description, category, etc.). Then each row in FINISHED_PRODUCT is essentially a "version" of that product. So "Product A" would only exist in one place, a row in the FINISHED_PRODUCT_TYPE table, but there could be one or many versions of it in the FINISHED_PRODUCT table.

Resources