Issues with ER model design - database

I am trying to design a model for our future database of our toys and certain measurements that have to be done post-production. I have trouble grasping how to model this. I have tried multiple ways, but none of them seem optimal and in the end I've always lost out on the connectivity between entities.
What I need to achieve is some kind of meaningful relationship between the following:
A toy (with some trivial properties).
A series of toys (multiple toys can be related to one series and a toy can only belong to one series).
Measurement steps. There are currently 6 of these steps. Each step has its own input parameters and these vary in type as well as in number (eg. only 3 parameters for measurement step 1 and 10 parameters for measurement step 2).
With each series, a sequence of these measurement steps is defined. Duplicates of tests are allowed (eg. measurement step 1 > measurement step 4 > measurement step 1 is a valid sequence). The sequence along with the parameters must be stored somewhere for future reference.
Each toy goes through the sequence of measurements that is defined by its series. All of the results must be stored somewhere (for each individual toy).
If I split the measurement steps into their own tables I can't reference them conditionally (as foreign keys) to some other table.
If I try to serialize part of the data I lose the ability to make connections between individual measurement steps, measurement results (at least with queries) etc.
I know people here generally hate/don't answer these kinds of "discussion-like" questions, but I'd ask of you to at least point out what is a good practice in a system where I need to store this locally on a machine, but need a database to hold the data - to move towards serial-like data and just do general relationships where it is easy to do so or keep trying to normalize it as much as possible?

If measurements steps share most of attributes (or are of the same type, like what you called PARAMETERS), and I understood correctly your definitions, I would make something like this.
It could be a starting point.
+----------------------------+ +------------------------------+
| TOYS | | TOY_SERIES |
+-----+----------------------+ +---------+--------------------+
| PK | ID_TOY | +----------+ PK, FK1 | ID_S +--------+
| | | | +------------------------------+ |
| FK1 | ID_S +---------+ | | ... | |
+----------------------------+ | | | |
| | ... | | | | |
| | | | | | |
+-----+----------------------+ +---------+--------------------+ |
|
|
|
|
+------------------------------+ |
| BR_SER_MEAS | |
+---------+--------------------+ |
| PK, FK1 | ID_S +--------+
| | |
| PK, FK2 | ID_M +--------+
| | | |
| PK | ID_SEQ | |
| | | |
+---------+--------------------+ |
|
|
+------------------------------+ |
| MEASURE_STEPS | |
+------------------------------+ |
| PK ID_M +--------+
+------------------------------+
| PARAM_01 |
| ... |
| PARAM_10 |
| |
| |
+------------------------------+

Related

Postgres query chain structure data

Assuming there is a gigantic organization with a crazy way to manage. Each employee has one or multiple managers, managers are employees themselves who have one or multiple managers on top.
employee table
| id | name |managers_id|
| -------- | -------------- |-----------|
| 1 | Smith | 5,6 |
| 2 | Matt | 1 |
| 3 | Bob | 1,2 |
| 4 | Adam | 1,3 |
| 5 | Suzi | 6 |
| 6 | Emily | 23,25 |
| ... | ... | ... |
It is a one-way management chain, no loops, meaning it goes A-B-C-D, A-a-b-C-D etc, no such case as A-B-C-D-A
The query is to get the management chains, say C has two management chains on top:
A-B-C
A-a-b-C
C also has one chain below:
C-D
The level of C along the chains is not a matter.
In theory, there is no limitation on the number of levels, the chain can keep going indefinitely.
I was thinking about 'inheritance' but probably it is not the solution.
Any tips on how to design this postgres dababase, please? Thank you.

Valid Warehousing Strategy? Type6?

As a database user, I'm having problems interpreting the data in one of our tables at work. When I questioned the data team, the solution architects told me it was done this way on purpose because it is a "Type 6" table.
From my limited googling, I think a Type 6 should look like this:
+--------------+------------------+------------------+------------+------------+---------------------+
| Customer_Key | Customer_Attrib1 | Customer_Attrib2 | Start_Date | End_Date | Record Updated Date |
+--------------+------------------+------------------+------------+------------+---------------------+
| 123 | 1 | A | 1/1/2001 | 6/8/2004 | 6/9/2004 |
+--------------+------------------+------------------+------------+------------+---------------------+
| 123 | 1 | A | 6/9/2004 | 4/11/2016 | 4/12/2016 |
+--------------+------------------+------------------+------------+------------+---------------------+
| 123 | 1 | A | 4/12/2016 | 4/3/2017 | 4/4/2017 |
+--------------+------------------+------------------+------------+------------+---------------------+
| 123 | 2 | B | 4/4/2017 | 5/18/2017 | 5/19/2017 |
+--------------+------------------+------------------+------------+------------+---------------------+
| 123 | 2 | B | 5/19/2017 | 12/31/9999 | 5/19/2017 |
+--------------+------------------+------------------+------------+------------+---------------------+
The activity in question is the Customer_Attrib1, how it changed from 1 to 2 on 5/18/2017.
I like this style because I can figure out what customer_attrib1 is at any point of time by using the start and end dates:
select customer_attrib1
from table
where customer_key=123
and '2017-03-01' between start_date and end_date;
However...
The table itself actually gets updated in arrears, to look like this:
+--------------+------------------+------------------+------------+------------+---------------------+
| Customer_Key | Customer_Attrib1 | Customer_Attrib2 | Start_Date | End_Date | Record Updated Date |
+--------------+------------------+------------------+------------+------------+---------------------+
| 123 | 2 | A | 1/1/2001 | 6/8/2004 | 5/19/2017 |
+--------------+------------------+------------------+------------+------------+---------------------+
| 123 | 2 | A | 6/9/2004 | 4/11/2016 | 5/19/2017 |
+--------------+------------------+------------------+------------+------------+---------------------+
| 123 | 2 | A | 4/12/2016 | 4/3/2017 | 5/19/2017 |
+--------------+------------------+------------------+------------+------------+---------------------+
| 123 | 2 | B | 4/4/2017 | 5/18/2017 | 5/19/2017 |
+--------------+------------------+------------------+------------+------------+---------------------+
| 123 | 2 | B | 5/19/2017 | 12/31/9999 | 5/19/2017 |
+--------------+------------------+------------------+------------+------------+---------------------+
Can you see how much trouble I have, if I want to go find what the customer_attrib1 was during March of 2016?
NOTE: There is a previous_customer_attrib1 column, but it also gets mass updated to the value of 1. I wanted to keep the table small enough to get the point across, which is why I didn't add it above.
The big question: Is this a valid warehousing strategy? Is this really what Type 6 is? Or is my solution architect wrong.
Follow up question: Would the answer be different if customer_attrib1 was a foreign key to another table?
Your first example looks like a plain ol Type 2 SCD. The second example looks like it is Type 1 (overwrite) on attribute 1 and Type 2 SCD on attribute 2.
Neither are a Type 6 as presented, which would be where you have a way to see both the history of the changes (in a Type 2 way, as per your first example) but also the current values, typically by holding a separate set of columns for the current values, or by linking to the current record. You mention the previous attrib 1 column, which is crucial to it being a Type 6. However you would not expect that to also be bulk updated, as otherwise you just get the one previous value, and you don't get to see any changes prior to that.
Different people refer to a Type 6 meaning different things. What you need in a type 6 is simply the value itself (which applies to the row at the time) and the current value (which is bulk updated when there is a change), along with (of course) the type 2 approach of creating new rows for each change.
To answer your question, yes, I can see how much trouble you have with the design that's been given to you. Its a valid strategy if and only if it meets the business requirements. These techniques are only there to help meet business needs.
If the attribute is a foreign key then it becomes a bit more tricky, and we'd need more info about how that foreign keyed table was tracking history to be able to answer whether that changes anything.

How can we validate tabular data in robot framework?

In Cucumber, we can directly validate the database table content in tabular format by mentioning the values in below format:
| Type | Code | Amount |
| A | HIGH | 27.72 |
| B | LOW | 9.28 |
| C | LOW | 4.43 |
Do we have something similar in Robot Framework. I need to run a query on the DB and the output looks like the above given table.
No, there is nothing built in to do exactly what you say. However, it's fairly straight-forward to write a keyword that takes a table of data and compares it to another table of data.
For example, you could write a keyword that takes the result of the query and then rows of information (though, the rows must all have exactly the same number of columns):
| | ${ResultOfQuery}= | <do the database query>
| | Database should contain | ${ResultOfQuery}
| | ... | #Type | Code | Amount
| | ... | A | HIGH | 27.72
| | ... | B | LOW | 9.28
| | ... | C | LOW | 4.43
Then it's just a matter of iterating over all of the arguments three at a time, and checking if the data has that value. It would look something like this:
**** Keywords ***
| Database should contain
| | [Arguments] | ${actual} | #{expected}
| | :FOR | ${type} | ${code} | ${amount} | IN | #{expected}
| | | <verify that the values are in ${actual}>
Even easier might be to write a python-based keyword, which makes it a bit easier to iterate over datasets.

What database technologies should I consider for building a scalable "running average" view?

We are working on an application where millions of users will be entering information at the same time. Suppose the application allows people to rate geographic regions on where they would like to live. Each participant is allowed to rate each region using a decimal value from 0-10. Each person belongs to one or more groups based upon attributes such as gender, and people that consider themselves active, or enjoy culture.
Every time a rating is made, we need to have a view which shows us the average rating for each region/group. I'm aware that most DB's have an "average" function, but for our purposes we need to be able to use our own function as we may use a the geometric mean instead of the arithmetic mean.
Below are some tables which might be used. Note: I did not include the relationship table PeopleGroups which map which groups a person is a member of for brevity purposes.
Regions People Groups RegionScoresByPerson
+-----+------------+ +-----+-------+ +-----+----------+ +-----+-----+-------+
| RID | NAME | | PID | Name | | GID | Name | | RID | PID | Score |
+-----+------------+ +-----+-------+ +-----+----------+ +-----+-----+-------+
| 1 | Flordia | | P1 | Alice | | G0 | Everyone | | 1 | P1 | 6 |
| 2 | California | | P2 | Bob | | G1 | Women | | 1 | P2 | 8 |
+-----+------------+ | P3 | Frank | | G2 | Men | | 1 | P3 | 3 |
| P4 | Mary | | G3 | Active | | 1 | P4 | 2 |
+-----+-------+ | G4 | Culture | | 1 | P1 | 7 |
+-----+----------+ | 1 | P2 | 5 |
| 1 | P3 | 8 |
| 1 | P4 | 2 |
+-----+-----+-------+
Our current implementation uses a similar set of tables for storing ratings, but we don't calculate averages real-time. Anytime we need the results (e.g. show me the average score California for women), we have to pull all the information into memory and run the calculations manually.
I was wondering how I leverage database technologies such as views, triggers, stored procedures, etc. to present to me a simple table that will allow me to get scores by for people and groups so we don't have to manually run calculations.
I would like some table like the following, where everything is handled by the DB. Any insert,update,delete actions on the RegionScoresByPerson or Groups tables would automatically be reflected in this table. If it is not apparent, the rows marked with * calculated rows. In this case I'm using a simple arithmetic average, but I the design should allow for any type of function.
EID stands for entity ID (a person or group)
Besides deciding how to build such a view, I'm unsure of what sort of datatypes to use (and index) for People and Groups. I suppose I'd like the index to be integers, but that would prevent me from creating the table below because I couldn't distinguish between Person 1 and Group 1 -- Would having ID's such as P1 and G1 be a performance hit? I'm obviously concerned about the design being scalable.
ScoreView
+-----------+-----+-------+
| RID | EID | Score |
| 1 | P1 | 6 |
| 1 | P2 | 8 |
| 1 | P3 | 3 |
| 1 | P4 | 2 |
| 1 | P1 | 7 |
| 1 | P2 | 5 |
| 1 | P3 | 8 |
| 1 | P4 | 2 |
| 1 | G0 | 4.75 |*
| 1 | G1 | 4 |*
| 1 | G2 | … |*
| 1 | G3 | … |*
+-----------+-----+-------+
Apache Flume is the open source tool designed to solve this kind of problem. Also have a look at Google Cloud Dataflow.
https://flume.apache.org/

Difficult kind of Hierachical Data in Relational Database

I have "components" which can be assembled in different ways into a "system". I want my database to hold all these "components", their type specific data and define how they are connected to each other to form a "system".
The systems are typically gearboxes and they can have rather complex branched designs. Let's start with an easy example:
This system is built up out of Masses (horizontal lines) and Stiffnesses (vertical lines). Gears and clutches are types of masses and come in pairs. Colors represent different branch speeds due to gear ratios. Here's a (bad) example of how I could store everything from this particular illustration:
ID | Type | Clutch | Ends | DrivenBy | NoOfTeeth| Mass | Stiffness
--- | ---- | ------ | ---- | --------- | -------- | ---- | ---------
1 | Mass | | Input1 | | | 5 |
2 | Stiffness | | | | | | 15
3 | Mass | 1.1 | | | | 2 |
4 | Mass | 1.2 | | | | 3 |
5 | Stiffness | | | | | | 20
6 | Gear | | | | 10 | 4 |
7 | Stiffness | | | | | | 30
8 | Gear | | | | 4 | 5 |
9 | Gear | | | 8 | 7 | 2 |
10 | Stiffness | | | | | | 40
11 | Mass | | | | | 4 |
12 | Stiffness | | Output1 | | | | 10
13 | Gear | | | 6 | 5 | 4 |
14 | Stiffness | | | | | | 20
15 | Mass | 2.1 | | | | 4 |
16 | Mass | 2.2 | | | | 3
17 | Stiffness | | | | | | 30
18 | Mass | | Output2 | | | 2 |
Obviously, this is not a very good way to store the data. This design pattern resembles somewhat of a "Repeated attributes" since each component type has a different attribute to be filled. I could create a table for each type of component, but things become more complex when looking at other examples, such as this 2-stage gearbox:
There are also examples with more than 1 input and several outputs, but I can't post more links due to low reputation.
Eitherway, you will see that the usual hierarchical data storage doesn't apply here because the data is not purely "tree-shaped" where everything branches off from 1 main branch.
I think that even though I could store data in the above mentioned way, I will get huge difficulties when it comes to the programming stage.
To add to the complexity, these gearboxes are actually sub-systems to a much bigger system.
So, any suggestions on a good way to store this type of data?*
Perhaps this is a possible way of doing it?
Here you will see that there is a "main" table called GearboxBranch, keeping track of all elements in the gearbox, giving them an id and to identify in which branch the element exists.
Then for the elements themselves, masses are defined in their dedicated table, so are stiffnesses. Gears and Clutches (which are types of masses) are then defined in their perspective tables. A recursive relationship is existing in the gear table, since one gear has to be driven by at least one other gear.
Furthermore, the table with Shaft Ends defines which of the elements in the gearbox are input or output and what number they have.
I can't seem to see any problems with this method, but I'm a little unsure how to get data out of the database. There will be considerable coding involved I'm afraid.

Resources