Postgres query chain structure data - database

Assuming there is a gigantic organization with a crazy way to manage. Each employee has one or multiple managers, managers are employees themselves who have one or multiple managers on top.
employee table
| id | name |managers_id|
| -------- | -------------- |-----------|
| 1 | Smith | 5,6 |
| 2 | Matt | 1 |
| 3 | Bob | 1,2 |
| 4 | Adam | 1,3 |
| 5 | Suzi | 6 |
| 6 | Emily | 23,25 |
| ... | ... | ... |
It is a one-way management chain, no loops, meaning it goes A-B-C-D, A-a-b-C-D etc, no such case as A-B-C-D-A
The query is to get the management chains, say C has two management chains on top:
A-B-C
A-a-b-C
C also has one chain below:
C-D
The level of C along the chains is not a matter.
In theory, there is no limitation on the number of levels, the chain can keep going indefinitely.
I was thinking about 'inheritance' but probably it is not the solution.
Any tips on how to design this postgres dababase, please? Thank you.

Related

What is this data referencing anti-pattern called?

I have a question related to a kind of duplication I see in databases from time to time. To ask this question, I need to set the stage a bit:
Let's say I have a database of TV shows. Its primary table Content stores information at various levels of granularity (Show -> Season -> Episode), using a parent column to denote hierarchy:
+----+---------------------------+-------------+----------+
| ID | ContentName | ContentType | ParentId |
+----+---------------------------+-------------+----------+
| 1 | Friends | Show | [null] |
| 2 | Season 1 | Season | 1 |
| 3 | The Pilot | Episode | 2 |
| 4 | The One with the Sonogram | Episode | 2 |
+----+---------------------------+-------------+----------+
Maybe this isn't ideal, but let's say it's good enough to work with and we're not looking to change it.
Now let's say we need to build a table that defines air dates. We can set these at any level, and they must apply down the hierarchy (e.g., if set at the Season level, it applies to all episodes within that season; if set at the Show level, it applies to all seasons and episodes).
So the original air dates might look like this:
+-------+-----------+------------+
| airId | ContentId | AirDate |
+-------+-----------+------------+
| 71 | 3 | 1994-09-22 |
| 72 | 4 | 1994-09-29 |
+-------+-----------+------------+
Whereas the air date for a streaming service might look like:
+-------+-----------+------------+
| airId | ContentId | AirDate |
+-------+-----------+------------+
| 91 | 1 | 2015-01-01 |
+-------+-----------+------------+
Cool. Everything's fine so far; we're adhering to 4NF (I think!) and we can proceed to our business logic.
Now we get to my question. If we implement our business logic in such a way that disregards the referential hierarchy, and instead duplicates the air dates down the hierarchy, what is this anti-pattern called? e.g., Let's say I set an air date at the Show level like above, but the business logic finds all child elements and creates an entry for each one, resulting in:
+-------+-----------+------------+
| airId | ContentId | AirDate |
+-------+-----------+------------+
| 91 | 1 | 2015-01-01 |
| 92 | 2 | 2015-01-01 |
| 93 | 3 | 2015-01-01 |
| 94 | 4 | 2015-01-01 |
+-------+-----------+------------+
There are some pretty clear problems with this, but please note that my question is not how to fix this. Just, is there a specific term for it? I want to call it something like, "disregarding data relationship" or, "ignoring referential context". Maybe it's not strictly a database anti-pattern, since in my example there's an external actor inserting the excess rows.

Issues with ER model design

I am trying to design a model for our future database of our toys and certain measurements that have to be done post-production. I have trouble grasping how to model this. I have tried multiple ways, but none of them seem optimal and in the end I've always lost out on the connectivity between entities.
What I need to achieve is some kind of meaningful relationship between the following:
A toy (with some trivial properties).
A series of toys (multiple toys can be related to one series and a toy can only belong to one series).
Measurement steps. There are currently 6 of these steps. Each step has its own input parameters and these vary in type as well as in number (eg. only 3 parameters for measurement step 1 and 10 parameters for measurement step 2).
With each series, a sequence of these measurement steps is defined. Duplicates of tests are allowed (eg. measurement step 1 > measurement step 4 > measurement step 1 is a valid sequence). The sequence along with the parameters must be stored somewhere for future reference.
Each toy goes through the sequence of measurements that is defined by its series. All of the results must be stored somewhere (for each individual toy).
If I split the measurement steps into their own tables I can't reference them conditionally (as foreign keys) to some other table.
If I try to serialize part of the data I lose the ability to make connections between individual measurement steps, measurement results (at least with queries) etc.
I know people here generally hate/don't answer these kinds of "discussion-like" questions, but I'd ask of you to at least point out what is a good practice in a system where I need to store this locally on a machine, but need a database to hold the data - to move towards serial-like data and just do general relationships where it is easy to do so or keep trying to normalize it as much as possible?
If measurements steps share most of attributes (or are of the same type, like what you called PARAMETERS), and I understood correctly your definitions, I would make something like this.
It could be a starting point.
+----------------------------+ +------------------------------+
| TOYS | | TOY_SERIES |
+-----+----------------------+ +---------+--------------------+
| PK | ID_TOY | +----------+ PK, FK1 | ID_S +--------+
| | | | +------------------------------+ |
| FK1 | ID_S +---------+ | | ... | |
+----------------------------+ | | | |
| | ... | | | | |
| | | | | | |
+-----+----------------------+ +---------+--------------------+ |
|
|
|
|
+------------------------------+ |
| BR_SER_MEAS | |
+---------+--------------------+ |
| PK, FK1 | ID_S +--------+
| | |
| PK, FK2 | ID_M +--------+
| | | |
| PK | ID_SEQ | |
| | | |
+---------+--------------------+ |
|
|
+------------------------------+ |
| MEASURE_STEPS | |
+------------------------------+ |
| PK ID_M +--------+
+------------------------------+
| PARAM_01 |
| ... |
| PARAM_10 |
| |
| |
+------------------------------+

How can we validate tabular data in robot framework?

In Cucumber, we can directly validate the database table content in tabular format by mentioning the values in below format:
| Type | Code | Amount |
| A | HIGH | 27.72 |
| B | LOW | 9.28 |
| C | LOW | 4.43 |
Do we have something similar in Robot Framework. I need to run a query on the DB and the output looks like the above given table.
No, there is nothing built in to do exactly what you say. However, it's fairly straight-forward to write a keyword that takes a table of data and compares it to another table of data.
For example, you could write a keyword that takes the result of the query and then rows of information (though, the rows must all have exactly the same number of columns):
| | ${ResultOfQuery}= | <do the database query>
| | Database should contain | ${ResultOfQuery}
| | ... | #Type | Code | Amount
| | ... | A | HIGH | 27.72
| | ... | B | LOW | 9.28
| | ... | C | LOW | 4.43
Then it's just a matter of iterating over all of the arguments three at a time, and checking if the data has that value. It would look something like this:
**** Keywords ***
| Database should contain
| | [Arguments] | ${actual} | #{expected}
| | :FOR | ${type} | ${code} | ${amount} | IN | #{expected}
| | | <verify that the values are in ${actual}>
Even easier might be to write a python-based keyword, which makes it a bit easier to iterate over datasets.

What database technologies should I consider for building a scalable "running average" view?

We are working on an application where millions of users will be entering information at the same time. Suppose the application allows people to rate geographic regions on where they would like to live. Each participant is allowed to rate each region using a decimal value from 0-10. Each person belongs to one or more groups based upon attributes such as gender, and people that consider themselves active, or enjoy culture.
Every time a rating is made, we need to have a view which shows us the average rating for each region/group. I'm aware that most DB's have an "average" function, but for our purposes we need to be able to use our own function as we may use a the geometric mean instead of the arithmetic mean.
Below are some tables which might be used. Note: I did not include the relationship table PeopleGroups which map which groups a person is a member of for brevity purposes.
Regions People Groups RegionScoresByPerson
+-----+------------+ +-----+-------+ +-----+----------+ +-----+-----+-------+
| RID | NAME | | PID | Name | | GID | Name | | RID | PID | Score |
+-----+------------+ +-----+-------+ +-----+----------+ +-----+-----+-------+
| 1 | Flordia | | P1 | Alice | | G0 | Everyone | | 1 | P1 | 6 |
| 2 | California | | P2 | Bob | | G1 | Women | | 1 | P2 | 8 |
+-----+------------+ | P3 | Frank | | G2 | Men | | 1 | P3 | 3 |
| P4 | Mary | | G3 | Active | | 1 | P4 | 2 |
+-----+-------+ | G4 | Culture | | 1 | P1 | 7 |
+-----+----------+ | 1 | P2 | 5 |
| 1 | P3 | 8 |
| 1 | P4 | 2 |
+-----+-----+-------+
Our current implementation uses a similar set of tables for storing ratings, but we don't calculate averages real-time. Anytime we need the results (e.g. show me the average score California for women), we have to pull all the information into memory and run the calculations manually.
I was wondering how I leverage database technologies such as views, triggers, stored procedures, etc. to present to me a simple table that will allow me to get scores by for people and groups so we don't have to manually run calculations.
I would like some table like the following, where everything is handled by the DB. Any insert,update,delete actions on the RegionScoresByPerson or Groups tables would automatically be reflected in this table. If it is not apparent, the rows marked with * calculated rows. In this case I'm using a simple arithmetic average, but I the design should allow for any type of function.
EID stands for entity ID (a person or group)
Besides deciding how to build such a view, I'm unsure of what sort of datatypes to use (and index) for People and Groups. I suppose I'd like the index to be integers, but that would prevent me from creating the table below because I couldn't distinguish between Person 1 and Group 1 -- Would having ID's such as P1 and G1 be a performance hit? I'm obviously concerned about the design being scalable.
ScoreView
+-----------+-----+-------+
| RID | EID | Score |
| 1 | P1 | 6 |
| 1 | P2 | 8 |
| 1 | P3 | 3 |
| 1 | P4 | 2 |
| 1 | P1 | 7 |
| 1 | P2 | 5 |
| 1 | P3 | 8 |
| 1 | P4 | 2 |
| 1 | G0 | 4.75 |*
| 1 | G1 | 4 |*
| 1 | G2 | … |*
| 1 | G3 | … |*
+-----------+-----+-------+
Apache Flume is the open source tool designed to solve this kind of problem. Also have a look at Google Cloud Dataflow.
https://flume.apache.org/

Difficult kind of Hierachical Data in Relational Database

I have "components" which can be assembled in different ways into a "system". I want my database to hold all these "components", their type specific data and define how they are connected to each other to form a "system".
The systems are typically gearboxes and they can have rather complex branched designs. Let's start with an easy example:
This system is built up out of Masses (horizontal lines) and Stiffnesses (vertical lines). Gears and clutches are types of masses and come in pairs. Colors represent different branch speeds due to gear ratios. Here's a (bad) example of how I could store everything from this particular illustration:
ID | Type | Clutch | Ends | DrivenBy | NoOfTeeth| Mass | Stiffness
--- | ---- | ------ | ---- | --------- | -------- | ---- | ---------
1 | Mass | | Input1 | | | 5 |
2 | Stiffness | | | | | | 15
3 | Mass | 1.1 | | | | 2 |
4 | Mass | 1.2 | | | | 3 |
5 | Stiffness | | | | | | 20
6 | Gear | | | | 10 | 4 |
7 | Stiffness | | | | | | 30
8 | Gear | | | | 4 | 5 |
9 | Gear | | | 8 | 7 | 2 |
10 | Stiffness | | | | | | 40
11 | Mass | | | | | 4 |
12 | Stiffness | | Output1 | | | | 10
13 | Gear | | | 6 | 5 | 4 |
14 | Stiffness | | | | | | 20
15 | Mass | 2.1 | | | | 4 |
16 | Mass | 2.2 | | | | 3
17 | Stiffness | | | | | | 30
18 | Mass | | Output2 | | | 2 |
Obviously, this is not a very good way to store the data. This design pattern resembles somewhat of a "Repeated attributes" since each component type has a different attribute to be filled. I could create a table for each type of component, but things become more complex when looking at other examples, such as this 2-stage gearbox:
There are also examples with more than 1 input and several outputs, but I can't post more links due to low reputation.
Eitherway, you will see that the usual hierarchical data storage doesn't apply here because the data is not purely "tree-shaped" where everything branches off from 1 main branch.
I think that even though I could store data in the above mentioned way, I will get huge difficulties when it comes to the programming stage.
To add to the complexity, these gearboxes are actually sub-systems to a much bigger system.
So, any suggestions on a good way to store this type of data?*
Perhaps this is a possible way of doing it?
Here you will see that there is a "main" table called GearboxBranch, keeping track of all elements in the gearbox, giving them an id and to identify in which branch the element exists.
Then for the elements themselves, masses are defined in their dedicated table, so are stiffnesses. Gears and Clutches (which are types of masses) are then defined in their perspective tables. A recursive relationship is existing in the gear table, since one gear has to be driven by at least one other gear.
Furthermore, the table with Shaft Ends defines which of the elements in the gearbox are input or output and what number they have.
I can't seem to see any problems with this method, but I'm a little unsure how to get data out of the database. There will be considerable coding involved I'm afraid.

Resources