Star schema, normalized dimensions, denormalized hierarchy level keys - data-modeling

Given the following star schema tables.
fact, two dimensions, two measures.
# geog_abb time_date amount value
#1: AL 2013-03-26 55.57 9113.3898
#2: CO 2011-06-28 19.25 9846.6468
#3: MI 2012-05-15 94.87 4762.5398
#4: SC 2013-01-22 29.84 649.7681
#5: ND 2014-12-03 37.05 6419.0224
geography dimension, single hierarchy, 3 levels in hierarchy.
# geog_abb geog_name geog_division_name geog_region_name
#1: AK Alaska Pacific West
#2: AL Alabama East South Central South
#3: AR Arkansas West South Central South
#4: AZ Arizona Mountain West
#5: CA California Pacific West
time dimension, two hierarchies, 4 levels in each.
# time_date time_weekday time_week time_month time_month_name time_quarter time_quarter_name time_year
#1: 2010-01-01 Friday 1 1 January 1 Q1 2010
#2: 2010-01-02 Saturday 1 1 January 1 Q1 2010
#3: 2010-01-03 Sunday 1 1 January 1 Q1 2010
#4: 2010-01-04 Monday 1 1 January 1 Q1 2010
#5: 2010-01-05 Tuesday 1 1 January 1 Q1 2010
Examples is stripped of surrogate keys to improve readability. In results there are levels in hierarchy without other attributes, just don't bother that, they are still levels in hierarchy.
In star schema expressed as:
GEOGRAPHY (all fields)
/
/
FACT
\
\
TIME (all fields)
In snowflake schema expressed as:
geog_region_name
/
geog_division_name
/
geog_abb (+ geog_name)
/
/
FACT
\
\
time_date
|
hierarchies: |
weekly / \ monthly
/ \
/ \
time_weekday time_month (+ time_month_name)
| |
| |
time_week time_quarter (+ time_quarter_name)
| |
| |
time_year time_year
How would you call the following schema
Does it have any specific name? Starflake? :)
|>-- geog_region_name
|
|>-- geog_division_name
|
|>-- geog_abb (+ geog_name)
|
|
geography base
/
/
FACT
\
\
time base
|
|
|>-- time_date
|
|>-- time_weekday
|
|>-- time_week
|
|>-- time_month (+ time_month_name)
|
|>-- time_quarter (+ time_quarter_name)
|
|>-- time_year
It basically has a dimension base table storing identities of every level of every hierarchy within a dimension. No need for recursive walk through snowflake's levels, potentially less joins. Data still well normalized, only keys are denormalized into base table. All levels from all hierarchies tied to lowest grain key of a dimension in dimension base.
Additionally having a dimension base table allows to handle time variant attributes/temporal queries just in that table, at the granularity of a hierarchy level.
Here is the tabular representation.
Still on natural keys!
fact
# geog_abb time_date amount value
# 1: AK 2010-01-01 154.43 12395.472
# 2: AK 2010-01-02 88.89 6257.639
# 3: AK 2010-01-03 81.74 7193.075
# 4: AK 2010-01-04 165.87 11150.619
# 5: AK 2010-01-05 8.75 6953.055
time dimension base
# time_date time_year time_quarter time_month time_week time_weekday
# 1: 2010-01-01 2010 1 1 1 Friday
# 2: 2010-01-02 2010 1 1 1 Saturday
# 3: 2010-01-03 2010 1 1 1 Sunday
# 4: 2010-01-04 2010 1 1 1 Monday
# 5: 2010-01-05 2010 1 1 1 Tuesday
time dimension normalization to hierarchy levels
# time_year
# 1: 2010
# 2: 2011
# 3: 2012
# 4: 2013
# 5: 2014
# time_quarter time_quarter_name
# 1: 1 Q1
# 2: 2 Q2
# 3: 3 Q3
# 4: 4 Q4
# time_month time_month_name
# 1: 1 January
# 2: 2 February
# 3: 3 March
# 4: 4 April
# 5: 5 May
# time_week
# 1: 1
# 2: 2
# 3: 3
# 4: 4
# 5: 5
# time_weekday
# 1: Friday
# 2: Monday
# 3: Saturday
# 4: Sunday
# 5: Thursday
# time_date time_week time_weekday time_year
# 1: 2010-01-01 1 Friday 2010
# 2: 2010-01-02 1 Saturday 2010
# 3: 2010-01-03 1 Sunday 2010
# 4: 2010-01-04 1 Monday 2010
# 5: 2010-01-05 1 Tuesday 2010
geography dimension base
# geog_abb geog_region_name geog_division_name
# 1: AK West Pacific
# 2: AL South East South Central
# 3: AR South West South Central
# 4: AZ West Mountain
# 5: CA West Pacific
geography dimension normalization to hierarchy levels
# geog_region_name
# 1: North Central
# 2: Northeast
# 3: South
# 4: West
# geog_division_name
# 1: East North Central
# 2: East South Central
# 3: Middle Atlantic
# 4: Mountain
# 5: New England
# geog_abb geog_name geog_division_name geog_region_name
# 1: AK Alaska Pacific West
# 2: AL Alabama East South Central South
# 3: AR Arkansas West South Central South
# 4: AZ Arizona Mountain West
# 5: CA California Pacific West
Dimension base could store also primary key's attributes, this would de-duplicate dimension's lowest level but will be less consistent (time_date levels from both hierarchies would fit into time dimension base tables).
What drawbacks such schema would have? I don't much bother about speed of joins and aggregates, and a query tool adaptivity.
Does it have any name? It is being use? If not why?

You are building a snowflake schema with shortcuts.
It's used and BI tools can easily use the shortcuts.
You can also have shortcuts from a parent level of a dimension to a fact table at child level for that dimension. It works, you can skip a join, but you need to store an additional column in the fact table.
The only concern is about data integrity, if a parent-child relationship changes you need to update not only the child table, but also all other tables where this relationship is stored.
It's not a big deal if you generate every time your dimension table from your normalize data, but you need to be careful, even more if you store a parent ID in the fact table.

What you are doing is not a snowflake schema ...it is similar to "Data Vault" and our own variation "Link-Model". It essentially creates link tables just containing keys which sit between Fact tables and Dim tables (and other Dim tables). Although, we describe them as entity tables and measure tables.
The advantages are
You can parallel load dimension and fact tables, then populate the link tables
Complicated practices like "as at reporting" with "Adjustments" as found in Insurance can be handled quite readily
It is more intuitive to split slowly and quickly changing dimension Dimension tables that are just linked by the link tables. This is a time saving.
Adding new dimensions to fact tables is fairly simple and quick, after all it is just adding an extra integer column to a table containing just integers.
Factless facts are far more intuitive than in a conventional schema. You can create relationships between dimensions, without any fact record.
The downsides are
A slightly more complicated schema structure, so we generally create a Kimball models on top of the "Link Model", as business users tend to understand it well.
To add a new dimension to a fact table or to extend a dimension table can be easily done, but the schema can become cluttered over time.

Related

Matching employees from different school and hometown

I am new to coding. Now I have a employee table looked like below:
Name
Hometown
School
Jeff
Illinois
Loyola University Chicago
Alice
California
New York University
William
Michigan
University of Illinois at Chicago
Fiona
California
Loyola University Chicago
Charles
Michigan
New York University
Linda
Indiana
Loyola University Chicago
I am trying to get those employees in pairs where two employees come from different state and different university. Each person can only be in one pair. The expected table should look like
employee1
employee2
Jeff
Alice
William
Fiona
Charles
Linda
The real table is over 3,000 rows. I am trying to do it with SQL or Python, but I don't know where to start.
A straightforward approach is to pick employees one by one and search the table after the one for an appropriate peer; found peers are flagged in order to not be paired repeatedly. Since in your case a peer should be found after a few steps, this iteration will likely be faster than operations which construct whole data sets at once.
from io import StringIO
import pandas as pd
# read example employee table
df = pd.read_table(StringIO("""Name Hometown School
Jeff Illinois Loyola University Chicago
Alice California New York University
William Michigan University of Illinois at Chicago
Fiona California Loyola University Chicago
Charles Michigan New York University
Linda Indiana Loyola University Chicago
"""))
# create expected table; its length is half that of the above
ef = pd.DataFrame(index=pd.RangeIndex(len(df)/2), columns=['employee1', 'employee2'])
k = 0 # number of found pairs, index into expected table
# array of flags for already paired employees
paired = pd.Series(False, pd.RangeIndex(len(df)))
# go through the employee table and collect pairs
for i in range(len(df)):
if paired[i]: continue
for j in range(i+1, len(df)):
if not paired[j] \
and df.iloc[j]['Hometown'] != df.iloc[i]['Hometown'] \
and df.iloc[j]['School'] != df.iloc[i]['School']:
# we found a pair - store it, mark employee j paired
ef.iloc[k] = df.iloc[[i, j]]['Name']
k += 1
paired[j] = True
break
else:
print("no peer for", df.iloc[i]['Name'])
print(ef)
output:
employee1 employee2
0 Jeff Alice
1 William Fiona
2 Charles Linda

TM1 - passing data between cubes by linking a measure in the target cube to a dimension in the source cube

TM1 - linking measures to dimensions
I have two cubes in TM1, and I am trying to source data from one cube by linking a calculated 'Age' field in the target cube to an 'Age' dimension in the source cube. However, while I can do this fine by writing code in the rules editor, I cannot work out how to do it using the rules Wizard. Unfortunately, policy in my company is that all TM1 models must be based around wizard-based rules, so I am hoping someone could explain how to do this via the wizard.
Cube 1 (the source) contains data on how quickly a loan balance reduces due to customer attrition, based on the loan's age in months - it looks a bit like this:
Age | Attrition %
-------|--------------
1 | 5%
2 | 6%
3 | 7%
Cube 2 (the target) contains a loan balance, and calculates how the balance reduces over several months, based on the data in Cube 1. It looks up the data in cube 1, based on the age that is calculated in the first row of Cube 2, on the basis of:
current month - start month + 1.
So if we assume the loan started in July, for August the age would be:
8 - 7 + 1 = 2 months old.
For the loan starting in July, Cube 2 would look a bit like this:
| Jul | Aug | Sep |
----------------|-------------------
Age | 1 | 2 | 3 |
Opening Balance | $100| $95 | $89 |
Attrition % | 5% | 6% | 7% | <-- sourced from Cube 1 on basis of Age
Attrition $ | -$5 |-$5.7|-$6.3|
Closing Balance | $95| $89 | $83 |
Creating this link is trivial in Excel, but whenever I try to do it using the TM1 Rules Wizard, I run into the problem that TM1 does not seem to allow the linking of a dimension ('Age' in cube 1) to a field within a dimension ('Age' in cube 2).
Can anyone advise?

Sequences in Graph Database

All,
I am new to the graph database area and want to know if this type of example if applicable to a graph database.
Say I am looking at a baseball game. When each player goes to bat, there are 3 possible outcomes: hit, strikeout, or walk.
For each batter and throughout the baseball season, what I want to figure out is the counts of the sequences.
For example, for batters that went to the plate n times, how many people had a particular sequence (e.g, hit/walk/strikeout or hit/hit/hit/hit), and if so, how many of the same batters repeated the same sequence indexed by time. To further explain, time would allow me know if a particular sequence (e.g. hit/walk/strikeout or hit/hit/hit/hit) occurred during the beginning of the season, in the mid, or later half.
For a key-value type database, the raw data would look as follows:
Batter Time Game Event Bat
------- ----- ---- --------- ---
Charles April 1 Hit 1
Charles April 1 strikeout 2
Charles April 1 Walk 3
Doug April 1 Walk 1
Doug April 1 Hit 2
Doug April 1 strikeout 3
Charles April 2 strikeout 1
Charles April 2 strikeout 2
Doug May 5 Hit 1
Doug May 5 Hit 2
Doug May 5 Hit 3
Doug May 5 Hit 4
Hence, my output would appear as follows:
Sequence Freq Unique Batters Time
----------------------- ---- -------------- ------
hit 5000 600 April
walk/strikeout 3000 350 April
strikeout/strikeout/hit 2000 175 April
hit/hit/hit/hit/hit 1000 80 April
hit 6000 800 May
walk/strikeout 3500 425 May
strikeout/strikeout/hit 2750 225 May
hit/hit/hit/hit/hit 1250 120 May
. . . .
. . . .
. . . .
. . . .
If this is feasible for a graph database, would it also scale? What if instead of 3 possible outcomes for a batter, there were 10,000 potential outcomes with 10,000,000 batters?
More so, the 10,000 unique outcomes would be sequenced in a combinatoric setting (e.g. 10,000 CHOOSE 2, 10,000 CHOOSE 3, etc.).
My question then is, if a graphing database is appropriate, how would you propose setting up a solution?
Much thanks in advance.

How to design a Db table for attendance

I am currently working on a school management system but can't seem to figure out the best way to design my student attendance table.
INFO
School is for 14 weeks and class holds 5 times a week. Students in the school can be up to 2000 per term. Meaning attendance can be up to 14 x 5 x 2000 = 140, 000 per term.
I am developing the application for a desktop using VB.Net and MS Access.
PROGRESS SO FAR
I have so far designed something that I am skeptic about.
table name: attendance
_____________________________________________
| id |std_id | att_week | att_date | status |
''''''''''''''''''''''''''''''''''''''''''''''
| 1 | 0001 | 1 |29/9/2015 | yes |
''''''''''''''''''''''''''''''''''''''''''''''
| 2 | 0002 | 1 |29/9/2015 | yes |
''''''''''''''''''''''''''''''''''''''''''''''
I easily found out that designing it like this can easily yield 140, 000 rows in a term.
I also thought of making the week days as column names, that will easily result in 14 x 5 = 70 columns.
What is the best way to design this said table.
Friend I think you should construct your table like this:
Table would accept only the absentees
id student_id class date
________________________________________
1 11 7a 11/11/2020
2 21 6b 10/12/2020
and so on.....
You could easily retrieve details like
1] total absentees per class
2] total absent of a student in date range
3] Per day report of attendance of student can be easily prepared based on this data
ALSO this would be extremly fast due to less number of record and if you index on class_id and and partition tables in specified date range.
Thank You!

MS Analysis Cube - one-to-many joins

I am building an OLAP cube in MS SQL Server BI Studio. I have two main tables that contain my measures and dimensions.
One table contains
Date | Keywords | Measure1
where date-keyword is the composite key.
One table contains looks like
Date | Keyword | Product | Measure2 | Measure3
where date-keyword-product is the composite key.
My problem is that there can be a one-to-many relationship between date-keyword's in the first table and date-keyword's in the second table (as the second table has data broken down by product).
I want to be able to make queries that look something like this when filtered for a given Keyword:
Measure1 Measure2 Measure3
============================================================
Tuesday, January 01 2013 23 19 18
============================================================
Bike 23
Car 23 16 13
Motorcycle 23
Caravan 23 2 4
Van 23 1 1
I've created dimensions for the Date and ProductType but I'm having problems creating the dimension for the Keywords. I can create a Keyword dimension that affects the measures from the second table but not the first.
Can anyone point me to any good tutorials for doing this sort of thing?
Turns out the first table had one row with all null values (a weird side effect of uploading an excel file straight into MS SQL Server db). Because the value that the cube was trying to apply the dimension to was null in this one row, the whole cube build and deploy failed with no useful error messages! Grr

Resources