BI - fact table design with incompatible grains - database

I'm quite new to BI designing DB, and here some point I do not understand well.
I'm trying to import french census data, where I got population for each city. For each city, I have population with different age classification, that can't really relate with each other.
For instance, let's say that one classification is 00 to 20 years old, 21 to 59, and 60+
And the other is way more precise : 00 to 02, 03 to 05, etc. but the bounds are never the same as the first one classification : I don't have 15 to 20, but 18 to 22, for example.
So those 2 classifications are incompatible. How can I use them in my fact table ? Should I use 2 fact tables and 2 cubes ? Should I use one fact table, and 2 dimensions for 1 cube ? But in this case, I will have double counted facts when I'll sum to have total population for a city, won't I ?
This is national census data, and national classifications, so changing that or estimating population to mix those classifications is not an option. And to be clear, one row doesn't relate to one person, but to one city. My facts are not individuals but cities' populations.
So this table is like :
Line 1 : One city - one amount of population - one code for dim age (ex. 00 to 19 yo) of this population - code (m/f) for the dim gender of that population - date of the census
Line 2 : Same city - one amount of population - one code for dim age (ex. 20 to 34) of this population - code (m/f) for the dim gender - date of the census
And so it goes for a lot of cities, both gender, and multiple years.
Same
I hope this question is clear enough, as english is not my native language and as I'm quite new in DB and BI !
Thanks for helping me with that.

One possible solution using a single fact table and two dimensions for the age ranges:
1 - Categorical range based on the broadest census, for example:
Young 0-20
Adult 21-59
Senior 60+
You could then link the other census to this dimension with approximate values, for example 18-22 could be Young.
2 -Original age range. This dimension could be used for precise age ranges when you report on a single city, it can also help you evaluate the impact of the overlapping bounds (e.g. how many rows are in the young / 18-22 range?)

you can crate one dimention as below
young 1-20
adult 21-59
senior 60+
Classification is
young city 1 : 1-20
young city 2 : 4-23
id field1 field2 field3 field4 .......
1 1 year young_city_1 other .......
2 2 year young_city_1 other .......
3 3 year young_city_1 other .......
4 4 year young_city_1 young_city_2 .......
Now you can report from any item and with any division
i hope it is help you

Related

automatic graphs

I have a database in Stata by country and the regions these belong to. I want to create graphs with one variable (varA) by year for each region. In each graph I want the line of varA of every country belonging to that region.
I want to do it automatically because there are a lot of countries (however only five regions).
Any help to do this?
The database is like this:
region varA
country 1 year 1 1
country 1 year 2 1
country 2 year 1 2
country 2 year 2 2
country 3 year 1 2
country 3 year 2 3
I tried using foreach command to loop the creation of the graphs but I can not figure out how to put in each region graph all the varA line series of each country in the region.
You don't show your code but I don't see that you need a loop here. The essence of what you want may be a line graph using by() to specify different regions.
This is an example you can run in your Stata. Some of the details won't necessarily apply to your dataset. company corresponds to country and group to region and invest to varA.
webuse grunfeld, clear
gen group = mod(company, 2)
xtset company year
line invest year, c(L) by(group, imargin(medsmall) note("")) ysc(log) yla(1 10 100 1000, ang(h)) xtitle("")

Star Schema - Unify data with varying structure from different sources

I am currently designing a star schema for a reporting database where an online product's performance is measured. The challenge is, that I receive information which is in principle measuring the same facts (visits, purchases) and has the same dimensions (user gender, user age, day) but with varying granularity depending on the source, for example, given a total of 10 visits:
Source A returns a single line per day for the performance in the format:
Visits, Purchases, Gender, Age Range, Day (total visits = 15)
Source B returns two lines for a single day as it does not allow the combination of gender and age:
Visits, Purchases, Gender, Day (total visits = 10)
Visits, Purchases, Age Range, Day (total visits = 10)
The issues is, if I store them in the same fact table, I will have incorrect values when applying aggregate functions:
Day
Visits
Age
Gender
Source
19/04/2022
5
18-24
Male
A
19/04/2022
10
18-24
Female
A
19/04/2022
2
NULL
Male
B
19/04/2022
8
NULL
Female
B
19/04/2022
10
18-24
NULL
B
(The sum of the visits column would count 20 for source B even though we only have 10 visits for this source, they just appear double due to the different data structure)
Is there a best practice for cases where dimensions and facts are generally the same, but the raw data granularity is different?
Is there a best practice for cases where dimensions and facts are generally the same, but the raw data granularity is different?
You typically can only present the combined data at a grain that's compatible with all the sources, so (Day), (Age,Day), or (Gender,Day).
Alternatively you could "allocate" the Source B data, say applying the gender split for the day to each age group. The totals would work, but the drilldown wouldn't be meaningful.

Is there a way to consolidate multiple formulas into one

all:
I am trying to design a shared worksheet that measures salespeople performance over a period of time. In addition to calculating # of units, sales price, and profit, I am trying to calculate how many new account were sold in the month (ideally, I'd like to be able to change the timeframe so I can calculate larger time periods like quarter, year etc').
In essence, I want to find out if a customer was sold to in the 12 months before the present month, and if not, that I will see the customer number and the salesperson who sold them.
So far, I was able to calculate that by adding three columns that each calculate a part of the process (see screenshot below):
Column H (SoldLastYear) - Shows customers that were sold in the year before this current month: =IF(AND(B2>=(TODAY()-365),B2<(TODAY()-DAY(TODAY())+1)),D2,"")
Column I (SoldNow) - Shows the customers that were sold this month, and if they are NOT found in column H, show "New Cust": =IFNA(IF(B2>TODAY()-DAY(TODAY()),VLOOKUP(D2,H:H,1,FALSE),""),"New Cust")
Column J (NewCust) - If Column I shows "New Cust", show me the customer number: =IF(I2="New Cust",D2,"")
Column K (SalesName) - if Column I shows "New Cust", show me the salesperson name: =IF(I2="New Cust",C2,"")
Does anyone have an idea how I can make this more efficient? Could an array formula work here or will it be stuck in a loop since its referring to other lines in the same column?
Any help would be appreciated!!
EDIT: Here is what Im trying to achieve:
Instead of:
having Column H showing me what was sold in the 12 months before the 1st day of the current month (for today's date: 8/1/19-7/31/20);
Having Column I showing me what was sold in August 2020; and
Column I searching column H to see if that customer was sold in the timeframe specified in Column H
I want to have one column that does all three: One column that flags all sales made for the last 12 months from the beginning of the current month (so, 8/1/19 to 8/27/20), then compares sales made in current month (august) with the sales made before it, and lets me know the first time a customer shows up in current month IF it doesn't appear in the 12 months prior --> finds the new customers after a dormant period of 12 months.
Im really just trying to find a way to make the formula better and less-resource consuming. With a large dataset, the three columns (copied a few times for different timeframes) really slow down Excel...
Here is an example of the end result:
Example of final product

SSAS cube with date range records

I have to build a cube based on date range records, and not sure about the best way to proceed.
Imagine say a cube of Cars and warranty periods. Each car has a start date, and an end of warranty periods. Then there may be extended warranty periods.. so imagine
CAR REG TYPE WARRANTY START WARRANTY END
CAR A PURCHASE 01/01/2016 31/01/2016
CAR A EXTENDED 01/01/2017 30/06/2017
CAR A EXTENDED 01/08/2017 30/01/2018 -- note, gap here
CAR B PURCHASE 01/01/2016 31/01/2016
CAR B EXTENDED 01/01/2017 30/06/2017
CAR B EXTENDED 01/08/2017 30/01/2018 -- note, gap here
So multiple items, with multiple date ranges. There is a main table (CARS) with car details (colour, model, etc).
Now I want to build a cube, which is reportable at month level, cars under warranty/warranty type, etc.
So plan 1 was to build a view which explodes the above out by a join to a date table, report by month, and feed this into a cube. But, the number of cars multiplied by the months covered leads to multi hundreds of milions of rows - which means sometimes the server runs out of TempDB space, and when it does run, the cube takes hours to build.
Is there a better way - such as a view for the car details, and then another view on the warranty table (how do I get SSAS to deal with months in a date range) - will the join in SSAS be more efficient than a join in a view in SQL?
Thanks all.
You can connect start and end columns to time dimension. And on the report you can use ":" operator to build date tange report.
More details you will find here: http://www.purplefrogsystems.com/blog/2013/04/mdx-between-start-date-and-end-date/
One approach which will work with drag-and-drop client tools like Excel or Power BI would be a many-to-many Date dimension. Since car A and B match, let's pretend there's a car C which has a warranty from 2015-07-30 to 2015-12-31.
Create a DimWarrantyDateRangeKey which represents a unique combination of dates during which a warranty is active. The surrogate key is WarrantyDateRangeKey. Certainly the ETL that builds this table will be a bit expensive, but given the size of your data I think it's a worthwhile investment that will produce better query performance than if your m2m bridge table has one row per active day per car.
Each car should be assigned one WarrantyDateRangeKey. Add the WarrantyDateRangeKey column to your fact tables...
CAR REG WarrantyDateRangeKey
A 1
B 1
C 2
m2mWarrantyDateRange
WarrantyDateRangeKey DateKey
1 20160101
1 20160102
1 ...
1 20170629
1 20170630
1 20170801
1 20170802
1 ...
1 20180129
1 20180130
2 20150701
2 20150702
2 ...
2 20151230
2 20151231
The tables relate together as follows...
FactTable -> DimWarrantyDateRange <- m2mWarrantyDateRange -> DimDate
Then in the cube you DimWarrantyDateRange should be a dimension, m2mWarrantyDateRange should be a measure group with a count measure. DimDate should be a dimension. Then you should relate DimDate to FactTable as a many-to-many (m2m) dimension using m2mWarrantyDateRange as the intermediate measure group.
Now in Excel or Power BI you should be able to filter to a particular date and it will filter down to the cars which had an active warranty on that day.

Correct Database Architecture

I don't know how to design my mysql webdatabase for a shop.
The scenario is for a site selling guided tours.
Each tour can be either a Private, a Semi-Private or a Group Tour. The price per person changes per tour type. BUT ALSO for the Private tours, the price per person varies depending on the number of persons. However it varies by different amounts depending on tour. How would i create a 'Tour/Product' record?
e.g. Let's say:
Tour of Vatican (tour has various bits of data - name, description, meeting point, duration, etc). Semi-Private tour costs 50 euro per person. Group tour costs 45 euro per person. Private tour costs (140 euro for 1-2 people), or 180 euro for 3 people, or 200 euros for 4 people, or 225 euros for 5 people or 240 euro for 6 people or for 7 people or more it costs 43 euro per person.
HOWEVER for the Tour of Coliseum (tour has same bits of data - name, description, meeting point, duration, etc), Semi Private costs 40 per person. Group costs 25 per person. Private tour costs (100 euro for 1-2 people), or 135 euro for 3 people, or 160 euros for 4 people, or 175 euros for 5 people or 180 euro for 6 people or for 7 people or more it costs 25 euro per person.
How would i structure the data in the database - 2 tables? 3 tables?
Totally confused....
Thanks
Tom
From what I understand from your post, the price alters depending on three different things:
The tour: the price for the Tour of Vatican is not similar with the price for the Tour of Colloseum.
The type of the tour: Private, a Semi-Private or a Group Tour.
The number of persons on the tour.
Since there is no exact (constant) price per person on any of the given options, I would go for a three tables approach.
The digaram is detailed in the below picture and works under the following assumptions:
There are three tables: Tour (containing the description for each individual tour); TourPriceOptions (containing the individual price options records) and TourType (which at all times, will contain just three records: Private, Semi-Private and Group Tour);
There are just two assumptions that you have to do:
A tour can have multiple price options (1 to many relationship)
An price option can have just one single tour type (1 to 1 relationship)
How to code this up:
Whenever the administrator of the store creates another tour in the backend of the store, he should be able to add multiple price options. In order to do this you will need to:
Create a new tour: a function which inserts in to the database a new entry in the tour table.
Get the id of the recently created tour: if there is only one person adding information at any given time, then there is a good bet to write a function that returns the id of the latest added tour.
Add pricing options based on the id_tour: insert a new price option based on the id_tour variable. Remember to assign a tour_type from one of the already predefined categories.
Whenever you want to return these values, just write a query that allows you to retrieve information based on the tour the user is currently browsing.
Additional things to research: Dynamic Forms - They will help you when you don't know how many price options an admin might want to add for a specific tour

Resources