First off I realize that narrow fact tables are the ideal situation.
I am designing a healthcare data warehouse specifically for ingestion into Power BI. The problem I'm having is that I have over 100 different metrics that are included in just one report. Most of the data comes from the source like this:
Hospital
HospitalID
Date
Description
Number
Children's Hospital
20192
1/2/2021
Beds Needed
8
Children's Hospital
20192
1/2/2021
Covid Patients
2
We currently use logic to pull each metric out like this in PowerBI:
Beds Needed=IF(Description="Beds Needed", Number,0)
We do this for over 100 metrics that are needed according to business leaders. My question is, there are two ways Im thinking of doing this:
Option 1:
We put the logic like above into the database and have every metric be it's own column.
Date
Hospitalid
Beds Needed
Covid Patients
1/2/2021
20192
8
2
Option 2:
I setup the fact table like so:
Date
HospitalID
Descriptionid
Number
1/2/2021
20192
12
8
1/2/2021
20192
11
2
And then create a dimension table like so:
Description
DescriptionID
Beds Needed
12
Covid Patients
11
The tables that I have currently (in the format of the first table) each are around 200k rows and there are 4 of them. There is one table that supplies metrics that is around 20 million rows.
Related
I have two tables: 'Sales' and 'Planed sales'. They represent sales data for about 80 different products and planed sales for those products across two months for every day. This is how they look:
Table 'Sales'
productName
Sold
Date
product_A
10
01.01.2022.
product_B
15
01.01.2022.
product_A
11
02.01.2022.
product_B
20
02.01.2022.
And then table 'Planed sales'
productName
Planed sales
Date
product_A
10
01.01.2022.
product_B
14
01.01.2022.
product_A
9
02.01.2022.
product_B
15
02.01.2022.
Product names are same in both tables and so are the dates, only 'Sold' and 'Planed sales' data differs.
I somehow need to use these two tables in a SQL query for Power Pivot in Excel while I'm importing data so I can get a table looking something like this:
productName
Planed sales 01.01.
Sold 01.01.
Planned sales 02.01.
Sold 02.01.
product_A
10
10
9
11
product_B
14
15
15
20
I have lots of different products and a lot of dates for at least two months.
I tried to just import both of these tables in Excels power pivot with a simple 'select *' query, and then create a relationship, but I don't have unique values because productName is displayed for every date. Unique key here would be 'productName' + 'Date', but I don't know how to add composite key in Excels Power Pivot if that is the solution for this. I know data is not well normalized but I can't change it much, it is what it is. Am I approaching this problem correctly? Is this even possible? If it is, how could I do it?
You need a Product dimension. There's a lot of help on the web on dimensional modeling for Power BI and PowerPivot. It's worth learning.
Add a table that is just a list of all the unique productNames and then add a relationship from this new table to your 2 existing tables.
If you have a data source with the product list, use that. If you don't, combine the two tables and remove duplicates. You can do this in either DAX or Power Query, but Power Query would be the better way to do this.
In Power Query, append the 2 tables (Home > Append Queries), then right click the productName column and Remove Other Columns, and right click again to Remove Duplicates.
Then go into PowerPivot to create the relationships. In your pivot, use productName from the Product dimension, not from the other 2 tables. Best practice is to right click the productName column in those 2 tables and hide them to make sure the correct table is used.
I currently have the information of all the employees and the courses they've taken, what I need is to have both the data of who have taken the courses and who hasn't in Power BI. The idea is to measure how is the organization doing with the trainings and to be able to speak with bosses to arrange trainings to for the employees who haven't taken them.
This is the structure right now:
Table Employees
ID_Employee Names Age Job title Department ...
1 Robert 44 Chief Procurement Officer Procurement ...
2 James 32 Warehouse Supervisor Operations ...
3 Natalie 47 Sales Manager Commercial ...
4 John 29 Planning Executive Operations ...
5 Matthew 33 Finance Executive Finance ...
Table Courses
ID_Course Course
1 Cybersecurity
2 Effective Negotiation
3 Safety Policies
4 Performance Analysis
Table EmployeexCourses
ID_Employee ID_Course Date_Taken
1 1 20201015
1 3 20201018
2 4 20201020
3 2 20201020
3 3 20201018
4 1 20201015
What I'm not sure is how to proceed, I can easily get who has taken each course because that's a table itself (EmployeesxCourses) and I was using a Pivot Table to have each employee in EmployeexCourses with all the courses in columns with their dates as values. I want to use this data in Power BI, so I have a question, what should I do to have the most flexible data in Power BI? Should I do a Pivot table with all the employees and all the courses (this number is small and is fixed for a while), it'd be a 1500 rows table and there would be multiple rows with no values in the columns (employees who've not taken any course). I'm a bit confused in the how of this table with all employees, because I'd need to use 'EmployeexCourses' and 'Employees' and then turn it into a Pivot Table (I'm sure I can I just haven't done it). Or should I get a separate table for all the employees with their missing courses?
Which option is better? Or do you think of another option? I need it to be efficient and flexible in Power BI for easy use (filter by area, by course missing and such). Any idea is welcome, TIA.
Got a question regarding big size SSAS cube:
Scenario: .. building product revenue cube on SQL Server Analysis Server (SSAS) 2012:
the product hierarchy is a dimension with 7 levels: cat1, cat2, … cat7(product_id);
The fact table is at product id level, key = (customer id + bill month + product id), this gives around 200M rows per month, may still go up.
Concern: Now the prototype is having only 3 month data, the cube browsing are slow already. When drag and drop 2 -3 dimensions, it might take couple of minutes or more.
Late on, there are 30+ other dimensions need to add in, and extend to 1-2 years data, 2-3B rows. So we are concerning the performance, can that much data be handled with an acceptable browsing speed?
Question: Is there any better way (design, performance tuning) for above scenario?
E.g. another thinking is to make the fact table flat, i.e. make it customer level, key = (customer id + bill month), one customer only have one row per bill month. While adding 40+ columns, one column for each product, that way the fact row count per month with go down to say 5M. But we can’t build/browse the product dimension anymore (that is an important requirement), can we?
Thanks in advance for any light shedding.
I am in the process of designing a small database application for a health center in my local community. The health center can receive both In & Out-Patients.
The one area i am not sure of how best it should be implemented is how to bill the patient automatically from the drugs/medication they have be given. I don't want the user to type in the name and price of the drugs given to the patient. I want to automate it with a list of all available drugs and their CURRENT prices in a table so that the user just selects a drug from a list & i the software should be able to determine the total.
I also want to maintain the history of drugs over time. Some thing like drug XXX was selling at $1234 in January, $4567 in September, $12 in 2008. So if i am to print a receipt for a patient who visited in 2008, the patient should be billed at the rates of 2008 not the drug current rate.
I am just asking for some general guidance and suggestions on the best database schema of a scenario related to my problem description above.
Thanks a lot.
With a table of drugprices
DrugPriceID DrugID PriceStartDate DrugPrice
1 1 1-1-2008 1234
2 1 1-9-2008 4567
which links to a table of drugs on DrugID. The price applies until it is superseded by a new price.
A table of Patients, which links to a table of PatientOrders,
PatientOrderID PatientID OrderDate
5 3 4-5-2008
and a further table of OrderDrugs
OrderDrugID PatientOrderID DrugID
6 5 1
I'm trying to create a data warehouse from which we will create all business reports. Already learned quite a lot about this and I have a general idea how to build the data warehouse. However, I came across a problem when I started to wonder how I could combine into a single data store information about products and sales from two separate OLTP databases.
ETL process looks like this:
1 Transfer product data from the first OLTP database table stgProducts
2 Merg product data from table to table stgProducts dimProducts - if the product is changed records are updated when there are new products that are added to new records.
3 Transfer product data from another database OLTP table stgProducts
4 Merg product data from table to table stgProducts dimProducts - if the product is changed records are updated when there are new products that are added to new records.
Similarly, the transfer is realized on sales data.
If I have one table with products how do I connect to the sales data from two different databases?
Speaking of the two databases, I mean two different ERP systems. One manages the online sales, the other handles other sales. SKU of the product is the same but the product ID is different for each system.
Assuming that you're using a star schema (not always the best approach, BTW), your product dimension table should have a key that is unique to the DW. Thus, SKU 186 might have a DW-specific key of 1 and SKU 294 might have a DW-specific key of 2. Your fact table, which holds 'transaction records' (sales records?), will have a compound key composed of multiple foreign key columns (eg. product_key, date_key, location_key, etc.).
This foreign key to the product table in this case is to the DW-specific product key, NOT to the source system SKU.
The ETL to populate your fact table should 'translate' the source system product key into the DW-specific product key as you bring the data in.
NOTE: This is the general approach to populating fact tables. There can be variations based on specific requirements.
Expanding Bens answer a bit with the caveate the there is no right answer the data warehouse - it is an expansive, mungeable area of IT with lots of schools of thought. Here is one possible direction you can pursue.
Assumptions:
1) You have 2 separate source databases both with 2 tables: Product and Sales
2) The separate source database are isloated and may have conflicting primary key data.
3) You want to version[1] both the product and the sales tables. This is an important assumption as mostly fact tables are not updated and the sales table sounds like it belongs as a nice static fact table. Your question in unclear on if you are expecting changes to sales so I will assume you will
4) A sales record can only ever be of 1 product (this is unlikely, but your question only mentions the 2 tables so ill work from there, a 1-many relation will involve more tweaking around the bridging table)
Warehouse Design:
You will need 3 tables with the following columns:
PRODUCT_DIM
PRODUCT_SK (Surrogate primary key data warehouse database generated)
SOURCE_SYSTEM_ID (DW specific indicator as to which source OLTP database the record was sourced from - can be string if you like)
PRODUCT_NK (PK of product from source system, used for SCD operations)
DATE_FROM (Record active from)
DATE_TO (Record active to (null for current))
PRODUCT_NAME (Product name for source table)
Other Columns (Any other product columns you may need)
SALES DIM
- SALES_SK (Surrogate primary key Data warehouse database generated)
- SOURCE_SYSTEM_ID (DW specific indicator as to which source OLTP database the record was sourced from - can be string if you like)
- SALES_NK (PK of sales record from source system, used for SCD operations)
- DATE_FROM (Record active from)
- DATE_TO (Record active to (null for current))
- SALE_AMOUNT (Product name for source table)
- Other Columns (Any other sales columns you may need)
PRODUCT_SALES_BRIDGE
- PRODUCT_SK (composite primary key)
- SALES_SK (composite primary key)
- DATE_FROM (Record active from)
- DATE_TO (Record active to (null for current))
The main things to note is the identifiers in the SALES and PRODUCT dim tables.
There is a Natural Key column for storing each records Primary Key value as exactly whats in the source system.
Because you have stated you have multiple source systems, the additional SOURCE_SYSTEM_ID column is required so you can match records from your multiple source systems to their equivalent record in your warehouse. Otherwise you might have a product of EGGS with an ID of 13 in your first source system and a product called MILK with an ID also of 13 in your second system. Without the additional SOURCE_SYSTEM_ID you will be forever cutting records for PRODUCT_DIM natural key 13. This will look somthing like this in your warehouse:
PRODUCT_SK SOURCE_SYSTEM_ID PRODUCT_NK .. PRODUCT_NAME
..
14 1 13 .. EGGS
15 2 13 .. MILK
..
The bridge table exists to prevent cutting of new SALES or PRODUCT records each time their related record changes. Consider a sale of 10$ with of Red Eggs. The next data, the Red Eggs product is renamed to "Super Red Eggs". This will result in a new PRODUCT record for Red Eggs in the warehouse. If the SALES table included a direct link to PRODUCT_SK, a new SALES record would to be cut solely because there was a new product SK for our Red Eggs. The bridging table moves the Referential Integrity Induced cutting of a new record from the DIMENSION/FACT table into the bridge table. This also has the added benfit of making new comers to the datawarehouse very aware that they are operating in a different mindset to the traditional RDBMS.
The 2 Natural Key columns should assist you to solve your original question, the bridge table is just personal preference and added for completeness - If you have a DW design already that works for you, stick with it.
[1] Ill use version to refer to mean what ever slowly changing dimension methodology you choose. Most people cheap out and just Type2 their entire tables 'just in case'