Ideas on Populating the Fact Table in a Data Mart - database

I am looking for ideas to populate a fact table in a data mart. Lets say i have the following dimensions
Physician
Patient
date
geo_location
patient_demography
test
I have used two ETL tools to populate the dimension tables- Pentaho and Oracle Warehouse Builder. The date, patient demography and geo locations do not pull data from the operational store. All dimension tables have their own NEW surrogate key.
I now want to populate the fact table with the details of a visit by a patient.When a patient visits a physician on a particular date, he orders a test. This is the info in the fact table. There are other measures too that I am omitting for simplicity.
I can create a single join with all the required columns in the fact table from the source system. But, i need to store the keys from the dimension tables for Patient, Physician, test etc.. What is the best way to achieve this?
Can ETL tools help in this?
Thank You
Krishna

Each dimension table should have a BusinessKey that uniquely identifies the object (person, date, location) that a table row describes. During loading of the fact table, you have to lookup the PrimaryKey from the dimension table, based on the BusinessKey. You can choose to lookup the dimension table directly, or create a key-lookup table for each dimension just before loading the fact table.
Pentaho Kettle has the "Database Value Lookup" (transformation step) for the purpose. You may also want to look at the "Delivering Fact Tables" section of Kimball's Data Warehouse ETL Toolkit.

Related

Temporal tables VS SCD 2

The temporal table can use it to replace the SCD type2 in a data warehouse ?.
I use temporal table in azure sql database.
Typically no. Temporal tables are a good fit for staging tables, and can be used as a source to create slowly-changing dimensions if needed in a dimensional model.
The whole point of a dimensional model is to make writing queries easy. In SCD the fact table still has a simple single-column foreign key to the dimension table. So you get a historically accurate dimension values for each fact rows without complicating the queries.
To get the same result from a temporal table you'd have to join both the main table and the archive table, and filter them on both the business key of the dimension and the validfrom and validto dates.
Also temporal tables only support system versioning, which means that the validfrom and validto are always refer to the clock of the database server. In a data warehouse you might want to use some other temporal reference in your data to model your SCD.

Best way to organize junction table

I'm currently building a small database on MS Access for upgrades (45) on several machines (30) on a factory. The info is on an excel spreadsheet where rows are the upgrades and columns the machine. The excel file shows how for each machine if a certain upgrade is already installed/to be installed/in dev/etc.
I currently have a table for each upgrade details and another table with every machine and its personal info.
To replicate the excel associations I intend on making a junction table.
Should I make it with the upgrades as a field (1 col) and the machine as another. That would give 3 columns and 30*45 rows.
Or should I imitate the excel and put the upgrades as a field and each machine as an individual field, the values being the state of the upgrade.
Thanks in advance
A junction table is typically used to build a many-to-many relationship. From what you describe, it seems to be the case that you have. The fields of the junction Table must include the Key field(s) of the two Tables between which you want to establish a many-to-many relationship. In your case, it seems that the junction table should have the key field(s) of table "machines" and the key field(s) of table "upgrades". You then build a one-to-many relationship with referential integrity from the table "machines", over the key fields, to the corresponding fields in the junction table. You do the same from the key field(s) of the table "upgrades" to the corresponding fields of the junction table. Then you populate the junction table with the corresponding data. It is quite frequent to include additional fields inthe junction table to provide useful information, like having a date field to record on what date each upgrade was done for each specific machine, or a comments fields, or the person that did the upgrade.
If you want to see a concrete example, you can take a look to the juntion Table "T_Umbrellas_in_Capitals" from the database of examples that you can download from LightningGuide.net. This junction table supports a many-to-many relationship between the tables "T_Capital_cities" and "T_Umbrella_models".

Fact table referencing another fact table?

We recently started working on our Data warehouse. We have Technicians, Salespersons, Date, Branch, Customer as our dimensions. We also have transactional tables in OLTP such as Sales Orders, Agreements, Which are referenced to each other in some situations. I'm planning to put sales order, Agreements info in Fact tables. So, that I would like to reference the all the above mentioned dimensions in both fact tables. But, here comes my problem.
Sales orders and service agreements need to referenced with each other. In most cases, Agreements information need to be referenced in Sales orders. Can I reference two fact tables each other in fact table? The sales orders table in OLTP consists Million records, and Agreement table holds half million records(minimum). Can you let me know if I can reference these two in fact table?
Do you have something like an AgreementID which can be added to each fact table and then used to drill across? This is a degenerate dimension - a dimension key without a dimension table.

How to create a 'sanitized' copy of our SQL Server database?

We're a manufacturing company, and we've hired a couple of data scientists to look for patterns and correlation in our manufacturing data. We want to give them a copy of our reporting database (SQL 2014), but it must be in a 'sanitized' form. This means that all table names get converted to 'Table1', 'Table2' etc., and column names in each table become 'Column1', 'Column2' etc. There will be roughly 100 tables, some having 30+ columns, and some tables have 2B+ rows.
I know there is a hard way to do this. This would be to manually create each table, with the sanitized table name and column names, and then use something like SSIS to bulk insert the rows from one table to another. This would be rather time consuming and tedious because of the manual SSIS column mapping required, and manual setup of each table.
I'm hoping someone has done something like this before and has a much faster, more efficienct, way.
By the way, the 'sanitized' database will have no indexes or foreign keys. Also, it may seem to make any sense why we would want to do this, but this is what was agreed to by our Director of Manufacturing and the data scientists, as the first round of analysis which will involve many rounds.
You basically want to scrub the data and objects, correct? Here is what I would do.
Restore a backup of the db.
Drop all objects not needed (indexes, constraints, stored procedures, views, functions, triggers, etc.)
Create a table with two columns, populate the table, each row has orig table name and new table name
Write a script that iterates through the table, roe by row, and renames your tables. Better yet, put the data into excel, and create a third column that builds the tsql you want to build, then cut/paste and execute in ssms.
Repeat step 4, but for all columns. Best to query sys.columns to get all the objects you need, put to excel, and build your tsql
Repeat again for any other objects needed.
Backip/restore will be quicker than dabbling in SSIS and data transfer.
They can see the data but they can't see the column names? What can that possibly accomplish? What are you protecting by not revealing the table or column names? How is a data scientist supposed to evaluate data without context? Without a FK all I see is a bunch of numbers on a column named colx. What are expecting to accomplish? Get a confidentially agreement. Consider a FK columns customerID verses a materialID. Patterns have widely different meanings and analysis. I would correlate a quality measure with materialID or shiftID but not with a customerID.
Oh look there is correlation between tableA.colB and tableX.colY. Well yes that customer is college team and they use aluminum bats.
On top of that you strip indexes (on tables with 2B+ rows) so the analysis they run will be slow. What does that accomplish?
As for the question as stated do a back up restore. Using system table drop all triggers, FK, index, and constraints. Don't forget to drop the triggers and constraints - that may disclose some trade secret. Then rename columns and then tables.

Difference between Fact table and Dimension table?

When reading a book for business objects, I came across the term- fact table and dimension table.
I am trying to understand what is the different between Dimension table and Fact table?
I read couple of articles on the internet but I was not able to understand clearly..
Any simple example will help me to understand better?
In Data Warehouse Modeling, a star schema and a snowflake schema consists of Fact and Dimension tables.
Fact Table:
It contains all the primary keys of the dimension and associated
facts or measures(is a property on which calculations can be made) like quantity sold, amount sold and average sales.
Dimension Tables:
Dimension tables provides descriptive information for all the measurements recorded in fact table.
Dimensions are relatively very small as comparison of fact table.
Commonly used dimensions are people, products, place and time.
image source
This appears to be a very simple answer on how to differentiate between fact and dimension tables!
It may help to think of dimensions as things or objects. A thing such
as a product can exist without ever being involved in a business
event. A dimension is your noun. It is something that can exist
independent of a business event, such as a sale. Products, employees,
equipment, are all things that exist. A dimension either does
something, or has something done to it.
Employees sell, customers buy. Employees and customers are examples of
dimensions, they do.
Products are sold, they are also dimensions as they have something
done to them.
Facts, are the verb. An entry in a fact table marks a discrete event
that happens to something from the dimension table. A product sale
would be recorded in a fact table. The event of the sale would be
noted by what product was sold, which employee sold it, and which
customer bought it. Product, Employee, and Customer are all dimensions
that describe the event, the sale.
In addition fact tables also typically have some kind of quantitative
data. The quantity sold, the price per item, total price, and so on.
Source:
http://arcanecode.com/2007/07/23/dimensions-versus-facts-in-data-warehousing/
This is to answer the part:
I was trying to understand whether dimension tables can be fact table
as well or not?
The short answer (INMO) is No.That is because the 2 types of tables are created for different reasons. However, from a database design perspective, a dimension table could have a parent table as the case with the fact table which always has a dimension table (or more) as a parent. Also, fact tables may be aggregated, whereas Dimension tables are not aggregated. Another reason is that fact tables are not supposed to be updated in place whereas Dimension tables could be updated in place in some cases.
More details:
Fact and dimension tables appear in a what is commonly known as a Star Schema. A primary purpose of star schema is to simplify a complex normalized set of tables and consolidate data (possibly from different systems) into one database structure that can be queried in a very efficient way.
On its simplest form, it contains a fact table (Example: StoreSales) and a one or more dimension tables. Each Dimension entry has 0,1 or more fact tables associated with it (Example of dimension tables: Geography, Item, Supplier, Customer, Time, etc.). It would be valid also for the dimension to have a parent, in which case the model is of type "Snow Flake". However, designers attempt to avoid this kind of design since it causes more joins that slow performance. In the example of StoreSales, The Geography dimension could be composed of the columns (GeoID, ContenentName, CountryName, StateProvName, CityName, StartDate, EndDate)
In a Snow Flakes model, you could have 2 normalized tables for Geo information, namely: Content Table, Country Table.
You can find plenty of examples on Star Schema. Also, check this out to see an alternative view on the star schema model Inmon vs. Kimball. Kimbal has a good forum you may also want to check out here: Kimball Forum.
Edit: To answer comment about examples for 4NF:
Example for a fact table violating 4NF:
Sales Fact (ID, BranchID, SalesPersonID, ItemID, Amount, TimeID)
Example for a fact table not violating 4NF:
AggregatedSales (BranchID, TotalAmount)
Here the relation is in 4NF
The last example is rather uncommon.
Super simple explanation:
Fact table: a data table that maps lookup IDs together. Is usually one of the main tables central to your application.
Dimension table: a lookup table used to store values (such as city names or states) that are repeated frequently in the fact table.
Dimension table
Dimension table is a table which contain attributes of measurements stored in fact tables. This table consists of hierarchies, categories and logic that can be used to traverse in nodes.
Fact table contains the measurement of business processes, and it contains foreign keys for the dimension tables.
Example – If the business process is manufacturing of bricks
Average number of bricks produced by one person/machine – measure of the business process
a Fact = an action: a sale, a transaction, an access
a Dimension = an object: a seller, a customer, a date, a price
Then...
Facts references dimensions for: when, where, what, who, how
The real interesting thing is deciding whether an attribute should be a dimension or a fact. For example, the price of each item in an order, or, the maximum amount of a insurance recorded in a contract. There are no generally correct way to approach these, only ones that make sense in the context.
PS: If I were to create those jargons I would prefer Log table and Object table.
In the simplest form, I think a dimension table is something like a 'Master' table - that keeps a list of all 'items', so to say.
A fact table is a transaction table which describes all the transactions. In addition, aggregated (grouped) data like total sales by sales person, total sales by branch - such kinds of tables also might exist as independent fact tables.
From my point of view,
Dimension table : Master Data
Fact table : Transactional Data
The fact table mainly consists of business facts and foreign keys that refer to primary keys in the dimension tables. A dimension table consists mainly of descriptive attributes that are textual fields.
A dimension table contains a surrogate key, natural key, and a set of attributes. On the contrary, a fact table contains a foreign key, measurements, and degenerated dimensions.
Dimension tables provide descriptive or contextual information for the measurement of a fact table. On the other hand, fact tables provide the measurements of an enterprise.
When comparing the size of the two tables, a fact table is bigger than a dimensional table. In a comparison table, more dimensions are presented than the fact tables. In a fact table, less numbers of facts are observed.
The dimension table has to be loaded first. While loading the fact tables, one should have to look at the dimension table. This is because the fact table has measures, facts, and foreign keys that are the primary keys in the dimension table.
Read more: Dimension Table and Fact Table | Difference Between | Dimension Table vs Fact Table http://www.differencebetween.net/technology/hardware-technology/dimension-table-and-fact-table/#ixzz3SBp8kPzo
For Relation database users, Dimension is equivalent to Master Table.
Fact is equivalent to Transaction table.
Dimension table : It is nothing but we can maintains information about the characterized date called as Dimension table.
Example : Time Dimension , Product Dimension.
Fact Table : It is nothing but we can maintains information about the metrics or precalculation data.
Example : Sales Fact, Order Fact.
Star schema : one fact table link with dimension table form as a Start Schema.
enter image description here

Resources