Fact Table With Non-Measure Data - data-modeling

In the model below, description is a free text field that describes why a employee was absent.
Can this description field be in the fact table and considered a degenerate dimension?
The value will mostly be used in listing reports or for dashboards where word clouds are used.

Your design is correct. There is nothing wrong with including free text as a degenerate dimension into a fact table.
Storing comments in a dimension makes sense only if comments are structured (i,e, if they are standardized and effectively have 1:M relations with the fact records). If they are stored as free text, and thus have 1:1 relations with the facts, then converting them into a dimension is a big mistake - you will end up with a dimension as tall as the fact table. In proper designs, dimensions are wide and short, while fact tables are narrow and tall. Tall dimensions are a problem, because they are very expensive in terms of performance.
They are also hard to use. Let's say, you are using a reporting tool such as PowerBI. If you store your free text as a degenerate dimension in a fact table, it's easy and intuitive to use - I can write something like:
Reason for Absence = SELECTEDVALUE( Fact[Description])
and the comment will be properly displayed in a report. Done.
But if you store the same comments in a dimension, well, good luck figuring out how to add them to the report.

Page 65 of The Data Warehouse Toolkit 3rd edition says the following:
Text Comments Dimension: Rather than treating freeform comments as textual metrics in a fact table, they should be stored outside the fact table in a separate comments dimension (or as attributes in a dimension with one row per transaction if the comments’ cardinality matches the number of unique transactions) with a corresponding foreign key in the fact table.
Kimball, Ralph; Ross, Margy. The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling (p. 65). Wiley. Kindle Edition.
On page 47 there is this example of a degenerate dimension:
For example, when an invoice has multiple line items, the line item
fact rows inherit all the descriptive dimension foreign keys of the
invoice, and the invoice is left with no unique content. But the
invoice number remains a valid dimension key for fact tables at the
line item level.
Kimball, Ralph; Ross, Margy. The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling (p. 47). Wiley. Kindle Edition.

No, descriptive text columns should not be included in fact tables. Instead, this column should be included in a dimension.
If you are looking to report on tags (key words) I would create a dimension for these tags and parse the description to find the appropriate tag to associate with the fact. For example, I see 2 tags from the descriptions (funeral and sick). I would create a dimension DimAbsentReason to contain these tags.
If you need to keep the actual description, then you could create a dimension (DimAbsentReason) for the description and make the appropriate association to the fact table.

Related

Star Schema from multiple source tables

I am struggling in figuring out how to create a star schema from multiple source tables. I work at a trading firm so the data is related to user trading activity. The issue I am having is that our datasets do not have primary ids for every field that could be a dimension. Instead, we usually relate our data together using the combination of date and account number. Here is an example of 3 source tables...
I would like to turn this into a star schema, something that looks like ...
Is my only option to denormalize my source tables into one wide table (joining trades to position on account number and date, and joining the users table on account number), create keys for each dimension, then re normalizing it into the star schema? Are star schema's ever built from multiple source tables?
Star schemas are almost always created from multiple source tables.
The normal process is:
Populate your dimension tables
Create a temporary/virtual fact record using your source data
Using this fact record, look up the relevant dimension keys
Write the actual fact record to your target fact table
Data-warehousing is about query speed. The data-warehouse should not be concerned with data integrity. IT SHOULD NOT CLEAN OR CORRECT BAD DATA. It only needs to gather all the data together into a single record to present to the model for analysis. Denormalizing the data is how this is done.
In a star schema, dimensions do not know about each other and have no relationships with other dimensions. In a snowflake, dimensions are related to other dimensions. That is the primary difference between star and snowflake.
All the metadata options for events are rolled up into dimensions and used for slicing/filtering. All the measurable/calculation data for an event are in the event fact, along with a reference to the dimension(s) containing the relevant metadata. The Metadata/Dimension is reused across multiple fact records.
Based on the limited example you've provided, I'd suggest you research degenerate dimensions and junk dimensions. Your Trade and Position data may need to be turned into a fact and a dimension (degenerate), and some of your flag attributes may be best placed into a junk dimension.
You should also make sure your dimension keys are clear. You should not have multiple paths to a dimension (accountnumber: trade -> position -> user & trade -> user ) as that will cause inconsistent results when querying depending on which relationship you traverse.

Dimension Creation - Multiple Uses

We received some generic training related to TM1 and dimension creation and we were informed we'd need separate dimensions for the same values.
Let me describe, we transport goods and we'd have an origin and destination province and in typical database design I'd expect we'd have one "province" reference table, but we were informed we'd need an "origin" dimension and a "destination" dimension. This seems to be cumbersome and seems like we'd encounter the same issue with customers, services, etc.
Can someone clarify how this could work for us?
Again, I'd expect to see a "lookup" table in the database which contains all possible provinces (assumption is values in both columns would be the same), then you'd have an ID value in any column that used the "province" and join to the "lookup" table based on ID.
in typical database design I'd expect we'd have one "province" reference table, but we were informed we'd need an "origin" dimension and a "destination" dimension
Following the regular DB design it makes sense to keep two data entities separate: one defines source, other defines target. I think on this we'd both agree. If you could give more details it would be better.
Imagine a drop down list: two lists populated by one single "source", but represent two different values in DB.
assumption is values in both columns would be the same
if the destination=origin, you don't need two dimensions then? :) This point needs clarification.
Besides your solution (combination of all source and destination in a table with an unique ID, which could be a way of solving this), it seems it's resolvable by cube or dimension structure changes.
If at some dimension you'd use e.g. ProvinceOrigin and ProvinceDestination as string type elements, and populate them from one single dimension (dynamic attribute) then whenever you save the cube you'll have these two fields populated from one single dimension.
Obviously the best solution for you depends on your system architecture.

Best approach to avoid Too many columns and complexity in database design

Inventory Items :
Paper Size
-----
A0
A1
A2
etc
Paper Weight
------------
80gsm
150gsm etc
Paper mode
----------
Colour
Bw
Paper type
-----------
glass
silk
normal
Tabdividers and tabdivider Type
--------
Binding and Binding Types
--
Laminate and laminate Types
--
Such Inventory items and these all needs to be stored in invoice table
How do you store them in Database using proper RDBMS.
As per my opinion for each list a master table and retrieval with JOINS. However this may be a little bit complex adding too many tables into the database.
This normalisation is having bit of problem when storing all this information against a Invoice. This is causing too many columns in invoice table.
Other way putting all of them into a one table with more columns and then each row will be a combination of them.. (hacking algorithm 4 list with 4 items over 24 records which will have reference ID).
Which one do you think the best and why!!
Your initial idea is correct. And anyone claiming that four tables is "a little bit complex" and/or "too many tables" shouldn't be doing database work. This is what RDBMS's are designed (and tuned) to do.
Each of these 4 items is an individual property of something so they can't simply be put, as is, into a table that merges them. As you had thought, you start with:
PaperSize
PaperWeight
PaperMode
PaperType
These are lookup tables and hence should have non-auto-incrementing ID fields.
These will be used as Foreign Key fields for the main paper-based entities.
Or if they can only exist in certain combinations, then there would need to be a relationship table to capture/manage what those valid combinations are. But those four paper "properties" would still be separate tables that Foreign Key to the relationship table. Some people would put an separate ID field on that relationship table to uniquely identify the combination via a single value. Personally, I wouldn't do that unless there was a technical requirement such as Replication (or some other process/feature) that required that each table had a single-field key. Instead, I would just make the PK out of the four ID fields that point to those paper "property" lookup tables. Then those four fields would still go into any paper-based entities. At that point the main paper entity tables would look about the same as they would if there wasn't the relationship table, the difference being that instead of having 4 FKs of a single ID field each, one to each of the paper "property" tables, there would be a single FK of 4 ID fields pointing back to the PK of the relationship table.
Why not jam everything into a single table? Because:
It defeats the purpose of using a Relational Database Management System to flatten out the data into a non-relational structure.
It is harder to grow that structure over time
It makes finding all paper entities of a particular property clunkier
It makes finding all paper entities of a particular property slower / less efficient
maybe other reasons?
EDIT:
Regarding the new info (e.g. Invoice Table, etc) that wasn't in the question when I was writing the above, that should be abstracted via a Product/Inventory table that would capture these combinations. That is what I was referring to as the main paper entities. The Invoice table would simply refer to a ProductID/InventoryID (just as an example) and the Product/Inventory table would have these paper property IDs. I don't see why these properties would be in an Invoice table.
EDIT2:
Regarding the IDs of the "property" lookup tables, one reason that they should not be auto-incrementing is that their values should be taken from Enums in the app layer. These lookup tables are just a means of providing a "data dictionary" so that the database layer can have insight into what these values mean.

SSAS 2012 - Dimension Modeling

I am working with a structure that results a lot of single attribute dimensions that require no hierarchy. Examples:
Status(Status Name)
Type(Type Name)
I get the following warning when compiling the project:
"Avoid having multiple dimensions containing a single attribute. Consider unifying them if possible."
A large number of single attribute dimensions is workable for our users, but it causes a lot of clutter in the Excel pivot table. Dimensions are listed along with the single attribute which is redundant.
I would like to unify them as the warning suggests so that I have a single dimension called 'Attributes' which contains status/type/etc, but I am unsure the best way to do so. It doesn't make conceptual sense to me with a parent/child dimension.
Any suggestions?
I agree this is a worthwhile change. I would construct a view that brings together the required attributes. Often they are all available on the fact/measure group table/view, so you can just use the same source object (in your DSV) to construct the dimension.
The tricky part may be the dimension key. The most flexible key is a Fact Surrogate Key eg a unique value per Fact row - in the future you can add any other fact-based attributes without affecting the key. However this will not scale indefinitely - you are probably OK up to 1m rows at least.
Beyond that scale, I would concatenate the attributes to form the dimension key and deliver them to a new dimension table. I would normally do this back in the ETL layer. The identical concatenation logic must be used for both the dimension and fact.

Style Question - Database Table with Many Fields

I'm starting a new project where I have to parse a document and store it in a database. This document contains several sections of simple key-value pairs - about 10 sections and about 100 pairs in total. I could have one table per section, and they all map one-to-one to an aggregate. Or I could have one table with about 100 fields. I'm stuck because I don't want to make a single table that big, but I also don't want to make that many one-to-one mappings either. So, do I make the big table, or do I make a bunch of smaller tables? Effectively, there wouldn't really be a difference as far as I can tell. If there are, please inform me.
EDIT
An example is desired so I will provide something that might help.
Document
- Section Title 1
- k1: val1
- k2: val2
...
- Section Title 2
- k10: val10
...
...
- Section Title n
- kn-1: valn-1
- kn: valn
And I have to use a relational database so don't bother suggesting otherwise.
If you have many, many instances of this big document to store (now and/or over time), and if each instance of this document will have values for those 100+ columns, and if you want the power and flexibility inherent in storing all that data actross rows and columns within an RDBMS, then I'd store it all as one big (albeit ugly) table.
If all the "items" in a given section are always filled, but invididual sections may or may not be filled, then there might be value in having one table per section... but it doesn't sound like this is the case.
Be wary of thise "ifs" above. If any of them are too shaky, then the big table idea may be more pain than it's worth, and alternate ideas (such as #9000's NoSQL idea) might be better.
If the data is just for read-only purpose and your xml doesn't mandate you to make DB scheme changes (alters) then I doesn't see any problem de-normalizing to a single table. The other alternative might be to look at EAV models
Table document(
PK - a surrogate key
name - the "natural" key
)
Table content(
PK - the PK of the parent document
section title
name
value
)
Yes, you have 100's of rows of name/value pairs per document. However, you can easily add names and values without having to revise the database.

Resources