I am trying to make sense of dimension modeling. While reading a dimension modeling book, I have created a star schema.
The fact table is a Accumulating snapshot table and it has multiple date columns which are linked to a date dimension using a surrogate key.
FactApplicants
{
Interview_No_Show_Date_Key (FK)
Cancel_Date_Key (FK)
Interviewed_Date_Key (FK)
. ....
Applicant_Key(FK)
InquiryCount int
}
DimDate
{
Date_Key (PK, int),
FullDateUSA (char(10))
Date (datetime)
}
I do have a well defined process for which i am trying to make this star schema for. I have a date field in the fact table for each of this step as I need to prepare funnel like report and activity reports. So the question really is
Is this correct? can a fact table refer to same date dimension table multiple times?
The examples I am seeing all over the internet seems to indicate this is correct but i am having hard time making it work with Pentaho reporting. so I am not sure if its a design problem or its something i am not doing correctly in Pentaho
Yes it is correct to refer to the date dimension multiple times
Yes, a fact can refer to the same dimension multiple times. However, given only what I see in your example, I am not sure why you need the date dimension. The date in applicants is just a date and can be used as an attribute without referring to a separate date dimension. It's just the attribute "date". You would need a separate date dimension if, for example, (1) you want to ensure that only valid dates are used, or (2) you want to elevate date to a full calendar in which other attributes are used to describe a date, such as day of the week, weekday/weekend, holiday, etc. or (3) you want to rollup date to other levels, such as week, month, year.
Related
I'm working with a "slowly changing fact table" that keeps track of changes to a reservation by stamping the effective start and effective end dates on when a change is effective for. The latest change has an effective end date of 12/31/9999 23:59:59.999 to keep it effective forever until if/when a new change comes in. I have a second table called an "As-Of" date table that has every single date since 8/1/2008 to current with a timestamp of 23:59:59.999 that I can use to join back to the slowly changing fact where "as-of" dates are between the effective start and end dates of in the fact. The purpose of this is to be able to go back to any specific date in time and see what the reservation looked like on that date. As you can imagine, each new as-of date has exponentially more data.
I've been tasked with creating an SSAS tabular model that has every "as-of" date available in a drop down for end users to select and be able to see data for that particular date. I'm concerned about storage and performance issues of having all as-of dates joined back to the fact table to provide end users the freedom to select any as-of date they want at any given time.
If I create a view that has the fact and as-of date table joined together, is it possible to pass the as-of dates they select in the drop down in SSAS back to a dynamic where clause in the view so that I am only joining the fact and as-of date tables dynamically for the as-of dates they actually need to see? Is it possible to create some type of "live" connection that only joins the fact and as-of date tables on the fly so I don't have to blow out the underlying data for each as-of date?
It seems as though the two tables will have to be joined in a SQL view before I bring it into SSAS since it doesn't seem possible to join two tables on more than one column in SSAS.
Can someone please tell me if this is something that is even technically possible? Or if you have any other ideas on the best way to represent this data in SSAS?
For some table, you can use Direct Query (live query) instead of Import.
DirectQuery Documentation
You can use a virtual relationship in your measure (where you can specifying more than one column).
VirtualRelationship
example:
CALCULATE (
<target_measure>,
TREATAS (
SUMMARIZE (
<lookup_table>
<lookup_granularity_column_1>
<lookup_granularity_column_2>
),
<target_granularity_column_1>,
<target_granularity_column_2>
)
)
I'm currently designing a database and will employ a Date Dimension table that will contain the various possible groupings a user may wish to report on e.g. financial year, quarter etc. This will be role played through views to create versions for Order Date, Cancellation Date etc and then merged with other Dimension tables.
Common database design theory advocates a smart key of form YYYYMMDD that is used to merge with other tables, however, I wonder if this is still valid. My understanding is that SQL Server, which is what I'm using, uses a data type of Date which is three bits over an integer of 4 bits.
Assuming that I do not require records that would identify records where date is "Unknown"or "Not Available" etc, is there any reason to use an integer smart key over date?
Thank you for your assistance.
I am finding it difficult to understand how you get the history data from a fact table join to a Dimension that has Type2 and Type1 for historic records that have changed. Currently I have a Surrogate Key and Business Key in the Dim. The Fact Table has the Surrogate Key the Fact table and I am using SSIS Lookup Component currently to bring back the row that has the CurrentFlag set to Yes.
However I am joining on the Business Key in the Lookup and returning the Surrogate. Which I know is the main reason I can't get history, however if I Join on the Business Key as I am currently doing and return the Business Key also, SSIS component will only bring back just one row, regardless of how many versions of history you have against that Business Key.
What I want to know or have been told is to use lookups to populate fact tables, however this doesn't seem to really give me the history as it will only return one row regardless. So I Just want to know how to return historic date between a fact and a dimension in SSIS.
Thank you
There's a few caveats when it comes to historical dimensions. Your end users will need to know what it is you are presenting, and understand the differences.
For example, consider the following scenario:
Customer A is located in Las Vegas in January 2017. They place an order for Product 123, which at that time costs $125.
Now, it's August. In the meantime, the Customer moved to Washington D.C. in May, and Product 123 was updated in July to cost $145.
Your end users will need to inform you what they want to see. In case you are not tracking history whatsoever, and simply truncate and load everything on a daily basis, your order report would show the following:
Customer A, located in Washington D.C. placed an order for $145 in January.
If you implement proper history tracking, and implemented logic to identify the start- and end-date of a row in a dimension, you would join the fact table to the dimension using the natural key as well as the proper date interval. This should return you a single value for every dimension row in the fact table. IF it returns more, you have overlapping dates.
Can you show us the logic where you receive only a single value from the lookup, even though you have more records?
Please bear with me if this is a trivial question,I am a new bee
I am in the design phase of a OLAP system where i need to show cost for a date range.
I have three other dimension like product,vendor and language.
Should I add date as one more dimension??
My queries are mostly cost on a date range like from 5-11-1997 to 01-09-2-13
Which is the best way to do it.
You do need to add a Time Dimension. If all the Date/Time facts are just Dates (no Time part as in the example range) then you need to create a table/view which consists of a row for each Date in the domain range.
This table can also have extra fields for things like week, month, quarter, season, year that your users may be interested in querying. (If there are none of these, then just have one column with the date.)
You would need to tell the OLAP data model that this date column in the Time table is the PK, and that the dates in other tables are FK's to it. The OLAP engine will then allow this new Time Demension to be used is queries just like any other dimensions.
I am working on a multiple properties booking system and making me headache about the best practice schema design. Assume the site hosts for example 5000 properties where each of it is maintained by one user. Each property has a booking calendar. My current implementation is a two-table-system with one table for the available dates and the other for the unavailable dates, with a granularity of 1 day each.
property_dates_available (property_id, date);
property_dates_booked (property_id, date);
However, i feel unsure if this is a good solution. In another question i read about a single table solution with both states represented. But i wonder if it is a good idea to mix them up. Also, should the booking calendar be mapped for a full year with all its 365 days per year into the database table or was it better to map only the days a property is available for booking? I think of the dramatically increasing number of rows every year. Also i think of searching the database lately for available properties and am not sure if looking through 5000 * 365 rows might be a bad idea compared to i.e. only 5000 * av. 100 rows.
What would you generally recommend? Is this aspect ignorable? How to best practice implement this?
I don't see why you need a separate table for available dates. If you have a table for booked dates (property_id, date), then you can easily query this table to find out which properties are available for a given date
select properties.property_name
from properties where not exists
(select 1 from property_dates_booked
where properties.property_id = property_dates_booked
and property_dates_booked.date = :date)
:date being a parameter to the query
Only enter actual bookings into the property_dates_booked table (it would be easier to rename the table 'bookings'). If a property is not available for certain dates because of maintenance, then enter a booking for those dates where the customer is 'special' (maybe the 'customer' has a negative id).