Implementing Date Range in OLAP systems - database

Please bear with me if this is a trivial question,I am a new bee
I am in the design phase of a OLAP system where i need to show cost for a date range.
I have three other dimension like product,vendor and language.
Should I add date as one more dimension??
My queries are mostly cost on a date range like from 5-11-1997 to 01-09-2-13
Which is the best way to do it.

You do need to add a Time Dimension. If all the Date/Time facts are just Dates (no Time part as in the example range) then you need to create a table/view which consists of a row for each Date in the domain range.
This table can also have extra fields for things like week, month, quarter, season, year that your users may be interested in querying. (If there are none of these, then just have one column with the date.)
You would need to tell the OLAP data model that this date column in the Time table is the PK, and that the dates in other tables are FK's to it. The OLAP engine will then allow this new Time Demension to be used is queries just like any other dimensions.

Related

Power BI Aggregation of End Tables

I am new to Power BI and data-base management and I want clarify for myself how Power BI works in reference to my last two questions (Database modelling Bridge Table , Power BI Report Bridge Table ). I have a main_table with firm specific information each year which is connected to an end_table that contains some quantitative information (e.g. sales data). The tables are modelled as a 1:N relationship, so that I do not have to store the same values twice, which I thought is a good thing to do in data modelling.
I want to aggregate the value column of end table over the group column Year. I am surprised that to my understanding Power BI sums up the value column within the end table when I would expect the aggregation over the group variable in the connected tables
My basic example is based on this data and data model (you need to adjust the relationship manually):
main_table<-data.frame(id=1:20, FK_id=sample(1:2,20, replace=TRUE), Jahre=2016:2020)
main_table<-rbind(main_table,data.frame(id=21:25, FK_id=sample(2:3,5, replace=TRUE), Jahre=2015) )
end_table<-data.frame(id=1:3, value=c(10,20,30))
The first 5 rows of the data including all columns looks like this:
If I take out all row specific information and sum up over value. It will always show the sum of the end table, which is 60, in each Year.
Making the connection bi-directional does not help. It just sums up for the existing values of the end_table in each year. I get the correct results, if I add the value column to the main table using Related value = RELATED(end_table[value])
I am just wondering if there is another way to model or analyse this 1:N relationship in Power BI. This comes up frequently and it feels a bit tedious to always add the column using Related() in the main table while it would be intuitive to just click both columns and expect the aggregation to be based on the grouping variable.
In any case, just asking this and my other two questions helped me a lot.
This is a bit of a weird modeling situation (even though it's not terribly uncommon). In general, it's handy to build star schemas where you have dimension tables in 1:N relationships to fact table(s). E.g.
In this setup, the items from the dimension tables (e.g. year or customer) are used in the columns and rows in a visual and measures generally aggregate columns from the fact table (e.g. sales amount).
Your example inverts this. You are trying to sum over a column in your end table using the year as a dimension. As a result, it's not automatically behaving as you'd expect.
In order to get the result that you want, where Year is treated as a dimension, you need to write a measure that sums over Year as if it were a dimension. Since main_table is essentially a dimension table for Year (one unique row per year), you can write
SumValue = SUMX ( main_table, RELATED ( end_table[value] ) )

Smart Date Key Vs Date Data Type in Date Dimension Table

I'm currently designing a database and will employ a Date Dimension table that will contain the various possible groupings a user may wish to report on e.g. financial year, quarter etc. This will be role played through views to create versions for Order Date, Cancellation Date etc and then merged with other Dimension tables.
Common database design theory advocates a smart key of form YYYYMMDD that is used to merge with other tables, however, I wonder if this is still valid. My understanding is that SQL Server, which is what I'm using, uses a data type of Date which is three bits over an integer of 4 bits.
Assuming that I do not require records that would identify records where date is "Unknown"or "Not Available" etc, is there any reason to use an integer smart key over date?
Thank you for your assistance.

Can a Accumulating snapshot table has multiple dates in it?

I am trying to make sense of dimension modeling. While reading a dimension modeling book, I have created a star schema.
The fact table is a Accumulating snapshot table and it has multiple date columns which are linked to a date dimension using a surrogate key.
FactApplicants
{
Interview_No_Show_Date_Key (FK)
Cancel_Date_Key (FK)
Interviewed_Date_Key (FK)
. ....
Applicant_Key(FK)
InquiryCount int
}
DimDate
{
Date_Key (PK, int),
FullDateUSA (char(10))
Date (datetime)
}
I do have a well defined process for which i am trying to make this star schema for. I have a date field in the fact table for each of this step as I need to prepare funnel like report and activity reports. So the question really is
Is this correct? can a fact table refer to same date dimension table multiple times?
The examples I am seeing all over the internet seems to indicate this is correct but i am having hard time making it work with Pentaho reporting. so I am not sure if its a design problem or its something i am not doing correctly in Pentaho
Yes it is correct to refer to the date dimension multiple times
Yes, a fact can refer to the same dimension multiple times. However, given only what I see in your example, I am not sure why you need the date dimension. The date in applicants is just a date and can be used as an attribute without referring to a separate date dimension. It's just the attribute "date". You would need a separate date dimension if, for example, (1) you want to ensure that only valid dates are used, or (2) you want to elevate date to a full calendar in which other attributes are used to describe a date, such as day of the week, weekday/weekend, holiday, etc. or (3) you want to rollup date to other levels, such as week, month, year.

Database table structure for price list

I have like about 10 tables where are records with date ranges and some value belongin to the date range.
Each table has some meaning.
For example
rates
start_date DATE
end_date DATE
price DOUBLE
availability
start_date DATE
end_date DATE
availability INT
and then table dates
day DATE
where are dates for each day for 2 years ahead.
Final result is joining these 10 tables to dates table.
The query takes a bit longer, because there are some other joins and subqueries.
I have been thinking about creating one bigger table containing all the 10 tables data for each day, but final table would have about 1.5M - 2M records.
From testing it seems to be quicker (0.2s instead of about 1s) to search in this table instead of joining tables and searching in the joined result.
Is there any real reason why it should be bad idea to have a table with that many records?
The final table would look like
day DATE
price DOUBLE
availability INT
Thank you for your comments.
This is a complicated question. The answer depends heavily on usage patterns. Presumably, most of the values do not change every day. So, you could be vastly increasing the size of the database.
On the other hand, something like availability may change every day, so you already have a large table in your database.
If your usage patterns focused on one table at a time, I'd be tempted to say "leave well-enough alone". That is, don't make a change if it ain't broke. If your usage involved multiple updates to one type of record, I'd be inclined to leave them in separate tables (so locking for one type of value does not block queries on other types).
However, your usage suggests that you are combining the tables. If so, I think putting them in one row per day per item makes sense. If you are getting successive days at one time, you may find that having separate days in the underlying table greatly simplifies your queries. And, if your queries are focused on particular time frames, your proposed structure will keep the relevant data in the cache, giving room for better performance.
I appreciate what Bohemian says. However, you are already going to the lowest level of granularity and seeing that it works for you. I think you should go ahead with the reorganization.
I went down this road once and regretted it.
The fact that you have a projection of millions of rows tells me that dates from one table don't line up with dates from another table, leading to creating extra boundaries for some attributes because being in one table all attributes must share the same boundaries.
The problem I encountered was that the business changed and suddenly I had a lot more combinations to deal with and the number of rows blew right out, slowing queries significantly. The other problem was keeping the data up to date - my "super" table was calculated from the separate tables when ever they changed.
I found that keeping them separate and moving the logic into the app layer worked for me.
The data I was dealing with was almost exactly the same as yours except I had only 3
tables: I had availability, pricing and margin. The fact was that the 3 were unrelated, so date ranges never aligned, leasing to lots of artificial rows in the big table.

Best practice database schema for property booking calendar

I am working on a multiple properties booking system and making me headache about the best practice schema design. Assume the site hosts for example 5000 properties where each of it is maintained by one user. Each property has a booking calendar. My current implementation is a two-table-system with one table for the available dates and the other for the unavailable dates, with a granularity of 1 day each.
property_dates_available (property_id, date);
property_dates_booked (property_id, date);
However, i feel unsure if this is a good solution. In another question i read about a single table solution with both states represented. But i wonder if it is a good idea to mix them up. Also, should the booking calendar be mapped for a full year with all its 365 days per year into the database table or was it better to map only the days a property is available for booking? I think of the dramatically increasing number of rows every year. Also i think of searching the database lately for available properties and am not sure if looking through 5000 * 365 rows might be a bad idea compared to i.e. only 5000 * av. 100 rows.
What would you generally recommend? Is this aspect ignorable? How to best practice implement this?
I don't see why you need a separate table for available dates. If you have a table for booked dates (property_id, date), then you can easily query this table to find out which properties are available for a given date
select properties.property_name
from properties where not exists
(select 1 from property_dates_booked
where properties.property_id = property_dates_booked
and property_dates_booked.date = :date)
:date being a parameter to the query
Only enter actual bookings into the property_dates_booked table (it would be easier to rename the table 'bookings'). If a property is not available for certain dates because of maintenance, then enter a booking for those dates where the customer is 'special' (maybe the 'customer' has a negative id).

Resources