We have a requirement to come up with a strategy to show Sales revenue data weighted by dates differently on different schedules.
We currently have a FactSales table with a grain of one row per order with the measure of sales amount. We have separate DimDate and DimTime dimensions,and a DimBusinessUnit dimension with one row for each entity within the organization.
In DimDate we have a flag for the major US holidays so we know reduced sales revenue may be expected. This flag would apply globally.
The ask is that different business units might have slow revenue days. For example, Monday's might be slow in one business unit, and Friday's slow in another. For analysis it is desireable to capture these different schedules with a flag or a weighting.
Ultimately this probably be reflected as a projected sales amount in a calculated measure.
How can I best add this weighting? Does it belong in the Date dimension, Business Unit dimension, or maybe a degenerate dimension in the Fact table, or something else altogether?
The DimDate is probably not a good place to keep this information, as each Business Unit (BU) may have a different schedule, so quite possibly you will have to have a flag on each of the dates per a combination of BU and a slow day. So for example if BU1 and BU2 has a slow day on Monday, each Monday in your DimDate will have to have a way showing that it's slow for BU1 and BU2.
The Dimension BU, might be a better place, as schedule is specific to each of the unit. So you may opt for extending your dim by adding 7 days as an attributes and flag them as slow or not using for example false or true flags. You could also have one attribute with the bit mask i.e. 0100000 where position of the value corresponds to the day i.e. M T W T F S S and 0 is not slow and 1 is slow, so in this example T is a slow day.
This will also allow you to trace a history if you whish selecting relevant SCD process.
Another option may be a separate Dimension i.e. DimSchedule and Factless Fact Table.
http://www.kimballgroup.com/data-warehouse-business-intelligence-resources/kimball-techniques/dimensional-modeling-techniques/factless-fact-table/
I hope this helps.
Your situation seems to be the same as the Multiple National Calendars problem described by Kimball:
http://www.kimballgroup.com/1998/12/think-globally-act-locally/
Where Kimball is describing holidays in the left-most table, you could also add a "slow day" flag.
Related
I am currently working on a web application that stores information of Cooks in the user table. We have a functionality to search the cooks from our web application. If a cook is not available on May 3, 2016, we want to show the Not-Bookable or Not-Available message for that cook if user performs the search for May 3, 2016. The solution we have come up to is to create a table named CooksAvailability with following fields
ID, //Primary key, auto increment
IDCook, //foreign key to user's table
Date, //date he is available on
AvailableForBreakFast, //bool field
AvailableForLunch, //bool field
AvailableForDinner, //book field
BreakFastCookingPrice, //decimal nullable
LunchCookingPrice, //decimal nullable
DinnerCookingPrice //decimal nullable
With this schema, we are able to tell if the user is available for a specific date or not. But the problem with this approach is that it requires a lot of db space i.e if a cook is available for 280 days/year, there has to be 280 rows to reflect just one cook's availability.
This is too much space given the fact that we may have potentially thousands of cooks registered with our application. As you can see the CookingPrice fields for breakfast, lunch and dinner. it means a cook can charge different cooking rates for cooking on different dates and times.
Currently, we are looking for a smart solution that fulfils our requirements and consumes less space than our solution does.
You are storing a record for each day and the main mistake, which led you to this redundant design was that you did not separate the concepts enough.
I do not know whether a cook has an expected rate for a given meal, that is, a price one can assume in general if one has no additional information. If that is the case, then you can store these default prices in the table where you store the cooks.
Let's store the availability and the specific prices in different tables. If the availability does not have to store the prices, then you can store availability intervals. In the other table, where you store the prices, you need to store only the prices which deviate from the expected price. So, you will have defined availability intervals in a table, specific prices when the price differs from the expected one in the oter and default meal price values in the cook table, so, if there is no special price, the default price will be used.
To answer your question I should know more about the structure of the information.
For example if most cooks are available in a certain period, it could be helpful to organize your availability table with
avail_from_date - avail_to_date, instead of a row for each day.
this would reduce the amount of rows.
The different prices for breakfast, lunch and dinner could be stored better in the cooks table, if the prices are not different each day. Same is for the a availability for breakfast, lunch and dinner if this is not different each day.
But if your information structure makes it necessary to keep a record for every cook every day this would be 365 * 280 = 102,200 records for a year, this is not very much for a sql db in my eyes. If you put the indexes at the right place this will have a good performance.
There are a few questions that would help with the overall answer.
How often does availability change?
How often does price change?
Are there general patterns, e.g. cook X is available for breakfast and lunch, Monday - Wednesday each week?
Is there a normal availability / price over a certain period of time,
but with short-term overrides / differences?
If availability and price change at different speeds, I would suggest you model them separately. That way you only need to show what has changed, rather than duplicating data that is constant.
Beyond that, there's a space / complexity trade-off to make.
At one extreme, you could have a hierarchy of configurations that override each other. So, for cook X there's set A that says they can do breakfast Monday - Wednesday between dates 1 and 2. Then also for cook X there's set B that says they can do lunch on Thursday between dates 3 and 4. Assuming that dates go 1 -> 3 -> 4 -> 2, you can define whether set B overrides set A or adds to it. This is the most concise, but has quite a lot of business logic to work through to interpret it.
At the other extreme, you just say for cook X between date 1 and 2 this thing is true (an availability for a service, a price). You find all things that are true for a given date, possibly bringing in several separate records e.g. a lunch availability for Monday, a lunch price for Monday etc.
Please bear with me if this is a trivial question,I am a new bee
I am in the design phase of a OLAP system where i need to show cost for a date range.
I have three other dimension like product,vendor and language.
Should I add date as one more dimension??
My queries are mostly cost on a date range like from 5-11-1997 to 01-09-2-13
Which is the best way to do it.
You do need to add a Time Dimension. If all the Date/Time facts are just Dates (no Time part as in the example range) then you need to create a table/view which consists of a row for each Date in the domain range.
This table can also have extra fields for things like week, month, quarter, season, year that your users may be interested in querying. (If there are none of these, then just have one column with the date.)
You would need to tell the OLAP data model that this date column in the Time table is the PK, and that the dates in other tables are FK's to it. The OLAP engine will then allow this new Time Demension to be used is queries just like any other dimensions.
I have like about 10 tables where are records with date ranges and some value belongin to the date range.
Each table has some meaning.
For example
rates
start_date DATE
end_date DATE
price DOUBLE
availability
start_date DATE
end_date DATE
availability INT
and then table dates
day DATE
where are dates for each day for 2 years ahead.
Final result is joining these 10 tables to dates table.
The query takes a bit longer, because there are some other joins and subqueries.
I have been thinking about creating one bigger table containing all the 10 tables data for each day, but final table would have about 1.5M - 2M records.
From testing it seems to be quicker (0.2s instead of about 1s) to search in this table instead of joining tables and searching in the joined result.
Is there any real reason why it should be bad idea to have a table with that many records?
The final table would look like
day DATE
price DOUBLE
availability INT
Thank you for your comments.
This is a complicated question. The answer depends heavily on usage patterns. Presumably, most of the values do not change every day. So, you could be vastly increasing the size of the database.
On the other hand, something like availability may change every day, so you already have a large table in your database.
If your usage patterns focused on one table at a time, I'd be tempted to say "leave well-enough alone". That is, don't make a change if it ain't broke. If your usage involved multiple updates to one type of record, I'd be inclined to leave them in separate tables (so locking for one type of value does not block queries on other types).
However, your usage suggests that you are combining the tables. If so, I think putting them in one row per day per item makes sense. If you are getting successive days at one time, you may find that having separate days in the underlying table greatly simplifies your queries. And, if your queries are focused on particular time frames, your proposed structure will keep the relevant data in the cache, giving room for better performance.
I appreciate what Bohemian says. However, you are already going to the lowest level of granularity and seeing that it works for you. I think you should go ahead with the reorganization.
I went down this road once and regretted it.
The fact that you have a projection of millions of rows tells me that dates from one table don't line up with dates from another table, leading to creating extra boundaries for some attributes because being in one table all attributes must share the same boundaries.
The problem I encountered was that the business changed and suddenly I had a lot more combinations to deal with and the number of rows blew right out, slowing queries significantly. The other problem was keeping the data up to date - my "super" table was calculated from the separate tables when ever they changed.
I found that keeping them separate and moving the logic into the app layer worked for me.
The data I was dealing with was almost exactly the same as yours except I had only 3
tables: I had availability, pricing and margin. The fact was that the 3 were unrelated, so date ranges never aligned, leasing to lots of artificial rows in the big table.
I'm trying to figure out how I can create a calculated measure that produces a count of only unique facts in my fact table. My fact table basically stores events from a historical perspective. But I need the measure to filter out redundant events.
Using sales as an example(Since all material around OLAP always uses sales in examples):
The fact table stores sales EVENTS. When a sale is first made it has a unique sales reference which is a column in the fact table. A unique sale however can be amended(Items added or returned) or completely canceled. The fact table stores these changes to a sale as different rows.
If I create a count measure using SSAS I get a count of all sales events which means an unique sale will be counted multiple times for every change made to it (Which in some reports is desirable). However I also want a measure that produces a count of unique sales rather than events but not just based on counting unique sales references. If the user filters by date then they should see unique sales that still exist on that date (If a sale was canceled by that date if should not be represented in the count at all).
How would I do this in MDX/SSAS? It seems like I need have a count query work from a subset from a query that finds the latest change to a sale based on the time dimension.
In SQL it would be something like:
SELECT COUNT(*) FROM SalesFacts FACT1 WHERE Event <> 'Cancelled' AND
Timestamp = (SELECT MAX(Timestamp) FROM SalesFact FACT2 WHERE FACT1.SalesRef=FACT2.SalesRef)
Is it possible or event performant to have subqueries in MDX?
In SSAS, create a measure that is based on the unique transaction ID (The sales number, or order number) then make that measure a 'DistinctCount' aggregate function in the properties window.
Now it should count distinct order numbers, under whichever dimension slice it finds itself under.
The posted query might probably be rewritten like this:
SELECT COUNT(DISTINCT SalesRef)
FROM SalesFacts
WHERE Event <> 'Cancelled'
An simple answer would be just to have a 'sales count' column in your fact view / dsv query that supplies a 1 for an 'initial' event, a zero for all subsiquent revisions to the event and a -1 if the event is cancelled. This 'journalling' approach plays nicely with incremental fact table loads.
Another approach, probably more useful in the long run, would be to have an Events dimension: you could then expose a calculated measure that was the count of the members in that dimension non-empty over a given measure in your fact table. However for sales this is essentially a degenerate dimension (a dimension based on a fact table) and might get very large. This may be inappropriate.
Sometimes the requirements may be more complicated. If you slice by time, do you need to know all the distinct events that existed then, even if they were later cancelled? That starts to get tricky: there's a recent post on Chris Webb's blog where he talks about one (slightly hairy) solution:
http://cwebbbi.wordpress.com/2011/01/22/solving-the-events-in-progress-problem-in-mdx-part-2role-playing-measure-groups/
i have the following requirement
Sales Officer: Bob
Week1 Week2 Week3 ................. Week52
Prod1 10 15 12 ................. 14
Prod2 20 14 10 ................. 17
. .
. .
. .
Sales supervisor will set the targets for each sales officer on weekly basis.
Sales officer may enter actual sales on daily basis for each product through a similar grid against the set targets e.g.
Edit
In the above case Supervisor has set target of 10 units for week 1 Now the sales Officer will enter the sales on daily basis as 1,2,0,1,3,2=9(Actual Sale for Week 1) so against the target of 10 unit he has sold 9 units in week one.
I have already created Employee and Product tables. Can any one guide about the best practice about how to store days and weeks in database against which the targets are stored and actual sales can be recorded.
I am thinking storing data in following table
EmpSales (EmployeeID,ProductID,SaleTarget,Actual Sale,Date,WeekNo,Month)
Thanks in advance
This one is really easy in pure Relational modelling terms. I do not see the need for "denormalisation" of any kind.
Sales Data Model.
If you are unfamiliar with the Standard for modelling Relational databases, the IDEF1X Notation may be helpful.
Pure 5NF; full Declarative referential Integrity; no Nulls, no Update Anomalies; no GROUP BYs; pure Date arithmetic.
The SaleTarget is compared against SaleActual by projection, and may be in the same result set.
If you have Monthly and Annual Sales accounting, the extension required is a common calendar table with a bit of control or structure; eg. similar to Week, including rows for each Month and Year. Just let me know, and I will update the model.
I say 5NF because that is the minimum I provide in order to eliminate Update Anomalies, and most modellers are familiar with it. But if it does not scare you off, the two Sales tables are actually Sixth Normal Form.
This allows full Pivoting (weeks or months across the top; Products or Employees down the side; vice versa; any combination) without temporary tables or complex SQL. (Just ask.)
I think it may even be self-explanatory, but I will supply the Verb Phrases which spell out the Business Rules, only because there are three Parents involved in each:
Each Employee is scheduled SaleTarget of Product for Week
Each Product is scheduled SaleTarget By Employee for Week
Each Employee did SaleActual of Product on Day
Each Product did SaleActual by Employee on Day
Comparison
I should have mentioned. Notice there is no vertical (rows) or horizontal (columns) duplication. When columns are duplicated eg, StartDate and EndDate, you have broken 3NF (introduced Functional Dependencies), and introduced an Update Anomaly. The EndDate in any row, is the StartDate in the next row (that, minus 1 second counts as a dupe, is a contrivance); when updating, now two rows instead of one have to be changed. More important, this structure is so simple (it is not a Time Series, or "temporal" requirement), the EndDate is not required.
Response to Comments
The Data Model has been updated to include Month and Year requirements. You now need a Check Constraint on SaleTarget to ensure that DateType is W for week. Loading the Date table is simple, you do not need the nonsense code (manually repeated cut-and-paste) that is posted on SQLTeam; they are famous for being stupid and sub-standard.
The SaleActual table now contains Daily, Weekly, Monthly, and Annual values. Which of course, you summarise programmatically on the first day of each Week, Month, Day. First add the new row to Date.
5NF is prety much the minimum required for standard compliance these days, so you need to get used to it. Basically there was a lot of argument among the academics (plus places like Wikipedia posting completely incorrect entries) of the NFs between 3NF and 5NF. The short and sweet definition of 5NF is that it is what 3NF was intended to be, with zero data duplication, zero Update Anomalies (no duplicated columns to be updated transactionally).
Forget about 6NF for now. Any table that is in 6NF, is in 5NF (and 4NF and BCNF and 3NF). Just treat the two Sales tables as 5NF. When you have to write a pivoted report, say an year from now, that's when you will realise the value of this structure.
I personally would store the targets and actuals in separate rows, and most probably in separate tables:
Targets:
EmployeeId, PeriodId, ProductId, TargetValue
Sales:
EmployeeId, PeriodId, ProductId, SalesValue
In fact, in an integrated system, the second table is usually unnecessary (assuming that you have a complete sales recording system, this should be a projection/view of the actual recorded sales - with appropriate assignment of employee, period and product based on the model of that subsystem).
In order to fit your calendar requirements, I would almost certainly have a date table which will allow you to ensure all your various business rules for definitions of weeks and months without complex date logic. Determining periods and aggregating is then just facilitated with joins to the calendar table.
So the ActualSales would look something like this (with just a generic Period table, which might itself be a period and date table):
SELECT sp.EmployeeId
, p.ProductId
, pd.PeriodType
, pd.PeriodId
, SUM(id.Quantity * id.UnitProce) AS TotalSales
FROM Invoice AS i
INNER JOIN InvoiceDetail AS id
ON id.InvoiceId = i.InvoiceId
INNER JOIN Employee AS sp
ON sp.EmployeeId = i.SalesPersonId
INNER JOIN Product AS p
ON id.ProductId = p.ProductId
INNER JOIN Period AS pd
ON pd.StartDate <= i.InvoiceDate
AND pd.EndDate > i.InvoiceDate
GROUP BY sp.EmployeeId, p.ProductId, pd.PeriodType, pd.PeriodId
In this case, data would be duplicated if you had overlapping periods (like daily, weekly, monthly), so you would need to aggregate ONLY one type of period - that's why I've specifically included it in this example view although it's redundant here.
I expect a generic Period table would look like:
PeriodId
PeriodType
StartDate
EndDate
This would be prepopulated with the various periods you want to report on:
'Q', 1/1/2010, 4/1/2010
'M', 1/1/2010, 2/1/2010
'M', 2/1/2010, 3/1/2010
'M', 3/1/2010, 4/1/2010
'W', 1/3/2010, 1/10/2010
'W', 1/10/2010, 1/17/2010
etc.
'D', 1/1/2010, 1/2/2010
'D', 1/2/2010, 1/3/2010
etc.
It makes very little sense to worry about holidays except that you probably aren't going to assign a target if they aren't working and this is mainly about managing the assignments so that they are presumably realistic. You can have a calendar table of days with various flags
Calendar
DateId
Date
IsHoliday
Then you can include that when you join to count the number of holidays/weekends in a period etc.
This is typically an accounting/business thing, but you may want to look into standardizing your calendar. For instance, in media buys for TV advertising, they make each "quarter" equal and make each "month" standardized - 4 weeks, 4 weeks, 5 weeks. Obviously they make exceptions for holiday and special TV events, but this helps to smooth out the accounting and compare like periods more easily.
Personally I would go for a more generic "period" table.
period(periodId,startDate,endDate,weekNo,Month,Year)
and then add
empSales(EmployeeId,ProductId,SaleTarget,ActualSale,periodId).
This is a bit more flexible (you can easily introduce different time spans and either make the relative week field null, or define some rule that maps the period on a "standard" week), there is less redundancy (note how the month and week have been moved away from the empSales table) and it allows you to do reporting and calculation (btw, you didn't include a Year field, is there a reason?).
Tallying up stuff should be easier, because assuming you have sales sorted by day, summing these up between intervals is easier unless you want to duplicate the "week" field all over the DB.
Note also that you can easily have targets on different, overlapping periods.
Example, you can set a weekly target for week 22-28 November (I am using the European convention of having the week start on monday) and have a special one-day period set on Black Friday
So:
Period:
periodId|startDate |endDate |weekNo|Month | Year|
0030020|22-NOV-2010|28-NOV-2010| 43 |November | 2010 |
0030026|26-NOV-2010|26-NOV-2010| null |November | 2010 |
empSales:
EmployeeId|ProductId|SaleTarget|ActualSale|periodId|
567689| 788585| 58 | 42 | 0030020|
567689| 788585| 28 | 32 | 0030026|
Note how Employee 567689 missed his weekly target but managed to go over his Black Friday target.
Btw, while working on this example I think you better drop the "empSales" table, renaming it to "empTargets":
empTargets(EmployeeId,ProductId,SaleTarget,periodId).
because the Actual Sales is easily calculated on the fly either with a UDF or placed in a view - after all, it's just a
select sum(items_sold)
from sales
where sales.employeeId = empTargets.employeeId and
sales.ProductId= empTargets.ProductId and
sales.saleDate between empTargets.startDate and
empTargets.endDate)
so no need to store it directly in the table (in fact it could become a burden in case of returned items or other future corrections).