I have a table which holds flight schedule data. Every schedule have an effective_from & effective_to date. I load this table from flat file which don't provide me an effective_from and effective_to date. So at the time of loading I ask this information from user.
Suppose user gave from date as current date and to date as 31st March. Now on 1st March the user loads a new flight schedule and user give from date as current date and to date as 31st May.
If I query table for effective date between 1st March to 31st March the query returns me two records for each flight whereas I want only one record for each flight and this should be the latest record.
How do I do this? Should I handle this by query or while loading check and correct the data?
You need to identify the primary key for the data (which might be a 'business' key). There must be something which uniquely identifies each flight schedule (it sounds like it shouldn't include effective_from. Once that key is established, you check for it when importing and then either update the existing record or insert a new one.
I assume that each flight has some unique ID, otherwise how can make the difference between them. Then you can add to the schedule thable extra field "Active".
When loading in new schedule - query first existing records with the same flight id and set them to Active=false. New record enter with Active=true.
Query is then simple: select * from schedule where active=1
I developed this solution but looking for even better solution if possible.
Table Schedule {
scheduleId, flightNumber, effective_from,effective_to
}
Data in Schedule table {
1, XYZ12, 01/01/2009, 31/03/2009
2, ABC12, 01/01/2009, 30/04/2009
}
Now user loads another record 3, XYZ12, 01/03/2009, 31/05/2009
select scheduleId from Schedule where flightNumber = 'XYZ12' and (effective_from < '01/03/2009' and effective_to > '01/03/2009' or effective_from < '31/05/2009' and effective_to > '31/05/2009')
If the above query returns me any result that means its an overlap and I should throw an error to user.
The problem description and the comment to one of the suggestions gives the business rules:
querying flights with an effective date should return only one record per flight
the record returned should be the latest record
previous schedules must be kept in the table
The key to the answer is how to determine which is the latest record - the simplest answer would be to add a column that records the timestamp when the row is inserted. Now when you query for a flight and a given effective date, you just get the result with the latest inserted timestamp (which can be done using ORDER BY DESC and take the first row returned).
You can do a similar query with just the effective date and return all flights - again, for each flight you just want to return the row that includes the effective date, but with the largest timestamp. There's a neat trick for finding the maximum of something on a group-by-group basis - left join the results with themselves so that left < right, then the maximum is the left value where the right is null. The author of High Performance MySQL gives a simple example of this.
It's much easier than trying to retroactively correct the older schedules - and, by the sound of things, the older schedules have to be kept intact to satisfy your business requirements. It also means you can answer historical questions - you can always find out what your schedule table used to look like on a given date - which means it's very handy when generating reports such as "This month's schedule changes" and so on.
Related
I'm working with a "slowly changing fact table" that keeps track of changes to a reservation by stamping the effective start and effective end dates on when a change is effective for. The latest change has an effective end date of 12/31/9999 23:59:59.999 to keep it effective forever until if/when a new change comes in. I have a second table called an "As-Of" date table that has every single date since 8/1/2008 to current with a timestamp of 23:59:59.999 that I can use to join back to the slowly changing fact where "as-of" dates are between the effective start and end dates of in the fact. The purpose of this is to be able to go back to any specific date in time and see what the reservation looked like on that date. As you can imagine, each new as-of date has exponentially more data.
I've been tasked with creating an SSAS tabular model that has every "as-of" date available in a drop down for end users to select and be able to see data for that particular date. I'm concerned about storage and performance issues of having all as-of dates joined back to the fact table to provide end users the freedom to select any as-of date they want at any given time.
If I create a view that has the fact and as-of date table joined together, is it possible to pass the as-of dates they select in the drop down in SSAS back to a dynamic where clause in the view so that I am only joining the fact and as-of date tables dynamically for the as-of dates they actually need to see? Is it possible to create some type of "live" connection that only joins the fact and as-of date tables on the fly so I don't have to blow out the underlying data for each as-of date?
It seems as though the two tables will have to be joined in a SQL view before I bring it into SSAS since it doesn't seem possible to join two tables on more than one column in SSAS.
Can someone please tell me if this is something that is even technically possible? Or if you have any other ideas on the best way to represent this data in SSAS?
For some table, you can use Direct Query (live query) instead of Import.
DirectQuery Documentation
You can use a virtual relationship in your measure (where you can specifying more than one column).
VirtualRelationship
example:
CALCULATE (
<target_measure>,
TREATAS (
SUMMARIZE (
<lookup_table>
<lookup_granularity_column_1>
<lookup_granularity_column_2>
),
<target_granularity_column_1>,
<target_granularity_column_2>
)
)
I have table with
fields: Id, ChId, ChValue, ChLoggingDate
Now the data will be saved for everyminute in to the database. I need a query to check if the data exists for every minute in the table through out the year for a particular weekday. That is, all Monday's in 2013 if it has complete data for that day calculate the arithmetic mean for the year of Monday's.
YOu will need a table - or a table value function - to create a timestamp for every minute, then you can join with that and check where the original table has no data.
Only way - any query otherwise can only work with the data it has.
This is a common approach for any reporting.
http://www.kodyaz.com/t-sql/create-date-time-intervals-table-in-sql-server.aspx
explains how, you just must expand it to a times table. If you separate date and time you can get away with a time table (00:00:00 to 24:59:59) and a date table separately.
I am trying to make sense of dimension modeling. While reading a dimension modeling book, I have created a star schema.
The fact table is a Accumulating snapshot table and it has multiple date columns which are linked to a date dimension using a surrogate key.
FactApplicants
{
Interview_No_Show_Date_Key (FK)
Cancel_Date_Key (FK)
Interviewed_Date_Key (FK)
. ....
Applicant_Key(FK)
InquiryCount int
}
DimDate
{
Date_Key (PK, int),
FullDateUSA (char(10))
Date (datetime)
}
I do have a well defined process for which i am trying to make this star schema for. I have a date field in the fact table for each of this step as I need to prepare funnel like report and activity reports. So the question really is
Is this correct? can a fact table refer to same date dimension table multiple times?
The examples I am seeing all over the internet seems to indicate this is correct but i am having hard time making it work with Pentaho reporting. so I am not sure if its a design problem or its something i am not doing correctly in Pentaho
Yes it is correct to refer to the date dimension multiple times
Yes, a fact can refer to the same dimension multiple times. However, given only what I see in your example, I am not sure why you need the date dimension. The date in applicants is just a date and can be used as an attribute without referring to a separate date dimension. It's just the attribute "date". You would need a separate date dimension if, for example, (1) you want to ensure that only valid dates are used, or (2) you want to elevate date to a full calendar in which other attributes are used to describe a date, such as day of the week, weekday/weekend, holiday, etc. or (3) you want to rollup date to other levels, such as week, month, year.
I need to create a database to manage a gas station.
I'm thinking of a basic product inventory and sales data model, but it need some changes.
See http://www.databaseanswers.org/data_models/inventory_and_sales/index.htm. This is how they proceed: the manager keep tracks of the inventory and sales twice a day, each time a gas pump attendant is in charge, and takes the responsibility of the sales.
How can I keep track of this ?
Using the Model that you provided you could use the first Model as reference:
And I would use all the six (6) tables namely:
1) Products
2) Product_Types
3) Product_In_Sales
4) Sales
5) Daily_Inventory_Level
6) Ref_Calendar
But I had to make some changes by alteration and adding:
First I need to include SalesPerson table that would have at least the following fields
1) SalesPersonID
2) Lastname
3) Firstname
4) Alias
In line with that I therefore need to add SalesPersonID as Foreign key in
my Sales table.
Now since you want to have twice a day Inventory then you could approach in many ways
you could add single primary key for Daily_Inventory_Level table or you could add a new field named Inventory_Daily_Flag which has either only the value of 1 or 2. If 1 that means that's the first inventory and if 2 that means that's the second inventory for the day. And that means by the way that you're Primary and Foreign Key at the same time would no longer be just Day_Date and ProductID but also Inventory_Daily_Flag for Daily_Inventory_Level table.
And also in line with that, that means you need to also to add a field in your Product_In_Sales like FlagForInventory with Boolean as Data Type.
So, let's say a Supervisor came in to do the first inventory, then the products sold
in Product_In_Sales for the day would be flag as True for the FlagForInventory and
then would be transferred to Daily_Inventory_Levels with Inventory_Daily_Flag field
as 1 to indicate as the first inventory and of course the Level also would be updated.
And so when the days end and the 2nd inventory is to be executed then those
sales for the day from Product_In_Sales table whose FlagForInventory is false then
it would be flag as True for FlagForInventory and then transferred again to Daily_Inventory_Levels with Inventory_Daily_Flag as 2 indicating the second inventory.
And of course you need to update the Level as well.
Does it make sense? If not I could always change the approach? ;-)
I'm trying to figure out how I can create a calculated measure that produces a count of only unique facts in my fact table. My fact table basically stores events from a historical perspective. But I need the measure to filter out redundant events.
Using sales as an example(Since all material around OLAP always uses sales in examples):
The fact table stores sales EVENTS. When a sale is first made it has a unique sales reference which is a column in the fact table. A unique sale however can be amended(Items added or returned) or completely canceled. The fact table stores these changes to a sale as different rows.
If I create a count measure using SSAS I get a count of all sales events which means an unique sale will be counted multiple times for every change made to it (Which in some reports is desirable). However I also want a measure that produces a count of unique sales rather than events but not just based on counting unique sales references. If the user filters by date then they should see unique sales that still exist on that date (If a sale was canceled by that date if should not be represented in the count at all).
How would I do this in MDX/SSAS? It seems like I need have a count query work from a subset from a query that finds the latest change to a sale based on the time dimension.
In SQL it would be something like:
SELECT COUNT(*) FROM SalesFacts FACT1 WHERE Event <> 'Cancelled' AND
Timestamp = (SELECT MAX(Timestamp) FROM SalesFact FACT2 WHERE FACT1.SalesRef=FACT2.SalesRef)
Is it possible or event performant to have subqueries in MDX?
In SSAS, create a measure that is based on the unique transaction ID (The sales number, or order number) then make that measure a 'DistinctCount' aggregate function in the properties window.
Now it should count distinct order numbers, under whichever dimension slice it finds itself under.
The posted query might probably be rewritten like this:
SELECT COUNT(DISTINCT SalesRef)
FROM SalesFacts
WHERE Event <> 'Cancelled'
An simple answer would be just to have a 'sales count' column in your fact view / dsv query that supplies a 1 for an 'initial' event, a zero for all subsiquent revisions to the event and a -1 if the event is cancelled. This 'journalling' approach plays nicely with incremental fact table loads.
Another approach, probably more useful in the long run, would be to have an Events dimension: you could then expose a calculated measure that was the count of the members in that dimension non-empty over a given measure in your fact table. However for sales this is essentially a degenerate dimension (a dimension based on a fact table) and might get very large. This may be inappropriate.
Sometimes the requirements may be more complicated. If you slice by time, do you need to know all the distinct events that existed then, even if they were later cancelled? That starts to get tricky: there's a recent post on Chris Webb's blog where he talks about one (slightly hairy) solution:
http://cwebbbi.wordpress.com/2011/01/22/solving-the-events-in-progress-problem-in-mdx-part-2role-playing-measure-groups/