I am rookie to DW. I have a Customer table with basic columns that rarely change like Name, JoinedOn etc. And another set of Columns that can change over time like "Status","CustomerType","PublishStatus","BusinessStatus","CurrentOwner" etc. At the moment there is no history. In the DW I would like to track when the following columns change "Status","CustomerType","PublishStatus","BusinessStatus","CurrentOwner". I feel it would be better if I create another table to track these, the table will have the following columns:
"CustomerId", "Status","CustomerType","PublishStatus","BusinessStatus","CurrentOwner", "ExpiredOn","IsCurrent"
Is this the right approach? And if yes then is this new table a fact or a slowly changing dimension? I would like to run queries like when did the CustomerType change from A to B? When was it Published? When the BusinessStatus changed who was the owner?
Which way of modeling the customers is better really depends on how are you going to use the corresponding dimension. If you would want, for example, to summarize some sales associated with the "Customer" dimension by "CustomerType" at the time of the sale, you could do it only if you keep the historical details as a part of slowly changing dimension.
You probably can run a lot of customer reports right on that table that represents a slowly changing "Customer" dimension. But if the number of your customers get into the millions, you'd be better off creating a separate fact table (or tables) for customer status changes.
So, to summarize: start off with a slowly changing dimension. If the number of customers grows too large and reports on customer statust changes become too slow, add a fact table for them and don't worry about duplicating data.
Related
I'm starting to track a host of variables around my life (QuantifiedSelf). I have a lot of input sources, and I'm working on sticking it all into a database. I plan on using this database with R to ask arbitrary questions about my life ("Which routes are the fastest to work", or "What foods affect my mood", etc)
The key question I'm trying to answer here is "Do I process the input before sticking it into the database?"
Examples of "process":
Some of my input is a list of moods (one for each day). As of right now, there are only 5 available moods (name with a rating between -2 and 2). Do I normalize this data and create two tables: A Mood table (with 5 items) and a DailyMood table?
If I process the data then I lose the original data. Perhaps I change a mood to have a different name. If I do this in a normalized database, then I lose the information that before the change, I had a mood "oldName"
If I don't process the data, then I have duplication of data
Another input is a list of GPS locations (lat, long). However, most of my day is spent in a single spot, or spent driving. Do I process this data to create two tables "Locations" and "Routes"?
If I don't process the data, then I have a whole bunch of duplicate locations (at different timestamps), which is difficult to query and get good data out of.
If I process the data, then I lose the original data. I end up with a nice set of Locations and Routes that is easy to query, but if those locations or routes are wrong, I would have to redownload the input source and rebuild the database.
However, I feel like I'm stuck between two opposing "ideals":
If I process the data, then I don't have the original data.
If I don't process the data, then I have duplicate, hard to use data.
I've considered storing both the original and the calculated. This feels like I'm getting the worst of both worlds: Some of my tables aren't original, and would need a full recalculation if they are wrong, while other tables are original but hard to use and have duplicate data.
To some of the points in the comments, I think which data you store depend on the need in your application, and I would approach each set of data through a use case lens.
For the first use case, mood data, it sounds like there is value in being able to see this data over time (i.e. it appears that over the last month, my mood has been improving) as well as to pull up individual events (i.e. on date x, I ate a hamburger, how did this affect my mood in the subsequent mood entry after date x).
If it were me, I would create a Mood table, with two attributes:
Name
Id (pk)
This table would essentially serve as a definition table. Here you could add attributes specific to the mood (such as description).
I would then create a MoodHistory table with the following attributes:
- Timestamp
- MoodId
- IsCurrent (Boolean)
Before you enter a mood in your application, UPDATE MoodHistory SET IsCurrent = 0 WHERE IsCurrent = 1, and then insert your new record with IsCurrent = 1. This structure is normalized and by indexing or partitioning by the IsCurrent column (and honestly even without any indexing/partitioning), even as your table grows quite large, you should always be able to query the current mood super quickly.
For your second use case, this is quite dependent not only on your planned usage, but where the data is coming from (particularly for routes). I'm not sure how you are planning on grouping locations into "routes" but if you clarify in the comments, I'm happy to add to my answer.
For locations however, I'm assuming you're taking a Location Snapshot during some set time interval. I would create a LocationSnapshot table structured similarly to the MoodHistory table:
I would then create a MoodHistory table with the following attributes:
Timestamp
Latitude
Longitude
IsCurrent
By processing your IsCurrent data in a similar way to your MoodHistory data, it should be quite straightforward to grab the last entered location. You could also do some additional processing if you want to avoid duplicates. Essentially, before updating IsCurrent, query the row where IsCurrent = 1. Then compare that records Latitude and Longitude to your new Latitude and Longitude before Inserting the new record. If there is any change, proceed to the insert, otherwise, no need to insert a new record.
You could also create a table of known locations such as KnownLocation:
Latitude
Longitude
Name
Joining to this table ON Latitude and Longitude should tell you when you were spending time at a particular location, say "Home" vs "Work"
I am finding it difficult to understand how you get the history data from a fact table join to a Dimension that has Type2 and Type1 for historic records that have changed. Currently I have a Surrogate Key and Business Key in the Dim. The Fact Table has the Surrogate Key the Fact table and I am using SSIS Lookup Component currently to bring back the row that has the CurrentFlag set to Yes.
However I am joining on the Business Key in the Lookup and returning the Surrogate. Which I know is the main reason I can't get history, however if I Join on the Business Key as I am currently doing and return the Business Key also, SSIS component will only bring back just one row, regardless of how many versions of history you have against that Business Key.
What I want to know or have been told is to use lookups to populate fact tables, however this doesn't seem to really give me the history as it will only return one row regardless. So I Just want to know how to return historic date between a fact and a dimension in SSIS.
Thank you
There's a few caveats when it comes to historical dimensions. Your end users will need to know what it is you are presenting, and understand the differences.
For example, consider the following scenario:
Customer A is located in Las Vegas in January 2017. They place an order for Product 123, which at that time costs $125.
Now, it's August. In the meantime, the Customer moved to Washington D.C. in May, and Product 123 was updated in July to cost $145.
Your end users will need to inform you what they want to see. In case you are not tracking history whatsoever, and simply truncate and load everything on a daily basis, your order report would show the following:
Customer A, located in Washington D.C. placed an order for $145 in January.
If you implement proper history tracking, and implemented logic to identify the start- and end-date of a row in a dimension, you would join the fact table to the dimension using the natural key as well as the proper date interval. This should return you a single value for every dimension row in the fact table. IF it returns more, you have overlapping dates.
Can you show us the logic where you receive only a single value from the lookup, even though you have more records?
A debate has stared at work regards a table design and auditing changes. We have a stock table that contains trucks we sell. The table has columns like mileage, location, price and stockdate to name just a few.
The databases is OLTP so quite a few reads and updates when change happens to an items of stock.
I'm happy to leave the table alone and have a shadow table auditing any inserts and updates. However, its been suggested to move most of the stock columns into separate table and to make these tables into slow moving dimensions
Personally I prefer all the data in one row. It seems a lot of hassle to join on 10 tables to bring back one stock record. And the updates will be pain because you'd have to check if each property has changed its value and do an insert if it has and update the last properties entries End Date. This can't be good for performance, can it?
Wouldn't it be better if you wanted to that level of auditing to leave the table alone and move the data to an OLAP?
I don't see the point in denormalizing your table just for auditing purposes.
You might want to look into change data capture to see if it solves your issue without further serious changes:
http://technet.microsoft.com/en-us/library/cc280519(v=sql.105).aspx
I recently realized that I add some form of row creation timestamp and possibly a "updated on" field to most of my tables. Suddenly I started thinking that perhaps every table in the database should have a created and modified field that are set in the model behind the scenes.
Does this sound correct? Are there any types of high-load tables (like sessions) or massive sized tables that this wouldn't be a good idea for?
I wouldn't put those fields (which I generally call audit fields) on every database table. If it's a low-traffic, high-value table (like Users, for instance), it goes on, no question. I'd also add creator and modifier. If it's a table that gets hit a lot (an operation history table, say), then maybe the benefit isn't worth the cost of increased insert time and storage space.
It's a call you'll need to make separately for each table.
Obviously, there isn't a single rule.
Most of my tables have date-related things, DateCreated, DateModified, and occasionally a Revision to track changes and so on. Do whatever makes sense. Clearly, you can invent cases where it's appropriate and cases where it is not. If you're asking whether you should add them "by default" to most tables, I'd say "probably".
(Sorry about the vagueness of the title; I can't think how to really say what I'm looking for without writing a book.)
So in our app, we allow users to change key pieces of data. I'm keeping records of who changed what when in a log schema, but now the problem presents itself: how do I best represent that data in a view for reporting?
An example will help: a customer's data (say, billing address) changed on 4/4/09. Let's say that today, 10/19/09, I want to see all of their 2009 orders, before and after the change. I also want each order to display the billing address that was current as of the date of the order.
So I have 4 tables:
Orders (with order data)
Customers (with current customer data)
CustomerOrders (linking the two)
CustomerChange (which holds the date of the change, who made the change (employee id), what the old billing address was, and what they changed it to)
How do I best structure a view to be used by reporting so that the proper address is returned? Or am I better served by creating a reporting database and denormalizing the data there, which is what the reports group is requesting?
There is no need for a separate DB if this is the only thing you are going to do. You could just create a de-normalized table/cube...and populate and retrieve from it. If your data is voluminous apply proper indexes on this table.
Personally I would design this so you don't need the change table for the report. It is a bad practice to store an order without all the data as of the date of the order stored in a table. You lookup the address from the address table and store it with the order (same for partnumbers and company names and anything that changes over time.) You never get information on an order by joining to customer, address, part numbers, price tables etc.
Audit tables are more for fixing bad changes or looking up who made them than for reporting.