Duplicate data vs Calculated data in database - database

I'm starting to track a host of variables around my life (QuantifiedSelf). I have a lot of input sources, and I'm working on sticking it all into a database. I plan on using this database with R to ask arbitrary questions about my life ("Which routes are the fastest to work", or "What foods affect my mood", etc)
The key question I'm trying to answer here is "Do I process the input before sticking it into the database?"
Examples of "process":
Some of my input is a list of moods (one for each day). As of right now, there are only 5 available moods (name with a rating between -2 and 2). Do I normalize this data and create two tables: A Mood table (with 5 items) and a DailyMood table?
If I process the data then I lose the original data. Perhaps I change a mood to have a different name. If I do this in a normalized database, then I lose the information that before the change, I had a mood "oldName"
If I don't process the data, then I have duplication of data
Another input is a list of GPS locations (lat, long). However, most of my day is spent in a single spot, or spent driving. Do I process this data to create two tables "Locations" and "Routes"?
If I don't process the data, then I have a whole bunch of duplicate locations (at different timestamps), which is difficult to query and get good data out of.
If I process the data, then I lose the original data. I end up with a nice set of Locations and Routes that is easy to query, but if those locations or routes are wrong, I would have to redownload the input source and rebuild the database.
However, I feel like I'm stuck between two opposing "ideals":
If I process the data, then I don't have the original data.
If I don't process the data, then I have duplicate, hard to use data.
I've considered storing both the original and the calculated. This feels like I'm getting the worst of both worlds: Some of my tables aren't original, and would need a full recalculation if they are wrong, while other tables are original but hard to use and have duplicate data.

To some of the points in the comments, I think which data you store depend on the need in your application, and I would approach each set of data through a use case lens.
For the first use case, mood data, it sounds like there is value in being able to see this data over time (i.e. it appears that over the last month, my mood has been improving) as well as to pull up individual events (i.e. on date x, I ate a hamburger, how did this affect my mood in the subsequent mood entry after date x).
If it were me, I would create a Mood table, with two attributes:
Name
Id (pk)
This table would essentially serve as a definition table. Here you could add attributes specific to the mood (such as description).
I would then create a MoodHistory table with the following attributes:
- Timestamp
- MoodId
- IsCurrent (Boolean)
Before you enter a mood in your application, UPDATE MoodHistory SET IsCurrent = 0 WHERE IsCurrent = 1, and then insert your new record with IsCurrent = 1. This structure is normalized and by indexing or partitioning by the IsCurrent column (and honestly even without any indexing/partitioning), even as your table grows quite large, you should always be able to query the current mood super quickly.
For your second use case, this is quite dependent not only on your planned usage, but where the data is coming from (particularly for routes). I'm not sure how you are planning on grouping locations into "routes" but if you clarify in the comments, I'm happy to add to my answer.
For locations however, I'm assuming you're taking a Location Snapshot during some set time interval. I would create a LocationSnapshot table structured similarly to the MoodHistory table:
I would then create a MoodHistory table with the following attributes:
Timestamp
Latitude
Longitude
IsCurrent
By processing your IsCurrent data in a similar way to your MoodHistory data, it should be quite straightforward to grab the last entered location. You could also do some additional processing if you want to avoid duplicates. Essentially, before updating IsCurrent, query the row where IsCurrent = 1. Then compare that records Latitude and Longitude to your new Latitude and Longitude before Inserting the new record. If there is any change, proceed to the insert, otherwise, no need to insert a new record.
You could also create a table of known locations such as KnownLocation:
Latitude
Longitude
Name
Joining to this table ON Latitude and Longitude should tell you when you were spending time at a particular location, say "Home" vs "Work"

Related

Loading Fact tables from SCD1 and SCD2 Dimension in SSIS

I am finding it difficult to understand how you get the history data from a fact table join to a Dimension that has Type2 and Type1 for historic records that have changed. Currently I have a Surrogate Key and Business Key in the Dim. The Fact Table has the Surrogate Key the Fact table and I am using SSIS Lookup Component currently to bring back the row that has the CurrentFlag set to Yes.
However I am joining on the Business Key in the Lookup and returning the Surrogate. Which I know is the main reason I can't get history, however if I Join on the Business Key as I am currently doing and return the Business Key also, SSIS component will only bring back just one row, regardless of how many versions of history you have against that Business Key.
What I want to know or have been told is to use lookups to populate fact tables, however this doesn't seem to really give me the history as it will only return one row regardless. So I Just want to know how to return historic date between a fact and a dimension in SSIS.
Thank you
There's a few caveats when it comes to historical dimensions. Your end users will need to know what it is you are presenting, and understand the differences.
For example, consider the following scenario:
Customer A is located in Las Vegas in January 2017. They place an order for Product 123, which at that time costs $125.
Now, it's August. In the meantime, the Customer moved to Washington D.C. in May, and Product 123 was updated in July to cost $145.
Your end users will need to inform you what they want to see. In case you are not tracking history whatsoever, and simply truncate and load everything on a daily basis, your order report would show the following:
Customer A, located in Washington D.C. placed an order for $145 in January.
If you implement proper history tracking, and implemented logic to identify the start- and end-date of a row in a dimension, you would join the fact table to the dimension using the natural key as well as the proper date interval. This should return you a single value for every dimension row in the fact table. IF it returns more, you have overlapping dates.
Can you show us the logic where you receive only a single value from the lookup, even though you have more records?

Slowly Changing Dimension Vs Fact

I am rookie to DW. I have a Customer table with basic columns that rarely change like Name, JoinedOn etc. And another set of Columns that can change over time like "Status","CustomerType","PublishStatus","BusinessStatus","CurrentOwner" etc. At the moment there is no history. In the DW I would like to track when the following columns change "Status","CustomerType","PublishStatus","BusinessStatus","CurrentOwner". I feel it would be better if I create another table to track these, the table will have the following columns:
"CustomerId", "Status","CustomerType","PublishStatus","BusinessStatus","CurrentOwner", "ExpiredOn","IsCurrent"
Is this the right approach? And if yes then is this new table a fact or a slowly changing dimension? I would like to run queries like when did the CustomerType change from A to B? When was it Published? When the BusinessStatus changed who was the owner?
Which way of modeling the customers is better really depends on how are you going to use the corresponding dimension. If you would want, for example, to summarize some sales associated with the "Customer" dimension by "CustomerType" at the time of the sale, you could do it only if you keep the historical details as a part of slowly changing dimension.
You probably can run a lot of customer reports right on that table that represents a slowly changing "Customer" dimension. But if the number of your customers get into the millions, you'd be better off creating a separate fact table (or tables) for customer status changes.
So, to summarize: start off with a slowly changing dimension. If the number of customers grows too large and reports on customer statust changes become too slow, add a fact table for them and don't worry about duplicating data.

Shared Table in Database, or Multiple Data? Storing Real-World Location in DB

I have what seems like what could be a simple question but might be more difficult than anticipated. Let's say I'm trying to track the latitude and longitude of Users and Businesses. Right now, I have a table called locTable, that contains 3 columns: Index, Latitude, Longitude.
The table that stores information for the Users and Businesses contain a FK to the locTable. This allows me to use one table to store the location data, however I've noticed doing queries on this data might be difficult.
Now, I could store Latitude and Longitude information in each table for Users and Businesses, however if I need to make changes regarding the data, I would have to update the queries along with two (or more) different tables.
What would you all suggest? Shared table or store the information separately?
First off, establish whether lat/long is a "lookup" table. I would not consider Lat/Long to be a lookup. What that means is you would not store an exhaustive list of every possible Lat/Long combo. They are often specified to 4 decimal places. In theory there could be infinite Lat/Long combo's if you have infinite scale.
I would not consider it duplication to store lat/long in both tables. Think of BirthDates. To avoid BirthDate duplication you could have a "BirthDates" table with a row for every day in the last 300 years. This would avoid BirthDay duplication of people, dogs, and companies. But it is not duplication in the "lookup" sense.
I am not suggesting it is wrong to store Lat/Long in it's own table. I'm just suggesting it may not be considered duplication to store Lat/Long in both tables.
Are the locations changing? If so, reverse the foreign key. Have the location table have a foreign key the user's and businesses table. Add a date column to the location table, and you can also track them over time just by adding new locations.
With your model, you potentially have to update two tables in order to update the location. With switching the foreign key around, all you have to do is add new rows to the location table when the location changes.

Best approach to views on archive data with change logs

(Sorry about the vagueness of the title; I can't think how to really say what I'm looking for without writing a book.)
So in our app, we allow users to change key pieces of data. I'm keeping records of who changed what when in a log schema, but now the problem presents itself: how do I best represent that data in a view for reporting?
An example will help: a customer's data (say, billing address) changed on 4/4/09. Let's say that today, 10/19/09, I want to see all of their 2009 orders, before and after the change. I also want each order to display the billing address that was current as of the date of the order.
So I have 4 tables:
Orders (with order data)
Customers (with current customer data)
CustomerOrders (linking the two)
CustomerChange (which holds the date of the change, who made the change (employee id), what the old billing address was, and what they changed it to)
How do I best structure a view to be used by reporting so that the proper address is returned? Or am I better served by creating a reporting database and denormalizing the data there, which is what the reports group is requesting?
There is no need for a separate DB if this is the only thing you are going to do. You could just create a de-normalized table/cube...and populate and retrieve from it. If your data is voluminous apply proper indexes on this table.
Personally I would design this so you don't need the change table for the report. It is a bad practice to store an order without all the data as of the date of the order stored in a table. You lookup the address from the address table and store it with the order (same for partnumbers and company names and anything that changes over time.) You never get information on an order by joining to customer, address, part numbers, price tables etc.
Audit tables are more for fixing bad changes or looking up who made them than for reporting.

How to handle an immutable table referencing mutable tables?

In making a pretty standard online store in .NET, I've run in to a bit of an architectural conundrum regarding my database. I have a table "Orders", referenced by a table "OrderItems". The latter references a table "Products".
Now, the orders and orderitems tables are in most aspects immutable, that is, an order created and its orderitems should look the same no matter when you're looking at the tables (for instance, printing a receipt for an order for bookkeeping each year should yield the same receipt the customer got at the time of the order).
I can think of two ways of achieving this behavior, one of which is in use today:
1. Denormalization, where values such as price of a product are copied to the orderitem table.
2. Making referenced tables immutable. The code that handles products could create a new product whenever a value such as the price is changed. Mutable tables referencing the products one would have their references updated, whereas the immutable ones would be fine and dandy with their old reference
What is your preferred way of doing this? Is there a better, more clever way of doing this?
It depends. I'm writing on a quite complex enterprise software that includes a kind of document management and auditing and is used in pharmacy.
Normally, primitive values are denormalized. For instance, if you just need a current state of the customer when the order was created, I would stored it to the order.
There are always more complex data that that need to be available of almost every point in time. There are two approaches: you create a history of them, or you implement a revision control system, which is almost the same.
The history means that every state that ever existed is stored as a separate record, in the same or another table.
I implemented a revision control system, where I split records into two tables, one for the actual item, lets say a product, and the other one for its versions. This way I can reference the product as a whole, or any specific version of it, because both have its own primary key.
This system is used for many entities. I can safely reference an object under revision control from audit trail for instance or other immutable records. At the beginning it seems to be more complex to have such a system, but at the end it is very straight forward and solves many problems at once.
Storing the price in both the Product table and the OrderItem table is NOT denormalizing if the price can change over time. Normalization rules say that every "fact" should be recorded only once in the database. But in this case, just because both numbers are called "price" doesn't make them the same thing. One is the current price, the other is the price as of the date of the sale. These are very different things. Just like "customer zip code" and "store zip code" are completely different fields; the fact that both might be called "zip code" for short does not make them the same thing. Personally, I have a strong aversion to giving fields that hold different data the same name because it creates confusion. I would not call them both "Price": I would call one "Current_Price" and the other "Sale_Price" or something like that.
Not keeping the price at the time of the sale is clearly wrong. If we need to know this -- which we almost surely do -- than we need to save it.
Duplicating the entire product record for every sale or every time the price changes is also wrong. You almost surely have constant data about a product, like description and supplier, that does not change every time the price changes. If you duplicate the product record, you will be duplicating all this data, which definately IS denormalization. This creates many potential problems. Like, if someone fixes a spelling error in the product description, we might now have the new record saying "4-slice toaster" while the old record says "4-slice taster". If we produce a report and sort on the description, they'll get separated and look like different products. Etc.
If the only data that changes about the product and that you care about is the price, then I'd just post the price into the OrderItem record.
If there's lots of data that changes, then you want to break the Product table into two tables: One for the data that is constant or whose history you don't care about, and another for data where you need to track the history. Like, have a ProductBase table with description, vendor, stock number, shipping weight, etc.; and a ProductMutable table with our cost, sale price, and anything else that routinely changes. You probably also want an as-of date, or at least an indication of which is current. The primary key of ProductMutable could then be Product_id plus As_of_date, or if you prefer simple sequential keys for all tables, fine, it at least has a reference to product_id. The OrderItem table references ProductMutable, NOT ProductBase. We find ProductBase via ProductMutable.
I think Denormalization is the way to go.
Also, Product should not have price (when it changes from time to time & when price mean different value to different people -> retailers, customers, bulk sellers etc).
You could also have a price history table where it contains ProductID, FromDate, ToDate, Price, IsActive - to maintain the price history for a product.

Resources