Database / table structure - similar entries vs. too much normalization? - database

I have designed this relational database that is keeping track of various assets and their owners over time. One of the most important piece of analysis I want to do is to track the value of those assets over time: expected original cost, actual original cost, actual cost, etc. So I have been putting data relative to a cost / value in a separate table called “Support_Value”. To complicates things some of the assets I’m tracking are in countries with foreign currencies so I’m collecting cost / value data in US Dollars but also in local currencies (“LC”), which ends up doubling the number of columns I have in this table. I also use this table as a way to keep track of the value of the asset owners themselves in a similar fashion.
- The columns of this table are the following:
My initial plan was to carve out separate tables to deal with (1) the various “qualities” of entries relative to cost and value (i.e. the “planned”, “upper” bound, “lower” bound”, “estimated” by analysts, and “actual” and another table to track) and (2) another one for currencies. But I realize this is likely to break as it doesn’t allow to have an initial “planned” cost that is then subsequently revised unless we make it explicit by creating new column for revised appendages but then there can be more than one revision.. So still not perfect.
What I’m now envisaging is to create a different value table that would have the following columns:
ID (PK representing individual instances of cost / value estimates)
Currency (FK to my currency table)
Asset (FK to my assets table) - i.e. what this cost or value is referring to
Date (FK to my date table) - i.e. to track revisions actually
Type (i.e. “cost" or “value")
Quality (i.e. “planned”, “upper”, “lower”, “estimated”, “actual”)
Valuation - i.e. the actual absolute amount in the currency designated in the second column
What do think of this approach? Is this an improvement?
Thanks for any suggestion you could have!

Both approaches are fine.
But, if you think you may need additional similar columns,
then the second aproach is more extensible.
Your second approach, it does look it has overnormalization,
I suggest split the "Quality" column back to its parts.
Some thing like:
"ID"
"Currency"
"Asset"
"Date"
"Type"
"Planned"
"Lower"
"Upper"
"Estimated"
"Actual"
"Valuation"
Cheers.

Related

Duplicate data vs Calculated data in database

I'm starting to track a host of variables around my life (QuantifiedSelf). I have a lot of input sources, and I'm working on sticking it all into a database. I plan on using this database with R to ask arbitrary questions about my life ("Which routes are the fastest to work", or "What foods affect my mood", etc)
The key question I'm trying to answer here is "Do I process the input before sticking it into the database?"
Examples of "process":
Some of my input is a list of moods (one for each day). As of right now, there are only 5 available moods (name with a rating between -2 and 2). Do I normalize this data and create two tables: A Mood table (with 5 items) and a DailyMood table?
If I process the data then I lose the original data. Perhaps I change a mood to have a different name. If I do this in a normalized database, then I lose the information that before the change, I had a mood "oldName"
If I don't process the data, then I have duplication of data
Another input is a list of GPS locations (lat, long). However, most of my day is spent in a single spot, or spent driving. Do I process this data to create two tables "Locations" and "Routes"?
If I don't process the data, then I have a whole bunch of duplicate locations (at different timestamps), which is difficult to query and get good data out of.
If I process the data, then I lose the original data. I end up with a nice set of Locations and Routes that is easy to query, but if those locations or routes are wrong, I would have to redownload the input source and rebuild the database.
However, I feel like I'm stuck between two opposing "ideals":
If I process the data, then I don't have the original data.
If I don't process the data, then I have duplicate, hard to use data.
I've considered storing both the original and the calculated. This feels like I'm getting the worst of both worlds: Some of my tables aren't original, and would need a full recalculation if they are wrong, while other tables are original but hard to use and have duplicate data.
To some of the points in the comments, I think which data you store depend on the need in your application, and I would approach each set of data through a use case lens.
For the first use case, mood data, it sounds like there is value in being able to see this data over time (i.e. it appears that over the last month, my mood has been improving) as well as to pull up individual events (i.e. on date x, I ate a hamburger, how did this affect my mood in the subsequent mood entry after date x).
If it were me, I would create a Mood table, with two attributes:
Name
Id (pk)
This table would essentially serve as a definition table. Here you could add attributes specific to the mood (such as description).
I would then create a MoodHistory table with the following attributes:
- Timestamp
- MoodId
- IsCurrent (Boolean)
Before you enter a mood in your application, UPDATE MoodHistory SET IsCurrent = 0 WHERE IsCurrent = 1, and then insert your new record with IsCurrent = 1. This structure is normalized and by indexing or partitioning by the IsCurrent column (and honestly even without any indexing/partitioning), even as your table grows quite large, you should always be able to query the current mood super quickly.
For your second use case, this is quite dependent not only on your planned usage, but where the data is coming from (particularly for routes). I'm not sure how you are planning on grouping locations into "routes" but if you clarify in the comments, I'm happy to add to my answer.
For locations however, I'm assuming you're taking a Location Snapshot during some set time interval. I would create a LocationSnapshot table structured similarly to the MoodHistory table:
I would then create a MoodHistory table with the following attributes:
Timestamp
Latitude
Longitude
IsCurrent
By processing your IsCurrent data in a similar way to your MoodHistory data, it should be quite straightforward to grab the last entered location. You could also do some additional processing if you want to avoid duplicates. Essentially, before updating IsCurrent, query the row where IsCurrent = 1. Then compare that records Latitude and Longitude to your new Latitude and Longitude before Inserting the new record. If there is any change, proceed to the insert, otherwise, no need to insert a new record.
You could also create a table of known locations such as KnownLocation:
Latitude
Longitude
Name
Joining to this table ON Latitude and Longitude should tell you when you were spending time at a particular location, say "Home" vs "Work"

Athletics Ranking Database - Number of Tables

I'm fairly new to this so you may have to bear with me. I'm developing a database for a website with athletics rankings on them and I was curious as to how many tables would be the most efficient way of achieving this.
I currently have 2 tables, a table called 'athletes' which holds the details of all my runners (potentially around 600 people/records) which contains the following fields:
mid (member id - primary key)
firstname
lastname
gender
birthday
nationality
And a second table, 'results', which holds all of their performances and has the following fields:
mid
eid (event id - primary key)
eventdate
eventcategory (road, track, field etc)
eventdescription (100m, 200m, 400m etc)
hours
minutes
seconds
distance
points
location
The second table has around 2000 records in it already and potentially this will quadruple over time, mainly because there are around 30 track events, 10 field, 10 road, cross country, relays, multi-events etc and if there are 600 athletes in my first table, that equates to a large amount of records in my second table.
So what I was wondering is would it be cleaner/more efficient to have multiple tables to separate track, field, cross country etc?
I want to use the database to order peoples results based on their performance. If you would like to understand better what I am trying to emulate, take a look at this website http://thepowerof10.info
Changing the schema won't change the number of results. Even if you split the venue into a separate table, you'll still have one result per participant at each event.
The potential benefit of having a separate venue table would be better normalization. A runner can have many results, and a given venue can have many results on a given date. You won't have to repeat the venue information in every result record.
You'll want to pay attention to indexes. Every table must have a primary key. Add additional indexes for columns you use in WHERE clauses when you select.
Here's a discussion about normalization and what it can mean for you.
PS - Thousands of records won't be an issue. Large databases are on the order of giga- or tera-bytes.
My thought --
Don't break your events table into separate tables for each type (track, field, etc.). You'll have a much easier time querying the data back out if it's all there in the same table.
Otherwise, your two tables look fine -- it's a good start.

What is the best way to represent a constrained many-to-many relationship within a relational database?

I would like to establish a many-to-many relationship with a constraint that only one or no entity from each side of the relationship can be linked at any one time.
A good analogy to the problem is cars and parking garage spaces. There are many cars and many spaces. A space can contain one car or be empty; a car can only be in one space at a time, or no space (not parked).
We have a Cars table and a Spaces table (and possibly a linking table). Each row in the cars table represents a unique instance of a car (with license, owner, model, etc.) and each row in the Spaces table represents a unique parking space (with address of garage floor level, row and number). What is the best way to link these tables in the database and enforce the constraint describe above?
(I am using C#, NHibernate and Oracle.)
If you're not worried about history - ie only worried about "now", do this:
create table parking (
car_id references car,
space_id references space,
unique car_id,
unique space_id
);
By making both car and space references unique, you restrict each side to a maximum of one link - ie a car can be parked in at most one space, and a space can has at most one car parked in it.
in any relational database, many to many relationships must have a join table to represent the combinations. As provided in the answer (but without much of the theoretical background), you cannot represent a many to many relationship without having a table in the middle to store all the combinations.
It was also mentioned in the solution that it only solves your problem if you don't need history. Trust me when I tell you that real world applications almost always need to represent historical data. There are many ways to do this, but a simple method might be to create what's called a ternary relationship with an additional table. You could, in theory, create a "time" table that also links its primary key (say a distinct timestamp) with the inherited keys of the other two source tables. this would enable you to prevent errors where two cars are located in the same parking spot during the same time. using a time table can allow you the ability to re-use the same time data for multiple parking spots using a simple integer id.
So, your data tables might look like so
table car
car_id (integers/numbers are fastest to index)
...
table parking-space
space_id
location
table timeslot
time_id integer
begin_datetime (don't use seconds unless you must!)
end_time (don't use seconds unless you must!)
now, here's where it gets fun. You add the middle table with a composite primary key that is made up of car.car_id + parking_space.space_id + time_id. There are other things you could add to optimize here, but you get the idea, I hope.
table reservation
car_id PK
parking_space_id PK
time_id PK (it's an integer - just try to keep it as highly granular as possible - 30 minute increments or something - if you allow this to include seconds / milliseconds /etc the advantages are cancelled out because you can't re-use the same value from the time table)
(this would also be the place to store variable rates, discounts, etc distinct to this particular account, reservation, etc).
now, you can reduce the amount of data because you aren't replicating the timestamp in the join table (reservation). By using an integer, you can re-use that timeslot for multiple parking spaces, but you could also apply a constraint preventing two cars from renting that given spot for the same "timeslot" for a given day / timeframe. This would also make it easier to store some history about the customers - who knows, you might want to see reports on customers who rent more often and offer them discounts or something.
By using the ternary relationship model, you are making each spot unique to a given timeslot (perhaps with some added validation rules), so the system can only store one car in one parking spot for one given time period.
By using integers as keys instead of timestamps, you are assured that the database won't need to do any heavy lifting to index the keys and sort / query against. This is a common practice in data warehousing / OLAP reporting when you have massive datasets and you need efficiency. I think it applies here as well.
create a third table.
parking
--------
car_id
space_id
start_dt
end_dt
for the constraint, i guess the problem with your situation is that you need to check a complex rule against the intersection table itself. if you try this in a trigger, it will report a mutation.
one way to avoid this would be to replicate the table, and query against this replication for the constraint.

How to handle an immutable table referencing mutable tables?

In making a pretty standard online store in .NET, I've run in to a bit of an architectural conundrum regarding my database. I have a table "Orders", referenced by a table "OrderItems". The latter references a table "Products".
Now, the orders and orderitems tables are in most aspects immutable, that is, an order created and its orderitems should look the same no matter when you're looking at the tables (for instance, printing a receipt for an order for bookkeeping each year should yield the same receipt the customer got at the time of the order).
I can think of two ways of achieving this behavior, one of which is in use today:
1. Denormalization, where values such as price of a product are copied to the orderitem table.
2. Making referenced tables immutable. The code that handles products could create a new product whenever a value such as the price is changed. Mutable tables referencing the products one would have their references updated, whereas the immutable ones would be fine and dandy with their old reference
What is your preferred way of doing this? Is there a better, more clever way of doing this?
It depends. I'm writing on a quite complex enterprise software that includes a kind of document management and auditing and is used in pharmacy.
Normally, primitive values are denormalized. For instance, if you just need a current state of the customer when the order was created, I would stored it to the order.
There are always more complex data that that need to be available of almost every point in time. There are two approaches: you create a history of them, or you implement a revision control system, which is almost the same.
The history means that every state that ever existed is stored as a separate record, in the same or another table.
I implemented a revision control system, where I split records into two tables, one for the actual item, lets say a product, and the other one for its versions. This way I can reference the product as a whole, or any specific version of it, because both have its own primary key.
This system is used for many entities. I can safely reference an object under revision control from audit trail for instance or other immutable records. At the beginning it seems to be more complex to have such a system, but at the end it is very straight forward and solves many problems at once.
Storing the price in both the Product table and the OrderItem table is NOT denormalizing if the price can change over time. Normalization rules say that every "fact" should be recorded only once in the database. But in this case, just because both numbers are called "price" doesn't make them the same thing. One is the current price, the other is the price as of the date of the sale. These are very different things. Just like "customer zip code" and "store zip code" are completely different fields; the fact that both might be called "zip code" for short does not make them the same thing. Personally, I have a strong aversion to giving fields that hold different data the same name because it creates confusion. I would not call them both "Price": I would call one "Current_Price" and the other "Sale_Price" or something like that.
Not keeping the price at the time of the sale is clearly wrong. If we need to know this -- which we almost surely do -- than we need to save it.
Duplicating the entire product record for every sale or every time the price changes is also wrong. You almost surely have constant data about a product, like description and supplier, that does not change every time the price changes. If you duplicate the product record, you will be duplicating all this data, which definately IS denormalization. This creates many potential problems. Like, if someone fixes a spelling error in the product description, we might now have the new record saying "4-slice toaster" while the old record says "4-slice taster". If we produce a report and sort on the description, they'll get separated and look like different products. Etc.
If the only data that changes about the product and that you care about is the price, then I'd just post the price into the OrderItem record.
If there's lots of data that changes, then you want to break the Product table into two tables: One for the data that is constant or whose history you don't care about, and another for data where you need to track the history. Like, have a ProductBase table with description, vendor, stock number, shipping weight, etc.; and a ProductMutable table with our cost, sale price, and anything else that routinely changes. You probably also want an as-of date, or at least an indication of which is current. The primary key of ProductMutable could then be Product_id plus As_of_date, or if you prefer simple sequential keys for all tables, fine, it at least has a reference to product_id. The OrderItem table references ProductMutable, NOT ProductBase. We find ProductBase via ProductMutable.
I think Denormalization is the way to go.
Also, Product should not have price (when it changes from time to time & when price mean different value to different people -> retailers, customers, bulk sellers etc).
You could also have a price history table where it contains ProductID, FromDate, ToDate, Price, IsActive - to maintain the price history for a product.

How do you handle "special-case" data when modeling a database?

Our organization provides a variety of services to our clients (e.g., web hosting, tech support, custom programming, etc...). There's a page on our website that lists all available services and their corresponding prices. This was static data, but my boss wants it all pulled from a database instead.
There are about 100 services listed. Only two of them, however, have a non numeric value for "price" (specifically, the strings "ISA" and "cost + 8%" - I really don't know what they're supposed to mean, so don't ask me).
I'd hate to make the "price" column a varchar just because of these two listings. My current approach is to create a special "price_display" field, which is either blank or contains the text to display in place of the price. This solution feels too much like a dirty hack though (it would needlessly complicate the queries), so is there a better solution?
Consider that this column is a price displayed to the customer that can contain anything.
You'd be inviting grief if you try to make it a numeric column. You're already struggling with two non-conforming values, and tomorrow your boss might want more...
PRICE ON APPLICATION!
CALL US FOR TODAYS SPECIAL!!
You get the idea.
If you really need a numeric column then call it internalPrice or something, and put your numeric constraints on that column instead.
When I have had to do this sort of thing in the past I used:
Price Unit Display
10.00 item null
100.00 box null
null null "Call for Pricing"
Price would be decimal datatype (any exact numeric, not float or real), unit and display would be some type of string data type.
Then used the case statement to display the price with either the price per unit or the display. Also put a constraint or trigger on the display column so that it must be null unless price is null. A constraint or trigger should also require a value in unit if price is not null.
This way you can calcuate prices for an order where possible and leave them out when the price is not specified but display both. I'd also put in a busness rule to make sure the total could not be totalled until the call for pricing was resolved (which you would also have to have a way to insert the special pricing to the order details rather than just pull from the price table).
Ask yourself...
Will I be adding these values? Will I be sorting by price? Will I need to convert to other currency values?
OR
Will I just be displaying this value on a web page?
If this is just a laundry list and not used for computation the simplest solution is to store price as a string (varchar).
Perhaps use a 'type' indicator in the main table, with one child table allowing numeric price and another with character values. These could be combined into one table, but I generally avoid that. You could also use an intermediate link table with a quantity if you ever want to base price on quantity purchased.
Lots of choices:
All prices stored as varchars
Prices stored numerically and extra price_display field that overrides the number if populated
Prices stored numberically and extra price_display field for display purposes populated manually or on trigger when numeric price is updated (duplication of data and it could get out of sync - yuk)
Store special case negative prices that map to special situations (simply yuk!!)
varchar price, prefix key field to a table of available prefixes ('cost +', ...), suffix key field to a table of available suffixes, type field key to a list of types for the value in price ('$', '%', 'description'). Useful if you'd need to write complex queries against prices in the future.
I'd probably go for 2 as a pragmatic solution, and an extension of 5 if I needed something very general for a generic pricing system.
If this is the extent of your data model, then a varchar field is fine. Your normal prices - decimal as they may be - are probably useless for calculations anyway. How do you compare $10/GB for "data transfer" and $25/month for "domain hosting"?
Your data model for this particular project isn't about pricing, but about displaying pricing. Design with that in mind.
Of course - if you're storing the price a particular customer paid for a particular project, or trying to figure out what to charge a particular customer - then you have a different (more complex) domain. And you'll need a different model to support that.
In that at least one of the alternate prices have a number involved, what about a Price column, a price type? The normal entries could be a number for the dollar value and type 'dollar', and the other could be 8 and 'PercentOverCost' and null and 'ISA' (for the Price and PriceType column).
You should probably have a PriceType table to validate and PriceTypeID if you go this route.
This would allow other types of pricing to be added in the future (unit pricing, foriegn currancy), give you a number, and also make it easier to know what type pricing you are dealing with..
http://books.google.co.in/books?id=0f9oLxovqIMC&pg=PA352&lpg=PA352&dq=How+do+you+handle+%E2%80%9Cspecial-case%E2%80%9D+data+when+modeling+a+database%3F&source=bl&ots=KxN9eRgO9q&sig=NqWPvxceNJPoyZzVS4AUtE-FF5c&hl=en&ei=V3RlSpDtI4bVkAWkzbHNDg&sa=X&oi=book_result&ct=result&resnum=3

Resources