Should I separate redundant data in my database?

Should I separate redundant data in my database? - database

I have a database app that stores prices for things in different places. Each price has the following data associated with it:
price
date
product ID
country
price type (factory/wholesale/retail)
The last three items (pID, country, pricetype) can be thought of as one composite item describing the purpose of the price; there is a lot of redundancy in this data. So I'm thinking: separate those out into their own table to save space and simplify queries.
Normal:
Prices (price_id, price, date, product_id, country_id, pricetype_id)
vs:
Prices (price_id, price, date, descriptor_id)
Descriptors (descriptor_id, product_id, country_id, pricetype_id)
Is this worth the added programming effort required? Will it be more or less extensible/maintainable in the long run?

Is this worth the added programming effort required?
Yes
Will it be more or less extensible/maintainable in the long run?
More extensible and easier to maintain.
In general
You should always normalize to at least 3NF.
See this article: http://databases.about.com/od/specificproducts/a/normalization.htm

It depends on the amount of data you are expecting in that table. If you have no performance/storage problems, you don't need separate tables (for performance reasons).
On the other hand, you will get all disadvantages that come with redundancy. You have to check your data for inconsistencies etc.
But: Regardless of the design you choose, there's still time to change the road you're on.

Related

Database design, an included attribute vs multiple joins? Confused

So I am taking a class in database design and management and am kind of confused from a design perspective. My example is an invoice system. I just made it up quick so it doesn't have a ton of complexity in it.
There are Customers, Orders, Invoices and Payments entities
Customers
CustId(PK),
Street,
Zip,
City,
..
Orders
OrderID(PK)
CustID(FK)
Date
Amt
....
Invoices
InvoiceID(PK),
OrderID(FK),
Date,
AmtDue,
AmtPaid,
....
Payments
PaymentNo(PK),
InvoiceID(FK),
PayMethod,
Date,
Amt,
...
Customer entity has a one to many relationship with Orders
Purchases entity has a one to many relationship with Invoices
Invoices Entity has a one to many relationship with Payments.
To get the results of a query to list all Payments made by a Customer the query would have to join Payments with the Invoice table, the Invoice table with the Orders table and the Orders table with the Customer table.
Is this the correct way to do it? One could also just put a custID in the payment entity which would then just require one join, but then there is unneeded information in the payment entity. Is this just a design thing or is it a performance issue?
Bonus question. Lets say there should be a report that says what the total customer balance is. Does there need to be a customer balance field in the database or can this be a calculated item that is produced by joining tables and adding up the amount billed vs amount paid?
Thanks!

Is this the correct way to do it?
Yes. Based on the information provided, it looks reasonable.
One could also just put a custID in the payment entity which would then just require one join, but then there is unneeded information in the payment entity. Is this just a design thing or is it a performance issue?
The question you're asking falls under "normal forms", often called normalization. Your target should be Boyce-Codd normal form (similar to 3NF), which should be described in your textbook. I will warn you that misinformation and misuderstanding of database design issues is very abundant on the interwebs, so beware of which answers you pay attention to.
The goal of normalization is to eliminate redundancy, and thus to eliminate "anomaliies", whereby two logically equivalent queries produce inconsistent results. If the same information is kept in two places, and is updated in only one, then two queries against the two different values will produce different -- i.e, inconsistent -- results.
In your example, if there is a Payments.CustID, should I believe that one, or the one derived from joining Payments to Orders? The same goes for total customer balance: do I believe the stored total, or the one I computed from the consituents?
If you are going to "denomalize for performance", as is so often alleged to be necessary, what are you going to do to ensure the redundant values are consistent?
Bonus question. Lets say there should be a report that says what the total customer balance is.
As a matter of fact, in practice balances are sort of a special case. It's often necessary to know the balance at points in time. While it's possible to compute, say, monthy account balances from inception based on transactions, as a practical matter applications usually "draw a line in the sand" and record the balance for future reference. Step are taken -- must be, for the sake of the business -- to ensure the historical information does not change or, if it does, that the recorded balance is updated to reflect the change. From that description alone, you can imagine that the work of enforcing consistency throughout the system is much more work than relying on the DBMS to enforce it. And that is why, insofar as is feasible, it's better to elimate all redundant data, and let the DBMS do the job it was designed to do.
In your analysis, seek Boyce-Codd normal form. Understand your data, eliminate the redundancies, and recognize the relations. Let the DBMS enforce referential integrity. Countless errors will be avoided, and time saved. Only when specific circumstances conspire to show that specific business requirements cannot be satisfied on a particular system with a given, correct design, does one begin the tedious and error-prone work of introducing redundant information and compensating for it with external controls.

"Is this the correct way to do it?" Of course, given your current design. But it's not the ONLY way. So you're studying DB "normalization" and seeing the pros and cons of the various "forms" of normalization. In the "real world" things can change on a dime, due to a management decision or whatever. I tend to use "compound primary keys" instead of simply one field for primary and others as FK. I handle my "FK" programmatically instead of relegating that responsibility to the DB.
I also create and utilize a number of "intermediate" tables, or sometimes "VIEWS", that I use more easily than a bunch of code with too many JOINs. (3rd Normal form addicts can hate, but my code runs faster than a scalded rabbit).
An Order means nothing without a Customer; an Invoice means nothing without an Order; a Payment is great, but means nothing without both an Order and Invoice. So lemme throw this out there -- what's wrong with having a "summary" type of entity that has Cust, Order, Invoice #, and Payment Id ?

database, separate table or columns for extra fields?

I have Stock table.
sku
quantity
quantity_sold
price
seller
I have StockExtra table for products that has date/time property (you can think of event tickets)
stock # references stock
quantity
quantity_sold
price
date_at
time_at
datetime_rule # foreign key to another table, it is a rule that describes when events occur
For event tickets, I use stock and seller from the Stock table, but use quantity from StockExtra table. Because a ticket at different date can have different quantity and price.
I've divided the tables but not so sure if it is the best practice.
Now I need to create another table to hold stock data for separate market stores.
(I'm making a system where seller can manage his inventory when he sells products over multiple stores)
One could sell event tickets in amazon.com and in ebay.com for instance.
And the price, quantity in each store might be different.
So there will be one to many relations from Stock to StoreStock.
Stock will hold default price and aggregated quantity/quantity_sold for all stores. StoreStock will hold data for an individual store.
And I'll also need one to many relations from StockExtra to StoreStock due to the same reason, i.e price/quantity might be different for each date/time for event tickets.
So with my current setup,
there will be Stock StockExtra and StoreStock.
Would it be better to have just Stock and StoreStock even though date/time related fields will be empty for non-ticket products?

You should think about reducing the complexity of your system by keeping everything in a consistent way. Keep complex scenarios in the same tables as simple scenarios.
By keeping the same information in different places (e.g. Stock vs StockExtra, or Stock vs StoreStock) you are creating a situation in which your code has to have extra branching to find the data depending on the situation.
When you're going after the data for a single transaction, branching isn't the end of the world, although it is more code to have to write, debug and maintain. However, when you go after data in aggregate, having it spread across multiple potential locations makes your data retrieval much more complicated.
I would recommend keeping everything at the most detailed level. Therefore everything goes in StoreStock even if there is only one store applicable to a particular situation. Then, unless you have a demonstrable performance issue, don't split off StockExtra from Stock - Just use nullable columns in Stock.
It's OK to keep default prices in Stock, but use a more descriptive name for the sake of the next guy that has to maintain your code. I'd advise against tracking sales quantity in the Stock table. Keep this in StoreStock only. Don't keep a pre-calculated quantity on hand value. This will inevitably be out of whack. Instead, track quantities added (receipts) and quantities removed (sales) and calculate quantity on hand dynamically. This will avoid inventory reconciliation problems.

Dimension Modeling: need help doing this star for a university

I am new to Dimension Modeling and I am working on doing a dimension model for a university. The current business process that I have picked up is actually sales/revenue. I have been reading different chapters of different books and although I think I have a good understanding of facts and dimensions I am having some tough time fitting the sales process on to the paper.
Ideally the sales process in the school is similar to other businesses where students are customers and the product is the "courses" they take. However in certain situation there are different product types and I don't know how to fit the product type. For example student pays an Application fee, late fee or transcript request fee which is not associated with any course. How do I fit these different type of revenue streams in my star?
What I have done so far is like this
Sales_FACT
====
Date_Key_FK
Product_Key_FK
Campus_Key_FK
Student_Key_FK
ChargeCredit_SKU
Amount
Product_Key
------
Product_Key_PK
SectionID
AcademicYear
AcademicTerm
AcademicSession
CourseCode
CourseName
ProductType???
Now for certain type of products (e.g. a transcript request fee) - I do not have the coursename,code, year term,session -- I am struggling how this will work.
Anyone has any input on this? or any helpful material/schema examples will definately appreciate them
Thanks,

You will find a plenty of those cases in future. Generally, you are experiencing this problem because you are mixing two different types of 'product' in your case.
It can be resolved logically or technically.
During the ETL process, in cleansing step, you can rewrite your null fields with sql (nvl, coalesce, CASE WHEN) with something related to field
nvl(CourseCode, 'No Course Code') as CourseCode
Then, when you group by ProductType, and CourseName you shoud get something like this:
ProductType CourseName sum(Amount)
------------------------------------------
AppFee Course1 345.13
AppFee Course4 8901.00
TranscriptFee No Course Name 245.99
Or, you can put it in separate tables. Even that is contradictory to your business process (can't have different products in fact row), sometimes terms you want to merge (i.e ApplicationFee and TranscriptFee) have many different grouping levels which is often too hard to map.
Edit:
No, snowflake make sense when there exists big dimension tables, high cardinality, many levels, as well as many to many relationship (i.e movies, categories). In your case good idea is to follow ERP/CRM database design, because it's current working solution. If there is no such reporting possibilty you want, you can make more generic dimension table:
Product-Service Dimension
--------------------------------------------
SurogateKey
NaruralKey
Type(Product/Sevrice/Other)
Level1(ProductType/ServiceType)
Level2(ProductSubType/ServiceSubType)
Level3
Level4
Attribute1
Attribute2

A product has one price, a price has one product - hasone nhibernate relationship?

I was a little miffed about the one-to-one relationship explanation on the 'I Think You Mean A Many To One' article.
In this instance for example, a product has one price because the business in question is small, niche, localized and supports only a single currency. Multiple prices per product make no sense in this case? I'm doubtful I'm grasping the concept correctly though, because everywhere I read says it will probably be a many-to-one even if you think it isn't?
Can somebody enlighten me please? :)

In an attempt to gain more reputation so that I can help in comments instead of an "answer" The one-to-many vs one-to-one is this
View a one-to-one as an extension of the table you are looking at.
Table B extends Table A. Meaning the information wasn't necessarily relevant enough to include in the table directly, but has a bidirectional relationship with each other. Basically meaning that As Table A, I am not dependent on the information in Table B, but Table B's information is very dependent on me. For the price example it means that Table A has a row related to a row in table B. So if you entering unique information in your Price table around every item to match in Table A, then this would be useful. As in say you had a description column about the item in your price table. Otherwise the price table in this case may just be irrelevant to have in the schema.
in a one-to-many relationship Table B usually has no reference back to Table A. So in the case of price, the items you are looking at do have a price, but prices aren't exclusive to items. So to better define, A number of things may have the price 9.99, but 9.99 only needs to exist in your pricing table once.

I am not familiar with the article you refer to. However, price is a classic example of a slowly changing dimension. Price may be constant at any point in time, but over time, the price changes.
Such dimensions are typically implemented by having effective and end dates for the period in question.
Now, at a given point in time, a product probably does have only one price. Things that affect the price -- coupons, discounts for the purchaser, volume discounts, for example -- are not properties of the product. These are properties of the transaction.
That said, there may be circumstances where a fixed volume discount does not make sense. So, the "price" for a product might include volume, as well as time.
In any case, I would agree with you that price is not a good example of a 1-1 relationship. There are other factors such as time and volume that affect it.

How to handle an immutable table referencing mutable tables?

In making a pretty standard online store in .NET, I've run in to a bit of an architectural conundrum regarding my database. I have a table "Orders", referenced by a table "OrderItems". The latter references a table "Products".
Now, the orders and orderitems tables are in most aspects immutable, that is, an order created and its orderitems should look the same no matter when you're looking at the tables (for instance, printing a receipt for an order for bookkeeping each year should yield the same receipt the customer got at the time of the order).
I can think of two ways of achieving this behavior, one of which is in use today:
1. Denormalization, where values such as price of a product are copied to the orderitem table.
2. Making referenced tables immutable. The code that handles products could create a new product whenever a value such as the price is changed. Mutable tables referencing the products one would have their references updated, whereas the immutable ones would be fine and dandy with their old reference
What is your preferred way of doing this? Is there a better, more clever way of doing this?

It depends. I'm writing on a quite complex enterprise software that includes a kind of document management and auditing and is used in pharmacy.
Normally, primitive values are denormalized. For instance, if you just need a current state of the customer when the order was created, I would stored it to the order.
There are always more complex data that that need to be available of almost every point in time. There are two approaches: you create a history of them, or you implement a revision control system, which is almost the same.
The history means that every state that ever existed is stored as a separate record, in the same or another table.
I implemented a revision control system, where I split records into two tables, one for the actual item, lets say a product, and the other one for its versions. This way I can reference the product as a whole, or any specific version of it, because both have its own primary key.
This system is used for many entities. I can safely reference an object under revision control from audit trail for instance or other immutable records. At the beginning it seems to be more complex to have such a system, but at the end it is very straight forward and solves many problems at once.

Storing the price in both the Product table and the OrderItem table is NOT denormalizing if the price can change over time. Normalization rules say that every "fact" should be recorded only once in the database. But in this case, just because both numbers are called "price" doesn't make them the same thing. One is the current price, the other is the price as of the date of the sale. These are very different things. Just like "customer zip code" and "store zip code" are completely different fields; the fact that both might be called "zip code" for short does not make them the same thing. Personally, I have a strong aversion to giving fields that hold different data the same name because it creates confusion. I would not call them both "Price": I would call one "Current_Price" and the other "Sale_Price" or something like that.
Not keeping the price at the time of the sale is clearly wrong. If we need to know this -- which we almost surely do -- than we need to save it.
Duplicating the entire product record for every sale or every time the price changes is also wrong. You almost surely have constant data about a product, like description and supplier, that does not change every time the price changes. If you duplicate the product record, you will be duplicating all this data, which definately IS denormalization. This creates many potential problems. Like, if someone fixes a spelling error in the product description, we might now have the new record saying "4-slice toaster" while the old record says "4-slice taster". If we produce a report and sort on the description, they'll get separated and look like different products. Etc.
If the only data that changes about the product and that you care about is the price, then I'd just post the price into the OrderItem record.
If there's lots of data that changes, then you want to break the Product table into two tables: One for the data that is constant or whose history you don't care about, and another for data where you need to track the history. Like, have a ProductBase table with description, vendor, stock number, shipping weight, etc.; and a ProductMutable table with our cost, sale price, and anything else that routinely changes. You probably also want an as-of date, or at least an indication of which is current. The primary key of ProductMutable could then be Product_id plus As_of_date, or if you prefer simple sequential keys for all tables, fine, it at least has a reference to product_id. The OrderItem table references ProductMutable, NOT ProductBase. We find ProductBase via ProductMutable.

I think Denormalization is the way to go.
Also, Product should not have price (when it changes from time to time & when price mean different value to different people -> retailers, customers, bulk sellers etc).
You could also have a price history table where it contains ProductID, FromDate, ToDate, Price, IsActive - to maintain the price history for a product.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight